1 Stimmen dafür0 Stimmen dagegen

49 Aufrufe507 SeitenJan 05, 2019

© © All Rights Reserved

PDF, TXT oder online auf Scribd lesen

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

49 Aufrufe

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

- (Oxford Library of Psychology) Richard M. Ryan-The Oxford Handbook of Human Motivation-Oxford University Press (2012).pdf
- Doing Interviews - Kvale, Steinar
- [Dennis_C._mueller]_The Oxford Handbook of Capitalism
- Research Methods a Tool for Life 3rd Edition
- Handbook of International Relations Carlsnaes W Risse Kappen T Risse T Simmons Adler E 2013 Constructivism
- Oxford Handbook of Social Cognition.pdf
- [Norman_K._Denzin,_Yvonna_S._Lincoln]_The_SAGE_Han(b-ok.cc).pdf
- Flick_The_SAGE_Handbook_of_Qualitative_Data.pdf
- Quantitative Research Methods
- (Oxford Library of Psychology) Jonathan S. Comer, Philip C. Kendall-The Oxford Handbook of Research Strategies for Clinical Psychology-Oxford University Press (2013).pdf
- The Oxford Handbook of Propaganda Studies
- Learning Statistics
- The Oxford Handbook of Philosophy and Psychiatry
- The Oxford Handbook of Political Ideologies-Oxford
- Research Methods, Statistics, A - Kathrynn a. Adams
- the Oxford Handbook of the Self
- Creswell John W, Clark Vicki L Plano - Designing and Conducting Mixed Methods Research-SAGE Publications (2018).pdf
- A-Companion-to-Qualitative-Research.pdf
- Systematic Reviews
- Research Methods in the Social Sciences

Sie sind auf Seite 1von 507

Quantitative Methods

OX F O R D L I B R A RY O F P S Y C H O L O G Y

Editor-in-Chief

Peter E. Nathan

A r e a E d i t o r s:

Clinical Psychology

David H. Barlow

Cognitive Neuroscience

Kevin N. Ochsner and Stephen M. Kosslyn

Cognitive Psychology

Daniel Reisberg

Counseling Psychology

Elizabeth M. Altmaier and Jo-Ida C. Hansen

Developmental Psychology

Philip David Zelazo

Health Psychology

Howard S. Friedman

History of Psychology

David B. Baker

Todd D. Little

Neuropsychology

Kenneth M. Adams

Organizational Phychology

Steve W. J. Kozlowski

Kay Deaux and Mark Snyder

OX F O R D L I B R A RY O F P S Y C H O L O G Y

of Quantitative

Methods

Edited by

Todd D. Little

Vo lu m e ı : Fo u n d at i o n s

1

3

Oxford University Press, Inc., publishes works that further

Oxford University’s objective of excellence in

research, scholarship, and education.

Auckland Cape Town Dar es Salaam Hong Kong Karachi

Kuala Lumpur Madrid Melbourne Mexico City Nairobi

New Delhi Shanghai Taipei Toronto

With ofﬁces in

Argentina Austria Brazil Chile Czech Republic France Greece

Guatemala Hungary Italy Japan Poland Portugal Singapore

South Korea Switzerland Thailand Turkey Ukraine Vietnam

198 Madison Avenue, New York, New York 10016

www.oup.com

stored in a retrieval system, or transmitted, in any form or by any means,

electronic, mechanical, photocopying, recording, or

otherwise, without the prior permission of Oxford University Press

The Oxford handbook of quantitative methods / edited by Todd D. Little.

v. cm. – (Oxford library of psychology)

ISBN 978–0–19–993487–4

ISBN 978–0–19–993489–8

1. Psychology–Statistical methods. 2. Psychology–Mathematical models. I. Little, Todd D.

BF39.O927 2012

150.72 1—dc23

2012015005

9 8 7 6 5 4 3 2 1

Printed in the United States of America

on acid-free paper

SHORT CONTENTS

Contributors xi

Chapters 1–479

Index 480

v

This page intentionally left blank

OX F O R D L I B R A RY O F P SYC H O LO GY

Oxford University Press, one of the world’s oldest and most highly respected publish-

ers, with a tradition of publishing signiﬁcant books in psychology. The ambitious goal

of the Oxford Library of Psychology is nothing less than to span a vibrant, wide-ranging

ﬁeld and, in so doing, to ﬁll a clear market need.

Encompassing a comprehensive set of handbooks, organized hierarchically, the

Library incorporates volumes at different levels, each designed to meet a distinct

need. At one level are a set of handbooks designed broadly to survey the major

subﬁelds of psychology; at another are numerous handbooks that cover important

current focal research and scholarly areas of psychology in depth and detail. Planned

as a reﬂection of the dynamism of psychology, the Library will grow and expand

as psychology itself develops, thereby highlighting signiﬁcant new research that will

impact on the ﬁeld. Adding to its accessibility and ease of use, the Library will be

published in print and, later, electronically.

The Library surveys psychology’s principal subﬁelds with a set of handbooks that

capture the current status and future prospects of those major subdisciplines. This ini-

tial set includes handbooks of social and personality psychology, clinical psychology,

counseling psychology, school psychology, educational psychology, industrial and

organizational psychology, cognitive psychology, cognitive neuroscience, methods

and measurements, history, neuropsychology, personality assessment, developmen-

tal psychology, and more. Each handbook undertakes to review one of psychology’s

major subdisciplines with breadth, comprehensiveness, and exemplary scholarship.

In addition to these broadly conceived volumes, the Library also includes a large

number of handbooks designed to explore in depth more specialized areas of schol-

arship and research, such as stress, health and coping, anxiety and related disorders,

cognitive development, or child and adolescent assessment. In contrast to the broad

coverage of the subﬁeld handbooks, each of these latter volumes focuses on an espe-

cially productive, more highly focused line of scholarship and research. Whether

at the broadest or most speciﬁc level, however, all of the Library handbooks offer

synthetic coverage that reviews and evaluates the relevant past and present research

and anticipates research in the future. Each handbook in the Library includes intro-

ductory and concluding chapters written by its editor to provide a roadmap to the

handbook’s table of contents and to offer informed anticipations of signiﬁcant future

developments in that ﬁeld.

An undertaking of this scope calls for handbook editors and chapter authors

who are established scholars in the areas about which they write. Many of the

vii

nation’s and world’s most productive and best-respected psychologists have agreed

to edit Library handbooks or write authoritative chapters in their areas of expertise.

For whom has the Oxford Library of Psychology been written? Because of its breadth,

depth, and accessibility, the Library serves a diverse audience, including graduate

students in psychology and their faculty mentors, scholars, researchers, and practi-

tioners in psychology and related ﬁelds. Each will ﬁnd in the Library the information

they seek on the subﬁeld or focal area of psychology in which they work or are

interested.

Beﬁtting its commitment to accessibility, each handbook includes a comprehen-

sive index, as well as extensive references to help guide research. And because the

Library was designed from its inception as an online as well as a print resource,

its structure and contents will be readily and rationally searchable online. Further,

once the Library is released online, the handbooks will be regularly and thoroughly

updated.

In summary, the Oxford Library of Psychology will grow organically to provide

a thoroughly informed perspective on the ﬁeld of psychology, one that reﬂects

both psychology’s dynamism and its increasing interdisciplinarity. Once published

electronically, the Library is also destined to become a uniquely valuable interactive

tool, with extended search and browsing capabilities. As you begin to consult this

handbook, we sincerely hope you will share our enthusiasm for the more than 500-

year tradition of Oxford University Press for excellence, innovation, and quality, as

exempliﬁed by the Oxford Library of Psychology.

Peter E. Nathan

Editor-in-Chief

Oxford Library of Psychology

ABOUT THE EDITOR

Todd D. Little

Todd D. Little, Ph.D., is a Professor of Psychology, Director of the Quantitative

training program, Director of the undergraduate Social and Behavioral Sciences

Methodology minor, and a member of the Developmental training program. Since

2010, Todd has been Director of the Center for Research Methods and Data Analysis

(CRMDA) at Kansas University. Little is internationally recognized for his quanti-

tative work on various aspects of applied SEM (e.g., indicator selection, parceling,

modeling developmental processes) as well as his substantive developmental research

(e.g., action-control processes and motivation, coping, and self-regulation). In

2001, Little was elected to membership in the Society for Multivariate Experimental

Psychology. In 2009, he was elected President of APA’s Division 5 (Evaluation, Mea-

surement, and Statistics) and in 2010 was elected Fellow of the division. In 2012,

he was elected Fellow in the Association for Psychological Science. He founded,

organizes, and teaches in the internationally renowned KU “Stats Camps” each June

(see crmda.KU.edu for details of the summer training programs). Little has edited

ﬁve books related to methodology including The Oxford Handbook of Quantitative

Methods and the Guilford Handbook of Developmental Research Methods (with Brett

Laursen and Noel Card). Little has been principal investigator or co-principal inves-

tigator on more than 15 grants and contracts, statistical consultant on more than 60

grants and he has guided the development of more than 10 different measurement

tools.

ix

This page intentionally left blank

CONTRIBUTORS

Department of Psychology Department of Psychology

Arizona State University Boston University

Tempe, AZ Boston, MA

Rawni A. Anderson Trent D. Buskirk

Center for Research Methods Department of Community Health-

and Data Analysis Biostatistics Division

University of Kansas Saint Louis University

Lawrence, KS Saint Louis, MO

Luc Anselin Noel A. Card

GeoDa Center for Geospatial Analysis Family Studies and Human Development

and Computation University of Arizona

School of Geographical Sciences and Tucson, AZ

Urban Planning Deborah M. Casper

Arizona State University Family Studies and Human Development

Tempe, AZ University of Arizona

Amanda N. Baraldi Tucson, AZ

Department of Psychology Daniel R. Cavagnaro

Arizona State University Department of Psychology

Tempe, AZ The Ohio State University

David E. Bard Columbus, OH

Department of Pediatrics Rand D. Conger

University of Oklahoma Health Sciences Department of Human and Community

Center Development

Oklahoma City, OK University of California at Davis

Theodore P. Beauchaine Davis, CA

Department of Psychology David Cook

Washington State University Abt Associates Inc.

Pullman, WA Thomas D. Cook

Gabriëlla A.M. Blokland Institute for Policy Research

Genetic Epidemiology Laboratory Northwestern University

Queensland Institute of Evanston, IL

Medical Research Stefany Coxe

School of Psychology and Centre Department of Psychology

for Advanced Imaging Arizona State University

University of Queensland Tempe, AZ

Brisbane, Australia R.J. de Ayala

Annette Brose Department of Educational

Max Plank Institute for Human Psychology

Development University of Nebraska Lincoln

Berlin, Germany Lincoln, NE

Max Plank Institute for Human Cognitive

and Brain Sciences

xi

Pascal R. Deboeck Brian D. Haig

Department of Psychology Department of Psychology

University of Kansas University of Canterbury

Lawrence, KS Canterbury, New Zealand

Sarah Depaoli Kelly Hallberg

Department of Educational Psychology Institute for Policy Research

University of Wisconsin Madison Northwestern University

Madison, WI Evanston, IL

Cody S. Ding Lisa L. Harlow

College of Education Department of Psychology

University of Missouri-Saint Louis University of Rhode Island

Saint Louis, MO Kingston, RI

M. Brent Donnellan Emily J. Hart

Department of Psychology Department of Psychology

Michigan State University University at Buffalo

East Lansing, MI The State University of New York

Dawnté R. Early Buffalo, NY

Department of Human and Community Kit-Tai Hau

Development The Chinese University of Hong Kong

University of California at Davis Hong Kong

Davis, CA Joop J. Hox

Craig K. Enders Department of Methodology and Statistics

Department of Psychology Utrecht University

Arizona State University Utrecht, The Netherlands

Tempe, AZ James Jaccard

David M. Erceg-Hurn Department of Psychology

School of Psychology Florida International University

University of Western Australia Boca Raton, FL

Crawley, WA, Australia Paul E. Johnson

Aurelio José Figueredo Department of Political Science

Department of Psychology Kansas University

School of Mind, Brain, & Behavior Lawrence, KS

Division of Family Studies and Human Kelly M. Kadlec

Development Department of Psychology

College of Agriculture and Life Sciences University of Southern California

University of Arizona Los Angeles, CA

Tucson, AZ David Kaplan

Rafael Antonio Garcia Department of Educational

Department of Psychology Psychology

School of Mind, Brain, & Behavior University of Wisconsin-Madison

University of Arizona Madison, WI

Tucson, AZ Ken Kelley

Amanda C. Gottschall Department of Management

Department of Psychology University of Notre Dame

Arizona State University Notre Dame, IN

Tempe, AZ Harvey J. Keselman

Michael J. Greenacre Department of Psychology

Department of Economics and Business University of Manitoba

Universitat Pompeu Fabra, Barcelona Winnipeg, Canada

Barcelona, Spain

xii contributors

Neal M. Kingston Professor Emeritus

School of Education Macquarie University

University of Kansas † April 16, 1928 – October, 29, 2011

Lawrence, KS Sarah E. Medland

Yasemin Kisbu-Sakarya Genetic Epidemiology Laboratory

Department of Psychology Queensland Institute of Medical Research

Arizona State University School of Psychology

Tempe, AZ University of Queensland

Laura B. Kramer Brisbane, Australia

School of Education Peter C. M. Molenaar

University of Kansas Department of Human Development

Lawrence, KS and Family Studies

Todd D. Little Pennsylvania State University

Center for Research Methods and Data University Park, PA

Analysis Alexandre J.S. Morin

Department of Psychology Department of Psychology

University of Kansas University of Sherbrooke

Lawrence, KS Sherbrooke, Quebec, Canada

Richard E. Lucas Miriam A. Mosing

Department of Psychology Genetic Epidemiology Laboratory

Michigan State University Queensland Institute of Medical Research

East Lansing, MI School of Psychology

David P. MacKinnon University of Queensland

Department of Psychology Brisbane, Australia

Arizona State University Keith E. Muller

Tempe, AZ Department of Health Outcomes

Patrick Mair and Policy

Institute for Statistics and Mathematics University of Florida

Vienna University of Economics Gainesville, FL

and Business Eun-Young Mun

Vienna, Austria Center of Alcohol Studies

Herbert W. Marsh Rutgers University

Department of Education Piscataway, NJ

University of Oxford Alan T. Murray

Oxford, UK GeoDa Center for Geospatial Analysis

Katherine E. Masyn and Computation

Graduate School of Education School of Geographical Sciences

Harvard University and Urban Planning

Cambridge, MA Arizona State University

John J. McArdle Tempe, AZ

Department of Psychology Jay I. Myung

University of Southern California Department of Psychology

Los Angeles, CA The Ohio State University

Roderick P. McDonald† Columbus, OH

Sydney University Benjamin Nagengast

Sydney, Australia Department of Education

Professor Emeritus Oxford University

University of Illinois Oxford, UK

at Urbana-Champaign

contributors xiii

Sally Gayle Olderbak André A. Rupp

Department of Psychology Department of Measurement, Statistics,

School of Mind, Brain, & and Evaluation (EDMS)

Behavior University of Maryland

University of Arizona College Park, MD

Tucson, AZ Gabriel Lee Schlomer

Jamie M. Ostrov Division of Family Studies and

Department of Psychology Human Development

University at Buffalo College of Agriculture and Life Sciences

The State University of New York University of Arizona

Buffalo, NY Tucson, AZ

Trond Peterson Christof Schuster

Department of Sociology Department of Psychology

University of California-Berkeley Justus-Liebig-Unversitat Giessen

Berkeley, CA Giessen, Germany

Mark A. Pitt James P. Selig

Department of Psychology Department of Psychology

The Ohio State University University of New Mexico

Columbus, OH Albuquerque, NM

Larry R. Price Paul E. Spector

College of Education and College of Science Department of Psychology

Texas State University-San Marcos University of South Florida

San Marcos, TX Tampa, FL

Nilam Ram Peter M. Steiner

Department of Human Development and Department of Educational Psychology

Family Studies University of Wisconsin-Madison

Pennsylvania State University Madison, WI

University Park, PA Carolin Strobl

Max Plank Institute for Human Ludwig-Maximilians-Universität

Development München Faculty of Mathematics,

Berlin, Germany Informatics and Statistics Institute

Sergio J. Rey of Statistics

GeoDa Center for Geospatial Analysis Munich, Germany

and Computation Bruce Thompson

School of Geographical Sciences and Baylor College of Medicine

Urban Planning Austin, TX

Arizona State University Terry T. Tomazic

Tempe, AZ Department of Sociology and

Joseph L. Rodgers Criminal Justice

Department of Psychology Saint Louis University

University of Oklahoma Saint Louis, MO

Norman, OK James T. Townsend

Robert Rosenthal Department of Psychological and

Department of Psychology Brain Sciences

University of California, Riverside Indiana University

Riverside, CA Bloomington, IN

Ralph L. Rosnow Trisha Van Zandt

Department of Psychology Department of Psychology

Temple University The Ohio State University

Philadelphia, PA Columbus, OH

xiv contributors

Alexander von Eye Keith F. Widaman

Departments of Psychology Department of Psychology

Michigan State University University of California at Davis

East Lansing, MI Davis, CA

University of Vienna Rand R. Wilcox

Vienna, Austria Department of Psychology

Stefan von Weber University of Southern California

Department of Mechanical Engineering Los Angeles, CA

Hochschule Furtwangen University Lisa M. Willoughby

Furtwangen im Schwarzwald, Germany Department of Psychology

Karin J.H. Verweij Saint Louis University

Genetic Epidemiology Laboratory Saint Louis, MO

Queensland Institute of Medical Research Coady Wing

School of Psychology Institute for Policy Research

University of Queensland Northwestern University

Brisbane, Australia Evanston, IL

Theodore A. Walls Pedro Soﬁo Abril Wolf

Department of Behavioral Science Department of Psychology

University of Rhode Island University of Cape Town

Kingston, RI Cape Town, South Africa

Lihshing Leigh Wang Vivian Wong

Educational Studies Program School of Education and Social Policy

University of Cincinnati Northwestern University

Cincinnati, OH Evanston, IL

Amber S. Watts Carol M. Woods

Center for Research Methods and Data Center for Research Methods and

Analysis and Data Analysis and

Lifespan Institute, Gerontology Center Department of Psychology

University of Kansas University of Kansas

Lawrence, KS Lawrence, KS

William W.S. Wei Wei Wu

Department of Statistics Center for Research Methods and

Temple University Data Analysis and

Philadelphia, PA Department of Psychology

Zhonglin Wen University of Kansas

South China Normal University Lawrence, KS

Stephen G. West Ke-Hai Yuan

Department of Psychology Department of Psychology

Arizona State University University of Notre Dame

Tempe, AZ Notre Dame, IN

contributors xv

This page intentionally left blank

CONTENTS

1. Introduction 1

Todd D. Little

2. The Philosophy of Quantitative Methods 7

Brian D. Haig

3. Quantitative Methods and Ethics 32

Ralph L. Rosnow and Robert Rosenthal

4. Special Populations 55

Keith F. Widaman, Dawnté R. Early, and Rand D. Conger

5. Theory Construction, Model Building, and Model Selection 82

James Jaccard

6. Teaching Quantitative Psychology 105

Lisa L. Harlow

7. Modern Test Theory 118

Roderick P. McDonald

8. The IRT Tradition and its Applications 144

R.J. de Ayala

9. Survey Design and Measure Development 170

Paul E. Spector

10. High-Stakes Test Construction and Test Use 189

Neal M. Kingston and Laura B. Kramer

11. Effect Size and Sample Size Planning 206

Ken Kelley

12. Experimental Design for Causal Inference: Clinical Trials and

Regression Discontinuity Designs 223

Kelly Hallberg, Coady Wing, Vivian Wong, and Thomas D. Cook

13. Matching and Propensity Scores 237

Peter M. Steiner and David Cook

14. Designs for and Analyses of Response Time Experiments 260

Trisha Van Zandt and James T. Townsend

15. Observational Methods 286

Jamie M. Ostrov and Emily J. Hart

xvii

16. A Primer of Epidemiologic Methods, Concepts, and Analysis with

Examples and More Advanced Applications within Psychology 305

David E. Bard, Joseph L. Rodgers, and Keith E. Muller

17. Program Evaluation: Principles, Procedures, and Practices 332

Aurelio José Figueredo, Sally Gayle Olderbak, Gabriel Lee Schlomer,

Rafael Antonio Garcia, and Pedro Soﬁo Abril Wolf

18. Overview of Statistical Estimation Methods 361

Ke-Hai Yuan and Christof Schuster

19. Robust Statistical Estimation 388

David M. Erceg-Hurn, Rand R. Wilcox, and Harvey J. Keselman

20. Bayesian Statistical Methods 407

David Kaplan and Sarah Depaoli

21. Mathematical Modeling 438

Daniel R. Cavagnaro, Jay I. Myung, and Mark A. Pitt

22. Monte Carlo Analysis in Academic Research 454

Paul E. Johnson

Index 480

xviii contents

CHAPTER

Introduction

1

Todd D. Little

Abstract

In this introductory chapter to The Oxford Handbook of Quantitative Methods, I provide an overview of

the two volumes. More speciﬁcally, I describe the rationale and motivation for the selected topics that

are presented in volumes. I also list out my instructions to the chapter authors and then describe how

the chapters ﬁt together into thematic groupings. I also extend my sincerest gratitude to the persons

who assisted me along the way, as no work this comprehensive can be done without the considerable

help and assistance of many persons. I conclude with how pleased I am with the quality and

comprehensiveness of the chapters that are included.

Key Words: Overview; Quantitative Methods; Methodology; Statistics

Handbooks provide a crucial venue to communi- of psychology (the Editor-in-Chief of the library

cate the current state of the ﬁeld. They also provide is Peter Nathan, University of Iowa Foundation

a one-stop source for learning and reviewing cur- Distinguished Professor of Psychology and Public

rent best practices in a ﬁeld. The Oxford Handbook of Health). The library comprises handbooks in the

Quantitative Methods serves both of these functions. truest sense of the word: books that summarize and

The ﬁeld of quantitative methods is quite broad, as synthesize a topic, deﬁne the current scholarship,

you can probably imagine. I have tried to be thor- and set the agenda for future research. Each hand-

ough in my selection of topics to be covered. As with book is published as a bound book, and it will also

any handbook of this magnitude, some topics were be developed for electronic delivery. In this format,

all set to have a contribution submitted, only to have the content will be integrated across topics and avail-

some unforeseen hindrance preclude its inclusion able as a fully integrated electronic library. I think

at the last minute (e.g., graphical representations the idea of a comprehensive electronic library is very

of data, ecological inference, history of quantita- forward-thinking. This format is a very attractive

tive methods). Some topics overlap with others and opportunity to have a fully comprehensive and up-

may not have found their way to become a separate to-date handbook of methods in our ﬁeld. Hence,

chapter, but their fundamental elements are found I agreed to take on the role of editor of The Oxford

in parts of other chapters. Handbook of Quantitative Methods.

This handbook is one of many that Oxford I am very pleased with the quality of the work

University Press (OUP) is assembling but will be that each author provided. As per my request to the

the capstone methodology handbook. As many of contributing authors, each chapter is meant to be

you know, OUP is building a comprehensive and both accessible and comprehensive; nearly all the

1

authors were very responsive to my requests. The support his or her chapter contribution. At the top of

guidelines I asked authors to consider were: that page is a link that encourages you to offer com-

ments and suggestions on the topics and coverage

• Handbook chapters should be comprehensive of the handbook. These comments will be reviewed

and authoritative; readers will rely heavily on these and integrated into future editions of this handbook.

chapters, particularly when they move to the online I encourage you, therefore, to take advantage of this

format. opportunity to help shape the directions and content

• Handbook chapters should present not only coverage of this handbook.

the strengths of the topic covered but also any Statistical software has blossomed with the advent

limitations. of hardware that provides the necessary speed and

• Handbook chapters should make all assump- memory and programming languages coupled with

tions underlying the topic explicit. numerical algorithms that are more efﬁcient and

• Regarding citations, handbook chapters optimized than yesteryear. These software advances

should cover the historical origins as well as the have allowed many of the advances in modern statis-

recent renditions of a given key topic. tics to become accessible to the typical end-user.

• Handbook chapters should not present one- Modern missing data algorithms and Bayesian esti-

sided views on any debate; rather, they should report mation procedures, for example, have been the

the issues and present the arguments—both pro and beneﬁciaries of these advances. Of course, some of

con. Authors can direct readers to other platforms the software developments have included simpliﬁed

where a position piece is presented. interfaces with slick graphic user interfaces. The crit-

• To facilitate the online linkages, handbook ical options are usually preﬁlled with default settings.

chapters should point to other online resources These latter two aspects of advancing software are

unfortunate because they lead to mindless applica-

related to the topic presented.

tions of the statistical techniques. I would prefer

• Every element of every formula presented must

that options not be set as default but, rather, have

be explicitly explained; assume no knowledge of how

the software prompt the user to make a choice (and

to read formulae.

give good help for what each choice means). I would

• Examples, examples, examples, and, when

prefer that a complete script of the GUI choices and

in doubt, provide an example! Concrete examples the order in which steps were taken be automatically

are absolutely critical to communicate quantitative saved and displayed.

content. I have organized the handbook by starting with

• Avoid jargon and acronyms. Please spell out some basics. It begins with the philosophical under-

acronyms, and if you use jargon, please remind the pinnings associated with science and quantitative

reader of the meaning or deﬁnition of the jargon methods (Haig, Chapter 2, Volume 1) followed

every three to four times it is used; similarly, if you by a discussion of how to construct theories and

use an acronym, then remind the reader of what it models so that they can be tested empirically

means every three to four times it is used. and the best model selected (Jaccard, Chapter 5,

• Use active voice, and do not shy away from the Volume 1). I then turn to an enlightened discus-

use of I/me or we/us. Channel how you lecture on sion of ethics in the conduct of quantitative research

the topic. It will create a crisp and enjoyable read. (Rosnow & Rosenbloom, Chapter 3, Volume 1)

• Do not start a sentence with “This” followed and related issues when quantitative methods are

by a verb. The referent to “this” must be restated applied in special populations (Widaman, Early, &

because of the ambiguity this creates. This general Conger, Chapter 4, Volume 1). Harlow (Chapter 6,

guideline should be followed as a rule! Volume 1) follows with an encompassing and

impassioned discussion of teaching quantitative

Authors, like editors, have preferences and habits, methods.

so you will ﬁnd places, chapters, and so on where The theme in the next grouping of chapters

some of my admonitions were not followed. But the centers on measurement issues. First, the late

quality of the product that each chapter provides is McDonald (Chapter 17, Volume 1) provides a thor-

nonetheless uncompromised. We have established a ough overview of Modern Test Theory.1 De Ayala

Wiki-based resource page for the handbook, which (Chapter 8, Volume 1) adds a detailed discussion

can be found at crmda.KU.edu/oxford. Each author of Item Response Theory as an essential mea-

has been asked to maintain and upload materials to surement and analysis tool. After these principles

2 introduction

of measurement are discussed, the principles and Volume 2) then extend the General Linear Model

practices surrounding survey design and measure to discuss the Generalized Linear Model. This dis-

development are presented (Spector, Chapter 9, cussion is easily followed by Woods (Chapter 4,

Volume 1). Kingston and Kramer (Chapter 10, Volume 2), who synthesizes the various techniques

Volume 1) further this discussion in the context of of analyzing categorical data. After the chapter on

high-stakes testing. conﬁgural frequency analysis by Von Eye, Mun,

A next grouping of chapters covers various design Mair and von Weber (Chapter 5, Volume 5), I then

issues. Kelley (Chapter 11, Volume 1) begins segway into nonparametric techniques (Buskirk,

this section by covering issues of power, effect Tomazic, & Willoughby, Chapter 6, Volume 2) and

size, and sample size planning. Hallberg, Wing, the more specialized techniques of correspondence

Wong, and Cook (Chapter 12, Volume 1) then analysis (Greenacre, Chapter 7, Volume 2) and spa-

address key experimental designs for causal infer- tial analysis (Anselin, Murry, & Rey, Chapter 8,

ence: the gold standard randomized clinical trials Volume 2). This section is capped with chapters

(RCT) design and the underutilized regression dis- dedicated to special areas of research—namely,

continuity design. Some key quasi-experimental techniques and issues related to the analysis of

procedures for comparing groups are discussed imaging data (e.g., fMRI; Price, Chapter 9, Vol-

in Steiner and Cooks’ (Chapter 13, Volume 1) ume 2). The closely aligned worlds of behavior

chapter on using matching and propensity scores. genetics (i.e., twin studies; Blokland, Mosing,

Finally, Van Zandt and Townsend (Chapter 14, Verweij, & Medland, Chapter 10, Volume 2)

Volume 1) provide a detailed discussion of the and genes (Medland, Chapter 11, Volume 2)

designs for and analyses of response time experi- follows.

ments. I put observational methods (Ostrov & Hart, The foundations of multivariate techniques are

Chapter 15, Volume 1), epidemiological methods grouped beginning with Ding’s (Chapter 12, Vol-

(Bard, Rodgers, & Mueller, Chapter 16, Volume 1), ume 2) presentation of multidimensional scaling

and program evaluation (Figueredo, Olderbak, & and Brown’s (Chapter 13, Volume 2) summary

Schlomer, Chapter 17, Volume 1) in with these of the foundations of latent variable measurement

chapters because they address more collection and models. Hox layers in the multilevel issues as han-

design issues, although the discussion of program dled in both the manifest regression framework

evaluation also addresses the unique analysis and and the latent variable work of structural equation

presentation issues. modeling. McArdle and Kadlec (Chapter 15, Vol-

I have a stellar group of chapters related to esti- ume 2) detail, in broad terms, different structural

mation issues. Yuan and Schuster (Chapter 18, equation models and their utility. MacKinnon,

Volume 1) provide an overview of statistical estima- Kisbu-Sakarya, and Gottschall (Chapter 16, Vol-

tion method; Erceg-Hurn, Wilcox, and Keselman ume 2) address the many new developments in

(Chapter 19, Volume 1) provide a nice comple- mediation analysis, while Marsh, Hau, Wen, and

ment with a focus on robust estimation tech- Nagengast (Chapter 17, Volume 2) do the same for

niques. Bayesian statistical estimation methods are analyses of moderation.

thoroughly reviewed in the Kaplan and Depaoli The next group of chapters focuses on repeated

(Chapter 20, Volume 1) contribution. The details measures and longitudinal designs. It begins with

of mathematical modeling are synthesized in this a chapter I co-wrote with Wu and Selig and pro-

section by Cavagnaro, Myung, and Pitt (Chapter 21, vides a general overview of longitudinal models

Volume 1). This section is completed by John- (Wu, Selig, & Little, Chapter 18, Volume 2).

son (Chapter 22, Volume 1), who discusses the Deboeck (Chapter 19, Volume 2) takes things fur-

many issues and nuances involved in conducting ther into the burgeoning world of dynamical systems

Monte Carlo simulations to address the what- and continuous-time models for longitudinal data.

would-happen-if questions that we often need to Relatedly, Walls (Chapter 20, Volume 2) provides

answer. an overview of designs for doing intensive longitu-

The foundational techniques for the statistical dinal collection and analysis designs. The wonderful

analysis of quantitative data start with a detailed world of dynamic-factor models (a multivariate

overview of the traditional methods that have model for single-subject data) is presented by Ram,

marked social and behavioral sciences (i.e., the Brose, and Molenaar (Chapter 21, Volume 2). Wei

General Linear Model; Thompson, Chapter 2, (Chapter 22, Volume 2) covers all the issues of tradi-

Volume 2). Coxe, West, and Aiken (Chapter 3, tional time-series models and Peterson (Chapter 23,

little 3

Volume 2) rounds out this section with a thorough Author note

coverage of event history models. Partial support for this project was provided by

The volume ﬁnishes with two small sections. grant NSF 1053160 (Todd D. Little & Wei Wu,

The ﬁrst focuses on techniques dedicated to ﬁnd- co-PIs) and by the Center for Research Methods

ing heterogeneous subgroups in one’s data. Rupp and Data Analysis at the University of Kansas (Todd

(Chapter 24, Volume 2) covers tradition clustering D. Little, director). Correspondence concerning this

and classiﬁcation procedures. Masyn and Nylund- work should be addressed to Todd D. Little, Center

Gibson (Chapter 25, Volume 2) cover the model- for Research Methods and Data Analysis, Univer-

based approaches encompassed under the umbrella sity of Kansas, 1425 Jayhawk Blvd. Watson Library,

of mixture modeling. Beauchaine (Chapter 26, Vol- 470. Lawrence, KS 66045. E-mail: yhat@ku.edu.

ume 2) completes this ﬁrst group with his coverage Web: crmda.ku.edu.

of the nuances of taxometrics. The second of the

ﬁnal group of chapters covers issues related to sec- Note

ondary analyses of extant data. I put the chapter on 1. This chapter was completed shortly before Rod’s unex-

missing data in here because it generally is applied pected passing. His legacy and commitment to quantitative

after data collection occurs, but it is also a little out methods was uncompromising and we will miss his voice of wis-

dom and his piercing intellect; R.I .P., Rod McDonald and, as

of order here because of the terriﬁc and powerful fea-

you once said, pervixi… .

tures of planned missing data designs. In this regard,

Baraldi and Enders (Chapter 27, Volume 2) could

have gone into the design section. Donnellan and

References

Anselin, L., Murry, A. T., & Rey, S. J. (2012). Spatial analysis.

Lucas (Chapter 28, Volume 2) cover the issues asso- In T. D. Little (Ed.), The Oxford Handbook of Quantitative

ciated with analyzing the large-scale archival data Methods (Vol. 2, pp. 154–174). New York: Oxford University

sets that are available via federal funding agencies Press.

such as NCES, NIH, NSF, and the like. Data mining Baraldi, A. N. & Enders, C. K. (2012). Missing data methods.

In T. D. Little (Ed.), The Oxford Handbook of Quantitative

can also be classiﬁed as a set of secondary modeling Methods (Vol. 2, pp. 635–664). New York: Oxford University

procedures, and Strobl’s (Chapter 29, Volume 2) Press.

chapter covers the techniques and issues in this Bard, D. E., Rodgers, J. L., & Muller, K. E. (2012). A primer of

emerging ﬁeld of methodology. Card and Casper epidemiology methods with applications in psychology. In T.

(Chapter 30, Volume 2) covers the still advancing D. Little (Ed.), The Oxford Handbook of Quantitative Methods

(Vol. 1, pp. 305–349). New York: Oxford University Press.

world of meta-analysis and current best practices Beauchaine, T. P. (2012). Taxometrics. In T. D. Little (Ed.),

in quantitative synthesis of published studies. The The Oxford Handbook of Quantitative Methods (Vol. 2,

ﬁnal chapter of The Oxford Handbook of Quantita- pp. 612–634). New York: Oxford University Press.

tive Methods is one I co-authored with Wang, Watts, Brown, T. A. (2012). Latent variable measurement models. In T.

and Anderson (Wang, Watts, Anderson, & Little, D. Little (Ed.), The Oxford Handbook of Quantitative Methods

(Vol. 2, pp. 257–280). New York: Oxford University Press.

Chapter 31, Volume 2). In this capstone chapter, Buskirk, T. D., Tomazic, T. T., & Willoughbby, L. (2012).

we address the many pervasive fallacies that still Nonparametric statistical techniques. In T. D. Little (Ed.),

permeate the world of quantitative methodology. The Oxford Handbook of Quantitative Methods (Vol. 2,

A venture such as this does involve the gener- pp. 106–141). New York: Oxford University Press.

ous and essential contributions of expert reviewers. Card, N. A. & Casper, D. M. (2012). Meta-analysis and quan-

titative research synthesis. In T. D. Little (Ed.), The Oxford

Many of the chapter authors also served as review- Handbook of Quantitative Methods (Vol. 2, pp. 701–717).

ers for other chapters, and I won’t mention them by New York: Oxford University Press.

name here. I do want to express gratitude to a num- Cavagnaro, D. R., Myung, J. I., & Pitt, M. A. (2012). Mathe-

ber of ad hoc reviewers who assisted me along the matical modeling. In T. D. Little (Ed.), The Oxford Handbook

way (in arbitrary order): Steve Lee, Kris Preacher, of Quantitative Methods (Vol. 1, pp. 438–453). New York:

Oxford University Press.

Mijke Rhemtulla, Chantelle Dowsett, Jason Lee, Coxe, S., West, S. G., & Aiken, L. S. (2012). Generalized lin-

Michael Edwards, David Johnson (I apologize now ear models. In T. D. Little (Ed.), The Oxford Handbook of

if I have forgotten that you reviewed a chapter for Quantitative Methods (Vol. 2, pp. 26–51). New York: Oxford

me!). I also owe a debt of gratitude to Chad Zim- University Press.

merman at OUP, who was relentless in guiding us De Ayala, R. J. (2012). The IRT tradition and its applications.

In T. D. Little (Ed.), The Oxford Handbook of Quantitative

through the incremental steps needed to herd us all Methods (Vol. 1, pp. 144–169). New York: Oxford University

to a ﬁnal and pride-worthy end product and to Anne Press.

Dellinger who was instrumental in bringing closure Deboeck, P. R. (2012). Dynamical systems and models of con-

to this mammoth project. tinuous time. In T. D. Little (Ed.), The Oxford Handbook

4 introduction

of Quantitative Methods (Vol. 2, pp. 411–431). New York: Marsh, H. W., Hau, K-T., Wen, Z., & Nagengast, B. (2012).

Oxford University Press. Moderation. In T. D. Little (Ed.), The Oxford Handbook

Ding, C. S. (2012). Multidimensional scaling. In T. D. Little of Quantitative Methods (Vol. 2, pp. 361–386). New York:

(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, Oxford University Press.

pp. 7–25). New York: Oxford University Press. Masyn, K. E. & Nylund-Gibson, K. (2012). Mixture modeling.

Donnellan, M. B. & Lucas, R. E. (2012). Secondary data analysis. In T. D. Little (Ed.), The Oxford Handbook of Quantitative

In T. D. Little (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, pp. 551–611). New York: Oxford University

Methods (Vol. 2, pp. 665–677). New York: Oxford University Press.

Press. McArdle, J. J. & Kadlec, K. M. (2012). Structural equation mod-

Erceg-Hurn, D. M., Wilcox, R. R., & Keselman, H. H. (2012). els. In T. D. Little (Ed.), The Oxford Handbook of Quantitative

Robust statistical estimation. In T. D. Little (Ed.), The Oxford Methods (Vol. 2, pp. 295–337). New York: Oxford University

Handbook of Quantitative Methods (Vol. 1, pp. 388–406). Press.

New York: Oxford University Press. McDonald, R. P. (2012). Modern test theory. In T. D. Little

Figueredo, A. J., Olderbak, S. G., & Schlomer, G. L. (2012) (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 1,

Program evaluation: Principles, procedures, and practices. pp. 118–143). New York: Oxford University Press.

In T. D. Little (Ed.), The Oxford Handbook of Quantitative Medland, S. E. (2012). Quantitative analysis of genes. In T.

Methods (Vol. 1, pp. 332–360). New York: Oxford University D. Little (Ed.), The Oxford Handbook of Quantitative Meth-

Press. ods (Vol. 2, pp. 219–234). New York: Oxford University

Greenacre, M. J. (2012). Correspondence analysis. In T. D. Little Press.

(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, Ostrov, J. M. & Hart, E. J. (2012). Observational methods. In T.

pp. 142–153). New York: Oxford University Press. D. Little (Ed.), The Oxford Handbook of Quantitative Methods

Haig, B. D. (2012). The philosophy of quantitative methods. (Vol. 1, pp. 286–304). New York: Oxford University Press.

In T. D. Little (Ed.), The Oxford Handbook of Quantitative Peterson, T. (2012) Analyzing event history data. In T. D. Little

Methods (Vol. 1, pp. 7–31). New York: Oxford University (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,

Press. pp. 486–516). New York: Oxford University Press.

Hallberg, K., Wing, C., Wong, V., & Cook, T. D. (2012). Exper- Price, L. R. (2012). Analysis of imaging data. In T. D. Lit-

imental design for causal inference: Clinical trials and regres- tle (Ed.), The Oxford Handbook of Quantitative Methods

sion discontinuity designs. In T. D. Little (Ed.), The Oxford (Vol. 2, pp. 175–197). New York: Oxford University

Handbook of Quantitative Methods (Vol. 1, pp. 223–236). Press.

New York: Oxford University Press. Ram, N., Brose, A., & Molenaar, P. C. M. (2012). Dynamic fac-

Harlow, L. (2012). Teaching quantitative psychology. In tor analysis: Modeling person-speciﬁc process. In T. D. Little

T. D. Little (Ed.), The Oxford Handbook of Quantitative Meth- (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,

ods (Vol. 1, pp. 105–117). New York: Oxford University pp. 441–457). New York: Oxford University Press.

Press. Rosnow, R. L. & Rosenthal, R. (2012). Quantitative methods

Hox, J. J., (2012). Multilevel regression and multilevel struc- and ethics. In T. D. Little (Ed.), The Oxford Handbook of

tural equation modeling In T. D. Little (Ed.), The Oxford Quantitative Methods (Vol. 1, pp. 32–54). New York: Oxford

Handbook of Quantitative Methods (Vol. 2, pp. 281–294). University Press.

New York: Oxford University Press. Rupp, A. A. (2012). Clustering and classiﬁcation. In T. D. Little

Jaccard, J. (2012). Theory construction, model building, and (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,

model selection. In T. D. Little (Ed.), The Oxford Handbook pp. 517–611). New York: Oxford University Press.

of Quantitative Methods (Vol. 1, pp. 82–104). New York: Spector, P. E. (2012). Survey design and measure development.

Oxford University Press. In T. D. Little (Ed.), The Oxford Handbook of Quantitative

Johnson, P. E. (2012). Monte Carlo analysis in academic research. Methods (Vol. 1, pp. 170–188). New York: Oxford University

In T. D. Little (Ed.), The Oxford Handbook of Quantitative Press.

Methods (Vol. 1, pp. 454–479). New York: Oxford University Steiner, P. M. & Cook, D. (2012). Matching and Propensity

Press. Scores. In T. D. Little (Ed.), The Oxford Handbook of Quan-

Kaplan, D. & Depaoli, S. (2012). Bayesian statistical methods. titative Methods (Vol. 1, pp. 237–259). New York: Oxford

In T. D. Little (Ed.), The Oxford Handbook of Quantitative University Press.

Methods (Vol. 1, pp. 407–437). New York: Oxford University Strobl, C. (2012). Data mining. In T. D. Little (Ed.), The Oxford

Press. Handbook of Quantitative Methods (Vol. 2, pp. 678–700).

Kelley, K. (2012). Effect size and sample size planning. In New York: Oxford University Press.

T. D. Little (Ed.), The Oxford Handbook of Quantitative Meth- Thompson, B. (2012). Overview of traditional/classical statisti-

ods (Vol. 1, pp. 206–222). New York: Oxford University cal approaches. In T. D. Little (Ed.), The Oxford Handbook of

Press. Quantitative Methods (Vol. 2, pp. 7–25). New York: Oxford

Kingston, N. M. & Kramer, L. B. (2012). High stakes test con- University Press.

struction and test use. In T. D. Little (Ed.), The Oxford Van Zandt, T., & Townsend, J. T. (2012). Designs for and

Handbook of Quantitative Methods (Vol. 1, pp. 189–205). analyses of response time experiments. In T. D. Little

New York: Oxford University Press. (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 1,

MacKinnon, D. P., Kisbu-Sakarya, Y., & Gottschall, A. C. pp. 260–285). New York: Oxford University Press.

(2012). Developments in mediation analysis. In T. D. von Eye, A., Mun, E. U., Mair, P., & von Weber, S. Conﬁgural

Little (Ed.), The Oxford Handbook of Quantitative Meth- frequency analysis. In T. D. Little (Ed.), The Oxford Hand-

ods (Vol. 2, pp. 338–360). New York: Oxford University book of Quantitative Methods (Vol. 2, pp. 73–105). New York:

Press. Oxford University Press.

little 5

Walls, T. A. (2012). Intensive longitudinal data. In T. D. Little Quantitative Methods (Vol. 1, pp. 55–81). New York: Oxford

(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, University Press.

pp. 432–440). New York: Oxford University Press. Woods, C. M. (2012). Categorical methods. In T. D. Little

Wang, L. L., Watts, A. S., Anderson, R. A., & Little, T. D. (2012). (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,

Common fallacies in quantitative research methodology. In pp. 52–73). New York: Oxford University Press.

T. D. Little (Ed.), The Oxford Handbook of Quantitative Meth- Wu, W., Selig, J. P., & Little, T. D. (2012). Longitudinal data

ods (Vol. 2, pp. 718–758). New York: Oxford University analysis. In T. D. Little (Ed.), The Oxford Handbook of Quan-

Press. titative Methods (Vol. 2, pp. 387–410). New York: Oxford

Wei. W. W. S. (2012). Time series analysis. In T. D. Little (Ed.), University Press.

The Oxford Handbook of Quantitative Methods (Vol. 2, pp. Yuan, K-H., & Schuster, C. (2012). Overview of statistical esti-

458–485). New York: Oxford University Press. mation methods. In T. D. Little (Ed.), The Oxford Handbook

Widaman, K. F., Early, D. R., & Conger, R. D. (2012). Special of Quantitative Methods. (Vol. 1, pp. 361–387). New York:

populations. In T. D. Little (Ed.), The Oxford Handbook of Oxford University Press.

6 introduction

CHAPTER

2

Brian D. Haig

Abstract

methods that are prominent in the behavioral sciences. It begins by outlining a scientiﬁc realist

methodology that can help illuminate the conceptual foundations of behavioral research methods. The

methods selected for critical examination are exploratory data analysis, statistical signiﬁcance testing,

Bayesian conﬁrmation theory, meta-analysis, exploratory factor analysis, and causal modeling. Typically,

these methods contribute to either the detection of empirical phenomena or the construction of

explanatory theory. The chapter concludes with a brief consideration of directions that might be taken

in future philosophical work on quantitative methods.

Key Words: scientiﬁc realism, methodology, exploratory data analysis, statistical signiﬁcance testing,

Bayesianism, meta-analysis, exploratory factor analysis, causal modeling, latent variables, phenomena

detection, hypothetico-deductive method, inference to the best explanation

Historically, philosophers of science have given tion of a number of different quantitative research

research methods in science limited attention, con- methods that are prominent in the behavioral sci-

centrating mostly on the nature and purpose of the- ences. It begins by outlining a scientiﬁc realist

ory in the physical sciences. More recently, however, methodology that can help illuminate the concep-

philosophers of science have shown an increased tual foundations of behavioral research methods.

willingness to deal with methodological issues in sci- The methods submitted to critical examination are

ences other than physics—particularly biology, but exploratory data analysis, statistical signiﬁcance test-

also psychology to a limited extent. There is, then, ing, Bayesian conﬁrmation theory, meta-analysis,

a developing literature in contemporary philosophy exploratory factor analysis, and causal modeling

of science that can aid both our understanding and methods. The chapter concludes with a brief and

use of a variety of research methods and strategies in selective consideration of directions that might be

psychology (e.g., Trout, 1998). taken in future philosophical work on quantitative

At the same time, a miscellany of theoretically methods.

oriented psychologists, and behavioral and social

scientists more generally, have produced work on

the conceptual foundations of research methods that Quantitative Methods and Scientiﬁc

helps illuminate those methods. The work of both Realism

professional philosophers of science and theoretical The three major philosophies of science

scientists deserves to be included in a philosophical that bear on psychology are empiricism, social

examination of behavioral research methods. constructionism, and scientiﬁc realism (Greenwood,

7

1992; Manicas & Secord, 1983). Nineteenth of which we are part and (2) that both the observ-

century British empiricism had a major inﬂuence able and unobservable features of that world can be

on the development of British statistics in the ﬁrst known by the proper use of scientiﬁc methods. Some

half of the twentieth century (e.g., Mulaik, 1985). versions of scientiﬁc realism incorporate additional

The statistical methods developed in that intellectual theses (e.g., the claims that truth is the primary aim

milieu remain an important part of psychology’s sta- of science and that successive theories more closely

tistical research practice. For example, Karl Pearson’s approximate the truth), and some will also nominate

product moment correlation coefﬁcient was taken optional doctrines that may, but need not, be used

by its founder to be the quantitative expression of by scientiﬁc realists (e.g., the claim that causal rela-

a causal relation viewed in empiricist terms. Simi- tions are relations of natural necessity; see Hooker,

larly, Fisher’s endorsement of inductive methods as 1987). Others who opt for an “industrial strength”

the proper view of scientiﬁc method stemmed from version of scientiﬁc realism for the physical sciences

a commitment to the empiricism of his day. Even are more cautious about its successful reach in the

in the current postpositivist philosophical climate, behavioral sciences. Trout (1998), for example, sub-

authors of research methods textbooks sometimes scribes to a modest realism in psychology, based on

portray quantitative research as essentially positivist his skepticism about the discipline’s ability to pro-

in its empiricist commitments (Yu, 2006). Among duce deeply informative theories like those of the

other things, positivism restricts its attention to physical sciences.

what can be observed and regards theories as instru- Given that this chapter is concerned with the

ments that organize claims about observables but philosophical foundations of quantitative methods,

that do not explain them by appeal to hidden the remaining characterization of scientiﬁc realism

causes. will limit its attention to research methodology.

Qualitative methodologists also often bolster

their preferred conception of qualitative research by Scientiﬁc Realist Methodology

comparing it with an unﬂattering positivist picture Scientiﬁc realism boasts a rich conception of

of quantitative research. They tend to adopt the phi- methodology, which is of considerable help in

losophy of social constructionism, which is opposed understanding and guiding research. The resource-

to the traditional notions of truth, objectivity, and fulness of realist methodology is suggested in the

reason, maintaining that our understanding of the following description of its major characteristics (see

world is determined by social negotiation. In one or Hooker, 1987; Nickles, 1987). First, realist method-

another of its various forms, it is the philosophy of ology has three major tasks: to describe how methods

choice for many qualitative researchers, and it tends function; to evaluate methods critically against their

to be employed by those who are opposed, or indif- rivals; and to recommend how to use particular

ferent, to quantitative methods. I shall not consider methods to pursue chosen research goals.

it further in this chapter. Second, realist methodology is critically aim-

In what follows, I will adopt a scientiﬁc realist oriented. At a broad level, it recommends the pursuit

perspective on research methods. Although the sub- of valuable truth, explanatory understanding, and

ject of considerable debate, and opposed by many effective control as primary research goals; and it is

antirealist positions, scientiﬁc realism is the domi- concerned with the mutual adjustment of methods

nant philosophy of science today. It is also the tacit and research goals.

philosophy of most working scientists. This fact, Third, realist methodology is naturalistic—that

combined with its current heavy emphasis on the is, it is a substantive domain that uses the meth-

nature of scientiﬁc practice, makes scientiﬁc realism ods of the various sciences to study method itself.

a philosophy for science—not just a philosophy of Proctor and Capaldi (2001) advocate a naturalistic

science. approach to methodology in psychology in which

the empirical justiﬁcation of methodological ideas

Scientiﬁc Realism is emphasized.

The philosophies of positivism, social construc- A fourth feature of realist methodology is that it

tionism, and scientiﬁc realism just mentioned are is both generative and consequentialist. Generative

really family positions. This is especially true of sci- methodology involves reasoning to, and accept-

entiﬁc realism, which comes in many forms. Most ing, knowledge claims in question from warranted

versions of scientiﬁc realism display a commitment premises. Exploratory factor analysis is a prominent

to at least two doctrines: (1) that there is a real world example of a method in psychology that involves

8 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

a generative justiﬁcation of the factorial hypotheses Exploratory Data Analysis

to which it gives rise. By contrast, consequentialist In psychological research, the major emphasis in

methodology focuses on reasoning from knowledge data analysis is placed on statistical inference, where

claims in question to their testable consequences. the task is to ﬁnd out whether a data set exhibits a

The widely used hypothetico-deductive method, designated feature of interest characterized with ref-

with its emphasis on predictive accuracy, clearly erence to a probabilistic model. Unfortunately, the

exhibits a consequentialist approach to justifying dominance of this goal has had the effect of discour-

knowledge claims. aging a concerted examination of data sets in terms

Fifth, realist methodology acknowledges the need of their quality and structure. Detailed explorations

for two quite different approaches to justifying of data are important in science, and it often makes

knowledge claims. In philosophy these are com- good sense to conduct them instead of a probabilis-

monly known as reliabilism and coherentism. With tic model or before the model is formulated and

reliabilism, a belief is justiﬁed to the extent that it is adopted.

acquired by reliable processes. In general, the innu- Consistent with this emphasis on the close exam-

merable methods that contribute to the detection of ination of data, the last 30 years have witnessed the

empirical phenomena are concerned with reliabilist strong development of an empirical, data-oriented

justiﬁcation. With coherentism, a belief is justiﬁed approach to statistics. One important part of this

in virtue of its coherence with other beliefs. Tha- movement is exploratory data analysis, which con-

gard’s (1992) theory of explanatory coherence, used trasts with the more familiar traditional statistical

for the comparative evaluation of scientiﬁc theories, methods with their characteristic emphasis on the

embodies an illuminating coherentist perspective on conﬁrmation of knowledge claims.

knowledge justiﬁcation. These two forms of justiﬁ-

cation are different, complementary, and of equal Exploratory Data Analysis and John Tukey

importance. Spelling out a philosophy of exploratory data

As a sixth feature, realist methodology regards analysis is difﬁcult, and few methodologists have

science as a problem-oriented endeavor in which attempted to do so (for an initial attempt to do

problems are conceptualized as constraints on their this from a Bayesian perspective, see Good, 1983).

effective solution (Haig, 1987; Nickles, 1981). On However, the intellectual progenitor of modern

this formulation, the constraints are actually con- exploratory data analysis, John Tukey, has devel-

stitutive of the problem itself; they characterize the oped a systematic perspective on the subject that

problem and give it structure. Further, by includ- has helped to highlight its importance to research.

ing all the constraints in the problem’s articulation, It deserves to be considered as a philosophy of data

the problem enables the researcher to direct inquiry analysis in its own right. Therefore, this brief exami-

effectively by pointing the way to its own solution. In nation of the philosophy of exploratory data analysis

a real sense, stating the problem is half the solution! pays particular attention to Tukey’s thinking on the

Finally, realist methodology takes the researcher’s topic.

make up as a “knowing subject” seriously. Among According to Tukey (1980), data analysis should

other things, the researcher is regarded as a satisﬁcer be treated as a two-stage compound process in

who makes heavy use of heuristics to guide her which the patterns in the data are ﬁrst sug-

inquiries. For example, McGuire (1997) discusses gested by exploratory data analysis and then crit-

many useful heuristics that can be employed to facil- ically checked through the use of conﬁrmatory

itate the generation of hypotheses in psychological data analysis procedures. Exploratory data analysis

research. involves descriptive—and frequently quantitative—

Scientiﬁc realist methodology undergirds a wide detective work designed to reveal structure or pattern

variety of methods, strategies, and heuristics that in the data sets under scrutiny. The data analyst

have been successfully used to produce worthwhile is encouraged to undertake an open-eyed investi-

knowledge about both empirical phenomena and gation of the data and perform multiple analyses

explanatory theories. If quantitative researchers in using a variety of intuitively appealing and easily

psychology engage this literature seriously, then they used techniques.

will ﬁnd resources for enhancing their understand- The compendium of methods for the explo-

ing of research methods. ration of data, many of which were developed by

I turn now to a philosophical consideration of Tukey (1977), is designed to facilitate both dis-

the selected research methods. covery and communication of information. These

haig 9

methods are concerned with the effective organi- there is no acknowledged place for such work in

zation of data, the construction of graphical and the hypothetico-deductive conception of inquiry

semi-graphical displays, and the examination of (Wilkinson & The Task Force, 1999).

distributional assumptions and functional depen- I think the worth of the inductive method as a

dencies. Two additional attractive features of Tukey’s model for data analysis is dismissed too quickly in

methods are their robustness to changes in under- the above quotation. The major failing of the induc-

lying distributions and their resistance to outliers tive account of scientiﬁc method lies not so much

in data sets. Exploratory methods with these two with its perspective on data analysis, but with its

features are particularly suited to data analysis in psy- prohibition of the formulation of explanatory the-

chology, where researchers are frequently confronted ories. A modern conception of inductive method is

with ad hoc sets of data on amenable variables, which embedded in the important scientiﬁc process of phe-

have been acquired in convenient circumstances. nomena detection. Phenomena are relatively stable

recurrent general features of the world that we seek

Exploratory Data Analysis and Scientiﬁc to explain (Woodward, 1989), and their detection

Method frequently involves an inductive process of empiri-

In his writings on data analysis, Tukey (1969) cal generalization. With its emphasis on phenomena

has emphasized the related ideas that psychology is detection, inductive method reserves an important

without an agreed-upon model of data analysis and place for the exploratory analysis of data. In detect-

that we need to think more broadly about scientiﬁc ing phenomena, one is concerned to extract a signal

inquiry. In an invited address to the American Psy- from the noise of data, and for this the inten-

chological Association in 1968, Tukey presented the sive search of large amounts of data is frequently

following excerpt from a prominent psychologist for essential. It is precisely because securing a heavy

his audience to ponder. I quote in part: information yield for our data is likely to throw

up potentially interesting data patterns that might

I have the feeling that psychology is currently turn out to be genuine phenomena. In this con-

without a dominant viewpoint concerning a model text, data mining is encouraged, and the capabilities

for data analysis. In the forties and early ﬁfties, a of exploratory techniques in this regard often make

hypothetico-deductive framework was popular, and them the appropriate methods of choice.

our mentors were keen on urging the design of By contrast, Behrens and Yu (2003) suggest that

“crucial” experiments for the refutation of speciﬁc the inferential foundations of exploratory data anal-

predictions made from one or another theory. ysis are to be found in the idea of abduction, or

Inductive empiricism was said to be disorderly and explanation (and by implication, not in the notions

inefﬁcient. You and I knew then, as we know now, of hypothetico-deductive testing and inductive gen-

that no one approach is uniformly most powerful. eralization). However, exploratory data analysis is a

(Tukey, 1969, p. 90) descriptive pattern-detection process that is a pre-

Consider the hypothetico-deductive and induc- cursor to the inductive generalizations involved in

tive conceptions of scientiﬁc methods, which are phenomena detection. As will be seen later in the

mentioned here as candidate models for data anal- consideration of exploratory factor analysis, abduc-

ysis. Most psychological researchers continue to tive inference is reserved for the construction of

undertake their research within the conﬁnes of the causal explanatory theories that are introduced to

hypothetico-deductive method. Witness their heavy explain empirical phenomena. Beherens and Yu’s

preoccupation with theory testing, where conﬁrma- suggestion conﬂates the quite different ideas of

tory data analyses are conducted on limited sets of descriptive and explanatory inference.

data gathered in accord with the dictates of the test

predictions of theories. In this regard, psychologists Exploratory Data Analysis and a Model of

frequently employ tests of statistical signiﬁcance Data Analysis

to obtain binary decisions about the credibility of In the spirit of Tukey’s (1962; 1980) push for

the null hypothesis and its substantive alternatives. breadth of vision in data analysis, one might use-

However, the use of statistical signiﬁcance tests in fully take a perspective on data analysis that extends

this way strongly blunts our ability to look for Tukey’s two-stage model (Haig, 2005b). Before

more interesting patterns in the data. Indeed, the exploring data for patterns of potential interest,

continued neglect of exploratory data analysis in researchers should assiduously screen their data for

psychological research occurs in good part because their quality. This initial data analysis involves

10 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

checking for the accuracy of data entries, identify- principle asserts that statements about observational

ing and dealing with missing and outlying data, and data, including empirical generalizations, have a

examining the data for their ﬁt to the assumptions degree of acceptability on their own. Such claims

of the data analytic methods to be used. This impor- are not indubitable, but they do stand by them-

tant, and time-consuming, preparatory phase of selves better than claims justiﬁed solely in terms of

data analysis has not received the amount of explicit what they explain. What justiﬁes the provisional

attention that it deserves in behavioral science edu- acceptance of data statements is that they have

cation and research practice. Fidell and Tabachnick been achieved by reliable methods; what strength-

(2003) provide a useful overview of the task and ens our provisional belief in the patterns thrown

techniques of initial data analysis. up by exploratory data analysis is their conﬁrma-

Conﬁrmation of the initial data patterns sug- tion through use of computer-based resampling

gested by exploratory data analysis is a “just check- methods.

ing” strategy and as such should be regarded as a Further, it is important to appreciate that the

process of close replication. However, it is essential acceptability of claims provided by the reliabilist jus-

to go further and undertake constructive replica- tiﬁcation of computer-intensive resampling meth-

tions to ascertain the extent to which results hold ods can be enhanced by making appropriate use of

across different methods, treatments, subjects, and a coherentist approach to justiﬁcation. One impor-

occasions. Seeking results that are reproducible tant form of coherence is explanatory coherence, and

through constructive replications requires data ana- one method that delivers judgments of explanatory

lytic strategies that are designed to achieve signiﬁcant coherence is the theory of explanatory coherence

sameness rather than signiﬁcant difference (Ehren- (Thagard, 1992). According to this theory, data

berg & Bound, 1993). Exploratory data analysis, claims, including empirical generalizations, receive

then, can usefully be regarded as the second in an additional justiﬁcation if and when they enter

a four-stage sequence of activities that, in turn, into, and cohere with, the explanatory relations of

attend to data quality, pattern suggestion, pattern the theory that explains them.

conﬁrmation, and generalization.

A Philosophy for Teaching Data Analysis

Resampling Methods and Reliabilist An underappreciated, but important, feature of

Justiﬁcation Tukey’s writings on exploratory data analysis is the

Since the 1980s, statisticians have been able to illuminating remarks on the teaching of data analysis

exploit the massive computational power of the that they contain. These remarks can be assem-

modern computer and develop a number of com- bled into a constructive philosophy for teaching data

puter intensive resampling methods, such as the analysis, which can properly be regarded as an aspect

jackknife, the bootstrap, and cross-validation (Efron of an overall philosophy of exploratory data analy-

& Tibshirani, 1993). These methods constitute one sis. This philosophy of teaching advises us to think

important set of conﬁrmatory procedures that are about and teach data analysis in a way that is quite

well suited to the task of checking on the data different from the prevailing custom.

patterns thrown up by exploratory data analysis. Provocatively, Tukey (1980) maintained that the

By exploiting the computer’s computational power, proper role of statistics teachers is to teach that

these resampling methods free us from the restric- which is most difﬁcult and leave that which is more

tive assumptions of modern statistical theory, such as manageable to good textbooks and computers. He

the belief that the data are normally distributed, and recommended teaching data analysis the way he

permit us to gage the reliability of chosen statistics understood biochemistry was taught, concentrat-

by making thousands, even millions, of calculations ing on what the discipline of statistics has learned,

on many data points. perhaps with a discussion of how such things were

It is important to appreciate that the resampling learned. The detail of methods should be assigned

methods just mentioned make use of a reliabilist to laboratory work, and the practice of learning

approach to justiﬁcation. Here, the reliability checks data analytic techniques should be assigned to a

on emergent data patterns are provided by consis- different course in which problems arose. He fore-

tency of test outcomes, which are time-honored saw that such a redirection in teaching data analysis

validating strategies. Our willingness to accept the would have to be introduced in phases. In Tukey’s

results of such checks is in accord with what Tha- (1962) words, “The proposal is really to go in the

gard (1992) calls the principle of data priority. This opposite direction from cookbookery; to teach not

haig 11

‘what to do,’ nor ‘how we learned what to do,’ but has undertaken a useful extensive review of the con-

rather, ‘what we have learned”’ (p. 63). This advice troversy since its beginning. Despite the plethora

is broadly consistent with the idea that we should of critiques of statistical signiﬁcance testing, most

teach research methods in terms of their accompa- psychologists understand them poorly, frequently

nying methodology, a recommendation considered use them inappropriately, and pay little attention

at the end of this chapter. to the controversy they have generated (Gigerenzer,

Another prominent feature of Tukey’s philosophy Krauss, & Vitouch, 2004).

of teaching data analysis is his recommendation that The signiﬁcance testing controversy is multi-

we should teach both exploratory and conﬁrmatory faceted. This section will limit its attention to a

data analysis and that we have an obligation to do so. consideration of the two major schools of signiﬁ-

Tukey’s strong promotion of the value of exploratory cance testing, their hybridization and its defects, and

data analysis was intended as a counter to the dom- the appropriateness of testing scientiﬁc hypotheses

inance of conﬁrmatory data analysis in statistical and theories using tests of statistical signiﬁcance.

practice. However, for Tukey, exploratory data anal- Psychologists tend to assume that there is a sin-

ysis was not to be understood as more important gle uniﬁed theory of tests of statistical signiﬁcance.

than conﬁrmatory data analysis because both are However, there are two major schools of thought

essential to good data analysis. regarding signiﬁcance tests: Fisherian and Neyman-

Tukey also suggested that exploratory data anal- Pearson. Initially, Neyman and Egon Pearson sought

ysis should probably be taught before conﬁrma- to build on and improve Fisher’s theory, but they

tory data analysis. There are several reasons why subsequently developed their own theory as an alter-

this recommendation has merit. Properly taught, native to Fisher’s theory. There are many points of

exploratory data analysis is probably easier to learn, difference between the two schools, which adopt

and it promotes a healthy attitude to data anal- fundamentally different outlooks on the nature of

ysis (encouraging one to be a dataphile without scientiﬁc method. The uncritical combination of

becoming a data junkie). It requires the investiga- the two schools in psychology has led to a confused

tor to get close to the data, analyze them in various understanding of tests of statistical signiﬁcance and

ways, and seek to extract as much as possible poten- to their misuse in research.

tially important information from the data. This is

done to detect indicative patterns in the data before

establishing through conﬁrmatory data analysis that The Fisherian Signiﬁcance Testing School

they are genuine patterns. The Fisherian school of signiﬁcance testing (e.g.,

Tukey emphasized that learning exploratory data Fisher, 1925) tests a hypothesis or theory of sub-

analysis centrally involves acquiring an appropri- stantive interest against the null hypothesis that the

ate attitude toward the data, which includes the experimental effect to be demonstrated is in fact

following elements: exploratory data analysis is suf- absent. Fisher argued that an experiment is per-

ﬁciently important to be given a great deal of time; formed solely to give the data an opportunity to

exploratory data analysis should be carried out ﬂex- disprove the null hypothesis. No alternative hypoth-

ibly with multiple analyses being performed; and esis is speciﬁed, and the null hypothesis is the

exploratory data analysis should employ a multiplic- hypothesis to be nulliﬁed; it need not be the hypoth-

ity of methods that enhance visual display. esis of zero difference. Because one cannot accept

the null hypothesis, no provision is made for Type

II error, and relatedly, there is no place for a sta-

Statistical Signiﬁcance Testing tistical concept of power. Most importantly, and as

It is well known that tests of statistical signiﬁcance noted earlier, Fisher subscribed to an inductive con-

are the most widely used methods for evaluating ception of scientiﬁc method and maintained that

hypotheses in psychology (e.g., Hubbard & Ryan, signiﬁcance tests are vehicles of inductive reasoning.

2000). These tests have been popular in psychology As such they are concerned with evidence for beliefs.

for nearly 50 years and in statistics for about 75 years.

Since the 1960s, there has developed a massive crit-

ical literature in psychology regarding their worth. Should We Use Fisher’s Signiﬁcance Tests?

Important early contributions to this debate are col- The question of whether behavioral scientists

lected in Morrison and Henkel (1970; see also Giere, should use Fisherian signiﬁcance tests as a defensible

1972). Cohen (1994) provides a short perceptive form of hypothesis testing largely centers on whether

review of the controversy, whereas Nickerson (2000) p−values are good measures of scientiﬁc evidence.

12 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

Although many psychological researchers think, and neo-Fisherian alternative. Key elements of this alter-

some methodologists argue, that p−values can be native are that the probability of Type I error is not

used to measure strength of evidence, others hold speciﬁed, p−values are not misleadingly described

them to be deeply problematic in this respect (e.g., as “signiﬁcant” or “nonsigniﬁcant,” judgment is sus-

Royall, 1997; Hubbard & Lindsay, 2008). pended about accepting the null hypothesis on the

Some have argued that the concept of evidence basis of high p-values, the “three-valued logic” that

adopted by Fisher is defective. For one thing, it is gives information about the direction of the effect

widely agreed by philosophers of science that the- being tested is adopted, accompanying effect size

ory or hypothesis evaluation is a comparative affair information is provided, and use is made of adjunct

in which evidence against one hypothesis is evidence information such as conﬁdence intervals, where

for another hypothesis (e.g., Sober, 2008). However, appropriate. Two things to note here about this neo-

as just noted, Fisher only countenanced the null, Fisherian perspective are that it is concerned with

and without an alternative hypothesis with which signiﬁcance assessments but not null hypothesis sig-

to compare the null, the logic of his signiﬁcance niﬁcance tests and that it is concerned with statistical

testing is importantly incomplete; one cannot have tests as distinct from the tests of scientiﬁc hypothe-

evidence against the null without it being evidence ses. There are empirical studies in psychology that

for another hypothesis. The idea that one might approximate this modiﬁed Fisherian perspective on

allow for alternative hypotheses by joining Fisher’s signiﬁcance tests.

perspective with that of Neyman and Pearson will

be seen by many to make matters worse. As will be

The Neyman-Pearson Hypothesis Testing

noted shortly, Neyman and Pearson were concerned

School

with the reliability of errors in decision making in

Neyman and Pearson rejected Fisher’s notion of a

the long run rather than with evidence for believing

signiﬁcance test and its use of a threshold p-value as a

hypotheses in a particular experiment.

basis for rejecting the null hypothesis. In this regard,

Others contend that a further major problem

they added the requirement of the speciﬁcation of an

with Fisher’s p-value is that it doesn’t measure evi-

alternative hypothesis as well as the null hypothesis,

dence properly (e.g., Goodman, 1993). In this

and they replaced Fisher’s evidential p-value with the

regard, four claims about p-values are thought to

Type I error rate, α (e.g., Neyman & Pearson, 1933).

disqualify it as a proper measure of evidence. First,

In addition, Neyman and Pearson permitted a more

the p-value is not a direct measure of the probability

liberal formulation of the null hypothesis than did

that the null is false; it is a conditional probability

Fisher and regarded it as legitimate to speak of its

of obtaining the data, calculated on the assumption

acceptance. Thus, Type II error was admitted, and

that the null hypothesis is true. Second, to take a

explicit provision was made for a statistical concept

number that quantiﬁes rare data under the null is to

of power. To capture these differences, Neyman and

confuse the strength of evidence with the probability

Pearson spoke of their approach as hypothesis testing

of its occurrence (Royall, 1986). These two things

rather than signiﬁcance testing.

are different because the probability is an indication

However, the Neyman-Pearson school differs

of the long-run Type I error rate, which is separate

from the Fisherian school most fundamentally in

from strength of evidence. Third, the calculation of

maintaining that signiﬁcance tests are rules of induc-

a p-value combines the rarity of an obtained result

tive behavior rather than vehicles for inductive rea-

with the probability of results that didn’t happen. As

soning. On this view, signiﬁcance testing is regarded

Jeffreys (1939) stated long ago, “What the use of P

as a theory of prudential decision-making; accept-

implies . . . is that a hypothesis that may be true may

ing or rejecting a hypothesis amounts to adopting an

be rejected because it has not predicted observable

appropriate course of action, rather than believing

results that have not occurred” (p. 136). Finally, it is

it to be probably true or false. At root, Neyman and

claimed that the p-value can exaggerate the strength

Pearson held different views from Fisher about the

of evidence against the point null and small inter-

nature of science.

val hypotheses (Berger & Selke, 1987), which are

frequently tested in psychology.

These criticisms are not convincing to all. For Should We Use Neyman and Pearson’s

example, Hurlbert and Lombardi (2009) recently Hypothesis Tests?

considered these arguments and recommend a shift It might seem that by focusing on hypothesis test-

in focus from the classical Fisherian framework to a ing and making provision for alternative hypotheses

haig 13

and a statistical concept of power, Neyman and Pear- To make matters worse, this confusion is com-

son’s approach marks an improvement over Fisher’s pounded by a tendency of psychologists to misrep-

signiﬁcance tests. However, as noted earlier, Ney- resent the cognitive accomplishments of signiﬁcance

man and Pearson were concerned with measuring tests in a number of ways. For example, levels

the reliability of errors in decision making in the of statistical signiﬁcance are taken as measures of

long run rather than with evidence for believing conﬁdence in research hypotheses, likelihood infor-

hypotheses in a particular experiment; Type I and mation is taken as a gage of the credibility of the

Type II error are objective probabilities, under- hypotheses being tested, and reported levels of sig-

stood as long-run relative frequencies. As such, they niﬁcance are taken as measures of the replicability

belong to reference classes of endless series of trials of ﬁndings (Gigerenzer, Krauss, & Vitouch, 2004).

that might have happened but never did. They are

not about single events such as individual exper- Signiﬁcance Tests and Theory Testing

iments. Further, being concerned with decision Meehl (1967, 1978, 1997) has made one of the

making understood as behavioral courses of action, strongest criticisms of the use of tests of statisti-

Neyman-Pearson statistics do not measure strength cal signiﬁcance in psychology. He argued that the

of evidence for different hypotheses and thus do not widespread use of tests of statistical signiﬁcance to

tell us how conﬁdent we should be in our beliefs test substantive hypotheses and theories is deeply

about those hypotheses. ﬂawed because the support for a hypothesis or the-

It would seem, then, that the Neyman-Pearson ory obtained by rejecting the null hypothesis is very

approach is suitable for use only when the focus is on weak.

controlling errors in the long run (e.g., quality con- Sometimes psychological researchers test a

trol experiments). However, this does not happen hypothesis of substantive interest against the point

often in psychology. null hypothesis that the difference between the rel-

evant population parameters is exactly zero. But a

The Hybrid fact, long known to professional statisticians and

Although the Neyman-Pearson theory of testing appreciated by Meehl, is that the point null hypoth-

is the ofﬁcial theory of statistical testing in the ﬁeld esis is virtually always false in the behavioral and

of professional statistics, textbooks in the behav- social sciences. The reason for this is that in these

ioral sciences ensure that researchers are instructed to sciences, most things are related to one another at

indiscriminately adopt a hybrid account of tests of least to some small extent. In many parts of psy-

statistical signiﬁcance, one that is essentially Fish- chology, “everything in the brain is connected with

erian in its logic but that is often couched in the everything else,” resulting in a large positive man-

decision-theoretic language of Neyman and Pearson. ifold in which many variables correlate positively

The hybrid logic is a confused and inconsistent with one another to a signiﬁcant degree. Thus, in

amalgam of the two different schools of thought the “softer” precincts of psychology, where “true”

(Acree, 1979; Gigerenzer, 1993; Spielman, 1974; experiments are often not possible, obtaining a rea-

but see Lehmann, 1993, for the suggestion that the sonable sample size makes the achievement of a

best elements of both can be combined in a uniﬁed statistically signiﬁcant result the likely outcome of an

position). To the bare bones of Fisherian logic, the empirical study. Meehl (1967) reasoned that if the

hybrid adds the notion of Type II error (opposed by null hypothesis of zero group differences is almost

Fisher) and the associated notion of statistical power always false, then with sufﬁcient power, directional

(which Fisher thought could not be quantiﬁed) hypotheses in these parts of psychology have a 50:50

but only at the level of rhetoric (thereby ignoring chance of achieving statistical signiﬁcance! Meehl

Neyman and Pearson), while giving a behavioral and Lykken provided some empirical evidence for

interpretation of both Type I and Type II errors (vig- this claim more than 40 years ago (Meehl, 1967; see

orously opposed by Fisher)! Because the authors of also Meehl, 1997). A recent simulation study on real

statistical textbooks in the behavioral sciences tend data carried out by Waller (2004) conﬁrmed Meehl’s

to present hybrid accounts of signiﬁcance testing, claim.

aspiring researchers in these sciences almost always One can better appreciate what is wrong with

acquire a confused understanding of such tests. It is using tests of statistical signiﬁcance to appraise psy-

most unfortunate that many writers of statistics text- chological theories by considering the logic involved

books in the behavioral sciences have unwittingly in such testing. It is helpful to begin by observ-

perpetuated these basic misunderstandings. ing the important distinction between scientiﬁc

14 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

hypotheses and the statistical hypotheses that may long and hard to answer this important and chal-

be derived from them (Bolles, 1962). Often in lenging question by developing theories of scientiﬁc

psychology, scientiﬁc theories about psychological conﬁrmation. Despite the considerable fruits of

processes or structures are formulated, and then their labors, there is widespread disagreement about

statistical hypotheses are derived to facilitate their which theory of conﬁrmation we should accept. In

empirical testing. The former will characteristically recent times, a large number of philosophers of

invoke causal mechanisms for explanatory purposes, science have contributed to Bayesian conﬁrmation

whereas the latter will appeal to statistical tests theory (e.g., Earman, 1992; Howson & Urbach,

of null hypotheses about the population parame- 2006). Many philosophical methodologists now

ters of observed variables. Meehl has argued that believe that Bayesianism, including Bayesian phi-

psychological researchers tend to conﬂate the sub- losophy of science, holds the best hope for building

stantive theory and the statistical hypothesis and a comprehensive and uniﬁed theory of scientiﬁc

unwarrantedly take the successful refutation of the inference.

null hypothesis as grounds for concluding that the Bayesianism is a comprehensive position. It com-

substantive theory has been strongly conﬁrmed. prises a theory of statistical inference, an account

However, if we have good grounds for believing that of scientiﬁc method, and a perspective on a vari-

the point null hypothesis is probably false at the out- ety of challenging methodological issues. Today, it

set, and we use the null as the observation hurdle for also boasts a fully ﬂedged philosophy of science. In

our theories to surmount, then support for a theory this section, attention is limited to a consideration

by rejecting this implausible null alternative is quite of the strengths and weaknesses of Bayesian statis-

feeble. tical inference, the ability of Bayesian conﬁrmation

For good reason, then, Meehl (1990) has urged theory to improve upon the hypothetic-deductive

psychologists to abandon tests of statistical signiﬁ- method, and the question of whether Bayesianism

cance for purposes of substantive theory testing in provides an illuminating account of the approach

psychology. He suggests that psychologists should to theory evaluation known as inference to the best

replace them with a strategy that is adapted from the explanation.

philosopher of science, Imre Lakatos (1970), who

argued against Popper’s strict falsiﬁcationist posi-

Bayesian Statistical Inference

tion on theory testing for the comparative theory

The Bayesian approach to statistical inference is

appraisal of research programs over time. Meehl has

so called because it makes central use of a theorem

maintained that one should defend and amend a

of the mathematical calculus of probability known

theory only if it has a good track record of successful

as Bayes’ theorem. This theorem can be written in a

or near-miss predictions of low prior probability.

simple form as:

In conclusion, it is clear that there are fundamen-

tal philosophical differences between the Fisherian Pr(H) × Pr(D/H)

Pr(H/D) =

and Neyman-Pearson schools of statistical thought. Pr(D)

Fisher’s statistical contributions can been seen as With the proviso that Pr (D) and Pr (H) cannot

a deliberate attempt to develop an objective alter- be zero, the theorem says that the posterior probabil-

native to Bayesian statistical thinking popular in ity of the hypothesis is obtained by multiplying the

Europe at the time, whereas those of Neyman and prior probability of the hypothesis by the probabil-

Pearson can be seen as an attempt to develop a ity of the data, given the hypothesis (the likelihood),

position that is even more objective than Fisher’s. and dividing the product by the prior probability of

Nevertheless, Bayesian thinking today is an the data. It is through use of this and other ver-

attractive option for many statisticians who hold sions of Bayes’ Theorem that Bayesians are able to

misgivings about one or both of these schools of implement their view of statistical inference, which

thought. It is also the focus of much attention in is the orderly revision of opinion in the light of new

the philosophy of science. It is to the elements of information.

the Bayesian position that we now turn. For Bayesians, a couple of features of this gloss

on Bayesian statistical inference recommend them-

Bayesian Conﬁrmation Theory selves. Most importantly, the Bayesian approach

What is it for empirical evidence to provide con- squares with the stated purpose of scientiﬁc inquiry

ﬁrmation or disconﬁrmation of a scientiﬁc hypothe- noted above—namely, securing the probability of a

sis or theory? Methodologists of science have worked hypothesis in the light of the relevant evidence. The

haig 15

informational output of a traditional test of signiﬁ- on the basis of evidence and logic, and assert that in

cance is the probability of the data, given the truth real-life situations, there is no alternative to relying

of our hypothesis, but it is just one input in the on a subjective component in our testing efforts.

Bayesian scheme of things. A second stated desir- In deciding whether to adopt a Bayesian posi-

able feature of the Bayesian view is its willingness to tion on statistical inference, it should be kept in

make use of relevant information about the hypoth- mind that one does not have to embrace a gen-

esis before the empirical investigation is conducted eral Bayesian theory of scientiﬁc conﬁrmation rather

and new data are obtained, explicitly in the form of than, say, the hypothetico-deductive alternative.

a prior probability estimate of our hypothesis. Tra- One might be a Bayesian when dealing with prob-

ditional tests of statistical signiﬁcance are premised lems of statistical inference but remain wedded to a

on the assumption that inferences should be based general hypothetico-deductive conception of scien-

solely on present data, without any regard for what tiﬁc method. Or, more plausibly, one might employ

we might bring to a study in the way of belief or Bayesian statistical methods when concerned with

knowledge about the hypothesis to be tested—a inferential problems about hypotheses for which

position that Bayesians contend is hardly designed to we have the relevant probabilistic information, but

maximize our chances of learning from experience. adopt a non-probabilistic count of theory evaluation

To achieve their goal of the systematic revision of such as Thagard’s theory of explanatory coherence,

opinion on the basis of new information, Bayesians which will be referred to later in the chapter. The

are able to employ Bayes’ theorem iteratively. Hav- general point to be made here is that Bayes’ theorem

ing obtained a posterior probability assignment for can help us deal with some problems of statistical

their hypothesis via Bayes’ theorem, they can then inference, but clearly, a great deal of scientiﬁc work

go on and use that posterior probability as the new will be done with the use of other methods—some

prior probability in a further use of Bayes’ theorem of them statistical and some of them not.

designed to yield a revised posterior probability, and

so on. In this way, the Bayesian researcher learns Bayesianism and the Hypothetico-Deductive

from experience. Method

One of the clear achievements of Bayesianism is

Criticisms of Bayesian Hypothesis Testing its ability to improve on the unsatisfactory approach

Although my consideration of the merits of to hypothesis and theory appraisal taken by the

conventional signiﬁcance tests and their Bayesian hypothetico-deductive method. The hypothetico-

alternative is both sketchy and selective, some read- deductive method has long been the method of

ers will sense that the Bayesian view provides an choice for the evaluation of scientiﬁc theories (Lau-

attractive alternative to the traditional approach, dan, 1981), and it continues to have a dominant

particularly when the latter assumes its hybrid form. place in psychology. Despite its popularity, it is

However, as with all theories of conﬁrmation, the usually characterized in an austere manner: The

Bayesian approach has come in for its share of crit- researcher takes a hypothesis or theory of interest

icism. These criticisms have tended to focus on the and tests it indirectly by deriving from it one or

alleged problematic nature of prior probabilities. In more observational predictions that are themselves

this regard, it is objected that because Bayesians directly tested. Predictions borne out by the data

adopt a subjectivist conception of probability and are taken to conﬁrm the theory to some degree;

resort to personal estimates of the prior probabilities those predictions that do not square with the data

of their hypotheses, they introduce an ineliminable, count as disconﬁrming instances of the theory.

but highly undesirable, subjective element into their Normally, the theory is not compared with rival

calculations. To this objection, the Bayesians have theories in respect of the data, only with the data

two plausible replies: they can concede that personal themselves.

estimates of prior probabilities are subjective, that The hypothetico-deductive method, in some-

they may differ markedly from person to person, and thing like this form, has been strongly criticized

that they are often very rough estimates and then go by methodologists on a number of counts (e.g.,

on to point out that when prior estimates err, they Glymour, 1980; Rozeboom, 1997). One major crit-

are brought into line by freshly obtained sets of data; icism of the method is that it is conﬁrmationally

or, they may appeal to the failure of strictly empiri- lax. This laxity arises from the fact that any posi-

cist theories of conﬁrmation, which hold that one tive conﬁrming instance of a hypothesis submitted

may obtain an adequate test of a hypothesis solely to empirical test can conﬁrm any hypothesis that is

16 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

conjoined with the test hypothesis, regardless of how Bayesianism and Inference to the Best

plausible it might be. This state of affairs is known as Explanation

the fallacy of irrelevant conjunction, or the tacking Recently, some Bayesians have claimed that

problem, because conﬁrmation of a test hypothesis their perspective on scientiﬁc method can also pro-

also conﬁrms any conjunct that is attached to the vide an enhanced characterization of the important

test hypothesis. The fallacy of irrelevant conjunc- approach to theory evaluation known as inference

tion arises with the hypothetico-deductive method to the best explanation. Inference to the best expla-

because predictions are deduced from hypotheses nation is based on the belief that much of what

only by making use of auxiliary hypotheses drawn we know about the world is based on consider-

from background knowledge. ations of explanatory worth. In contrast to the

Clearly, this is an unacceptable state of affairs. Bayesian approach, accounts of inference to the best

Bayesians have challenged the assumption that the explanation take theory evaluation to be a quali-

occurrence of the consequences of a theory con- tative exercise that focuses on explanatory criteria

ﬁrm the theory and its conjuncts holistically. They rather than a quantitative undertaking in which

argue that the Bayesian approach enables the dif- one assigns probabilities to theories (Haig, 2009;

ferential support of the elements of a theory, spec- Thagard, 1992).

ifying conditions showing that E never increases Although inference to the best explanation has

the probability of H conjoined with any additional typically been regarded as a competitor for Bayesian

hypothesis by more than it increases the probability theory evaluation, Lipton (2004) has recently

of H. argued that the two approaches are broadly compat-

Another major criticism of the hypothetico- ible and that, in fact, their proponents “should be

deductive method is that it tests a single hypothesis friends.” In broad terms, he suggests that judgments

or theory of interest against the empirical evidence; of the loveliest explanation, which are provided by

it does not test a hypothesis or theory in relation to the evaluative criteria of inference to the best expla-

rivals in respect of the evidence. This is held to be nation, such as uniﬁcatory power, precision, and

a major ﬂaw because it is widely agreed that theory elaboration of explanatory mechanisms, contribute

evaluation is a comparative affair involving simul- to assessments of the likeliest explanation, which

taneous evaluation of two or more hypotheses or are provided by the probabilities of the Bayesian

theories. approach. Speciﬁcally, Lipton maintains that the

The comparative nature of theory evaluation is explanatory considerations invoked in inference to

straightforwardly handled by the Bayesian position the best explanation guide determination of the

by rewriting the simple form of Bayes’ theorem prior probabilities (and the likelihoods) that are

given earlier to deal with two or more hypothe- inserted in Bayes’ Theorem.

ses. Here, Bayes’ theorem is presented for the However, although appeal to explanatory matters

case of two hypotheses, where the theorem can be might be one way in which Bayesians can determine

written for each hypothesis in turn. For the ﬁrst their prior probabilities, Lipton does not suggest

hypothesis, how this might be done. Further, those who hold

inference to the best explanation to be a norma-

Pr (H1 /D) tive approach to scientiﬁc theory evaluation, with

its own distinctive character, will worry that Lipton

Pr(H1 ) × Pr(D/H1 ) relegates it to a descriptive role within a Bayesian

=

Pr(H2 )×Pr(D/H2 )+Pr(H1 )×Pr(D/H1 ) normative framework (e.g., Psillos, 2004).

Another way of showing the compatibility of

This says that the posterior probability of the inference to the best explanation and Bayesianism is

ﬁrst hypothesis is obtained by multiplying its prior to translate the evaluative criteria employed within

probability by the probability of the data, given that inference to the best explanation into probabilis-

hypothesis (the likelihood), and dividing the prod- tic terms. McGrew (2003) has done this by taking

uct by the value that results from adding the prior the important theoretical virtue of consilience, or

probability of the second hypothesis, multiplied explanatory breadth, and showing that its Bayesian

by the likelihood for that hypothesis, to the prior form leads to higher posterior probabilities of the

probability of the ﬁrst hypothesis, multiplied by its hypotheses being evaluated. Nevertheless, McGrew

likelihood. Bayes’ theorem for the second hypothesis has acknowledged that by translating consilience

is written in a similar way. into its “ﬂattened” probabilistic form, it no longer

haig 17

remains a genuine explanatory virtue. Not only is in domains where the evaluation of explanatory

there no guarantee that consilience will be concerned hypotheses and theories are of primary concern, sci-

with an explanation of the evidence, there is no entists have, for good reason, often employed some-

way that probabilistic translations of the explana- thing like inference to the best explanation. Like it

tory virtues can refer to the causal connections that or not, the intending Bayesian scientist will have to

are often appealed to in scientiﬁc explanations. Fur- consult the relevant philosophical literature, among

ther, Weisberg (2009) has recently argued that the other methodological literatures, to furnish an

explanatory loss incurred in such translations will informed justiﬁcation for their Bayesian practices.

occur for any distinctively explanatory virtue that

is given such probabilistic treatment. In short, it

would seem that Bayesianism cannot capture the Meta-Analysis

intuitively important notion of explanatory power In the space of three decades meta-analysis has

without signiﬁcant loss. become a prominent methodology in behavioral sci-

ence research, with the major developments coming

from the ﬁelds of education and psychology (Glass,

What Should We Think About McGaw, & Smith, 1981: Hedges & Olkin, 1985;

Bayesianism? Hunter & Schmidt, 2004). Meta-analysis is an

Philosophical assessment of the worth of approach to data analysis that involves the quantita-

Bayesianism range from claims that it is without peer tive analysis of the data analyses of primary empirical

as a theory of scientiﬁc reasoning to the view that it is studies. Hence, the term meta-analysis coined by

fundamentally wrong-headed. Howson and Urbach Glass (1976). Meta-analysis, which comes in a vari-

(2006) exemplify the former view, claiming that sci- ety of forms (Bangert-Drowns, 1986), is concerned

entiﬁc reasoning is both inductive and probabilistic with the statistical analyses of the results from many

and that the axioms of probability sufﬁce to articu- individual studies in a given domain for the pur-

late such reasoning. The latter view is exempliﬁed by pose of integrating or synthesizing those research

Bunge (2008), who has argued that Bayesianism is ﬁndings.

fundamentally wrong for three reasons: (1) it assigns The following selective treatment of meta-

probabilities to statements rather than taking them analysis considers its possible roles in scientiﬁc

as objective features of the world; (2) it conceives explanation and evaluation research before critically

of probabilities as subjective; and (3) it appeals to examining one extended argument for the con-

probabilities in the absence of randomness. clusion that meta-analysis is premised on a faulty

To add to this mix of views, many statisticians conception of science.

take Bayesian statistical inference to be a superior

alternative to classical statistical inference, for the

reasons stated earlier. Finally, some advocates of Meta-Analysis and Explanation

Bayesianism see it as a comprehensive theory of Meta-analysis is a prominent example of a dis-

conﬁrmation, whereas others see it as having only tinctive use of statistical methods by behavioral

context-speciﬁc application. scientists to aid in the detection of empirical phe-

The difﬁculties of deciding just what to think nomena. By calculating effect sizes across primary

about Bayesianism are captured well by the ambiva- studies in a common domain, meta-analysis helps us

lence of John Earman (1992), a Bayesian philoso- detect robust empirical generalizations (cf. Schmidt,

pher of science. He confesses to being an enthusiastic 1992). By using statistical methods to ascertain the

Bayesian on Mondays, Wednesdays, and Fridays. existence of such regularities, meta-analysis can be

But on Tuesdays, Thursdays, and Saturdays, he usefully viewed as the statistical analog of direct

holds doubts about the totalizing ambitions of experimental replication. It is in this role that meta-

Bayesianism and indeed whether it can serve as analysis currently performs its most important work

a proper basis for scientiﬁc inference. Faced with in science.

such difﬁculty, it is probably prudent to settle for However, given that the detection of empiri-

a contextual application of Bayesian thinking, as cal phenomena and the construction of explana-

indicated earlier in this section. For example, in par- tory theories are quite different research tasks, the

ticular domains such as medical diagnosis, where the recent suggestion that meta-analysis can directly

relevant probabilistic information is often available, contribute to the construction of explanatory the-

scientists sometimes appeal to the Bayesian corpus to ory (Cook et al., 1992; Schmidt, 1993) is an

justify the selective use of its methods. By contrast, arresting methodological claim. In approving this

18 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

extension of meta-analysis beyond a concern with thereby confer an explanatory role on meta-analysis

phenomena detection, Schmidt has acknowledged itself. One does not assign status simply on the basis

that scientiﬁc explanation normally involves the of association.

causal explanation of observed phenomena. Nev-

ertheless, he maintains that it is appropriate to

Meta-Analysis and Evaluative Inquiry

take scientiﬁc explanation to include “all research

It is surprising that methodological discussions of

processes that contribute ultimately to theory build-

meta-analysis and its applications have shown little

ing, including the ﬁrst step of determining what

regard for the rationale that Glass originally pro-

the relationships are among important variable or

vided for its use. Glass claims that many researchers

constructs and how stable these relationships are”

misunderstand meta-analyses of outcome research

(Schmidt, 1993, p. 1164). Thus, the demonstration

because they fail to take cognizance of his rationale.

of a general effect, such as the pervasive inﬂuence

Speciﬁcally, this failure is offered by him as the rea-

of psycho-educational treatments on adult surgi-

son for the widespread misunderstanding of Smith,

cal patients, is deemed to be a meta-analysis at

Glass, and Miller’s (1980) original meta-analysis of

the “lowest level of explanation.” On the other

psychotherapy outcome studies.

hand, the use of meta-analysis to test compet-

In a number of different publications, Glass

ing theories of how patients cope with the stress

insists that meta-analysis should be understood as

of surgery is viewed as higher level explanatory

an exercise in evaluative research rather than in

meta-analysis.

scientiﬁc research (Glass, 1972; Smith, Glass, &

However, this attempt to extend the role of meta-

Miller, 1980; Glass & Kleigl, 1983). The core

analytic methods beyond phenomena detection to

of Glass’s underlying rationale for meta-analysis

explanation obscures the basic methodological dis-

involves drawing a strong distinction between scien-

tinction between phenomena detection and scien-

tiﬁc and evaluative inquiry. Glass’s position is that

tiﬁc explanation. As noted earlier in the chapter, the

researchers as scientists are concerned to satisfy their

stable general effects gleaned from meta-analysis are

curiosity by seeking truthful conclusions in the form

empirical phenomena, and statements about phe-

of theories comprising explanatory laws. By con-

nomena are the objects of scientiﬁc explanations;

trast, evaluators undertake research on behalf of a

they are not the explanations themselves. The ques-

client that is aimed at producing useful decisions

tion, “What do statements of empirical phenomena

based on descriptive determinations of the worth

explain?” occasions no natural reply. This is not sur-

of particular products or programs. For Glass, the

prising, for the successful detection of phenomena

meta-analysis of outcome studies properly involves

is essentially a descriptive achievement that involves

the integration of the products of evaluative research

investigative practices that are, for the most part,

only.

quite different from explanatory endeavors. In psy-

The methodology for this conception of meta-

chology, these methods are often statistical in kind.

analysis fashions the distinction between scientiﬁc

By contrast, scientiﬁc explanation is often causal-

and evaluative inquiry in terms of the relevance

mechanistic in nature (Salmon, 1984). On this

for each of the concepts of truth, explanation,

view, explanation requires the identiﬁcation of the

values, problems, and generalizations. Because

mechanisms that underlie and give rise to empirical

of space limitations, I will consider just one of

phenomena, along with a detailing of the ways in

these contrasts—that of explanation. Glass contends

which those mechanisms produce the phenomena

that scientiﬁc inquiry involves the continual search

we seek to understand.

for subsurface explanations of surface phenomena.

When meta-analysis enters into the process of

Evaluative inquiry, on the other hand, does not seek

testing explanatory theories, it contributes to an

explanations:

evaluation of those theories in terms of predictive

success. However, this common strategy for evaluat- “A fully proper and useful explanation can be

ing scientiﬁc theories is not directly concerned with conducted without producing an explanation of why

their explanatory adequacy. To repeat, it is not being the product or program being evaluated is good or

denied that meta-analytic methods can be employed bad of how it operates to produce its effects . . . [It] is

in the course of testing theories, but meta-analysis usually enough for the evaluator to know that

itself is not an approach to theory testing (Chow, something attendant upon the [product or program]

1996). To employ meta-analysis to assist in the pre- is responsible for the valued outcomes.” (Glass, 1972,

dictive testing of an explanatory theory does not pp. 5–6)

haig 19

Glass’s position seems to be that although pro- as “. . . a two-tiered research enterprise [where] one

gram treatments can be causally responsible for their group of researchers will specialize in conducting

measured outcomes, it matters little that knowl- individual studies [and] another group will apply

edge of this gleaned from evaluation studies does complex and sophisticated meta-analysis methods to

not tell us how programs produce their effects, those cumulative studies and will make the scientiﬁc

because such knowledge is not needed for policy discoveries” (p. 1180). Sohn’s primary concerns are to

action. challenge the claim that meta-analysis is an impor-

Glass is surely correct in asserting that scientists tant vehicle of scientiﬁc discovery and to identify the

are centrally concerned with the construction of major problems besetting mainstream psychological

causal theories to explain phenomena, for this is the research.

normal way in which they achieve understanding Sohn (1996) has questioned the basic idea of

of the empirical regularities they discover. However, meta-analysis as a standalone literature review capa-

he is wrong to insist that proper evaluations should ble of discovering truths, whereas traditionally sci-

deliberately ignore knowledge of underlying causal entiﬁc discoveries were contained in the empirical

mechanisms. The reason for this is that the effective ﬁndings of the primary studies themselves. For

implementation and alteration of social programs Sohn, the idea that meta-analytic literature reviews

often beneﬁts from knowledge of the relevant causal can make discoveries about nature rests on the

mechanisms involved (Gottfredson, 1984), and assumption that the primary research literature is

strategic intervention in respect of these is often a proxy for nature. It is an assumption that he has

the most effective way to bring about social change. roundly rejected.

Although standard versions of scientiﬁc realism are Noting the tendency of meta-analysts to paint a

wrong to insist that the relevant causal mechanisms bleak picture of progress in twentieth century psy-

are always unobserved mechanisms, it is the case that chology, Sohn (1996) has suggested that although

appeal to knowledge of covert causal mechanisms meta-analysis has been introduced to improve mat-

will frequently be required for understanding and ters in this regard, it is in fact symptomatic of its

change. poor progress. In his judgment, this lack of good

To conclude this highly selective evaluation of progress is a consequence of psychology adopting a

Glass’s rationale for meta-analysis, science itself is hypothesis-testing view of science. For Sohn, this

best understood as a value-laden, problem-oriented view of science seeks knowledge by testing research

human endeavor that tries to construct causal hypotheses about the relationship of descriptive

explanatory theories of the phenomena it discovers. variables without regard for causal mediating vari-

There is no sound way of drawing a principled con- ables. Essentially, the approach amounts to the

trast between scientiﬁc and evaluative inquiry. These hypothetico-deductive testing of outcome studies

critical remarks are not directed against the worth of through use of signiﬁcance tests and effect size

evaluation as such or against the use of meta-analysis measures. Sohn maintains that there are, in fact,

in evaluating program or product effectiveness. They two deleterious consequences of such an approach

are leveled speciﬁcally at the conception of eval- to research: one is the lack of agreement about

uation research that appears to undergird Glass’s outcomes, and the other is the absence of knowl-

approach to meta-analysis. edge of the causal mechanisms that are responsi-

ble for those alleged outcomes. Meta-analysis is

indicted by Sohn for failing to remedy both types of

Meta-Analysis and the Nature of Science defect.

Proponents of meta-analysis often justify the However, Sohn has supported his claim that

use of these methods by pointing out the need to meta-analysis does not produce demonstrable evi-

glean valuable knowledge from the information in a dence for treatment effects in a curious way. He

domain that lies dormant in the cornucopia of scat- has acknowledged that Smith, Glass, and Miller’s

tered primary studies. However, Sohn (1996) has (1980) well-known meta-analytic treatment of the

expressed concern about the quality of empirical psy- beneﬁts of psychotherapy has been corroborated

chological studies that are used in meta-analysis. He by subsequent meta-analyses yet has maintained

has urged resistance to the generally accepted view that this does not constitute evidence for replica-

that meta-analysis is a form of research rather than ble effects. He has expressed a distrust of research

a review of research, and he has balked at Schmidt’s that relies on statistical methods for making claims

(1992) revisionist model of possible future science about replicable effects. This distrust appears to

20 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

be founded in part on an extension of the view Factor Analytic Inference

attributed to Lord Rutherford that if an experimen- Alongside the debate between the ﬁctionalist and

tal study requires statistics, then the experiment is realist interpretations of factors, there is a difference

in need of improvement. For Sohn, “If one’s science of view about whether the basic inferential nature

needs [meta-analysis], one should have done better of factor analytic inference is inductive or abductive

science.” in nature. Expositions of exploratory factor analysis

However, this view ﬂies in the face of widely seldom consider its inferential nature, but when they

accepted scientiﬁc practice. Woodward’s (1989) do, the method is usually said to be inductive in char-

detailed examination of the practice of phenomena acter. This is not surprising, given that exploratory

detection in science strongly supports the view that factor analysis can be plausibly located histori-

different parts of the various sciences, from physics cally within seventeenth- and eighteenth-century

to anthropology, appropriately make extensive use of empiricist philosophy of science and its inductive

statistical methods in the detection of empirical phe- conception of inquiry (Mulaik, 1987). However,

nomena. It is hard to imagine that statistics would even if one relaxes the Baconian ideal that inductive

exist as we currently know it unless it provided a method is an algorithm that produces incorrigi-

necessary armament for science. In this regard, it ble knowledge, an inductive characterization of

is worth noting that Sohn has acknowledged the exploratory factor analysis seems inappropriate. This

claim made by Hedges and Olkin (1985) that meta- is because inductive inference, being descriptive

analysis in some form or other has a long history inference, cannot take the researcher from mani-

of use in the hard sciences. Sohn has stated his dis- fest effects to theoretical entities that are different in

agreement with this position, but he has not argued kind from those effects. However, abductive infer-

against it. Space limitations preclude further analy- ence, which is concerned with the generation and

sis of Sohn’s (1996) argument, but perhaps enough evaluation of explanatory hypotheses, can do so.

has been said to suggest that it should be regarded For this reason, exploratory factor analysis is bet-

with some skepticism. ter understood as an abductive method of theory

More work on the philosophical foundations generation (Haig, 2005a), a characterization that

of meta-analysis is clearly needed. However, from coheres well with its general acceptance as a latent

this highly selective examination of its conceptual variable method. With exploratory factor analysis,

foundations, it can be concluded that meta-analysis abductive inference is explanatory inference that

receives its primary justiﬁcation in scientiﬁc research leads back from presumed effects to underlying

by articulating one, but only one, way in which causes.

researchers can fashion empirical generalization There are different forms of abductive reason-

from the ﬁndings of primary studies. It derives its ing. Exploratory factor analysis is a method that

importance in this role directly from the impor- can facilitate the drawing of explanatory inferences

tance accorded the goal of phenomena detection in that are known as existential abductions. Existen-

science. tial abductions enable researchers to hypothesize

the existence, but not the nature, of entities previ-

ously unknown to them. The innumerable examples

Exploratory Factor Analysis of existential abduction in science include the ini-

Despite the advanced statistical state and fre- tial postulation of hidden entities such as atoms,

quent use of exploratory factor analysis in the genes, tectonic plates, and personality traits. In

behavioral sciences, debate about its basic nature cases like these, the primary thrust of the ini-

and worth abounds. Thurstone (1947) has appro- tial abductive inferences is to claims about the

priately emphasized the exploratory nature of the existence of theoretical entities to explain empiri-

method, and many methodologists take it to be cal facts or phenomena. Similarly, the hypotheses

a method for postulating latent variables that are given to us through the use of exploratory factor

thought to underlie patterns of correlations. Some, analysis postulate the existence of latent variables

however, understand exploratory factor analysis as a such as Spearman’s g and extraversion. It remains

method of data reduction that provides an econom- for further research to elaborate on the ﬁrst rudi-

ical description of correlational data. The present mentary conception of these variables and their

section considers this and other important foun- interrelation.

dational issues that have ﬁgured prominently in The factor analytic use of existential abduction

discussions of the method. to infer the existence of, say, the theoretical entity

haig 21

g , can be coarsely reconstructed in accordance with correlated, infer one or more common causes unless

the following argument schema: there is good reason not to.” Clearly, the princi-

ple should not be taken as a hard-and-fast rule,

The surprising empirical phenomenon of the positive for in many cases, proper inferences about corre-

correlations among different tests of ability is lated events will not be in terms of common causes.

identiﬁed. The qualiﬁer, “unless there is good reason not to,”

If g exists, and it is validly and reliably measured by a should be understood as an injunction to consider

Weschler intelligence scale (and/or some other causal interpretations of the correlated events other

objective test), then the positive manifold would than the common causal kind. For example, in a

follow as a matter of course. given research situation, the correlated events might

Hence, there are grounds for judging the hypothesis be related as direct causes, or their relationship

of g to be initially plausible and worthy of further might be mediated by a third variable in a causal

pursuit. sequence.

Note that the schema for abductive inference and Although exploratory factor analysis is used to

its application to the generation of the hypothesis infer common causes, expositions of common factor

of g is concerned with the form of the arguments analysis that explicitly acknowledge the importance

involved, rather than with the actual generation of the principle of the common cause are rare. Kim

of the explanatory hypotheses. The explanatory and Mueller’s (1978) textbook exposition of factor

hypothesis is given in the second premise of the argu- analysis is a noteworthy exception. In discussing

ment. An account of the genesis of the explanatory the conceptual foundations of factor analysis, these

hypothesis must, therefore, be furnished by some authors evince the need to rely on what they call

other means. It is plausible to suggest that reasoning the postulate of factorial causation. The postulate

to explanatory hypotheses trades on our evolved cog- of factorial causation is characterized by them as

nitive ability to abductively generate such hypothe- “the assumption that the observed variables are lin-

ses (Carruthers, 2002). Whatever its origin, an ear combinations of underlying factors and that

informative methodological characterization of the the covariation between observed variables solely

abductive nature of factor analytic inference must results from their common sharing of one or more

appeal to the scientist’s own psychological resources of the common factors” (p. 78). The authors make

as well as those of logic. This injunction is motivated clear that the common factors mentioned in the

by the realist methodological thesis of naturalism assumption are to be regarded as underlying causal

stated near the beginning of the chapter. variables. Understood as a methodological injunc-

Although exploratory factor analysis exempliﬁes tion, this postulate functions as a variant of the

well the character of existential abduction, it is principle of the common cause. Without appeal

clearly not an all-purpose method for abductively to this principle, factor analysts could not iden-

generating explanatory hypotheses and theories. tify the underlying factor pattern from the observed

With its focus on common factors, it can prop- covariance structure.

erly serve as a generator of elementary theories only There are two features of the principle of the com-

in those multivariate domains that have common mon cause that make it particularly suitable for use

causal structures. in exploratory factor analysis. First, it can be applied

in situations where we do not know how likely it

is that the correlated effects result from a common

The Principle of the Common Cause cause. The abductive inference to common causes

It is well known that exploratory factor analysis is is a basic explanatory move that is non-probabilistic

a common factor analytic model in which the latent and qualitative in nature. It is judgments about the

factors it postulates are referred to as common factors. soundness of the abductive inferences, rather than

Less well known is the fact that there is an important the assignment of probabilities, that confer initial

principle of scientiﬁc inference, known as the prin- plausibility on the factorial hypotheses spawned by

ciple of the common cause. (e.g., Sober, 1988), that exploratory factor analysis. Second, the principle

can be used to drive the nature and shape of the exis- can also be used in situations where we are essentially

tential abductive inferences involved in exploratory ignorant of the nature of the common cause. With

factor analysis. The principle of the common cause this second feature, the principle of the common

can be formulated concisely as follows: “Whenever cause accommodates the fact the exploratory factor

two or more events are improbably, or signiﬁcantly, analysis trades in existential abductions.

22 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

Further, it is important to appreciate that the not determined by the relevant evidence (Duhem,

principle of the common cause does not function 1954). In effect, methodological indeterminacy

in isolation from other methodological constraints. arises from our inability to justify accepting one

Embedded in exploratory factor analysis, the prin- theory among alternatives on the basis of empirical

ciple helps one limit existential abductive inference evidence alone.

to those situations where we reason back from Mulaik (1987) sees underdetermination in

correlated effects to one or more common causes. exploratory factor analysis as involving inductive

Although covariation is an important basic datum in generalizations that go beyond the data. How-

science, not all effects are expressed as correlations, ever, inductive underdetermination should be seen

and of course, not all causes are of the common as applying speciﬁcally to the task of establishing

causal variety. It follows from this that one should factorial invariance where one seeks constructive

not always look for common causal interpreta- or external replication of factor patterns. However,

tions of multivariate data, for there are numerous for exploratory factor analysis there is also need to

alternative latent variable models. acknowledge and deal with abductive underdeter-

mination involved in the generation of explanatory

factorial theories. The sound abductive generation

The Underdetermination of Factors of hypotheses is essentially educated guess work.

The methodological literature on exploratory fac- Thus, drawing from background knowledge, and

tor analysis has given considerable attention to the constrained by correlational empirical evidence, the

indeterminacy of factors in the common factor use of exploratory factor analysis can reasonably be

model. Factor indeterminacy arises from the fact expected to yield a plurality of factorial hypotheses

that the common factors are not uniquely deter- or theories that are thought to be in competition.

mined by their related manifest variables. As a This contrasts strongly with the unrealistic expec-

consequence, a number of different common fac- tation held by many earlier users of exploratory

tors can be produced to ﬁt the same pattern of factor analysis that the method would deliver them

correlations in the manifest variables. strongly justiﬁed claims about the one best factorial

Although typically ignored by factor analytic hypothesis or theory.

researchers, factor indeterminacy is an epistemic How then, can exploratory factor analysis deal

fact of life that continues to challenge factor ana- with the specter of underdetermination in the con-

lytic methodologists. Some methodologists regard text of theory generation? One plausible answer is

factor indeterminacy as a serious problem for com- that exploratory factor analysis narrows down the

mon factor analysis and recommend the use of space of a potential inﬁnity of candidate theories

alternative methods such as component analysis to a manageable subset by facilitating judgments

methods because they are considered to be deter- of initial plausibility (Haig, 2005a). It seems clear

minate methods. enough that scientists often make judgments about

One constructive perspective on the issue of fac- the initial plausibility of the explanatory hypothe-

tor indeterminacy has been suggested by Mulaik and ses and theories that they generate. However, it

McDonald (Mulaik & McDonald, 1978; McDon- is less clear just to what this evaluative criterion

ald & Mulaik, 1979; Mulaik, 1987). Their position amounts (cf. Whitt, 1992). With an abductive

is that the indeterminacy involved in interpreting conception of exploratory factor analysis, judg-

the common factors in exploratory factor analysis ments of the initial plausibility of theories are

is just a special case of the general indeterminacy judgments about the soundness of the abductive

of theory by empirical evidence widely encountered arguments employed in generating those theories.

in science, and it should not, therefore, be seen as It seems reasonable to suppose that those who

a debilitating feature that forces us to give up on employ exploratory factor analysis as an abductive

common factor analysis. method of theory generation often make com-

Indeterminacy is pervasive in science. It occurs in pressed judgments of initial plausibility. By confer-

semantic, metaphysical, and epistemological forms ring judgments of initial plausibility on the theories

(McMullin, 1995). Factor indeterminacy is essen- it spawns, exploratory factor analysis deems them

tially epistemological in nature. The basic idea of worthy of further pursuit, whereupon it remains

epistemological or, more precisely, methodological for the factorial theories to be further developed

indeterminacy is that the truth or falsity (better, and evaluated, perhaps through the use of conﬁr-

acceptance or rejection) of a hypothesis or theory is matory factor analysis. It should be emphasized that

haig 23

using exploratory factor analysis to facilitate judg- contrary, conﬁrmatory factor analysis is the more

ments about the initial plausibility of hypotheses will theoretically important—and should be the much

still leave the domains being investigated in a state more widely used—of the two major factor analytic

of considerable theoretical underdetermination. It approaches. (p. 134)

should also be stressed that the resulting plural-

ity of competing theories is entirely to be expected Although Gorsuch makes his claim in emphatic

and should not be thought of as an undesirable terms, he provides no justiﬁcation for it. He seems

consequence of employing exploratory factor anal- to assume that theory testing is more important

ysis. To the contrary, it is essential for the growth than theory generation. However, this belief is difﬁ-

of scientiﬁc knowledge that we vigorously promote cult to defend, given the fact that there are many

theoretical pluralism (Hooker, 1987), develop the- other important phases of scientiﬁc inquiry that

oretical alternatives, and submit them to critical together demand most of the researcher’s method-

scrutiny. ological time. Recall, for example, the importance

to science of the detection of empirical phenomena

and the generation, development, and comparative

Exploratory Factor Analysis and appraisal of theories. Viewed in this light, theory

Conﬁrmatory Factor Analysis testing is just one, albeit important, part of scien-

The aforementioned consideration of exploratory tiﬁc method (cf. Simon, 1968). Given the fact that

factor analysis supports the conclusion that there science is as much concerned with theory genera-

is an important role for its use in factor analytic tion as it is with theory testing, and acknowledging

research. However, this conclusion raises the ques- that exploratory factor analysis is a useful abduc-

tion of how exploratory factor analysis relates to its tive method of theory generation, exploratory factor

conﬁrmatory namesake. In contrast to popular ver- analysis deserves to be regarded as important as con-

sions of the classical inductivist view of science that ﬁrmatory factor analysis in the theory constructor’s

inductive method can generate secure knowledge toolkit.

claims, the use of exploratory factor analysis as an To conclude, despite the fact that exploratory

abductive method of theory generation can only fur- factor analysis has been frequently employed in

nish researchers with a weak logic of discovery—one psychological research, the extant methodological

that gives them educated guesses about underlying literature on the method seldom acknowledges the

causal factors. It is for this reason that those who use explanatory and ontological import of the method’s

exploratory factor analysis to generate theories need inferential nature. Abduction is a major form of cre-

to supplement their generative assessments of the ative reasoning in science, and the principle of the

initial plausibility of those theories with additional common cause is a maxim of scientiﬁc inference with

consequentialist justiﬁcation in the form of conﬁr- important application in research. By incorporating

matory factor analytic testing or some alternative these two related elements into its fold, exploratory

approach to theory appraisal. factor analysis is ensured an important, albeit cir-

However, in the factor analytic literature, there is cumscribed, role in the construction of explanatory

a division of opinion about whether exploratory fac- theories in psychology and other sciences. By gen-

tor analysis and conﬁrmatory factor analysis should erating structural models about common causes,

be viewed as complementary or competing methods exploratory factor analysis can serve as a valuable

of common factor analysis. Quite a number of factor precursor to conﬁrmatory factor analysis.

analytic methodologists have expressed views that

discourage their complementary use in factor ana-

lytic research. For example, Gorsuch (1983), in his Causal Modeling

well-known book on factor analysis, has expressed During the last 50 years, social and behavioral

a view about the relative importance of exploratory science methodologists have developed a variety of

and conﬁrmatory factor analysis that seems to be increasingly sophisticated statistical methods to help

quite widely held today: researchers draw causal conclusions from correla-

tional data. These causal modeling methods, as they

Although the next three chapters [of Factor analysis] have sometimes been called, include path analy-

are primarily concerned with exploratory factor sis, conﬁrmatory factor analysis, and full structural

analysis, the space and time given to that technique is equation modeling.

a function of the complexity of resolving its Despite the fact that psychological researchers are

problems, not of its theoretical importance. On the increasingly employing more sophisticated causal

24 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

modeling methods in place of simple regression as a tendency—that is, an existing state of an object,

and partial correlation procedures, worries about which, if unimpeded, will produce its effect. We

both their accompanying methodology and their are, therefore, able to infer abductively the presence

misuse have been expressed (e.g., Cliff, 1983). In of the causal mechanism on the basis of knowl-

this section, I consider some philosophical aspects edge of the triggering condition and/or its presumed

of three foundational issues that have been dis- effect.

cussed in the literature on causal modeling: the Advocates of the generative theory of causa-

different ideas of causation presupposed by causal tion claim it has a number of important advan-

modeling; the suggestion that causal modeling tages over the regularity theory. One advantage of

can be viewed as a form of inference to the best the generative theory is that it is able to accom-

explanation; and the contested nature of latent modate deep structural, explanatory theories that

variables. postulate unobserved generative mechanisms. It is

argued that we need a theory of causation that

Causal Modeling and Theories of Causation affords us the conceptual space to do this, because

One central methodological issue in the debates many of the world’s causal mechanisms are not

about causal modeling has to do with the appropri- open to direct inspection. The latent variables of

ateness of the nature of causation involved in various many of our causal modeling methods are thought

causal modeling procedures. A popular view of the by many to be precisely of this kind. A related

matter is clearly expressed by Kenny (1979), who advantage of the generative theory is that it is

points out that three conditions must be satisﬁed needed for enlightened social policy because, as

for a researcher to claim that one variable is the noted in the discussion of evaluation research ear-

cause of another. The ﬁrst condition is that the rela- lier, the possibility of ameliorative action depends

tionship be asymmetric. The second condition is on effecting change based on an understanding

that a functional relationship be present between of how things work, and for this, knowledge of

cause and effect. The third condition is that the the relevant underlying causal mechanisms is often

causal relationship be direct or non-spurious. These essential.

three conditions are exactly those of the regular- A third, and signiﬁcant, advantage of the the-

ity theory of causation, which depicts the causal ory of generative causation is that it enables us to

relationship between events in terms of their reg- draw the important distinction between empirical

ular succession, covariation, and contiguity. The regularities and genuine causal laws. An adequate

regularity theory, which is more or less Humean methodology of causal modeling must be able to

in character, provides an important part of the draw the distinction between empirical regularities

epistemic backdrop against which traditional causal and causal laws, because the ability to do so is a

modeling methods like path analysis have been conceptual requirement of being able to differen-

understood. tiate properly direct causal relations from spurious

However, like other parts of the standard empiri- correlations. By collapsing this distinction, empiri-

cist enterprise, this theory has received strong criti- cists, with their regularity theory of causation, are

cism. Its claimed limitations can best be appreciated unable to articulate a satisfactory notion of spuri-

by contrasting it with a scientiﬁc realist alternative ousness. For example, Simon’s (1985) inﬂuential

known as the generative theory of causation (Harré analysis of spurious correlation explicitly rejects the

& Madden, 1975). Brieﬂy stated, the generative the- generative theory of causation and endeavors to

ory depicts causation as a relationship where, under ground the distinction between true and spurious

appropriate conditions, a causal mechanism produces correlations on a commitment to an empiricist view

its effect. For this to happen, the causal mechanism of causation. The common or intervening causes

must connect to its effect and have the power to that bring about spurious correlations will typically

generate that effect, usually when stimulated by the be unobserved. However, for a statistical treat-

appropriate causal condition. It is the productivity ment of these variables to be consistent with the

of a generative mechanism that makes it a causal regularity theory, Simon’s view of causation forces

mechanism, and for this to occur, there must be a researchers to focus on altogether different variables

naturally necessary connection that allows for the at the manifest level. But this cavalier ontologi-

transmission of power from cause to effect. This cal slide wrecks our efforts to obtain worthwhile

causal power exists irrespective of whether it is cur- causal knowledge, because the manifest replacement

rently being exercised. As such, it is properly viewed variables cannot act as effective surrogates for their

haig 25

presumed common and intervening causes. They the best explanation being considered here will

are ontologically distinct from such causes and, allow the researcher to say that a good explana-

although as causal conditions they may trigger their tory theory will rate well on the explanatory criteria

latent counterparts, they do not function as major and, at the same, boast a measure of predictive

causal mechanisms that can bring about spurious success. Most methodologists and scientists will

correlations. agree that an explanatory theory that also makes

Although it can plausibly be argued that a gener- accurate predictions will be a better theory for

ative view of causation is required to make sense of doing so.

research that embraces hidden causal mechanisms, it Although the use of structural equation mod-

does not follow, as is often supposed (e.g., Manicas, eling in psychology often involves testing models

1989; Sayer, 1992), that the regularity theory has in hypothetico-deductive fashion, it also contains

no place in a realist conception of science. With a minority practice that amounts to inference to

its emphasis on the ideas of regularity, it would the best explanation in the sense just noted. This

seem to be a suitable account of causation for claims latter practice involves the explicit comparison of

about phenomena that take the form of empirical models or theories in which an assessment of

generalizations. Nor should it be thought that the their goodness-of-ﬁt to the empirical evidence is

regularity theory and the generative theory together combined with the weighting of the ﬁt statistics

give one a full understanding of causation in science. in terms of parsimony indices (Kaplan, 2000).

For example, structural equation modeling provides Here goodness-of-ﬁt provides information about the

knowledge of causal networks. As such, it does not so empirical adequacy of the model, whereas parsi-

much encourage the development of detailed knowl- mony functions as a criterion having to do with

edge of the nature of latent variables as it speciﬁes the explanatory value of the model. Both are used

the range and order of causal relations into which in judgments of model goodness. Markus, Hawes,

latent and manifest variables enter. For this type of and Thasites (2008) recently have suggested that

research, a network theory of causation is needed in structural equation modeling, model ﬁt can

(Thagard, 1999). be combined with model parsimony, understood

The suggestion that different conceptions of cau- as explanatory power, to provide an operational-

sation are relevant to causal modeling ﬁts with a ized account of inference to the best explanation.

philosophy of causal pluralism, which is increasingly They discussed the prospects of using structural

being recommended in contemporary methodolog- equation modeling in this way to evaluate the com-

ical studies of the nature of causation (Godfrey- parative merits of two- and three-factor models of

Smith, 2009). psychopathy.

Inference to the Best Explanation Many causal modeling methods are latent vari-

The guess-and-test strategy of the hypothetico- able methods, whose conceptual foundations are to

deductive method takes predictive accuracy as the be found in the methodology of latent variable the-

sole criterion of theory goodness. However, it ory (Borsboom, 2005; 2008). Central to this theory

seems to be the case that in research practice, the is the concept of a latent variable itself. However,

hypothetico-deductive method is sometimes com- the notion of a latent variable is a contested concept,

bined with the use of supplementary evaluative crite- and there are fundamental philosophical differences

ria such as simplicity, scope, and fruitfulness. When in how it should be understood.

this happens, and one or more of the supplementary A clear example of the contested nature of the

criteria have to do with explanation, the combined concept of a latent variable is to be found in the

approach can appropriately be regarded as a version two quite different interpretations of the nature of

of inference to the best explanation, rather than just the factors produced by exploratory factor analy-

an augmented account of the hypothetico-deductive sis. One view, known as ﬁctionalism, maintains

method (Haig, 2009). This is because the central that the common factors, the output of exploratory

characteristic of the hypothetico-deductive method factor analysis, are not theoretical entities invoked

is a relationship of logical entailment between the- to explain why the observed variables correlate the

ory and evidence, whereas with inference to the way that they do. Rather, these factors are taken

best explanation the relationship is also one of to be summary expressions of the way manifest

explanation. The hybrid version of inference to variables co-vary. Relatedly, theories that marshal

26 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

descriptions of such factors are properly considered factor analysis can facilitate the drawing of sound

to serve the instrumentalist function of econom- abductive inference about the existence of latent

ically redescribing the original correlational data. factors.

This interpretation of exploratory factor analysis This highly selective consideration of the phi-

has been quite inﬂuential in psychometrics (Block, losophy of causal modeling points to three con-

1976) and has been taught to generations of psychol- clusions: (1) that causation in causal modeling

ogy students through textbooks on psychological manifests itself in a number of different ways;

testing (e.g., Anastasi & Urbina, 1997). Fictional- (2) that causal modeling can transcend the limi-

ism seems to be the preferred option of many factor tations of the hypothetico-deductive method and

analytic researchers in the domains of personality adopt the practice of inference to the best expla-

and intelligence. nation; and (3) that latent variables deserve to be

However, ﬁctionalism is a difﬁcult position to given a realist interpretation as genuine theoretical

defend, and it seems to fail in factor analysis for entities.

the reason it fails in science generally: it inappro-

priately grants ontological signiﬁcance to a sharp

distinction between observation and theory to but- Conclusion

tress the claim that only observable, or manifest, The philosophy of research methods is an aspect

entities exist, when observability is really a matter of research methodology that receives limited atten-

of degree. Fictionalists argue that because we do tion in behavioral science education. The majority

not have perceptual experience of theoretical enti- of students and research practitioners in the behav-

ties, we do not have grounds for saying they exist; ioral sciences obtain the bulk of their knowledge of

we only have grounds for claiming that observable research methods from textbooks. However, a casual

entities exist. But realist philosophers of science (e.g., examination of these texts shows that they tend to

Maxwell, 1962) assert in reply that ﬁctionalists can- pay little, if any, serious regard to the philosophy

not maintain a sharp distinction between what is of science and its bearing on the research process.

observable and what is unobservable. What can- As Kuhn pointed out nearly 50 years ago (Kuhn,

not be seen directly by the unaided eye might be 1962; 1996), textbooks play a major role in dogmat-

observable through a magnifying glass and what ically initiating students into the routine practices

cannot be observed through a magnifying glass of normal science. Serious attention to the philos-

might be observed through a microscope. Impor- ophy of research methods would go a considerable

tantly, how we draw the observable/unobservable way toward overcoming this uncritical practice. As

distinction at a particular time is a function of contemporary philosophy of science increasingly

prior knowledge, our physiological make-up, and focuses on the contextual use of research methods in

available instrumentation. Thus, the distinction the various sciences, it is to be hoped that research

provides no basis for deciding what entities do, and methodologists and other behavioral scientists will

do not, exist. To assert that factors are theoretical avail themselves of the genuine methodological

entities is not to regard them as having a special insights that it contains.

existence; rather, it is to acknowledge that we come

to know them indirectly in terms of their correlated Future Directions

effects. On this realist interpretation, the factors are In this ﬁnal section of the chapter, I suggest a

regarded as latent variables that underlie, and give number of directions that future work in the philos-

rise to, the correlated manifest variables. Borsboom ophy of quantitative methods might take. The ﬁrst

(2005) has made a strong case for adopting a realist three suggestions are brieﬂy discussed; the remaining

attitude to latent variables more generally by com- suggestions are simply listed.

bining an argument similar to Maxwell’s, along with

other foundational considerations in philosophy of

science and psychometrics. Understand Quantitative Methods Through

This general argument against ﬁctionalism simul- Methodology

taneously supports the doctrine of realism in sci- A proper understanding of research methods can-

ence, but it does not by itself establish that not be had without an appreciation of their accom-

the factors of exploratory factor analysis should panying methodology (see Proctor & Capaldi,

be given a realist interpretation. Whether this 2001). Methodology is the interdisciplinary ﬁeld

should happen depends also on whether exploratory that studies methods. It draws from the disciplines

haig 27

of statistics, philosophy of science, and cognitive However, the quantitative/qualitative debate has

science, among others. And yet, the professional not considered the possibility that most methods

literature of these disciplines does not ﬁgure in have both quantitative and qualitative dimensions.

the content of research methods courses. Further, In many cases, we are likely to gain a better under-

it is important to appreciate that methodology standing of the research methods we use not by view-

has descriptive, critical, and advisory dimensions: ing them as either qualitative or quantitative but by

Again, the typical methods curriculum does not sys- regarding them as having both qualitative and quan-

tematically deal with research methods with these titative dimensions. Three examples are mentioned

considerations in mind. It is not surprising, there- here. First, grounded theory (e.g., Strauss, 1987),

fore, that psychologists’ understanding of research the most prominent extant qualitative methodol-

methods often leaves a lot to be desired. ogy, is in good part the product of a translation

A realist-oriented methods curriculum would from some sociological quantitative methods of the

proﬁtably consider methods in the light of the pri- 1950s. Moreover, there is nothing in principle to

mary characteristics of realist methodology outlined stop researchers using quantitative methods within

early in the chapter. To mention just three of these: the fold of grounded theory. Exploratory factor

Greater prominence would be given to generative analysis, for example, could sometimes be used for

methodology in which reasoning well to hypothe- generating grounded theory.

ses and theories would ﬁgure in the assessment Second, although exploratory factor analysis

of those knowledge claims. The sound abductive itself is standardly characterized as a multivariate sta-

reasoning to factorial hypotheses using exploratory tistical method, the inferential heart of the method

factor analysis is perhaps psychology’s best example is the important scientiﬁc heuristic known as the

of generative justiﬁcation. Similarly, the coherentist principle of the common cause. Importantly, this

justiﬁcation of explanatory theories using methods principle, which guides the factor analytic inference

of inference to the best explanation would feature from correlations to underlying common factors,

much more prominently than it does at present. can be effectively formulated in qualitative terms.

Finally, in adopting methods that are apt for us Finally, the theory of explanatory coherence

as knowing subjects, heuristic procedures would (Thagard, 1992), which evaluates theories in terms

receive much more explicit attention in the methods of their explanatory power, is a qualitative method

curriculum as realistic guides to our thinking. of theory appraisal, but it is implemented by a com-

The British Psychological Society now takes con- puter program that is part of the method proper,

ceptual and historical issues as one of psychol- and that has a connectionist architecture that is

ogy’s seven core areas. Teaching methods through mathematically constrained.

methodology is the appropriate way to employ this It is recommended, then, that methodologists

core area in the teaching of research methods. The and researchers seriously entertain the prospect that

American Psychological Association and the Asso- individual methods are likely to have a mix of

ciation of Psychological Science would do well to qualitative and quantitative features—that is, that

follow suit, for it is only by making considered use individual methods are themselves mixed methods.

of methodology that a genuine education in research

methods can be achieved.

Evaluate the Philosophical Critiques of

Rethink the Quantitative/Qualitative Quantitative Research Methods

Distinction Most of the occasional references to scientiﬁc real-

A major feature of the methodological landscape ism in psychology are to Bhaskar’s (1975; 1979)

has been the discussion of the distinction between critical realism (e. g., Manicas & Secord, 1983),

quantitative and qualitative methods. Although per- a philosophy that has had considerable impact on

haps necessary in establishing a legitimate role for the the social sciences (e.g., Sayer, 1992). Interestingly,

use of qualitative methods in research, the distinc- critical realists have expressed strong reservations

tion is now the subject of critical scrutiny. The way about the use of statistical methods in quantitative

the original distinction was drawn has been ques- research. Bhaskar himself goes so far as to say that

tioned (e.g., Michell, 2004), and the combination causal models should be “totally discarded.” There

of qualitative and quantitative methods in mixed are various reasons for this attitude (see Pratschke,

methods strategies has been strongly promoted in 2003), but perhaps the most fundamental one is

recent times. the claim that statistical models themselves do not

28 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

provide researchers with the substantive models that • Apply insights from the “new experimental-

are sought in causal modeling research. ism” in the philosophy of science to the understand-

However, this claim rests on a mistaken concep- ing of quantitative research methods.

tion of the relation between statistical models and • Develop the philosophical foundations of the-

substantive theoretical models. It is hard to deny ory construction methods in the behavioral sciences.

that consideration of much more than the statistical • Assess the implications of different theories of

machinery of causal modeling is needed to ground causality for research methods.

substantive conclusions. Indeed, it is difﬁcult to • Examine the philosophical foundations of

see how any statistical method could be properly “new” research methods such as data mining, struc-

understood and used in research without appeal to tural equation modeling, and functional neuroimag-

suprastatistical matters. Consider, for example, the

ing.

oft-made claim that factors of exploratory factor anal-

ysis are statistical entities and that the method cannot,

therefore, be used to favor one substantive facto- Author note

rial theory of intelligence over another (e.g., Gould, Brian D. Haig, Department of Psychology, Uni-

1996). This claim is false because factor analysts typ- versity of Canterbury Christchurch, New Zealand,

ically transcend the statistical level of the method Email: brian.haig@canterbury.ac.nz

and makes use of the relevant part of Latent Vari- References

able Theory to generate plausible hypotheses about Acree, M. C. (1979). Theories of statistical inference in psy-

the existence of latent variables. Of central relevance chological research: A historico-critical study. Dissertation

here is the fact that exploratory factor analysis exploits Abstracts International, 39, 5073B. (UMI No. 7907000).

theso-called“principleofthecommoncause”tosanc- Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.)

Upper Saddle River, NJ: Prentice-Hall.

tion inferences to the initial plausibility of interpreted Bangert-Drowns, R. L. (1986). Review of developments in meta-

latent variables. We saw earlier that inferences from analytic method. Psychological Bulletin, 99, 388–399.

manifest to latent variables made in accordance with Behrens, J. T., & Yu, C-H. (2003). Exploratory data analysis. In

this principle are abductive, or explanatory, in nature J. A. Schinka & W. F. Velicer (Eds.), Handbook of psychology,

and are made by factor analysts themselves. Although Vol. 2 (pp. 33–64). New York: Wiley.

Berger, J. O. & Selke, T. (1987). Testing a point null hypothesis:

the statistical machinery of multiple regression and The irreconcilability of p values and evidence (with com-

partial correlation theory is obviously an impor- ments). Journal of the American Statistical Association, 82,

tant part of exploratory factor analysis, its primary 112–139.

function is to facilitate researchers’ suprastatistical Bhaskar, R. (1975). A realist philosophy of science. Brighton,

inferences to latent factors. England: Harvester.

Bhaskar, R. (1979). The possibility of naturalism. Brighton,

It is important to appreciate that the interpretive England: Harvester.

dimension on causal modeling methods is a proper Block, N. J. (1976). Fictionalism, functionalism, and factor

part of its methodology. There is nothing in critical analysis. In R. S. Cohen, C. A. Hooker, & A. C. Micha-

realism, or other variants of scientiﬁc realism, that los (Eds.), Boston Studies in the Philosophy of Science, Vol. 32,

prevents one from taking such an outlook on causal (pp. 127–141). Dordrecht, the Netherlands: Reidel.

Bolles, R. C. (1962). The difference between statistical and

modeling. Indeed, scientiﬁc realism comports well scientiﬁc hypotheses. Psychological Reports, 11, 629–645.

with causal modeling methods that countenance Borsboom, D. (2005). Measuring the mind: Conceptual issues in

latent variables. contemporary psychometrics. Cambridge, England: Cambridge

University Press.

Borsboom, D. (2008). Latent variable theory. Measurement:

Additional Directions Interdisciplinary Research and Perspectives, 6, 25–53.

Space considerations prevent discussion of addi- Bunge, M. (2008). Bayesianism: Science or pseudoscience?

International Review of Victimology, 15, 165–178.

tional future directions in the philosophy of quan- Carruthers, P. (2002). The roots of scientiﬁc reasoning: Infancy,

titative methods. However, the following points modularity, and the art of tracking. In P. Carruthers, S. Stich,

deserve to be on an agenda for future study. & M. Siegal (Eds.), The cognitive basis of science (pp. 73–95).

Cambridge, England: Cambridge University Press.

• Develop a modern interdisciplinary concep- Chow, S. L. (1996). Statistical signiﬁcance: Rationale, validity, and

tion of research methodology. utility. London, England: Sage.

• Give more attention to investigative strategies Cliff, N. (1983). Some cautions concerning the application of

causal modeling methods. Multivariate Behavioral Research,

in psychological research. 18, 115–126.

• Take major philosophical theories of scientiﬁc Cohen, J. (1994). The earth is round (p <.05). American

method seriously. Psychologist, 49, 997–1003.

haig 29

Cook, T. D., Cooper, H., Cordray, D. S., Hartmann, H., Haig, B. D. (2009). Inference to the best explanation: A neglected

Hedges, L. V., Light, R. J., et al. (1992). Meta-analysis for approach to theory appraisal in psychology. American Journal

explanation: A casebook. New York: Russell Sage Foundation. of Psychology, 122, 219–234.

Duhem, P. (1954). The aim and structure of physical theory (2nd Harré, R., & Madden, E. H. (1975). Causal powers. Oxford,

ed., P. P. Weiner, Trans.). Princeton, NJ: Princeton University England: Blackwell.

Press. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-

Earman, J. (1992). Bayes or bust? A critical examination of Bayesian analysis. New York: Academic Press.

conﬁrmation theory. Cambridge, MA: MIT Press. Hooker, C. A. (1987). A realistic theory of science. New York: State

Efron, B., & Tibshirani, R. (1993). An introduction to the University of New York Press.

bootstrap. New York: Chapman & Hall. Howson, C., & Urbach, P. (2006). Scientiﬁc reasoning: The

Ehrenberg, A. S. C., & Bound, J. A. (1993). Predictability and Bayesian approach (3rd ed.). La Salle, IL: Open Court.

prediction. Journal of the Royal Statistical Society, Part 2, 156, Hubbard, R., & Lindsay, R. M. (2008). Why p values are not a

167–206. useful measure of evidence in statistical signiﬁcance testing.

Fidell, L. S., & Tabachnick, B. G. (2003). Preparatory data anal- Theory and Psychology, 18, 69–88.

ysis. In J. A. Schinka & W. F. Velicer (Eds.), Handbook of Hubbard, R., & Ryan, P. A. (2000). The historical growth of

psychology, Vol. 2 (pp. 115–121). New York: Wiley. statistical signiﬁcance testing in psychology—and its future

Fisher, R. A. (1925). Statistical methods for research workers. prospects. Educational and Psychological Measurement, 60,

Edinburgh, Scotland: Oliver & Boyd. 661–681.

Giere, R. N. (1972). The signiﬁcance test controversy. British Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-

Journal for the Philosophy of Science, 23, 170–180. analysis: Correcting error and bias in research ﬁndings (2nd

Gigerenzer, G. (1993). The superego, the ego, and the id in ed.). Newbury Park, CA: Sage.

statistical reasoning. In G. Keren & C. Lewis (Eds.), A hand- Hurlbert, S. H., & Lombardi, C. M. (2009). Final col-

book for data analysis in the behavioral sciences (pp. 311–339). lapse of the Neyman-Pearson decision theoretic framework

Hillsdale, NJ: Lawrence Erlbaum. and rise of the neoFisherian. Annales Zoologici Fennici, 46,

Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null 311–349.

ritual: What you wanted to know about signiﬁcance testing Jeffreys, H. (1939). Theory of probability. Oxford, England:

but were afraid to ask. In D. Kaplan (Ed.), The Sage handbook Clarendon.

of quantitative methodology for the social sciences (pp. 391– Kaplan, D. (2000). Structural equation modeling: Foundations and

408). Thousand Oaks, CA: Sage. extensions. Thousand Oaks, CA: Sage.

Glass, G. V. (1972). The wisdom of scientiﬁc inquiry on Kenny, D. (1979). Correlation and causation. New York: Wiley.

education. Journal of Research in Science Teaching, 9, 3–18. Kim, J-O., & Mueller, C. W. (1978). Introduction to factor

Glass, G. V. (1976). Primary, secondary, and meta-analysis of analysis. Beverly Hills, CA: Sage.

research. Educational Researcher, 5, 3–8. Kuhn, T. S. (1996). The structure of scientiﬁc revolutions (3rd

Glass, G. V., & Kleigl, R. M. (1983). An apology for research inte- ed.). Chicago, IL: University of Chicago Press (originally

gration in the study of psychotherapy. Journal of Consulting published, 1962).

and Clinical Psychology, 51, 28–41. Lakatos, I. (1970). Falsiﬁcation and the methodology of scien-

Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis tiﬁc research programmes. In I. Lakatos & A. Musgrave, A.

in social research. Beverly Hills, CA: Sage. (Eds.). Criticism and the growth of knowledge (pp. 91–195).

Glymour, C. N. (1980). Theory and evidence. Princeton, NJ: Cambridge, England: Cambridge University Press.

Princeton University Press. Laudan, L. (1981). Science and hypothesis. Dordrecht, The

Godfrey-Smith, P. (2009). Causal pluralism. In H. Beebee, C. Netherlands: Reidel.

Hitchcock, & P. Menzies (Eds.), The Oxford handbook of Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories

causation (pp. 326–337). Oxford, England: Oxford Univer- of testing hypotheses: One theory or two? Journal of the

sity Press. American Statistical Association, 88, 1242–1249.

Good, I. J. (1983). The philosophy of exploratory data analysis. Lipton, P. (2004). Inference to the best explanation (2nd ed.).

Philosophy of Science, 50, 283–295. London, England: Routledge.

Goodman, S. N. (1993). P values, hypothesis tests, and likeli- Manicas, P. T. (1989). Explanation and quantiﬁcation. In B.

hood: Implications for epidemiology of a neglected historical Glassner & J. D. Moreno (Eds.), The qualitative-quantitative

debate. American Journal of Epidemiology, 137, 485–495. distinction in the social sciences (pp. 179–205). Dordrecht,

Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: The Netherlands: Kluwer.

Lawrence Erlbaum. Manicas, P. T., & Secord, P. F. (1983). Implications for psychol-

Gottfredson, G. D. (1984). A theory-ridden approach to ogy of the new philosophy of science. American Psychologist,

program evaluation. American Psychologist, 39, 1101–1112. 38, 399–413.

Gould, S. J. (1996). The mismeasure of man (2nd ed.). New York: Markus, K., Hawes, S. S., & Thasites, R. (2008). Abductive infer-

Norton. ence to psychological variables: Steiger’s question and best

Greenwood, J. D. (1992). Realism, empiricism, and social explanations in psychopathy. Journal of Clinical Psychology,

constructionism. Theory and Psychology, 2, 131–151. 64, 1069–1088.

Haig, B. D. (1987). Scientiﬁc problems and the conduct of Maxwell, G. (1962). The ontological status of theoretical entities.

research. Educational Philosophy and Theory, 19, 22–32. In H. Feigl & G. Maxwell (Eds.), Minnesota Studies in the

Haig, B. D. (2005a). Exploratory factor analysis, theory genera- Philosophy of Science, Vol. 3 (pp. 3–28). Minneapolis, MN:

tion, and scientiﬁc method. Multivariate Behavioral Research, University of Minnesota Press.

40, 303–329. McDonald, R. P., & Mulaik, S. A. (1979). Determinacy of com-

Haig, B. D. (2005b). An abductive theory of scientiﬁc method. mon factors: A nontechnical review. Psychological Bulletin,

Psychological Methods, 10, 371–388. 86, 297–306.

30 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s

McGrew, T. (2003). Conﬁrmation, heuristics, and explanatory Salmon, W. C. (1984). Scientiﬁc explanation and the causal

reasoning. British Journal for the Philosophy of Science, 54, structure of the world. Princeton, NJ: Princeton University

553–567. Press.

McGuire, W. J. (1997). Creative hypothesis generating in psy- Sayer, A. (1992). Methods in social science: A realist approach (2nd

chology: Some useful heuristics. Annual Review of Psychology, ed.). London, England: Routledge.

48, 1–30. Schmidt, F. L. (1992). What do data really mean? Research ﬁnd-

McMullin, E. (1995). Underdetermination. The Journal of ings, meta-analysis, and cumulative knowledge in psychology.

Medicine and Philosophy, 20, 233–252. American Psychologist, 47, 1173–1181.

Meehl, P. E. (1967). Theory testing in psychology and physics: A Schmidt, F. L. (1993). Meta-analysis and cumulative knowledge.

methodological paradox. Philosophy of Science, 34, 103–115. Contemporary Psychology, 38, 1163–1165.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Simon, H. (1985). Spurious correlation: A causal interpretation.

Karl, Sir Ronald, and the slow progress of soft psychology. In H. M. Blalock (Ed.), Causal models in the social sciences

Journal of Consulting and Clinical Psychology, 46, 806–834. (2nd ed.) (pp. 7–21). New York: Aldine.

Meehl, P. E. (1990). Appraising and amending theories: The Simon, H. A. (1968). On judging the plausibility of theories. In

strategy of Lakatosian defense and two principles that warrant B. van Rootselaar & J. F. Staal (Eds.), Logic, methodology, and

using it. Psychological Inquiry, 1, 108–141, 173–180. philosophy of science, Vol. 3 (pp. 439–459). Amsterdam, The

Meehl, P. E. (1997). The problem is epistemology, not statis- Netherlands: North-Holland.

tics: Replace signiﬁcance tests by conﬁdence intervals and Smith, M. L., Glass, G. V., & Miller, T. I. (1980). The beneﬁts

quantify accuracy of risky numerical predictions. In L. L. Har- of psychotherapy. Baltimore, MD: Johns Hopkins University

low, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were Press.

no signiﬁcance tests? (pp. 393–425). Mahwah, NJ: Lawrence Sober, E. (1988). The principle of the common cause. In J.

Erlbaum. H. Fetzer (Ed.), Probability and causality (pp. 211–229).

Michell, J. (2004). The place of qualitative research in psychology. Dordrecht, The Netherlands: Reidel.

Qualitative Research in Psychology, 1, 307–319. Sober, E. (2008). Evidence and evolution: The logic behind the

Morrison, D. E., & Henkel, R. E. (1970). The signiﬁcance test science. Cambridge, England: Cambridge University Press.

controversy. Chicago: Aldine. Sohn, D. (1996). Meta-analysis and science. Theory and Psychol-

Mulaik, S. A. (1985). Exploratory statistics and empiricism. ogy, 6, 229–246.

Philosophy of Science, 52, 410–430. Spielman, S. (1974). The logic of tests of signiﬁcance. Philosophy

Mulaik, S. A. (1987). A brief history of the philosophical foun- of Science, 14, 211–225.

dations of exploratory factor analysis. Multivariate Behavioral Strauss, A. L. (1987). Qualitative analysis for social scientists. New

Research, 22, 267–305. York: Cambridge University Press.

Mulaik, S. A., & McDonald, R. P. (1978). The effect of additional Thagard, P. (1992). Conceptual revolutions. Princeton, NJ:

variables on factor indeterminacy in models with a single Princeton University Press.

common factor. Psychometrika, 43, 177–192. Thagard, P. (1999). How scientists explain disease. Princeton, NJ:

Neyman, J., & Pearson, E. S. (1933). On the problem of the Princeton University Press.

most efﬁcient tests of statistical hypotheses. Philosophical Thurstone, L. L. (1947). Multiple factor analysis. Chicago, IL:

Transactions of the Royal Society of London A, 231, 289–337. University of Chicago Press.

Nickerson, R. S. (2000). Null hypothesis signiﬁcance testing: Trout, J. D. (1998). Measuring the intentional world: Realism,

A review of an old and continuing controversy. Psychological naturalism, and quantitative methods in the behavioral sciences.

Methods, 5, 241–301 New York, NY: Oxford University Press.

Nickles, T. (1981). What is a problem that we might solve it? Tukey, J. W. (1962). The future of data analysis. The Annals of

Synthese, 47, 85–118. Mathematical Statistics, 33, 1–67.

Nickles, T. (1987). Twixt method and madness. In N. J. Nerses- Tukey, J. W. (1969). Analyzing data: Sanctiﬁcation or detective

sian (Ed.), The process of science (pp. 41–67). Dordrecht, The work? American Psychologist, 24, 83–91.

Netherlands: Martinus Nijhoff. Tukey, J. W. (1977). Exploratory data analysis. Reading, MA:

Pratschke, J. (2003). Realistic models? Critical realism and Addison Wesley.

statistical models in the social sciences. Philosophica, 71, Tukey, J. W. (1980). We need both exploratory and conﬁrmatory.

13–38. American Statistician, 34, 23–25.

Proctor, R. W., & Capaldi, E. J. (2001). Empirical evaluation Waller, N. G. (2004). The fallacy of the null hypothesis in soft

and justiﬁcation of methodologies in psychological science. psychology. Applied and Preventive Psychology, 11, 83–86.

Psychological Bulletin, 127, 759–772. Weisberg, J. (2009). Locating IBE in the Bayesian framework.

Psillos, S. (2004). Inference to the best explanation and Bayesian- Synthese, 167, 125–143.

ism. In F. Stadler (Ed.), Induction and deduction in the sciences Whitt, L. A. (1992). Indices of theory promise. Philosophy of

(pp. 83–91). Dordrecht, The Netherlands: Kluwer. Science, 59, 612–634.

Royall, R. M. (1986). The effect of sample size on the meaning Wilkinson, L., & the Task Force on Statistical Inference. (1999).

of signiﬁcance tests. American Statistician, 40, 313–315. Statistical methods in psychology journals: Guidelines and

Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. explanations. American Psychologist, 54, 594–604.

New York: Chapman & Hall. Woodward, J. (1989). Data and phenomena. Synthese, 79, 393–

Rozeboom, W. W. (1997). Good science is abductive, not 472.

hypothetico-deductive. In L. L. Harlow, S. A. Mulaik, & Yu, C-H. (2006). Philosophical foundations of quantitative

J. H. Steiger (Eds.), What if there were no signiﬁcance tests? research methodology. Lanham, MD: University Press of

(pp. 335–391.). Hillsdale, NJ: Lawrence Erlbaum. America.

haig 31

CHAPTER

3

Ralph L. Rosnow and Robert Rosenthal

Abstract

The purpose of this chapter is to provide a context for thinking about the role of ethics in quantitative

methodology. We begin by reviewing the sweep of events that led to the creation and expansion of

legal and professional rules for the protection of research subjects and society against unethical

research. The risk–beneﬁt approach has served as an instrument of prior control by institutional

review boards. After discussing the nature of that approach, we sketch a model of the costs and

utilities of the “doing” and “not doing” of research. We illustrate some implications of the expanded

model for particular data analytic and reporting practices. We then outline a 5 × 5 matrix of general

ethical standards crossed with general data analytic and reporting standards to encourage thinking

about opportunities to address quantitative methodological problems in ways that may have mutual

ethical and substantive rewards. Finally, we discuss such an opportunity in the context of problems

associated with risk statistics that tend to exaggerate the absolute effects of therapeutic interventions

in randomized trials.

Key Words: Accountability,American Psychological Association (APA) Ethics Code, Belmont Report,

ethical principles, health statistics, institutional review board (IRB), moral dilemmas, Nuernberg

(Nuremberg) Code, quantitative methodology, risk–beneﬁt assessment, statistical illiteracy,

transparency, volunteer bias

In this chapter we sketch an historic and heuristic illuminating to use both types in the same research.

framework for assessing certain ethical implications For example, in the typical psychological experi-

of the term quantitative methods. We use this term ment in which the observations take a numerical

in the broadest sense to include not only statisti- form, it may be edifying to ask some of the par-

cal procedures but also what is frequently described ticipants in postexperimental interviews to reﬂect

as quantitative research (in contrast to qualitative on the context in which the experiment was con-

research) in psychology and some other disciplines. ducted and to speculate on the ways in which it

As deﬁned in the APA Dictionary of Psychology, the may have inﬂuenced their own and other partici-

traditional distinction between these two general pants’ behaviors (Orne, 1962, 1969). By the same

types of research rests on whether “the approach token, it is usually possible to quantify nonquantita-

to science” does (quantitative research) or does tive observations by, for example, decomposing the

not (qualitative research) “employ the quantiﬁcation qualitative subject matter element by element and

(expression in numerical form) of the observa- then numerically and visually analyzing and summa-

tions made” (VandenBos, 2007, pp. 762–763). Of rizing the results. Blogs and online discussion groups

course, quantitative and qualitative methods should are currently a popular source of qualitative subject

32

matter, which researchers have trolled for patterns intentions, it might be noted that there is also empir-

or relationships that can be quantiﬁed by the use of ical evidence of (1) individual differences in this

simple summary statistics (e.g., Bordia & Rosnow, ability (described as interpersonal acumen) and (2) a

1995) or coded and visually mapped out using social hierarchy of intention–action combinations ranging

network analysis to highlight links and nodes in from the least to most cognitively taxing (Rosnow,

the observed relationships (e.g., Kossinets & Watts, Skleder, Jaeger, & Rind, 1994).

2006; see also Wasserman & Faust, 1994). Whether Societal sensitivities, group values, and situa-

blogs and online discussion groups’ data are treated tional pressures are subject to change in the face of

quantitatively or qualitatively, their use may raise signiﬁcant events. On the other hand, some moral

ethical questions regarding the invasion of privacy. values seem to be relatively enduring and univer-

The fact that bloggers and participants in online sal, such as the golden rule, which is frequently

discussion groups are typically fully aware that their expressed as “Do unto others as you would have

communications are quite public minimizes the risk them do unto you.” In the framework of quanti-

of invasion of privacy. tative methods and ethics, a categorical imperative

The term ethics was derived from the Greek ethos, might be phrased as “Thou shalt not lie with statis-

meaning “character” or “disposition.” We use the tics.” Still, Huff, in his book, How to Lie with

term here to refer to the dos and don’ts of codiﬁed Statistics, ﬁrst cautioned the public in 1954 that the

and/or culturally ingrained rules by which morally reporting of statistical data was rife with “bungling

“right” and “wrong” conduct can be differentiated. and chicanery” (Huff, 1982, p. 6). The progress

Conformity to such rules is usually taken to mean of science depends on the good faith that scien-

morality, and our human ability to make ethical tists have in the integrity of one another’s work

judgments is sometimes described as a moral sense and the unbiased communication of ﬁndings and

(a tradition that apparently goes back to David conclusions. Lying with statistics erodes the cred-

Hume’s A Treatise of Human Nature in the eigh- ibility of the scientiﬁc enterprise, and it can also

teenth century). Philosophers and theologians have present an imminent danger to the general pub-

frequently disagreed over the origin of the moral lic. “Lying with statistics” can refer to a number of

sense, but on intuitive grounds it would seem that more speciﬁc practices: for example, reporting only

morality is subject to societal sensitivities, group the data that agree with the researcher’s bias, omit-

values, and social pressures. It is not surprising ting any data not supporting the researcher’s bias,

that researchers have documented systematic biases and, most serious of all, fabricating the results of

in ethical judgments. For example, in a study by the research. For example, there was a case reported

Kimmel (1991), psychologists were asked to make in 2009 in which an anesthesiologist fabricated the

ethical judgments about hypothetical research cases. statistical data that he had published in 21 journal

Kimmel reported that those psychologists who were articles purporting to give the results of clinical trials

more (as compared to less) approving in their eth- of a pain medicine marketed by the company that

ical judgments were more often men; had held an funded much of the doctor’s research (Harris, 2009).

advanced degree for a longer period of time; had Another case, around the same time, involved a

received the advanced degree in an area such as medical researcher whose accounts of a blood test

experimental, developmental, or social psychology for diagnosing prostate cancer had generated con-

rather than counseling, school, or community psy- siderable excitement in the medical community, but

chology; and were employed in a research-oriented who was now being sued for scientiﬁc fraud by his

context as opposed to a service-oriented context. industry sponsor (Kaiser, 2009). As the detection of

Citing this work of Kimmel’s (1991), an Amer- lying with statistics is often difﬁcult in the normal

ican Psychological Association (APA) committee course of events, there have been calls for the public

raised the possibility that inconsistent implemen- sharing of raw data so that, as one scientist put it,

tation of ethical standards by review boards might “Anyone with the skills can conduct their own anal-

result not only from the expanded role of review yses, draw their own conclusions, and share those

boards but also from the composition of particu- conclusions with others” (Allison, 2009, p. 522).

lar boards (Rosnow, Rotheram-Borus, Ceci, Blanck, That would probably help to reduce some of the

& Koocher, 1993). Assuming that morality is also problems of biased data analysis, but it would not

predicated on people’s abilities to ﬁgure out the help much if the shared data had been fabricated to

meaning of other people’s actions and underlying begin with.

rosnow, rosenthal 33

In the following section, we review the sweep of in war crimes and crimes against humanity were

events that led to the development and growth of presented before a military tribunal in Nuernberg,

restraints for the protection of human subjects and Germany. For allied atomic scientists, Hiroshima

society against unethical research.1 A thread running had been an epiphany that vaporized the old iconic

throughout the discussion is the progression of the image of a morally neutral science. For researchers

APA’s code of conduct for psychological researchers who work with human participants, the backdrop

who work with human subjects. We assume that to the formation of ethical and legal principles to

many readers of this Handbook will have had a protect the rights and welfare of all research partici-

primary or consulting background in some area of pants were the shocking revelations of the war crimes

psychology or a related research area. The develop- documented in meticulous detail at the Nuernberg

ment of the APA principles gives us a glimpse of the Military Tribunal. Beginning with the German inva-

speciﬁc impact of legal regulations and societal sensi- sion of Poland at the outbreak of World War II,

tivities in an area in which human research has been Jews and other ethnic minority inmates of concen-

constantly expanding into new contexts, includ- tration camps had been subjected to sadistic tortures

ing “ﬁeld settings and biomedical contexts where and other barbarities in “medical experiments” by

research priorities are being integrated with the Nazi physicians in the name of science. As method-

priorities and interests of nonresearch institutions, ically described in the multivolume report of the

community leaders, and diverse populations” (Sales trials, “in every one of the experiments the subjects

& Folkman, 2000, p. ix). We then depict an ideal- experienced extreme pain or torture, and in most of

ized risk–beneﬁt approach that review boards have them they suffered permanent injury, mutilation, or

used as an instrument of prior control of research, death” (Trials of War Criminals before the Nuernberg

and we also describe an expanded model focused Military Tribunals under Control Council Law No.

on the costs and utilities of “doing” and “not doing” 10, p. 181). Table 3.1 reprints the principles of the

research. The model can also be understood in terms Nuernberg Code, which have resonated to varying

of the cost–utility of adopting versus not adopting degrees in all ensuing codes for biomedical research

particular data analytic and reporting practices. We with human participants as well as having had a gen-

then outline a matrix of general ethical standards erative inﬂuence on the development of principles

crossed with general data analytic and reporting for the conduct of behavioral and social research.

standards as (1) a reminder of the basic distinction We pick up the story again in the 1960s in the

between ethical and technical mandates and (2) a United States, a period punctuated by the shock-

framework for thinking about promising opportu- ing assassinations of President John F. Kennedy in

nities for ethical and substantive rewards in quanti- 1963 and then of Dr. Martin Luther King, Jr., and

tative methodology (cf. Blanck, Bellack, Rosnow, Senator Robert F. Kennedy in 1968. The 1960s

Rotheram-Borus, & Schooler, 1992; Rosenthal, were also the beginning of the end of what Pat-

1994; Rosnow, 1997). We discuss such an opportu- tullo (1982) called “the hitherto sacrosanct status”

nity in the context of the way in which a ﬁxation on of the human sciences, which moved “into an era

relative risk (RR) in large sample randomized trials of uncommonly active concern for the rights and

of therapeutic interventions can lead to misconcep- welfare of segments of the population that had tra-

tions about the practical meaning to patients and ditionally been neglected or exploited” (p. 375).

health-care providers of the particular intervention One highly publicized case in 1963 involved a noted

tested. cancer researcher who had injected live cancer cells

into elderly, noncancerous patients, “many of whom

were not competent to give free, informed consent”

The Shaping of Principles to Satisfy Ethical (Pattullo, p. 375). In 1966, the U.S. Surgeon Gen-

and Legal Standards eral issued a set of regulations governing the use

If it can be said that a single historical event in of subjects by researchers whose work was funded

modern times is perhaps most responsible for ini- by the National Institutes of Health (NIH). Most

tially galvanizing changes in the moral landscape of NIH grants funded biomedical research, but there

science, then it would be World War II. On Decem- was also NIH support for research in the behav-

ber 9, 1946 (the year after the surrender of Germany ioral and social sciences. In 1969, following the

on May 8, 1945 and the surrender of Japan on exposure of further instances in which the welfare

August 14, 1945), criminal proceedings against Nazi of subjects had been ignored or endangered in

physicians and administrators who had participated biomedical research (cf. Beecher, 1966, 1970; Katz,

34 q u a n t i tat i v e m e t h o d s a n d e t h i c s

Table 3.1. The Nuernberg Principles of 1946–1949 for Permissible Medical Experiments∗

1. The voluntary consent of the human subject is absolutely essential.

This means that the person involved should have legal capacity to give consent; should be so situated as to be able to

exercise free power of choice, without the intervention of any element of force, fraud, deceit, duress, over-reaching,

or other ulterior form of constraint or coercion; and should have sufﬁcient knowledge and comprehension of the

elements of the subject matter involved as to enable him to make an understanding and enlightened decision. This

latter element requires that before the acceptance of an afﬁrmative decision by the experimental subject there should

be made known to him the nature, duration, and purpose of the experiment; the method and means by which it is

to be conducted; all inconveniences and hazards reasonably to be expected; and the effects upon his health or person

which may possibly come from his participation in the experiment.

The duty and responsibility for ascertaining the quality of the consent rests upon each individual who initiates,

directs or engages in the experiment. It is a personal duty and responsibility which may not be delegated to another

with impunity.

2. The experiment should be such as to yield fruitful results for the good of society, unprocurable by other methods or

means of study, and not random and unnecessary in nature.

3. The experiment should be so designed and based on the results of animal experimentation and a knowledge of the

natural history of the disease or other problem under study that the anticipated results will justify the performance

of the experiment.

4. The experiment should be so conducted as to avoid all unnecessary physical and mental suffering and injury.

5. No experiment should be conducted where there is an a priori reason to believe that death or disabling injury will

occur; except, perhaps, in those experiments where the experimental physicians also serve as subjects.

6. The degree of risk to be taken should never exceed that determined by the humanitarian importance of the problem

to be solved by the experiment.

7. Proper preparations should be made and adequate facilities provided to protect the experimental subject against even

remote possibilities of injury, disability, or death.

8. The experiment should be conducted only by scientiﬁcally qualiﬁed persons. The highest degree of skill and care

should be required through all stages of the experiment of those who conduct or engage in the experiment.

9. During the course of the experiment the human subject should be at liberty to bring the experiment to an end if he

has reached the physical or mental state where continuation of the experiment seems to him to be impossible.

10. During the course of the experiment the scientist in charge must be prepared to terminate the experiment at any

stage, if he has probable cause to believe, in the exercise of the good faith, superior skill and careful judgment required

of him that a continuation of the experiment is likely to result in injury, disability, or death to the experimental

subject.

∗ Reprinted from pp. 181–182 in Trials of War Criminals before the Nuernberg Military Tribunals under Control Council Law No. 10, October

1946–April 1949, Vol. II. Washington, DC: U.S. Government Printing Ofﬁce.

1972), the Surgeon General extended the earlier never informed they had syphilis but only told they

safeguards to all human research. In a notorious had “bad blood.” They also were not offered peni-

case (not made public until 1972), a study con- cillin when it was discovered in 1943 and became

ducted by the U.S. Public Health Service (USPHS) widely available in the 1950s, and they were warned

simply followed the course of syphilis in more than not to seek treatment elsewhere or they would be

400 low-income African-American men residing in dropped from the study. The investigators went so

Tuskegee, Alabama, from 1932 to 1972 (Jones, far as to have local doctors promise not to treat the

1993). Recruited from churches and clinics with the men in the study with antibiotics (Stryker, 1997). As

promise of free medical examinations and free health the disease progressed in its predictable course with-

care, the men who were subjects in this study were out any treatment, the men experienced damage to

rosnow, rosenthal 35

their skeletal, cardiovascular, and central nervous by the Department of Health, Education, and Wel-

systems and, in some cases, death. In 1972, the fare (DHEW) 3 years earlier were codiﬁed as gov-

appalling details were ﬁnally made public by a lawyer ernment regulations by the National Research Act of

who had been an epidemiologist for the USPHS, July 12, 1974 (Pub. L. 93–348). Among the require-

and the study was halted (Fairchild & Bayer, 1999). ments instituted by the government regulations was

The following year, the Senate Health Subcommit- that institutions that received federal funding estab-

tee (chaired by Senator Edward Kennedy) aired the lish an institutional review board (IRB) for the

issue of scientiﬁc misconduct in public hearings. purpose of making prior assessments of the possi-

The early 1960s was also a period when emotions ble risks and beneﬁts of proposed research.2 This

about invasions of privacy were running high in the federal act also created the National Commission

United States after a rash of reports of domestic wire- for the Protection of Human Subjects of Biomed-

tapping and other clandestine activities by federal ical and Behavioral Research. Following hearings

agencies. In the ﬁeld of psychology, the morality of that were held over a 3-year period, the document

the use of deception was being debated. As early called “The Belmont Report” was issued in April,

as the 1950s, there had been concerned statements 1979 (available online and also reprinted in Sales &

issued about the use of deception in social psycho- Folkman, 2000). Unlike other reports of the Com-

logical experiments (Vinacke, 1954). The spark that mission, the Belmont Report did not provide a list of

lit a fuse in the 1960s in the ﬁeld of psychology speciﬁc recommendations for administrative action

was the publication of Stanley Milgram’s studies on by the DHEW, but the Belmont Report recom-

obedience to authority, in which he had used an mended that the report be adopted in its entirety

elaborate deception and found that a majority of as a statement of DHEW policy. In the pream-

ordinary research subjects were willing to adminis- ble, the report mentioned the standards set by the

ter an allegedly dangerous level of shock to another Nuernberg (“Nuremberg”) Code as the prototype

person when “ordered” to do so by a person in of many later codes consisting of rules, some gen-

authority, although no shock was actually admin- eral and others speciﬁc, to guide researchers and

istered (cf. Blass, 2004; Milgram, 1963, 1975). assure that research involving human participants

Toward the end of the 1960s, there were impas- would be carried out in an ethical manner. Not-

sioned pleas by leading psychologists for the ethical ing that the rules were often inadequate to cover

codiﬁcation of practices commonly used in psycho- complex situations, that they were often difﬁcult to

logical research (Kelman, 1968; Smith, 1969). As apply or interpret, and that they often came into

there were new methodological considerations and conﬂict with one another, the National Commis-

federal regulations since the APA had formulated sion had decided to issue broad ethical principles

a professional code of ethics in 1953, a task force to provide a basis on which speciﬁc rules could

was appointed to draft a set of ethical principles for then be formulated, criticized, and interpreted. As

research with human subjects. Table 3.2 shows the we track the development of the APA principles

ﬁnal 10 principles adopted by the APA’s Council of in this discussion, we will see that there has been

Representatives in 1972, which were elucidated in a similar progression, and later we will emphasize

a booklet that was issued the following year, Ethi- some broad ethical principles when we discuss the

cal Principles in the Conduct of Research with Human interface of ethical and technical standards in quan-

Participants (APA, 1973). An international survey titative methodology. For now, however, it can be

conducted 1 year later found there were by then two noted that the Belmont Report proposed that (1)

dozen codes of ethics that had been either adopted respect for persons, (2) beneﬁcence, and (3) jus-

or were under review by professional organizations tice provide the foundation for research ethics. The

of social scientists (Reynold, 1975). Although viola- report also proposed norms for scientiﬁc conduct

tions of such professional codes were supported by in six major areas: (1) the use of valid research

penalties such as loss of membership in the organiza- designs, (2) the competence of researchers, (3) the

tion, the problem was that many researchers engaged identiﬁcation of risk–beneﬁt consequences, (4) the

in productive, rewarding careers did not belong to selection of research participants, (5) the importance

these professional organizations. of obtaining informed voluntary consent, and (6)

By the end of the 1970s, the pendulum had compensation for injury.3

swung again, as accountability had become the In 1982, the earlier APA code was updated, and

watchword of the decade (National Commission on a new version of Ethical Principles in the Conduct

Research, 1980). In 1974, the guidelines provided of Research with Human Participants was published

36 q u a n t i tat i v e m e t h o d s a n d e t h i c s

Table 3.2. The Ethical Principles Adopted in December, 1972, by the Council of Representatives of

the American Psychological Association∗

The decision to undertake research rests upon a considered judgment by the individual psychologist

about how best to contribute to psychological science and to human welfare. The responsible psychologist

weighs alternative directions in which personal energies and resources might be invested. Having made

the decision to conduct research, psychologists must carry out their investigation with respect for the

people who participate and with concern for their dignity and welfare. The Principles that follow make

explicit the investigator’s ethical responsibilities toward participants over the course of research, from the

initial decision to pursue a study to the steps necessary to protect the conﬁdentiality of research data.

These Principles should be interpreted in terms of the context provided in the complete document offered

as a supplement to these Principles.

1. In planning a study the investigator has the personal responsibility to make a careful evaluation of

its ethical acceptability, taking into account these Principles for research with human beings. To

the extent that this appraisal, weighing of scientiﬁc and humane values, suggests a deviation from

any Principle, the investigator incurs an increasingly serious obligation to seek ethical advice and to

observe more stringent safeguards to protect the rights of the human research participants.

2. Responsibility for the establishment and maintenance of acceptable ethical practice in research always

remains with the individual investigator. The investigator is also responsible for the ethical treatment

of research participants by collaborators, assistants, students, and employees, all of whom, however,

incur parallel obligations.

3. Ethical practice requires the investigator to inform the participant of all features of the research that

reasonably might be expected to inﬂuence willingness to participate and to explain all other aspects of

the research about which the participant inquires. Failure to make full disclosure gives added emphasis

to the investigator’s responsibility to protect the welfare and dignity of the research participant.

4. Openness and honesty are essential characteristics of the relationship between investigator and research

participant. When the methodological requirements of a study necessitate concealment or deception,

the investigator is required to ensure the participant’s understanding of the reasons for this action

and to restore the quality of the relationship with the investigator.

5. Ethical research practice requires the investigator to respect the individual’s freedom to decline to

participate in research or to discontinue participation at any time. The obligation to protect this

freedom requires special vigilance when the investigator is in a position of power over the participant.

The decision to limit this freedom increases the investigator’s responsibility to protect the participant’s

dignity and welfare.

6. Ethically acceptable research begins with the establishment of a clear and fair agreement between the

investigator and the research participant that clariﬁes the responsibilities of each. The investigator

has the obligation to honor all promises and commitments included in that agreement.

7. The ethical investigator protects participants from physical and mental discomfort, harm, and danger.

If the risk of such consequences exists, the investigator is required to inform the participant of that

fact, secure consent before proceeding, and take all possible measures to minimize distress. A research

procedure may not be used if it is likely to cause serious and lasting harm to participants.

8. After the data are collected, ethical practice requires the investigator to provide the participant

with a full clariﬁcation of the nature of the study and to remove any misconceptions that may have

arisen. Where scientiﬁc or human values justify delaying or withholding information, the investigator

acquires a special responsibility to assure that there are no damaging consequences for the participant.

rosnow, rosenthal 37

Table 3.2. (Continued)

9. Where research procedures may result in undesirable consequences for the participant, the investigator

has the responsibility to detect and remove or correct these consequences, including, where relevant,

long-term aftereffects.

10. Information obtained about the research participants during the course of an investigation is con-

ﬁdential. When the possibility exists that others may obtain access to such information, ethical

research practice requires that this possibility, together with the plans for protecting conﬁdentiality,

be explained to the participants as a part of the procedure for obtaining informed consent.

∗ Quoted from pp. 1–2 in Ethical Principles in the Conduct of Research with Human Participants. Washington, DC: American

Psychological Association. Copyright © 1973 by the American Psychological Association.

by the APA. In the earlier version and in the 1982 were perhaps likely to take their ethical cues from

version, the principles were based on actual ethi- the legislated morality and its oversight by IRBs as

cal problems that researchers had experienced, and opposed to the aspirational principles embodied in

extensive discussion throughout the profession was professional codes, such as the APA code. Another

incorporated in each edition of Ethical Principles. complication in this case is that there was a frac-

The principles in the 1982 code are reprinted in tious splintering of the APA in the late-1980s, which

Table 3.3. Notice that there were several new terms resulted in many members resigning from the APA

(subject at risk and subject at minimal risk) and also an and the creation of the rival American Psycholog-

addendum sentence to informed consent (referring ical Society, subsequently renamed the Association

to “research with children or with participants who for Psychological Science (APS). For a time in the

have impairments that would limit understanding 1990s, a joint task force of the APA and the APS

and/or communication”). The concept of minimal attempted to draft a revised ethics code, but the

risk (which came out of the Belmont Report) means APS then withdrew its participation following an

that the likelihood and extent of harm to the partic- apparently irresolvable disagreement. In 2002, after

ipants are presumed to be no greater than what may a 5-year revision process, APA adopted a reworked

be typically experienced in everyday life or in routine ethics code that emphasized the ﬁve general princi-

physical or psychological examinations (Scott-Jones ples deﬁned (by APA) in Table 3.4 and also “speciﬁc

& Rosnow, 1998, p. 149). In actuality, the extent of standards” that ﬂeshed out these principles.5 The

harm may not be completely anticipated, and esti- tenor of the ﬁnal document was apparently intended

mating the likelihood of harm is frequently difﬁcult to reﬂect the remaining majority constituency of

or impossible. Regarding the expanded statement the APA (practitioners) but also the residual con-

on deception, the use of deception in research had stituency of psychological scientists who perform

been frowned upon for some years although there either quantitative or qualitative research in fun-

had long been instances in which active and passive damental and applied contexts. Of the speciﬁc

deceptions had been used routinely. An example was principles with some relevance to data analysis or

the withholding of information (passive deception). quantitative methods, there were broadly stated rec-

Randomized clinical trials would be considered of ommendations such as sharing the research data

dubious value in medical research had the experi- for veriﬁcation by others (Section 8.14), not mak-

menters and the participants not been deprived of ing deceptive or false statements (Section 8.10),

information regarding which condition was assigned using valid and reliable instruments (Section 9.02),

to each participant. On the other hand, in some drawing on current knowledge for design, standard-

areas of behavioral experimentation, the use of ization, validation, and the reduction or elimina-

deception has been criticized as having “reached tion of bias when constructing any psychometric

a ‘taken-for-granted’ status” (Smith, Kimmel, & instruments (Section 9.05). We turn next to the

Klein, 2009, p. 486).4 risk–beneﬁt process, but we should also note that

Given the precedence of federal (and state) regu- ethical values with relevance to statistical practices

lations since the guidelines developed by the DHEW are embodied in the codes developed by statistical

were codiﬁed by the National Research Act in 1974 organizations (e.g., American Statistical Association,

(and revised as of November 6, 1975), researchers 1999; see also Panter & Sterba, 2011).

38 q u a n t i tat i v e m e t h o d s a n d e t h i c s

Table 3.3. The Revised Ethical Principles Adopted in August, 1982, by the Council of Representatives of the

American Psychological Association for Research with Human Participants∗

The decision to undertake research rests upon a considered judgment by the individual psychologist about how best to

contribute to psychological science and human welfare. Having made the decision to conduct research, the psychologist

considers alternative directions in which research energies and resources might be invested. On the basis of this consid-

eration, the psychologist carries out the investigation with respect and concern for the dignity and welfare of the people

who participate and with cognizance of federal and state regulations and professional standards governing the conduct

of research with human participants.

A. In planning a study, the investigator has the responsibility to make a careful evaluation of its ethical acceptability. To

the extent that the weighing of scientiﬁc and human values suggests a compromise of any principle, the investigator

incurs a correspondingly serious obligation to seek ethical advice and to observe stringent safeguards to protect the

rights of human participants.

B. Considering whether a participant in a planned study will be a “subject at risk” or a “subject at minimal risk,” according

to recognized standards, is of primary ethical concern to the investigator.

C. The investigator always retains the responsibility for ensuring ethical practice in research. The researcher is also

responsible for the ethical treatment of research participants by collaborators, assistants, students, and employees, all

of whom, however, incur similar obligations.

D. Except in minimal-risk research, the investigator establishes a clear and fair agreement with research participants, prior

to their participation, that clariﬁes the obligations and responsibilities of each. The investigator has the obligation

to honor all promises and commitments included in that agreement. The investigator informs the participants of all

aspects of the research that might reasonably be expected to inﬂuence willingness to participate and explains all other

aspects of the research about which the participants inquire. Failure to make full disclosure prior to obtaining informed

consent requires additional safeguards to protect the welfare and dignity of the research participants. Research with

children or with participants who have impairments that would limit understanding and/or communication requires

special safeguarding procedures.

E. Methodological requirements of a study may make the use of concealment or deception necessary. Before conducting

such a study, the investigator has a special responsibility to (1) determine whether the use of such techniques is justiﬁed

by the study’s prospective scientiﬁc, educational, or applied value; (2) determine whether alternative procedures are

available that do not use concealment or deception; and (3) ensure that the participants are provided with sufﬁcient

explanation as soon as possible.

F. The investigator respects the individual’s freedom to decline to participate in or to withdraw from the research at any

time. The obligation to protect this freedom requires careful thought and consideration when the investigator is in

a position of authority or inﬂuence over the participant. Such positions of authority include, but are not limited to,

situations in which research participation is required as part of employment or in which the participant is a student,

client, or employee of the investigator.

G. The investigator protects the participant from physical and mental discomfort, harm, and danger that may arise

from research procedures. If risks of such consequences exist, the investigator informs the participant of that fact.

Research procedures likely to cause serious or lasting harm to a participant are not used unless the failure to use

these procedures might expose the participant to risk of greater harm or unless the research has great potential beneﬁt

and fully informed and voluntary consent is obtained from each participant. The participant should be informed

of procedures for contacting the investigator within a reasonable time period following participation should stress,

potential harm, or related questions or concerns arise.

H. After the data are collected, the investigator provides the participant with information about the nature of the study

and attempts to remove any misconceptions that may have arisen. Where scientiﬁc or human values justify delaying

or withholding this information, the investigator incurs a special responsibility to monitor the research and to ensure

that there are no damaging consequences for the participant.

rosnow, rosenthal 39

Table 3.3. (Continued)

I. Where research procedures result in undesirable consequences for the individual participant, the investigator has the

responsibility to detect and remove or correct these consequences, including long-term effects.

J. Information obtained about a research participant during the course of an investigation is conﬁdential unless otherwise

agreed upon in advance. When the possibility exists that others may obtain access to such information, this possibility,

together with the plans for protecting conﬁdentiality, is explained to the participant as part of the procedure for

obtaining informed consent.

∗ Quoted from pp. 5–7 in Ethical Principles in the Conduct of Research with Human Participants. Washington, DC: American Psychological Association.

Expanding the Calculation of Risks and beings are not omniscient, there was also the prob-

Beneﬁts lem that “neither the risks nor the beneﬁts . . . can

After the Belmont Report, it seemed that every- be perfectly known in advance” (Mark et al., 1999,

thing changed permanently for scientists engaged p. 49).

in human subject research, and it made little dif- These complications notwithstanding, another

ference whether they were engaged in biomedical, catch-22 of the risk–beneﬁt assessment is that it

behavioral, or social research. As the philosopher focuses only on the doing of research. Some years ago,

John E. Atwell (1981) put it, the moral dilemma we proposed a way of visualizing this predicament—

was to defend the justiﬁcation of using human sub- ﬁrst, in terms of an idealized representation of the

jects as the means to an end that was beneﬁcial risk–beneﬁt assessment and, second, in terms of an

in some profoundly signiﬁcant way (e.g., the pro- alternative model focused on the costs and beneﬁts

gression of science, public health, or public policy) of both the doing and not doing of research (Rosen-

while protecting the moral “ideals of human dignity, thal & Rosnow, 1984). The latter model also has

respect for persons, freedom and self-determination, implications for the risk–beneﬁt (we prefer the term

and a sense of personal worth” (p. 89). Review cost–utility) of using or not using particular quanti-

boards were now delegated the responsibility of mak- tative methods (we return to this idea in a moment).

ing prior assessments of the future consequences of First, however, Figure 3.1 shows an idealized repre-

proposed research on the basis of the probability sentation of the traditional risk–beneﬁt assessment.

that a certain magnitude of psychological, physi- Risk (importance or probability of harm) is plotted

cal, legal, social, or economic harm might result, from low (C) to high (A) on the vertical axis, and

weighed against the likelihood that “something of the beneﬁt is plotted from low (C) to high (D) on

positive value to health or welfare” might result. the horizontal axis. In other words, studies in which

Quoting the Belmont Report, “risk is properly con- the risk–beneﬁt assessment is close to A would pre-

trasted to probability of beneﬁts, and beneﬁts are sumably be less likely to be approved; studies close

properly contrasted with harms rather than risks of to D would be more likely to be approved; and stud-

harms,” where the “risks and beneﬁts of research ies falling along the B–C “diagonal of indecision”

may affect the individual subjects, the families of exist in a limbo of uncertainty until relevant infor-

the individual subjects, and society at large (or mation nudges the assessment to either side of the

special groups of subjects in society).” The moral diagonal. The idea of “zero risk” is a methodological

calculus of beneﬁts to risks was said to be “in a conceit, however, because all human subject research

favorable ratio” when the anticipated risks were out- can be understood as carrying some degree of risk.

weighed by the anticipated beneﬁts to the subjects The potential risk in the most benign behavioral

(assuming this was applicable) and the anticipated and social research, for example, is the “danger of

beneﬁt to society in the form of the advance- violating someone’s basic rights, if only the right of

ment of knowledge. Put into practice, however, privacy” (Atwell, 1981, p. 89). However, the funda-

researchers and members of review boards found it mental problem of the traditional model represented

difﬁcult to “exorcize the devil from the details” when in Figure 3.1 is that it runs the risk of ignoring the

challenged by ethical guidelines that frequently con- “not doing of research.” Put another way, there are

ﬂicted with traditional technical criteria (Mark, also moral costs when potentially useful research is

Eyssell, & Campbell, 1999, p. 48). As human forestalled, or if the design or implementation is

40 q u a n t i tat i v e m e t h o d s a n d e t h i c s

Table 3.4. General Principles Adopted in 2003 by the American Psychological Association∗

General Principles

General Principles, as opposed to Ethical Standards, are aspirational in nature. Their intent is to guide and inspire

psychologists toward the very highest ethical ideals of the profession. General Principles, in contrast to Ethical Standards,

do not represent obligations and should not form the basis for imposing sanctions. Relying upon General Principles for

either of these reasons distorts both their meaning and purpose.

Principle A: Beneﬁcence and Nonmaleﬁcence

Psychologists strive to beneﬁt those with whom they work and take care to do no harm. In their professional actions,

psychologists seek to safeguard the welfare and rights of those with whom they interact professionally and other affected

persons, and the welfare of animal subjects of research. When conﬂicts occur among psychologists’ obligations or

concerns, they attempt to resolve these conﬂicts in a responsible fashion that avoids or minimizes harm. Because

psychologists’ scientiﬁc and professional judgments and actions may affect the lives of others, they are alert to and

guard against personal, ﬁnancial, social, organizational, or political factors that might lead to misuse of their inﬂuence.

Psychologists strive to be aware of the possible effect of their own physical and mental health on their ability to help

those with whom they work.

Principle B: Fidelity and Responsibility

Psychologists establish relationships of trust with those with whom they work. They are aware of their professional

and scientiﬁc responsibilities to society and to the speciﬁc communities in which they work. Psychologists uphold

professional standards of conduct, clarify their professional roles and obligations, accept appropriate responsibility for

their behavior, and seek to manage conﬂicts of interest that could lead to exploitation or harm. Psychologists consult

with, refer to, or cooperate with other professionals and institutions to the extent needed to serve the best interests

of those with whom they work. They are concerned about the ethical compliance of their colleagues’ scientiﬁc and

professional conduct. Psychologists strive to contribute a portion of their professional time for little or no compensation

or personal advantage.

Principle C: Integrity

Psychologists seek to promote accuracy, honesty, and truthfulness in the science, teaching, and practice of psychology.

In these activities psychologists do not steal, cheat, or engage in fraud, subterfuge, or intentional misrepresentation of

fact. Psychologists strive to keep their promises and to avoid unwise or unclear commitments. In situations in which

deception may be ethically justiﬁable to maximize beneﬁts and minimize harm, psychologists have a serious obligation

to consider the need for, the possible consequences of, and their responsibility to correct any resulting mistrust or other

harmful effects that arise from the use of such techniques.

Principle D: Justice

Psychologists recognize that fairness and justice entitle all persons to access to and beneﬁt from the contributions of

psychology and to equal quality in the processes, procedures, and services being conducted by psychologists. Psychol-

ogists exercise reasonable judgment and take precautions to ensure that their potential biases, the boundaries of their

competence, and the limitations of their expertise do not lead to or condone unjust practices.

Principle E: Respect for People’s Rights and Dignity

Psychologists respect the dignity and worth of all people, and the rights of individuals to privacy, conﬁdentiality, and

self-determination. Psychologists are aware that special safeguards may be necessary to protect the rights and welfare

of persons or communities whose vulnerabilities impair autonomous decision making. Psychologists are aware of and

respect cultural, individual, and role differences, including those based on age, gender, gender identity, race, ethnicity,

culture, national origin, religion, sexual orientation, disability, language, and socioeconomic status and consider these

factors when working with members of such groups. Psychologists try to eliminate the effect on their work of biases based

on those factors, and they do not knowingly participate in or condone activities or others based upon such prejudices.

∗ Quoted from the American Psychological Association’s Ethical Principles of Psychologists and Code of Conduct (http://www.apa.org/

ethics/code2002.html). Effective date June 1, 2003, copyrighted in 2002 by the American Psychological Association.

compromised in a way that jeopardizes the integrity appraisal of the “doing of research,” and the model

of the research (cf. Haywood, 1976). on the right corresponds to an appraisal of the “not

Figure 3.2 shows an alternative representing a doing of research.” We use the terms cost and utility

cost–utility assessment of both the doing and not each in a collective sense. That is, the cost of doing

doing of research. In Part A, the decision plane and the cost of not doing a particular research study

model on the left corresponds to a cost–utility include more than only the risk of psychological or

rosnow, rosenthal 41

A B physical harm; they also include the cost to society,

High funding agencies, and to scientiﬁc knowledge when

imagination and new scientiﬁcally based solutions

are stiﬂed. As one scientist observed, “Scientists

know that questions are not settled; rather, they are

on

given provisional answers for which it is contingent

Risk of doing

isi

ec

upon the imagination of followers to ﬁnd more illu-

nd

fi minating solutions” (Baltimore, 1997, p. 8). We

lo

a

on

iag

D

useful, but in a way that may have no immedi-

ate application and instead “speaks to our sense

Low

of wonder and paves the way for future advances”

Low High (Committee on Science, Engineering, and Public

Beneﬁt of doing

C D Policy, 2009, p. 3). These ﬁgurative deﬁnitions of

cost and utility aside, Part B of Figure 3.2 suggests

Figure 3.1 Idealized decision-plane model representing the rel-

ative risks and beneﬁts of research submitted to a review board

a way of transforming the three dimensions of Part

for prior approval (after Rosenthal & Rosnow, 1984; Rosnow & A to a two-dimensional model. Suppose an A–D

Rosenthal, 1997). “decision diagonal” for each of the decision planes

A B A' B'

High DON’T DO High DO

n

n

io

io

cis

cis

de

de

Cost

Cost

in

in

of

of

al

al

on

on

iag

iag

D

Low Low

C DO D C' DON’T DO D'

Low High Low High

Utility Utility

(Decision diagonal A'–D' of right plane in Part A)

D'

DON’T DO

Decision not to do

n

io

ecis

nd

fi

o

al

on

iag

D

Low

DO

A'

A D

Decision to do

(Decision diagonal A–D of left plane in Part A)

Figure 3.2 Decision-planes representing the ethical assessment of the costs and utilities of doing and not doing research (after Rosenthal

& Rosnow, 1984, 2008). (A) Costs and utilities of doing (left plane) and not doing (right plane) research. (B) Composite plane

representing both cases in Part A (above).

42 q u a n t i tat i v e m e t h o d s a n d e t h i c s

in Part A (in contrast to B–C and B’–C’, the diag- Unfortunately, there are also cases involving the

onals of indecision). For any point in the plane of suppression of data where the cost is not only that

doing, there would be a location on the cost axis it impedes progress in the ﬁeld, but it also under-

and on the utility axis, where any point could be mines the authority and trustworthiness of scientiﬁc

translated to an equivalent position on the decision research and, in some instances, can cause harm

diagonal. Thus, if a point were twice as far from to the broader society, such as when public pol-

A as from D, the transformed point would then be icy is based on only partial information or when

located two-thirds of the way on the decision diago- there is selective outcome reporting of the efﬁcacy

nal A–D (closer to D than to A). Similar reasoning is of clinical interventions in published reports of ran-

applicable to not doing, with the exception that close- domized trials (Turner, Matthews, Linardatos, Tell,

ness to A would mean “do” rather than “not do.” & Rosenthal, 2008; Vedula, Bero, Scherer, & Dick-

Points near D tell us the research should be done, ersin, 2009). In an editorial in Science, Cicerone

and points near D’ tell us the research should not be (2010), then president of the NAS, stated that his

done.6 impression—based on information from scattered

Figure 3.2 can also be a way of thinking about public opinion polls and various assessments of lead-

cost–utility dilemmas regarding quantitative meth- ers in science, business, and government—was that

ods and statistical reporting practices. In the 2009 “public opinion has moved toward the view that

edition of the U.S. National Academy of Sciences scientists often try to suppress alternative hypotheses

(NAS) guide to responsible conduct in scientiﬁc and ideas and that scientists will withhold data and

research, there are several hypothetical scenarios, try to manipulate some aspects of peer review to pre-

including one in which a pair of researchers (a post- vent dissent” (p. 624). Spielmans and Parry (2010)

doctoral and a graduate student) discuss how they described a number of instances of “marketing-based

should deal with two anomalous data points in a medicine” by pharmaceutical ﬁrms. Cases included

graph they are preparing to present in a talk (Com- the “cherry-picking” of data for publication, the sup-

mittee on Science, Engineering, and Public Policy, pression or understatement of negative results, and

2009). They want to put the best face on their the publication (and distribution to doctors) of jour-

research, but they fear that discussing the two out- nal articles that were not written by the academic

liers will draw people’s attention away from the bulk authors who lent their names, titles, and purported

of the data. One option would be to drop the out- independence to the papers but instead had been

liers, but, as one researcher cautions, this could be written by ghost writers hired by pharmaceutical and

viewed as “manipulating” the data, which is unethi- medical-device ﬁrms to promote company products.

cal. The other person comments that if they include Spielmans and Parry displayed a number of screen

the anomalous points, and if a senior person then shots of company e-mails, which we do not usually

advises them to include the anomalous data in a get to see because they go on behind the curtain. In

paper they are drafting for publication, this could an editorial in PLoS Medicine (2009) lamenting the

make it harder to have the paper accepted by a problem of ghost writers and morally dubious prac-

top journal. That is, the reported results will not tices in the medical marketing of pharmaceutics, the

be unequivocal (a potential reason for rejection), editors wrote:

and the paper will also then be too wordy (another

reason to reject it?). In terms of Figure 3.2, not How did we get to the point that falsifying the

including the two anomalous data points is anal- medical literature is acceptable? How did an industry

ogous to the “not doing of research.” There are, of whose products have contributed to astounding

course, additional statistical options, which can also advances in global health over the past several

be framed in cost–utility terms, such as using a suit- decades come to accept such practices as the norm?

able transformation to pull in the outlying stragglers Whatever the reasons, as the pipeline for new drugs

and make them part of the group (cf. Rosenthal dries up and companies increasingly scramble for an

& Rosnow, 2008, pp. 310–311). On the other ever-diminishing proportion of the market in

hand, outliers that are not merely recording errors “me-too” drugs, the medical publishing and

or instrument errors can sometimes provide a clue pharmaceutical industries and the medical

as to a plausible moderator variable. Suppressing community have become locked into a cycle of

this information could potentially impede scientiﬁc mutual dependency, in which truth and a lack of bias

progress (cf. Committee on Science, Engineering, have come to be seen as optional extras. Medical

and Public Policy, 2009, p. 8). journal editors need to decide whether they want to

rosnow, rosenthal 43

roll over and just join the marketing departments of research the focus is usually on the magnitude of

pharmaceutical companies. Authors who put their the difference between the experimental and con-

names to such papers need to consider whether doing trol group means. Such differences, we and other

so is more important than having a medical literature investigators observed, were sometimes affected by

that can be believed in. Politicians need to consider the use of volunteers (Rosenthal & Rosnow, 1975,

the harm done by an environment that incites 2009).

companies into insane races for proﬁt rather than for With problems such as these serving as beginning

medical need. And companies need to consider points for empirical and meta-analytic investiga-

whether the arms race they have started will in the tions, we explored the characteristics that differen-

end beneﬁt anyone. After all, even drug company tiated volunteers and nonvolunteers, the situational

employees get sick; do they trust ghost authors? determinants of volunteering, some possible inter-

actions of volunteer status with particular treatment

effects, the implications for predicting the direction

Ethical Standards and Quantitative and, sometimes, the magnitude of the biasing effects

Methodological Standards in research situations, and we also thought about

We turn now to Table 3.5, which shows a matrix the broader ethical implications of these ﬁndings

of general ethical standards crossed with quantitative (Rosenthal & Rosnow, 1975; Rosnow & Rosenthal,

methodological standards (after Rosnow & Rosen- 1997). For example, in one aspect of our meta-

thal, 2011). We do not claim that the row and analytic inquiry, we put the following question to

column standards are either exhaustive or mutually the research literature: What are the variables that

exclusive but only that they are broadly represen- tend to increase or decrease the rates of volunteering

tative of (1) aspirational ideals in the society as a obtained? Our preliminary answers to this question

whole and (2) methodological, data analytic, and may have implications for both the theory and prac-

reporting standards in science and technology. The tice of behavioral science. That is, if we continue

matrix is a convenient way of reminding ourselves to learn more about the situational determinants of

of the distinction between (1) and (2), and it is also volunteering, we can learn more about the social

a way of visualizing a potential clash between (1) psychology of social inﬂuence processes. Method-

and (2) and, frequently, the opportunity to exploit ologically, once we learn more about the situational

this situation in a way that could have rewarding determinants of volunteering, we should be in a bet-

ethical and scientiﬁc implications. Before we turn ter position to reduce the bias in our samples that

speciﬁcally to the deﬁnitions of the row and col- derives from the volunteer subjects being systemat-

umn headings in Table 3.5, we will give a quick ically different from nonvolunteers in a variety of

example of what we mean by “rewarding ethical characteristics. For example, one situational corre-

and scientiﬁc implications” in the context of the late was that the more important the research was

recruitment of volunteers. For this example, we perceived, the more likely people were to volun-

draw on some of our earlier work on speciﬁc threats teer for it. Thus, mentioning the importance of the

to validity (collectively described as artifacts) deriv- research during the recruitment phase might coax

ing from the volunteer status of the participants for more of the “nonvolunteers” into the sampling pool.

research participation. Among our concerns when It would be unethical to exaggerate or misrepresent

we began to study the volunteer was that ethical the importance of the research. By being honest,

sensitivities seemed to be propelling psychological transparent, and informative, we are treating people

science into a science of informed volunteers (e.g., with respect and also giving them a well-founded

Rosenthal & Rosnow, 1969; Rosnow & Rosenthal, justiﬁcation for asking them to volunteer their valu-

1970). It was long suspected that people who vol- able time, attention, and cooperation. In sum, the

unteered for behavioral and social research might ﬁve column headings of Table 3.5 frequently come

not be fully adequate models for the study of behav- precorrelated in the real world of research, often with

ior in general. To the extent that volunteers differ implications for the principles in the row headings

from nonvolunteers on dimensions of importance, of the table.

the use of volunteers could have serious effects on Turning more speciﬁcally to the row headings in

such estimated parameters as means, medians, pro- Table 3.5, rows A, B, C, and E reiterate the three

portions, variances, skewness, and kurtosis. The “basic ethical principles” in the Belmont Report,

estimation of parameters such as these is the princi- which were described there as respect for persons,

pal goal in survey research, whereas in experimental beneﬁcence, and justice. Beneﬁcence (the ethical

44 q u a n t i tat i v e m e t h o d s a n d e t h i c s

Table 3.5. General Ethical Standards Crossed with Quantitative Methodology Standards (after Rosnow &

Rosenthal, 2011)

Quantitative methodological standards

A. Beneﬁcence

B. Nonmaleﬁcence

C. Justice

D. Integrity

E. Respect

ideal of “doing good”) was conﬂated with the princi- results are presented in an open, frank, and can-

ple (b) of nonmaleﬁcence (“not doing harm”), and the did way, that any technical language used is clear

two were also portrayed as obligations assimilating and appropriate, and that visual displays do not

two complementary responsibilities: (1) do not obfuscate the data but instead are as crystal clear as

harm and (2) maximize possible beneﬁts and mini- possible. Elements of graphic design are explained

mize possible harms. Next in Table 3.5 is justice, by and illustrated in a number of very useful books

which we mean a sense of “fairness in distribution” and articles, particularly the work of Tufte (1983,

or “what is observed” (quoting from the Belmont 1990, 2006) and Wainer (1984, 1996, 2000, 2009;

Report). As the Belmont Report went on to explain: Wainer & Thissen, 1981), and there is a burgeon-

“Injustice occurs when some beneﬁt to which a per- ing literature in every area of science on the visual

son is entitled is denied without good reason or display of quantitative data. Second, by informa-

when some burden is imposed unduly.” Conced- tiveness, we mean that there is enough information

ing that “what is equal?” and “what is unequal?” reported to enable readers to make up their own

are often complex, highly nuanced questions in a minds on the basis of the primary results and enough

speciﬁc research situation (just as they are when to enable others to re-analyze the summary results

questions of justice are associated with social prac- for themselves. The development of meta-analysis,

tices, such as punishment, taxation, and political with emphasis on effect sizes and moderator vari-

representation), justice was nonetheless considered a ables, has stimulated ways of recreating summary

basic moral precept relevant to the ethics of research data sets and vital effect size information, often from

involving human subjects. Next in Table 3.5 is minimal raw ingredients. Third, the term precision

integrity, an ethical standard that was not distinctly is used not in a statistical sense (the likely spread of

differentiated in the Belmont Report but that was estimates of a parameter) but rather in a more gen-

discussed in detail in the NAS guide (Committee eral sense to mean that quantitative results should

on Science, Engineering, and Public Policy, 2009). be reported to the degree of exactitude required by

Integrity implies honesty and truthfulness; it also the given situation. For example, reporting the aver-

implies a prudent use of research funding and other age scores on an attitude questionnaire to a high

resources and, of course, the disclosure of any con- degree of decimal places is psychologically mean-

ﬂicts of interest, ﬁnancial or otherwise, so as not ingless (false precision), and reporting the weight

to betray public trust. Finally, respect was described of mouse subjects to six decimal places is pointless

in the Belmont Report as assimilating two obliga- (needless precision). Fourth, accuracy means that a

tions: “ﬁrst, that individuals should be treated as conscientious effort is made to identify and cor-

autonomous agents, and second, that persons with rect mistakes in measurements, calculations, and

diminished autonomy are entitled to protection.” the reporting of numbers. Accuracy also means not

In the current APA code, respect is equated with exaggerating results by, for example, making false

civil liberties—that is, privacy, conﬁdentiality, and claims that applications of the results are unlikely to

self-determination. achieve. Fifth, groundedness implies that the method

Inspecting the column headings in Table 3.5, ﬁrst of choice is appropriate to the question of interest, as

by transparency, we mean here that the quantitative opposed to using whatever is fashionable or having

rosnow, rosenthal 45

a computer program repackage the data in a one- truly astonishing amount of confusion. For example,

size-ﬁts-all conceptual framework. The methods we writing in the Journal of the National Cancer Insti-

choose must be justiﬁable on more than just the tute, Miller (2007) mentioned that many doctors

grounds that they are what we were taught in grad- equate the level of statistical signiﬁcance of cancer

uate school, or that “this is what everyone else does” data with the “degree of improvement a new treat-

(cf. Cohen, 1990, 1994; Rosnow & Rosenthal, ment must make for it to be clinically meaningful”

1995, 1996; Zuckerman, Hodgins, Zuckerman, & (p. 1832).7

Rosenthal, 1993). In the space remaining, we concentrate on mis-

conceptions and illusions regarding the concepts of

RR and statistical signiﬁcance when the clinical sig-

Clinical Signiﬁcance and the Consequences niﬁcance of interventions is appraised through the

of Statistical Illiteracy lens of these concepts in randomized clinical trials

To bring this discussion of quantitative meth- (RCTs). As a case in point, a highly cited report

ods and ethics full circle, we turn ﬁnally to a on the management of depression, a report that was

problem that has been variously described as innu- issued by the National Institute for Health and Clin-

meracy (Paulos, 1990) and statistical illiteracy. The ical Excellence (NICE), used RR of 0.80 or less as a

terms are used to connote a lack of knowledge or threshold indicator of clinical signiﬁcance in RCTs

understanding of the meaning of numbers, statisti- with dichotomous outcomes and statistically signif-

cal concepts, or the numeric expression of summary icant results.8 We use the term clinical signiﬁcance

statistics. As the authors of a popular book, The here in the way that it was deﬁned in an authorita-

Numbers Game, put it: “Numbers now saturate the tive medical glossary, although we recognize that it is

news, politics, life. . . . For good or for evil, they a hypothetical construct laden with surplus meaning

are today’s preeminent public language—and those as well (cf. Jacobson & Truax, 1991). In the glos-

who speak it rule” (Blastland & Dilnot, 2009, p. x). sary, clinical signiﬁcance was taken to mean that “an

To be sure, even people who are most literate in the intervention has an effect that is of practical mean-

language of numbers are prone to wishful thinking ing to patients and health care providers” (NICHSR,

and fearful thinking and, therefore, sometimes sus- 2010; cf. Jeans, 1992; Kazdin, 1977, 2008). By

ceptible to those who use numbers and gimmicks intervention, we mean a treatment or involvement

to sway, inﬂuence, or even trick people. The math- such as a vaccine used in a public health immu-

ematician who coined the term innumeracy told of nization program to try to eradicate a preventable

how his vulnerability to whim “entrained a series of disease (e.g., the Salk poliomyelitis vaccine), or a

ill-fated investment decisions,” which he still found drug that can be prescribed for a patient in the doc-

“excruciating to recall” (Paulos, 2003, p. 1). The tor’s ofﬁce, or an over-the-counter medicine (e.g.,

launching point for the remainder of our discussion aspirin) used to reduce pain or lessen the risk of an

was an editorial in a medical journal several years adverse event (e.g., heart attack), or a medication

ago, in which the writers of the editorial lamented and/or psychotherapy to treat depression. By tradi-

“the premature dissemination of research and the tion, RCTs are the gold standard in evidence-based

exaggeration of medical research ﬁndings” (Schwartz medicine when the goal is to appraise the clinical

& Woloshin, 2003, p. 153). A large part of the prob- signiﬁcance of interventions in a carefully controlled

lem is an emphasis on RR statistics that hook general scientiﬁc manner. Claims contradicted by RCTs are

readers into making unwarranted assumptions, a not always immediately rejected in evidence-based

problem that may often begin with researchers, medicine, as it has been noted that some “claims

funders, and journals that “court media atten- from highly cited observational studies persist and

tion through press releases” (Woloshin, Schwartz, continue to be supported in the medical literature

Casella, Kennedy, & Larson, 2009, p. 613). Con- despite strong contradictory evidence from random-

fusion about risk and risk statistics is not limited to ized trials” (Tatsioni, Bonitsis, & Ioannidis, 2007).

the general public (cf. Prasad, Jaeschke, Wyer, Keitz, Of course, just as gold can ﬂuctuate in value, so

& Guyatt, 2008), but it is the susceptible public can conclusions based on the belief that statistical

(Carling, Kristoffersen, Herrin, Treweek, Oxman, signiﬁcance is a proxy for clinical signiﬁcance, or

Schünemann, Akl, & Montori, 2008) that must when it is believed that given statistical signiﬁcance,

ultimately pay the price of the accelerating costs clinical signiﬁcance is achieved only if the reduc-

of that confusion. Stirring the concept of statistical tion in RR reaches some arbitrary ﬁxed magnitude

signiﬁcance into this mix can frequently produce a (recall, for example, NICE, 2004). The challenge

46 q u a n t i tat i v e m e t h o d s a n d e t h i c s

Table 3.6. Final results for myocardial infarction (MI) and hemorrhagic stroke (HS) for the

aspirin (325 mg every other day) component of the Physicians’ Health Study (Steering

Committee of the Physicians’ Health Study Research Group, 1989). The increase in relative

risk (RRI) for HS was more than twice the reduction in relative risk (RRR) for MI. Having one

more case of HS in the aspirin group would have yielded a chi-square signiﬁcant at p < 0.05,

RR = 2.0, and RRI = 100%. In the combined samples, the event rate of MI (378/22,071 =

0.0171, or 1.71% ) exceeded the event rate of HS (35/22,071 = 0.0016, or 0.16% ) by a ratio

of about 10:1, and a difference of 1.71% – 0.16% = 1.55%. In the subtable on the right, RRI is

the relative risk increase, computed as RRR (see Table 3.7), but indicated as RRI when the

treatment increases the risk of the adverse outcome.

Myocardial infarction (Heart attack) Hemorrhagic stroke

MI No MI Total HS No HS Total

Placebo 239 10,795 11,034 Placebo 12 11,022 11,034

RR 0.58 RR 1.92

is to reverse the accelerating cost curve of statistical (thereby implying RRR), and an RR greater than 1.0

illiteracy in an area that affects us all (see, for example, indicates the treated sample did more poorly than

Gigerenzer, Gaissmaier, Kurz-Milcke, Schwartz, & the control (thereby implying RRI). Observe that

Woloshin, 2008). the “slightly increased risk of stroke” (RRI = 92% )

Table 3.6 helps us illustrate the folly of a delicate was actually more than twice the reduction in risk of

balancing act that is sometimes required between heart attack (RRR = 42% )! Suppose the study had

statistical signiﬁcance and RR. The table shows a continued, and one more case of stroke had turned

portion of the results from the aspirin component up in the aspirin group. The p-value would have

of a highly cited double-blind, placebo-controlled, reached the 0.05 level, and the researchers might

randomized trial to test whether 325 milligrams of have arrived at a different conclusion, possibly that

aspirin every other day reduces the mortality from the beneﬁt with respect to heart attack was more

cardiovascular disease and whether beta-carotene than offset by the increased risk in stroke. Appar-

decreases the incidence of cancer (Steering Com- ently, a p-value only a hair’s-breadth greater than

mittee of the Physicians’ Health Study Research 0.05 can trump a RR increase of 92% . On the

Group, 1989). The aspirin component of the study other hand, the event rate of stroke in the study as

was terminated earlier than planned on ﬁnding “a a whole was only 0.16% , less than one-tenth the

statistically signiﬁcant, 44 [sic] percent reduction in magnitude of the event rate of 1.7% of heart attack

the risk of myocardial infarction for both fatal and in the study as a whole.9 However, we would never

nonfatal events . . . [although] there continued to know this from the RR alone.

be an apparent but not signiﬁcantly increased risk The fact is that RR statements are oblivious to

of stroke” (p. 132). RR (for relative risk) refers to event rates in the total N . To give a quick exam-

the ratio of the incidence rate of the adverse event ple, suppose in a study with 100 people each in the

(the illness) in the treated sample to the control sam- treated and control samples that 1 treated person

ple; RRR is the relative risk reduction; and RRI, is and 5 untreated people (controls) became ill. RR

the relative risk increase (the computation of these and RRR would be 0.20 and 80% , respectively.

indices is described in Table 3.7). When tables of Stating there was an 80% reduction in risk of the

independent counts are set up as shown in Tables adverse event conveys hope. However, suppose we

3.6 and 3.7, an RR less than 1.0 indicates that the increase each sample size to 1,000 but still assume

treated sample fared better than the control sample 1 case of illness in the treated sample and 5 cases

rosnow, rosenthal 47

Table 3.7. Six studies each with total sample size (N ) of 2,000 and 1% event rates in Studies 1 and 4 (20

cases out of 2,000), 25% event rates in Studies 2 and 5 (500 cases out of 2,000), and 50% event rates in

Studies 3 and 6 (1,000 cases out of 2,000). RR, the relative risk or risk ratio, indicates the ratio of the

incidence rate of level of risk in the treated group to the level of risk in the control group. With cells labeled

A, B, C, D from upper left (A) to upper right (B), to lower left (C), to lower right (D), RR =

[A/(A+B)]/[C/(C+D)], where RR < 1.0 favors the treatment effect (risk reduction) and RR > 1.0 favors the

control effect (risk increase). OR, the odds ratio, also called relative odds or the cross-product ratio, is the

ratio of A/B to C/D, or the cross-product AD/BC. RRR, the relative risk reduction, is the reduction in risk

of the adverse outcome (e.g., illness) in the treated sample relative to the control, which is indicated as a

percentage by dividing RD (deﬁned next) by [C/(C+D)] and then multiplying by 100. RD, the risk

difference, also called the absolute risk reduction (ARR), is the reduction in risk of the particular adverse

outcome (e.g., cancer, heart attack, stroke) in the treated group compared with the level of baseline risk in

the control—that is, [A/(A+B)]–[C/(C+D)]. Multiplying RD (or ARR) times 10,000 estimates the number

of people in a group of 10,000 that are predicted to beneﬁt from the treatment. NNT = 1/RD = 1/ARR, is

the number needed to treat to prevent a single case of the particular adverse outcome.

Study 1 (N = 2,000) Study 2 (N = 2,000) Study 3 (N = 2,000)

Adverse outcome (1% ) Adverse outcome (25% ) Adverse outcome (50% )

Control 19 981 475 525 950 50

Chi-square 16.4 (p < 0.0001) 540.0 (p < 0.0001) 1,620.0 (p < 0.0001)

Adverse outcome (1% ) Adverse outcome (25% ) Adverse outcome (50% )

Control 11 989 275 725 550 450

NNT=1/ARR=1/RD 500 20 10

48 q u a n t i tat i v e m e t h o d s a n d e t h i c s

Study 1 Study 2 Study 3 RR RRR

1000

800

Number of 600

Patients

400

0.05 95%

200

0

Study 4 Study 5 Study 6

1000

800

Number of 600

Patients

400

0.82 18%

200

0

Treatment

Treatment

Treatment

Control

Control

Control

Group

Assignment

Event

1% 25% 50%

Rate

Figure 3.3 Histograms based on the six studies in Table 3.7, in which the total sample size (N ) was 2,000 in each study. Darkened

areas of the bars indicate the number of adverse outcomes (event rates), which increased from 1% (20 cases out of 2,000) in Studies 1

and 4, to 25% (500 cases out of 2,000) in Studies 2 and 5, to 50% (1,000 cases out of 2,000) in Studies 3 and 6. However, the relative

risk (RR) and relative risk reduction (RRR) were insensitive to these vastly different event rates. In Studies 1, 2, and 3, the RR and

RRR remained constant at 0.05 and 94.7%, respectively, whereas in Studies 4, 5, and 6, the RR and RRR remained constant at 0.82

and 18.2%, respectively.

of illness in the control sample. We would still ﬁnd (Morris & Gardner, 2000). As Table 3.7 shows, OR

RR = 0.20 and RRR = 80% . It makes no differ- and RR are usually highly correlated. The absolute

ence how large we make the sample sizes, as RR and risk reduction (ARR), also called the risk differ-

RRR will not budge from 0.20 and 80% so long as ence (RD), refers to the absolute reduction in risk

we assume 1 case of illness in the treated sample and of the adverse event (illness) in the treated patients

5 cases of illness in the control sample. Suppose we compared with the level of baseline risk in the con-

now hold the N constant and see what happens to trol group. Gigerenzer et al. (2008) recommended

the RR and RRR when the event rate in the over- using the absolute risk reduction (RD) rather than

all N changes from one study to another. In Figure the RR. As Table 3.7 shows, RD (or ARR) is sen-

3.3, we see the results of six hypothetical studies in sitive to the differences in the event rates. There

which the event rates increased from 1% in Stud- are other advantages as well to RD, which are dis-

ies 1 and 4, to 25% in Studies 2 and 5, to 50% in cussed elsewhere (Rosenthal & Rosnow, 2008, pp.

Studies 3 and 6. Nonetheless, in Studies 1, 2, and 3, 631–632). Phi is the product-moment correlation

RR remained constant at 0.05 and RRR remained (r) when the two correlated variables are dichoto-

constant at an attention-getting 95% . In Studies 4, mous, and Table 3.7 shows it is sensitive to the

5, and 6, RR and RRR stayed constant at 0.82 and event rates and natural frequencies. Another useful

18% , respectively. index is NNT, for the number of patients that

Further details of the studies in Figure 3.3 are need to be treated to prevent a single case of the

given in Table 3.7. The odds ratio (OR), for the adverse event. Relative risk may be an easy-to-handle

ratio of two odds, was for a time widely promoted description, but it is only an alerting indicator that

as a measure of association in 2 × 2 tables of tells us that something happened and we need to

counts (Edwards, 1963; Mosteller, 1968) and is explore the data further. As Tukey (1977), the con-

still frequently reported in epidemiological studies summate exploratory data analyst, stated: “Anything

rosnow, rosenthal 49

that makes a simpler description possible makes the degree to which each procedure in a particular study

description more easily handleable; anything that meets the methodological standards of transparency,

looks below the previously described surface makes informativeness, precision, accuracy, and ground-

the description more effective” (p. v). And, we can edness? The research program called for to address

add, that any index of the magnitude of effect that these psychometric questions of reliability will surely

is clear enough, transparent enough, and accurate ﬁnd that speciﬁc research contexts, speciﬁc disci-

enough to inform the nonspecialist of exactly what plinary afﬁliations, and other speciﬁc individual dif-

we have learned from the quantitative data increases ferences (e.g., years of experience) will be moderators

the ethical value of those data (Rosnow & Rosenthal, of the magnitudes of agreement (i.e., reliabilities)

2011). achieved. We believe that the results of such research

will demonstrate that there will be some disagree-

ment (that is, some unreliability) in quantitative

Conclusion methodologists’ evaluations of various standards of

In a cultural sphere in which so many things com-

practice. And, as we noted above, that is likely to

pete for our attention, it is not surprising that people

be associated with some disagreement (that is, some

seem to gravitate to quick, parsimonious forms of

unreliability) in their evaluations of the ethical value

communication and, in the case of health statis-

of various quantitative procedures.

tics, to numbers that appear to speak directly to

Another important question would be addressed

us. For doctors with little spare time to do more

by research asking the degree to which the spe-

than browse abstracts of clinical trials or the sum-

ciﬁc goals and speciﬁc sponsors of the research

maries of summaries, the emphasis on parsimonious

may serve as causal factors in researchers’ choices of

summary statistics such as RR communications in

quantitative procedures. Teams of researchers (e.g.,

large sample RCTs may seem heavily freighted with

graduate students in academic departments rou-

clinical meaning. For the general public, reading

tinely employing quantitative procedures in their

about a 94.7% reduction in the risk of some ill-

research) could be assigned at random to analyze

ness, either in a pharmaceutical advertisement or in a

the data of different types of sponsors with differ-

news story about a “miracle drug that does wonders,”

ent types of goals. It would be instructive to learn

is attention-riveting. It is the kind of information

that choice of quantitative procedure was predictable

that is especially likely to arouse an inner urgency

from knowing who was paying for the research and

in patients but also in anyone who is anxious and

what results the sponsors were hoping for. Recogni-

uncertain about their health. Insofar as such infor-

tion of the possibility that the choice of quantitative

mation exaggerates the absolute effects, it is not only

procedures used might be affected by the ﬁnan-

the patient or the public that will suffer the conse-

cial interests of the investigator is reﬂected in the

quences; the practice of medicine and the progress

increased frequency with which scientiﬁc journals

of science will as well. As Gigerenzer et al. (2008)

(e.g., medical journals) require a statement from

wrote, “Statistical literacy is a necessary precondi-

all co-authors of their ﬁnancial interest in the com-

tion for an educated citizenship in a technological

pany sponsoring the research (e.g., pharmaceutical

democracy” (p. 53). There are promising opportuni-

companies).

ties for moral (and societal) rewards for quantitative

Finally, it would be valuable to quantify the costs

methodologists who can help us educate our way

and utilities of doing and not doing a wide variety of

out of statistical illiteracy. And that education will be

speciﬁc studies, including classic and not-so-classic

beneﬁcial, not only to the public but to many behav-

studies already conducted, and a variety of studies

ioral, social, and medical researchers as well. As that

not yet conducted. Over time, there may develop

education takes place, there will be increased clar-

a disciplinary consensus over the costs and util-

ity, transparency, and accuracy of the quantitative

ities of a wide array of experimental procedures.

methods employed, thereby increasing their ethical

And, although such a consensus is building over

value.

time, it will be of considerable interest to psychol-

ogists and sociologists of science to study disci-

Future Directions plinary differences in such consensus-building. Part

An important theoretical and practical question of such a program of self-study of disciplines doing

remains to be addressed: To what extent is there quantitative research would focus on the quantita-

agreement among quantitative methodologists in tive procedures used, but the primary goal would

their evaluation of quantitative procedures as to the be to apply survey research methods to establish

50 q u a n t i tat i v e m e t h o d s a n d e t h i c s

the degree of consensus on research ethics of the and much of the original research [about organic foods] report

behavioral, social, educational, and biomedical sci- only the statistical signiﬁcance of the differences in nutrient

ences. The ﬁnal product of such a program of levels—not whether they are nutritionally important” (Clancy,

Hamm, Levine, & Wilkins, 2009, p. 676).

research would include the costs and utilities of

8. NICE (2004) also recommended that researchers use a

doing, and of not doing, a wide variety of research standardized mean difference (SMD) of half a standard devia-

studies. tion or more (i.e., d or g ≥ 0.5) with continuous outcomes as

the threshold of clinical signiﬁcance for initial assessments of sta-

tistically signiﬁcant summary statistics (NICE, 2004). However,

Notes effects far below the 0.5 threshold for SMDs have been associated

1. Where we quote from a document but do not give the page with important interventions. For example, in the classic Salk

numbers of the quoted material, it is because either there was no vaccine trial (Brownlee, 1955; Francis, Korns, Voight, Boisen,

pagination or there was no consistent pagination in the online and Hemphill, Napier, & Tolchinsky, 1955), phi = 0.011, which

hard copy versions that we consulted. Tables 3.1–3.4 reprint only has a d -equivalent of 0.022 (Rosnow & Rosenthal, 2008). It is

the original material, as there were slight discrepancies between probably the case across the many domains in which clinical sig-

original material and online versions. niﬁcance is studied that larger values of d or g are in fact generally

2. Pattullo (1982) described the logical basis on which “rule- associated with greater intervention beneﬁt, efﬁcacy, or clinical

makers” (like DHEW) had proceeded in terms of a syllogism importance. But it is also possible for large SMDs to have little

emphasizing not the potential beneﬁts of research but only the or no clinical signiﬁcance. Suppose a medication was tested on

avoidance of risks of harm: “(a) Research can harm subjects; (2) 100 pairs of identical twins with fever, and in each and every

Only impartial outsiders can judge the risk of harm; (3) There- pair, the treated twin loses exactly one-tenth of 1 degree more

fore, all research must be approved by an impartial outside group” than the control twin. The SMD will be inﬁnite, inasmuch as

(p. 376). the variability (the denominator of d or g ) will be 0, but few

3. Hearings on the recommendations in the Belmont Report doctors would consider this ES clinically signiﬁcant. As Cohen

were conducted by the President’s Commission for the Study (1988) wisely cautioned, “the meaning of any given ES is, in the

of Ethical Problems in Medicine, Biomedical, and Behavioral ﬁnal analysis, a function of the context in which it is embedded”

Research. Proceeding on the basis of the information pro- (p. 535).

vided at these hearings and on other sources of advice, the 9. The high RR of HS in this study, in which participants

Department of Health and Human Services (DHHS) then (male physicians) took 325 milligrams every other day, might

issued a set of regulations in the January 26, 1981, issue of explain in part why the current dose for MI prophylaxis is

the Federal Register. A compendium of regulations and guide- tempered at only 81 milligrams per day.

lines that now govern the implementation of the National

Research Act and subsequent amendments can be found in

the DHHS manual known as the “Gray Booklet,” speciﬁcally Author Note

titled Guidelines for the Conduct of Research Involving Human Ralph L. Rosnow is Thaddeus Bolton Professor

Subjects at the National Institutes of Health (available online at Emeritus at Temple University, Philadelphia, PA

http://ohsr.od.nih.gov/guidelines/index.html).

(rosnow@temple.edu). Robert Rosenthal is a Dis-

4. Smith, Kimmel, and Klein (2009) reported that 43.4%

of the articles on consumer research in leading journals in the tinguished Professor at the University of California

ﬁeld in 1975 through 1976 described some form of deception in at Riverside (robert.rosenthal@ucr.edu) and Edgar

the research. By 1989 through 1990, the number of such articles Pierce Professor of Psychology, Emeritus, Harvard

increased to 57.7% , where it remained steady at 56% in 1996 University.

through 1997, increased to 65.7% in 2001 through 2002, and

jumped to 80.4% in 2006 through 2007. The issue of deception

is further complicated by the fact that active and passive decep-

References

tions are far from rare in our society. Trial lawyers manipulate Allison, D. B. (2009). The antidote to bias in research. Science,

the truth in court on behalf of their clients; prosecutors surrep- 326, 522.

titiously record private conversations; journalists get away with American Psychological Association. (1973). Ethical principles in

using hidden cameras and undercover practices to obtain stories; the conduct of research with human participants. Washington,

and the police use sting operations and entrapment procedures to DC: Author.

gather incriminating evidence (cf. Bok, 1978, 1984; Saxe, 1991; American Psychological Association. (1982). Ethical principles in

Starobin, 1997). the conduct of research with human participants. Washington,

5. The document, titled “Ethical Principles of Psycholo- DC: Author.

gists and Code of Conduct,” is available online at http://www. American Statistical Association (1999). Ethical guidelines for

apa.org/ETHICS/code2002.html. statistical practice. http://www.amstat.org/profession/index.

6. Adaptations of the models in Figures 3.1 and 3.2 have been cfm?fusaction=ethical

used to cue students about possible ethical dilemmas in research Atwell, J. E. (1981). Human rights in human subjects research.

and data analysis (cf. Bragger & Freeman, 1999; Rosnow, 1990; In A. J. Kimmel (Ed.), New directions for methodology of social

Strohmetz & Skleder, 1992). and behavioral science: Ethics of human subject research (No. 10,

7. The confusion of statistical signiﬁcance with practical pp. 81–90). San Francisco, CA: Jossey-Bass.

importance may be a more far-reaching problem in science. In Baltimore, D. (January 27, 1997). Philosophical differences. The

a letter in Science, the writers noted that “almost all reviews New Yorker, p. 8.

rosnow, rosenthal 51

Beecher, H. K. (July 2, 1966). Documenting the abuses. Saturday Huff, D. (1982). How to lie with statistics. New York: Norton.

Review, 45–46. Jacobson, N. S., & Truax, P. (1991). Clinical signiﬁcance:

Beecher, H. K. (1970). Research and the individual. Boston: Little A statistical approach to deﬁning meaningful change in

Brown. psychotherapy research. Journal of Consulting and Clinical

Blanck, P. D., Bellack, A. S., Rosnow, R. L., Rotheram-Borus, Psychology, 59, 12–19.

M. J., & Schooler, N. R. (1992). Scientiﬁc rewards and con- Jeans, M. E. (1992). Clinical signiﬁcance of research: A growing

ﬂicts of ethical choices in human subjects research. American concern. Canadian Journal of Nursing, 24, 1–2.

Psychologist, 47, 959–965. Jones, H. H. (1993). Bad blood: The Tuskegee syphilis experiment

Blass, T. (2004). The man who shocked the world: The life and (Revised edition). New York: Free Press.

legacy of Stanley Milgram. New York: Basic Books. Kaiser, J. (September 18, 2009). Researcher, two universities sued

Blastland, M., & Dilnot, A. (2009). The numbers game. New over validity of prostate cancer test. Science, 235, 1484.

York: Gotham Books. Kazdin, A. E. (1977). Assessing the clinical or applied impor-

Bok, S. (1978). Lying: Moral choice in public and private life. New tance of behavior change through social validation. Behavior

York: Pantheon. Modiﬁcation, 1, 427–451.

Bok, S. (1984). Secrets: On the ethics of concealment and revelation. Kazdin, A. E. (2008). Evidence-based treatment and practice:

New York: Vintage Books. New opportunities to bridge clinical research and practice,

Bordia, P., & Rosnow, R. L. (1995). Rumor rest stops on enhance the knowledge base, and improve patient care.

the information highway: A naturalistic study of transmis- American Psychologist, 63, 146–159.

sion patterns in a computer-mediated rumor chain. Human Katz, J. (1972). Experimentation with human beings. New York:

Communication Research, 25, 163–179. Russell Sage.

Bragger, J. D., & Freeman, M. A. (1999). Using a cost–beneﬁt Kelman, H. C. (1968). A time to speak: On human values and

analysis to teach ethics and statistics. Teaching of Psychology, social research. San Francisco, CA: Jossey-Bass.

26, 34–36. Kimmel, A. J. (1991). Predictable biases in the ethical decision

Brownlee, K. A. (1955). Statistics of the 1954 polio vaccine trials. making of psychologists. American Psychologist, 46, 786–788.

Journal of the American Statistical Association, 50, 1005–1013. Kossinets, G., & Watts, D. J. (2006). Empirical analysis of an

Carling, C., Kristoffersen, D. T., Herrin, J., Treweek, S., Oxman, evolving social network. Science, 311, 88–90.

A. D., Schünemann, H., et al. (2008). How should the Mark, M. M., Eyssell, K. M., & Campbell, B. (1999). The

impact of different presentations of treatment effects on ethics of data collection and analysis. In J. L. Fitzpatrick &

patient choice be evaluated? PLoS One, 3(11), e3693. M. Morris (Eds.), Ethical issues in program evaluation (pp.

Cicerone, R. J. (2010). Ensuring integrity in science. Science, 47–56). San Francisco, CA: Jossey-Bass.

327, 624. Milgram, S. (1963). Behavioral study of obedience. Journal of

Clancy, K., Hamm, M., Levine, A. S., & Wilkins, J. (2009), Abnormal and Social Psychology, 67, 371–378.

Organics: Evidence of health beneﬁts lacking. Science, 325, Milgram, S. (1975). Obedience to authority: An experimental view.

676. New York: Harper Colophon Books.

Cohen, J. (1988). Statistical power analysis for the behavioral Miller, J. D. (2007). Finding clinical meaning in cancer data.

sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Journal of the National Cancer Institute, 99(24), 1832–1835.

Cohen, J. (1990). Things I have learned (so far). American Morris, J. A., & Gardner, M. J. (2000). Epidemiological studies.

Psychologist, 45, 1304–1312. In D. A. Altman, D. Machin, T. N. Bryant, & M. J. Gard-

Cohen, J. (1994). The earth is round (p < .05). American ner (Eds.). Statistics with conﬁdence (2nd. ed., pp. 57–72).

Psychologist, 49, 997–1003. London: British Medical Journal Books.

Committee on Science, Engineering, and Public Policy (2009). Mosteller, F. (1968). Association and estimation in contingency

On being a scientist: A guide to responsible conduct in research tables. Journal of the American Statistical Association, 63, 1–28.

(3rd ed.). Washington, DC: National Academies Press. National Commission for the Protection of Human Subjects of

Edwards, A. W. F. (1963). The measure of association in Biomedical and Behavioral Research. (April 18, 1979). The

a 2×2 table. Journal of the Royal Statistical Society, 126, Belmont Report: Ethical principles and guidelines for the pro-

109–114. tection of human subjects of research. Retrieved April 8, 2009,

Fairchild, A. L., & Bayer, R. (1999). Uses and abuses of Tuskegee. http://ohsr.od.nih.gov/guidelines/belmont.html.

Science, 284, 918–921. National Commission on Research. (1980). Accountability:

Francis, T., Jr., Korns, R. F., Voight, R. B., Boisen, M., Hemphill, Restoring the quality of the partnership. Science, 207,

F., Napier, J., & Tolchinsky, E. (1955). An evaluation of 1177–1182.

the 1954 poliomyelitis vaccine trials: A summary report. NICE. (2004). Depression: Management of depression in pri-

American Journal of Public Health, 45(5), 1–63. mary and secondary care (Clinical practice guideline No.

Gigerenzer, G., Gaissmaier, W., Kurz-Milcke, E., Schwartz, L. 23). London: National Institute for Health and Clinical

M., & Woloshin, S. (2008). Helping doctors and patients Excellence

make sense of health statistics. Psychological Science in the NICHSR. (2010). HTA 101: Glossary. Retrieved March 3, 2010

Public Interest, 8, 53–96. from http://www.nlm.nih.gov/nichsr/hta101/ta101014.html

Harris, G (March 10, 2009). Doctor admits pain studies were Orne, M. T. (1962). On the social psychology of the psy-

frauds, hospital says. The New York Times. Retrieved March chological experiment: With particular reference to demand

20, 2009, www.nytimes.com/2009/03/11/health/research/ characteristics and their implications. American Psychologist,

11pain.html?ref=us. 17, 776–783.

Haywood, H. C. (1976). The ethics of doing research . . . and Orne, M. T. (1969). Demand characteristics and the concept

of not doing it. American Journal of Mental Deﬁciency, 81, of quasi-control. In R. Rosenthal & R. L. Rosnow (Eds.).

311–317. Artifact in behavioral research (pp. 143–179). New York:

52 q u a n t i tat i v e m e t h o d s a n d e t h i c s

Academic Press. (Reissued in Rosenthal & Rosnow, 2009, Rosnow, R. L., & Rosenthal, R. (2008). Assessing the effect size

pp. 110–137). of outcome research. In A. M. Nezu & C. M. Nezu (Eds.),

Panter, A. T., & Sterba, S. K. (2011). Handbook of ethics in Evidence-based outcome research (pp. 379–401). New York:

quantitative methodology. Taylor & Francis. Oxford University Press.

Pattullo, E. L. (1982). Modesty is the best policy: The federal Rosnow, R. L., & Rosenthal, R. (2011). Ethical principles in

role in social research. In T. L. Beauchamp, R. R. Faden, R. J. data analysis: An overview. In A. T. Panter & S. K. Sterba

Wallace, Jr., & L. Walters (Eds.), Ethical issues in social research (Eds.), Handbook of ethics in quantitative methodology (pp.

(pp. 373–390). Baltimore, MD: Johns Hopkins University 37–58). New York: Routledge.

Press. Rosnow, R. L., Rotheram-Borus, M. J., Ceci, S. J., Blanck,

Paulos, J. A. (1990). Innumeracy: Mathematical illiteracy and its P. D., & Koocher, G. P. (1993). The institutional review

consequences. New York: Vintage Books. board as a mirror of scientiﬁc and ethical standards. American

Paulos, J. A. (2003). A mathematician plays the stock market. New Psychologist, 48, 821–826.

York: Basic Books. Rosnow, R. L., Skleder, A. A., Jaeger, M. E., & Rind, B. (1994).

PLoS Medicine Editors (September, 2009). Ghostwriting: The Intelligence and the epistemics of interpersonal acumen: Test-

dirty little secret of medical publishing that just got bigger. ing some implications of Gardner’s theory. Intelligence, 19,

PLoS Medicine, 6, Issue 9, e1000156. Accessed September 19, 93–116.

2009 from http://www.plosmedicine.org/static/ghostwriting. Sales, B. D., & Folkman, S. (Eds.). (2000). Ethics in research with

action. human participants. Washington, DC: American Psychologi-

Prasad, K., Jaeschke, R., Wyer, P., Keitz, S., & Guyatt, G. (2008). cal Association.

Tips for teachers of evidence-based medicine: Understanding Saxe, L. (1991). Lying: Thoughts of an applied social psycholo-

odds ratios and their relationship to risk ratios. Journal of gist. American Psychologist, 46, 409–415.

General Internal Medicine, 23(5), 635–640. Schwartz, L. M., & Woloshin, S. (2003). On the prevention

Reynold, P. D. (1975). Value dilemmas in the professional con- and treatment of exaggeration. Journal of General Internal

duct of social science. International Social Science Journal, 27, Medicine, 18(2), 153–154.

563–611. Scott-Jones, D., & Rosnow, R. L. (1998). Ethics and mental

Rosenthal, R. (1994). Science and ethics in conducting, ana- health research. In H. Friedman (Ed.), Encyclopedia of mental

lyzing, and reporting psychological research. Psychological health (Vol. 2, pp. 149–160). San Diego, CA: Academic

Science, 5, 127–134. Press.

Rosenthal, R., & Rosnow, R. L. (1969). The volunteer subject. Smith, M. B. (1969). Social psychology and human values.

In R. Rosenthal & R. L. Rosnow (Eds.), Artifact in behavioral Chicago, IL: Aldine.

research (pp. 59–118). New York: Academic Press. (Reissued Smith, N. C., Kimmel, A. J., & Klein, J. G. (2009). Social con-

in Rosenthal & Rosnow, 2009, pp. 48–92). tract theory and the ethics of deception in consumer research.

Rosenthal, R., & Rosnow, R. L. (1975). The volunteer subject. Journal of Consumer Psychology, 19, 486–496.

New York: Wiley-Interscience. (Reissued in Rosenthal & Spielmans, G. I., & Parry, P. I. (2010). From evidence-

Rosnow, 2009, pp. 667–862). based medicine to marketing-based medicine: Evidence from

Rosenthal, R., & Rosnow, R. L. (1984). Applying Hamlet’s ques- internal industry documents. Journal of Bioethical Inquiry:

tion to the ethical conduct of research. American Psychologist, doi:10.1007/s11673-010-9208-8.

45, 775–777. Starobin, P. (January 28, 1997). Why those hidden cameras hurt

Rosenthal, R., & Rosnow, R. L. (2008). Essentials of behav- journalism. The New York Times, p. A21.

ioral research: Methods and data analysis (3rd ed.). New York: Steering Committee of the Physicians’ Health Study Research

McGraw-Hill. Group. (1989). Final report on the aspirin component of the

Rosenthal, R., & Rosnow, R. L. (2009). Artifacts in behavioral ongoing Physicians’ Health Study. New England Journal of

research. New York: Oxford University Press. Medicine, 321, 129–135.

Rosnow, R. L. (1990). Teaching research ethics through role-play Strohmetz, D. B., & Skleder, A. A. (1992). The use of role-play

and discussion. Teaching of Psychology, 17, 179–181. in teaching research ethics: A validation study. Teaching of

Rosnow, R. L. (1997). Hedgehogs, foxes, and the evolving Psychology, 19, 106–108.

social contract in psychological science: Ethical challenges Stryker, J. (April 13, 1997). Tuskegee’s long arm still touches a

and methodological opportunities. Psychological Methods, 2, nerve. The New York Times, p. 4.

345–356. Tatsioni, A., Bonitsis, N. G., & Ioannidis, J. P. A. (2007). Per-

Rosnow, R. L., & Rosenthal, R. (1970). Volunteer effects in sistence of contradicted claims in the literature. Journal of the

behavioral research. In K. H. Craik, B. Kleinmuntz, R. L. American Medical Association, 298, 2517–2526.

Rosnow, R. Rosenthal, J. A. Cheyne, & R. H. Walters, Trials of War Criminals before the Nuernberg Military Tribunals

New directions in psychology (pp. 211–277). New York: Holt, under Control Council Law No. 10, October 1946-April

Rinehart & Winston. 1949, Vol. II. Washington, DC: U.S. Government Printing

Rosnow, R. L., & Rosenthal, R. (1995). “Some things you Ofﬁce.

learn aren’t so”: Cohen’s paradox, Asch’s paradigm, and the Tufte, E. R. (1983). The visual display of quantitative information.

interpretation of interaction. Psychological Science, 6, 3–9. Cheshire, CT: Graphics Press.

Rosnow, R. L., & Rosenthal, R. (1996). Contrasts and inter- Tufte, E. T. (1990). Envisioning information. Cheshire, CT:

actions redux: Five easy pieces. Psychological Science, 7, Graphics Press.

253–257. Tufte, E. T. (2006). Beautiful evidence. Cheshire, CT: Graphics

Rosnow, R. L., & Rosenthal, R. (1997). People studying people: Press.

Artifacts and ethics in behavioral research. New York: W. H. Tukey, J. W. (1977). Exploratory data analysis. Reading, MA:

Freeman. Addison-Wesley.

rosnow, rosenthal 53

Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Wainer, H. (2009). Picturing the uncertain world. Princeton, NJ:

Rosenthal, R. (2008). Selective publication of antidepressant Princeton University Press.

trials and its inﬂuence on apparent efﬁcacy. New England Wainer, H., & Thissen, D. (1981). Graphical data analysis.

Journal of Medicine, 358(3), 32–40. Annual Review of Psychology, 32, 191–241.

VandenBos, G. R. (Ed.). (2007). APA dictionary of psychology. Wasserman, S., & Faust, K. (1994). Social network analysis:

Washington, DC: American Psychological Association. Methods and applications. Cambridge, England: Cambridge

Vedula, S. S., Bero, L., Scherer, R. W., & Dickersin, K. University Press.

(2009). Outcome reporting in industry-sponsored trials Woloshin, S., Schwartz, L. M., Casella, S. L., Kennedy, A. T., &

of Gabapentin for off-label use. New England Journal of Larson, R. J. (2009). Press releases by academic medical cen-

Medicine, 361(20), 1963–1971. ters: Not so academic? Annals of Internal Medicine, 150(9),

Vinacke, W. E. (1954). Deceiving experimental subjects. Ameri- 613–618.

can Psychologist, 9, 155. Zuckerman, M., Hodgins, H. S., Zuckerman, A., & Rosenthal,

Wainer, H. (1984). How to display data badly. American R. (1993). Contemporary issues in the analysis of data: A

Statistician, 38, 137–147. survey of 551 psychologists. Psychological Science, 4, 49–53.

Wainer, H. (1996). Depicting error. American Statistician, 50(2),

101–111.

Wainer, H. (2000). Visual revelations: Graphical tales of fate and

deception from Napoleon Bonaparte to Ross Perot. Mahwah, NJ:

Erlbaum.

54 q u a n t i tat i v e m e t h o d s a n d e t h i c s

CHAPTER

Special Populations

4

Keith F. Widaman, Dawnté R. Early, and Rand D. Conger

Abstract

Special populations offer unique opportunities and challenges for mathematical/statistical modeling of

data. First, we discuss several ways of construing the notion of special populations, including the basis

on which we argue that the general notion of special populations is a rather recent one. Then, we

discuss four key methodological implications when considering special populations: (1) properly

deﬁning and accessing participants from the special population; (2) ensuring that the same dimensions

are present across multiple populations; (3) assessing whether empirical implications of psychological

theories hold across populations; and (4) exploiting unusual variation in special populations that may

allow tests of unique hypotheses. Next, we provide examples that illustrate how to deal with each of

the methodological issues. We close with a discussion of issues occasioned by our discussion of

special populations, issues that represent topics for future developments.

Key Words: Special populations, group differences, factorial invariance, measurement invariance,

structural invariance, individual differences

The topic of the current chapter is the place or these groups.

importance of special populations, particularly with We develop several goals for the chapter based on

regard to how quantitative methods or techniques our considerations of the application of quantitative

can be used to understand or characterize special methods to special populations. An initial goal is to

populations or inform about the nature of special identify the nature or conception of special popu-

populations. Research on special populations has lations. Here, we discuss our observations on how

burgeoned during the past quarter-century, and the special populations are identiﬁed. A second goal is

pace of development of quantitative methods has to explore how research on special populations offers

also expanded rapidly during this period. In this challenges to or ready application of methodological

chapter, we deal with the intersection of these two or quantitative approaches. To meet this goal, we dis-

streams of research—special populations and quan- cuss four major implications we draw when thinking

titative methods—to illuminate both. That is, we about conducting research with special populations,

discuss ways in which the use of state-of-the-art and we discuss techniques that would be partic-

quantitative techniques can help explain the nature ularly appropriate in pursuing these implications.

of special populations in unique and informative Our third goal is to describe a series of applications

ways. In turn, we hope that consideration of spe- of quantitative techniques in the study of special

cial populations may provide feedback that will populations to provide a substantive instantiation of

lead to interesting developments in quantitative how quantitative techniques can be used to clarify

55

the nature of special populations and the dynam- research on persons with mental retardation or intel-

ics of psychological and biological processes in these lectual disability. Clinical methods in use more than

groups. We close with conclusions and a series of 200 years ago for dealing with persons with mental

questions that represent issues for future research. illness or intellectual disability appear barbaric to the

In pursuing the nature and implications of special twenty-ﬁrst-century practicing scientist or informed

populations for quantitative methods, we performed citizen; indeed, many clinical methods in use only

several literature searches to help bound or circum- 50 or 75 years ago seem rather unfortunate and

scribe our thinking about special populations. A misguided. Thus, research on special populations

PsycINFO search in August 2010 using “special has long been pursued in psychology, allied behav-

population” and “special populations” as possible ioral and social sciences, and medical sciences. But,

title words yielded 397 citations. The oldest cita- the time is ripe for renewed exploration of behav-

tion was to a publication by Uehling in 1952, the ioral phenomena in special populations to provide a

next oldest was to a publication in 1975, and 390 fuller understanding of persons in these populations,

(or more than 98%) of the 397 citations were to which should lead to improved ways of treating and

publications in 1980 or later. Thus, the clear major- improving the lives of persons in special populations.

ity of publications listing “special population” as a

title word have occurred relatively recently, within

Conceptions of Special Populations

the last 30 years.

To consider methodological implications and

More inclusively, we next conducted a PsycINFO

applications related to special populations, one must

search using “special population” and “special popu-

consider ﬁrst what a special population is. The term

lations” as possible keywords. This search turned up

special population has no obvious and ready refer-

1546 citations, a far broader net of citations than

ent. That is, if one asked 10 different psychologists

provided by the title-word search. Here, the oldest

what was meant by “special populations,” one might

citation was to a chapter in a volume resulting from

get 10 different answers with little in the way of

a White House Conference on Child Health and

overlap aside from the indication that “special pop-

Protection, published in 1931 during the Hoover

ulation” implies deviation from general population

administration (see Folks, 1931). The second oldest

norms. To confront this deﬁnitional problem, we

citation was the Uehling (1952) paper that was the

read a random sample of publications identiﬁed in

oldest publication in the title-word search, and 1503

our literature searches to identify the ways different

(or more than 97%) of the 1546 citations were to

researchers used the term “special populations.” In

publications in 1980 or later.

the rest of this section, we discuss several ways in

Finally, a PsycINFO search using “special pop-

which investigators have used the term, admitting

ulation” or “special populations” as phrases to be

that our categorization is not exhaustive.

found anywhere in the database led to 4266 cita-

tions, our search that led to the most inclusive list of

reference citations. The two oldest citations were the Disability Groups

Folks (1931) and Uehling (1952) papers uncovered One common use of the term special popula-

in the earlier searches, and a full 4221 (or more than tion appears, implicitly, to represent a less pejorative

98.9%) of the citations were to works published in way of referring to one or another disability group.

1980 or later. Consistent with the preceding two Research on persons with a slowed rate of mental

searches, the vast bulk of citations under the explicit development was long published under the head-

title or heading of “special populations” occurred ing of research on mentally retarded persons, with a

within the past three decades. clear implication that this was a special population.

Each of these three searches outlined above sup- More recently, the term mentally retarded persons was

ports the conclusion that work on or thinking about replaced by persons with mental retardation, which

special populations under that speciﬁc rubric is a was a less pejorative term; still more recently, the

fairly recent phenomenon. However, work on spe- accepted term is now persons with intellectual disabil-

cial populations has been a hallmark of research in ity. Historical changes in these terms are reﬂected in

psychology for more than 100 years, if not more, the name of the leading professional organization

even if this research has not been published under for research and practice in this domain, which has

the heading of special populations. For example, changed its name from the American Association

research and theory on persons with mental illness on Mental Deﬁciency to the American Association

can be traced back more than 200 years, as can on Mental Retardation and then to the American

56 s p e c i a l p o p u l at i o n s

Association on Intellectual and Developmental Dis- synthesized results from prior meta-analyses of self-

abilities. Regardless of the precise term used to determination interventions for various categories

refer to this population, the key deﬁning feature is of persons with disability. In some meta-analyses,

that a person is considered to be a member of the research on students from any disability group were

special population if he or she has exhibited a signif- included, whereas other meta-analyses focused on

icantly slower rate of mental development relative more restricted groups, such as students with intel-

to persons in the general population. lectual disability or developmental disabilities or

As noted above, persons with mental illness are students with ADHD. The synthesis by Cobb et

another group that is treated as a special popula- al. ended on an unfortunate note, concluding that

tion, a special population deﬁned on the basis of interventions to enhance levels of self-determination

a perceived disability. Many additional disability appear to have relatively weak effects, perhaps

groups can be identiﬁed, including persons with because of heterogeneity of the special populations

visual deﬁcits or blindness, persons with hearing loss combined in the meta-analyses.

or deafness, persons with physical disabilities, and Echoing issues identiﬁed above, Barton (2009)

individuals with learning disabilities, to name only has decried the lack of a common deﬁnition of

a few. The one family resemblance shared by all of the basic term disability by persons with disabili-

these uses of the term “special population” is that ties or their advocates. In 1990, the U.S. Congress

the generic label is applied to a group to denote the passed the landmark Americans with Disabilities

presence of a behavioral disability that is common Act, which outlawed discrimination on the basis of

to all members of the group. disability and outlined many ways in which accom-

One problem with the use of the term “special modations for persons with disabilities must be

population” in connection with disability groups provided. Given this legislation and the relatively

arises in the context of comorbidity, or the presence large numbers of persons in the United States who

of more than one identiﬁer of disability. For exam- fall in one or another disability category, the pop-

ple, consider the case of a person with intellectual ulation of persons with disability could and should

disability who is also deaf. Is this person a member be a potent political force. But, without a common

of two different special populations—that is, is such and inclusive deﬁnition of the term disability, the

a person a member of the special population of per- political clout of this population typically is watered

sons with intellectual disability and also a member down, as advocates for particular disability groups

of the special population of persons who are deaf? (e.g., persons with intellectual disability, students

Or, is the person a member of a new and still more with deafness) follow their own special agendas. If

special population, the population of persons with persons with disability and their advocates could

dual diagnoses of intellectual disability and deafness? unite subgroups under a commonly accepted and

No resolution of this issue is proposed here. Rather, inclusive deﬁnition of disability, this would almost

we merely raise this issue to highlight a vexing issue certainly lead to a more united stance on issues, lead-

in evolving notions of disability groups. ing to greater power. Until that time, the political

Research on disability groups takes a number power of persons with disabilities is likely to remain

of forms, and we survey a few of these here. As fragmented and therefore weaker.

one example, Martz and Daniel (2010) collected

data in the United States and Kenya to determine

whether disability prototypes for four disability “Superability” Groups

groups (persons with AIDS, hearing impairment, In contrast to the notion of disability group-

mental illness, and spinal cord injury) differed across ing, “special population” can also be used to deﬁne

the four groups and whether this was moderated individuals with superior levels of ability or per-

by country of residence. As a second example, Or, formance, which can be termed superability groups

Cohen, and Tirosh (2010) used discriminant anal- to contrast with the term disability group. One

ysis to determine whether measures from a parent- of the most common “superability” groups is the

rated questionnaire could discriminate among three intellectually gifted, often deﬁned as persons with

groups of students, including those with attention- intelligence quotients (IQs) greater than or equal to

deﬁcit hyperactivity disorder (ADHD), those with 130. But, giftedness has many dimensions; some

a learning disability, and those with a combination individuals are considered gifted in general, whereas

of ADHD and learning disability. Finally, Cobb, others may be deemed gifted in narrower domains

Lehman, Newman-Gonchar, and Alwell (2009) such as the arts or speciﬁc academic domains.

One issue in the study of superability groups lifespan will be an interesting result to track in the

is the early identiﬁcation of such individuals. Two future.

notable attempts at early identiﬁcation were under- In an interesting twist, a superior level of ability

taken during the twentieth century. In the ﬁrst of or performance in one area can be exhibited in the

these, Terman established what became known as presence of rather low performance in other areas.

the Genetic Studies of Genius. Terman had recently For example, Olson, Berryhill, Drowos, Brown,

developed the Stanford Binet Scale of Intelligence, and Chatterjee (2010) reported on a patient who

which was published in 1916. Then, from 1921 to had rather severe impairments in episodic memory

1923, he asked ﬁfth grade teachers in California to that presumably arose as a result of anoxia during

nominate the three brightest children in their classes birth. The impairments in episodic memory were

and the youngest child. After testing these children quite general, resulting in rather poor performance

with the new Stanford Binet scale, children who on many memory measures. However, the patient

scored 130 or above (i.e., two or more SDs above the had extremely accurate ability to recall calendar data

population mean) were invited into the study. Ini- with regard to day, month, and year, and this unusu-

tially, more than 1450 children were enrolled in the ally high skill enables the patient to recall the precise

study, and later additions resulted in a ﬁnal sample date of many of his personal experiences. Another

size of 1528 children. example is that of hyperlexia, or precocious devel-

The second large-scale study of early identiﬁca- opment of single-word reading, which has typically

tion was the work begun in 1971 by Stanley to been identiﬁed only in persons who have a devel-

identify junior high school students with very high opmental disability (Grigorenko, Klin, & Volkmar,

levels of math skill, an undertaking subsequently 2003). As a result, the identiﬁcation of a partic-

dubbed the Study of Mathematically Precocious ular form of performance as a “superability” does

Youth (SMPY). These youth were identiﬁed at age not insure that the person displaying such an abil-

12 or 13 years and had to score within the top 3% on ity is thereby a member of a generally advantaged

a test of school achievement. Then, the young peo- group. Rather, the superability may merely represent

ple were given the SAT, a standard test for college an unusually superior level of performance in the

admission, and had to score within the top 1% on context of a generally depressed level of functioning.

the test. Once selected into a cohort, the young stu-

dents were given a variety of accelerated academic

experiences to facilitate their learning in domains Demographic Groups

of science and math. As described by Lubinski Another way of using the “special population”

and Benbow (2006), who are now the co-directors term is to refer to groups of individuals who are

of SMPY, a total of ﬁve cohorts of SMPY youth identiﬁed on the basis of demographic character-

were recruited. Recruitment for the various cohorts istics. The most common characteristics used to

took place between 1972 and 1992, and a total classify individuals in the United States are sex, age,

of more than 5300 youth have participated in the ethnic (or racial) status, and socioeconomic status

program. (SES) grouping, although other demographic vari-

A more recent example of early identiﬁcation of ables are also used. For example, Sussman (2006)

gifted individuals is a study by Kuo, Maker, Su, and has discussed the prevention of adolescent alcohol

Hu (2010), who described an early identiﬁcation problems in various special populations, citing vari-

protocol used in Taiwan to identify gifted preschool- ation across gender, ethnicity, region of the country,

ers. Children so identiﬁed were then enrolled in and SES groups.

an enrichment program to offer an optimal envi- The issue of special populations often arises when

ronment for them to increase their problem-solving investigating the differential validity of psychologi-

abilities in multiple modalities. The protocol used cal tests with persons from different ethnic groups.

various kinds of information—from interviews, Cleary (1968) offered a series of statistical tests using

checklists, portfolios, intelligence tests, and observer multiple regression analysis to determine whether

ratings—to identify giftedness in several domains. intercept bias and/or slope bias existed in the use

Although most children were deemed gifted in one of test scores when evaluating students from differ-

or another of the domains, almost 20% of children ent ethnic groups for college admission. If neither

were identiﬁed as gifted in more than one domain. intercept nor slope bias were found to occur, then a

Whether these very-early identiﬁed children will test would be deemed unbiased for use as a selection

remain characterized as gifted at later points in their device. If fewer applicants from a special population

58 s p e c i a l p o p u l at i o n s

(e.g., African-American) were selected using such updated by Henry, 2008; Sears, 2008) decried the

a test, then one could justify the result by claim- use of college sophomores as typical research partici-

ing that the lower level of selection resulted from pants, arguing that college sophomores may provide

lower levels of scores obtained by members of the systematically different responses than members

special population on unbiased tests. Although so- of the general population on many, if not most,

called “Cleary tests” have been used routinely for experimental questionnaires and paradigms. Col-

more than four decades in college admissions and in lege sophomores are in a stage of life when they are

personnel selection, concerns about the utility of the attempting to “ﬁnd” or deﬁne their identities and

approach remain (e.g., Meade & Tonidandel, 2010; thus may be much more susceptible to various inﬂu-

Colarelli, Han, & Yang, 2010). ences, such as experimental inductions, than would

Retrieving literature that references “special pop- others. Despite the potentially limited utility of col-

ulations,” we were struck by the frequency with lege sophomores as research participants, given their

which authors who referred to special populations unrepresentativeness relative to the population, con-

cited groups such as women, children, the elderly, venience in obtaining their participation in research

and minority ethnic groups. This designation of is a leading cause of their continued predominance

special populations implies that non-women, non- in studies in social and personality psychology.

children, non-elderly, non-minority individuals— Political pressure often appears to be another fac-

that is, adult, White males and, perhaps, college tor related to the choice of research participants,

sophomores—are “the” populations that serve as either restricting or promoting research on a given

common reference groups, and anyone who devi- topic. If a researcher chooses to study a topic (e.g.,

ates from these norms is a member of a special romantic love) or a group (e.g., homosexual males)

population. that a member of Congress ﬁnds objectionable, the

Why White adult males and college sophomores research project might be highlighted as a waste of

became standard reference groups is anyone’s guess. taxpayer money. Senator William Proxmire (D—

Most likely, the emergence of these reference groups Wisconsin) made headlines in the 1970s and 1980s

resulted from not just one but a combination of when announcing his Golden Fleece Awards, which

factors. Since the founding of the United States of derided programs of research he personally opposed.

America, with few notable exceptions, adult White On the other hand, the decades-long support for

males have tended to occupy the highest positions research on persons with intellectual disability by the

of power in business, academia, and government. members of the Kennedy family in Congress led to

Regardless of the optimality of this distribution far more research funding in this area than otherwise

of power, the mere presence of this demographic would have occurred. Indeed, continued research

in positions of power may have led, explicitly or support for any special population usually requires

implicitly, to acceptance of adult White males as the the presence of a special champion in Congress for

reference group against which outcomes for other that population, given conﬂicting funding priorities

groups would be compared. Additionally, the use of at the national level.

a single demographic, such as adult White males, Yet another reason why certain groups are not

might have been considered a way to reduce hetero- often represented as research participants is the

geneity that might otherwise cloud research results. sheer difﬁculty in ﬁnding and recruiting partici-

Assuming that results for different demographic pants in these groups. Masten and colleagues (e.g.,

groups should not differ dramatically, reduction of Obradoviæ, Long, Cutuli, Chan, Hinz, Heistad,

heterogeneity might allow trends in data to be seen & Masten, 2009) have been studying homeless and

more clearly. highly mobile children, and identifying members

Other reasons are, almost surely, responsible for of this group and then tracking them longitudi-

the use of college sophomores as a standard or nally has been difﬁcult. Other challenges often face

reference group. The top reason for selecting col- researchers who study court-related samples, such

lege sophomores as “typical” research participants as victims of physical or sexual abuse or children or

must be convenience. When taking introductory adolescents in foster care, where concurrence of legal

psychology courses in college, students often must entities are yet another impediment to research.

participate in experiments so that they learn about Regardless of the basis for the designation of

how studies are conducted, and college sophomores reference populations, signs of the breakup of the

are a common demographic in introductory psy- hegemony of White adult males and college sopho-

chology classes. A quarter-century ago, Sears (1986, mores as reference groups are clearly in evidence.

For example, when applying for grants through the highlighting characteristics such as the health sta-

National Institutes of Health, researchers must care- tus, ethnicity, language needs, and gender-related

fully describe the projected sample with regard to issues of clients. A key issue Polinsky et al. dis-

sex, age, and ethnicity, and any exclusion of persons cussed was the application of types of treatments

from a particular demographic category must be to types of clients, under the assumption that client

justiﬁed in convincing fashion. Further, concerted characteristics may alter the effectiveness of partic-

efforts have been made to correct the demographic ular treatments. Taking the issue further, Polinsky

imbalance in prior research on health. The Women’s et al. proposed that treatments might be spe-

Health Initiative (http://www.nhlbi.nih.gov/whi/) cially adapted to the characteristics of the client to

was a major program of research funded by the obtain maximal success. More recently, Diaz, Hor-

government to investigate various outcomes in ton, McIlveen, Weiner, and Nelson (2009), using

postmenopausal women, focusing on cardiovas- data from substance abuse programs, found that

cular disease, cancer, and osteoporosis as com- almost half (48%) of the clients in their sample

mon causes of disability, morbidity, and death had dysthymia. Building on this ﬁnding, Diaz et

of women. Furthermore, the Ofﬁce of Minor- al. recommended that treatment programs consider

ity Health of the U.S. Department of Health and whether clients have psychological disorders such as

Human Services has pursued Minority Health Ini- dysthymia when treatments are formulated, because

tiatives (http://minorityhealth.hhs.gov/) to investi- treatment success may depend on the presence of

gate causes of disease, morbidity, and death in signiﬁcant comorbid characteristics.

minority populations that deviate from patterns Expanding the reach of functional groups beyond

common in the majority (i.e., White) population. personal traits or characteristics, aspects of personal

We look forward to the day when women, children, life situations may also be used to deﬁne functional

the elderly, and ethnic minorities are not consid- groups. Individuals are born into families, these

ered special populations but are considered major families live in communities, and communities are

portions of the general population that deserve just nested within larger geographical entities. In many

as much attention as research subjects as any other publications, Bronfenbrenner (e.g., 1977, 1986a,

demographic group. 1986b, 1999) laid out the ever-expanding circles of

embedded environments from micro- to macrosys-

tems. A recent book edited by Little, Bovaird,

Functional Groups and Card (2007) was dedicated to presenting sta-

The term special population may also be applied tistical and other methodological solutions to the

to identify persons who, in our terminology, are modeling of contextual effects, such as embedded

members of identiﬁable functional groups. By func- social systems, that are rife in studies conducted

tional groups, we refer to individuals who have in representative, everyday contexts. In one con-

particular combinations of behavioral proﬁles or tribution to the Little et al. volume, Widaman

life situations that may have unique importance for (2007) discussed the integration of embedded ver-

understanding their behavior. That is, individuals sions of both the social environment and the physical

can be characterized by the full repertoire of pos- environment as these combine with personal char-

itive and negative behaviors they exhibit and the acteristics to inﬂuence behavior. Although any of

life situations they have selected or that have been these embedded levels may play a role in moderat-

imposed on them. Any of these factors, particularly ing behavioral change, variables associated with the

in combination, may deﬁne groups that function more proximal environments of family and com-

in unique fashion to determine their behavior and munity probably play a larger role than do more

their susceptibility to environmental presses, such as distal variables. In a recent study, Chassin, Knight,

treatments. Vargas-Chanes, Losoya, and Naranjo (2009) found

In our literature searches on special populations, that treatments to reduce certain forms of negative

we found a great many of the articles retrieved con- behavior were effective only if families were involved

cerned treatment outcomes in special populations, in the treatment. The upshot of this ﬁnding is that

where these special populations met our description personal characteristics and aspects of both the social

of functional groups. Over a decade ago, Polinsky, and physical life situations within which an individ-

Hser, and Grella (1998) described the extremely ual functions should be considered when attempting

varied types of clients who received services from to understand the behavior and adaptability of the

drug treatment programs in Los Angeles County, individual.

60 s p e c i a l p o p u l at i o n s

Biological or Genetic Markers of Group varied levels at which this research is undertaken—

Membership from the level of synaptic processes to molar patterns

A ﬁfth and ﬁnal type of special population is in behavior—is remarkable, yet the patterns uncov-

perhaps the most current and state-of-the-art way ered all point to the biological nature of the processes

of deﬁning special populations—by the presence of involved.

speciﬁc biological or genetic markers that have ties to Two additional ways of understanding biologi-

behavior. Since its rebirth in the early 1970s, behav- cal markers and their effects deserve mention here.

ior genetic research has accumulated at a seemingly First, the search for a solitary SNP or a small number

ever-increasing rate. The general thrust of ﬁndings in of SNPs responsible for a particular psychological

this ﬁeld is that most, if not all, behavioral traits have or behavioral trait is almost certainly a rather unre-

some heritability, and certain important traits (e.g., alistic goal. A more likely outcome is reﬂected in

intelligence) have high levels of heritability. Consis- research on phenylketonuria (PKU). As reported on

tent with this focus, research continues to focus on the Phenylalanine Hydroxylase Locus Knowledge-

biological markers related to behavior, where biolog- base website (http://www.pahdb.mcgill.ca/), more

ical markers are broadly deﬁned, from the molecular than 500 mutations on the phenylalanine hydroxy-

level of particular genes or single-nucleotide poly- lase (PAH) gene have been identiﬁed, and all of these

morphisms (SNPs) to the molar level of performance lead to reduced metabolism of phenylalanine into

on experimental tasks. tyrosine, the underlying problem in PKU. Although

Searching for the molecular genetic bases of a small number of mutations (e.g., 5 or 6) may

behavioral traits is a major focus at the present time account for the majority of mutations in particular

(Petrill, 2010). The successful identiﬁcation of par- populations (e.g., European), any of the mutations

ticular genes related to a certain condition is often can cause PKU. Further, many of the mutations

touted in the press. For example, Chakrabarti et al. have been categorized with regard to the severity

(2010) recently reported that certain genes associ- of the mutation, indexed by the degree of disrup-

ated with sex steroids and neural growth are related tion of phenylalanine metabolism. If a large number

to autistic traits and Asperger syndrome. But, the of mutations are found to underlie a single, rather

mere presence of certain SNPs may not be the key. restricted phenotype such as PKU, we should expect

Rather, gene expression at the molecular levels may that very large numbers of mutations or SNPs are

be the important feature, as Simunovic et al. (2009) related to broader phenotypes such as intelligence,

reported in their study of the pathology underlying intellectual disability, or personality disorders.

Parkinson’s disease. Although successes in ﬁnding Second, we think that researchers must pay at

certain SNPs or particular forms of gene expression least as much attention to the environment as to gene

related to a disease or behavior have been reported, SNPs when searching for genes that affect behav-

most of these involve very small portions of vari- ior, an approach that has been characterized as the

ance explained and suffer from lack of replication in search for Gene X Environment, or G X E, interac-

follow-up investigations. tions. Speciﬁc genes or SNPs are important in the

Perhaps more promising is research at a more G X E approach, but the SNPs alone do not directly

molar level. For example, Pennington et al. (2008) herald the emergence of a behavioral trait. Rather,

reported synaptic and metabolic abnormalities in behavioral differences among groups identiﬁed with

the prefrontal cortex in persons with schizophre- different genetic alleles may arise only in particu-

nia or bipolar disorder. Focusing on the anterior lar environmental circumstances or may be clearly

cingulate, Eastwood and Harrison (2010) found exacerbated in such environments, so main effects

increased synaptic transmission and plasticity in of genes or environments supply less-than-complete

persons with bipolar disorder. At a still more information. Rather than main effects, the G X E

molar level, Koychev, El-Deredy, Haenschel, and interaction indicates that the effects of genetic alle-

Deakin (2010) reported visual information process- les is moderated by environmental circumstances,

ing deﬁcits in persons with schizotypy, which is so environments modulate how genetic factors are

a marker of vulnerability to schizophrenia. And, expressed in behavior. Two early papers on the G

Reichenberg, Caspi, Harrington, Houts, Keefe, X E approach, by Caspi et al. (2002) and by Caspi

Murray, Poulton, and Mofﬁtt (2010) argued that et al. (2003), were both published in Science, the

they had identiﬁed patterns in performance on leading general journal in all of science. In the sec-

standard psychological tests that reﬂected cogni- ond of these papers, Caspi et al. (2003) investigated

tive deﬁcits related to childhood schizophrenia. The the 5-HTTLPR region, which is associated with

serotonin function. The 5-HTTLPR region is char- most susceptible to environmental inﬂuence, with

acterized by either short (s) or long (l) alleles; because their behavioral outcomes tracking the negativity or

individuals obtain one copy from mother and one positivity of the environments within which they

from father, individuals can be characterized as s/s, have developed. Although ﬁrm conclusions about

s/l, or l/l based on whether they have (1) two short, the presence of differential susceptibility and the

(2) one short and one long, or (3) two long alleles, resulting cross-over G X E interactions has not yet

respectively. Caspi et al. presented data suggesting been provided, many experimental results published

that persons with the l/l allele were relatively imper- in the last few years seem to support this idea. If G X

vious to stressful environments and therefore tended E interactions—particularly cross-over G X E inter-

to have lower levels of negative outcomes (e.g., actions consistent with the differential susceptibility

depressive symptoms, suicide ideation/attempts) in notion—are present in many behavioral domains, a

the most stressful environments. In contrast, indi- more nuanced picture must be drawn, with genes

viduals with the s/s allele tended to have the most and environments having co-equal status as the basis

negative outcomes in the most stressful environ- for behavioral phenotypes. Groups may still be iden-

ments, and persons with the s/l allele had outcomes tiﬁed by their genetic alleles, but the implications of

that fell midway between the l/l and s/s groups. their genes for these special populations can only

However, these differences did not hold in all envi- be understood by considering the environments in

ronmental situations. Indeed, in the least stressful which the persons have developed.

environments, essentially no differences across allele

groups were found. Thus, in low-stress environ-

Summary

ments or in environments with no maltreatment,

As a reading of this section demonstrates, the

the allele groups did not differ in depressive out-

notion of special populations is often invoked but

comes, but differences across groups appeared only

can refer to very different kinds of groupings of

as the stressfulness of the environment increased.

individuals. We have identiﬁed ﬁve ways of char-

More recently, researchers have been investigat-

acterizing special populations, based on disability

ing the notion of the genetic basis for differential

status, superability status, demographic characteris-

susceptibility to the environment. The studies by

tics, functional characteristics, and genetic markers.

Caspi et al. (2002, 2003) support the contention

Others might be able to identify additional classes of

of G X E interactions, but the allele group that

variables that might be used to designate special pop-

fared worst in the most stressful environments rarely

ulations. Although the distinctions among different

exhibited any beneﬁt versus the other groups in

types of special populations underscore the hetero-

the least stressful environments. However, as Bel-

geneous nature of the alternate bases for groupings of

sky, Bakermans-Kranenburg, and van IJzendoorn

persons, all of the distinctions among groups have an

(2007) argued, the Caspi et al. studies may not

important familial resemblance: Researchers must

have investigated the widest possible range of envi-

investigate whether patterns of empirical results vary

ronmental circumstances, instead generally looking

in important ways across special populations or

only at stressful versus average environments. How-

when moving from the general population to a spe-

ever, if one studied the entire range of environmen-

cial population. If results vary importantly across

tal circumstances—from worst through average to

groups, then special population status is a modera-

superior environments—then a true cross-over G X

tor of results, and conclusions about “the ways that

E interaction may be found. That is, persons with

things work” do not generalize across groups. Thus,

certain genetic alleles (e.g., the l/l allele from the

special populations constitute a crucible for research

5-HTTPLR) may do relatively well in very poor

in the social sciences, and we must guard against

environments but also may not do much better in

unwarranted generalization of ﬁndings across

superior environments, representing a group of per-

groups unless research supports such conclusions.

sons who are relatively impervious to environmental

circumstances and therefore have low susceptibility

to effects of the environment. In contrast, per- Methodological Implications of Special

sons with other alleles (e.g., the s/s allele from the Populations

5-HTTPLR) may indeed perform rather poorly in Having established some guidelines for distin-

the worst environments but might show the best guishing among special populations, we turn next

outcomes of all groups in superior environments. If to the methodological or quantitative implications

this were to occur, these individuals would be the that arise when considering special populations. In

62 s p e c i a l p o p u l at i o n s

most graduate education in psychology, quantitative or implicit exclusion criteria should be used or one

experts teach classes in which students are prepared might inadvertently draw an unrepresentative sam-

to obtain data from a sample (typically described in ple. Of course, a more restricted population could be

nebulous terms); take standard, off-the-shelf statis- the topic of a study, such as persons from minority

tical methods; estimate population parameters in groups who seek mental health services with regard

a particular analytic model, and use their results to alcohol problems but without attendant drug use

to make conjectures about the population, as if problems. We hope the reader is clear on the issue

the population were a single, monolithic entity. of identiﬁcation of a special population: Both inclu-

But, once one acknowledges the presence of spe- sion and exclusion criteria for participants must be

cial populations (which represent specially identiﬁed clearly elucidated and justiﬁed by the nature of the

subsets of the larger population) our thinking about study, and the research team should remain vigilant

methodological and statistical procedures must be to ensure that no unexpected factors are biasing the

amended. Rather than estimating “the” population nature of the population deﬁned.

parameter in monolithic population, we should A second issue is the development of a plan

begin trying to understand the nature of special to access participants from the special population.

populations, how parameter estimates might vary If one is interested in studying clinical popula-

across populations, when it is possible to compare tions, access to potential participants will often be

parameter estimates across groups, and similar dif- sought through professional agencies, such as men-

ﬁcult questions. In this section, we slice up this tal health centers, regional centers that offer services

task under four headings—identifying and accessing for members of particular populations (e.g., intel-

special populations, measuring the same constructs lectual disability), state agencies, or the like. If the

across groups, exploring the bounds of psychological goal is to study students in elementary school, devel-

theories, and exploiting unusual variation in spe- oping ties to schools is the most obvious approach.

cial populations—and discuss the methodological But, members of some populations are much harder

implications of the substantive issues that arise. to access. For example, researchers who study child

physical or sexual abuse or neglect must often access

their participants through arrangements with court

Identifying and Accessing Participants in systems, child welfare organizations, and so forth.

Special Populations In these research endeavors, investigators frequently

The ﬁrst task of any investigation into partici- ﬁnd that a great deal of time must be spent and

pants from special populations is gaining access to a large amount of red tape must be surmounted to

the participants. But, this characterization of “gain- gain access to research subjects. Still, one should not

ing access to the participants” masks several neces- let the difﬁculty of the access alter the research goals

sary steps in the design and conduct of an empirical of a program of research. To provide the most valid

investigation. To provide some cognitive structur- data from the most representative settings, proper

ing of this ﬁrst concern, we have broken down the and optimal access must be developed.

research process associated with drawing a sample A third step, once the special population has

from a special population into four steps or issues. been identiﬁed and modes of accessing participants

The ﬁrst task is the clear identiﬁcation of the spe- have been developed, is to develop a plan for sam-

cial population to be studied. As earlier sections pling participants from the population. In some

of this chapter have shown, a single, overarching situations, a sampling plan can be relatively straight-

notion of what constitutes a special population is forward. For example, a researcher might select each

not present in the research literature. Rather, vari- birth (or every second birth) from a set of hospi-

ous ways of deﬁning special populations can be used, tals during a given time period as the sample to

and some of these will provide partially overlapping be drawn, and this will often lead to the draw-

subsets of possible participants. Because of the seem- ing of a representative sample from the population

ingly murky nature of some special populations, we that is served by the hospitals in question. Study-

offer only the most general recommendations of how ing birth outcomes in low SES families might be

to deal with the problem of identiﬁcation of the pop- approached in this fashion, if the hospitals selected

ulation. Researchers should be careful to develop a tended to serve the low SES families in a community.

clear statement of the population to be studied. If But, if a researcher intends to investigate differ-

the target population is all persons receiving mental ences across certain strata (e.g., African-American,

health services in a catchment area, then no explicit Hispanic [or Latino]) and participants from these

strata are unequally represented in the population, the meaning and the metrics of dimensions of indi-

then oversampling the low-frequency participants vidual differences within and among populations

and then re-weighting in the analysis stage can or subpopulations. Researchers typically assess indi-

ensure statistical analyses that have greater external vidual differences on key dimensions in a domain

validity or generalizability. Full discussion of sam- in their research projects. For example, a study

pling plans is beyond the scope of this chapter; here, on mental abilities may include measures of ﬂuid

we merely emphasize the need to consider the sam- and crystallized intelligence, and a study of per-

pling plan with as much attention and concern as sonality is likely to assess the Big Five dimensions

other aspects of study design. of Extraversion, Agreeableness, Conscientiousness,

The ﬁnal step in an empirical study is the recruit- Neuroticism, and Openness. Further, the listing

ment of participants from the special population. of certain dimensions of intelligence and person-

This step involves the initial contact with potential ality represents only a meager sampling of the many

participants and their recruitment into the study. different types of characteristics we assess in our

The recruitment rate into a study is a crucial statis- work as researchers in psychology. However, using

tic and should be reported in every published report. a standard measure developed and normed on the

Recruitment rates vary across types of studies, so a general population to assess individuals from a spe-

low recruitment rate for a given study may not be cial population is fraught with problems, and a

a fatal ﬂaw if this is representative of the studies in researcher cannot assume that the individual differ-

the domain. However, researchers should collect as ences reﬂected in scores on the measure are directly

much basic data as possible on potential participants comparable across populations.

so that trends in participation versus nonparticipa- Consider the use of a widely employed measur-

tion might be discerned. If nonparticipants differ ing device in a special population. For example, a

systematically from participants on any variables, researcher might want to assess dimensions of per-

this may limit the generalizations to be drawn from sonality in a large sample of persons with intellectual

the study. Researchers should report the recruitment disability. The researcher may intend to determine

rate in any published paper, so readers will have a whether persons with intellectual disability have dif-

basis for placing the research results in context. ferent mean levels on the personality dimensions

(e.g., higher levels of Agreeableness) and whether

Implication 1: Informed identiﬁcation of special individual differences on the personality traits relate

populations and members of such groups is often a in meaningful ways to success in community place-

difﬁcult task. But, the failure to identify, sample, and ments. To do so, the research might select the Big

recruit members of special populations in appropriate Five Inventory (BFI) (John, Donahue, & Kentle,

ways may render any empirical results of dubious 1991), a 44-item measure of the major dimensions

value. of personality that contains 8 to 10 items for each of

The history of research in psychology tends to the ﬁve dimensions. On each item, the respondent

reﬂect the conduct of studies on samples of con- is asked to indicate, using a 1-to-5 rating scale, his

venience. Researchers often are careful to describe or her degree of agreement with certain adjectives

the basic demographic characteristics of their partic- describing personal behaviors or descriptions.

ipants but infrequently discuss whether the sample In a situation like this, a researcher may blithely

is representative of any larger population. As psy- assume that responses by persons with intellectual

chology matures as a science, greater attention will disability can easily be compared at the scale level

be paid to the issues of drawing appropriate sam- with responses by persons who do not have intel-

ples from well-deﬁned populations. Our hope is lectual disability. Thus, one might sum up the

that as researchers pay greater attention to this set item scores on the Extroversion scale and com-

of issues, research results will begin demonstrating pare mean and variance differences on scale scores

greater replication across studies than has often been between relevant samples of persons with and with-

the case to date. out intellectual disability. However, psychometric

investigations over the past 50 years and more have

demonstrated that incorrect conclusions may be

Measuring the Same Constructs Across drawn in such situations unless one is conﬁdent

Groups that one is assessing “the same constructs” across

Perhaps the initial analytic task to undertake the different populations. Methods of verifying that

when studying special populations is to determine one is assessing “the same constructs” across groups

64 s p e c i a l p o p u l at i o n s

have been published under several rubrics, including r latent variables, η is an (r × 1) vector of scores

for

methods to assess measurement invariance, factorial person i on the r latent variables, and ε is a p × 1

invariance, or lack of differential item function- vector of scores of person i on the p unique factors.

ing (DIF). The upshot of this concern leads to One can use the model in Equation 2 to develop

an important implication when researching special moment expectations for the manifest variables,

populations: moment expectations that yield expressions for the

covariance structure and the mean structure of the

Implication 2: Establishing the measurement

manifest variables. In a single-group case, these

invariance of instruments across groups is a key result

expectations are:

in the investigation of any special population.

Measurement invariance must hold for meaningful

comparisons to be made across samples from Σ = ΛΨ + Θ (3a)

different populations. μ = τ + Λα, (3b)

Measurement invariance is a broad topic, one

that subsumes research and statistical models that where Σ is the p × p matrix of population covari-

cross many boundaries. For example, some work ances among manifest variables, Ψ is the (r × r)

on measurement invariance has looked at predic- matrixof covariances

among the latent variables, Θ

tion models, which may be used to predict some is the p × p matrix (usually diagonal)

of covari-

behavioral outcome (e.g., college grade point aver- ances among unique factors, μ is a p × 1 vector

age) from relevant predictors (e.g., high school grade of population means on manifest variables, α is

point average, admission test scores). Multiple mod- an (r × 1) vector of means on the latent variables,

els can be employed in such research, and regression and other symbols were deﬁned above. Equation 3a

analysis is often used. In regression analysis, ques- is the population covariance structure model, and

tions arise regarding any potential intercept bias Equation 3b is the population mean structure

or slope bias across groups when predicting the model.

outcome. In any sample from a population, we observe

When measurement invariance is considered sample covariances among manifest variables, which

within the context of factor analysis models, the we signify as S, and sample means on the manifest

term applied is usually factorial invariance, which is variables, which we can signify as Ȳ. Given these

a restricted form of measurement invariance (Mered- sample estimators of population values, we can write

ith, 1993). In factor models, we typically begin with the sample covariance and mean models as:

a data model of the form:

Yji = τj + λj1 η1i + · · · + λjr ηri + εji , (1) S∼

= Λ̂Ψ̂Λ̂ + Θ̂=Σ̂ (4a)

= μ̂ = τ̂ + Λ̂α̂, (4b)

manifest variable j(j = 1, . . . , p), τj is the intercept

for manifest variable j, λjk is the factor loading (or where carets (ˆ) are added to matrices to indicate that

regression weight) for predicting manifest variable j sample estimates of population parameters are con-

from latent variable k(k = 1, . . . , r), ηki is the factor tained in the matrices, and all symbols are deﬁned

score for person i on latent variable k, and εji is the above.

score of person i on the unique factor for manifest The covariance structure model in Equation

variable j. The model in Equation 1 is termed the 4a signiﬁes that the matrix of sample covariances

linear factor analysis data model to signify the fact among manifest variables S is approximated by the

that the linear model was developed as a model for covariance structure model, Λ̂Ψ̂Λ̂ +Θ̂; with esti-

understanding persons’ scores on manifest variables mates in the three parameter matrices Λ̂, Ψ̂, andΘ̂,

as linear functions of their scores on latent variables. the matrix expression yields an estimate of the pop-

Writing Equation 1 in matrix notation results in: ulation covariances among manifest variables, Σ̂,

under the assumption that the model is correct

Y = τ + Λη + ε, (2)

in the population. The mean structure model in

where Y is a p × 1 vector of scores for person i Equation 4b shows that the sample means are esti-

on the p manifest variables, τ is a p × 1 vector of mators of population means, and these are approx-

intercepts for the p manifest variables, Λ is a p × r imated as a function of intercepts, factor loadings,

matrix of loadings of the p manifest variables on the and means of the latent variables.

If we generalize the model in Equations 4a and In our presentation here, we will follow the sum-

4b to the multiple-group context, we arrive at: mary of approaches to factorial invariance research

∼ Λ̂g Ψ̂g Λ̂ g + Θ̂g = Σ̂g provided by Widaman and Reise (1997), who syn-

Sg = (5a)

thesized prior work by Jöreskog (1971), Horn,

Ȳg ∼

= μ̂g =τ̂g + Λ̂g α̂g , (5b) McArdle, and Mason (1983), and Meredith (1993)

where the subscript g (g = 1, . . . , G) has been to arrive at four levels of factorial invariance.

added to each matrix or vector to indicate that the These four levels constitute levels of increasing

elements of equations are derived from group g , and restriction on parameters of the factor analysis

all other symbols are deﬁned above. model.

Levels of factorial invariance. Given the multiple- Horn et al. (1983) termed the ﬁrst level of invari-

group model shown in Equations 5a and 5b, con- ance conﬁgural invariance. By conﬁgural invariance,

sideration of factorial invariance can commence. A we mean that the same pattern of ﬁxed and free

simple rendition of factorial invariance is this: The loadings in Λ is observed in each group. This form

same factors should be present in multiple groups. of invariance states merely that within each group,

But, this simple statement masks key issues. How each manifest variable is predicted by the same latent

can we tell if the same factors are present in differ- variable(s) as occurs in other groups. Under conﬁg-

ent groups? What empirical results would give us ural invariance, the g subscript is still employed for

conﬁdence that we have identiﬁed the same factors each Λ matrix, because different estimates can be

in different groups? found for particular factor loadings across groups

One way to get a bit more deﬁnite about how and the stipulation of conﬁgural invariance will still

to verify that invariant factors have been identiﬁed be satisﬁed.

in multiple groups is to consider a mathematical The second level of factorial invariance was

statement regarding the expectations of the termed weak factorial invariance by Widaman and

mani- Reise (1997) and refers to a model in which factor

fest variables, which can be written as E Y|η, g .

This equation states that the expected values of the loadings are constrained to be invariant, or iden-

manifest variables in Y are a function of the com- tical, on a one-by-one basis across groups. Thus,

mon latent variables in η and not only must the factor loadings display the same

the group

g to which

pattern of ﬁxed and free loadings across groups, but

a person belongs. Now, if E Y|η, g = E (Y|η), or

if the expectations of the manifest variables given η the free loadings across groups are constrained to

and g equal the expectations of the manifest variables invariance. If the invariance constraint on factor

given just η, then the expectations are not depen- loadings is supported, then regression weights for

dent on the group of which a person is a member. predicting manifest variables from latent variables

If this identity holds, then the same latent variables are invariant across groups. This equality is one key

are present in the different groups. element in showing that latent variables are trans-

To translate the expectation equations into impli- lated into manifest variables in the same fashion.

cations regarding data, consider the factor model If this holds and the factor loading matrices can

shown in Equations be constrained to invariance across groups, the g

5aand 5b above. The expecta- subscript can be deleted from the loading matrices,

tion equality E Y|η, g = E (Y|η) will hold only

if the common factors are “translated” into manifest leading to:

variable scores in the same fashion in each group.

Sg ∼

= Λ̂Ψ̂g Λ̂ + Θ̂g = Σ̂g (6a)

This requirement has led researchers to discuss sev-

eral levels of factorial invariance. Many researchers Ȳg ∼

= μ̂g = τ̂g + Λ̂α̂g , (6b)

(e.g., Byrne, Shavelson, & Muthén, 1989; Chen,

Sousa, & West, 2005; Cheung & Rensvold, 1999; where all symbols in Equations 6a and 6b are

Ferrer, Balluerka, & Widaman, 2008; Hancock, deﬁned above.

Kuo, & Lawrence, 2001; Little, 1997; McArdle, The third level of invariance, strong factorial

1988; Meredith & Horn, 2001; Millsap & Mered- invariance, adds invariance across groups of the

ith, 2007; Nesselroade, 1983; Rensvold & Cheung, measurement intercepts to the weak factorial invari-

1998) have written on the topic of factorial invari- ance model. Thus, under strong factorial invariance,

ance, and the preceding listing of contributions the regression equation for predicting each manifest

merely scratches the surface of work in this area and variable from the set of latent variables is invariant

therefore necessarily misses many contributions that across groups, satisfying a key criterion for measure-

should also be cited. ment invariance—that the latent variables related to

66 s p e c i a l p o p u l at i o n s

manifest variables in the same fashion across groups. particular matrices in Equations 8a and 8b. Specif-

The resulting equations are: ically, group differences in mean levels are obtained

from the α̂g matrices in Equation 8b. Models are

Sg ∼

= Λ̂Ψ̂g Λ̂ + Θ̂g = Σ̂g (7a)

often identiﬁed by ﬁxing factor means to zero in one

Ȳg ∼

= μ̂g =τ̂ + Λ̂α̂g , (7b) group (e.g., group 1, which serves as the reference

group), so mean values for the other groups (e.g.,

where all symbols are deﬁned above. Note that both groups 2, . . . , G) are estimated as mean differences

the factor loading matrix and the vector of intercepts from the reference group.

have the g subscript deleted, because these matrices Group differences in variance on the latent vari-

have invariant estimates across groups. Two key out- ables or covariances among latent variables are

comes accompany strong factorial invariance: (1) all

obtained from the Ψ̂g matrices in Equation 8a. If

differences across groups in the means on manifest vari-

latent variables are identiﬁed by ﬁxing latent vari-

ables are due to mean differences on the latent variables;

able variances to unity in the reference group (e.g.,

and (2) group differences in means and variances on

group 1), then variances on the latent variables are

the latent variables are identiﬁed in a comparable met-

estimated relative to this reference group.

ric across groups, enabling comparisons at the latent

In summary, if strong factorial invariance holds

variable level across groups. Thus, the latent variable

for a set of data, latent variables are identiﬁed

model in Equations 7a and 7b identiﬁes the latent

in an invariant fashion across groups. Given this,

variables as the sources of differences across groups in

between-group comparisons on mean and/or vari-

mean levels on the manifest variables, and variances

ance on the latent variables are justiﬁed, and these

on latent variables can also be compared.

comparisons can be made at the error-free, latent

The fourth and most restricted level of invari-

variable level.

ance is strict factorial invariance. Under this level

of invariance, the unique factor variances are addi-

tionally constrained to invariance across groups. The Exploring the Bounds of Psychological

resulting equations for the covariance structure and Theories

mean structure, respectively, are: As discussed above, obtaining strong factorial

invariance allows the researcher to assume that

Sg ∼

= Λ̂Ψ̂g Λ̂ + Θ̂ = Σ̂g (8a)

mean and/or variance differences on dimensions of

Ȳg ∼

= μ̂g =τ̂ + Λ̂α̂g , (8b) individual difference are interpretable across popu-

lations. However, psychological theories frequently

where all symbols are deﬁned above. As shown in

lead to predicted relations among constructs. For

Equation 8a, the g subscript is deleted from the

example, many researchers have sought to outline

unique factor covariance matrix Θ̂ because esti- dimensions of parenting styles, and the isolation and

mates in this matrix are invariant across groups. replication of such dimensions—particularly across

Under strict factorial invariance, all group differ- groups (e.g., across ethnic groups)—is an impor-

ences in mean levels and in variances on the manifest tant matter. But, once these initial steps have been

variables are due to mean and variance differences, taken, researchers are typically interested in whether

respectively, in the latent variables. Thus, under strict parenting styles have impacts on, or at least con-

factorial invariance, we have a concise representa- sistent patterns of asymmetric relations with, other

tion of all between-group differences on manifest variables, such as child behavior. To pursue such

variables. Although strict factorial invariance is the research, we must focus on the relations among

most concise of the levels of factorial invariance, latent variables, determining whether the patterns

researchers often ﬁnd that equality constraints on and strength of relations among latent variables is

unique factor variances across groups are too restric- similar or different across groups.

tive. This is not a problem, because comparisons

across groups on the latent variables are justiﬁed if Implication 3: Results of studies of the general

strong factorial invariance holds. population should not be generalized to special

Representing mean and variance/covariance dif- populations without research that supports such

ferences across groups. If at least strong factorial generalization. Investigations of the structural

invariance is satisﬁed for a set of data, then latent relations among latent variables across populations

variables are identiﬁed in a form that allows inves- hold the key to determining whether conclusions

tigation of group differences on the latent variables. regarding relations among variables generalize to

These differences across groups are contained in special populations.

In the structural modeling literature, researchers Equation 9 for η leads to:

often distinguish between the measurement model

η = (I − B)−1 ζ, (10)

and the structural model. The measurement model

consists of the relations of the latent variables to where I is an (r × r) identify matrix, the superscript

the manifest variables, and the structural model −1 indicates the inverse of the associated matrix,

contains relations among the latent variables. In and all other symbols were deﬁned above.

the foregoing section, we described several levels of The covariance expectations for Equation 10 are:

factorial invariance—such as weak and strong facto- −1

rial invariance—and these concerned invariance of E ηη = (I − B)−1 Ψ I − B , (11)

parameter estimates in the measurement model. where all symbols are deﬁned above. To ease inter-

Once at least strong factorial invariance is estab- pretation of two parameter matrices, we will dis-

lished, we may pursue other forms of invariance tinguish between independent latent variables and

that are of great importance for generalizing the- dependent latent variables, using superscript (i) and

oretical conclusions across groups. These additional (d ), respectively. Thus, we will write the matrix of

forms of invariance fall under the rubric we are covariances among latent variables as:

calling structural invariance, because they involve (i)

invariance of the parameter estimates in the struc- Ψ 0

Ψ= (12)

tural model. A great deal of work has been done 0 Ψ(d )

on measurement or factorial invariance, but rela-

and the matrix of factor means as

tively little research has been published under the (i)

heading of structural invariance. To document this, α

α= , (13)

PsycINFO searches were made with “measurement α(d )

invariance” and “factorial invariance” as keywords;

these searches returned 412 and 402 citations, where Ψ(i) contains free covariances among the

respectively. A search with “structural invariance” as independent latent variables, Ψ(d ) contains covari-

keyword returned only 101 citations, a much lower ances among residuals of the dependent latent vari-

number. ables, the independent latent variables are assumed

In addition, the term structural invariance to be uncorrelated with the residuals of the depen-

seemed to be used in a much less consistent, more dent latent variables, α(i) is a vector of means of

confused fashion in prior research. Many authors the independent latent variables, α(d ) is a vector

used the term “structural invariance” synonymously of intercepts of the dependent latent variables, and

with what we have called factorial invariance, other symbols are as deﬁned above.

arguing that satisfying tests of factorial invariance Placing Equations 11, 12, and 13 into Equations

implied that the factors were structurally invari- 7a and 7b for a multiple-group version of the strong

ant across groups. When using the term “structural factorial invariance model yields:

invariance” to refer to elements in the structural −1 Ψ̂(i) 0

model, “structural invariance” has typically been Sg =∼ Λ̂ I − B̂g g

0 Ψ̂(d

g

)

interpreted as denoting the invariance of the pattern

of signiﬁcant and nonsigniﬁcant directed relations −1

among latent variables. I − B̂ g Λ̂ + Θ̂g = Σ̂g (14a)

To discuss levels of structural invariance, we need

α̂g(i)

to revise slightly the structural model shown in Ȳg ∼

= μ̂g = τ̂ + Λ̂ , (14b)

Equations 8a and 8b. If we return to Equation 2 α̂g(d )

and allow directed relations among latent vari- where g subscripts on vectors or matrices indicate

ables, we can write an equation for the latent the presence of differing parameter estimates across

variables as: groups, and all symbols are deﬁned above.

Levels of structural invariance. Here, we propose

η = Bη + ζ, (9)

that different levels of invariance can be distin-

guished for the structural model, levels that are

where B is an (r × r) matrix of regression weights analogous to the levels of factorial invariance. In

for predicting latent variables from other latent offering this proposal, we hope we can lead to

variables, ζ is an (r × 1) vector of latent variable research on structural invariance that is as illumi-

residuals, and other terms are deﬁned above. Solving nating as work on factorial invariance and, in the

68 s p e c i a l p o p u l at i o n s

process, help establish a common nomenclature for be placed on raw score regression weights (or equiva-

discussing structural invariance. lents of these), rather than standard score regression

The ﬁrst and most basic form of structural invari- weights. As in typical multiple regression analy-

ance is conﬁgural structural invariance of the pattern ses, raw score regression weights are presumed to

of ﬁxed and free regression weights in the B̂g matrices be invariant across samples from a common pop-

in Equation 14a. If a restricted pattern of directed ulation, whereas standardized regression weights

paths is estimated from independent latent variables are expected to vary as a result of range restric-

to dependent latent variables and among depen- tion resulting from sampling. Now, with latent

dent latent variables, conﬁgural structural invariance variable models, the scale of each latent variable

implies that the same latent variables have directed may be ﬁxed in any of several ways, and the “raw

effects on the same outcome latent variables in each score regression weights” among the latent η vari-

group. ables will vary as a result. However, if the scale

The second level of structural invariance is invari- of each latent variable is ﬁxed in one group and

ance of the regression weight estimates for predicting strong factorial invariance constraints are placed on

certain latent η variables from other latent variables. the Λ̂ and τ̂ matrices, then the latent η variables

Paralleling distinctions made above for factorial are on the same scale across groups. As a result,

invariance, placing invariance on the regression the regression weights in the B̂ matrices are anal-

weights leads to what may be called weak structural ogous to raw score regression weights and are on

invariance. These regression weights are contained a comparable metric, so invariance constraints on

in the B̂g matrices in Equation 14a. If parame- these weights are reasonable a priori hypotheses

ter estimates in the B̂g matrices are constrained to to test.

invariance across groups, the g subscript would be A third form of structural invariance involves

dropped from these matrices, leading to the intercepts for the dependent latent variables,

−1 Ψ̂(i)

contained in the αg(d ) matrices. Placing invariance

0

Sg ∼

= Λ̂ I − B̂

g constraints on these latent intercepts leads to strong

0 Ψ̂(d

g

)

structural invariance. If across-group constraints on

−1 latent intercepts are imposed, then the resulting

I − B̂ Λ̂ + Θ̂g = Σ̂g , (15) equation for mean expectations has the form:

where all terms are deﬁned above, and the lack of

α̂g(i)

subscripts on the B̂ matrices indicates that across- Ȳg ∼

= μ̂g = τ̂ + Λ̂ , (16)

α̂(d )

groups invariance constraints have been imposed on

parameter estimates in this matrix.

where all terms are deﬁned above, and the lack of

At least two issues must be mentioned with regard

subscripts on the α(d ) matrix indicates the presence

to constraints on regression weights in the B̂ matri- of cross-group invariance constraints on parameter

ces. First, across-group constraints on the B̂ matrices estimates in this matrix.

are interpretable only if weak or strong factorial More importantly, if these latent intercept terms

invariance has been established. Thus, if only con- are constrained to invariance across groups with

ﬁgural factorial invariance holds for a given set of no signiﬁcant loss in model ﬁt, then any group

data, no clear and convincing substantive interpre- differences in mean level on the dependent latent

tation can be placed on constraints on elements variables result from group differences in mean level

in the B̂ matrices. Data must support at least the on the independent latent variables. This condition

hypothesis of weak factorial invariance to yield inter- is analogous to the distinction at the measured vari-

pretable constraints on the B̂ matrices, and still able level that distinguishes weak factorial invariance

stronger interpretations of such constraints accom- from strong factorial invariance; hence, our dis-

pany the successful speciﬁcation of strong factorial tinction here between weak and strong structural

invariance. For this reason, we placed the restricted invariance.

B̂ matrices in the strong factorial invariance model The fourth and ﬁnal type of structural invariance

in Equation 15, because strong factorial invariance to be represented and tested involves the residual

allows a more adequate basis for discussing these covariances among the dependent latent variables,

constraints. contained in the Ψ̂(d g

) matrices. If these residual

Second, if across-group constraints are imposed covariances are constrained to invariance, the result-

on the B̂ matrices, the constraints should generally ing model is characterized as conforming to strict

structural invariance. Under this model, Equation 15 Exploiting Unusual Variation in Special

becomes: Populations

A ﬁnal issue that arises in the study of special pop-

ulations is the frequent ﬁnding of unique forms of

−1 Ψ̂(i) 0

Sg ∼

= Λ̂ I − B̂ g variability in the special populations. That is, rela-

0 Ψ̂(d ) tive to the general population, individuals in a given

−1 special population may exhibit substantial variation

I − B̂ Λ̂ + Θ̂g = Σ̂g , (17) on key variables that does not exist in marked form

in the general population. For example, adaptive

behaviors are forms of behavior that enable one to

where all terms are deﬁned above, and the lack of live independently in the community. Because of

subscripts on certain parameter matrices indicates the way that adaptive behaviors are measured, per-

cross-group invariance constraints on parameter sons with IQ scores that are at or above the mean

estimates in these matrices. of the population may exhibit little variability on a

As a summary of issues in structural invariance, measure of adaptive behavior, as they score at the

distinctions among levels of constraints similar to highest level on every item. This is a quite reason-

those made for factorial invariance may be drawn. able outcome, as persons with IQ scores at or above

Speciﬁcally, invariance constraints on the α̂(d ) and B̂ the population mean do not have problems living

matrices are the most important for tests of sub- independently in the community. One tends to ﬁnd

stantive theory. Once cross-group constraints are substantially more variability in ratings of adaptive

imposed on the α̂(d ) and B̂ matrices, identical raw behavior in samples of persons with intellectual dis-

score regression models—both the intercepts and ability; again, this is a reasonable outcome, as many

regression weights—hold in each group at the latent persons with intellectual disability have difﬁculties

variable level. Invariance of the α̂(d ) and B̂ matrices in one or more domains of adaptive functioning,

is a reasonable a priori hypothesis; if sustained, then limiting the quality of their independent living in

the lack of group differences at this level is an impor- the community. Of course, if a special population is

tant ﬁnding. By comparison, additional constraints a subset of the general population, then the marked

on the Ψ̂(d )

g matrices are nice but not necessary. In variation on key variables does exist in the general

fact, there are reasonable bases for expecting that the population. But, if the special population is a rather

Ψ̂(d )

g matrices will vary signiﬁcantly across groups small subset of the population, then extreme vari-

under sampling from a population, although the ability on certain variables may be submerged in data

α̂(d ) and B̂ matrices may display invariance across on the full population and therefore go unnoticed.

groups. This issue of unusual variation within special

Finally, we reiterate our earlier statement that populations leads to our fourth methodological

invoking strong or strict structural invariance con- implication of the presence of special populations:

straints makes no sense substantively unless at least

Implication 4: Special populations may exhibit

strong factorial invariance constraints have been

unique forms of variability that can be exploited to

placed on the model. Only if invariant latent vari-

test theoretical conjectures in ways unavailable in the

ables are identiﬁed in an identical metric across

general population.

samples does it make sense to test whether invariant

regression parameters hold among these latent vari- We have not uncovered any particular method-

ables. Moreover, if the α̂(d ) and B̂ are invariant across ological techniques that are unique to the issue of

the general population and speciﬁc special popula- unusual variation within special populations. In

tions, then processes bringing about the dependent the preceding sections, we outlined how quantita-

latent variables are the same across populations, tive techniques could provide informative ways of

allowing one to generalize theoretical conclusions answering questions regarding measurement equiv-

across groups. On the other hand, if the α̂(d ) and B̂ alence and structural invariance. If unusual variation

are not invariant across the general population and were present on one or more variables in a special

special populations, then results from the general population, then this unusual variation could well

population cannot be extended to the special popu- be uncovered in particular matrices in the general

lations, and theories regarding the nature of relations structural equation model we described.

among latent variables would require modiﬁcations Here, we merely highlight the issue of poten-

in connection with special populations. tial unusual variation in a special population. In

70 s p e c i a l p o p u l at i o n s

emphasizing this issue, we encourage researchers to reported by Nary, Froehlich-Grobe, and Aaronson

be vigilant for the presence of unusual variation. (2011), who sought to recruit wheelchair users into

Unusual variation may take any of several forms. Per- a randomized controlled exercise trial. Nary et al.

sons in a special population may exhibit variation on set out to recruit 180 participants for their study, or

dimensions on which members of the general pop- 60 persons in each of three cohorts. The research

ulation may show no variability at all. An example team used a wide variety of recruiting strategies,

of this is special artistic abilities, such as individual beginning with the traditional approach of con-

differences in composing symphonies or directing tacting hospitals, health-care organizations, health-

movies. No one doubts the presence of individual care providers, and disability agencies. When these

differences among composers and movie directors, sources of recruitment led to discouraging results,

and preference for certain music or movies is often Nary et al. began using more innovative approaches,

a lively topic of discussion. But, we rarely reﬂect including putting ﬂyers in movie theaters and public

on the fact that relatively few individuals compose transportation (e.g., busses), advertising in news-

symphonies or direct movies, so no individual dif- papers, newsletters, and so forth, and employing

ferences on these dimensions are evident outside the direct mail coupon packets (identiﬁed as ValPak).

small group of persons who pursue these distinctive These recruitment efforts were extremely varied, as

endeavors. Nary et al. listed approximately 30 different loca-

Alternately, members of special populations tions/activities associated with their recruitment.

may exhibit unusually large or small variation on When participants were recruited, they were asked

behavioral traits relative to the general population. where they heard about the project. Somewhat sur-

Persons in a special population may show a particu- prisingly, the single most effective method was the

lar symptomatic behavior and so exhibit essentially ValPak direct mail coupon packet and other contacts

no variation in this behavior, which may fall at one through the media (e.g., radio and TV advertise-

end of a behavioral continuum on which the general ments, newspaper advertisements); a full third of

population exhibits wide individual variability. For the sample of participants was recruited through the

example, persons with Down syndrome often have media contacts. Although the research team did not

facial features, including small chin, round face, fully meet their recruitment goal of 180 participants,

almond eye shape, and so forth, that are an unusual they did come close to the goal—and only did so by

combination in the general population. Thus, the employing a wide array of approaches that cost much

common facial appearance of persons with Down more in time and effort than the research team had

syndrome, which is distinct from other members of anticipated. The Nary et al. paper is an entertaining,

the population, is an early signal that the individual if sobering, accounting of the myriad approaches the

is a member of this special population. Or, per- researchers took to recruit participants and the rel-

sons in a special population may exhibit enhanced ative value of the different approaches. Nary et al.

or enlarged variability on a dimension relative to have summarized their experiences in a series of ﬁve

the general population. Regardless of the form of “lessons learned,” such as needing to be cognizant

unusual variability, such variability may offer unique that recruiting of persons with disabilities will be

ways to understand differences across groups. more difﬁcult than recruiting persons without dis-

abilities and that more time, effort, and money will

Examples of Quantitative Explorations of be required in recruiting members of special popu-

Special Populations lations than initially thought. The most unfortunate

Having outlined various methodological issues lesson learned was the fourth lesson, which related to

and approaches that can be used to characterize spe- the research team ﬁnding that health-care agencies

cial populations, we now describe empirical studies were not especially helpful in recruiting participants,

that embody each of the methodological issues that although the special population of wheelchair users

we have discussed. The approaches we have taken has, in general, greater reliance on such agencies.

and the results obtained can be extended readily by The Nary et al. paper is a wonderful place to start

other researchers to the study of special populations. when planning to recruit members of populations

with disabilities, and learning from the efforts of

Identifying and Accessing Participants in these researchers might ease recruitment in future

Special Populations studies.

One example of the problems that arise when A second example of ways researchers have dealt

recruiting members of a special population was with identifying and accessing participants in a

special population comes from our California Fam- conclusions we wish to draw. Optimal recruitment

ilies Project (CFP), a study of 674 Mexican-origin and retention is an unexpectedly time-consuming

families in California (e.g., Conger, Song, Stock- and expensive proposition, and we have our own

dale, Ferrer, Widaman, & Cauce, 2012; Conger, list of lessons learned in this necessary, thankless

Stockdale, Song, Robins, & Widaman, in press). task.

We worked through the school districts in two cities

in Northern California to identify potential mem-

bers of our Mexican-origin population, which we Measuring the Same Constructs Across

deﬁned as the population of ﬁfth grade public or Groups

Catholic school children whose four grandparents Economic pressure and depression. As one exam-

were born in Mexico or whose forebears were born ple of measuring the same constructs across groups,

in Mexico. Children were drawn at random from the we present an example of measurement invari-

student rosters for the school districts of these two ance across groups using data from three ethnically

cities. Families of these children were then recruited diverse samples living in the United States. Data

by telephone or, for cases in which there was no listed for these samples were obtained from the follow-

phone number, by a recruiter who went to their ing studies: (1) The Family and Community Health

home. Of the 982 families contacted, 68.6% of these Study (FACHS), a study of 889 African-American

eligible families (N = 674) agreed to participate children and their families; (2) the CFP, a study

in the study. All family members were of Mexican of 674 Mexican-origin families and children; and

origin as determined by their ancestry and their self- (3) the Family Transitions Project (FTP), a study

identiﬁcation as being of Mexican heritage. First-, of 550 European American children and families.

second-, and third-generation children of Mexican For these analyses, we focused our attention on

origin were eligible for the study. Also, the focal child the impact of economic pressure on husbands’ and

had to be living with his or her biological mother. wives’ depressive symptoms. Because we were inter-

Either two-parent (82% of the sample) or single- ested in impacts on two-parent families, all three

parent (18% of the sample) families were eligible to samples were reduced in number, yielding 300 fam-

participate. In two-parent families, the father had to ilies from the FACHS study, 482 families from

be the child’s biological father. the CFP study, and 281 families from the IYFP

In addition to the initial recruitment, we face study, for a total of 1063 families across the three

challenges in tracking our families over time, espe- groups.

cially given high mobility of this population during The conﬁrmatory factor model ﬁt to the data

times of economic downturn and resulting eco- for the three samples contained seven manifest vari-

nomic pressure on families. We are currently in ables and three latent variables and is illustrated in

the sixth year of assessing families in our longi- Figure 4.1. In the ﬁgure, standard ﬁgural notation

tudinal study and are happy to report that we is used: Triangles denote the unit constant used

have a retention rate over 90% for the families to estimate means (αk ) or intercepts (τj ), squares

in our study. This has taken a number of special or rectangles represent manifest variables, circles

approaches to retaining families, including going or ellipses stand for latent variables, single-headed

door-to-door if attempts to contact a family by arrows reﬂect directed effects (e.g., factor loadings

phone are unsuccessful, contacting persons who λj , regression weights), and double-headed arrows

know a family (especially if the family has moved), represent latent variable (ψkk ) and unique factor (θj )

and following families for assessments as they move variances or covariances (ψkk ).

to other states (e.g., Texas, Arizona) or back to Mex- In Figure 4.1, three latent variables are shown:

ico. In the future, we should document all of the Economic Pressure, Husband Depression, and Wife

different strategies we have used to keep contact Depression. All three studies included the same

with our sample. An article detailing the different three indicators of Economic Pressure: (1) Unmet

approaches and their relative success would provide Material Needs (U), the average of 4 items assessing

a useful comparison to the recruitment problems unmet material needs in different domains, includ-

confronted by Nary et al. (2011) in their study ing “having enough money to afford the kind

of wheelchair users. Only by keeping recruitment of home, clothing and food your family needs”

and retention at the highest possible levels will the (response scale 1 = “strongly agree” to 4 = “strongly

data generated by our study and by others (e.g., disagree”); (2) Can’t Make Ends Meet (C), the

Nary et al., 2011) be optimal for drawing the average of 2 items measuring difﬁculties in having

72 s p e c i a l p o p u l at i o n s

money to cover expenses, including “difﬁculty pay- Assessing factorial invariance across groups. We

ing bills” and “the amount of money left at the end of then ﬁt a series of three-group conﬁrmatory fac-

the month” (response scale: 1 = “no difﬁculty at all” tor models to test factorial invariance of the model

to 4 = “a great deal of difﬁculty”); and (3) Financial across the three samples. The ﬁrst model, Model

Cutbacks (F), the number of 11 areas in which the 1, was a three-factor model that had the same pat-

family had to make cutbacks, including “the family tern of ﬁxed and free factor loadings in each group.

postponing major household purchases, changing This conﬁgural invariance model ﬁt the data well,

residence, and or eliminating medical insurance” χ 2 (33, N = 1063) = 41.99, with RMSEA of

(each area scored dichotomously, 0 = “no cut- .028, and CFI and TLI values of 0.996 and 0.992,

backs” and 1 = “cutbacks”). Husbands and wives respectively.

provided answers on the preceding scales, and aver- Next, we imposed invariance constraints on fac-

age responses across husbands and wives were used tor loadings in the Λ matrix across gender and

as analyzed scores. For families in which husbands ethnicity. For example, the loadings for the two indi-

did not participate, scores on the economic pres- cators of husband depression were constrained to be

sure variables were treated as missing data. Scores equal to the loadings for the two indicators of wife

on all three indicators were reverse-scored, when depression within each sample; these loading were

necessary, so that higher scores indicated greater then constrained to equality across all three samples.

economic pressure on the family. The ﬁt of the resulting model, Model 2, was slightly

Across the three ethnic groups, the depression worse; however, indicators of model ﬁt remained

latent construct was obtained from the Mini-Mood in an acceptable range, χ 2 (42, N = 1063) =

and Anxiety Questionnaire (Mini-MASQ; Clark 61.25, p = 0.02, with RMSEA of 0.036, and CFI

& Watson,1997). For this analysis, we used the and TLI scores of 0.991 and 0.987, respectively.

ﬁve items that measure general distress—depression. The latter indices of practical ﬁt suggest that the

These items ask both husband and wife to self-report invariance constraints imposed on Model 2, which

on how much they had felt depressed, discour- represent weak factorial invariance, are appropri-

aged, and or worthless in the past week. Each ate. In prior research, several authors have discussed

item was answered on a scale ranging from 1 = changes in TLI or CFI values that might be consid-

“not at all” to 4 = “very much.” Two parcels (one ered practically signiﬁcant when comparing nested

two-item and one three-item parcel) were formed, models. Thus, Widaman (1985) considered a differ-

and these are denoted M1H, M2H, M1W, and ence in the TLI of less than 0.01 when comparing

M2W for parcels 1 and 2 from husband and wife, two models to be a practically unimportant differ-

respectively. ence in ﬁt, and Cheung and Rensvold (1999) and

α1 α2 α3

1

ψ 31

ψ21 ψ 32

ψ22 ψ 33

Pressure Depression Depression

λ1 λ2 λ3 λ4 λ5 λ6 λ7

θ1 θ2 θ3 θ4 θ5 θ6 θ7

τ1 τ2 τ3 τ4 τ5 τ6 τ7

Figure 4.1 A ﬁgural presentation of the relations between latent variables for Economic Pressure, Husband Depression and Wife

Depression and their respective indicators (Note: triangle stands for the unit constant, squares for manifest variables, circles or ellipses

for latent variables, single-headed arrows for direct effects, and double-headed arrows for symmetric effects such as variances or

covariances)

Little, Card, Slegers, and Ledford (2007) argued higher levels of depression (M = 0.40 [SE =

that the CFI must meet this criterion (i.e., a dif- 0.10], SD = 1.46) relative to their husbands. The

ference in CFI of 0.01 or more between models) FACHS sample had elevated means of 0.12 [SE =

to be deemed a difference in ﬁt worthy of note. 0.10](SD = 1.03), 0.28 [SE = 0.13](SD =

Using these standards, the changes in ﬁt when 1.37), and 0.45 [SE = 0.12](SD = 1.27),

moving from the conﬁgural invariance model to on the Economic Pressure, Husband Depression,

the weak factorial invariance model were small and and Wife Depression latent variables, respectively,

unimportant. and the CFP sample also showed higher means

Because ﬁt indices for Model 2 were acceptable, on the three factors, respectively, of 0.60 [SE =

we next invoked invariance constraints on the man- 0.09](SD = 0.97), 0.36 [SE = 0.11](SD =

ifest variable intercepts in τ, leading to Model 3. 1.47), and 0.69 [SE = 0.11](SD = 1.52). Thus,

Invariance constraints on all but one τ value in each relative to European American families, African-

sample were acceptable; the τ value allowed to vary American and Mexican-origin families showed both

freely across the three samples was the intercept for higher mean levels and greater variability on the

the third economic pressure indicator. This partial three factors assessed in these conﬁrmatory factor

strong invariance model resulted in poorer statisti- models.

cal ﬁt than prior models, χ 2 (49, N = 1063) =

95.29, p < 0.001, reﬂecting a signiﬁcant worsen-

ing of ﬁt relative to the weak invariance model, Exploring Bounds of Psychological Theories

χ 2 (7, N = 1063) = 34.04, p < 0.001. How- Ethnic differences in effects of economic pressure on

ever, practical ﬁt indices were only modestly worse, depression. We turn next to ways of exploring the

with an RMSEA of 0.052, and CFI and TLI values bounds of psychological theories. A substantial body

of 0.979 and 0.973, respectively. Despite having of prior research supports the conclusion that family

worse ﬁt that slightly exceeded the “change of 0.01” economic pressure has negative effects on husband

criterion for practically signiﬁcant difference in ﬁt, and wife depression. Much of this research has been

the practical ﬁt indexes for Model 3 were still accept- based on data from European American families,

able (i.e., the CI for the RMSEA included 0.05, and and we wanted to see whether the effect of economic

both TLI and CFI values were above 0.95), indi- pressure on husband and wife depression was similar

cating that the partial strong invariance constraints across ethnic groups.

applied were appropriate. To answer this question, we returned to the

The ﬁnal conﬁrmatory model, Model 4, imposed three-group data used to test factorial invariance

invariance constraints on the unique factor variances but examined models that substituted direct effects

in Θ and therefore corresponded to strict factorial of economic pressure on husband and wife depres-

invariance. Model 4 resulted in relatively poor ﬁt, sion in place of the correlations estimated in prior

χ 2 (65, N = 1063) = 317.84, p < 0.001, with models. In the ﬁrst of these models, we imposed

RMSEA of 0.105, and CFI and TLI scores of 0.886 invariance constraints on the path coefﬁcients lead-

and 0.890, respectively. This represented a signiﬁ- ing from economic pressure to depression across

cant worsening of both the statistical index of ﬁt, husbands and wives within each sample. This cross-

χ 2 (16, N = 1063) = 222.55, p < 0.001, and gender constraint resulted in a signiﬁcant worsening

practical indices of ﬁt when compared to Model 3. in ﬁt compared to the partial strong invariance

Based on the poor practical ﬁt indexes of the strict model,

χ 2 (3, N = 1063) = 12.96, p = 0.01,

factorial invariance model, we accepted the partial with moderately worse practical ﬁt indexes, RMSEA

strong factorial invariance model (Model 3) as the of 0.055, and CFI and TLI values of 0.975 and

optimal model for the data. 0.969, respectively. Next, we freed the cross-gender

Group differences in means and variances on latent constraints, but imposed equality constraints on the

variables. Given the ﬁt of the partial strong facto- path coefﬁcients of economic pressure on husband

rial invariance model, mean and variance differences depression across groups and on wife depression

across groups could be evaluated in an unbiased fash- across groups. This model also resulted in a sig-

ion. For the latent variables of Economic Pressure niﬁcant worsening of ﬁt when compared to partial

and Husband Depression, means were ﬁxed at 0 and strong factorial invariance model,

χ 2 (4, N =

variances at unity for the FTP sample, and means 1063) = 19.65, p < 0.001, with worse practical

and variances were estimated for all other latent vari- ﬁt indexes, RMSEA of 0.057, and CFI and TLI

ables. In the FTP sample, wives had signiﬁcantly values of 0.972 and 0.967, respectively.

74 s p e c i a l p o p u l at i o n s

In our ﬁnal model, we imposed invariance con- positive parenting behaviors of induction, maturity

straints on the path coefﬁcients for economic pres- demands, and promoting autonomy had the

sure on husband and wife depression across the two strongest, positive effects on longitudinal changes

minority samples, allowing the corresponding coef- in different forms of adaptive functioning, such as

ﬁcients to be free for the European American sample. practical skills (e.g., dressing oneself ), conceptual

This resulted in a nonsigniﬁcant change in model competence (e.g., functional academics), and social

ﬁt when compared to the partial strong invariance competence (e.g., making and keeping friends).

model,

χ 2 (2, N = 1063) = 1.52, p = 0.47. The However, contrary to research on the general popu-

practical ﬁt indexes for this model were essentially lation, Widaman and Borthwick-Duffy found that

identical to those for the strong factorial invari- the authoritarian dimension of power assertion

ance model, so we chose this model as the optimal consistently had the strongest effects on longitu-

representation of our data. dinal changes in negative behaviors such as social

Parameter estimates in this model suggest that the maladaption (e.g., aggression, property destruc-

impact of economic pressure on the level of depres- tion) and personal maladaption (e.g., self-injurious

sion for minority couples differs from that for Euro- behavior). Although the standardized regression

pean American couples. Speciﬁcally, economic pres- weights were not large (βs ranged from 0.15 to

sure appears to have the strongest effect (β = 0.69) 0.24), power assertion was the only parenting style to

on depression in minority husbands (both African- impact changes in maladaptive behaviors. Thus, the

American and Mexican-American husbands), and a proscription of power assertion as a less useful form

signiﬁcantly weaker effect (β = 0.47) on depression of parenting behavior was based on research on fam-

in minority wives. In contrast, economic pressure ilies in the general population. However, to reduce

had a stronger effect (β = 0.54) on European Amer- the levels of maladaptive behaviors in children with

ican wives than on European American husbands intellectual disability, power assertion appears to be

(β = 0.23), who were least affected by economic the only viable parenting option.

pressure.

Parenting styles and child behavior. A second

example of exploring the bounds of psychologi- Exploiting Unusual Variability in Special

cal theories developed on the general population Populations

is worthy of note. In this study, Widaman and Adaptive behavior in persons with intellectual dis-

Borthwick-Duffy (April, 1990) reported results ability. The ﬁnal analytic issue to illustrate is the

from 109 families with a child with intellectual dis- exploiting of unusual variability in special popula-

ability. Key ﬁndings on parenting styles, or styles of tions. One example of unusual variability in special

parenting behaviors, by researchers such as Baum- populations involves the domain of adaptive behav-

rind (1968, 1971, 1991; Baumrind, Larzelere, & ior, which attains signiﬁcance in the population

Owens, 2010) and Hoffman (1975, 1979, 1994) of persons with intellectual disability. The three

supported the presence of several dimensions of par- key dimensions of adaptive behavior are practical

enting behavior. The general ﬁndings of this research competence (or independent living skills), concep-

was that authoritative parenting (high control, plus tual competence (or cognitive skills), and social

high nurturance/warmth) was associated with opti- competence (or skills); all three dimensions repre-

mal forms of child behavior and that authoritarian sent everyday behaviors that enable a person to live

parenting (high control and power assertion, plus independently in the community.

low nurturance and warmth) and permissive par- Widaman, Gibbs, and Geary (1987) utilized a

enting (low control, plus high nurturance and database maintained by the Department of Devel-

warmth) both led to less optimal levels of child opmental Services (DDS) of the State of Cali-

behavior. fornia. DDS required all persons receiving state

Widaman and Borthwick-Duffy (April, 1990) services for developmental disability to be assessed

isolated a total of seven dimensions of par- on a 66-item instrument of adaptive behavior. The

enting behavior in their study, including (1) instrument assessed the three key dimensions of

nurturance/warmth, (2) induction, (3) maturity practical, conceptual, and social competence and

demands, (4) promoting autonomy, (5) ﬁrm con- also assessed three additional dimensions of motor

trol, (6) love withdrawal, and (7) power assertion. competence, social maladaption, and personal mal-

Consistent with research on the general population, adaption. Widaman et al. extracted 14 groups of

Widaman and Borthwick-Duffy found that the individuals based on a crossing of age (children,

adolescents, and adults), levels of intellectual dis- the Maternal PKU Collaborative (MPKUC) study.

ability (mild, moderate, and severe), and placement Phenylketonuria is a well-known genetically based

(home, community). Thus, these 14 groups are disorder, which results in disrupted metabolism of

special subgroups deﬁned by demographic and func- phenylalanine (PHE) into tyrosine. If left untreated,

tional characteristics from the special population infants with PKU who are normal at birth suffer

of persons with intellectual disability. Widaman et permanent brain damage that leads to severe mental

al. conﬁrmed essentially identical six-factor solu- retardation (mean IQ of 50) by age 2 years. However,

tions across all 14 groups; interested readers are with early identiﬁcation and placement of infants on

referred to the published article for details. Of a diet low in PHE, the negative effects of PKU can be

importance here is the fact that, given the way the circumvented, and children with PKU can grow to

66 items are phrased, persons who do not have adulthood with no evidence of intellectual disability

an intellectual disability would likely score at the (Koch & de la Cruz, 1999a, 1999b).

highest scale point on each item, therefore failing However, the story of maternal PKU is more

to exhibit any variability in responses on the items. complex. If a woman with PKU does not maintain

Only persons with intellectual disability exhibit sub- a low-PHE diet when pregnant, then the increased

stantial variability on the items, so the dimensions PHE in her blood crosses the placental barrier and

of adaptive behavior only have ready application in exposes the developing fetus to high levels of PHE.

this special population. For persons with PKU, levels of endogenous PHE

However, this research on adaptive behavior over 6 mg|dL have been found to be teratogenic,

offers a chance to make recommendations for future which means that such levels lead to negative effects

research. The instrument developed by the Cal- on behavior. However, for the developing fetus, lev-

ifornia DDS contained items that exhibit much els of exogenous PHE (i.e., PHE from the mother)

variability only in samples of persons with develop- that were sufﬁcient to cause teratogenic effects had

mental disabilities, because persons without devel- never been identiﬁed. The MPKUC study (Koch, de

opmental disabilities would have ceiling effects on la Cruz, & Azen, 2003) began in 1984 as an inter-

all items. This does not mean, however, that persons vention study to help pregnant women maintain

without developmental disabilities do not display low-PHE diets and therefore maintain low blood

individual differences in adaptive forms of behav- PHE levels during pregnancy. A secondary goal of

ior. More likely, the lack of variance in a sample the MPKUC study was to monitor maternal PHE

of persons without developmental disabilities is a levels throughout pregnancy to study the relation

measurement issue or problem. This measurement between PHE levels and child outcomes.

problem could be confronted by developing a set Table 4.1 shows descriptive statistics for partic-

of harder items for each dimension of adaptive ipants from the MPKUC study for three mother

behavior, so that items would have a higher ceil- variables (Full Scale IQ, PHE level when on an

ing and persons without developmental disabilities unrestricted diet, and average PHE level during

might not “top out” on every item. If this were pregnancy) and three child variables (Verbal, Perfor-

done, then a computerized adaptive testing (CAT) mance, and Full Scale IQ). Various forms of unusual

approach could be developed to administer a unique variability are contained in this table. First, all four of

set of items to each examinee, presenting items the IQ variables exhibit substantial deviation from

that would enable precise measurement of the indi- the population mean: The mean mother IQ was

vidual’s standing on each dimension, although a approximately a full standard deviation below the

minimal number of items were needed to do so. population mean of 100, and the three mean child

Waller and Reise (1989) described how to apply IQ scores are more than one-half a standard devia-

the CAT approach to personality measurement, and tion below the population mean. Furthermore, the

Reise and Waller (2003) discussed application of three child IQ scores show markedly greater variabil-

sophisticated techniques to assessing psychopathol- ity than in the population, with SDs greater than

ogy. Similar approaches could undoubtedly be used 21 relative to a population SD of 15. The exces-

to assess adaptive behaviors across the entire popu- sive variability in child IQ scores is the result of a

lation in a more adequate way. larger-than-expected number of observations with

Effects of prenatal exposure to phenylalanine. A rather low IQ, presumably resulting from higher

second example of unusual variability in special levels of prenatal exposure to PHE. The indices of

populations that provides a unique opportunity to mother PHE levels also exhibit unusual variability.

explore relations among variables is derived from Persons who do not have the PKU genetic defect

76 s p e c i a l p o p u l at i o n s

Table 4.1. Descriptive Statistics on Variables from the Maternal PKU

Study

Variable N Mean SD Min Max

Mother PHE level on unrestricted diet 413 22.03 9.18 3.30 51.10

Mother PHE level during pregnancy 412 8.23 4.49 1.30 28.30

Note: Mother full-scale IQ based on Wechsler Adult Intelligence Scale—Revised; Mother PHE

levels on regular diet and during pregnancy are in mg | dL; Child verbal, performance, and

full-scale IQ based on Wechsler Intelligence Scale for Children—Revised.

would exhibit extremely low PHE values, below 1.0 the three coefﬁcients were 1.70, 0.32, and 0.62, the

and near 0, on these two measures. In contrast to point estimates and interval estimates for the coef-

this, mother PHE levels on an unrestricted diet var- ﬁcients are: intercept = 103.9, 95% CI [100.5,

ied between 3.3 and 51.1, and mother PHE levels 107.3], regression slope for PHE = −4.14, 95%

during pregnancy varied between 1.3 and 28.3. CI [ −3.50, −4.78], and knot point = 5.50, 95%

To model the effect of prenatal PHE exposure on CI [4.25, 6.75].

child Full Scale IQ at 7 years, one could use a linear Exploiting the unusual variability of mother PHE

regression model. The result of ﬁtting this model is levels during pregnancy allowed us to reach several

the following equation: important goals. First, we could verify the nature of

the teratogenic effect of prenatal exposure to PHE

predicted IQ = 119.5 − 3.53 (PHE)

on child cognitive outcomes, which is nonlinear

The standard errors for the intercept and regres- in nature with a threshold for teratogenic effects.

sion slope were 2.13 and 0.24, respectively, so Second, we could estimate the level of exogenous

interval estimates (95% CIs) for the two parameter PHE at which the teratogenic effect begins to occur,

estimates were approximately 117.2 to 123.7 and which is close to the level of endogenous PHE often

−3.05 to −4.00 for the intercept and slope, respec- assumed to have teratogenic effects for persons with

tively. The above equation represents the unrealistic PKU. Third, these results could be used as the basis

expectation that a child’s IQ at age 7 years would for recommendations for monitoring PHE levels

be approximately 120 if his/her mother had main- in mothers with PKU, attempting to ensure that

tained a PHE level of 0 during pregnancy, and that mothers with PKU keep their PHE levels below

IQ would drop about 3.5 points for every 1 mg|dL the level at which teratogenic effects are likely to

increase above this value of 0. occur. Additionally, all of these goals were accom-

An alternative regression model is a two-piece plished only because of the unusual variability in

linear spline, with three parameter estimates: an PHE levels exhibited by mothers with PKU in the

intercept, a knot point that estimates the PHE expo- MPKUC study. Readers are referred to Widaman

sure at which a teratogenic effect begins to occur, and (2009) for an informative summary of ﬁndings of

the linear slope representing the teratogenic effect the MPKUC study.

after the knot point. The results for this model were:

predicted IQ = 103.9−4.14 (PHE), knot = 5.50.

Conclusions

In this equation, children who experienced pre- The presence of special populations—however

natal PHE levels between 0 and 5.50 all had an delineated—provides opportunities beyond those

expected 7-year IQ of about 104, near the popu- afforded by drawing repeated samples that are rep-

lation mean. Moreover, after the knot point of 5.50, resentative of the general population. If we disre-

the predicted IQ drops over 4.1 IQ points for every garded the presence of special populations, then we

1 mg|dL increased. Because the standard errors for might develop a corpus of scientiﬁc ﬁndings that

applies to “the” population—whatever that is—but groups, but how should we characterize the

in actuality applies to very few individuals or, at magnitudes of effects, what magnitude of effects

least, fails to apply to members of important special should be considered practically important, and

populations. The mere existence of special popula- would the magnitude of effects considered important

tions, however these are deﬁned, challenges us to vary across domains of research?

verify that the most important of our ﬁndings are Question 5: When does unusual variability represent a

not moderated substantially as a function of special valid measurement outcome, and when does unusual

population status. (or different) variability across groups represent a

We outlined four primary methodological issues failure to assess individual differences adequately or in

that arise when investigating phenomena in the comparable fashion across groups? Measurement is

presence of special populations. These four issues the basis on which the whole enterprise of science is

involve identifying and assessing participants in spe- erected, and concerted attention to accurate

cial populations, verifying that we are measuring the measurement of individual differences across the

same things across populations, discovering whether entire span of a dimension is crucial to answering

sociobehavioral processes unfold in the same ways questions such as these. Use of the most up-to-date

across populations, and searching for unusual vari- measurement approaches, such as CAT, would go a

ability that might provide unique ways of viewing long way to resolving issues of differences in variance

behavioral phenomena and testing theoretical con- across groups.

jectures. Given the rather recent upsurge of research

pursued under the explicit rubric of special popula- The future is bright for methodological and

tions, the implications of special populations on the quantitative innovations in the study of special pop-

way we do science will only increase in the future. ulations. As we documented at the start of this

Furthermore, we expect that the recognition of the chapter, labeling groups as special populations is

presence of special populations will only enrich the a relatively recent phenomenon, largely a product

way we view social and behavioral science as we move of the past three decades. Moreover, the majority

further into the twenty-ﬁrst century. of statistical methods and techniques for studying

differences across groups are also of recent origin.

Many of the most advanced techniques have been

Future Directions available for, at most, 30 years, and new ways in

Several future directions for research and the- which these methods can be used to illuminate sim-

ory with regard to quantitative methodology as ilarities and differences across populations are being

applied to special populations can be drawn from developed on an almost daily basis. Improved, more

the material presented earlier in this chapter. These sophisticated understanding of the nature of special

are provided as a series of questions to guide future populations is occurring at the genetic, biologi-

research. cal, and psychological/behavioral levels, and optimal

use of methodological approaches and quantitative

Question 1: What is the best way to conceptualize special techniques is a crucial element that will push this

populations? Or, are alternative bases for identifying endeavor forward. Indeed, new quantitative tech-

special populations the best way to proceed? niques or innovative use of existing techniques may

Question 2: Is a “one-size-ﬁts-all” conception of special well be the key that unlocks the door to advanced

populations possible, or will the deﬁnition of special understanding of why special populations deserve

populations vary as a function of the research their distinctive status.

question asked?

Question 3: How should a researcher proceed if full

strong factorial invariance does not hold but only Author Note

partial strong invariance is exhibited by data? How Support for this work was provided by grants

seriously does partial strong factorial invariance from the National Institute of Child Health and

impede scientiﬁc conclusions relative to full Human Development, the National Institute on

invariance of all factor loadings and intercepts in the Drug Abuse, and the National Institute of Men-

strong factorial invariance model? tal Health (DA017902, HD047573, HD051746,

Question 4: How large must differences be across special and MH051361).

populations before differences are considered Correspondence regarding this manuscript should

important? Large sample sizes lead to increased power be addressed to Keith Widaman, Department

to detect statistically signiﬁcant differences across of Psychology, University of California at Davis,

78 s p e c i a l p o p u l at i o n s

One Shields Avenue, Davis, CA 95616. E-mail: Clark, L. A., & Watson, D. (1997). The Mini Mood and

kfwidaman@ucdavis.edu Anxiety SymptomQuestionnaire (Mini-MASQ). Unpublished

manuscript, University of Iowa.

Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and

References White students in integrated colleges. Journal of Educational

Barton, B. (2009). Dreams deferred: Disability deﬁnitions, data, Measurement, 5, 115–124.

models, and perspectives. Journal of Sociology and Social Cobb, B., Lehmann, J., Newman-Gonchar, R., & Alwell, M.

Welfare, 36(4), 13–24. (2009). Self-determination for students with disabilities: A

Baumrind, D. (1968). Authoritarian vs. authoritative parent narrative metasynthesis. Career Development for Exceptional

control. Adolescence, 3, 255–272. Individuals, 32, 108–114.

Baumrind, D. (1971). Current patterns of parental authority. Colarelli, S. M., Han, K., & Yang, C. (2010). Biased against

Developmental Psychology, 4 (1, Pt. 2), 1–103. whom? The problems of “group” deﬁnition and membership

Baumrind, D. (1991). The inﬂuence of parenting style on ado- in test bias analyses. Industrial and Organizational Psychology:

lescent competence and substance abuse. The Journal of Early Perspectives on Science and Practice, 3, 228–231.

Adolescence, 11, 56–95. Conger, R. D., Song, H., Stockdale, G. D., Ferrer, E., Widaman,

Baumrind, D., Larzelere, R. E., & Owens, E. B. (2010). Effects K. F., & Cauce, A. M. (2012). Resilience and vulnerability of

of preschool parents’ power assertive patterns and practices Mexican origin youth and their families: A test of a culturally-

on adolescent development. Parenting: Science and Practice, informed model of family economic stress. In P. K. Kerig, M.

10, 157–201. S. Schulz, & S. T. Hauser (Eds.), Adolescence and beyond:

Belsky, J., Bakermans-Kranenburg, M. J., van IJzendoorn, M. H. Family processes and development (pp. 268–286). New York:

(2007). For better and for worse: Differential susceptibility to Oxford University Press.

environmental inﬂuences. Current Directions in Psychological Conger, R. D., Stockdale, G. D., Song, H., Robins, R. W., &

Science, 16, 300–304. Widaman, K. F. (in press). Predicting change in substance

Bronfenbrenner, U. (1977). Toward an experimental ecology of use and substance use cognitions of Mexican origin youth

human development. American Psychologist, 32, 513–531. during the transition from childhood to early adolescence. In

Bronfenbrenner, U. (1986a). Ecology of the family as a context Y. F. Thomas, L. N. Price, & A. V. Lybrand (Eds.), Drug use

for human development: Research perspectives. Developmen- trajectories among African American and Hispanic Youth. New

tal Psychology, 22, 723–742. York: Springer.

Bronfenbrenner, U. (1986b). Recent advances in research on Diaz, N., Horton, E. G., McIlveen, J., Weiner, M., & Nelson, J.

the ecology of human development. In R. K. Silbereisen, (2009). Dysthymia among substance abusers: An exploratory

K. Eyferth, & G. Rudinger (Eds.), Development as action in study of individual and mental health factors. International

context: Problem behavior and normal youth development (pp. Journal of Mental Health and Addiction, 7, 357–367.

287–309). Heidelberg: Springer-Verlag. Eastwood, S. L., & Harrison, P. J. (2010). Markers of glutamate

Bronfenbrenner, U. (1999). Environments in developmental synaptic transmission and plasticity are increased in the ante-

perspective: Theoretical and operational models. In S. L. rior cingulate cortex in bipolar disorder. Biological Psychiatry,

Friedman & T. D. Wachs (Eds.), Measuring environment 67, 1010–1016.

across the life span: Emerging methods and concepts (pp. 3–28). Ferrer, E., Balluerka, N., & Widaman, K. F. (2008). Factorial

Washington, DC: American Psychological Association. invariance and the speciﬁcation of second-order latent growth

Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for models. Methodology, 4, 22–36.

the equivalence of factor covariance and mean structures: The Folks, H. (1931). Committee C-1—Socially handicapped—

issue of partial measurement invariance. Psychological Bulletin, Dependency and neglect. In White House Conference on

105, 456–466. Child Health and Protection, White House conference 1930:

Caspi, A., McClay, J., Mofﬁtt, T. E., Mill, J., Martin, J., Craig, Addresses and abstracts of committee reports, White House con-

I. W., et al. (2002). Role of genotype in the cycle of violence ference on child health and protection called by President Hoover

in maltreated children. Science, 297, 851–854. (pp. 319–340). New York: The Century Co.

Caspi, A., Sugden, K., Mofﬁtt, T. E., Taylor, A., Craig, I. W., Grigorenko, E. L., Klin, A., & Volkmar, F. (2003). Annota-

Harrington, H., et al. (2003). Inﬂuence of life stress on tion: Hyperlexia: disability or superability? Journal of Child

depression: Moderation by a polymorphism in the 5-HTT Psychology and Psychiatry, 44, 1079–1091.

gene. Science, 301, 386–389. Hancock, G. R., Kuo, W.-L., & Lawrence, F. R. (2001). An

Chakrabarti, B., Dudbridge, E., Kent, L., Wheelwright, S., Hill- illustration of second-order latent growth models. Structural

Cawthorne, G., Allison, C., et al. (2010). Genes related Equation Modeling, 8, 470–489.

to sex steroids, neural growth, and social-emotional behav- Henry, P. J. (2008). College sophomores in the laboratory

ior are associated with autistic traits, empathy, and Asperger redux: Inﬂuences of a narrow data base on social psychol-

syndrome. Autism Research, 2, 157–177. ogy’s view of the nature of prejudice. Psychological Inquiry, 19,

Chassin, L., Knight, G., Vargas-Chanes, D., Losoya, S. H., & 49–71.

Naranjo, D. (2009). Substance use treatment outcomes in a Hoffman, M. L. (1975). Altruistic behavior and the parent-child

sample of male serious juvenile offenders. Journal of Substance relationship. Journal of Personality and Social Psychology, 31,

Abuse Treatment, 36, 183–194. 937–943.

Chen, F. F., Sousa, K. H., & West, S. G. (2005). Testing mea- Hoffman, M. L. (1979). Development of moral thought, feeling,

surement invariance of second-order factor models. Structural and behavior. American Psychologist, 34, 958–966.

Equation Modeling, 12, 471–492. Hoffman, M. L. (1994). Discipline and internalization. Devel-

Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial opmental Psychology, 30, 26–8.

invariance across groups: A reconceptualization and proposed Horn, J. L., McArdle, J. J., & Mason, R. (1983). When is

new method. Journal of Management, 25, 1–27. invariance not invariant: A practical scientist’s look at the

ethereal concept of factorial invariance. Southern Psychologist, Nary, D. E., Froehlich-Grobe, K., & Aaronson, L. (2011).

1, 179–188. Recruitment issues in a randomized controlled exercise trial

John, O. P., Donahue, E. M., & Kentle, R. L. (1991). The Big targeting wheelchair users. Contemporary Clinical Trials, 32,

Five Inventory-Versions 4a and 54. Berkeley, CA: University 188–195.

of California, Berkeley, Institute of Personality and Social Nesselroade, J. R. (1983). Temporal selection and factor invari-

Research. ance in the study of development and change. In P. B. Baltes

Jöreskog, K. J. (1971). Simultaneous factor analysis in several & O. G. Brim, Jr. (Eds.), Life-span development and behavior

populations. Psychometrika, 36, 409–426. (Vol. 5; pp. 59–87). New York: Academic Press.

Koch, R., & de la Cruz, F. (1999a). Historical aspects and Obradoviæ, J., Long, J. D., Cutuli, J. J., Chan, C.-K.,

overview of research on phenylketonuria. Mental Retardation Hinz, E., Heistad, D., & Masten, A. S. (2009). Aca-

and Developmental Disabilities Research Reviews, 5, 101–103. demic achievement of homeless and highly mobile children

Koch, R., & de la Cruz, F. (Eds.). (1999b). Phenylketonuria. in an urban school district: Longitudinal evidence on risk,

Mental Retardation and Developmental Disabilities Research growth, and resilience. Development and Psychopathology, 21,

Reviews, 5, 101–161. 493–518.

Koch, R., de la Cruz, F., & Azen, C. G. (Eds.). (2003). The Olson, I. R., Berryhill, M. E., Drowos, D. B., Brown, L., & Chat-

Maternal Phenylketonuria Collaborative Study: New devel- terjee, A. (2010). A calendar savant with episodic memory

opments and the need for new strategies. Pediatrics, 112, impairments. Neurocase, 16, 208–218.

1513–1587. Or, D., Cohen, A., & Tirosh, E. (2010). The reliability and valid-

Koychev, I., El-Deredy, W., Haenschel, C., & Deakin, J. F. W. ity of the Aggregate Neurobehaviroal Student Health and

(2010). Visual information processing deﬁcits as biomarkers Educational Review Parent’s Questionnaire (ANSER-PQ).

of vulnerability to schizophrenia: An event-related potential Journal of Child Neurology, 25, 157–164.

study in schizotypy. Neuropsychologia, 48, 2205–2214. Pennington, K., Beasley, C. L., Dicker, P., Fagan, A.,

Kuo, C.-C., Maker, J., Su, F.-L., & Hu, C. (2010). Identify- English, J., Pariante, C. M., et al. (2008). Prominent

ing young gifted children and cultivating problem solving synaptic and metabolic abnormalities revealed by pro-

abilities and multiple intelligences. Learning and Individual teomic analysis of the dorsolateral prefrontal cortex in

Differences, 20, 365–379. schizophrenia and bipolar disorder. Molecular Psychiatry, 13,

Little, T. D. (1997). Mean and covariance structures (MACS) 1102–1117.

analyses of cross-cultural data: Practical and theoretical issues. Petrill, S. A. (2010). Editorial: Identifying the genetic and envi-

Multivariate Behavioral Research, 32, 53–76. ronmental inﬂuences on psychological outcomes. Journal of

Little, T. D., Bovaird, J. A., & Card, N. A. (Eds.) (2007). Mod- Child Psychology and Psychiatry, 51, 745–746.

eling contextual effects in longitudinal studies. Mahwah, NJ: Polinsky, M. L., Hser, Y.-I., & Grella, C. E. (1998). Considera-

Erlbaum. tion of special populations in the drug treatment system of a

Little, T. D., Card, N. A., Slegers, D. W., & Ledford, E. C. large metropolitan area. Journal of Behavioral Health Services

(2007). Representing contextual effects in multiple-group & Research, 25, 7–21.

MACS models. In T. D. Little, J. A. Bovaird, & N. A. Card Reichenberg, A., Caspi, A., Harrington, H., Houts, R.,

(Eds.), Modeling contextual effects in longitudinal studies (pp. Keefe, R. S. E., Murray, R. M., et al. (2010). Static

121–147). Mahwah, NJ: Erlbaum. and dynamic cognitive deﬁcits in childhood schizophrenia:

Lubinski, D., & Benbow, C. P. (2006). Study of Mathematically A 30-year study. The American Journal of Psychiatry, 167,

Precocious Youth after 35 years: Uncovering antecedents for 160–169.

the development of math-science expertise. Perspectives on Reise, S. P., & Waller, N. G. (2003). How many IRT parameters

Psychological Science, 1, 316–345. does it take to model psychopathology items? Psychological

Martz, E., & Daniel, S. (2010). Comparing disability proto- Methods, 8, 164–184.

types in the United States and Kenya. Journal of Applied Rensvold, R. B., & Cheung, G. W. (1998). Testing measure-

Rehabilitation Counseling, 41, 19–25. ment model for factorial invariance: A systematic approach.

McArdle, J. J. (1988). Dynamic but structural equation modeling Educational and Psychological Measurement, 58, 1017–1034.

of repeated measures data. In J. R. Nesselroade & R. B. Cattell Sears, D. O. (1986). College sophomores in the laboratory: Inﬂu-

(Eds.), Handbook of multivariate experimental psychology (2nd ences of a narrow database on social psychology’s view of

ed., pp. 561–614). New York: Plenum. human nature. Journal of Personality and Social Psychology,

Meade, A. W., & Tonidandel, S. (2010). Not seeing clearly with 57, 515–530.

Cleary: What test bias analyses do and do not tell us. Indus- Sears, D. O. (2008). College student-it is redux. Psychological

trial and Organizational Psychology: Perspectives on Science and Inquiry, 19, 72–77.

Practice, 3, 192–205. Simunovic, F., Yi, M., Wang, Y., Macey, L., Brown, L. T.,

Meredith, W., & Horn, J. (2001). The role of factorial invariance Krichevsky, A. M., et al. (2009). Gene expression proﬁling

in modeling growth and change. In L. M. Collins & A. G. of substantia nigra dopamine neurons: Further insights into

Sayer (Eds.), New methods for the analysis of change (pp. 203– Parkinson’s disease pathology. Brain, 132, 1795–1809.

240). Washington, DC: American Psychological Association. Sussman, S. (2006). Prevention of adolescent alcohol problems in

Meredith, W. M. (1993). Measurement invariance, factor special populations. In M. Galanter (Ed.), Alcohol problems

analysis and factorial invariance. Psychometrika, 58, 525–543. in adolescents and young adults: Epidemiology, neurobiology,

Millsap, R. E., & Meredith, W. (2007). Factorial invariance: prevention, and treatment (pp. 225–253). New York: Springer

Historical perspectives and new problems. In R. Cudeck & Science & Business Media.

R. C. MacCallum (Eds.), Factor analysis at 100: Historical Uehling, H. F. (1952). Rorschach “shock” for two spe-

developments and new directions (pp. 131–152). Mahwah, NJ: cial populations. Journal of Consulting Psychology, 16,

Erlbaum. 224–225.

80 s p e c i a l p o p u l at i o n s

Waller, N. G., & Reise, S. P. (1989). Computerized adaptive presented at the Gatlinburg Conference on Research and The-

personality assessment: An illustration with the Absorption ory in Mental Retardation and Developmental Disabilities,

scale. Journal of Personality and Social Psychology, 57, 1051– Brainerd, MN.

1058. Widaman, K. F., Gibbs, K. W., & Geary, D. C. (1987). The

Widaman, K. F. (1985). Hierarchically nested covariance struc- structure of adaptive behavior: I. Replication across fourteen

ture models for multitrait-multimethod data. Applied Psycho- samples of nonprofoundly retarded persons. American Journal

logical Measurement, 9, 1–26. of Mental Deﬁciency, 91, 348–360.

Widaman, K. F. (2007). Intrauterine environment affects infant Widaman, K. F., & Reise, S. P. (1997). Exploring the measure-

and child intellectual outcomes: Environment as direct effect. ment invariance of psychological instruments: Applications

In T. D. Little, J. A. Bovaird, & N. A. Card (Eds.), Model- in the substance use domain. In K. J. Bryant, M. Windle,

ing contextual effects in longitudinal studies (pp. 387–436). & S. G. West (Eds.), The science of prevention: Method-

Mahwah, NJ: Erlbaum. ological advances from alcohol and substance abuse research

Widaman, K. F. (2009). Phenylketonuria in children and moth- (pp. 281–324). Washington, DC: American Psychological

ers: Genes, environments, behavior. Current Directions in Association.

Psychological Science, 18, 48–52.

Widaman, K. F., & BorthwickDuffy, S. A. (April, 1990). Parental

inﬂuences on the development of adaptive behaviors. Paper

CHAPTER

5 and Model Selection

James Jaccard

Abstract

General issues involved in (1) building causal theories, (2) translating those theories into a set of

mathematical representations, (3) choosing an analytic strategy to estimate parameters in the

equations implied by the theory, and (4) choosing the “best” model from a set of competing models

are discussed. Theory construction fundamentally relies on six relationship types, including direct

effects, indirect effects, moderated effects, reciprocal effects, spurious effects and unanalyzed

relationships. Once speciﬁed, each type of effect can be represented mathematically, thereby

translating a path diagram into a set of (linear) equations. Parameters in the equations are then

estimated using either limited information estimation approaches or full information estimation

approaches, taking into account measurement properties, population distributions, and matters of

robustness. Choices between competing models are based on indices of relative ﬁt with the data and

relative ﬁt with competing models, as well as more general theoretical criteria (e.g., parsimony,

consistency with related theories).

Key Words: Theory; theory construction; modeling; parameter estimation

Few people dispute that theory is at the heart of this, we would transform data on depression so

the scientiﬁc enterprise. We use theories to explain that it would approximate a normal distribution

phenomena and to help solve important applied and be amenable to analysis using statistical meth-

problems. With theoretical propositions in hand, ods that assumed normality. Such days, fortunately,

we design research to gain perspectives on the viabil- are over. The statistical tools available now allow

ity of the theory and imbue those propositions with us to give precedence to theory and model test-

a certain degree of conﬁdence based on the results ing without being slave to many of the traditional

of research and the quality of the research design. assumptions made by methods of analysis. The

Theory is fundamental to the social sciences. present chapter discusses issues that analysts need

Advances in statistical analysis have been con- to take into account as they move from theory to

siderable during the past 30 years. Our statistical analysis. I focus ﬁrst on the nature of theory in

toolbox was at one time somewhat limited, and it the social sciences, with an emphasis on describ-

was not uncommon for analysts to adopt practices ing theories that invoke the language of causality.

that forced data to conform to the assumptions of Although there are many other approaches to the-

our statistical tools so that we could apply inferen- ory and modeling, one cannot deny the prominence

tial methods of analysis to those data. For example, and pervasiveness of causal thinking in social sci-

although we knew that a construct like depression ence research. Hence, causal frameworks capture

was non-normally distributed in the population and the bulk of my attention. Next, I discuss issues

that our measures of depression and data reﬂected involved in moving from a well-speciﬁed causal

82

theory to a mathematical representation of that the- Lewis, 2000, for alternative conceptualizations).

ory for purposes of statistical analysis. I then discuss Four properties of causality are typically emphasized.

a range of issues that must be considered as one First, a cause always must precede an effect in time.

moves from a set of equations representing a the- Second, the time that it takes for a change in X to

ory to formal data analysis, including measurement produce a change in Y can vary, ranging from almost

considerations, full versus limited information esti- instantaneous change to years, decades, centuries, or

mation strategies, distributional assumptions, and millennia. Third, the nature and/or strength of the

the possible presence of outliers. Finally, I consider effect of X on Y can vary depending on context. X

issues in model selection, which refers to the process may inﬂuence Y in one context but not another con-

of choosing one model as being the “best” from a set text. Finally, cause and effect must be in some form

of candidate models. of spatial contact or must be connected by a chain

In the present chapter, I use the terms theory of intermediate events. We return to these ideas in

and model interchangeably, although some scien- later sections of this chapter.

tists do not do so. As examples, various authorities

contend that models are a special type of theory The Nature of Theories that Use

(e.g., Coombs, Dawes, & Tversky, 1970, p. 4; Causality

Kaplan, 1964, p. 263), models are portions of the- The focus of most causal theories is on explaining

ories (Sheth, 1967, p. 444; Torgerson, 1958, p. 4), why variation in one or more variables exists. Some

models are derived from theories (e.g., Pap, 1962, people make a great deal of money and others are

p. 355), models are simpliﬁed versions of theories poor. Why? What can account for this variation?

(e.g., Carnap, 1971, p. 54), models represent corre- Some people are able to remember complex mate-

spondence between two or more theories (Brodbeck, rial easily whereas for other people, it is difﬁcult

1968), or theories represent speciﬁc interpretations to do so. Why? What explains this variability? We

of (i.e., are derived from) models (e.g., Green & answer these questions by specifying the presumed

Tull, 1975, p. 42). Others consider the terms to be causes of the variability, and then we seek to test our

synonymous (cf, Dubin, 1976; Simon & Newell, theoretical proposition(s).

1956). Although there may indeed be meaningful In any given causal theory, there are six funda-

distinctions between theories and models, it also is mental types of relationships that can be referenced;

the case that models, like theories, involve variables these are illustrated in Figure 5.1. These six relation-

and relationships between variables, usually invok- ship types are the cornerstone of causal theories and

ing the concept of causality. Accordingly, I will use deﬁne the universe of causal relationships that theo-

the terms theory and model interchangeably. rists draw upon. In Figure 5.1, a variable is indicated

by a box, and a causal inﬂuence is represented by a

Specifying a Causal Theory straight arrow emanating from the cause and point-

The Nature of Causality ing to the effect. I discuss the bidirectional curved

Theories take many forms in the social sciences. arrow in Figure 5.1 shortly.

One common form involves specifying presumed Referring to Figure 5.1, a direct causal relation-

relationships between variables while invoking the ship is one in which a given cause is assumed to

concept of causality. The nature of causality has have a direct causal impact on some outcome vari-

been debated extensively by philosophers of sci- able. For example, exposure to violence in the media

ence (e.g., Bunge, 1961; Cartwright, 2007; Frank, is assumed to inﬂuence aggressive behavior. Or,

1961; Morgan & Winship, 2007; Pearl, 2000; Rus- the quality of the relationship between a mother

sell, 1931; Rubin, 1974, 1978; Shadish, Cook, & and her adolescent child is assumed to inﬂuence

Campbell, 2002) and most agree that causality is whether the child uses drugs. By contrast, an indi-

an elusive concept that is fraught with ambiguities. rect causal relationship is when a variable inﬂuences

It is beyond the scope of this chapter to delve into another variable indirectly through its impact on an

the underlying philosophical issues (see Jaccard & intermediary variable (see Fig. 5.1). For example,

Jacoby, 2010). Rather, I emphasize here a “working failing to accomplish a goal may lead to frustra-

model” of causality that is adopted by most social tion that, in turn, causes someone to aggress against

scientists. another. In this case, the failure to obtain a goal

The concept of causality is usually thought of in is an indirect cause of aggression. It inﬂuences

terms of change—that is, X is a cause of Y if changes aggression through its impact on frustration. Frus-

in X produce changes in Y (but see Sowa, 2000, and tration is formally called a mediating variable or,

jaccard 83

Direct Causal Relationship Indirect Causal Relationship

X Y X Z Y

Z X Y

X Y

X Y

X Y

more informally, a mediator, because other vari- give to peers who use drugs. We therefore decide to

ables “work through” it to inﬂuence the outcome. posit partial mediation to reﬂect this fact and add a

Indirect relationships are sometimes called mediated direct causal path from X to Y.

relationships. If we are able to specify another mechanism by

Whenever a theorist posits an indirect relation- which Z inﬂuences Y over and beyond Z, then why

ship, the issue of whether to specify partial or not just incorporate that additional mediator into

complete mediation presents itself. In partial medi- the theory? Of course, we could very well do this,

ation, the distal variable has a direct effect on but then the issue becomes whether the two media-

the outcome variable over and above its effect on tors in the theory, considered together, are complete

the mediator. In complete mediation, all of the or partial mediators of the causal effect of X on Y.

impact of the distal variable on the outcome vari- This might lead us to speculate about yet a third

able is accounted for by the mediator. With partial mechanism, and once we have speciﬁed it, the issue

mediation, in addition to the two causal paths char- of partial or complete mediation will present itself

acterizing the indirect effect illustrated in Figure 5.1, yet again. At some point, we must decide to close

one adds an additional causal path linking the distal out the system and just let a direct path between X

variable (X) and the outcome variable (Y) directly. As and Y stand so as to reﬂect partial mediation with-

an example, we might posit that the quality of the out formally bringing additional mediators into the

relationship between a mother and her adolescent model. If pressed, we could articulate one, but we

child (X) impacts the child’s motivation to perform simply do not want to complicate the theory further.

well in school (Z) and that one’s motivation to do A spurious relationship is one in which two vari-

well in school, in turn, impacts (negatively) the ten- ables are related because they share a common cause

dency for an adolescent to use drugs (Y). As we think but not because either causes the other (see Fig.

about the matter of specifying partial versus com- 5.1). As an example, if we select a random sam-

plete mediation, we might decide that there are other ple of people in the United States and calculate

mechanisms by which the quality of the relationship the association between height and length of hair,

between parent and adolescent can impact drug use, then we would ﬁnd a moderate relationship between

such as by lessening the attention that adolescents the two variables: People with shorter hair grow

taller. Does this mean that a causal relationship exists in time we measure adolescent drug use and grade

between these variables—that is, that cutting one’s point averages in school. It is likely that the mea-

hair will make one grow taller? Of course not. The sured drug use reﬂects the inﬂuence of prior poor

reason the variables are correlated is because they performance in school because adolescents who do

share a common cause: gender. Males tend to have poorly in school might turn to drugs as a coping

shorter hair than females and men tend to grow mechanism or as a way of spending free time that

taller than females. The common cause of gender normally would have been directed to school work.

produces a correlation between length of hair and Similarly, the measured school performance likely

height, but it is spurious. reﬂects the effects of any prior drug use, which can

A moderated causal relationship, like spurious and cause students to lose interest in school and to not

indirect relationships, involves at least three variables concentrate on tests and studying. It would be wrong

(see Fig. 5.1). In this case, the causal relationship to assume there is unidirectional causality from one

between two variables, X and Y, differs depending construct to the other in this study. More realisti-

on the value of a third variable, Z. For example, cally, the two measures reﬂect a more ﬁne-grained

it might be found that a given type of psychother- process that has played itself out—that is, poor per-

apy (X) is effective for reducing depression (Y) for formance in school at time t inﬂuenced drug use at

females but not for males. In this case, the causal time t + 1, which in turn inﬂuenced school perfor-

relationship between being exposed to psychother- mance at time t + 2, which in turn inﬂuenced drug

apy and depression is moderated by gender. When use at time t + 3, and so on. It is only when we are

gender has the value “female,” X impacts Y. How- unable to capture the more ﬁne-grained time inter-

ever, when gender has the value “male,” X does vals and we must instead work with coarser time

not impact Y. Gender is called a moderator vari- intervals that the dynamic of the reciprocal causal

able because the relationship between the presence relationship as illustrated in Figure 5.1 applies. By

or absence of psychotherapy (X) and depression (Y) working with coarser time units, the more ﬁne-

changes as a function of (or is “moderated by”) the grained temporal causal dynamics are assumed to

levels of gender. have already played themselves out (often referred to

A bidirectional causal relationship exists when as the equilibrium assumption). In this sense, there

two variables are conceptualized as inﬂuencing each exists reciprocal causality per Figure 5.1.

other (see Fig. 5.1). For example, in the area of The ﬁnal type of relationship that can occur in

reproductive health, a theorist might posit a bidirec- a causal model is an unanalyzed relationship. In

tional inﬂuence between a woman’s belief that the Figure 5.1, the two variables for this type of relation-

rhythm method is effective at preventing pregnancy ship are connected by a double-headed curved arrow.

(X) and her attitude toward the rhythm method This arrow signiﬁes that the two variables are pos-

(Y). A woman may have a positive attitude toward sibly correlated but that the theorist is not going to

the rhythm method because she believes it is effec- specify why they are correlated. The correlation may

tive. Simultaneously, she may believe it is effective, be spurious or it may result from a causal connection

in part, because she has a positive attitude toward of some kind. The theorist wants to recognize the

it, via a mechanism that involves rationalization of possible correlation between the variables, but trying

behavior. to explain it is beyond the scope of the theoretical

Technically, there can never be simultaneous effort. The relationship will remain unanalyzed.

reciprocal causation because there always must be Most causal models have more than one of these

a time interval, no matter how inﬁnitesimally small, six types of relationships in them. An example of

between the cause and the effect that follows from a multivariate causal model appears in Figure 5.2,

that cause. If we observed the causal dynamics which was based on an endeavor that developed

within the appropriate time frames, then the true an intervention to reduce unprotected sexual inter-

dynamic underlying a reciprocal causal relationship course to reduce the spread of sexually transmitted

would appear as follows: diseases (STDs). The intervention is represented as

a two-level, qualitative variable in which individuals

Xt1→ Yt2→ Xt3→ Yt4,

are randomly assigned to either an intervention

where Xt1 is variable X at time 1, Yt2 is variable group or a control group. The intervention is

Y at time 2, Xt3 is variable X at time 3, and Yt4 designed to inﬂuence (1) the belief that having

is variable Y at time 4. As an example, suppose we unprotected sex increases the risk of contracting an

conduct a cross-sectional study and at a given point STD (see path a in Fig. 5.2) and (2) the belief

jaccard 85

Belief

Unprotected Sex

c

leads to STDs

a

Intervention Engages in

e

versus Unprotected Has an STD

Control Sex

b

Belief STDs d

are Serious

that contracting an STD is bad for one’s health the direct causes can then be turned into an indirect

(see path b in Fig. 5.2). These two beliefs, in turn, effect by elaborating the mechanisms by which the

are assumed to inﬂuence the tendency for a person cause produces the effect. The “mechanisms” are,

to engage in unprotected sex (paths c and d ). The essentially, mediators. The theory might be further

tendency to engage in unprotected sex, in turn, is elaborated by specifying the boundary conditions

thought to impact the extent to which the person of effects, thereby introducing moderated relation-

contracts STDs (path e). Paths a through e each ships into the framework (e.g., the effect holds when

represent direct causal relationships. There also is condition A is operative but not when conditions

aspurious relationship in this model, as seen by the B is operative; or the effect holds for one type

fact that the two beliefs share a common cause, of individual but not another type of individual).

the intervention (paths a and b). Because of this For a discussion of the many heuristics scientists

common cause, we expect the two beliefs to be use to identify mechanisms and boundary condi-

correlated to some extent, although there is no pre- tions when developing causal theories see Jaccard

sumed causal connection between them. There also and Jacoby (2010).

are several indirect causal relationships in the model. Theories can be represented in the form of path

For example, the intervention indirectly affects the diagrams using the graphical schemes illustrated in

occurrence of STDs through the two belief media- Figures 5.1 and 5.2. Such representations are pos-

tors and, in turn, their inﬂuence on the tendency to sible no matter what research design is used to test

engage in unprotected sex. a theory—that is, we can represent the underlying

In causal theories, distinctions are made between theory with path diagrams for experiments just as

exogenous and endogenous variables. Any variable readily as for purely observational designs. Figure 5.2

that has a straight arrow going to it in a path dia- is an example of a causal theory that incorporates an

gram is called an endogenous variable. Endogenous experimental design.

variables, essentially, are outcome variables—that is,

they are presumed to be inﬂuenced by another vari-

able in the theoretical system. Variables that do not Causal Theories with Explicit Temporal

have a straight arrow going to them are called exoge- Dynamics

nous variables. They do not have presumed causes An important structure of many (but not all)

that are elaborated upon in the theory. causal theories is a focus on longitudinal dynam-

In sum, the fundamental orientation to con- ics. Causal theories that focus on longitudinal

structing a causal theory is to explain variation in dynamics contain one or more of the six types

one or more outcomes. This is accomplished with of relationships described above, but there also is

reference to six types of relationships, direct effects, an explicit temporal dynamic that is of theoretical

indirect effects, reciprocal effects, moderated effects, interest. A nomenclature has emerged around such

spurious effects, and unanalyzed relationships. In theories, which I brieﬂy describe here.

the theory construction process, it is not uncommon One popular type of causal model that includes

for the theorist to ﬁrst identify the outcome variable longitudinal dynamics is called a panel model in

he or she wishes to explain and then to specify a which multiple variables are modeled at multi-

few direct causes of that variable. One or more of ple points in times, also called waves. Figure 5.3

Alcohol Use a Alcohol Use a Alcohol Use a Alcohol Use

Time t Time t+1 Time t+2 Time t+3

d d d

c c c

Time t b Time t+1 b Time t+2 b Time t+3

presents an example of a two-variable, four-wave “growth” process that causes changing values of

panel model. This model examines the relationship an outcome over time. These models are typically

between adolescent alcohol use and delinquency associated with latent growth curve models in the

across the 4 years of high school. Heavy alcohol use social science literature or, more simply, latent curve

at time t is thought to have a direct effect on alcohol models (LCMs). The classic form of these models is

use at time t + 1, and the same type of dynamic is presented in Figure 5.4, as applied to the alcohol-

thought to hold for delinquency. A path that spec- delinquency example considered earlier. Observed

iﬁes a variable at time t + 1 as being a function variables or measures are represented by boxes, and

of that same variable at time t is said to reﬂect an latent (unobserved) variables are represented by cir-

autoregressive effect (see paths a and b in Fig. 5.3). cles. This model parameterizes a latent “growth”

In the model in Figure 5.3, the autoregressive effects or “maturation” process for alcohol use as students

are ﬁrst-order effects, because the variable at a given progress through high school (represented by the

time, t, is assumed to be a function of that same variable called “Alcohol Slope” in Fig. 5.4) as well

variable at the time just prior to it. A second-order as a latent growth or maturation process for delin-

autoregressive effect is one where the variable at time quency as students progress through high school

t is impacted (also) by the same variable at time t −2. (see the variable “Delinquency Slope” in Fig. 5.4).

A third-order autoregressive effect implies a direct The “growth process” for alcohol use is assumed to

effect between a variable at time t and that same impact the “growth process” for delinquency across

variable at time t − 3. And so on. the four time periods. One often will encounter

The theory in Figure 5.3 also speciﬁes that drink- longitudinal causal theories expressed in this form

ing alcohol at time t predisposes one to delinquency instead of in the more traditional panel model form

at that same time period under the rationale that (Collins, 2006).

people often commit acts of delinquency when they Of course, it is possible that both types of dynam-

are drunk. When a causal relationship exists between ics in the two types of models operate. When the two

two variables at the same period of time, it is said to types of processes are integrated into a single model,

represent a contemporaneous causal effect (see path c we obtain what is called an autoregressive latent tra-

in Fig. 5.3). Finally, when a variable at time t has jectory model (Bollen & Curran, 2006). Figure 5.5

an effect on a different variable at time t + 1, it is presents this model.

referred to as a lagged effect (see path d in Fig. 5.3). In sum, when constructing theories that incorpo-

In this case, alcohol use at time t is thought to have rate longitudinal dynamics, one will explicitly take

a delayed effect on delinquency at time t + 1 inde- into account the possible causal dynamics described

pendent of the other indirect causal chains that link by panel models, by LCMs, or by autoregressive

alcohol use at time t to delinquency at time t + 1. latent trajectory models.

In sum, it is common in longitudinal models to

theorize about autoregressive effects, contempora-

neous effects, and lagged effects. These effects are Multilevel Causal Theories

common features of panel models (Collins, 2006; Another common type of causal model that has

Finkel, 2008). received considerable attention in the social sciences

A second type of theory that formally incorpo- is one that incorporates multiple levels of analysis, or

rates temporal dynamics is theory based on growth what is called a multilevel model. These models deal

processes. This approach views variation in out- with scenarios where there is nesting—for example,

comes across time as arising from an unobserved where individuals are nested within different, higher

jaccard 87

Baseline

Alcohol

Alcohol

Slope

Use

Time t Time t+1 Time t+2 Time t+3

a

Time t Time t+1 Time t+2 Time t+3

Baseline Delinquency

Delinquency Slope

Baseline Alcohol

Alcohol Slope

Time t Time t+1 Time t+2 Time t+3

e

d d d

c c c

Time t b Time t+1 b Time t+2 b Time t+3

Baseline Delinquency

Delinquency Slope

level organizational units. For example, students are (a)

nested within schools and both characteristics of the X Y δ

students and characteristics of the schools can inﬂu-

ence an outcome variable, such as performance by

students on a standardized math test. Employees are

nested within organizations, and both characteristics

of the employees and characteristics of the organiza- Z

tion can inﬂuence employee behavior. Patients are

nested within hospitals and both characteristics of (b)

the patients and characteristics of the hospitals can

inﬂuence patient recovery. δ δ2

In sum, causal theories seek to explain variation in

Smoking Drug Use

one or more outcome variables by identifying causes

of those variables. Social scientists posit theories or

models to represent the presumed causal relation-

ships among variables, and these theories typically

have one or more of the six fundamental relation- Gender

ships types in them—namely, direct effects, indirect

effects, spurious effects, moderated effects, recip-

rocal causality, and unanalyzed relationships. The (c)

theories can focus on a single time period or explic-

itly deal with temporal dynamics, If they deal with

temporal dynamics, then this usually takes the form

of a panel model, a LCM, or an autoregressive latent δ1 δ2

trajectory model. Theorizing also can occur at a sin-

gle level of analysis or at multiple levels of analysis

in which lower order units are nested within higher

order units, with characteristics of the units at both Smoking Drug Use

levels inﬂuencing outcomes.

There is a more subtle facet of theory construc- Gender

tion beyond those elucidated thus far, and this

concerns the concept of disturbance terms. Con- Figure 5.6 Theories with Disturbance Terms. (a) Theory with

sider the simple theory in Figure 5.6a. This theory Disturbance Term (b) Smoking and Drug Example with Uncor-

related Disturbance Terms (c) Smoking and Drug Example with

has two direct causes where variables X and Z are Correlated Disturbance Terms

assumed to inﬂuence variable Y. A fourth “variable”

is represented in the system by a circle. This “vari- Consider another example in Figure 5.6b. There

able” reﬂects all unspeciﬁed variables that inﬂuence are two endogenous variables, and they share a com-

Y other than X and Z. It formally recognizes that the mon cause. One of the endogenous variables is

theory is incomplete and that we have not speciﬁed adolescent tobacco use, and the other is adolescent

every cause of the outcome variable. This “vari- drug use. The common cause is gender. The theory

able” is called a disturbance term, and it represents posits that boys are more likely than girls to smoke

the totality of all unspeciﬁed causal effects on the cigarettes and that boys also are more likely than girls

endogenous variable it is associated with. The pres- to use drugs. There is a disturbance term for each

ence of a disturbance term explicitly recognizes that of the endogenous variables. These terms recognize

not all causal inﬂuences on a variable have been spec- that factors other than gender impact tobacco use

iﬁed in the model. Traditionally, each endogenous and drug use.

variable in a theory has a disturbance term associated But there is a problem with this theory. According

with it. to the theory, the only reason that smoking cigarettes

jaccard 89

and drug use in adolescence are correlated is because the theoretical system and work just with the vari-

they share the common cause of gender. In real- ables we have speciﬁed. By including disturbance

ity, there are many other common causes of these terms and correlated disturbances, we are explic-

two constructs. For example, social class impacts itly recognizing the operation of other variables, but

both tobacco use and drug use during adolescence, we choose not to give them central focus in our

with more economically disadvantaged youth hav- theory.

ing an increased tendency to smoke cigarettes and

to use drugs. Essentially, social class resides within

the disturbance term for smoking cigarettes and it Latent Variables, Structural Models, and

also resides within the disturbance term for drug Measurement Models

use. If the same unspeciﬁed cause is in each distur- Some researchers take matters a step further and

bance term, then we would expect the two distur- also incorporate a measurement theory into their

bance terms to be correlated. Figure 5.6c presents conceptual frameworks when they are performing

a more plausible theory that includes this correla- an empirical test of the theory. This goes beyond

tion between disturbances. According to this theory, the typical province of theory construction per se,

there are two reasons why adolescent cigarette smok- but I mention the ideas here as they ultimately

ing and adolescent drug use are correlated. One impact data analysis and the choice of statisti-

reason is because they share the common cause of cal methods. The integration of conceptual and

gender. Another reason is that they share other com- measurement theories is something that should be

mon causes that are unspeciﬁed by the theory and standard practice for social science research.

that reside in both disturbance terms. An empirical test of a theory necessarily requires

A well-developed theory provides explicit state- developing and using measures of the theoretical

ments about which disturbance terms in the frame- constructs in the theory. Just as one can build a the-

work are correlated and which disturbance terms ory linking one concept to another concept, so too

are not. The lazy way out for a theorist is to simply can one build a theory linking a construct to a mea-

assume all disturbance terms are correlated. But this sure of that construct. Some theorists combine both

is not satisfactory, and it can create considerable dif- types of theories into a single overarching frame-

ﬁculties for testing the theory empirically. A better work. Traditional measurement models make a

approach is to carefully consider every pair of dis- distinction between a latent variable and an observed

turbance terms and to articulate a possible common measure of that variable. The latent variable is the

cause that resides in each of the separate disturbance true construct about which one is interested in

terms. If such a common cause can be articulated, making statements, such as depression. Although

then it makes sense to posit correlated disturbances. we can directly observe many of the symptoms of

If one cannot articulate any such variable, or if its depression, we can’t directly observe the “seat” of

effects are thought to be trivial, then one does not depression in a person’s mind. Rather, we rely on

posit correlated disturbances. some observable response(s) to assess the latent vari-

For models with a longitudinal component, able, such as a multi-item inventory of depression

many theorists have a “knee-jerk” reaction that that a person completes. Figure 5.7a presents one

disturbances directed at the same variable at two representation of a measurement model. The latent

different points in time must be correlated. Again, variable of depression is contained in a circle, and

if one can articulate a compelling rationale for cor- the observed measure thought to reﬂect depression

related disturbances, then by all means, correlated is contained in a square (the label “AR” stands for

disturbances should be incorporated into the the- adolescent report of depression). A causal path is

ory. Otherwise, correlated disturbances should be drawn from the latent variable to the observed mea-

viewed with theoretical skepticism. sure, under the assumption that how depressed a

If a theorist is able to articulate a variable that person is inﬂuences how he or she responds to the

resides in two disturbance terms to create correlated questions on the inventory. There also is an error

disturbances, then why not explicitly incorporate term that reﬂects measurement error—that is, there

the variable into the theoretical system? For exam- are factors other than depression that inﬂuence a

ple, for the smoking cigarette and drug use example person’s responses on the inventory. Ideally, mea-

in Figure 5.6, why not explicitly bring social class surement error is minimal, but it is a fact of life for

into the theoretical system? This, of course, is desir- many research endeavors. The relationship between

able. But at some point, we want to close out the latent construct and the observed indicator is

(a) (b) happens to her is beyond her control. The theory

ε ε1 ε2 is that the more a mother thinks that what happens

is not under her control, the more likely she will

be to keep smoking and drinking during pregnancy.

These two latent exogenous variables are assumed to

AR AR MR be correlated. The three latent endogenous variables

each have a disturbance term, indicated by a circle

with a δ inside of it. The disturbances are assumed

to be uncorrelated.

The portion of the diagram with arrows from

Depression Depression the latent constructs to the observed measures con-

stitutes the measurement model. Each of the latent

variables has multiple indicators. In other words, the

Figure 5.7 Measurement Models. (a) Single Indicator (b) Mul- researcher obtains three measures of each construct,

tiple Indicators with the exception of birth weight, which is mea-

sured using two different indicators. In the interest

of space, we do not describe these measures, but note

usually assumed to be linear, but it could also be that each is assumed to be fallible (i.e., subject to

nonlinear. some measurement error; see the circles ranging from

Sometimes we obtain multiple indicators of a ε1 to ε14). The measurement errors are assumed

construct. For example, a researcher might obtain a to be uncorrelated. Figure 5.8 provides an explicit

self-report of depression from an adolescent as well roadmap for a researcher to test the combined

as a report from the adolescent’s mother about how structural theory and measurement theory.

depressed the child is (MR). Figure 5.7b presents In experiments that involve a formal manipula-

a measurement model for this scenario. The latent tion, the manipulation typically is considered to be

variable of depression is assumed to inﬂuence both of an observed variable in its own right, with no latent

the observed measures, and each measure is assumed construct underlying it (see Fig. 5.2). In some cases,

to have some measurement error as reﬂected by the the manipulation is designed to reﬂect or produce

presence of error terms. The errors are assumed to differences in an underlying psychological state (e.g.,

be uncorrelated because we cannot articulate any one’s mood when studying the effects of mood on

viable reason why we would expect them to be memory). In this case, a measurement model may

correlated. However, one can introduce correlated be introduced into the causal system, treating the

measurement error, if appropriate. manipulation as a formative rather than a reﬂec-

Figure 5.8 presents an example of a more elabo- tive indicator of the construct (see Schumacker &

rate theoretical framework that incorporates a theory Lomax, 2004, for elaboration).

about the relationship between constructs as well as a Many social scientists view measurement mod-

measurement theory. Although it appears somewhat els as interesting causal theories in their own right

intimidating, it is a straightforward model. There are rather than just a methodological feature of the-

ﬁve latent constructs, and the main substantive the- ory testing. The most common case is when the

ory is focused on them. The portion of the diagram conceptual focus is on specifying the facets or dimen-

focused on the causal relations among the latent sions of a construct vis-à-vis factor analysis. The

variables is called the structural model. The primary causal theory underlying factor analysis is that one or

outcome variable in this model is the birth weight more (possibly correlated) unmeasured latent vari-

of a newborn. Birth weight is thought to be inﬂu- ables are each a common cause of a set of observed

enced by two factors: how much alcohol the mother measures of constructs that are of interest in their

consumes during her pregnancy and how much she own right. For example, theorists have suggested

smokes during her pregnancy. Both of these variables there are four facets of social support: informational

are thought to be inﬂuenced by two other variables. support, emotional support, tangible support, and

The ﬁrst determinant is the extent of support the companionship support. Each of these facets is con-

mother has from friends and relatives who can help ceptualized as a latent variable that impacts distinct

her quit smoking and drinking. The second is the manifestations of social support. For elaboration of

mother’s locus of control. Locus of control refers to theory-based expressions of measurement models,

the extent to which the mother believes that what see Brown (2006).

jaccard 91

ε1 ε2 ε3 ε4 ε5 ε6

S1 S2 S3 A1 A2 A3

ε13 ε14

W1 W2

Support Alcohol

δ1

Birth-

weight

δ2

δ3

Control Smoking

C1 C2 C3 K1 K2 K3

In sum, a well-developed theory will not only the theoretical system are continuous. Second, I will

include causal relationships between constructs but assume that the measures of all variables have at least

also will include disturbance terms and a theory interval level properties. Finally, I will assume that

of the correlational structure of those disturbance all relationships between variables are linear. I adopt

terms. As one moves to an empirical test of a theory, a strategy whereby the theory under consideration

one also speciﬁes a measurement model that links is expressed in the form of a set of linear equations.

theoretical constructs to measures obtained in the This does not mean, however, that researchers always

empirical test. parameterize data analysis in terms of the linear

model. Many well-known methods of analysis, such

as t tests, analysis of variance, and analysis of covari-

From Theories to Mathematical ance, evolved outside of the context of the linear

Representations model. However, these methods can be re-expressed

Specifying Core Model Equations as linear equations, and in this sense, they are com-

With a well-articulated theory in hand, the next patible with the current approach. Complex models

step for choosing a form of statistical analysis is to of dichotomous, nominal, and count variables also

translate the theory into a set of equations that then can be approached from the perspective of linear

guide the statistical analysis. The basic strategy for equations using what is known as the generalized

doing so can be illustrated by making some simpli- linear model (see Yuan & Schuster, Chapter 18,

fying assumptions, which I relax later. First, unless Volume 1). Finally, newer methods of analysis

otherwise speciﬁed, I will assume that all variables in that focus on robust indices of central tendency

ε3

Y3

β4 β2

β1

X Y2 Y1

β5 β3

Y4 ε2 ε1

ε4

and variability can be viewed in the context of lin- The rule for expressing a path diagram in terms

ear equations but where the parameterization shifts of a set of core equations also applies to models with

from means and variances to trimmed means, M latent and observed variables, such as the model in

estimators, and the like (see Erceg-Hurn, Wilcox, & Figure 5.10. For this model, the structural model

Keselman, Chapter 19, Volume 1). has the following core equations:

Consider the case where the initial theory is LY = α1 + β1 LM + δ1

represented as a path or inﬂuence diagram, such

as the theory in Figure 5.9. A path diagram can and

be viewed as a pictorial representation of a set of LM = α2 + β2 LX + β3 LZ + δ2 ,

equations. There is a separate equation for each

with each equation focusing on a latent endoge-

endogenous variable in the theory. More speciﬁ-

nous variable as an outcome. We are interested in

cally, given the aforementioned assumptions, each

estimating the parameters in these equations, but

endogenous variable can be expressed as being a

the task is more challenging statistically because we

linear function of all variables with arrows going

do not have direct access to a person’s scores on

directly to and explicitly touching the box repre-

the latent variables. Nevertheless, statisticians have

senting the endogenous variable. For the model in

derived methods for obtaining such estimates (see

Figure 5.9, there are four linear equations that are

McDonald, Chapter 7, Volume 1).

of primary theoretical interest because there are four

The measurement model for the theory in

endogenous variables.

Figure 5.10 implies the following equations (again,

Using the above rule, the relevant equations are:

using the rule of specifying an equation for each

Y1 = α1 + β1 Y2 + ε1, endogenous variable, in this case, the observed

endogenous variables):

Y2 = α2 + β2 Y3 + β3 Y4 + ε2,

X1 = α3 + β4 LX + ε1,

Y3 = α3 + β4 X + ε3, and

X2 = α4 + β5 LX + ε2,

Y4 = α4 + β5 X + ε4,

Z1 = α5 + β6 LZ + ε3,

where the various α are intercepts, the various β are

linear coefﬁcients, and the various ε are disturbance Z2 = α6 + β7 LZ + ε4,

terms. Primary interest of the research is estimat- M1 = α7 + β8 LM + ε5,

ing and interpreting the parameters α1 through α4 ,

M2 = α8 + β9 LM + ε6,

β1 through β5 , and the variances of ε1 through

ε4 relative to the variances of Y1 through Y5 . We Y1 = α9 + β10 LY + ε7,

select statistical methods of analysis that provide and

the best and most well-behaved estimates of these

parameters. Y2 = α10 + β11 LY + ε8.

jaccard 93

ε1 ε2

ε5 ε6 ε7 ε8

X1 X2

β4 β5

M1 M2 Y1 Y2

LX

β2 β8 β9 β10 β11

β1

LM LY

β3

LZ δ2 δ1

β6 β7

Z1 Z2

ε3 ε4

It is not uncommon to express the set of core respect to two latent η, and p = 4 observed endoge-

equations in a theory in more succinct form than nous variables with respect to the two latent ξ . We

the above using matrix algebra. Readers unfamiliar further specify an mXm matrix (B) representing the

with matrix algebra can skip to the summary para- linear (regression) coefﬁcients regressing the η onto

graph at the end of this section. As an example, the the η and a mXq matrix () representing the linear

above structural model is expressed by making dis- (regression) coefﬁcients regressing the η onto the ξ :

tinctions between the latent endogenous variables

β11 β12 γ11 γ12

and the latent exogenous variables, deﬁning a vector B= = .

of latent endogenous variables (η) and a vector of β21 β22 γ21 γ22

latent exogenous variables (ξ ) as In the example in Figure 5.10, let LY = η1 ,

LM = η2 , LX = ξ1 , and LZ = ξ2 . Then

η1 ξ1

η= ξ= ,

η2 ξ2 0 β12 0 0

B= =

0 0 γ21 γ22

and a vector of intercepts (α) and a vector of

disturbance terms (δ) for the latent η as and the structural model is deﬁned as

δ1 α1 η = α + Bη + ξ + δ.

δ= α= .

δ2 α2

The measurement model for the latent endoge-

In the present example, there are m = 2 latent nous and exogenous variables deﬁnes separate matri-

endogenous variables, q = 2 latent exogenous vari- ces of factor loadings for the linear (regression)

ables, r = 4 observed endogenous measures with coefﬁcients from the latent endogenous variables to

the observed endogenous variables/indicators with variables, and X is a column vector of the observed

respect to them (λY which is an rXm matrix) and endogenous variables with respect to the latent

for the linear (regression) coefﬁcients from the latent exogenous variables.

exogenous variables to the observed endogenous In addition to these matrices and vectors, one also

variables/indicators of them (λX , which is a pXq typically speciﬁes a matrix of covariances between

matrix): the latent exogenous variables, a matrix of covari-

⎛ ⎞ ances for the disturbance terms, a matrix of covari-

λY11 λY12

⎜ λY ances for the error terms, and a vector of means

λY22 ⎟

λY = ⎜ 21

⎝ λY31 λY32 ⎠

⎟ for the latent exogenous variables, although these

are not typically part of the “core” equations in

λY41 λY42

⎛ ⎞ the model. Nevertheless, statisticians make use of

λX11 λX12 these matrices in the analysis of data and model

⎜ λX λX22 ⎟ evaluation.

λX = ⎜ 21

⎝ λX31 λX32 ⎠ .

⎟

In sum, given a well-developed causal theory that

λX4 λX42 takes the form of a path or inﬂuence diagram, one

can translate that diagram into a set of core equations

For the present example, let the four observed

that guide statistical analysis. This process is central

endogenous variables with respect to the latent

to building correspondence between our theoret-

endogenous variables be Y1 , Y2 , M1 , and M2 ,

ical models and our statistical models. The core

respectively, and the four observed endogenous vari-

equations can be represented in traditional algebraic

ables with respect to the latent exogenous variables

terms or in matrix form, but either way, they guide

be X1 , X2 , Z1 , and Z2 , respectively. Then the above

the choice of the statistical model for purposes of

matrices are

⎛ ⎞ data analysis.

λY11 0

⎜ λY 0 ⎟

λY = ⎜ ⎝ 0

21 ⎟

λY32 ⎠ Some Qualiﬁcations

0 λY42 For models with moderated causal effects, the

⎛ ⎞ rule for translating a path diagram into a set of core

λX11 0 equations must be slightly modiﬁed. Moderated

⎜ λX 0 ⎟

λX = ⎜ ⎝ 0

21 ⎟. effects are traditionally (but not always) represented

λX32 ⎠ by including product terms in the equation. Con-

0 λX42 sider a simple causal model of the following form:

A vector of intercepts (τY ) and a vector of error

terms (εY ) are deﬁned for the observed indicators of Z

the latent endogenous variables:

⎛ ⎞ ⎛ ⎞

τY1 εY1

⎜ τY ⎟ ⎜ εY ⎟

τY = ⎜ 2 ⎟ ⎜ 2 ⎟

⎝ τY3 ⎠ εY = ⎝ εY3 ⎠,

X Y

τY4 εY4

and the observed indicators of the latent exogenous The core equation in this case describes the out-

variables: come variable Y, as a linear function of the two

⎛ ⎞ ⎛ ⎞ variables involved in the moderated relationship and

τX1 εX1

⎜ τX ⎟ ⎜ εX ⎟ the product of the two variables:

τX = ⎜ 2 ⎟ ⎜ 2 ⎟

⎝ τX3 ⎠ εX = ⎝ εX3 ⎠,

Y = α + β1 X + β2 Z + β3 XZ + δ.

τX4 εX4

The coefﬁcient associated with the product term

and the measurement models are deﬁned as:

reﬂects the degree of moderation that is operating.

Y = τY + λY η + εY Traditionally, the component parts of the product

term are included in the core equation as separate

X = τX + λ X ξ + ε X ,

predictors in addition to the product term because

where Y is a column vector of the observed endoge- doing so protects against arbitrary scaling affect-

nous variables with respect to the latent endogenous ing conclusions. For elaboration, see Blanton and

jaccard 95

Jaccard (2006) and Yuan and Schuster (Chapter properties of the measures of the variables that

18, Volume 1). The above equation suggests that comprise the theory. One important metric consid-

another way of diagramming a moderated effect is eration is the level of measurement of the measures.

to include three separate direct effects, one from The classic distinction between nominal, ordinal,

each of the component parts of the product and a interval, and ratio level measurement is particularly

third from the product term. Interaction analysis important, because different methods of analysis

does not have to be pursued through product terms often are called for depending on the operative levels

in a linear equation (Wilcox, 2005), but this is the of measurement. Nominal measurement involves

most common approach. the assignment of numbers to levels of variable that

Another qualiﬁcation to specifying core equations are categorical in nature and that have no inher-

is whether the relationship between variables is non- ent ordering on an underlying dimension. For the

linear rather than linear. For example, the relation- variable gender, a researcher might assign the val-

ship between an outcome variable and a causal vari- ues of 0 to females and 1 to males. The fact that

able might be characterized by a logarithmic func- one number is larger than the other has no sub-

tion, an exponential function, a power function, stantive interest. The numbers function much like

a polynomial function, or a trigonomic function, non-numeric labels.

among others. In such cases, the form of the core Ordinal, interval, and ratio level measures can

equation differs from those speciﬁed earlier. Non- be deﬁned relative to the measurement equations

linear modeling can be complex and consideration between quantitative latent variables and observed

of such models is beyond the scope of this chapter. measures described earlier. In describing the prop-

Readers are referred to Jaccard and Jacoby (2010) erties of these levels of measurement, I will assume

and Yuan and Schuster (Chapter 18, Volume 1). no measurement error is operating, to simplify the

Although there are other nuances that can fur- presentation. Assuming no measurement error, an

ther shape the nature of the equations one derives interval level measure is one that is a linear function

from a causal theory, the general point I stress here is of the underlying latent variable—that is,

that (1) it is possible to represent most causal mod-

els in the form of one or more equations and (2) X = α + βLX,

that translating a model into equation form is a fun- where X is the observed measure and LX is the latent

damental step for choosing an appropriate analytic variable. The intercept and the linear coefﬁcient

method for testing the viability of a causal model. can take on any value. For ratio level measures, the

underlying latent variable, LX, should have a mean-

Additional Considerations for Choosing ingful conceptual zero-point (such as zero weight

Analytic Strategy when measuring how heavy an object is) and the

With a carefully speciﬁed set of equations that are intercept in the measurement equation must equal

theoretically guided, the next task of the researcher zero (as would be the case for a weight measure

is to choose a data analytic method to estimate in pounds or a measure in grams). For ordinal

the parameters in those equations. The present measures, the relationship between X and LX is non-

section considers the role of facets that must be linear but monotonic. For example, X might be a

taken into account when making such decisions, power function of LX, in which case the measure

including measurement/metric considerations, the has ordinal properties.

use of full versus limited information estimation, As discussed in the other chapters in this volume,

distributional assumptions of estimation strategies, analytic strategies for parameter estimation vary as a

and the anticipated presence of outliers. Of course, function of these measurement levels. For example,

there are a host of more nuanced considerations that when a measure of an exogenous variable in a causal

researchers must account for, and these are devel- model is nominal, it is common to represent that

oped in the different chapters in this volume. My variable in a core equation using dummy variables

focus is on categories of more general considerations. (Chapter 18). If one wants to compare two groups

on their average heights, then a measure of height

that has ordinal properties is potentially problem-

Measurement and Metric Considerations atic and not well-suited to comparing means on the

Strategies for estimating parameters in core underlying latent construct of height.

equations, or the act of deﬁning the parameters An important but underappreciated point when

themselves, are inﬂuenced by the psychometric taking measurement levels into account is the fact

that the distinction between the different measure- Individual Height in Inches Rank Order Height

ment levels is best viewed as approximate rather than

all-or-none. Metric properties are not inherent in A 72" 5

scales but, rather, are inherent in data and, hence, B 71" 4

are inﬂuenced by all of the facets of data collection. C 70" 3

The extent to which a measure has interval prop- D 69" 2

erties not only depends on the scale used to make E 68" 1

observations but also possibly on the particular set of

individuals on which the observations are made, the Note that for these ﬁve individuals, the rank order

time at which the data are collected, the setting in measure has interval level properties. The difference

which the data are collected, and so on. Consider the in scores between individuals A and B is 1, as is

following simplistic yet pedagogically useful exam- the difference between individuals D and E. These

ple. The height of ﬁve individuals is measured on differences correspond to the exact same amount

two different metrics, inches and a rank order of on the underlying physical dimension. In this case,

height: what we think of as traditionally being an ordinal

“scale” actually yields measures with interval level

properties. Suppose that individual E was not 68”

Individual Height in Inches Rank Order Height tall but instead was 67.9” tall. In this case, the rank

order measure is not strictly interval. But it is close

A 72" 5

and probably can be treated as if it is interval level

B 71" 4

without adverse effects.

C 70" 3

This example illustrates that the crucial issue is

D 69" 2

not whether a set of measures is interval or ordinal.

E 67" 1

Rather, the critical issue is the extent to which a set of

measures approximates interval level characteristics.

If the approximation is close, then the data often

As is well known, the measures taken in inches can be effectively analyzed using statistical methods

have interval level properties. For example, a differ- that assume interval level properties. If the approx-

ence of 1 between any two scores corresponds to the imation is poor, an alternative analytic strategy is

same physical difference on the underlying dimen- called for. In this sense, we often can apply statis-

sion of height. The actual height difference between tical strategies that assume interval level measures

individuals A and B corresponds to the same true to ordinal data without consequence as long as the

underlying height difference between individuals C ordinal data reasonably approximate interval level

and D, and the metric reﬂects this (i.e., 72−71 = 1 properties.

and 70 − 69 = 1). Similarly, the difference between Some researchers confuse the concept of measure-

D and E is 69 − 67 = 2, and the difference ment level with measurement precision. Precision

between A and C is 2. These differences also reﬂect refers to the number of scale points of a measure,

the same amount on the underlying dimension of such as a 5-point scale, a 10-point scale, or a 21-

height. Note, however, that these properties do not point scale. Measures can have interval level proper-

hold for the rank order measure. The difference ties, for example, but be imprecise, or they can have

in scores between individuals A and B is 1 (i.e., ordinal properties yet be relative precise. Precision

5 − 4), and the difference in scores for individu- of a measure may shape the way that one chooses to

als D and E is also 1 (i.e., 2 − 1). These identical model data to evaluate a theory, but to the extent

differences correspond to differing degrees of height it does, it is not because of levels of measurement

disparities on the underlying dimension of height of the measures. With coarse and imprecise mea-

(i.e., the true difference between individuals D and sures of continuous constructs, analytic methods

E is larger than the true difference between indi- that assume high levels of precision can be problem-

viduals A and B, as is evident for the measure atic and alternative analytic strategies are required

using inches). For these individuals, the rank order (see Yuan & Schuster, Chapter 18, Volume 1).

measures have ordinal properties but not interval Another measurement matter that shapes the

properties. method of data analysis is whether the outcome mea-

Now consider ﬁve different individuals with the sure is discrete and zero-bounded (such as measures

following scores: of counts, like the number of times an adolescent has

jaccard 97

smoked marijuana) and whether measures are cen- hand. Three assumptions are typical: (1) normal-

sored. Censoring occurs when a value occurs outside ity of scores in the population, (2) homogeneity

the range of a measuring instrument. For example, of variances in the population, and (3) indepen-

a bathroom scale might only measure up to 300 dence of replicates. The ways these assumptions

pounds. If a 350-pound individual is weighed using are instantiated vary by analytic method, and other

the scale, then we would only know that the indi- assumptions also can come into play (Maxwell &

vidual’s weight is at least 300 pounds. For details of Delaney, 2004). I focus on the above assumptions

how these metric qualities affect analytic strategy, see primarily to make general points about approaches

Long (1997) and Yuan and Schuster (Chapter 18, to distributional assumptions. I will make reference

Volume 1). to the robustness of a test to violations of underlying

assumptions. A statistical test is said to be robust to

violations of assumptions if (1) the nominal Type I

Full Information versus Limited error rate (alpha level) set by the investigator a priori

Information Estimation (usually 0.05) is maintained in the face of assump-

Given a set of equations derived from a causal tion violations and (2) the statistical power of the

theory, one approach to estimating parameters in test is relatively unaffected by assumption violations.

those equations is to use a full information estima- For a more technical and nuanced discussion of

tion approach. In full information estimation, the robustness, see Wilcox (2005).

coefﬁcients in all equations in the model are esti- A common perception of many researchers is

mated simultaneously. This can be contrasted with that traditional F and t tests in analysis of vari-

a limited information estimation approach where ance (ANOVA) and regression are quite robust to

the coefﬁcients in the equations are estimated one violations of normality and homogeneity of vari-

equation at a time or, alternatively, focusing on only ance assumptions. Because of this, these methods

a portion of the larger model rather than the entire are often applied to data where the population

model. Full information estimation is common in assumptions are tenuous. Studies have shown that

structural equation modeling (SEM), but there also ANOVA and regression are not necessarily robust

are limited information variants of SEM (see Bollen, to assumption violations of normality and variance

2001). Full information estimation has the advan- heterogeneity (see Keselman et al., 1998; Maxwell

tage that parameter estimates typically (but not & Delaney, 2004; Wilcox, 2005). One strategy for

always) are more efﬁcient (in a strict statistical sense dealing with assumption violations is to perform

of the term) than limited information estimators as a preliminary test of the viability of the assump-

long as the model being tested is correctly speciﬁed. tion in question and, if the test suggests a problem,

Limited information estimation has the advantage perform a metric transformation or use a robust

that it often is less susceptible to adverse effects of analytic alternative. This two-step strategy is contro-

speciﬁcation error, because the effects of speciﬁca- versial for several reasons. First, many preliminary

tion error are limited to the particular portion of tests lack power without large sample sizes and

the model where the error occurs. It also allows one yield nonsigniﬁcant results for testing an assumption

to use the strongest methods of analysis available violation, even when the violation is problematic

dependent on the properties of measures and vari- (Wilcox, Charlin, & Thompson, 1986; Wilcox,

ables in different portions of the model rather than 2003). Second, the crucial issue is not whether the

applying the same homogenous algorithm through- null hypothesis of normality or variance homogene-

out the model. A decision point for analysts when ity can be rejected but, rather, estimating the degree

choosing a statistical method to estimate parameters to which the assumption is violated and making

deﬁned by a theoretical model is whether to adopt a a decision as to whether the degree of violation is

full information or limited information estimation consequential. This requires documenting the mag-

strategy. nitude of the assumption violation in the sample

data and then taking sampling error into account

when making decisions. For example, we might ﬁnd

Statistical Assumptions that a variance ratio comparing the variances of two

Another issue that analysts must consider when groups is 4.0, with a margin of error of plus or minus

moving from equations to estimation is the dis- 3.0. The margin of error suggests that the variance

tributional assumptions of the estimation method could be as large as 7.0, which could be problem-

and whether they are appropriate for the task at atic. Unfortunately, it is rare for researchers to take

margins of error into account when evaluating pre- et al., 2008; Wilcox, 2005). These scientists recog-

liminary tests. Third, many tests of non-normality nize that cases may occur where defaulting to robust

are based on asymptotic theory and only perform analytic strategies will result in some loss of statistical

adequately with large sample sizes (Shapiro & Wilk, power and inaccurate probability coverage of conﬁ-

1965). However, with large N , such tests tend to dence intervals (CIs). However, the argument is that

detect minor departures from normality that may in the long run, the use of robust methods will result

be of little consequence. In addition, normality tests in better Type I error protection, increased power to

can be differentially sensitive to different types of detect effects, and CIs that more accurately reﬂect

non-normality. Some tests are sensitive mostly to the desired probability coverage (Wilcox, 1998). Of

skew, whereas others are sensitive mostly to kurtosis. course, it is always important to explore the shapes of

Fourth, the preliminary tests often make assump- distributions and dispersions of data. However, the

tions in their own right and may perform poorly recommendation is to view traditional tests of model

when their assumptions are violated. For example, assumptions and remedial strategies based on trans-

many tests of variance homogeneity assume the pop- formations with caution, deferring instead to the use

ulation data are normally distributed (Carroll & of more modern robust analytic methods.

Schneider, 1985; Keyes & Levy, 1997; Parra-Frutos, Earlier, I discussed causal theories that are multi-

2009). If the population data are non-normal, then level in nature, such as theories of how characteristics

the preliminary test of variance homogeneity may of organizations as well as characteristics of indi-

be invalid. Fifth, using preliminary tests as a screen viduals affect the behavior of individuals within an

can change the sampling distribution of F tests and organization. Research that tests multilevel mod-

t tests in unpredictable ways. Although it seems rea- els often strategically samples organizations (called

sonable, the strategy of conducting preliminary tests Level 2 units) and individuals nested within those

of model assumptions faces numerous difﬁculties. organizations (called Level 1 units). In such cases,

Transformation strategies for dealing with the statistical assumption of independent residu-

assumption violations have also been criticized. For als/errors often is untenable because of the impact

example, Budescu and Appelbaum (1981) found that individuals within an organization have on

that transformations to address variance heterogene- one another, either directly or indirectly. In such

ity can create more problems than they solve in cases, specialized statistical methods must be used

inferential tests because they can adversely impact to deal with the dependencies (see Yuan & Schuster,

normality (see also Blaylock et al., 1980; Milligan, Chapter 18, Volume 1).

1987; Doksum & Wong, 1983; Wilcox, 1996, In sum, the choice of an analytic method to

1998). Transformed variables often are difﬁcult to use with data is impacted by the equations used to

interpret (e.g., the mean log of annual income is represent a theory, the psychometric properties of

not easily interpreted). In models with multiple the measures, whether one seeks full information

predictors, transformations of the dependent vari- estimation or limited information estimation, and

able can create speciﬁcation error that undermines the population distributional assumptions that the

covariate control because it alters the relationships statistical tools available to the researcher make rela-

between the outcome variable and all predictors in tive to the actual population distributions. A grow-

the equation. Years ago, before high-speed comput- ing number of scientists suggest adopting robust

ers were widespread, analysts had little choice but methods in favor of traditional methods because of

to use transformations to make measures conform the complexities of two step strategies that rely on

to the assumptions of the limited number of para- preliminary tests of assumptions.

metric statistical strategies available. Such practices

are rarely needed today given the array of modern

methods of analysis that are available. Outliers

A growing number of statisticians recommend Outliers are unusually small or large scores that

that analysts simply abandon the more traditional distort basic trends in the data. For example, for

tests that make strong population assumptions un- the scores 2, 3, 4, 5, 6, 7, 8, 9, 10, and 50, the

less they are conﬁdent in assumption viability based last score is an outlier that distorts the mean and

on theory or extant research. Rather, analysts should makes the use of the mean suspect as a way to char-

routinely use modern-day robust methods of analysis acterize the central tendency of the data. Simple

or, at the very least, routinely supplement traditional methods for outlier detection compare the results

methods with modern robust methods (Keselman of an analysis when the case is included versus the

jaccard 99

results of an analysis when the case is deleted. Such theory must be in agreement with known data and

approaches, however, can be nondiagnostic when facts. Third, the theory must be testable—that is,

multiple outliers are present. For example, if there a theory must ultimately be subject to empirical

are two individuals in an analysis who distort a mean evaluation.

upward, deleting only one of them may not reveal In addition to these criteria, Shaw and Costonzo

an “outlier” effect as long as the second outlier is still (1982) have speciﬁed six additional features of a the-

in the data. Only when both outliers are removed ory that are desirable but not necessarily critical. First,

is their distorting character revealed. Outlier iden- a theory should be stated in terms that can be under-

tiﬁcation is a complex enterprise, with some of the stood and communicated to most other scientists.

most sophisticated work being pursued in the liter- Second, the theory should be parsimonious in that it

ature on robust statistics (see Wilcox, 2003, 2005, adequately explains a phenomenon but with a min-

for elaboration). imum of concepts and principles. All other things

Wilcox (1998, 2006) objects to applying tradi- being equal, preference is given to theories that make

tional inferential methods to data that have elim- fewerassumptions.Third, althoughwerecognizethat

inated outliers based on simple outlier detection theories are occasionally so novel that they upset the

methods. He argues that doing so invalidates the sta- theoretical apple cart, a theory should be consistent

tistical theory on which the inferential tests are based with other accepted theories that have achieved con-

because of dependencies that outlier elimination cre- sensus among the scientiﬁc community—that is, it

ates. Others recommend conducting analyses with should be able to be integrated into existing bod-

and without outliers to determine if conclusions ies of theory. A fourth desideratum is scope. Other

change. If conclusions do change, then one moves things being equal, the greater the range of the the-

forward with any conclusions on a tentative basis. ory, the better it is thought to be. That said, there are

Probably the most effective strategy for dealing with times when narrow range theories tend to hold up

outliers is to focus on parameter estimation methods better over time than broad range theories. Creativ-

that are resistant to outliers. For an introduction to ity or novelty is a ﬁfth criterion sometimes suggested

these methods, see Wilcox (2005). for evaluating a theory. A theory that explains the

obvious is generally not as highly valued by the scien-

tiﬁc community as one that provides a novel insight

Model Selection into an interesting phenomenon. Finally, many sci-

Model selection refers to the process of choosing entists suggest that a good theory is one that generates

what one believes is the “best” model for describ- research activity.

ing a phenomenon from a set of candidate models. Brinberg and McGrath (1985) have noted that

The set of plausible models might consist of many the various theory desiderata sometimes conﬂict

models or it might consist of just one or two with each other. For example, parsimonious theories

models. Criteria for model selection can be consid- tend to be more limited in scope. As such, the-

ered in general terms using a mindset of specifying orists often make trade-offs as they construct the-

criteria to evaluate the overall quality of a theory ories to maximize what is valued by the scientiﬁc

in general; or they can be discussed in speciﬁc, community.

quantitative terms when choosing between compet-

ing models within an experiment. I consider both

perspectives. Choosing Between Models in a Given Study

As one pursues the theory construction process,

one may develop competing theories that either

General Criteria for Evaluating Theories make opposite predictions or that account for the

Consensual validation is one basis by which the- same phenomena but using different explanations

ories are evaluated. This refers to the degree of con- and assumptions. Faced with such scenarios, we

sensus among the scientiﬁc community about the design research to help us choose between the com-

validity of the theory. If a theory enjoys widespread peting theories. Many scientists consider research

acceptance, then it is seen as being a “good” theory. that chooses between two or more logical and plau-

Shaw and Costonzo (1982) have argued that three sible theories to be inherently more interesting than

criteria are crucial for the acceptance of a theory. studies that yield results regarding a single theory

First, the theory must be logically consistent—that (Platt, 1964).

is, the theoretical statements within the concep- In some cases, competing models making quali-

tual system must not be contradictory. Second, the tatively opposite predictions about how data should

pattern themselves in a given study. In these cases, Table 5.1. Correlations for Intelligence

Example

deciding which theory better accounts for the data

is reasonably straightforward. As an example, sup- Y1 Y2 Y3 Y4 Y5

pose we describe the personal qualities of a political Y1 1.00 0.72 0.63 0.54 0.45

candidate to a person who he or she has not heard of

by providing the person with three pieces of infor- Y2 0.72 1.00 0.56 0.48 0.40

mation about the candidate. Suppose that the three

Y3 0.63 0.56 1.00 0.42 0.35

pieces of information all are quite positive (e.g., the

candidate is said to be honest, smart, and empathic). Y4 0.54 0.48 0.42 1.00 0.30

For purposes of developing this example, we char-

Y5 0.45 0.40 0.35 0.30 1.00

acterize how positive each piece of information is

considered to be using a metric that ranges from

0 to 10, with higher numbers reﬂecting higher

degrees of positivity. Suppose we want to predict the ﬁrst candidate to the second candidate. Psycho-

how favorable a person will feel toward the can- logically, the ﬁrst candidate has nothing but very

didate based on the three pieces of information. positive qualities, whereas the second candidate has

One plausible model states that the overall degree very positive qualities but also some qualities that are

of favorability is a function of the sum of the posi- only somewhat positive. The person prefers the ﬁrst

tivity of each individual piece of information—that candidate, who has nothing but very positive quali-

is, that people “tally up” the affective implications of ties, to the second candidate, who has very positive

each piece of information when forming their over- qualities but also moderately positive qualities.

all impression. Another model, by contrast, speciﬁes In the above example, the summation and averag-

a different function—namely, an averaging func- ing models make opposite predictions about candi-

tion. In this case, the overall feeling of favorability date preference, and it is straightforward to choose

is thought to be a function of the average positivity between the theories based on an empirical study

of the information presented. that asks people which of the two candidates they

What are the implications of specifying the func- prefer. Of course, if the study found that neither

tion as being summative versus averaging in form? candidate tended to be preferred, then this would

It turns out, they are considerable. Let’s explore question both models.

the summation model ﬁrst. Suppose the positivity Although studies like the above are compelling, it

values of the three pieces of information are 8, 8, is common to conduct studies where competing the-

and 8 respectively. The overall feeling of favorability ories do not make qualitatively opposite predictions,

toward the candidate will be a scaled function of but, rather, they make predictions about how data

8 + 8 + 8 = 24. Now suppose we describe a sec- should pattern themselves that allows researchers to

ond candidate to this person using the same three choose between them. As an example, in the area

pieces of information but we add a fourth descrip- of intelligence testing, theorists agree that there are

tor to them (cunning) that has a positivity value of different kinds of intelligence and cognitive abilities,

4. According to the summation model, the over- such as math skills, vocabulary breadth, spatial skills,

all feeling of favorability toward this new candidate motor skills, and memory. Studies suggest that mea-

will be a scaled function of 8 + 8 + 8 + 4 = 28, and sures of these constructs are moderately correlated

the person will prefer the second candidate to the with one another. Some theorists believe that the

ﬁrst candidate. Psychologically, it is as if the sec- correlations among them result from the common

ond candidate has all the same qualities as the ﬁrst inﬂuence of general intelligence, sometimes called g .

candidate, and then “as a bonus,” you get a fourth We refer to this as a “single factor model” because it

positive attribute as well. Hence, the person prefers posits that the correlations among the measures can

the second candidate to the ﬁrst candidate. be accounted for by a single underlying factor. Other

Now consider instead the averaging function. theorists believe that g does not exist and that the

The overall feeling toward the ﬁrst candidate is pre- correlations among the different abilities are a result

dicted to be (8 + 8 + 8)/3 = 8.0, and the overall of a more complex constellation of determinants

feeling toward the second candidate is said to be (8 consisting of multiple factors.

+ 8 + 8 + 4)/4 = 7.0. In the averaging model, exactly Suppose in a population of individuals the corre-

the reverse prediction is made in terms of candi- lations among the ﬁve variables are as presented in

date preference—namely, the person now will prefer Table 5.1. It can be shown using somewhat involved,

jaccard 101

but tedious, algebra that the one factor model pre- is true in the population, then the sample correla-

dicts that the correlations should pattern themselves tion matrix should at least reasonably approximate

in a certain way—that is, they should follow cer- the regularities predicted by the one-factor model.

tain mathematical regularities. For example, if we A challenge for scientists when comparing models

choose any two columns and ignore entries where vis-à-vis indices of ﬁt (or, for that matter, evaluating

a 1.00 occurs, then the ratios of the respective row a single model using an index of ﬁt) is to take into

entries for the two columns should be equal in value account such sampling error.

to the comparable ratio for any row in the matrix. A ﬁnal strategy that one encounters when scien-

For example, tists evaluate competing models is the case where

scientists prefer the model that explains the most

For columns 1 and 2:

variation in an outcome variable. In these cases,

.63/.56 = .54/.48 = .45/.40 = 1.125 one typically cannot differentiate models in terms

of degree of ﬁt to the data, as in the previous

and

examples we considered. Nevertheless, one can dif-

For columns 2 and 5 : ferentiate between them in terms of whether one

model accounts for signiﬁcantly more variation in

.72/.45 = .56/.35 = .48/.30 = 1.600.

an outcome than the other model. For example,

By contrast, models that posit more complex pat- when explaining adolescent drug use, one might

terns of underlying determinants (e.g., a two-factor compare a model that assumes drug use is a sim-

model, a three-factor model) predict a different set ple additive function of gender and grade in school

of regularities, which we do not elaborate here in versus a model that assumes drug use is an additive

the interest of space (see McDonald, 1985). The function of gender, grade, plus the interaction effect

model selection task is to choose which of the com- between gender and grade. If the models account

peting models (e.g., a one-factor model, a two-factor for about the same amount of variation in drug

model, etc.) is most consistent with the data. use, then preference for the ﬁrst model will take

Statisticians have developed quantitative indices precedence on grounds of parsimony and because

of ﬁt that represent the overall degree of correspon- taking into account the interaction effect does not

dence between the predicted pattern of data by a seem to matter. By contrast, if the three-parameter

model and the observed data pattern. When there model explains substantially more variation in drug

are competing models, a ﬁt index is derived for use than the two-parameter model, then the three-

each model and then the ﬁt indices are compared to parameter model is preferred.

identify the model that best accounts for the data. As scientists consider the complex issues sur-

The technical details of such comparisons can be rounding model selection, a paramount concern

quite complex, and these are elaborated through- is not to mistakenly embrace a model that is mis-

out the many chapters in this volume. The point speciﬁed. A misspeciﬁed model is a model that is

I am emphasizing here is that even if two models wrong because it (1) left out an important variable

do not make qualitatively opposite predictions, as from the model whose omission biases parame-

was the case for the example on summation versus ter estimates and leads to faulty conclusions (also

averaging, models often can be compared in terms called left-out-variable-error or LOVE problems);

of their relative consistency with the data, with one (2) assumed one type of function between variables

model ultimately being declared as being more con- (e.g., linear) when, in fact, a different type of func-

sistent with the data than another model based on tion was operative (e.g., a threshold function), with

the comparison of quantitatively deﬁned ﬁt indices. such misspeciﬁcation leading to faulty conclusions;

There is, however, an additional complication in and/or (3) incorrectly modeled the temporal dynam-

model selection. In practice, it would be surprising ics within a model in such a way that parameter

if sample data yielded a set of correlations that per- estimates and conclusions are non-trivially biased.

fectly followed the predicted regularities of, say, a Throughout this volume, chapters elaborate on

one-factor model, even if the one-factor model was issues of model misspeciﬁcation and ways to gain

operative in the population. This is because sam- perspectives on it. Cudeck (1989) has argued that

ple correlations are subject to sampling error and the crucial issue is not whether a model is misspeci-

will randomly deviate to a greater or lesser extent ﬁed because misspeciﬁcation is inevitable in so much

from the true population correlations. Even if this of the research we conduct. Rather, the more cen-

is the case, one expects that if the one factor model tral issue is whether the degree of misspeciﬁcation is

sufﬁciently large that it leads to conclusions that are future, A festschrift in honor of Karl Jöreskog. (pp. 119–138).

incorrect or misplaced. Lincoln, IL: Scientiﬁc Software.

Bollen, K.A., & Curran, P.J. (2006). Latent curve models: A

In sum, model selection can involve two pro-

structural equation approach. Hoboken, NJ: Wiley.

cesses: (1) evaluating a model by and of itself to Brinberg, D. & McGrath, J. (1985). Validity and the research

determine whether it meets criteria the scientiﬁc process. Newbury Park: Sage.

community judges to be important (such as being Brodbeck, M. (1968). Readings in the philosophy of the social

logical, testable, consistent with known facts/data, sciences. New York: Macmillan.

Brown, T. (2006). Conﬁrmatory factor analysis for applied research.

etc.), and (2) quantitatively comparing a model with

New York: Guilford.

other competing models within a given study to Budescu, D. & Appelbaum, M. (1981). Variance stabilizing

identify which model best accounts for the observed transformations and the power of the F test. Journal of

data or explains the most variation in an out- Educational and Behavioral Statistics, 6, 55–74.

come. Three strategies are typically used to choose Bunge, M. (1961). Causality, chance, and law. American Scientist,

69, 432–488.

between models. First, one designs a study where

Carnap, R. (1971). A basic system of inductive logic, Part 1. In

the competing models make qualitatively opposite R. Carnap & R. C. Jeffrey (Eds.), Studies in inductive logic

predictions. Second, one deduces from the model and probability. Vol. I, (pp. 18–39). Berkeley, CA: University

how data should pattern themselves and then derives of California Press.

a quantitative (ﬁt) index representing the corre- Carroll, RJ, & Schneider, H. (1985). A note on Levene’s test

for equality of variances. Statistics & Probability Letters, 3,

spondence between the predicted patterns and the

191–194.

observed patterns. The model that “ﬁts” the data Cartwright, N. (2007). Hunting causes and using them. New York:

best is the preferred model, everything else being Cambridge University Press.

equal. Third, one selects a model that can explain Collins, L. (2006). Analysis of longitudinal data: The integration

the most meaningful variation in a targeted outcome of theoretical model, temporal design, and statistical model.

Annual Review of Psychology, 57, 505–528.

variable. There are many nuances surrounding the

Coombs, C., Dawes, R. & Tversky, A. (1970). Mathematical

above strategies, and these are elaborated in other psychology: An elementary introduction. Englewood Cliffs, NJ:

chapters in this volume. Prentice Hall.

Cudeck, R. (1989). Analysis of correlation matrices using covari-

ance structure models. Psychological Bulletin, 105, 317–327.

Concluding Comments Doksum, K.A., & Wong, C.-W. (1983). Statistical tests based

This chapter has described general issues involved on transformed data. Journal of the American Statistical

in (1) building causal theories, (2) translating those Association, 78, 411–417.

theories into a set of mathematical representations, Dubin, R., (1976). Theory building in applied areas. In Marvin

(3) choosing an analytic strategy to estimate param- D. Dunnette (Ed.), Handbook of industrial and organizational

psychology. (pp. 17–39). Chicago: Rand McNally.

eters and sampling error vis-à-vis those equations,

Erceg-Hurn, D. M., Wilcox, R. R., & Keselman, H. H. (2012).

and (4) choosing the “best” model from a set of Robust statistical estimation. In T. D. Little (Ed.), The Oxford

competing models. There is a vast array of technical Handbook of Quantitative Methods (Vol. 1, pp. 388–406).

issues to consider as one approaches these four tasks, New York: Oxford University Press.

and these technicalities are elaborated throughout Finkel, S. (2008). Linear panel analysis. In S. Menard (Ed.),

Handbook of longitudinal research (pp. 475–504). New York:

the chapters in this volume. My purpose here was to

Elsevier Press

provide a “big picture” view of the broader enterprise Frank, P. (1961). Modern science and its philosophy. New York:

so that we can “see the forest through the trees” as Collier Books.

we approach the task of building cumulative bodies Green, P. & Tull, D. S. (1975). Research for marketing decisions.

of knowledge and solving social problems. Englewood Cliffs, NJ: Prentice-Hall.

Jaccard, J. & Jacoby, J. (2010). Theory construction and model

building skills: A practical guide for social scientists. New York:

References Guilford.

Blanton, H. & Jaccard, J. (2006). Tests of multiplicative mod- Kaplan, A. (1964). The conduct of inquiry. San Francisco, CA:

els in psychology: A case study using the uniﬁed theory of Chandler.

implicit attitudes, stereotypes, self esteem and self concept. Keselman, H., Algina, J., Lix, L., Wilcox, R. & Deering, K.

Psychological Review, 122, 155–169. (2008). A generally robust approach for testing hypotheses

Blaylock, J., Salathe, L. & Green, R. (1980). A note on the Box- and setting conﬁdence intervals for effect sizes. Psychological

Cox transformation under heteroskedasticity. Western Journal Methods, 13, 110–129.

of Agricultural Economics, 45, 129–135. Keselman, H. J., Huberty, C. J., Lix, L. M., Olejnik, S., Cribbie,

Bollen, K.A. (2001). Two-stage least squares and latent variable R., Donahue, B., et al. (1998). Statistical practices of educa-

models: Simultaneous estimation and robustness to misspec- tional researchers: An analysis of their ANOVA, MANOVA,

iﬁcations. In Robert Cudeck, Stephen Du Toit, & Dag and ANCOVA analyses. Review of Educational Research, 68,

Sörbom (Eds.), Structural equation modeling: Present and 350–386.

jaccard 103

Keyes, T. & Levy, M.S. (1997). Analysis of Levene’s test under Shadish, W. R., Cook, T. & Campbell, D. (2002). Experimental

design imbalance. Journal of Educational and Behavioral and quasi-experimental designs for generalized causal inference.

Statistics, 22, 227–236. Boston: Houghton Mifﬂin.

Lewis, D. (2000) Causation as inﬂuence. Journal of Philosophy, Shapiro, S. S. & Wilk, M. B. (1965). An analysis of variance test

97, 182–197. for normality (complete samples). Biometrika, 52, 591–611.

Long, S. (1997).Regression models for categorical and limited Shaw, M. & Costonzo, P. (1982). Theories of social psychology.

dependent variables. Thousand Oaks, CA: Sage. New York: McGraw-Hill.

Maxwell, S. E. & Delaney, H.D. (2004). Designing experiments Sheth, J. N. (1967). A review of buyer behavior. Management

and analyzing data: A model comparison perspective. Mahwah, Science, 13, B718–B758.

NJ: Erlbaum. Simon, H. A. & Newell, A. (1956) Models: Their uses and lim-

McDonald, R. (1985). Factor analysis and related methods. itations. In L.D. White (Ed.), The state of the social sciences.

Hillsdale, NJ: Erlbaum. (pp. 61–83). Chicago, IL: University of Chicago Press.

McDonald, R. P. (2012). Modern test theory. In T. D. Little Sowa, J. (2000). Processes and causality. From www.jfsowa.com/-

(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 1, ontology/causal.htm. Accessed August 15, 2011.

pp. 118–143). New York: Oxford University Press. Torgerson, W. S., (1958) Theory and methods of scaling. New York:

Milligan, G. (1987). The use of the arc-sine transformation Wiley.

in the analysis of variance. Educational and Psychological Wilcox, R. (2006). Graphical methods for assessing effect size:

Measurement, 47, 563–573. Some alternatives to Cohen’s d. Journal of Experimental

Morgan, S. & Winship, C. (2007). Counterfactuals and causal Education, 74, 353–367.

inference. New York: Cambridge University Press. Wilcox, R.R. (1996). Statistics for the social sciences. New York:

Pap, A. (1962). An introduction to the philosophy of science. Academic Press.

Glencoe, IL: The Free Press of Glencoe (MacMillan). Wilcox, R. R. (1998). How many discoveries have been lost by

Parra-Frutos, I. (2009). The behaviour of the modiﬁed Lev- ignoring modern statistical methods. American Psychologist,

ene’s test when data are not normally distributed. Journal 53, 300–314.

Computational Statistics, 24, 671–693. Wilcox, R.R. (2003). Applying contemporary statistical techniques.

Pearl, J. (1999). Causality: Models, reasoning, and inference. New San Diego, CA: Academic Press.

York: Cambridge University Press. Wilcox, R.R. (2005). Introduction to robust estimation and

Platt, J. (1964). Strong inference. Science, 146, 347–353. hypothesis testing (2nd ed.). San Diego, CA: Academic

Rubin, D. (1974). Estimating causal effects of treatments in ran- Press.

domized and nonrandomized studies. Journal of Educational Wilcox, R. R., Charlin, V. L., & Thompson, K. (1986). New

Psychology, 66, 688–701. Monte Carlo results on the robustness of the ANOVA F, W,

Rubin, D. (1978). Bayesian inference for causal effects: The role and F* statistics. Communications in Statistics. Simulation and

of randomization. Annals of Statistics, 6, 34–58. Computation, 15, 933–944.

Russell, B. (1931). The scientiﬁc outlook. London: George Allen Yuan, K-H., & Schuster, C. (2012). Overview of statistical esti-

& Unwin. mation methods. In T. D. Little (Ed.), The Oxford Handbook

Schumacker, R. E. & Lomax, R. G. (2004). A beginner’s guide to of Quantitative Methods, (Vol. 1, pp. 360–386). New York:

structural equation modeling. Mahwah, NJ: Erlbaum. Oxford University Press.

CHAPTER

6

Lisa L. Harlow

Abstract

This chapter strives to enliven quantitative psychology teaching and encourage statistical literacy.

Summaries of quantitative training, at the undergraduate and graduate levels, offer guidelines to

improve instruction and call for greater emphasis on measurement, research, and quantitative

methods. Strategies for effectively teaching quantitative psychology are suggested, including active,

hands-on learning for engaging students, e-learning and Web-based instruction, mentoring and role

models, and encouraging conceptual understanding. Quantitative students are encouraged to focus on

the nature of research questions, similarities and differences in statistical methods, and interpreting

ﬁndings with statistical tests, effect sizes, and conﬁdence intervals. Future directions are offered

regarding model building, quantitative learning beyond the classroom through workshops, membership

in quantitative societies, and reading the quantitative literature. The reach of quantitative training

should be widened to more readily include those from disadvantaged, early career, and

under-represented groups to further strengthen the ﬁeld and enlighten greater numbers about the

wonders of quantitative psychology.

Key Words: quantitative training, statistical literacy, engaging students, strategies for teaching, active

learning, mentoring, underlying concepts, widening quantitative reach

Readers of this volume are part of a rare and students often approach quantitative courses with

unique subset of individuals who resonate with little interest or conﬁdence, coupled with anxiety

the term quantitative psychology. When the topic andmisperceptionsaboutlearningstatisticalmaterial

of psychology is discussed, whether by students, (Ashcraft, & Kirk, 2001; DeVaney, 2010; Harlow,

faculty, or the general public, the qualiﬁer “quan- Burkholder & Morrow, 2006; Onwuegbuzie, 2000;

titative” does not always enter the conversation. Onwuegbuzie, & Wilson, 2003; Piotrowski, Bagui,

Students in psychology often delay taking required & Hemasinha, 2002). What compounds the prob-

statistics and research courses, short-changing their lem is that students with high anxiety tend to have

ability to understand and develop scientiﬁc skills poorer attitudes toward and lower performance in

necessary to open up their career options and further quantitative studies (Budé, Van De Wiel, Imbos,

the ﬁeld of psychology (Rajecki, Appleby, Williams, Candel, Broers, &Berger, 2007; Harlow, Burkholder

Johnson, & Jeschke, 2005). In more than 25 years & Morrow, 2002; Mills, 2004; Rodarte-Luna &

of teaching quantitative psychology, I have learned Sherry, 2008; Tremblay, Gardner, & Heipel, 2000).

not to be too surprised by cringing and ﬂinching, Fortunately, quantitative attitudes have been shown

if only subtle, in students enrolled in my courses. to predict quantitative performance just as strongly as

105

pre-course quantitative skill in a class with engaging measurement, and methodology. Only 15% of the

activities to get students involved, suggesting that graduate programs had a quantitative area, although

it is worthwhile to try to reduce quantitative anxi- about 50% offered a minor in quantitative studies.

ety by making quantitative learning more appealing Almost all of the graduate programs required stu-

(Harlow, Burkholder, & Morrow, 2002). dents to take at least one or two quantitative courses

Thus, it is probably not extraordinary that I ﬁnd during their ﬁrst year, although more than 25% sent

it my greatest joy to try to engage and enlighten students to other departments to get this training.

students with the remarkable prospective of quan- Required quantitative courses usually included basic

titative thinking and analysis. I go so far as to analysis of variance (ANOVA) and some regres-

argue that any student interested in psychology has sion, with half to two-thirds of programs including

the making of a quantitative scientist, if only at limited training in measurement or research. Com-

some latent, implicit level. What student has not puter training in SPSS and SAS is often part of the

posed a possible theory for why something is, or curriculum, with the possibility of EQS, AMOS,

is not, a certain way? Doesn’t everyone wake up Mplus, LISREL, and other software exposure in

each morning with one or more hypotheses about more advanced courses (e.g., structural equation

how the day will emerge and which variables or fac- modeling) in about half of the programs. Over a span

tors will help bring about the desired effect, whether of almost two decades between surveys, quantitative

small, moderate, or large? And, don’t we all think training had not improved much; over time there

about how much conﬁdence we can place on our was slightly more emphasis on measurement and

hypotheses or expectations, depending on speciﬁc deterioration in coverage of research design. Much

mitigating or confounding circumstances? I believe improvement in advanced methods training is still

that we all engage in these activities, however for- needed in most departments to impart the expertise

mal or informal, in which quantitative psychology needed to effectively compete in quantitative studies

can play a pivotal role. Quantitative psychology and ﬁll needed positions in the workforce.

has the potential to empower and enlighten with It is becoming readily apparent that efforts are

the training, skills, reasoning, and capabilities to needed to increase the number of quantitatively

formalize these kinds of questions, helping us to trained individuals. A recent report from an Amer-

describe and understand the essence of myriad data ican Psychological Association (APA) Task Force

that come our way. With more advanced quantita- to Increase the Quantitative Pipeline (Aiken et al.,

tive training, we could further develop and analyze 2007) reported that in recent history, whereas there

intricate theoretical models that help explain and were approximately two applicants for each non-

predict complex processes and behaviors, integrat- quantitative doctoral level job (e.g., cognitive, devel-

ing information necessary to inform and improve opmental, social), there were almost 2.5 jobs for

interventions, policies, and the human condition every quantitatively trained PhD student. Further,

(Harlow, 2010; Rodgers, 2010). the focus of quantitative methodology is moving

In this chapter, I highlight some major issues and away from emphasis on learning a set of speciﬁc

topics that go into teaching quantitative psychology. methods and procedures and instead placing greater

First, I provide an overview of quantitative training priority on developing a broader vision of quanti-

to get an overarching picture of the ﬁeld. Second, tative science through theory building, modeling

I suggest a number of issues that should be taken underlying processes, and integrating information

into account when teaching quantitative courses and across meta-analytic studies (e.g., Harlow, 2010;

research that investigates ways to address these con- Jaccard & Jacoby, 2009; McGrath, 2011; Rodgers,

cerns. Third, I offer several themes that run through 2010; Rosenthal & DiMatteo, 2001). Consistent

many statistical methods to help tie together the with this vision is a focus on encouraging greater

numerous, seemingly isolated and obscure quanti- statistical reasoning, thinking and literacy, rather

tative procedures. Finally, I summarize the current than rote learning (Gal, 2003; Garﬁeld & delMas,

ﬁeld of quantitative teaching and give recommenda- 2010; Ridgway, Nicholson, & McCusker, 2007).

tions for other options to supplement, enrich, and Researchers are realizing that quantitative literacy

expand statistical learning. is a needed goal in undergraduate studies, begin-

ning with the most basic, introductory statistics

Overview of Quantitative Training courses (Ben-Zvi & Garﬁeld, 2004; Mulhern &

Aiken et al. (1990, 2008) surveyed more than Wylie, 2004, 2006; Rumsey, 2002; Watson, 2006).

200 graduate training programs regarding statistics, Resources are becoming available to help encourage

106 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y

quantitative thinking (Garﬁeld & Ben-Zvi, 2008; & Garﬁeld, 2006). Humor is also a great ice-

Saville, Zinn, Lawrence, Barron, & Andre, 2008; breaker, catching students’ attention (e.g., Cobb,

Stark & Krause, 2009), along with a realization of 1999, 2007; Zeedyk, 2006), helping to diffuse ten-

the challenges involved (Ben-Zvi & Garﬁeld, 2004; sion, and surreptitiously calling students back to

Sen, 2004). quantitative learning.

Garﬁeld and colleagues from the University of Creating a lively environment is essential for

Minnesota developed a program in Quantitative capturing the interest of students. A number of

Methods in Education that is at the forefront on researchers have emphasized the importance of

improving undergraduate statistical training. Much hands-on and interactive learning (Dolinsky, 2001;

of their research focuses on training statistics teach- Kolar & McBride, 2003; Wulff & Wulff, 2004).

ers (Garﬁeld & Everson, 2009), reforming the way A simple strategy could be to invite students to

introductory statistics is taught (Garﬁeld, Hogg, write down what is clear and not clear at the end

Schau, & Whittinghill, 2002), and investigating of each lecture, with the faculty clarifying unclear

how to teach and learn statistics (Garﬁeld & Ben- points at the beginning of the next lecture (Harlow,

Zvi, 2007). Other researchers (Friedrich, Buday, Burkholder, & Morrow, 2006). It takes little time,

& Kerr, 2000) also help outline the ﬁeld by sur- and students get fairly immediate feedback on how

veying undergraduate programs across the country to clear up misunderstandings, whereas faculty get

with respect to quantitative training. In the next a clearer idea of what the students understood

section, I review research focusing on strategies and what needs to be clariﬁed. Moreover, involv-

for teaching quantitative psychology (e.g., Gelman ing students with creative examples (Chew, 2007;

& Nolan, 2002), including a set of guidelines Schwartz & Martin, 2004) and analyzing data (e.g.,

developed by Garﬁeld and colleagues (Franklin & Nie & Lau, 2010; Watkins, Scheaffer, & Cobb,

Garﬁeld, 2006). 2004) bring about more in-depth learning than tra-

ditional lecture-based approaches. Demonstrating

visual images and graphs of procedures also helps

Strategies for Teaching Quantitative

improve understanding (e.g., Peden, 2001).

Psychology

If there is more time, then students can be

There is an abundance of resources to aid instruc-

arranged into small groups and be given a research

tors of statistics courses, including best practice

scenario in which they need to consult among them-

volumes (e.g., Dunn, Smith, & Beins, 2007;

selves to recommend statistical procedures to address

Hulme, 2007; Hulsizer & Woolf, 2009), com-

the research question (Harlow, Burkholder, & Mor-

pilations of research on teaching statistics (e.g.,

row, 2006). For a basic example, students could be

Bjornsdottir & Garﬁeld, 2010; Ziefﬂer, Garﬁeld,

asked how to assess whether men and women dif-

Alt, Dupuis, Holleque, & Chang, 2008), learn-

fered on hours of exercise per week (i.e., with a two-

ing objectives for introductory statistics and research

sample independent t-test) or whether the number

(Tomcho, Rice, Foels, Folmsbee, Vladescu, Liss-

of hours worked at an outside job was related to GPA

man, Matulewicz, & Bopp, 2009), and sugges-

(i.e., with a correlation). A number of researchers

tions for teaching quantitative courses (Garﬁeld &

have recommended engaging students in these

Everson, 2009; Ware & Johnson, 2000). In this

cooperative small groups (DaRos-Voseles, Collins,

section, I present several strategies that have been

Onwuegbuzie, & Jiao, 2008; Onwuegbuzie,

researched and recommended to improve statistical

Collins, & Jiao, 2009; Peterson & Miller, 2004) or

literacy.

learning communities (e.g., Barren, Benedict, Sav-

ille, Serdikoff, & Zinn, 2007) where students work

Active Learning together to understand quantitative material and

Research has demonstrated that actively involv- immerse themselves in the process. Krause, Stark,

ing students signiﬁcantly improves performance in and Mandl (2009), on the other hand, have found

quantitative courses (e.g., Helman & Horswill, that cooperative groups did not directly improve

2002). Guidelines to improve statistical learn- statistics performance, although students reported

ing (e.g., encouraging statistical thinking, using greater perceived efﬁcacy when working with others.

examples with technology and real data, emphasiz- It may be that group learning is not effective for all

ing concepts, promoting active learning) reiterate students, with more advanced students possibly ben-

this approach of engaging students in the process eﬁting the least. For example, Harlow, Burkholder,

(Everson, Ziefﬂer, & Garﬁeld, 2008; Franklin and Morrow (2002) found that learning activities

harlow 107

that included working in groups and with peer men- Ferreira, 2001) and can help to supplement class

tors was viewed more favorably when students were lectures and faculty input (e.g., Katayama, 2001).

more anxious and had lower conﬁdence about quan- Fortunately, graduate teaching assistants (TAs) are

titative learning. Future research could investigate often provided for undergraduate- and graduate-

whether ability or achievement level is a moderator level quantitative courses. When TAs are not funded,

for group learning and performance to clarify who I have found it very effective to invite one or more

beneﬁts from group learning. top achievers from a previous semester to serve as

volunteer peer mentors or TAs, offering indepen-

Technology and Learning dent study or teaching practicum credit. Students

Other research has revealed the value of tech- in the course beneﬁt from having a peer of similar

nology in heightening quantitative learning using age demonstrating and facilitating expertise in quan-

computer-assisted analysis (Bartz & Sabolik, 2001), titative methods. TAs or peer mentors gain efﬁcacy

Web-based tutorials (Bliwise, 2005), and speciﬁc and greater conﬁdence, often choosing to become

online learning programs such as Estimating Statis- even more involved with other quantitative courses

tics (EStat; Britt, Sellinger, & Stillerman, 2002), and research to continue building their skills (e.g.,

Simulation-Assisted Learning Statistics (SALS; Liu, Harlow, Burkholder, & Morrow, 2002, 2006).

Lin, & Kinshuk, 2010), the Utah Virtual Lab Mentoring can be particularly valuable for

(Malloy & Jensen, 2001), Statistical Understand- women and individuals from under-represented

ing Made Simple (SUMS; Swingler, Bishop, & groups who have few role models in the quanti-

Swingler, 2009), or Web Interface for Statistics Edu- tative ﬁeld, for providing direction, support, and

cation (WISE; Berger, n.d., http://wise.cgu.edu/). encouragement (e.g., Kosoko-Lasaki, Sonnino, &

In a meta-analysis of 45 studies, there was a small- Voytko, 2006; Neal-Barnett, Mitchell, & Boeltar,

to medium-sized Cohen’s d (i.e., 0.33) performance 2002; Zirkel, 2002).

beneﬁt effect size attributed to learning statistics

with computer-assisted instruction versus learning Conceptual Approach to Teaching

in a lecture-based control group that did not provide Perhaps the most effective idea for conveying

such input (Sosa, Berger, Saw, & Mary, 2010). Thus, complex quantitative material is to focus on the

students who had technology-enhanced instruction concepts rather than using a strictly mathematical

demonstrated one-third of a standard deviation approach (e.g., Atkinson, Catrambone, & Merrill,

higher performance than those students without 2003). Chiou (2009) found that encouraging stu-

such advantage; further, the improvement was even dents to collaborate on conceptually mapping sta-

more salient when involving those who were more tistical material signiﬁcantly improved performance

advanced (i.e., graduate students) and when more compared to having students complete textbook

time was allotted for instruction. exercises and calculations. In another study, Aber-

Still, whereas research has demonstrated the son, Berger, Healy, and Romero (2003) demon-

beneﬁts of e-learning approaches (e.g., Fillion, strated that an interactive approach to hypothe-

Limayem, Laferrière, & Mantha, 2008; Hanley, sis testing concepts was received more positively

2004; Sosa et al., 2010; Wender & Muehlboeck, and improved performance over traditional labo-

2003), more research is needed, as others remain ratory exercises. Similarly, Meletiou-Mavrotheris

unconvinced of the merits of adding technology to and Lee (2002) found that helping students to

the classroom environment (e.g., Härdle, Klinke, & understand concepts, improve statistical reason-

Ziegenhagen, 2007). Even the value of online dis- ing, and build intuitions about statistical ideas was

cussions and whether or how much faculty should more facilitating than using a traditional teaching

facilitate or contribute is not entirely clear (e.g., approach.

Mazzolini & Maddison, 2007). Instructors are Presenting and encouraging understanding of

encouraged to consider literature on cognitive learn- core concepts—particularly through hands-on

ing and assessment to improve Web-based materials engagement and research—can help foster more

for students, particularly in quantitative learning in-depth insight and greater involvement in inquiry-

(e.g., Romero, Berger, Healy, & Aberson, 2000). based future learning (Aulls & Shore, 2008; Dewey,

1997). Inviting students to seek out solutions to

Mentors and Role Models quantitative research problems promotes greater sta-

Mentoring can help students get the extra input tistical awareness and literacy. In the next section, I

needed to understand quantitative concepts (e.g., present a set of conceptual themes that are common

108 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y

to many statistical methods and that help provide MANOVA) after taking into account one or more

a foundation for understanding introductory and covariates such as number of previous math courses

advanced quantitative learning. or GPA.

For those who are not speciﬁcally interested in

mean differences across groups, a correlational ques-

Themes that Run Through Quantitative

tion may be asked. Researchers interested in ﬁnding

Psychology

what is associated with achievement could conduct a

Quantitative understanding is increased when

simple correlation analysis, assessing whether math

common ideas are revealed that occur in many sta-

interest is related to college GPA. This bivariate pro-

tistical procedures. Harlow (2005) has emphasized

cedure (between just two variables) could expand

a number of themes that run through multivariate

to canonical correlation (CC) that allows an exam-

methods and that can be extended to encompass

ination of two sets of variables. For example, a

univariate methods as well. Three basic themes are

researcher could examine whether several math atti-

presented below to facilitate approaching, analyzing,

tude scores (e.g., math interest, math efﬁcacy, and

and interpreting quantitative methods.

math anxiety) are related to several achievement

outcomes (e.g., GPA, the number of conferences

Considering the Research Question attended, and the number of memberships in honor

First, it is helpful to encourage students to con- societies). Correlation could also be extended to

sider the kind of research question that needs to be multiple regression (MR) to relate a linear combi-

addressed. Group difference questions can be ana- nation of several continuous predictors (e.g., math

lyzed with basic z-tests when information is known ability, verbal ability, previous GPA, achievement

about the mean and the variability in the popula- motivation) to a single outcome such as current

tion. For example, we could investigate whether GPA. If the outcome were dichotomous (e.g., suc-

an exceptional students’ class evinced an IQ that cess or failure in a course or grade level), then

was different from the known average of 100, with the merit of similar predictors could be examined

a standard deviation of 15. When only a pop- with logistic regression (LR) or discriminant func-

ulation mean is known or there are two groups tion analysis (DFA), depending on whether there

involved, a t-test would be appropriate, requiring was interest in conveying the odds or the degree of

that a researcher estimate the population standard correlation with the outcome, respectively. Multi-

deviation(s) from the sample(s). For studies with level modeling (MLM) would be useful to predict

two or more groups, an ANOVA would be useful an outcome such as achievement, when there is

for assessing whether there were differences among reason to believe that participants are clustered in

multiple groups. For example, a teacher may want separate groups (e.g., classrooms, school districts).

to compare the level of interest in taking math Thus, MLM allows, and even models, heterogeneity

courses for male and female students, which could of variance across several groups in the data, con-

be examined with a two-sample independent t-test. trary to more restrictive prediction methods (e.g.,

To investigate potential differences in math inter- MR, LR, and DFA) that assume that data are drawn

est among students from three different college from a homogeneous group.

majors, an ANOVA would be helpful. To exam- Other, more advanced, research questions could

ine whether several groups differ on math interest, be addressed with more sophisticated methods. Fac-

after accounting for the number of previous math tor analysis (FA) or principal components analysis

classes taken, analysis of covariance (ANCOVA) (PCA) would be useful when faced with a large

could be used. Further, group differences across sev- set of measures with a goal of identifying a few

eral groups could be assessed on several measures key dimensions that explain the underlying struc-

(e.g., math interest, math efﬁcacy, and math anxiety) ture or associations (e.g., quantitative and verbal

using multivariate analysis of variance (MANOVA). intelligence dimensions in an IQ test of multiple

Similarly to the way that ANCOVA extended subtests). Factor analysis would allow the delin-

ANOVA by taking into account another predic- eation of unique or error variance in measures before

tor that correlated or covaried with the outcome, forming factors with the variance that was in com-

a multivariate analysis of covariance (MANCOVA) mon among the measures. In contrast, PCA analyzes

can extend a MANOVA when wanting to examine all of the variance in the variables when form-

whether several groups differ on several outcomes ing components. In practice, there may not be

(e.g., the three math attitudes suggested for the much actual difference in results across these two

harlow 109

methods if the loadings are high for relevant vari- being and exercise. Thus, both group difference and

ables on their respective factors, even with the initial correlational methods can examine how much vari-

difference in whether unique variance is included ance is shared between independent and dependent

in the analysis (Velicer & Jackson, 1990). Struc- variables. In this regard, Cohen, Cohen, West, and

tural equation modeling (SEM) could be used to Aiken (2003) have described how categorical group-

examine whether latent factors, each with several ing variables can be coded for use in correlational and

measures, could be theoretically modeled to explain regression procedures.

hypothesized underlying processes. For example,

using Bandura’s (1997) social learning theory, we

could test whether different factors of intelligence

(i.e., quantitative and verbal) would predict degree Noting Similarities and Differences in

of self-efﬁcacy in learning, which in turn could Quantitative Methods

predict an achievement factor (measured by home- Examining research questions and seeing speciﬁc

work, quiz, and exam scores). In testing the SEM, methods that seem to be relevant to such questions

other covariates (e.g., socioeconomic status, pre- leads into a second theme, which is to notice the

vious GPA) could be added as possible predictors similarities and differences among various quantita-

of the mediator (i.e., self-efﬁcacy) to see if they tive procedures. Quilici and Mayer (2002) helped

are important, or to rule them out as predictors students notice the underlying similarities in differ-

of achievement. Further, multiple group analyses ent statistical word problems. This helped students

could examine whether such a predictive SEM of to know how to analyze speciﬁc problems by not-

achievement held in different groups (e.g., men ver- ing their similarities to other statistical problems

sus women, different ethnicities). If ﬁndings differed that were addressed by certain methods. Similarly,

across groups, then it would indicate that the group- Derryberry, Schou, and Conover (2010) helped

ing variable moderated the prediction. Longitudinal students understand how to conduct rank-based sta-

modeling (e.g., latent growth curve modeling) could tistical tests by revealing their resemblance to already

examine whether achievement changed over time in studied parametric tests.

mean level and in the rate of change and whether Quantitative methods can be classiﬁed in several

predictors (e.g., previous GPA, IQ, gender) could ways to help delineate common aspects or distin-

predict the level (called the intercept) and rate of guishing differences. Thus, group-difference meth-

change (called the slope) across time. Other meth- ods (e.g., z-test, t-test, ANOVA, ANCOVA, and

ods discussed in this volume could also be examined MANOVA) are similar in allowing an examination

to further ﬁne tune the nature of the question being of potential mean differences between groups on

assessed. one or more outcome variables. In contrast, correla-

Therefore, the main point of the ﬁrst theme (type tional methods (e.g., Pearson’s r, CC) do not tend to

of research question asked) is that it is the nature focus on means but, rather, on assessing association

of the research and the questions asked of the data between independent and dependent variables. Pre-

that drive the kind of quantitative method that is diction methods (e.g., MR, LR, DFA, and MLM)

selected. There is a whole world of methods to all focus on predicting an outcome from a set of pre-

consider, and the choice is not so daunting when dictors and may have some emphasis on grouping

realizing that certain methods readily lend them- variables—particularly LR and DFA, which explic-

selves to different kinds of research questions (e.g., itly include a categorical dependent variable, and

group difference, correlation or prediction, underly- MLM, which takes into account different levels or

ing structure, longitudinal, etc.). In the next section, groups in the data. However, each of these predic-

similarities and differences among methods is the tion methods differs somewhat from ANOVA-based

second theme that is discussed. Here, it will become methods in focusing more on weights linking inde-

apparent that although there are distinctions among pendent and dependent variables and less on mean

methods that make them more likely to be applied to scores such as group averages or centroids. Dimen-

speciﬁc research questions, many quantitative proce- sional or structural methods (e.g., FA, PCA) are

dures share similarities in how they reveal the essence similar by involving one or more sets of measures

of the data. For example, researchers who ﬁnd a sig- with a smaller set of underlying factors or compo-

niﬁcant difference in well being between groups who nents posited or revealed to help explain the patterns

exercise regularly and those who do not will also ﬁnd of relationships among variables. Thus, dimensional

that there is a signiﬁcant relationship between well methods are similar to correlational methods in their

110 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y

focus on associations among variables and differ- despite their individual variance (or uniqueness), a

ent from group difference methods that are more meaningful relationship emerges.

concerned with mean differences among groups. After assessing signiﬁcance, it is important to

SEM combines the best of group difference, correla- examine the magnitude of an overall ﬁnding, called

tion/prediction, and structural methods by allowing an effect size (ES). For group-difference methods,

an investigation of the means of latent factors across an ES can indicate the number of standard deviation

different samples or groups, while still allowing an units of difference there is among group means (i.e.,

examination or conﬁrmation of the structural or Cohen’s d ; Cohen, 1988), with small, medium, and

correlational relationships among variables. Longi- large effects having values of about 0.20 (almost

tudinal methods (e.g., time series, latent growth a quarter of a standard deviation), 0.50 (half a

modeling) add the element of time to allow an standard deviation), and 0.80 (almost a full stan-

examination of temporal ordering that can help in dard deviation) (Cohen, 1988). For correlational

assessing causal precedence among variables. methods, we would hope to show a meaningful rela-

The main point of this second theme is to encour- tionship between pertinent variables, with values of

age students to see how various quantitative methods 0.1, 0.3, and 0.5 indicating small, medium, and

are similar and how they are different. Discern- large Pearson product moment correlations, respec-

ing these distinctions and similarities will go a long tively. Particularly for prediction methods, as well

way toward highlighting the overriding features and as other methods, proportions of shared variance

underlying aspects of quantitative methods and in effect sizes (e.g., η2 or R2 ) are useful for showing

selecting an appropriate method to address a speciﬁc how much the independent and dependent variables

kind of research question. have in common, ranging in size from 0 to 1.0. Pro-

portion of variance ES values of 0.01, 0.09, and

Interpreting Findings from Quantitative 0.25 indicate small, medium, and large univariate

Methods effects, respectively (obtained by squaring 0.1, 0.3,

Third, after identifying a research question and and 0.5 correlations, respectively); and 0.02, 0.13,

noticing similarities and differences that lead to and 0.26 or more refer to small, medium, and large

selecting a method to analyze one’s data, it is impor- multivariate effect sizes, respectively (e.g., Cohen,

tant to examine and interpret ﬁndings from multiple 1988; Harlow, 2005).

perspectives. Although the ﬁeld of quantitative When interpreting ES, it is also important to pro-

psychology has traditionally focused largely on sig- vide an indication of the margin of error with con-

niﬁcance testing, there is growing emphasis on ﬁdence intervals (e.g., Cumming & Fidler, 2009;

considering other alternatives such as effect sizes Gilliland & Melﬁ, 2010; Odgaard & Fowler, 2010).

and conﬁdence intervals (e.g., Harlow, Mulaik, & Conﬁdence intervals provide a range of values, with

Steiger, 1997; Kline, 2004; Wilkinson, & The Task narrower intervals indicating more precision for an

Force on Statistical Inference, 1999). Initially, it can estimated effect. Research journals such as the Jour-

be helpful to assess whether the overall variation nal of Consulting and Clinical Psychology and others

in the data is signiﬁcantly different from chance. are beginning to require that statistical ﬁndings be

Signiﬁcance tests usually involve some ratio of vari- supplemented with effect sizes and conﬁdence inter-

ances. We can think of group difference models vals (e.g., La Greca, 2005; Odgaard & Fowler,

as explaining the ratio of how scores vary among 2010).

groups, relative to how much scores vary within each Thus, the third theme, concerned with inter-

group. This ratio can be readily recognized as the preting results, involves providing an indication as

F -test. The closer this ratio is to 1.0, the less we to whether a result is signiﬁcantly different from

are apt to consider between-groups variance as any- chance, the magnitude of the effect, and the degree

thing more meaningful than the within-group error of certainly about the result.

variance. Correlational methods examine whether Ultimately, understanding quantitative themes

the covariance between variables is salient when and concepts (e.g., Abelson, 1995), including

contrasted with the variances within each variable. framing initial research questions; noticing sim-

Thus, a correlation is simply a ratio of the covariance ilarities and differences when selecting statistical

between variables over the product of the stan- methods; and interpreting the signiﬁcance, extent,

dardized variance (i.e., standard deviations) within and precision of one’s ﬁnding will go a long way

each variable. Using an analogy relating individuals, toward moving psychology more in the forefront of

when the covariance between two people is salient, quantitative science.

harlow 111

Conclusions overarching constructs and processes rather than

This chapter features a number of issues regard- limiting teaching to narrow and isolated method-

ing the teaching of quantitative psychology. First, ological procedures (Embretson, 2010; McGrath,

research on quantitative training underscores the 2011; Rodgers, 2010). Building quantitative mod-

need to provide greater guidelines and opportunities els of behavior will generate more integrated and

for learning more about measurement, research, and explanatory understanding that is necessary to keep

statistical methods (e.g., Franklin & Garﬁeld, 2006; psychology scientiﬁcally rigorous.

Garﬁeld & Everson, 2009). Job opportunities are Second, students, faculty, and professionals

more available than are applicants for quantitative should be encouraged to explore options for fur-

careers (e.g., Aiken et al., 1990, 2008). An APA task thering quantitative learning outside of the class-

force on increasing the quantitative pipeline (Aiken room. For example, quantitative journals (e.g.,

et al., 2007) is aimed at bringing more people into Multivariate Behavioral Research, Psychological Assess-

this ﬁeld to ﬁll this need. ment, Psychological Methods, Structural Equation

Second, strategies for effectively teaching quan- Modeling ) as well as quantitative handbooks (e.g.,

titative psychology emphasize the advantages of Cooper et al., 2012), statistical dictionaries (e.g.,

active learning (e.g., Onwuegbuzie, Collins, & Dodge, 2003; Everitt, 2002; Upton & Cook,

Jiao, 2009), technology, and Web-based instruction 2008), and even online resources on quantita-

(e.g., Bliwise, 2005; Fillion, Limayem, Laferrière, tive methods (e.g., Journal of Statistics Educa-

& Mantha, 2008; Sosa et al., 2010), mentoring tion, http://www.amstat.org/publications/jse/) pro-

and role models (e.g., Ferreira, 2001), and concep- vide opportunities for further quantitative studies.

tual understanding (e.g., Mulhern & Wylie, 2004; There are also numerous quantitative volumes that

Swingler, Bishop, & Swingler, 2009). All of these provide input and guidelines for understanding

strategies implicitly invite students to become more the main aspects of various quantitative methods

immersed in the quantitative learning process and (Hancock & Mueller, 2010), including multi-

ultimately to become more statistically literate. ple regression/correlation (Cohen, Cohen, West,

Third, several themes have been offered to bring & Aiken, 2003), multivariate statistics (Stevens,

about greater clarity in quantitative learning (Har- 2009), statistical mediation (MacKinnon, 2008),

low, 2005). One of the themes includes considering missing data analysis (Enders, 2010), and structural

the nature of the research question. Students should equation modeling (e.g., Kline, 2011), among oth-

be encouraged to ask whether their research ques- ers. For those interested in more advanced statistical

tion involves group differences, correlation, predic- material, they could venture into the British Jour-

tion, underlying structure, or longitudinal studies. nal of Mathematical and Statistical Psychology and

Another theme encourages students to notice the Psychometrika, or even the Millsap and Maydeu-

similarities and differences among quantitative pro- Olivares (2009) volume on Contemporary Psychomet-

cedures, as this kind of awareness is helpful in select- rics, realizing that these latter sources would not be

ing appropriate methods to analyze data and address as accessible as the more translational articles in the

research questions. Still another theme presents sug- former outlets.

gestions for interpreting quantitative ﬁndings using Those interested in additional training could

statistical tests, ES, and conﬁdence intervals. consider workshops (e.g., the University of Kansas

Being aware of quantitative training practices, Summer Institute held by Todd Little), annual

encouraging more engaged learning, and focusing conferences featuring quantitative presentations

on conceptual thinking and underlying themes can (e.g., Division 5 talks at the APA convention,

help us feature and spotlight quantitative psychology Division D talks at the American Educational

as a highly worthwhile and exciting ﬁeld in which Research Association convention, and presenta-

to take part. tions and workshops at the Association for Psy-

chological Science). Websites focusing on quan-

Future Directions titative material (e.g., UCLA statistical website,

A number of future directions are suggested to http://www.ats.ucla.edu/stat/, and a similar one at

point out areas in which quantitative teaching is Georgetown, http://statpages.org/) and online tuto-

emerging and needs to grow. First, teaching should rials in quantitative methods (Dinov & Christou,

focus on actively engaging students and helping 2009; Garﬁeld & delMas, 2010) offer additional

them to understand basic, underlying concepts. This occasions to heighten quantitative proﬁciency. Fur-

would also involve more emphasis on modeling ther, there are quantitative societies that provide

112 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y

useful opportunities and contacts (e.g., American Active

g

Psychological Association, Division 5: Evaluation, Learnin

ning

Measurement, and Statistics; American Statisti- E-Lear

cal Association; Psychometric Society; Society for ring

Quant Mento

Mathematical Psychology, and the Society of Multi- Teaching

variate Experimental Psychology), including inter-

national statistical societies (e.g., European Associa- Research

tion of Methodology, International Association for Questions

Statistical Education, International Statistical Insti- Quant Similarities &

tute, Society for Multivariate Analysis in the Behav- Concepts Diﬀerences

ioral Sciences). Participating in quantitative forums Interpret

helps to enlarge quantitative networks, providing a Results

stronger foundation to further one’s skills. Other Quant Statistical

Third, quantitative teaching should be widened Learning Literacy

Models

to include individuals from low-income (Kitchen,

Wo

DePree, Celedón-Pattichis, & Brinkerhoff, 2007) Soc

rks

ho

Li ieti ps

and under-represented groups (Kosoko-Lasaki, Son- ter

atu es

nino, & Voytko, 2006). We also need to con- re

sider whether traditional approaches to quantitative

teaching have helped to shrink or amplify the differ- Widening

Quant Reach

ences among mainstream and marginalized students Disadvantaged

(Ceci & Papierno, 2005). Other offerings can be

Early

encouraged (e.g., Quantitative Training for Under- Care

er

Und

represented Groups) to help bring about greater erre

pres

equity and effectiveness that can open doors to ente

d

more career options in the ﬁeld of quantitative sci-

ence to an even larger and more diverse group of Figure 6.1 Bringing about statistical literacy through quanti-

individuals. tative teaching, other quantitative learning, and widening the

quantitative reach

Fourth, it is important to reach out to quantita-

tive researchers who are newly graduated to encour-

age them to become involved in the larger ﬁeld.

Early career psychologists make up about 10% of the technology, engineering, and mathematics that

150,000 members of the APA (2010), and Division sometimes exclude psychology. Further, research

5 (measurement, statistics and evaluation) involves reveals that students from the United States fall

approximately 1% of APA members (i.e., Gruber, in the bottom half regarding quantitative perfor-

2010). The APA Quantitative Task Force (Aiken mance when compared to other countries (Gon-

et al., 2007) helps in this regard, with a website zales, Guzmán, Partelow, Pahlke, Jocelyn, Kast-

(http://www.apa.org/research/tools/quantitative/in- berg, & Williams, 2003; McQuillan & Kennelly,

dex.aspx) that provides deﬁnitions of the areas of 2005).

focus within quantitative methods, recommenda- Figure 6.1 depicts the main focus of this chapter,

tions for preparing for quantitative training, and lists where innovative quantitative teaching (with active

of quantitative graduate programs that can help in learning, e-learning, mentoring, and concepts learn-

recruiting individuals into quantitative psychology ing), other quantitative learning (e.g., model build-

and in retaining those that are already participating. ing, workshops, quantitative societies, and quan-

Finally, continuing efforts to generate greater titative literature), and widening the quantitative

interest, participation, and performance in quan- reach (to the disadvantaged, early career individu-

titative literacy is an important goal. Unfortunately, als, and the under-represented) can help bring about

psychology is not always considered as having the greater statistical literacy. More emphasis needs to be

scientiﬁc credentials of other disciplines such as made on involving others in the marvels of quanti-

physics, chemistry, and biology (Simonton, 2004). tative psychology, whether at the undergraduate or

Particularly, recent efforts to improve technologi- graduate level, in the larger profession of psychol-

cal education in the United States (Chopra, 2010) ogy or in the general public. Ideas presented in this

have focused more on traditionally deﬁned science, chapter, and in the larger handbook, are offered to

harlow 113

help in teaching quantitative psychology, within the Ashcraft, M. H., & Kirk, E. P. (2001). The relationship among

classroom, in research centers, and in every facet of working memory, math anxiety, and performance. Journal of

Experimental Psychology: General, 130, 224–237.

life that involves quantitative thinking.

Atkinson, R. K., Catrambone, R., & Merrill, M. M. (2003). Aid-

ing transfer in statistics: Examining the use of conceptually

oriented equations and elaborations during subgoal learning.

Author Note

Journal of Educational Psychology, 95, 762–773.

Lisa L. Harlow is Professor of Quantitative Psy- Aulls, Mark W., & Shore, B. M. (2008). Inquiry in education,

chology at the University of Rhode Island. Please volume 1: The conceptual foundations for research as a curricular

address correspondence to Lisa L. Harlow, Depart- imperative. New York: Taylor & Francis/Erlbaum.

ment of Psychology, University of Rhode Island, 142 Bandura, A. (1997). Self-efﬁcacy: The exercise of control. New York:

W. H. Freeman.

Flagg Rd, Chafee, Kingston, RI 02881-0808.

Barren, K. E., Benedict, J. O., Saville, B. K., Serdikoff, S.

L., & Zinn, T. E. (2007). Innovative approaches to teach-

ing statistics and research methods: Just-in-time teaching,

Acknowledgments

interteaching, and learning communities. In Dunn, D. S.,

I wish to acknowledge support from the National Smith, R. A, & Beins, B. (Eds.), Best practices for teaching

Science Foundation (NSF) Grant (# 0720063) statistics and research methods in the behavioral sciences, (pp.

on Quantitative Training for Underrepresented 143–158). Mahwah, NJ: Erlbaum.

Groups; Lisa L. Harlow, Principal Investigator (PI). Bartz, A. E., & Sabolik, M. A. (2001). Computer and software

use in teaching the beginning statistics course. Teaching of

I would like to thank students, faculty and staff at

Psychology, 28, 147–149.

the University of Rhode Island and the Cancer Pre- Ben-Zvi, D. & Garﬁeld, J. (Eds.) (2004). The challenge of devel-

vention Research Center in Rhode Island for their oping statistical literacy, reasoning, and thinking. Dordrecht,

ongoing support, inspiration and encouragement. Netherlands: Kluwer.

Similarly, I would like to thank Dr. Lisa Dierker Berger, D. (n.d.). WISE Web Interface for Statistics Educa-

tion. Retrieved December 20, 2010 from http://wise.cgu.

(PI) and her colleagues at the Quantitative Analy-

edu/.

sis Center at Wesleyan University. Serving on the Bjornsdottir, A., & Garﬁeld, J. (2010). Teaching bits: Statistics

advisory board for their NSF Grant (# 0942246) education articles from 2009. Journal of Statistics Educa-

on An Inquiry-Based, Supportive Approach to Sta- tion, 18, Retrieved May 9, 2012 from www.amstat.org/

tistical Reasoning and Application has opened up publications/jse/v18n1/garﬁeldtb.pdf

Bliwise, N. G. (2005). Web-based tutorials for teaching intro-

my thinking and those of my students on how

ductory statistics. Journal of Educational Computing Research,

to productively engage students in the wonders of 33, 309–325.

quantitative inquiry. Britt, M. A., Sellinger, J., & Stillerman, L. M. (2002). A review

of ESTAT: An innovative program for teaching statistics.

Teaching of Psychology, 29, 73–75.

References Budé, L., Van De Wiel, M. W. J., Imbos, T., Candel, M.

Abelson, R. P. (1995). Statistics as principled argument. Hillsdale, J. J. M., Broers, N. J., and Berger, M. P. F. (2007).

NJ: Erlbaum. Students’ achievements in a statistics course in relation to

Aberson, C. L., Berger, D. E., Healy, M. R., & Romero, V. motivational aspects and study behaviour. Statistics Education

L. (2003). Evaluation of an interactive tutorial for teach- Research Journal [Online], 6, 5–21. Retrieved May 9, 2012

ing hypothesis testing concepts. Teaching of Psychology, 30, from http://www.stat.auckland.ac.nz/∼iase/serj/SERJ6(1)_

75–78. Bude.pdf

Aiken, L. S., Aguinis, H., Appelbaum, M., Boodoo, G., Ceci, S. J., & Papierno, P. B. (2005). The rhetoric and reality of

Edwards, M. C., Gonzalez, R. D., Panter, A., et al. gap closing: When the “have nots” gain but the “haves” gain

(2007). Report of the task force for increasing the num- even more. American Psychologist, 60, 149–160.

ber of quantitative psychologists. Washington DC: American Chew, S. L. (2007). Designing effective examples and problems

Psychological Association. Retrieved July 1, 2010 from: for teaching statistics. In, Dunn, Dana S., Smith, Randolph

http://www.apa.org/research/tools/quantitative/quant-task- A, & Beins, Barney (Eds.), Best practices for teaching statistics

force-report.pdf. and research methods in the behavioral sciences, (pp. 73–91).

Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Doctoral Mahwah, NJ: Erlbaum.

training in statistics, measurement, and methodology in psy- Chiou, C.-C. (2009). Effects of concept mapping strategy on

chology: Replication and extension of Aiken, West, Sechrest, learning performance in business and economics statistics.

and Reno’s (1990) survey of PhD programs in North America. Teaching in Higher Education, 14, 55–69.

American Psychologist, 63, 32–50. Chopra, A. (2010). Building a future for technologi-

Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R. (1990). cal innovation. White House Ofﬁce of Science &

Graduate training in statistics, methodology, and measure- Technology Policy. Retrieved January 13, 2011 from

ment in psychology: A survey of PhD programs in North http://www.whitehouse.gov/sites/default/ﬁles/microsites/

America. American Psychologist, 45, 721–734. ostp/CommClubr081710FINAL.pdf.

APA (2010). Early career psychologists. Washington D.C.: Amer- Cobb, G. (1999). Move over, Bill Nye: George Cobb

ican Psychological Association. Retrieved on July 1, 2010 makes learning about statistics practical and fun. Mount

from: http://www.apa.org/membership/index.aspx. Holyoke College news and events. Retrieved May 9, 2012

114 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y

from http://www.mtholyoke.edu/ofﬁces/comm/csj/990409/ Franklin, C., & Garﬁeld, J. (2006). The guidelines for assess-

gummy.html. ment and instruction in statistics education (GAISE) project:

Cobb, G. W. (2007). The introductory statistics course: A Developing statistics education guidelines for grades pre K-

ptolemaic curriculum? Technology Innovations in Statis- 12 and college courses, In G.F. Burrill, (Ed.), Thinking and

tics Education, 1, 1?15. Retrieved May 9, 2012 from reasoning with data and chance: Sixty-eighth annual NCTM

http://escholarship.org/uc/item/6hb3k0nz# page-15 yearbook (pp. 345–375). Reston, VA: National Council of

Cohen, J. (1988). Statistical power analysis for the behavioral Teachers of Mathematics.

sciences (2nd edition). Hillsdale, NJ: Erlbaum. Friedrich, J., Buday, E., & Kerr, D. (2000). Statistical train-

Cohen, J. Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied ing in psychology: A national survey and commentary

multiple regression/correlation analysis for the behavioral sciences on undergraduate programs. Teaching of Psychology, 27,

(3rd edition). Mahwah, NJ: Erlbaum. 248–257.

Cooper, H., Camic, P. M., Long, D. L., Panter, A. T., Rind- Gal, I. (2003). Teaching for statistical literacy and ser-

skopf, D., & Sher, K. J. (Eds.). (2012). APA handbook of vices of statistics agencies. The American Statistician, 57,

research methods in psychology (Vols. 1–3). Washington, DC: 80–84.

American Psychological Association. Garﬁeld, J., & Ben-Zvi, D. (2007). How students learn statis-

Cumming, G., & Fidler, F. (2009). Conﬁdence intervals: Better tics revisited: A current review of research on teaching

answers to better questions. Zeitschrift für Psychologie/Journal and learning statistics. International Statistical Review, 75,

of Psychology, 217, 15–26. 372–396.

DaRos-Voseles, D. A., Collins, K. M. T., Onwuegbuzie, A. Garﬁeld, J., & Ben-Zvi, D. (2008). Developing students’ statistical

J., & Jiao, Q. G. (2008). Effect of self-perception on per- reasoning: Connecting research and teaching practice. New York:

formance of graduate-level cooperative groups in research Springer.

methodology courses. Journal of Instructional Psychology, 35, Garﬁeld, J., & delMas, R. (2010). A web site that provides

254–259. resources for assessing students’ statistical literacy, reasoning

Derryberry, D. R., Schou, S. B., & Conover, W. J. and thinking. Teaching Statistics, 32, 2–7.

(2010). Teaching rank-based tests by emphasizing struc- Garﬁeld, J., & Everson, M. (2009). Preparing teachers of

tural similarities to corresponding parametric tests. Journal statistics: A graduate course for future teachers. Journal of

of Statistics Education, 18, Retrieved May 9, 2012 from Statistics Education, 17 (2), Retrieved May 9, 2012 from

www.amstat.org/publications/jse/v18n1/derryberry.pdf. www.amstat.org/publications/jse/v17n2/garﬁeld.html.

DeVaney, T. A. (2010). Anxiety and attitude of graduate stu- Garﬁeld, J., Hogg, B., Schau, C., & Whittinghill, D.

dents in on-campus vs. online statistics courses. Journal (2002). First courses in statistical science: The status

of Statistics Education, 18, Retrieved May 9, 2012 from of educational reform efforts. Journal of Statistics Educa-

www.amstat.org/publications/jse/v18n1/devaney.pdf. tion, 10(2), Retrieved May 9, 2012 from www.amstat.

Dewey, J. (1997). How we think. New York: Dover Publications. org/publications/jse/v10n2/garﬁeld.html.

Dinov, I. D., & Christou, N. (2009). Statistics online com- Gelman, A., & Nolan, D. (2002). Teaching statistics: A bag of

putational resource for education. Teaching Statistics, 31, tricks. New York: Oxford University Press.

49–51. Gilliland, D., & Melﬁ, V. (2010). A note on conﬁdence

Dodge, Y. (Ed.) (2003). The Oxford dictionary of statistical terms. interval estimation and margin of error. Journal of Statis-

New York: Oxford University Press. tics Education, 18(1), Retrieved May 9, 2012 from

Dolinsky, B. (2001). An active learning approach to teaching www.amstat.org/publications/jse/v18n1/gilliland.pdf.

statistics. Teaching of Psychology, 28, 55–56. Gonzales, P., Guzmán, J. C., Partelow, L., Pahlke, E., Jocelyn,

Dunn, D. S., Smith, R. A., & Beins, B. (Eds.) (2007). Best prac- L., Kastberg, D., & Williams, T. (2003). Highlights from the

tices for teaching statistics and research methods in the behavioral trends in international mathematics and science study (TIMSS).

sciences. Mahwah, NJ: Erlbaum. U.S. Department of Education Institute of Education Sci-

Embretson, S. E. (Ed.) (2010). Measuring psychological con- ences, National Center for Educational Statistics.

structs: Advances in model-based approaches. Washington, DC: Gruber, C. (Ed.) (2010) The score newsletter, Vol. XXXII. No. 3.

American Psychological Association. Division 5, American Psychological Association, Retrieved

Enders, C. K. (2010). Applied missing data analysis. New York: July 1, 2010 from http://www.apa.org/divisions/div5/pdf/

Guilford. July10Score.pdf.

Everitt, B. S. (2002). The Cambridge dictionary of statis- Hancock, G. R., & Mueller, R. O. (Eds.). (2010). The reviewer’s

tics (2nd edition). Cambridge, UK: Cambridge University guide to quantitative methods in the social sciences. New York:

Press. Taylor & Francis.

Everson, M., Ziefﬂer, A., & Garﬁeld, J. (2008). Implement- Hanley, G. L. (2004). E-learning and the science of instruction.

ing new reform guidelines in teaching introductory college Applied Cognitive Psychology, 18, 123–124.

statistics courses. Teaching Statistics, 30, 66–69. Härdle, W., Klinke, S., & Ziegenhagen, U. (2007). On the utility

Ferreira, M. (2001). Building communities through role mod- of e-learning in statistics. International Statistical Review, 75,

els, mentors, and hands-on-science. The School Community 355–364.

Journal, 11, 27–37. Harlow, L.L. (2005). The essence of multivariate thinking: Basic

Fillion, G., Limayem, M., Laferrière, T., & Mantha, R. (2008). themes and methods. Mahwah, NJ: Erlbaum.

Integrating ICT into higher education: A study of onsite Harlow, L. L. (2010). On scientiﬁc research: Invited com-

vs. online students’ and professors’ perceptions. International mentary on the role of statistical modeling and hypothesis

Journal of Web-Based Learning and Teaching Technologies, 3, testing. Journal of Modern Applied Statistical Methods, 9,

48–72. 348–358.

harlow 115

Harlow, L. L., Burkholder, G., & Morrow, J. (2002). Evaluating McQuillan, L., & Kennelly, L. (2005). New study ﬁnds U.S.

attitudes, skill and performance in a learning enhanced quan- math students consistently behind their peers around the world:

titative methods course: A structural modeling approach. Findings challenge conventional wisdom about U.S. math success

Structural Equation Modeling Journal, 9, 413–430. in early grades. Washington, D.C.: American Institutes for

Harlow, L. L., Burkholder, G., & Morrow, J. (2006). Engag- Research.

ing students in learning: An application with quantitative Meletiou-Mavrotheris, M, & Lee, C. (2002). Teaching students

psychology. Teaching of Psychology, 33, 231–235. the stochastic nature of statistical concepts in an introduc-

Harlow, L. L., Mulaik, S., & Steiger, J. (Eds.) (1997). What if tory statistics course. Statistics Education Research Journal, 1,

there were no signiﬁcance tests? Mahwah, NJ: Erlbaum. 22–37.

Helman, S., & Horswill, M. S. (2002). Does the introduction Mills, J. D. (2004). Students’ attitudes toward statistics: Impli-

of non-traditional teaching techniques improve psychology cations for the future. College Student Journal, 38, 349–361.

undergraduates’ performance in statistics? Psychology Learning Millsap, R. E., Maydeu-Olivares, A. (Eds.) (2009). The Sage

& Teaching, 2, 12–16. handbook of quantitative methods in psychology. Thousand

Hulme, J. (2007). Review of Handbook of demonstrations and Oaks, CA: Sage Publications.

activities in the teaching of psychology volume 1: Intro- Mulhern, G., & Wylie, J. (2004). Changing levels of numer-

ductory, statistics, research methods and history. Psychology acy and other core mathematical skills among psychology

Learning & Teaching, 6, 164–165. undergraduates between 1992 and 2002. British Journal of

Hulsizer, M. R., & Woolf, L. M. (2009). Guide to teaching Psychology, 95, 355–370.

statistics: Innovations and best practices. New York: Wiley- Mulhern, G., & Wylie, J. (2006). Mathematical prerequisites

Blackwell. for learning statistics in psychology: Assessing core skills of

Jaccard, J., & Jacoby, J. (2009). Theory construction and model numeracy and mathematical reasoning among undergradu-

building skills: A practical guide for social scientists. New York: ates. Psychology Learning & Teaching, 5, 119–132.

Guilford Press. Neal-Barnett, A., Mitchell, M., & Boeltar, C. (2002). Faculty

Katayama, A. D. (2001). Bi-modal instructional practices in edu- of color serving students, serving self: The psychology group.

cational psychology: Mentoring and traditional instruction. Teaching of Psychology, 29, 44–45.

Journal of Instructional Psychology, 28, 171–177. Nie, Y., & Lau, S. (2010). Differential relations of constructivist

Kitchen, R. S., DePree, J., Celedón-Pattichis, S., & Brinkerhoff, and didactic instruction to students’ cognition, motivation,

J. (2007). Mathematics education at highly effective schools that and achievement. Learning and Instruction, 20, 411–423.

serve the poor: Strategies for change. Mahwah, NJ: Erlbaum. Odgaard, E. C., & Fowler, R. L. (2010). Conﬁdence intervals for

Kline, R. B. (2004). Beyond signiﬁcance testing: Reforming data effect sizes: Compliance and clinical signiﬁcance in the jour-

analysis methods in behavioral research. Washington, DC: nal of consulting and clinical psychology. Journal of Consulting

American Psychological Association. and Clinical Psychology, 78, 287–297.

Kline, R. B. (2011). Principles and practice of structural equation Onwuegbuzie, A. J. (2000). Attitudes toward statistics assess-

modeling ( 3rd edition). New York: Guilford. ments. Assessment and Evaluation in Higher Education, 25,

Kolar, D. W., & McBride, C. A. (2003). Creating problems 321–339.

to solve problems: An interactive teaching technique for Onwuegbuzie, A. J., Collins, K. M. T., & Jiao, Q. G. (2009).

statistics courses. Teaching of Psychology, 30, 67–68. Performance of cooperative learning groups in a postgraduate

Kosoko-Lasaki, O., Sonnino, R. E., & Voytko, M. L. (2006). education research methodology course: The role of social

Mentoring for women and underrepresented minority fac- interdependence. Active Learning in Higher Education, 10,

ulty and students: experience at two institutions of higher 265–277.

education. Journal of the National Medical Association, 98(9), Onwuegbuzie, A. J., & Wilson, V. A. (2003). Statistics anxiety:

1449–1459. Nature, etiology, antecedents, effects, and treatments-a com-

Krause, U,-M., Stark, R., & Mandl, H. (2009). The effects of prehensive review of literature. Teaching in Higher Education,

cooperative learning and feedback on e-learning in statistics. 8, 195–209.

Learning and Instruction, 19, 158–170. Peden, B. F. (2001). Correlational analysis and interpretation:

La Greca, A. M. (2005). Editorial. Journal of Consulting and Graphs prevent gaffes. Teaching of Psychology, 28, 129–131.

Clinical Psychology, 73, 3–5. Peterson, S. E., & Miller, J. A. (2004). Quality of college students’

Liu, T.-C., Lin, Y.-C., & Kinshuk (2010). The applica- experiences during cooperative learning. Social Psychology of

tion of Simulation-Assisted Learning Statistics (SALS) for Education, 7(2), 161–183.

correcting misconceptions and improving understanding Piotrowski, C., Bagui, S. C., & Hemasinha, R. (2002). Devel-

of correlation. Journal of Computer Assisted Learning, 26, opment of a measure of statistics anxiety in graduate-level

143–158. psychology students. Journal of Instructional Psychology, 29,

MacKinnon, D. P. (2008). Introduction to statistical mediation 97–100.

analysis. New York: Taylor & Francis. Quilici, J. L., & Mayer, R. E. (2002). Teaching students to recog-

Malloy, T. E., & Jensen, G. C. (2001). Utah Virtual Lab: nize structural similarities between statistics word problems.

JAVA interactivity for teaching science and statistics on line. Applied Cognitive Psychology, 16, 325–342.

Behavior Research Methods, Instruments & Computers, 33, Rajecki, D. W., Appleby, D., Williams, C. C., Johnson,

282–286. K., & Jeschke, M. P. (2005). Statistics can wait: Career

Mazzolini, M., & Maddison, S. (2007). When to jump in: The plans activity and course preferences of American psychology

role of the instructor in online discussion forums. Computers undergraduates. Psychology Learning & Teaching, 4, 83–89.

& Education, 49, 193–213. Ridgway, J., Nicholson, J., & McCusker, S. (2007). Teaching

McGrath, R. E. (2011). Quantitative models in psychology. statistics—Despite its applications. Teaching Statistics, 29,

Washington, DC: American Psychological Association. 44–48.

116 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y

Rodarte-Luna, B., & Sherry, A. (2008). Sex differences in Tomcho, T. J., Rice, D., Foels, R., Folmsbee, L., Vladescu, J.,

the relation between statistics anxiety and cognitive/learning Lissman, R., Matulewicz, R., et al. (2009). APA’s learning

strategies. Contemporary Educational Psychology, 33, 327– objectives for research methods and statistics in practice: A

344. multimethod analysis. Teaching of Psychology, 36, 84–89.

Rodgers, J. L. (2010). The epistemology of mathematical and Tremblay, P. F., Gardner, R. C., & Heipel, G. (2000). A model

statistical modeling: A quiet methodological revolution. of the relationships among measures of affect, aptitude, and

American Psychologist, 65, 1–12. performance in introductory statistics. Canadian Journal of

Romero, V. L., Berger, D. E., Healy, M. R., & Aberson, C. L. Behavioural Science, 32, 40–48.

(2000). Using cognitive learning theory to design effective on- Upton, G. & Cook, I. (2008). A dictionary of statistics. New York:

line statistics tutorials. Behavior Research Methods, Instruments Oxford University Press.

& Computers, 32, 246–249. Velicer, W.F., & Jackson, D.N. (1990). Component analy-

Rosenthal, R. & DiMatteo, M. R. (2001) Meta-analysis: recent sis vs. common factor analysis: Some issues in selecting an

developments in quantitative methods for literature reviews. appropriate procedure. Multivariate Behavioral Research, 25,

Annual Review of Psychology, 52, 59–82. 1–28.

Rumsey, D. J. (2002). Statistical literacy as a goal Ware, M. E. & Johnson, D. E. (Eds) (2000). Handbook of

for introductory statistics courses. Journal of Statis- demonstrations and activities in the teaching of psychology: Intro-

tics Education, 10(3), Retrieved May 9, 2012 from ductory, statistics, research methods, and history, Vol. I (2nd

http://www.amstat.org/publications/jse/v10n3/rumsey2.html. ed.). Mahwah, NJ: Erlbaum.

Saville, B. K., Zinn, T. E., Lawrence, N. K., Barron, K. E., Watkins, A. E., Scheaffer, R. L., & Cobb, G. W. (2004). Statis-

& Andre, J. (2008). Teaching critical thinking in statistics tics in action: Understanding a world of data. New York: Key

and research methods. In Dunn, D. S., Halonen, J. S., & College Publishing.

Smith, R. A. (Eds.), Teaching critical thinking in psychology: A Watson, J. M. (2006). Statistical literacy at school: Growth and

handbook of best practices, (pp. 149–160). New York: Wiley- goals. Mahwah, NJ: Erlbaum.

Blackwell. Wender, K. F., & Muehlboeck, J.-S. (2003). Animated diagrams

Schwartz, D., & Martin, T. (2004). Inventing to prepare for in teaching statistics. Behavior Research Methods, Instruments

future learning: The hidden efﬁciency of encouraging orig- & Computers, 35, 255–258.

inal student production in statistics instruction. Cognition Wilkinson, L., & The Task Force on Statistical Inference (1999).

Instruction, 22, 129–184. Statistical methods in psychology journals: Guidelines and

Sen, S. (2004). Teaching quantitative thinking: Comment. explanations. American Psychologist, 54, 594–604.

Science, 304( 5670), 518. Wulff, S. S., & Wulff, D. H. (2004). “Of course I’m com-

Simonton, D. K. (2004). Psychology’s status as a scientiﬁc dis- municating, I lecture every day:” Enhancing teaching and

cipline: Its empirical placement within an implicit hierarchy learning in introductory statistics. Communication Education,

of the sciences. Review of General Psychology, 8, 59–67. 53, 92–103.

Sosa, S., Berger, D. E., Saw, A. T., & Mary, J. C. (2010). Zeedyk, M. S. (2006). Detective work on statistics street: Teach-

Effectiveness of computer-assisted instruction in statistics: ing statistics though humorous analogy. Psychology Learning

A meta-analysis. Review of Educational Research, published & Teaching, 5, 97–109.

online August 12, 2010. DOI: 10.3102/0034654310378174 Ziefﬂer, A., Garﬁeld, J., Alt, S., Dupuis, D., Holleque, K.,

Stark, R., & Krause, U.-M. (2009). Effects of reﬂection prompts & Chang, B. (2008). What does research suggest about

on learning outcomes and learning behaviour in statistics the teaching and learning of introductory statistics at the

education. Learning Environments Research, 12, 209–223. college level? A review of literature. Journal of Statistics

Stevens, J. P. (2009). Applied multivariate statistics for the social Education, 16 (2), Retrieved May 9, 2012 from http://

sciences (5th ed.). New York: Routledge. www/amstat.org/publications/jse/v16n2/ziefﬂer.html.

Swingler, M. V., Bishop, P., & Swingler, K. M. (2009). SUMS: Zirkel, S. (2002). Is there a place for me? Role models and aca-

A ﬂexible approach to the teaching and learning of statistics. demic identity among white students and students of color.

Psychology Learning & Teaching, 8, 39–45. Teachers College Record, 104, 357–376.

harlow 117

CHAPTER

7

Roderick P. McDonald†

Abstract

This chapter provides a uniﬁed treatment of seven major topics in test theory. It relates Modern

Theory based on item response modeling to classical linear modeling through the use of the common

factor model. The topics include choosing a metric; measurement and measurement error (reliability);

item selection: homogeneity and dimensionality; validity; equating tests; comparing populations. The

treatment of these problems makes no distribution assumptions.

Key Words: Test theory, metric, homogeneity, dimensionality, validity, equating, differential item

functioning

The subject of this chapter is very large, so the However, for decades Psychometric Theory has

approach to it requires a broad brush and has to be treated them as foundations of two major, unrelated

somewhat selective. I apologize for this in advance. topics.

We begin, as is commonly the case, with ques- The primary problems of classical test theory were

tions of deﬁnition. Test theory consists of mathemat- the homogeneity, the reliability, and the validity of a

ical models for item scores and theory for the test test. These remain major topics in the modern era.

scores derived from them by a scoring formula. Fur- The treatment of these problems rested on simple

ther, a test (a set of items) is designed to measure linear additive models—nothing could be simpler

a quantiﬁable attribute or set of related attributes than Spearman’s true score model—but with an

of examinees. The term “attributes” has its ordinary uneasy recognition that linear models were inappro-

dictionary deﬁnition—properties or characteristics priate for binary items. The interested reader would

of the examinees. There is an inevitable looseness ﬁnd Gulliksen’s (1950) neglected classic an excellent

in these three statements, but they should serve as a account of thinking in the classical period.

platform for the following account. The incorporation of Alan Birnbaum’s rigorous

The topics of test theory have developed in a treatment of item response models in Lord and

piecemeal fashion, with little attempt to examine Novick (1968) marks the beginning of the mod-

the relationships between them. We might date the ern era. Lord and Novick’s text, accompanied by

beginnings of the “classical” period of test theory Lord (1980), should still be required reading for any

to Spearman’s papers on True Score Theory (1904a) student or researcher who wishes to have a general

and on the general factor model (1904b). The two understanding of the ﬁeld.

† Editor’s Note: Sadly, Rod passed away in October of 2011. His contributions to quantitative methods are inestimable

and he will be dearly missed. I would like to thank Aaron Boulton who completed the tables, ﬁgures, and proofs for

this chapter.

118

The Lord and Novick text contains the necessary for the values they take or the scale on which they are

foundations for a uniﬁed treatment of test theory but distributed. I assume the reader is familiar with the

leaves implicit the relationship between the linear algebra of expectations and with variances, covari-

models of classical theory, and item response models ances, and of course correlations. I will write an

for binary test items. A general misconception is still expected value—a mean—as E{ }, covariance as

apparently widely held that test theory is made up Cov{ }, and variance as Var{ }. This allows us to

of two parts—Classical Test Theory for quantitative have E{Y |X = x} for the conditional mean of ran-

data, and Item Response Theory (IRT) for binary dom Y when random X takes the value x, and

data, with a great gulf ﬁxed between them. Var{Y |X = x} for its conditional variance. Any

The possibility of a uniﬁed treatment of linear change from, say, X to x signals a change from repre-

and nonlinear models for item scores was adum- senting a random variable to representing a speciﬁc

brated by McDonald (1967), who showed how an value for an individual or points on the scale of x.

item response model could be approximated by a When a sentence deﬁnes a new term, the term will

linear model using standard procedures, common in be written in italics.

physics, for linearizing a nonlinear model. A more The models we consider are mathematical ideal-

rigorous uniﬁed treatment, based on the general izations and can be regarded only as approximations

linear model of Nelder and Wedderburn (1972), to the behavior of any real measures (See McDonald,

was provided by Bartholomew. (See Bartholomew 2010). A set of m items is given to a large sample

& Knott, 1999, and Skrondall & Rabe-Hesketh, of examinees. For our ﬁrst model, we suppose that

2004). However, these accounts do not address the responses to them can be coded by item scores

the central problems of Test Theory. McDonald that range over enough numerical values to apply a

(1999) sought to present a “uniﬁed treatment” by linear model as an acceptable approximation. The

applying the same psychometric concepts to a lin- item scores might be subtest scores, or Likert-scaled

ear (unidimensional or multidimensional) model for scores: for example, coding Strongly agree, Agree,

quantitative data and to the parallel nonlinear model Indifferent, Disagree, and Strongly disagree as the

for binary data. This treatment shows how the linear integers 5, 4, 3, 2, and 1, respectively.

model serves as a ﬁrst approximation to the nonlin- We take as a suitable model Spearman’s general

ear model. Other authors have contributed to this factor model, written in the form

uniﬁcation of theory. See, for example, Jones and X j = μj + λ j F + U j , (1)

Thissen (2007) and Thissen and Wainer (2001).

This chapter contains a brief introduction to the where Xj is the score on the jth item of a ran-

treatment given by McDonald (1999), with some domly chosen subject from a deﬁned population,

revision and with restructuring to exhibit parallels F is the (unobserved) value of the attribute given by

between linear and nonlinear models. I rely on more the model, and Uj is a random interaction between

specialized accounts in the present volume to ﬁll the item and the examinee. (To simplify the pre-

in details that I must omit here. Speciﬁcally, I will sentation, it will be understood that the subscript j,

not examine standard estimation methods for item wherever it appears, ranges from 1 to m. We leave

parameters. (See Hallberg, Wing, Wong, & Cook, out the formal j = 1, 2, . . . , m. All summations are

Chapter 12, Volume 1; Steiner and Cook, Chapter over this range, unless otherwise stated.)

13, Volume 1.) Equation 1 is a simple regression model, with

Following the next section, which sets out the F the “explanatory” variable, Xj the response vari-

properties of the linear (factor) model and the par- able, and Uj the residual. Accordingly, F and Uj are

allel nonlinear (item response) model that we need, I uncorrelated. We assume that the interaction terms

will discuss the application of these models to seven Uj of distinct items are uncorrelated and write ψj2 for

problems in test theory. These are: (1) imposing a their variances. Then by the algebra of expectations,

metric on the measured attribute, (2) measurement the variance of Xj is given by

and error of measurement, (3) item selection, (4) Var{Xj } = λ2j Var{F } + ψj2 , (2)

homogeneity and dimensionality, (5) validity, (6)

equating tests, and (7) comparing populations. The and the covariance of two distinct item scores Xj and

last section, not surprisingly, is general discussion. Xk by

Cov {Xj , Xk } = λj λk Var{F }. (3)

The Models

With reference to notation, I will use uppercase (The reader who is familiar with the factor analy-

italics for random variables, and lowercase Roman sis of correlations needs to be told that for test theory

mcdonald 119

applications we use covariances.) In this mathemat- enough. Higher moments of the joint distribution

ical model, F is the common factor that accounts for are extremely unstable in sampling and cannot be

the covariance of the item scores through the rela- relied on to improve estimates. It is implicit in appli-

tion given by Equation 3. The common factor F is cations of factor models that the weak form of the

linked to the real world of applications by identify- principle of local independence is used to ﬁt the

ing it with a measure of the abstract attribute that the model, but the strong form is intended. That is, we

items share as their common property. The interac- do not suppose that the common factor accounts for

tion terms Uj —the unique components—are linked the covariances between the items but leaves other

to applications by identifying them with measures of forms of statistical dependence unaccounted for.

speciﬁc properties of the items. This identiﬁcation Equation 4 gives a simple linear relationship

requires the strong assumption that their speciﬁc between the item scores and the factor. We might

properties are strictly unique to the items. The instead suppose that each item score has a distinct

assumption is realized only approximately in prac- relationship to the factor, writing

tice. (The unique components may also contain an

E {Xj |F = f } = γj (f ), (8)

error of replication that cannot be separated in a sin-

gle set of observations.) The regression constant μj where γj is intended to represent a different nonlin-

is the item score mean, representing item difﬁculty in ear function for each item. There has been very little

cognitive applications. (Strictly, this should be item work on this kind of generality (see, for example,

facility—the easier the item, the higher the mean McDonald, 1967; Yalcin & Amemiya, 2001).

score.) The regression slope λj measures the extent A generalized linear model, following Nelder and

to which the item discriminates between subjects Wedderburn (1972), is obtained by substituting a

at different levels of the attribute. It is traditionally common nonlinear link function γ for a set of dis-

termed a factor loading. (For reasons that will appear, tinct functions γj , relating the item scores to a linear

I would like to call it the discrimination parameter, function of f, with

the counterpart term in item response models, but

E {Xj |F = f } = γ (μj + λj f ). (9)

tradition is too strong.) The parameter ψj2 is the

unique variance of the item score. In principle there is a wide choice of link func-

We can write Equation 1 as tions. In applications the choice is motivated by

metric properties of the item scores (see Skrondall

E {Xj |F = f } = μj + λj f , (4) & Rabe-Hesketh, 2004, for a fairly comprehensive

list). For our purposes, it must sufﬁce to consider

the alternative way to write a regression—that is, as

just two. The ﬁrst, which we already have, is the

the expected value of Xj conditioning on a value f of

simple linear relationship given by Equation 4, with

F . Deﬁning

the link function just the identity function. This

Uj = Xj − E {Xj |F = f } (5) would be chosen whenever the item score ranges

appear to allow a linear model, especially if we can

returns us to Equation 1, but in the form, Equation suppose—approximately—multivariate normality.

4 it allows us to write The only alternative we will consider is motivated

Var{Xj |F = f } = ψj2 , (6) by binary data, items scored with just two values—

0 and 1. This will include multiple choice cognitive

and items, with one answer—the correct answer, we

Cov {Xj , Xk |F = f } = 0. (7) hope—scored 1 and the rest 0. It also includes

noncognitive items with one response keyed for the

That is, for a ﬁxed value of the attribute the item

attribute to be measured, and the other response(s)

scores are uncorrelated. Equation 7 is a weak ver-

nonkeyed—for example, a checklist of symptoms of

sion of the Principle of Local Independence, which

a behavior disorder. In such a case, it is easily shown

governs item response models. A strong form of

that E {Xj |F = f} is the probability of giving the

this principle requires that X1 , X2 , . . . , Xm are mutu-

keyed response, conditioned on f. To avoid a clash

ally statistically independent for ﬁxed f. The strong

of notations I will write this as:

form of the principle implies the weak form, and

if the data have a multivariate normal distribution, E {Xj |F = f } = P {Xj = 1|F = f } = γ (aj + bj f ),

then the weak form implies the strong form. Fitting (10)

and testing the model using just limited informa- where aj replaces μj and bj replaces λj . The former

tion from the item covariances is commonly good notation is set in stone for the factor model and will

1

0.9 4 2

0.8

1

0.7

0.6 .5

P(F)

0.5

0.4

0.3 −1

0.2

0.1

0

−3 −2 −1 0 1 2 3

F

Figure 7.1

be needed again. The notation in Equation 10 is (But it is written without the stars, which I have

standard for regression coefﬁcients. added to distinguish it from my regression notation.)

The choice of a link function is motivated by the Lord’s notation is ﬁrmly established—even to the use

fact that for binary data, E {Xj |F = f} becomes a of θ for “ability”—but will not be used here. It has

probability, which is bounded below by 0 and above the advantage that the constant bj∗ corresponds to

by 1. If we suppose that the link function should be the position on the attribute scale where P(f ) is 0.5.

monotone, then we naturally choose a cumulative Its disadvantage is that it does not generalize to mul-

distribution function for γ . There are many distri- tidimensional models (or allow a simple notational

bution functions, but theory has been developed for distinction between a random variable and values of

just two—namely, the cumulative normal curve (the it). The regression parameterization in Equation 10

normal ogive) and the cumulative form of the logis- easily generalizes, as we will see.

tic distribution (the logistic function). I will just write The third parameterization comes from the work

N(z) = N(aj + bj f) for the normal ogive—given by of Christoffersson (1975). He developed the nor-

the integral from negative inﬁnity to z of the normal mal ogive form of the model by supposing that an

density function: underlying tendency Xj∗ to give the keyed response

follows the linear model (Equation 1) with a nor-

n(t) = [1/(2π)1/2 ] exp[−(1/2)t 2 ], (11)

mal density function. The keyed response is given

and L(aj + bj f ) for the logistic function—given by: when Xj∗ is greater than a threshold value τj . This

assumption leads to the parameterization:

L(z) = 1/[1 + exp(−Dz)]. (12)

P {Xj = 1|F = f } = N[λj f − τj )/ψj2 ]. (14)

With D set equal to 1.701, these functions are

virtually indistinguishable, but each has distinct and The chief advantage of this form of the model is that

useful mathematical properties, and I will switch with a proper choice of origin and unit for F (see

between them as needed. When I wish to refer to next section), the factor loadings λj are bounded

both, I will just write P(aj + bj f ). Figure 7.1 can be by +1 and −1. They are much more stable than

regarded as a graphical representation of either. the corresponding parameters in the other forms,

There are three accepted ways to deﬁne the which range over the entire real line, and the load-

parameters of this model. The ﬁrst we already have ings can be interpreted by the established standards

in Equation 10. I will refer to it as the regres- of the common factor model. I will refer to Equation

sion parameterization. The second—Lord’s (1980) 14 as giving the item factor parameterization. The

parameterization—is conventionally written as: functions in Equations 10, 13, and 14 are vari-

ously referred to as item characteristic functions, item

P {Xj = 1|F = f } = γ (aj∗ (θ − bj∗ ). (13) characteristic curves, and item response functions or

mcdonald 121

curves. Although not an established convention, we patterns of responses. Each has advantages and dis-

can apply the same terms to the linear function advantages. (See McDonald, 1999, Chapter 12.) We

in Equation 4. The constant in these models is certainly assume the strong form in applications of

referred to as the difﬁculty parameter (although this the ﬁtted model.

hardly applies to noncognitive items), and the slope Because the common factor model and the item

parameter—the multiplier of F —as the discrimina- response models developed independently, estab-

tion parameter. The threshold parameter τj actually lished terminology for the former refers to F as the

measures in the direction of difﬁculty for cognitive common factor of the items whereas Item Response

items, whereas bj and a∗j measure in the direction Theory calls it a latent trait or latent variable. From

of easiness. In Lord’s parameterization, aj∗ can be here I will refer to it as a latent trait in both the lin-

called a location parameter. It is the location on the ear factor model and the item response models and

scale of the point of inﬂexion of the response curve. regard it as a measure of the attribute in the metric

These parameters serve the same functions as in the supplied by the model. (See Metric section below.)

counterpart linear model (Equation 4). It might be desirable to give an example of

The model: an application of Equation 1 to a suitable set of

quantitative data, and an application of Equation 10

P(f ) = L(aj + bj f ) (15) to a set of binary data. But for brevity, and to exhibit

is referred to as the two-paramenter (2PL) model, parallels, I will deliberately apply both to a single,

a term pointing to its two-item parameters and much-analyzed set: the LSAT6 data set. Responses

the choice of (L)ogistic function. We can write a from 1000 examinees to a section of a Law School

one-parameter (1PL) model by equating the slope Admissions Test (items 11–15 of Section 6) have

parameters, giving been reanalyzed by a number of psychometric the-

orists (e.g., Christoffersson, 1975; Bock & Aitkin,

P(f ) = L(aj + bf ). (16) 1981; Muthén, 1978). The items themselves have

This is a member of the family of models identi- been lost. Table 7.1 gives the proportion pj passing

ﬁed by the work of Rasch (1960). It is often referred each item, and the item variances and covariances.

to as “the” Rasch model. A three-parameter (3PL) Table 7.2 gives the ﬁtted parameters and the discrep-

model ancies between the matrix of sample covariances S

and the matrix of ﬁtted covariances (given by

P(f ) = cj + (1 − cj )L(aj + bf ) (17) Equations 2 and 3), using a standard factor analysis

program.2

allows for the effects of guessing in multiple-choice

The corresponding normal ogive model, ﬁtted

items. It will not be considered further here.1

to the same data set by the NOHARM program

As in the linear case, we can ﬁt the model by the

(McDonald, 1982, 1997), gives the parameters in

weak form of the Principle of Local Independence,

Table 7.3, in the item factor and the regression

using just bivariate relations between items. (See

parameterizations. Any direct relations between the

McDonald, 1982, 1997; Muthén, 1984.) These

parameters of the factor model in Table 7.2 and those

estimation procedures can be referred to as limited or

of the normal ogive are not visible to inspection,

bivariate information methods. Alternatively we can

but Table 7.4 gives the values of the ﬁtted normal

ﬁt the model using the strong form of the principle,

which for binary items reduces to the statement:

Prob{X1 = 1, X2 = 1, . . . , Xm = 1|F = f } Table 7.1. LSAT-6—Difﬁculties and

Covariance Matrix

= Prob{Xm = 1|F = f }x Prob{Xm = 1|F = f }

Item

. . . x Prob{Xm = 1|F = f }. (18)

Item pj 1 2 3 4 5

This is a way of saying that the items are related ⎡ ⎤

1 .924 .0702 .664 .524 .710 .806

in probability only through their relations with the ⎢ ⎥

attribute. Methods of estimating the item param- 2 ⎢ .708 .0089 .2063 .418 .553 .630 ⎥

⎢ ⎥

eters using this strong form of the principle are ⎢ ⎥

⎢ .553 .0130 .0259 .2472 .445 .490 ⎥

3 ⎢ ⎥

referred to as full information methods (e.g., Bock & ⎢ ⎥

⎢ ⎥

Aitkin, 1981). There is at present no clear advantage 4 ⎢ .763 .0050 .0120 .0231 .1808 .678 ⎥

⎣ ⎦

to ﬁtting these models using the limited information

5 .870 .0021 .0132 .0089 .0142 .1131

from pairwise relations or the full information from

Table 7.2. LSAT-6—Spearman Analysis

Discrepancy Matrix

Variances ψ 2

⎡ ⎤

1 .0605 .0665 .0 .0008 .0017 .0021 −.0024

⎢ ⎥

⎢ ⎥

2 .1345 .1882 ⎢ .0008 .0 .0009 .0038 .0032 ⎥

⎢ ⎥

⎢ ⎥

3 .1861 .2126 ⎢ .0017 .0009 .0 .0012 .0050 ⎥

⎢ ⎥

⎢ ⎥

⎢ .0021 ⎥

4 .1174 .1670 ⎢ .0038 .0012 .0 .0054 ⎥

⎣ ⎦

5 .0745 .1076 −.0024 .0032 .0050 .0054 .0

Table 7.3. LSAT-6—NOHARM Analysis simple sum of the item scores is an obvious choice

Item τ̂j λ̂j aj bj but not the only possibility.

To calibrate the model, we ﬁt it to the responses

1 −1.433 .381 1.549 .412 of a (hopefully large, hopefully random) sample of

2 −0.550 .379 0.595 .410 subjects from a suitable population. At least initially

we choose the origin of the scale for the latent trait

3 −0.133 .478 0.152 .544 as the mean of F and the unit as its standard devi-

4 −0.716 .377 0.773 .406 ation. That is, we standardize the latent trait in the

population chosen. (This simpliﬁes Equations 2 and

5 −1.126 .345 1.200 .368 3, setting Var{F } = 1.) The entire metric, and con-

sequently the distribution of F , is then determined

by the link function.3

ogive item response functions N(aj + bj f ), with the If, in the course of developing the test items,

corresponding values of the approximating linear a set is chosen that ﬁts a model with equal coef-

functions μj + λj f from the factor Equation 4, in ﬁcients of F , and the link function is monotone,

parentheses, for six values of f . In this example, the then it can be shown that the difference f1 − f2 in

approximation is remarkably good. F between two subpopulations is independent of

the items chosen to measure them. This property

was termed speciﬁc objectivity by Rasch (1960). It is

Some Test Theory Problems sometimes thought of as a special and valuable prop-

We turn now to a series of problems that can be erty of the 1PL model and somewhat hyperbolically

solved with the use of the models. referred to as “item-free measurement” (e.g., Wright

& Stone, 1979). However, speciﬁc objectivity is a

Metric property of any model, linear or nonlinear, in which

The attribute that we ﬁrst aim to model, and then the item slope parameters do not vary (see McDon-

aim to measure, must be quantiﬁable in principle. ald, 1999, p. 425). The claim is sometimes also

By this I mean that it must have ordinal properties, made that the 1PL model, identiﬁed as the Rasch

admitting of “more” or “less.” However, its metric— model, gives interval scale measurement—equality

not only the origin and unit of measurement, but its of attribute differences as measured by the model.

entire calibration—is not given by data and gener- However, the choice of distinct link functions for

ally must be imposed by the model. Imagine a meter 1PL models gives distinct and mutually contradic-

stick drawn on a rubber sheet, with the millimeters tory “interval” scales. To repeat, the metric of the

able to be stretched or compressed at will. The units measurements is imposed by the model chosen, and

in which the attribute is measured are determined alternative metrics are always available. Note that

by the choice of (1) a link function and (2) an origin speciﬁc objectivity does not make the items of a test

and unit. The distribution of F in the calibration interchangeable measures of the attribute. The error

population will depend on the choice of link func- of measurement in a 1PL model depends on the dif-

tion and be determined by it. As we will see, it is ﬁculty parameter of the item, a fact that makes a

also possible to determine a metric by choosing a problem for equating tests (see Alternate Forms and

formula score—a function of the item scores. The Test Equating)

mcdonald 123

Table 7.4. LSAT-6—Normal-Ogive Item Response Functions

Item −3 −2 −1 0 1 2

1 .62 (.74) .76 (.80) .87 (.86) .94 (.92) .97 (.98) .99 (1.04)

2 .26 (.30) .41 (.44) .57 (.57) .77 (.71) .84 (.84) .92 (0.98)

3 .07 (.01) .17 (.18) .35 (.37) .56 (.55) .76 (.74) .89 (0.93)

4 .33 (.41) .48 (.53) .64 (.65) .78 (.76) .88 (.88) .94 (1.00)

5 .54 (.65) .68 (.72) .80 (.80) .88 (.87) .94 (.94) .97 (1.02)

To calibrate the model, we need to sample a care- That is, the rescaled test score F Y is a measure

fully chosen population, but we do not need to of the latent trait F with error of measurement E Y ,

use the population mean and standard deviation to the error in F from a rescaling of Y . Note that the

deﬁne the scale. Alternatives are available that do correspondence between T and F fails, and con-

not rest on the population chosen for their origin sequently the linear model fails, when values of F

and unit. A simple alternative measure—long ante- would give values of T outside the range of possible

dating IRT—is the raw sum or the mean of the item values of Y .

scores. The test score, the raw sum of the item scores, Before proceeding we should note that in Spear-

man’s original true score model, as in Equation

Y = Xj , (19)

24, the distinction between true score and error

and the mean score, remained undeﬁned for decades, and various devices

M = Y /m, (20) were proposed to deal with the lack of deﬁnition

(see McDonald, 1999, Ch. 5.) If, as is usual, F

provide alternatives that many users of tests may has been standardized in the calibration population,

justiﬁably prefer. For example, a psychiatrist may Equation 23 gives a simple relation between the

prefer a count of symptoms of a behavior disorder attribute so measured and the true score in sum score

to a score on the latent trait. units, whereas Equation 24 gives an appropriate

In the linear model, from Equation 4, foundation for the classical model. This relation-

E {Y |F = f } = μj + (λj )f , (21) ship justiﬁes using the raw sum score or mean score

as a measure of the attribute, each with an origin and

or, from Equation 1, unit that can have a direct and important meaning.

Y = μj + (λj )F + Uj . (22) Consider, as an example, a set of items with the same

Likert-scale format, where a respondent’s mean score

We can rewrite Equation 21 as

might represent, say, modal disagreement with all of

t = μY + λ.f , (23) the statements measuring an attitude.

In any model with a nonlinear link function, we

where μY is the test score mean, and λ. the sum of

have, correspondingly, from Equation 19,

the loadings λj . We can rewrite Equation 22 as:

Y = T + E = [(μY + λ.)F ] + Uj . (24) E {Y |F = f } = γ (aj + bj f ). (28)

Here, T is the true score of classical test theory, For the case of binary items, Equation 28 becomes

and E the error of measurement of the attribute by

the total test score. t = E {Y |F = f } = P(aj + bj f ). (29)

By Equation 23,

Equation 29 will give a nonlinear monotone rela-

f = (t − μY )/λ., (25) tionship, with f unbounded on the real line and t

we can deﬁne bounded by 0 and the number of items. The graph

of t on f is known as the test characteristic curve.

F Y = (Y − μY )/λ., (26) The relation given by Equation 23 for the linear

and rescale E into E Y = E /λ., so that model is also a test characteristic curve, although

not always recognized as such. In the LSAT6 data,

F Y = F + EY. (27) as Table 7.5 shows, the test characteristic curve from

Table 7.5. LSAT-6—Normal-Ogive Item as scaled by the model or put conﬁdence bounds on

Response Functions

the measurement. We may wish to obtain a measure-

f −3 −2 −1 0 1 2 ment of an attribute from one or more members of

NO 1.82 2.51 3.27 3.88 4.39 4.72

the calibration population or examinees from a dif-

ferent population. This consideration does not affect

Lin 2.10 2.67 3.24 3.82 4.39 4.96 the measurement process. I suggest as an axiom that

a measure of a deﬁned quantity (which requires a

Note. NO, normal-ogive; Lin, linear

calibrated scale) equals the quantity to be measured

plus an error of measurement. Also the variance of

the measurement should equal the variance of the

the linear model gives a remarkably good approxi- measured quantity plus the variance of measure-

mation to that from the normal ogive model in this ment error. Spearman’s true score model essentially

example. Undoubtedly cases can be found in which expresses this axiom.

the approximation is poor. In the linear (factor) model, Equation 24 satisﬁes

The metric given by the latent trait f is the axiom, with

unbounded on the real line. We can regard Var{Y } = Var{T } + Var{E }, (30)

Equations 28 and 29 as applicable to any func-

tion Y of the item scores. The sum score metric and from

is bounded by 0 and m, and, an equivalent, the pro- Var{T } = (λj )2 , (31)

portion keyed—the mean score—is bounded by 0 and

and 1. The metrics are related nonlinearly by the Var{E } = ψj2 , (32)

(formula score) test characteristic curve. As we will

see, an important function of the model is to supply we have

information about measurement error in the latent

Var{Y } = Var{T } + Var{E } = (λj )2 + ψj2 .

trait, in the sum or mean score, or in any function

(33)

of the item scores. Because the attribute to be cali-

Thus, the ﬁtted factor model gives us the variance

brated is not itself bounded, in any sense, it might

of the error of measurement from the unique vari-

seem that we should regard the metric given by the

ances of the item scores, and the true score variance

latent trait as more fundamental than that given by

from the factor loadings. Note that in this realization

a (formula) score from a set of items chosen to mea-

of Spearman’s (1904a) classical true score model, the

sure it. Following Lord (1980), we would regard the

error of measurement arises in a single administra-

test characteristic curve as exhibiting “distortions of

tion of the test and results from speciﬁc properties

mental measurement” (Lord, 1980, p. 49) resulting

of the items, although it may include confounded

from the test score chosen.

errors of unrealized replication.

When the metric of the scale has been determined

One of the oldest concepts in Test Theory is,

by the choice of a link function, the distribution

of course, the reliability coefﬁcient of a total test

of F and of any formula score is determined by

score, deﬁned as the ratio of the variance of the true

that choice and is not open to arbitrary assumption.

score to the variance of the total score, and con-

McDonald (1967) gave a method for estimating the

ventionally denoted by ρYY . A somewhat neglected

moments of the distribution of F from the distri-

alternative concept is the reliability index, the cor-

bution of the observations. This provides a simple

relation between the true score and the total score,

test of departure from normality. Bartholomew and

denoted by ρYT . The two coefﬁcients are related by

Knott (1999) have taken the view that the distri-

2

bution is arbitrary along with the choice of link ρYY = ρYT . (34)

function.

However, for decades these quantities remained

undeﬁned, like the true score itself.

Measurement and Error of Measurement From the parameters of the factor model, the

(Reliability) reliability coefﬁcient is given by

When we have calibrated the attribute by ﬁtting 2

ρYY = ρYT = ω = λ.2 /(λ.2 + ψj2 ). (35)

a model, we may then wish to use the created test to

assign a measure of the attribute to an examinee. We This is also Var {F }/Var{F Y }, the ratio of the vari-

also wish to obtain a standard error of measurement ance of the factor score to the variance of its measure

mcdonald 125

obtained by rescaling the total score in Equation gives a measure (resulting from Bartlett, 1937),

26. Coefﬁcient omega, deﬁned by Equation 35, was

F B = [1/{λ2j /ψj2 }][λj /ψj2 ](Xj −μj ) = F +E B ,

originally given by McDonald (1970). From the

(40)

parameters of the linear model in Table 7.2, it is easy

with minimum error variance

to compute coefﬁcient omega. It is 0.307, which, of

course, is very low. Var{E B } = 1/(λ2j /ψj2 ), (41)

Guttman (1945) gave a lower bound to the

reliability of a total test score, which was then fur- and maximum reliability coefﬁcient 1/[1 +

ther studied by Cronbach (1951) and is commonly Var(E B }]. For the LSAT6 data, the maximum reli-

known as Cronbach’s alpha. I prefer to acknowl- ability, given by these weights, is 0.309—hardly an

edge Guttman’s original contribution and call it the improvement on 0.307 from the simple sum score.

Guttman-Cronbach alpha. It is deﬁned by The reciprocal of the error variance in Equation

41,

α = [m/(m−1)][1−( Var{Xj }0/Var{Y }]. (36) I = (λ2j /ψj2 ), (42)

is a sum of m independent terms, one for each item.

A sample estimate of G.-C. alpha is still com-

Each makes a separate contribution to the reduction

monly used for reliability, although Guttman (1945)

of the measurement error. The more informative

clearly showed that it was a lower bound. Novick

items are those with the largest ratio of squared

and Lewis (1967) showed that alpha gives reliability

loading to unique variance. We can take the test

if the items are true score-equivalent, meaning that

information to be deﬁned by this reciprocal and the

the items ﬁt the model

item information to be the contribution of each term

Xj = T + Ej , (37) in Equation 41.4 The usefulness of the information

concept for us rests on the additivity of these terms,

and McDonald (1970) showed further that it gives enabling, as we will see, the independent selection of

reliability if and only if the factor loadings in good items. The weights given by Equation 38 min-

Equation 1 are equal. In applications, G.-C. alpha is imize the measurement error variance and maximize

often a very good lower bound to coefﬁcient omega. the information and reliability, among all possible

In the LSAT6 data, from Table 7.1 we obtain a value weighted sums of the items.5 The maximum reli-

for alpha of 0.295, very little less than omega. The ability can be written in terms of information as

case for using omega rests on the fact that it is a I /(I + 1).

simple byproduct of a prior analysis that gives other The raw sum score is an equally weighted sum of

useful results, as will be seen. the item scores. Scaled to F Y as in Equation 26, it

A reliability coefﬁcient is not an end in itself. has error variance

From its origin, it was a device for overcoming

Var{E Y } = ψj2 /(λj )2 , (43)

the problem of replicating test scores, to obtain a

standard error of measurement. The measurement and test score (sum score) information

error variance can be expected to be approximately

I Y = (λj )2 /(ψj2 ). (44)

invariant over populations, whereas the reliability

varies widely (with the true score variance) from The ratio

population to population. This variability can be

seen in lists of reliability estimates from different RE = I Y /I = Var{E }/Var{E Y }, (45)

populations, recorded in handbooks of tests and which is the ratio of the information in the simple

measurements. sum score to the maximum information in the test

The simple sum score, possibly rescaled to a mean (given by Equation 37 with Equation 38), is the

by dividing by m, or rescaled to the metric of the relative efﬁciency of this test score, necessarily less

common factor by Equation 26, is not the best than 1. The relative efﬁciency of the sum score for

measure of the attribute. With scaling to latent-trait the LSAT6 is 0.994. There are other possibilities—

metric (and latent-trait variance 1), the weighted for example, we can obtain the relative efﬁciency of

sum scores from a subset of items.6

F B = wj (Xj − μj ), (38) Given the error variance, we have the correspond-

ing standard error of measurement as its square

with root. By the Central Limit Theorem, the error of

wj = [1/{λ2j /ψj2 }][λj /ψj2 ], (39) measurement will approach a normal distribution

as the number of items becomes large, because it The error of measurement of the formula score

is a sum of m independent unique components. E S = S − tS . Then

We can then put conﬁdence bounds on an exam-

inee’s score by standard methods, without imposing S = T S + E S = g(F ) + E S . (47)

distribution assumptions on the model. Using the If the function g(f ) is invertible, then formally

linear model for the LSAT6 gives an error vari- there is a nonlinear transformation

ance Var{E } = ψj2 of the raw score Y equal to

0.742 and a standard error of measurement of 0.876. g−1 (S) = F + EfS = F + g −1 (E S ). (48)

Ninety-ﬁve percent conﬁdence bounds on an exam- Thus, g−1 (S) is a measure of F , with error of

inee’s score of 3 would be 3+/−1.96×0.876—that measurement EfS = g−1 (ES ). Now suppose we take

is, 1.31 and 4.69, which nearly covers the range (0 to

5) of the scale. The usefulness of the LSAT6 clearly S W = wj Xj , (49)

lies in the pleasures of analysis it has given to psy-

chometric theorists rather than in its precision as a a weighted sum with ﬁxed weights. (With weights

measuring instrument. We can rescale these num- equal to 1, this gives the simple raw sum.) Then,

bers to the scale of the latent trait, from λj = T W = E {S w |F = f } = wj Pj (F ), (49a)

0.573 and the test mean μY = μj = 3.818,

giving fY = (3 − 3.818)/0.573 = −1.43, with and

error variance 0.742/0.5732 = 2.260, standard E W = wj Xj − wj Pj (F ). (49a)

error of measurement 1.503, and conﬁdence bounds

−1.43 + / − 1.96 × 1.503—that is, −4.377 and Here T W and E W represent the true and error

1.517. components of the weighted sum S W given by

On the face of it, the linear (factor) model makes Equation 48. Then,

the strong assumption that the errors of measure- Var{E W } = w2j Pj (1 − Pj ). (50)

ment are homoscedastic. This assumption is easily

tested and will very commonly be found to be If we intend to measure the attribute in the metric

false. A classical method for testing it results from of the chosen formula score (the chosen weighted

Mollenkopf (1949) and Thorndike (1951). (See sum), then this last result is all we need. A common

also Lord, 1984, Feldt et al., 1985, and Qualls- choice will be the raw sum score with

Payne, 1992.) The principle, which is capable of

Y = T Y + EY, (51)

reﬁnement, is: We split the items into two paral-

lel halves, and plot the variance of the difference where

(which estimates error variance) against the sum

(which estimates the true test score). We can call Var{E Y } = {Pj (f )[1 − P(f )]}. (52)

this a practical or empirical method. A method This result supplies the conditional standard error

based on IRT is given later. When the item response of measurement of the sum score, based on the

model ﬁts, there should be little to choose between item response model. As remarked already, when

the methods, and the model-based method has the model ﬁts, model-based conditional standard

best theoretical justiﬁcation. The resulting stan- errors will agree closely with the empirical results

dard error of measurement, a function of test score, from classical methods.

is referred to as a conditional standard error of A measure of the latent trait f for any individual

measurement. can be obtained from her/his formula score sw in

In a nonlinear model, the item information and Equation 48 by equating it to its expectation, writing

the test information—and hence the corresponding sW = tW —that is,

measurement error variance—are, by theory, func-

tions of the position of the examinee on the scale. wj xj = wj Pj (f ). (53)

We can use the model to deﬁne a true score and a

conditional error of measurement for any formula Equation 53 can be solved for a measure fW

score—any quantity calculated from the item scores. by plotting tW against f, and ﬁnding the point

Let S be any such score. Then a corresponding true where the equality is satisﬁed—or by an equivalent

score tS is given by, for example, computer program. This corresponds to applying

Equation 47. Then,

tS = E {S|F = f } = g(f ). (46) F W = F + EfW , (54)

mcdonald 127

where F W is the measure of F given by Equation (2007) have suggested inventing a conditional reli-

53. The corresponding error of measurement, EfW , ability for a sum score, deﬁned as the unit com-

has variance plement of the ratio of the conditional variance of

the error of measurement to the total variance. The

Var{EfW } = [ w2j Pj {f}(1 − Pj {f})]/[ wj Pj {f}], intended use of this index is to compare the condi-

(55) tional precision of two tests of the same attribute. It

where P’j {f} is the gradient of P{f}. Any choice of the is not clear whether such an index has any advan-

weights will give a measure of f for each individual, tages over the use of relative efﬁciency as suggested

from Equation 53, and a corresponding variance of by Lord (1980) for such purposes.

measurement error, given by Equation 55. Unlike the linear case, the item information is

If we set weights equal to the discrimination a function of f. Any other formula score must

(slope) parameters of the items, then the variance give less information and a greater measurement

of the measurement error is the minimum possible error variance than the solution of Equation 57—

from all choices of weights. This is the choice wj = a counterpart of Bartlett’s minimum error measure

bj in the regression parameterization (Equation 10), Equation 37 with 38—at every point on the scale

a∗j in Lord’s parameterization (Equation 13), or of f. In particular, the sum score, Y = Xj , with

λj /ψj in the item factor parameterization (Equation true score T Y = Pj (F) gives a conditional error

14). In Lord’s (1980) original account of these variance of Y

results, Equation 13 is used. Here it is convenient to

use Equation 14, to exhibit the relation to the linear Var{E Y } = [(Pj {f})(1 − Pj {f})], (62)

model. That is, the formula score

and conditional error variance of the measure of f

s = (λj /ψj )[xj − μj ] (56)

Var{Ef } = [(Pj {f})(1−Pj {f})]/[ Pj (f )2 ], (63)

gives a measure fb of f for any individual from the parallel to Equation 37. At every point of the scale,

solution fb of this must be greater than the minimum variance

(λj /ψj )xj = (λj /ψj )Pj (f ). (57) given by Equation 57. In applications, the differ-

ence may be small, and for some purposes it would

It has the property that be reasonable to use the total test score as a measure

or its transformation onto the latent-trait scale. The

F b = F + E b, (58)

information function and its reciprocal—the error

with variance—changes radically under transformations

Var{F b } = Var{F } + Var{E b }, (59) of the scale. For example, at the ﬂoor and ceiling of

the test, the error variance of the latent trait becomes

where Var{F } = 1 and inﬁnite, whereas that of the true score becomes zero.

Var{E b } = E {F b |F = f } = 1/I (f ). (60) The unbounded metric shows clearly that we cannot

get good estimates for examinees for whom the test is

Here, too easy or too difﬁcult. At these extremes, the true

I (f ) = [(Pj {f})(1 − Pj {f})]/[Pj ’(f )2 ], (61) score is constrained from changing in response to

the attribute and conveys no information about it.

and Pj ’(f ) is the gradient of P(f ). As in the lin- For the LSAT6 data, Table 7.6 gives the item

ear case, I (f ) is the information, the reciprocal information functions. Table 7.7 summarizes the

of the error variance. The choice of these weights further results relevant to this section. These are:

minimizes the error variance and, equivalently, max- (1) TCC: The test characteristic curve—the sum of

imizes the information from the data. We notice the item characteristic curves; (2) I (f ): the test infor-

that Equation 61 parallels Equation 41 for the linear mation function—the sum of the item information

model. The measures fb are nonlinear counterparts functions; (3) Var{E b }: the minimum error variance

of the Bartlett scores given by Equations 37 and 38. available from the test; (4) S.E.M.(f ): the Stan-

The information is the sum of terms that constitute dard Error of Measurement from fb ; (5) The TCC

item information—the contribution of each item to from the linear model; and (6) Var{EfY }: the (con-

the reduction of error. stant) error variance of F from the linear model. We

There is no counterpart of the reliability coefﬁ- observe that all the items are easy for this population,

cient in the nonlinear model. This is not a defect, and we have lower standard errors of measurement

from the modern perspective. However, Raju et al. (more information) for examinees of low ability than

Table 7.6. Item Information Functions My treatment of measurement and errors of

f measurement has been free of distribution assump-

tions. Rather than measures of factor scores/latent

Item −3 −2 −1 0 1 2 3

variables, in the sense employed here, we can ask

1 .114 .086 .054 .031 .016 .008 .004 for best predictors of them from any information

we have about the subjects. In the Spearman model,

2 .092 .117 .118 .095 .063 .037 .020

given no information about a subject beyond his/her

3 .059 .120 .192 .210 .154 .083 .038 item scores, we can use the regression of the latent

variable on the item scores as a best linear predictor,

4 .104 .119 .108 .080 .050 .028 .015 minimizing the residual variance. This predictor,

5 .097 .084 .062 .040 .024 .013 .008 given by Thomson (1934), takes the form

fT = E {F |X1 = x1 , . . . , Xm = xm }

= [1/(1 + I )][λj /ψj2 ](xj − μj ). (64)

for examinees of high ability. For some purposes, this

might be desirable, whereas for others it would be a Thomson’s predictor decomposes the measured

defect of the test. We can immediately see how we quantity F into orthogonal components, giving

could select items from a large calibrated set to form F = F T + E T, (65)

a test information curve of desired form. This is the

subject of the next section. with

Over the last couple of decades or so in psycho- Var{F } = Var{F T } + Var{E T }

metric theory, there has been a general movement

away from the method of maximum likelihood to = I /(1 + I ) + 1/(1 + I ) = 1. (66)

Bayesian estimators, both for parameters of the This may be contrasted with the Bartlett measure,

items and for predicting values of the attributes which decomposes the measure F B into orthogonal

of examinees (see, for example, Bartholomew & components

Knott, 1999). I need to point out that there F B = F + E B, (67)

remains some confusion in terminology about the

“estimation” of latent traits, with no distinction with

made between measurement and prediction, and Var{F B } = Var{F } + Var{E B } = 1 + 1/I . (68)

some writers loosely referring to obtaining, assess-

ing, constructing, or ﬁnding “proxies” for factor We note also that

scores—values of individual latent traits. To dis-

Var{F T } = 1/Var{F B } < Var{F } < Var{F B },

cuss the Bayesian treatment, I ﬁnd it convenient

(69)

to distinguish measurement, as treated so far, and

prediction. Var{E T } < Var{E B }, (70)

f

−3 −2 −1 0 1 −2 3

Linear Approximation

mcdonald 129

and with varying degrees of theoretical motivation.

E{F T |F = f } = [I /(1 + I )f ]. (71) These are no longer needed (see McDonald, 1999,

The regression predictor is a shrunken, conditionally Chapter 11). In the linear case, with a set of items

biased estimator of the latent variable. The standard ﬁtting a single-factor model, we see immediately

error of prediction is less than the standard error of that a best subset of m items would contain the

measurement. It is not presently clear what would items with the largest information values, λ2j /ψj2 ,

motivate the choice between measurement and pre- thus yielding the smallest error variance and largest

diction in applications of the model. Prediction reliability for a given number of items. As an exam-

appears to be the current “default” option. ple in miniature, if we wanted the best three items

If it is found empirically that the latent trait has from the LSAT6, using the linear model as approx-

a normal distribution, then the regression predic- imating it well enough, then we would take items

tor is also a Bayes predictor (see Bartholomew & 3, 2, and 4, with information 0.163, 0.096, and

Knott, 1999). In the corresponding 2PL model, if 0.082, respectively, rejecting items 1 and 5, with

it turns out that empirically the latent trait has a information 0.055 and 0.052, respectively. Keep-

normal distribution, then we can obtain a Bayes ing the ﬁrst three gives test information 0.341 and

predictor by methods described in Bartholomew error variance 2.93, to be compared with infor-

and Knott (1999) and Skrondall and Rabe-Hesketh mation 0.448 and error variance 2.232 from the

(2004). The Bayes predictor is, again, a shrunken, full set.

conditionally biased estimator. The nonlinear counterpart procedure is more

complex. We use the item information functions

Item Selection to select a subset of items giving a desirable test

In developing a test, it is common practice to try information function, recognizing that we may not

out a set of items on a calibration sample, of which be able to minimize conditional error variance at all

only a subset will be kept for actual application— levels of the latent trait, and may wish to have “small”

measuring examinees with the ﬁnal calibrated test. error variance in speciﬁed intervals. This depends on

We need a convenient way to choose a “best” subset. the purpose of the test. A careful study of the item

It may also happen that the items under trial seem information functions in Table 7.6 and the way they

insufﬁcient to measure the attribute well enough, give different orders at different points of the scale

and we wish to know how many more need to be will indicate the possibilities and difﬁculties of the

written. task.

The conception of our freedom to shorten or Of course, the principles just adduced for elimi-

lengthen a test measuring a given attribute contains, nating items from a set apply also to adding items of

at least implicitly, the recognition that the attribute known characteristics. For the problem of “proph-

is not “operationally” deﬁned by just the set of items esying” the effect of adding items that are not yet

chosen to measure it. The possibility of shortening available, Modern Test Theory adds little to the

or lengthening a test for an attribute rests on an ide- classical treatment. The Spearman-Brown classi-

alization. In effect, we suppose that the items written cal “prophesy” formula (Spearman, 1910; Brown,

come from a quasi-inﬁnite set of items that would, 1910) allowed the prediction of the reliability of a

if written and administered, deﬁne and measure the lengthened test, of r items, from the reliability of a

attribute precisely. Such a quasi-inﬁnite set has been test of m items. It required the very strong condi-

called a behavior domain, or a universe of content. I tion that the items in the given test be parallel. In

prefer to call it an item domain. Although it will vir- Classical Test Theory, this condition requires that

tually never be the case that the m items we have are a each measures a shared true score with equal error

random sample from an item domain, it is necessary variance. Interpreting this in terms of the linear fac-

to think of them as obtained from it and to take the tor model, the condition is that each of the m items

true score or latent trait to be determined by it. The has the same factor loading and the same unique

limit results justifying this do not depend on random variance. An assumption is also needed that the

sampling. They do depend on the strong assumption additional items have the same loading and unique

that we know how to realize indeﬁnitely more items variance.

measuring just the attribute we intend to measure. The one mild advantage of applying the factor

This requires very careful conceptualization. model over the classical treatment is that the condi-

In Classical Test Theory, a large number of heuris- tion and assumption can be weakened considerably.

tic devices for item selection have been developed, If the m given items ﬁt the Spearman model, we can

deﬁne an approximate reliability for them as a remarkable number of indices or heuristic devices

2 2 2 were invented to measure or test the extent to which

ω = (λ. /(λ. + ψ .), (72)

a set of items is homogeneous. These were based

where λ. is the sum of the loadings and ψ 2 . the sum on rather intuitive notions of what homogeneity

of the unique variances. The Spearman-Brown for- means. Hattie (1984, 1985) studied the behavior

mula is derived in Classical Test Theory as a special of 87 of these and found that only one could be rec-

case of G.-C. alpha, and can also be derived from ommended. As expected from McDonald (1981),

the Spearman model with equal loadings and equal this exception was based on a check to see whether

unique variances (see McDonald, 1999, Chapter 6). the item scores ﬁt a model with a single latent

The formula is given by trait. Indeed, we may now take it that what was

always intended by the term “homogeneous test”

ρm = mρ1 /[(m − 1)ρ1 + 1]. (73)

is one whose item scores ﬁt a model with a single

This is an expression for the reliability of a test latent trait. This follows from our identiﬁcation of

of m parallel items from the reliability of just one of the latent trait in the mathematical model with the

them. We can use it with Equation 72 to obtain attribute as measured by the item responses.

the reliability of a projected test of r items with In the early literature, psychometric theorists

the strong condition eliminated, and the assump- often treated the single common factor and the m

tion weakened to the hope that the added items unique factors as on the same conceptual level and

have the same average loadings and the same aver- described the model as containing m + 1 factors.

age unique variances as the given set. This is still a This way of expressing it makes an ambiguity over

strong demand on the item writer. what we should mean by the dimensionality of a

There is no clear strategy in the conceptually par- test.7 Writing the linear model as the expected value

allel model for binary data, allowing the investigator of an item score for ﬁxed factor score, as in Equation

to predict the number of items needed to meet a cri- 4, and the item response model as the corresponding

terion for error variance. For this purpose, it is not Equation 10, we regard f as the single dimension on

unreasonable to use the linear model as an approxi- which the attribute varies.

mation and apply the modernized classical formula. We now consider the possibility of writing

This is perhaps a little better than consulting a crys- alternative p-dimensional models. I will just illus-

tal ball, and perhaps the question is not of great trate this possibility with two-dimensional models,

importance. writing

The most important function of the behavior

domain concept is that it serves to determine the E{Xj |F1 = f1 , F2 = f2 } = μj + λ1 f1 + λ2 f2 , (74)

latent trait or true score as the score on a test of with

inﬁnite length. The behavior domain gives a clear

distinction between these scores and a measure of Uj = Xj − E{Xj |F1 = f1 , F2 = f2 }, (75)

them from a chosen set of items. In Lord and a 2PL model for quantitative data, or

Novick’s account, there is room for alternative treat-

ments of a true score as either the score on a test Prob{Xj = 1} = P(aj + bj1 f1 + bj2 f2 ), (76)

of inﬁnite length or as the mean of a “propen- a two-latent-trait model for binary data, in regres-

sity distribution.” A propensity distribution is the sion equation notation. This can also be written as

distribution of the score obtained when one exami-

nee is brainwashed to forget previous responses and Prob{Xj = 1} = P[(λj1 f1 + λj2 f2 − τj )/ψj ], (77)

retested many times under ideal conditions (but see

the obvious extension of Equation 10.

McDonald, 2003).

The extension of the factor model to multi-

ple dimensions has a long history. The history of

Homogeneity and the Dimensionality of the corresponding latent trait models is short, and

Tests unfortunately it has not always been recognized that

An unexamined assumption of the previous dis- multidimensional item response models require the

cussion is that the items are measures of just one techniques invented for multiple-factor models if

attribute. In the classical period, this was discussed we are to ﬁt and interpret them in applications to

as the question of test homogeneity. In terms of the real data. The latent traits in models of this kind

Greek root, the question is whether the items are are correlated, and each of the latent traits is stan-

of the same (homo-) kind (genos). In that period, dardized. The parameters of the linear model are

mcdonald 131

the item means μj , the factor loadings λj1 and λj2 , Reckase (2009) has provided a very different

the unique variances ψj2 , and, for this model, the treatment of multidimensional item response mod-

correlation between the latent traits, which we will els. Reckase seeks to describe the multidimensional

write as φ12 . Correspondingly, the parameters of the space without reference to the methods developed

two-dimensional item response model are aj , bj1 , in common factor modeling to determine what

bj2 , and the correlation of the latent traits φ12 . In is measured. This interesting development awaits

two or more dimensions, in addition to the prob- evaluation.

lem of choosing an origin and unit for each axis, In the linear Equation 4, and the nonlinear

we face the very old problem of choosing an ori- counterpart in Equation 10, if the items form inde-

entation of the axes in the two- or p-dimensional pendent clusters, then a measure of each latent trait

space—the rotation problem of classical factor anal- with minimum error variance is given by the corre-

ysis. In exploratory studies, this problem has usually sponding cluster of items, with the same expressions

been solved by ﬁtting the model on some conve- (Equations 37 and 65) as for the unidimensional

nient basis and then transforming the factor loadings case. Even if the factors are highly correlated, the

to Thurstonian simple structure, with uncorrelated errors of measurement are uncorrelated. I suggest

(orthogonal) factors or correlated (oblique) factors that we call this the case of pure measurement, with

(see, for example, Mulaik, 2010). each attribute measured by just its own items and

In the context of test construction, we write items with uncorrelated measurement errors. This is an

to measure an attribute and should not need to use important property, because generally we would

exploratory methods. Even so, in the early stages not wish to have the measurement of one ability,

of developing and calibrating a test, the conceptual say, affected by items measuring another correlated

denotation of the attribute may be unclear, and it ability.

may be that the attribute is conceived at a high level Some writers would reject the distinction made

of abstraction from behavior, and the domain of pos- here between measurement and prediction, regard-

sible items divides into subdomains. The paradigm ing both as “estimation.” Bartholomew and Knott

case is the set of correlated primary mental abilities (1999), for example, suggest the use of Bayes pre-

into which Thurstone divided Spearman’s general dictors in place of the measures I recommend. The

“intelligence”—scholastic ability. If the items are Bayes predictors have the property that the pre-

written so that they fall neatly into the subdomains, dicted value of one attribute of an examinee is

they form clusters that, if ﬁtted separately, are homo- increased or decreased by a high or low value of

geneous and ﬁt the unidimensional models given by another. Thus, being good at English improves

Equations 1 or 10. Jointly, they ﬁt multidimensional an examinee’s mathematics score. This effect can

models here represented by Equations 75 and 77, be described as “measurement contamination” or

with correlated factors, and an item with nonzero “borrowing strength from extraneous information,”

factor loadings/slope parameters on one latent trait depending on whether we wish to measure or to

has zero factor loadings on the other. The items are predict. If, indeed, the intention is to predict the

said to be factorially simple, belonging clearly to value of an attribute, then we can use items belong-

just one attribute. This case is commonly referred to ing to related attributes and any other information

as having independent clusters (see McDonald 1999, about the examinee (e.g., socioeconomic status,

Chapter 9). educational history, etc.) that is relevant to predic-

We may not succeed in creating pure clusters of tion. My view is that prediction from extraneous

items measuring just one latent trait. Some items information is not measurement, and it would

may be factorially complex, with nonzero loadings require a distinct research motive. On the face of it,

on two or more factors. Although pure clusters are measurement models are designed for the purpose

a desirable goal of measurement, at least the aim of measurement.

should be to create enough factorially simple items As an example of a multidimensional model, I

for these to form a basis for analyzing the measure- will give a brief account of an example in McDon-

ment properties of the complex items and possibly ald (1999, Chapter 14). Fifteen items taken from

eliminating them. (An example following should the ACT Mathematics Test have the item stems

make these statements clearer.) Without such a basis, listed in Table 7.8. It seems reasonable to describe

we cannot be sure what we are measuring (McDon- items 1 through 5 as measuring geometry achieve-

ald, 1999, Chapter 9, calls this case an independent ment, and items 6 through 10 as measuring algebra

clusters basis). achievement. Items 11 through 16 are less readily

Table 7.8. Fifteen ACT Mathematics Items Table 7.9. ACT Independent Clusters

Basis Solution

Item Item Stem Description

Loadings

1 Angles in a right triangle

Item I II Uniqueness

2 Areas of bisected triangles

1 .766 .413

3 Length hypotenuse—right triangle

2 .642 .588

4 Length adjacent—right triangle

3 .451 .814

5 Area trapezoid

4 .604 .636

√ √

6 2 28 + 3 175

5 .485 .765

7 √1

2−1 6 .439 .809

8 (−3)2 + 3−2 7 .502 .650

9 x, for which [(x(x − 2))][(x − 1)(x − 2)] 8 .386 .851

is undeﬁned

9 .666 .556

10 22 + 20 + 2−2

10 .388 .849

11 Application of 73/3 + 17.85 + 61/2

11 .365 .367 .534

12 Slope of line 2x + 3y + 6 = 0

12 .355 .363 .551

13 Radius of circle given circumference

13 .358 .349 .567

14 Speed given distance and time

14 .223 .436 .615

15 Longest diagonal in box

15 −.335 .548 .859

classiﬁed. We ﬁt the model given by Equation have pure measurements by keeping only the ﬁrst 10

78, specifying a corresponding pattern of zero and items.

nonzero coefﬁcients, allowing the last group of items

to be, possibly, composites of geometry and algebra.

The ﬁtted parameters are given in Table 7.9.8 Validity

The correlation between the latent traits is 0.739. The classical problem of determining the extent

The last ﬁve items appear to be fairly equally to which a test score is a valid measure of an attribute

balanced combinations of geometry and algebra remains with us, although not quite in its classi-

abilities. The clusters formed by the ﬁrst two sets cal form. I accept as a deﬁnition the statement:

give simple item characteristic curves and the sub- “A test score is valid to the extent that it measures

test characteristic curves given in Table 7.10. The the attribute of the respondents that it is employed

complex items in the last group give the subtest to measure, in the population(s) in which it is

characteristic surface tabulated also in Table 7.10. used.”

The simplicity of the basis supplied by the pure Early validity theory was inﬂuenced by an

clusters is what gives us an understanding of the extremely behaviorist psychology and a logical pos-

complex items in the last group. Without the basis itivist philosophy of science. The attributes we wish

for interpretation supplied by the geometry and to measure were regarded as invented, convenient,

algebra items, it would be difﬁcult, if not impos- ﬁctional “constructs.” This view led to a recognition

sible, to determine what we are measuring and of three forms of validity—namely, predictive valid-

clearly impossible in models of higher dimension- ity, concurrent validity, and content validity, with

ality. The information functions and error variances hesitant admission of a fourth—“construct” validity.

have the same structure, with uncorrelated errors for The many predictive validities of a test were its abili-

formula scores from the ﬁrst two sets and high cor- ties to predict external measures. Concurrent validity

relations between the errors of measurement of the required correlating the test with another test of

mcdonald 133

Table 7.10. ACT Test Characteristic Curves

s1 s2 s3

the same name. Content validity, regarded with sus- deﬁne a validity and generalizability index for a set

picion by behaviorists, rested on the “subjective” of binary items. The biserial correlation between a

judgment that the item contents were indicators of standardized variable Z and a binary variable X

a common attribute. is the correlation between Z and an underlying

Largely through the work of Cronbach and response tendency X ∗ , such that X = 1 if X ∗

Meehl (1955), and that of Messick (1989), the exceeds a threshold value τ . It can be shown that

common view now seems to be that there is one con- Cov{X , Z } = n(τ )Cor{X ∗ , Z }, where n( ) is the

cept of validity, still called construct validity (out of ordinate of the normal density function. In the

pure habit), with predictive, concurrent, and con- normal ogive model in Equation 14,

tent validity seen as evidence of it. Validation would

now include all forms of evidence that the score Cov{Xj∗ , F } = λj ,

is an acceptable measure of a speciﬁed attribute.

My account will be limited to those forms of evi- so

dence that rest on the models we are considering as Cov{Xj , F } = n(τj )λj ,

a framework for applications of test theory.9

It is possible to take the view that primary evi- giving

dence of the validity of a test score comes from

establishing that the test is homogeneous in two Cor{Y , F } = [n(τj )Cov{Xj , F }]/[Var{Y }1/2 ].

senses—that the test is unidimensional and that the (78)

item contents are judged to be measuring an abstract Evidence of validity comes from an examination

attribute in common. If we regard the attribute of what the test does not measure as well as what

as what is perfectly measured by a test of inﬁnite it does. Suppose a set of items ﬁts the multiple

length, then a measure of validity can be taken to factor Equation 75 with independent clusters and

be the correlation between the total test score and correlated factors. Each cluster gives a subtest sum

the true, domain score. The redundant qualiﬁer that should be correlated with its own factor, but

“construct” can be omitted, and we simply refer to it may also have a high correlation with the fac-

“validity.” tors belonging to other clusters. The independent

In the case of the Spearman model, the validity clusters model supplies the natural explication of

coefﬁcient is just the reliability index given by the a somewhat intuitive suggestion by Campbell and

square root of coefﬁcient omega. We can also call Fiske (1959). They suggested that multiple measures

this index a coefﬁcient of generalizability, from the of a construct have convergent validity if they are suf-

m items in the test to the inﬁnite set given by the item ﬁciently highly correlated and discriminant validity

domain. Thus, on this view, reliability is validity is if they have sufﬁciently low correlations with tests of

generalizability, which is a great simpliﬁcation (see other, distinct constructs. The main contribution of

McDonald, 1985, 1999). Campbell and Fiske (1959) was the suggestion that

In the case of binary items, the error variance is any item can be regarded as a trait-method unit—

a function of the latent trait, so it seems impossi- a union of a particular trait-content and a method

ble to deﬁne an overall reliability index. However, of measurement. To segregate trait from method,

using the normal ogive model, we can usefully they recommend measuring a number of traits by

a number of methods in a crossed (multitrait– sets of items. It is customary to refer to these as alter-

multimethod) design. This rather casual sugges- nate forms of a test. Two tests Y and V are item-parallel

tion has spawned a large and confusing literature. if the items in each are paired to have equal param-

The position I take is that multitrait–multimethod eters in a jointly unidimensional model. (They are

designs have not yet been shown to contribute to sometimes called “strictly parallel.”) A necessary and

convergent/discriminant validity. sufﬁcient condition for the scores Y and V on test

To quantify convergent and discriminant valid- forms Y and V to have the same distribution for

ity by the independent clusters model, we compute examinees with the same value of their latent trait is

(1) the correlation between each cluster sum and that the forms are item-parallel. This is a condition

its own factor, and (2) the correlation between each for the complete exchangeability of the test forms. In

cluster sum and the other factors. We hope to ﬁnd some applications, it is a condition for equity, mean-

the former high for convergent validity and the lat- ing that it cannot matter to the examinee which form

ter small for discriminant validity. The correlation is administered.

of each cluster-sum with its own factor is just its Two item-parallel test forms have (1) equal test

reliability index, the square root of omega. The cor- characteristic functions, (2) equal test information

relation of each with the other factors is just its functions, (3) equal test-score information functions,

reliability index multiplied by its correlation(s) with and (4) matched true scores and matched error

the other factor(s). This holds for binary items also, variances at every point on the scale. We can recog-

using Equation 79 for the correlation between the nize three distinct levels of matching or equivalence

cluster-sum and its latent trait. Table 7.11 gives the between test forms—namely, (1) item-parallel, (2)

correlations for the ACT data between the three equal test characteristic and test-score information

cluster sums (s1 for items 1–5, s2 for items 6– functions, and (3) equal test characteristic curves—

10, and s3 and the geometry [I ] and algebra [II ] matched true scores, but possibly different error

latent traits; see McDonald, 1999, pp. 322–323, for

## Viel mehr als nur Dokumente.

Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.

Jederzeit kündbar.