Cook, Campbell & Peracchio (1990) Quasi-Experimentation

.
a~~------------------------------------------------
CHAPTER 9
Quasi Experimentation
Thomas D. Cook
Northwestern University
Donald T. Campbell
Lehigh University
Laura Peracchio
University of Wisconsin, Milwaukee
This chapter has two purposes. The first is to explicate four kinds of validity.
Statistical conclusion validity refers to the validity of conclusions about
whether the observed covariation between variables is due to chance. Internal
validity is concerned with whether the covariation implies cause. Construct
validity refers to the validihJ with which cause and effect operations are labeled
in theory-relevant or generalizable terms. External validity refers to the validity
with which a causal relationship can be generalized to various populations of

persons, settings, and times.
The second purpose is to describe and critically examine some quasiexperimental designs from the perspective of these four kinds of validity, especially internal validity. We argue that the quality of causal inference depends
on the structural attributes of a quasi-experimental design, the local particulars
of each research project, and the quality of substantive theory available to aid in
interpretation. We place special emphasis on quasi-experimental designs that
allow multiple empirical probes of the causal hypothesis under scrutiny on the
assumption that this usually rules out more threats to internal validity.
This chapter is a revision of one that appeared in the last Handbook,
which was itself an offshoot of the earlier Campbell and Stanley monograph
(1966). The present chapter covers much of the same ground as its immediate
predecessor, though there are subtle changes in (a) the discussion of validity
492
Cook, Campbell, and Peracchio
types, (b) the presentation of new quasi-experimental designs, (c) the emphasis
placed an the conceptualization of causal generalization, (d) the advocacy of
multiple pmbes of a causal hypothesis, and (e) the importance assigned to
knowledge that transcends the data collected for a particular research study.
Unfortunately, space limitations preclude much discussion of nonexperiments,
randomized experiments, and methods for promoting causal generalization
through meta-analysis or explanatory model testing. These topics are discussed
elsewhere (Cook, 1.990a, 1.99Gb; Cook & Campbell, 1979).
The Theory of
Introduction
In reviewing the major journals devoted to

industrial and organizational psychology, we
have been struck by the relative paucity of field
experiments. By field, we mean any setting
that respondents do not perceive as having
been setup for the primary purpose of conducting research. By experiment, we mean any
experimenter-controlled or naturally occurring event with rapid onset (a "treatment")
whose possible consequences are to be empirically assessed. Experiments are traditionally
divided into two major categories. Randomized
experiments assign respondents to treatments at
random, while quasi experiments primarily
depend on self-selection or administrative
decisions to determine who is to be exposed to
a treatment. Quasi experiments canhaveall the
other major structural features of experiments
though, including pretest and posttest observations and comparison groups.
This chapter deals only with quasi experiments. This is because the relevant theory
of method is less welll<nown and less clear-cut
than the theory buttressing random assignment. Also, it is generally easier to implement
quasi experiments than randomized experiments in many of the field settings where
causal conclusions are needed. And sometimes in a randomized experiment, different
types of units dropout of each treatment group,
:reating a quasi experiment out of an intended
randomized experiment. Though these concerns justify a chapter on quasi experiments,

we must not forget that such experiments often
produce less interpretable causal conclusions
than their randomized counterparts. Random
assignment is the method of choice wherever
quality of causal inference is the dominant
intellectual concern.
A broad definition of industrial and organizational psychology undergirds this chapter
because most types of human behavior can
be found in formal and informal organizations. This broad definition justifies us in presenting quasi-experimental studies that were
originally reported in other social science disciplines-sometimes for purposes of building
theory and sometimes for assessing the effectiveness of organizational changes made for
pragmatic reasons. Our conception of organizational research is therefore close to Pugh's
(1966) dual concern with improving substantive theory in the field and enlarging the store
of practices that might enhance organizational
functioning.
Theories of Causation
Experiments are vehicles for testing a particular type of causal hypothesis. For most laypersons, causation probably implies manipulating
one thing and observing whether a later change
occurs ina phenomenon that is plausibly linked
to the change agent. Intrinsic to this conception
is the notion of deliberate manipulation, and
some philosophers of science have called
this the manipulability or activity theory of
~~-------------------------------------------------Quasi Experimentation 493
causatiqn (Collingwood, 1940; Gasking, 1955).

It is usually associated with manipulanda specified as multivariate hodgepodges rather than
unidimensional constructs so that each manipulandum could potentially be broken down
into its constituent components. Since the combination of components might vary from one
research setting to the next, knowledge claims
predicated on a manipulability theory of cause
are not likely to hold under all conditions.
Most philosophers of science aspire to identify causa 1relationships that are invariably true
because all the contingencies are known on
which the relationship depends. They therefore prefer essentialist theories of causation to
manipulability ones, particularly those that
specify the efficacious components of the global
manipulandum, the causally implicated components of the global effect, and any mediating
processes that occurred after the cause varied
and before the effect was observed and that also
generated the cause-effect relationship. Essentialist causal knowledge is explanatory and
promises more understanding and human
control than the multivariate hodgepodges
typically manipulated and measured in field
research. Controlis enhanced because explanatory knowledge specifies the processes absolutely required for a given result. Potential
users of the knowledge are then free to choose
how to instantiate these processes in light of
their local circumstances and needs. To recreate a causal-generative process is crucial; how
it is set in motion is less relevant (Cronbach,
1982).
In the statistical tradition of experiments
associated with Fisher (1925, 1935) and used
in the social sciences, random assignment is
the key to causal inference. But manipulating
more than a few independent variables is
rarely possible, so that experiments test hypotheses about the independent and interactive effects of a small number of manipulated
variables. There is no pretense about directly
probing hypotheses either about the causally
efficacious components of the treatments
manipulatedoraboutmediatingprocesseslinking the independent and dependent variables.

To be sure, independent and dependent variables can and should be chosen for maximal
theory relevance. Moreover, process measures
can be built into experiments and measured
before the major outcomes are assessed. But
still the core of the theory of causation underlying the design of experiments is to estimate the
effects of a treatment contrast, however theoryrelevant orirrelevantit may be. Understanding
treatment and outcome components and establishing mediating processes are admirable goals,
but they are secondary to estimating treatment
effects. The theory of experimentation that
statisticians and social scientists espouse is
therefore more conceptually akin to the manipulability theory of cause than to any of the
more essentialist or explanatory theories.
In the manipulability theory, several
conditions have to be met for concluding that
two variables are causally related and that the
direction of causation is from A to B. First, a
cause must precede an effect in time. Meeting
this condition is easy if investigators know
when respondents experienced a treatment. In
a randomized experiment, a properly executed
assignment procedure ensures pretest equivalence between the comparison groups within
the limits of sampling error. Hence, any posttest
group differences are a product of the treatments (or chance) and have to have happened
after the treatment was introduced. In quasi
experiments, investigators can combine knowledge of when treatment assignment occurred
with their observation of changes in pretest and
posttest performance.
The treatment and effect also have to covary. Many social scientists use statistical tests
to help decide about covariation, accepting
somewhat arbitrary social standards (e.g., p<
.05) for deciding whether the covariation is
"real." Though statistics function as gatekeepers, they are fallible even when properly used,
sometimes failing to detect true patterns of covariation and sometimes indicating that there
i'
I:
!' '
494 Cook, Campbell, and Peracchio
is covariation when there is not. Since covariation is a requirement for cause and statistical
tests are usually used to make such judgments, it seems wise to explicate the major
factors that can lead to false conclusions about
covariation. We call these threats to statistical
conclusion validity.
We must confess to trepidation about this
neologism since decisions about covariation
strike us as less important than decisions about
causal magnitude. There are two reasons for
this. First, all social research (except, perhaps,
consulting) has many different stakeholders
with their own standards of risk. Some prefer
little risk and are willing to ignore some genuine patterns of covariation in order to protect
themselves against concluding there is covariation when there is not. Other stakeholders are
less risk-averse and want to ensure that they do
not miss true relationships. Thus, whatever the
level of risk researchers adopt (or slip into) in
their statistical reasoning, it is bound to meet
the needs of some stakeholders more than
others. This would not be so much the case if
conclusions were drawn about the size of an
effect. Potential users of the information could
then decide for themselves how important a
given effect was. Second, statistical conclusion
validity has to do with "statistical significance"
which can be a highly misleading construct.
Many inexperienced researchers and laypersons fail to realize that even the most trivial
relationship can be statistically significant if a
test has enough statistical power. Since statistical significance, theoretical significance, and
policy significance are not synonyms, we henceforth refer to relationships as being "reliable"
rather than "statistically significant."
The third necessary condition for causal
inference is that there are no plausible alternative explanations of B other than A. This is the
most difficult condition to meet. For instance, if
a new machine were under evaluation and was
associated with an increase in productivity, the
increase might be due not to the machine, but to
a seasonal increase that occurs every year at the
same time. This is only one of many possible

third variables, and we shallla ter present a systematic list of them as threats to internal validity.
They all imply that the change observed in B is
spurious because the same change would have
occurred even if there had been no treatment.
This is different from another meaning of
"alternative interpretation" that does not question whether A-as-manipulated is causally
connected to B-as-measured. Instead, it questions whether the operations used in the research represent the theoretical constructs the
investigator claims they do. Most of the controversies in the theoretical social sciences are of
this kind. They are not controversies about
whether being paid more money in a particular
study led to higher productivity. Instead, they
are controversies about whether the payment
created feelings of inequity or led the experimenters to expect that the payment would
improve productivity. The issue is: How should
the payment be labeled in theory-relevant and
generalizable terms; no doubt need be expressed
about whether the manipulation had the effect
attributed to it. To give another example, for
some persons the interpretative problem in the
famous Roethlisberger and Dickson (1939)
experiments at the Hawthorne plant is one of
labeling what caused the women to increase
productivity; it is not one of determining
whether the operational treatment increased
productivity. Was the causal agent (a) a change
in illumination; (b) the fact that an organizational change took place, irrespective of its
nature; (c) feedback about one's own job performance; or (d) new perception of management interest in the workers? We shall later
discuss threats to construct validity of the cause
and effect, and these should be understood as
threats to the correct labeling of the cause and
effect operations in abstract terms that come
from common language or formal theory.
Past commentators have misunderstood
"internal validity/' using it to refer to concerns
about how the cause and effect operations
should be labeled. This confusion probably
arose because most theory-centered scholars
attribute little value to causal statements where
the causal agent and its effect cannot be convincingly described in a general way (Kruglanski & Kroy, 1975). It may also have arisen because the same (fallible) deductive methodology is used in ruling out alternative interpretations to both internal and construct validity.
But while internal validity involves ruling out
alternative interpretations of the presumed
causal relationship between A-as-manipulated
and B-as-measured (Campbell, 1986), construct
validity involves ruling out alternative interpretations of the entities claimed as A and B. The
rationale for experiments is to probe whether
variables are causally related, and so the alternatives that necessarily have to be ruled out
before inferring cause are alternative interpretations of the relationship between A and B (i.e.,
threats to internal validity). Nonetheless, eliminating all alternative hypotheses about the
constructs involved in a descriptive causal relationship aids in causal explanation and the
understanding and control that such explanation fosters.
At least one further step is useful. To infer a
causal relationship at one moment in time in a
single research setting and with one sample of
respondents would give us little confidence
that a demonstrated causal relationship is robust. External validity concerns the generalizability of findings across times, settings, and
persons. It takes two overlapping but nonetheless distinct forms. The first has to do with
generalizing to the times, places, and persons
specified in the claims researchers make about
the generalizability of the causal findings they
have shown. Usually, these claims touch on the
populations of persons, settings, and times
specified in the original research question, but
that is not inevitable. The core component is
using a sample to generalize to a population
that the sample is thought to represent. The
second form external validity takes has to do
with extrapolating beyond the instances captured in the sampling plan so as to draw
495
inferences about entities that are manifestly

different. Such extrapolation is not easy and
does not depend on sampling considerations
alone.
Most of our validity distinctions can be
translated into the language Cronbach (1982)
uses to formulate the major issues in research
design. He uses the term u to refer to the units
(individuals, groups, or institutions) that are
assigned to treatments within a study. He uses
t to refer to treatments, o to refer to observations
(of which outcome measures are a particularly
important subclass), and s to refer to settings.
(We shall add ch to refer to time [chronos], since
Cronbach does not deal with time and t is
already spoken for in his system.) Cronbach
writes lowercase utosch to refer to the samples
achieved in a research study; he uses capital
UTOSCH to refer to the universes, populations,
and categories represented by the sample-level
particulars; and he uses *UTOSCH (pronounced
star UTOSCH) to refer to populations with
attributes different from those found in utosch
but about which causal statements from the
sample data might still be warranted through
extrapolation.
Figure 1 illustrates Cronbach's notational
system and also some potential terminological
confusion for unwary readers. The arrow
linking t too at the level of observables is the
connection that Campbell (1957) designates as
internal validity, a product of two inferencesone about covariation (statistical conclusion
validity) and the other about cause (internal
validity). Cronbach assigns no special status to
the t too link. In Cronbach' s system, he uses the
term internal validity to describe the relationship indicated by the arrows from u to U, t toT,
o to 0, s to S, and ch to CH. These generalizations about populations, universes, categories,
classes-call them what you will-are warranted from consideration of the sample-level
particulars, cases, instances, exemplars-call
these what you will. Campbell calls these same
relationships external validity because they
have to do with generalizing. To add to the
496
FIGURE 1
Validity Distinctions Translated Into Cronbach's Research Design Language
~*U
~*T
~*0
~*5
r*CH
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
\
TJ
\
OJ
! I LJ
LJ
u
!L_sf
L_u
CH
!L_ f
ch
Internal Validity for Campbell

and for Cook & Campbell
----+- Construct Validity for Cook
------J~P
& Campbell -
confusion, in theconceptualizationin this chapter, the arrows from t to T and o to 0 indicate

construct validity of cause and effect, respectively. To Cronbach, these are part of what he
means by internal validity, while in Campbell's
earlier formulations, these same relationships
were called external validity because they also
have to do with generalizing to target populations. All of this is surely confusing, and we
urge readers to study Figure 1 in detail.
Cronbach acknowledges that it is often difficult to designate the population or category a
particular sample or operation represents, but
he insists that there is still a population in
theory. In his system, the arrows to *U, *T, *0,
*S, and *CHrepresent extrapolation to populations and categories with attributes different
from those at the sampling level, and this is the
most important inference of all for him. Indeed,
it is the core around which his own theory of research design was developed (Shadish, Cook,
& Leviton, in press), though it is not central in
Internal Validity for Cronbach

External Validity for Campbell
and for Cook & Campbell
__... External Validity for Cronbach
other recent theories of design (e.g,. Kish, 1987).

Campbell is not specific about whether external validity should have to do with both generalization to target universes and extrapolation
to unstudied ones, and he is certainly not explicit about preferred methods for extrapolation. Instead, he centers his conception of external validity around the open-ended question,
"To which populations of persons, settings,
and times can a causal relationship be generalized?", and he emphasizes the practice in the
natural sciences of assuming the robustness of
a relationship after a few demonstrations and
maintaining this belief until contrary evidence
builds up. The theory is then either abandoned
or hedged around with causal contingencies
limiting its generalizability (Campbell, 1969b).
Statistical Conclusion Validity
Conceptual Issues. Experiments are conducted to make decisions. Sometimes, investiga-
Quasi Experimentation 497
tors specify target effect sizes that have to be

reached before claiming that a treatment is
effective (e.g., productivity has to increase by
10%, or absenteeism has to fall to less than 2%).
But this is rare, and most investigators want to
decide whether a treatment has had some effect, no effect, or whether no decision can be
made at present. Statistical significance tests
are used to this end, and they usually involve
contrasting statistics within a treatment group
(e.g., pretest and posttest means) or from a
treatment group with those from a comparison
group. In describing threats to valid conclusions about the existence of bivariate relationships and in providing readers with some ways
of overcoming these threats, we hope that readers will realize that such null-hypothesis testing is merely a preliminary to deciding how
large an effect is and why it might be important
for theory or practice.
Arbitrary statistical traditions have developed for drawing conclusions about covariation from sample data. Relationships below
the five percent probability level are typically
considered "true," while those above the five
percent probability level are treated as though
they were "false." But we can be wrong in concluding that population means differ even when
the probability level is less than .05; and we can
be wrong in concluding they do not differ when
the probability level is higher. Moreover, relationships can be inferred using many means
other than statistics, and in fields like ethnography they are! Nonetheless, statistics are widely
used in a gatekeeper role, and this is why
Campbell (1969c) added "instability" to his
earlier list of threats to internal validity, describing it as "unreliability of measures, fluctuations in sampling persons or components,
autonomous instability of repeated or equivalent measures" (p. 411). But since co variation is
a precondition for cause, and since statistics are
so fallible, we have chosen to place more emphasis on covariation, justifying the distinction
between statistical conclusion validity and internal validity.
The close connection between statistical

conclusion validity and internal validity can be
seen in the distinction between bias and error.
Bias refers to factors that systematically affect
the value of means; error refers to factors that
increase variability (and so decrease the chance
of obtaining statistically reliable effects). If we
erroneously concluded from a quasi experiment that A causes B, this might either be
because threats to internal validity bias the
relevant means or because, for a specifiable
percentage of possible comparisons, differences
as large as those found in a study would be
obtained by chance. If we erroneously concluded that A cannot be demonstrated to affect
B, this can either be because threats to internal
validity bias means and obscure true differences or because the uncontrolled variability
obscures true differences. Statistical conclusion validity is concerned with sources of error
and the appropriate use of statistical tests for
dealing with such error. Concern with bias is
the domain of internal validity.
It is standard in statistics to note that the
null hypothesis cannot be proven. There is
always the possibility, however remote, that
statistics have failed to detect a true difference
or that a different substantive conclusion would
have beendrawnifa treatment had been implemented at a higher dosage level, if its theoretical integrity had been greater, if more sources
of random error had been controlled, if suppressor variables had been measured, or if a
statistical test of greater power had been used.
Nonetheless, circumstances sometimes require
acting as though the null hypothesis were true.
In some practical contexts, failure to reject the
null hypothesis implies that the treatment-asimplemented makes so small a difference that
it would not be worth worrying about even if it
had made a reliable difference (Greenwald,
1975). But in more theory-relevant contexts, we
can only consider accepting the null hypothesis
if all the theory-derived conditions facilitating
the effect were demonstrably present and a test
with high statistical power failed to produce a
498
reliable effect (Cook, Cruder, Hennigan, &

Flay, 1979).
A special problem of statistical conclusion
validity arises in designs with two (or more)
experimental groups if the pretest-posttest
difference within one or more groups is reliable
but the difference between groups is not. The
apparent puzzle can be solved by noting that
the within-group comparisons test whether
change has taken place, but itfails to specify the
locus of change. Is it due to the treatment or to
some other factor that varied between the measurement waves? The between-group comparison tests whether there is more change in one
experimental group than another, potentially
specifying the locus of cause more adequately
because irrelevant temporal changes have now
been controlled. But if there is overlap in the
treatment components or in the mediating
processes the treatments stimulate, then the
between-group comparison runs the risk of
underestimating a treatment's total potential
impact. Experiments with comparison groups
test contrasts between treatments rather than
individual treatments so that the within-group
and between-group analyses test different
hypotheses. It is not a genuine contradiction if
they suggest different conclusions about a
treatment's effectiveness.
This chapter deals with the design of quasi
experiments rather than the analysis of the data
they provide. We occasionally provide references to recent statistical sources, but we warn
readers that the statistical literature is now in a
productive state of chaos as concerns the analysis of data from nonequivalent groups when
there is no long pretreatment time series of
observations (cf. Heckman & Hotz, 1989a, 1989b,
vs. Holland, 1989). However, the statistical issue is more one of using control variables to
guard against bias than using them to reduce
random error. The theoretical conditions required for an unbiased test are well known, but
are unfortunately beyond simple implementation. These theoretical conditions are (a) complete knowledge of all the outcome-correlated
differences between nonequivalent treatment

groups and (b) error-free measurement ofthese
differences. (The data-analytic situation is less
bleak with time series analysis, though even
there, much art and many assumptions are
required to deal with the correlation between
errors that has to be dealt with if unbiased
standard errors are to result [Box & Jenkins,
1976].)
List of Threats. Here is our taxonomy of
threats to statistical conclusion validity.
Statistical Power. The likelihood of making an

incorrect no-difference conclusion, or Type II
error, increases when sample sizes are small,
alpha is set low, one-sided hypotheses are incorrectly tested, major sources of extraneous
variance remain uncontrolled, variables are
dichotomously distributed, and distributionfree statistics are used for hypothesis testing.
Many books now address issues of statistical
power (e.g., Cohen, 1988; Kraemer &Thiemann,
1987; Lipsey, 1990), and ways of dealing with
the problem should be obvious-increase
sample sizes, control major sources of extraneous variance, implement continuous measures,
and use the most powerful statistical tests
appropriate for such measures.
Fishing and the Error Rate Problem.
Type I errors will result when multiple comparisons are

made since a certain proportion of them will
differ by chance. Ryan (1959) has illustrated
one method of adjusting for the error rate per
experiment. This involves computing a new t
value that has to be reached before reliability at
a given alpha level can be claimed. The new t
requires deciding on an alpha (e.g., .05) and
then dividing this value by the number of
comparisons made, resulting in a proportion,
orpvalue, that will be lower than .05. The tvalue
corresponding to this adjusted pis looked up in
the appropriate tables, and it will be higher
than the t value normally associated with
alpha = .05. (This higher value reflects the
stringency required for obtaining a true level of

statistical significance when multiple tests are
made.) A second method for dealing with the
error rate problem involves using the conservative multiple comparison tests of Tukey or
Scheffe, which are discussed in most moderately advanced statistics texts. And when there
are multiple dependent variables in a factorial
experiment, a multivariate analysis of variance
strategy can be used for determining whether
any of the significant univariate Ftests within a
particular main or interaction effect are due to
chance rather than to the manipulations.
The Reliability of Measures. Measures of low

reliability (conceptualized either as "stability"
or "test-retest") cannot be depended on to register true changes. Error terms will be inflated,
reducing the chance of rejecting the null hypothesis. One way of controlling for this, where
possible, is to use longer tests for which items
have been carefully selected for their high intercorrelation, provided that the intercorrelation has not been achieved by restricting the
conceptual domain being measured. Another
way is to decrease the interval between tests in
longitudinal studies. Also, where larger units
of analysis can be used without creating ecological fallacies (e.g., group means become the
unit of analysis instead of individual scores),
reliability will increase. Indeed, and if experience in meta-analysis is anything to judge by,
the gain in reliability will often more than
compensate for the decrease in degrees of freedom that occurs when using more aggregated
units of analysis. Failing all these things, standard corrections for unreliability can and should
be used to avoid the false conclusions about
covariation associated with measures of low
reliability.
The Reliability of Treatment Implementation.
The way a treatment is implemented may differ
from one respondent to another for a variety
of reasons. Foremost among them are that different persons may implement the treatment
differently, and there may be differences from

occasion to occasion even when the same person implements the treatment. Such lack of
standardization will inflate random error and
decrease the chance of obtaining true differences. The threat can be most obviously controlled by making the treatment and its implementation as standard as possible across
occasions of implementation. Alternatively, the
variability in implementation should be measured and "somehow" used in the data analysis
to see how responsive the outcome is to variation in the intensity or quality of the treatment (Boruch & Gomez, 1977; Sechrest, West,
Phillips, Redner, & Yeaton, 1979).
Random Irrelevancies in the Setting. Some features of an experimental setting other than the
treatment will undoubtedly affect scores on the
dependent variable, thereby inflating error
variance. No setting is quite like another, and
any one setting is not likely to stay constant
from one time period to another. This threat
can most obviously be controlled by choosing
settings free of extraneous sources of variation,
which is the logic behind the sealed-off laboratory setting. Alternatively, experimental procedures can be selected that focus respondents'
attention on the treatment and thereby lower
the saliency of environmental variables. Finally, it is possible to measure some of the
major sources of extraneous setting variance
and "somehow" use them in the statistical
analysis.
Random Heterogeneity of Respondents. Therespondents in an experiment usually differ from
each other within treatment groups in ways
related to the dependent variables. (This is different from some types of respondents being
more affected by a treatment than others, which
we shall soon see is a matter of external validity.) The more respondents differ from each
other within groups, the less will be the ability
to reject the null hypothesis when betweensubject error terms are used. This threat can
500
obviously be controlled by (a) selecting homogeneous respondent populations (at some cost
in external validity), (b) blocking on respondent characteristics most highly correlated with
the dependent variable, or (c) choosing withinsubject error terms as in pretest-posttest designs. In designs with both pretest and posttest
measures, the extent to which within-subject
error terms reduce the error terms depends on
the correlation between scores over time: The
higher it is, the greater the reduction in error.
Internal Validity
List of Threats. Threats to internal validity
compromise inferences about whether the relationship observed between two variables would
have occurred even without the treatment under
analysis. We distinguish the following threats.
Ways of dealing with them other than through
random assignment will be dealt with in even
greater detail later as we discuss individual
quasi-experimental designs.
History is a threat when the relationship between the presumed cause and effect might be
due to some event that took place between a
pretest and posttest and that is not part of the
treatment under analysis.
Maturation is a threat when a presumed
causal relationship might be due to respondents growing older, wiser, stronger, and so on
between the pretest and posttest, assuming
that such maturation is not the treatment of
interest.
Testi11g is a threat when a relationship might
be due to the consequences of taking a test
different numbers of times-for example, the
first test making respondents think of, or look
up, the answers.
Jnstrume11tation is a threat when a relationship might be due to the measuring instrument
changing between the pretest and posttest.
The measuring instrument can be persons recording observations or a physical instrument
whose properties vary at different times,
perhaps because of such scaling artifacts as
ceiling or basement effects or because of shifts

in reliability.
Statistical regression is a threat when a relationship might be due to respondents being
classified into experimental groups on the basis
of pretest scores or their correlates that have
been measured with error. High pretest scorers
will then score relatively lower at the posttest,
and low pretest scorers will score higher. This
is because the high initial scores will contain
more error inflating obtained scores than deflating them, and because the low initial scores
will have more error deflating them than raising them. All things being equal, the error
should not be so biasedly distributed at a next
testing, leading the high scores to decline on the
average and the low ones to rise. Such differential"change" is due to classification on the basis
of fallible scores rather than to the treatment
under analysis.
Selection is a threat when a relationship may
be due to different kinds of persons serving in
each of the treatment groups rather than to the
different treatments each group has received.
Mortality is a threat when a relationship
may be due to units with different attributes
dropping out of one or more of the treatment
groups. This results in a selection artifact at
posttest, since the various groups are then
composed of different kinds of persons on the
average.
Interactions With Selection. Many of the threats

to internal validity listed above can interact
with selection to produce forces that can masquerade as treatment effects. Perhaps the
most common such spurious force is selectionmaturation. This results when experimental
groups are composed of different kinds of
persons on the average, and the unique population groups so constituted are maturing at
different rates. Such treatment-correlated
growth differences often occur in education
when one treatment group has more middleclass children than another and learning gains
are the outcome measured. Selection-history (or

local history as it is sometimes called) occurs
when a contemporaneous but irrelevant source
of external change takes place between the
pretest and posttest in one treatment group but
not another.
Ambiguity About the Direction ofCausal Influence.

It is possible to imagine all plausible alternative
explanations of anA-B relationship being ruled

outwithoutitbeingclearwhether A causes B or
B causes A. This threat is most salient in crosssectional correlational studies where it might
not be clear, for instance, whether less foreman
supervision caused higher productivity or
higher productivity caused less supervision.
The threat is not salient in most experiments,
however, since the order of temporal precedence is clearer there. Nor is it a problem in
longitudinal research or where one direction of
causal influence is relatively implausible. For
example, it is more plausible to argue that a
decrease in external temperature increases
fuel consumption than to argue that an increase in fuel consumption decreases outside
temperature.
Conceptual Issues. This chapter is about how
quasi experiments facilitate causal inference.
However, experiments are not unique as a
means of inferring cause. A science like astronomy has progressed without experimentation,
in part because it has been blessed with reliable
observational methods and quantitative theories that have predicted precise locations in
space, precise orbits, and precise time intervals
for crossing space. The numerical precision of
predictions has meant, first, that predictions
could be tested with a high degree of accuracy,
and second, that different numerical predictions could be pitted against each other. This is
not to say that all validity problems are answered in astronomy, or that the investigator
can give up the required task of trenchantly
thinking through as many alternative hypotheses as possible and consciously pitting each of
them against the data to see if they can be ruled
out. Our point is that there will typically be

fewer validity threats where theories are as
precise as in astronomy and measurement as
reliable.
Unfortunately, the social sciences are not
yet blessed with such powerfully precise theories, such reliable measurement, or such recurrent cyclical orders in the observational data.
Imagine observing a manager's performance
between a pretest taken before beginning a
year-long special course and a posttest taken
after completing the course. What are the
chances of predicting how much of this difference can be explained in terms of the course
itself, spontaneous maturation, gains in testtaking skill, unique historical events that had
some effect on the dependent variable, or any
combination of these forces? Even if specific
numerical predictions could be attached to each
of these explanations, how confident would we
be of measuring each factor and performance
reliably enough to discriminate between the
theories? We believe there are few social science theories (outside of some areas of economics) where precise numerical predictions can be
successfully used to test competing causal hypotheses, though we agree with Meehl (1978)
that such numerical specificity would make
theory tests much stronger.
And so we turn to experiments as a vehicle
for facilitating causal inference. The sources of
bias associated with internal validity are much
more problematic for quasi experiments than
they are for randomized experiments. With
nonequivalent treatment groups, selection is
obviously a potential problem, whether in its
simplest form or in interaction with, say, maturation or history. Since nonequivalent groups
often differ in the physical, social, and psychological settings they inhabit, local history is also
often a problem. So, too, is maturation or instrumentation, particularly where no control
groups exist. The situation is simpler with
randomized experiments. Assuming that a
correct randomization procedure has been
correctly implemented and thattherehavebeen
JU~
no treatment-correlated refusals to participate

in a study, nearly all the foregoing threats to
internal validity can be ruled out because the
groups are initially equivalent in composition,
history, testing sequence, and other factors.
However, we cannot rely on randomized
experiments alone for causal inference (Secrest
& Hannah, 1990). In particular, there are two
internal validity threats the experiment does
not automatically rule out in those cases where
experiments are feasible-which is not always
the case. The most salient threat is mortality.
With longer lasting treatments that differ in
intrinsic desirability, attrition is likely to be
greater from the less desirable treatment conditions. This happened, for example, with the
New Jersey Negative Income Tax Experiment
(Rossi & Lyalt 1976), where the families guaranteed a lower income dropped out of the
experiment in greater numbers, leading to a
confound between the level of income guaranteed and the type of families remaining in the
study. This problem was eventually attenuated
by providing a small payment for filling out
questionnaires to those who received a smaller
guaranteed income or no guarantee at all.
However, it is not clear that side payments to
no-treatment control groups or low benefit
treatment groups will always eliminate all treatment-correlated attrition.
The second exception is the selectionhistory threat that occurs when the various
treatment and control groups are not treated
equivalently inallmattersotherthan treatment
assignment. Experimenters sometimes collect
different subsets of measures from the various
groups on the grounds that some measures will
seem bizarre to some groups because they
presuppose knowledge of treatment particulars. But when such treatment-correlated
differences in the measurement plan occur, it
is not logically clear whether any observed
effects are due to the treatment contrast or
the differences in constructs measured. Likewise, when the researcher randomly assigns
units but collects all the data for a particular
treatment group in a single data-collection

session, all idiosyncratic events that took place
during that session will be confounded with
the experimental treatment and may be responsible for effects. The cure for this, where
feasible, is to administer the treatment to individuals or smaller groups, randomizing the
experimental and control sessions as to location and time and basing the degrees of freedom in the statistical analysis on the number of
groups rather than persons. If small groups are
the unit of treatment assignment, then they,
too, should be the unit of analysis.
Estimating the internal validity of a relationship is a mostly deductive process in which
researchers have to be their own most trenchant critics and systematically think through
how each threat may have influenced the data.
But personal criticism is inevitably limited, and
it is highly desirable to receive critical commentary from knowledgeable outsiders. The more
tough-minded their commentary and the more
ideologically opposed they are to the emerging
findings, the more helpful their comments are
likely to be. Once potential alternative interpretations have been identified, tests have to be
carried out to examine which plausible threats
can and cannot be ruled out. When all can be
ruled out-including those unique to the local
setting studied and not in lists of validity
threats-then the confident provisional conclusion can be made that a relationship is causal.
When some of the plausible threats cannot be
rejected because the appropriate data are not
available or analyses suggest that the threat in
question may indeed have operated, then the
investigator has to conclude that it is not clear
whether a demonstrated relationship is causal
or not.
When practical decisions have to be made,
the researcher may sometimes have to act as if a
relationship were causat whatever the quality
of evidence available. At other times, there may
be a limited number of alternative interpretations that cannot be ruled out, and the researcher can confidently dismiss those that
Quasi Experimeutatio11 503

seem implausible because their implications
clash with frequently replicated findings from
a well-grounded substantive theory. But plausibility is often less clear-cut than this. Obtaining high interjudge agreement about the plausibility of a particular alternative is often difficult, and anyway, theorists place great emphasis on testing predictions that conflict with both
common sense and existing substantive theories. Such an emphasis suggests that the "implausible" is sometimes true. Judgments about
plausibility are necessary in interpreting any
experiment, but they are necessarily time- and
place-bound. While they should reduce doubt
about whether a relationship is causal, they
cannot totally eliminate it.
Construct Validity of the Cause and Effect
Conceptual Issues.
Construct validity is
what experimental psychologists mean by
"confounding" -the possibility that the operational definition of a cause or effect can be
construed in terms of more than one construct
specified at the same level of reduction. What
one investigator interprets as a causal relationship between A and B, another might interpret as a causal relationship between A andY,
or X and B,orevenXand Y. Later studies might
help support just one of these reinterpretations.
Note what is at issue here. There may be no
doubt that something related to a treatment
contrast has brought about an effect; the question is one of labeling the cause and effect
particulars in general, abstract language.
This should be differentiated from specifying the causal sequences that occur after a
global cause has varied and before its effects
have been observed. Identifying such "causal
paths," "develop1nental sequences," or "explanatory causal theories" is important because it
provides clues about the processes that absolutely have to be recreated if an effect is to be
replicated (Bhaskar, 1975; Cronbach, 1982).1tis
therefore especially important for the transfer
of knowledge to contexts obviously different
from those captured by the sampling particulars to date. But explaining causal mediating
processes is not what we mean by the construct
validity of the cause or effect. The latter concerns alternative interpretations of manipulations and measures, not alternative specifications of intervening causal processes.
High construct validity obviously requires
a rigorous and comprehensive description of
the theoretical cause and effect. Without this,
neither the manipulations nor measures can be
tailored to the constructs they are meant to
represent. But single exemplars of constructs
never give a perfect fit. Each instance contains
unique irrelevancies and usually also fails to
include some theoretically important components. From this follows the recommendation
that researchers should use multiple exemplars
of each construct deliberately that have been
selected both to share common variance attributable to the target construct and to differ from
each other in irrelevant components (Campbell
& Fiske, 1959). Such "multiple operationalism"
(Campbell, 1969a) allows tests of whether a
particular cause-effect relationship is robust
across the somewhat different set of relevant
and irrelevant components built into the research operations available. While irrelevancies should be randomly distributed across the
available operations, it is in practice impossible
to assume this, and the weaker assumption
undergirding Campbell's theory of measurement is that the irrelevancies are d-istributed
heterogeneously across the instances sampled.
That is, there are at least some instances where
each irrelevancy is and is not present. The
worst situation is, of course, where a construct
and an irrelevancy are totally confounded.
Two other points are worth making about
assessing the construct validity of causes and
effects. First, independent variables should
demonstrably vary what they are meant to
vary, and independent measures of this are
useful. Outcome measures should also measure what they are meant to assess, and independent evidence of this is also desirable. The
------------IZ~""""'"'"
assumption that the 'take' of the independent

variable can and should be independently
measured implies that our understanding of
construct validity includes what Cronbach and
Meehl (1955) call "content validity." Content
validity is based on defending interpretations
of research operations by means of arguments
about how much the content of the operation
overlaps with the content specified in some
definition. Cronbach and Meehl's own conception of construct validity is restricted to manifestly unobservable psychological states, such
as anxiety or drive, and inferences about them
are presumed to depend on a network of evidence and argumentation rather than on the
degree of overlap between the content of a
construct and a research operation. We believe
it is important to judge how well experimental
manipulations and measures tap into the target
content space. Since this involves judgments
about higher-order constructs-even if many
of them are more ostensive than psychological
states-we prefer a broad description of construct validity. In this connection, we are pleased
to note that Cronbach (1989) now espouses the
view that content and construct validities are
not as distinct as they seemed in the 1950s.
Second, construct validity requires not only
that manipulations and measures of the same
construct covary (the convergent validation
criterion), but also that there should be noticeably less covariation between measures of related
but different substantive constructs (the discriminant validity criterion). It is as important to
show what a measure does not represent
(Campbell & Fiske, 1959) as to show what it
does. To be concrete about this, imagine a
manipulation of "communicator expertise."
Convergence dictates that it should be highly
correlated with respondents' reports about the
communicator's level of knowledge, while
discrimination dictates that it should be noticeably less correlated with their reports about
the communicator's trustworthiness, likability, or power.
We can further illustrate convergent and

discriminant validity by considering a possible
experiment on the effects of supervisory distance. Suppose we operationalized "supervision by a foreman" as standing within comfortable speaking distance of workers. The treatment might therefore be more exactly specified
as "supervision from a speaking distance." It
would exclude longer distances from which
speaking was impossible but from which workers might nonetheless feel supervised. Since
supervision from speaking and nonspeaking
distances might have different consequences,
construct validity would be promoted if supervisory distances were systematically varied to
probe whether discriminating between supervision from speaking and nonspeaking distances is warranted. Of course, distance is not
the only attribute of supervision that would
vary in a study. Foremen might also differ in
whether they tend to supervise with a smile, in
an officious manner, or even inadvertently
during the execution of some other task. Neither the smile, the officiousness, nor the inadvertence are necessary components of "supervision," and researchers might hope they would
occur with equal frequency across all the instances of supervisory distance sampled. To
check on this, they should measure these attributes directly and in their data analysis probe
whether dose supervision had similar effects if
a foreman smiled, was officious, or supervised
inadvertently. If a cause-effect relationship
could not be generalized across any one of
these irrelevancies, then a causal contingency
formulation would be achieved and we would
know one of the factors that codetermines a
particular effect. We would then be forced to
speak of"close supervision with a smile" rather
than "close supervision."
Though construct validity concerns the
correspondence between research operations
and abstract theoretical labels, it would be
a mistake to think that the construct validity
of causes and effects should concern only
theorists. Most social treatments are best described as complex packages, especially in
applied research. This complexity makes it
difficult even to reproduce the package, let
alone replicate any effects in the (usual) absence of full information about effective treatment components. High construct validity of
the cause helps promote reproducibility. It also
promotes efficiency because it provides clues
about those components of the cause that do
not contribute to the effect and so can be
dropped. For instance, "Sesame Street" was
evaluated in an experiment where researchers
regularly visited children in their homes and
encouraged them to view the show, leaving
behind toys, books, and balloons advertising
the series (Ball & Bogatz, 1970; Bogatz & Ball,
1971). Such face-to-face encouragement cost
between $100 and $200 per child per viewing
season. Viewing without encouragement costs
just $1 to $2 per child per season and is much
more commonplace (Cook, Appleton, Conner,
Schaffer, Tamkin, & Weber,1975). Would it not
be useful to know whether viewing "Sesame
Street" without encouragement has similar
effects to viewing it with encouragement,
for to drop the encouragement component
would be highly desirable financially, if not
pedagogically?
Unfortunately, individual experiments afford poor prospects for achieving high construct validity of the cause. This is primarily
because implementing multiple operationalizations of an independent variable is costly,
and multiple operationalization is a better
means of enhancing construct validity than
carefully tailoring a single operation to a referent construct. Fortunately, the prospects are
brighter for the high construct validity of outcomes because investigators typically have
much greater latitude for multiple measurement than multiple manipulation.
List of Threats. Below is a list of threats to
construct validity. We want to draw special
attention to a subclass of threats associated

with units in one treatment group learning of
the content of another treatment and reacting
to this knowledge. Cook and Campbell (1979)
earlier listed such treatment crossovers under
internal validity because they can lead to false
conclusions about causal connections. However, Cronbach (1982) has pointed out that such
crossover effects would not have come about
unless a treatment contrast had been implemented, and that all experiments test contrasts
rather than individual treatments. How, he
asks, can treatment crossovers be threats to
internal validity if internal validity is defined in
terms of relationships that might have come
about even in the absence of a treatment, for
treatment crossovers depend on the presence
of a treatment contrast to start with? Cronbach
is undoubtedly correct, and it seems better to
classify threats following from treatment crossover as threats to the construct validity of the
cause. The analyst is not sure whether the crossovers or the planned treatment caused the
obtained effect (or absence of such).
Inadequate Preoperational Explication of Constructs. The choice of operations should depend on the result of a conceptual analysis of
the prototypical components of a construct
(Rosch, 1978), perhaps through consulting dictionaries (social science or otherwise}_ relevant
substantive theory, and the past literature on a
topic. Doing this, one would find, for example,
that "attitude" is usually defined as a stable
predisposition to respond and that this stability is understood either as consistency across
time or response modes (affective, cognitive,
and behavioral). Such an analysis immediately
indicates that the standard practice of measuring preferences or beliefs at a single session
and labeling this "attitude" is not adequate.
To give another example, most definitions of
aggression include both the intent to harm and
the consequence of harm. This is to distinguish between (a) the black eye one child gives
------
506
another as they collide coming round a blind

bend; (b) the black eye that one child gives
another to get his or her candy (instrumental
aggression) or to harm that other child (noninstrumental aggression); and (c) the verbal
threat to give another child a black eye unless
the target child gives some candy to the other.
Since intent and physical harm are stressed in
the definition, only (b) above is adequate to
exemplify the construct aggression, though it
will not be adequate for the minority of persons
who prefer a more idiosyncratic definition. A
precise explication (and justification) of each
construct and the components that constitute it
is vital for high construct validity. Without it,
manipulations and measures cannot be tailored to whichever construct description
emerges from the explication.
Mono-operation Bias. Many experiments are

designed to have only one exemplar of a possible cause, and some have just a single measure of the possible effect. Since single operations both underrepresent constructs and contain irrelevancies, construct validity will be
lower in single exemplar research than in research where each construct is ope rationalized
multipe times to triangulate on the referent.
There is usually no excuse for single operations
of effects, since gathering data on additional
measures is not costly. There is usually more
excuse for a single manipulation of the possible
cause, given that increasing the number of treatments leads to very large sample research or to
small sizes within each cell of the factoral design that usually results when more than one
treatment is involved. Even so, there is really
no substitute for deliberately varying two or
three exemplars of a treatment. Thus, the researcher interested in communicator expertise
might use one distinguished male professor
from a distinguished university, one distinguished female research scientist from a prestigious research center, and the most famous
science journalist in Western Germany. This
varies different combinations of irrelevancies
---
---
(gender, affiliation, nationality, and academic

standing), and tests can be made of whether
each source had an effect-or a differential
effect-or responses. If sample sizes do not
permit separate analyses by source, then the
data could at least be combined from all three
sources to test whether expertise was effective
despite the irrelevant sources of heterogeneity.
Mono-method Bias. To have more than one

operational representation of a construct does
not necessarily imply that all irrelevancies have
been made heterogeneous. Indeed, this is never
the case when all the manipulations use the
same means of presenting treatments and all
the outcome measures use the same means of
recording responses. Thus, if all the experts in
the previous hypothetical example had been
presented to respondents in writing, it would
not logically be possible to generalize to experts
who are seen or heard. It would be more accurate to label the treatment as "experts presented in writing" than as "high expertise."
Attitude scales are often presented to respondents without apparent thought to (a) using
methods of recording other than paper-andpencil; (b) varying the attitude statements so
that they are both positively or negatively
worded; and (c) varying the position of the
positive and negative end of the response scale
so that it appears on the right or left of the page.
These three points are contingent on whether
one can rule out the measure as reflecting "personal private attitude" or eliciting "paper-andpencil nonaccountable responses" or"acquiescence" or "response bias."
Interaction of Procedure and Treatment. Sometimes the respondents in the treatment and
control groups will gain new information or
undergo new experiences as part of the context
in which treatments are embedded, and this
may influence how treatments are reacted to.
For example, respondents in the New Jersey
Income Maintenance Experiments were guaranteed a fixed minimum income of various

levels for three years. Thus, each respondent
may have beenreactingnotjustto the payment,
but to knowing the payment would last only
three years. To give another example of how
time can be conceptually interwoven with a
construct that has no necessary temporal attributes, we often do not know for how long a
treatment will affect a cause, yet measurement
has to end some time. If experimental results
showed that an effect was short-lived, it would
be more useful to state that A caused B for n
weeks than to state that A caused B.
needy, tended to be given Title 1 funds in

amounts equivalent to those coming to the experimental schools. Several other experimental
evaluations of compensatory education have
met the same problem. It exemplifies a problem
of administrative equity that must certainly
occur among units of an industrial organization, and it explains some administrators'
reluctance to employ random assignment to
treatments which their constituencies consider
valuable.
Compensatory Rivalry. Where the assignment

Diffusion or Imitation of the Treatment.
If the
treatments involve widely diffused informational programs-as with many mass media
programs-or if the experimental and control
groups can communicate with each other, then
the possibility of treatment crossovers exists
(Rubin, 1986). In the extreme case where notreatment controls receive a treatment in similar dosage to the planned treatment group, the
experiment becomes invalid. There is no longer
a functional difference between the treatment
and control groups. In many quasi experiments,
the need to minimize nonequivalence makes it
desirable to sample units that are as similar as
possible in all respects other than the treatment. Physically adjacent units facilitate this.
But their very propinquity also enhances the
chances of treatment differences becoming
obscured. For example, if one of the New England states were used as a control group to
study the effects of changes in the New York
abortion law, any true effects of the new law
would be obscured if New Englanders went
freely to New York for abortions.
Compensatory Equalization of Treatment. When

the experimental treatment provides goods
generally believed to be desirable, there
may emerge administrative and constituency
reluctance to tolerate the focused inequality
that results. Thus, in nationwide educational
experiments such as Follow Through, the
control schools, particularly if equally
of persons or organizational units to experimental and control conditions is made public

(as it usually must be in experiments in
industrial and organizational psychology),
conditions of social competition are generated.
The control group, as the natural underdog, is
motivated to a rivalrous effort to reduce or
reverse the expected difference. This result is
particularly likely where intact units (such as
departments, plants, work crews, etc.) are assigned to treatments, orif members ofthe control
group stand to lose if a treatment were successful. For instance, Saretsky (1972) has pointed
out that performance contracting would
threaten the job security of schoolteachers and
has reviewed evidence suggesting that the
academic performance of children taught by
teachers in the control groups of the OEO
Performance Contracting Experiment was
better during the experiment than it had been
in past years. The net effect of atypically high
learning gains by controls would be to diminish the difference in learning between them and
the experimental children who were taught
by outside contractors whose payments depended on the size of gains made by the
children they taught. Saretsky describes the
controls' special effort as a John Henry effect in
honor of the railroad steel driver who, when
learning that his work was to be compared with
that of a steam drill, did so well that he
outperformed the drill only to die of overexertion. Compensatory rivalry is in many ways
--------------------~~~
508
like compensatory equalization. However, the

later results from administrators anticipating
problems from the groups that receive less
desirable treatments, while the former results
from the way that members of these groups
react to their treatments or nontreatments.
Resentful Demoralization of Respondents Receiving Less Desirable Treatments. When an experiment is obtrusive, the reaction of a no-treatment control group can be associated with
resentment and demoralization as well as with
compensatory rivalry. This is because controls
are often relatively deprived when compared
to experimentals. In an industrial setting, the
controls might retaliate by lowering productivity and company profits. This situation is likely
to lead to a posttest difference between treatment and no-treatment groups and it might be
quite wrong to attribute the difference to the
treatment. It would be more apt to label the notreatment as the resentment treatment. (Of course,
this phenomenon is not restricted to control
groups. It can occur whenever treatments differ in desirability and respondents are a ware of
the difference.)
Hypothesis-guessing Within Experimental
Conditions. Reactive research can also result
in uninterpretable treatment effects when persons in one treatment group guess how they are
supposed to behave and respond accordingly.
(In many situations it is not difficult to guess
what is desired, especially in education where
academic achievement is paramount or in industrial organization where productivity and
satisfaction are.) The problem can best be
avoided by making hypotheses hard to guess
(if there are any), by decreasing the general
level of reactivity in the experiment, or by deliberately giving different hypotheses to different respondents. But these solutions are at best
partial. Respondents are not passive, and they
will sometimes generate their own treatmentrelated hypotheses whatever experimenters do
to dampen such behavior. However, having a
hypothesis does not necessarily imply the

motivation or ability to comply with it-or to
sabotage it, for that matter. Despite the widespread discussion of treatment confounds presumed to result from wanting to give data that
will please researchers-which we suspect is
a result of discussions of the Hawthorne
effect-there is neither widespread evidence of
the Hawthorne effect in field experiments (see
reviews by D. Cook, 1967, and Diamond, 1974)
nor is there evidence of an analog orientation in
social psychology laboratory contexts (Weber
& Cook, 1972). However, we still lack a sophisticated and empirically corroborated theory of
the conditions under which (a) hypothesisguessing occurs, (b) is treatment-specific, and
(c) is translated into behavior that could lead to
erroneous conclusions about the nature of a
treatment construct.
Evaluation Apprehension. Weber and Cook

(1972) reviewed considerable data from
laboratory experiments in social psychology,
which indicate that subjects attempted to
present themselves as both competent and psychologically healthy because they were apprehensive about being evaluated by experimenters whom they considered experts in personality adjustment and task performance. It is
not clear how widespread such an orientation
is in social science field experiments where
treatments tend to last longer and so populations may worry less about how social scientists evaluate them. They have to go about their
lives, after all! Nonetheless, some past treatment effects may be due to respondents in the
most desirable treatment groups presenting
themselves so as to be evaluated favorably.
Since this is rarely the target construct around
which experiments are designed, it must be
classed as a confound.
Experimenter Expectancies. There is considerable literature which indicates that an
experimenter's expectancies can bias the data
he or she obtains (Rosenthal, 1966). When this
happens, it will not be clear whether the causal

treatment is the treatment-as-labeled or the
experimenter's expectations. This threat can be
decreased by having experimenters who have
no expectations or false expectations, or by
analyzing the data separately for experimenters with different kinds or levels of expectancy.
Confounding Levels of Constructs With Constructs.

Experiments typically involve the
manipulation of several discrete levels of an independent variable that is continuous. Thus,
one might conclude from an experiment that A
does not affect B when in fact A-at-level-1 does
not affect B, or A-at-level-4 does not affect B,
whereas A-at-level-2 might affect B, as might
A-at-level-3. Obviously, this threat is a problem when A and Bare not linearly related, and
it is especially prevalent, we assume, when
treatments have only a weak impact. Thus, low
levels of A are manipulated, but conclusions
may be drawn about A without any qualifications about the strength of the manipulation.
The best control for this threatis to conduct parametric research in which many levels of A are
varied, many levels of B are measured, and a
wide range is estimated on both variables.
the populations or categories they represent,

and the populations or categories to which
potential research users want to generalize.
Cronbach (1982) reiterated this tripartite
distinction, adding that some stakeholders
seek to generalize a particular causal result to
populations like those studied, while others
want to extrapolate it to populations, contexts,
and constructs manifestly different from those
examined to date. As noted earlier, Campbell
and Stanley's (1966) discussions of external
validity are not clear about the need to conceptualize generalization and extrapolation differently. He builds his discussion of external validity around identifying how generalizable a
causal relationship is, with the fully generalizable causal statement specifying a causal relationship that holds for all types of persons in
all settings at all times and with all operational
representations of the cause and effect. Threats
to external validity are therefore factors that
limit such generalization; they are causal contingencies that produce a statistical interaction
between the treatment and the threat. Thus, we
will express external validity threats in the
language of statistical interactions.
Interaction of Treatments With Treatments.

External Validity
List of Threats. Of the concerns that Campbell
and Stanley (1966) mentioned under external
validity, those that focus on generalizing to, or
across, times, settings, and persons are retained
here under external validity as well. Those
that have to do with interpreting operationalized treatments and measures in generalized
terms were earlier grouped under construct
validity. In both cases, the concern is to use
information about particular samples, cases,
instances, or exemplars to draw inferences
about general categories, classes, types,
populations, or universes.
Bracht and Glass (1968) pointed out that
external validity has to do with the relationships among available samples or instances,
This threat occurs if units experience more than

one treatment, and we do not know if a causal
finding can be generalized to the situation where
they receive only one. The solution is either to
give only one treatment to respondents or, if
efficiency considerations demand multiple
treatments, then the order of presenting them
should be systematically varied and separate
analyses should be conducted of the treatments
received first and later.
InteractionofTesting With Treatment. To which

kind of testing situations can a cause-effect
relationship be generalized? In particular, can
it be generalized beyond the testing conditions
that were originally used to probe the relationship? This is especially important when the
pretesting of respondents might condition
-----------------------------------.-~
the reception of the experimental stimulus.

We would want to know whether the same
result would have been obtained without a
pretest. A posttest-only control group is necessary for this. Similarly, if repeated posttest
measurements are made, we would want to
know whether the same results would be obtained if respondents were posttested once
rather than at each delay interval. The recommended solution here is to have independent
experimental groups at each delayed test
session.
Interaction of Selection With Treatment. To which

categories of persons can a cause-effect relationship be generalized? Can it be generalized
beyond the groups used to establish the initial
relationship-that is, to various racial, social,
geographical, age, gender, or personality
groups? In particular, whenever the conditions
of recruiting respondents are systematically
selective, we are apt to have findings applicable
only to volunteers, exhibitionists, hypochondriacs, scientific do-gooders, those with nothing else to do with their time, and so on. One
feasible way of reducing this bias is to make
cooperation in the experiment as convenient as
possible. For example, volunteers who have to
come downtown to a television studio to participate in research are likely to be less typical of
the general population than randomly selected
volunteers in an experiment carried door-todoor. Or, an experiment involving executives is
likely to be less generalizable if it takes a day's
time to complete as opposed to ten minutes, for
the former experiment is likely to exclude more
busy persons who, by definition almost, do not
have the time to participate.
Interaction of Setting With Treatment. Can a
causal relationship obtained in a factory be
obtained in a bureaucracy, in a military camp,
on a university campus? The solution here is to
vary settings and to analyze for a causal relationship within each. This threat is of particular
relevance to organizational psychology since
its settings are at such disparate levels as the

central organization, the peripheral worksite,
the small group within the worksite, and the
individual. When can we generalize from any
one of these units to the others? The threat
is also relevant because of the volunteer bias as
to which organizations cooperate. The refusal
rate in getting the cooperation of industrial
organizations, school systems, and other settings must be nearer 75 percent than 25 percent,
especially if we include those that were never
contacted because it was considered certain
they would refuse. The volunteering organizations will often be the most progressive, proud,
and institutionally exhibitionist. For example,
Campbell (1956), although working with Office
of Naval Research funds, could not get access to
destroyer crews and had to settle for highmorale submarine crews. Can we extrapolate
from such situations to those where morale,
exhibitionism, pride, or self-improvement
needs are lower?
Interaction of History With Treatment. To which

periods in the past and future can a particular
causal relationship be generalized? Sometimes,
an experiment takes place on a very special day
(e.g., when a president dies), and the researcher
is left wondering whether the same cause and
effect relationship would have been obtained
under more mundane circumstances. But we
can never logically extrapolate findings from
the present to the future. Even so, commonsense solutions lie in replicating an experiment
at different times (for other advantages of consecutive replication, see Cook, 1974) or in conducting a literature review to see if prior evidence at least does not refute a causal relationship found in a single study at a particular time
in history.
Conceptual Issues. Assessing external validity is a complex and difficult process. We
cannot extrapolate with any logical certainty
from the persons, settings, and times of a study
to all the different classes of person, setting,
and time of interest to the various stakeholders

in social research. Indeed, we cannot even generalize with certainty from the limited set of
persons and settings specified in study questions to similar entities in the present and
immediate future. All sampling has a past
orientation. Add to this that many of the
populations involved in research cannot be
comprehensively specified because they are
persons, settings, and times selected for
convenience rather than formal representativeness understood in sampling theory terms.
While we can and should measure and report
attributes of these samples, the descriptions
achieved will not fully describe the populations from which the samples of convenience
were drawn.
These difficulties do not mean that we are
without techniques for increasing external
validity in research aimed at probing causal
hypotheses. The formal random sampling model
is sometimes used in experiments to obtain
representative samples. This tends to be primarily when thepopulationisofpersons (rather
than settings or times), the setting is highly
localized (as in, say, Palo Alto or Ann Arbor),
and the treatment costs little to deliver (as in
studies of marketing by mail, though for an
important exception, see Connell, Turner, &
Mason, 1985). The high financial cost involved
in the quality implementation of most other
treatments across multiple settings and times
precludes random sampling, as does the fact
that few organizations are willing to tolerate researchers. Volunteerism is the reality-not the
benign coercion that facilitates randomly
selecting organizations (or even cases within
organizations). The experimental ideal is of
random selection for purposes of sample
representativeness, followed by random assignment for purposes of initial group comparability. But even its proponents acknowledge
the many serious limitations to implementing
the first of these two steps in much research
practice (e.g., Kish, 1987; Lavori, Louis, Bailar,
& Polansky,1986).
A second model of generalization to specific populations entails generalizing to modal

instances. This insight is captured by the
politician's criterion when assessing a new
policy or program idea: "Will it fly in Peoria?"
The presumption here is that Peoria is modal of
the entire United States in terms of population
size, voter turnout, Democratic-Republican
split, or whatever else politicians think will
influence voter approval The same insight is
captured by the factory owner who wants to
generalize to his or her own factory with its
current work force that might modally be described as white, middle-aged, and with ten
years experience. Should resources or politics
preclude sampling workers at random, a consultant working with the factory owner might
choose a purposive sample that resembles the
mode as closely as possible, with the mode
being established from archival records or, failing these, from expert judgments. The key is a
sample belonging in the category described as
the mode. It is not that the cases are randomly
selected from within the modal category, though
that would be advantageous, also.
A third model is really a more general
version of the preceding.lt involves impressionistically generalizing to target universes on
the basis of category membership alone. If generalization is sought to factories employing fewer
than 200 persons, then factories should be selected for study that meet this criterion. Nothing is necessarily implied about them being
representative of all such factories, as would be
the case if random sampling had occurred or if
presumptively modal instances had been selected. All that is implied is that the sample of
cases falls within the target category. In our
experience, most social research that tests intervention hypotheses has sampling plans following this simple model of generalization. Its
major drawback is that the samples so selected
may be homogeneous on other irrelevant attributes. That is, the factories with fewer than
200 employees may come from very few regions of the country or produce a restricted
--------------------------~s~
512
range of goods. However, the instances chosen

for category membership will be heterogeneous on other irrelevant attributes by which the
data can be stratified-as in, for example, the
ratio of male to female employees, hours
worked, and so on. Subanalyses can then be
conducted to examine whether such factors
moderate a treatment effect, thus limiting its
external validity. To proceed this way makes
no claim that the moderator variable is representative of the category-say, African-Americans or entrepreneurs. Rather, the weaker claim
is that a causal relationship was (or was not)
disproved among people belonging in these
categories, even though they may all live in
Maine or work for Macy's. Generalization
depends here on stratification and subanalysis
more than on the nature of the assignment
process to strata.
Some scholars aspire to universalistic
inferences that hold for all times, all places, and
all persons. Other scholars are interested in
multiple populations rather than the single one
around which study sampling designs are often crafted. One of the most useful models of
generalization for both these stakeholder groups
involves purposive sampling to obtain
heterogeneous samples of persons, settings, and
times. It is then possible to probe the robustness
of a particular cause and effect relationship
over a wide range of possible moderating factors, even though researchers may not know
whether the units representing each factor are
formally representative of it in the strict sampling theory sense. They would merely know
that the units belong in the particular category
of interest. In this model, generalizing across
categories is more important than generalizing
to them, and the key is to sample many categories that theory or experience suggest will interact with the treatment. Robustness of results is
the watchword, not formal representativeness.
The heterogeneous sampling model
underlies the success of meta-analysis as a
tool for synthesizing causal studies and estimating the generalizability of a causal
connection. Take the case of patient education

and its effects on recovery from surgery
(Devine & Cook, 1983). Across the more than
100 studies now available, the effects of patient
education have been examined in many
quite different ways. In conceptualizing
patient education, studies differ in the emphasis they place on different combinations of social
support, on informing patients about what will
happen to them during their hospital stay, and
on teaching the patients skills and exercises
designed to reduce postsurgical complications.
Recovery from surgery has also been measured
quite heterogeneously -as on the basis oflength
of hospital stay, pain medication taken, and
self-reports of physical and psychological
comfort. The patients studied have had many
different types of operations involving a diversity of body parts. The hospitals they attended
were sometimes privately and sometimes
publicly owned, and had quite different staff
mixes, physical locations, and philosophies of
patient care. The intervention was provided by
nurses in many cases, but by physicians and
chaplains in others. And research on the topic
has been continuous since the early 1960s. Metaanalysis is best known for probes of the main
effect question: How effective is patient education? But this can be paraphrased in different
ways that speak to different facets of causal
generalization. One way is to ask whether an
effect is detectable despite the heterogeneity in
such external validity threats. Here the irrelevant variability is a counterforce against which
the treatment is somehow made to struggle
under the logic that if it is powerful enough to
emerge through all these counterforces, then it
must indeed be powerful. A second, more
specific and useful way of phrasing the causal
generalization question uses the language of
external validity threats explicitly: Does the
size or direction of a causal relationship depend on the type of treatment, patient, hospital,
or time period when the study was conducted?
Causal relationships that can be specified in terms of major variables conditioning
effectiveness are very valuable. Potential users

of the information can then examine attributes
of the persons and settings to which they want
to generalize and ask whether these correspond
with the attributes on which research suggests
the treatment's effectiveness depends. At
issue is the link from utos to UTOS. Causal relationships thatremainstubbornlyrobustacross
a wide range of possible conditioning factors
also have important implications for extrapolation from utos to *UTOS. They inspire confidence that the same results are likely to hold
under novel circumstances that were not found
in the studies entering the literature under
review.
We should distinguish here, though,
between findings that replicate the sign and
those that replicate the magnitude of a cause
and effect relationship. Meta-analysis is not a
fine enough tool for us to place much confidence in small differences in average effect size
attributable to a causal contingency variable.
This is primarily because of the difficulty of
removing all the variance associated with irrelevant differences between the kinds of studies with and without the potential external
validity threat under review. Also, in many
policy contexts, it is often infeasible to act upon
differences in effect size. One cannot disseminate a treatment to some populations but withhold it from others, or provide one treatment
for one kind of setting and a different one for
another. The focused differences are too salient
and too politically and ethically troublesome.
Differences in causal sign have quite different
consequences for the policy world, however.
They suggest not only that a treatment helps
some populations, but also that it is actively
harmful to others. The policy consequences of
doing harm are much more serious than those
as so cia ted with helping some groups more than
others when all are helped or none are harmed.
In our discussion of external validity and the
heterogeneous (and usually purposive) sampling that promotes it, we understand replicability in terms of the reproduction of causal
signs rather than the reproduction of comparable effect sizes of the same sign.
Individual studies obviously provide less
scope than literature reviews for probing the
robustness of replication across a heterogeneous assortment of external validity threats.
Nonetheless, the model has implications for
single studies. It suggests sampling broadly,
makingsurethatpersons with a wide variety of
different but conceptually relevant backgrounds are included in the sampling plan in
sufficient numbers for responsible analysis. The
model also suggests making sure that there is
heterogeneity in the settings and times studied,
even though cost considerations usually constrain intervention studies to a small number of
settings and times.
Since resources for sampling settings are
limited in many individual studies, it is useful
to consider a subcase of the heterogeneous
sampling model. This involves using theory or
other forms of experience to sample types that are
maximally different from each other and are therefore particularly likely to condition the treatment impact. Thus, if a new personnel testing
system were to be initiated in Army recruiting
stations, researchers might do well to conduct
field tests not only in some recruitment centers
close to the mode but also in others that have
shown themselves to be run particularly well
and:particularly poorly. Any treatment that is
effective only in the best run centers has a
demonstrated potential for success, but there is
a legitimate question about its generality and
about whether its implementation can be improved in other recruiting centers to get the
same results. Of course, if the results are contingent on factors only found in the best organized
settings, then the restricted external validity is
a problem indeed. But if the effect is found in
both the modal and best centers, this helps give
an impressionistic fix on the generality of the
new practice. It is effective from the top to the
middle of the distribution of presumed organizational effectiveness, with the mode representingwheremostrecruiting centers are likely
-------------------------------------------------------------m~
514
to be.lf the effect is found across all three levels

of presumed organizational quality, then the
inference can be drawn that the effect is robust
across most types of recruiting centers. At
play here is a notion of generalization through
interpolation between conditions that have
been deemed to be among the most and least
likely to show the effect. This says nothing
about the transferability of the causal relationship to different settings in or beyond the Army;
and it also says nothing about transfer to different times or variants of the treatment or outcome measure. But it does offer a probe of
generalization through interpolation.
These various methods for strengthening
external validity differ in the scientific warrant
they provide, in the circumstances where they
are appropriate, and in their feasibility of implementation. Anyone designing research into
generalized causal connections should be aware
of them. But they are not enough. Causal generalization is also promoted when the causal
mediating processes are known that take place
after a successful treatment has varied and
before it has produced its effect (Cronbach,
1982). The difficulty here is that the quantitative analysis of causal mediating processes relies
so heavily on high-quality measurement and
fully specified theories that can be pitted against
each other (Glymour, 1987). Acknowledging
these limitations, Cronbach has suggested
looking to ethnography, journalism, and history for methods of causal explanation/ mediation. This recommendation will fall on deaf
ears in many quarters in the social sciences, and
we feel confident in claiming that no simple
recommendations can be forthcoming about
methods to achieve causal generalization
through causal explanation, just as no simple
recommendations are possible about causal
generalization through sampling techniques.
We persist in believing, therefore, that external validity is the Achilles' heel of research
design. The best tools-random selection from
a clearly designated universe and explanation
through the causal modeling of mediating
processes-are rarely implementable or definitive in the context of testing social

interventions.
Some Relationships Among
the Four Kinds of Validity
We have previously noted how internal and
statistical conclusion validity are like each other
in that they both promote causal relationships,
while external and construct validity are like
each other in dealing with the generalization of
such causal relationships. We now want to
mention some other relationships among the
different kinds of validity.
First, increasing one kind of validity will
decrease another. For instance, internal validity is best served by carrying out randomized
experiments. But the organizations willing to
tolerate this are probably less representative
than the organizations willing to tolerate passive measurement. Statistical conclusion validity is increased if the experimenter can rigidly
control the stimuli impinging on respondents.
But this procedure can decrease both external
and construct validity. The construct validity
of causes and effects can be increased by the
critical selection of multiple manipulations or
measures. But each of these is likely to increase
attrition from the experiment (influencing internal validity if it is treatment-related and
external validity if it is not). These and other
countervailing relationships suggest that crucial components of any experiment are explicating the priority-ordering among the four
kinds of validity and searching for ways to
avoid or minimize all unnecessary tradeoffs.
However, some tradeoffs are inevitable, and
we think it unrealistic to expect that a single
piece of research will effectively answer all of
the validity questions surrounding even the
simplest causal relationship.
Second, we presume that internal validity
has priority over all the other types of validity.
Since experiments are conducted to justify
causal statements of the manipulability theory

sort, choosing to conduct any experiment already presupposes the primacy of internal
validity. Indeed, this primacy is built into the
most esteemed form of experimentation-the
randomized experiment-through the definitional saliency of an assignment procedure that
controls for threats to internal validity but is
irrelevant to construct and external validity.
Quasi experiments do not have random assignment, and so the investigator's arduous
and ultimately less fulfilling task is to rule out
internal validity threats one by one. In the rest
of this chapter, we argue that ruling out internal validity threats is better promoted by experimental design than by statistical adjustments and that, assuming random assignment
is not possible, designs are preferred that test
more numerous or more specific data-based
implications of the causal hypothesis under
test.
The primacy of internal validity is not universally acknowledged. For the field of evaluation, Cronbach (1982) argues that internal
validity as we understand it is trivial because
the treatment effect is retrospective and because the treatment itself is specified operationally rather than conceptually. This conception of internal validity tells us little about what
the causal agent is, and it makes no reference to
what will be effective in the immediate future.
He wants the primacy to be placed instead on
external validity understood as extrapolation
to *UTOS. He believes that identifying causal
explanatory processes is the best way to achieve
such extrapolation under the assumption that
it should be possible to recreate these processes
in novel settings using novel procedures that
are tailored to the particulars of the world to
which potential users of the information want
to generalize. For Cronbach, the important
thing is to recreate causal generative processes;
how they are set in motion is less important and
can be locally determined. Cronbach wants to
modify the function of experiments, getting
much deeper inside the black box and downplaying the molar treatment contrast that
dominates thinking in the Fisher tradition of

experimentation.
One problem with this viewpoint is the
difficulty of gaining convincing quantitative
evidence about causal explanatory processes
given the inevitable underspedfication of
substantive theories, the impossibility of
theory-neutral measurement, and resistance to
look to ethnography, history and journalism
for appropriate methods. Another problem is
that Cronbach's position depends on the presence of valid causal connections to be explained.
Here again a clash of standards emerges. Cronbach is deliberately willing to be less conservative about causal claims than most other scholars (and us). He believes that current standards
in the social sciences are so conservative about
inferring causation that important and effective treatments are passed over because of low
alpha levels, poor measurement, inadequate
treatment implementation, and inadequate
theories that do not make clear the contingencies on which a causal relationship depends.
Taste plays a major role in the risk levels scholars are willing to tolerate. So let us state here
that financial cost, avoiding unnecessary disruptions to human lives, and avoiding a political backlash against social change and its honest evaluation all reinforce conservatism, making us afraid to advocate the widespread dissemination of social changes that later evidence
might show to be ineffective. We fear this more
than the danger of overlooking effective treatments that have slipped through the imperfect
editing net of social science methods. More
than anything else, this dictates the primacy of
internal validity.
Cronbach based his critique of the primacy
of internal validity on its failure to meet the
needs of those who look to evaluation for guidance in improving social programs. He specifically excluded basic research from his comments. The third point we would like to make
about the validity threats explicated above is
that their priority-ordering seems to vary with
the kind of stake one has in social research.
~--~-----------------------------------------------------------------------------------------&~
516
For persons interested in theory-testing, it is

almost as important to show that A causes B (a
problem of construct validity of the cause and
effect) as it is to show that something causes
something (a problem of internal validity).
Moreover, most theories do not specify target
settings, populations, times, or the like, so that
external validity is of relatively little importance. In particular, it can be sacrificed for the
statistical power that comes through having
isolated settings, standardized procedures, and
homogeneous respondent populations. Thus,
for investigators with theoretical interests, the
rank-ordering of validity types is probably (a)
internal, (b) construct, (c) statistical conclusion,
and (d) external. (Actually, the construct validity of causes is likely to be much more important for theorists than the construct validity of
effects. Think, for example, how easily "attitude" is operationalized in many persuasion
experiments, or "cooperation" in bargaining
studies, or "aggression" in interpersonal violence studies. On the other hand, think how
much care goes into demonstrating that a
manipulation varied cognitive dissonance and
not reactance or evaluation apprehension.)
Much applied research has a different set
of priorities. It is often concerned with testing
whether a particular specific problem has been
alleviated. This makes internal validity important for demonstrating "alleviation," and construct validity of the effect for demonstrating
"problem" relevance. It is also critical that a
study is conducted in a context that permits
either broad generalization or generalization to
specific target populations of obvious importance (high interest in external validity). The
research is relatively less concerned with
whether the causal treatment is, for example,
better lighting or a Hawthorne effect, so long as
the treatment and its effects are generally reproducible. Thus, the rank ordering of validity
types for applied researchers is likely to be
something like (a) internal, (b) external, (c)
construct validity of the effect, (d) statistical
conclusion validity, and (e) construct validity
of the cause. External validity and the construct

validity of the effect rank higher than in basic
research and define an important part of the
context within which concern for internal validity is expressed. We agree with Cronbach
that external and construct validities are more
important in applied work; the issue is whether
they take priority over internal validity.
Though experiments are designed to test
causal hypotheses, and though internal validity is the sine qua non of causal inference, there
are nonetheless some contexts where it would
not be advisable to subordinate too much to
internal validity. In the consulting world, for
instance, someone commissioning research to
improve the efficiency of his or her own organization might not take kindly to the idea of
testing a proposed improvement in a laboratory setting with sophomore respondents. A
necessary condition for meeting clients' needs
is that clients can generalize findings to their
own organizations and to the indicators of efficiency regularly used for monitoring performance. Indeed, the need for generalizability this
respect may be so great that a client may be
prepared to sacrifice some gains in internal
validity for a necessary minimum of external
validity. However, the client then runs the risk
of being unable to answer any preliminary
causal question with confidence. Hence, while
the sociology of doing research might sometimes incline one to rate external validity higher
than internal validity, the logic oftesting causal
propositions suggests tempering this inclination lest the very hypothesis-testing rationale
for experiments be vitiated.
Quasi-experimental Designs
This section is devoted to an exposition of
some quasi-experimental designs evaluated primarily with respect to their ability to rule out
internal validity threats. In outlining them, we
shall use a notational system in which X stands
for a treatment; 0 stands for an observation;

subscripts 1 through n refer to the sequential
order of implementing treatments (X1... X,) or
of recording observations (01 . .0); and a dashed
line between experimental groups indicates
that the groups were not randomly formed.
Three Generally Uninterpretable Designs
We present below three designs often used in
industrial and organizational research. While
they are useful for suggesting new hypotheses
and cross-validating causal claims from higher
quality quantitative research (Trend, 1979), they
are normally not sufficient for permitting strong
tests of causal hypotheses. The reasons for this
are discussed below.
The One Group Posttest-Only Design. This
design requires observations on respondents
who have experienced a treatment, but there
are no control groups or pretest observations
on the same scale as the posttest. The design is
generally uninterpretable causally because the
absence of a pretest makes it difficult to infer
that any change has taken place, while the
absence of a no-treatment control group makes
it difficult to rule out many plausible threats to
internal validity.
There are some theory-linked conditions,

though, under which the design is more interpretable. Noting that detectives can be successful in discovering the cause of major crimes,
Scriven (1984) examines why. He attributes
success to the usually obvious fact that a crime
has been committed, to the availability of
multiple "clues" at the site, and to the ability to
link these clues to the modus operandi of known
criminals. Thus, a murder investigation is facilitated if there is a body, if the body and the
environment provide a pattern of clues that
specify the time and manner of death, and if
there is a file of known murderers who commit
their crimes in distinctive ways that heavily
overlap with details found at the scene of the

crime. If more than one person has a particular
modus operandi, then the known suspects can
be questioned to probe their alibis. This approach assumes that some causes leave unique
signatures on the patterned effect, and so it
implies a modification to the design presented
above. It should include multiple outcome
variables. Even so, for the modus operandi
approach to work, the effect has to stand out
clearly, the pattern of evidence surrounding it
has to be clear, the potential causes all have to
be known, and auxiliary information has to be
available for discriminating among alternatives
when several are viable.
Other disciplines use a similar approach.
Pathologists have to discover why someone
died (the effect) using the many sources of
evidence from the corpse and the setting (the
clues) to come up with a finite set of possible
diseases and then relating each of these possibilities to highly specific and known patterns of
their effects in the relevant scientific literature.
Epidemiologists have to discover how an epidemic like AIDS (the effect) could have come to
the United States using all the clues available
(early prevalence in the homosexual and Haitian communities) in order to trace the disease's
origin to Cuban soldiers serving in Equitorial
Africa where AIDS is endemic, then visiting
Haiti after their return from Africa, and homosexuals from the United States then visiting
Haiti on vacation. The crucial question is: How
frequently does one find situations in which
the modus operandi approach is appropriate
outside of the disciplines above? When are the
effects so large and clear, the pattern of clues so
distinctive, and the relevant scientific knowledge of all the plausible causes so definitive
that a "unique signature" can be identified?
Our guess is that these conditions are rarely
met in the social sciences.
However, Yin (1984) suggests that they can
be met in explanatory case studies where the
researcher seeks to learn why an intervention
affected an outcome under the assumption that
--------------------------~~-
518
it did. Yin postulates that explanation is the

most important purpose for case studies and
relegates to secondary status those case studies
that are descriptive (did it have an effect?-the
major question to which this chapter is addressed) or exploratory (what did I discover
that is new?). Thus, Yin's claims about the
utility of case studies are predicated on an
essentialist theory of causation that is largely
irrelevant to the more modest aim of quasi
experimentation: describing grosser causeeffect connections. Critics might wonder what
sense it makes in Yin's conception (as in
Cronbach's) to explain a causal relationship
unless it has already been identified well. Notwithstanding, we have no issue with the useful
role that case studies can play in helping explain those descriptive causal links that can be
confidently postulated. Nor do we have any
issue with using case studies to discover relationships previously unknown, or with using
findings from case studies to confront the findings of quantitative science when the two produce conflicting results. Our worry is with the
use of the case studies for describing causal
connections between manipulated causal agents
and outcome changes when there is no theory
of method, like the modus operandi approach,
that seems obviously relevant.
The One Group Pretest-Posttest Design. The
one group pretest-posttest design requires a
pretest observation taken on a single group of
respondents (01 ) who later receive a treatment
(X) followed by a single posttest observation
(0 ). This frequently used design is diagrammed
below. An obvious structural limitation is that
it does not include a no-treatment control group
or other form of comparison group for assessing some of the internal validity threats mentioned below.
X
Finkle (1979) used the design to evaluate

how flexible working hours influenced late-
ness and short-term absenteeism. He obtained

managers' reports of employee lateness and
absenteeism and found they decreased after
flexible working hours were introduced. He
concluded, therefore, that the change in work
hours decreased lateness and absenteeism. But
the effects might alternatively be due to history, for other events could have happened
between the pretest and posttest that affected
lateness and absenteeism. For example, a new
salary scale might have been introduced, a
union policy changed, or a new training program implemented. Since any of these could
have affected the outcomes singly or in combination, it is incumbent upon the researcher
either to demonstrate through observation that
they did not operate or to otherwise render
them implausible as alternative explanations
for the results.
Consider next, statistical regression. Why
should working hours be changed in the first
place? One possibility is that lateness and absenteeism have become unacceptably high,
prompting a concern to do something about
them. They could be high for many reasons,
only some of which are relevant to regression.
For instance, if levels have always been high or
are part of a genuine long-term increase, regression would not be a problem. But if they
have suddenly become high due to some shortterm fluctuation or other source of "error,"
then implementing the treatment could lead to
regression. This is because measures at the time
of the intervention will have a larger error
component increasing scores than decreasing
them. All other things being equal, there will
then be regression downward toward the grand
mean of the respective trends, resulting in
"improved" absenteeism and lateness. Likewise, when managers seek aid from organizational psychologists because the performance
of their staff is suddenly and inexplicably decreasing and a one group pretest-posttest design is used to evaluate the ad vice, the resulting
regression will make the consultants who ad vise
them look good (Einhorn & Hogarth, 1981).

A more common form of regression artifact
for this design arises when a special program is
given only to those with extreme pretest scores.
For example, a compensatory educational program given only to low scorers will seem to
produce improvement where the pretestposttest correlation is less than unity (McNemar, 1940) because at the first testing, the
children's obtained scores contain more error
deflating scores than inflating them. Similarly,
if an advanced but totally ineffective training
program is given to the best salespeople of one
year, it will appear to be harmful. The sales
volume of its graduates will decline over time
when compared to other salespeople. The
magnitude of such regression depends on how
highly each salesperson's sales volume is correlated from year to year and how far from the
population mean the salespeople score. No
regression is involved when the correlation is
perfect or when the true score equals the population mean.
Even when the pretest-posttest correlation
is high, a posttest decrease in lateness or absenteeism in the Finkle study could still be accounted for in terms of maturation. Typically,
lateness and absenteeism do not stay constant
from year to year or from season to season
within any one year, being subject to both random and systematic fluctuations. Whenever
systematic changes take place, the posttest level
can be different from the pretest level. For
instance, if the pretest measures were made in
September and the posttest in February, it would
not be too difficult to imagine seasonal variation in climate causing greater absenteeism
and lateness. Also, if the work force is aging
and the pretest and posttest are separated by
several years, who's to say whether any observed change is due to flexible working hours
or workers being more financially secure when
they are older and feeling free to come in when
they like.
In some contexts, there are ways of estimating whether maturation could plausibly
account for any pretest-posttest changes in
absenteeism and lateness. Imagine that the pretest and posttest are separated by a year and
that the mean level of experience (years worked
in an organization) increases between the pretest and posttest. If the pool of pretest scores
were sufficiently large, one could regress absenteeism and lateness onto years worked. If
the resulting relationship were zero, then the
maturation hypothesis would be rendered
implausible. If there were a relationship, one
could then use the regression equation to predict what scores would have beenattheposttest
in the absence of the treatment. The obtained
performance could then be compared with the
predicted level. Care must be taken with this
estimation procedure because the mean expected posttest performance is obtained from a
pool of pretest scores, but the posttest data with
which this prediction is compared come from a
second measurement wave. This testing problem is less plausible, of course, if measurement
is unobtrusive, perhaps because it comes from
regularly collected administrative records.
The internal validity threats mentioned
above do not operate in every single situation
where the one group pretest-posttest design is
used. For instance, Robertson and Rossiter
(1976) used the design to study the impact of
television advertising for children's toys and
games during the Christmas season. Children
were asked to nominate their five most strongly
preferred choices for a Christmas present five
weeks before Christmas and then again four
weeks later. All brand-name items reported by
the children were traced to network television
logs for November and December. The researchers hypothesized that any change in requests
for toys and games from the first to the second
wave of data collection would be evidence of
the influence of television advertising. Thus,
when a five percent increase in the nomination
of advertised toys and games was found, it was
interpreted as an advertising effect. However,
the children's preference change might be
due to history, other events that occurred between the pretest and posttest. The increase in
:;,'
-,;.
520
advertising for toy and game products at Christmas is not limited to television; it also occurs
with radio, print, and store circular advertising. Given the short period between the two
testing waves, we do not believe maturation to
be a plausible threat, nor is regression a problem, since it is difficult to imagine that the
children's preferences were initially biased systematically against advertised toys. Threats to
validity are only potential, and those that are
usually associated with a particular design do
not operate in all concrete settings where the
design is used.
Unfortunately, it is rare to find specific research projects where the threats of history,
maturation, and regressionareimplausibleand
where the one group pretest-posttest design is
causally interpretable. To rule out history, either the respondents have to be physically isolated from all other forces that might affect the
experimental outcomes, or the outcomes have
to have no external forces acting upon them, or
the test-retest interval has to be very short. To
rule out statistical regression, either a series of
pretest observations has to show that the
immediate pretreatment observation did not
deviate from the pretest trend or an argument
has to be made based on the high reliability of
measures. To rule out maturation, we either
have to show that the posttreatment observations deviated from whatever pretreatment
maturational trend was observed or we have to
analyze pretest scores by the maturational
variable and show no trend. Since the one group
pretest design has only a single pretest observation by definition, we know nothing definitive about the pretest trend or the typicality of
the immediate pretreatment observation. We
are then thrown back on less direct data probes
and other plausibility arguments to rule out the
threats in question.
Posttest-Only Design With Nonequivalent
Groups. Sometimes a treatment is implemented before the researcher can prepare for it,
and the evaluation procedure must be designed
after the treatment has begun. Such ex post

facto research does not imply the absence of
pretest observations, for archival records can
often be used to establish pretest performance.
However, in the present context, we understand that no pretest observations are available
ana scale that is identical with, orequivalentto,
the scale used to collect posttest data. The design is diagrammed below, with the treatment
and comparison groups being measured only
at the posttest.
The most obvious structural flaw with

this design is the absence of pretests so that any
group differences at the posttest can be attributed to the treatment or to group selection
differences. Donaldson (1975) used the design
to evaluate the effects of job enlargement on the
satisfaction of female assembly line workers
who worked either in an enlarged job or a job
whose design remained unchanged. The former group was more satisfied, and the researchers attributed this to the enlargement of their
jobs. Unfortunately, although the two treatment groups were similar in age, they could
have differed in prior job satisfaction, productivity, or a host of other variables correlated
with job satisfaction. The possibility of group
differences makes it impossible to separate out
treatment and selection effects.
One rationale offered for preferring this
weak design seems to be flawed. Some persons
fear that pretest measurement will sensitize
subjects and differentially influence posttest
scores across the treatment groups (Lana, 1969).
However, there have been few demonstrations
of differential pretest sensitization, even in a
meta-analysis of 32 studies (Wilson & Putnam,
1982). Note in this connection that pretest effects which are constant across the various
treatment groups do not provide a threat to
internal validity. Only differential effects do.

Undoubtedly, there are some as yet unknown
circumstances where differential pretest sensitization occurs, and it is then advisable to
administer two equivalent forms of a test-one
at the pretest and the other at the posttestrather than use the same test twice. Lengthening the time interval between pretest and
posttest, as reported by Wilson and Putnam
(1982), may also minimize differential pretest
effects. However, it seems to be that in general
pretests do not lead to differential treatment
sensitization. Such sensitization does not provide, therefore, a good rationale for avoiding
pretests.
Pretests are not always available, however.
How can the researcher deal with their absence? Hutton and McNeil (1981) used retrospective pretests in their evaluation of an energy
conservation program. They asked subjects
when they had implemented a list of conservation tips recommended in an experimental
communication, and from the dates determined
whether these actions were taken before or
after the conservation program began. This
retrospective approach assumes that the treatment does not influence memory, inflating or
deflating estimates of time or outcome-related
behavior when compared to the control population. We echo here the opinion of Campbell
and Stanley (1966): "Given the autistic factors
known to distort memory and interview reports, such data (i.e., retrospective pretest data)
can never be crucial" (p. 66). We may one day
have a rich literature on the conditions under
which retrospection does and does not bias
estimates of prior performance, but until then,
caution should be the watchword.
A second technique ad vacated for handling
the lack of a pretest is to form the treatment and
control groups through matching on correlates of
the pretest. Levy, Mathews, Stephenson, Tenney, and Schucker (1985) took ten test stores in
Washington and ten control stores in Maryland
and matched them on store size and the socioeconomic characteristics of their sites. The
purpose of the study was to evaluate the effects
of a nutrition information program on product

sales and market share. All stores were owned
by the same supermarket chain and followed
the same management procedures. The most
serious threat is the possibility of undermatching and the selection bias that results
from this. Although the researchers matched
the stores on four variables, other variables not
used might discriminate between the stores
and be correlated with the outcomes, such as
the product line the stores carry and the proximity of other stores. To deal with this, Levy et
al. might have used even more matching variables from the set currently known to predict
store sales. But while this would have created
greater equivalence between the test and control stores, it would still not have induced
equivalence-at least, not in any way where
the equivalence would be independently
known.
A third technique for overcoming the lack
of pretest assessments is to measure proxies for
them and then make statistical adjustments.
Such proxies are variables that correlate with
the posttest within treatments but are not
measured on the same scale. In practice, this
often entails using such readily available demographic measures as age, gender, social class,
race, place of birth, or residence. But proxies
can be conceptually closer to the outcome. For
instance, when evaluating an algebra course
for children who have never before had algebra, an algebra pretest is not possible. But a pretreatment test of mathematical aptitude or arithmetic achievement would be feasible. The hope
with all proxies is that they correlate with the
posttest.within each treatment group so as to
increase statistical power and model the selection process better. But they usually correlate
less well with the posttest than pretests collected on the same instrument as the posttest
(Campbell & Reichardt, 1990). Consequently,
they influence statistical power less and provide less complete information about the ways
the treatment groups initially differ. Selection
is modelled better, the higher the selection
------------------~~~
522
variables of any type correlate with posttest

outcomes.
The proxy pretest design is diagrammed
below. The subscripts A and B refer to different
measures, with A representing the proxy and B
the posttest. A concrete example is offered
by Booton and Lane (1985) who examined how
X
hourly wages were influenced by whether

nurses did or did not have a baccalaureate
degree and whether they worked in a hospital
or nonhospital setting. Since pretest observations were lacking, proxy pretest measures were
collected at the posttest, including reports of
the years of experience as a registered nurse,
the number of children under six present in the
nurse's household, other household income,
age, and whether the nurses were reentering
the work force. Using a regression model to
control for these selection-defining variables,
the researchers concluded that although nurses
with baccalaureate degrees were paid more
than the other nurses, the wage differential
was larger in nonhospital than hospital settings. Unfortunately, we do not know if the
background variables measured cover all of
the ways the two types of nurses differ; nor do
we know if the quality of measurement was
high enough to remove nearly all of the variance attributable to those proxies that were
measured.
The consequences of such imperfect adjustment can be seen by considering what happens
if the correlation between the proxies and
posttest is zero. No adjustment then takes place,
and all the pretest differences between groups
remain in the posttest. If the correlation is substantial but still less than unity, some pretest
difference will be removed, but not all. Unreliability in the proxies (Lord, 1960) and ignorance
of the full selection model (Campbell & Briebacher, 1970) are each likely to attenuate corre-
lations between the true and unmeasured pretest and the posttest, perhaps spuriously leading to the conclusion that the difference in
wages between baccalaureate and nonbaccalaureate nurses is due to the place of employment rather than the more productive nurses
tending to leave hospitals and seek employment in other settings. We are skeptical about
proxy-based adjustment procedures, even
though they are widely advertised in econometrics (e.g., Heckman & Hotz, 1989a, 1989b).
Empirical work on these procedures has not
been particularly promising (e.g., LaLonde,
1986; Murnane, Newstead, & Olsen, 1985), and
many mathematical statisticians pronounce
themselves skeptical (e.g., Holland, 1989).
There are some contexts, though, where
even without a pretest, substantive theory is
good enough to generate a highly differentiated
causal hypothesis that, if corroborated, would
rule out most internal validity threats because
they are not capable of generating the same
pattern of empirical implications. The intuition
is that, all other things being equal, the more
specific or complex the form predicted for the
test data, the fewer the viable alternatives. The
importance ofthis can be illustrated by Seaver's
(1973) archival quasi experiment on the effects
of teacher performance expectancies on studentsJ academic achievement. Seaver used
school records to locate a group of children
whose older siblings had obtained high or low
grades and achievement scores in school. He
then divided these two groups into those who
had had the same teacher as their older sibling
and those who had had a different teacher. This
resulted in a 2 x 2 design (same or different
teacher crossed with high or low performing
sibling). Seaver predicted that teacher expectancies should cause children with high-performing siblings to outperform children with
low-performing siblings by a greater amount if
they had the same teachers rather than different ones. The data corroborated this predicted
statistical interaction on several subsets of the
Stanford Achievement Test.
It is not easy to invoke a simple selection
interpretation of such a data pattern. However,
Reichardt (1985) has argued that the results
may be due to a regression artifact. He noted
that, when Seaver partitioned the students into
four groups, he also partitioned the teachers.
Assuming that an older sibling's performance
is affected by the ability of the teacher, when
Seaver selected older siblings with above average performance, he may also have been inadvertently selecting teachers whose abilities were
genuinely above average. A younger sibling
assigned to the same teacher would therefore
receive the above average teacher. But the
younger sibling assigned to a different teacher
would more likely have a teacher closer to the
mean level of teaching ability. By the same
process, for older siblings who had performed
poorly, Seaver may have selected teachers
who were less able on the average. A younger
sibling assigned to a different teacher would
then be likely, again because of regression, to
receive a more able teacher. Differences in
teacher effectiveness might thus account for
the crossover interaction Seaver obtained. But
all other internal validity threats seem to be
controlled for.
The Seaver study shows that posttest-only
designs with nonequivalent groups can be quite
strong under the specific circumstances he
chose: (a) substantive theory that predicts a
somewhat complex data pattern; (b) sibling
comparison groups that, although nonequivalent, are nonetheless matched on most
family background factors; (c) outcome measures (academic achievement) that are quite
reliably measured; and (d) large sample sizes.
Although these circumstances may not often be
forthcoming in social research, they serve to
remind us that the structure of a quasi-experimental design does not by itself determine the
quality of a causal inference. The uniqueness of
a theory-derived hypothesis also plays a role.
We now distinguish between several kinds
of nonequivalent control group designs that
generally produce interpretable causal results.
523
We do not want to make too rigid a distinction

between these and the designs just discussed.
For, as we sought to show with the modus
operandi technique and Seaver's study, there
are some rarer contexts even with structurally
poor designs where causal inference may
nonetheless seem warranted. Likewise, we
will soon see some particular instances of
better designs where quite poor causal inferences resulted. Designs facilitate causal inference, but do not guarantee it. In this spirit, the
following designs are offered as "generally"
more interpretable.
Untreated Control
Group Designs With Pretests
The Untreated Control Group Design With
Dependent Pretest and Posttest Samples.
The most commonly used quasi-experimental
design involves a treatment and an untreated
control group, with the same units providing
both pretest and posttestdata. It is diagrammed
below. Though most internal validity threats
are the same whatever the pattern of data obtained, we nonetheless distinguish five different ways the data might be patterned. This is to
show how the quality of causal inference partly
depends on this factor as well as on the structural attributes of a design and on the quality of
the relevant substantive theory.
X
Outcome 1: No Change in the Control Group.

Narayanan and Nath (1982) used an untreated
control group design with pretest and posttest
to examine how flexible working hours (flexitime) influenced a unit of employees who were
contrasted with another unit within the same
company. The results suggested that flexitime
enhanced employee flexibility and workgroup
and supervisor-subordinate relations because
more change occurred in the flexitime group
FIGURE 2
First Outcome of the
No-Treatment Control Group Design
With Pretest and Posttest
Control
Pretest
Posttest
than among controls. Indeed, no changes at all

were observed in the nontreatment controls, as
Figure 2 illustrates. But we could expect this
same pattern of data if the two groups were
differently composed and therefore differently
changing over time because respondents in one
group are growing older, wiser, or more experienced, when compared to respondents in
another group.
However, when there is no temporal change
in the controls the investigator has to think
through why there might be spontaneous
growth only in the treatment group. It will
often be easier to think why both groups are
maturing at different rates in the same direction than to think why one group should be
changing in one direction while the other is not
changing at all. Different growth rates in the
same direction are common in, say, educational contexts and in work environments where
performance increases with experience. In other
instances, it will be easier to think why neither
group should be spontaneously changing over
time than to think why one group should be
changing and the other not. Of course, pretest
data can sometimes be analyzed within conditions to assess whether growth rates differ in
each group by age or experience level. If the
data suggest that no growth is expected in

either group over a period equivalent to the
pretest-posttest time interval, this suggests
(but does not "prove") that selection-maturation is implausible with outcomes like those in
Figure 2.
A second internal validity problem can arise
with the design because of instrumentation. It
is not clear with most scales that the intervals
are truly equal, and "change" may be easier to
detect at some points on a scale than others. In
a quasi experiment, the nonequivalent groups
begin almost by definition at different points
on a scale. Scaling problems are presumably
more acute (a) the greater the initial nonequivalence between treatment groups, (b) the
greater the pretest-posttest change recorded,
and (c) the closer any group means are to one
end of the scale so that ceiling or floor effects are
likely. To investigate possible instrumentation
effects, it is easy to inspect pretest and posttest
frequency distributions within each treatment
group to see if the distributions are skewed and
if the group means and variances are correlated. Sometimes, the raw data can be rescaled
to reduce any detected problems, while at other
times a careful choice must be made of intact
and unmatched groups that score at about the
middle of a scale and so close to each other.
A third problem has to do with differential
statistical regression. Consider the consequences of having a predetermined experimental group of low scorers (say, children eligible
for Head Start) and then selecting no-treatment
controls from among nearby children. Their
mean score on most education-relevant outcomes will be above those of the Head Start
children since they are not eligible for the program. Knowing this, the researcher might put
forth a special (but misguided) effort to select
as controls those noneligibles who have particularly low pretest scores. Such matching will
bring into the study non-Head Start children
whose obtained achievement scores were deflated by a particularly large (and one time)
negative error component. All things being

equal, the error should be less on later measurement, making scores regress toward the population mean for controls that is higher than the
Head Start mean. The initially more advantaged controls will thus appear to gain more,
making the Head Start program look harmful!
A fourth problem relates to local historythe possibility that a one-time event occurred
between pretest and posttest and affected one
treatment group more than another. In Narayanan and N ath (1982), flexitime was initiated in
one unit of a company while another served as
no-treatment controls. To ensure that differential supervisory changes did not occur during
the experimental period, Narayanan and Nath
measured them but found none. Of course, this
is only one particular example of local history,
and many others could be surfaced. Each would
then have to be examined. Researchers have to
be vigilant lest ruling out one study-specific
selection-history force lulls them into believing
that all such threats have been ruled out.
Outcome 2: Both Groups Grow Apart in the Same

Direction. There is a pattern of the selectionmaturation interaction that, in our experience,
is both more common and lawful than the
foregoing. It occurs when initially nonequivalent groups are growing apart at different
average rates in the same direction, as in Figure
3. This data pattern is certainly consistent with
treatment effects. The question is: Can alternative interpretations be identified and ruled out?
Ralston, Anthony, and Gustafson (1985)
examined the effects of flexitime on productivity among programmers within two state
government agencies. They found that productivity was initially lower in the agency
without flexitime and increased slightly over
time, and it was initially higher in the agency
with flexitime but increased at an even faster
rate. Differential growth rates are common in
quasi-experimental research, particularly
when respondents self-select themselves into
receiving a treatment. But even when administrators assign individuals, treatments are often
FIGURE 3
Second Outcome of the
Pretest
Posttest
made available to the specially meritorious or

most keen to improve themselves. Since these
are often the more able or better networked,
they are likely to change at a faster rate over
time for reasons that have nothing to do with a
treatment.
Several clues are available for assessing
whether nonequivalent groups are maturing at
different rates. First, if the group mean differences are a result of biased social aggregation
or selection only, then the differential growth
between groups should also be occurring
within groups. To help understand this, let us
suppose that flexitime was given to units where
the workers had higher performance ratings on
the average and so might gain more than controls for reasons associated with their superior
aptitude or interest. We might then expect the
more able and more interested to gain faster
than others among both the experimentals and
controls. Indeed, many patterns of selectionmaturation lead to posttest within-group variances that are greater than the corresponding
pretest variances. Simple inspection of variances will help explore this possibility, of course.
The second clue comes from plotting pretest
outcome scores against the maturational variable (e.g., age or years of experience) for the
--------------------=
II
526
,,
experimental and control groups separately.

If the regression lines differ, this is presumptive evidence of differential average growth
rates. This is because the slope differences
cannot be due to the treatment; only pretest
scores entered the analysis. Sociologically, the
growth pattern we have been describing is
associated with phrases like "the rich get richer"
or "those that are ahead get further ahead."
Data analytically, the pattern entails a constant increase over time in the mean difference
between the nonequivalent groups and within-group variances that increase with their
respective means.
N a thing makes initial group difference
increase at a constant linear rate. There can be
simple linear growth in one condition but
quadratic change in another. The patterns of
change expected have to be thought through
anew in each study and responsibly probed
wherever relevant data are available. However, in our experience, differential maturation
of the fan-spread type (in which group differences increase linearly over time) is commonplace. Indeed, many longitudinal descriptive data sets in education show children
with higher achievement growing steadily
ahead of their lower scoring contemporaries.
We suspect that many other longitudinal
data sets on performance will show the same
pattern. Nonetheless, there are some theoretical formulations that predict a selectionmaturation pattern other than a fan spread. For
instance, within education, Piaget's theory of
conservation predicts sharper discontinuities
in growth differences as some children suddenly acquire the concept and others do not.
For each longitudinal data analysis, an argument has to be constructed presenting and
justifying the assumptions made about
maturational differences. Sometimes pretest
data will play an important role in this; sometimes data from longitudinal samples collected
for other purposes will serve the function; and
sometimes theoretical speculation is all that
can be presented.
Figures 2 and 3 differ most of all in their

implications for the plausibility of selectionmaturation. The control data in Figure 2 suggest that no change occurred between the pretest and posttest and so there is no general
pattern of change. To accept this conclusion
depends on the measures and statistical tests in
the control group being sensitive enough to
detect real change and on the assumption that
the experimental group is not subject to irrelevant forces inducing a maturational change.
Figure 3, on the other hand, suggests that change
is generally present and that the group with the
earlier advantage may be changing at a faster
rate than the group with the lesser advantage.
Since many descriptive longitudinal studies
show just such a data pattern, researchers have
to ask, "How can I probe whether selectionmaturation occurred, how large it was, and
how it can be dealt with through design modifications or statistical adjustments?" Though
this may be the researchers biggest single task
with Outcome 2, it does not exhaust all the
internal validity problems to be faced. Local
history, differential instrumentation shifts,
differential testing, and differential statistical
regression cannot be ignored as paten tial interpretations of the outcomes in Figure 3, just as
they could not for Figure 2.
Outcome 3: Pretest Differences Diminished.

Let us now look at Figure 4 in which the pretest
superiority is diminished or eliminated by the
posttest, not enhanced as above. This particular
outcome was obtained from a sample of black
third, fourth, and fifth graders in a study of the
effects of school integration on academic selfconcept (Weber, Cook, & Campbell, 1971). At
the pretest, black children who attended allblack schools had a higher academic self-concept mean than black children who attended
integrated schools in the same district. But after
formal school integration had taken place, the
initially segregated and initially integrated black
children did not differ. While the logic of experimentation with control groups involves
FIGURE 4
Outcome 4: The Compensatory Treatment Case

Without a Crossover Effect. The salient charac-
Third Outcome of the

Pretest
Posttest
starting with equality between the groups and

finishing with differences, we should also be
alert to "catch-up" designs in which the control
group already has the treatment which the
experimental group then receives between
pretest and posttest.
All the internal validity threats described
for Figures 2 and 3 are relevant to Figure 4,
including the possibility of a selectionmaturation interaction. However, sticky plausibility arguments become relevant here, for it
is rare to find maturational patterns such that
those further ahead fall back in a relative sense.
It can happen, of course. Take physical strength
as an example. When one group is slightly
older than another, it is possible that between
the pretest and posttest, the initially superior
group comes to pass the peak age of strength
while the other group just reaches it. But in educational and mental health contexts, such phenomena are much rarer, and in the Weber et al.
example, the groups were equivalent in age.
Thus, the argument is that no known and independently validated selection-maturation pattern can account for the pattern of results found
in Figure 4. In the future, some mechanism
might be found, but it is not yet known.
teristics of a fourth possible outcome of the

no-treatment control group design is that, as
with Outcome 3, the experimental-control
difference is greater at the pretest than the
posttest. But now the experimentals initially
underperform the controls rather than outperform them. This is a particularly interesting
case because it reflects the outcome desired
when organizations introduce compensatory
inputs to increase the performance of the disadvantaged or of those whose performance
is disappointing, as in the case when a firm
makes changes to try to improve on past poor
performance.
Keller and Holland (1981) found the result
under discussion when they assessed the impact of a job change on employees' performance, innovativeness, satisfaction, and integration in three different research and development organizations. All employees who were
promoted or assigned to a different job were
placed in the treatment group, while all others
were considered controls. Outcomes were
measured twice, a year apart. Thus, there was
no explicit compensatory focus in the worki it
merely turned out that the data fit the pattern
under discussion here.
The outcome is subject to the typical
scaling (i.e., instrumentation) and local history
(i.e., selection x history) threats discussed earlier. But two special features stand out. Though
some people were promoted in the Keller and
Holland study, the majority of job changers
moved laterally. If their jobs changed because
their performance was unexpectedly low or
their supervisors suddenly wanted to get rid of
them, this would entail a source of measurement error operating to depress scores initially.
All other things being equal, it should regress
upward by the posttest-the very data pattern
depicted in Figure 5. If the treatment-control
differences in Keller and Holland were temporally stable, then statistical regression would
not be a problem. But with this design, we
528
FIGURE 5
FIGURE 6
Fourth Outcome of the

Fifth Outcome of the

Control
Pretest
Posttest
would not know anything about pretreatment

time patterns. In nonequivalent control group
designs, it is imperative to explore the reasons
for initial groups differences, including why
some groups assign themselves, or are assigned,
to one particular treatment rather than another.
It is especially important to do this when the
treatment is assigned to units performing Jess
well.
The second part to note is that the outcome
in Figure 5 rules out cumulative selectionmaturation of the fan-spread type. More exactly, FigureS implies thatifthereweresuchan
effect the treatment was so powerful that it
overcame it. However, less common selectionmaturation patterns could be invoked. In Keller
and Holland, for example, the job changers
may have been junior in the organization, accounting for their lower pretest scores, but
particularly open to Jearn from new experiences, making their performance rise disproportionately quickly. Data on age and time in
the organization would have to be analyzed to
examine this possibility. Though Outcome 4
reduces the likelihood of a fan spread, it does
not mean that other, even more complex forms
of selection-maturation are ruled out.
Pretest
Posttest
Outcome 5: Outcomes That Cross Over. Bracht

and Glass (1968) have noted the desirability of
basing causal inferences on interaction patterns like Figure 6, where the trend lines cross
over and the means are reliably different from
each other in one direction at the pretest and in
the opposite direction at the posttest. The important point is not the crossover per se; it is the
switching pattern of reliable mean differences
which tells us that the low-scoring pretest group
(the "experimentals") has come to overtake the
initial high-scoring controls. None of the other
interaction patterns that we have presented
thus far does this; nor is it done if the trend lines
cross but the two posttest means do not reliably
differ.
There are several reasons why the
outcome in Figure 6 is particularly amenable
to causal interpretation. First, the plausibility
of an alternative scaling interpretation is reduced, for no logarithmic or other transformation will remove the interaction. While a ceiling
effect might explain why a lower scoring pretest group comes to score as high as a higherscoring group, it cannot by itself explain how
the lower-scoring group then came to draw
ahead. A more convincing scaling artifact would

have to postulate that true change occurs in the
initially lower-scoring group and to a level
above that of the other group. However, the
estimate of change is inflated because the interval properties of the test make change easier at
positions further away from the grand mean
than close to it. However, this entails the exacerbation of a true effect and not the mediation
of a totally artifactual one.
Second, selection-maturation threats are
much less likely with Figure 6, for such crossover interaction patterns are not widely known
in descriptive or theoretical literatures. However, Cook et al. (1975) have commented that
selection-maturation is not impossible, even
with Figure 6. They reanalyzed some of the
Educational Testing Service (ETS) data on the
effectiveness of "Sesame Street" and found that
children in Winston-Salem who had been encouraged to view the show knew reliably less at
the pretest than children who had not been so
encouraged. However, by the posttest the treatment group knew reliably more than the control group, resulting in a data pattern like Figure 6. But were the encouraged children younger
and brighter, thus scoring lower than controls
at the pretest, but changing more over time
because of their greater ability? Fortunately,
data analysis indicated that the encouraged
and nonencouraged groups did not differ in
age or several pretest measures of presumed
cognitive ability, reducing the plausibility of
this interpretation.
Third, the Figure 6 outcome renders a regression alternative explanation less likely.
Greene and Podsakoff (1978) found the depicted crossover when they examined how
removing a pay incentive plan affected employee satisfaction in a paper mill. The employees were divided into high, middle, and low
performers, and satisfaction was measured
before and after removal of the pay incentive
plan. Following the removal, the high performers' satisfaction decreased reliably, the
low performers increased, and the midlevel
performers did not change. These slope differences might be due to regression with all three
groups converging on the same grand mean (as
in Figure 5). But the low performers had overtaken the high performers by the posttest and
differed from them reliably. By itself, statistical
regression does not provide a plausible alternative for these results, though it may have
inflated treatment effect estimates. A regression explanation would have to assume that the
initial low performers were genuinely better
performers than the others, but for some unknown reason a particularly large error component depressed their scores at the earlier time.
It seems implausible to us to assume that low
scorers on the pretest would regress to a mean
higher than that of both the middle and high
scorers.
Though the outcome in Figure 6 is often
interpretable in causal terms, any attempt to
set up a design to achieve it involves considerable risk and should not be undertaken lightly.
This is especially true in growth situations where
a true treatment effect would have to countervail against the lower expected growth rate in
the treatment group. A no-difference finding
from a study with considerable statistical power
but with an inevitably incomplete selection
model would not make it clear whether the
treatment had no effector whether two countervailing forces (the treatment and selectionmaturation) had cancelled each other out.
Even if there were a difference in slopes, this
would more readily take the form of Figure 5
than Figure 6, and Figure 5 is less interpretable
than Figure 6. It is one thing to comment on the
interpretive advantages of a crossover interaction with reliable and switching pretest and
posttest differences, and it is quite another to
obtain such a data pattern.
The Untreated Control Group Design With Independent Pretest and Samples. The basic pretestposttest design with nonequivalent groups is
sometimes used with separate samples being
-----------..,"""';,
--------~-
530
measured in each treatment group at each time

interval. Independent groups tend to be used
when researchers suspect that pretest measurementwillbedifferentiallyreactive, when budget
constraints make it simpler to contact independent groups instead of tracking down prior
participants, or when there is an ideological
commitment to studying intact communities
even though their populations change with
time. The actual design is diagrammed below,
with the vertical line indicating noncomparability across time.
I
--,---I
o
0
This design is frequently used in epidemiology, marketing, and political polling and may
be gaining in popularity. The only justifiable
context for using it is with random selection of
the pretest and posttest groups within each
(noncom parable) treatment condition. It is the
random selection that makes the samples representative of the population such as it is at
each time point. Selection is not completely
avoided, of course. First, random selection
equates the pretest and posttest only within
limits of sampling error, making comparability
problematic with smaller and heterogeneous
samples. Second, the populations will probably change in composition between the two
measurement waves and the changes need not
be the same in each treatment group, entailing
a problem with selection-maturation. Indeed,
most of the internal validity threats that apply
tothenonequivalentcontrolgroup design with
repeated measurement also apply when separate pretest and posttest samples are involved,
particularly local history. Statistical conclusion
validity becomes more of a problem with the
independent samples at each measurement
wave. Units no longer serve as their own sta tistical controls as they do when dependent
samples are analyzed.
Given these restrictions, anyone considering a design with nonequivalent groups and
independent pretest and posttest samples
should first critically assess whether the need
for independent groups is compelling. Only
then should a researcher pay special attention
to such matters as sample sizes, how the sampling design is implemented, and how comparability can be assessed using measures that are
stable, reliable, and beyond influence by the
treatment. However, such sampling and measurement concerns are irrelevant to the most
crucial aspect of population comparability: How
would the two nonequivalent groups change
over time even without a treatment, given their
initial noncomparability and the possibility of
population differences in composition over
time?
The Untreated Control Group Design With a Doul1/e

Pretest. This design is a variant on the most
commonly used untreated control group design with dependent samples. It involves adding an antecedent pretest of the same form used
at the posttest, preferably administered with
the same time delay as between the posttest
and second pretest. The design is below, and
Wortman, Reichardt, and St. Pierre (1978) used
it to evaluate how the Alum Rock educational
voucher experiment affected reading test scores.
Under the program, parents selected whatever local school they wanted for their child
and received a credit or voucher equal to the
cost of the child's education at that place. The
objective was to improve schools by fostering
competition. Initial evaluations had suggested
that vouchers decreased the academic performance of children, but Wortman, Reichardt,
and St. Pierre doubted this. So they reanalyzed
reading test scores using the repeat pretest
..
design, following a cohort of students through

the first to the third grades in both voucher and
nonvoucher schools. Unlike the initial analysis,
the reanalysis examined two types of voucher
programs, traditional and nontraditional. The
additional pretest allowed the researchers to
contrast the pretreatment maturation rates in
reading with the change in rates following it.
The decrease in reading previously attributed
to voucher schools in general was then attributed only to the nontraditional voucher group,
with the traditional voucher and nonvoucher
groups continuing at the same growth rates.
The advantage of using a double pretest is
that it permits an assessment of the likelihood
of selection-maturation under the assumption
that the rates between 0 1 and 0 2 will also be
found between 0 2 and 0 3 However, the withingroup growth rates will be fallibly estimated,
given measurement error. Moreover, instrument shifts could make the measured growth
between 0 1 and 0 2 unrepresentative of what
we might expect between 0 2 and 0 3 And the
assumption that the rate of growth obtained
between 0 1 and 0 2 would have continued
between 0 2 and 0 3 is testable only for the
untreated control group. These difficulties
notwithstanding, the second pretest can help
considerably in assessing the plausibility of
selection-maturation. This is not the only benefit of the design, though. If the 0 2 observation in
either group was atypically low or high, then
a spurious treatment effect could emerge because of statistical regression. Having the 0 1
measure goes some way toward assessing this.
A third benefit of the design cannot perhaps be
fully appreciated at this point, for it has more to
do with statistical analysis than experimental
design. In some analyses, it is desirable to be
able to estimate quite precisely what the correlation is between observations taken from a
single group across a known time interval. To
compute the correlation between 0 2 and 0 3 in
the treated group gives an unclear estimate of
what the correlation would have been in the
absence of a treatment. Yet this is the crucial

information needed for the statistical analysis.
The 0 2-03 correlation from the untreated group
is often used as the best estimate of the corresponding correlation in the treated group. However, a good case can be made that the 0 1-02
correlation in the treated group will usually be
at least as good an estimate as the 0 2-03 correlation from the untreated group. (It is because
correlations are sensitive to the length of the
test-retest interval that it is desirable to keep a
constant interval between 0 1, 0 2, and 0 3 .)
Adding a pretest clearly helps the interpretation of possible causal relationships. Why,
then, is the design not used more often? One
reason may be relative ignorance, but another
is surely that the design is sometimes infeasible.
In many situations, one is lucky to be able to
delay the treatment implementation process in
order to obtain a single pretest, let alone two.
Sometimes, archives will make the second
pretest possible, though with archives the researcher can often extend the pretest time series
over many more periods, thus creating an even
more powerful time series analysis. Unfortunately, time is not the only feasibility constraint.
Some person responsible for authorizing research expenditures are loath to see money
spent for any features of experimental design
other than the treatment and posttest measures
on persons who have received particular treatments. Convincing them about pretests and
conventional control groups is not easy.
Convincing them of double pretests is hard
indeed! Nonetheless, where the archival system, time-frame, or political system permits,
the same pretest should be administered at two
different times prior to the treatment.
Cohort Designs in Institutions With Cyclical
Turnover. Many formal institutions are characterized by regular turnover as one group of
persons "graduates" to another level of the
institution and is followed by another group.
Schools provide an obvious example of this as
532
Cook, Campbell, awi Peracchio
children move from grade to grade. So, too, do

manv businesses as one group of trainees follows"another. So, too, do families as one sibling
follows another. We use the term cohort to
designate this process, noting that the term is
used in other disciplines to refer to samples that
are repeatedly measured over time. As we
understand cohorts, they are useful for experimental purposes because (a) some cohorts
experience a given treatment while preceding
or subsequent cohorts do not; (b) it is often
reasonable to assume that a cohort differs in
only minor ways from its contiguous cohorts;
(c) intact organizations sometimes insist that a
treatment has to be made available to all, precluding the use of simultaneous controls and
making only historical controls possible; and
(d) an organization's own archival records can
often be used for comparing historical cohorts
that differ in the treatment received.
A crucial feature that makes cohort designs
particularly useful for drawing causal inferences is the quasi comparability or attenuated
selection difference that can often be assumed
between cohorts exposed and not exposed to a
treatment. Quasi comparability cannot be assumed willy-nilly and must be probed anew in
each study through analyses of the background
characteristics presumed to correlate with the
major research outcomes. The degree of comparability will never be as high with cohorts as
with random assignment. Indeed, a review of
behavioral genetics research concluded that, as
far as intellectual performance is concerned,
"environmental influence makes two children
in the same family as different from one another as are pairs of children selected randomly
from the population" (Plomkin & Daniel, 1987).
If this position generalized to nonintellectual
domains, it would seriously undermine the
case for assigning any special status to siblings
as controls for shared family variance. Moreover, if it were substantiated for other types of
cohorts, for example, in schools or work organizations, it would seriously undermine
cohort analysis altogether.
Our discussion of cohort designs will not

deal with the growing use of cohort designs to
try to determine the potential relevance of the
effects of age (i.e., growing older), birth cohort
(i.e., being born within a given time span), and
period of history (i.e., the events occurring
betweenanytwotimeintervals). Using sample
survey data to unconfound and model these
effects is a thorny task that demographers and
developmentalists have set themselves (Glenn,
1977; Mason & Fienberg, 1985). Instead, our
focus will be on using cohorts to strengthen
causal inferences and on illustrating how the
basic cohort design feature can be supplemented
with other design features already discussed.
The Basic Cohort Design.
Minton (1975)
wanted to examine how the first season of
"Sesame Street" affected the Metropolitan Readiness Test scores of a socially heterogeneous
sample of kindergarten children. She located a
kindergarten where the test was used at the
end of the child's first year. However, she had
no data from a control group. Fortunately, she
had access to the Metropolitan scores of the
children's older siblings who had attended the
same kindergarten before "Sesame Street" went
on the air when they were of the same age (and
presumably maturational stage) as the "Sesame Street" viewers. She was able, therefore, to
compare the postkindergarten know ledge
level of children who were "Sesame Street"
viewers with the knowledge level of their siblings who had terminated kindergarten before
"Sesame Street" was ever broadcast. The essential design can be diagrammed below, with the
wavy line indicating a restricted degree of selection nonequivalence. Note that the numerical subscripts refer to time of measurement,
even though in each case there is only one
wave of measurement.
Quasi Experime11tation 533

Despite the similarities between the cohorts
in maturational status and many other familybased selection variables, to contrast just these
two observations provides a weak test of the
causal hypothesis. First, a selection problem
remains since the older siblings are more likely
to be first-borns and first-borns tend to outperform siblings born later (Zajonc & Markus,
1975). One way of reducing this threat would
be to analyze the data separately by birth order
of the older child. The assumption would be
that ordinal position makes more of a difference with first- and second-barns than with
second- and third -borns, etc. (Zajonc & Markus,
1975). The design is also weak with respect to
history, for the older and younger siblings could
have experienced unique sets of events other
than "Sesame Street" that affected knowledge
levels. One way of partially examining this
threat would be to break the cohorts down
into those whose kindergarten experience was
separated by one, two, three, or more years
from their sib lings'. This would be to see if the
greater learning of the younger group held
over all the different sets of historical events.
However, this procedure would be less than
optimal because there would still be no control
for historical events other than "Sesame Street"
that took place the same year the show was
introduced.
Direct measurement can sometimes be
used in an attempt to deal with selection and
history. Thus, Devine, O'Connor, Cook, Wenk,
and Curtin (1988) conducted a quasi experiment to examine how a psychoeducational
care workshop influenced nurses' care of cholecystectomy (gall bladder) surgical patients.
Reports were collected from different patients
over a seven-month pretreatment period, and
again after the intervention for another six
months on another group. This created pretreatment and posttreatment cohort groups.
Did the two cohort groups differ in background characteristics? An analysis of many
such characteristics revealed no differences,
minimizing the selection threat for the
variables examined but obviously not for unmeasured attributes. Still, it would have been
better if practical circumstances had allowed
collecting both the pretest and posttest data for
a calendar year each instead of for seven and six
months, respectively, for the data collection
procedure actually implemented is confounded
with seasons. History was examined as an alternative interpretation in the sense that research staff were in the target hospital almost
every day and detected no major irrelevant
changes that might have influenced recovery
from surgery. This provides no guarantee, of
course, and design modifications would obviously be better than measurement for ruling
out this internal validity threat. Therefore, data
were also collected from a control hospital in a
nearby suburb that was owned by the same
corporation and had some of the same physicians admitting patients. It provided a better
control for history, and in this instance-like
the passive measurement-led to the conclusion that the treatment effect was not due to
general history.
Cohort Design With Pretests From Each Unit.
Another example of the cohort design leads to
an important design elaboration that strengthens causal inference. In a study comparing the
relative effectiveness of regular teachers versus outside contractors hired to stimulate
children's achievement, Saretsky (1972) noted
that the teachers exerted special efforts and
performed better than would have been expected on the basis of their previous years'
performance. He attributed such compensatory rivalry to the teachers' fear that they
would lose their jobs if the contractors outperformed them. It is not clear how Saretsky tested
this hypothesis. Let us assume for pedagogic
purposes that he compared the average grade
equivalent gain in classes taught by the regular teachers during the study period with the
average gain from the same classes taught by
the same teachers in previous years when they
were not aware of being in a study. The result-
ingdesign would be ofthe form below, with 0 1

and 0 2 representing beginning and end of year
scores for the earlier cohort who could not have
been influenced by teacher fears, and 0 3 and 0 4
representing scores for the later cohort that
might have been so influenced. The null hypothesis is that the change in one cohort equals
that in the other. But the design, which we
might call the institutional cycles design, can be
extended back over time to include multiple
"control" cohorts rather than the single one
pictured here. Indeed, it seems that Saretsky
reported data for two pre experimental years.
Adding earlier measures to each cohort

shows the similarity of the resulting design to
the basic nonequivalent control group design
with pretest and posttest. The major differences are that measurement occurs earlier in
the control group, and the cohorts are assumed
to be less nonequivalent than most independent groups would be. This last point can be
checked by comparing each cohort's pretest
mean, and the ability to do this is one of the
major advantages of including pretest measures in cohort designs. Other advantages of
the pretest are, of course, the increase in statistical power that comes from within-subject
error terms and the possibility ofbetter (though
not perfect) statistical adjustment for any
group noncomparability.
History is perhaps the most salient internal
v~lidity th~eat in the institutional cycle design
Wlth or w1thout pretests. A series of control
~ohort years helps here. If the amount of change
1s comparable across all pretreatment cohorts
this suggests that performance was not atypi~
cally low in the immediate pretreatment cohorts. While this rules out as one form of history likely to produce spurious results, it still
leaves viable the history threat that some unique
event happened in the intervention time period. To control for general history, we would
be even better served if a nonequivalent, notreatment control group could be found and
measured at exactly the same time points as the
treatment cohorts. Failing this, the design could
be greatly strengthened if nonequivalent dependentvariables were specified, some of which
should be affected by history while others
should not.
The plausibility of a history threat can be
examined if the research question is slightly
modified. Imagine wanting to know about the
effects, not of a new practice like participating
in a performance contracting experiment, but
of a long established practice like asking second graders to do one-half hour of homework
each school night. With access to past school
records or with at least two years to do a study
with original data collection, a design like the
one below might be possible. It involves three
cohorts entering the second grade in consectttive years, and the spacing of observations
indicates that 0 1 and 0 2 are not simultaneously
observed because one might be at the end of a
school year and the other at the beginning of the
next. This institutional testing cycle is repeated
again with the 0 3 and 0 4 observation to create
a design that Campbell and Stanley (I 966) discussed in detail as the recurrent institutional
cycle design. A treatment main effect is suggested if the 0 1 and 0 3 are, say, higher than 0
and 0 4, and if 0 2 and 0 (and 0 and 0) ar~
4
1
3
not d1'fferent from each other.
. A p~~tial control for history is provided if,

m add1hon to 0 3 being greater than 0 V 0 I
surpass~s 0 2, and 0 3 surpasses 0 4 These last
compansons suggest that a treatment has been

effective at two different times. Hence, any
single history alternative would have had to
operate twice if it is to explain both 0 1 > 0 2 and
0 3 > 0 4; alternatively, two independent historical forces would have to be invoked. Selection
is also ruled out in this version of a cohort
design since the same persons are involved in
some of the comparisons, particularly 0 2 - 0 3 .
Another problem that affects some cohort designs-testing-is not ruled out since all the
comparisons involve contrasting a first testing
(02 or 0 4) with a second testing (01 or 0 3). This
is why Campbell and Stanley recommended
extending the design further by splitting the
middle group that is both pretested and
posttested into random halves, one of which
receives a pretest, treatment, and posttest sequencewhile the other receives a treatment and
posttest sequence but no pretest. Any differences between these two groups at the posttest
would presumably be due to repeated measurement; the failure to obtain reliable differences (or strong but unreliable trends in the
direction of differences) would suggest that
repeated measurement is nota problem. Though
the three-group design we have outlined is
practical for use in institutional settings where
everyone has to receive a treatment, it has
one major drawback over and above testing.
Causal interpretation depends on a complex
pattern of outcomes in which three contrasts
involve 0 2 Since a chance elevation of 0 2 would
have disastrous implications, the design should
only be used with reliable measures and large
samples.
Cohort Designs With Treatment Partitioning.
Yet another way to strengthen the generic cohort
design where organizational constraints prevent forming independent control groups is to
add treatment partitioning to help reduce the
plausibility of selection and history threats. For
example, in Minton's analysis of "Sesame
Street," the kindergarten children she observed
could have been partitioned into heavier and
lighter viewing groups. If the show were effective, we would then expect larger knowledge
differences between the heavy and light viewers from the cohort that viewed the show when
contrasted with their siblings who did not. The
key assumption here is that the older siblings
would themselves have been heavy and light
viewers of "Sesame Street" if the show had
been available to them during their kindergarten years. Under this assumption, the null
hypothesis is that the difference in knowledge
between heavy and light viewers who saw the
show is no different from the difference in
knowledge between their respective siblings
who did not see it. A difference of these differences would suggest that "Sesame Street" was
effective and that a simple selection alternative
could be ruled out. While selection could account for the differences between heavier and
lighter viewers who had seen the show, it could
not readily account for their difference from
their respective siblings who had not seen it.
Moreover, we would expect the heavier and
lighter viewers from the "Sesame Street" cohort to have experienced the same general history (though not necessarily the same local
history). Partitioning respondents into treatmentgroups based on treatment exposure levels
strengthens the internal validity of cohorts
designs. Even with self-selection into viewing
levels, it is difficult to come up with plausible
alternative interpretations when the data look
like they do in Figure 7.
Partitioning has a further advantage for
internal validity. If testing conditions differ
between the earlier and later cohorts, then testing alone might bias scores in the later experimental cohort. Partitioning respondents by
the length of exposure to the treatment rules
out a simple testing threat, for there is no reason
why testing should have a greater effect in the
longer exposure treatment group than the
shorter one. For a number of reasons, then, we
advocate partitioning respondents and implementing a modified cohort design, as depicted
536
below, with three treatment levels represented

byX2
FIGURE 7
Interpretable Outcome
of a Posttest-Only Cohort Design
Pretreatment
Cohorts
Ball and Bogatz' (1970) evaluation of "Sesame Street" also used cohorts, but they were
older children from the local neighborhood
instead of siblings. They took a sample of children and tested them before "Sesame Street"
went on the air and six months later. Many of
the children were aged between 53 and 58
months at the posttest and were called the
posttest cohort. Other children were aged between 53 and 58 months at the pretest and were
called the pretest cohort. Comparing the posttest scores of the posttest cohort and the pretest scores of the pretest cohort rules out maturation since all the cohorts are presumably
at comparable maturational stages. The major
difference between them is that one cohort
has seen "Sesame Street" and the other has not.
A selection effect is also not likely, provided
that the analysis included data from all the
children available to be in a particular cohort.
And as a further check, background characteristics of the pretest and posttest cohorts can be
assessed.
The design is vulnerable to a history interpretation, though. Did theolderpretestcohorts
experience unique, outcome-modifying events
before their younger cohorts were born? Or
were the older children at particularly sensitive
Posttreatment
Cohorts
maturational stages when they learned information from the environment that the other
cohorts were too young to take advantage of?
Also, the design as portrayed here and as
implemented by Ball and Bogatz has a unique
testing problem. The scores of the pretest cohort came from a first testing, while those of the
posttest cohort came from a second. This makes
it unclear whether cohort differences in knowledge were due to the treatment or to differences
in the frequency of measurement. To deal with
these history and testing problems, Ball and
Bogatz used measures of the reported frequency
of viewing "Sesame Street" to partition each
cohort into four viewing levels. The consequences of this partitioning are displayed in
Figure 8, and an analysis of variance showed
that the differences in knowledge between the
four viewing levels were greater within the
posttest cohort than the pretest one. Since the
cohorts were of the same mean age, of comparable social background within the different
viewing groups, and had experienced the same
history and testing sequences (all posttest cohorts were pretested), the complex, theorypredicted outcome in Figure 8 can account for
all the internal validity threats discussed thus
far. Cohort designs gain in inferential

strength as other design features like treatment
partitioning and nonequivalent no-treatment
control groups are added to them. Further, they
are always improved if attributes are measured that might differ between cohorts. The
working assumption has to be that institutional cohorts reduce initial nonequivalence.
It cannot be that they eliminate selection
altogether.
Switching Replications Design. The switching replication design strengthens causal inference by rep Heating the treatment effect at a later
date within the group that initially served as a
no-treatment control. The design is diagrammed
below, and its name comes from the fact that
between the first and second measurement
wave one group serves as the experimentals
and the other as controls, while between the
second and third waves the roles are switched
and so the treatment statuses are reversed.
Basadur, Graen, and Scandura (1986) used

a version of the switching replication design
in their analysis of how training affected engineers' attitudes toward divergent thinking
in solving problems. Measurement was taken
prior to the treatment, following the training
of one group of engineers, and then following
the training of a second nonequivalent group.
The later treatment group served as controls
during the early phase of the study, while the
roles were switched during the study's second
phase. However, it should be noted that the
second phase does not involve an "exact" replication. The context surrounding the second
treatment introduction is different from the
first, both historically and because a treatment
has "somehow" been removed from the first
group or, if it has not been removed, it is as-
FIGURE 8
Interpretable Outcome of a
Selection Cohorts Design
With Pretest and Posttest Cohorts
Pretest
Cohorts
Vert; Light Viewers
Posttest
Cohorts
sumed to be having no impact between the

later measurement waves. Thus, the second
introduction of the treatment provides a
modified replication, probing the external
validity issue of whether there is an interaction
of treatments with treatments. That is, does the
treatment have different effects when presented
in the context of another treatment?
In some situations, the researcher will be
fortunate enough to introduce the treatment
only to a subsample of the original control
group,leaving a subsampleof controls to remain
untreated for the entire study. Similarly, it is
sometimes possible to reintroduce the treatment a second time to a subsample of the original treatment group in order to evaluate the
benefits of additional treatment exposure.
Gunn, Iverson, and Katz (1985) implemented
these design variants in a study of a health
education program that was introduced into
1,071 classrooms nationwide. This design is
diagrammed at the top of the next page, with
R indicating random assignment.
----------------4.~~
538
Year2
Year 1
Classrooms were initially divided into

nonequivalent treatment and control groups.
In the first year, students in each group were
tested on knowledge of health both before and
after the program. In the second year, the initial
control group was divided in half. One group
received the health education program, and the
other remained without instruction, allowing
both a replication of the treatment effect and
the chance to continue with a no-treatment
control. In addition, a random subsample of the
original treatment group received a second
year of instruction to determine the incremental
benefit of additional training in health education. Here we see switching replications yoked
to both a continuation of the original controls
and a booster for the original treatment group.
This variant strengthens the simple switching
replication design,especiallyif the second phase
of the study involves partitioning on a random
basis.
The Reversed-Treatment Control Group
Design With Pretest and Posttest. This design can be diagrammed as shown below. X+
represents a treatment expected to influence an
effect in one direction, and X- represents a conceptual! y opposite treatment expected to reverse
the pattern of findings.
Hackman, Pearce, and Wolfe (1978) used

the design to investigate how changes in the
motivational properties of jobs affect work attitudes and behaviors. As a result of a technological innovation, clerical jobs in a bank were
changed to make the work on some units more
complex and challenging but to make work on
other units less so. These changes were made
without the company personnel being told of
their possible motivational consequences, and
measures of job characteristics, employee attitudes, and work behaviors were taken before
and after the jobs were redesigned. The pretest
scores for the group with enriched jobs were
systematically below those of the other group,
entailing an initial selection difference. If the
groups also matured at different rates, this
would manifest itself as the statistical interaction that indicates a treatment effect in the
design under consideration. Such a threat is not
very plausible, however, if the results in each
treatment group show reliably different trends
with opposite causal signs. This is because
selection-maturation patterns that operate in
different directions are much rarer than patterns where change occurs at different rates in
the same direction.
The reversed-treatment design often has a
special construct validity ad vantage. The causal
construct has to be rigorously specified and
carefully built into manipulations in order to
create a sensitive test that depends on one
version of the cause (e.g., job enrichment) affecting one group one way, while its conceptual opposite (job impoverishment) affects the
second group the opposite way. Moreover, some
of the irrelevancies associated with one treatment will be different from those associated
with the reversed treatment, adding to the
heterogeneity of irrelevancies. To understand
this better, consider what would have happened had Hackman, Pearce, and Wolf used an
enriched job group only and no-treatment
controls. A steeper pretest-posttest slope in the
enriched condition could then be attributed
either to the job changes or to respondents
feeling specially treated or guessing the hypothesis. The plausibility of such alternatives is
lessened in the design under discussion if the

expected pretest-posttest decrease in job satisfaction is found in the de-enriched group. This
is because awareness of being in a research
study is typically considered to elicit socially
desirable responses. The only construct validity interpretation that would fit both the increase in the enriched group and the decrease
in the contrast group would be if each set of
respondents guessed the hypothesis and
wanted to corroborate it in their own way.
A disturbing feature of the design is its
dependence on obtaining a data pattern with
opposite causal signs. When change is differential across treatments but in the same direction,
results are not easily interpretable. Without a
no-treatment control group, we cannot say
whether the two same-sign slopes would have
had different signs if the effects of temporal
change had been removed. To be maximally
interpretable, the reversed treatment design
needs a no-lTeatment group or preferably a
placebo control group. Another problem is that,
in many organizational contexts, ethical and
practical considerations prevent the implementation of a reversed treatment design. The
vast majority of treatments have ameliorative
and prosocial goals, suggesting that a conceptually opposite treatment might be downright
harmful. Hence, we should expect to see the
reversed treatment designonlyrarely, and then
in situations where reversals occur in unplanned
fashion.
The Untreated Control Group Design With a
Double Pretest and Both Independent and
Dependent Samples. The design features
we have presented thus far can be mixed and
matched in many ways that aid causal interpretation. Indeed, we have already seen this in
discussing cohort designs when nonequivalent
control groups and treatment partitioning were
added to the basic cohort feature. We shall later
see the idea more systematically developed
when we examine interrupted time series designs. For now, however, let us describe hybrid
design that utilizes a number of different design features.

To evaluate similar, but not identical, interventions aimed at reducing cardiovascular risk
factors at the community level, Blackburn et al.
(1984) and Farquhar et al. (1990) combined a
double pretest with samples where the outcome was measured on both independent
and dependent (i.e., repeated measurement)
samples. We diagram the logic of the design
below, glossing over some complexities and
using the perpendicular lines between some
0' s to show independent samples.
01
The major outcome measures were physiological indicators of heart problems collected
annually, including blood pressure and cholesterol level. Three matched treatment and control communities were used in Blackburn's
study and two in Farquhar's. The double pretest was included to get some estimate of pretest trends in the two nonequivalent communities, and independent random samples were
drawn in the first two years both out of concern
that obtrusive annual physiological measurement would sensitize repeatedly measured
respondents to the treatment and out of a desire
to generalize to the community at large as it
evolved over time. However, in the Blackburn
et al. study, the variability between years within
cities was much greater than expected and,
since it could not be readily explained, statistical adjustments were not very helpful. Hence,
Blackburn modified his design in midstream so
that some respondents who had provided pretest information were followed up at a number
of posttest time points, creating a longitudinal
sample of community residents to complement
540
the independent samples that continued to be

drawn in most years. The Farquhar et al. (1990)
study was different in this regard since it was
designed from scratch to include both independent and dependent samples.
Farquhar et al. discovered that repeated
measurement enhanced the power to reject the
null hypothesis and also led to mean differences that were slightly larger than with independent samples. It is not clear why the latter
was the case, and chance cannot be ruled out.
Another possibility, though, is that the independent samples included persons who had
recently moved into the communities and so
had less exposure to the treatment. Including
them should lower treatment intensity and
increase random error. A third possibility is
that the obtrusive drawing of blood led the
sample with repeated measurement to use the
intervention differently or better, helping them
to exercise more, eat more healthfully, or give
up cigarettes. It is fortunate, therefore, that in
both the Blackburn et al. and Farquhar et al.
studies, the independent and dependent
samples produced roughly similar patterns of
mean differences, though effects did tend to be
slightly larger for the dependent samples. The
overall similarity renders testing less viable as
the sole cause of effects, though it may have
marginally inflated those that were found. Only
through having both types of samples could we
have been made aware of a possible role of
testing, given its special saliency in this public
health context where physiological measures
are routinely taken.
The Farquhar study is interesting for another reason that is particularly pertinent to
studies where the unit of assignment is some
large aggregate, like a community or business.
For reasons of cost and logistics, it is rarely
possible to have a large number of such aggregates. Indeat in the form presented for publication the Farquhar et al. study had only four
communities, two treatment and two controls.
The results showed that risk of cardiovascular
disease decreased in the two treatment communities and in one of the control communities, and by amounts that hardly differed. But
the risk appears to have increased over time in
the second control community, despite a national trend downward over the period studied. Omitting this one community from some
analyses would have been desirable and would
have reduced the treatment-control differences
to a level close to zero. But with only two
communities per condition this could not be
done. With so few units there can be no pretense of achieving comparability between the
treatment and control groups, however conscientiously the communities were paired before
assigning them to the treatment and control
status. Larger sample sizes are required. This
entails either adding more communities (which
would be prohibitively expensive even if a
greater ratio of control to treatment communities was involved) or combining studies that
have similar treatments, even though they will
not be identical and other contextual and evaluation factors are bound to differ between the
studies being synthesized.
Designs Without Pretests
In this section, we discuss designs that are
feasible and useful when it is absol utel y im possible to collect comparison group data, especially no-treatment control group data. These
are in many ways fall-back designs born of
necessity rather than desire. They vary considerably in their potential for justifying causal
inferences, and some are frankly better used as
parts of more complex designs than as standalone designs.
The Removed Treatment Control Group
Design. When it is not feasible to obtain a
nonequivalent control group, the researcher is
forced to create the functional equivalent of
such a group. The design shown at the top of
the next page does this in many instances.
It calls for a simple one group pretestposttest design (see 0 1 to 0), with a third wave
of data collection being added (see 0 3), after
which the treatment is removed from the treatment group (X symbolizes this removal), and a
final measure is made (04). The link from 0 1 to
0 2 is the experimental sequence, as it were,
while the link from 0 3 to 0 4 attempts to serve
as a no-treatment control. Note that a single
group of units is involved, and we only expect
the predicted data pattern if the treatment has
no long-term effects that carry over once the
treatment has been removed. A persisting effect will bias the analysis against the reversal
in the direction of change predicted between 0 3
to 0 4 .
The most interpretable outcome of the design is presented in Figure 9. However, statistical conclusion validity is a problem, since the
pattern of results can be thrown off by even a
single deviant data point. Hence, large sample
sizes and highly reliable measures are a desideratum. Second, many treatments are ameliorative in nature. Removing them might be hard
to defend ethically; it may also arouse a frustration that should be correlated with measures
of aggression, satisfaction, and performance,
not to speak of the resentful demoralization
and compensatory rivalry listed earlier under
threats to construct validity of the cause. Such
considerations indicate that it will sometimes
be impossible to implement the design.
Third, there are many instances where respondents voluntarily decide to discontinue
exposure to a treatment for reasons that are not
related to the social research taking place. Since
the present design is most likely to be used
when respondents self-select themselves out of
a treatment, very special care had to be taken
in this circumstance. Imagine someone who
becomes a foreman (X), develops promanagerial attitudes between 0 1 and 0 2, dislikes his
new contact with managers, and becomes less
FIGURE 9
Generally Interpretable Outcome
of the Removed-Treatment Design
Interpretable
Outcome
promanagerial by 0 3 . This person would be a

likely candidate for resignation from the foremanship or being relieved of it. Any continuation of his less promanagerial attitudes after
changing back to a worker would result in an
0 3-04 difference that differed from the 0 1-0 2
difference. Thus, the researcher has to decide
whether the 0 3-04 difference reflects spontaneous maturation or the change of jobs. The
maturation explanation would be more likely if
the 0 3-0 4 difference were similar to the 0 2-03
difference (see the interpretable outcome in
Figure 9). A rise in promanagerial attitudes
between 0 3 and 0 4 that was greater than the
0 2-0 3 decline would strongly suggest that
entering a new role causes one to adopt the
attitudes appropriate to that role.
Fourth, it is advantageous if observations
are made at equally spaced intervals. This permits a control for any spontaneous linear
changes that take place within a given time
period. A simple comparison of the differences
between 0 2-03 and 0 3-04 would be meaningless if the 0 3-04 time interval were longer than
the 0 2-03 interval. This is because a constant
rate of decay would reveal larger 0 3-04 differences than 0 2-03 differences. It is sometimes
542
possible to estimate rates of change per time

intervat thus lessening the need for equal spacing. But a sensitive treatment-free estimate of
the rate of spontaneous change is not possible
with the design under consideration, increasing the reliance on equal intervals.
Lieberman (1956) used a simpler version
of the removed-treatment design in his examination of the attitude change that follows
role change. He sampled men before becoming a foreman, after becoming a foreman, and
then after reverting back to worker status.
His design differed from the one we have
outlined since only three measurement
waves were made. Hence, any differences between his 0 1-02 and 0 2-03 measures might be
due to (a) role change influencing attitudes, (b)
foremen whose attitudes were becoming less
managerial being selected for demotion, or (c)
sampling error distorting the 0 2 mean that
appears in both comparisons (with 0 1 and 0 3),
resulting in the data pattern indicating a treatment effect. Having two observations during the treatment period helps rule out these
possibilities in the four-wave design we have
advocated.
The Repeated-Treatment Design. When
the investigator has access to only a single
research population, it will sometimes be possible to introduce, remove, and then reintroduce the treatment at different dates, correlating the treatment's availability with changes in
the dependent variable. Such a design is only
viable when the initial treatment effect is presumed to be transient. If it were not, it would
countervail against the treatment removal reversing the direction of effect and might give
rise to a ceiling effect that prevents the reintroduction of the treatment registering an effect.
The design is below, and its most interpretable
outcome is when 0 1 differs from 0 2, 0 3 differs
from 0 4, and the 0 3-04 difference is in the same
direction as the 0 1-02 difference.
01
02
03
04
The design is of the general type associated

with Skinnerian research in psychology, and its
logic was used in the original Hawthorne studies (Roethlisberger & Dickson, 1939). In some of
these studies, female factory workers were
separated from their larger work groups and
were given different kinds of rest periods at
different times so that the researchers could investigate the effects of rest on productivity. In
some cases, the same rest period was introduced at two different times. If we were to
regard only these repeated rest periods, we
would have the basic design under discussion
here.
One possible threat to internal validity
comes from cyclical maturation. For example,
if 0 2 and 0 4 were recorded on Tuesday morning, and 0 1 and 0 3 on Friday afternoon, any
differences in productivity might be related to
daily differences in performance rather than
treatment. Another even less plausible possibility is of unique historical events that happen
to mimic the patterning of the treatment introduction and removal. In general, though, the
design is very strong from an internal validity
perspective, especially when the investigator
introduces and removes the treatment at will.
However, the design is often particularly
vulnerable on external and statistical conclusion validity grounds. For example, many of
the performance graphs in the Hawthorne
studies of Roethlisberger and Dickson (1939)
are of individual female workers, and in the
Relay Room Experiment there was a grand
total of only six women! Moreover, there appears to be considerable variability in how the
women reacted to the treatments (particularly
the Mica Splitting Room Experiment), and we
cannot be sure to what extent the results would
be statistically reliable in analyses that summed
across all the women. The Hawthorne studies
closely parallel the design of Skinnerian experiments with a common preference for few
subjects and repeated reintroductions of the
treatment and with a common disdain for
statistical tests. Of course, small populations
and no statistical tests are merely correlates of

the design's use in the past. They are notrequirements, and we strongly urge using larger
samples and statistical tests.
Construct validity of the cause presents a
major threat when respondents notice the introduction, removal, or reintroduction of the
treatment, as will often be the case. This permits respondents to achieve self-generated
hypotheses about the study and to respond to
them even when there is none of the obtrusive
observation or special group status of the
Hawthorne experiment. Resentful demoralization can also be an issue when the treatment
is removed between 0 2 and 0 3 The 0 3 data
point might then be affected, and a difficult-tointerpret increase might also occur between 0 3
and 0 4 as the treatment is reinstated and the
source of frustration removed! This design is
better, therefore, when there are short-lasting
effects, unobtrusive treatments, a long delay
between the treatment and its reintroduction,
and no confounding of temporal cycles when
the treatment is introduced, removed, and then
reintroduced. The design is best of all, though,
when reintroductions are frequent and randomly distributed across time blocks, thereby
creating a randomized experiment with time
blocks as the unit of assignment.
Constructed No-Cause Baseline Designs.
When no comparison groups are available, we
must somehow locate or develop functional
equivalents of the no-cause baseline. A number
of alternatives exists for doing this by constructing a no-cause baseline as opposed to
directly observing no-treatment controls at the
same time as a treatment group is observed.
Such constructed control designs include (a) a
regression extrapolation design in which actual and projected posttest scores are compared, (b) a normed comparison design in which
respondents are pretested and posttested and
their scores compared to normed samples, and
(c) a secondary data design in which respondents are compared to samples drawn from
543
population-based surveys conducted for differentpurposes than serving as a control group.

All of these possibilities have distinct weaknesses. We do not endorse them other than
faute de mieux, and we strongly urge that all
causal interpretations offered from such designs be hedged around with many explicit
conditionals.
Regression Extrapolation Design. In theregression extrapolation design, we want to
compare a treatment group's obtained posttest
score to a predicted one that takes account of
internal validity threats. This is routinely done
in interrupted time series work, but here the
emphasis is on contexts where less pretreatment information is available. Specifically, the
regression of pretest values on a maturational
variable is used to estimate the rate of change
that would be expected for a time period equivalent to the pretest-posttest interval. Thus, mahlration is highlighted as the major threat to be
ruled out in the basic pretest-posttest design
without control groups, though it is not the
only threat.
The analysis is perhaps best understood in
the context of an example. Cook et al. (1975)
wanted to evaluate "Sesame Street" in several
different areas of the country. A pretest was
administered, followed six months later by a
posttest. Using pretest scores only, a scatterplot was computed to show the relationship
of age (in months) to academic achievement.
The regression lines were basically linear,
and from them area-specific unstandardized
regression equations were calculated to estimate how much achievement gain was expected for each month a child grew older. It
was then possible to estimate by how much
the children at each site would have changed
due to maturation in the six months separating
the pretest from the posttest. This provided a
maturation-sensitive expected posttest mean
that could be evaluated against the obtained
posttest mean. Since the expected gain was
derived from the children's pretest scores, it
could not have been influenced by "Sesame

Street."
The analysis depends on stable estimation,
which entails the need for reliable measures
and large samples. Also important is that the
design is predicated on estimating spontaneous change between the pretest and posttest. It
is therefore moot about the possibility of history causing spurious effects. A testing artifact
is also possible, for the pretest estimation of
posttest performance is based on the first testing of respondents, while the obtained posttest
is based on a second testing. For all these reasons, the design is not strong by itself, though
probably worth doing carefully when nothing
else is possible.
Normed Comparison Designs.
Normed
comparison designs have frequently been used,
especially in education. In this design, the
experimental groups' obtained performance
at the pretest and posttest is expressed in terms
of the norms for a population as similar as
possible to the population under study. Imagine that the average raw score on a pretest
language arts exam might be equal to the
50th percentile on some standardized (and,
hence, presumptively nationally normed) test.
Imagine, further, that the raw scores rose to a
level equivalent to the 55th percentile after a
program designed to improve language. The
hypothesis is that this gain of five percentile
points is due to the social program under
analysis which has, in essence, been contrasted
against na tiona! norms over the pretest-posttest
time period (Tallmadge, 1982). Thus, the
norming population serves as the no-treatment
control group.
There are two crucial assumptions of the
normed comparison design. The first is that
when there is no treatment effect, the pretest
percentile rank equals the posttest rank. The
second is that data must be available from a
norming population that is similar to the study
population and whose ages cover the same
range. This last requirement is often met in
education. Test publishers conduct national

norming studies, hoping to provide manuals
with raw scores and their percentile equivalents for various ages. Some school districts
even construct their own local norms based on
past data collection efforts. Some business
organizations with effective record-keeping
systems presumably do the same.
The normed comparison design presents
numerous problems for its potential users.
Among them are instrumentation, statistical
regression, and selection. The model assumes
that a change in percentile rank is systematically and meaningfully related to a change in
raw scores. Yet one unit of change in the original metric often entails quite different percentile transformations, depending on the raw
score obtained. Typically, one unit of change
in the original metric makes more of a difference to those scoring close to the end of a distribution than to those scoring around the
mean. Therefore, random error has different
consequences at different points on the original
scale. A second instrumentation threat is evoked
because, in conducting Title I analyses, many
school districts converted student scores to
percentile ranks by hand. According to the
available empirical evidence (Linn, Dunbar,
Harnisch, & Hastings, 1982), this resulted in
errors that were not random but were systematically biased in favor of inflating treatment
gains.
Research on Title I evaluations has also
shown that statistical regression can act to bias
the normed reference model when the selection
test for entry into the program is made closer to
either the pretest or posttest, as opposed to
being right between them (Glass, 1978; Linnet
al., 1982). Linn (1980) tried to estimate the exact
size of this regression effect and discovered
that it was greater, as one would expect, the
farther the respondent's test score was below
the population mean. But it also seems to have
been larger with children below the third grade,
perhaps because measures were less reliable
then. Linn, Dunbar, Harnisch, and Hastings

(1982) concluded that the amount of regression bias was small but systematically biased
in favor of enhancing gains.
Selection is perhaps the most salient threat
to the normed reference model because it assumes that the study population and the "controls" in the norming population are equivalent in composition and rates of change. However, the populations used for norming are
often not equivalent (Roberts, 1980). Test publishers rely on the voluntary cooperation of
school districts to collect normative data. While
they try to select districts representative of
the nation as a whole, districts very often decline to participate in the norming sample
(Linnet al., 1982), and in recent years more and
more districts are refusing to participate. Publishers' norms, then, are not representative of
any national or study population. If the rate of
knowledge growth per month is lower in a
publisher's population than a study population, then a study population will seem to gain
from a treatment even if it has not.
Local norms are better, therefore. They
function as cohort controls, capitalizing on the
assumption of enhanced similarity as groups
follow each other through an organization.
Thus, if a school district or other organization
had a regular testing program that was well
conducted over a number of years in an environment where there were no radical changes
in population, their scores could be used to
construct norms that would be rnore appropriate to local study populations. However, if the
standards above were met, there would be no
need for norms. One would merely construct a
pretreatment time series and evaluate a treatment in terms of deviations from the level and
slope of the pretreatment series! Thus, the
strongest form of a normed comparison design
renders such a design unnecessary.
Creating a Baseline From Other Sources.
Even without published norms, researchers
can sometimes construct opportunistic comparison groups from archival sources that were
intended for purposes other than serving as

controls for a particular study. Labor economists often use national data sets this way,
going to current population surveys, panel
studies on income dynamics, or one of the national longitudinal surveys to create baselines
against which to evaluate job training programs in particular. This is a controversial
practice, even within labor economics (LaLonde,
1986),despiteitswidespread use. We also have
deep reservations.
Jackson and Mohr (1986) used a population-based control group in their study of the
effects of an experimental housing allowance
program. Housing outcomes at the program
sites were contrasted against a comparison
group constructed from two years of data
from the nationwide Annual Housing Survey.
These were the years that most closely approximated the time period of the experiment
and so provided a control for simpler history
effects. To deal with selection, families were
selected from the national surveys according
to the same criteria used to determine eligibility for the housing allowance program. The
final step of the analysis involved contrasting
the performance of families in the program
with the performance of the matched controls
on those outcome variables that were common
to the annual survey and the demonstration
program.
It is important to remember that, in the
original demonstration, many of the families
eligible for housing assistance did not receive
it, and for systematic reasons. For instance,
some refused the service, others did not hear
about it, some received it initially but dropped
out, and so on. Thus, the matching procedure
used inevitably undermatches on all those
motivational and other factors that led some
eligibles to participate but not others. Undermatching is a potential and likely selection
threat in this and any other study that seeks to
construct a control group from within the
population sampled in national long-term
data collection programs designed for other
--------..,.,~~
546
purposes. This is recognized by proponents

of the technique who usually complement
matching with statistical adjustments designed
to account for any remaining population differences between the original study population
and the constructed controls. However, the
adjustment procedure they favor is under intensive criticism right now in the mathematical
statistics literature (e.g., Holland, 1989) and
even among some labor economists (e.g., Moffitt, 1989).
Constructing a comparison group in the
way outlined above is insensitive to any local
history forces that might affect responding
between the data collection waves. Moreover,
national longitudinal surveys often involve
measuring respondents for many years, while
cross-sectional studies involve independent
samples. Neither of these is similar to what
happens in the treatment group where the
posttest is usually the second measurement
period. While differences in the frequency of
measurement are not likely to have large effects in our judgment, they may still not be
zero. The cost and time savings associated with
constructing a comparison group from archival data have probably led to this technique
being used, even when there are only posttest
data available from the treatment group (e.g.,
Brett, 1982). Without pretest data, the design is
likely to be particularly problematic because of
simple selection differences between the treatment and the opportunistic, archival control
group. Matching to reduce the differences will
inevitably be partial, and statistical procedures
for controlling any remaining selection differences are incomplete.
We are not much swayed by the argument
that the costs of direct! y collecting control data
can be reduced by opportunistically constructing controls from available archival sources,
particularly when only posttest data are collected in the treatment group. But even when
pretest data are available, we join those cited
above who are suspicious of the quality of the
rna tching and statistical adjustment procedures
used to control for the more complex forms of

selection relevant to longitudinal data, particularly selection-maturation. However, with
more national data bases becoming available,
we expect to see more future use of designs
with comparison groups constructed from
archival data originally collected for other
purposes.
The Nonequivalent Dependent Variables
Design. When used alone, this is one of the
weakest interpretable quasi-experimental designs. It is diagrammed below to highlight its
similarities to the untreated control group
design, even though the A and B represent
different measures collected from a single
group.
The design depends on the availability of

scales measuring two constructs. One is expected to change because of the treatment, while
the other is not. However, the second one is
expected to respond to all internal validity
threats in the same way as the first. Use of this
design should be restricted, therefore, to theoretical contexts where differential change is
predicted and a case can be convincingly made
that all alternative interpretations affect both
outcomes equally. These are strong assumptions, and they cannot be easily met in actual
research practice.
Trulson (1986) used the nonequivalent
dependent variables design as part of his evaluation of how training in TaeKwonDo, a Korean
martial art, influenced the delinquency-related
behavior of boys. Substantive theory was used
to predict that the treatment would influence
psychopathic deviation, schizophrenia, and hypomania scores on the Minnesota Multiphasic
Personality Inventory (MMPI), but would not
affect other mental health constructs like depression and hysteria that do not differentiate

delinquent from nondelinquent boys. After six
months of TaeKwonDo training, the delinquent boys showed pretest-posttest decreases
in the expected mental health constructs related to delinquency potential and no change in
the second set of mental health outcomes.
Plausible alternative interpretations have
to be specified in terms of construct-related
differences in maturation rates, local history,
or instrumentation (particularly as regards differences between the measures in reliability or
ceiling/basement effects). High-quality measurement and good substantive theory are the
sine qua nons of this design. For instance, it
would be trivial to demonstrate that the Tae
Kwon Do training was related to a pretestposttest difference in aggressiveness but not to
changes in hairstyles or school performance.
The importance of examining two related but
different dependent variables stems from the
fact that alternative explanations like history,
instrumentation, or maturation are expected to
affect the two classes of mental health outcomes equally. The nonequivalent dependent
variables design is strongest where a specific
complex pattern of change is predicted that
allows many alternative explanations to be
ruled out (Campbell, 1966), and Trulson definitely helped his case by having several outcomes that were, and several that were not,
predicted to change. Nonetheless, the nonequivalent dependent variables design will
rarely be effective in a stand-only capacity.
The Regression-Continuity Design
When merit-based awards or need-based
services are given to groups, we would often
like to discover the consequences of such provisions. A regression-discontinuity design can
be appropriate for these situations if a number
of conditions are met. The logic behind the
design is simple. Imagine that each unit in a
sample can be classified according to its score
on a quantified continuum and that there is
a specified cutting point such that persons
scoring above the point gain an award while

those scoring below it do not. Assignment to
treatment and control conditions is thus based
on the cutoff score and nothing else. Now
imagine fitting separate regression lines to the
groups above and below the cutoff, thereby
representing how scores on the assignment
variable are related to the outcome. If there
were a simple main effect of the treatment, a
discontinuity between the regression lines
would appear at the cutoff point because the
persons scoring above it have had their outcome scores increased while those below it
have not.
To illustrate this, consider a continuous
interval scale measure of pretest organizational
performance, perhaps productivity or sales
performance. Consider next a bonus given to
those persons who score above a particular
output level, followed by measurement of
performance in the next time period. A scatterplot could then be constructed with pretest
performance along the horizontal axis and
posttest performance along the vertical. In this
plot would be entered the point at which each
person's pretest and posttest scores intersect.
If the award were effective, a discontinuity
should be observed when separate regression
lines are fit to the persons above and below the
output cutoff. This hypothetical case is portrayed in Figure 10. If a major question of
interest were sustained effects, then the posttest
scores could be examined two or three years
later and the same analysis conducted.
The portrayal of the regression-discontinuity design in Figure 10 highlights its similarity
with how a treatment effect would appear in a
randomized experiment where posttest scores
are plotted as a function of pretest scores. A
simple main effect of the treatment adds a
constant to all values of the pretest scores,
creating parallel treatment and control regression lines with different intercepts. In the
regression-discontinuity design, a treatment
effect would also involve the same constant
being added and parallel regression lines. But
----~-
548
FIGURE 10
Hypothetical Outcome of a Pretest-Posttest
Regression-Discontinuity Quasi Experiment
30
No Treatment
Treatment
XX
)(
10
0
0
-10k--------------L------------~--------------J-------20
-10
10
0
Pretest Values
now all the control cases would be missing on

one side of the cutoff point and all the experimental cases on the other. But the most important resemblance between the regressiondiscontinuity design and the randomized
experiment is that, while with nonequivalent
group designs, the selection process is perfectly
known. In the pure regression-discontinuity
case, selection is based on the fallible obtained
scores used for assigning units to treatments
and on nothing else. In the randomized experiment, assignment is based on a lottery and
nothing else. It is knowledge of the selection
process that makes the regression-discontinuity design so potentially amenable to causal
interpretation and one of the strongest alternatives to the randomized experiment (Trochim,
1984).
A greater variety of effects can be obtained
with the design than just a discontinuity in
level at the cutoff point. The slope of the regression lines might also come to differ on each side
of the cutoff, suggesting a statistical interaction
of the pretest and treatment such that persons
on the treatment side of the cutoff do better or
worse depending on their initial scores.
Suppose the National Science Foundation
were to award individualfellowships to graduate students solely on the basis of their Graduate Record Examination scores and then wanted
to know what effect this had on a particular college outcome. If a regression discontinuity
design were used and resulted in different regression slopes with the steeper one being on
the awards side of the cutoff, this would suggest that fellowships have more of an impact on
students whose GRE scores are among the
highest than on students whose scores just
qualify them for fellowships. (However, the
interpretation of slope differences is extremely
.....
Qtwsi Experimentation 549

difficult without ch.ar inttrct>pt differences
bl'r<HtSt' it is then l'spvdally liktly that a nonlint.ar sdtl'lion fum:litm can lw fit to the data (:>et~
lwlmv l.l
AltlHntgh thtn an. many situations in
whid1 social nsounts an allocah~d on the
b.1sis ot' tJIIilnlititlive digibility criteria, the
rq;nssion-distuntinuily thsign is not frequtntly l'tnploytd. Trochim (19H4) L~xmnined
this issut.' in a nvitw oftvaluationsof'J'itk~One
t'tnllJWilS<lt<HY tduea lion programs (nowChnpltr ( hw). l-:ad1 sdmol d istrk'l wns fn~t' to choost~
,1mong thnt> nwlht nls to ust in thcirlot'al cvalu,,lion. ( hw w.1s tlw l'L').~l'l'ssion discontinuity
dl'~;ign, but only two pt'l'l'tnt of all administmlllrs vhusv tu w;t it ovl'r a randomized control
group ur" nurnwd nftnnn dtsign. (Tht~ last
w11s tiVtrwiHImingly pnhtTtd.) Tlwy dwst~
nul tu ust tlw l'l')\l'l'ssiun-distonlinuity dt~sign
lor .1 v,tl'il'ly pf l'l'<IStlliS. M<~ny hmltmfoundcd
IP,\t's th.ll w;ing il would yil'ld rwg<ttiw pro1\l'lllll'll\ds; tlllwrs lwlivvtd that it was ltchnit\llly diltktill to Wit'; ollwrs nhlintaintd (crro-
llt'llWilyl tl~tll Llw vvithin-group prt>ltstposltest

t'ornhltions h.1tllo ht dl hilsl.'l for tiw nwthod
t ll h< ust>lu I; ot iwrs wmritd ,, huu t assigning
~.tudvnts to llw tnntnwnlonlhv basis of a strict
t'utull point .d<liH'; finally ;md pPrlwps most
impnrt.mtiv tiw sl<lll' dt.parlnwnls of t!dttcnlitlll t'IH't lttl'tlgtd tlw ~;~hool d islrkls h lllstotlwr
tv,\lttt~lion 11\lldtls. Tlw ftw adminhitmlors
1vho did Wit' tiw rq~nssiutHI isrunlinuily de
~iign dltd ht'\'l'l\ll l'l',ISUI\S. j.'jrsl, it W<ISt.'(li1Cl'pt\lid ly dt IS< st to llw idP,1 of Tilk Chw in thnt it
dq wll!b on vxplidt qwmlilitd nwnsurt~ment
uJ llt't'd .unl .1 dtilr dtdsiun rult for allocating
lilt St'l'\'il't. ~itvond, tlw dt>sign fits in wdl with
IIH .tllllltdl ll'~;ting ,yd< ul many sdwoi disIrk!~; .md ~ill nquirvs no addiliunnl Ltsting-t.~~l vt,tr'~.; ~;rons t'tlll lw ustd .ts tlw prt'lt~st nnd
thi; Vt'.H''s .1s tlw poslll'sl. Third, tlw t:ksign is
nwthodulogk.1lly strungtr for inftJTing causnl it I!\ wiH'n VlllliJlil rt>d ln tiw nornwd compari'tll\
d1~iignmosl di~;lrids <lcllli111y ustd.
Tlw lulluvving distussion of Llw r<.!gression

disnllllinllily dvsign highlights soml' of the
most cited examples of use of the design and is

divided into two sections. The first deals with
the situation where pretest scores are available
for classifying units and assigning them to treatments; the second deals with the use of quantitative scores other than the pretest. The distinction has no theoretical importance. It is made
solely to remind readers that the regressiondiscontinuity design can be used even when no
pretest measures are available.
Regression-Discontinuity With Pretest

Measures on the Same Scale as the Posttest.
Seaver and Quarton (1976) used the regressiondiscontinuity design to examine how making
the Dean's list because of one quarter's grades
influenced college students' grades in the next
quarter. They obtained grades from 1,002 studl!nts for the quarters before and after the list
was published, and their sample included some
pt~r~:>ons who qualified for the distinction and
other who did not. Students who made the list
should do better in the next quarter than those
who did not for a variety of reasons. The issue
with regression-discontinuity analysis is
whether the rewarded students performed
better than the others by more than would have
bl~t>n the case without the reward. Figure 11
makeH it look as though they did.
llowcver, Seaver provided Joyce Sween
wilh his data, and she replotted it as individual
scores instead of the array means in Figure 11.
It then appeared to her as though grades might
bt~ curvilinearly relnted to each other across
quarters. Since the plot of 1})02 scores is complex, we have replotted the data using finer
array means than Seaver and Quarton and
have fitted both curvilinear and linear regressions to the data. Figure 12 provides the data,
and the underlying relationship still appears to
be curvilinear. What are the consequences of
this? As Figure 12 shows, fitting a curvilinear
trend produces no evidence of a discontinuity
at the cutoff point and so there is no evidence
that making the Dean's list had any effect. In
other words, there is a selection-maturation
------------------------------~~
550
FIGURE 11
Regression of Grade Point Average Term 2 on Grade Point Average
Term 1 for Non-Dean's List and Dean's List Groups
4.0
"'E....
3.5
~ 3.0
~
"<::
.....
.E
~ 2.5
"'~
Cj
2.0
1.5
'------'-----'-----'----'----'----'-----L----'----
.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Grade Point Average-Term 1
phenomenon such that performance in a later

quarter depends on performance in a prior
quarter, but the relationship varies with the
level of earlier score actually received. When
linear regressions are fitted to the data on each
side of a cutoff point, spurious differences can
emerge if one or more of the regressions are not
in fact linear.
The point we are making can perhaps best
be understood by considering the hypothetical
example in Figure 13 where the scatterplot portrays an underlying nonlinear relationship.
Fitting linear relationships gives a discontinuity at the cutting point; fitting a nonlinear
regression does not. Imagine, now, that the
underlying distribution was of a J-form and
that the tail at the high end of the pretest distribution in Figure 13 did not exist. Then, it would
not be appropriate to fit a linear regression to

the data at the lower end of the pretest continuum, but it would be appropriate at the higher
end. However, if linear regressions were fitted
on both sides of the cutting point, a discontinuity in both the level and slope would be obtained. The major threat to internal validity
with the regression-discontinuity design is
selection-maturation-the possibility that
change between the pretest and posttest does
not follow a simple linear pattern across the
whole of the pretest distribution.
Seaver and Quartan were sensitive to this
threat. But instead of examining the form of
their data more closely for the broader range of
cases who did not make the Dean's list, they
chose another strategy. They reasoned that,
if there were spurious selection-maturation,
FIGURE 12
Plot of the Column Means for the Seaver-Quarton Data
4.0
3.5
if'
'?:
'f
b.
/:._,"'
,
, ,,
3.0
D.
""~
2.5
D.
~
....::::
2.0
0.,
1.5
D.
<:e;
D.
'5
"'
"1:1
::::
(.,')
1.0
.5
.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Grade Point Average-Term 1
then it should appear in the data when the pretest in Figure 11 was used as the posttest and
the previous quarter's grades were used at the
pretest. When this analysis was conducted,
it produced no discontinuities. However, the
test is not as accurate as critically examining
scatterplots to examinewhethernonlinear forms
of regression fit the data better than linear
forms. If they do, regressions of a higher order
than the linear have to be fit; otherwise, spurious causal conclusions will result.
Another set of problems with the regression-discontinuity design arises from the fact
that in most social settings, i:he treatment is

made available to only a small percentage of
persons who score at one of the ends of a
quantitative continuum, maximizing the
nonequivalence between the groups. This is
not, of course, a necessary feature of the
regression-discontinuity design, though it is
certainly to be expected in many cases where
the design is appropriate. It leads to two problems. First, the dependence on small numbers
of extremely high or low scores makes it difficult to estimate with any reasonable certainty
the shape of the distribution of scores on the
--------~
FIGURE 13
The Consequences of Fitting
Inappropriate Linear Regressions When
the Underlying Regression Is Not Linear
short side of the cutoff point. The range is

severely curtailed. Yet it is important to be
able to estimate this distribution to determine
if it can be considered simply as an extension
of the distribution found on the better estimated side of the cutoff. Second, being restricted to the most needy and most meritorious, the regression-discontinuity design is
weak when it comes to generalizing to other
types of persons. For instance, in an example to
be presented later, we show that Medicaid significantly increased the number of visits that
poor persons made to the doctor. Some health
planners would like to be able to estimate how
the demand for medical services will increase
once we have some form of a national health
insurance for all citizens and not just the less
affluent. How confident would one feel generalizing to the United States at large based on
Medicaid experience with those who were
most poor and least healthy? (In practice, of
course, one would not rely solely on Medicaid
for an estimate of the demand for health services some labor unions have won completely
free' medical services for their members, and
some cities have private medical schemes allowing unlimited free medical services.)
A difficulty that can often be anticipated
with the regression-discontinuity design is that
the cutoff point will not be as clear-cut as our
discussion may suggest. Lack of clarity is especially likely when the cutoff point is widely
known, for this can give rise to special pressures to help some persons achieve the cutoff
point score. For instance, the Irish. govern?"ent
publishes the passing score on vanous natiOnal
examinations in education. In an unpublished
manuscript, Greaney, Kellaghan, Takata, and
Campbell (1979) found that a frequency distribution of the number of children obtaining all
possible scores on the physics exam showed
fewer students than expected scoring just below the cutoff point and more than expected
scoring just above it. Did examiners give "an
extra hand" to students scoring just below the
cutoff? A similar phenomenon occurs in
many social service settings, where eligibility
certification workers help clients disguise part
of their income if they suspect that full disclosure will take clients above the eligibility
point for services. Similarly, clients know the
cutoffs and systematically manipulate the information they provide with the cutoffs in mind
(Jencks & Edin, 1990).
A major problem with a fuzzy cutoff point
is that the systematic distortions around the
cutoff can masquerade as treatment effects.
Imagine a social service setting where clients
know the income cutoff point for obtaining
supplementary services and so are motivated
to report their income as lower than it actually
is. To examine how these services influence the
social mobility aspirations of young children,
we might plot the reported income of a wide
range of parents against the mobility aspirations of their children. Let us suppose that the
overall relationship is linear and positive in
general, indicating that higher incomes are
associated with higher aspirations. But one

group of parents will have recorded incomes
below the cutoff point, while their real incomes
are above it. The obtained aspiration scores of
their children will thus be higher than those of
children with more honest parents reporting
the same income but who actually have less
than the others. Combining the scores of all
persons reporting the same income will artifactually increase the scores of children who score
just below the cutoff point, contributing to
spurious differences in intercept and slope.
Since achieving the cutoff point permits
access to scarce and desired resources, we
should always anticipate that some individuals
will gain what, strictly speaking, they should
not receive. Political pull, administrative error,
or the inability to deny services to borderline
cases all make for fuzzy cutoff points. Where
biased scores can be identified, the solutionis to
drop the relevant cases from the analysis and to
proceed as though they were not part of the
study. If it is not possible to identify all the
misclassified cases then an estimate should be
made of the range around the cutoff point in
which it is not likely that biased assignments
occurred. This range is then treated as the cutoff point, with regression lines fitted above and
below the endpoints of the range. The logic is
the same whether a range or single cutoff point
is used in the analysis. Statistical sensitivity
will be greater with a point, though.
Quantified Multiple Control Groups,
Posttest-Only Designs. Regression-discontinuity designs are possible whenever units
can be ordered along some quantifiable dimension systematically related to treatment assignment (Trochim, 1984). While desirable,
pretests are not necessary. We can illustrate this
by referring to a study by Lohr (1972; Wilder,
1972) who was interested in exploring the effects of Medicaid. The program was designed
to make medical care available to the very poor
(income level under $3000 per family per year
at the time) by means of federal government
payments. One important question was whether

the poor would actually take advantage of the
new policy.
Lohr's data can be displayed to plot the
mean visits to the doctor per family per year
as a function of annual family income (each
measure was based on interviews done in connection with the Current Population Reports).
The relationship of the two variables is portrayed in Figure 14. A discontinuity occurs for
families with an income under $3000. The
number of physician visits sharply increases
and even tends to exceed the number of visits
made by more affluent families. Since these
data indicate that Medicaid might have increased medical visits by the poor, we have to
ask ourselves the perennial question: Are there
any plausible alternative interpretations of the
relationship?
The chronically sick aside, visits to the doctor are presumably highest among the aged.
Income is often lowest among the aged and
was especially so in the very early 1970s. Thus,
if the aged fell disproportionately into the lowest income category, the relationship in Figure
14 might reflect merely an aging phenomenon.
The relationship of age to income is ultimately
an empirical issue, and national demographic
data exist for checking it. But it is perhaps more
important to note that persons over age 65 are
eligible for both Medicare and Medicaid, and
many older persons use both programs. Since
the evaluation of Medicaid should not be confounded with Medicare, the analysis should be
restricted to persons under 65 years of age.
Thus, there are good reasons for wanting to see
Figure 14 data presented separately for families where no one is eligible for Medicare and
for families where someone is eligible.
A different age explanation is based on the
possibility that medical visits are most needed
by pregnantwomenand young children. Hence,
we have to consider whether the lowestincome group was disproportionately composed of persons prone to pregnancy and large
families. If so, they might have had more
554
FIGURE 14
Quantified Multiple Control Group Posttest-Only
Analysis of the Effects of Medicaid
4.9
4.8
,._
<::!
~
..,....
4.7
.,... 4.6
.:0
\:'2
Experimmtal
Group
4.5
~
.,...
it:
.:::
4.4
4.3
4.2
4.1
~
;
-~
4.0
Under $3,000
$3,000 -4,999
$5,000
-6,999
$7,000
$10,000
-9,999
-14,999
$15,000
or more
Income Level
frequent visits to doctors even before Medicaid, though these visits might have been more
often to state hospitals on a nonpayment basis.
There is every need, therefore, to disaggregate
the data even further to examine the relationship of income to medical visits among persons
of different family sizes.
A different kind of possible selection bias
cannot be ruled out merely by disaggregating
on the basis of demographic factors that are
routinely measured in surveys and are archived
for general use. Some families in the lowestincome group were eligible for assistance from
many programs, some of which mandated (and

paid for) medical visits as a precondition for
receiving aid or continuing to receive it. Were
the disproportionately frequent physician visits by the poor the result of Medicaid meeting a
need or of the pre-Medicaid requirement of
other programs that a doctor be consulted?
This problem requires knowing something
about the programs in which family members
were enrolled, especially those requiring
work-related and welfare-related physical
checkups. To our knowledge, such data are
not continuously collected so that the best one
could do would be to consult the most reliable

data available on the number of persons eligible for mandated medical visits and determine (a) if such eligibility is related to income
in the discontinuous manner suggested by
Figure 14 and (b) if it was of a magnitude that
could plausibly account for the pattern in that
figure.
Another alternative interpretation is based
on selection-maturation. This interpretation
suggests that the demand for medical care was
greatest among the poor, that the supply of
doctors was increasing year by year, and that
the new supply could find outlets only among
those sections of the population whose prior
demands had not been met and whose physical
state required urgent care. Against this, however, is the fact reported to us by Lohr that,
though the number of doctors per capita increased between 1960 and 1970, the number in
medical practice did not. Presumably, some
doctors went into medical research or nonmedical careers.
A final threat to internal validity arises
because the direction of causality is not clear
from Figure 14. Did Medicaid cause an increase in medical visits, or did the desire for
medical visits lead the sick and hypochondriac
to underreport their true income to both doctors and interviewers in order to continue the
pretense that they were eligible for program
support? An indirect check of this might be
possible by using nonmedical surveys to estimate the proportion of persons in each of the
income categories in Figure 14. If equal proportions fell into each income category in
each type of survey, then it would be unlikely
that Medicaid caused a change in income reportingratherthanmedical visits. What should
be emphasized about most (but not all) of the
internal validity threats listed above is that
their plausibility can be assessed without undue effort by consulting available archives or
collecting additional data. Our list of threats
should not discourage researchers; like other
lists for other projects, it should spurinto action
those whose interest lies in strengthening a

particular causal inference.
The Lohr-Wilder data actually cover three
waves, one corning before Medicaid. As displayed in Reicken et al. (1974, Fig. 4.18), the
data for the year prior to Medicaid indicated
that doctors devoted the lowest level of attention to the least financially advantaged group,
which, as Figure 14 shows, was not the case
after Medicaid. Such a change invalidates many
of the alternative interpretations listed above
that were presented here for pedagogical reasons. It also clearly documents the desirability
of adding a pretest measurement wave to the
basic regression-discontinuity design.
Some important problems of construct
validity should also be mentioned with respect
to Lohr's quasi experiment. Given the rapid
stimulation of demand from new Medicaid
patients and a slower response from physician
supply, does an increase in the quantity of care
for the poor entail a decrease in the quality of
care for them and for others? Furthermore, is
the dependent variable appropriately labeled
as "an increase in physician visits" or as "a
temporary increase in physician visits"? The
frequency of chronic and ill-monitored disease
is presumably higher among the poor and might
well be decreased by Medicaid, thereby leading to a later decrease in visits as more and
more chronic problems are cured or detected
before they become worse. Only longer term
data can explore this and other related issues.
Interrupted Time Series Designs
A time series is involved when we have multiple observations over time, either on the
same units, as when particular individuals are
repeatedly observed, or on different but similar units, as when achievement scores for institutional cohorts are displayed over a series of
years. Interrupted time series analysis requires
knowing the specific point in the series when a
treatment occurred so as to infer whether the
treatment has had any impact on the series. If
it did, observations after the treatment should

be different from those before it, creating a
change or "interruption" in the series at the
expected time. This change can involve preand posttreatment series differing in their mean,
slope, variability around the slope, or even in
the nature of seasonal patterns. Our presentation begins with the basic design and then adds
variants. The attentive reader will come to see
that combining the basic interrupted series with
many of the design features discussed earlier
facilitates causal inference over what it would
otherwise have been.
Simple Interrupted Time Series. This is the
most basic time series design. It requires only
one group in which multiple observations are
made before and after a treatment is implemented. It is diagrammed below.
We begin discussion with examples from

the classic studies of the British Industrial Fatigue Research Board in the early 1920s which
introduced our present period of experimental
quantitative management science. These were
the studies that inspired, and were eclipsed
by, the Hawthorne studies. While their methodology leaves much to be desired by current
standards, it was a great leap forward in the direction here advocated and was probably
stronger than the methodology used in the
Hawthorne studies. Figure 15 comes from
Farmer (1924), who concluded that shortening the workday from ten to eight hours improved hourly productivity. With modern
methodological concerns, we cannot be so
sure. First, there is the possibility of a maturation alternative explanation, since an upward
self-improvement trend is visible before the
treatment. We assume it could have continued
even without the change to an eight-hour day.
One of the major advantages of a time series
over other forms of quasi-experimental
analysis is that we can assess the maturational

trend prior to some intervention.
The advantages of a new treatment series
that is long enough to establish maturation
trends reliably becomes clear if we examine
Viscusi (1985). He studied how workers' health
was affected by tightening of cotton dust standards in textile plants by the Occupational
Safety and Health Administration (OSHA). The
five-year pretreatment series showed an increase in injury rates for at least two years prior
to the change in standards, and this trend seems
to have continued afterward. Thus, maturation
is a threat. Lacking a no-treatment control series, Viscusi needed to compute the expected
level of posttreatment injuries attributable to
maturation to see if the obtained series differed
from it. He used three variables from the pretreatment period to predict injury rates at that
time, then used this prediction equation to
construct the expected posttest level of injuries.
The analysis indicated that the level of the
series after the treatment was higher than predicted, which led Viscusi to conclude that the
new safety standards increased injury rates!
But the pretreatment time series was very short,
and posttest injury rates were assuredly poorly
estimated from a three-variable prediction
equation developed on five pretreatment measurement waves. We see here a heroic effort to
adjust statistically for what was an inadequate
design. The Farmer study, with its large pretreatment series, turned out to be superior on
this score.
A second threat to internal validity is apparent in the British data. Figure 15 also suggests
the possibility of a seasonal trend masquerading as a treatment effect. The change in the
number of hours worked occurred in August,
1919. The preceding August was a low productivity month and was followed by an upward
trend, mimicking what happened in 1919 when
the change was introduced. The plausibility of
a seasonal effect is reduced because there is no
summer slump in 1920, but without an even

FIGURE 15
Change in Hourly Productivity as a Result of
Shifting From a Ten-Hour to an Eight-Hour Day
11
tl')
~
c
0
N
10
.s
......
;:::
.B;:::
0
,2->
.....
;:::
8
M AM
1918
J J
AS ON D
F MA M
J J AS 0 NO J F MA M J J A
1919
1920
Mont11s
From "A C.omp;~riHon of Different Shift Systems in the Glass Trade" by E. Farmer, 1924. Report No. 24 of the Medical
RcHemt:h Cc.mnL:d Industrial Fatigue Research lloard. Adopted by permission.
longer time series it is impossible to assess the

role of seasonality.
Figure 16 presents time series data from a
quasi experiment by Lawler and Hackman
(1969) where the threats to validity are somewhat different from those in the previous example. The treatment was the introduction of a
participative decision-making scheme to three
groups of men doing janitorial work at night,
and the dependent variable was absenteeism
(the proportion of possible work hours actually
worked). A noteworthy feature of Figure 16 is
that a false conclusion might have been drawn
if there had been a single pretest and posttest,
for the entire pretest series reveals that the last

pretest measure is atypically low. As a consequence, statistical regression may be inflating
the difference between the last pretest and first
posttest. Time series designs allow assessment
of the pretest time trend, thereby permitting a
check on the plausibility of both maturation
and statistical regression.
The major threat to internal validity with
most single time series designs is a main effect
of history. In the Lawler and Hackman case,
this would be if participative decision making
was introduced simultaneously with a police

drive that made coming to work at night safer,
or some janitors may have staged a strike or
begun a car pool. Several controls for history
are possible, perhaps the best being to add a notreatment control group. But, though advisable, this is not always necessary. For instance,
Lawler and Hackman's unobtrusive measure
of absenteeism was calibrated into weekly
intervals, and the historical events that can
explain an apparent treatment effect are fewer
with weekly than with monthly or yearly intervals. Also, if records are kept of all plausible
effect-causing events that could influence respondents during a quasi experiment, it should
be possible to ascertain whether any of them
operated between the last pretest and the first
posttest.
Another threat is instrumentation. A
change in administrative procedures will
sometimes lead to a change in the way records
are kept. Persons with a mandate to change an
organization may interpret this to include
making changes in the way records are kept or
how criteria of success and failure are defined.
This seems to have happened, for instance,
when Orlando Wilson took charge of the Chicago police. By redefining how crimes should
be classified, he appeared to have caused an increase in crime. But the increase was spurious,
reflecting record-keeping changes and not
changes in criminal behavior. Time series always deserve critical scrutiny lest the numbers suppose a constancy of definition they do
not enjoy. Note the experience of Kerpelman,
Fox, Nunes, Muse, and Stoner (1978). As part
of their study of the effects of a lawn mower
safety campaign, they used time series data
about the accidents reported to hospitals. Some
hospitals routinely monitor accidents and their
reported causes, and Kerpelman et al. wanted
to discover if the monthly accident rate attributable to lawn mowers decreased after the
safety campaign was launched. At one site,
the data looked to simple visual inspection as
though the campaign had backfired, for accidents seemed higher after the campaign than
for the two years before it. Since this result
was surprising, the hospital records were
checked. It seems that the persons responsible
for compiling the data became much more
conscious of "lawn mower accidents" as a category for assigning causes after they learned
that a lawn mower safety campaign was under way and that the data they provided would
be used as part of the evaluation of the campaign. When Kerpelman et al. went back to the
basic data for all three years and recoded them
with the same definition across all the years,
the apparent negative effect was smaller than
before and probably not statistically reliable.
Another threat is simple selection. This
occurs when the composition of the experimental group changes abruptly at the time of
the intervention, perhaps because the treatment causes attrition from the measurement
framework This attrition can be a genuine
treatment effect. But it will obscure inferences
about the treatment's effect on other outcomes,
and it will not be possible without further
analysis to disentangle whether an interruption in the series was due to the treatment or to
different persons being in the pretreatment
series when compared to the posttreatment
series. The simplest solution to the selection
problem, where available, is to restrict at least
one of the data analyses to those units providing measures at all time points. But repeated
measurement on the same units is not always
possible (e.g., in cohort designs where, say, the
third-grade achievement scores from a single
school over 20 years are involved). Then, background characteristics of units have to be
analyzed to ascertain whether a sharp discontinuity in the profile of the units occurs
at the time the treatment is introduced. If
there is, selection is a problem; if there is not,
selection is not likely to be a serious threat
unless the background characteristics were
poorly measured or were not the appropriate

FIGURE 16
Mean Attendance of the Participative Groups for the
12 Weeks Before the Incentive Plan and the 16 Weeks After the Plan
12 11 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Before
After
Weeks
Note: Attendance is expressed in terms of the percentage of hours scheduled
to be worked that were actually worked.
From "Impact of Employee Participation in the Development of Pay Incentive Plans: A Field Experiment" by E. E. Lawler 1Il
and J. R. Hackman, 1969, /OIImnl of Applied l'syclwlogy, 53. Copyright 1969 by the Amerimn Psychological Association.
Adapted by permission.
ones for explaining shifts in the dependent

variable.
As far as statistical conclusion validity is
concerned, Lawler and Hackman compared
the collapsed pretest hours worked with the
collapsed posttest hours, failing to use all the
pretest and posttest data. Moreover, the statistical test they used is only appropriate for
comparing independent groups and the
groups they contrasted were dependent. They
would have done better to use the methods
outlined in Cook and Campbell (1979) and Box
and Jenkins (197 6), though their series is rather
short for them. In any event, we cannot be sure

from the reported analysis that the decrease in
absenteeism is reliable.
As far as construct validity is concerned,
reactive arrangements are only likely to hinder
interpretation of the data for a single series if
the same respondents are repeatedly measured and they know when the treatment was
implemented. It is rare for both these conditions to be met in archival studies, with the
major exception being when an innovation is
heralded with much prior publicity. However, the social context of each time series
------------------------~~~
I
560
I
I
experiment should be carefully examined to

determine whether the observed results might
be due to evaluation apprehension or some
similar threat to construct validity. Typically,
there is only one operationalization of the
treatment or effect in time series work, entailing problems with the construct validity of the
cause. Moreover, many treatments are pragmatic in their origins and are not closely tailored to substantive theory or well-specified
constructs. Multiple measures of the effect
construct are more common, though many
archives still measure academic achievement,
sales, or unemployment in just one way because of the expense of record-keeping or oldfashioned definitional operationalist thinking.
With computers being so prevalent now in
organizations, we can expect even more multiple measurement of an effect, permitting the
time series for each measure to be examined
separately or a composite formed.
Concerning external validity, it is sometimes possible in archival studies to use available data on the background characteristics
of units in order to stratify them by gender,
age, or whatever. The series for each subgroup
can then be examined to assess how robust
the effect is across the various subgroups. Of
course, there is no need to restrict the stratification to person variables. Variables relevant to
setting can also be used in the same way to
specify the range of an effect. This disaggregation has to be accomplished with caution,
though, because statistical power is reduced
by creating subgroups.
Interrupted Time Series With aN onequivalent
No-Treatment Control Group Time Series.
Consider adding to the simple interrupted
time series a second series from a nonequivalent no-treatment control group. The resulting design is diagrammed below, and Tybout
and Koppelman (1980) used it in their evaluation of an information campaign designed to
increase bus ridership in Evanston, Illinois. The
independent variable was a multidimensional

information campaign that included mailings
of bus route ma.ps and timetables to all households, special signs for bus stops along routes,
and posters displayed in banks and stores
around the city that illustrated the bus routes.
A major dependent variable was ridership.
Average weekday bus ridership figures were
therefore obtained for the city for each month
from January 1974 to December 1978. Comparable records were also collected from a control
city 60 miles away, Elgin, Illinois. Variation in
bus ridership was modeled for both cities in an
analysis that considered not only when the
treatment was introduced-]uly1978-but also
such relevant causes of bus ridership as weather
and demographic differences between the two
cities. The analysis suggested that the campaign was associated with a reliable 14 percent
increase in bus ridership but that the effect was
only temporary.
It is unlikely that a national event caused
the apparent increase in bus ridership, for such
a history alternative interpretation should affect the control city as well. The ability to test for
history is the major strength of the control
group time series design. However, local
history (selection x history) still remains a possibility. One city may experience a unique set of
events that influences ridership, such as a plant
closing or certain roads being resurfaced. The
untreated control series allows tests of other
threats to internal validity that operate with the
single time series. There is no obvious reason in
the Tybout and Koppelman case, for example,
why the measurement instrument should have
differed between the treatment and control
groups. Nor, considering the pretest series, is
there any indication that each group experienced a different cyclical pattern of ridership.

Moreover, the possibility of differential regression can be assessed by noting whether there is
an immediate pretreatment shift in one series
but not the other.
Since threats to internal validity become
less problematic the more comparable groups
are, it might be thought that matching the two
series at the intervention time point will produce comparability and reduce threats. This is
not the case, however. Such matching can cause
regression effects. Consider a study of the effects of television on, say, library circulation in
two nonequivalent sets of communities, one of
which got television in 1953 and the other did
not. Communities can be selected for each set
by matching on per capita library circulation
in 1953 (i.e., by choosing communities from
each set that are "close" matches). When this is
done, the data might appear as in the hypothetical Figure 17, which should be contrasted
with the unmatched series in Figure 18. It can
be seen that the matching is successful and that
the television and nontelevision communities
have similar library circulation figures at the
point of intervention. But, given erroneous
measures of circulation, it can also be seen that
the two groups drift apart before and after 1953,
regressing to their respective means. Assuming communities that differed in circulation,
matching circulation in 1953 is bound to capitalize on error, because to achieve matches the
communities chosen for the television group
are likely to have had more error inflating than
deflating their scores, while the control communities are likely to have had more error
initially deflating their scores than inflating
them. A surprising result of the Tybout and
Koppelman study was that the increase in bus
ridership decayed so rapidly. This might indicate that bus ridership was inherently dissatisfying to the new riders, that it is difficult to
make permanent changes in any well-formed
behavior, or that the bus information campaign
was not maintained over time. When the authors added to their time series pre- and
posttreatment survey data they had also collected, the results suggested that the campaign
improved residents' knowledge of the bus
system and their attitudes toward it. This led
the authors to surmise that the decay in ridership was not due to the poor performance of the
bus system but rather to the absence of an
ongoing campaign to maintain a permanent
behavior change. Without thepostintervention
series and the exploratory surveys, even this
tentative conclusion about causal mediation
would not have been possible.
Interrupted Time Series With Nonequivalent
Dependent Variable. History is the main
threat to internal validity in a single interrupted time series. As we have already seen,
its effects can sometimes be examined by
minimizing the time interval between measures or including a no-treatment control group
in the design. History can also be examined by
collecting time series data for nonequivalent
dependent variables, all of which should be
equally affected by history (and other plausible
alternative casual forces), while only the treatment group should have been influenced by
the intervention. The design is diagrammed as:
An example of its use comes from a study

of the effectiveness of the British breathalyser crackdown (Ross, Campbell, & Glass,
1970). The breathalyser was used in an attempt
to curb drunken driving and reduce serious
traffic accidents. One feature of British drinking laws atthe time was that pubs could only be
open during certain hours. Thus, if we are
prepared to assume that a considerable proportion of British drinking takes place in pubs
rather than at home, an effective breathalyser
would entail decreases in the number of serious traffic accidents during the hours pubs are
----------------------r-5~~
FIGURE 17
Hypothetical Library Circulation Data After Matching Just Before a Treatment Is Introduced
8
;::
7
.s
....
~
;:;
Bl
.~
Bl
u 5
~
~
>-1
A1
.::!
a....
~
...... ............__
Al
1948 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
Year
FIGURE 18
Hypothetical Library Circulation Data Prematching
8
;::
~
;:;
>-1
<::!
<::!
"'
....
Q,
1948 49 50 51 52 53 54 55 56 57 58 59 60 61
Year
62 63 64 65 66

FIGURE 19
British Casualties (Fatalities Plus Serious Injuries)
Before and After the British Breathalyser Crackdown of October 1967, Seasonally Adjusted
10000
9000
8000
G>
7000
1400
6000
1200
5000
1000
4000
800
3000
600
2000
400
1000
200
;.:
""~
0
JMMJ SNJMMJSNJMMJSNJMMJSNJMMJSN
FAJAODFAJAODFAJAODFAJAODFAJ AOD
1966
1967
1968
1969
1970
From "Determining the Social Effects of a Legal Reform: The British 'Breatlmlyser' Crackdown of "1967" by H. L. Ross, D. T. Campbell,
and G. V. Glass, 1970, American Behavioral Scienlist. 13 (4), pp. 493-509. Copyright 1970 by Sage Publications. Reprinted by permL'"ion.
open, particularly during weekend nights, but

not during the morning and midafternoonhours
when pubs are closed. The importance of the
distinction between serious accidents when
pubs are open or closed derives from the fact
that most history alternative interpretations of
a decrease in serious accidents cannot account
for time of the day differences. This is the case
with weather changes, the introduction of safer
cars, a police crackdown on speeding, contemporaneous newspaper reports of high accident
rates or particularly gory accidents, and so
forth. It is obvious from a visual inspection of
Figure 19 that a steeper drop in the accident rate

occurs on weekend nights than during the
nondrinking hours in the week. Statistical
analysis corroborated the decrease in 1967 in
the weekend-nights time series (and also in the
all-hours and all-days series). It is very difficult
to fault either the internal or statistical conclusion validity of these data.
As always, however, questions can be
raised about external validity. An obvious
question is: Would the same results be obtained
in other countries? Other questions relate to
possible unanticipated side effects of the
--------------------------~~
breathalyser. For example, how did it affect

accident insurance rates, sales of liquor, public confidence in the role of technological innovations for solving social problems, the
sale of technical gadgetry to the police, and the
way courts handled drunken driving cases?
Many such issues are examined in Ross (1973).
However, Ross's major concern is to probe the
construct validity of the cause. If we consider Figure 19 more carefully, it is apparent
that not all of the initial decrease in serious accidents during the weekend is maintained.
That is, the accident rate drops at first but
then continually rises toward the level in the
control time series, though the two trends still
do not meet by the last measure. Thus, the effect construct needs modifying to point out
that it is a temporary reduction in serious
accidents.
The fact that "something" either inflated
the effects of the breathalyser when it was
introduced or deflated its subsequent effects
suggests that the causal agent was not the
breathalyser alone. The breathalyser was introduced into Britain with considerable publicity that may have made the general public
especially mindful of not driving after they
had been drinking. Or it may have made the
police especially vigilant in controlling the speed
of traffic, especially during and immediately
after pub hours when they (like the researchers) would also expect more drunken drivers.
Ross (1973) also suggested that use of the
breathalyser may have reduced the overall
number of hours driven, may have cut down on
drinking, or may have led drunken drivers to
drive more carefully. He very ingenious! y ruled
out some of these explanations. He took the
regular surveys by the British Road Research
Laboratory of miles driven, converted the
number of serious accidents and fatalities to the
number of accidents per mile driven, and
showed that the introduction of the breathalyser was still associated with a decrease in
accidents when the estimate of accidents per
mile driven was used. He also examined the
sale of beer and spirits before and after the

breathalyser and could find no evidence of a
discontinuity in sales when the instrument was
introduced. This ruled out the interpretation
that the breathalyser had reduced drinking.
He was also able to show that for ten months
after the breathalyser, more persons reported
walking home after drinking than had been the
case in the equivalent ten months preceding
the breathalyser. Finally, he also showed that
fewer of the postbreathalyser traffic fatalities
had high alcohol levels in their blood than had
the corpses of the prebreathalyser fatalities.
These last data indirectly supported the explanation that the causal construct was a reduction
in the number of heavy drinkers driving. As
with Tybout and Koppelman, the analysis of
other, not necessarily time series, data sources
can help in the causal explanation of demonstrated descriptive casual relationships.
However, the attempts by Ross to use data
to rule out alternative explanations of the
causal construct should alert us to the difficulty
and expense of doing this and to the large
number of irrelevancies typically associated
with the introduction of a new practice. Having
claimed that the breathalyser caused heavy
drinkers to drive less-which led to fewer
serious accidents-Ross was then faced with
the problem of explaining why the effects of
the breathalyser were not more permanent.
His analysis suggested to him that the British
courts increasingly failed to punish drinkers
detected by the breathalyser so that the instrument lost its deterrent power. Thus, Ross's final
inference from this study took the highly useful form: A breathalyser will help reduce serious traffic accidents if it leads to less driving
by drinkers and if the courts enforce the law.
Otherwise, its potential for reducing serious accidents will not be realized.
Interrupted Time Series With Multiple
Replications. In some controlled settings, it
is possible to introduce a treatment, remove
it, reintroduce it, remove it again, and so on,

FIGURE 20
The Percentage of Motorists Driving at or Over 58 km/hr During Each Session
Across Experimental Conditions and the Percentage of Motorists
Driving at or Above This Speed Posted for Each Week During the Follow-Up
Daily
Baseline
2
Posting
l
70
58 km/hr
Num
Daily No Wky No
Wky
Post Num Post Nun1
Post
.:,.:. :. ..
..
.......
......
: .. .:
..
.
.. "
: ~;
.
:
: ..
r '
.,
: . : \ :':
... .. : \ ,,.... : " : .. .
.. . .
.. ..
... ...
.. ..
Follow
Up
f\
6 50
t:
: Jl,
C(J
II
40
'\
1! 30
"'
20
10
10
'8
\l
o!
No
Post
2
,'
.... 60
..e....:::
Daily
15
20
:J!I,
''
0 II
I I
I
II
25
30
'',
0
35
o \I
,.1
'-'
J
.\ -
40
Sessions
,'\
r,
11
)I ',1
,..
~~/
,r
!">
.,
I/''w"
........
....-~\ II
\.1..
10
I'
'
I\
,..~
\I
,1,J
~-~
I ..
\,'
' '
15
20
25
Weeks
From "An Analysis of Public Posting in Reducing Speeding Behavior on an Urban Highway" by R. Vnn Houten, P, Nau, and z. M.nrini, 1980,}or~nml of Applied
llel1nviar Aunlysis, 1.1. Copyright 1980 by fnumnl of Applied llelmvior Armly.is. Adapted by permission.
according to a planned schedule. This design

can be diagrammed as:
0 1 0 2 X03 0 4 X05 0 6 X07
08
09
0 10 X 0 11 0 12
X 0 13
0 14
A treatment effect would be suggested if the

dependent variable responded in similar fashion each time the treatment was introduced but
in a different fashion each time it was removed.
This is in many ways the ultimate in a Skinner
type experimental design.
Van Houten, Nau, and Marini (1980) assessed how a publicly posted sign showing the
percentage of drivers not speeding influenced
highway driving speed. To do this they used

an interrupted time series with multiple replications and assessed speed by means of a concealed radar unit. During the baseline phase, a
measure was constructed of the percentage of
persons driving over 58 kilometers per hour.
During the treatment or posting conditions, a
sign was erected providing feedback on the
percentage of drivers not exceeding this limit.
The sign was then covered and reintroduced at
three later times. Figure 20 details the driving
speed of those exceeding 58 kilometers per
hour and shows that it was reduced whenever
a posting period occurred but returned to normal when the sign was covered. The authors
566
concluded that the sign was effective in reducing speeding. (Actually, to enhance generality,
they also used three other speed postings and
all showed the same type of effect.)
A salient issue with this design concerns
the scheduling of treatments and their removal. This is probably best done randomly,
though in a way that preserves the alternation
of X and X, that is, in a block randomized
design. Such random assignment rules out the
threat of cyclical maturation mimicking the
obtained pattern of increases and decreases,
even in the absence of a treatment. Where such
cyclicity is ruled out by theory or a long preintervention series that permits direct observation of maturational patterns, randomization is
less important. But even when the introductions are haphazardly rather than randomly
scheduled, the design is very powerful for inferring causal effects and can easily be modified to compare different treatments, with X1
being substituted for the global X and X2 for X.
It would even be possible to add an interaction
factor, xlx2.
The major limitations of the basic design
are practical. First, it can be implemented only
where the initial effect of the treatment is expected to dissipate rapidly. Secondly, it requires a degree of experimental control that is
most likely to be achieved in the laboratory, in
institutional settings like schools or prisons, or
in settings where the treatment constitutes a
very minor deviation from the regular state of
affairs (as in the highway example above). Our
distinct impression is that the design has been
used most often in mental health, particularly
in research with a behavior modification bent
where the therapeutic situation allows investigators the very control over respondents that
the design requires.
Interrupted Time Series With Switching
Replications.
Imagine two nonequivalent
samples, each of which receives the treatment
at different times so that when one group
receives the treatment the other serves as a
control, and when the control group later receives the treatment the original treatment
group serves as the control. The design can be
diagrammed as:
0 I 0 2 0 3 0 4 05 06 07 OBX 09 010
--------------The power of the design derives from its

control for most threats to internal validity and
from its potential in extending external and
construct validity. External validity is enhanced
because an effect can be demonstrated with
two populations in at least two settings at different moments in history. Moreover, there are
likely to be different irrelevancies associated
with application of each treatment and, if
measures are unobtrusive, there need be no
fear of the treatment's interacting with testing.
Figure 21 gives data from a study which
used the replicated time series design (Parker,
Campbell, Cook, Katzman, & Butler-Paisley,
1971). The treatment was the introduction of
television into various Illinois communities,
and the hypothesis was that television would
cause a decrease in library circulation (Parker,
1963) because television would serve as a substitute for reading. Thus, the dependent variable was the annual per capita circulation of
library books. The unique feature of this particular quasi experiment is the sharp differentiation of the treatment groups. This occurred
because the Federal Communications Commission stopped issuing new licenses for television stations in 1951, splitting Illinois into
two groups: (a) an urban, wealthy group with
growing population that had television before
the freeze (the so-called early TV communities)
and (b) a rural, poor group with static population growth that received television only after
the freeze was lifted in 1953 (the so-called late
TV communities). It can be seen from Figure 21
that library circulation declined about 1948 for
the early television communities and during
1953 for the late television group. The Glass,
FIGURE 21
Per Capita Library Circulation in Two Sets of Illinois Communities
as a Function of the Introduction of Television
7.0
Early Communities (n = 55)

Late Communities (n ==55)
6.5
::::
.S/
TV Introduced Late
6.0
;::!
-~
.~
.,_
a....
.t
TV Probably
Introduced Early
5.5
5.0
_...,..... ......
,..e......
. . . . tI ....~.....
I
._ __.,. .... ~"'
4.5
4.0
45
46
47
48
49
50
51
52
53
54
55 56
57
58
59
60
Year
Tiao, and Maguire (1971) statistic confirmed

that each of these decreases was statistically
reliable. This corroborated the hypothesis, and
it is noteworthy that the design involved archival measures, distinctly different populations,
different historical moments for introducing
the treatment (in nonreactive fashion), different irrelevancies associated with how the treatment was introduced, and repeated measures
to ascertain if an initial effect can be generalized
over time.
But even the replicated time series design
can have problems of internal validity. Paperback books may have been introduced earlier
into the rich early TV communities than into
the poor late TV communities. If so, the declines in library circulation in 1948 and 1953
might have been due to historical differences in
the availability of paperbacks-an interaction
of history and selection. This alternative interpretation could be ruled out by collecting data
on the circulation of paperbacks in each set of
communities. This would be a laborious but
worthwhileoperationforsomeonewitha vested
interest in knowing that the introduction of
television caused a decrease in library circulation. A different strategy, following the example of Parker (1963), would be to split the
library circulation into fiction and nonfiction
books. Television, as a predominantly fictional
medium, would be expected to have a greater
effect on the circulation of fiction than of fact
books. Using the nonequivalent dependent
variables in this way would render the paperback explanation less plausible because we
would have to postulate that fiction books
were introduced into the different communities at different times but that nonfiction
zm
568
books were not. This is not impossible-only

relatively implausible.
The switching replications design is also
useful because it can help detect effects that
have an unpredictable delay period. Assuming
an equal delay of effect in each group, we
would expect a discontinuity at an earlier date
in one series than in the other. We would expect
the period between the discontinuities to be
equal to the known period that elapsed between implementing the treatment with different groups. However, in some cases it will not
be possible to accept the assumption of equal
delay intervals, particularly (a) when the nonequivalent groups are highly dissimilar and/
or (b) when the gap between implementing a
treatment in one group and then another is
large. Many treatments may interact in complex ways with history and group characteristics to determine how long it takes for an effect
to manifest itself. Therefore, it is more realistic
to look for relative differences when an apparent treatment effect appears in each group than
it is to look for an exact match between the
difference in time when each group received
the treatment and the difference in time when
effects become manifest. Nonetheless, the delayed causal effects are most clear when the
time interval between the treatment and the
onset of the response is the same in both
groups.
Komaki, Barwick, and Scott (1978) also used
a switching replications time series in their
behavioral analysis of the safety practices of
two departments in a manufacturing plant.
Their independent variable was a behavioral
program that included an explanation of the
need for safety, a visual presentation of desired
safety measures, and the frequent, low-cost
reinforcement of safe practices. Two nonequivalentdepartments were used, one of which
received the treatment before the other. The
incidence of safe practices was monitored on
the job both before, during, and after implementation in each of the departments. The data
are presented in Figure 22. For each series, they
show an initial effect and a return to the baseline level after the treatment was removed.
The replication is important for, as Figure
22 shows, safety practices seem to have been
increasing slightly before the intervention in
the setting that experienced it first. Might part
of the effectiveness of the safety treatment be
attributed to this trend rather than to the intervention? The baseline data from the delayed
intervention group are crucial here because
they indicate that safety practices were not
increasing and may even have been decreasing. Yet the same initial effect is noted. Moreover, when the intervention is removed, the
safety behavior returns to baseline in both treatment groups, further demonstrating the effectiveness of the treatment so long as it is in place.
An obvious advantage of switching replications is that the threat of history is reduced.
The only relevant historical threat is one that
operates in different settings at different times,
or involves two different historical forces operating in different directions at different times.
Simple selection is also less of a threat since the
only viable selection threat requires different
kinds of attrition at different time points. Instrumentation is also less likely, though it will
be worthwhile exploring whether a spurious
effect occurred when the treatment was introduced (say, a history-related effect) while a
simple ceiling or floor effect caused an inflection in an upward or downward general trend
at the point when the treatment was removed.
Given these considerable advantages, a key
question is: How practical is the switching
replication design?
Conclusion
The warrant for causal conclusions is clear in
the case of the randomized experiment. Assuming competent implementation of a cbrrect
randomization procedure, random assignment
creates a probabilistic equivalence between
groups at the pretest. Assuming further that
FIGURE 22
Percentage of Items Performed Safely by Employees in
Two Departments of a Food Manufacturing Plant During a 25-Week Period of Time
100
90
80
z, 70
~
<:::
en
60
'"<:::
""~ 50
'@,
0
""
1:1,
.:2
Baseline
Wrapping Department
Make-Up Department
Intervention
Reversal
.... - ..... -- ................ - .................................................... .

I
:~1
100
<u
:;:;
90
<J
~
'E' 80
tf?.
70
60
50
0
:;::
I
1
1
I
I
I
____ j
: r - v
:1
I
10
15
20
25
30
35
40
45
50
55
60
65
Observatio11 Sessions
From "A Behavioral Approach to Occupnlionnl Safety: Pinpoinling Safe Performance In a Food Manufacturing Plant" oy J. Komaki, K. D. Barwick, and L. R. Scott,
197B, ]mml<ll of Apf1lie<l PsyclwloS.'I 63. Copyright 1978 by the American Psychologicnl Association. Adapted by permission.
there is no differential attrition from the experiment, any group differences at the posttest must be due to chance or to the treatment
contrast actually achieved. The likelihood of
chance can be reduced by having large samples, stratifying on correlates of the dependent
variable prior to assignment, and using statistics correctly. If chance is ruled out and a treatment-outcome relationship persists, the strong
likelihood is that the treatment contrast caused
the observed effect. However, random assignment does not ensure that this contrast is isomorphic with the construct the treatment was
supposed to represent. The assignment procedure merely permits one to conclude that something about the treatment contrast caused the
effect. It might have been the intended causal

agent, but it could also have been unsuspected
forces of substantive significance or such
nuisance factors as treatment crossovers,
Hawthorne effects, or any other of the threats
to construct validity listed earlier. Thus, the
randomized experiment creates a single and
elegant warrant for some causal influence
having emanated from the operationalized
treatment and having influenced the operationalized effect. This is its strength. But it does
not describe or explain this influence or test its
generality.
The causal warrant from quasi experiments
cannot, alas, be so simple or so elegant, for the
inferential linchpin of random assignment is
570
missing. A different warrant has to be constructed. This chapter has sought to construct
a warrant that depends on five shaky but
not totally unreasonable assumptions. The first
is that only falsification can be justified as a
means of certain knowledge (Popper, 1959).
From this follows the recommendation that
researchers try not just to corroborate a causal
hypothesis but also to falsify it by ruling out
threats to both statistical conclusion validity
(to deal with chance) and internal validity (to
deal with forces leading to the same pattern
of results as expected from the manipulated
treatment contrast). However, theorists of
quasi experimentation acknowledge that falsification can never be perfect in research practice since, as Kuhn (1962) has pointed out, it
depends on fully specified theories and perfectly valid measurement. Neither of these is
forthcoming in any science, let alone the social
sciences. Falsification is fallible, but not for that
matter useless, as we have tried to show in
conceptualizing threats to validity and then
attempting to rule them out.
The second assumption on which a warrant for causal inference from quasi experiments rests is that published lists of threats to
statistical conclusion and internal validity are
comprehensive. Without this assumption one
cannot be sure of having falsified all the relevant threats. The available lists are long and
reflect the criticisms most often raised in the
past by scholars when commenting on others'
work. But while this makes our lists long and
empirical, it does not make them infallible.
They are subject to correction as some current
threats are removed from the lexicon because
they do not operate often enough to be worried
about, and others are added as the community
of field researchers discovers them in the course
of their work.
The third assumption is that quasi experiments are more dependent on the notion of
plausibility than is the case with randomized
experiments. Judgments have to be made
about (a) which validity threats are plausible
enough to need ruling out and (b) how well a

particular quasi-experimental design or measurement detail has ruled out an identified threat.
With randomized experiments, the mechanical procedure of assignment handles these
matters. Though dependence on plausibility is
especially high in quasi experimentation, plausibility is a slippery concept. There is not always interrater agreement about what is plausible, and what one generation considers to be
plausible or implausible is judged differently
by later generations operating from their unique
knowledge and assumption bases. All social
research requires human judgment, but its
need seems to be particularly acute in quasi experimentation given its dependence on ruling
out those alternative interpretations judged
plausible, and its dependence on judgements
about how well a particular threat has been
ruled out.
The fourth assumption on which the warrant for causal inferences from quasi experiments rests is that structural elements of design
other than random assignment-pretests,
comparison groups, the way treatments are
scheduled across groups, as in a Latin Squares
design-provide the best way of ruling out
threats to internal validity. This is in many
ways the most important of all the assumptions
buttressing quasi experimentation. The major
alternative to control via design is control via
measurement and statistical adjustment.
Throughout this chapter, we mentioned the
need for measurement to help identify and rule
out threats where no other method of falsification exists. But the quality of existing substantive theory and measurement prevents us
from being even lukewarm fans of most statistical adjustment procedures, especially those
that purport to control for selection. We are
heartened to see this belief shared by statisticians (e.g., Holland, 1986; Rosenbaum, 1984;
Rosenbaum & Rubin, 1983; Rubin, 1986), some
econometricians (LaLonde, 1986;Moffitt, 1989),
and even by the developers of LISREL who
make clear that their causal modeling program
estimates causal parameters presumed to be

true as opposed to testing whether the relationships themselves are causal (Joreskog & Sorbum, 1990). As Light, Singer, and Willett(1990)
write, "You can't fix by analysis what you
bungled by design."
The final assumption undergirding the
warrant for causal inference is that, in the absence of random assignment, conclusions are
more plausible if they are based on evidence
that corroborates numerous, complex, or numerically precise predictions drawn from a descriptive causal hypothesis. We tried to model this principle in the sections on cohort designs and interrupted time series designs in
particular, adding one design feature after
another to the basic design to show how more
and more threats were controlled and to show
that those remaining were less plausible because they depended on many forces operating
in complex interaction with each other. With
quasi experiments, there is no omnibus mechanism like random assignment that rules out
nearly all internal validity threats through a
simple procedural device. Instead, threats have
to be forced out and falsified within the limits of
the designs, theory,andmeasuresthatareavailable. Casting causal hypotheses so that they
detail specific sets of conditions under which a
given effect should and should not appear
promotes stronger causal conclusions.
Frank acknowledgment of the assumption base on which quasi experimentation rests
for answering questions about molar descriptive causal questions should not engender disillusionment. Indeed, the many excellent examples presented earlier should nip in the bud
any tendency in that direction, as should the
growing recognition that causal inference is
strengthened by (a) increasing the number,
complexity, and specificity of the data-based
predictions derived from a causal hypothesis,
(b) designing research so that specific threats
are minimized (e.g., by matching on stable
attributes highly correlated with the outcome
to deal with selection), and (c) empirically
estimating threats better (e.g., as when a long

pretest time series describes maturational processes within a group and estimates the likely
selection-maturation differences between
groups).
References
Ball, S., & Bogatz, G. A. (1970). The first year of Sesame
Street: An evaluation. Princeton, NJ: Educational
Testing Service.
Basadur, M., Graen, G. B., & Scandura, T.A. (1986).
Training effects on attitudes toward divergent
thinking among manufacturing engineers. Journal of Applied Psychology, 71, 612-617.
Bhaskar, R. (1975). A realist theory of science. Leeds,
' England: Leeds.
Blackburn, H., Luepker, R., Kline, F. G., Bracht, N.,
Carlaw, R., Jacobs, D., Mittelmark, M., Stauffer,
L., & Taylor, H. L. (1984). The Minnesota Heart
. Health Program: A research and demonstration
project in cardiovascular disease prevention. In
}.D.Matarazzo,S. Weiss,J.A.Herd,N. E.Miller,
&S.M. Weiss(Eds.), Behavioral health. New York:
Wiley.
Bogatz, G. A., & Ball, S. (1971). The second year of
"Sesame Street": A continuing evaluation. Princeton, NJ: Educational Testing Service.
Booton, L. A., & Lane, J. I. (1985). Hospital market
structure and the return to nursing education.
The Journal of Human Resources, 20, 184-196.
Boruch, R. F., & Gomez, H. (1977). Sensitivity, bias,
and theory in impact evaluation. Professional
Psychology, 8, 411-434.
Box, G. E. P., & Jenkins, G. M. (1976). Time-series
analysis: Forecasting and control. San Francisco:
Holden-Day.
Bracht, G. H., & Glass, G. V. (1968). The external
validity of experiments. American Educational
Research Journal, 5, 437-474.

Brett, J. M. (1982). Job transfer and well-being. Journal of Applied Psychology, 67, 450-463.
Campbell, D. T. (1956). Leadership and its effects
upon the group. In Ohio studies in personnel.
Columbus: Ohio State University.
Campbell, D. T. (1957). Factors relevant to the validity of experiments in social settings. Psychological
Bulletin, 54, 297-312.
572 Cook, Campbell, and Perncchio
Campbell, D. T. (1966). Patternrnatchingasanessential in distal knowing. InK. R. Hammond (Ed.),

The psycholopj of Egan Brunswick. New York:
Holt, Rinehart.
Campbell, D.T. (1969a). Definitional versus multiple
operationalism. et al., 2, 14--17.
Campbell, D. T. (1969b). Prospective: Artifact and
control. In R. Rosenthal & R. Rosnow (Eds.),
Artifact in behavior research (pp. 351-382). New
York: Academic Press.
Campbell, D. T. (1969c). Reforms as experiments.
American Psychologist, 24, 409-429.
Campbell, D. T. (1986). Relabelling internal and
external validity for applied social scientists.
In W. M. K. Trochim (Ed.), Advances in quasi-
experimental design and analysis. New Directions

for program evaluation, 31, 66-77. San Francisco:
Jossey-Bass.
Campbell, D. T., & Erlebacher, A. E. (1970). How
regression artifacts in education look harmful. In
J. Hellmuth (Ed.), Compensatory education: Anational debate (Vol. 3). Disadvantaged child. New
York: Brunner/Mazel.
Campbell, D. T., & Fiske, D. W. (1959). Convergent
and discriminant validation by the multitraitmultimethod matrix. Psychological Bulletin, 56,
81-105.
Campbell, D. T., & Reichardt, C. S. (1990). Problems
in assuming the comparability of pretest and
posttest in autoregressive and growth models.
In R. E. Snow & D. E. Wiley (Eds.), Strategic
thinking: A volume in honor of Lee J. Crmzbach. San
Francisco: Jossey-Bass.
Campbell, D. T., & Stanley, J. C. (1966). Experimental
and quasi-experimental design for research. Chicago: Rand McNally.
Cohen, J. (1988). Statistical power analysis. Hillsdale,
NJ: Erlbaum.
Collingwood, R. G. (1940). A11 essay on metaphysics.
Oxford, England: Clarenden Press.
Connell, D. B., Turner, R. R., & Mason, E. F. (1985).
Summary of findings of the school health education evaluation: Health promotion effectiveness,
implementation and costs. Joumal of School Health,
55, 316-321.
Cook, D. (1967). The impact of the Hawthorne effect in
experimental designs in educational research (U.S.
Office of Education, No. 0726). Washington, DC:
U.S. Government Printing Office.
Cook, T. D. (1974). The potential and limitations of
secondary research. In M. W. Apple, M. J.

Subkoviak, & H. S. Lufler, Jr. (Eds.), Educational
evaluation: Analysis and responsibility. Berkeley,
CA: McCutchan.
Cook, T. D. (1990a). Clarifying the warrant for generalized causal inferences in quasi-experimentation. In M. W. McLaughlin & D. Phillips (Eds.),
Evaluation and education at quarter century. NSSE
Yearbook.
Cook, T. D. (1990b). The generalization of causal
connections: Multiple theories in search of clear
practice. In L. Secrest, E. Perrin, & ]. B. Bunker
(Eds.), Research methodology: Strengthening causal
interpretations of nonexperimental data(PI-IS Pub.
No. 90-3454). Rockville, MD: Agency for Health
Care Policy & Research.
Cook, T. D., Appleton, H., Conner, R., Schaffer,
A., Tamkin, G., & Weber, S. J. (1975). "Sesame
Street" revisited: A case study in evaluation research.
New York: Russell Sage Foundation.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experi-
mentation design &analysis issues for field settings.

Boston: Houghton Mifflin.
Cook, T. D., Gruder, C. L., Hennigan, K. M., & Flay,
B. R. (1979). The history of the sleeper effect:
Some logical pitfalls in accepting the null
hypothesis. Psychological Bulletin, 86, 662-679.
Cronbach, L. J. (1982). Designing evaluations of ed ucational a11d social programs. San Francisco: JosseyBass.
Cronbach, L. J. (1989). Construct validity after thirty
years. In R. L. Linn (Ed.), Intelligence: Measurement, theory and public policy. Urbana and Chicago: University of Illinois Press.
Cronbach, L. J., & Meehl, P. E. (1955). Construct
validity in psychological tests. Psychological
Bulletin, 52, 281-302.
Devine, E. C., & Cook, T. D. (1983). A meta-analytic
analysis of effects of psychoeducational intervention on length of postsurgical hospital stay.
Nursing Research, 32, 267-274.
Devine, E. C., O'Connor, F. W., Cook, T. D., Wenk, V.
A., & Curtin, T. R. (1988). Clinical and financial
effects of psychoeducational care provided by
staff nurses to adult surgical patients in the postDRG environment. American Journal of Public
Health, 78, 1293-1297.
Diamond, S. S. (1974). Hawthorne effects: Another look.
Unpublished manuscript, University of Illinois
at Chicago.

Donaldson, L. (1975). Job enlargement: A multidimensional process. Human Relations, 28,
593-610.
Einhorn, H. J., & Hogarth, R. M. (1981). Behavioral
decision theory: Processes of judgment and
choice. Annual Review of Psychology, 32, 53-88.
Farmer, E. (1924). A comparison of different shift
systems in the glass trade (Tech. Rep. No. 24,
Medical Research Council, Industrial Fatigue
Research Board). London: Her Majesty's Stationery Office.
Farquhar,J. W., Fortmann, S. P., Flora,J. A., Taylor,
C. B., Haskell, W. L., Williams, P. T., MacCoby,
N., & Wood, P.O. (1990). The Stanford fivecity project: Effects of community-wide
education on cardiovascular disease risk factors.
Journal of the American Medical Association, 26,
359-365.
Finkle, A L. (1979). Flexitime in government. Public
Personnel Management, 8, 152-155.
Fisher, R. A. (1925). Statistical methods for research
workers. London: Oliver & Boyd.
Fisher, R. A. (1935). The design of experiments. London: Oliver & Boyd.
Gasking, D. (1955). Causation and recipes. Mind, 64,
479-487.
Glass, G. V. (1978). Regression effect. Memorandum,
March 8.
Glass, G., Tiao, G. C., & Maguire, T. 0. (1971). Analysis of the data on the 1900 revision of German
divorce laws as a time-series quasi-experiment.
Law and Society Review, 4, 539-562.
Glenn, N. (1977). Cohort analysis. Beverly Hills, CA:
Sage.
Glymour, C. (1987). Discovering causal structure:
Artificial intelligence, philosophy of science, and

statistical modelling. Orlando, FL; Academic
Press.
Greaney, V., Kellaghan, T., Takata, G., & Campbell,
D. T. (1979). Regression-discontinuities in the Irish
"Leaving Certificate." Unpublished manuscript.
Greene, C. N., & Podsakoff, P. M. (1978). Effects of
removal of a pay incentive: A field experiment. Proceedings of the Academy of Management, 38,
206-210.
Greenwald, A. G. (1975). Consequences of prejudice
against the null hypothesis. Psyclwlogical Bulletin, 82, 1-20.
Gunn, W. J., Iverson, D. C., & Katz, M. (1985). Design
of school health education evaluation. Journal of
School Health, 55,301-304.

Hackman, J. R., Pearce, J. L., & Wolfe,}. C. (1978).
Effects of changes in job characteristics on
work attitudes and behaviors: A naturally occurring quasi-experiment. Organizational Behavior
and Human Pe1jormance, 21, 289-304.
Heckman, J. L & Hotz, J. (1989a). Choosing among
alternative nonexperimental methods for estimating the impact of social programs: The case
of manpower training. Journal of the American
Statistical Association, 84, 862-874.
Heckman, J. J., & Hotz, J. (1989b). Rejoinder. Journal
of the American Statistical Association, 84, 878-880.
Holland, P. W. (1986). Statistics and causal inference.
Journal of the American Statistical Association, 81,
945-959.
Holland, P. W. (1989). Comment: It's very clear.
Journal of the American Statistical Association, 84,

875-877.
Hutton, R. B., & McNeil, D. L. (1981). The value of
incentives instimulating energy conservation.
Joumal of Consumer Research, 8, 291-298.
Jackson, B.O.,&Mohr, L. B. (1986). Rentsubsidiesan
impact evaluation and an application of the
random-comparison-group design. Evaluation
Review, 10, 483-517.
Jencks, C., & Edin, K. (1990). The real welfare problem.
The American Prospect, 1, 31-50.
Joreskog, K. G., & Sorbom, D. (1990). Comment on
Sprites, Scheines and Glymour: Simulation studies

of the reliability of computer aided model specification using the TETRAD II, EQS, and LISREL VI
programs. Unpublished manuscript, Department of Statistics, University of Uppsala,
Sweden.
Keller, R. T., & Holland, W. E. (1981). Job change: A
naturally occurring field experiment. Human
Relations,134, 1053-1067.
Kerpelman, L. C., Fox, G., Nunes, D., Muse, D., &
Stoner, D. (1978). Evaluation of the effectiveness of
outdoor power equipment information and education

programs. Final Report of Abt Associates to U. S.
Consumer Products Safety Commission.
Kish, L. (1987). Statistical design for researc/1. New
York: Wiley.
Komaki, J., Barwick, K. D., & Scott, L.R. (1978). A
behavioral approach to occupational safety:
Pinpointing and reinforcing safe performance in
a food manufacturing plant. Journal of Applied
Psychology, 63, 434-445.
--------------------------------~~~
574
Kraemer, H. C., & Thiemann, S. (1987). How many

subjects: Statistical power analysis in research. Beverly Hills, CA: Sage.
Kruglanski, A. W., & Kroy, M. (1975). Outcome
validity in experimental research: A reconceptualization. Journal of Representative Research
in Social Psychologtj, 7, 168-178.
Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago: University of Chicago Press.
LaLonde, R. J. (1986). Evaluating the econometric
evaluations of training programs with experimental data. American Economic Review, 76,
604--620.
Lana, R. E. (1969). Pretest sensitization. In R. Rosenthal & R. L. Rosnow (Eds.), Artifact in behavioral
research. New York: Academic Press.
Lavori, P. W., Louis, T. A., Bailar, J. C., & Polansky,
H. (1986). Designs for experiments: Parallel
comparisons of treatment. In J. C. Bailar & F.
Mosteller (Eds.), Medical uses of statistics.
Waltham, MA: New England Journal of Medicine.
Lawler, E. E., III, & Hackman, J. R. (1969). Impact of
employee participation in the development of
pay incentive plans: A field experiment. Journal
of Applied Psychologtj, 53,467-471.
Levy, A. S.,Mathews, 0., Stephenson, M., Tenney, J.
E., & Schucker, R. E. (1985). The impact of a
nutrition information program on food
purchases. Journal of Public Policy and Marketing,
4, 1-13.
Lieberman, S. (1956). The effects of changes in roles
on the attitudes of role occupants. Human
Relations, 9, 385-402.
Light, R. J., Singer, J.D., & Willett, J. B. (1990). By
design: Plmming research on higher education.

Cambridge, MA: Harvard University Press.
Linn, R. L. (1980). Discussion: Regression toward the
mean and the interval between test
administrations. New Directions for Testing and
Measurement, 8, 83-89.
Linn,R. L., Dunbar,S. B.,Harnisch, D. L., & Hastings,
C. N. (1982). The validity of the title I evaluation
and reporting system. In E. R. House, S. Mathison,
J. A. Pearsol, & H. Preskill (Eds.), Evaluation
studies review annual. Beverly Hills, CA: Sage.
Lipsey, M. W. (1990). Design sensitivity: Statistical
power for experimental research. Newbury Park,
CA: Sage.
Lohr, B. W. (1972). An historical view of the research on
the factors related to the utilization of health services.

Duplicated research report, Bureau for Health
Services Research and Evaluation, Social and
Economic Analysis Division, Rockville, MD.
Lord, F. M. (1960). Large-sample covariance analysis
when the control variable is fallible. Journal of the
American Statistical Association, 55, 307-321.
Mason, W. M., & Fienberg, S. E. (1985). Cohort analysis in social research. New York: Springer-Verlag.
McNemar, Q. A (1940). Critical examination of the
University of Iowa studies of environmental influences upon the I.Q. Psychological Bulletin, 37,
63-90.
Meehl, P. E. (1978). Theoretical risks and tabular
asterisks: Sir Karl, Sir Ronald and the slow progress of soft psychology. Journal of Consulting and
Clinical Psychologtj, 46, 806-834.
Minton, J. H. (1975). The impact of "Sesame Street"
on reading readiness of kindergarten children.
Sociologt; of Education, 48, 141-151.
Moffitt, R. (1989). Comment. journal of the American
Statistical Association, 84, 877-878.
Murnane, R. J., Newstead, S., & Olsen, R. J. (1985).
Comparing public and private schools: The
puzzling role of selectivity bias. Journal of Business & Economic Statistics, 3, 23-35.
Narayanan, V. K., & Nath, R. (1982). A field test of
some attitudinal and behavioral consequences
of flexitime. Journal of Applied PsycholOgJJ, 67,
214--218.
Parker, E. B. (1963). The effects of television on public
library circulation. Public Opinion Quarterly, 27,
578-589.
Parker, E. B., Campbell, D. T., Cook, T. D., Katzman,
N., & Butler-Paisley, M. (1971). Time-series analysis of effects of television on library circulation in
Illinois. Unpublished manuscript, Northwestern
University, Evanston, IL.
Plomkin, R., & Daniels, D. (1987). Why are children
from the same family so different from one
another? Behavioral and Brain Sciences, 10, 1--60.
Popper, K. R. (1959). The logic of scientific discovery.
New York: Basic Books.
Pugh, D. S. (1966). Modern organizational theory: A
psychological and sociological study. Psychological Bulletin, 66,233-251.
Ralston, D. A., Anthony, W. P., & Gustafson D. J.
(1985). Employees may love flextime, but what

does it do to the organization's productivity?
journal of Applied Psychology, 70, 272-279.

Reichardt, C. S. (1985). Reinterpreting Seaver's (1973)
study of teacher expectancies as a regression
artifact. Journal of Educational Psychology, 77,
231-236.
Riecken, H. W., Boruch, R. F., Campbell, D. T., Coplan,
W., Glennan,T. K., Pratt,J., Rees,A., & Williams,
W. (1974). Social experimentation: A method for
planning and evaluating social innovations. New
York: Academic Press.
Roberts, S. J. (1980). Differences in NCE gain estimates
resulting from the use of local norms versus national

110rms. Mountain View, CA: RMC Corporation.
Robertson, T. S., & Rossiter, J. R. (1976). Short-run
advertising effects on children: A field study.
Journal of Marketing Research, 8, 68-70.
Roethlisberger, F. S., & Dickson, W. J. (1939). Management and the worker. Cambridge, MA: Harvard University Press.
Rosch, E. H. (1978). Principles of categorization. In E.
Rosch & B. B. Lloyd (Eds.), Cognition and categorization. Hillsdale, NJ: Erlbaum.
Rosenbaum, P.R. (1984). Association to causation
in observational studies: The role of tests of
strongly ignorable treatment assignment. journal of the American Statistical Association, 79,
41-48.
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity school in observational studies for causal effects. Biometrika, 70,
41-55.
Rosenthal, R. (1966). Experimenter effects in behavioral
research. New York: Appleton-Century-Crofts.
Enlarged ed. published in 1976 by Irvington
Publishers in New York.
Ross, H. L. (1973). Law, science and accidents: The
British Road Safety Act of 1967. journal of Legal
Studies, 2, 1-75.
Ross, H. L., Campbell, D. T., & Glass, G. V. (1970).
Determining the social effects of a legal reform:
The British "Breathalyser" crackdown of 1967.
American Behavioral Scientist, 13, 493-509.
Rossi, R. H., & Lyall, K. (1976). Reforming public
welfare: A critique of the negative income tax experiment. New York: Russell Sage Foundation.
Ryan, T. A. (1959). Multiple comparisons in psychological research. Psychological Bulletin, 56,26-47.

Saretsky, G. (1972). The OEO P.C. experiment and
the John Henry Effect. Phi Delta Kappmt, 153,
589-591.
Scriven, M. (1984). Maximizing the power of causal
investigations: The modus operandi method.
In R. F. Conner, D. G. Altman, & C. Jackson
(Eds.), Evaluation studies review annual (Vol. 9).
Beverly Hills, CA: Sage.
Seaver, W. B. (1973). Effects of naturally induced
teacher expectancies. Journal of Personality and
Social Psychology, 28, 333-342.
Seaver, W. B., & Quartan, R. J. (1976). Regressiondiscontinuity analysis of dean's list effects. Journal of EducationalPsychologtj, 68, 459-465.
Sechrest, L., West, S. G., Phillips, M.A., Redner, R.,
& Yeaton, W. (1979). Some neglected problems
in evaluation research: Strength and integrity of
treatments. In L. Sechrest, S. G. West, M. A.
Phillips, R. Redner, & W. Yeaton (Eds.), Evaluation studies review annual (Vol. 4). Beverly Hills,
CA: Sage.
Secrest, L., & Hannah, M. (1990). The critical
importance of nonexperimental data. In L.
Secrest, E. Perrin, & J. Bunker (Eds.), Research
methodology: Strengthening causal interpretation

(PHS Pub. No. 90-3454). Rockville, MD: Agency
for Health Care Policy and Research.
Shadish, W. R., Cook, T. D., & Leviton, L. C. (in
press). Foundations of program evaluation. Newberry Park, CA: Sage.
Tallmadge, G. K. (1982). An empirical assessment
of norm-referenced evaluation methodology.
Journal of Educational Measurement, 19,97-112.
Trend, M.G. (1979). On the reconciliation of qualitative and quantitative analyses: A case study. In
T. D. Cook & C. S. Reichardt (Eds.), Qualitative
and quantitative methods in evaluation research.

Trochim, W. M. K. (1984). Research design for program
evaluation: The regression discontinuity approach.

Trulson, M. E. (1986). Martial arts training: A novel
"cure" for juvenile delinquency. Human Relations,
39, 113I-n4o.
Rubin, D. B. (1986). Which ifs have causal answers?
Tybout, A. M., & Koppelman, F. S. (1980). Consumer-
Journal of American Statistical Association, 81,
oriented transportation service: Modification and

evaluation. Report to the U.S. Department of
961-962.
........
__________________
576
Transportation, Northwestern University Transportation Center.

VanHouten, R., Nau, P., & Marini, Z. (1980). An
analysis of public posting in reducing speeding
behavior on an urban highway. Journal ofApplied
Behavior Analysis, 13,383-395.
Viscusi, W. K. (1985). Cotton dust regulation: An
OSHA success story? Journal of Policy Analysis
and Management, 4, 325-343.
Weber, S. J., & Cook, T. D. (1972). Subject effects in
laboratory research: An examination of subject
roles, demand characteristics and valid inference.
Psychological Bulletin, 77, 273-295.
Weber, S. J., Cook, T. D., & Campbell, D. T. (1971).
The effects of school integration on the academic selfconcept of public-school children. Paper presented
at the meeting of the Midwestern Psychological
Association, Detroit.
Wilder, C. S. (1972). Physician visits, volume, and

interval since last visit, U.S. 1969 (Series 10, No. 75;
DHEW Pub. No. [HSM] 72-1064). Rockville,MD:
National Center for Health Statistics.
Wilson, V. L., & Putnam, R. R. (1982). A metaanalysis of pretest sensitization effects in experimental design. American Educational Research
Journal, 19,249-258.
Wortman, P.M., Reichardt, C. S., & St. Pierre, R. G.
(1978). The first year of the education voucher
demonstration. Evaluation Quarterly,2, 193-214.
Yin, R. K. (1984). Case study research. Beverly Hills,
CA: Sage.
Zajonc, R. B., & Markus, H. (1975). Birth order and
intellectual development. Psychological Review,
82,74-88.
~..-~~

Cook, Campbell & Peracchio (1990) Quasi-Experimentation

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Cook, Campbell & Peracchio (1990) Quasi-Experimentation

Hochgeladen von

Copyright:

Verfügbare Formate

.

with which a causal relationship can be generalized to various populations of

Cook, Campbell, and Peracchio

In reviewing the major journals devoted to

randomized experiment. Though these concerns justify a chapter on quasi experiments,

~~-------------------------------------------------Quasi Experimentation 493

causatiqn (Collingwood, 1940; Gasking, 1955).

manipulatedoraboutmediatingprocesseslinking the independent and dependent variables.

494 Cook, Campbell, and Peracchio

same time. This is only one of many possible

inferences about entities that are manifestly

Cook, Campbell, and Peracchio

Internal Validity for Campbell

----+- Construct Validity for Cook

confusion, in theconceptualizationin this chapter, the arrows from t to T and o to 0 indicate

Internal Validity for Cronbach

__... External Validity for Cronbach

other recent theories of design (e.g,. Kish, 1987).

Quasi Experimentation 497

tors specify target effect sizes that have to be

The close connection between statistical

Cook, Campbell, and Peracchio

reliable effect (Cook, Cruder, Hennigan, &

differences between nonequivalent treatment

Statistical Power. The likelihood of making an

Type I errors will result when multiple comparisons are

Quasi Experimentation 499

stringency required for obtaining a true level of

The Reliability of Measures. Measures of low

differently, and there may be differences from

Cook, Campbell, and Peracchio

ceiling or basement effects or because of shifts

Interactions With Selection. Many of the threats

Quasi Experimentation 501

Ambiguity About the Direction ofCausal Influence.

explanations of anA-B relationship being ruled

out. Our point is that there will typically be

502 Cook, Campbell, and Peracchio

no treatment-correlated refusals to participate

treatment group in a single data-collection

Quasi Experimeutatio11 503

assumption that the 'take' of the independent

We can further illustrate convergent and

Quasi Experimentation 505

attention to a subclass of threats associated

Cook, Campbell, and Peracchio

another as they collide coming round a blind

Mono-operation Bias. Many experiments are

(gender, affiliation, nationality, and academic

Mono-method Bias. To have more than one

Quasi Experimentation 507

needy, tended to be given Title 1 funds in

Compensatory Rivalry. Where the assignment

Compensatory Equalization of Treatment. When

of persons or organizational units to experimental and control conditions is made public

Cook, Campbell, and Peracchio

like compensatory equalization. However, the

hypothesis does not necessarily imply the

Evaluation Apprehension. Weber and Cook

Quasi Experimentation 509

happens, it will not be clear whether the causal

Confounding Levels of Constructs With Constructs.

the populations or categories they represent,