Beruflich Dokumente
Kultur Dokumente
a~~------------------------------------------------
CHAPTER 9
Quasi Experimentation
Thomas D. Cook
Northwestern University
Donald T. Campbell
Lehigh University
Laura Peracchio
University of Wisconsin, Milwaukee
This chapter has two purposes. The first is to explicate four kinds of validity.
Statistical conclusion validity refers to the validity of conclusions about
whether the observed covariation between variables is due to chance. Internal
validity is concerned with whether the covariation implies cause. Construct
validity refers to the validihJ with which cause and effect operations are labeled
in theory-relevant or generalizable terms. External validity refers to the validity
492
types, (b) the presentation of new quasi-experimental designs, (c) the emphasis
placed an the conceptualization of causal generalization, (d) the advocacy of
multiple pmbes of a causal hypothesis, and (e) the importance assigned to
knowledge that transcends the data collected for a particular research study.
Unfortunately, space limitations preclude much discussion of nonexperiments,
randomized experiments, and methods for promoting causal generalization
through meta-analysis or explanatory model testing. These topics are discussed
elsewhere (Cook, 1.990a, 1.99Gb; Cook & Campbell, 1979).
The Theory of
Quasi Experimentation
Introduction
Theories of Causation
Experiments are vehicles for testing a particular type of causal hypothesis. For most laypersons, causation probably implies manipulating
one thing and observing whether a later change
occurs ina phenomenon that is plausibly linked
to the change agent. Intrinsic to this conception
is the notion of deliberate manipulation, and
some philosophers of science have called
this the manipulability or activity theory of
i'
I:
!' '
is covariation when there is not. Since covariation is a requirement for cause and statistical
tests are usually used to make such judgments, it seems wise to explicate the major
factors that can lead to false conclusions about
covariation. We call these threats to statistical
conclusion validity.
We must confess to trepidation about this
neologism since decisions about covariation
strike us as less important than decisions about
causal magnitude. There are two reasons for
this. First, all social research (except, perhaps,
consulting) has many different stakeholders
with their own standards of risk. Some prefer
little risk and are willing to ignore some genuine patterns of covariation in order to protect
themselves against concluding there is covariation when there is not. Other stakeholders are
less risk-averse and want to ensure that they do
not miss true relationships. Thus, whatever the
level of risk researchers adopt (or slip into) in
their statistical reasoning, it is bound to meet
the needs of some stakeholders more than
others. This would not be so much the case if
conclusions were drawn about the size of an
effect. Potential users of the information could
then decide for themselves how important a
given effect was. Second, statistical conclusion
validity has to do with "statistical significance"
which can be a highly misleading construct.
Many inexperienced researchers and laypersons fail to realize that even the most trivial
relationship can be statistically significant if a
test has enough statistical power. Since statistical significance, theoretical significance, and
policy significance are not synonyms, we henceforth refer to relationships as being "reliable"
rather than "statistically significant."
The third necessary condition for causal
inference is that there are no plausible alternative explanations of B other than A. This is the
most difficult condition to meet. For instance, if
a new machine were under evaluation and was
associated with an increase in productivity, the
increase might be due not to the machine, but to
a seasonal increase that occurs every year at the
Quasi Experimentation
arose because most theory-centered scholars
attribute little value to causal statements where
the causal agent and its effect cannot be convincingly described in a general way (Kruglanski & Kroy, 1975). It may also have arisen because the same (fallible) deductive methodology is used in ruling out alternative interpretations to both internal and construct validity.
But while internal validity involves ruling out
alternative interpretations of the presumed
causal relationship between A-as-manipulated
and B-as-measured (Campbell, 1986), construct
validity involves ruling out alternative interpretations of the entities claimed as A and B. The
rationale for experiments is to probe whether
variables are causally related, and so the alternatives that necessarily have to be ruled out
before inferring cause are alternative interpretations of the relationship between A and B (i.e.,
threats to internal validity). Nonetheless, eliminating all alternative hypotheses about the
constructs involved in a descriptive causal relationship aids in causal explanation and the
understanding and control that such explanation fosters.
At least one further step is useful. To infer a
causal relationship at one moment in time in a
single research setting and with one sample of
respondents would give us little confidence
that a demonstrated causal relationship is robust. External validity concerns the generalizability of findings across times, settings, and
persons. It takes two overlapping but nonetheless distinct forms. The first has to do with
generalizing to the times, places, and persons
specified in the claims researchers make about
the generalizability of the causal findings they
have shown. Usually, these claims touch on the
populations of persons, settings, and times
specified in the original research question, but
that is not inevitable. The core component is
using a sample to generalize to a population
that the sample is thought to represent. The
second form external validity takes has to do
with extrapolating beyond the instances captured in the sampling plan so as to draw
495
496
FIGURE 1
Validity Distinctions Translated Into Cronbach's Research Design Language
~*U
~*T
~*0
~*5
r*CH
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
\
TJ
\
OJ
! I LJ
LJ
u
!L_sf
L_u
CH
!L_ f
ch
------J~P
& Campbell -
498
Random Irrelevancies in the Setting. Some features of an experimental setting other than the
treatment will undoubtedly affect scores on the
dependent variable, thereby inflating error
variance. No setting is quite like another, and
any one setting is not likely to stay constant
from one time period to another. This threat
can most obviously be controlled by choosing
settings free of extraneous sources of variation,
which is the logic behind the sealed-off laboratory setting. Alternatively, experimental procedures can be selected that focus respondents'
attention on the treatment and thereby lower
the saliency of environmental variables. Finally, it is possible to measure some of the
major sources of extraneous setting variance
and "somehow" use them in the statistical
analysis.
Random Heterogeneity of Respondents. Therespondents in an experiment usually differ from
each other within treatment groups in ways
related to the dependent variables. (This is different from some types of respondents being
more affected by a treatment than others, which
we shall soon see is a matter of external validity.) The more respondents differ from each
other within groups, the less will be the ability
to reject the null hypothesis when betweensubject error terms are used. This threat can
500
obviously be controlled by (a) selecting homogeneous respondent populations (at some cost
in external validity), (b) blocking on respondent characteristics most highly correlated with
the dependent variable, or (c) choosing withinsubject error terms as in pretest-posttest designs. In designs with both pretest and posttest
measures, the extent to which within-subject
error terms reduce the error terms depends on
the correlation between scores over time: The
higher it is, the greater the reduction in error.
Internal Validity
List of Threats. Threats to internal validity
compromise inferences about whether the relationship observed between two variables would
have occurred even without the treatment under
analysis. We distinguish the following threats.
Ways of dealing with them other than through
random assignment will be dealt with in even
greater detail later as we discuss individual
quasi-experimental designs.
History is a threat when the relationship between the presumed cause and effect might be
due to some event that took place between a
pretest and posttest and that is not part of the
treatment under analysis.
Maturation is a threat when a presumed
causal relationship might be due to respondents growing older, wiser, stronger, and so on
between the pretest and posttest, assuming
that such maturation is not the treatment of
interest.
Testi11g is a threat when a relationship might
be due to the consequences of taking a test
different numbers of times-for example, the
first test making respondents think of, or look
up, the answers.
Jnstrume11tation is a threat when a relationship might be due to the measuring instrument
changing between the pretest and posttest.
The measuring instrument can be persons recording observations or a physical instrument
whose properties vary at different times,
perhaps because of such scaling artifacts as
JU~
from those captured by the sampling particulars to date. But explaining causal mediating
processes is not what we mean by the construct
validity of the cause or effect. The latter concerns alternative interpretations of manipulations and measures, not alternative specifications of intervening causal processes.
High construct validity obviously requires
a rigorous and comprehensive description of
the theoretical cause and effect. Without this,
neither the manipulations nor measures can be
tailored to the constructs they are meant to
represent. But single exemplars of constructs
never give a perfect fit. Each instance contains
unique irrelevancies and usually also fails to
include some theoretically important components. From this follows the recommendation
that researchers should use multiple exemplars
of each construct deliberately that have been
selected both to share common variance attributable to the target construct and to differ from
each other in irrelevant components (Campbell
& Fiske, 1959). Such "multiple operationalism"
(Campbell, 1969a) allows tests of whether a
particular cause-effect relationship is robust
across the somewhat different set of relevant
and irrelevant components built into the research operations available. While irrelevancies should be randomly distributed across the
available operations, it is in practice impossible
to assume this, and the weaker assumption
undergirding Campbell's theory of measurement is that the irrelevancies are d-istributed
heterogeneously across the instances sampled.
That is, there are at least some instances where
each irrelevancy is and is not present. The
worst situation is, of course, where a construct
and an irrelevancy are totally confounded.
Two other points are worth making about
assessing the construct validity of causes and
effects. First, independent variables should
demonstrably vary what they are meant to
vary, and independent measures of this are
useful. Outcome measures should also measure what they are meant to assess, and independent evidence of this is also desirable. The
------------IZ~""""'"'"
504 Cook, Campbell, and Peracchio
theorists. Most social treatments are best described as complex packages, especially in
applied research. This complexity makes it
difficult even to reproduce the package, let
alone replicate any effects in the (usual) absence of full information about effective treatment components. High construct validity of
the cause helps promote reproducibility. It also
promotes efficiency because it provides clues
about those components of the cause that do
not contribute to the effect and so can be
dropped. For instance, "Sesame Street" was
evaluated in an experiment where researchers
regularly visited children in their homes and
encouraged them to view the show, leaving
behind toys, books, and balloons advertising
the series (Ball & Bogatz, 1970; Bogatz & Ball,
1971). Such face-to-face encouragement cost
between $100 and $200 per child per viewing
season. Viewing without encouragement costs
just $1 to $2 per child per season and is much
more commonplace (Cook, Appleton, Conner,
Schaffer, Tamkin, & Weber,1975). Would it not
be useful to know whether viewing "Sesame
Street" without encouragement has similar
effects to viewing it with encouragement,
for to drop the encouragement component
would be highly desirable financially, if not
pedagogically?
Unfortunately, individual experiments afford poor prospects for achieving high construct validity of the cause. This is primarily
because implementing multiple operationalizations of an independent variable is costly,
and multiple operationalization is a better
means of enhancing construct validity than
carefully tailoring a single operation to a referent construct. Fortunately, the prospects are
brighter for the high construct validity of outcomes because investigators typically have
much greater latitude for multiple measurement than multiple manipulation.
List of Threats. Below is a list of threats to
construct validity. We want to draw special
Inadequate Preoperational Explication of Constructs. The choice of operations should depend on the result of a conceptual analysis of
the prototypical components of a construct
(Rosch, 1978), perhaps through consulting dictionaries (social science or otherwise}_ relevant
substantive theory, and the past literature on a
topic. Doing this, one would find, for example,
that "attitude" is usually defined as a stable
predisposition to respond and that this stability is understood either as consistency across
time or response modes (affective, cognitive,
and behavioral). Such an analysis immediately
indicates that the standard practice of measuring preferences or beliefs at a single session
and labeling this "attitude" is not adequate.
To give another example, most definitions of
aggression include both the intent to harm and
the consequence of harm. This is to distinguish between (a) the black eye one child gives
------
506
---
---
If the
treatments involve widely diffused informational programs-as with many mass media
programs-or if the experimental and control
groups can communicate with each other, then
the possibility of treatment crossovers exists
(Rubin, 1986). In the extreme case where notreatment controls receive a treatment in similar dosage to the planned treatment group, the
experiment becomes invalid. There is no longer
a functional difference between the treatment
and control groups. In many quasi experiments,
the need to minimize nonequivalence makes it
desirable to sample units that are as similar as
possible in all respects other than the treatment. Physically adjacent units facilitate this.
But their very propinquity also enhances the
chances of treatment differences becoming
obscured. For example, if one of the New England states were used as a control group to
study the effects of changes in the New York
abortion law, any true effects of the new law
would be obscured if New Englanders went
freely to New York for abortions.
--------------------~~~
508
Resentful Demoralization of Respondents Receiving Less Desirable Treatments. When an experiment is obtrusive, the reaction of a no-treatment control group can be associated with
resentment and demoralization as well as with
compensatory rivalry. This is because controls
are often relatively deprived when compared
to experimentals. In an industrial setting, the
controls might retaliate by lowering productivity and company profits. This situation is likely
to lead to a posttest difference between treatment and no-treatment groups and it might be
quite wrong to attribute the difference to the
treatment. It would be more apt to label the notreatment as the resentment treatment. (Of course,
this phenomenon is not restricted to control
groups. It can occur whenever treatments differ in desirability and respondents are a ware of
the difference.)
Hypothesis-guessing Within Experimental
Conditions. Reactive research can also result
in uninterpretable treatment effects when persons in one treatment group guess how they are
supposed to behave and respond accordingly.
(In many situations it is not difficult to guess
what is desired, especially in education where
academic achievement is paramount or in industrial organization where productivity and
satisfaction are.) The problem can best be
avoided by making hypotheses hard to guess
(if there are any), by decreasing the general
level of reactivity in the experiment, or by deliberately giving different hypotheses to different respondents. But these solutions are at best
partial. Respondents are not passive, and they
will sometimes generate their own treatmentrelated hypotheses whatever experimenters do
to dampen such behavior. However, having a
-----------------------------------.-~
--------------------------~s~
512
signs rather than the reproduction of comparable effect sizes of the same sign.
Individual studies obviously provide less
scope than literature reviews for probing the
robustness of replication across a heterogeneous assortment of external validity threats.
Nonetheless, the model has implications for
single studies. It suggests sampling broadly,
makingsurethatpersons with a wide variety of
different but conceptually relevant backgrounds are included in the sampling plan in
sufficient numbers for responsible analysis. The
model also suggests making sure that there is
heterogeneity in the settings and times studied,
even though cost considerations usually constrain intervention studies to a small number of
settings and times.
Since resources for sampling settings are
limited in many individual studies, it is useful
to consider a subcase of the heterogeneous
sampling model. This involves using theory or
other forms of experience to sample types that are
maximally different from each other and are therefore particularly likely to condition the treatment impact. Thus, if a new personnel testing
system were to be initiated in Army recruiting
stations, researchers might do well to conduct
field tests not only in some recruitment centers
close to the mode but also in others that have
shown themselves to be run particularly well
and:particularly poorly. Any treatment that is
effective only in the best run centers has a
demonstrated potential for success, but there is
a legitimate question about its generality and
about whether its implementation can be improved in other recruiting centers to get the
same results. Of course, if the results are contingent on factors only found in the best organized
settings, then the restricted external validity is
a problem indeed. But if the effect is found in
both the modal and best centers, this helps give
an impressionistic fix on the generality of the
new practice. It is effective from the top to the
middle of the distribution of presumed organizational effectiveness, with the mode representingwheremostrecruiting centers are likely
-------------------------------------------------------------m~
514
~--~-----------------------------------------------------------------------------------------&~
516
Quasi-experimental Designs
This section is devoted to an exposition of
some quasi-experimental designs evaluated primarily with respect to their ability to rule out
internal validity threats. In outlining them, we
shall use a notational system in which X stands
for a treatment; 0 stands for an observation;
--------------------------~~-
518
absenteeism and lateness. Imagine that the pretest and posttest are separated by a year and
that the mean level of experience (years worked
in an organization) increases between the pretest and posttest. If the pool of pretest scores
were sufficiently large, one could regress absenteeism and lateness onto years worked. If
the resulting relationship were zero, then the
maturation hypothesis would be rendered
implausible. If there were a relationship, one
could then use the regression equation to predict what scores would have beenattheposttest
in the absence of the treatment. The obtained
performance could then be compared with the
predicted level. Care must be taken with this
estimation procedure because the mean expected posttest performance is obtained from a
pool of pretest scores, but the posttest data with
which this prediction is compared come from a
second measurement wave. This testing problem is less plausible, of course, if measurement
is unobtrusive, perhaps because it comes from
regularly collected administrative records.
The internal validity threats mentioned
above do not operate in every single situation
where the one group pretest-posttest design is
used. For instance, Robertson and Rossiter
(1976) used the design to study the impact of
television advertising for children's toys and
games during the Christmas season. Children
were asked to nominate their five most strongly
preferred choices for a Christmas present five
weeks before Christmas and then again four
weeks later. All brand-name items reported by
the children were traced to network television
logs for November and December. The researchers hypothesized that any change in requests
for toys and games from the first to the second
wave of data collection would be evidence of
the influence of television advertising. Thus,
when a five percent increase in the nomination
of advertised toys and games was found, it was
interpreted as an advertising effect. However,
the children's preference change might be
due to history, other events that occurred between the pretest and posttest. The increase in
:;,'
-,;.
520
advertising for toy and game products at Christmas is not limited to television; it also occurs
with radio, print, and store circular advertising. Given the short period between the two
testing waves, we do not believe maturation to
be a plausible threat, nor is regression a problem, since it is difficult to imagine that the
children's preferences were initially biased systematically against advertised toys. Threats to
validity are only potential, and those that are
usually associated with a particular design do
not operate in all concrete settings where the
design is used.
Unfortunately, it is rare to find specific research projects where the threats of history,
maturation, and regressionareimplausibleand
where the one group pretest-posttest design is
causally interpretable. To rule out history, either the respondents have to be physically isolated from all other forces that might affect the
experimental outcomes, or the outcomes have
to have no external forces acting upon them, or
the test-retest interval has to be very short. To
rule out statistical regression, either a series of
pretest observations has to show that the
immediate pretreatment observation did not
deviate from the pretest trend or an argument
has to be made based on the high reliability of
measures. To rule out maturation, we either
have to show that the posttreatment observations deviated from whatever pretreatment
maturational trend was observed or we have to
analyze pretest scores by the maturational
variable and show no trend. Since the one group
pretest design has only a single pretest observation by definition, we know nothing definitive about the pretest trend or the typicality of
the immediate pretreatment observation. We
are then thrown back on less direct data probes
and other plausibility arguments to rule out the
threats in question.
Posttest-Only Design With Nonequivalent
Groups. Sometimes a treatment is implemented before the researcher can prepare for it,
and the evaluation procedure must be designed
------------------~~~
522
lations between the true and unmeasured pretest and the posttest, perhaps spuriously leading to the conclusion that the difference in
wages between baccalaureate and nonbaccalaureate nurses is due to the place of employment rather than the more productive nurses
tending to leave hospitals and seek employment in other settings. We are skeptical about
proxy-based adjustment procedures, even
though they are widely advertised in econometrics (e.g., Heckman & Hotz, 1989a, 1989b).
Empirical work on these procedures has not
been particularly promising (e.g., LaLonde,
1986; Murnane, Newstead, & Olsen, 1985), and
many mathematical statisticians pronounce
themselves skeptical (e.g., Holland, 1989).
There are some contexts, though, where
even without a pretest, substantive theory is
good enough to generate a highly differentiated
causal hypothesis that, if corroborated, would
rule out most internal validity threats because
they are not capable of generating the same
pattern of empirical implications. The intuition
is that, all other things being equal, the more
specific or complex the form predicted for the
test data, the fewer the viable alternatives. The
importance ofthis can be illustrated by Seaver's
(1973) archival quasi experiment on the effects
of teacher performance expectancies on studentsJ academic achievement. Seaver used
school records to locate a group of children
whose older siblings had obtained high or low
grades and achievement scores in school. He
then divided these two groups into those who
had had the same teacher as their older sibling
and those who had had a different teacher. This
resulted in a 2 x 2 design (same or different
teacher crossed with high or low performing
sibling). Seaver predicted that teacher expectancies should cause children with high-performing siblings to outperform children with
low-performing siblings by a greater amount if
they had the same teachers rather than different ones. The data corroborated this predicted
statistical interaction on several subsets of the
Stanford Achievement Test.
Quasi Experimentation
It is not easy to invoke a simple selection
interpretation of such a data pattern. However,
Reichardt (1985) has argued that the results
may be due to a regression artifact. He noted
that, when Seaver partitioned the students into
four groups, he also partitioned the teachers.
Assuming that an older sibling's performance
is affected by the ability of the teacher, when
Seaver selected older siblings with above average performance, he may also have been inadvertently selecting teachers whose abilities were
genuinely above average. A younger sibling
assigned to the same teacher would therefore
receive the above average teacher. But the
younger sibling assigned to a different teacher
would more likely have a teacher closer to the
mean level of teaching ability. By the same
process, for older siblings who had performed
poorly, Seaver may have selected teachers
who were less able on the average. A younger
sibling assigned to a different teacher would
then be likely, again because of regression, to
receive a more able teacher. Differences in
teacher effectiveness might thus account for
the crossover interaction Seaver obtained. But
all other internal validity threats seem to be
controlled for.
The Seaver study shows that posttest-only
designs with nonequivalent groups can be quite
strong under the specific circumstances he
chose: (a) substantive theory that predicts a
somewhat complex data pattern; (b) sibling
comparison groups that, although nonequivalent, are nonetheless matched on most
family background factors; (c) outcome measures (academic achievement) that are quite
reliably measured; and (d) large sample sizes.
Although these circumstances may not often be
forthcoming in social research, they serve to
remind us that the structure of a quasi-experimental design does not by itself determine the
quality of a causal inference. The uniqueness of
a theory-derived hypothesis also plays a role.
We now distinguish between several kinds
of nonequivalent control group designs that
generally produce interpretable causal results.
523
FIGURE 2
First Outcome of the
No-Treatment Control Group Design
With Pretest and Posttest
Control
Pretest
Posttest
FIGURE 3
Second Outcome of the
No-Treatment Control Group Design
With Pretest and Posttest
Pretest
Posttest
--------------------=
II
526
,,
FIGURE 4
Pretest
Posttest
528
FIGURE 5
FIGURE 6
Control
Pretest
Posttest
Pretest
Posttest
performers did not change. These slope differences might be due to regression with all three
groups converging on the same grand mean (as
in Figure 5). But the low performers had overtaken the high performers by the posttest and
differed from them reliably. By itself, statistical
regression does not provide a plausible alternative for these results, though it may have
inflated treatment effect estimates. A regression explanation would have to assume that the
initial low performers were genuinely better
performers than the others, but for some unknown reason a particularly large error component depressed their scores at the earlier time.
It seems implausible to us to assume that low
scorers on the pretest would regress to a mean
higher than that of both the middle and high
scorers.
Though the outcome in Figure 6 is often
interpretable in causal terms, any attempt to
set up a design to achieve it involves considerable risk and should not be undertaken lightly.
This is especially true in growth situations where
a true treatment effect would have to countervail against the lower expected growth rate in
the treatment group. A no-difference finding
from a study with considerable statistical power
but with an inevitably incomplete selection
model would not make it clear whether the
treatment had no effector whether two countervailing forces (the treatment and selectionmaturation) had cancelled each other out.
Even if there were a difference in slopes, this
would more readily take the form of Figure 5
than Figure 6, and Figure 5 is less interpretable
than Figure 6. It is one thing to comment on the
interpretive advantages of a crossover interaction with reliable and switching pretest and
posttest differences, and it is quite another to
obtain such a data pattern.
The Untreated Control Group Design With Independent Pretest and Samples. The basic pretestposttest design with nonequivalent groups is
sometimes used with separate samples being
-----------..,"""';,
--------~-
530
I
--,---I
o
0
This design is frequently used in epidemiology, marketing, and political polling and may
be gaining in popularity. The only justifiable
context for using it is with random selection of
the pretest and posttest groups within each
(noncom parable) treatment condition. It is the
random selection that makes the samples representative of the population such as it is at
each time point. Selection is not completely
avoided, of course. First, random selection
equates the pretest and posttest only within
limits of sampling error, making comparability
problematic with smaller and heterogeneous
samples. Second, the populations will probably change in composition between the two
measurement waves and the changes need not
be the same in each treatment group, entailing
a problem with selection-maturation. Indeed,
most of the internal validity threats that apply
tothenonequivalentcontrolgroup design with
repeated measurement also apply when separate pretest and posttest samples are involved,
particularly local history. Statistical conclusion
validity becomes more of a problem with the
independent samples at each measurement
wave. Units no longer serve as their own sta tistical controls as they do when dependent
samples are analyzed.
Given these restrictions, anyone considering a design with nonequivalent groups and
independent pretest and posttest samples
should first critically assess whether the need
for independent groups is compelling. Only
then should a researcher pay special attention
to such matters as sample sizes, how the sampling design is implemented, and how comparability can be assessed using measures that are
stable, reliable, and beyond influence by the
treatment. However, such sampling and measurement concerns are irrelevant to the most
crucial aspect of population comparability: How
would the two nonequivalent groups change
over time even without a treatment, given their
initial noncomparability and the possibility of
population differences in composition over
time?
Under the program, parents selected whatever local school they wanted for their child
and received a credit or voucher equal to the
cost of the child's education at that place. The
objective was to improve schools by fostering
competition. Initial evaluations had suggested
that vouchers decreased the academic performance of children, but Wortman, Reichardt,
and St. Pierre doubted this. So they reanalyzed
reading test scores using the repeat pretest
..
532
variables examined but obviously not for unmeasured attributes. Still, it would have been
better if practical circumstances had allowed
collecting both the pretest and posttest data for
a calendar year each instead of for seven and six
months, respectively, for the data collection
procedure actually implemented is confounded
with seasons. History was examined as an alternative interpretation in the sense that research staff were in the target hospital almost
every day and detected no major irrelevant
changes that might have influenced recovery
from surgery. This provides no guarantee, of
course, and design modifications would obviously be better than measurement for ruling
out this internal validity threat. Therefore, data
were also collected from a control hospital in a
nearby suburb that was owned by the same
corporation and had some of the same physicians admitting patients. It provided a better
control for history, and in this instance-like
the passive measurement-led to the conclusion that the treatment effect was not due to
general history.
Cohort Design With Pretests From Each Unit.
Another example of the cohort design leads to
an important design elaboration that strengthens causal inference. In a study comparing the
relative effectiveness of regular teachers versus outside contractors hired to stimulate
children's achievement, Saretsky (1972) noted
that the teachers exerted special efforts and
performed better than would have been expected on the basis of their previous years'
performance. He attributed such compensatory rivalry to the teachers' fear that they
would lose their jobs if the contractors outperformed them. It is not clear how Saretsky tested
this hypothesis. Let us assume for pedagogic
purposes that he compared the average grade
equivalent gain in classes taught by the regular teachers during the study period with the
average gain from the same classes taught by
the same teachers in previous years when they
were not aware of being in a study. The result-
event happened in the intervention time period. To control for general history, we would
be even better served if a nonequivalent, notreatment control group could be found and
measured at exactly the same time points as the
treatment cohorts. Failing this, the design could
be greatly strengthened if nonequivalent dependentvariables were specified, some of which
should be affected by history while others
should not.
The plausibility of a history threat can be
examined if the research question is slightly
modified. Imagine wanting to know about the
effects, not of a new practice like participating
in a performance contracting experiment, but
of a long established practice like asking second graders to do one-half hour of homework
each school night. With access to past school
records or with at least two years to do a study
with original data collection, a design like the
one below might be possible. It involves three
cohorts entering the second grade in consectttive years, and the spacing of observations
indicates that 0 1 and 0 2 are not simultaneously
observed because one might be at the end of a
school year and the other at the beginning of the
next. This institutional testing cycle is repeated
again with the 0 3 and 0 4 observation to create
a design that Campbell and Stanley (I 966) discussed in detail as the recurrent institutional
cycle design. A treatment main effect is suggested if the 0 1 and 0 3 are, say, higher than 0
and 0 4, and if 0 2 and 0 (and 0 and 0) ar~
4
1
3
not d1'fferent from each other.
lighter viewing groups. If the show were effective, we would then expect larger knowledge
differences between the heavy and light viewers from the cohort that viewed the show when
contrasted with their siblings who did not. The
key assumption here is that the older siblings
would themselves have been heavy and light
viewers of "Sesame Street" if the show had
been available to them during their kindergarten years. Under this assumption, the null
hypothesis is that the difference in knowledge
between heavy and light viewers who saw the
show is no different from the difference in
knowledge between their respective siblings
who did not see it. A difference of these differences would suggest that "Sesame Street" was
effective and that a simple selection alternative
could be ruled out. While selection could account for the differences between heavier and
lighter viewers who had seen the show, it could
not readily account for their difference from
their respective siblings who had not seen it.
Moreover, we would expect the heavier and
lighter viewers from the "Sesame Street" cohort to have experienced the same general history (though not necessarily the same local
history). Partitioning respondents into treatmentgroups based on treatment exposure levels
strengthens the internal validity of cohorts
designs. Even with self-selection into viewing
levels, it is difficult to come up with plausible
alternative interpretations when the data look
like they do in Figure 7.
Partitioning has a further advantage for
internal validity. If testing conditions differ
between the earlier and later cohorts, then testing alone might bias scores in the later experimental cohort. Partitioning respondents by
the length of exposure to the treatment rules
out a simple testing threat, for there is no reason
why testing should have a greater effect in the
longer exposure treatment group than the
shorter one. For a number of reasons, then, we
advocate partitioning respondents and implementing a modified cohort design, as depicted
536
FIGURE 7
Interpretable Outcome
of a Posttest-Only Cohort Design
Pretreatment
Cohorts
Ball and Bogatz' (1970) evaluation of "Sesame Street" also used cohorts, but they were
older children from the local neighborhood
instead of siblings. They took a sample of children and tested them before "Sesame Street"
went on the air and six months later. Many of
the children were aged between 53 and 58
months at the posttest and were called the
posttest cohort. Other children were aged between 53 and 58 months at the pretest and were
called the pretest cohort. Comparing the posttest scores of the posttest cohort and the pretest scores of the pretest cohort rules out maturation since all the cohorts are presumably
at comparable maturational stages. The major
difference between them is that one cohort
has seen "Sesame Street" and the other has not.
A selection effect is also not likely, provided
that the analysis included data from all the
children available to be in a particular cohort.
And as a further check, background characteristics of the pretest and posttest cohorts can be
assessed.
The design is vulnerable to a history interpretation, though. Did theolderpretestcohorts
experience unique, outcome-modifying events
before their younger cohorts were born? Or
were the older children at particularly sensitive
Posttreatment
Cohorts
maturational stages when they learned information from the environment that the other
cohorts were too young to take advantage of?
Also, the design as portrayed here and as
implemented by Ball and Bogatz has a unique
testing problem. The scores of the pretest cohort came from a first testing, while those of the
posttest cohort came from a second. This makes
it unclear whether cohort differences in knowledge were due to the treatment or to differences
in the frequency of measurement. To deal with
these history and testing problems, Ball and
Bogatz used measures of the reported frequency
of viewing "Sesame Street" to partition each
cohort into four viewing levels. The consequences of this partitioning are displayed in
Figure 8, and an analysis of variance showed
that the differences in knowledge between the
four viewing levels were greater within the
posttest cohort than the pretest one. Since the
cohorts were of the same mean age, of comparable social background within the different
viewing groups, and had experienced the same
history and testing sequences (all posttest cohorts were pretested), the complex, theorypredicted outcome in Figure 8 can account for
all the internal validity threats discussed thus
FIGURE 8
Interpretable Outcome of a
Selection Cohorts Design
With Pretest and Posttest Cohorts
Pretest
Cohorts
Posttest
Cohorts
----------------4.~~
538
Year2
Year 1
motivational properties of jobs affect work attitudes and behaviors. As a result of a technological innovation, clerical jobs in a bank were
changed to make the work on some units more
complex and challenging but to make work on
other units less so. These changes were made
without the company personnel being told of
their possible motivational consequences, and
measures of job characteristics, employee attitudes, and work behaviors were taken before
and after the jobs were redesigned. The pretest
scores for the group with enriched jobs were
systematically below those of the other group,
entailing an initial selection difference. If the
groups also matured at different rates, this
would manifest itself as the statistical interaction that indicates a treatment effect in the
design under consideration. Such a threat is not
very plausible, however, if the results in each
treatment group show reliably different trends
with opposite causal signs. This is because
selection-maturation patterns that operate in
different directions are much rarer than patterns where change occurs at different rates in
the same direction.
The reversed-treatment design often has a
special construct validity ad vantage. The causal
construct has to be rigorously specified and
carefully built into manipulations in order to
create a sensitive test that depends on one
version of the cause (e.g., job enrichment) affecting one group one way, while its conceptual opposite (job impoverishment) affects the
second group the opposite way. Moreover, some
of the irrelevancies associated with one treatment will be different from those associated
with the reversed treatment, adding to the
heterogeneity of irrelevancies. To understand
this better, consider what would have happened had Hackman, Pearce, and Wolf used an
enriched job group only and no-treatment
controls. A steeper pretest-posttest slope in the
enriched condition could then be attributed
either to the job changes or to respondents
feeling specially treated or guessing the hypothesis. The plausibility of such alternatives is
01
The major outcome measures were physiological indicators of heart problems collected
annually, including blood pressure and cholesterol level. Three matched treatment and control communities were used in Blackburn's
study and two in Farquhar's. The double pretest was included to get some estimate of pretest trends in the two nonequivalent communities, and independent random samples were
drawn in the first two years both out of concern
that obtrusive annual physiological measurement would sensitize repeatedly measured
respondents to the treatment and out of a desire
to generalize to the community at large as it
evolved over time. However, in the Blackburn
et al. study, the variability between years within
cities was much greater than expected and,
since it could not be readily explained, statistical adjustments were not very helpful. Hence,
Blackburn modified his design in midstream so
that some respondents who had provided pretest information were followed up at a number
of posttest time points, creating a longitudinal
sample of community residents to complement
540
disease decreased in the two treatment communities and in one of the control communities, and by amounts that hardly differed. But
the risk appears to have increased over time in
the second control community, despite a national trend downward over the period studied. Omitting this one community from some
analyses would have been desirable and would
have reduced the treatment-control differences
to a level close to zero. But with only two
communities per condition this could not be
done. With so few units there can be no pretense of achieving comparability between the
treatment and control groups, however conscientiously the communities were paired before
assigning them to the treatment and control
status. Larger sample sizes are required. This
entails either adding more communities (which
would be prohibitively expensive even if a
greater ratio of control to treatment communities was involved) or combining studies that
have similar treatments, even though they will
not be identical and other contextual and evaluation factors are bound to differ between the
studies being synthesized.
Designs Without Pretests
In this section, we discuss designs that are
feasible and useful when it is absol utel y im possible to collect comparison group data, especially no-treatment control group data. These
are in many ways fall-back designs born of
necessity rather than desire. They vary considerably in their potential for justifying causal
inferences, and some are frankly better used as
parts of more complex designs than as standalone designs.
The Removed Treatment Control Group
Design. When it is not feasible to obtain a
nonequivalent control group, the researcher is
forced to create the functional equivalent of
such a group. The design shown at the top of
the next page does this in many instances.
It calls for a simple one group pretestposttest design (see 0 1 to 0), with a third wave
of data collection being added (see 0 3), after
which the treatment is removed from the treatment group (X symbolizes this removal), and a
final measure is made (04). The link from 0 1 to
0 2 is the experimental sequence, as it were,
while the link from 0 3 to 0 4 attempts to serve
as a no-treatment control. Note that a single
group of units is involved, and we only expect
the predicted data pattern if the treatment has
no long-term effects that carry over once the
treatment has been removed. A persisting effect will bias the analysis against the reversal
in the direction of change predicted between 0 3
to 0 4 .
The most interpretable outcome of the design is presented in Figure 9. However, statistical conclusion validity is a problem, since the
pattern of results can be thrown off by even a
single deviant data point. Hence, large sample
sizes and highly reliable measures are a desideratum. Second, many treatments are ameliorative in nature. Removing them might be hard
to defend ethically; it may also arouse a frustration that should be correlated with measures
of aggression, satisfaction, and performance,
not to speak of the resentful demoralization
and compensatory rivalry listed earlier under
threats to construct validity of the cause. Such
considerations indicate that it will sometimes
be impossible to implement the design.
Third, there are many instances where respondents voluntarily decide to discontinue
exposure to a treatment for reasons that are not
related to the social research taking place. Since
the present design is most likely to be used
when respondents self-select themselves out of
a treatment, very special care had to be taken
in this circumstance. Imagine someone who
becomes a foreman (X), develops promanagerial attitudes between 0 1 and 0 2, dislikes his
new contact with managers, and becomes less
FIGURE 9
Generally Interpretable Outcome
of the Removed-Treatment Design
Interpretable
Outcome
542
02
03
04
Quasi Experimentation
543
--------..,.,~~
546
----~-
548
FIGURE 10
Hypothetical Outcome of a Pretest-Posttest
Regression-Discontinuity Quasi Experiment
30
No Treatment
Treatment
XX
)(
10
0
0
-10k--------------L------------~--------------J-------20
-10
10
0
Pretest Values
level at the cutoff point. The slope of the regression lines might also come to differ on each side
of the cutoff, suggesting a statistical interaction
of the pretest and treatment such that persons
on the treatment side of the cutoff do better or
worse depending on their initial scores.
Suppose the National Science Foundation
were to award individualfellowships to graduate students solely on the basis of their Graduate Record Examination scores and then wanted
to know what effect this had on a particular college outcome. If a regression discontinuity
design were used and resulted in different regression slopes with the steeper one being on
the awards side of the cutoff, this would suggest that fellowships have more of an impact on
students whose GRE scores are among the
highest than on students whose scores just
qualify them for fellowships. (However, the
interpretation of slope differences is extremely
.....
------------------------------~~
550
FIGURE 11
Regression of Grade Point Average Term 2 on Grade Point Average
Term 1 for Non-Dean's List and Dean's List Groups
4.0
"'E....
3.5
~ 3.0
~
"<::
.....
.E
~ 2.5
"'~
Cj
2.0
1.5
'------'-----'-----'----'----'----'-----L----'----
.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
FIGURE 12
Plot of the Column Means for the Seaver-Quarton Data
4.0
3.5
if'
'?:
'f
b.
/:._,"'
,
, ,,
3.0
D.
""~
2.5
D.
~
....::::
2.0
0.,
1.5
D.
<:e;
D.
'5
"'
"1:1
::::
(.,')
1.0
.5
.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
then it should appear in the data when the pretest in Figure 11 was used as the posttest and
the previous quarter's grades were used at the
pretest. When this analysis was conducted,
it produced no discontinuities. However, the
test is not as accurate as critically examining
scatterplots to examinewhethernonlinear forms
of regression fit the data better than linear
forms. If they do, regressions of a higher order
than the linear have to be fit; otherwise, spurious causal conclusions will result.
Another set of problems with the regression-discontinuity design arises from the fact
--------~
FIGURE 13
The Consequences of Fitting
Inappropriate Linear Regressions When
the Underlying Regression Is Not Linear
for an estimate of the demand for health services some labor unions have won completely
free' medical services for their members, and
some cities have private medical schemes allowing unlimited free medical services.)
A difficulty that can often be anticipated
with the regression-discontinuity design is that
the cutoff point will not be as clear-cut as our
discussion may suggest. Lack of clarity is especially likely when the cutoff point is widely
known, for this can give rise to special pressures to help some persons achieve the cutoff
point score. For instance, the Irish. govern?"ent
publishes the passing score on vanous natiOnal
examinations in education. In an unpublished
manuscript, Greaney, Kellaghan, Takata, and
Campbell (1979) found that a frequency distribution of the number of children obtaining all
possible scores on the physics exam showed
fewer students than expected scoring just below the cutoff point and more than expected
scoring just above it. Did examiners give "an
extra hand" to students scoring just below the
cutoff? A similar phenomenon occurs in
many social service settings, where eligibility
certification workers help clients disguise part
of their income if they suspect that full disclosure will take clients above the eligibility
point for services. Similarly, clients know the
cutoffs and systematically manipulate the information they provide with the cutoffs in mind
(Jencks & Edin, 1990).
A major problem with a fuzzy cutoff point
is that the systematic distortions around the
cutoff can masquerade as treatment effects.
Imagine a social service setting where clients
know the income cutoff point for obtaining
supplementary services and so are motivated
to report their income as lower than it actually
is. To examine how these services influence the
social mobility aspirations of young children,
we might plot the reported income of a wide
range of parents against the mobility aspirations of their children. Let us suppose that the
overall relationship is linear and positive in
general, indicating that higher incomes are
554
FIGURE 14
Quantified Multiple Control Group Posttest-Only
Analysis of the Effects of Medicaid
4.9
4.8
,._
<::!
~
..,....
4.7
.,... 4.6
.:0
\:'2
Experimmtal
Group
4.5
~
.,...
it:
.:::
4.4
4.3
4.2
4.1
~
;
-~
4.0
Under $3,000
$3,000 -4,999
$5,000
-6,999
$7,000
$10,000
-9,999
-14,999
$15,000
or more
Income Level
frequent visits to doctors even before Medicaid, though these visits might have been more
often to state hospitals on a nonpayment basis.
There is every need, therefore, to disaggregate
the data even further to examine the relationship of income to medical visits among persons
of different family sizes.
A different kind of possible selection bias
cannot be ruled out merely by disaggregating
on the basis of demographic factors that are
routinely measured in surveys and are archived
for general use. Some families in the lowestincome group were eligible for assistance from
A time series is involved when we have multiple observations over time, either on the
same units, as when particular individuals are
repeatedly observed, or on different but similar units, as when achievement scores for institutional cohorts are displayed over a series of
years. Interrupted time series analysis requires
knowing the specific point in the series when a
treatment occurred so as to infer whether the
treatment has had any impact on the series. If
11
tl')
~
c
0
N
10
.s
......
;:::
.B;:::
0
,2->
.....
;:::
8
M AM
1918
J J
AS ON D
F MA M
J J AS 0 NO J F MA M J J A
1919
1920
Mont11s
From "A C.omp;~riHon of Different Shift Systems in the Glass Trade" by E. Farmer, 1924. Report No. 24 of the Medical
RcHemt:h Cc.mnL:d Industrial Fatigue Research lloard. Adopted by permission.
though the campaign had backfired, for accidents seemed higher after the campaign than
for the two years before it. Since this result
was surprising, the hospital records were
checked. It seems that the persons responsible
for compiling the data became much more
conscious of "lawn mower accidents" as a category for assigning causes after they learned
that a lawn mower safety campaign was under way and that the data they provided would
be used as part of the evaluation of the campaign. When Kerpelman et al. went back to the
basic data for all three years and recoded them
with the same definition across all the years,
the apparent negative effect was smaller than
before and probably not statistically reliable.
Another threat is simple selection. This
occurs when the composition of the experimental group changes abruptly at the time of
the intervention, perhaps because the treatment causes attrition from the measurement
framework This attrition can be a genuine
treatment effect. But it will obscure inferences
about the treatment's effect on other outcomes,
and it will not be possible without further
analysis to disentangle whether an interruption in the series was due to the treatment or to
different persons being in the pretreatment
series when compared to the posttreatment
series. The simplest solution to the selection
problem, where available, is to restrict at least
one of the data analyses to those units providing measures at all time points. But repeated
measurement on the same units is not always
possible (e.g., in cohort designs where, say, the
third-grade achievement scores from a single
school over 20 years are involved). Then, background characteristics of units have to be
analyzed to ascertain whether a sharp discontinuity in the profile of the units occurs
at the time the treatment is introduced. If
there is, selection is a problem; if there is not,
selection is not likely to be a serious threat
unless the background characteristics were
poorly measured or were not the appropriate
12 11 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Before
After
Weeks
Note: Attendance is expressed in terms of the percentage of hours scheduled
to be worked that were actually worked.
From "Impact of Employee Participation in the Development of Pay Incentive Plans: A Field Experiment" by E. E. Lawler 1Il
and J. R. Hackman, 1969, /OIImnl of Applied l'syclwlogy, 53. Copyright 1969 by the Amerimn Psychological Association.
Adapted by permission.
------------------------~~~
I
560
I
I
posttreatment survey data they had also collected, the results suggested that the campaign
improved residents' knowledge of the bus
system and their attitudes toward it. This led
the authors to surmise that the decay in ridership was not due to the poor performance of the
bus system but rather to the absence of an
ongoing campaign to maintain a permanent
behavior change. Without thepostintervention
series and the exploratory surveys, even this
tentative conclusion about causal mediation
would not have been possible.
Interrupted Time Series With Nonequivalent
Dependent Variable. History is the main
threat to internal validity in a single interrupted time series. As we have already seen,
its effects can sometimes be examined by
minimizing the time interval between measures or including a no-treatment control group
in the design. History can also be examined by
collecting time series data for nonequivalent
dependent variables, all of which should be
equally affected by history (and other plausible
alternative casual forces), while only the treatment group should have been influenced by
the intervention. The design is diagrammed as:
----------------------r-5~~
FIGURE 17
Hypothetical Library Circulation Data After Matching Just Before a Treatment Is Introduced
8
;::
7
.s
....
~
;:;
Bl
.~
Bl
u 5
~
~
>-1
A1
.::!
a....
~
...... ............__
Al
1948 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
Year
FIGURE 18
Hypothetical Library Circulation Data Prematching
8
;::
~
;:;
>-1
<::!
<::!
"'
....
Q,
1948 49 50 51 52 53 54 55 56 57 58 59 60 61
Year
62 63 64 65 66
G>
7000
1400
6000
1200
5000
1000
4000
800
3000
600
2000
400
1000
200
;.:
""~
0
JMMJ SNJMMJSNJMMJSNJMMJSNJMMJSN
FAJAODFAJAODFAJAODFAJAODFAJ AOD
1966
1967
1968
1969
1970
From "Determining the Social Effects of a Legal Reform: The British 'Breatlmlyser' Crackdown of "1967" by H. L. Ross, D. T. Campbell,
and G. V. Glass, 1970, American Behavioral Scienlist. 13 (4), pp. 493-509. Copyright 1970 by Sage Publications. Reprinted by permL'"ion.
--------------------------~~
Baseline
2
Posting
l
70
58 km/hr
Num
Daily No Wky No
Wky
Post
.:,.:. :. ..
..
.......
......
: .. .:
..
.
.. "
: ~;
.
:
: ..
r '
.,
: . : \ :':
... .. : \ ,,.... : " : .. .
.. . .
.. ..
... ...
.. ..
Follow
Up
f\
6 50
t:
: Jl,
C(J
II
40
'\
1! 30
"'
20
10
10
'8
\l
o!
No
Post
2
,'
.... 60
..e....:::
Daily
15
20
:J!I,
''
0 II
I I
I
II
25
30
'',
0
35
o \I
,.1
'-'
J
.\ -
40
Sessions
,'\
r,
11
)I ',1
,..
~~/
,r
!">
.,
I/''w"
........
....-~\ II
\.1..
10
I'
'
I\
,..~
\I
,1,J
~-~
I ..
\,'
' '
15
20
25
Weeks
From "An Analysis of Public Posting in Reducing Speeding Behavior on an Urban Highway" by R. Vnn Houten, P, Nau, and z. M.nrini, 1980,}or~nml of Applied
llel1nviar Aunlysis, 1.1. Copyright 1980 by fnumnl of Applied llelmvior Armly.is. Adapted by permission.
09
0 10 X 0 11 0 12
X 0 13
0 14
566
concluded that the sign was effective in reducing speeding. (Actually, to enhance generality,
they also used three other speed postings and
all showed the same type of effect.)
A salient issue with this design concerns
the scheduling of treatments and their removal. This is probably best done randomly,
though in a way that preserves the alternation
of X and X, that is, in a block randomized
design. Such random assignment rules out the
threat of cyclical maturation mimicking the
obtained pattern of increases and decreases,
even in the absence of a treatment. Where such
cyclicity is ruled out by theory or a long preintervention series that permits direct observation of maturational patterns, randomization is
less important. But even when the introductions are haphazardly rather than randomly
scheduled, the design is very powerful for inferring causal effects and can easily be modified to compare different treatments, with X1
being substituted for the global X and X2 for X.
It would even be possible to add an interaction
factor, xlx2.
The major limitations of the basic design
are practical. First, it can be implemented only
where the initial effect of the treatment is expected to dissipate rapidly. Secondly, it requires a degree of experimental control that is
most likely to be achieved in the laboratory, in
institutional settings like schools or prisons, or
in settings where the treatment constitutes a
very minor deviation from the regular state of
affairs (as in the highway example above). Our
distinct impression is that the design has been
used most often in mental health, particularly
in research with a behavior modification bent
where the therapeutic situation allows investigators the very control over respondents that
the design requires.
Interrupted Time Series With Switching
Replications.
Imagine two nonequivalent
samples, each of which receives the treatment
at different times so that when one group
receives the treatment the other serves as a
control, and when the control group later receives the treatment the original treatment
group serves as the control. The design can be
diagrammed as:
0 I 0 2 0 3 0 4 05 06 07 OBX 09 010
FIGURE 21
Per Capita Library Circulation in Two Sets of Illinois Communities
as a Function of the Introduction of Television
7.0
6.5
::::
.S/
TV Introduced Late
6.0
;::!
-~
.~
.,_
a....
.t
TV Probably
Introduced Early
5.5
5.0
_...,..... ......
,..e......
. . . . tI ....~.....
I
4.5
4.0
45
46
47
48
49
50
51
52
53
54
55 56
57
58
59
60
Year
of history and selection. This alternative interpretation could be ruled out by collecting data
on the circulation of paperbacks in each set of
communities. This would be a laborious but
worthwhileoperationforsomeonewitha vested
interest in knowing that the introduction of
television caused a decrease in library circulation. A different strategy, following the example of Parker (1963), would be to split the
library circulation into fiction and nonfiction
books. Television, as a predominantly fictional
medium, would be expected to have a greater
effect on the circulation of fiction than of fact
books. Using the nonequivalent dependent
variables in this way would render the paperback explanation less plausible because we
would have to postulate that fiction books
were introduced into the different communities at different times but that nonfiction
zm
568
show an initial effect and a return to the baseline level after the treatment was removed.
The replication is important for, as Figure
22 shows, safety practices seem to have been
increasing slightly before the intervention in
the setting that experienced it first. Might part
of the effectiveness of the safety treatment be
attributed to this trend rather than to the intervention? The baseline data from the delayed
intervention group are crucial here because
they indicate that safety practices were not
increasing and may even have been decreasing. Yet the same initial effect is noted. Moreover, when the intervention is removed, the
safety behavior returns to baseline in both treatment groups, further demonstrating the effectiveness of the treatment so long as it is in place.
An obvious advantage of switching replications is that the threat of history is reduced.
The only relevant historical threat is one that
operates in different settings at different times,
or involves two different historical forces operating in different directions at different times.
Simple selection is also less of a threat since the
only viable selection threat requires different
kinds of attrition at different time points. Instrumentation is also less likely, though it will
be worthwhile exploring whether a spurious
effect occurred when the treatment was introduced (say, a history-related effect) while a
simple ceiling or floor effect caused an inflection in an upward or downward general trend
at the point when the treatment was removed.
Given these considerable advantages, a key
question is: How practical is the switching
replication design?
Conclusion
The warrant for causal conclusions is clear in
the case of the randomized experiment. Assuming competent implementation of a cbrrect
randomization procedure, random assignment
creates a probabilistic equivalence between
groups at the pretest. Assuming further that
FIGURE 22
Percentage of Items Performed Safely by Employees in
Two Departments of a Food Manufacturing Plant During a 25-Week Period of Time
100
90
80
z, 70
~
<:::
en
60
'"<:::
""~ 50
'@,
0
""
1:1,
.:2
Baseline
Wrapping Department
Make-Up Department
Intervention
Reversal
:~1
100
<u
:;:;
90
<J
~
'E' 80
tf?.
70
60
50
0
:;::
I
1
1
I
I
I
____ j
: r - v
:1
I
10
15
20
25
30
35
40
45
50
55
60
65
Observatio11 Sessions
From "A Behavioral Approach to Occupnlionnl Safety: Pinpoinling Safe Performance In a Food Manufacturing Plant" oy J. Komaki, K. D. Barwick, and L. R. Scott,
197B, ]mml<ll of Apf1lie<l PsyclwloS.'I 63. Copyright 1978 by the American Psychologicnl Association. Adapted by permission.
there is no differential attrition from the experiment, any group differences at the posttest must be due to chance or to the treatment
contrast actually achieved. The likelihood of
chance can be reduced by having large samples, stratifying on correlates of the dependent
variable prior to assignment, and using statistics correctly. If chance is ruled out and a treatment-outcome relationship persists, the strong
likelihood is that the treatment contrast caused
the observed effect. However, random assignment does not ensure that this contrast is isomorphic with the construct the treatment was
supposed to represent. The assignment procedure merely permits one to conclude that something about the treatment contrast caused the
570
missing. A different warrant has to be constructed. This chapter has sought to construct
a warrant that depends on five shaky but
not totally unreasonable assumptions. The first
is that only falsification can be justified as a
means of certain knowledge (Popper, 1959).
From this follows the recommendation that
researchers try not just to corroborate a causal
hypothesis but also to falsify it by ruling out
threats to both statistical conclusion validity
(to deal with chance) and internal validity (to
deal with forces leading to the same pattern
of results as expected from the manipulated
treatment contrast). However, theorists of
quasi experimentation acknowledge that falsification can never be perfect in research practice since, as Kuhn (1962) has pointed out, it
depends on fully specified theories and perfectly valid measurement. Neither of these is
forthcoming in any science, let alone the social
sciences. Falsification is fallible, but not for that
matter useless, as we have tried to show in
conceptualizing threats to validity and then
attempting to rule them out.
The second assumption on which a warrant for causal inference from quasi experiments rests is that published lists of threats to
statistical conclusion and internal validity are
comprehensive. Without this assumption one
cannot be sure of having falsified all the relevant threats. The available lists are long and
reflect the criticisms most often raised in the
past by scholars when commenting on others'
work. But while this makes our lists long and
empirical, it does not make them infallible.
They are subject to correction as some current
threats are removed from the lexicon because
they do not operate often enough to be worried
about, and others are added as the community
of field researchers discovers them in the course
of their work.
The third assumption is that quasi experiments are more dependent on the notion of
plausibility than is the case with randomized
experiments. Judgments have to be made
about (a) which validity threats are plausible
References
Ball, S., & Bogatz, G. A. (1970). The first year of Sesame
Street: An evaluation. Princeton, NJ: Educational
Testing Service.
Basadur, M., Graen, G. B., & Scandura, T.A. (1986).
Training effects on attitudes toward divergent
thinking among manufacturing engineers. Journal of Applied Psychology, 71, 612-617.
Bhaskar, R. (1975). A realist theory of science. Leeds,
' England: Leeds.
Blackburn, H., Luepker, R., Kline, F. G., Bracht, N.,
Carlaw, R., Jacobs, D., Mittelmark, M., Stauffer,
L., & Taylor, H. L. (1984). The Minnesota Heart
. Health Program: A research and demonstration
project in cardiovascular disease prevention. In
}.D.Matarazzo,S. Weiss,J.A.Herd,N. E.Miller,
&S.M. Weiss(Eds.), Behavioral health. New York:
Wiley.
Bogatz, G. A., & Ball, S. (1971). The second year of
"Sesame Street": A continuing evaluation. Princeton, NJ: Educational Testing Service.
Booton, L. A., & Lane, J. I. (1985). Hospital market
structure and the return to nursing education.
The Journal of Human Resources, 20, 184-196.
Boruch, R. F., & Gomez, H. (1977). Sensitivity, bias,
and theory in impact evaluation. Professional
Psychology, 8, 411-434.
Box, G. E. P., & Jenkins, G. M. (1976). Time-series
analysis: Forecasting and control. San Francisco:
Holden-Day.
Bracht, G. H., & Glass, G. V. (1968). The external
validity of experiments. American Educational
--------------------------------~~~
574
welfare: A critique of the negative income tax experiment. New York: Russell Sage Foundation.
39, 113I-n4o.
961-962.
........
__________________
576
The effects of school integration on the academic selfconcept of public-school children. Paper presented
at the meeting of the Midwestern Psychological
Association, Detroit.
~..-~~