Replication and Meta-Analysis in Parapsychology - Jessica Utts

Statistical Science
1991, Vol. 6,No.4, 363-403
Replication and Meta-Analysis in

Parapsychology
Jessica Utts
Abstract. Parapsychology, the laboratory study of psychic phenomena,

has had its history interwoven with that of statistics. Many of the
controversies in parapsychology have focused on statistical issues, and
statistical models have played an integral role in the experimental
work. Recently, parapsychologists have been using meta-analysis as a
tool for synthesizing large bodies of work. This paper presents an
overview of the use of statistics in parapsychology and offers a summary
of the meta-analyses that have been conducted. It begins with some
anecdotal information about the involvement of statistics and statisti-
cians with the early history of parapsychology. Next, it is argued that
most nonstatisticians do not appreciate the connection between power
and "successful" replication of experimental effects. Returning to para-
psychology, a particular experimental regime is examined by summariz-
ing an extended debate over the interpretation of the results. A new set
of experiments designed to resolve the debate is then reviewed. Finally,
meta-analyses from several areas of parapsychology are summarized. It
is concluded that the overall evidence indicates that there is an anoma-
lous effect in need of an explanation.
Key words and phrases: Effect size, psychic research, statistical contro-
versies, randomness, vote-counting.
1. INTRODUCTION Parapsychology, as this field is called, has been a

source ifcontroversy throughout its history. Strong
In a June 1990 Gallup Poll, 49% of the 1236 beliefs tend to be resistant to change even in the
respondents claimed to believe in extrasensory per- face of data, and many people, scientists included,
ception (ESP), and one in four claimed to have had seem to have made up their minds on the question
a personal experience involving telepathy (Gallup without examining any empirical data at all. A
and Newport, 1991). Other surveys have shown critic of parapsychology recently acknowledged that
even higher percentages; the University of "The level of the debate during the past 130 years
Chicago's National Opinion Research Center re- has been an embarrassment for anyone who would
cently surveyed 1473 adults, of which 67% claimed like to believe that scholars and scientists adhere
that they had experienced ESP (Greeley, 1987). to standards of rationality and fair play" (Hyman,
Public opinion is a poor arbiter of science, how- 1985a, page 89). While much of the controversy has
ever, and experience is a poor substitute for the focused on poor experimental design and potential
scieptific method. For more than a century, small fraud, there have been attacks and defenses of the
numbers of scientists have been conducting labora- statistical methods as well, sometimes calling into
tory experiments to study phenomena such as question the very foundations of probability and
telepathy, clairvoyance and precognition, collec- statistical inference.
tively known as "psi" abilities. This paper will Most of the criticisms have been leveled by psy-
examine some of that work, as well as some of the chologists. For example, a 1988 report of the U.S.
statistical controversies it has generated. National Academy of Sciences concluded that "The
committee finds no scientific justification from
research conducted over a period of 130 years for
Jessica Utts is Associate Professor, Division of the existence of parapsychological phenomena"
Statistics, University of California at Davis, 469 (Druckman and Swets, 1988, page 22). The chapter
Kerr Hall, Davis, California 95616. on parapsychology was written by a subcommittee
364 J. UTTS
chaired by a psychologist who had published a ments, offered one of the earliest treatises on the
similar conclusion prior to his appointment to the statistical evaluation of forced-choice experiments
committee (Hyman, 1985a, page 7). There were no in two articles published in the Proceedings of the
parapsychologists involved with the writing of the Society for Psychical Research (Edgeworth, 1885,
report. Resulting accusations of bias (Palmer, Hon- 1886). Unfortunately, as noted by Mauskopf and
orton and Utts, 1989) led U.S. Senator Claiborne McVaugh (1979) in their historical account of the
Pel1 to request that the Congressional Office of period, Edgeworth's papers were "perhaps too diffi-
Technology Assessment (OTA) conduct an investi- cult for their immediate audience" @age 105).
gation with a more balanced group. A one-day Edgeworth began his analysis by using Bayes'
workshop was held on September 30, 1988, bring- theorem to derive the formula for the posterior
ing together parapsychologists, critics and experts probability that chance was operating, given the
in some related fields (including the author of this data. He then continued with an argument
paper). The report concluded that parapsychology "savouring more of Bernoulli than Bayes" in which
needs "a fairer hearing across a broader spectrum "it is consonant, I submit, to experience, to put 112
of the scientific community, so that emotionality both for cr and P," that is, for both the prior proba-
does not impede objective assessment of experimen- bility that chance alone was operating, and the
tal results" (Office of Technology Assessment, prior probability that "there should have been some
1989). additional agency." He then reasoned (using a
It is in the spirit of the OTA report that this Taylor series expansion of the posterior prob-
article is written. After Section 2, which offers an ability formula) that if there were a large prob-
anecdotal account of the role of statisticians and ability of observing the data given that some
statistics in parapsychology, the discussion turns to additional agency was at work, and a small objec-
the more general question of replication of experi- tive probability of the data under chance, then the
mental results. Section 3 illustrates how replica- latter (binomial) probability "may be taken as a
tion has been (mis)interpreted by scientists in many rough measure of the sought a posteriori probabil-
fields. Returning to parapsychology in Section 4, a ity in favour of mere chance" @age 195). Edge-
particular experimental regime called the "ganz- worth concluded his article by applying his method
feld" is described, and an extended debate about to some data published previously in the same
the ,interpretation of the experimental results is journal. He found the probability against chance to
discussed. Section 5 examines a meta-analysis of be 0.99996, which he said "may fairly be regarded
recent ganzfeld experiments designed to resolve the as physical certainty" @age 199). He concluded:
debate. Finally, Section 6 contains a brief account
Such is the evidence which the calculus of
of meta-analyses that have been conducted in other
probabilities affords as to the existence of an
areas of parapsychology, and conclusions are given agency other than mere chance. The calculus is
in Section 7. silent as to the nature of that agency-whether
it is more likely to be vulgar illusion or ex-
2. STATISTICS AND PARAPSYCHOLOGY traordinary law. That is a question to be
decided, not by formulae and figures, but by
Parapsychology had its beginnings in the investi-
general philosophy and common sense [page
gation of purported mediums and other anecdotal
1991.
claims in the late 19th century. The Society for
Psychical Research was founded in Britain in 1882, Both the statistical arguments and the experi-
and its American counterpart was founded in mental controls in these early experiments were
Boston in 1884. While these organizations and their somewhat loose. For example, Edgeworth treated
menibers were primarily involved with investigat- as binomial an experiment in which one person
ing anecdotal material, a few of the early re- chose a string of eight letters and another at-
searchers were already conducting "forced-choice" tempted to guess the string. Since it has long been
experiments such as card-guessing. (Forced-choice understood that people are poor random number (or
experiments are like multiple choice tests; on each letter) generators, there is no statistical basis for
trial the subject must guess from a small, known analyzing such an experiment. Nonetheless, Edge-
set of possibilities.) Notable among these was worth and his contemporaries set the stage for the
Nobel Laureate Charles Richet, who is generally use of controlled experiments with statistical evalu-
credited with being the first to recognize that prob- ation in laboratory parapsychology. An interesting
ability theory could be applied to card-guessing historical account of Edgeworth's involvement and
experiments (Rhine, 1977, page 26; Richet, 1884). the role telepathy experiments played in the early
F. Y. Edgeworth, partly in response to what he history of randomization and experimental design
considered to be incorrect analyses of these experi- is provided by Hacking (1988).
REPLICATION IN PARAPSYCHOLOGY 365
One of the first American researchers to and Greenwood, 1937). Stuart, who had been an
use statistical methods in parapsychology was undergraduate in mathematics at Duke, was one of
John Edgar Coover, who was the Thomas Welton Rhine's early subjects and continued to work with
Stanford Psychical Research Fellow in the Psychol- him as a researcher until Stuart's death in 1947.
ogy Department at Stanford University from 1912 Greenwood was a Duke mathematician, who appar-
to 1937 (Dommeyer, 1975). In 1917, Coover pub- ently converted to a statistician at the urging of
lished a large volume summarizing his work Rhine.
(Coover, 1917). Coover believed that his results Another prominent figure who was distressed
were consistent with chance, but others have ar- with Kellogg's attack was E. V. Huntington, a
gued that Coover's definition of significance was mathematician at Harvard. After corresponding
too strict (Dommeyer, 1975). For example, in one with Rhine, Huntington decided that, rather than
evaluation of his telepathy experiments, Coover further confuse the public with a technical reply to
found a two-tailed p-value of 0.0062. He concluded, Kellogg's arguments, a simple statement should be
"Since this value, then, lies within the field of made to the effect that the mathematical issues in
chance deviation, although the probability of its Rhine's work had been resolved. Huntington must
occurrence by chance is fairly low, it cannot be have successfully convinced his former student,
accepted as a decisive indication of some cause Burton Camp of Wesleyan, that this was a wise
beyond chance which operated in favor of success in approach. Camp was the 1937 President of IMS.
guessing" (Coover, 1917, page 82). On the next When the annual meetings were held in December
page, he made it explicit that he would require a of 1937 ('jointly with AMS and AAAS), Camp
p-value of 0.0000221 to declare that something released a statement to the press that read:
other than chance was operating. Dr. Rhine's investigations have two aspects:
It was during the summer of 1930, with the experimental and statistical. On the exper-
card-guessing experiments of J. B. Rhine at Duke
imental side mathematicians, of course,
University, that parapsychology began to take hold have nothing to say. On the statistical side,
as a laboratory science. Rhine's laboratory still however, recent mathematical work has
exists under the name of the Foundation for Re- established the fact that, assuming that the
search on the Nature of Man, housed at the edge of experiments have been properly performed,
the Duke University campus. the statistical analysis is essentially valid. If
It wasn't long after Rhine published his first
the Rhine investigation is to be fairly attacked,
book, Extrasensory Perception in 1934, that the it must be on other than mathematical grounds
attacks on his methodology began. Since his claims [Camp, 19371.
were wholly based on statistical analyses of his
experiments, the statistical methods were closely One statistician who did emerge as a critic was
scrutinized by critics Anxious to find a conventional William Feller. In a talk at the Duke Mathemati-
explanation for Rhine's positive results. cal Seminar on April 24, 1940, Feller raised three
The most persistent critic was a psychologist criticisms to Rhine's work (Feller, 1940). They had
from McGill University named Chester Kellogg been raised before by others (and continue to be
(Mauskopf and McVaugh, 1979). Kellogg's main raised even today). The first was that inadequate
argument was that Rhine was using the binomial shuffling of the cards resulted in additional infor-
distribution (and normal approximation) on a se- mation from one series to the next. The second was
ries of trials that were not independent. The experi- what is now known as the "file-drawer effect,"
ments in question consisted of having a subject namely, that if one combines the results of pub-
*ess the order of a deck of 25 cards, with five each lished studies only, there is sure to be a bias in
of five symbols, so technically Kellogg was correct. favor of successful studies. The third was that the
By 1937, several mathematicians and statis- results were enhanced by the use of optional stop-
ticians had come to Rhine's aid. Mauskopf and ping, that is, by not specifying the number of trials
McVaugh (1979) speculated that since statistics was in advance. All three of these criticisms were ad-
itself a young discipline, "a number of statisticians dressed in a rejoinder by Greenwood and Stuart
were equally outraged by Kellogg, whose argu- (1940), but Feller was never convinced. Even in its
ments they saw as discrediting their profession" third edition published in 1968, his book An Intro-
(page 258). The major technical work, which ac- duction to Probability Theory and Its Applications
knowledged that Kellogg's criticisms were accurate still contains his conclusion about Greenwood and
but did little to change the significance of the Stuart: "Both their arithmetic and their experi-
results, was conducted by Charles Stuart and ments have a distinct tinge of the supernatural"
Joseph A. Greenwood and published in the first (Feller, 1968, page 407). In his discussion of Feller's
volume of the Journal of Parapsychology (Stuart position, Diaconis (1978) remarked, "I believe
386 J. UTTS
Feller was confused. . . he seemed to have decided diverse views among parapsychologists on the na-
the opposition was wrong and that was that." ture of the problem. Honorton (1985a) and Rao
Several statisticians have contributed to the (1985), for example, both argued that strict replica-
literature in parapsychology to greater or lesser tion is uncommon in most branches of science and
degrees. T. N. E. Greville developed applicable that parapsychology should not be singled out as
statistical methods for many of the experiments in unique in this regard. Other authors expressed
parapsychology and was Statistical Editor of the disappointment in the lack of a single repeatable
Journal of Parapsychology (with J . A. Greenwood) experiment in parapsychology, with titles such
from its start in 1937 through Volume 31 in 1967; as "Unrepeatability: Parapsychology's Only Find-
Fisher (1924, 1929) addressed some specific prob- ing" (Blackmore, 1985), and "Research Strategies
lems in card-guessing experiments; Wilks (1965a, b) for Dealing with Unstable Phenomena" (Beloff,
described various statistical methods for parapsy- 1985).
chology; Lindley (1957) presented a Bayesian anal- It has never been clear, however, just exactly
ysis of some parapsychology data; and Diaconis what would constitute acceptable evidence of a re-
(1978) pointed out some problems with certain ex- peatable experiment. In the early days of investiga-
periments and presented a method for analyzing tion, the major critics "insisted that it would be
experiments when feedback is given. sufficient for Rhine and Soal to convince them of
Occasionally, attacks on parapsychology have ESP if a parapsychologist could perform success-
taken the form of attacks on statistical inference in fully a single 'fraud-proof' experiment" (Hyman,
general, at least as it is applied to real data. 1985a, page 71). However, as soon as well-designed
Spencer-Brown (1957) attempted to show that true experiments showing statistical significance
randomness is impossible, at least in finite se- emerged, the critics realized that a single experi-
quences, and that this could be the explanation for ment could be statistically significant just by
the results in parapsychology. That argument re- chance. British psychologist C. E. M. Hansel quan-
emerged in a recent debate on the role of random- tified the new expectation, that the experiment
ness in parapsychology, initiated by psychologist J. should be repeated a few times, as follows:
Barnard Gilmore (Gilmore, 1989, 1990; Utts, 1989;
If a result is significant a t the .O1 level and
Palmer, 1989, 1990). Gilmore stated that "The ag- this result is not due to chance but to informa-
nostic statistician, advising on research in psi,
tion reaching the subject, it may be expected
should take account of the possible inappropriate-
that by making two further sets of trials the
ness of classical inferential statistics" (1989, page
antichance odds of one hundred to one will be
338). In his second paper, Gilmore reviewed several
increased to around a million to one, thus en-
non-psi studies showing purportedly random sys-
abling the effects of ESP-or whatever is re-
tems that do not behave as they should under
sponsible for the original result-to manifest
randomness (e.g., ~versen, Longcor, Mosteller,
itself to such a n extent that there will be little
Gilbert and Youtz, 1971; Spencer-Brown, 1957).
doubt that the result is not due to chance
Gilmore concluded that "Anomalous data . . .
[Hansel, 1980, page 2981.
should not be found nearly so often if classical
statistics offers a valid model of reality" (1990, In other words, three consecutive experiments a t
page 54), thus rejecting the use of classical statisti- p 5 0.01 would convince Hansel that something
cal inference for real-world applications in general. other than chance was a t work.
This argument implies that if a particular experi-
ment produces a statistically significant result, but
subsequent replications fail to attain significance,
Iinplicit and explicit in the literature on parapsy- then the original result was probably due to chance,
chology is the assumption that, in order to truly or a t least remains unconvincing. The problem with
establish itself, the field needs to find a repeat- this line of reasoning is that there is no consid-
able experiment. For example, Diaconis (1978) eration given to sample size or power. Only a n
started the summary of his article in Science with experiment with extremely high power should
the words "In search of repeatable ESP experi- be expected to be "successful" three times in
ments, modern investigators. . . " (page 131). On succession.
October 28-29, 1983, the 32nd International Con- It is perhaps a failure of the way statistics is
ference of the Parapsychology Foundation was held taught that many scientists do not understand the
in San Antonio, Texas, to address "The Repeatabil- importance of power in defining successful replica-
ity Problem in Parapsychology." The Conference tion. To illustrate this point, psychologists Tversky
Proceedings (Shapin and Coly, 1985) reflect the and Kahnemann (1982) distributed a questionnaire
to their colleagues a t a professional meeting, with with 71 as opposed to 69 successful trials. The
the question: one-tailed p-values for the combined trials are
0.0017 for Professor A and 0.0006 for Professor B.
An investigator has reported a result that you
To address the question of replication more ex-
consider implausible. He ran 15 subjects, and
plicitly, I also posed the following scenario. In
reported a significant value, t = 2.46. Another
December of 1987, it was decided to prematurely
investigator has attempted to duplicate his pro-
terminate a study on the effects of aspirin in reduc-
cedure, and he obtained a nonsignificant value
ing heart attacks because the data were so convinc-
of t with the same number of subjects. The
ing (see, e.g., Greenhouse and Greenhouse, 1988;
direction was the same in both sets of data.
Rosenthal, 1990a). The physician-subjects had been
You are reviewing the literature. What is the
randomly assigned to take aspirin or a placebo.
highest value of t in the second set of data that
There were 104 heart attacks among the 11,037
you would describe as a failure to replicate?
subjects in the aspirin group, and 189 heart attacks
[1982, page 281.
among the 11,034 subjects in the placebo group
In reporting their results, Tversky and Kahne- (chi-square = 25.01, p < 0.00001).
mann stated: After showing the results of that study, I pre-
sented the audience with two hypothetical experi-
The majority of our respondents regarded t =
ments conducted to try to replicate the original
1.70 a s a failure to replicate. If the data of two result, with outcomes in Table 2.
such studies ( t = 2.46 and t = 1.70) are pooled,
I asked the audience to indicate which one they
the value of t for the combined data is about
thought was a more successful replication. The au-
3.00 (assuming equal variances). Thus, we are dience chose the second one, as would most journal
faced with a paradoxical state of affairs, in
editors, because of the "significant p-value." In
which the same data that would increase our
fact, the first replication has almost exactly the
confidence in the finding when viewed as part same proportion of heart attacks in the two groups
of the original study, shake our confidence
as the original study and is thus a very close repli-
when viewed as a n independent study [1982,
cation of that result. The second replication has
page 281.
At a recent presentation to the History and Phi-
losophy of Science Seminar a t the University of
California a t Davis, I asked the following question. TABLE1
Attempted replciations for professor B
Two scientists, Professors A and B, each have a
theory they would like to demonstrate. Each plans n Number of successes One-tailed p-value
to run a fixed number of Bernoulli trials and then
test H,: p = 0.25 versos Ha: p > 0.25. Professor A
has access to large numbers of students each
semester to use as subjects. In his first experiment,
he runs 100 subjects, and there are 33 successes
( p = 0.04, one-tailed). Knowing the importance of
replication, Professor A runs a n additional 100 sub-
jects as a second experiment. He finds 36 successes
( p = 0.009, one-tailed).
Professor B only teaches small classes. Each
quarter, she runs a n experiment on her students to
test her theory. She carries out ten studies this
way, with the results in Table 1.
TABLE2
I asked the audience by a show of hands to Hypothetical replications o f the aspirin / heart
indicate whether or not they felt the scientists had attack study
successfully demonstrated their theories. Professor
A's theory received overwhelming support, with Replication # 1 Replication # 2
approximately 20 votes, while Professor B's theory Heart attack Heart attack
received only one vote. Yes No Yes No
If you aggregate the results of the experiments
Aspirin 11 1156 20 2314
for each professor, you will notice that each con- Placebo 19 1090 48 2170
ducted 200 trials, and Professor B actually demon- Chi-square 2.596, p = 0.11 13.206, p = 0.0003
strated a higher level of success than Professor A,
368 J. UTTS
very different proportions, and in fact the relative been consistent effects of the same magnitude.
risk from the second study is not even contained in Rosenthal also advocates this view of replication:
a 95% confidence interval for relative risk from the The traditional view of replication focuses on
original study. The magnitude of the effect has significance level as the relevant summary
been much more closely matched by the "nonsig- statistic of a study and evaluates the success of
nificant" replication. a replication in a dichotomous fashion. The
Fortunately, psychologists are beginning to no- newer, more useful view of replication focuses
tice that replication is not as straightforward as on effect size as the more important summary
they were originally led to believe. A special issue statistic of a study and evaluates the success of
of the Journal of Social Behavior and Personality a replication not in a dichotomous but in a
was entirely devoted to the question of replication continuous fashion [Rosenthal, 1990b, page 281.
(Neuliep, 1990). In one of the articles, Rosenthal
cautioned his colleagues: "Given the levels of sta- The dichotomous view of replication has been
tistical power at which we normally operate, we used throughout the history of parapsychology, by
have no right to expect the proportion of significant both parapsychologists and critics (Utts, 1988). For
results that we typically do expect, even if in na- example, the National Academy of Sciences report
ture there is a very real and very important effect" critically evaluated "significant" experiments, but
(Rosenthal, 1990b, page 16). entirely ignored "nonsignificant" experiments.
Jacob Cohen, in his insightful article titled In the next three sections, we will examine some
"Things I Have Learned (So Far)," identified an- of the results in parapsychology using the broader,
other misconception common among social scien- more appropriate definition of replication. In doing
tists: "Despite widespread misconceptions to the so, we will show that the results are far more
contrary, the rejection of a given null hypothesis interesting than the critics would have us believe.
gives us no basis for estimating the probability that
4. THE GANZFELD DEBATE IN
a replication of the research will again result in
PARAPSYCHOLOGY
rejecting that null hypothesis" (Cohen, 1990, page
1307). An extensive debate took place in the mid-1980s
Cohen and Rosenthal both advocate the use of between a parapsychologist and critic, questioning
effect sizes as opposed to significance levels when whether or not a particular body of parapsychologi-
defining the strength of a n experimental effect. In cal data had demonstrated psi abilities. The experi-
general, effect sizes measure the amount by which ments in question were all conducted using the
the data deviate from the null hypothesis in terms ganzfeld setting (described below). Several authors
of standardized units. For instance, the effect size were invited to write commentaries on the debate.
for a two-sample t-test is usually defined to be the As a result, this data base has been more thor-
difference in the two means, divided by the stan- oughly analyzed by both critics and proponents
dard deviation for the control group. This measure than any other and provides a good source for
can be compared across studies without the depen- studying replication in parapsychology.
dence on sample size inherent in significance lev- The debate concluded with a detailed series of
els. (Of course there will still be variability in the recommendations for further experiments, and left
sample effect sizes, decreasing as a function of sam- open the question of whether or not psi abilities
ple size.) Comparison of effect sizes across studies is had been demonstrated. A new series of experi-
one of the major components of meta-analysis. ments that followed the recommendations were
Similar arguments have recently been made in conducted over the next few years. The results of
,the medical literature. For example, Gardner and the new experiments will be presented in Section 5.
Altman (1986) stated that the use of p-values "to
4.1 Free-Response Experiments
define two alternative outcomes-significant and
not significant-is not helpful and encourages lazy Recent experiments in parapsychology tend to
thinking" (page 746). They advocated the use of use more complex target material than the cards
confidence intervals instead. and dice used in the early investigations, partially
As discussed in the next section, the arguments to alleviate boredom on the part of the subjects and
used to conclude that parapsychology has failed to partially because they are thought to "more nearly
demonstrate a replicable effect hinge on these mis- resemble the conditions of spontaneous psi occur-
conceptions of replication and failure to examine rences" (Burdick and Kelly, 1977, page 109). These
power. A more appropriate analysis would compare experiments fall under the general heading of
the effect sizes for similar experiments across ex- "free-response" experiments, because the subject is
perimenters and across time to see if there have asked to give a verbal or written description of the
target, rather than being forced to make a choice isolation technique originally developed by Gestalt
from a small discrete set of possibilities. Various psychologists for other purposes. Evidence from
types of target material have been used, including spontaneous case studies and experimental work
pictures, short segments of movies on video tapes, had led parapsychologists to a model proposing that
actual locations and small objects. psychic functioning may be masked by sensory in-
Despite the more complex target material, the put and by inattention to internal states (Honorton,
statistical methods used to analyze these experi- 1977). The ganzfeld procedure was specifically de-
ments are similar to those for forced-choice experi- signed to test whether or not reduction of external
ments. A typical experiment proceeds as follows. "noise" would enhance psi performance.
Before conducting any trials, a large pool of poten- In these experiments, the subject is placed in a
tial targets is assembled, usually in packets of four. comfortable reclining chair in a n acoustically
Similarity of targets within a packet is kept to a shielded room. To create a mild form of sensory
minimum, for reasons made clear below. At the deprivation, the subject wears headphones through
start of a n experimental session, after the subject is which white noise is played, and stares into a
sequestered in a n isolated room, a target is selected constant field of red light. This is achieved by
a t random from the pool. A sender is placed in taping halved translucent ping-pong balls over the
another room with the target. The subject is asked eyes and then illuminating the room with red light.
to provide a verbal or written description of what In the psi ganzfeld experiments, the subject speaks
he or she thinks is in the target, knowing only that into a microphone and attempts to describe the
it is a photograph, a n object, etc. target material being observed by the sender in a
After the subject's description has been recorded distant room.
and secured against the potential for later alter- At the 1982 Annual Meeting of the Parapsycho-
ation, a judge (who may or may not be the subject) logical Association, a debate took place over the
is given a copy of the subject's description and the degree to which the results of the psi ganzfeld
four possible targets that were in the packet with experiments constituted evidence of psi abilities.
the correct target. A properly conducted experi- Psychologist and critic Ray Hyman and parapsy-
ment either uses video tapes or has two identical chologist Charles Honorton each analyzed the re-
sets of target material and uses the duplicate set sults of all known psi ganzfeld experiments to date,
for this part of the process, to ensure that clues and they reached strikingly different conclusions
such as fingerprints don't give away the answer. (Honorton, 1985b; Hyman, 1985b). The debate con-
Based on the subject's description, and of course on tinued with the publication of their arguments in
a blind basis, the judge is asked to either rank the separate articles in the March 1985 issue of the
four choices from most to least likely to have been Journal of Parapsychology. Finally, in the Decem-
the target, or to select the one from the four that ber 1986 issue of the Journal'of Parapsychology,
seems to best match the subject's description. If Hyman and Honorton (1986) wrote-a joint article
ranks are used, the statistical analysis proceeds by in which they highlighted their agreements and
summing the ranks over a series of trials and disagreements and outlined detailed criteria for
comparing the sum to what would be expected by future experiments. That same issue contained
chance. If the selection method is used, a "direct commentaries on the debate by 10 other authors.
hit" occurs if the correct target is chosen, and the The data base analyzed by Hyman and Honorton
number of direct hits over a series of trials is (1986) consisted of results taken from 34 reports
compared to the number expected in a binomial written by a total of 47 authors. Honorton counted
experiment with p = 0.25. 42 separate experiments described in the reports, of
Note that the subjects' responses cannot be con- which 28 reported enough information to determine
sidered to be "random" in any sense, so probability the number of direct hits achieved. Twenty three of
assessments are based on the random selection of the studies (55%) were classified by Honorton as
the target and decoys. In a correctly designed ex- having achieved statistical significance at 0.05.
periment, the probability of a direct hit by chance
4.3 The Vote-Counting Debate
is 0.25 on each trial, regardless of the response, and
the trials are independent. These and other issues Vote-counting is the term commonly used for the
related to analyzing free-response experiments are technique of drawing inferences about a n experi-
discussed by Utts (1991). mental effect by counting the number of significant
versus nonsignificant studies of the effect. Hedges
4.2 The Psi Gandeld Experiments
and Olkin (1985) give a detailed analysis of the
The ganzfeld procedure is a particular kind of inadequacy of this method, showing that it is more
free-response experiment utilizing a perceptual and more likely to make the wrong decision as the
370 J. UTTS
number of studies increases. While Hyman ac- that the number of significant studies (using his
knowledged that "vote-counting raises many prob- definition of a study) only dropped from 55% to
lems" (Hyman, 1985b, page 8), he nonetheless spent 45%. Next, he proposed that a uniform index of
half of his critique of the ganzfeld studies showing success be applied to all studies. He used the num-
why Honorton's count of 55% was wrong. ber of direct hits, since it was by far the most
Hyman's first complaint was that several of the commonly reported measure and was the measure
studies contained multiple conditions, each of which used in the first published psi ganzfeld study. He
should be considered as a separate study. Using then conducted a detailed analysis of the 28 studies
this definition he counted 80 studies (thus further reporting direct hits and found that 43% were sig-
reducing the sample sizes of the individual studies), nificant at 0.05 on that measure alone. Further, he
of which 25 (31%) were "successful." Honorton's showed that significant effects were reported by six
response to this was to invite readers to examine of the 10 independent investigators and thus were
the studies and decide for themselves if the varying not due to just one or two investigators or laborato-
conditions constituted separate experiments. ries. He also noted that success rates were very
Hyman next postulated that there was selection similar for reports published in refereed journals
bias, so that significant studies were more likely to and those published in unrefereed monographs and
be reported. He raised some important issues about abstracts.
how pilot studies may be terminated and not re- While Hyman's arguments identified issues such
ported if they don't show significant results, or may as selective reporting and optional stopping that
at least be subject to optional stopping, allowing should be considered in any meta-analysis, the de-
the experimenter to determine the number of tri- pendence of significance levels on sample size makes
als. He also presented a chi-square analysis that the vote-counting technique almost useless for as-
"suggests a tendency to report studies with a small sessing the magnitude of the effect. Consider, for
sample only if they have significant results" example, the 24 studies where the direct hit meas-
(Hyman, 1985b, page 14), but I have questioned his ure was reported and the chance probability of a
analysis elsewhere (Utts, 1986, page 397). direct hit was 0.25, the most common type of study
Honorton refuted Hyman's argument with four in the data base. (There were four direct hit studies
rejoinders (Honorton, 1985b, page 66). In addition with other chance probabilities and 14 that did not
to reinterpreting Hyman's chi-square analysis, report direct hits.) Of the 24 studies, 13 (54%) were
Honorton pointed out that the Parapsychological "nonsignificant" at a! = 0.05, one-tailed. But if the
Association has an official policy encouraging the 367 trials in these "failed replications" are com-
publication of nonsignificant results in its journals bined, there are 106 direct hits, z = 1.66, and p =
and proceedings, that a large number of reported 0.0485, one tailed. This is reminiscent of the
ganzfeld studies did not achieve statistical signifi- dilemma of Professor B in Section 3.
cance and that therewould have to be 15 studies in Power is typically very low for these studies. The
the "file-drawer" for every one reported to cancel median sample size for the studies reporting direct
out the observed significant results. hits was 28. If there is a real effect and it increases
The remainder of Hyman's vote-counting analy- the success probability from the chance 0.25 to
sis consisted of showing that the effective error rate an actual 0.33 (a value whose rationale will be
for each study was actually much higher than the made clear below), the power for a study with 28
nominal 5%. For example, each study could have trials is only 0.181 (Utts, 1986). It should be no
been analyzed using the direct hit measure, the surprise that there is a "repeatability" problem in
sum of ranks measure or one of two other measures parapsychology.
used for free-response analyses. Hyman carried out
4.4 Flaw Analysis and Future Recommendations
a simulation study that showed the true error rate
would be 0.22 if "significance" was defined by re- The second half of Hyman's paper consisted of a
quiring at least one of these four measures to "Meta-Analysis of Flaws and Successful Outcomes"
achieve the 0.05 level. He suggested several other (1985b, page 30), designed to explore whether or
ways in which multiple testing could occur and not various measures of success were related to
concluded that the effective error rate in each ex- specific flaws in the experiments. While many crit-
periment was not the nominal 0.05, but rather was ics have argued that the results in parapsychology
probably close to the 31% he had determined to be can be explained by experimental flaws, Hyman's
the actual success rate in his vote-count. analysis was the first to attempt to quantify the
Honorton acknowledged that there was a multi- relationship between flaws and significant results.
ple testing problem, but he had a two-fold response. Hyman identified 12 potential flaws in the
First, he applied a Bonferroni correction and found ganzfeld experiments, such as inadequate random-
ization, multiple tests used without adjusting the Hyman in his capacity as Chair of the National
significance level (thus inflating the significance Academy of Sciences' Subcommittee on Parapsy-
level from the nominal 5%) and failure to use a chology. Using Hyman's flaw classifications and a
duplicate set of targets for the judging process (thus multivariate analysis, Harris and Rosenthal con-
allowing possible clues such as fingerprints). Using cluded that "Our analysis of the effects of flaws on
cluster and factor analyses, the 12 binary flaw study outcome lends no support to the hypothesis
variables were combined into three new variables, that ganzfeld research results are a significant
which Hyman named General Security, Statistics function of the set of flaw variables" (1988b,
and Controls. page 3).
Several analyses were then conducted. The one Hyman and Honorton were in the process of
reported with the most detail is a factor analysis preparing papers for a second round of debate when
utilizing 17 variables for each of 36 studies. Four they were invited to lunch together at the 1986
factors emerged from the analysis. From these, Meeting of the Parapsychological Association. They
Hyman concluded that security had increased over discovered that they were in general agreement on
the years, that the significance level tended to be several major issues, and they decided to coauthor
inflated the most for the most complex studies and a "Joint Communiqu6" (Hyman and Honorton,
that both effect size and level of significance were 1986). It is clear from their paper that they both
correlated with the existence of flaws. thought it was more important to set the stage for
Following his factor analysis, Hyman picked the future experimentation than to continue the techni-
three flaws that seemed to be most highly corre- cal arguments over the current data base. In the
lated with success, which were inadequate atten- abstract to their paper, they wrote:
tion to both randomization and documentation and We agree that there is an overall significant
the potential for ordinary communication between effect in this data base that cannot reasonably
the sender and receiver. A regression equation was
be explained by selective reporting or multiple
then computed using each of the three flaws as analysis. We continue to differ over the degree
dummy variables, and the effect size for the experi- to which the effect constitutes evidence for psi,
ment as the dependent variable. From this equa-
but we agree that the final verdict awaits the
tion, Hyman concluded that a study without these outcome of future experiments conducted by a
three flaws would be predicted to have a hit rate of broader range of investigators and according to
27%. He concluded that this is "well within the
more stringent standards [page 3511.
statistical neighborhood of the 25% chance rate"
(1985b, page 37), and thus "the ganzfeld psi data The paper then outlined what these standards
base, despite initial impressions, is inadequate ei- should be. They included controls against any kind
ther to support the contention of a repeatable study of sensory leakage, thorough testing and documen-
or to demonstrate thereality of psi" (page 38). tation of randomization methods used, better re-
Honorton discounted both Hyman's flaw classifi- porting of judging and feedback protocols, control
cation and his analysis. He did not deny that flaws for multiple analyses and advance specification of
existed, but he objected that Hyman's analysis was number of trials and type of experiment. Indeed,
faulty and impossible to interpret. Honorton asked any area of research could benefit from such a
psychometrician David Saunders to write an Ap- careful list of procedural recommendations.
pendix to his article, evaluating Hyman's analysis. ,
4.5 Rosenthal's Meta-Analysis
Saunders first criticized Hyman's use of a factor
analysis with 17 variables (many of which were The same issue of the Journal of Parapsychology
dichotomous) and only 36 cases and concluded that in which the Joint Communiqu6 appeared also car-
"the entire analysis is meaningless" (Saunders, ried commentaries on the debate by 10 separate
1985, page 87). He then noted that Hyman's choice authors. In his commentary, psychologist Robert
of the three flaws to include in his regression anal- Rosenthal, one of the pioneers of meta-analysis in
ysis constituted a clear case of multiple analysis, psychology, summarized the aspects of Hyman's
since there were 84 possible sets of three that could and Honorton's work that would typically be in-
have been selected (out of nine potential flaws), and cluded in a meta-analysis (Rosenthal, 1986). It is
Hyman chose the set most ,highly correlated with worth reviewing Rosenthal's results so that they
effect size. Again, Saunders concluded that "any can be used as a basis of comparison for the more
interpretation drawn from [the regression analysis] recent psi ganzfeld studies reported in Section 5.
must be regarded as meaningless" (1985, page 88). Rosenthal, like Hyman and Honorton, focused
Hyman's results were also contradicted by Harris only on the 28 studies for which direct hits were
and Rosenthal (1988b) in an analysis requested by known. He chose to use an effect size measure
372 J. UTTS
called Cohen's h, which is the difference between Honorton and his colleagues developed an auto-
the arcsin transformed proportions of direct hits mated ganzfeld experiment that was designed to
that were observed aild expected: eliminate the methodological flaws identified by
Hyman. The execution and reporting of the experi-
h = 2(arcsin - arcsin 6). ments followed the detailed guidelines
- agreed
- upon
by Hyman and Honorton.
One advantage of this measure over the difference
Using this "autoganzfeld" experiment, 11experi-
in raw proportions is that it can be used to compare
mental series were conducted by eight experi-
experiments with different chance hit rates.
menters between February 1983 and September
If the and numbers of hits 1989, when the equipment had to be dismantled
were identical, the effect size would be zero. Of the
due to lack of funding. In this section, the results
28 studies, 23 (82%) had effect sizes greater than
zero, with a median effect size of 0.32 and a mean of these experiments are summarized and com-
pared to the earlier ganzfeld studies. Much of the
of 0.28. These correspond to direct hit rates of 0.40
information is derived from Honorton et al. (1990).
and 0.38 respectively, when 0.25 is expected by
chance. A 95% confidence interval for-the true
effect size is from 0.11 to 0.45, corresponding to 5.1 The Automated Ganzfeld Procedure
direct hit rates of from 0.30 to 0.46 when chance is
Like earlier ganzfeld studies, the "autoganzfeld"
0.25.
experiments require four participants. The first is
A common technique in meta-analysis is to calcu-
the Receiver (R), who attempts to identify the tar-
late a "combined z," found by summing the indi-
get material being observed by the Sender (S). The
vidual z scores and dividing by the square root of
Experimenter (E) prepares R for the task, elicits
the number of studies. The result should have a
the response from R and supervises R's judging of
standard normal distribution if each z score has a
the response against the four potential targets.
standard normal distribution. For the ganzfeld
(Judging is double blind; E does not know which is
studies, Rosenthal reported a combined z of 6.60
the correct target.) The fourth participant is the lab
with a p-value of 3.37 x lo-''. He also reiterated
assistant (LA) whose only task is to instruct the
Honorton's file-drawer assessment by calculating
computer to randomly select the target. No one
that there would have to be 423 studies unreported
involved in the experiment knows the identity of
to negate the significant effect in the 28 direct hit
the target.
studies.
Both R and S are sequestered in sound-isolated,
Finally, Rosenthal acknowledged that, because of
electrically shielded rooms. R is prepared as in
the flaws in the data base and the potential for at
earlier ganzfeld studies, with white noise and a
least a small file-drawer effect, the true average
field of red light. In a nonadjacent room, S watches
effect size was probably closer to 0.18 than 0.28. He
the target material on a television and can hear R's
concluded, "Thus, when the accuracy rate expected
target description ("mentation") as it is being
under the null is 114, we might estimate the ob-
given. The mentation is also tape recorded.
tained accuracy rate to be about 1/3" (1986, page The judging process takes place immediately af-
333). This is the value used for the earlier power
ter the 30-minute sending period. On a TV monitor
calculation. in the isolated room, R views the four choices from
It is worth mentioning that Rosenthal was com- the target pack that contains the actual target. R is
missioned by the National Academy of Sciences to asked to rate each one according to how closely it
prepare a background paper to accompany its 1988 matches the ganzfeld mentation. The ratings are
' report on parapsychology. That 'paper (Harris and
converted to ranks and, if the correct target is
Rosenthal, 1988a) contained much of the same ranked first, a direct hit is scored. The entire proc-
analysis as his commentary summarized above. ess is automatically recorded by the computer. The
Ironically, the discussion of the ganzfeld work in computer then displays the correct choice to R as
the National Academy Report focused on Hyman's feedback.
1985 analysis, but never mentioned the work it had There were 160 preselected targets, used with
commissioned Rosenthal to perform, which contra- replacement, in 10 of the 11 series. They were
dicted the final conclusion,in the report. arranged in packets of four, and the decoys for a
given target were always the remaining three in
5. A META-ANALYSIS OF RECENT GANZFELD
the same set. Thus, even if a particular target in a
EXPERIMENTS
set were consistently favored by Rs, the probability
After the initial exchange with Hyman at of a direct hit under the null hypothesis would
the 1982 Parapsychological Association Meeting, remain at 114. Popular targets should be no more
R E P L I C A T I O N IN P A
likely to be selected by the computer's random rank, the first example is reproduced here. The
number generator than any of the others in the set. target was a painting by Salvador Dali called
The selection of the target by the computer is the "Christ Crucified." The correct target received a
only source of randomness in these experiments. first place rank. The part of the mentation R used
This is an important point, and one that is often to make this assessment read:
misunderstood. (See Utts, 1991, for elucidation.)
Eighty of the targets were "dynamic," consisting . . . I think of guides, like spirit guides, leading
of scenes from movies, documentaries and cartoons; me and I come into a court with a king. It's
80 were "static," consisting of photographs, art quiet. . . . It's like heaven. The king is some-
prints and advertisements. The four targets within thing like Jesus. Woman. Now I'm just sort of
each set were all of the same type. Earlier studies s u m m e r s a u l t i n g through heaven . . . .
indicated that dynamic targets were more likely to Brooding. . . . Aztecs, the Sun God. . . . High
produce successful results, and one of the goals of priest . . . . F e a r . . . . Graves. W o m a n .
the new experiments was to test that theory. Prayer . . . . Funeral . . . . Dark.
The randomization procedure used to select the Death . . . . Souls . . . . Ten Commandments.
target and the order of presentation for judging was Moses. . . . [Honorton et al., 19901.
thoroughly tested before and during the experi-
ments. A detailed description is given by Honorton Over all 11 series, there were 122 direct hits in
et al. (1990, pages 118-120). the 355 trials, for a hit rate of 34.4% (exact bino-
Three of the 11 series were pilot series, five were mial p-value = 0.00005) when 25% were expected
formal series with novice receivers, and three were by chance. Cohen's h is 0.20, and a 95% confidence
formal series with experienced receivers. The last interval for the overall hit rate is from 0.30 to 0.39.
series with experienced receivers was the only one This calculation assumes, of course, that the proba-
that did not use the 160 targets. Instead, it used bility of a direct hit is constant and independent
only one set of four dynamic targets in which one across trials, an assumption that may be question-
target had previously received several first place able except under the null hypothesis of no psi
ranks and one had never received a first place abilities.
rank. The receivers, none of whom had had prior Honorton et al. (1990) also calculated effect sizes
exposure to that target pack, were not aware that for each of the 11 series and each of the eight
only one target pack was being used. They each experimenters. All but one of the series (the first
contributed one session only to the series. This will novice series) had positive effect sizes, as did all of
be called the "special series" in what follows. the experimenters.
Except for two of the pilot series, numbers of The special series with experienced Rs had an
trials were planned in advance for each series. exceptionally high effect size with h = 0.81, corre-
Unfortunately, three of the formal series were not sponding to 16 direct hits out of 25 trials (64%),but
yet completed when the funding ran out, including the remaining series and the experimenters had
the special series, and one pilot study with advance relatively homogeneous effect sizes given the
planning was terminated early when the experi- amount of variability expected by chance. If the
menter relocated. There were no unreported trials special series is removed, the overall hit rate is
during the 6-year period under review, so there was 32.1%, h = 0.16. Thus, the positive effects are not
no "file drawer." due to just one series or one experimenter.
Overall, there were 183 Rs who contributed only Of the 218 trials contributed by novices, 71 were
one trial and 58 who contributed more than one, for direct hits (32.5%, h = 0.17), compared with 51
a total of 241 participants and 355 trials. Only 23 hits in the 137 trials by those with prior ganzfeld
Rs had previously participated in ganzfeld experi- experience (37%, h = 0.26). The hit rates and effect
ments, and 194 Rs (81%) had never participated in sizes were 31% ( h = 0.14) for the combined pilot
any parapsychological research. series, 32.5% ( h = 0.17) for the combined formal
novice series, and 41.5% ( h = 0.35) for the com-
5.2 Results
bined experienced series. The last figure drops to
While acknowledging that no probabilistic con- 31.6% if the outlier series is removed. Finally,
clusions can be drawn from qualitative data, Hon- without the outlier series the hit rate for the com-
orton et al. (1990) included several examples of bined series where all of the planned trials were
session excerpts that Rs identified as providing the completed was 31.2% ( h = 0.14), while it was 35%
basis for their target rating. To give a flavor for the ( h = 0.22) for the combined series that were termi-
dream-like quality of the mentation and the amount nated early. Thus, optional stopping cannot
of information that can be lost by only assigning a account for the positive effect.
374 J. UTTS
There were two interesting comparisons that had groups results in h = 0.068. Thus, the effect size
been suggested by earlier work and were pre- observed in the ganzfeld data base is triple the
planned in these experiments. The first was to much publicized effect of aspirin on heart attacks.
compare results for trials with dynamic targets
with those for static targets. In the 190 dynamic
6. OTHER META-ANALYSES IN
target sessions there were 77 direct hits (40%, h =
PARAPSYCHOLOGY
0.32) and for the static targets there were 45 hits
in 165 trials (27%, h = 0.05), thus indicating Four additional meta-analyses have been con-
that dynamic targets produced far more successful ducted in various areas of parapsychology since the
results. original ganzfeld meta-analyses were reported.
The second comparison of interest was whether Three of the four analyses focused on evidence of
or not the sender was a friend of the receiver. This psi abilities, while the fourth examined the rela-
was a choice the receiver could make. If he or she tionship between extroversion and psychic func-
did not bring a friend, a lab member acted as tioning. In this section, each of the four analyses
sender. There were 211 trials with friends as will be briefly summarized.
senders (some of whom were also lab staff), result- There are only a handful of English-language
ing in 76 direct hits (36%, h = 0.24). Four trials journals and proceedings in parapsychology, so
used no sender. The remaining 140 trials used retrieval of the relevant studies in each of the
nonfriend lab staff as senders and resulted in 46 four cases was simple to accomplish by searching
direct hits (33%, h = 0.18). Thus, trials with friends those sources in detail and by searching other
as senders were slightly more successful than those bibliographic data bases for keywords.
without. Each analysis included an overall summary, an
Consonant with the definition of replication based analysis of the quality of the studies versus the size
on consistent effect sizes, it is informative to com- of the effect and a "file-drawer" analysis to deter-
pare the autoganzfeld experiments with the direct mine the possible number of unreported studies.
hit studies in the previous data base. The overall Three of the four also contained comparisons across
success rates are extremely similar. The overall various conditions.
direct hit rate was 34.4% for the autoganzfeld stud-
ies and was 38% for the comparable direct hit 6.1 Forced-Choice Precognition Experiments
studies in the earlier meta-analysis. Rosenthal's
(1986) adjustment for flaws had placed a more con- Honorton and Ferrari (1989) analyzed forced-
servative estimate at 33%, very close to the choice experiments conducted from 1935 to 1987, in
observed 34.4% in the new studies. which the target material was randomly selected
One limitation of this work is that the auto- after the subject had attempted to predict what it
ganzfeld studies, while conducted by eight experi- would be. The time delay in selecting the target
menters, all used the same equipment in the same ranged from under a second to one year. Target
laboratory. Unfortunately, the level of fund- material included items as diverse as ESP cards
ing available in parapsychology and the cost in and automated random number generators. Two
time and equipment to conduct proper experiments investigators, S. G. Soal and Walter J. Levy, were
make it difficult to amass large amounts of data not included because some of their work has been
across laboratories. Another autoganzfeld labora- suspected to be fraudulent.
tory is currently being constructed at the Univer- Overall Results. There were 309 studies re-
sity of Edinburgh in Scotland, s6 interlaboratory ported by 62 senior authors, including more than
comparisons may be possible in the near future. 50,000 subjects and nearly two million individual
Based on the effect size observed to date, large trials. Honorton and Ferrari used z / f i as the
samples are needed to achieve reasonable power. If measure of effect size (ES) for each study, where n
there is a constant effect across all trials, resulting was the number of Bernoulli trials in the study.
in 33% direct hits when 25% are expected by chance, They reported a mean E S of 0.020, and a mean
to achieve a one-tailed significance level of 0.05 z-score of 0.65 over all studies. They also reported a
with 95% probability would require 345 sessions. combined z of 11.41, p = 6.3 x Some 30%
We end this section by &turning to the aspirin (92) of the studies were statistically significant at
and heart attack example in Section 3 and expand- a! = 0.05. The mean E S per investigator was 0.033,
ing a comparison noted by Atkinson, Atkinson, and the significant results were not due to just a
Smith and Bem (1990, page 237). Computing the few investigators.
equivalent of Cohen's h for comparing obser- Quality. Eight dichotomous quality measures
ved heart attack rates in the aspirin and placebo were assigned to each study, resulting in possible
scores from zero for the lowest quality, to eight for trend, decreasing in order as the time interval
the highest. They included features such as ade- increased from minutes to hours to days to weeks to
quate randomization, preplanned analysis and au- months.
tomated recording of the results. The correlation
6.2 Attempts to Influence Random Physical
between study quality and effect size was 0.081,
Systems
indicating a slight tendency for higher quality
studies to be more successful, contrary to claims by Radin and Nelson (1989) examined studies de-
critics that the opposite would be true. There was signed to test the hypothesis that "The statistical
a clear relationship between quality and year of output of an electronic RNG [random number gen-
publication, presumably because over the years erator] is correlated with observer intention in ac-
experimenters in parapsychology have responded cordance with prespecified instructions" (page
to suggestions from critics for improving their 1502). These experiments typically involve RNGs
methodology. based on radioactive decay, electronic noise or pseu-
File Drawer. Following Rosenthal (1984), the dorandom number sequences seeded with true ran-
authors calculated the "fail-safe N" indicating the dom sources. Usually the subject is instructed to
number of unreported studies that would have to be try to influence the results of a string of binary
sitting in file drawers in order to negate the signifi- trials by mental intention alone. A typical protocol
cant effect. They found N = 14,268, or a ratio of 46 would ask a subject to press a button (thus starting
unreported studies for each one reported. They also the collection of a fixed-length sequence of bits),
followed a suggestion by Dawes, Landman and and then try to influence the random source to
Williams (1984) and computed the mean z for all produce more zeroes or more ones. A run might
studies with z > 1.65. If such studies were a ran- consist of three successive button presses, one each
dom sample from the upper 5% tail of a N ( 0 , l ) in which the desired result was more zeroes or
distribution, the mean z would be 2.06. In this case more ones, and one as a control with no conscious
it was 3.61. They concluded that selective reporting intention. A z score would then be computed for
could not explain thew results. each button press.
Comparisons. Four variables were identified The 832 studies in the analysis were conducted
that appeared to have a systematic relationship to from 1959 to 1987 and included 235 "control" stud-
study outcome. The first was that the 25 studies ies, in which the output of the RNGs were recorded
using subjects selected on the basis of good past but there was no conscious intention involved.
performance were more successful than the 223 These were usually conducted before and during
using unselected subjects, with mean effect sizes of the experimental series, as tests of the RNGs.
0.051 and 0.008, respectively. Second, the 97 stud- Results. The effect size measure used was again
ies testing subjects individually were more success- z / A, where z was positive if more bits of the
ful than the 105 studies that used group testing; specified type were achieved. The mean effect size
mean effect sizes were 0.021 and 0.004, respec- for control studies was not significantly different
tively. Timing of feedback was the third moderat- from zero ( - 1.0 x lop5). The mean effect size
ing variable, but information was only available for for the experimental studies was also very small,
104 studies. The 15 studies that never told the 3.2 x but it was significantly higher than the
subjects what the targets were had a mean effect mean ES for the control studies ( z = 4.1).
size of -0.001. Feedback after each trial produced Quality. Sixteen quality measures were defined
the best results, the mean ES for the 47 studies and assigned to each study, under the four general
was 0.035. Feedback after each set of trials re- categories of procedures, statistics, data and the
sulted in mean ES of 0.023 (21 studies), while RNG device. A score of 16 reflected the highest
delayed feedback (also 21 studies) yielded a mean quality. The authors regressed mean effect size on
ES of only 0.009. There is a clear ordering; as the mean quality for each investigator and found a
gap between time of feedback and time of the slope of 2.5 x with standard error of 3.2 x
actual guesses decreased, effect sizes increased. indicating little relationship between quality
The fourth variable was the time interval be- and outcome. They also calculated a weighted mean
tween the subject's guess and the actual target effect size, using quality scores as weights, and
selection, available for 144 studies. The best results found that it was very similar to the unweighted
were for the 31 studies that generated targets less mean ES. They concluded that "differences
than a second after the guess (mean ES = 0.045), in methodological quality are not significant
while the worst were for the seven studies that predictors of effect size" (page 1507).
delayed target selection by at least a month (mean File Drawer. Radin and Nelson used several
ES = 0.001). The mean effect sizes showed a clear methods for estimating the number of unreported
UTTS
studies (pages 1508-1510). Their estimates ranged were evenly balanced among the six faces. They
from 200 to 1000 based on models assuming still found a significant effect, with mean and stan-
that all significant studies were reported. They dard error for effect size of 8.6 x and 1.1 x
calculated the fail-safe N to be 54,000. respectively. The combined z was 7.617 for
these studies.
6.3 Attempts to Influence Dice
They also compared effect sizes across types of
Radin and Ferrari (1991) examined 148 studies, subjects used in the studies, categorizing them as
published from 1935 to 1987, designed to test unselected, experimenter and other subjects, exper-
whether or not consciousness can influence the imenter as sole subject, and specially selected sub-
results of tossing dice. They also found 31 "con- jects. Like Honorton and Ferrari (1989), they found
trol" studies in which no conscious intention was the highest mean ES for studies with selected
involved. subjects; it was approximately 0.02, more than twice
Results. The effect size measure used was that for unselected subjects.
z / 6, where z was based on the number of throws
6.4 Extroversion and ESP Performance
in which the die landed with the desired face (or
faces) up, in n throws. The weighted mean ES for Honorton, Ferrari and Bem (1991) conducted a
the experimental studies was 0.0122 with a stan- meta-analysis to examine the relationship between
dard error of 0.00062; for the control studies the scores on tests of extroversion and scores on
mean and standard error were 0.00093 and 0.00255, psi-related tasks. They found 60 studies by 17
respectively. Weights for each study were de- investigators, conducted from 1945 to 1983.
termined by quality, giving more weight to high- Results. The effect size measure used for this
quality studies. Combined z scores for the exper- analysis was the correlation between each subject's
imental and control studies were reported by Radin extroversion score and ESP score. A variety of
and Ferrari to be 18.2 and 0.18, respectively. measures had been used for both scores across stud-
Quality. Eleven dichotomous quality measures ies, so various correlation coefficients were used.
were assigned, ranging from automated recording Nonetheless, a stem and leaf diagram of the corre-
to whether or not control studies were interspersed lations showed a n approximate bell shape with
with the experimental studies. The final quality mean and standard deviation of 0.19 and 0.26,
score for each study combined these with informa- respectively, and with a n additional outlier a t r =
tion on method of tossing the dice, and with source 0.91. Honorton et al. reported that when weighted
of subject (defined below). A regression of quality by degrees of freedom, the weighted mean r was
score versus effect size resulted in a slope of - 0.002, 0.14, with a 95% confidence interval covering 0.10
with a standard error of 0.0011. However, when to 0.19.
effect sizes were weighted by sample size, there was Forced-Choice versus Free-Response Re-
a significant relationship between quality and ef- sults. Because forced-choice and free-response tests
fect size, leading Radin and Ferrari to conclude differ qualitatively, Honorton et al. chose to exam-
that higher-quality studies produced lower weighted ine their relationship to extroversion separately.
effect sizes. They found that for free-response studies there was
File Drawer. Radin and Ferrari calculated a significant correlation between extroversion and
Rosenthal's fail-safe, N for this analysis to be ESP scores, with mean r = 0.20 and z = 4.46. Fur-
17,974. Using the assumption that all significant ther, this effect was homogeneous across both
studies were reported, they estimated the number investigators and extroversion scales.
of unreported studies to be 1152..As a final assess- For forced-choice studies, there was a significant
ment, they compared studies published before and correlation between ESP and extroversion, but only
after 1975, when the Journal of Parapsychology for those studies that reported the ESP results
adopted a n official policy of publishing nonsigni- to the subjects before measuring extroversion.
ficant results. They concluded, based on that an- Honorton et al. speculated that the relationship
alysis, that more nonsignificant studies were was a n artifact, in which extroversion scores
published after 1975, and thus "We must consi- were temporarily inflated as a result of positive
der the overall (1935-1987) data base as suspect feedback on ESP performance.
with respect to the filedrawer problem." Confirmation with New Data Following the
Comparisons. Radin and Ferrari noted that extroversion/ESP meta-analysis, Honorton et al.
there was bias in both the experimental and control attempted to confirm the relationship using
studies across die face. Six was the face most likely the autoganzfeld data base. Extroversion scores
to come up, consistent with the observation that it based on the Myers-Briggs Type Indicator were
has the least mass. Therefore, they examined re- available for 221 of the 241 subjects who had
sults for the subset of 69 studies in which targets participated in autoganzfeld studies.
The correlation between extroversion scores and much to be gained by discovering how to enhance
ganzfeld rating scores was r = 0.18, with a 95% and apply these abilities to important world
confidence interval from 0.05 to 0.30. This is con- problems.
sistent with the mean correlation of r = 0.20 for
ACKNOWLEDGMENTS
free-response experiments, determined from the
meta-analysis. These correlations indicate that ex- I would like to thank Deborah Delanoy, Charles
troverted subjects can produce higher scores in Honorton, Wesley Johnson, Scott Plous and an
free-response ESP tests. anonymous reviewer for their helpful comments on
an earlier draft of this paper, and Robert Rosenthal
and Charles Honorton for discussions that helped
7. CONCLUSIONS
clarify details.
Parapsychologists often make a distinction be-
tween "proof-oriented research" and "process- REFERENCES
oriented research." The former is typically con-
, E . and B E M ,D. J .
A T K I N S O NR,. L., A T K I N S O NR., C . , S M I T HE.
ducted to test the hypothesis that psi abilities exist, (1990). Introduction to Psychology, 10th ed. Harcourt Brace
while the latter is designed to answer questions Jovanovich, San Diego.
about how psychic functioning works. Proof- BELOFF,J. (1985). Research strategies for dealing w i t h unstable
oriented research has dominated the literature phenomena. I n T h e Repeatability Problem i n Parapsychol-
in parapsychology. Unfortunately, many of the ogy ( B . Shapin and L. Coly, eds.) 1-21. Parapsychology
Foundation, N e w Y o r k .
studies used small samples and would thus be BLACKMORE, S . J. (1985). Unrepeatability: Parapsychology's only
nonsignificant even if a moderate-sized effect finding. In The Repeatability Problem i n Parapsychology
exists. ( B . Shapin and L. Coly, eds.) 183-206. Parapsychology
The recent focus on meta-analysis in parapsy- Foundation, N e w Y o r k .
chology has revealed that there are small but BURDICK, D. S . and KELLY,E . F. (1977). Statistical methods i n
parapsychological research. In Handbook of Parapsychology
consistently nonzero effects across studies, experi- ( B . B. Wolman, ed.) 81-130. V a n Nostrand Reinhold, N e w
menters and laboratories. The sizes of the effects in York.
forced-choice studies appear to be comparable to CAMP,B . H . (1937). (Statement i n Notes Section.) Journal of
those reported in some medical studies that had Parapsychology 1 305.
been heralded as breakthroughs. (See Section 5; C O H E NJ, . (1990). Things I have learned (so far). American
Psychologist 45 1304-1312.
also Honorton and Ferrari, 1989, page 301.) Free- COOVER,J. E . (1917). Experiments i n Psychical Research at
response studies show effect sizes of far greater Leland Stanford Junior University. Stanford U n i v .
magnitude. DAWES,R . M., LANDMAN, J . and W I L L I A M SJ ,. (1984). Reply t o
A promising direction for future process-oriented Kurosawa. American Psychologist 39 74-75.
research is to examine the causes of individual DIACONIS, P. (1978). Statistical problems i n ESP research. Sci-
ence 201 131-136.
differences in psychic functioning. The ESP/ex- DOMMEYER, F. C . (1975). Psychical research at Stanford Univer-
troversion meta-analysis is a step in that direction. sity. Journal of Parapsychology 39 173-205.
In keeping with the idea of individual differ- DRUCKMAN, D. and SWETS,J. A , , eds. (1988) Enhancing H u m a n
ences, Bayes and empirical Bayes methods would Performance: Issues, Theories, and Techniques. National
appear to make more sense than the classical infer- Academy Press, Washington, D.C.
EDGEWORTH, F. Y . (1885). T h e calculus o f probabilities applied
ence methods commonly used, since they would t o psychical research. In Proceedings of the Society for
allow individual abilities and beliefs to be modeled. Psychical Research 3 190-199.
Jeffreys (1990) reported a Bayesian analysis of some EDGEWORTH, F. Y . (1886). T h e calculus o f probabilities applied
of the RNG experiments and showed that conclu- t o psychical research. 11. In Proceedings of the Society for
sions were closely tied to prior beliefs even though Psychical Research 4 189-208.
FELLER,W . K . (1940). Statistical aspects o f ESP. Journal of
hundreds of thousands of trials were available. Parapsychology 4 271-297.
It may be that the nonzero effects observed in the FELLER,W . K . (1968). A n Introduction to Probability Theory
meta-analyses can be explained by something other and Its Applications 1, 3rd ed. W i l e y , New Y o r k .
than ESP, such as shortcomings in our understand- FISHER,R. A . (1924). A method o f scoring coincidences i n tests
ing of randomness and independence. Nonetheless, w i t h playing cards. In Proceedings o f the Society for Psychi-
cal Research 34 181-185.
there is an anomaly that needs an explanation. As FISHER,R. A . (1929). T h e statistical method i n psychical re-
I have argued elsewhere (Utts, 1987), research in search. In Proceedings o f the Society for Psychical Research
parapsychology should receive more support from 39 189-192.
the scientific community. If ESP does not exist, GALLUP,G . H . , JR., and NEWPORT, F. (1991). Belief i n paranor-
there is little to be lost by erring in the direction of mal phenomena among adult Americans. Skeptical Inquirer
15 137-146.
further research, which may in fact uncover other GARDNER, M . J . and ALTMAN,D. G. (1986). Confidence intervals
anomalies. If ESP does exist, there is much to be rather t h a n p-values: Estimation rather t h a n hypothesis
lost by not doing process-oriented research, and testing. British Medical Journal 292 746-750.
378 J . UTTS
GILMORE,J . B. (1989). Randomness and the search for psi. the behavioral and social sciences. Journal of Social Behav-
Journal of Parapsychology 53 309-340. ior and Personality 5 (4) 1-510.
GILMORE, J. B. (1990). Anomalous significance in pararandom OFFICEOF TECHNOLOGY ASSESSMENT (1989). Report of a work-
and psi-free domains. Journal of Parapsychology 54 53-58. shop on experimental parapsychology. Journal of the Amer-
GREELEY,A. (1987). Mysticism goes mainstream. American ican Society for Psychical Research 83 317-339.
Health 7 47-49. PALMER, J. (1989). A reply to Gilmore. Journal of Parapsychol-
GREENHOUSE, J. B. and GREENHOUSE, S. W. (1988). An aspirin a ogy 53 341-344.
d a y . . . ? Chance 1 24-31. PALMER,J. (1990). Reply to Gilmore: Round two. Journal o f
GREENWOOD, J. A. and STUART,C. E. (1940). A review of Dr. Parapsychology 54 59-61.
Feller's critique. Journal of Parapsychology 4 299-319. PALMER, J. A., HONORTON, C. and Urrs, J . (1989). Reply to the
HACKING, I. (1988). Telepathy: Origins of randomization in ex- National Research Council study on parapsychology. Jour-
perimental design. Zsis 79 427-451. nal of the American Society for Psychical Research 83 31-49.
HANSEL,C. E. M. (1980). E S P and Parapsychology: A Critical RADIN,D. I. and FERRARI, D. C. (1991). Effects of consciousness
Re-evaluation. Prometheus Books, Buffalo, N.Y. on the fall of dice: A meta-analysis. Journal of Scientific
HARRIS,M. J. and ROSENTHAL, R. (1988a). Interpersonal Ex- Exploration 5 61-83.
pectancy Effects and H u m a n Performance Research. Na- RADIN,D. I. and NELSON,R. D. (1989). Evidence for conscious-
tional Academy Press, Washington, D.C. ness-related anomalies in random physical systems. Foun-
HARRIS,M. J. and ROSENTHAL, R. (1988b). Postscript to Znterper- dations of Physics 19 1499-1514.
sonal Expectancy Effects and H u m a n Performance Research. RAO,K. R. (1985). Replication in conventional and controversial
National Academy Press, Washington, D.C. sciences. In The Repeatability Problem i n Parapsychology
HEDGES,L. V. and OLKIN,I. (1985). Statistical Methods for (B. Shapin and L. Coly, eds.) 22-41. Parapsychology Foun-
Meta-Analysis. Academic, Orlando, Fla. dation, New York.
HONORTON, C. (1977). Psi and internal attention states. In RHINE,J . B. (1934). Extrasensory Perception. Boston Society for
Handbook of Parapsychology (B. B. Wolman, ed.) 435-472. Psychical Research, Boston. (Reprinted by Branden Press,
1964.)
Van Nostrand Reinhold, New York.
RHINE,J. B. (1977). History of experimental studies. In Hand-
HONORTON, C. (1985a). How to evaluate and improve the repli-
book of Parapsychology (B. B. Wolman, ed.) 25-47. Van
cability of parapsychological effects. In T h e Repeatability
Nostrand Reinhold, New York.
Problem i n Parapsychology (B. Shapin and L. Coly, eds.)
RICHET,C. (1884). La suggestion mentale et le calcul des proba-
238-255. Parapsychology Foundation, New York.
bilit6s. Revue Philosophique 18 608-674.
HONORTON, C. (1985b). Meta-analysis of psi ganzfeld research: A
ROSENTHAL, R. (1984). Meta-Analytic Procedures for Social Re-
response to Hyman. Journal of Parapsychology 49 51-91.
search. Sage, Beverly Hills.
HONORTON, C., BERGER,R. E., VARVOGLIS, M.
P., QUANT,M., ROSENTHAL, R. (1986). Meta-analytic procedures and the nature
DERR,P., SCHECHTER, E.
I. and FERRARI,D. C. (1990). of replication: The ganzfeld debate. Journal of Parapsychol-
Psi communication in the ganzfeld: Experiments with a n
ogy 50 315-336.
automated testing system and a comparison with a meta-
ROSENTHAL, R. (1990a). How are we doing in soft psychology?
analysis of earlier studies. Journal of Parapsychology 54
American Psychologist 45 775-777.
99-139. ROSENTHAL,R. (1990b). Replication in behavioral research.
HONORTON, C. and FERRARI,D. C. (1989). "Future telling": A Journal of Social Behavior and Personality 5 1-30.
meta-analysis of forced-choice precognition experiments, SAUNDERS, D. R. (1985). On Hyman's factor analysis. Journal of
1935-1987. Journal of Parapsychology 53 281-308. Parapsychology 49 86-88.
HONORTON, C., FERRARI, D:C. and BEM,D. J . (1991). Extraver- SHAPIN,B. and COLY,L., eds. (1985). T h e Repeatability Problem
sion and ESP performance: A meta-analysis and a new i n Parapsychology. Parapsychology Foundation, New York.
confirmation. Research i n Parapsychology 1990. The Scare- SPENCER-BROWN, G. (1957). Probability and Scientific Inference.
crow Press, Metuchen, N.J. To appear. Longmans Green, London and New York.
HYMAN, R. (1985a). A critical overview of parapsychology. In A STUART,C. E. and GREENWOOD, J . A. (1937). A review of criti-
Skeptic's Handbook of Parapsychology (P. Kurtz, ed.) 1-96. cisms of the mathematical evaluation of ESP data. Journal
Prometheus Books, Buffalo, N.Y. of Parapsychology 1 295-304.
HYMAN,R. (1985b). The ganzfeld psi experiment: A critical TVERSKY,A. and KAHNEMAN, D. (1982). Belief in the law of
appraisal. Journal of Parapsychology 49 3-49. small numbers. In Judgment Under Uncertainty: Heuristics
HYMAN,R. and HONORTON, C. (1986). Joint communiqu6: The and Biases (D. Kahneman, P. Slovic and A. Tversky, eds.)
psi ganzfeld controversy. Journal o.f Parapsychology 50 23-31. Cambridge Univ. Press.
351-364. Urrs, J. (1986). The ganzfeld debate: A statistician's perspec-
IVERSEN,G. R., LONGCOR, W. H., MOSTELLER, F., GILBERT,J. P. tive. Journal of Parapsychology 50 395-402.
and Y o u ~ z C. , (1971). Bias and runs in dice throwing and Urrs, J. (1987). Psi, statistics, and society. Behavioral and
recording: A few million throws. Psychometrika 36 1-19. Brain Sciences 10 615-616.
JEFFREYS,W. H. (1990). Bayesian analysis of random event Urrs, J . (1988). Successful replication versus statistical signifi-
generator data. Journal of Scientific Exploration 4 153-169. cance. Journal of Parapsychology 52 305-320.
LINDLEY,D. V. (1957). A statistical paradox. Biometrika 44 Urrs, J. (1989). Randomness and randomization tests: A reply to
187-192. Gilmore. Journal of Parapsychology 53 345-351.
MAUSKOPF, S. H. and MCVAUGH, M,. (1979). The Elusive Science: Urrs, J . (1991). Analyzing free-response data: A progress report.
Origins of Experimental Psychical Research. Johns Hopkins In Psi Research Methodology: A Re-examination (L. Coly,
Univ. Press. ed.). Parapsychology Foundation, New York. To appear.
MCVAUGH, M. R. and MAUSKOPF,S. H. (1976). J. B. Rhine's WILKS, S. S. (1965a). Statistical aspects of expeirments in
Extrasensory Perception and its background in psychical telepath. N. Y. Statistician 16 (6) 1-3.
research. Zsis 67 161-189. WILKS, S. S. (1965b). Statistical aspects of experiments in
NEULIEP,J. W., ed. (1990). Handbook of replication research in telepathy. N. Y.Statistician 16 (7) 4-6.
Comment
M. J. Bayarri and James Berger
1. INTRODUCTION EXAMPLE 1. Consider the example from Tversky

and Kahnemann (1982), in which a n experiment
There are many fascinating issues discussed in
results in a standardized test statistic of 2, = 2.46.
this paper. Several concern parapsychology itself
(We will assume normality to keep computations
and the interpretation of statistical methodology
trivial.) The question is: What is the highest value
therein. We are not experts in parapsychology, and
of z2 in a second set of data that would be consid-
so have only one comment concerning such mat-
ered a failure to replicate? Two possible precise
ters: In Section 3 we briefly discuss the need to
versions of this question are: Question 1: What is
switch from P-values to Bayes factors in discussing
the probability of observing z2 for which the null
evidence concerning parapsychology.
hypothesis would be rejected in the replicated ex-
A more general issue raised in the paper is that
periment? Question 2: What value of z2 would
of replication. It is quite illuminating to consider
leave one's overall opinion about the null hypothe-
the issue of replication from a Bayesian perspec-
sis unchanged?
tive, and this is done in Section 2 of our discussion.
Consider the simple case where Z, - N(z, 10, 1)
and (independently) Z2 - N(z2 10, I), where 0 is
2. REPLICATION
the mean and 1 is the standard deviation of the
Many insightful observations concerning replica- normal distribution. Note that we are considering
tion are given in the article, and these spurred us the case in which no experimental bias is suspected
to determine if they could be quantified within and so the means for each experiment are assumed
Bayesian reasoning. Quantification requires clear to be the same.
delineation of the possible purposes of replication, Suppose that it is desired to test H,: 0 I0 versus
and at least two are obvious. The first is simple H,: 0 > 0, and suppose that initial prior opinion
reduction of random error, achieved by obtaining about 0 can be described by the noninformative
more observations from the replication. The second prior n(0) = 1. We consider the one-sided testing
purpose is to search for possible bias in the original problem with a constant prior in this section, be-
experiment. We use "bias" in a loose sense here, to cause it is known that then the posterior probabil-
refer to any of the huge number of ways in which ity of H,, to be denoted by P(H, I data), equals the
the effects being measured by the experiment can P-value, allowing us to avoid complications arising
differ from the actual effects of interest. Thus a from differences between Bayesian and classical
clinical trial without a placebo can suffer a placebo answers.
"bias"; a survey can suffer a "bias" due to the After observing 2, = 2.46, the posterior distribu-
sampling frame being unrepresentative of the tion of 0 is
actual population; and possible sources of bias
in parapsychological experiments have been
extensively discussed. Question 1 then has the answer (using predictive
Bayesian reasoning)
Replication to Reduce Random Error
P(rejecting at level a! 1 2,)
If the sole goal of replication 'of a n experiment is
to reduce random error, matters are very straight-
forward. Reviewing the Bayesian way of studying
this issue is, however, useful and will be done
through the following simple example.
where 9 is the standard normal cdf and c, is the

- - (one-sided) critical value corresponding to the level,
M. J. Bayarri is Titular Professor, Department of a , of the test. For instance, if a! = 0.05, then this
Statistics and Operations Research, University o f probability equals 0.7178, demonstrating that there
Valencia, Avenida Dr. Moliner 50,46100 Burjassot, is a quite substantial probability that the second
Valencia, Spain. James Berger is the Richard M. experiment will fail to reject. If a is chosen to be
Brumfield Distinguished Professor of Statistics, the observed significance level from the first exper-
Purdue University, West Lafayette, Indiana 47907. iment, so that c, = z,, then the probability that the
380 J. UTTS
second experiment will reject is just 112. This is one is very confident that the X i have mean 6.
nothing but a statement of the well-known martin- Using normal approximations for convenience, the
gale property of Bayesianism, that what you "ex- data can be summarized as
pect" to see in the future is just what you know
today. In a sense, therefore, question 1 is exposed
as being uninteresting.
Question 2 more properly focuses on the fact that with actual observations x, = 7.704 and x, =
the stated goal of replication here is simply to 13.07.
reduce uncertainty in stated conclusions. The an- Consider now the bias issue. We assume that the
swer to the question follows immediately from not- original experiment is somewhat suspect in this
ing that the posterior from the combined data regard, and we will model bias by defining the
mean of Y to be
(21, 22) is
where 0 is the unknown bias. Then the data in the

so that original experiment can be summarized by
Setting this equal to P(H, ) 2,) and solving for 2, with the actual observation being y = 7.707.
yields 2, = (a - l)z, = 1.02. Any value of 2, Bayesian analysis requires specification of a prior
distribution, "(P), for the suspected amount of bias.
greater than this will increase the total evidence
against H,, while any value smaller than 1.02 will Of particular interest then are the posterior distri-
decrease the evidence. bution of 0, assuming replication i has been
performed, given by
Replication to Detect Bias
The aspirin example dramatically raises the is- "(PI Y , xi)
sue of bias detection as a motive for replication.
Professor Utts observes that replication 1 gives
results that are fully compatible with those of the
original study, which could be interpreted as sug- where a? is the variance (4.82 or 3.63) from repli-
gesting that there is no bias in the original study, cation i; and the posterior probability of H,, given
while replication 2 would raise serious concerns of by
bias. We became very interested in the implicit
suggestion that replication 2 would thus lead to
less overall evidence against the null hypothesis
than would replication 1, even though in isolation
replication 2 was much more "significant" than
was replication 1. In attempting to see if this is so,
we considered the Bayesian approach to study of
bias within the framework of the aspirin example.
, EXAMPLE 2. For simplicity in the aspiring exam- Recall that our goal here was to see if Bayesian
ple, we reduce consideration to analysis can reproduce the intuition that the origi-
6 = true difference in heart attack rates between nal experiment could be trusted if replication 1had
aspirin and placebo populations multiplied by been done, while it could not be trusted (in spite of
1000; its much larger sample size) had replication 2 been
performed. Establishing this requires finding a
Y = difference in observed heart attack rates be-
prior distribution "(0) for which n(@I y, x,) has
tween aspirin and placebo groups in original
little effect on P(H, 1 y, x,), but "(0 I y, x,) has a
study multiplied by 10QO;
large effect on P( H, ( y, x,). To achieve the first
Xi= difference in observed heart attack rates be- objective, "(0) must be tightly concentrated near
tween aspirin and placebo groups in Replica- zero. To achieve the second, "(0) must be such that
tion i multiplied by 1000. large 1 y - x, I , which suggests presence of a large
We assume that the replication studies are ex- bias, can result in a substantial shift of posterior
tremely well designed and implemented, so that mass for 0 away from zero.
A sensible candidate for the prior density n(@) 1.54/100)). The effect of this is to essentially ignore
is the Cauchy (0, V) density the original experiment in overall assessments of
evidence. For instance, P(Ho 1 y, x,) = 3.81 x
lo-'', which is very close to P(Ho 1 x,) = 3.29 x
lo-". Note that, if @ were set equal to zero, the
overall posterior probability of Ho (and P-value)
Flat-tailed densities, such as this, are well known would be 2.62 x lo-',.
to have the property that when discordant data is Thus Bayesian reasoning can reproduce the intu-
observed (e.g., when ( I y - x, I is large), substan- ition that replication which indicates bias can cast
tial mass shifts away from the prior center towards considerable doubt on the original experiment,
the likelihood center. It is easy to see that a normal while replication which provides no evidence of
prior for @ can not have the desired behavior. bias leaves evidence from the original experiment
Our first surprise in consideration of these priors intact. Such behavior seems only obtainable, how-
was how small V needed to be chosen in order for ever, with flat-tailed priors for bias (such as the
P(Ho 1 y, x,) to be unaffected by the bias. For Cauchy) that are very concentrated (in comparison
instance, even with V = 1.54/100 (recall that 1.54 with the experimental standard deviation) near
was the standard deviation of Y from the original zero.
experiment), computation yields P(Ho 1 y, x,) =
4.3 x compared with the P-value (and poste- 3. P-VALUES OR BAYES FACTORS?
rior probability from the original experiment as- Parapsychology experiments usually consider
suming no bias) of 2.8 x There is a clear testing of Ho: No parapsychological effect exists.
lesson here; even very small suspicions of bias can Such null hypotheses are often realistically repre-
drastically alter a small P-value. Note that replica- sented as point nulls (see Berger and Delampady,
tion 1 is very consistent with the presence of no 1987, for the reason that care must be taken in
bias, and so the posterior distribution for the bias such representation), in which case it is known that
remains tightly concentrated near zero; for in- there is a large difference between P-values and
stance, the mean of the posterior for @ is then posterior probabilities (see Berger and Delampady,
7.2 x and the standard deviation is 0.25. 1987, for review). The article by Jefferys (1990)
When we turned attention to replication 2, we dramatically illustrates this, showing that a very
found that it did not seriously change the prior small P-value can actually correspond to evidence
perceptions of bias. Examination quickly revealed for Ho when considered from a Bayesian perspec-
the reason; even the maximum likelihood estimate tive. (This is very related to the famous "Jeffreys"
of the bias is no more than 1.4 standard deviations paradox.) The argument in favor of the Bayesian
from zero, which is not enough to change strong approach here is very strong, since it can be shown
prior beliefs. We, therefore, considered a third that the conflict holds for virtually any sensible
experiment, defined in Table 1. Transforming to prior distribution; a Bayesian answer can be wrong
approximate normality, as before, yields if the prior information turns out to be inaccurate,
but a Bayesian answer that holds for all sensible
priors is unassailable.
Since P-values simply cannot be viewed as mean-
with x, = 22.72 being the actual observation. The ingful in these situations, we found it of interest to
maximum likelihood estimate of bias is now 3.95 reconsider the example in Section 5 from a Bayes
standard deviations from zero, so there is potential factor perspective. We considered only analysis of
for a substantial change in opinion about the bias. the overall totals, that is, x = 122 successes out of
Sure enough, computation when V = 1.54/100 n = 355 trials. Assuming a simple Bernoulli trial
yields that E[@1 y, x,] = - 4.9 with (posterior) model with success probability 0, the goal is to test
standard deviation equal to 6.62, which is a dra- H0:8 = 114 versus H1:8 # 114.
matic shift from prior opinion (that @ is Cauchy (0, To determine the Bayes factor here, one must
specify g(0), the conditional prior density on HI.
TABLE1 Consider choosing g to be uniform and symmetric,
Frequency of heart attacks in replication 3 that is,
Yes No
Aspirin 5 2309
Placebo 54 2116 ( 0, otherwise
382 J. UTTS
B lr)
Crudely, r could be considered to be the maximum
change in success probability that one would expect zoo--
given that ESP exists. Also, these distributions are

the "extreme points" over the class of symmetric
unimodal conditional densities, so answers that hold 150--
over this class are also representative of answers

over a much larger class. Note that here r I0.25 loo
(because 0 r 19 I1); for the given data the 0 > 0.5

are essentially irrelevant, but if it were deemed
important to take them into account one could use
the more sophisticated binomial analysis i n Berger
and Delampady (1987).
0.05 0.1 0.15 0.2 0.25
For g r , the Bayes factor of Hl to No, which is to

be interpreted as the relative odds for the hypothe- FIG.1. T h e Bayes factor o f HI to H, as a function of r , the
m a x i m u m change i n success probability that is expected given
ses provided by the data, is given by that ESP exists, for the ganzfeld experiment.
2 5 + r 122 (1 - 8)355-122do
( ( r ) )J 0
B(r) =
(1/4) 122 (1- 1 1 4 ) ~ ~ ~

obtains final posterior odds by multiplying prior
odds by the Bayes factor). To properly assess
strength of evidence, we feel that such Bayes factor
computations should become standard i n parapsy-
chology.
As mentioned by Professor Utts, Bayesian meth-
ods have additional potential in situations such as
This is graphed in Figure 1. this, by allowing unrealistic models of iid trials to
The P-value for this problem was 0.00005, indi- be replaced by hierarchical models reflecting differ-
cating overwhelming evidence against No from a ing abilities among subjects.
classical perspective. In contrast to the situation
studied by Jefferys (1990), the Bayes factor here ACKNOWLEDGMENTS
does not completely reverse the conclusion, show-
ing that there are very reasonable values of r for M. J. Bayarri's research was supported in part
which the evidence against No is moderately by the Spanish Ministry of Education and Science
strong, for example 100/1 or 20011. Of course, this under DGICYT Grant BE91-038, while visiting
evidence is probably not of sufficient strength to Purdue University. James Berger's research was
overcome strong prior opinions against Ho (one supported by NSF Grant DMS-89-23071.
Comment
Ree Dawson
.This paper offers readers interested in statistical account of how both design and inferential aspects
science multiple views of the controversial history of statistics have been pivotal issues in evaluating
of parapsychology and how statistics has con- the outcomes of experiments that study psi abili-
tributed to its development. It first provides a n ties. It then emphasizes how the idea of science as
replication has been key in this field in which
results have not been conclusive or consistent and
thus meta-analysis has been a t the heart of the
Ree Dawson is Senior statistician, New England literature in parapsychology. The author not only
Biomedical Research Foundation, and Statistical reviews past debate on how to interpret repeated
Consultant, R F E I RL Research Institute. Her mail- psi studies, but also provides very detailed informa-
ing address is 177 Morrison Avenue, Somerville, tion on the Honorton-Hyman argument, a nice
Massachusetts 02144. illustration of the challenges of resolving such de-
bate. This debate is also a good example of how effects for this data (this result is reported in Sec-
statistical criticism can be part of the scientific tion 5). For the remaining 10 series, the chi-square
process and lead to better experiments and, in gen- value = 7.01 strongly favors homogeneity, al-
eral, better science. though more than one-third of its value is due to
The remainder of the paper addresses technical the novice series (number 4 in Table 1). This pat-
issues of meta-analysis, drawing upon recent re- tern points to the potential usefulness of a richer
search in parapsychology for an in-depth applica- model to accommodate series that may be distinct
tion. Through a series of examples, the author from the others. For the earlier ganzfeld data ana-
presents a convincing argument that power issues lyzed by Honorton (1985b), the appeal of a Bayes or
cannot be overlooked in successive replications and other model that recognizes the heterogeneity
that comparison of effect sizes provides a richer across studies is clear cut: Xi3
= 56.6, p = 0.0001,
alternative to the dichotomous measure inherent in where only those studies with common chance hit
the use of p-values. This is particularly relevant rate have been included (see Table 2).
when the potential effect size is small and re- Historic reliance on voting-count approaches to
sources are limited, as seems to be the case for psi determine the presence of psi effects makes it natu-
studies. ral to consider Bayes models that focus on the
The concluding section briefly mentions Bayesian ensemble of experimental effects from parapsycho-
techniques. As noted by the author, Bayes (or em- logical studies, rather than individual estimates.
pirical Bayes) methodology seems to make sense for Recent work in parapsychology that compares ef-
research in parapsychology. This discussion exam- fect sizes across studies, rather than estimating
ines possible Bayesian approaches to meta-analysis separate study effects, reinforces the need to exam-
in this field. ine this type of model. Louis (1984) develops Bayes
and empirical Bayes methods for problems that
consider the ensemble of parameter values to be
BAYES MODELS FOR PARAPSYCHOLOGY
the primary goal, for example, multiple compar-
The notion of repeatability maps well into the isons. For the simple compound normal model,
Bayesian set-up in which experiments, viewed as a - -
Yi N(ei, I), ei N(p, T'), the standard Bayes
random sample from some superpopulation of ex- estimates (posterior means)
periments, are assumed to be exchangeable. When T2
subjects can also be viewed as an approximately 8*=p+D(Yi-p) D= -
and
1+ T2
random sample from some population, it is appro-
priate to pool them across experiments. Otherwise, where the ei represent experimental effects of in-
analyses that partially pool information according terest, are modified approximately to
to experimental heterogeneity need to be consid-
ered. Empirical and hierarchical Bayes methods
offer a flexible modeling framework for such analy- when an ensemble loss function is assumed. The
ses, relying on empirical or subjective sources to new estimates adjust the shrinkage factor D so
determine the degree of pooling. These richer meth- that their sample mean and variance match the
ods can be particularly useful to meta-analysis of posterior expectation and variance of the 0's. Simi-
experiments in parapsychology conducted under lar results are obtained when the model is gener-
potentially diverse conditions.
For the recent ganzfeld series, assuming them TABLE1
to be independent binomially distributed as dis- Recent ganzfeld series
cussed in Section 5, the data can be summed

(pooled) across series to estimate a common hit Series type N Trials Hit rate Yi oi
rate. Honorton et al. (1990) assessed the homogene- Pilot
ity of effects across the 11 series using a chi-square Pilot
test that compares individual effect sizes to Pilot
the weighted mean effect. The chi-square statistic Novice
Novice
xf0 = 16.25, not statistically significant ( p = Novice
0.093), largely reflects the contribution of the last Novice
"special" series (contributes 9.2 units to the xf0 Novice
value), and to a lesser extent the novice series with Experienced
a negative effect (contributes 2.5 units). The outlier Experienced
Experienced
series can be dropped from the analysis to provide a

more conservative estimate of the presence of psi Overall
384 J. UTTS
TABLE2
vides more convincing support that all ei > - 1.10,
Earlier ganzfeld studies
although posterior estimates of uncertainty are
N Trials Hit rate yi "i needed to fully calibrate this. For the earlier
ganzfeld data in Table 2, ensemble loss can simi-
larly be used to determine the number of studies
with Bi < - 1.10 and specifically whether the nega-
tive effects of studies 4 and 24 (Y, = -1.21
and Y,, = - 1.33) occurred as a result of chance
fluctuation.
Features of the ganzfeld data in Section 5, such
as the outlier series, suggest that further elabora-
tion of the basic Bayesian set-up may be necessary
for some meta-analyses in parapsychology. Hierar-
chical models provide a natural framework to spec-
ify these elaborations and explore how results
change with the prior specification. This type of
sensitivity analysis can expose whether conclusions
are closely tied to prior beliefs, as observed by
Jeffreys for RNG data (see Section 7). Quantifying
the influence of model components deemed to be
more subjective or less certain is important to broad
acceptance of results as evidence of psi performance
(or lack thereof).
Consider the initial model commonly used for
Bayesian analysis of discrete data:
Yi I Pi, ni - B ( p i , ni),
alized to the case of unequal variances, Yi - ei - ~ ( pT,),, ei = logit(pi),
N(ei, ui").
For the above model, the fraction of 8; above (or with noninformative priors assumed for p and
below) a cut point C is a consistent estimate of the (e.g., log T locally uniform). The distinctiveness of
fraction of Bi > C (or Bi < C). Thus, the use of the last "special" series and, in general, the differ-
ensemble, rather than component-wise, loss can ent types of series (pilot versus formal, novice ver-
help detect when individual effects are above sus experienced) raises the question of whether the
a specified threshold by chance. For the meta- experimental effects follow a normal distribution.
analysis of ganzfeld experiments, the observed bi- Weighted normal plots (Ryan and Dempster, 1984)
nomial proportions transformed on the logit (or can be used to graphically diagnose the adequacy of
arcsin.\/) scale can be modeled in this framework. second-stage normality (see Dempster, Selwyn and
Letting d i and mi denote the number of direct hits Weeks, 1983, for examples with binary response
and misses respectively for the ith experiment, and and normal superpopulation).
pi as the corresponding population proportion of Alternatively, if nonnormality is suspected, the
direct hits, the Yi are the observed logits model can be revised to include some sort of heavy-
tailed prior to accommodate possibly outlying se-
ries or studies. West (1985) incorporates additional
and , ui2, estimated by maximum likelihood as scale parameters, one for each component of the
l / d i + l / m i , is the variance of Yi conditional on model (experiment), that flexibly adapt to a typi-
Bi = logit(pi). The threshold logit (0.25) = 1.10 can cal ei and discount their influence on posterior
be used to identify the number of experiments for estimates, thus avoiding under- or over-shrinkage
which the proportion of direct hits exceeds that due to such Bi. For example, the second stage
expected by chance. can specify the prior as a scale mixture of normals:
Table 1 shows Yi and ui for the 11 ganzfeld
series. All but one of the series are well above the
threshold; Y, marginally falls below -1.10. Any
shrinkage toward a common hit rate will lead to a n
estimate, 8: or 8:, above the threshold. The use of
ensemble loss (with its consistency property) pro- This approach for the prior is similar to others for
maximum likelihood estimation that modify the should not vary as the prior for r 2 is varied. Other-
sampling error distribution to yield estimates that wise, it is important to identify the degree to which
are "robust" against outlying observations. subjective information about interexperimental
Like its maximum likelihood counterparts, in ad- variability influences the conclusions. This sen-
dition to the robust effect estimates o*, the Bayes sitivity analysis is a Bayesian enrichment of
model provides (posterior) scale estimates y*. These the simpler test of homogeneity directed toward
can be interpreted as the weight given to the data determining whether or not complete pooling is
for each 8, in the analysis and are useful to diag- appropriate.
nosing which model components (series or studies) To assess how well heterogeneity among his-
are unusual and how they influence the shrinkage. torical control groups is determined by the data.
When more complex groupings among the Bi are Dempster, Selwyn and Weeks (1983) propose three
suspected, for example, bimodal distribution of priors for r 2 in the logistic-normal model. The prior
studies from different sites or experimenters, other distributions range from strongly favoring individ-
mixture specifications can be used to further relax ual estimates, p(r2)dr 0: r-', to the uniform refer-
the shrinkage toward a common value. ence prior p ( r 2 ) d r 0: T ' ~ ,flat on the log r scale, to
For the 11 ganzfeld series, the last "outlier" strongly favoring complete pooling, p(r2)dr 0: r - 3
series, quite distinct from the others (hit rate = (the latter forcing complete pooling for the com-
0.64), is moderately precise (N = 25). Omitting it pound normal model; see Morris, 1983). For their
from the analysis causes the overall hit rate to drop two examples, the results (estimates of linear treat-
from 0.344 to 0.321. The scale mixture model is a ment effects) are largely insensitive to variation in
compromise between these two values (on the logit the prior distribution, but the number of studies in
scale), discounting the influence of series 11 on the each example was large (70 and 19 studies avail-
estimated posterior common hit rate used for able for pooling). For the 11ganzfeld series, r 2 may
shrinkage. The scale factor yT,, an indication of be less well determined by the data. The posterior
how separate 8,, is from the other parameters, also estimate of r 2 and its sensitivity to p(r2)dr will
causes 8T, to be shrunk less toward the common hit also depend on whether individual scale parame-
rate than other, more homogeneous Oi, giving more ters are incorporated into the model. Discounting
weight to individual information for that series (see the influence of the last series will both shift the
West, 1985). The heterogeneity of the earlier marginal likelihood toward smaller values of r 2
ganzfeld data is more pronounced, and studies are and concentrate it more in that region.
taken from a variety of sources over time. For these The issue of objective assessment of experiment
data, the y* can be used to explore atypical studies results is one that extends well beyond the field of
(e.g., study 6, with hit rate = 0.90, contributes more parapsychology, and this paper provides insight into
than 25% to the xis value for homogeneity) and issues surrounding the analysis and interpretation
groupings among effects, as well as protect the of small effects from related studies. Bayes meth-
analysis from misspecification of second-stage ods can contribute to such meta-analyses in two
normality. ways. They permit experimental and subjective evi-
Variation among ganzfeld series or studies and dence to be formally combined to determine the
the degree to which pooling or shrinking is appro- presence or absence of effects that are not clear cut
priate can be investigated further by considering a or controversial (e.g., psi abilities). They can also
range of priors for r2. If the marginal likelihood of help uncover sources and degree of uncertainty in
r 2 dominates the prior specification, then results the scientific conclusions.
Comment
Persi Diaconis
In my experience, parapsychologists use statis- impossible to demand wide replicability in others.

tics extremely carefully. The plethora of widely Finally, defining "qualified skeptic" is difficult. In
significant p-values in the many thousands of pub- defense, most areas have many easily replicable
lished parapsychological studies must give us pause experiments and many have their findings ex-
for thought. Either something spooky is going on, plained and connected by unifying theories. It sim-
or it is possible for a field to exist on error and ply seems clear that when making claims a t such
artifact for over 100 years. The present paper offers extraordinary variance with our daily experience,
a useful review by an expert and a glimpse a t some claims that have been made and washed away so
tantalizing new studies. often in the past, such extraordinary measures are
My reaction is that the studies are crucially mandatory before one has the right to ask outsiders
flawed. Since my reasons are somewhat unusual, I to spend their time in review. The papers cited in
will try to spell them out. Section 5 do not actively involve qualified skeptics,
I have found it impossible to usefully judge what and I do not feel they have earned the right to our
actually went on in a parapsychology trial from serious attention.
their published record. Time after time, skeptics The points I have made above are not new. Many
have gone to watch trials and found subtle and appear in the present article. This does not dimin-
not-so-subtle errors. Since the field has so far failed ish their utility nor applicability to the most recent
to produce a replicable phenomena, it seems to studies.
me that any trial that asks us to take its find- Parapsychology is worth serious study. First,
ings seriously should include full participation by there may be something there, and I marvel a t the
qualified skeptics. Without a magician and/or patience and drive of people like Jessica Utts and
knowledgeable psychologist skilled a t running ex- Ray Hyman. Second, if it is wrong, it offers a truly
periments with human subjects, I don't think a alarming massive case study of how statistics can
serious effort is being made. mislead and be misused. Third, it offers marvelous
I recognize that this is an unorthodox set of combinatorial and inferential problems. Chung,
requirements. In fact, one cannot judge what Diaconis, Graham and Mallows (1981), Diaconis
"really goes on" in studies in most areas, and it is and Graham (1981) and Samaniego and Utts
(1983) offer examples not cited in the text. Finally,
our budding statistics students are fascinated by its
Persi Diaconis is Professor of iVIathematics at Har- claims; the present paper gives a responsible
vard University, Science Center, 1 Oxford Street, overview providing background for a spectacular
Cambridge, Massachusetts 02138. classroom presentation.
Comment: Parapsychology -On the Margins
of Science?
Joel 6. Greenhouse
Professor Utts reviews and synthesizes a large lish the existence of paranormal phenomena. The
body of experimental literature as well as the scien- organization and clarity of her presentation are
tific controversy involved in the attempt to estab- noteworthy. Although I do not believe that this
paper will necessarily change anyone's views re-
garding the existence of paranormal phenomena, it
does raise very interesting questions about the pro-
Joel B. Greenhouse is Associate Professor of Statis- cess by which new ideas are either accepted or
tics, Carnegie Mellon University, Pittsburgh, Penn- rejected by the scientific community. As students of
sylvania 15213-3890. science, we believe that scientific discovery
REPLICATION OF P ARAPSYCHOLOGY 387
advances methodically and objectively through the entific ideas, and the one referred to by Professor
accumulation of knowledge (or the rejection of false Utts above, is the prevailing substantive beliefs
knowledge) derived from the implementation of the and theories held by scientists a t any given time.
scientific method. But, as we will see, there is more Barber offers the opposition to Copernicus and his
to the acceptance of new scientific discoveries than heliocentric theory and to Mendel's theory of ge-
the systematic accumulation and evaluation of netic inheritance as examples of how, because of
facts. The recognition that there is a social process preconceived ideas, theories and values, scientists
involved with the acceptance or rejection of scien- are not as open-minded to new advances as one
tific knowledge has been the subject of study of might think they should be. It was R. A. Fisher
sociologists for some time. The scientific commu- who said that each generation seems to have found
nity's rejection of the existence of paranormal phe- in Mendel's paper only what it expected to find and
nomena is a n excellent case study of this process ignored what did not conform to its own expecta-
(Allison, 1979; Collins and Pinch, 1979). tions (Fisher, 1936).
Implicit in Professor Utts' presentation and Pearson's response to the antimathematical prej-
paramount to the acceptance of parapsychology as udice expressed by the Royal Society was to estab-
a legitimate science are the description and docu- lish with Galton's support a new journal,
mentation of the professionalization of the field of Biometrika, to encourage the use of mathematics in
parapsychology. It is true that many researchers in biology. Galton (1901) wrote a n article for the first
the field have university appointments; there are issue of the journal, explaining the need for this
organized professional societies for the advance- new voice of "mutual encouragement and support"
ment of parapsychology; there are journals with for mathematics in biology and saying that "a new
rigorous standards for published research; the field science cannot depend on a welcome from the fol-
has received funding from federal agencies; and lowers of the older ones, and [therefore]. . . i t is
parapsychology has received recognition from other advisable to establish a special Journal for Biome-
professional societies, such as the IMS and the try." Lavoisier understood the role of preconceived
American Association for the Advancement of Sci- beliefs as a source of resistance when he wrote in
ence (Collins and Pinch, 1979). Nevertheless, most 1785,
readers of Statistical Science would agree that
I do not expect my ideas to be adopted all at
parapsychology is not accepted as part of orthodox
once. The human mind gets creased into a way
science and is considered by most of the scientific
of seeing things. Those who have envisaged
community to be on the margins of science, at best
nature according to a certain point of view
(Allison, 1979; Collins and Pinch, 1979). Why is
during much of their career, rise only with
this the case? Professor Utts believes that it is
difficulty to new ideas. (Barber, 1961.)
because people have not examined the data. She
states that "Strong beliefs tend to be resistant to I suspect that this paper by Professor Utts syn-
change even in the face of data, and many people, thesizing the accumulation of research results sup-
scientists included, seem to have made up their porting the existence of paranormal phenomena
minds on the question without examining any em- will continue to be received with skepticism by the
pirical data at all." orthodox scientific community "even after examin-
The history of science is replete with examples of ing the data." In part, this resistance is due to the
resistance by the established scientific community popular perception of the association between para-
to new discoveries. A challenging problem for sci- psychology and the occult (Allison, 1979) and due
ence is to understand the process by which a new to the continued suspicion and documentation of
th,eory or discovery becomes accepted by the com- fraud in parapsychology (Diaconis, 1978). An addi-
munity of scientists and, likewise, to characterize tional and important source of resistance to the
the nature of the resistance to new ideas. Barber evidence presented by Professor Utts, however, is
(1961) suggests that there are many different the lack of a model to explain the phenomena.
sources of resistance to scientific discovery. In 1900, Psychic phenomena are unexplainable by any cur-
for example, Karl Pearson met resistance to his use rent scientific theory and, furthermore, directly
of statistics in applications to biological problems, contradict the laws of physics. Acceptance of psi
illustrating a source of resistance due to the use of implies the rejection of a large body of accumulated
a particular methodology. The Royal Society in- evidence explaining the physical and biological
formed Pearson that future papers submitted to the world as we know it. Thus, even though the effect
Society for publication must keep the mathematics size for a relationship between aspirin and the
separate from the biological applications. prevention of heart attacks is three times smaller
Another obvious source of resistance to new sci- than the effect size observed in the ganzfeld data
388 J. UTTS
base, it is the existence of a biological mechanism tions about the role of publication bias. Publication
to explain the effectiveness of aspirin that ac- bias or "the file-drawer problem" arises when only
counts, in part, for acceptance of this relationship. statistically significant findings get published,
In evaluating the evidence in favor of the exis- while statistically nonsignificant studies sit unre-
tence of paranormal phenomena, it is necessary to ported in investigators' file drawers. Typically,
consider alternative explanations or hypotheses for Rosenthal's method (1979) is used to calculate the
the results and, as noted by Cornfield (1959), "If "fail-safe N," that is, the number of unreported
important alternative hypotheses are compatible studies that would have to be sitting in file-drawers
with available evidence, then the question is unset- in order to negate the significant effect. Iyengar
tled, even if the evidence is experimental" (see and Greenhouse (1988) describe a modification of
also Platt, 1964). Many of the experimental results Rosenthal's method, however, that gives a fail-safe
reported by Professor Utts need to be considered in N that is often an order of magnitude smaller than
the context of explanations other than the exist- Rosenthal's method, suggesting that the sensitivity
ence of paranormal phenomena. Consider the of the results of meta-analyses of psi experiments to
following examples: unpublished negative studies is greater than is
currently believed.
(1) In the various psi experiments that Professor
Utts discusses, the null hypothesis is a simple Even if parapsychology is thought to be on the
chance model. However, as noted by Diaconis (1978) margins of science by the scientific community,
in a critique of parapsychological research, "In parapsychologists should not be held to a different
complex, badly controlled experiments simple standard of evidence to support their findings than
chance models cannot be seriously considered as orthodox scientists, but like other scientists they
tenable explanations: hence, rejection of such mod- must be concerned with spurious effects and the
els is not of particular interest." Diaconis shows effects of extraneous variables. The experimental
that the underlying probabilistic model in many of results summarized by Professor Utts appear to be
these experiments (even those that are well con- sensitive to the effect of alternative hypotheses like
trolled) is much more complicated than chance. the ones described above. Sensitivity analyses,
(2) The role that experimenter expectancy plays which question, for example, how large of an effect
in the reporting and interpreting of results cannot due to experimenter expectancy there would have
be underestimated. Rosenthal (1966), based on a to be to account for the effect sizes being reported
meta-analysis of the effects of experimenters' ex- in the psi experiments, are not addressed here.
pectancies on the results of their research, found Again, the ability to account for and eliminate the
that experimenters tended to get the results they role of alternative hypotheses in explaining the
expected to get. Clearly this is an important po- observed relationship between aspirin and the pre-
tential confounder i n parapsychological research. vention of heart attacks is another reason for the
Professor Utts comments on a debate between acceptance of these results.
Honorton and Hyman, parapsychologist and critic, A major new technology discussed by Professor
respectively, regarding evidence for psi abili- Utts in synthesizing the experimental parapsychol-
ties, and, although not necessarily a result of ex- ogy literature is meta-analysis. Until recently, the
perimenter expectancy, describes how ". . . each quantitative review and synthesis of a research
analyzed the results of all known psi ganzfeld literature, that is, meta-analysis, was considered by
experiments to date, and reached strikingly differ- many to be a questionable research tool (Wachter,
ent conclusions." 1988). Resistance by statisticians to meta-analysis
(3) What is an acceptable response in these exis interesting because, historically, many promi-
periments? What constitutes a direct hit? What if nent statisticians found the combining of informa-
the response is close, who decides whether or not tion from independent studies to be an important
that constitutes a hit (see (2) above)? In an example and useful methodology (see, e.g., Fisher, 1932;
of a response of a Receiver in an automated ganzfeld Cochran, 1954; Mosteller and Bush, 1954; Mantel
procedure, Professor Utts describes the "dream-like and Haenszel, 1959). Perhaps the more recent skep-
quality of the mentation." Someone must evaluate ticism about meta-analysis is because of its use as a
these stream-of-consciousness responses to deter- tool to advance discoveries that themselves were
mine what is a hit. An important methodological the objects of resistance, such as the efficacy of
question is: How sensitive are the results to differ- psychotherapy (Smith and Glass, 1977) and now
ent definitions of a hit? the existence of paranormal phenomena. It is an
(4) In describing the results of different meta- interesting problem for the history of science to
analyses, Professor Utts is careful to raise ques- explore why and when in the development of a
REPLICATION OF PARAPSYCHOLOGY 389
of a discipline it turns to meta-analysis to answer research synthesis is in assessing the potential ef-
research questions or to resolve controversy (e.g., fects of study characteristics and to quantify the
Greenhouse et al., 1990). sources of heterogeneity in a research domain, that
One argument for combining information from is, to study systematically the effects of extraneous
different studies is that a more powerful result can variables. Tom Chalmers and his group at Harvard
be obtained than from a single study. This objective have used meta-analysis in just this way not only
is implicit in the use of meta-analysis in parapsy- to advance the understanding of the effectiveness of
chology and is the force behind Professor Utts' medical therapies but also to study the characteris-
paper. The issue is that by combining many small tics of good research in medicine, in particular, the
studies consisting of small effects there is a gain in randomized controlled clinical trial. (See Mosteller
power to find an overall statistically significant and Chalmers, 1991, for a review of this work.)
effect. It is true that the meta-analyses reported by Professor Utts should be congratulated for her
Professor Utts find extremely small p-values, but courage in contributing her time and statistical
the estimate of the overall effect size is still small. expertise to a field struggling on the margins of
As noted earlier, because of the small magnitude of science, and for her skill in synthesizing a large
the overall effect size, the possibility that other body of experimental literature. I have found her
extraneous variables might account for the rela- paper to be quite stimulating, raising many inter-
tionship remains. esting issues about how science progresses or does
Professor Utts, however, also illustrates the use not progress.
of meta-analysis to investigate how studies differ
and to characterize the influence of difficult covari-
ACKNOWLEDGMENT
ates or moderating variables on the combined esti-
mate of effect size. For example, she compares the This work was supported in part by MHCRC
mean effect size of studies where subjects were grant MH30915 and MH15758 from the National
selected on the basis of good past performance to Institute of Mental Health, and CA54852 from the
studies where the subjects were unselected, and she National Cancer Institute. I would like to acknowl-
compares the mean effect size of studies with feed- edge stimulating discussions with Professors Larry
back to studies without feedback. To me, this latter Hedges, Michael Meyer, Ingram Olkin, Teddy
use of meta-analysis .highlights the more valuable Seidenfeld and Larry Wasserman, and thank them
and important contribution of the methodology. for their patience and encouragement while prepar-
Specifically, the value of quantitative methods for ing this discussion.
Comment
Ray Hyman
Utts concludes that "there is an anomaly that Neither time nor space is available to respond in
needs explanation." She bases this conclusion on detail to her argument. Instead, I will point to
the ganzfeld experiments and four meta-analyses of some of my concerns. I will do so by focusing on
parapsychological studies. She argues that both those parts of Utts' discussion that involve me.
Honorton and Rosenthal have successfully refuted Understandably, I disagree with her assertions that
my critique of the ganzfeld experiments. The meta- both Honorton and Rosenthal successfully refuted
analyses apparently show effects that cannot be my criticisms of the ganzfeld experiments.
explained away by unreported experiments nor Her treatment of both the ganzfeld debate and
over-analysis of the data. Furthermore, effect size the National Research Council's report suggests
does not correlate with the rated quality of the that Utts has relied on second-hand reports of the
experiment. data. Some of her statements are simply inaccu-
rate. Others suggest that she has not carefully read
what my critics and I have written. This remote-
ness from the actual experiments and details of the
Ray Hyman is Professor of Psychology, University of arguments may partially account for her optimistic
Oregon, Eugene,' Oregon 97403. assessment of the results. Her paper takes
UTTS
the reported data at face value and focuses on ity omitted many attributes that I included in
the statistical interpretation of these data. my ratings. Even in those cases where we used
Both the statistical- interpretation of the results the same indicators to make our assessments, we
of an individual experiment and of the results of a differed because of our scaling. For example, on
meta-analysis are based on a model of an ideal adequacy of randomization I used a simple dicho-
world. In this ideal world, effect sizes have a tomy. Either the experimenter clearly indicated
tractable and known distribution and the points in using an appropriate randomization procedure or
the sample space are independent samples from a he did not. Honorton converted this to a trichoto-
coherent population. The appropriateness of any mous scale. He distinguished between a clearly
statistical application in a given context is an em- inadequate procedure such as hand-shuffling and
pirical matter. That is why such issues as the failure to report how the randomization was done.
adequacy of randomization, the non-independence He then assigned the lowest rating to failure to
of experiments in a meta-analysis and the over- describe the randomization. In his scheme, clearly
analysis of data are central to the debate. The inadequate randomization was of higher quality
optimistic conclusions from the meta-analyses as- than failure to describe the procedure. Although we
sume that the effect sizes are unbiased estimates agreed on which experiments had adequate ran-
from independent experiments and have nicely domization, inadequate randomization or inade-
behaved distributional properties. quate documentation, the different ways these were
Before my detailed assessment of all the avail- ordered produced important differences between us
able ganzfeld experiments through 1981, I accepted in how randomization related to effect size. These
the assertions by parapsychologists that their are just some of the reasons why the finding of no
experiments were of high quality in terms of stat- correlation between effect size and rated quality
istical and experimental methodology. I was sur- does not justify concluding that the observed flaws
prised to find that the ganzfeld experiments, had no effect.
widely heralded as the best exemplar of a suc- I will now consider some of Utts' assertions and
cessful research program in parapsychology, were hope that I can go into more detail in anoth-
characterized by obvious possibilities for sensory er forum. Utts discusses the conclusions of the
leakage, inadequate randomization, over-analysis National Research Council's Committee on
and other departures from parapsychology's own Techniques for the Enhancement of Human Per-
professed standards. One response was to argue formance. I was chairperson of that committee's
that I had exaggerated the number of flaws. But subcommittee on paranormal phenomena. She
even internal critics agreed that the rate of defects wrongly states that we restricted our evaluation
in the ganzfeld data base was too high. only to significant studies. I do not know how she
The other response, implicit in Utts' discussion of got such an impression since we based our analysis
the ganzfeld experimmts and the meta-analyses, on meta-analyses whenever these were available.
was to admit the existence of the flaws but to deny The two major inputs for the committee's evalua-
their importance. The parapsychologists doing the tion were a lengthy evaluation of contemporary
meta-analysis would rate each experiment for qual- parapsychology experiments by John Palmer and
ity on one or more attributes. Then, if the null an independent assessment of these experiments by
hypothesis of no correlation between effect size and James Alcock. Our sponsors, the Army Research
quality were upheld, the investigators concluded Institute had commissioned the report from the
that the results could not be attributed to defects in parapsychologist John Palmer. They specifically
methodology. asked our committee to provide a second opinion
This retrospective sanctification' using statistical from a non-parapsychological perspective. They
controls to compensate for inadequate experimental were most interested in the experiments on remote
controls has many problems. The quality ratings viewing and random number generators. We de-
are not blind. As the differences between myself cided to add the ganzfeld experiments. Alcock was
and Honorton reveal, such ratings are highly sub- instructed, in making his evaluation, to restrict
jective. Although I tried my best to restrict my himself to the same experiments in these categories
ratings to what I thought were objective and ea- that Palmer had chosen. In this way, the experi-
sily codeable indicators, my quality ratings pro- ments we evaluated, which included both signifi-
vide a different picture than do those of Honorton. cant and nonsignificant ones, were, in effect,
Honorton, I am sure, believes he was just as selected for us by a prominent parapsychologist.
objective in assigning his ratings as I believe I was. Utts mistakenly asserts that my subcommittee
Another problem is the number of different prop- on parapsychology commissioned Harris and Rosen-
erties that are rated. Honorton's ratings of qual- thal to evaluate parapsychology experiments for
us. Harris and Rosenthal were commissioned by lation between criterion variables and flaws of
our evaluation subcommittee to write a paper on "only" 0.46. A true correlation of this magnitude
evaluation issues, especially those related to exper- would be impressive given the nature and split of
imenter effects. On their own initiative, Harris and the dichotomous variables. But, because it was not
Rosenthal surveyed a number of data bases to illus- statistically significant, Harris and Rosenthal con-
trate the application of methodological procedures cluded that there was no relationship between
such as meta-analysis. As one illustration, they quality and effect size. A canonical correlation on
included a meta-analysis of the subsample of this sample of 28 nonindependent cases, of course,
ganzfeld experiments used by Honorton in his has virtually no chance of being significant, even if
rebuttal to my critique. it were of much greater magnitude.
Because Harris and Rosenthal did not them- What this amounts to is that the alleged contra-
selves do a first-hand evaluation of the ganzfeld dictory conclusions of Harris and Rosenthal are
experiments, and because they used Honorton's rat- based on a meta-analysis that supports Honorton's
ings for their illustration, I did not refer to their position when Honorton's ratings are used and
analysis when I wrote my draft for the chapter on supports my position when my ratings are used.
the paranormal. Rosenthal told me, in a letter, that Nothing substantive comes from this, and it is
he had arbitrarily used Honorton's ratings rather redundant with what Honorton and I have already
than mine because they were the most recent avail- published. Harris and Rosenthal's footnote adds
able. I assumed that Harris and Rosenthal were nothing because it supports the null hypothesis
using Honorton's sample and ratings to illustrate with a statistical test that has no power against a
meta-analytic procedures. I did not believe they reasonably sized alternative. It is ironic that Utts,
were making a substantive contribution to the after emphasizing the importance of considering
debate. statistical power, places so much reliance on the
Only after the committee's complete report was outcome of a powerless test.
in the hands of the editors did someone become (I should add that the recurrent charge that the
concerned that Harris and Rosenthal had come to a NRC committee completely ignored Harris and
conclusion on the ganzfeld experiments different Rosenthal's conclusions is not strictly correct. I
from the committee. Apparently one or more com- wrote a response to the Harris and Rosenthal paper
mittee members contacted Rosenthal and asked him that was included in the same supplementary
to explain why he and Harris were dissenting. volume that contains their commissioned paper.)
Because some committee members believed that Utts' discussion of the ganzfeld debate, as I have
we should deal with this apparent discrepancy, I indicated, also shows unfamiliarity with details.
contacted Rosenthal and pointed out if he had used She cites my factor analysis and Saunders' critique
my ratings with _the very same analysis he had as if these somehow jeopardized the conclusions I
applied to Honorton's ratings, he would have drew. Again, the matter is too complex to discuss
reached a conclusion opposite to what Harris and adequately in this forum. The "factor analysis" she
he had asserted. I did this, not to suggest my is talking about is discussed in a few pages of my
ratings were necessarily more trustworthy than critique. I introduced it as a convenient way to
Honorton's, but to point out how fragile any conclu- summarize my conclusions, none of which depended
sions were based on this small and limited sample. on this analysis. I agree with what Saunders has to
Indeed, the data were so lacking in robustness that say about the limitations of factor analysis in this
the difference between my rating and Honorton's context. Unfortunately, Saunders bases his criti-
rating of one investigator (Sargent) on one at- cism on wrong assumptions about what I did and
tribute (randomization) sufficed to reverse the con- why I did it. His dismissal of the results as
clusions Harris and Rosenthal made about the "meaningless" is based on mistaken algebra. I in-
correlation between quality and effect size. cluded as dummy variables five experimenters in
Harris and Rosenthal responded by adding a foot- the factor analysis. Because an experimenter can
note to their paper. In this footnote, they repor- only appear on one variable, this necessarily forces
ted an analysis using my ratings rather than the average intercorrelation among the experi-
Honorton's. This analysis, they concluded, still sup- menter variables to be negative. Saunders falsely
ported the null hypothesis of no correlation be- asserts that this negative correlation must be -1.
tween quality and effect size. They used 6 of my 12 If he were correct, this would make the results
dichotomous ratings of flaws as predictors and the z meaningless. But he could be correct only if there
score and effect size as criterion variables in both were just two investigators and that each one ac-
multiple regression and canonical correlation anal- counted for 50% of the experiments. In my case, as
yses. They reported an "adjusted" canonical corre- I made sure to check ahead of time, the use of five
392 J. UTTS
experimenters, each of whom contributed only a precise departure of a specific kind from the orbit
few studies to the data base, produced a mildly expected on the basis of Newtonian mechanics. He
negative intercorrelation of -0.147. To make sure knew exactly what he had to account for.
even that small correlation did not distort the re- The "anomaly" or "anomalies" that Utts talks
sults, I did the factor analysis with and without the about are different. We do not know what it is that
dummy variables. The same factors were obtained we are asked to account for other than something
in both cases. that sometimes produces nonchance departures
However, I do no& wish to defend this factor from a statistical model, whose appropriateness is
analysis. None of my conclusions depend on it. I itself open to question.
would agree with any editor who insisted that I The case rests on a handful of meta-analyses that
omit it from the paper on the grounds of redun- suggest effect sizes different from zero and uncorre-
dancy. I am discussing it here as another example lated with some non-blindly determined indices of
that suggests that Utts is not familiar with some quality. For a variety of reasons, these retrospec-
relevant details in literature she discusses. tive attempts to find evidence for paranormal phe-
nomena are problematical. At best, they should
provide the basis for parapsychologists designing
CONCLUSIONS prospective studies in which they can specify, in
advance, the complete sample space and the critical
Utts may be correct. There may indeed be a n region. When they get to the point where they can
anomaly in the parapsychological findings. Anoma- specify this along with some boundary conditions
lies may also exist in non-parapsychological do- and make some reasonable predictions, then they
mains. The question is when is an anomaly worth will have demonstrated something worthy of our
taking seriously. The anomaly that Utts has in attention.
mind, if it exists, can be described only as a depar- In this context, I agree with Utts that Honorton's
ture from a generalized statistical model. From the recent report of his automated ganzfeld experi-
evidence she presents, we might conclude that we ments is a step in the right direction. He used the
are dealing with a variety of different anomalies ganzfeld meta-analyses and the criticisms of the
instead of one coherent phenomenon. Clearly, the existing data base to design better experiments and
reported effect sizes for the experiments with ran- make some predictions. Although he and Utts be-
dom number generators are orders of magnitude lieve that the findings of meaningful effect sizes in
lower than those for the ganzfeld experiments. Even the dynamic targets and a lack of a nonzero effect
within the same experimental domain, the effect size in the static targets are somehow consistent
sizes do not come from the same population. The with previous ganzfeld results, I disagree. I believe
effects sizes obtained by Jahn are much smaller the static targets are closer in spirit to the original
than those obtained'by Schmidt with similar ex- data base. But this is a minor criticism.
periments on random number generators. In Honorton's experiments have produced intrigu-
the ganzfeld experiments, experimenters differ ing results. If, as Utts suggests, independent labo-
significantly in the effect sizes each obtains. ratories can produce similar results with the same
This problem of what effect sizes are and what relationships and with the same attention to rigor-
they are measuring points to a problem for para- ous methodology, then parapsychology may indeed
psychologists. In other fields of science such as have finally captured its elusive quarry. Of course,
astronomy, an "anomaly" is a very precisely speci- on several previous occasions in its century-plus
fied departure from a well-established substantive history, parapsychology has felt it was on the
theory. When Leverrier discovered Neptune by threshold of a breakthrough. The breakthrough
studying the perturbations in the orbit of Uranus, never materialized. We will have to patiently wait
he was able to characterize the anomaly as a very to see if the current situation is any different.
REPLICATION OF PARAPSYCHOLOGY
Comment
Robert L. Morris
Experimental sciences by their nature have found intervals rather than arbitrarily chosen signifi-
it relatively easy to deal with simple closed sys- cance levels seems to indicate much greater consis-
tems. When they come to study more complex, open tency in the findings than has previously been
systems, however, they have more difficulty in gen- claimed.
erating testable models, must rely more on multi- Second, when one codes the individual studies for
variate approaches, have more diversity from flaws and relates flaw abundance with effect size,
experiment to experiment (and thus more difficulty there appears to be little correlation for all but one
in constructing replication attempts), have more data base. This contradicts the frequent assertion
noise in the data, and more difficulty in construct- that parapsychological results disappear when
ing a linkage between concept and measurement. methodology is tightened. Additional evidence on
Data gatherers and other researchers are more this point is the series of studies by Honorton and
likely to be part of the system themselves. Exam- associates using an automated ganzfeld procedure,
ples include ecology, economics, social psychology apparently better conducted than any of the previ-
and parapsychology. Parapsychology can be re- ous research, which nevertheless obtained an effect
garded as the study of apparent new means of size very similar to that of the earlier more diverse
communication, or transfer of influence, between data base.
organism and environment. Any observer attempt- Third, meta-analysis allows researchers to look
ing to decide whether or not such psychic communi- at moderator variables, to build a clearer picture of
cation has taken place is one of several elements in the conditions that appear to produce the strongest
a complex open system composed of an indefinite effects. Research in any real scientific discipline
number of interactive features. The system can be must be cumulative, with later researchers build-
modeled, as has been done elsewhere (e.g., Morris, ing on the work of those who preceded them. If our
1986) such as to organise our understanding of how earlier successes and failures have meaning, they
observers can be misled by themselves, or by delib- should help us obtain increasingly consistent,
erate frauds. Parapsychologists designing experi- clearer results. If psychic ability exists and is suffi-
mental studies must take extreme care to ensure ciently stable that it can be manifest in controlled
that the elements in the experimental system do experimental studies, then moderator variables
not interact in unanticipated ways to produce arti- should be present in groups of studies that would
fact or encourage fraudulent procedures. When re- indicate conditions most favourable and least
searchers follow up ;the findings of others, they favourable to the production of large effect sizes.
must ensure that the new experimental system From the analyses presented by Utts, for instance,
sufficiently resembles the earlier one, regarding its it seems evident that group studies tend to produce
important components and their potential interac- poor results and, however convenient it may be to
tions. Specifying sufficient resemblance is more dif- conduct them, future researchers should apparently
ficult in complex and open systems, and in areas of focus much more on individual testing. When doing
research using novel methodologies. ganzfeld studies, it appears best to work with dy-
As a result, parapsychology and other such areas namic rather than static target material and with
may well profit from the application of modern experienced participants rather than novices. If
meta-analysis, and meta-analytic methods may in such results are valid, then future researchers who
turn profit from being given a good stiff workout by wish to get strong results now have a better idea of
controversial data bases, as suggested by Jessica what procedures to select to increase the likelihood
Utts in her article. Parapsychology would appear to of so doing, what elements in the experimental
gain from meta-analytic techniques, in at least system seem most relevant. The proportion of stud-
three important areas. ies obtaining positive results should therefore
First, in assessing the question of replication increase.
rate, the new focus on effect size and confidence However, the situation may be more complex
than the somewhat ideal version painted above. As
noted earlier, meta-analysis may learn from para-
psychology as well as vice versa. Parapsychological
Robert L. Morris occupies the Koestler Chair of data may well give meta-analytic techniques a good
Parapsychology in the Department of Psychology at workout and will certainly pose some challenges.
the University of Edinburgh, 7 George Square, Ed- None of the cited meta-analyses, as described above,
inburgh EH8 9JZ, United Kingdom. apparently employed more than one judge or
UTTS
evaluator. Certainly none of them cited any corre- dance of existing faults and how to avoid them. If
lation values between evaluators, and the correla- results are as strong under well-controlled con-
tions between judges of research quality in other ditions as under sloppy ones, then additional
social sciences tend to be "at best around .50," research such as that done by Honorton and associ-
according to Hunter and Schmidt (1990, page 497). ates under tight conditions should continue to pro-
Although Honorton and Hyman reported a rela- duce positive results.
tively high correlation of 0.77 between themselves, In addition to the replication issue, there are
they were each doing their own study and their some other problems that need to be addressed. So
flaw analyses did reach somewhat different conclu- far, the assessment of moderator variables has been
sions, as noted by Utts. Other than Hyman, the univariate, whereas a multivariate approach would
evaluators cited by Utts tend to be positively ori- seem more likely to produce a clearer picture. Mod-
ented toward parapsychology; roughly speaking, all erator variables may covary, with each other or
evaluators doing flaw analyses found what they with flaws. For instance, in the dice data higher
might hope to find, with the exception of the PK effect sizes were found for flawed studies and for
dice data base. Were evaluators blind as to study studies with selected subjects. Did studies using
outcome when coding flaws? No comment is made special subjects use weaker procedures?
on this aspect. The above studies need to be repli- Given the importance attached to effect size and
cated, with multiple (and blind) evaluators and incorporating estimates of effect size in designing
reported indices of evaluator agreement. Ideally, studies for power, we must be careful not to assume
evaluator attitude should be assessed and taken that effect size is independent of number of trials or
into account as well. A study with all hostile evalu- subjects unless we have empirical reason to do so.
ators may report very high evaluator correlations, Effect size may decrease with larger N if experi-
yet be a less valid study than one that employs a menters are stressed or bored towards the end of a
range of evaluators and reports lower correlations long study or if there are too many trials to be
among evaluators. conducted within a short period of time and sub-
But what constitutes a replication of a meta- jects are given less time to absorb their instructions
analysis? As with experimental replications, it may or to complete their tasks. On one occasion there is
be important to distinguish between exact and con- presentation of an estimated "true average effect
ceptual replications. In the former, a replicator size," (0.18 rather than 0.28) without also present-
would attempt to match all salient features of the ing an estimate of effect size dispersal. Future
initial analysis, from the selection of reports to the investigators should have some sense of how the
coding of features to the statistical tests employed, likelihood that they will obtain a hit rate of 113
such as to verify that the stated original protocol (where 114 is expected) will vary in accordance
had been followed faithfully and that a simi- with conditions.
lar outcome results. For conceptual replication, There are a few additional quibbles with particu-
replicators would take the stated outcome of the lar points. In Utts' example experiment with Pro-
meta-analysis and attempt their own independent fessor A versus Professor B, sex of professor is a
analysis, with their own initial report selection possible confounding variable. When Honorton
criteria, coding criteria and strategy for statistical omitted studies that did not report direct hits as a
testing, to see if similar conclusions resulted. Con- measure, he may have biased his sample. Were
ceptual replication allows more room for bias and there studies omitted that could have reported di-
resultant debate when findings differ, but when rect hits but declined to do so, conceivably because
r,esults are similar they can be assumed to have they looked at that measure, saw no results and
more legitimacy. Given the strong and surpris- dropped it? This objection is only with regard to the
ing (for many) conclusions reached in the meta- initial meta-analysis and is not relevant for the
analysis reported by Utts, it is quite likely later series of studies which all used direct hits. In
that others with strong views on parapsychology Honorton's meta-analysis of forced-choice precogni-
will attempt to replicate, hoping for clear confirmation experiments, the comparison variables of feed-
tion or disconfirmation. The diversity of methods back delay and time interval to target selection
they are likely to employ and the resultant debates appear to be confounded. Studies delaying target
should provide a good oppoktunity for airing the selection cannot provide trial by trial feedback, for
many conceptual problems still present in meta- instance. Also, I am unsure about using an approxi-
analysis. If results differ on moderator variables, mation to Cohen's h for assessing the effect size for
there can come to be empirical resolution of the the aspirin study. There would appear to be a very
differences as further results unfold. With regard striking effect, with the aspirin condition heart
to flaw analysis, such analyses have already fo- attack rate only 55% that of the rate for the placebo
cused attention in ganzfeld research on the abun- condition. How was the expected proportion of
misses estimated; perhaps Cohen's h greatly un- The above objections should not detract from the
derestimates effect size when very low probability overall value of the Utts survey. The findings she
events (less than 1 i n 50 for heart attack in the reports will need to be replicated; but even as is,
placebo condition and less than 1 in a 100 for they provide a challenge to some of the cherished
aspirin) are involved. I'm not a statistician and arguments of counteradvocates, yet also challenge
thus don't know if there is a relevant literature on serious researchers to use these findings effectively
this point. as guidelines for future studies.
Comment
Frederick Mosteller
Dr. Utts's discussion stimulates me to offer some psychology, vol. 4 (1940), pp. 298-319, in particular
comments that bear on her topic but do not, in the p. 306." The 25 symbols of 5 kinds, 5 of each,
main, fall into an agree-disagree mode. My refer- correspond to the cards in a parapsychology deck.
ences refer to her bibliography. The point of page 306 is that Greenwood and
Let me recommend J. Edgar Coover's work to Stuart on that page claim to have generated two
statisticians who would like to read about a pretty random orders of such a deck using Tippett's table
sequence of experiments developed and executed of random numbers. Apparently Feller thought that
well before Fisher's book on experimental design it would have taken them a long time to do it. If
appeared. Most of the standard kinds of ESP exper- one assumes that Feller's way of generating a ran-
iments (though not the ganzfeld) are carried out dom shuffle is required, then it would indeed be
and reported in this 1917 book. Coover even began unreasonable to suppose that the experiments could
looking into the amount of information contained be carried out quickly. I wondered then whether
in cues such as whispers. He also worked at expos- Feller thought this was the only way to produce a
ing mediums. I found the book most impressive. As random order to such a deck of cards. If you happen
Utts says in her article, the question of significance to know how to shuffle a deck efficiently using
level was a puzzling one, and one we still cannot random numbers, it is hard to believe that others
solve even though some fields seem to have stan- do not know. I decided to test it out and so I
dardized on 0.05. proposed to a class of 90 people in mathematical
When Feller's comments on Stuart and Green- statistics that we find a way of using random num-
wood's sampling efperiments came out in the first bers to shuffle a deck of cards. Although they were
edition of his book, I was surprised. Feller devotes familiar with random numbers, they could not come
a problem to the results of generating 25 symbols up with a way of doing it, nor did anyone after class
from the set a, b, c, d and e @age 45, first edition) come in with a workable idea though several stu-
using random numbers with 0 and 1 corresponding dents made proposals. I concluded that inventing
to a, 2 and 3 to b, etc. He asks the student to find such a shuffling technique was a hard problem and
out how often the 25 produce 5 of each symbol. He that maybe Feller just did not know how at the
asks the student to check the results using random time of writing the footnote. My face-to-face at-
number tables. The answer seems to be about 1 tempts to verify this failed because his response
chance in 500. In a footnote Feller then says "They was evasive. I also recall Feller speaking at a
[fandom numbers] are occasionally extraordinarily scientific meeting where someone had complained
obliging: c.f. J. A. Greenwood and E. E. Stuart, about mistakes in published papers. He said essen-
Review of Dr. Feller's Critique, Journal o f Para- tially that we won't have any literature if mistakes
are disallowed and further claimed that he always
had mistakes in his own papers, hard as he tried to
avoid them. It was fun to hear him speak.
Frederick Mosteller is Roger I. Lee Professor of Although I find Utts's discussion of replication
Mathematical Statistics, Emeritus, at Harvard Uni- engaging as a problem in human perception, I do
versity and Director o f the Technology Assessment always feel that people should not be expected to
Group i n the Harvard School of Public Health. His carry out difficult mathematical exercises in their
mailing address is Department of Statistics, Har- head, off the cuff, without computers, textbooks or
vard University, Science Center, 1 Oxford Street, advisors. The kind of problem treated requires
Cambridge, Massachusetts 02138. careful formulation and then careful analysis. Even
396 J. UTTS
after a careful analysis is completed, there can be backs of the cards of my parapsychology deck as
vigorous reasonable arguments about the appropri- clearly as the faces. While preparing these remarks
ateness of the formulation and its analysis. These in 1991, I found a note on page 305 of volume 1 of
investigations leave me reinforced with the belief The Journal of Parapsychology (1937) indicating
that people cannot do hard mathematical problems that imperfections in the cards precluded their use
in their heads, rather than with an attitude toward in unscreened situations, but that improvements
or against ESP investigations. were on the way. Thus I sympathize with Utts's
When I first became aware of the work of Rhine conclusion that much is to be gained by studying
and others, the concept seemed to me to be very how to carry out such work well. If there is no ESP,
important and I asked a psychologist friend why then we want to be able to carry out null experi-
more psychologists didn't study this field. He re- ments and get no effect, otherwise we cannot put
sponded that there were too many ways to do these much belief in work on small effects in non-ESP
experiments in a poorly controlled manner. At the situations. If there is ESP, that is exciting. How-
time, I had just discovered that when viewed with ever, thus far it does not look as if it will replace
light coming from a certain angle, I could read the the telephone.
Rejoinder
Jessica Utts
I would like to thank this distinguished group of Hyman and Morris, by either clarifying my origi-
discussants for their thought-provoking contribu- nal statements or by adding more information from
tions. They have raised many interesting and di- the original reports.
verse issues. Certain points, such as Professor
Mosteller's enlightening account of Feller's posi- Points Raised by Diaconis
tion, require no further comment. Other points in-
dicate the need for clarification and elaboration of Diaconis raised the point that qualified skeptics
my original material. Issues raised by Professors and magicians should be active participants in
Diaconis and Hyman and subsequent conversations parapsychology experiments. I will discuss this
with Robert Rosenthal and Charles Honorton have general concept in the next section, but elaborate
led me to consider the topic of "Satisfying the here on the steps that were taken in this regard for
Skeptics." Since the conclusion in my paper was the autoganzfeld experiments described in Section
not that psychic phenomena have been proved, but 5 of my paper. As reported by Honorton et al.
rather that there is an anomalous effect that needs (1990):
to be explained, comments by several of the discus-
Two experts on the simulation of psi ability
sants led me to address the question "Should Psi
have examined the autoganzfeld system and
Research be Ignored by the Scientific Community?"
protocol. Ford Kross has been a professional
Finally, each of the discussants addressed repli-
mentalist [a magician who simulates psychic
,cation and modeling issues. The last part of my
abilities] for over 20 years. . . Mr. Kross has
rejoinder comments on some of these ideas and
provided us with the following statement: "In
discusses them in the context of parapsychology.
my professional capacity as a mentalist, I have
reviewed Psychophysical Research Laborato-
CLARIFICATION AND ELABORATION
ries' automated ganzfeld system and found it to
Since my paper was a survey of hundreds of provide excellent security against deception by
experiments and many published reports, I could subjects." We have received similar comments
obviously not provide all of the details to accom- from Daryl Bem, Professor of Psychology a t
pany this overview. However, there were details Cornell University. Professor Bem is well
lacking in my paper that have led to legitimate known for his research in social and personal-
questions and misunderstandings from several of ity psychology. He is also a member of the
the discussants. In this section, I address specific Psychic Entertainers Association and has per-
points raised by Professors Diaconis, Greenhouse, formed for many years as a mentalist. He vis-
ited PRL for several days and was a subject in cal question is the same. Under the null hypothe-
Series 101" [pages 134- 1351. sis, since the target is randomly selected from the
four possibilities presented, the probability of a
Honorton has also informed me (personal communi- direct hit is 0.25 regardless of who does the judg-
cation, July 25, 1991) that several self-proclaimed ing. Thus, the observed anomalous effects cannot
skeptics have visited his laboratory and received be explained by assuming there was an over-
demonstrations of the autoganzfeld procedure and optimistic judge.
that no one expressed any concern with the secu- If Professor Greenhouse is suggesting that the
rity arrangements. source of judging may be a moderating variable
This may not completely satisfy Professor Diaco- that determines the magnitude of the demonstrated
nis' objections, but it does indicate a serious effort anomalous effect, I agree. The parapsychologists
on the part of the researchers to involve such peo- have considered this issue in the context of whether
ple. Further, the original publication of the re- or not subjects should serve as judges for their own
search in Section 5 followed the reporting criteria sessions, with differing opinions in different labora-
established by Hyman and Honorton (1986), thus tories. This is an example of an area that has been
providing much more detail for the reader than the suggested for further research.
earlier published records to which Professor Finally, Greenhouse raised the question of the
Diaconis alludes. accuracy of the file-drawer estimates used in the
reported meta-analyses. I agree that it is instruc-
Points Raised by Greenhouse
tive to examine the file-drawer estimate using more
Greenhouse enumerated four items that offer al- than one model. As an example, consider the 39
ternative explanations for the observed anomalous studies from the direct hit and autoganzfeld data
effects. Three of these (items 2-4) will be addressed bases. Rosenthal's fail-safe N estimates that there
in this section by elaborating on the details pro- would have to be 371 studies in the file-drawer to
vided in my paper. His item 1 will be addressed in account for the results. In contrast, the method
a later section. proposed by Iyengar and Greenhouse gives a file-
Item 2 on his list questioned the role of experi- drawer estimate of 258 studies. Even this estimate
menter expectancy effects as a potential confounder is unrealistically large for a discipline with as few
in parapsychological research. While the expecta- researchers as parapsychology. Given that the av-
tions of the experimenter may influence the report- erage number of trials per experiment is 30, this
ing of results, the ganzfeld experiments (as well as would represent almost 8000 unreported trials, and
other psi experiments) are conducted in such a way at least that many hours of work.
that experimenter expectancy cannot account for There are pros and cons to any method of esti-
the results themselves. Rosenthal, who Greenhouse mating the number of unreported studies, and the
cites as the expert in this area, addressed this in actual practices of the discipline in question should
his background paper for the National Research be taken into account. Recognizing publication bias
Council (Harris and Rosenthal, 1988a) and con- as an issue, the Parapsychological Association has
cluded that the ganzfeld studies were adequately had an official policy since 1975 against the selec-
controlled in this regard. He also visited the auto- tive reporting of positive results. Of the original
ganzfeld laboratory and was given a demonstration ganzfeld studies reported in Section 4 of my paper,
of that procedure. less than half were significant, and it is a matter of
Greenhouse's item 3, the question of what consti- record that there are many nonsignificant studies
tutes a direct hit, was addressed in my paper but and "failed replications" published in all areas of
perhaps needs elaboration. Although free-response psi research. Further, the autoganzfeld database
experiments do generate substantial amounts of reported in Section 5 has no file-drawer. Given the
subjective data, the statistical analysis requires publication practices and the size of the field, the
that the results for each trial be condensed into a proposed file-drawer cannot account for the ob-
single measure of whether or not a direct hit was served effects.
achieved. This is done by presenting four choices to
Points Raised by Hyman
a judge (who of course does not know the correct
answer) and asking the judge to decide which of the One of my goals in writing this paper was to
four best matches the subject's response. If the present a fair account of recent work and debate in
judge picks the target, a direct hit has occurred. parapsychology. Thus, I was disturbed that Hy-
It is true that different judges may differ on their man, who has devoted much of his career to the
opinions of whether or not there has been a direct study of parapsychology, and who had first-hand
hit on any given trial, but in all cases the statisti- knowledge of the original published reports, be-
398 J. UTTS
lieved that some of my statements were inaccurate Hyman, chair of the subcommittee on parapsychol-
and indicated that I had not carefully read the ogy, in which he raises questions about the pres-
reports. I will address some of his specific objec- ence and consequence of methodological flaws in
tions and show that, except where noted, the accu- the ganzfeld studies. . . ."
racy of my original statements can be verified by In reference to this postscript, I stand corrected
further elaboration and clarification, with due apol- on a technical point, because Hyman himself did
ogy for whatever necessary details were lacking in not request the response to his own letter. As noted
my original report. by Palmer, Honorton and Utts (1989), the postscript
Most of our points of disagreement concern was added because:
the National cade ern^ of sciences (National Re-
At one stage of the process, John Swets, Chair
search Council) report Enhancing Human Per-
of the Committee, actually phoned Rosenthal
formance (Druckman and Swets, 1988). This
and asked him to withdraw the parapsychology
report evaluated several controversial areas, in-
section of his [commissioned] paper. When
cluding parapsychology. Professor Hyman chaired
Rosenthal declined, Swets and Druckman then
the Parapsychology Subcommittee. Several back-
requested that Rosenthal respond to criticisms
ground papers were commissioned to accompany
that Hyman had included in a July 30, 1987
this report, available from the "Publication on
letter to Rosenthal [page 381.
Demand Program" of the National Academy
Press. One of the papers was written by Harris and A related issue on which I would like to elaborate
Rosenthal, and entitled "Human Performance concerns the correlation between flaws and success
Research: An Overview. " in the original ganzfeld data base. Hyman has
Professor Hyman alleged that "Utts mistakenly misunderstood both my position and that of Harris
asserts that my subcommittee on parapsychology and Rosenthal. He believes that I implicitly denied
commissioned Harris and Rosenthal to evaluate the importance of the flaws, so I will make my
parapsychology experiments for u s . . . ." I cannot position explicit. I do not think there is any evi-
find a statement in my paper that asserts that dence that the experimental results were due to the
Harris and Rosenthal were commissioned by the identified flaws. The flaw analysis was clearly use-
subcommittee, nor can I find a statement that ful for delineating acceptable criteria for future
asserts that they were asked to evaluate parapsy- experiments. Several experiments were conducted
chology experiments. Nonetheless, I believe our using those criteria. The results were similar to the
substantive disagreement results from the fact original experiments. I believe that this indicates
that the work by Harris and Rosenthal was writ- an anomaly in need of a n explanation.
ten in two parts, both of which I referenced in In discussing the paper and ppstscript by Harris
my paper. They were written several months and Rosenthal, Hyman stated that "The alleged
apart, but published together, and each had contradictory conclusions [to the National Research
its own history. Council report] of Harris and Rosenthal are based
The first part (Harris and Rosenthal, 1988a) is on a meta-analysis that supports Honorton's posi-
the one to which I referred with the words tion when Honorton's [flaw] ratings are used and
"Rosenthal was commissioned by the National supports my position when my ratings are used."
Academy of Sciences to prepare a background He believes that Harris and Rosenthal (and I) failed
paper to accompany its 1988 report on parapsychol- to see this point because the low power of the test
ogy" (p. 372). According, to Rosenthal (personal associated with their analysis was not taken into
communication, July 23, 1991) he was asked to pre- account.
pare a background paper to address evaluation The analysis in question was based on a canoni-
issues and experimenter effects to accompany the cal correlation between flaw ratings and measures
report in five specific areas of research, including of successful outcome for the ganzfeld studies. The
parapsychology. canonical correlation was 0.46, a value Hyman finds
The second part was a "Postscript" to the com- to be impressive. What he has failed to take into
missioned paper (Harris and Rosenthal, 1988b), and account however, is that a canonical correlation
this is the one to which I referred on page 371 as gives only the magnitude of the relationship, and
"requested by Hyman in his capacity as Chair of not the direction. A careful reading of Harris and
the National Academy of Sciences' Subcommittee Rosenthal (198813) reveals that their analysis actu-
on Parapsychology." (It is probably this wording ally contradicted the idea that the flaws could
that led Professor Hyman to his erroneous allega- account for the successful ganzfeld results, since
tion.) The postscript began with the words "We "Interestingly, three of the six flaw variables corre-
have been asked to respond to a letter from Ray lated positively with the flaw canonical variable
REPLICATION I N P ARAPSYCHOLOGY 399
and with the outcome canonical variable but three Paranormal (CSICOP) was established in 1976 by
correlated negatively" (page 2, italics added). philosopher Paul Kurtz and sociologist Marcello
Rosenthal (personal cemmunication, July 23, 1991) Truzzi when "Kurtz became convinced that the
verified that this was indeed the point he was time was ripe for a more active crusade against
trying to make. Readers who are interested in parapsychology and other pseudo-scientists" (Pinch
drawing their own conclusions from first-hand and Collins, 1984, page 527). Truzzi resigned from
analyses can find Hyman's original flaw codings in the organization the next year (as did Professor
an Appendix to his paper (Hyman, 1985, pages Diaconis) "because of what he saw as the growing
44-49). danger of the committee's excessive negative zeal
Finally, in my paper, I stated that the parapsy- a t the expense of responsible scholarship" (Collins
chology chapter of the National Research Council and Pinch, 1982, page 84). In an advertising
report critically evaluated statistically significant brochure for their publication The Skeptical In-
experiments, but not those that were nonsignifi- quirer, CSICOP made clear its belief that paranor-
cant. Professor Hyman "does not know how [I] got mal phenomena are worthy of scientific attention
such an impression," so I will clarify by outlining only to the extent that scientists can fight the
some of the material reviewed in that report. There growing interest in them. Part of the text of the
were surveys of three major areas of psi research: brochure read: "Why the sudden explosion of inter-
remote viewing (a particular type of free-response est, even among some otherwise sensible people, in
experiment), experiments with random number all sorts of paranormal 'happenings'?. . . Ten years
generators, and the ganzfeld experiments. As a n ago, scientists started to fight back. They set up an
example of where I got the impression that they organization-The Committee for the Scientific In-
evaluated only significant studies, consider the sec- vestigation of Claims of the Paranormal."
tion on remote viewing. It began by referencing a During the six years that I have been working
published list of 28 studies. Fifteen of these were with parapsychologists, they have repeatedly ex-
immediately discounted, since "only 1 3 . . . were pressed their frustration with the unwillingness of
published under refereed auspices" (Druckman and the skeptics to specify what would constitute ac-
Swets, 1988, page 179). Four more were then dis- ceptable evidence, or even to delineate criteria for
missed, since "Of the 13 scientifically reported an acceptable experiment. The Hyman and Honor-
experiments, 9 are classified as successful" (page ton Joint Communiquk was seen as the first major
179). The report continued by discussing these nine step in that direction, especially since Hyman was
experiments, never again mentioning any of the the Chair of the Parapsychology Subcommittee of
remaining 19 studies. The other sections of the CSICOP.
report placed similar emphasis on significant stud- Hyman and Honorton (1986) devoted eight pages
ies. I did not think this was a valid statistical to "Recommendations for Future Psi Experiments,"
method for surveying a large body of research. carefully outlining details for how the experiments
should be conducted and reported. Honorton and
Minor Point Raised by Morris his colleagues then conducted several hundred
The final clarification I would like to offer con- trials using these specific criteria and found essen-
cerns the minor point raised by Professor Morris, tially the same effect sizes as in earlier work for
that "When Honorton omitted studies that did not both the overall effect and effects with moderator
report direct hits as a measure, he may have biased variables taken into account. I would expect Profes-
his sample." This possibility was explicitly ad- sor Hyman to be very interested in the results of
dressed by Honorton (1985, page 59). He examined these experiments he helped to create. While he did
what would happen if z-scores of zero were inserted acknowledge that they "have produced intriguing
for the 10 studies for which the number of direct results," it is both surprising and disappointing
hits was not measured, but could have been. He that he spent only a scant two paragraphs at the
found that even with this conservative scenario, end of his discussion on these results.
the combined z-score only dropped from 6.60 to Instead, Hyman seems to be proposing yet an-
5.67. other set of requirements to be satisfied before
parapsychology should be taken seriously. It is dif-
ficult to sort out what those requirements should be
SATISFYING THE SKEPTICS
from his account: "[They should] specify, in ad-
Parapsychology is probably the only scientific vance, the complete sample space and the critical
discipline for which there is a n organization of region. When they get to the point where they can
skeptics trying to discredit its work. The Commit- specify this along with some boundary conditions
tee for the Scientific Investigation of Claims of the and make some reasonable predictions, then they
400 J. UTTS
will have demonstrated something worthy of our CSICOP, have served a useful role in helping to
attention." improve experiments, their counter-advocacy stance
Diaconis believes that psi experiments do not is counterproductive. If they are truly interested
deserve serious attention unless they actively in- in resolving the question of whether or not psi
volve skeptics. Presumably, he is concerned with abilities exist, I would expect them to encourage
subject or experimenter fraud, or with improperly evaluation and experimentation by unbiased,
controlled experiments. There are numerous docu- skilled experimenters. Instead, they seem to be
mented cases of fraud and trickery in purported trying to discourage such interest by providing a
psychic phenomena. Some of these were observed moving target of requirements that must be satis-
by Diaconis and reported in his article in Science. fied first.
Such cases have mainly been revealed when inves-
tigators attempted to verify the claims of individ-
SHOULD PSI RESEARCH BE IGNORED BY THE
ual psychic practitioners in quasi-experimental or

SCIENTIFIC COMMUNITY?
uncontrolled conditions. These instances have re-

ceived considerable attention, probably because the In the conclusion of my paper, I argued that the
claims are so sensational, the fraud is so easy to scientific community should pay more attention to
detect by a skilled observer and they are an easy the experimental results in parapsychology. I was
target for skeptics looking for a way to discredit not suggesting that the accumulated evidence con-
psychic phenomena. As noted by Hansen (1990), stitutes proof of psi abilities, but rather that it
"Parapsychology has long been tainted by the indicates that there is indeed an anomalous effect
fraudulent behavior of a few of those claiming psy- that needs an explanation. Greenhouse noted that
chic abilities" (page 25). my paper will not necessarily change anyone's view
Control against deception by subjects in the labo- about the existence of paranormal phenomena, an
ratory has been discussed extensively in the para- observation with which I agree. However, I hope it
psychological literature (see, e.g., Morris, 1986, and will change some views about the importance of
Hansen, 1990). Properly designed experiments further investigation.
should preclude the possibility of such fraud. Mosteller and Diaconis both acknowledged that
Hyman and Honorton (1986, page 355) explicitly there are reasons for statisticians to be interested
discussed precautions to be taken in the ganzfeld in studying the anomalous effects, regardless of
experiments, all of which were followed in the auto- whether or not psi is real. As noted by Mosteller,
ganzfeld experiments. Further the controlled labo- "If there is no ESP, then we want to be able to
ratory experiments discussed in my paper usually carry out null experiments and get no effect, other-
used a large number of subjects, a situation that wise we cannot put much belief in work on small
minimizes the possibility that the results were due effects in non-ESP situations." Diaconis concluded
to fraud on the part of a few subjects. As for the that "Parapsychology is worthy of serious study"
possibility of experimenter fraud, it is of course an partly because "If it is wrong, it offers a truly
issue in all areas of science. There have been a few alarming massive case study of how statistics can
such instances in parapsychology, but since para- mislead and be misused."
psychologists tend to be aware of this possibility, Greenhouse noted several sociological reasons for
they were generally detected and exposed by insid- the resistance of the scientific community to accept-
ers in the field. ing parapsychological phenomena. One of these is
It is not clear whether or not Diaconis is suggest- that they directly contradict the laws of physics.
ing that a magician or "qualified skeptic" needs to However, this assertion is not uniformly accepted
be 'present at all times during a laboratory experi- by physicists (see, e.g., Oteri, 1975), and some of
ment. 'I believe that it would be more productive for the leading parapsychological researchers hold
such consultation to occur during the design phase, Ph.D.s in physics.
and during the implementation of some pilot ses- Another reason cited by Greenhouse, and sup-
sions. This is essentially what was done for the ported by Hyman, is that psychic phenomena are
autoganzfeld experiments, in which Professor Hy- currently unexplainable by a unified scientific the-
man, a skeptic as well as an accomplished magi- ory. But that is precisely the reason for more inten-
cian, participated in the specification of design sive investigation. The history of science and
criteria, and mentalists Bem and Kross observed medicine is replete with examples where empirical
experimental sessions. Bem is also a well-respected departures from expectation led to important find-
experimental psychologist. ings or theoretical models. For example, the causal
While I believe that the skeptics, particularly connection between cigarette smoking and lung
some of the more knowledgeable members of cancer was established only after years of statisti-
cal studies, resulting from the observation by one specify what should happen if there is no such
physician that his lung cancer patients who smoked thing as ESP by using simple binomial models,
did not recover a t the same rate as those who did either to find p-values or Bayes factors. As noted
not. There are many medications in common use by Mosteller, if there is no ESP, or other nonstatis-
for which there is still no medical explanation for tical explanation for an effect, we should be able to
their observed therapeutic effectiveness, but that carry out null experiments and get no effect. Other-
does not prohibit their use. wise, we should be worried about using these sim-
There are also examples where a coherent theory ple models for other applications.
of a phenomenon was impossible because the re- Greenhouse, in his first alternative explanation
quisite background information was missing. For for the results, questioned the use of these simple
instance, the current theory of endorphins as an models, but his criticisms do not seem relevant to
explanation for the success of acupuncture would the experiments discussed in Section 5 of my paper.
have been impossible before the discovery of endor- The experiments to which he referred were either
phins in the 1970s. poorly controlled, in which case no statistical anal-
Mosteller's observation that ESP will not replace ysis could be valid, or were specifically designed to
the telephone leads to the question of whether or incorporate trial by trial feedback in such a way
not psi abilities are of any use even if they do exist, that the analysis needed to account for the added
since the effects are relatively small. Again, a look information. Models and analyses for such experi-
a t history is instructive. For example, in 1938 For- ments can be found in the references given a t the
tune Magazine reported that "At present, few sci- end of Diaconis' discussion.
entists foresee any serious or practical use for For the remainder of this discussion, I will con-
atomic energy. " fine myself to models appropriate for experiments
Greenhouse implied that I think parapsychology such as the autoganzfeld described in Section 5. It
is not accepted by more of the scientific community is this scenario for which Bayarri and Berger com-
only because they have not examined the data, but puted Bayes factors, and for which Dawson dis-
this misses the main point I was trying to make. cussed possible Bayesian models.
The point is that individual scientists are willing to If ESP does exist, it is undoubtedly a gross over-
express an opinion without any reference to data. simplification to use a simple non-null binomial
The interesting sociological question is why they model for these experiments. In addition to poten-
are so resistant to examining the data. One of the tial differences in ability among subjects, there
major reasons is undoubtedly the perception identi- were also observed differences due to dynamic ver-
fied by Greenhouse that there is some connection sus static targets, whether or not the sender was a
between parapsychology and the occult, or worse, friend, and how the receiver scored on measures of
religious beliefs. Since religion is clearly not in the extraversion. All of these differences were antici-
realm of science, the very thought that parapsy- pated in advance and could be incorporated into
chology might be a science leads to what psychol- models as covariates.
ogists call "cognitive dissonance." As noted by It is nonetheless instructive to examine the Bayes
Griffin (1988), "People feel unpleasantly aroused factor computed by Bayarri and Berger for the
when two cognitions are dissonant-when they con- simple non-null binomial model. First, the observed
tradict one another" (page 33). Griffin continued by anomalous effects would be less interesting if the
observing that there are also external reasons for Bayes factor was small for reasonable values of r ,
scientists to discount the evidence, since "It is gen- as it was for the random number generator experi-
erally easier to be a skeptic in the face of novel ments analyzed by Jefferys (1990), most of which
' evidence; skeptics may be overly conservative, but purported to measure psychokinesis instead of ESP.
they are rarely held up to ridicule" (page 34). Second, the Bayes factor provides a rough measure
In summary, while it may be safer and more of the strength of the evidence against the null
consonant with their beliefs for individual scien- hypothesis and is a much more sensible summary
tists to ignore the observed anomalous effects, the than the p-value. The Bayes factors provided by
scientific community should be concerned with Bayarri and Berger are probably more conserva-
finding an explanation. The explanations proposed tive, in the sense of favoring the null hypothesis,
by Greenhouse and others are simply not tenable. than those that would result from priors elicited
from parapsychologists, but are probably reason-
able for those who know nothing about past ob-
REPLICATION AND MODELING
served effects. I expect tht most parapsychologists
Parapsychology is one of the few areas where a would not opt for a prior symmetric around chance,
point null hypothesis makes some sense. We can but would still choose one with some mass below
402 J. UTTS
chance. The final reason it is instructive to exam- my discussion. Their analyses suggest some alter-
ine these Bayes factors is that they provide a quan- natives to power analysis that might be considered
titative challenge to .,skeptics to be explicit about when designing a new study to try to replicate a
their prior probabilities for the null and alternative questionable result.
hypotheses. Morris addressed the question of what con-
Dawson discussed the use of more complex stitutes a replication of a meta-analysis. He
Bayesian models for the analysis of the auto- distinguished between exact and conceptual repli-
ganzfeld data. She proposed a hierarchical model cations. Using his distinction, the autoganzfeld
where the number of successes for each experiment meta-analysis could be viewed as a conceptual
followed a binomial distribution with hit rate pi, replication of the earlier ganzfeld meta-analysis.
and logit(p,) came from a normal distribution with He noted that when such a conceptual replication
noninformative priors for the mean and variance. offers results similar to those of the original
She then expanded this model to include heavier meta-analysis, it lends legitimacy to the original
tails by allowing a n additional scale parameter for results, as was the case with the autoganzfeld
each experiment. Her rationale for this expanded meta-analysis.
model was that there were clear outlier series in Greenhouse and Morris both noted the value of
the data. meta-analysis as a method of comparing different
The hierarchical model proposed by Dawson is a conditions, and I endorse that view. Conditions
reasonable place to start given only that there were found to produce different effects in one meta-
several experiments trying to measure the same analysis could be explicitly studied in a conceptual
effect, conducted by different investigators. In the replication. One of the intriguing results of the
autoganzfeld database, the model could be ex- autoganzfeld experiments was that they supported
panded to incorporate the additional information the distinction .between effect sizes for dynamic
available. Each experiment contained some ses- versus static targets found in the earlier ganzfeld
sions with static targets and some with dynamic work, and they supported the relationship between
targets, some sessions in which the sender and ESP and extraversion found in the meta-analysis
receiver were friends and others in which they by Honorton, Ferrari and Bem (1990).
were not and some information about the extraver- Most modern parapsychologists, as indicated by
sion score of the receiver. All of this information Morris, recognize that demonstrating the validity
could be included by defining the individual session of their preliminary findings will depend on identi-
as the unit of analysis, and including a vector of fying and utilizing "moderator variables" in future
covariates for each session. It would then make studies. The use of such variables will require more
sense to construct a logistic regression model with complicated statistical models than the simple bi-
a component for each experiment, following the nomial models used in the past. Further, models
model proposed by Dawson, and a term X/3 to are needed for combining results from several dif-
include the covariates. A prior distribution for /3 ferent experiments, that don't oversimplify a t the
could include information from earlier ganzfeld expense of lost information.
studies. The advantage of using a Bayesian ap- In conclusion, the anomalous effect that persists
proach over a simple logistic regression is that throughout the work reviewed in my paper will be
information could be continually updated. Some of better understood only after further experimenta-
the recent work in Bayesian design could then be tion that takes into account the complexity of the
incorporated so that future trials make use of the system. More realistic, and thus more complex,
best conditions. models will be needed to analyze the results of
Several of the discussants addressed the concept those experiments. This presents a challenge that I
of replication. I agree with Mosteller's implication hope will be welcomed by the statistics community.
that it was unwise for the audience in my seminar
to respond to my replication questions so quickly,
ADDITIONAL REFERENCES
and that was precisely my point. Most nonstatisti-
cians do not seem to understand the complexity ALLISON,P. (1979). Experimental parapsychology as a rejected
of the replication question. Parenthetically, when science. The Sociological Review Monograph 27 271-291.
I posed the same scenario to a n audience of statis- BARBER, B. (1961). Resistance by scientists to scientific discov-
ticians, very few were willing to offer a quick ery. Science 134 596-602.
opinion. BERGER,J. 0. and DELAMPADY, M. (1987). Testing precise hy-
potheses (with discussion). Statist. Sci. 2 317- 352.
Bayarri and Berger provided a n insightful dis- CHUNG,I?. R. K., DIACONIS, P., GRAHAM, R. L. and MALLOWS,
cussion of .the purpose of replication, offering quan- C. L. (1981). On the permanents of compliments of the
titative answers to questions that were implicit in direct sum of identity matrices. Adu. Appl. Math. 2 121-137.
COCHRAN, W. G. (1954). The combination of estimates from analysis of data from retrospective studies of disease. Jour-
different experiments. Biometries 10 101-129. nal of the National Cancer Institute 22 719-748.
COLLINS, H. and PINCH,T. (1979). The construction of the para- MORRIS,C. (1983). Parametric empirical Bayes inference: The-
normal: Nothing unscientific is happening. The Sociological ory and applications (rejoinder) J. Amer. Statist. Assoc. 78
Review Monograph 27 237-270. 47-65.
COLLINS, H. M. and PINCH,T. J. (1982). Frames of Meaning: The MORRIS,R. L. (1986). What psi is not: The necessity for experi-
Social Construction o f Extraordinary Science. Routledge & ments. In Foundations of Parapsychology (H. L. Edge, R. L.
Kegan Paul, London. Morris, J. H. Rush and J. Palmer, eds.) 70-110. Routledge
CORNFIELD, J. (1959). Principles of research. American Journal & Kegan Paul, London.
of Mental Deficiency 64 240-252. MOSTELLER, F. and BUSHR. R. (1954). Selected quantitative
DEMPSTER, A. P., SELWYN, M. R. and WEEKS,B. J. (1983). techniques. In Handbook of Social Psychology (G. Lindzey,
Combining historical and randomized controls for assessing ed.) 1 289-334. Addison-Wesley, Cambridge, Mass.
trends in proportions. J. Amer. Statist. Assoc. 78 221-227. MOSTELLEE, F. and CHALMERS, T.,(1991). Progress and problems
DIACONIS, P. and GRAHAM, R. L. (1981). The analysis of sequen- in meta-analysis. Statist. Sci. To appear.
tial experiments with feedback to subjects. Ann. Statist. 9 oTERI,L,, ed, (1975), Q~~~~~~ physics and P ~ ~
236-244. Parapsychology Foundation, New York.
FISHER,R. A. (1932). Statistical Methods for Research Workers,
PINCH,T, J. and COLLINS,H. M. (1984). Private science and
4th ed. Oliver and Boyd, London. public knowledge: The Committee for the Scientific Investi-
FISHER,R. A. (1935). Has Mendel's work been rediscovered? gation of Claims of the Paranormal and its use of the
Ann. of Sci. 1 116-137. literature. Social Studies of Science 14 521-546.
GALTON, F. (1901-2). Biometry. Biometrika 1 7-10

PLATT,J. R. (1964). Strong inference. Science 146 347-353.
GREENHOUSE, J., FROMM, D., IYENGAR, S., DEW,M. A,, HOLLAND,
RosENTHAL,R. Effects in Re-
A, and KASS,R. (1990). Case study: ~h~ effects of rehabili-
tation therapy for aphasia. In The Future of Meta-Analysis search. Appleton-Century-Crofts, New York.
(K. W. Wachter and M. L. Straf, eds,) 31-32. ~ ~sage~ ROSENTHAL,
~ ~ R. (1979).
l lThe "file drawer ~roblem"and tolerance
Foundation, New York. for null results. Psychological Bulletin 86 638-641.
GRIFFIN,D. (1988). Intuitive judgment and the evaluation of RyAN2 L. M. and DEMpsTER~ A. P. Weighted
evidence. In Enhancing Human Performance: Issues, Theo- plots. Technical Report 3942, Dana-Farber Cancer Inst.,
ries and Techniques Background Papers-Part I. National Boston, Mass.
Academy Press, Washington, D.C. SAMANIEGO, F. J. and UTTS, J. (1983). Evaluating performance
HANSEN,G. (1990). Deception by subjects in psi research. Jour- in continuous experiments with feedback to subjects. Psy-
nal of the American Society for Psychical Research 84 25-80. chometrika 48 195-209.
HUNTER, J. and SCHMIDT, F. (1990). Methods of Meta-Analysis. SMITH,M. and GLASS,G. (1977). Meta-anal~sisof psychotherapy
Sage, London. outcome studies. American Psychologist 32 752-760.
IYENGAR, S. and GREENHOUSE, J. (1988). Selection models and WACHTER, K. (1988). Disturbed by meta-analysis? Science 241
the file drawer problem (with discussion). Statist. Sci. 3 1407-1408.
109-135. WEST,M. (1985). Generalized linear models: Scale parameters,

LOUIS,T. A. (1984). Estimating a n ensemble of parameters outlier accommodation and prior distributions. In Bayesian
using Bayes and empirical Bayes methods. J. Amer. Statist. Statistics 2 ( J . M. Bernardo, M. H. DeGroot, D. V. Lindley,
Assoc. 79 393-398. and A. F. M. Smith, eds.) 531-558. North-Holland Amster-
MANTEL, N. and HAENSZEL, W. (1959). Statistical aspects of the dam.

Replication and Meta-Analysis in Parapsychology - Jessica Utts

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Replication and Meta-Analysis in Parapsychology - Jessica Utts

Hochgeladen von

Copyright:

Verfügbare Formate

Statistical Science

1991, Vol. 6,No.4, 363-403

Replication and Meta-Analysis in

Abstract. Parapsychology, the laboratory study of psychic phenomena,

1. INTRODUCTION Parapsychology, as this field is called, has been a

telepathy, clairvoyance and precognition, collec- statistical inference.

1. INTRODUCTION EXAMPLE 1. Consider the example from Tversky

where 9 is the standard normal cdf and c, is the

where 0 is the unknown bias. Then the data in the

given that ESP exists. Also, these distributions are

over this class are also representative of answers

(because 0 r 19 I1); for the given data the 0 > 0.5

For g r , the Bayes factor of Hl to No, which is to

(1/4) 122 (1- 1 1 4 ) ~ ~ ~

to be independent binomially distributed as dis- Recent ganzfeld series

cussed in Section 5, the data can be summed

ity of effects across the 11 series using a chi-square Pilot

test that compares individual effect sizes to Pilot

the weighted mean effect. The chi-square statistic Novice

xf0 = 16.25, not statistically significant ( p = Novice

0.093), largely reflects the contribution of the last Novice

"special" series (contributes 9.2 units to the xf0 Novice

value), and to a lesser extent the novice series with Experienced

a negative effect (contributes 2.5 units). The outlier Experienced

series can be dropped from the analysis to provide a

In my experience, parapsychologists use statis- impossible to demand wide replicability in others.

Comment: Parapsychology -On the Margins

ual psychic practitioners in quasi-experimental or

uncontrolled conditions. These instances have re-

Ann. of Sci. 1 116-137. literature. Social Studies of Science 14 521-546.

GALTON, F. (1901-2). Biometry. Biometrika 1 7-10

the file drawer problem (with discussion). Statist. Sci. 3 1407-1408.

109-135. WEST,M. (1985). Generalized linear models: Scale parameters,

Das könnte Ihnen auch gefallen