You are on page 1of 95

13 Chen RT, Weierbach R, Bisoffi Z, Cutts F, Rhodes P, Ramaroson S, et al.

21 Ashy P, Bukh J, Hoff G, Leehay J, Lisse IM, Mordhorst CH, Pedersen IR.
A "post-honeymoon period" measles outbreak in Muyinga Sector, Burundi. High measles mortality in infancy related to intensity of exposure. y Pediatr
IntrEpidemiol 1994;23:185-93. 1986;109:40-4.
14 Holt EA, Boulos R, Halsey NA, Boulos IM, Boulos C. Childhood survival in 22 Koenig MA. Mortality reductions from measles and tetanus immunization: a
Haiti: Protective effect of measles vaccination. Pediarics 1990;85:188-94. review of the evidence. In: Hill K, ed. Child suwival pnornes in the 1990s.
15 Koenig MA, Khan MA, Wojtyniak B, Clemens JD, Chakraborty J, Fauveau Baltimore: Johns Hopkins Institute for Intemational Programs, 1992:43-72.
V, et al. The impact of measles vaccination upon childhood mortality in 23 Hartfield J, Morley D. Efficacy of measles vaccine. Joumnal of Hygiene
Matiab, Bangladesh. Bufl World Healh Organ 1990;68:441-7. (Cambridge) 1963;61:143-7.
16 Velema JP, Alihonou EM, Gandaho T, Hounye FH. Childhood mortality 24 Hull HF, Williams PJ, Oldfield F. Measles mortality and vaccine efficacy in
among users and non-users of primary health care in a rural West African rural West Africa. Lancet 1983;i:972-5.
community. IntJEpidemiol 1991;20:474-9. 25 Griffin DE, Ward BJ, Esolen LM. Pathogenesis of measles virus infection: An
17 Fleiss JL. The statistical basis of meta-analysis. Sta Mehods Med Res hypothesis for altered responses. IlInfect Dis 1994;170:S24-31.
1993;2:121-45. 26 Petralli JK, Merigan TC, Wilbur JR. Action of endogenous interferon against
18 Garenne M, Aaby P. Pattern of exposure and measles mortality in Senegal. vaccinia infection in children. Lancet 1965ii:401-5.
YInfDis 1990;161:1088-94. 27 Rooth I, Sinani HM, Smedman 1, Bjorkman A. A study of malaria infection
19 Desgrees du Lou A, Pison G, Aaby P. The role of immunizations in the recent during the acute stage of measles infection.J Trop Med Hyg 1991;94:195-8.
decline in childhood mortality and the changes in the female/male mortality 28 Gellin BG, Katz SL. Measles: state of the art and future directions. J Infect Dis
ratio in rural Senegal. Am YEpidemiol (in press). 1994;170:S3-14.
20 Harris MF. The safety of measles vaccine in severe illness. S Aft Med Y
1979:38. (Accepted 15June 1995)

Statistics Notes
Absence ofevidence is not evidence ofabsence
Douglas G Altman, J Martin Bland

The non-equivalence of statistical significance and reinfarction after acute myocardial infarction. The
clinical importance has long been recognised, but this overview of randomised controlled trials found
error of interpretation remains common. Although a a modest but clinically worthwhile (and highly sig-
significant result in a large study may sometimes not be nificant) reduction in mortality of 22%,4 but only five of
clinically important, a far greater problem arises from the 24 trials had shown a statistically significant effect
misinterpretation of non-significant findings. By with P<005. The lack of statistical significance of
convention a P value greater than 5% (P>0 05) is called most of the individual trials led to a long delay before
"not significant." Randomised controlled clinical the true value of streptokinase was appreciated.
trials that do not show a significant difference between While it is usually reasonable not to accept a new
the treatments being compared are often called treatment unless there is positive evidence in its
"negative." This term wrongly implies that the study favour, when issues of public health are concerned
has shown that there is no difference, whereas usually we must question whether the absence of evidence
all that has been shown is an absence of evidence of a is a valid enough justification for inaction. A recent
difference. These are quite different statements. publicised example is the suggested link between some
The sample size of controlled trials is generally sudden infant deaths and antimony in cot mattresses.
inadequate, with a consequent lack of power to Statements about the absence of evidence are common
detect real, and clinically worthwhile, differences in -for example, in relation to the possible link between
treatment. Freiman et al found that only 30% of a violent behaviour and exposure to violence on television
sample of 71 trials published in the New England and video, the possible harmful effects of pesticide
Journal of Medicine in 1978-9 with P>0-1 were large residues in drinking water, the possible link between
enough to have a 90% chance of detecting even a 50% electromagnetic fields and leukaemia, and the possible
difference in the effectiveness of the treatments being transmission of bovine spongiform encephalopathy
compared, and they found no improvement in a similar from cows. Can we be comfortable that the absence of
sample of trials published in 1988. To interpret all clear evidence in such cases means that there is no risk
these "negative" trials as providing evidence of the or only a negligible one?
ineffectiveness of new treatments is clearly wrong and When we are told that "there is no evidence that A
foolhardy. The term "negative" should not be used in causes B" we should first ask whether absence of
this context.2 evidence means simply that there is no information at
A recent example is given by a trial comparing all. Ifthere are data we should look for quantification of
octreotide and sclerotherapy in patients with variceal the association rather than just a P value. Where risks
bleeding.' The study was carried out on a sample of are small P values may well mislead: confidence
only 100 despite a reported calculation that suggested intervals are likely to be wide, indicating considerable
that 1800 patients were needed. This trial had only a uncertainty. While we can never prove the absence of a
Medical Statistics 5% chance of getting a statistically significant result if relation, when necessary we should seek evidence
Laboratory, Inperial the stated clinically worthwhile treatment difference against the link between A and B-for example, from
Cancer Research Fund,
London WC2A 3PX truly existed. One consequence of such low statistical case-control studies. The importance of carrying out
Douglas G Altman, head power was a wide confidence interval for the treatment such studies will relate to the seriousness of the
difference. The authors concluded that the two postulated effect and how widespread is the exposure
Department of Public treatments were equally effective despite a 95% in the population.
Health Sciences, confidence interval that included differences between
St George's Hospital 1 Freiman JA, Chalmers TC, Smith H, Kuebler HR. The importance of beta, the
the cure rates of the two treatments of up to 20 per- type BI error, and sample size in the design and interpretation of the
Medical School, centage points. randomized controlled trial: survey of two sets of "negative" trials. In: Bailar
London SW17 ORE Similar evidence of the dangers of misinterpretation JC, Mosteller F, eds. Medical ues of stanstics. 2nd ed. Boston, MA: NEJM
J Martn Bland, reader in of non-significant results is found in numerous meta- Books, 1992:357-73.
medical statistics 2 Chalmers I. Proposal to outlaw the term "negative trial." BM71985;29:1002.
analyses (overviews) of published trials, when few or 3 Sung JJY, Chung SCS, Lai C-W, Chan FKL, Leung JWC, Yung M-L,
Kassianides C, ea aL Octreotide infusion or emergency sclerotherapy for
Correspondence to: none of the individual trials were statistically large variceal haemorrhage. Lancet 1993;342:637-41.
Mr Altman. enough. A dramatic example is provided by the 4 Yusuf S, Collins R, Peto R, Furberg C, Stampfer MJ, Goldhaber SZ, et aL
overview of clinical trials evaluating fibrinolytic' Intravenous and intracoronary fibrinolytic therapy in acute myocardial
infarction: overview of results on mortality, reinfarction and side-effects from
BMJ 1995;311:485 treatment (mostly streptokinase) for preventing 33 randomized controlled trals. EurHeart_ 1985;6:556-85.

BMJ VOLUME 311 19AuGusT1995 485


This site uses cookies. More info Close By continuing to browse the site you are agreeing to our use of
cookies. Find out more here Close

Education And Debate

Statistics Notes: Measurement error


BMJ 1996; 313 doi: https://doi.org/10.1136/bmj.313.7059.744 (Published 21 September 1996) Cite this as: BMJ
1996;313:744

J Martin Bland, professor of medical statisticsa, Douglas G Altman, headb


a
Department of Public Health Sciences, St George's Hospital Medical School, London SW17 0RE,
b
IRCF Medical Statistics Group, Centre for Statistics in Medicine, Institute of Health Sciences, PO Box 777, Oxford OX3 7LF

Correspondence to: Professor Bland.

Several measurements of the same quantity on the same subject will not in general be the same. This may be
because of natural variation in the subject, variation in the measurement process, or both. For example, table 1
shows four measurements of lung function in each of 20 schoolchildren (taken from a larger study1). The first
child shows typical variation, having peak expiratory flow rates of 190, 220, 200, and 200 1/min.

Child No PEFR (l/min) Mean SD

1 190 220 200 200 202.50 12.58

2 220 200 240 230 222.50 17.08

3 260 260 240 280 260.00 16.33

4 210 300 280 265 263.75 38.60

5 270 265 280 270 271.25 6.29

6 280 280 270 275 276.25 4.79

7 260 280 280 300 280.00 16.33

8 275 275 275 305 282.50 15.00

9 280 290 300 290 290.00 8.16

10 320 290 300 290 300.00 14.14


Child No PEFR (l/min) Mean SD

11 300 300 310 300 302.50 5.00

12 270 250 330 370 305.00 55.08

13 320 330 330 330 327.50 5.00

14 335 320 335 375 341.25 23.58

15 350 320 340 365 343.75 18.87

16 360 320 350 345 343.75 17.02

17 330 340 380 390 360.00 29.44

18 335 385 360 370 362.50 21.02

19 400 420 425 420 416.25 11.09

20 430 460 480 470 460.00 21.60

Table 1
Repeated peak expiratory flow rate (PEFR) measurements for 20 schoolchildren

Let us suppose that the child has a true average value over all possible measurements, which is what we really
want to know when we make a measurement. Repeated measurements on the same subject will vary around the
true value because of measurement error. The standard deviation of repeated measurements on the same
subject enables us to measure the size of the measurement error. We shall assume that this standard deviation
is the same for all subjects, as otherwise there would be no point in estimating it. The main exception is when the
measurement error depends on the size of the measurement, usually with measurements becoming more
variable as the magnitude of the measurement increases. We deal with this case in a subsequent statistics note.
The common standard deviation of repeated measurements is known as the within-subject standard deviation,
which we shall denote by sw.

To estimate the within-subject standard deviation, we need several subjects with at least two measurements for
each. In addition to the data, table 1 also shows the mean and standard deviation of the four readings for each
child. To get the common within-subject standard deviation we actually average the variances, the squares of the
standard deviations. The mean within-subject variance is 460.52, so the estimated within-subject standard
deviation is sw = (square root)460.52 = 21.5 1/min. The calculation is easier using a program that performs one
way analysis of variance2 (table 2). The value called the residual mean square is the within-subject variance.
The analysis of variance method is the better approach in practice, as it deals automatically with the case of
subjects having different numbers of observations. We should check the assumption that the standard deviation
is unrelated to the magnitude of the measurement. This can be done graphically, by plotting the individual
subject's standard deviations against their means (see fig 1). Any important relation should be fairly obvious, but
we can check analytically by calculating a rank correlation coefficient. For the figure there does not appear to be
a relation (Kendall's (tau) = 0.16, P = 0.3).
Source of Degrees of Sum of Mean Variance Probability
variation freedom squares square ratio (F) (P)

Children 19 285318.44 15016.78 32.6 <0.0001

Residual 60 27631.25 460.52

Total 79 312949.69

Table 2
One way analysis of variance for the data of table 1

Fig 1
Fig 1
Individual subjects' standard deviations plotted against their means

A common design is to take only two measurements per subject. In this case the method can be simplified
because the variance of two observations is half the square of their difference. So, if the difference between the
2 2
two observations for subject i is di the within-subject standard deviation sw is given by s w = 1/2n(summation)d i,
where n is the number of subjects. We can check for a relation between standard deviation and mean by plotting
for each subject the absolute value of the differencethat is, ignoring any signagainst the mean.

The measurement error can be quoted as sw. The difference between a subject's measurement and the true
value would be expected to be less than 1.96 sw for 95% of observations. Another useful way of presenting
measurement error is sometimes called the repeatability, which is (square root)2 x 1.96 sw or 2.77 sw. The
difference between two measurements for the same subject is expected to be less than 2.77 sw for 95% of pairs
of observations. For the data in table 1 the repeatability is 2.77 x 21.5 = 60 1/min. The large variability in peak
expiratory flow rate is well known, so individual readings of peak expiratory flow are seldom used. The variable
used for analysis in the study from which table 1 was taken was the mean of the last three readings.1

Other ways of describing the repeatability of measurements will be considered in subsequent statistics notes.

References
1.Bland JM, Holland WW, Elliott A. The development of respiratory symptoms in a cohort of Kent schoolchildren.Bull
Physio-Path Resp1974;10:699716.

2.Altman DG, Bland JM. Comparing several groups using analysis of variance.BMJ1996;312: 1472.
This site uses cookies. More info Close By continuing to browse the site you are agreeing to our use of
cookies. Find out more here Close

Education And Debate

Statistics Notes: Measurement error and correlation


coefficients
BMJ 1996; 313 doi: https://doi.org/10.1136/bmj.313.7048.41 (Published 06 July 1996) Cite this as: BMJ
1996;313:41

J Martin Bland, professor of medical statisticsa, Douglas G Altman, headb


a
Department of Public Health Sciences, St George's Hospital Medical School, London SW17 0RE
b
ICRF Medical Statistics Group, Centre for Statistics in Medicine, Institute of Health Sciences, PO Box 777, Oxford OX3 7LF

Correspondence to: Professor Bland

Measurement error is the variation between measurements of the same quantity on the same individual.1 To
quantify measurement error we need repeated measurements on several subjects. We have discussed the
within-subject standard deviation as an index of measurement error,1 which we like as it has a simple clinical
interpretation. Here we consider the use of correlation coefficients to quantify measurement error.

A common design for the investigation of measurement error is to take pairs of measurements on a group of
subjects, as in table 1. When we have pairs of observations it is natural to plot one measurement against the
other. The resulting scatter diagram (see figure 1 may tempt us to calculate a correlation coefficient between the
first and second measurement. There are difficulties in interpreting this correlation coefficient. In general, the
correlation between repeated measurements will depend on the variability between subjects. Samples containing
subjects who differ greatly will produce larger correlation coefficients than will samples containing similar
subjects. For example, suppose we split this group in whom we have measured forced expiratory volume in one
second (FEV1) into two subsamples, the first 10 subjects and the second 10 subjects. As table 1 is ordered by
the first FEV1 measurement, both subsamples vary less than does the whole sample. The correlation for the first
subsample is r = 0.63 and for the second it is r = 0.31, both less than r = 0.77 for the full sample. The correlation
coefficient thus depends on the way the sample is chosen, and it has meaning only for the population from which
the study subjects can be regarded as a random sample. If we select subjects to give a wide range of the
measurement, the natural approach when investigating measurement error, this will inflate the correlation
coefficient.

Measurement Measurement

Subject No 1st 2nd Subject No 1st 2nd

1 1.19 1.37 11 1.54 1.57

2 1.33 1.32 12 1.59 1.60


Measurement Measurement

Subject No 1st 2nd Subject No 1st 2nd

3 1.35 1.40 13 1.61 1.53

4 1.36 1.25 14 1.61 1.61

5 1.38 1.29 15 1.62 1.68

6 1.38 1.40 16 1.80 1.82

7 1.38 1.40 17 1.80 1.82

8 1.40 1.38 18 1.85 1.89

9 1.43 1.38 19 1.94 2.10

10 1.43 1.51 20 2.10 2.20

Table 1
Pairs of measurements of FEV1 (litres) a few weeks apart from 20 Scottish schoolchildren, taken from a
larger study (D Strachan,personal communication)

Fig 1
Fig 1
Measurements from pairs of observations plotted against each other

The correlation coefficient between repeated measurements is often called the reliability of the measurement
method. It is widely used in the validation of psychological measures such as scales of anxiety and depression,
where it is known as the test-retest reliability. In such studies it is quoted for different populations (university
students, psychiatric outpatients, etc) because the correlation coefficient differs between them as a result of
differing ranges of the quantity being measured. The user has to select the correlation from the study population
most like the user's own.

Another problem with the use of the correlation coefficient between the first and second measurements is that
there is no reason to suppose that their order is important. If the order were important the measurements would
not be repeated observations of the same thing. We could reverse the order of any of the pairs and get a slightly
different value of the correlation coefficient between repeated measurements. For example, reversing the order
of the even numbered subjects in table 1 gives r = 0.80 instead of r = 0.77. The intra-class correlation coefficient
avoids this problem. It estimates the average correlation among all possible orderings of pairs. It also extends
easily to the case of more than two observations per subject, where it estimates the average correlation between
all possible pairs of observations.

Few computer programs will calculate the intra-class correlation coefficient directly, but when the number of
observations is the same for each subject it can be found from a one way analysis of variance table2 such as
table 2. We need the total sum of squares, SST, and the sum of squares between subjects, SSB.
Then

rI = mSSB - SST/(m - 1) SST

where m is the number of observations per subject. For table II, m = 2 and

rI = 2 1.52981 - 1.74651/(2 - 1) 1.74651 = 0.75

Source of Degrees of Sum of Mean Variance


Probability(P)
variation freedom squares square ratio (F)

Children 19 1.52981 0.08052 7.4 <0.0001

Residual 20 0.21670 0.01086

Total 39 1.74651

Table 2
One way analysis of variance for the data in table 1

In practice, there will usually be little difference between r and rI for true repeated measurements. If, however,
there is a systematic change from the first measurement to the second, as might be caused by a learning effect,
rI will be much less than r. If there was such an effect the measurements would not be made under the same
conditions and so we could not measure reliability.

The correlation coefficient can be used to compare measurements of different quantities, such as different scales
for measuring anxiety. We could make repeated measurements of all the quantities on the same subjects and
calculate intra-class correlations. The measures with the highest correlation between repeated measurements
would discriminate best between individuals; in other words they would carry the most information. For most
applications, however, we prefer the within-subjects standard deviation as an index of measurement error, as it
has a more direct interpretation which can be applied to individual measurements.1

References
1.Bland JM, Altman AD.Measurement error.BMJ 1996;312: 1654.

2.Altman DG, Bland BJ.Comparing several groups using a analysis of variance BMJ 1996;312: 14723.
Education and debate

Statistics notes
Bayesians and frequentists
J Martin Bland, Douglas G Altman,

There are two competing philosophies of statistical population value lies within the 95% confidence inter- Department of
Public Health
analysis: the Bayesian and the frequentist. The val, or that the probability that the null hypothesis is Sciences, St
frequentists are much the larger group, and almost all true is less than 5%. It is argued that researchers want Georges Hospital
the statistical analyses which appear in the BMJ are fre- this, which is why they persistently misinterpret Medical School,
London SW17 0RE
quentist. The Bayesians are much fewer and until confidence intervals and significance tests in this way.
J Martin Bland,
recently could only snipe at the frequentists from the A major difficulty, of course, is deciding on the professor of medical
high ground of university departments of mathemati- prior distribution. This is going to influence the statistics
cal statistics. Now the increasing power of computers is conclusions of the study, yet it may be a subjective syn- ICRF Medical
bringing Bayesian methods to the fore. thesis of the available information, so the same data Statistics Group,
Centre for Statistics
Bayesian methods are based on the idea that analysed by different investigators could lead to differ- in Medicine,
unknown quantities, such as population means and ent conclusions. Another difficulty is that Bayesian Institute of Health
proportions, have probability distributions. The prob- methods may lead to intractable computational Sciences, Oxford
OX3 7LF
ability distribution for a population proportion problems. (All widely available statistical packages use Douglas G Altman,
expresses our prior knowledge or belief about it, before frequentist methods.) head
we add the knowledge which comes from our data. For Most statisticians have become Bayesians or Correspondence to:
example, suppose we want to estimate the prevalence frequentists as a result of their choice of university. Professor Bland
of diabetes in a health district. We could use the knowl- They did not know that Bayesians and frequentists
edge that the percentage of diabetics in the United existed until it was too late and the choice had been BMJ 1998;317:1151

Kingdom as a whole is about 2%, so we expect the made. There have been subsequent conversions. Some
prevalence in our health district to be fairly similar. It is who were taught the Bayesian way discovered that
unlikely to be 10%, for example. We might have infor- when they had huge quantities of medical data to ana-
mation based on other datasets that such rates vary lyse the frequentist approach was much quicker and
between 1% and 3%, or we might guess that the preva- more practical, although they may remain Bayesian at
lence is somewhere between these values. We can con- heart. Some frequentists have had Damascus road con-
struct a prior distribution which summarises our versions to the Bayesian view. Many practising
beliefs about the prevalence in the absence of specific statisticians, however, are fairly ignorant of the
data. We can do this with a distribution having mean 2 methods used by the rival camp and too busy to have
and standard deviation 0.5, so that two standard devia- time to find out.
tions on either side of the mean are 1% and 3%. (The The advent of very powerful computers has given a
precise mathematical form of the prior distribution new impetus to the Bayesians. Computer intensive
depends on the particular problem.) methods of analysis are being developed, which allow
Suppose we now collect some data by a sample new approaches to very difficult statistical problems,
survey of the district population. We can use the data to such as the location of geographical clusters of cases of
modify the prior probability distribution to tell us what a disease. This new practicability of the Bayesian
we now think the distribution of the population approach is leading to a change in the statistical
percentage is; this is the posterior distribution. For paradigmand a rapprochement between Bayesians
example, if we did a survey of 1000 subjects and found and frequentists.1 2 Frequentists are becoming curious
15 (1.5%) to be diabetic, the posterior distribution about the Bayesian approach and more willing to use
would have mean 1.7% and standard deviation 0.3%. Bayesian methods when they provide solutions to diffi-
We can calculate a set of values, a 95% credible interval cult problems. In the future we expect to see more
(1.2% to 2.4% for the example), such that there is a Bayesian analyses reported in the BMJ. When this hap-
probability of 0.95 that the percentage of diabetics is pens we may try to use Statistics notes to explain them,
within this set. The frequentist analysis, which ignores though we may have to recruit a Bayesian to do it.
the prior information, would give an estimate 1.5%
We thank David Spiegelhalter for comments on a draft.
with standard error 0.4% and 95% confidence interval
0.8% to 2.5%. This is similar to the results of the Baye-
1 Breslow N. Biostatistics and Bayes (with discussion). Statist Sci 1990;5:
sian method, as is usually the case, but the Bayesian 269-98.
method gives an estimate nearer the prior mean and a 2 Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to
randomized trials (with discussion). J R Statist Soc A 1994;157:357-416.
narrower interval.
Frequentist methods regard the population value
as a fixed, unvarying (but unknown) quantity, without a
probability distribution. Frequentists then calculate Correction
confidence intervals for this quantity, or significance North of England evidence based guidelines development project:
tests of hypotheses concerning it. Bayesians reasonably guideline for the primary care management of dementia
object that this does not allow us to use our wider An editorial error occurred in this article by Martin Eccles
knowledge of the problem. Also, it does not provide and colleagues (19 September, pp 802-8). In the list of
what researchers seem to want, which is to be able to authors the name of Moira Livingston [not Livingstone] was
say that there is a probability of 95% that the misspelt.

BMJ VOLUME 317 24 OCTOBER 1998 www.bmj.com 1151


Clinical review

Statistics Notes
Survival probabilities (the Kaplan-Meier method)
J Martin Bland, Douglas G Altman

Department of As we have observed,1 analysis of survival data requires


Public Health
special techniques because some observations are 1.0

Survival probability
Sciences, St
Georges Hospital censored as the event of interest has not occurred for all
Medical School, patients. For example, when patients are recruited over
London SW17 0RE 0.75
J Martin Bland,
two years one recruited at the end of the study may be
professor of medical alive at one year follow up, whereas one recruited at the
statistics start may have died after two years. The patient who died 0.5
ICRF Medical has a longer observed survival than the one who still
Statistics Group,
survives and whose ultimate survival time is unknown.
Centre for Statistics
in Medicine, The table shows data from a study of conception in 0.25
Institute of Health subfertile women.2 The event is conception, and
Sciences, Oxford
OX3 7LF women survived until they conceived. One woman
0
Douglas G Altman, conceived after 16 months (menstrual cycles), whereas 0 6 12 18 24
head several were followed for shorter time periods during Time (months)
Correspondence to: which they did not conceive; their time to conceive was Survival curve showing probability of not conceiving among 38
Professor Bland thus censored. subfertile women after laparoscopy and hydrotubation2
We wish to estimate the proportion surviving (not
BMJ 1998;317:1572
having conceived) by any given time, which is also the
estimated probability of survival to that time for a There are three assumptions in the above. Firstly,
member of the population from which the sample is we assume that at any time patients who are censored
drawn. Because of the censoring we use the have the same survival prospects as those who
continue to be followed. This assumption is not easily
Kaplan-Meier method. For each time interval we
testable. Censoring may be for various reasons. In the
estimate the probability that those who have survived
conception study some women had received hormone
to the beginning will survive to the end. This is a condi-
treatment to promote ovulation, and others had
tional probability (the probability of being a survivor at
stopped trying to conceive. Thus they were no longer
the end of the interval on condition that the subject
part of the population we wanted to study, and their
was a survivor at the beginning of the interval). Survival
survival times were censored. In most studies some
Time (months) to to any time point is calculated as the product of the
subjects drop out for reasons unrelated to the
conception or conditional probabilities of surviving each time
condition under study (for example, emigration) If,
censoring in 38 interval. These data are unusual in representing
sub-fertile women however, for some patients in this study censoring was
months (menstrual cycles); usually the conditional
after laparoscopy related to failure to conceive this would have biased the
probabilities relate to days. The calculations are simpli-
and hydrotubation2 estimated survival probabilities downwards.
fied by ignoring times at which there were no recorded
Secondly, we assume that the survival probabilities
Did not survival times (whether events or censored times).
Conceived conceive are the same for subjects recruited early and late in the
In the example, the probability of surviving for two
1 2 study. In a long term observational study of patients
months is the probability of surviving the first month
1 3 with cancer, for example, the case mix may change over
times the probability of surviving the second month
1 4 the period of recruitment, or there may be an innova-
given that the first month was survived. Of 38 women,
1 7 tion in ancillary treatment. This assumption may be
1 7 32 survived the first month, or 0.842. Of the 32 women tested, provided we have enough data to estimate
1 8 at the start of the second month (at risk of survival curves for different subsets of the data.
2 8 conception), 27 had not conceived by the end of the Thirdly, we assume that the event happens at the
2 9 month. The conditional probability of surviving the time specified. This is not a problem for the conception
2 9 second month is thus 27/32 = 0.844, and the overall data, but could be, for example, if the event were recur-
2 9 probability of surviving (not conceiving) after two rence of a tumour which would be detected at a regu-
2 11 months is 0.842 0.844 = 0.711. We continue in this lar examination. All we would know is that the event
3 24
way to the end of the table, or until we reach the last happened between two examinations. This imprecision
3 24
event. Observations censored at a given time affect the would bias the survival probabilities upwards. When
3
number still at risk at the start of the next month. The the observations are at regular intervals this can be
4
4
estimated probability changes only in months when allowed for quite easily, using the actuarial method.3
4 there is a conception. In practice, a computer is used to Formal methods are needed for testing hypotheses
6 do these calculations. Standard errors and confidence about survival in two or more groups. We shall describe
6 intervals for the estimated survival probabilities can be the logrank test for comparing curves and the more
9 found by Greenwoods method.3 Survival probabilities complex Cox regression model in future Notes.
9 are usually presented as a survival curve (figure). The
1 Altman DG, Bland JM. Time to event (survival) data. BMJ
9 curve is a step function, with sudden changes in the 1997;317:468-9.
10 estimated probability corresponding to times at which 2 Luthra P, Bland JM, Stanton SL. Incidence of pregnancy after
13 laparoscopy and hydrotubation. BMJ 1982;284:1013-4.
an event was observed. The times of the censored data 3 Parmar MKB, Machin D. Survival analysis: a practical approach. Chichester:
16
are indicated by short vertical lines. Wiley, 37, 47-9.

1572 BMJ VOLUME 317 5 DECEMBER 1998 www.bmj.com


Education and debate

Statistics notes
Treatment allocation in controlled trials: why randomise?
Douglas G Altman, J Martin Bland

Since 1991 the BMJ has had a policy of not publishing how random samples are expected to behave and so ICRF Medical
Statistics Group,
trials that have not been properly randomised, except can compare the observations with what we would Centre for Statistics
in rare cases where this can be justified.1 Why? expect if the treatments were equally effective. in Medicine,
The simplest approach to evaluating a new Institute of Health
The term random does not mean the same as hap- Sciences, Oxford
treatment is to compare a single group of patients hazard but has a precise technical meaning. By random OX3 7LF
given the new treatment with a group previously allocation we mean that each patient has a known Douglas G Altman,
treated with an alternative treatment. Usually such chance, usually an equal chance, of being given each professor of statistics
in medicine
studies compare two consecutive series of patients in treatment, but the treatment to be given cannot be pre-
the same hospital(s). This approach is seriously flawed. Department of
dicted. If there are two treatments the simplest method Public Health
Problems will arise from the mixture of retrospective of random allocation gives each patient an equal Sciences, St
and prospective studies, and we can never satisfactorily Georges Hospital
chance of getting either treatment; it is equivalent to Medical School,
eliminate possible biases due to other factors (apart
tossing a coin. In practice most people use either a London SW17 0RE
from treatment) that may have changed over time. J Martin Bland,
table of random numbers or a random number
Sacks et al compared trials of the same treatments in professor of medical
generator on a computer. This is simple randomisa- statistics
which randomised or historical controls were used and
tion. Possible modifications include block randomisa-
found a consistent tendency for historically controlled Correspondence to:
trials to yield more optimistic results than randomised tion, to ensure closely similar numbers of patients in Professor Altman.

trials.2 The use of historical controls can be justified each group, and stratified randomisation, to keep the
groups balanced for certain prognostic patient charac- BMJ 1999;318:1209
only in tightly controlled situations of relatively rare
conditions, such as in evaluating treatments for teristics. We discuss these extensions in a subsequent
advanced cancer. Statistics note.
The need for contemporary controls is clear, but Fifty years after the publication of the first
there are difficulties. If the clinician chooses which randomised trial5 the technical meaning of the term
treatment to give each patient there will probably be randomisation continues to elude some investigators.
differences in the clinical and demographic character- Journals continue to publish randomised trials which
istics of the patients receiving the different treatments. are no such thing. One common approach is to
Much the same will happen if patients choose their allocate treatments according to the patients date of
own treatment or if those who agree to have a birth or date of enrolment in the trial (such as giving
treatment are compared with refusers. Similar prob- one treatment to those with even dates and the other to
lems arise when the different treatment groups are at those with odd dates), by the terminal digit of the hos-
different hospitals or under different consultants. Such pital number, or simply alternately into the different
systematic differences, termed bias, will lead to an over- treatment groups. While all of these approaches are in
estimate or underestimate of the difference between principle unbiasedbeing unrelated to patient
treatments. Bias can be avoided by using random allo- characteristicsproblems arise from the openness of
cation. the allocation system.1 Because the treatment is known
A well known example of the confusion engen- when a patient is considered for entry into the trial this
dered by a non-randomised study was the study of the
knowledge may influence the decision to recruit that
possible benefit of vitamin supplementation at the time
patient and so produce treatment groups which are
of conception in women at high risk of having a baby
not comparable.
with a neural tube defect.3 The investigators found that
Of course, situations exist where randomisation is
the vitamin group subsequently had fewer babies with
simply not possible.6 The goal here should be to retain
neural tube defects than the placebo control group.
The control group included women ineligible for the all the methodological features of a well conducted
trial as well as women who refused to participate. As a randomised trial7 other than the randomisation.
consequence the findings were not widely accepted,
and the Medical Research Council later funded a large
randomised trial to answer to the question in a way that 1 Altman DG. Randomisation. BMJ 1991;302:1481-2.
would be widely accepted.4 2 Sacks H, Chalmers TC, Smith H. Randomized versus historical controls
for clinical trials. Am J Med 1982;72:233-40.
The main reason for using randomisation to 3 Smithells RW, Sheppard S, Schorah CJ, Seller MJ, Nevin NC, Harris R, et
allocate treatments to patients in a controlled trial is to al. Possible prevention of neural-tube defects by periconceptional vitamin
prevent biases of the types described above. We want to supplementation. Lancet 1980;i:339-40.
4 MRC Vitamin Study Research Group. Prevention of neural tube defects:
compare the outcomes of treatments given to groups results of the Medical Research Council vitamin study. Lancet
of patients which do not differ in any systematic way. 1991;338:131-7.
5 Medical Research Council. Streptomycin treatment of pulmonary tuber-
Another reason for randomising is that statistical culosis. BMJ 1948;2:769-82.
theory is based on the idea of random sampling. In a 6 Black N. Why we need observational studies to evaluate the effectiveness
study with random allocation the differences between of health care. BMJ 1996;312:1215-8.
7 Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improv-
treatment groups behave like the differences between ing the quality of reporting of randomized controlled trials: the
random samples from a single population. We know CONSORT Statement. JAMA 1996;276:637-9.

BMJ VOLUME 318 1 MAY 1999 www.bmj.com 1209


General practice

15 Van den Hoogen HJM, Koes BW, van Eijk JT, Bouter LM, Devill W. On 20 Dionne CE, Koepsell TD, Von Korff M, Deyo RA, Barlow WE, Checkoway
the course of low back pain in general practice: a one year follow up H. Predicting long-term functional limitations among back pain patients
study. Ann Rheum Dis 1998;57:13-9. in primary care. J Clin Epidemiol 1997;50:31-43.
16 Croft PR, Papageorgiou AC, Ferry S, Thomas E, Jayson MIV, Silman AJ. 21 Macfarlane GJ, Thomas E, Papageorgiou AC, Schollum J, Croft PR. The
Psychological distress and low back pain: Evidence from a prospective natural history of chronic pain in the community: a better prognosis than
study in the general population. Spine 1996;20:2731-7. in the clinic? J Rheumatol 1996;23:1617-20.
17 Papageorgiou AC, Macfarlane GJ, Thomas E, Croft PR, Jayson MIV, 22 Troup JDG, Martin JW, Lloyd DCEF. Back pain in industry. A prospective
Silman AJ. Psychosocial factors in the work placedo they predict new survey. Spine 1981;6:61-9.
episodes of low back pain? Spine 1997;22:1137-42. 23 Burton AK, Tillotson KM. Prediction of the clinical course of low-back
18 Main CJ, Wood PL, Hollis S, Spanswick CC, Waddell G. The distress and trouble using multivariable models. Spine 1991;16:7-14.
risk assessment method. A simple patient classification to identify distress 24 Pope MH, Rosen JC, Wilder DG, Frymoyer JW. The relation between bio-
and evaluate the risk of poor outcome. Spine 1992;17:42-52. mechanical and psychological factors in patients with low-back pain.
19 Coste J, Delecoeuillerie G, Cohen de Lara A, Le Parc JM, Paolaggi JB. Spine 1980;5:173-8.
Clinical course and prognostic factors in acute low back pain: an incep-
tion cohort study in primary care practice. BMJ 1994;308:577-80. (Accepted 31 March 1999)

Statistics notes
Variables and parameters
Douglas G Altman, J Martin Bland

Like all specialist areas, statistics has developed its own ICRF Medical
80 Statistics Group,
language. As we have noted before,1 much confusion Frequency Centre for Statistics
may arise when a word in common use is also given a in Medicine,
technical meaning. Statistics abounds in such terms, Institute of Health
60 Sciences, Oxford
including normal, random, variance, significant, etc. OX3 7LF
Two commonly confused terms are variable and Douglas G Altman,
parameter; here we explain and contrast them. 40 professor of statistics
in medicine
Information recorded about a sample of individu-
als (often patients) comprises measurements such as Department of
20 Public Health
blood pressure, age, or weight and attributes such as Sciences, St
blood group, stage of disease, and diabetes. Values of Georges Hospital
these will vary among the subjects; in this context Medical School,
0 London SW17 0RE
35 40 45 50 55
blood pressure, weight, blood group and so on are J Martin Bland,
Albumin (g/l)
variables. Variables are quantities which vary from professor of medical
individual to individual. Measurements of serum albumin in 481 white men aged over 20 statistics
By contrast, parameters do not relate to actual (data from Dr W G Miller) Correspondence to:
Professor Altman.
measurements or attributes but to quantities defining a
theoretical model. The figure shows the distribution of (more generally known as regression coefficients) are
BMJ 1999;318:1667
measurements of serum albumin in 481 white men the parameters defining the model. They have no
aged over 20 with mean 46.14 and standard deviation meaning for individuals, although they can be used to
3.08 g/l. For the empirical data the mean and SD are predict an individuals lung function from their height.
called sample estimates. They are properties of the col- In some contexts parameters are values that can be
lection of individuals. Also shown is the normal1 distri- altered to see what happens to the performance of
bution which fits the data most closely. It too has mean some system. For example, the performance of a
46.14 and SD 3.08 g/l. For the theoretical distribution screening programme (such as positive predictive
the mean and SD are called parameters. There is not value or cost effectiveness) will depend on aspects such
one normal distribution but many, called a family of as the sensitivity and specificity of the screening test. If
distributions. Each member of the family is defined by we look to see how the performance would change if,
its mean and SD, the parameters1 which specify the say, sensitivity and specificity were improved, then we
particular theoretical normal distribution with which are treating these as parameters rather than using the
we are dealing. In this case, they give the best estimate values observed in a real set of data.
of the population distribution of serum albumin if we Parameter is a technical term which has only
can assume that in the population serum albumin has recently found its way into general use, unfortunately
a normal distribution. without keeping its correct meaning. It is common in
Most statistical methods, such as t tests, are called medical journals to find variables incorrectly called
parametric because they estimate parameters of some parameters (but not in the BMJ we hope2). Another
underlying theoretical distribution. Non-parametric common misuse of parameter is as a limit or boundary,
methods, such as the Mann-Whitney U test and the log as in within certain parameters. This misuse seems to
rank test for survival data, do not assume any particu- have arisen from confusion between parameter and
lar family for the distribution of the data and so do not perimeter.
estimate any parameters for such a distribution. Misuse of medical terms is rightly deprecated. Like
Another use of the word parameter relates to its other language errors it leads to confusion and the loss
original mathematical meaning as the value(s) defining of valuable distinction. Misuse of non-medical terms
one of a family of curves. If we fit a regression model, should be viewed likewise.
such as that describing the relation between lung func- 1 Altman DG, Bland JM. The normal distribution. BMJ 1995;310:298.
tion and height, the slope and intercept of this line 2 Endpiece: Whats a parameter? BMJ 1998;316:1877.

BMJ VOLUME 318 19 JUNE 1999 www.bmj.com 1667


Education and debate

Much work still needs to be done to achieve this. To be 3 World Health Association. Global strategy for health for all by the year 2000.
Geneva: WHO, 1981. (WHO Health for All series No 3.)
useful in health policy at this level, all the targets need 4 Visschedijk J, Simant S. Targets for health for all in the 21st century.
to be elaborated further and clear, practical statements World Health Stat Q 1998;51:56-67.
5 Van de Water HPA, van Herten LM. Never change a winning team? Review
must be made on their operationespecially the four of WHOs new global policy: health for all in the 21st century. Leiden: TNO
targets on health policy and sustainable health systems. Prevention and Health, 1999.
The WHO should stimulate the discussion of these 6 World Health Organisation. Bridging the gaps. Geneva: WHO, 1995.
(World health report.)
important targets, but it should also be careful about 7 World Health Organisation. Fighting disease, fostering development. Geneva:
being too prescriptive about health systems since this WHO, 1996. (World health report.)
8 World Health Organisation. 1997: Conquering suffering, enriching humanity.
could be counterproductive. Geneva: WHO, 1997. (World health report.)
In addition, more attention should be given to the 9 Murray CJL, Lopez AD, eds. The global burden of disease. Boston: Harvard
usefulness of the targets in member states. One way of University Press, 1996.
10 United Nations. The world population prospects. New York: UN, 1998.
doing this is to rank the countries by target and to 11 United Nations Development Programme. Human development report
divide them into three groups. A specific level could be 1997. New York: Oxford University Press, 1997.
12 World Bank. Poverty reduction and the World Bank: progress and challenges in
set for each group. For example, for target 2, three such the 1990s. New York: World Bank, 1996.
groups could be distinguished as follows: 13 World Health Organisation. Third evaluation of health for all by the year
x Countries that have already achieved this target 14
2000. Geneva: WHO, 1999. (In press.)
Ad Hoc Committee on Health Research Relating to Future Intervention
x Countries for which the global target is achievable Options. Investing in health research and development. Geneva: WHO,1996.
and challenging (Document TDR/Gen/96.1.)
15 Taylor CE. Surveillance for equity in primary health care: policy implica-
x Countries that find the global target hard to achieve tions from international experience. Int J Epidemiol 1992;21:1043-9.
and therefore demotivating. 16 Frerichs RR. Epidemiologic surveillance in developing countries. Annu
Rev Public Health 1991;12:257-80.
The first group needs stricter target levels, and the 17 World Health Organisation. Health for all renewalbuilding sustainable
third group less stringent ones. If a breakdown of this health systems: from policy to action. Report of meeting on 17-19 November 1997
kind is made for each target, some countries may be in Helsinki, Finland. Geneva: WHO, 1998.
18 World Health Organisation. EMC annual report 1996. Geneva: WHO:
classified in different groups for different targets. In this 1996.
way, the targets will provide an insight into the health 19 World Health Organisation. Physical status: the use and interpretation of
anthropometry of a WHO expert committee. Geneva: WHO, 1995. (WHO
status of the population and could be useful for policy technical report series No 834.)
makers in member states in encouraging action and 20 World Health Organisation. Global database on child growth and malnutri-
tion. Geneva: WHO, 1997.
allocating their resources. 21 World Health Organisation. Tobacco or health: a global status report. Geneva:
WHO, 1997.
We thank Dr J Visschedijk and Professor L J Gunning-Schepers 22 Erkens C. Cost-effectiveness of short course chemotherapy in smear-negative
and other referees of this article for their helpful comments. tuberculosis. Utrecht: Netherlands School of Public Health, 1996.
Funding: This study was commissioned by Policy Action 23 Van de Water HPA, van Herten LM. Bulls eye or Achilles heel: WHOs Euro-
Coordination at the WHO and supported by an unrestricted pean health for all targets evaluated in the Netherlands. Leiden: Netherlands
educational grant from Merck & Co Inc, New Jersey, USA. Association for Applied Scientific Research (TNO) Prevention and
Health, 1996.
Competing interests: None declared.
24 Van de Water HPA, van Herten LM. Health policies on target? Review of
health target and priority setting in 18 European countries. Leiden:
1 World Health Assembly. Resolution WHA51.7. Health for all policy for the Netherlands Association for Applied Scientific Research (TNO) Preven-
twenty-first century. Geneva: World Health Organisation, 1998. tion and Health, 1998.
2 World Health Association. Health for all in the 21st century. Geneva: WHO,
1998. (Accepted 4 May 1999)

Statistics notes
How to randomise
Douglas G Altman, J Martin Bland

We have explained why random allocation of place to start and also the direction in which to read ICRF Medical
Statistics Group,
treatments is a required feature of controlled trials.1 the table. The first 10 two digit numbers from a starting Centre for Statistics
Here we consider how to generate a random allocation place in column 2 are 85 80 62 36 96 56 17 17 23 87, in Medicine,
sequence. which translate into the sequence A B B B B B A A A Institute of Health
Sciences, Oxford
Almost always patients enter a trial in sequence A for the first 10 patients. We could instead have taken OX3 7LF
over a prolonged period. In the simplest procedure, each digit on its own, or numbers 00 to 49 for A and 50 Douglas G Altman,
simple randomisation, we determine each patients to 99 for B. There are countless possible strategies; it professor of statistics
in medicine
treatment at random independently with no con- makes no difference which is used.
straints. With equal allocation to two treatment groups We can easily generalise the approach. With three Department of
Public Health
this is equivalent to tossing a coin, although in practice groups we could use 01 to 33 for A, 34 to 66 for B, and Sciences, St
coins are rarely used. Instead we use computer gener- 67 to 99 for C (00 is ignored). We could allocate treat- Georges Hospital
Medical School,
ated random numbers. Suitable tables can be found in ments A and B in proportions 2 to 1 by using 01 to 66
London SW17 0RE
most statistics textbooks. The table shows an example2: for A and 67 to 99 for B. J Martin Bland,
the numbers can be considered as either random digits At any point in the sequence the numbers of professor of medical
from 0 to 9 or random integers from 0 to 99. patients allocated to each treatment will probably statistics

For equal allocation to two treatments we could differ, as in the above example. But sometimes we want Correspondence to:
Professor Altman.
take odd and even numbers to indicate treatments A to keep the numbers in each group very close at all
and B respectively. We must then choose an arbitrary times. Block randomisation (also called restricted BMJ 1999;319:7034

BMJ VOLUME 319 11 SEPTEMBER 1999 www.bmj.com 703


Education and debate

compare two alternative treatments for breast cancer it


Excerpt from a table of random digits.2 The numbers used in the
might be important to stratify by menopausal status.
example are shown in bold
Separate lists of random numbers should then be con-
89 11 77 99 94 structed for premenopausal and postmenopausal
35 83 73 68 20
women. It is essential that stratified treatment
84 85 95 45 52
allocation is based on block randomisation within each
56 80 93 52 82
stratum rather than simple randomisation; otherwise
97 62 98 71 39
79 36 13 72 99
there will be no control of balance of treatments within
34 96 98 54 89
strata, so the object of stratification will be defeated.
69 56 88 97 43 Stratified randomisation can be extended to two or
09 17 78 78 02 more stratifying variables. For example, we might want
83 17 39 84 16 to extend the stratification in the breast cancer trial to
24 23 36 44 14 tumour size and number of positive nodes. A separate
39 87 30 20 41 randomisation list is needed for each combination of
75 18 53 77 83 categories. If we had two tumour size groups (say <4
33 93 39 24 81 and > 4cm) and three groups for node involvement (0,
22 52 01 86 71 1-4, > 4) as well as menopausal status, then we have
2 3 2 = 12 strata, which may exceed the limit of what
randomisation) is used for this purpose. For example, if is practical. Also with multiple strata some of the com-
we consider subjects in blocks of four at a time there binations of categories may be rare, so the intended
are only six ways in which two get A and two get B: treatment balance is not achieved.
1: A A B B 2: A B A B 3: A B B A 4: B B A A 5: B A In a multicentre study the patients within each cen-
B A 6: B A A B tre will need to be randomised separately unless there
We choose blocks at random to create the is a central coordinated randomising service. Thus
allocation sequence. Using the single digits of the pre- centre is a stratifying variable, and there may be other
vious random sequence and omitting numbers outside stratifying variables as well.
the range 1 to 6 we get 5 6 2 3 6 6 5 6 1 1. From these In small studies it is not practical to stratify on more
we can construct the block allocation sequence B A B than one or perhaps two variables, as the number of
A / B A A B / A B A B / A B B A / B A A B, and so on. strata can quickly approach the number of subjects.
The numbers in the two groups at any time can never When it is really important to achieve close similarity
differ by more than half the block length. Block size is between treatment groups for several variables
normally a multiple of the number of treatments. minimisation can be usedwe discuss this method in a
Large blocks are best avoided as they control balance separate Statistics note.3
less well. It is possible to vary the block length, again at We have described the generation of a random
random, perhaps using a mixture of blocks of size 2, 4, sequence in some detail so that the principles are clear.
or 6. In practice, for many trials the process will be done by
While simple randomisation removes bias from the computer. Suitable software is available at http://
allocation procedure, it does not guarantee, for exam- www.sghms.ac.uk/phs/staff/jmb/jmb.htm.
ple, that the individuals in each group have a similar We shall also consider in a subsequent note the
age distribution. In small studies especially some practicalities of using a random sequence to allocate
chance imbalance will probably occur, which might treatments to patients.
complicate the interpretation of results. We can use
stratified randomisation to achieve approximate
1 Altman DG, Bland JM. Treatment allocation in controlled trials: why ran-
balance of important characteristics without sacrificing domise? BMJ 1999;318:1209.
the advantages of randomisation. The method is to 2 Altman DG. Practical statistics for medical research. London: Chapman and
Hall, 1990: 540-4.
produce a separate block randomisation list for each 3 Treasure T, MacRae KD. Minimisation: the platinum standard for trials?
subgroup (stratum). For example, in a study to BMJ 1998;317:362-3.

One hundred years ago


Generalisation of salt infusions

The subcutaneous infusion of salt solution has proved of great been employed earlier. Several physicians have adopted Dr.
benefit in the treatment of collapse after severe operations. The Penroses method, and with the most gratifying results. The cases
practice, it may be said, developed from two sources: the new are reported fully in the Bulletin of the Johns Hopkins Hospital for
method of transfusion where water, instead of another persons July last. The infusions of salt solution were administered just as
blood, is injected into the patients veins; and flushing of the after an operation. The salt solution, at a little above body
peritoneum, introduced by Lawson Tait. After flushing, much of temperature, is poured into a graduated bottle connected by a
the fluid left in the peritoneum is absorbed into the circulation, rubber tube with a needle. The pressure is regulated by elevating
greatly to the patients advantage. Dr. Clement Penrose has tried the bottle, or by means of a rubber bulb with valves; the needle is
the effect of subcutaneous salt infusions as a last extremity in introduced into the connective tissue under the breast or under
severe cases of pneumonia. He continues this treatment with the integuments of the thighs. There can be no doubt that
inhalations of oxygen. He has had experience of three cases, all subcutaneous saline infusions are increasing in popularity, and
considered hopeless, and succeeded in saving one. In the other little doubt that their use will be greatly extended in medicine as
two the prolongation of life and the relief of symptoms were so well as surgery.
marked that Dr. Penrose regretted that the treatment had not (BMJ 1899;ii:933)

704 BMJ VOLUME 319 11 SEPTEMBER 1999 www.bmj.com


Education and debate

Statistics Notes
The odds ratio
J Martin Bland, Douglas G Altman

Department of In recent years odds ratios have become widely used in The two odds ratios are
Public Health
Sciences, medical reportsalmost certainly some will appear in
St Georges todays BMJ. There are three reasons for this. Firstly,
Hospital Medical
they provide an estimate (with confidence interval) for
School, London
SW17 0RE the relationship between two binary (yes or no) vari- which can both be rearranged to give
J Martin Bland ables. Secondly, they enable us to examine the effects of
professor of medical
statistics
other variables on that relationship, using logistic
regression. Thirdly, they have a special and very
ICRF Medical
Statistics Group, convenient interpretation in case-control studies (dealt If we switch the order of the categories in the rows and
Centre for Statistics with in a future note). the columns, we get the same odds ratio. If we switch
in Medicine,
Institute of Health
The odds are a way of representing probability, the order for the rows only or for the columns only, we
Sciences, Oxford especially familiar for betting. For example, the odds get the reciprocal of the odds ratio, 1/4.89 = 0.204.
OX3 7LF that a single throw of a die will produce a six are 1 to These properties make the odds ratio a useful
Douglas G Altman
professor of statistics
5, or 1/5. The odds is the ratio of the probability that indicator of the strength of the relationship.
in medicine the event of interest occurs to the probability that it The sample odds ratio is limited at the lower end,
Correspondence to: does not. This is often estimated by the ratio of the since it cannot be negative, but not at the upper end,
Professor Bland number of times that the event of interest occurs to and so has a skew distribution. The log odds ratio,2
the number of times that it does not. The table shows however, can take any value and has an approximately
BMJ 2000;320:1468
data from a cross sectional study showing the Normal distribution. It also has the useful property that
prevalence of hay fever and eczema in 11 year old if we reverse the order of the categories for one of the
children.1 The probability that a child with eczema will variables, we simply reverse the sign of the log odds
also have hay fever is estimated by the proportion ratio: log(4.89) = 1.59, log(0.204) = 1.59.
141/561 (25.1%). The odds is estimated by 141/420. We can calculate a standard error for the log odds
Similarly, for children without eczema the probability ratio and hence a confidence interval. The standard
of having hay fever is estimated by 928/14 453 (6.4%) error of the log odds ratio is estimated simply by the
and the odds is 928/13 525. We can compare the square root of the sum of the reciprocals of the four
frequencies. For the example,
groups in several ways: by the difference between the
proportions, 141/561 928/14 453 = 0.187 (or 18.7
percentage points); the ratio of the proportions, (141/
561)/(928/14 453) = 3.91 (also called the relative
risk); or the odds ratio, (141/420)/(928/
13 525) = 4.89.
A 95% confidence interval for the log odds ratio is
obtained as 1.96 standard errors on either side of the
estimate. For the example, the log odds ratio is
Association between hay fever and eczema in 11 year old
loge(4.89) = 1.588 and the confidence interval is
children1
1.5881.960.103, which gives 1.386 to 1.790. We can
Hay fever antilog these limits to give a 95% confidence interval
Eczema Yes No Total for the odds ratio itself,2 as exp(1.386) = 4.00 to
Yes 141 420 561 exp(1.790) = 5.99. The observed odds ratio, 4.89, is not
No 928 13 525 14 453 in the centre of the confidence interval because of the
Total 1069 13 945 15 522
asymmetrical nature of the odds ratio scale. For this
reason, in graphs odds ratios are often plotted using a
logarithmic scale. The odds ratio is 1 when there is no
Now, suppose we look at the table the other way relationship. We can test the null hypothesis that the
round, and ask what is the probability that a child with odds ratio is 1 by the usual 2 test for a two by two
hay fever will also have eczema? The proportion is table.
141/1069 (13.2%) and the odds is 141/928. For a Despite their usefulness, odds ratios can cause diffi-
child without hay fever, the proportion with eczema is culties in interpretation.3 We shall review this debate
420/13 945 (3.0%) and the odds is 420/13 525. Com- and also discuss odds ratios in logistic regression and
paring the proportions this way, the difference is 141/ case-control studies in future Statistics Notes.
1069 420/13 945 = 0.102 (or 10.2 percentage We thank Barbara Butland for providing the data.
points); the ratio (relative risk) is (141/1069)/(420/
13 945) = 4.38; and the odds ratio is (141/928)/(420/ 1 Strachan DP, Butland BK, Anderson HR. Incidence and prognosis of
asthma and wheezing illness from early childhood to age 33 in a national
13 525) = 4.89. The odds ratio is the same whichever British cohort. BMJ. 1996;312:1195-9.
2 Bland JM, Altman DG. Transforming data. BMJ 1996;312:770.
way round we look at the table, but the difference and 3 Sackett DL, Deeks JJ, Altman DG. Down with odds ratios! Evidence-Based
ratio of proportions are not. It is easy to see why this is. Med 1996;1:164-6.

1468 BMJ VOLUME 320 27 MAY 2000 bmj.com


Education and debate

Statistics Notes
Blinding in clinical trials and other studies
Simon J Day, Douglas G Altman

Leo Human behaviour is influenced by what we know or the treatment received. Such blind assessment of
Pharmaceuticals,
Princes Risborough,
believe. In research there is a particular risk of expecta- outcome can often also be achieved in trials which are
Buckinghamshire tion influencing findings, most obviously when there is open (non-blinded). For example, lesions can be
HP27 9RR some subjectivity in assessment, leading to biased photographed before and after treatment and assessed
Simon J Day results. Blinding (sometimes called masking) is used to
manager, clinical
by someone not involved in running the trial. Indeed,
biometrics try to eliminate such bias. blind assessment of outcome may be more important
ICRF Medical
It is a tenet of randomised controlled trials that the than blinding the administration of the treatment,
Statistics Group, treatment allocation for each patient is not revealed especially when the outcome measure involves subjec-
Institute of Health until the patient has irrevocably been entered into the
Sciences, Oxford tivity. Despite the best intentions, some treatments have
OX3 7LF trial, to avoid selection bias. This sort of blinding, better unintended effects that are so specific that their occur-
Douglas G Altman referred to as allocation concealment, will be discussed rence will inevitably identify the treatment received to
professor of statistics in a future statistics note. In controlled trials the term both the patient and the medical staff. Blind
in medicine
blinding, and in particular double blind, usually assessment of outcome is especially useful when this is
Correspondence to: refers to keeping study participants, those involved
S J Day a risk.
with their management, and those collecting and ana-
In epidemiological studies it is preferable that the
lysing clinical data unaware of the assigned treatment,
BMJ 2000;321:504 identification of cases as opposed to controls be
so that they should not be influenced by that
kept secret while researchers are determining each
knowledge.
subjects exposure to potential risk factors. In many
The relevance of blinding will vary according to
such studies blinding is impossible because exposure
circumstances. Blinding patients to the treatment they
can be discovered only by interviewing the study
have received in a controlled trial is particularly impor-
tant when the response criteria are subjective, such as participants, who obviously know whether or not they
alleviation of pain, but less important for objective cri- are a case. The risk of differential recall of important
teria, such as death. Similarly, medical staff caring for disease related events between cases and controls must
patients in a randomised trial should be blinded to then be recognised and if possible investigated.2 As a
treatment allocation to minimise possible bias in minimum the sensitivity of the results to differential
patient management and in assessing disease status. recall should be considered. Blinded assessment of
For example, the decision to withdraw a patient from a patient outcome may also be valuable in other
study or to adjust the dose of medication could easily epidemiological studies, such as cohort studies.
be influenced by knowledge of which treatment group Blinding is important in other types of research
the patient has been assigned to. too. For example, in studies to evaluate the perform-
In a double blind trial neither the patient nor the ance of a diagnostic test those performing the test must
caregivers are aware of the treatment assignment. be unaware of the true diagnosis. In studies to evaluate
Blinding means more than just keeping the name of the reproducibility of a measurement technique the
the treatment hidden. Patients may well see the observers must be unaware of their previous measure-
treatment being given to patients in the other ment(s) on the same individual.
treatment group(s), and the appearance of the drug We have emphasised the risks of bias if adequate
used in the study could give a clue to its identity. Differ- blinding is not used. This may seem to be challenging
ences in taste, smell, or mode of delivery may also the integrity of researchers and patients, but bias asso-
influence efficacy, so these aspects should be identical ciated with knowing the treatment is often subcon-
for each treatment group. Even colour of medication scious. On average, randomised trials that have not
has been shown to influence efficacy.1 used appropriate levels of blinding show larger
In studies comparing two active compounds, blind- treatment effects than blinded studies.3 Similarly, diag-
ing is possible using the double dummy method. For nostic test performance is overestimated when the ref-
example, if we want to compare two medicines, one erence test is interpreted with knowledge of the test
presented as green tablets and one as pink capsules, we result.4 Blinding makes it difficult to bias results
could also supply green placebo tablets and pink intentionally or unintentionally and so helps ensure
placebo capsules so that both groups of patients would
the credibility of study conclusions.
take one green tablet and one pink capsule.
Blinding is certainly not always easy or possible.
Single blind trials (where either only the investigator or 1 De Craen AJM, Roos PJ, de Vries AL, Kleijnen J. Effect of colour of drugs:
systematic review of perceived effect of drugs and their effectiveness. BMJ
only the patient is blind to the allocation) are 1996;313:1624-6.
sometimes unavoidable, as are open (non-blind) trials. 2 Barry D. Differential recall bias and spurious associations in case/control
studies. Stat Med 1996;15:2603-16.
In trials of different styles of patient management, 3 Schulz KF, Chalmers I, Hayes R, Altman DG. Empirical evidence of bias:
surgical procedures, or alternative therapies, full blind- dimensions of methodological quality associated with estimates of treat-
ing is often impossible. ment effects in controlled trials. JAMA 1995;273:408-12.
4 Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der
In a double blind trial it is implicit that the Meulen JH, et al. Empirical evidence of design-related bias in studies of
assessment of patient outcome is done in ignorance of diagnostic tests. JAMA 1999;282:1061-6.

504 BMJ VOLUME 321 19-26 AUGUST 2000 bmj.com


Education and debate

Competing interests: None declared.


Irrationality, the market, and quality of care
1 McCallum J, Geiselhart K. Australias new aged: issues for young and old.
Consider the irrationality of a person who pays extra Sydney: Allen and Unwin, 1996.
so as not to share a hotel room with a colleague while 2 Osborne D, Gaebler T. Reinventing government. New York: Addison-
on a business trip. He does this because he values Wesley, 1992.
privacy but he also scoffs at taking out long term care 3 Jost T. The necessary and proper role of regulation to assure the quality
insurance to guarantee a private room in a nursing of health care. Houston Law Review 1988;25:525-98.
4 Tingle L. Moran the big winner as aged care goes private. Sydney Morning
home. Why is he willing to risk sharing a room for the Herald 16 March 2001:2.
rest of his life with a person he does not like? This 5 Lohr R, Head M. Kerosene baths reveal systemic aged care crisis in Aus-
common irrationality is often masked by tralia. World Socialist Web Site. www.wsws.org/articles/2000/mar2000/
rationalisations such as I would rather die than have aged-m10.shtml (accessed 10 Mar 2000).
to live in a nursing home. Yet we know that when the 6 Jenkins A, Braithwaite J. Profits, pressure and corporate lawbreaking.
Crime, Law and Social Change 1993;20:221-32.
time comes most prefer the limited pleasures of life in
7 Braithwaite J, Makkai T, Braithwaite V, Gibson D. Raising the standard: resi-
a nursing home to suicide dent centred nursing home regulation in Australia. Canberra: Department of
Community Services and Health, 1993.
8 Braithwaite J, Braithwaite V. The politics of legalism: rules versus
standards in nursing home regulation. Social and Legal Studies
their feet. There are even more fundamental reasons 1995;4:307-41.
why depending on the rationality of the market will 9 Black J. Rules and regulators. Oxford: Clarendon Press, 1997.
10 Braithwaite J, Makkai T. Can resident-centred inspection of nursing
never work well for quality of care (box). Sensible
homes work with very sick residents? Health Policy 1993;24:19-33.
policy for providing nursing home care requires a 11 Makkai T, Braithwaite J. Praise, pride and corporate compliance. Int J Soci-
larger welfare state, a larger regulatory state, and ology Law 1993;21:73-91.
12 Braithwaite J. Restorative justice and responsive regulation. New York: Oxford
encouragement of public, non-profit providers. University Press (in press).
Australias recent experience shows that to head in the 13 McKibbin H. Accreditation: the on-site audit. The Standard (Newsletter of
opposite direction is medically, economically, and the Aged Care Standards Agency) 1999;2(2):2.
14 Power M. The audit society. Oxford: Oxford University Press, 1997.
politically irrational.

Statistics Notes
Concealing treatment allocation in randomised trials
Douglas G Altman, Kenneth F Schulz

ICRF Medical We have previously explained why random allocation there is a typed list showing the allocation sequence).
Statistics Group,
Centre for Statistics of treatments is a required design feature of controlled Each of the above steps may then be compromised
in Medicine, trials1 and explained how to generate a random alloca- because of conscious or subconscious bias. Even when
Institute of Health tion sequence.2 Here we consider the importance of the sequence is not easily available, there is strong
Sciences, Oxford
OX3 7LF concealing the treatment allocation until the patient is anecdotal evidence of frequent attempts to discover
Douglas G Altman entered into the trial. the sequence through a combination of a misplaced
professor of statistics Regardless of how the allocation sequence has belief that this will be beneficial to patients and lack of
in medicine
been generatedsuch as by simple or stratified understanding of the rationale of randomisation.3
Family Health
International, randomisation2there will be a prespecified sequence How can the allocation sequence be concealed?
PO Box 13950, of treatment allocations. In principle, therefore, it is Firstly, the person who generates the allocation
Research Triangle possible to know what treatment the next patient will sequence should not be the person who determines
Park, NC 27709,
USA get at the time when a decision is taken to consider the eligibility and entry of patients. Secondly, if possible the
Kenneth F Schulz patient for entry into the trial. mechanism for treatment allocation should use people
vice president, The strength of the randomised trial is based on not involved in the trial. A common procedure,
Quantitative Sciences
aspects of design which eliminate various types of bias. especially in larger trials, is to use a central telephone
Correspondence to:
D G Altman
Randomisation of patients to treatment groups randomisation system. Here patient details are
eliminates bias by making the characteristics of the supplied, eligibility confirmed, and the patient entered
BMJ 2001;323:4467 patients in two (or more) groups the same on average, into the trial before the treatment allocation is divulged
and stratification with blocking may help to reduce (and it may still be blinded4). Another excellent alloca-
chance imbalance in a particular trial.2 All this good tion concealment mechanism, common in drug trials,
work can be undone if a poor procedure is adopted to is to get the allocation done by a pharmacy. The inter-
implement the allocation sequence. In any trial one or ventions are sealed in serially numbered containers
more people must determine whether each patient is (usually bottles) of equal appearance and weight
eligible for the trial, decide whether to invite the according to the allocation sequence.
patient to participate, explain the aims of the trial and If external help is not available the only other
the details of the treatments, and, if the patient agrees system that provides a plausible defence against alloca-
to participate, determine what treatment he or she will tion bias is to enclose assignments in serially
receive. numbered, opaque, sealed envelopes. Apart from
Suppose it is clear which treatment a patient will neglecting to mention opacity, this is the method used
receive if he or she enters the trial (perhaps because in the famous 1948 streptomycin trial (see box). This

446 BMJ VOLUME 323 25 AUGUST 2001 bmj.com


Education and debate

The desirability of concealing the allocation was


Description of treatment allocation in the MRC recognised in the streptomycin trial5 (see box). Yet the
streptomycin trial5 importance of this key element of a randomised trial
has not been widely recognised. Empirical evidence of
Determination of whether a patient would be treated
the bias associated with failure to conceal the
by streptomycin and bed-rest (S case) or by bed-rest
alone (C case) was made by reference to a statistical allocation6 7 and explicit requirement to discuss this
series based on random sampling numbers drawn up issue in the CONSORT statement8 seem to be leading
for each sex at each centre by Professor Bradford Hill; to wider recognition that allocation concealment is an
the details of the series were unknown to any of the essential aspect of a randomised trial.
investigators or to the co-ordinator and were
contained in a set of sealed envelopes, each bearing on
Allocation concealment is completely different
the outside only the name of the hospital and a from (double) blinding.4 It is possible to conceal the
number. After acceptance of a patient by the panel, randomisation in every randomised trial. Also,
and before admission to the streptomycin centre, the allocation concealment seeks to eliminate selection
appropriate numbered envelope was opened at the bias (who gets into the trial and the treatment they are
central office; the card inside told if the patient was to
be an S or a C case, and this information was then assigned). By contrast, blinding relates to what happens
given to the medical officer of the centre. after randomisation, is not possible in all trials, and
seeks to reduce ascertainment bias (assessment of
outcome).

method is not immune to corruption,3 particularly if


poorly executed. However, with care, it can be a good 1 Altman DG, Bland JM. Treatment allocation in controlled trials: why ran-
domise? BMJ 1999;318:1209.
mechanism for concealing allocation. We recommend
2 Altman DG, Bland JM. How to randomise. BMJ 1999;319:703-4.
that investigators ensure that the envelopes are opened 3 Schulz KF. Subverting randomization in controlled trials. JAMA
sequentially, and only after the participants name and 1995;274:1456-8.
4 Day SJ, Altman DG. Blinding in clinical trials and other studies. BMJ
other details are written on the appropriate envelope.3 2000;321:504.
If possible, that information should also be transferred 5 Medical Research Council. Streptomycin treatment of pulmonary
to the assigned allocation by using pressure sensitive tuberculosis: a Medical Research Council investigation. BMJ 1948;2:
769-82.
paper or carbon paper inside the envelope. If an inves- 6 Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, et al. Does
tigator cannot use numbered containers, envelopes quality of reports of randomised trials affect estimates of intervention
efficacy reported in meta-analyses. Lancet 1998;352:609-13.
represent the best available allocation concealment 7 Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias:
mechanism without involving outside parties, and may dimensions of methodological quality associated with estimates of treat-
sometimes be the only feasible option. We suspect, ment effects in controlled trials. JAMA 1995;273:408-12.
8 Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improv-
however, that in years to come we will see greater use of ing the quality of reporting of randomized controlled trials: the
external third party randomisation. CONSORT statement. JAMA 1996;276:637-9.

The public health benefits of mobile phones

The bread and butter of public health on call is accompanying the patient produced both their mobile
identifying contacts in the case of suspected phones and confidently reassured me that between the
meningococcal disease. On the whole this is two of them they would have the mobile numbers of
straightforward but can occasionally cause difficulties. all 15 household contacts. She was right, and in just
Most areas that I have worked in include several over two hours all of them had been contacted.
universities, and during October it is common to There has been much coverage in the medical and
experience the problem of contact tracing in the popular press about the potential health hazards of
student population. mobile phones, and if these fears are realised the 100%
There are two main problems. The first is how to ownership among this small sample of students is
define household contacts when the index patient lives worrying. However, in terms of contact tracing for
in a hall of residence containing several hundred suspected meningococcal disease, mobile phones have
students. Finding the appropriate university protocol
potential health benefits not just for their owners but
and not being too concerned about the different
also for the mental health of public health doctors. Of
approaches adopted by neighbouring universities can
course, this may not solve the close kissing contact
reduce the number of sleepless nights. The second
problem.
problem is harder. Close kissing contacts among 18
year olds who have been set free from parental control Debbie Lawlor senior lecturer in epidemiology and public
for the first time is a minefield. My experience suggests health, University of Bristol
that it is best to assume there will be lots and that names
and contact details will not necessarily have been We welcome articles up to 600 words on topics such as
obtained. By the end of a weekend on call, you will feel A memorable patient, A paper that changed my practice, My
like a cross between a detective and an agony aunt. most unfortunate mistake, or any other piece conveying
One year I volunteered to cover Christmas weekend instruction, pathos, or humour. If possible the article
in the belief that at least the students would be gone by should be supplied on a disk. Permission is needed
then. I could not have been more mistaken. To add a from the patient or a relative if an identifiable patient is
further difficulty, the index patient presented to referred to. We also welcome contributions for
hospital on the night of the last day of term, and all Endpieces, consisting of quotations of up to 80 words
contacts had already set off to the far reaches of the (but most are considerably shorter) from any source,
country. I could not believe my luck when the friend ancient or modern, which have appealed to the reader.

BMJ VOLUME 323 25 AUGUST 2001 bmj.com 447


Education and debate

Statistics Notes
Analysing controlled trials with baseline and follow up
measurements
Andrew J Vickers, Douglas G Altman

In many randomised trials researchers measure a con- where a and b are estimated coefficients and group Integrative
Medicine Service,
tinuous variable at baseline and again as an outcome is a binary variable coded 1 for treatment and 0 for Biostatistics Service,
assessed at follow up. Baseline measurements are com- control. The coefficient b is the effect of interestthe Memorial
mon in trials of chronic conditions where researchers estimated difference between the two treatment Sloan-Kettering
Cancer Center, New
want to see whether a treatment can reduce groups. In effect an analysis of covariance adjusts each York, New York
pre-existing levels of pain, anxiety, hypertension, and patients follow up score for his or her baseline score, 10021, USA
the like. but has the advantage of being unaffected by baseline Andrew J Vickers
assistant attending
Statistical comparisons in such trials can be made differences. If, by chance, baseline scores are worse in research methodologist
in several ways. Comparison of follow up (post- the treatment group, the treatment effect will be
ICRF Medical
treatment) scores will give a result such as at the end underestimated by a follow up score analysis and over- Statistics Group,
of the trial, mean pain scores were 15 mm (95% confi- estimated by looking at change scores (because of Centre for Statistics
dence interval 10 to 20 mm) lower in the treatment regression to the mean). By contrast, analysis of covari- in Medicine,
Institute of Health
group. Alternatively a change score can be calculated ance gives the same answer whether or not there is Sciences, Oxford
by subtracting the follow up score from the baseline baseline imbalance. OX3 7LF
score, leading to a statement such as pain reductions As an illustration, Kleinhenz et al randomised 52 Douglas G Altman
professor of statistics
were 20 mm (16 to 24 mm) greater on treatment than patients with shoulder pain to either true or sham acu- in medicine
control. If the average baseline scores are the same in puncture.4 Patients were assessed before and after
Correspondence to:
each group the estimated treatment effect will be the treatment using a 100 point rating scale of pain and Dr Vickers
same using these two simple approaches. If the function, with lower scores indicating poorer outcome. vickersa@mskcc.org
treatment is effective the statistical significance of the There was an imbalance between groups at baseline,
BMJ 2001;323:11234
treatment effect by the two methods will depend on the with better scores in the acupuncture group (see table).
correlation between baseline and follow up scores. If Analysis of post-treatment scores is therefore biased.
the correlation is low using the change score will add The authors analysed change scores, but as baseline
variation and the follow up score is more likely to show and change scores are negatively correlated (about
a significant result. Conversely, if the correlation is high r = 0.25 within groups) this analysis underestimates
using only the follow up score will lose information the effect of acupuncture. From analysis of covariance
and the change score is more likely to be significant. It we get:
is incorrect, however, to choose whichever analysis follow up score =
gives a more significant finding. The method of analy- 24 + 0.71baseline score + 12.7group
sis should be specified in the trial protocol. (see figure). The coefficient for group (b) has a use-
Some use change scores to take account of chance ful interpretation: it is the difference between the mean
imbalances at baseline between the treatment groups. change scores of each group. In the above example it
However, analysing change does not control for can be interpreted as pain and function score
baseline imbalance because of regression to the improved by an estimated 12.7 points more on average
mean1 2: baseline values are negatively correlated with in the treatment group than in the control group. A
change because patients with low scores at baseline 95% confidence interval and P value can also be calcu-
generally improve more than those with high scores. A lated for b (see table).5 The regression equation
better approach is to use analysis of covariance provides a means of prediction: a patient with a
(ANCOVA), which, despite its name, is a regression baseline score of 50, for example, would be predicted
method.3 In effect two parallel straight lines (linear to have a follow up score of 72.2 on treatment and 59.5
regression) are obtained relating outcome score to on control.
baseline score in each group. They can be summarised An additional advantage of analysis of covariance is
as a single regression equation: that it generally has greater statistical power to detect a
follow up score = treatment effect than the other methods.6 For example,
constant + abaseline score + bgroup a trial with a correlation between baseline and follow

Results of trial of acupuncture for shoulder pain4


Pain scores (mean and SD)
Placebo group Acupuncture group Difference between means
(n=27) (n=25) (95% CI) P value
Baseline 53.9 (14) 60.4 (12.3) 6.5
Analysis
Follow up 62.3 (17.9) 79.6 (17.1) 17.3 (7.5 to 27.1) 0.0008
Change score* 8.4 (14.6) 19.2 (16.1) 10.8 (2.3 to 19.4) 0.014
ANCOVA 12.7 (4.1 to 21.3) 0.005
*Analysis reported by authors.4

BMJ VOLUME 323 10 NOVEMBER 2001 bmj.com 1123


Education and debate

situations, analysis of change scores can be a

Posttreatment score
100 reasonable alternative, particularly if restricted
randomisation is used to ensure baseline comparability
between groups.7 Analysis of covariance is the
80 preferred general approach, however.
As with all analyses of continuous data, the use of
analysis of covariance depends on some assumptions
60 that need to be tested. In particular, data transforma-
tion, such as taking logarithms, may be indicated.8
Lastly, analysis of covariance is a type of multiple
regression and can be seen as a special type of adjusted
40
analysis. The analysis can thus be expanded to include
Acupuncture
additional prognostic variables (not necessarily con-
Placebo
tinuous), such as age and diagnostic group.
20

20 40 60 80 100
We thank Dr J Kleinhenz for supplying the raw data from his
study.
Pretreatment score

Pretreatment and post-treatment scores in each group showing fitted 1 Bland JM, Altman DG. Regression towards the mean. BMJ
lines. Squares show mean values for the two groups. The estimated 1994;308:1499.
difference between the groups from analysis of covariance is the 2 Bland JM, Altman DG. Some examples of regression towards the mean.
BMJ 1994;309:780.
vertical distance between the two lines 3 Senn S. Baseline comparisons in randomized clinical trials. Stat Med
1991;10:1157-9.
4 Kleinhenz J, Streitberger K, Windeler J, Gussbacher A, Mavridis G, Mar-
tin E. Randomised clinical trial comparing the effects of acupuncture and
up scores of 0.6 that required 85 patients for analysis of a newly designed placebo needle in rotator cuff tendonitis. Pain
follow up scores, would require 68 for a change score 1999;83:235-41.
5 Altman DG, Gardner MJ. Regression and correlation. In: Altman DG,
analysis but only 54 for analysis of covariance. Machin D, Bryant TN, Gardner MJ, eds. Statistics with confidence. 2nd ed.
The efficiency gains of analysis of covariance com- London: BMJ Books, 2000:73-92.
6 Vickers AJ. The use of percentage change from baseline as an outcome in
pared with a change score are low when there is a high a controlled trial is statistically inefficient: a simulation study. BMC Med
correlation (say r > 0.8) between baseline and follow up Res Methodol 2001;1:16.
7 Altman DG, Bland JM. How to randomise. BMJ 1999;319:703-4.
measurements. This will often be the case, particularly 8 Bland JM, Altman DG. The use of transformation when comparing two
in stable chronic conditions such as obesity. In these means. BMJ 1996;312:1153.

A memorable patient
Informed consent

I first met Ivy three years ago when she came for her and she was brought back the following day for
29th oesophageal dilatation. She was an 86 year old removal of the bolus by endoscopy.
spinster, deaf without speech from childhood, and the She came to the endoscopy room but did not have
only sign language she knew was thumbs up, which she her familiar smile. She looked around for a minute, got
would use for saying good morning or for showing off her trolley, and walked out. Everyone in the
happiness. She had no next of kin and had lived in a endoscopy room understood that she was trying to say,
residential home for the past 50 years. She developed a Ive had enough.
benign oesophageal stricture in 1992 and came to the She did not come back for a repeat endoscopy, and
endoscopy unit for repeated dilatations. The carers in she stayed nil by mouth on intravenous fluids. Two
the residential home used to say that she enjoyed her weeks later she died of an aspiration pneumonia. We
days out at the endoscopy unit. think she understood all the procedures she had
We would explain the procedure to her in sign agreed to. We also think it was informed consent. I
language. She would use the thumbs up sign and make hope we were right. She gave us a very clear message
a cross on the dotted line on the consent form. She without saying a word on her last visit to the
would enter the endoscopy room smiling, put her left endoscopy room.
arm out to be cannulated, turn to her left side for Do we really understand what aphasic patients are
endoscopy, and when fully awake would show her trying to tell us when we get informed consent for
thumbs up again. Every time after her dilatation the invasive procedures? We should try to read the
nursing staff would question why an expandable non-verbal messages very carefully.
oesophageal stent was not being considered. We would
I Tiwari associate specialist in gastroenterology, Broomfield
conclude that the indications for an expandable stent
Hospital, Chelmsford
in benign strictures are not well established.
Her need for dilatation was becoming more We welcome articles of up to 600 words on topics
frequent, and so on her 46th dilatation we decided to such as A memorable patient, A paper that changed my
refer her to our regional centre for the insertion of a practice, My most unfortunate mistake, or any other piece
stent. She had an expandable stent inserted, and in his conveying instruction, pathos, or humour. If possible
report the endoscopist mentioned the risk of the stent the article should be supplied on a disk. Permission is
migrating down in the stomach beyond the stricture. needed from the patient or a relative if an identifiable
Six weeks later she developed a bolus obstruction. At patient is referred to. We also welcome contributions
endoscopy it was noted that the stent had indeed for Endpieces, consisting of quotations of up to 80
migrated down. She consented to another stent. Four words (but most are considerably shorter) from any
weeks later she had another bolus obstruction that source, ancient or modern, which have appealed to
could not be completely removed at the first attempt, the reader.

1124 BMJ VOLUME 323 10 NOVEMBER 2001 bmj.com


Education and debate

The quality and reliability of health information on consumers confidence in online healthcare. They must
the internet remains of paramount concern in Europe, ensure that the mechanisms are put in place whereby
as elsewhere. Self regulatory codes of ethics for health health professionals themselves can benefit from using
websites abound, yet the quality and practices of many the internet, while still ensuring the highest standards
are highly questionable. of medical practice.
Little progress seems to have been made,
moreover, in assuring consumers that the information Avienda was formerly known as the Centre for Law Ethics and
Risk in Telemedicine.
they share with health websites will not be misused. Competing interests: None declared.
Several US studies have already concluded that
websites privacy practices do not match their
1 http://news.bbc.co.uk/hi/english/uk/england/newsid_1752000/
proclaimed policies.5 In an attempt to counter this ero- 1752670.stm (accessed 5 Feb 2002).
sion of trust in Europe, the European Commissions 2 Case C-322/01: Reference for a preliminary ruling by the Landgericht
Frankfurt am Main by order of that court of 10 August 2001 in the case
guidelines for quality criteria for health related of Deutscher Apothekerverband e.V. against DocMorris NV and Jacques
websites have recognised that there is no shortage of Waterval. Official Journal of the European Communities No C 2001
legislation in the field of privacy and security.6 They December 8:348/10.
3 Council Directive 1992/28/EEC of 31 March 1992 on the advertising of
have drawn specific attention to a new recommen- medicinal products for human use. (Articles 1(3) and 3(1).) Official Journal
dation regarding online data collection adopted in of the European Communities No L 1995 11 February:32/26.
4 Directive 2000/31/EC on mutual recognition of primary medical and
May 2001 that explains how European directives on specialist medical qualifications and minimum standards of training. Offi-
issues such as data protection should be applied to the cial Journal of the European Communities No L 2001 July 31:206/1-51.
5 Schwartz J. Medical websites faulted on privacy. Washington Post 2000
most common processing tasks carried out via the February 1.
internet.7 6 http://europa.eu.int/information_society/eeurope/ehealth/quality/
draft_guidelines/index_en.htm (accessed 5 Feb 2002).
The challenge facing Europes health professionals 7 European Commission. Recommendation 2/2001 on certain minimum
and policymakers is to carefully craft the development requirements for collecting personal data on-line in the European
Union. Adopted on 17 May 2001. http://europa.eu.int/comm/
of new approaches to the supervision of medical and internal_market/en/dataprot/wpdocs/wp43en.htm (accessed 25 Jan
pharmaceutical practice. Their ultimate goal is to raise 2002).

Statistics Notes
Validating scales and indexes
J Martin Bland, Douglas G Altman

Papers p 569
An index of quality is a measurement like any other, Some quantities are even more difficult to measure
whether it is assessing a website, as in todays BMJ,1 a and evaluate. Cardiac stroke volume does at least have
clinical trial used in a meta-analysis,2 or the quality of a an objective reality; a physical quantity of blood is
Department of life experienced by a patient.3 As with all measure- pumped out of the heart when it beats. Anxiety and
Public Health ments, we have to decide whether it measures what we depression do not have a physical reality but are useful
Sciences, St
Georges Hospital want it to measure, and how well. artificial constructs. They are measured by question-
Medical School, The simplest measurements, such as length and naire scales, where answers to a series of questions
London SW17 0RE
distance, can be validated by an objective criterion. The related to the concept we want to measure are
J Martin Bland
professor of medical earliest criteria must have been biological: the length of combined to give a numerical score. Website quality is
statistics a pace, a foot, a thumb. The obvious problem, that the similar. We are measuring a quantity which is not pre-
Cancer Research criterion varies from person to person, was eventually cisely defined, and there is no instrument with which
UK Medical solved by establishing a fundamental unit and defining we can compare any measure we might devise. How
Statistics Group,
Centre for Statistics all others in terms of it. Other measurements can then are we to assess the validity of such a scale?
in Medicine, be defined in terms of a fundamental unit. To define a The relevant theory was developed in the social sci-
Institute for Health
Sciences, Oxford
unit of weight we find a handy substance which ences in the context of questionnaire scales.4 First we
OX3 7LF appears the same everywhere, such as water. The unit might ask whether the scale looks right, whether it asks
Douglas G Altman of weight is then the weight of a volume of water speci- about the sorts of thing which we think of as being
professor of statistics
in medicine
fied in the basic unit of length, such as 100 cubic centi- related to anxiety or website quality. If it appears to be
metres. Such measurements have criterion validity, correct, we call this face validity. Next we might ask
Correspondence to:
Professor Bland meaning that we can take some known quantity and whether it covers all the aspects which we want to
mbland@sghms.ac.uk compare our measurement with it. measure. A phobia scale which asked about fear of
For some measurements no such standard is possi- dogs, spiders, snakes, and cats but ignored height, con-
BMJ 2002;324:6067 ble. Cardiac stroke volume, for example, can be fined spaces, and crowds would not do this. We call
measured only indirectly. Direct measurement, by appropriate coverage of the subject matter content
collecting all the blood pumped out of the heart over a validity.
series of beats, would involve rather drastic interference Our scale may look right and cover the right things,
with the system. Our criterion becomes agreement but what other evidence can we bring to the question
with another indirect measurement. Indeed, we some- of validity? One question we can ask is whether our
times have to use as a standard a method which we score has the relationships with other variables that we
know produces inaccurate measurements. would expect. For example, does an anxiety measure

606 BMJ VOLUME 324 9 MARCH 2002 bmj.com


Education and debate

distinguish between psychiatric patients and medical and different observers can make simultaneous
patients? Do we get different anxiety scores from measurements. In assessing the reliability of a website
students before and after an examination? Does a quality scale, it is easy to get several observers to apply
measure of depression predict suicide attempts? We the scale independently. With websites, repeat assess-
call the property of having appropriate relationships ments need to be close in time because their content
with other variables construct validity. changes frequently (as does bmj.com). With question-
We can also ask whether the items which together naires, either self administered or recorded by an
compose the scale are related to one another: does the observer, repeat measurements need to be far enough
scale have internal consistency? If not, do the items really apart in time for the earlier responses to be forgotten,
measure the same thing? On the other hand, if the yet not so far apart that the underlying quantity being
items are too similar, some may be redundant. Highly measured might have changed. Such data enable us to
correlated items in a scale may make the scale over- evaluate test-retest reliability. If two measures have com-
long and may lead to some aspects being overempha- parable face, content, and construct validity the more
sised, impairing the content validity. A handy summary repeatable one may be preferred for the study of a
measure for this feature is Cronbachs alpha.5 given population.
A scale must also be repeatable and be sufficiently
objective to give similar results for different observers.
If a measurement is repeatable, in that someone who 1 Gagliardi A, Jadad AR. Examination of instruments used to rate quality of
has a high score on one occasion tends to have a high health information on the internet: chronicle of a voyage with an unclear
destination. BMJ 2002;324:569-73.
score on another, it must be measuring something. 2 Jni P, Altman DG, Egger M. Assessing the quality of controlled clinical
With physical measurements, it is often possible for the trials. BMJ 2001;323:42-6.
3 Muldoon MF, Barger SD, Flory JD, Manuck B. What are quality of life
same observer (or different observers) to make measurements measuring? BMJ 1998;316:542-5.
repeated measurements in quick succession. When 4 Streiner DL, Norman GR. Health measurement scales: a practical
guide to their development and use. 2nd ed. Oxford: Oxford University Press,
there is a subjective element in the measurement the 1996.
observer can be blinded from their first measurement, 5 Bland JM, Altman DG. Cronbachs alpha. BMJ 1997;314:572.

Honour a physician with the honour due unto him

A few years ago my general practitioner told me that no more reason to expect the doctor to engage with
anyone aged over 40 with upper abdominal discomfort me as a person than I would the phlebotomist taking a
needed investigating. At the local teaching hospital, a routine blood sample. Clearly, this consultant saw
pleasant young doctor did a gastroscopy, which things similarly as a rule, but when the patient was a
showed a mass in my stomach wall. I was sent for a doctor the aesthetics of the encounter changed. He
barium meal. A consultant radiologist took the x ray had apologised three times for what he felt was a lapse
films, instructing me briskly to turn this way and that on his part, arising from his failure to notice what was
but not otherwise paying me any attention. He told me written in the corner of the request form. Perhaps he
to wait a few minutes while he checked the films to see thought I knew that my general practitioner had
if all the views were satisfactory. I sat alone in the room written this and that I expected this of a medical
for about five minutes.
referral, and thus expected to be recognised by him
From the moment the consultant re-entered I could
not just as a patient but also as a colleague. He seemed
see that he was slightly agitated. Im terribly sorry, he
to see this as my due. (As it happens, I didnt.)
called out as he came through the door at the far end.
And then again, Im terribly sorry. Perhaps these I had forgotten this incident, but it was brought back
words of regret, coupled with the concern on his face, to me by the aftermath of the Bristol cardiac surgery
might not have had the effect they did had I not been a debacle, and by the publicity surrounding other recent
man with an abdominal mass on his mind. At this medical scandals. These have all put a spotlight on
moment of truth and reckoning, certain visions swam relations between doctors, who seem to offer each
before my eyes. other acknowledgement and empathy, as my
Three strides later, he was in front of me and consultant had sought belatedly to do to me. The
looking me full in the face: Im terribly sorry, I hadnt general public may be coming to suspect that this
realised you were a doctor. In his hand was the request collegiate solidarity is somehow not in their interests,
form, and I could see that my general practitioner had associating it with mutual protectiveness and thus with
written ex-SR here in one corner. He must have cover-ups of medical malpractice. It is too soon to say
spotted this when checking the form as he looked at how the profession will react, but my consultant was an
the preliminary plates. Though no further x rays were older man and my guess is that, with younger
needed, he proceeded a little breathlessly to deliver generations of doctors, we will see the waning of a
three or four minutes of almost a caricature of caring, tradition whose roots lie with Hippocrates. For it was
empathic interest in a patient. What branch of his oath that bound doctors to look well on each other
medicine was I in, and where did I work? Good (and not charge each other for their services).
heavens, that must be tough. Is that an Australian Its another story, but I found out later that the mass
accent I hear? A St Marys old boy, ah yes. What did I was the gastroscopy instrument itself distorting the
think about. . .? stomach wall, misdiagnosed by an inexperienced
I dont mean to imply that this was insincere, merely
registrar. No special treatment there, anyway.
splendidly different from his earlier matter of factness
and economy of word. I had thought nothing of this at Derek Summerfield consultant psychiatrist, CASCAID,
the time: in such a bread and butter procedure I had South London and Maudsley NHS Trust, London

BMJ VOLUME 324 9 MARCH 2002 bmj.com 607


Education and debate

Statistics Notes
Interaction revisited: the difference between two estimates
Douglas G Altman, J Martin Bland

We often want to compare two estimates of the same 0.2206 (row 8). From these two values we can test the Cancer Research
UK Medical
quantity derived from separate analyses. Thus we might interaction and estimate the ratio of the relative risks Statistics Group,
want to compare the treatment effect in subgroups in a (with confidence interval). The test of interaction is the Centre for Statistics
randomised trial, such as two age groups. The term for ratio of d to its standard error: z= 0.2726/ in Medicine,
Institute for Health
such a comparison is a test of interaction. In earlier Sta- 0.2206= 1.24, which gives P=0.2 when we refer it to a Sciences, Oxford
tistics Notes we discussed interaction in terms of hetero- table of the normal distribution. The estimated OX3 7LF
geneity of treatment effect.13 Here we revisit interaction interaction effect is exp( 0.2726)=0.76. (This value can Douglas G Altman
professor of statistics
and consider the concept more generally. also be obtained directly as 0.67/0.88=0.76.) The in medicine
The comparison of two estimated quantities, such as confidence interval for this effect is 0.7050 to 0.1598
Department of
means or proportions, each with its standard error, is a on the log scale (row 9). Transforming back to the rela- Public Health
general method that can be applied widely. The two esti- tive risk scale, we get 0.49 to 1.17 (row 12). There is thus Sciences,
mates should be independent, not obtained from the no good evidence to support a different treatment effect St Georges
Hospital Medical
same individualsexamples are the results from in younger and older women. School, London
subgroups in a randomised trial or from two independ- The same approach is used for comparing odds SW17 0RE
ent studies. The samples should be large. If the estimates ratios. Comparing means or regression coefficients is J Martin Bland
professor of medical
are E1 and E2 with standard errors SE(E1) and SE(E2), simpler as there is no log transformation. The two esti- statistics
then the difference d=E1 E2 has standard error mates must be independent: the method should not be
Correspondence to:
SE(d)=[SE(E1)2 + SE(E2)2] (that is, the square root of the used to compare a subset with the whole group, or two D G Altman
sum of the squares of the separate standard errors). This estimates from the same patients. doug.altman@
cancer.org.uk
formula is an example of a well known relation that the There is limited power to detect interactions, even in
variance of the difference between two estimates is the a meta-analysis combining the results from several stud-
BMJ 2003;326:219
sum of the separate variances (here the variance is the ies. As this example illustrates, even when the two
square of the standard error). Then the ratio z=d/SE(d) estimates and P values seem very different the test of
gives a test of the null hypothesis that in the population interaction may not be significant. It is not sufficient for
the difference d is zero, by comparing the value of z to the relative risk to be significant in one subgroup and
the standard normal distribution. The 95% confidence not in another. Conversely, it is not correct to assume
interval for the difference is d1.96SE(d) to d+1.96SE(d). that when two confidence intervals overlap the two esti-
We illustrated this for means and proportions,3 mates are not significantly different.6 Statistical analysis
although we did not show how to get the standard should be targeted on the question in hand, and not
error of the difference. Here we consider comparing based on comparing P values from separate analyses.2
relative risks or odds ratios. These measures are always
analysed on the log scale because the distributions of 1 Altman DG, Matthews JNS. Interaction 1: Heterogeneity of effects. BMJ
1996;313:486.
the log ratios tend to be those closer to normal than of 2 Matthews JNS, Altman DG. Interaction 2: Compare effect sizes not P
the ratios themselves. values. BMJ 1996;313:808.
3 Matthews JNS, Altman DG. Interaction 3: How to examine heterogeneity.
In a meta-analysis of non-vertebral fractures in ran- BMJ 1996;313:862.
domised trials of hormone replacement therapy the 4 Torgerson DJ, Bell-Syer SEM. Hormone replacement therapy and
estimated relative risk from 22 trials was 0.73 (P=0.02) in prevention of nonvertebral fractures. A meta-analysis of randomized
trials. JAMA 2001;285:2891-7.
favour of hormone replacement therapy.4 From 14 trials 5 Bland JM, Altman DG. Logarithms. BMJ 1996;312:700.
of women aged on average < 60 years the relative risk 6 Bland M, Peacock J. Interpreting statistics with confidence. Obstetrician
and Gynaecologist (in press).
was 0.67 (95% confidence interval 0.46 to 0.98; P=0.03).
From eight trials of women aged >60 the relative risk
was 0.88 (0.71 to 1.08; P=0.22). In other words, in Calculations for comparing two estimated relative risks
younger women the estimated treatment benefit was a
Group 1 Group 2
33% reduction in risk of fracture, which was statistically 1 RR 0.67 0.88
significant, compared with a 12% reduction in older 2 *log RR 0.4005 (E1) 0.1278 (E2)
women, which was not significant. But are the relative 3 95% CI for RR 0.46 to 0.98 0.71 to 1.08
risks from the subgroups significantly different from 4 *95% CI for log RR 0.7765 to 0.0202 0.3425 to 0.0770
each other? We show how to answer this question using 5 Width of CI 0.7563 0.4195
just the summary data quoted. 6 SE[=width/(21.96)] 0.1929 0.1070
Because the calculations were made on the log scale, Difference between log relative risks
comparing the two estimates is complex (see table). We 7 d[=E1E2] 0.4005(0.1278)=0.2726
need to obtain the logs of the relative risks and their 8 SE(d) (0.19292+ 0.10702)=0.2206
confidence intervals (rows 2 and 4).5 As 95% confidence 9 CI(d) 0.2726 1.960.2206
or 0.7050 to 0.1598
intervals are obtained as 1.96 standard errors either side
10 Test of interaction z=0.2726/0.2206=1.24 (P=0.2)
of the estimate, the SE of each log relative risk is Ratio of relative risks (RRR)
obtained by dividing the width of its confidence interval 11 RRR=exp(d) exp(0.2726)=0.76
by 21.96 (row 6). The estimated difference in log 12 CI(RRR) exp(0.7050) to exp(0.1598), or 0.49 to 1.17
relative risks is d=E1 E2= 0.2726 and its standard error *Values obtained by taking natural logarithms of values on preceding row.

BMJ VOLUME 326 25 JANUARY 2003 bmj.com 219


Education and debate

Statistics Notes
The logrank test
J Martin Bland, Douglas G Altman

We often wish to compare the survival experience of Department of


1.0 Health Sciences,

Survival proportion
two (or more) groups of individuals. For example, the Astrocytoma University of York,
table shows survival times of 51 adult patients with Glioblastoma York YO10 5DD
0.8
recurrent malignant gliomas1 tabulated by type of J Martin Bland
tumour and indicating whether the patient had died or professor of health
0.6 statistics
was still alive at analysisthat is, their survival time was
Cancer Research
censored.2 As the figure shows, the survival curves dif- 0.4 UK/NHS Centre
fer, but is this sufficient to conclude that in the popula- for Statistics in
tion patients with anaplastic astrocytoma have worse Medicine, Institute
0.2
of Health Sciences,
survival than patients with glioblastoma? Oxford OX3 7LF
We could compute survival curves3 for each group 0 Douglas G Altman
0 52 104 156 208
and compare the proportions surviving at any specific professor of statistics
Time (weeks) in medicine
time. The weakness of this approach is that it does not
provide a comparison of the total survival experience of Correspondence to:
Survival curves for women with glioma by diagnosis
Professor Bland
the two groups, but rather gives a comparison at some
arbitrary time point(s). In the figure the difference in BMJ 2004;328:1073
survival is greater at some times than others and eventu-
ally becomes zero. We describe here the logrank test, the
2 1 = 1. From a table of the 2 distribution we get
most popular method of comparing the survival of
P < 0.01, so that the difference between the groups is
groups, which takes the whole follow up period into
statistically significant. There is a different method of
account. It has the considerable advantage that it does
calculating the test statistic,4 but we prefer this Weeks to death or
not require us to know anything about the shape of the
approach as it extends easily to several groups. It is censoring in 51
survival curve or the distribution of survival times.
also possible to test for a trend in survival across adults with
The logrank test is used to test the null hypothesis
ordered groups.4 Although we have shown how the recurrent gliomas1
that there is no difference between the populations in (A=astrocytoma,
calculation is made, we strongly recommend the use of
the probability of an event (here a death) at any time G=glioblastoma)
statistical software.
point. The analysis is based on the times of events
The logrank test is based on the same assumptions A G
(here deaths). For each such time we calculate the
as the Kaplan Meier survival curve3namely, that cen- 6 10
observed number of deaths in each group and the 13 10
soring is unrelated to prognosis, the survival probabili-
number expected if there were in reality no difference 21 12
ties are the same for subjects recruited early and late in
between the groups. The first death was in week 6, 30 13
the study, and the events happened at the times speci-
when one patient in group 1 died. At the start of this 31* 14
fied. Deviations from these assumptions matter most if
week, there were 51 subjects alive in total, so the risk of 37 15
they are satisfied differently in the groups being
death in this week was 1/51. There were 20 patients in 38 16
compared, for example if censoring is more likely in
group 1, so, if the null hypothesis were true, the 47* 17
one group than another.
expected number of deaths in group 1 is 20 1/51 = 49 18
The logrank test is most likely to detect a difference 50 20
0.39. Likewise, in group 2 the expected number of
between groups when the risk of an event is 63 24
deaths is 31 1/51 = 0.61. The second event
consistently greater for one group than another. It is 79 24
occurred in week 10, when there were two deaths.
unlikely to detect a difference when survival curves 80* 25
There were now 19 and 31 patients at risk (alive) in the
cross, as can happen when comparing a medical with a 82* 28
two groups, one having died in week 6, so the
surgical intervention. When analysing survival data, the 82* 30
probability of death in week 10 was 2/50. The
survival curves should always be plotted. 86 33
expected numbers of deaths were 19 2/50 = 0.76 98 34*
Because the logrank test is purely a test of
and 31 2/50 = 1.24 respectively. 149* 35
significance it cannot provide an estimate of the size of
The same calculations are performed each time an 202 37
the difference between the groups or a confidence
event occurs. If a survival time is censored, that 219 40
interval. For these we must make some assumptions
individual is considered to be at risk of dying in the 40
about the data. Common methods use the hazard ratio,
week of the censoring but not in subsequent weeks. 40*
including the Cox proportional hazards model, which
This way of handling censored observations is the 46
we shall describe in a future Statistics Note.
same as for the Kaplan-Meier survival curve.3 48
From the calculations for each time of death, the Competing interests: None declared. 70*

total numbers of expected deaths were 22.48 in group 76


81
1 and 19.52 in group 2, and the observed numbers of 1 Rostomily RC, Spence AM, Duong D, McCormick K, Bland M, Berger
MS. Multimodality management of recurrent adult malignant gliomas: 82
deaths were 14 and 28. We can now use a 2 test of the results of a phase II multiagent chemotherapy study and analysis of
91
null hypothesis. The test statistic is the sum of (O cytoreductive surgery. Neurosurgery 1994;35:378-8.
2 Altman DG, Bland JM. Time to event (survival) data. BMJ 112
E)2/E for each group, where O and E are the totals of 1998;317:468-9. 181
the observed and expected events. Here (14 22.48)2 3 Bland JM, Altman DG. Survival probabilities. The Kaplan-Meier method.
BMJ 1998;317:1572. *Censored survival
/ 22.48 + (28 19.52)2 / 19.52 = 6.88. The degrees of 4 Altman DG. Practical statistics for medical research. London: Chapman & time (still alive at
freedom are the number of groups minus one, i.e. Hall, 1991: 371-5. follow up).

BMJ VOLUME 328 1 MAY 2004 bmj.com 1073


Education and debate

Statistics Notes
Diagnostic tests 4: likelihood ratios
Jonathan J Deeks, Douglas G Altman

Screening and Test The properties of a diagnostic or screening test are often
Evaluation
Program, School of described using sensitivity and specificity or predictive Calculation of post-test probabilities using
Public Health, values, as described in previous Notes.1 2 Likelihood likelihood ratios
University of
Sydney, NSW 2006,
ratios are alternative statistics for summarising diag-
Pretest probability = p1 = 0.1
Australia nostic accuracy, which have several particularly powerful pretest odds = p1/(1 p1) = 0.1/0.9 = 0.11
Jonathan J Deeks properties that make them more useful clinically than post-test odds = pretest oddslikelihood ratio
senior research
biostatistician
other statistics.3 post-test odds = o2 = 0.1120.43 = 2.27
Each test result has its own likelihood ratio, which Post-test probability = o2/(1+ o2) = 2.27/3.37 = 0.69
Cancer Research
UK/NHS Centre summarises how many times more (or less) likely
for Statistics in patients with the disease are to have that particular
Medicine, Institute
for Health Sciences, result than patients without the disease. More formally, Likelihood ratios are ratios of probabilities, and can
Oxford OX3 7LF it is the ratio of the probability of the specific test result be treated in the same way as risk ratios for the
Douglas G Altman in people who do have the disease to the probability in
professor of statistics
purposes of calculating confidence intervals.6
in medicine people who do not. For a test with only two outcomes, likelihood ratios
Correspondence to:
A likelihood ratio greater than 1 indicates that the can be calculated directly from sensitivities and specifi-
Mr Deeks test result is associated with the presence of the disease, cities.1 For example, if smoking habit is dichotomised
Jon.Deeks@ whereas a likelihood ratio less than 1 indicates that the as above or below 40 pack years, the sensitivity is 28.4%
cancer.org.uk
test result is associated with the absence of disease. The (42/148) and specificity 98.6% (142/144). The positive
BMJ 2004;329:1689 further likelihood ratios are from 1 the stronger the likelihood ratio is the proportion with obstructive
evidence for the presence or absence of disease. Likeli- airway disease who smoked more than 40 pack years
hood ratios above 10 and below 0.1 are considered to (sensitivity) divided by the proportion without disease
provide strong evidence to rule in or rule out who smoked more than 40 pack years (1specificity),
diagnoses respectively in most circumstances.4 When 28.4/1.4 = 20.3, as before. The negative likelihood
tests report results as being either positive or negative ratio is the proportion with disease who smoked less
the two likelihood ratios are called the positive than 40 pack years (1sensitivity) divided by the
proportion without disease who smoked less than 40
likelihood ratio and the negative likelihood ratio.
pack years (specificity), 71.6/98.6 = 0.73. However,
The table shows the results of a study of the value of
unlike sensitivity and specificity, computation of
a history of smoking in diagnosing obstructive airway
likelihood ratios does not require dichotomisation of
disease.5 Smoking history was categorised into four
groups according to pack years smoked (packs per day
years smoked). The likelihood ratio for each category Pre-test Post-test
probability probability
is calculated by dividing the percentage of patients with
0.001 0.999
obstructive airway disease in that category by the
0.002 0.998
percentage without the disease in that category. For 0.003 0.997
example, among patients with the disease 28% had 40+ 0.005 0.995
0.007 0.993
smoking pack years compared with just 1.4% of 0.01 0.99
patients without the disease. The likelihood ratio is Likelihood
0.02 ratio 0.98
thus 28.4/1.4 = 20.3. A smoking history of more than 0.03 1000 0.97
500
40 pack years is strongly predictive of a diagnosis of 0.05 0.95
0.07 200 0.93
obstructive airway disease as the likelihood ratio is sub- 0.1 100 0.9
50
stantially higher than 10. Although never smoking or 20 0.8
0.2
smoking less than 20 pack years both point to not hav- 10
0.3 5 0.7
ing obstructive airway disease, their likelihood ratios 0.4 2 0.6
0.5 1 0.5
are not small enough to rule out the disease with 0.6 0.5 0.4
confidence. 0.7 0.2 0.3
0.1
0.8 0.05 0.2
0.02
0.9 0.01 0.1
Likelihood ratios are ratios of probabilities, and can be treated in the same way as risk 0.93 0.005 0.07
0.95 0.05
ratios for the purposes of calculating confidence intervals6 0.002
0.97 0.001 0.03
Obstructive airway disease 0.98 0.02
Smoking habit
(pack years) Yes (n (%)) No (n (%)) Likelihood ratio 95% CI 0.99 0.01
40 42 (28.4) 2 (1.4) (42/148)/(2/144)=20.4 5.04 to 82.8 0.993 0.007
0.995 0.005
20-40 25 (16.9) 24 (16.7) (25/148)/(24/144)=1.01 0.61 to 1.69 0.003
0.997
0-20 29 (19.6) 51 (35.4) (29/148)/51/144)=0.55 0.37 to 0.82 0.998 0.002
Never smoked or smoked 52 (35.1) 67 (46.5) (52/148)/67/144)=0.76 0.57 to 1.00
0.999 0.001
for <1 yr
Total 148 (100) 144 (100)
Use of Fagans nomogram for calculating post-test probabilities7

168 BMJ VOLUME 329 17 JULY 2004 bmj.com


Education and debate

test results. Forcing dichotomisation on multicategory study sample and can rarely be generalised beyond the
test results may discard useful diagnostic information. study (except when the study is based on a suitable
Likelihood ratios can be used to help adapt the random sample, as is sometimes the case for
results of a study to your patients. To do this they make population screening studies). Likelihood ratios pro-
use of a mathematical relationship known as Bayes vide a solution as they can be used to calculate the
theorem that describes how a diagnostic finding probability of abnormality, while adapting for varying
changes our knowledge of the probability of abnor- prior probabilities of the chance of abnormality from
mality.3 The post-test odds that the patient has the different contexts.
disease are estimated by multiplying the pretest odds
by the likelihood ratio. The use of odds rather than
risks makes the calculation slightly complex (box) but a 1 Altman DG, Bland JM. Diagnostic tests 1: sensitivity and specificity. BMJ
nomogram can be used to avoid having to make 1994;308:1552.
2 Altman DG, Bland JM. Diagnostic tests 2: predictive values. BMJ
conversions between odds and probabilities (figure).7 1994;309:102.
Both the figure and the box illustrate how a prior 3 Sackett DL, Straus S, Richardson WS, Rosenberg W, Haynes RB. Evidence-
probability of obstructive airway disease of 0.1 (based, based medicine. How to practise and teach EBM. 2nd ed. Edinburgh: Churchill
Livingstone, 2000:67-93.
say, on presenting features) is updated to a probability 4 Jaeschke R, Guyatt G, Lijmer J. Diagnostic tests. In: Guyatt G, Rennie D,
of 0.7 with the knowledge that the patient had smoked eds. Users guides to the medical literature. Chicago: AMA Press,
for more than 40 pack years. 2002:121-40.
5 Straus SE, McAlister FA, Sackett DL, Deeks JJ. The accuracy of patient
In clinical practice it is essential to know how a history, wheezing, and laryngeal measurements in diagnosing obstructive
particular test result predicts the risk of abnormality. airway disease. CARE-COAD1 Group. Clinical assessment of the
reliability of the examination-chronic obstructive airways disease. JAMA
Sensitivities and specificities1 do not do this: they 2000;283:1853-7.
describe how abnormality (or normality) predicts 6 Altman DG. Diagnostic tests. In: Altman DG, Machin D, Bryant TN,
particular test results. Predictive values2 do give Gardner MJ, eds. Statistics with confidence. 2nd ed. London: BMJ Books,
2000:105-19.
probabilities of abnormality for particular test results, 7 Fagan TJ. Letter: Nomogram for Bayes theorem. N Engl J Med
but depend on the prevalence of abnormality in the 1975;293:257.

A memorable patient
Living history

I was finally settling down at my desk when the pager To even my unpractised technique, his
bleeped: it was the outpatients department. An extra cardiovascular signs were a museum piece: absent left
patient had been added to the afternoon listwould I subclavian pulse, big arciform scars on the anterior
see him? chest created by the surgeon who had saved his life,
The patient was a slightly built man in his 60s. He central cyanosis, right-sided systolic murmurs, loud
had brought recent documentation from another pulmonary valve closure sound (iatrogenic pulmonary
hospital. I asked about his presenting complaint. hypertension, I reasoned), pulsatile liverall these and
Well, Ill try, but I wasnt aware of everything that undoubtedly more noted by the physician who first
happened. Thats why Ive brought my wifeshe was understood his condition.
with me at the time. That night, I read about his doctors. Helen Taussig
This was turning out to be one of those perfect indeed had had substantial hearing impairment, a
neurological consultations: documents from another disability that would have meant the end of a career in
hospital, a witness account, an articulate patient. The cardiology for a less able clinician. I also learnt of the
only question would be whether it was seizure, greater challenges of sexual prejudice that she fought
syncope, or transient ischaemic attack. As we went and overcame all her life. I learnt about Alfred Blalock,
through his medical history, I studied his records and the young doctor denied a residency at Johns Hopkins
for the first time noticed the phrase Tetralogy of only to be invited back in his later years to head its
Fallot. surgical unit.
Yes, my lifelong diagnosis, he smiled. I was The experiences of a life in medicine are sometimes
operated on. overwhelming. For weeks, I reflected on the
I saw the faint dusky blue colour of his lips. perspectives opened to me by this unassuming patient.
Blalock-Taussig shunt? I asked, as dim memories The curious irony of a man with a life threatening
from medical school somehow came back into focus. condition who had outlived his saviours; the
Yes, twice. By Blalock. He paused. In those days extraordinary vision of his all-too-human doctors; the
the operation was only done at Johns Hopkinsall the opportunity to witness history played out in the course
patients went there. I remember my second operation of a half-hour consultation. And memories were
well. I was 12 at the time. jogged, too: the words of my former professor of
And the doctors? I asked. medicine, who showed us cases of Fallots and first
Yes, especially Dr Taussig. She would come around told us about Taussig, the woman cardiologist; the
with her entourage every so often. Deaf as a post, she portrait of Blalock adorning a surgical lecture theatre
was. in medical college.
What? Taussig deaf? (And my patients neurological examination and
Yes. She had an amplifier attached to her investigations? Non-contributory. I still dont know
whether it was seizure, syncope, or transient ischaemic
stethoscope to examine patients.
attack.)
I asked to examine him, only too aware that it was
more for my benefit than his. Giridhar P Kalamangalam clinical fellow in epilepsy,
A half smile suggested that he had read my department of neurology, Cleveland Clinic Foundation,
thoughts: Of course. Cleveland OH, USA

BMJ VOLUME 329 17 JULY 2004 bmj.com 169


Education and debate

Statistics Notes
Treatment allocation by minimisation
Douglas G Altman, J Martin Bland

In almost all controlled trials treatments are allocated Cancer Research


Table 2 Hypothetical distribution of baseline characteristics after UK Medical
by randomisation. Blocking and stratification can be Statistics Group,
40 patients had been enrolled in the trial
used to ensure balance between groups in size and Centre for Statistics
patient characteristics.1 But stratified randomisation Behavioural Nutrition in Medicine, Oxford
counselling counselling OX3 7LF
using several variables is not effective in small trials. (n=20) (n=20) Douglas G Altman
The only widely acceptable alternative approach is Women 12 11 professor of statistics
minimisation,2 3 a method of ensuring excellent Age >50 7 5 in medicine
balance between groups for several prognostic factors, Ethnicity: Department of
even in small samples. With minimisation the White 15 15 Health Sciences,
University of York,
treatment allocated to the next participant enrolled in Black 4 5
York YO10 5DD
the trial depends (wholly or partly) on the characteris- Asian 1 0
J Martin Bland
tics of those participants already enrolled. The aim is Current smokers 6 8 professor of health
statistics
that each allocation should minimise the imbalance
across multiple factors. ling would increase the imbalance; allocation to nutri- Correspondence to:
tion would decrease it. D Altman
Table 1 shows some baseline characteristics in a doug.altman@
controlled trial comparing two types of counselling in At this point there are two options. The chosen cancer.org.uk
relation to dietary intake.4 Minimisation was used for treatment could simply be taken as the one with the
the four variables shown, and the two groups were lower score; or we could introduce a random element. BMJ 2005;330:843

clearly very similar in all of these variables. Such good We use weighted randomisation so that there is a high
balance for important prognostic variables helps the chance (eg 80%) of each participant getting the
credibility of the comparisons. How is it achieved? treatment that minimises the imbalance. The use of a
Minimisation is based on a different principle from random element will slightly worsen the overall imbal-
randomisation. The first participant is allocated a treat- ance between the groups, but balance will be much
ment at random. For each subsequent participant we better for the chosen variables than with simple
determine which treatment would lead to better randomisation. A random element also makes the allo-
balance between the groups in the variables of interest. cation more unpredictable, although minimisation is a
The dietary behaviour trial used minimisation secure allocation system when used by an independent
based on the four variables in table 1. Suppose that person.
after 40 patients had entered this trial the numbers in After the treatment is determined for the current
each subgroup in each treatment group were as shown participant the numbers in each group are updated
in table 2. (Note that two or more categories need to be and the process repeated for each subsequent
constructed for continuous variables.) participant. If at any time the totals for the two groups
The next enrolled participant is a black woman are the same, then the choice should be made using
aged 52, who is a non-smoker. If we were to allocate her simple randomisation. The method extends to trials of
to behavioural counselling, the imbalance would be more than two treatments.
increased in sex distribution (12+1 v 11 women), in age Minimisation is a valid alternative to ordinary ran-
(7+1 v 5 aged > 50), and in smoking (14+1 v 12 domisation,2 3 5 and has the advantage, especially in
non-smoking) and decreased in ethnicity (4+1 v 5 small trials, that there will be only minor differences
black). We formalise this by summing over the four between groups in those variables used in the
variables the numbers of participants with the same allocation process. Such balance is especially desirable
characteristics as this new recruit already in the trial: where there are strong prognostic factors and modest
treatment effects, such as oncology. Minimisation is
Behavioural: 12 (sex) +7 (age) +4 (ethnicity) +14 (smoking) = 37
best performed with the aid of softwarefor example,
Nutrition: 11+5+5+12 = 33
minim, a free program.6 Its use makes trialists think
Imbalance is minimised by allocating this person to about prognostic factors at the outset and helps ensure
the group with the smaller total (or at random if the adherence to the protocol as a trial progresses.7
totals are the same). Allocation to behavioural counsel-
1 Altman DG, Bland JM. How to randomise. BMJ 1999;319:703-4.
2 Treasure T, MacRae KD. Minimisation: the platinum standard for trials.
BMJ 1998;317:362-3.
Table 1 Baseline characteristics in two groups4 3 Scott NW, McPherson GC, Ramsay CR, Campbell MK. The method of
minimization for allocation to clinical trials: a review. Control Clin Trials
Behavioural Nutrition 2002;23:662-74.
counselling counselling 4 Steptoe A, Perkins-Porras L, McKay C, Rink E, Hilton S, Cappuccio FP.
Women 82 84 Behavioural counselling to increase consumption of fruit and vegetables
in low income adults: randomised trial. BMJ 2003;326:855-8.
Mean (SD) age (years) 43.3 (13.8) 43.2 (14.0) 5 Buyse M, McEntegart D. Achieving balance in clinical trials: an
Ethnicity: unbalanced view from the European regulators. Applied Clin Trials
White 94 96 2004;13:36-40.
6 Evans S, Royston P, Day S. Minim: allocation by minimisation in clinical
Black 37 32 trials. http://www-users.york.ac.uk/zmb55/guide/minim.htm. (accessed
Asian 3 5 24 October 2004).
7 Day S. Commentary: Treatment allocation by the method of
Current smokers 47 44
minimisation. BMJ 1999;319:947-8.

BMJ VOLUME 330 9 APRIL 2005 bmj.com 843


Education and debate

6 The Cardiac Arrhythmia Suppression Trial I. Preliminary Report. Effect of 16 Beauchamp TL, Childress JF. Respect for autonomy. Principles of biomedical
encainide and flecainide on mortality in a randomized trial of arrhythmia ethics. 4th ed. New York: Oxford University Press, 1994:120-88.
suppression after myocardial infarction. N Engl J Med 1989;321:406-12. 17 Roberts LW. Evidence-based ethics and informed consent in mental
7 Dickert N, Grady C. Whats the price of a research subject? Approaches to illness research. Arch Gen Psychiatry 2000;57:540-2.
payment for research participation. N Engl J Med 1999;341:198-203. 18 Bayer R, Oppenheimer GM. Toward a more democratic medicine: sharing the
8 Grady C. Money for research participation: does it jeopardize informed burden of ignorance. AIDS Doctors: voices from the epidemic. New York: Oxford
consent? Am J Bioethics 2001;1:40-4. University Press, 2000:156-69.
9 Macklin R. Due and undue inducements: On paying money to 19 Coulter A, Rozansky D. Full engagement in health. BMJ 2004;329:1197-8.
research subjects. IRB: a review of human subjects research 1981;3:1-6. 20 Joffe S, Manocchia M, Weeks JC, Cleary PD. What do patients value in
10 McGee G. Subject to payment? JAMA 1997;278:199-200.
their hospital care? An empirical perspective on autonomy centred
11 McNeil P. Paying people to participate in research. Bioethics
bioethics. J Med Ethics 2003;29:103-8.
1997;11:390-6.
21 Heesen C, Kasper J, Segal J, Kopke S, Muhlhauser I. Decisional role pref-
12 Wilkenson M, Moore A. Inducement in research. Bioethics 1997;11:
373-89. erences, risk knowledge and information interests in patients with multi-
13 Halpern SD, Karlawish JHT, Casarett D, Berlin JA, Asch DA. Empirical ple sclerosis. Mult Scler 2004;10:643-50.
assessment of whether moderate payments are undue or unjust induce- 22 Azoulay E, Pochard F, Chevret S, et al. Half the family members of inten-
ments for participation in clinical trials. Arch Intern Med 2004;164:801-3. sive care unit patients do not want to share in the decision-making proc-
14 Bentley JP, Thacker PG. The influence of risk and monetary payment on ess: a study in 78 French intensive care units. Crit Care Med
the research participation decision making process. J Med Ethics 2004;32:1832-8.
2004;30:293-8. 23 Dunn LB, Gordon NE. Improving informed consent and enhancing
15 Viens AM. Socio-economic status and inducement to participate. Am J recruitment for research by understanding economic behavior. JAMA
Bioethics 2001;1. 2005;293:609-12.

Statistics Notes
Standard deviations and standard errors
Douglas G Altman, J Martin Bland

The terms standard error and standard deviation example. By contrast the standard deviation will not Cancer Research
UK/NHS Centre
are often confused.1 The contrast between these two tend to change as we increase the size of our sample. for Statistics in
terms reflects the important distinction between data So, if we want to say how widely scattered some Medicine, Wolfson
description and inference, one that all researchers measurements are, we use the standard deviation. If we College, Oxford
OX2 6UD
should appreciate. want to indicate the uncertainty around the estimate of
Douglas G Altman
The standard deviation (often SD) is a measure of the mean measurement, we quote the standard error of professor of statistics
variability. When we calculate the standard deviation of a the mean. The standard error is most useful as a means in medicine

sample, we are using it as an estimate of the variability of of calculating a confidence interval. For a large sample, Department of
a 95% confidence interval is obtained as the values Health Sciences,
the population from which the sample was drawn. For University of York,
data with a normal distribution,2 about 95% of individu- 1.96SE either side of the mean. We will discuss confi- York YO10 5DD
als will have values within 2 standard deviations of the dence intervals in more detail in a subsequent Statistics J Martin Bland
mean, the other 5% being equally scattered above and Note. The standard error is also used to calculate P val- professor of health
statistics
below these limits. Contrary to popular misconception, ues in many circumstances.
The principle of a sampling distribution applies to Correspondence to:
the standard deviation is a valid measure of variability Prof Altman
other quantities that we may estimate from a sample, doug.altman@
regardless of the distribution. About 95% of observa-
such as a proportion or regression coefficient, and to cancer.org.uk
tions of any distribution usually fall within the 2 standard
contrasts between two samples, such as a risk ratio or
deviation limits, though those outside may all be at one BMJ 2005;331:903
the difference between two means or proportions. All
end. We may choose a different summary statistic, how-
such quantities have uncertainty due to sampling vari-
ever, when data have a skewed distribution.3
ation, and for all such estimates a standard error can be
When we calculate the sample mean we are usually
calculated to indicate the degree of uncertainty.
interested not in the mean of this particular sample, but In many publications a sign is used to join the
in the mean for individuals of this typein statistical standard deviation (SD) or standard error (SE) to an
terms, of the population from which the sample comes. observed meanfor example, 69.49.3 kg. That
We usually collect data in order to generalise from them notation gives no indication whether the second figure
and so use the sample mean as an estimate of the mean is the standard deviation or the standard error (or
for the whole population. Now the sample mean will indeed something else). A review of 88 articles
vary from sample to sample; the way this variation published in 2002 found that 12 (14%) failed to
occurs is described by the sampling distribution of the identify which measure of dispersion was reported
mean. We can estimate how much sample means will (and three failed to report any measure of variability).4
vary from the standard deviation of this sampling distri- The policy of the BMJ and many other journals is to
bution, which we call the standard error (SE) of the esti- remove signs and request authors to indicate clearly
mate of the mean. As the standard error is a type of whether the standard deviation or standard error is
standard deviation, confusion is understandable. being quoted. All journals should follow this practice.
Another way of considering the standard error is as a
Competing interests: None declared.
measure of the precision of the sample mean.
The standard error of the sample mean depends 1 Nagele P. Misuse of standard error of the mean (SEM) when reporting
on both the standard deviation and the sample size, by variability of a sample. A critical evaluation of four anaesthesia journals.
Br J Anaesthesiol 2003;90:514-6.
the simple relation SE = SD/(sample size). The stand- 2 Altman DG, Bland JM. The normal distribution. BMJ 1995;310:298.
ard error falls as the sample size increases, as the extent 3 Altman DG, Bland JM. Quartiles, quintiles, centiles, and other quantiles.
BMJ 1994;309:996.
of chance variation is reducedthis idea underlies the 4 Olsen CH. Review of the use of statistics in Infection and Immunity. Infect
sample size calculation for a controlled trial, for Immun 2003;71:6689-92.

BMJ VOLUME 331 15 OCTOBER 2005 bmj.com 903


Practice

Statistics Notes
The cost of dichotomising continuous variables
Douglas G Altman, Patrick Royston

Cancer Research Measurements of continuous variables are made in all preferable to performing several analyses and
UK/NHS Centre
for Statistics in
branches of medicine, aiding in the diagnosis and choosing that which gives the most convincing result.
Medicine, Wolfson treatment of patients. In clinical practice it is helpful to Use of this so called optimal cutpoint (usually that
College, Oxford label individuals as having or not having an attribute, giving the minimum P value) runs a high risk of a spu-
OX2 6UD
such as being hypertensive or obese or having riously significant result; the difference in the outcome
Douglas G Altman
professor of statistics high cholesterol, depending on the value of a variable between the groups will be overestimated,
in medicine continuous variable. perhaps considerably; and the confidence interval will
MRC Clinical Trials Categorisation of continuous variables is also com- be too narrow. This strategy should never be used.6 7
Unit, London mon in clinical research, but here such simplicity is When regression is being used to adjust for the effect
NW1 2DA
gained at some cost. Though grouping may help data of a confounding variable, dichotomisation will run the
Patrick Royston
professor of statistics presentation, notably in tables, categorisation is unnec- risk that a substantial part of the confounding remains.4 7
Correspondence to: essary for statistical analysis and it has some serious Dichotomisation is not much used in epidemiological
Professor Altman drawbacks. Here we consider the impact of converting studies, where the use of several categories is preferred.
doug.altman@ Using multiple categories (to create an ordinal
continuous data to two groups (dichotomising), as this
cancer.org.uk
is the most common approach in clinical research.1 variable) is generally preferable to dichotomising. With
BMJ 2006;332:1080 What are the perceived advantages of forcing all four or five groups the loss of information can be quite
individuals into two groups? A common argument is small, but there are complexities in analysis.
that it greatly simplifies the statistical analysis and leads Instead of categorising continuous variables, we pre-
to easy interpretation and presentation of results. A fer to keep them continuous. We could then use
binary splitfor example, at the medianleads to a linear regression rather than a two sample t test, for
comparison of groups of individuals with high or low example. If we were concerned that a linear regression
values of the measurement, leading in the simplest case would not truly represent the relation between the
to a t test or 2 test and an estimate of the difference outcome and predictor variable, we could explore
between the groups (with its confidence interval). There whether some transformation (such as a log transforma-
is, however, no good reason in general to suppose that tion) would be helpful.7 8 As an example, in a regression
there is an underlying dichotomy, and if one exists there analysis to develop a prognostic model for patients with
is no reason why it should be at the median.2 primary biliary cirrhosis, a carefully developed model
Dichotomising leads to several problems. Firstly, with bilirubin as a continuous explanatory variable
much information is lost, so the statistical power to explained 31% more of the variability in the data than
detect a relation between the variable and patient out- when bilirubin distribution was split at the median.7
come is reduced. Indeed, dichotomising a variable at Competing interests: None declared.
the median reduces power by the same amount as
would discarding a third of the data.2 3 Deliberately dis- 1 Del Priore G, Zandieh P, Lee MJ. Treatment of continuous data as
carding data is surely inadvisable when research categoric variables in obstetrics and gynecology. Obstet Gynecol
1997;89:351-4.
studies already tend to be too small. Dichotomisation 2 MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of
may also increase the risk of a positive result being a dichotomization of quantitative variables. Psychol Meth 2002;7:19-40.
3 Cohen J. The cost of dichotomization. Appl Psychol Meas 1983;7:249-53.
false positive.4 Secondly, one may seriously underesti- 4 Austin PC, Brunner LJ. Inflation of the type I error rate when a continu-
mate the extent of variation in outcome between ous confounding variable is categorized in logistic regression analyses.
Stat Med 2004;23:1159-78.
groups, such as the risk of some event, and 5 Buettner P, Garbe C, Guggenmoos-Holzmann I. Problems in defining
considerable variability may be subsumed within each cutoff points of continuous prognostic factors: example of tumor
thickness in primary cutaneous melanoma. J Clin Epidemiol
group. Individuals close to but on opposite sides of the 1997;50:1201-10.
cutpoint are characterised as being very different 6 Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using
optimal cutpoints in the evaluation of prognostic factors. J Natl Cancer
rather than very similar. Thirdly, using two groups Inst 1994;86:829-35.
conceals any non-linearity in the relation between the 7 Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous
predictors in multiple regression: a bad idea. Stat Med 2006;25:127-41.
variable and outcome. Presumably, many who dichot- 8 Royston P, Sauerbrei W. Building multivariable regression models with
omise are unaware of the implications. continuous covariates in clinical epidemiologywith an emphasis on
fractional polynomials. Methods Inf Med 2005;44:561-71.
If dichotomisation is used where should the
cutpoint be? For a few variables there are recognised
cutpoints, such as > 25 kg/m2 to define overweight
based on body mass index. For some variables, such as
Endpiece
age, it is usual to take a round number, usually a mul-
tiple of five or 10. The cutpoint used in previous stud- Reading and reflecting
ies may be adopted. In the absence of a prior cutpoint
the most common approach is to take the sample Reading without reflecting is like eating without
digesting.
median. However, using the sample median implies
that various cutpoints will be used in different studies Edmund Burke
so that their results cannot easily be compared, Kamal Samanta, retired general practitioner,
seriously hampering meta-analysis of observational Denby Dale, West Yorkshire
studies.5 Nevertheless, all these approaches are

1080 BMJ VOLUME 332 6 MAY 2006 bmj.com


PRACTICE

Statistics Notes
Missing data
Douglas G Altman1, J Martin Bland2
1
Cancer Research UK/NHS Centre Almost all studies have some missing observations. A few missing observations are a minor nuisance,
for Statistics in Medicine, Oxford Yet textbooks and software commonly assume that but a large amount of missing data is a major threat
OX2 6UD data are complete, and the topic of how to handle to a studys integrity. Non-response is a particular
2
Department of Health Sciences,
University of York, York YO10 5DD missing data is not often discussed outside statistics problem in pair-matched studies, such as some case-
Correspondence to: Professor journals. control studies, as it is unclear how to analyse data
Altman There are many types of missing data and different from the unmatched individuals. Loss of patients
doug.altman@cancer.org.uk reasons for data being missing. Both issues affect the also reduces the power of the trial. Where losses are
bmj 2007;334:424 analysis. Some examples are: expected it is wise to increase the target sample size
doi:10.1136/bmj.38977.682025.2C (1) In a postal questionnaire survey not all the to allow for losses. This cannot eliminate the poten-
selected individuals respond; tial bias, however.
(2) In a randomised trial some patients are lost to Missing data are much more common in retro-
follow-up before the end of the study; spective studies, in which routinely collected data are
(3) In a multicentre study some centres do not subsequently used for a different purpose.2 When infor-
measure a particular variable; mation is sought from patients medical notes, the notes
(4) In a study in which patients are assessed often do not say whether or not a patient was a smoker
frequently some data are missing at some time or had a particular procedure carried out. It is tempt-
points for unknown reasons; ing to assume that the answer is no when there is no
(5) Occasional data values for a variable are missing indication that the answer is yes, but this is generally
because some equipment failed; unwise.
(6) Some laboratory samples are lost in transit or No really satisfactory solution exists for missing data,
technically unsatisfactory; which is why it is important to try to maximise data
(7) In a magnetic resonance imaging study some very collection. The main ways of handling missing data in
obese patients are excluded as they are too large analysis are: (a) omitting variables which have many
for the machine; missing values; (b) omitting individuals who do not
(8) In a study assessing quality of life some patients have complete data; and (c) estimating (imputing) what
die during the follow-up period. the missing values were.
Omitting everyone without complete data is known
The prime concern is always whether the available as complete case (or available case) analysis and is
data would be biased. If the fact that an observation probably the most common approach. When only a
is missing is unrelated both to the unobserved value very few observations are missing little harm will be
(and hence to patient outcome) and the data that are done, but when many are missing omitting all patients
available this is called missing completely at ran- without full data might result in a large proportion of
dom. For cases 5 and 6 above that would be a safe the data being discarded, with a major loss of statisti-
assumption. Sometimes data are missing in a predict- cal power. The results may be biased unless the data
able way that does not depend on the missing value are missing completely at random. In general it is
itself but which can be predicted from other dataas advisable not to include in an analysis any variable
in case 3. Confusingly, this is known as missing at that is not available for a large proportion of the sam-
random. In the common cases 1 and 2, however, the ple. The main alternative approach to case deletion is
missing data probably depend on unobserved values, imputation, whereby missing values are replaced by
called missing not at random, and hence their lack some plausible value predicted from that individuals
may lead to bias. available data. Imputation has been the topic of much
In general, it is important to be able to examine recent methodological work; we will consider some of
whether missing data may have introduced bias. For the simpler methods in a separate Statistics Note.
example, if we know nothing at all about the non-
responders to a survey then we can do little to explore Competing interests: None declared.
possible bias. Thus a high response rate is necessary for 1 Evans SJW. Good surveys guide. BMJ 1991;302:302-3.
reliable answers.1 Sometimes, though, some informa- 2 Burton A, Altman DG. Missing covariate data within cancer prognostic
tion is available. For example, if the survey sample is studies: a review of current reporting and proposed guidelines. Br J
Cancer 2004;91:4-8.
chosen from a register that includes age and sex, then
the responders and non-responders can be compared
on these variables. At the very least this gives some
pointers to the representativeness of the sample. Non-
responders often (but not always) have a worse medical This is part of a series of occasional articles on statistics and handling
data in research.
prognosis than those who respond.

424 BMJ | 24 february 2007 | Volume 334


This site uses cookies. More info Close By continuing to browse the site you are agreeing to our use of
cookies. Find out more here Close

Practice Statistics Notes

Parametric v non-parametric methods for data


analysis
BMJ 2009; 338 doi: https://doi.org/10.1136/bmj.a3167 (Published 02 April 2009) Cite this as: BMJ
2009;338:a3167

Douglas G Altman, professor of statistics in medicine1, J Martin Bland, professor of health statistics2
1
Centre for Statistics in Medicine, University of Oxford, Wolfson College Annexe, Oxford OX2 6UD
2
Department of Health Sciences, University of York, York YO10 5DD

Correspondence to: Professor Altman doug.altman@csm.ox.ac.uk

Continuous data arise in most areas of medicine. Familiar clinical examples include blood pressure, ejection
fraction, forced expiratory volume in 1 second (FEV ), serum cholesterol, and anthropometric measurements.
1
Methods for analysing continuous data fall into two classes, distinguished by whether or not they make
assumptions about the distribution of the data.

Theoretical distributions are described by quantities called parameters, notably the mean and standard
deviation.1 Methods that use distributional assumptions are called parametric methods, because we estimate the
parameters of the distribution assumed for the data. Frequently used parametric methods include t tests and
analysis of variance for comparing groups, and least squares regression and correlation for studying the relation
between variables. All of the common parametric methods (t methods) assume that in some way the data follow
a normal distribution and also that the spread of the data (variance) is uniform either between groups or across
the range being studied. For example, the two sample t test assumes that the two samples of observations come
from populations that have normal distributions with the same standard deviation. The importance of the
assumptions for t methods diminishes as sample size increases.

Alternative methods, such as the sign test, Mann-Whitney test, and rank correlation, do not require the data to
follow a particular distribution. They work by using the rank order of observations rather than the measurements
themselves. Methods which do not require us to make distributional assumptions about the data, such as the
rank methods, are called non-parametric methods. The term non-parametric applies to the statistical method
used to analyse data, and is not a property of the data.1 As tests of significance, rank methods have almost as
much power as t methods to detect a real difference when samples are large, even for data which meet the
distributional requirements.

Non-parametric methods are most often used to analyse data which do not meet the distributional requirements
of parametric methods. In particular, skewed data are frequently analysed by non-parametric methods, although
data transformation can often make the data suitable for parametric analyses.2

Data that are scores rather than measurements may have many possible values, such as quality of life scales or
data from visual analogue scales, while others have only a few possible values, such as Apgar scores or stage of
disease. Scores with many values are often analysed using parametric methods, whereas those with few values
tend to be analysed using rank methods, but there is no clear boundary between these cases.

To compensate for the advantage of being free of assumptions about the distribution of the data, rank methods
have the disadvantage that they are mainly suited to hypothesis testing and no useful estimate is obtained, such
as the average difference between two groups. Estimates and confidence intervals are easy to find with t
methods. Non-parametric estimates and confidence intervals can be calculated, however, but depend on extra
assumptions which are almost as strong as those for t methods.3 Rank methods have the added disadvantage of
not generalising to more complex situations, most obviously when we wish to use regression methods to adjust
for several other factors.

Rank methods can generate strong views, with some people preferring them for all analyses and others believing
that they have no place in statistics. We believe that rank methods are sometimes useful, but parametric
methods are generally preferable as they provide estimates and confidence intervals and generalise to more
complex analyses.

The choice of approach may also be related to sample size, as the distributional assumptions are more important
for small samples. We consider the analysis of small data sets in a subsequent Statistics Note.

Notes
Cite this as: BMJ 2009;338:a3167

Footnotes
Competing interests: None declared.

Provenance and peer review: Commissioned, not externally peer reviewed.

References
1. Altman DG, Bland JM. Variables and parameters. BMJ1999;318:1667.
2. Bland JM, Altman DG. Transforming data. BMJ1996;312:770.

3. Campbell MJ, Gardner MJ. Medians and their differences. In: Altman DG, Machin D, Bryant TN, Gardner MJ, eds.
Statistics with confidence. 2nd ed. London: BMJ Books, 2000:36-44.
This site uses cookies. More info Close By continuing to browse the site you are agreeing to our use of
cookies. Find out more here Close

Practice Statistics Notes

Analysis of continuous data from small samples


BMJ 2009; 338 doi: https://doi.org/10.1136/bmj.a3166 (Published 06 April 2009) Cite this as: BMJ
2009;338:a3166

J Martin Bland, professor of health statistics 1, Douglas G Altman, professor of statistics in medicine2
1
Department of Health Sciences, University of York, York YO10 5DD
2
Centre for Statistics in Medicine, University of Oxford, Wolfson College Annexe, Oxford OX2 6UD

Correspondence to: Professor Bland mb55@york.ac.uk

Studies with small numbers of measurements are rare in the modern BMJ, but they used to be common and
remain plentiful in specialist clinical journals. Their analysis is often more problematic than that for large samples.

Parametric methods, including t tests, correlation, and regression, require the assumption that the data follow a
normal distribution and that variances are uniform between groups or across ranges.1 In small samples these
assumptions are particularly important, so this setting seems ideal for rank (non-parametric) methods, which
make no assumptions about the distribution of the data; they use the rank order of observations rather than the
measurements themselves.1 Unfortunately, rank methods are least effective in small samples. Indeed, for very
small samples, they cannot yield a significant result whatever the data. For example, when using the Mann-
Witney test for comparing two samples of fewer than four observations a statistically significant difference is
impossible: any data give P>0.05. Similarly, the Wilcoxon paired test, the sign test, and Spearmans and
Kendalls rank correlation coefficients cannot produce P<0.05 for fewer than six observations. Methods based on
the t distribution do not have this problem and can detect differences in samples as small as two for paired
differences and three for two groups, or detect correlations in samples of three.

For example, we were recently asked about the data in table 1, which shows before and after measurements of
pudendal nerve terminal motor latency. Should we use the Wilcoxon or the sign test? MB replied that the
Wilcoxon would be acceptable, giving P<0.05 (actually P=0.047), and so would the paired t test, which gave
P=0.04. The questioner also asked whether the Wilcoxon test could be used for the second group of four
observations alone, for patients who had received a slightly different intervention. Here all the differences are in
the same direction, but the Wilcoxon test gives P=0.125. It is not possible for it to give a significant difference.
The paired t test gives P=0.04, a significant difference.

Subgroup Pudendal nerve terminal motor latency (ms)

Initial Follow-up Change

1 2.2 2.3 0.1


Subgroup Pudendal nerve terminal motor latency (ms)

1 2.3 1.6 0.7

1 2.1 2.2 0.1

1 2.4 2.3 0.1

2 2.3 2.1 0.2

2 2.4 1.8 0.6

2 2.4 1.9 0.5

2 2.6 1.6 1.0

Table 1 Five year follow-up of patients receiving hyperbaric oxygen therapy for faecal incontinence

On the other hand, using t methods when their assumptions are greatly violated can also be misleading. Table 2
shows concentration of antibody to type II group B Streptococcus in 20 volunteers before and after
immunisation.2 3 The comparison of the antibody levels was summarised in the report of this study as t=1.8;
P>0.05. The paired t test is not suitable for these data, because the differences clearly have a very skewed
distribution. There are 8 zero differences, forming a clump at one end of the distribution, which would remain
whatever transformation we used. We could consider the Wilcoxon paired sample test, but this method assumes
that the differences have a symmetrical distribution, which they do not. The sign test is preferred here; it tests the
null hypothesis that non-zero differences are equally likely to be positive or negative, using the binomial
distribution. We have 1 negative and 11 positive differences, which gives P=0.006. Hence the original authors
failed to detect a difference because they used an inappropriate analysis.

Antibody to type II group B Streptococcus (g/ml)

Before After Change

0.4 0.4 0.0

0.4 0.5 0.1

0.4 0.5 0.1

0.4 0.9 0.5

0.5 0.5 0.0


Antibody to type II group B Streptococcus (g/ml)

0.5 0.5 0.0

0.5 0.5 0.0

0.5 0.5 0.0

0.5 0.5 0.0

0.6 0.6 0.0

0.6 12.2 11.6

0.7 1.1 0.4

0.7 1.2 0.5

0.8 0.8 0.0

0.9 1.2 0.3

0.9 1.9 1.0

1.0 0.9 0.1

1.0 2.0 1.0

1.6 8.1 6.5

2.0 3.7 1.7

We have often come across the idea that we should not use t distribution methods for small samples but should
instead use rank based methods. The statement is sometimes that we should not use t methods at all for
samples of fewer than six observations.4 But, as we noted, rank based methods cannot produce anything useful
for such small samples.

The aversion to parametric methods for small samples may arise from the inability to assess the distribution
shape when there are so few observations. How can we tell whether data follow a normal distribution if we have
only a few observations? The answer is that we have not only the data to be analysed, but usually also
experience of other sets of measurements of the same thing. In addition, general experience tells us that body
size measurements are usually approximately normal, as are the logarithms of many blood concentrations and
the square roots of counts.
Notes
Cite this as: BMJ 2009;338:a3166

Footnotes
We thank Jonathan Cowley for the data in table 1.

Competing interests: None declared.

Provenance and peer review: Commissioned, not externally peer reviewed.

References
1. Altman DG, Bland JM. Parametric v non-parametric methods for data analysis. BMJ 2009;338:a3167.
2. Baker CJ, Kasper DL, Edwards MS, Schiffman G. Influence of preimmunization antibody levels on the specificity of the
immune response to related polysaccharide antigens. N Engl J Med1980;303:173-8.
3. Altman DG. Practical statistics for medical research. London: Chapman and Hall, 1991: 224-5.
4. Siegel S. Nonparametric statistics for the behavioral sciences. 1st ed. Tokyo: McGraw-Hill Kogakusha, 1956:32.
This site uses cookies. More info Close By continuing to browse the site you are agreeing to our use of
cookies. Find out more here Close

Research Methods & Reporting Statistics Notes

Correlation in restricted ranges of data


BMJ 2011; 342 doi: https://doi.org/10.1136/bmj.d556 (Published 11 March 2011) Cite this as: BMJ 2011;342:d556

J Martin Bland, professor of health statistics1, Douglas G Altman, professor of statistics in medicine2
1
Department of Health Sciences, University of York, York YO10 5DD
2
Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD

Correspondence to: Professor M Bland martin.bland@york.ac.uk

In a study of 150 adult diabetic patients there was a strong correlation between abdominal circumference and
body mass index (BMI) (r = 0.85).1 The authors went on to report that the correlation differed in different BMI
categories as shown in the table.

BMI group r

<25 0.62

25 to 30 0.50

30 to 35 0.09

>35 0.86

All patients 0.85

Correlation between abdominal circumference and body mass indeed (BMI) in 1450 adult patients with
diabetes
2
The authors interpretation of these data was that in patients with low or high BMI values (BMI <25 kg/m and
2 2
BMI >35 kg/m ) the correlation was strong, but in those with BMI values between 25 and 35 kg/m the correlation
was weak or missing. They concluded that measuring abdominal circumference is of particular importance in
2
subjects with the most frequent BMI category (25 to 35 kg/m ).

When we restrict the range of one of the variables, a correlation coefficient will be reduced. For example, fig 1
shows some BMI and abdominal circumference measurements from a different population. Although these
people are from a rather thinner population, the correlation coefficient is very similar, r = 0.82 (P<0.0001). When
2
we divide the sample into the same four restricted ranges of BMI at 20, 25, and 30 kg/m , the correlation
coefficient in each interval is smaller than the correlation coefficient for the whole sample. This phenomenon is
to be expected; it is a result of restricting the range of data, not any particular property of BMI and abdominal
circumference.
Figure1
BMI and abdominal circumference in 202 men and women, with correlation coefficients in four restricted
ranges and overall
2
One interpretation of the correlation coefficient r is that r is the proportion of the variation in abdominal
circumference explained or predicted by the variation in BMI. If we restrict the range of BMI values we reduce the
variation in BMI, which will explain less variation in abdominal circumference, and r will fall. If we further reduce
the variation in BMI until all remaining patients have the same BMI, then we cannot explain any variation in
abdominal circumference and the correlation must be zero. (By contrast within any of the sections of fig 1 the
fitted regression line would be the same, apart from random variation.)

For another example, fig 2 shows the weights and heights of the same sample, with different symbols for men
and women. Clearly, the lower end of the height range for men is higher than the lower end of the range for
women, but the upper ends of the ranges are very similar. The mens heights (SD 6.0 cm) are less variable than
those of the women (SD 8.9 cm) or the heights of both sexes combined (also SD 8.9 cm). The correlation
coefficients for women and for both men and women are very similar and considerably larger than that for men
alone.

Figure2
Weight and height in 202 men and women, with correlation coefficients

The same phenomenon can arise when the sample is restricted using another variable related to the ones being
studies. For example, the correlation between weight and height of schoolchildren will increase as the age range
is increased. But a spurious correlation may also be seen in such a situation, for example between shoe size and
spelling ability.2 Such an example illustrates the well worn phrase that an observed association does not imply
causation.

Correlation coefficients are a property of the variables and also the population in which they are measured. If we
look at a restricted population, we should not conclude that there is little or no relation between the variables
because the correlation coefficient is small. But given a clear relation in the whole group, we see no point in
looking within categories of one of the variables. In any case, regression is generally the preferred approach to
considering the relation between two continuous variables.

Notes
Cite this as: BMJ 2011;342:d556

Footnotes
Acknowledgements: The data are taken from a student elective project by Dr Malcolm Savage.

Contributors: JMB and DGA jointly wrote and agreed the text, JMB did the statistical analysis.

Competing interests: All authors have completed the Unified Competing Interest form at
www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no
support from any organisation for the submitted work; no financial relationships with any organisations that
might have an interest in the submitted work in the previous 3 years; no other relationships or activities that
could appear to have influenced the submitted work.

References
1. Ndas J, Putz Z, Kolev G, Nagy S, Jermendy G. Intraobserver and interobserver variability of measuring waist
circumference. Med Sci Monit2008;14:CR15-8.
2. Goodwin LD, Leech NL. Understanding correlation: factors that affect the size of r. J Exp Educ2006;74:251-66.
BMJ 2011;342:d561 doi: 10.1136/bmj.d561 Page 1 of 3

Research Methods & Reporting

RESEARCH METHODS & REPORTING

STATISTICS NOTES

Comparisons within randomised groups can be very


misleading
J Martin Bland professor of health statistics 1, Douglas G Altman professor of statistics in medicine 2

1
Department of Health Sciences, University of York, York YO10 5DD; 2Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD

When we randomise trial participants into two or more The table shows simulated data for a randomised trial with two
intervention groups, we do this to remove bias; the groups will, groups of 30 participants. Data were drawn from the same
on average, be comparable in every respect except the treatment population, so there is no systematic difference between the two
which they receive. Provided the trial is well conducted, without groups. The true baseline measurements had a mean of 10.0
other sources of bias, any difference in the outcome of the with standard deviation (SD) 2.0, and the outcome measurement
groups can then reasonably be attributed to the different was equal to the baseline plus an increase of 0.5 and a random
interventions received. In a previous note we discussed the element with SD 1.0. The difference between mean outcomes
analysis of those trials in which the primary outcome measure is 0.22 (95% confidence interval 0.75 to 0.34, P=0.5), adjusting
is also measured at baseline. We discussed several valid for the baseline by analysis of covariance.1 The difference is
analyses, observing that analysis of covariance (a regression not statistically significant, which is not surprising because we
method) is the method of choice.1 know that the null hypothesis of no difference in the population
Rather than comparing the randomised groups directly, however, is true. If we compare baseline with outcome for each group
researchers sometimes look at the change in the measurement using a paired t test, however, for group A the difference is
between baseline and the end of the trial; they test whether there statistically significant, P=0.03, for group B it is not significant,
was a significant change from baseline, separately in each P = 0.2. These results are quite similar to those of the anti-ageing
randomised group. They may then report that this difference is cream trial.2
significant in one group but not in the other, and conclude that We would not wish to draw any conclusions from one
this is evidence that the groups, and hence the treatments, are simulation. In 1000 runs, the difference between groups had
different. One such example was a recent trial in which P<0.05 in the analysis of covariance 47 times, or for 4.7% of
participants were randomised to receive either an anti-ageing samples, very close to the 5% we expect. Of the 2000
cream or the vehicle as a placebo.2 A wrinkle score was recorded comparisons between baseline and outcome, 1500 (75%) had
at baseline and after six months. The authors gave the results P<0.05. In this simulation, where there is no difference
of significance tests comparing the score with baseline for each whatsoever between the two treatments, the probability of a
group separately, reporting the active treatment group to have significant difference in one group but not the other was 38%,
a significant difference (P=0.013) and the vehicle group not not 5%. Hence a significant difference in one group but not the
(P=0.11). Their interpretation was that the cosmetic cream other is not good evidence of a significant difference between
resulted in significant clinical improvement in facial wrinkles. the groups. Even when there is a clear benefit of one treatment
But we cannot validly draw this conclusion, because the lack over the other, separate P values are not the way to analyse such
of a significant difference in the vehicle group does not provide studies.4
good evidence that the anti-ageing product is superior.3 How many pairs of tests will have one significant and one
The essential feature of a randomised trial is the comparison non-significant difference depends on the size of the change
between groups. Within group analyses do not address a from baseline to final measurement. If the population difference
meaningful question: the question is not whether there is a from baseline is very large, nearly all the within group tests will
change from baseline, but whether any change is greater in one be significant, and if the population difference is small, nearly
group than the other. It is not possible to draw valid inferences all tests will be not significant, so there will be few samples
by comparing P values. In particular, there is an inflated risk of with only one significant difference. If the difference is such
a false positive result, which we shall illustrate with a simulation. that half the samples would show a significant change from
baseline, as it would be in our simulation if the underlying

Correspondence to: Professor M Bland martin.bland@york.ac.uk

Reprints: http://journals.bmj.com/cgi/reprintform Subscribe: http://resources.bmj.com/bmj/subscribers/how-to-subscribe


BMJ 2011;342:d561 doi: 10.1136/bmj.d561 Page 2 of 3

RESEARCH METHODS & REPORTING

difference were 0.37 rather than 0.5, we would expect 50% of organisations that might have an interest in the submitted work in the
samples to have just one significant difference. previous 3 years; no other relationships or activities that could appear
The anti-ageing trial is not the only one where we have seen to have influenced the submitted work.
this misleading approach applied to randomised trial data.3 We
1 Vickers AJ, Altman DG. Analysing controlled trials with baseline and follow-up
even found it once in the BMJ!5 measurements. BMJ 2001;323:1123-4.
2 Watson REB, Ogden S, Cotterell LF, Bowden JJ, Bastrilles JY, Long SP, et al. A cosmetic
anti-ageing product improves photoaged skin: a double-blind, randomized controlled
Contributors: JMB and DGA jointly wrote and agreed the text, JMB did trial. Br J Dermatol 2009;161:419-26.
the statistical analysis. 3 Bland JM. Evidence for an anti-ageing product may not be so clear as it appears. Br J
Dermatol 2009;161:1207-8.
Competing interests: All authors have completed the Unified Competing 4 Altman DG, Bland JM. Interaction revisited: the difference between two estimates. BMJ
Interest form at www.icmje.org/coi_disclosure.pdf (available on request 2003;326:219.
5 Bland JM, Altman DG. Informed consent. BMJ 1993;306:928.
from the corresponding author) and declare: no support from any
organisation for the submitted work; no financial relationships with any
Cite this as: BMJ 2011;342:d561

Reprints: http://journals.bmj.com/cgi/reprintform Subscribe: http://resources.bmj.com/bmj/subscribers/how-to-subscribe


BMJ 2011;342:d561 doi: 10.1136/bmj.d561 Page 3 of 3

RESEARCH METHODS & REPORTING

Table

Table 1| Simulated data from a randomised trial comparing two groups of 30 patients, with a true change from baseline but no difference
between groups (sorted by baseline values within each group)

Group A Group B
Baseline 6 months Change Baseline 6 months Change
1 6.4 7.1 0.7 1 6.8 7.9 1.1
2 6.6 5.6 -1.0 2 7.2 7.5 0.3
3 7.3 8.3 1.0 3 7.2 6.9 -0.3
4 7.7 9.1 1.4 4 7.4 6.9 -0.5
5 7.7 9.5 1.8 5 7.5 8.3 0.8
6 7.9 9.6 1.7 6 7.5 9.4 1.9
7 8.0 8.5 0.5 7 8.3 9.0 0.7
8 8.0 8.5 0.5 8 8.4 8.8 0.4
9 8.1 9.1 1.0 9 8.7 8.0 -0.7
10 9.2 9.6 0.4 10 9.0 7.2 -1.8
11 9.3 8.7 -0.6 11 9.2 7.1 -2.1
12 9.6 10.7 1.1 12 9.6 10.6 1.0
13 9.7 9.0 -0.7 13 9.9 11.0 1.1
14 9.8 9.0 -0.8 14 10.1 11.5 1.4
15 9.8 8.0 -1.8 15 10.2 10.4 0.2
16 10.2 11.1 0.9 16 10.3 11.0 0.7
17 10.3 11.5 1.2 17 10.4 9.9 -0.5
18 10.6 9.1 -1.5 18 10.5 11.3 0.8
19 10.6 12.0 1.4 19 10.7 9.9 -0.8
20 10.7 13.2 2.5 20 10.8 10.7 -0.1
21 10.9 9.7 -1.2 21 10.8 11.8 1.0
22 11.1 12.2 1.1 22 11.1 10.0 -1.1
23 11.2 10.8 -0.4 23 11.1 13.2 2.1
24 11.8 11.9 0.1 24 11.4 11.8 0.4
25 12.3 12.2 -0.1 25 11.6 12.1 0.5
26 12.4 12.6 0.2 26 11.7 11.5 -0.2
27 13.1 15.0 1.9 27 12.0 12.7 0.7
28 13.2 13.8 0.6 28 12.3 13.7 1.4
29 13.3 14.1 0.8 29 13.7 12.6 -1.1
30 13.7 14.2 0.5 30 13.9 13.7 -0.2
Mean 10.02 10.46 0.44 Mean 9.98 10.21 0.24
SD 2.06 2.29 1.06 SD 1.90 2.09 1.02

Reprints: http://journals.bmj.com/cgi/reprintform Subscribe: http://resources.bmj.com/bmj/subscribers/how-to-subscribe


BMJ 2011;343:d2090 doi: 10.1136/bmj.d2090 (Published 8 August 2011) Page 1 of 2

Research Methods & Reporting

RESEARCH METHODS & REPORTING

How to obtain the confidence interval from a P value


1 2
Douglas G Altman professor of statistics in medicine , J Martin Bland professor of health statistics
1
Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD; 2Department of Health Sciences, University of York, Heslington, York
YO10 5DD

Confidence intervals (CIs) are widely used in reporting statistical Following the steps in the box we calculate the CI as follows:
analyses of research data, and are usually considered to be more z = 0.862+ [0.743 2.404log(0.034)] = 2.117;
informative than P values from significance tests.1 2 Some
Est = log (0.30) = 1.204;
published articles, however, report estimated effects and P
values, but do not give CIs (a practice BMJ now strongly SE = 1.204/2.117 = 0.569 but we ignore the minus sign,
discourages). Here we show how to obtain the confidence so SE = 0.569, and 1.96SE = 1.115;
interval when only the observed effect and the P value were 95% CI on log scale = 1.204 1.115 to 1.204 + 1.115 =
reported. 2.319 to 0.089;
The method is outlined in the box below in which we have 95% CI on natural scale = exp (2.319) = 0.10 to exp
distinguished two cases. (0.089) = 0.91.
(a) Calculating the confidence interval for Hence the relative risk is estimated to be 0.30 with 95% CI
a difference 0.10 to 0.91.

We consider first the analysis comparing two proportions or


two means, such as in a randomised trial with a binary outcome Limitations of the method
or a measurement such as blood pressure. The methods described can be applied in a wide range of
For example, the abstract of a report of a randomised trial settings, including the results from meta-analysis and regression
included the statement that more patients in the zinc group analyses. The main context where they are not correct is in small
than in the control group recovered by two days (49% v 32%, samples where the outcome is continuous and the analysis has
P=0.032).5 The difference in proportions was Est = 17 been done by a t test or analysis of variance, or the outcome is
percentage points, but what is the 95% confidence interval (CI)? dichotomous and an exact method has been used for the
Following the steps in the box we calculate the CI as follows: confidence interval. However, even here the methods will be
z = 0.862+ [0.743 2.404log(0.032)] = 2.141; approximately correct in larger studies with, say, 60 patients or
more.
SE = 17/2.141 = 7.940, so that 1.96SE = 15.56 percentage
points;
P values presented as inequalities
95% CI is 17.0 15.56 to 17.0 + 15.56, or 1.4 to 32.6
percentage points. Sometimes P values are very small and so are presented as
P<0.0001 or something similar. The above method can be
applied for small P values, setting P equal to the value it is less
(b) Calculating the confidence interval for than, but the z statistic will be too small, hence the standard
a ratio (log transformation needed) error will be too large and the resulting CI will be too wide.
This is not a problem so long as we remember that the estimate
The calculation is trickier for ratio measures, such as risk ratio, is better than the interval suggests.
odds ratio, and hazard ratio. We need to log transform the
estimate and then reverse the procedure, as described in a When we are told that P>0.05 or the difference is not significant,
previous Statistics Note.6 things are more difficult. If we apply the method described here,
using P=0.05, the confidence interval will be too narrow. We
For example, the abstract of a report of a cohort study includes must remember that the estimate is even poorer than the
the statement that In those with a [diastolic blood pressure] confidence interval calculated would suggest.
reading of 95-99 mm Hg the relative risk was 0.30 (P=0.034).7
What is the confidence interval around 0.30?

Correspondence to: D G Altman doug.altman@csm.ox.ac.uk

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe
BMJ 2011;343:d2090 doi: 10.1136/bmj.d2090 (Published 8 August 2011) Page 2 of 2

RESEARCH METHODS & REPORTING

Steps to obtain the confidence interval (CI) for an estimate of effect from the P value and the estimate (Est)
(a) CI for a difference
1 calculate the test statistic for a normal distribution test, z, from P3: z = 0.862 + [0.743 2.404log(P)]
2 calculate the standard error: SE = Est/z (ignoring minus signs)
3 calculate the 95% CI: Est 1.96SE to Est + 1.96SE.

(b) CI for a ratio


For a ratio measure, such as a risk ratio, the above formulas should be used with the estimate Est on the log scale (eg, the log risk ratio). Step
3 gives a CI on the log scale; to derive the CI on the natural scale we need to exponentiate (antilog) Est and its CI.4
Notes
All P values are two sided.
All logarithms are natural (ie, to base e). 4
For a 90% CI, we replace 1.96 by 1.65; for a 99% CI we use 2.57.

1 Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather 6 Altman DG, Bland JM. Interaction revisited: the difference between two estimates. BMJ
than hypothesis testing. BMJ 1986;292:746-50. 2003;326:219.
2 Moher D, Hopewell S, Schulz KF, Montori V, Gtzsche PC, Devereaux PJ, et al. 7 Lindblad U, Rstam L, Rydn L, Ranstam J, Isacsson S-O, Berglund G. Control of blood
CONSORT 2010. Explanation and Elaboration: updated guidelines for reporting parallel pressure and risk of first acute myocardial infarction: Skaraborg hypertension project.
group randomised trials. BMJ 2010;340:c869. BMJ 1994;308:681.
3 Lin J-T. Approximating the normal tail probability and its inverse for use on a pocket
calculator. Appl Stat 1989;38:69-70.
4 Bland JM, Altman DG. Statistics Notes. Logarithms. BMJ 1996;312:700. Cite this as: BMJ 2011;343:d2090
5 Roy SK, Hossain MJ, Khatun W, Chakraborty B, Chowdhury S, Begum A, et al. Zinc
BMJ Publishing Group Ltd 2011
supplementation in children with cholera in Bangladesh: randomised controlled trial. BMJ
2008;336:266-8.

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe
BMJ 2011;343:d2304 doi: 10.1136/bmj.d2304 Page 1 of 2

Research Methods & Reporting

RESEARCH METHODS & REPORTING

STATISTICS NOTES

How to obtain the P value from a confidence interval


Douglas G Altman professor of statistics in medicine 1, J Martin Bland professor of health statistics 2

1
Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD; 2Department of Health Sciences, University of York, Heslington, York
YO10 5DD

We have shown in a previous Statistics Note1 how we can Taggart et al presented a hazard ratio of 0.81; 95% CI 0.70 to
calculate a confidence interval (CI) from a P value. Some 0.94.5 They did not quote the P value.
published articles report confidence intervals, but do not give Following the steps in the box we calculate P as follows:
corresponding P values. Here we show how a confidence interval Est = log(0.81) = 0.211
can be used to calculate a P value, should this be required. This
might also be useful when the P value is given only imprecisely l = log(0.70) = 0.357, u = log (0.94) = 0.062
(eg, as P<0.05). Wherever they can be calculated, we are SE = [0.062 (0.357)]/(21.96) = 0.0753.
advocates of confidence intervals as much more useful than P
z = 0.211/0.0753 = 2.802. We take the positive value of
values, but we like to be helpful.
z, 2.802.
The method is outlined in the box below in which we have
P = exp(0.7172.802 0.4162.8022) = 0.005.
distinguished two cases.
(a) P from CI for a difference (no
Limitations of the method
transformation needed)
The formula for P is unreliable for very small P values and if
The simple case is when we have a CI for the difference between your P value is smaller than 0.0001, just report it as P<0.0001.
two means or two proportions. For example, participants in a
The methods described can be applied in a wide range of
trial received antihypertensive treatment with or without
settings, including the results from meta-analysis and regression
pravastatin. The authors report that pravastatin performed
analyses. The main context where they are not correct is small
slightly worse than a placebo. The estimated difference between
samples where the outcome is continuous and the analysis has
group means was 1.9 (95% CI 0.6 to 4.3) mm Hg.4 What was
been done by a t test or analysis of variance, or the outcome is
the P value?
dichotomous and an exact method has been used for the
Following the steps in the box above we calculate P as follows: confidence interval. However, even here the methods will be
SE = [4.3 (0.6)]/(21.96) = 1.25; approximately correct in larger studies with, say, 60 patients or
z = 1.9/1.25 = 1.52; more.
P = exp(0.7171.52 0.4161.522) = 0.13.
Contributors: JMB and DGA jointly wrote and agreed the text.
In this paper, the authors did indeed publish a P value of 0.13,4 Competing interests: All authors have completed the Unified Competing
as we have estimated from their confidence interval. Interest form at www.icmje.org/coi_disclosure.pdf (available on request
from the corresponding author) and declare: no support from any
(b) CI for a ratio (log transformation organisation for the submitted work; no financial relationships with any
needed) organisations that might have an interest in the submitted work in the
previous 3 years; no other relationships or activities that could appear
The calculation is trickier for ratio measures, such as risk ratio, to have influenced the submitted work.
odds ratio, and hazard ratio. We need to log transform the
estimate and confidence limits, so that Est, l, and u in the box 1 Altman DG, Bland JM. How to obtain a confidence interval from a P value. BMJ
are the logarithms of the published values. 2
2011;342:d2090.
Lin J-T. Approximating the normal tail probability and its inverse for use on a pocket
For example, in a meta-analysis of several studies comparing calculator. Appl Stat 1989;38:69-70.

single versus bilateral mammary artery coronary bypass grafts

Correspondence to: D G Altman doug.altman@csm.ox.ac.uk

Reprints: http://journals.bmj.com/cgi/reprintform Subscribe: http://resources.bmj.com/bmj/subscribers/how-to-subscribe


BMJ 2011;343:d2304 doi: 10.1136/bmj.d2304 Page 2 of 2

RESEARCH METHODS & REPORTING

Steps to obtain the P value from the CI for an estimate of effect (Est)

(a) P from CI for a difference


If the upper and lower limits of a 95% CI are u and l respectively:
1 calculate the standard error: SE = (u l)/(21.96)
2 calculate the test statistic: z = Est/SE
3 calculate the P value2: P = exp(0.717z 0.416z2).

(b) P from CI for a ratio


For a ratio measure, such as a risk ratio, the above formulas should be used with the estimate Est and the confidence
limits on the log scale (eg, the log risk ratio and its CI).
Notes
All P values are two sided.
All logarithms are natural (ie, to base e).3
exp is the exponential function.
The formula for P works only for positive z, so if z is negative we remove the minus sign.
For a 90% CI, we replace 1.96 by 1.65; for a 99% CI we use 2.57.

3 Mancia G, Parati G, Revera M, Bilo G, Giuliano A, Veglia F, et al. Statins, antihypertensive 5 Bland JM, Altman DG. Logarithms. BMJ 1996;312:700.
treatment, and blood pressure control in clinic and over 24 hours: evidence from PHYLLIS
randomised double blind trial. BMJ 2010;340:c1197.
4 Taggart DP, DAmico R, Altman DG. Effect of arterial revascularisation on survival: a Cite this as: BMJ 2011;343:d2304
systematic review of studies comparing bilateral and single internal mammary arteries.
Lancet 2001;358:870-5.

Reprints: http://journals.bmj.com/cgi/reprintform Subscribe: http://resources.bmj.com/bmj/subscribers/how-to-subscribe


BMJ 2011;343:d570 doi: 10.1136/bmj.d570 Page 1 of 2

Research Methods & Reporting

RESEARCH METHODS & REPORTING

STATISTICS NOTES

Brackets (parentheses) in formulas


Douglas G Altman professor of statistics in medicine 1, J Martin Bland professor of health statistics 2

1
Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD; 2Department of Health Sciences, University of York, York YO10 5DD

Each year, new health sciences postgraduate students in York where P1 = n2 n1/r1 and P2 = n3 n2/r2. Here n1, n2, and n3 are
are given a simple maths test. Each year the majority of them refractive indices, r1, r2, and t are distances in metres, and P, P1,
fail to calculate 20 3 5 correctly. According to the and P2, are powers in dioptres. But he should have written P1 =
conventional rules of arithmetic, division and multiplication are (n2 n1)/r1 and P2 = (n3 n2)/r2. P1 = n2 n1/r1 is clearly wrong
done before addition and subtraction, so 20 3 5 = 20 15 dimensionally, as P1 is dioptres, 1/metre, n2 and n1 are ratios
= 5. Many students work from left to right and calculate 20 and so pure numbers, and r1 is in metres. Also, it is not clear
3 5 as 17 5 = 85. If that was what was actually meant, we whether t/n2 (P1P2) means (t/n2) P1P2, which it does, or t/(n2P1P2).
would need to use brackets: (20 3) 5 = 17 5 = 85. Brackets Do such errors matter? Certainly. In our experience the
tell us that the enclosed part must be evaluated first. That calculations are usually correct in the paper, but anyone using
convention is part of various mnemonic acronyms which indicate the published formula would go wrong. Sometimes, however,
the order of operations, such as BODMAS (Brackets, Of (that the incorrect formula was used, as in the following case.
is, power of), Divide, Multiply, Add, Subtract) and PEMDAS
(Parentheses, Exponentiation, Multiplication, Division, Addition,
Subtraction).1
Example 3
Schoolchildren learn the basic rules about how to construct and In their otherwise exemplary evaluation of the chronic ankle
interpret mathematical formulas.1 The conventions exist to instability scale, Eechaute et al4 made a mistake in their formula
ensure that there is absolutely no ambiguity, as mathematics for the minimal detectable change (MDC) or repeatability
(unlike prose) has no redundancy, so any mistake may have coefficient,5 writing: MDC = 2.04 (2 SEM). Here SEM is
serious consequences. Our experience is that mistakes are quite the standard error of a measurement or within subject standard
common when formulas are presented in medical journal deviation.5 This formula uses 2.04 where 2 or 1.96 is more
articles. A particular concern is that brackets are often omitted usual,6 but, much more seriously, the SEM should not be
or misused. The following examples are typical and we mean included within the square root, as the brackets indicate. This
nothing personal by choosing them. might be dismissed as a simple typographical error, but the
authors actually used this incorrect formula. Their value of SEM
Example 1 was 2.7 points, so they calculated the minimal detectable change
In a discussion of methods for analysing diagnostic test as 2.04 (2 2.7) = 4.7. They should have calculated 2.04
accuracy, Collinson 2 wrote: (2) 2.7 = 7.8. Their erroneous formula makes the scale appear
considerably more reliable than it actually is.6 The formula is
Sensitivity = TP/TP + FN also wrong in terms of dimensions, because the minimum
where TP = true positive and FN = false negative. The formula clinical difference should be in the same units as the
should, of course, be: measurement, not in square root units.
Some mistakes in formulas may be present in a submitted
Sensitivity = TP/(TP + FN).
manuscript, but others might be introduced in the publication
process. For example, problems sometimes arise when a
Example 2 displayed formula is converted to an in-text formula as part
For a non-statistical example, Leyland 3 wrote that the total of the editing, and the implications are not realised or not noticed
optical power of the cornea is: by either editing staff or authors. Often it is necessary to insert
brackets when reformatting a formula. So the simple formula:
P = P1 + P2 t/n2(P1P2)

Correspondence to: DG Altman doug.altman@csm.ox.ac.uk

Reprints: http://journals.bmj.com/cgi/reprintform Subscribe: http://resources.bmj.com/bmj/subscribers/how-to-subscribe


BMJ 2011;343:d570 doi: 10.1136/bmj.d570 Page 2 of 2

RESEARCH METHODS & REPORTING

from the corresponding author) and declare: no support from any


organisation for the submitted; no financial relationships with any
organisations that might have an interest in the submitted work in the
previous 3 years; no other relationships or activities that could appear
should be changed to p/(1 p) if moved to the text. to have influenced the submitted work
Formulas in published articles may be used by others, so
mistakes may lead to substantive errors in research. It is essential 1 Wikipedia. Order of operations. [cited 2010 Nov 23]. http://en.wikipedia.org/wiki/Order_
of_operations
that authors and editors check all formulas carefully. 2 Collinson P. Of bombers, radiologists, and cardiologists: time to ROC. Heart
1998;80:215-7.
3 Leyland M. Validation of Orbscan II posterior corneal curvature measurement for intraocular
Acknowledgements: We are very grateful to Phil McShane for pointing lens power calculation. Eye 2004;18:357-60.
out a mistake in an earlier version of this statistics note. 4 Eechaute C, Vaes P, Duquet W. The chronic ankle instability scale: Clinimetric properties
of a multidimensional, patient-assessed instrument. Phys Ther Sport 2008;9:57-66.
Contributors: DGA and JMB jointly wrote and agreed the text. 5 Bland JM, Altman DG. Statistics Notes. Measurement error. BMJ 1996;312:744.
6 Bland JM. Minimal detectable change. Phys Ther Sport 2009;10:39.
Competing interests: All authors have completed the Unified Competing
Interest form at www.icmje.org/coi_disclosure.pdf (available on request
Cite this as: BMJ 2011;343:d570

Reprints: http://journals.bmj.com/cgi/reprintform Subscribe: http://resources.bmj.com/bmj/subscribers/how-to-subscribe


BMJ 2013;346:f3438 doi: 10.1136/bmj.f3438 (Published 6 June 2013) Page 1 of 4

Research Methods & Reporting

RESEARCH METHODS & REPORTING

Statistics Notes: Missing outcomes in randomised


trials
1
Andrew J Vickers attending research methodologist , Douglas G Altman professor of statistics in
2
medicine

Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY 10065, USA; 2Centre for Statistics in
1

Medicine, University of Oxford, Oxford OX2 6UD, UK

In most randomised trials, some patients fail to provide data for A more sophisticated approach to missing data is known as
study endpoints.1 We have previously described the analysis of multiple imputation, which uses a regression model to predict
a trial of acupuncture versus sham acupuncture for the treatment missing values.5 In randomised trials, the strongest predictors
of shoulder pain.2 All 52 randomised patients provided baseline of future outcome are often the scores provided by the patient
data on pain and range of motion, but only 45 returned for so far, but other variables can be included. To avoid
follow-up testing. The statistical question is how to handle those underestimating the width of the confidence interval, multiple
seven patients with missing data. The most straightforward imputation involves a form of random sampling. For a given
approach is simply to ignore the seven patients and do what is patient with a missing outcome, regression is used to predict
known as an available case analysis (often confusingly known the mean value of the missing outcome for similar patients and
as complete case analysis). As not all randomised patients are also the variability around the mean; a value is then selected at
included in the analysis, this leads to reduced statistical power.1 random from this distribution. The results from several
A method that attempts to include all randomised patients is imputations (hence multiple) are combined using a method
last observation carried forward, in which the last known as Rubins rules.5 6 Multiple imputation is widely
measurement obtained from the patient is used for all data points believed to be the preferred approach to missing data, not just
that were subsequently missed. This method is attractive because for randomised trials.7 It is computationally complex, however,
it is simple, but it has little else to recommend it. Substituting and needs to be implemented by special software, such as the
a missing data point with a value is known as imputation,1 ice command in Stata (see www.multiple-imputation.com).
and the data analyst needs a clear rationale for the type of The table shows the results of the shoulder pain study analysed
imputation used. That a patients responses would remain the by each method. The estimates for available case and multiple
same after drop-out is generally implausible. This is most imputation do not differ much, although multiple imputation
obvious in chronic degenerative diseases. For instance, cognitive has a slightly narrower confidence interval. Last observation
function scores decrease over time in dementia, so last carried forward appears to be biasedit underestimates the
observation carried forward gives overoptimistic scores for effects of acupunctureand gives a confidence interval that is
patients who drop out (figure). If a treatment was associated too narrow.
with toxicity, and this led to earlier drop-out than in the control Multiple imputation works best when good predictors of
group, the method would give results biased in favour of the outcome are available. In the shoulder pain example, baseline
experimental arm.3 By contrast, shoulder pain generally gets score was only moderately correlated with follow-up score
better over time, either because treatment is effective or because (r0.4). Had outcome been assessed halfway through treatment,
of the placebo effect and regression to the mean.4 In the this measure would have been more highly correlated with
randomised trial, patients in the control group improved by a post-treatment score, markedly improving the properties of the
mean of 9.8 points out of 100 from baseline to post-treatment multiple imputation.
follow-up, whereas patients who received acupuncture improved
Multiple imputation has several important strengths, but it does
by 21.5 points. So assuming that patients lost to follow-up
not adjust for the sort of bias created if patients were less likely
experienced precisely zero change in pain scores makes little
to return for follow-up if they were in a lot of pain; this is an
sense. Last observation carried forward may also underestimate
inherent limitation to missing data analysis. We cannot know
the standard deviation of the endpoint, especially in cases in
whether patients pain levels affect the chance that they will
which the last observation is the baseline, leading to confidence
complete a pain questionnaire because, obviously enough, we
intervals that are too narrow.
do not have the pain scores of non-respondents.

Correspondence to: A Vickers vickersa@mskcc.org

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe
BMJ 2013;346:f3438 doi: 10.1136/bmj.f3438 (Published 6 June 2013) Page 2 of 4

RESEARCH METHODS & REPORTING

Sometimes simple common sense is more important than the previous three years; no other relationships or activities that could
complex statistics. In the shoulder pain trial, three of the seven appear to have influenced the submitted work.
drop-outs were in the acupuncture group and four were controls, Provenance and peer review: Not commissioned; not externally peer
so it seems implausible that their omission had materially reviewed.
affected the results of the trial. If drop-out rates were very
different between the two arms of a trial, that may raise concerns 1 Altman DG, Bland JM. Missing data. BMJ 2007;334:424.
about bias. Above all, analysis of missing data teaches us the 2 Vickers AJ, Altman DG. Statistics notes: analysing controlled trials with baseline and
follow up measurements. BMJ 2001;323:1123-4.
importance of avoiding missing data in the first place: an 3 Molnar FJ, Man-Son-Hing M, Hutton B, Fergusson DA. Have
informed guess, even using a technique as sophisticated as last-observation-carried-forward analyses caused us to favour more toxic dementia
therapies over less toxic alternatives? A systematic review. Open Med 2009;3:e31-50.
multiple imputation, is still a guess. 4 Bland JM, Altman DG. Regression towards the mean. BMJ 1994;308:1499.
5 Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation
for missing data in epidemiological and clinical research: potential and pitfalls. BMJ
Contributors: AJV and DGA jointly wrote and agreed the text. 2009;338:b2393.
6 White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and
Competing interests: All authors have completed the ICMJE uniform
guidance for practice. Stat Med 2011;30:377-99.
disclosure form at www.icmje.org/coi_disclosure.pdf (available on 7 White IR, Horton NJ, Carpenter J, Pocock SJ. Strategy for intention to treat analysis in
request from the corresponding author) and declare: no support from randomised trials with missing outcome data. BMJ 2011;342:d40.

any organisation for the submitted work; no financial relationships with


any organisations that might have an interest in the submitted work in Cite this as: BMJ 2013;346:f3438
BMJ Publishing Group Ltd 2013

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe
BMJ 2013;346:f3438 doi: 10.1136/bmj.f3438 (Published 6 June 2013) Page 3 of 4

RESEARCH METHODS & REPORTING

Table

Table 1| Analysis of shoulder pain trial with three statistical methods

Analysis Effect of acupuncturedifference in points (95% CI) Standard error P value


Available case 14.2 (5.1 to 23.4) 4.53 0.003
Last observation carried forward* 12.6 (3.9 to 21.3) 4.33 0.005
Multiple imputation 14.3 (5.4 to 23.3) 4.45 0.002

*Missing final value replaced by baseline value (n=7).


Using baseline score and treatment group.

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe
BMJ 2013;346:f3438 doi: 10.1136/bmj.f3438 (Published 6 June 2013) Page 4 of 4

RESEARCH METHODS & REPORTING

Figure

Function scores over time for patient with chronic degenerative disease

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe
BMJ 2014;349:g7064 doi: 10.1136/bmj.g7064 (Published 25 November 2014) Page 1 of 2

Research Methods & Reporting

RESEARCH METHODS & REPORTING

STATISTICS NOTES

Uncertainty and sampling error


1 2
Douglas G Altman professor of statistics in medicine , J Martin Bland professor of health statistics

Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford
1

OX3 7LD, UK; 2Department of Health Sciences, University of York, York YO10 5DD, UK

Medical research is conducted to help to reduce uncertainty. Interpretation of a studys results should be primarily based on
For example, randomised controlled trials aim to answer the estimated effect and a measure of its uncertainty. In
questions relating to treatment choices for a particular group of mainstream statistics, the uncertainty of estimates is indicated
patients. Rarely, however, does a single study remove by the use of confidence intervals. Before the mid-1980s,
uncertainty. There are two reasons for this: sampling error and confidence intervals were rarely seen in clinical research articles.
other (non-sampling) sources of uncertainty. The word error Around 1986 things changed,3 and these days almost all clinical
comes from a Latin root meaning to wander, and we use it in research articles in major journals include confidence intervals.
its statistical sense of meaning variation from the average, not The confidence interval is a range of uncertainty around the
mistake. Sampling error arises because any sample may not estimate of interest, such as the treatment effect in a controlled
behave quite the same as the larger population from which it trial.
was drawn. Non-sampling error arises from the many ways a So, for example, in a study of the impact of a mental health
research study may deviate from addressing the question that worker on the management of depression in primary care, it
the researcher wants to answer. was reported that After adjustment for baseline depression,
Sampling error is very much the concern of the statistician, who mean depression score was 1.33 PHQ-9 points lower (95%
imagines that the group of people in the study is just one of the confidence interval 0.35 to 2.31, P=0.009) in participants
many possible samples from the population of interest. Despite receiving collaborative care than in those receiving usual care
it being widely condemned,1 the dominant way of summarising at four months.4 This means that we estimate that, in the
the evidence from a research study is by the P value. It should population which these trial participants represent, the average
be obvious that the evidence from a research study cannot difference in mean depression score if all were offered
reasonably be summarised as just a single number, but the use collaborative care would be between 0.35 and 2.31 scale points
of P values remains unshakeable. Further, the practice of less than if all were treated in the usual way. It is only an
labelling P values as significant or not significant leads not only estimate. For 2.5% of studies the confidence interval will be
to dichotomous decisions but often also to the belief that the entirely below the true population difference, and 2.5% will
research question has been answered. have the interval entirely above it. We dont think P=0.009
P values represent the probability that the observed data (or a adds much to this, but researchers can seldom bear to do without
more extreme result) could have arisen when the true effect of it. The inevitable uncertainty from sampling error can be reduced
interest is zerofor example, the true treatment effect in a by increasing the sample size, but usually only modestly. To
randomised trial. It is common to interpret P<0.05 (significant) halve the width of the confidence interval we would need to
as clear evidence that there is a real effect, and P>0.05 (not quadruple the sample size.
significant) as evidence that there is no effect. However, the A common mistake is to believe that the confidence interval
former interpretation may be unwise, and the latter is wrong. expresses all the uncertainty. Rather, the confidence interval
Although 0.05 is the conventional decision point, P<0.05 is far expressed uncertainty from just one causenamely the
from representing certainty. One in 20 studies could have a uncertainty due to having taken a sample from the population
difference of the observed size if there were really no difference defined by the inclusion criteria. Often there are other sources
in the population. Not significant indicates that we found of uncertainty that may be even more important to consider, in
insufficient evidence to conclude that there is a real effect, not particular relating to possibly biased results. We address these
that we have shown that there is not one.2 Referring to results in our linked statistics note.5
as statistically significant, or not, only helps a bit.
Contributors: DGA and JMB jointly wrote and agreed the text.

Correspondence to: D G Altman doug.altman@csm.ox.ac.uk

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe
BMJ 2014;349:g7064 doi: 10.1136/bmj.g7064 (Published 25 November 2014) Page 2 of 2

RESEARCH METHODS & REPORTING

Competing interests: All authors have completed the ICMJE uniform 1 Sterne JA, Davey Smith G. Sifting the evidencewhats wrong with significance tests?
BMJ 2001;322:226-31.
disclosure form at www.icmje.org/coi_disclosure.pdf (available on 2 Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ
request from the corresponding author) and declare: no support from 1995;311:485.
3 Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather
any organisation for the submitted work; no financial relationships with than hypothesis testing. BMJ 1986;292:746-50.
any organisations that might have an interest in the submitted work in 4 Richards DA, Hill JJ, Gask L, Lovell K, Chew-Graham C, Bower P, et al. Clinical
effectiveness of collaborative care for depression in UK primary care (CADET): cluster
the previous three years; no other relationships or activities that could randomised controlled trial. BMJ 2013;347:f4913.
appear to have influenced the submitted work. 5 Altman DG, Bland JM. Uncertainty beyond sampling error. BMJ 2014;349:g7065.

Provenance and peer review: Not commissioned; not externally peer


reviewed. Cite this as: BMJ 2014;349:g7064
BMJ Publishing Group Ltd 2014

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe
BMJ 2014;349:g7065 doi: 10.1136/bmj.g7065 (Published 25 November 2014) Page 1 of 2

Research Methods & Reporting

RESEARCH METHODS & REPORTING

STATISTICS NOTES

Uncertainty beyond sampling error


1 2
Douglas G Altman professor of statistics in medicine , J Martin Bland professor of health statistics

Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford
1

OX3 7LD, UK; 2Department of Health Sciences, University of York, York YO10 5DD, UK

Statistical analysis of research results mainly uses confidence patient, in nutrition, in ancillary treatments, and so on, may all
intervals and hypothesis tests to capture the uncertainty rising make our sample unrepresentative as a guide for future action.
from our study being on a sample of participants drawn from a For example, the UK review of evidence relating to
much larger population, in which our interest mainly lies.1 But mammographic screening outlined three sources of uncertainty
beyond the issue of sampling variation there are other sources in relation to the pooled estimated effect from a meta-analysis
of uncertainty that may be even more important to consider. In of the results of all the randomised trials.4 First, there was
measurement, a distinction is made between precision, which uncertainty due to sampling variation, as previously discussed.1
is how variable are measurements on the same person by the Second, there was uncertainty from some methodological
same method made at the same time, and accuracy, which is weaknesses of the trials. Third was uncertainty about whether
how close the measurement is to what we actually want to know. the results from the trials were still relevant 25 years later, after
For example, if we were to ask a group of patients on two major changes in cancer incidence, management of and
occasions how much alcohol they typically consume, this would treatments for breast cancer, and the technology of
enable us to estimate precision, how repeatable answers are, but mammography. Unlike sampling variation, which is quantified
not how close these answers are to how much they actually in the confidence interval and is uncontroversial, the uncertainty
drink, which we might suspect to be higher. In the same way, from the other causes cannot easily be quantified and remains
a confidence interval tells us about the precision of research the source of fierce debate.
results, what would happen if we were to repeat the same study,
Large numbers, increasingly common in this era of big data,
not their accuracy, which is how close the study is to the truth.
will produce narrow confidence intervals. These can create an
In general, beyond the imprecision or uncertainty of numerical illusion of accuracy, but they ignore all sources of possible bias
results arising from sampling, the main concern is the possibility that are not affected by sample size, and so these other sources
that the study results are biased. Recent developments in become relatively more important.5 6 A recent example of a very
appraising published randomised trials have switched from precise but seriously wrong answer purported to show that skin
considering quality (essentially undefinable) to assessing cancer was protective for heart attacks and all-cause mortality.7
explicitly the risk of bias in relation to the way the study was So, although confidence intervals are a valuable way of depicting
done.2 Here sources of possible bias include lack of blinding uncertainty, they are always too narrow in the sense that they
and losses to follow up (missing data). But bias is especially reflect only statistical uncertainty, precision rather than accuracy.
relevant in non-randomised studies, where there will be major
Many journals require authors to consider in their discussion
concerns about possible confounding, where an apparent
the limitations of their studysome even require this in the
relationship between two things may be the result of the
articles abstract. Issues raised there help readers to judge what
relationships of both to a third.
extra uncertainty might apply to the study, including whether
A further source of uncertainty concerns the extent to which the observed effects may be affected by bias. It is common for
results of research conducted in a particular setting with selected authors to say that their results should be interpreted with
participants can be taken as applying equally to a wider group caution (including >700 in the BMJ), but who knows what that
of patients in a different location. Judgement of generalisability means in practice? The GRADE group have developed a
(also known as external validity) is challenging.3 In a clinical framework for a more structured approach to assessing the
trial, the part of the larger population we particularly care about reliability of research findings that addresses the aspects outlined
does not yet exist. We want to know what would happen to above.8
future patients if we were to apply either of the trial treatments.
Changes over time in the nature of disease, in the fitness of the

Correspondence to: D G Altman doug.altman@csm.ox.ac.uk

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe
BMJ 2014;349:g7065 doi: 10.1136/bmj.g7065 (Published 25 November 2014) Page 2 of 2

RESEARCH METHODS & REPORTING

Contributors: DGA and JMB jointly wrote and agreed the text. 3 Rothwell PM. External validity of randomised controlled trials: to whom do the results of
this trial apply?. Lancet 2005;365:82-93.
Competing interests: All authors have completed the ICMJE uniform 4 Marmot MG, Altman DG, Cameron DA, Dewar JA, Thompson SG, Wilcox M. The benefits
and harms of breast cancer screening: an independent review. Br J Cancer
disclosure form at www.icmje.org/coi_disclosure.pdf (available on
2013;108:2205-40.
request from the corresponding author) and declare: no support from 5 Greenland S. Interval estimation by simulation as an alternative to and extension of
any organisation for the submitted work; no financial relationships with confidence intervals. Int J Epidemiol 2004;33:1389-97.
6 Kaplan RM, Chambers DA, Glasgow RE. Big data and large sample size: a cautionary
any organisations that might have an interest in the submitted work in note on the potential for bias. Clin Transl Sci 2014;7:342-6.
the previous three years; no other relationships or activities that could 7 Lange T, Keiding N. Skin cancer as a marker of sun exposure: a case of serious immortality
bias. Int J Epidemiol 2014;43:971.
appear to have influenced the submitted work. 8 Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, Alonso-Coello P, et al. GRADE:
Provenance and peer review: Not commissioned; not externally peer an emerging consensus on rating quality of evidence and strength of recommendations.
BMJ 2008;336:924-6.
reviewed.

Cite this as: BMJ 2014;349:g7065


1 Altman DG, Bland JM. Uncertainty and sampling error. BMJ 2014;349:g7064.
2 Higgins JP, Altman DG, Gtzsche PC, Jni P, Moher D, Oxman AD, et al. The Cochrane BMJ Publishing Group Ltd 2014
Collaborations tool for assessing risk of bias in randomised trials. BMJ 2011;343:d5928.

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe
Research Methods & ReportinG

open access
Statistics Notes: Bootstrap resampling methods
J Martin Bland,1 Douglas G Altman2

1Department of Health Sciences, In medical research we study a sample of individuals to treatment as usual) was 1.33 points on the PHQ-9
University of York, York YO10 make inferences about a target population. Estimates of scale (95% confidence interval 2.31 to 0.35) adjusted
5DD, UK
interest, such as a mean or a difference in proportions, for baseline PHQ-9, age, the list size, index of multiple
2Centre for Statistics in

Medicine, Nuffield Department


are calculated, usually accompanied by a confidence deprivation, city of the practice, and clustering.
of Orthopaedics, Rheumatology interval derived from the standard error. The data from a We created another sample of 505 by resampling as
and Musculoskeletal Sciences, single sample are used here to quantify the variation in described above, the full original sample being avail-
University of Oxford, Oxford
OX3 7LD, UK;
the estimate of interest across (hypothetical) multiple able for each of the 505 choices. The resulting new sam-
Correspondence to: J M Bland
samples from the same population.1 As we have only ple of 505 observations included 313 of the original 505
martin.bland@york.ac.uk one sample we need to make assumptions about the participants, some once, some more than once, a maxi-
Cite this as: BMJ 2015;350:h2622 data. Most methods of analysis are called parametric mum of five times. The same regression analysis which
doi: 10.1136/bmj.h2622 because they incorporate assumptions about the distri- produced the original treatment effect estimate was
bution ofthe data, such as that observations follow a repeated for this new sample resulting in a slightly dif-
normal distribution. Non-parametric methods avoid ferent estimated treatment difference of 1.25 points.
assumptions about distributions but generally provide Instead of resampling once, we should do it many times
only P values and not estimates of quantities of interest.2 and use the variability of the results to obtain a confidence
For a given dataset the assumptions may not be met. In interval. The distribution of the estimated treatment effect
such cases there is an alternative way to estimate stan- from 1000 resamplings of the CADET data is shown in the
dard errors and confidence intervals without any reli- figure. The mean and standard deviation of this distribu-
ance on assumed probability distributions. We use the tion are 1.353 and 0.565. This standard deviation provides
sample dataset and apply a resampling procedure called an alternative estimate of the standard error of the mean
the bootstrap. (In general language, a bootstrap method difference between the treatments, which does not make
is a self sustaining process that needs no e xternal input.) use of any theory about the distribution of the data. There
The clever idea behind the bootstrap is to create mul- are two ways to use the bootstrap estimates to find a confi-
tiple datasets from the real dataset without needing to dence interval. If the resampling distribution is close to
make any assumptions. Our observed sample is repre- normal, as is the case here, the 95% confidence interval
sentative of a population about which we wish to make will be 1.353(1.960.565) to 1.353+(1.960.565), or
inferences, so a set of randomly chosen observations 2.46 to 0.25. This interval is similar to that obtained
from our sample will be equally representative of the using the standard error from the least squares regression
original population. We can generate a sample of the on the real data. The other approach is to take the 95%
same size as the original data set by randomly choosing confidence interval directly from the 2.5th and 97.5th cen-
real observations one at a time. Each observation has an tiles of the distribution. For these data the bootstrap confi-
equal chance of being chosen each time, so some obser- dence interval calculated this way is 2.44 to 0.26. This
vations will be picked more than once and some wont second approach can be used regardless of the distribu-
be picked at all. That doesnt matter; the new boot- tion of the bootstrap estimates.
strap sample is comparable to the original data set and Clearly we need enough repetitions so that the estimates
is equally representative of the target population. are stableusually thousands of bootstrap samples are
For an example, CADET3 was a cluster randomised used, especially when using the observed centiles of the
trial comparing collaborative care for depression distribution of estimates. A repetition of the whole boot-
detected in primary care with treatment as usual. The strap analysis for CADET produced almost i dentical values
outcome measure was the PHQ-9 depression scale, of the mean (1.335) and standard deviation (0.567).
and data were available for 505 participants. The esti- This note gives the general idea of the bootstrap;
mated mean difference (collaborative care minus there are many variations.4 We can get a bootstrap esti-
mate for any quantity we can calculate from any sam-
ple. Bootstrap methods are particularly favoured by
2.5% Mean 97.5% health economists, because cost data tend to be highly
200
No of samplings

skewed and unsuited to conventional approaches.5


They are also useful for complex d atasetsfor exam-
150
ple, when the observations arent independent.
Contributors: JMB and DGA jointly wrote and agreed the text.
100
Competing interests: We have read and understood the BMJ Group policy
Histogram of 1000 on declaration of interests and have no relevant interests to declare.
resampling estimates of the 50 Provenance and peer review: Not commissioned; not externally peer
treatment difference from reviewed.
the CADET data, with
1 Altman DG, Bland JM. Standard deviations and standard errors.
corresponding normal 0
-4 -3 -2 -1 0 1 BMJ 2005;331:903.
distribution curve, mean, 2 Altman DG, Bland JM. Parametric v nonparametric methods for data
and 2.5 and 97.5 centiles Estimated treatment difference analysis. BMJ 2009;338:a3167.

thebmj|BMJ 2015;350:h2622|doi: 10.1136/bmj.h2622 1


RESEARCH

3 Richards DA, Hill JJ, Gask L, etal. Clinical effectiveness of collaborative 5 Schroeder E, Petrou S, Patel N, etal. Cost effectiveness of alternative
care for depression in UK primary care (CADET): cluster randomised planned places of birth in woman at low risk of complications:
controlled trial. BMJ 2013;347:f4913. evidence from the Birthplace in England national prospective cohort
4 Carpenter J, Bithell J. Bootstrap confidence intervals: when, which, study. BMJ 2012;344:e2292.
what? A practical guide for medical statisticians. Stat Med
2000;19:1141-64. BMJ Publishing Group Ltd 2015

No commercial reuse: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe


BMJ 2016;352:i189 doi: 10.1136/bmj.i189 (Published 15 January 2016) Page 1 of 2

Research Methods & Reporting

RESEARCH METHODS & REPORTING

STATISTICS NOTES

Inverse probability weighting


1
Mohammad Ali Mansournia assistant professor of epidemiology , Douglas G Altman professor of
2
statistics in medicine
1
Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran; 2Centre for Statistics
in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford OX3 7LD, UK

Statistical analysis usually treats all observations as equally affected by availability or willingness to participate.
important. In some circumstances, however, it is appropriate to Likewise in a cohort study of the effect of obesity on
vary the weight given to different observations. Well known hypertension, some individuals are censored due to loss to
examples are in meta-analysis, where the inverse variance follow-up (such as emigration) or competing risks (such
(precision) weight given to each contributing study varies, and as death from other causes).4 In each case the amount of
in the analysis of clustered data.1 missing information will vary across subgroups.
Differential weighting is also used when different parts of the 4.Randomised trials with crossing over from one arm to the
population are sampled with unequal probabilities of selection. otherIn a randomised trial 8010 postmenopausal women
Two examples of intentional unbalanced sampling are: with early breast cancer were assigned to tamoxifen
1.Surveys with unequal probabilities of selectionIn a (n=2459) or letrozole (n=2463) for five years or to
national survey of hypertension prevalence, certain groups sequential treatment with two years of one of these agents
with relatively rare characteristics (such as people aged followed by three years of the other. There was a selective
65 years) were oversampled to improve the precision of crossover to letrozole of 619 patients in the tamoxifen arm
estimates for those groups.2 after significant benefit was reported for letrozole compared
2.Two-phase prevalence studiesIn the first phase of a with tamoxifen during the study. These 619 women may
two-phase prevalence study of mental health status, the be artificially censored at the time they crossed over for
sampled patients completed a short screening questionnaire. analysis.5
In the second phase, a subsample was selected for a In these situations, missing outcomes are unlikely to happen
definitive diagnostic test with oversampling of the at random so that estimates will be biased. While the
screen-positive cases to ensure precise estimates for selection probabilities in examples 1 and 2 are known, the
diagnostic prevalence.3 response or non-censoring probabilities in examples 3 and
In such cases the ordinary unweighted sample quantities, 4 are unknown. Inverse probability weighting can be used
such as means or proportions, are likely to be biased with weights estimated from a logistic regression model
estimates of their corresponding population quantities. This for predicting non-response or censoring. As in the first
selection bias can be eliminated by performing a scenario, this application of the method aims to remove
weighted estimation, giving each individuals data a weight bias, but it is more controversial. Its validity relies on a
inversely proportional to their probability of selection. correctly specified model including all prognostic variables
Intuitively, the weighting is used to deflate the weight for associated with non-response or censoring, which cannot
those individuals who are oversampled. The weighted be assured.
analysis can be thought of as creating a study with no In the breast cancer trial (example 4), although the
differential selection. intention-to-treat hazard ratio for overall survival (which
Inverse probability weighting can also be used when ignores selective crossover) was 0.87 (95% confidence
individuals vary in their probability of having missing interval 0.77 to 1.00) in favour of letrozole, the adjusted
information. Two contexts where there may be hazard ratio using inverse probability of selection weights
unintentional unbalanced selection are: was 0.79 (0.69 to 0.90), suggesting that the true effect is
greater than the intention-to-treat estimate.5
3.Studies with missing outcome dataIn surveys such as
that mentioned in example 1, the response rates will be

Correspondence to: M A Mansournia mansournia_ma@yahoo.com

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe
BMJ 2016;352:i189 doi: 10.1136/bmj.i189 (Published 15 January 2016) Page 2 of 2

RESEARCH METHODS & REPORTING

In observational studies, the probability of exposure can necessarily true for examples 3-5. The ordinary 95% confidence
depend on external factors (called confounders) that also interval for inverse probability weighted estimates may not
affect the outcome. The causal effect of interest is then provide the correct coverage and should be avoided. Instead,
confused with the effects of confounders. Such confounding robust sandwich variance estimators or non-parametric
can be thought as a type of selection bias, because bootstrapping should be used to provide valid confidence
confounding essentially means that some causes of the intervals.8 Deeper discussion of inverse probability weighting
outcome also influence selection for the exposure. A methods can be found elsewhere.8 9
particularly important context is:
5.Non-randomised studies comparing different Contributors: MAM and DGA jointly wrote and agreed the text.
treatmentsIn a cohort study 12 552 warfarin-naive Competing interests: We have read and understood the BMJ Group
patients with atrial fibrillation admitted to hospital for policy on declaration of interests and have no relevant interests to
ischaemic stroke and treated with warfarin were compared declare.
with patients who received no oral anticoagulant at Provenance and peer review: Not commissioned; not externally peer
discharge.6 reviewed.
Outside randomised trials the choice of treatment is likely to be
1 Kerry SM, Bland JM.. Analysis of a trial randomised in clusters. BMJ
influenced by predictors of outcome, so called confounding 1998;316: 54. 9451271
by indication.7 Various strategies are used to try to remove the 2 Korn EL, Graubard BI.. Epidemiologic studies utilizing surveys: accounting for the sampling

bias in non-randomised treatment comparisons. The 3


design. Am J Public Health 1991;81: 1166-73. 1951829
Vzquez-Barquero JL, Garca J, Simn JA et al. Mental health in primary care. An
conventional approach is to use multivariable regression, but a epidemiological study of morbidity and use of health resources. Br J Psychiatry
recent alternative is inverse probability of treatment weighting. 4
1997;170: 529-35. 9330019
Alonso A, Segu-Gmez M, de
Here the weights are based on each individuals probability of Irala J, Snchez-Villegas A, Beunza JJ, Martnez-Gonzalez MA.. Predictors of follow-up
receiving a specific treatment given the confounders, which is and assessment of selection bias from dropouts using inverse probability weighting in a
cohort of university graduates. Eur J Epidemiol 2006;21: 351-8. 16736275
known as the propensity score (PS). The weights are 1/PS for 5 Regan MM, Neven P, Giobbie-Hurder A et al. BIG 1-98 Collaborative Group International
the treated participants and 1/(1PS) for the untreated Breast Cancer Study Group (IBCSG). Assessment of letrozole and tamoxifen alone and
in sequence for postmenopausal women with steroid hormone receptor-positive breast
participants.8 The weights can be estimated from a logistic cancer: the BIG 1-98 randomised clinical trial at 81 years median follow-up. Lancet Oncol
regression model for predicting treatment. Key assumptions are 2011;12: 1101-8. 22018631

that all confounders have been measured and properly modelled 6 Xian Y, Wu J, OBrien EC et al. Real world effectiveness of warfarin among ischemic
stroke patients with atrial fibrillation: observational analysis from Patient-Centered Research
in this treatment model. In the warfarin study (example 5) the into Outcomes Stroke Patients Prefer and Effectiveness Research (PROSPER) study.
unadjusted hazard ratio for cardiac events was 0.73 (99% BMJ 2015;351: h3786. 26232340
7 Freemantle N, Marston L, Walters K, Wood J, Reynolds MR, Petersen I.. Making
confidence interval 0.67 to 0.80) in favour of warfarin, whereas inferences on treatment effects from real world data: propensity scores, confounding by
the adjusted estimate using inverse probability of treatment indication, and other perils for the unwary in observational research. BMJ
2013;347: f6409. 24217206
weighting was 0.87 (0.78 to 0.98), about half the effect size.6 If 8 Robins JM, Hernn MA, Brumback B.. Marginal structural models and causal inference
the cohort is also affected by censoring (see example 3 above), in epidemiology. Epidemiology 2000;11: 550-60. 10955408

one can adjust simultaneously for confounding and selection 9 Seaman SR, White IR.. Review of inverse probability weighting for dealing with missing
data. Stat Methods Med Res 2013;22: 278-95. 21220355
bias due to censoring.4 8
Accepted: 05 01 2016
Although helpful for bias reduction, estimates weighted by Published by the BMJ Publishing Group Limited. For permission to use (where not already
design weights (examples 1 and 2) tend to be less precisely granted under a licence) please go to http://group.bmj.com/group/rights-licensing/
estimated than the unweighted estimates, which is not permissions

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe