26 views

Uploaded by pegazus_ar

Statistics Notes in the British Medical Journal (Bland JM, Altman DG. - NEJ)

- Regression
- QT1_Session12
- LinRegr2b
- Building morphology
- DAV Case
- utf-8__session11
- FOSFORO
- Article_5.pdf
- RayWicks(IBMUSA) Trending HO
- Impact of Executive Support on Organizational Core Competencies Management for Strategic Product Innovation
- penelitian kariss
- analysis of teeth estimation
- 2.3_Reading&Reid
- 6 Urbanization of Karachi Shaheen
- 2.Analysis Full
- David Panel
- URCFI Index - An Approach for Identification of Road Crash Prone for Areas, A Case of Surat
- Output SPSS.docx
- Annex XII - PSPP Introduction Training Manual
- Ch 17 Correlation vs Regression

You are on page 1of 95

21 Ashy P, Bukh J, Hoff G, Leehay J, Lisse IM, Mordhorst CH, Pedersen IR.

A "post-honeymoon period" measles outbreak in Muyinga Sector, Burundi. High measles mortality in infancy related to intensity of exposure. y Pediatr

IntrEpidemiol 1994;23:185-93. 1986;109:40-4.

14 Holt EA, Boulos R, Halsey NA, Boulos IM, Boulos C. Childhood survival in 22 Koenig MA. Mortality reductions from measles and tetanus immunization: a

Haiti: Protective effect of measles vaccination. Pediarics 1990;85:188-94. review of the evidence. In: Hill K, ed. Child suwival pnornes in the 1990s.

15 Koenig MA, Khan MA, Wojtyniak B, Clemens JD, Chakraborty J, Fauveau Baltimore: Johns Hopkins Institute for Intemational Programs, 1992:43-72.

V, et al. The impact of measles vaccination upon childhood mortality in 23 Hartfield J, Morley D. Efficacy of measles vaccine. Joumnal of Hygiene

Matiab, Bangladesh. Bufl World Healh Organ 1990;68:441-7. (Cambridge) 1963;61:143-7.

16 Velema JP, Alihonou EM, Gandaho T, Hounye FH. Childhood mortality 24 Hull HF, Williams PJ, Oldfield F. Measles mortality and vaccine efficacy in

among users and non-users of primary health care in a rural West African rural West Africa. Lancet 1983;i:972-5.

community. IntJEpidemiol 1991;20:474-9. 25 Griffin DE, Ward BJ, Esolen LM. Pathogenesis of measles virus infection: An

17 Fleiss JL. The statistical basis of meta-analysis. Sta Mehods Med Res hypothesis for altered responses. IlInfect Dis 1994;170:S24-31.

1993;2:121-45. 26 Petralli JK, Merigan TC, Wilbur JR. Action of endogenous interferon against

18 Garenne M, Aaby P. Pattern of exposure and measles mortality in Senegal. vaccinia infection in children. Lancet 1965ii:401-5.

YInfDis 1990;161:1088-94. 27 Rooth I, Sinani HM, Smedman 1, Bjorkman A. A study of malaria infection

19 Desgrees du Lou A, Pison G, Aaby P. The role of immunizations in the recent during the acute stage of measles infection.J Trop Med Hyg 1991;94:195-8.

decline in childhood mortality and the changes in the female/male mortality 28 Gellin BG, Katz SL. Measles: state of the art and future directions. J Infect Dis

ratio in rural Senegal. Am YEpidemiol (in press). 1994;170:S3-14.

20 Harris MF. The safety of measles vaccine in severe illness. S Aft Med Y

1979:38. (Accepted 15June 1995)

Statistics Notes

Absence ofevidence is not evidence ofabsence

Douglas G Altman, J Martin Bland

The non-equivalence of statistical significance and reinfarction after acute myocardial infarction. The

clinical importance has long been recognised, but this overview of randomised controlled trials found

error of interpretation remains common. Although a a modest but clinically worthwhile (and highly sig-

significant result in a large study may sometimes not be nificant) reduction in mortality of 22%,4 but only five of

clinically important, a far greater problem arises from the 24 trials had shown a statistically significant effect

misinterpretation of non-significant findings. By with P<005. The lack of statistical significance of

convention a P value greater than 5% (P>0 05) is called most of the individual trials led to a long delay before

"not significant." Randomised controlled clinical the true value of streptokinase was appreciated.

trials that do not show a significant difference between While it is usually reasonable not to accept a new

the treatments being compared are often called treatment unless there is positive evidence in its

"negative." This term wrongly implies that the study favour, when issues of public health are concerned

has shown that there is no difference, whereas usually we must question whether the absence of evidence

all that has been shown is an absence of evidence of a is a valid enough justification for inaction. A recent

difference. These are quite different statements. publicised example is the suggested link between some

The sample size of controlled trials is generally sudden infant deaths and antimony in cot mattresses.

inadequate, with a consequent lack of power to Statements about the absence of evidence are common

detect real, and clinically worthwhile, differences in -for example, in relation to the possible link between

treatment. Freiman et al found that only 30% of a violent behaviour and exposure to violence on television

sample of 71 trials published in the New England and video, the possible harmful effects of pesticide

Journal of Medicine in 1978-9 with P>0-1 were large residues in drinking water, the possible link between

enough to have a 90% chance of detecting even a 50% electromagnetic fields and leukaemia, and the possible

difference in the effectiveness of the treatments being transmission of bovine spongiform encephalopathy

compared, and they found no improvement in a similar from cows. Can we be comfortable that the absence of

sample of trials published in 1988. To interpret all clear evidence in such cases means that there is no risk

these "negative" trials as providing evidence of the or only a negligible one?

ineffectiveness of new treatments is clearly wrong and When we are told that "there is no evidence that A

foolhardy. The term "negative" should not be used in causes B" we should first ask whether absence of

this context.2 evidence means simply that there is no information at

A recent example is given by a trial comparing all. Ifthere are data we should look for quantification of

octreotide and sclerotherapy in patients with variceal the association rather than just a P value. Where risks

bleeding.' The study was carried out on a sample of are small P values may well mislead: confidence

only 100 despite a reported calculation that suggested intervals are likely to be wide, indicating considerable

that 1800 patients were needed. This trial had only a uncertainty. While we can never prove the absence of a

Medical Statistics 5% chance of getting a statistically significant result if relation, when necessary we should seek evidence

Laboratory, Inperial the stated clinically worthwhile treatment difference against the link between A and B-for example, from

Cancer Research Fund,

London WC2A 3PX truly existed. One consequence of such low statistical case-control studies. The importance of carrying out

Douglas G Altman, head power was a wide confidence interval for the treatment such studies will relate to the seriousness of the

difference. The authors concluded that the two postulated effect and how widespread is the exposure

Department of Public treatments were equally effective despite a 95% in the population.

Health Sciences, confidence interval that included differences between

St George's Hospital 1 Freiman JA, Chalmers TC, Smith H, Kuebler HR. The importance of beta, the

the cure rates of the two treatments of up to 20 per- type BI error, and sample size in the design and interpretation of the

Medical School, centage points. randomized controlled trial: survey of two sets of "negative" trials. In: Bailar

London SW17 ORE Similar evidence of the dangers of misinterpretation JC, Mosteller F, eds. Medical ues of stanstics. 2nd ed. Boston, MA: NEJM

J Martn Bland, reader in of non-significant results is found in numerous meta- Books, 1992:357-73.

medical statistics 2 Chalmers I. Proposal to outlaw the term "negative trial." BM71985;29:1002.

analyses (overviews) of published trials, when few or 3 Sung JJY, Chung SCS, Lai C-W, Chan FKL, Leung JWC, Yung M-L,

Kassianides C, ea aL Octreotide infusion or emergency sclerotherapy for

Correspondence to: none of the individual trials were statistically large variceal haemorrhage. Lancet 1993;342:637-41.

Mr Altman. enough. A dramatic example is provided by the 4 Yusuf S, Collins R, Peto R, Furberg C, Stampfer MJ, Goldhaber SZ, et aL

overview of clinical trials evaluating fibrinolytic' Intravenous and intracoronary fibrinolytic therapy in acute myocardial

infarction: overview of results on mortality, reinfarction and side-effects from

BMJ 1995;311:485 treatment (mostly streptokinase) for preventing 33 randomized controlled trals. EurHeart_ 1985;6:556-85.

This site uses cookies. More info Close By continuing to browse the site you are agreeing to our use of

cookies. Find out more here Close

BMJ 1996; 313 doi: https://doi.org/10.1136/bmj.313.7059.744 (Published 21 September 1996) Cite this as: BMJ

1996;313:744

a

Department of Public Health Sciences, St George's Hospital Medical School, London SW17 0RE,

b

IRCF Medical Statistics Group, Centre for Statistics in Medicine, Institute of Health Sciences, PO Box 777, Oxford OX3 7LF

Several measurements of the same quantity on the same subject will not in general be the same. This may be

because of natural variation in the subject, variation in the measurement process, or both. For example, table 1

shows four measurements of lung function in each of 20 schoolchildren (taken from a larger study1). The first

child shows typical variation, having peak expiratory flow rates of 190, 220, 200, and 200 1/min.

Child No PEFR (l/min) Mean SD

Table 1

Repeated peak expiratory flow rate (PEFR) measurements for 20 schoolchildren

Let us suppose that the child has a true average value over all possible measurements, which is what we really

want to know when we make a measurement. Repeated measurements on the same subject will vary around the

true value because of measurement error. The standard deviation of repeated measurements on the same

subject enables us to measure the size of the measurement error. We shall assume that this standard deviation

is the same for all subjects, as otherwise there would be no point in estimating it. The main exception is when the

measurement error depends on the size of the measurement, usually with measurements becoming more

variable as the magnitude of the measurement increases. We deal with this case in a subsequent statistics note.

The common standard deviation of repeated measurements is known as the within-subject standard deviation,

which we shall denote by sw.

To estimate the within-subject standard deviation, we need several subjects with at least two measurements for

each. In addition to the data, table 1 also shows the mean and standard deviation of the four readings for each

child. To get the common within-subject standard deviation we actually average the variances, the squares of the

standard deviations. The mean within-subject variance is 460.52, so the estimated within-subject standard

deviation is sw = (square root)460.52 = 21.5 1/min. The calculation is easier using a program that performs one

way analysis of variance2 (table 2). The value called the residual mean square is the within-subject variance.

The analysis of variance method is the better approach in practice, as it deals automatically with the case of

subjects having different numbers of observations. We should check the assumption that the standard deviation

is unrelated to the magnitude of the measurement. This can be done graphically, by plotting the individual

subject's standard deviations against their means (see fig 1). Any important relation should be fairly obvious, but

we can check analytically by calculating a rank correlation coefficient. For the figure there does not appear to be

a relation (Kendall's (tau) = 0.16, P = 0.3).

Source of Degrees of Sum of Mean Variance Probability

variation freedom squares square ratio (F) (P)

Total 79 312949.69

Table 2

One way analysis of variance for the data of table 1

Fig 1

Fig 1

Individual subjects' standard deviations plotted against their means

A common design is to take only two measurements per subject. In this case the method can be simplified

because the variance of two observations is half the square of their difference. So, if the difference between the

2 2

two observations for subject i is di the within-subject standard deviation sw is given by s w = 1/2n(summation)d i,

where n is the number of subjects. We can check for a relation between standard deviation and mean by plotting

for each subject the absolute value of the differencethat is, ignoring any signagainst the mean.

The measurement error can be quoted as sw. The difference between a subject's measurement and the true

value would be expected to be less than 1.96 sw for 95% of observations. Another useful way of presenting

measurement error is sometimes called the repeatability, which is (square root)2 x 1.96 sw or 2.77 sw. The

difference between two measurements for the same subject is expected to be less than 2.77 sw for 95% of pairs

of observations. For the data in table 1 the repeatability is 2.77 x 21.5 = 60 1/min. The large variability in peak

expiratory flow rate is well known, so individual readings of peak expiratory flow are seldom used. The variable

used for analysis in the study from which table 1 was taken was the mean of the last three readings.1

Other ways of describing the repeatability of measurements will be considered in subsequent statistics notes.

References

1.Bland JM, Holland WW, Elliott A. The development of respiratory symptoms in a cohort of Kent schoolchildren.Bull

Physio-Path Resp1974;10:699716.

2.Altman DG, Bland JM. Comparing several groups using analysis of variance.BMJ1996;312: 1472.

This site uses cookies. More info Close By continuing to browse the site you are agreeing to our use of

cookies. Find out more here Close

coefficients

BMJ 1996; 313 doi: https://doi.org/10.1136/bmj.313.7048.41 (Published 06 July 1996) Cite this as: BMJ

1996;313:41

a

Department of Public Health Sciences, St George's Hospital Medical School, London SW17 0RE

b

ICRF Medical Statistics Group, Centre for Statistics in Medicine, Institute of Health Sciences, PO Box 777, Oxford OX3 7LF

Measurement error is the variation between measurements of the same quantity on the same individual.1 To

quantify measurement error we need repeated measurements on several subjects. We have discussed the

within-subject standard deviation as an index of measurement error,1 which we like as it has a simple clinical

interpretation. Here we consider the use of correlation coefficients to quantify measurement error.

A common design for the investigation of measurement error is to take pairs of measurements on a group of

subjects, as in table 1. When we have pairs of observations it is natural to plot one measurement against the

other. The resulting scatter diagram (see figure 1 may tempt us to calculate a correlation coefficient between the

first and second measurement. There are difficulties in interpreting this correlation coefficient. In general, the

correlation between repeated measurements will depend on the variability between subjects. Samples containing

subjects who differ greatly will produce larger correlation coefficients than will samples containing similar

subjects. For example, suppose we split this group in whom we have measured forced expiratory volume in one

second (FEV1) into two subsamples, the first 10 subjects and the second 10 subjects. As table 1 is ordered by

the first FEV1 measurement, both subsamples vary less than does the whole sample. The correlation for the first

subsample is r = 0.63 and for the second it is r = 0.31, both less than r = 0.77 for the full sample. The correlation

coefficient thus depends on the way the sample is chosen, and it has meaning only for the population from which

the study subjects can be regarded as a random sample. If we select subjects to give a wide range of the

measurement, the natural approach when investigating measurement error, this will inflate the correlation

coefficient.

Measurement Measurement

Measurement Measurement

Table 1

Pairs of measurements of FEV1 (litres) a few weeks apart from 20 Scottish schoolchildren, taken from a

larger study (D Strachan,personal communication)

Fig 1

Fig 1

Measurements from pairs of observations plotted against each other

The correlation coefficient between repeated measurements is often called the reliability of the measurement

method. It is widely used in the validation of psychological measures such as scales of anxiety and depression,

where it is known as the test-retest reliability. In such studies it is quoted for different populations (university

students, psychiatric outpatients, etc) because the correlation coefficient differs between them as a result of

differing ranges of the quantity being measured. The user has to select the correlation from the study population

most like the user's own.

Another problem with the use of the correlation coefficient between the first and second measurements is that

there is no reason to suppose that their order is important. If the order were important the measurements would

not be repeated observations of the same thing. We could reverse the order of any of the pairs and get a slightly

different value of the correlation coefficient between repeated measurements. For example, reversing the order

of the even numbered subjects in table 1 gives r = 0.80 instead of r = 0.77. The intra-class correlation coefficient

avoids this problem. It estimates the average correlation among all possible orderings of pairs. It also extends

easily to the case of more than two observations per subject, where it estimates the average correlation between

all possible pairs of observations.

Few computer programs will calculate the intra-class correlation coefficient directly, but when the number of

observations is the same for each subject it can be found from a one way analysis of variance table2 such as

table 2. We need the total sum of squares, SST, and the sum of squares between subjects, SSB.

Then

where m is the number of observations per subject. For table II, m = 2 and

Probability(P)

variation freedom squares square ratio (F)

Total 39 1.74651

Table 2

One way analysis of variance for the data in table 1

In practice, there will usually be little difference between r and rI for true repeated measurements. If, however,

there is a systematic change from the first measurement to the second, as might be caused by a learning effect,

rI will be much less than r. If there was such an effect the measurements would not be made under the same

conditions and so we could not measure reliability.

The correlation coefficient can be used to compare measurements of different quantities, such as different scales

for measuring anxiety. We could make repeated measurements of all the quantities on the same subjects and

calculate intra-class correlations. The measures with the highest correlation between repeated measurements

would discriminate best between individuals; in other words they would carry the most information. For most

applications, however, we prefer the within-subjects standard deviation as an index of measurement error, as it

has a more direct interpretation which can be applied to individual measurements.1

References

1.Bland JM, Altman AD.Measurement error.BMJ 1996;312: 1654.

2.Altman DG, Bland BJ.Comparing several groups using a analysis of variance BMJ 1996;312: 14723.

Education and debate

Statistics notes

Bayesians and frequentists

J Martin Bland, Douglas G Altman,

There are two competing philosophies of statistical population value lies within the 95% confidence inter- Department of

Public Health

analysis: the Bayesian and the frequentist. The val, or that the probability that the null hypothesis is Sciences, St

frequentists are much the larger group, and almost all true is less than 5%. It is argued that researchers want Georges Hospital

the statistical analyses which appear in the BMJ are fre- this, which is why they persistently misinterpret Medical School,

London SW17 0RE

quentist. The Bayesians are much fewer and until confidence intervals and significance tests in this way.

J Martin Bland,

recently could only snipe at the frequentists from the A major difficulty, of course, is deciding on the professor of medical

high ground of university departments of mathemati- prior distribution. This is going to influence the statistics

cal statistics. Now the increasing power of computers is conclusions of the study, yet it may be a subjective syn- ICRF Medical

bringing Bayesian methods to the fore. thesis of the available information, so the same data Statistics Group,

Centre for Statistics

Bayesian methods are based on the idea that analysed by different investigators could lead to differ- in Medicine,

unknown quantities, such as population means and ent conclusions. Another difficulty is that Bayesian Institute of Health

proportions, have probability distributions. The prob- methods may lead to intractable computational Sciences, Oxford

OX3 7LF

ability distribution for a population proportion problems. (All widely available statistical packages use Douglas G Altman,

expresses our prior knowledge or belief about it, before frequentist methods.) head

we add the knowledge which comes from our data. For Most statisticians have become Bayesians or Correspondence to:

example, suppose we want to estimate the prevalence frequentists as a result of their choice of university. Professor Bland

of diabetes in a health district. We could use the knowl- They did not know that Bayesians and frequentists

edge that the percentage of diabetics in the United existed until it was too late and the choice had been BMJ 1998;317:1151

Kingdom as a whole is about 2%, so we expect the made. There have been subsequent conversions. Some

prevalence in our health district to be fairly similar. It is who were taught the Bayesian way discovered that

unlikely to be 10%, for example. We might have infor- when they had huge quantities of medical data to ana-

mation based on other datasets that such rates vary lyse the frequentist approach was much quicker and

between 1% and 3%, or we might guess that the preva- more practical, although they may remain Bayesian at

lence is somewhere between these values. We can con- heart. Some frequentists have had Damascus road con-

struct a prior distribution which summarises our versions to the Bayesian view. Many practising

beliefs about the prevalence in the absence of specific statisticians, however, are fairly ignorant of the

data. We can do this with a distribution having mean 2 methods used by the rival camp and too busy to have

and standard deviation 0.5, so that two standard devia- time to find out.

tions on either side of the mean are 1% and 3%. (The The advent of very powerful computers has given a

precise mathematical form of the prior distribution new impetus to the Bayesians. Computer intensive

depends on the particular problem.) methods of analysis are being developed, which allow

Suppose we now collect some data by a sample new approaches to very difficult statistical problems,

survey of the district population. We can use the data to such as the location of geographical clusters of cases of

modify the prior probability distribution to tell us what a disease. This new practicability of the Bayesian

we now think the distribution of the population approach is leading to a change in the statistical

percentage is; this is the posterior distribution. For paradigmand a rapprochement between Bayesians

example, if we did a survey of 1000 subjects and found and frequentists.1 2 Frequentists are becoming curious

15 (1.5%) to be diabetic, the posterior distribution about the Bayesian approach and more willing to use

would have mean 1.7% and standard deviation 0.3%. Bayesian methods when they provide solutions to diffi-

We can calculate a set of values, a 95% credible interval cult problems. In the future we expect to see more

(1.2% to 2.4% for the example), such that there is a Bayesian analyses reported in the BMJ. When this hap-

probability of 0.95 that the percentage of diabetics is pens we may try to use Statistics notes to explain them,

within this set. The frequentist analysis, which ignores though we may have to recruit a Bayesian to do it.

the prior information, would give an estimate 1.5%

We thank David Spiegelhalter for comments on a draft.

with standard error 0.4% and 95% confidence interval

0.8% to 2.5%. This is similar to the results of the Baye-

1 Breslow N. Biostatistics and Bayes (with discussion). Statist Sci 1990;5:

sian method, as is usually the case, but the Bayesian 269-98.

method gives an estimate nearer the prior mean and a 2 Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to

randomized trials (with discussion). J R Statist Soc A 1994;157:357-416.

narrower interval.

Frequentist methods regard the population value

as a fixed, unvarying (but unknown) quantity, without a

probability distribution. Frequentists then calculate Correction

confidence intervals for this quantity, or significance North of England evidence based guidelines development project:

tests of hypotheses concerning it. Bayesians reasonably guideline for the primary care management of dementia

object that this does not allow us to use our wider An editorial error occurred in this article by Martin Eccles

knowledge of the problem. Also, it does not provide and colleagues (19 September, pp 802-8). In the list of

what researchers seem to want, which is to be able to authors the name of Moira Livingston [not Livingstone] was

say that there is a probability of 95% that the misspelt.

Clinical review

Statistics Notes

Survival probabilities (the Kaplan-Meier method)

J Martin Bland, Douglas G Altman

Public Health

special techniques because some observations are 1.0

Survival probability

Sciences, St

Georges Hospital censored as the event of interest has not occurred for all

Medical School, patients. For example, when patients are recruited over

London SW17 0RE 0.75

J Martin Bland,

two years one recruited at the end of the study may be

professor of medical alive at one year follow up, whereas one recruited at the

statistics start may have died after two years. The patient who died 0.5

ICRF Medical has a longer observed survival than the one who still

Statistics Group,

survives and whose ultimate survival time is unknown.

Centre for Statistics

in Medicine, The table shows data from a study of conception in 0.25

Institute of Health subfertile women.2 The event is conception, and

Sciences, Oxford

OX3 7LF women survived until they conceived. One woman

0

Douglas G Altman, conceived after 16 months (menstrual cycles), whereas 0 6 12 18 24

head several were followed for shorter time periods during Time (months)

Correspondence to: which they did not conceive; their time to conceive was Survival curve showing probability of not conceiving among 38

Professor Bland thus censored. subfertile women after laparoscopy and hydrotubation2

We wish to estimate the proportion surviving (not

BMJ 1998;317:1572

having conceived) by any given time, which is also the

estimated probability of survival to that time for a There are three assumptions in the above. Firstly,

member of the population from which the sample is we assume that at any time patients who are censored

drawn. Because of the censoring we use the have the same survival prospects as those who

continue to be followed. This assumption is not easily

Kaplan-Meier method. For each time interval we

testable. Censoring may be for various reasons. In the

estimate the probability that those who have survived

conception study some women had received hormone

to the beginning will survive to the end. This is a condi-

treatment to promote ovulation, and others had

tional probability (the probability of being a survivor at

stopped trying to conceive. Thus they were no longer

the end of the interval on condition that the subject

part of the population we wanted to study, and their

was a survivor at the beginning of the interval). Survival

survival times were censored. In most studies some

Time (months) to to any time point is calculated as the product of the

subjects drop out for reasons unrelated to the

conception or conditional probabilities of surviving each time

condition under study (for example, emigration) If,

censoring in 38 interval. These data are unusual in representing

sub-fertile women however, for some patients in this study censoring was

months (menstrual cycles); usually the conditional

after laparoscopy related to failure to conceive this would have biased the

probabilities relate to days. The calculations are simpli-

and hydrotubation2 estimated survival probabilities downwards.

fied by ignoring times at which there were no recorded

Secondly, we assume that the survival probabilities

Did not survival times (whether events or censored times).

Conceived conceive are the same for subjects recruited early and late in the

In the example, the probability of surviving for two

1 2 study. In a long term observational study of patients

months is the probability of surviving the first month

1 3 with cancer, for example, the case mix may change over

times the probability of surviving the second month

1 4 the period of recruitment, or there may be an innova-

given that the first month was survived. Of 38 women,

1 7 tion in ancillary treatment. This assumption may be

1 7 32 survived the first month, or 0.842. Of the 32 women tested, provided we have enough data to estimate

1 8 at the start of the second month (at risk of survival curves for different subsets of the data.

2 8 conception), 27 had not conceived by the end of the Thirdly, we assume that the event happens at the

2 9 month. The conditional probability of surviving the time specified. This is not a problem for the conception

2 9 second month is thus 27/32 = 0.844, and the overall data, but could be, for example, if the event were recur-

2 9 probability of surviving (not conceiving) after two rence of a tumour which would be detected at a regu-

2 11 months is 0.842 0.844 = 0.711. We continue in this lar examination. All we would know is that the event

3 24

way to the end of the table, or until we reach the last happened between two examinations. This imprecision

3 24

event. Observations censored at a given time affect the would bias the survival probabilities upwards. When

3

number still at risk at the start of the next month. The the observations are at regular intervals this can be

4

4

estimated probability changes only in months when allowed for quite easily, using the actuarial method.3

4 there is a conception. In practice, a computer is used to Formal methods are needed for testing hypotheses

6 do these calculations. Standard errors and confidence about survival in two or more groups. We shall describe

6 intervals for the estimated survival probabilities can be the logrank test for comparing curves and the more

9 found by Greenwoods method.3 Survival probabilities complex Cox regression model in future Notes.

9 are usually presented as a survival curve (figure). The

1 Altman DG, Bland JM. Time to event (survival) data. BMJ

9 curve is a step function, with sudden changes in the 1997;317:468-9.

10 estimated probability corresponding to times at which 2 Luthra P, Bland JM, Stanton SL. Incidence of pregnancy after

13 laparoscopy and hydrotubation. BMJ 1982;284:1013-4.

an event was observed. The times of the censored data 3 Parmar MKB, Machin D. Survival analysis: a practical approach. Chichester:

16

are indicated by short vertical lines. Wiley, 37, 47-9.

Education and debate

Statistics notes

Treatment allocation in controlled trials: why randomise?

Douglas G Altman, J Martin Bland

Since 1991 the BMJ has had a policy of not publishing how random samples are expected to behave and so ICRF Medical

Statistics Group,

trials that have not been properly randomised, except can compare the observations with what we would Centre for Statistics

in rare cases where this can be justified.1 Why? expect if the treatments were equally effective. in Medicine,

The simplest approach to evaluating a new Institute of Health

The term random does not mean the same as hap- Sciences, Oxford

treatment is to compare a single group of patients hazard but has a precise technical meaning. By random OX3 7LF

given the new treatment with a group previously allocation we mean that each patient has a known Douglas G Altman,

treated with an alternative treatment. Usually such chance, usually an equal chance, of being given each professor of statistics

in medicine

studies compare two consecutive series of patients in treatment, but the treatment to be given cannot be pre-

the same hospital(s). This approach is seriously flawed. Department of

dicted. If there are two treatments the simplest method Public Health

Problems will arise from the mixture of retrospective of random allocation gives each patient an equal Sciences, St

and prospective studies, and we can never satisfactorily Georges Hospital

chance of getting either treatment; it is equivalent to Medical School,

eliminate possible biases due to other factors (apart

tossing a coin. In practice most people use either a London SW17 0RE

from treatment) that may have changed over time. J Martin Bland,

table of random numbers or a random number

Sacks et al compared trials of the same treatments in professor of medical

generator on a computer. This is simple randomisa- statistics

which randomised or historical controls were used and

tion. Possible modifications include block randomisa-

found a consistent tendency for historically controlled Correspondence to:

trials to yield more optimistic results than randomised tion, to ensure closely similar numbers of patients in Professor Altman.

trials.2 The use of historical controls can be justified each group, and stratified randomisation, to keep the

groups balanced for certain prognostic patient charac- BMJ 1999;318:1209

only in tightly controlled situations of relatively rare

conditions, such as in evaluating treatments for teristics. We discuss these extensions in a subsequent

advanced cancer. Statistics note.

The need for contemporary controls is clear, but Fifty years after the publication of the first

there are difficulties. If the clinician chooses which randomised trial5 the technical meaning of the term

treatment to give each patient there will probably be randomisation continues to elude some investigators.

differences in the clinical and demographic character- Journals continue to publish randomised trials which

istics of the patients receiving the different treatments. are no such thing. One common approach is to

Much the same will happen if patients choose their allocate treatments according to the patients date of

own treatment or if those who agree to have a birth or date of enrolment in the trial (such as giving

treatment are compared with refusers. Similar prob- one treatment to those with even dates and the other to

lems arise when the different treatment groups are at those with odd dates), by the terminal digit of the hos-

different hospitals or under different consultants. Such pital number, or simply alternately into the different

systematic differences, termed bias, will lead to an over- treatment groups. While all of these approaches are in

estimate or underestimate of the difference between principle unbiasedbeing unrelated to patient

treatments. Bias can be avoided by using random allo- characteristicsproblems arise from the openness of

cation. the allocation system.1 Because the treatment is known

A well known example of the confusion engen- when a patient is considered for entry into the trial this

dered by a non-randomised study was the study of the

knowledge may influence the decision to recruit that

possible benefit of vitamin supplementation at the time

patient and so produce treatment groups which are

of conception in women at high risk of having a baby

not comparable.

with a neural tube defect.3 The investigators found that

Of course, situations exist where randomisation is

the vitamin group subsequently had fewer babies with

simply not possible.6 The goal here should be to retain

neural tube defects than the placebo control group.

The control group included women ineligible for the all the methodological features of a well conducted

trial as well as women who refused to participate. As a randomised trial7 other than the randomisation.

consequence the findings were not widely accepted,

and the Medical Research Council later funded a large

randomised trial to answer to the question in a way that 1 Altman DG. Randomisation. BMJ 1991;302:1481-2.

would be widely accepted.4 2 Sacks H, Chalmers TC, Smith H. Randomized versus historical controls

for clinical trials. Am J Med 1982;72:233-40.

The main reason for using randomisation to 3 Smithells RW, Sheppard S, Schorah CJ, Seller MJ, Nevin NC, Harris R, et

allocate treatments to patients in a controlled trial is to al. Possible prevention of neural-tube defects by periconceptional vitamin

prevent biases of the types described above. We want to supplementation. Lancet 1980;i:339-40.

4 MRC Vitamin Study Research Group. Prevention of neural tube defects:

compare the outcomes of treatments given to groups results of the Medical Research Council vitamin study. Lancet

of patients which do not differ in any systematic way. 1991;338:131-7.

5 Medical Research Council. Streptomycin treatment of pulmonary tuber-

Another reason for randomising is that statistical culosis. BMJ 1948;2:769-82.

theory is based on the idea of random sampling. In a 6 Black N. Why we need observational studies to evaluate the effectiveness

study with random allocation the differences between of health care. BMJ 1996;312:1215-8.

7 Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improv-

treatment groups behave like the differences between ing the quality of reporting of randomized controlled trials: the

random samples from a single population. We know CONSORT Statement. JAMA 1996;276:637-9.

General practice

15 Van den Hoogen HJM, Koes BW, van Eijk JT, Bouter LM, Devill W. On 20 Dionne CE, Koepsell TD, Von Korff M, Deyo RA, Barlow WE, Checkoway

the course of low back pain in general practice: a one year follow up H. Predicting long-term functional limitations among back pain patients

study. Ann Rheum Dis 1998;57:13-9. in primary care. J Clin Epidemiol 1997;50:31-43.

16 Croft PR, Papageorgiou AC, Ferry S, Thomas E, Jayson MIV, Silman AJ. 21 Macfarlane GJ, Thomas E, Papageorgiou AC, Schollum J, Croft PR. The

Psychological distress and low back pain: Evidence from a prospective natural history of chronic pain in the community: a better prognosis than

study in the general population. Spine 1996;20:2731-7. in the clinic? J Rheumatol 1996;23:1617-20.

17 Papageorgiou AC, Macfarlane GJ, Thomas E, Croft PR, Jayson MIV, 22 Troup JDG, Martin JW, Lloyd DCEF. Back pain in industry. A prospective

Silman AJ. Psychosocial factors in the work placedo they predict new survey. Spine 1981;6:61-9.

episodes of low back pain? Spine 1997;22:1137-42. 23 Burton AK, Tillotson KM. Prediction of the clinical course of low-back

18 Main CJ, Wood PL, Hollis S, Spanswick CC, Waddell G. The distress and trouble using multivariable models. Spine 1991;16:7-14.

risk assessment method. A simple patient classification to identify distress 24 Pope MH, Rosen JC, Wilder DG, Frymoyer JW. The relation between bio-

and evaluate the risk of poor outcome. Spine 1992;17:42-52. mechanical and psychological factors in patients with low-back pain.

19 Coste J, Delecoeuillerie G, Cohen de Lara A, Le Parc JM, Paolaggi JB. Spine 1980;5:173-8.

Clinical course and prognostic factors in acute low back pain: an incep-

tion cohort study in primary care practice. BMJ 1994;308:577-80. (Accepted 31 March 1999)

Statistics notes

Variables and parameters

Douglas G Altman, J Martin Bland

Like all specialist areas, statistics has developed its own ICRF Medical

80 Statistics Group,

language. As we have noted before,1 much confusion Frequency Centre for Statistics

may arise when a word in common use is also given a in Medicine,

technical meaning. Statistics abounds in such terms, Institute of Health

60 Sciences, Oxford

including normal, random, variance, significant, etc. OX3 7LF

Two commonly confused terms are variable and Douglas G Altman,

parameter; here we explain and contrast them. 40 professor of statistics

in medicine

Information recorded about a sample of individu-

als (often patients) comprises measurements such as Department of

20 Public Health

blood pressure, age, or weight and attributes such as Sciences, St

blood group, stage of disease, and diabetes. Values of Georges Hospital

these will vary among the subjects; in this context Medical School,

0 London SW17 0RE

35 40 45 50 55

blood pressure, weight, blood group and so on are J Martin Bland,

Albumin (g/l)

variables. Variables are quantities which vary from professor of medical

individual to individual. Measurements of serum albumin in 481 white men aged over 20 statistics

By contrast, parameters do not relate to actual (data from Dr W G Miller) Correspondence to:

Professor Altman.

measurements or attributes but to quantities defining a

theoretical model. The figure shows the distribution of (more generally known as regression coefficients) are

BMJ 1999;318:1667

measurements of serum albumin in 481 white men the parameters defining the model. They have no

aged over 20 with mean 46.14 and standard deviation meaning for individuals, although they can be used to

3.08 g/l. For the empirical data the mean and SD are predict an individuals lung function from their height.

called sample estimates. They are properties of the col- In some contexts parameters are values that can be

lection of individuals. Also shown is the normal1 distri- altered to see what happens to the performance of

bution which fits the data most closely. It too has mean some system. For example, the performance of a

46.14 and SD 3.08 g/l. For the theoretical distribution screening programme (such as positive predictive

the mean and SD are called parameters. There is not value or cost effectiveness) will depend on aspects such

one normal distribution but many, called a family of as the sensitivity and specificity of the screening test. If

distributions. Each member of the family is defined by we look to see how the performance would change if,

its mean and SD, the parameters1 which specify the say, sensitivity and specificity were improved, then we

particular theoretical normal distribution with which are treating these as parameters rather than using the

we are dealing. In this case, they give the best estimate values observed in a real set of data.

of the population distribution of serum albumin if we Parameter is a technical term which has only

can assume that in the population serum albumin has recently found its way into general use, unfortunately

a normal distribution. without keeping its correct meaning. It is common in

Most statistical methods, such as t tests, are called medical journals to find variables incorrectly called

parametric because they estimate parameters of some parameters (but not in the BMJ we hope2). Another

underlying theoretical distribution. Non-parametric common misuse of parameter is as a limit or boundary,

methods, such as the Mann-Whitney U test and the log as in within certain parameters. This misuse seems to

rank test for survival data, do not assume any particu- have arisen from confusion between parameter and

lar family for the distribution of the data and so do not perimeter.

estimate any parameters for such a distribution. Misuse of medical terms is rightly deprecated. Like

Another use of the word parameter relates to its other language errors it leads to confusion and the loss

original mathematical meaning as the value(s) defining of valuable distinction. Misuse of non-medical terms

one of a family of curves. If we fit a regression model, should be viewed likewise.

such as that describing the relation between lung func- 1 Altman DG, Bland JM. The normal distribution. BMJ 1995;310:298.

tion and height, the slope and intercept of this line 2 Endpiece: Whats a parameter? BMJ 1998;316:1877.

Education and debate

Much work still needs to be done to achieve this. To be 3 World Health Association. Global strategy for health for all by the year 2000.

Geneva: WHO, 1981. (WHO Health for All series No 3.)

useful in health policy at this level, all the targets need 4 Visschedijk J, Simant S. Targets for health for all in the 21st century.

to be elaborated further and clear, practical statements World Health Stat Q 1998;51:56-67.

5 Van de Water HPA, van Herten LM. Never change a winning team? Review

must be made on their operationespecially the four of WHOs new global policy: health for all in the 21st century. Leiden: TNO

targets on health policy and sustainable health systems. Prevention and Health, 1999.

The WHO should stimulate the discussion of these 6 World Health Organisation. Bridging the gaps. Geneva: WHO, 1995.

(World health report.)

important targets, but it should also be careful about 7 World Health Organisation. Fighting disease, fostering development. Geneva:

being too prescriptive about health systems since this WHO, 1996. (World health report.)

8 World Health Organisation. 1997: Conquering suffering, enriching humanity.

could be counterproductive. Geneva: WHO, 1997. (World health report.)

In addition, more attention should be given to the 9 Murray CJL, Lopez AD, eds. The global burden of disease. Boston: Harvard

usefulness of the targets in member states. One way of University Press, 1996.

10 United Nations. The world population prospects. New York: UN, 1998.

doing this is to rank the countries by target and to 11 United Nations Development Programme. Human development report

divide them into three groups. A specific level could be 1997. New York: Oxford University Press, 1997.

12 World Bank. Poverty reduction and the World Bank: progress and challenges in

set for each group. For example, for target 2, three such the 1990s. New York: World Bank, 1996.

groups could be distinguished as follows: 13 World Health Organisation. Third evaluation of health for all by the year

x Countries that have already achieved this target 14

2000. Geneva: WHO, 1999. (In press.)

Ad Hoc Committee on Health Research Relating to Future Intervention

x Countries for which the global target is achievable Options. Investing in health research and development. Geneva: WHO,1996.

and challenging (Document TDR/Gen/96.1.)

15 Taylor CE. Surveillance for equity in primary health care: policy implica-

x Countries that find the global target hard to achieve tions from international experience. Int J Epidemiol 1992;21:1043-9.

and therefore demotivating. 16 Frerichs RR. Epidemiologic surveillance in developing countries. Annu

Rev Public Health 1991;12:257-80.

The first group needs stricter target levels, and the 17 World Health Organisation. Health for all renewalbuilding sustainable

third group less stringent ones. If a breakdown of this health systems: from policy to action. Report of meeting on 17-19 November 1997

kind is made for each target, some countries may be in Helsinki, Finland. Geneva: WHO, 1998.

18 World Health Organisation. EMC annual report 1996. Geneva: WHO:

classified in different groups for different targets. In this 1996.

way, the targets will provide an insight into the health 19 World Health Organisation. Physical status: the use and interpretation of

anthropometry of a WHO expert committee. Geneva: WHO, 1995. (WHO

status of the population and could be useful for policy technical report series No 834.)

makers in member states in encouraging action and 20 World Health Organisation. Global database on child growth and malnutri-

tion. Geneva: WHO, 1997.

allocating their resources. 21 World Health Organisation. Tobacco or health: a global status report. Geneva:

WHO, 1997.

We thank Dr J Visschedijk and Professor L J Gunning-Schepers 22 Erkens C. Cost-effectiveness of short course chemotherapy in smear-negative

and other referees of this article for their helpful comments. tuberculosis. Utrecht: Netherlands School of Public Health, 1996.

Funding: This study was commissioned by Policy Action 23 Van de Water HPA, van Herten LM. Bulls eye or Achilles heel: WHOs Euro-

Coordination at the WHO and supported by an unrestricted pean health for all targets evaluated in the Netherlands. Leiden: Netherlands

educational grant from Merck & Co Inc, New Jersey, USA. Association for Applied Scientific Research (TNO) Prevention and

Health, 1996.

Competing interests: None declared.

24 Van de Water HPA, van Herten LM. Health policies on target? Review of

health target and priority setting in 18 European countries. Leiden:

1 World Health Assembly. Resolution WHA51.7. Health for all policy for the Netherlands Association for Applied Scientific Research (TNO) Preven-

twenty-first century. Geneva: World Health Organisation, 1998. tion and Health, 1998.

2 World Health Association. Health for all in the 21st century. Geneva: WHO,

1998. (Accepted 4 May 1999)

Statistics notes

How to randomise

Douglas G Altman, J Martin Bland

We have explained why random allocation of place to start and also the direction in which to read ICRF Medical

Statistics Group,

treatments is a required feature of controlled trials.1 the table. The first 10 two digit numbers from a starting Centre for Statistics

Here we consider how to generate a random allocation place in column 2 are 85 80 62 36 96 56 17 17 23 87, in Medicine,

sequence. which translate into the sequence A B B B B B A A A Institute of Health

Sciences, Oxford

Almost always patients enter a trial in sequence A for the first 10 patients. We could instead have taken OX3 7LF

over a prolonged period. In the simplest procedure, each digit on its own, or numbers 00 to 49 for A and 50 Douglas G Altman,

simple randomisation, we determine each patients to 99 for B. There are countless possible strategies; it professor of statistics

in medicine

treatment at random independently with no con- makes no difference which is used.

straints. With equal allocation to two treatment groups We can easily generalise the approach. With three Department of

Public Health

this is equivalent to tossing a coin, although in practice groups we could use 01 to 33 for A, 34 to 66 for B, and Sciences, St

coins are rarely used. Instead we use computer gener- 67 to 99 for C (00 is ignored). We could allocate treat- Georges Hospital

Medical School,

ated random numbers. Suitable tables can be found in ments A and B in proportions 2 to 1 by using 01 to 66

London SW17 0RE

most statistics textbooks. The table shows an example2: for A and 67 to 99 for B. J Martin Bland,

the numbers can be considered as either random digits At any point in the sequence the numbers of professor of medical

from 0 to 9 or random integers from 0 to 99. patients allocated to each treatment will probably statistics

For equal allocation to two treatments we could differ, as in the above example. But sometimes we want Correspondence to:

Professor Altman.

take odd and even numbers to indicate treatments A to keep the numbers in each group very close at all

and B respectively. We must then choose an arbitrary times. Block randomisation (also called restricted BMJ 1999;319:7034

Education and debate

Excerpt from a table of random digits.2 The numbers used in the

might be important to stratify by menopausal status.

example are shown in bold

Separate lists of random numbers should then be con-

89 11 77 99 94 structed for premenopausal and postmenopausal

35 83 73 68 20

women. It is essential that stratified treatment

84 85 95 45 52

allocation is based on block randomisation within each

56 80 93 52 82

stratum rather than simple randomisation; otherwise

97 62 98 71 39

79 36 13 72 99

there will be no control of balance of treatments within

34 96 98 54 89

strata, so the object of stratification will be defeated.

69 56 88 97 43 Stratified randomisation can be extended to two or

09 17 78 78 02 more stratifying variables. For example, we might want

83 17 39 84 16 to extend the stratification in the breast cancer trial to

24 23 36 44 14 tumour size and number of positive nodes. A separate

39 87 30 20 41 randomisation list is needed for each combination of

75 18 53 77 83 categories. If we had two tumour size groups (say <4

33 93 39 24 81 and > 4cm) and three groups for node involvement (0,

22 52 01 86 71 1-4, > 4) as well as menopausal status, then we have

2 3 2 = 12 strata, which may exceed the limit of what

randomisation) is used for this purpose. For example, if is practical. Also with multiple strata some of the com-

we consider subjects in blocks of four at a time there binations of categories may be rare, so the intended

are only six ways in which two get A and two get B: treatment balance is not achieved.

1: A A B B 2: A B A B 3: A B B A 4: B B A A 5: B A In a multicentre study the patients within each cen-

B A 6: B A A B tre will need to be randomised separately unless there

We choose blocks at random to create the is a central coordinated randomising service. Thus

allocation sequence. Using the single digits of the pre- centre is a stratifying variable, and there may be other

vious random sequence and omitting numbers outside stratifying variables as well.

the range 1 to 6 we get 5 6 2 3 6 6 5 6 1 1. From these In small studies it is not practical to stratify on more

we can construct the block allocation sequence B A B than one or perhaps two variables, as the number of

A / B A A B / A B A B / A B B A / B A A B, and so on. strata can quickly approach the number of subjects.

The numbers in the two groups at any time can never When it is really important to achieve close similarity

differ by more than half the block length. Block size is between treatment groups for several variables

normally a multiple of the number of treatments. minimisation can be usedwe discuss this method in a

Large blocks are best avoided as they control balance separate Statistics note.3

less well. It is possible to vary the block length, again at We have described the generation of a random

random, perhaps using a mixture of blocks of size 2, 4, sequence in some detail so that the principles are clear.

or 6. In practice, for many trials the process will be done by

While simple randomisation removes bias from the computer. Suitable software is available at http://

allocation procedure, it does not guarantee, for exam- www.sghms.ac.uk/phs/staff/jmb/jmb.htm.

ple, that the individuals in each group have a similar We shall also consider in a subsequent note the

age distribution. In small studies especially some practicalities of using a random sequence to allocate

chance imbalance will probably occur, which might treatments to patients.

complicate the interpretation of results. We can use

stratified randomisation to achieve approximate

1 Altman DG, Bland JM. Treatment allocation in controlled trials: why ran-

balance of important characteristics without sacrificing domise? BMJ 1999;318:1209.

the advantages of randomisation. The method is to 2 Altman DG. Practical statistics for medical research. London: Chapman and

Hall, 1990: 540-4.

produce a separate block randomisation list for each 3 Treasure T, MacRae KD. Minimisation: the platinum standard for trials?

subgroup (stratum). For example, in a study to BMJ 1998;317:362-3.

Generalisation of salt infusions

The subcutaneous infusion of salt solution has proved of great been employed earlier. Several physicians have adopted Dr.

benefit in the treatment of collapse after severe operations. The Penroses method, and with the most gratifying results. The cases

practice, it may be said, developed from two sources: the new are reported fully in the Bulletin of the Johns Hopkins Hospital for

method of transfusion where water, instead of another persons July last. The infusions of salt solution were administered just as

blood, is injected into the patients veins; and flushing of the after an operation. The salt solution, at a little above body

peritoneum, introduced by Lawson Tait. After flushing, much of temperature, is poured into a graduated bottle connected by a

the fluid left in the peritoneum is absorbed into the circulation, rubber tube with a needle. The pressure is regulated by elevating

greatly to the patients advantage. Dr. Clement Penrose has tried the bottle, or by means of a rubber bulb with valves; the needle is

the effect of subcutaneous salt infusions as a last extremity in introduced into the connective tissue under the breast or under

severe cases of pneumonia. He continues this treatment with the integuments of the thighs. There can be no doubt that

inhalations of oxygen. He has had experience of three cases, all subcutaneous saline infusions are increasing in popularity, and

considered hopeless, and succeeded in saving one. In the other little doubt that their use will be greatly extended in medicine as

two the prolongation of life and the relief of symptoms were so well as surgery.

marked that Dr. Penrose regretted that the treatment had not (BMJ 1899;ii:933)

Education and debate

Statistics Notes

The odds ratio

J Martin Bland, Douglas G Altman

Department of In recent years odds ratios have become widely used in The two odds ratios are

Public Health

Sciences, medical reportsalmost certainly some will appear in

St Georges todays BMJ. There are three reasons for this. Firstly,

Hospital Medical

they provide an estimate (with confidence interval) for

School, London

SW17 0RE the relationship between two binary (yes or no) vari- which can both be rearranged to give

J Martin Bland ables. Secondly, they enable us to examine the effects of

professor of medical

statistics

other variables on that relationship, using logistic

regression. Thirdly, they have a special and very

ICRF Medical

Statistics Group, convenient interpretation in case-control studies (dealt If we switch the order of the categories in the rows and

Centre for Statistics with in a future note). the columns, we get the same odds ratio. If we switch

in Medicine,

Institute of Health

The odds are a way of representing probability, the order for the rows only or for the columns only, we

Sciences, Oxford especially familiar for betting. For example, the odds get the reciprocal of the odds ratio, 1/4.89 = 0.204.

OX3 7LF that a single throw of a die will produce a six are 1 to These properties make the odds ratio a useful

Douglas G Altman

professor of statistics

5, or 1/5. The odds is the ratio of the probability that indicator of the strength of the relationship.

in medicine the event of interest occurs to the probability that it The sample odds ratio is limited at the lower end,

Correspondence to: does not. This is often estimated by the ratio of the since it cannot be negative, but not at the upper end,

Professor Bland number of times that the event of interest occurs to and so has a skew distribution. The log odds ratio,2

the number of times that it does not. The table shows however, can take any value and has an approximately

BMJ 2000;320:1468

data from a cross sectional study showing the Normal distribution. It also has the useful property that

prevalence of hay fever and eczema in 11 year old if we reverse the order of the categories for one of the

children.1 The probability that a child with eczema will variables, we simply reverse the sign of the log odds

also have hay fever is estimated by the proportion ratio: log(4.89) = 1.59, log(0.204) = 1.59.

141/561 (25.1%). The odds is estimated by 141/420. We can calculate a standard error for the log odds

Similarly, for children without eczema the probability ratio and hence a confidence interval. The standard

of having hay fever is estimated by 928/14 453 (6.4%) error of the log odds ratio is estimated simply by the

and the odds is 928/13 525. We can compare the square root of the sum of the reciprocals of the four

frequencies. For the example,

groups in several ways: by the difference between the

proportions, 141/561 928/14 453 = 0.187 (or 18.7

percentage points); the ratio of the proportions, (141/

561)/(928/14 453) = 3.91 (also called the relative

risk); or the odds ratio, (141/420)/(928/

13 525) = 4.89.

A 95% confidence interval for the log odds ratio is

obtained as 1.96 standard errors on either side of the

estimate. For the example, the log odds ratio is

Association between hay fever and eczema in 11 year old

loge(4.89) = 1.588 and the confidence interval is

children1

1.5881.960.103, which gives 1.386 to 1.790. We can

Hay fever antilog these limits to give a 95% confidence interval

Eczema Yes No Total for the odds ratio itself,2 as exp(1.386) = 4.00 to

Yes 141 420 561 exp(1.790) = 5.99. The observed odds ratio, 4.89, is not

No 928 13 525 14 453 in the centre of the confidence interval because of the

Total 1069 13 945 15 522

asymmetrical nature of the odds ratio scale. For this

reason, in graphs odds ratios are often plotted using a

logarithmic scale. The odds ratio is 1 when there is no

Now, suppose we look at the table the other way relationship. We can test the null hypothesis that the

round, and ask what is the probability that a child with odds ratio is 1 by the usual 2 test for a two by two

hay fever will also have eczema? The proportion is table.

141/1069 (13.2%) and the odds is 141/928. For a Despite their usefulness, odds ratios can cause diffi-

child without hay fever, the proportion with eczema is culties in interpretation.3 We shall review this debate

420/13 945 (3.0%) and the odds is 420/13 525. Com- and also discuss odds ratios in logistic regression and

paring the proportions this way, the difference is 141/ case-control studies in future Statistics Notes.

1069 420/13 945 = 0.102 (or 10.2 percentage We thank Barbara Butland for providing the data.

points); the ratio (relative risk) is (141/1069)/(420/

13 945) = 4.38; and the odds ratio is (141/928)/(420/ 1 Strachan DP, Butland BK, Anderson HR. Incidence and prognosis of

asthma and wheezing illness from early childhood to age 33 in a national

13 525) = 4.89. The odds ratio is the same whichever British cohort. BMJ. 1996;312:1195-9.

2 Bland JM, Altman DG. Transforming data. BMJ 1996;312:770.

way round we look at the table, but the difference and 3 Sackett DL, Deeks JJ, Altman DG. Down with odds ratios! Evidence-Based

ratio of proportions are not. It is easy to see why this is. Med 1996;1:164-6.

Education and debate

Statistics Notes

Blinding in clinical trials and other studies

Simon J Day, Douglas G Altman

Leo Human behaviour is influenced by what we know or the treatment received. Such blind assessment of

Pharmaceuticals,

Princes Risborough,

believe. In research there is a particular risk of expecta- outcome can often also be achieved in trials which are

Buckinghamshire tion influencing findings, most obviously when there is open (non-blinded). For example, lesions can be

HP27 9RR some subjectivity in assessment, leading to biased photographed before and after treatment and assessed

Simon J Day results. Blinding (sometimes called masking) is used to

manager, clinical

by someone not involved in running the trial. Indeed,

biometrics try to eliminate such bias. blind assessment of outcome may be more important

ICRF Medical

It is a tenet of randomised controlled trials that the than blinding the administration of the treatment,

Statistics Group, treatment allocation for each patient is not revealed especially when the outcome measure involves subjec-

Institute of Health until the patient has irrevocably been entered into the

Sciences, Oxford tivity. Despite the best intentions, some treatments have

OX3 7LF trial, to avoid selection bias. This sort of blinding, better unintended effects that are so specific that their occur-

Douglas G Altman referred to as allocation concealment, will be discussed rence will inevitably identify the treatment received to

professor of statistics in a future statistics note. In controlled trials the term both the patient and the medical staff. Blind

in medicine

blinding, and in particular double blind, usually assessment of outcome is especially useful when this is

Correspondence to: refers to keeping study participants, those involved

S J Day a risk.

with their management, and those collecting and ana-

In epidemiological studies it is preferable that the

lysing clinical data unaware of the assigned treatment,

BMJ 2000;321:504 identification of cases as opposed to controls be

so that they should not be influenced by that

kept secret while researchers are determining each

knowledge.

subjects exposure to potential risk factors. In many

The relevance of blinding will vary according to

such studies blinding is impossible because exposure

circumstances. Blinding patients to the treatment they

can be discovered only by interviewing the study

have received in a controlled trial is particularly impor-

tant when the response criteria are subjective, such as participants, who obviously know whether or not they

alleviation of pain, but less important for objective cri- are a case. The risk of differential recall of important

teria, such as death. Similarly, medical staff caring for disease related events between cases and controls must

patients in a randomised trial should be blinded to then be recognised and if possible investigated.2 As a

treatment allocation to minimise possible bias in minimum the sensitivity of the results to differential

patient management and in assessing disease status. recall should be considered. Blinded assessment of

For example, the decision to withdraw a patient from a patient outcome may also be valuable in other

study or to adjust the dose of medication could easily epidemiological studies, such as cohort studies.

be influenced by knowledge of which treatment group Blinding is important in other types of research

the patient has been assigned to. too. For example, in studies to evaluate the perform-

In a double blind trial neither the patient nor the ance of a diagnostic test those performing the test must

caregivers are aware of the treatment assignment. be unaware of the true diagnosis. In studies to evaluate

Blinding means more than just keeping the name of the reproducibility of a measurement technique the

the treatment hidden. Patients may well see the observers must be unaware of their previous measure-

treatment being given to patients in the other ment(s) on the same individual.

treatment group(s), and the appearance of the drug We have emphasised the risks of bias if adequate

used in the study could give a clue to its identity. Differ- blinding is not used. This may seem to be challenging

ences in taste, smell, or mode of delivery may also the integrity of researchers and patients, but bias asso-

influence efficacy, so these aspects should be identical ciated with knowing the treatment is often subcon-

for each treatment group. Even colour of medication scious. On average, randomised trials that have not

has been shown to influence efficacy.1 used appropriate levels of blinding show larger

In studies comparing two active compounds, blind- treatment effects than blinded studies.3 Similarly, diag-

ing is possible using the double dummy method. For nostic test performance is overestimated when the ref-

example, if we want to compare two medicines, one erence test is interpreted with knowledge of the test

presented as green tablets and one as pink capsules, we result.4 Blinding makes it difficult to bias results

could also supply green placebo tablets and pink intentionally or unintentionally and so helps ensure

placebo capsules so that both groups of patients would

the credibility of study conclusions.

take one green tablet and one pink capsule.

Blinding is certainly not always easy or possible.

Single blind trials (where either only the investigator or 1 De Craen AJM, Roos PJ, de Vries AL, Kleijnen J. Effect of colour of drugs:

systematic review of perceived effect of drugs and their effectiveness. BMJ

only the patient is blind to the allocation) are 1996;313:1624-6.

sometimes unavoidable, as are open (non-blind) trials. 2 Barry D. Differential recall bias and spurious associations in case/control

studies. Stat Med 1996;15:2603-16.

In trials of different styles of patient management, 3 Schulz KF, Chalmers I, Hayes R, Altman DG. Empirical evidence of bias:

surgical procedures, or alternative therapies, full blind- dimensions of methodological quality associated with estimates of treat-

ing is often impossible. ment effects in controlled trials. JAMA 1995;273:408-12.

4 Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der

In a double blind trial it is implicit that the Meulen JH, et al. Empirical evidence of design-related bias in studies of

assessment of patient outcome is done in ignorance of diagnostic tests. JAMA 1999;282:1061-6.

Education and debate

Irrationality, the market, and quality of care

1 McCallum J, Geiselhart K. Australias new aged: issues for young and old.

Consider the irrationality of a person who pays extra Sydney: Allen and Unwin, 1996.

so as not to share a hotel room with a colleague while 2 Osborne D, Gaebler T. Reinventing government. New York: Addison-

on a business trip. He does this because he values Wesley, 1992.

privacy but he also scoffs at taking out long term care 3 Jost T. The necessary and proper role of regulation to assure the quality

insurance to guarantee a private room in a nursing of health care. Houston Law Review 1988;25:525-98.

4 Tingle L. Moran the big winner as aged care goes private. Sydney Morning

home. Why is he willing to risk sharing a room for the Herald 16 March 2001:2.

rest of his life with a person he does not like? This 5 Lohr R, Head M. Kerosene baths reveal systemic aged care crisis in Aus-

common irrationality is often masked by tralia. World Socialist Web Site. www.wsws.org/articles/2000/mar2000/

rationalisations such as I would rather die than have aged-m10.shtml (accessed 10 Mar 2000).

to live in a nursing home. Yet we know that when the 6 Jenkins A, Braithwaite J. Profits, pressure and corporate lawbreaking.

Crime, Law and Social Change 1993;20:221-32.

time comes most prefer the limited pleasures of life in

7 Braithwaite J, Makkai T, Braithwaite V, Gibson D. Raising the standard: resi-

a nursing home to suicide dent centred nursing home regulation in Australia. Canberra: Department of

Community Services and Health, 1993.

8 Braithwaite J, Braithwaite V. The politics of legalism: rules versus

standards in nursing home regulation. Social and Legal Studies

their feet. There are even more fundamental reasons 1995;4:307-41.

why depending on the rationality of the market will 9 Black J. Rules and regulators. Oxford: Clarendon Press, 1997.

10 Braithwaite J, Makkai T. Can resident-centred inspection of nursing

never work well for quality of care (box). Sensible

homes work with very sick residents? Health Policy 1993;24:19-33.

policy for providing nursing home care requires a 11 Makkai T, Braithwaite J. Praise, pride and corporate compliance. Int J Soci-

larger welfare state, a larger regulatory state, and ology Law 1993;21:73-91.

12 Braithwaite J. Restorative justice and responsive regulation. New York: Oxford

encouragement of public, non-profit providers. University Press (in press).

Australias recent experience shows that to head in the 13 McKibbin H. Accreditation: the on-site audit. The Standard (Newsletter of

opposite direction is medically, economically, and the Aged Care Standards Agency) 1999;2(2):2.

14 Power M. The audit society. Oxford: Oxford University Press, 1997.

politically irrational.

Statistics Notes

Concealing treatment allocation in randomised trials

Douglas G Altman, Kenneth F Schulz

ICRF Medical We have previously explained why random allocation there is a typed list showing the allocation sequence).

Statistics Group,

Centre for Statistics of treatments is a required design feature of controlled Each of the above steps may then be compromised

in Medicine, trials1 and explained how to generate a random alloca- because of conscious or subconscious bias. Even when

Institute of Health tion sequence.2 Here we consider the importance of the sequence is not easily available, there is strong

Sciences, Oxford

OX3 7LF concealing the treatment allocation until the patient is anecdotal evidence of frequent attempts to discover

Douglas G Altman entered into the trial. the sequence through a combination of a misplaced

professor of statistics Regardless of how the allocation sequence has belief that this will be beneficial to patients and lack of

in medicine

been generatedsuch as by simple or stratified understanding of the rationale of randomisation.3

Family Health

International, randomisation2there will be a prespecified sequence How can the allocation sequence be concealed?

PO Box 13950, of treatment allocations. In principle, therefore, it is Firstly, the person who generates the allocation

Research Triangle possible to know what treatment the next patient will sequence should not be the person who determines

Park, NC 27709,

USA get at the time when a decision is taken to consider the eligibility and entry of patients. Secondly, if possible the

Kenneth F Schulz patient for entry into the trial. mechanism for treatment allocation should use people

vice president, The strength of the randomised trial is based on not involved in the trial. A common procedure,

Quantitative Sciences

aspects of design which eliminate various types of bias. especially in larger trials, is to use a central telephone

Correspondence to:

D G Altman

Randomisation of patients to treatment groups randomisation system. Here patient details are

eliminates bias by making the characteristics of the supplied, eligibility confirmed, and the patient entered

BMJ 2001;323:4467 patients in two (or more) groups the same on average, into the trial before the treatment allocation is divulged

and stratification with blocking may help to reduce (and it may still be blinded4). Another excellent alloca-

chance imbalance in a particular trial.2 All this good tion concealment mechanism, common in drug trials,

work can be undone if a poor procedure is adopted to is to get the allocation done by a pharmacy. The inter-

implement the allocation sequence. In any trial one or ventions are sealed in serially numbered containers

more people must determine whether each patient is (usually bottles) of equal appearance and weight

eligible for the trial, decide whether to invite the according to the allocation sequence.

patient to participate, explain the aims of the trial and If external help is not available the only other

the details of the treatments, and, if the patient agrees system that provides a plausible defence against alloca-

to participate, determine what treatment he or she will tion bias is to enclose assignments in serially

receive. numbered, opaque, sealed envelopes. Apart from

Suppose it is clear which treatment a patient will neglecting to mention opacity, this is the method used

receive if he or she enters the trial (perhaps because in the famous 1948 streptomycin trial (see box). This

Education and debate

Description of treatment allocation in the MRC recognised in the streptomycin trial5 (see box). Yet the

streptomycin trial5 importance of this key element of a randomised trial

has not been widely recognised. Empirical evidence of

Determination of whether a patient would be treated

the bias associated with failure to conceal the

by streptomycin and bed-rest (S case) or by bed-rest

alone (C case) was made by reference to a statistical allocation6 7 and explicit requirement to discuss this

series based on random sampling numbers drawn up issue in the CONSORT statement8 seem to be leading

for each sex at each centre by Professor Bradford Hill; to wider recognition that allocation concealment is an

the details of the series were unknown to any of the essential aspect of a randomised trial.

investigators or to the co-ordinator and were

contained in a set of sealed envelopes, each bearing on

Allocation concealment is completely different

the outside only the name of the hospital and a from (double) blinding.4 It is possible to conceal the

number. After acceptance of a patient by the panel, randomisation in every randomised trial. Also,

and before admission to the streptomycin centre, the allocation concealment seeks to eliminate selection

appropriate numbered envelope was opened at the bias (who gets into the trial and the treatment they are

central office; the card inside told if the patient was to

be an S or a C case, and this information was then assigned). By contrast, blinding relates to what happens

given to the medical officer of the centre. after randomisation, is not possible in all trials, and

seeks to reduce ascertainment bias (assessment of

outcome).

poorly executed. However, with care, it can be a good 1 Altman DG, Bland JM. Treatment allocation in controlled trials: why ran-

domise? BMJ 1999;318:1209.

mechanism for concealing allocation. We recommend

2 Altman DG, Bland JM. How to randomise. BMJ 1999;319:703-4.

that investigators ensure that the envelopes are opened 3 Schulz KF. Subverting randomization in controlled trials. JAMA

sequentially, and only after the participants name and 1995;274:1456-8.

4 Day SJ, Altman DG. Blinding in clinical trials and other studies. BMJ

other details are written on the appropriate envelope.3 2000;321:504.

If possible, that information should also be transferred 5 Medical Research Council. Streptomycin treatment of pulmonary

to the assigned allocation by using pressure sensitive tuberculosis: a Medical Research Council investigation. BMJ 1948;2:

769-82.

paper or carbon paper inside the envelope. If an inves- 6 Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, et al. Does

tigator cannot use numbered containers, envelopes quality of reports of randomised trials affect estimates of intervention

efficacy reported in meta-analyses. Lancet 1998;352:609-13.

represent the best available allocation concealment 7 Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias:

mechanism without involving outside parties, and may dimensions of methodological quality associated with estimates of treat-

sometimes be the only feasible option. We suspect, ment effects in controlled trials. JAMA 1995;273:408-12.

8 Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improv-

however, that in years to come we will see greater use of ing the quality of reporting of randomized controlled trials: the

external third party randomisation. CONSORT statement. JAMA 1996;276:637-9.

The bread and butter of public health on call is accompanying the patient produced both their mobile

identifying contacts in the case of suspected phones and confidently reassured me that between the

meningococcal disease. On the whole this is two of them they would have the mobile numbers of

straightforward but can occasionally cause difficulties. all 15 household contacts. She was right, and in just

Most areas that I have worked in include several over two hours all of them had been contacted.

universities, and during October it is common to There has been much coverage in the medical and

experience the problem of contact tracing in the popular press about the potential health hazards of

student population. mobile phones, and if these fears are realised the 100%

There are two main problems. The first is how to ownership among this small sample of students is

define household contacts when the index patient lives worrying. However, in terms of contact tracing for

in a hall of residence containing several hundred suspected meningococcal disease, mobile phones have

students. Finding the appropriate university protocol

potential health benefits not just for their owners but

and not being too concerned about the different

also for the mental health of public health doctors. Of

approaches adopted by neighbouring universities can

course, this may not solve the close kissing contact

reduce the number of sleepless nights. The second

problem.

problem is harder. Close kissing contacts among 18

year olds who have been set free from parental control Debbie Lawlor senior lecturer in epidemiology and public

for the first time is a minefield. My experience suggests health, University of Bristol

that it is best to assume there will be lots and that names

and contact details will not necessarily have been We welcome articles up to 600 words on topics such as

obtained. By the end of a weekend on call, you will feel A memorable patient, A paper that changed my practice, My

like a cross between a detective and an agony aunt. most unfortunate mistake, or any other piece conveying

One year I volunteered to cover Christmas weekend instruction, pathos, or humour. If possible the article

in the belief that at least the students would be gone by should be supplied on a disk. Permission is needed

then. I could not have been more mistaken. To add a from the patient or a relative if an identifiable patient is

further difficulty, the index patient presented to referred to. We also welcome contributions for

hospital on the night of the last day of term, and all Endpieces, consisting of quotations of up to 80 words

contacts had already set off to the far reaches of the (but most are considerably shorter) from any source,

country. I could not believe my luck when the friend ancient or modern, which have appealed to the reader.

Education and debate

Statistics Notes

Analysing controlled trials with baseline and follow up

measurements

Andrew J Vickers, Douglas G Altman

In many randomised trials researchers measure a con- where a and b are estimated coefficients and group Integrative

Medicine Service,

tinuous variable at baseline and again as an outcome is a binary variable coded 1 for treatment and 0 for Biostatistics Service,

assessed at follow up. Baseline measurements are com- control. The coefficient b is the effect of interestthe Memorial

mon in trials of chronic conditions where researchers estimated difference between the two treatment Sloan-Kettering

Cancer Center, New

want to see whether a treatment can reduce groups. In effect an analysis of covariance adjusts each York, New York

pre-existing levels of pain, anxiety, hypertension, and patients follow up score for his or her baseline score, 10021, USA

the like. but has the advantage of being unaffected by baseline Andrew J Vickers

assistant attending

Statistical comparisons in such trials can be made differences. If, by chance, baseline scores are worse in research methodologist

in several ways. Comparison of follow up (post- the treatment group, the treatment effect will be

ICRF Medical

treatment) scores will give a result such as at the end underestimated by a follow up score analysis and over- Statistics Group,

of the trial, mean pain scores were 15 mm (95% confi- estimated by looking at change scores (because of Centre for Statistics

dence interval 10 to 20 mm) lower in the treatment regression to the mean). By contrast, analysis of covari- in Medicine,

Institute of Health

group. Alternatively a change score can be calculated ance gives the same answer whether or not there is Sciences, Oxford

by subtracting the follow up score from the baseline baseline imbalance. OX3 7LF

score, leading to a statement such as pain reductions As an illustration, Kleinhenz et al randomised 52 Douglas G Altman

professor of statistics

were 20 mm (16 to 24 mm) greater on treatment than patients with shoulder pain to either true or sham acu- in medicine

control. If the average baseline scores are the same in puncture.4 Patients were assessed before and after

Correspondence to:

each group the estimated treatment effect will be the treatment using a 100 point rating scale of pain and Dr Vickers

same using these two simple approaches. If the function, with lower scores indicating poorer outcome. vickersa@mskcc.org

treatment is effective the statistical significance of the There was an imbalance between groups at baseline,

BMJ 2001;323:11234

treatment effect by the two methods will depend on the with better scores in the acupuncture group (see table).

correlation between baseline and follow up scores. If Analysis of post-treatment scores is therefore biased.

the correlation is low using the change score will add The authors analysed change scores, but as baseline

variation and the follow up score is more likely to show and change scores are negatively correlated (about

a significant result. Conversely, if the correlation is high r = 0.25 within groups) this analysis underestimates

using only the follow up score will lose information the effect of acupuncture. From analysis of covariance

and the change score is more likely to be significant. It we get:

is incorrect, however, to choose whichever analysis follow up score =

gives a more significant finding. The method of analy- 24 + 0.71baseline score + 12.7group

sis should be specified in the trial protocol. (see figure). The coefficient for group (b) has a use-

Some use change scores to take account of chance ful interpretation: it is the difference between the mean

imbalances at baseline between the treatment groups. change scores of each group. In the above example it

However, analysing change does not control for can be interpreted as pain and function score

baseline imbalance because of regression to the improved by an estimated 12.7 points more on average

mean1 2: baseline values are negatively correlated with in the treatment group than in the control group. A

change because patients with low scores at baseline 95% confidence interval and P value can also be calcu-

generally improve more than those with high scores. A lated for b (see table).5 The regression equation

better approach is to use analysis of covariance provides a means of prediction: a patient with a

(ANCOVA), which, despite its name, is a regression baseline score of 50, for example, would be predicted

method.3 In effect two parallel straight lines (linear to have a follow up score of 72.2 on treatment and 59.5

regression) are obtained relating outcome score to on control.

baseline score in each group. They can be summarised An additional advantage of analysis of covariance is

as a single regression equation: that it generally has greater statistical power to detect a

follow up score = treatment effect than the other methods.6 For example,

constant + abaseline score + bgroup a trial with a correlation between baseline and follow

Pain scores (mean and SD)

Placebo group Acupuncture group Difference between means

(n=27) (n=25) (95% CI) P value

Baseline 53.9 (14) 60.4 (12.3) 6.5

Analysis

Follow up 62.3 (17.9) 79.6 (17.1) 17.3 (7.5 to 27.1) 0.0008

Change score* 8.4 (14.6) 19.2 (16.1) 10.8 (2.3 to 19.4) 0.014

ANCOVA 12.7 (4.1 to 21.3) 0.005

*Analysis reported by authors.4

Education and debate

Posttreatment score

100 reasonable alternative, particularly if restricted

randomisation is used to ensure baseline comparability

between groups.7 Analysis of covariance is the

80 preferred general approach, however.

As with all analyses of continuous data, the use of

analysis of covariance depends on some assumptions

60 that need to be tested. In particular, data transforma-

tion, such as taking logarithms, may be indicated.8

Lastly, analysis of covariance is a type of multiple

regression and can be seen as a special type of adjusted

40

analysis. The analysis can thus be expanded to include

Acupuncture

additional prognostic variables (not necessarily con-

Placebo

tinuous), such as age and diagnostic group.

20

20 40 60 80 100

We thank Dr J Kleinhenz for supplying the raw data from his

study.

Pretreatment score

Pretreatment and post-treatment scores in each group showing fitted 1 Bland JM, Altman DG. Regression towards the mean. BMJ

lines. Squares show mean values for the two groups. The estimated 1994;308:1499.

difference between the groups from analysis of covariance is the 2 Bland JM, Altman DG. Some examples of regression towards the mean.

BMJ 1994;309:780.

vertical distance between the two lines 3 Senn S. Baseline comparisons in randomized clinical trials. Stat Med

1991;10:1157-9.

4 Kleinhenz J, Streitberger K, Windeler J, Gussbacher A, Mavridis G, Mar-

tin E. Randomised clinical trial comparing the effects of acupuncture and

up scores of 0.6 that required 85 patients for analysis of a newly designed placebo needle in rotator cuff tendonitis. Pain

follow up scores, would require 68 for a change score 1999;83:235-41.

5 Altman DG, Gardner MJ. Regression and correlation. In: Altman DG,

analysis but only 54 for analysis of covariance. Machin D, Bryant TN, Gardner MJ, eds. Statistics with confidence. 2nd ed.

The efficiency gains of analysis of covariance com- London: BMJ Books, 2000:73-92.

6 Vickers AJ. The use of percentage change from baseline as an outcome in

pared with a change score are low when there is a high a controlled trial is statistically inefficient: a simulation study. BMC Med

correlation (say r > 0.8) between baseline and follow up Res Methodol 2001;1:16.

7 Altman DG, Bland JM. How to randomise. BMJ 1999;319:703-4.

measurements. This will often be the case, particularly 8 Bland JM, Altman DG. The use of transformation when comparing two

in stable chronic conditions such as obesity. In these means. BMJ 1996;312:1153.

A memorable patient

Informed consent

I first met Ivy three years ago when she came for her and she was brought back the following day for

29th oesophageal dilatation. She was an 86 year old removal of the bolus by endoscopy.

spinster, deaf without speech from childhood, and the She came to the endoscopy room but did not have

only sign language she knew was thumbs up, which she her familiar smile. She looked around for a minute, got

would use for saying good morning or for showing off her trolley, and walked out. Everyone in the

happiness. She had no next of kin and had lived in a endoscopy room understood that she was trying to say,

residential home for the past 50 years. She developed a Ive had enough.

benign oesophageal stricture in 1992 and came to the She did not come back for a repeat endoscopy, and

endoscopy unit for repeated dilatations. The carers in she stayed nil by mouth on intravenous fluids. Two

the residential home used to say that she enjoyed her weeks later she died of an aspiration pneumonia. We

days out at the endoscopy unit. think she understood all the procedures she had

We would explain the procedure to her in sign agreed to. We also think it was informed consent. I

language. She would use the thumbs up sign and make hope we were right. She gave us a very clear message

a cross on the dotted line on the consent form. She without saying a word on her last visit to the

would enter the endoscopy room smiling, put her left endoscopy room.

arm out to be cannulated, turn to her left side for Do we really understand what aphasic patients are

endoscopy, and when fully awake would show her trying to tell us when we get informed consent for

thumbs up again. Every time after her dilatation the invasive procedures? We should try to read the

nursing staff would question why an expandable non-verbal messages very carefully.

oesophageal stent was not being considered. We would

I Tiwari associate specialist in gastroenterology, Broomfield

conclude that the indications for an expandable stent

Hospital, Chelmsford

in benign strictures are not well established.

Her need for dilatation was becoming more We welcome articles of up to 600 words on topics

frequent, and so on her 46th dilatation we decided to such as A memorable patient, A paper that changed my

refer her to our regional centre for the insertion of a practice, My most unfortunate mistake, or any other piece

stent. She had an expandable stent inserted, and in his conveying instruction, pathos, or humour. If possible

report the endoscopist mentioned the risk of the stent the article should be supplied on a disk. Permission is

migrating down in the stomach beyond the stricture. needed from the patient or a relative if an identifiable

Six weeks later she developed a bolus obstruction. At patient is referred to. We also welcome contributions

endoscopy it was noted that the stent had indeed for Endpieces, consisting of quotations of up to 80

migrated down. She consented to another stent. Four words (but most are considerably shorter) from any

weeks later she had another bolus obstruction that source, ancient or modern, which have appealed to

could not be completely removed at the first attempt, the reader.

Education and debate

The quality and reliability of health information on consumers confidence in online healthcare. They must

the internet remains of paramount concern in Europe, ensure that the mechanisms are put in place whereby

as elsewhere. Self regulatory codes of ethics for health health professionals themselves can benefit from using

websites abound, yet the quality and practices of many the internet, while still ensuring the highest standards

are highly questionable. of medical practice.

Little progress seems to have been made,

moreover, in assuring consumers that the information Avienda was formerly known as the Centre for Law Ethics and

Risk in Telemedicine.

they share with health websites will not be misused. Competing interests: None declared.

Several US studies have already concluded that

websites privacy practices do not match their

1 http://news.bbc.co.uk/hi/english/uk/england/newsid_1752000/

proclaimed policies.5 In an attempt to counter this ero- 1752670.stm (accessed 5 Feb 2002).

sion of trust in Europe, the European Commissions 2 Case C-322/01: Reference for a preliminary ruling by the Landgericht

Frankfurt am Main by order of that court of 10 August 2001 in the case

guidelines for quality criteria for health related of Deutscher Apothekerverband e.V. against DocMorris NV and Jacques

websites have recognised that there is no shortage of Waterval. Official Journal of the European Communities No C 2001

legislation in the field of privacy and security.6 They December 8:348/10.

3 Council Directive 1992/28/EEC of 31 March 1992 on the advertising of

have drawn specific attention to a new recommen- medicinal products for human use. (Articles 1(3) and 3(1).) Official Journal

dation regarding online data collection adopted in of the European Communities No L 1995 11 February:32/26.

4 Directive 2000/31/EC on mutual recognition of primary medical and

May 2001 that explains how European directives on specialist medical qualifications and minimum standards of training. Offi-

issues such as data protection should be applied to the cial Journal of the European Communities No L 2001 July 31:206/1-51.

5 Schwartz J. Medical websites faulted on privacy. Washington Post 2000

most common processing tasks carried out via the February 1.

internet.7 6 http://europa.eu.int/information_society/eeurope/ehealth/quality/

draft_guidelines/index_en.htm (accessed 5 Feb 2002).

The challenge facing Europes health professionals 7 European Commission. Recommendation 2/2001 on certain minimum

and policymakers is to carefully craft the development requirements for collecting personal data on-line in the European

Union. Adopted on 17 May 2001. http://europa.eu.int/comm/

of new approaches to the supervision of medical and internal_market/en/dataprot/wpdocs/wp43en.htm (accessed 25 Jan

pharmaceutical practice. Their ultimate goal is to raise 2002).

Statistics Notes

Validating scales and indexes

J Martin Bland, Douglas G Altman

Papers p 569

An index of quality is a measurement like any other, Some quantities are even more difficult to measure

whether it is assessing a website, as in todays BMJ,1 a and evaluate. Cardiac stroke volume does at least have

clinical trial used in a meta-analysis,2 or the quality of a an objective reality; a physical quantity of blood is

Department of life experienced by a patient.3 As with all measure- pumped out of the heart when it beats. Anxiety and

Public Health ments, we have to decide whether it measures what we depression do not have a physical reality but are useful

Sciences, St

Georges Hospital want it to measure, and how well. artificial constructs. They are measured by question-

Medical School, The simplest measurements, such as length and naire scales, where answers to a series of questions

London SW17 0RE

distance, can be validated by an objective criterion. The related to the concept we want to measure are

J Martin Bland

professor of medical earliest criteria must have been biological: the length of combined to give a numerical score. Website quality is

statistics a pace, a foot, a thumb. The obvious problem, that the similar. We are measuring a quantity which is not pre-

Cancer Research criterion varies from person to person, was eventually cisely defined, and there is no instrument with which

UK Medical solved by establishing a fundamental unit and defining we can compare any measure we might devise. How

Statistics Group,

Centre for Statistics all others in terms of it. Other measurements can then are we to assess the validity of such a scale?

in Medicine, be defined in terms of a fundamental unit. To define a The relevant theory was developed in the social sci-

Institute for Health

Sciences, Oxford

unit of weight we find a handy substance which ences in the context of questionnaire scales.4 First we

OX3 7LF appears the same everywhere, such as water. The unit might ask whether the scale looks right, whether it asks

Douglas G Altman of weight is then the weight of a volume of water speci- about the sorts of thing which we think of as being

professor of statistics

in medicine

fied in the basic unit of length, such as 100 cubic centi- related to anxiety or website quality. If it appears to be

metres. Such measurements have criterion validity, correct, we call this face validity. Next we might ask

Correspondence to:

Professor Bland meaning that we can take some known quantity and whether it covers all the aspects which we want to

mbland@sghms.ac.uk compare our measurement with it. measure. A phobia scale which asked about fear of

For some measurements no such standard is possi- dogs, spiders, snakes, and cats but ignored height, con-

BMJ 2002;324:6067 ble. Cardiac stroke volume, for example, can be fined spaces, and crowds would not do this. We call

measured only indirectly. Direct measurement, by appropriate coverage of the subject matter content

collecting all the blood pumped out of the heart over a validity.

series of beats, would involve rather drastic interference Our scale may look right and cover the right things,

with the system. Our criterion becomes agreement but what other evidence can we bring to the question

with another indirect measurement. Indeed, we some- of validity? One question we can ask is whether our

times have to use as a standard a method which we score has the relationships with other variables that we

know produces inaccurate measurements. would expect. For example, does an anxiety measure

Education and debate

distinguish between psychiatric patients and medical and different observers can make simultaneous

patients? Do we get different anxiety scores from measurements. In assessing the reliability of a website

students before and after an examination? Does a quality scale, it is easy to get several observers to apply

measure of depression predict suicide attempts? We the scale independently. With websites, repeat assess-

call the property of having appropriate relationships ments need to be close in time because their content

with other variables construct validity. changes frequently (as does bmj.com). With question-

We can also ask whether the items which together naires, either self administered or recorded by an

compose the scale are related to one another: does the observer, repeat measurements need to be far enough

scale have internal consistency? If not, do the items really apart in time for the earlier responses to be forgotten,

measure the same thing? On the other hand, if the yet not so far apart that the underlying quantity being

items are too similar, some may be redundant. Highly measured might have changed. Such data enable us to

correlated items in a scale may make the scale over- evaluate test-retest reliability. If two measures have com-

long and may lead to some aspects being overempha- parable face, content, and construct validity the more

sised, impairing the content validity. A handy summary repeatable one may be preferred for the study of a

measure for this feature is Cronbachs alpha.5 given population.

A scale must also be repeatable and be sufficiently

objective to give similar results for different observers.

If a measurement is repeatable, in that someone who 1 Gagliardi A, Jadad AR. Examination of instruments used to rate quality of

has a high score on one occasion tends to have a high health information on the internet: chronicle of a voyage with an unclear

destination. BMJ 2002;324:569-73.

score on another, it must be measuring something. 2 Jni P, Altman DG, Egger M. Assessing the quality of controlled clinical

With physical measurements, it is often possible for the trials. BMJ 2001;323:42-6.

3 Muldoon MF, Barger SD, Flory JD, Manuck B. What are quality of life

same observer (or different observers) to make measurements measuring? BMJ 1998;316:542-5.

repeated measurements in quick succession. When 4 Streiner DL, Norman GR. Health measurement scales: a practical

guide to their development and use. 2nd ed. Oxford: Oxford University Press,

there is a subjective element in the measurement the 1996.

observer can be blinded from their first measurement, 5 Bland JM, Altman DG. Cronbachs alpha. BMJ 1997;314:572.

A few years ago my general practitioner told me that no more reason to expect the doctor to engage with

anyone aged over 40 with upper abdominal discomfort me as a person than I would the phlebotomist taking a

needed investigating. At the local teaching hospital, a routine blood sample. Clearly, this consultant saw

pleasant young doctor did a gastroscopy, which things similarly as a rule, but when the patient was a

showed a mass in my stomach wall. I was sent for a doctor the aesthetics of the encounter changed. He

barium meal. A consultant radiologist took the x ray had apologised three times for what he felt was a lapse

films, instructing me briskly to turn this way and that on his part, arising from his failure to notice what was

but not otherwise paying me any attention. He told me written in the corner of the request form. Perhaps he

to wait a few minutes while he checked the films to see thought I knew that my general practitioner had

if all the views were satisfactory. I sat alone in the room written this and that I expected this of a medical

for about five minutes.

referral, and thus expected to be recognised by him

From the moment the consultant re-entered I could

not just as a patient but also as a colleague. He seemed

see that he was slightly agitated. Im terribly sorry, he

to see this as my due. (As it happens, I didnt.)

called out as he came through the door at the far end.

And then again, Im terribly sorry. Perhaps these I had forgotten this incident, but it was brought back

words of regret, coupled with the concern on his face, to me by the aftermath of the Bristol cardiac surgery

might not have had the effect they did had I not been a debacle, and by the publicity surrounding other recent

man with an abdominal mass on his mind. At this medical scandals. These have all put a spotlight on

moment of truth and reckoning, certain visions swam relations between doctors, who seem to offer each

before my eyes. other acknowledgement and empathy, as my

Three strides later, he was in front of me and consultant had sought belatedly to do to me. The

looking me full in the face: Im terribly sorry, I hadnt general public may be coming to suspect that this

realised you were a doctor. In his hand was the request collegiate solidarity is somehow not in their interests,

form, and I could see that my general practitioner had associating it with mutual protectiveness and thus with

written ex-SR here in one corner. He must have cover-ups of medical malpractice. It is too soon to say

spotted this when checking the form as he looked at how the profession will react, but my consultant was an

the preliminary plates. Though no further x rays were older man and my guess is that, with younger

needed, he proceeded a little breathlessly to deliver generations of doctors, we will see the waning of a

three or four minutes of almost a caricature of caring, tradition whose roots lie with Hippocrates. For it was

empathic interest in a patient. What branch of his oath that bound doctors to look well on each other

medicine was I in, and where did I work? Good (and not charge each other for their services).

heavens, that must be tough. Is that an Australian Its another story, but I found out later that the mass

accent I hear? A St Marys old boy, ah yes. What did I was the gastroscopy instrument itself distorting the

think about. . .? stomach wall, misdiagnosed by an inexperienced

I dont mean to imply that this was insincere, merely

registrar. No special treatment there, anyway.

splendidly different from his earlier matter of factness

and economy of word. I had thought nothing of this at Derek Summerfield consultant psychiatrist, CASCAID,

the time: in such a bread and butter procedure I had South London and Maudsley NHS Trust, London

Education and debate

Statistics Notes

Interaction revisited: the difference between two estimates

Douglas G Altman, J Martin Bland

We often want to compare two estimates of the same 0.2206 (row 8). From these two values we can test the Cancer Research

UK Medical

quantity derived from separate analyses. Thus we might interaction and estimate the ratio of the relative risks Statistics Group,

want to compare the treatment effect in subgroups in a (with confidence interval). The test of interaction is the Centre for Statistics

randomised trial, such as two age groups. The term for ratio of d to its standard error: z= 0.2726/ in Medicine,

Institute for Health

such a comparison is a test of interaction. In earlier Sta- 0.2206= 1.24, which gives P=0.2 when we refer it to a Sciences, Oxford

tistics Notes we discussed interaction in terms of hetero- table of the normal distribution. The estimated OX3 7LF

geneity of treatment effect.13 Here we revisit interaction interaction effect is exp( 0.2726)=0.76. (This value can Douglas G Altman

professor of statistics

and consider the concept more generally. also be obtained directly as 0.67/0.88=0.76.) The in medicine

The comparison of two estimated quantities, such as confidence interval for this effect is 0.7050 to 0.1598

Department of

means or proportions, each with its standard error, is a on the log scale (row 9). Transforming back to the rela- Public Health

general method that can be applied widely. The two esti- tive risk scale, we get 0.49 to 1.17 (row 12). There is thus Sciences,

mates should be independent, not obtained from the no good evidence to support a different treatment effect St Georges

Hospital Medical

same individualsexamples are the results from in younger and older women. School, London

subgroups in a randomised trial or from two independ- The same approach is used for comparing odds SW17 0RE

ent studies. The samples should be large. If the estimates ratios. Comparing means or regression coefficients is J Martin Bland

professor of medical

are E1 and E2 with standard errors SE(E1) and SE(E2), simpler as there is no log transformation. The two esti- statistics

then the difference d=E1 E2 has standard error mates must be independent: the method should not be

Correspondence to:

SE(d)=[SE(E1)2 + SE(E2)2] (that is, the square root of the used to compare a subset with the whole group, or two D G Altman

sum of the squares of the separate standard errors). This estimates from the same patients. doug.altman@

cancer.org.uk

formula is an example of a well known relation that the There is limited power to detect interactions, even in

variance of the difference between two estimates is the a meta-analysis combining the results from several stud-

BMJ 2003;326:219

sum of the separate variances (here the variance is the ies. As this example illustrates, even when the two

square of the standard error). Then the ratio z=d/SE(d) estimates and P values seem very different the test of

gives a test of the null hypothesis that in the population interaction may not be significant. It is not sufficient for

the difference d is zero, by comparing the value of z to the relative risk to be significant in one subgroup and

the standard normal distribution. The 95% confidence not in another. Conversely, it is not correct to assume

interval for the difference is d1.96SE(d) to d+1.96SE(d). that when two confidence intervals overlap the two esti-

We illustrated this for means and proportions,3 mates are not significantly different.6 Statistical analysis

although we did not show how to get the standard should be targeted on the question in hand, and not

error of the difference. Here we consider comparing based on comparing P values from separate analyses.2

relative risks or odds ratios. These measures are always

analysed on the log scale because the distributions of 1 Altman DG, Matthews JNS. Interaction 1: Heterogeneity of effects. BMJ

1996;313:486.

the log ratios tend to be those closer to normal than of 2 Matthews JNS, Altman DG. Interaction 2: Compare effect sizes not P

the ratios themselves. values. BMJ 1996;313:808.

3 Matthews JNS, Altman DG. Interaction 3: How to examine heterogeneity.

In a meta-analysis of non-vertebral fractures in ran- BMJ 1996;313:862.

domised trials of hormone replacement therapy the 4 Torgerson DJ, Bell-Syer SEM. Hormone replacement therapy and

estimated relative risk from 22 trials was 0.73 (P=0.02) in prevention of nonvertebral fractures. A meta-analysis of randomized

trials. JAMA 2001;285:2891-7.

favour of hormone replacement therapy.4 From 14 trials 5 Bland JM, Altman DG. Logarithms. BMJ 1996;312:700.

of women aged on average < 60 years the relative risk 6 Bland M, Peacock J. Interpreting statistics with confidence. Obstetrician

and Gynaecologist (in press).

was 0.67 (95% confidence interval 0.46 to 0.98; P=0.03).

From eight trials of women aged >60 the relative risk

was 0.88 (0.71 to 1.08; P=0.22). In other words, in Calculations for comparing two estimated relative risks

younger women the estimated treatment benefit was a

Group 1 Group 2

33% reduction in risk of fracture, which was statistically 1 RR 0.67 0.88

significant, compared with a 12% reduction in older 2 *log RR 0.4005 (E1) 0.1278 (E2)

women, which was not significant. But are the relative 3 95% CI for RR 0.46 to 0.98 0.71 to 1.08

risks from the subgroups significantly different from 4 *95% CI for log RR 0.7765 to 0.0202 0.3425 to 0.0770

each other? We show how to answer this question using 5 Width of CI 0.7563 0.4195

just the summary data quoted. 6 SE[=width/(21.96)] 0.1929 0.1070

Because the calculations were made on the log scale, Difference between log relative risks

comparing the two estimates is complex (see table). We 7 d[=E1E2] 0.4005(0.1278)=0.2726

need to obtain the logs of the relative risks and their 8 SE(d) (0.19292+ 0.10702)=0.2206

confidence intervals (rows 2 and 4).5 As 95% confidence 9 CI(d) 0.2726 1.960.2206

or 0.7050 to 0.1598

intervals are obtained as 1.96 standard errors either side

10 Test of interaction z=0.2726/0.2206=1.24 (P=0.2)

of the estimate, the SE of each log relative risk is Ratio of relative risks (RRR)

obtained by dividing the width of its confidence interval 11 RRR=exp(d) exp(0.2726)=0.76

by 21.96 (row 6). The estimated difference in log 12 CI(RRR) exp(0.7050) to exp(0.1598), or 0.49 to 1.17

relative risks is d=E1 E2= 0.2726 and its standard error *Values obtained by taking natural logarithms of values on preceding row.

Education and debate

Statistics Notes

The logrank test

J Martin Bland, Douglas G Altman

1.0 Health Sciences,

Survival proportion

two (or more) groups of individuals. For example, the Astrocytoma University of York,

table shows survival times of 51 adult patients with Glioblastoma York YO10 5DD

0.8

recurrent malignant gliomas1 tabulated by type of J Martin Bland

tumour and indicating whether the patient had died or professor of health

0.6 statistics

was still alive at analysisthat is, their survival time was

Cancer Research

censored.2 As the figure shows, the survival curves dif- 0.4 UK/NHS Centre

fer, but is this sufficient to conclude that in the popula- for Statistics in

tion patients with anaplastic astrocytoma have worse Medicine, Institute

0.2

of Health Sciences,

survival than patients with glioblastoma? Oxford OX3 7LF

We could compute survival curves3 for each group 0 Douglas G Altman

0 52 104 156 208

and compare the proportions surviving at any specific professor of statistics

Time (weeks) in medicine

time. The weakness of this approach is that it does not

provide a comparison of the total survival experience of Correspondence to:

Survival curves for women with glioma by diagnosis

Professor Bland

the two groups, but rather gives a comparison at some

arbitrary time point(s). In the figure the difference in BMJ 2004;328:1073

survival is greater at some times than others and eventu-

ally becomes zero. We describe here the logrank test, the

2 1 = 1. From a table of the 2 distribution we get

most popular method of comparing the survival of

P < 0.01, so that the difference between the groups is

groups, which takes the whole follow up period into

statistically significant. There is a different method of

account. It has the considerable advantage that it does

calculating the test statistic,4 but we prefer this Weeks to death or

not require us to know anything about the shape of the

approach as it extends easily to several groups. It is censoring in 51

survival curve or the distribution of survival times.

also possible to test for a trend in survival across adults with

The logrank test is used to test the null hypothesis

ordered groups.4 Although we have shown how the recurrent gliomas1

that there is no difference between the populations in (A=astrocytoma,

calculation is made, we strongly recommend the use of

the probability of an event (here a death) at any time G=glioblastoma)

statistical software.

point. The analysis is based on the times of events

The logrank test is based on the same assumptions A G

(here deaths). For each such time we calculate the

as the Kaplan Meier survival curve3namely, that cen- 6 10

observed number of deaths in each group and the 13 10

soring is unrelated to prognosis, the survival probabili-

number expected if there were in reality no difference 21 12

ties are the same for subjects recruited early and late in

between the groups. The first death was in week 6, 30 13

the study, and the events happened at the times speci-

when one patient in group 1 died. At the start of this 31* 14

fied. Deviations from these assumptions matter most if

week, there were 51 subjects alive in total, so the risk of 37 15

they are satisfied differently in the groups being

death in this week was 1/51. There were 20 patients in 38 16

compared, for example if censoring is more likely in

group 1, so, if the null hypothesis were true, the 47* 17

one group than another.

expected number of deaths in group 1 is 20 1/51 = 49 18

The logrank test is most likely to detect a difference 50 20

0.39. Likewise, in group 2 the expected number of

between groups when the risk of an event is 63 24

deaths is 31 1/51 = 0.61. The second event

consistently greater for one group than another. It is 79 24

occurred in week 10, when there were two deaths.

unlikely to detect a difference when survival curves 80* 25

There were now 19 and 31 patients at risk (alive) in the

cross, as can happen when comparing a medical with a 82* 28

two groups, one having died in week 6, so the

surgical intervention. When analysing survival data, the 82* 30

probability of death in week 10 was 2/50. The

survival curves should always be plotted. 86 33

expected numbers of deaths were 19 2/50 = 0.76 98 34*

Because the logrank test is purely a test of

and 31 2/50 = 1.24 respectively. 149* 35

significance it cannot provide an estimate of the size of

The same calculations are performed each time an 202 37

the difference between the groups or a confidence

event occurs. If a survival time is censored, that 219 40

interval. For these we must make some assumptions

individual is considered to be at risk of dying in the 40

about the data. Common methods use the hazard ratio,

week of the censoring but not in subsequent weeks. 40*

including the Cox proportional hazards model, which

This way of handling censored observations is the 46

we shall describe in a future Statistics Note.

same as for the Kaplan-Meier survival curve.3 48

From the calculations for each time of death, the Competing interests: None declared. 70*

81

1 and 19.52 in group 2, and the observed numbers of 1 Rostomily RC, Spence AM, Duong D, McCormick K, Bland M, Berger

MS. Multimodality management of recurrent adult malignant gliomas: 82

deaths were 14 and 28. We can now use a 2 test of the results of a phase II multiagent chemotherapy study and analysis of

91

null hypothesis. The test statistic is the sum of (O cytoreductive surgery. Neurosurgery 1994;35:378-8.

2 Altman DG, Bland JM. Time to event (survival) data. BMJ 112

E)2/E for each group, where O and E are the totals of 1998;317:468-9. 181

the observed and expected events. Here (14 22.48)2 3 Bland JM, Altman DG. Survival probabilities. The Kaplan-Meier method.

BMJ 1998;317:1572. *Censored survival

/ 22.48 + (28 19.52)2 / 19.52 = 6.88. The degrees of 4 Altman DG. Practical statistics for medical research. London: Chapman & time (still alive at

freedom are the number of groups minus one, i.e. Hall, 1991: 371-5. follow up).

Education and debate

Statistics Notes

Diagnostic tests 4: likelihood ratios

Jonathan J Deeks, Douglas G Altman

Screening and Test The properties of a diagnostic or screening test are often

Evaluation

Program, School of described using sensitivity and specificity or predictive Calculation of post-test probabilities using

Public Health, values, as described in previous Notes.1 2 Likelihood likelihood ratios

University of

Sydney, NSW 2006,

ratios are alternative statistics for summarising diag-

Pretest probability = p1 = 0.1

Australia nostic accuracy, which have several particularly powerful pretest odds = p1/(1 p1) = 0.1/0.9 = 0.11

Jonathan J Deeks properties that make them more useful clinically than post-test odds = pretest oddslikelihood ratio

senior research

biostatistician

other statistics.3 post-test odds = o2 = 0.1120.43 = 2.27

Each test result has its own likelihood ratio, which Post-test probability = o2/(1+ o2) = 2.27/3.37 = 0.69

Cancer Research

UK/NHS Centre summarises how many times more (or less) likely

for Statistics in patients with the disease are to have that particular

Medicine, Institute

for Health Sciences, result than patients without the disease. More formally, Likelihood ratios are ratios of probabilities, and can

Oxford OX3 7LF it is the ratio of the probability of the specific test result be treated in the same way as risk ratios for the

Douglas G Altman in people who do have the disease to the probability in

professor of statistics

purposes of calculating confidence intervals.6

in medicine people who do not. For a test with only two outcomes, likelihood ratios

Correspondence to:

A likelihood ratio greater than 1 indicates that the can be calculated directly from sensitivities and specifi-

Mr Deeks test result is associated with the presence of the disease, cities.1 For example, if smoking habit is dichotomised

Jon.Deeks@ whereas a likelihood ratio less than 1 indicates that the as above or below 40 pack years, the sensitivity is 28.4%

cancer.org.uk

test result is associated with the absence of disease. The (42/148) and specificity 98.6% (142/144). The positive

BMJ 2004;329:1689 further likelihood ratios are from 1 the stronger the likelihood ratio is the proportion with obstructive

evidence for the presence or absence of disease. Likeli- airway disease who smoked more than 40 pack years

hood ratios above 10 and below 0.1 are considered to (sensitivity) divided by the proportion without disease

provide strong evidence to rule in or rule out who smoked more than 40 pack years (1specificity),

diagnoses respectively in most circumstances.4 When 28.4/1.4 = 20.3, as before. The negative likelihood

tests report results as being either positive or negative ratio is the proportion with disease who smoked less

the two likelihood ratios are called the positive than 40 pack years (1sensitivity) divided by the

proportion without disease who smoked less than 40

likelihood ratio and the negative likelihood ratio.

pack years (specificity), 71.6/98.6 = 0.73. However,

The table shows the results of a study of the value of

unlike sensitivity and specificity, computation of

a history of smoking in diagnosing obstructive airway

likelihood ratios does not require dichotomisation of

disease.5 Smoking history was categorised into four

groups according to pack years smoked (packs per day

years smoked). The likelihood ratio for each category Pre-test Post-test

probability probability

is calculated by dividing the percentage of patients with

0.001 0.999

obstructive airway disease in that category by the

0.002 0.998

percentage without the disease in that category. For 0.003 0.997

example, among patients with the disease 28% had 40+ 0.005 0.995

0.007 0.993

smoking pack years compared with just 1.4% of 0.01 0.99

patients without the disease. The likelihood ratio is Likelihood

0.02 ratio 0.98

thus 28.4/1.4 = 20.3. A smoking history of more than 0.03 1000 0.97

500

40 pack years is strongly predictive of a diagnosis of 0.05 0.95

0.07 200 0.93

obstructive airway disease as the likelihood ratio is sub- 0.1 100 0.9

50

stantially higher than 10. Although never smoking or 20 0.8

0.2

smoking less than 20 pack years both point to not hav- 10

0.3 5 0.7

ing obstructive airway disease, their likelihood ratios 0.4 2 0.6

0.5 1 0.5

are not small enough to rule out the disease with 0.6 0.5 0.4

confidence. 0.7 0.2 0.3

0.1

0.8 0.05 0.2

0.02

0.9 0.01 0.1

Likelihood ratios are ratios of probabilities, and can be treated in the same way as risk 0.93 0.005 0.07

0.95 0.05

ratios for the purposes of calculating confidence intervals6 0.002

0.97 0.001 0.03

Obstructive airway disease 0.98 0.02

Smoking habit

(pack years) Yes (n (%)) No (n (%)) Likelihood ratio 95% CI 0.99 0.01

40 42 (28.4) 2 (1.4) (42/148)/(2/144)=20.4 5.04 to 82.8 0.993 0.007

0.995 0.005

20-40 25 (16.9) 24 (16.7) (25/148)/(24/144)=1.01 0.61 to 1.69 0.003

0.997

0-20 29 (19.6) 51 (35.4) (29/148)/51/144)=0.55 0.37 to 0.82 0.998 0.002

Never smoked or smoked 52 (35.1) 67 (46.5) (52/148)/67/144)=0.76 0.57 to 1.00

0.999 0.001

for <1 yr

Total 148 (100) 144 (100)

Use of Fagans nomogram for calculating post-test probabilities7

Education and debate

test results. Forcing dichotomisation on multicategory study sample and can rarely be generalised beyond the

test results may discard useful diagnostic information. study (except when the study is based on a suitable

Likelihood ratios can be used to help adapt the random sample, as is sometimes the case for

results of a study to your patients. To do this they make population screening studies). Likelihood ratios pro-

use of a mathematical relationship known as Bayes vide a solution as they can be used to calculate the

theorem that describes how a diagnostic finding probability of abnormality, while adapting for varying

changes our knowledge of the probability of abnor- prior probabilities of the chance of abnormality from

mality.3 The post-test odds that the patient has the different contexts.

disease are estimated by multiplying the pretest odds

by the likelihood ratio. The use of odds rather than

risks makes the calculation slightly complex (box) but a 1 Altman DG, Bland JM. Diagnostic tests 1: sensitivity and specificity. BMJ

nomogram can be used to avoid having to make 1994;308:1552.

2 Altman DG, Bland JM. Diagnostic tests 2: predictive values. BMJ

conversions between odds and probabilities (figure).7 1994;309:102.

Both the figure and the box illustrate how a prior 3 Sackett DL, Straus S, Richardson WS, Rosenberg W, Haynes RB. Evidence-

probability of obstructive airway disease of 0.1 (based, based medicine. How to practise and teach EBM. 2nd ed. Edinburgh: Churchill

Livingstone, 2000:67-93.

say, on presenting features) is updated to a probability 4 Jaeschke R, Guyatt G, Lijmer J. Diagnostic tests. In: Guyatt G, Rennie D,

of 0.7 with the knowledge that the patient had smoked eds. Users guides to the medical literature. Chicago: AMA Press,

for more than 40 pack years. 2002:121-40.

5 Straus SE, McAlister FA, Sackett DL, Deeks JJ. The accuracy of patient

In clinical practice it is essential to know how a history, wheezing, and laryngeal measurements in diagnosing obstructive

particular test result predicts the risk of abnormality. airway disease. CARE-COAD1 Group. Clinical assessment of the

reliability of the examination-chronic obstructive airways disease. JAMA

Sensitivities and specificities1 do not do this: they 2000;283:1853-7.

describe how abnormality (or normality) predicts 6 Altman DG. Diagnostic tests. In: Altman DG, Machin D, Bryant TN,

particular test results. Predictive values2 do give Gardner MJ, eds. Statistics with confidence. 2nd ed. London: BMJ Books,

2000:105-19.

probabilities of abnormality for particular test results, 7 Fagan TJ. Letter: Nomogram for Bayes theorem. N Engl J Med

but depend on the prevalence of abnormality in the 1975;293:257.

A memorable patient

Living history

I was finally settling down at my desk when the pager To even my unpractised technique, his

bleeped: it was the outpatients department. An extra cardiovascular signs were a museum piece: absent left

patient had been added to the afternoon listwould I subclavian pulse, big arciform scars on the anterior

see him? chest created by the surgeon who had saved his life,

The patient was a slightly built man in his 60s. He central cyanosis, right-sided systolic murmurs, loud

had brought recent documentation from another pulmonary valve closure sound (iatrogenic pulmonary

hospital. I asked about his presenting complaint. hypertension, I reasoned), pulsatile liverall these and

Well, Ill try, but I wasnt aware of everything that undoubtedly more noted by the physician who first

happened. Thats why Ive brought my wifeshe was understood his condition.

with me at the time. That night, I read about his doctors. Helen Taussig

This was turning out to be one of those perfect indeed had had substantial hearing impairment, a

neurological consultations: documents from another disability that would have meant the end of a career in

hospital, a witness account, an articulate patient. The cardiology for a less able clinician. I also learnt of the

only question would be whether it was seizure, greater challenges of sexual prejudice that she fought

syncope, or transient ischaemic attack. As we went and overcame all her life. I learnt about Alfred Blalock,

through his medical history, I studied his records and the young doctor denied a residency at Johns Hopkins

for the first time noticed the phrase Tetralogy of only to be invited back in his later years to head its

Fallot. surgical unit.

Yes, my lifelong diagnosis, he smiled. I was The experiences of a life in medicine are sometimes

operated on. overwhelming. For weeks, I reflected on the

I saw the faint dusky blue colour of his lips. perspectives opened to me by this unassuming patient.

Blalock-Taussig shunt? I asked, as dim memories The curious irony of a man with a life threatening

from medical school somehow came back into focus. condition who had outlived his saviours; the

Yes, twice. By Blalock. He paused. In those days extraordinary vision of his all-too-human doctors; the

the operation was only done at Johns Hopkinsall the opportunity to witness history played out in the course

patients went there. I remember my second operation of a half-hour consultation. And memories were

well. I was 12 at the time. jogged, too: the words of my former professor of

And the doctors? I asked. medicine, who showed us cases of Fallots and first

Yes, especially Dr Taussig. She would come around told us about Taussig, the woman cardiologist; the

with her entourage every so often. Deaf as a post, she portrait of Blalock adorning a surgical lecture theatre

was. in medical college.

What? Taussig deaf? (And my patients neurological examination and

Yes. She had an amplifier attached to her investigations? Non-contributory. I still dont know

whether it was seizure, syncope, or transient ischaemic

stethoscope to examine patients.

attack.)

I asked to examine him, only too aware that it was

more for my benefit than his. Giridhar P Kalamangalam clinical fellow in epilepsy,

A half smile suggested that he had read my department of neurology, Cleveland Clinic Foundation,

thoughts: Of course. Cleveland OH, USA

Education and debate

Statistics Notes

Treatment allocation by minimisation

Douglas G Altman, J Martin Bland

Table 2 Hypothetical distribution of baseline characteristics after UK Medical

by randomisation. Blocking and stratification can be Statistics Group,

40 patients had been enrolled in the trial

used to ensure balance between groups in size and Centre for Statistics

patient characteristics.1 But stratified randomisation Behavioural Nutrition in Medicine, Oxford

counselling counselling OX3 7LF

using several variables is not effective in small trials. (n=20) (n=20) Douglas G Altman

The only widely acceptable alternative approach is Women 12 11 professor of statistics

minimisation,2 3 a method of ensuring excellent Age >50 7 5 in medicine

balance between groups for several prognostic factors, Ethnicity: Department of

even in small samples. With minimisation the White 15 15 Health Sciences,

University of York,

treatment allocated to the next participant enrolled in Black 4 5

York YO10 5DD

the trial depends (wholly or partly) on the characteris- Asian 1 0

J Martin Bland

tics of those participants already enrolled. The aim is Current smokers 6 8 professor of health

statistics

that each allocation should minimise the imbalance

across multiple factors. ling would increase the imbalance; allocation to nutri- Correspondence to:

tion would decrease it. D Altman

Table 1 shows some baseline characteristics in a doug.altman@

controlled trial comparing two types of counselling in At this point there are two options. The chosen cancer.org.uk

relation to dietary intake.4 Minimisation was used for treatment could simply be taken as the one with the

the four variables shown, and the two groups were lower score; or we could introduce a random element. BMJ 2005;330:843

clearly very similar in all of these variables. Such good We use weighted randomisation so that there is a high

balance for important prognostic variables helps the chance (eg 80%) of each participant getting the

credibility of the comparisons. How is it achieved? treatment that minimises the imbalance. The use of a

Minimisation is based on a different principle from random element will slightly worsen the overall imbal-

randomisation. The first participant is allocated a treat- ance between the groups, but balance will be much

ment at random. For each subsequent participant we better for the chosen variables than with simple

determine which treatment would lead to better randomisation. A random element also makes the allo-

balance between the groups in the variables of interest. cation more unpredictable, although minimisation is a

The dietary behaviour trial used minimisation secure allocation system when used by an independent

based on the four variables in table 1. Suppose that person.

after 40 patients had entered this trial the numbers in After the treatment is determined for the current

each subgroup in each treatment group were as shown participant the numbers in each group are updated

in table 2. (Note that two or more categories need to be and the process repeated for each subsequent

constructed for continuous variables.) participant. If at any time the totals for the two groups

The next enrolled participant is a black woman are the same, then the choice should be made using

aged 52, who is a non-smoker. If we were to allocate her simple randomisation. The method extends to trials of

to behavioural counselling, the imbalance would be more than two treatments.

increased in sex distribution (12+1 v 11 women), in age Minimisation is a valid alternative to ordinary ran-

(7+1 v 5 aged > 50), and in smoking (14+1 v 12 domisation,2 3 5 and has the advantage, especially in

non-smoking) and decreased in ethnicity (4+1 v 5 small trials, that there will be only minor differences

black). We formalise this by summing over the four between groups in those variables used in the

variables the numbers of participants with the same allocation process. Such balance is especially desirable

characteristics as this new recruit already in the trial: where there are strong prognostic factors and modest

treatment effects, such as oncology. Minimisation is

Behavioural: 12 (sex) +7 (age) +4 (ethnicity) +14 (smoking) = 37

best performed with the aid of softwarefor example,

Nutrition: 11+5+5+12 = 33

minim, a free program.6 Its use makes trialists think

Imbalance is minimised by allocating this person to about prognostic factors at the outset and helps ensure

the group with the smaller total (or at random if the adherence to the protocol as a trial progresses.7

totals are the same). Allocation to behavioural counsel-

1 Altman DG, Bland JM. How to randomise. BMJ 1999;319:703-4.

2 Treasure T, MacRae KD. Minimisation: the platinum standard for trials.

BMJ 1998;317:362-3.

Table 1 Baseline characteristics in two groups4 3 Scott NW, McPherson GC, Ramsay CR, Campbell MK. The method of

minimization for allocation to clinical trials: a review. Control Clin Trials

Behavioural Nutrition 2002;23:662-74.

counselling counselling 4 Steptoe A, Perkins-Porras L, McKay C, Rink E, Hilton S, Cappuccio FP.

Women 82 84 Behavioural counselling to increase consumption of fruit and vegetables

in low income adults: randomised trial. BMJ 2003;326:855-8.

Mean (SD) age (years) 43.3 (13.8) 43.2 (14.0) 5 Buyse M, McEntegart D. Achieving balance in clinical trials: an

Ethnicity: unbalanced view from the European regulators. Applied Clin Trials

White 94 96 2004;13:36-40.

6 Evans S, Royston P, Day S. Minim: allocation by minimisation in clinical

Black 37 32 trials. http://www-users.york.ac.uk/zmb55/guide/minim.htm. (accessed

Asian 3 5 24 October 2004).

7 Day S. Commentary: Treatment allocation by the method of

Current smokers 47 44

minimisation. BMJ 1999;319:947-8.

Education and debate

6 The Cardiac Arrhythmia Suppression Trial I. Preliminary Report. Effect of 16 Beauchamp TL, Childress JF. Respect for autonomy. Principles of biomedical

encainide and flecainide on mortality in a randomized trial of arrhythmia ethics. 4th ed. New York: Oxford University Press, 1994:120-88.

suppression after myocardial infarction. N Engl J Med 1989;321:406-12. 17 Roberts LW. Evidence-based ethics and informed consent in mental

7 Dickert N, Grady C. Whats the price of a research subject? Approaches to illness research. Arch Gen Psychiatry 2000;57:540-2.

payment for research participation. N Engl J Med 1999;341:198-203. 18 Bayer R, Oppenheimer GM. Toward a more democratic medicine: sharing the

8 Grady C. Money for research participation: does it jeopardize informed burden of ignorance. AIDS Doctors: voices from the epidemic. New York: Oxford

consent? Am J Bioethics 2001;1:40-4. University Press, 2000:156-69.

9 Macklin R. Due and undue inducements: On paying money to 19 Coulter A, Rozansky D. Full engagement in health. BMJ 2004;329:1197-8.

research subjects. IRB: a review of human subjects research 1981;3:1-6. 20 Joffe S, Manocchia M, Weeks JC, Cleary PD. What do patients value in

10 McGee G. Subject to payment? JAMA 1997;278:199-200.

their hospital care? An empirical perspective on autonomy centred

11 McNeil P. Paying people to participate in research. Bioethics

bioethics. J Med Ethics 2003;29:103-8.

1997;11:390-6.

21 Heesen C, Kasper J, Segal J, Kopke S, Muhlhauser I. Decisional role pref-

12 Wilkenson M, Moore A. Inducement in research. Bioethics 1997;11:

373-89. erences, risk knowledge and information interests in patients with multi-

13 Halpern SD, Karlawish JHT, Casarett D, Berlin JA, Asch DA. Empirical ple sclerosis. Mult Scler 2004;10:643-50.

assessment of whether moderate payments are undue or unjust induce- 22 Azoulay E, Pochard F, Chevret S, et al. Half the family members of inten-

ments for participation in clinical trials. Arch Intern Med 2004;164:801-3. sive care unit patients do not want to share in the decision-making proc-

14 Bentley JP, Thacker PG. The influence of risk and monetary payment on ess: a study in 78 French intensive care units. Crit Care Med

the research participation decision making process. J Med Ethics 2004;32:1832-8.

2004;30:293-8. 23 Dunn LB, Gordon NE. Improving informed consent and enhancing

15 Viens AM. Socio-economic status and inducement to participate. Am J recruitment for research by understanding economic behavior. JAMA

Bioethics 2001;1. 2005;293:609-12.

Statistics Notes

Standard deviations and standard errors

Douglas G Altman, J Martin Bland

The terms standard error and standard deviation example. By contrast the standard deviation will not Cancer Research

UK/NHS Centre

are often confused.1 The contrast between these two tend to change as we increase the size of our sample. for Statistics in

terms reflects the important distinction between data So, if we want to say how widely scattered some Medicine, Wolfson

description and inference, one that all researchers measurements are, we use the standard deviation. If we College, Oxford

OX2 6UD

should appreciate. want to indicate the uncertainty around the estimate of

Douglas G Altman

The standard deviation (often SD) is a measure of the mean measurement, we quote the standard error of professor of statistics

variability. When we calculate the standard deviation of a the mean. The standard error is most useful as a means in medicine

sample, we are using it as an estimate of the variability of of calculating a confidence interval. For a large sample, Department of

a 95% confidence interval is obtained as the values Health Sciences,

the population from which the sample was drawn. For University of York,

data with a normal distribution,2 about 95% of individu- 1.96SE either side of the mean. We will discuss confi- York YO10 5DD

als will have values within 2 standard deviations of the dence intervals in more detail in a subsequent Statistics J Martin Bland

mean, the other 5% being equally scattered above and Note. The standard error is also used to calculate P val- professor of health

statistics

below these limits. Contrary to popular misconception, ues in many circumstances.

The principle of a sampling distribution applies to Correspondence to:

the standard deviation is a valid measure of variability Prof Altman

other quantities that we may estimate from a sample, doug.altman@

regardless of the distribution. About 95% of observa-

such as a proportion or regression coefficient, and to cancer.org.uk

tions of any distribution usually fall within the 2 standard

contrasts between two samples, such as a risk ratio or

deviation limits, though those outside may all be at one BMJ 2005;331:903

the difference between two means or proportions. All

end. We may choose a different summary statistic, how-

such quantities have uncertainty due to sampling vari-

ever, when data have a skewed distribution.3

ation, and for all such estimates a standard error can be

When we calculate the sample mean we are usually

calculated to indicate the degree of uncertainty.

interested not in the mean of this particular sample, but In many publications a sign is used to join the

in the mean for individuals of this typein statistical standard deviation (SD) or standard error (SE) to an

terms, of the population from which the sample comes. observed meanfor example, 69.49.3 kg. That

We usually collect data in order to generalise from them notation gives no indication whether the second figure

and so use the sample mean as an estimate of the mean is the standard deviation or the standard error (or

for the whole population. Now the sample mean will indeed something else). A review of 88 articles

vary from sample to sample; the way this variation published in 2002 found that 12 (14%) failed to

occurs is described by the sampling distribution of the identify which measure of dispersion was reported

mean. We can estimate how much sample means will (and three failed to report any measure of variability).4

vary from the standard deviation of this sampling distri- The policy of the BMJ and many other journals is to

bution, which we call the standard error (SE) of the esti- remove signs and request authors to indicate clearly

mate of the mean. As the standard error is a type of whether the standard deviation or standard error is

standard deviation, confusion is understandable. being quoted. All journals should follow this practice.

Another way of considering the standard error is as a

Competing interests: None declared.

measure of the precision of the sample mean.

The standard error of the sample mean depends 1 Nagele P. Misuse of standard error of the mean (SEM) when reporting

on both the standard deviation and the sample size, by variability of a sample. A critical evaluation of four anaesthesia journals.

Br J Anaesthesiol 2003;90:514-6.

the simple relation SE = SD/(sample size). The stand- 2 Altman DG, Bland JM. The normal distribution. BMJ 1995;310:298.

ard error falls as the sample size increases, as the extent 3 Altman DG, Bland JM. Quartiles, quintiles, centiles, and other quantiles.

BMJ 1994;309:996.

of chance variation is reducedthis idea underlies the 4 Olsen CH. Review of the use of statistics in Infection and Immunity. Infect

sample size calculation for a controlled trial, for Immun 2003;71:6689-92.

Practice

Statistics Notes

The cost of dichotomising continuous variables

Douglas G Altman, Patrick Royston

Cancer Research Measurements of continuous variables are made in all preferable to performing several analyses and

UK/NHS Centre

for Statistics in

branches of medicine, aiding in the diagnosis and choosing that which gives the most convincing result.

Medicine, Wolfson treatment of patients. In clinical practice it is helpful to Use of this so called optimal cutpoint (usually that

College, Oxford label individuals as having or not having an attribute, giving the minimum P value) runs a high risk of a spu-

OX2 6UD

such as being hypertensive or obese or having riously significant result; the difference in the outcome

Douglas G Altman

professor of statistics high cholesterol, depending on the value of a variable between the groups will be overestimated,

in medicine continuous variable. perhaps considerably; and the confidence interval will

MRC Clinical Trials Categorisation of continuous variables is also com- be too narrow. This strategy should never be used.6 7

Unit, London mon in clinical research, but here such simplicity is When regression is being used to adjust for the effect

NW1 2DA

gained at some cost. Though grouping may help data of a confounding variable, dichotomisation will run the

Patrick Royston

professor of statistics presentation, notably in tables, categorisation is unnec- risk that a substantial part of the confounding remains.4 7

Correspondence to: essary for statistical analysis and it has some serious Dichotomisation is not much used in epidemiological

Professor Altman drawbacks. Here we consider the impact of converting studies, where the use of several categories is preferred.

doug.altman@ Using multiple categories (to create an ordinal

continuous data to two groups (dichotomising), as this

cancer.org.uk

is the most common approach in clinical research.1 variable) is generally preferable to dichotomising. With

BMJ 2006;332:1080 What are the perceived advantages of forcing all four or five groups the loss of information can be quite

individuals into two groups? A common argument is small, but there are complexities in analysis.

that it greatly simplifies the statistical analysis and leads Instead of categorising continuous variables, we pre-

to easy interpretation and presentation of results. A fer to keep them continuous. We could then use

binary splitfor example, at the medianleads to a linear regression rather than a two sample t test, for

comparison of groups of individuals with high or low example. If we were concerned that a linear regression

values of the measurement, leading in the simplest case would not truly represent the relation between the

to a t test or 2 test and an estimate of the difference outcome and predictor variable, we could explore

between the groups (with its confidence interval). There whether some transformation (such as a log transforma-

is, however, no good reason in general to suppose that tion) would be helpful.7 8 As an example, in a regression

there is an underlying dichotomy, and if one exists there analysis to develop a prognostic model for patients with

is no reason why it should be at the median.2 primary biliary cirrhosis, a carefully developed model

Dichotomising leads to several problems. Firstly, with bilirubin as a continuous explanatory variable

much information is lost, so the statistical power to explained 31% more of the variability in the data than

detect a relation between the variable and patient out- when bilirubin distribution was split at the median.7

come is reduced. Indeed, dichotomising a variable at Competing interests: None declared.

the median reduces power by the same amount as

would discarding a third of the data.2 3 Deliberately dis- 1 Del Priore G, Zandieh P, Lee MJ. Treatment of continuous data as

carding data is surely inadvisable when research categoric variables in obstetrics and gynecology. Obstet Gynecol

1997;89:351-4.

studies already tend to be too small. Dichotomisation 2 MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of

may also increase the risk of a positive result being a dichotomization of quantitative variables. Psychol Meth 2002;7:19-40.

3 Cohen J. The cost of dichotomization. Appl Psychol Meas 1983;7:249-53.

false positive.4 Secondly, one may seriously underesti- 4 Austin PC, Brunner LJ. Inflation of the type I error rate when a continu-

mate the extent of variation in outcome between ous confounding variable is categorized in logistic regression analyses.

Stat Med 2004;23:1159-78.

groups, such as the risk of some event, and 5 Buettner P, Garbe C, Guggenmoos-Holzmann I. Problems in defining

considerable variability may be subsumed within each cutoff points of continuous prognostic factors: example of tumor

thickness in primary cutaneous melanoma. J Clin Epidemiol

group. Individuals close to but on opposite sides of the 1997;50:1201-10.

cutpoint are characterised as being very different 6 Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using

optimal cutpoints in the evaluation of prognostic factors. J Natl Cancer

rather than very similar. Thirdly, using two groups Inst 1994;86:829-35.

conceals any non-linearity in the relation between the 7 Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous

predictors in multiple regression: a bad idea. Stat Med 2006;25:127-41.

variable and outcome. Presumably, many who dichot- 8 Royston P, Sauerbrei W. Building multivariable regression models with

omise are unaware of the implications. continuous covariates in clinical epidemiologywith an emphasis on

fractional polynomials. Methods Inf Med 2005;44:561-71.

If dichotomisation is used where should the

cutpoint be? For a few variables there are recognised

cutpoints, such as > 25 kg/m2 to define overweight

based on body mass index. For some variables, such as

Endpiece

age, it is usual to take a round number, usually a mul-

tiple of five or 10. The cutpoint used in previous stud- Reading and reflecting

ies may be adopted. In the absence of a prior cutpoint

the most common approach is to take the sample Reading without reflecting is like eating without

digesting.

median. However, using the sample median implies

that various cutpoints will be used in different studies Edmund Burke

so that their results cannot easily be compared, Kamal Samanta, retired general practitioner,

seriously hampering meta-analysis of observational Denby Dale, West Yorkshire

studies.5 Nevertheless, all these approaches are

PRACTICE

Statistics Notes

Missing data

Douglas G Altman1, J Martin Bland2

1

Cancer Research UK/NHS Centre Almost all studies have some missing observations. A few missing observations are a minor nuisance,

for Statistics in Medicine, Oxford Yet textbooks and software commonly assume that but a large amount of missing data is a major threat

OX2 6UD data are complete, and the topic of how to handle to a studys integrity. Non-response is a particular

2

Department of Health Sciences,

University of York, York YO10 5DD missing data is not often discussed outside statistics problem in pair-matched studies, such as some case-

Correspondence to: Professor journals. control studies, as it is unclear how to analyse data

Altman There are many types of missing data and different from the unmatched individuals. Loss of patients

doug.altman@cancer.org.uk reasons for data being missing. Both issues affect the also reduces the power of the trial. Where losses are

bmj 2007;334:424 analysis. Some examples are: expected it is wise to increase the target sample size

doi:10.1136/bmj.38977.682025.2C (1) In a postal questionnaire survey not all the to allow for losses. This cannot eliminate the poten-

selected individuals respond; tial bias, however.

(2) In a randomised trial some patients are lost to Missing data are much more common in retro-

follow-up before the end of the study; spective studies, in which routinely collected data are

(3) In a multicentre study some centres do not subsequently used for a different purpose.2 When infor-

measure a particular variable; mation is sought from patients medical notes, the notes

(4) In a study in which patients are assessed often do not say whether or not a patient was a smoker

frequently some data are missing at some time or had a particular procedure carried out. It is tempt-

points for unknown reasons; ing to assume that the answer is no when there is no

(5) Occasional data values for a variable are missing indication that the answer is yes, but this is generally

because some equipment failed; unwise.

(6) Some laboratory samples are lost in transit or No really satisfactory solution exists for missing data,

technically unsatisfactory; which is why it is important to try to maximise data

(7) In a magnetic resonance imaging study some very collection. The main ways of handling missing data in

obese patients are excluded as they are too large analysis are: (a) omitting variables which have many

for the machine; missing values; (b) omitting individuals who do not

(8) In a study assessing quality of life some patients have complete data; and (c) estimating (imputing) what

die during the follow-up period. the missing values were.

Omitting everyone without complete data is known

The prime concern is always whether the available as complete case (or available case) analysis and is

data would be biased. If the fact that an observation probably the most common approach. When only a

is missing is unrelated both to the unobserved value very few observations are missing little harm will be

(and hence to patient outcome) and the data that are done, but when many are missing omitting all patients

available this is called missing completely at ran- without full data might result in a large proportion of

dom. For cases 5 and 6 above that would be a safe the data being discarded, with a major loss of statisti-

assumption. Sometimes data are missing in a predict- cal power. The results may be biased unless the data

able way that does not depend on the missing value are missing completely at random. In general it is

itself but which can be predicted from other dataas advisable not to include in an analysis any variable

in case 3. Confusingly, this is known as missing at that is not available for a large proportion of the sam-

random. In the common cases 1 and 2, however, the ple. The main alternative approach to case deletion is

missing data probably depend on unobserved values, imputation, whereby missing values are replaced by

called missing not at random, and hence their lack some plausible value predicted from that individuals

may lead to bias. available data. Imputation has been the topic of much

In general, it is important to be able to examine recent methodological work; we will consider some of

whether missing data may have introduced bias. For the simpler methods in a separate Statistics Note.

example, if we know nothing at all about the non-

responders to a survey then we can do little to explore Competing interests: None declared.

possible bias. Thus a high response rate is necessary for 1 Evans SJW. Good surveys guide. BMJ 1991;302:302-3.

reliable answers.1 Sometimes, though, some informa- 2 Burton A, Altman DG. Missing covariate data within cancer prognostic

tion is available. For example, if the survey sample is studies: a review of current reporting and proposed guidelines. Br J

Cancer 2004;91:4-8.

chosen from a register that includes age and sex, then

the responders and non-responders can be compared

on these variables. At the very least this gives some

pointers to the representativeness of the sample. Non-

responders often (but not always) have a worse medical This is part of a series of occasional articles on statistics and handling

data in research.

prognosis than those who respond.

This site uses cookies. More info Close By continuing to browse the site you are agreeing to our use of

cookies. Find out more here Close

analysis

BMJ 2009; 338 doi: https://doi.org/10.1136/bmj.a3167 (Published 02 April 2009) Cite this as: BMJ

2009;338:a3167

Douglas G Altman, professor of statistics in medicine1, J Martin Bland, professor of health statistics2

1

Centre for Statistics in Medicine, University of Oxford, Wolfson College Annexe, Oxford OX2 6UD

2

Department of Health Sciences, University of York, York YO10 5DD

Continuous data arise in most areas of medicine. Familiar clinical examples include blood pressure, ejection

fraction, forced expiratory volume in 1 second (FEV ), serum cholesterol, and anthropometric measurements.

1

Methods for analysing continuous data fall into two classes, distinguished by whether or not they make

assumptions about the distribution of the data.

Theoretical distributions are described by quantities called parameters, notably the mean and standard

deviation.1 Methods that use distributional assumptions are called parametric methods, because we estimate the

parameters of the distribution assumed for the data. Frequently used parametric methods include t tests and

analysis of variance for comparing groups, and least squares regression and correlation for studying the relation

between variables. All of the common parametric methods (t methods) assume that in some way the data follow

a normal distribution and also that the spread of the data (variance) is uniform either between groups or across

the range being studied. For example, the two sample t test assumes that the two samples of observations come

from populations that have normal distributions with the same standard deviation. The importance of the

assumptions for t methods diminishes as sample size increases.

Alternative methods, such as the sign test, Mann-Whitney test, and rank correlation, do not require the data to

follow a particular distribution. They work by using the rank order of observations rather than the measurements

themselves. Methods which do not require us to make distributional assumptions about the data, such as the

rank methods, are called non-parametric methods. The term non-parametric applies to the statistical method

used to analyse data, and is not a property of the data.1 As tests of significance, rank methods have almost as

much power as t methods to detect a real difference when samples are large, even for data which meet the

distributional requirements.

Non-parametric methods are most often used to analyse data which do not meet the distributional requirements

of parametric methods. In particular, skewed data are frequently analysed by non-parametric methods, although

data transformation can often make the data suitable for parametric analyses.2

Data that are scores rather than measurements may have many possible values, such as quality of life scales or

data from visual analogue scales, while others have only a few possible values, such as Apgar scores or stage of

disease. Scores with many values are often analysed using parametric methods, whereas those with few values

tend to be analysed using rank methods, but there is no clear boundary between these cases.

To compensate for the advantage of being free of assumptions about the distribution of the data, rank methods

have the disadvantage that they are mainly suited to hypothesis testing and no useful estimate is obtained, such

as the average difference between two groups. Estimates and confidence intervals are easy to find with t

methods. Non-parametric estimates and confidence intervals can be calculated, however, but depend on extra

assumptions which are almost as strong as those for t methods.3 Rank methods have the added disadvantage of

not generalising to more complex situations, most obviously when we wish to use regression methods to adjust

for several other factors.

Rank methods can generate strong views, with some people preferring them for all analyses and others believing

that they have no place in statistics. We believe that rank methods are sometimes useful, but parametric

methods are generally preferable as they provide estimates and confidence intervals and generalise to more

complex analyses.

The choice of approach may also be related to sample size, as the distributional assumptions are more important

for small samples. We consider the analysis of small data sets in a subsequent Statistics Note.

Notes

Cite this as: BMJ 2009;338:a3167

Footnotes

Competing interests: None declared.

References

1. Altman DG, Bland JM. Variables and parameters. BMJ1999;318:1667.

2. Bland JM, Altman DG. Transforming data. BMJ1996;312:770.

3. Campbell MJ, Gardner MJ. Medians and their differences. In: Altman DG, Machin D, Bryant TN, Gardner MJ, eds.

Statistics with confidence. 2nd ed. London: BMJ Books, 2000:36-44.

This site uses cookies. More info Close By continuing to browse the site you are agreeing to our use of

cookies. Find out more here Close

BMJ 2009; 338 doi: https://doi.org/10.1136/bmj.a3166 (Published 06 April 2009) Cite this as: BMJ

2009;338:a3166

J Martin Bland, professor of health statistics 1, Douglas G Altman, professor of statistics in medicine2

1

Department of Health Sciences, University of York, York YO10 5DD

2

Centre for Statistics in Medicine, University of Oxford, Wolfson College Annexe, Oxford OX2 6UD

Studies with small numbers of measurements are rare in the modern BMJ, but they used to be common and

remain plentiful in specialist clinical journals. Their analysis is often more problematic than that for large samples.

Parametric methods, including t tests, correlation, and regression, require the assumption that the data follow a

normal distribution and that variances are uniform between groups or across ranges.1 In small samples these

assumptions are particularly important, so this setting seems ideal for rank (non-parametric) methods, which

make no assumptions about the distribution of the data; they use the rank order of observations rather than the

measurements themselves.1 Unfortunately, rank methods are least effective in small samples. Indeed, for very

small samples, they cannot yield a significant result whatever the data. For example, when using the Mann-

Witney test for comparing two samples of fewer than four observations a statistically significant difference is

impossible: any data give P>0.05. Similarly, the Wilcoxon paired test, the sign test, and Spearmans and

Kendalls rank correlation coefficients cannot produce P<0.05 for fewer than six observations. Methods based on

the t distribution do not have this problem and can detect differences in samples as small as two for paired

differences and three for two groups, or detect correlations in samples of three.

For example, we were recently asked about the data in table 1, which shows before and after measurements of

pudendal nerve terminal motor latency. Should we use the Wilcoxon or the sign test? MB replied that the

Wilcoxon would be acceptable, giving P<0.05 (actually P=0.047), and so would the paired t test, which gave

P=0.04. The questioner also asked whether the Wilcoxon test could be used for the second group of four

observations alone, for patients who had received a slightly different intervention. Here all the differences are in

the same direction, but the Wilcoxon test gives P=0.125. It is not possible for it to give a significant difference.

The paired t test gives P=0.04, a significant difference.

Subgroup Pudendal nerve terminal motor latency (ms)

Table 1 Five year follow-up of patients receiving hyperbaric oxygen therapy for faecal incontinence

On the other hand, using t methods when their assumptions are greatly violated can also be misleading. Table 2

shows concentration of antibody to type II group B Streptococcus in 20 volunteers before and after

immunisation.2 3 The comparison of the antibody levels was summarised in the report of this study as t=1.8;

P>0.05. The paired t test is not suitable for these data, because the differences clearly have a very skewed

distribution. There are 8 zero differences, forming a clump at one end of the distribution, which would remain

whatever transformation we used. We could consider the Wilcoxon paired sample test, but this method assumes

that the differences have a symmetrical distribution, which they do not. The sign test is preferred here; it tests the

null hypothesis that non-zero differences are equally likely to be positive or negative, using the binomial

distribution. We have 1 negative and 11 positive differences, which gives P=0.006. Hence the original authors

failed to detect a difference because they used an inappropriate analysis.

Antibody to type II group B Streptococcus (g/ml)

We have often come across the idea that we should not use t distribution methods for small samples but should

instead use rank based methods. The statement is sometimes that we should not use t methods at all for

samples of fewer than six observations.4 But, as we noted, rank based methods cannot produce anything useful

for such small samples.

The aversion to parametric methods for small samples may arise from the inability to assess the distribution

shape when there are so few observations. How can we tell whether data follow a normal distribution if we have

only a few observations? The answer is that we have not only the data to be analysed, but usually also

experience of other sets of measurements of the same thing. In addition, general experience tells us that body

size measurements are usually approximately normal, as are the logarithms of many blood concentrations and

the square roots of counts.

Notes

Cite this as: BMJ 2009;338:a3166

Footnotes

We thank Jonathan Cowley for the data in table 1.

References

1. Altman DG, Bland JM. Parametric v non-parametric methods for data analysis. BMJ 2009;338:a3167.

2. Baker CJ, Kasper DL, Edwards MS, Schiffman G. Influence of preimmunization antibody levels on the specificity of the

immune response to related polysaccharide antigens. N Engl J Med1980;303:173-8.

3. Altman DG. Practical statistics for medical research. London: Chapman and Hall, 1991: 224-5.

4. Siegel S. Nonparametric statistics for the behavioral sciences. 1st ed. Tokyo: McGraw-Hill Kogakusha, 1956:32.

This site uses cookies. More info Close By continuing to browse the site you are agreeing to our use of

cookies. Find out more here Close

BMJ 2011; 342 doi: https://doi.org/10.1136/bmj.d556 (Published 11 March 2011) Cite this as: BMJ 2011;342:d556

J Martin Bland, professor of health statistics1, Douglas G Altman, professor of statistics in medicine2

1

Department of Health Sciences, University of York, York YO10 5DD

2

Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD

In a study of 150 adult diabetic patients there was a strong correlation between abdominal circumference and

body mass index (BMI) (r = 0.85).1 The authors went on to report that the correlation differed in different BMI

categories as shown in the table.

BMI group r

<25 0.62

25 to 30 0.50

30 to 35 0.09

>35 0.86

Correlation between abdominal circumference and body mass indeed (BMI) in 1450 adult patients with

diabetes

2

The authors interpretation of these data was that in patients with low or high BMI values (BMI <25 kg/m and

2 2

BMI >35 kg/m ) the correlation was strong, but in those with BMI values between 25 and 35 kg/m the correlation

was weak or missing. They concluded that measuring abdominal circumference is of particular importance in

2

subjects with the most frequent BMI category (25 to 35 kg/m ).

When we restrict the range of one of the variables, a correlation coefficient will be reduced. For example, fig 1

shows some BMI and abdominal circumference measurements from a different population. Although these

people are from a rather thinner population, the correlation coefficient is very similar, r = 0.82 (P<0.0001). When

2

we divide the sample into the same four restricted ranges of BMI at 20, 25, and 30 kg/m , the correlation

coefficient in each interval is smaller than the correlation coefficient for the whole sample. This phenomenon is

to be expected; it is a result of restricting the range of data, not any particular property of BMI and abdominal

circumference.

Figure1

BMI and abdominal circumference in 202 men and women, with correlation coefficients in four restricted

ranges and overall

2

One interpretation of the correlation coefficient r is that r is the proportion of the variation in abdominal

circumference explained or predicted by the variation in BMI. If we restrict the range of BMI values we reduce the

variation in BMI, which will explain less variation in abdominal circumference, and r will fall. If we further reduce

the variation in BMI until all remaining patients have the same BMI, then we cannot explain any variation in

abdominal circumference and the correlation must be zero. (By contrast within any of the sections of fig 1 the

fitted regression line would be the same, apart from random variation.)

For another example, fig 2 shows the weights and heights of the same sample, with different symbols for men

and women. Clearly, the lower end of the height range for men is higher than the lower end of the range for

women, but the upper ends of the ranges are very similar. The mens heights (SD 6.0 cm) are less variable than

those of the women (SD 8.9 cm) or the heights of both sexes combined (also SD 8.9 cm). The correlation

coefficients for women and for both men and women are very similar and considerably larger than that for men

alone.

Figure2

Weight and height in 202 men and women, with correlation coefficients

The same phenomenon can arise when the sample is restricted using another variable related to the ones being

studies. For example, the correlation between weight and height of schoolchildren will increase as the age range

is increased. But a spurious correlation may also be seen in such a situation, for example between shoe size and

spelling ability.2 Such an example illustrates the well worn phrase that an observed association does not imply

causation.

Correlation coefficients are a property of the variables and also the population in which they are measured. If we

look at a restricted population, we should not conclude that there is little or no relation between the variables

because the correlation coefficient is small. But given a clear relation in the whole group, we see no point in

looking within categories of one of the variables. In any case, regression is generally the preferred approach to

considering the relation between two continuous variables.

Notes

Cite this as: BMJ 2011;342:d556

Footnotes

Acknowledgements: The data are taken from a student elective project by Dr Malcolm Savage.

Contributors: JMB and DGA jointly wrote and agreed the text, JMB did the statistical analysis.

Competing interests: All authors have completed the Unified Competing Interest form at

www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no

support from any organisation for the submitted work; no financial relationships with any organisations that

might have an interest in the submitted work in the previous 3 years; no other relationships or activities that

could appear to have influenced the submitted work.

References

1. Ndas J, Putz Z, Kolev G, Nagy S, Jermendy G. Intraobserver and interobserver variability of measuring waist

circumference. Med Sci Monit2008;14:CR15-8.

2. Goodwin LD, Leech NL. Understanding correlation: factors that affect the size of r. J Exp Educ2006;74:251-66.

BMJ 2011;342:d561 doi: 10.1136/bmj.d561 Page 1 of 3

STATISTICS NOTES

misleading

J Martin Bland professor of health statistics 1, Douglas G Altman professor of statistics in medicine 2

1

Department of Health Sciences, University of York, York YO10 5DD; 2Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD

When we randomise trial participants into two or more The table shows simulated data for a randomised trial with two

intervention groups, we do this to remove bias; the groups will, groups of 30 participants. Data were drawn from the same

on average, be comparable in every respect except the treatment population, so there is no systematic difference between the two

which they receive. Provided the trial is well conducted, without groups. The true baseline measurements had a mean of 10.0

other sources of bias, any difference in the outcome of the with standard deviation (SD) 2.0, and the outcome measurement

groups can then reasonably be attributed to the different was equal to the baseline plus an increase of 0.5 and a random

interventions received. In a previous note we discussed the element with SD 1.0. The difference between mean outcomes

analysis of those trials in which the primary outcome measure is 0.22 (95% confidence interval 0.75 to 0.34, P=0.5), adjusting

is also measured at baseline. We discussed several valid for the baseline by analysis of covariance.1 The difference is

analyses, observing that analysis of covariance (a regression not statistically significant, which is not surprising because we

method) is the method of choice.1 know that the null hypothesis of no difference in the population

Rather than comparing the randomised groups directly, however, is true. If we compare baseline with outcome for each group

researchers sometimes look at the change in the measurement using a paired t test, however, for group A the difference is

between baseline and the end of the trial; they test whether there statistically significant, P=0.03, for group B it is not significant,

was a significant change from baseline, separately in each P = 0.2. These results are quite similar to those of the anti-ageing

randomised group. They may then report that this difference is cream trial.2

significant in one group but not in the other, and conclude that We would not wish to draw any conclusions from one

this is evidence that the groups, and hence the treatments, are simulation. In 1000 runs, the difference between groups had

different. One such example was a recent trial in which P<0.05 in the analysis of covariance 47 times, or for 4.7% of

participants were randomised to receive either an anti-ageing samples, very close to the 5% we expect. Of the 2000

cream or the vehicle as a placebo.2 A wrinkle score was recorded comparisons between baseline and outcome, 1500 (75%) had

at baseline and after six months. The authors gave the results P<0.05. In this simulation, where there is no difference

of significance tests comparing the score with baseline for each whatsoever between the two treatments, the probability of a

group separately, reporting the active treatment group to have significant difference in one group but not the other was 38%,

a significant difference (P=0.013) and the vehicle group not not 5%. Hence a significant difference in one group but not the

(P=0.11). Their interpretation was that the cosmetic cream other is not good evidence of a significant difference between

resulted in significant clinical improvement in facial wrinkles. the groups. Even when there is a clear benefit of one treatment

But we cannot validly draw this conclusion, because the lack over the other, separate P values are not the way to analyse such

of a significant difference in the vehicle group does not provide studies.4

good evidence that the anti-ageing product is superior.3 How many pairs of tests will have one significant and one

The essential feature of a randomised trial is the comparison non-significant difference depends on the size of the change

between groups. Within group analyses do not address a from baseline to final measurement. If the population difference

meaningful question: the question is not whether there is a from baseline is very large, nearly all the within group tests will

change from baseline, but whether any change is greater in one be significant, and if the population difference is small, nearly

group than the other. It is not possible to draw valid inferences all tests will be not significant, so there will be few samples

by comparing P values. In particular, there is an inflated risk of with only one significant difference. If the difference is such

a false positive result, which we shall illustrate with a simulation. that half the samples would show a significant change from

baseline, as it would be in our simulation if the underlying

BMJ 2011;342:d561 doi: 10.1136/bmj.d561 Page 2 of 3

difference were 0.37 rather than 0.5, we would expect 50% of organisations that might have an interest in the submitted work in the

samples to have just one significant difference. previous 3 years; no other relationships or activities that could appear

The anti-ageing trial is not the only one where we have seen to have influenced the submitted work.

this misleading approach applied to randomised trial data.3 We

1 Vickers AJ, Altman DG. Analysing controlled trials with baseline and follow-up

even found it once in the BMJ!5 measurements. BMJ 2001;323:1123-4.

2 Watson REB, Ogden S, Cotterell LF, Bowden JJ, Bastrilles JY, Long SP, et al. A cosmetic

anti-ageing product improves photoaged skin: a double-blind, randomized controlled

Contributors: JMB and DGA jointly wrote and agreed the text, JMB did trial. Br J Dermatol 2009;161:419-26.

the statistical analysis. 3 Bland JM. Evidence for an anti-ageing product may not be so clear as it appears. Br J

Dermatol 2009;161:1207-8.

Competing interests: All authors have completed the Unified Competing 4 Altman DG, Bland JM. Interaction revisited: the difference between two estimates. BMJ

Interest form at www.icmje.org/coi_disclosure.pdf (available on request 2003;326:219.

5 Bland JM, Altman DG. Informed consent. BMJ 1993;306:928.

from the corresponding author) and declare: no support from any

organisation for the submitted work; no financial relationships with any

Cite this as: BMJ 2011;342:d561

BMJ 2011;342:d561 doi: 10.1136/bmj.d561 Page 3 of 3

Table

Table 1| Simulated data from a randomised trial comparing two groups of 30 patients, with a true change from baseline but no difference

between groups (sorted by baseline values within each group)

Group A Group B

Baseline 6 months Change Baseline 6 months Change

1 6.4 7.1 0.7 1 6.8 7.9 1.1

2 6.6 5.6 -1.0 2 7.2 7.5 0.3

3 7.3 8.3 1.0 3 7.2 6.9 -0.3

4 7.7 9.1 1.4 4 7.4 6.9 -0.5

5 7.7 9.5 1.8 5 7.5 8.3 0.8

6 7.9 9.6 1.7 6 7.5 9.4 1.9

7 8.0 8.5 0.5 7 8.3 9.0 0.7

8 8.0 8.5 0.5 8 8.4 8.8 0.4

9 8.1 9.1 1.0 9 8.7 8.0 -0.7

10 9.2 9.6 0.4 10 9.0 7.2 -1.8

11 9.3 8.7 -0.6 11 9.2 7.1 -2.1

12 9.6 10.7 1.1 12 9.6 10.6 1.0

13 9.7 9.0 -0.7 13 9.9 11.0 1.1

14 9.8 9.0 -0.8 14 10.1 11.5 1.4

15 9.8 8.0 -1.8 15 10.2 10.4 0.2

16 10.2 11.1 0.9 16 10.3 11.0 0.7

17 10.3 11.5 1.2 17 10.4 9.9 -0.5

18 10.6 9.1 -1.5 18 10.5 11.3 0.8

19 10.6 12.0 1.4 19 10.7 9.9 -0.8

20 10.7 13.2 2.5 20 10.8 10.7 -0.1

21 10.9 9.7 -1.2 21 10.8 11.8 1.0

22 11.1 12.2 1.1 22 11.1 10.0 -1.1

23 11.2 10.8 -0.4 23 11.1 13.2 2.1

24 11.8 11.9 0.1 24 11.4 11.8 0.4

25 12.3 12.2 -0.1 25 11.6 12.1 0.5

26 12.4 12.6 0.2 26 11.7 11.5 -0.2

27 13.1 15.0 1.9 27 12.0 12.7 0.7

28 13.2 13.8 0.6 28 12.3 13.7 1.4

29 13.3 14.1 0.8 29 13.7 12.6 -1.1

30 13.7 14.2 0.5 30 13.9 13.7 -0.2

Mean 10.02 10.46 0.44 Mean 9.98 10.21 0.24

SD 2.06 2.29 1.06 SD 1.90 2.09 1.02

BMJ 2011;343:d2090 doi: 10.1136/bmj.d2090 (Published 8 August 2011) Page 1 of 2

1 2

Douglas G Altman professor of statistics in medicine , J Martin Bland professor of health statistics

1

Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD; 2Department of Health Sciences, University of York, Heslington, York

YO10 5DD

Confidence intervals (CIs) are widely used in reporting statistical Following the steps in the box we calculate the CI as follows:

analyses of research data, and are usually considered to be more z = 0.862+ [0.743 2.404log(0.034)] = 2.117;

informative than P values from significance tests.1 2 Some

Est = log (0.30) = 1.204;

published articles, however, report estimated effects and P

values, but do not give CIs (a practice BMJ now strongly SE = 1.204/2.117 = 0.569 but we ignore the minus sign,

discourages). Here we show how to obtain the confidence so SE = 0.569, and 1.96SE = 1.115;

interval when only the observed effect and the P value were 95% CI on log scale = 1.204 1.115 to 1.204 + 1.115 =

reported. 2.319 to 0.089;

The method is outlined in the box below in which we have 95% CI on natural scale = exp (2.319) = 0.10 to exp

distinguished two cases. (0.089) = 0.91.

(a) Calculating the confidence interval for Hence the relative risk is estimated to be 0.30 with 95% CI

a difference 0.10 to 0.91.

two means, such as in a randomised trial with a binary outcome Limitations of the method

or a measurement such as blood pressure. The methods described can be applied in a wide range of

For example, the abstract of a report of a randomised trial settings, including the results from meta-analysis and regression

included the statement that more patients in the zinc group analyses. The main context where they are not correct is in small

than in the control group recovered by two days (49% v 32%, samples where the outcome is continuous and the analysis has

P=0.032).5 The difference in proportions was Est = 17 been done by a t test or analysis of variance, or the outcome is

percentage points, but what is the 95% confidence interval (CI)? dichotomous and an exact method has been used for the

Following the steps in the box we calculate the CI as follows: confidence interval. However, even here the methods will be

z = 0.862+ [0.743 2.404log(0.032)] = 2.141; approximately correct in larger studies with, say, 60 patients or

more.

SE = 17/2.141 = 7.940, so that 1.96SE = 15.56 percentage

points;

P values presented as inequalities

95% CI is 17.0 15.56 to 17.0 + 15.56, or 1.4 to 32.6

percentage points. Sometimes P values are very small and so are presented as

P<0.0001 or something similar. The above method can be

applied for small P values, setting P equal to the value it is less

(b) Calculating the confidence interval for than, but the z statistic will be too small, hence the standard

a ratio (log transformation needed) error will be too large and the resulting CI will be too wide.

This is not a problem so long as we remember that the estimate

The calculation is trickier for ratio measures, such as risk ratio, is better than the interval suggests.

odds ratio, and hazard ratio. We need to log transform the

estimate and then reverse the procedure, as described in a When we are told that P>0.05 or the difference is not significant,

previous Statistics Note.6 things are more difficult. If we apply the method described here,

using P=0.05, the confidence interval will be too narrow. We

For example, the abstract of a report of a cohort study includes must remember that the estimate is even poorer than the

the statement that In those with a [diastolic blood pressure] confidence interval calculated would suggest.

reading of 95-99 mm Hg the relative risk was 0.30 (P=0.034).7

What is the confidence interval around 0.30?

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2011;343:d2090 doi: 10.1136/bmj.d2090 (Published 8 August 2011) Page 2 of 2

Steps to obtain the confidence interval (CI) for an estimate of effect from the P value and the estimate (Est)

(a) CI for a difference

1 calculate the test statistic for a normal distribution test, z, from P3: z = 0.862 + [0.743 2.404log(P)]

2 calculate the standard error: SE = Est/z (ignoring minus signs)

3 calculate the 95% CI: Est 1.96SE to Est + 1.96SE.

For a ratio measure, such as a risk ratio, the above formulas should be used with the estimate Est on the log scale (eg, the log risk ratio). Step

3 gives a CI on the log scale; to derive the CI on the natural scale we need to exponentiate (antilog) Est and its CI.4

Notes

All P values are two sided.

All logarithms are natural (ie, to base e). 4

For a 90% CI, we replace 1.96 by 1.65; for a 99% CI we use 2.57.

1 Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather 6 Altman DG, Bland JM. Interaction revisited: the difference between two estimates. BMJ

than hypothesis testing. BMJ 1986;292:746-50. 2003;326:219.

2 Moher D, Hopewell S, Schulz KF, Montori V, Gtzsche PC, Devereaux PJ, et al. 7 Lindblad U, Rstam L, Rydn L, Ranstam J, Isacsson S-O, Berglund G. Control of blood

CONSORT 2010. Explanation and Elaboration: updated guidelines for reporting parallel pressure and risk of first acute myocardial infarction: Skaraborg hypertension project.

group randomised trials. BMJ 2010;340:c869. BMJ 1994;308:681.

3 Lin J-T. Approximating the normal tail probability and its inverse for use on a pocket

calculator. Appl Stat 1989;38:69-70.

4 Bland JM, Altman DG. Statistics Notes. Logarithms. BMJ 1996;312:700. Cite this as: BMJ 2011;343:d2090

5 Roy SK, Hossain MJ, Khatun W, Chakraborty B, Chowdhury S, Begum A, et al. Zinc

BMJ Publishing Group Ltd 2011

supplementation in children with cholera in Bangladesh: randomised controlled trial. BMJ

2008;336:266-8.

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2011;343:d2304 doi: 10.1136/bmj.d2304 Page 1 of 2

STATISTICS NOTES

Douglas G Altman professor of statistics in medicine 1, J Martin Bland professor of health statistics 2

1

Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD; 2Department of Health Sciences, University of York, Heslington, York

YO10 5DD

We have shown in a previous Statistics Note1 how we can Taggart et al presented a hazard ratio of 0.81; 95% CI 0.70 to

calculate a confidence interval (CI) from a P value. Some 0.94.5 They did not quote the P value.

published articles report confidence intervals, but do not give Following the steps in the box we calculate P as follows:

corresponding P values. Here we show how a confidence interval Est = log(0.81) = 0.211

can be used to calculate a P value, should this be required. This

might also be useful when the P value is given only imprecisely l = log(0.70) = 0.357, u = log (0.94) = 0.062

(eg, as P<0.05). Wherever they can be calculated, we are SE = [0.062 (0.357)]/(21.96) = 0.0753.

advocates of confidence intervals as much more useful than P

z = 0.211/0.0753 = 2.802. We take the positive value of

values, but we like to be helpful.

z, 2.802.

The method is outlined in the box below in which we have

P = exp(0.7172.802 0.4162.8022) = 0.005.

distinguished two cases.

(a) P from CI for a difference (no

Limitations of the method

transformation needed)

The formula for P is unreliable for very small P values and if

The simple case is when we have a CI for the difference between your P value is smaller than 0.0001, just report it as P<0.0001.

two means or two proportions. For example, participants in a

The methods described can be applied in a wide range of

trial received antihypertensive treatment with or without

settings, including the results from meta-analysis and regression

pravastatin. The authors report that pravastatin performed

analyses. The main context where they are not correct is small

slightly worse than a placebo. The estimated difference between

samples where the outcome is continuous and the analysis has

group means was 1.9 (95% CI 0.6 to 4.3) mm Hg.4 What was

been done by a t test or analysis of variance, or the outcome is

the P value?

dichotomous and an exact method has been used for the

Following the steps in the box above we calculate P as follows: confidence interval. However, even here the methods will be

SE = [4.3 (0.6)]/(21.96) = 1.25; approximately correct in larger studies with, say, 60 patients or

z = 1.9/1.25 = 1.52; more.

P = exp(0.7171.52 0.4161.522) = 0.13.

Contributors: JMB and DGA jointly wrote and agreed the text.

In this paper, the authors did indeed publish a P value of 0.13,4 Competing interests: All authors have completed the Unified Competing

as we have estimated from their confidence interval. Interest form at www.icmje.org/coi_disclosure.pdf (available on request

from the corresponding author) and declare: no support from any

(b) CI for a ratio (log transformation organisation for the submitted work; no financial relationships with any

needed) organisations that might have an interest in the submitted work in the

previous 3 years; no other relationships or activities that could appear

The calculation is trickier for ratio measures, such as risk ratio, to have influenced the submitted work.

odds ratio, and hazard ratio. We need to log transform the

estimate and confidence limits, so that Est, l, and u in the box 1 Altman DG, Bland JM. How to obtain a confidence interval from a P value. BMJ

are the logarithms of the published values. 2

2011;342:d2090.

Lin J-T. Approximating the normal tail probability and its inverse for use on a pocket

For example, in a meta-analysis of several studies comparing calculator. Appl Stat 1989;38:69-70.

BMJ 2011;343:d2304 doi: 10.1136/bmj.d2304 Page 2 of 2

Steps to obtain the P value from the CI for an estimate of effect (Est)

If the upper and lower limits of a 95% CI are u and l respectively:

1 calculate the standard error: SE = (u l)/(21.96)

2 calculate the test statistic: z = Est/SE

3 calculate the P value2: P = exp(0.717z 0.416z2).

For a ratio measure, such as a risk ratio, the above formulas should be used with the estimate Est and the confidence

limits on the log scale (eg, the log risk ratio and its CI).

Notes

All P values are two sided.

All logarithms are natural (ie, to base e).3

exp is the exponential function.

The formula for P works only for positive z, so if z is negative we remove the minus sign.

For a 90% CI, we replace 1.96 by 1.65; for a 99% CI we use 2.57.

3 Mancia G, Parati G, Revera M, Bilo G, Giuliano A, Veglia F, et al. Statins, antihypertensive 5 Bland JM, Altman DG. Logarithms. BMJ 1996;312:700.

treatment, and blood pressure control in clinic and over 24 hours: evidence from PHYLLIS

randomised double blind trial. BMJ 2010;340:c1197.

4 Taggart DP, DAmico R, Altman DG. Effect of arterial revascularisation on survival: a Cite this as: BMJ 2011;343:d2304

systematic review of studies comparing bilateral and single internal mammary arteries.

Lancet 2001;358:870-5.

BMJ 2011;343:d570 doi: 10.1136/bmj.d570 Page 1 of 2

STATISTICS NOTES

Douglas G Altman professor of statistics in medicine 1, J Martin Bland professor of health statistics 2

1

Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD; 2Department of Health Sciences, University of York, York YO10 5DD

Each year, new health sciences postgraduate students in York where P1 = n2 n1/r1 and P2 = n3 n2/r2. Here n1, n2, and n3 are

are given a simple maths test. Each year the majority of them refractive indices, r1, r2, and t are distances in metres, and P, P1,

fail to calculate 20 3 5 correctly. According to the and P2, are powers in dioptres. But he should have written P1 =

conventional rules of arithmetic, division and multiplication are (n2 n1)/r1 and P2 = (n3 n2)/r2. P1 = n2 n1/r1 is clearly wrong

done before addition and subtraction, so 20 3 5 = 20 15 dimensionally, as P1 is dioptres, 1/metre, n2 and n1 are ratios

= 5. Many students work from left to right and calculate 20 and so pure numbers, and r1 is in metres. Also, it is not clear

3 5 as 17 5 = 85. If that was what was actually meant, we whether t/n2 (P1P2) means (t/n2) P1P2, which it does, or t/(n2P1P2).

would need to use brackets: (20 3) 5 = 17 5 = 85. Brackets Do such errors matter? Certainly. In our experience the

tell us that the enclosed part must be evaluated first. That calculations are usually correct in the paper, but anyone using

convention is part of various mnemonic acronyms which indicate the published formula would go wrong. Sometimes, however,

the order of operations, such as BODMAS (Brackets, Of (that the incorrect formula was used, as in the following case.

is, power of), Divide, Multiply, Add, Subtract) and PEMDAS

(Parentheses, Exponentiation, Multiplication, Division, Addition,

Subtraction).1

Example 3

Schoolchildren learn the basic rules about how to construct and In their otherwise exemplary evaluation of the chronic ankle

interpret mathematical formulas.1 The conventions exist to instability scale, Eechaute et al4 made a mistake in their formula

ensure that there is absolutely no ambiguity, as mathematics for the minimal detectable change (MDC) or repeatability

(unlike prose) has no redundancy, so any mistake may have coefficient,5 writing: MDC = 2.04 (2 SEM). Here SEM is

serious consequences. Our experience is that mistakes are quite the standard error of a measurement or within subject standard

common when formulas are presented in medical journal deviation.5 This formula uses 2.04 where 2 or 1.96 is more

articles. A particular concern is that brackets are often omitted usual,6 but, much more seriously, the SEM should not be

or misused. The following examples are typical and we mean included within the square root, as the brackets indicate. This

nothing personal by choosing them. might be dismissed as a simple typographical error, but the

authors actually used this incorrect formula. Their value of SEM

Example 1 was 2.7 points, so they calculated the minimal detectable change

In a discussion of methods for analysing diagnostic test as 2.04 (2 2.7) = 4.7. They should have calculated 2.04

accuracy, Collinson 2 wrote: (2) 2.7 = 7.8. Their erroneous formula makes the scale appear

considerably more reliable than it actually is.6 The formula is

Sensitivity = TP/TP + FN also wrong in terms of dimensions, because the minimum

where TP = true positive and FN = false negative. The formula clinical difference should be in the same units as the

should, of course, be: measurement, not in square root units.

Some mistakes in formulas may be present in a submitted

Sensitivity = TP/(TP + FN).

manuscript, but others might be introduced in the publication

process. For example, problems sometimes arise when a

Example 2 displayed formula is converted to an in-text formula as part

For a non-statistical example, Leyland 3 wrote that the total of the editing, and the implications are not realised or not noticed

optical power of the cornea is: by either editing staff or authors. Often it is necessary to insert

brackets when reformatting a formula. So the simple formula:

P = P1 + P2 t/n2(P1P2)

BMJ 2011;343:d570 doi: 10.1136/bmj.d570 Page 2 of 2

organisation for the submitted; no financial relationships with any

organisations that might have an interest in the submitted work in the

previous 3 years; no other relationships or activities that could appear

should be changed to p/(1 p) if moved to the text. to have influenced the submitted work

Formulas in published articles may be used by others, so

mistakes may lead to substantive errors in research. It is essential 1 Wikipedia. Order of operations. [cited 2010 Nov 23]. http://en.wikipedia.org/wiki/Order_

of_operations

that authors and editors check all formulas carefully. 2 Collinson P. Of bombers, radiologists, and cardiologists: time to ROC. Heart

1998;80:215-7.

3 Leyland M. Validation of Orbscan II posterior corneal curvature measurement for intraocular

Acknowledgements: We are very grateful to Phil McShane for pointing lens power calculation. Eye 2004;18:357-60.

out a mistake in an earlier version of this statistics note. 4 Eechaute C, Vaes P, Duquet W. The chronic ankle instability scale: Clinimetric properties

of a multidimensional, patient-assessed instrument. Phys Ther Sport 2008;9:57-66.

Contributors: DGA and JMB jointly wrote and agreed the text. 5 Bland JM, Altman DG. Statistics Notes. Measurement error. BMJ 1996;312:744.

6 Bland JM. Minimal detectable change. Phys Ther Sport 2009;10:39.

Competing interests: All authors have completed the Unified Competing

Interest form at www.icmje.org/coi_disclosure.pdf (available on request

Cite this as: BMJ 2011;343:d570

BMJ 2013;346:f3438 doi: 10.1136/bmj.f3438 (Published 6 June 2013) Page 1 of 4

trials

1

Andrew J Vickers attending research methodologist , Douglas G Altman professor of statistics in

2

medicine

Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY 10065, USA; 2Centre for Statistics in

1

In most randomised trials, some patients fail to provide data for A more sophisticated approach to missing data is known as

study endpoints.1 We have previously described the analysis of multiple imputation, which uses a regression model to predict

a trial of acupuncture versus sham acupuncture for the treatment missing values.5 In randomised trials, the strongest predictors

of shoulder pain.2 All 52 randomised patients provided baseline of future outcome are often the scores provided by the patient

data on pain and range of motion, but only 45 returned for so far, but other variables can be included. To avoid

follow-up testing. The statistical question is how to handle those underestimating the width of the confidence interval, multiple

seven patients with missing data. The most straightforward imputation involves a form of random sampling. For a given

approach is simply to ignore the seven patients and do what is patient with a missing outcome, regression is used to predict

known as an available case analysis (often confusingly known the mean value of the missing outcome for similar patients and

as complete case analysis). As not all randomised patients are also the variability around the mean; a value is then selected at

included in the analysis, this leads to reduced statistical power.1 random from this distribution. The results from several

A method that attempts to include all randomised patients is imputations (hence multiple) are combined using a method

last observation carried forward, in which the last known as Rubins rules.5 6 Multiple imputation is widely

measurement obtained from the patient is used for all data points believed to be the preferred approach to missing data, not just

that were subsequently missed. This method is attractive because for randomised trials.7 It is computationally complex, however,

it is simple, but it has little else to recommend it. Substituting and needs to be implemented by special software, such as the

a missing data point with a value is known as imputation,1 ice command in Stata (see www.multiple-imputation.com).

and the data analyst needs a clear rationale for the type of The table shows the results of the shoulder pain study analysed

imputation used. That a patients responses would remain the by each method. The estimates for available case and multiple

same after drop-out is generally implausible. This is most imputation do not differ much, although multiple imputation

obvious in chronic degenerative diseases. For instance, cognitive has a slightly narrower confidence interval. Last observation

function scores decrease over time in dementia, so last carried forward appears to be biasedit underestimates the

observation carried forward gives overoptimistic scores for effects of acupunctureand gives a confidence interval that is

patients who drop out (figure). If a treatment was associated too narrow.

with toxicity, and this led to earlier drop-out than in the control Multiple imputation works best when good predictors of

group, the method would give results biased in favour of the outcome are available. In the shoulder pain example, baseline

experimental arm.3 By contrast, shoulder pain generally gets score was only moderately correlated with follow-up score

better over time, either because treatment is effective or because (r0.4). Had outcome been assessed halfway through treatment,

of the placebo effect and regression to the mean.4 In the this measure would have been more highly correlated with

randomised trial, patients in the control group improved by a post-treatment score, markedly improving the properties of the

mean of 9.8 points out of 100 from baseline to post-treatment multiple imputation.

follow-up, whereas patients who received acupuncture improved

Multiple imputation has several important strengths, but it does

by 21.5 points. So assuming that patients lost to follow-up

not adjust for the sort of bias created if patients were less likely

experienced precisely zero change in pain scores makes little

to return for follow-up if they were in a lot of pain; this is an

sense. Last observation carried forward may also underestimate

inherent limitation to missing data analysis. We cannot know

the standard deviation of the endpoint, especially in cases in

whether patients pain levels affect the chance that they will

which the last observation is the baseline, leading to confidence

complete a pain questionnaire because, obviously enough, we

intervals that are too narrow.

do not have the pain scores of non-respondents.

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2013;346:f3438 doi: 10.1136/bmj.f3438 (Published 6 June 2013) Page 2 of 4

Sometimes simple common sense is more important than the previous three years; no other relationships or activities that could

complex statistics. In the shoulder pain trial, three of the seven appear to have influenced the submitted work.

drop-outs were in the acupuncture group and four were controls, Provenance and peer review: Not commissioned; not externally peer

so it seems implausible that their omission had materially reviewed.

affected the results of the trial. If drop-out rates were very

different between the two arms of a trial, that may raise concerns 1 Altman DG, Bland JM. Missing data. BMJ 2007;334:424.

about bias. Above all, analysis of missing data teaches us the 2 Vickers AJ, Altman DG. Statistics notes: analysing controlled trials with baseline and

follow up measurements. BMJ 2001;323:1123-4.

importance of avoiding missing data in the first place: an 3 Molnar FJ, Man-Son-Hing M, Hutton B, Fergusson DA. Have

informed guess, even using a technique as sophisticated as last-observation-carried-forward analyses caused us to favour more toxic dementia

therapies over less toxic alternatives? A systematic review. Open Med 2009;3:e31-50.

multiple imputation, is still a guess. 4 Bland JM, Altman DG. Regression towards the mean. BMJ 1994;308:1499.

5 Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation

for missing data in epidemiological and clinical research: potential and pitfalls. BMJ

Contributors: AJV and DGA jointly wrote and agreed the text. 2009;338:b2393.

6 White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and

Competing interests: All authors have completed the ICMJE uniform

guidance for practice. Stat Med 2011;30:377-99.

disclosure form at www.icmje.org/coi_disclosure.pdf (available on 7 White IR, Horton NJ, Carpenter J, Pocock SJ. Strategy for intention to treat analysis in

request from the corresponding author) and declare: no support from randomised trials with missing outcome data. BMJ 2011;342:d40.

any organisations that might have an interest in the submitted work in Cite this as: BMJ 2013;346:f3438

BMJ Publishing Group Ltd 2013

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2013;346:f3438 doi: 10.1136/bmj.f3438 (Published 6 June 2013) Page 3 of 4

Table

Available case 14.2 (5.1 to 23.4) 4.53 0.003

Last observation carried forward* 12.6 (3.9 to 21.3) 4.33 0.005

Multiple imputation 14.3 (5.4 to 23.3) 4.45 0.002

Using baseline score and treatment group.

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2013;346:f3438 doi: 10.1136/bmj.f3438 (Published 6 June 2013) Page 4 of 4

Figure

Function scores over time for patient with chronic degenerative disease

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2014;349:g7064 doi: 10.1136/bmj.g7064 (Published 25 November 2014) Page 1 of 2

STATISTICS NOTES

1 2

Douglas G Altman professor of statistics in medicine , J Martin Bland professor of health statistics

Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford

1

OX3 7LD, UK; 2Department of Health Sciences, University of York, York YO10 5DD, UK

Medical research is conducted to help to reduce uncertainty. Interpretation of a studys results should be primarily based on

For example, randomised controlled trials aim to answer the estimated effect and a measure of its uncertainty. In

questions relating to treatment choices for a particular group of mainstream statistics, the uncertainty of estimates is indicated

patients. Rarely, however, does a single study remove by the use of confidence intervals. Before the mid-1980s,

uncertainty. There are two reasons for this: sampling error and confidence intervals were rarely seen in clinical research articles.

other (non-sampling) sources of uncertainty. The word error Around 1986 things changed,3 and these days almost all clinical

comes from a Latin root meaning to wander, and we use it in research articles in major journals include confidence intervals.

its statistical sense of meaning variation from the average, not The confidence interval is a range of uncertainty around the

mistake. Sampling error arises because any sample may not estimate of interest, such as the treatment effect in a controlled

behave quite the same as the larger population from which it trial.

was drawn. Non-sampling error arises from the many ways a So, for example, in a study of the impact of a mental health

research study may deviate from addressing the question that worker on the management of depression in primary care, it

the researcher wants to answer. was reported that After adjustment for baseline depression,

Sampling error is very much the concern of the statistician, who mean depression score was 1.33 PHQ-9 points lower (95%

imagines that the group of people in the study is just one of the confidence interval 0.35 to 2.31, P=0.009) in participants

many possible samples from the population of interest. Despite receiving collaborative care than in those receiving usual care

it being widely condemned,1 the dominant way of summarising at four months.4 This means that we estimate that, in the

the evidence from a research study is by the P value. It should population which these trial participants represent, the average

be obvious that the evidence from a research study cannot difference in mean depression score if all were offered

reasonably be summarised as just a single number, but the use collaborative care would be between 0.35 and 2.31 scale points

of P values remains unshakeable. Further, the practice of less than if all were treated in the usual way. It is only an

labelling P values as significant or not significant leads not only estimate. For 2.5% of studies the confidence interval will be

to dichotomous decisions but often also to the belief that the entirely below the true population difference, and 2.5% will

research question has been answered. have the interval entirely above it. We dont think P=0.009

P values represent the probability that the observed data (or a adds much to this, but researchers can seldom bear to do without

more extreme result) could have arisen when the true effect of it. The inevitable uncertainty from sampling error can be reduced

interest is zerofor example, the true treatment effect in a by increasing the sample size, but usually only modestly. To

randomised trial. It is common to interpret P<0.05 (significant) halve the width of the confidence interval we would need to

as clear evidence that there is a real effect, and P>0.05 (not quadruple the sample size.

significant) as evidence that there is no effect. However, the A common mistake is to believe that the confidence interval

former interpretation may be unwise, and the latter is wrong. expresses all the uncertainty. Rather, the confidence interval

Although 0.05 is the conventional decision point, P<0.05 is far expressed uncertainty from just one causenamely the

from representing certainty. One in 20 studies could have a uncertainty due to having taken a sample from the population

difference of the observed size if there were really no difference defined by the inclusion criteria. Often there are other sources

in the population. Not significant indicates that we found of uncertainty that may be even more important to consider, in

insufficient evidence to conclude that there is a real effect, not particular relating to possibly biased results. We address these

that we have shown that there is not one.2 Referring to results in our linked statistics note.5

as statistically significant, or not, only helps a bit.

Contributors: DGA and JMB jointly wrote and agreed the text.

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2014;349:g7064 doi: 10.1136/bmj.g7064 (Published 25 November 2014) Page 2 of 2

Competing interests: All authors have completed the ICMJE uniform 1 Sterne JA, Davey Smith G. Sifting the evidencewhats wrong with significance tests?

BMJ 2001;322:226-31.

disclosure form at www.icmje.org/coi_disclosure.pdf (available on 2 Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ

request from the corresponding author) and declare: no support from 1995;311:485.

3 Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather

any organisation for the submitted work; no financial relationships with than hypothesis testing. BMJ 1986;292:746-50.

any organisations that might have an interest in the submitted work in 4 Richards DA, Hill JJ, Gask L, Lovell K, Chew-Graham C, Bower P, et al. Clinical

effectiveness of collaborative care for depression in UK primary care (CADET): cluster

the previous three years; no other relationships or activities that could randomised controlled trial. BMJ 2013;347:f4913.

appear to have influenced the submitted work. 5 Altman DG, Bland JM. Uncertainty beyond sampling error. BMJ 2014;349:g7065.

reviewed. Cite this as: BMJ 2014;349:g7064

BMJ Publishing Group Ltd 2014

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2014;349:g7065 doi: 10.1136/bmj.g7065 (Published 25 November 2014) Page 1 of 2

STATISTICS NOTES

1 2

Douglas G Altman professor of statistics in medicine , J Martin Bland professor of health statistics

Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford

1

OX3 7LD, UK; 2Department of Health Sciences, University of York, York YO10 5DD, UK

Statistical analysis of research results mainly uses confidence patient, in nutrition, in ancillary treatments, and so on, may all

intervals and hypothesis tests to capture the uncertainty rising make our sample unrepresentative as a guide for future action.

from our study being on a sample of participants drawn from a For example, the UK review of evidence relating to

much larger population, in which our interest mainly lies.1 But mammographic screening outlined three sources of uncertainty

beyond the issue of sampling variation there are other sources in relation to the pooled estimated effect from a meta-analysis

of uncertainty that may be even more important to consider. In of the results of all the randomised trials.4 First, there was

measurement, a distinction is made between precision, which uncertainty due to sampling variation, as previously discussed.1

is how variable are measurements on the same person by the Second, there was uncertainty from some methodological

same method made at the same time, and accuracy, which is weaknesses of the trials. Third was uncertainty about whether

how close the measurement is to what we actually want to know. the results from the trials were still relevant 25 years later, after

For example, if we were to ask a group of patients on two major changes in cancer incidence, management of and

occasions how much alcohol they typically consume, this would treatments for breast cancer, and the technology of

enable us to estimate precision, how repeatable answers are, but mammography. Unlike sampling variation, which is quantified

not how close these answers are to how much they actually in the confidence interval and is uncontroversial, the uncertainty

drink, which we might suspect to be higher. In the same way, from the other causes cannot easily be quantified and remains

a confidence interval tells us about the precision of research the source of fierce debate.

results, what would happen if we were to repeat the same study,

Large numbers, increasingly common in this era of big data,

not their accuracy, which is how close the study is to the truth.

will produce narrow confidence intervals. These can create an

In general, beyond the imprecision or uncertainty of numerical illusion of accuracy, but they ignore all sources of possible bias

results arising from sampling, the main concern is the possibility that are not affected by sample size, and so these other sources

that the study results are biased. Recent developments in become relatively more important.5 6 A recent example of a very

appraising published randomised trials have switched from precise but seriously wrong answer purported to show that skin

considering quality (essentially undefinable) to assessing cancer was protective for heart attacks and all-cause mortality.7

explicitly the risk of bias in relation to the way the study was So, although confidence intervals are a valuable way of depicting

done.2 Here sources of possible bias include lack of blinding uncertainty, they are always too narrow in the sense that they

and losses to follow up (missing data). But bias is especially reflect only statistical uncertainty, precision rather than accuracy.

relevant in non-randomised studies, where there will be major

Many journals require authors to consider in their discussion

concerns about possible confounding, where an apparent

the limitations of their studysome even require this in the

relationship between two things may be the result of the

articles abstract. Issues raised there help readers to judge what

relationships of both to a third.

extra uncertainty might apply to the study, including whether

A further source of uncertainty concerns the extent to which the observed effects may be affected by bias. It is common for

results of research conducted in a particular setting with selected authors to say that their results should be interpreted with

participants can be taken as applying equally to a wider group caution (including >700 in the BMJ), but who knows what that

of patients in a different location. Judgement of generalisability means in practice? The GRADE group have developed a

(also known as external validity) is challenging.3 In a clinical framework for a more structured approach to assessing the

trial, the part of the larger population we particularly care about reliability of research findings that addresses the aspects outlined

does not yet exist. We want to know what would happen to above.8

future patients if we were to apply either of the trial treatments.

Changes over time in the nature of disease, in the fitness of the

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2014;349:g7065 doi: 10.1136/bmj.g7065 (Published 25 November 2014) Page 2 of 2

Contributors: DGA and JMB jointly wrote and agreed the text. 3 Rothwell PM. External validity of randomised controlled trials: to whom do the results of

this trial apply?. Lancet 2005;365:82-93.

Competing interests: All authors have completed the ICMJE uniform 4 Marmot MG, Altman DG, Cameron DA, Dewar JA, Thompson SG, Wilcox M. The benefits

and harms of breast cancer screening: an independent review. Br J Cancer

disclosure form at www.icmje.org/coi_disclosure.pdf (available on

2013;108:2205-40.

request from the corresponding author) and declare: no support from 5 Greenland S. Interval estimation by simulation as an alternative to and extension of

any organisation for the submitted work; no financial relationships with confidence intervals. Int J Epidemiol 2004;33:1389-97.

6 Kaplan RM, Chambers DA, Glasgow RE. Big data and large sample size: a cautionary

any organisations that might have an interest in the submitted work in note on the potential for bias. Clin Transl Sci 2014;7:342-6.

the previous three years; no other relationships or activities that could 7 Lange T, Keiding N. Skin cancer as a marker of sun exposure: a case of serious immortality

bias. Int J Epidemiol 2014;43:971.

appear to have influenced the submitted work. 8 Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, Alonso-Coello P, et al. GRADE:

Provenance and peer review: Not commissioned; not externally peer an emerging consensus on rating quality of evidence and strength of recommendations.

BMJ 2008;336:924-6.

reviewed.

1 Altman DG, Bland JM. Uncertainty and sampling error. BMJ 2014;349:g7064.

2 Higgins JP, Altman DG, Gtzsche PC, Jni P, Moher D, Oxman AD, et al. The Cochrane BMJ Publishing Group Ltd 2014

Collaborations tool for assessing risk of bias in randomised trials. BMJ 2011;343:d5928.

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

Research Methods & ReportinG

open access

Statistics Notes: Bootstrap resampling methods

J Martin Bland,1 Douglas G Altman2

1Department of Health Sciences, In medical research we study a sample of individuals to treatment as usual) was 1.33 points on the PHQ-9

University of York, York YO10 make inferences about a target population. Estimates of scale (95% confidence interval 2.31 to 0.35) adjusted

5DD, UK

interest, such as a mean or a difference in proportions, for baseline PHQ-9, age, the list size, index of multiple

2Centre for Statistics in

are calculated, usually accompanied by a confidence deprivation, city of the practice, and clustering.

of Orthopaedics, Rheumatology interval derived from the standard error. The data from a We created another sample of 505 by resampling as

and Musculoskeletal Sciences, single sample are used here to quantify the variation in described above, the full original sample being avail-

University of Oxford, Oxford

OX3 7LD, UK;

the estimate of interest across (hypothetical) multiple able for each of the 505 choices. The resulting new sam-

Correspondence to: J M Bland

samples from the same population.1 As we have only ple of 505 observations included 313 of the original 505

martin.bland@york.ac.uk one sample we need to make assumptions about the participants, some once, some more than once, a maxi-

Cite this as: BMJ 2015;350:h2622 data. Most methods of analysis are called parametric mum of five times. The same regression analysis which

doi: 10.1136/bmj.h2622 because they incorporate assumptions about the distri- produced the original treatment effect estimate was

bution ofthe data, such as that observations follow a repeated for this new sample resulting in a slightly dif-

normal distribution. Non-parametric methods avoid ferent estimated treatment difference of 1.25 points.

assumptions about distributions but generally provide Instead of resampling once, we should do it many times

only P values and not estimates of quantities of interest.2 and use the variability of the results to obtain a confidence

For a given dataset the assumptions may not be met. In interval. The distribution of the estimated treatment effect

such cases there is an alternative way to estimate stan- from 1000 resamplings of the CADET data is shown in the

dard errors and confidence intervals without any reli- figure. The mean and standard deviation of this distribu-

ance on assumed probability distributions. We use the tion are 1.353 and 0.565. This standard deviation provides

sample dataset and apply a resampling procedure called an alternative estimate of the standard error of the mean

the bootstrap. (In general language, a bootstrap method difference between the treatments, which does not make

is a self sustaining process that needs no e xternal input.) use of any theory about the distribution of the data. There

The clever idea behind the bootstrap is to create mul- are two ways to use the bootstrap estimates to find a confi-

tiple datasets from the real dataset without needing to dence interval. If the resampling distribution is close to

make any assumptions. Our observed sample is repre- normal, as is the case here, the 95% confidence interval

sentative of a population about which we wish to make will be 1.353(1.960.565) to 1.353+(1.960.565), or

inferences, so a set of randomly chosen observations 2.46 to 0.25. This interval is similar to that obtained

from our sample will be equally representative of the using the standard error from the least squares regression

original population. We can generate a sample of the on the real data. The other approach is to take the 95%

same size as the original data set by randomly choosing confidence interval directly from the 2.5th and 97.5th cen-

real observations one at a time. Each observation has an tiles of the distribution. For these data the bootstrap confi-

equal chance of being chosen each time, so some obser- dence interval calculated this way is 2.44 to 0.26. This

vations will be picked more than once and some wont second approach can be used regardless of the distribu-

be picked at all. That doesnt matter; the new boot- tion of the bootstrap estimates.

strap sample is comparable to the original data set and Clearly we need enough repetitions so that the estimates

is equally representative of the target population. are stableusually thousands of bootstrap samples are

For an example, CADET3 was a cluster randomised used, especially when using the observed centiles of the

trial comparing collaborative care for depression distribution of estimates. A repetition of the whole boot-

detected in primary care with treatment as usual. The strap analysis for CADET produced almost i dentical values

outcome measure was the PHQ-9 depression scale, of the mean (1.335) and standard deviation (0.567).

and data were available for 505 participants. The esti- This note gives the general idea of the bootstrap;

mated mean difference (collaborative care minus there are many variations.4 We can get a bootstrap esti-

mate for any quantity we can calculate from any sam-

ple. Bootstrap methods are particularly favoured by

2.5% Mean 97.5% health economists, because cost data tend to be highly

200

No of samplings

They are also useful for complex d atasetsfor exam-

150

ple, when the observations arent independent.

Contributors: JMB and DGA jointly wrote and agreed the text.

100

Competing interests: We have read and understood the BMJ Group policy

Histogram of 1000 on declaration of interests and have no relevant interests to declare.

resampling estimates of the 50 Provenance and peer review: Not commissioned; not externally peer

treatment difference from reviewed.

the CADET data, with

1 Altman DG, Bland JM. Standard deviations and standard errors.

corresponding normal 0

-4 -3 -2 -1 0 1 BMJ 2005;331:903.

distribution curve, mean, 2 Altman DG, Bland JM. Parametric v nonparametric methods for data

and 2.5 and 97.5 centiles Estimated treatment difference analysis. BMJ 2009;338:a3167.

RESEARCH

3 Richards DA, Hill JJ, Gask L, etal. Clinical effectiveness of collaborative 5 Schroeder E, Petrou S, Patel N, etal. Cost effectiveness of alternative

care for depression in UK primary care (CADET): cluster randomised planned places of birth in woman at low risk of complications:

controlled trial. BMJ 2013;347:f4913. evidence from the Birthplace in England national prospective cohort

4 Carpenter J, Bithell J. Bootstrap confidence intervals: when, which, study. BMJ 2012;344:e2292.

what? A practical guide for medical statisticians. Stat Med

2000;19:1141-64. BMJ Publishing Group Ltd 2015

BMJ 2016;352:i189 doi: 10.1136/bmj.i189 (Published 15 January 2016) Page 1 of 2

STATISTICS NOTES

1

Mohammad Ali Mansournia assistant professor of epidemiology , Douglas G Altman professor of

2

statistics in medicine

1

Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran; 2Centre for Statistics

in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford OX3 7LD, UK

Statistical analysis usually treats all observations as equally affected by availability or willingness to participate.

important. In some circumstances, however, it is appropriate to Likewise in a cohort study of the effect of obesity on

vary the weight given to different observations. Well known hypertension, some individuals are censored due to loss to

examples are in meta-analysis, where the inverse variance follow-up (such as emigration) or competing risks (such

(precision) weight given to each contributing study varies, and as death from other causes).4 In each case the amount of

in the analysis of clustered data.1 missing information will vary across subgroups.

Differential weighting is also used when different parts of the 4.Randomised trials with crossing over from one arm to the

population are sampled with unequal probabilities of selection. otherIn a randomised trial 8010 postmenopausal women

Two examples of intentional unbalanced sampling are: with early breast cancer were assigned to tamoxifen

1.Surveys with unequal probabilities of selectionIn a (n=2459) or letrozole (n=2463) for five years or to

national survey of hypertension prevalence, certain groups sequential treatment with two years of one of these agents

with relatively rare characteristics (such as people aged followed by three years of the other. There was a selective

65 years) were oversampled to improve the precision of crossover to letrozole of 619 patients in the tamoxifen arm

estimates for those groups.2 after significant benefit was reported for letrozole compared

2.Two-phase prevalence studiesIn the first phase of a with tamoxifen during the study. These 619 women may

two-phase prevalence study of mental health status, the be artificially censored at the time they crossed over for

sampled patients completed a short screening questionnaire. analysis.5

In the second phase, a subsample was selected for a In these situations, missing outcomes are unlikely to happen

definitive diagnostic test with oversampling of the at random so that estimates will be biased. While the

screen-positive cases to ensure precise estimates for selection probabilities in examples 1 and 2 are known, the

diagnostic prevalence.3 response or non-censoring probabilities in examples 3 and

In such cases the ordinary unweighted sample quantities, 4 are unknown. Inverse probability weighting can be used

such as means or proportions, are likely to be biased with weights estimated from a logistic regression model

estimates of their corresponding population quantities. This for predicting non-response or censoring. As in the first

selection bias can be eliminated by performing a scenario, this application of the method aims to remove

weighted estimation, giving each individuals data a weight bias, but it is more controversial. Its validity relies on a

inversely proportional to their probability of selection. correctly specified model including all prognostic variables

Intuitively, the weighting is used to deflate the weight for associated with non-response or censoring, which cannot

those individuals who are oversampled. The weighted be assured.

analysis can be thought of as creating a study with no In the breast cancer trial (example 4), although the

differential selection. intention-to-treat hazard ratio for overall survival (which

Inverse probability weighting can also be used when ignores selective crossover) was 0.87 (95% confidence

individuals vary in their probability of having missing interval 0.77 to 1.00) in favour of letrozole, the adjusted

information. Two contexts where there may be hazard ratio using inverse probability of selection weights

unintentional unbalanced selection are: was 0.79 (0.69 to 0.90), suggesting that the true effect is

greater than the intention-to-treat estimate.5

3.Studies with missing outcome dataIn surveys such as

that mentioned in example 1, the response rates will be

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2016;352:i189 doi: 10.1136/bmj.i189 (Published 15 January 2016) Page 2 of 2

In observational studies, the probability of exposure can necessarily true for examples 3-5. The ordinary 95% confidence

depend on external factors (called confounders) that also interval for inverse probability weighted estimates may not

affect the outcome. The causal effect of interest is then provide the correct coverage and should be avoided. Instead,

confused with the effects of confounders. Such confounding robust sandwich variance estimators or non-parametric

can be thought as a type of selection bias, because bootstrapping should be used to provide valid confidence

confounding essentially means that some causes of the intervals.8 Deeper discussion of inverse probability weighting

outcome also influence selection for the exposure. A methods can be found elsewhere.8 9

particularly important context is:

5.Non-randomised studies comparing different Contributors: MAM and DGA jointly wrote and agreed the text.

treatmentsIn a cohort study 12 552 warfarin-naive Competing interests: We have read and understood the BMJ Group

patients with atrial fibrillation admitted to hospital for policy on declaration of interests and have no relevant interests to

ischaemic stroke and treated with warfarin were compared declare.

with patients who received no oral anticoagulant at Provenance and peer review: Not commissioned; not externally peer

discharge.6 reviewed.

Outside randomised trials the choice of treatment is likely to be

1 Kerry SM, Bland JM.. Analysis of a trial randomised in clusters. BMJ

influenced by predictors of outcome, so called confounding 1998;316: 54. 9451271

by indication.7 Various strategies are used to try to remove the 2 Korn EL, Graubard BI.. Epidemiologic studies utilizing surveys: accounting for the sampling

design. Am J Public Health 1991;81: 1166-73. 1951829

Vzquez-Barquero JL, Garca J, Simn JA et al. Mental health in primary care. An

conventional approach is to use multivariable regression, but a epidemiological study of morbidity and use of health resources. Br J Psychiatry

recent alternative is inverse probability of treatment weighting. 4

1997;170: 529-35. 9330019

Alonso A, Segu-Gmez M, de

Here the weights are based on each individuals probability of Irala J, Snchez-Villegas A, Beunza JJ, Martnez-Gonzalez MA.. Predictors of follow-up

receiving a specific treatment given the confounders, which is and assessment of selection bias from dropouts using inverse probability weighting in a

cohort of university graduates. Eur J Epidemiol 2006;21: 351-8. 16736275

known as the propensity score (PS). The weights are 1/PS for 5 Regan MM, Neven P, Giobbie-Hurder A et al. BIG 1-98 Collaborative Group International

the treated participants and 1/(1PS) for the untreated Breast Cancer Study Group (IBCSG). Assessment of letrozole and tamoxifen alone and

in sequence for postmenopausal women with steroid hormone receptor-positive breast

participants.8 The weights can be estimated from a logistic cancer: the BIG 1-98 randomised clinical trial at 81 years median follow-up. Lancet Oncol

regression model for predicting treatment. Key assumptions are 2011;12: 1101-8. 22018631

that all confounders have been measured and properly modelled 6 Xian Y, Wu J, OBrien EC et al. Real world effectiveness of warfarin among ischemic

stroke patients with atrial fibrillation: observational analysis from Patient-Centered Research

in this treatment model. In the warfarin study (example 5) the into Outcomes Stroke Patients Prefer and Effectiveness Research (PROSPER) study.

unadjusted hazard ratio for cardiac events was 0.73 (99% BMJ 2015;351: h3786. 26232340

7 Freemantle N, Marston L, Walters K, Wood J, Reynolds MR, Petersen I.. Making

confidence interval 0.67 to 0.80) in favour of warfarin, whereas inferences on treatment effects from real world data: propensity scores, confounding by

the adjusted estimate using inverse probability of treatment indication, and other perils for the unwary in observational research. BMJ

2013;347: f6409. 24217206

weighting was 0.87 (0.78 to 0.98), about half the effect size.6 If 8 Robins JM, Hernn MA, Brumback B.. Marginal structural models and causal inference

the cohort is also affected by censoring (see example 3 above), in epidemiology. Epidemiology 2000;11: 550-60. 10955408

one can adjust simultaneously for confounding and selection 9 Seaman SR, White IR.. Review of inverse probability weighting for dealing with missing

data. Stat Methods Med Res 2013;22: 278-95. 21220355

bias due to censoring.4 8

Accepted: 05 01 2016

Although helpful for bias reduction, estimates weighted by Published by the BMJ Publishing Group Limited. For permission to use (where not already

design weights (examples 1 and 2) tend to be less precisely granted under a licence) please go to http://group.bmj.com/group/rights-licensing/

estimated than the unweighted estimates, which is not permissions

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

- RegressionUploaded byRohit Arora
- QT1_Session12Uploaded byshivam1992
- LinRegr2bUploaded byamithg33
- Building morphologyUploaded byukeypravin
- DAV CaseUploaded bysathyanl90
- utf-8__session11Uploaded byqgocong
- FOSFOROUploaded byLilian Arias
- Article_5.pdfUploaded byLawrence101
- RayWicks(IBMUSA) Trending HOUploaded byJair Santos
- Impact of Executive Support on Organizational Core Competencies Management for Strategic Product InnovationUploaded byInternational Journal of Business Marketing and Management
- penelitian karissUploaded byKaris Mashudi
- analysis of teeth estimationUploaded byAisyah Rieskiu
- 2.3_Reading&ReidUploaded byaliakhtar02
- 6 Urbanization of Karachi ShaheenUploaded byMaira Khan
- 2.Analysis FullUploaded byTJPRC Publications
- David PanelUploaded bydatateam
- URCFI Index - An Approach for Identification of Road Crash Prone for Areas, A Case of SuratUploaded byGRD Journals
- Output SPSS.docxUploaded bynadia altiany
- Annex XII - PSPP Introduction Training ManualUploaded byPetru Madalin Schönthaler
- Ch 17 Correlation vs RegressionUploaded byDottie Kamellia
- Experimental Investigation on Copper Tube by using Double Passive Augmentation Technique used for Heat Exchanger Performance EnhancementUploaded byAnonymous kw8Yrp0R5r
- Regression.pptUploaded byAtakhan Gürel
- sdsdChow-lin7,28.docadcascascascascUploaded byMartin Aragoneses
- Risk Diversifications in Emerging EconomiesUploaded bySounay Phothisane
- Independent t Test ScoresUploaded byAngeliMacapagal
- BalUploaded bywinda wahyuni
- 1-s2.0-S1526612514000681-mainUploaded bysreejith2786
- 13 NonnormalityUploaded byDede Firmansyah Albanjary
- Quiz 3 SolvedUploaded byHiteshSharma
- 11E Chapter 16Uploaded byslade

- Taller Semiologia NeurologicaUploaded bypegazus_ar
- Nuevos Conceptps de RcpUploaded byAndres Barbosa Acosta
- Ventilacion No InvasivaUploaded bypegazus_ar
- 9789241505857_eng.pdfUploaded bypegazus_ar
- Guia Prevencion Enfermedades CardiovascularesUploaded byNury Andrea
- POSTER MEDICACION URGENCIAS Y EMERGENCIAS - PDF.pdfUploaded bypegazus_ar
- Aborto Farmacológico en El Primer Trimestre de La GestaciónUploaded bypegazus_ar
- PROA_rodajeUploaded bypegazus_ar
- Pablo Huer GaUploaded byymar_1
- Transferencia de PacienteUploaded bypegazus_ar
- atajospincel.pdfUploaded bypegazus_ar
- APLICABILIDAD DE SIMULADORES DE MEDIANA COMPLEJIDAD EN EL PROCESO DE FORMACIÓN DE RESIDENTES DE GINECOLOGÍA Y OBSTETRICIAUploaded bypegazus_ar
- Plano FrancisUploaded bypegazus_ar
- Taller Semiologia Hematologica Sbozzo 2007Uploaded bypegazus_ar
- El Efecto PlaceboUploaded bypegazus_ar
- RevArgAnatOnl-2014-5(2)-p45-79-fulltext.pdfUploaded byBryan Pando Montero
- Como aprenden los adultosUploaded bypegazus_ar
- Basic Training Camara DslrUploaded bypegazus_ar
- RevArgAnatOnl-2013-4(supl4)-p5-150-fulltextUploaded bypegazus_ar
- Yn 565ex Nikon Espanol Manual Del UsuarioUploaded bymdparfer
- Biografia de FisherUploaded byAnonymous KrLU31O7
- RevArgAnatOnl-2014-5(4)-p115-152-fulltextUploaded bypegazus_ar
- RevArgAnatOnl-2014-5(4)-p115-152-fulltextUploaded bypegazus_ar
- Plantilla StoryboardUploaded bypegazus_ar
- PROA_preparaUploaded bypegazus_ar
- 1.2.Probabilidad condicionadaUploaded byReynolds Roberth
- TEORICO EVALUACION NUTRICIONALUploaded bypegazus_ar
- 10 Grandes Fotografías Que Evocan a Clásicos de La PinturaUploaded bypegazus_ar
- El Efecto PlaceboUploaded bypegazus_ar

- The Deprived Childhood - Case study of Orphans in Kashmir.Uploaded byArsilanAziz
- Developmental Dysplasia of the Hip - Dubai Health Authority - DHAUploaded byMedarabia
- Michael C. Seto 2012 ASB is Pedophilia an OrientationUploaded byjulycuty
- FiLawyer- Practice Court Complaint AffidavitUploaded byFlourescent Lamp
- Hemodialysis OldUploaded bymaxxine12
- case study cvdUploaded byapi-280549915
- Referat appendicitis.pptUploaded byFerio Joelian Chandra
- jurnal baruUploaded byharis
- WK 6 - Malignant Diseases of theUploaded bymonchievalera
- Rhodiola PropertiesUploaded byBogdan Scupra
- 1000 MCQ Bank QuestionsUploaded byapi-26291651
- How Alois Alzheimer Redefined DementiaUploaded byElaineWilliams
- Veterinary Drug Interactions Fluid TherapyUploaded bySunil
- Castor OilUploaded byazeem dilawar
- thyroid hormone dan cardiovascular nejm.pdfUploaded byminar
- caitlin mariscal - career exploration worksheetUploaded byapi-389735946
- 2017 Beyond ABC ReportUploaded byKERANews
- NGM282 for Treatment of Nash (Online First)Uploaded byMr. L
- Broiler Chickens Fact SheetUploaded byMelita Rudo Ncube Zhuwarara
- cv_dzmUploaded byOmer Ahmed
- PNFUploaded byPaul Tucci
- 78 31 PhytochemicalsUploaded byDessy Erlyani Mugita Sari
- Glossopharyngeal PistoningUploaded byadriana
- Med ExamineUploaded byMilan Djuric
- professional resume- cole cassidy extraUploaded byapi-432864702
- 2012-966 parent interviewUploaded byapi-231978749
- Bacon, New Atlantis, first editionUploaded bysyminkov8016
- Indian Journal of Plastic Surgery Volume 43 Issue 3 2010 [Doi 10.4103_0970-0358.70712] Halim_ AhmadSukari_ Khoo_ TengLye_ Shah_ JumaatMohd. Yussof -- Biologic and Synthetic Skin Substitutes- An OverUploaded byMannuela Anugrahing Marwindi
- OR PrepUploaded byGeneva Ruz Binuya
- Human Physiology IntroductionUploaded byTeddy K Kunong