Sie sind auf Seite 1von 12

Measurement in Medicine: The Analysis of Method Comparison Studies

Author(s): D. G. Altman and J. M. Bland


Source: Journal of the Royal Statistical Society. Series D (The Statistician), Vol. 32, No. 3 (Sep.,
1983), pp. 307-317
Published by: Wiley for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2987937 .
Accessed: 03/04/2013 00:56

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to
Journal of the Royal Statistical Society. Series D (The Statistician).

http://www.jstor.org

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions
TheStatistician32 (1983) 307-317
? 1983Instituteof Statisticians

Measurement in Medicine: the Analysis of Method


Comparison Studiest

D. G. ALTMAN and J. M. BLAND:


Divisionof Computing andStatistics,MRC Clinical
ResearchCentre,WatfordRoad,HarrowHAl 3UJ; and
T.Departmentof ClinicalEpidemiologyandSocial Medicine,
St George'sHospitalMedicalSchool,CranmerTerrace,LondonSW17

Summary:Methodsof analysisusedin the comparisonof two methodsof measurement are


reviewed.The use of correlation,regressionand the differencebetweenmeansis criticized.
A simpleparametricapproachis proposedbasedon analysisof varianceandsimplegraphical
methods.

1 Theproblem
In medicinewe often want to comparetwo differentmethodsof measuringsome quantity,
such as blood pressure,gestationalage, or cardiacstrokevolume. Sometimeswe compare
an approximateor simplemethodwith a very preciseone. This is a calibrationproblem,
and we shall not discussit furtherhere. Frequently,however,we cannot regard either
method as givingthe true value of the quantitybeing measured.In this case we want to
knowwhetherthe methodsgiveanswerswhichare,in somesense,comparable.For example,
we may wish to see whethera new, cheap and quick methodproducesanswersthat agree
with those from an establishedmethod sufficientlywell for clinicalpurposes.Many such
studies,using a varietyof statisticaltechniques,have been reported.Yet few reallyanswer
the question"Do the two methodsof measurementagreesufficientlyclosely?"
In this paper we shall describewhat is usually done, show why this is inappropriate,
suggesta better approach,and ask why such studiesare done so badly. We will restrict
our considerationto the comparisonof two methodsof measuringa continuousvariable,
althoughsimilarproblemscan arisewith categoricalvariables.

2 Incorrectmethodsof analysis

We shall first describesome examplesof methodcomparisonstudies,wherethe statistical


methodsused werenot appropriateto answerthe question.

Comparison of means
Cater (1979) comparedtwo methods of estimatingthe gestationalage of human babies.
tPaper presentedat the Instituteof Statisticiansconference,July 1981.
307

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions
Gestationalage was calculatedfrom the last menstrualpenrod(LMP)and also by the total
maturityscore based on externalphysical characteristics(TMS). He divided the babies
into three groups: normal birthweightbabies, low birthweightpre-term (<36 weeks
gestation)babies,and low birthweighttermbabies.For each grouphe comparedthe mean
by each method(usingan unspecifiedtest of significance),findingthe meangestationalage
to be significantlydifferentfor pre-termbabiesbutnot for the othergroups.It wasconcluded
that "the TMS is a convenientand accuratemethod of assessinggestationalage in term
babies".
His criterionof agreementwas that the two methodsgave the samemeanmeasurement;
"thesame"appearsto standfor "not significantlydifferent".Clearly,this approachtells us
verylittle aboutthe accuracyof the methods.By his criterion,the greaterthe measurement
error,and hencethe less chanceof a significantdifference,the better.
Correlation
The favourite approach is to calculate the product-momentcorrelationcoefficient,r,
betweenthe two methods of measurement.Is this a valid measureof agreement?The
correlationcoefficientin this case dependson both the variationbetweenindividuals(i.e.
betweenthe truevalues)and the variationwithinindividuals(measurementerror).In some
applicationsthe "truevalue"will be the subject'saveragevalue over time, and short-term
within-subject variationwill be part of the measurementerror.In others,wherewe wish to
identifychangeswithinsubjects,the true valueis not assumedconstant.
The correlationcoefficientwill thereforepartlydependon the choice of subjects.For if
the variationbetweenindividualsis highcomparedto the measurementerrorthe correlation
will be high, whereasif the variationbetweenindividualsis low the correlationwill be low.
Thiscan be seenif we regardeachmeasurementas the sumof the truevalueof the measured
quantityand the errordue to measurement.We have:
varianceof truevalues aT2
varianceof measurementerror,methodA= aA2
varianceof measurementerror,methodB = aB2
In the simplestmodel errorshave expectationzero and are independentof one another
and of the truevalue, so that
varianceof methodA= aA ?+ aT2
varianceof methodB = aB2+ aT2
covariance=aT2 (see appendix)
Hence the expectedvalue of the samplecorrelationcoefficientr is
aT2

2 V(cr2+ UT2) (an2+ UT2)


Clearlyp is less than one, and it dependsonly on the relativesizes of aT2, aA2 and aB2.
If aA2 and aB2 are not smallcomparedto aT2, the correlationwill be smallno matterhow
good the agreementbetweenthe two methods.
In the extremecase, whenwe have severalpairsof measurementson the sameindividual,
aT2=0 (assumingthat there are no temporalchanges),and so p=0 no matterhow close
the agreementis. Keimet al. (1976)compareddye-dilutionand impedancecardiographyby
findingthe correlationbetweenrepeatedpairs of measurementsby the two methods on
each of 20 patients.The 20 correlationcoefficientsrangedfrom -0 77 to 0-80, with one
correlationbeingsignificantat the 5 percentlevel.Theyconcludedthatthe two methodsdid
not agreebecauselow correlationswerefoundwhenthe rangeof cardiacoutputwas small,
even thoughotherstudiescoveringa wide rangeof cardiacoutputhad shownhigh correla-
308

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions
tions.In factthe resultof theiranalysismay be explainedon the statisticalgroundsdiscussed
above,the expectedvalueof the correlationcoefficientbeingzero.Theirconclusionthat the
methodsdid not agreewas thus wrong- theirapproachtells us nothingabout dye-dilution
and impedancecardiography.
As alreadynoted, anotherimplicationof the expectedvalue of r is that the observed
correlationwill increaseif the betweensubjectvariabilityincreases.A good exampleof
this is given by the measurementof blood pressure.Diastolic blood pressurevaries less
betweenindividualsthandoes systolicpressure,so that we wouldexpectto observea worse
correlationfor diastolicpressureswhen methodsare comparedin this way. In two papers
(Laughlinet al., 1980;Hunyoret al., 1978)presentingbetweenthem 11pairsof correlations,
thisphenomenonwas observedeverytime(Table1).It is not an indicationthatthe methods
agree less well for diastolicthan for systolic measurements.This table providesanother
illustrationof the effect on the correlationcoefficientof variationbetween individuals.
The sampleof patientsin the studyof Hunyoret al. had muchgreaterstandarddeviations
than the sampleof Laughlinet al. and the correlationswere correspondingly greater.

Table 1. Correlationcoefficientsbetween methods of measurementof blood


pressurefor systolicand diastolicpressures

Systolicpressure Diastolicpressure
SA SB r SA SB r

Laughlinet al. (1980)


1 13.4a 15.3a 069 6.1a 6.3a 063
2 0 83 055
3 0 68 0 48
4 0 66 037
Hunyor et al. (1978)
1 40*0 40 3 0.997 15*9 13*2 0938
2 41.5 36.7 0.994 15.5 14.0 0 863
3 40*1 41*8 0970 16-2 17-8 0927
4 41*6 38.8 0984 14-7 15.0 0736
5 40o6 37.9 0985 15-9 19.0 0 685
6 43.3 37 0 0987 16 7 15.5 0 789
7 45.5 38-7 0 967 23*9 26*9 0941

aStandarddeviationsfor four sets of data combined.

A further point of interest is that even what appears (visually) to be fairly poor agreement
can produce fairly high values of the correlation coefficient. For example, Serfontein and
Jaroszewicz (1978) found a correlation of 0 85 when they compared two more methods of
assessing gestational age, the Robinson and the Dubowitz. They concluded that because the
correlation was high and significantly different from zero, agreement was good. However,
from their data a baby with a gestational age of 35 weeks by the Robinson method could
have been anything between 34 and 39 5 weeks by the Dubowitz method. For two methods
which purport to measure the same thing the agreement between them is not close, because
what may be a high correlation in other contexts is not high when comparing things that
should be highly related anyway. The test of significance of the null hypothesis p=0 is
beside the point. It is unlikely that we would consider totally unrelated quantities as candi-
dates for a method comparison study.
The correlation coefficient is not a measure of agreement; it is a measure of association.
309

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions
Thusit is quitewrong,for example,to inferfrom a high correlationthat "the methods...
may be used interchangeably" (Hallmanand Teramo,1981).
At the extreme,when measurementerroris very small and correlationscorrespondingly
high, it becomes difficultto interpretdifferences.Oldhamet al. (1979) state that: "Con-
necting[two types of peak flow meter]in seriesproducesa correlationcoefficientof 0996,
whichis a materialimprovementon the figureof 0-992obtainedwhen they are used separ-
ately".It is difficultto imagineanothercontextin whichit werethoughtpossibleto improve
materiallyon a correlatonof 0-992.As Westgardand Hunt (1973)have said: "Thecorrela-
tion coefficient... is of no practicaluse in the statisticalanalysisof comparisondata".
Regression
Linearregressionis anothermisusedtechniquein method comparisonstudies.Often the
slope of the least squaresregressionline is testedagainstzero. This is equivalentto testing
the correlationcoefficientagainst zero, and the above remarksapply. A more subtle
problemis illustratedby the work of Carret al. (1979), who comparedtwo methodsof
measuringthe heart'sleft ventricularejectionfraction.Theseauthorsgave not only correla-
tion coefficientsbut the regressionline of one method,Teichholz,on the other,angiography.
They noted that the slope of the regressionline differedsignificantlyfrom the line of
identity.Theirimpliedargumentwas that if the methodswere equivalentthe slope of the
regressionline would be 1. However,this ignoresthe fact that both dependentand inde-
pendentvariablesare measuredwith error.In our previousnotationthe expectedslope is
f = UT2/(aA2 + UT2) andis thereforelessthan 1. How muchlessthan 1 dependson the amount
of measurementerrorof the methodchosenas independent.Similarly,the expectedvalueof
the interceptwill be greaterthan zero (by an amountthat is the productof the mean of
the true valuesand the bias in the slope) so that the conclusionof Floss et al. (1982)that
"with a slope not differingsignificantlyfrom unity but a statisticallyhighly significant
y-intercept,the presenceof a systematicdifference. .. is demonstrated"is unjustified.
We do not rejectregressiontotally as a suitablemethod of analysis,and will discussit
furtherbelow.
Asking the right question
None of the previouslydiscussedapproachestells us whetherthe methods can be con-
sideredequivalent.We thinkthat this is becausethe authorshave not thoughtabout what
questiontheyaretryingto answer.The questionsto be askedin methodcomparisonstudies
fall into two categories:
(a) Propertiesof each method:
How repeatableare the measurements?
(b) Comparisonof methods:
Do the methodsmeasurethe samethingon average?Thatis, is thereany relativebias?
Whatadditionalvariabilityis there?This may includeboth errorsdue to repeatability
and errorsdue to patient/methodinteractions.We summarizeall this as "error".
Under propertiesof each method we could also include questions about variability
betweenobservers,betweentimes, betweenplaces,betweenposition of subject,etc. Most
studiesstandardizethese, but do not considertheir effects,althoughwhen they are con-
sidered,confusion may result. Altman's (1979) criticismof the design of the study by
Serfonteinand Jaroszewicz(1978)provokedthe responsethat: "For the actualstudyit was
felt that the fact assessmentswere made by two differentobservers(one doing only the
Robinson techniqueand the other only the Dubowitz method) would result in greater
objectivity"(Serfonteinand Jaroszewicz,1979).The effectsof methodand observerare, of
course,totallyconfounded.
310

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions
We emphasizethat this is a questionof estimation,both of errorand bias. Whatwe need
is a designand analysiswhichprovideestimatesof both errorand bias. No singlestatistic
can estimateboth.

3 Proposedmethodof analysis
Justas thereare severalinvalidapproachesto this problem,thereare also variouspossible
types of analysiswhich are valid, but none of these is without difficulties.We feel that a
relativelysimple pragmaticapproachis preferableto more complex analyses,especially
whenthe resultsmust be explainedto non-statisticians.
It is difficultto producea methodthat will be appropriatefor all circumstances.What
followsis a briefdescriptionof the basicstrategythatwe favour;clearlythe variouspossible
complexitieswhichcould arisemightrequirea modifiedapproach,involvingadditionalor
even alternativeanalyses.
Propertiesof eachmethod:repeatability
The assessmentof repeatabilityis an importantaspectof studyingalternativemethodsof
measurement.Replicatedmeasurementsare, of course, essential for an assessmentof
repeatability,but to judge from the medicalliteraturethe collectionof replicateddata is
rare.One possiblereasonfor this will be suggestedlater.
Repeatabilityis assessed for each measurementmethod separatelyfrom replicated
measurementson a sample of subjects.We obtain a measureof repeatabilityfrom the
within-subjectstandard deviation of the replicates.The British StandardsInstitution
(1979)definea coefficientof repeatabilityas "thevaluebelowwhichthe differencebetween
two singletest results... may be expectedto lie with a specifiedprobability;in the absence
of other indications,the probabilityis 95 per cent". Providedthat the differencescan be
assumedto follow a Normaldistributionthis coefficientis 2-83 ar, where crris the within-
subject standarddeviation. ar must be estimatedfrom a suitable experiment.For the
purposesof the presentanalysisthe standarddeviationalone can be used as the measure
of repeatability.
It is importantto ensurethat the within-subjectrepeatabilityis not associatedwith the
size of the measurements,in which case the resultsof subsequentanalysesmight be mis-
leading.The best way to look for an associationbetweenthesetwo quantitiesis to plot the
standarddeviationagainstthe mean.If thereare two replicatesxi and X2 then this reduces
to a plot of Ixi -x21 against(xi +x2)/2. Fromthisplot it is easyto see if thereis anytendency
for the amountof variationto changewiththe magnitudeof the measurements. The correla-
tion coefficientcould be testedagainstthe null hypothesisof r= 0 for a formaltest of inde-
pendence.
If the within-subject repeatabilityis found to be independentof the size of the measure-
ments,then a one-wayanalysisof variancecan be performed.The residualstandarddevia-
tion is an overallmeasureof repeatability,pooled acrosssubjects.
If, however,an associationis observed,the resultsof an analysisof variancecould be
misleading.Severalapproachesare possible, the most appealingof which is the trans-
formationof the datato removethe relationship.In practicethe logarithmictransformation
will oftenbe suitable.If the relationshipcan be removed,a one-wayanalysisof variancecan
be carriedout. Repeatabilitycan be describedby calculatinga 95 per cent rangefor the
differencebetweentwo replicates.Back-transformation providesa measureof repeatability
in the originalunits. In the case of log transformationthe repeatabilityis a percentageof
the magnitudeof the measurementratherthan an absolutevalue.It wouldbe preferableto
carryout the sametransformation for measurement by eachmethod,but thisis not essential,
and may be totallyinappropriate.
311

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions
If transformationis unsuccessful,thenit maybe necessaryto analysedatafroma restricted
rangeof measurements only, or to subdividethe scaleinto regionsto be analysedseparately.
Neither of these approachesis likely to be particularlysatisfactory.Alternatively,the
repeatabilitycan be definedas a functionof the size of the measurement.

Propertiesof eachmethod:otherconsiderations
Manyfactorsmay affecta measurement,such as observer,time of day, positionof subject,
particularinstrumentused, laboratory, etc. The British StandardsInstitution (1979)
distinguishbetweenrepeatability,describedabove, and reproducibility,"the value below
whichtwo single test results... obtainedunderdifferentconditions... may be expected
to lie with a specifiedprobability".There may be difficultiesin carryingout studies of
reproducibilityin many areas of medicalinterest.For example,the gestationalage of a
newbornbaby could not be determinedat differenttimes of year or in differentplaces.
However,when it is possibleto vary conditions,observers,instruments,etc., the methods
describedabovewill be appropriateprovidedthe effectsare random.Wheneffectsare fixed,
for example when comparingan inexperiencedobserverand an experiencedobserver,
the approachused to comparedifferentmethods,describedbelow, shouldbe used.
Comparison of methods
The main emphasisin methodcomparisonstudiesclearlyrests on a directcomparisonof
the resultsobtainedby the alternativemethods.The questionto be answeredis whetherthe
methods are comparableto the extent that one might replace the other with sufficient
accuracyfor the intendedpurposeof measurement.

220
0

200- /
0 0

0
0

180 - 00 0
rn 0
0

= 160- 0

0 0 0

0
140- 0
00

120- / o,,,0
120 140 160 180 200 220
1
MIETHOD
Fig. 1. Comparisonof two methodsof measuringsystolic bloodprssure.
312

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions
The obvious first step, one which should be mandatory,is to plot the data. We first
considerthe unreplicatedcase, comparingmethodsA and B. Plots of this type are very
commonand often have a regressionline drawnthroughthe data. The appropriateness of
regressionwill be consideredin moredetaillater,but whateverthe meritsof this approach,
the data will alwaysclusterarounda regressionline by definition,whateverthe agreement.
For the purposesof comparingthe methodsthe lineof identity(A= B) is muchmoreinforma-
tive, and is essentialto get a correctvisual assessmentof the relationship.An exampleof
such a plot is givenin Figure 1, wheredata comparingtwo methodsof measuringsystolic
blood pressureare shown.
Althoughthis type of plot is veryfamiliarand in frequentuse, it is not the best way of
looking at this type of data, mainlybecausemuch of the plot will often be empty space.
Also, the greaterthe rangeof measurementsthe betterthe agreementwill appearto be. It
is preferableto plot the differencebetween the methods (A-B) against (A+B)/2, the
average.Figure2 shows the data from Figure 1 replottedin this way. From this type of
plot it is much easierto assessthe magnitudeof disagreement(both errorand bias), spot
outliers,and see whetherthere is any trend, for examplean increasein A-B for high
values.This way of plottingthe data is a very powerfulway of displayingthe resultsof a
methodcomparisonstudy. It is closely relatedto the usual plot of residualsafter model-
fitting,and the patternsobservedmay be similarlyvaried.In the exampleshown(Figure2)
therewas a significantrelationshipbetweenthe methoddifferenceand the size of measure-
ment (r=0.45, n=25, P=0-02). This test is equivalentto a test of equalityof the total

40

30- 0 0

0 0
z
w
20-
wU 0 0
0
CD
cn]
0 0 0 0
L)
u 10 0

wU 0 0
r'll U_ 0- 0
00~~0 0
H~~~~~~
~~~0*
0
0

-10-
120 140 160 180 200 220
AVERAGE OF TWOMETHODS
Fig. 2. Data from Figure 1 replottedto show the differencebetweenthe two methodsagainst the average
measurement.
13 313

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions
variancesof measurementsobtainedby the two methods(Pitman,1939;see Snedecorand
Cochran,1967,pp. 195-7).
As in the investigationof repeatability,we are looking here for the independenceof the
between-methoddifferencesand the size of the measurements.With independencethe
methodsmay be comparedvery simplyby analysingthe individualA-B differences.The
mean of these differenceswill be the relativebias, and their standarddeviationis the esti-
mate of error. The hypothesisof zero bias can be formallyexaminedby a pairedt-test.
For the data of Carr et al. (1979) alreadydiscussed,the correlationof the individual
differenceswith the averagevalue was -0036 (P >0.1), so that an assumptionof indepen-
dence is not contradictedby the data. Figure3 shows these data plotted in the suggested
manner.Also shown is a histogramof the individualbetween-methoddifferences,and
superimposedon the data are lines showingthe mean differenceand a 95 per cent range
calculatedfromthe standarddeviation.A compositeplot like this is muchmoreinformative
than the usualplot (suchas Figure1).
If there is an associationbetween the differencesand the size of the measurements,
then as before, a transformation(of the raw data) may be successfullyemployed.In this
case the 95 per cent limitswill be asymmetricand the bias will not be constant.Additional
insightinto the appropriateness of a transformationmay be gainedfrom a plot of IA-Bj
against(A+B)/2, if the individualdifferencesvary eitherside of zero. In the absenceof a
suitabletransformation it maybe reasonableto describethe differencesbetweenthe methods
by regressingA - B on (A + B)/2.
For replicateddata, we can carryout theseproceduresusingthe meansof the replicates.
The estimateof bias will be unaffected,but the errorwill be reduced.We can estimatethe
standarddeviationof the differencebetweenindividualmeasurementsfrom the standard
deviationof the differencebetweenmeansby
var(A-B)=n var(A-W)
wheren is the numberof replicates.
Withinreplicateddatait maybe felt desirableto carryout a two-wayanalysisof variance,
with main effects of individualsand methods,in order to get better estimates.Such an
analysiswouldneedto be supportedby the analysisof repeatability,and in the eventof the
two methodsnot being equallyrepeatablethe analysiswould have to be weightedappro-
priately.The simpleranalysisof methoddifferences(Figure2) will also need to be carried
out to ascertainthat the differencesare independentof the size of the measurements,as
otherwisethe answersmightbe misleading.
Alternativeanalyses
One alternativeapproachis least squaresregression.We can use regressionto predictthe
measurementobtainedby one methodfrom the measurementobtainedby the other, and
calculatea standarderrorfor this prediction.This is, in effect,a calibrationapproachand
does not directlyanswerthe questionof comparability.There are severalproblemsthat
can arise,some of whichhave alreadybeen referredto. Regressiondoes not yield a single
value for relativeprecision(error),as this dependsupon the distancefrom the mean. If
we do try to use regressionmethodsto assesscomparabilitydifficultiesarisebecausethere
no obviousestimateof bias, andthe parametersare difficultto interpret.Unlikethe analysis
of variancemodel,the parametersare affectedby the rangeof the observationsand for the
resultsto applygenerallythe methodsought to have been comparedon a randomsample
of subjects- a conditionthatwillveryoftennot be met.Theproblemof the underestimation
(attenuation)of the slope of the regressionline has been consideredby Yates(Healy, 1958),
but the otherproblemsremain.
Other methods which have been proposed include principalcomponent analysis (or
314

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions
0 0
CO)
on
20-
0
CD 0
O

X 10- 0 0
Z 0
w
iL

U. 0 -
0 O
m ?O
000 ~~~~0
w ~~
~~~~~~~~~~~0
0
w C10) -1 0 0
Hl 0
LL
*__

c
-20 ------- - - --------------

-30--- I I
0 10 20 30 40 50 60 70 80 90 1i0
AVERAGE
OF TWOMETHOOS

CD

L L

-30 -28 -10 0 16 20 30 40


DIFFERENCE
BETWEEN
MIETHODS
Fig. 3. Comparisonof two methodsof measuringleft ventricularejectionfraction(Carret al., 1979)replotted
to show errorand bias.

orthogonalregression)and regressionmodels with errors in both variables(structural


relationshipmodels)(see for exampleCareyet al., 1975; Lawtonet al., 1979; Cornbleet
and Gochman,1979; Feldmannet al., 1981). The considerableextra complexityof such
analysiswill not be justifiedif a simplecomparisonis all that is required.This is especially
true when the resultsmust be conveyedto and used by non-experts,e.g. clinicians.Such
315

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions
methodswill be necessary,however,if it is requiredto predictone measurementfrom the
other- this is nearerto calibrationand is not the problemwe have been addressingin this
paper.

Whydoesthe comparisonof methodscauseso muchdifficulty?


The majorityof medicalmethod comparisonstudiesseem to be carriedout without the
benefitof professionalstatisticalexpertise.Becausevirtuallyall introductorycoursesand
textbooksin statisticsare method-basedratherthan problem-based,the non-statistician
will searchin vain for a descriptionof how to proceedwith studiesof this nature.It may
be that, as a consequence,textbooksare scannedfor the most similar-lookingproblem,
which is undoubtedlycorrelation.Correlationis the most commonlyused method,which
may be one reasonfor so few studiesinvolvingreplication,since simplecorrelationcannot
cope with replicateddata. A furtherreason for poor methodologyis the tendencyfor
researchers to imitatewhattheysee in otherpublishedpapers.So manypapersarepublished
in whichthe sameincorrectmethodsare used that researcherscan perhapsbe forgivenfor
assumingthat they are doing the rightthing. It is to be hoped that journalswill become
enlightenedand returnpapersusinginappropriatetechniquesfor reanalysis.
Anotherfactoris thatsomestatisticiansarenot as awareof thisproblemas theymightbe.
As an illustrationof this, the blood pressuredata shownin Figures1 and 2 weretakenfrom
the book Biostatisticsby Daniel(1978),wherethey wereused as the exampleof the calcula-
tion coefficient.A counter-exampleis the whole chapterdevoted to method comparison
(by regression)by Strike(1981). More statisticiansshould be awareof this problem,and
should use their influence to similarlyincrease the awarenessof their non-statistical
colleaguesof the fallaciesbehindmanycommonmethods.

Conclusions
1. Most commonapproaches,notablycorrelation,do not measureagreement.
2. A simpleapproachto the analysismay be the most revealingway of looking at the
data.
3. There needs to be a greater understandingof the nature of this problem, by
statisticians,non-statisticiansandjournalreferees.

Acknowledgements

We wouldlike to thankDr David Robsonfor helpfuldiscussionsduringthe preparationof


this paper, and ProfessorD. R. Cox, ProfessorM. J. R. Healy and Mr A. V. Swan for
commentson an earlierdraft.

Appendix

Covarianceof two methods of measurementin the presence of measurementerrors


WehavetwomethodsA andB of measuringa truequantityT. TheyarerelatedT byA = T+ EA
andB =T+ B, where(A andEB areexperimental errors.Weassumethatthe errorshavemean
zero and are independentof each otherand of T, and definethe followingvariances:
var(T) = aT2, var(eA) = aA2, and var(EB)= aB

316

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions
Now the covariance of A and B is given by
E(AB) - E(A)E(B)
= E{(T + eA)(T+ EB)}- E(T+ eA)E(T + EB)
= E{(T2+ EAT+ EBT + EAEB)} -{E(T) + E(eA)}{E(T)+ E(EB)}
But E(eA) = E(eB) = 0, and the errorsand T are independent,so
E(eA)E(T) = E(eB)E(T) = 0
and
E(EAEB)= E(eA)E(eB) = 0
Hence cov(A,B) = E(T2) -{E(T)}2 = aT2

References
Altman, D. G. (1979). Estimation of gestational age at birth - comparisonof two methods. Archivesof
Disease in Childhood54, 242-3.
British StandardsInstitution (1979). Precision of test methods, part 1: guide for the determinationof
repeatabilityand reproducibilityfor a standardtest method. BS 5497, Part 1. London.
Carey,R. N., Wold, S. and Westgard,J. 0. (1975).Principalcomponentanalysis:an alternativeto "referee"
methods in method comparisonstudies. AnalyticalChemistry47, 1824-9.
Carr, K. W., Engler, R. L., Forsythe, J. R., Johnson, A. D. and Gosink, B. (1979). Measurementof left
ventricularejectionfraction by mechanicalcross-sectionalechocardiography.Circulation59, 1196-1206.
Cater, J. I. (1979). Confirmationof gestational age by external physical characteristics(total maturity
score). Archivesof Disease in Childhood54, 794-5.
Cornbleet,P. J. and Gochman,N. (1979).Incorrectleast-squaresregressioncoefficientsin method-compari-
son analysis. ClinicalChemistry25, 432-8.
Daniel, W. W. (1978). Biostatistics:a Foundationfor Analysisin the Health Sciences,2nd edn. Wiley, New
York.
Feldmann, U., Schneider,B., Klinkers, H. and Haeckel, R. (1981). A multivariateapproachfor the bio-
metric comparisonof analyticalmethods in clinical chemistry.Journalof ClinicalChemistryand Clinical
Biochemistry19, 121-37.
Hallman, M. and Teramo, K. (1981). Measurementof the lecithin/sphingomyelinratio and phosphatidyl-
glycerol in amniotic fluid: an accuratemethod for the assessmentof fetal lung maturity.BritishJournal
of Obstetricsand Gynaecology88, 806-13.
Healy, M. J. R. (1958). Variationswithin individualsin human biology. HumanBiology 30, 210-8.
Hunyor, S. M., Flynn, J. M. and Cochineas,C. (1978). Comparisonof performanceof various sphygmo-
manometerswith intra-arterialblood-pressurereadings.BritishMedicalJournal2, 159-62.
Keim, H. J., Wallace,J. M., Thurston,H., Case,D. B., Drayer,J. I. M. and Laragh,J. H. (1976).Impedance
cardiographyfor determinationof stroke index. Journalof AppliedPhysiology41, 797-9.
Laughlin, K. D., Sherrard,D. J. and Fisher, L. (1980). Comparisonof clinic and home blood-pressure
levels in essential hypertensionand variablesassociatedwith clinic-homedifferences.Journalof Chronic
Diseases 33, 197-206.
Lawton,W. H., Sylvestre,E. A. and Young-Ferraro,B. J. (1979). Statisticalcomparisonof multipleanalytic
procedures:applicationto clinical chemistry.Technometrics 21, 397-416.
Oldham, H. G., Bevan, M. M. and McDermott, M. (1979). Comparison of the new miniatureWright
peak flow meterwith the standardWrightpeak flow meter. Thorax34, 807-8.
Pitman, E. J. G. (1939). A note on normal correlation.Biometrika31, 9-12.
Ross, H. A., Visser, J. W. E., der Kinderen, P. J., Tertoolen, J. F. W. and Thijssen, J. H. H. (1982). A
comparativestudy of free thyroxineestimations.Annalsof ClinicalBiochemistry19, 108-13.
Serfontein,G. L. and Jaroszewicz,A. M. (1978).Estimationof gestationalage at birth- comparisonof two
methods. Archivesof Disease in Childhood53, 509-11.
Serfontein,G. L. and Jaroszewicz,A. M. (1979). Estimation of gestational age at birth - comparisonof
two methods. Archivesof Disease in Childhood54, 243.
Snedecor,G. W. and Cochran,W. G. (1967). StatisticalMethods,6th edn. UniversityPress, Iowa.
Strike,P. W. (1981). MedicalLaboratoryStatistics. P. S. G. Wright,Bristol.
Westgard,J. 0. and Hunt, M. R. (1973). Use and interpretationof common statistical tests in method-
comparisonstudies. ClinicalChemistry19, 49-57.

317

This content downloaded from 137.149.3.15 on Wed, 3 Apr 2013 00:56:48 AM


All use subject to JSTOR Terms and Conditions

Das könnte Ihnen auch gefallen