Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to
Journal of the Royal Statistical Society. Series D (The Statistician).
http://www.jstor.org
1 Theproblem
In medicinewe often want to comparetwo differentmethodsof measuringsome quantity,
such as blood pressure,gestationalage, or cardiacstrokevolume. Sometimeswe compare
an approximateor simplemethodwith a very preciseone. This is a calibrationproblem,
and we shall not discussit furtherhere. Frequently,however,we cannot regard either
method as givingthe true value of the quantitybeing measured.In this case we want to
knowwhetherthe methodsgiveanswerswhichare,in somesense,comparable.For example,
we may wish to see whethera new, cheap and quick methodproducesanswersthat agree
with those from an establishedmethod sufficientlywell for clinicalpurposes.Many such
studies,using a varietyof statisticaltechniques,have been reported.Yet few reallyanswer
the question"Do the two methodsof measurementagreesufficientlyclosely?"
In this paper we shall describewhat is usually done, show why this is inappropriate,
suggesta better approach,and ask why such studiesare done so badly. We will restrict
our considerationto the comparisonof two methodsof measuringa continuousvariable,
althoughsimilarproblemscan arisewith categoricalvariables.
2 Incorrectmethodsof analysis
Comparison of means
Cater (1979) comparedtwo methods of estimatingthe gestationalage of human babies.
tPaper presentedat the Instituteof Statisticiansconference,July 1981.
307
Systolicpressure Diastolicpressure
SA SB r SA SB r
A further point of interest is that even what appears (visually) to be fairly poor agreement
can produce fairly high values of the correlation coefficient. For example, Serfontein and
Jaroszewicz (1978) found a correlation of 0 85 when they compared two more methods of
assessing gestational age, the Robinson and the Dubowitz. They concluded that because the
correlation was high and significantly different from zero, agreement was good. However,
from their data a baby with a gestational age of 35 weeks by the Robinson method could
have been anything between 34 and 39 5 weeks by the Dubowitz method. For two methods
which purport to measure the same thing the agreement between them is not close, because
what may be a high correlation in other contexts is not high when comparing things that
should be highly related anyway. The test of significance of the null hypothesis p=0 is
beside the point. It is unlikely that we would consider totally unrelated quantities as candi-
dates for a method comparison study.
The correlation coefficient is not a measure of agreement; it is a measure of association.
309
3 Proposedmethodof analysis
Justas thereare severalinvalidapproachesto this problem,thereare also variouspossible
types of analysiswhich are valid, but none of these is without difficulties.We feel that a
relativelysimple pragmaticapproachis preferableto more complex analyses,especially
whenthe resultsmust be explainedto non-statisticians.
It is difficultto producea methodthat will be appropriatefor all circumstances.What
followsis a briefdescriptionof the basicstrategythatwe favour;clearlythe variouspossible
complexitieswhichcould arisemightrequirea modifiedapproach,involvingadditionalor
even alternativeanalyses.
Propertiesof eachmethod:repeatability
The assessmentof repeatabilityis an importantaspectof studyingalternativemethodsof
measurement.Replicatedmeasurementsare, of course, essential for an assessmentof
repeatability,but to judge from the medicalliteraturethe collectionof replicateddata is
rare.One possiblereasonfor this will be suggestedlater.
Repeatabilityis assessed for each measurementmethod separatelyfrom replicated
measurementson a sample of subjects.We obtain a measureof repeatabilityfrom the
within-subjectstandard deviation of the replicates.The British StandardsInstitution
(1979)definea coefficientof repeatabilityas "thevaluebelowwhichthe differencebetween
two singletest results... may be expectedto lie with a specifiedprobability;in the absence
of other indications,the probabilityis 95 per cent". Providedthat the differencescan be
assumedto follow a Normaldistributionthis coefficientis 2-83 ar, where crris the within-
subject standarddeviation. ar must be estimatedfrom a suitable experiment.For the
purposesof the presentanalysisthe standarddeviationalone can be used as the measure
of repeatability.
It is importantto ensurethat the within-subjectrepeatabilityis not associatedwith the
size of the measurements,in which case the resultsof subsequentanalysesmight be mis-
leading.The best way to look for an associationbetweenthesetwo quantitiesis to plot the
standarddeviationagainstthe mean.If thereare two replicatesxi and X2 then this reduces
to a plot of Ixi -x21 against(xi +x2)/2. Fromthisplot it is easyto see if thereis anytendency
for the amountof variationto changewiththe magnitudeof the measurements. The correla-
tion coefficientcould be testedagainstthe null hypothesisof r= 0 for a formaltest of inde-
pendence.
If the within-subject repeatabilityis found to be independentof the size of the measure-
ments,then a one-wayanalysisof variancecan be performed.The residualstandarddevia-
tion is an overallmeasureof repeatability,pooled acrosssubjects.
If, however,an associationis observed,the resultsof an analysisof variancecould be
misleading.Severalapproachesare possible, the most appealingof which is the trans-
formationof the datato removethe relationship.In practicethe logarithmictransformation
will oftenbe suitable.If the relationshipcan be removed,a one-wayanalysisof variancecan
be carriedout. Repeatabilitycan be describedby calculatinga 95 per cent rangefor the
differencebetweentwo replicates.Back-transformation providesa measureof repeatability
in the originalunits. In the case of log transformationthe repeatabilityis a percentageof
the magnitudeof the measurementratherthan an absolutevalue.It wouldbe preferableto
carryout the sametransformation for measurement by eachmethod,but thisis not essential,
and may be totallyinappropriate.
311
Propertiesof eachmethod:otherconsiderations
Manyfactorsmay affecta measurement,such as observer,time of day, positionof subject,
particularinstrumentused, laboratory, etc. The British StandardsInstitution (1979)
distinguishbetweenrepeatability,describedabove, and reproducibility,"the value below
whichtwo single test results... obtainedunderdifferentconditions... may be expected
to lie with a specifiedprobability".There may be difficultiesin carryingout studies of
reproducibilityin many areas of medicalinterest.For example,the gestationalage of a
newbornbaby could not be determinedat differenttimes of year or in differentplaces.
However,when it is possibleto vary conditions,observers,instruments,etc., the methods
describedabovewill be appropriateprovidedthe effectsare random.Wheneffectsare fixed,
for example when comparingan inexperiencedobserverand an experiencedobserver,
the approachused to comparedifferentmethods,describedbelow, shouldbe used.
Comparison of methods
The main emphasisin methodcomparisonstudiesclearlyrests on a directcomparisonof
the resultsobtainedby the alternativemethods.The questionto be answeredis whetherthe
methods are comparableto the extent that one might replace the other with sufficient
accuracyfor the intendedpurposeof measurement.
220
0
200- /
0 0
0
0
180 - 00 0
rn 0
0
= 160- 0
0 0 0
0
140- 0
00
120- / o,,,0
120 140 160 180 200 220
1
MIETHOD
Fig. 1. Comparisonof two methodsof measuringsystolic bloodprssure.
312
40
30- 0 0
0 0
z
w
20-
wU 0 0
0
CD
cn]
0 0 0 0
L)
u 10 0
wU 0 0
r'll U_ 0- 0
00~~0 0
H~~~~~~
~~~0*
0
0
-10-
120 140 160 180 200 220
AVERAGE OF TWOMETHODS
Fig. 2. Data from Figure 1 replottedto show the differencebetweenthe two methodsagainst the average
measurement.
13 313
X 10- 0 0
Z 0
w
iL
U. 0 -
0 O
m ?O
000 ~~~~0
w ~~
~~~~~~~~~~~0
0
w C10) -1 0 0
Hl 0
LL
*__
c
-20 ------- - - --------------
-30--- I I
0 10 20 30 40 50 60 70 80 90 1i0
AVERAGE
OF TWOMETHOOS
CD
L L
Conclusions
1. Most commonapproaches,notablycorrelation,do not measureagreement.
2. A simpleapproachto the analysismay be the most revealingway of looking at the
data.
3. There needs to be a greater understandingof the nature of this problem, by
statisticians,non-statisticiansandjournalreferees.
Acknowledgements
Appendix
316
References
Altman, D. G. (1979). Estimation of gestational age at birth - comparisonof two methods. Archivesof
Disease in Childhood54, 242-3.
British StandardsInstitution (1979). Precision of test methods, part 1: guide for the determinationof
repeatabilityand reproducibilityfor a standardtest method. BS 5497, Part 1. London.
Carey,R. N., Wold, S. and Westgard,J. 0. (1975).Principalcomponentanalysis:an alternativeto "referee"
methods in method comparisonstudies. AnalyticalChemistry47, 1824-9.
Carr, K. W., Engler, R. L., Forsythe, J. R., Johnson, A. D. and Gosink, B. (1979). Measurementof left
ventricularejectionfraction by mechanicalcross-sectionalechocardiography.Circulation59, 1196-1206.
Cater, J. I. (1979). Confirmationof gestational age by external physical characteristics(total maturity
score). Archivesof Disease in Childhood54, 794-5.
Cornbleet,P. J. and Gochman,N. (1979).Incorrectleast-squaresregressioncoefficientsin method-compari-
son analysis. ClinicalChemistry25, 432-8.
Daniel, W. W. (1978). Biostatistics:a Foundationfor Analysisin the Health Sciences,2nd edn. Wiley, New
York.
Feldmann, U., Schneider,B., Klinkers, H. and Haeckel, R. (1981). A multivariateapproachfor the bio-
metric comparisonof analyticalmethods in clinical chemistry.Journalof ClinicalChemistryand Clinical
Biochemistry19, 121-37.
Hallman, M. and Teramo, K. (1981). Measurementof the lecithin/sphingomyelinratio and phosphatidyl-
glycerol in amniotic fluid: an accuratemethod for the assessmentof fetal lung maturity.BritishJournal
of Obstetricsand Gynaecology88, 806-13.
Healy, M. J. R. (1958). Variationswithin individualsin human biology. HumanBiology 30, 210-8.
Hunyor, S. M., Flynn, J. M. and Cochineas,C. (1978). Comparisonof performanceof various sphygmo-
manometerswith intra-arterialblood-pressurereadings.BritishMedicalJournal2, 159-62.
Keim, H. J., Wallace,J. M., Thurston,H., Case,D. B., Drayer,J. I. M. and Laragh,J. H. (1976).Impedance
cardiographyfor determinationof stroke index. Journalof AppliedPhysiology41, 797-9.
Laughlin, K. D., Sherrard,D. J. and Fisher, L. (1980). Comparisonof clinic and home blood-pressure
levels in essential hypertensionand variablesassociatedwith clinic-homedifferences.Journalof Chronic
Diseases 33, 197-206.
Lawton,W. H., Sylvestre,E. A. and Young-Ferraro,B. J. (1979). Statisticalcomparisonof multipleanalytic
procedures:applicationto clinical chemistry.Technometrics 21, 397-416.
Oldham, H. G., Bevan, M. M. and McDermott, M. (1979). Comparison of the new miniatureWright
peak flow meterwith the standardWrightpeak flow meter. Thorax34, 807-8.
Pitman, E. J. G. (1939). A note on normal correlation.Biometrika31, 9-12.
Ross, H. A., Visser, J. W. E., der Kinderen, P. J., Tertoolen, J. F. W. and Thijssen, J. H. H. (1982). A
comparativestudy of free thyroxineestimations.Annalsof ClinicalBiochemistry19, 108-13.
Serfontein,G. L. and Jaroszewicz,A. M. (1978).Estimationof gestationalage at birth- comparisonof two
methods. Archivesof Disease in Childhood53, 509-11.
Serfontein,G. L. and Jaroszewicz,A. M. (1979). Estimation of gestational age at birth - comparisonof
two methods. Archivesof Disease in Childhood54, 243.
Snedecor,G. W. and Cochran,W. G. (1967). StatisticalMethods,6th edn. UniversityPress, Iowa.
Strike,P. W. (1981). MedicalLaboratoryStatistics. P. S. G. Wright,Bristol.
Westgard,J. 0. and Hunt, M. R. (1973). Use and interpretationof common statistical tests in method-
comparisonstudies. ClinicalChemistry19, 49-57.
317