Sie sind auf Seite 1von 6

Methodological Notes

Cerebrovasc Dis 2013;36:267–272 Received: June 18, 2013


Accepted: June 19, 2013
DOI: 10.1159/000353863
Published online: October 16, 2013

Diagnostic Accuracy Measures


Paolo Eusebi
Department of Epidemiology, Regional Health Authority of Umbria, Perugia, Italy

Key Words derestimate the indicators of test performance and limit the
Diagnostic accuracy · Sensitivity · Specificity · Predictive applicability of the results of the study. Key Messages: The
values · Likelihood ratios · Diagnostic odds ratio · Receiver testing procedure should be verified on a reasonable popu-
operating characteristic · Area under the receiver operating lation, including people with mild and severe disease, thus
characteristic curve providing a comparable spectrum. Sensitivities and speci-
ficities are not predictive measures. Predictive values de-
pend on disease prevalence, and their conclusions can be
Abstract transposed to other settings only for studies which are based
Background: An increasing number of diagnostic tests and on a suitable population (e.g. screening studies). Likelihood
biomarkers have been validated during the last decades, and ratios should be an optimal choice for reporting diagnostic
this will still be a prominent field of research in the future accuracy. Diagnostic accuracy measures must be reported
because of the need for personalized medicine. Strict evalu- with their confidence intervals. We always have to report
ation is needed whenever we aim at validating any potential paired measures (sensitivity and specificity, predictive val-
diagnostic tool, and the first requirement a new testing pro- ues or likelihood ratios) for clinically meaningful thresholds.
cedure must fulfill is diagnostic accuracy. Summary: Diag- How much discriminative or predictive power we need de-
nostic accuracy measures tell us about the ability of a test to pends on the clinical diagnostic pathway and on misclassifi-
discriminate between and/or predict disease and health. cation (false positives/negatives) costs. © 2013 S. Karger AG, Basel
This discriminative and predictive potential can be quanti-
fied by measures of diagnostic accuracy such as sensitivity
and specificity, predictive values, likelihood ratios, area un-
der the receiver operating characteristic curve, overall accu- Introduction
racy and diagnostic odds ratio. Some measures are useful for
discriminative purposes, while others serve as a predictive An increasing number of diagnostic tests and bio-
tool. Measures of diagnostic accuracy vary in the way they markers [1] have become available during the last de-
depend on the prevalence, spectrum and definition of the cades, and the need for personalized medicine will
disease. In general, measures of diagnostic accuracy are ex- strengthen the impact of this phenomenon in the future.
tremely sensitive to the design of the study. Studies not Consequently, we need a careful evaluation of any poten-
meeting strict methodological standards usually over- or un- tial new testing procedure in order to limit the potential-
194.243.112.67 - 3/11/2014 9:21:51 AM

© 2013 S. Karger AG, Basel Paolo Eusebi, PhD


1015–9770/13/0364–0267$38.00/0 Department of Epidemiology, Regional Health Authority of Umbria
Via M. Angelonii, 61
Downloaded by:

E-Mail karger@karger.com
IT–06124 Perugia (Italy)
www.karger.com/ced
E-Mail paoloeusebi @ gmail.com
ly negative consequences on both health and medical care Table 1. 2 × 2 table reporting cross-classification of subjects by in-
expenditures [2]. dex and reference test result
Evaluating the diagnostic accuracy of any diagnostic
Reference test
procedure or test is not a trivial task. In general it is about
answering several questions. Will the test be used in a subjects with subjects with- total
the disease out the disease
clinical or screening setting? In which part of the clinical
pathway will it be placed? Has the test the ability to dis- Index test
criminate between health and disease? How well does the Positive TP FP TP + FP
test do that job? How much discriminative ability do we Negative FN TN FN + TN
need for our clinical purposes? Total TP + FN FP + TN Total
In the following pages we will try and answer these
questions. We will give an overview of diagnostic accu-
racy measures, accompanied by their definitions includ-
ing their purpose, advantages and weak points. We will
implement some of these measures by discussing a real great importance to know how to interpret them, as well
example from the medical literature performing an evalu- as when and under what circumstances to use them.
ation of the use of velocity criteria applied to transcranial When we conduct a test, we have a cutoff value indi-
Doppler (TCD) signals in the detection of stenosis of the cating whether an individual can be classified as positive
middle cerebral artery [3]. Then we will end the paper (above/below the cutoff) or negative (below/above the
with a list of take-home messages. cutoff), and a gold standard (or reference method) which
will tell us whether the same individual is ill or healthy.
Therefore, the cutoff divides the population of examined
Diagnostic Accuracy Measures subjects with and without disease into 4 subgroups, which
can be displayed in a 2 × 2 table:
Overview – true positive (TP) = subjects with the disease with the
The discriminative ability of a test can be quantified by value of a parameter of interest above/below the cutoff;
several measures of diagnostic accuracy: – false positive (FP) = subjects without the disease with
– sensitivity and specificity; the value of a parameter of interest above/below the
– positive and negative predictive values (PPV, NPV); cutoff;
– positive and negative likelihood ratios (LR+, LR–); – true negative (TN) = subjects without the disease with
– the area under the receiver operating characteristic the value of a parameter of interest below/above the
(ROC) curve (AUC); cutoff;
– the diagnostic odds ratio (DOR); – false negative (FN) = subjects with the disease with the
– the overall diagnostic accuracy. value of a parameter of interest below/above the cutoff
While these measures are often reported interchange- (table 1).
ably in the literature, they have specific features and fit
specific research questions. These measures are related to Sensitivity and Specificity
two main categories of issues: Sensitivity [5] is generally expressed in percentage and
– classification of people between those who are and defines the proportion of TP subjects with the disease in
those who are not diseased (discrimination); a total group of subjects with the disease: TP/(TP + FN).
– estimation of the posttest probability of a disease (pre- Sensitivity estimates the probability of getting a positive
diction). test result in subjects with the disease. Hence, it relates to
While discrimination purposes are mainly of concern the ability of a test to recognize the ill. Specificity [5], on
in health policy decisions, predictive measures are most the other hand, is defined as the proportion of subjects
useful for predicting the probability of a disease in an in- without the disease with a negative test result in a total
dividual once the test result is known. Thus, these mea- group of subjects without the disease: TN/(TN + FP). In
sures of diagnostic accuracy cannot be used interchange- other words, specificity estimates the probability of get-
ably. Some measures largely depend on disease preva- ting a negative test result in a healthy subject. Therefore,
lence, and all of them are sensitive to the spectrum of the it relates to the ability of a diagnostic procedure to recog-
disease in the population studied [4]. It is therefore of nize the healthy.
194.243.112.67 - 3/11/2014 9:21:51 AM

268 Cerebrovasc Dis 2013;36:267–272 Eusebi


DOI: 10.1159/000353863
Downloaded by:
Both sensitivity and specificity are not dependent on ply be calculated according to the following formula:
disease prevalence, meaning that results from one study LR+ = sensitivity/(1 – specificity).
could easily be transferred to some other setting with a LR– represents the ratio of the probability that a nega-
different prevalence of the disease in a population. tive result will occur in subjects with the disease to the
Nonetheless, sensitivity and specificity may largely de- probability that the same result will occur in subjects
pend on the spectrum of the disease. In fact, both sensi- without the disease. Therefore, LR– tells us how much
tivity and specificity benefit from evaluating patients less likely the negative test result is to occur in a subject
with more severe disease. Sensitivity and specificity are with the disease than in a healthy subject. LR– is usually
good indices of a test’s discriminative ability; however, less than 1, because it is less likely that a negative test re-
in clinical practice a more common line of reasoning is sult occurs in subjects with than in subjects without the
to know how good the test is at predicting illness or disease. For a test with only 2 outcomes, LR– is calculated
health: How confident can we be about the disease status according to the following formula: LR– = (1 – sensitivi-
of a subject if the test has yielded a positive result? What ty)/specificity.
is the probability that the person is healthy if the test is The lower the LR–, the stronger the evidence for ab-
negative? These questions need predictive values to ad- sence of the disease. Since both specificity and sensitivity
dress them. are used to calculate the LR, it is clear that neither LR+
nor LR– depend on disease prevalence. Like all the other
Predictive Values measures, LR depend on the disease spectrum of the pop-
PPV define the probability of being ill for a subject ulation under study. LR are directly linked to posttest
with a positive result. Therefore, a PPV represents a pro- probabilities. Here, we concentrate on the posttest prob-
portion of patients with a positive test result in a total ability of having the disease. The pretest probability is the
group of subjects with a positive result: TP/(TP + FP) [6]. probability of having the disease before the test is done. If
NPV describe the probability of not having a disease for no other clinical characteristics are available of the sub-
a subject with a negative test result. An NPV is defined as ject, we can assume the disease prevalence as the pretest
a proportion of subjects without the disease and with a probability. If we have a pretest probability p, we calculate
negative test result in a total group of subjects with a neg- the pretest odds as a1 = p/(1 – p) and the posttest odds as
ative test result: TN/(TN + FN) [6]. a2 = a1 × LR; then, by reverse calculation, the posttest
Unlike sensitivity and specificity, both PPV and NPV probability P = a2/(1 + a2).
depend on disease prevalence in an evaluated population.
Therefore, predictive values from one study should not be ROC Curve
transferred to some other setting with a different preva- It is clear that diagnostic measures depend on the cut-
lence of the disease in the population. PPV increase while off used. There is a pair of diagnostic sensitivity and spec-
NPV decrease with an increase in the prevalence of a dis- ificity values for every single cutoff. To construct an ROC
ease in a population. Both PPV and NPV increase when curve, we plot these pairs of values on an ROC space with
we evaluate patients with more severe disease. 1 – specificity on the x-axis and sensitivity on the y-axis
(fig. 1) [8]. The shape of an ROC curve and the AUC helps
Likelihood Ratios us to estimate how high the discriminative power of a test
LR are useful measures of diagnostic accuracy, but is. The closer the curve is located to the upper-left corner
they are often disregarded, even if they have several par- and the larger the AUC, the better the test is at discrimi-
ticularly powerful properties that make them very useful nating between disease and health. The AUC can have
from a clinical perspective. They are defined as the ratio any value between 0 and 1 and is a good indicator of the
of the probability of an expected test result in subjects goodness of the test. A perfect diagnostic test has an AUC
with the disease to the probability in the subjects without of 1.0, whereas a nondiscriminating test has an area of 0.5
the disease [7]. (as if one were tossing a coin).
LR+ tells us how many times more likely a positive test AUC is a global measure of diagnostic accuracy. It
result occurs in subjects with the disease than in those does not tell us anything about individual parameters,
without the disease. The farther LR+ is from 1, the stron- such as sensitivity and specificity, which refer to specific
ger the evidence for the presence of the disease. If LR+ is cutoffs. A higher AUC indicates higher specificity and
equal to 1, the test is not able to distinguish the ill from sensitivity along all the available cutoffs. Out of 2 tests
the healthy. For a test with only 2 outcomes, LR+ can sim- with identical or similar AUC, one can have significantly
194.243.112.67 - 3/11/2014 9:21:51 AM

Diagnostic Accuracy Measures Cerebrovasc Dis 2013;36:267–272 269


DOI: 10.1159/000353863
Downloaded by:
Table 2. Cross-classification of lesions by TCD (MV cutoff >90
cm/s) and angiography (reference test)
1.0
Angiography
0.8 positive negative total

TCD
0.6 Positive 9 7 16
Sensitivity

Negative 3 80 83
0.4 Total 12 87 99

0.2
tivity with low rates of FP and FN has a high DOR. With
0 the same sensitivity of the test, the DOR increases with
the increase in the test’s specificity. The DOR does not
0 0.2 0.4 0.6 0.8 1.0
depend on disease prevalence; however, it depends on the
1 – specificity
criteria used to define the disease and its spectrum of
pathological conditions of the population examined.
Fig. 1. ROC curve. From bottom-left to upper-right corner, we ob-
serve increasing sensitivity and decreasing specificity at lowering
thresholds. Example

With the help of a study published by Rorick et al. [10],


we will now apply the diagnostic measures discussed in
higher sensitivity, whereas the other can have significant- the previous section. The purpose of the study was to
ly higher specificity. For comparing the AUC of two ROC evaluate the use of mean velocity (MV) criteria applied to
curves, we use statistical tests which evaluate the statisti- TCD signals in the detection of stenosis of the middle ce-
cal significance of an estimated difference, with a previ- rebral artery. The reference test was an angiographic ex-
ously defined level of statistical significance. amination. When the positiveness of TCD was fixed at an
MV cutoff of >90 cm/s, results as can be seen in table 2
Overall Diagnostic Accuracy were observed (the 2 × 2 table was reconstructed by using
Another global measure is the so-called ‘diagnostic ac- information available from the study). We can calculate
curacy’, expressed as the proportion of correctly classified the diagnostic accuracy measures, which are reported in
subjects (TP + TN) among all subjects (TP + TN + FP + table 3.
FN). Diagnostic accuracy is affected by disease preva- If a test aims at diagnosing a subject, then it is crucial
lence. With the same sensitivity and specificity, the diag- to determine the posttest probability of the disease if the
nostic accuracy of a particular test increases as the disease test is positive. If we assume that the pretest probability
prevalence decreases. equals the prevalence of the disease (p = 12/99 = 0.12), the
pretest odds are a1 = 0.12/(1 – 0.12) = 0.14 and the post-
Diagnostic Odds Ratio test odds a2 = a1 × LR+ = 0.14 × 9.32 = 1.29, then the post-
DOR is also a global measure of diagnostic accuracy, test probability of the disease is P = a2/(1 + a2) = 1.29/
used for general estimation of the discriminative power (1 + 1.29) = 0.56.
of diagnostic procedures and also for comparison of di- Not surprisingly, the posttest probability equals the
agnostic accuracies between 2 or more diagnostic tests. PPV. The reason is that we assumed as pretest probabil-
The DOR of a test is the ratio of the odds of positivity in ity the disease prevalence of the study. Let us suppose
subjects with the disease to the odds in subjects without that the pretest probability is higher (0.25). Then, if we re-
the disease [9]. It is calculated according to the following peat our calculations, the pretest odds are a1 = 0.25/(1 –
formula: DOR = LR+/LR– = (TP/FN)/(FP/TN). 0.25) = 0.33 and the posttest odds a2 = a1 × LR+ = 0.33 ×
The DOR depends significantly on the sensitivity and 9.32 = 3.11, and the posttest probability of the disease is
specificity of a test. A test with high specificity and sensi- P = a2/(1 + a2) = 3.11/(1 + 3.11) = 0.76.
194.243.112.67 - 3/11/2014 9:21:51 AM

270 Cerebrovasc Dis 2013;36:267–272 Eusebi


DOI: 10.1159/000353863
Downloaded by:
Table 3. Diagnostic accuracy measures

Measure Formula Calculation Estimate 95% CI

Sensitivity TP/(TP + FN) 9/12 0.75 0.46 – 0.93


Specificity TN/(TN + FP) 80/87 0.92 0.88 – 0.94
PPV TP/(TP + FP) 9/16 0.56 0.35 – 0.70
NPV TN/(TN + FN) 80/83 0.96 0.92 – 0.99
LR+ sensitivity/(1 – specificity) 0.75/(1 – 0.92) 9.32 3.86 – 16.55
LR– (1 – sensitivity)/specificity (1 – 0.75)/0.92 0.27 0.08 – 0.61
Accuracy (TP + TN)/(TP + FP + TN + FN) (9 + 80)/(9 + 7 + 80 + 3) 0.90 0.83 – 0.94
DOR LR+/LR– 9.32/0.27 34.29 6.33 – 214.66

The study also reported results when using a cutoff of


>80 cm/s. Sensitivity increased to 83%, while specificity 1.0
decreased to 87%. We can plot the results in the ROC
space (fig. 2). The lower threshold (MV cutoff >80 cm/s) 0.9
was positioned top right of the higher threshold (MV cut-
MV >80 cm/s
off >90 cm/s). If we were able to have the cutoff along a 0.8
continuum, we would see an ROC curve. The study was Sensitivity
MV >90 cm/s
then reanalyzed in a review of Navarro et al. [11], which 0.7
demonstrated a good TCD performance against angiog-
raphy. 0.6

0.5
Key Messages
0.4
Population 0 0.1 0.2 0.3 0.4 0.5 0.6
The testing procedure should be verified on a reason- 1 – specificity
able population; thus it needs to include those with mild
and severe disease, aiming at providing a comparable
spectrum. Disease prevalence affects predictive values, Fig. 2. Visual display of test results evaluated at an MV cutoff of
but the disease spectrum has an impact on all diagnostic >80 cm/s and an MV cutoff of >90 cm/s in an ROC space.
accuracy measures.

Sensitivity and Specificity, PPV and NPV or LR?


For clinicians and health professionals, the key issue is share with sensitivity and specificity the feature of being
to know how a diagnostic procedure can predict a disease. independent of disease prevalence, but they can be used
Sensitivity and specificity are not predictive measures, to calculate the probability of a disease while adapting for
they only describe how a disease predicts particular test varying prior probabilities.
results. Predictive values inform us about the probability
of having the disease with a positive test result (PPV), or Multiple Thresholds: Paired Measures or ROC
the probability of being healthy with a negative test result. Diagnostic accuracy can be presented at a specific
Unfortunately, these probabilities largely depend on dis- threshold by using paired results, such as sensitivity and
ease prevalence and their meaning can rarely be trans- specificity, or, alternatively, predictive values or LR. Oth-
ferred beyond the study (except when the study is based er methods summarize accuracy across all the different
on a suitably random sample, e.g. population screening test thresholds available. In general, it is better to present
studies). While often disregarded, LR should be an opti- both summary and paired results, because a good dis-
mal choice in reporting diagnostic accuracy, because they criminative power of a test can occur only by chance with
194.243.112.67 - 3/11/2014 9:21:51 AM

Diagnostic Accuracy Measures Cerebrovasc Dis 2013;36:267–272 271


DOI: 10.1159/000353863
Downloaded by:
a particular threshold. In that case, a graphical presenta- How Much Discriminative or Predictive Power?
tion can be highly informative, in particular an ROC plot. In general, answering this question is dependent on
At the same time, concentrating only on the AUC when the stage in which it will be placed in the clinical diagnos-
comparing 2 tests can be misleading because this measure tic pathway, and on the misclassification cost. For in-
averages in a single piece of information both clinically stance, if I need a triage and the cost of FP is of no impor-
meaningful and irrelevant thresholds. Thus, we always tance, I need to focus on sensitivity, PPV or LR+.
have to report paired measures for clinically relevant
thresholds.

Variability
As always, it is crucial to report variability/uncertainty
measures for diagnostic accuracy results (95% CI).

References
1 Whiteley W, Wardlaw J, Dennis M, Lowe G, 4 Montori VM, Wyer P, Newman TB, Keitz S, 8 Zou KH, O’Malley AJ, Mauri LL: Receiver-
Rumley A, Sattar N, Welsh P, Green A, An- Guyatt G, Evidence-Based Medicine Teach- operating characteristic analysis for evaluat-
drews M, Graham C, Sandercock P: Blood ing Tips Working Group: Tips for learners of ing diagnostic tests and predictive models.
biomarkers for the diagnosis of acute cerebro- evidence-based medicine. 5. The effect of Circulation 2007;115:654–657.
vascular diseases: a prospective cohort study. spectrum of disease on the performance of di- 9 Glas AS, Lijmer JG, Prins MH, Bonsel GJ,
Cerebrovasc Dis 2011;32:141–147. agnostic tests. CMAJ 2005;173:385–390. Bossuyt PM: The diagnostic odds ratio: a sin-
2 Koffijberg H, van Zaane B, Moons KGM: 5 Altman DG, Bland JM: Diagnostic tests. 1. gle indicator of test performance. J Clin Epi-
From accuracy to patient outcome and cost- Sensitivity and specificity. BMJ 1994; 308: demiol 2003;56:1129–1135.
effectiveness evaluations of diagnostic tests 1552. 10 Rorick MB, Nichols FT, Adams RJ: Transcra-
and biomarkers: an exemplary modelling 6 Altman DG, Bland JM: Diagnostic tests. 2. nial Doppler correlation with angiography in
study. BMC Med Res Methodol 2013;13:12. Predictive values. BMJ 1994;309:102. detection of intracranial stenosis. Stroke
3 Tamura A, Yamamoto Y, Nagakane Y, 7 Deeks JJ, Altman DG: Diagnostic tests. 4. 1994;25:1931–1934.
Takezawa H, Koizumi T, Makita N, Makino Likelihood ratios. BMJ 2004;329:168. 11 Navarro JC, Lao AY, Sharma VK, Tsivgoulis
M: The relationship between neurological G, Alexandrov AV: The accuracy of transcra-
worsening and lesion patterns in patients with nial Doppler in the diagnosis of middle cere-
acute middle cerebral artery stenosis. Cere- bral artery stenosis. Cerebrovasc Dis 2007;23:
brovasc Dis 2013;35:268–275. 325–330.

194.243.112.67 - 3/11/2014 9:21:51 AM

272 Cerebrovasc Dis 2013;36:267–272 Eusebi


DOI: 10.1159/000353863
Downloaded by:

Das könnte Ihnen auch gefallen