Sie sind auf Seite 1von 15

Endocrinol Metab Clin N Am

31 (2002) 567581

Evidence-based diagnosis
in endocrinology
Roman Jaeschke, MD, MSca,
Gordon H. Guyatt, MD, MSca,b,*,
Victor M. Montori, MD, MScc
a

Department of Medicine, McMaster University, 1200 Main Street West,


Hamilton, Ontario, L8N 3Z5, Canada
b
Department of Clinical Epidemiology and Biostatistics, McMaster University,
1200 Main Street West, Hamilton, Ontario, L8N 3Z5, Canada
c
Division of Endocrinology, Metabolism, Nutrition, and Internal Medicine,
Mayo Clinic, 200 First Street Southwest, Rochester, MN 55905, USA

Any interaction with patients involves one or more tasks that may include
establishing and communicating the prognosis of a particular condition,
choosing, if available, an appropriate therapy or preventive strategy, and
supporting them during therapy. This chain of events clearly begins with a
determination of what is happening, that is, establishing the diagnosis.
A rapidly growing literature that describes the characteristics and usefulness of diagnostic tests and procedures provides information that can help
clinicians make an accurate diagnosis. The concepts used in this literature
(and the vocabulary used to describe them) are reviewed in this article. They
are introduced in the context of a clinical scenario that describes the testing
of a urine sample for the presence of microalbuminuria.
Recent guidelines for the management of diabetes recommend that
patients have annual screening for proteinuria. The presence of even a small
amount of protein in urine (microalbuminuria) has important prognostic
and therapeutic implications. Twenty-fourhour urine collections for albumin remain the gold standard for detecting microalbuminuria, but they are
quite cumbersome, and the need for a simpler, more convenient test is
obvious. A determination of the urine albumin concentration (UAC) or the
urine albumin:creatinine ratio (UACR) in a random urine sample is an
appropriate option. This was assessed in a 1997 article [1] that described the

* Corresponding author.
E-mail address: guyatt@mcmaster.ca (G.H. Guyatt).
0889-8529/02/$ - see front matter  2002, Elsevier Science (USA). All rights reserved.
PII: S 0 8 8 9 - 8 5 2 9 ( 0 2 ) 0 0 0 1 8 - X

568

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

diagnostic properties of a rst morning urine sample using the concepts of


criterion standard, receiver operating characteristics curve, and sensitivity
and specicity. In this study, patients with diabetes collected 132 24-hour
urine specimens for measurement of the urinary albumin excretion rate
(UAER). When patients returned the 24-hour collection during a clinic visit,
a random urine specimen was taken for determination of the UAC and
UACR. The measurement of 24-hour UAER was considered adequate
when creatinine measurement in the same sample was 700 to 1500 mg/d for
women and 1000 to 1800 mg/d for men. Nine samples were excluded based
on this criterion. Thus, 123 pairs of 24-hour urine specimens and random
urine specimens were available to establish the diagnostic properties of the
random sample.
The following list provides a set of criteria that can be used to help judge
the value of the evidence provided by such a study.
What are the properties of the diagnostic test (ie, sensitivity, specicity,
predictive value, likelihood ratios, receiver operating characteristics)?
How applicable are the studys result and the diagnostic test to dierent
clinical settings?
How applicable are the studys result and the diagnostic test to dierent
clinical settings?
The validity or accuracy of a diagnostic test is best determined by comparing it to the truth. A set of criteria that can help assess the validity
of a study that describes a diagnostic test is listed as follows [2].
Did the clinicians face diagnostic uncertainty?
Was there a blind comparison with an independent reference standard
applied similarly to the treatment group and the control group?
Did the results of the test being evaluated inuence the decision to perform the reference standard?
The extent to which an article about a diagnostic test adheres to these criteria determines the condence we can have in its results. Diagnostic studies
can be classied based on this condence into levels of evidence (see Box).
It is clear that any new test must be compared to an appropriate reference
standard (such as biopsy, surgery, autopsy, or long-term follow-up) and that
this standard should be available for every study participant, along with
the test under investigation. In the study on microalbuminuria, in which
both tests were biochemical assays, the issue of choosing this standard was
arbitrary and was related to the denition of microalbuminuria. In other
situations (eg, stroke, myocardial infarction, urinary tract infection, or
osteomyelitis), the choice of an appropriate reference standard (also called
a gold standard, criterion standard, or diagnostic standard) may not be that
clear. The study of a diagnostic test that does not provide a reasonable criterion standard is unlikely to provide valid results.

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

569

Levels of evidence for studies of diagnosis


Level 1
Independent interpretation of test result (without knowledge
of the result of diagnostic standard)
Independent interpretation of the diagnostic standard (without
knowledge of the test result)
Selection of patients who are suspected (but not known
to have) the disorder
Reproducible description of both the test and diagnostic
standard
At least 50 patients with and 50 patients without the disorder
Level 2: Meets four of the Level 1 criteria
Level 3: Meets three of the Level 1 criteria
Level 4: Meets two or fewer of the level 1 criteria

To preclude the possibility that the results of the new diagnostic test are
inuenced by the results of the reference standard, it is important that the
test results and the reference standard be assessed independently of each
other (ie, by interpreters who were unaware of the results of the other investigation). This independence of comparisons is not crucial when considering
objective, biochemical tests. Its importance arises, however, when the interpretation of one tests results may be inuenced by the knowledge of the
other tests results. Examples include assessing fundoscopy when one knows
angiography results, assessing results of clinical examination for neuropathy
when one knows the electromyogram (EMG) or nerve-conduction-study
results, interpreting bone radiographs when one knows bone scan results,
looking at chest radiographs when one knows CT scan results, and conducting heart auscultation when one knows echocardiogram results. The more
likely that the interpretation of a new test could be inuenced by knowledge
of the reference standard result (or vice versa), the greater the importance of
the independent interpretation of both tests.
All patients should receive the test under evaluation and the reference
standard. This point may be illustrated by the situation in which all patients
with suspected peripheral neuropathy have nerve conduction studies but
only patients with abnormal velocities have a nerve biopsy. This situation,
sometimes called verication bias or work-up bias, was not a problem
in the study under consideration, in which patients had both tests.
A diagnostic test is useful only to the extent that it distinguishes between
target states or disorders that might otherwise be confused. Almost any test
can distinguish the healthy from the severely aected. The pragmatic value
of a test is established only in a study that closely resembles clinical practice.

570

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

Testing the ability of UACR to distinguish between healthy volunteers and


patients with long-term diabetes with established nephropathy would not be
useful in this regard (although if it was not working in those populations, it
would not be useful in more challenging ones).
In the microalbuminuria article [1], there is no full description of the
process through which patients entered the study. It is not known how many
were considered, how many were excluded, and why. It is known only that
all patients classied at that time (1997) as having type 2 diabetes were eligible, that they ranged in age from 40 to 75 years (mean 61), that the mean
duration of diabetes was 11 years (range 145), that the mean HbA1c was
10.1% (range 6.9%15.6%), and that the median UAER was 18.3 lg/min
(26 mg/d). It is likely that the appropriate patient sample was chosen.
What are the properties of a diagnostic test?
Sensitivity and specicity
From the data provided by the authors of the microalbuminuria study
[1], one can construct Table 1, in which a UAER of more than 28.8 mg/d
(corresponding to excretion of 20 mg of albumin per minute) is considered
abnormal. Sensitivity refers to the proportion of people with the target disorder in whom the test result is positive, and specicity refers to the proportion of people without the target disorder in whom the test result is negative.
To use these concepts, the results can be divided into normal and abnormal;
that is, a 22 table can be created. Table 1 (a 32 table) can be transformed
into one of two possible 22 tables, depending on what we call normal (negative) or abnormal (positive) test results. If one assumes that only UACR
>26.8 lg albumin/g of creatinine is abnormal (or positive), Table 2 can be
constructed.
Sensitivity ability of a test to detect the disease among persons who have
it (proportion of people with disease who have positive test result)
Specicity ability of a test to conrm normal status among people without disease (proportion of people without disease who have negative test
result)

From Table 2 it is clear that 69 people had proven abnormal UAER


and 61 of them had an abnormal test result. The sensitivity is 61/69 (88%).
Table 1
Relationship between UAER (gold standard) and UACR (test) and three levels of tests results
UAER
UACR

>26.8 mg/g
1526.8 mg/g
<15 mg/g

>28.8 mg/d

<28.8 mg/d

61
8
0

6
8
40

69

54

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

571

Table 2
Relationship between UAER (gold standard) and UACR (test) and two levels of tests results
UAER
UACR

>26.8 mg/g
<26.8 mg/g

>28.8 mg/d

<28.8 mg/d

61
8

6
48

69

54

From Evidence-based Medicine Working Group. American Medical Association, Users


guide to the medical literature: a manual for evidence-based clinical practice. Chicago: AMA
Press; 2002; with permission.

Similarly, 48 of 54 people with a normal UAER had a normal UACR, so


the specicity is 48/54 (89%).
If the threshold of positive versus negative were set dierently (ie,
15 mg/g instead of 26.8 mg/g) Table 3 would be generated. This analysis yields
an improved sensitivity of 69/69, or 100%, but a poorer specicity of 40/54
(74%). By increasing the ability to detect existing disease (ie, increasing sensitivity, or the proportion of true-positive results), some unaected people
are mislabeled as being target positive (ie, specicity is decreased, the proportion of false-positive cases is increased, and proportion of true-negative
cases is decreased). This illustrates the fact that to use the concepts of sensitivity and specicity one must either throw away important information or
recalculate sensitivity and specicity for every cutpoint. The use of likelihood ratios (discussed later) presents a way of avoiding this problem.
Table 4 summarizes these denitions. Sensitivity is [a/(a c)], specicity is
[d/(b d)], accuracy (ability of a test to classify patients properly) equals
[(a d)/(a b c d)]. Sensitivity is also sometimes referred to as the
proportion of true-positive results, and specicity is a proportion of truenegative cases.
Receiver operating characteristic curve
Dierent cut-os that dene positive and negative tests result in dierent
sensitivities and specicities. One may create a graph on which the vertical
axis denotes sensitivity (or true-positive rate) for dierent cut-os and the
Table 3
Relationship between UAER (gold standard) and UACR (test) and two levels of tests results
(dierent threshold)
UAER
UACR

>15 mg/g
<15 mg/g

>28.8 mg/d

<28.8 mg/d

69
0

14
40

69

54

572

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

Table 4
Relationship between tests results and truth
Disease
Test result

Positive
Negative

Present

Absent

a (TP)
c (FN)

b (FP)
d (TN)

ac

bd

Abbreviations: FN, false negative; FP, false positive; TN, true negative; TP, true positive.

horizontal axis displays [1-specicity] (or the false-positive rate) for the same
cut-os. The curve established by connecting the points generated by using
dierent diagnostic cutos is called a receiver operating characteristic (ROC
curve). For the data set under consideration, two points of this curve, on the
basis of known sensitivities and specicities, are known. A third point could
be read from the data provided in the article and represents 100% specicity
at the expense of only 42% sensitivity. The resulting ROC curve, which represents modied ROC curve from the article, is presented in Fig. 1.
Such ROC curves can be used to compare formally the value of dierent
tests by examining the area under each curve; the better the test, the larger
the area under the curve. Zelmanovitz et al found that for the detection of

Fig. 1. Receiver operating characteristics curve for UACR. (Adapted from Gerstein H, Haynes
B. Evidence-based diabetes care. Hamilton, Ontario: BC Decker Inc; 2001; with permission.)

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

573

abnormal protein secretion by UACR the area under ROC curve was 0.9689
and that the area for another test described in the same study (UAC) was
0.976 [1]. To put the discriminating abilities of those tests into perspective,
one may consider that the area under ROC curve for ferritin in diagnosing
iron deciency anemia is approximately 0.95 [3].
Predictive value
Predictive values represent another way of expressing the properties of a
diagnostic test. In applying a given test to a patient, the vertical columns
of Table 4 are of limited interest, because if one really knew what column
the patient was in, the diagnostic test would not be required. The clinically
relevant questions are embedded in the rows. For example: what proportion
of patients with UACR ratio more than 15 mg/g has abnormal 24-hour urinary albumin secretion? In this study the answer is 69 of 83 patients, or 83%
(a proportion called the positive predictive value, or PPV). For the same
threshold the probability that a patient with negative test results has no disease is 100% (40/40), a proportion called negative predictive value or NPV
(Table 3). For the dierent threshold (26.8 mg/g) the respective values are
91% for PPV and 86% for NPV (Table 2). Using the symbols in Table 4,
PPV [a/(a b)], and NPV [d/(c d)].
The relationship between sensitivity and specicity on one hand and predictive values on the other can be illustrated using a hypothetical example
(Table 5), in which one assumes that a population has a smaller proportion of people with the disease of interest. In this example, the sensitivity
[a/(a c)] and specicity [2d/2(b d) d/(b d)] are unchanged, but the
PPV is reduced from [a/(a b)] to [a/(a 2b)] and the NPV has increased
from [d(c d)] to [2d(c 2d)].
As a general rule, although the sensitivity and specicity do not change,
decreasing the disease prevalence decreases the PPV and increases the NPV.
Similarly, it can be easily shown that maintaining test sensitivity and specificity but increasing the disease prevalence (2a and 2c) increases the PPV and
decreases the NPV. Predictive values reect the test characteristics and the
disease prevalence in the population and are of limited value in populations
dierent from the studied one.
Sometimes sensitivity or specicity is so high that it can be used to rule in
or rule out a target disorder. When a test has a high sensitivity, a negative
Table 5
Relationship between prevalence of disease, sensitivitiy, specicity, and predictive values
Disease
Test result

Positive
Negative

Present

Absent

a
c

2b
2d

ac

2(b d)

574

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

result rules out the diagnosis (a convenient mnemonic is sensitive-negativeout or SnNout); this result corresponds to a high negative predictive value.
When a test has a high specicity, a positive test result rules in the diagnosis
(specicity-positive-in or SpPin); this result corresponds to a high positive
predictive value. Calling a test result positive or negative may be useful when
the test has a good SpPin or SnNout, but for most tests, creating this dichotomy can lose much information.
There are clearly situations in which it is important to maximize sensitivity
or specicity. The requirement for high sensitivity is obvious when a test is
used as a screening tool. In that situation it is important to identify all patients
with a given condition in a population, not just part of them. The high sensitivity and associated high NPV, however, come at a price of lower specicity,
or an increased number of false-positive results and an increased need for conrmatory (sometimes invasive) tests. Examples include mammography for
breast cancer screening or prostate specic antigen (PSA) for prostate cancer
screening. The opposite occurs when high specicityand corresponding
high positive predictive valueis required, such as when establishing the
diagnosis denitively has important therapeutic or prognostic implications.
Likelihood ratios: pretest and posttest probabilities
Despite the relative simplicity of the concepts described previously, they
are limited by the need to choose dierent thresholds for a test result, which
can vary depending on the purpose of the test (ie, screening for or conrming
disease). They also lump dierent degrees of abnormality into a single category: either diabetic neuropathy is present or not; either ketoacidosis is
present or not. Unfortunately, this distinction does not correspond to the
clinical reality in which, for example, a daily albumin secretion of 400 mg and
4 g, a serum creatinine of 200 or 600 lmol/L, and a serum glucose of 20 and
60 mmol/L are all abnormal but have clearly dierent clinical implications.
These distinctions are best captured by the concept of likelihood ratios.
This concept recognizes the fact that dierent patients have dierent probabilities of having the disease of interest because of dierent risk factors,
such as age and comorbidities. The application of any test can be viewed
as a way of either increasing or decreasing the probability that the patient
has the disease of interest. That is, a test serves to modify the pretest probability of the disease and yields a new posttest probability. The direction and
magnitude of this change from pretest to posttest probability is determined
by the tests properties, which are called the likelihood ratios.
In the diagnostic process one frequently proceeds through a series of different diagnostic tests (information from history taking, physical examination, laboratory or radiologic tests). If the properties of each of these
pieces of information are known, one can move sequentially through them,
incorporating each piece of information and continuously recalculating the
probability of the target disorder. Clinicians implicitly do proceed in this

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

575

fashion, but because the properties of the individual items of history


and physical examination are often not available, they must rely on clinical
experience and intuition to arrive at the consecutive pretest probability that
precedes ordering a diagnostic test.
The limited information regarding the properties of items of history and
physical examination often results in widely varying estimates of pretest
probabilities by clinicians. Potential solutions include examining the literature, looking at the prevalence of the target condition in populations that
are similar to the one being considered, consulting other clinicians about
their probability estimates (the consensus view likely to be more accurate
than our individual intuition), or assuming the extreme plausible pretest
probability and determining if this changes the clinical course of action.
Using the previously constructed Table 4 from the microalbuminuria
article [1], there were 69 people with an abnormal albumin excretion and
54 people in whom the UAER was negative. One may ask two questions:
How likely is a UACR of more than 26.8 mg/g among people who excrete
more than 28.8 mg/d of albumin? Table 1 shows 61 of 69 (or 0.88). How often
is the same test result (UACR >26.8 mg/g) found among people who,
although suspected of abnormal albuminuria, do not have it? The answer
is 6 of 54 of them (0.11). The ratio of these two likelihoods is called the likelihood ratio (LR) and for UACR more than 26.8 mg/g the LR is 0.88/0.11 or
8. In other words, this particular test result is eight times as likely to occur
among patients with, as opposed to among patients without, abnormal
albuminuria. In a similar fashion, the LR can be calculated for each level
of the diagnostic test result. Each calculation involves answering two questions: First, how likely it is to get a given test result (eg, UACR between 15
and 26.8 mg/g) among people with the target disorder (abnormal albuminuria)? Second, how likely it is to get the same test result (1626.8 mg/g) among
people without the target disorder (no abnormal albuminuria). For this
intermediate test result the likelihoods are 8/69 or 0.12 and 8/54 or 0.15, and
their ratio (the LR for this test result) is 0.8. To complete, the likelihood
ratio for the UACR <15 mg/g is 0/69 divided by 40/54, or 0.
The clinical use of the LR is in its ability to indicate by how much a given
diagnostic test result will raise or lower the pretest probability of the target disorder. A LR of 1 means that the posttest probability is exactly the same as the
pretest probability. LRs more than 1 increase the probability that the target
disorder is present, and the higher the LR the greater this increase. Conversely,
LRs less than 1 decrease the probability of the target disorder, and the smaller
the LR, the greater the decrease in probability and the smaller its nal value.
LRs more than 10 or less than 0.1 generate large and often conclusive changes
from pretest to posttest probability; LRs of 5 to 10 and 0.1 to 0.2 generate
moderate shifts in pretest to posttest probability; LRs of 2 to 5 and 0.5 to
0.2 generate small (but sometimes important) changes in probability; and
LRs of 1 to 2 and 0.5 to 1 alter probability to a small (and rarely important)
degree.

576

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

Having determined the magnitude and signicance of the LRs, how does
one use them to go from pretest to posttest probability? LRs cannot be combined directly; their formal use requires converting pretest probability to
odds, multiplying the result by the LR, and converting the consequent posttest odds into a posttest probability. Although not a dicult process (see
appendix), this calculation can be tedious and o-putting. Fortunately, there
is an easier way. A nomogram proposed by Fagan [4] (Fig. 2) does all the
conversions and facilitates the conversion from pretest to posttest probabilities. The rst column of this nomogram represents the pretest probability,
the second column represents the LR, and the third shows the posttest probability. One may obtain the posttest probability by anchoring a ruler at the
pretest probability and rotating it until it lines up with the LR for the
observed test result.
Thus, the LR incorporates the information that is generally used when
arriving at a diagnosis: the specics of a given clinical encounter (ie, the individual characteristics of a patient and ones clinical experience) and the
external evidence that comes from performing tests in populations of
patients. The former determines the assessment of pretest probabilities, and
the latter concerns the ability of the tests result to distinguish patients with
and without the condition of interest. These two elements are combined to
establish estimates of whether the patient has the target disorder (posttest
probabilities).
Table 6 provides likelihood ratios that apply to the evaluation of thyroid
nodules. Tables such as this can be useful in designing evidence-based diagnostic strategies and interpreting test results at the clinic using the framework presented previously.

How applicable are the studys result and the diagnostic test to dierent
clinical settings?
The value of any test depends on its ability to yield the same result when
reapplied to stable patients in ones own clinical setting. Poor reproducibility can result from problems with the test itself (eg, variations in reagents in
radioimmunoassay kits for determining hormone levels). A second cause for
dierent test results in stable patients arises whenever a test requires interpretation (eg, the extent of ST-segment elevation on an electrocardiogram).
Ideally, an article about a diagnostic test informs readers about how reproducible the test results can be expected to be. This is especially important
when expertise is required in performing or interpreting the test.
If the reproducibility of a test in the study setting is mediocre, disagreement between observers is common, and the test still discriminates well
between patients with and without the target condition, it is useful. Under
these circumstances, it is likely that the test can be applied readily in any
clinical setting. If reproducibility of a diagnostic test is high and observer

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

577

Fig. 2. A likelihood ratio nomogram. (Adapted from Fagan T. Nomogram for Bayess theorem.
N Engl J Med 1975;293:257; with permission.  1975 Massachusetts Medical Society. All rights
reserved.)

variation is low, either the test is simple and unambiguous or the clinicians
who are interpreting it are highly skilled. If the latter applies, less skilled
interpreters may not fare as well.
Test properties may change with a dierent mix of disease severity or
a dierent distribution of competing conditions. If the population with the
target condition is severely aected, likelihood ratios move away from

578

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

Table 6
Likelihood ratios for the diagnosis of malignancy in euthyroid patients with a single or
dominant thyroid nodule
Prevalence
(pretest
probability) (%)

No. of
patients
included

20

722

Test

Result

LR (95% CI)

132

Fine-needle
aspiration cytology
guided with ultrasound

Malignant
Suspicious
Insucient
Benign

226 (4.411.739)
1.3 (0.523.2)
2.7 (0.5215)
0.24 (0.110.52)

868

Fine-needle
aspiration cytology
not guided

Malignant
Suspicious
Insucient
Benign

34 (1574)
1.7 (0.943)
0.5 (0.270.76)
0.23 (0.130.42)

From Evidence-based Medicine Working Group. American Medical Association, Users


guide to the medical literature: a manual for evidence-based clinical practice. Chicago: AMA
Press; 2002; with permission.

a value of 1 (sensitivity increases). If patients with the target condition are


all mildly aected, likelihood ratios usually move toward 1 (sensitivity decreases). If patients without the target disorder have competing conditions
that mimic the test results seen in patients who do have the target disorder, likelihood ratios move closer to 1 and the test seems less useful. In a dierent clinical setting in which fewer of the nondiseased patients have these competing
conditions, likelihood ratios move away from 1 and the test seems more useful.
The phenomenon of diering test properties in dierent subpopulations
has been most strikingly demonstrated for exercise electrocardiography in
the diagnosis of coronary artery disease. For instance, the more extensive
the average severity of coronary artery disease in the studied population, the
farther from 1 are the likelihood ratios of abnormal exercise electrocardiography for angiographic narrowing of the coronary arteries [5]. Another
example comes from the diagnosis of venous thromboembolism, in which
compression ultrasound for proximal-vein thrombosis has proved more
accurate in symptomatic outpatients than in asymptomatic postoperative
patients [6]. Sometimes a test fails in the very patients one hopes it will serve
best. The likelihood ratio of a negative dipstick test for the rapid diagnosis
of urinary tract infection is approximately 0.2 in patients with clear symptoms. Thus, there is a high probability of urinary tract infection, but it is
more than 0.5 in persons with low probability [7], which renders it of little
help in ruling out infection in the latter, low probability patients.
If ones practice is in a setting similar to that of the investigation and
ones patient meets all the study inclusion criteria and does not violate any
of the exclusion criteria, the results of the study are likely applicable. If not,
a judgment is required. As with therapeutic interventions, one should ask
whether there are compelling reasons why the results should not be applied
to given patients, either because the severity of disease in target positive

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

579

patients or the mix of competing conditions in target negative patients is so


dierent that generalization is unwarranted. The issue of generalizability
may be resolved if an overview pools the results of a number of studies [8,9].
It is useful in making, learning, teaching, and communicating management decisions to link them explicitly to the probability of the target disorder. For any target disorder there are probabilities below which a clinician
would dismiss a diagnosis and order no further tests (a test threshold, also
called no-test-no-treatment threshold). Similarly, there are probabilities
above which a clinician would consider the diagnosis conrmed and would
stop testing and initiate treatment (a treatment threshold, also called testtreatment threshold). When the probability of the target disorder lies
between the test and treatment thresholds, further testing is mandated
(Fig. 3) [10]. Thresholds vary among diseases and individual patients.
Once it is decided what test and treatment thresholds are, posttest probabilities have direct treatment implications. If most patients have test results
with LRs near 1, the test is not useful. The usefulness of a diagnostic test is
inuenced strongly by the proportion of patients suspected of having the
target disorder and whose test results have high or low LRs so that the test
result moves their probability of disease across a test or treatment threshold.
In the patients suspected of abnormal albuminuria, Table 1 allows us to
determine the proportion of patients with extreme results (either >26.8 mg/g
or <15 mg/g). The proportion can be calculated as 107/123 (87%).
A nal comment deals with the use of sequential tests. Each item of history, or each nding on physical examination, laboratory test, or imaging
procedure, represents a diagnostic test. Pretest probabilities are modied
with each new nding. If two tests are closely related, however, application
of the second test may provide little or no information and the sequential
application of likelihood ratios yields misleading results. For instance, once
one has the results of the most powerful laboratory test for iron deciency,
serum ferritin, additional tests, such as serum iron or transferrin saturation,
add no further information [3].
The ultimate criterion for the usefulness of a diagnostic test is whether it
adds information beyond that otherwise available and whether this information leads to a change in management that is ultimately benecial to the

Fig. 3. Diagnostic process: test and treatment thresholds. (Adapted from Gerstein H, Haynes B.
Evidence-based diabetes care. Hamilton, Ontario: BC Decker Inc; 2001; with permission.)

580

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

patient. The value of an accurate test is undisputed when the target disorder,
if left undiagnosed, is dangerous, the test has acceptable risks, and eective
treatment exists.
In other clinical situations, tests may be accurate and management even
may change as a result of their application, but their impact on patient outcome may be far less certain. Examples include right heart catheterization
for many critically ill patients and the incremental value of MRI over CT
scanning for various problems.

Acknowledgment
This article is largely based on our previous publication on the subject
entitled How should diagnostic tests be chosen and used? [11].

References
[1] Zelmanovitz T, Gross JL, Oliveira JR, et al. The receiver operating characteristics curve in
the evaluation of a random urine specimen as a screening test for diabetic nephropathy.
Diabetes Care 1997;20:5169.
[2] Evidence-based Medicine Working Group. Users guides to the medical literature:
a manual for evidence-based clinical practice. Chicago: AMA Press; 2001.
[3] Guyatt G, Oxman A, Ali M. Diagnosis of iron deciency. J Gen Intern Med 1992;7:14553.
[4] Fagan T. Nomogram for Bayess theorem. N Engl J Med 1975;293:257.
[5] Hlatky M, Pryor D, Harrell F. Factors aecting sensitivity and specicity of exercise
electrocardiography. Am J Med 1984;77:6471.
[6] Ginsberg J, Caco C, Brill-Edwards P, et al. Venous thrombosis in patients who have
undergone major hip or new surgery: detection with compress US and impedance
plethysmography. Radiology 1991;181:6514.
[7] Lachs MS, Nachamkin I, Edelstein PH, et al. Spectrum bias in the evaluation of diagnostic
tests: lessons from the rapid dipstick test for urinary tract infection. Ann Intern Med
1992;117:13540.
[8] Irwig L, Tosteson AN, Gatsonis C, et al. Guidelines for meta-analyses evaluating
diagnostic tests. Ann Intern Med 1994;120:66776.
[9] Walter SD, Irwig L, Glasziou PP. Meta-analysis of diagnostic tests with imperfect reference
standards. J Clin Epidemiol 1999;52:94351.
[10] Sackett D, Haynes R, Guyatt G, et al. Clinical epidemiology: a basic science for clinical
medicine. 2nd edition. Boston: Little, Brown and Co; 1991.
[11] Gerstein H, Haynes B. Evidence-based diabetes care. Hamilton, Ontario: BC Decker Inc;
2001.

Appendix: Calculation of post-test probabilities using likelihood ratios


The equation to convert probabilities into odds is as follows: probability/
[1probability], which is equivalent to probability of having the target disorder/probability of not having the target disorder. A probability of 0.5 represent odds of 0.50/0.50, or 1 to 1; a probability of 0.80 represents odds of

R. Jaeschke et al / Endocrinol Metab Clin N Am 31 (2002) 567581

581

0.80/0.20, or 4 to 1; a probability of 0.25 represents odds of 0.25/0.75, or


1 to 3, or 0.33. With the pretest odds known, the posttest odds are calculated
by multiplying the pretest odds by the LR. The posttest odds can be
converted back into probabilities using a formula of probability odds/
(odds 1).

Das könnte Ihnen auch gefallen