Beruflich Dokumente
Kultur Dokumente
UMR181 Physiopathologie & Toxicologie Exprimentales INRA, ENVT, Ecole Nationale Vtrinaire, 23 Chemin des Capelles, 31076 Toulouse Cedex 3, France
Department of Clinical Sciences, Ecole Nationale Vtrinaire, 23 Chemin des Capelles, 31076 Toulouse Cedex 3, France
a r t i c l e
i n f o
Article history:
Received 23 February 2009
Received in revised form 27 March 2009
Accepted 31 March 2009
Available online 5 April 2009
Keywords:
Hospital-based reference interval
Indirect method
E.M. algorithm
a b s t r a c t
Background: Reference limits are some of the most widely used tools in the medical decision process. Their
determination is long, difcult, and expensive, mainly because of the need to select sufcient numbers of
reference individuals according to well-dened criteria. Data from hospitalized patients are, in contrast,
numerous and easily available. Even if all the information required for a direct reference interval computation
is usually not available, these data contain information that can be exploited to derive at least rough
reference intervals.
Methods: In this article, we propose a method for the indirect estimation of reference intervals. It relies on a
statistical method which has become a gold-standard in other sciences to separate components of mixtures.
It relies on some distributional assumptions that can be checked graphically. For the determination of
reference intervals, this new method is intended to separate the healthy and diseased distributions of the
measured analyte. We assessed its performance by using simulated data drawn from known distributions
and two previously published datasets (from human and veterinary clinical chemistry).
Results and discussion: The comparison of results obtained by the new method with the theoretical data of
the simulation and determination of the reference interval for the datasets was good, thus supporting the
application of this method for a rough estimation of reference intervals when the recommended procedure
cannot be used.
2009 Elsevier B.V. All rights reserved.
1. Introduction
Reference values are some of the most powerful tools in medical
decision-making both in human and veterinary medicine, even
though decision limits are being introduced in an increasing number
of cases. Recently, the recommendations for Determining, establishing, and verifying reference intervals in the clinical laboratory have
been updated by the International Federation of Clinical Chemistry
and Laboratory Medicine (IFCC) and the Clinical and Laboratory
Standards Institute (CLSI) [1].
This working group acknowledges that the de novo determination
of reference limits, and their regular updating for the main partitioning groups is an enormous, time-consuming and very expensive
undertaking, even in human clinical pathology, for which very large
data bases are available, e.g. through Preventive Medicine Centers [2].
This task is even more difcult in animal clinical pathology, due to the
various species, breeds, breeding conditions, productions, etc., and
often the small number of animals available, e.g. in wild species.
44
Yi = Ui Xi + 1 Ui Xi ; i = 1; :::; n
>
<x 1
if 0
k x =
>
:
logx if = 0:
If the actual status Ui of the ith individual was known, the probability density
function (pdf) of the observation Yi would be
yi 1
1
2
p exp 2 k yi m
2
2
with (m, 2,) = (m1,12,1) if Ui = 1 and (m, 2,) = (m2, 22,2) otherwise.
1 = 1
1 = 1
m
2
+ 2
1+
; 1+
1
1
1
1
1
1
#
2
where m1, 21, and 1 are the parameter estimations computed from the data of healthy
individuals only.
Since the Ui's are not observed, we need to estimate the percentage of healthy
individuals p and the parameters of the healthy distribution (m1, 21,1). The healthy
distribution can be well separated from the diseased distribution when the parameters
from the diseased distribution (m2, 22,2) are also estimated. We thus need to estimate
the parameter = (p,m1, 21,1,m2, 22,2) to get an estimation of the reference interval.
The best way to proceed is to use the maximum likelihood estimate of the
parameter that can be obtained as the value of that maximizes
n
Y
L =
pm1 ; 2 ;1 yi + 1 p m2 ; 2 ;2 yi :
1
i=1
k + 1
ai
2
2
2
21 1 log yi + log 1 + k1 yi m1 = 1 +
i=1
2
k
21 2 log yi + log 22 + k2 yi m2 = 22
1 ai
where
k
ai
where Ui is the unobserved status of this individual equal to 1 when he is healthy and 0
otherwise. Consequently, when the individual is healthy, one observes X1i while X2i is
observed when he is diseased. The distribution of Yi is therefore a mixture of the
distributions for healthy and diseased individuals.
We assume that the random variables Xji that appear in the model (Eq. (1)) are
mutually independent, independent of Ui and respectively distributed according to a N
(mj, 2j ) up to a BoxCox transformation kj where
m; 2 ; yi =
p mk ; 2k ;k yi
1
:
1 1
pk mk ; 2k ;k yi + 1 pk mk ; 2k ;k yi
1
>
>
P k
P k
>
k
k
k
k
>
>
w i = ai =
ai and vi = 1 ai
n
ai
>
>
>
i
i
>
>
>
>
>
> k + 1 = arg inf log P wk k y mk 2 2 P log y
>
i
i
>
1
1
i
>
>
i
i
>
>
>
>
P k
>
k + 1
>
>
=
wi kk + 1 yi
>
> m1
1
<
i
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
2k + 1
k + 1
P
i
wi
kk
k + 1
2k + 1
vi kk
+ 1
P
i
vi
+ 1
m2
2
kk
2
vi
k + 1 2
yi m1
k 2
k yi m2
yi
+ 1
k + 1 2
yi m2
P
i
log yi
k1 y m1
k2 y m2
+ 1 p
F y = p
1
2
V y ;
plot by representing F
(F (yi)) as a function of yi. When the points obtained are about
on a straight line, the model can be considered as reasonable. However and even if the
chosen model is the good one, such graphics always exhibit points far away from the
line because the sampling variation has not been taken into account to build it. Also, a
condence region/a P-value is helpful to compare these cdf. We propose to build a 95%
uniform region R0.95 for the QQ-Plot that can be used as follows: when at least one point
1
(yi, F
(F (yi))) falls outside R0.95 there is less than 5% risk that F F. Equivalently, we
propose to compute a P-value by simulating the distribution of sup j F y F yj
y
under H0: F = F.
To perform these simulations, we used the properties of the maximum likelihood
p
estimator: the Central Limit Theorem guarantees that when n is large, n is
1
approximately distributed according to the following Gaussian distribution N(0,I ())
where I() is the Fisher information matrix. This matrix depends on the true value
of which is unknown. Louis [14] proposed a method that gives a good approximation
of I() that we denote by I .
We therefore propose to build R0.95 using Monte-Carlo simulations. First, draw a large
1/n) and for each j, we simulated K2 =50
number (K1) of j (say K1 =10 000) from N( ,I
samples of size n using (Eq. (1)). We thus obtained K1 K2 samples of size n, each of them
giving an empirical cdf G and the distance sup j F y G y j . The P-value of the test is
y
discarding the 5% simulated samples with the highest distance sup j F y G yj , we built
y
the QQ-plot for each of the 95% remaining samples. The envelope of these QQ-plots gives
R0.95.
After the distributional assumptions of the model (Eq. (1)) have been checked, the
standard error of the estimates of the reference limits are computed as the standard
deviation of the K1 reference limits that can be computed using equation (Eq. (2)) but
with the Monte-Carlo sample j instead of .
3. Performance assessment
The objective of this simulation was to create samples from a
predened model distribution and to see whether or not the proposed
method was able to recover this known distribution from the samples
thus created.
The following parameters were used for the model distribution
p = 0.9, m1 = 3.1, 21 = 0.014, 1 = 0.18, m2 = 2.90, 22 = 0.022, 2 =
0.26. These values are not purely ctive and, except for the value of
p, come from example 2 given hereafter. In this distribution, the 2
subpopulations of healthy and diseased subjects are not clearly
separated and greatly overlap.
45
46
Table 1
Parameters of the simulated model distribution and estimations of the percentage of
healthy individuals and of the reference interval using the proposed method.
n = 1000
True value
a.s.e.
Mean
s.e.
n = 10,000
Lower lim.
Upper lim.
Lower lim.
Upper lim.
0.90
0.094
0.899
0.045
53.73
1.01
53.53
1.18
155.76
10.78
155.8
9.62
0.90
0.034
0.886
0.022
53.73
0.33
53.34
0.6
155.76
3.43
154.49
5.19
(p = percentage of healthy individuals; Lower and Upper lim. = the limits of the
reference interval; a.s.e. = average of asymptotic standard errors; s.e. = standard
deviation of estimates obtained with the 100 samples).
Fig. 2. Estimated reference limits obtained from 100 samples of size n = 1000 and 100
samples of size n = 10,000. The horizontal lines are the actual reference limits. The
lower reference limit is well estimated while the estimated upper reference limit has
quite a large imprecision probably because a large percentage of diseased individuals
had concentrations close to this limit.
Fig. 1. Estimated percentage of healthy individuals obtained from 100 samples of size
n = 1000 and 100 samples of size n = 10,000. The horizontal line is the value that
was used for simulations. This percentage is well estimated; the imprecision decreases
with n.
Fig. 3. The histogram represents the observed TSH concentrations in human males
obtained from [16]. The concentrations below the LOQ have not been represented (but
taken into account for the RI determination (see text). The dotted, thin and thick curves
represent the TSH distribution obtained using the proposed method for the diseased,
healthy individuals and for the entire population respectively.
Fig. 4. Diagnostic plot proposed in Section 2 for the TSH data. The empirical cdf F(thin
line) is totally included within the 95% condence region (thick lines) or (P N 0.05). The
assumptions about the model (shape of distributions) are thus in accordance with the
data.
5. Discussion
In this article we propose a method for estimating reference
interval of an analyte from a large set of data when the actual status of
the individuals in the analyzed sample is unknown. This method is not
new; it is currently considered as the gold-standard method for
mixtures analysis.
This method compensates some of the deciencies of existing
methods used to determine reference intervals. Bhattacharya and
Baadenhuijsen's methods assume specic shapes, mainly Gaussian
and log-Gaussian, for the distributions of healthy and diseased. These
methods are not exible enough to adapt to other distribution shapes,
unlike the proposed method. With these methods the computation of
reference intervals relies mainly on a graphical analysis that requires a
subjective choice of aligned points in a scatter plot. This choice
strongly inuences the results obtained and the condence intervals
for the provided estimates cannot be computed. Actually, these
47
Fig. 5. Histogram of the plasma creatinine concentrations in dogs obtained from [17]. The dotted, thin and thick curves represent the creatinine distribution obtained for diseased
animals, for healthy animals and for the entire population respectively, using the proposed method.
48
[4] Glick JH. Statistics of patient test values: application to indirect normal range and
to quality control. Clin Chem 1972;18:150413.
[5] Grossi E, Colombo R, Cavuto S, et al. The REALAB project: a new method for the
formulation of reference intervals based on current data. Clin Chem 2005;51:123240.
[6] Ferre-Masferrer M, Fuentes-Arderiu X, Puchal-Ane R. Indirect reference limits
estimated from patients' results by three mathematical procedures. Clin Chim Acta
1999;279:97105.
[7] Petitclerc C. Normality: the unreachable star? Clin Chem Lab Med 2004;42:698701.
[8] Kairisto V, Poola A. Software for illustrative presentation of basic clinical
characteristics of laboratory tests GraphROC for Windows. Scand J Clin Lab Invest
Suppl 1995;222:4360.
[9] Bhattacharya CG. A simple method of resolution of a distribution into Gaussian
components. Biometrics 1967;23:11535.
[10] Baadenhuijsen H, Smit JC. Indirect estimation of clinical chemical reference
intervals from total hospital patient data: application of a modied Bhattacharya
procedure. J Clin Chem Clin Biochem 1985;23:82939.
[11] Oosterhuis WP, Modderman TA, Pronk C. Reference values: Bhattacharya or the
method proposed by the IFCC? Ann Clin Biochem 1990;27:35965.
[12] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via
the EM algorithm. J R Stat Soc B 1977;39:138 (With discussion).
[13] Wu CF. On the convergence properties of the EM algorithm. Ann Stat 1983;11:95103.
[14] Louis TA. Finding the observed information matrix when using the EM algorithm.
J R Stat Soc B 1982;44:22633.
[15] Giavarina D, Dorizzi RM, Sofati G. Indirect methods for reference intervals based
on current data. Clin Chem 2006;52:3356.
[16] Concordet D, Vergez F, Trumel C, Diqulou A, Lanore D, et al. A multicentric
retrospective study of serum/plasma urea and creatinine concentrations in dogs
using univariate and multivariate decision rules to evaluate diagnostic efciency.
Vet Clin Pathol 2008;37:96103.
[17] Holzmann H, Munk A, Gneiting T. Identiability of nite mixtures of elliptical
distributions. Scand J Statist 2006;33:75363.
[18] Royston P, Wright EM. Goodness-of-t statistics for age-specic reference
intervals. Stat Med 2000;19:294362.