Beruflich Dokumente
Kultur Dokumente
Keywords: Survival analysis, outlier detection, robust regression, Cox proportional hazards, concordance c-index
Abstract: Outlier detection is an important task in many data-mining applications. In this paper, we present two para-
metric outlier detection methods for survival data. Both methods propose to perform outlier detection in a
multivariate setting, using the Cox regression as the model and the concordance c-index as a measure of good-
ness of fit. The first method is a single-step procedure that presents a delete-1 statistic based on bootstrap
hypothesis, testing for the increase in the concordance c-index. The second method is based on a sequential
procedure that maximizes the c-index of the model using a a greedy one-step-ahead search. Finally, we use
both methods to perform robust estimation for the Cox regression, removing from the regression a fraction of
the data by their measure of outlyingness. Our preliminary results on three different datasets have shown to
improve the estimation of the Cox Regression coefficients and also the model predictive ability.
2.1 Swamping and Masking To assess the predictive ability of a survival model, we
will use Harrels concordance c-index (Harrell et al.,
Data sets with multiple outliers or clusters of outliers 1982). It measures the ability of the model to predict
are subject to masking and swamping effects. Here we a higher relative risk to an individual whose event oc-
enunciate the following definitions (Acuna and Ro- curs first. The relative risk is estimated from the out-
driguez, 2004): put of the model for each individual; in a Cox model
Masking Effect One outlier masks another outlier if for instance, the relative risk corresponds to the haz-
the second outlier can be considered an outlier ard ratio. The c-index is calculated using the follow-
only by itself but not in the presence of the first ing procedure:
outlier.
1. Form all possible pairs of individuals.
Swamping Effect One outlier observation swamps a
second observation if the latter can be considered 2. Omit the pairs whose shorter survival time is cen-
as an outlier in presence of the first but not by it- sored and all pairs where both observations are
self. censored. These are the permissible pairs, being
As seen in (Fischler and Bolles, 1981), these ef- N permissible its cardinality.
fects are particularly harmful when developing se- 3. To calculate Concordance, for each permissible
quential procedures for outlier detection, mainly be- pair when Ti 6= T j : count 1 if the shorter survival
cause the subset of observations already deleted influ- time has higher predicted risk, count 0.5 other-
ences which observations will be deleted in the sub- wise. For Ti = T j and both not censored: count
sequent iterations. 1 if the predicted risks are the same, 0.5 other-
wise; if at least one is censored and it corresponds
2.2 Model-Specific Outlier Detection: to a lower risk, count 1 (0.5, otherwise). Concor-
Cox Proportional Hazards dance is defined as the sum of all counts for each
permissible pair.
In this paper the model chosen to represent the data
was the Cox proportional hazards due to its simplicity, 4. The c-index is given by
good results and great power of interpretability. c-index = Concordance/N permissible .
Several works have been developed to increase the
robustness of the estimation of the Cox Regression The c-index is a rank measure, thus it only mea-
by performing outlier detection, for example through sures how well predicted values are concordant with
residual analysis, estimating the variation in regres- rank-ordered response variables. For example, the c-
sion parameters with the removal of a given observa- index for two patients with predicted hazard ratios of
tion (Therneau et al., 1990). The outliers can then be 0.4 and 0.6 is the same as if the patients had hazard
detected by selecting the observations that cause the ratios of 0.1 and 0.9 (Harrell, 2001), it only measures
largest variation in the parameters upon its removal. if the outcome is concordant with the response vari-
This approach is susceptible to masking and swamp- ables or not. Thus, unlike measures such as the sum of
ing and also needs the tuning of the outlier or non- squared errors, one observation by itself has a limited
outlier threshold. contribution for the overall concordance. This robust-
In (Farcomeni and Viviani, 2011) outlying obser- ness may allow for the maximization of the c-index
vations are defined as the individuals that have the without worrying if it is being maximized at the cost
smallest contributions to the Cox partial likelihood. of the majority of the data, only to fit better one or a
In order to find these observations they first make a cluster of outlying observations, as it can happen with
robust fitting of the Cox regression and then in the the sum of squared errors (Fischler and Bolles, 1981).
3 METHODS FOR OUTLIER These confidence intervals will be computed us-
DETECTION ing Monte Carlo Bootstrap as explained in (Harrell,
2001), for each observation i the procedure is the
We propose three novel methods for outlier detec- following: 1) produce B bootstrap samples by sam-
tion in survival data based on the concordance index, pling with replacement n 1 observations from the
described in sections 3.1 and 3.3. Section 3.4 de- empirical distribution Datai ; 2) compute the concor-
scribes alternative proposals that will be further used dance for each bootstrap sample; 3) the p-value corre-
for comparison purposes. sponds to the proportion of bootstrap samples having
The proposed methods make use of an operational Ci Coriginal 0.
definition of outlier, defined as an observation that, The number of bootstrap samples B used has
when absent from the data, will likely decrease the shown to be dependent on the number of individuals
prediction error of the fitted model. In a survival set- and number of covariates. In our tests the value for B
ting, this prediction error will be measured recurring was iteratively increased until p-values convergence.
to the concordance c-index, which has the particular- Following the same reasoning provided in (Singh
ity of using the predictive model as a black-box. and Xie, 2003), given an outlying observation the
probability that a bootstrap sample does not contain
3.1 Bootstrap Hypothesis Testing (BHT) is approximately (1 n1 )n 1e ( 37%) as n .
Thus, each observation will be absent in approxi-
Ideally we would know the underlying distribution of mately 37% of the samples. A low p-value for the
the observations Xi ,Yi and perform an hypothesis test hypothesis test mentioned above, means that the given
about the difference in terms of concordance between observation i improves the concordance c-index in a
the two distributions. Thus the idea is to perform n hy- systematic way not depending on the cooperation of
pothesis tests about the concordance variation, one for any other observation. On the other hand, if one out-
each observation i, and sorting the resulting p-values. lier is masked by another, the masking outlier will
The hypothesis tests will be made following the not be present in approximately 37% of the bootstrap
bootstrap approach (Efron, 1979). Each observation samples and thus we can expect a multimodal be-
Xi ,Yi is considered a discrete random variable hav- havior for the expected C. Thus an outlier subject
ing a distribution equal to the empirical distribution to masking may not systematically improve concor-
given by the original dataset. We will consider n dif- dance (present a high p-value for the hypothesis test)
ferent empirical distributions, each distribution results but if presents multimodality and one of the modes is
from removing each observation i from the original relatively high, it is a candidate for an outlier.
data and adjust densities in order to sum one. De- To sum up, Bootstrap Hypothesis testing (BHT)
noting by C the concordance c-index and Coriginal the on C works as follows: for each observation, an hy-
concordance in the original data, distributions Datai pothesis test by bootstrap is done. The resulting statis-
represent the adjusted empirical distributions having tics for each observation will be a p-value and the ex-
P(X = Xi ,Y = Yi ) = 0. The hypothesis test for each pected value of C. The p-value gives us the confi-
observation is formulated as follows: dence level to reject the hypothesis that the removal
of the observation causes no increase in the c-index.
H0 : CModel,(X,Y )Datai Coriginal Experimentally we verified that these two values are
H1 : CModel,(X,Y )Datai > Coriginal correlated. When the p-value is low, the expected C
is usually very high, the opposite relation has shown
Writing CModel,(X,Y )Datai and Ci = Ci Coriginal
to be weaker. So in order to obtain a 1-dimensional
it is more useful to reformulate the hypothesis tests
metric for outlyingness, we consider the observations
as:
with the lowest p-values the more outlying ones.
H0 : Ci 0
3.2 Dual Bootstraps Hypothesis Test
H1 : Ci > 0
The rejection of the null hypothesis given a signifi- This method aims to improve the approach taken in
cance level corresponds to estimate a confidence in- the BHT method. In the BHT method, removing one
terval for the values of C for each distribution Datai , observation from the dataset, and then assess the im-
if this interval does not contain values less or equal pact of each removal on concordance has an unde-
than zero we can reject the null hypothesis for the sig- sired effect. Since the model has less observations
nificance level , alternatively we can calculate the to fit (observation under test is removed), there is a
test p-value. tendency for the concordance to increase, this poten-
tially introduces confusion in the hypothesis test made N(t) be the number of events until t and H(t) the cu-
in BHT, in particular it may increase the number of mulative hazard function, we have for each individi-
false positives. The rationale behind DBHT is to ual i the Martingale residual process:
generate two histograms from two antagonistic ver-
sions of the bootstrap procedure: the poison and anti- Mi (t) = Ni (t) Hi (t). (1)
dote bootstraps and then compare them. The antidote The martingale residual is defined as the value of pro-
bootstrap excludes the observation under test from ev- cess Mi (t) at the time of failure/censoring, as N(t)
ery bootstrap sample, as the poison forces the obser- takes 1 if the event is observed and zero when cen-
vation being tested into every bootstrap sample. We sored (David Collett, 2003), their are given by:
consider Cantidote and C poison as two real random
variables corresponding to the concordance variation rMi = i Hi (t), (2)
in relation to the original dataset for an antidote boot- where i is the censoring indicator for individual i.
strap sample and a poison bootstrap sample.Then we For the Cox model the residuals are given by:
perform the following hypothesis test:
rMi = i exp{X}H0 (t). (3)
H0 : E [Cantidote ] > E [C poison ] ;
H1 : E [Cantidote ] E [C poison ] ; 3.4.2 Deviance residuals (DEV)
again we believe that if observation i is an outlier, The deviance residuals are an attempt (David Collett,
when removed, the concordance of the samples tends 2003) to adjust the Martingale residuals to be more
to increase. Similar to the BHT method, besides a sur- centered around zero, given by:
vival model and the input dataset D, DBHT only takes 1
one input parameter: B the number of bootstrap sam- rDi = sgn(rMi )[2{rMi + i log(i rMi )}] 2 . (4)
ples used on the antidote and poison bootstrap pro-
cedures. The DBHT method is a soft-classifier and 3.4.3 Likelihood displacement statistic (LD)
single-step method. Thus the output is an outlying
score for each observation, from this, one can extract Let be the value of that maximizes the partial Cox
the k most outlying observations. likelihood and (i) the estimate when observation i is
eliminated from the fitting. The likelihood displace-
3.3 One-step deletion (OSD) ment (Cook, 1977) statistic (LD) is given by:
2logL( (i) ).
LDi = 2logL() (5)
This method is a sequential procedure for outlier re-
moval. We start with all data and at each itera- Under the null hypothesis (i) = the LD statis-
tion of the algorithm, the observation that, when ex- tic follows a chi-square distribution with one degree
cluded, causes the largest increase in concordance, is of freedom. Therefore we calculate the p-value for
removed. The resulting subset is interpreted as con- this test for all observations, the ones having more
taining the most outlying observations. This method significance are considered the most outlying ones.
is equivalent to do one-step-ahead greedy search for
maximizing the c-index of the model in the data. The
resulting subset of observations, will be considered
the most outlying ones. 4 DATASETS
Here we present alternative methods for outlier de- Similarly to the simulation data in (Farcomeni and
tection in survival data that will be used to assess the Viviani, 2011), we will generate datasets having as
performance of the proposed methods. underlying probabilistic model, the Cox proportional
hazards. Our goal is to recreate a realistic setting,
3.4.1 Martingale residuals (MART) with survival times and covariates as similar as real
datasets. In order to approximate this conditions,
These residuals are provenient from the counting pro- each simulated dataset will have a pure model G that
cess framework for censored survival, first a Martin- translates a a general trend of the observations, and
gale process is defined by the difference between ob- two other Cox models with different parameter val-
served and expected number of events (David W. Hos- ues. Each dataset consists in 100 observations hav-
mer, Stanley Lemeshow, Susanne May, 2008). Let ing covariates X1 , X2 , X3 . These follow a 3-D normal
distribution with zero mean and covariance matrix , OSD and we compare their results with MART, DEV
that will be equal to the identity matrix I. For the sur- and LD. We start by presenting the configuration of
vival times, the probabilistic model for the hazard of our simulation study for outlier detection. Then we
each individual follows one of two possible models: apply all methods to two real datasets, performing
0
the pure model G and one outlier model . Having outlier detection. We further use the detected outliers
k < n outliers, the hazard function for each observa- to perform a robust Cox regression by removing them
tion i is generated by: from the data, the coefficients and p-values of the re-
( gression will be compared.
h0 (t) exp{X} 1 i n k
hi (t) = 0 . (6)
h0 (t) exp{ X} n k < i n k 5.1 SIM dataset
Three baseline hazards h0 (t) are used, and given by a
Weibull with the parameter combinations: = 1, = The outlier detection methods will be used on sim-
1, = 1.5, = 0.5, and = 0.5, = 1.5 correspond- ulated datasets generated using the methodology de-
ing respectively to a constant, strictly decreasing, and scribed in Section 4.1. In order to test the outlier de-
strictly increasing hazard functions. The value for k tection methods in a variety of conditions for the out-
will be set in order to have 10% of outliers. The esti- lying models and for the general model, we will fix
mation of the cumulative hazard function Hi (t) is then the general trend model = (1, 1, 1) and use a set of
obtained: configurations for the source of outlying observations.
Z t
Each parameter for the outlier source is given by a
Hi (t) = hi ()d. (7) three dimensional normal distribution with a diagonal
0
From each Hi (t) we further calculate the correspond- covariance matrix, the different values for the outlier
ing survival curves by Si (t) = eHi (t) . Having this dis- generator model are presented in Table 1. Although
tribution, we generate 100 survival times according to Table 1: The different outlier configurations used in the sim-
the distribution given by Si (t) and generate a censor- ulation data. The pure model is G =(1,1,1).
ing vector c1 , .., c100 following a Bernoulli with prob- 0 0 0
ability p, corresponding to the proportion of censored # || ||/||G ||
observations, typically a value around 0.2: 1 180 1 (-1,-1,-1)
ti 1 Si (t), (8) 2 180 0.2 (-0.2,-0.2,-0.2)
ci Bernoulli(p). 3 180 5 (-5,-5,-5)
4 135 0.2 (-0.143,0,-0.283)
4.2 Clinical Data 5 135 5 (-3,6,0,-7.07)
6 90 0.2 (-0.245,0,-0.245)
In order to test the procedures in a more realistic set- 7 90 5 (6.12,0,-6.12)
ting, we have further applied the methods to real clin- 8 0 0.2 (0.2,0.2,0.2)
ical data, focusing on two studies: 9 0 5 (5,5,5)
WHAS Dataset from the Worcester Heart At- 10 180 10 (-10,-10,-10)
tack Study, with 100 individuals each with 11 0 10 (10,10,10)
5 covariates. This data concerns the sur- 12 135 10 (-7.15,0,-14.15)
vival times of patients having their first
heart attack. Data publicly available at
the outlying values for the parameters may seem close
https://www.umass.edu/statdata/statdata/data/.
to the general trend model it is worth noting that the
BMT Bone Marrow Transplant Data (Klein and Cox model defines the hazards as an exponential func-
Moeschberger, 1997): contains data about 137 tion of X, thus the ratio between the hazard of an
leukemia patients each with 10 covariates. The outlying and a general trend observation is given by
0
data concerns the survival time after the bone mar- exp{ X X}. The reasons behind the choice of this
row transplant. It is publicly available in the R (R set of scenarios is to have a variety of combinations
Development Core Team, 2006) package KMsurv. with different norms and contrasting parameters.
Table 2 reports the true positive rate (TPR) of the
first 10 most outlying observations according to each
5 RESULTS AND DISCUSSION method.
By inspecting Table 2 we see that the DBHT at-
In this section we assess the performance of the tains the overall best performance, overcoming the
two proposed outlier detection methods BHT and other methods in 9 out of 12 of the scenarios. For
Table 2: Average of top-k TPR grouped by outlier scenarios.
they consider as the most outlying, for example, all
Scenario # MART DEV LD DFB OSD BHT DBHT
1 0.29 0.36 0.43 0.36 0.47 0.43 0.47
the methods identified observations 1 and 56 in their
2 0.22 0.25 0.31 0.29 0.32 0.31 0.34 top-10 outliers.The estimates for the regression coef-
3 0.50 0.58 0.59 0.52 0.63 0.59 0.65 ficients when fitting the Cox model to all observations
4 0.22 0.23 0.30 0.28 0.30 0.29 0.32
5 0.44 0.54 0.52 0.48 0.58 0.53 0.58 are given in Table 5.
6 0.21 0.22 0.28 0.26 0.27 0.26 0.28
7 0.40 0.50 0.40 0.41 0.44 0.37 0.42 Table 5: Cox model fitted to the WHAS dataset.
8 0.18 0.18 0.23 0.22 0.22 0.20 0.23
9 0.32 0.36 0.18 0.25 0.09 0.06 0.07 p-value
10 0.53 0.63 0.64 0.57 0.68 0.60 0.70 los -0.022 0.3972
11 0.38 0.46 0.24 0.32 0.14 0.11 0.12 age 0.039 0.0025
12 0.49 0.60 0.54 0.51 0.60 0.52 0.60 gender 0.157 0.6066
bmi -0.071 0.0497
Table 3: Average of AUC grouped by outlier scenarios.
Scenario # MART DEV LD DFB BHT DBHT We observe that only two covariates are statisti-
1 0.70 0.70 0.74 0.68 0.78 0.82 cally significant corresponding to the age at the first
2 0.65 0.65 0.70 0.64 0.71 0.75 hear attack (age) and the body mass index (bmi). Af-
3 0.80 0.80 0.78 0.77 0.86 0.90 ter trimming the 10% most outlying observations ac-
4 0.64 0.64 0.69 0.63 0.71 0.73 cording to each method (in Table 4), new models were
5 0.78 0.77 0.74 0.75 0.82 0.84
6 0.63 0.63 0.67 0.63 0.68 0.71 obtained (Table 6 and Table 7). The goal is to un-
7 0.76 0.76 0.66 0.73 0.70 0.72 veil a trend model, unaffected by outlying observa-
8 0.62 0.62 0.66 0.62 0.65 0.68 tions. The results show that for BHT, and DBHT
9 0.74 0.72 0.61 0.69 0.60 0.60
10 0.83 0.83 0.80 0.81 0.87 0.92 Table 6: Cox model fit on the WHAS dataset with 10%
11 0.78 0.76 0.61 0.73 0.59 0.61 outlier trimming using the proposed methods.
12 0.80 0.80 0.74 0.78 0.81 0.86
OSD BHT DBHT
Xi i p i p i p
scenarios 9 and 11 the MART residuals have shown los -0.025 0.374 -0.166 0.006 -0.129 0.017
the best performance. In Table 3 we have the AUC age 0.068 0.000 0.0450 0.000 0.0590 0.000
gender 0.042 0.898 0.003 0.992 0.260 0.425
of the methods in each scenario, with the exception bmi -0.133 0.002 -0.162 0.0012 -0.162 0.000
of OSD, as it does not output a score for every ob-
servation, the calculation of AUC is not applicable.
By inspecting table 3 we can again verify that DBHT Table 7: Cox estimates removing the top 10% outlier obser-
outperforms the other methods in the same 9 scenario. vations in the WHAS dataset for methods MART, DEV and
Worth noticing that the performance offset of DBHT LD.
in terms of AUC is fairly larger than when analysing MART DEV LD
p-value p-value p-value
the top-10 TPR. los -0.016 0.498 -0.015 0.550 -0.016 0.506
age 0.045 0.001 0.032 0.012 0.069 0.000
5.2 WHAS Dataset gender -0.082 0.800 0.155 0.653 -0.230 0.483
bmi -0.082 0.029 -0.037 0.030 -0.146 0.001