Sie sind auf Seite 1von 21

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/235343751

Chemosensitivity Prediction of Tumours Based on Expression, miRNA, and


Proteomics Data

Article · January 2012


DOI: 10.4018/ijsbbt.2012040101

CITATION READS
1 122

4 authors:

Ioannis Tsamardinos Giorgos Borboudakis


University of Crete University of Crete
178 PUBLICATIONS   4,941 CITATIONS    19 PUBLICATIONS   179 CITATIONS   

SEE PROFILE SEE PROFILE

Eleni Christodoulou Oluf Dimitri Røe


Duke-NUS School of Medicine Norwegian University of Science and Technology
13 PUBLICATIONS   122 CITATIONS    107 PUBLICATIONS   1,119 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

2nd International NTNU Symposium Current and Future Clinical Biomarkers of Cancer: From diagnosis to immunotherapy – why is precision medicine so difficult? For
registration: https://cancerbiomarkerstrondheim.com/ View project

Drug repositioning View project

All content following this page was uploaded by Oluf Dimitri Røe on 29 May 2014.

The user has requested enhancement of the downloaded file.


Chemosensitivity Prediction of Tumours
Based on Expression, miRNA, and
Proteomics Data
 

I. Tsamardinos1, 2, G. Borboudakis1, 2, E. G. Christodoulou3, O. D. Røe4, 5


1
Bioinformatics Laboratory, ICS-FORTH, Heraklion, Crete, Greece
2
Computer Science Department, University of Crete, Heraklion, Crete, Greece
3
Cellular Networks Group, BIOTEC TU Dresden, Dresden, Germany
4
Cancer Clinic, Levanger Hospital, Nord-Trøndelag Health Trust, Levanger, Norway
5
Department of Cancer Research and Molecular Medicine
Norwegian University of Science and Technology (NTNU), Trondheim, Norway

ABSTRACT
The chemosensitivity of tumours to specific drugs can be predicted based on molecular
quantities, such as gene expressions, miRNA expressions, and protein concentrations. This
finding is important for improving drug efficacy and personalizing drug use. In this paper, we
present an analysis strategy that, compared to prior work, retains more information in the data for
analysis and may lead to improved chemosensitivity prediction. We apply improved methods for
estimating the GI50 value of a drug (an indicator of the response to the drug), regression
methods for constructing predictive models of the GI50 value, advanced variable selection
techniques, such as MMPC, and a multi-task variable selection technique for identifying a small-
size signature that is simultaneously predictive for several drugs and cell lines. The methods are
applied on gene expression, miRNA expression, and proteomics data from 53 tumour cell lines
after treatment with 120 drugs, obtained from the National Cancer Institute databases. A
biological interpretation and discussion of the results is presented for the most clinically
important subset of 14 drugs.

Keywords: chemosensitivity prediction, variable selection, feature selection, regression,


classification, cell lines, microarray, mRNA, cancer, oncology

INTRODUCTION
Prior work shows that the sensitivity of a tumour to a drug can be predicted better than chance
based on the gene-expressions of the tumour (Potti, et al., 2006), (Augustine, et al., 2009). This
finding paves the way to personalized therapy models. In addition, identifying the molecular
quantities that are predictive may lead to a better understanding of the biological mechanisms a
drug employs to attack the tumour. In this paper, we develop an analysis strategy to produce
predictive models, estimate their performance, and identify the smallest, most-predictive set of
molecular quantities required for prediction. The strategy is first applied to and evaluated on the
prediction of the response to a set of 120 chemotherapeutic agents based on the responses
measured on 53 solid-tumour cell lines; the data have been obtained from the National Cancer
Institute databases and contain pre-treatment gene expression, miRNA expression, and protein
concentration profiles of the cell-lines. Subsequently, we focus our interest on a subset of 14
drugs that are the most interesting in clinical practice and provide a detailed presentation and
biological interpretation of the results. In addition, we apply a method for multi-task feature
selection which selects molecular quantities that combined, are simultaneously predictive for an
array of drugs. Such algorithms are important for selecting the optimal therapy, by being able to
predict the response to several drugs at once by measuring only a small set of molecular
quantities.

Compared to prior works, our proposed strategy differs in several ways. The machine learning
and statistical analysis employed in the literature process the data in a way that reduces the
available information with potential detrimental effects both on the models' prediction
performance as well as the identification of the molecular signatures (Potti, et al., 2006),
(Augustine, et al., 2009), (Staunton, et al., 2001). First the estimation of the response to a drug in
prior work maybe sub-optimal (Potti, et al., 2006), (Augustine, et al., 2009). The response of a
tumour depends of course, on the dosage. The National Cancer Institute has treated a panel of 60
cancer cell lines with several thousand drugs and has created a dosage-response profile for each
combination of drug and tumour. Often, this profile is summarized with a single value such as
the log10GI50. The GI50 stands for “growth inhibition 50%”, the concentration of a given test
drug that causes 50% growth inhibition at 48 hours, corrected for the cell count at time zero.
NCI, in the majority of cases, estimates log10GI50 by piece-wise linear interpolation which are
then employed by all prior work (e.g., (Potti, et al., 2006), (Ma, et al., 2009), (Staunton, et al.,
2001)). In this paper, we show that estimating the log10GI50 values by fitting a sigmoid to the
dosage-response profile preserves more information about the effects of the drug that lead to
statistically significantly improved predictive performance.

Second, prior work typically quantizes the log10GI50 values to create classes of tumours: (Potti,
et al., 2006) and (Augustine, et al., 2009) categorize tumours as sensitive and resistant, while
(Staunton, et al., 2001) and (Ma, et al., 2009) as sensitive, intermediate, and resistant. This type
of quantization allows the application of machine learning classification techniques, variable
selection methods for classification tasks, and statistical hypothesis testing techniques for
discrete outcomes. Our computational experiments however, demonstrate that maintaining the
exact log10GI50 values and employing regression analysis instead of classification is often
preferable as it improves chemosensitivity prediction in approximately half of the cases.

Third, prior work often employs simple methods for identifying molecular signatures such as
selecting the top k genes that are mostly differentially expressed between different classes of
tumours. We show that more sophisticated methods such as the Max Min Parents and Children
(MMPC) algorithm for multivariate feature selection (Tsamardinos, Brown, & Aliferis, 2006)
often select more predictive signatures for the same parameter k and are preferable to apply.

Fourth, selecting minimal-size, most-predictive sets of variables for a drug intends to increase
our understanding and intuition on the molecular mechanisms of the drugs. However, for clinical
use one should be apply to predict the response to several drugs at the same time with the fewest
measurements as possible. Then, a selection of the optimal therapy can be planned based on
these predictions. Towards this goal, we apply a multi-task feature selection algorithm
(Argyriou, Evgeniou, & Pontil, 2008) that selects only a handful of molecular quantities (about
20) to reliable predict the response to 9 drugs of interest. The results indicate that multi-task
feature selection is a potentially useful technique that may further reduce the required number of
variables, when prediction is required simultaneously for several drugs to optimize therapy.

The structure of the paper is as follows: we present the data and the problem definition. In the
subsequent section we show that estimating the GI50 value by sigmoid fitting is preferable to the
standard NCI estimation using piecewise linear interpolation. Next section compares feature
selection methods. The next two sections discuss the results on the full panel of 120 drugs and a
subset of 14 drugs of particular clinical interest, respectively, including a biological
interpretation of results. Next section presents the multi-task feature selection analysis. The final
section concludes the paper.

DATA AND PROBLEM DESCRIPTION


Data Description: Molecular profiles were obtained for the NCI-60 cell-line panel
(Developmental Therapeutics Program NCI/NIH) (these actually contain expressions only for 59
of the 60 cell lines) representing nine types of cancers: 5 Breast, 6 Central Nervous System
(CNS), 7 Colon, 6 Leukemia, 10 Melanoma, 9 Lung, 7 Ovarian, 2 Prostate, 7 Renal. All of these
types of cancers are solid tumours, except Leukemia. Leukemia differ in origin, biology and
clinical response to drugs, compared to solid tumours, and thus we conducted our computational
experiments on the subset of 53 solid-tumour cell lines.

The molecular profiles include gene expressions, miRNA expressions, and protein
concentrations. The gene expressions were measured on AffymtrixU133plus2 array containing
54,675 probe sets that correspond to about 47,000 transcript variants which in turn represent
more than 39,500 of the best characterized human genes. The miRNA expressions were
measured on a miRNA OSU V3 chip containing 627 probes. Finally, the proteomic data
consisted of 162 proteins measured in a protein lysate array. All data were downloaded from the
public NCI website (Developmental Therapeutics Program NCI/NIH). We denote with Xi the
vector of molecular quantities (variables) for cell-line i, Xi,v the value of the molecular quantity v
on cell-line i, and with X = {Xi} the matrix of quantities. The gene expression raw data have been
subjected to GCRMA normalization before analysis as implemented in the BioConductor
platform (Bioconductor). The log2 values of the miRNA and the proteomic expressions are used.
The drug-response data for all 53 cell-lines were obtained from the CellMiner database
(Shankavaram, et al., 2009) for a panel of 120 drugs. The set of 120 drugs was selected as
follows: 118 of them are the ones denoted as fully characterized by the NCI. Our clinical expert
ODR also suggested another 14 drugs that are of clinical importance. The union of these two sets
is the set of 120 drugs included in the computational experiments. The drug responses for each
combination of drug and cell-line contain several pairs of ‹d,r›, where d is the log10 drug dosage
and r is the percentage of tissue that survived at 48 hours after treatment. We denote with Ri,j the
set of such pairs for cell-line i and drug j.

Problem Definition: The analysis task we address is to predict the response to a drug of a tissue
based on a vector X of predictive variables. Three sets of predictive variables are employed: gene
expressions, miRNA expressions, and protein concentrations. The response of cell-line i to a
drug j is often characterized with a single number that we denote with GI50i,j and corresponds to
log10GI50. GI50i,j is typically not available in the raw data, thus the value of GI50i,j is estimated
from the data in Ri,j. Learning predictive models for GI50i,j given a vector X is a regression task.
Additionally, we are interested in identifying minimal molecular signatures that are optimally
predictive of response and that could provide insight into the molecular mechanisms of the drug.

IMPROVING THE ESTIMATION OF GI50


The GI50i,j values in the publicly available NCI data are usually estimated as follows : The mean
response r(d) for each dosage d is calculated and a piece-wise linear function is interpolated
through these mean values. The estimated GI50 value is the concentration that corresponds to r =
50% on this function, denoted as GI50PLIi,j. According to the official NCI-60 site (NCI60
Methodology) this is the methodology followed for estimating the 55% of the GI50i,j values. The
remaining 45% were either approximated (manually, we presume) or chosen to be the highest
concentration tested.
 
 
 

Figure 1: The drug-response measurements for the cell line HS 578T (Breast Cancer) and the
drug Mechlorethamine. The response values Ri,j values are shown, as well as the fitted sigmoid
curve (red color) and the respective piece-wise linear interpolation segments (green color).
 
We now present an estimation method that employs all available measurements in Ri,j. We
assume the dosage-response curve to have a sigmoid shape where at 0 dosage (i.e., its logarithm
approaches -∞) there is no reduction of the tumour (r = 100%) and at infinity the tumour size is
reduced to zero (r = -100%). The equation of a sigmoid that ranges asymptotically between α and
α+β and crosses the mid-range at γ is
#
r =$ + (1)
1 + e ( d %! ) "
where δ is a parameter controlling the slope of the function, r the response and d the dosage
(expressed by its logarithm). Considering that asymptotically (presumably for very high
concentrations of the drug) the tumour is completely eradicated and there is a 100% reduction in
its size, then α = -100%. The parameter β was set to 200% so that the range of the response is
between -100% and +100%. The remaining two parameters γ and δ were estimated using least-
squares numerical optimization. Specifically, we used the function nlinfit of Matlab with initial
values γ = -5 and δ = 1. This function performs a number of steps towards the steepest descend
direction for the parameters γ and δ in order to converge to a good-fitting value. In the cases
where the procedure would not converge with these initial values, we repeated it 100 times with
different initial values for the parameters γ and δ uniformly sampled within [-15 2] (the range of
all concentrations in the data). Out of these 100 repetitions the parameter pair that led to the least
mean squared error (MSE) was selected. The estimated GI50i,j values are found by setting r =
50% and solving Eq. 1. In certain cases, fitting a sigmoid leads to extreme values. In order to
detect the outliers we applied the Matlab function deletoutliers. This implements iteratively the
Grubbs Test that tests one value at a time (Grubbs, 1969). If outliers are found they are trimmed
to ± 2· σj, where σj is the standard deviation of all currently fitted values to drug j. We denote the
final estimates as GI50Sigi,j. Figure 1 shows a graphical depiction of Ri,j for the CCRF-CEM
(Leukemia) cell-line and Carmustine (BCNU) with the fitted sigmoid superimposed. The
corresponding piece-wise linear interpolation segments are also shown in the figure.

We now show that this method of estimation leads to improvements in chemosensitivity


prediction. The analysis includes the following steps:

Feature Selection: The most commonly applied feature (variable) selection method in the field
of personalized medicine is to rank the genes according to their association with the class
(equivalently the p-value) and select the top k. We call this method univariate filtering. In our
work we additionally employed the Max Min Parents and Children algorithms (MMPC)
(Tsamardinos, Brown, & Aliferis, 2006) to select a minimal-size, optimally predictive set of
probe-sets. MMPC is an algorithm that seeks to identify the neighbors of the variable-to-predict
in the Bayesian Network capturing the data distribution by taking into account multivariate
associations. It has been shown very effective in recent extensive experiments (Aliferis,
Statnikov, Tsamardinos, Mani, & Koutsoukos, 2010) against an array of state-of-the art feature
selection methods. In this work, the causal explorer implementation of MMPC was used
(Statnikov, Tsamardinos, Brown, & Aliferis, 2009) with the default values for the parameters.

Regression: We employed SVM Regression to construct the predictive models (Boser, Guyon,
& Vapnik, 1992), as implemented in the package libSVM, version 3.01 (Chang & Lin, 2001). In
our experiments we used the Radial Basis kernel and all other parameters set to default.

Estimation of Performance: We used a leave-one-out cross validation protocol due to the small
number of available samples. For each training set, the combination of MMPC and SVM
regression produced a predictive model that was applied on the hold-out test sample. This avoids
overfitting due to the multiple testing problem (Hastie, Tibshirani, & Friedman, 2009).

Metric of performance: The metric to measure prediction performance is the leave-one-out


cross-validated R2 (coefficient of determination), which is a conservative metric (Steel & Torrie,
Figure 2: RSig and RPLI are the R2 achieved on GI50’s estimated by a sigmoid fit and by
standard piece-wise linear interpolation respectively. The differences RSig - RPLI are shown for
miRNA expressions, and protein concentrations. The median differences are 5.02%, 8.13%,
4.09%, favoring estimation of GI50 using a sigmoid fit.
1960). Specifically, for a given drug j, let µ\i be the mean value of GI50 in the data excluding cell
line i (i.e., only on training data), m\i the predicted GI50 by the model constructed excluding
cell-line i, and GI50i the GI50 as estimated by the experiments in the corresponding cell line i for
drug j. We define:
! (GI 50 i " m\i ) 2
R 2j = 1 " i  
i
! (GI 50i " µ \i ) 2
The interpretation of R2 is that it corresponds to the variance explained (uncertainty) by the
model; alternatively, it is the relative reduction of variance by the use of the predictive model vs.
predicting using the mean (estimated on the training data only).

We have computed R2j for all 120 drugs both when the GI50 values are estimated using piece-
wise linear interpolation as well as when fitting a sigmoid function, as described above. We
denote the corresponding values as RPLIj and RSigj. The results are shown in Figure 2.
The figure shows that GI50 values estimated by the sigmoid are better predicted using the
protocol described above. Thus at least for the combination of MMPC and SVM Regression
GI50Sig values facilitate the induction of predictive models vs. using the GI50PLI. The median
differences (RSig - RPLI) are 5.02%, 8.13%, 4.09%. The p-values for the null hypothesis that the
median of (RSig - RPLI) is zero as estimated by a Wilcoxon signed-rank test are 0.0546, 0.0060,
and 0.0169, when employing gene expressions, miRNA expressions, and protein concentrations
as predictive variables respectively. Of course, one could argue that the results may not transfer
to other feature selection or regression methods. The results however, corroborate our intuition
that the sigmoid estimation better preserves information in the Ri,j measurements and given no
evidence to the contrary, we would suggest this method of estimation in future analyses and
employ it for the rest of the paper.

COMPARISON OF FEATURE SELECTION TECHNIQUES


We compared the prediction performance of the models using MMPC and univariate filtering as
feature selection methodology. We computed the cross-validated R2 for both methods on all
drugs using the same protocol as before. The k parameter is set to the number of genes returned
by MMPC, so that both methods return the signatures of the same sizes. This of course is unfair
to MMPC because the algorithm needs to discover k on its own, while univariate filtering is
provided with a good estimate of k. Figure 3 presents a histogram of the results. The median
differences between RMMPC - Runi, where RMMPC and Runi is the R2 obtained using MMPC and
univariate filtering respectively, in the different datasets (gene expressions, miRNA expressions,
proteins) are respectively: 6.53%, -1.2%, -0.61%. The Wilcoxon signed-rank test returns the
following p-values 10-5, 0.25, 0.32. Thus, MMPC is statistically significantly better in the gene
expression dataset, while the performance of the two algorithms is not statistically
distinguishable in the latter two datasets (at the significance level of 0.05). Note however, that
MMPC automatically selects the best parameter k of the number of variables to select; the same k
was passed as extra information to univariate association. If one does not know k, this parameter
should be optimized somehow, e.g., by employing nested cross-validation procedures which are
computationally more expensive. Due to the improved performance on at least one dataset, and
the automatic selection of k, we would suggest the use of MMPC instead of the simpler
univariate filtering methods and employ MMPC for the rest of the paper.

COMPARISON OF REGRESSION VERSUS CLASSIFICATION


In all related prior work, to the best of our knowledge, classification and not regression models
have been constructed for predicting GI50 values (Potti, et al., 2006), (Staunton, et al., 2001),
(Ma, et al., 2009). Given that the latter values are continuous, the authors have quantized them
before applying any classifiers, as described in the previous sections. We now show that
quantization is sometimes detrimental to performance and regression techniques have greater
predictive power.

In the next set of computational experiments we pre-process the GI50 values of each drug to
discretize them as described in (Ma, et al., 2009). Specifically, the class Ci,j of a cell-line i and
drug j is computed as sensitive, intermediate, or resistant if GI50i,j falls within (-∞, µj - 0.5σj], (µj
- 0.5σj, µj + 0.5σj], and [µj + 0.5σj, ∞) respectively, where µj is the average GI50 value over all
cell lines for drug j and σj the standard deviation.
 

Figure 3: Performance differences when employing MMPC vs. univariate filtering for
feature selection. MMPC is statistically significantly better in the gene expression dataset.
The differences in the other datasets are not significant. Univariate filtering requires
optimization of the number of variables k to retain, while MMPC does not. For these reasons
MMPC is preferable on this task.

To evaluate classification, we employed the same overall protocol described in Section


“IMPROVING THE ESTIMATION OF GI50” with the following minimal modifications: we
used multi-class SVM classification instead of SVM Regression. SVMs have been very popular
and successful classifiers, particularly in bioinformatics (Statnikov, Aliferis, Tsamardinos,
Hardin, & Levy, 2005). As for regression, we used the libSVM implementation of SVMs with
the Radial Basis kernel and all other parameters set to default. The metric of performance for
classification is accuracy, i.e., the percentage of samples whose class is correctly predicted
(instead of the metric R2 used for regression). We denote with Aj the leave-one-out cross-
validated accuracy of the method on drug j.

Comparing regression vs. classification is not straightforward given that regression outputs a
continuous prediction for GI50i,j while classification outputs its class. To overcome this issue we
discretize the output of the regression models to the three stated classes using the same intervals
as above. This allows us to compute the cross-validated accuracy of the regression for each drug
j, denoted as Dj. In other words, Aj’s are computed by first discretizing the data, then using
classification, and measuring the accuracy of the output, while Dj’s are computed by using
regression, then discretizing the predictions, and computing accuracy.

Figure 4 shows the histograms of their difference Dj – Aj for the gene expression, miRNA
expression, and proteomics data over the full set of 120 drugs and the 14 clinically important
ones. The legends of the figures also show the means of these differences. In some cases
regression accuracy scores higher than classification accuracy and in other cases the reverse
happens. Table 1 shows the minima and maxima of their difference Dj – Aj over the full or the
restricted set of drugs for each dataset.

Table 1: Minimum and maximum difference Dj – Aj with the corresponding drug names and NSC
ids, on all datasets and for all drugs and the selected drugs.
Dataset   Min     Max     Min     Max  
(all  drugs)   (all  drugs)   (selected  drugs)   (selected  drugs)  
Gene  expressions   -­‐35.8   24.5   -­‐20.7   24.5  
Drug  name   Inosine   Oxanthrazale   Doxorubicin   Etoposide  
  dialdehyde        
NSC  id   118994   349174   123127   141540  
miRNA  expressions   -­‐41.5   26.4   -­‐20.7   24.5  
Drug  name   Pyrazoloimidazole   Dichloroallyl   Paclitaxel   Camptothecin  
    lawsone      
NSC  id   51143   126771   125973   94600  
Proteins   -­‐43.4   43.4   -­‐18.9   30.2  
Drug  name   Hydroxyurea   5-­‐HP   Vinblastine-­‐ Camptothecin  
      sulfate    
NSC  id   32065   107392   49482   94600  

These results suggest that in general one should also try regression methods and not only
classification. In terms of mean differences, classification accuracy A is on average higher than
regression accuracy D on the full set of 120 drugs for all three datasets. However, when we focus
on the 14 drugs of high interest, the situation is reversed: D is on average higher than A in two
out of three datasets. Finally, we note that the way this comparison is performed, slightly favors
classification. This is because regression methods try to make predictions at a finer level, i.e.,
predict the exact value of the GI50. On the other hand, classification methods make cruder
predictions of the general levels sensitive, intermediate, resistant. When we discretize the output
of the regression methods, they lose this advantage and are compared against classification on
this cruder and less granular scale. Considering all the above, we decide to employ regression
methods for the rest of the paper.
 

Figure 4: Differences in accuracies D – A between the accuracy achieved by discretized


regression outputs (D) and the accuracy achieved by classification methods (A).
Classification accuracy A is on average higher than regression accuracy D on the full set of
120 drugs for all three datasets. However, when we focus on the 14 drugs of high interest,
the situation is reversed for two out of the three datasets.
A FULL ANALYSIS OF THE COMPLETE SET OF 120 DRUGS
Based on the previous sections we estimate the GI50’s using the sigmoid fitting, employ SVM
regression methods, the MMPC for feature selection, leave-one-out cross-validation for
estimation of performance, and the cross-validated R2 metric (coefficient of determination). We
stress-out that feature selection is also cross-validated, i.e., for each sample that is left-out,
feature selection is performed on the remaining samples. This avoids overfitting (see Hastie,
Tibshirani, & Friedman, 2009, section 7.10.2) which is particularly a problem in this task where
the total number of variables greatly exceeds the number of samples (55465 variables vs. 53
samples). The final set of selected variables is produced by applying the feature selection method
on all the available data. Thus, the estimation of performance is produced by cross-validating the
complete method of selecting variables and producing a model.

The above combination of methods and experimentation protocols run on four datasets: gene
expressions, miRNA expression, protein concentrations, and the combined dataset containing all
molecular quantities. To facilitate the interpretation of the results, we classify drugs regarding
how well they are predicted. More specifically, for the Pearson correlation r between two
quantities, Cohen gives the following interpretation guidelines (Wikipedia - Effect Size): small
effect size, r = 0.1 - 0.23; medium, r = 0.24 - 0.36; large, r = 0.37 or larger. Interpreting R2 as
r2and translating the values we get approximately the intervals [0.01, 0.05), [0.05, 0.13), [0.13,
1]. The term “effect size” was coined for causal effects; in our case, a correlation does not
necessarily correspond to a causal effect, so the correct interpretation of “effect size” is the
predictability of the GI50 given the molecular quantities. Under this interpretation several drugs
have large effect size, while other ones have a negative size effect, meaning that our prediction
does not improve compared to the prediction by the mean value. Summary results are presented
Table 2.
Table 2: Predictive performance over all 120 drugs and datasets. SVM regression, MMPC for
feature selection, sigmoid fitting for estimation of GI50, and leave-one-out cross validation is
employed. The response to several drugs can be predicted using molecular quantities such as
gene expressions, miRNA expressions, and protein concentrations.
                                  Not  predicted   Small  Effect   Medium  Effect   Large  Effect  
Dataset   R2  <  0   R2  :  0  –  0.05   R2  :  0.05  –  0.13   R2  >  0.13  
Gene  expressions   95   12   11   6  
Proteins   92   13   11   4  
miRNA  expressions   82   10   15   13  
Combined     95   4   15   6  
 
FOCUS ON THE MOST CLINICALLY RELEVANT SET OF DRUGS
To discover new biological knowledge we focus on a set of 14 drugs that our clinical expert
ODR suggested as the most important for clinical practice. We note that the list was pre-selected
before the beginning of the analysis. Their names and NSC ids are shown in Table 3.
Table 3: A selected set of clinically interesting drugs on which to focus biological interpretation
of results.
Drug Name NSC Id
Methotrexate 740
Fluorouracil (5-FU) 19893
Mitomycin 26980
Vinblastine-sulfate 49842
Vincristine-sulfate 67574
Lomustin (CCNU) 79037
Camptothecin 94600
Cisplatin 119875
Doxorubicin 123127
Taxol (Paclitaxel) 125973
Etoposide 141540
Tamoxifen 180973
Carboplatin 241240
Gemcitabine 613327

In Table 4 we summarize the predictive performance for the 14 drugs using different datasets.
Notice that the addition of variables in the “Combined” dataset sometimes leads to worse
performance. Obviously, the combined variable set contains more information but the additional
dimensions to the problem may confuse the learning methods and reduce performance (“curse of
dimensionality”). This phenomenon is particularly keen when sample size is small, as in this
task.
Table 4: Categorization of predictive performance over the 14 clinically important drugs. GI50
was estimated with sigmoid fitting. SVM regression and MMPC for feature selection were
employed.
                                  Not  predicted   Small  Effect   Medium  Effect   Large  Effect  
Dataset   R2  <  0   R2  :  0  –  0.05   R2  :  0.05  –  0.13   R2  >  0.13  
Gene  expressions   10   1   1   2  
Proteins   9   2   2   1  
miRNA  expressions   8   1   4   1  
Combined   9   0   1   4  
Best  Achieved  over   5   1   1   7  
all  datasets  
 
Table 5: The list of the selected drugs whose chemosensitivity can be reliably predicted (R2 > 0).
The dataset where the best performance is achieved is shown.
Drug  N ame   NSC  ID   R2   Dataset  
Camptothecin   94600   0.3764   Gene  Expression  
Tamoxifencitrate   180973   0.3190   Combined  
Paclitaxel   125973   0.2145   miRNA  
Doxorubicin   123127   0.2023   Protein  
Carboplatin   241240   0.1829   Gene  Expression  
Mitomycin   26980   0.1539   Combined  
Gemcitabine     613327   0.1419   Combined  
Lomustine   79037   0.1176   Protein  
Table 6: The selected molecular quantities for each drug on the dataset that achieves the best
predictability performance. The linear Pearson correlation of each quantity with the GI50 is also
shown.
Drug  N ame   Dataset   Probe-­‐set  ID   Gene/microRNA   Correlation  w ith  
symbol   GI50  
Camptothecin   Gene  Expression   1563210_at         Unknown   -­‐0.4940  
208425_s_at   DKFZP564D166   -­‐0.6105  
221013_s_at   APOL2   -­‐0.5317  
226015_at   ZNF12   7   -­‐0.4921  
229284_at   MAT2B   0.3820  
229986_at   LOC377064   0.4943  
230254_at   Unknown   -­‐0.4414  
230410_at   NRP2   -­‐0.5528  
239664_at   C3orf17   0.4253  
Tamoxifencitrate   Combined   1569188_s_at   RPL10   -­‐0.6849  
200915_x_at   KTN1  /  PDIA6   0.5577  
201840_at   NEDD8   0.5215  
204798_at   MYB   -­‐0.6387  
223783_s_at   GEMIN4   -­‐0.5226  
225371_at   GLE1L   -­‐0.5203  
227481_at   CNKSR3     0.6458  
(Protein)   PTPN11   -­‐0.5291  
Paclitaxel   miRNA   (miRNA)   hsa-­‐mir-­‐106a   0.6028  
(miRNA)   mir_95  left   0.4070  
Doxorubicin   Protein   (Protein)   BCAR1   -­‐0.4486  
(Protein)   CASP7   -­‐0.3588  
(Protein)   CCNA2   0.3334  
Carboplatin   Gene  Expression   1557805_at   C9orf77   0.4797  
1558504_at   LOC440721   0.5643  
1565741_at   Unknown   -­‐0.4302  
202049_s_at   ZNF262   -­‐0.4595  
213572_s_at   SERPINB1   0.5220  
225627_s_at   CACHD1   -­‐0.4540  
226825_s_at   TPARL   0.4335  
228726_at   SERPINB1   0.6081  
244877_at   Unknown   0.5260  
Mitomycin   Combined   1560854_s_at   ZNF588   -­‐0.5395  
1562303_at   ZNF306   0.4147  
202031_s_at   WIPI2   -­‐0.4397  
209450_at   OSGEP   0.4967  
224374_s_at   EMILIN2   0.4866  
237104_at   CTSS   -­‐0.4330  
239637_at   Unknown   -­‐0.6447  
Gemcitabine   Combined   211358_s_at   CIZ1   -­‐0.4616  
hydrochloride   212873_at   HMHA1   0.5025  
222878_s_at   OTUB2   0.4714  
227575_s_at   C14orf102   0.4422  
229935_s_at   MLL   0.8181  
232922_s_at   C20orf59   0.4062  
Lomustine   Protein   (Protein)   GSK3B   -­‐0.3439  
(Protein)   HRAS   0.4972  
(Protein)   MGMT   0.4455  
(Protein)   PTPN11   -­‐0.4927  
 
We now focus on the 8 drugs with medium and strong effects. Their names and dataset where the
maximum is achieved is shown in Table 5. The selected variables are shown in Table 6. The
GI50 stands for “growth inhibition 50%”, the concentration of a given test drug that causes 50%
growth inhibition value corrected for the cell count at time zero. The (univariate) linear Pearson
correlations of each quantity with the GI50 are also shown in Table 6. A positive correlation in
this context means that the larger the value of the mRNA/miRNA/protein, as measured by the
arrays, the larger the GI50 for the drug/sample combination, and thus, the larger the resistance of
the tumour to the drug. A negative correlation on the other hand implies that the higher the value
of the miRNA/protein, the smaller the GI50 and thus, the more sensitive is the tumour to the
drug. A brief biological interpretation of the results now follows.

Camptotecines are central in the treatment of colorectal cancer and target the cleavable complex
between the topoisomerase I, and the DNA inducing irreversible double-strand breaks. None of
the genes identified in this signature were previously linked to this drug.
Tamoxifen is an anti-estrogen used for the prevention of breast cancer recurrence in estrogen and
progesterone hormone receptor positive breast cancer. This is to be taken for five years after the
initial curative treatment. In this signature, NEDD8 gene overexpression confers resistance. The
NEDD8 pathway was proposed to provide a mechanism by which breast cancer cells acquire
anti-estrogen resistance while retaining expression of estrogen receptor alpha (Fan, Bigsby, &
Nephew, 2003).
microRNAs are short (18-24 nt) non-coding RNAs that are involved in post-transcriptional
regulation of gene expression in multicellular organisms by affecting both the stability and
translation of mRNAs. Each microRNA can theoretically control hundreds of genes, and there is
an inverse relation between microRNA and mRNA where a high microRNA induces reduction
of its mRNA target and the opposite. For paclitaxel, an antitubulin, where the main target is
stabilizing the dynamic of the microtubule system in the interphase and M-phase, inducing DNA
damage, chromosomal imbalance and subsequently apoptosis, two miRNAs were highly
predictive where overexpression conferred resistance. Overexpression of mir-106a has been
connected to increased proliferation in breast cancer, one of the main tumors where paclitaxel is
used (Kim, Chadalapaka, Lee, Yamada, & Sastre-Garau, et al., 2008) through down-regulation
of ZBTB4. We also found that ZBTB4 has a 99% probability of being a target of mir-106a using
appropriate software (MicroRNA Target Prediction) (Saito & Sætrom, 2010). The mir-95 was
recently shown overexpressed in 50% of colorectal cancer and to have oncogenic properties
through down-regulation of the SNX1 (Huang, Huang, Wang, Liang, & Ni, et al., 2011). .

Doxorubicin is an anthracycline, a topoisomerase II poison that stabilizes the cleavable


complexes of DNA inducing double-strand breaks and forms covalent adducts inducing DNA
 

Figure 5: The signature of Lomustine entered in KEGG (Kyoto Encyclopedia of Genes and
Genomes (Kanehisa, Araki, Goto, Hattori, Hirakawa, & et al., 2008)) showing HRAS
overexpression (red) in the oncogenic pathways of brain cancer, here implying resistance
against Lomustine.  
damage. Here, high levels of CCNA2 protein and low levels of BCAR1 and CASP7 conferred
resistance. CCNA2 is critical for initiation of DNA replication, transcription and cell cycle
regulation, and its manipulation changes doxorubicin sensitivity, BCAR1 (breast cancer
resistance gene 1) is shown to confer resistance to tamoxifen and is a prognostic factor in breast
cancer (Dorssers, Grebenchtchikov, Brinkman, Look, & van Broekhoven, et al., 2004), and
CASP7 is an important factor of drug-induced apoptosis.
The DNA is the main cytotoxic target of Cisplatin and Carboplatin by induction of single and
double-strand DNA breaks through adducts and cross-linking, leading to cell death through
apoptosis. For Carboplatin six genes (seven transcripts) were predictive where overexpression of
only two genes predicted resistance. The one, SERPINB1 or PAI-1 is the first predictive and
prognostic biomarker evaluated in a phase III study in breast cancer, where high expression
predicted low chemotherapy effect.
The novel antimetabolite gemcitabine targets RRM1 (ribonucleotide reductase subunit M1). The
signature includes only one gene that has been related to drug resistance, the MLL. High MLL
was detected in osteosarcoma cell lines resistant for methotrexate, a drug with similar
mechanism as gemcitabine (Hattinger, Stoico, Michelacci, Pasello, & Scionti, et al., 2009).
Lomustine, a chloroethylating chemotherapeutic is active against tumours in the central nervous
system. Here we found MGMT protein overexpression predicting resistance. MGMT is a well-
known DNA repair gene that is predictive for Lomustine as part of the PCV regimen for
aggressive brain cancer. MGMT down-regulation due to methylation is common in brain
tumours, which strongly affect the response of treatment (Herrlinger, Rieger, Koch, Loeser, &
Blaschke, et al., 2006). Moreover, HRAS overexpression was predictive for Lomustine
resistance. Overexpression of HRAS is a key feature of aggressive brain cancer (Kanehisa,
Araki, Goto, Hattori, & Hirakawa, et al., 2008), predictive for survival (Serao, Delfino, Southey,
Beever, & Rodriguez-Zas, et al., 2011) but its predictive value related to Lomustine has not been
established (see Figure 5). For both genes their respective proteins expression has not been
established as predictive markers previously, thus this is a novel finding. Moreover GSK3B, part
of the AKT/GSK3β/cyclin D1 pathway, were down-regulation confers radio-resistance
(Shimura, 2011), here down-regulation correlated to Lomustine resistance.
Several of these relations have not been explored previously and deserve evaluation in the wet
lab and the clinic.

FURTHER REDUCING THE SELECTED VARIABLES BY EMPLOYING


MULTI-TASK VARIABLE SELECTION
Anticancer drugs as chemotherapeutics, anti-hormones and targeted drugs are typically effective
only for subsets of patients, and thus a large patient population is treated unnecessarily, only to
experience the side effects; unnecessary treatment also incurs a high cost to the patient and
society at large. In treatment of cancer one can often choose between two or more compounds,
and to know which will work for each specific patient can save lives and prevent unnecessary
suffering.
The variable selection methods applied so far (MMPC and univariate filtering) select a set of
predictive variables for each drug. In order to apply these methods for personalized treatment
and selection of the optimal chemotherapeutical agent for a given patient, one would have to
measure the predictive variables for all the drugs. This potentially increases cost. One way to
further reduce the number of variables would be to select a set of variables that is simultaneously
for all drugs.

We approach this problem with a Multi-Task variable selection method; in our case, a task is
defined as the prediction of a specific drug. Multi-task methods try to simultaneously solve a
prediction problem for several tasks at once, in our case the selection of variables and prediction
of a panel of drugs. Such multi-task methods have two aims: (a) attempt to learn a small
signature set that is common to many drugs, (b) improve learning performance by exploiting the
similarities among different predictive tasks. For example, a multi-task method may determine
that a gene that seems marginally predictive when examined for a single drug is deemed
important when examined for several drugs because these drugs may share similar chemical
composition or affect the same pathway.

We chose to use the method introduced in (Argyriou, Evgeniou, & Pontil, 2008). We slightly
optimized the code (available online) for our specific problem. In particular, (a) since all drugs
 

Figure 6: Number of drugs with R2 > 0 as a function of the number of variables k selected
by the MTVS algorithm.
 
(tasks) share the same data we kept only one copy of data for all tasks, therefore reducing the
total memory (and time) used by the procedure (by a factor linearly proportional to the total
number of drugs) and (b) removed redundant computations since we only want to select
variables and not learn new features. The number of selected variables is controlled by a
parameter k.

We ran MTVS on the complete set of drugs, in order to take advantage of possible drug
similarities, and show results only for the 14 clinically important drugs. Table 7 summarizes the
results and juxtaposes with the ones achieved by MMPC. For k = 50, both methods have similar
predictive performance (MTVS obtains a positive R2 for more drugs but they are smaller in
magnitude), while MTVS has less than half the number of selected variables. The difference is
much higher if we consider the whole set of drugs (MMPC selects over 700 variables in total).  
Table 7: Classification of predictive performance over the 14 clinically important drugs. SVM
regression, sigmoid fitting for estimation of GI50, and leave-one-out cross validation is
employed for both variable selection methods.
                                  Not  predicted   Small  Effect   Medium  Effect   Large  Effect  
Method   #All  Selected   R2  <  0   R2  :  0  –  0.05   R2  :  0.05  –  0.13   R2  >  0.13  
Var.  
MMPC   105   9   0   1   4  
MTVS     50   4   6   1   3  

Figure 6 shows the number of drugs with a positive R2 (i.e., that can be predicted) as a function
of the number k of selected variables by MTVS. By using only 23 variables MTVS can predict a
total of 9 drugs with a positive R2. Allowing more variables to 50 allows one more drug to be
predicted. For this specific set of 9 drugs, MMPC requires measuring a total of 71 variables. This
is not a fair comparison because we compare the required number of variables on the drugs on
which MTVS determine to predict the best a posteriori. However, it indicates that multi-task
feature selection is a potentially useful technique that may further reduce the required number of
variables, when prediction is required simultaneously for several drugs, compared to standard
variable selection methods.

CONCLUSION
Predicting chemosensitivity of tumours from gene expressions is important for selecting
treatment, understanding the molecular mechanisms of drug response, and selecting molecular
signatures. In this paper, we show that predictive performance can sometimes be improved by
employing a new method for estimating the GI50 (indication of response to drug), regression
algorithms instead of classification, and state-of-the-art, multivariate feature selection. In
addition, we show that by employing multi-task feature selection methods common signatures
for several drugs can be found with smaller sizes than with non-multi task variable selection
methods. The signatures identified here have several known links to cancer progression and
resistance to chemotherapy. Knowledge on these relations is still expanding and the methods
used to identify those signatures may be important tool for novel biological hypotheses.

Acknowledgements
EC was supported for this research by the ContraCancrum EU FP7 STREP GA 223979 and the
Institute of Computer Science of the Foundation for Research and Technology, Hellas. We
would like to thank Prof. Sætrom of the Norwegian University of Science and Technology
(NTNU), Trondheim, Norway for microRNA-gene target analysis. We would like to thank Amos
Folarin for introducing us to this problem. We would like to thank Matina Fragoyanni for her
code on downloading drug response data. Thanks to Sofia Triantafillou, Vincenzo Lagani and
Angelos Armen for their feedback and fruitful comments.

REFERENCES
Aliferis,  C.  F.,  Statnikov,  A.,  Tsamardinos,  I.,  Mani,  S.,  &  Koutsoukos,  X.  D.  ( 2010).  Local  causal  and  
markov  blanket  induction  for  causal  discovery  and  feature  s election  for  classification  part  i:  
Algorithms  and  empirical  evaluation.  Journal  of  Machine  Learning  Research,  Special  Topic  on  
Causality  11,  171-­‐234.  

Argyriou,  A.,  Evgeniou,  T.,  &  Pontil,  M.  (2008).  Conves  Multi-­‐Task  Feature  Learning.  Machine  Learning,  
Special  Issue  on  Inductive  Transfer  Learning,  73(3),  243-­‐272.  

Augustine,  C.  K.,  Yoo,  A.,  Potti,  J.  S.,  Yoshimoto,  Y.,  Zipfel,  P.  A.,  Friedman,  H.  S.,  et  al.  (2009).  Genomic  
and  molecular  profiling  predicts  response  to  temozolomide  in  melanoma.  Clinical  C ancer  Res  
15(2),  502.  

Bioconductor.  (n.d.).  Retrieved  from  h ttp://www.bioconductor.org/  


Boser,  B.,  Guyon,  I.,  &  Vapnik,  V .  (1992).  An  training  a lgorithm  for  optimal  margin  classifiers.  In  Fifth  
Annual  Workshop  on  Computational  Lerning  Theory,  144-­‐152.  

Chang,  C.  C.,  &  Lin,  C.  J.  (2001).  LIBSVM:  a  library  for  support  vector  machines.  

Developmental  Therapeutics  Program  NCI/NIH.  (n.d.).  Retrieved  from  h ttp://dtp.nci.nih.gov/index.html  

Dorssers,  L.C.,  Grebenchtchikov,  N.,  Brinkman,  A.,  Look,  M.P.,  van  Broekhoven,  S .P.,  et  a l.  (2004)  The  
prognostic  value  of  BCAR1  in  patients  with  primary  breast  cancer.  Clin  Cancer  Res  10:  6194-­‐
6202.  
 
Fan,  M.,  Bigsby,  R.M.,  Nephew,  K.P.  (2003)  The  NEDD8  pathway  is  required  for  proteasome-­‐mediated  
degradation  of  h uman  estrogen  receptor  ( ER)-­‐alpha  and  essential  for  the  antiproliferative  
activity  of  ICI  182,780  in  ERalpha-­‐positive  breast  cancer  cells.  Mol  Endocrinol  17:  356-­‐365.  
 
Grubbs,  F.  (1969).  Procedures  for  Detecting  Outlying  Observations  in  Samples.  Technometrics  11(1),  1-­‐
21.  

Hastie,  T.,  Tibshirani,  R.,  &  Friedman,  J.  (2009).  The  Elements  of  Statistical  Learning:  Data  Mining,  
Inference,  and  Prediction,  Second  Edition.  Springer  Series  in  Statistics.  

Hattinger,  C.M.,  Stoico,  G.,  Michelacci,  F.,  Pasello,  M.,  Scionti,  I.,  et  a l.  (2009)  Mechanisms  of  gene  
amplification  and  evidence  of  coamplification  in  drug-­‐resistant  human  osteosarcoma  cell  lines.  
Genes  Chromosomes  Cancer  48:  289-­‐309.  
 
Herrlinger,  U.,  Rieger,  J.,  Koch,  D.,  Loeser,  S .,  Blaschke,  B.,  et  al.  (2006)  Phase  II  trial  of  lomustine  p lus  
temozolomide  chemotherapy  in  addition  to  radiotherapy  in  n ewly  d iagnosed  glioblastoma:  UKT-­‐
03.  J  Clin  Oncol  24:  4412-­‐4417.  
 
Huang,  Z.,  Huang,  S.,  Wang,  Q.,  Liang,  L.,  Ni,  S .,  et  a l.  (2011)  MicroRNA-­‐95  promotes  cell  proliferation  
and  targets  sorting  Nexin  1  in  h uman  colorectal  carcinoma.  Cancer  Res  71:  2582-­‐2589.  
 
Kanehisa,  M.,  Araki,  M.,  Goto,  S .,  Hattori,  M.,  Hirakawa,  M.,  et  al.  (2008)  KEGG  for  linking  genomes  to  life  
and  the  environment.  Nucleic  Acids  Res  36:  D480-­‐484.  

Kim,  K.,  Chadalapaka,  G.,  Lee,  S .O.,  Yamada,  D.,  Sastre-­‐Garau,  X.,  et  a l.  (2011)  Identification  of  oncogenic  
microRNA-­‐17-­‐92/ZBTB4/specificity  protein  a xis  in  breast  cancer.  Oncogene.  
 
Ma,  Y.,  Ding,  Z.,  Qian,  Y.,  Wan,  Y.  W.,  Tosun,  K.,  Shi,  X.,  et  al.  (2009).  An  integrative  genomic  and  
proteomic  approach  to  chemosensitivity  prediction.  Int.  J.  Oncol.  34(1),  107-­‐115.  

NCI60  Methodology.  (n.d.).  Retrieved  from  


http://dtp.nci.nih.gov/docs/compare/compare_methodology.html  

NTNU  -­‐  Faculty  of  Medicine,  MicroRNA  Target  Prediction.  (n.d.).  Retrieved  from  
tare.medisin.ntnu.no/mirna_target/  
Potti,  H.  K.,  Dressman,  A.,  Bild,  A.,  Riedel,  R.  F.,  Chan,  G.,  Sayer,  R.,  et  a l.  (2006).  Genomic  s ignatures  to  
guide  the  use  of  chemotherapeutics.  Nature  Medicine  12,  1294-­‐1300.  

Saito,  T.,  &  Sætrom,  P.  (2010).  A  two-­‐step  s ite  and  mRNA-­‐level  model  for  predicting  microRNA  targets.  
BMC  Bioinformatics,  612.  

Serao,  N.V.,  Delfino,  K.R.,  Southey,  B.R.,  Beever,  J.E.,  Rodriguez-­‐Zas,  S.L.  (2011)  Cell  cycle  and  aging,  
morphogenesis,  a nd  response  to  stimuli  genes  are  individualized  biomarkers  of  glioblastoma  
progression  and  survival.  BMC  Med  Genomics  4:  49.  
 
Shankavaram,  U.  T.,  Varma,  S.,  Kane,  D.,  Sunshine,  M.,  Chary,  K.  K.,  Reinhold,  W.  C.,  et  al.  (2009).  
Cellminer:  a  relational  database  and  query  tool  for  the  nci-­‐60  cancer  cell  lines.  BMC  Genomics  
10,  277.  

Shimura,  T.  (2011)  Acquired  radioresistance  of  cancer  and  the  AKT/GSK3beta/cyclin  D1  overexpression  
cycle.  J  Radiat  Res  (Tokyo)  52:  539-­‐544.  
 
Statnikov,  A.,  Aliferis,  C.  F.,  Tsamardinos,  I.,  Hardin,  D.,  &  Levy,  S.  (2005).  A  comprehensive  evaluation  of  
multicategory  classification  methods  for  microarray  gene  expression  cancer  diagnosis.  
Bioinformatics  21(5),  631-­‐645.  

Statnikov,  A.,  Tsamardinos,  I.,  Brown,  L.  E.,  &  Aliferis,  C.  F.  (2009).  Causal  explorer:A  matlab  library  of  
algorithms  for  causal  discovery  and  variable  selection  for  classification.  Challenges  in  C ausality  1.  

Staunton,  J.  E.,  Slonim,  D.  K.,  Coller,  H.  A.,  Tamayo,  P.,  Angelo  Michael,  J.,  Park,  J.,  et  al.  (2001).  
Chemosensitivity  prediction  b y  transcriptional  profiling.  Proc.  Natl.  Acad.  Sci.  98(19),  10787-­‐
10792.  

Steel,  R.  G.,  &  Torrie,  J.  H.  (1960).  Principles  and  Procedures  of  Statistics.  New  York:  McGraw-­‐Hill.  

Tsamardinos,  I.,  Brown,  L.  E.,  &  Aliferis,  C.  F.  (2006).  The  max-­‐min  h ill-­‐climbing  Bayesian  n etwork  
structure  learning  a lgorithm.  Journal  of  Machine  Learning  65,  31-­‐78.  

Wikipedia  -­‐  Effect  Size.  (n.d.).  Retrieved  from  en.wikipedia.org/wiki/Effect_size  

View publication stats

Das könnte Ihnen auch gefallen