Chemosensitivity Prediction of Tumours Based On Expression, miRNA, and Proteomics Data

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/235343751
Chemosensitivity Prediction of Tumours Based on Expression, miRNA, and

Proteomics Data
Article · January 2012

DOI: 10.4018/ijsbbt.2012040101
CITATION READS
1 122
4 authors:
Ioannis Tsamardinos Giorgos Borboudakis

University of Crete University of Crete
178 PUBLICATIONS 4,941 CITATIONS 19 PUBLICATIONS 179 CITATIONS
SEE PROFILE SEE PROFILE
Eleni Christodoulou Oluf Dimitri Røe

Duke-NUS School of Medicine Norwegian University of Science and Technology
13 PUBLICATIONS 122 CITATIONS 107 PUBLICATIONS 1,119 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
2nd International NTNU Symposium Current and Future Clinical Biomarkers of Cancer: From diagnosis to immunotherapy – why is precision medicine so difficult? For
registration: https://cancerbiomarkerstrondheim.com/ View project
Drug repositioning View project
All content following this page was uploaded by Oluf Dimitri Røe on 29 May 2014.
The user has requested enhancement of the downloaded file.

Chemosensitivity Prediction of Tumours
Based on Expression, miRNA, and
Proteomics Data

I. Tsamardinos1, 2, G. Borboudakis1, 2, E. G. Christodoulou3, O. D. Røe4, 5

1
Bioinformatics Laboratory, ICS-FORTH, Heraklion, Crete, Greece
2
Computer Science Department, University of Crete, Heraklion, Crete, Greece
3
Cellular Networks Group, BIOTEC TU Dresden, Dresden, Germany
4
Cancer Clinic, Levanger Hospital, Nord-Trøndelag Health Trust, Levanger, Norway
5
Department of Cancer Research and Molecular Medicine
Norwegian University of Science and Technology (NTNU), Trondheim, Norway
ABSTRACT
The chemosensitivity of tumours to specific drugs can be predicted based on molecular
quantities, such as gene expressions, miRNA expressions, and protein concentrations. This
finding is important for improving drug efficacy and personalizing drug use. In this paper, we
present an analysis strategy that, compared to prior work, retains more information in the data for
analysis and may lead to improved chemosensitivity prediction. We apply improved methods for
estimating the GI50 value of a drug (an indicator of the response to the drug), regression
methods for constructing predictive models of the GI50 value, advanced variable selection
techniques, such as MMPC, and a multi-task variable selection technique for identifying a small-
size signature that is simultaneously predictive for several drugs and cell lines. The methods are
applied on gene expression, miRNA expression, and proteomics data from 53 tumour cell lines
after treatment with 120 drugs, obtained from the National Cancer Institute databases. A
biological interpretation and discussion of the results is presented for the most clinically
important subset of 14 drugs.
Keywords: chemosensitivity prediction, variable selection, feature selection, regression,

classification, cell lines, microarray, mRNA, cancer, oncology
INTRODUCTION
Prior work shows that the sensitivity of a tumour to a drug can be predicted better than chance
based on the gene-expressions of the tumour (Potti, et al., 2006), (Augustine, et al., 2009). This
finding paves the way to personalized therapy models. In addition, identifying the molecular
quantities that are predictive may lead to a better understanding of the biological mechanisms a
drug employs to attack the tumour. In this paper, we develop an analysis strategy to produce
predictive models, estimate their performance, and identify the smallest, most-predictive set of
molecular quantities required for prediction. The strategy is first applied to and evaluated on the
prediction of the response to a set of 120 chemotherapeutic agents based on the responses
measured on 53 solid-tumour cell lines; the data have been obtained from the National Cancer
Institute databases and contain pre-treatment gene expression, miRNA expression, and protein
concentration profiles of the cell-lines. Subsequently, we focus our interest on a subset of 14
drugs that are the most interesting in clinical practice and provide a detailed presentation and
biological interpretation of the results. In addition, we apply a method for multi-task feature
selection which selects molecular quantities that combined, are simultaneously predictive for an
array of drugs. Such algorithms are important for selecting the optimal therapy, by being able to
predict the response to several drugs at once by measuring only a small set of molecular
quantities.
Compared to prior works, our proposed strategy differs in several ways. The machine learning
and statistical analysis employed in the literature process the data in a way that reduces the
available information with potential detrimental effects both on the models' prediction
performance as well as the identification of the molecular signatures (Potti, et al., 2006),
(Augustine, et al., 2009), (Staunton, et al., 2001). First the estimation of the response to a drug in
prior work maybe sub-optimal (Potti, et al., 2006), (Augustine, et al., 2009). The response of a
tumour depends of course, on the dosage. The National Cancer Institute has treated a panel of 60
cancer cell lines with several thousand drugs and has created a dosage-response profile for each
combination of drug and tumour. Often, this profile is summarized with a single value such as
the log10GI50. The GI50 stands for “growth inhibition 50%”, the concentration of a given test
drug that causes 50% growth inhibition at 48 hours, corrected for the cell count at time zero.
NCI, in the majority of cases, estimates log10GI50 by piece-wise linear interpolation which are
then employed by all prior work (e.g., (Potti, et al., 2006), (Ma, et al., 2009), (Staunton, et al.,
2001)). In this paper, we show that estimating the log10GI50 values by fitting a sigmoid to the
dosage-response profile preserves more information about the effects of the drug that lead to
statistically significantly improved predictive performance.
Second, prior work typically quantizes the log10GI50 values to create classes of tumours: (Potti,
et al., 2006) and (Augustine, et al., 2009) categorize tumours as sensitive and resistant, while
(Staunton, et al., 2001) and (Ma, et al., 2009) as sensitive, intermediate, and resistant. This type
of quantization allows the application of machine learning classification techniques, variable
selection methods for classification tasks, and statistical hypothesis testing techniques for
discrete outcomes. Our computational experiments however, demonstrate that maintaining the
exact log10GI50 values and employing regression analysis instead of classification is often
preferable as it improves chemosensitivity prediction in approximately half of the cases.
Third, prior work often employs simple methods for identifying molecular signatures such as
selecting the top k genes that are mostly differentially expressed between different classes of
tumours. We show that more sophisticated methods such as the Max Min Parents and Children
(MMPC) algorithm for multivariate feature selection (Tsamardinos, Brown, & Aliferis, 2006)
often select more predictive signatures for the same parameter k and are preferable to apply.
Fourth, selecting minimal-size, most-predictive sets of variables for a drug intends to increase
our understanding and intuition on the molecular mechanisms of the drugs. However, for clinical
use one should be apply to predict the response to several drugs at the same time with the fewest
measurements as possible. Then, a selection of the optimal therapy can be planned based on
these predictions. Towards this goal, we apply a multi-task feature selection algorithm
(Argyriou, Evgeniou, & Pontil, 2008) that selects only a handful of molecular quantities (about
20) to reliable predict the response to 9 drugs of interest. The results indicate that multi-task
feature selection is a potentially useful technique that may further reduce the required number of
variables, when prediction is required simultaneously for several drugs to optimize therapy.
The structure of the paper is as follows: we present the data and the problem definition. In the
subsequent section we show that estimating the GI50 value by sigmoid fitting is preferable to the
standard NCI estimation using piecewise linear interpolation. Next section compares feature
selection methods. The next two sections discuss the results on the full panel of 120 drugs and a
subset of 14 drugs of particular clinical interest, respectively, including a biological
interpretation of results. Next section presents the multi-task feature selection analysis. The final
section concludes the paper.
DATA AND PROBLEM DESCRIPTION

Data Description: Molecular profiles were obtained for the NCI-60 cell-line panel
(Developmental Therapeutics Program NCI/NIH) (these actually contain expressions only for 59
of the 60 cell lines) representing nine types of cancers: 5 Breast, 6 Central Nervous System
(CNS), 7 Colon, 6 Leukemia, 10 Melanoma, 9 Lung, 7 Ovarian, 2 Prostate, 7 Renal. All of these
types of cancers are solid tumours, except Leukemia. Leukemia differ in origin, biology and
clinical response to drugs, compared to solid tumours, and thus we conducted our computational
experiments on the subset of 53 solid-tumour cell lines.
The molecular profiles include gene expressions, miRNA expressions, and protein
concentrations. The gene expressions were measured on AffymtrixU133plus2 array containing
54,675 probe sets that correspond to about 47,000 transcript variants which in turn represent
more than 39,500 of the best characterized human genes. The miRNA expressions were
measured on a miRNA OSU V3 chip containing 627 probes. Finally, the proteomic data
consisted of 162 proteins measured in a protein lysate array. All data were downloaded from the
public NCI website (Developmental Therapeutics Program NCI/NIH). We denote with Xi the
vector of molecular quantities (variables) for cell-line i, Xi,v the value of the molecular quantity v
on cell-line i, and with X = {Xi} the matrix of quantities. The gene expression raw data have been
subjected to GCRMA normalization before analysis as implemented in the BioConductor
platform (Bioconductor). The log2 values of the miRNA and the proteomic expressions are used.
The drug-response data for all 53 cell-lines were obtained from the CellMiner database
(Shankavaram, et al., 2009) for a panel of 120 drugs. The set of 120 drugs was selected as
follows: 118 of them are the ones denoted as fully characterized by the NCI. Our clinical expert
ODR also suggested another 14 drugs that are of clinical importance. The union of these two sets
is the set of 120 drugs included in the computational experiments. The drug responses for each
combination of drug and cell-line contain several pairs of ‹d,r›, where d is the log10 drug dosage
and r is the percentage of tissue that survived at 48 hours after treatment. We denote with Ri,j the
set of such pairs for cell-line i and drug j.
Problem Definition: The analysis task we address is to predict the response to a drug of a tissue
based on a vector X of predictive variables. Three sets of predictive variables are employed: gene
expressions, miRNA expressions, and protein concentrations. The response of cell-line i to a
drug j is often characterized with a single number that we denote with GI50i,j and corresponds to
log10GI50. GI50i,j is typically not available in the raw data, thus the value of GI50i,j is estimated
from the data in Ri,j. Learning predictive models for GI50i,j given a vector X is a regression task.
Additionally, we are interested in identifying minimal molecular signatures that are optimally
predictive of response and that could provide insight into the molecular mechanisms of the drug.
IMPROVING THE ESTIMATION OF GI50

The GI50i,j values in the publicly available NCI data are usually estimated as follows : The mean
response r(d) for each dosage d is calculated and a piece-wise linear function is interpolated
through these mean values. The estimated GI50 value is the concentration that corresponds to r =
50% on this function, denoted as GI50PLIi,j. According to the official NCI-60 site (NCI60
Methodology) this is the methodology followed for estimating the 55% of the GI50i,j values. The
remaining 45% were either approximated (manually, we presume) or chosen to be the highest
concentration tested.

Figure 1: The drug-response measurements for the cell line HS 578T (Breast Cancer) and the
drug Mechlorethamine. The response values Ri,j values are shown, as well as the fitted sigmoid
curve (red color) and the respective piece-wise linear interpolation segments (green color).

We now present an estimation method that employs all available measurements in Ri,j. We
assume the dosage-response curve to have a sigmoid shape where at 0 dosage (i.e., its logarithm
approaches -∞) there is no reduction of the tumour (r = 100%) and at infinity the tumour size is
reduced to zero (r = -100%). The equation of a sigmoid that ranges asymptotically between α and
α+β and crosses the mid-range at γ is
#
r =$ + (1)
1 + e ( d %! ) "
where δ is a parameter controlling the slope of the function, r the response and d the dosage
(expressed by its logarithm). Considering that asymptotically (presumably for very high
concentrations of the drug) the tumour is completely eradicated and there is a 100% reduction in
its size, then α = -100%. The parameter β was set to 200% so that the range of the response is
between -100% and +100%. The remaining two parameters γ and δ were estimated using least-
squares numerical optimization. Specifically, we used the function nlinfit of Matlab with initial
values γ = -5 and δ = 1. This function performs a number of steps towards the steepest descend
direction for the parameters γ and δ in order to converge to a good-fitting value. In the cases
where the procedure would not converge with these initial values, we repeated it 100 times with
different initial values for the parameters γ and δ uniformly sampled within [-15 2] (the range of
all concentrations in the data). Out of these 100 repetitions the parameter pair that led to the least
mean squared error (MSE) was selected. The estimated GI50i,j values are found by setting r =
50% and solving Eq. 1. In certain cases, fitting a sigmoid leads to extreme values. In order to
detect the outliers we applied the Matlab function deletoutliers. This implements iteratively the
Grubbs Test that tests one value at a time (Grubbs, 1969). If outliers are found they are trimmed
to ± 2· σj, where σj is the standard deviation of all currently fitted values to drug j. We denote the
final estimates as GI50Sigi,j. Figure 1 shows a graphical depiction of Ri,j for the CCRF-CEM
(Leukemia) cell-line and Carmustine (BCNU) with the fitted sigmoid superimposed. The
corresponding piece-wise linear interpolation segments are also shown in the figure.
We now show that this method of estimation leads to improvements in chemosensitivity

prediction. The analysis includes the following steps:
Feature Selection: The most commonly applied feature (variable) selection method in the field
of personalized medicine is to rank the genes according to their association with the class
(equivalently the p-value) and select the top k. We call this method univariate filtering. In our
work we additionally employed the Max Min Parents and Children algorithms (MMPC)
(Tsamardinos, Brown, & Aliferis, 2006) to select a minimal-size, optimally predictive set of
probe-sets. MMPC is an algorithm that seeks to identify the neighbors of the variable-to-predict
in the Bayesian Network capturing the data distribution by taking into account multivariate
associations. It has been shown very effective in recent extensive experiments (Aliferis,
Statnikov, Tsamardinos, Mani, & Koutsoukos, 2010) against an array of state-of-the art feature
selection methods. In this work, the causal explorer implementation of MMPC was used
(Statnikov, Tsamardinos, Brown, & Aliferis, 2009) with the default values for the parameters.
Regression: We employed SVM Regression to construct the predictive models (Boser, Guyon,
& Vapnik, 1992), as implemented in the package libSVM, version 3.01 (Chang & Lin, 2001). In
our experiments we used the Radial Basis kernel and all other parameters set to default.
Estimation of Performance: We used a leave-one-out cross validation protocol due to the small
number of available samples. For each training set, the combination of MMPC and SVM
regression produced a predictive model that was applied on the hold-out test sample. This avoids
overfitting due to the multiple testing problem (Hastie, Tibshirani, & Friedman, 2009).
Metric of performance: The metric to measure prediction performance is the leave-one-out

cross-validated R2 (coefficient of determination), which is a conservative metric (Steel & Torrie,
Figure 2: RSig and RPLI are the R2 achieved on GI50’s estimated by a sigmoid fit and by
standard piece-wise linear interpolation respectively. The differences RSig - RPLI are shown for
miRNA expressions, and protein concentrations. The median differences are 5.02%, 8.13%,
4.09%, favoring estimation of GI50 using a sigmoid fit.
1960). Specifically, for a given drug j, let µ\i be the mean value of GI50 in the data excluding cell
line i (i.e., only on training data), m\i the predicted GI50 by the model constructed excluding
cell-line i, and GI50i the GI50 as estimated by the experiments in the corresponding cell line i for
drug j. We define:
! (GI 50 i " m\i ) 2
R 2j = 1 " i
i
! (GI 50i " µ \i ) 2
The interpretation of R2 is that it corresponds to the variance explained (uncertainty) by the
model; alternatively, it is the relative reduction of variance by the use of the predictive model vs.
predicting using the mean (estimated on the training data only).
We have computed R2j for all 120 drugs both when the GI50 values are estimated using piece-
wise linear interpolation as well as when fitting a sigmoid function, as described above. We
denote the corresponding values as RPLIj and RSigj. The results are shown in Figure 2.
The figure shows that GI50 values estimated by the sigmoid are better predicted using the
protocol described above. Thus at least for the combination of MMPC and SVM Regression
GI50Sig values facilitate the induction of predictive models vs. using the GI50PLI. The median
differences (RSig - RPLI) are 5.02%, 8.13%, 4.09%. The p-values for the null hypothesis that the
median of (RSig - RPLI) is zero as estimated by a Wilcoxon signed-rank test are 0.0546, 0.0060,
and 0.0169, when employing gene expressions, miRNA expressions, and protein concentrations
as predictive variables respectively. Of course, one could argue that the results may not transfer
to other feature selection or regression methods. The results however, corroborate our intuition
that the sigmoid estimation better preserves information in the Ri,j measurements and given no
evidence to the contrary, we would suggest this method of estimation in future analyses and
employ it for the rest of the paper.
COMPARISON OF FEATURE SELECTION TECHNIQUES

We compared the prediction performance of the models using MMPC and univariate filtering as
feature selection methodology. We computed the cross-validated R2 for both methods on all
drugs using the same protocol as before. The k parameter is set to the number of genes returned
by MMPC, so that both methods return the signatures of the same sizes. This of course is unfair
to MMPC because the algorithm needs to discover k on its own, while univariate filtering is
provided with a good estimate of k. Figure 3 presents a histogram of the results. The median
differences between RMMPC - Runi, where RMMPC and Runi is the R2 obtained using MMPC and
univariate filtering respectively, in the different datasets (gene expressions, miRNA expressions,
proteins) are respectively: 6.53%, -1.2%, -0.61%. The Wilcoxon signed-rank test returns the
following p-values 10-5, 0.25, 0.32. Thus, MMPC is statistically significantly better in the gene
expression dataset, while the performance of the two algorithms is not statistically
distinguishable in the latter two datasets (at the significance level of 0.05). Note however, that
MMPC automatically selects the best parameter k of the number of variables to select; the same k
was passed as extra information to univariate association. If one does not know k, this parameter
should be optimized somehow, e.g., by employing nested cross-validation procedures which are
computationally more expensive. Due to the improved performance on at least one dataset, and
the automatic selection of k, we would suggest the use of MMPC instead of the simpler
univariate filtering methods and employ MMPC for the rest of the paper.
COMPARISON OF REGRESSION VERSUS CLASSIFICATION

In all related prior work, to the best of our knowledge, classification and not regression models
have been constructed for predicting GI50 values (Potti, et al., 2006), (Staunton, et al., 2001),
(Ma, et al., 2009). Given that the latter values are continuous, the authors have quantized them
before applying any classifiers, as described in the previous sections. We now show that
quantization is sometimes detrimental to performance and regression techniques have greater
predictive power.
In the next set of computational experiments we pre-process the GI50 values of each drug to
discretize them as described in (Ma, et al., 2009). Specifically, the class Ci,j of a cell-line i and
drug j is computed as sensitive, intermediate, or resistant if GI50i,j falls within (-∞, µj - 0.5σj], (µj
- 0.5σj, µj + 0.5σj], and [µj + 0.5σj, ∞) respectively, where µj is the average GI50 value over all
cell lines for drug j and σj the standard deviation.

Figure 3: Performance differences when employing MMPC vs. univariate filtering for
feature selection. MMPC is statistically significantly better in the gene expression dataset.
The differences in the other datasets are not significant. Univariate filtering requires
optimization of the number of variables k to retain, while MMPC does not. For these reasons
MMPC is preferable on this task.
To evaluate classification, we employed the same overall protocol described in Section

“IMPROVING THE ESTIMATION OF GI50” with the following minimal modifications: we
used multi-class SVM classification instead of SVM Regression. SVMs have been very popular
and successful classifiers, particularly in bioinformatics (Statnikov, Aliferis, Tsamardinos,
Hardin, & Levy, 2005). As for regression, we used the libSVM implementation of SVMs with
the Radial Basis kernel and all other parameters set to default. The metric of performance for
classification is accuracy, i.e., the percentage of samples whose class is correctly predicted
(instead of the metric R2 used for regression). We denote with Aj the leave-one-out cross-
validated accuracy of the method on drug j.
Comparing regression vs. classification is not straightforward given that regression outputs a
continuous prediction for GI50i,j while classification outputs its class. To overcome this issue we
discretize the output of the regression models to the three stated classes using the same intervals
as above. This allows us to compute the cross-validated accuracy of the regression for each drug
j, denoted as Dj. In other words, Aj’s are computed by first discretizing the data, then using
classification, and measuring the accuracy of the output, while Dj’s are computed by using
regression, then discretizing the predictions, and computing accuracy.
Figure 4 shows the histograms of their difference Dj – Aj for the gene expression, miRNA
expression, and proteomics data over the full set of 120 drugs and the 14 clinically important
ones. The legends of the figures also show the means of these differences. In some cases
regression accuracy scores higher than classification accuracy and in other cases the reverse
happens. Table 1 shows the minima and maxima of their difference Dj – Aj over the full or the
restricted set of drugs for each dataset.
Table 1: Minimum and maximum difference Dj – Aj with the corresponding drug names and NSC
ids, on all datasets and for all drugs and the selected drugs.
Dataset Min Max Min Max
(all drugs) (all drugs) (selected drugs) (selected drugs)
Gene expressions -‐35.8 24.5 -‐20.7 24.5
Drug name Inosine Oxanthrazale Doxorubicin Etoposide
dialdehyde
NSC id 118994 349174 123127 141540
miRNA expressions -‐41.5 26.4 -‐20.7 24.5
Drug name Pyrazoloimidazole Dichloroallyl Paclitaxel Camptothecin
lawsone
NSC id 51143 126771 125973 94600
Proteins -‐43.4 43.4 -‐18.9 30.2
Drug name Hydroxyurea 5-‐HP Vinblastine-‐ Camptothecin
sulfate
NSC id 32065 107392 49482 94600
These results suggest that in general one should also try regression methods and not only
classification. In terms of mean differences, classification accuracy A is on average higher than
regression accuracy D on the full set of 120 drugs for all three datasets. However, when we focus
on the 14 drugs of high interest, the situation is reversed: D is on average higher than A in two
out of three datasets. Finally, we note that the way this comparison is performed, slightly favors
classification. This is because regression methods try to make predictions at a finer level, i.e.,
predict the exact value of the GI50. On the other hand, classification methods make cruder
predictions of the general levels sensitive, intermediate, resistant. When we discretize the output
of the regression methods, they lose this advantage and are compared against classification on
this cruder and less granular scale. Considering all the above, we decide to employ regression
methods for the rest of the paper.

Figure 4: Differences in accuracies D – A between the accuracy achieved by discretized

regression outputs (D) and the accuracy achieved by classification methods (A).
Classification accuracy A is on average higher than regression accuracy D on the full set of
120 drugs for all three datasets. However, when we focus on the 14 drugs of high interest,
the situation is reversed for two out of the three datasets.
A FULL ANALYSIS OF THE COMPLETE SET OF 120 DRUGS
Based on the previous sections we estimate the GI50’s using the sigmoid fitting, employ SVM
regression methods, the MMPC for feature selection, leave-one-out cross-validation for
estimation of performance, and the cross-validated R2 metric (coefficient of determination). We
stress-out that feature selection is also cross-validated, i.e., for each sample that is left-out,
feature selection is performed on the remaining samples. This avoids overfitting (see Hastie,
Tibshirani, & Friedman, 2009, section 7.10.2) which is particularly a problem in this task where
the total number of variables greatly exceeds the number of samples (55465 variables vs. 53
samples). The final set of selected variables is produced by applying the feature selection method
on all the available data. Thus, the estimation of performance is produced by cross-validating the
complete method of selecting variables and producing a model.
The above combination of methods and experimentation protocols run on four datasets: gene
expressions, miRNA expression, protein concentrations, and the combined dataset containing all
molecular quantities. To facilitate the interpretation of the results, we classify drugs regarding
how well they are predicted. More specifically, for the Pearson correlation r between two
quantities, Cohen gives the following interpretation guidelines (Wikipedia - Effect Size): small
effect size, r = 0.1 - 0.23; medium, r = 0.24 - 0.36; large, r = 0.37 or larger. Interpreting R2 as
r2and translating the values we get approximately the intervals [0.01, 0.05), [0.05, 0.13), [0.13,
1]. The term “effect size” was coined for causal effects; in our case, a correlation does not
necessarily correspond to a causal effect, so the correct interpretation of “effect size” is the
predictability of the GI50 given the molecular quantities. Under this interpretation several drugs
have large effect size, while other ones have a negative size effect, meaning that our prediction
does not improve compared to the prediction by the mean value. Summary results are presented
Table 2.
Table 2: Predictive performance over all 120 drugs and datasets. SVM regression, MMPC for
feature selection, sigmoid fitting for estimation of GI50, and leave-one-out cross validation is
employed. The response to several drugs can be predicted using molecular quantities such as
gene expressions, miRNA expressions, and protein concentrations.
Not predicted Small Effect Medium Effect Large Effect
Dataset R2 < 0 R2 : 0 – 0.05 R2 : 0.05 – 0.13 R2 > 0.13
Gene expressions 95 12 11 6
Proteins 92 13 11 4
miRNA expressions 82 10 15 13
Combined 95 4 15 6

FOCUS ON THE MOST CLINICALLY RELEVANT SET OF DRUGS
To discover new biological knowledge we focus on a set of 14 drugs that our clinical expert
ODR suggested as the most important for clinical practice. We note that the list was pre-selected
before the beginning of the analysis. Their names and NSC ids are shown in Table 3.
Table 3: A selected set of clinically interesting drugs on which to focus biological interpretation
of results.
Drug Name NSC Id
Methotrexate 740
Fluorouracil (5-FU) 19893
Mitomycin 26980
Vinblastine-sulfate 49842
Vincristine-sulfate 67574
Lomustin (CCNU) 79037
Camptothecin 94600
Cisplatin 119875
Doxorubicin 123127
Taxol (Paclitaxel) 125973
Etoposide 141540
Tamoxifen 180973
Carboplatin 241240
Gemcitabine 613327
In Table 4 we summarize the predictive performance for the 14 drugs using different datasets.
Notice that the addition of variables in the “Combined” dataset sometimes leads to worse
performance. Obviously, the combined variable set contains more information but the additional
dimensions to the problem may confuse the learning methods and reduce performance (“curse of
dimensionality”). This phenomenon is particularly keen when sample size is small, as in this
task.
Table 4: Categorization of predictive performance over the 14 clinically important drugs. GI50
was estimated with sigmoid fitting. SVM regression and MMPC for feature selection were
employed.
Dataset R2 < 0 R2 : 0 – 0.05 R2 : 0.05 – 0.13 R2 > 0.13
Gene expressions 10 1 1 2
Proteins 9 2 2 1
miRNA expressions 8 1 4 1
Combined 9 0 1 4
Best Achieved over 5 1 1 7
all datasets

Table 5: The list of the selected drugs whose chemosensitivity can be reliably predicted (R2 > 0).
The dataset where the best performance is achieved is shown.
Drug N ame NSC ID R2 Dataset
Camptothecin 94600 0.3764 Gene Expression
Tamoxifencitrate 180973 0.3190 Combined
Paclitaxel 125973 0.2145 miRNA
Doxorubicin 123127 0.2023 Protein
Carboplatin 241240 0.1829 Gene Expression
Mitomycin 26980 0.1539 Combined
Gemcitabine 613327 0.1419 Combined
Lomustine 79037 0.1176 Protein
Table 6: The selected molecular quantities for each drug on the dataset that achieves the best
predictability performance. The linear Pearson correlation of each quantity with the GI50 is also
shown.
Drug N ame Dataset Probe-‐set ID Gene/microRNA Correlation w ith
symbol GI50
Camptothecin Gene Expression 1563210_at Unknown -‐0.4940
208425_s_at DKFZP564D166 -‐0.6105
221013_s_at APOL2 -‐0.5317
226015_at ZNF12 7 -‐0.4921
229284_at MAT2B 0.3820
229986_at LOC377064 0.4943
230254_at Unknown -‐0.4414
230410_at NRP2 -‐0.5528
239664_at C3orf17 0.4253
Tamoxifencitrate Combined 1569188_s_at RPL10 -‐0.6849
200915_x_at KTN1 / PDIA6 0.5577
201840_at NEDD8 0.5215
204798_at MYB -‐0.6387
223783_s_at GEMIN4 -‐0.5226
225371_at GLE1L -‐0.5203
227481_at CNKSR3 0.6458
(Protein) PTPN11 -‐0.5291
Paclitaxel miRNA (miRNA) hsa-‐mir-‐106a 0.6028
(miRNA) mir_95 left 0.4070
Doxorubicin Protein (Protein) BCAR1 -‐0.4486
(Protein) CASP7 -‐0.3588
(Protein) CCNA2 0.3334
Carboplatin Gene Expression 1557805_at C9orf77 0.4797
1558504_at LOC440721 0.5643
1565741_at Unknown -‐0.4302
202049_s_at ZNF262 -‐0.4595
213572_s_at SERPINB1 0.5220
225627_s_at CACHD1 -‐0.4540
226825_s_at TPARL 0.4335
228726_at SERPINB1 0.6081
244877_at Unknown 0.5260
Mitomycin Combined 1560854_s_at ZNF588 -‐0.5395
1562303_at ZNF306 0.4147
202031_s_at WIPI2 -‐0.4397
209450_at OSGEP 0.4967
224374_s_at EMILIN2 0.4866
237104_at CTSS -‐0.4330
239637_at Unknown -‐0.6447
Gemcitabine Combined 211358_s_at CIZ1 -‐0.4616
hydrochloride 212873_at HMHA1 0.5025
222878_s_at OTUB2 0.4714
227575_s_at C14orf102 0.4422
229935_s_at MLL 0.8181
232922_s_at C20orf59 0.4062
Lomustine Protein (Protein) GSK3B -‐0.3439
(Protein) HRAS 0.4972
(Protein) MGMT 0.4455
(Protein) PTPN11 -‐0.4927

We now focus on the 8 drugs with medium and strong effects. Their names and dataset where the
maximum is achieved is shown in Table 5. The selected variables are shown in Table 6. The
GI50 stands for “growth inhibition 50%”, the concentration of a given test drug that causes 50%
growth inhibition value corrected for the cell count at time zero. The (univariate) linear Pearson
correlations of each quantity with the GI50 are also shown in Table 6. A positive correlation in
this context means that the larger the value of the mRNA/miRNA/protein, as measured by the
arrays, the larger the GI50 for the drug/sample combination, and thus, the larger the resistance of
the tumour to the drug. A negative correlation on the other hand implies that the higher the value
of the miRNA/protein, the smaller the GI50 and thus, the more sensitive is the tumour to the
drug. A brief biological interpretation of the results now follows.
Camptotecines are central in the treatment of colorectal cancer and target the cleavable complex
between the topoisomerase I, and the DNA inducing irreversible double-strand breaks. None of
the genes identified in this signature were previously linked to this drug.
Tamoxifen is an anti-estrogen used for the prevention of breast cancer recurrence in estrogen and
progesterone hormone receptor positive breast cancer. This is to be taken for five years after the
initial curative treatment. In this signature, NEDD8 gene overexpression confers resistance. The
NEDD8 pathway was proposed to provide a mechanism by which breast cancer cells acquire
anti-estrogen resistance while retaining expression of estrogen receptor alpha (Fan, Bigsby, &
Nephew, 2003).
microRNAs are short (18-24 nt) non-coding RNAs that are involved in post-transcriptional
regulation of gene expression in multicellular organisms by affecting both the stability and
translation of mRNAs. Each microRNA can theoretically control hundreds of genes, and there is
an inverse relation between microRNA and mRNA where a high microRNA induces reduction
of its mRNA target and the opposite. For paclitaxel, an antitubulin, where the main target is
stabilizing the dynamic of the microtubule system in the interphase and M-phase, inducing DNA
damage, chromosomal imbalance and subsequently apoptosis, two miRNAs were highly
predictive where overexpression conferred resistance. Overexpression of mir-106a has been
connected to increased proliferation in breast cancer, one of the main tumors where paclitaxel is
used (Kim, Chadalapaka, Lee, Yamada, & Sastre-Garau, et al., 2008) through down-regulation
of ZBTB4. We also found that ZBTB4 has a 99% probability of being a target of mir-106a using
appropriate software (MicroRNA Target Prediction) (Saito & Sætrom, 2010). The mir-95 was
recently shown overexpressed in 50% of colorectal cancer and to have oncogenic properties
through down-regulation of the SNX1 (Huang, Huang, Wang, Liang, & Ni, et al., 2011). .
Doxorubicin is an anthracycline, a topoisomerase II poison that stabilizes the cleavable

complexes of DNA inducing double-strand breaks and forms covalent adducts inducing DNA

Figure 5: The signature of Lomustine entered in KEGG (Kyoto Encyclopedia of Genes and
Genomes (Kanehisa, Araki, Goto, Hattori, Hirakawa, & et al., 2008)) showing HRAS
overexpression (red) in the oncogenic pathways of brain cancer, here implying resistance
against Lomustine.
damage. Here, high levels of CCNA2 protein and low levels of BCAR1 and CASP7 conferred
resistance. CCNA2 is critical for initiation of DNA replication, transcription and cell cycle
regulation, and its manipulation changes doxorubicin sensitivity, BCAR1 (breast cancer
resistance gene 1) is shown to confer resistance to tamoxifen and is a prognostic factor in breast
cancer (Dorssers, Grebenchtchikov, Brinkman, Look, & van Broekhoven, et al., 2004), and
CASP7 is an important factor of drug-induced apoptosis.
The DNA is the main cytotoxic target of Cisplatin and Carboplatin by induction of single and
double-strand DNA breaks through adducts and cross-linking, leading to cell death through
apoptosis. For Carboplatin six genes (seven transcripts) were predictive where overexpression of
only two genes predicted resistance. The one, SERPINB1 or PAI-1 is the first predictive and
prognostic biomarker evaluated in a phase III study in breast cancer, where high expression
predicted low chemotherapy effect.
The novel antimetabolite gemcitabine targets RRM1 (ribonucleotide reductase subunit M1). The
signature includes only one gene that has been related to drug resistance, the MLL. High MLL
was detected in osteosarcoma cell lines resistant for methotrexate, a drug with similar
mechanism as gemcitabine (Hattinger, Stoico, Michelacci, Pasello, & Scionti, et al., 2009).
Lomustine, a chloroethylating chemotherapeutic is active against tumours in the central nervous
system. Here we found MGMT protein overexpression predicting resistance. MGMT is a well-
known DNA repair gene that is predictive for Lomustine as part of the PCV regimen for
aggressive brain cancer. MGMT down-regulation due to methylation is common in brain
tumours, which strongly affect the response of treatment (Herrlinger, Rieger, Koch, Loeser, &
Blaschke, et al., 2006). Moreover, HRAS overexpression was predictive for Lomustine
resistance. Overexpression of HRAS is a key feature of aggressive brain cancer (Kanehisa,
Araki, Goto, Hattori, & Hirakawa, et al., 2008), predictive for survival (Serao, Delfino, Southey,
Beever, & Rodriguez-Zas, et al., 2011) but its predictive value related to Lomustine has not been
established (see Figure 5). For both genes their respective proteins expression has not been
established as predictive markers previously, thus this is a novel finding. Moreover GSK3B, part
of the AKT/GSK3β/cyclin D1 pathway, were down-regulation confers radio-resistance
(Shimura, 2011), here down-regulation correlated to Lomustine resistance.
Several of these relations have not been explored previously and deserve evaluation in the wet
lab and the clinic.
FURTHER REDUCING THE SELECTED VARIABLES BY EMPLOYING

MULTI-TASK VARIABLE SELECTION
Anticancer drugs as chemotherapeutics, anti-hormones and targeted drugs are typically effective
only for subsets of patients, and thus a large patient population is treated unnecessarily, only to
experience the side effects; unnecessary treatment also incurs a high cost to the patient and
society at large. In treatment of cancer one can often choose between two or more compounds,
and to know which will work for each specific patient can save lives and prevent unnecessary
suffering.
The variable selection methods applied so far (MMPC and univariate filtering) select a set of
predictive variables for each drug. In order to apply these methods for personalized treatment
and selection of the optimal chemotherapeutical agent for a given patient, one would have to
measure the predictive variables for all the drugs. This potentially increases cost. One way to
further reduce the number of variables would be to select a set of variables that is simultaneously
for all drugs.
We approach this problem with a Multi-Task variable selection method; in our case, a task is
defined as the prediction of a specific drug. Multi-task methods try to simultaneously solve a
prediction problem for several tasks at once, in our case the selection of variables and prediction
of a panel of drugs. Such multi-task methods have two aims: (a) attempt to learn a small
signature set that is common to many drugs, (b) improve learning performance by exploiting the
similarities among different predictive tasks. For example, a multi-task method may determine
that a gene that seems marginally predictive when examined for a single drug is deemed
important when examined for several drugs because these drugs may share similar chemical
composition or affect the same pathway.
We chose to use the method introduced in (Argyriou, Evgeniou, & Pontil, 2008). We slightly
optimized the code (available online) for our specific problem. In particular, (a) since all drugs

Figure 6: Number of drugs with R2 > 0 as a function of the number of variables k selected
by the MTVS algorithm.

(tasks) share the same data we kept only one copy of data for all tasks, therefore reducing the
total memory (and time) used by the procedure (by a factor linearly proportional to the total
number of drugs) and (b) removed redundant computations since we only want to select
variables and not learn new features. The number of selected variables is controlled by a
parameter k.
We ran MTVS on the complete set of drugs, in order to take advantage of possible drug
similarities, and show results only for the 14 clinically important drugs. Table 7 summarizes the
results and juxtaposes with the ones achieved by MMPC. For k = 50, both methods have similar
predictive performance (MTVS obtains a positive R2 for more drugs but they are smaller in
magnitude), while MTVS has less than half the number of selected variables. The difference is
much higher if we consider the whole set of drugs (MMPC selects over 700 variables in total).
Table 7: Classification of predictive performance over the 14 clinically important drugs. SVM
regression, sigmoid fitting for estimation of GI50, and leave-one-out cross validation is
employed for both variable selection methods.
Method #All Selected R2 < 0 R2 : 0 – 0.05 R2 : 0.05 – 0.13 R2 > 0.13
Var.
MMPC 105 9 0 1 4
MTVS 50 4 6 1 3
Figure 6 shows the number of drugs with a positive R2 (i.e., that can be predicted) as a function
of the number k of selected variables by MTVS. By using only 23 variables MTVS can predict a
total of 9 drugs with a positive R2. Allowing more variables to 50 allows one more drug to be
predicted. For this specific set of 9 drugs, MMPC requires measuring a total of 71 variables. This
is not a fair comparison because we compare the required number of variables on the drugs on
which MTVS determine to predict the best a posteriori. However, it indicates that multi-task
feature selection is a potentially useful technique that may further reduce the required number of
variables, when prediction is required simultaneously for several drugs, compared to standard
variable selection methods.
CONCLUSION
Predicting chemosensitivity of tumours from gene expressions is important for selecting
treatment, understanding the molecular mechanisms of drug response, and selecting molecular
signatures. In this paper, we show that predictive performance can sometimes be improved by
employing a new method for estimating the GI50 (indication of response to drug), regression
algorithms instead of classification, and state-of-the-art, multivariate feature selection. In
addition, we show that by employing multi-task feature selection methods common signatures
for several drugs can be found with smaller sizes than with non-multi task variable selection
methods. The signatures identified here have several known links to cancer progression and
resistance to chemotherapy. Knowledge on these relations is still expanding and the methods
used to identify those signatures may be important tool for novel biological hypotheses.
Acknowledgements
EC was supported for this research by the ContraCancrum EU FP7 STREP GA 223979 and the
Institute of Computer Science of the Foundation for Research and Technology, Hellas. We
would like to thank Prof. Sætrom of the Norwegian University of Science and Technology
(NTNU), Trondheim, Norway for microRNA-gene target analysis. We would like to thank Amos
Folarin for introducing us to this problem. We would like to thank Matina Fragoyanni for her
code on downloading drug response data. Thanks to Sofia Triantafillou, Vincenzo Lagani and
Angelos Armen for their feedback and fruitful comments.
REFERENCES
Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S., & Koutsoukos, X. D. ( 2010). Local causal and
markov blanket induction for causal discovery and feature s election for classification part i:
Algorithms and empirical evaluation. Journal of Machine Learning Research, Special Topic on
Causality 11, 171-‐234.
Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Conves Multi-‐Task Feature Learning. Machine Learning,
Special Issue on Inductive Transfer Learning, 73(3), 243-‐272.
Augustine, C. K., Yoo, A., Potti, J. S., Yoshimoto, Y., Zipfel, P. A., Friedman, H. S., et al. (2009). Genomic
and molecular profiling predicts response to temozolomide in melanoma. Clinical C ancer Res
15(2), 502.
Bioconductor. (n.d.). Retrieved from h ttp://www.bioconductor.org/

Boser, B., Guyon, I., & Vapnik, V . (1992). An training a lgorithm for optimal margin classifiers. In Fifth
Annual Workshop on Computational Lerning Theory, 144-‐152.
Chang, C. C., & Lin, C. J. (2001). LIBSVM: a library for support vector machines.
Developmental Therapeutics Program NCI/NIH. (n.d.). Retrieved from h ttp://dtp.nci.nih.gov/index.html
Dorssers, L.C., Grebenchtchikov, N., Brinkman, A., Look, M.P., van Broekhoven, S .P., et a l. (2004) The
prognostic value of BCAR1 in patients with primary breast cancer. Clin Cancer Res 10: 6194-‐
6202.

Fan, M., Bigsby, R.M., Nephew, K.P. (2003) The NEDD8 pathway is required for proteasome-‐mediated
degradation of h uman estrogen receptor ( ER)-‐alpha and essential for the antiproliferative
activity of ICI 182,780 in ERalpha-‐positive breast cancer cells. Mol Endocrinol 17: 356-‐365.

Grubbs, F. (1969). Procedures for Detecting Outlying Observations in Samples. Technometrics 11(1), 1-‐
21.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Second Edition. Springer Series in Statistics.
Hattinger, C.M., Stoico, G., Michelacci, F., Pasello, M., Scionti, I., et a l. (2009) Mechanisms of gene
amplification and evidence of coamplification in drug-‐resistant human osteosarcoma cell lines.
Genes Chromosomes Cancer 48: 289-‐309.

Herrlinger, U., Rieger, J., Koch, D., Loeser, S ., Blaschke, B., et al. (2006) Phase II trial of lomustine p lus
temozolomide chemotherapy in addition to radiotherapy in n ewly d iagnosed glioblastoma: UKT-‐
03. J Clin Oncol 24: 4412-‐4417.

Huang, Z., Huang, S., Wang, Q., Liang, L., Ni, S ., et a l. (2011) MicroRNA-‐95 promotes cell proliferation
and targets sorting Nexin 1 in h uman colorectal carcinoma. Cancer Res 71: 2582-‐2589.

Kanehisa, M., Araki, M., Goto, S ., Hattori, M., Hirakawa, M., et al. (2008) KEGG for linking genomes to life
and the environment. Nucleic Acids Res 36: D480-‐484.
Kim, K., Chadalapaka, G., Lee, S .O., Yamada, D., Sastre-‐Garau, X., et a l. (2011) Identification of oncogenic
microRNA-‐17-‐92/ZBTB4/specificity protein a xis in breast cancer. Oncogene.

Ma, Y., Ding, Z., Qian, Y., Wan, Y. W., Tosun, K., Shi, X., et al. (2009). An integrative genomic and
proteomic approach to chemosensitivity prediction. Int. J. Oncol. 34(1), 107-‐115.
NCI60 Methodology. (n.d.). Retrieved from

http://dtp.nci.nih.gov/docs/compare/compare_methodology.html
NTNU -‐ Faculty of Medicine, MicroRNA Target Prediction. (n.d.). Retrieved from
tare.medisin.ntnu.no/mirna_target/
Potti, H. K., Dressman, A., Bild, A., Riedel, R. F., Chan, G., Sayer, R., et a l. (2006). Genomic s ignatures to
guide the use of chemotherapeutics. Nature Medicine 12, 1294-‐1300.
Saito, T., & Sætrom, P. (2010). A two-‐step s ite and mRNA-‐level model for predicting microRNA targets.
BMC Bioinformatics, 612.
Serao, N.V., Delfino, K.R., Southey, B.R., Beever, J.E., Rodriguez-‐Zas, S.L. (2011) Cell cycle and aging,
morphogenesis, a nd response to stimuli genes are individualized biomarkers of glioblastoma
progression and survival. BMC Med Genomics 4: 49.

Shankavaram, U. T., Varma, S., Kane, D., Sunshine, M., Chary, K. K., Reinhold, W. C., et al. (2009).
Cellminer: a relational database and query tool for the nci-‐60 cancer cell lines. BMC Genomics
10, 277.
Shimura, T. (2011) Acquired radioresistance of cancer and the AKT/GSK3beta/cyclin D1 overexpression
cycle. J Radiat Res (Tokyo) 52: 539-‐544.

Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., & Levy, S. (2005). A comprehensive evaluation of
multicategory classification methods for microarray gene expression cancer diagnosis.
Bioinformatics 21(5), 631-‐645.
Statnikov, A., Tsamardinos, I., Brown, L. E., & Aliferis, C. F. (2009). Causal explorer:A matlab library of
algorithms for causal discovery and variable selection for classification. Challenges in C ausality 1.
Staunton, J. E., Slonim, D. K., Coller, H. A., Tamayo, P., Angelo Michael, J., Park, J., et al. (2001).
Chemosensitivity prediction b y transcriptional profiling. Proc. Natl. Acad. Sci. 98(19), 10787-‐
10792.
Steel, R. G., & Torrie, J. H. (1960). Principles and Procedures of Statistics. New York: McGraw-‐Hill.
Tsamardinos, I., Brown, L. E., & Aliferis, C. F. (2006). The max-‐min h ill-‐climbing Bayesian n etwork
structure learning a lgorithm. Journal of Machine Learning 65, 31-‐78.
Wikipedia -‐ Effect Size. (n.d.). Retrieved from en.wikipedia.org/wiki/Effect_size
View publication stats

Chemosensitivity Prediction of Tumours Based On Expression, miRNA, and Proteomics Data

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chemosensitivity Prediction of Tumours Based On Expression, miRNA, and Proteomics Data

Hochgeladen von

Copyright:

Verfügbare Formate

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Chemosensitivity Prediction of Tumours Based on Expression, miRNA, and

Article · January 2012

Ioannis Tsamardinos Giorgos Borboudakis

SEE PROFILE SEE PROFILE

Eleni Christodoulou Oluf Dimitri Røe

SEE PROFILE SEE PROFILE

Drug repositioning View project

The user has requested enhancement of the downloaded file.

I. Tsamardinos1, 2, G. Borboudakis1, 2, E. G. Christodoulou3, O. D. Røe4, 5

Keywords: chemosensitivity prediction, variable selection, feature selection, regression,

DATA AND PROBLEM DESCRIPTION

IMPROVING THE ESTIMATION OF GI50

We now show that this method of estimation leads to improvements in chemosensitivity

Metric of performance: The metric to measure prediction performance is the leave-one-out

COMPARISON OF FEATURE SELECTION TECHNIQUES

COMPARISON OF REGRESSION VERSUS CLASSIFICATION

To evaluate classification, we employed the same overall protocol described in Section

Figure 4: Differences in accuracies D – A between the accuracy achieved by discretized

Doxorubicin is an anthracycline, a topoisomerase II poison that stabilizes the cleavable

FURTHER REDUCING THE SELECTED VARIABLES BY EMPLOYING

Bioconductor. (n.d.). Retrieved from h ttp://www.bioconductor.org/

Developmental Therapeutics Program NCI/NIH. (n.d.). Retrieved from h ttp://dtp.nci.nih.gov/index.html

NCI60 Methodology. (n.d.). Retrieved from

Wikipedia -­‐ Effect Size. (n.d.). Retrieved from en.wikipedia.org/wiki/Effect_size

View publication stats

Das könnte Ihnen auch gefallen

Wikipedia -‐ Effect Size. (n.d.). Retrieved from en.wikipedia.org/wiki/Effect_size