Beruflich Dokumente
Kultur Dokumente
net/publication/235343751
CITATION READS
1 122
4 authors:
Some of the authors of this publication are also working on these related projects:
2nd International NTNU Symposium Current and Future Clinical Biomarkers of Cancer: From diagnosis to immunotherapy – why is precision medicine so difficult? For
registration: https://cancerbiomarkerstrondheim.com/ View project
All content following this page was uploaded by Oluf Dimitri Røe on 29 May 2014.
ABSTRACT
The chemosensitivity of tumours to specific drugs can be predicted based on molecular
quantities, such as gene expressions, miRNA expressions, and protein concentrations. This
finding is important for improving drug efficacy and personalizing drug use. In this paper, we
present an analysis strategy that, compared to prior work, retains more information in the data for
analysis and may lead to improved chemosensitivity prediction. We apply improved methods for
estimating the GI50 value of a drug (an indicator of the response to the drug), regression
methods for constructing predictive models of the GI50 value, advanced variable selection
techniques, such as MMPC, and a multi-task variable selection technique for identifying a small-
size signature that is simultaneously predictive for several drugs and cell lines. The methods are
applied on gene expression, miRNA expression, and proteomics data from 53 tumour cell lines
after treatment with 120 drugs, obtained from the National Cancer Institute databases. A
biological interpretation and discussion of the results is presented for the most clinically
important subset of 14 drugs.
INTRODUCTION
Prior work shows that the sensitivity of a tumour to a drug can be predicted better than chance
based on the gene-expressions of the tumour (Potti, et al., 2006), (Augustine, et al., 2009). This
finding paves the way to personalized therapy models. In addition, identifying the molecular
quantities that are predictive may lead to a better understanding of the biological mechanisms a
drug employs to attack the tumour. In this paper, we develop an analysis strategy to produce
predictive models, estimate their performance, and identify the smallest, most-predictive set of
molecular quantities required for prediction. The strategy is first applied to and evaluated on the
prediction of the response to a set of 120 chemotherapeutic agents based on the responses
measured on 53 solid-tumour cell lines; the data have been obtained from the National Cancer
Institute databases and contain pre-treatment gene expression, miRNA expression, and protein
concentration profiles of the cell-lines. Subsequently, we focus our interest on a subset of 14
drugs that are the most interesting in clinical practice and provide a detailed presentation and
biological interpretation of the results. In addition, we apply a method for multi-task feature
selection which selects molecular quantities that combined, are simultaneously predictive for an
array of drugs. Such algorithms are important for selecting the optimal therapy, by being able to
predict the response to several drugs at once by measuring only a small set of molecular
quantities.
Compared to prior works, our proposed strategy differs in several ways. The machine learning
and statistical analysis employed in the literature process the data in a way that reduces the
available information with potential detrimental effects both on the models' prediction
performance as well as the identification of the molecular signatures (Potti, et al., 2006),
(Augustine, et al., 2009), (Staunton, et al., 2001). First the estimation of the response to a drug in
prior work maybe sub-optimal (Potti, et al., 2006), (Augustine, et al., 2009). The response of a
tumour depends of course, on the dosage. The National Cancer Institute has treated a panel of 60
cancer cell lines with several thousand drugs and has created a dosage-response profile for each
combination of drug and tumour. Often, this profile is summarized with a single value such as
the log10GI50. The GI50 stands for “growth inhibition 50%”, the concentration of a given test
drug that causes 50% growth inhibition at 48 hours, corrected for the cell count at time zero.
NCI, in the majority of cases, estimates log10GI50 by piece-wise linear interpolation which are
then employed by all prior work (e.g., (Potti, et al., 2006), (Ma, et al., 2009), (Staunton, et al.,
2001)). In this paper, we show that estimating the log10GI50 values by fitting a sigmoid to the
dosage-response profile preserves more information about the effects of the drug that lead to
statistically significantly improved predictive performance.
Second, prior work typically quantizes the log10GI50 values to create classes of tumours: (Potti,
et al., 2006) and (Augustine, et al., 2009) categorize tumours as sensitive and resistant, while
(Staunton, et al., 2001) and (Ma, et al., 2009) as sensitive, intermediate, and resistant. This type
of quantization allows the application of machine learning classification techniques, variable
selection methods for classification tasks, and statistical hypothesis testing techniques for
discrete outcomes. Our computational experiments however, demonstrate that maintaining the
exact log10GI50 values and employing regression analysis instead of classification is often
preferable as it improves chemosensitivity prediction in approximately half of the cases.
Third, prior work often employs simple methods for identifying molecular signatures such as
selecting the top k genes that are mostly differentially expressed between different classes of
tumours. We show that more sophisticated methods such as the Max Min Parents and Children
(MMPC) algorithm for multivariate feature selection (Tsamardinos, Brown, & Aliferis, 2006)
often select more predictive signatures for the same parameter k and are preferable to apply.
Fourth, selecting minimal-size, most-predictive sets of variables for a drug intends to increase
our understanding and intuition on the molecular mechanisms of the drugs. However, for clinical
use one should be apply to predict the response to several drugs at the same time with the fewest
measurements as possible. Then, a selection of the optimal therapy can be planned based on
these predictions. Towards this goal, we apply a multi-task feature selection algorithm
(Argyriou, Evgeniou, & Pontil, 2008) that selects only a handful of molecular quantities (about
20) to reliable predict the response to 9 drugs of interest. The results indicate that multi-task
feature selection is a potentially useful technique that may further reduce the required number of
variables, when prediction is required simultaneously for several drugs to optimize therapy.
The structure of the paper is as follows: we present the data and the problem definition. In the
subsequent section we show that estimating the GI50 value by sigmoid fitting is preferable to the
standard NCI estimation using piecewise linear interpolation. Next section compares feature
selection methods. The next two sections discuss the results on the full panel of 120 drugs and a
subset of 14 drugs of particular clinical interest, respectively, including a biological
interpretation of results. Next section presents the multi-task feature selection analysis. The final
section concludes the paper.
The molecular profiles include gene expressions, miRNA expressions, and protein
concentrations. The gene expressions were measured on AffymtrixU133plus2 array containing
54,675 probe sets that correspond to about 47,000 transcript variants which in turn represent
more than 39,500 of the best characterized human genes. The miRNA expressions were
measured on a miRNA OSU V3 chip containing 627 probes. Finally, the proteomic data
consisted of 162 proteins measured in a protein lysate array. All data were downloaded from the
public NCI website (Developmental Therapeutics Program NCI/NIH). We denote with Xi the
vector of molecular quantities (variables) for cell-line i, Xi,v the value of the molecular quantity v
on cell-line i, and with X = {Xi} the matrix of quantities. The gene expression raw data have been
subjected to GCRMA normalization before analysis as implemented in the BioConductor
platform (Bioconductor). The log2 values of the miRNA and the proteomic expressions are used.
The drug-response data for all 53 cell-lines were obtained from the CellMiner database
(Shankavaram, et al., 2009) for a panel of 120 drugs. The set of 120 drugs was selected as
follows: 118 of them are the ones denoted as fully characterized by the NCI. Our clinical expert
ODR also suggested another 14 drugs that are of clinical importance. The union of these two sets
is the set of 120 drugs included in the computational experiments. The drug responses for each
combination of drug and cell-line contain several pairs of ‹d,r›, where d is the log10 drug dosage
and r is the percentage of tissue that survived at 48 hours after treatment. We denote with Ri,j the
set of such pairs for cell-line i and drug j.
Problem Definition: The analysis task we address is to predict the response to a drug of a tissue
based on a vector X of predictive variables. Three sets of predictive variables are employed: gene
expressions, miRNA expressions, and protein concentrations. The response of cell-line i to a
drug j is often characterized with a single number that we denote with GI50i,j and corresponds to
log10GI50. GI50i,j is typically not available in the raw data, thus the value of GI50i,j is estimated
from the data in Ri,j. Learning predictive models for GI50i,j given a vector X is a regression task.
Additionally, we are interested in identifying minimal molecular signatures that are optimally
predictive of response and that could provide insight into the molecular mechanisms of the drug.
Figure 1: The drug-response measurements for the cell line HS 578T (Breast Cancer) and the
drug Mechlorethamine. The response values Ri,j values are shown, as well as the fitted sigmoid
curve (red color) and the respective piece-wise linear interpolation segments (green color).
We now present an estimation method that employs all available measurements in Ri,j. We
assume the dosage-response curve to have a sigmoid shape where at 0 dosage (i.e., its logarithm
approaches -∞) there is no reduction of the tumour (r = 100%) and at infinity the tumour size is
reduced to zero (r = -100%). The equation of a sigmoid that ranges asymptotically between α and
α+β and crosses the mid-range at γ is
#
r =$ + (1)
1 + e ( d %! ) "
where δ is a parameter controlling the slope of the function, r the response and d the dosage
(expressed by its logarithm). Considering that asymptotically (presumably for very high
concentrations of the drug) the tumour is completely eradicated and there is a 100% reduction in
its size, then α = -100%. The parameter β was set to 200% so that the range of the response is
between -100% and +100%. The remaining two parameters γ and δ were estimated using least-
squares numerical optimization. Specifically, we used the function nlinfit of Matlab with initial
values γ = -5 and δ = 1. This function performs a number of steps towards the steepest descend
direction for the parameters γ and δ in order to converge to a good-fitting value. In the cases
where the procedure would not converge with these initial values, we repeated it 100 times with
different initial values for the parameters γ and δ uniformly sampled within [-15 2] (the range of
all concentrations in the data). Out of these 100 repetitions the parameter pair that led to the least
mean squared error (MSE) was selected. The estimated GI50i,j values are found by setting r =
50% and solving Eq. 1. In certain cases, fitting a sigmoid leads to extreme values. In order to
detect the outliers we applied the Matlab function deletoutliers. This implements iteratively the
Grubbs Test that tests one value at a time (Grubbs, 1969). If outliers are found they are trimmed
to ± 2· σj, where σj is the standard deviation of all currently fitted values to drug j. We denote the
final estimates as GI50Sigi,j. Figure 1 shows a graphical depiction of Ri,j for the CCRF-CEM
(Leukemia) cell-line and Carmustine (BCNU) with the fitted sigmoid superimposed. The
corresponding piece-wise linear interpolation segments are also shown in the figure.
Feature Selection: The most commonly applied feature (variable) selection method in the field
of personalized medicine is to rank the genes according to their association with the class
(equivalently the p-value) and select the top k. We call this method univariate filtering. In our
work we additionally employed the Max Min Parents and Children algorithms (MMPC)
(Tsamardinos, Brown, & Aliferis, 2006) to select a minimal-size, optimally predictive set of
probe-sets. MMPC is an algorithm that seeks to identify the neighbors of the variable-to-predict
in the Bayesian Network capturing the data distribution by taking into account multivariate
associations. It has been shown very effective in recent extensive experiments (Aliferis,
Statnikov, Tsamardinos, Mani, & Koutsoukos, 2010) against an array of state-of-the art feature
selection methods. In this work, the causal explorer implementation of MMPC was used
(Statnikov, Tsamardinos, Brown, & Aliferis, 2009) with the default values for the parameters.
Regression: We employed SVM Regression to construct the predictive models (Boser, Guyon,
& Vapnik, 1992), as implemented in the package libSVM, version 3.01 (Chang & Lin, 2001). In
our experiments we used the Radial Basis kernel and all other parameters set to default.
Estimation of Performance: We used a leave-one-out cross validation protocol due to the small
number of available samples. For each training set, the combination of MMPC and SVM
regression produced a predictive model that was applied on the hold-out test sample. This avoids
overfitting due to the multiple testing problem (Hastie, Tibshirani, & Friedman, 2009).
We have computed R2j for all 120 drugs both when the GI50 values are estimated using piece-
wise linear interpolation as well as when fitting a sigmoid function, as described above. We
denote the corresponding values as RPLIj and RSigj. The results are shown in Figure 2.
The figure shows that GI50 values estimated by the sigmoid are better predicted using the
protocol described above. Thus at least for the combination of MMPC and SVM Regression
GI50Sig values facilitate the induction of predictive models vs. using the GI50PLI. The median
differences (RSig - RPLI) are 5.02%, 8.13%, 4.09%. The p-values for the null hypothesis that the
median of (RSig - RPLI) is zero as estimated by a Wilcoxon signed-rank test are 0.0546, 0.0060,
and 0.0169, when employing gene expressions, miRNA expressions, and protein concentrations
as predictive variables respectively. Of course, one could argue that the results may not transfer
to other feature selection or regression methods. The results however, corroborate our intuition
that the sigmoid estimation better preserves information in the Ri,j measurements and given no
evidence to the contrary, we would suggest this method of estimation in future analyses and
employ it for the rest of the paper.
In the next set of computational experiments we pre-process the GI50 values of each drug to
discretize them as described in (Ma, et al., 2009). Specifically, the class Ci,j of a cell-line i and
drug j is computed as sensitive, intermediate, or resistant if GI50i,j falls within (-∞, µj - 0.5σj], (µj
- 0.5σj, µj + 0.5σj], and [µj + 0.5σj, ∞) respectively, where µj is the average GI50 value over all
cell lines for drug j and σj the standard deviation.
Figure 3: Performance differences when employing MMPC vs. univariate filtering for
feature selection. MMPC is statistically significantly better in the gene expression dataset.
The differences in the other datasets are not significant. Univariate filtering requires
optimization of the number of variables k to retain, while MMPC does not. For these reasons
MMPC is preferable on this task.
Comparing regression vs. classification is not straightforward given that regression outputs a
continuous prediction for GI50i,j while classification outputs its class. To overcome this issue we
discretize the output of the regression models to the three stated classes using the same intervals
as above. This allows us to compute the cross-validated accuracy of the regression for each drug
j, denoted as Dj. In other words, Aj’s are computed by first discretizing the data, then using
classification, and measuring the accuracy of the output, while Dj’s are computed by using
regression, then discretizing the predictions, and computing accuracy.
Figure 4 shows the histograms of their difference Dj – Aj for the gene expression, miRNA
expression, and proteomics data over the full set of 120 drugs and the 14 clinically important
ones. The legends of the figures also show the means of these differences. In some cases
regression accuracy scores higher than classification accuracy and in other cases the reverse
happens. Table 1 shows the minima and maxima of their difference Dj – Aj over the full or the
restricted set of drugs for each dataset.
Table 1: Minimum and maximum difference Dj – Aj with the corresponding drug names and NSC
ids, on all datasets and for all drugs and the selected drugs.
Dataset
Min
Max
Min
Max
(all
drugs)
(all
drugs)
(selected
drugs)
(selected
drugs)
Gene
expressions
-‐35.8
24.5
-‐20.7
24.5
Drug
name
Inosine
Oxanthrazale
Doxorubicin
Etoposide
dialdehyde
NSC
id
118994
349174
123127
141540
miRNA
expressions
-‐41.5
26.4
-‐20.7
24.5
Drug
name
Pyrazoloimidazole
Dichloroallyl
Paclitaxel
Camptothecin
lawsone
NSC
id
51143
126771
125973
94600
Proteins
-‐43.4
43.4
-‐18.9
30.2
Drug
name
Hydroxyurea
5-‐HP
Vinblastine-‐ Camptothecin
sulfate
NSC
id
32065
107392
49482
94600
These results suggest that in general one should also try regression methods and not only
classification. In terms of mean differences, classification accuracy A is on average higher than
regression accuracy D on the full set of 120 drugs for all three datasets. However, when we focus
on the 14 drugs of high interest, the situation is reversed: D is on average higher than A in two
out of three datasets. Finally, we note that the way this comparison is performed, slightly favors
classification. This is because regression methods try to make predictions at a finer level, i.e.,
predict the exact value of the GI50. On the other hand, classification methods make cruder
predictions of the general levels sensitive, intermediate, resistant. When we discretize the output
of the regression methods, they lose this advantage and are compared against classification on
this cruder and less granular scale. Considering all the above, we decide to employ regression
methods for the rest of the paper.
The above combination of methods and experimentation protocols run on four datasets: gene
expressions, miRNA expression, protein concentrations, and the combined dataset containing all
molecular quantities. To facilitate the interpretation of the results, we classify drugs regarding
how well they are predicted. More specifically, for the Pearson correlation r between two
quantities, Cohen gives the following interpretation guidelines (Wikipedia - Effect Size): small
effect size, r = 0.1 - 0.23; medium, r = 0.24 - 0.36; large, r = 0.37 or larger. Interpreting R2 as
r2and translating the values we get approximately the intervals [0.01, 0.05), [0.05, 0.13), [0.13,
1]. The term “effect size” was coined for causal effects; in our case, a correlation does not
necessarily correspond to a causal effect, so the correct interpretation of “effect size” is the
predictability of the GI50 given the molecular quantities. Under this interpretation several drugs
have large effect size, while other ones have a negative size effect, meaning that our prediction
does not improve compared to the prediction by the mean value. Summary results are presented
Table 2.
Table 2: Predictive performance over all 120 drugs and datasets. SVM regression, MMPC for
feature selection, sigmoid fitting for estimation of GI50, and leave-one-out cross validation is
employed. The response to several drugs can be predicted using molecular quantities such as
gene expressions, miRNA expressions, and protein concentrations.
Not
predicted
Small
Effect
Medium
Effect
Large
Effect
Dataset
R2
<
0
R2
:
0
–
0.05
R2
:
0.05
–
0.13
R2
>
0.13
Gene
expressions
95
12
11
6
Proteins
92
13
11
4
miRNA
expressions
82
10
15
13
Combined
95
4
15
6
FOCUS ON THE MOST CLINICALLY RELEVANT SET OF DRUGS
To discover new biological knowledge we focus on a set of 14 drugs that our clinical expert
ODR suggested as the most important for clinical practice. We note that the list was pre-selected
before the beginning of the analysis. Their names and NSC ids are shown in Table 3.
Table 3: A selected set of clinically interesting drugs on which to focus biological interpretation
of results.
Drug Name NSC Id
Methotrexate 740
Fluorouracil (5-FU) 19893
Mitomycin 26980
Vinblastine-sulfate 49842
Vincristine-sulfate 67574
Lomustin (CCNU) 79037
Camptothecin 94600
Cisplatin 119875
Doxorubicin 123127
Taxol (Paclitaxel) 125973
Etoposide 141540
Tamoxifen 180973
Carboplatin 241240
Gemcitabine 613327
In Table 4 we summarize the predictive performance for the 14 drugs using different datasets.
Notice that the addition of variables in the “Combined” dataset sometimes leads to worse
performance. Obviously, the combined variable set contains more information but the additional
dimensions to the problem may confuse the learning methods and reduce performance (“curse of
dimensionality”). This phenomenon is particularly keen when sample size is small, as in this
task.
Table 4: Categorization of predictive performance over the 14 clinically important drugs. GI50
was estimated with sigmoid fitting. SVM regression and MMPC for feature selection were
employed.
Not
predicted
Small
Effect
Medium
Effect
Large
Effect
Dataset
R2
<
0
R2
:
0
–
0.05
R2
:
0.05
–
0.13
R2
>
0.13
Gene
expressions
10
1
1
2
Proteins
9
2
2
1
miRNA
expressions
8
1
4
1
Combined
9
0
1
4
Best
Achieved
over
5
1
1
7
all
datasets
Table 5: The list of the selected drugs whose chemosensitivity can be reliably predicted (R2 > 0).
The dataset where the best performance is achieved is shown.
Drug
N ame
NSC
ID
R2
Dataset
Camptothecin
94600
0.3764
Gene
Expression
Tamoxifencitrate
180973
0.3190
Combined
Paclitaxel
125973
0.2145
miRNA
Doxorubicin
123127
0.2023
Protein
Carboplatin
241240
0.1829
Gene
Expression
Mitomycin
26980
0.1539
Combined
Gemcitabine
613327
0.1419
Combined
Lomustine
79037
0.1176
Protein
Table 6: The selected molecular quantities for each drug on the dataset that achieves the best
predictability performance. The linear Pearson correlation of each quantity with the GI50 is also
shown.
Drug
N ame
Dataset
Probe-‐set
ID
Gene/microRNA
Correlation
w ith
symbol
GI50
Camptothecin
Gene
Expression
1563210_at
Unknown
-‐0.4940
208425_s_at
DKFZP564D166
-‐0.6105
221013_s_at
APOL2
-‐0.5317
226015_at
ZNF12
7
-‐0.4921
229284_at
MAT2B
0.3820
229986_at
LOC377064
0.4943
230254_at
Unknown
-‐0.4414
230410_at
NRP2
-‐0.5528
239664_at
C3orf17
0.4253
Tamoxifencitrate
Combined
1569188_s_at
RPL10
-‐0.6849
200915_x_at
KTN1
/
PDIA6
0.5577
201840_at
NEDD8
0.5215
204798_at
MYB
-‐0.6387
223783_s_at
GEMIN4
-‐0.5226
225371_at
GLE1L
-‐0.5203
227481_at
CNKSR3
0.6458
(Protein)
PTPN11
-‐0.5291
Paclitaxel
miRNA
(miRNA)
hsa-‐mir-‐106a
0.6028
(miRNA)
mir_95
left
0.4070
Doxorubicin
Protein
(Protein)
BCAR1
-‐0.4486
(Protein)
CASP7
-‐0.3588
(Protein)
CCNA2
0.3334
Carboplatin
Gene
Expression
1557805_at
C9orf77
0.4797
1558504_at
LOC440721
0.5643
1565741_at
Unknown
-‐0.4302
202049_s_at
ZNF262
-‐0.4595
213572_s_at
SERPINB1
0.5220
225627_s_at
CACHD1
-‐0.4540
226825_s_at
TPARL
0.4335
228726_at
SERPINB1
0.6081
244877_at
Unknown
0.5260
Mitomycin
Combined
1560854_s_at
ZNF588
-‐0.5395
1562303_at
ZNF306
0.4147
202031_s_at
WIPI2
-‐0.4397
209450_at
OSGEP
0.4967
224374_s_at
EMILIN2
0.4866
237104_at
CTSS
-‐0.4330
239637_at
Unknown
-‐0.6447
Gemcitabine
Combined
211358_s_at
CIZ1
-‐0.4616
hydrochloride
212873_at
HMHA1
0.5025
222878_s_at
OTUB2
0.4714
227575_s_at
C14orf102
0.4422
229935_s_at
MLL
0.8181
232922_s_at
C20orf59
0.4062
Lomustine
Protein
(Protein)
GSK3B
-‐0.3439
(Protein)
HRAS
0.4972
(Protein)
MGMT
0.4455
(Protein)
PTPN11
-‐0.4927
We now focus on the 8 drugs with medium and strong effects. Their names and dataset where the
maximum is achieved is shown in Table 5. The selected variables are shown in Table 6. The
GI50 stands for “growth inhibition 50%”, the concentration of a given test drug that causes 50%
growth inhibition value corrected for the cell count at time zero. The (univariate) linear Pearson
correlations of each quantity with the GI50 are also shown in Table 6. A positive correlation in
this context means that the larger the value of the mRNA/miRNA/protein, as measured by the
arrays, the larger the GI50 for the drug/sample combination, and thus, the larger the resistance of
the tumour to the drug. A negative correlation on the other hand implies that the higher the value
of the miRNA/protein, the smaller the GI50 and thus, the more sensitive is the tumour to the
drug. A brief biological interpretation of the results now follows.
Camptotecines are central in the treatment of colorectal cancer and target the cleavable complex
between the topoisomerase I, and the DNA inducing irreversible double-strand breaks. None of
the genes identified in this signature were previously linked to this drug.
Tamoxifen is an anti-estrogen used for the prevention of breast cancer recurrence in estrogen and
progesterone hormone receptor positive breast cancer. This is to be taken for five years after the
initial curative treatment. In this signature, NEDD8 gene overexpression confers resistance. The
NEDD8 pathway was proposed to provide a mechanism by which breast cancer cells acquire
anti-estrogen resistance while retaining expression of estrogen receptor alpha (Fan, Bigsby, &
Nephew, 2003).
microRNAs are short (18-24 nt) non-coding RNAs that are involved in post-transcriptional
regulation of gene expression in multicellular organisms by affecting both the stability and
translation of mRNAs. Each microRNA can theoretically control hundreds of genes, and there is
an inverse relation between microRNA and mRNA where a high microRNA induces reduction
of its mRNA target and the opposite. For paclitaxel, an antitubulin, where the main target is
stabilizing the dynamic of the microtubule system in the interphase and M-phase, inducing DNA
damage, chromosomal imbalance and subsequently apoptosis, two miRNAs were highly
predictive where overexpression conferred resistance. Overexpression of mir-106a has been
connected to increased proliferation in breast cancer, one of the main tumors where paclitaxel is
used (Kim, Chadalapaka, Lee, Yamada, & Sastre-Garau, et al., 2008) through down-regulation
of ZBTB4. We also found that ZBTB4 has a 99% probability of being a target of mir-106a using
appropriate software (MicroRNA Target Prediction) (Saito & Sætrom, 2010). The mir-95 was
recently shown overexpressed in 50% of colorectal cancer and to have oncogenic properties
through down-regulation of the SNX1 (Huang, Huang, Wang, Liang, & Ni, et al., 2011). .
Figure 5: The signature of Lomustine entered in KEGG (Kyoto Encyclopedia of Genes and
Genomes (Kanehisa, Araki, Goto, Hattori, Hirakawa, & et al., 2008)) showing HRAS
overexpression (red) in the oncogenic pathways of brain cancer, here implying resistance
against Lomustine.
damage. Here, high levels of CCNA2 protein and low levels of BCAR1 and CASP7 conferred
resistance. CCNA2 is critical for initiation of DNA replication, transcription and cell cycle
regulation, and its manipulation changes doxorubicin sensitivity, BCAR1 (breast cancer
resistance gene 1) is shown to confer resistance to tamoxifen and is a prognostic factor in breast
cancer (Dorssers, Grebenchtchikov, Brinkman, Look, & van Broekhoven, et al., 2004), and
CASP7 is an important factor of drug-induced apoptosis.
The DNA is the main cytotoxic target of Cisplatin and Carboplatin by induction of single and
double-strand DNA breaks through adducts and cross-linking, leading to cell death through
apoptosis. For Carboplatin six genes (seven transcripts) were predictive where overexpression of
only two genes predicted resistance. The one, SERPINB1 or PAI-1 is the first predictive and
prognostic biomarker evaluated in a phase III study in breast cancer, where high expression
predicted low chemotherapy effect.
The novel antimetabolite gemcitabine targets RRM1 (ribonucleotide reductase subunit M1). The
signature includes only one gene that has been related to drug resistance, the MLL. High MLL
was detected in osteosarcoma cell lines resistant for methotrexate, a drug with similar
mechanism as gemcitabine (Hattinger, Stoico, Michelacci, Pasello, & Scionti, et al., 2009).
Lomustine, a chloroethylating chemotherapeutic is active against tumours in the central nervous
system. Here we found MGMT protein overexpression predicting resistance. MGMT is a well-
known DNA repair gene that is predictive for Lomustine as part of the PCV regimen for
aggressive brain cancer. MGMT down-regulation due to methylation is common in brain
tumours, which strongly affect the response of treatment (Herrlinger, Rieger, Koch, Loeser, &
Blaschke, et al., 2006). Moreover, HRAS overexpression was predictive for Lomustine
resistance. Overexpression of HRAS is a key feature of aggressive brain cancer (Kanehisa,
Araki, Goto, Hattori, & Hirakawa, et al., 2008), predictive for survival (Serao, Delfino, Southey,
Beever, & Rodriguez-Zas, et al., 2011) but its predictive value related to Lomustine has not been
established (see Figure 5). For both genes their respective proteins expression has not been
established as predictive markers previously, thus this is a novel finding. Moreover GSK3B, part
of the AKT/GSK3β/cyclin D1 pathway, were down-regulation confers radio-resistance
(Shimura, 2011), here down-regulation correlated to Lomustine resistance.
Several of these relations have not been explored previously and deserve evaluation in the wet
lab and the clinic.
We approach this problem with a Multi-Task variable selection method; in our case, a task is
defined as the prediction of a specific drug. Multi-task methods try to simultaneously solve a
prediction problem for several tasks at once, in our case the selection of variables and prediction
of a panel of drugs. Such multi-task methods have two aims: (a) attempt to learn a small
signature set that is common to many drugs, (b) improve learning performance by exploiting the
similarities among different predictive tasks. For example, a multi-task method may determine
that a gene that seems marginally predictive when examined for a single drug is deemed
important when examined for several drugs because these drugs may share similar chemical
composition or affect the same pathway.
We chose to use the method introduced in (Argyriou, Evgeniou, & Pontil, 2008). We slightly
optimized the code (available online) for our specific problem. In particular, (a) since all drugs
Figure 6: Number of drugs with R2 > 0 as a function of the number of variables k selected
by the MTVS algorithm.
(tasks) share the same data we kept only one copy of data for all tasks, therefore reducing the
total memory (and time) used by the procedure (by a factor linearly proportional to the total
number of drugs) and (b) removed redundant computations since we only want to select
variables and not learn new features. The number of selected variables is controlled by a
parameter k.
We ran MTVS on the complete set of drugs, in order to take advantage of possible drug
similarities, and show results only for the 14 clinically important drugs. Table 7 summarizes the
results and juxtaposes with the ones achieved by MMPC. For k = 50, both methods have similar
predictive performance (MTVS obtains a positive R2 for more drugs but they are smaller in
magnitude), while MTVS has less than half the number of selected variables. The difference is
much higher if we consider the whole set of drugs (MMPC selects over 700 variables in total).
Table 7: Classification of predictive performance over the 14 clinically important drugs. SVM
regression, sigmoid fitting for estimation of GI50, and leave-one-out cross validation is
employed for both variable selection methods.
Not
predicted
Small
Effect
Medium
Effect
Large
Effect
Method
#All
Selected
R2
<
0
R2
:
0
–
0.05
R2
:
0.05
–
0.13
R2
>
0.13
Var.
MMPC
105
9
0
1
4
MTVS
50
4
6
1
3
Figure 6 shows the number of drugs with a positive R2 (i.e., that can be predicted) as a function
of the number k of selected variables by MTVS. By using only 23 variables MTVS can predict a
total of 9 drugs with a positive R2. Allowing more variables to 50 allows one more drug to be
predicted. For this specific set of 9 drugs, MMPC requires measuring a total of 71 variables. This
is not a fair comparison because we compare the required number of variables on the drugs on
which MTVS determine to predict the best a posteriori. However, it indicates that multi-task
feature selection is a potentially useful technique that may further reduce the required number of
variables, when prediction is required simultaneously for several drugs, compared to standard
variable selection methods.
CONCLUSION
Predicting chemosensitivity of tumours from gene expressions is important for selecting
treatment, understanding the molecular mechanisms of drug response, and selecting molecular
signatures. In this paper, we show that predictive performance can sometimes be improved by
employing a new method for estimating the GI50 (indication of response to drug), regression
algorithms instead of classification, and state-of-the-art, multivariate feature selection. In
addition, we show that by employing multi-task feature selection methods common signatures
for several drugs can be found with smaller sizes than with non-multi task variable selection
methods. The signatures identified here have several known links to cancer progression and
resistance to chemotherapy. Knowledge on these relations is still expanding and the methods
used to identify those signatures may be important tool for novel biological hypotheses.
Acknowledgements
EC was supported for this research by the ContraCancrum EU FP7 STREP GA 223979 and the
Institute of Computer Science of the Foundation for Research and Technology, Hellas. We
would like to thank Prof. Sætrom of the Norwegian University of Science and Technology
(NTNU), Trondheim, Norway for microRNA-gene target analysis. We would like to thank Amos
Folarin for introducing us to this problem. We would like to thank Matina Fragoyanni for her
code on downloading drug response data. Thanks to Sofia Triantafillou, Vincenzo Lagani and
Angelos Armen for their feedback and fruitful comments.
REFERENCES
Aliferis,
C.
F.,
Statnikov,
A.,
Tsamardinos,
I.,
Mani,
S.,
&
Koutsoukos,
X.
D.
( 2010).
Local
causal
and
markov
blanket
induction
for
causal
discovery
and
feature
s election
for
classification
part
i:
Algorithms
and
empirical
evaluation.
Journal
of
Machine
Learning
Research,
Special
Topic
on
Causality
11,
171-‐234.
Argyriou,
A.,
Evgeniou,
T.,
&
Pontil,
M.
(2008).
Conves
Multi-‐Task
Feature
Learning.
Machine
Learning,
Special
Issue
on
Inductive
Transfer
Learning,
73(3),
243-‐272.
Augustine,
C.
K.,
Yoo,
A.,
Potti,
J.
S.,
Yoshimoto,
Y.,
Zipfel,
P.
A.,
Friedman,
H.
S.,
et
al.
(2009).
Genomic
and
molecular
profiling
predicts
response
to
temozolomide
in
melanoma.
Clinical
C ancer
Res
15(2),
502.
Chang, C. C., & Lin, C. J. (2001). LIBSVM: a library for support vector machines.
Dorssers,
L.C.,
Grebenchtchikov,
N.,
Brinkman,
A.,
Look,
M.P.,
van
Broekhoven,
S .P.,
et
a l.
(2004)
The
prognostic
value
of
BCAR1
in
patients
with
primary
breast
cancer.
Clin
Cancer
Res
10:
6194-‐
6202.
Fan,
M.,
Bigsby,
R.M.,
Nephew,
K.P.
(2003)
The
NEDD8
pathway
is
required
for
proteasome-‐mediated
degradation
of
h uman
estrogen
receptor
( ER)-‐alpha
and
essential
for
the
antiproliferative
activity
of
ICI
182,780
in
ERalpha-‐positive
breast
cancer
cells.
Mol
Endocrinol
17:
356-‐365.
Grubbs,
F.
(1969).
Procedures
for
Detecting
Outlying
Observations
in
Samples.
Technometrics
11(1),
1-‐
21.
Hastie,
T.,
Tibshirani,
R.,
&
Friedman,
J.
(2009).
The
Elements
of
Statistical
Learning:
Data
Mining,
Inference,
and
Prediction,
Second
Edition.
Springer
Series
in
Statistics.
Hattinger,
C.M.,
Stoico,
G.,
Michelacci,
F.,
Pasello,
M.,
Scionti,
I.,
et
a l.
(2009)
Mechanisms
of
gene
amplification
and
evidence
of
coamplification
in
drug-‐resistant
human
osteosarcoma
cell
lines.
Genes
Chromosomes
Cancer
48:
289-‐309.
Herrlinger,
U.,
Rieger,
J.,
Koch,
D.,
Loeser,
S .,
Blaschke,
B.,
et
al.
(2006)
Phase
II
trial
of
lomustine
p lus
temozolomide
chemotherapy
in
addition
to
radiotherapy
in
n ewly
d iagnosed
glioblastoma:
UKT-‐
03.
J
Clin
Oncol
24:
4412-‐4417.
Huang,
Z.,
Huang,
S.,
Wang,
Q.,
Liang,
L.,
Ni,
S .,
et
a l.
(2011)
MicroRNA-‐95
promotes
cell
proliferation
and
targets
sorting
Nexin
1
in
h uman
colorectal
carcinoma.
Cancer
Res
71:
2582-‐2589.
Kanehisa,
M.,
Araki,
M.,
Goto,
S .,
Hattori,
M.,
Hirakawa,
M.,
et
al.
(2008)
KEGG
for
linking
genomes
to
life
and
the
environment.
Nucleic
Acids
Res
36:
D480-‐484.
Kim,
K.,
Chadalapaka,
G.,
Lee,
S .O.,
Yamada,
D.,
Sastre-‐Garau,
X.,
et
a l.
(2011)
Identification
of
oncogenic
microRNA-‐17-‐92/ZBTB4/specificity
protein
a xis
in
breast
cancer.
Oncogene.
Ma,
Y.,
Ding,
Z.,
Qian,
Y.,
Wan,
Y.
W.,
Tosun,
K.,
Shi,
X.,
et
al.
(2009).
An
integrative
genomic
and
proteomic
approach
to
chemosensitivity
prediction.
Int.
J.
Oncol.
34(1),
107-‐115.
NTNU
-‐
Faculty
of
Medicine,
MicroRNA
Target
Prediction.
(n.d.).
Retrieved
from
tare.medisin.ntnu.no/mirna_target/
Potti,
H.
K.,
Dressman,
A.,
Bild,
A.,
Riedel,
R.
F.,
Chan,
G.,
Sayer,
R.,
et
a l.
(2006).
Genomic
s ignatures
to
guide
the
use
of
chemotherapeutics.
Nature
Medicine
12,
1294-‐1300.
Saito,
T.,
&
Sætrom,
P.
(2010).
A
two-‐step
s ite
and
mRNA-‐level
model
for
predicting
microRNA
targets.
BMC
Bioinformatics,
612.
Serao,
N.V.,
Delfino,
K.R.,
Southey,
B.R.,
Beever,
J.E.,
Rodriguez-‐Zas,
S.L.
(2011)
Cell
cycle
and
aging,
morphogenesis,
a nd
response
to
stimuli
genes
are
individualized
biomarkers
of
glioblastoma
progression
and
survival.
BMC
Med
Genomics
4:
49.
Shankavaram,
U.
T.,
Varma,
S.,
Kane,
D.,
Sunshine,
M.,
Chary,
K.
K.,
Reinhold,
W.
C.,
et
al.
(2009).
Cellminer:
a
relational
database
and
query
tool
for
the
nci-‐60
cancer
cell
lines.
BMC
Genomics
10,
277.
Shimura,
T.
(2011)
Acquired
radioresistance
of
cancer
and
the
AKT/GSK3beta/cyclin
D1
overexpression
cycle.
J
Radiat
Res
(Tokyo)
52:
539-‐544.
Statnikov,
A.,
Aliferis,
C.
F.,
Tsamardinos,
I.,
Hardin,
D.,
&
Levy,
S.
(2005).
A
comprehensive
evaluation
of
multicategory
classification
methods
for
microarray
gene
expression
cancer
diagnosis.
Bioinformatics
21(5),
631-‐645.
Statnikov,
A.,
Tsamardinos,
I.,
Brown,
L.
E.,
&
Aliferis,
C.
F.
(2009).
Causal
explorer:A
matlab
library
of
algorithms
for
causal
discovery
and
variable
selection
for
classification.
Challenges
in
C ausality
1.
Staunton,
J.
E.,
Slonim,
D.
K.,
Coller,
H.
A.,
Tamayo,
P.,
Angelo
Michael,
J.,
Park,
J.,
et
al.
(2001).
Chemosensitivity
prediction
b y
transcriptional
profiling.
Proc.
Natl.
Acad.
Sci.
98(19),
10787-‐
10792.
Steel, R. G., & Torrie, J. H. (1960). Principles and Procedures of Statistics. New York: McGraw-‐Hill.
Tsamardinos,
I.,
Brown,
L.
E.,
&
Aliferis,
C.
F.
(2006).
The
max-‐min
h ill-‐climbing
Bayesian
n etwork
structure
learning
a lgorithm.
Journal
of
Machine
Learning
65,
31-‐78.