Sie sind auf Seite 1von 10

Applied Soft Computing Journal 74 (2019) 264–273

Contents lists available at ScienceDirect

Applied Soft Computing Journal


journal homepage: www.elsevier.com/locate/asoc

Conference

Comparison of common machine learning models for classification of


tuberculosis using transcriptional biomarkers from integrated datasets✩

Carly A. Bobak a,b , Alexander J. Titus a,c , Jane E. Hill a,b ,
a
Program in Quantitative Biomedical Sciences, Dartmouth School of Graduate and Advanced Studies, Hanover, NH, USA
b
Thayer School of Engineering, Dartmouth School of Graduate and Advanced Studies, Hanover, NH, USA
c
Department of Epidemiology, Dartmouth Geisel School of Medicine, Hanover, NH, USA

highlights graphical abstract

• Four datasets were integrated for the


identification of biomarkers for TB
diagnosis.
• Three machine learning algorithms
were compared using the integrated
data.
• Random Forest had the optimal global
sensitivity of 0.8973.
• Analysis of the returned biomarkers
reveals immune response pathways.

article info a b s t r a c t

Article history: Tuberculosis (TB) is a top-10 cause of death worldwide, and new diagnostics are a key element of the
Received 19 June 2018 World Health Organization’s End TB strategy. Significant research efforts have gone into trying to identify
Received in revised form 21 September 2018 a transcriptional signature from patient blood in order to diagnose TB, but a consistent signature for
Accepted 3 October 2018
heterogeneous populations has remained elusive. In this work, we propose a data analysis framework
Available online 22 October 2018
which directly integrates multiple publicly-available expression array datasets in order to identify a more
Keywords: reliable gene signature for the diagnosis of TB. The proposed method was built using 4 distinct datasets
Tuberculosis diagnostics spanning a total of 1164 samples and 4 countries. The performance and selected gene features of three
Transcriptional biomarkers different machine learning classifiers were compared in the context of this multi-cohort framework. A
Data integration Random Forest classifier provided the best classification results, with an AUC of 0.8646 in our validation
Machine learning models
data. Gene ontology enrichment analysis revealed that the selected gene features across the three models
Expression array data
are all related to immunological processes.

Funchal, Madeira, Portugal, 19–21 January, 2018, pp. XX-YY, ISBN: 978-989-758-
281-3, INSTICC, 2018.
✩ This paper is an extended, improved version of the paper ‘Investigating ∗ Corresponding author at: Thayer School of Engineering, Dartmouth School of
Random Forest Classification on Publicly Available Tuberculosis Data to Uncover Graduate and Advanced Studies, Hanover, NH, USA.
Robust Transcriptional Biomarkers’ presented at AI4Health 2018 workshop and E-mail addresses: Carly.A.Bobak.GR@dartmouth.edu (C.A. Bobak),
published in: BIOSTEC 2018, Proceedings of the 11th International Joint Confer- Alexander.J.Titus.GR@dartmouth.edu (A.J. Titus), Jane.E.Hill@dartmouth.edu
ence on Biomedical Engineering Systems and Technologies, Volume 5: HEALTHINF, (J.E. Hill).

https://doi.org/10.1016/j.asoc.2018.10.005
1568-4946/
C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273 265

1. Introduction using the geometric mean of up- and down-regulated expression


to construct a score. This gene signature and subsequent score are
Tuberculosis (TB), a disease caused by the bacterium Mycobac- then validated in independent datasets [6,7,12,20]. When applying
terium tuberculosis (Mtb), was the cause of 1.7 million deaths in this methodology to the study of TB, a three-gene signature was
2016 — making it the deadliest infectious disease worldwide [1]. proposed that could differentiate between active TB, LTBI, healthy
While TB infection may occur in many parts of the body, it is controls, and other diseases. [7]. Sweeney and colleagues obtained
most frequently seen in the lungs. As such, sputum culture is an AUC of 0.9, 0.88, and 0.84 for discriminating between active TB
the current gold standard diagnosis for pulmonary TB. However, and healthy controls, active TB and LTBI, and active TB and other
this diagnostic takes weeks for a positive result. Other screening diseases using this approach (see Table 3 for additional metrics).
technologies typically employed in high-burden TB settings, such
as GeneXpert⃝ R
, also rely on sputum samples. However, as many 2.2. Cross normalization and COCONUT
as one-third of TB patients are unable to reliably produce sputum,
including most children [2]. Therefore, a key element in the WHO’s Machine learning models benefit from having diverse train-
End TB Strategy is a call for rapid, non-sputum based diagnos- ing data, but combining datasets is complicated by batch effects,
tics [3]. such that machine learning classifiers cannot efficiently learn the
Pathogens stimulate a unique host-immune response, and mea- distinctions between diseased and non-diseased samples. Adjust-
suring these responses can be used to develop diagnostic signa- ment is often necessary to reduce the impact of batch effect, but
tures [4]. A common method of developing a biological signature is no gold standard exists in the context of many expression array
using DNA microarray gene expression values, often obtained from datasets [17,21]. In this study, we elected to utilize a normaliza-
the host’s blood samples [5,6]. Given the importance of developing tion method called COmbat CONormalization Using conTrols (CO-
new diagnostics for TB, it is unsurprising that several blood-based CONUT), wherein a ComBat empirical Bayes normalization method
transcriptional signatures have been proposed [7–11]. To date, is applied to healthy controls across all datasets, and the derived
none have been supported by the WHO for translation to a clinical parameters are used to normalize the diseased samples [6,22].
setting [1]. Many other conormalization techniques assume samples all come
A universal diagnostic must be robust to inherent popula- from the same distribution, but this assumption is violated with
tion heterogeneity in order to successfully translate to all clinical the wide variety of other diseases present across the selected
settings. Unfortunately, biomarkers identified from population- datasets. Because COCONUT derives the ComBat parameters ex-
specific TB transcriptional studies fail to reproduce across different clusively from the controls, bias due to the presence of diseased
cohorts [7,12]. Studies funded by the United States of America’s samples is not introduced to the normalization technique [6,23].
National Institute of Health (NIH) have to make their data publicly Patients with other diseases are, in clinical practice, an essential
available, including those investigating TB [13,14]. Consequently, cohort to have contextualized. Specifically, patients will come into
some researchers have attempted to develop a universal transcrip- a clinic presenting as TB-suspects and it is these patients a tran-
tional signature for TB with validation studies currently under- scriptional diagnostic needs to target effectively. Previous studies
way [7]. have noted that while COCONUT greatly improved the batch effect
Integrating disparate transcriptional datasets and conducting seen in the integrated cohort, study-specific projections were still
data analysis on them is non-trivial. Differences in research ob- noticeable in unsupervised visualizations of the data [18].
jectives, cohorts, methods, instruments, data collected, and data
processing often leads to confounding batch effects and result 2.3. Machine learning for biomarker discovery
in spurious differences between study groups [15–17]. The work
presented in this paper is a considerable extension of our prelimi- Machine learning (ML) algorithms have demonstrated to be
nary analysis of four publicly available datasets [8–11] which were useful tools for classifying transcriptomic data [24]. DNA expres-
previously presented in the AI4Health Workshop in 2018 [18]. sion array data is high-dimensional, hence classifiers need to be
Here, we (1) propose a framework for directly integrating multiple adept at solving the p>>n problem, or the case where the number
clinical cohort transcription data and (2) compare and contrast of parameters greatly outnumber the sample size. Moreover, gene
three established machine learning tools in order to identify a expression features tend to be highly correlated, and so models
whole blood transcriptional signature for TB that is resilient to need to be robust to multicollinearity [24–26]. A final, but crucial,
global population heterogeneity. Moreover, the addition of a Gene consideration for the classification of DNA expression array data is
Ontology (GO) enrichment analysis adds a layer of biological vali- the importance of feature selection. Feature selection is necessary
dation to the biomarkers proposed as well as motivates the utility in order to obtain optimal model results as well as a reduced,
of our approach in a clinical context [19]. manageable number of features that can be translated as cost-
effective diagnostics [7,12,27,28].
2. Background Many classifiers may meet these feature selection criteria, how-
ever, the ‘no free lunch’ theorem of optimization states that there is
2.1. Current multi-cohort method no one optimal algorithm which is appropriate in all cases [29,30].
One must consider the application domain in selecting an appro-
One multi-cohort method has been proposed which aims to priate ML model. Uncovering novel TB pathogenesis biology is an
leverage publicly available microarray datasets to identify robust important sub-objective of this transcriptional analysis, therefore
signatures [6,7,12,20]. This method is a meta-analytic framework models facilitating that have the greatest explanatory power and
using a DerSimmeon Laird Random Effects Model that utilizes are particularly relevant in a clinical setting [31]. Taken together,
datasets from the NIH Gene Expression Omnibus (GEO) depository we employed partial least squares-discriminant analysis (PLS-DA),
to identify transcriptional signatures which are resilient to het- support vector machines (SVM), and random forests (RF) (Fig. 1).
erogeneity (population, transcriptional profiling instrumentation, These models are highly interpretable, frequently used, are favored
et cetera). Their method estimates an effect for every gene in in the biomedical data community, which may improve adoption
every dataset, initially filters genes by effect size and adjusted p- of our framework [18,25–27,32–39].
values, and then runs a greedy forward search in order to maximize PLS-DA is a derivative of Partial Least Squares Regression meth-
the Area Under the receiving operator characteristic Curve (AUC) ods specifically designed for supervised clustering of data. The
266 C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273

Fig. 1. Visual representations of the three machine learning classifiers explored in this paper. (A) represents a PLS-DA algorithm, (B) represents an SVM algorithm with a
polynomial kernel and (C) represents an RF algorithm.

primary objective of the algorithm is to maximize the covari- learning classifiers for their performance as potential diagnostic
ance between each data point and a dependent variable (e.g., TB tools. This data analysis pipeline can easily be applied to other
case status) in a high dimensional setting by identifying if groups disease contexts utilizing the wealth of publicly available -omics
are sufficiently different, and determine which features are most data to address some of the issues surrounding reproducibility in
significantly contributing to this difference. To do this, the high preclinical research and drive scientific insight into pathogenesis
dimensional data is mapped to a linear subspace of the explanatory as well as facilitate diagnostic development.
variables [22,25,30]. PLS-DA is often thought of as a supervised
version of a PCA — as it is a dimensionality reduction of the 3. Datasets and proposed approach
variables but maintains the known classification labels. While PLS-
DA is a useful method in many contexts, it can be sensitive to noisy 3.1. Data mining and collection from GEO
data. As well, the PLS-DA components, much like PCA’s principal
components, rely on linear combinations of the features which The National Institute of Health’s (NIH) Gene Expression Om-
may not best describe the data [40]. nibus (GEO) is a publicly available database for functional genomics
In SVM algorithms, data is projected into a high dimensional data and, at the time of writing, home to over 98,000 series records
subspace, and a hyperplane is identified which best separates the each representing a genomic study [14]. The datasets presented in
predefined groups. Depending on the selected kernel, SVM can this paper were collected from the GEO database, where ‘tubercu-
transform the data into a higher dimensional linear, radial, or losis’ and ‘TB’ were used as the key search terms among expression
polynomial space in order to identify the best hyperplane to min- array datasets of whole blood from human subjects. Study level
imize the number of misclassifications and maximize the margin eligibility criteria for inclusion in this particular analysis required
between groups. As such, it is a flexible, non-parametric method for
that the dataset include controls for normalization purposes, have
the classification of transcriptomic data [32,35,38,39]. However,
at least 100 distinct samples, and that each study originated from
SVM can have a few drawbacks, mainly in that it does not perform
distinct institutions. A brief summary of the sample distributions
well if there is no margin between groups (i.e., the groups overlap).
of these studies is presented in Table 1.
As well, the SVM’s do not directly calculate the probability of be-
Study subject samples originated from the United Kingdom,
longing to a particular group, this must be estimated in a separate
France, South Africa, and The Gambia. Patients ages ranged from
process [41,42].
16 to 87 years of age. Each study included excluded patients who
Random forests (RF) are based on a collection of many decision
were HIV-positive. While GSE19491, GSE28623, and GSE42834
trees — wherein features are selected to be nodes along a tree,
were all focused on pulmonary TB, GSE83456 also included ex-
and samples walk along the tree until they are sorted into classes.
trapulmonary cases. Qualification of TB positive patients varied
Each tree is generated by taking a random sample of the data and
between data sets, GSE19491 and GSE42834 both used culture con-
a random sample of the features and then selecting nodes which
firmation in either sputum or bronchial lavage [8,10], GSE28623
best classify the data according to its label at each level of the tree.
relied on patients who were both smear positive and had chest
The importance of each feature in the model can then be measured
as a function of how frequently it is chosen to be a node or based on X-rays indicative of TB [9], and GSE83456 used a combination of
the mean decrease in accuracy when the feature is removed from culture at infection site, caseating granuloma on biopsy and/or
the model. In general, RF models perform well across various types clinical/radiological features consistent with active TB [11] . These
of -omics data, although they can overfit the data [18,25,26,35,36] datasets include patients afflicted with the following under the
In each of these models, there are a collection of parameters Other Disease category: Streptococcal Pharyngitis, Staphylococcus
which need to be tuned. Cross Validation (CV) can reduce the infection, Still’s disease, Systemic Lupus Erythematosus, Sarcoido-
risk of overfitting our models to the training data while tuning sis, Pneumonia, and Lung Cancer [8–11]. In total, 1164 samples
these parameters. A popular CV method is k-fold cross-validation, were included in this analysis. Each expression set was used as
where data is split into k even-sized pieces; k-1 of these pieces deposited, with the exception of checking to make sure values had
are used to train the model, and the model is tested on the last been appropriately log2 transformed.
remaining piece [43,44]. This process iterates so that each of piece
of the data is used to test the model. An accuracy distribution 3.2. Proposed approach
is defined based on each of the model’s performance on the test
data. While k-fold cross validation is a useful technique for tuning Our proposed approach for identifying a transcriptional
model parameters, there is often still large variance in the results. biomarker signature for active TB can be broken down into two
Repeated CV can ameliorate this — where the k-fold process is primary problems. First, suitably merging and integrating the indi-
repeated with multiple different cuts of the data [44,45]. vidual cohorts into one multi-cohort dataset. Second, identifying
To our knowledge, the framework proposed in this paper is appropriate machine learning algorithms which not only accu-
the first to directly integrate transcriptomic data across multi- rately classify active TB, but are also able to identify key features
ple, independent cohorts and evaluate three common machine which can be exploited for further diagnostic development.
C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273 267

Fig. 2. (A) A summary of the dataset merging and integration process. (i) represents the mining process from NCBI’s GEO database and the selection of 4 distinct datasets.
(ii) Represents merging each dataset based on gene symbol to obtain one multi-cohort dataset. (iii)-(vi) represent the ‘COCONUT’ conormalization process, where (iii) is
splitting the healthy controls from the diseased samples, (iv) is obtaining the ComBat parameters from the healthy controls, (v) is applying the obtained parameters to the
diseased samples, and (vi) is recombining the normalized healthy control and diseased samples into one dataset. (B) A summary of the model building and feature selection
process used. Three machine learning classifiers were built on a training data: an RF model, an SVM model with a polynomial kernel, and a PLS-DA model. Each model was
built using a repeated 5-fold CV process to first define the feature importance. Feature selection was conducted using a plus-L, minus-R selection (LRS) sequential selection
method. After feature selection was completed, repeated 5-fold CV was used again to tune the final model parameters, and the final models were used to evaluate the class
probabilities on the validation data.

Table 1 Each distinct expression array dataset was then merged using
The number of samples in each class of interest by each selected dataset, as the unique gene symbols and the final dataset was condensed to
denoted by the unique GEO ID. ‘TB’ indicates active tuberculosis, and ‘LTBI’ indicates
latent tuberculosis infection. The other diseases included in these datasets spans
only those genes represented in all four expression sets. In total,
Streptococcal Pharyngitis, Staphylococcus infection, Still’s disease, Systemic Lupus the expression values of 15,821 genes were considered as initial
Erythematosus, Sarcoidosis, Pneumonia, and Lung Cancer. features for the model. A summary of the dataset merging and
GSE Number of Samples integration process is shown in Fig. 2(A).
Healthy Control Treated TB LTBI TB Other Disease All statistical analyses were performed using R v3.4.3 (R Foun-
19491 [8] 133 14 69 89 193 dation for Statistical Computing, Vienna, Austria). Before any model
28623 [9] 37 – 25 46 – building occurred, a principal components analysis (PCA) was
42834 [10] 143 – – 65 148 used to visualize any batch effect in the merged dataset (see
83456 [11] 61 – – 92 49 Supplementary Figure 1) [46]. It was immediately clear that severe
Total 374 14 94 292 390
batch effects by study are present in the merged dataset, with the
exception of GSE42834 and GSE83456. Both of these studies used
the Illumina HumanHT-12 V4.0 expression beadchip for RNA-Seq
Within our data, there are many classes of interest: healthy analysis, and hence their results appear to be more comparable [6,
controls, latent tuberculosis infection (LTBI), treated TB, active 9,11,23]. In order to adjust for batch effect without violating the
tuberculosis (TB), and other diseases. We are primarily interested assumption that our samples came from identical distributions
in being able to identify active TB regardless of the comparison despite a wide variety in diseases present, we used ‘COCONUT’
group, and hence are utilizing ‘1 vs. the rest’ algorithms in this and specified non-parametric priors [6,23]. This conormalization
particular study. While multi-class algorithms are also an option, was implemented using the ‘COCONUT’ package in R [23]. After
a binary classification model is simpler and the most relevant in a conormalizing the datasets, another PCA was performed to assess
clinical context which classically relies on positive or negative test the degree to which the data was adjusted to remove batch effect
results [18]. (Supplementary Figure 2). The conormalized data were unit-scaled
and mean-centered. The dataset was split into two sets: a training
3.3. Dataset merging and integration portion (2/3 of the data) and a validation portion (1/3) at random.
In total, 775 samples were used in the training data, and 389
In order to directly merge datasets, some unique identifier, or samples were used in the validation set. These are identical to the
link, is needed. While GSE19491, GSE42834, and GSE83456 were training/testing sets used in the previous study [18].
all originally analyzed using Illumina Systems technology, and
hence share probe IDs, GSE28623 was analyzed using an Agilent 3.4. Model building and comparison
Platform [8–11]. Therefore, probe ID’s were first matched to their
gene symbol and where multiple probe IDs matched one gene All the models presented here were built using the ‘caret’ pack-
symbol, we took the median expression value across all probes. age in R using only the training data (n = 775) [42]. In total, three
268 C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273

different machine learning classifiers were used: Random Forest of biological understanding to gene sets. To do this, it identifies
(RF), Support Vector Machine (SVM) with a polynomial kernel, genes which are overrepresented among gene sets and links these
and Partial Least Squares-Discriminant Analysis (PLS-DA). In order to known functional annotations for those genes [19].
to define a feature importance ranking for each of these models, To computationally validate our selected features, two unsu-
Mean Decrease in Accuracy (MDA), feature specific Area Under pervised methods were employed. First, Hierarchical Clustering
the Receiving Operating Characteristic curve (AUROC), and the Analysis (HCA) was used to visualize the Euclidean distance be-
weighted sums of the absolute regression coefficients were used tween every sample and every gene, and a heatmap showing the
for RF, SVM, and PLS-DA respectively [42]. A summary of the model relative expression of selected feature is constructed [52]. As well,
building and feature selection process is shown in Fig. 2(B). a t-distributed Stochastic Neighbor Embedding (t-SNE) was used
Feature selection was conducted using plus-L, minus-R sequen- to visualize whether or not active TB cases clustered together. For
tial selection (LRS). This method is similar to a greedy forward this particular study, t-SNE is superior to PCA as it is both non-
search or recursive feature elimination, wherein features are added linear and non-parametric and better reflects the machine learning
to the reduced model based on their importance ranking derived models used in the feature selection [53].
from the full model. The algorithm for this method is as follows:
4. Results and discussion
1. Add top L features to the reduced feature set, and check
performance statistic
The final PLS-DA model was built using 60 features and tuned
2. Iteratively remove up to R features from the feature set,
to use 3 components. The reduced SVM model with a polynomial
and check performance statistic, if performance does not
kernel was built using 24 features, where the parameters were
improve, return to 1.
set to degree=2, offset = 0.5, and scale = 0.1 through cross-
3. Iterate until overall performance does not improve.
validation. The final RF model built using our framework used 25
LRS feature selection is more resistant to getting stuck in local features and had a mtry =13, where mtry is defined as the number
maxima or local minima than basic sequential feature selection of predictor variables which are randomly sampled as candidates
algorithms [47]. In our implementation of LRS, L was set to be 5 at each split of a given tree. A complete list of the features used by
features, and R was set to iterate to up to 4 features. these models, as well as HCA and t-SNE’s for each individual model
In order to evaluate the performance of the reduced model the can be found in the Supplementary Tables and Figures.
Squared Error, Accuracy, and ROC (SAR) statistic was used, defined Table 2 shows the results of each of the final models across the
as: training sets, validation sets, and overall in the entire integrated
ACC + AUROC + (1 − RMSE ) multi-cohort. All three models had strong accuracy, specificity,
SAR = SAR, MSE and AUC performance measures. However, the sensi-
3
SAR is a metric which has demonstrated to be robust to model tivity of these models was low. The only model which achieved
nuances and hence is ideal for comparing the performance of reasonable sensitivity was the RF, which had perfect sensitivity in
models which may be best optimized using different performance the training data, and 69% sensitivity in the validation data. The
statistics [48]. For instance, RMSE is an appropriate measure of low sensitivity across models may occur in part to the unbalanced
performance for PLS-DA but does not perform well in SVM con- nature of the data, as only 292/1164 samples had the active class
texts [48,49]. By using the SAR measure, we can construct our mod- of interest. Permutation methods, such as up- or down-sampling
els in a way that is less biased to any one classifier due to statistic may ameliorate this issue in future iterations [54]. Another pos-
specific nuances. To fit each of these models to the training data, a sibility potentially contributing to the low sensitivity measures is
5-fold CV scheme was used with 10 repeats, as suggested in [50] the inherent complexity of identifying active TB cases from other
to achieve reasonable precision and accuracy while maintaining diseases. Arguably, in a clinical context, this is the most relevant
a reasonable computational burden. The scheme was used to first problem as most TB-suspect patients will be suffering from some
identify the most important gene features for discriminating active kind of malady, likely with a similar phenotypic presentation to ac-
TB cases, and then again to tune the parameters on the reduced tive TB. To address this issue, more datasets which have additional
model. For SVM, this requires the tuning of the scale, offset, and Other Disease samples and rule out TB may provide additional
degree parameters for the polynomial kernel. In PLS-DA, the num- power to further elucidate a unique transcriptional signature for
ber of components to use must be tuned. And in RF the number active TB [12].
of variables which are sampled as candidates for each split (mtry) Fig. 3 shows a comparison of the AUROC curves from each
must be tuned [36–38,42]. model in both the training and validation data. Similar to the
The final models were then defined using the features selected results shown in Table 2, we observe that the three models per-
from the LRS process with the optimal tuning of necessary parame- formed comparably on the validation data according to AUC. The
ters using 5-fold cross validation with 10 repeats. The ‘caret’ pack- SVM model had the best performance on validation data overall,
age includes the ability to use a grid-search to optimize parameter but the AUC of the RF model on the validation set was only lower
tuning, and all parameters in all models were tuned according by a margin of 0.0021. Given the RF models substantially better
to this procedure [42]. A grid-based search is generalizable to sensitivity compared to the SVM model, we conclude that in this
any dataset, and thus no manual selection of parameter values is particular study the RF model has the best performance. In terms
needed to extend this approach to different applications [51]. To of disease diagnostics, it has previously been suggested that AUCs
evaluate the performance of the final models, each model was used of 0.5–0.7 suggest little-to-no discrimination, 0.7–0.8 should be in-
to predict onto the previously unseen validation data (n=389). terpreted as acceptable, 0.8–0.9 should be interpreted as excellent,
and above 0.9 are considered outstanding [55].
3.5. Feature selection validation and visualization It is important to note that the AUC curves themselves repre-
sent a trade-off between the test sensitivity and specificity, where
Beyond testing the performance of our classifier to the held-out different cutoffs of model scores may be selected to optimize for
validation set, we also sought to add both biological and compu- either sensitivity or specificity, depending on the disease con-
tation credibility to our selected features. In order to assess bio- text [55]. The WHO recommends that Tuberculosis tests should
logical credibility, a Gene Ontology (GO) enrichment analysis was aim for a target sensitivity of 90% (75%–91%) for pulmonary TB
conducted. GO enrichment analysis seeks to add additional layers in adults, and a target specificity of 92% (77%–94%) [56]. This
C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273 269

Table 2
The performance metrics used to assess each of the final models on the training data, validation data, and overall among the entire multi-cohort dataset. The number of
features used in the final model as selected by the LRS process is shown in brackets.
Statistic PLS-DA (60) Polynomial SVM (24) Random Forest (25)
Train Validate Overall Train Validate Overall Train Validate Overall
Accuracy 0.8658 0.8226 0.8524 0.9123 0.838 0.8875 1 0.8612 0.9536
Sensitivity 0.5744 0.5464 0.5651 0.7077 0.5773 0.6644 1 0.6907 0.8973
Specificity 0.9638 0.9144 0.9472 0.981 0.9247 0.9622 1 0.9178 0.9725
SAR 0.7927 0.7529 0.7789 0.8766 0.7924 0.8442 0.9650 0.7947 0.9031
MSE 0.1568 0.1655 0.1552 0.0677 0.1552 0.0846 0.0110 0.1168 0.0.0459
AUC 0.8971 0.8344 0.8772 0.9557 0.8677 0.9273 1 0.8646 0.9709

useful. Thus, unsupervised analyses can serve as a visual method


for biomarker validation [57].
A heatmap showing the unsupervised HCA of each of the sam-
ples across the entire multicohort dataset and the relative gene
expression values of the selected features from the RF model is
shown in Fig. 4. Features selected by the random forest model are
shown along the vertical axis of the heat map. The trees on the top
and left side of the heatmap are unsupervised hierarchical cluster-
ing of similar patients (the top) or similar (genes) using Euclidean
distance. The annotation bar along the top of the heatmap denotes
the class of each of our samples, where active TB is shown in red,
LTBI is shown in purple, healthy controls are shown in dark blue,
treated TB samples are shown in turquoise and other diseases are
shown in yellow. While perfect clustering of active TB from other
samples is not observed, we can clearly see a predominant active
TB section in the heatmap. It is also clear from this annotation bar
that active TB often clusters with other diseases, which supports
our theory that the low sensitivity observed in the model may
be in part due to the high degree of similarity between active TB
and some of the other diseases included in the cohorts. The values
within the heatmap represent the mean centered expression value
for each gene by each patient. We can see an obvious pattern
Fig. 3. The AUC curves from each of the three machine learning models examined
in both the training and validation data. in the cluster that is predominantly red in the annotation bar,
demonstrating clear differences in the expression values amongst
the random forest selected genes for this subgroup. While we do
not have exact disease labels for the other disease patients which
further motivates why the RF model is the best selection in this
are clustering with the active TB patients, it is likely that they
circumstance, as it achieves the best trade-off between sensitivity
have diseases with overlapping symptomology as active TB. The
and specificity and is the only model to have a sensitivity within
heatmap further validates that there are observable differences
the recommended range. Of note, the specificity of all three models
between the selected genes in most active TB patients vs everyone
meets the WHO threshold, and thus the models here may be
else.
positioned as a ‘rule-in’ test and would have utility as tools for
Another point of interest from the RF features is that some
faster diagnosis of some TB suspect patients [56].
of the selected features belong to the same gene family: such as
It is worth noting, as is common with machine learning tech- FCGR1A and FCGR1B. This generates some confidence regarding
niques, that the RF model’s perfect classification on the training the biological importance of the pathway(s) involved and also
data may be indicative of overfitting. While the AUC of 0.86 in indicates that a more parsimonious feature list may be obtained
the withheld data is strong, external validation of other datasets if we first condense gene features for the full dataset by their
is necessary to further validate the random forest model as a families, or to consider a correlational cutoff point to condense
potential tool for diagnostic development. gene features. Given the importance of conserved feature sets for
There are a few potential reasons why the RF algorithm may diagnostic development, this may be a critical improvement to
outperform the SVM and PLS-DA algorithms in this case. Accurate obtaining a smaller transcriptional signature for TB [7,12,27,28].
SVM models rely on no overlap between cases and controls in To add an additional layer of confidence to our transcriptional
highly dimensional space [41]. PLS-DA is both linear and para- signature, we examined the selected features across all three mod-
metric, and thus may not perform well on data with relationships els for any evidence of overlap. The results from this compari-
with are non-linear and non-parametric [40]. RF models are more son are shown in a Venn diagram in Fig. 5(A). Only 4 features
flexible and do not rely on these shortcomings, and this may have (ANKRD22, FCGR1A, FCGR1B, and GBP5) were selected by all four
led to their superior performance in this case. models. However, 19 features (ANKRD22, ASPHD2, BATF2, FCGR1A,
Unsupervised clustering methods can be used as a way to add EEF2K, ETV7, FCGR1B, FOXO1, GBP1, GBP2, GBP5, GBP6, GBP4,
computational validity to the features selected. The idea behind MEF2D, PSME2, SERTAD2, SOCS1, SCO2, and VAMP5) were selected
this methodology is that if features are strongly associated with by at least two models.
an outcome of interest, when these values are clustered in an un- A t-SNE showing a dimensionality reduction of the set of 19
supervised way, we should see clusters which are also associated overlapping features is shown in Fig. 5(B). We can quickly observe a
with the outcome of interest. In a health application domain this predominant active TB cluster shown in red, albeit with some mix-
is incredibly important to do, because it informs researchers that ing of other diseases indicated in yellow. In our previous analysis
not only does the model work, but that the input to the model is of this data, we found that unsupervised clustering demonstrated
270 C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273

Table 3
Comparing the global AUC performance based on each comparison group from the
current model, the previous signature proposed in [18], and the results from the
meta-analysis framework [7].
Comparison Current Previous Results Meta-Analysis
Group Results [18] Framework [7]
Control 0.978 0.91 0.9
LTBI 0.982 0.93 0.88
Other Disease 0.962 0.8 0.84

performance to the previously proposed meta-analysis framework


should be taken with a grain of salt. Table 3 shows such a com-
parison of the global AUC broken down by comparison group
across the current model, the previously proposed framework, and
the meta-analysis framework [7,18]. As evidenced by the notable
increases in global AUC across all three sub-categories of classifi-
cation, the expansion of this framework to include the 5-fold cross
validation process improved model performance, particularly in
the category which seeks to discriminate between active TB in-
fection and other diseases. This comparison shown in Table 3 is
considered for the sole purpose of demonstrating that the results
presented here appear promising given the current performance
of other multi-cohort models in this landscape, however further
Fig. 4. A heatmap and HCA of the entire multi-cohort dataset and the relative expansion of this framework to include additional data is neces-
expression values of the selected genes from the RF model.
sary to fully show improvements in comparison to these other
models.
Our approach to seeking a transcriptional signature using a
clear confounding based on dataset despite conormalizing. While
multi-model, multi-cohort approach has a variety of strengths.
we observe less impact of batch effect in this analysis, it is notewor-
First and foremost, by integrating diverse cohorts from different
thy that the active TB cases from GSE28623 are not clustering with
geographical areas who have different comorbidities, we are able
the active TB cases from the other datasets, and instead appear
to examine a dataset that is far richer than could be achieved by
to cluster with the control, LTBI and some other disease samples.
GSE28623 was the only dataset analyzed on an Agilent platform, any one cohort. While the data may be noisier, the results should
while the other three datasets were analyzed using Illumina plat- better mirror the diversity of patients which may be seen in clinics
forms [8–11]. Improved conormalization to further remove these world-wide. Moreover, we achieve this diversity through direct
batch effects will likely improve overall model performance. Other integration, and hence not at the cost of being able to use machine
figures showing the unsupervised HCA of the overlapping features learning algorithms to gain additional insights at the patient level
as well as the HCA and t-SNE for the entire collection of selected and to drive questions regarding TB pathogenesis.
features are shown in the Supplementary material. Moreover, the variety of the machine learning algorithms we
A GO enrichment analysis was conducted to identify biolog- selected to examine our data are a strength of this analysis. By
ical process annotations from the set of genes spanning all the not limiting ourselves to one type of model, we do not limit the
selected features from each model [19]. If reasonable features were proposed TB signature to those features which are more likely to
selected during our modeling process, we would expect biological be selected by particular model types, be they parametric vs non-
processes to be returned which are known to be associated with parametric or linear vs non-linear. Using the comparison of the
Tuberculosis. In total, 86 unique genes were selected across the returned signatures to further reduce the number of false positives
PLS-DA, SVM, and RF models. The returned annotations to this 86 in considering those features that overlap between at least two
gene signature are all related to immune system responses. Fig. 6 methods is an additional strength of this analysis.
shows a directed acyclic graph (DAG) developing using the online There are a number of limitations to the analysis presented
GOrilla tool that demonstrates the significant biological process here. Most critically, the exclusion of children and HIV-positive
terms related to the full gene set [58]. This figure shoes a hierar- patients from the analysis needs to be addressed in future work
chical structure of the processes related to this 86 gene signature.
in order to propose a globally applicable transcriptional signature
The color represents the p-value for which the 86 gene signature
for TB. While the proposed transcriptomic signature may have use
was enriched for a particular process compared to what would be
as a rule-in test or companion diagnostic, the sensitivity of the
expected by random chance [19,58]. Of particular note is that many
three compared models ideally needs to be improved in order to
of the most significant GO terms are related to interferon-gamma.
translate as a rule-out test in a clinical setting. The inclusion of
Interferon-gamma is thought to be a mediator of macrophage
activation and has long been associated with TB. It has previously additional TB datasets, as well as datasets spanning other diseases
been implicated as a potential tool for TB resistance [59]. In fact, with similar phenotypic presentations of TB, may assist with both
interferon gamma release assays (IGRAs) are one tool available of these shortcomings.
to identify patients with LTBI [1]. Thus, this gene set enrichment As well, there is no way to guarantee that the ‘best’ algo-
analysis adds a layer of biological credibility to the transcriptional rithms were selected in this approach. While the models selected
signature produced by our multi-cohort framework being related encompass diverse ML characteristics (tree-based model, para-
to Tuberculosis. metric model, models based on organization in n-dimensional
Given the reduced number of datasets used here and the lack space), it is within the scope of possibility that some other model
of HIV-positive patients, direct comparisons of this framework’s will achieve superior performance. It is not feasible to test every
C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273 271

Fig. 5. (A) A Venn diagram showing the overlap between selected features across the three machine learning classifiers. (B) A t-SNE with perplexity of 15 constructed using
any feature which was selected by at least two of the models.

Fig. 6. A directed acyclic graph (DAG) to visualize the GO enrichment analysis using all 86 genes selected across the three machine learning models. The color of each node
of the DAG represents the p-value associated with the annotation as shown in the legend along the top. Figure generated using GOrilla, where the p-value threshold was set
to 10−4 [58].

possible machine learning technique, but we argue that the three algorithm. Random Forest algorithms are easily extended to multi-
presented here give reasonable insight into possible performance class problems [26].
on integrated gene expression array data.
5. Conclusions
While this analysis has focused on solving a ‘1 vs. the rest’
problem to classify active TB from controls, LTBI, treated TB, and Here we show a framework for integrating diverse transcrip-
other diseases, a multi-class classifier may add additional insight tional datasets in the context of a clinical application where analy-
into TB pathogenesis. Specifically, the problem of classifying active ses of a single dataset have failed. We have shown that by directly
TB from other diseases may be improved by the use of a multiclass integrating datasets and adjusting for batch effect, multicohort
272 C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273

data can be used with canonical machine learning tools to drive [13] Wellcome trust, sharing data from large-scale biological research projects:
addition findings which accurately classifies active TB infection A system of tripartite responsibility, fort lauderdale, 2003, https://www.
genome.gov/pages/research/wellcomereport0303.pdf. (Accessed September
in the clinical milieu of a variety of other similarly-presenting
20, 2018).
diseases as well as LTBI, patients with TB on treatment, and healthy [14] R. Edgar, M. Domrachev, A.E. Lash, Gene Expression Omnibus: NCBI gene
controls. Moreover, the biomarker selection procedure identified expression and hybridization array data repository, Nucl. Acids Res. 30 (2002)
features for these models which are biologically-relevant for TB 207–210, http://dx.doi.org/10.1093/nar/30.1.207.
infection and preserve potential information to drive further hy- [15] J. Allen, K.J. Inder, T.J. Lewin, J.R. Attia, F.J. Kay-Lambkin, A.L. Baker, T. Hazell,
potheses regarding TB pathogenesis. Future work will incorporate B.J. Kelly, Integrating and extending cohort studies: lessons from the eXtend-
ing Treatments, Education and Networks in Depression (xTEND) study, BMC
additional datasets which include common co-morbidities such Med. Res. Methodol. 13 (2013) 122, http://dx.doi.org/10.1186/1471-2288-
as co-infection with HIV as well as the utilization of multiclass 13-122.
machine learning classifiers. [16] V. Nygaard, E.A. Rødland, E. Hovig, Methods that remove batch effects while
retaining group differences may lead to exaggerated confidence in down-
Acknowledgments stream analyses, Biostatistics 17 (2016) 29–39, http://dx.doi.org/10.1093/
biostatistics/kxv027.
[17] J.A. Thompson, J. Tan, C.S. Greene, Cross-platform normalization of microarray
Dartmouth College holds an Institutional Program Unifying and {RNA}-seq data for machine learning applications, PeerJ 4 (2016) e1621,
Population and Laboratory Based Sciences award from the Bur- http://dx.doi.org/10.7717/peerj.1621.
roughs Wellcome Fund, USA, and Carly A. Bobak was supported [18] C.A. Bobak, A.J. Titus, J.E. Hill, Investigating Random Forest Classification
by this grant (Grant #1014106). Alexander J. Titus was supported on Publicly Available Tuberculosis Data to Uncover Robust Transcriptional
Biomarkers. HEALTHINF, 2018, pp. 695–701.
by the Office of the U.S. Director of the National Institutes of Health
[19] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis,
under award number T32LM012204. K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A.
Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, G.
Appendix A. Supplementary data Sherlock, Gene ontology: tool for the unification of biology, Nat. Genet. 25
(2000) 25–29, http://dx.doi.org/10.1038/75556.
Supplementary material related to this article can be found [20] W.A. Haynes, F. Vallania, C. Liu, E. Bongen, A. Tomczak, M. Andres-Terre, S.
Lofgren, A. Tam, C.A. Deisseroth, M.D. Li, T.E. Sweeney, P. Khatri, Empowering
online at https://doi.org/10.1016/j.asoc.2018.10.005. multi-cohort gene expression analysis to increase reproducibility, BioRxiv.
(2016) http://biorxiv.org/content/early/2016/08/25/071514.
References [21] J.N. Taroni, C.S. Greene, Cross-Platform normalization enables machine learn-
ing model training on microarray and rna-seq data simultaneously, BioRxiv.
[1] The World Health Organization, Global Tuberculosis Report 2017, 2017, http: (2017) 118349, http://dx.doi.org/10.1101/118349.
//www.who.int/tb/publications/global_report/gtbr2017_main_text.pdf. (Ac- [22] T.E. Sweeney MD, Ph.D., COCONUT: COmbat CO-Normalization Using con-
cessed 24 May 2018). Trols (COCONUT), 2016, https://cran.r-project.org/web/packages/COCONUT/
[2] J.G. Peter, G. Theron, T.E. Muchinga, U. Govender, K. Dheda, The diagnostic COCONUT.pdf. (Accessed 24 May 2018).
accuracy of urine-based xpert MTB/RIF in HIV-infected hospitalized patients [23] T.E. Sweeney, Package ‘COCONUT, ’, 2017, https://cran.r-project.org/web/
who are smear-negative or sputum scarce, PLoS One 7 (2012) e39966, http: packages/COCONUT/COCONUT.pdf. (Accessed 24 May 2018).
//dx.doi.org/10.1371/journal.pone.0039966. [24] E. Lin, H.-Y. Lane, Machine learning and systems genomics approaches
[3] WHO | WHO End TB Strategy, WHO, 2015, http://www.who.int/tb/post2015_ for multi-omics data, Biomark. Res. 5 (2017) 2, http://dx.doi.org/10.1186/
strategy/en/. (Accessed 24 May 2018). s40364-017-0082-y.
[4] O. Ramilo, A. Mejías, Shifting the paradigm: Host gene signatures for diagnosis [25] A.V. Lebedev, E. Westman, G.J.P. Van Westen, M.G. Kramberger, A. Lundervold,
of infectious diseases, Cell Host Microbe. 6 (2009) 199–200, http://dx.doi.org/ D. Aarsland, H. Soininen, I. Kloszewska, P. Mecocci, M. Tsolaki, B. Vellas, S.
10.1016/j.chom.2009.08.007. Lovestone, A. Simmons, Random Forest ensembles for detection and predic-
[5] O. Ramilo, A. Mejías, Shifting the paradigm: Host gene signatures for diagnosis tion of Alzheimer’s disease with a good between-cohort robustness, NeuroIm-
of infectious diseases, Cell Host Microbe. 6 (2009) 199–200, http://dx.doi.org/ age Clin. 6 (2014) 115–125, http://dx.doi.org/10.1016/j.nicl.2014.08.023.
10.1016/j.chom.2009.08.007. [26] R. Diaz-Uriarte, S.A. de Andres, Gene selection and classification of microarray
[6] T.E. Sweeney, H.R. Wong, P. Khatri, Robust classification of bacterial and viral data using random forest, BMC Bioinformatics 7 (2006) 3, http://dx.doi.org/
infections via integrated host gene expression diagnostics, Sci. Transl. Med. 8 10.1186/1471-2105-7-3.
(2016) 346ra91–346ra91, http://dx.doi.org/10.1126/scitranslmed.aaf7165.
[27] P.S. Gromski, H. Muhamadali, D.I. Ellis, Y. Xu, E. Correa, M.L. Turner,
[7] T.E. Sweeney, L. Braviak, C.M. Tato, P. Khatri, Genome-wide expression for
R. Goodacre, A tutorial review: metabolomics and partial least squares-
diagnosis of pulmonary tuberculosis: a multicohort analysis, Lancet Respir.
discriminant analysis – a marriage of convenience or a shotgun wedding, Anal.
Med. 4 (2016) 213–224, http://dx.doi.org/10.1016/s2213-2600(16)00048-5.
Chim. Acta. 879 (2015) 10–23, http://dx.doi.org/10.1016/j.aca.2015.02.012.
[8] M.P.R. Berry, C.M. Graham, F.W. McNab, Z. Xu, S.A.A. Bloch, T. Oni, K.A.
[28] N.R. Pal, A fuzzy rule based approach to identify biomarkers for diagnostic
Wilkinson, R. Banchereau, J. Skinner, R.J. Wilkinson, C. Quinn, D. Blankenship,
classification of cancers, in: 2007 IEEE Int. Fuzzy Syst. Conf., IEEE, 2007, pp.
R. Dhawan, J.J. Cush, A. Mejias, O. Ramilo, O.M. Kon, V. Pascual, J. Banchereau,
1–6, http://dx.doi.org/10.1109/FUZZY.2007.4295533.
D. Chaussabel, A. O’Garra, An interferon-inducible neutrophil-driven blood
[29] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEE
transcriptional signature in human tuberculosis, Nature 466 (2010) 973–977,
Trans. Evol. Comput. 1 (1997) 67–82, http://dx.doi.org/10.1109/4235.585893.
http://dx.doi.org/10.1038/nature09247.
[9] J. Maertzdorf, M. Ota, D. Repsilber, H.J. Mollenkopf, J. Weiner, P.C. Hill, S.H.E. [30] S.S. Verma, A. Lucas, X. Zhang, Y. Veturi, S. Dudek, B. Li, R. Li, R. Urbanowicz,
Kaufmann, Functional correlations of pathogenesis-driven gene expression J.H. Moore, D. Kim, M.D. Ritchie, Collective feature selection to identify cru-
signatures in tuberculosis, PLoS ONE 6 (2011) e26938, http://dx.doi.org/10. cial epistatic variants, BioData Min. 11 (2018) 5, http://dx.doi.org/10.1186/
1371/journal.pone.0026938. s13040-018-0168-6.
[10] C.I. Bloom, C.M. Graham, M.P.R. Berry, F. Rozakeas, P.S. Redford, Y. Wang, Z. [31] E.W. Steyerberg, Clinical Prediction Models : A Practical Approach
Xu, K.A. Wilkinson, R.J. Wilkinson, Y. Kendrick, G. Devouassoux, T. Ferry, M. to Development, Validation, and Updating, Springer, 2009, https:
Miyara, D. Bouvry, V. Dominique, G. Gorochov, D. Blankenship, M. Saadatian, //books.google.com/books?id=kHGK58cLsMIC&dq=statistical+parsimony+
P. Vanhems, H. Beynon, R. Vancheeswaran, M. Wickremasinghe, D. Chauss- and+clinical+models&lr=&source=gbs_navlinks_s. (Accessed 18 September
abel, J. Banchereau, V. Pascual, L. Ho, M. Lipman, A. O’Garra, A. O’Garra, Tran- 2018).
scriptional blood signatures distinguish pulmonary tuberculosis, pulmonary [32] J.A.K. Suykens, J. Vandewalle, Least squares support vector machine clas-
sarcoidosis, pneumonias and lung cancers, PLoS One 8 (2013) e70630, http: sifiers, Neural Process. Lett. 9 (1999) 293–300, https://lirias.kuleuven.
//dx.doi.org/10.1371/journal.pone.0070630. be/bitstream/123456789/218716/2/Suykens_NeurProcLett.pdf. (Accessed 24
[11] S. Blankley, C.M. Graham, J. Turner, M.P.R. Berry, C.I. Bloom, Z. Xu, V. Pascual, May 2018).
J. Banchereau, D. Chaussabel, R. Breen, G. Santis, D.M. Blankenship, M. Lip- [33] T.R. Mellors, C.A. Rees, W.F. Wieland-Alter, C.F. von Reyn, J.E. Hill, The volatile
man, A. O’Garra, The transcriptional signature of active tuberculosis reflects molecule signature of four mycobacteria species, J. Breath Res. 11 (2017)
symptom status in extra-pulmonary and pulmonary tuberculosis, PLoS One 31002, http://dx.doi.org/10.1088/1752-7163/aa6e06.
11 (2016) e0162220, http://dx.doi.org/10.1371/journal.pone.0162220. [34] M. Pérez-Enciso, M. Tenenhaus, Prediction of clinical outcome with microar-
[12] T.E. Sweeney, W.A. Haynes, F. Vallania, J.P. Ioannidis, P. Khatri, Methods to ray data: a partial least squares discriminant analysis (PLS-DA) approach,
increase reproducibility in differential gene expression via meta-analysis, Hum. Genet. 112 (2003) 581–592, http://dx.doi.org/10.1007/s00439-003-
Nucl. Acids Res. 45 (2016) e1–e1, http://dx.doi.org/10.1093/nar/gkw797. 0921-9.
C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273 273

[35] A. Statnikov, L. Wang, C.F. Aliferis, A comprehensive comparison of random [48] R. Caruana, R. Caruana, A. Niculescu-Mizil, Data mining in metric space: An
forests and support vector machines for microarray-based cancer classifi- empirical analysis of supervised learning performance criteria, in: Proc. Tenth
cation, BMC Bioinformatics 9 (2008) 319, http://dx.doi.org/10.1186/1471- ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2004, pp. 69–78. http://
2105-9-319. citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.60.6684. (Accessed May
[36] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32, https://www.stat. 26, 2018).
berkeley.edu/~breiman/randomforest2001.pdf. (Accessed 24 May 2018). [49] S. Takahama, A.M. Dillner, Model selection for partial least squares calibration
[37] M. Barker, W. Rayens, Partial least squares for discrimination, J. Chemom. 17 and implications for analysis of atmospheric organic aerosol samples with
(2003) 166–173, http://dx.doi.org/10.1002/cem.785. mid-infrared spectroscopy, J. Chemom. 29 (2015) 659–668, http://dx.doi.org/
[38] C. Cortes, V. Vapnik, Support-vector networks, 20 (1995) 273–297. https: 10.1002/cem.2761.
//link.springer.com/content/pdf/10.1007/BF00994018.pdf. (Accessed 24 May [50] M. Kuhn, K. Johnson, Applied Predictive Modeling, second ed., Springer,
2018). 2016, http://appliedpredictivemodeling.com/about/. (Accessed 18 Septem-
[39] T. Hastie, R. Tibshirani, J. Friedman, Support Vector Machines and Flexible Dis- ber 2018).
criminants, in: 2009: 417–458, http://dx.doi.org/10.1007/978-0-387-84858- [51] C.-W. Hsu, C.-C. Chang, C.-J. Lin, A Practical Guide to Support Vector Classifi-
7_12. cation, Taipei, 2003.
[40] D.R. Perez, G. Narasimhan, So you think you can PLS-DA? Doi.Org. (2017) [52] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data
207225, http://dx.doi.org/10.1101/207225. Mining, Inference, and Prediction, second ed., Springer-Verlag, New York, NY,
[41] S. Hashemi, S. Hashemi, T. Trappenberg, T. Trappenberg, Using SVM for USA, 2009, p. 745, http://dx.doi.org/10.1111/j.1467-985X.2010.00646_6.x.
Classification in Datasets with Ambiguous data, Sci 2002 (2002) https://web. [53] L. van der Maaten, G. Hinton, Visualizing data using t-SNE, J. Mach. Learn. Res.
cs.dal.ca/~tt/papers/SCI2002color2.pdf. (Accessed 25 May 2018). 9 (2008) 2579–2605.
[42] M. Kuhn, J. Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt, T. Cooper, [54] A.P. Modeling, Down-sampling using random forests applied predictive mod-
Z. Mayer, M. Benesty, R. Lescarbeau, A. Ziem, L. Scrucca, Y. Tang, C. Candan, T. eling, 2016, pp. 4–13, http://appliedpredictivemodeling.com/blog/2013/12/
Hunt, M. Max Kuhn, 8/28rmc2lv96h8fw8700zm4nl50busep. (Accessed 26 May 2018).
[43] F. Mosteller, J.W.G. Tukey, Data Analysis, Including Statistics, Addison- [55] J.N. Mandrekar, Receiver operating characteristic curve in diagnostic test
Wesley, 1968, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53. assessment, J. Thorac. Oncol. 5 (2010) 1315–1316, http://dx.doi.org/10.1097/
7042&rep=rep1&type=pdf. (Accessed 24 May 2018). JTO.0B013E3181EC173D.
[44] P.F. Thall, R. Simon, D.A. Grier, Test-based variable selection via cross- [56] World Health Organization, High-Priority Target Product Profiles for New Tu-
validation, J. Comput. Graph. Stat. 1 (1992) 41–61, http://dx.doi.org/10.1080/ berculosis Diagnostics, World Health Organization, 2014, http://www.who.
10618600.1992.10474575. int/tb/publications/tpp_report/en/. (Accessed 18 September 2018).
[45] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation [57] J.E. McDermott, J. Wang, H. Mitchell, B.-J. Webb-Robertson, R. Hafen, J. Ramey,
and model selection, 1995, http://robotics.stanford.edu/~ronnyk. (Accessed K.D. Rodland, Challenges in biomarker discovery: Combining expert insights
24 May 2018). with statistical analysis of complex omics data, Expert Opin. Med. Diagn. 7
[46] H. Hotelling, Analysis of a complex of statistical variables into principal (2013) 37–51, http://dx.doi.org/10.1517/17530059.2012.718329.
components, J. Educ. Psychol. 24 (1933) 417–441, http://dx.doi.org/10.1037/ [58] E. Eden, R. Navon, I. Steinfeld, D. Lipson, Z. Yakhini, GOrilla: a tool for discovery
h0071325. and visualization of enriched GO terms in ranked gene lists, BMC Bioinformat-
[47] S. Khalid, T. Khalil, S. Nasreen, A survey of feature selection and feature ics 10 (2009) 48, http://dx.doi.org/10.1186/1471-2105-10-48.
extraction techniques in machine learning, in: 2014 Sci. Inf. Conf, IEEE, 2014, [59] J.L. Flynn, J. Chan, K.J. Triebold, D.K. Dalton, T.A. Stewart, B.R. Bloom, An essen-
pp. 372–378, http://dx.doi.org/10.1109/SAI.2014.6918213. tial role for interferon gamma in resistance to Mycobacterium tuberculosis
infection, J. Exp. Med. 178 (1993) 2249–2254, http://dx.doi.org/10.1084/JEM.
178.6.2249.

Das könnte Ihnen auch gefallen