Beruflich Dokumente
Kultur Dokumente
Conference
article info a b s t r a c t
Article history: Tuberculosis (TB) is a top-10 cause of death worldwide, and new diagnostics are a key element of the
Received 19 June 2018 World Health Organization’s End TB strategy. Significant research efforts have gone into trying to identify
Received in revised form 21 September 2018 a transcriptional signature from patient blood in order to diagnose TB, but a consistent signature for
Accepted 3 October 2018
heterogeneous populations has remained elusive. In this work, we propose a data analysis framework
Available online 22 October 2018
which directly integrates multiple publicly-available expression array datasets in order to identify a more
Keywords: reliable gene signature for the diagnosis of TB. The proposed method was built using 4 distinct datasets
Tuberculosis diagnostics spanning a total of 1164 samples and 4 countries. The performance and selected gene features of three
Transcriptional biomarkers different machine learning classifiers were compared in the context of this multi-cohort framework. A
Data integration Random Forest classifier provided the best classification results, with an AUC of 0.8646 in our validation
Machine learning models
data. Gene ontology enrichment analysis revealed that the selected gene features across the three models
Expression array data
are all related to immunological processes.
Funchal, Madeira, Portugal, 19–21 January, 2018, pp. XX-YY, ISBN: 978-989-758-
281-3, INSTICC, 2018.
✩ This paper is an extended, improved version of the paper ‘Investigating ∗ Corresponding author at: Thayer School of Engineering, Dartmouth School of
Random Forest Classification on Publicly Available Tuberculosis Data to Uncover Graduate and Advanced Studies, Hanover, NH, USA.
Robust Transcriptional Biomarkers’ presented at AI4Health 2018 workshop and E-mail addresses: Carly.A.Bobak.GR@dartmouth.edu (C.A. Bobak),
published in: BIOSTEC 2018, Proceedings of the 11th International Joint Confer- Alexander.J.Titus.GR@dartmouth.edu (A.J. Titus), Jane.E.Hill@dartmouth.edu
ence on Biomedical Engineering Systems and Technologies, Volume 5: HEALTHINF, (J.E. Hill).
https://doi.org/10.1016/j.asoc.2018.10.005
1568-4946/
C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273 265
Fig. 1. Visual representations of the three machine learning classifiers explored in this paper. (A) represents a PLS-DA algorithm, (B) represents an SVM algorithm with a
polynomial kernel and (C) represents an RF algorithm.
primary objective of the algorithm is to maximize the covari- learning classifiers for their performance as potential diagnostic
ance between each data point and a dependent variable (e.g., TB tools. This data analysis pipeline can easily be applied to other
case status) in a high dimensional setting by identifying if groups disease contexts utilizing the wealth of publicly available -omics
are sufficiently different, and determine which features are most data to address some of the issues surrounding reproducibility in
significantly contributing to this difference. To do this, the high preclinical research and drive scientific insight into pathogenesis
dimensional data is mapped to a linear subspace of the explanatory as well as facilitate diagnostic development.
variables [22,25,30]. PLS-DA is often thought of as a supervised
version of a PCA — as it is a dimensionality reduction of the 3. Datasets and proposed approach
variables but maintains the known classification labels. While PLS-
DA is a useful method in many contexts, it can be sensitive to noisy 3.1. Data mining and collection from GEO
data. As well, the PLS-DA components, much like PCA’s principal
components, rely on linear combinations of the features which The National Institute of Health’s (NIH) Gene Expression Om-
may not best describe the data [40]. nibus (GEO) is a publicly available database for functional genomics
In SVM algorithms, data is projected into a high dimensional data and, at the time of writing, home to over 98,000 series records
subspace, and a hyperplane is identified which best separates the each representing a genomic study [14]. The datasets presented in
predefined groups. Depending on the selected kernel, SVM can this paper were collected from the GEO database, where ‘tubercu-
transform the data into a higher dimensional linear, radial, or losis’ and ‘TB’ were used as the key search terms among expression
polynomial space in order to identify the best hyperplane to min- array datasets of whole blood from human subjects. Study level
imize the number of misclassifications and maximize the margin eligibility criteria for inclusion in this particular analysis required
between groups. As such, it is a flexible, non-parametric method for
that the dataset include controls for normalization purposes, have
the classification of transcriptomic data [32,35,38,39]. However,
at least 100 distinct samples, and that each study originated from
SVM can have a few drawbacks, mainly in that it does not perform
distinct institutions. A brief summary of the sample distributions
well if there is no margin between groups (i.e., the groups overlap).
of these studies is presented in Table 1.
As well, the SVM’s do not directly calculate the probability of be-
Study subject samples originated from the United Kingdom,
longing to a particular group, this must be estimated in a separate
France, South Africa, and The Gambia. Patients ages ranged from
process [41,42].
16 to 87 years of age. Each study included excluded patients who
Random forests (RF) are based on a collection of many decision
were HIV-positive. While GSE19491, GSE28623, and GSE42834
trees — wherein features are selected to be nodes along a tree,
were all focused on pulmonary TB, GSE83456 also included ex-
and samples walk along the tree until they are sorted into classes.
trapulmonary cases. Qualification of TB positive patients varied
Each tree is generated by taking a random sample of the data and
between data sets, GSE19491 and GSE42834 both used culture con-
a random sample of the features and then selecting nodes which
firmation in either sputum or bronchial lavage [8,10], GSE28623
best classify the data according to its label at each level of the tree.
relied on patients who were both smear positive and had chest
The importance of each feature in the model can then be measured
as a function of how frequently it is chosen to be a node or based on X-rays indicative of TB [9], and GSE83456 used a combination of
the mean decrease in accuracy when the feature is removed from culture at infection site, caseating granuloma on biopsy and/or
the model. In general, RF models perform well across various types clinical/radiological features consistent with active TB [11] . These
of -omics data, although they can overfit the data [18,25,26,35,36] datasets include patients afflicted with the following under the
In each of these models, there are a collection of parameters Other Disease category: Streptococcal Pharyngitis, Staphylococcus
which need to be tuned. Cross Validation (CV) can reduce the infection, Still’s disease, Systemic Lupus Erythematosus, Sarcoido-
risk of overfitting our models to the training data while tuning sis, Pneumonia, and Lung Cancer [8–11]. In total, 1164 samples
these parameters. A popular CV method is k-fold cross-validation, were included in this analysis. Each expression set was used as
where data is split into k even-sized pieces; k-1 of these pieces deposited, with the exception of checking to make sure values had
are used to train the model, and the model is tested on the last been appropriately log2 transformed.
remaining piece [43,44]. This process iterates so that each of piece
of the data is used to test the model. An accuracy distribution 3.2. Proposed approach
is defined based on each of the model’s performance on the test
data. While k-fold cross validation is a useful technique for tuning Our proposed approach for identifying a transcriptional
model parameters, there is often still large variance in the results. biomarker signature for active TB can be broken down into two
Repeated CV can ameliorate this — where the k-fold process is primary problems. First, suitably merging and integrating the indi-
repeated with multiple different cuts of the data [44,45]. vidual cohorts into one multi-cohort dataset. Second, identifying
To our knowledge, the framework proposed in this paper is appropriate machine learning algorithms which not only accu-
the first to directly integrate transcriptomic data across multi- rately classify active TB, but are also able to identify key features
ple, independent cohorts and evaluate three common machine which can be exploited for further diagnostic development.
C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273 267
Fig. 2. (A) A summary of the dataset merging and integration process. (i) represents the mining process from NCBI’s GEO database and the selection of 4 distinct datasets.
(ii) Represents merging each dataset based on gene symbol to obtain one multi-cohort dataset. (iii)-(vi) represent the ‘COCONUT’ conormalization process, where (iii) is
splitting the healthy controls from the diseased samples, (iv) is obtaining the ComBat parameters from the healthy controls, (v) is applying the obtained parameters to the
diseased samples, and (vi) is recombining the normalized healthy control and diseased samples into one dataset. (B) A summary of the model building and feature selection
process used. Three machine learning classifiers were built on a training data: an RF model, an SVM model with a polynomial kernel, and a PLS-DA model. Each model was
built using a repeated 5-fold CV process to first define the feature importance. Feature selection was conducted using a plus-L, minus-R selection (LRS) sequential selection
method. After feature selection was completed, repeated 5-fold CV was used again to tune the final model parameters, and the final models were used to evaluate the class
probabilities on the validation data.
Table 1 Each distinct expression array dataset was then merged using
The number of samples in each class of interest by each selected dataset, as the unique gene symbols and the final dataset was condensed to
denoted by the unique GEO ID. ‘TB’ indicates active tuberculosis, and ‘LTBI’ indicates
latent tuberculosis infection. The other diseases included in these datasets spans
only those genes represented in all four expression sets. In total,
Streptococcal Pharyngitis, Staphylococcus infection, Still’s disease, Systemic Lupus the expression values of 15,821 genes were considered as initial
Erythematosus, Sarcoidosis, Pneumonia, and Lung Cancer. features for the model. A summary of the dataset merging and
GSE Number of Samples integration process is shown in Fig. 2(A).
Healthy Control Treated TB LTBI TB Other Disease All statistical analyses were performed using R v3.4.3 (R Foun-
19491 [8] 133 14 69 89 193 dation for Statistical Computing, Vienna, Austria). Before any model
28623 [9] 37 – 25 46 – building occurred, a principal components analysis (PCA) was
42834 [10] 143 – – 65 148 used to visualize any batch effect in the merged dataset (see
83456 [11] 61 – – 92 49 Supplementary Figure 1) [46]. It was immediately clear that severe
Total 374 14 94 292 390
batch effects by study are present in the merged dataset, with the
exception of GSE42834 and GSE83456. Both of these studies used
the Illumina HumanHT-12 V4.0 expression beadchip for RNA-Seq
Within our data, there are many classes of interest: healthy analysis, and hence their results appear to be more comparable [6,
controls, latent tuberculosis infection (LTBI), treated TB, active 9,11,23]. In order to adjust for batch effect without violating the
tuberculosis (TB), and other diseases. We are primarily interested assumption that our samples came from identical distributions
in being able to identify active TB regardless of the comparison despite a wide variety in diseases present, we used ‘COCONUT’
group, and hence are utilizing ‘1 vs. the rest’ algorithms in this and specified non-parametric priors [6,23]. This conormalization
particular study. While multi-class algorithms are also an option, was implemented using the ‘COCONUT’ package in R [23]. After
a binary classification model is simpler and the most relevant in a conormalizing the datasets, another PCA was performed to assess
clinical context which classically relies on positive or negative test the degree to which the data was adjusted to remove batch effect
results [18]. (Supplementary Figure 2). The conormalized data were unit-scaled
and mean-centered. The dataset was split into two sets: a training
3.3. Dataset merging and integration portion (2/3 of the data) and a validation portion (1/3) at random.
In total, 775 samples were used in the training data, and 389
In order to directly merge datasets, some unique identifier, or samples were used in the validation set. These are identical to the
link, is needed. While GSE19491, GSE42834, and GSE83456 were training/testing sets used in the previous study [18].
all originally analyzed using Illumina Systems technology, and
hence share probe IDs, GSE28623 was analyzed using an Agilent 3.4. Model building and comparison
Platform [8–11]. Therefore, probe ID’s were first matched to their
gene symbol and where multiple probe IDs matched one gene All the models presented here were built using the ‘caret’ pack-
symbol, we took the median expression value across all probes. age in R using only the training data (n = 775) [42]. In total, three
268 C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273
different machine learning classifiers were used: Random Forest of biological understanding to gene sets. To do this, it identifies
(RF), Support Vector Machine (SVM) with a polynomial kernel, genes which are overrepresented among gene sets and links these
and Partial Least Squares-Discriminant Analysis (PLS-DA). In order to known functional annotations for those genes [19].
to define a feature importance ranking for each of these models, To computationally validate our selected features, two unsu-
Mean Decrease in Accuracy (MDA), feature specific Area Under pervised methods were employed. First, Hierarchical Clustering
the Receiving Operating Characteristic curve (AUROC), and the Analysis (HCA) was used to visualize the Euclidean distance be-
weighted sums of the absolute regression coefficients were used tween every sample and every gene, and a heatmap showing the
for RF, SVM, and PLS-DA respectively [42]. A summary of the model relative expression of selected feature is constructed [52]. As well,
building and feature selection process is shown in Fig. 2(B). a t-distributed Stochastic Neighbor Embedding (t-SNE) was used
Feature selection was conducted using plus-L, minus-R sequen- to visualize whether or not active TB cases clustered together. For
tial selection (LRS). This method is similar to a greedy forward this particular study, t-SNE is superior to PCA as it is both non-
search or recursive feature elimination, wherein features are added linear and non-parametric and better reflects the machine learning
to the reduced model based on their importance ranking derived models used in the feature selection [53].
from the full model. The algorithm for this method is as follows:
4. Results and discussion
1. Add top L features to the reduced feature set, and check
performance statistic
The final PLS-DA model was built using 60 features and tuned
2. Iteratively remove up to R features from the feature set,
to use 3 components. The reduced SVM model with a polynomial
and check performance statistic, if performance does not
kernel was built using 24 features, where the parameters were
improve, return to 1.
set to degree=2, offset = 0.5, and scale = 0.1 through cross-
3. Iterate until overall performance does not improve.
validation. The final RF model built using our framework used 25
LRS feature selection is more resistant to getting stuck in local features and had a mtry =13, where mtry is defined as the number
maxima or local minima than basic sequential feature selection of predictor variables which are randomly sampled as candidates
algorithms [47]. In our implementation of LRS, L was set to be 5 at each split of a given tree. A complete list of the features used by
features, and R was set to iterate to up to 4 features. these models, as well as HCA and t-SNE’s for each individual model
In order to evaluate the performance of the reduced model the can be found in the Supplementary Tables and Figures.
Squared Error, Accuracy, and ROC (SAR) statistic was used, defined Table 2 shows the results of each of the final models across the
as: training sets, validation sets, and overall in the entire integrated
ACC + AUROC + (1 − RMSE ) multi-cohort. All three models had strong accuracy, specificity,
SAR = SAR, MSE and AUC performance measures. However, the sensi-
3
SAR is a metric which has demonstrated to be robust to model tivity of these models was low. The only model which achieved
nuances and hence is ideal for comparing the performance of reasonable sensitivity was the RF, which had perfect sensitivity in
models which may be best optimized using different performance the training data, and 69% sensitivity in the validation data. The
statistics [48]. For instance, RMSE is an appropriate measure of low sensitivity across models may occur in part to the unbalanced
performance for PLS-DA but does not perform well in SVM con- nature of the data, as only 292/1164 samples had the active class
texts [48,49]. By using the SAR measure, we can construct our mod- of interest. Permutation methods, such as up- or down-sampling
els in a way that is less biased to any one classifier due to statistic may ameliorate this issue in future iterations [54]. Another pos-
specific nuances. To fit each of these models to the training data, a sibility potentially contributing to the low sensitivity measures is
5-fold CV scheme was used with 10 repeats, as suggested in [50] the inherent complexity of identifying active TB cases from other
to achieve reasonable precision and accuracy while maintaining diseases. Arguably, in a clinical context, this is the most relevant
a reasonable computational burden. The scheme was used to first problem as most TB-suspect patients will be suffering from some
identify the most important gene features for discriminating active kind of malady, likely with a similar phenotypic presentation to ac-
TB cases, and then again to tune the parameters on the reduced tive TB. To address this issue, more datasets which have additional
model. For SVM, this requires the tuning of the scale, offset, and Other Disease samples and rule out TB may provide additional
degree parameters for the polynomial kernel. In PLS-DA, the num- power to further elucidate a unique transcriptional signature for
ber of components to use must be tuned. And in RF the number active TB [12].
of variables which are sampled as candidates for each split (mtry) Fig. 3 shows a comparison of the AUROC curves from each
must be tuned [36–38,42]. model in both the training and validation data. Similar to the
The final models were then defined using the features selected results shown in Table 2, we observe that the three models per-
from the LRS process with the optimal tuning of necessary parame- formed comparably on the validation data according to AUC. The
ters using 5-fold cross validation with 10 repeats. The ‘caret’ pack- SVM model had the best performance on validation data overall,
age includes the ability to use a grid-search to optimize parameter but the AUC of the RF model on the validation set was only lower
tuning, and all parameters in all models were tuned according by a margin of 0.0021. Given the RF models substantially better
to this procedure [42]. A grid-based search is generalizable to sensitivity compared to the SVM model, we conclude that in this
any dataset, and thus no manual selection of parameter values is particular study the RF model has the best performance. In terms
needed to extend this approach to different applications [51]. To of disease diagnostics, it has previously been suggested that AUCs
evaluate the performance of the final models, each model was used of 0.5–0.7 suggest little-to-no discrimination, 0.7–0.8 should be in-
to predict onto the previously unseen validation data (n=389). terpreted as acceptable, 0.8–0.9 should be interpreted as excellent,
and above 0.9 are considered outstanding [55].
3.5. Feature selection validation and visualization It is important to note that the AUC curves themselves repre-
sent a trade-off between the test sensitivity and specificity, where
Beyond testing the performance of our classifier to the held-out different cutoffs of model scores may be selected to optimize for
validation set, we also sought to add both biological and compu- either sensitivity or specificity, depending on the disease con-
tation credibility to our selected features. In order to assess bio- text [55]. The WHO recommends that Tuberculosis tests should
logical credibility, a Gene Ontology (GO) enrichment analysis was aim for a target sensitivity of 90% (75%–91%) for pulmonary TB
conducted. GO enrichment analysis seeks to add additional layers in adults, and a target specificity of 92% (77%–94%) [56]. This
C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273 269
Table 2
The performance metrics used to assess each of the final models on the training data, validation data, and overall among the entire multi-cohort dataset. The number of
features used in the final model as selected by the LRS process is shown in brackets.
Statistic PLS-DA (60) Polynomial SVM (24) Random Forest (25)
Train Validate Overall Train Validate Overall Train Validate Overall
Accuracy 0.8658 0.8226 0.8524 0.9123 0.838 0.8875 1 0.8612 0.9536
Sensitivity 0.5744 0.5464 0.5651 0.7077 0.5773 0.6644 1 0.6907 0.8973
Specificity 0.9638 0.9144 0.9472 0.981 0.9247 0.9622 1 0.9178 0.9725
SAR 0.7927 0.7529 0.7789 0.8766 0.7924 0.8442 0.9650 0.7947 0.9031
MSE 0.1568 0.1655 0.1552 0.0677 0.1552 0.0846 0.0110 0.1168 0.0.0459
AUC 0.8971 0.8344 0.8772 0.9557 0.8677 0.9273 1 0.8646 0.9709
Table 3
Comparing the global AUC performance based on each comparison group from the
current model, the previous signature proposed in [18], and the results from the
meta-analysis framework [7].
Comparison Current Previous Results Meta-Analysis
Group Results [18] Framework [7]
Control 0.978 0.91 0.9
LTBI 0.982 0.93 0.88
Other Disease 0.962 0.8 0.84
Fig. 5. (A) A Venn diagram showing the overlap between selected features across the three machine learning classifiers. (B) A t-SNE with perplexity of 15 constructed using
any feature which was selected by at least two of the models.
Fig. 6. A directed acyclic graph (DAG) to visualize the GO enrichment analysis using all 86 genes selected across the three machine learning models. The color of each node
of the DAG represents the p-value associated with the annotation as shown in the legend along the top. Figure generated using GOrilla, where the p-value threshold was set
to 10−4 [58].
possible machine learning technique, but we argue that the three algorithm. Random Forest algorithms are easily extended to multi-
presented here give reasonable insight into possible performance class problems [26].
on integrated gene expression array data.
5. Conclusions
While this analysis has focused on solving a ‘1 vs. the rest’
problem to classify active TB from controls, LTBI, treated TB, and Here we show a framework for integrating diverse transcrip-
other diseases, a multi-class classifier may add additional insight tional datasets in the context of a clinical application where analy-
into TB pathogenesis. Specifically, the problem of classifying active ses of a single dataset have failed. We have shown that by directly
TB from other diseases may be improved by the use of a multiclass integrating datasets and adjusting for batch effect, multicohort
272 C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273
data can be used with canonical machine learning tools to drive [13] Wellcome trust, sharing data from large-scale biological research projects:
addition findings which accurately classifies active TB infection A system of tripartite responsibility, fort lauderdale, 2003, https://www.
genome.gov/pages/research/wellcomereport0303.pdf. (Accessed September
in the clinical milieu of a variety of other similarly-presenting
20, 2018).
diseases as well as LTBI, patients with TB on treatment, and healthy [14] R. Edgar, M. Domrachev, A.E. Lash, Gene Expression Omnibus: NCBI gene
controls. Moreover, the biomarker selection procedure identified expression and hybridization array data repository, Nucl. Acids Res. 30 (2002)
features for these models which are biologically-relevant for TB 207–210, http://dx.doi.org/10.1093/nar/30.1.207.
infection and preserve potential information to drive further hy- [15] J. Allen, K.J. Inder, T.J. Lewin, J.R. Attia, F.J. Kay-Lambkin, A.L. Baker, T. Hazell,
potheses regarding TB pathogenesis. Future work will incorporate B.J. Kelly, Integrating and extending cohort studies: lessons from the eXtend-
ing Treatments, Education and Networks in Depression (xTEND) study, BMC
additional datasets which include common co-morbidities such Med. Res. Methodol. 13 (2013) 122, http://dx.doi.org/10.1186/1471-2288-
as co-infection with HIV as well as the utilization of multiclass 13-122.
machine learning classifiers. [16] V. Nygaard, E.A. Rødland, E. Hovig, Methods that remove batch effects while
retaining group differences may lead to exaggerated confidence in down-
Acknowledgments stream analyses, Biostatistics 17 (2016) 29–39, http://dx.doi.org/10.1093/
biostatistics/kxv027.
[17] J.A. Thompson, J. Tan, C.S. Greene, Cross-platform normalization of microarray
Dartmouth College holds an Institutional Program Unifying and {RNA}-seq data for machine learning applications, PeerJ 4 (2016) e1621,
Population and Laboratory Based Sciences award from the Bur- http://dx.doi.org/10.7717/peerj.1621.
roughs Wellcome Fund, USA, and Carly A. Bobak was supported [18] C.A. Bobak, A.J. Titus, J.E. Hill, Investigating Random Forest Classification
by this grant (Grant #1014106). Alexander J. Titus was supported on Publicly Available Tuberculosis Data to Uncover Robust Transcriptional
Biomarkers. HEALTHINF, 2018, pp. 695–701.
by the Office of the U.S. Director of the National Institutes of Health
[19] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis,
under award number T32LM012204. K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A.
Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, G.
Appendix A. Supplementary data Sherlock, Gene ontology: tool for the unification of biology, Nat. Genet. 25
(2000) 25–29, http://dx.doi.org/10.1038/75556.
Supplementary material related to this article can be found [20] W.A. Haynes, F. Vallania, C. Liu, E. Bongen, A. Tomczak, M. Andres-Terre, S.
Lofgren, A. Tam, C.A. Deisseroth, M.D. Li, T.E. Sweeney, P. Khatri, Empowering
online at https://doi.org/10.1016/j.asoc.2018.10.005. multi-cohort gene expression analysis to increase reproducibility, BioRxiv.
(2016) http://biorxiv.org/content/early/2016/08/25/071514.
References [21] J.N. Taroni, C.S. Greene, Cross-Platform normalization enables machine learn-
ing model training on microarray and rna-seq data simultaneously, BioRxiv.
[1] The World Health Organization, Global Tuberculosis Report 2017, 2017, http: (2017) 118349, http://dx.doi.org/10.1101/118349.
//www.who.int/tb/publications/global_report/gtbr2017_main_text.pdf. (Ac- [22] T.E. Sweeney MD, Ph.D., COCONUT: COmbat CO-Normalization Using con-
cessed 24 May 2018). Trols (COCONUT), 2016, https://cran.r-project.org/web/packages/COCONUT/
[2] J.G. Peter, G. Theron, T.E. Muchinga, U. Govender, K. Dheda, The diagnostic COCONUT.pdf. (Accessed 24 May 2018).
accuracy of urine-based xpert MTB/RIF in HIV-infected hospitalized patients [23] T.E. Sweeney, Package ‘COCONUT, ’, 2017, https://cran.r-project.org/web/
who are smear-negative or sputum scarce, PLoS One 7 (2012) e39966, http: packages/COCONUT/COCONUT.pdf. (Accessed 24 May 2018).
//dx.doi.org/10.1371/journal.pone.0039966. [24] E. Lin, H.-Y. Lane, Machine learning and systems genomics approaches
[3] WHO | WHO End TB Strategy, WHO, 2015, http://www.who.int/tb/post2015_ for multi-omics data, Biomark. Res. 5 (2017) 2, http://dx.doi.org/10.1186/
strategy/en/. (Accessed 24 May 2018). s40364-017-0082-y.
[4] O. Ramilo, A. Mejías, Shifting the paradigm: Host gene signatures for diagnosis [25] A.V. Lebedev, E. Westman, G.J.P. Van Westen, M.G. Kramberger, A. Lundervold,
of infectious diseases, Cell Host Microbe. 6 (2009) 199–200, http://dx.doi.org/ D. Aarsland, H. Soininen, I. Kloszewska, P. Mecocci, M. Tsolaki, B. Vellas, S.
10.1016/j.chom.2009.08.007. Lovestone, A. Simmons, Random Forest ensembles for detection and predic-
[5] O. Ramilo, A. Mejías, Shifting the paradigm: Host gene signatures for diagnosis tion of Alzheimer’s disease with a good between-cohort robustness, NeuroIm-
of infectious diseases, Cell Host Microbe. 6 (2009) 199–200, http://dx.doi.org/ age Clin. 6 (2014) 115–125, http://dx.doi.org/10.1016/j.nicl.2014.08.023.
10.1016/j.chom.2009.08.007. [26] R. Diaz-Uriarte, S.A. de Andres, Gene selection and classification of microarray
[6] T.E. Sweeney, H.R. Wong, P. Khatri, Robust classification of bacterial and viral data using random forest, BMC Bioinformatics 7 (2006) 3, http://dx.doi.org/
infections via integrated host gene expression diagnostics, Sci. Transl. Med. 8 10.1186/1471-2105-7-3.
(2016) 346ra91–346ra91, http://dx.doi.org/10.1126/scitranslmed.aaf7165.
[27] P.S. Gromski, H. Muhamadali, D.I. Ellis, Y. Xu, E. Correa, M.L. Turner,
[7] T.E. Sweeney, L. Braviak, C.M. Tato, P. Khatri, Genome-wide expression for
R. Goodacre, A tutorial review: metabolomics and partial least squares-
diagnosis of pulmonary tuberculosis: a multicohort analysis, Lancet Respir.
discriminant analysis – a marriage of convenience or a shotgun wedding, Anal.
Med. 4 (2016) 213–224, http://dx.doi.org/10.1016/s2213-2600(16)00048-5.
Chim. Acta. 879 (2015) 10–23, http://dx.doi.org/10.1016/j.aca.2015.02.012.
[8] M.P.R. Berry, C.M. Graham, F.W. McNab, Z. Xu, S.A.A. Bloch, T. Oni, K.A.
[28] N.R. Pal, A fuzzy rule based approach to identify biomarkers for diagnostic
Wilkinson, R. Banchereau, J. Skinner, R.J. Wilkinson, C. Quinn, D. Blankenship,
classification of cancers, in: 2007 IEEE Int. Fuzzy Syst. Conf., IEEE, 2007, pp.
R. Dhawan, J.J. Cush, A. Mejias, O. Ramilo, O.M. Kon, V. Pascual, J. Banchereau,
1–6, http://dx.doi.org/10.1109/FUZZY.2007.4295533.
D. Chaussabel, A. O’Garra, An interferon-inducible neutrophil-driven blood
[29] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEE
transcriptional signature in human tuberculosis, Nature 466 (2010) 973–977,
Trans. Evol. Comput. 1 (1997) 67–82, http://dx.doi.org/10.1109/4235.585893.
http://dx.doi.org/10.1038/nature09247.
[9] J. Maertzdorf, M. Ota, D. Repsilber, H.J. Mollenkopf, J. Weiner, P.C. Hill, S.H.E. [30] S.S. Verma, A. Lucas, X. Zhang, Y. Veturi, S. Dudek, B. Li, R. Li, R. Urbanowicz,
Kaufmann, Functional correlations of pathogenesis-driven gene expression J.H. Moore, D. Kim, M.D. Ritchie, Collective feature selection to identify cru-
signatures in tuberculosis, PLoS ONE 6 (2011) e26938, http://dx.doi.org/10. cial epistatic variants, BioData Min. 11 (2018) 5, http://dx.doi.org/10.1186/
1371/journal.pone.0026938. s13040-018-0168-6.
[10] C.I. Bloom, C.M. Graham, M.P.R. Berry, F. Rozakeas, P.S. Redford, Y. Wang, Z. [31] E.W. Steyerberg, Clinical Prediction Models : A Practical Approach
Xu, K.A. Wilkinson, R.J. Wilkinson, Y. Kendrick, G. Devouassoux, T. Ferry, M. to Development, Validation, and Updating, Springer, 2009, https:
Miyara, D. Bouvry, V. Dominique, G. Gorochov, D. Blankenship, M. Saadatian, //books.google.com/books?id=kHGK58cLsMIC&dq=statistical+parsimony+
P. Vanhems, H. Beynon, R. Vancheeswaran, M. Wickremasinghe, D. Chauss- and+clinical+models&lr=&source=gbs_navlinks_s. (Accessed 18 September
abel, J. Banchereau, V. Pascual, L. Ho, M. Lipman, A. O’Garra, A. O’Garra, Tran- 2018).
scriptional blood signatures distinguish pulmonary tuberculosis, pulmonary [32] J.A.K. Suykens, J. Vandewalle, Least squares support vector machine clas-
sarcoidosis, pneumonias and lung cancers, PLoS One 8 (2013) e70630, http: sifiers, Neural Process. Lett. 9 (1999) 293–300, https://lirias.kuleuven.
//dx.doi.org/10.1371/journal.pone.0070630. be/bitstream/123456789/218716/2/Suykens_NeurProcLett.pdf. (Accessed 24
[11] S. Blankley, C.M. Graham, J. Turner, M.P.R. Berry, C.I. Bloom, Z. Xu, V. Pascual, May 2018).
J. Banchereau, D. Chaussabel, R. Breen, G. Santis, D.M. Blankenship, M. Lip- [33] T.R. Mellors, C.A. Rees, W.F. Wieland-Alter, C.F. von Reyn, J.E. Hill, The volatile
man, A. O’Garra, The transcriptional signature of active tuberculosis reflects molecule signature of four mycobacteria species, J. Breath Res. 11 (2017)
symptom status in extra-pulmonary and pulmonary tuberculosis, PLoS One 31002, http://dx.doi.org/10.1088/1752-7163/aa6e06.
11 (2016) e0162220, http://dx.doi.org/10.1371/journal.pone.0162220. [34] M. Pérez-Enciso, M. Tenenhaus, Prediction of clinical outcome with microar-
[12] T.E. Sweeney, W.A. Haynes, F. Vallania, J.P. Ioannidis, P. Khatri, Methods to ray data: a partial least squares discriminant analysis (PLS-DA) approach,
increase reproducibility in differential gene expression via meta-analysis, Hum. Genet. 112 (2003) 581–592, http://dx.doi.org/10.1007/s00439-003-
Nucl. Acids Res. 45 (2016) e1–e1, http://dx.doi.org/10.1093/nar/gkw797. 0921-9.
C.A. Bobak, A.J. Titus and J.E. Hill / Applied Soft Computing Journal 74 (2019) 264–273 273
[35] A. Statnikov, L. Wang, C.F. Aliferis, A comprehensive comparison of random [48] R. Caruana, R. Caruana, A. Niculescu-Mizil, Data mining in metric space: An
forests and support vector machines for microarray-based cancer classifi- empirical analysis of supervised learning performance criteria, in: Proc. Tenth
cation, BMC Bioinformatics 9 (2008) 319, http://dx.doi.org/10.1186/1471- ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2004, pp. 69–78. http://
2105-9-319. citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.60.6684. (Accessed May
[36] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32, https://www.stat. 26, 2018).
berkeley.edu/~breiman/randomforest2001.pdf. (Accessed 24 May 2018). [49] S. Takahama, A.M. Dillner, Model selection for partial least squares calibration
[37] M. Barker, W. Rayens, Partial least squares for discrimination, J. Chemom. 17 and implications for analysis of atmospheric organic aerosol samples with
(2003) 166–173, http://dx.doi.org/10.1002/cem.785. mid-infrared spectroscopy, J. Chemom. 29 (2015) 659–668, http://dx.doi.org/
[38] C. Cortes, V. Vapnik, Support-vector networks, 20 (1995) 273–297. https: 10.1002/cem.2761.
//link.springer.com/content/pdf/10.1007/BF00994018.pdf. (Accessed 24 May [50] M. Kuhn, K. Johnson, Applied Predictive Modeling, second ed., Springer,
2018). 2016, http://appliedpredictivemodeling.com/about/. (Accessed 18 Septem-
[39] T. Hastie, R. Tibshirani, J. Friedman, Support Vector Machines and Flexible Dis- ber 2018).
criminants, in: 2009: 417–458, http://dx.doi.org/10.1007/978-0-387-84858- [51] C.-W. Hsu, C.-C. Chang, C.-J. Lin, A Practical Guide to Support Vector Classifi-
7_12. cation, Taipei, 2003.
[40] D.R. Perez, G. Narasimhan, So you think you can PLS-DA? Doi.Org. (2017) [52] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data
207225, http://dx.doi.org/10.1101/207225. Mining, Inference, and Prediction, second ed., Springer-Verlag, New York, NY,
[41] S. Hashemi, S. Hashemi, T. Trappenberg, T. Trappenberg, Using SVM for USA, 2009, p. 745, http://dx.doi.org/10.1111/j.1467-985X.2010.00646_6.x.
Classification in Datasets with Ambiguous data, Sci 2002 (2002) https://web. [53] L. van der Maaten, G. Hinton, Visualizing data using t-SNE, J. Mach. Learn. Res.
cs.dal.ca/~tt/papers/SCI2002color2.pdf. (Accessed 25 May 2018). 9 (2008) 2579–2605.
[42] M. Kuhn, J. Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt, T. Cooper, [54] A.P. Modeling, Down-sampling using random forests applied predictive mod-
Z. Mayer, M. Benesty, R. Lescarbeau, A. Ziem, L. Scrucca, Y. Tang, C. Candan, T. eling, 2016, pp. 4–13, http://appliedpredictivemodeling.com/blog/2013/12/
Hunt, M. Max Kuhn, 8/28rmc2lv96h8fw8700zm4nl50busep. (Accessed 26 May 2018).
[43] F. Mosteller, J.W.G. Tukey, Data Analysis, Including Statistics, Addison- [55] J.N. Mandrekar, Receiver operating characteristic curve in diagnostic test
Wesley, 1968, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53. assessment, J. Thorac. Oncol. 5 (2010) 1315–1316, http://dx.doi.org/10.1097/
7042&rep=rep1&type=pdf. (Accessed 24 May 2018). JTO.0B013E3181EC173D.
[44] P.F. Thall, R. Simon, D.A. Grier, Test-based variable selection via cross- [56] World Health Organization, High-Priority Target Product Profiles for New Tu-
validation, J. Comput. Graph. Stat. 1 (1992) 41–61, http://dx.doi.org/10.1080/ berculosis Diagnostics, World Health Organization, 2014, http://www.who.
10618600.1992.10474575. int/tb/publications/tpp_report/en/. (Accessed 18 September 2018).
[45] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation [57] J.E. McDermott, J. Wang, H. Mitchell, B.-J. Webb-Robertson, R. Hafen, J. Ramey,
and model selection, 1995, http://robotics.stanford.edu/~ronnyk. (Accessed K.D. Rodland, Challenges in biomarker discovery: Combining expert insights
24 May 2018). with statistical analysis of complex omics data, Expert Opin. Med. Diagn. 7
[46] H. Hotelling, Analysis of a complex of statistical variables into principal (2013) 37–51, http://dx.doi.org/10.1517/17530059.2012.718329.
components, J. Educ. Psychol. 24 (1933) 417–441, http://dx.doi.org/10.1037/ [58] E. Eden, R. Navon, I. Steinfeld, D. Lipson, Z. Yakhini, GOrilla: a tool for discovery
h0071325. and visualization of enriched GO terms in ranked gene lists, BMC Bioinformat-
[47] S. Khalid, T. Khalil, S. Nasreen, A survey of feature selection and feature ics 10 (2009) 48, http://dx.doi.org/10.1186/1471-2105-10-48.
extraction techniques in machine learning, in: 2014 Sci. Inf. Conf, IEEE, 2014, [59] J.L. Flynn, J. Chan, K.J. Triebold, D.K. Dalton, T.A. Stewart, B.R. Bloom, An essen-
pp. 372–378, http://dx.doi.org/10.1109/SAI.2014.6918213. tial role for interferon gamma in resistance to Mycobacterium tuberculosis
infection, J. Exp. Med. 178 (1993) 2249–2254, http://dx.doi.org/10.1084/JEM.
178.6.2249.