Beruflich Dokumente
Kultur Dokumente
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 1
Abstract—The diagnosis of Type 2 Diabetes (T2D) at an early complications [2]. Moreover, diabetes is the major cost on the
stage has a key role for an adequate T2D integrated management economic balances of national health systems (IDF indicates
system and patient’s follow-up. Recent years have witnessed for the year 2015 a level of expenditure for the treatment
an increasing amount of available Electronic Health Record
(EHR) data and Machine Learning (ML) techniques have been of diabetic patients equal to 11.6% of the total world health
considerably evolving. However, managing and modeling this expenditure).
amount of information may lead to several challenges such A more efficient integrated management system, including
as overfitting, model interpretability and computational cost. General Practitioners (GPs) and specialists with multidis-
Starting from these motivations, we introduced a ML method
ciplinary skills, could be a valid solution to alleviate the
called Sparse Balanced Support Vector Machine (SB-SVM) for
discovering T2D in a novel collected EHR dataset (named healthcare costs while preventing diabetes-related diseases
FIMMG dataset). In particular, among all the EHR features (e.g., diabetic retinopathy, renal diabetes). Almost all GP
related to exemptions, examination and drug prescriptions we outpatient clinics are now equipped with EHRs storing the
have selected only those collected before T2D diagnosis from a health history of the patients as well as several heterogeneous
uniform age group of subjects. We demonstrated the reliability
information (i.e. demographic, monitoring, lifestyle, clinical).
of the introduced approach with respect to other ML and Deep
Learning approaches widely employed in the state-of-the-art for The reason for a strong investment in Health Information
solving this task. Results evidence that the SB-SVM overcomes Technology (HIT) is that wider adoption of EHRs will reduce
the other state-of-the-art competitors providing the best com- healthcare costs, medical errors [4], [5], patient complications
promise between predictive performance and computation time. and mortality [6], [7]. Moreover, the HIT will decrease the use
Additionally, the induced sparsity allows to increase the model
of healthcare services such as laboratory tests and outpatient
interpretability, while implicitly managing high dimensional data
and the usual unbalanced class distribution. visits, [4], [8], while it will improve the national healthcare
quality and efficiency [9], [10].
Index Terms—Type 2 Diabetes, Machine Learning, Electronic
Health Record, Support Vector Machine, Decision Support Sys- The high number of patients’ information recorded in EHRs
tem. results in a large amount of stored data. In this context,
one of the aims of biomedical informatics is to convert
EHR into knowledge that could be exploited for designing
I. I NTRODUCTION a clinical Decision Support System (DSS). In this scenario,
HE World Health Organization (WHO) reported that the ML models are able to manage this enormous amount of
T global prevalence of worldwide diabetes is around 9%
(more than 400 million people). The 90% of people with
data by predicting clinical outcomes and interpreting particular
patterns sometimes unsighted by physicians [11].
diabetes suffers from Type 2 Diabetes (T2D) [1], [2]. T2D The aim of this work is to exploit a ML methodology, named
is on the rise worldwide and only in 2012 diabetes caused sparse balanced Support Vector Machine (SB-SVM), for dis-
an estimated 1.5 million deaths. The WHO anticipates that covering T2D using features extracted from a novel EHR
worldwide deaths will double by 2030 [3]. In developing dataset, namely the FIMMG dataset. The proposed SB-SVM is
nations, more than the half of all diabetic cases goes undiag- able to manage high dimensional data by increasing the model
nosed. This can be attributed to the fact that T2D symptoms interpretability and finding the most relevant features while
may be less marked than other types of diabetes (e.g., Type dealing with the usual unbalanced class distribution. In the data
1). Nonetheless, the International Diabetes Federation (IDF) analysis, among all the EHR features related to exemptions,
stated that early diagnosis and opportune treatments can save examination and drug prescriptions, we have considered only
lives while preventing or significantly delaying devastating those collected before T2D diagnosis, while excluding all
features that have already revealed a T2D patient’s follow-
M. Bernardini, and E. Frontoni are with Department of Information up. Additionally, we considered a subset of subjects enclosed
Engineering Università Politecnica delle Marche, Ancona, Italy. (e-mail:
m.bernardini@pm.univpm.it, e.frontoni@univpm.it) from 60 − 80 years range, where the chronological age is not
L. Romeo is with Department of Information Engineering Università Po- statistically relevant in order to discriminate T2D condition.
litecnica delle Marche, Ancona, Italy and Cognition, Motion and Neuroscience The employed FIMMG dataset is available at the following
and Computational Statistics and Machine Learning, Fondazione Istituto
Italiano di Tecnologia Genova. (e-mail: l.romeo@univpm.it) link1 .
P. Misericordia is with Federazione Italiana Medici di Medicina Generale
(FIMMG), (e-mail: p.misericordia@mmg.org) 1 http://vrai.dii.univpm.it/content/fimmg-dataset
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 2
We formulate three different research questions in order to important attributes for identifying the diabetic condition while
measure the reliability of our approach with respect to the evidencing that the main discriminative feature is the age
state-of-the-art methodologies: marker. Concerning the classification stage, they employed
• Case I: Is the SB-SVM approach able to predict T2D standard supervised algorithms such as NB, DT, and Instance-
using all set of EHR features? (Section III-B1). Based learners [17]. The main difference with the respect to
• Case II: Is the SB-SVM approach able to predict T2D the above-mentioned literature [14], [15], [16], [17] can be
using only a subset of EHR features collected before T2D resumed according to the different strategy used to manage
clinical diagnosis? (Section III-B2). high dimensional data while achieving reliable performance.
• Case III: Is the SB-SVM approach able to predict T2D Our embedded method interacts with the classifier structure
using only a subset of EHR features collected before T2D and is able to manage the huge amount of features while
clinical diagnosis from a uniform age group of subjects? discovering the most relevant ones without any supplementary
(Section III-B3). feature selection stage. Differently the wrapper feature selec-
The paper is organized as follows: Section II gives an tion approaches [16] can result in an external module with an
overview of the state-of-the-art on the ML approaches for increment of computational effort (e.g., beam search, sequen-
T2D prediction using EHR features; Section III describes the tial forward selection, sequential backward elimination) for the
FIMMG dataset and the data analysis procedure; Section IV training and validation stage, while the filter methodologies
describes the proposed SB-SVM approach; Section V shows [14], [15], [16], [17] rely on the evaluation of the data statistics
the results of the three proposed research questions; Section VI (e.g., t-test, chi-square, information gain, ReliefF) which is not
and Section VII introduce respectively the discussions and the always linked with the learning algorithm.
conclusions as well as the future work in that direction. The papers [20], [21], [22] are closer to our work, where
the aim is to develop an algorithm able to manage high
dimensional and unbalanced data without performing a sup-
II. R ELATED W ORK
plementary features selection strategy. However, apart from
In the last decade, with the increasing amount of available the proposed methodology, our approach is also different in
data, ML techniques for discriminating T2D condition have terms of experimental procedure for testing the reliability of
been considerably evolving. Accordingly, the widespread ad- the ML model in T2D early prediction.
vances in the field of Deep Learning (DL) have encouraged In particular, DT model was proposed in [20] for discov-
the application of DL approaches to clinical tasks (includ- ering T2D in a uniform age group, while the ensemble (i.e.,
ing outcome prediction) based on EHR data [12]. Therefore bagging) and boosting (i.e., AdaBoost) methodologies were
one of the limitation of DL research involves topic such employed for decreasing the generalization error when dealing
as data heterogeneity and model interpretability [12], [13]. with high dimensional data. In Wang et al. as well an ensemble
Real-world datasets extracted from EHRs usually own high strategy based on RF was employed in order to suggest antihy-
dimensionality data and several noisy and/or redundant fea- perglycemic medications for T2D patients [22]. The synthetic
tures. Managing and modeling this amount of information minority oversampling technique (SMOTE) algorithm [23]
may lead to several challenges such as i) overfitting; ii) was used to solve the unbalanced class problem oversampling
reduction of interpretability; iii) computational cost increment. the minority class. Deviating from their approach we induce
In this context, researchers tried to overcome these issues sparsity which leads to an increase in the interpretability of our
by performing features selection [14], [15], [16], [17] or ML model. Additionally, we deal the high unbalanced setting
engineering feature techniques [18]. Zheng et al. introduced without drawing synthetic samples [22], not always consistent
an engineering features based framework able to improve the and close with the real class data. As it will be evidenced from
performance of traditional ML models (e.g., SVM, logistic re- the experimental results (Section V), our methodology is more
gression (LR), decision tree (DT), k-nearest neighbor (KNN), effective with respect to the DT and bagging classifiers (i.e.,
random forest (RF), naive Bayes) for predicting T2D condition RF) employed by [20], [22] even in a uniform age group [20].
[18]. Deviating from our approach, the feature extraction In Yu et al. a SVM model was proposed for two different
stage implemented by [18] required a further computational classification problems related to diabetes condition [21]. Their
effort as well as the supervision of the physician in order to proposed SVM and LR models were able to discriminate
define high-level features. Sheikhi et al. performed the analysis between T2D and control subjects in high dimensional data
not considering some features directly related to T2D (i.e., with a large population not suffering from diabetes. The main
glycated hemoglobin (HbA1c)) [14]. Then, a feature selection difference of our approach with respect to [21] lies in the
based on LASSO and ridge LR was executed before predicting application of 1-norm regularizer in the hinge loss function
the diabetic condition. Kamkar et al. exploited the typical EHR that induces sparsity in the model coefficients.
tree structure in order to perform a feature selection based on
In summary, the main contributions of the proposed work
Tree-LASSO algorithm [15]. Cho et al. predicted the onset of
are:
diabetic nephropathy from an unbalanced and irregular dataset.
A feature selection has been employed with baseline statistical • The collection and employment of the novel FIMMG
methods, ReliefF [19], SVM sensitivity analysis and recursive dataset.
feature elimination (RFE) [16]. In [17] a filter feature selection • The design of an algorithm core for treating and following
strategy based on ReliefF was adopted as well, to rank the chronic T2D patients.
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 3
There were existing work which proposed the solution of a Blood pressure
Sparse SVM while addressing the imbalance dataset problem SB-SVM Clinical
Exams
in different domain ranging from clinical data [24] to image Pathologies
Drugs
categorization [25]. Differently from [24] we induced sparsity Exemptions
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 4
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 5
called the hinge loss while the 2-norm penalty is called the algorithm (e.g., SpaRSA, iterative shrinkage thresholding and
ridge penalty [30]. coordinate descent-based algorithms) can therefore compute
valid solutions in the non-full rank case [33].
B. Sparse 1-norm SVM
In this work we have replaced the ridge penalty with the 1-
norm of w , i.e., the LASSO penalty [31]. This penalty induced C. Sparse Balanced SVM
sparse solution in the model coefficients. The application of
1-norm SVM was widely used for solving high dimensional The margin of the predicted SB-SVM response was mapped
task while increasing sparsity as well as the interpretability of into [0-1] interval by using a sigmoid function, without
the model [32]. The considered sparse 1-norm SVM has the changing the SVM error function. The mapping was realized
following optimization problem [30]: according to [39], adding a post-processing step where the
sigmoid parameters were learned with regularized binomial
m h q i maximum likelihood. The computed probabilistic outputs of
min ∑ 1 − yi w 0 + ∑ w j h j (xi ) + λ kw
wk (3) SB-SVM reflects the predicted response (y p ) based on the
w0
w ,w +
i=1 j=1 threshold th = 0.5:
This formulation combines the hinge loss with an l1 -
constraint, bounding the sum of the absolute values of the 1
P(y p = T 2D| f ) = > th y p = T 2D
coefficients. Since we have solved the 1-norm SVM in the 1 + exp(A f + B) (4)
form of error function (see (3)), we did not set the amount of else y p = Control
misclassified subjects (i.e., ∑i ξi ) in the form of box constraint
(i.e., C). On the contrary, the hyperparameter λ (that is Several solutions to the class-imbalance problem were pre-
inversely proportional to box constraint) was properly tuned viously proposed both at the data and algorithmic levels [40],
in order to control the trade-off between loss and penalty [41]. Differently, from the data-level solutions which include
as well as between bias and variance. This hyperparameter many different forms of re-sampling (e.g., random re-sample,
was properly tuned implementing a grid-search according to directed re-sample, oversample with informed generation), we
the experimental procedure described in Section IV-D. The have decided to work at the algorithm level. In particular,
starting value of this hyperparameter for the performed grid- we adjust the decision threshold th in the validation set in
search is 10−3 . Note that for a high regularization term λ order to maximize the macro-recall metric while alleviating
can encourage the sparsity of the model by driving some the effect of high unbalanced data. The prediction of the SB-
coefficients exactly to zero, so that, irrelevant features can SVM produces an uncalibrated value that is not a probability.
be automatically removed from the model. This shrinkage The post-processing step allows to transform the output of
has the effect of controlling the variances of the model the SB-SVM classifier (i.e., distance from the margin) into
coefficient, avoiding overfitting, and improving the general- posterior probability. Thus, the posterior probability represents
ization performance especially when there are many highly a salient information which can be integrated in a clinical
correlated features. In this way it provides an automatic way DSS for supporting the early-stage diagnosis by revealing the
for doing model selection in linear model [33]. Accordingly, confidence level of the performed prediction. The idea behind
the LASSO penalty corresponds to a double-exponential prior the employed methodology [39] lies in the use of a parametric
for the coefficients, while the ridge penalty corresponds to a model (i.e., sigmoid model, see (4)) to fit the posterior directly.
Gaussian prior [30]. This reflects the greater tendency of the Hence, the parameters A and B of the sigmoid function are
1-norm SVM to produce some largely model coefficients in adapted to give the best probability outputs [39]. Differently
terms of magnitude and leave others at 0. In order to solve from [42] the employed sigmoid model has two parameters
the optimization problem described in (3) we employed the trained discriminatively, rather than one parameter.
Sparse Reconstruction by Separable Approximation (SpaRSA) The general idea behind the application of the proposed ap-
solver [34]. The SpaRSA algorithmic framework [34] exploits proach lies in the two main challenges that EHR data present:
the proximity operator used in [35] for solving large-scale (i) high dimensional data with several irrelevant/noisy features
optimization problems involving the sum of a smooth error and high degree of redundancy and (ii) natural unbalanced set-
term and a possibly nonsmooth regularizer. In this context, ting of this task. Starting from this motivations the introduction
the proximal methods are specifically tailored to optimize of 1-norm LASSO regularizer allow to induce sparsity while
an objective function of the form (3), gaining faster conver- increasing the interpretability of the linear model. This is a
gence rates and higher scalability to large nonsmooth convex salient information which may lead to understand not only the
problems [36]. In particular, the advantages of SpaRSA with predicted outcome but also why and how the prediction was
respect to other competitors [37], [38] are resumed in [34]. made. Additionally the LASSO regularizer may improve the
It is worth noting here that when the predictor matrix X accuracy prediction by reducing the variance of the predicted
is not of full column rank, the LASSO solutions are not class by shrinking some coefficients to zero [33]. Accordingly,
unique, because the criterion is not strictly convex [33]. The the shift of the decision threshold over the estimated posterior
non-full-rank case can occur when n > m or when n ≤ m probability favours the minority class, increasing the recall rate
due to collinearity of the EHR features [33]. The numerical over all the predicted testing set.
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 6
features selection and data level solution for dealing with the
Fig. 2: Overview of the SB-SVM model architecture. A Tenfold
nature unbalanced setting of the task (Table IV).
Cross-Validation procedure was executed. The optimization of the
The assessment of the introduced ML model was performed SB-SVM hyperparameters was performed implementing a grid-search
according to the following measures: and optimizing the macro-recall score in a nested stratified Fivefold
• macro-precision (precision): the precision is calculated Cross-Validation. Hence, each split of the outer loop was trained with
the optimal hyperparameters tuned in the inner loop.
for each class and then take the unweighted mean.
• macro-recall (recall): the recall is calculated for each
each class and then take the unweighted mean.
V. R ESULTS
• Receiver Operating Characteristic (ROC): is designed by
plotting the true positive rate (TPR) against the false We tested our approach using the FIMMG dataset described
positive rate (FPR) at various threshold settings. in Section III-A according to the Data analysis reported
• Area Under Receiver Operating Characteristic curve in Section III-B which aims to answer the three research
(AUC): represents the probability that the classifier will questions described in Section I.
rank a randomly chosen positive sample higher than a
randomly chosen negative one. A. Case I
• l 0 measure: it is employed as a measure of model sparsity
We report in Figure 3 the results of the SB-SVM for
[45]. It calculates the number of zero coefficients of the
all cases according to the precision, recall, and AUC. In
model: #{ j, coe f f j = 0}.
particular, the higher results were obtained in Case I (Figure
The overview of the SB-SVM model architecture is shown 3a) considering all features as well as all subjects. Since
in Figure 2. We computed in both experiments a stratified recall was optimized in the validation set, it achieved best
Tenfold Cross-Validation (10-CV) over subjects procedure. performance when compared to precision.
The optimization of SB-SVM hyperparameters (i.e., λ , th) The ROC curves for each fold are shown in Figure 4a as
was performed implementing a grid-search and optimizing well as the averaged curve. All points of ROC curves are above
the macro-recall score in a nested stratified Fivefold Cross- the chance level (i.e., red dotted line).
Validation. Hence, each split of the outer loop was trained with The recall and AUC of SB-SVM are compared with respect
the optimal hyperparameters tuned in the inner loop. Although to other state-of-the-art ML and DL approaches applied for
this procedure is computationally expensive, it allows to obtain T2D prediction (Table IV). The recall and AUC for the
an unbiased and robust performance evaluation [46]. SB-SVM are significantly higher (p < .05) than all baseline
The regularization factor λ was picked inside the subset models, except than DT (recall: t18 = 0.514, p = .61; AUC:
{0.001, 0.01, 0.1, 1, 10}, while the threshold was picked inside t18 = 1.652, p = .12) and RF (recall: t18 = 0.514, p = .09).
the subset {0, 0.01, 0.02, . . . , 1}. Table III shows the different Also the ROC curves confirm that the SB-SVM is above the
hyperparameters for all competitors’ approaches, as well as other baseline models (Figure 5a). The SB-SVM considerably
the grid-search set. Since the aim is to find the best threshold overcomes (p < .05) both SCAD SVM, MLP, resampling and
value of posterior probability/margin of the classifier, the RFE-SVM based models. On the contrary, the recall and AUC
threshold is a common hyperparameter for all tested methods for the SB-SVM are statistically comparable with respect 1-
except for SMOTE-based approaches and 1-norm SVM (i.e., norm SVM, DBN, Ttest and ReliefF based models.
the approaches which have been already formulated to be Once evaluated and compared the performance of the
consistent whit the unbalanced setting).2 . proposed SB-SVM method in order to discriminate between
The stability or reproducibility of the model is also an healthy and T2D patients, the aim is to identify which features
important aspect to be considered, especially in case of sparse contribute to the decision.
solutions which can lead to unstable models [47]. We have
The SB-SVM offers a natural way for addressing this goal,
studied and reported this aspect measuring the variance of the
in fact, compared to other approaches (i.e., SVM Gauss,
selected SB-SVM hyperparameters (Section V).
MLP, DBN) the model is linear and easily interpretable. The
2 Matlab code to reproduce all results (and modify the setting) is available computed l 0 measure is 0.39, while the magnitude of the SB-
at the following link https://codeocean.com/capsule/6677801 SVM coefficients is shown in Figure 6a. The stability of the
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 7
TABLE III: Range of Hyperparameters (Hyp) for the proposed Sparse Balanced-Support Vector Machine (SB-SVM) model and other tested
approaches: Linear SVM (SVM Lin), Gaussian SVM (SVM Gauss), K-nearest neighbor (KNN), decision tree (DT), random forest (RF),
logistic regression (LR) ridge, smoothly clipped absolute deviation (SCAD) SVM, 1-norm SVM, multi layer perceptron (MLP) and deep
belief network (DBN). The threshold hyperparameter th is optimized for each model except for SMOTE-based approaches and 1-norm SVM
(i.e., the approaches which have been already formulated to be consistent with the unbalanced setting).
1 1 1
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive Rate False Positive Rate False Positive Rate
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 8
1 1 1
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive Rate False Positive Rate False Positive Rate
Weights
Weights
0 0 0
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
# SB-SVM model coefficients # SB-SVM model coefficients # SB-SVM model coefficients
(a) Case I: l 0 =0.39 (b) Case II: l 0 =0.91 (c) Case III: l 0 =0.57
Fig. 6: Magnitude of the SB-SVM coefficients and l 0 measure values. Top 10-rank features are pointed out with red spots.
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 9
and AUC decrease respectively of 5.48%, 7.25%, and 9.61%. The motivation behind the selection of different top features
However, all metrics are above (p < .05) chance level (0.5). lies in the substantial divergence between each case which
The ROC curves for each fold are shown in Figure 4b as represents a different performed task (Table V).
well as the averaged curve. All points of ROC curves are above
the chance level.
In the case II, once the task complexity has increased after D. Computation time analysis
discarding the features reported in Table II, the recall and AUC The computation time analysis for the training and valida-
of SB-SVM are greater than all the state-of-the-art ML and tion stage is performed in Figure 7a, while the computation
DL approaches (Table IV). Regarding the baseline models, the time for the testing stage is shown in Figure 7b.
DT (recall: t18 = 0.807, p = .43; AUC: t18 = 2.002, p = .06)
and SVM Lin (recall: t18 = 1.811, p = .09; AUC: t18 = 1.369,
p = .19) are the closest models to SB-SVM. Also the ROC 104
VM R ss
VM e
n
es M E
Tt es R E
R SM M
R SV VM e
R fF Ga n
T
SC B-S F
D rm M
s
G n
1- D M
R elie +LR us
+S ridg
Li
t+ S g
i
M Li
D
S R
Tt +S OT
Tt t+L OT
of the SB-SVM coefficients. The lower Var(λ )=3.34 × 10−36
KN
el M L
T+ SV
no SV
A V
-S +L au
au
es t+ rid
SV VM
V M
ri
FE M G
S
ie fF
FE S
ie
hyperparameters outlined the high stability of the model.
el
(a) Preprocessing, training and validation time
C. Case III
Figure 3c shows the results of SB-SVM for the Case III,
considering also a uniform distribution of subjects where age
is no longer statistically significant for T2D prediction. The 10-1
log(time) [sec x fold]
precision, recall and AUC remain above (p < .05) the chance
level (0.5).
The ROC curves for each fold are shown in Figure 4c as
well as the averaged curve. All points of ROC curves are above
10-2
the chance level.
The recall and AUC of SB-SVM continues to be greater
than all the state-of-the-art ML and DL approaches (Table
IV). The gain of the SB-SVM is increased especially in
VM R ss
VM e
n
es M E
Tt es R E
R SM M
R SV VM e
R fF Ga n
T
SC B-S F
D rm M
s
G n
1- D M
+S ridg
Li
t+ S g
i
M Li
D
S R
Tt +S OT
Tt t+L OT
KN
el M L
T+ SV
no SV
A V
-S +L au
au
es t+ rid
SV M
V M
ri
FE M G
SV
ie fF
FE S
ie
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 10
TABLE V: Top 10-rank features according to the SB-SVM magnitude coefficients: Blood Pressure (BP), Drugs (D), Exam prescriptions
(EP), Exemptions (E), Pathologies (P).
VI. D ISCUSSIONS powerful, it is not easy to understand how they work, and what
The proposed approach and the results shown before brings inputs are most relevant for the prediction [13]. Accordingly,
to the following main discussion points. the experimental results outline some peculiarity of SB-SVM
with respect to other state-of-the-art work: (i) the algorithm
is robust to the natural high unbalanced setting (i.e., several
A. Clinical Perspective
healthy subjects and few T2D subjects) without drawing syn-
Faced with an often difficult and complicated management thetic examples or employing additional resampling strategies,
of the diabetic patient, the gold-standard diagnosis of T2D such as in [22]; (ii) the model is linear and easy interpretable
remains uncertain and challenging. In this condition, a DSS due to the induced sparsity; (iii) it is able to manage high
solution which is able to provide a reliable prediction of T2D dimensional data implicitly selecting the most relevant features
becomes essential in order to target timely and appropriately without requiring a further features selection [14], [15], [16],
prevention strategies. This possibility becomes more interest- [17] or engineering features techniques [18]; (iv) the sparse
ing when the ML algorithm is able to identify a cluster of solution is stable across the considered dataset and (v) the
patients at high risk of developing the disease. Starting from proposed classifier is trained only from a subset of EHR
these clinical motivations we proposed a ML approach, named features recorded before the T2D diagnosis (i.e., Case II, Case
SB-SVM, able to predict T2D using the novel FIMMG dataset III). In particular, we have excluded all ATC code A10 drug
available at the following link3 . prescriptions and exam prescriptions shown in Table II (i.e.,
Case II), and simultaneously, we have also selected a uniform
B. SB-SVM effectiveness and outcomes age sample group where the age does not enclose anymore
The experimental results demonstrate how the early stage a relevant information for discriminating the T2D condition
T2D prediction is enclosed within EHR features. However, (i.e., Case III).
the discriminative power of each feature is not uniform, but
sparsely distributed. In fact, the SB-SVM overcomes the other C. Pattern discrimination and localization
state-of-the-art approaches, particularly for the most challeng-
ing data analysis (i.e., Case II, Case III). The better recall The main research questions that we have answered pre-
and AUC suggest how the proposed SB-SVM methodology is viously are whether the SB-SVM is able to discover the
a viable and natural solution to model the high dimensional T2D according to all set of EHR features (Case I), a subset
EHR data while discarding noisy and/or redundant features. of EHR features collected before the T2D diagnosis (Case
The LASSO regularizer is proven to be an effective strat- II), even considering a uniform age group of subjects (Case
egy for dealing with high level of noise while significantly III). This work focuses on the introduction of a ML model
reduces both model error and model complexity [26], [33]. for pattern discrimination in order to answer these research
The FIMMG dataset comprises several noise components: i) questions. However, once the reliability and the robustness of
missing values; ii) outliers; iii) unreadable encrypted informa- our approach have been detected, we may go ahead answering
tion for privacy preserving; iv) GP’s transcription typos (e.g., where the discriminatory information is encoded. The SB-
ICD-9 codes) and v) non-uniformity in the definition of certain SVM is able to implicitly localize the discriminative pattern
medical examinations or pathologies. while identifying the most relevant features for providing an
The better effectiveness, as well as the higher interpretability accurate T2D prediction. These features were summarized in
of the SB-SVM model, are the key advantages of our method- Table V and represent a salient information which can be
ology with respect to other ML and DL competitors. This exploited in order to support early stage diagnosis. It turns out
is a salient information which leads to understand not only that some of the top 10-rank features selected by the SB-SVM
the predicted outcome, but also why and how the prediction model for both experimental cases are consistent with those
has been made. On the other hand, DL approaches often reported in the state-of-the-art regarding the T2D risk factors
lack this sort of algorithmic transparency. While the heuristic [2], [3]. In particular for the Case III, recent studies [48], [49]
optimization procedures for neural networks are demonstrably are also confirming the possibility that the antibiotics (e.g.,
Moxifloxacin, Netilmicin) exposure could increase the T2D
3 http://vrai.dii.univpm.it/content/fimmg-dataset risk.
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 11
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2899218, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF IEEE BIOMEDICAL AND HEALTH INFORMATICS 12
[21] W. Yu, T. Liu, R. Valdez, M. Gwinn, and M. J. Khoury, “Application [47] L. Baldassarre, M. Pontil, and J. Mourão-Miranda, “Sparsity is better
of support vector machine modeling for prediction of common diseases: with stability: combining accuracy and stability for model selection in
the case of diabetes and pre-diabetes,” BMC Medical Informatics and brain decoding,” Frontiers in neuroscience, vol. 11, p. 62, 2017.
Decision Making, vol. 10, no. 1, p. 16, 2010. [48] K. H. Mikkelsen, F. K. Knop, M. Frost, J. Hallas, and A. Pottegård,
[22] Y. Wang, P. F. Li, Y. Tian, J. J. Ren, and J. S. Li, “A shared decision- “Use of antibiotics and risk of type 2 diabetes: a population-based case-
making system for diabetes medication choice utilizing electronic health control study,” The Journal of Clinical Endocrinology & Metabolism,
record data,” IEEE Journal of Biomedical and Health Informatics, vol. 100, no. 10, pp. 3633–3640, 2015.
vol. 21, no. 5, pp. 1280–1287, 2017. [49] B. Boursi, R. Mamtani, K. Haynes, and Y.-X. Yang, “The effect
[23] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, of past antibiotic exposure on diabetes risk,” European Journal of
“SMOTE: synthetic minority over-sampling technique,” Journal of Ar- Endocrinology, pp. EJE–14, 2015.
tificial Intelligence Research, vol. 16, pp. 321–357, 2002. [50] C. M. Bishop, Pattern Recognition and Machine Learning (Information
[24] K.-M. Jung, “Support vector machines for unbalanced multicategory Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006.
classification,” Mathematical Problems in Engineering, vol. 2015, 2015. [51] S. Tateishi, H. Matsui, and S. Konishi, “Nonlinear regression modeling
[25] J. Bi, Y. Chen, and J. Z. Wang, “A sparse support vector machine via the lasso-type regularization,” Journal of Statistical Planning and
approach to region-based image categorization,” in IEEE Computer Inference, vol. 140, no. 5, pp. 1125–1134, 2010.
Society Conference on Computer Vision and Pattern Recognition, vol. 1,
2005, pp. 1121–1128.
[26] J. Fan and R. Li, “Variable selection via nonconcave penalized like-
lihood and its oracle properties,” Journal of the American statistical
Association, vol. 96, no. 456, pp. 1348–1360, 2001. Michele Bernardini received the M.Sc. degree in
[27] H. Masnadi-Shirazi, N. Vasconcelos, and A. Iranmehr, “Cost-sensitive Electronic Engineering from Polytechnic University
support vector machines,” arXiv preprint arXiv:1212.0975, 2012. of Marche with a thesis entitled “Development of
[28] G. Lee and C. Scott, “Nested support vector machines,” IEEE Transac- an automatic procedure to mechanically characterize
tions on Signal Processing, vol. 58, no. 3, pp. 1648–1660, 2010. soft tissue materials using visual sensors” in col-
[29] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, laboration with Université Libre de Bruxelles. Since
vol. 20, no. 3, pp. 273–297, 1995. November 2017 he is in the PhD School at Poly-
[30] J. Zhu, S. Rosset, R. Tibshirani, and T. J. Hastie, “1-norm support vector technic University of Marche, under the supervision
machines,” in Advances in neural information processing systems, 2004, of Prof. Emanuele Frontoni.
pp. 49–56.
[31] R. Tibshirani, “Regression shrinkage and selection via the lasso,”
Journal of the Royal Statistical Society. Series B (Methodological), pp.
267–288, 1996.
[32] P. S. Bradley and O. L. Mangasarian, “Feature selection via concave
minimization and support vector machines.” in International Conference Luca Romeo received the Ph.D. degree in computer
on Machine Learning, vol. 98, 1998, pp. 82–90. science in the Department of Information Engineer-
[33] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical learning with ing (DII), Università Politecnica delle Marche, in
sparsity: the lasso and generalizations. CRC press, 2015. 2018. His Ph.D. thesis was on "applied machine
[34] S. J. Wright, R. D. Nowak, and M. A. Figueiredo, “Sparse reconstruction learning for human motion analysis and affective
by separable approximation,” IEEE Transactions on Signal Processing, computing". He is currently a PostDoc Researcher
vol. 57, no. 7, pp. 2479–2493, 2009. with DII and he is affiliated with the Unit of Cog-
[35] P. L. Combettes and V. R. Wajs, “Signal recovery by proximal forward- nition, Motion and Neuroscience and Computational
backward splitting,” Multiscale Modeling & Simulation, vol. 4, no. 4, Statistics and Machine Learning, Fondazione Istituto
pp. 1168–1200, 2005. Italiano di Tecnologia Genova. His research topic
[36] S. Sra, S. Nowozin, and S. J. Wright, Optimization for Machine includes Machine learning applied to biomedical
Learning. The MIT Press, 2011. applications, affective computing and motion analysis.
[37] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding
algorithm for linear inverse problems with a sparsity constraint,” Com-
munications on Pure and Applied Mathematics: A Journal Issued by the
Courant Institute of Mathematical Sciences, vol. 57, no. 11, pp. 1413–
1457, 2004.
Paolo Misericordia is a General Practitioner from
[38] M. A. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection
Sant’Elpidio a Mare, Fermo, Italy. He received a sec-
for sparse reconstruction: Application to compressed sensing and other
ond level Master’s degree in e-Health at University
inverse problems,” IEEE Journal of Selected Topics in Signal Processing,
of Camerino. He is the supervisor of the ICT area
vol. 1, no. 4, pp. 586–597, 2007.
and the National study center of Italian Federation
[39] J. Platt, “Probabilistic outputs for support vector machines and com-
of General Practitioners (FIMMG). He designed and
parison to regularized likelihood methods,” Advances in Large Margin
coordinated the development and implementation of
Classifiers, pp. 61–74, 2000.
Netmedica Italia (NMI).
[40] N. V. Chawla, N. Japkowicz, and A. Kotcz, “Special issue on learning
from imbalanced data sets,” ACM Sigkdd Explorations Newsletter, vol. 6,
no. 1, pp. 1–6, 2004.
[41] V. Ganganwar, “An overview of classification algorithms for imbalanced
datasets,” International Journal of Emerging Technology and Advanced
Engineering, vol. 2, pp. 42–47, 01 2012.
[42] T. Hastie and R. Tibshirani, “Classification by pairwise coupling,” in
Advances in neural information processing systems, 1998, pp. 507–513. Emanuele Frontoni is Professor of "Fondamenti di
[43] E. Choi, A. Schuetz, W. F. Stewart, and J. Sun, “Medical concept Informatica" and "Computer Vision” at the Univer-
representation learning from electronic health records and its application sità Politecnica delle Marche, Engineering Faculty.
on heart failure prediction,” arXiv preprint arXiv:1602.03686, 2016. He received the doctoral degree in electronic en-
[44] H. Li, X. Li, M. Ramanathan, and A. Zhang, “Identifying informative gineering from the University of Ancona, Italy, in
risk factors and predicting bone disease progression via deep belief 2003. In the same year he joined the Dept. of Ingeg-
networks,” Methods, vol. 69, no. 3, pp. 257–265, 2014. neria Informatica, Gestionale e dell’Automazione
[45] N. Hurley and S. Rickard, “Comparing measures of sparsity,” IEEE (DIIGA) as a Ph.D. student in "Intelligent Artificial
Transactions on Information Theory, vol. 55, no. 10, pp. 4723–4741, Systems". He obtained his PhD in 2006 discussing
2009. a thesis on Vision Based Robotics. His research
[46] G. C. Cawley and N. L. Talbot, “On over-fitting in model selection focuses on applying computer science, artificial in-
and subsequent selection bias in performance evaluation,” Journal of telligence and computer vision techniques to mobile robots and innovative IT
Machine Learning Research, vol. 11, no. Jul, pp. 2079–2107, 2010. applications.
2168-2194 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.