Sie sind auf Seite 1von 7



Vijay Srinivas Agneeswaran, Joydeb Mukherjee, Ashutosh Gupta, Pranay Tonpay, Jayati Tiwari, and Nitin Agarwal
Impetus Infotech Private Limited, Bangalore, Karnataka, India

Abstract It is time for the healthcare industry to move from the era of analyzing our health history to the age of managing the future of our health. In this article, we illustrate the importance of real-time analytics across the healthcare industry by providing a generic mechanism to reengineer traditional analytics expressed in the R programming language into Storm-based real-time analytics code. This is a powerful abstraction, since most data scientists use R to write the analytics and are not clear on how to make the data work in real-time and on highvelocity data. Our paper focuses on the applications necessary to a healthcare analytics scenario, specically focusing on the importance of electrocardiogram (ECG) monitoring. A physician can use our framework to compare ECG reports by categorization and consequently detect Arrhythmia. The framework can read the ECG signals and uses a machine learning-based categorizer that runs within a Storm environment to compare different ECG signals. The paper also presents some performance studies of the framework to illustrate the throughput and accuracy trade-off in real-time analytics. Introduction
The healthcare industry is undergoing a major transformation. The old days of using paper records of patients data are gone with the digitization of healthcare information, starting with the use of electronic health records (EHRs). The use of EHRs is becoming widespread, partly dictated by nancial stimulus and partly by governmental regulations. The healthcare industry is now turning to the use of data analytics. The pace is likely to pick up with the advent of the Affordable Care Act (ACA), or Obamacare, which promises to transform the healthcare industry from fee-for-service to fee-for-value. Moreover, due to the widening of the eligibility requirements and affordability, more people will come into the system for healthcare. This implies the need for big-data analytics, especially for the mandated health exchanges. The Affordable Care Act has also spurred many innovations in healthcarethis is evident in the number of healthcare startups funded recently, such as the following (this list is only indicative, not intended to be complete, and is biased toward health analytics): 1. Health catalyst, which provides analytics suite to analyze EHRs. 2. xG health solutions, which provides analytics of population health as well as reporting and interpretation.

Editors Note: Impetus supports multiple venues for dialogue in big data, providing thought leadership and services to create new ways to analyze data to gain key opportunities in business and industry across enterprises. The following is a description of one potential application of their expertise in machine learning within the healthcare space.



SEPTEMBER 2013  DOI: 10.1089/big.2013.0018

Agneeswaran et al.

3. Lumeris, which uses real-time analytics of healthcare data to improve patient care, essentially focused on making ACA work for all players including health systems, payers, and providers. 4. Eviti, which provides physicians with actionable information using analytics for cancer related decision making. 5. Humedica, which uses data from multiple sources including EHRs, claims data, etc. to help healthcare providers analyze patient data as well as population data. 6. HealthTap, which provides a social platform for physicians and patients to share information as well as build a peer reputation.

comparison between the two common devices, the loop event monitoring and the mobile cardiac outpatient telemetry system, and their effectiveness in detecting arrhythmias.

Machine LearningBased Classication of ECG Data

The classier we have developed works in two modes: the training mode (or learning mode) and the operational mode (or advisory mode). In the training mode, we extract features (i.e., variables or transformed variables) in terms of which A number of startup accelerators include Nanthealth, Rockarrhythmia types, including its absence, can be represented health, Healthbox, and Blueprint Health Services, among others. and we learn the parameters of the inference mechanism about the occurrence or nonoccurThis article presents a different scerence of a type of Arrhythmia. In nario requiring real-time analytics of this mode, the results cannot advise THE OLD DAYS OF USING big data, and as an example, applies the doctor, but rather, the input cutting edge big data technologies to about the label (i.e., type of arPAPER RECORDS OF historical data. The electrocardiorhythmia or absence of it) correPATIENTS DATA ARE GONE gram (ECG) signal provides critical sponding to each record provided is WITH THE DIGITIZATION OF information about the heart activity used for training (see Fig. 1). HEALTHCARE INFORMATION, of a patient. Continuous monitoring of ECG is important when a patient Once the training is complete, the STARTING WITH THE USE is ambulatory or at the bedside. It is classier goes into operational OF ELECTRONIC HEALTH very important to treat arrhythmic mode, meaning it begins advising RECORDS patients on time, as delays can lead the doctor on new, unseen, but to potentially fatal complications.1 similar cases to those seen during training. The doctor arrives at an Arrhythmia detection from ECG inference about the presence or absence of arrhythmia taking signals is a well-studied problem. For instance, Gao et al.1 the output of the classier into consideration. Also, if arsolve it by using an articial neural network approach based rhythmia is present, which type it is can be suggested by the on a Bayesian framework. Rothman and colleagues2 make a

FIG. 1.

ML Based Classication of ECG Data: Training Mode.




Agneeswaran et al.

classier. The various types of arrhythmia classes (labels) will be listed in a subsequent section. This mode of operation is depicted in Fig. 2. The input to machine learning algorithm is a set of historic patient records. Clinical measurements recorded in the past from ECG signals, namely, QRS duration, RR, P-R, Q-T intervals constitute such records, along with information such as gender, age, and weight. This data is padded with the categorical label a cardiologist had assigned to each record, such as normal or one of the 15 types of pathology categories. These make up a total of 279 features as enumerated by Guvenir et al.3

Description of dataset
We analyzed a dataset containing 452 records belonging to patients coming from different age groups, weights, heights and gender (see Arrhythmia for more information). There are in all 280 variables, including various arrhythmia class types as the 280th column, in the database downloaded from the source.1 Values for this column can be 1 to 16, representing one of the codes as enumerated above. There are 5 categorical variables and 274 numeric variables. Five variables had missing values in their records as enumerated below. These variables occurred in columns 11 to 15 in the original dataset.
Vector angles in degrees on front plane of: 11 T 8 values missing 12 P 22 values missing 13 QRST 1 value missing 14 J 376 values missing Number of heart beats per minute 15 Heart rate 1 value missing

Class names and description

Class distribution: Database: Arrhythmia Class code: Class: 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 Normal Ischemic changes (Coronary Artery Disease) Old Anterior Myocardial Infarction Old Inferior Myocardial Infarction Sinus tachycardy Sinus bradycardy Ventricular Premature Contraction (PVC) Supraventricular Premature Contraction Left bundle branch block Right bundle branch block 1. degree AtrioVentricular block 2. degree AV block 3. degree AV block Left ventricule hypertrophy Atrial Fibrillation or Flutter Others Number of instances: 245 44 15 15 13 25 3 2 9 50

Some of the variables had 0 throughout the column (i.e., across all records). Those variables are enumerated below with their column number followed by the variable name
20 70 132 DI S-prime Wave; 68 AVL S-prime Wave AVL Existence of ragged R wave; 84 AVF Existence of ragged P wave V4 Existence of ragged P wave; 133 V4 Existence of diphasic derivation of P wave V5 S-prime Wave, 142 V5 Existence of ragged R wave V5 Existence of ragged P wave; 146 V5 Existence of ragged T wave

0 0 4 5 22

140 144

FIG. 2.

ML Based Classication of ECG Data: Operational Mode.




Agneeswaran et al.

152 158 205 275

V6 S-prime Wave ; 157 V6 Existence of diphasic derivation of P wave V6 Existence of ragged T wave; 165 DI Amplitude S-prime Wave AVL Amplitude S-prime Wave; 265 V5 Amplitude S-prime Wave V6 Amplitude R-prime Wave

The rst four columns in the original datale had non-ECG variables as follows:
1 2 3 4 Age: Age in years , linear Sex: Sex (0 = male; 1 = female) , nominal Height: Height in centimeters , linear Weight: Weight in kilograms , linear

number of heart beats per minute, had some missing values. A couple of imputation algorithms were tried out,5,6 and nally rfImpute from the randomForest package was chosen to impute those missing values. Amelia7 was not considered because it could produce imputed results only with a high value of prior information [with empri parameter value as high 0.9*nrow(data), when usually 0.01* nrow(data) is used]. The latter amounts to adding lot of articial observations with the same mean and variance of existing observations but with 0 covariance.

Imbalance of data with respect to classes

The gross imbalance in the dataset (Table 1) poses problems for selecting a subset of data to be used for training and testing. If the training and testing sets are typically partitioned (70%80% for training and 30%20% for testing), classication performance will be misleading. There are several ways to partially address this problem. Generating articial data for the minor classes (via SMOTE algorithms and associated packages)8 is one method. Another means is to down-sample data from the major class. We have chosen the latter path [i.e., subsampling the major class (Normal class) in proportion to the minor class (those classes that had at least 10% data)]. While subsampling the major class, we made sure that its maximum number did not exceed 100% that of the minor class. Furthermore, weights were used for the training examples supplied to the RF classier. Classes that had single-digit representation namely, Left ventricule hypertrophy (0.9%), Atrial Fibrillation or Flutter (1.1%), Ventricular Premature Contraction (PVC) (0.7%), Supraventricular Premature Contraction (0.4%), and Left bundle branch block (1.9%) were not addressed.

Classication algorithm

We chose the random forest (RF) classier2 for several reasons: it is fast (training time); its OOB-error (out-of-bag errors) is a good estimate for generalization error; it can handle noisy data; it can suggest important variables, using which, a parsimonious predictive model can be built; and it has an imputation method associated with it which at times is better choice than using any other external methods for imputation. Additionally, two or more separately trained RFs can be combined without incurring much computational expenditure, and it is an ensemble classier (i.e., a collection of classiers), which predicts by counting votes cast by each classier for a class on a query record. Predictive performance of an ensemble classier is better than any of its constituents. The constituent classiers for RF are classication trees. The advantage of using such classiers is that individual classiers may be barely accurate (slightly better than random guessing) but combining trees may produce classiers with much higher accuracy. Also, a great deal of variance may be present as we move from one tree to another, but the overall classiers variance is reduced because of averaging that takes place in the course of ensembling. RF is trained by bagging (bootstrapped aggregation) of training data. Random samples with replacement are drawn from the training data and classication trees are built using them. If large numbers of trees are constructed (11/e) & 63% of the original data are used therein, the remaining 36% are used for testing the trees to calculate OOB-errors. It can be shown that this error is a fair indication of generalization error for the RF classier. Generalization error measures predictive performance of classiers when tested with unseen data outside of the training set but supposedly generated from the same distribution as that of the training data. These will be the kind of data encountered by the classier in the operational mode. The keys to the predictive performance of RF classier are the strength of individual classiers and the diversity (degree of uncorrelatedness) of constituent classication trees in the forest in terms of raw margin functions.4

Variable selection for model building

Variable Selection plays a major role in the development of predictive models. In this study, one of the reasons for selecting RF classier over other alternatives was that it has a means of assessing the effectiveness of each variable occurring in the model, using which we can build a parsimonious model for the deployment. The criteria based on which RF ranks its Important Variables are Mean Decrease Accuracy and Mean Decrease Gini. We prefer the latter for selecting the important variables, because in some instances in the literature, it has been reported that the other measure is not stable.9 All variables with a Mean Decrease Gini value greater than its mean value will be retained in the model, in our case by setting the criterion threshold to its mean value (see Fig. 2). The complete variable list with descriptions is provided in the online reference (

Experimental Results and Discussion

We performed experiments on the classier we developed to assess its predictive performance. We enumerate the steps of the algorithm for classication using RF below:

Imputation of missing values

In the exploratory data analysis (EDA) phase, it was found that important variables such as heart rate as measured in



Agneeswaran et al.

1. Read comma-separated values of Arrhythmia data from text le as table. 2. Identify and create a response variable showing which class datapoints belong to (280th column of original data read as table). 3. Make sure data is complete:  Identify the columns with missing values.  Replace the missing values (occurring as ?) with NA (required for imputation). 4. Assign names of the Variables (for ease of identication). 5. Get rid of variables with zero entries, age, sex, height, and weight and the one specifying Arrhythmia Type (i.e., non-ECG values). (For imputation, we cannot afford to retain so many variables with so few records. One of the imputation methods used, Amelia, does not permit it.) 6. Perform Imputation with rfImpute/Amelia. 7. Sample imputed data judiciously (as described previously) from respective classes up to the maximum number of records it contains except for the Normal Class (code 01). For this class try out number of records 100, 90, 80, and 70.  Toss a biased coin to generate indices between 1 and number of records (rows) in the ratio 70:30.  Generate training and test set using above indices. 8. Call Random Forest with imputed data and number of tree = 500 and other parameters. 9. Call Predict function on the test set of data. 10. Identify the important variables according to the specied criterion (MeanDecreaseGini) at specied threshold value (Set equal to the Mean of MeanDecreaseGini). 11. Call Random Forest with important variables and training set of data and number of tree = 500 and other parameters. 12. Call Predict function on the test set of data. 13. Go back to step 7 until the list (100, 90, 80, 70) is exhausted. Table 1 shows the computation of precision and recall, which can be dened below as follows: Precision: the number of correctly classied examples of a particular class divided by the number of examples labeled by the system as belonging to that particular class.10 jfcorrect labels g \ fpredicted labels gj jpredicted labels j


jfcorrect labels g \ fpredicted labels gj jcorrect labels j

F-score: a combination of the above two measures in the form of harmonic mean. F-Score 2 precision recall precision recall


Recall (sensitivity): the number of correctly classied examples of a particular class divided by the number of examples of that particular class in the data.

As the system keeps operating in the eld, more records for the various cases will be collected, together with the cardiologists decisions for the respective records. A new RF classier may be trained with these data and nally it can be

Table 1: Precision/Recall Computation Number of records Class 1 (precision, recall, Class major class f-score as dened below). 2 96.43 58.69 72.97 78.26 52.94 63.15 89.29 78.12 83.33 89.47 65.38 75.55 Class 3 Class 4 Class 5 33.33 100.0 49.96 33.33 100.0 49.96 50.0 100.0 66.67 33.33 100.0 49.96 Class 6 85.71 75.00 79.99 87.5 87.5 87.5 75.00 90.00 81.82 80.00 88.80 84.17 Class 9 100.0 100.0 100.0 66.67 100.0 79.99 100.0 100.0 100.0 100 100 100 Class With all variables With important variables 10 100-OOB-error 100-OOB-error 62.50 66.66 64.53 80.00 54.54 64.86 55.56 83.33 66.71 86.67 68.42 76.47 67.29 70.10




66.67 100.0 50.0 71.40 75.0 100.0 68.97 85.71 66.7 72.73 75.0 33.33 53.33 50.0 50.00 61.54 60.0 39.98 71.43 66.67 50.0 55.55 100 50.0 62.53 80.0 50.0 75.00 0.00 83.33 70.59 0.00 83.33 72.73 0.00 83.33










Agneeswaran et al.

combined with the one currently operating incrementally using the combine() function of randomForest.

Table 2. ECG Classication Performance Analysis Time taken (in seconds) Number of predictions (ECG categorizations) 20K 40K 0.1 million Sequential processing (no-Storm used) 3,600 7,200 18,300 Storm cluster with 2 nodes (1 spout, 8 bolts) 900 1,710 4,440 Storm cluster with 3 nodes (1 spout, 16 bolts) 450 900 2,400

Implementation of R-based Classier for Real-Time Analysis

R code can be executed from within a bash script, which allows us to invoke it from within a Java program (or any programming language or script for that matter). Storm is an open-source real time computation framework, which allows us to process streams of data in a parallel fashion making it a very good choice for classication of data on a cluster of nodes. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation. The model le created in the previous step is referenced in another R script, which is used for real-time classication. Data to run classication on enters the storm framework via a Spout which then emits it to the bolts. Each bolt runs R script in parallel and emits results of the classication (which can get captured and used as needed) as shown in Fig. 3. Note that for each result in Table 2, one node is a Nimbus node and the remaining are supervisors. Each node is an 8has 8 quad-core CPUs, 32 GB of RAM, and 32 GB of swap space.

Note: We made use of only one Spout for this POC. Depending on the mechanisms of data entry into Storm framework, it is possible to use multiple spouts, which would enhance performance further.

Concluding Remarks
This article has presented a real-time machine-learning platform for the healthcare domain that allows ECG signals to be classied. It is an additional input for the physician, but a crucial one that facilitates care-for-value. The implication is that this work provides the basis for building a powerful analytical framework that can work in real-timethis study could prove extremely useful, not only for ECG classication, but also for enabling physicians to get incremental analytics on various kinds of patient data increasingly available in the EHRs. Our study also enables incremental healthcare, where the focus can shift to analytics, and consequently, to customized real-time healthcare. The upcoming health exchanges may also benet, as on-the-y analytics on highvelocity data becomes essential for providers, physicians, and patients equally.

Author Disclosure Statement

All authors are employed by Impetus.

1. Dayong Gao, Madden M, Chambers D, Lyons G. Bayesian ANN classier for ECG arrhythmia diagnostic system: A comparison study. Proceedings of 2005 IEEE International Joint Conference on Neural Networks (IJCNN 05) 2005; 4:23832388. 2. Rothman SA, et al. The diagnosis of cardiac arrhythmias: A Prospective multi-center randomized study comparing mobile cardiac outpatient telemetry versus standard loop event monitoring. J Cardiovasc Electrophysiol 2007; 8:17. 3. Guvenir HA, Acar S, Demiroz, G, Cekin A. A supervised machine learning algorithm for arrhythmia analysis. Comput Cardiol 1997;7:433436.

FIG. 3.

Running R over Storm.




Agneeswaran et al.

4. Breiman L. Random Forests. Mach Learn 2001; 45:532. 5. Liaw A. Missing value imputations by randomForest. R documentation. Available online at http://rss.acs.unt .edu/Rdoc/library/randomForest/html/rfImpute.html. (Last accessed on September 6, 2013). 6. Ishioka T. Imputation of missing values for unsupervised data using the proximity in random forests. In: Proceedings of The Fifth International Conference on Mobile, Hybrid, and On-line Learning. Nice, France, February 24March 1, 2013. 7. Honaker J, King G, Blackwell M. AMELIA II: A program for missing data. J Stat Softw 2011; 45:147. 8. Blagus R, Lusa L.SMOTE for high-dimensional classimbalanced data. BMC Bioinformatics 2013; 14:106. Available online at 14/106. (Last accessed on September 6, 2013).

9. Calle ML, Urrea V. Letter to the editor: Stability of random forest importance measures. Briengs Bioinf 2011; 128689. 10. Solokova M, Guy L. A systematic analysis of performance measures for classication tasks. Inf Process Manag 2009; 45:427437. Address correspondence to: Vijay Srinivas Agneeswaran, PhD Innovation Labs Impetus Infotech India Private Limited Pritech Park SEZ, Bellandur Outer Ring Road Bangalore, Karnataka 560103 India E-mail: