Jonnalagadda2017 Article TextMiningOfTheElectronicHealt

J. of Cardiovasc. Trans. Res.
(2017) 10:313–321
DOI 10.1007/s12265-017-9752-2
ORIGINAL ARTICLE
Text Mining of the Electronic Health Record: An Information

Extraction Approach for Automated Identification
and Subphenotyping of HFpEF Patients for Clinical Trials
Siddhartha R. Jonnalagadda 1,2 & Abhishek K. Adupa 1 & Ravi P. Garg 1 &
Jessica Corona-Cox 3 & Sanjiv J. Shah 3
Received: 18 January 2017 / Accepted: 16 May 2017 / Published online: 5 June 2017
# Springer Science+Business Media New York 2017
Abstract Precision medicine requires clinical trials that Introduction

are able to efficiently enroll subtypes of patients in whom
targeted therapies can be tested. To reduce the large The creation and acceptance of electronic health records
amount of time spent screening, identifying, and (EHRs) has ignited widespread interest in the use of clinical
recruiting patients with specific subtypes of heteroge- data for secondary purposes and research [1]. One such appli-
neous clinical syndromes (such as heart failure with pre- cation that can greatly benefit from an EHR-based approach is
served ejection fraction [HFpEF]), we need prescreening clinical trial screening and recruitment. In general, screening
systems that are able to automate data extraction and for clinical trial recruitment is currently done manually.
decision-making tasks. However, a major obstacle is the Clinicians and study coordinators go through each of the eli-
vast amount of unstructured free-form text in medical re- gibility criteria, determine data elements relevant to the clini-
cords. Here we describe an information extraction-based cal trial, extract the data elements from structured and unstruc-
approach that automatically converts unstructured text in- tured EHR of each patient, and match the data elements with
to structured data, which is cross-referenced against eligi- the eligibility criteria to decide whether the patient qualifies
bility criteria using a rule-based system to determine for the trial. Not only is this process slow, it is also prone to
which patients qualify for a major HFpEF clinical trial errors. It typically takes approximately 15–20 min for a study
(PARAGON). We show that we can achieve a sensitivity coordinator to examine each patient’s data, and this process
and positive predictive value of 0.95 and 0.86, respective- can take even longer in the context of complex clinical syn-
ly. Our open-source algorithm could be used to efficiently dromes such as heart failure with preserved ejection fraction
identify and subphenotype patients with HFpEF and other (HFpEF). In addition, as we advance towards precision med-
disorders. icine, finding specific subtypes of patients will be increasingly
important, making manual screening of patient records even
more time-consuming and arduous.
Keywords Information extraction . Natural language Because of the subjectivity involved in human decision-
processing . Clinical trials . Precision medicine making, domain knowledge, which patients are considered
for initial search, and other factors [2], there is always a pos-
sibility of type 1 and type 2 errors in the prescreening process
and biases in the overall recruitment. Furthermore, clinicians
* Siddhartha R. Jonnalagadda and study coordinators typically rely on patients identified in
sidjreddy@gmail.com their own specialty clinics or in certain defined patient care
settings, thereby missing out on the advantage of screening an
1 entire healthcare system. We hypothesized that it would be
Division of Health and Biomedical Informatics, Department of
Preventive Medicine, Northwestern University Feinberg School of possible to create a highly sensitive automated process
Medicine, Chicago, IL 60611, USA for prescreening with sufficient specificity to save study
2
Present address: Microsoft Corporation, 555 110th Ave NE, coordinator time overall, and potentially reduce recruit-
Bellevue, WA 98004, USA ment bias because it would be possible to consider pa-
3
Division of Cardiology, Department of Medicine, Northwestern tients from a larger pool (i.e., the entire EHR). Thus, an
University Feinberg School of Medicine, Chicago, IL 60611, USA algorithm that can prescreen eligible patients efficiently
314 J. of Cardiovasc. Trans. Res. (2017) 10:313–321
could provide a proficient and robust approach to clin- Algorithm

ical trial recruitment.
Therefore, we sought to develop a high recall (sensitivity) The information from the patient data is extracted as part of
prescreening algorithm for recruiting patients into a multicen- separate modules (Fig. 1). These modules are designed to
ter, randomized, double-blind, parallel-group, active- extract the data elements relevant to PARAGON but are reus-
controlled study to evaluate the efficacy and safety of able individually for other clinical trials. After extraction, a
LCZ696 (sacubitril/valsartan) compared to valsartan, on mor- rule-based system matches the eligibility criteria and excludes
bidity and mortality in HFpEF (Prospective comparison of patients who (1) do not satisfy all of the inclusion criteria, or
ARni with Angiotensin receptor blocker Global Outcomes (2) meet one or more of the exclusion criteria. Figure 2 de-
in heart failure with preserved ejectioN fraction, scribes the system’s architecture in finer detail. We broadly
PARAGON). Our approach involves the development of in- categorize the modules as (1) structured data normalizer, (2)
formation extraction modules that use a combination of tech- unstructured data extractor, and (3) unstructured data
niques, including natural language processing (NLP), which classifier.
can be reused not only for other EHRs but also for other trials A structured data normalizer is used for the extraction of
using similar data elements. data elements whose values are already present in the struc-
tured form. This module is further divided into two
submodules. In submodule 1, we extract the values for age,
body mass index, hemoglobin, glomerular filtration rate, and
Methods blood pressure from structured fields. In submodule 2, we
extract medication data, including types of medications (e.g.,
Patient Records and Eligibility Criteria—Data angiotensin-converting enzyme inhibitors, beta-blockers,
Description dihydropyridine, and non-dihydropyridine calcium channel
blockers). The reports are in structured form with mapping
The patient records used in this study came from the Epic of the medication to the patient. This submodule requires ex-
EHR used by Northwestern Medicine (Northwestern ternal information resources, which we provide as databases
Memorial Hospital and Northwestern University, Chicago, to our system.
IL). The initial cohort of patients we considered for our exper- An unstructured data extractor is used for the extraction of
iments was very broad to ensure we were not missing any values of data elements present in unstructured text. This mod-
patients that could be included—patients that were document- ule also accepts input and provides output just as the previous
ed to have HF with the ICD-9-CM Diagnosis Code 428.0 and module but uses a complex set of regular expressions to ex-
had an echocardiogram within the past year. We randomly tract the exact value. For example, we needed to extract the
selected n = 198 of these patients for the development dataset left ventricular ejection fraction (LVEF) value as part of the
and n = 3002 patients for the validation dataset. prescreening process for identifying HFpEF patients as part of
Each patient’s data consists of five types of reports: en- the PARAGON clinical trial. Here we used a set of regular
counters, problem list, echocardiography reports, lab reports, expressions to extract sentences where the LVEF value may
and current medication list. Encounters contain two types of be present and then another set of regular expressions to ex-
files: encounter diagnosis name and encounter progress notes. tract the definite values as shown in Table 2. Regular expres-
The characteristics of the patient records for both datasets are sions 1 through 4 extract the sentences that can contain LVEF
summarized in Table 1. There are a total of 40 eligibility values. Then, the sentences are parsed through regular expres-
criteria—7 for inclusion and 33 for exclusion—for the sions 5 and 6. Regular expression 5 extracts the values present
PARAGON clinical trial [3]. in range format (e.g., B40–45%^ or B40% to 45%^). Regular
expression 6 extracts the freely available values: for example,
Table 1 Characteristics of the development and validation patient B40%.^
datasets Next, we used an unstructured data classifier to classify
whether certain data elements are present or absent in relation
Development set Validation set
to the context of the patient. In this module, we extract all of
Total number of patients 198 3002 the instances of a given data element (diagnosis, medication,
Encounters 54,173 393,482 treatment, or tests) and its synonyms in the input report(s). For
Echocardiography reports 96,281 883,385 this, the module first checks for synonyms of the input term
Lab reports 52,393 371,879 using UMLS Metathesaurus [4], builds automatically a set of
Current medication entries 4490 41,947 regular expressions, and then applies them to the input report
Problem lists 3521 33,089 text to extract all the instances. For example, to extract HF-
related terms the module compiles a list of synonyms: Bheart
J. of Cardiovasc. Trans. Res. (2017) 10:313–321 315
Fig. 1 Overview of the clinical Paent Before Paent Aer

trial recruitment system
architecture. We analyzed Demographics
Demographics, Lab
different heart failure-related Lab Values
Values, etc.
patient medical reports and
derived pattern-based information (Structured Data) Informaon Pathology Report
Informaon Eligibility
extraction modules that provide Extracon OUTPUT
Pathology Reports, Criteria
output of structured data to Modules Echo Report
compare against eligibility criteria Echo Report,
Informaon
for clinical trial recruitment Encounter Notes,
Encounter Notes
Discharge
Informaon
Summaries, etc.
(Unstructured Data) (All Structured
Form)
f a i l u r e , ^ BH F, ^ Bd i a s t o l i c d y s f u n c t i o n , ^ a n d Testing and Validation of the Algorithm

Bcardiomyopathy.^ Next, a set of regular expressions are au-
tomatically generated (Table 3) and used to extract all the We first evaluated our methods iteratively in patients previ-
instances of HF-related terms. For PARAGON, the other data ously manually screened for the PARAGON trial at
elements processed in this category are Bangioedema^, Northwestern University. The initial iterative algorithm devel-
Bpancreatitis^, Bvalvular heart disease,^ etc. We adapt existing opment used these cases of patients found to be eligible for
rule-based systems to make sure the data elements are not in PARAGON (and enrolled in the study) and those who were
their negated form using a rule-based negation detection algo- found to be ineligible.
rithm, the data elements refer to the current status (as opposed Next, we performed formal testing using the development
to historical condition or a hypothesis for conducting a test), set of 198 patients identified from the Northwestern Medicine
and the data elements correspond to the patient (as opposed to Epic EHR. An experienced clinical research study coordinator
a family member [i.e., as listed in the Bfamily history^ section (J.C.-C.) read each patient record, extracted data elements of
of clinical notes]) [5, 6]. Our open-source algorithm is avail- relevance to PARAGON, and matched against the
able at https://github.com/sidkgp/IECTM. PARAGON eligibility criteria. Patients deemed potentially
Fig. 2 Patient identification Paent Records IC: Age ≥ 55 y, EC: BMI > 40 kg/m2
algorithm for the PARAGON
Data
trial. Each patient’s data is parsed
IC: Number of HF terms ≥1
through three types of extraction
modules. The modules extract the Encounter Age IC: LVEF value ≥45%
appropriate information and Progress Notes BMI
create a patient profile. This EC: Number of angioedema or
Hb
profile is then checked against the Encounter pancreas or bilateral renal artery
Structured data GFR
clinical trial eligibility criteria to Diagnosis Names stenosis terms ≥1
normalizer BP
check whether or not the patient
qualifies for the study. PARAGON Problem List EC: Number of “transplant” or “ICD”
Structured data
Prospective comparison of ARni extractor LVEF
terms ≥1
Lab Report
with Angiotensin receptor blocker
Unstructured EC: Sentences with “malignant” term
Global Outcomes in heart failure Echo Report
and absence of “basal” and
with preserved ejectioN fraction; data extractor a. Heart failure
BMI body mass index; Hb Medicaon List “prostrate” terms ≥1
b. Angioedema
hemoglobin; GFR glomerular c. Pancreas EC: Hb < 10 g/dl, GFR < 30 ml/min
filtration rate; BP blood pressure; d. Organ transplant
LVEF left ventricular ejection EC: SBP value > 150 mm Hg and
e. ICD
fraction; ICD implantable number of anhypertensive drugs > 3
f. Malignant cancer
cardioverter-defibrillator; IC
inclusion criteria; EC exclusion g. Valvular heart
criteria disease EC: Number of “constricve
h. Constricve pericardis” or “genec hypertrophic
pericardis, cardiomyopathy” or “inﬁltrave
hypertrophic cardiomyopathy” terms ≥ 1
cardiomyopathy,
inﬁltrave
cardiomyopathy
Table 2 Regular expressions for

extracting left ventricular ejection Step Regular expression
fraction values from clinical notes
and reports 1 (left ventricular ejection fraction|lvef|lv ejection fraction|left ventricle ejection fraction|ejection fraction|
ef |ejection fraction)[^_%\\.]*?([\\d-\\.]+)\\s*'?%
2 (left ventricular systolic function|left ventricular function|systolic function of the left ventricle|lv systolic
function|left ventricular ejection fraction|ejection fraction|left ventricle)(normal|normal global|low
normal|well preserved|severely reduced|moderately decreased|moderately depressed|severely
decreased|severe|markedly decreased|markedly reduced|severely globally reduced|mildly
decreased|mildly depressed|severely depressed)
3 (normal|normal global|low normal|well preserved|severely reduced|moderately decreased|moderately
depressed|severely decreased|severe|markedly decreased|markedly reduced|severely globally
reduced|mildly decreased|mildly depressed|severely depressed)
4 .*(moderate|marked|severe) (lv systolic dysfunction|left ventricular dysfunction|left ventricular systolic
dysfunction).*
5 ((\\d+\\s*(\\-|to)\\s*\\d+)|(\\d*\\.\\d*\\s*(\\-|to)\\s*\\d*\\.\\d*)|(\\d*\\.\\d+)|(\\d+))(?=(\\s*(\\%)))
6 \\d+(\\.\\d+)?
eligible for the PARAGON trial by the study coordinator were had to be verified by the clinical cardiologist. Our automated
then reviewed in further detail by a cardiologist with expertise information extraction (prescreening) system to successfully
in HFpEF (S.J.S.). The automated prescreening algorithm was parse and extract the required information from different data
tested on the same 198 patients (blinded to the manual screen- reports took less than 2 min for the entire 198-patient dataset.
ing results). The manual screening was then compared to the Table 4 shows the large number of exclusions based on
automated algorithm. The time needed to complete the exclusion criteria for the PARAGON trial. Due to the fact that
prescreening task was documented for both methods; in addi- PARAGON has a large number of stringent eligibility criteria,
tion, a 2 × 2 contingency was created comparing both manual the typical number of patients that qualify for the trial is small.
and automated prescreening, and sensitivity and specificity This is a challenge faced by many clinical trials for HFpEF
were calculated. and other heterogeneous clinical syndromes, and a problem
Once the development of our automated information ex- that will only become increasingly prominent as we seek to
traction system was complete, we focused our attention on its study narrower subtypes of patients in the era of precision
validation. We ran our information extraction approach on medicine. For these reasons, it is important for an automated
EHR-derived reports from the validation dataset (n = 3002). prescreening system to give more importance to retrieving
The study coordinator then performed manual review of the nearly all the qualifying patients; in other words, the recall
medical records for all patients deemed eligible for the of the system should be close to 100%. We therefore tuned
PARAGON trial by the automated information extraction sys- our system in order to achieve a high recall (i.e., high sensi-
tem. Results were again compared between the two methods. tivity) so as not to have too many false negatives (which
would result in missing potentially eligible patients). Table 5
shows the 2 × 2 contingency table for the automated
Results prescreening algorithm compared to the gold standard (man-
ual prescreening). Sensitivity (recall) of the algorithm was
Development Dataset 95% and specificity was 96%. The precision (positive predic-
tive value) was 86% (F score of 90%).
For the 198 patient records included in the development
dataset, our experienced research coordinator took 2 weeks
(∼80 h) to generate the gold standard data, which then still Validation Dataset
Table 3 Regular expressions to extract heart failure-related terms After the development of our PARAGON HFpEF clinical trial
prescreening algorithm was complete, we tested it on the val-
Regular expression
idation dataset (n = 3002). The algorithm took ∼20 min to go
[^\w]+(h|H)eart\s+(f|F)ailure[^\w]+ through all 3002 medical records, and it identified n = 113
[^\w]+(d|D)iastolic\s+(d|D)ysfunction[^\w]+ (3.7%) patients who were potentially eligible for the
[^\w]+(c|C)ardiomyopathy[^\w]+ PARAGON trial (see Table 4 for exclusions). Our clinical trial
[^\w]+HF[^\w]+ study coordinator went through these records (and verified
results with the clinical cardiologist) and found that 46 of
Table 4 Numbers of patients excluded at each step for the development or are beyond the scope of any system to check due to lack of
and validation patient datasets
data (e.g., type of cancer or cancer is malignant or benign
Exclusion criteria Report type Number of patients when the details are not present) or other extraneous factors
excluded, n (%)a that are difficult to conceptualize by text mining of clinic notes
(e.g., patient non-compliance with medications).
Development Validation
dataset dataset
(n = 198) (n = 3002) Enhancing Efficiency by Rapidly Excluding Large
Numbers of Patients
Age + BMI Encounter 22 (11) 1071 (36)
report
Heart failure-related term Encounter 3 (2) 1597 (53)
Table 4 lists the number of patients in both the development
report/- and validation datasets based on the PARAGON trial eligibil-
problem list ity criteria. The information extraction module played a major
LVEF Echo report 90 (45) 672 (22) role in screening out large proportions of patients without
Angioedema, Encounter 44 (22) 218 (7) human involvement. For example, module 2, which extracts
pancreatitis, or bilateral report
LVEF values, excluded 90 patients from the 198-patient de-
renal artery stenosis
Organ transplant or ICD Encounter 50 (25) 806 (27) velopment dataset and 672 patients from the 3002-patient val-
report/- idation dataset. This would not have been captured by any
problem list methods that aim to prioritize patients using information re-
Malignancy Encounter 49 (25) 600 (20)
report/-
trieval approaches without first extracting the values of the
problem list relevant data elements from unstructured reports.
Hemoglobin + GFR Lab value 42 (21) 507 (17)
report
Blood pressure and Encounter 6 (3) 522 (17)
anti-hypertensive report
Discussion
medications
Constrictive pericarditis, Echo report 23 (12) 314 (10) Here we have presented the development and validation of an
genetic hypertrophic information extraction system based on text mining tech-
cardiomyopathy, or
infiltrative
niques, including NLP, that efficiently identified patients
cardiomyopathy who were eligible for PARAGON, a major HFpEF clinical
trial. Given the stringent inclusion/exclusion criteria for
BMI body mass index, LVEF left ventricular ejection fraction, ICD im- PARAGON, the information extraction approach we have cre-
plantable cardioverter-defibrillator, GFR glomerular filtration rate, Echo
ated has applications to precision medicine initiatives given
echocardiography
a the need to identify specific subtypes of patients with hetero-
Patients could be excluded due to multiple criteria (thus, the percentages
listed in each column do not add up to 100%) geneous medical disorders. Importantly, we show the clear
increase in efficiency with automated screening of the EHR
for complex clinical trials. We showed in our development
the patients qualified for the clinical trial. However, 67 of the
dataset that screening time was reduced from 80 h to 2 min.
patients did not qualify for the trial. In most cases, these in-
Extrapolating manual prescreening time requirements to the
stances were not because of errors in the prescreening system
validation dataset, we see that manual prescreening of ∼3000
but rather due to certain other criteria that were either not
medical records would have taken several months. Instead,
included in the algorithm (e.g., certain specific allergies to
our automated system only took 20 min to perform the
medication, pregnancy, patient who did not live in the area)
prescreening task. Nevertheless, from these results and obser-
vations, we also understand that the system can only be used
Table 5 Development dataset 2 × 2 contingency table (compared to
for prescreening, and further validation by the clinical trial
manual prescreening gold standard)
study coordinator or clinical investigator is still required.
Prescreening gold We achieved high recall with reasonable precision on our
standard (manual) development dataset and were able to replicate the perfor-
Patients Patients
mance on a larger dataset. As with any automated system,
included excluded there are certain limitations to our proposed architecture,
which can be broadly categorized into (1) data processing
Classification outcome-based screen- Patients 38 6 and (2) data handling issues. We briefly describe some of
ing algorithm (automated) included
these issues. The precision of the system suffers from the
Patients 2 152
excluded complexity of text data. In some cases, our unstructured data
extractor module was unable to extract terms correctly. For
example, the module fails to identify certain HF- or ICD- reports and then extract the most recent one. Similarly, there
related terms. This is due to the large number of synonyms were also cases where Bend-date^ of medication and
and spelling mistakes for the relevant data elements. Bdepartment-name^ for encounter reports were missing or
In addition, there are some cases where a patient has spe- misplaced. We handled such cases following discussions with
cific allergies or may show a certain adverse reaction to a the data warehouse coordinator. To summarize, we can deduce
medication, both of which are difficult to extract from unstruc- that the patient data records are noisy due to various reasons
tured notes because they are not always reported in a standard and a preprocessing module is required to handle these issues.
format within the EHR. There are also cases where the patient An additional limitation of our study to consider is the time
has moved out of the hospital’s geographic area and therefore saved using an automated approach. When comparing the
cannot provide consent for the clinical trial. These are details automated algorithm to the time required for manual review,
that are likely too patient-specific for automated extraction it is important to factor in the time to collect the EHR data and
and can only be checked manually. create/tune the screening modules for the study, the time re-
In some cases, the LVEF value (which is an important quired to modify modules for a new study. These metrics are
factor for inclusion in HF clinical trials) is present in the form difficult to quantify and can vary greatly from study to study
of a range or qualitative description. This created a problem and from institution to institution, but future studies would
while checking for eligibility according to the criterion given. benefit from recording these data to get a sense of the true
For example, in our clinical trial, we have set the lower limit of time saved using an automated approach (i.e., the time the
LVEF at 45% based on the inclusion criteria. This creates a study coordinator would not have to spend reviewing records
problem when the value is contained within the range extract- for the patients that the prescreening tool rules out, minus the
ed (40% to 45% or 30% to 50%, for example). An initial programmer time required to repurpose/retune the text mining
approach of taking the average value and comparing it with tools for the clinical trial in which it has been implemented).
the LVEF threshold was deemed inappropriate given the need Clinical trials are the gold standard by which investigators
to exclude all patients with LVEF <45% at baseline. can determine whether or not a particular treatment is effective
Therefore, we subsequently modified our algorithm to include for a disease or clinical syndrome. However, a study of clinical
these patients but with a warning regarding their LVEF value. trial enrollment reported that 86% of all clinical trials are de-
This then served as an indication to the study coordinators to layed in patient recruitment from 1 to 6 months, and 13% are
recheck the echocardiogram report (and review the echocar- delayed by longer than 6 months [2]. In the TOPCAT trial
diographic images with the clinical investigator) in order to (spironolactone vs. placebo in HFpEF) [7], which required
make further decisions about the patient’s eligibility for the 270 enrolling sites in 6 countries, the average enrollment rate
clinical trial. was a dismal 2.3 patients per site per year worldwide, and
There are also some cases in which clinicians are just even worse in the USA (1.4 patients per site per year),
screening the patient for a particular diagnosis but the patient highlighting significant problems with HFpEF patient identi-
may not actually have the disease, such as a Bmalignancy of fication and enrollment [8]. A major cause of delay in HF
organ system^ check of exclusion criteria. To handle this, we clinical trials is the inability to efficiently screen for and iden-
do not exclude those patients if we find the Bscreening^ term tify eligible patients. An automated system is therefore urgent-
in the sentences extracted for eligibility check. For the terms ly needed to accelerate the process of prescreening patients for
Bcancer^ and Bmalignancy,^ we cannot exclude all patients clinical trials.
with these terms because the patient may have a cancer that The surge of the use of EHRs in the USA has created
is currently in remission or may have only had a presumptive abundant opportunities for clinical and translational research.
diagnosis of cancer that was either never verified or found to As Friedman et al. noted, the extensive use of clinical data
be inaccurate. To mitigate these issues, we currently just dis- provides great potential to transform our healthcare system
play a warning in these cases, as we did for LVEF. The coor- into a BSelf-learning Health System^ [9, 10]. In addition to
dinator can then perform further checks and decide the classi- its primary purpose of providing improved clinical practice,
fication. In other exclusion criteria where we have to check the the use of EHRs offers means for the identification of partic-
B-type natriuretic peptide and glomerular filtration rate ipants who satisfy predefined criteria. This can be used for a
values, we face the issues of non-availability and potential variety of applications, including clinical trial recruitment,
outliers in the data. For such cases too, we currently report outcome prediction, survival analysis, and other retrospective
them as a warning to coordinators for further checking. studies [11–14].
We also had to deal with data handling issues in some EHRs contain patient data in both structured and unstruc-
cases. For example, in criteria where we have to perform a tured formats. The structured data generally encompass a pa-
check for recent hemoglobin values, we found that the value tient’s demographic data, physical characteristics (e.g., body
may also be present in reports other than just blood reports. To mass index [BMI], blood pressure), laboratory data, and diag-
mitigate this issue, we check for hemoglobin values in all noses. Structured data are not only the best representation of
Fig. 3 Summary of techniques Rule-Based Techniques

for patient cohort identification Paern Matching
(Regular Expressions)
Input Machine Learning–Based
Features
Paent Techniques
Record Language Modeling
Based Methods Informaon Retrieval–Based
Techniques
knowledge but also easier to process. However, there is a vast Clinical trials usually have a large number of eligibility criteria
amount of medical knowledge that is locked in the unstruc- that need to be checked. Therefore, a large amount of infor-
tured format. The unstructured data are typically text clinical mation related to the eligibility criteria needs to be extracted.
narratives present in progress notes, imaging reports (e.g., Our study goal is similar to that of the plethora of ap-
echocardiographic reports), and discharge summaries, for ex- proaches proposed in computer-based clinical trial recruitment
ample. Thus, a module that can automatically and efficiently systems [41]. However, a majority of these approaches either
extract information from unstructured clinical text and convert lack EHRs as data source or are not equipped to handle un-
it into a structured form is needed. structured data. We, on the other hand, obtain patient data
The syndromic nature of HF presents unique challenges for from EHRs and handle unstructured data through information
the identification of patients from EHR data for research [15]. extraction methods, as opposed to the Bbag of terms^ or Bbag
HFpEF is particularly challenging to identify during of concepts^ suggested in other methods [42]. The main con-
prescreening activities. The presence of large amounts of un- tributions of our study are to (1) show that automated recruit-
structured data in patient medical reports aggravates the chal- ment systems can only serve as prescreening tools and to (2)
lenges. Previous studies have shown that clinicians often pre- develop and validate a clinical trial screening system based on
fer free text entry to coded options, in order to fully explain the information extracted from EHRs. Here, we demonstrate how
health conditions of each patient [16–18]. It has also been our system processes a set of eligibility criteria, extracts infor-
noted that unstructured data are essential because of the information from patient records automatically into a structured
mation they contain [19]; therefore, unstructured data are like- format, and finally prescreens the patients who could qualify
ly to persist in the future. There is an immediate need for an for the trial by matching the structured patient document with
automated data extraction system to transform unstructured the eligibility criteria.
clinical reports into a structured form, which is much easier
to process and handle [20–22].
There has been considerable research in identifying patient
cohorts from EHRs [23]. These approaches can be categorized Conclusions
into three general types (Fig. 3): (1) rule-based approaches
[24–28], (2) machine learning-based approaches [29–32], Using an information extraction approach to the EHR, we
and (3) information retrieval-based approaches [33–36]. All show here that automated identification of HFpEF patients
these approaches use either pattern matching (regular expres- who are appropriate candidates for a clinical trial is feasible
sions) or language modeling-based methods [37–40] to ex- and time-efficient. Our approach demonstrated excellent ac-
tract features for their system to work on. Rule-based systems curacy, and it reduced the time needed for prescreening pa-
are stringent and binary (either yes or no) in nature. On the tients from a few weeks to a few minutes. As we move to-
other hand, machine learning- and information retrieval-based wards a Bprecision medicine^ approach for complex, hetero-
methods provide output as probability or a score. Machine geneous disorders such as HFpEF, we will need to find novel
learning techniques, however, require a large amount of train- ways to identify appropriate patients to augment clinical trial
ing data to give accurate results. enrollment of specific disease subgroups, a task that when
Our proposed system is different from these approaches in done manually is quite time-consuming and tedious. Given
various ways. A majority of the reported systems aim to iden- the ability of our approach to identify subtypes of HFpEF
tify whether a patient shows a certain phenotype. Therefore, patients who meet specific inclusion/exclusion criteria, it
the number of criteria required is less than that which is nec- could be very useful for the targeting of specific therapeutics
essary for clinical trial screening. For example, a majority of to specific subtypes of HFpEF patients in future clinical trials
the systems only use a variation of disease names, medications without overburdening study staff. Finally, an approach such
used, or treatments taken as their eligibility criteria [25, 27]. as the one we demonstrate here may decrease costs and expe-
Ours is a more diverse application. Our goal is to check wheth- dite clinical trials, and could enhance the reproducibility of
er a particular patient qualifies for a certain clinical trial. trials across institutions and populations.
Compliance with Ethical Standards 10. Friedman, C., & Rigby, M. (2013). Conceptualising and creating a
global learning health system. International Journal of Medical
Funding Sources This work was funded by the National Library of Informatics, 82(4), e63–e71.
Medicine: R00LM011389 and R01LM011416 (to S.R.J.), and an 11. Ma, X.-J., Wang, Z., Ryan, P. D., Isakoff, S. J., Barmettler, A.,
investigator-initiated study grant from Novartis. S.J.S. is also supported Fuller, A., et al. (2004). A two-gene expression ratio predicts clin-
by grants from the National Institutes of Health (R01 HL107577 and R01 ical outcome in breast cancer patients treated with tamoxifen.
HL127028). The authors acknowledge Prasanth Nannapaneni for his Cancer Cell, 5(6), 607–616.
valuable ideas on extracting information from the electronic health 12. Strom, B. L., Schinnar, R., Jones, J., Bilker, W. B., Weiner, M. G.,
record. Hennessy, S., et al. (2011). Detecting pregnancy use of non-
hormonal category X medications in electronic medical
records. Journal of the American Medical Informatics
Conflicts of Interest Siddhartha R. Jonnalagadda is currently an em- Association, 18(Suppl 1), i81–i86.
ployee of Microsoft Corporation. 13. Mathias, J. S., Gossett, D., & Baker, D. W. (2012). Use of electronic
Abhishek K. Adupa declares that he has no conflict of interest. health record data to evaluate overuse of cervical cancer screening.
Ravi P. Garg declares that he has no conflict of interest. Journal of the American Medical Informatics Association, 19(e1),
Jessica Corona-Cox declares that she has no conflict of interest. e96–e101.
Sanjiv J. Shah reports receiving consulting fees from Novartis. 14. De Pauw, R., Kregel, J., De Blaiser, C., Van Akeleyen, J., Logghe,
T., Danneels, L., et al. (2015). Identifying prognostic factors
Ethical Approval All procedures performed in studies involving hu- predicting outcome in patients with chronic neck pain after multi-
man participants were in accordance with the ethical standards of the modal treatment: a retrospective study. Manual Therapy, 20(4),
institutional and/or national research committee and with the 1964 592–597.
Helsinki Declaration and its later amendments or comparable ethical 15. Onofrei, M., Hunt, J., Siemienczuk, J., Touchette, D. R., &
standards. Middleton, B. (2004). A first step towards translating evidence into
practice: heart failure in a community practice-based research net-
work. Informatics in Primary Care, 12(3), 139–145.
Informed Consent Informed consent was waived for this study by the 16. Johnson, S. B., Bakken, S., Dine, D., Hyun, S., Mendonça, E.,
Northwestern University Institutional Review Board because the study Morrison, F., et al. (2008). An electronic health record based on
only involved retrospective chart review. structured narrative. Journal of the American Medical Informatics
Association, 15(1), 54–64.
17. Zhou, L., Mahoney, L. M., Shakurova, A., Goss, F., Chang, F. Y.,
Bates, D. W., et al. (2012). How many medication orders are en-
References tered through free-text in EHRs?—a study on hypoglycemic agents.
American Medical Informatics Association Annual Symposium
1. Jensen, P. B., Jensen, L. J., & Brunak, S. (2012). Mining electronic Proceedings, 2012, 1079–1088.
health records: towards better research applications and clinical 18. Zheng, K., Hanauer, D. A., Padman, R., Johnson, M. P., Hussain, A.
care. Nature Reviews. Genetics, 13(6), 395–405. A., Ye, W., et al. (2011). Handling anticipated exceptions in clinical
2. Sullivan, J.. (2004). Subject Recruitment and Retention: Barrier to care: investigating clinician use of ‘exit strategies’ in an electronic
Success. http://www.appliedclinicaltrialsonline.com/subject- health records system. Journal of the American Medical
recruitment-and-retention-barriers-success. Accessed 27 July 2015. Informatics Association, 18(6), 883–889.
3. PARAGON Inclusion/Exclusion Criteria (2015). https:// 19. Raghavan, P., Chen, J. L., Fosler-Lussier, E., & Lai, A. M. (2014).
How essential are unstructured clinical narratives and information
sjonnalagadda.files.wordpress.com/2015/08/paragon_ie-criteria_
fusion to clinical trial recruitment? AMIA Jt Summits Transl Sci
10-01-2014.pdf. Accessed 10th August 2015.
Proc, 2014, 218–223.
4. Bodenreider, O. (2004). The unified medical language system
20. Stanfill, M. H., Williams, M., Fenton, S. H., Jenders, R. A., &
(UMLS): integrating biomedical terminology. Nucleic Acids
Hersh, W. R. (2010). A systematic literature review of automated
Research, 32(Database issue), D267–D270.
clinical coding and classification systems. Journal of the American
5. Harkema, H., Dowling, J. N., Thornblade, T., & Chapman, W. W. Medical Informatics Association, 17(6), 646–651.
(2009). ConText: an algorithm for determining negation, 21. Jha, A. K. (2011). The promise of electronic records: around the
experiencer, and temporal status from clinical reports. Journal of corner or down the road? JAMA, 306(8), 880–881.
Biomedical Informatics, 42(5), 839–851. 22. Friedman, C., Rindflesch, T. C., & Corn, M. (2013). Natural lan-
6. Mitchell, K. J., Becich, M. J., Berman, J. J., Chapman, W. W., guage processing: State of the art and prospects for significant
Gilbertson, J., Gupta, D., et al. (2004). Implementation and evalu- progress, a workshop sponsored by the National Library of
ation of a negation tagger in a pipeline-based system for informa- Medicine. Journal of Biomedical Informatics, 46(5), 765–773.
tion extract from pathology reports. Studies in Health Technology 23. Shivade, C., Raghavan, P., Fosler-Lussier, E., Embi, P. J., Elhadad,
and Informatics, 107(Pt 1), 663–667. N., Johnson, S. B., et al. (2014). A review of approaches to identi-
7. Shah, S. J., Heitner, J. F., Sweitzer, N. K., Anand, I. S., Kim, H. Y., fying patient phenotype cohorts using electronic health
Harty, B., et al. (2013). Baseline characteristics of patients in the records. Journal of the American Medical Informatics
treatment of preserved cardiac function heart failure with an aldo- Association, 21(2), 221–230.
sterone antagonist trial. Circulation. Heart Failure, 6(2), 184–192. 24. Nguyen, A. N., Lawley, M. J., Hansen, D. P., Bowman, R. V.,
8. Shah, S. J., Cogswell, R., Ryan, J. J., & Sharma, K. (2016). How to Clarke, B. E., Duhig, E. E., et al. (2010). Symbolic rule-based
develop and implement a specialized heart failure with preserved classification of lung cancer stages from free-text pathology
ejection fraction clinical program. Current Cardiology Reports, reports. 17(4), 440–445.
18(12), 122. 25. Mia Schmiedeskamp, P. P., Spencer Harpe, P. P. M. P. H., Ronald
9. Friedman, C. P., Wong, A. K., & Blumenthal, D. (2010). Achieving Polk, P., Michael Oinonen, P. M. P. H., & Amy Pakyz, P. M. S.
a nationwide learning health system. Science Translational (2009). Use of international classification of diseases, ninth revi-
Medicine, 2(57), 57cm29–57cm29. sion, clinical modification codes and medication use data to identify
nosocomial Clostridium difficile infection. Infection Control and informatics platform. American Medical Informatics Association
Hospital Epidemiology, 30(11), 1070–1076. Annual Symposium Proceedings, 2009, 391–395.
26. Penberthy, L., Brown, R., Puma, F., & Dahman, B. (2010). 34. Gregg, W., Jirjis, J., Lorenzi, N. M., & Giuse, D. (2003).
Automated matching software for clinical trials eligibility: measur- StarTracker: an integrated, web-based clinical search engine.
ing efficiency and flexibility. Contemporary Clinical Trials, 31(3), AMIA Annual Symposium Proceedings, 855.
207–217. 35. Hanauer, D. A., Mei, Q., Law, J., Khanna, R., & Zheng, K. (2015).
27. Kho, A. N., Hayes, M. G., Rasmussen-Torvik, L., Pacheco, J. A., Supporting information retrieval from electronic health records: a
Thompson, W. K., Armstrong, L. L., et al. (2012). Use of diverse report of University of Michigan’s nine-year experience in devel-
electronic medical record systems to identify genetic risk for type 2 oping and using the Electronic Medical Record Search Engine
diabetes within a genome-wide association study. Journal of the (EMERSE). Journal of Biomedical Informatics, 55, 290–300.
American Medical Informatics Association, 19(2), 212–218. 36. Zalis, M., & Harris, M. (2010). Advanced search of the electronic
28. Klompas, M., Haney, G., Church, D., Lazarus, R., Hou, X., & Platt, medical record: augmenting safety and efficiency in radiology.
R. (2008). Automated identification of acute hepatitis B using elec- Journal of the American College of Radiology, 7(8), 625–633.
tronic medical record data to facilitate public health surveillance. 37. Lehman, L. W., Saeed, M., Long, W., Lee, J., & Mark, R. (2012).
PloS One, 3(7), e2626. Risk stratification of ICU patients using topic models inferred from
29. Mani, S., Chen, Y., Arlinghaus, L. R., Li, X., Chakravarthy, A. B., unstructured progress notes. American Medical Informatics
Bhave, S. R., et al. (2011). Early prediction of the response of breast Association Annual Symposium Proceedings, 2012, 505–511.
tumors to neoadjuvant chemotherapy using quantitative MRI and 38. Carroll, R. J., Eyler, A. E., & Denny, J. C. (2011). Naive electronic
machine learning. American Medical Informatics Association health record phenotype identification for rheumatoid arthritis.
Annual Symposium Proceedings, 2011, 868–877. American Medical Informatics Association Annual Symposium
Proceedings, 2011, 189–196.
30. Van den Bulcke, T., Vanden Broucke, P., Van Hoof, V.,
39. Liao, K. P., Cai, T., Gainer, V., Goryachev, S., Zeng-treitler, Q.,
Wouters, K., Vanden Broucke, S., Smits, G., et al. (2011).
Raychaudhuri, S., et al. (2010). Electronic medical records for dis-
Data mining methods for classification of Medium-Chain
covery research in rheumatoid arthritis. Arthritis Care and
Acyl-CoA dehydrogenase deficiency (MCADD) using non-
Research, 62(8), 1120–1127.
derivatized tandem MS neonatal screening data. Journal of
40. Bejan, C. A., Xia, F., Vanderwende, L., Wurfel, M. M., & Yetisgen-
Biomedical Informatics, 44(2), 319–325.
Yildiz, M. (2012). Pneumonia identification using statistical feature
31. Zhao, D., & Weng, C. (2011). Combining PubMed knowledge and selection. Journal of the American Medical Informatics
EHR data to develop a weighted bayesian network for pancreatic Association, 19(5), 817–823.
cancer prediction. Journal of Biomedical Informatics, 44(5), 859– 41. Kopcke, F., & Prokosch, H. U. (2014). Employing computers for
868. the recruitment into clinical trials: a comprehensive systematic re-
32. Kawaler, E., Cobian, A., Peissig, P., Cross, D., Yale, S., & Craven, view. Journal of Medical Internet Research, 16(7), e161.
M. (2012). Learning to predict post-hospitalization VTE risk from 42. Ni, Y., Kennebeck, S., Dexheimer, J. W., McAneney, C. M., Tang,
EHR data. American Medical Informatics Association Annual H., Lingren, T., et al. (2015). Automated clinical trial eligibility
Symposium Proceedings, 2012, 436–445. prescreening: increasing the efficiency of patient identification for
33. Lowe, H. J., Ferris, T. A., Hernandez, P. M., & Weber, S. C. (2009). clinical trials in the emergency department. Journal of the American
STRIDE—an integrated standards-based translational research Medical Informatics Association, 22(1), 166–178.

Jonnalagadda2017 Article TextMiningOfTheElectronicHealt

Hochgeladen von

Dokumentinformationen

Copyright

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Jonnalagadda2017 Article TextMiningOfTheElectronicHealt

Hochgeladen von

Copyright:

J. of Cardiovasc. Trans. Res.

Text Mining of the Electronic Health Record: An Information

Abstract Precision medicine requires clinical trials that Introduction

could provide a proficient and robust approach to clin- Algorithm

Fig. 1 Overview of the clinical Paent Before Paent Aer

f a i l u r e , ^ BH F, ^ Bd i a s t o l i c d y s f u n c t i o n , ^ a n d Testing and Validation of the Algorithm

Table 2 Regular expressions for

Fig. 3 Summary of techniques Rule-Based Techniques

Das könnte Ihnen auch gefallen