You are on page 1of 43

Creating Data Science Workflows

A Healthcare Use Case

Wade L. Schulz, MD, PhD

Thomas JS Durant, MD, MPT

Getting Started

GitHub: Software Links/Instructions, Code

Hortonworks Sandbox
Enable/Install Hortonworks Data Flow (HDF/NiFi)

VirtualBox / Hortonworks Sandbox

Hortonworks Sandbox: HDP 2.6
Large download, have flash drives with VirtualBox image if not already

VirtualBox Networking Issues?

1. Update VirtualBox if unable to start HDP image

2. If a message prompts there are network issues, make sure that the
network connection has its virtual cable connected

Code for Sandbox

Add port 6667 to VirtualBox port forwarding

Edit hosts file if you have admin privileges
In VirtualBox:
Add port 6667 to NAT port forwarding (Kafka)
From image: curl -L -o


Launch Dashboard for Ambari (username/password: raj_ops)


Hadoop and Ambari

Ambari: Hadoop
management interface
HDFS: Storage
YARN: Resource mngr
Hive: SQL-like
HBase: Non-relational DB
Oozie: Workflow
ZooKeeper: Coordinator
Storm: Stream processing
Kafka: Message queue
Ranger: Security
Spark: Compute/analytics
Zeppelin: Notebook
NiFi: Stream processing

SSH to Download Code

SSH to Download Code

1. yum install python-pip

2. pip install kafka hdfs
3. curl
-L -o .hdfscli.cfg

Code Repository

git clone

git clone

HDF / NiFi

** This may not install depending on network,

but will demo NiFi functionality and can load
data through alternative method
S L I D E 10
S L I D E 11
Data Science Project Development

Step 1:
Get Data

Step 2: ?

Step 3:
S L I D E 12
Healthcare No Exception

S L I D E 13
For the architect/developer:
What to do when there is no Step 2?

Obtaining sample data

Rapid data modeling/review
Any structured elements?
Capture and store the data
Create reusable workflows
Keep data analysis-ready when possible (at least from a basic

S L I D E 14
Three Healthcare Use Cases

Clinical Laboratory Informatics (Workshop)

Patient Monitoring
Image Analysis / Deep Learning

S L I D E 15
Workshop Goal: Create Reusable Workflow Framework

Data Review / Generation


S L I D E 16
Data Science Toolbox

Create a personalized toolbox for each step of the data lifecycle

Identify strengths/weaknesses of each tool
Test small, local implementations and at scale when possible

S L I D E 17
Data Flow in a Healthcare System

S L I D E 18
Data Flow in a Healthcare System

S L I D E 19
Use Case 1: Create a Laboratory Data Workflow

A use case with a step 2!

Problem: Enterprise data warehouse not real-time

Quality control: Moving averages of actual patient specimens
Business intelligences: Which? When? How Many? Efficiency?

Data elements needed: Data feed: HL7 Interface

Lab test name
Time of specimen collection
Time of result
Order location

S L I D E 20
What We Will Build

HL7 Data

Kafka NiFi

HDFS Zeppelin

S L I D E 21
Problem #1: Sample/Test Data

Fortunately, we can often collect data from test implementations

within our healthcare system
New companies, vendors often have restrictions on data access or
may not have an early clinical partner
Even when data samples provided, may not be of sufficient size or
scope for testing

S L I D E 22
Problem #2: Healthcare Data Standards

Standards are often extremely complex and extensible

Many vendor systems dont follow the pure standard
Even though there is a standard data format, many data elements
(ontology, semantics) are not standard

S L I D E 23
Health Level 7 Order/Result Unit (ORU)

OBR|1|8642753100012^LIS|20809880170^LCS|008342^UPPER RESPIRATORY CULTURE^L|||19980727175800||||||
OBX|2|CE|997231^RESULT 1^L||M415|||||N|F|||19980729160500|BN

S L I D E 24
Workshop Part #1: The Lab Data Generator

Follow along, if you have Python 2.7 and/or Jupyter installed, open
the Laboratory Data Generatory.ipynb notebook from the GitHub
repository (/odsceast17/1-generation/data-generator/Laboratory
Data Generatory.ipynb)

If you have conda:

conda create -n odscHealth python=3
activate odscHealth
conda install notebook ipykernel matplotlib pandas
pip install kafka
jupyter notebook --notebook-dir=/path/to/git/repo

S L I D E 25
A Note on Laboratory Data

Laboratory tests often include panels

Complete blood count (CBC)
Basic metabolic panel (BMP)
Comprehensive metabolic panel (CMP)
Each panel can include several individual components
CBC: Hemoglobin, White Blood Cell Count, Platelets, etc
BMP: Sodium, Potassium, Chloride, Creatinine, etc
Laboratory Normal Ranges vary by laboratory due to variations
in equipment and physiologic range of local population

Generating good test data may require more than a simple random
number generator

S L I D E 26
Architect the Pipeline

Ingest Process Store Analyze

S L I D E 27
Data Ingest - Kafka

Setting up the Kafka Queue

In the Sandbox container:
cd /usr/hdp/current/kafka-broker/bin
./ --zookeeper localhost:2181 --
replication-factor 1 --partitions 1 --create --topic
./ --zookeeper localhost:2181 --
topic test1

S L I D E 28
Start a Data/Kafka Producer

1. Less Fancy: Random values, CLI data loader (any Python, can run
inside Sandbox if )
1. /odsceast17/1-generation/

2. Fancy: Normally distributed data (if you have Python >= 2.7 with
previous dependencies and were able to config hosts file)

S L I D E 29
Architect the Pipeline: Data Transformation / Load

Not all data need to be transformed before load

Many data may benefit from pre-indexing, even if downstream
analytics are not yet known
Timestamps, identifiers, etc

S L I D E 30
Workshop Part #2: Python HL7 Parsing

NiFi if installed
Python to HDFS if NiFi not installed

S L I D E 31
Workshop Part #3: Data Visualization

Spark / Zeppelin

S L I D E 32
Workflow Overview

S L I D E 33
Implementation Architecture

S L I D E 34
Extending Hadoop

S L I D E 35
Repeatable Architecture Patterns
- Continuous Patient Monitoring

Generate approximately 6-9 billion data points per month

Capture monitoring data from EDs, ICUs, others

Highly scalable, fault-tolerant processing pipeline
Batch analytics of entire data set
Real-time visualization of more recent data

S L I D E 36
Architect the Pipeline: Patient Monitoring

Ingest Process Store Analyze

S L I D E 37
Expanding to Other Use Cases Patient Monitoring

S L I D E 38
Repeatable Architecture Patterns
- Image Analysis / Deep Learning

Capture imaging data from laboratory instrument for a machine

learning pipeline

Ability to capture data from vendor instrument
Integrate Python-based deep learning libraries
Store features for batch and real-time analysis

S L I D E 39
Expanding to Other Use Cases Image Analysis

S L I D E 40

Having an analytic plan in place before data capture is good, but

not always possible
Identify key fields of the data stream that can be indexed in advance
for later filtering
Data science software is a complex and rapidly evolving
Find key applications to become comfortable with and use frequently
Data processing pipelines are also often complex
Repeat with standard architectural approaches when possible

S L I D E 41
Creating Data Science Workflows
A Healthcare Use Case

Wade L. Schulz, MD, PhD Thomas JS Durant, MD, MPT
LinkedIn: wadeschulz LinkedIn: thomas-durant

S L I D E 42