You are on page 1of 43

Creating Data Science Workflows

A Healthcare Use Case

Wade L. Schulz, MD, PhD


Thomas JS Durant, MD, MPT

SLIDE 0
Getting Started

GitHub: Software Links/Instructions, Code


VirtualBox
Hortonworks Sandbox
Enable/Install Hortonworks Data Flow (HDF/NiFi)

https://github.com/ComputationalHealth/odsceast17
https://github.com/ComputationalHealth/odsceast17-data

SLIDE 1
VirtualBox / Hortonworks Sandbox

VirtualBox: https://www.virtualbox.org/
Hortonworks Sandbox: HDP 2.6
Large download, have flash drives with VirtualBox image if not already
downloaded

SLIDE 2
VirtualBox Networking Issues?

1. Update VirtualBox if unable to start HDP image


2. If a message prompts there are network issues, make sure that the
network connection has its virtual cable connected

SLIDE 3
Code for Sandbox

Add port 6667 to VirtualBox port forwarding


Edit hosts file if you have admin privileges
127.0.0.1 sandbox.hortonworks.com
In VirtualBox:
Add port 6667 to NAT port forwarding (Kafka)
From image: curl https://goo.gl/o6jQ7T -L -o start_sandbox.sh
./start_sandbox.sh

SLIDE 4
Success!

Launch Dashboard for Ambari (username/password: raj_ops)


(or http://127.0.0.1:8080)

SLIDE 5
Hadoop and Ambari

Ambari: Hadoop
management interface
HDFS: Storage
YARN: Resource mngr
Hive: SQL-like
HBase: Non-relational DB
Oozie: Workflow
ZooKeeper: Coordinator
Storm: Stream processing
Kafka: Message queue
Ranger: Security
Spark: Compute/analytics
Zeppelin: Notebook
NiFi: Stream processing

SLIDE 6
SSH to Download Code

SLIDE 7
SSH to Download Code

1. yum install python-pip


2. pip install kafka hdfs
3. curl https://goo.gl/22EVWn
-L -o .hdfscli.cfg

SLIDE 8
Code Repository

git clone https://github.com/ComputationalHealth/odsceast17.git

git clone https://github.com/ComputationalHealth/odsceast17-data.git

SLIDE 9
HDF / NiFi

** This may not install depending on network,


but will demo NiFi functionality and can load
data through alternative method
S L I D E 10
S L I D E 11
Data Science Project Development

Step 1:
Get Data

Step 2: ?

Step 3:
Profit
S L I D E 12
Healthcare No Exception

S L I D E 13
For the architect/developer:
What to do when there is no Step 2?

Obtaining sample data


Rapid data modeling/review
Any structured elements?
Capture and store the data
Create reusable workflows
Keep data analysis-ready when possible (at least from a basic
level)

S L I D E 14
Three Healthcare Use Cases

Clinical Laboratory Informatics (Workshop)

Patient Monitoring
Image Analysis / Deep Learning

S L I D E 15
Workshop Goal: Create Reusable Workflow Framework

Data Review / Generation


Ingestion
Storage
Transformation
Visualization

S L I D E 16
Data Science Toolbox

Create a personalized toolbox for each step of the data lifecycle


Identify strengths/weaknesses of each tool
Test small, local implementations and at scale when possible

S L I D E 17
Data Flow in a Healthcare System

S L I D E 18
Data Flow in a Healthcare System

S L I D E 19
Use Case 1: Create a Laboratory Data Workflow

A use case with a step 2!

Problem: Enterprise data warehouse not real-time


Quality control: Moving averages of actual patient specimens
Business intelligences: Which? When? How Many? Efficiency?

Data elements needed: Data feed: HL7 Interface


Lab test name
Time of specimen collection
Time of result
Order location

S L I D E 20
What We Will Build

HL7 Data
Generator

Kafka NiFi

HDFS Zeppelin

S L I D E 21
Problem #1: Sample/Test Data

Fortunately, we can often collect data from test implementations


within our healthcare system
New companies, vendors often have restrictions on data access or
may not have an early clinical partner
Even when data samples provided, may not be of sufficient size or
scope for testing

S L I D E 22
Problem #2: Healthcare Data Standards

Standards are often extremely complex and extensible


Many vendor systems dont follow the pure standard
Even though there is a standard data format, many data elements
(ontology, semantics) are not standard

S L I D E 23
Health Level 7 Order/Result Unit (ORU)

MSH|^~\&|LCS|LCA|LIS|TEST9999|199807311532||ORU^R01|3629|P|2.2
PID|2|2161348462|20809880170|1614614|20809880170^TESTPAT||19760924|M|||^^^^|||||||86427531^^^03|
ORC|NW|8642753100012^LIS|20809880170^LCS||||||19980727000000|||HAVILAND
OBR|1|8642753100012^LIS|20809880170^LCS|008342^UPPER RESPIRATORY CULTURE^L|||19980727175800||||||
SRC:THROAT
OBX|1|ST|008342^UPPER RESPIRATORY CULTURE^L||FINALREPORT|||||N|F||| 19980729160500|BN
OBX|2|CE|997231^RESULT 1^L||M415|||||N|F|||19980729160500|BN
NTE|1|L|MORAXELLA (BRANHAMELLA) CATARRHALIS
NTE|2|L| HEAVY GROWTH
NTE|3|L| BETA LACTAMASE POSITIVE

https://corepointhealth.com/resource-center/hl7-resources/hl7-msh-message-header

S L I D E 24
Workshop Part #1: The Lab Data Generator

Follow along, if you have Python 2.7 and/or Jupyter installed, open
the Laboratory Data Generatory.ipynb notebook from the GitHub
repository (/odsceast17/1-generation/data-generator/Laboratory
Data Generatory.ipynb)

If you have conda:


conda create -n odscHealth python=3
activate odscHealth
conda install notebook ipykernel matplotlib pandas
pip install kafka
jupyter notebook --notebook-dir=/path/to/git/repo

S L I D E 25
A Note on Laboratory Data

Laboratory tests often include panels


Complete blood count (CBC)
Basic metabolic panel (BMP)
Comprehensive metabolic panel (CMP)
Each panel can include several individual components
CBC: Hemoglobin, White Blood Cell Count, Platelets, etc
BMP: Sodium, Potassium, Chloride, Creatinine, etc
Laboratory Normal Ranges vary by laboratory due to variations
in equipment and physiologic range of local population

Generating good test data may require more than a simple random
number generator

S L I D E 26
Architect the Pipeline

Ingest Process Store Analyze

S L I D E 27
Data Ingest - Kafka

Setting up the Kafka Queue


In the Sandbox container:
cd /usr/hdp/current/kafka-broker/bin
./kafka-topics.sh
./kafka-topics.sh --zookeeper localhost:2181 --
replication-factor 1 --partitions 1 --create --topic
test1
./kafka-console-consumer.sh --zookeeper localhost:2181 --
topic test1

S L I D E 28
Start a Data/Kafka Producer

1. Less Fancy: Random values, CLI data loader (any Python, can run
inside Sandbox if )
1. /odsceast17/1-generation/

2. Fancy: Normally distributed data (if you have Python >= 2.7 with
previous dependencies and were able to config hosts file)

S L I D E 29
Architect the Pipeline: Data Transformation / Load

Not all data need to be transformed before load


Many data may benefit from pre-indexing, even if downstream
analytics are not yet known
Timestamps, identifiers, etc

S L I D E 30
Workshop Part #2: Python HL7 Parsing

NiFi if installed
Python to HDFS if NiFi not installed

S L I D E 31
Workshop Part #3: Data Visualization

Spark / Zeppelin

S L I D E 32
Workflow Overview

S L I D E 33
Implementation Architecture

S L I D E 34
Extending Hadoop

S L I D E 35
Repeatable Architecture Patterns
- Continuous Patient Monitoring

Generate approximately 6-9 billion data points per month


Capture monitoring data from EDs, ICUs, others

Requirements:
Highly scalable, fault-tolerant processing pipeline
Batch analytics of entire data set
Real-time visualization of more recent data

S L I D E 36
Architect the Pipeline: Patient Monitoring

Ingest Process Store Analyze

S L I D E 37
Expanding to Other Use Cases Patient Monitoring

S L I D E 38
Repeatable Architecture Patterns
- Image Analysis / Deep Learning

Capture imaging data from laboratory instrument for a machine


learning pipeline

Requirements:
Ability to capture data from vendor instrument
Integrate Python-based deep learning libraries
Store features for batch and real-time analysis

S L I D E 39
Expanding to Other Use Cases Image Analysis

S L I D E 40
Conclusions

Having an analytic plan in place before data capture is good, but


not always possible
Identify key fields of the data stream that can be indexed in advance
for later filtering
Data science software is a complex and rapidly evolving
environment
Find key applications to become comfortable with and use frequently
Data processing pipelines are also often complex
Repeat with standard architectural approaches when possible

S L I D E 41
Creating Data Science Workflows
A Healthcare Use Case

Wade L. Schulz, MD, PhD Thomas JS Durant, MD, MPT


wade.schulz@yale.edu thomas.durant.@yale.edu
LinkedIn: wadeschulz LinkedIn: thomas-durant

S L I D E 42