Sie sind auf Seite 1von 63

CSE

300

Data mining and its application and


usage in medicine
By Radhika

Data Mining and Medicine

CSE
300

History
Past 20 years with relational databases
More dimensions to database queries

earliest and most successful area of data mining


Mid 1800s in London hit by infectious disease
Two theories
Miasma theory Bad air propagated disease
Germ theory Water-borne

Advantages
Discover trends even when we dont understand reasons
Discover irrelevant patterns that confuse than enlighten
Protection against unaided human inference of patterns provide
quantifiable measures and aid human judgment

Data Mining

Patterns persistent and meaningful


Knowledge Discovery of Data
2

The future of data mining

10 biggest killers in the US

Data mining = Process of discovery of interesting,


meaningful and actionable patterns hidden in large
amounts of data

CSE
300

Major Issues in Medical Data Mining

CSE
300

Heterogeneity of medical data


Volume and complexity
Physicians interpretation
Poor mathematical categorization
Canonical Form
Solution: Standard vocabularies, interfaces
between different sources of data integrations,
design of electronic patient records
Ethical, Legal and Social Issues
Data Ownership
Lawsuits
Privacy and Security of Human Data
Expected benefits
Administrative Issues
4

Why Data Preprocessing?

CSE
300

Patient records consist of clinical, lab parameters,


results of particular investigations, specific to tasks
Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
Noisy: containing errors or outliers
Inconsistent: containing discrepancies in codes or
names
Temporal chronic diseases parameters
No quality data, no quality mining results!
Data warehouse needs consistent integration of
quality data
Medical Domain, to handle incomplete,
inconsistent or noisy data, need people with
domain knowledge
5

What is Data Mining? The KDD Process


CSE
300

Pattern Evaluation
Data Mining
Task-relevant
Data
Data
Warehouse

Selection

Data Cleaning
Data Integration
Databas
es
6

From Tables and Spreadsheets to Data Cubes

CSE
300

A data warehouse is based on a multidimensional data


model that views data in the form of a data cube
A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
Fact table contains measures (such as
dollars_sold) and keys to each of related dimension
tables
W. H. Inmon:A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile collection of
data in support of managements decision-making
process.
7

Data Warehouse vs. Heterogeneous DBMS

CSE
300

Data warehouse: update-driven, high performance


Information from heterogeneous sources is
integrated in advance and stored in warehouses for
direct query and analysis
Do not contain most current information
Query processing does not interfere with
processing at local sources
Store and integrate historical information
Support complex multidimensional queries

Data Warehouse vs. Operational DBMS


OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory,
banking, manufacturing, payroll, registration,
accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical,
consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex
queries

CSE
300

CSE
300

10

Why Separate Data Warehouse?


High performance for both systems
DBMS tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:
Missing data: Decision support requires historical
data which operational DBs do not typically
maintain
Data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
Data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled

CSE
300

11

CSE
300

12

CSE
300

13

Typical OLAP Operations

CSE
300

Roll up (drill-up): summarize data


by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice:
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)

14

CSE
300

15

CSE
300

16

Multi-Tiered Architecture
CSE
300
other

Metadata

sources

Operational

DBs

Extract
Transform
Load
Refresh

Monitor
&
Integrator

Data
Warehouse

OLAP Server

Serve

Analysis
Query
Reports
Data mining

Data Marts

Data Sources

Data Storage

OLAP Engine

Front-End Tools
17

Steps of a KDD Process

CSE
300

Learning the application domain:


relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining
summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns,
etc.
Use of discovered knowledge
18

Common Techniques in Data Mining

CSE
300

Predictive Data Mining


Most important
Classification: Relate one set of variables in data to
response variables
Regression: estimate some continuous value
Descriptive Data Mining
Clustering: Discovering groups of similar instances
Association rule extraction

Variables/Observations

Summarization of group descriptions

19

Leukemia

CSE
300

Different types of cells look very similar


Given a number of samples (patients)
can we diagnose the disease accurately?
Predict the outcome of treatment?
Recommend best treatment based of previous
treatments?
Solution: Data mining on micro-array data
38 training patients, 34 testing patients ~ 7000 patient
attributes
2 classes: Acute Lymphoblastic Leukemia(ALL) vs
Acute Myeloid Leukemia (AML)

20

Clustering/Instance Based Learning

CSE
300

Uses specific instances to perform classification than general


IF THEN rules
Nearest Neighbor classifier
Most studied algorithms for medical purposes
Clustering Partitioning a data set into several groups
(clusters) such that
Homogeneity: Objects belonging to the same cluster are
similar to each other
Separation: Objects belonging to different clusters are
dissimilar to each other.
Three elements
The set of objects
The set of attributes
Distance measure

21

Measure the Dissimilarity of Objects


CSE
300

Find best matching instance


Distance function
Measure the dissimilarity between a pair of
data objects
Things to consider
Usually very different for interval-scaled,
boolean, nominal, ordinal and ratio-scaled
variables
Weights should be associated with different
variables based on applications and data
semantic
Quality of a clustering result depends on both the
distance measure adopted and its implementation
22

Minkowski Distance

CSE
300

Minkowski distance: a generalization


d (i, j) q | x x |q | x x |q ... | x x |q (q 0)
i1
j1
i2
j2
ip
jp

If q = 2, d is Euclidean distance
If q = 1, d is Manhattan distance

xi

Xi (1,7)

12

8.48

q=2

q=1

6
6

Xj(7,1)

xj
23

Binary Variables

CSE
300

A contingency table for binary data


Object j
1
0
sum

1
0

a
c

b
d

Object i sum a c b d

a b
c d
p

Simple matching coefficient

d(i, j )

b c
a b c d

24

Dissimilarity between Binary Variables

CSE
300

Example
Object 1
Object 2

A1
1

A2
0

A3
1

A4
1

A5
1

A6
0

A7
0

Object 2

1
2

0
2

sum
4

2
sum 4

1
3

3
7

1
Object
0
1

4
2

2
d (O ,O )

1 2
2 2 2 1 7

25

K-nearest neighbors algorithm

CSE
300

Initialization
Arbitrarily choose k objects as the initial cluster
centers (centroids)
Iteration until no change
For each object Oi
Calculate the distances between Oi and the k centroids
(Re)assign Oi to the cluster whose centroid is the
closest to Oi

Update the cluster centroids based on current


assignment

26

k-Means Clustering Method


current10
clusters9

10

CSE
300

cluste
r
mean

9
8

10

new
10
clusters

10

10

objects
relocat
ed

0
0

10

0
4

10

27

Dataset

CSE
300

Data set from UCI repository


http://kdd.ics.uci.edu/
768 female Pima Indians evaluated for diabetes
After data cleaning 392 data entries

28

Hierarchical Clustering

CSE
300

Groups observations based on dissimilarity


Compacts database into labels that represent the
observations
Measure of similarity/Dissimilarity
Euclidean Distance
Manhattan Distance
Types of Clustering
Single Link
Average Link
Complete Link

29

Hierarchical Clustering: Comparison


Single-link

CSE
300

3
5

4
5

6
1

5
2

Average-link
5

Complete-link

Centroid distance
5

2
2

5
3

1
4

6
1

3
30

Compare Dendrograms
Single-link

Complete-link

CSE
300

1 2

3 6

3 6

3 6

Centroid distance

Average-link

1 2

1 2

3 6 4 1
31

Which Distance Measure is Better?

CSE
300

Each method has both advantages and disadvantages;


application-dependent
Single-link
Can find irregular-shaped clusters
Sensitive to outliers
Complete-link, Average-link, and Centroid distance
Robust to outliers
Tend to break large clusters
Prefer spherical clusters

32

Dendrogram from dataset


CSE
300

Minimum spanning tree through the observations


Single observation that is last to join the cluster is patient whose
blood pressure is at bottom quartile, skin thickness is at bottom
quartile and BMI is in bottom half
Insulin was however largest and she is 59-year old diabetic
33

Dendrogram from dataset


CSE
300

Maximum dissimilarity between observations in one


cluster when compared to another
34

Dendrogram from dataset


CSE
300

Average dissimilarity between observations in one


cluster when compared to another
35

Supervised versus Unsupervised Learning

CSE
300

Supervised learning (classification)


Supervision: Training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on training set
Unsupervised learning (clustering)
Class labels of training data are unknown
Given a set of measurements, observations, etc.,
need to establish existence of classes or clusters in
data

36

Classification and Prediction

CSE
300

Derive models that can use patient specific


information, aid clinical decision making
Apriori decision on predictors and variables to predict
No method to find predictors that are not present in the
data
Numeric Response
Least Squares Regression
Categorical Response
Classification trees
Neural Networks
Support Vector Machine
Decision models
Prognosis, Diagnosis and treatment planning
Embed in clinical information systems
37

Least Squares Regression

Find a linear function of predictor variables that


minimize the sum of square difference with response
Supervised learning technique

Predict insulin in our dataset :glucose and BMI

CSE
300

38

Decision Trees

CSE
300

Decision tree
Each internal node tests an attribute
Each branch corresponds to attribute value
Each leaf node assigns a classification
ID3 algorithm
Based on training objects with known class labels to
classify testing objects
Rank attributes with information gain measure
Minimal height
least number of tests to classify an object

Used in commercial tools eg: Clementine


ASSISTANT

Deal with medical datasets


Incomplete data
Discretize continuous variables
Prune unreliable parts of tree
Classify data
39

Decision Trees
CSE
300

40

Algorithm for Decision Tree Induction


CSE

300

Basic algorithm (a greedy algorithm)


Attributes are categorical (if continuous-valued,
they are discretized in advance)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all training examples are at the root
Test attributes are selected on basis of a heuristic
or statistical measure (e.g., information gain)
Examples are partitioned recursively based on
selected attributes

41

Training Dataset
Age

BMI

Hereditary

Vision

Risk of
Condition X

P1

<=30

high

no

fair

no

P2

<=30

high

no

excellent

no

P3

>40

high

no

fair

yes

P4

3140

medium

no

fair

yes

P5

3140

low

yes

fair

yes

P6

3140

low

yes

excellent

no

P7

>40

low

yes

excellent

yes

P8

<=30

medium

no

fair

no

P9

<=30

low

yes

fair

yes

P10

3140

medium

yes

fair

yes

P11

<=30

medium

yes

excellent

yes

P12

>40

medium

no

excellent

yes

P13

>40

high

yes

fair

yes

P14

3140

medium

no

excellent

no

CSE
300

42

Construction of A Decision Tree for Condition X


[P1,P14]
Yes: 9,
No:5

CSE
300

<=30

Age?
3040

>40

[P1,P2,P8,P9,P11 [P3,P7,P12,P13]
]
Yes: 4, No:0
Yes: 2, No:3

History

no
[P1,P2,P8
]
Yes: 0,
NO
No:3

YES

yes
[P9,P11]
Yes: 2,
No:0

YES

[P4,P5,P6,P10,P1
4]
Yes:
3, No:2
Vision

excellent
[P6,P14]
Yes: 0,
No:2

NO

fair
[P4,P5,P1
0]
Yes: 3,
YES
No:0
43

Entropy and Information Gain

CSE

300

S contains si tuples of class Ci for i = {1, ..., m}


Information measures info required to classify any
arbitrary tuple
m

si
si
I( s1,s2,...,s
m)
log2
s
i 1 s

Entropy of attribute A with values {a1,a2,,av}


v

s1j ... smj


E(A)
I (s1j,...,smj)
s
j 1

Information gained by branching on attribute A

Gain(A) I(s1,s2,...,sm) E(A)


44

Entropy and Information Gain

CSE
300

Select attribute with the highest information gain (or


greatest entropy reduction)
Such attribute minimizes information needed to
classify samples

45

Rule Induction

CSE
300

IF conditions THEN Conclusion


Eg: CN2
Concept description:
Characterization: provides a concise and succinct summarization
of given collection of data
Comparison: provides descriptions comparing two or more
collections of data

Training set, testing set


Imprecise
Predictive Accuracy
P/P+N
46

Example used in a Clinic

CSE
300

Hip arthoplasty trauma surgeon predict patients longterm clinical status after surgery
Outcome evaluated during follow-ups for 2 years
2 modeling techniques
Nave Bayesian classifier
Decision trees
Bayesian classifier
P(outcome=good) = 0.55 (11/20 good)
Probability gets updated as more attributes are
considered
P(timing=good|outcome=good) = 9/11 (0.846)
P(outcome = bad) = 9/20 P(timing=good|
outcome=bad) = 5/9
47

CSE
300

Nomogram

48

Bayesian Classification

CSE
300

Bayesian classifier vs. decision tree


Decision tree: predict the class label
Bayesian classifier: statistical classifier; predict
class membership probabilities
Based on Bayes theorem; estimate posterior
probability
Nave Bayesian classifier:
Simple classifier that assumes attribute
independence
High speed when applied to large databases
Comparable in performance to decision trees

49

Bayes Theorem

CSE

300

Let X be a data sample whose class label is unknown


Let Hi be the hypothesis that X belongs to a particular
class Ci
P(Hi) is class prior probability that X belongs to a
particular class Ci
Can be estimated by ni/n from training data
samples
n is the total number of training data samples
ni is the number of training data samples of class Ci
P( X | H )P(H )
i
i
P(H | X )
i
P( X )
Formula of Bayes Theorem
50

More classification Techniques

CSE
300

Neural Networks
Similar to pattern recognition properties of biological
systems
Most frequently used
Multi-layer perceptrons
Input with bias, connected by weights to hidden, output

Backpropagation neural networks

Support Vector Machines


Separate database to mutually exclusive regions
Transform to another problem space
Kernel functions (dot product)
Output of new points predicted by position

Comparison with classification trees


Not possible to know which features or combination of
features most influence a prediction
51

Multilayer Perceptrons

CSE
300

Non-linear transfer functions to weighted sums of


inputs
Werbos algorithm
Random weights
Training set, Testing set

52

Support Vector Machines

CSE
300

3 steps
Support Vector creation
Maximal distance between points found
Perpendicular decision boundary
Allows some points to be misclassified
Pima Indian data with X1(glucose) X2(BMI)

53

What is Association Rule Mining?


Finding frequent patterns, associations, correlations, or causal
CSE
structures among sets of items or objects in transaction
300
databases, relational databases, and other information
repositories
PatientID Conditions

High LDL Low HDL,


High BMI, Heart Failure

High LDL Low HDL,


Heart Failure, Diabetes

3
4

Diabetes
High LDL Low HDL,
Heart Failure

High BMI , High LDL


Low HDL, Heart Failure

Example of Association Rules


{High LDL, Low HDL}
{Heart Failure}

People who have high


LDL
(bad
cholesterol), low HDL
(good cholesterol)
are at
higher risk of heart
54
failure.

Association Rule Mining

CSE
300

Market Basket Analysis


Same groups of items bought placed together
Healthcare
Understanding among association among patients with
demands for similar treatments and services

Goal : find items for which joint probability of


occurrence is high
Basket of binary valued variables
Results form association rules, augmented with
support and confidence

55

Association Rule Mining

CSE
300

Association Rule

An implication
expression of the form
X Y, where X and Y
are itemsets and
XY=

Rule Evaluation
Metrics

Trans
containing
both X and Y

Trans

Trans

Support (s): Fraction of


containing
containing
transactions that
X
Y
contain both X and Y
# trans containing ( X Y )
Confidence (c):
P( X Y )
# trans in D
Measures how often
items in Y appear in
transactions that
# trans containing ( X Y )
contain X
P( X | Y )
# trans containing X

56

The Apriori Algorithm


CSE
300

Starts with most frequent 1-itemset


Include only those items that pass threshold
Use 1-itemset to generate 2-itemsets
Stop when threshold not satisfied by any itemset
L1 = {frequent items};
for (k = 1; Lk !=; k++) do
Candidate Generation: Ck+1 = candidates
generated from Lk;
Candidate Counting: for each transaction t in
database do increment the count of all candidates
in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_sup
return k Lk;
57

Apriori-based Mining
CSE
300

Data base D

TID

Items

10

a, c, d

20

b, c, e

30
40

1-candidates

Freq 1-itemsets

2-candidates

Itemset

Sup

Itemset

Sup

Itemset

ab

ac

a, b, c, e

ae

b, e

bc

Scan D

Min_sup=0.5
3-candidates
Scan D

be

Freq 2-itemsets

Counting

Itemset

Itemset

Sup

Itemset

Sup

bce

ac

ab

bc

ac

be

ae

ce

bc

be

ce

Freq 3-itemsets
Itemset

Sup

bce

ce

Scan D

58

Principle Component Analysis

CSE
300

Principle Components
In cases of large number of variables, highly possible that
some subsets of the variables are very correlated with each
other. Reduce variables but retain variability in dataset
Linear combinations of variables in the database
Variance of each PC maximized
Display as much spread of the original data

PC orthogonal with each other


Minimize the overlap in the variables

Each component normalized sum of square is unity


Easier for mathematical analysis

Number of PC < Number of variables


Associations found
Small number of PC explain large amount of variance

Example 768 female Pima Indians evaluated for diabetes


Number of times pregnant, two-hour oral glucose tolerance test
(OGTT) plasma glucose, Diastolic blood pressure, Triceps skin fold
thickness, Two-hour serum insulin, BMI, Diabetes pedigree function,
Age, Diabetes onset within last 5 years
59

PCA Example
CSE
300

60

CSE
300

National Cancer Institute


CancerNet http://www.nci.nih.gov
CancerNet for Patients and the Public
CancerNet for Health Professionals
CancerNet for Basic Reasearchers
CancerLit

61

Conclusion

CSE
300

About billion of peoples medical records are


electronically available
Data mining in medicine distinct from other fields due
to nature of data: heterogeneous, with ethical, legal
and social constraints
Most commonly used technique is classification and
prediction with different techniques applied for
different cases
Associative rules describe the data in the database
Medical data mining can be the most rewarding
despite the difficulty

62

CSE
300

Thank you !!!


63

Das könnte Ihnen auch gefallen