Beruflich Dokumente
Kultur Dokumente
300
CSE
300
History
Past 20 years with relational databases
More dimensions to database queries
Advantages
Discover trends even when we dont understand reasons
Discover irrelevant patterns that confuse than enlighten
Protection against unaided human inference of patterns provide
quantifiable measures and aid human judgment
Data Mining
CSE
300
CSE
300
CSE
300
Pattern Evaluation
Data Mining
Task-relevant
Data
Data
Warehouse
Selection
Data Cleaning
Data Integration
Databas
es
6
CSE
300
CSE
300
CSE
300
CSE
300
10
CSE
300
11
CSE
300
12
CSE
300
13
CSE
300
14
CSE
300
15
CSE
300
16
Multi-Tiered Architecture
CSE
300
other
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine
Front-End Tools
17
CSE
300
CSE
300
Variables/Observations
19
Leukemia
CSE
300
20
CSE
300
21
Minkowski Distance
CSE
300
If q = 2, d is Euclidean distance
If q = 1, d is Manhattan distance
xi
Xi (1,7)
12
8.48
q=2
q=1
6
6
Xj(7,1)
xj
23
Binary Variables
CSE
300
1
0
a
c
b
d
Object i sum a c b d
a b
c d
p
d(i, j )
b c
a b c d
24
CSE
300
Example
Object 1
Object 2
A1
1
A2
0
A3
1
A4
1
A5
1
A6
0
A7
0
Object 2
1
2
0
2
sum
4
2
sum 4
1
3
3
7
1
Object
0
1
4
2
2
d (O ,O )
1 2
2 2 2 1 7
25
CSE
300
Initialization
Arbitrarily choose k objects as the initial cluster
centers (centroids)
Iteration until no change
For each object Oi
Calculate the distances between Oi and the k centroids
(Re)assign Oi to the cluster whose centroid is the
closest to Oi
26
10
CSE
300
cluste
r
mean
9
8
10
new
10
clusters
10
10
objects
relocat
ed
0
0
10
0
4
10
27
Dataset
CSE
300
28
Hierarchical Clustering
CSE
300
29
CSE
300
3
5
4
5
6
1
5
2
Average-link
5
Complete-link
Centroid distance
5
2
2
5
3
1
4
6
1
3
30
Compare Dendrograms
Single-link
Complete-link
CSE
300
1 2
3 6
3 6
3 6
Centroid distance
Average-link
1 2
1 2
3 6 4 1
31
CSE
300
32
CSE
300
36
CSE
300
CSE
300
38
Decision Trees
CSE
300
Decision tree
Each internal node tests an attribute
Each branch corresponds to attribute value
Each leaf node assigns a classification
ID3 algorithm
Based on training objects with known class labels to
classify testing objects
Rank attributes with information gain measure
Minimal height
least number of tests to classify an object
Decision Trees
CSE
300
40
300
41
Training Dataset
Age
BMI
Hereditary
Vision
Risk of
Condition X
P1
<=30
high
no
fair
no
P2
<=30
high
no
excellent
no
P3
>40
high
no
fair
yes
P4
3140
medium
no
fair
yes
P5
3140
low
yes
fair
yes
P6
3140
low
yes
excellent
no
P7
>40
low
yes
excellent
yes
P8
<=30
medium
no
fair
no
P9
<=30
low
yes
fair
yes
P10
3140
medium
yes
fair
yes
P11
<=30
medium
yes
excellent
yes
P12
>40
medium
no
excellent
yes
P13
>40
high
yes
fair
yes
P14
3140
medium
no
excellent
no
CSE
300
42
CSE
300
<=30
Age?
3040
>40
[P1,P2,P8,P9,P11 [P3,P7,P12,P13]
]
Yes: 4, No:0
Yes: 2, No:3
History
no
[P1,P2,P8
]
Yes: 0,
NO
No:3
YES
yes
[P9,P11]
Yes: 2,
No:0
YES
[P4,P5,P6,P10,P1
4]
Yes:
3, No:2
Vision
excellent
[P6,P14]
Yes: 0,
No:2
NO
fair
[P4,P5,P1
0]
Yes: 3,
YES
No:0
43
CSE
300
si
si
I( s1,s2,...,s
m)
log2
s
i 1 s
CSE
300
45
Rule Induction
CSE
300
CSE
300
Hip arthoplasty trauma surgeon predict patients longterm clinical status after surgery
Outcome evaluated during follow-ups for 2 years
2 modeling techniques
Nave Bayesian classifier
Decision trees
Bayesian classifier
P(outcome=good) = 0.55 (11/20 good)
Probability gets updated as more attributes are
considered
P(timing=good|outcome=good) = 9/11 (0.846)
P(outcome = bad) = 9/20 P(timing=good|
outcome=bad) = 5/9
47
CSE
300
Nomogram
48
Bayesian Classification
CSE
300
49
Bayes Theorem
CSE
300
CSE
300
Neural Networks
Similar to pattern recognition properties of biological
systems
Most frequently used
Multi-layer perceptrons
Input with bias, connected by weights to hidden, output
Multilayer Perceptrons
CSE
300
52
CSE
300
3 steps
Support Vector creation
Maximal distance between points found
Perpendicular decision boundary
Allows some points to be misclassified
Pima Indian data with X1(glucose) X2(BMI)
53
3
4
Diabetes
High LDL Low HDL,
Heart Failure
CSE
300
55
CSE
300
Association Rule
An implication
expression of the form
X Y, where X and Y
are itemsets and
XY=
Rule Evaluation
Metrics
Trans
containing
both X and Y
Trans
Trans
56
Apriori-based Mining
CSE
300
Data base D
TID
Items
10
a, c, d
20
b, c, e
30
40
1-candidates
Freq 1-itemsets
2-candidates
Itemset
Sup
Itemset
Sup
Itemset
ab
ac
a, b, c, e
ae
b, e
bc
Scan D
Min_sup=0.5
3-candidates
Scan D
be
Freq 2-itemsets
Counting
Itemset
Itemset
Sup
Itemset
Sup
bce
ac
ab
bc
ac
be
ae
ce
bc
be
ce
Freq 3-itemsets
Itemset
Sup
bce
ce
Scan D
58
CSE
300
Principle Components
In cases of large number of variables, highly possible that
some subsets of the variables are very correlated with each
other. Reduce variables but retain variability in dataset
Linear combinations of variables in the database
Variance of each PC maximized
Display as much spread of the original data
PCA Example
CSE
300
60
CSE
300
61
Conclusion
CSE
300
62
CSE
300