Beruflich Dokumente
Kultur Dokumente
Data Mining
Data Mining
An example
Building a telecom
customer retention model
Given a customers
telecom behavior, predict if
the customer will stay or
leave
Data warehouse:
Repository for the data available for BI and Decision Support Systems
Internal Data, external Data and Personal Data
Internal data:
Back office: transactional records, orders, invoices, etc.
Front office: call center, sales office, marketing campaigns,
Web-based: sales transactions on e-commerce websites
External:
Market surveys, GIS systems
Independent Variables
Outlook
Temp
Humidity
Dependent
Variable
Windy
Play
sunny
85
85
FALSE
no
sunny
80
90
TRUE
no
overcast
83
86
FALSE
yes
rainy
70
96
FALSE
yes
rainy
68
80
FALSE
yes
rainy
65
70
TRUE
no
overcast
64
65
TRUE
yes
sunny
72
95
FALSE
no
sunny
69
70
FALSE
yes
rainy
75
80
FALSE
yes
sunny
75
70
TRUE
yes
overcast
72
90
TRUE
yes
overcast
81
75
FALSE
yes
rainy
71
91
TRUE
no
Measures of Dispersion
Variance
m
1
2
2
(
x
i
m 1 i 1
1
2
( xi )
Standard deviation
m 1 i 1
1/ 2
Heterogeneity Measures
The Gini
coefficient (also known as the Gini index or
Gini ratio) is a
measure of statistical dispersion developed
by the Italian statistician and sociologist
Corrado Gini and published in his 1912
paper "Variability and Mutability" (Italian:
Variabilit e mutabilit) )
G 1 fh
i 1
E f h log 2 f h
i1
Test of Significance
Given two models:
Model M1: accuracy = 85%, tested on 30 instances
Model M2: accuracy = 75%, tested on 5000 instances
Confidence Intervals
Given a frequency of (f) is 25%. How close is
this to the true probability p?
Prediction is just like tossing a biased coin
Head is a success, tail is an error
Confidence intervals
We can say: p lies within a certain specified
interval with a certain specified confidence
Example: S=750 successes in N=1000 trials
Estimated success rate: f=75%
How close is this to true success rate p?
Answer: with 80% confidence p[73.2,76.7]
Pr[ z X z ] c
Pr[ z X z ] 1 (2 * Pr[ X z ])
-Z/2
Z1- /211
Transforming f
Transformed value for f:
f p
p (1 p ) / N
Resulting equation:
Pr z
Solving for p:
f p
z c
p(1 p ) / N
2
2
2
z
f
f
z
p f
z
2
N
N
N
4
N
z2
1
N
12
50
100
500
1000 5000
p(lower)
0.670
0.711
0.763
0.774
0.789
p(upper)
0.888
0.866
0.833
0.824
0.811
1-
0.99
0.98
0.95
0.90
Z
2.58
2.33
1.96
1.65
13
Confidence limits
Confidence limits for the normal distribution with 0 mean
and a variance of 1:
Pr[Xz]
0.1%
3.09
0.5%
2.58
1%
2.33
5%
1.65
10%
1.28
20%
0.84
40%
0.25
Examples
f=75%, N=1000, c=80% (so that z=1.28):
p [0.732,0.767]
15
Implications
First, the more test data the better
N is large, thus confidence level is large
16
Hold aside one group for testing and use the rest to build model
Repeat
Test
iteration
17
Confidence
2% error in 100 tests
2% error in 10000 tests
Tradeoff:
# of Folds = # of Data N
Leave One Out CV
Trained model very close to
final model, but test data =
very biased
# of Folds = 2
Trained Model very unlike
final model, but test data =
close to training distribution
18
ROC curve plots TP (on the y-axis) against FP (on the x-axis)
Performance of each classifier represented as a point on the
ROC curve
changing the threshold of algorithm, sample distribution or cost matrix
changes the location of the point
19
ACTUAL
CLASS
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
Widely-used metric:
ad
TP TN
Accuracy
a b c d TP TN FP FN
20
P(+|A)
True
Class
0.95
0.93
0.87
0.85
0.85
0.85
0.76
0.53
0.43
Predicted10
by classifier
0.25
0.25
0.43
0.53
0.76
0.85
0.85
0.85
0.87
0.93
0.95
1.00
FP
TN
FN
TPR
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.2
FPR
0.8
0.8
0.6
0.4
0.2
0.2
Class
P
Threshold
>=
TP
ROC Curve:
22
No model consistently
outperform the other
M1 is better for
small FPR
M2 is better for
large FPR
Ideal:
Area = 1
Random guess:
Area = 0.5
23