Beruflich Dokumente
Kultur Dokumente
Where is Massachusetts?
Where in Massachusetts?
Williams College
Williams College
Data = $$
Data Mining Is
the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. --- Fayyad finding interesting structure (patterns, statistical models, relationships) in data bases.--- Fayyad, Chaduri and Bradley a knowledge discovery process of extracting previously unknown, actionable information from very large data bases--- Zornes a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. ---Edelstein
7
JMP Visual Data Mining
Clustering and Visualization Neural networks Decision trees Nave Bayes K Nearest Neighbor Methods K-means Combining Models Bagging and Boosting
9
JMP Visual Data Mining
UPS
16TB U.S. library of congress Mostly tracking
Facebook
200-300 TB of photos every month
Google
1 PB every 72 minutes
10
11
12
Marketing Experiments
Often, new hypotheses are generated by data mining a planned experiment. Segmentation
14
JMP Visual Data Mining
Manufacturing Applications
Product reliability and quality control Process control
What can I do to improve batch yields?
Warranty analysis
Product problems Service assessment Adverse experiences link to production
15
JMP Visual Data Mining
Medical Applications
Medical procedure effectiveness
Who are good candidates for surgery?
Physician effectiveness
Which tests are ineffective?
16
E-commerce
Automatic web page design Recommendations for new purchases Cross selling Social Network Marketing
17
Pharmaceutical Applications
High throughput screening
Predict actions in assays Predict results in animals or humans
18
Pharmaceutical Applications
Combine clinical trial results with extensive medical/demographic information Non traditional uses of clinical trial data warehouse to explore:
Prediction of adverse experiences combining more than one trial Who is likely to be non-compliant or drop out? What are alternative (I.E., Non-approved) uses supported by the data?
19
JMP Visual Data Mining
Which stock trades are based on insider information? Whose cell phone numbers have been stolen? Which credit card transactions are from stolen cards? Which documents are interesting When are changes in networks signs of potential illegal activity?
20
JMP Visual Data Mining
To optimize profit, who should receive the current solicitation? What is the most cost effective strategy?
21
22
T-Code
23
More Tcode
24
Transformation?
25
Categories?
26
27
Shoppers Person ID 135366 135366 259835 person name Lyle Lyle Dick ZIPCODE 19103 19103 01267 item bought T35691 C56621 RS5292
28
Metadata
The data survey describes the data set contents and characteristics
Table name Description Primary key/foreign key relationships Collection information: how, where, conditions Timeframe: daily, weekly, monthly Cosynchronus: every Monday or Tuesday
29
Data Preparation
Build data mining database Explore data Prepare data for modeling
60% to 95% of the time is spent 60% to 95% of the time is spent preparing the data preparing the data
30
Data Challenges
Data definitions
Types of variables
Data consolidation
Combine data from different sources NASA mars lander
Data heterogeneity
Homonyms Synonyms
Data quality
31
JMP Visual Data Mining
Data Quality
32
Missing Values
Random missing values
Delete row?
Paralyzed Veterans
Substitute value
Imputation Multiple Imputation JMP 8 (!)
33
35
36
37
38
Associations?
Most associated variables in the census Most associated variables in a supermarket Assocation Rules
39
Why Models?
Beer and Diapers
In the convenience stores we looked at, on Friday nights, purchases of beer and purchases of diapers are highly associated Conclusions? Actions?
40
Models
Models are:
Powerful summaries for understanding Used for exploration and prediction
42
43
44
Data Processing
Five months to consolidate process data Three months to analyze and reduce dimension of water data Eight months after starting projects, statisticians received flat file:
960 ingots (rows) 149 variables
45
0.11
0.06
0.01
0.05
46
Adult
Child
1,2,C
2 or 3
1 or Crew
1 or 2 46% 93%
Crew 14%
23%
33%
47
MI: myocardial infarction (heart attack) Ischemia Heart muscle not getting enough blood
48
Confusion Matrix
Doctors in ER Actual Heart Attack No Heart Attack Tree Algorithm (Goldman) Actual Heart Attack No Heart Attack
0.89
0.75
0.96
0.08
0.11
0.25
0.04
0.92
49
Regression Tree
Price<9446.5 |
Weight<2280
Disp.<134
Weight<3637.5 34.00 30.17 26.22 Price<11522 18.60 Reliability:abde 24.17 HP<154 22.57
21.67
20.40
50
All Rows
Count Mean Std Dev 3935 0.2356 0.4244
Alloy (6045,7348,8234,2345,3234)
Count Mean Std Dev 3005 0.159 0.3661
Alloy (5434,5894,2439)
Count Mean Std Dev 930 0.4817 0.4999
We know that some alloys are hard to make. Thats why we gave you the data in the first place.
51
Second Tree
All Rows
Count Mean Std Dev 3935
0.2356 0.4244
MG<3.9
Count Mean Std Dev 3055 0.1673 0.3723
MG>=3.9
Count Mean Std Dev 880 0.4727 0.4999
52
53
54
All Rows
Count G^2 Level 94649 37928.436 0 1 Prob 0.9494 0.0506
TARGET_D>=1
Count 4792 G^2 Level 00 1 Prob 0.0000 1.0000
TARGET_D<1
Count 89857 G^2 Level 00 1 Prob 1.0000 0.0000
55
Success depends more on the way you mine the data rather than the specific tool
57
Types of Models
Descriptions Classification (categorical or discrete values) Regression (continuous values)
Time series (continuous values)
Clustering Association
58
Model Building
Model building
Train Test
Evaluate
59
Overfitting in Regression
Classical overfitting:
Fit 6th order polynomial to 6 data points
3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -1 0 1 2 3 4 5 6
60
Overfitting
Fitting non-explanatory variables to data Overfitting is the result of
Including too many predictor variables Lack of regularizing the model
Neural net run too long Decision tree too deep
61
Avoiding Overfitting
Avoiding overfitting is a balancing act Occams Razor
Fit fewer variables rather than more Have a reason for including a variable (other than it is in the database) Regularize (dont overtrain) Know your field.
All models should be as simple as possible All models should be as simple as possible but no simpler than necessary but no simpler than necessary Albert Einstein Albert Einstein
62
JMP Visual Data Mining
Toy Problem
25 20 train2$y train2$y 0.0 0.2 0.4 train2[, i] 0.6 0.8 1.0 15 10 5 5 10 15 20 25 0.0 0.2 0.4 train2[, i] 0.6 0.8 1.0
25
20
train2$y
15
10
10
15
20
25
0.0
0.2
0.4 train2[, i]
0.6
0.8
1.0
25
20
train2$y
15
10
10
15
20
25
0.0
0.2
0.4 train2[, i]
0.6
0.8
1.0
25
20
train2$y
15
10
10
15
20
25
0.0
0.2
0.4 train2[, i]
0.6
0.8
1.0
25
20
train2$y
15
10
10
15
20
25
0.0
0.2
0.4 train2[, i]
0.6
0.8
1.0
63
Tree Model
x4<0.512146
x1<0.359395
x1<0.209569
x4<0.140557
x2<0.299431
x5<0.260297
x2<0.27271
x3<0.215425
x5<0.232708
x2<0.129879
x5<0.412206 10.640
x3<0.885533
x5<0.621811
x5<0.588094
x4<0.223909
x4<0.264999
x4<0.768584
x2<0.20279 x3<0.248065
x4<0.916189
x2<0.414822
16.830
x3<0.177433 x3<0.114976 x3<0.0789249 x3<0.0777104 12.700 14.380 12.250 18.160 15.830 15.150
x3<0.821878
x8<0.933915
x4<0.941058 25.320
17.690 19.470
67.2% Test
y
10 0 10 20
y Predictor
67.2% Test
Tree Advantages
Model explains its reasoning -- builds rules Build model quickly Handles non-numeric data No problems with missing data
Missing data as a new value Surrogate splits
Linear Regression
Term Estimate Std Error t Ratio Prob>|t| -0.900 0.482 -1.860 0.063 Intercept 4.658 0.292 15.950 <.0001 x1 4.685 0.294 15.920 <.0001 x2 -0.040 0.291 -0.140 0.892 x3 9.806 0.298 32.940 <.0001 x4 5.361 0.281 19.090 <.0001 x5 0.369 0.284 1.300 0.194 x6 0.001 0.291 0.000 0.998 x7 -0.110 0.295 -0.370 0.714 x8 0.467 0.301 1.550 0.122 x9 -0.200 0.289 -0.710 0.479 x10
69.4% Test
Stepwise Regression
Term Intercept x1 x2 x4 x5 Estimate Std Error t Ratio
-0.625 4.619 4.665 9.824 5.366 0.309 0.289 0.292 0.296 0.28 -2.019 15.998 15.984 33.176 19.145
Prob>|t|
0.0439 <.0001 <.0001 <.0001 <.0001
69.8% Test
69
t Ratio
-7.68 23.47 26.04 -2.79 53.79 29.67 -0.97 2.28 -1.31 14.14 -2.1 -1.25 29.71 1.56 -1.78 -1.28 1.78 0.2 1.69
Prob>|t|
<.0001 <.0001 <.0001 0.0054 <.0001 <.0001 0.3301 0.0232 0.1905 <.0001 0.0358 0.2111 <.0001 0.1197 0.075 0.2008 0.0748 0.8414 0.0914
88.8% Test
Next Steps
Higher order terms? When to stop? Transformations? Too simple: underfitting bias Too complex: inconsistent predictions, overfitting high variance Selecting models is Occams razor
Keep goals of interpretation vs. prediction in mind
71
JMP Visual Data Mining
Logistic Regression
What happens if we use linear regression on 1-0 (yes/no) data?
1.0 0.0 0.2 0.4 0.6 0.8
20000
40000 Income
60000
80000
72
Logistic Regression II
Points on the line can be interpreted as probability, but dont stay within [0,1] Use a sigmoidal function instead of linear function to fit the data
1 f (I ) = I 1+ e
-1 0
73
10
-6
-2
20000
40000 Income
60000
80000
74
Neural Nets
Dont resemble the brain
Are a statistical model Closest relative is projection pursuit regression
75
A Single Neuron
x1 x2 x3 x4 x5 x0
76
0.3 0.7 -0.2 h(z1) 0.4 -0.5 0.8 z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5 Input (z1) Output
Single Node
Input to outer layer from hidden node:
I = zl = w1 jk x j + l
j
Output:
yk = h( z kl )
77
Layered Architecture
z1 x1 z2 x2 z3 Output layer Input layer Hidden layer
78
Neural Networks
Create lots of features hidden nodes
zl = w1 jk x j + l
j
yk = w21 h( z1 ) + w22 h( z 2 ) + + j
79
Put It Together
~ yk = h ( w2 kl h( w1 jk x j + l ) + j )
l j
The resulting model is just a flexible nonlinear regression of the response on a set of predictor variables.
80
R2
81
JMP Visual Data Mining
89.5% Train
87.7% Test
Interpretation?
82
83
84
K-Nearest Neighbors(KNN)
To predict y for an x:
Find the k most similar x's Average their y's
85
Collaborative Filtering
Goal: predict what movies people will like Data: list of movies each person has watched
Lyle Ellen Fred Dean Jason
86
Andr, Starwars Andr, Starwars, Destin Starwars, Batman Starwars, Batman, Rambo Destin dAmlie Poulin, Cach
Data Base
Data can be represented as a sparse matrix
Starwars Rambo Lyle Ellen Fred Dean Jason Karen y y y y y ? Batman My Dinner w/Andr Destin D'Amilie y y y y y y ? ? ? ? y ? Cach
Karen likes Andr. What else might she like? CDNow doubled e-mail responses
87
JMP Visual Data Mining
Newer equipment is prohibitively expensive for the developing world Early detection of breast cancer is crucial Cumulative type I error over a decade is near 100% leading to needless biopsies
88
JMP Visual Data Mining
The Data
1618 mammograms showing clustered microcalcifications
Biostatistics Dept Institut Curie
Variables
Response: Malignant or not Predictors: Age, Tissue Type (light/dense) Size (mm), Number of microcalc, Number of suspicious clusters, Shape of microcalc (1-5), Polyshape?(y/n), Shape of cluster (1,2,3), Retro (cluster near nipple?), Deep? (y/n)
89
Tree model
90
Combining Models
In 1950s forecasters found that combining forecasting models worked better on average than any single forecast model
Reduces variance by averaging Can reduce bias if collection is broader than single model
91
Boosting
Create repeated samples of weighted data Weights based on misclassification Combine by majority rule, or linear combination of predictions
92
JMP Visual Data Mining
Results
Split data into train and test (62.5% 37.5%) Repeat random splits 1000 times
For each iteration, count false positives and false negatives on the 600 test set cases
Simple Tre e Ne ura l Ne tw ork Booste d Tre e s Ba gge d Tre e s Ra diologists Fa lse Positive s Fa lse Ne ga tive s 32.20% 33.70% 25.50% 31.70% 24.90% 32.50% 19.30% 28.80% 22.40% 35.80%
93
94
Where to Start
Three rules of data analysis
Draw a picture Draw a picture Draw a picture
95
Build model
Compare model on small subset with full predictive model
96
More Realistic
250 predictors
200 Continuous 50 Categorical
97
x1<0.297806
x5<0.465905
x1<0.333728
x5<0.529173
x2<0.343653
x5<0.466843
x4<0.752766
2.000 4.570
98
Brushing
99
100
Fast Fail
Not every modeling effort is a success
A model search can save lots of queries
Data took 8 months to get ready Analyst spent 2 months exploring it Tree models, stepwise regression (and a neural network running for several hours) found no out of sample predictive ability
101
102
A neural networks shows that Zip code is the most important predictor
103
JMP Visual Data Mining
Zip Code?
104
105
PVA data
Useful predictor increased sales 40%
Depression Study
Identified critical intervention point at 2 weeks
Ingots
Gave clues as to where to look Experimental design followed
Churches
When to quit
Printers
When to experiment what factors
106
JMP Visual Data Mining
107
108
109
110
KDNuggets
http://www.kdnuggets.com
deveaux@williams.edu
M. Berry and G. Linoff, Data Mining Techniques, Wiley, 1997 Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999 Hand, D.J., Mannila, H. and Smyth, P., Principles of Data Mining, MIT Press 2001 Tan, P.N, Steinbach, and Kumar: Introduction to Data Mining, Addison-Wesley, 2006 Hastie, Tibshirani and Friedman, Statistical Learning 2nd edition, Springer
111
JMP Visual Data Mining