Sie sind auf Seite 1von 111

Successful Data Mining in Practice

Richard D. De Veaux Williams College May 19, 2009 deveaux@williams.edu

Where is Massachusetts?

JMP Visual Data Mining

Where in Massachusetts?

JMP Visual Data Mining

Williams College

JMP Visual Data Mining

Williams College

JMP Visual Data Mining

Reason for Data Mining

Data = $$

JMP Visual Data Mining

Data Mining Is
the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. --- Fayyad finding interesting structure (patterns, statistical models, relationships) in data bases.--- Fayyad, Chaduri and Bradley a knowledge discovery process of extracting previously unknown, actionable information from very large data bases--- Zornes a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. ---Edelstein
7
JMP Visual Data Mining

What is Data Mining?

JMP Visual Data Mining

Data Mining Models a partial list


Traditional statistical models
Linear regression, logistic regression, splines, smoothers etc. Vendors are adding these to DM software

Clustering and Visualization Neural networks Decision trees Nave Bayes K Nearest Neighbor Methods K-means Combining Models Bagging and Boosting
9
JMP Visual Data Mining

What makes Data Mining Different?


Massive amounts of data
Number of rows (cases) Number of columns (variables)

Low signal to noise


Many irrelevant variables Subtle relationships Variation

UPS
16TB U.S. library of congress Mostly tracking

Facebook
200-300 TB of photos every month

Google
1 PB every 72 minutes

10

JMP Visual Data Mining

Why Is Data Mining Taking Off Now?


Because we can
Computer power The price of digital storage is near zero

Data warehouses already built


Companies want return on data investment

11

JMP Visual Data Mining

Users are Also Different


Users
Domain experts, not statisticians Have too much data Want automatic methods Want useful information without spending all their time doing statistical analysis

12

JMP Visual Data Mining

Data Mining Myths


Find answers to unasked questions Continuously monitor your data base for interesting patterns Eliminate the need to understand your business Eliminate the need to collect good data Eliminate the need to have good data analysis skills
13

JMP Visual Data Mining

Examples of Data Mining Applications Customer Relationship Management


Transactional Data
Customer retention Upselling opportunities Customer optimization across different areas

Marketing Experiments
Often, new hypotheses are generated by data mining a planned experiment. Segmentation
14
JMP Visual Data Mining

Manufacturing Applications
Product reliability and quality control Process control
What can I do to improve batch yields?

Warranty analysis
Product problems Service assessment Adverse experiences link to production
15
JMP Visual Data Mining

Medical Applications
Medical procedure effectiveness
Who are good candidates for surgery?

Physician effectiveness
Which tests are ineffective?

Which physicians are likely to overprescribe treatments?


What combinations of tests are most effective?

16

JMP Visual Data Mining

E-commerce
Automatic web page design Recommendations for new purchases Cross selling Social Network Marketing

17

JMP Visual Data Mining

Pharmaceutical Applications
High throughput screening
Predict actions in assays Predict results in animals or humans

Rational drug design


Relating chemical structure with chemical properties Inverse regression to predict chemical properties from desired structure

DNA snips Genomics


Associate genes with diseases Find relationships between genotype and drug response (e.g., dosage requirements, adverse effects) Find individuals most susceptible to placebo effect

18

JMP Visual Data Mining

Pharmaceutical Applications
Combine clinical trial results with extensive medical/demographic information Non traditional uses of clinical trial data warehouse to explore:
Prediction of adverse experiences combining more than one trial Who is likely to be non-compliant or drop out? What are alternative (I.E., Non-approved) uses supported by the data?
19
JMP Visual Data Mining

Fraud and Terrorist Detection


Identify false:
Medical insurance claims Accident insurance claims

Which stock trades are based on insider information? Whose cell phone numbers have been stolen? Which credit card transactions are from stolen cards? Which documents are interesting When are changes in networks signs of potential illegal activity?
20
JMP Visual Data Mining

Lesson 1: Learn to Make Friends


PVA is a philanthropic organization, Sanctioned by the US Govt to represent the disabled veterans They send out 4 million free gifts , every 6 weeks And hope for donations Data were used for the KDD 1998 cup
200,000 donors
(100,000 training, 100,000 test)

481 demographic variables


Past giving, income, age etc etc etc

Recent campaign (only for training set)


Did they give? (Target B) How much did they give (Target D)

To optimize profit, who should receive the current solicitation? What is the most cost effective strategy?

21

JMP Visual Data Mining

Whats Hard? --Example

22

JMP Visual Data Mining

T-Code

23

JMP Visual Data Mining

More Tcode

24

JMP Visual Data Mining

Transformation?

25

JMP Visual Data Mining

Categories?

26

JMP Visual Data Mining

What does it mean?


T -C o d e T itle
0 _ 1 M R. 1 0 0 1 M ES S RS . 1 0 0 2 M R. & M RS . 2 M RS . 2 0 0 2 M ES DAM ES 3 M IS S 3 0 0 3 M IS S ES 4 DR. 4 0 0 2 DR. & M RS . 4 0 0 4 DO CTO RS 5 M ADAM E 6 S ERGEANT 9 RABBI 1 0 P RO FES S O R 1 0 0 0 2 P RO FES S O R & M RS . 1 0 0 1 0 P RO FES S O RS 1 1 ADM IRAL 1 1 0 0 2 ADM IRAL & M RS . 1 2 GENERAL 1 2 0 0 2 GENERAL & M RS . 1 3 CO LO NEL 1 3 0 0 2 CO LO NEL & M RS . 1 4 CAP TAIN 1 4 0 0 2 CAP TAIN & M RS . 1 5 CO M M ANDER 1 5 0 0 2 CO M M ANDER & M RS . 1 6 DEAN 1 7 J UDGE 1 7 0 0 2 J UDGE & M RS . 1 8 M AJ O R 1 8 0 0 2 M AJ O R & M RS . 1 9 S ENATO R 2 0 GO V ERNO R 2 1 0 0 2 S ERGEANT & M RS . 2 2 0 0 2 CO LNEL & M RS . 2 4 LIEUTENANT 2 6 M O NS IGNO R 2 7 REV EREND 28 MS. 28028 MSS. 2 9 BIS HO P 3 1 AM BAS S ADO R 3 1 0 0 2 AM BAS S ADO R & M RS 3 3 CANTO R 3 6 BRO THER 3 7 S IR 3 8 CO M M O DO RE 4 0 FATHER 4 2 S IS TER 4 3 P RES IDENT 4 4 M AS TER 4 6 M O THER 4 7 CHAP LAIN 4 8 CO RP O RAL 5 0 ELDER 5 6 M AYO R 5 9 0 0 2 LIEUTENANT & M RS . 6 2 LO RD 6 3 CARDINAL 6 4 FRIEND 6 5 FRIENDS 6 8 ARCHDEACO N 6 9 CANO N 7 0 BIS HO P 7 2 0 0 2 REV EREND & M RS . 7 3 P AS TO R 7 5 ARCHBIS HO P 8 5 S P ECIALIS T 8 7 P RIV ATE 8 9 S EAM AN 9 0 AIRM AN 9 1 J US TICE 9 2 M R. J US TICE 100 M. 1 0 3 M LLE. 1 0 4 CHANCELLO R 1 0 6 REP RES ENTATIV E 1 0 7 S ECRETARY 1 0 8 LT. GO V ERNO R 1 0 9 LIC. 1 1 1 S A. 1 1 4 DA. 1 1 6 S R. 1 1 7 S RA. 1 1 8 S RTA. 1 2 0 YO UR M AJ ES TY 1 2 2 HIS HIGHNES S 1 2 3 HER HIGHNES S 1 2 4 CO UNT 1 2 5 LADY 1 2 6 P RINCE 1 2 7 P RINCES S 1 2 8 CHIEF 1 2 9 BARO N 1 3 0 S HEIK 1 3 1 P RINCE AND P RINCES S 1 3 2 YO UR IM P ERIAL M AJ ES T 1 3 5 M . ET M M E. 2 1 0 P RO F.

27

JMP Visual Data Mining

Relational Data Bases


Data are stored in tables
Items ItemID C56621 T35691 RS5292 ItemName top hat cane red shoes price 34.95 4.99 22.95

Shoppers Person ID 135366 135366 259835 person name Lyle Lyle Dick ZIPCODE 19103 19103 01267 item bought T35691 C56621 RS5292

28

JMP Visual Data Mining

Metadata
The data survey describes the data set contents and characteristics
Table name Description Primary key/foreign key relationships Collection information: how, where, conditions Timeframe: daily, weekly, monthly Cosynchronus: every Monday or Tuesday

29

JMP Visual Data Mining

Data Preparation
Build data mining database Explore data Prepare data for modeling

60% to 95% of the time is spent 60% to 95% of the time is spent preparing the data preparing the data

30

JMP Visual Data Mining

Data Challenges
Data definitions
Types of variables

Data consolidation
Combine data from different sources NASA mars lander

Data heterogeneity
Homonyms Synonyms

Data quality
31
JMP Visual Data Mining

Data Quality

32

JMP Visual Data Mining

Missing Values
Random missing values
Delete row?
Paralyzed Veterans

Substitute value
Imputation Multiple Imputation JMP 8 (!)

Systematic missing data


Now what?

33

JMP Visual Data Mining

Missing Values -- Systematic


Credit Card Bank finds that Income field is missing Wharton Ph.D. Student questionnaire on survey attitudes Bowdoin college applicants have mean SAT verbal score above 750 Clinical Trial of Depression Medication what does missing mean?
34
JMP Visual Data Mining

Results for PVA Data Set


If entire list (100,000 donors) are mailed, net donation is $10,500

Using data mining techniques, this was increased 41.37%

35

JMP Visual Data Mining

KDD CUP 98 Results

36

JMP Visual Data Mining

KDD CUP 98 Results 2

37

JMP Visual Data Mining

Students in Data Mining Class

Student #1 $15,024 Student #2 $14,695 Student #3 $14,345

38

JMP Visual Data Mining

Data Mining and OLAP


On-line analytical processing (OLAP): users deductively analyze data to verify hypothesis
Descriptive, not predictive

Data mining: software uses data to inductively find patterns models!


Predictive or descriptive

Associations?
Most associated variables in the census Most associated variables in a supermarket Assocation Rules

39

JMP Visual Data Mining

Why Models?
Beer and Diapers
In the convenience stores we looked at, on Friday nights, purchases of beer and purchases of diapers are highly associated Conclusions? Actions?

40

JMP Visual Data Mining

Beer and Diapers

Picture from TandemTM ad


41
JMP Visual Data Mining

Models
Models are:
Powerful summaries for understanding Used for exploration and prediction

Of course, models are not reality George Box


All models are wrong, but some are useful Statisticians, like artists, have the bad habit of falling in love with their models.

42

JMP Visual Data Mining

Twymans Law and Corollaries


If it looks interesting, it must be wrong

De Veauxs Corollary 1 to Twymans Law


If its perfect, its wrong

De Veauxs Corollary 2 to Twymans Law


If it isnt wrong, you probably knew it already

43

JMP Visual Data Mining

Lesson 2 An Example of Twymans Law


Ingot cracking
3935 30,000 lb. Ingots Up to 25% cracking rate $30,000 per recast 90 potential explanatory variables
Water composition (reduced) Metal composition Process variables Other environmental variables

44

JMP Visual Data Mining

Data Processing
Five months to consolidate process data Three months to analyze and reduce dimension of water data Eight months after starting projects, statisticians received flat file:
960 ingots (rows) 149 variables

45

JMP Visual Data Mining

Decision Trees Mortgage Defaults


Household Income > $40000 No Yes Debt > $10000 No Yes

On Job > 5 Yr No Yes

0.11

0.06

0.01

0.05

46

JMP Visual Data Mining

Decision Tree -- Titanic


M F
|

Adult

Child

1,2,C

2 or 3

1 or Crew

1 or 2 46% 93%

Crew 14%

1st 27% 100%

23%

33%

47

JMP Visual Data Mining

Cook County Hospital -- ER


The ECG Tests:

The 3 Urgent Risk Factors:


1. 2. 3. Is the reported Pain unstable angina? Is there fluid in patients lungs? Is the patients systolic BP < 100?

MI: myocardial infarction (heart attack) Ischemia Heart muscle not getting enough blood

48

JMP Visual Data Mining

Confusion Matrix
Doctors in ER Actual Heart Attack No Heart Attack Tree Algorithm (Goldman) Actual Heart Attack No Heart Attack

Predict Heart Attack

0.89

0.75

Predict Heart Attack

0.96

0.08

Predict No Heart Attack

0.11

0.25

Predict No Heart Attack

0.04

0.92

49

JMP Visual Data Mining

Regression Tree
Price<9446.5 |

Weight<2280

Disp.<134

Weight<3637.5 34.00 30.17 26.22 Price<11522 18.60 Reliability:abde 24.17 HP<154 22.57

21.67

20.40

50

JMP Visual Data Mining

Ingots First Tree

All Rows
Count Mean Std Dev 3935 0.2356 0.4244

Alloy (6045,7348,8234,2345,3234)
Count Mean Std Dev 3005 0.159 0.3661

Alloy (5434,5894,2439)
Count Mean Std Dev 930 0.4817 0.4999

We know that some alloys are hard to make. Thats why we gave you the data in the first place.
51

JMP Visual Data Mining

Second Tree

All Rows
Count Mean Std Dev 3935

0.2356 0.4244

MG<3.9
Count Mean Std Dev 3055 0.1673 0.3723

MG>=3.9
Count Mean Std Dev 880 0.4727 0.4999

What do you think is in those alloys?

52

JMP Visual Data Mining

One More Time


Looks like Chrome OH! Did that solve it? No, but
Experimental design Enabled us to focus on important variables
Oh, thats funny! -Issac Asimov

53

JMP Visual Data Mining

What did we learn?


Data mining gave clues for generating hypotheses Followed up with DOE DOE led to substantial process improvement

54

JMP Visual Data Mining

Herbs Tree Twymans Law again

All Rows
Count G^2 Level 94649 37928.436 0 1 Prob 0.9494 0.0506

TARGET_D>=1
Count 4792 G^2 Level 00 1 Prob 0.0000 1.0000

TARGET_D<1
Count 89857 G^2 Level 00 1 Prob 1.0000 0.0000

55

JMP Visual Data Mining

Doing it Right Knowledge Discovery


Define business problem Build data mining database Explore data Prepare data for modeling Build model Evaluate model Deploy model and results Note: This process model borrows from CRISP-DM: CRoss Industry Standard Process for Data Mining
56
JMP Visual Data Mining

Successful Data Mining


The keys to success:
Formulating the problem Using the right data Flexibility in modeling Acting on results

Success depends more on the way you mine the data rather than the specific tool

57

JMP Visual Data Mining

Types of Models
Descriptions Classification (categorical or discrete values) Regression (continuous values)
Time series (continuous values)

Clustering Association

58

JMP Visual Data Mining

Model Building
Model building
Train Test

Evaluate

59

JMP Visual Data Mining

Overfitting in Regression
Classical overfitting:
Fit 6th order polynomial to 6 data points
3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -1 0 1 2 3 4 5 6

60

JMP Visual Data Mining

Overfitting
Fitting non-explanatory variables to data Overfitting is the result of
Including too many predictor variables Lack of regularizing the model
Neural net run too long Decision tree too deep

61

JMP Visual Data Mining

Avoiding Overfitting
Avoiding overfitting is a balancing act Occams Razor
Fit fewer variables rather than more Have a reason for including a variable (other than it is in the database) Regularize (dont overtrain) Know your field.

All models should be as simple as possible All models should be as simple as possible but no simpler than necessary but no simpler than necessary Albert Einstein Albert Einstein
62
JMP Visual Data Mining

Toy Problem
25 20 train2$y train2$y 0.0 0.2 0.4 train2[, i] 0.6 0.8 1.0 15 10 5 5 10 15 20 25 0.0 0.2 0.4 train2[, i] 0.6 0.8 1.0

25

20

train2$y

train2$y 0.0 0.2 0.4 train2[, i] 0.6 0.8 1.0

15

10

10

15

20

25

0.0

0.2

0.4 train2[, i]

0.6

0.8

1.0

25

20

train2$y

train2$y 0.0 0.2 0.4 train2[, i] 0.6 0.8 1.0

15

10

10

15

20

25

0.0

0.2

0.4 train2[, i]

0.6

0.8

1.0

25

20

train2$y

train2$y 0.0 0.2 0.4 train2[, i] 0.6 0.8 1.0

15

10

10

15

20

25

0.0

0.2

0.4 train2[, i]

0.6

0.8

1.0

25

20

train2$y

train2$y 0.0 0.2 0.4 train2[, i] 0.6 0.8 1.0

15

10

10

15

20

25

0.0

0.2

0.4 train2[, i]

0.6

0.8

1.0

63

JMP Visual Data Mining

Tree Model
x4<0.512146

x1<0.359395

x1<0.209569

x4<0.140557

x2<0.299431

x5<0.260297

x2<0.27271

x3<0.215425

x5<0.232708

x2<0.129879

x5<0.412206 10.640

x3<0.885533

x5<0.621811

x5<0.588094

x3<0.490631 9.785 7.602

x2<0.54068 x4<0.336583 x4<0.148234

x4<0.223909

x4<0.264999

x4<0.768584

x2<0.20279 x3<0.248065

x4<0.916189

x2<0.414822

16.830

x4<0.283724 4.400 7.074

x3<0.177433 x3<0.114976 x3<0.0789249 x3<0.0777104 12.700 14.380 12.250 18.160 15.830 15.150

x1<0.328133 21.240 18.760

x3<0.821878

12.040 6.688 9.865 12.060 8.887

x3<0.784959 7.994 9.956 15.020

x3<0.728124 21.260 17.770 15.190

x8<0.933915

x4<0.941058 25.320

19.060 17.560 14.340

x4<0.700738 10.780 14.060 14.030 16.860 14.200 21.100 24.280

17.690 19.470

R squared 82.3% Train


64
JMP Visual Data Mining

67.2% Test

Predictions for Example


20

y
10 0 10 20

y Predictor

R squared 82.3% Train


65
JMP Visual Data Mining

67.2% Test

Tree Advantages
Model explains its reasoning -- builds rules Build model quickly Handles non-numeric data No problems with missing data
Missing data as a new value Surrogate splits

Works fine with many dimensions


66
JMP Visual Data Mining

Whats Wrong With Trees?


Output are step functions big errors near boundaries Greedy algorithms for splitting small changes change model Uses less data after every split Model has high order interactions -- all splits are dependent on previous splits Often non-interpretable
67
JMP Visual Data Mining

Linear Regression
Term Estimate Std Error t Ratio Prob>|t| -0.900 0.482 -1.860 0.063 Intercept 4.658 0.292 15.950 <.0001 x1 4.685 0.294 15.920 <.0001 x2 -0.040 0.291 -0.140 0.892 x3 9.806 0.298 32.940 <.0001 x4 5.361 0.281 19.090 <.0001 x5 0.369 0.284 1.300 0.194 x6 0.001 0.291 0.000 0.998 x7 -0.110 0.295 -0.370 0.714 x8 0.467 0.301 1.550 0.122 x9 -0.200 0.289 -0.710 0.479 x10

R-squared: 73.5% Train


68
JMP Visual Data Mining

69.4% Test

Stepwise Regression
Term Intercept x1 x2 x4 x5 Estimate Std Error t Ratio
-0.625 4.619 4.665 9.824 5.366 0.309 0.289 0.292 0.296 0.28 -2.019 15.998 15.984 33.176 19.145

Prob>|t|
0.0439 <.0001 <.0001 <.0001 <.0001

R-squared 73.4% on Train

69.8% Test

69

JMP Visual Data Mining

Stepwise 2ND Order Model


Term
Intercept x1 x2 x3 x4 x5 x8 x9 (x1-0.51811)*(x1-0.51811) (x2-0.48354)*(x1-0.51811) (x3-0.48517)*(x1-0.51811) (x3-0.48517)*(x2-0.48354) (x3-0.48517)*(x3-0.48517) (x4-0.49647)*(x1-0.51811) (x4-0.49647)*(x2-0.48354) (x5-0.50509)*(x2-0.48354) (x5-0.50509)*(x3-0.48517) (x5-0.50509)*(x4-0.49647) (x8-0.52029)*(x5-0.50509)

Estimate Std Error


-2.026 4.311 4.808 -0.506 10 5.212 -0.181 0.427 -0.932 8.972 -1.367 -0.8 20.515 1.014 -1.159 -0.794 1.105 0.127 1.065 0.264 0.184 0.185 0.181 0.186 0.176 0.186 0.188 0.711 0.634 0.65 0.639 0.69 0.651 0.65 0.62 0.619 0.635 0.63

t Ratio
-7.68 23.47 26.04 -2.79 53.79 29.67 -0.97 2.28 -1.31 14.14 -2.1 -1.25 29.71 1.56 -1.78 -1.28 1.78 0.2 1.69

Prob>|t|
<.0001 <.0001 <.0001 0.0054 <.0001 <.0001 0.3301 0.0232 0.1905 <.0001 0.0358 0.2111 <.0001 0.1197 0.075 0.2008 0.0748 0.8414 0.0914

R-squared 89.9% Train


70
JMP Visual Data Mining

88.8% Test

Next Steps
Higher order terms? When to stop? Transformations? Too simple: underfitting bias Too complex: inconsistent predictions, overfitting high variance Selecting models is Occams razor
Keep goals of interpretation vs. prediction in mind
71
JMP Visual Data Mining

Logistic Regression
What happens if we use linear regression on 1-0 (yes/no) data?
1.0 0.0 0.2 0.4 0.6 0.8

20000

40000 Income

60000

80000

72

JMP Visual Data Mining

Logistic Regression II
Points on the line can be interpreted as probability, but dont stay within [0,1] Use a sigmoidal function instead of linear function to fit the data

1 f (I ) = I 1+ e

1.2 1 0.8 0.6 0.4 0.2 0

-1 0

73

JMP Visual Data Mining

10

-6

-2

Logistic Regression III


1.0 Accept 0.0 0.2 0.4 0.6 0.8

20000

40000 Income

60000

80000

74

JMP Visual Data Mining

Neural Nets
Dont resemble the brain
Are a statistical model Closest relative is projection pursuit regression

75

JMP Visual Data Mining

A Single Neuron
x1 x2 x3 x4 x5 x0
76

0.3 0.7 -0.2 h(z1) 0.4 -0.5 0.8 z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5 Input (z1) Output

JMP Visual Data Mining

Single Node
Input to outer layer from hidden node:

I = zl = w1 jk x j + l
j

Output:

yk = h( z kl )

77

JMP Visual Data Mining

Layered Architecture
z1 x1 z2 x2 z3 Output layer Input layer Hidden layer
78

JMP Visual Data Mining

Neural Networks
Create lots of features hidden nodes

zl = w1 jk x j + l
j

Use them in an additive model:

yk = w21 h( z1 ) + w22 h( z 2 ) + + j

79

JMP Visual Data Mining

Put It Together
~ yk = h ( w2 kl h( w1 jk x j + l ) + j )
l j

The resulting model is just a flexible nonlinear regression of the response on a set of predictor variables.

80

JMP Visual Data Mining

Predictions for Example

R2
81
JMP Visual Data Mining

89.5% Train

87.7% Test

What Does This Get Us?


Enormous flexibility Ability to fit anything
Including noise

Interpretation?

82

JMP Visual Data Mining

Neural Net Pro


Advantages
Handles continuous or discrete values Complex interactions In general, highly accurate for fitting due to flexibility of model Can incorporate known relationships
So called grey box models See De Veaux et al, Environmetrics 1999

83

JMP Visual Data Mining

Neural Net Con


Disadvantages
Model is not descriptive (black box) Difficult, complex architectures Slow model building Categorical data explosion Sensitive to input variable selection

84

JMP Visual Data Mining

K-Nearest Neighbors(KNN)
To predict y for an x:
Find the k most similar x's Average their y's

Find k by cross validation No training (estimation) required Works embarrassingly well


Friedman, KDDM 1996

85

JMP Visual Data Mining

Collaborative Filtering
Goal: predict what movies people will like Data: list of movies each person has watched
Lyle Ellen Fred Dean Jason
86

Andr, Starwars Andr, Starwars, Destin Starwars, Batman Starwars, Batman, Rambo Destin dAmlie Poulin, Cach

JMP Visual Data Mining

Data Base
Data can be represented as a sparse matrix
Starwars Rambo Lyle Ellen Fred Dean Jason Karen y y y y y ? Batman My Dinner w/Andr Destin D'Amilie y y y y y y ? ? ? ? y ? Cach

Karen likes Andr. What else might she like? CDNow doubled e-mail responses
87
JMP Visual Data Mining

Lesson 3: Know When to Hold em


Breast cancer data from mammograms
Error rates by trained radiologists are near 25% for both false positives and false negatives

Newer equipment is prohibitively expensive for the developing world Early detection of breast cancer is crucial Cumulative type I error over a decade is near 100% leading to needless biopsies
88
JMP Visual Data Mining

The Data
1618 mammograms showing clustered microcalcifications
Biostatistics Dept Institut Curie

Variables
Response: Malignant or not Predictors: Age, Tissue Type (light/dense) Size (mm), Number of microcalc, Number of suspicious clusters, Shape of microcalc (1-5), Polyshape?(y/n), Shape of cluster (1,2,3), Retro (cluster near nipple?), Deep? (y/n)

89

JMP Visual Data Mining

Tree model

90

JMP Visual Data Mining

Combining Models
In 1950s forecasters found that combining forecasting models worked better on average than any single forecast model
Reduces variance by averaging Can reduce bias if collection is broader than single model

91

JMP Visual Data Mining

Bagging and Boosting


Bagging (Bootstrap Aggregation)
Bootstrap a data set repeatedly Take many versions of same model (e.g. tree)
Random Forest Variation

Form a committee of models Take majority rule of predictions

Boosting
Create repeated samples of weighted data Weights based on misclassification Combine by majority rule, or linear combination of predictions
92
JMP Visual Data Mining

Results
Split data into train and test (62.5% 37.5%) Repeat random splits 1000 times
For each iteration, count false positives and false negatives on the 600 test set cases
Simple Tre e Ne ura l Ne tw ork Booste d Tre e s Ba gge d Tre e s Ra diologists Fa lse Positive s Fa lse Ne ga tive s 32.20% 33.70% 25.50% 31.70% 24.90% 32.50% 19.30% 28.80% 22.40% 35.80%

93

JMP Visual Data Mining

How Do We Really Start?


Life is not so kind
Categorical variables Missing data 500 variables, not 10

481 variables where to start?

94

JMP Visual Data Mining

Where to Start
Three rules of data analysis
Draw a picture Draw a picture Draw a picture

Ok, but how?


There are 90 histogram/bar charts and 4005 scatterplots to look at (or at least 90 if you look only at y vs. X)

95

JMP Visual Data Mining

Exploratory Data Models


Use a tree to find a smaller subset of variables to investigate Explore this set graphically
Start the modeling process over

Build model
Compare model on small subset with full predictive model

96

JMP Visual Data Mining

More Realistic
250 predictors
200 Continuous 50 Categorical

10,000 rows Why is this still easy?


No missing values Relatively high signal/noise

97

JMP Visual Data Mining

Start With a Simple Model


Tree?
x2<0.288579 x4<0.477873 |

x1<0.297806

x5<0.465905

x1<0.333728

x5<0.529173

x2<0.343653

x1<0.152683 -2.560 -0.265

x5<0.466843

x2<0.125849 2.540 5.120

x4<0.752766

x4<0.208211 -1.890 1.150 5.820

x5<0.644585 x5<0.49235 2.910 6.050

2.000 4.570

7.500 10.100 9.880 12.200

98

JMP Visual Data Mining

Brushing

99

JMP Visual Data Mining

Lesson 4: Know when to Fold em


Liability for churches
Some Predictors
Net Premium Value Property Value Coastal (yes/no) Inner100 (a.k.a., highly-urban) (yes/no) High property value Neighborhood (yes/no) Indicator Class
1 (Church/House of worship) 2 (Sexual Misconduct Church) 3 (Addl Sex. Misc. Covg Purchased) 4 (Not-for-profit daycare centers) 5 (Dwellings One family (Lessors risk)) 6 (Bldg or Premises Office Not for profit) 7 (Corporal Punishment each faculty member) 8 (Vacant land- not for profit) 9 (Private, not for profit, elementary, Kindergarten and Jr. High Schools) 10 (Stores no food or drink not for profit) 11 (Bldg or Premises Maintained by insured (lessors risk) not for profit) 12 (Sexual misconduct diocese)

100

JMP Visual Data Mining

Fast Fail
Not every modeling effort is a success
A model search can save lots of queries

Data took 8 months to get ready Analyst spent 2 months exploring it Tree models, stepwise regression (and a neural network running for several hours) found no out of sample predictive ability

101

JMP Visual Data Mining

Lesson 5: Machines are Smart You are Smarter


Why do statisticians like interpretability? Black boxes are not interpretable, but there may be important information

102

JMP Visual Data Mining

Case Study Warranty Data


A new backpack inkjet printer is showing higher than expected warranty claims
What are the important variables? Whats going on?

A neural networks shows that Zip code is the most important predictor
103
JMP Visual Data Mining

Zip Code?

104

JMP Visual Data Mining

Data Mining DOE Synergy


Data Mining is exploratory Efforts can go on simultaneously Learning cycle oscillates naturally between the two

105

JMP Visual Data Mining

What Did We Learn?


Toy problem
Functional form of model

PVA data
Useful predictor increased sales 40%

Depression Study
Identified critical intervention point at 2 weeks

Ingots
Gave clues as to where to look Experimental design followed

Churches
When to quit

Printers
When to experiment what factors
106
JMP Visual Data Mining

Challenges for data mining


Not algorithms Overfitting Finding an interpretable model that fits reasonably well

107

JMP Visual Data Mining

Recap Success in Data Mining


Problem formulation Data preparation
Data definitions Data cleaning Feature creation, transformations

EDM exploratory modeling


Reduce dimensions

108

JMP Visual Data Mining

Success in Data Mining II


Dont forget Graphics Second phase modeling Testing, validation, implementation Constant re-evaluation of models

109

JMP Visual Data Mining

Which Method(s) to Use?


No method is best Which methods work best when? Which method to use?
YES!

110

JMP Visual Data Mining

For More Information


Two Crows
http//www.twocrows.com

KDNuggets
http://www.kdnuggets.com

deveaux@williams.edu
M. Berry and G. Linoff, Data Mining Techniques, Wiley, 1997 Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999 Hand, D.J., Mannila, H. and Smyth, P., Principles of Data Mining, MIT Press 2001 Tan, P.N, Steinbach, and Kumar: Introduction to Data Mining, Addison-Wesley, 2006 Hastie, Tibshirani and Friedman, Statistical Learning 2nd edition, Springer
111
JMP Visual Data Mining

Das könnte Ihnen auch gefallen