Sie sind auf Seite 1von 28

Data Preparation

Dr. Saed Sayad


University of Toronto
2010
saed.sayad@utoronto.ca
1 http://chem-eng.utoronto.ca/~datamining/
Data Mining Steps
1
Problem Definition
2
Data Preparation
3
Data Exploration
4
Modeling
5
Evaluation
6
Deployment
http://chem-eng.utoronto.ca/~datamining/ 2
1. Problem Definition
http://chem-eng.utoronto.ca/~datamining/ 3
Understanding the project objectives and
requirements from a business perspective,
and then converting this knowledge into a
data mining problem definition with a
preliminary plan designed to achieve the
objectives.
Source: http://www.crisp-dm.org/Process/index.htm
2- Data Preparation
The data preparation step covers all
activities to construct the final dataset
for modeling from the raw data. Tasks
include database, table, record, and field
selection as well as cleaning, aggregation
and transformation of data.
4 http://chem-eng.utoronto.ca/~datamining/
Data Preparation
Modeling Data
Data
Text
Data
DSN
ETL
http://chem-eng.utoronto.ca/~datamining/ 5
Data Sources
http://chem-eng.utoronto.ca/~datamining/ 6
Text Files
Relational
Database
Multi-dimensional
Database
Entities File Table Cube
Attributes Row and Col
Record, Field,
Index
Dimension, Level,
Measurement
Methods Read, Write
Select, Insert,
Update,
Delete
Drill down, Drill
up, Drill through
Language - SQL MDX
Data Types
Data
Measurement
Ratio
Interval
Counting
Ordinal
Nominal
http://chem-eng.utoronto.ca/~datamining/ 7
Numerical
Categorical
Denormalization
8 http://chem-eng.utoronto.ca/~datamining/
One Row per Subject
Tranformation
Customer
Customer
Transformed
1 to 1
Transaction
Transaction
Transformed
1 to 1
1 to N
1 to N
9 http://chem-eng.utoronto.ca/~datamining/
Copy and Aggregate
Customer
Transaction
Copy Aggregate
10 http://chem-eng.utoronto.ca/~datamining/
Data Preparation - Aggregation
Aggregation
Categorical
Count
Count%
Numeric
Count, Sum
Mean, Std
Min, Max
11 http://chem-eng.utoronto.ca/~datamining/
One to Many Relationship
Customer ID Age Married
1 25 N
2 38 Y
3 46 Y
Transaction ID Customer ID
Purchased
Amount
1 1 250
2 1 125
3 2 100
4 2 85
5 2 24
6 3 400
12 http://chem-eng.utoronto.ca/~datamining/
Customers
Transactions
1
N
Data Preparation - Copy
Transaction ID Customer ID
Purchased
Amount
Age Married
1 1 250 25 N
2 1 125 25 N
3 2 100 38 Y
4 2 85 38 Y
5 2 24 38 Y
6 3 400 46 Y
13 http://chem-eng.utoronto.ca/~datamining/
Data Preparation - Aggregation
Customer ID Age Married
Purchased
Count
Purchased
Total
1 25 N 2 375
2 38 Y 3 209
3 46 Y 1 400
14 http://chem-eng.utoronto.ca/~datamining/
Data Transformation and Cleansing
http://chem-eng.utoronto.ca/~datamining/ 15
Variable
Categorical Numeric
Missing Values Missing Values
Invalid Values Invalid & Outliers
Encoding Binning
Missing Values
http://chem-eng.utoronto.ca/~datamining/ 16
Education
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
B
L
A
N
K 1 2 3 4
F
r
e
q
u
e
n
c
y
83%
Missing Value
Invalid Values
http://chem-eng.utoronto.ca/~datamining/ 17
doc_type_id
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
N
U
L
L
Z X 1 2 3
F
r
e
q
u
e
n
c
y
Invalid
Missing and Invalid Values and Outliers
18 http://chem-eng.utoronto.ca/~datamining/
Months in Business
Box Plot
http://chem-eng.utoronto.ca/~datamining/ 19
Outliers
*
Missing Values
Fill in missing values manually based on our
domain knowledge
Ignore the records with missing data
Fill in it automatically:
A global constant (e.g., ?)
The variable mean
Inference-based methods such as Bayes rule,
decision tree, or EM algorithm
http://chem-eng.utoronto.ca/~datamining/ 20
Managing Outliers
Data points inconsistent with the majority of data
Different outliers
Valid: CEOs salary
Noisy: Ones age = 200, widely deviated points
Removal methods
Box plot
Clustering
Curve-fitting
http://chem-eng.utoronto.ca/~datamining/ 21
Encoding Categorical Variables
Encoding is the process of transforming
categorical variables into numerical
counterparts.
Encoding methods:
Binary method
Ordinal Method
Target based Encoding
http://chem-eng.utoronto.ca/~datamining/ 22
Encoding
Binary method:
for free: 1, 0, 0
own: 0, 1, 0
rent: 0, 0, 1
http://chem-eng.utoronto.ca/~datamining/ 23
Ordinal method:
own: 1
for free: 3
rent: 5
Housing (for free, own, rent)
Binning Numerical Variables
Binning is the process of transforming
numerical variables into categorical
counterparts.
Binning methods:
Equal Width
Equal Frequency
Entropy Based
http://chem-eng.utoronto.ca/~datamining/ 24
Binning
Variable: 0, 4, 12, 16, 16, 18, 24, 26, 28
Equi-width binning:
Bin 1: 0, 4 [-,10) bin
Bin 2: 12, 16, 16, 18 [10,20) bin
Bin 3: 24, 26, 28 [20,+) bin
Equi-frequency binning :
Bin 1: 0, 4, 12 [-, 14) bin
Bin 2: 16, 16, 18 [14, 21) bin
Bin 3: 24, 26, 28 [21,+) bin
http://chem-eng.utoronto.ca/~datamining/ 25
Binning
26 http://chem-eng.utoronto.ca/~datamining/
Months in Business
Summary
In the data preparation step the final modeling
dataset is constructed from the raw data.
One Row per Subject is the heart of the data
preparation activities for building the modeling
dataset.
Tasks include database, table, record, and field
selection as well as cleaning, aggregation and
transformation of data also taking care of missing
values, invalid values and outliers.
27 http://chem-eng.utoronto.ca/~datamining/
28 http://chem-eng.utoronto.ca/~datamining/

Das könnte Ihnen auch gefallen