University of Toronto 2010 saed.sayad@utoronto.ca 1 http://chem-eng.utoronto.ca/~datamining/ Data Mining Steps 1 Problem Definition 2 Data Preparation 3 Data Exploration 4 Modeling 5 Evaluation 6 Deployment http://chem-eng.utoronto.ca/~datamining/ 2 1. Problem Definition http://chem-eng.utoronto.ca/~datamining/ 3 Understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition with a preliminary plan designed to achieve the objectives. Source: http://www.crisp-dm.org/Process/index.htm 2- Data Preparation The data preparation step covers all activities to construct the final dataset for modeling from the raw data. Tasks include database, table, record, and field selection as well as cleaning, aggregation and transformation of data. 4 http://chem-eng.utoronto.ca/~datamining/ Data Preparation Modeling Data Data Text Data DSN ETL http://chem-eng.utoronto.ca/~datamining/ 5 Data Sources http://chem-eng.utoronto.ca/~datamining/ 6 Text Files Relational Database Multi-dimensional Database Entities File Table Cube Attributes Row and Col Record, Field, Index Dimension, Level, Measurement Methods Read, Write Select, Insert, Update, Delete Drill down, Drill up, Drill through Language - SQL MDX Data Types Data Measurement Ratio Interval Counting Ordinal Nominal http://chem-eng.utoronto.ca/~datamining/ 7 Numerical Categorical Denormalization 8 http://chem-eng.utoronto.ca/~datamining/ One Row per Subject Tranformation Customer Customer Transformed 1 to 1 Transaction Transaction Transformed 1 to 1 1 to N 1 to N 9 http://chem-eng.utoronto.ca/~datamining/ Copy and Aggregate Customer Transaction Copy Aggregate 10 http://chem-eng.utoronto.ca/~datamining/ Data Preparation - Aggregation Aggregation Categorical Count Count% Numeric Count, Sum Mean, Std Min, Max 11 http://chem-eng.utoronto.ca/~datamining/ One to Many Relationship Customer ID Age Married 1 25 N 2 38 Y 3 46 Y Transaction ID Customer ID Purchased Amount 1 1 250 2 1 125 3 2 100 4 2 85 5 2 24 6 3 400 12 http://chem-eng.utoronto.ca/~datamining/ Customers Transactions 1 N Data Preparation - Copy Transaction ID Customer ID Purchased Amount Age Married 1 1 250 25 N 2 1 125 25 N 3 2 100 38 Y 4 2 85 38 Y 5 2 24 38 Y 6 3 400 46 Y 13 http://chem-eng.utoronto.ca/~datamining/ Data Preparation - Aggregation Customer ID Age Married Purchased Count Purchased Total 1 25 N 2 375 2 38 Y 3 209 3 46 Y 1 400 14 http://chem-eng.utoronto.ca/~datamining/ Data Transformation and Cleansing http://chem-eng.utoronto.ca/~datamining/ 15 Variable Categorical Numeric Missing Values Missing Values Invalid Values Invalid & Outliers Encoding Binning Missing Values http://chem-eng.utoronto.ca/~datamining/ 16 Education 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 B L A N K 1 2 3 4 F r e q u e n c y 83% Missing Value Invalid Values http://chem-eng.utoronto.ca/~datamining/ 17 doc_type_id 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 N U L L Z X 1 2 3 F r e q u e n c y Invalid Missing and Invalid Values and Outliers 18 http://chem-eng.utoronto.ca/~datamining/ Months in Business Box Plot http://chem-eng.utoronto.ca/~datamining/ 19 Outliers * Missing Values Fill in missing values manually based on our domain knowledge Ignore the records with missing data Fill in it automatically: A global constant (e.g., ?) The variable mean Inference-based methods such as Bayes rule, decision tree, or EM algorithm http://chem-eng.utoronto.ca/~datamining/ 20 Managing Outliers Data points inconsistent with the majority of data Different outliers Valid: CEOs salary Noisy: Ones age = 200, widely deviated points Removal methods Box plot Clustering Curve-fitting http://chem-eng.utoronto.ca/~datamining/ 21 Encoding Categorical Variables Encoding is the process of transforming categorical variables into numerical counterparts. Encoding methods: Binary method Ordinal Method Target based Encoding http://chem-eng.utoronto.ca/~datamining/ 22 Encoding Binary method: for free: 1, 0, 0 own: 0, 1, 0 rent: 0, 0, 1 http://chem-eng.utoronto.ca/~datamining/ 23 Ordinal method: own: 1 for free: 3 rent: 5 Housing (for free, own, rent) Binning Numerical Variables Binning is the process of transforming numerical variables into categorical counterparts. Binning methods: Equal Width Equal Frequency Entropy Based http://chem-eng.utoronto.ca/~datamining/ 24 Binning Variable: 0, 4, 12, 16, 16, 18, 24, 26, 28 Equi-width binning: Bin 1: 0, 4 [-,10) bin Bin 2: 12, 16, 16, 18 [10,20) bin Bin 3: 24, 26, 28 [20,+) bin Equi-frequency binning : Bin 1: 0, 4, 12 [-, 14) bin Bin 2: 16, 16, 18 [14, 21) bin Bin 3: 24, 26, 28 [21,+) bin http://chem-eng.utoronto.ca/~datamining/ 25 Binning 26 http://chem-eng.utoronto.ca/~datamining/ Months in Business Summary In the data preparation step the final modeling dataset is constructed from the raw data. One Row per Subject is the heart of the data preparation activities for building the modeling dataset. Tasks include database, table, record, and field selection as well as cleaning, aggregation and transformation of data also taking care of missing values, invalid values and outliers. 27 http://chem-eng.utoronto.ca/~datamining/ 28 http://chem-eng.utoronto.ca/~datamining/
(Springer Series in Statistics) R.-D. Reiss (Auth.) - Approximate Distributions of Order Statistics - With Applications To Nonparametric Statistics-Springer-Verlag New York (1989) PDF