Sie sind auf Seite 1von 54

Income Analysis

Ping Yin
11/10/2016

Contents
Executive Summary ------------------------------------------------------------------------------------- 3
Introduction ---------------------------------------------------------------------------------------------- 4
Purpose ---------------------------------------------------------------------------------------------------- 5
Methodology

Data Selection ----------------------------------------------------------------------------------- 6


Exploration ----------------------------------------------------------------------------------- 7-24
Preparation & Transformation ---------------------------------------------------------- 25-34
Model Development & Assessment --------------------------------------------------- 35-44
Model Comparison ------------------------------------------------------------------------ 45-47
Options and Recommendations ---------------------------------------------------------------- 48-52
Summary ------------------------------------------------------------------------------------------------- 53
Appendix ------------------------------------------------------------------------------------------------- 54

Executive Summary
After data preparation and partition, three models are
built in SAS studio, EM, and DataRobot
The same test dataset is scored by these models
The model built in EM has the best performance

Introduction
Can we predict peoples Income level based on their
age, gender, education, etc.?
What is my income level after I graduate?

Purpose
Figure out the best predictive model for Income dataset
Predict my Income level
Practice skills for preparing data, building model, and model
assessment

Data Selection

Income dataset is originally extracted from1994 Census bureau


database
Downloaded from Kaggle.com
Reasons for choosing it:
Target variable, Income, is categorical variable
Medium size: 10+ columns and 30K+ rows
Used in Macro and DataRobot projects

Exploration
Using SAS studio to explore data
32,561 observations
15 variables: 6 Num, 9 Char
Num: Age Capitalgain Capitalloss Weekhour Edunum Fnlwgt
Char: Income Relationship Education Occupation Sex
Marital
Workclass Race Nativecountry
Target: Income (>50K , <=50k)

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration
Data issues :

Missing value: Workclass Occupation Nativecountry


Multiple levels: Education Marital Workclass Nativecountry
Numeric variables: Capitalgain Capitalloss
Screen variable: Fnlwgt

Preparation & Transformations


Solutions:
Imputing missing value using subject matter knowledge:
impute missing value for Workclass and Occupation with
Unemployeed
Imputing missing value using mode value:
impute missing value for Nativecountry with United-States

Preparation & Transformations


Solutions:
Coverting Capitalgain and Capitalloss from Num to Char
Binning multiple-level variables: Education Marital Workclass

Preparation & Transformations


Solutions:
Binning Nativecountry and creating a new variable: region

Preparation & Transformations


Reasons for dropping variable Fnlwgt:
It is the weight on the Current Population Survey files, not original
data from Census
It shows near zero importance in last week DataRobot project

Preparation & Transformations


Reasons for not handling with variable Occupation:
15 levels
Do not have a sound criterion
Reasons for not handling with variable Race and
Relationship:
5-6 Levels
Each level is meaningful

Preparation & Transformations


After preparation:

Preparation & Transformations

Preparation & Transformations

Preparation & Transformations


Data partition using Strata method

Now it is ready to go!

SAS Studio

Training dataset
Enterprise Miner

Test dataset
DataRobot

Model Development & Assessment

Model Development & Assessment

Model Development & Assessment

Model Development & Assessment

Model Development & Assessment

Model Development & Assessment

Model Development & Assessment

Model Development & Assessment

Model Development & Assessment

Model Development & Assessment

Model Comparison

Model Comparison
The best model in this project:
EM

Studio

DataRobot

Model Comparison: Predict my Income


Ping Dataset

EM

Studio

DataRobot

Options and Recommendations


Using 60%
data to build a
model

Using 70%
data to build a
model

Options and Recommendations


Macro
Project

DataRobo
t Project
The
overall
best
model

Options and Recommendations


Factors which may produce these differences:
Dropping variable Fnlwgt
Reducing levels
Variable transformation: Capitalgain Capitalloss

Increased speed, but decreased model performance

Options
Using DataRobot to build models without handling data
issues

Keep trying in SAS studio

Summary
We can predict peoples income level based on their
characteristics
For Income dataset, DataRobot is most robust to build
models
Be aware of unexpected outcomes for data preparing
Back and forth, until getting an ideal result

Appendix
Link to Data:
https://www.kaggle.com/uciml/adult-census-Income

Thanks !

Das könnte Ihnen auch gefallen