Beruflich Dokumente
Kultur Dokumente
Muhammad Bilal
03-243181-008
Abstract:
The prominent inequality of wealth and income is huge concern especially in the United States. The
likelihood of diminishing poverty is one valid reason to reduce the world’s surging level of economic
inequality. The principal of universal moral equality ensures sustainable development and improve
the economic stability of a nation. Governments in different countries have been trying their best to
address this problem and provide optimal solution. This study aims to show the usage of machine
learning and datamining techniques in providing a solution to the income equality problem. The UCI
adult dataset has been used for this purpose. Classification has been done to predict whether a
person's yearly income in US falls in the income category of either greater than 50K Dollars or less
equal to 50K Dollars category based on a certain set of attributes. We applied different preprocessing
techniques i.e. Discretization, Principal Component Analysis (PCA) etc. and classification techniques
i.e. Naïve Bayes Algorithm, J48 Decision Tree, Logistic Regression. J48 and Logistic Regression clocked
the highest accuracy of 93% – 94% while Naïve Bayes Algorithm give the accuracy of 83% but this
defines the minimum benchmark for any classification algorithm for correctly classified.
Table of Contents
1. Introduction:................................................................................................................................... 3
1.1. Background ............................................................................................................................. 3
1.2. Scope ....................................................................................................................................... 3
Naïve Bayes Algorithm ............................................................................................................ 3
Decision Tree Algorithm.......................................................................................................... 3
Logistic Regression .................................................................................................................. 4
2. Literature Review: .......................................................................................................................... 5
3. Problem Statement ........................................................................................................................ 6
4. Dataset Acquisition & Description................................................................................................. 6
4.1. Training Set ............................................................................................................................. 8
4.2. Testing Set ............................................................................................................................... 8
5. Data Preprocessing......................................................................................................................... 8
5.1. Data Preparation for Removing Outliers and Missing Values ................................................. 8
5.2. Outliers.................................................................................................................................... 8
5.3. Missing Values....................................................................................................................... 10
5.5. Discretization ........................................................................................................................ 11
5.6. Principal Component Analysis............................................................................................... 12
7. Classification ................................................................................................................................. 13
7.1. Naïve Bayes Algorithm .......................................................................................................... 13
7.2. J48 algorithm......................................................................................................................... 14
7.3. Logistic Regression ................................................................................................................ 15
8. Performance Experiments & Post Processing ............................................................................. 16
8.1. Training Set ........................................................................................................................... 16
8.2. K fold cross validation ........................................................................................................... 16
9. Conclusion .................................................................................................................................... 17
10. References ................................................................................................................................ 18
1. Introduction:
1.1. Background
Society produces various amount of raw data to record facts and to discover patterns from that data.
Without techniques to extract data from raw data, this raw data is useless. The process of mining
previously anonymous and hypothetically useful information from large amount data is called Data
Mining. Classification, Association Rules and Sequence Analysis are major component of Data Mining.
A classification includes discovering rules that infer definite association into predefined classes. In this
procedure, training data set is analyzed and a set of rules are generated to classify the testing data
set.
An association rule involves finding rules that imply certain association relation among a set of
attributes in the given data. In this process, a set association rules is generated at multiple levels of
abstraction from relevant sets of attributes in the data.
1.2. Scope
The scope of this research is limited to classification. The following data classification methods are
used in this research.
i. C4.5 Algorithm
Tis algorithm is constructs a decision tree for training data by recursively splitting that data. The
decision tree grow using Depth-first approach. This algorithm also consider all possible tests that can
split the data and select a test that gives the highest information gain. C4.5 algorithm removes the
bias favor that give ID3 algorithm. C4.5 algorithm allow to prune the tree that produce as a result.
However this increase the error rate in training data but on the other hand it reduce error at unseen
data. This algorithm is also deal with missing values, noisy data as well as numeric attributes.
Logistic Regression
Logistic regression is technique lent by machine learning from the field of statistics. It is the go-to
method for binary classification problems. Logistic regression is used in various fields, including
machine learning, most medical fields, and social sciences. Logistic regression may be used to predict
the risk of developing a given disease (e.g. diabetes; coronary heart disease), based on observed
characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.).
2. Literature Review:
Certain endeavors utilizing AI models have been made in the past by scientists for foreseeing salary
levels.
Chockalingam[1] investigated and examined the Adult Dataset and utilized a few Machine
Learning Models like Strategic Regression, Stepwise Logistic Regression, Naive Bayes, Decision
Trees, Extra Trees, k-Nearest Neighbor, SVM, Gradient Boosting and 6 setups of Activated
Neural Network. They likewise drew a similar investigation of their prescient exhibitions.
Bekena[2] executed the Random Forest Classifier calculation to foresee pay dimensions of
people.
Topiwalla[3] made the utilization of complex calculations like XGBOOST, Random Forest and
stacking of models for forecast assignments including Logistic Stack on XGBOOST furthermore,
SVM Stack on Logistic for scaling up the precision.
Lazar [4] executed Principal Component Analysis (PCA) and Support Vector Machine
techniques to create furthermore, assess salary forecast information dependent on the
Current Populace Survey given by the U.S. Evaluation Bureau.
Deepajothi [5] endeavored to duplicate Bayesian Networks, Choice Tree Induction, Lazy
Classifier and Rule Based Learning Techniques for the Adult Dataset and displayed a near
examination of the prescient exhibitions.
Lemon[6] endeavored to distinguish the significant includes in the information that could
streamline the intricacy of various AI models utilized in order undertakings.
Haojun Zhu[7] endeavored Logistic Regression as the Statistical Modeling Tool and 4
distinctive Machine Learning Strategies, Neural Network, Classification and Regression Tree,
Random Forest, and Support Vector Machine for anticipating Income Levels.
3. Problem Statement
There are many arguments about how to become a member of high-income social level in US, but
there is no conclusion. Some people believe education is the key while some people insist that the
capital gain is the only way to be richer than others. In the same time, the middle classes in the
emerging countries desperately want to know how the middle class in the developed counties gained
their fortune. As a group of international students, we want to know the crucial factors to become the
higher income level in the US. Based on the data from UCI Machine Learning Repository
(http://archive.ics.uci.edu/ml/datasets/Census+Income).
We want to try to predict the high level income using different datamining techniques and address
the above problem.
This dataset is taken from the Data Extraction System (DES) of the US Census Bureau:
http://www.census.gov/ftp/pub/DES/www/welcome.html
This dataset can be downloaded from the following:
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult/
The following table provides details and description about 14 attributes that will be used to train and
test the results/outcome.
5. Data Preprocessing
5.1. Data Preparation for Removing Outliers and Missing Values
The original dataset contained some missing values and outlier. We converted the dataset in to
Microsoft Excel spreadsheet document and separated in to 14 columns that are represented as 14
instances. We used WEKA to clean outliers and missing values.
5.2. Outliers
Data object that vary considerably and/or fall outside the expected or accepted range can be consider
as outlier. Outliers can be caused by measurement or execution errors. Outliers can worsen the
performance of datamining algorithms and also may result in algorithms lead to inaccurate results.
Mining outliers from the dataset can be classified by two main sub-problems.
For this purpose we use Interquartile Range Filter from WEKA. The Interquartile Range Filter from
weka library uses an IQR formula to designate some values as outliers/extreme values. Any value
outside this range [Q1−k(Q3−Q1),Q3+k(Q3−Q1)] is considered some sort of an outlier, where k is some
constant, and IQR=Q3−Q1.
By default weka uses k=3 to define something as outlier, and k=3∗2 to define something as extreme
value (extreme outlier).The formula guarantees that at least 50% value are considered non-outliers.
Having a single variable (univariate sample of values), it's practically impossible to reproduce your
result.
Note however that this filter can be applied to a data frame. When applied like this, it will consider as
an outlier any instance of the data frame which has at least one value of the instance considered as
outlier for that variable.
Figure 1: Before Applying Outlier Detection
The following table shows the different types of method that are used to fix missing values in the
dataset for this research.
The following graphs describe the figure 5, that we obtained by using PCA.
Using Naïve Bayes algorithm 13849 (83.39 %) instances are classified correctly. And 2757 (16.06 %)
instances are classified as incorrectly.
7.2. J48 algorithm
J48 builds a decision tree model by analyzing training data, and uses this model to classify the testing
data (user data). We use the default parameters provided in WEKA. For example, the confidence
threshold factor for pruning is set to 0.25, minimum21 number of instances in a leaf is 2, and the
reduced-error pruning is set to false default.
Using J48 algorithm 15481 (93.22 %) instances are classified correctly. And 1125 (6.77 %) instances
are classified as incorrectly. The confusion matrix for this algorithm is as followed.
Using J48 algorithm 15668 (94.22 %) instances are classified correctly. And 938 (5.77 %) instances
are classified as incorrectly.
8. Performance Experiments & Post Processing
The three different learning schemes used are Naïve Bayes, J48 Decision Tree, and Logistics
Regression.
We can see that J48 and Logistic Regression are better as compare to Naïve Bayes for the given
dataset.
The data is divided randomly into ten parts. Each part is held out in turn and the learning scheme is
trained on the remaining nine parts. Then its error rate is calculated on the holdout set. The procedure
is executed a total of ten times. The error estimates from each portion is averaged together to yield
an overall error estimate.
In order to analyze the performance of different algorithms, we ran the tenfold cross validation .The
results are recorded below:
Number of runs J48 (%) Naïve Bayes (%) Logistic Regression (%)
1 93.1172 83.203 94.0123
2 92.9884 83.1518 94.0401
3 93.1254 83.2931 94.0021
4 93.1148 83.3331 94.0232
5 93.2141 83.3331 94.0999
6 92.9945 83.3421 94.1032
7 93.1358 83.3431 94.1066
8 93.1854 83.1518 94.1100
9 93.1175 83.1538 94.1323
10 93.2117 83.1395 94.1464
MEAN 93.1204 83.24444 94.07761
9. Conclusion
We obtained raw data, and wrote our own code to fix outliers and missing values from the training
and testing set. Then we used the training dataset to train the machine to predict whether a person
makes over 50K a year. Then we use the testing dataset to test whether the machine can predict if a
person makes over 50K a year. Based on the experiment results, we compared the accuracy and
performance of several data mining algorithms. The different algorithm used were: Naïve Bayes
Algorithm, J48 Decision Tree and Logistic Regression (LR).
Our result shows that the Naïve Bayes had not such accuracy that gave by J48 and LR as we aim to
find out one attribute that accurately determine whether an individual’s income exceed 50K US
dollars or not. But it is useful to set benchmark performance before progressing toward
sophisticated learning algorithms.
10. References
[1] V. Chockalingam Sejal Shah Ronit Shaw, “Income Classification using Adult Census Data.”
[2] M. Personal, R. Archive, and S. Menji Bekena, “M P RA Using decision tree classifier to predict
income levels,” 2017.
[3] M. Topiwalla, “Machine Learning on UCI Adult data Set Using Various Classifier Algorithms
And Scaling Up The Accuracy Using Extreme Gradient Boosting.”
[4] A. Lazar, “Income prediction via support vector machine,” in 2004 International Conference
on Machine Learning and Applications, 2004. Proceedings., pp. 143–149.
[5] S. Deepajothi and S. Selvarajan, “A Comparative Study of Classification Techniques On Adult
Data Set.”
[6] C. L. A, C. Z. A, and K. M. A, “No Title,” 1994.
[7] “A Comparative Study of Classification Techniques in Data Mining Algorithms,” Int. J. Mod.
Trends Eng. Res., vol. 4, no. 7, pp. 58–63, 2017.