Beruflich Dokumente
Kultur Dokumente
ABSTRACT
INTRODUCTION
In this paper we have studied previous approaches used for classifying spam and non
spam emails by using distinct classification algorithms. We have also studied the distinct
features extracted for classifier training and the feature selection algorithms applied to get rid
of irrelevant features and selecting the most contributing features. After studying the current
feature selection and classification approaches, we have applied two new classification
techniques viz. Random forests and Partial decision trees along with distinct feature selection
algorithms.
R.Parimala, et.al. [1] presents a new FS (Feature Selection) technique which is guided
by Fselector Package. They have used nine feature selection techniques such as Correlation
based feature selection, Chisquare, Entropy, Information Gain,Gain Ratio, Mutual
Information, Symmetrical Uncertainty, OneR, Relief and five classification algorithms such
123
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
as Linear Discriminant Analysis, Random Forest, Rpart, Naïve Bayes and Support Vector
Machine on spambase dataset. In their evaluation, the results show that filter methods CFS, Chi-
squared, GR, ReliefF, SU, IG, and oneR enables the classifiers to achieve the highest increase in
classification accuracy. They conclude that the implemented FS can improve the accuracy of
Support vector machine classifiers by performing FS.
In the paper by R. Kishore Kumar, et.al.[2] spam dataset is analyzed using Tanagra data
mining tool. Initially, feature construction and feature selection is done to extract the relevant
features by using Fisher filtering, ReliefF, Runs Filtering, Step disc. Then classification
algorithms such as C4.5, C-PLS, C-RT, CS-CRT, CS-MC4, CS-SVC, ID, K-NN LDA, Log Reg
TRIRLS, Multilayer Perceptron, Multilogical Logistic Regression, Naïve Bayes Continuous,
PLS-DA, PLS-LDA, Rnd Tree and SVM are applied over spambase dataset and cross validation
is done for each of these classifiers. They conclude Fisher filtering and Runs filtering feature
selection algorithms performs better for many classifiers. The Rnd tree classification algorithm
with the relevant features extracted by fisher filtering produces more than 99% accuracy in spam
detection.
W.A. Awad,et.al.[3] reviews machine learning methods Bayesian classification, k-NN,
ANNs, SVMs, Artificial immune system and Rough sets on the SpamAssassin spam corpus.
They conclude Naïve bayes method has the highest precision among the six algorithms while the
k-nearest neighbor has the worst precision percentage. Also, the rough sets method has a very
competitive percentage.
In the work by V.Christina, et.al.[4] employs supervised machine learning techniques
namely C4.5 Decision tree classifier, Multilayer perceptron and Naïve Bayes classifier. Five
features of an e-mail: all (A), header (H), body (B), subject (S), and body with subject (B+S), are
used to evaluate the performance of four machine learning algorithms. The training dataset, spam
and legitimate message corpus is generated from the mails that they have received from their
institute mail server for a period of six months. They conclude Multilayer Perceptron classifier
outperforms other classifiers and the false positive rate is also very low compared to other
algorithms.
Rafiqul Islam,et.al.[5] have presented an effective and efficient email classification
technique based on data filtering method. In their testing they have introduced an innovative
filtering technique using instance selection method (ISM) to reduce the pointless data instances
from training model and then classify the test data. In their model, tokenization and domain
specific feature selection methods are used for feature extraction. The behavioral features are also
included for improving performance, especially for reducing false positive (FP) problems. The
behavioral features include the frequency of sending/receiving emails, email attachment, type of
attachment, size of attachment and length of the email. In their experiment, they have tested five
base classifiers Naive Bayes, SVM, IB1, Decision Table and Random Forest on 6 different
datasets. They also have tested adaptive boosting (AdaboostM1) as meta-classifier on top of base
classifiers. They have achieved overall classification accuracy above 97%.
A comparative analysis is performed by Ms.DKarthikaRenuka, et.al. [6], for the
classification techniques such as MLP, J48 and Naïve Bayesian, for classifying spam messages
from e-mail using WEKA tool. The dataset gathered from UCI repository had 2788 legitimate
and 1813 spam emails received during a period of several months. Using this dataset as a training
dataset, models are built for classification algorithms. The study reveals that the same classifier
performed dissimilarly when run on the same dataset but using different software tools. Thus,
from all perspectives MLP is top performer in all cases and thus, can be deemed consistent.
Following table summarizes all the previous classification approaches enlisted above and
provides a comparison in terms of % accuracy they have achieved with the application of a
specific feature selection algorithm.
124
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
125
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
PROPOSED WORK
After a detailed review of the existing techniques used for spam detection, in this
section we are illustrating the methodology and techniques we used for spam mail detection.
Figure 1 shows the process we have used for spam mail identification and how it is
used in conjunction with a machine learning scheme. Feature ranking techniques such as
Chisquare, Information gain, Gain ratio, Symmetrical uncertainty, Relief, OneR and
Correlation are applied to a copy of the training data. After the feature selection subset with
the highest merit is used to reduce the dimensionality of both the original training data and
the testing data. Both reduced datasets may then be passed to a machine learning scheme for
126
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
training and testing. Results are obtained by using Random Forest and Part classification
techniques.
In the following subsections we discuss the basic concept related to our work. It
includes a brief background on feature ranking techniques, classification techniques and
results.
Dataset
The dataset used for our experiment is spambase [13]. The last column of
'spambase.data' denotes whether the e-mail was considered spam (1) or not (0). Most of the
attributes indicate the frequency of spam related term occurrences. The first 48 set of
attributes (1–48) give tf-idf (term frequency and inverse document frequency) values for
spam related words, whereas the next 6 attributes (49-54) provide tf-idf values for spam
related terms. The run-length attributes (55-57) measure the length of sequences of
consecutive capital letters, capital_run_length_average, capital_run_length_longest and
capital_run_length_total. Thus, our dataset has in total 57 attributes serving as an input
features for spam detection and the last attribute represents the class (spam/non-spam).
We have also used one public dataset Enron [20]. The “preprocessed” subdirectory contains
the messages in the preprocessed format. Each message is in a separate text file. The body of
an email contains the actual information. This information needs to be extracted before
running a filter process by means of preprocessing. The purpose for preprocessing is to
transform messages in mail into a uniform format that can be understood by the learning
algorithm. Following are the steps involved in preprocessing:
1. Feature extraction (Tokenization): Extracting features
from e-mail in to a vector space.
2.Stemming: Stemming is a process for removing the commoner morphological and in-
flexional endings from words in English.
3.Stop word removal: Removal of non-informative words.
4. Noise removal: Removal of obscure text or symbols from features.
127
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
From the above defined feature vector of total 58 features, we use feature ranking and
selection algorithms to select the subset of features. We rank the given set of features using
the following distinct approaches.
1.Chisquare
2. Information gain
2
If the observed values of Y in the training data are partitioned according to the values of a
second feature X, and the entropy of Y with respect to the partitions induced by X is less than
the entropy of Y prior to partitioning, then there is a relationship between features Y and X.
Equation gives the entropy of Y after observing X
|
2|
The amount by which the entropy of Y decreases reflects additional information about Y
provided by X and is called the information gain or alternatively, mutual information [9].
Information gain is given by
128
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
|
|
,
3. Gain ratio
,
!
4. Symmetrical uncertainty
Information gain is a symmetrical measure that is, the amount of information gained
about Y after observing X is equal to the amount of information gained about X after
observing Y. Symmetry is a desirable property for a measure of feature-feature
intercorrelation to have. Unfortunately, information gain is biased in favour of features with
more values. Symmetrical uncertainty compensates for information gain’s bias toward
attributes with more values and normalizes its value to the range [0, 1] [9]:
"##$!%&
'&$%!! ($)) 2.0 ,
5. Relief
129
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
4 2
′ 8 4& 91 4& :; < 4&| 1 4&| =
∑ 4 2
67 67
6. OneR
Like other empirical learning methods, 1R [11] takes as input a set of examples, each
with several attributes and a class. The aim is to infer a rule that predicts the class given the
values of the attributes. The 1R algorithm chooses the most informative single attribute and
bases the rule on this attribute alone. The basic idea is:
7. Correlation
Feature selection for classification tasks in machine learning can be accomplished on
the basis of correlation between features, and that such a feature selection procedure can be
beneficial to common machine learning algorithms [9]. Features are relevant if their values
vary systematically with category membership. In other words, a feature is useful if it is
correlated with or predictive of the class; otherwise it is irrelevant. A good feature subset is
one that contains features highly correlated with (predictive of) the class, yet uncorrelated
with (not predictive of) each other. The acceptance of a feature will depend on the extent to
which it predicts classes in areas of the instance space not already predicted by other features.
Correlation based feature selection feature subset evaluation function [9]:
@%BBBB
6A
>?
C@ @ @ 1 %AA
Where,>? - the heuristic “merit” of a feature subset S containing k features, %6A -the mean
feature-class correlation, %AA -the average feature-feature inter-correlation.
130
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
1. Remove irrelevant features, which might be misleading the classifier decreasing the
classifier interpretability by reducing generalization by increasing over fitting.
2. Remove redundant features, which provide no additional information than the other
set of features, unnecessarily decreasing the efficiency of the classifier.
3. Selecting high rank features, which may not affect much as far as improving precision
and recall is concerned; but reduces time complexity drastically. Selection of such
high rank features reduces the dimensionality feature space of the domain. It speeds
up the classifier there of improving the performance and increasing the
comprehensibility of the classification result.
We have considered 87%, 77% and 70% of the features; wherein there is a performance
improvement in 70% feature consideration.
Classification Methods
Based on the assumption that the given dataset has enough number of the training
instances we have chosen the following two classification algorithms. The algorithms work
well based on the fact that the dataset is of good quality.
1. Random Forest
Random Forests [14] are a combination of tree predictors such that each tree depends on
the values of a random vector sampled independently and with the same distribution for all
trees in the forest. The generalization error for forests converges a.s. to a limit as the number
of trees in the forest becomes large. The generalization error of a forest of tree classifiers
depends on the strength of the individual trees in the forest and the correlation between them.
Each tree is grown as follows:
1. If the number of cases in the training set is N, sample N cases at random - but with
replacement, from the original data. This sample will be the training set for growing
the tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m
variables are selected at random out of the M and the best split on these m is used to
split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
131
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
This is the case when 100% features have selected for training model, accordingly the root
node of each tree changes.
132
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
purposes. Two well-known members of the family of rule-learners are C4.5 and RIPPER.
C4.5 [16], for instance, generates an unpruned decision tree and transforms this tree into a set
of rules. For each path from the root node to a leaf a rule is generated. Then, each rule is
simplified separately followed by a rule-ranking strategy. Finally, the algorithm deletes rules
from the rule set as long as the rule set’s error rate on the training instances decreases.
RIPPER [17] implements a divide and conquer strategy to rule induction. Only one rule is
generated at a time and the instances from a training set covered by this rule are removed. It
iteratively derives new rules for the remaining instances of the training set.
PART (Partial Decision Trees) adopts the divide-and-conquer strategy of RIPPER
[17] and combines it with the decision tree approach of C4.5 [16]. PART generates a set of
rules according to the divide-and-conquer strategy, removes all instances from the training
collection that are covered by this rule and proceeds recursively until no instance remains. To
generate a single rule, PART builds a partial decision tree for the current set of instances and
chooses the leaf with the largest coverage as the new rule. For example, following is the way
of rule formation in our implementation of PART and some of the rules are as shown below:
Rule 1:
word_freq_remove> 0.0 AND
char_freq_! > 0.049 AND
word_freq_edu<= 0.06: 1 (Instances: 490 and Incorrect: 7)
Now, after Rule1 the next set of rules are formed excluding 490 instances from the 4601 total
instances of spambase.
Rule 2:
char_freq_$ > 0.058 AND
word_freq_hp<= 0.4 AND
capital_run_length_longest> 9.0 AND
word_freq_1999 <= 0.0 AND
word_freq_edu<= 0.08 AND
char_freq_! > 0.107: 1 (Instances: 334 and Incorrect: 2)
133
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
Results
Spambase Results
The dataset spambase was taken from UCI machine learning repository [13].
Spambase dataset contains 4601 instances and 58 attributes. 1 - 57 continuous attributes and
1 nominal class label. The email spam classification has been implemented in Eclipse.
Eclipse considered by many to be the best Java development tool available. Feature ranking
and feature selection is done by using the methods such as Chi-square, Information gain,
Gain ratio, Relief, OneR, Correlation as a preprocessing step so as to select feature subset for
building the learning model.
Classification algorithms are from decision tree family, viz, Random Forest and
Partial Decision Trees. Random forest is an effective tool in prediction. Because of the law of
large numbers they do not over fit. Random inputs and random features produce good results
in classification-less so in regression. For the larger data sets, it seems that significantly lower
error rates are possible [14]. Feature space can be reduced by the magnitude of 10 while
achieving similar classification results. For example, it takes about 2,000 χ 2 features to
achieve similar accuracies as those obtained with 149 PART features [15].
As a part of our implementation, we have divided the dataset into two parts. 80% of
the dataset is used for training purpose and 20% for the testing purpose. After preprocessing
step top 87%, 77% and 70% features are considered while building training model and testing
because there is a significant performance improvement. Prediction accuracy, correctly
classified instances, incorrectly classified instances, confusion matrix and time complexity
are used as performance measures of the system.
More than 99% prediction accuracy is achieved by Random forest with all the seven
feature selection methods in consideration; whereas 97% prediction accuracy is achieved by
PART with almost all the seven feature selection methods while training the model. Training
and testing results, when 100% features have considered are given in Table 2.
Both training results and testing results on spambase dataset after feature ranking and
subset selection are shown in the Table 3 and Table 4.
134
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
135
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
From the results above, it can be observed that for Random Forest, after using 70% of
the feature set extracted using Infogain, Symmetrical Uncertainty and OneR feature selection
algorithms the training accuracy remained the same (99.918%) whereas the computation time
reduced by 20% (from 1540ms – to 1276ms). This shows that the remaining 30% features
were not contributing towards the classification.
Also it can be observed that for PART, after using 70% of the feature set extracted
using Chisquare, Infogain and Symmetrical Uncertainty feature selection algorithms the
training accuracy is increased by1.521% and computation time is reduced by 52% (from
4938 ms – to 2409ms). This shows that not only the remaining 30% features were redundant
but also they were misleading the classification.
Enron Results
More than 96% prediction accuracy is achieved by Random forest with all the seven
feature selection methods in consideration; whereas more than 95% prediction accuracy is
achieved by PART with almost all the seven feature selection methods while training the
model. Training and testing results, when 100% features have considered are given in Table
5.
Both training and testing results after feature ranking and subset selection are shown
in the Table 6 and Table 7.
136
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
From the results above, it can be observed that for Random Forest, after using 87% of
the feature set extracted the training accuracy is (96.012%) whereas the computation time
reduced by 51.574% (from 9466ms – to 4584ms). This shows that the remaining 13%
features were not contributing towards the classification.
Also, it can be observed that for Part, after using 87%, 77% of the feature set
extracted the training accuracy is increased. There is a significant improvement in 87%
feature selection by 1% and computation time is reduced by 67.879% (from 18558 ms – to
5961ms). This shows that not only the remaining 30% features were redundant but also they
were misleading the classification.
Further, we have tested our Enron model on the dataset created by using emails we
have received in our Gmail accounts during the period of last 3 months. The results are
shown in the Table 8. In this, experiment we test dataset is completely non-overlapping with
the training set allowing us to truly evaluate the performance of our system.
CONCLUSION
In this paper we have studied previous approaches of spam email detection using
machine learning methodologies. We have compared and evaluated the approaches based on
the factors such as dataset used; features extracted, ranked and selected; feature selection
algorithms used and the results received in terms of accuracy (precision, recall and error rate)
and performance (time required).
137
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
The datasets available for spam detection are large in number and for such larger
datasets Random Forest and Part tend to produce better results with lower error rates and
higher precision. So, we used these two classifiers to classify spam email detection. For
spambase dataset, we acquired the best percentage accuracy of 99.918% with Random Forest
which is 9% better than previous spambase approaches and 96.416% with Part. For Enron
dataset, we acquired the best percentage accuracy of 96.181% with Random Forest and
95.093% with Part. Enron dataset is used by [21] in an unsupervised spam learning and
detection scheme. Above all, for the dataset created by using our personal emails the
percentage accuracy of 96% is achieved with Random Forest and 97.33% with Part. The
feature selection algorithms used also contributed to achieve better accuracy with lower time
complexity due to dimensionality reduction. For Random Forest, after using 70% of the
feature set extracted, for spambase data set, the training accuracy remained the same
(99.918%) whereas the computation time reduced by 20% (from 1540ms – to 1276ms),
whereas for PART, the training accuracy is increased by 1.521% and computation time is
reduced by 52% (from 4938 ms – to 2409ms).
REFERENCES
1.“A Study of Spam E-mail classification using Feature Selection package”, R.Parimala,
Dr. R. Nallaswamy, National Institute of Technology, Global Journal of Computer Science
and Technology, Volume 11 Issue 7 Version 1.0 May 2011.
2.“Comparative Study on Email Spam Classifier using Data Mining Techniques”, R. Kishore
Kumar, G. Poonkuzhali, P. Sudhakar, Member, IAENG, Proceedings of the International
Multiconference of Engineers and Computer Scientists 2012 Vol I, IMECS 2012, March 14-
16, Hong Kong.
3.“Machine Learning Methods for Spam E-mail Classification”, W.A. Awad and S.M.
ELseuofi, International Journal of Computer Applications (0975 – 8887)Volume 16– No.1,
February 2011.
4.“Email Spam Filtering using Supervised Machine Learning Techniques”, V.Christina,
S.Karpagavalli, G.Suganya, (IJCSE) International Journal on Computer Science and
EngineeringVol. 02, No. 09, 2010, 3126-3129.
5.“Email Classification Using Data Reduction Method”, Rafiqul Islam and Yang Xiang,
member IEEE, School of Information Technology Deakin University, Burwood 3125,
Victoria, Australia.
6.“Spam Classification based on Supervised Learning using Machine Learning
Techniques”, Ms.D Karthika Renuka, Dr.T.Hamsapriya, Mr.M.Raja Chakkaravarthi,
Ms. P. Lakshmi Surya, 978-1-61284-764-1/11/$26.00 ©2011 IEEE.
7.“An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail
Categorization”, Chih-Chin Lai, Ming-Chi Tsai, Proceedings of the Fourth International
Conference on Hybrid Intelligent Systems (HIS’04) 0-7695-2291-2/04 $ 20.00 IEEE.
8.“Introductory Statistics: Concepts, Models, and Applications”, David W. Stockburger.
9.“Feature Subset Selection: A Correlation Based Filter Approach”, Hall, M. A., Smith, L.
A., 1997, International Conference on Neural Information Processing and Intelligent
Information Systems, Springer, p855-858.
10.“A practical approach to feature selection”, K. Kira and L. A. Rendell, Proceedings of the
Ninth International Conference, 1992.
138
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
11.“Very simple classification rules perform well on most commonly used datasets”, Holte,
R.C.(1993) Machine Learning, Vol. 11, 63–91.
12.“Induction of decision trees”, J.R. Quinlan, Machine Learning 1, 81-106, 1986.
13.“UCI repository of Machine learning Databases”, Department of Information and
Computer Science, University of California, Irvine, CA,
http://www.ics.uci.edu/~mlearn/MLRepository.html, Hettich, S., Blake, C. L., and Merz,
C. J.,1998.
14.“Random Forests”, Leo Breiman, Statistics Department University of California
Berkeley, CA 94720, January 2001.
15.“Exploiting Partial Decision Trees for Feature Subset Selection in eMail Categorization”,
Helmut Berger, Dieter Merkl, Michael Dittenbach, SAC’06 April 2327, 2006, Dijon, France
Copyright 2006 ACM 1595931082/06/0004.
16.“C4.5: Programs for Machine Learning”, J. R. Quinlan, Morgan Kaufmann Publishers
Inc., 1993.
17.“Fast effective rule induction”, W. W. Cohen, In Proc. of the Int’l Conf. on Machine
Learning, pages 115–123. Morgan Kaufmann, 1995.
18.“Toward optimal feature selection using Ranking methods and classification Algorithms”,
Jasmina Novaković, PericaStrbac, DusanBulatović, March 2011.
19.“SpamAssassin”, http://spamassassin.apache.org.
20. The enron spam dataset http://www.aueb.gr/users/ion/data/enron-spam/
21. “A Case for Unsupervised-Learning-based Spam Filtering”, Feng Qian, Abhinav Pathak,
Y. Charlie Hu, Z. Morley Mao, Yinglian Xie.
22. Jyoti Pruthi and Dr. Ela Kumar, “Data Set Selection in Anti-Spamming Algorithm -
Large or Small”, International Journal of Computer Engineering & Technology (IJCET),
Volume 3, Issue 2, 2012, pp. 206 - 212, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
23. C.R. Cyril Anthoni and Dr. A. Christy, “Integration of Feature Sets with Machine
Learning Techniques for Spam Filtering”, International Journal of Computer Engineering &
Technology (IJCET), Volume 2, Issue 1, 2011, pp. 47 - 52, ISSN Print: 0976 – 6367,
ISSN Online: 0976 – 6375.
139