Sie sind auf Seite 1von 4

International Journal of Computer Information Systems, Vol. 4, No.

1, 2012

Spam Email Classification based on Machine learning Algorithms


Suresh Subramanian
Research Scholar Karpagam University Coimbatore, Tamil Nadu, India sureshsmanian@gmail.com

Dr. Sivaprakasam
Department of Computer Science Sri Vasavi College, Erode, Tamil Nadu, India psperode@yahoo.com

Abstract Data mining, is a powerful new technology used to extract the hidden predictive information from large databases. Data mining tools helps to predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. In recent years, communication has become much easier because of email and related internet technologies. However, due to spammers, spam mail has become an increasingly bothersome problem for the internet users. In this paper, we have compared the different machine learning classification algorithms on Spam email data. Key words: Data Mining, Classification, MLP, J48, CART, IB1, Spam Email;

In this research we have compared the four classification algorithms, namely J48, Multi Layer Perceptron (MLP), Classification and Regression Trees (CART) and IB1 using WEKA tool on spam email dataset. II.
CLASSIFICATION ALGORITHMS

I.

INTRODUCTION

E-mail has become an essential communication tool in most of the business today as it is cheaper and fastest. As the usage of Email is increasing which increases spam mail as well, since it is an inexpensive way of advertising. The number of worldwide email users is projected to increase from over 1.4 billion in 2009, to over 1.9 billion by 2013 [1]. In 2009, about 81% of the total email traffic was spam. This figure will increase steadily over the next two years, totaling 84% in 2013 [1]. In today's internet, Spam Email is one of the major problems, invading the individual user's private mails, bringing financial damage to companies and annoying individual users in many ways. Moreover spam remains a viable source of income to spammers. Spam emails are invading users mail area without their consent and filling their mail boxes. Spam emails occupy network capacity as well as time in checking and deleting spam mails. While most of the users want to get rid of spam, they require clear and simple guidelines on how to behave. In spite of all the efforts taken to eliminate spam, they are not yet removed totally. Among different approaches developed to stop spam, filtering is the one of the most important technique. Many researches in spam filtering have been centered on the more sophisticated classifier-related issues. In recent days, Machine learning for spam classification is an important research issue. Various data mining classification algorithms are used to identify the email as spam or non-spam. Data mining classification algorithms are used to categorize the email as spam or non spam.

Classification is a process of finding a model that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown[2]. Standard text classification techniques have been used to filter spam emails, which includes keyword, phrase and character based studies. WEKA is an open source software tool, is a collection of machine learning algorithms implemented in Java. A comparative study of different machine learning algorithms for classifying spam messages from e-mail are done through this tool. A. Multi Layer Perceptron(MLP) A Multi Layer Perceptron is a feed forward artificial neural network model that maps sets of input data onto a set of output data. MLP networks are flexible, non-linear models consist of a number of units organised into multiple layers. The complexity of the MLP network can be changed by increasing the number of layers and the number of units in each layer. MLP contains input contains three or more layers namely input layer, hidden layer and output layer. The number of neurons in input and output layer depend on the problem itself; whereas the number of neuron in the hidden layer is arbitrary which is usually decided by trial-and-error. Each neuron in every layer receives the summation of weighted signals came from previous layer neurons. An active function simulates the activation output strength of this neuron. The weights between input-hidden and hidden-output layer, and the thresholds for hidden layer and output layer neurons, are selected randomly. Using back propagation delta learning algorithm, weights and thresholds can be modified gradually, this procedure is called training. Once the training is completed apply the 'recalling' procedure, that is quickly calculating outputs with respect to corresponding inputs. Details regarding modifications for a MLPs weights and thresholds can be found in books, such as [3].

January

Page 19 of 63

ISSN 2229 5208

B. J48 J48 implements Quinlan's C4.5 algorithm for generating a pruned or un-pruned C4.5 decision tree [4]. J48 deals with the concept of information gain ratio. J48 builds decision trees from a set of training data using information entropy. It examines the normalized information gain that results from choosing an attribute for splitting the data and each attribute of the data has split into smaller subsets. To arrive a decision the attribute with the highest normalized information gain is used. Then the algorithm recursively classifies on the smaller subsets until each leaf node is pure, means that the data has been categorized as close to perfectly as possible. The training data set S=s1,s2,s3,.... of are classified samples, each sample si = x1,x2,x3... is a vector where x1,x2,x3... are attributes or features of the sample. The training data is supported with a vector C = c1,c2,c3... where c1,c2,c3... are the class to which each sample belongs. At each node of the tree, J48 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. It takes the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. Further, J48 algorithm recurs on the smaller sub lists as well. C. Classification and Regression Trees (CART) Classification and Regression Trees is a data exploration and prediction algorithm. In the early 1980s, CART was developed by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone. CART is a classification method, Bayesian model is a precursor to this algorithm. The CART methodology solves number of problems such as performance, accuracy, and operational problems. Regression type CART is a tree based modeling for continuous variables which use sum of squared errors as splitting criterion. Classification type CART is for discrete / categorical variables in which use Gini, Entropy and Twoing measures to produce completely pure nodes. CART builds the tree by recursively splitting the variable space based on the impurity of the variables to determine the split till the termination condition is met. The CART algorithm features are: CART applies binary, or two-way, splits that divide each parent node into exactly two child nodes; Hierarchical interaction of the variables can be seen in visual display; It has automated solutions as intelligently handle missing values; 'surrogate splitters'

International Journal of Computer Information Systems, Vol. 4, No.1, 2012 D. IB1 The idea of IB1 or IBL (Instance-Based Learning) is one of the classifier similar to K-Nearest Neighbors(K-NN) algorithm[5][6]. IB1 uses a simple distance measure to find the training instance closest to the given test instance, and predicts the same class as this training instance. If multiple instances are the same (smallest) distance to the test instance, the first one found is used [7]. IB1 generates classification predictions using only specific instances. Unlike K-NN, IB1 normalizes its attributes ranges, processes instances incrementally and has a simple policy for tolerating missing values. III. DATASET DESCRIPTION

The dataset Spam email gathered from UCI repository [8]. This dataset has 4601 instances, out of which 2788 legitimate and 1813 spam mails. Each instance has 58 attributes out of which 57 are continuous and 1 has nominal class label. Most of the attributes represent the frequency of a given word or character in the email that corresponds to the instance. Attribute Information: 48 attributes of type word_freq_WORD describing the frequency of word w, the percentage of words in the email; 6 attributes of type char_freq_CHAR describing the frequency of a character c, defined in the same way as word frequency; 3 attributes describing the longest length, total numbers of capital letters and average length; 1 nominal {0,1} class attribute of type spam describing whether the e-mail was considered spam (1) or not (0). IV. EXPERIMENTAL RESULTS

In this paper, We have applied 10-fold cross validation for evaluating the performance of the classifiers. Data mining classification model were developed using data mining classification tool Weka version 3.6. We have used spam mail dataset [8] for the comparative study, Table 1 shows the Accuracy, Sensitivity, Specificity and Error rate of MLP,J48,CART and IB1 algorithms. Figure1 shows the graphical representation of comparison in Accuracy. Figure 2 shows the graphical representation of comparison in Error rate. Figure 3 shows the graphical representation of difference in Sensitivity.

Rules and models are easy to grasp and easy to apply to new data.

January

Page 20 of 63

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 4, No.1, 2012 Figure 4 shows the graphical representation of difference in Precision. Figure 5 shows the graphical representation of difference in Specificity.
Table 1: Comparison of Data Mining Models Figure 3: Comparison based on Sensitivity

Algorithm MLP J48 CART IB1

Accur acy 91.43 92.97 92.43 90.78

Error Rate 8.56 7.02 7.56 9.21

Sensit ivity 92.78 94.03 93.14 92.12

Precis ion 93.11 94.4 94.47 92.71

Specifi city 89.35 91.34 91.31 88.69


Figure 4: Comparison based on Precision

Figure 1: Comparison based on Accuracy

Figure 2: Comparison based on Error Rate

Figure 5: Comparison based on Specificity

January

Page 21 of 63

ISSN 2229 5208

V.

CONCLUSION AND SCOPE FOR FURTHER ENHANCEMENTS

In this paper, we have done the analysis in order to determine the classification accuracy of four algorithms in terms of which algorithm better determine whether a particular email is spam or not with the help of WEKA data mining tool. Four algorithms namely MLP, J48, Simple CART and IB1 were compared on the basis of accuracy, error rate, sensitivity, precision and specificity. Our research shows that J48 turned out to be good classifier. In future work, we can include the more datasets with increased or decreased instances and attributes and comparing the accuracy performance of the proposed algorithms. VI. REFERNCES

International Journal of Computer Information Systems, Vol. 4, No.1, 2012 10. Agrawal, R., Imielinski, T., Swami, A., Database Mining:A Performance Perspective, IEEE Transactions on Knowledge and Data Engineering, pp. 914-925, December 1993. 11. Youn and Dennis McLeod, A Comparative Study for Email Classification, Seongwook Los Angeles , CA 90089, USA, 2006. 12. Salvatore Ruggieri, Efficient C4.5 Proceedings of IEEE transactions on knowledge and data Engineering,Vo1. 14,2,No.2, PP.438-444,20025. AUTHORS PROFILE

1. The Radicati Group, Email Statistics Report, 2009-2013, May 2009. available online at: http://www.radicati.com/wp/wpontent/uploads/2009/05/emailstats-report-exec-summary.pdf 2. J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kauffman Publishers, USA, 2006. 3. J. -S. R. Jang, S. -T. Sun, E. Mizutani, Neuro-Fuzzy and Soft Computing, Prentice Hall, Ltd., 1997. 4. Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. 5. Ian H. W. and Frank E. (2005) Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco,2005. 6. David W. Aha, Kibler D. and K. Albert M. (1991) Instance-based learning algorithms . Machine Learning journal, Vol 1, No 6, ISDN 1573- 0565,Page(s) 37-66. [Online] Available from: http://www.springerlink.com/content/kn127378pg361187/fullt ext.pdf. 7. Aha, D., and D. Kibler (1991) "Instance-based learning algorithms", Machine Learning, vol.6, pp. 37-66. 8 UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science. Accessed Online from http://www.ics.uci.edu/~mlearn/MLRepository.html. 9. Mohammad Tari, Behrouz Minaei , Ahmad Farahi, Mohammad Niknam Pirzadeh ,Prediction of Students' Educational Status Using CART Algorithm, Neural Network, and Increase in Prediction Precision Using Combinational Model, IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011.

Mr. Suresh Subramanian is working as IS Analyst in Ahlia University, Bahrain. His research interest is in Data mining, Web Mining and Internet Technology.

Dr. Sivaprakasam is working as a Professor in Sri Vasavi College, Erode, Tamil Nadu, India. His research interests include Data mining, Internet Technology, Web & Caching Technology, Communication Networks and Protocols, Content Distributing Networks.

January

Page 22 of 63

ISSN 2229 5208

Das könnte Ihnen auch gefallen