Spam Detection Using rANDOMIZED fOREST tECHINQUE

Considering Behavior of Sender in Spam Mail Detection
S. Naksomboon1, C. Charnsripinyo2 and N. Wattanapongsakorn1,*

1
Computer Engineering Department, King Mongkuts University of Technology Thonburi, 126 Pracha-Utid, Tung-Kru, Bangkok 10140 Thailand, 2 Network Technology Laboratory, National Electronics and Computer Technology Center, Klong Luang, Pathumthani, 10120 Thailand *Corresponding author: naruemon@cpe.kmutt.ac.th always crate new techniques to evade email filtering process, so that it is not quite easy to catch them. As a result, the techniques to detect spam mail have to be updated and effective. In general, the contexts of email message are used to classify spam mail. In this paper, we combine behavior characteristic of spammer in addition to the body of the email message to determine spam mail. We utilize the spammer behavior as well as keywords from the email message to detect spam mail. This method focuses on behavior-based that it is more likely efficient than keyword-based method alone because spammers always change the keywords or content in the email. We select Random Forest algorithm to classify spam mail in our approach. This paper is organized as follows. In Section 2, we present the literature review. Section 3 describes the spam mail classification method. Section 4 describes the detection approach. Our experiment and simulation results are presented in Section 5. Finally, we conclude our research work in Section 6. II. RELATED WORK Based on the literature survey, there are many algorithms that can classify spam mails such as Neural Network, LVQ Network, Genetic Algorithm and Bayesian Network. In 1998, Sahami et al [1] proposed a Bayesian network approach to classify spam mail. Bayesian network is a graphical and probabilistic method that can automatically learned from the input data to filter or eliminate spam mail. They considered the content of the email messages and the domainspecific features that can be incorporated into the probabilistic models. Then in 2002, Vinther [2] proposed an approach using artificial neural network to classify spam mail. Artificial neuron network or ANN is an automatic method for detecting spam mail where the training and updating of the classification rules can be done automatically. The input data to the neural network is a list of words presented in emails. They used a training dataset consisting of 168 good mails and 186 spam mails, while the testing dataset contained 204 good mails and 337 spam mails.
Abstract- Recently, the number of spam mails is exponentially growing. It affects the costs of organizations and annoying the email recipient. Spammers always try to find the way to avoid filtering out from the email system. At the same time, as an email recipient or network system/administrator, we try to have an effective spam mail filtering technique to catch the spam mails. The problems of spam mail filtering are that each user has different perspective toward spam mails; so there are many types of spam mails, while the challenge is how to detect the various types and forms of spam mails. In this paper, behaviors of spammers are used to customize the filtering rule. The information from the spam messages also can be used to filter spam mails and it can give higher accuracy than the keywordbased method does. We propose a spam classification approach using Random Forest algorithm. Spam Assassin Corpus is selected as a database for classification. It consists of 6,047 email messages, where 4,150 of them are the legitimate messages and the other 1,897 messages are the spam mails. Key Words: Spam mail detection, Random Forest, Data classification, Spam Assassin dataset
I.
INTRODUCTION
The number of users connected to wide applications of internet grows up rapidly every day. The internet can help us to smooth business transaction and cooperation and enhance efficient communication. E-mail or electronic mail is becoming one of popular methods of communication because it is convenient, extremely cheap and easy to obtain. It can support text, photo and multimedia data. On the other hand, e-mail has become a tool of abuse. Some people use e-mail to attack or disturb system and email user by sending a lot of large size emails. Doing this causes most of the network bandwidth consumption and may overload the mail server process. Some business sectors use e-mail to advertise their goods that may be annoying to most email users, who have to waste their time manually eliminate the email from their inbox. This e-mail is called spam mail. Spam mail can be described as junk e-mail, unsolicited commercial e-mail or unsolicited bulk e-mail. Spam mail causes a great problem to computer users today. It is increasing with very fast rate. In 2002, an American person received 2200 spam mails on average and the amount increases 2% per month, causing up to 3600 pieces of spam mails in 2007. It causes the affected companies to have annual loss in profit. Spammers
Later in 2005, Chuan et al [3] proposed a LVQ network based neural network approach to classify spam mail that combined subclasses into a single class and formed complex class boundaries for the design of anti-spam mail neural network model. They divided the spam mails into several subclasses according to spam mail content. The features of words which are closely related can be easily used to classify spam mail. They considered 1000 emails from the corpus, including 580 spam mails and 420 legitimate mails. Then in 2006, Sanpakdee et al [4] proposed a genetic algorithm approach to classify spam mail. Genetic algorithm or GA is a method to find reasonable solution to solve complex problems. They generated spam mail prototype for filtering. Spam mails are encoded to chromosomes and enter to genetic operations, i.e., crossover, mutation and evaluated by the objective function. They used 1097 spam mails and 300 ham/legitimate mails. Their system consisted of 2 main processes which are keyword extraction and genetic algorithm processing. First, the input emails were passed into a process of keyword extraction. Then they entered into the genetic algorithm for creating the spam prototype. Later in 2008, Chih-Hung et al [5] proposed a backpropagation neural network (BPNN) to classify spam mail. The BPNN is a supervised learning algorithm of artificial neural networks. They considered spammers behavior instead of keywords from emails content. They selected header and syslog [10] to represent the spam behaviors. They collected the data from MTA (mail transfer agent) which supports header, message bodies, and syslog. The previously mentioned research papers used different sets of input data and algorithms for spam mail classification. Most of them considered keywords from the email content and use different algorithm such as Bayesian network, ANN, and GA, while very few papers recently considered behavior of users extracted from the header of email and syslog. It is a complex method to collect header features from the MUAs/MTAs. None of the papers compared the classification/detection results between using keywords and using the users behavior. Thus, in this paper, we evaluate and compare the two approaches which are using only keywords and using both keywords and behavior features from header of emails. We select the keyword features using T-test statistic, and combine the spammers behavior features such as hyper links to commercial web site, non-office hour mail sending, etc. Our behavior features are easy to collect from actual email system, because they can be collected from the header and body of emails instead of from the syslog. We use the spam assassin corpus [7] containing various emails including many different hams and spams. We apply the Random Forest algorithm to classify the emails. III. CLASSIFICATION METHOD Decision tree is a classification algorithm that is represented as a tree, where the structure of classification rule is
IFThen... The decision tree with decision results can be constructed very fast, and it is easy to understand. First we must define possible events and draw the tree from root node on the top to branch node. Each of nodes will describe the value that get from the gain function, when we put the problem value, it will be begin on root node and evaluate our even and through the node until the last leaf node. C4.5 Decision Tree is a well known data mining algorithm that classifies data set by using many nodes of tree. It builds tree by using a divide-and-conquer algorithm. A Decision tree is approximated with over-fitting on large dataset. The classification model of Decision tree is constructed by extracting rules from training set. These rules are used to predict and classify testing set or records that dont have answer class. The Decision tree will predict answer class by starting at the root and traversing to a leave node. The result of classification and prediction can founded at leave node. More over C4.5 is the technique to train this algorithm that is well known and has efficiency.
Fig. 1. Decision tree
Figure1 shows a simple decision tree, having node A as its root. Nodes B or C can be reached by comparing the previous value from node A with the gain value of each node until reaching to one of the leaf nodes which are D, E , F or G. Random Forest is an effective data mining algorithm because it can fix problem of over-fitting on large dataset and can train/test quickly on large and complex data set. Random forest is robust with noise. A tree is constructed using random data from the training dataset with replacement; two-third of these datasets is used for training, and the rest of dataset is used for evaluating error estimate. This model can evaluate important factors used in classification and un-pruned rules that are created and evaluated by the training dataset. There are many classification trees in classification model of Random Forest. Each classification tree is different from each other. Each classification tree is voted for a class. Finally, an answer class is assigned based on the highest vote. IV. DETECTION APPROACH A. Our Design Process Framework Our design process framework can be divided into two parts. The first part is the preprocess part. We use the spam-assassin corpus for our experiment. It has 6,047 various email messages. First we split words of each email message and record the frequency of occurrence of each word in the email. Then we select the words that more likely indicate if it is spam or
legitimate email using T-test statistical technique [9]. formula of the T-test is shown as following.
Z0 = |Xl X
s
The
|
2 s
l2
nl
(1)
ns
spammer for spam mail classification. Taken from the email header, these are the number of receivers (To), the number of referred receivers (Carbon copy, Cc), the number of bypassed receivers (Blind carbon copy, Bcc), the number of hyper links, the number of html presentations, the number of transmission relays and time of sending.
From the formula, X l is the mean of legitimate mail, X s is the mean of spam mail, l is the variance of legitimate mail, s is the variance of spam mail, nl is the number of legitimate mail and ns is the number of spam mail.
Spam Assassin Corpus 1. Preprocess Part Split words Keep behavior
Record Frequency of each word
Fig. 2. T-Test statistics, z0
Select word by using T-test statistic (z0)
From Figure2, x-axis describes the frequency of each word in each email and y-axis describes the frequency of email that has this word. For example, if we have the frequencies of occurrence of the word get in spam and legitimate mail are 10 and 12 respectively, it is likely that the word get is not playing as a main feature in characterizing spam mail and legitimate mail. On the other hand, if we get the frequencies equal to 2 and 10 for the word order in spam mail and legitimate mail respectively, it is likely that the word order is significant in determining spam and legitimate mail. Thus the word order should be selected as one of the features to classify spam mails. T-test statistical technique [9] is the method that can separate legitimate mail and spam mail, when the difference between averages of spam and legitimate mail is high and the difference of variance is low. From Figure 2, the left-hand side shows the low z0 that graph has overlap area, so we cannot separate between spam and legitimate emails. The right-hand side of the figure shows the high z0 with high contrast between spam and legitimate mails. Furthermore, we consider the spammer behavior in our process. Spammer behaviors are the stratagem that spammers use in delivering spam mails. These behaviors can help us identify spam mail effectively. The second part is the classification part; we classify spam mail by using Weka program [8], and Random Forest algorithm is the method that we use to classify spam mails. The diagram shown below describes our spam detection workflow.
2. Classification Part Weka Classify
Spam or Legitimate Fig. 3. Our spam detection framework
B. Email Message Features Because the spammers always try to avoid the spam mail filtering process by changing the keywords used in email message, only keyword-based detection approach is not efficient to catch them. We try to find the spam features considering spammer behavior. We select seven behaviors of
We consider the following spammers behaviors. Quantity of recipients. Spammer may random the email address for sending or using mailing list to send to recipients. Long transmission relay. Spammer may use long transmissions relay to avoid the spam filtering process. Non-office hour. Spam mail is often sent on nonoffice hour such as during 02.00 AM-05.00 AM because it is easier to breaking down the network bandwidth without system admins supervision. Hyper link. Commercial spam mail often has an attached hyper link that brings us to their business website to advertise their products. HTML. For advertisement, they send email that contains their web page for user interest. We keep spammer features from each email message and record them as our input data as shown in Table I. Each row in the table shows the spammer behavior features from each email. The keyword features that we select by using T-test statistical technique is shown in Table II. We split words in each email message and record the frequency of occurrence as shown in each row.
TABLE I FEATURES OF SPAMMERS BEHAVIOR

To Cc Bcc Html Http Relay Time Class
1 1 1 3 3 1
0 1 0 0 3 0
0 0 0 0 0 0
0 0 0 0 1 1
2 0 2 2 2 62
5 5 3 6 3 2
14 13 8 23 9 5
Legitimate Legitimate Legitimate Spam Spam Spam
mails and 1,897 spam mail messages. We separate them into three groups. In the first case, we select 750 legitimate mails and 750 spam mails for training and use the other 1,500 mails for testing. From the results shown in Figure 4, we obtain the highest detection accuracy of 97.4% when we use 300 keywords from emails message together with 7 spam behavior features.
TABLE II EMAIL KEYWORDS AND THEIR FREQUENCIES

$ language order Business Mail class
0 0 0 1 8 67
4 1 0 0 0 0
0 0 0 5 1 0
0 0 0 6 2 1
0 0 0 3 4 5
legitimate legitimate legitimate spam spam spam Fig. 4. Results from case 1
V. EXPERIMENTAL RESULTS & DISCUSSION In this paper, we compare and evaluate results obtained from applying the Random Forrest algorithm to classify spam mail. We consider percentages of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) as shown in Table III.
TABLE III METRIC FOR PERFORMANCE EVALUATION Predicted Class Spam Legitimate Spam TP FN Legitimate FP TN
The second case, we select 1,125 legitimate mails and 750 spam mails for training and use the other 1,875 mails for testing. From the results shown in Figure 5, we obtain the highest detection accuracy of 97.3% when we use 1000 keywords together with 7 spammer behavior features.
Count
Actual Class
TP (True Positive): The number of spam correctly predicted as spam. TN (True Negative): The number of ham correctly predicted as ham. FP (False Positive): The number of ham predicted as spam. FN (False Negative): The number of spam predicted as ham. | TP | Spam Detection Rate (SDR) = | TP | + | FN | Legitimate Detection Rate (LDR) = Total Detection Rate (TDR) =
| TP | + | TN | | TP | + | FN | + | TN | + | FP | | TN | | TN | + | FP |
Figure 5. Results from case 2.
Fig. 5. Results from case 2
In the third case, we select 1,500 legitimate mails and 750 spam mails for training and use the other 2,250 mails for testing. From the results shown in Figure 6, the obtained best detection accuracy is 97.4% when we use 200 keywords together with 7 spammer behavior features.
For our experiment, we using Weka tool [8] and Random Forest algorithm to classify the spam mail. We use spamassassin corpus in our experiment containing 4,150 legitimate
can be obtained when only 200 keywords are investigated. Essentially, considering spammer behaviors shown in the email header can greatly improve the spam mail detection rate and efficiency. VI. CONCLUSION
Fig. 6. Results from case 3 TABLE IV RESULTS FROM CASE 3: USING ONLY KEYWORD FEATURES Number of emails SDR (%) LDR (%) TDR (%) 50 100 200 300 500 1000 2500 93.7 94.5 96.8 97.2 96.1
97.5
96.1 96.6 97.1 96.3 96.9

96.9
95.3 95.9 97 96.6 96.7

97.1
This paper presents a spam mail filtering approach using Random Forest algorithm with spammer behavior features in combination with the email keywords chosen by T-test statistical technique. From the results, the T-test values can help choosing efficient keywords for spam mail classification., Essentially, with fixed number of keywords, in addition of spam mail behavior features, the detection rates are enhanced for all the cases with different number of keywords, and different size of dataset. In other words, using a few behavior features can reduce the number of keywords used for email classification without decreasing detection accuracy, hence giving high detection speed, low memory and CPU consumption. Lastly we can conclude that Random Forest algorithm is suitable to classify spam mail considering behavior features and keywords chosen by the T-test statistic. References
[1] [2] [3] M Meharn Sahami ,Susan Dumais , David Heckerman and Eric Horviz , 1998, A Bayesian Approach to Filtering Junk E-mail, Computer Science Department, Stanford University. Michael Vinther, Intelligent Junk Mail Detection Using Neural Network [Online], Available: http://logicnet.dk/reports/JunkDetection/JunkDetection.htm Zhan Chuan, Lu Xianliang, Hou Mengshu, Zhou Xu, 2005, A LVQBased Neural Network Anti-Spam Email Approach, ACM SIGOPS Operating Systems Review, Volume 39 , pp.34 - 39. Usarat Sanpakdee, Aranya Walairacht, Somsak Walairacht, 2006, Adaptive Spam Mail Filtering Using Genetic Algorithm, Advanced Communication Technology, The 8th International Conference, Volume 1, pp.441445 Wu Chih-Hung, Tsai Chiung-Hui, 2008, Robust Classification for Spam Filtering by Back-propagation Neural Networks Using Behavior-based Features, Department of Electrical Engineering, National University of Kaohsiung. Ian H. Witten & Eibe Frank, Data Mining Practical Machine Learning Tools and Techniques, 2nd ed., Elsevier Inc.,2005,pp. 189-199. Douglas C. Montgomery, George C. Runger, Applied Statistic and Probability for Engineer, 3rd ed., John Wiley & Sons Inc.,2003,pp. 300305. Spam assassin corpus [Online], Available: http://spamassassin.apache.org/publiccorpus/ Weka 3.7.0 tools [Online], Available: http://www.cs.waikato.ac.nz/ml/weka/ Syslog [Online], Available: http://en.wikipedia.org/wiki/Syslog
96.8
96.1
96.4
TABLE V RESULTS FROM CASE 3: USING KEYWORD AND BEHAVIOR FEATURES [4]
Number of emails SDR (%) LDR (%) TDR (%)
50 100 200 300 500 1000 2500
96.8 96.5
97.6
95.9 97.4
97.3
96.2 97.1
97.4
[5]
96.8 97.2 96.4 96.8
96.7 97.1 97.5 97
96.8 97.1 97.1
[6] [7] [8]
96.9 [9] [10]
Tables IV & V present the detection results from the third case where we use more email testing dataset. The results are evaluated in terms of Spam Detection Rate (SDR), Legitimate Detection Rate (LDR) and Total Detection Rate. We consider various spam mails and legitimate mails ranging from 50, 100, 200, 300, 500, 1000 and 2500 emails as training dataset, and the same amount of different dataset for the testing part. From the results shown in Table IV where only keywords are considered, the highest detection rate can be obtained when 1000 keywords are used. However, when the spam behaviors are considered, as shown in Table V, the highest detection rate

Spam Detection Using rANDOMIZED fOREST tECHINQUE

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Spam Detection Using rANDOMIZED fOREST tECHINQUE

Hochgeladen von

Copyright:

Verfügbare Formate

Considering Behavior of Sender in Spam Mail Detection

S. Naksomboon1, C. Charnsripinyo2 and N. Wattanapongsakorn1,*

Fig. 1. Decision tree

Spam Assassin Corpus 1. Preprocess Part Split words Keep behavior

Record Frequency of each word

Fig. 2. T-Test statistics, z0

Select word by using T-test statistic (z0)

2. Classification Part Weka Classify

Spam or Legitimate Fig. 3. Our spam detection framework

TABLE I FEATURES OF SPAMMERS BEHAVIOR

Legitimate Legitimate Legitimate Spam Spam Spam

TABLE II EMAIL KEYWORDS AND THEIR FREQUENCIES

Figure 5. Results from case 2.

Fig. 5. Results from case 2

96.1 96.6 97.1 96.3 96.9

95.3 95.9 97 96.6 96.7

50 100 200 300 500 1000 2500

96.8 97.2 96.4 96.8

96.7 97.1 97.5 97

96.8 97.1 97.1

[6] [7] [8]

96.9 [9] [10]

Das könnte Ihnen auch gefallen