Sie sind auf Seite 1von 15

Spam Email Detection: A Comparative Study

Jincheng Zhang, Yan Liu


December 6, 2013
Supervised by Prof. Laiwan Chan.

Contents
1 Introduction

2 Data Set
2.1 Data Set Source . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Data Set Description . . . . . . . . . . . . . . . . . . . . . . .

5
5
5

3 Approaches
3.1 Nave Bayes Classification
3.2 Support Vector Machine .
3.3 Artificial Neural Network .
3.4 Combined Method . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

4 Performance Evaluation
4.1 Training Set and Testing Set . . . . . . . . .
4.2 Performance Metrics . . . . . . . . . . . . .
4.3 Experimental Results . . . . . . . . . . . . .
4.3.1 Different Kernel functions of SVM . .
4.3.2 Different Spam Detection Algorithms
5 Conclusion

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

7
7
8
9
10

.
.
.
.
.

11
11
11
12
12
13
14

Chapter 1
Introduction
Nowadays, emails have become an important way for people to communicate and share information with the rapid development of information technology. The total number of worldwide email accounts is expected to increase
from 3.9 billion in 2013 to over 4.9 billion by the end of 2017, and the number
of daily emails sent and received is expected from 182.9 billion in 2013 to over
200 billion by the end of 2017 according to a recent email statistics report
[1]. However, based on a spam trends and statistics report by KASPERSKY,
over 70% of global email traffic comes from spam in Q2-2013 [2].
Spam emails cause much inconvenience, bad feeling and potential danger to users. Spam emails waste the limited mailbox space and network
bandwidth. Also email users have to waste time reading and identifying the
spam. Moreover, users may have significant financial loss due to the spam.
For example, just in US, spam emails have been reported to result in direct
financial losses in excess of 1 billion dollars per year [3]. If an email service
provider cannot detect the spam, they may lose their users. Motivated by
these observations, it is important to do spam detection for both users and
email service providers.
In this project, we address the spam detection problem (i.e., classify email
as spam or non-spam) using three commonly used machine learning techniques, which are Nave Bayes Classification (NB), Support Vector Machine
(SVM) and Artificial Neural Network (ANN). For the SVM technique, we
try four types of kernel functions, including linear function, polynomial function, radial basis function and Sigmoid function. Furthermore, we combine
the output results of NB, SVM and ANN to see how the combined method
perform. Last, we compare the performance of NB, SVM, ANN and the combined method, and give some guidelines on how to select machine learning
3

algorithms when conducting spam detection.


The remainder of this report is structured as follows. Chapter 2 describes
the data set we used in this project. Chapter 3 states the principles, settings and implementation details for the three machine learning techniques.
Chapter 4 depicts the main results and gives the implications on how to
select spam detection algorithms. Chapter 5 concludes this report.

Chapter 2
Data Set
2.1

Data Set Source

In this project, we use the UCI Spambase data set [4]. It is a classical
data set for spam detection problem. For the data set, the collection of spam
emails came from the postmaster and individuals who had filed spam. And
the collection of non-spam emails came from filed work and personal emails.
The general information about this data set is shown in Figure 2.1.

Figure 2.1: UCI Spambase Data Set

2.2

Data Set Description

The UCI Spambase data set has 4601 instances. Among them, 1813 instances are spam and 2788 instances are non-spam. For each instance, there
are 58 attributes.
The first 48 attributes are all continuous real number ranging from 0 to
100, with the name word freq WORD. The value of each attribute is the
percentage of words in the email that match WORD as defined in equation
2.1.
100

number of times the WORD appears in the e-mail


total number of words in e-mail
5

(2.1)

A word in this case is any string of alphanumeric characters bounded by


non-alphanumeric characters or end-of-string.
Right after the first 48 attributes, there are 6 attributes which are all continuous real number ranging from 0 to 100 with the name char freq CHAR.
The value of each attribute is the percentage of characters in the email that
match CHAR as shown in equation 2.2.
100

number of CHAR occurences


total characters in email

(2.2)

Then we have 1 continuous real number attribute with the name capital run length average. The value of this attribute is the average length of
uninterrupted sequences of capital letters. Obviously, this value is always
bigger than 1.
Similar to the attribute capital run length average, we have 1 continuous
integer attribute with the name capital run length longest. The value of this
attribute is the length of longest uninterrupted sequence of capital letters.
Following is a continuous integer attribute with the name capital run length total.
The value of this attribute is the sum of length of uninterrupted sequences
of capital letters, i.e., the total number of capital letters in the email.
The last attribute is a nominal {0, 1} class attribute. 1 denotes that the
email is considered spam, 0 donotes that the email is non-spam.
In the training set, the number of spam emails and the number of nonspam emails should not differ too much. For example, if the number of spam
emails is much larger than that of the non-spam emails in the training set,
then the classification model might learn to always classify the instances in
the testing set as 1, simply because that would be true most of the time for
the bad training set. Notice that in the UCI Spambase data set, the first
1813 instances are all in the class of spam and the remaining instances are
all in the class of non-spam. Therefore the UCI Spambase data set cannot be
directly used as the input of the training algorithm, details on how to divide
the training set and testing set will be described in Chapter 4.

Chapter 3
Approaches
3.1

Nave Bayes Classification

Nave Bayes classification is one of the frequently used supervised learning


algorithms for text categorization [8],[9]. This approach begins by conducting
a statistics analysis for a set of emails which have already been labeled as
spam or non-spam (i.e., training set) to obtain the probability distribution
for each attribute. Then when a new email comes in, we use the information
gathered from the training set to compute the probability that the email is
spam and the probability that the email is non-spam. We classify the email
into the class with higer probability.
Given an input feature vector x = {x1 ; x2 ; ...; xn } of an email, where xi
is the value of ith attribute, and n is the number of attributes in the data
set. Let Y denote the class to be predicted and Y = yi , where yi is 0 or
1 (0 denotes non-spam and 1 denotes spam). According to Bayes rule, the
probability that the input vector x belongs to class yi is as follows:
P (Y = yi |x) =

P (Y = yi ) P (x|Y = yi )
P (x)

(3.1)

where P (x) denotes the probability that a randomly picked email has
vector x as its representation, P (Y = yi ) is the probability that a randomly
picked email is from class yi , and P (x|Y = yi ) denotes the probability that
a randomly picked email that belongs to class yi has x as its representation. Besides, in Nave Bayes Classification, all features are conditionally
independent given the condition yi . Thus, P (x|Y = yi ) can be decomposed
into
P (x|Y = yi ) = ni=1 P (xi |Y = yi )

(3.2)

Therefore, determining the class of an input vector x using NB classifier can


be formulated as follows:
y = arg max P (Y = yi |x)
yi

(3.3)

The input vector x belongs to spam if P (Y = 1) ni=1 P (xi |Y = 1) >


P (Y = 0) ni=1 P (xi |Y = 0), otherwise, x belongs to non-spam.
In this project, we implement Nave Bayes Classification by modifying
existing codes from [13].

3.2

Support Vector Machine

SVM is extenstively used in spam detection [5], [6]. The basic idea behind
SVM is to find the optimal separating hyperplane that gives the maximum
margin between two different classes, (e.g., {0,1}). Based on this idea, spam
filtering can be viewed as a simple SVM application, i.e., binary classification
of emails as spam or non-spam.
Given a set of training samples X = (xi , yi ), where xi <m and yi
{0, 1} is the corresponding output for the ith training sample (here 1 represents spam and 0 stands for non-spam). Then SVMs require the solution of
following optimization problem:
l

min
w,b,

subject to

X
1 T
w w+C
i
2
i=1

(3.4)

yi (wT (xi ) + b) 1 
i 0

Here the training vectors xi are mapped into a higher dimensional space
by the function , K(xi , xj ) (xi )T (xj ) is called the kernel function and
C > 0 is the penalty parameter of the error term. More and more researchers
pay attention to SVM-based classifiers for spam detection, since their demonstrated robustness and ability to handle large feature spaces makes them
particularly attractive for this work.
In general, there are four types of kernel functions (linear, polynomial,
RBF and sigmoid) frequently used with SVM. The definitions of these four
kernel functions are as follows:
1. Linear: K(xi , xj ) = xTi xj
2. Polynomial: K(xi , xj ) = (xTi xj + r)d , > 0
8

3. Radial Basis Function(RBF): K(xi , xj ) = exp(kxi xj k2 ), > 0


4. Sigmoid: K(xi , xj ) = tanh(xTi xj + r)
In this project, we have tried all above four kernels to evaluate their
performance for the spam detection problem. We use the libsvm library [12]
to implement kernel-based SVMs to conduct spam classification.

3.3

Artificial Neural Network

The artificial neural network is also widely used to do spam detection [10],
[11]. The artificial neural network we adopt in this project is a non-linear
feed-forward neural network with the architecture shown in Figure 3.1. There
Input Layer

Hidden Layer

Output Layer

Figure 3.1: Architecture for Feedforward Artificial Neural Network


are three layers in the neural network, an input layer, a hidden layer and an
output layer. For the input layer, there are 57 inputs. For the hidden layer,
there are 10 neurons. For the output layer, there is one neuron. All neurons
in hidden layer and output layer are coupled with the sigmoid activation
function as shown in equation (1). So the output value ranges from 0 to 1.
In the data set, 1 represents spam and 0 represents non-spam. The email
is classified as spam if the output value is above 0.5, otherwise the email is
classified as non-spam.
1
f (x) =
(1)
1 + ex

Levenberg-Marquardt [7] algorithm is adopted as the learning algorithm


beacuse Levenberg-Marquardt algorithm is the fastest backpropagation algorithm in the MATLAB toolbox.
We use MATLAB as the development platform due to the ample functions
in the MATLAB Neural Network Toolbox.

3.4

Combined Method

Beyond above mentioned algorithms, we propose a simple method that


combines NB, SVM and ANN when conducting spam detection. Assume the
predicted value of NB, SVM and ANN are denoted by yN B , ySV M and yAN N
respectively. Obviously, yN B , ySV M , yAN N {0, 1}, then the classification
result using combined method is as follows:
(
1 if yN B +ySV3M +yAN N 0.5
(3.5)
ycombined =
0 if yN B +ySV3M +yAN N < 0.5
The intuition behind this combined method is that if among the three
above machine learning algorithms (i.e., NB, SVM and ANN), a majority of
them classify an email as spam, then the email is very likely to be a spam.
In our experiments, we will show that this combined method indeed improve
the performance of spam detection.

10

Chapter 4
Performance Evaluation
4.1

Training Set and Testing Set

As we have discussed in Chapter 2, we cannot naively select the first K


instances in the original UCI Spambase data set as the training set due to
the extremely heterogeneous distribution of spam instances and non-spam
instances. Thus, we have to re-rank the instances in the original data set to
obtain a new data set with relatively homogeneous distribution for spam
and non-spam instances. Furthermore, in the data set, 1813 instances are
spam and the remaining 2788 instances are non-spam. The ratio between
the number of spam and total instances is as follows:
1813
0.4
(1813 + 2788)
In the training set and testing set, we both let the ratio between the
number of spam and total instances be around 0.4, and also let the ratio
between the size of training set and testing set be around 12 . Based on
this division method, we have 3000 instances in the training set, in which
1154 instances are spam, and 1601 instances in the testing set, in which
659 instances are spam. The spam and non-spam instances are relatively
homogeneously distributed in both the training set and testing set.

4.2

Performance Metrics

To evaluate the performance of different spam detection algorithms, we


adopt four commonly used metrics defined as follows: (TP is short for True
Positive, FP is short for False Positive, TN is short for True Negative and
FN is short for False Negative)
11

N
1. Accuracy: T P +TTNP +T
, which measures the fraction of emails that
+F P +F N
are correctly classified.
P
2. Precision: T PT+F
, which gives the ratio between the number of emails
P
that are correctly classified as spam and the number of emails classified
as spam.
P
, which measures the ratio between the number of emails
3. Recall: T PT+F
N
that are correctly classified as spam and the number of spam emails in
the testing set.

, which is a measure of tests accuracy.


4. F1 -measure: 2 PrecisionRecall
Precision+Recall
The optimal value of F1 -measure is 1 and the worst value is 0.

4.3
4.3.1

Experimental Results
Different Kernel functions of SVM

Table 4.1: Performance Comparison of Different Kernels


Method
Accuracy(%) Precision(%) Recall (%) F1 -measure
Linear
91.75
91.10
88.61
0.8984
Polynomial(d=1)
91.38
92.77
85.73
0.8911
Polynomial(d=2)
69.76
90.32
29.74
0.4474
Polynomial(d=3)
39.78
39.94
91.95
0.5569
RBF
82.94
79.60
78.75
0.7917
Sigmoid
65.33
93.33
16.95
0.2875
In this section, we compare the spam detection performance of different
kernel functions based SVM. As shown in Table 4.1, we can find that linear
kernel achieves similar performance with polynomial kernel(for degree=1).
By contrast, RBF and Sigmoid kernels work much worse than linear and
polynomial ones (for degree=1). Besides, we also compare the performance
of polynomial kernel with different degree values, we find that increase degree
leads to a lower detection performance.
Due to the varying performance of different kernels (e.g. SVM with Polynomial kernel with degree=3 which achieves 39.68% accuracy versus SVM
with linear kernel which achieves 91.95% accuracy), we believe in our future
data mining work with SVM tools, we should be careful enough on kernel
selection so as to achieve good performance.
12

4.3.2

Different Spam Detection Algorithms

Table 4.2: Performance Comparison of Different Algorithms


Method
Accuracy(%) Precision(%) Recall (%) F1 -measure
ANN
93.44
93.82
89.98
0.9186
NB
58.27
49.64
95.59
0.6535
SVM(Linear)
92.50
91.98
94.08
0.9302
Combined
93.94
91.32
94.23
0.9275
In this section, we compare the spam detection performance of NB, SVM
and ANN. For this experiment, we use linear kernel based SVM as the representative of SVM for comparison.
As shown in Table 4.2, we can see that ANN achieves a slightly higher
accuracy compared to SVM. And SVM outperforms NB in terms of accuracy,
precision and F1 -measure. The intuition behind this phenomenon is that for
the Nave Bayes Classification, the attributes are all assumed to be independent. However, in our data set, some attributes have relevance to some
extent. For example, the attribute capital run length average and attribute
capital run length total is closely correlated.
Besides, our proposed combined method has a small performance improvement compared to the other three algorithms individually. This is because we use the crowd source of NB, SVM and ANN. If a majority of them
classify an email as spam, then the email is more likely to be a spam, and
vice versa.
For the time efficiency issues, kernel-based SVMs and NB only require
about 10 seconds to finish the training process. However, ANN needs at
least 2 minutes to complete the training. Obviously, SVM and NB achieve
much higher time efficiency than ANN.

13

Chapter 5
Conclusion
In this project, we adopt three machine learning algorithms, including
Nave Bayes Classification, Support Vector Machine and Artificial Neural
Network to tackle the spam detection problem, and conduct extensive experiments on a classical benchmark spam filtering corpus UCI Spambase
Data Set to evaluate the performance of these three above classification algorithms. Experimental results show that for the kernel based SVM, linear
kernel achieves similar performance with polynomial kernel, and outperforms
RBF and sigmoid kernels. Besides, polynomial kernel with higher degree
leads to lower performance. ANN achieves a slightly higher accuracy than
linear kernel based SVM. But as a contrast, Nave Bayes Classification has
the worst accuracy. This is because for the Nave Bayes Classification, the
attributes are all assumed to be independent. However, in our data set,
some attributes have close correlation. Thus, Nave Bayes results in poor
performance. Furthermore, our proposed combined method can improve the
performance in terms of accuracy, recall and F1 -measure because of the crowd
sourcing. With regard to the time efficiency, ANN has much more training
time compared with SVM and NB. Therefore, SVM is suitable to the applications that require low complexity and time efficiency. ANN is suitable to
the applications that require high accuracy and can allow long training time.
And NB is not suitable to spam detection task.
There are several interesting directions that could be explored. First, we
can try deep learning algorithms to conduct spam detection. Nowadays, deep
learning is a hot topic due to the big performance improvement for many applications in vision, audio, speech and natural language processing. Trying
deep learning algorithms for spam detection should also be an interesting
application. Second, in our project, we mainly focus on developing a generalized spam filter. As an extension, we can try to implement personalized
spam filter for different users by introducing some personalized features.
14

Bibliography
[1] Sara Radicati, Email Statistics Report 2013-2017, 2013.
[2] KASPERSKY, Spam Statistics Report for Q2-2013, 2013.
[3] Wombat Security Technologies, PhishPatrol White Paper, April, 2012.
[4] UCI Spambase Data Set, http://archive.ics.uci.edu/ml/datasets/Spambase.
[5] Youn Seongwook, and Dennis McLeod, A comparative study for email classification, Advances and
Innovations in Systems, Computing Sciences and Software Engineering. Springer Netherlands, 2007.
387-391.
[6] Drucker Harris, Donghui Wu, and Vladimir N. Vapnik, Support vector machines for spam categorization, IEEE Transactions on Neural Networks, 1999.
[7] http://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt algorithm
[8] Karl-Michael Schneider, A comparison of event models for Naive Bayes anti-spam e-mail filtering,
In Proc. of Association for Computational Linguistics, 2003.
[9] Alexander K. Seewald, An evaluation of naive Bayes variants in content-based learning for spam
filtering, Intelligent Data Analysis, 2007.
[10] Chuan Zhan, Xianliang Lu, etc., A LVQ-based neural network anti-spam email approach, ACM
SIGOPS Operating Systems Review, 2005.
[11] Clark James etc., A neural network based approach to automated e-mail classification, In Proc. of
IEEE/WIC International Conference on Web Intelligence, 2003.
[12] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, ACM
Transactions on Intelligent Systems and Technology, 2011.
[13] https://github.com/pranavgupta21/Spam-Detector

15

Das könnte Ihnen auch gefallen