Beruflich Dokumente
Kultur Dokumente
Contents
1 Introduction
2 Data Set
2.1 Data Set Source . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Data Set Description . . . . . . . . . . . . . . . . . . . . . . .
5
5
5
3 Approaches
3.1 Nave Bayes Classification
3.2 Support Vector Machine .
3.3 Artificial Neural Network .
3.4 Combined Method . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Performance Evaluation
4.1 Training Set and Testing Set . . . . . . . . .
4.2 Performance Metrics . . . . . . . . . . . . .
4.3 Experimental Results . . . . . . . . . . . . .
4.3.1 Different Kernel functions of SVM . .
4.3.2 Different Spam Detection Algorithms
5 Conclusion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
8
9
10
.
.
.
.
.
11
11
11
12
12
13
14
Chapter 1
Introduction
Nowadays, emails have become an important way for people to communicate and share information with the rapid development of information technology. The total number of worldwide email accounts is expected to increase
from 3.9 billion in 2013 to over 4.9 billion by the end of 2017, and the number
of daily emails sent and received is expected from 182.9 billion in 2013 to over
200 billion by the end of 2017 according to a recent email statistics report
[1]. However, based on a spam trends and statistics report by KASPERSKY,
over 70% of global email traffic comes from spam in Q2-2013 [2].
Spam emails cause much inconvenience, bad feeling and potential danger to users. Spam emails waste the limited mailbox space and network
bandwidth. Also email users have to waste time reading and identifying the
spam. Moreover, users may have significant financial loss due to the spam.
For example, just in US, spam emails have been reported to result in direct
financial losses in excess of 1 billion dollars per year [3]. If an email service
provider cannot detect the spam, they may lose their users. Motivated by
these observations, it is important to do spam detection for both users and
email service providers.
In this project, we address the spam detection problem (i.e., classify email
as spam or non-spam) using three commonly used machine learning techniques, which are Nave Bayes Classification (NB), Support Vector Machine
(SVM) and Artificial Neural Network (ANN). For the SVM technique, we
try four types of kernel functions, including linear function, polynomial function, radial basis function and Sigmoid function. Furthermore, we combine
the output results of NB, SVM and ANN to see how the combined method
perform. Last, we compare the performance of NB, SVM, ANN and the combined method, and give some guidelines on how to select machine learning
3
Chapter 2
Data Set
2.1
In this project, we use the UCI Spambase data set [4]. It is a classical
data set for spam detection problem. For the data set, the collection of spam
emails came from the postmaster and individuals who had filed spam. And
the collection of non-spam emails came from filed work and personal emails.
The general information about this data set is shown in Figure 2.1.
2.2
The UCI Spambase data set has 4601 instances. Among them, 1813 instances are spam and 2788 instances are non-spam. For each instance, there
are 58 attributes.
The first 48 attributes are all continuous real number ranging from 0 to
100, with the name word freq WORD. The value of each attribute is the
percentage of words in the email that match WORD as defined in equation
2.1.
100
(2.1)
(2.2)
Then we have 1 continuous real number attribute with the name capital run length average. The value of this attribute is the average length of
uninterrupted sequences of capital letters. Obviously, this value is always
bigger than 1.
Similar to the attribute capital run length average, we have 1 continuous
integer attribute with the name capital run length longest. The value of this
attribute is the length of longest uninterrupted sequence of capital letters.
Following is a continuous integer attribute with the name capital run length total.
The value of this attribute is the sum of length of uninterrupted sequences
of capital letters, i.e., the total number of capital letters in the email.
The last attribute is a nominal {0, 1} class attribute. 1 denotes that the
email is considered spam, 0 donotes that the email is non-spam.
In the training set, the number of spam emails and the number of nonspam emails should not differ too much. For example, if the number of spam
emails is much larger than that of the non-spam emails in the training set,
then the classification model might learn to always classify the instances in
the testing set as 1, simply because that would be true most of the time for
the bad training set. Notice that in the UCI Spambase data set, the first
1813 instances are all in the class of spam and the remaining instances are
all in the class of non-spam. Therefore the UCI Spambase data set cannot be
directly used as the input of the training algorithm, details on how to divide
the training set and testing set will be described in Chapter 4.
Chapter 3
Approaches
3.1
P (Y = yi ) P (x|Y = yi )
P (x)
(3.1)
where P (x) denotes the probability that a randomly picked email has
vector x as its representation, P (Y = yi ) is the probability that a randomly
picked email is from class yi , and P (x|Y = yi ) denotes the probability that
a randomly picked email that belongs to class yi has x as its representation. Besides, in Nave Bayes Classification, all features are conditionally
independent given the condition yi . Thus, P (x|Y = yi ) can be decomposed
into
P (x|Y = yi ) = ni=1 P (xi |Y = yi )
(3.2)
(3.3)
3.2
SVM is extenstively used in spam detection [5], [6]. The basic idea behind
SVM is to find the optimal separating hyperplane that gives the maximum
margin between two different classes, (e.g., {0,1}). Based on this idea, spam
filtering can be viewed as a simple SVM application, i.e., binary classification
of emails as spam or non-spam.
Given a set of training samples X = (xi , yi ), where xi <m and yi
{0, 1} is the corresponding output for the ith training sample (here 1 represents spam and 0 stands for non-spam). Then SVMs require the solution of
following optimization problem:
l
min
w,b,
subject to
X
1 T
w w+C
i
2
i=1
(3.4)
yi (wT (xi ) + b) 1
i 0
Here the training vectors xi are mapped into a higher dimensional space
by the function , K(xi , xj ) (xi )T (xj ) is called the kernel function and
C > 0 is the penalty parameter of the error term. More and more researchers
pay attention to SVM-based classifiers for spam detection, since their demonstrated robustness and ability to handle large feature spaces makes them
particularly attractive for this work.
In general, there are four types of kernel functions (linear, polynomial,
RBF and sigmoid) frequently used with SVM. The definitions of these four
kernel functions are as follows:
1. Linear: K(xi , xj ) = xTi xj
2. Polynomial: K(xi , xj ) = (xTi xj + r)d , > 0
8
3.3
The artificial neural network is also widely used to do spam detection [10],
[11]. The artificial neural network we adopt in this project is a non-linear
feed-forward neural network with the architecture shown in Figure 3.1. There
Input Layer
Hidden Layer
Output Layer
3.4
Combined Method
10
Chapter 4
Performance Evaluation
4.1
4.2
Performance Metrics
N
1. Accuracy: T P +TTNP +T
, which measures the fraction of emails that
+F P +F N
are correctly classified.
P
2. Precision: T PT+F
, which gives the ratio between the number of emails
P
that are correctly classified as spam and the number of emails classified
as spam.
P
, which measures the ratio between the number of emails
3. Recall: T PT+F
N
that are correctly classified as spam and the number of spam emails in
the testing set.
4.3
4.3.1
Experimental Results
Different Kernel functions of SVM
4.3.2
13
Chapter 5
Conclusion
In this project, we adopt three machine learning algorithms, including
Nave Bayes Classification, Support Vector Machine and Artificial Neural
Network to tackle the spam detection problem, and conduct extensive experiments on a classical benchmark spam filtering corpus UCI Spambase
Data Set to evaluate the performance of these three above classification algorithms. Experimental results show that for the kernel based SVM, linear
kernel achieves similar performance with polynomial kernel, and outperforms
RBF and sigmoid kernels. Besides, polynomial kernel with higher degree
leads to lower performance. ANN achieves a slightly higher accuracy than
linear kernel based SVM. But as a contrast, Nave Bayes Classification has
the worst accuracy. This is because for the Nave Bayes Classification, the
attributes are all assumed to be independent. However, in our data set,
some attributes have close correlation. Thus, Nave Bayes results in poor
performance. Furthermore, our proposed combined method can improve the
performance in terms of accuracy, recall and F1 -measure because of the crowd
sourcing. With regard to the time efficiency, ANN has much more training
time compared with SVM and NB. Therefore, SVM is suitable to the applications that require low complexity and time efficiency. ANN is suitable to
the applications that require high accuracy and can allow long training time.
And NB is not suitable to spam detection task.
There are several interesting directions that could be explored. First, we
can try deep learning algorithms to conduct spam detection. Nowadays, deep
learning is a hot topic due to the big performance improvement for many applications in vision, audio, speech and natural language processing. Trying
deep learning algorithms for spam detection should also be an interesting
application. Second, in our project, we mainly focus on developing a generalized spam filter. As an extension, we can try to implement personalized
spam filter for different users by introducing some personalized features.
14
Bibliography
[1] Sara Radicati, Email Statistics Report 2013-2017, 2013.
[2] KASPERSKY, Spam Statistics Report for Q2-2013, 2013.
[3] Wombat Security Technologies, PhishPatrol White Paper, April, 2012.
[4] UCI Spambase Data Set, http://archive.ics.uci.edu/ml/datasets/Spambase.
[5] Youn Seongwook, and Dennis McLeod, A comparative study for email classification, Advances and
Innovations in Systems, Computing Sciences and Software Engineering. Springer Netherlands, 2007.
387-391.
[6] Drucker Harris, Donghui Wu, and Vladimir N. Vapnik, Support vector machines for spam categorization, IEEE Transactions on Neural Networks, 1999.
[7] http://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt algorithm
[8] Karl-Michael Schneider, A comparison of event models for Naive Bayes anti-spam e-mail filtering,
In Proc. of Association for Computational Linguistics, 2003.
[9] Alexander K. Seewald, An evaluation of naive Bayes variants in content-based learning for spam
filtering, Intelligent Data Analysis, 2007.
[10] Chuan Zhan, Xianliang Lu, etc., A LVQ-based neural network anti-spam email approach, ACM
SIGOPS Operating Systems Review, 2005.
[11] Clark James etc., A neural network based approach to automated e-mail classification, In Proc. of
IEEE/WIC International Conference on Web Intelligence, 2003.
[12] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, ACM
Transactions on Intelligent Systems and Technology, 2011.
[13] https://github.com/pranavgupta21/Spam-Detector
15