Beruflich Dokumente
Kultur Dokumente
CSE 390
By: Erik Emmanuel
Brian Lin
Jianneng (Jack) Wu
05/04/2017
computers to extract the meaning behind our language. This has given us
the proper tools to be able to build the spam email classifier. The classifier
sent by another computer. We choose this topic for our project because we
were building something that in turn could help save people a lot of time.
Although the limits of our project are extended to a limit dataset, we can
understand the workings of how filtering can be done. The project will
consist of 3 parts. In part 1 we build the training data needed to make the
classifier. Part 2 will then be to build the Naive Bayes algorithm and test it
the classifier measuring F-1 score, recall and precision. The last part will
For the data extraction, We used the data we acquired from the Enron dataset.
From the dataset we used the preprocessed data which was already categorized as
either spam or not spam (ham). In the training set there was 5975 unique emails, and of
those there was 1:3 ratio of spam to ham emails. For the probability generation aspect
of the project we made a python script that goes through a folder, opens each folder
and and for each line in the email tokenize the word. Make counts of each word and
how many times they had occurred in each of the classes (spam or ham). For every
new word that was observed it was added to the vocab array. Since the dataset was
large, I omitted tokens from emails that were not words; ie. 12 was not considered as it
considered part of the vocab. I also omitted words from the vocab that occurred less
than 2 times.Once the dataset was finalized, I had to attach probabilities that the word
Probability Extraction:
Once the vocab array was finalized, now we had to go through each object in the
list and calculate the probability that word either lead to a spam or ham. To determine
a readable format onto the an out.txt file which is then passed onto the algorithm to
predict the test set. The following diagram can be used as a visual tool to trace the flow
of the program.
Implementation of Naive Bayes Algorithm (Part 2: Brian)
At this point we had our training data from the training emails in the form of
out.txt. Now our task was to create a process in which we can process new unknown
emails and determine from the training data whether or not the email is valid or spam.
The Naive Bayes Algorithm is an algorithm that takes a list of words as its input
and outputs a value that corresponds to whether or not that list of words is more likely to
the value we are computing. The first term is a constant that changes based on the size
of the training data we use. P(S) is determined by taking the number of spam and valid
emails and determining the probability of the email picked is spam. In our case we had
1500 spam emails and 3672 valid emails. Taking this P(S) would be 1500/(3672 +
1500). The same applies to P(H) where to denotes the probability that the email picked
is valid email. The next part of the equation with the summation of log probabilities takes
a word and its probability that the word is in spam over the probability that the word is in
a valid email and takes the log of the quotient. The log incorporation is important
because it determines if the value computing is being added or subtracted. This means
that any log probability between 0 and 1 is going to be negative whereas any log
value of log((w|S)/(w|H)) is going to be positive. The only way that a email is classified
We tested this algorithm on the test set called enron2 which contained 4361 valid
emails and 1496 spam emails. Running through the emails with the implemented
The precision in the classifier for spam emails was amazing, almost every email that the
classifier predicted as spam came out to actually be spam. However the recall of spam
emails was very poor. Less than half of the emails that were spam were detected by the
classifier. I believe the reason behind this is the training data. Because of the fact that
the number of spam emails was much less than the number of the valid emails in the
training set, this causes the words in the training set to have more frequent words where
the probability of the word being in a valid email is greater than the probability of the
word in a spam email. This affects the equation, more precisely . The
fraction will then be a value between 0 and 1 and any log of that value will produce a
negative value. Adding a negative to the equation will skew the equation to return that
the email is valid and for that the recall is imperfect. This could be fixed if
we were to have more data for training where the ratio of spam and valid
Implementation
POS Tagging and the Naive Bayes Algorithm (Part 3: Jack)
Training:
Using the Enron1 data from the Enron dataset1, I tagged the data using a POS tagger2
from Stanford. I chose to use this over using the one that we made for two reasons; Its
much more accurate (97% Precision), and its much for efficient. With so much training
data, efficiently tagging the data is quite important. Passing the paths of the data to
train_tagger.py, and the path to the tagged ham/spam folders, we can calculate their
emissions_spam.txt.
Given the existing algorithm from a research paper3 by Tianhao Sun, we were able to
n
P (S|E) P (S) P (w |S)
log P (H|E) = log P (H) + log P (w i|H)
i
i=1
Where if the result is greater than zero, then we classify if as spam. For the POS
adjustment I will be integrating interpolation into the function for one main reason: I
generally understand the concept and feel this is a good and simple probability
distribution for combining two features. I will be modifying the equation as follows:
1=0.9 and 2=0.1 because these weights provided generally the best results. The final
n
P (S|E) P (S) P (w |S,t )
log P (H|E) = log P (H) + log P (w i|H,ti )
i i
i=1
*** At this point is where Im not sure if I normalized correctly as there are little to no
research papers detailing the addition of POS tags with respect to the Naive Bayes
Algorithm.
Testing:
To test the algorithm, I used the stanford tagger to tag the next set of the enron dataset
(enron2). I then modified the original code from Brian Lin (WordProb.py) to adjust for my
changed input data along with additional emissions data, and created the code
WordProbPOS.py . After optimizing the code, I ran the algorithm a few times to test sets
of weights that would be appropriate and settled on 0.9 and 0.1 respectively.
From the results (specified below), we can see that with the addition of the POS tags,
we have decreased the precision of the algorithm by around 5% with respect to the
original algorithm, but have increased the recall by an astounding 45%. This is where
my concerns lie: Im not 100% certain that I normalized the probabilities correctly, and
also not certain if the original algorithm was accurate as well. Both these could have
skewed the results, leading to an over appreciation for the POS addition.
2
Precision (Ham w/out POS):
Number of times right/Number of times predicted
2865/4361 = 0.773796791
Recall (Ham):
Number of times right/Number of spam in the set
3562/4361 = 0.816785141
F-1 measurement (Ham):
2((p*r)/(p+r) = 0.794710047093
The motivation for this project was to be able to detect spam emails given a set
of unclassified set of emails. To accomplish this goal we used the Naive Bayes model to
write the algorithm. We were able to obtain a fairly large training and test sets from
csmining.org. Using these our Naive Bayes model was able to precisely predict the
class of a given email, but our initial model was not very good at recalling the correct
amount of spam emails. So we set out to add other mechanisms to the Bayes model to
improve our initial algorithm so we looked toward using POS tags to be incorporated
into the formula. With the incorporating the POS tags and tinkering with the weights we
were able to reach a happy medium where both recall and precision was fairly high. But
we are a little suspicious about the improvement we were able to achieve with the
properly. In the future, we can look toward adding smoothing techniques to account for
a broader variety of words that can appear in any given email irrelevant of its class it
may be associated with. Another direction we can move toward is prove the
confidence on the system we created using POS tags, which gave us a better recall.
We can they focus on looking into adding more metadata probabilities to add onto the
current formula to improve the classifier's ability to detect spam even more easily.
Bibliography
http://www.cs.ubbcluj.ro/~gabis/DocDiplome/Bayesian/000539771r.pdf