POS Classifier

Spam Email Classifier
With the addition of POS tags
CSE 390
By: Erik Emmanuel
Brian Lin
Jianneng (Jack) Wu
05/04/2017
Professor Niranjan Balasubramanian
TA: Noah Weber

Spam Email Classifier
Natural Language Processing has allowed us to be able to allow
computers to extract the meaning behind our language. This has given us
the proper tools to be able to build the spam email classifier. The classifier
is designed to filter emails by removing the emails that were reported to be
sent by another computer. We choose this topic for our project because we
were building something that in turn could help save people a lot of time.
Although the limits of our project are extended to a limit dataset, we can
understand the workings of how filtering can be done. The project will
consist of 3 parts. In part 1 we build the training data needed to make the
classifier. Part 2 will then be to build the Naive Bayes algorithm and test it
on a new set of data. Afterwards we will also measure the performance of
the classifier measuring F-1 score, recall and precision. The last part will
then to be implement part of speech tagging to further improve the
performance of the classifier.

Probability generation for training dataset (Part 1: Erik)
Vocab set generation:
For the data extraction, We used the data we acquired from the Enron dataset.
From the dataset we used the preprocessed data which was already categorized as
either spam or not spam (ham). In the training set there was 5975 unique emails, and of
those there was 1:3 ratio of spam to ham emails. For the probability generation aspect
of the project we made a python script that goes through a folder, opens each folder
and and for each line in the email tokenize the word. Make counts of each word and
how many times they had occurred in each of the classes (spam or ham). For every
new word that was observed it was added to the vocab array. Since the dataset was
large, I omitted tokens from emails that were not words; ie. 12 was not considered as it
is a number while NBA16 would be an example of an alphanumeric token that was
considered part of the vocab. I also omitted words from the vocab that occurred less
than 2 times.Once the dataset was finalized, I had to attach probabilities that the word
could lead to the document being classified as a spam or ham email.
Probability Extraction:
Once the vocab array was finalized, now we had to go through each object in the
list and calculate the probability that word either lead to a spam or ham. To determine
the P(spam|word) and P(ham|word).

#(wi occurred in spam)
P (wi |spam) = #(wi occurred in total)
#(wi occurred in ham)

P (wi |ham) = #(wi occurred in total)
After finding these probabilities for each word in the vocabulary, we output these data in
a readable format onto the an out.txt file which is then passed onto the algorithm to
predict the test set. The following diagram can be used as a visual tool to trace the flow
of the program.
Implementation of Naive Bayes Algorithm (Part 2: Brian)
At this point we had our training data from the training emails in the form of
out.txt. Now our task was to create a process in which we can process new unknown
emails and determine from the training data whether or not the email is valid or spam.
This is where the Naive Bayes Algorithm comes into play.
The Naive Bayes Algorithm is an algorithm that takes a list of words as its input
and outputs a value that corresponds to whether or not that list of words is more likely to
be in a spam or valid email.
The left side of the equation is
the value we are computing. The first term is a constant that changes based on the size
of the training data we use. P(S) is determined by taking the number of spam and valid
emails and determining the probability of the email picked is spam. In our case we had
1500 spam emails and 3672 valid emails. Taking this P(S) would be 1500/(3672 +
1500). The same applies to P(H) where to denotes the probability that the email picked
is valid email. The next part of the equation with the summation of log probabilities takes
a word and its probability that the word is in spam over the probability that the word is in
a valid email and takes the log of the quotient. The log incorporation is important
because it determines if the value computing is being added or subtracted. This means
that any log probability between 0 and 1 is going to be negative whereas any log
probability above 1 is going to be positive. This is important because whenever the

probability of the word being in spam is greater than it being in a valid email then the
value of log((w|S)/(w|H)) is going to be positive. The only way that a email is classified
as spam is when the final value of the right side is above 0.
We tested this algorithm on the test set called enron2 which contained 4361 valid
emails and 1496 spam emails. Running through the emails with the implemented
algorithm gave us the results:
Spam: Recall = 0.4525 Precision = 0.9713 F-1 = 0.6174
Ham: Recall = 0.8167 Precision = 0.7737 F-1 = 0.7947
The precision in the classifier for spam emails was amazing, almost every email that the
classifier predicted as spam came out to actually be spam. However the recall of spam
emails was very poor. Less than half of the emails that were spam were detected by the
classifier. I believe the reason behind this is the training data. Because of the fact that
the number of spam emails was much less than the number of the valid emails in the
training set, this causes the words in the training set to have more frequent words where
the probability of the word being in a valid email is greater than the probability of the
word in a spam email. This affects the equation, more precisely . The
fraction will then be a value between 0 and 1 and any log of that value will produce a
negative value. Adding a negative to the equation will skew the equation to return that
the email is valid and for that the recall is imperfect. This could be fixed if
we were to have more data for training where the ratio of spam and valid
emails were closer to 1.

Figure 2.1 Flowchart of Algorithm
Implementation
POS Tagging and the Naive Bayes Algorithm (Part 3: Jack)
Training:
Using the Enron1 data from the Enron dataset1, I tagged the data using a POS tagger2
from Stanford. I chose to use this over using the one that we made for two reasons; Its
much more accurate (97% Precision), and its much for efficient. With so much training
data, efficiently tagging the data is quite important. Passing the paths of the data to
pos_tagger.java , outputs the results a directory named enron_tagged. Then using
train_tagger.py, and the path to the tagged ham/spam folders, we can calculate their
emissions probabilities: P (T |W ). This is outputted as emissions_ham.txt and
emissions_spam.txt.
Modifying the algorithm:
Given the existing algorithm from a research paper3 by Tianhao Sun, we were able to
implement the existing algorithm as follows:
n
P (S|E) P (S) P (w |S)
log P (H|E) = log P (H) + log P (w i|H)
i
i=1
Where if the result is greater than zero, then we classify if as spam. For the POS
adjustment I will be integrating interpolation into the function for one main reason: I
generally understand the concept and feel this is a good and simple probability
distribution for combining two features. I will be modifying the equation as follows:
1 P (wi S)+2 P (wi |S,ti ) 1 P (wi H)+2 P (wi |H,ti )

P (wi |S, ti ) = P (S) and P (wi |H, ti ) = P (H)
Where 1 and 2 are weights assigned to each feature. For our specific case we used
1=0.9 and 2=0.1 because these weights provided generally the best results. The final
equation for determining if the email is classified as a spam is then as follows:
n
P (S|E) P (S) P (w |S,t )
log P (H|E) = log P (H) + log P (w i|H,ti )
i i
i=1
*** At this point is where Im not sure if I normalized correctly as there are little to no
research papers detailing the addition of POS tags with respect to the Naive Bayes
Algorithm.
Testing:
To test the algorithm, I used the stanford tagger to tag the next set of the enron dataset
(enron2). I then modified the original code from Brian Lin (WordProb.py) to adjust for my
changed input data along with additional emissions data, and created the code
WordProbPOS.py . After optimizing the code, I ran the algorithm a few times to test sets
of weights that would be appropriate and settled on 0.9 and 0.1 respectively.
Results and Concerns With the POS Method:
From the results (specified below), we can see that with the addition of the POS tags,
we have decreased the precision of the algorithm by around 5% with respect to the
original algorithm, but have increased the recall by an astounding 45%. This is where
my concerns lie: Im not 100% certain that I normalized the probabilities correctly, and
also not certain if the original algorithm was accurate as well. Both these could have
skewed the results, leading to an over appreciation for the POS addition.
Results (With and without POS):

Precision (Spam w/out POS):
Number of times right/Number of times predicted
677/697 = 0.971305595
Recall (Spam):
Number of times right/Number of spam in the set
677/1496 = 0.452540107
F-1 measurement (Spam):
2((p*r)/(p+r) = 0.61741906061
2
Precision (Ham w/out POS):
2865/4361 = 0.773796791
Recall (Ham):
Number of times right/Number of spam in the set
3562/4361 = 0.816785141
F-1 measurement (Ham):
2((p*r)/(p+r) = 0.794710047093
Precision (Spam w/ POS):

1246/1345 = 0.926394052
Recall (Spam w/ POS):
Number of times right/ Number of spam in the set
1246/1496 = 0.832887701
F-1 measurement (Spam w/ POS):
2((p*r)/(p+r)) = 0.877155931
Precision (Ham w/ POS):

4261/4505 = 0.9458379578
Recall (Ham w/ POS):
Number of times right/ Number of spam in the set
4017/4361 = 0.9211190094
F-1 measurement (Ham w/ POS):
2((p*r)/(p+r))= 0.9333148412
General Flow of Classifier:
Conclusion:
The motivation for this project was to be able to detect spam emails given a set
of unclassified set of emails. To accomplish this goal we used the Naive Bayes model to
write the algorithm. We were able to obtain a fairly large training and test sets from
csmining.org. Using these our Naive Bayes model was able to precisely predict the
class of a given email, but our initial model was not very good at recalling the correct
amount of spam emails. So we set out to add other mechanisms to the Bayes model to
improve our initial algorithm so we looked toward using POS tags to be incorporated
into the formula. With the incorporating the POS tags and tinkering with the weights we
were able to reach a happy medium where both recall and precision was fairly high. But
we are a little suspicious about the improvement we were able to achieve with the
incorporation of POS tags because of our uncertainty on if we normalized the tags
properly. In the future, we can look toward adding smoothing techniques to account for
a broader variety of words that can appear in any given email irrelevant of its class it
may be associated with. Another direction we can move toward is prove the
normalizations we made on the POS classifier were correct, giving us a boost in
confidence on the system we created using POS tags, which gave us a better recall.
We can they focus on looking into adding more metadata probabilities to add onto the
current formula to improve the classifier's ability to detect spam even more easily.
Bibliography
[1] Enron Dataset: http://csmining.org/index.php/enron-spam-datasets.html
[2] Stanfords POS Tagger: https://nlp.stanford.edu/software/tagger.shtml
[3] Tianhao Suns paper on Naive Bayes Algorithm:
http://www.cs.ubbcluj.ro/~gabis/DocDiplome/Bayesian/000539771r.pdf

POS Classifier

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

POS Classifier

Hochgeladen von

Copyright:

Verfügbare Formate

Spam Email Classifier

With the addition of POS tags

Professor Niranjan Balasubramanian

TA: Noah Weber

Natural Language Processing has allowed us to be able to allow

is designed to filter emails by removing the emails that were reported to be

on a new set of data. Afterwards we will also measure the performance of

then to be implement part of speech tagging to further improve the

performance of the classifier.

Vocab set generation:

is a number while NBA16 would be an example of an alphanumeric token that was

could lead to the document being classified as a spam or ham email.

the P(spam|word) and P(ham|word).

#(wi occurred in ham)

This is where the Naive Bayes Algorithm comes into play.

be in a spam or valid email.

The left side of the equation is

probability above 1 is going to be positive. This is important because whenever the

as spam is when the final value of the right side is above 0.

algorithm gave us the results:

Spam: Recall = 0.4525 Precision = 0.9713 F-1 = 0.6174

Ham: Recall = 0.8167 Precision = 0.7737 F-1 = 0.7947

emails were closer to 1.

pos_tagger.java , outputs the results a directory named enron_tagged. Then using

emissions probabilities: P (T |W ). This is outputted as emissions_ham.txt and

Modifying the algorithm:

implement the existing algorithm as follows:

1 P (wi S)+2 P (wi |S,ti ) 1 P (wi H)+2 P (wi |H,ti )

equation for determining if the email is classified as a spam is then as follows:

Results and Concerns With the POS Method:

Results (With and without POS):

Precision (Spam w/ POS):

Precision (Ham w/ POS):

incorporation of POS tags because of our uncertainty on if we normalized the tags

normalizations we made on the POS classifier were correct, giving us a boost in

[1] Enron Dataset: http://csmining.org/index.php/enron-spam-datasets.html

[2] Stanfords POS Tagger: https://nlp.stanford.edu/software/tagger.shtml

[3] Tianhao Suns paper on Naive Bayes Algorithm:

Das könnte Ihnen auch gefallen