Spam Classifier Report

SPAM CLASSIFIER
Using neural networks
MARCH 22, 2018

GURPREM SINGH
ME CSE ,UIET
TABLE OF CONTENTS
Content Page
Introduction 1
Project 1
Methodology 2
Results 2
Conclusion 3
References 3
SPAM CLASSIFIER
Introduction
Spam refers to the unwanted emails that clutter our inbox. These messages need to be
segregated for a better customer experience. The point to be kept in mind is that the
classification should not tag the relevant email as spam. In this project, we have used neural
network for doing the task.
Project:
In order to build the classifier, the classifier using neural networks implemented in scikit library
that is MLPClassifier was used. The data “spambase.data” was taken from UCI repository.
The data has 4601 instances with 58 attributes (57 continuous,1 label). The attribute
information is as follows:
The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not
(0), i.e. unsolicited commercial e-mail.
Most of the attributes indicate whether a particular word or character was frequently occuring
in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive
capital letters. For the statistical measures of each attribute, see the end of this file. Here are
the definitions of the attributes:
48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the

e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) /
total number of words in e-mail. A "word" in this case is any string of alphanumeric characters
bounded by non-alphanumeric characters or end-of-string.
6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the

e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail
1 continuous real [1,...] attribute of type capital_run_length_average = average length of

uninterrupted sequences of capital letters
1 continuous integer [1,...] attribute of type capital_run_length_longest = length of longest

uninterrupted sequence of capital letters
1 continuous integer [1,...] attribute of type capital_run_length_total = sum of length of

uninterrupted sequences of capital letters = total number of capital letters in the e-mail
1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam
(1) or not (0), i.e. unsolicited commercial e-mail.
8. Missing Attribute Values: None
9. Class Distribution:
Spam 1813 (39.4%)
Non-Spam 2788 (60.6%)
Methodology:
1. Load data into memory.
2. Split into training and test samples.
3. Prepare data for training and testing, that is standardizing the data.
4. Defining function space- setting parameters for the classifier.
5. Training the classifier.
6. Plotting the loss curve and end result of classification.
Results:
1. The graph below shows the plot of loss with respect to epochs undertaken. It can be
seen that the loss function converges after 40 iterations and the classifier fits.
2. The figure below is the confusion matrix for the results. The scale on right shows the
number of spams/non_spams predicted as such and are more than 700 in number
whereas the dark purple color points towards the wrong predictions.
Conclusions:
Spam Classifier is really essential to manage spam due to increasing number of spammers these days.
It provides an efficient and effective result with MLPClassifier. The loss is quite less which clearly
depicts that 95% of the e-mails are correctly filtered as valid or spam during training while the testing
error is around 7%. The other methods like SVMs, MLPRegressor, Adaboost decision tree to compare
the results and further improve the working of the spam classifier.
References:
1. http://scikit
learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html
2. https://archive.ics.uci.edu/ml/datasets/spambase
3. https://stackoverflow.com/questions/19233771/sklearn-plot-confusion-matrix-with-
labels?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_
rich_qa

Spam Classifier Report

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Spam Classifier Report

Hochgeladen von

Copyright:

Verfügbare Formate

SPAM CLASSIFIER

Using neural networks

MARCH 22, 2018

48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the

6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the

1 continuous real [1,...] attribute of type capital_run_length_average = average length of

1 continuous integer [1,...] attribute of type capital_run_length_longest = length of longest

1 continuous integer [1,...] attribute of type capital_run_length_total = sum of length of

8. Missing Attribute Values: None

Das könnte Ihnen auch gefallen