Beruflich Dokumente
Kultur Dokumente
Content Page
Introduction 1
Project 1
Methodology 2
Results 2
Conclusion 3
References 3
SPAM CLASSIFIER
Introduction
Spam refers to the unwanted emails that clutter our inbox. These messages need to be
segregated for a better customer experience. The point to be kept in mind is that the
classification should not tag the relevant email as spam. In this project, we have used neural
network for doing the task.
Project:
In order to build the classifier, the classifier using neural networks implemented in scikit library
that is MLPClassifier was used. The data “spambase.data” was taken from UCI repository.
The data has 4601 instances with 58 attributes (57 continuous,1 label). The attribute
information is as follows:
The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not
(0), i.e. unsolicited commercial e-mail.
Most of the attributes indicate whether a particular word or character was frequently occuring
in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive
capital letters. For the statistical measures of each attribute, see the end of this file. Here are
the definitions of the attributes:
1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam
(1) or not (0), i.e. unsolicited commercial e-mail.
9. Class Distribution:
Spam 1813 (39.4%)
Non-Spam 2788 (60.6%)
Methodology:
1. Load data into memory.
2. Split into training and test samples.
3. Prepare data for training and testing, that is standardizing the data.
4. Defining function space- setting parameters for the classifier.
5. Training the classifier.
6. Plotting the loss curve and end result of classification.
Results:
1. The graph below shows the plot of loss with respect to epochs undertaken. It can be
seen that the loss function converges after 40 iterations and the classifier fits.
2. The figure below is the confusion matrix for the results. The scale on right shows the
number of spams/non_spams predicted as such and are more than 700 in number
whereas the dark purple color points towards the wrong predictions.
Conclusions:
Spam Classifier is really essential to manage spam due to increasing number of spammers these days.
It provides an efficient and effective result with MLPClassifier. The loss is quite less which clearly
depicts that 95% of the e-mails are correctly filtered as valid or spam during training while the testing
error is around 7%. The other methods like SVMs, MLPRegressor, Adaboost decision tree to compare
the results and further improve the working of the spam classifier.
References:
1. http://scikit
learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html
2. https://archive.ics.uci.edu/ml/datasets/spambase
3. https://stackoverflow.com/questions/19233771/sklearn-plot-confusion-matrix-with-
labels?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_
rich_qa