Sie sind auf Seite 1von 4

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 6June 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 1733



New Filtering Approaches for Phishing Email
Mrs.P.Lalitha*1, Sumalatha.Udutha*2
Assistant Professor, Dept of Computer Applications, SNIST, Ghatkesar, Hyderabad, AP, India
M.C.A Student, Dept of Computer Applications, SNIST, Ghatkesar, Hyderabad, AP, India
ABSTRACT:
The aim of making web users believe that
they are communicating with a trusted entity for the
purpose of stealing account information, logon
credentials, and identity information in general. This
attack method, commonly known as "phishing," is
most commonly initiated by sending out emails with
links to spoofed websites that harvest information.
Phishing is a significant problem involving
fraudulent email and web sites that trick unsuspecting
users into revealing private information. Phishing has
become more and more complicated and
sophisticated attack can bypass the filter set by anti-
phishing techniques. Most phishing emails aim at
withdrawing money from financial institutions or
getting access to private information and is a serious
threat to global security and economy. Phishing
filters are necessary and widely used to increase
communication security.
In this paper we describe a number of
features that are particularly well-suited to identify
phishing emails. These include statistical models for
the low-dimensional descriptions of email topics,
sequential analysis of email text and external links,
the detection of embedded logos as well as indicators
for hidden salting.
I. INTRODUCTION
Phishing has increased enormously over the
last years and is a serious threat to global security and
economy. Criminals are trying to convince
unsuspecting online users to reveal sensitive
information, e.g., passwords, account numbers, social
security numbers or other personal information.
Unsolicited mail disguised as coming from reputable
online businesses such as financial institutions is
often the vehicle for luring individuals to these fake,
usually temporary sites.
In this paper we extend the work in several
major ways: First, we incorporate a large number of
new features, in particular graphical features, such as
the hidden salting detection, the image distortion, and
the logo detection. Second, we report on many more
experiments, in particular on datasets captured from
real-life. Third, we also consider the question of
spam as in real life spam emails cannot completely be
eliminated from an email stream before a dedicated
phishing filter is applied. Fourth, we outline an active
learning deployment strategy.

II. TYPES OF PHISHING ATTACKS

Two different types of phishing attacks may
be distinguished: Malware-based phishing and
deceptive phishing. For malware-based phishing
malicious software is spread by deceptive emails or
by exploiting security holes of the computer software
and installed on the users machine. The focus of this
paper is deceptive phishing, in which a phisher sends
out deceptive emails pretending to come from a
reputable institution, e.g., a bank. This information is
exploited by the phisher, e.g., by withdrawing money
from the users bank account. A number of tricks are
common in deceptive phishing:
Social engineering: The invention of plausible
stories, scenarios and methodologies to produce a
convincing context and in addition the use of
personalized information.
Mimicry: The email and the linked website closely
resembles official emails and the official websites of
the target. This includes the use of genuine design
elements, trademarks, logos and images.
Email spoofing: Phishers hide the actual senders
identity and show a faked sender address to the user.
URL hiding: Phishers attempt to make the URLs in
the email and the linked website to appear official
and legitimate and hide the actual link addresses.
Invisible content: Phishers insert information into
the phishing mail or website, which is invisible to the
user and aimed at fooling automatic filtering
approaches.
Image content: Phishers create images that contain
the text of the message only in graphical form.
Phishing cause large losses to the economy. As the
targeted organizations want to avoid a bad press they
are reluctant to provide precise information on losses.
The requirements for the Targeted Malicious
Email (TME) Attack tool were as follows:-
User interface should be compact and easy-
to-use.
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 6June 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 1734

Harvesting function should be able to search
web pages for contact details by crawling
the target web site.
The tool should be able to extract useful
information (i.e. email addresses) from the
harvested pages.
Social engineering tricks should be applied
by using flexible email contents (txt/html)
and attachments.
Different replay mail servers should be
supported for mass email delivery.
III. Countermeasures against Phishing

A variety of countermeasures against phishing have
been proposed. In this section we give an overview
over these measures. We distinguish network- and
encryption-based measures, black- and white listing,
and content-based measures.
3.1 Network- and Encryption-Based
Countermeasures
Communication-oriented measures aim at
establishing a secure communication. A first
protection against malware-based phishing attacks is
the installation of virus scanners and regular software
updates. An additional measure is email
authentification. There are several technical
proposals but currently the vast majority of email
messages do not use any. Other approaches are
password hashing (transmitting a hash of the
password together with the domain name), or two
factor authentication, where two independent
credentials are used for authentification, e.g.,
smartcards and passwords
3.2 Blacklisting and White-listing
One approach of phishing prevention
concentrates on checking web addresses when a page
is rendered in a web browser. In the Mozilla Firefox
browser, for instance, each web page requested by a
user is checked against a blacklist of known phishing
sites. This list is automatically downloaded to the
local machine and updated in regular intervals. It is
well-known, however, that new phishing sites appear
frequently, and about 35 new phishing sites were
detected per hour. As phishing attacks pertain to a
finite number of target institutions, white-list
approaches have been proposed. Here a white-list of
good URLs is compared to external links in
incoming emails. This approach seems more
promising, but maintaining a list of trustworthy
sources can be a time-consuming and labor intensive
task. Another drawback of white-lists is that they can
produce false positives, i.e., filtered out ham emails,
whereas blacklists can only produce false negatives,
i.e., missed spam emails, a less severe type of error.
3.3 Content-Based Filtering
In this paper we concentrate on content-
based filtering approaches to detect phishing attacks.
These evaluate the content of an email or the
associated website and try to identify different tricks
for producing a plausible phishing attack, e.g.,
The detection of typical formulations urging the
user to enter confidential information.
The detection of design elements, trademarks, and
logos for known brands. Note that only relatively few
brands are attacked, e.g., 144 different brands in
December 2007.
The identification of spoofed sender addresses and
URLs.
The detection of invisible content inserted to fool
automatic filtering approaches.
The analysis of images which may contain the
message text.
The filters statistically combine the evidence from
many features and classify a communication as
phishing or non-phishing. Common tricks of
spammers known as message salting are the inclusion
of random strings and diverse combinatorial
variations of spacing, word spelling, word order, etc.
Some salting techniques called hidden salting cause
messages to visually appear the same to the human
eye, even if the machine-readable forms are very
different.. We use a computer vision technique to
segment text into partitions that have the same
reading order, and we use a document image
understanding technique to find the reading order of
these partitions. Specifically, given an email as input,
a text production process, e.g. a Web browser, creates
a parsed, internal representation of the email text and
drives the rendering of that representation onto some
output medium, e.g. a browser window. Our
proposed method of detecting text salting takes place
during this rendering process, and is composed of
two steps:
these glyph visibility conditions:
1. Clipping: glyph drawn within the physical bounds
of the drawing clip, which is a type of spatial mask;
2. Concealment: glyph not concealed by other glyphs
or shapes;
3. Font color: glyphs fill color contrasts well with the
background color;
4. Glyph size: glyph size and shape is sufficiently
large. Failure to comply to any of these conditions
results in an invisible glyph, which we use to identify
hidden salting. E.g. zero-sized fonts violate the
glyph size condition. To resolve salting detected
through these conditions, we eliminate all invisible
glyphs (i.e. retain only what is perceived).

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 6June 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 1735

IV. Classification Experiments

4.1 Phishing Classifiers
As discussed in section 3.3.1 content-based
phishing classifiers have to integrate a large number
of features. It turns out that methods used for text
classification also yield best results for content-based
phishing classification. Prominent methods are
variants of tree classifiers like Random Forests or
Support Vector Machines. A key requirement is that
they are able to limit the influence of unimportant
features and only capture systematic dependencies. In
the context of phishing classification different setups
may be used.. Statistical phishing classifiers only
work if the features of the new emails correspond to
the composition of emails in the training set. If
emails with a completely new combination of feature
values occur the filter has to be adapted and
retrained. It turns out that the filters are quite robust
as the classification usually depends not on a single
feature value but on the cumulative effect of several
features.
4.2 Active Learning
Advanced classifiers do not only predict a
class for a new email but also are able to estimate the
probability that this class is the true class. This effect
can be used to provide a self-diagnosis for the
reliability of classifications. If unusual combinations
of feature values occur which are dissimilar from all
examples seen in the training set then the classifier
usually will be undecided and predict low probability
values for all classes. This effect can be used to adapt
classifiers to new types of phishing emails. If an
email with high classification uncertainty is detected
it may contain a novel combination of features which
never occurred in the training set. Active learning
strategies select such training examples for
annotation by experts Active learning is well-suited
for the selection of examples in the experts lab of
spam and phishing filter providers.
4.3 Description of Data
We have compiled a real-life dataset of
20000 ham and phishing emails over a period of
almost seven months, from April 2007 to mid-
November 2007. These emails were provided by our
project partners. We assume a ratio of four to five
ham emails for one phishing email. Specifically, our
corpus consists of 16364 ham emails and 3636
phishing emails. For an additional experiment we
added 20000 spam emails from the same time period
to create an overall corpus of 40000 emails. We
chose such a large number of spam emails to
approximately reflect the fact that in real-life the
majority of emails are unsolicited.
4.4 Feature Processing and Feature Selection
The features used in our system come from a
variety of different sources. It is thus necessary to
post process them to supply the classifiers with
unified inputs.
4.4.1 Feature Processing
It is not advisable to directly use the
unmodified feature values as input for a classifier.
We perform scaling and normalization. Scaling
guarantees that all features have values within the
same range. Many classifiers are based on distance
measures, such as the Euclidean distance, which
overemphasizes features with large values. We
perform a Z-transformation to ensure that all features
have an empirical mean of 0 and an empirical
standard deviation of 1. Additionally, we normalize
the length of the feature vectors to one, which is
adequate for inner-product based classifiers.
4.4.2 Feature Selection
In practice, machine learning algorithms
tend to degrade in performance when faced with
many features that are not necessary for predicting
the correct label . The problem of selecting a subset
of relevant features, while ignoring the rest, is a
challenge that all learning schemes are faced with.. In
this space every state represents a feature subset.
Operators that add and eliminate single features
determine the connection between the states. Because
there are 2n 1 different non-empty subsets of n
features, a complete search through the state space is
impractical and heuristics need to be employed\. The
classifier operates on an independent validation set;
the search algorithm systematically adds and
subtracts features to a current subset. In our
experiments, we apply the so-called best-first search
strategy, which expands the current node (i.e., the
current subset), evaluates its children and moves to
the child node with the highest estimated
performance. The underlying idea is to shorten the
search for each iteration by considering not only the
information of the best child node (as described in
the greedy approach above) but also the other
evaluated children. More formally, by ranking the
operators with respect to the estimated performance
of the children a compound operator ci can be
defined to be the combination of the best i + 1
operators.
4.5 Evaluation Criteria
We use 10-fold cross-validation as our
evaluation method and report a variety of evaluation
measures. For each email four different scenarios are
possible: true positive (TP, correctly classified
phishing email), true negative (TN, correctly
classified ham email), false positive (FP, ham email
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 6June 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 1736

wrongly classified as phishing), and false negative
(FN, phishing email wrongly classified as ham). We
do not report accuracy, i.e., the fraction of correctly
classified emails, as this measure is only of limited
interest in a scenario where the different classes are
very unevenly distributed. More importantly, we
report standard measures, such as precision, recall,
and F-measure as well as the false positive and the
false negative rate. These measures are defined as:

precision = |TP|
|TP| +|FP|
Recall = |TP|
|TP| +|FN|

f =2 precision recall
precision +recall
fpr = |FP|
|FP| +|TN|

fnr = |FN|
|TP| + |FN|
Note that in email classification, errors are not of
equal importance. A false positive is much more
costly than a false negative. It is thus desirable to
have a classifier with a low false positive rate. For
most classifiers the false positive rate may be reduced
at the cost of a higher false negative rate by changing
a decision threshold. Therefore content based
phishing filters are needed to fill the remaining
security gap. The classifiers are robust and can
identify most future phishing emails. Of course they
can further be improved by including black- and
white lists. As new phishing emails appear frequently
it is nevertheless necessary to update the filters in
short time intervals. Select emails for annotation
using the active learning approach. This reduces
the human effort for annotation while selecting the
most informative emails for training emails for
annotation by a small task forces of human experts
the filters will be continually re-trained and updated.
By using this scheme the quality of filters can be kept
at a high level using the infrastructure which is
already available at companies constantly updating
spam filters and virus scanners.

V. CONCLUSION

Phishing has become a serious threat to
global security and economy. Because of the fast rate
of emergence of new phishing websites and because
of distributed phishing attacks it is difficult to keep
blacklists up to date.
In this paper we have described a number of
features of phishing mails which are able to reduce
the rate of successful phishing attempts far below
1%. In contrast to many other approaches most
features are not handcrafted but are themselves
statistical models, which capture different email
aspects and are trained using annotated training
emails. . In the next phase of the Anti Phish project
we will implement this approach for an email stream
in a realistic environment.

VI.References

[1] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair.
A comparison of machine learning techniques for
phishing detection. In Proceedings of the eCrime
Researchers Summit, 2007.
[2] Anti-Phishing Working Group. Phishing activity
trends - report for the month of December 2007,
2008. http://www.antiphishing.org/reports/apwg
report oct 2007.pdf, accessed on 28.04.08.
[3] Bank Austria. Faq mobile TAN, 2008.
http://www.bankaustria.at/de/19825.html, accessed
on 25.01.08.
[4] A. Bergholz, J .-H. Chang, G. Paa, F. Reichartz,
and S. Strobel. Improved phishing detection using
model-based features. In Proceedings of the
Conference on Email and Anti-Spam (CEAS), 2008.
[5] B. Biggio, G. Fumera, I. Pillai, and F. Roli. Image
spam filtering using visual information. In ICIAP
07: Proceedings of the 14th International Conference
on Image Analysis and Processing, pages 105110,
Washington, DC, USA, 2007. IEEE Computer
Society.

Das könnte Ihnen auch gefallen