0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)
13 Ansichten4 Seiten
The aim of making web users believe that
they are communicating with a trusted entity for the
purpose of stealing account information, logon
credentials, and identity information in general. This
attack method, commonly known as "phishing," is
most commonly initiated by sending out emails with
links to spoofed websites that harvest information.
Phishing is a significant problem involving
fraudulent email and web sites that trick unsuspecting
users into revealing private information. Phishing has
become more and more complicated and
sophisticated attack can bypass the filter set by antiphishing
techniques. Most phishing emails aim at
withdrawing money from financial institutions or
getting access to private information and is a serious
threat to global security and economy. Phishing
filters are necessary and widely used to increase
communication security.
The aim of making web users believe that
they are communicating with a trusted entity for the
purpose of stealing account information, logon
credentials, and identity information in general. This
attack method, commonly known as "phishing," is
most commonly initiated by sending out emails with
links to spoofed websites that harvest information.
Phishing is a significant problem involving
fraudulent email and web sites that trick unsuspecting
users into revealing private information. Phishing has
become more and more complicated and
sophisticated attack can bypass the filter set by antiphishing
techniques. Most phishing emails aim at
withdrawing money from financial institutions or
getting access to private information and is a serious
threat to global security and economy. Phishing
filters are necessary and widely used to increase
communication security.
The aim of making web users believe that
they are communicating with a trusted entity for the
purpose of stealing account information, logon
credentials, and identity information in general. This
attack method, commonly known as "phishing," is
most commonly initiated by sending out emails with
links to spoofed websites that harvest information.
Phishing is a significant problem involving
fraudulent email and web sites that trick unsuspecting
users into revealing private information. Phishing has
become more and more complicated and
sophisticated attack can bypass the filter set by antiphishing
techniques. Most phishing emails aim at
withdrawing money from financial institutions or
getting access to private information and is a serious
threat to global security and economy. Phishing
filters are necessary and widely used to increase
communication security.
New Filtering Approaches for Phishing Email Mrs.P.Lalitha*1, Sumalatha.Udutha*2 Assistant Professor, Dept of Computer Applications, SNIST, Ghatkesar, Hyderabad, AP, India M.C.A Student, Dept of Computer Applications, SNIST, Ghatkesar, Hyderabad, AP, India ABSTRACT: The aim of making web users believe that they are communicating with a trusted entity for the purpose of stealing account information, logon credentials, and identity information in general. This attack method, commonly known as "phishing," is most commonly initiated by sending out emails with links to spoofed websites that harvest information. Phishing is a significant problem involving fraudulent email and web sites that trick unsuspecting users into revealing private information. Phishing has become more and more complicated and sophisticated attack can bypass the filter set by anti- phishing techniques. Most phishing emails aim at withdrawing money from financial institutions or getting access to private information and is a serious threat to global security and economy. Phishing filters are necessary and widely used to increase communication security. In this paper we describe a number of features that are particularly well-suited to identify phishing emails. These include statistical models for the low-dimensional descriptions of email topics, sequential analysis of email text and external links, the detection of embedded logos as well as indicators for hidden salting. I. INTRODUCTION Phishing has increased enormously over the last years and is a serious threat to global security and economy. Criminals are trying to convince unsuspecting online users to reveal sensitive information, e.g., passwords, account numbers, social security numbers or other personal information. Unsolicited mail disguised as coming from reputable online businesses such as financial institutions is often the vehicle for luring individuals to these fake, usually temporary sites. In this paper we extend the work in several major ways: First, we incorporate a large number of new features, in particular graphical features, such as the hidden salting detection, the image distortion, and the logo detection. Second, we report on many more experiments, in particular on datasets captured from real-life. Third, we also consider the question of spam as in real life spam emails cannot completely be eliminated from an email stream before a dedicated phishing filter is applied. Fourth, we outline an active learning deployment strategy.
II. TYPES OF PHISHING ATTACKS
Two different types of phishing attacks may be distinguished: Malware-based phishing and deceptive phishing. For malware-based phishing malicious software is spread by deceptive emails or by exploiting security holes of the computer software and installed on the users machine. The focus of this paper is deceptive phishing, in which a phisher sends out deceptive emails pretending to come from a reputable institution, e.g., a bank. This information is exploited by the phisher, e.g., by withdrawing money from the users bank account. A number of tricks are common in deceptive phishing: Social engineering: The invention of plausible stories, scenarios and methodologies to produce a convincing context and in addition the use of personalized information. Mimicry: The email and the linked website closely resembles official emails and the official websites of the target. This includes the use of genuine design elements, trademarks, logos and images. Email spoofing: Phishers hide the actual senders identity and show a faked sender address to the user. URL hiding: Phishers attempt to make the URLs in the email and the linked website to appear official and legitimate and hide the actual link addresses. Invisible content: Phishers insert information into the phishing mail or website, which is invisible to the user and aimed at fooling automatic filtering approaches. Image content: Phishers create images that contain the text of the message only in graphical form. Phishing cause large losses to the economy. As the targeted organizations want to avoid a bad press they are reluctant to provide precise information on losses. The requirements for the Targeted Malicious Email (TME) Attack tool were as follows:- User interface should be compact and easy- to-use. International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 6June 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page 1734
Harvesting function should be able to search web pages for contact details by crawling the target web site. The tool should be able to extract useful information (i.e. email addresses) from the harvested pages. Social engineering tricks should be applied by using flexible email contents (txt/html) and attachments. Different replay mail servers should be supported for mass email delivery. III. Countermeasures against Phishing
A variety of countermeasures against phishing have been proposed. In this section we give an overview over these measures. We distinguish network- and encryption-based measures, black- and white listing, and content-based measures. 3.1 Network- and Encryption-Based Countermeasures Communication-oriented measures aim at establishing a secure communication. A first protection against malware-based phishing attacks is the installation of virus scanners and regular software updates. An additional measure is email authentification. There are several technical proposals but currently the vast majority of email messages do not use any. Other approaches are password hashing (transmitting a hash of the password together with the domain name), or two factor authentication, where two independent credentials are used for authentification, e.g., smartcards and passwords 3.2 Blacklisting and White-listing One approach of phishing prevention concentrates on checking web addresses when a page is rendered in a web browser. In the Mozilla Firefox browser, for instance, each web page requested by a user is checked against a blacklist of known phishing sites. This list is automatically downloaded to the local machine and updated in regular intervals. It is well-known, however, that new phishing sites appear frequently, and about 35 new phishing sites were detected per hour. As phishing attacks pertain to a finite number of target institutions, white-list approaches have been proposed. Here a white-list of good URLs is compared to external links in incoming emails. This approach seems more promising, but maintaining a list of trustworthy sources can be a time-consuming and labor intensive task. Another drawback of white-lists is that they can produce false positives, i.e., filtered out ham emails, whereas blacklists can only produce false negatives, i.e., missed spam emails, a less severe type of error. 3.3 Content-Based Filtering In this paper we concentrate on content- based filtering approaches to detect phishing attacks. These evaluate the content of an email or the associated website and try to identify different tricks for producing a plausible phishing attack, e.g., The detection of typical formulations urging the user to enter confidential information. The detection of design elements, trademarks, and logos for known brands. Note that only relatively few brands are attacked, e.g., 144 different brands in December 2007. The identification of spoofed sender addresses and URLs. The detection of invisible content inserted to fool automatic filtering approaches. The analysis of images which may contain the message text. The filters statistically combine the evidence from many features and classify a communication as phishing or non-phishing. Common tricks of spammers known as message salting are the inclusion of random strings and diverse combinatorial variations of spacing, word spelling, word order, etc. Some salting techniques called hidden salting cause messages to visually appear the same to the human eye, even if the machine-readable forms are very different.. We use a computer vision technique to segment text into partitions that have the same reading order, and we use a document image understanding technique to find the reading order of these partitions. Specifically, given an email as input, a text production process, e.g. a Web browser, creates a parsed, internal representation of the email text and drives the rendering of that representation onto some output medium, e.g. a browser window. Our proposed method of detecting text salting takes place during this rendering process, and is composed of two steps: these glyph visibility conditions: 1. Clipping: glyph drawn within the physical bounds of the drawing clip, which is a type of spatial mask; 2. Concealment: glyph not concealed by other glyphs or shapes; 3. Font color: glyphs fill color contrasts well with the background color; 4. Glyph size: glyph size and shape is sufficiently large. Failure to comply to any of these conditions results in an invisible glyph, which we use to identify hidden salting. E.g. zero-sized fonts violate the glyph size condition. To resolve salting detected through these conditions, we eliminate all invisible glyphs (i.e. retain only what is perceived).
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 6June 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page 1735
IV. Classification Experiments
4.1 Phishing Classifiers As discussed in section 3.3.1 content-based phishing classifiers have to integrate a large number of features. It turns out that methods used for text classification also yield best results for content-based phishing classification. Prominent methods are variants of tree classifiers like Random Forests or Support Vector Machines. A key requirement is that they are able to limit the influence of unimportant features and only capture systematic dependencies. In the context of phishing classification different setups may be used.. Statistical phishing classifiers only work if the features of the new emails correspond to the composition of emails in the training set. If emails with a completely new combination of feature values occur the filter has to be adapted and retrained. It turns out that the filters are quite robust as the classification usually depends not on a single feature value but on the cumulative effect of several features. 4.2 Active Learning Advanced classifiers do not only predict a class for a new email but also are able to estimate the probability that this class is the true class. This effect can be used to provide a self-diagnosis for the reliability of classifications. If unusual combinations of feature values occur which are dissimilar from all examples seen in the training set then the classifier usually will be undecided and predict low probability values for all classes. This effect can be used to adapt classifiers to new types of phishing emails. If an email with high classification uncertainty is detected it may contain a novel combination of features which never occurred in the training set. Active learning strategies select such training examples for annotation by experts Active learning is well-suited for the selection of examples in the experts lab of spam and phishing filter providers. 4.3 Description of Data We have compiled a real-life dataset of 20000 ham and phishing emails over a period of almost seven months, from April 2007 to mid- November 2007. These emails were provided by our project partners. We assume a ratio of four to five ham emails for one phishing email. Specifically, our corpus consists of 16364 ham emails and 3636 phishing emails. For an additional experiment we added 20000 spam emails from the same time period to create an overall corpus of 40000 emails. We chose such a large number of spam emails to approximately reflect the fact that in real-life the majority of emails are unsolicited. 4.4 Feature Processing and Feature Selection The features used in our system come from a variety of different sources. It is thus necessary to post process them to supply the classifiers with unified inputs. 4.4.1 Feature Processing It is not advisable to directly use the unmodified feature values as input for a classifier. We perform scaling and normalization. Scaling guarantees that all features have values within the same range. Many classifiers are based on distance measures, such as the Euclidean distance, which overemphasizes features with large values. We perform a Z-transformation to ensure that all features have an empirical mean of 0 and an empirical standard deviation of 1. Additionally, we normalize the length of the feature vectors to one, which is adequate for inner-product based classifiers. 4.4.2 Feature Selection In practice, machine learning algorithms tend to degrade in performance when faced with many features that are not necessary for predicting the correct label . The problem of selecting a subset of relevant features, while ignoring the rest, is a challenge that all learning schemes are faced with.. In this space every state represents a feature subset. Operators that add and eliminate single features determine the connection between the states. Because there are 2n 1 different non-empty subsets of n features, a complete search through the state space is impractical and heuristics need to be employed\. The classifier operates on an independent validation set; the search algorithm systematically adds and subtracts features to a current subset. In our experiments, we apply the so-called best-first search strategy, which expands the current node (i.e., the current subset), evaluates its children and moves to the child node with the highest estimated performance. The underlying idea is to shorten the search for each iteration by considering not only the information of the best child node (as described in the greedy approach above) but also the other evaluated children. More formally, by ranking the operators with respect to the estimated performance of the children a compound operator ci can be defined to be the combination of the best i + 1 operators. 4.5 Evaluation Criteria We use 10-fold cross-validation as our evaluation method and report a variety of evaluation measures. For each email four different scenarios are possible: true positive (TP, correctly classified phishing email), true negative (TN, correctly classified ham email), false positive (FP, ham email International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 6June 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page 1736
wrongly classified as phishing), and false negative (FN, phishing email wrongly classified as ham). We do not report accuracy, i.e., the fraction of correctly classified emails, as this measure is only of limited interest in a scenario where the different classes are very unevenly distributed. More importantly, we report standard measures, such as precision, recall, and F-measure as well as the false positive and the false negative rate. These measures are defined as:
fnr = |FN| |TP| + |FN| Note that in email classification, errors are not of equal importance. A false positive is much more costly than a false negative. It is thus desirable to have a classifier with a low false positive rate. For most classifiers the false positive rate may be reduced at the cost of a higher false negative rate by changing a decision threshold. Therefore content based phishing filters are needed to fill the remaining security gap. The classifiers are robust and can identify most future phishing emails. Of course they can further be improved by including black- and white lists. As new phishing emails appear frequently it is nevertheless necessary to update the filters in short time intervals. Select emails for annotation using the active learning approach. This reduces the human effort for annotation while selecting the most informative emails for training emails for annotation by a small task forces of human experts the filters will be continually re-trained and updated. By using this scheme the quality of filters can be kept at a high level using the infrastructure which is already available at companies constantly updating spam filters and virus scanners.
V. CONCLUSION
Phishing has become a serious threat to global security and economy. Because of the fast rate of emergence of new phishing websites and because of distributed phishing attacks it is difficult to keep blacklists up to date. In this paper we have described a number of features of phishing mails which are able to reduce the rate of successful phishing attempts far below 1%. In contrast to many other approaches most features are not handcrafted but are themselves statistical models, which capture different email aspects and are trained using annotated training emails. . In the next phase of the Anti Phish project we will implement this approach for an email stream in a realistic environment.
VI.References
[1] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A comparison of machine learning techniques for phishing detection. In Proceedings of the eCrime Researchers Summit, 2007. [2] Anti-Phishing Working Group. Phishing activity trends - report for the month of December 2007, 2008. http://www.antiphishing.org/reports/apwg report oct 2007.pdf, accessed on 28.04.08. [3] Bank Austria. Faq mobile TAN, 2008. http://www.bankaustria.at/de/19825.html, accessed on 25.01.08. [4] A. Bergholz, J .-H. Chang, G. Paa, F. Reichartz, and S. Strobel. Improved phishing detection using model-based features. In Proceedings of the Conference on Email and Anti-Spam (CEAS), 2008. [5] B. Biggio, G. Fumera, I. Pillai, and F. Roli. Image spam filtering using visual information. In ICIAP 07: Proceedings of the 14th International Conference on Image Analysis and Processing, pages 105110, Washington, DC, USA, 2007. IEEE Computer Society.