Sie sind auf Seite 1von 6

IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2016), December 23-25, 2016, Jaipur, India

Efficient Malicious Domain Detection Using Word


Segmentation and BM Pattern Matching
Sachin Gupta
Department of Computer Science & Engineering
APEX Institute of Engineering & Technology, Jaipur
ersachingupta11@gmail.com

AbstractOn the World Wide Web, the malicious links are the web contents by the web users. Various approaches have
highly problematic in the dissemination channels as a source code been proposed to address the Web-prediction of malicious
to the malware broadcasting. These suspicious malicious links attacks to discover the pure URLs from the complete database
gives full access to the web attackers as an instrument of web of DNS server. A mundane remedy has been utilized to
pages on internet. It is easily affected by the results of attackers
declare the malicious URLs as a blacklist of malevolent
on the system of victim where system is utilized easily for
performing the cyber-attacks such as stealing the financial URLs. It can be constructed correctly as human proposals
credentials, phishing-spamming, hacking and many more such given by the proposed techniques for time saving and positive
web attacks. The developed system must be accurate and fast results. The unknown maleficent URLs cannot be found
enough to detect these types of such cyber-attacks by observing efficiently as suspicious URL matching is highly problematic
the ability to find new developed malicious URLs or malicious in the distribution of the URLs from the database of DNS
source code contents. It is the critical task to detect the malicious server. This research effort of blacklisting has been given a
contents in network of the web pages over the World Wide Web. relegation model to address malignant URL by applying the
The various malicious cyber-attacks like spamming, code critical role of the URL disclosure where it is fixed on culling.
phishing are done by using the malicious URLs to mount these
It is on the basis of prediction of URL features which are built
types of cyber-attacks. Internet unlawful activities are found in
Malicious Web sites as cornerstone of the Malicious URLs. The in the connection of safe URL detections.
main threat is to identify these attacks so that the suspicious
URLs can be easily resolved as malicious URLs along with its A. Background
source code of the web pages.
In this paper, a method has been proposed which is highly Filenames are used to discover the files on a local server
useful in the field of World Wide Web networking of domains for because the Uniform Resource Locators (URLs) is able to
detecting the malicious URLs by using BM (Boyer-Moore) [1] locate Web pages of sites and perform the operation of
string pattern matching algorithm based on word segmentation separation in web resources. Visiting a site is done by the way
approach [2]. The nature of attacks is identified as a malicious of typing a URL by the user into the address bar of the
URL or source code of the web sites questioned on the World browser. A critical step is to find and click the links which are
Wide Web. The proposed approach is based on the real time available in the source code of the web pages so that it can be
system for getting suspicious URL from the DNS server followed
by the detection on the basis of word segmentation of source
redirected to other web pages containing the safeguard in the
code. The discriminative features of this system are verified by source code. Generally URLs are the easiest readable and
using the proposed method which gives a variety of properties understandable text, which gives better link to other sites or an
including text and link in the source code as highly powerful and email messages by email client. URLs are defined as
novel approach in the detection of suspicious URLs. following manner:
<protocol>://<name of the host><complete path>
Keywords BM (Boyer-Moore) Pattern, Malicious Web
Pages, Classification Module, Web-Based Attacks. In the <protocol> portion, network protocol must be used
to declare the required resource locator. There are many
I. INTRODUCTION common obligations used in the connection of URL prototype
standards where we can use Hypertext Transport Protocol or
We have learned from the initial age of the malicious URL HTTP in the middle layer OSI model as Transport Layer
detection to manage personal risk. There are many perilous Security.
situations which can be studied to identify malicious detection.
However, the property for detecting the malicious source code
on the internet gives few heuristics to power the system as safe
Web sites. A wide range of gregariously undesirable
individuals, including spam-advertised, malicious source
codes are supported by unlawful websites. It is the killer
application on the World Wide Web which gives huge risk for
cyber-crimes such as stealing the financial credentials which
increases the risk of malicious source codes while browsing

[978-1-5090-2807-8/16/$31.00 2016 IEEE]


IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2016), December 23-25, 2016, Jaipur, India

it is done by pattern matching algorithm [1]. The data spotter


is used to download the source code followed by the
comparison of the same with the available virus signature in
the database. If comparison and matching are successful than
the data spotter declares the matched URL as a suspicious
URL. URL Malicious detection solution is given by Fuqiang
Yu[1] which is based on the pattern matching procedure of
BM technique.
The specific steps are as follows:
Step1: BM pattern matching algorithm for URL source and
virus signatures.
Step2: The malicious URL and code description are stored
in the database as a BM model matching substring.
Fig 1. The Components and Example of a URL Step3: Detection of malignant Web data structure for
malignant web page source codes followed by preparation
In Figure 1, the entire fragmentation of URLs is classified of database signature matching.
and specified as the HTTP protocol of the OSI layer. Although Step4: Detect malicious Web algorithm descriptions to
we cannot combine the entire protocol model as a search gathered URL of the source code and perform the
classification of URLs because most of the URLs are malicious signature matching.
rendezvous. We have noticed that phishing of the URLs or the The maleficent URL disclosure predicated on BM pattern
contents of web pages are broken through applying the tricks matching algorithm has been implemented and this procedure
by the user of World Wide Web in real. provides the suspicious URLs by suing the pattern matching in
the downloaded URL source codes which is available by the
B. Motivation
virus finder in the virus URL database. The BM string pattern
BM algorithm provides fuzzy matching results by using string matching algorithm provides higher efficiency in the detection
matching technique. The right-to-left conversion method is of malevolent URLs.
applied in BM algorithm and there are two heuristic rules The virus characteristics are matched by using BM matching
available, applied for finding the bad character and the better algorithm to concede the source codes of URL for verifying
suffix pattern. By using above conversion method the right the safety of URL in the database. Fuqiang Yu has detected
jump distance can be finalized easily [3]. The static analysis 203 Web Page search images as Malicious URL detection [1]
techniques provide better source code analysis in the BM predicated using the BM pattern technique. Another method is
pattern matching technique. Source code of each and every kaspersky scanning in which 190 URLs are purely confirmed
URL is analyzed by using the technique of BM Matching by the same author as a detection of malicious URLs with the
Source Code Analyzer. complete estimated error of only 6.9%, and the final felicitous
In this paper, URL forwarding and exploitation [4] are estimation is given by 91.1%. He has proved that protected
explored for the feasibility of detecting malicious domains URLs given by the malicious detection algorithm are highly
visited on a cellular network which is based completely on safe for search engines for finding the safer web images. Wei
lexical characteristics of the domain names. In addition to Wang and Kenneth E. Shirley have given a method for
traditional quantitative features of BM pattern matching detecting malignant domains utilizing word segmentation
technique for domain names, we also use a word segmentation approach.
algorithm to segment the domain names into particular words They find that word segmentation approach if applied to
for extremely expand the size of the feature set. domain names, adds generous predictive power to a logistic
A brief review to understand the background of the regression model that classifies domain names based on their
problem and associated work is presented in section 2 as lexical features. In their experiments on real-word data,
literature survey. The proposed method for Malicious URL models that used word segmentation decreased relative
Detection and Identification is presented in Section 3. The misclassification rates and increased related AUC rates by
algorithmic and observational analyses of proposed model are practically 10% compared to similar models that didnt use
defined in section 4. The section 5 gives the conclusion of the word segmentation approach. Their method also provides
proposed model and future work on the proposed model. interpretable results which showed the words engaged users to
malicious sites. These words would commonly change over
II. LITERATURE SURVEY time as attackers change their methods, but a model equivalent
as the one presented here could be fit to a sliding window of
Passive Detection of Malicious Uniform Resource Locator is new data to continuously monitor and detect new words that
the complete analytical survey of data which are available on are being used in malicious domains. Their results, of course,
the source web page. It is used to check whether the source depend on the data we used i.e. fresh domains visited on a
code contains the malicious objects or not so that it can be cellular network and presented WoT prestige ratings as the
classified as codes under the malicious detection category and outcome. Using different brink for the outcome variable,
IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2016), December 23-25, 2016, Jaipur, India

different time decreases for the query that collected the WoT ii. Number of hyphens: the number of hyphens in the
reputation ratings, or a different outcome variable altogether domain name (mean = 0.12).
(such as one that is clearly tailored to mobile traffic) are iii. Number of digits: the number of digits in the domain
attractive directions for next work. They also plan to modify name (mean = 0.09).
further into the relative performance of our models on fresh iv. Number of numbers: the number of numbers in the
domains vs. relatively low-traffic domains, since our data set domain name, where a number is defined as follows:
of domains that had not been seen in the prior 30 days consists A string of consecutive digits of length > 0 (mean =
of a mixture of these two types of domains. 0.04).To allow non-linear relationships between these features
Alexander Moshchuk and Henry M. Levy [5] have given a and the outcome, we binned the number of characters into
new method for detecting the Malicious Web Content which deciles, and the other three basic features into the basket {0, 1,
applies Execution-Based technique in their research. Their 2, 3}. The hypothetical football-related domain 4downs-
approach gives challenging aspects as achieving better 10yards.com, for example, incorporates 14 characters, 3
interactive performance and more as defending malicious digits, one hyphen, and two numbers. Rather than 4 feature
URL detection in the database of DNS server. They have vectors, the collect versions of these four basic features
mentioned that the analysis is based on execution model of contain 10, 4, 4, and 4 feature vectors, respectively.
malicious URL detection with some defined limitations. Character Indicator Variables: We created 36 binary
Haotian Liu, Xiang Pan and Zhengyang Qu [6] have given features to quota the presence of each character from a to
method for detecting malicious web sites using learning based z and each digit from 0 to 9 in each domain name. The most
suspicious URLs. They proposed a learning based approach and least frequently appearing characters were e and q,
for separating Web sites into 3 classes: benign, phishing, and consequently (occurring at least one time in 198,304 and 5,152
malware. The analysis is only based on URL itself beyond domain names, accordingly), and the most and minimum
accessing the target website, which removes the run-time frequently occurring digits were 1 and 7, consequently
latency and safeguards user from being exposed to browser- (occurring at least once in 3,980 and 1,193 domain names,
based vulnerabilities. By carefully selecting features and respectively).
learning algorithms, our system achieves 97.53% veracity on Log-likelihood: We computed the log-likelihood of the
detecting malicious Web sites. sequence of characters in each domain name using a first-
Matthew G. Schultz and Eleazar Eskin [7] have given the order Markov model as the possibility model for characters in
method of data mining for efficient and fast detection of the English language. To define the transition probabilities
malicious URL and malicious source code as executable. They between characters, we figure out the table of first-order
have given the method which presents the first contribution for transitions from a list of the top 1/3 million unigrams from the
finding the undetectable malicious source executable. The Google n-grams corpus [10]. For each domain, we removed
Multi-Naive Bayes method gives better comparison of their digits and hyphens, and then computed the log-likelihood of
results in the popularity of signature based patterns and the the sequence of remaining characters, using the dissemination
results shows 97.76% efficiency over the detection amounts of of first characters in the unigram list as the probability
malicious detection as signature-based methods. dissemination of first characters for each domain. We also
figure out a normalized version of this feature in which we
divided the log-prospects by the number of characters in the
III. PROBLEM STATEMENT AND PROPOSED APPROACH domain (omitting digits and hyphens) to account for the fact
The binary distribution problem is used to classify the that longer domains on average have lower log-prospects.
URL reputation where positive results are considered as Last, we binned these values into deciles (as we did with the
malicious URLs and negative results shows that the URLs are number of characters feature), and we combined an
not malicious URLs [1]. If the dissemination of the URL is additional, 11th bin for the 197 domain names in the data that
based on the technique of the learning-based model then the only contained digits and hyphens (and thus had a lost value
values for malicious detection can be highly efficient and for the log-prospects) [11].
accurate in comparison to other models [9]. Another aspect is Top-level domains: We noticed 857 uncommon top-level
to classify the web sites containing the malicious source code domains (TLDs) in our data, where every domain belongs to
with the relationship among the various URLs in the database
exactly one TLD (defined as existing on the Mozilla Public
of DNS server is lexical.
Suffix List [14]). The most common TLD is .com (58% of
A. Word Segmentation Technique [2] domains in our sample), and the next 5 very common are
The features that we generated to forecast maliciousness .net, .org, .de, and .co.uk (5%, 11%, 3%, and 3% of
fall into five groups: domains, respectively). 337 of the 857 TLDs were only
(a) Basic features; (b) Character indicator variables; (c) observed once.
Log-prospects; (d) Top-level domains; and (e) Words. Words: Another source for features in domain names is
Basic Features: We measured four basic features: individual words. First, we define some notations. In a set of
i. Number of characters: the number of characters in domain names, we invoke a token as a single occurrence of a
the domain name, excluding the top-level domain and word within a domain name, where each particular token type
all periods. (Mean=11.5).
IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2016), December 23-25, 2016, Jaipur, India

in the data defines a word. The glossary is the list of all words
observed across the collection of domain names. If the domain (2) Public class of the malicious code data structure are as
www.duckduckgo.com, for example, were segmented into follows:
three tokens, {duck, duck, go}, it would only contain Public class url_information
two words: go and duck. {
Private string Identification; // Length 14 bytes
Public string name_of_www; // Length 200 bytes
B. Proposed Approach
Private string url_description; // Length 4 bytes
At the time of accessing the malicious web pages, the Private sting description; // Length 200 bytes
malicious web system or browser susceptibility uses the }
advantage of the program as Trojans. Other malicious
programs can also be downloaded by using these methods on
the host resulting where the state of host is unsafe [12]. The
internet Explorer susceptibility risk level is higher while using
the internet browser. Generally, there are two methods
available to utilize this risk level from higher to lower. These
methods are as follows:
i. The implementation of shell code in the web source
code is leaded due to loopholes error that contains
shell code.
ii. The program is downloaded and run by using these
methods after applying the component or other
susceptibilities.
Generally, create object () function and ActiveXObject () are
the most common functions available in object tags of the
URL address files [13]. The specific steps of the proposed
approach are as follows:
i. Apply word segmentation technique for all the domain
names found in the source code of web pages.
ii. Classify the domain segments using BM pattern. Fig 2. Flow Chart of Signature Matching
iii. Virus signatures and URL source are matched by using
the BM pattern matching algorithm.
iv. Collected URLs are searched in the source code of A. Algorithm
malicious URL detection by using malicious web URL BM Pattern (* pattern, * buffer, * url)
algorithm in the malicious signature database. Input: Buffer of the web page as source code and pattern of
v. BM mode of the matching substring is stored in the the virus signatures from the database and the name of URL.
database as the description of malicious URL or Output: 1 as true and 0 as false.
malicious code signature basis. Steps of algorithm are as follows:
vi. Malicious source code of web pages and signature 1. Apply word segmentation technique for all the
matching in database are detected as malicious web data domain names found in the source code of web page.
structures in the database to prepare better malicious 2. ADO connection database technology is used for
pattern matching. finding the pattern matching of the virus signatures to
vii. Matching the substring in to the new collected malicious characterize the database of the virus featured items.
signatures of malicious URLs. 3. Live matching of source code and viruses are
determined by the comparison of the virus signatures
IV. ALGORITHMIC AND OBSERVATIONAL MODEL to find whether the matching of string comes to the
Web Data Structures for detecting Malicious Web Objects end of string or not. If false value is returned by
are as follows: above matching, perform the stem 4 otherwise
perform step 5.
(1) The data structure are as follows for malicious URL code 4. Calculate number of jumps in bytes according to the
features: jump rules of the bad character and by using the good
{ suffix rule, the jump length is calculated.
Private string Identification; // Length 14 bytes 5. Returns true if virus signature matching is successful,
Public string URL_signature; // length 200 bytes otherwise it returns false and go to stem 3.
Private string URL_type; //length 4 bytes 6. Return the final result as true of false and returned to
} called module.
IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2016), December 23-25, 2016, Jaipur, India

B. Observational Model detection is more than 85% while other anti-virus software
The collected virus URLs and the entire URLs are the part of gives poor results in the same scenario. If a domain is
the experiment in the comparison phase and all the virus estimated to have a high probability of being malevolent
detection are validated in virus signature database by using predicated merely on its denomination, then a more costly
detection of matching algorithm. PHP and VC languages are analysis (such as web content-predicated analysis) could be
used to validate and to analyze the virus signature patterns in acclimated to regulate further action, such as blocking the site
the implementation [14]. The complete performance or inserting a speed bump. In this way, the expected word
evaluation of the malicious URL detection and the system segmentation and BM Matching Pattern techniques outline
performance of the algorithm are given by rate of the virus here could amend subsisting systems that use machine
URL detection in the measurement. learning to find malignant domains by engendering thousands
. of further features with which to relegate domains. Various
aspects of the malicious URL detection are presented in this
paper and we have introduced various associated techniques in
the association of URL classification process where target
website is detected as malicious or not.
In future, more research can be done to find malignant URLs
from web documents by exploiting the hyperlink architecture
of web. We have identified that this proposed method can be
elongated with a coalescence of the Dynamically Mining
Patterns without Pre-defined Elements.

REFERENCES
[1] Fuqiang Yu, Malicious URL Detection Algorithm based on BM Pattern
Matching, International Journal of Security and Its Applications, Vol.9,
Fig 3. Detection Number Diagram of Malicious URL
No.9, pp.33-44, 2015.
The number of daily visited malicious URLs is shown in [2] Wei Wang and Kenneth E. Shirley, Breaking Bad: Detecting malicious
Fig. 3 due to uncommon distribution of the web page [15]. domains using word segmentation, AT&T Labs Research, New York,
The number of malicious URL are captured as the category of NY 10007,
music, pornography or others due to the malicious page [3] F. Noriyuki and H. Pkenichi, A personal system for web image
detection algorithm are better than any other methods, thus retrieval, Proceedings of the 4th international symposium Information
and communication technologies, (2005), pp. 209-216.
total number of malicious source codes are distributed and
[4] E. Kirda, C. Kruegel and G. Vigna, A Client-side Solution for
captured as virus URLs. Code obfuscation of 27% has been Mitigating Cross-site Scripting Attacks, Proc. of 2006 ACM
seen in this malicious URL detection and the source code Symposium on Applied Computing, (2006), pp. 330-337.
detection as failure is of 27%. The URL exploit detection is [5] Alexander Moshchuk, Tanya Bragin, Damien Deville, Steven D. Gribble
found for 33% as complete detection in the malicious code and Henry M. Levy, SpyProxy: Execution-based Detection of
and malicious URL. Other 40% URL redirection is found in Malicious Web Content, Proceedings of 16th USENIX Security
Symposium on USENIX Security Symposium, Article No. 3, ISBN:
the complete analysis of the detected malicious URL of the 111-333-5555-77-9, (2007).
malicious web pages. [6] Haotian Liu, Xiang Pan and Zhengyang Qu, Learning based Malicious
Web Sites Detection using Suspicious URLs, Proc. of the
V. CONCLUSION AND FUTURE WORK 34thInternational Conference on Software Engineering, Northwestern
University, IL, USA, (2009)
We have reviewed and analyzed the subsisting malware
[7] Matthew G. Schultz, Eleazar Eskin and Erez Zadok, Data Mining
detection techniques and compared it with the advantage and Methods for Detection of New Malicious Executables, SP '01
disadvantage and withal discuss some current quandary are Proceedings of the 2001 IEEE Symposium on Security and Privacy,
still remain. From the analysis, researcher has focused an Page 38 IEEE Computer Society Washington, DC, USA,
incipient dynamic feature extraction of malware detection 2001.
techniques. The output of our lightweight method could likely [8] A. Mori, T. Lzumida and T. Sawada, A Tool for Analyzing and
Detecting Malicious Mobile Code, Proc. of the 28thInternational
be acclimated to apply near-authentic time detection of Conference on Software Engineering, (2006), pp. 831 -834.
dubious domains when a utilizer endeavors to visit a domain [9] P. Euripides, V. Epimendes and M. Evangelos, Searching for logo and
on a World Wide Web network. trademark images on the web, Proceedings of the 6th ACM
In this paper, an incipient method is proposed for the international conference Image and video retrieval, (2007), pp. 541 -548.
maleficent URL detection utilizing Word Segmentation and [10] X F. He, D. Cai and J R. Wen, Clustering and searching WWW images
using link and page layout analysis, ACM Transactions on Multimedia
BM Pattern Matching technique. The malicious source codes Computing, Communications, and Applications, vol. 3, no. 2, (2007),
and URL are detected by using the proposed approach. BM pp. 10-35.
Pattern matching feature gives the complete analysis of the [11] T. Dziubak and J. Matulewski, An object-oriented implementation of a
malicious source code so the source code of the URLs is solver of the time- dependent Schrodinger equation using the CUDA
detected as malicious URLs. Detection for the maleficent URL technology, COMPUTER PHYSICS COMMUNICATIONS, vol. 183,
no. 3. (2012), pp. 800-812.
are proven in this paper and the accuracy of malicious URL
IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2016), December 23-25, 2016, Jaipur, India

[12] K. Wang and A R. Smith, Efficient kinematical simulation of reflection [14] C.-M. Chen and Y.-H. Ou, Secure Mechanism for Mobile Web
high-energy electron diffraction streak patterns for crystal surfaces, Browsing, 2011 IEEE 17th International Conference on Parallel and
COMPUTER PHYSICS COMMUNICATIONS, vol. 182, no. 10, Distributed Systems (ICPADS), (2011), pp. 924 - 928.
(2011), pp. 2208-2212. [15] R. Bhandari and U. Suman, Broker based secure web service
[13] S.-W. Hsiao, Y. S. Sun, F.-C. Ao and M. C. Chen, A Secure Proxy- composition using star topology, 2012 CSI Sixth International
Based Cross-Domain Communication for Web Mashups, 2011 Ninth Conference on Software Engineering (CONSEG), (2012), pp 1 -7.
IEEE European Conference on Web Services (ECOWS), (2011), pp. 57
64.

Das könnte Ihnen auch gefallen