Sie sind auf Seite 1von 4

Feature

Deep Web Data Extraction Based on URL


B. Aysha Banu is a research
scholar at Mohamed Sathak
Engineering College in
Ramanathapuram, Tamil
Nadu, India.
and Domain Classification
M. Chitra, Ph.D., is a The rapid development of computer and interface and the domain, based on the database
professor in the department networking technologies has increased the summary.
of information technology at popularity of the web, which has led to the
Sona College of Technology, presence of more and more information on DWDE FRAMEWORK BASED ON URL AND
Salem, Tamil Nadu, India. the web. However, the explosive increase DOMAIN CLASSIFICATION
of information online leads to some search The Deep Web Data Extraction (DWDE)
problemsspecifically search engines usually framework seeks to provide accurate results to
return too many unrelated results on a given query. users based on their URL or domain search. The
Deep web is content that is dynamically complete steps of the framework for DWDE are
generated from data sources, namely file systems shown in figure 1. Initially, the collected web
or databases. Unlike the surface web, pages sites are categorized into surface web or deep
in the deep web are collected by following the web repositories based on their content. The user
hyperlinks embedded within collected pages. gives a query to retrieve the relevant web pages.
Data from the deep web are guarded by search The user can search the query based on the
interfaces such as web services, HTML forms or following two criteria:
programmable web, and they can be retrieved 1. URL
by database queries only. Surface web content 2. Domain
is defined as static, crawlable web pages. The If the user searches by URL, then the proposed
surface web contains a large amount of unfiltered framework validates whether it is a live web site. If
information, whereas the deep web includes it is a valid web site, the necessary and important
high-quality, managed and subject-specific contents are extracted from the web site based
information.1 The deep web grows faster than the on the tag information. If the keyword or domain


surface web because the surface information is directly given
web is limited to what is easily to the search engine, the contents
found by search engines. The deep web grows are extracted based on the
Do you have The deep web covers domains faster than the surface given keyword and web site
something such as education, sports and
web because the surface content matching.


to say about the economy. It contains huge For both searching criteria, a
this article? amounts of information and web is limited. frequency calculation is applied
Visit the Journal valuable content.2 Because deep to calculate the number of
pages of the ISACA web information can be found only by queries, it occurrences among the given query with the
web site (www.isaca. is necessary to design a special search engine to relevant web sites. The domain classification
org/journal), find the crawl deep web pages. Deep web data extraction algorithm is designed to predict the classified
article and choose is the process of extracting a set of data records domains, and it retrieves the accurate web pages
the Comments tab to and the items that they contain from a query result for users.
share your thoughts.
page. Such structured data can be later integrated
Go directly to the article: into results from other data sources and given Classifying the Web Site
to the user in a single, cohesive view. Domain The Internet contains a huge number of web
identification is used to identify the query interfaces pages and web content. Web pages can be
related to the domain from the forms obtained in categorized into two types, namely surface
the search process. The domain classifications are web repository and deep web repository.
done based on the number of matching results This classification is based on the static and
obtained in the similar criteria among the query dynamic nature of the web pages. The web

ISACA JOURNAL VOLUME 4, 2015 1


sites are classified based on the tag information. If the web
site includes any form tags, it is classified as a deep web
site. (It includes dynamic information.) If the web site does
not include any form tags, it is classified as surface web Learn more about, discuss and collaborate on
repositories. (It network security and cybersecurity in the
Figure 1Deep Flow Web Data Extraction Framework
Knowledge Center.
www.isaca.org/knowledgecenter
User Query
tags for domain classification. In this classification method, a
stop word removal is applied to all the information from each
Search Search by of those tags to extract the essential features. Each stemmed
by URL Domain
term with corresponding tag creates a feature. For example,
the word mining in the title tag, the word mining in
URL Classifier
the <b> tag and the word mining in the <li> tag are all
URL
Classifier considered different features from similar HTML tags.
Not Found Live The DWDE framework uses limited tags to extract the
Web Sites Web Sites
most important features to recognize the domain of the given
URL. These tags are used to avoid spending time extracting
Apply Tag-based Keyword Matching
Feature Extraction With the Relevant the less-important features.
Feature Algorithm Site Contents
Extraction
Frequency Domain Classification
Calculation
After extracting the source information, a visual block tree is
constructed using the tags. Constructing a tag tree from the
Apply Domain Classified web site is an essential step for most web content extraction
Domain Classification
Classification Domain
Algorithm methods. In the DWDE framework, the HTML tag is used to
formulate the corresponding tag tree. Based on the properties
of the HTML tag and text, the tag node is defined as tag-name,
Query Processor
type, parent, child-list, data, text-num and attribute. The tag-
Final Result name denotes the name of the tag; type denotes the type of each
Resultant Web Pages node and where nodes are divided into branch nodes; parent
represents the parent node; child-list is the set of successors;
Source: Banu and Chitra. Reprinted with permission. data stores the content of the node; text-num denotes the total
number of punctuation and words in all the descendants of each
includes only static information.) The DWDE framework node; and attribute denotes a mapping of the characteristics
helps process deep web pages to provide the accurate and of the HTML tag. The root node represents the whole page,
necessary web pages to the users. and each block in the tree belongs to the clocks that cannot be
further segmented. For each tag on the tree, keyword frequency
Tag-based Feature Extraction (TFE) is calculated for the extracted information.
Tags, such as title, header, anchor, metadata about the
Hypertext Markup Language (HTML) document, paragraph, PERFORMANCE ANALYSIS
group in-line elements in a document and images, are utilized The DWDE framework was compared with the existing
to extract the features domain classification. Most of the Genetic Algorithm (GA)3 and Naive Bayes (NB)4 classifiers
domain-specific necessary terms appear under these tags. The with respect to precision, recall and F-measure. The proposed
existing web-page classification mechanism uses the HTML system is able to search the query with multiple keywords,

2 ISACA JOURNAL VOLUME 4, 2015


so it is also compared with the existing Multi-keyword The greater the value of the F-measure, the better the
Text Search (MTS)5 algorithm, and the execution time is performance of the system. The DWDE framework results
investigated. The DWDE framework can be executed for any in higher F-measure values than the GA and NB methods.
database that collects from real-time online data sets. As the precision and recall values for the method yield better
results, it is reflected in the F-measure calculation. The
Precision and Recall Analysis DWDE framework can result in better classification of web
Precision is the number of true positives divided by the pages than the existing methods. The result of the F-measure
total number of positives, providing the percentage of true analysis among various domains is depicted in figure 4.
positives. Recall is the number of true positives divided by the
number of true positives and false negatives, providing the Figure 3Recall Values for DWDE Framework
percentage of positives that are found. Compared to GA and NB Methods

0.2
Figure 2Precision for DWDE Framework 0.18
DWDE-DC GA NB
Compared to GA and NB Methods 0.16
1.2 0.14
DWDE-DC GA NB 0.12
Recall

1 0.1
0.08
0.8
Precision

0.06
0.04
0.6
0.02

0.4 0
Domain
Association
Food
Radio
Travel
Education
Disaster
Labor
Housing
Marketing
CD
Music
Blogs
Flights
Child Games
Encyclopedia
Search
Shopping
Colleges
Maps
Schools
Hotels
Car
Sports
0.2

Domains
0
Association
Radio
Travel
Disaster
Child Games
Housing
Labor
Encyclopedia
Search
Colleges
Maps
Music
Sports
Flights
Marketing
Schools
CD
Blogs
Hotels
Car
Shopping
Domain
Food
Education

Source: Banu and Chitra. Reprinted with permission.

Domains
Figure 4F-measure Values of DWDE Compared
Source: Banu and Chitra. Reprinted with permission. to GA and NB Methods
0.35
In this experiment, the DWDE framework was validated for DWDE-DC GA NB
0.3
diverse domains, including association, radio, travel, disaster
0.25
and child games. The precision and recall values are noted and
the results are shown in figures 2 and 3. The proposed DWDE
F-measure

0.2

framework uses only a limited number of valuable tags to 0.15


classify the domains, so the results are better and more relevant
0.1
pages for the user. As a result, the framework automatically has
0.05
better precision and recall values.
The results show that the DWDE framework results in 0
Domain
Education
Travel
Radio
Food
Association
Housing
Labor
Disaster
Marketing
CD
Music
Blogs
Child Games
Flights
Encyclopedia
Search
Shopping
Maps
Colleges
Schools
Hotels
Car
Sports

better performance than the existing GA and NB methods.

Domains
F-measure
F-measure is the measure of a systems/methods performance Source: Banu and Chitra. Reprinted with permission.
by its classifiers and is based on precision and recall scores.

ISACA JOURNAL VOLUME 4, 2015 3


Execution Time Analysis Figure 6Execution Time for Multi-keyword Search
The execution time is evaluated based upon three types of
processes: 1200
DWDE-DC MTS
Time taken for the raw data set 1000
Time taken for HTML parsing

Execution Time (ms)


800
Time taken for domain classification
The three criteria are investigated for the proposed 600
method with the existing classifiers. The DWDE framework
400
uses limited tags to extract the features, so it takes less time
to execute the domain classification process to retrieve the 200
resultant web pages.
0
The results (figure 5) show that the proposed framework 1 2 3 4 5
takes less execution time to complete the task than the GA Keywords

and NB methods.
Source: Banu and Chitra. Reprinted with permission.

Figure 5Execution Time for Raw Data Set, The frequency calculation is applied to compare the
HTML Parsing and DWDE
matching frequencies between the user query and the relevant
45,000
search web sites. Based on the frequency measures, the
40,000 DWDE-DC GA NB
domain classification algorithm is introduced to retrieve the
35,000
accurate resultant pages.
Execution Time (ms)

30,000 The experiment results are compared with the existing


25,000 methods, such as GA, NB and MTS algorithm. The proposed
20,000 framework has better precision, recall and F-measure values
15,000 than the existing GA and NB classifiers. Moreover, the time
10,000 taken to execute the query search is less than the existing MTS
5,000 algorithm. The DWDE framework is well suited to search the
0 query for efficient user query retrieval.
Raw Data Set HTML Parsing Domain Classification
Types of Process ENDNOTES
1
Ferrara, E.; P. De Meo; G. Fiumara; R. Baumgartner;
Source: Banu and Chitra. Reprinted with permission.
Web Data Extraction, Applications and Techniques:
A Survey, Knowledge-Based Systems, vol. 70,
Multi-keyword search is also possible in the DWDE
November 2014, p. 301-323
framework. The execution time experiment was conducted for 2
Liu, Z.; Y. Feng; H. Wang; Automatic Deep Web Query
multiple keywords (up to five keywords). The DWDE method
Results User Satisfaction Evaluation With Click-through
takes less execution time than the existing MTS algorithm,
Data Analysis, International Journal of Smart Home,
which is shown in figure 6.
vol. 8, 2014
3
Ozel, S. A.; A Web Page Classification System Based on a
CONCLUSION
Genetic Algorithm Using Tagged-terms as Features, Expert
The DWDE framework uses the tag-based feature extraction
Systems With Applications, vol. 38, 2011, p. 3407-3415
algorithm to retrieve the necessary data. The method can 4
Ibid.
process the query in two ways, searching by URL and 5
Sun, W.; B. Wang; N. Cao; M. Li; W. Lou; Y. Hou;
searching by domain. Hence, it provides a user-friendly search
Verifiable Privacy-preserving Multi-keyword Text Search
process to deliver the results. Most of the existing algorithms
in the Cloud Supporting Similarity-based Ranking, IEEE
process the queries on single query or multiple keyword queries,
but the DWDE framework can process the single, multiple Transactions on Parallel and Distributed Systems, 2013
keyword queries and appropriate domain classification.

4 ISACA JOURNAL VOLUME 4, 2015

Das könnte Ihnen auch gefallen