Beruflich Dokumente
Kultur Dokumente
surface web because the surface information is directly given
web is limited to what is easily to the search engine, the contents
found by search engines. The deep web grows are extracted based on the
Do you have The deep web covers domains faster than the surface given keyword and web site
something such as education, sports and
web because the surface content matching.
to say about the economy. It contains huge For both searching criteria, a
this article? amounts of information and web is limited. frequency calculation is applied
Visit the Journal valuable content.2 Because deep to calculate the number of
pages of the ISACA web information can be found only by queries, it occurrences among the given query with the
web site (www.isaca. is necessary to design a special search engine to relevant web sites. The domain classification
org/journal), find the crawl deep web pages. Deep web data extraction algorithm is designed to predict the classified
article and choose is the process of extracting a set of data records domains, and it retrieves the accurate web pages
the Comments tab to and the items that they contain from a query result for users.
share your thoughts.
page. Such structured data can be later integrated
Go directly to the article: into results from other data sources and given Classifying the Web Site
to the user in a single, cohesive view. Domain The Internet contains a huge number of web
identification is used to identify the query interfaces pages and web content. Web pages can be
related to the domain from the forms obtained in categorized into two types, namely surface
the search process. The domain classifications are web repository and deep web repository.
done based on the number of matching results This classification is based on the static and
obtained in the similar criteria among the query dynamic nature of the web pages. The web
0.2
Figure 2Precision for DWDE Framework 0.18
DWDE-DC GA NB
Compared to GA and NB Methods 0.16
1.2 0.14
DWDE-DC GA NB 0.12
Recall
1 0.1
0.08
0.8
Precision
0.06
0.04
0.6
0.02
0.4 0
Domain
Association
Food
Radio
Travel
Education
Disaster
Labor
Housing
Marketing
CD
Music
Blogs
Flights
Child Games
Encyclopedia
Search
Shopping
Colleges
Maps
Schools
Hotels
Car
Sports
0.2
Domains
0
Association
Radio
Travel
Disaster
Child Games
Housing
Labor
Encyclopedia
Search
Colleges
Maps
Music
Sports
Flights
Marketing
Schools
CD
Blogs
Hotels
Car
Shopping
Domain
Food
Education
Domains
Figure 4F-measure Values of DWDE Compared
Source: Banu and Chitra. Reprinted with permission. to GA and NB Methods
0.35
In this experiment, the DWDE framework was validated for DWDE-DC GA NB
0.3
diverse domains, including association, radio, travel, disaster
0.25
and child games. The precision and recall values are noted and
the results are shown in figures 2 and 3. The proposed DWDE
F-measure
0.2
Domains
F-measure
F-measure is the measure of a systems/methods performance Source: Banu and Chitra. Reprinted with permission.
by its classifiers and is based on precision and recall scores.
and NB methods.
Source: Banu and Chitra. Reprinted with permission.
Figure 5Execution Time for Raw Data Set, The frequency calculation is applied to compare the
HTML Parsing and DWDE
matching frequencies between the user query and the relevant
45,000
search web sites. Based on the frequency measures, the
40,000 DWDE-DC GA NB
domain classification algorithm is introduced to retrieve the
35,000
accurate resultant pages.
Execution Time (ms)