Sie sind auf Seite 1von 6

2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies

Performance Optimization of Focused Web Crawling Using Content Block Segmentation


Bireshwar Ganguly
PG Student, Department of C.S.E, RCERT, Chandrapur Maharashtra, India bireshwar.ganguly@gmail.com

Devashri Raich
PG Student, Department of C.S.E, RCERT, Chandrapur Maharashtra, India devashriraich@gmail.com

Abstract --The World Wide Web (WWW) is a collection of billions of documents formatted using HTML.Web Search engines are used to find the desired information on the World Wide Web. Whenever a user query is inputted, searching is performed through that database. The size of repository of search engine is not enough to accommodate every page available on the web. So it is desired that only the most relevant pages must be stored in the database. So, to store those most relevant pages from the World Wide Web, a better approach has to be followed. The software that traverses web for getting the relevant pages is called Crawlers or Spiders. A specialized crawler called focused crawler traverses the web and selects the relevant pages to a defined topic rather than to explore all the regions of the web page. The crawler does not collect all the web pages, but retrieves only the relevant pages out of all. So the major problem is how to retrieve the relevant and quality web pages. To address this problem, in this paper, we have designed and implemented an algorithm which partitions the web pages on the basis of headings into content blocks and then calculates the relevancy of each partitioned block in web page. Then the page relevancy is calculated by sum of all block relevancy scores in one page. It also calculates the URL score and identifies whether the URL is relevant to a topic or not On the basis of headings, there is an appropriate division of pages into blocks because a complete block comprises of the heading, content, images, links, tables and sub tables of a particular block only. Keywords-Web crawling algorithms; search engine; focused crawling algorithm; page rank; Information Retrieval.

The answer is web crawlers. Crawlers form the crucial component of search engine with a primary job of traversing the web & retrieving web pages to populate the database for later indexing and ranking .More specifically the crawler iteratively performs the following process: 1. 2. 3. Download the Web page. Parse through the downloaded page and retrieve all the links. For each link retrieved, repeat the process.

The information can be used to collect more on related data by intelligently and efficiently choosing what links to follow and what pages to discard. This process is called Focused Crawling [1]. A focused crawler tries to identify the most promising links, and ignores off-topic documents. If the crawler starts from a document which is i steps from a target document, it downloads a small subset of all the documents that are up to i-1 steps from the starting document. If the search strategy is optimal the crawler takes only i steps to discover the target. In order to achieve topic specialization of high quality, the logic behind focused crawling tries to imitate the human behavior when searching for a specific topic. The crawler takes following features into account of the web that can be used for topic discrimination including: the relevance of the parent page to the subject the importance of a page the structure of the web features of the link and the text around a link the experience of the crawler

I.

INTRODUCTION

The World Wide Web (or the Web) is a collection of billions of interlinked documents formatted using HTML. WWW is a network where we can get a large amount of information. In a Web, a user views the Web pages that contains text, images, and other multimedia and navigates between them using hyperlinks. By Search Engine, we are usually referring to the actual search that we are performing through the databases of HTML documents. When you ask a search engine to get the desired information, it is actually searches through the index which it has created and does not actually searches through the Web. Different search engines give different ranking results because not every search engine uses the same algorithm to search through all the indices. The question is what is going on behind these search engines and why is it possible to get relevant data so fast?
978-1-4799-2102-7/14 $31.00 2014 IEEE DOI 10.1109/ICESC.2014.69 365

Figure 1. Focused Crawler

The remainder of the paper is structured as follows. The next section surveys related work. Section 3 describes the methodology of focus crawler that we used. Section 4 describes Pattern recognition and the algorithms that we used. Section

4&5 implementation and experimental results. Section 6 conclusion and future work. II. LITERATURE SURVEY

The first generations of crawlers [22] on which most of the web search engines are based rely heavily on traditional graph algorithms, such as breadth-first or depth-first traversal, to index the web. A core set of URLs are used as a seed set, and the algorithm recursively follows hyper links down to other documents. Document content is paid little heed, since the ultimate goal of the crawl is to cover the whole web. However, at the time, the web was two to three orders of magnitude smaller than it is today, so those systems did not address the scaling problems inherent in a crawl of today's web. Depth-first crawling [22] follows each possible path to its conclusion before another path is tried. It works by finding the first link on the first page. It then crawls the page associated with that link, finding the first link on the new page, and so on, until the end of the path has been reached. The process continues until all the branches of all the links have been exhausted. Breadth-first crawling [2] checks each link on a page before proceeding to the next page. Thus, it crawls each link on the first page and then crawls each link on the first pages first link, and so on, until each level of links has been exhausted. Fish-Search [3] the Web is crawled by a team of crawlers, which are viewed as a school of fish. If the fish finds a relevant page based on keywords specified in the query, it continues looking by following more links from that page. If the page is not relevant, its child links receive a low preferential value. Shark-Search [4] is a modification of Fish-search which differs in two ways: a child inherits a discounted value of the score of its parent, and this score is combined with a value based on the anchor text that occurs around the link in the Web page. Naive Best First method proposed by Pant, G [19] [20] exploits the fact that relevant pages possibly link to other relevant pages. Therefore, the relevance of a page a to a topic t, pointed by a page b, is estimated by the relevance of page b to the topic t. Page rank algorithm proposed by Brin, S. and Page, L., [5] [6] determines the importance of the web pages by counting citations or back links to a given page. The page rank of a given page is calculated as: PR (A) = (1-d) + d (PR (T1)/C (T1) + ... + PR (Tn)/C (Tn)) PR (A) -Page Rank of a Website, d -damping factor T1.Tn links. The HITS algorithm, proposed by Kleinberg [7] is another method for rating the quality of a page. It introduces the idea of authorities and hubs. An authority is a prominent page on a topic. They are the target of the crawling process since they have high quality on a topic. A hub is a page that points to many authorities. Their characteristic is that its out links are suggestive of high quality pages. Hubs do not need to have high quality on the topic themselves or links from 'good' pages pointing to them. The idea of the hub is a solution to the problem of distinguishing the popular pages, from the authoritative pages. Therefore, hubs and authorities are defined in terms of mutual recursion.

Neural Networks extends the RL method for focused crawling is proposed by Grigoriadis, A. [19]. In their approach, each web page is represented by a set of 500 binary values, and the state of each page is determined by Temporal Difference Learning, in order to minimize the state space. The relevance of the page depends on the presence of a set of keywords within the page. A neural network is used for the estimation of the values of the different stages. Concept graph focused crawling proposed by Liu, H., Milios, E. & Janssen, J., [15] states the analysis of the link structure and the construction of a concept graph of the semantic content of relevant pages with the help of Hidden Markov Models is proposed. Here, the user of the search engine has a more active role, since he is required to browse the web and train the system, by providing the set of the pages he considered interesting. The system aims at analyzing and detecting the semantic relationships that exist within the paths leading to a relevant page. Reinforcement Learning (RL) Crawler proposed by Rennie and McCallum [17, 18] used to train a crawler on specified example web sites containing target documents. The web site or server on which the document appears is repeatedly crawled to learn how to construct optimized paths to the target documents. Ontology based focused crawling proposed by Ehrig, M. and Maedche [9] utilizes the notion of ontologies in the process of crawling. It consists of two main processes which interact with each other, the ontology cycle and the crawling cycle. In the ontology cycle, the crawling target is defined by ontologies (provided by the user) and the documents that are considered relevant as well as proposals for the enrichment of the ontology are returned to the user. The crawling cycle retrieves the documents on the web and interacts with the ontology to determine the relevance of the documents and the ranking of the links to be followed. Intelligent crawling proposed by Aggarwal et al. with arbitrary predicates is described in [8]. The method involves looking for specific features in a page to rank the candidate links. These features include page content, URL names of referred Web page, and the nature of the parent and sibling pages. It is a generic framework in that it allows the user to specify the relevant criteria. Also, the system has the ability of self-learning, i.e. to collect statistical information during the crawl and adjust the weight of these features to capture the dominant individual factor at that moment. After studying the various approaches in literature we find that the major open problem in focused crawling is that of properly assigning credit to all pages along a crawl route that yields a highly relevant document. In the absence of a reliable credit assignment strategy, focused crawlers suffer from a limited ability to sacrifice short term document retrieval gains in the interest of better overall crawl performance. In particular, existing crawlers still fall short in learning strategies where topically relevant documents are found by following off-topic pages. Because of obvious disadvantages we propose a new technique to overcome the overall crediting system of focused crawling in the following section.

366

III.

PROPOSED METHODOLOGY

A. Crawler Architecture
As the information on the WWW is growing so far, there is a great demand for developing efficient methods to retrieve the information available on WWW. Search engines present information to the user quickly using Web Crawlers. Crawling the Web quickly is an expensive and unrealistic goal as it requires enormous amounts of hardware and network resources. A focused crawler is technique that aims at desired topic and visits and gathers only a relevant web page which is based upon some set of topics and does not waste time on irrelevant web pages. The focused crawler does not collect all web pages, but selects and retrieves only the relevant pages and neglects those that are not concern. But we see there are multiple URLs and topics on a single web page. So the complexity of web page increases and it negatively affects the performance of focused crawling because the overall relevancy of web page decreases. A highly relevant region a web page may be obscured because of low overall relevance of that page. Apart from main content blocks, the pages have such blocks as navigation panels, copyright and privacy notices, unnecessary images, extraneous links, and advertisements.

When the pages are segmented into the content blocks, the relevant blocks are crawled further to extract the relevant links from them. Then the relevancy value of each block is calculated separately and summed up to find the overall relevancy of the page. The relevancy of web page may be inappropriately calculated if the web page contains multiple topics that can be unrelated and that may be a negative factor. Instead of treating a whole web page as a unit of relevance calculation, we will evaluate each content block separately.

Figure 3.The tag tree of a block corresponds to an HTML source

C. The Proposed Algorithm


Our work mainly focuses on the assignment of credits to the web pages as per its semantic contents. We also give emphasis to prioritize the frontier queue so that the higher credit page URLs are given priority to crawl over lower ones. In this algorithm we present a method to divide the web pages into content blocks. This method uses a technique to partition a web page into content blocks with a hierarchical structure and partition the pages based on their pre-defined structure, i.e. the HTML tags. We have extracted content from HTML web pages and make the HTML tag tree of a block. 1) Start to Select a set of seed URLs (Selected & Prioritized Manually) & insert in the frontier queue; 2) If the frontier queue is non empty and Max URL s <Limit ,Download the Web page New() pointed by the topmost URL in the queue Else Stop; 3) Initialize the PageScore() =0; 4) Enter the Keyword(s) to be searched; 5) Assign the Page Score() as follows:i) If the Keyword(s) is present in the <Head> of the New() Webpage, increment PageScore() by 5 , Else PageScore () is unchanged; ii) If the Keyword(s) is present in the <Href > ( inside the hyperlink/URL) of the New() Webpage,, increment PageScore() by 2 , Else PageScore is unchanged; iii) If the Keyword(s) is present in the <Body/Text>

Figure: 2 Methodology of focused crawler [22]

B. Content Block Segmentation


Segmenting the web pages into small units will improve the performance. A content block is supposed to have a rectangle shape. Page segmentation transforms the multi-topic web page into several single topic context blocks. This method is known as content block partitioning. In this paper, we will present an algorithm how to efficiently divide the web page into content blocks and then we will apply focused crawling on all the content blocks. A web page will be partitioned into blocks on

the basis of headings, hyperlinks and content body. On the basis of this, there is an appropriate division of pages into blocks because a complete block comprises of the heading, content, images, links, tables and sub tables of a particular block only. First we make the HTML tag tree of a block. Each HTML page corresponds to a tree where tags are title, hyperlinks or the contents.

6) 7)

of the New () Webpage, increment PageScore () by 1 and repeat the process for every occurrence of the Keyword(s). The final PageScore () of the New () Webpage will be the cumulative score of Step (5). If the final PageScore () <1, reject New () (irrelevance threshold), and Go to Step 2. Else if PageScore () is >= 1,

367

If the PageScore () of New () >= Previous (), then extract all the hyperlinks of New () and insert at the top of the frontier queue Else append at the rear of the queue. Previous () ==New (); Go To Step 2. Our crawler operates like this. The seed URL is having the Seed URL. DNS thread removes a URL from the seed URL and tries to match the hostname with an Internet protocol address. At first, the DNS thread consults the DNS database to see whether the host name has been resolved; if so the thread retrieves the IP from the database. Otherwise, DNS threads learn the IP from a DNS server. Next, a read thread receives the resolved IP address; tries to open an HTTP socket connection, and ask for the web page. After downloading the page, the crawler checks the page content to avoid duplicate. Next, it extracts and normalizes the URLs in the fetched page, verifies whether robots can crawl those URLs, and check whether the crawler has previously visited those extracted URLs. Obviously we cant keep the server busy for long time so we have to put some timestamp, so it will crawl until the timestamp expired. If it will not get the links until this time stamp it will display string not found. Otherwise it will fetch the links and display it in some table and store it in some repository file. Here, we are taking only html web pages. After downloading the web page contents we give maximum emphasis on the <Head> or <Title> of an document as it actually depicts the most significant part <Title> of the web page and if a keyword is present in the <Head> then surely the webpage we are searching is a relevant to our query. Second importance is given to the Hyperlink contents<Href> and finally to the frequency of the keyword appearing in the document. And last we are also focusing on improvisation of frontier queue so that the order of crawling can also be prioritized with higher relevant links are crawled over lower ones.

available on the web page. It should not include images, and buttons. The extracted content should be stored in a file. Here we use pattern recognition in the form of regular expression to separate first the <Head> part from the web page using following pattern:

"(? i) (<Head.*?>) (. +?)(</Head>)"


When the corresponding <Head> block is separated from the whole page, match the search string with it and if a match occurs according to algorithm increment the PageScore () by 5. Similarly to separate the hyperlink part from the page we use the following pattern:

"<a\\s+href\\s*=\\s*\"?(.*?)[\"|>]"
When the corresponding <Href>block is separated from the whole page, match the search string with it and if a match occurs according to algorithm increment the PageScore () by 2. Similarly to separate the rest of body part from the page we use the following pattern:

"< (body*)\b [^>]*> (.*?)</\body>


When the corresponding <Body>block is separated from the whole page, match the search string with it and if a match occurs according to algorithm increment the PageScore () by 1. Finally the cumulative score of all the blocks are added to find the final PageScore () of the webpage. Here in this approach we have kept the minimum threshold of the PageScore () as 1 just to keep the process simple. If we wish to increase the relevancy of the crawler process we can increase the threshold to any integer value. This provides flexibility in the approach and also a users orientation on how tight he wish to keep his search process.

IV.

IMPLEMENTATION

D. Similarity Measurement Recognition

Using

Pattern

Here with pattern we mean only text. Pattern matching is used for syntax analysis. When we compare pattern matching with regular expressions then we will find that patterns are more powerful, but slower in matching. A pattern is a character string. All keywords can be written in both the upper and lower cases. A pattern expression consists of atoms bound by unary and binary operators. Spaces and tabs can be used to separate keywords. Text mining is an important step of knowledge discovery process. It is used to extract hidden information from not-structured or semi-structured data. This aspect is fundamental because much of the web information is semistructured due to the nested structure of HTML code, much of the web information is linked, and much of the web information is redundant. Web text mining helps whole knowledge mining process of mining, extraction and integration of useful data, information and knowledge from the web page content. In this paper, pattern recognition is applied on the crawler like this, when we start the crawler it will give us the links related to the keyword. It will then read the web pages that are extracted from the links and while it will read the web page it will extract only the content. Here content means only the text that is

In this paper, we presented the working and design of a Focused WebCrawler. Here after compilation, to run the crawler (Fig 4) we will be providing one seed URL, keyword to search for and the path for text file as input. Additionally we can provide maximum number of URLS to crawl and its Case Sensitivities. When we press the search button it will take the URLs that match the keyword from Internet. If we press stop button it will interrupt the searching in between (Figure 5).

Figure 4. Design of the software

368

As we can see it will give us the table in which it will extract the keyword matched URLs. When we click on search button it will pop up the dissertation window.

Seed URL Keyword Max URLs to Crawl Time (Sec) approx. URLs to Crawl Links Retrieved

www.google.com -50 15 -409

www.google.com Search 50 165 1133 26

www.google.com Search 50 240 48 17

On the basis of above Table I parameters as input values to our crawler, we are able to generate the following results on the basis of our proposed algorithm.
TABLE II: RELEVANCY URL PAGE SCORE DESCRIPTION USING PROPOSED CRAWLER S.No Links Retrieved Score 1 http://google.com 4 2 3 4 5 6 7 8 9 10 11 12 13 14 http://google.com/advanced_search?hl=enIN&amp;authuser=0 http://google.com/support/websearch/bin/answer.py?an swer=510&amp;p=adv_safesearch&amp;hl=en http://google.com/intl/en/insidesearch/index.html http://google.com/intl/en/insidesearch/howsearchworks /thestory/ http://google.com/intl/en/insidesearch/tipstricks/index. html http://google.com/intl/en/insidesearch/features/ http://google.com/intl/en/insidesearch/stories/index.ht ml http://google.com/intl/en/insidesearch/playground/ http://support.google.com/websearch/?hl=en http://google.com/webmasters/tools/safesearch http://google.com/support/websearch/?p=highlights&a mp;hl=en http://google.com/intl/en/insidesearch/features/search/k nowledge.html http://google.com/intl/en/insidesearch/features/voicesea rch/ http://google.com/intl/en/insidesearch/recipes.html http://support.google.com/websearch/bin/answer.py?hl =en&amp;answer=179386 http://support.google.com/websearch/bin/answer.py?hl =en&amp;answer=1710607&amp;topic=2413802 19 6 11 8 14 23 16 18 15 6 7 13 19 10 24 28

Figure 5: Crawling in Process

Here we can see that the output occurs in tabular form containing the matched links after the complete crawling process terminates successfully.

Figure 6. Output result

15 16 17

V.

EXPERIMENTAL RESULTS

Here we have compared of our crawler with the Blind crawler and Standard Focused Crawler on some standard parameters. The results clearly show that our approaches in mostly all parameter are comparatively better than the other two. Especially the column of Links Retrieved in our approach is minimum but our main objective of optimization the crawler is achieved when we manually compare the retrieved links with the other two as relevancy is highly achieved. This is proved in our next table displaying the results obtained dynamically executing our crawler.
TABLE I: COMPARISON OF RESULTS WITH TWO OTHER CRAWLERS.
Parameter Blind Crawling Standard Focused Crawler Focused Crawler using Content Block Partitioning

TABLE III: OVERALL PERFORMANCE ANALYSIS Crawler Seed Keyword Time(Sec) Frontier URL approx Queue

Retri eved Links

Standard Focused Crawler

Rediff

News

69

940

43

Yahoo Google MSN

Bollywood Search Search

160 165 120

1724 1133 1525

43 27 46

369

Indiatimes Proposed Focused Crawler Rediff

Times News

180 360

3834 1170

49 43

Yahoo Google MSN Indiatimes

Bollywood Search Search Times

360 240 480 600

1754 48 163 1978

48 15 50 49

[6] Page, L., Brin, S., Motwani, R. & Winograd, T., 1998. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project. [7] Kleinberg, M. J., 1997. Authoritative Sources in a Hyperlinked Environment, In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithm. [8] Aggarwal, C., Al-Garawi, F. & Yu, P., 2001. Intelligent Crawling on the World Wide Web with Arbitrary Predicates, In Proceedings of the 10th international conference on World Wide Web, Hong Kong, Hong Kong, pp. 96 105. [9] Ehrig, M. and Maedche, A., 2003. Ontology-Focused Crawling of Web Documents. In Proceedings of the Symposium on Applied Computing 2003 (SAC 2003), Melbourne, Florida, USA, pp. 1174[10] Zhuang, Z., Wagle, R. & Giles, C. L., 2005. Whats There and Whats Not? Focused Crawling for Missing Documents in Digital Libraries. In Joint Conference on Digital Libraries, (JCDL 2005) pp. 301-310. [11] Medelyan, O., Schulz, S., Paetzold, J., Poprat, M. & Mark, K., 2006. Language Specific and Topic Focused Web Crawling. In Proceedings of the Language Resources Conference LREC 2006, Genoa, Italy. [12] Pant, G. and Menczer, F., 2003. Topical Crawling for Business Intelligence. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries. [13 ]Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L. & Gori, M., 2000. Focused Crawling Using Context Graphs. In 26th International Conference on Very Large Database, [14] Castillo, C., 2004. Effective Web Crawling. Ph. D., University of Chile. [15] Liu, H., Milios, E. & Janssen, J., 2004. Focused Crawling by Learning HMM from User's Topic-specific Browsing. In Proceedings of 2004 IEEE/WIC/ACM international Conference on Web Intelligence, Beijing, China, pp. 732-735. [16] Li, J., Furuse, K. & Yamaguchi, K., 2005. Focused Crawling by Exploiting Anchor Text Using Decision Tree. In Special interest tracks and posters of the 14th international conference on World Wide Web, Chiba, Japan. [17] J. Rennie and A. McCallum, "Using Reinforcement Learning to Spider the Web Efficiently," In proceedings of the 16th International Conference on Machine Learning(ICML-99), pp. 335-343, 1999. [18] Sutton, R. and Barto, A. 2002. Reinforcement Learning. An Introduction. MIT Press, Cambridge, MA. [19] Grigoriadis, A. and Paliouras, G., 2004. Focused Crawling using Temporal Difference-Learning. In Proceedings of the Panhellenic Conference in Artificial Intelligence (SETN), Samos, Greece, pp. 142153. [20] Pant, G. and Menczer, F., 2003. Topical Crawling for Business Intelligence. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries. [21] Pant, G., 2004a. Learning to crawl: Classifier-guided topical crawlers. Ph. D. The University of Iowa. [22] Pooja gupta and Mrs. Kalpana Johari IMPLEMENTATION OF WEB CRAWLER,Second International Conference on Emerging Trends in Engineering and Technology, ICETET-09

VI.

CONCLUSION

The motivation for focused crawling comes from the poor performance of general-purpose search engines, which depend on the results of generic web crawlers. The focused crawler is a system that learns the specialization from examples, and then explores the web, guided by a relevance and popularity rating mechanism. It filters at the data-acquisition level, rather than as a post-processing step. In this paper, we have briefly discussed about Internet, Search Engines, Web Crawlers, Focused Crawlers and Block Partitioning of Web pages. In this paper our approach is to partition the web pages into content blocks. Using this approach we can partition the pages on the basis of headings and preserve the relevant content blocks. Therefore a highly relevant region in a low overall relevance web page will not be obscured. In our proposed method of partitioning the web pages into blocks on the basis of headings gives an advantage over conventional block partitioning is that we divide the blocks which include a complete topic. The heading, hyperlinks and the body of a particular topic is included in one complete block. After calculating the relevancy of different regions it calculates the relevancy score of web page based on its block relevancy score with respect to topics and calculates the URL score based on its parent pages blocks in which this link does exist.

REFERENCES
[1] Chakrabarti, S., van den Berg, M. & Dom, B., 1999a Focused crawling: a new approach to topic-specific Web resource discovery. In Proceeding of the 8th International conference on World Wide Web, Toronto, Canada, pp.1623 [2 ]Najork, M. and Wiener, J., L., 2001. Breadth-First Search Crawling Yields High-Quality Pages. In 10th International World Wide Web Conference, pp. 114-118 [3 ]De Bra, P. and Post, R., 1994. Information Retrieval in the WorldWide Web: Making Clientbased searching feasible. In Journal on Computer Networks andISDN Systems, 27, pp. 183-192, Elsevier Science BV. [4] Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M. & Ur, S., 1998. The shark-search algorithm an application: tailored Web site mapping. InProceedings of the seventh international conference on World Wide Web 7,Brisbane, Australia pp. 317 326. [5] Brin, S. and Page, L., 1998. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In Proceedings of the seventh international conference on World Wide Web 7.Brisbane, Australia pp. 107 - 117.

370

Das könnte Ihnen auch gefallen