Sie sind auf Seite 1von 34

PERFORMANCE EVALUATION OF WEB CRAWLER

Supervisor co-supervisor

prof. M. Q. Rafiq , Dept. of Computer Engg.

dR. omar farooq Dept. of electronics engg.

Presented by : Ms. Sandhya (gd-6707)

OUTLINE
INTRODUCTION Motivation Crawling Issues Related work SEARCH ENGINES WEB CRAWLER PROBLEMS WITH A TYPICAL WEB CRAWLER ALGORIRHM DESCRIPTION (Proposed ) APPROACH FOLLOWED MODEL OF WORK RESULTS REFERENCES

INTRODUCTION
Internet
It is a network of network consist of millions of private , academic , business and government network. Vast source of information resources and services

World Wide Web (WWW)


System of interlinked hypertext documents Web browser Start with the URL of web page DNS resolves URL into IP ADDRESS HTTP request

Motivation
Problem Statement
Most of existing website ranking algorithm just take use of website link graphs and the content of websites are usually not put into consideration and nor the site popularity . It is not enough for a reliable ranking of websites.

CRAWLING ISSUES
Issue 1: Overlapping of web documents Overlap problem occurs when multiple crawlers instance/thread running in parallel download the same web document multiple times. Issue 2: Quality of downloaded web documents The quality of downloaded documents can be ensured only when web pages of high relevance are downloaded by the crawlers. Issue 3: Change of web documents Changing of web documents is a continuous process. This change must be reflected at the search engine repository failing which a user may have to access an obsolete web document.

Issue 4: Network bandwidth/traffic problem In order to maintain the quality, the crawling process is carried out using either of the following approaches: Crawlers can be generously allowed to communicate among themselves or They cannot be allowed to communicate among themselves at all. Both approaches put extra burden on network traffic.

Proposed

Solution

To address this issue, a new site ordering algorithm is introduced based on the all three web mining technique i.e., Content relevance ,Site popularity and Site updation frequency.

Related work

SEARCH ENGINE :-A www search engine is defined as retrieval


service, consisting of a database(s) describing mainly resources available on the www, search software and a user interface also available via www. There are two types of search engines: 1. Directory-based search engines (Yahoo, The Open Directory) Directories are search engines powered by human beings. Human editors compile all the listings that directories have. 2. Crawler-based search engines (Google & AltaVista) automatically visit web pages to compile their listings. By taking care in how you build your pages, you might rank well in crawler-produced results.

ARCHITECTURE
ENGINE

OF A TYPICAL SEARCH

Figure 1. Web search engine architecture

WEB CRAWLER

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. For a given set of urls, supplied by search engine, the crawler traverses the world wide web to download hypertext documents. The embedded hyperlinks in the documents are extracted and followed

WEB CRAWLER ARCHITECTURE

Figure 2. Web Crawler architecture

1. Multi-threaded Downloader:-It download documents in parallel by various parallel running thread. 2. 3. 4. Scheduler:-Selects the next url to be downloaded. URL Queue:-A queue having all url of page. Site-ordering Module:-It score the site based on various factors as described below and order them based on the score.

5. New ordered queue:-Urls sorted based on their score.


6. World Wide Web:- Collection of interlinked documents.

7. Storage:-to save the downloaded documents

Working of a web crawler

PROBLEMS WITH A TYPICAL

WEB CRAWLER

Serial process - Slow

Unable to handle Www growing rapidly - Increasing exponentially

- Dynamic in nature
- High rate of change

Need to parallelize

My_web_crawler working

Parallelize the crawling process

Several instances of crawler work in parallel


Main page downloaded Links on the main page stored

Main page and these links crawled in parallel

Crawled page /URLs stored in files URL ordering proposed algorithm is applied. URL are crawled in order to Rank obtained .

PROPOSED Rank ALGORITHM


1 - Input/Seed url. 2 - Extract whole site. 3 - Remove the stop word and suffix. 4 - Calculate tern weight using TF-IDF. 5 - Now calculate content similarity. 6 - Calculate public popularity score using server logs. 7 - Obtain site updating frequency. 8 - Final site score is now obtained in accordance to relevance of above factor i.e .Final Rank=0.223(result at step 6) +0.2387(result at step 7) +0.35(result at step 5)

ALGORITHM Explnation
1 - As web crawling process start with a input URL/seed URL. Same thing is considered here. 2 - Now contents of that web page are retrieved. 3 - Stop words are removed based on the list of http://www.cs.cmu.edu/~mccallum/bow/rainbow/ 4 - Here content mining concept is used , now Content Similarity of the page is calculated by using the TF-IDF (Term frquencyInverse document Frequency )formulas. wij = { 0.5 +0.5tj}log + 1/ 5 - (, ) = w2ipa+w2ipb-wipa*wipb Hence similarity is obtained.

6 - Here Usage mining concept is used , now Now using below formula to calculate page popularity. Total= TEC+ UEC + TIC+ UIC WS = x* TEC + y* UEC+ z* TIC + w*UIC Total Total Total Total 7 - Here Link mining concept is used ,now Update frequency is calculated Freq( s)= x*Na +(1-x)*Nna D D

Where Na denotes the count of updated related pages in a website, and Nna denotes the count of updated non-topic pages. D is the updating time interval for calculating updated pages. x is a damping factor, 0<x<1, usually set to 0.85.
8 - Final score is calculated as : Final Rank = 0.223(result at step 6) + 0.2387(result at step 7) + 0.35(result at step 5)

PERFORMANCE EVALUATION
It will be evaluated as follows:

First, top 100 URLs returned by the above mentioned algorithms will be used. Then a count of different urls present will be done. Secondly, number of overlapping/repeated URLs is counted. Thirdly, number of Relevant page is counted . Lastly urls in different Ranks are counted.

APPROACH FOLLOWED

Development language

- Java

Work divided into two parts - Threads


Working threads Threads controller Crawling process

- Web crawler

Save url
Ranking algorithm

WORK
Threads

ACHIEVED

Queue Thread controller

Message receiver
Web crawler
Crawler process Save URL Ranking algorithm implemented successfully.

MODEL OF WORK

Figure 3: Web Crawler GUI

Figure 4: Ranking process intermediate result

Figure: Final Site score obtained

Figure:View of Final ranked URLs in GUI.

Top URLs by proposed algorithm

Top URLs PageRank -5 (Parameter v1.3)

RESULTS

It is observed that on an average ranking categories obtained by algorithm are two more than the ranks category of the traditional page rank. Also it has more urls in highest rank than page rank. It gives more accurate ranks with server logs . It also provide ranks without server logs. As experiment will be continued on large data sets with variations to conclude about efficiency.

CONCLUSION AND FUTURE WORK

In this paper a new URL ordering algorithm is proposed based on the content similarity, popularity information from web logs and site updating frequency. It perform well and better than traditional page rank and also have anti-spamming ability . A database may be used to overcome limitation of java data structure. More data set for page popularity may be tested. Lastly it may be tested for a large data sets.

Paper Published/accepted

Sandhya ,M.Qasim rafiq,Performance Evaluation of Web Crawler Accepted and published in International Conference on emerging Trends in Technology(ICETT 2011) held in Baselious Mathews II at Kollam kerala 25-26 March 2011.

Sandhya ,M.Qasim rafiq,Performance Evaluation of Web Crawler Accepted in International Journal of Emerging Technologies in Sciences and Engineering (IJETSE). Sandhya ,M.Qasim rafiq,Performance Evaluation of Web Crawler Accepted in National Conference by Kurukshetra University 6-7 April 2011.

REFERENCES
[1] Bhaskar Reddy,Kethi Reddy, Improving efficiency of web crawler algorithm using parametric variationsPh.d thesis submitted in June 2010 at Thapar University India. [2] Arvind chandramouli ,Susan gauch and Josua eno A popularity-based URL ordering Algorithm for Crawlers, Rzesow ,Poland may 13-15 2010 IEEE [3] Shaojie Qiao ,Tianni Li, Jiangtao Qiu, SimRank: A Page Rank Approach based on Similarity Measure 2010 IEEE. [4] Dilip Kumar Sharma , A.K.Sharma A Comparative Analysis of Web Page Ranking Algorithms (IJCSE) International Journal on Computer Science and Engineering Vol. 02, No. 08, 2010, 2670-2676 [5] Hongzhi Guo, Qingcai Chen, Xiaolong Wang, Zhiyong Wang, Yonghui Wu, STRank: A SiteRank Algorithm using Semantic Relevance and Time Frequency Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics .

[6] http://palatnikfactor.com/2010/05/18/components-of-googles-rankingalgorithm- in-2010-linking-still-king/ [7] Yi Zhang,Lei Zhang,Yan zhang,Xiaoming Li,XRrank;Learning More from Web User Behaviors2006 IEEE [8] Qiancheng jiang,Yan Zhang,SiteRank-Based crawling Ordering Strategy for Search Engine 2007IEEE [9] Neelam Duhan, A. K. Sharma, Komal Kumar Bhatia, Page Ranking Algorithms: A Survey,2009 IEEE International Advance Computing Conference (IACC 2009). [10] Apostolos Kritikopoulos, Martha Sideri, Iraklis Varlamis, Wordrank: a method for ranking web pages based on content similarity 2007 IEEE.

THANK YOU

Das könnte Ihnen auch gefallen