Beruflich Dokumente
Kultur Dokumente
Supervisor co-supervisor
OUTLINE
INTRODUCTION Motivation Crawling Issues Related work SEARCH ENGINES WEB CRAWLER PROBLEMS WITH A TYPICAL WEB CRAWLER ALGORIRHM DESCRIPTION (Proposed ) APPROACH FOLLOWED MODEL OF WORK RESULTS REFERENCES
INTRODUCTION
Internet
It is a network of network consist of millions of private , academic , business and government network. Vast source of information resources and services
Motivation
Problem Statement
Most of existing website ranking algorithm just take use of website link graphs and the content of websites are usually not put into consideration and nor the site popularity . It is not enough for a reliable ranking of websites.
CRAWLING ISSUES
Issue 1: Overlapping of web documents Overlap problem occurs when multiple crawlers instance/thread running in parallel download the same web document multiple times. Issue 2: Quality of downloaded web documents The quality of downloaded documents can be ensured only when web pages of high relevance are downloaded by the crawlers. Issue 3: Change of web documents Changing of web documents is a continuous process. This change must be reflected at the search engine repository failing which a user may have to access an obsolete web document.
Issue 4: Network bandwidth/traffic problem In order to maintain the quality, the crawling process is carried out using either of the following approaches: Crawlers can be generously allowed to communicate among themselves or They cannot be allowed to communicate among themselves at all. Both approaches put extra burden on network traffic.
Proposed
Solution
To address this issue, a new site ordering algorithm is introduced based on the all three web mining technique i.e., Content relevance ,Site popularity and Site updation frequency.
Related work
ARCHITECTURE
ENGINE
OF A TYPICAL SEARCH
WEB CRAWLER
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. For a given set of urls, supplied by search engine, the crawler traverses the world wide web to download hypertext documents. The embedded hyperlinks in the documents are extracted and followed
1. Multi-threaded Downloader:-It download documents in parallel by various parallel running thread. 2. 3. 4. Scheduler:-Selects the next url to be downloaded. URL Queue:-A queue having all url of page. Site-ordering Module:-It score the site based on various factors as described below and order them based on the score.
WEB CRAWLER
- Dynamic in nature
- High rate of change
Need to parallelize
My_web_crawler working
Crawled page /URLs stored in files URL ordering proposed algorithm is applied. URL are crawled in order to Rank obtained .
ALGORITHM Explnation
1 - As web crawling process start with a input URL/seed URL. Same thing is considered here. 2 - Now contents of that web page are retrieved. 3 - Stop words are removed based on the list of http://www.cs.cmu.edu/~mccallum/bow/rainbow/ 4 - Here content mining concept is used , now Content Similarity of the page is calculated by using the TF-IDF (Term frquencyInverse document Frequency )formulas. wij = { 0.5 +0.5tj}log + 1/ 5 - (, ) = w2ipa+w2ipb-wipa*wipb Hence similarity is obtained.
6 - Here Usage mining concept is used , now Now using below formula to calculate page popularity. Total= TEC+ UEC + TIC+ UIC WS = x* TEC + y* UEC+ z* TIC + w*UIC Total Total Total Total 7 - Here Link mining concept is used ,now Update frequency is calculated Freq( s)= x*Na +(1-x)*Nna D D
Where Na denotes the count of updated related pages in a website, and Nna denotes the count of updated non-topic pages. D is the updating time interval for calculating updated pages. x is a damping factor, 0<x<1, usually set to 0.85.
8 - Final score is calculated as : Final Rank = 0.223(result at step 6) + 0.2387(result at step 7) + 0.35(result at step 5)
PERFORMANCE EVALUATION
It will be evaluated as follows:
First, top 100 URLs returned by the above mentioned algorithms will be used. Then a count of different urls present will be done. Secondly, number of overlapping/repeated URLs is counted. Thirdly, number of Relevant page is counted . Lastly urls in different Ranks are counted.
APPROACH FOLLOWED
Development language
- Java
- Web crawler
Save url
Ranking algorithm
WORK
Threads
ACHIEVED
Message receiver
Web crawler
Crawler process Save URL Ranking algorithm implemented successfully.
MODEL OF WORK
RESULTS
It is observed that on an average ranking categories obtained by algorithm are two more than the ranks category of the traditional page rank. Also it has more urls in highest rank than page rank. It gives more accurate ranks with server logs . It also provide ranks without server logs. As experiment will be continued on large data sets with variations to conclude about efficiency.
In this paper a new URL ordering algorithm is proposed based on the content similarity, popularity information from web logs and site updating frequency. It perform well and better than traditional page rank and also have anti-spamming ability . A database may be used to overcome limitation of java data structure. More data set for page popularity may be tested. Lastly it may be tested for a large data sets.
Paper Published/accepted
Sandhya ,M.Qasim rafiq,Performance Evaluation of Web Crawler Accepted and published in International Conference on emerging Trends in Technology(ICETT 2011) held in Baselious Mathews II at Kollam kerala 25-26 March 2011.
Sandhya ,M.Qasim rafiq,Performance Evaluation of Web Crawler Accepted in International Journal of Emerging Technologies in Sciences and Engineering (IJETSE). Sandhya ,M.Qasim rafiq,Performance Evaluation of Web Crawler Accepted in National Conference by Kurukshetra University 6-7 April 2011.
REFERENCES
[1] Bhaskar Reddy,Kethi Reddy, Improving efficiency of web crawler algorithm using parametric variationsPh.d thesis submitted in June 2010 at Thapar University India. [2] Arvind chandramouli ,Susan gauch and Josua eno A popularity-based URL ordering Algorithm for Crawlers, Rzesow ,Poland may 13-15 2010 IEEE [3] Shaojie Qiao ,Tianni Li, Jiangtao Qiu, SimRank: A Page Rank Approach based on Similarity Measure 2010 IEEE. [4] Dilip Kumar Sharma , A.K.Sharma A Comparative Analysis of Web Page Ranking Algorithms (IJCSE) International Journal on Computer Science and Engineering Vol. 02, No. 08, 2010, 2670-2676 [5] Hongzhi Guo, Qingcai Chen, Xiaolong Wang, Zhiyong Wang, Yonghui Wu, STRank: A SiteRank Algorithm using Semantic Relevance and Time Frequency Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics .
[6] http://palatnikfactor.com/2010/05/18/components-of-googles-rankingalgorithm- in-2010-linking-still-king/ [7] Yi Zhang,Lei Zhang,Yan zhang,Xiaoming Li,XRrank;Learning More from Web User Behaviors2006 IEEE [8] Qiancheng jiang,Yan Zhang,SiteRank-Based crawling Ordering Strategy for Search Engine 2007IEEE [9] Neelam Duhan, A. K. Sharma, Komal Kumar Bhatia, Page Ranking Algorithms: A Survey,2009 IEEE International Advance Computing Conference (IACC 2009). [10] Apostolos Kritikopoulos, Martha Sideri, Iraklis Varlamis, Wordrank: a method for ranking web pages based on content similarity 2007 IEEE.
THANK YOU