0 Stimmen dafür0 Stimmen dagegen

4 Aufrufe9 SeitenTemporal aware algorithms vs PageRank

Jan 22, 2015

© © All Rights Reserved

PDF, TXT oder online auf Scribd lesen

Temporal aware algorithms vs PageRank

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

4 Aufrufe

Temporal aware algorithms vs PageRank

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

- Lecture Notes-March06 (1)
- Reliability Engineering
- Google Page Rank
- Differ en Ti Able Parameterization of Catmull-Clark Subdivision Surfaces
- Hmm Game Strategy
- Final
- wun2k6
- Brief Introduction to Vectors and Matrices Chapter3
- SSRN-id1963216
- M1 R08 MayJune 12.pdf
- Franco-et-al-Compos-Struct-39-1997
- Matlab ODE
- A Survey on Enhancing the Efficiency of Various Web Structure Mining Algorithms
- Reverse Acting Grate Com Orientações 2
- discmath.pdf
- brain4
- Wide-Area Detection of Voltage Instability From Synchronized Phasor Measurements - Part I - Principle
- Continuum Mechanics
- Matrix Algebra
- Kronecker Products

Sie sind auf Seite 1von 9

xx XXXX 200x

PAPER

Bundit MANASKASEMSAKa) , Student Member, Arnon RUNGSAWANG , Nonmember,

and Hayato YAMANA , Member

assessing the authoritativeness of web pages. We present three

dierent measures related to time: age, event, and trend factors

that measure recentness, special event occurrence, and trend in

revisions, respectively. An experimental dataset is created by

crawling selected topics for a period of several months. This data

is used to compare page rankings by human experts with rankings computed by the standard PageRank algorithm (which does

not include temporal factors) and three algorithms that incorporate temporal factors, included the Time-Weighted PageRank

(TWPR) algorithm introduced here. Analysis of the rankings

indicates that all three temporal-aware algorithms produce rankings more like those of human experts than does the PageRank

algorithm. Of these, the TWPR algorithm produced rankings

most similar to human experts, indicating that all three temporal factors are relevant in page ranking.

key words: web ranking algorithm, time-weighted ranking, web

authoritativeness, PageRank, link analysis, search engine

1.

Introduction

the most relevant or useful resources for their specic

interest; the goal being to include the most relevant

pages in the top-10 results returned by a query. Most

search engines include a ranking algorithm that computes a pages authoritativeness based either on the link

structure of the Web (such as HITS [16] and PageRank

[5], [20]) or by mining of users browse histories (such

as BrowseRank [17], Trac-weighted Ranking [19], and

BookRank [12]). While all of these algorithms produce

useful rankings, overall, they do not incorporate temporal information of pages which can be important to human users interest in a particular page. Indeed, these

algorithms may be biased against more recent pages

[1], [6], as such pages have had less time to accumulate

in-links that contribute to their link-based ranking, and

less chance to be incorporated into a browse history.

In many situations a web user may be more interested in recent or actively updated resources, rather

than older ones, even though the latter may appear to

Manuscript received January 1, 2003.

Manuscript revised January 1, 2003.

Final manuscript received January 1, 2003.

The author is with the Department of Computer Engineering, Kasetsart University, Thailand.

The author is with the Graduate School of Fundamental Science and Engineering, Waseda University, Japan.

a) E-mail: un@mikelab.net

be more authoritative. A web page for an upcoming conference is clearly more interesting than a page

for a past conference; a link to the current version of

an open-source software is probably more useful than

a link to an old version. Indeed, one can argue that

search engines are most needed for nding newer, more

dynamic content; such items are less likely to be ndable via web directories (e.g. ODP [8]), the Wikipedia

[23], or other resources that a human user can turn to

for help.

In this paper, we propose a method to incorporate temporal aspects into the authority ranking of web

pages using a modied version of the PageRank algorithm. Three temporal factors are incorporated: age,

event, and trend ; these three capture the intuitive notion that a page with recent updates, special time occurrence, or high frequency of updates is potentially

more interesting to users. These attributes are computed from page timestamps (directly or by inference)

and a modication history that is stored during each

web crawl; the event and trend factors require categorizing pages and comparing the temporal history of

pages in the same category, as described in Sect. 3. We

discuss other algorithms that use temporal information

in Sect. 2, notably T-Rank [4] and TimedPR [24], which

are included in our experimental study.

The remainder of this paper is organized as follows. In Sect. 2, we briey review the PageRank algorithm and some related work. In Sect. 3, we rst

introduce the factors of temporal aspect, and then describe about how to apply them to our new variant of

PageRank. In Sect. 4, we report and discuss the results

of experiments. We oer conclusions and future work

in Sect. 5.

2.

follows. For each crawl of the Web, we can construct

a web graph G = (V, E), an arbitrary directed graph

with pages and their hyperlinks corresponding to the

set of nodes V and edges E V 2 , respectively, without

multiple edges. If page u V has a link to page v V ,

it implies that the author of page u implicitly confers

some importance to page v. Let r(u) be the PageRank

score (i.e., importance) of page u and t(u, v) be the

value is normally set to 1/out(u), where out(u) is the

out-degree of page u in G. Therefore, a link (u, v) E

confers r(u)/out(u) units of rank to page v. This yields

the following equation to calculate the PageRank score

for all web pages:

v V, r(k+1) (v) =

t(u, v)r(k) (u)

(1)

uB(v)

r(k) (u)

,

out(u)

uB(v)

total amount of score conferred on v is the summation

of the score of each source page u divided by its outdegree. This recursive denition requires an iterative

computation, which repeats until the rank scores are

stable.

This process is equivalent to a random walk on the

directed web graph G. Consider a random surfer who

visits a page u at time k. The surfer then uniformly

chooses and follows a link from among us out-links to

page vwith the probability 1/out(u), also illustrated

by t(u, v) in Eq. (1)randomly at time k + 1. Consequently, the PageRank score of a page v can be dened

as the probability of a random surfer landing on v at a

suciently large time step. Let n be the total number

of pages contained in G, and M be the (n n)-square

matrix describing the transition over G. The matrix

entry mij is thus assigned to 1/out(i) if there is a link

from page i to page j, and the remainder are set to

0. For the power iteration of the matrix-vector multiplication of an eigensystem [11], the PageRank r can

be considered as the principal eigenvector of M T corresponding to eigenvalue 1:

r = M T r.

(2)

walk on the web graph G, the transition probability

matrix M is said to be valid if it is a row-stochastic

matrixits row totals equal 1. Hence, M must have

no row consisting of all zeros. A general approach for

dealing with dangling pages is to add a set of virtual

links from those pages to all pages. Let M be the

(n n)-square matrix with all entries in each row corresponding to a dangling page are given by 1/n; otherwise, they are 0. Then, the modied PageRank is:

r = (M T + M T ) r.

(3)

[13], [21] guarantees that the PageRank formula will

converge to a stationary probability distribution if M

is also aperiodic and irreducible. These requirements

leak problem [20].

to the surfer possibly returns to a visited page in any

time step, while the latter regards to every page directly links to every other page. In fact, the web graph

often contains sinks or isolated clusters. However,

this problem can be solved by adding a complete teleportation term to every page. This solution is reasonable for the random surfer analogy: the random surfer,

who follows links on the Web, at some time step gets

bored and abandons link surng and instead enters a

new destination in the browsers URL line. Let E be

the (n n)-square matrix representing the teleportation over the web graph G in which all entries have

value 1/n. Then, the modication of PageRank can be

expressed as:

r = (M T + M T ) + (1 )E T r

= (M T + M T ) r + (1 )E T r

1

T

T

= (M + M ) r + (1 )

.

n n1

(4)

of rank propagation and teleportation. Since the total

probability r1 = 1, the multiplication E T r then

returns a uniform n-dimensional column vector, i.e., the

total probability of random jumps to each page is equal

to 1/n.

Since PageRank was published, it has been widely

applied in many commercial Web search engines and

applications. In reality, however, the Web is dynamic;

only considering the number of page referrers for assessing authoritativeness is incomplete and inaccurate.

Obviously, PageRank tends to prefer older to newer

pages since the older ones tend to have many in-links

accumulated over time. Cho and Roy [6] describe this

rich-get-richer phenomenon when search engines usually report higher-PageRank pages at the top rank of

results. These pages not only are always known but

also receive more and more references from other Webs

authors over time, while the new-born pages are continually left behind. However, the old pages may be

outdated; the new pages may be more valuable. Cho

et al. [7] thus propose a model to estimate quality of

pages by analyzing the growth of their popularityhow

many references they receive over time. A page with

fast growing trend should be considered as potentially

valuable, even though it is still unpopular.

There are also some studies that concentrate on

integrating temporal aspects into the PageRank algorithm. Baeza-Yates et al. [3] assume that a page should

be considered good if it has a recent modication time,

and already has other pages linked to it. They thus

propose a modied PageRank by weighting pages with

an exponential decay function according to their age, to

to main community, causes the PageRank sink problem [20].

counteract bias against new pages. Yu et al. [24] propose another temporal aspect of PageRank in the context of scientic publications, called TimedPageRank

(TimedPR). However, their proposed technique cannot

be directly applied to the Web. Although the citation

concept of scientic papers is quite similar, some other

aspects are dierent from general web content. For instance, scientic papers have static information xed at

publication time. Since they cannot be deleted, their

citation counts are also monotonic increasing. In contrast, web pages can be modied and deleted both of

content and links over time. Scientic papers can only

recommend others further in the past; there are obviously no citation loops. On the Web, target-linked

pages can be subsequently modiedbecoming newer

than their referrers and provide higher quality information. This factor consequently eects the authoritativeness and should be considered.

Berberich et al. [4] specify a temporal window of

interest to analyze freshness and update frequency of

both pages and links. They apply these two factors in

PageRank computation, called Time-Aware Authority

Ranking (T-Rank ). However, their proposed technique

is somewhat impractical because the modication time

of links is hardly detectableusually approximated using the modication time of pages. In addition, the

results are not representative since they only reported

for a dataset on a specic domain. In fact, each domain usually has its own change behavior. For example, news pages having daily update should not dominate the others. Therefore, in dealing with the Web,

the Webs topics should also be considered.

3.

with temporal aspects. In Sect. 3.1, we rst introduce

three factors for time analysis. We then describe how

new scores can be computed by integrating these temporal factors in Sect. 3.2.

AF

1

Fig. 1

by [0, 1]. Given the observation period, the age of page

u is set to maximal value 1 if it has never changed; and

its age decreases linearly over the normalized interval

until it reaches zero for tsLast (u) (T SStart , T SEnd ].

However, the more recently a web page was updated, the more important it is. So, instead of directly

using the age value, we compliment it to create the

inverse-age factor of a page u, denoted IAF (u), dened

as:

IAF (u) = 1 AF (u).

(6)

Event Factor

Temporal Factors

created, modied, or even deleted according to time as

well as events. Based on the assumptions: (1) authors

who update their web pages will usually provide more

correct and higher quality information, and (2) users

often need and are interested in up-to-date information,

we infer the authoritativeness of pages by considering

their change behavior. We investigate the age, event,

and trend factors in analyzing modication times of

pages.

3.1.1

Time

TSEnd

page since it was created, but rather the time since last

modication. A recently updated page is assumed to

be of higher quality than an older one.

We rst let two constants T SStart and T SEnd be

the starting- and ending-point timestamps set to observe changes in web pages, respectively. Let ts(u) be

a modication time of page u, and tsLast (u) be the recent one. Then, the age factor of a page u, denoted by

AF (u), can be dened by the following function:

T S ts

Last (u)

End

T SEnd T SStart

if T SStart < tsLast (u) T SEnd , (5)

AF (u) =

0 otherwise.

3.1.2

3.1

TSStart

modied to provide up-to-date information about the

event. The more important the occurrence is, the more

web pages will be changed. If we observe the amount of

change, we can predict not only an event occurs but also

how important it is. However, every related web page

will probably not be changed for such an event. This

obviously depends on the kind of topic of the page related to; web pages updated for the Olympic Games

are usually categorized as a sport; those updated for a

movie are categorized as an entertainment, for instance.

In our approach, we are interested in the last event of

each web page, i.e., the event factor is analyzed by considering the last-modied time of a page based on its

topic.

topics containing non-overlapping pages, i.e., if a page

/ Cj for all j = i. Let I = {I1 , I2 , . . . , Iq }

u Ci then u

be a set of time intervals such that Ii = (T Si1 , T Si ]

for 1 i q, where T Si is a pre-dened timestamp

with T SStart = T S0 < T S1 < . . . < T Sq1 < T Sq =

T SEnd . Consider members of the Cartesian product

C I, we let page(Ci , Ij ) be the number of web pages

that are categorized in Ci and have the last-modied

time within Ij . Then, the event factor of a page u,

denoted by EF (u), is dened as:

page(Ci , Ij )

;

EF (u) = q

k=1 page(Ci , Ik )

u Ci and

(7)

tsLast (u) Ij .

xi (u) =

xj (u).

(8)

j:tj Ii

Similarly, we let x (Cj ) be the r-dimensional vector representing the update prole of a category, where

Cj C and

uCj xi (u)

.

(9)

xi (Cj ) =

|Cj |

That is, x (Cj ) refers to the average prole of all pages

contained in the same category. We nally dene the

trend factor of a page u, denoted by T F (u), to be:

T F (u) = Sim(x (u), x (Cj )); u Cj .

(10)

pages in the same category as u that also changed in

the same time interval, relative to the total number of

pages in that category. A large event value, i.e., a large

fraction of web pages changed, can be used to infer that

an important event occurred.

the classical IR [2]. The trend factor of u is thus considered as the similarity of its update prole to the prole

of its category.

3.1.3

ranking score by incorporating the inverse-age, event,

and trend factors in PageRank. We rst let D be a set

of dangling pages of the web graph G. Let t(u, v) be

the weight of rank propagation from page u to page v,

and s(v) be the weight for a random jump from any

page to v. Then, the PageRank formula illustrated in

Eq. (4) can be rewritten in a general form as follows:

v V, r(v) =

t(u, v)r(u)

Trend Factor

The age and event factors only consider the last modication time of web pages. We believe how often a

page is updated can also be exploited to estimate its

importance. A frequently updated page reects activeness on itself. However, the changes usually depend

on the kind of topic as well. Considering only the frequency of change may be biased and unfair to some

others which have already presented either correct or

reliable information. For instance, a news page will

likely be updated more frequently than a scientic one.

In our approach, a trend factor is dened to describe how similar the change behavior of a web page is

to others within the same topic, based on the assumption that most important pages will behave in the same

way. We rst introduce an update prole of each page

based on the changes over the observation period. Let

T = {ti (T SStart , T SEnd ]|1im and t1 <t2 <. . .<tm }

be a set of pre-dened timestamps. Let x(u) be the

m-dimensional vector representing the update prole

of page u:

1 if u was updated at timestamp ti ,

xi (u) =

0 otherwise.

However, using this prole to directly compare

pages will produce great dierences since there is little

possibility that web pages will be updated at the same

specic timestamps along the observation period. We

thus let I = {I1 , I2 , . . . , Ir } be a set of time intervals

such that Ii = (T Si1

, T Si ] for 1 i r, where T Si

is a pre-dened timestamp, T SStart = T S0 < T S1 <

< T Sr = T SEnd . Dene x (u), an r. . . < T Sr1

dimensional vector, to represent a new update prole

of u, where

3.2

Time-Weighted Ranking

u(B(v)D)

+ (1 )s(v).

(11)

either 1/out(u) if u B(v) or 1/n if u D, and s(v)

is set to 1/n. However, in this work we propose new

transitions that incorporate temporal aspects.

3.2.1

Time-Weighted Transition

pages that page u links to by either its actual links or

articial links in case u is a dangling page. For any

t(u, v), we dene the time-weighted transition

1 IAF (v) + 2 EF (v) + 3 T F (v)

wF (u)(1 IAF (w)+2 EF (w)+3 T F (w))

t(u, v)=

(12)

with the coecients 1 , 2 , 3 [0, 1]. Equation (12)

represents the probability of the page that a random

surfer selects for his next hop with some biases, i.e.,

the inverse-age, event, and trend values. In other

words, the surfer will choose the target page v by upto-dateness, event importance, as well as activeness, in

comparison to all targets.

5

Table 1

3.2.2

Age-Biased Vector

transition edges is articially added to cover all web

pages with respect to the Ergodic Theorem [13], [21].

These articial transitions, illustrated in Eq. (4), are

given by a uniform probability distribution n1 n1 .

However, our key to reducing bias in the original

PageRank against new pages is that we will compute

the values by using a non-uniform vector. We let s be a

n-dimensional column vector where the entry for each

page v is its normalized inverse-age:

s(v) =

IAF (v)

.

wV IAF (w)

(13)

randomly jumps to v with a bias based on page age.

4.

4.1

Dataset

crawled from our e-Society Project [10] which also overlaps with the Open Directory project (ODP) [8], containing roughly 390,000 URLs. To study the evolution

of these pages, we crawled them daily from August

15th, 2008 until January 31st, 2009. Modication of

pages was detected by simply comparing consecutive

versions. Each page was also categorized according to

the topics in ODP, used in the computation of TWPR.

Note that since the ODP provides an hierarchical topic

from most generalized to more specic one, a page can

then be categorized to more than one topics according

to a dened level. For example, a page in Computers/

Internet/Searching/Directories/... will be able

to labeled as Computers, Computers/Internet,

and Computers/Internet/Searching according to

the 1st, 2nd, and 3rd level of topics, respectively.

For the experiments, each crawled version of the

web pages was translated into a graph structure for

computing authoritative scores; however, only the last

version was indexed and retrieved by our searching system. As illustration, 30 sample queries are listed in

Table 1, some of which come from [9], [15].

4.2

Evaluation Measures

innovation

lyme disease

marketing

music

olympic

parallel computing

presidential election

risk management

rock climbing

science conference

soccer

socialism

sushi

tattoo

tournament

lists. OSimk (1 , 2 ) determines the degree of overlap

between the top-k URLs of two rankings 1 and 2 .

OSimk (1 , 2 ) =

Experimental Results

[20], our TWPR, T-Rank [4], and TimedPR [24], using

the Java programming language. We also integrated

them into our prototype Lucene-based searching system

[18] for comparison of the quality of ranking results.

affirmative action

amusement park

architecture

astronomy

blues

business

christmas

cruise

disaster

earthquake

fashion

film festival

game

graphic design

hurricane

|R1 R2 |

,

k

(14)

in 1 and 2 , respectively.

KSimk (1 , 2 ), a variant of Kendalls distance

measure, determines the degree of agreement in which

pairwise distinct URLs u and v within top-k rank have

the same relative order in both rankings 1 and 2 . Consider two lists R1 and R2 of top-k URLs rankings. Let

U be the union of URLs contained in both lists and

dene R1 as the extension of R1 to add the elements

U R1 after all the URLs in R1 . Similarly, R2 is also

dened as the extension of R2 . KSim is then given as:

|{(u, v) :

KSimk (1 , 2 ) =

4.3

}|

of (u, v) and u = v

.

|U | (|U | 1)

(15)

study the page authority assessment of the PR, TWPR,

T-Rank, and TimedPR methods. In the following, we

rst present the results on the comparison of top authority web pages computed by those methods. Then,

we discuss the time evolution of authoritativeness. Finally, we describe results of a user study to evaluate the

quality of ranked results obtained by a pre-dened set

of queries.

To individually assess the authoritativeness for

each crawled version of web pages, we rst set parameters used in the computation of our TWPR method

as follows: T SStart = the date of the rst crawl (August 15th, 2008), T SEnd = the date of the crawled version in consideration, C@3 = a set of the 3rd level

categories of ODP, I@5 = a set of every 5 consecutive

6

Table 2 Top-5 authority web pages produced by PR, TWPR,

T-Rank, and TimedPR, respectively.

OSim

0.9

1 PLOTEUS (Portail sur les opportunites detudes et de formation

0.8

0.7

PR

TWPR

T-Rank

TimedPR

0.6

0.5

2

3

0.8

0.7

31-01-2009

1 Filmopplevelsen starter her - Filmweb

15-01-2009

15-11-2008

31-10-2008

15-10-2008

30-09-2008

15-09-2008

31-08-2008

0.5

31-12-2008

0.6

15-12-2008

PR

TWPR

T-Rank

TimedPR

30-11-2008

KSim

0.9

en Europe)

http://europa.eu.int/ploteus/portal

CM Magazine: On Strike: The Winnipeg General Strike, 1919.

http://umanitoba.ca/outreach/cm/vol7/no4/onstrike.html

DRIIE | Department of international relations and european

integration |

http://www.die.ro

Japanese Public Holidays for 2004

http://www3.sympatico.ca/ccsr/j2004.html

Home Planet Release 3.3a

http://www.fourmilab.ch/homeplanet/homeplanet.html

pages produced by the same ranking methods for consecutive

fortnights.

1 = 2 = 3 = 1/3. We also set the interval I to

be the same as I in all experiments. For all of the

other methods, we used the default parameters previously dened by the authors [4], [20], [24]. In addition,

we x the value of the parameter = 0.85 [14], [20].

We now discuss the dierence between the authoritativeness induced by the ranking methods. Considering scores of the last crawled web pages, the ve most

authoritative pages are depicted in Table 2. As shown

in the table, PR ranks old pages as highly authoritative; for example, the rst page has become obsolete

and redirected to the new one, and the other pages

contain outdated contents. Not surprisingly, TWPR,

T-Rank, and TimedPR confer signicant authority on

up-to-date pages. In addition, the rankings contain several pages in common, including the rst one about a

recent lm.

We further investigated the time evolution eect,

i.e., how the authoritativeness changes over time. We

compared the lists of pages obtained from the ranking

methods for several consecutive fortnights. Figure 2

shows the evolution of top-1000 authorities in terms

of OSim and KSim. Note that the observation period

starts from August 15th, 2008 until January 31st, 2009.

Hence, each point on the graph is the similarity between

the rankings at that time and those of the previous

fortnight.

As shown in the gure, PR yields OSim values

of 0.92 0.97 (average 0.95) and KSim values of

0.88 0.94 (average 0.91). This means PR produces

nearly the same rankings over the entire time period.

In contrast, the consecutive rankings of TWPR, T-Rank

and TimedPR have lower similarities because of their

2

3

4

5

http://www.filmweb.no

Census Bureau Home Page

http://www.census.gov

ShopMania - Price comparison in US, Read reviews

http://www.shopmania.com

National Institute of Neurological Disorders and Stroke (NINDS)

http://www3.sympatico.ca/ccsr/j2004.html

Statistici, clasament si trac web romanesc

http://www.trafic.ro

1 Filmopplevelsen starter her - Filmweb

2

3

4

5

http://www.filmweb.no

Stummlm - ARTE

http://www.arte.tv/de/film/stummfilm-auf-arte/690880.html

Census Bureau Home Page

http://www.census.gov

ShopMania - Price comparison in US, Read reviews

http://www.shopmania.com

Al-Ahram Weekly | Front Page

http://weekly.ahram.org.eg

1 Filmopplevelsen starter her - Filmweb

2

3

4

5

http://www.filmweb.no

ShopMania - Price comparison in US, Read reviews

http://www.shopmania.com

HIT100.ro

http://www.hit100.ro

Al-Ahram Weekly | Front Page

http://weekly.ahram.org.eg

Regina Public Schools

http://www.rbe.sk.ca

is less biased (i.e., has higher similarities) than T-Rank

and TimedPR as well. In addition, all three temporalaware ranking methods showed the lowest consecutiveranking similarities at 31-12-2008, due to several factors

such as the New Year festival event.

To evaluate the quality of the page rankings of

those methods with respect to human users notion

of pages importance, we conducted a user study by

twenty-seven users, referred to as experts. In this

study, all ranking methods were integrated into our

searching system. We employed the last crawled web

pages and the 30 sample queries in Table 1, in the experiments. For each query, we rst separated the top-

7

OSim

KSim

Table 3 Average in-degrees per web page for top-5, 10, and

20 of results produced by the ranking methods and the experts.

Similarity

0.8

PR

TWPR

T-Rank

TimedPR

PR

TWPR

T-Rank

TimedPR

Top-5

Top-10

Top-20

0.6

PR

TWPR

T-Rank

TimedPR

Experts

5343

3717

2272

3256

2597

1884

2556

2192

2021

2675

2024

1909

3852

3208

2004

0.4

rankings and the rankings produced by TWPR setting 1 = 2 =

3 = 1/3 but varying the values of C and I.

0.2

0

Top-5

Top-10

Top-20

Top-5

Top-10

Top-20

between the experts rankings and the rankings produced by the

methods.

C@1

C@2

C@3

our interest is in the eect of temporal aspects for the

ranking, we therefore simply assume that a result is

relevant to a query if it contains the query words. The

set of top-thirty results are then re-ranked according to

the authoritative scores produced by the four ranking

methods. We also asked the human experts to grade

each result, i.e., giving a score from 0 (worst) to 3

(best). Based on the experts assessments, we aggregated those scores and sorted from highest to lowest,

to create baseline rankings for comparisons. For every

query and every ranking method, we summed up and

normalized the similarities obtained from the comparisons between the baseline rankings and the rankings of

that method. We illustrate here the average similarities

at top-5, 10, and 20 results in Fig. 3.

The gure shows there is a high similarity between

the experts rankings and the rankings of methods. All

three temporal-aware methods show higher similarity

to experts than does PageRank; further, the similarity

increases as more results are included. This similarity

increases rapidly until the top-10 for all temporal-aware

methods indicating that they can identify the pages

most relevant to human experts within the top few results. This is an important result since, as reported in

[22], most users usually browse only the top-ten results.

To further investigate how the ranking methods

confer authority, we examined the in-link pages of each

result and counted them. For each query, the in-degrees

were summed and normalized. Table 3 shows the average in-degrees. It is not surprising that PR provides the

largest number of values within top-5 and 10 due to its

denition. In contrast, TWPR, T-Rank, and TimedPR

prefer the up-to-date web pages in spite of having fewer

citations. Moreover, among the temporal-aware rankings, TWPR returns the number of average in-degrees

most closely matching the experts within the top-5 and

10. It can be inferred that TWPR is less biased toward

recent pages and agrees with some old pages indeed

being important.

4.4

OSim

KSim

OSim

KSim

OSim

KSim

I@1

I@3

I@5

I@7

0.55

0.52

0.58

0.53

0.62

0.59

0.55

0.54

0.59

0.57

0.66

0.62

0.56

0.55

0.64

0.61

0.68

0.66

0.56

0.54

0.63

0.61

0.68

0.64

Sensitivity Analysis

TWPR method. The parameters can be divided into

two groups: external and internal parameters. The former regards to the levels of hierarchical topics in ODP

[8] and the width of the time intervals, i.e., C and I, respectively. The latter regards to the coecients 1 , 2 ,

and 3 , which determine the weights of the inverse-age,

event, and trend factors, respectively. The remaining

parameter in Eq. (11), we assign a constant value

0.85. We conducted the experiments by employing the

last crawled web pages on January 31st, 2009 as well

as the 30 queries. We also used the baseline rankings

obtained from the experts for comparisons. In the following, we will examine only the top-10 ranking results.

To study the eects of the external parameters, we

set all internal parameters to the same constant value

(i.e., 1/3) but vary C and I. Let C@x be a set of

categories corresponding to the x-th level in ODP, and

I@y be a set of time intervals for every y consecutive

days. For the experiments, we compared the baseline

rankings with the rankings produced from twelve combinations of C and I, as depicted in Table 4. The results show that at C@1, varying I does not aect the

rankings. The reason is that categories are too general

at the rst level; there are many web pages contained

in each category. Consequently, just modifying some

web pages is less impact when compared to the large

amount remaining. However, varying I has much more

eect when changing C to C@2 and C@3, respectively.

Moreover, when we x C at C@2 and C@3, varying I

from I@1 through I@5 can result in better rankings,

but they will be stable (or just down) until I@7. The

reason is that I@1 is too specic: modifying certain

web pages at the same time has less probability. I@3

and I@5 gaps are more exible; however, it becomes

too much at I@7.

8

Table 5 Average similarities at top-10 between the experts

rankings and the rankings produced by TWPR setting C@3, I@5

but varying the values of 1 , 2 , and 3 .

Notation

OSim

KSim

TWPR A

TWPR E

TWPR T

TWPR AE

TWPR AT

TWPR ET

TWPR AET

1

0

0

1/2

1/2

0

1/3

0

1

0

1/2

0

1/2

1/3

0

0

1

0

1/2

1/2

1/3

0.59

0.52

0.55

0.63

0.66

0.61

0.68

0.54

0.44

0.48

0.58

0.63

0.55

0.66

Finally, we examine the eect of the remaining internal parameters. For this, we xed the external parameter values at C@3 and I@5. The seven combinations with their variations of three coecients are

shown in Table 5. From the results, we conclude that

the inverse-age factor is the most important, while the

trend and event factors have the second and the last impact, respectively. This means that the experts prefer

newer as well as up-to-date web pages rather than those

which were updated in the past. However, the combinations with more than one factor can produce better

rankings; especially, TWPR AET provides the best result.

5.

Conclusion

For many web queries, human users seek current or actively maintained information; hence, one would expect

that time-related factors of a web page can contribute

to establish its usefulness to human web searchers. Our

experiments show that this is indeed the case. Age,

event-related changes, and trend in revisions all appear

to contribute to improve page rankings, as compared

to the standard PageRank algorithm. Three temporalaware algorithms outperformed PageRank; the TWPR

algorithm, which incorporates all three factors, showed

the greatest improvement.

There are several issues which need to be addressed

in future work. First, we employed a small subset of

web pages in our experiments that had been categorized

by ODP. To deal with a complete Web, including unknown pages, an automatic system for Web categorization is needed. Second, a model for predicting change

in the Web is also essential to save time in crawling as

well as network bandwidth. Last, we plan to investigate

the content information and metadata in web pages and

integrate these in the computation for better rankings.

Acknowledgements

This research is supported by the Thailand Research

Fund through the Royal Golden Jubilee Ph.D. Program

(Grant No. PHD/0122/2548). We thank the members

of Yamana Laboratory for their help in data preparation and many helpful suggestions. We also thank

James Brucker for his intensive polishing of the paper.

References

[1] L.A. Adamic and B.A. Huberman, The webs hidden order, Communications of the ACM, vol.44, no.9, pp.5559,

2001.

[2] R.A. Baeza-Yates and B.A. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, England, pp.2730, 1998.

[3] R.A. Baeza-Yates, F. Saint-Jean, and C. Castillo, Web

structure, dynamics and page quality, Proc. 9th International Symposium on String Processing and Information

Retrieval, pp.117130, 2002.

[4] K. Berberich, M. Vazirgiannis, and G. Weikum, Timeaware authority ranking, Internet Mathematics, vol.2,

no.3, pp.301332, 2006.

[5] S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN

Systems, vol.30, no.17, pp.107117, 1998.

[6] J. Cho and S. Roy, Impact of search engines on page popularity, Proc. 13th International World Wide Web Conference, pp.2029, 2004.

[7] J. Cho, S. Roy, and R.E. Adams, Page quality: In search

of an unbiased web ranking, Proc. ACM SIGMOD International Conference on Management of Data, pp.551562,

2005.

[8] DMOZ, The Open Directory Project, http://www.dmoz.

org/.

[9] C. Dwork, R. Kumar, M. Naor, and D, Sivakumar, Rank

aggregation methods for the web, Proc. 10th International

World Wide Web Conference, pp.613622, 2001.

[10] e-Society Project, http://www.yama.info.waseda.ac.jp/

~yamana/e-society/index_eng.htm.

[11] G.H. Golub and C.F. Van Loan, Matrix Computations,

Johns Hopkins University Press, Baltimore and London,

1996.

[12] B. Goncalves, M.R. Meiss, J.J. Ramasco, A. Flammini, and

F. Menczer, Remembering what we like: Toward an agentbased model of web trac, Proc. 2nd ACM International

Conference on Web Search and Data Mining, 2009.

[13] G.R. Grimmett and D.R. Stirzaker, Probability and Random Processes, Oxford University Press, USA, 2001.

[14] T.H. Haveliwala, Ecient computation of PageRank,

Stanford Digital Libraries, 1999.

[15] T.H. Haveliwala, Topic-sensitive PageRank: A contextsensitive ranking algorithm for web search, IEEE Trans.

Knowledge and Data Engineering, vol.15, no.4, pp.784796,

2003.

[16] J.M. Kleinberg, Authoritative sources in a hyperlinked environment, J. ACM, vol.46, no.5, pp.604632, 1999.

[17] Y. Liu, B. Gao, T. Liu, Y. Zhang, Z. Ma, S. He, and H.

Li, BrowseRank: Letting web users vote for page importance, Proc. 31st ACM SIGIR Conference on Research and

Development in Information Retrieval, pp.451458, 2008.

[18] Lucene, The Apache Software Foundation, http://lucene.

apache.org/.

[19] M.R. Meiss, F. Menczer, S. Fortunato, A. Flammini, and

A. Vespignani, Ranking web sites with real user trac,

Proc. 1st ACM International Conference on Web Search

and Data Mining, pp.6576, 2008.

[20] L. Page, S. Brin, R. Motwani, and T. Winograd, The

PageRank citation ranking: Bringing order to the web,

Stanford Digital Libraries, 1998.

[21] S.M. Ross. Introduction to Probability Models, Academic

Press, 2002.

[22] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz,

Analysis of a very large web search engine query log,

ACM SIGIR Forum, vol.33, no.1, pp.612, 1999.

[24] P.S. Yu, X. Li, and B. Liu, Adding the temporal dimension to search - A case study in publication search, Proc.

IEEE/WIC/ACM International Conference on Web Intelligence, pp.543549, 2005.

Bundit Manaskasemsak

received

the B.Eng. and M.Eng. degrees in Computer Engineering from Kasetsart University, Thailand, in 2003 and 2005, respectively. He has received the Royal Golden

Jubilee scholarship for the Ph.D. program

since 2005. He is a Ph.D. candidate in

Computer Engineering, Kasetsart University. His current research interests include web search, information retrieval,

and parallel and distributed computing.

Arnon Rungsawang

received the

B.Eng. degree in Electrical Engineering

from King Mongkut Institute of Technology, Thailand, in 1986; the DEA-IARFA

from Univesit

e de Pierre et Marie Curie

(Paris VI), France, in 1993; and the Ph.D.

degree in Informatique et Reseaux from

de lENST-Paris, France, in 1997. Since

1998, he has been a lecturer in Department of Computer Engineering, Kasetsart

University. His current research interests

include web search, information retrieval, parallel and distributed

computing, and articial intelligence.

Hayato Yamana

received the B.S.,

M.S., and Dr.Eng. degrees in Computer

Science from Waseda University, Japan,

in 1987, 1989, and 1993, respectively.

From 1993 to 2000, he was a researcher in

the Electorotechnical Laboratory, AIST,

MITI. From 2000 to 2005, he was an associate professor of Waseda University.

Since 2005, he has been a professor of

Waseda University. His current research

interests include data mining, distributed

computing, and information retrieval. He is the president of Information Grand Voyage Project Consortium since 2007. He is a

member of IPSJ, IEICE, ACM, and IEEE.

- Lecture Notes-March06 (1)Hochgeladen vonrukma
- Reliability EngineeringHochgeladen vonPeter Favour
- Google Page RankHochgeladen vonMiuchdavids
- Differ en Ti Able Parameterization of Catmull-Clark Subdivision SurfacesHochgeladen vonsdfkjh234hksfnk
- Hmm Game StrategyHochgeladen vonAriel Vernaza
- FinalHochgeladen vonSiva Shankar
- wun2k6Hochgeladen vonllunotretniop
- Brief Introduction to Vectors and Matrices Chapter3Hochgeladen vonarcangelizeno
- SSRN-id1963216Hochgeladen vonForeclosure Fraud
- M1 R08 MayJune 12.pdfHochgeladen vonaathavan1991
- Franco-et-al-Compos-Struct-39-1997Hochgeladen vonnii20597
- Matlab ODEHochgeladen vonprabhadasila
- A Survey on Enhancing the Efficiency of Various Web Structure Mining AlgorithmsHochgeladen vonATS
- Reverse Acting Grate Com Orientações 2Hochgeladen vonbarrosojms
- discmath.pdfHochgeladen vonFatima Ahsan
- brain4Hochgeladen vonNishant Jadhav
- Wide-Area Detection of Voltage Instability From Synchronized Phasor Measurements - Part I - PrincipleHochgeladen vonmosh08
- Continuum MechanicsHochgeladen vonEmre Demirci
- Matrix AlgebraHochgeladen vonmuhsafiq
- Kronecker ProductsHochgeladen vonyacp16761
- Dynamics LectureHochgeladen vonajay
- Voja Radovanovic-Probability and Statistics by Example, Markov Chains_ a Primer in Random Processes and Their Applications. Volume 2-Cambridge University Press (2008)Hochgeladen vonbindaaz301
- Modal AnalysisHochgeladen vonraihan_momand1989
- 1-s2.0-S0167691115001784-mainHochgeladen vonSukddesh Ragavan
- Stability Analysis AndHochgeladen vonJohn Bihag
- 481 Recitation 6Hochgeladen vonJeffSchueler
- A Markov Model of the Indus ScriptHochgeladen vonKundu
- Summary of Results on Markov ChainsHochgeladen vonEdgar Jamil
- IR-UNIT 11 (Link Analysis)-2019.pptHochgeladen vonSups
- AHP Tutorial by Kardi TeknomoHochgeladen vonnormand67

- Association AnalysisHochgeladen vonjithender
- RSA EncryptionHochgeladen vonredsox1903
- Cognitive CSHochgeladen vonredsox1903
- Einstein RiddleHochgeladen vonredsox1903
- Regression Analysis Tutorial Excel Matlab[1]Hochgeladen vonFAUC)N
- Combinatorics and ProbabilityHochgeladen vonredsox1903
- Disc MathHochgeladen vonrcpuram01
- Web GraphHochgeladen vonredsox1903
- Agent Based Cellular AutomataHochgeladen vonredsox1903
- Pokernews Strategy eBookHochgeladen vonredsox1903
- Poker ProbabilityHochgeladen vonredsox1903
- Physics 2 Ch11 ProblemsHochgeladen vonredsox1903
- Introductory StatisticsHochgeladen vonredsox1903
- Combinatorics of PokerHochgeladen vonredsox1903
- Anti Cruelty SocietyHochgeladen vonredsox1903
- HIPER - High Power Laser Energy ResearchHochgeladen vonredsox1903
- ampl2Hochgeladen vonkardra
- RegressionHochgeladen vonluispedro1985
- Solution Guide rubik's kubikHochgeladen vonneiramedic
- AMPL Modelling SystemHochgeladen vonredsox1903
- Regression AMPL ModelingHochgeladen vonredsox1903
- SNG Blueprint Part 1Hochgeladen vonPopa Laurentiu
- Operations ResearchHochgeladen vonredsox1903

- Quick SortHochgeladen vonAdeel Durrani
- Yes, Big Data Can Solve Real World ProblemsHochgeladen vonarindon23
- Indonesia Sales Order Form.docxHochgeladen vonObis Beladas
- d vijay kumar.pdfHochgeladen vonVijay Pandit
- PQP Format (01)Hochgeladen vonMichael Teo
- Final Project Report (1).docxHochgeladen vonMankush Jain
- Music Industry SurveyHochgeladen vonefitzpat2011
- The Best Oracle Apps DBA Training@DBA School HyderabadHochgeladen vonseelam1
- Manual VAS 6150.pdfHochgeladen vondiana resendez
- Engineering PracticeHochgeladen vonPriya Kamath
- TCR-TSC Lookup TableHochgeladen vonSoumya Das
- gyro_control.pdfHochgeladen vonrijilpoothadi
- Data Domain FundamentalHochgeladen vonharibabu6502
- Table of Contents CcnaHochgeladen vonVicky Jain
- Understanding SIFT Algorithm and its usesHochgeladen vonIJSTE
- Putovanje-1812-kroz-zemlje-pod-Turskom-i-pod-Grckom.pdfHochgeladen vonjoj84
- 479-0244-Anybus Communicator CAN PROFINET-IO User Manual (1)Hochgeladen vonvarunshyam
- MIS presentation.pptHochgeladen vonBikram Prajapati
- Manets Thesis ReportHochgeladen vonCorina Necula
- HPE Building Datacenter Solutions Rev. 17.41Hochgeladen vonMigdaly DaSilva De Bravo Cuenta Llena
- VaxVoIPHochgeladen vonNelson Parra Romo
- ws-securitypolicy-1.3-spec-osHochgeladen vonUday Kumar
- Hitachi PartsHochgeladen vonபாரதி ராஜா
- Bpos Wiht Hr Head 12 AprilHochgeladen vonKarthik Inbaraja
- ADF DocumentHochgeladen vonMithun Banerjee
- DRAFT - Cost Estimation Guidelines v 0 2Hochgeladen vonKerwin Jay Condor
- CCD ERPHochgeladen vonArshChandra
- IE412 512_HW#1 SolutionHochgeladen vonWilly Lema
- GreenButton_User Guide_EnglishHochgeladen vondibidibirek
- hikmat-ki-batain.pdfHochgeladen vonعلي خان

## Viel mehr als nur Dokumente.

Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.

Jederzeit kündbar.