Sie sind auf Seite 1von 9

IEICE TRANS. INF. & SYST., VOL.Exx??, NO.

xx XXXX 200x

PAPER

Time-Weighted Web Authoritative Ranking


Bundit MANASKASEMSAKa) , Student Member, Arnon RUNGSAWANG , Nonmember,
and Hayato YAMANA , Member

SUMMARY We investigate temporal aspects as factors in


assessing the authoritativeness of web pages. We present three
dierent measures related to time: age, event, and trend factors
that measure recentness, special event occurrence, and trend in
revisions, respectively. An experimental dataset is created by
crawling selected topics for a period of several months. This data
is used to compare page rankings by human experts with rankings computed by the standard PageRank algorithm (which does
not include temporal factors) and three algorithms that incorporate temporal factors, included the Time-Weighted PageRank
(TWPR) algorithm introduced here. Analysis of the rankings
indicates that all three temporal-aware algorithms produce rankings more like those of human experts than does the PageRank
algorithm. Of these, the TWPR algorithm produced rankings
most similar to human experts, indicating that all three temporal factors are relevant in page ranking.
key words: web ranking algorithm, time-weighted ranking, web
authoritativeness, PageRank, link analysis, search engine

1.

Introduction

Web search engines are critical to enabling users to nd


the most relevant or useful resources for their specic
interest; the goal being to include the most relevant
pages in the top-10 results returned by a query. Most
search engines include a ranking algorithm that computes a pages authoritativeness based either on the link
structure of the Web (such as HITS [16] and PageRank
[5], [20]) or by mining of users browse histories (such
as BrowseRank [17], Trac-weighted Ranking [19], and
BookRank [12]). While all of these algorithms produce
useful rankings, overall, they do not incorporate temporal information of pages which can be important to human users interest in a particular page. Indeed, these
algorithms may be biased against more recent pages
[1], [6], as such pages have had less time to accumulate
in-links that contribute to their link-based ranking, and
less chance to be incorporated into a browse history.
In many situations a web user may be more interested in recent or actively updated resources, rather
than older ones, even though the latter may appear to
Manuscript received January 1, 2003.
Manuscript revised January 1, 2003.
Final manuscript received January 1, 2003.

The author is with the Department of Computer Engineering, Kasetsart University, Thailand.

The author is with the Graduate School of Fundamental Science and Engineering, Waseda University, Japan.
a) E-mail: un@mikelab.net

be more authoritative. A web page for an upcoming conference is clearly more interesting than a page
for a past conference; a link to the current version of
an open-source software is probably more useful than
a link to an old version. Indeed, one can argue that
search engines are most needed for nding newer, more
dynamic content; such items are less likely to be ndable via web directories (e.g. ODP [8]), the Wikipedia
[23], or other resources that a human user can turn to
for help.
In this paper, we propose a method to incorporate temporal aspects into the authority ranking of web
pages using a modied version of the PageRank algorithm. Three temporal factors are incorporated: age,
event, and trend ; these three capture the intuitive notion that a page with recent updates, special time occurrence, or high frequency of updates is potentially
more interesting to users. These attributes are computed from page timestamps (directly or by inference)
and a modication history that is stored during each
web crawl; the event and trend factors require categorizing pages and comparing the temporal history of
pages in the same category, as described in Sect. 3. We
discuss other algorithms that use temporal information
in Sect. 2, notably T-Rank [4] and TimedPR [24], which
are included in our experimental study.
The remainder of this paper is organized as follows. In Sect. 2, we briey review the PageRank algorithm and some related work. In Sect. 3, we rst
introduce the factors of temporal aspect, and then describe about how to apply them to our new variant of
PageRank. In Sect. 4, we report and discuss the results
of experiments. We oer conclusions and future work
in Sect. 5.
2.

Review of PageRank and Related Work

The basic denition of PageRank [20] can be stated as


follows. For each crawl of the Web, we can construct
a web graph G = (V, E), an arbitrary directed graph
with pages and their hyperlinks corresponding to the
set of nodes V and edges E V 2 , respectively, without
multiple edges. If page u V has a link to page v V ,
it implies that the author of page u implicitly confers
some importance to page v. Let r(u) be the PageRank
score (i.e., importance) of page u and t(u, v) be the

IEICE TRANS. INF. & SYST., VOL.Exx??, NO.xx XXXX 200x

proportion of importance propagated from u to v. This


value is normally set to 1/out(u), where out(u) is the
out-degree of page u in G. Therefore, a link (u, v) E
confers r(u)/out(u) units of rank to page v. This yields
the following equation to calculate the PageRank score
for all web pages:

v V, r(k+1) (v) =
t(u, v)r(k) (u)
(1)
uB(v)

 r(k) (u)
,
out(u)

uB(v)

where B(v) represents the set of pages linking to v. The


total amount of score conferred on v is the summation
of the score of each source page u divided by its outdegree. This recursive denition requires an iterative
computation, which repeats until the rank scores are
stable.
This process is equivalent to a random walk on the
directed web graph G. Consider a random surfer who
visits a page u at time k. The surfer then uniformly
chooses and follows a link from among us out-links to
page vwith the probability 1/out(u), also illustrated
by t(u, v) in Eq. (1)randomly at time k + 1. Consequently, the PageRank score of a page v can be dened
as the probability of a random surfer landing on v at a
suciently large time step. Let n be the total number
of pages contained in G, and M be the (n n)-square
matrix describing the transition over G. The matrix
entry mij is thus assigned to 1/out(i) if there is a link
from page i to page j, and the remainder are set to
0. For the power iteration of the matrix-vector multiplication of an eigensystem [11], the PageRank r can
be considered as the principal eigenvector of M T corresponding to eigenvalue 1:
r = M T r.

(2)

Consider the Markov chain induced by a random


walk on the web graph G, the transition probability
matrix M is said to be valid if it is a row-stochastic
matrixits row totals equal 1. Hence, M must have
no row consisting of all zeros. A general approach for
dealing with dangling pages is to add a set of virtual
links from those pages to all pages. Let M  be the
(n n)-square matrix with all entries in each row corresponding to a dangling page are given by 1/n; otherwise, they are 0. Then, the modied PageRank is:
r = (M T + M T ) r.

(3)

Moreover, the Ergodic Theorem for Markov chains


[13], [21] guarantees that the PageRank formula will
converge to a stationary probability distribution if M
is also aperiodic and irreducible. These requirements

The pages with zero out-degree cause the PageRank


leak problem [20].

hold for a proper Markov model; the former regards


to the surfer possibly returns to a visited page in any
time step, while the latter regards to every page directly links to every other page. In fact, the web graph
often contains sinks or isolated clusters. However,
this problem can be solved by adding a complete teleportation term to every page. This solution is reasonable for the random surfer analogy: the random surfer,
who follows links on the Web, at some time step gets
bored and abandons link surng and instead enters a
new destination in the browsers URL line. Let E be
the (n n)-square matrix representing the teleportation over the web graph G in which all entries have
value 1/n. Then, the modication of PageRank can be
expressed as:


r = (M T + M T ) + (1 )E T r
= (M T + M T ) r + (1 )E T r
 
1
T
T
= (M + M ) r + (1 )
.
n n1

(4)

Here, the coecient is used to determine the weights


of rank propagation and teleportation. Since the total
probability r1 = 1, the multiplication E T r then
returns a uniform n-dimensional column vector, i.e., the
total probability of random jumps to each page is equal
to 1/n.
Since PageRank was published, it has been widely
applied in many commercial Web search engines and
applications. In reality, however, the Web is dynamic;
only considering the number of page referrers for assessing authoritativeness is incomplete and inaccurate.
Obviously, PageRank tends to prefer older to newer
pages since the older ones tend to have many in-links
accumulated over time. Cho and Roy [6] describe this
rich-get-richer phenomenon when search engines usually report higher-PageRank pages at the top rank of
results. These pages not only are always known but
also receive more and more references from other Webs
authors over time, while the new-born pages are continually left behind. However, the old pages may be
outdated; the new pages may be more valuable. Cho
et al. [7] thus propose a model to estimate quality of
pages by analyzing the growth of their popularityhow
many references they receive over time. A page with
fast growing trend should be considered as potentially
valuable, even though it is still unpopular.
There are also some studies that concentrate on
integrating temporal aspects into the PageRank algorithm. Baeza-Yates et al. [3] assume that a page should
be considered good if it has a recent modication time,
and already has other pages linked to it. They thus
propose a modied PageRank by weighting pages with
an exponential decay function according to their age, to

The small self-loop cluster, having no connection back


to main community, causes the PageRank sink problem [20].

MANASKASEMSAK et al.: TIME-WEIGHTED WEB AUTHORITATIVE RANKING

counteract bias against new pages. Yu et al. [24] propose another temporal aspect of PageRank in the context of scientic publications, called TimedPageRank
(TimedPR). However, their proposed technique cannot
be directly applied to the Web. Although the citation
concept of scientic papers is quite similar, some other
aspects are dierent from general web content. For instance, scientic papers have static information xed at
publication time. Since they cannot be deleted, their
citation counts are also monotonic increasing. In contrast, web pages can be modied and deleted both of
content and links over time. Scientic papers can only
recommend others further in the past; there are obviously no citation loops. On the Web, target-linked
pages can be subsequently modiedbecoming newer
than their referrers and provide higher quality information. This factor consequently eects the authoritativeness and should be considered.
Berberich et al. [4] specify a temporal window of
interest to analyze freshness and update frequency of
both pages and links. They apply these two factors in
PageRank computation, called Time-Aware Authority
Ranking (T-Rank ). However, their proposed technique
is somewhat impractical because the modication time
of links is hardly detectableusually approximated using the modication time of pages. In addition, the
results are not representative since they only reported
for a dataset on a specic domain. In fact, each domain usually has its own change behavior. For example, news pages having daily update should not dominate the others. Therefore, in dealing with the Web,
the Webs topics should also be considered.
3.

Proposed Temporal Aspects

In this section, we describe our variant of PageRank


with temporal aspects. In Sect. 3.1, we rst introduce
three factors for time analysis. We then describe how
new scores can be computed by integrating these temporal factors in Sect. 3.2.

AF
1

Fig. 1

The age factor function.

This age function, depicted in Fig. 1, is obviously bound


by [0, 1]. Given the observation period, the age of page
u is set to maximal value 1 if it has never changed; and
its age decreases linearly over the normalized interval
until it reaches zero for tsLast (u) (T SStart , T SEnd ].
However, the more recently a web page was updated, the more important it is. So, instead of directly
using the age value, we compliment it to create the
inverse-age factor of a page u, denoted IAF (u), dened
as:
IAF (u) = 1 AF (u).

(6)

Event Factor

Temporal Factors

The Web is continually changing. Some web pages are


created, modied, or even deleted according to time as
well as events. Based on the assumptions: (1) authors
who update their web pages will usually provide more
correct and higher quality information, and (2) users
often need and are interested in up-to-date information,
we infer the authoritativeness of pages by considering
their change behavior. We investigate the age, event,
and trend factors in analyzing modication times of
pages.
3.1.1

Time
TSEnd

is. This factor does not mean an actual age of a web


page since it was created, but rather the time since last
modication. A recently updated page is assumed to
be of higher quality than an older one.
We rst let two constants T SStart and T SEnd be
the starting- and ending-point timestamps set to observe changes in web pages, respectively. Let ts(u) be
a modication time of page u, and tsLast (u) be the recent one. Then, the age factor of a page u, denoted by
AF (u), can be dened by the following function:
T S ts
Last (u)
End

T SEnd T SStart
if T SStart < tsLast (u) T SEnd , (5)
AF (u) =

1 if tsLast (u) T SStart ,

0 otherwise.

3.1.2
3.1

TSStart

Age and Inverse-Age Factors

We dene the age factor to describe how old a web page

In general, if an event occurs, some web pages will be


modied to provide up-to-date information about the
event. The more important the occurrence is, the more
web pages will be changed. If we observe the amount of
change, we can predict not only an event occurs but also
how important it is. However, every related web page
will probably not be changed for such an event. This
obviously depends on the kind of topic of the page related to; web pages updated for the Olympic Games
are usually categorized as a sport; those updated for a
movie are categorized as an entertainment, for instance.
In our approach, we are interested in the last event of
each web page, i.e., the event factor is analyzed by considering the last-modied time of a page based on its
topic.

IEICE TRANS. INF. & SYST., VOL.Exx??, NO.xx XXXX 200x

Let C = {C1 , C2 , . . . , Cp } be a set of categories or


topics containing non-overlapping pages, i.e., if a page
/ Cj for all j = i. Let I = {I1 , I2 , . . . , Iq }
u Ci then u
be a set of time intervals such that Ii = (T Si1 , T Si ]
for 1 i q, where T Si is a pre-dened timestamp
with T SStart = T S0 < T S1 < . . . < T Sq1 < T Sq =
T SEnd . Consider members of the Cartesian product
C I, we let page(Ci , Ij ) be the number of web pages
that are categorized in Ci and have the last-modied
time within Ij . Then, the event factor of a page u,
denoted by EF (u), is dened as:
page(Ci , Ij )
;
EF (u) = q
k=1 page(Ci , Ik )

u Ci and
(7)
tsLast (u) Ij .

xi (u) =

xj (u).

(8)

j:tj Ii

Similarly, we let x (Cj ) be the r-dimensional vector representing the update prole of a category, where
Cj C and


uCj xi (u)

.
(9)
xi (Cj ) =
|Cj |
That is, x (Cj ) refers to the average prole of all pages
contained in the same category. We nally dene the
trend factor of a page u, denoted by T F (u), to be:
T F (u) = Sim(x (u), x (Cj )); u Cj .

(10)

That is, the event factor of page u is the proportion of


pages in the same category as u that also changed in
the same time interval, relative to the total number of
pages in that category. A large event value, i.e., a large
fraction of web pages changed, can be used to infer that
an important event occurred.

Here, Sim(, ) is the cosine similarity function used in


the classical IR [2]. The trend factor of u is thus considered as the similarity of its update prole to the prole
of its category.

3.1.3

In this section, we describe how we compute a new


ranking score by incorporating the inverse-age, event,
and trend factors in PageRank. We rst let D be a set
of dangling pages of the web graph G. Let t(u, v) be
the weight of rank propagation from page u to page v,
and s(v) be the weight for a random jump from any
page to v. Then, the PageRank formula illustrated in
Eq. (4) can be rewritten in a general form as follows:

v V, r(v) =
t(u, v)r(u)

Trend Factor

The age and event factors only consider the last modication time of web pages. We believe how often a
page is updated can also be exploited to estimate its
importance. A frequently updated page reects activeness on itself. However, the changes usually depend
on the kind of topic as well. Considering only the frequency of change may be biased and unfair to some
others which have already presented either correct or
reliable information. For instance, a news page will
likely be updated more frequently than a scientic one.
In our approach, a trend factor is dened to describe how similar the change behavior of a web page is
to others within the same topic, based on the assumption that most important pages will behave in the same
way. We rst introduce an update prole of each page
based on the changes over the observation period. Let
T = {ti (T SStart , T SEnd ]|1im and t1 <t2 <. . .<tm }
be a set of pre-dened timestamps. Let x(u) be the
m-dimensional vector representing the update prole
of page u:

1 if u was updated at timestamp ti ,
xi (u) =
0 otherwise.
However, using this prole to directly compare
pages will produce great dierences since there is little
possibility that web pages will be updated at the same
specic timestamps along the observation period. We
thus let I  = {I1 , I2 , . . . , Ir } be a set of time intervals

such that Ii = (T Si1
, T Si ] for 1 i r, where T Si
is a pre-dened timestamp, T SStart = T S0 < T S1 <

< T Sr = T SEnd . Dene x (u), an r. . . < T Sr1
dimensional vector, to represent a new update prole
of u, where

3.2

Time-Weighted Ranking

u(B(v)D)

+ (1 )s(v).

(11)

Note that in the original PageRank, t(u, v) is set to


either 1/out(u) if u B(v) or 1/n if u D, and s(v)
is set to 1/n. However, in this work we propose new
transitions that incorporate temporal aspects.
3.2.1

Time-Weighted Transition

Consider a page u V ; we let F (u) be the set of web


pages that page u links to by either its actual links or
articial links in case u is a dangling page. For any
t(u, v), we dene the time-weighted transition
1 IAF (v) + 2 EF (v) + 3 T F (v)
wF (u)(1 IAF (w)+2 EF (w)+3 T F (w))

t(u, v)=

(12)
with the coecients 1 , 2 , 3 [0, 1]. Equation (12)
represents the probability of the page that a random
surfer selects for his next hop with some biases, i.e.,
the inverse-age, event, and trend values. In other
words, the surfer will choose the target page v by upto-dateness, event importance, as well as activeness, in
comparison to all targets.

MANASKASEMSAK et al.: TIME-WEIGHTED WEB AUTHORITATIVE RANKING

5
Table 1

3.2.2

Age-Biased Vector

In terms of the random-walk model, a complete set of


transition edges is articially added to cover all web
pages with respect to the Ergodic Theorem [13], [21].
These articial transitions, illustrated in Eq. (4), are
given by a uniform probability distribution n1 n1 .
However, our key to reducing bias in the original
PageRank against new pages is that we will compute
the values by using a non-uniform vector. We let s be a
n-dimensional column vector where the entry for each
page v is its normalized inverse-age:
s(v) =

IAF (v)
.
wV IAF (w)

(13)

Equation (13) describes the behavior of a surfer who


randomly jumps to v with a bias based on page age.
4.

4.1

Dataset

As the source of web data, we used a partial Web


crawled from our e-Society Project [10] which also overlaps with the Open Directory project (ODP) [8], containing roughly 390,000 URLs. To study the evolution
of these pages, we crawled them daily from August
15th, 2008 until January 31st, 2009. Modication of
pages was detected by simply comparing consecutive
versions. Each page was also categorized according to
the topics in ODP, used in the computation of TWPR.
Note that since the ODP provides an hierarchical topic
from most generalized to more specic one, a page can
then be categorized to more than one topics according
to a dened level. For example, a page in Computers/
Internet/Searching/Directories/... will be able
to labeled as Computers, Computers/Internet,
and Computers/Internet/Searching according to
the 1st, 2nd, and 3rd level of topics, respectively.
For the experiments, each crawled version of the
web pages was translated into a graph structure for
computing authoritative scores; however, only the last
version was indexed and retrieved by our searching system. As illustration, 30 sample queries are listed in
Table 1, some of which come from [9], [15].
4.2

Evaluation Measures

We used two metrics, OSim and KSim, proposed

innovation
lyme disease
marketing
music
olympic
parallel computing
presidential election
risk management
rock climbing
science conference
soccer
socialism
sushi
tattoo
tournament

in [15], to measure the similarity of any two ranked


lists. OSimk (1 , 2 ) determines the degree of overlap
between the top-k URLs of two rankings 1 and 2 .
OSimk (1 , 2 ) =

Experimental Results

We implemented several page ranking methods, i.e., PR


[20], our TWPR, T-Rank [4], and TimedPR [24], using
the Java programming language. We also integrated
them into our prototype Lucene-based searching system
[18] for comparison of the quality of ranking results.

Queries used in the experiment.

affirmative action
amusement park
architecture
astronomy
blues
business
christmas
cruise
disaster
earthquake
fashion
film festival
game
graphic design
hurricane

|R1 R2 |
,
k

(14)

where R1 and R2 are the sets of top-k URLs contained


in 1 and 2 , respectively.
KSimk (1 , 2 ), a variant of Kendalls distance
measure, determines the degree of agreement in which
pairwise distinct URLs u and v within top-k rank have
the same relative order in both rankings 1 and 2 . Consider two lists R1 and R2 of top-k URLs rankings. Let
U be the union of URLs contained in both lists and
dene R1 as the extension of R1 to add the elements
U R1 after all the URLs in R1 . Similarly, R2 is also
dened as the extension of R2 . KSim is then given as:
|{(u, v) :
KSimk (1 , 2 ) =

4.3

R1 , R2 agree on order


}|
of (u, v) and u = v
.
|U | (|U | 1)
(15)

Results and Discussions

We conducted experiments on our crawled web data to


study the page authority assessment of the PR, TWPR,
T-Rank, and TimedPR methods. In the following, we
rst present the results on the comparison of top authority web pages computed by those methods. Then,
we discuss the time evolution of authoritativeness. Finally, we describe results of a user study to evaluate the
quality of ranked results obtained by a pre-dened set
of queries.
To individually assess the authoritativeness for
each crawled version of web pages, we rst set parameters used in the computation of our TWPR method
as follows: T SStart = the date of the rst crawl (August 15th, 2008), T SEnd = the date of the crawled version in consideration, C@3 = a set of the 3rd level
categories of ODP, I@5 = a set of every 5 consecutive

IEICE TRANS. INF. & SYST., VOL.Exx??, NO.xx XXXX 200x

6
Table 2 Top-5 authority web pages produced by PR, TWPR,
T-Rank, and TimedPR, respectively.

OSim

0.9

Top-5 lists obtained from PR


1 PLOTEUS (Portail sur les opportunites detudes et de formation

0.8
0.7
PR
TWPR
T-Rank
TimedPR

0.6
0.5

2
3

0.8

0.7

31-01-2009

Top-5 lists obtained from TWPR


1 Filmopplevelsen starter her - Filmweb
15-01-2009

15-11-2008

31-10-2008

15-10-2008

30-09-2008

15-09-2008

31-08-2008

0.5

31-12-2008

0.6

15-12-2008

PR
TWPR
T-Rank
TimedPR
30-11-2008

KSim

0.9

en Europe)
http://europa.eu.int/ploteus/portal
CM Magazine: On Strike: The Winnipeg General Strike, 1919.
http://umanitoba.ca/outreach/cm/vol7/no4/onstrike.html
DRIIE | Department of international relations and european
integration |
http://www.die.ro
Japanese Public Holidays for 2004
http://www3.sympatico.ca/ccsr/j2004.html
Home Planet Release 3.3a
http://www.fourmilab.ch/homeplanet/homeplanet.html

Fig. 2 Similarities between lists of top-1000 authority web


pages produced by the same ranking methods for consecutive
fortnights.

days starting from the rst crawl, and the coecients


1 = 2 = 3 = 1/3. We also set the interval I  to
be the same as I in all experiments. For all of the
other methods, we used the default parameters previously dened by the authors [4], [20], [24]. In addition,
we x the value of the parameter = 0.85 [14], [20].
We now discuss the dierence between the authoritativeness induced by the ranking methods. Considering scores of the last crawled web pages, the ve most
authoritative pages are depicted in Table 2. As shown
in the table, PR ranks old pages as highly authoritative; for example, the rst page has become obsolete
and redirected to the new one, and the other pages
contain outdated contents. Not surprisingly, TWPR,
T-Rank, and TimedPR confer signicant authority on
up-to-date pages. In addition, the rankings contain several pages in common, including the rst one about a
recent lm.
We further investigated the time evolution eect,
i.e., how the authoritativeness changes over time. We
compared the lists of pages obtained from the ranking
methods for several consecutive fortnights. Figure 2
shows the evolution of top-1000 authorities in terms
of OSim and KSim. Note that the observation period
starts from August 15th, 2008 until January 31st, 2009.
Hence, each point on the graph is the similarity between
the rankings at that time and those of the previous
fortnight.
As shown in the gure, PR yields OSim values
of 0.92 0.97 (average 0.95) and KSim values of
0.88 0.94 (average 0.91). This means PR produces
nearly the same rankings over the entire time period.
In contrast, the consecutive rankings of TWPR, T-Rank
and TimedPR have lower similarities because of their

2
3
4
5

http://www.filmweb.no
Census Bureau Home Page
http://www.census.gov
ShopMania - Price comparison in US, Read reviews
http://www.shopmania.com
National Institute of Neurological Disorders and Stroke (NINDS)
http://www3.sympatico.ca/ccsr/j2004.html
Statistici, clasament si trac web romanesc
http://www.trafic.ro

Top-5 lists obtained from T-Rank


1 Filmopplevelsen starter her - Filmweb
2
3
4
5

http://www.filmweb.no
Stummlm - ARTE
http://www.arte.tv/de/film/stummfilm-auf-arte/690880.html
Census Bureau Home Page
http://www.census.gov
ShopMania - Price comparison in US, Read reviews
http://www.shopmania.com
Al-Ahram Weekly | Front Page
http://weekly.ahram.org.eg

Top-5 lists obtained from TimedPR


1 Filmopplevelsen starter her - Filmweb
2
3
4
5

http://www.filmweb.no
ShopMania - Price comparison in US, Read reviews
http://www.shopmania.com
HIT100.ro
http://www.hit100.ro
Al-Ahram Weekly | Front Page
http://weekly.ahram.org.eg
Regina Public Schools
http://www.rbe.sk.ca

temporal approach. We can also see that our TWPR


is less biased (i.e., has higher similarities) than T-Rank
and TimedPR as well. In addition, all three temporalaware ranking methods showed the lowest consecutiveranking similarities at 31-12-2008, due to several factors
such as the New Year festival event.
To evaluate the quality of the page rankings of
those methods with respect to human users notion
of pages importance, we conducted a user study by
twenty-seven users, referred to as experts. In this
study, all ranking methods were integrated into our
searching system. We employed the last crawled web
pages and the 30 sample queries in Table 1, in the experiments. For each query, we rst separated the top-

MANASKASEMSAK et al.: TIME-WEIGHTED WEB AUTHORITATIVE RANKING

7
OSim

KSim

Table 3 Average in-degrees per web page for top-5, 10, and
20 of results produced by the ranking methods and the experts.

Similarity

0.8

PR
TWPR
T-Rank
TimedPR

PR
TWPR
T-Rank
TimedPR

Top-5
Top-10
Top-20

0.6

PR

TWPR

T-Rank

TimedPR

Experts

5343
3717
2272

3256
2597
1884

2556
2192
2021

2675
2024
1909

3852
3208
2004

0.4

Table 4 Average similarities at top-10 between the experts


rankings and the rankings produced by TWPR setting 1 = 2 =
3 = 1/3 but varying the values of C and I.

0.2
0
Top-5

Top-10

Top-20

Top-5

Top-10

Top-20

Fig. 3 Average similarities at top-5, 10, and 20 ranked results


between the experts rankings and the rankings produced by the
methods.

C@1
C@2
C@3

thirty URLs resulting from the tf-idf weight [2]. Since


our interest is in the eect of temporal aspects for the
ranking, we therefore simply assume that a result is
relevant to a query if it contains the query words. The
set of top-thirty results are then re-ranked according to
the authoritative scores produced by the four ranking
methods. We also asked the human experts to grade
each result, i.e., giving a score from 0 (worst) to 3
(best). Based on the experts assessments, we aggregated those scores and sorted from highest to lowest,
to create baseline rankings for comparisons. For every
query and every ranking method, we summed up and
normalized the similarities obtained from the comparisons between the baseline rankings and the rankings of
that method. We illustrate here the average similarities
at top-5, 10, and 20 results in Fig. 3.
The gure shows there is a high similarity between
the experts rankings and the rankings of methods. All
three temporal-aware methods show higher similarity
to experts than does PageRank; further, the similarity
increases as more results are included. This similarity
increases rapidly until the top-10 for all temporal-aware
methods indicating that they can identify the pages
most relevant to human experts within the top few results. This is an important result since, as reported in
[22], most users usually browse only the top-ten results.
To further investigate how the ranking methods
confer authority, we examined the in-link pages of each
result and counted them. For each query, the in-degrees
were summed and normalized. Table 3 shows the average in-degrees. It is not surprising that PR provides the
largest number of values within top-5 and 10 due to its
denition. In contrast, TWPR, T-Rank, and TimedPR
prefer the up-to-date web pages in spite of having fewer
citations. Moreover, among the temporal-aware rankings, TWPR returns the number of average in-degrees
most closely matching the experts within the top-5 and
10. It can be inferred that TWPR is less biased toward
recent pages and agrees with some old pages indeed
being important.

4.4

OSim
KSim
OSim
KSim
OSim
KSim

I@1

I@3

I@5

I@7

0.55
0.52
0.58
0.53
0.62
0.59

0.55
0.54
0.59
0.57
0.66
0.62

0.56
0.55
0.64
0.61
0.68
0.66

0.56
0.54
0.63
0.61
0.68
0.64

Sensitivity Analysis

We investigated the eect of parameters specied in the


TWPR method. The parameters can be divided into
two groups: external and internal parameters. The former regards to the levels of hierarchical topics in ODP
[8] and the width of the time intervals, i.e., C and I, respectively. The latter regards to the coecients 1 , 2 ,
and 3 , which determine the weights of the inverse-age,
event, and trend factors, respectively. The remaining
parameter in Eq. (11), we assign a constant value
0.85. We conducted the experiments by employing the
last crawled web pages on January 31st, 2009 as well
as the 30 queries. We also used the baseline rankings
obtained from the experts for comparisons. In the following, we will examine only the top-10 ranking results.
To study the eects of the external parameters, we
set all internal parameters to the same constant value
(i.e., 1/3) but vary C and I. Let C@x be a set of
categories corresponding to the x-th level in ODP, and
I@y be a set of time intervals for every y consecutive
days. For the experiments, we compared the baseline
rankings with the rankings produced from twelve combinations of C and I, as depicted in Table 4. The results show that at C@1, varying I does not aect the
rankings. The reason is that categories are too general
at the rst level; there are many web pages contained
in each category. Consequently, just modifying some
web pages is less impact when compared to the large
amount remaining. However, varying I has much more
eect when changing C to C@2 and C@3, respectively.
Moreover, when we x C at C@2 and C@3, varying I
from I@1 through I@5 can result in better rankings,
but they will be stable (or just down) until I@7. The
reason is that I@1 is too specic: modifying certain
web pages at the same time has less probability. I@3
and I@5 gaps are more exible; however, it becomes
too much at I@7.

IEICE TRANS. INF. & SYST., VOL.Exx??, NO.xx XXXX 200x

8
Table 5 Average similarities at top-10 between the experts
rankings and the rankings produced by TWPR setting C@3, I@5
but varying the values of 1 , 2 , and 3 .
Notation

OSim

KSim

TWPR A
TWPR E
TWPR T
TWPR AE
TWPR AT
TWPR ET
TWPR AET

1
0
0
1/2
1/2
0
1/3

0
1
0
1/2
0
1/2
1/3

0
0
1
0
1/2
1/2
1/3

0.59
0.52
0.55
0.63
0.66
0.61
0.68

0.54
0.44
0.48
0.58
0.63
0.55
0.66

Finally, we examine the eect of the remaining internal parameters. For this, we xed the external parameter values at C@3 and I@5. The seven combinations with their variations of three coecients are
shown in Table 5. From the results, we conclude that
the inverse-age factor is the most important, while the
trend and event factors have the second and the last impact, respectively. This means that the experts prefer
newer as well as up-to-date web pages rather than those
which were updated in the past. However, the combinations with more than one factor can produce better
rankings; especially, TWPR AET provides the best result.
5.

Conclusion

For many web queries, human users seek current or actively maintained information; hence, one would expect
that time-related factors of a web page can contribute
to establish its usefulness to human web searchers. Our
experiments show that this is indeed the case. Age,
event-related changes, and trend in revisions all appear
to contribute to improve page rankings, as compared
to the standard PageRank algorithm. Three temporalaware algorithms outperformed PageRank; the TWPR
algorithm, which incorporates all three factors, showed
the greatest improvement.
There are several issues which need to be addressed
in future work. First, we employed a small subset of
web pages in our experiments that had been categorized
by ODP. To deal with a complete Web, including unknown pages, an automatic system for Web categorization is needed. Second, a model for predicting change
in the Web is also essential to save time in crawling as
well as network bandwidth. Last, we plan to investigate
the content information and metadata in web pages and
integrate these in the computation for better rankings.
Acknowledgements
This research is supported by the Thailand Research
Fund through the Royal Golden Jubilee Ph.D. Program
(Grant No. PHD/0122/2548). We thank the members
of Yamana Laboratory for their help in data preparation and many helpful suggestions. We also thank
James Brucker for his intensive polishing of the paper.

References
[1] L.A. Adamic and B.A. Huberman, The webs hidden order, Communications of the ACM, vol.44, no.9, pp.5559,
2001.
[2] R.A. Baeza-Yates and B.A. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, England, pp.2730, 1998.
[3] R.A. Baeza-Yates, F. Saint-Jean, and C. Castillo, Web
structure, dynamics and page quality, Proc. 9th International Symposium on String Processing and Information
Retrieval, pp.117130, 2002.
[4] K. Berberich, M. Vazirgiannis, and G. Weikum, Timeaware authority ranking, Internet Mathematics, vol.2,
no.3, pp.301332, 2006.
[5] S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN
Systems, vol.30, no.17, pp.107117, 1998.
[6] J. Cho and S. Roy, Impact of search engines on page popularity, Proc. 13th International World Wide Web Conference, pp.2029, 2004.
[7] J. Cho, S. Roy, and R.E. Adams, Page quality: In search
of an unbiased web ranking, Proc. ACM SIGMOD International Conference on Management of Data, pp.551562,
2005.
[8] DMOZ, The Open Directory Project, http://www.dmoz.
org/.
[9] C. Dwork, R. Kumar, M. Naor, and D, Sivakumar, Rank
aggregation methods for the web, Proc. 10th International
World Wide Web Conference, pp.613622, 2001.
[10] e-Society Project, http://www.yama.info.waseda.ac.jp/
~yamana/e-society/index_eng.htm.
[11] G.H. Golub and C.F. Van Loan, Matrix Computations,
Johns Hopkins University Press, Baltimore and London,
1996.
[12] B. Goncalves, M.R. Meiss, J.J. Ramasco, A. Flammini, and
F. Menczer, Remembering what we like: Toward an agentbased model of web trac, Proc. 2nd ACM International
Conference on Web Search and Data Mining, 2009.
[13] G.R. Grimmett and D.R. Stirzaker, Probability and Random Processes, Oxford University Press, USA, 2001.
[14] T.H. Haveliwala, Ecient computation of PageRank,
Stanford Digital Libraries, 1999.
[15] T.H. Haveliwala, Topic-sensitive PageRank: A contextsensitive ranking algorithm for web search, IEEE Trans.
Knowledge and Data Engineering, vol.15, no.4, pp.784796,
2003.
[16] J.M. Kleinberg, Authoritative sources in a hyperlinked environment, J. ACM, vol.46, no.5, pp.604632, 1999.
[17] Y. Liu, B. Gao, T. Liu, Y. Zhang, Z. Ma, S. He, and H.
Li, BrowseRank: Letting web users vote for page importance, Proc. 31st ACM SIGIR Conference on Research and
Development in Information Retrieval, pp.451458, 2008.
[18] Lucene, The Apache Software Foundation, http://lucene.
apache.org/.
[19] M.R. Meiss, F. Menczer, S. Fortunato, A. Flammini, and
A. Vespignani, Ranking web sites with real user trac,
Proc. 1st ACM International Conference on Web Search
and Data Mining, pp.6576, 2008.
[20] L. Page, S. Brin, R. Motwani, and T. Winograd, The
PageRank citation ranking: Bringing order to the web,
Stanford Digital Libraries, 1998.
[21] S.M. Ross. Introduction to Probability Models, Academic
Press, 2002.
[22] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz,
Analysis of a very large web search engine query log,
ACM SIGIR Forum, vol.33, no.1, pp.612, 1999.

MANASKASEMSAK et al.: TIME-WEIGHTED WEB AUTHORITATIVE RANKING

[23] WIKIPEDIA, http://www.wikipedia.org/.


[24] P.S. Yu, X. Li, and B. Liu, Adding the temporal dimension to search - A case study in publication search, Proc.
IEEE/WIC/ACM International Conference on Web Intelligence, pp.543549, 2005.

Bundit Manaskasemsak
received
the B.Eng. and M.Eng. degrees in Computer Engineering from Kasetsart University, Thailand, in 2003 and 2005, respectively. He has received the Royal Golden
Jubilee scholarship for the Ph.D. program
since 2005. He is a Ph.D. candidate in
Computer Engineering, Kasetsart University. His current research interests include web search, information retrieval,
and parallel and distributed computing.

Arnon Rungsawang
received the
B.Eng. degree in Electrical Engineering
from King Mongkut Institute of Technology, Thailand, in 1986; the DEA-IARFA
from Univesit
e de Pierre et Marie Curie
(Paris VI), France, in 1993; and the Ph.D.
degree in Informatique et Reseaux from
de lENST-Paris, France, in 1997. Since
1998, he has been a lecturer in Department of Computer Engineering, Kasetsart
University. His current research interests
include web search, information retrieval, parallel and distributed
computing, and articial intelligence.

Hayato Yamana
received the B.S.,
M.S., and Dr.Eng. degrees in Computer
Science from Waseda University, Japan,
in 1987, 1989, and 1993, respectively.
From 1993 to 2000, he was a researcher in
the Electorotechnical Laboratory, AIST,
MITI. From 2000 to 2005, he was an associate professor of Waseda University.
Since 2005, he has been a professor of
Waseda University. His current research
interests include data mining, distributed
computing, and information retrieval. He is the president of Information Grand Voyage Project Consortium since 2007. He is a
member of IPSJ, IEICE, ACM, and IEEE.