Sie sind auf Seite 1von 30

Information Retrieval on the Web

MEI KOBAYASHI and KOICHI TAKEDA


IBM Research

In this paper we review studies of the growth of the Internet and technologies that
are useful for information search and retrieval on the Web. We present data on the
Internet from several different sources, e.g., current as well as projected number of
users, hosts, and Web sites. Although numerical figures vary, overall trends cited
by the sources are consistent and point to exponential growth in the past and in
the coming decade. Hence it is not surprising that about 85% of Internet users
surveyed claim using search engines and search services to find specific
information. The same surveys show, however, that users are not satisfied with the
performance of the current generation of search engines; the slow retrieval speed,
communication delays, and poor quality of retrieved results (e.g., noise and broken
links) are commonly cited problems. We discuss the development of new techniques
targeted to resolve some of the problems associated with Web-based information
retrieval,and speculate on future trends.

Categories and Subject Descriptors: G.1.3 [Numerical Analysis]: Numerical


Linear AlgebraEigenvalues and eigenvectors (direct and iterative methods);
Singular value decomposition; Sparse, structured and very large systems (direct and
iterative methods); G.1.1 [Numerical Analysis]: Interpolation; H.3.1
[Information Storage and Retrieval]: Content Analysis and Indexing; H.3.3
[Information Storage and Retrieval]: Information Search and
RetrievalClustering; Retrieval models; Search process; H.m [Information
Systems]: Miscellaneous
General Terms: Algorithms, Theory
Additional Key Words and Phrases: Clustering, indexing, information retrieval,
Internet, knowledge management, search engine, World Wide Web

1. INTRODUCTION manuscript on the subject will be out-of-


date before it reaches the intended au-
We review some notable studies on the dience, particularly URLs that are ref-
growth of the Internet and on technolo- erenced. Second, a comprehensive
gies useful for information search and coverage of all of the important topics is
retrieval on the Web. Writing about the impossible, because so many new ideas
Web is a challenging task for several are constantly being proposed and are
reasons, of which we mention three. either quickly accepted into the Internet
First, its dynamic nature guarantees mainstream or rejected. Finally, as with
that at least some portions of any any review paper, there is a strong bias

Authors address: Tokyo Research Laboratory, IBM Research, 162314 Shimotsuruma, Yamato-shi,
Kanagawa-ken, 2428502, Japan; email: mei_kobayashi@jp.ibm.com; kohichi_takeda@jp.ibm.com.
Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted
without fee provided that the copies are not made or distributed for profit or commercial advantage, the
copyright notice, the title of the publication, and its date appear, and notice is given that copying is by
permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to
lists, requires prior specific permission and / or a fee.
2001 ACM 0360-0300/00/06000144 $5.00

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 145

CONTENTS 1994]; Internet technologies [CACM


1. Introduction
1994; IEEE 1999]; and knowledge dis-
1.1 Ratings of Search Engines and their Features covery [CACM 1999]. Some notable sur-
1.2 Growth of the Internet and the Web vey papers are those by Chakrabarti
1.3 Evaluation of Search Engines and Rajagopalan [1997]; Faloutsos and
2. Tools for Web-Based Retrieval and Ranking Oard [1995]; Feldman [1998]; Gudivada
2.1 Indexing
2.2 Clustering
et al. [1997]; Leighton and Srivastava
2.3 User Interfaces [1997]; Lawrence and Giles [1998b;
2.4 Ranking Algorithms for Web-Based Searches 1999b]; and Raghavan [1997]. Exten-
3. Future Directions sive, up-to-date coverage of topics in
3.1 Intelligent and Adaptive Web Services
3.2 Information Retrieval for Internet Shopping
Web-based information retrieval and
3.3 Multimedia Retrieval knowledge management can be found in
3.4 Conclusions the proceedings of several conferences,
such as: the International World Wide
Web Conferences [WWW Conferences
2000] and the Association for Comput-
in presenting topics closely related to ing Machinerys Special Interest Group
the authors background, and giving on Computer-Human Interaction [ACM
only cursory treatment to those of which SIGCHI] and Special Interest Group on
they are relatively ignorant. In an at- Information Retrieval [ACM SIGIR]
tempt to compensate for oversights and conferences ,acm.org.. A list of papers
biases, references to relevant works and Web pages that review and compare
that describe or review concepts in Web search tools are maintained at sev-
depth will be given whenever possible. eral sites, including Boutells World
This being said, we begin with refer- Wide Web FAQ ,boutell.com/faq/.;
ences to several excellent books that Hamline Universitys ,web.hamline.edu/
cover a variety of topics in information administration/libraries/search/comparisons.
management and retrieval. They in- html.; Kuhns pages (in German)
clude Information Retrieval and Hyper- ,gwdg.de/hkuhn1/pagesuch.html#vl.;
text [Agosti and Smeaton 1996]; Modern Maires pages (in French) ,imaginet.fr/
Information Retrieval [Baeza-Yates and ime/search.htm.; Princeton Universitys
Ribeiro-Neto 1999]; Text Retrieval and ,cs.princeton.edu/html/search.html.;
Filtering: Analytic Models of Perfor- U.C. Berkeleys ,sunsite.berkeley.edu/
mance [Losee 1998]; Natural Language help/searchdetails.html.; and Yahoo!s
Information Retrieval [Strzalkowski pages on search engines ,yahoo.com/
1999]; and Managing Gigabytes [Witten computers and internet/internet/world
et al. 1994]. Some older, classic texts, wide web.. The historical development
which are slightly outdated, include In- of information retrieval is documented
formation Retrieval [Frakes and Baeza- in a number of sources: Baeza-Yates
Yates 1992]; Information Storage and and Ribeiro-Neto [1999]; Cleverdon
Retrieval [Korfhage 1997]; Intelligent [1970]; Faloutsos and Oard [1995]; Sal-
Multmedia Information Retrieval [May- ton [1970]; and van Rijsbergen [1979].
bury 1997]; Introduction to Modern In- Historical accounts of the Web and Web
formation Retrieval [Salton and McGill search technologies are given in Berners-
1983]; and Readings in Information Re- Lee et al. [1994] and Schatz [1997].
trieval [Jones and Willett 1977]. This paper is organized as follows. In
Additional references are to special the remainder of this section, we dis-
journal issues on search engines on the cuss and point to references on ratings
Internet [Scientific American 1997]; of search engines and their features, the
digital libraries [CACM 1998]; digital growth of information available on the
libraries, representation and retrieval Internet, and the growth in users. In
[IEEE 1996b]; the next generation the second section we present tools for
graphical user interfaces (GUIs) [CACM Web-based information retrieval. These

ACM Computing Surveys, Vol. 32, No. 2, June 2000


146 M. Kobayashi and K. Takeda

include classical retrieval tools (which Lycos, one of the biggest and most popular search
can be used as is or with enhancements engines, is legendary for its unavailability dur-
ing work hours. [Webster and Paul 1996]
specifically geared for Web-based appli-
cations), as well as a new generation of There are many publicly available
tools which have developed alongside search engines, but users are not neces-
the Internet. Challenges that must be sarily satisfied with the different for-
overcome in developing and refining mats for inputting queries, speeds of
new and existing technologies for the retrieval, presentation formats of the
Web environment are discussed. In the retrieval results, and quality of re-
concluding section, we speculate on fu- trieved information [Lawrence and
ture directions in research related to Giles 1998b]. In particular, speed (i.e.,
Web-based information retrieval which search engine search and retrieval time
may prove to be fruitful. plus communication delays) has consis-
tently been cited as the most commonly
experienced problem with the Web in
1.1 Ratings of Search Engines and their the biannual WWW surveys conducted
Features at the Graphics, Visualization, and Us-
ability Center of the Georgia Institute
About 85% of Web users surveyed claim of Technology.1 63% to 66% of Web us-
to be using search engines or some kind ers in the past three surveys, over a
of search tool to find specific informa- period of a year-and-a-half were dissat-
tion of interest. The list of publicly ac- isfied with the speed of retrieval and
cessible search engines has grown enor- communication delay, and the problem
mously in the past few years (see, e.g., appears to be growing worse. Even
blueangels.net), and there are now lists though 48% of the respondents in the
of top-ranked query terms available on- April 1998 survey had upgraded mo-
line (see, e.g., ,searchterms.com.). dems in the past year, 53% of the re-
Since advertising revenue for search spondents left a Web site while search-
and portal sites is strongly linked to the ing for product information because of
volume of access by the public, increas- slow access. Broken links registered
ing hits (i.e., demand for a site) is an as the second most frequent problem in
extremely serious business issue. Un- the same survey. Other studies also cite
doubtedly, this financial incentive is the number one and number two rea-
serving as one the major impetuses for sons for dissatisfaction as slow access
the tremendous amount of research on and the inability to find relevant infor-
Web-based information retrieval. mation, respectively [Huberman and
One of the keys to becoming a popular Lukose 1997; Huberman et al. 1998]. In
and successful search engine lies in the this paper we elaborate on some of the
development of new algorithms specifi- causes of these problems and outline
cally designed for fast and accurate re- some promising new approaches being
trieval of valuable information. Other developed to resolve them.
features that make a search or portal It is important to remember that
site highly competitive are unusually problems related to speed and access
attractive interfaces, free email ad- time may not be resolved by considering
dresses, and free access time [Chan- Web-based information access and re-
drasekaran 1998]. Quite often, these ad- trieval as an isolated scientific problem.
vantages last at most a few weeks, since An August 1998 survey by Alexa Internet
competitors keep track of new develop-
ments (see, e.g., ,portalhub.com. or
,traffik.com., which gives updates and
1
GVUs user survey (available at ,gvu.gatech.
edu/user surveys/.) is one of the more reliable
comparisons on portals). And sometimes sources on user data. Its reports have been en-
success can lead to unexpected conse- dorsed by the World Wide Web Consortium (W3C)
quences: and INRIA.

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 147

,alexa.com/company/inthenews/webfacts. ducted according to five criteria: (1)


html. indicates that 90% of all Web simple queries; (2) customized queries;
traffic is spread over 100,000 different (3) news queries; (4) duplicate elimina-
hosts, with 50% of all Web traffic tion; and (5) dead link elimination. Once
headed towards the top 900 most popu- again, variations in the ratings some-
lar sites. Effective means of managing times differ substantially for a given
uneven concentration of information search engine. In addition to ratings,
packets on the Internet will be needed the authors give charts on search in-
in addition to the development of fast dexes and directories associated with
access and retrieval algorithms. twelve of the search engines, and rate
The volume of information on search them in terms of specific features for
engines has exploded in the past year. complex searches and content. The data
Some valuable resources are cited be- indicate that as the number of people
low. The University of California at Ber- using the Internet and Web has grown,
keley has extensive Web pages on how user types have diversified and search
to choose the search tools you need engine providers have begun to target
,lib.berkeley.edu/teachinglib/guides/ more specific types of users and queries
internet/toolstables.html.. In addition with specialized and tailored search
to general advice on conducting tools.
searches on the Internet, the pages com- Web Search Engine Watch ,search-
pare features such as size, case sensitiv- enginewatch.com/webmasters/features.
ity, ability to search for phrases and html. posts extensive data and ratings
proper names, use of Boolean logic of popular search engines according to
terms, ability to require or exclude spec- features such as size, pages crawled per
ified terms, inclusion of multilingual day, freshness, and depth. Some other
features, inclusion of special feature useful online sources are home pages on
buttons (e.g., more like this, top 10 search engines by the Gray ,mit.people.
most frequently visited sites on the sub- edu/mkgray/net.; Information Today
ject, and refine) and exclusion of ,infotoday.com/searcher/jun/story2.htm.;
pages updated prior to a user-specified Kansas City Public Library ,kcpl.lib.
date of several popular search engines mo.us/search/srchengines.htm.; Koch
such as those of Alta Vista ,altavista. ,ub2.lu.se/desire/radar/lit-about-search-
com.; HotBot ,hotbot.com.; Lycos Pro services.html.; Northwestern Univer-
Power Search ,lycos.com.; Excite ,ex- sity Library ,library.nwu.edu/resources/
cite.com.; Yahoo! ,yahoo.com.; Info- internet/search/evaluate.html.; and
seek ,infoseek.com.; Disinformation Notes of Search Engine Showdown
,disinfo.com.; and Northern Light ,imtnet/notes/search/index.html.. Data
,nlsearch.com.. on international use of the Web and
The work of Lidsky and Kwon [1997] Internet is posted at the NUA Internet
is an opinionated but informative re- Survey home page ,nua.ie/surveys..
source on search engines. It describes A note of caution: in digesting the
36 different search engines and rates data in the paragraphs above and be-
them on specific details of their search low, published data on the Internet and
capabilities. For instance, in one study, the Web are very difficult to measure
searches are divided into five catego- and verify. GVU offers a solid piece of
ries: (1) simple searches; (2) custom advice on the matter:
searches; (3) directory searches; (4) cur- We suggest that those interested in these (i.e.,
rent news searches; and (5) Web con- Internet/WWW statistics and demographics)
tent. The five categories of search are statistics should consult several sources; these
evaluated in terms of power and ease of numbers can be difficult to measure and results
use. Variations in ratings sometimes may vary between different sources. [GVUs
WWW user survey]
differ substantially for a given search
engine. Similarly, query tests are con- Although details of data from different

ACM Computing Surveys, Vol. 32, No. 2, June 2000


148 M. Kobayashi and K. Takeda

popular sources vary, overall trends are by the editors of the IEEE Internet
fairly consistently documented. We Computing Magazine states that expo-
present some survey results from some nential growth of Internet hosts was
of these sources below. observed in separate studies by several
experts [IEEE 1998a], such as Mark
1.2 Growth of the Internet and the Web Lottor of Network Wizards ,nw.com.;
Mirjan Khne of the RIPE Network
Schatz [1997] of the National Center for Control Center ,.ripe.net. for a period
Supercomputing Applications (NCSA) of over ten years; Samarada Weera-
estimates that the number of Internet handi of Bellcore on his home page on
users increased from 1 million to 25 Internet hosts ,ripe.net. for a period
million in the five years leading up to of over five years in Europe; and John
January of 1997. Strategy Alley [1998] Quarterman of Matrix Information and
gives a number of statistics on Internet Directory Services ,mids.org..
users: Matrix Information and Direc- The number of publicly accessible
tory Services (MIDS), an Internet mea- pages is also growing at an aggressive
surement organization, estimated there pace. Smith [1973] estimates that in
were 57 million users on the consumer January of 1997 there were 80 million
Internet worldwide in April of 1998, and public Web pages, and that the number
that the number would increase to 377 would subsequently double annually.
million by 2000; Morgan Stanley gives Bharat and Broder [1998] estimated
the estimate of 150 million in 2000; and that in November of 1997 the total num-
Killen and Associates give the estimate ber of Web pages was over 200 million.
as 250 million in 2000. Nuas surveys If both of these estimates for number of
,nua.ie/surveys. estimates the figure Web pages are correct, then the rate of
as 201 million worldwide in September increase is higher than Smiths predic-
of 1999, and more specifically by region: tion, i.e., it would be more than double
1.72 million in Africa; 33.61 in the Asia/ per year. In a separate estimate [Monier
Pacific region; 47.15 in Europe; 0.88 in 1998], the chief technical officer of Alta-
the Middle East; 112.4 in Canada and Vista estimated that the volume of pub-
the U.S.; and 5.29 in Latin America. licly accessible information on the Web
Most data and projections support con- has grown from 50 million pages on
tinued tremendous growth (mostly ex- 100,000 sites in 1995 to 100 to 150
ponential) in Internet users, although million pages on 600,000 sites in June
precise numerical values differ. of 1997. Lawrence and Giles summarize
Most data on the amount of informa- Web statistics published by others: 80
tion on the Internet (i.e., volume, num- million pages in January of 1997 by the
ber of publicly accessible Web pages and Internet Archive [Cunningham 1997],
hosts) show tremendous growth, and 75 million pages in September of 1997
the sizes and numbers appear to be by Forrester Research Inc. [Guglielmo
growing at an exponential rate. Lynch 1997], Moniers estimate (mentioned
has documented the explosive growth of above), and 175 million pages in Decem-
Internet hosts; the number of hosts has ber 1997 by Wired Digital. Then they
been roughly doubling every year. For conducted their own experiments to es-
example, he estimates that it was 1.3 timate the size of the Web and con-
million in January of 1993, 2.2 million cluded that:
in January of 1994, 4.9 million in Janu-
ary of 1995, and 9.5 million in January it appears that existing estimates significantly
of 1996. His last set of data is 12.9 underestimate the size of the Web. [Lawrence
and Giles 1998b]
million in July of 1996 [Lynch 1997].
Strategy Alley [1998] cites similar fig- Follow-up studies by Lawrence and
ures: Since 1982, the number of hosts Giles [1999a] estimate that the number
has doubled every year. And an article of publicly indexable pages on the Web

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 149

at that time was about 800 million


pages (with a total of 6 terabytes of text
data) on about 3 million servers (Law-
renceshomepage:,neci.nec.cim/lawrence/
papers.html.). On Aug. 31 1998, Alexa
Internet announced its estimate of 3
terabytes or 3 million megabytes for the
amount of information on the Web, with
20 million Web content areas; a content
area is defined as top-level pages of
sites, individual home pages, and signif-
icant subsections of corporate Web sites.
Furthermore, they estimate a doubling
of volume every eight months.
Given the enormous volume of Web Figure 1. Three way trade-off in search engine
performance: (1) speed of retrieval, (2) precision,
pages in existence, it comes as no sur- and (3) recall.
prise that Internet users are increas-
ingly using search engines and search
services to find specific information. Ac- ample, interactive response times ap-
cording to Brin and Paige, the World pear to be at the top of the list of
WideWebWorm(homepages:,cs.colorado. important issues for Web users (see Sec-
edu/wwww. and ,guano.cs.colorado. tion 1.1) as well as the number of valu-
edu/wwww.) claims to have handled an able sites listed in the first page of
average of 1,500 queries a day in April retrieved results (i.e., ranked in the top
1994, and AltaVista claims to have han- 8, 10, or 12), so that the scroll down or
dled 20 million queries in November next page button do not have to be in-
1997. They believe that voked to view the most valuable results.
it is likely that top search engines will handle Some traditional measures of infor-
hundreds of millions (of queries) per day by the mation retrieval system performance
year 2000. [Brin and Page 1998] are recognized in modified form by Web
The results of GVUs April 1998 users. For example, a basic model from
WWW user survey indicate that about traditional retrieval systems recognizes
86% of people now find a useful Web a three way trade-off between the speed
site through search engines, and 85% of information retrieval, precision, and
find them through hyperlinks in other recall (which is illustrated in Figure 1).
Web pages; people now use search en- This trade-off becomes increasingly dif-
gines as much as surfing the Web to ficult to balance as the number of docu-
find information. ments and users of a database escalate.
In the context of information retrieval,
precision is defined as the ratio of rele-
1.3 Evaluation of Search Engines vant documents to the number of re-
trieved documents:
Several different measures have been
proposed to quantitatively measure the number of relevant documents
performance of classical information re- precision 5 ,
number of retrieved documents
trieval systems (see, e.g., Losee [1998];
Manning and Schutze [1999]), most of and recall is defined as the proportion of
which can be straightforwardly ex- relevant documents that are retrieved:
tended to evaluate Web search engines.
However, Web users may have a ten- recall 5
dency to favor some performance issues
more strongly than traditional users of number of relevant, retrieved documents
.
information retrieval systems. For ex- total number of relevant documents

ACM Computing Surveys, Vol. 32, No. 2, June 2000


150 M. Kobayashi and K. Takeda

Most Web users who utilize search en- ferences (TREC) U.S. National Institute
gines are not so much interested in the of Standards and Technology (NIST)
traditional measure of precision as the search engines ,trec.nist.gov.. In par-
precision of the results displayed in the ticular, they examine answers to ques-
first page of the list of retrieved docu- tions such as Can link information re-
ments, before a scroll or next page sult in better rankings? and Do longer
command is used. Since there is little queries result in better answers?
hope of actually measuring the recall
rate for each Web search engine query 2. TOOLS FOR WEB-BASED RETRIEVAL
and retrieval joband in many cases AND RANKING
there may be too many relevant pag-
esa Web user would tend to be more Classical retrieval and ranking algo-
concerned about retrieving and being rithms developed for isolated (and
able to identify only very highly valu- sometimes static) databases are not nec-
able pages. Kleinberg [1998] recognizes essarily suitable for Internet applica-
the importance of finding the most in- tions. Two of the major differences be-
formation rich, or authority pages. Hub tween classical and Web-based retrieval
pages, i.e., pages that have links to and ranking problems and challenges in
many authority pages are also recog- developing solutions are the number of
nized as being very valuable. A Web simultaneous users of popular search
user might substitute recall with a mod- engines and the number of documents
ified version in which the recall is com- that can be accessed and ranked. More
puted with respect to the set of hub and specifically, the number of simultaneous
authority pages retrieved in the top 10 users of a search engine at a given
or 20 ranked documents (rather than all moment cannot be predicted beforehand
related pages). Details of an algorithm and may overload a system. And the
for retrieving authorities and hubs by number of publicly accessible docu-
Kleinberg [1998] is given in Section 2.4 ments on the Internet exceeds those
of this paper. numbers associated with classical data-
Hearst [1999] notes that the user in- bases by several orders of magnitude.
terface, i.e., the quality of human-com- Furthermore, the number of Internet
puter interaction, should be taken into search engine providers, Web users, and
account when evaluating an informa- Web pages is growing at a tremendous
tion retrieval system. Nielsen [1993] ad- pace, with each average page occupying
vocates the use of qualitative (rather more memory space and containing dif-
than quantitative) measures to evaluate ferent types of multimedia information
information retrieval systems. In partic- such as images, graphics, audio, and
ular, user satisfaction with the system video.
interface as well as satisfaction with There are other properties besides the
retrieved results as a whole (rather number of users and size that set classi-
than statistical measures) is suggested. cal and Web-based retrieval problems
Westera [1996] suggests some query for- apart. If we consider the set of all Web
mats for benchmarking search engines, pages as a gigantic database, this set is
such as: single keyword search; plural very different from a classical database
search capability; phrase search; Bool- with elements that can be organized,
ean search (with proper noun); and com- stored, and indexed in a manner that
plex Boolean. In the next section we facilitates fast and accurate retrieval
discuss some of the differences and sim- using a well-defined format for input
ilarities in classical and Internet-based queries. In Web-based retrieval, deter-
search, access and retrieval of informa- mining which pages are valuable
tion. enough to index, weight, or cluster and
Hawking et al. [1999] discusses eval- carrying out the tasks efficiently, while
uation studies of six text retrieval con- maintaining a reasonable degree of

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 151

accuracy considering the ephemeral na- promising areas of study. We first give
ture of the Web, is an enormous chal- an overview of Web-based indexing,
lenge. Further complicating the problem then describe or give references to the
is the set of appropriate input queries; various approaches.
the best format for inputting the que- Indexing Web pages to facilitate re-
ries is not fixed or known. In this sec- trieval is a much more complex and
tion we examine indexing, clustering, challenging problem than the corre-
and ranking algorithms for documents sponding one associated with classical
available on the Web and user inter- databases. The enormous number of ex-
faces for protoype IR systems for the isting Web pages and their rapid in-
Web. crease and frequent updating makes
straightforward indexing, whether by
2.1 Indexing human or computer-assisted means, a
seemingly impossible, Sisyphean task.
The American Heritage Dictionary Indeed, most experts agree that, at a
(1976) defines index as follows: given moment, a significant portion of
(in z dex) 1. Anything that serves to the Web is not recorded by the indexer
guide, point out or otherwise facilitate of any search engine. Lawrence and
reference, as: a. An alphabetized list- Giles estimated that, in April 1997, the
ing of names, places, and subjects in- lower bound on indexable Web pages
cluded in a printed work that gives was 320 million, and a given individual
for each item the page on which it search engine will have indexed be-
may be found. b. A series of notches tween 3% to 34% of the possible total
cut into the edge of a book for easy [Lawrence and Giles 1998b]. They also
access to chapters or other divisions. estimated that the extent of overlap
c. Any table, file, or catalogue. among the top six search engines is
small and their collective coverage was
Although the term is used in the same only around 60%; the six search engines
spirit in the context of retrieval and are HotBot, AltaVista, Northern Light,
ranking, it has a specific meaning. Excite, Infoseek, and Lycos. A follow up
Some definitions proposed by experts study for the period February 228, 1999,
are The most important of the tools for involving the top 11 search engines (the six
information retrieval is the indexa above plus Snap ,snap.com.; Microsoft
collection of terms with pointers to ,msn.com.; Google ,google.com.;
places where information about docu- Yahoo!; and Euroseek ,euroseek.com.)
ments can be found [Manber 1999]; indicates that we are losing the index-
indexing is building a data structure ing race. A far smaller proportion of the
that will allow quick seaching of the Web is now indexed with no engine cov-
text [Baeza-Yates 1999]; or the act of ering more than 16% of the Web. Index-
assigning index terms to documents, ing appears to have become more impor-
which are the objects to be retrieved tant than ever, since 83% of sites
[Korfhage 1997]; An index term is a contained commercial content and 6%
(document) word whose semantics helps contained scientific or educational con-
in remembering the documents main tent [Lawrence and Giles 1999a].
themes [Baeza-Yates and Ribeiro-Neto Bharat and Broder estimated in No-
1999]. Four approaches to indexing doc- vember 1997 that the number of pages
uments on the Web are (1) human or indexed by HotBot, AltaVista, Excite,
manual indexing; (2) automatic index- and Infoseek were 77 million, 100 mil-
ing; (3) intelligent or agent-based index- lion, 32 million, and 17 million, respec-
ing; and (4) metadata, RDF, and anno- tively. Furthermore, they believe that
tation-based indexing. The first two the union of these pages is around 160
appear in many classical texts, while million pages, i.e., about 80% of the 200
the latter two are relatively new and million total accessible pages they believe

ACM Computing Surveys, Vol. 32, No. 2, June 2000


152 M. Kobayashi and K. Takeda

existed at that time. Their studies indi- Tools for removing redundant URLs or
cate that there is little overlap in the URLs of near and perfectly identical
indexing coverage, more specifically, sites have been investigated by Baldo-
less than 1.4% (i.e., 2.2 million) of the nado and Winograd [1997]; Hernandez
160 million indexed pages were covered [1996]; Hernandez and Stolfo [1995];
by all four of the search engines. Melees Hylton [1996]; Monge and Elkan [1998];
Indexing Coverage Analysis (MICA) Selberg and Etzioni [1995a]; and Silber-
Reports ,melee.com/mica/index.html. schatz et al. [1995].
provides a weekly update on indexing Henzinger et al. [1999] suggested a
coverage and quality by a few, select, method for evaluating the quality of
search engines that claim to index at pages in a search engines index. In the
least one fifth of the Web. Other stud- past, the volume of pages indexed was
ies on estimating the extent of Web
used as the primary measurement of
pages that have been indexed by popu-
Web page indexers. Henzinger et al.
lar search engines include Baldonado
and Winograd [1997]; Hernandez suggest that the quality of the pages in
[1996]; Hernandez and Stolfo [1995]; a search engines index should also be
Hylton [1996]; Monge and Elkan [1998]; considered, especially since it has be-
Selberg and Etzioni [1995a]; and Silber- come clear that no search engine can
schatz et al. [1995]. index all documents on the Web, and
In addition to the sheer volume of there is very little overlap between the
documents to be processed, indexers indexed pages of major search engines.
must take into account other complex The idea of Henzingers method is to
issues, for example, Web pages are not evaluate the quality of Web pages ac-
constructed in a fixed format; the tex- cording to its indegree (an evaluation
tual data is riddled with an unusually measure based on how many other
high percentage of typosthe contents pages point to the Web page under con-
usually contain nontextual multimedia sideration [Carriere and Kazman 1997])
data, and updates to the pages are and PageRank (an evaluation measure
made at different rates. For instance, based on how many other pages point to
preliminary studies documented in Na- the Web page under consideration, as
varro [1998] indicate that on the aver- well as the value of the pages pointing
age site 1 in 200 common words and 1 in to it [Brin and Page 1998; Cho et al.
3 foreign surnames are misspelled. 1998]).
Brake [1997] estimates that the average The development of effective indexing
page of text remains unchanged on the tools to aid in filtering is another major
Web for about 75 days, and Kahle esti- class of problems associated with Web-
mates that 40% of the Web changes based search and retrieval. Removal of
every month. Multiple copies of identi-
spurious information is a particularly
cal or near-identical pages are abun-
challenging problem, since a popular in-
dant; for example, FAQs 2 postings, mir-
ror sites, old and updated versions of formation site (e.g., newsgroup discus-
news, and newspaper sites. Broder et al. sions, FAQ postings) will have little
[1997] and Shivakumar and Garca-Mo- value to users with no interest in the
lina [1998] estimate that 30% of Web topic. Filtering to block pornographic
pages are duplicates or near-duplicates. materials from children or for censor-
ship of culturally offensive materials is
another important area for research and
2
FAQs, or frequently asked questions, are essays business devlopment. One of the prom-
on topics on a wide range of interests, with point- ising new approaches is the use of meta-
ers and references. For an extensive list of FAQs,
see data, i.e., summaries of Web page con-
,cis.ohio-state.edu/hypertext/faq/usenet/faq-list. tent or sites placed in the page for
html. and ,faq.org.. aiding automatic indexers.

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 153

2.1.1 Classical Methods. Manual in- 2.1.2 Crawlers/Robots. Scientists have


dexing is currently used by several com- recently been investigating the use of
mercial, Web-based search engines, e.g., intelligent agents for performing specific
Galaxy ,galaxy.einet.net.; GNN: Whole tasks, such as indexing on the Web [AI
Internet Catalog ,e1c.gnn.com/gnn/wic/ Magazine 1997; Baeza-Yates and Ri-
index.html.; Infomine ,lib-www.ucr. beiro-Neto 1999]. There is some ambi-
edu.; KidsClick! ,sunsite.berkeley.edu/ guity concerning proper terminology to
kidsclick!/.; LookSmart ,looksmart.com.; describe these agents. They are most
Subject Tree ,bubl.bath.ac.uk/bubl/ commonly referred to as crawlers, but
cattree.html.; Web Developers Virtual are also known as ants, automatic in-
Library ,stars.com.; World-Wide Web dexers, bots, spiders, Web robots (Web
Virtual Library Series Subject Catalog robot FAQ ,info.webcrawler.com/mak/
,w3.org/hypertext/datasources/bysubject/ projects/robots/faq.html.), and worms.
overview.html.; and Yahoo!. The prac- It appears that some of the terms were
tice is unlikely to continue to be as proposed by the inventors of a specific
successful over the next few years, tool, and their subsequent use spread to
since, as the volume of information more general applications of the same
available over the Internet increases at genre.
an ever greater pace, manual indexing Many search engines rely on automati-
is likely to become obsolete over the cally generated indices, either by them-
long term. Another major drawback selves or in combination with other
with manual indexing is the lack of technologies, e.g., Aliweb ,nexor.co.uk/
consistency among different profes- public/aliweb/aliweb.html.; AltaVista;
sional indexers; as few as 20% of the Excite; Harvest ,harvest.transarc.com.;
HotBot; Infoseek; Lycos; Magellan
terms to be indexed may be handled in
,magellan.com.; MerzScope ,merzcom.
the same manner by different individu-
com.; Northern Light; Smart Spider
als [Korfhage 1997, p. 107], and there is
,engsoftware.com.; Webcrawler
noticeable inconsistency, even by a
,webcrawler.com/.; and World Wide
given individual [Borko 1979; Cooper Web Worm. Although most of Yahoo!s
1969; Jacoby and Slamecka 1962; Mac- entries are indexed by humans or ac-
skassys et al. 1998; Preschel 1972; and quired through submissions, it uses a
Salton 1969]. robot to a limited extent to look for new
Though not perfect, compared to most announcements. Examples of highly
automatic indexers, human indexing is specialized crawlers include Argos ,argos.
currently the most accurate because ex- evansville.edu. for Web sites on the
perts on popular subjects organize and ancient and medieval worlds; CACTVS
compile the directories and indexes in a Chemistry Spider ,schiele.organik.
way which (they believe) facilitates the uni-erlangen.de/cactvs/spider.html. for
search process. Notable references on chemical databases; MathSearch ,maths.
conventional indexing methods, includ- usyd.edu.au:8000/mathsearch.html. for
ing automatic indexers, are Part IV of English mathematics and statistics doc-
Soergel [1985]; Jones and Willett uments; NEC-MeshExplorer ,netplaza.
[1977]; van Rijsbergen [1977]; and Wit- biglobe.or.jp/keyword.html. for the
ten et al. [1994, Chap. 3]. Technological NETPLAZA search service owned by
advances are expected to narrow the the NEC Corporation; and Social Sci-
gap in indexing quality between human ence Information Gateway (SOSIG)
and machine-generated indexes. In the ,scout.cs.wisc.edu/scout/mirrors/sosig.
future, human indexing will only be ap- for resources in the social sciences.
plied to relatively small and static (or Crawlers that index documents in lim-
near static) or highly specialized data ited environments include LookSmart
bases, e.g., internal corporate Web ,looksmart.com/. for a 300,000 site data-
pages. base of rated and reviewed sites; Robbie

ACM Computing Surveys, Vol. 32, No. 2, June 2000


154 M. Kobayashi and K. Takeda

the Robot, funded by DARPA for educa- note that robots are not always the root
tion and training purposes; and UCSD cause of network overload; sometimes
Crawl ,www.mib.org/ ucsdcrawl. for human user overload causes problems,
UCSD pages. More extensive lists of in- which is what happened at the CNN
telligent agents are available on The Web site just after the announcement of the
Robots Page ,info.webcrawler.com/mak/ O.J. Simpson trial verdict [Carl 1995].
projects/robots/active/html/type.html.; Use of the exclusion standard is strictly
and on Washington State Universitys voluntary, so that Web masters have no
robot pages ,wsulibs.wsu.edu/general/ guarantee that robots will not be able to
robots.htm.. enter computer systems and create
To date, there are three major prob- havoc. Arguments in support of the ex-
lems associated with the use of robots: clusion standard and discussion on its
(1) some people fear that these agents effectiveness are given in Carl [1995]
are too invasive; (2) robots can overload and Koster [1996].
system servers and cause systems to be
virtually frozen; and (3) some sites are 2.1.3 Metadata, RDF, and Annota-
updated at least several times per day, tions.
e.g., approximately every 20 minutes by What is metadata? The Macquarie dictionary de-
CNN ,cnn.com. and Bloomberg fines the prefix meta- as meaning among, to-
,bloomberg.com., and every few hours gether with, after or behind. That suggests
the idea of a fellow traveller : that metadata is
by many newspaper sites [Carl 1995] not fully fledged data, but it is a kind of fellow-
(article home page ,info.webcrawler.com/ traveller with data, supporting it from the side-
mak/projects/robots/threat-or-treat.html.); lines. My definition is that an element of meta-
[Koster 1995]. Some Web sites deliber- data describes an information resource or helps
ately keep out spiders; for example, the provide access to an information resource.
[Cathro 1997]
New York Times ,nytimes.com.
requires users to pay and fill out a In the context of Web pages on the
registration form; CNN used to exclude Internet, the term metadata usually
search spiders to prevent distortion of refers to an invisible file attached to a
data on the number of users who visit Web page that facilitates collection of
the site; and the online catalogue of the information by automatic indexers; the
British Library ,portico.bl.uk. only al- file is invisible in the sense that it has
lows access to users who have filled out no effect on the visual appearance of the
an online query form [Brake 1997]. Sys- page when viewed using a standard
tem managers of these sites must keep Web browser.
up with the new spider and robot tech- The World Wide Web (W3) Consor-
nologies in order to develop their own tium ,w3.org. has compiled a list of
tools to protect their sites from new resources on information and standard-
types of agents that intentionally or un- ization proposals for metadata (W3
intentionally could cause mayhem. metadata page ,w3.org/metadata.. A
As a working compromise, Kostner number of metadata standards have
has proposed a robots exclusion stan- been proposed for Web pages. Among
dard (A standard for robots exclusion, them, two well-publicized, solid efforts
ver.1:,info.webcrawler.com/mak/projects/ are the Dublin Core Metadata standard:
robots/exclusion.html.; ver. 2: ,info. home page ,purl.oclc.org/metadata/
webcrawler.com/mak/projects/robots/ dublin core. and the Warwick frame-
norobot.html.), which advocates work: article home page ,dlib.org/dlib/
blocking certain types of searches to july96/lagoze/07lagoze.html. [Lagoze
relieve overload problems. He has also 1996]. The Dublin Core is a 15-element
proposed guidelines for robot design metadata element set proposed to facili-
(Guidelines for robot writers (1993) tate fast and accurate information re-
,info.webcrawler.com/mak/projects/robots/ trieval on the Internet. The elements
guidelines.html.). It is important to are title; creator; subject; description;

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 155

publisher; contributors; date; resource fair players. Many unfair players main-
type; format; resource identifier; source; tain sites that can increase advertising
language; relation; coverage; and rights. revenue if the number of visitors is very
The group has also developed methods high or charging a fee per visit for ac-
for incorporating the metadata into a cess to pornographic, violent, and cul-
Web page file. Other resources on meta- turally offensive materials. These sites
data include Chapter 6 of Baeza-Yates can attract a large volume of visitors by
and Ribeiro-Neto [1999] and Mar- attaching metadata with many popular
chionini [1999]. If the general public keywords. Development of reliable fil-
adopts and increases use of a simple tering services for parents concerned
metadata standard (such as the Dublin about their childrens surfing venues is
Core), the precision of information re- a serious and challenging problem.
trieved by search engines is expected to Spamming, i.e., excessive, repeated
improve substantially. However, wide- use of key words or hidden text pur-
spread adoption of a standard by inter- posely inserted into a Web page to pro-
national users is dubious. mote retrieval by search engines, is re-
One of the major drawbacks of the lated to, but separate from, the
simplest type of metadata for labeling unethical or deceptive use of metadata.
HTML documents, called metatags, is Spamming is a new phenomenon that
they can only be used to describe con- appeared with the introduction of
tents of the document to which they are search engines, automatic indexers, and
attached, so that managing collections filters on the Web [Flynn 1996; Libera-
of documents (e.g., directories or those tore 1997]. Its primary intent is to out-
on similar topics) may be tedious when smart these automated software sys-
updates to the entire collection are tems for a variety of purposes;
made. Since a single command cannot spamming has been used as an adver-
be used to update the entire collection tising tool by entrepreneurs, cult re-
at once, documents must be updated cruiters, egocentric Web page authors
one-by-one. Another problem is when wanting attention, and technically well-
documents from two or more different versed, but unbalanced, individuals who
collections are merged to form a new have the same sort of warped mentality
collection. When two or more collections as inventors of computer viruses. A fa-
are merged, inconsistent use of meta- mous example of hidden text spamming
tags may lead to confusion, since a is the embedding of words in a black
metatag might be used in different col- background by the Heavens Gate cult.
lections with entirely different mean- Although the cult no longer exists, its
ings. To resolve these issues, the W3 home page is archived at the sunspot.
Consortium proposed in May 1999 that net site ,sunspot.net/news/special/
the Resource Description Framework heavensgatesite., a technique known
(RDF) be used as the metadata coding as font color spamming [Liberatore
scheme for Web documents (W3 Consor- 1997]. We note that the term spamming
tium RDF homepage ,w3.org/rdf.. An has a broader meaning, related to re-
interesting associated development is ceiving an excessive amount of email or
IBMs XCentral ,ibm.com/developer/ information. An excellent, broad over-
xml., the first search engine that in- view of the subject is given in Cranor
dexes XML and RDF elements. and LaMacchia [1998]. In our context,
Metadata places the responsibility of the specialized terms spam-indexing,
aiding indexers on the Web page au- spam-dexing, or keyword spamming are
thor, which is reasonable if the author more precise.
is a responsible person wishing to ad- Another tool related to metadata is
vertise the presence of a page to in- annotation. Unlike metadata, which is
crease legitimate traffic to a site. Unfor- created and attached to Web documents
tunately, not all Web page authors are by the author for the specific purpose of

ACM Computing Surveys, Vol. 32, No. 2, June 2000


156 M. Kobayashi and K. Takeda

aiding indexing, annotations include a tant for identifying groups of documents


much broader class of data to be at- in a database that can be retrieved and
tached to a Web document [Nagao and processed together for a given type of
Hasida 1998; Nagao et al. 1999]. Three user input query.
examples of the most common annota- Several important points should be
tions are linguistic annotation, com- considered in the development and im-
mentary (created by persons other than plementation of algorithms for cluster-
the author), and multimedia annota- ing documents in very large databases.
tion. Linguistic annotation is being used These include identifying relevant at-
for automatic summarization and con- tributes of documents and determining
tent-based retrieval. Commentary anno- appropriate weights for each attribute;
tation is used to annotate nontextual selecting an appropriate clustering
multimedia data, such as image and method and similarity measure; esti-
sound data plus some supplementary mating limitations on computational
information. Multimedia annotation gen- and memory resources; evaluating the
erally refers to text data, which describes reliability and speed of the retrieved
the contents of video data (which may be results; facilitating changes or updates
downloadable from the Web page). An in the database, taking into account the
interesting example of annotation is the rate and extent of the changes; and
attachment of comments on Web docu- selecting an appropriate search algo-
ments by people other than the document rithm for retrieval and ranking. This
author. In addition to aiding indexing and final point is of particularly great con-
retrieval, this kind of annotation may be cern for Web-based searches.
helpful for evaluating documents. There are two main categories of clus-
Despite the promise that metadata tering: hierarchical and nonhierarchi-
and annotation could facilitate fast and cal. Hierarchical methods show greater
accurate search and retrieval, a recent promise for enhancing Internet search
study for the period February 228, and retrieval systems. Although details
1999 indicates that metatags are only of clustering algorithms used by major
used on 34% of homepages, and only search engines are not publicly avail-
0.3% of sites use the Dublin Core meta- able, some general approaches are
data standard [Lawrence and Giles known. For instance, Digital Equipment
1999a]. Unless a new trend towards the Corporations Web search engine Alta-
use of metadata and annotations devel- Vista is based on clustering. Anick and
ops, its usefulness in information re- Vaithyanathan [1997] explore how to
trieval may be limited to very large, combine results from latent semantic
closed data owned by large corporations, indexing (see Section 2.4) and analysis
public institutions, and governments of phrases for context-based information
that choose to use it. retrieval on the Web.
Zamir et al. [1997] developed three
2.2 Clustering clustering methods for Web documents.
In the word-intersection clustering
Grouping similar documents together to
method, words that are shared by docu-
expedite information retrieval is known
ments are used to produce clusters. The
as clustering [Anick and Vaithyanathan
1997; Rasmussen 1992; Sneath and method runs in O ~ n 2 ! time and produces
Sokal 1973; Willett 1988]. During the good results for Web documents. A second
information retrieval and ranking pro- method, phrase-intersection clustering,
cess, two classes of similarity measures runs in O ~ nlog n ! time is at least two
must be considered: the similarity of a orders of magnitude faster than methods
document and a query and the similar- that produce comparable clusters. A third
ity of two documents in a database. The method, called suffix tree clustering is de-
similarity of two documents is impor- tailed in Zamir and Etzioni [1998].

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 157

Modha and Spangler [2000] developed advantage of small, powerful comput-


a clustering method for hypertext docu- ers, and will probably have a variety of
ments, which uses words contained in mechanisms for querying nontextual
the document, outlinks from the docu- data (e.g., hand-drawn sketches, tex-
ment, and in-links to the document. tures and colors, and speech) and better
Clustering is based on six information user interfaces to enable users to visu-
nuggets, which the authors dubbed ally manipulate retrieved information
summary, breakthrough, review, key- [Card et al. 1999; Hearst 1997; Maybury
words, citation, and reference. The first and Wahlster 1998; Rao et al. 1993;
two are derived from the words in the Tufte 1983]. Hearst [1999] surveys visu-
document, the next two from the out- alization interfaces for information re-
links, and the last two from the in-links. trieval systems, with particular emphasis
Several new approaches to clustering on Web-based systems. A sampling of
documents in data mining applications some exploratory works being conducted
have recently been developed. Since in this area are described below. These
these methods were specifically de- interfaces and their display systems,
signed for processing very large data which are known under several different
sets, they may be applicable with some names (e.g., dynamic querying, informa-
modifications to Web-based information tion outlining, visual information seek-
retrieval systems. Examples of some of ing), are being developed at universities,
these techniques are given in Agrawal et government, and private research labs,
al. [1998]; Dhillon and Modha [1999; and small venture companies worldwide.
2000]; Ester et al. [1995a; 1995b; 1995c];
Fisher [1995]; Guha et al. [1998]; Ng and 2.3.1 Metasearch Navigators. A very
Han [1994]; and Zhang et al. [1996]. For simple tool developed to exploit the best
very large databases, appropriate parallel features of many search engines is the
algorithms can speed up computation metasearch navigator. These navigators
[Omiecinski and Scheuermann 1990]. allow simultaneous search of a set of
Finally, we note that clustering is just other navigators. Two of the most exten-
one of several ways of organizing docu- sive are Search.com ,search.com/.,
ments to facilitiate retrieval from large which can utilize the power of over 250
databases. Some alternative methods search engines, and INFOMINE ,lib-
are discussed in Frakes and Baeza- www.ucr.edu/enbinfo.html., which uti-
Yates [1992]. Specific examples of some lizes over 90. Advanced metasearch
methods designed specifically for facili- navigators have a single input interface
tating Web-based information retrieval that sends queries to all (or only user
are evaluation of significance, reliabil- selected search engines), eliminates du-
ity, and topics covered in a set of Web plicates, and then combines and ranks
pages based on analysis of the hyper- returned results from the different search
link structures connecting the pages engines. Some fairly simple examples
(see Section 2.4); and identification of available on the Web are 2ask ,web.
cyber communities with expertise in gazeta.pl/miki/search/2ask-anim.html.;
subject(s) based on user access fre- ALL-IN-ONE ,albany.net/allinone/.;
quency and surfing patterns. EZ-Find at The River ,theriver.com/
theRiver/explore/ezfind.html.; IBM Info-
2.3 User Interfaces Market Service ,infomkt.ibm.com/.;
Inference Find ,inference.com/infind/.;
Currently, most Web search engines are Internet Sleuth ,intbc.com/sleuth.; Meta-
text-based. They display results from Crawler ,metacrawler.cs.washington.edu:
input queries as long lists of pointers, 8080/.; and SavvySearch ,cs.colostat.edu/
sometimes with and sometimes without dreiling/smartform.html. and ,guaraldi.
summaries of retrieved pages. Future cs.colostate.edu:2000/. [Howe and
commercial systems are likely to take Dreilinger 1997].

ACM Computing Surveys, Vol. 32, No. 2, June 2000


158 M. Kobayashi and K. Takeda

2.3.2 Web-Based Information Outlin- tems to aid users in formulating a que-


ing/Visualization. Visualization tools ry. The system, RABBIT III, supports
specifically designed to help users un- interactive refinement of queries by al-
derstand websites (e.g., their directory lowing users to critique retrieved re-
structures, types of information avail- sults with labels such as require and
able) are being developed by many pri- prohibit. Williams claims that this
vate and public research centers system is particularly helpful to nave
[Nielsen 1997]. Overviews of some of users with only a vague idea of what
these tools are given in Ahlberg and they want and therefore need to be
Shneiderman [1994]; Beaudoin et al. guided in the formulation/reformulation
[1996]; Bederson and Hollan [1994]; Gloor of their queries . . . (or) who have lim-
and Dynes [1998]; Lamping et al. [1995]; ited knowledge of a given database or
Liechti et al. [1998]; Maarek et al. [1997]; who must deal with a multitude of data-
Munzner and Burchard [1995]; Robertson bases.
et al. [1991]; and Tetranet Software Inc. Hearst [1995] and Hearst and Peder-
[1998] ,tetranetsoftware.com.. Below son [1996] developed a visualization
we present some examples of interfaces
system for displaying information about
designed to facilitate general informa-
a document and its contents, e.g., its
tion retrieval systems, we then present
length, frequency of term sets, and dis-
some that were specifically designed to
tribution of term sets within the docu-
aid retrieval on the Web.
ment and to each other. The system,
Shneiderman [1994] introduced the
called TileBars, displays information
term dynamic queries to describe inter-
active user control of visual query pa- about a document in the form of a two-
rameters that generate a rapid, up- dimensional rectangular bar with even-
dated, animated visual display of sized tiles lying next to each other in an
database search results. Some applica- orderly fashion. Each tile represents
tions of the dynamic query concept are some feature of the document; the infor-
systems that allow real estate brokers mation is encoded as a number whose
and their clients to locate homes based magnitude is represented in grayscale.
on price, number of bedrooms, distance Cutting et al. [1993] developed a sys-
from work, etc. [Williamson and Shnei- tem called Scatter/Gather to allow users
derman 1992]; locate geographical re- to cluster documents interactively,
gions with cancer rates above the na- browse the results, select a subset of the
tional average [Plaisant 1994]; allow clusters, and cluster this subset of docu-
dynamic querying of a chemistry table ments. This process allows users to iter-
[Ahlberg and Shneiderman 1997]; with atively refine their search. BEAD
an interface to enable users to explore [Chalmers and Chitson 1992]; Galaxy of
UNIX directories through dynamic que- News [Rennison 1994]; and Theme-
ries [Liao et al. 1992]: Visual presenta- Scapes [Wise et al. 1995] are some of
tion of query components; visual presen- the other systems that show graphical
tation of results; rapid, incremental, displays of clustering results.
and reversible actions; selection by Baldonado [1997] and Baldonado and
pointing (not typing); and immediate Winograd [1997] developed an interface
and continuous feedback are features of for exploring information on the Web
the systems. Most graphics hardware across heterogeneous sources, e.g.,
systems in the mid-1990s were still too search services such as Alta Vista, bib-
weak to provide adequate real-time in- liographic search services such as Dia-
teraction, but faster algorithms and ad- log, a map search service and a video
vances in hardware should increase sys- search service. The system, called Sense-
tem speed up in the future. Maker, can bundle (i.e., cluster) simi-
Williams [1984] developed a user in- lar types of retrieved data according to
terface for information retrieval sys- user specified bundling criteria (the

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 159

criteria must be selected from a fixed map for visualizing a Web sites struc-
menu provided by SenseMaker). Exam- ture and keywords.
ples of available bundling criteria for a
URL type include (1) bundling results 2.3.3 Acoustical Interfaces. Web-
whose URLs refer to the same site; (2) based IR contributes to the acceleration
bundling results whose URLs refer to of studies on and development of more
the same collection at a site; and (3) not user friendly, nonvisual, input-output
bundling at all. The system allows us- interfaces. Some examples of research
ers to select from several criteria to projects are given in a special journal
view retrieved results, e.g., according to issue on the topic the next generation
the URL, and also allows users to select graphics user interfaces (GUIs) [CACM
from several criteria on how duplicates 1993]. An article in Business Week
in retrieved information will be elimi- [1977] discusses user preference for
nated. Efficient detection and elimina- speech-based interfaces, i.e., spoken in-
tion of duplicate database records and put (which relies on speech recognition
duplicate retrievals by search engines, technologies) and spoken output (which
which are very similar but not necessar- relies on text-to-speech and speech syn-
ily identical, have been investigated ex- thesis technologies).
tensively by many scientists, e.g., Her- One response to this preference by
nandez [1996]; Hernandez and Stolfo Asakawa [1996] is a method to enable
[1995]; Hylton [1996]; Monge and Elkan the visually impaired to access and use
[1998]; and Silberschatz et al. [1995]. the Web interactively, even when Japa-
Card et al. [1996] developed two 3D nese and English appear on a page (IBM
virtual interface tools, WebBook and Homepage on Systems for the Disabled
WebForager, for browsing and recording ,trl.ibm.co.jp/projects/s7260/sysde.htm.).
Web pages. Kobayashi et al. [1999] de- The basic idea is to identify different
veloped a system to compare how rele- languages (e.g., English, Japanese) and
vance ranking of documents differ when different text types (e.g., title and sec-
queries are changed. The parallel rank- tion headers, regular text, hot buttons)
ing system can be used in a variety of and then assign persons with easily dis-
applications, e.g., query refinement and tinguishable voices (e.g., male, female)
understanding the contents of a data- to read each of the different types of text.
base from different perspectives (each More recently, the method has been ex-
query represents a different user per- tended to enable the visually impaired
spective). Manber et al. [1997] devel- to access tables in HTML [Oogane and
oped WebGlimpse, a tool for simulta- Asakawa 1998].
neous searching and browsing Web Another solution, developed by Ra-
pages, which is based on the Glimpse man [1996], is a system that enables
search engine. visually impaired users to surf the Web
Morohashi et al. [1995] and Takeda interactively. The system, called Emac-
and Nomiyama [1997] developed a sys- speak, is much more sophisticated than
tem that uses new technologies to orga- screen readers. It reveals the structure
nize and display, in an easily discern- of a document (e.g., tables or calendars)
ible form, a massive set of data. The in addition to reading the text aloud.
system, called information outlining, A third acoustic-based approach for
extracts and analyzes a variety of fea- Web browsing is being investigated by
tures of the data set and interactively Mereu and Kazman [1996]. They exam-
visualizes these features through corre- ined how sound environments can be
sponding multiple, graphical viewers. used for navigation and found that
Interactions with multiple viewers facil- sighted users prefer musical environ-
itates reducing candidate results, profil- ments to enhance conventional means of
ing information, and discovering new navigation, while the visually impaired
facts. Sakairi [1999] developed a site prefer the use of tones. The components

ACM Computing Surveys, Vol. 32, No. 2, June 2000


160 M. Kobayashi and K. Takeda

of all of the systems described above can ment vector.3 Ranking a document is
be modified for more general systems based on computation of the angle de-
(i.e., not necessarily for the visually im- fined by the query and document vector.
paired) which require an audio/speech- It is impractical for very large data-
based interface. bases.
One of the more widely used vector
space model-based algorithms for reduc-
2.4 Ranking Algorithms for Web-Based ing the dimension of the document
Searches ranking problem is latent semantic in-
dexing (LSI) [Deerwester et al. 1990].
A variety of techniques have been devel- LSI reduces the retrieval and ranking
oped for ranking retrieved documents problem to one of significantly lower
for a given input query. In this section dimensions, so that retrieval from very
we give references to some classical large databases can be performed in
techniques that can be modified for use real time. Although a variety of algo-
by Web search engines [Baeza-Yates rithms based on document vector mod-
and Ribeiro-Neto 1999; Berry and els for clustering to expedite retrieval
Browne 1999; Frakes and Baeza-Yates and ranking have been proposed, LSI is
1992]. Techniques developed specifically one of the few that successfully takes
for the Web are also presented. into account synonymy and polysemy.
Detailed information regarding rank- Synonymy refers to the existence of
ing algorithms used by major search equivalent or similar terms, which can
engines is not publicly available, howev- be used to express an idea or object in
erit seems that most use term weight- most languages, and polysemy refers to
ing or variations thereof or vector space the fact that some words have multiple,
models [Baeza-Yates and Ribeiro-Neto unrelated meanings. Absence of ac-
1999]. In vector space models, each doc- counting for synonymy will lead to
ument (in the database under consider- many small, disjoint clusters, some of
ation) is modeled by a vector, each coor- which should actually be clustered to-
dinate of which represents an attribute gether, while absence of accounting for
of the document [Salton 1971]. Ideally, polysemy can lead to clustering together
of unrelated documents.
only those that can help to distinguish
In LSI, documents are modeled by
documents are incorporated in the at-
vectors in the same way as Saltons
tribute space. In a Boolean model, each
vector space model. We represent the
coordinate of the vector is zero (when relationship between the attributes and
the corresponding attribute is absent)
or unity (when the corresponding at- documents by an m -by-n (rectangular)
tribute is present). Many refinements of matrix A , with ij -th entry a ij , i.e.,
the Boolean model exist. The most com-
monly used are term-weighting models, A 5 @aij#.
which take into account the frequency
of appearance of an attribute (e.g., key- The column vectors of A represent the
word) or location of appearance (e.g., documents in the database. Next, we
keyword in the title, section header, or compute the singular value decomposi-
abstract). In the simplest retrieval and tion (SVD) of A , then construct a modified
ranking systems, each query is also matrix A k , from the k largest singular
modeled by a vector in the same manner
as the documents. The ranking of a doc-
ument with respect to a query is deter-
mined by its distance to the query 3
The angle between two vectors is determined by
vector. A frequently used yardstick is computing the dot product and dividing by the
the angle defined by a query and docu- product of the l 2 -norms of the vectors.

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 161

values s i ; i 5 1, 2, ..., k , and their cor- analysis of hyperlink structures for use
responding vectors, i.e., on the Web [Botafogo et al. 1992; Carri-
ere and Kazman 1997; Chakrabarti et
Ak 5 UkSkV Tk . al. 1988; Chakrabarti et al. 1998; Frisse
1988; Kleinberg 1998; Pirolli et al. 1996;
S k is a diagonal matrix with monotoni- and Rivlin et al. 1994].
cally decreasing diagonal elements s i . A simple means to measure the qual-
U k and V k are matrices whose columns ity of a Web page, proposed by Carriere
are the left and right singular vectors of and Kazman [1997], is to count the
number of pages with pointers to the
the k largest singular values of A .4 page, and is used in the WebQuery sys-
Processing the query takes place in
tem and the Rankdex search engine
two steps: projection followed by match-
,rankdex.gari.com.. Google, which
ing. In the projection step, input queries
currently indexes about 85 million Web
are mapped to pseudodocuments in the
reduced query-document space by the pages, is another search engine that
uses link infomation. Its rankings are
matrix U k , then weighted by the corre- based, in part, on the number of other
sponding singular values s i from the pages with pointers to the page. This
reduced rank singular matrix S k . The policy seems to slightly favor educa-
process can be described mathemati- tional and government sites over com-
cally as mercial ones. In November 1999, North-
ern Light introduced a new ranking
q 3 q 5 qTUkS21
k , system, which is also based, in part, on
link data (Search Engine Briefs
where q represents the original query ,searchenginewatch.com/sereport/99/11-
vector; q the pseudodocument; q T the briefs.html.).
transpose of q ; and ~ z ! 21 the inverse The hyperlink structures are used to
operator. In the second step, similari- rank retrieved pages, and can also be
ties between the pseudodocument q and used for clustering relevant pages on
documents in the reduced term docu- different topics. This concept of corefer-
encing as a means of discovering so-
ment space V Tk are computed using any
called communities of good works was
one of many similarity measures, such
originally introduced in nonInternet-
as angles defined by each document and
based studies on cocitations by Small
query vector; see Anderberg [1973] or
Salton [1989]. Notable reviews of linear [1973] and White and McCain [1989].
algebra techniques, including LSI and Kleinberg [1998] developed an algo-
its applications to information retrieval, rithm to find the several most informa-
are Berry et al. [1995] and Letsche and tion-rich or, authority, pages for a
Berry [1997]. query. The algorithm also finds hub
Statistical approaches used in natural pages, i.e., pages with links to many
language modeling and IR can probably authority pages, and labels the two
be extended for use by Web search en- types of retrieved pages appropriately.
gines. These approaches are reviewed in
Crestani et al. [1998] and Manning and
Schutze [1999]. 3. FUTURE DIRECTIONS
Several scientists have proposed in-
In this section we present some promis-
formation retrieval algorithms based on
ing and imaginative research endeavors
that are likely to make an impact on
4
For details on implementation of the SVD algo-
Web use in some form or variation in
rithm, see Demmel [1997]; Golub and Loan [1996]; the future. Knowledge management
and Parlett [1998]. [IEEE 1998b].

ACM Computing Surveys, Vol. 32, No. 2, June 2000


162 M. Kobayashi and K. Takeda

3.1 Intelligent and Adaptive Web Services Discussion and ratings of some of these
and other robots are available at several
As mentioned earlier, research and de- Web sites, e.g., Felt and Scales ,wsulibs.
velopment of intelligent agents (also wsu.edu/general/robots.htm. and Mitchell
known as bots, robots, and aglets) for [1998].
performing specific tasks on the Web Some scientists have studied proto-
has become very active [Finin et al. type metasearchers, i.e., services that
1998; IEEE 1996a]. These agents can combine the power of several search en-
tackle problems including finding and gines to search a broader range of pages
filtering information; customizing infor- (since any given search engine covers
mation; and automating completion of less than 16% of the Web) [Gravano
simple tasks [Gilbert 1997]. The agents 1997; Lawrence and Giles 1998a; Sel-
gather information or perform some berg and Etzioni 1995a; 1995b]. Some of
other service without (the users) imme- the better known metasearch engines
diate presence and on some regular include MetaCrawler, SavvySearch, and
schedule (whatis?com home page InfoSeek Express. After a query is is-
,whatis.com/intellig.htm.). The BotSpot sued, metasearchers work in three main
home page ,botspot.com. summarizes steps: first, they evaluate which search
and points to some historical informa- engines are likely to yield valuable,
tion as well as current work on intelli- fruitful responses to the query; next,
gent agents. The Proceedings of the As- they submit the query to search engines
sociation for Computing Machinery with high ratings; and finally, they
(ACM), see Section 5.1 for the URL; the merge the retrieved results from the
Conferences on Information and Know- different search engines used in the pre-
ledge Management (CIKM); and the vious step. Since different search en-
American Association for Artificial In- gines use different algorithms, which
telligence Workshops ,www.aaai.org. may not be publicly available, ranking of
are valuable information sources. The merged results may be a very difficult task.
Proceedings of the Practical Applica- Scientists have investigated a number
tions of Intelligent Agents and Multi- of approaches to overcome this problem.
Agents (PAAM) conference series In one system, a result merging condi-
,demon.co.uk/ar/paam96.and,demon. tion is used by a metasearcher to decide
co.uk/ar/paam97. gives a nice overview how much data will be retrieved from
of application areas. The home page of each of the search engine results, so
the IBM Intelligent Agent Center of that the top objects can be extracted
Competence (IACC) ,networking.ibm. from search engines without examining
com/iag/iaghome.html. describes some the entire contents of each candidate
of the companys commercial agent object [Gravano 1997]. Inquirus down-
products and technologies for the Web. loads and analyzes individual docu-
Adaptive Web services is one interest- ments to take into account factors such
ing area in intelligent Web robot re- as query term context, identification of
search, including, e.g., Ahoy! The dead pages and links, and identification
Homepage Finder, which performs dy- of duplicate (and near duplicate) pages
namic reference sifting [Shakes et al. [Lawrence and Giles 1998a]. Document
1997]; Adaptive Web Sites, which auto- ranking is based on the downloaded doc-
matically improve their organization ument itself, instead of rankings from
and presentation based on user access individual search engines.
data [Etzioni and Weld 1995; Perkowitz
and Etzioni 1999]; Perkowitzs home 3.2 Information Retrieval for Internet
page ,info.cs.vt.edu.; and Adaptive Shopping
Web Page Recommendation Service
[Balabanovic 1997; Balabanovic and An intriguing application of Web robot
Shoham 1998; Balabanovic et al. 1995]. technology is in simulation and prediction

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 163

of pricing strategies for sales over the preferences. Price comparison robots
Internet. The 1999 Christmas and holi- and their possible roles in Internet mer-
day season marked the first time that chant price wars in the future are dis-
shopping online was no longer a predic- cussed in Kephart et al. [1998a; 1998b].
tion; Online sales increased by 300 per- The auction site is another successful
cent and the number of orders increased technological off-shoot of the Internet
by 270 percent compared to the previ- shopping business [Cohen 2000; Ferguson
ous year [Clark 2000]. To underscore 2000]. Two of the more famous general
the point, Time magazine selected Jeff online auction sites are priceline.com
Bezos, founder of Amazon.com as 1999 ,priceline.com. and eBay ,ebay.com..
Person of the Year. Exponential growth Priceline.com pioneered and patented
is predicted in online shopping. Charts its business concept, i.e., online bidding
that illustrate projected growth in In- [Walker et al. 1997]. Patents related to
ternet-generated revenue, Internet-re- that of priceline.com include those
lated consumer spending, Web advertis- owned by ADT Automotive, Inc. [Berent
ing revenue, etc. from the present to et al. 1998]; Walker Asset Management
2002, 2003, and 2005 are given in Nuas [Walker et al. 1996]; and two individu-
survey pages (see Section 1.2 for the als [Barzilai and Davidson 1997].
URL).
Robots to help consumers shop, or 3.3 Multimedia Retrieval
shopbots, have become commonplace in
e-commerce sites and general-purpose IR from multimedia databases is a mul-
Web portals. Shopbot technology has tidisciplinary research area, which in-
taken enormous strides since its initial cludes topics from a very diverse range,
introduction in 1995 by Anderson Con- such as analysis of text, image and
sulting. This first bot, known as Bar- video, speech, and nonspeech audio;
gain Finder, helped consumers find the graphics; animation; artificial intelli-
lowest priced CDs. Many current shop- gence; human-computer interaction;
bots are capable of a host of other tasks and multimedia computing [Faloutsos
in addition to comparing prices, such as 1996; Faloutsos and Lin 1995; Maybury
comparing product features, user re- 1997; and Schauble 1997]. Recently,
views, delivery options, and warranty several commercial systems that inte-
information. Clark [2000] reviews the grate search capabilities from multiple
state-of-the-art in bot technology and databases containing heterogeneous,
presents some predicitions for the fu- multimedia data have become available.
ture by experts in the fieldfor exam- Examples include PLS ,pls.com.;
ple, Kephart, manager of IBMs Agents Lexis-Nexis ,lexis-nexis.com.; DIALOG
and Emergent Phenomena Group, pre- ,dialog.com.; and Verity ,verity.com..
dicts that shopping bots may soon be In this section we point to some recent
able to negotiate and otherwise work developments in the field; but the dis-
with vendor bots, interacting via ontolo- cussion is by no means comprehensive.
gies and distributed technologies... bots Query and retrieval of images is one
would then become economic actors of the more established fields of re-
making decisions and Guttman, chief search involving multimedia databases
technology officer for Frictionless com- [IEEE ICIP: Proceedings of the IEEE
merce ,frictionless.com. footnotes that International Conference on Image Pro-
Frictionlesss bot engine is used by some cessing and IEEE ICASSP: Proceedings
famous portals, including Lycos, and of the IEEE International Conference on
mentions that his companys technology Acoustics, Speech and Signal Processing
will be used in a retailer bot that will and IFIP 1992]. So much work by so
negotiate trade-offs between product many has been conducted on this topic
price, performance, and delivery times that a comprehensive review is beyond
with shopbots on the basis of customer the scope of this paper. But some se-

ACM Computing Surveys, Vol. 32, No. 2, June 2000


164 M. Kobayashi and K. Takeda

lected work in this area follows: search Paltrow), but there was a considerable
and retrieval from large image archives amount of noise after the first page of
[Castelli et al. 1998]; pictorial queries retrievals and there were many redun-
by image similarity [Soffer and Samet]; dancies. Other search engines with an
image queries using Gabor wavelet fea- option for searching for images in their
tures [Manjunath and Ma 1996]; fast, advanced search page are Lycos, Hot-
multiresolution image queries using Bot, and AltaVista. All did somewhat
Haar wavelet transform coefficients [Ja- better than Photofinder in retrieving
cobs et al. 1995]; acquisition, storage, many images of Brad Pitt and Gwyneth
indexing, and retrieval of map images Paltrow; most of the thumbnails were
[Samet and Soffer 1986]; real-time fin- relevant for the first several pages (each
gerprint matching from a very large da- page contained 10 thumbnails).
tabase [Ratha et al. 1992]; querying and NECs Inquirus is an image search
retrieval using partially decoded JPEG engine that uses results from several
data and keys [Schneier and Abdel-Mot- search engines. It analyzes the text ac-
taleb 1996]; and retrieval of faces from a companying images to determine rele-
database [Bach et al. 1993; Wu and vance for ranking, and downloads the
Narasimhalu 1994]. actual images to create thumbnails that
Finding documents that have images are displayed to the user [Lawrence and
of interest is a much more sophisticated Giles 1999c].
problem. Two well-known portals with a Query and retrieval of images in a
search interface for a database of im- video frame or frames is a research area
ages are the Yahoo! Image Surfer closely related to retrieval of still im-
,isurf.yahoo.com. and the Alta Vista ages from a very large image database
PhotoFinder ,image.altavista.com.. Like [Bolle et al. 1998]. We mention a few to
Yahoo!s text-based search engine, the illustrate the potentially wide scope of
Image Surfer home pages are organized applications, e.g., content-based video
into categories. For a text-based query, indexing retrieval [Smoliar and Zhang
a maximum of six thumbnails of the 1994]; the Query-by-Image-Content
top-ranked retrieved images are dis- (QBIC) system, which helps users find
played at a time, along with their titles. still images in large image and video
If more than six are retrieved, then databases on the basis of color, shape,
links to subsequent pages with lower texture, and sketches [Flickner et al.
relevance rankings appear at the bot- 1997; Niblack 1993]; Information Navi-
tom of the page. The number of entries gation System (INS) for multimedia
in the database seem to be small; we data, a system for archiving and search-
attempted to retrieve photos of some ing huge volumes of video data via Web
famous movie stars and came up with browsers [Nomiyama et al. 1997]; and
none (for Brad Pitt) or few retrievals VisualSEEk, a tool for searching, brows-
(for Gwyneth Paltrow), some of which ing, and retrieving images, which allows
were outdated or unrelated links. The users to query for images using the vi-
input interface to Photofinder looks sual properties of regions and their spa-
very much like the interface for Alta tial layout [Smith and Chang 1997a;
Vistas text-based search engine. For a 1996]; compressed domain image ma-
text-based query, a maximum of twelve nipulation and feature extraction for
thumbnails of retrieved images are dis- compressed domain image and video in-
played at a time. Only the name of the dexing and searching [Chang 1995;
image file is displayed, e.g., image.jpg. Zhong and Chang 1997]; a method for
To read the description of an image (if it extracting visual events from relatively
is given), the mouse must point to the long videos uing objects (rather than
corresponding thumbnail. The number keywords), with specific applications to
of retrievals for Photofinder were huge sports events [Iwai et al. 2000; Kuro-
(4232 for Brad Pitt and 119 for Gwyneth kawa et al. 1999]; retrieval and semantic

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 165

interpretation of video contents based sites and hosts are not sustainable, so
on objects and their behavior [Echigo et that research and business opportuni-
al. 2000]; shape-based retrieval and its ties in the area will decline. They cite
application to identity checks on fish statistics such as the April 1998 GVU
[Schatz 1997]; and searching for images WWW survey, which states that the use
and videos on the Web [Smith and of better equipment (e.g., upgrades in
Chang 1997b]. modems by 48% of people using the
Multilingual communication on the Web) has not resolved the problem of
Web [Miyahara et al. 2000] and cross- slow access, and an August 1998 survey
language document retrieval is a timely by Alexa Internet stating that 90% of all
research topic being investigated by Web traffic is spread over 100,000 dif-
many [Ballesteros and Croft 1998; Eich- ferent hosts, with 50% of all Web traffic
mann et al. 1998; Pirkola 1998]. An headed towards the top 900 most popu-
introduction to the subject is given in lar sites. In short, the pessimists main-
Oard [1997b], and some surveys are tain that an effective means of manag-
found in CLIR [1999] (Cross-Language ing the highly uneven concentration of
Information Retrieval Project ,clis.umd. information packets on the Internet is
edu/dlrg.); Oard [1997a] ,glue.umd. not immediately available, nor will it be
edu/oard/research.html. and in Oard in the near future. Furthermore, they
and Door [1996]. Several search engines note that the exponential increase in
now feature multilingual search, e.g., Web sites and information on the Web
Open Text Web Index ,index.opentext. is contributing to the second most com-
net. searches in four languages (En- monly cited problem, that is, users not
glish, Japanese, Spanish, and Portu- being able to find the information they
guese). A number of commercial seek in a simple and timely manner.
Japanese-to-English and English-to- The vast majority of publications,
Japanese Web translation software however, support a very optimistic view.
products have been developed by lead- The visions and research projects of
ing Japanese companies in Japanese many talented scientists point towards
,bekkoame.ne.jp/oto3.. A typical ex- finding concrete solutions and building
ample, which has a trial version for more efficient and user-friendly solu-
downloading, is a product called Hon- tions. For example, McKnight and
yaku no Oosama ,ibm.co.jp/software/ Boroumand [2000] maintain that flat
internet/king/index.html., or Internet rate Internet retail pricingcurrently
King of Translation [Watanabe and the predominant pricing model in the
Takeda 1998]. U.S.may be one of the major culprits
Other interesting research topics and in the traffic-congestion problem, and
applications in multimedia IR are they suggest that other pricing models
speech-based IR for digital libraries are being proposed by researchers. It is
[Oard 1997c] and retrieval of songs from likely that the better proposals will be
a database when a user hums the first seriously considered by the business
few bars of a tune [Kageyama and community and governments to avoid
Takashima 1994]. The melody retrieval the continuation of the current solution,
technology has been incorporated as an i.e., overprovisioning of bandwidth.
interface in a karaoke machine.
ACKNOWLEDGMENTS
3.4 Conclusions
The authors acknowledge helpful con-
Potentially lucrative application of In- versations with Stuart McDonald of alpha-
ternet-based IR is a widely studied and Works and our colleagues at IBM Re-
hotly debated topic. Some pessimists be- search. Our manuscript has benefitted
lieve that current rates of increase in greatly from the extensive and well-docu-
the use of the Internet, number of Web mented list of suggestions and corrections

ACM Computing Surveys, Vol. 32, No. 2, June 2000


166 M. Kobayashi and K. Takeda

from the reviewers of the first draft. We First International Conference on Autonomous
appreciate their generosity, patience, and Agents (AGENTS 97, Marina del Rey, CA,
Feb. 5 8), W. L. Johnson, Chair. ACM Press,
thoughtfulness. New York, NY, 378 385.
BALABANOVIC, M. AND SHOHAM, Y. 1995. Learning
information retrieval agents: Experiments
REFERENCES with automated web browsing. In Proceedings
of the 1995 AAAI Spring Symposium on Infor-
ASSOCIATION FOR COMPUTING MACHINERY. 2000.
mation Gathering from Heterogenous Distrib-
SIGCHI: Special Interest Group on Computer-
Human Interaction. Home page: www.acm. uted Environments (Stanford, CA, Mar.).
org/sigchi/ AAAI Press, Menlo Park, CA.
ASSOCIATION FOR COMPUTING MACHINERY. 2000. SI- BALABANOVIC, M., SHOHAM, Y., AND YUN, T. 1995.
GIR: Special Interest Group on Information An adaptive agent for automated web
Retrieval. Home page: www.acm.org/sigir/ browsing. Stanford Univ. Digital Libraries
AGOSTI, M. AND SMEATON, A. 1996. Information Project, working paper 1995-0023. Stanford
Retrieval and Hypertext. Kluwer Academic University, Stanford, CA.
Publishers, Hingham, MA. BALDONADO, M. 1997. An interactive, structure-
AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., AND mediated approach to exploring information
RAGHAVAN, P. 1998. Automatic subspace clus- in a heterogeneous, distributed environment.
tering of high dimensional data for data min- Ph.D. Dissertation. Computer Systems Labo-
ing applications. In Proceedings of the ACM ratory, Stanford Univ., Stanford, CA.
SIGMOD Conference on Management of Data BALDONADO, M. Q. W. AND WINOGRAD, T. 1997.
(SIGMOD, Seattle, WA, June). ACM Press, SenseMarker: an information-exploration in-
New York, NY, 94 105. terface supporting the contextual evolution of
AHLBERG, C. AND SHNEIDERMAN, B. 1994. Visual a users interests. In Proceedings of the ACM
information seeking: Tight coupling of dy- Conference on Human Factors in Computing
namic query filters with starfield displays. In Systems (CHI 97, Atlanta, GA, Mar. 2227),
Proceedings of the ACM Conference on Human S. Pemberton, Ed. ACM Press, New York, NY,
Factors in Computing Systems: Celebrating 1118.
Interdependence (CHI 94, Boston, MA, Apr. BALLESTEROS, L. AND CROFT, W. B. 1998. Resolving
24 28). ACM Press, New York, NY, 313317. ambiguity for cross-language retrieval. In
AHLBERG, C. AND SHNEIDERMAN, B. 1997. The al- Proceedings of the 21st Annual International
phaslider: A compact and rapid and selector. ACM SIGIR Conference on Research and De-
In Proceedings of the ACM Conference on Hu- velopment in Information Retrieval (SIGIR
man Factors in Computing Systems (CHI 97, 98, Melbourne, Australia, Aug. 24 28), W. B.
Atlanta, GA, Mar. 2227), S. Pemberton, Ed. Croft, A. Moffat, C. J. van Rijsbergen, R.
ACM Press, New York, NY. Wilkinson, and J. Zobel, Chairs. ACM Press,
AI MAG. 1997. Special issue on intelligent systems New York, NY, 64 71.
on the internet. AI Mag. 18, 4. BARZILAI AND DAVIDSON. 1997. Computer-based
ANDERBERG, M. R. 1973. Cluster Analysis for electronic bid, auction and sale system, and a
Applications. Academic Press, Inc., New York, system to teach new/non-registered customers
NY. how bidding, auction purchasing works: U.S.
ANICK, P. G. AND VAITHYANATHAN, S. 1997. Exploit-
Patent no. 60112045.
ing clustering and phrases for context-based
BEAUDOIN, L., PARENT, M.-A., AND VROOMEN, L. C.
information retrieval. SIGIR Forum 31, 1,
1996. Cheops: A compact explorer for complex
314 323.
hierarchies. In Proceedings of the IEEE Con-
ASAKAWA, C. 1996. Enabling the visually disabled
to use the www in a gui environment. IEICE ference on Visualization (San Francisco, CA,
Tech. Rep. HC96-29. Oct. 27-Nov. 1), R. Yagel and G. M. Nielson,
BACH, J., PAUL, S., AND JAIN, R. 1993. A visual Eds. IEEE Computer Society Press, Los
information management system for the in- Alamitos, CA, 87ff.
teractive retrieval of faces. IEEE Trans. BEDERSON, B. B. AND HOLLAN, J. D. 1994. Pad11:
Knowl. Data Eng. 5, 4, 619 628. A zooming graphical interface for exploring
BAEZA-YATES, R. A. 1992. Introduction to data alternate interface physics. In Proceedings of
structures and algorithms related to informa- the 7th Annual ACM Symposium on User
tion retrieval. In Information Retrieval: Data Interface Software and Technology (UIST 94,
Structures and Algorithms, W. B. Frakes and Marina del Rey, CA, Nov. 2 4), P. Szekely,
R. Baeza-Yates, Eds. Prentice-Hall, Inc., Up- Chair. ACM Press, New York, NY, 1726.
per Saddle River, NJ, 1327. BERENT, T., HURST, D., PATTON, T., TABERNIK, T.,
BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Mod- REIG, J. W. D., AND WHITTLE, W. 1998. Elec-
ern Information Retrieval. Addison-Wesley, tronic on-line motor vehicle auction and infor-
Reading, MA. mation system: U.S. Patent no. 5774873.
BALABANOVIC, M. 1997. An adaptive Web page BERNERS-LEE, T., CAILLIAU, R., LUOTONEN, A.,
recommendation service. In Proceedings of the NIELSEN, H. F., AND SECRET, A. 1994. The

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 167

World-Wide Web. Commun. ACM 37, 8 (Aug.), CASTELLI, V., BERGMAN, L., KONTOYIANNINS, I., LI,
76 82. C.-S., ROBINSON, J., AND TUREK, J. 1998.
BERRY, M. AND BROWN, M. 1999. Understanding Progressive search and retrieval in large im-
Search Engines. SIAM, Philadelphia, PA. age archives. IBM J. Res. Dev. 42, 2 (Mar.),
BERRY, M. W., DUMAIS, S. T., AND OBRIEN, G. W. 253268.
1995. Using linear algebra for intelligent in- CATHRO, W. 1997. Matching discovery and
formation retrieval. SIAM Rev. 37, 4 (Dec.), recovery. In Proceedings of the Seminar on
573595. Standards Australia. www.nla.gov.au/staffpa-
BHARAT, K. AND BRODER, A. 1998. A technique for per/cathro3.html
measuring the relative size and overlap of CHAKRABARTI, S., DOM, B., GIBSON, D., KUMAR, S.,
public web search engines. In Proceedings of RAGHAVAN, P., RAJAGOPALAN,, S., AND TOMKINS,
the Seventh International Conference on A. 1988. Experiments in topic distillation. In
World Wide Web 7 (WWW7, Brisbane, Austra- Proceedings of the ACM SIGIR Workshop on
lia, Apr. 14 18), P. H. Enslow and A. Ellis, Hypertext Information Retrieval for the Web
Eds. Elsevier Sci. Pub. B. V., Amsterdam, The (Apr.). ACM Press, New York, NY.
Netherlands, 379 388. CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGO-
BOLLE, R., YEO, B.-L., AND YEUNG, M. 1998. Video PALAN, S., GIBSON, D., AND KLEINBERG, J. 1998.
query: Research directions. IBM J. Res. Dev. Automatic resource compilation by analyzing
42, 2 (Mar.), 233251. hyperlink structure and associated text. In
BORKO, H. 1979. Inter-indexer consistency. In Pro- Proceedings of the Seventh International Con-
ceedings of the Cranfield Conference. ference on World Wide Web 7 (WWW7, Bris-
BOTAFOGO, R. A., RIVLIN, E., AND SHNEIDERMAN, B. bane, Australia, Apr. 14 18), P. H. Enslow
1992. Structural analysis of hypertexts: Iden- and A. Ellis, Eds. Elsevier Sci. Pub. B. V.,
tifying hierarchies and useful metrics. ACM Amsterdam, The Netherlands, 6574.
Trans. Inf. Syst. 10, 2 (Apr.), 142180. CHAKRABARTI, S. AND RAJAGOPALAN, S. 1997. Sur-
BRAKE, D. 1997. Lost in cyberspace. New Sci. vey of information retrieval research and
Mag. www.newscientist.com/keysites/networld/ products. Home page: w3.almaden.ibm.com/
lost.html soumen/ir.html
CHALMERS, M. AND CHITSON, P. 1992. Bead: explo-
BRIN, S. AND PAGE, L. 1998. The anatomy of a
rations in information visualization. In Pro-
large-scale hypertextual Web search engine.
ceedings of the 15th Annual International
Comput. Netw. ISDN Syst. 30, 1-7, 107117.
ACM Conference on Research and Develop-
BRODER, A. Z., GLASSMAN, S. C., MANASSE, M. S.,
ment in Information Retrieval (SIGIR 92,
AND ZWEIG, G. 1997. Syntactic clustering of
Copenhagen, Denmark, June 2124), N. Bel-
the Web. Comput. Netw. ISDN Syst. 29, 8-13,
kin, P. Ingwersen, and A. M. Pejtersen, Eds.
11571166.
ACM Press, New York, NY, 330 337.
BUSINESS WEEK. 1997. Special report on speech CHANDRASEKARAN, R. 1998. Portals offer one-stop
technologies. Business Week. surfing on the net. Int. Herald Tribune 19/21.
COMMUNICATIONS OF THE ACM. 1993. Special issue CHANG, S.-F. 1995. Compressed domain tech-
on the next generation GUIs. Commun. ACM. niques for image/video indexing and
COMMUNICATIONS OF THE ACM. 1994. Special issue manipulation. In Proceedings of the Confer-
on internet technology. Commun. ACM. ence on Information Processing.
COMMUNICATIONS OF THE ACM. 1995. Special is- CHO, J., GARCIA-MOLINA, H., AND PAGE, L. 1998.
sues on digital libraries. Commun. ACM. Efficient crawling through URL ordering.
COMMUNICATIONS OF THE ACM. 1999. Special is- Comput. Netw. ISDN Syst. 30, 1-7, 161172.
sues on knowledge discovery. Commun. ACM. CLARK, D. 2000. Shopbots become agents for busi-
CARD, S., MACKINLAY, J., AND SHNEIDERMAN, B. ness change. IEEE Computer.
1999. Readings in Information Visualization: CLEVERDON, C. 1970. Progress in documentation.
Using Vision to Think. Morgan Kaufmann J. Doc. 26, 55 67.
Publishers Inc., San Francisco, CA. CLIR. 1999. Cross-language information retrieval
CARD, S. K., ROBERTSON, G. G., AND YORK, W. 1996. project, resource page. Tech. Rep. University
The WebBook and the Web Forager: an infor- of Maryland at College Park, College Park,
mation workspace for the World-Wide Web. In MD.
Proceedings of the ACM Conference on Human COHEN, A. 1999. The attic of e. Time Mag.
Factors in Computing Systems (CHI 96, Van- COMPUT. NETW. ISDN SYST. 2000. World Wide
couver, B.C., Apr. 1318), M. J. Tauber, Ed. Web conferences. 1995-2000. Comput. Netw.
ACM Press, New York, NY, 111ff. ISDN Syst. www.w3.org/Conferences/Overview-
CARL, J. 1995. Protocol gives sites way to keep out WWW.html
the bots. Web Week 1, 7 (Nov.). COOPER, W. 1969. Is interindexer consistency a
CARRIRE, J. AND KAZMAN, R. 1997. WebQuery: hobgoblin? Am. Doc. 20, 3, 268 278.
Searching and visualizing the Web through CRANOR, L. F. AND LA MACCHIA, B. A. 1998. Spam!
connectivity. In Proceedings of the Sixth Inter- Commun. ACM 41, 8, 74 83.
national Conference on the World Wide Web CRESTANI, F., LALMAS, M., VAN RIJSBERGEN, C. J.,
(Santa Clara CA, Apr.). AND CAMPBELL, I. 1998. Is this document relevant?

ACM Computing Surveys, Vol. 32, No. 2, June 2000


168 M. Kobayashi and K. Takeda

Probably: A survey of probabilistic models in ETZIONI, O. AND WELD, D. 1995. Intelligent agents
information retrieval. ACM Comput. Surv. 30, on the Internet: Fact, fiction and forecast.
4, 528 552. Tech. Rep. University of Washington, Seattle,
CUNNINGHAM, M. 1997. Brewsters millions. Irish WA.
Times.www.irish-times.com/irish-times/paper/ FALOUTSOS, C. 1996. Searching Multimedia Data-
1997/0127/cmp1.html bases by Content. Kluwer Academic Publish-
CUTTING, D. R., KARGER, D. R., AND PEDERSEN, J. ers, Hingham, MA.
O. 1993. Constant interaction-time scatter/ FALOUTSOS, C. AND LIN, K. 1995. FastMap: A fast
gather browsing of very large document algorithm for indexing, data-mining and visu-
collections. In Proceedings of the 16th Annual alization of traditional and multimedia
International ACM Conference on Research datasets. In Proceedings of the ACM SIGMOD
and Development in Information Retrieval Conference on Management of Data (ACM-
(SIGIR 93, Pittsburgh, PA, June 27July), R. SIGMOD, San Jose, CA, May). SIGMOD.
Korfhage, E. Rasmussen, and P. Willett, Eds. ACM Press, New York, NY, 163174.
ACM Press, New York, NY, 126 134. FALOUTSOS, C. AND OARD, D. W. 1995. A survey of
DEERWESTER, S., DUMAI, S. T., FURNAS, G. W., information retrieval and filtering methods.
LANDAUER, T. K., AND HARSHMAN, R. 1990. Univ. of Maryland Institute for Advanced
Indexing by latent semantic analysis. J. Am. Computer Studies Report. University of
Soc. Inf. Sci. 41, 6, 391 407. Maryland at College Park, College Park, MD.
DEMMEL, J. W. 1997. Applied Numerical Linear FELDMAN, S. 1998. Web search services in 1998:
Algebra. SIAM, Philadelphia, PA. Trends and challenges. Inf. Today 9.
DHILLON, I. AND MOHDA, D. 1999. A data-cluster- FERGUSON, A. 1999. Auction nation. Time Mag.
ing algorithm on distributed memory FININ, T., NICHOLAS, C., AND MAYFIELD, J. 1998.
multiprocessors. In Proceedings of the Work- Software agents for information retrieval
shop on Large-Scale Parallel KDD Systems (short course notes). In Proceedings of the
(ACM SIGKDD., Aug. 15-18). ACM Press, Third ACM Conference on Digital Libraries
New York, NY. (DL 98, Pittsburgh, PA, June 2326), I. Wit-
DHILLON, I. AND MODHA, D. 2000. Concept decom- ten, R. Akscyn, and F. M. Shipman, Eds. ACM
Press, New York, NY.
positions for large sparse text data using
FINKELSTEIN, A. AND SALESIN, D. 1995. Fast multi-
clustering. Mach. Learn.
resolution image querying. In Proceedings of
ECHIGO, T., KUROKAWA, M., TOMITA, A., TOMITA, A.,
the ACM SIGGRAPH Conference on Visual-
MIYAMORI, AND IISAKU, S. 2000. Video enrich-
ization: Art and Interdisciplinary Programs
ment: Retrieval and enhanced visualization
(SIGGRAPH 95, Los Angeles, CA, Aug.
based on behaviors of objects. In Proceedings
6 11), K. OConnell, Ed. ACM Press, New
of the Fourth Asian Conference on Computer
York, NY, 277286.
Vision (ACCV2000, Jan. 8-11). 364 369. FISHER, D. 1995. Iterative optimization and sim-
EICHMANN, D., RUIZ, M. E., AND SRINIVASAN, P. plification of hierarchical clusterings. Tech.
1998. Cross-language information retrieval Rep. Vanderbilt University, Nashville, TN.
with the UMLS metathesaurus. In Proceed- FLICKNER, M., SAWHNEY, H., NIBLACK, W., ASHLEY,
ings of the 21st Annual International ACM J., HUANG, Q., DOM, B., GORKANI, M., HAFNER,
SIGIR Conference on Research and Develop- J., LEE, D., PETKOVIC, D., STEELE, D., AND
ment in Information Retrieval (SIGIR 98, YANKER, P. 1997. Query by image and video
Melbourne, Australia, Aug. 24 28), W. B. content: the QBIC system. In Intelligent Multi-
Croft, A. Moffat, C. J. van Rijsbergen, R. media Information Retrieval, M. T. Maybury,
Wilkinson, and J. Zobel, Chairs. ACM Press, Ed. MIT Press, Cambridge, MA, 722.
New York, NY, 72 80. FLYNN, L. 1996. Desperately seeking surfers: Web
ESTER, M., KRIEGEL, H.-S., SANDER, J., AND XU, X. programmers try to alter search engines
1995a. A density-based algorithm for discov- results. New York Times.
ering clusters in large spatial databases with FRAKES, W. B. AND BAEZA-YATES, R., EDS. 1992.
noise. In Proceedings of the First Interna- Information Retrieval: Data Structures and
tional Conference on Knowledge Discovery and Algorithms. Prentice-Hall, Inc., Upper Saddle
Data Mining (Montreal, Canada, Aug. 20-21). River, NJ.
ESTER, M., KRIEGEL, H.-S., AND XU, X. 1995b. A FRISSE, M. E. 1988. Searching for information in a
database interface for clustering in large spa- hypertext medical handbook. Commun. ACM
tial databases. In Proceedings of the First 31, 7 (July), 880 886.
International Conference on Knowledge Dis- GILBERT, D. 1997. Intelligent agents: The right
covery and Data Mining (Montreal, Canada, information at the right time. Tech. Rep. IBM
Aug. 20-21). Corp., Research Triangle Park, NC.
ESTER, M., KRIEGEL, H.-S., AND XU, X. 1995c. Fo- GLOOR, P. AND DYNES, S. 1998. Cybermap: Visually
cusing techniques for efficient class navigating the web. J. Visual Lang. Comput.
identification. In Proceedings of the Fourth 9, 3 (June), 319 336.
International Symposium on Large Spatial GOLUB, G. H. AND VAN LOAN, C. F. 1996. Matrix
Databases. Computations. 3rd. Johns Hopkins studies in

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 169

the mathematical sciences. Johns Hopkins IEEE. 1998a. News and trends section. IEEE In-
University Press, Baltimore, MD. ternet Comput.
GRAVANO, L. 1998. Querying multiple document IEEE. 1998b. Special issue on knowledge
collections across the Internet. Ph.D. management. IEEE Expert.
Dissertation. Stanford University, Stanford, IEEE. 1996a. Special issue on intelligent agents.
CA. IEEE Expert/Intelligent Systems and Their
GUDIVADA, V., RAGHAVAN, V., GROSKY, W., AND Applications.
KASAANAGOTTU, R. 1997. Information retrieval IEEE. 1996b. Special issue on digital libraries:
on the world wide web. IEEE Internet Com- representation and retrieval. IEEE Trans.
put. 1, 1 (May/June), 58 68. Pattern Anal. Mach. Intell. 18, 8, 771 859.
GUGLIELMO, C. 1997. Upside today (on-line). home IFIP. 1989. Visual Data Base Systems I and II.
page: inc.com/cgi-bin/tech link.cgi?url5http:// Elsevier North-Holland, Inc., Amsterdam,
www.upside.com. The Netherlands.
GUHA, S., RASTOGI, R., AND SHIM, K. 1998. Cure: IWAI, Y., MARUO, J., YACHIDA, M., ECHIGO, T., AND
An efficient clustering algorithm for large IISAKU, S. 2000. A framework for visual event
databases. In Proceedings of the ACM SIG- extraction from soccer games. In Proceedings
MOD Conference on Management of Data of the Fourth Asian Conference on Computer
(SIGMOD, Seattle, WA, June). ACM Press, Vision (ACCV2000, Jan. 8-11). 222227.
New York, NY. JACOBY, J. AND SLAMECKA, V. 1962. Indexer consis-
HAWKING, D., CRASWELL, N., THISTLEWAITE, P., AND tency under minimal conditions. RADC TR
HARMAN, D. 1999. Results and Challenges in 62--426. Documentation, Inc., Bethesda, MD,
Web Search Evaluation. US.
HEARST, M. A. 1995. TileBars: Visualization of KAGEYAMA, T. AND TAKASHIMA, Y. 1994. A melody
term distribution information in full text in- retrieval method with hummed melody. IEICE
formation access. In Proceedings of the ACM Trans. Inf. Syst. J77, 8 (Aug.), 15431551.
Conference on Human Factors in Computing KAHLE, B. 1999. Archiving the Internet. home
Systems (CHI 95, Denver, CO, May 711), I. page: www.alexa.com/ brewster/essays/sciam
R. Katz, R. Mack, L. Marks, M. B. Rosson, article.html
and J. Nielsen, Eds. ACM Press/Addison-Wes- KEPHART, J., HANSON, J., LEVINE, D., GROSOF, B.,
ley Publ. Co., New York, NY, 59 66. SAIRAMESH, J., AND WHITE, R. S. 1998a. Emer-
HEARST, M. 1997. Interfaces for searching the gent behavior in information economies.
web. Sci. Am., 68 72. KEPHART, J. O., HANSON, J. E., AND SAIRAMESH, J.
HEARST, M. 1999. User interfaces and 1998b. Price-war dynamics in a free-market
visualization. In Modern Information Re- economy of software agents. In Proceedings of
trieval, R. Baeza-Yates and B. Ribeiro-Neto. the Sixth International Conference on Artifi-
Addison-Wesley, Reading, MA, 22573232. cial Life (ALIFE, Madison, WI, July 26 30),
HEARST, M. A. AND PEDERSEN, J. O. 1996. Visualiz- C. Adami, R. K. Belew, H. Kitano, and C. E.
ing information retrieval results: a demon- Taylor, Eds. MIT Press, Cambridge, MA, 53
stration of the TileBar interface. In Proceed- 62.
ings of the CHI 96 Conference Companion on KLEINBERG, J. M. 1998. Authoritative sources in a
Human Factors in Computing Systems: Com- hyperlinked environment. In Proceedings of
mon Ground (CHI 96, Vancouver, British Co- the 1998 ACM-SIAM Symposium on Discrete
lumbia, Canada, Apr. 1318), M. J. Tauber, Algorithms (San Francisco CA, Jan.). ACM
Ed. ACM Press, New York, NY, 394 395. Press, New York, NY.
HENZINGER, M., HEYDON, A., MITZENMACHER, M., KOBAYASHI, M., DUPRET, G., KING, O., SAMUKAWA,
AND NAJORK, M. 1999. Measuring index qual- H., AND TAKEDA, K. 1999. Multi-perspective
ity using random walks on the web. retrieval, ranking and visualization of web
HERNANDEZ, M. 1996. A generalization of band data. In Proceedings of the International Sym-
joins and the merge/purge problem. Ph.D. posium on Digital Libraries ((ISDL99),
Dissertation. Columbia Univ., New York, NY. Tsukuba, Japan). 159 162.
HOWE, A. AND DREILINGER, D. 1997. Savvysearch: KORFHAGE, R. R. 1997. Information Storage and
A metasearch engine that learns which search Retrieval. John Wiley and Sons, Inc., New
engine to query. AI Mag. 18, 2, 19 25. York, NY.
HUBERMAN, B. AND LUKOSE, R. 1997. A metasearch KOSTER, M. 1995. Robots in the web: trick or
engine that learns which search engine to treat? ConneXions 9, 4 (Apr.).
query. Science 277, 535537. KOSTER, M. 1996. Examination of the standard for
HUBERMAN, B., PIROLLI, P., PITKOW, J., AND LU- robots exclusion. home page: info.webcrawler-
KOSE, R. 1998. Strong regularities in world .com/mak/projects/robots/eval.html
wide web surfing. Science 280, 9597. KUROKAWA, M., ECHIGO, T., TOMITA, T., MAEDA, J.,
HYLTON, J. 1996. Identifying and merging related MIYAMORI, H., AND ISISAKU, S. 1999. Represen-
bibliographic records. Masters Thesis. tation and retrieval of video scene by using
IEEE. 1999. Special issue on intelligent informa- object actions and their spatio-temporal
tion retrieval. IEEE Expert. relationships. In Proceedings of the Interna-

ACM Computing Surveys, Vol. 32, No. 2, June 2000


170 M. Kobayashi and K. Takeda

tional Conference on ICIP-Image Processing. tering web pages: A preliminary study. In


IEEE Press, Piscataway, NJ. Proceedings of the Fourth International Con-
LAGOZE, C. 1996. The Warwick framework: A con- ference on Knowledge Discovery and Data
tainer architecture for diverse sets of Mining (Seattle, WA, June 98). 264 268.
metadata. D-Lib Mag. www.dlib.org MANBER, U. 1999. Foreword. In Modern Informa-
LAMPING, J., RAO, R., AND PIROLLI, P. 1995. A tion Retrieval, R. Baeza-Yates and B.
focus1context technique based on hyperbolic Ribeiro-Neto. Addison-Wesley, Reading, MA,
geometry for visualizing large hierarchies. In 5 8.
Proceedings of the ACM Conference on Human MANBER, U., SMITH, M., AND GOPAL, B. 1997. Web-
Factors in Computing Systems (CHI 95, Den- glimpse: Combining borwsing and searching.
ver, CO, May 711), I. R. Katz, R. Mack, L. In Proceedings on USENIX 1997 Annual
Marks, M. B. Rosson, and J. Nielsen, Eds. Technical Conference (Jan.). 195206.
ACM Press/Addison-Wesley Publ. Co., New MANJUNATH, B. S. AND MA, W. Y. 1996. Texture
York, NY, 401 408. features for browsing and retrieval of image
LAWRENCE, S. AND GILES, C. 1998a. Context and data. IEEE Trans. Pattern Anal. Mach. Intell.
page analysis for improved web search. IEEE 18, 8, 837 842.
Internet Comput. 2, 4, 38 46. MANNING, C. AND SCHUTZE, H. 1999. Foundations
LAWRENCE, S. AND GILES, C. 1998b. Searching the of Statistical Natural Language Processing.
world wide web. Science 280, 98 100. MIT Press, Cambridge, MA.
LAWRENCE, S. AND GILES, C. 1999a. Accessibility of MARCHIONINI, G. 1995. Information Seeking in
information on the web. Nature 400, 107109. Electronic Environments. Cambridge Series
LAWRENCE, S. AND GILES, C. 1999b. Searching the on Human-Computer Interaction. Cambridge
web: General and scientific information University Press, New York, NY.
access. IEEE Commun. Mag. 37, 1, 116 122. MAYBURY, M. 1997. Intelligent Multmedia Infor-
LAWRENCE, S. AND GILES, C. 1999c. Text and image mation Retrieval. MIT Press, Cambridge, MA.
metasearch on the web. In Proceedings of the MAYBURY, M. T. AND WAHLSTER, W., EDS. 1998.
International Conference on Parallel and Dis- Readings in Intelligent User Interfaces. Mor-
tributed Processing Techniques and Applica- gan Kaufmann Publishers Inc., San Fran-
tions (PDPTA99). 829 835. cisco, CA.
LEIGHTON, H. AND SRIVASTAVA, J. 1997. Precision MCKNIGHT, L. 2000. Pricing internet services: Ap-
among world wide web search engines: Alta- proaches and challenges. IEEE Computer,
vista, excite, hotbot, infoseek, and lycos. home 128 129.
page: www.winona.msus.edu/library/webind2/ MEREU, S. W. AND KAZMAN, R. 1996. Audio en-
webind2.htm. hanced 3D interfaces for visually impaired
LETSCHE, T. AND BERRY, M. 1997. Large-scale in- users. In Proceedings of the ACM Conference
formation retrieval with latent semantic in- on Human Factors in Computing Systems
dexing: (submitted). Inf. Sci. Appl. (CHI 96, Vancouver, B.C., Apr. 1318), M. J.
LIAO, H., OSADA, M., AND SHNEIDERMAN, B. 1992. A Tauber, Ed. ACM Press, New York, NY, 7278.
formative evaluation of three interfaces for MITCHELL, S. 1998. General internet resource find-
browsing directories using dynamic queries. ing tools. home pages: library.ucr.edu/pubs/
Tech. Rep. CS-TR-2841. Department of Com- navigato.html
puter Science, University of Maryland, Col- MIYAHARA, T, WATANABE, H., TAZOE, E., KAMIYAMA,
lege Park, MD. Y., AND TAKEDA, K. 2000. Internet Machine
LIBERATORE, K. 1997. Getting to the source: Is it Translation. Mainichi Communications, Japan.
real or spam, maam ? MacWorld. MODHA, S. AND SPANGLER, W. 2000. Clustering
LIDSKY, D. AND KWON, R. 1997. Searching the net. hypertext with applications to web searching.
PC Mag., 227258. In Proceedings of the Conference on Hypertext
LIECHTI, O., SIFER, M. J., AND ICHIKAWA, T. 1998. (May 30-June 3).
Structured graph format: XML metadata for MONGE, A. AND ELKAN, C. 1998. An efficient do-
describing Web site structure. Comput. Netw. main-independent algorithm for detecting ap-
ISDN Syst. 30, 1-7, 1121. proximately duplicate database records. Tech.
LOSEE, R. M. 1998. Text Retrieval and Filtering: Rep. University of California at San Diego, La
Analytic Models of Performance. Kluwer in- Jolla, CA.
ternational series on information retrieval. MONIER, L. 1998. Altavista cto responds.
Kluwer Academic Publishers, Hingham, MA. www4.zdnet.com/anchordesk/talkback/talkback
LYNCH, C. 1997. Searching the Internet. Sci. Am., 13066.html.
5256. MOROHASHI, M., TAKEDA, K., NOMIYAMA, H., AND
MAAREK, Y. S., JACOVI, M., SHTALHAIM, M., UR, S., MARUYAMA, H. 1995. Information outlining. In
ZERNIK, D., AND BEN-SHAUL, I. Z. 1997. Web- Proceedings of International Symposium on
Cutter: A system for dynamic and tailorable Digital Libraries (Tsukuba, Japan).
site mapping. Comput. Netw. ISDN Syst. 29, MUNZNER, T. AND BURCHARD, P. 1995. Visualizing
8-13, 1269 1279. the structure of the World Wide Web in 3D
MACSKASSY, S., BANERJEE, A., DAVISON, B., AND hyperbolic space. In Procedings of the Sympo-
HIRSH, H. 1998. Human performance on clus- sium on Virtual Reality Modeling Language

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 171

(VRML 95, San Diego, CA, Dec. 14 15), D. R. PERKOWITZ, M. AND ETZIONI, O. 1999. Adaptive
Nadeau and J. L. Moreland, Chairs. ACM web sites: An ai challenge. Tech. Rep. Univer-
Press, New York, NY, 3338. sity of Washington, Seattle, WA.
NAGAO, K. AND HASIDA, K. 1998. Automatic text PIRKOLA, A. 1998. The effects of query structure
summarization based on the global document and dictionary setups in dictionary-based
annotation. In Proceedings of the Conference cross-language information retrieval. In Pro-
on COLING-ACL. ceedings of the 21st Annual International
NAGAO, K., HOSOYA, S., KAWAKITA, Y., ARIGA, S., ACM SIGIR Conference on Research and De-
SHIRAI, Y., AND YURA, J. 1999. Semantic velopment in Information Retrieval (SIGIR
transcoding: Making the world wide web more 98, Melbourne, Australia, Aug. 24 28), W. B.
understandable and reusable by external an- Croft, A. Moffat, C. J. van Rijsbergen, R.
notations. Wilkinson, and J. Zobel, Chairs. ACM Press,
NAVARRO, G. 1998. Approximate text searching. New York, NY, 55 63.
Ph.D. Dissertation. Univ. of Chile, Santiago, PIROLLI, P., PITKOW, J., AND RAO, R. 1996. Silk
Chile. from a sows ear: extracting usable structures
NG, R. AND HAN, J. 1994. Efficient and effective from the Web. In Proceedings of the ACM
methods for spatial data mining. In Proceed- Conference on Human Factors in Computing
ings of the 20th International Conference on Systems (CHI 96, Vancouver, B.C., Apr. 13
Very Large Data Bases (VLDB94, Santiago, 18), M. J. Tauber, Ed. ACM Press, New York,
Chile, Sept.). VLDB Endowment, Berkeley, NY, 118 125.
CA. PLAISANT, C. 1994. Dynamic queries on a health
NIBLACK, W. 1993. The qbic project: Query by statistics atlas. Tech. Rep. University of
image by content using color, texture, and Maryland at College Park, College Park, MD.
shape. In Proceedings of the Conference on PRESCHEL, B. 1972. Indexer consistency in percep-
Storage and Retrieval for Image and Video tion of concepts and choice of terminology.
Databases. SPIE Press, Bellingham, WA, Final Rep. Columbia Univ., New York, NY.
173187. RAGHAVAN, P. 1997. Information retrieval algo-
NIELSEN, J. 1993. Usability Engineering. Academ- rithms: A survey. In Proceedings of the Sym-
ic Press Prof., Inc., San Diego, CA. posium on Discrete Algorithms. ACM Press,
NIELSEN, J. 1999. User interface directions for the New York, NY.
Web. Commun. ACM 42, 1, 6572. RAMAN, T. V. 1996. Emacspeaka speech
NOMIYAMA, H., KUSHIDA, T., URAMOTO, N., IOKA, interface. In Proceedings of the ACM Confer-
M., KUSABA, M., KUSABA, J.-K., CHIGONO, A., ence on Human Factors in Computing Systems
ITOH, T., AND TSUJI, M. 1997. Information (CHI 96, Vancouver, B.C., Apr. 1318), M. J.
navigation system for multimedia data. Res. Tauber, Ed. ACM Press, New York, NY, 66
Rep. RT-0227. Research Laboratory, IBM 71.
Tokyo, Tokyo, Japan. RAO, R., PEDERSEN, J. O., HEARST, M. A., MACKIN-
OARD, D. 1997a. Cross-language text retrieval re- LAY, J. D., CARD, S. K., MASINTER, L., HAL-
search in the USA. In Proceedings of the VORSEN, P.-K., AND ROBERTSON, G. C. 1995.
Third Delos Workshop on ERCIM (Mar.). Rich interaction in the digital library. Com-
OARD, D. 1997b. Serving users in many languages. mun. ACM 38, 4 (Apr.), 29 39.
D-Lib Mag. 3, 1 (Jan.). RASMUSSEN, E. 1992. Clustering algorithms. In
OARD, D. 1997c. Speech-based information re- Information Retrieval: Data Structures and
trieval for digital libraries. Tech. Rep. Algorithms, W. B. Frakes and R. Baeza-Yates,
CS-TR-3778. University of Maryland at Col- Eds. Prentice-Hall, Inc., Upper Saddle River,
lege Park, College Park, MD. NJ, 419 442.
OARD, D. W. AND DORR, B. J. 1996. A survey of RATHA, N. K., KARU, K., CHEN, S., AND JAIN, A. K.
multilingual text retrieval. Tech. Rep 1996. A real-time matching system for large
UMIACS-TR-96-19. University of Maryland fingerprint databases. IEEE Trans. Pattern
at College Park, College Park, MD. Anal. Mach. Intell. 18, 8, 799 813.
OMIECINSKI, E. AND SCHEUERMANN, P. 1990. A par- RENNISON, E. 1994. Galaxy of news: an approach
allel algorithm for record clustering. ACM to visualizing and understanding expansive
Trans. Database Syst. 15, 4 (Dec.), 599 624. news landscapes. In Proceedings of the 7th
OOGANE, T. AND ASAKAWA, C. 1998. An interactive Annual ACM Symposium on User Interface
method for accessing tables in HTML. In Pro- Software and Technology (UIST 94, Marina
ceedings of the Third International ACM Con- del Rey, CA, Nov. 2 4), P. Szekely, Chair.
ference on Assistive Technologies (Assets 98, ACM Press, New York, NY, 312.
Marina del Rey, CA, Apr. 1517), M. M. Blat- RIVLIN, E., BOTAFOGO, R., AND SHNEIDERMAN, B.
tner and A. I. Karshmer, Chairs. ACM Press, 1994. Navigating in hyperspace: designing a
New York, NY, 126 128. structure-based toolbox. Commun. ACM 37, 2
PARLETT, B. N. 1998. The Symmetric Eigenvalue (Feb.), 8796.
Problem. Prientice-Hall SIAM Classics in Ap- ROBERTSON, G. G., MACKINLAY, J. D., AND CARD, S.
plied Mathematics Series. Prentice-Hall, Inc., K. 1991. Cone trees: Animated 3D visualiza-
Upper Saddle River, NJ. tions of hierarchical information. In Proceedings

ACM Computing Surveys, Vol. 32, No. 2, June 2000


172 M. Kobayashi and K. Takeda

of the Conference on Human Factors in Com- image retrieval. IEEE Trans. Pattern Anal.
puting Systems: Reaching through Technology Mach. Intell. 18, 8, 849 853.
(CHI 91, New Orleans, LA, Apr. 27May 2), SILBERSCHATZ, A., STONEBRAKER, M., AND ULLMAN,
S. P. Robertson, G. M. Olson, and J. S. Olson, J. 1995. Database research: Achievements
Eds. ACM Press, New York, NY, 189 194. and opportunities into the 21st century. In
SAKAIRI, T. 1999. A site map for visualizing both a Proceedings of the NSF Workshop on The Fu-
web sites structure and keywords. In Pro- ture of Database Research (May).
ceedings of the IEEE Conference on System, SMALL, H. 1973. Co-citation in the scientific liter-
Man, and Cybernetics (SMC 99). IEEE Com- ature: a new measure of the relationship be-
puter Society Press, Los Alamitos, CA, 200 tween two documents. J. Am. Soc. Inf. Sci. 24,
205. 265269.
SALTON, G. 1969. A comparison between manual SMITH, J. R. AND CHANG, S.-F. 1996. VisualSEEk:
and automatic indexing methods. Am. Doc. A fully automated content-based image query
20, 1, 6171. system. In Proceedings of the Fourth ACM
SALTON, G. 1970. Automatic text analysis. Science International Conference on Multimedia (Mul-
168, 335343. timedia 96, Boston, MA, Nov. 18 22), P.
SALTON, G., ED. 1971. The Smart Retrieval Sys- Aigrain, W. Hall, T. D. C. Little, and V. M.
tem: Experiments in Automatic Document Bove, Chairs. ACM Press, New York, NY,
Processing. Prentice-Hall, Englewood Cliffs, 8798.
NJ. SMITH, J. R. AND CHANG, S.-F. 1997a. Querying by
SALTON, G., ED. 1988. Automatic Text Processing. color regions using VisualSEEk content-based
Addison-Wesley Series in Computer Science. visual query system. In Intelligent Multi-
Addison-Wesley Longman Publ. Co., Inc., media Information Retrieval, M. T. Maybury,
Reading, MA. Ed. MIT Press, Cambridge, MA, 23 41.
SALTON, G. AND MCGILL, M. J. 1983. Introduction SMITH, J. AND CHANG, S.-F. 1997b. Searching for
to Modern Information Retrieval. McGraw- images and videos on the world-wide web.
Hill, Inc., Hightstown, NJ. IEEE MultiMedia.
SAMET, H. AND SOFFER, A. 1996. MARCO: MAp SMITH, Z. 1973. The truth about the web: Crawling
Retrieval by COntent. IEEE Trans. Pattern towards eternity. Web Tech. Mag. www.webt-
Anal. Mach. Intell. 18, 8, 783798. echniques.com/features/1997/05/burner/burn-
SCHATZ, B. 1997. Information retrieval in digital er.html
libraries: Bringing search to the net. Science SMOLIAR, S. W. AND ZHANG, H. 1994. Content-
275, 327334. based video indexing and retrieval. IEEE
SCHAUBLE, P. 1997. Multimedia Information Re- MultiMedia 1, 2 (Summer), 6272.
trieval: Content-Based Information Retrieval SNEATH, P. H. A. AND SOKAL, R. R. 1973. Numeri-
from Large Text and Audio Databases. Klu- cal Taxonomy. Freeman, London, UK.
wer Academic Publishers, Hingham, MA. SOERGEL, D. 1985. Organizing Information: Prin-
SCIENTIFIC AMERICAN. 1997. The Internet: Ful- ciples of Data Base and Retrieval Systems.
fillling the promise: special report. Scientific Academic Press library and information sci-
American, Inc., New York, NY. ence series. Academic Press Prof., Inc., San
SELBERG, E. AND ETZIONI, O. 1995a. The Diego, CA.
metacrawler architecture for resource aggre- SOFFER, A. AND SAMET, H. 2000. Pictorial query
gation on the web. IEEE Expert. specification for browsing through spatially-
SELBERG, E. AND ETZIONI, O. 1995b. Multiple ser- referenced image databases. J. Visual Lang.
vice search and comparison using the Comput.
metacrawler. In Proceedings of the Fourth SPARCK JONES, K. AND WILLETT, P., EDS. 1997.
International Conference on The World Wide Readings In Information Retrieval. Morgan
Web (Boston, MA). Kaufmann multimedia information and sys-
SHAKES, J., LANGHEINRICH, M., AND ETZIONI, O. tems series. Morgan Kaufmann Publishers
1997. Dynamic reference sifting: A case study Inc., San Francisco, CA.
in the homepage domain. In Proceedings of STOLFO, S. AND HERNANDEZ, M. 1995. The merge/
the Conference on The World Wide Web. 189 purge problem for large databases. In Pro-
200. ceedings of the ACM SIGMOD Conference on
SHIVAKUMAR, N. AND GARCIA-MOLINA, H. 1998. Management of Data (San Jose, CA, May).
Finding near-replicas of documents on the 127138.
web. In Proceedings of the Workshop on Web STRATEGYALLEY. 1998. White paper on the viabil-
Databases (Valencia, Spain, Mar.). ity of the internet for business. home page:
SHNEIDERMAN, B. 1994. Dynamic queries for visual www.strategyalley.com/articles/inet1.htm.
information seeking. Tech. Rep. CS-TR-3022. STRZALKOWSKI, T. 1999. Natural Language Infor-
University of Maryland at College Park, Col- mation Retreival. Kluwer Academic Publish-
lege Park, MD. ers, Hingham, MA.
SHNEIER, M. AND ABDEL-MOTTALEB, M. 1996. Ex- TAKEDA, K. AND NOMIYAMA, H. 1997. Information
ploiting the JPEG compression scheme for outlining and site outlining. In Proceedings of

ACM Computing Surveys, Vol. 32, No. 2, June 2000


Information Retrieval on the Web 173

the International Symposium on Digital Li- June 2124), N. Belkin, P. Ingwersen, and A.
braries (ISDL97, Tsukuba, Japan). M. Pejtersen, Eds. ACM Press, New York, NY,
TETRANET SOFTWARE INC. 1998. Wisebot. Home 339 346.
page for Wisebots: www.tetranetsoftware.com/ WISE, J., THOMAS, J., PENNOCK, K., LANTRIP, D.,
products/wisebot.htm POTTIER, M., AND SCHUR, A. 1995. Visualizing
TUFTE, E. R. 1986. The Visual Display of Quanti- the non-visual: spatial analysis and interac-
tative Information. Graphics Press, Cheshire, tion with information from text documents. In
CT. Proceedings of the IEEE Conference on Infor-
VAN RIJSBERGEN, C. 1977. A theoretical basis for mation Visualization. IEEE Computer Society
the use of cooccurrence data in information Press, Los Alamitos, CA, 5158.
retrieval. J. Doc. 33, 2. WITTEN, I. H., MOFFAT, A., AND BELL, T. C. 1994.
VAN RIJSBERGEN, C. 1979. Information Retrieval. Managing Gigabytes: Compressing and Index-
2nd ed. Butterworths, London, UK. ing Documents and Images. Van Nostrand
WALKER, J., CASE, T., JORASCH, J., AND SPARICO, T. Reinhold Co., New York, NY.
1996. Method, apparatus, and program for WU, J. K. AND NARASIMHALU, A. D. 1994. Identify-
pricing, selling, and exercising options to pur- ing faces using multiple retrievals. IEEE Multi-
chase airline tickets: U.S. Patent no. 5797127. Media 1, 2 (Summer), 2738.
WALKER, J., SPARICO, T., AND CASE, T. 1997. Meth- ZAMIR, O. AND ETZIONI, O. 1998. Web document
od and apparatus for the sale of airline-speci- clustering: A feasibility demonstration. In
fied flight tickets: U.S. Patent no. 5897620. Proceedings of the 21st Annual International
WATANABE, H. AND TAKEDA, K. 1998. A pattern-
ACM SIGIR Conference on Research and De-
based machine translation system extended
velopment in Information Retrieval (SIGIR
by example-based processing. In Proceedings
98, Melbourne, Australia, Aug. 24 28), W. B.
of the Conference on COLING-ACL. 1369
Croft, A. Moffat, C. J. van Rijsbergen, R.
1373.
Wilkinson, and J. Zobel, Chairs. ACM Press,
WEBSTER, K. AND PAUL, K. 1996. Beyond surfing:
Tools and techniques for searching the web. New York, NY, 46 54.
home page: magi.com/mmelick/it96jan.htm. ZAMIR, O., ETZIONI, O., MADANI, O., AND KARP, R.
WESTERA, G. 1996. Robot-driven search engine 1997. Fast and intuitive clustering of web
evaluation overview. www.curtin.edu.au/curtin/ documents. In Proceedings of the ACM SIG-
library/staffpages/gwpersonal/senginestudy/. MOD International Workshop on Data Mining
WHITE, H. AND MCCAIN, K. 1989. Bibliometrics. and Knowledge Discovery (SIGMOD-96, Aug.),
Annual Review Information Science and Tech- R. Ng, Ed. ACM Press, New York, NY, 287
nology. 290.
WILLETT, P. 1988. Recent trends in hierarchic ZHANG, T., RAMAKRISHNAN, R., AND LIVNY, M. 1996.
document clustering: a critical review. Inf. Birch: An efficient data clustering method for
Process. Manage. 24, 5 (), 577597. large databases. In Proceedings of the ACM-
WILLIAMS, M. 1984. What makes rabbit run? Int. SIGMOD Conference on Management of Data
J. Man-Mach. Stud. 2a, 1, 333352. (Montreal, Canada, June). ACM, New York,
WILLIAMSON, C. AND SHNEIDERMAN, B. 1992. The NY.
dynamic HomeFinder: Evaluating dynamic ZHONG, D. AND CHANG, S.-F. 1997. Video object
queries in a real-estate information explora- model and segmentation for content-based
tion system. In Proceedings of the 15th An- video indexing. In Proceedings of the Interna-
nual International ACM Conference on Re- tional Conference on Circuits and Systems.
search and Development in Information IEEE Computer Society Press, Los Alamitos,
Retrieval (SIGIR 92, Copenhagen, Denmark, CA.

Received: November 1998; revised: April 2000; accepted: July 2000

ACM Computing Surveys, Vol. 32, No. 2, June 2000

Das könnte Ihnen auch gefallen