Sie sind auf Seite 1von 39

COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.

ScientificAmerican.com
THE FUTURE OF THE WEB
special online issue no. 2
The dotcom bubble may have finally burst but there can be no doubt that the Internet has forever changed the way
we communicate, do business and find information of all kinds. Scientific American has regularly covered the
advances making this transformation possible. And during the past five years alone, many leading researchers
and computer scientists have aired their views on the Web in our pages.

In this collection, expert authors discuss a range of topicsfrom XML and hypersearching the web to filtering
information and preserving the Internet in one vast archive. Other articles cover more recent ideas, including
ways to make Web content more meaningful to machines and plans to create an operating system that would
span the Internet as a whole. --the Editors

TABLE OF CONTENTS
2 Filtering Information on the Internet
BY PAUL RESNICK; SCIENTIFIC AMERICAN, MARCH 1997
Look for the labels to decide if unknown software and World Wide Web sites are safe and interesting.

5 Preserving the Internet


BY BREWSTER KAHLE; SCIENTIFIC AMERICAN, MARCH 1997
An archive of the Internet may prove to be a vital record for historians, businesses and governments.

7 Searching the Internet


BY CLIFFORD LYNCH; SCIENTIFIC AMERICAN, MARCH 1997
Combining the skills of the librarian and the computer scientist may help organize the anarchy of the Internet.

12 XML and the Second-Generation Web


BY JON BOSAK AND TIM BRAY; SCIENTIFIC AMERICAN, MAY 1999
The combination of hypertext and a global Internet started a revolution. A new ingredient, XML,
is poised to finish the job.

17 Hypersearching the Web


BY MEMBERS OF THE CLEVER PROJECT; SCIENTIFIC AMERICAN, JUNE 1999
With the volume of on-line information in cyberspace growing at a breakneck pace, more effective search tools
are desperately needed. A new technique analyzes how Web pages are linked together.

24 The Semantic Web


BY TIM BERNERS-LEE, JAMES HENDLER, AND ORA LASSILA; SCIENTIFIC AMERICAN, MAY 2001
A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities.

31 The Worldwide Computer


BY DAVID P. ANDERSON AND JOHN KUBIATOWICZ; SCIENTIFIC AMERICAN, MARCH 2002
An operating system spanning the Internet would bring the power of millions of the worlds
Internet-connected PCs to everyones fingertips.

1 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
FILTERING INFORMATION
ON THE INTERNET
Look for the labels to decide if unknown
software and World Wide Web sites are safe and interesting

by Paul Resnick

T he Internet is often
called a global village,
suggesting a huge but
close-knit community
that shares common
values and experiences. The metaphor
is misleading. Many cultures coexist on
the Internet and at times clash. In its
public spaces, people interact commer-
has developed a set of technical stan-
dards called PICS (Platform for Internet
Content Selection) so that people can
electronically distribute descriptions of
digital works in a simple, computer-
readable form. Computers can process
these labels in the background, auto-
matically shielding users from undesir-
able material or directing their atten-
net. Each RSACi (the i stands for
Internet) label has four numbers, in-
dicating levels of violence, nudity, sex
and potentially offensive language. An-
other organization, SafeSurf, has devel-
oped a vocabulary with nine separate
scales. Labels can reflect other concerns
beyond indecency, however. A privacy
vocabulary, for example, could describe
cially and socially with strangers as well tion to sites of particular interest. The Web sites information practices, such
as with acquaintances and friends. The original impetus for PICS was to allow as what personal information they col-
city is a more apt metaphor, with its parents and teachers to screen materials lect and whether they resell it. Similarly,
suggestion of unlimited opportunities they felt were inappropriate for children an intellectual-property vocabulary could
and myriad dangers. using the Net. Rather than censoring describe the conditions under which an
To steer clear of the most obviously what is distributed, as the Communica- item could be viewed or reproduced [see
offensive, dangerous or just boring neigh- tions Decency Act and other legislative Trusted Systems, by Mark Stefik, page
borhoods, users can employ some me- initiatives have tried to do, PICS enables 78]. And various Web-indexing organi-
chanical filtering techniques that identi- users to control what they receive. zations could develop labels that indi-
fy easily definable risks. One technique cate the subject categories or the relia-
is to analyze the contents of on-line ma- Whats in a Label? bility of information from a site.
terial. Thus, virus-detection software Labels could even help protect com-
searches for code fragments that it
knows are common in virus programs.
Services such as AltaVista and Lycos can
P ICS labels can describe any aspect
of a document or a Web site. The
first labels identified items that might
puters from exposure to viruses. It has
become increasingly popular to down-
load small fragments of computer code,
either highlight or exclude World Wide run afoul of local indecency laws. For bug fixes and even entire applications
Web documents containing particular example, the Recreational Software Ad- from Internet sites. People generally trust
words. My colleagues and I have been visory Council (RSAC) adapted its com-
at work on another filtering technique puter-game rating system for the Inter-
based on electronic labels that can be
added to Web sites to describe digital
works. These labels can convey charac- FILTERING SYSTEM for the World Wide
teristics that require human judgment Web allows individuals to decide for them-
whether the Web page is funny or offen- selves what they want to see. Users speci-
fy safety and content requirements (a),
siveas well as information not readily
which label-processing software (b) then
apparent from the words and graphics, consults to determine whether to block ac-
such as the Web sites policies about the cess to certain pages (marked with a stop
use or resale of personal data. sign). Labels can be affixed by the Web
The Massachusetts Institute of Tech- sites author (c), or a rating agency can
nologys World Wide Web Consortium store its labels in a separate database (d).

2 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
that the software they download will These labels would be stored on a sepa- sonal data not be collected or sold, a
not introduce a virus; they could add a rate server; not everyone who visits the Web server can offer a version of its ser-
margin of safety by checking for labels neo-Nazi pages would see the Wiesen- vice that does not depend on collecting
that vouch for the softwares safety. The thal Center labels, but those who were personal information.
vocabulary for such labels might indi- interested could instruct their software
cate which virus checks have been run to check automatically for the labels. Establishing Trust
on the software or the level of confidence Software can be configured not mere-
in the codes safety.
In the physical world, labels can be
attached to the things they describe, or
ly to make its users aware of labels but
to act on them directly. Several Web soft-
ware packages, including CyberPatrol
N ot every label is trustworthy. The
creator of a virus can easily dis-
tribute a misleading label claiming that
they can be distributed separately. For and Microsofts Internet Explorer, al- the software is safe. Checking for labels
example, the new cars in an automobile ready use the PICS standard to control merely converts the question of wheth-
showroom display stickers describing users access to sites. Such software can er to trust a piece of software to one of
features and prices, but potential cus- make its decisions based on any PICS- trusting the labels. One solution is to
tomers can also consult independent compatible vocabulary. A user who use cryptographic techniques that can
listings such as consumer-interest mag- plugs in the RSACi vocabulary can set determine whether a document has been
azines. Similarly, PICS labels can be at- the maximum acceptable levels of lan- changed since its label was created and
tached or detached. An information pro- guage, nudity, sex and violence. A user to ensure that the label really is the work
vider that wishes to offer descriptions who plugs in a software-safety vocabu- of its purported author.
of its own materials can directly embed lary can decide precisely which virus That solution, however, simply chang-
labels in Web documents or send them checks are required. es the question again, from one of trust-
along with items retrieved from the In addition to blocking unwanted ing a label to one of trusting the labels
Web. Independent third parties can de- materials, label processing can assist in author. Alice may trust Bills labels if she
scribe materials as well. For instance, the finding desirable materials. If a user ex- has worked with him for years or if he
Simon Wiesenthal Center, which tracks presses a preference for works of high runs a major software company whose

BRYAN CHRISTIE
the activities of neo-Nazi groups, could literary quality, a search engine might reputation is at stake. Or she might trust
publish PICS labels that identify Web be able to suggest links to items labeled an auditing organization of some kind
pages containing neo-Nazi propaganda. that way. Or if the user prefers that per- to vouch for Bill.

a
c AUTHOR
LABELS
b LABEL-PROCESSING
SOFTWARE

d DATABASE
OF INDEPENDENT
LABELS

CONTAINS EDUCATIONAL MATERIAL


SUITABLE FOR YOUNG CHILDREN

FREE OF VIRUSES

CONTAINS MATERIAL OF HIGH


LITERARY QUALITY

CONTAINS VIOLENCE

CONTAINS NAZI PARTY LITERATURE

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 3


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
JENNIFER C. CHRISTIANSEN
COMPUTER CODE for a
PICS standards label is typi-
cally read by label-processing
software, not humans. This
sample label rates both the
literary quality and the vio-
lent content of the Web site
http://www.w3.org/PICS

Of course, some labels address mat- wallscombinations of software and tend to stifle noncommercial communi-
ters of personal taste rather than points hardware that block their citizens ac- cation. Labeling requires human time
of fact. Users may find themselves not cess to certain newsgroups and Web sites. and energy; many sites of limited inter-
trusting certain labels, simply because Another concern is that even without est will probably go unlabeled. Because
they disagree with the opinions behind central censorship, any widely adopted of safety concerns, some people will
them. To get around this problem, sys- vocabulary will encourage people to block access to materials that are unla-
tems such as GroupLens and Firefly rec- make lazy decisions that do not reflect beled or whose labels are untrusted. For
ommend books, articles, videos or mu- their values. Today many parents who such people, the Internet will function
sical selections based on the ratings of may not agree with the criteria used to more like broadcasting, providing access
like-minded people. People rate items assign movie ratings still forbid their only to sites with sufficient mass-mar-
with which they are familiar, and the children to see movies rated PG-13 or ket appeal to merit the cost of labeling.
software compares those ratings with R; it is too hard for them to weigh the While lamentable, this problem is an
opinions registered by other users. In merits of each movie by themselves. inherent one that is not caused by label-
making recommendations, the software Labeling organizations must choose ing. In any medium, people tend to
assigns the highest priority to items ap- vocabularies carefully to match the cri- avoid the unknown when there are
proved by people who agreed with the teria that most people care about, but risks involved, and it is far easier to get
users evaluations of other materials. even so, no single vocabulary can serve information about material that is of
People need not know who agreed with everyones needs. Labels concerned only wide interest than about items that ap-
them; they can participate anonymous- with rating the level of sexual content peal to a small audience.
ly, preserving the privacy of their evalu- at a site will be of no use to someone Although the Net nearly eliminates
ations and reading habits. concerned about hate speech. And no the technical barriers to communica-
Widespread reliance on labeling raises labeling system is a full substitute for a tion with strangers, it does not remove
a number of social concerns. The most thorough and thoughtful evaluation: the social costs. Labels can reduce those
obvious are the questions of who de- movie reviews in a newspaper can be costs, by letting us control when we ex-
cides how to label sites and what labels far more enlightening than any set of tend trust to potentially boring or dan-
are acceptable. Ideally, anyone could la- predefined codes. gerous software or Web sites. The chal-
bel a site, and everyone could establish Perhaps most troubling is the sugges- lenge will be to let labels guide our ex-
individual filtering rules. But there is a tion that any labeling system, no matter ploration of the global city of the
concern that authorities could assign la- how well conceived and executed, will Internet and not limit our travels.
bels to sites or dictate criteria for sites
to label themselves. In an example from
a different medium, the television indus-
try, under pressure from the U.S. gov- The Author Further Reading
ernment, has begun to rate its shows for
PAUL RESNICK joined AT&T Labs Rating the Net. Jonathan Weinberg in Hast-
age appropriateness. Research in 1995 as the founding mem- ings Communications and Entertainment Law
Mandatory self-labeling need not ber of the Public Policy Research group. Journal, Vol. 19; March 1997 (in press). Avail-
lead to censorship, so long as individu- He is also chairman of the PICS work- able on the World Wide Web at http://www.
als can decide which labels to ignore. ing group of the World Wide Web Con- msen.com/ ~ weinberg/rating.htm
But people may not always have this sortium. Resnick received his Ph.D. in Recommender Systems. Special section in
power. Improved individual control re- computer science in 1992 from the Mas- Communications of the ACM, Vol. 40, No. 3;
sachusetts Institute of Technology and March 1997 (in press).
moves one rationale for central control
was an assistant professor at the M.I.T. The Platform for Internet Content Selection
but does not prevent its imposition. Sloan School of Management before home page is available on the World Wide Web
Singapore and China, for instance, are moving to AT&T. at http://www.w3.org/PICS
experimenting with national fire-

4 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
PRESERVING
THE INTERNET
An archive of the Internet may prove to be a vital record for
historians, businesses and governments

by Brewster Kahle

ranging from text to video to audio re- What makes this experiment possible

M
cording. In comparison, the Library of is the dropping cost of data storage. The
anuscripts Congress contains about 20 terabytes of price of a gigabyte (a billion bytes) of
from the li- text information. In the coming months, hard-disk space is $200, whereas tape
brary of Alex- our computers and storage media will storage using an automated mounting
andria in an- make records of other areas of the In- device costs $20 a gigabyte. We chose
cient Egypt dis- ternet, including the Gopher informa- hard-disk storage for a small amount of
appeared in a fire. The early printed tion system and the Usenet bulletin data that users of the archive are likely
books decayed into unrecognizable boards. The material gathered so far to access frequently and a robotic de-
shreds. Many of the oldest cinematic has already proved a useful resource to vice that mounts and reads tapes auto-
films were recycled for their silver con- historians. In the future, it may provide matically for less used information. A
tent. Unfortunately, history may repeat the raw material for a carefully indexed, disk drive accesses data in an average of
itself in the evolution of the Internet searchable library. 15 milliseconds, whereas tapes require
and its World Wide Web. The logistics of taking a snapshot of four minutes. Frequently accessed in-
No one has tried to capture a com- the Web are relatively simple. Our Inter- formation might be historical docu-
prehensive record of the text and imag- net Archive operates with a staff of 10 ments or a set of URLs no longer in use.
es contained in the documents that ap- people from offices located in a convert- We plan to update the information
pear on the Web. The history of print ed military basethe Presidioin down- gathered at least every few months. The
and film is a story of loss and partial re- town San Francisco; it also runs an in- first full record required nearly a year
construction. But this scenario need not formation-gathering computer in the to compile. In future passes through the
be repeated for the Web, which has in- San Diego Supercomputer Center at the Web, we will be able to update only the
creasingly evolved into a storehouse of University of California at San Diego. information that has changed since our
valuable scientific, cultural and histori- The software on our computers last perusal.
cal information. crawls the Netdownloading docu- The text, graphics, audio clips and
The dropping costs of digital storage ments, called pages, from one site after other data collected from the Web will
mean that a permanent record of the another. Once a page is captured, the never be comprehensive, because the
Web and the rest of the Internet can be software looks for cross references, or crawler software cannot gain access to
preserved by a small group of technical links, to other pages. It uses the Webs many of the hundreds of thousands of
professionals equipped with a modest hyperlinksaddresses embedded with- sites. Publishers restrict access to data
complement of computer workstations in a document pageto move to other or store documents in a format inacces-
and data storage devices. A year ago I pages. The software then makes copies sible to simple crawler programs. Still,
and a few others set out to realize this again and seeks additional links con- the archive gives a feel of what the Web
vision as part of a venture known as the tained in the new pages. The crawler looks like during a given period of time
Internet Archive. avoids downloading duplicate copies of even though it does not constitute a full
By the time this article is published, pages by checking the identification record.
we will have taken a snapshot of all names, called uniform resource locators After gathering and storing the public
parts of the Web freely and technically (URLs), against a database. Programs contents of the Internet, what services
accessible to us. This collection of data such as Digital Equipment Corporations will the archive provide? We possess the
will measure perhaps as much as two AltaVista also employ crawler software capability of supplying documents that
trillion bytes (two terabytes) of data, for indexing Web sites. are no longer available from the origi-

5 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
nal publisher, an important function if thors exclude their works from the ar- gun. DejaNews keeps a record of mes-
the Webs hypertext system is to become chive. We are also considering allowing sages on the Usenet bulletin boards, and
a medium for scholarly publishing. Such researchers to obtain broad censuses of InReference archives Internet mailing
a service could also prove worthwhile the archive data instead of individual lists. Both support themselves with rev-
for business research. And the archival documentsone could count the total enue from advertisers, a possible fund-
data might serve as a copy of record number of references to pachyderms on ing source for the Internet Archive as
for the government or other institutions the Web, for instance, but not look at a well. Until now, I have funded the proj-
with publicly available documents. So, specific elephant home page. These mea- ect with money I received from the sale
over time, the archive would come to sures, we hope, will suffice to allay im- of an Internet software and services
resemble a digital library. mediate concerns about privacy and in- company. Major computer companies
tellectual-property rights. Over time, the have also donated equipment.
Keeping Missing Links issues addressed in setting up the Inter- It will take many years before an in-
net Archive might help resolve the larg- frastructure that assures Internet preser-

H istorians have already found the


material useful. David Allison of
the Smithsonian Institution has tapped
er policy debates on intellectual proper-
ty and privacy by testing concepts such
as fair use on the Internet.
vation becomes well establishedand
for questions involving intellectual-prop-
erty issues to resolve themselves. For our
into the archive for a presidential elec- The Internet Archive complements part, we feel that it is important to pro-
tion Web site exhibit at the museum, a other projects intended to ensure the ceed with the collection of the archival
project he compares to saving video- longevity of information on the Internet. material because it can never be recov-
tapes of early television campaign ad- The Commission on Preservation and ered in the future. And the opportunity
vertisements. Many of the links for these Access in Washington, D.C., researches to capture a record of the birth of a
Web sites, such as those for Texas Sena- how to ensure that data are not lost as new medium will then be lost.
tor Phil Gramms campaign, have al- the standard formats for digital storage
ready disappeared from the Internet. media change over the years. In another
Creating an archive touches on an ar- effort, the Internet Engineering Task
ray of issues, from privacy to copyright. Force and other groups have labored on
The Author
What if a college student created a Web technical standards that give a unique
page that had pictures of her then cur- identification name to digital documents. BREWSTER KAHLE founded the Inter-
rent boyfriend? What if she later want- These uniform resource names (URNs), net Archive in April 1996. He invented the
ed to tear them up, so to speak, yet as they are called, could supplement the Wide Area Information Servers (WAIS) sys-
they lived on in the archive? Should she URLs that currently access Web docu- tem in 1989 and started a company, WAIS,
Inc., in 1992 to commercialize this Internet
have the right to remove them? In con- ments. Giving a document a URN at- publishing software. The company helped
trast, should a public figurea U.S. sen- tempts to ensure that it can be traced to bring commercial and government agen-
ator, for instancebe able to erase data after a link disappears, because estimates cies onto the Internet by selling publishing
posted from his or her college years? put the average lifetime for a URL at 44 tools and production services. Kahle also
Does collecting information made avail- days. The URN would be able to locate served as a principal designer of the Connec-
able to the public violate the fair use other URLs that still provided access to tion Machine, a supercomputer produced
by Thinking Machines. He received a bach-
provisions of the copyright law? The is- the desired documents. elors degree from the Massachusetts Insti-
sues are not easily resolved. Other, more limited attempts to ar- tute of Technology in 1982.
To address these worries, we let au- chive parts of the Internet have also be-

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 6


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
SEARCHING THE INTERNET
Combining the skills of the librarian and the computer scientist
may help organize the anarchy of the Internet

by Clifford Lynch

active conversations. The ephemeral At the moment, computer technology

O
ne sometimes hears the mixes everywhere with works of lasting bears most of the responsibility for or-
Internet characterized importance. ganizing information on the Internet. In
as the worlds library In short, the Net is not a digital libra- theory, software that automatically
for the digital age. This ry. But if it is to continue to grow and classifies and indexes collections of dig-
description does not thrive as a new means of communica- ital data can address the glut of infor-
stand up under even casual examina- tion, something very much like tradi- mation on the Netand the inability of
tion. The Internetand particularly its tional library services will be needed to human indexers and bibliographers to
collection of multimedia resources organize, access and preserve networked cope with it. Automating information
known as the World Wide Webwas information. Even then, the Net will not access has the advantage of directly ex-
not designed to support the organized resemble a traditional library, because ploiting the rapidly dropping costs of
publication and retrieval of informa- its contents are more widely dispersed computers and avoiding the high ex-
tion, as libraries are. It has evolved into than a standard collection. Consequent- pense and delays of human indexing.
what might be thought of as a chaotic ly, the librarians classification and se- But, as anyone who has ever sought
repository for the collective output of lection skills must be complemented by information on the Web knows, these
the worlds digital printing presses. the computer scientists ability to auto- automated tools categorize information
This storehouse of information con- mate the task of indexing and storing differently than people do. In one sense,
tains not only books and papers but information. Only a synthesis of the the job performed by the various index-

BRYAN CHRISTIE
raw scientific data, menus, meeting differing perspectives brought by both ing and cataloguing tools known as
minutes, advertisements, video and au- professions will allow this new medium search engines is highly democratic. Ma-
dio recordings, and transcripts of inter- to remain viable. chine-based approaches provide uniform

SEARCH ENGINE operates by visiting, or crawling through,


World Wide Web sites, pictured as blue globes. The yellow and blue
lines represent the output from and input to the engines server (red
tower at center), where Web pages are downloaded. Software on the
server computes an index (tan page) that can be accessed by users.
7 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002
COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
and equal access to all the in- APPROXIMATE .com SITES ognize text only. The intense
formation on the Net. In prac- NUMBER (PERCENT OF ALL SITES) interest in the Web, though, has
OF WEB SITES 0
10 20 30 40 50 60 70
tice, this electronic egalitarian- come about because of the me-
ism can prove a mixed bless- JUNE 1993 130 2
diums ability to display imag-
DEC. 1993 620 5
ing. Web surfers who type JUNE 1994 2,740 14
es, whether graphics or video
in a search request are often DEC. 1994 10,000 18 clips. Some research has moved
overwhelmed by thousands of JUNE 1995 23,500 31 forward toward finding colors
responses. The search results JAN. 1996 100,000 50 or patterns within images [see
frequently contain references to JUNE 1996 230,000 68 box on next two pages]. But no

SOURCE: MATTHEW K. GRAY; BRYAN CHRISTIE


irrelevant Web sites while leav- JAN. 1997 650,000 63 program can deduce the un-
ing out others that hold impor- derlying meaning and cultural
tant material. NUMBER OF HOST COMPUTERS significance of an image (for ex-
(IN MILLIONS)
0 2 4 6 8 10 12 ample, that a group of men din-
Crawling the Web JAN. 1993 1.3 ing represents the Last Supper).
JAN. 1994 2.2 At the same time, the way

T he nature of electronic in- JAN. 1995 4.9 information is structured on


dexing can be understood JAN. 1996 9.5 the Web is changing so that it
JULY 1996 12.9
by examining the way Web often cannot be examined by
search engines, such as Lycos GROWTH AND CHANGE on the Internet are reflected in Web crawlers. Many Web pag-
or Digital Equipment Corpora- the burgeoning number of Web sites, host computers and es are no longer static files that
tions AltaVista, construct in- commercial, or .com, sites. can be analyzed and indexed by
dexes and find information re- such programs. In many cases,
quested by a user. Periodically, they dis- Web are not structured so that programs the information displayed in a docu-
patch programs (sometimes referred to can reliably extract the routine informa- ment is computed by the Web site dur-
as Web crawlers, spiders or indexing ro- tion that a human indexer might find ing a search in response to the users re-
bots) to every site they can identify on through a cursory inspection: author, quest. The site might assemble a map, a
the Webeach site being a set of docu- date of publication, length of text and table and a text document from differ-
ments, called pages, that can be accessed subject matter. (This information is ent areas of its database, a disparate
over the network. The Web crawlers known as metadata.) A Web crawler collection of information that conforms
download and then examine these pag- might turn up the desired article au- to the users query. A newspapers Web
es and extract indexing information that thored by Jane Doe. But it might also site, for instance, might allow a reader to
can be used to describe them. This pro- find thousands of other articles in which specify that only stories on the oil-equip-
cessdetails of which vary among search such a common name is mentioned in ment business be displayed in a person-
enginesmay include simply locating the text or in a bibliographic reference. alized version of the paper. The database
most of the words that appear in Web Publishers sometimes abuse the indis- of stories from which this document is
pages or performing sophisticated anal- criminate character of automated index- put together could not be searched by a
yses to identify key words and phrases. ing. A Web site can bias the selection Web crawler that visits the site.
These data are then stored in the search process to attract attention to itself by A growing body of research has at-
engines database, along with an ad- repeating within a document a word, tempted to address some of the prob-
dress, termed a uniform resource loca- such as sex, that is known to be quer- lems involved with automated classifi-
tor (URL), that represents where the file ied often. The reason: a search engine cation methods. One approach seeks to
resides. A user then deploys a browser, will display first the URLs for the docu- attach metadata to files so that index-
such as the familiar Netscape, to submit ments that mention a search term most ing systems can collect this information.
queries to the search engines database. frequently. In contrast, humans can eas- The most advanced effort is the Dublin
The query produces a list of Web re- ily see around simpleminded tricks. Core Metadata program and an affiliat-
sources, the URLs that can be clicked The professional indexer can describe ed endeavor, the Warwick Framework
on to connect to the sites identified by the components of individual pages of the first named after a workshop in
the search. all sorts (from text to video) and can Dublin, Ohio, the other for a colloquy
Existing search engines service mil- clarify how those parts fit together into in Warwick, England. The workshops
lions of queries a day. Yet it has become a database of information. Civil War have defined a set of metadata elements
clear that they are less than ideal for re- photographs, for example, might form that are simpler than those in traditional
trieving an ever growing body of infor- part of a collection that also includes library cataloguing and have also creat-
mation on the Web. In contrast to hu- period music and soldier diaries. A hu- ed methods for incorporating them
man indexers, automated programs man indexer can describe a sites rules within pages on the Web.
have difficulty identifying characteris- for the collection and retention of pro- Categorization of metadata might
tics of a document such as its overall grams in, say, an archive that stores range from title or author to type of
theme or its genrewhether it is a poem Macintosh software. Analyses of a sites document (text or video, for instance).
or a play, or even an advertisement. purpose, history and policies are beyond Either automated indexing software or
The Web, moreover, still lacks stan- the capabilities of a crawler program. humans may derive the metadata, which
dards that would facilitate automated Another drawback of automated in- can then be attached to a Web page for
indexing. As a result, documents on the dexing is that most search engines rec- retrieval by a crawler. Precise and de-

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 8


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
AUTOMATED INDEXING, used by
Web crawler software, analyzes a page
PAGE AUTOMATIC HUMAN
INDEXING (left panel) by designating most words as
INDEXING indexing terms (top center) or by grouping
S words into simple phrases (bottom cen-
I ter). Human indexing (right) gives addi-
M tional context about the subject of a page.
P
L
E tailed human annotations can provide a
S more in-depth characterization of a
M page than can an automated indexing
A program alone.
R
T Where costs can be justified, human
E indexers have begun the laborious task
R of compiling bibliographies of some
BRYAN CHRISTIE

Web sites. The Yahoo database, a com-


mercial venture, classifies sites by broad
subject area. And a research project at
the University of Michigan is one of

Finding Pictures on the Web


by Gary Stix, staff writer

T he Internet came into its own a few years ago, when the
World Wide Web arrived with its dazzling array of photogra-
phy, animation, graphics, sound and video that ranged in subject
cats category. To narrow
the search, the user can
click on any icons that
matter from high art to the patently lewd. Despite the multimedia show black cats. Using its
barrage, finding things on the hundreds of thousands of Web sites previously generated col-
still mostly requires searching indexes for words and numbers. or analysis, the search en-
Someone who types the words French flag into the popular gine looks for matches of
search engine AltaVista might retrieve the requested graphic, as images that have a similar
long as it were captioned by those two identifying words. But what color profile. The presen-
if someone could visualize a blue, white and red banner but did tation of the next set of
not know its country of origin? icons may show black
Ideally, a search engine should allow the user to draw or scan in catsbut also some mar-
a rectangle with vertical thirds that are colored blue, white and malade cats sitting on
redand then find any matching images stored on myriad Web black cushions. A visitor
sites. In the past few years, techniques that combine key-word in- to WebSEEk can refine a
dexing with image analysis have begun to pave the way for the search by adding or ex-
first image search engines. cluding certain colors from an image when initiating subsequent
Although these prototypes suggest possibilities for the indexing queries. Leaving out yellows or oranges might get rid of the odd
of visual information, they also demonstrate the crudeness of ex- marmalade. More simply, when presented with a series of icons,
isting tools and the continuing reliance on text to track down im- the user can also specify those images that do not contain black
agery. One project, called WebSEEk, based at Columbia University, cats in order to guide the program away from mistaken choices. So
illustrates the workings of an image search engine. WebSEEk be- far WebSEEk has downloaded and indexed more than 650,000 pic-
gins by downloading files found by trolling the Web. It then at- tures from tens of thousands of Web sites.
tempts to locate file names containing acronyms, such as GIF or Other image-searching projects include efforts at the University
MPEG, that designate graphics or video content. It also looks for of Chicago, the University of California at San Diego, Carnegie Mel-
words in the names that might identify the subject of the files. lon University, the Massachusetts Institute of Technologys Media
When the software finds an image, it analyzes the prevalence of Lab and the University of California at Berkeley. A number of com-
different colors and where they are located. Using this information, mercial companies, including IBM and Virage, have crafted soft-
it can distinguish among photographs, graphics and black-and- ware that can be used for searching corporate networks or data-
white or gray images. The software also compresses each picture bases. And two companiesExcalibur Technologies and Interpix
so that it can be represented as an icon, a miniature image for dis- Softwarehave collaborated to supply software to the Web-based
play alongside other icons. For a video, it will extract key frames indexing concerns Yahoo and Infoseek.
from different scenes. One of the oldest image searchers, IBMs Query by Image Con-
A user begins a search by selecting a category from a menu tent (QBIC), produces more sophisticated matching of image fea-
cats, for example. WebSEEk provides a sampling of icons for the tures than, say, WebSEEk can. It is able not only to pick out the col-

9 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
several efforts to develop more formal prehensive access to raw databases of for human involvement is to share judg-
descriptions of sites that contain mate- information, free of any controls or ments about what is worthwhile. Soft-
rial of scholarly interest. editing. For them, standard search en- ware-based rating systems have begun
gines provide real benefits because they to let users describe the quality of par-
Not Just a Library forgo any selective filtering of data. ticular Web sites [see Filtering Infor-
The diversity of materials on the Net mation on the Internet, by Paul Res-

T he extent to which either human


classification skills or automated
indexing and searching strategies are
goes far beyond the scope of the tradi-
tional library. A library does not pro-
vide quality rankings of the works in a
nick, page 62].
Software tools search the Internet and
also separate the good from the bad.
needed will depend on the people who collection. Because of the greater vol- New programs may be needed, though,
use the Internet and on the business ume of networked information, Net us- to ease the burden of feeding the crawl-
prospects for publishers. For many com- ers want guidance about where to spend ers that repeatedly scan Web sites. Some
munities of scholars, the model of an the limited amount of time they have to Web site managers have reported that
organized collectiona digital library research a subject. They may need to their computers are spending enormous
still remains relevant. For other groups, know the three best documents for a amounts of time in providing crawlers
an uncontrolled, democratic medium given purpose. They want this informa- with information to index, instead of
may provide the best vehicle for infor- tion without paying the costs of em- servicing the people they hope to at-
mation dissemination. Some users, from ploying humans to critique the myriad tract with their offerings.
financial analysts to spies, want com- Web sites. One solution that again calls To address this issue, Mike Schwartz

ors in an image but also to gauge texture by several measures gram that is the work of David A. Forsyth of Berkeley and Margaret IBM CORPORATION/ROMTECH/COREL
contrast (the black and white of zebra stripes), coarseness (stones M. Fleck of the University of Iowa. The software begins by analyz-
versus pebbles) and directionality (linear fence posts versus omni- ing the color and texture of a photograph. When it finds matches
directional flower petals). QBIC also has a limited ability to search for flesh colors, it runs an algorithm that looks for cylindrical areas
for shapes within an image. Specifying a pink dot on a green back- that might correspond to an arm or leg. It then seeks other flesh-
ground turns up flowers and other photographs with similar colored cylinders, positioned at certain angles, which might con-
shapes and colors, as shown above. Possible applications range firm the presence of limbs. In a test last fall, the program picked
from the selection of wallpaper patterns to enabling police to out 43 percent of the 565 naked people among a group of 4,854
identify gang members by clothing type. images, a high percentage for this type of complex image analysis.
All these programs do nothing more than match one visual fea- It registered, moreover, only a 4 percent false positive rate among
ture with another. They still require a human observeror accom- the 4,289 images that did not contain naked bodies. The nudes
panying textto confirm whether an object is a cat or a cushion. were downloaded from the Web; the other photographs came
For more than a decade, the artificial-intelligence community has primarily from commercial databases.
labored, with mixed success, on nudging computers to ascertain The challenges of computer vision will most likely remain for a
directly the identity of objects within an image, whether they are decade or so to come. Searches capable of distinguishing clearly
cats or national flags. This approach correlates the shapes in a pic- among nudes, marmalades and national flags are still an unreal-
ture with geometric models of real-world objects. The program ized dream. As time goes on, though, researchers would like to
can then deduce that a pink or brown cylinder, say, is a human arm. give the programs that collect information from the Internet the
One example is software that looks for naked people, a pro- ability to understand what they see.

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 10


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
HARVEST, a new search-engine architecture, would derive indexing terms using
software called gatherers that reside at Web sites (brown boxes near globes) or op-
erate in a central computer (brown hexagon). By so doing, the search engine can
avoid downloading all the documents from a Web site, an activity that burdens net-
work traffic. The search engines server (red structure at center) would simply ask

BRYAN CHRISTIE
the gatherers (dark blue arrows) for a file of key words (red arrows) that could be
processed into an index (tan page) for querying by a user.

and his colleagues at the University of cessed, thus alleviating the load on the method will depend mostly on users.
Colorado at Boulder developed soft- network and the computers tied to it. For which users will it then come to re-
ware, called Harvest, that lets a Web Gatherers might also serve a different semble a library, with a structured ap-
site compile indexing data for the pages function. They may give publishers a proach to building collections? And for
it holds and to ship the information on framework to restrict the information whom will it remain anarchic, with ac-
request to the Web sites for the various that gets exported from their Web sites. cess supplied by automated systems?
search engines. In so doing, Harvests This degree of control is needed because Users willing to pay a fee to under-
automated indexing program, or gath- the Web has begun to evolve beyond a write the work of authors, publishers,
erer, can avoid having a Web crawler distribution medium for free informa- indexers and reviewers can sustain the
export the entire contents of a given site tion. Increasingly, it facilitates access to tradition of the library. In cases where
across the network. proprietary information that is furnished information is furnished without charge
Crawler programs bring a copy of for a fee. This material may not be open or is advertiser supported, low-cost com-
each page back to their home sites to ex- for the perusal of Web crawlers. Gath- puter-based indexing will most likely
tract the terms that make up an index, a erers, though, could distribute only the dominatethe same unstructured envi-
process that consumes a great deal of information that publishers wish to ronment that characterizes much of the
network capacity (bandwidth). The gath- make available, such as links to sum- contemporary Internet. Thus, social and
erer, instead, sends only a file of index- maries or samples of the information economic issues, rather than technolog-
ing terms. Moreover, it exports only in- stored at a site. ical ones, will exert the greatest influence
formation about those pages that have As the Net matures, the decision to in shaping the future of information re-
been altered since they were last ac- opt for a given information collection trieval on the Internet.

The Author Further Reading


CLIFFORD LYNCH is director of library automation at The Harvest Information Discovery and Access System. C. M. Bow-
the University of Californias Office of the President, where man et al. in Computer Networks and ISDN Systems, Vol. 28, Nos. 12,
he oversees MELVYL, one of the largest public-access in- pages 119125; December 1995.
formation retrieval systems. Lynch, who received a doctor- The Harvest Information Discovery and Access System is available on the
ate in computer science from the University of California, World Wide Web at http://harvest.transarc.com
Berkeley, also teaches at Berkeleys School of Information The Warwick Metadata Workshop: A Framework for the Deploy-
Management and Systems. He is a past president of the ment of Resource Description. Lorcan Dempsey and Stuart L. Weibel
American Society for Information Science and a fellow of in D-lib Magazine, JulyAugust 1996. Available on the World Wide Web
the American Association for the Advancement of Science. at http://www.dlib.org/dlib/july96/07contents.html
He leads the Architectures and Standards Working Group The Warwick Framework: A Container Architecture for Diverse
for the Coalition for Network Information. Sets of Metadata. Carl Lagoze, ibid.

11 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
XML
Second-Generation
by Jon Bosak and Tim Bray

and the

G
WEB
The combination of hypertext and a
global Internet started a revolution.
A new ingredient, XML, is
poised to finish the job

ive people a few hints, and


they can figure out the rest.
They can look at this page,
see some large type followed by blocks
of small type and know that they are
looking at the start of a magazine article.
They can look at a list of groceries and
see shopping instructions. They can look
at some rows of numbers and under-
stand the state of their bank account.
Computers, of course, are not that
smart; they need to be told exactly what
things are, how they are related and how
to deal with them. Extensible Markup
Language (XML for short) is a new lan-
guage designed to do just that, to make
information self-describing. This sim-
ple-sounding change in how computers
communicate has the potential to extend
the Internet beyond information delivery
to many other kinds of human activity.
Indeed, since XML was completed in
early 1998 by the World Wide Web Con-
sortium (usually called the W3C), the
standard has spread like wildfire through
science and into industries ranging from
manufacturing to medicine.
The enthusiastic response is fueled by
a hope that XML will solve some of the
ILLUSTRATIONS BY BRUCIE ROSCH

Webs biggest problems. These are wide-


ly known: the Internet is a speed-of-light
network that often moves at a crawl;
and although nearly every kind of in-
formation is available on-line, it can be
maddeningly difficult to find the one XML BRIDGES the incompatibilities of computer
piece you need. systems, allowing people to search for and exchange
Both problems arise in large part from scientific data, commercial products and multilin-
the nature of the Webs main language, gual documents with greater ease and speed.

12 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE The Future of the Web


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
<Together XML and XSL allow publishers to pour a publication into
myriad formswrite once and publish everywhere. />

not know what to make of the infor- powered machine sits waiting idly, be-
mation, which to its eyes would be no cause it has only been told about <H1>s
HTML (shorthand for Hypertext Mark- more intelligible than <H1>blah blah and <BOLD>s, not about prices and
up Language). Although HTML is the </H1> <BOLD>blah blah blah </BOLD>. shipping options.
most successful electronic-publishing lan- As programming legend Brian Kerni- Thus also the dissatisfying quality of
guage ever invented, it is superficial: in ghan once noted, the problem with Web searches. Because there is no way
essence, it describes how a Web brows- What You See Is What You Get is to mark something as a price, it is effec-
er should arrange text, images and push- that what you see is all youve got. tively impossible to use price informa-
buttons on a page. HTMLs concern with Those angle-bracketed labels in the ex- tion in your searches.
appearances makes it relatively easy to ample just above are called tags. HTML
learn, but it also has its costs. has no tag for a drug reaction, which Something Old, Something New
One is the difficulty in creating a Web highlights another of its limitations: it is
site that functions as more than just a
fancy fax machine that sends documents
to anyone who asks. People and compa-
inflexible. Adding a new tag involves a
bureaucratic process that can take so
long that few attempt it. And yet every
T he solution, in theory, is very sim-
ple: use tags that say what the in-
formation is, not what it looks like. For
nies want Web sites that take orders from application, not just the interchange of example, label the parts of an order for
customers, transmit medical records, medical records, needs its own tags. a shirt not as boldface, paragraph, row
even run factories and scientific instru- Thus the slow pace of todays on-line and columnwhat HTML offersbut
ments from half a world away. HTML bookstores, mail-order catalogues and as price, size, quantity and color. A pro-
was never designed for such tasks. other interactive Web sites. Change the gram can then recognize this document
So although your doctor may be able quantity or shipping method of your as a customer order and do whatever it
to pull up your drug reaction history on order, and to see the handful of digits needs to do: display it one way or dis-
his Web browser, he cannot then e-mail that have changed in the total, you play it a different way or put it through a
it to a specialist and expect her to be able must ask a distant, overburdened server bookkeeping system or make a new shirt
to paste the records directly into her hos- to send you an entirely new page, graph- show up on your doorstep tomorrow.
pitals database. Her computer would ics and all. Meanwhile your own high- We, as members of a dozen-strong
W3C working group, began crafting
such a solution in 1996. Our idea was
MARKED UP WITH XML TAGS, one file powerful but not entirely original. For
containing, say, movie listings for an entire city generations, printers scribbled notes on
can be displayed on a wide variety of devices. manuscripts to instruct the typesetters.
Stylesheets can filter, reorder and render the
.
000

This markup evolved on its own until


x2

listings as a Web page with graphics for a desktop


oPle

1986, when, after decades of work, the


ond

computer, as a text-only list for a handheld orga-


eM

.
:15

International Organization for Standard-


at th

11

nizer and even as audible speech for a telephone.


nd
ing

0a
how

9:0

ization (ISO) approved a system for the


5,
is s

6:4
tion

0,

creation of new markup languages.


4:3
rrec

s
ult
5,
2:1

ad
Insu

for
Named Standard Generalized Mark-
at

ach
are

<movie>
Sh r Trek:

e
.50
es

en.
ildr
tim

$8
<title>Star Trek: Insurrection</title> re r ch up Language, or SGML, this language
Sta
ow

ts a eac
h fo
ke
<star>Patrick Stewart</star> Tic d $5.00
<star>Brent Spiner</star>
an for describing languagesa metalan-
AUDIBLE
<theatre> SPEECH guagehas since proved useful in many
STYLESHEET
<theatre-name>MondoPlex 2000</theatre-name>
large publishing applications. Indeed,
<showtime>1415</showtime>
<showtime>1630</showtime> HTML was defined using SGML. The
<showtime>1845</showtime> only problem with SGML is that it is
<showtime>2100</showtime>
<showtime>2315</showtime>
too generalfull of clever features de-
<price>
File Edit View Special

Star Trek
signed to minimize keystrokes in an era
<adult-price>8.50</-price> CONVENTIONAL
SCREEN
Select a showtime
Buy tickets
when every byte had to be accounted for.
<child-price>5.00</-price>
</price>
STYLESHEET
Shakespeare in
It is more complex than Web browsers
</theatre> can cope with.
<theatre> Our team created XML by removing
<theatre-name>Bigscreen 1</theatre-name>
<showtime>1930</showtime> frills from SGML to arrive at a more
<price> streamlined, digestible metalanguage.
Sta
<adult-price>6.00</adult-price>
</price> HANDHELD
Mon rek
2:15 doPlex
rT
XML consists of rules that anyone can
6:45 4:30
</theatre>
DISPLAY
STYLESHEET Sh
11:1 9:00
5
follow to create a markup language from
LAURIE GRACE

speake-
</movie> are
... scratch. The rules ensure that a single
<movie>
<title>Shakespeare in Love</title>
compact program, often called a parser,
<star>Gwyneth can process all these new languages.
Consider again the doctor who wants
to e-mail your medical record to a spe-

13 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
cialist. If the medical profession uses ter sets. Thus, XML enables exchange classified ads that promises to make
XML to hammer out a markup lan- of information not only between differ- such searches much more effective.
guage for encoding medical records ent computer systems but also across Even that is just an intermediate step.
and in fact several groups have already national and cultural boundaries. Librarians figured out a long time ago
started work on this then your doctors that the way to find information in a
e-mail could contain <patient> <name> An End to the World Wide Wait hurry is to look not at the information
blah blah </name> <drug-allergy> blah itself but rather at much smaller, more
blah blah </drug-allergy> </patient>.
Programming any computer to recog-
nize this standard medical notation and
A s XML spreads, the Web should be-
come noticeably more responsive.
At present, computing devices connect-
focused sets of data that guide you to
the useful sources: hence the library
card catalogue. Such information about
to add this vital statistic to its database ed to the Web, whether they are power- information is called metadata.
becomes straightforward. ful desktop computers or tiny pocket From the outset, part of the XML
Just as HTML created a way for every planners, cannot do much more than project has been to create a sister stan-
computer user to read Internet docu- get a form, fill it out and then swap it dard for metadata. The Resource De-
ments, XML makes it possible, despite back and forth with a Web server until scription Framework (RDF), finished
the Babel of incompatible computer a job is completed. But the structural this past February, should do for Web
systems, to create an Esperanto that all and semantic information that can be data what catalogue cards do for li-
can read and write. Unlike most com- added with XML allows these devices brary books. Deployed across the Web,
puter data formats, XML markup also to do a great deal of processing on the RDF metadata will make retrieval far
makes sense to humans, because it con- spot. That not only will take a big load faster and more accurate than it is now.
sists of nothing more than ordinary text. off Web servers but also should reduce Because the Web has no librarians and
The unifying power of XML arises network traffic dramatically. every Webmaster wants, above all else,
from a few well-chosen rules. One is To understand why, imagine going to to be found, we expect that RDF will
that tags almost always come in pairs. an on-line travel agency and asking for achieve a typically astonishing Internet
Like parentheses, they surround the all the flights from London to New York growth rate once its power becomes
text to which they apply. And like quo- on July 4. You would probably receive a apparent.
tation marks, tag pairs can be nested in- list several times longer than your screen There are of course other ways to find
side one another to multiple levels. could display. You could shorten the list things besides searching. The Web is after
The nesting rule automatically forces by fine-tuning the departure time, price all a hypertext, its billions of pages
a certain simplicity on every XML or airline, but to do that, you would connected by hyperlinksthose under-
document, which takes on the structure
known in computer science as a tree.
As with a genealogical tree, each graph- < XML enables exchange of infor-
ic and bit of text in the document repre-
sents a parent, child or sibling of some
mation not only between different
other element; relationships are unam- computer systems but also across
biguous. Trees cannot represent every
kind of information, but they can repre- national and cultural boundaries. />
sent most kinds that we need comput-
ers to understand. Trees, moreover, are
extraordinarily convenient for pro- have to send a request across the Inter- lined words you click on to get whisked
grammers. If your bank statement is in net to the travel agency and wait for its from one to the next. Hyperlinks, too,
the form of a tree, it is a simple matter answer. If, however, the long list of will do more when powered by XML.
to write a bit of software that will re- flights had been sent in XML, then the A standard for XML-based hypertext,
order the transactions or display just travel agency could have sent a small named XLink and due later this year
the cleared checks. Java program along with the flight rec- from the W3C, will allow you to choose
Another source of XMLs unifying ords that you could use to sort and win- from a list of multiple destinations. Oth-
strength is its reliance on a new standard now them in microseconds, without ever er kinds of hyperlinks will insert text or
called Unicode, a character-encoding involving the server. Multiply this by a images right where you click, instead of
system that supports intermingling of few million Web users, and the global forcing you to leave the page.
text in all the worlds major languages. efficiency gains become dramatic. Perhaps most useful, XLink will en-
In HTML, as in most word processors, As more of the information on the Net able authors to use indirect links that
a document is generally in one particular is labeled with industry-specific XML point to entries in some central database
language, whether that be English or Jap- tags, it will become easier to find exactly rather than to the linked pages them-
anese or Arabic. If your software cannot what you need. Today an Internet search selves. When a pages address changes,
read the characters of that language, for stockbroker jobs will inundate the author will be able to update all the
then you cannot use the document. The you with advertisements but probably links that point to it by editing just one
situation can be even worse: software turn up few job listingsmost will be database record. This should help elim-
made for use in Taiwan often cannot hidden inside the classified ad services inate the familiar 404 File Not Found
read mainland-Chinese texts because of of newspaper Web sites, out of a search error that signals a broken hyperlink.
incompatible encodings. But software robots reach. But the Newspaper Asso- The combination of more efficient
that reads XML properly can deal with ciation of America is even now building processing, more accurate searching and
any combination of any of these charac- an XML-based markup language for more flexible linking will revolutionize

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 14


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
the structure of the Web and make pos- programming details so that people with ences, in business and in the scholarly
sible completely new ways of accessing similar interests can concentrate on the disciplines [see box on opposite page].
information. Users will find this new hard partagreeing on how they want Before they can draft a new XML lan-
Web faster, more powerful and more to represent the information they com- guage, designers must agree on three
useful than the Web of today. monly exchange. This is not an easy things: which tags will be allowed, how
problem to solve, but it is not a new one, tagged elements may nest within one
Some Assembly Required either. another and how they should be pro-
Such agreements will be made, be- cessed. The first twothe languages

O f course, it is not quite that simple.


XML does allow anyone to design
a new, custom-built language, but de-
cause the proliferation of incompatible
computer systems has imposed delays,
costs and confusion on nearly every
vocabulary and structureare typically
codified in a Document Type Definition,
or DTD. The XML standard does not
signing good languages is a challenge area of human activity. People want to compel language designers to use DTDs,
that should not be undertaken lightly. share ideas and do business without all but most new languages will probably
And the design is just the beginning: the having to use the same computers; ac- have them, because they make it much
meanings of your tags are not going to tivity-specific interchange languages go easier for programmers to write soft-
be obvious to other people unless you a long way toward making that possi- ware that understands the markup and
write some prose to explain them, nor ble. Indeed, a shower of new acronyms does intelligent things with it.
to computers unless you write some ending in ML testifies to the inven- Programmers will also need a set of
software to process them. tiveness unleashed by XML in the sci- guidelines that describe, in human lan-
A moments thought reveals why. If
all it took to teach a computer to handle
a purchase order were to label it with d Flights - JFK - XML Browser
<purchase-order> tags, we wouldnt Edit Favorites Help
need XML. We wouldnt even need pro-
grammersthe machines would be SoftlandAirlines 115 Select your seat for Softland flight #118
Heathrow to NY Kennedy July 4, 1999
smart enough to take care of themselves.
What XML does is less magical but SoftlandAirlines 118
quite effective nonetheless. It lays down
ground rules that clear away a layer of SoftlandAirlines 120

SoftlandAirlines 116
e Scheduled Flights - JFK - XML Browser 7/4/99
Sun
New York(JFK)
Arrive 10:55 am Softland irlinesA 121
Flight Confirmation - XML Browser
File Edit View Favorites Help
7/4/99 New York(JFK)
8:00 am 7/4/99 7h 55m London(LHR) to New York(JFK)
SoftlandAirlines 115 Sun ? Your reservation
Arrive 11:25
Youam
Softland A
will be entered.
irlines
must purchase your tickets
119
Sun Depart 8:00 am Arrive 10:55 am
within 72 hours. Proceed?
New York(JFK)
7/4/99
8:45 am 7/4/99 7h 55m London(LHR) to New York(JFK)
SoftlandAirlines 118 Sun
Arrive 11:45Yes
am Softland
No
A
irlines
Cancel
117
Sun Depart 8:45 am Arrive 11:40 am
New York(JFK)
8:55 am 7/4/99 7h 55m London(LHR) to New York(JFK) Arrive 12:00 pm SoftlandAirlines 123
Sun Depart 8:55 am Arrive 11:45 am SoftlandAirlines 120
Show remaining
10:00 am 7/4/99 7h 55m London(LHR) to seats
New York(JFK)
SoftlandAirlines 116 Fare restrictions: A
Softland irlines 125
Sun Depart 10:00 am Arrive 12:00 am
Book flight Must stay over a Saturday night.
10:55 am 7/4/99 7h 55m London(LHR) to New York(JFK)
Show fare SoftlandAirlines 121
Tickets must be Softland A
irlines 127
purchased within
Sun Depart 10:55 am Arrive 1:45 pm 24 hours of reservation and not less than
restrictions 7 days prior to flight.
12:00 pm 7/4/99 7h 55m London(LHR) to New York(JFK)
SoftlandAirlines 119
A
Softland irlines 129
Tickets are nonrefundable. Changes to
Sun Depart 12:00 pm Enter
Arrivenew
2:55 pm itinerary will result in $75 fee and
itinerary
payment of difference in fare.
1:15 pm 7/4/99 7h 55m London(LHR) to New York(JFK)
Sun Depart 1:15 pm Arrive 4:10 pm SoftlandAirlines 117
1:55 pm 7/4/99 7h 55m London(LHR) to New York(JFK)
Sun Depart 1:55 pm Arrive 4:50 pm SoftlandAirlines 123 e Softland Airlines Flight Finder - XML Browser

2:00 pm 7/4/99 7h 55m London(LHR) to New York(JFK) File Edit View Favorites Help
Sun Depart 2:00 pm Arrive 4:55 pm SoftlandAirlines 125
2:00 pm 7/4/99 7h 55m London(LHR) to New York(JFK)
Sun Depart 2:00 pm Arrive 4:55 pm SoftlandAirlines 127 Try our fast Roundtrip
Fare Finder:
LAURIE GRACE

2:05 pm 7/4/99 7h 55m London(LHR) to New York(JFK) SoftlandAirlines Register first


Sun Depart 2:05 pm Arrive 5:00 pm SoftlandAirlines 129

Book a flight
XML HYPERLINK can open a menu of several op- Leaving from Departing Time
tions. One option might insert an image, such as a 3/19/99 evening
Going to Returning Time
plane seating chart, into the current page (red arrow).
3/19/99 evening
Others could run a small program to book a flight
1 adult More search options
(yellow arrow) or reveal hidden text (green arrow).
The links can also connect to other pages (blue arrow). This search is limited to adult round trip coach fare
Click "Book a flight" to do a more detailed search.

15 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
guage, what all the tags mean. HTML,
for instance, has a DTD but also hun- New Languages for Science
dreds of pages of descriptive prose that
programmers refer to when they write
browsers and other Web software. X ML offers a particularly convenient way for scien-
tists to exchange theories, calculations and ex-
perimental results. Mathematicians, among others, have
A Question of Style long been frustrated by Web browsers ablity to display
mathematical expressions only as pictures. MathML now
allows them to insert equations into their Web pages with a few lines of simple
F or users, it is what those programs
do, not what the descriptions say,
that is important. In many cases, people
text. Readers can then paste those expressions directly into algebra software for
calculation or graphing.
will want software to display XML-en- Chemists have gone a step further, developing new browser programs for their
coded information to human readers. XML-based Chemical Markup Language (CML) that graphically render the molec-
ular structure of compounds described in CML Web pages. Both CML and Astron-
But XML tags offer no inherent clues
omy Markup Language will help researchers sift quickly through reams of journal
about how the information should look
citations to find just the papers that apply to the object of their study. As-
on screen or on paper.
tronomers, for example, can enter the sky coordinates of a galaxy to pull up a list of
This is actually an advantage for pub-
images, research papers and instrument data about that heavenly body.
lishers, who would often like to write
XML will be helpful for running experiments as well as analyzing their results.
once and publish everywhereto dis-
National Aeronautics and Space Administration engineers began work last year on
till the substance of a publication and Astronomical Instrument ML (AIML) as a way to enable scientists on the ground
then pour it into myriad forms, both to control the SOFIA infrared telescope as it flies on a Boeing 747. AIML should
printed and electronic. XML lets them eventually allow astronomers all over the world to control telescopes and perhaps
do this by tagging content to describe even satellites through straightforward Internet browser software.
its meaning, independent of the display Geneticists may soon be using Biosequence ML (BSML) to exchange and ma-
medium. Publishers can then apply rules nipulate the flood of information produced by gene-mapping and gene-sequenc-
organized into stylesheets to reformat ing projects. A BSML browser built and distributed free by Visual Genomics in
the work automatically for various de- Columbus, Ohio, lets researchers search through vast databases of genetic code
vices. The standard now being devel- and display the resulting snippets as meaningful maps and charts rather than as
oped for XML stylesheets is called the obtuse strings of letters. The Editors
Extensible Stylesheet Language, or XSL.
The latest versions of several Web
browsers can read an XML document,
fetch the appropriate stylesheet, and use dardized documents: purchase orders, better place to do business. Web site de-
it to sort and format the information invoices, manifests, receipts and so on. signers, on the other hand, will find it
on the screen. The reader might never Documents work for commerce be- more demanding. Battalions of program-
know that he is looking at XML rather cause they do not require the parties in- mers will be needed to exploit new XML
than HTML except that XML-based volved to know about one anothers in- languages to their fullest. And although
sites run faster and are easier to use. ternal procedures. Each record exposes the day of the self-trained Web hacker
People with visual disabilities gain a exactly what its recipient needs to is not yet over, the species is endangered.
free benefit from this approach to pub- know and no more. The exchange of Tomorrows Web designers will need to
lishing. Stylesheets will let them render documents is probably the right way to be versed not just in the production of
XML into Braille or audible speech. The do business on-line, too. But this was words and graphics but also in the con-
advantages extend to others as well: not the job for which HTML was built. struction of multilayered, interdepen-
commuters who want to surf the Web XML, in contrast, was designed for dent systems of DTDs, data trees, hy-
in their cars may also find it handy to document exchange, and it is becoming perlink structures, metadata and style-
have pages read aloud. clear that universal electronic commerce sheetsa more robust infrastructure for
Although the Web has been a boon to will rely heavily on a flow of agree- the Webs second generation. SA

science and to scholarship, it is com- ments, expressed in millions of XML


merce (or rather the expectation of fu- documents pulsing around the Internet.
ture commercial gain) that has fueled its Thus, for its users, the XML-pow-
lightning growth. The recent surge in re- ered Web will be faster, friendlier and a
tail sales over the Web has drawn much
attention, but business-to-business com-
merce is moving on-line at least as quick- The Authors
ly. The flow of goods through the manu- JON BOSAK and TIM BRAY played crucial roles in the development of XML. Bosak, an
facturing process, for example, begs for on-line information technology architect at Sun Microsystems in Mountain View, Calif.,
automation. But schemes that rely on organized and led the World Wide Web Consortium working group that created XML. He
complex, direct program-to-program in- is currently chair of the W3C XML Coordination Group and a representative to the Orga-
teraction have not worked well in prac- nization for the Advancement of Structured Information Standards. Bray is co-editor of the
XML 1.0 specification and the related Namespaces in XML and serves as co-chair of the
tice, because they depend on a uniformi- W3C XML Syntax Working Group. He managed the New Oxford English Dictionary
ty of processing that does not exist. Project at the University of Waterloo in 1986, co-founded Open Text Corporation in 1989
For centuries, humans have success- and launched Textuality, a programming firm in Vancouver, B.C., in 1996.
fully done business by exchanging stan-

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 16


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
Hypersearching the Web
With the volume of on-line information in cyberspace growing at a
breakneck pace, more effective search tools are desperately needed.
A new technique analyzes how Web pages are linked together

by Members of the Clever Project

E very day the World Wide Web


grows by roughly a million
electronic pages, adding to the
hundreds of millions already on-line.
This staggering volume of information
tions of links to those locations. Our
methodology should enable users to lo-
cate much of the information they de-
sire quickly and efficiently.
known as a ranking function, which
must apply not only to relatively
specific and straightforward queries
(Nepal Airways) but also to much
more general requests, such as for air-
is loosely held together by more than a The Challenges of Search Engines craft, a word that appears in more
billion annotated connections, called than a million Web pages. How should
hyperlinks. For the first time in history,
millions of people have virtually instant
access from their homes and offices to
C omputer disks have become in-
creasingly inexpensive, enabling the
storage of a large portion of the Web at
a search engine choose just 20 from
such a staggering number?
Simple heuristics might rank pages by
the creative output of a significant a single site. At its most basic level, a the number of times they contain the
and growing fraction of the planets search engine maintains a list, for every query term, or they may favor instances
population. word, of all known Web pages contain- in which that text appears earlier. But
But because of the Webs rapid, chaot- ing that word. Such a collection of lists such approaches can sometimes fail
ic growth, the resulting network of in- is known as an index. So if people are spectacularly. Tom Wolfes book The
formation lacks organization and struc- interested in learning about acupunc- Kandy-Kolored Tangerine-Flake Stream-
ture. In fact, the Web has evolved into a ture, they can access the acupuncture line Baby would, if ranked by such
global mess of previously unimagined list to find all Web pages containing that heuristics, be deemed very relevant to
proportions. Web pages can be written word. the query hernia, because it begins
in any language, dialect or style by indi- Creating and maintaining this index is by repeating that word dozens of times.
viduals with any background, educa- highly challenging [see Searching the Numerous extensions to these rules of
tion, culture, interest and motivation. Internet, by Clifford Lynch; Scien- thumb abound, including approaches
Each page might range from a few tific American, March 1997], and de- that give more weight to words that ap-
characters to a few hundred thousand, termining what information to return in pear in titles, in section headings or in a
containing truth, falsehood, wisdom, response to user requests remains larger font.
propaganda or sheer nonsense. How, daunting. Consider the unambiguous Such strategies are routinely thwarted
then, can one extract from this digital query for information on Nepal Air- by many commercial Web sites that de-
morass high-quality, relevant pages in ways, the airline company. Of the sign their pages in certain ways specifi-
response to a specific need for certain roughly 100 (at the time of this writing) cally to elicit favorable rankings. Thus,
information? Web pages containing the phrase, how one encounters pages whose titles are
In the past, people have relied on does a search engine decide which 20 cheap airfares cheap airfares cheap air-
search engines that hunt for specific or so are the best? One difficulty is that fares. Some sites write other carefully
words or terms. But such text searches there is no exact and mathematically chosen phrases many times over in col-
ALL ILLUSTRATIONS BY BRYAN CHRISTIE

frequently retrieve tens of thousands of precise measure of best; indeed, it lies ors and fonts that are invisible to hu-
pages, many of them useless. How can in the eye of the beholder. man viewers. This practice, called spam-
people quickly locate only the informa- Search engines such as AltaVista, Info- ming, has become one of the main rea-
tion they need and trust that it is au- seek, HotBot, Lycos and Excite use sons why it is currently so difficult to
thentic and reliable? heuristics to determine the way in which maintain an effective search engine.
We have developed a new kind of to order and thereby prioritize pages. Spamming aside, even the basic as-
search engine that exploits one of the These rules of thumb are collectively sumptions of conventional text searches
Webs most valuable resources its myr-
iad hyperlinks. By analyzing these inter-
connections, our system automatically
WEB PAGES (white dots) are scattered over the Internet with little structure, making it
locates two types of pages: authorities difficult for a person in the center of this electronic clutter to find only the information
and hubs. The former are deemed to be desired. Although this diagram shows just hundreds of pages, the World Wide Web
the best sources of information on a currently contains more than 300 million of them. Nevertheless, an analysis of the way
particular topic; the latter are collec- in which certain pages are linked to one another can reveal a hidden order.

17 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
19 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002
COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
are suspect. To wit, pages that are high-
ly relevant will not always contain the
query term, and others that do may be
worthless. A major cause of this prob-
lem is that human language, in all its
richness, is awash in synonymy (differ-
ent words having the same meaning)
and polysemy (the same word having
multiple meanings). Because of the for-
mer, a query for automobile will miss
a deluge of pages that lack that word
but instead contain car. The latter
manifests itself in a simple query for
jaguar, which will retrieve thousands
of pages about the automobile, the jun-
gle cat and the National Football
League team, among other topics.
One corrective strategy is to augment FINDING authorities and hubs can be tricky because of the circular way in which they
search techniques with stored informa- are defined: an authority is a page that is pointed to by many hubs; a hub is a site that
tion about semantic relations between links to many authorities. The process, however, can be performed mathematically.
words. Such compilations, typically con- Clever, a prototype search engine, assigns initial scores to candidate Web pages on a
particular topic. Clever then revises those numbers in repeated series of calculations,
structed by a team of linguists, are some-
with each iteration dependent on the values of the previous round. The computations
times known as semantic networks, fol- continue until the scores eventually settle on their final values, which can then be used
lowing the seminal work on the Word- to determine the best authorities and hubs.
Net project by George A. Miller and his
colleagues at Princeton University. An
index-based engine with access to a se- When people perform a search for will automatically produce those results
mantic network could, on receiving the Harvard, many of them want to has been a troublesome undertaking. So
query for automobile, first determine learn more about the Ivy League they could maintain a list of queries like
that car is equivalent and then re- school. But more than a million loca- Harvard for which they will override
trieve all Web pages containing either tions contain Harvard, and the uni- the judgment of the search engine with
word. But this process is a double- versitys home page is not the one that predetermined right answers.
edged sword: it helps with synonymy uses it the most frequently, the earliest This approach is being taken by a
but can aggravate polysemy. or in any other way deemed especially number of search engines. In fact, a ser-
Even as a cure for synonymy, the so- significant by traditional ranking func- vice such as Yahoo! contains only hu-
lution is problematic. Constructing and tions. No entirely internal feature of man-selected pages. But there are
maintaining a semantic network that is that home page truly seems to reveal its countless possible queries. How, with a
exhaustive and cross-cultural (after all, importance. limited number of human experts, can
the Web knows no geographical bound- Indeed, people design Web pages one maintain all these lists of precom-
aries) are formidable tasks. The process with all kinds of objectives in mind. For puted responses, keeping them reason-
is especially difficult on the Internet, instance, large corporations want their ably complete and up-to-date, as the
where a whole new language is evolv- sites to convey a certain feel and project Web meanwhile grows by a million
ing words such as FAQs, zines a specific image goals that might be pages a day?
and bots have emerged, whereas oth- very different from that of describing
er words such as surf and browse what the company does. Thus, IBMs Searching with Hyperlinks
have taken on additional meanings. home page does not contain the word
Our work on the Clever project at
IBM originated amid this perplexing ar-
ray of issues. Early on, we realized that
computer. For these types of situa-
tions, conventional search techniques
are doomed from the start.
I n our work, we have been attacking
the problem in a different way. We
have developed an automatic technique
the current scheme of indexing and re- To address such concerns, human ar- for finding the most central, authorita-
trieving a page based solely on the text chitects of search engines have been tive sites on broad search topics by
it contained ignores more than a billion tempted to intervene. After all, they be- making use of hyperlinks, one of the
carefully placed hyperlinks that reveal lieve they know what the appropriate Webs most precious resources. It is the
the relations between pages. But how responses to certain queries should be, hyperlinks, after all, that pull together
exactly should this information be used? and developing a ranking function that the hundreds of millions of pages into a
web of knowledge. It is through these
connections that users browse, serendip-
itously discovering valuable information
AUTHORITIES AND HUBS help to organize information on the Web, however infor-
mally and inadvertently. Authorities ( ) are sites that other Web pages happen to link to through the pointers and recommenda-
frequently on a particular topic. For the subject of human rights, for instance, the home tions of people they have never met.
page of Amnesty International might be one such location. Hubs ( ) are sites that tend The underlying assumption of our
to cite many of those authorities, perhaps in a resource list or in a My Favorite Links approach views each link as an implicit
section on a personal home page. endorsement of the location to which it

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 20


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
points. Consider the Web site of a hu- set, will typically contain between 1,000 culations. Furthermore, the results are
man-rights activist that directs people and 5,000 pages. generally independent of the initial esti-
to the home page of Amnesty Interna- For each of these, Clever assigns ini- mates of scores used to start the pro-
tional. In this case, the reference clearly tial numerical hub and authority scores. cess. The method will work even if the
signifies approval. The system then refines the values: the values are all initially set to be equal to 1.
Of course, a link may also exist purely authority score of each page is updated So the final hub and authority scores are
for navigational purposes (Click here to be the sum of the hub scores of other intrinsic to the collection of pages in the
to return to the main menu), as a paid locations that point to it; a hub score is root set.
advertisement (The vacation of your revised to be the sum of the authority A useful by-product of Clevers itera-
dreams is only a click away) or as a scores of locations to which a page tive processing is that the algorithm nat-
stamp of disapproval (Surf to this site points. In other words, a page that has urally separates Web sites into clusters.
to see what this fool says). We believe, many high-scoring hubs pointing to it A search for information on abortion,
however, that in aggregate that is, when earns a higher authority score; a loca- for example, results in two types of lo-
a large enough number is considered tion that points to many high-scoring cations, pro-life and pro-choice, because
Web links do confer authority. authorities garners a higher hub score. pages from one group are more likely to
In addition to expert sites that have Clever repeats these calculations until link to one another than to those from
garnered many recommendations, the the scores have more or less settled on the other community.
Web is full of another type of page: their final values, from which the best From a larger perspective, Clevers al-
hubs that link to those prestigious loca- authorities and hubs can be deter- gorithm reveals the underlying structure
tions, tacitly radiating influence out- mined. (Note that the computations do of the World Wide Web. Although the
ward to them. Hubs appear in guises not preclude a particular page from Internet has grown in a hectic, willy-
ranging from professionally assembled achieving a top rank in both categories, nilly fashion, it does indeed have an in-
lists on commercial sites to inventories as sometimes occurs.) herent albeit inchoate order based on
of My Favorite Links on personal The algorithm might best be under- how pages are linked.
home pages. So even if we find it difficult stood in visual terms. Picture the Web
to define authorities and hubs in as a vast network of innumerable sites, The Link to Citation Analysis
isolation, we can state this much: a re- all interconnected in a seemingly ran-
spected authority is a page that is re-
ferred to by many good hubs; a useful
hub is a location that points to many
dom fashion. For a given set of pages
containing a certain word or term,
Clever zeroes in on the densest pattern
M ethodologically, the Clever algo-
rithm has close ties to citation
analysis, the study of patterns of how
valuable authorities. of links between those pages. scientific papers make reference to one
These definitions look hopelessly cir- As it turns out, the iterative summa- another. Perhaps the fields best-known
cular. How could they possibly lead to tion of hub and authority scores can be measure of a journals importance is the
a computational method of identifying analyzed with stringent mathematics. impact factor. Developed by Eugene
both authorities and hubs? Thinking of Using linear algebra, we can represent Garfield, a noted information scientist
the problem intuitively, we devised the the process as the repeated multiplica- and founder of Science Citation Index,
following algorithm. To start off, we tion of a vector (specifically, a row of the metric essentially judges a publication
look at a set of candidate pages about a numbers representing the hub or au- by the number of citations it receives.
particular topic, and for each one we thority scores) by a matrix (a two-di- On the Web, the impact factor would
make our best guess about how good a mensional array of numbers represent- correspond to the ranking of a page sim-
hub it is and how good an authority it ing the hyperlink structure of the root ply by a tally of the number of links that
is. We then use these initial estimates to set). The final results of the process are point to it. But this approach is typically
jump-start a two-step iterative process. hub and authority vectors that have not appropriate, because it can favor
First, we use the current guesses about equilibrated to certain numbers values universally popular locations, such as
the authorities to improve the estimates that reveal which pages are the best the home page of the New York Times,
of hubs we locate all the best authori- hubs and authorities, respectively. (In regardless of the specific query topic.
ties, see which pages point to them and the world of linear algebra, such a stabi- Even in the area of citation analysis,
call those locations good hubs. Second, lized row of numbers is called an eigen- researchers have attempted to improve
we take the updated hub information vector; it can be thought of as the solu- Garfields measure, which counts each
to refine our guesses about the authori- tion to a system of equations defined by reference equally. Would not a better
ties we determine where the best hubs the matrix.) strategy give additional weight to cita-
point most heavily and call these the With further linear algebraic analysis, tions from a journal deemed more im-
good authorities. Repeating these steps we have shown that the iterative pro- portant? Of course, the difficulty with
several times fine-tunes the results. cess will rapidly settle to a relatively this approach is that it leads to a circu-
We have implemented this algorithm steady set of hub and authority scores. lar definition of importance, similar
in Clever, a prototype search engine. For our purposes, a root set of 3,000 to the problem we encountered in speci-
For any query of a topic say, acupunc- pages requires about five rounds of cal- fying hubs and authorities. As early as
ture Clever first obtains a list of 200
pages from a standard text index such
as AltaVista. The system then augments CYBERCOMMUNITIES (shown in different colors) populate the Web. An exploration
these by adding all pages that link to of this phenomenon has uncovered various groups on topics as arcane as oil spills off the
and from that 200. In our experience, coast of Japan, fire brigades in Australia and resources for Turks living in the U.S. The
the resulting collection, called the root Web is filled with hundreds of thousands of such finely focused communities.

21 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
1976 Gabriel Pinski and Francis Narin tions pointing to it. So, when presented Our preliminary experiments suggest
of CHI Research in Haddon Heights, with a specific query, Google can re- that this refinement substantially in-
N.J., overcame this hurdle by develop- spond by quickly retrieving all pages con- creases the focus of the search results.
ing an iterated method for computing a taining the search text and listing them (A shortcoming of Clever has been that
stable set of adjusted scores, which they according to their preordained ranks. for a narrow topic, such as Frank Lloyd
termed influence weights. In contrast to Google and Clever have two main dif- Wrights house Fallingwater, the system
our work, Pinski and Narin did not in- ferences. First, the former assigns initial sometimes broadens its search and re-
voke a distinction between authorities rankings and retains them independently trieves information on a general subject,
and hubs. Their method essentially pass- of any queries, whereas the latter assem- such as American architecture.) We are
es weight directly from one good author- bles a different root set for each search investigating other improvements, and
ity to another. term and then prioritizes those pages in given the many styles of authorship on
This difference raises a fundamental the context of that particular query. Con- the Web, the weighting of links might
point about the Web versus traditional sequently, Googles approach enables incorporate page content in a variety of
printed scientific literature. In cyber- faster response. Second, Googles basic ways.
space, competing authorities (for exam- philosophy is to look only in the forward We have also begun to construct lists
ple, Netscape and Microsoft on the direction, from link to link. In contrast, of Web resources, similar to the guides
topic of browsers) frequently do not ac- Clever also looks backward from an au- put together manually by employees of
knowledge one anothers existence, so thoritative page to see what locations companies such as Yahoo! and Info-
they can be connected only by an inter- are pointing there. In this sense, Clever seek. Our early results indicate that au-
mediate layer of hubs. Rival prominent takes advantage of the sociological phe- tomatically compiled lists can be com-
scientific journals, on the other hand, nomenon that humans are innately moti- petitive with handcrafted ones. Further-
typically do a fair amount of cross-cita- vated to create hublike content express- more, through this work we have found
tion, making the role of hubs much less ing their expertise on specific topics. that the Web teems with tightly knit
crucial. groups of people, many with offbeat com-
A number of groups are also investi- The Search Continues mon interests (such as weekend sumo en-
gating the power of hyperlinks for thusiasts who don bulky plastic outfits
searching the Web. Sergey Brin and
Lawrence Page of Stanford University,
for instance, have developed a search
W e are exploring a number of
ways to enhance Clever. A fun-
damental direction in our overall ap-
and wrestle each other for fun), and we
are currently investigating efficient and
automatic methods for uncovering these
engine dubbed Google that implements proach is the integration of text and hy- hidden communities.
a link-based ranking measure related to perlinks. One strategy is to view certain The World Wide Web of today is dra-
the influence weights of Pinski and Nar- links as carrying more weight than oth- matically different from that of just five
in. The Stanford scientists base their ap- ers, based on the relevance of the text in years ago. Predicting what it will be like
proach on a model of a Web surfer who the referring Web location. Specifically, in another five years seems futile. Will
follows links and makes occasional hap- we can analyze the contents of the even the basic act of indexing the Web
hazard jumps, arriving at certain places pages in the root set for the occurrences soon become infeasible? And if so, will
more frequently than others. Thus, and relative positions of the query topic our notion of searching the Web undergo
Google finds a single type of universally and use this information to assign nu- fundamental changes? For now, the one
important page intuitively, locations merical weights to some of the connec- thing we feel certain in saying is that the
that are heavily visited in a random tions between those pages. If the query Webs relentless growth will continue to
traversal of the Webs link structure. In text appeared frequently and close to a generate computational challenges for
practice, for each Web page Google ba- link, for instance, the corresponding wading through the ever increasing vol-
sically sums the scores of other loca- weight would be increased. ume of on-line information.

The Authors Further Reading


THE CLEVER PROJECT: Soumen Chakrabarti, Byron Dom, S. Ravi Search Engine Watch (www.searchenginewatch.com) contains
Kumar, Prabhakar Raghavan, Sridhar Rajagopalan and Andrew Tomkins information on the latest progress in search engines. The
are research staff members at the IBM Almaden Research Center in San WordNet project is described in WordNet: An Electronic Lex-
Jose, Calif. Jon M. Kleinberg is an assistant professor in the computer ical Database (MIT Press, 1998), edited by Christiane Fell-
science department at Cornell University. David Gibson is completing baum. The iterative method for determining hubs and author-
his Ph.D. at the computer science division at the University of California, ities first appeared in Jon M. Kleinbergs paper Authoritative
Berkeley. Sources in a Hyperlinked Environment in Proceedings of the
The authors began their quest for exploiting the hyperlink structure of 9th ACM-SIAM Symposium on Discrete Algorithms, edited
the World Wide Web three years ago, when they first sought to develop by Howard Karloff (SIAM/ACMSIGACT, 1998). Improve-
improved techniques for finding information in the clutter of cyberspace. ments to the algorithm are described at the Web site of the
Their work originated with the following question: If computation were IBM Almaden Research Center (www.almaden.ibm.com/cs/
not a bottleneck, what would be the most effective search algorithm? In k53/clever.html). Introduction to Informetrics (Elsevier Sci-
other words, could they build a better search engine if the processing ence Publishers, 1990), by Leo Egghe and Ronald Rousseau,
didnt have to be instantaneous? The result was the algorithm described provides a good overview of citation analysis. Information on
in this article. Recently the research team has been investigating the Web the Google project at Stanford University can be obtained
phenomenon of cybercommunities. from www.google.com on the World Wide Web.

23 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
THE
SE M A N T I C
WEB
A new form of Web content
that is meaningful to computers
will unleash a revolution of new possibilities

by
TIM BERNERS-LEE,
JAMES HENDLER and
ORA LASSILA
PHOTOILLUSTRATIONS BY MIGUEL SALMERON

24 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
The entertainment system was trust in Petes agent in the context of the goes to Dr. Hartmans curriculum vitae.
belting out the Beatles We Can Work It present task, automatically assisted by The Semantic Web will bring structure to
Out when the phone rang. When Pete supplying access certificates and shortcuts the meaningful content of Web pages,
answered, his phone turned the sound to the data it had already sorted through. creating an environment where software
down by sending a message to all the oth- Almost instantly the new plan was agents roaming from page to page can
er local devices that had a volume control. presented: a much closer clinic and earli- readily carry out sophisticated tasks for
His sister, Lucy, was on the line from the er times but there were two warning users. Such an agent coming to the clinics
doctors office: Mom needs to see a spe- notes. First, Pete would have to reschedule Web page will know not just that the page
cialist and then has to have a series of a couple of his less important appoint- has keywords such as treatment, medi-
physical therapy sessions. Biweekly or ments. He checked what they werenot a cine, physical, therapy (as might be en-
something. Im going to have my agent set problem. The other was something about coded today) but also that Dr. Hartman
up the appointments. Pete immediately the insurance companys list failing to in- works at this clinic on Mondays,
agreed to share the chauffeuring. clude this provider under physical ther- Wednesdays and Fridays and that the
At the doctors office, Lucy instruct- apists: Service type and insurance plan script takes a date range in yyyy-mm-
ed her Semantic Web agent through her status securely verified by other means, dd format and returns appointment
handheld Web browser. The agent the agent reassured him. (Details?) times. And it will know all this with-
promptly retrieved information about Lucy registered her assent at about the out needing artificial intelligence on the
Moms prescribed treatment from the same moment Pete was muttering, Spare scale of 2001s Hal or Star Warss C-
doctors agent, looked up several lists of me the details, and it was all set. (Of 3PO. Instead these semantics were en-
providers, and checked for the ones course, Pete couldnt resist the details and coded into the Web page when the clinics
in-plan for Moms insurance within a 20- later that night had his agent explain how office manager (who never took Comp
mile radius of her home and with a rat- it had found that provider even though it Sci 101) massaged it into shape using off-
ing of excellent or very good on trusted wasnt on the proper list.) the-shelf software for writing Semantic
rating services. It then began trying to find Web pages along with resources listed on
a match between available appointment Expressing Meaning the Physical Therapy Associations site.
times (supplied by the agents of individ- pete and lucy could use their agents to The Semantic Web is not a separate
ual providers through their Web sites) and carry out all these tasks thanks not to the Web but an extension of the current one,
Petes and Lucys busy schedules. (The em- World Wide Web of today but rather the in which information is given well-defined
phasized keywords indicate terms whose Semantic Web that it will evolve into to- meaning, better enabling computers and
semantics, or meaning, were defined for morrow. Most of the Webs content to- people to work in cooperation. The first
the agent through the Semantic Web.) day is designed for humans to read, not steps in weaving the Semantic Web into
In a few minutes the agent presented for computer programs to manipulate the structure of the existing Web are al-
them with a plan. Pete didnt like it Uni- meaningfully. Computers can adeptly ready under way. In the near future, these
versity Hospital was all the way across parse Web pages for layout and routine developments will usher in significant
town from Moms place, and hed be dri- processing here a header, there a link to new functionality as machines become
ving back in the middle of rush hour. He another page but in general, computers much better able to process and under-
set his own agent to redo the search with have no reliable way to process the se- stand the data that they merely display
stricter preferences about location and mantics: this is the home page of the Hart- at present. The essential property of the
time. Lucys agent, having complete man and Strauss Physio Clinic, this link World Wide Web is its universality. The

Overview / Semantic Web


To date, the World Wide Web has developed most rapidly as a medium of documents for people rather than of
information that can be manipulated automatically. By augmenting Web pages with data targeted at
computers and by adding documents solely for computers, we will transform the Web into the Semantic Web.
Computers will find the meaning of semantic data by following hyperlinks to definitions of key terms and rules
for reasoning about them logically. The resulting infrastructure will spur the development of automated Web
services such as highly functional agents.
Ordinary users will compose Semantic Web pages and add new definitions and rules using off-the-shelf
software that will assist with semantic markup.

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 25


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
power of a hypertext link is that any- ized, requiring everyone to share exactly is the task before the Semantic Web com-
thing can link to anything. Web tech- the same definition of common concepts munity at the moment. A mixture of
nology, therefore, must not discriminate such as parent or vehicle. But central mathematical and engineering decisions
between the scribbled draft and the pol- control is stifling, and increasing the size complicate this task. The logic must be
ished performance, between commercial and scope of such a system rapidly be- powerful enough to describe complex
and academic information, or among cul- comes unmanageable. properties of objects but not so power-
tures, languages, media and so on. Infor- Moreover, these systems usually care- ful that agents can be tricked by being
mation varies along many axes. One of fully limit the questions that can be asked asked to consider a paradox. Fortunate-
these is the difference between informa- so that the computer can answer reliably ly, a large majority of the information we
tion produced primarily for human con- or answer at all. The problem is reminis- want to express is along the lines of a
sumption and that produced mainly for cent of Gdels theorem from mathemat- hex-head bolt is a type of machine bolt,
machines. At one end of the scale we have ics: any system that is complex enough to which is readily written in existing lan-
everything from the five-second TV com- be useful also encompasses unanswerable guages with a little extra vocabulary.
mercial to poetry. At the other end we questions, much like sophisticated ver- Two important technologies for de-
have databases, programs and sensor out- sions of the basic paradox This sentence veloping the Semantic Web are already in
put. To date, the Web has developed most is false. To avoid such problems, tradi- place: eXtensible Markup Language
rapidly as a medium of documents for tional knowledge-representation systems (XML) and the Resource Description
people rather than for data and informa- generally each had their own narrow and Framework (RDF). XML lets everyone
tion that can be processed automatically. idiosyncratic set of rules for making infer- create their own tags hidden labels such
The Semantic Web aims to make up for ences about their data. For example, a ge- as <zip code> or <alma mater> that an-
this. nealogy system, acting on a database of notate Web pages or sections of text on a
Like the Internet, the Semantic Web family trees, might include the rule a wife page. Scripts, or programs, can make use
will be as decentralized as possible. Such of an uncle is an aunt. Even if the data of these tags in sophisticated ways, but
Web-like systems generate a lot of excite- could be transferred from one system to the script writer has to know what the
ment at every level, from major corpora- another, the rules, existing in a complete- page writer uses each tag for. In short,
tion to individual user, and provide bene- ly different form, usually could not. XML allows users to add arbitrary struc-
fits that are hard or impossible to predict Semantic Web researchers, in contrast, ture to their documents but says nothing
in advance. Decentralization requires accept that paradoxes and unanswerable about what the structures mean [see
compromises: the Web had to throw away questions are a price that must be paid to XML and the Second-Generation Web,
the ideal of total consistency of all of its in- achieve versatility. We make the language by Jon Bosak and Tim Bray; Scientific
terconnections, ushering in the infamous for the rules as expressive as needed to al- American, May 1999].
message Error 404: Not Found but al- low the Web to reason as widely as de- Meaning is expressed by RDF, which
lowing unchecked exponential growth. sired. This philosophy is similar to that of encodes it in sets of triples, each triple be-
the conventional Web: early in the Webs ing rather like the subject, verb and object
Knowledge Representation development, detractors pointed out that of an elementary sentence. These triples
for the semantic web to function, it could never be a well-organized library; can be written using XML tags. In RDF,
computers must have access to structured without a central database and tree struc- a document makes assertions that partic-
collections of information and sets of in- ture, one would never be sure of finding ular things (people, Web pages or what-
ference rules that they can use to conduct everything. They were right. But the ex- ever) have properties (such as is a sister
automated reasoning. Artificial-intelli- pressive power of the system made vast of, is the author of) with certain val-
gence researchers have studied such sys- amounts of information available, and ues (another person, another Web page).
tems since long before the Web was de- search engines (which would have seemed This structure turns out to be a natural
veloped. Knowledge representation, as quite impractical a decade ago) now pro- way to describe the vast majority of the
this technology is often called, is current- duce remarkably complete indices of a lot data processed by machines. Subject and
ly in a state comparable to that of hyper- of the material out there. object are each identified by a Universal
text before the advent of the Web: it is The challenge of the Semantic Web, Resource Identifier (URI), just as used in
clearly a good idea, and some very nice therefore, is to provide a language that a link on a Web page. (URLs, Uniform
demonstrations exist, but it has not yet expresses both data and rules for reason- Resource Locators, are the most common
changed the world. It contains the seeds ing about the data and that allows rules type of URI.) The verbs are also identified
of important applications, but to realize from any existing knowledge-representa- by URIs, which enables anyone to define
its full potential it must be linked into a tion system to be exported onto the Web. a new concept, a new verb, just by defin-
single global system. Adding logic to the Web the means ing a URI for it somewhere on the Web.
Traditional knowledge-representa- to use rules to make inferences, choose Human language thrives when using
tion systems typically have been central- courses of action and answer questions the same term to mean somewhat differ-

26 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
Glossary ed by the third basic component of the
HTML: Hypertext Markup Language. The language used to encode formatting, Semantic Web, collections of informa-
links and other features on Web pages. Uses standardized tags such as <H1> and tion called ontologies. In philosophy, an
<BODY> whose meaning and interpretation is set universally by the World Wide ontology is a theory about the nature of
Web Consortium. existence, of what types of things exist;
XML: eXtensible Markup Language. A markup language like HTML that lets ontology as a discipline studies such the-
individuals define and use their own tags. XML has no built-in mechanism to convey ories. Artificial-intelligence and Web re-
the meaning of the users new tags to other users. searchers have co-opted the term for their
RESOURCE: Web jargon for any entity. Includes Web pages, parts of own jargon, and for them an ontology is
a Web page, devices, people and more. a document or file that formally defines
URL: Uniform Resource Locator. The familiar codes (such as the relations among terms. The most typ-
http://www.sciam.com/index.html) that are used in hyperlinks. ical kind of ontology for the Web has a
URI: Universal Resource Identifier. URLs are the most familiar type of URI. A URI taxonomy and a set of inference rules.
defines or specifies an entity, not necessarily by naming its location on the Web. The taxonomy defines classes of ob-
RDF: Resource Description Framework. A scheme for defining information on the Web. jects and relations among them. For ex-
RDF provides the technology for expressing the meaning of terms and concepts in a ample, an address may be defined as a
form that computers can readily process. RDF can use XML for its syntax and URIs to type of location, and city codes may be
specify entities, concepts, properties and relations. defined to apply only to locations, and
ONTOLOGIES: Collections of statements written in a language such as RDF that so on. Classes, subclasses and relations
define the relations between concepts and specify logical rules for reasoning among entities are a very powerful tool
about them. Computers will understand the meaning of semantic data on a Web for Web use. We can express a large
page by following links to specified ontologies. number of relations among entities by as-
AGENT: A piece of software that runs without direct human control or constant signing properties to classes and allowing
supervision to accomplish goals provided by a user. Agents typically collect, filter and subclasses to inherit such properties. If
process information found on the Web, sometimes with the help of other agents. city codes must be of type city and
SERVICE DISCOVERY: The process of locating an agent or automated Web-based cities generally have Web sites, we can
service that will perform a required function. Semantics will enable agents to describe discuss the Web site associated with a
to one another precisely what function they carry out and what input data are needed. city code even if no database links a city
code directly to a Web site.
Inference rules in ontologies supply
ent things, but automation does not. example, imagine that we have access to further power. An ontology may express
Imagine that I hire a clown messenger ser- a variety of databases with information the rule If a city code is associated with
vice to deliver balloons to my customers about people, including their addresses. a state code, and an address uses that city
on their birthdays. Unfortunately, the If we want to find people living in a spe- code, then that address has the associated
service transfers the addresses from my cific zip code, we need to know which state code. A program could then read-
database to its database, not knowing fields in each database represent names ily deduce, for instance, that a Cornell
that the addresses in mine are where and which represent zip codes. RDF can University address, being in Ithaca, must
bills are sent and that many of them are specify that (field 5 in database A) (is a be in New York State, which is in the
post office boxes. My hired clowns end field of type) (zip code), using URIs U.S., and therefore should be formatted
up entertaining a number of postal work- rather than phrases for each term. to U.S. standards. The computer doesnt
ers not necessarily a bad thing but cer- truly understand any of this informa-
tainly not the intended effect. Using a dif- Ontologies tion, but it can now manipulate the terms
ferent URI for each specific concept solves of course, this is not the end of the much more effectively in ways that are
that problem. An address that is a mailing story, because two databases may use useful and meaningful to the human user.
address can be distinguished from one that different identifiers for what is in fact the With ontology pages on the Web, so-
is a street address, and both can be distin- same concept, such as zip code. A pro- lutions to terminology (and other) prob-
guished from an address that is a speech. gram that wants to compare or combine lems begin to emerge. The meaning of
The triples of RDF form webs of in- information across the two databases has terms or XML codes used on a Web page
formation about related things. Because to know that these two terms are being can be defined by pointers from the page
RDF uses URIs to encode this informa- used to mean the same thing. Ideally, the to an ontology. Of course, the same prob-
tion in a document, the URIs ensure that program must have a way to discover lems as before now arise if I point to an
concepts are not just words in a docu- such common meanings for whatever ontology that defines addresses as con-
ment but are tied to a unique definition databases it encounters. taining a zip code and you point to one
that everyone can find on the Web. For A solution to this problem is provid- that uses postal code. This kind of con-

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 27


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
fusion can be resolved if ontologies (or that Hendler received his Ph.D. from and so forth), find the ones that mention
other Web services) provide equivalence Brown University. A computer program working for a company thats on your
relations: one or both of our ontologies trying to find such information, howev- list of clients and follow links to Web
may contain the information that my zip er, would have to be very complex to pages of their children to track down if
code is equivalent to your postal code. guess that this information might be in a any are in school at the right place.
Our scheme for sending in the clowns biography and to understand the English
to entertain my customers is partially language used there. Agents
solved when the two databases point to For computers, the page is linked to the real power of the Semantic Web
different definitions of address. The an ontology page that defines informa- will be realized when people create many
program, using distinct URIs for differ- tion about computer science depart- programs that collect Web content from
ent concepts of address, will not con- ments. For instance, professors work at diverse sources, process the information
fuse them and in fact will need to discov- universities and they generally have doc- and exchange the results with other pro-
er that the concepts are related at all. The torates. Further markup on the page (not grams. The effectiveness of such software
program could then use a service that displayed by the typical Web browser) agents will increase exponentially as more
takes a list of postal addresses (defined uses the ontologys concepts to specify machine-readable Web content and auto-
in the first ontology) and converts it into that Hendler received his Ph.D. from the mated services (including other agents) be-
a list of physical addresses (the second entity described at the URI http://www. come available. The Semantic Web pro-
ontology) by recognizing and removing brown.edu/ the Web page for Brown. motes this synergy: even agents that were
post office boxes and other unsuitable Computers can also find that Hendler is not expressly designed to work together
addresses. The structure and semantics a member of a particular research pro- can transfer data among themselves when
provided by ontologies make it easier ject, has a particular e-mail address, and the data come with semantics.
for an entrepreneur to provide such a so on. All that information is readily An important facet of agents func-
service and can make its use completely processed by a computer and could be tioning will be the exchange of proofs
transparent. used to answer queries (such as where written in the Semantic Webs unifying
Ontologies can enhance the func- Dr. Hendler received his degree) that cur- language (the language that expresses log-
tioning of the Web in many ways. They rently would require a human to sift ical inferences made using rules and infor-
can be used in a simple fashion to im- through the content of various pages mation such as those specified by ontolo-
prove the accuracy of Web searches the turned up by a search engine. gies). For example, suppose Ms. Cooks
search program can look for only those In addition, this markup makes it contact information has been located by
pages that refer to a precise concept in- much easier to develop programs that an online service, and to your great sur-
stead of all the ones using ambiguous can tackle complicated questions whose prise it places her in Johannesburg. Nat-
keywords. More advanced applications answers do not reside on a single Web urally, you want to check this, so your
will use ontologies to relate the informa- page. Suppose you wish to find the Ms. computer asks the service for a proof of
tion on a page to the associated knowl- Cook you met at a trade conference last its answer, which it promptly provides by
edge structures and inference rules. An year. You dont remember her first name, translating its internal reasoning into the
example of a page marked up for such but you remember that she worked for Semantic Webs unifying language. An in-
use is online at http://www.cs.umd.edu/~ one of your clients and that her son was ference engine in your computer readily
hendler. If you send your Web browser a student at your alma mater. An intelli- verifies that this Ms. Cook indeed match-
to that page, you will see the normal Web gent search program can sift through es the one you were seeking, and it can
page entitled Dr. James A. Hendler. As all the pages of people whose name is show you the relevant Web pages if you
a human, you can readily find the link to Cook (sidestepping all the pages relat- still have doubts. Although they are still
a short biographical note and read there ing to cooks, cooking, the Cook Islands far from plumbing the depths of the Se-
mantic Webs potential, some programs
can already exchange proofs in this way,
THE AUTHORS

TIM BERNERS-LEE, JAMES HENDLER and ORA LASSILA are individually and collectively obsessed
with the potential of Semantic Web technology. Berners-Lee is director of the World Wide Web using the current preliminary versions of
Consortium (W3C) and a researcher at the Laboratory for Computer Science at the Massachu- the unifying language.
setts Institute of Technology. When he invented the Web in 1989, he intended it to carry more Another vital feature will be digital
semantics than became common practice. Hendler is professor of computer science at the
University of Maryland at College Park, where he has been doing research on knowledge rep- signatures, which are encrypted blocks of
resentation in a Web context for a number of years. He and his graduate research group de- data that computers and agents can use
veloped SHOE, the first Web-based knowledge representation language to demonstrate many to verify that the attached information
of the agent capabilities described in this article. Hendler is also responsible for agent-based has been provided by a specific trusted
computing research at the Defense Advanced Research Projects Agency (DARPA) in Arlington, source. You want to be quite sure that a
Va. Lassila is a research fellow at the Nokia Research Center in Boston, chief scientist of Nokia
Venture Partners and a member of the W3C Advisory Board. Frustrated with the difficulty of statement sent to your accounting pro-
building agents and automating tasks on the Web, he co-authored W3Cs RDF specification, gram that you owe money to an online
which serves as the foundation for many current Semantic Web efforts. retailer is not a forgery generated by the

28 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
1
Lucy issues
instructions
computer-savvy teenager next door.
Agents should be skeptical of assertions
that they read on the Semantic Web un- 7
til they have checked the sources of in- 2 The agent sends
formation. (We wish more people would Her agent the appointment
learn to do this on the Web as it is!) follows hyperlinks plan to Lucy and
in the request to Pete at Petes home
Many automated Web-based services ontologies where key (per Lucys request)
already exist without semantics, but oth- terms are defined. for their approval
er programs such as agents have no way Links to ontologies are
to locate one that will perform a specific used at every step
function. This process, called service dis-
covery, can happen only when there is a
3
common language to describe a service in
After getting
a way that lets other agents under- treatment info from Dr.s Office
stand both the function offered and how the doctors computer
to take advantage of it. Services and agents and schedule info from
can advertise their function by, for ex- Lucys and Petes Ontologies
computers, the agent
ample, depositing such descriptions in di- goes to a provider
rectories analogous to the Yellow Pages. finder service
Some low-level service-discovery
Individual
schemes are currently available, such as Provider Site
Microsofts Universal Plug and Play, 6
Lucys agent and
Lucys agent
which focuses on connecting different the finder service
interacts with the
types of devices, and Sun Microsystemss negotiate using 4 selected individual
ontologies and
Jini, which aims to connect services. agree on payment
provider sites to find
These initiatives, however, attack the one with suitable
for its service
Provider Finder open appointment
problem at a structural or syntactic level Service 5 The finder service times, which it
and rely heavily on standardization of a sends out its own tentatively reserves
predetermined set of functionality de- agents to look at
scriptions. Standardization can only go semantics-enhanced
insurance company
so far, because we cant anticipate all lists and provider sites
possible future needs. Insurance Co.
The Semantic Web, in contrast, is Lists Provider Sites
more flexible. The consumer and pro- SOFTWARE AGENTS will be greatly facilitated by semantic content on the Web. In the depicted scenario,
ducer agents can reach a shared under- Lucys agent tracks down a physical therapy clinic for her mother that meets a combination of criteria and
standing by exchanging ontologies, has open appointment times that mesh with her and her brother Petes schedules. Ontologies that define
which provide the vocabulary needed for the meaning of semantic data play a key role in enabling the agent to understand what is on the Semantic
discussion. Agents can even bootstrap Web, interact with sites and employ other automated services.
new reasoning capabilities when they dis-
cover new ontologies. Semantics also
makes it easier to take advantage of a ser- Putting all these features together re- distributed across the Web (and almost
vice that only partially matches a request. sults in the abilities exhibited by Petes worthless in that form) was progressive-
A typical process will involve the cre- and Lucys agents in the scenario that ly reduced to the small amount of data of
ation of a value chain in which sub- opened this article. Their agents would high value to Pete and Lucy a plan of
assemblies of information are passed from have delegated the task in piecemeal fash- appointments to fit their schedules and
one agent to another, each one adding ion to other services and agents discov- other requirements.
value, to construct the final product re- ered through service advertisements. For In the next step, the Semantic Web will
quested by the end user. Make no mistake: example, they could have used a trusted break out of the virtual realm and extend
to create complicated value chains auto- service to take a list of providers and de- into our physical world. URIs can point to
matically on demand, some agents will ex- termine which of them are in-plan for a anything, including physical entities,
ploit artificial-intelligence technologies in specified insurance plan and course of which means we can use the RDF lan-
addition to the Semantic Web. But the Se- treatment. The list of providers would guage to describe devices such as cell
mantic Web will provide the foundations have been supplied by another search ser- phones and TVs. Such devices can adver-
XPLANE

and the framework to make such tech- vice, et cetera. These activities formed tise their functionalitywhat they can do
nologies more feasible. chains in which a large amount of data and how they are controlled much like

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 29


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
What Is the Killer App?
AFTER WE GIVE a presentation about the Semantic Web, were often asked, Okay, so what is the killer application of the Semantic
Web? The killer app of any technology, of course, is the application that brings a user to investigate the technology and start
using it. The transistor radio was a killer app of transistors, and the cell phone is a killer app of wireless technology.
So what do we answer? The Semantic Web is the killer app.
At this point were likely to be told were crazy, so we ask a question in turn: Well, whats the killer app of the World Wide Web?
Now were being stared at kind of fish-eyed, so we answer ourselves: The Web is the killer app of the Internet. The Semantic Web is
another killer app of that magnitude.
The point here is that the abilities of the Semantic Web are too general to be thought about in terms of solving one key problem
or creating one essential gizmo. It will have uses we havent dreamed of.
Nevertheless, we can foresee some disarming (if not actually killer) apps that will drive initial use. Online catalogs with
semantic markup will benefit both buyers and sellers. Electronic commerce transactions will be easier for small businesses to set
up securely with greater autonomy. And one final example: you make reservations for an extended trip abroad. The airlines, hotels,
soccer stadiums and so on return confirmations with semantic markup. All the schedules load directly into your date book and all
the expenses directly into your accounting program, no matter what semantics-enabled software you use. No more laborious
cutting and pasting from e-mail. No need for all the businesses to supply the data in half a dozen different formats or to create and
impose their own standard format.

software agents. Being much more flexible and employ services and other devices for them brings great benefits. Like a Finnish-
than low-level schemes such as Universal added information or functionality. It is English dictionary, or a weights-and-mea-
Plug and Play, such a semantic approach not hard to imagine your Web-enabled sures conversion table, the relations allow
opens up a world of exciting possibilities. microwave oven consulting the frozen- communication and collaboration even
For instance, what today is called food manufacturers Web site for opti- when the commonality of concept has not
home automation requires careful config- mal cooking parameters. (yet) led to a commonality of terms.
uration for appliances to work together. The Semantic Web, in naming every
Semantic descriptions of device capabili- Evolution of Knowledge concept simply by a URI, lets anyone ex-
ties and functionality will let us achieve the semantic web is not merely the press new concepts that they invent with
such automation with minimal human in- tool for conducting individual tasks that minimal effort. Its unifying logical lan-
tervention. A trivial example occurs when we have discussed so far. In addition, if guage will enable these concepts to be
Pete answers his phone and the stereo properly designed, the Semantic Web can progressively linked into a universal Web.
sound is turned down. Instead of having assist the evolution of human knowledge This structure will open up the knowl-
to program each specific appliance, he as a whole. edge and workings of humankind to
could program such a function once and Human endeavor is caught in an eter- meaningful analysis by software agents,
for all to cover every local device that ad- nal tension between the effectiveness of providing a new class of tools by which
vertises having a volume control the small groups acting independently and we can live, work and learn together.
TV, the DVD player and even the media the need to mesh with the wider commu-
players on the laptop that he brought nity. A small group can innovate rapidly
home from work this one evening. and efficiently, but this produces a sub-
The first concrete steps have already culture whose concepts are not under- MORE TO E XPLORE
been taken in this area, with work on de- stood by others. Coordinating actions Weaving the Web: The Original Design and
veloping a standard for describing func- across a large group, however, is painful- Ultimate Destiny of the World Wide Web by Its
tional capabilities of devices (such as ly slow and takes an enormous amount Inventor.
screen sizes) and user preferences. Built of communication. The world works Tim Berners-Lee, with Mark Fischetti. Harper San
Francisco, 1999.
on RDF, this standard is called Compos- across the spectrum between these ex-
ite Capability/Preference Profile (CC/PP). tremes, with a tendency to start small World Wide Web Consortium (W3C): www.w3.org/
Initially it will let cell phones and other from the personal idea and move to- W3C Semantic Web Activity: www.w3.org/2001/sw/
nonstandard Web clients describe their ward a wider understanding over time. An introduction to ontologies:
www.SemanticWeb.org/knowmarkup.html
characteristics so that Web content can An essential process is the joining to-
Simple HTML Ontology Extensions Frequently
be tailored for them on the fly. Later, gether of subcultures when a wider com- Asked Questions (SHOE FAQ):
when we add the full versatility of lan- mon language is needed. Often two groups www.cs.umd.edu/projects/plus/SHOE/faq.html
guages for handling ontologies and log- independently develop very similar con- DARPA Agent Markup Language (DAML) home page:
ic, devices could automatically seek out cepts, and describing the relation between www.daml.org/

30 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
The
Worldwide
Computer
By David P. Anderson
and John Kubiatowicz
An operating system
spanning the Internet
would bring the power of
millions of the worlds
Internet-connected PCs
to everyones fingertips

When Mary gets home


from work and goes to her PC to check e-mail, the PC isnt just theyre not giving anything away for free. As her PC works,
sitting there. Its working for a biotech company, matching gene pennies trickle into her virtual bank account. The payments
sequences to a library of protein molecules. Its DSL connection come from the biotech company, the movie system and the
is busy downloading a block of radio telescope data to be ana- backup service. Instead of buying expensive server farms,
lyzed later. Its disk contains, in addition to Marys own files, these companies are renting time and space, not just on Marys
encrypted fragments of thousands of other files. Occasionally two computers but on millions of others as well. Its a win-win
one of these fragments is read and transmitted; its part of a situation. The companies save money on hardware, which en-
movie that someone is watching in Helsinki. Then Mary moves ables, for instance, the movie-viewing service to offer obscure
the mouse, and this activity abruptly stops. Now the PC and its movies. Mary earns a little cash, her files are backed up, and
network connection are all hers. she gets to watch an indie film. All this could happen with an
This sharing of resources doesnt stop at her desktop com- Internet-scale operating system (ISOS) to provide the neces-
puter. The laptop computer in her satchel is turned off, but its sary glue to link the processing and storage capabilities of
disk is filled with bits and pieces of other peoples files, as part millions of independent computers.
of a distributed backup system. Marys critical files are backed
up in the same way, saved on dozens of disks around the world. Internet-Scale Applications
Later, Mary watches an independent film on her Internet- A L T H O U G H M A R Y S W O R L D is fictional and an Internet-
connected digital television, using a pay-per-view system. The scale operating system does not yet exist developers have al-
PHILIP HOWE

movie is assembled on the fly from fragments on several hun- ready produced a number of Internet-scale, or peer-to-peer,
dred computers belonging to people like her. applications that attempt to tap the vast array of underutilized
Marys computers are moonlighting for other people. But machines available through the Internet [see box on page 42].

31 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
These applications accomplish goals that ing challenge. Developers must build a virtual computing environment in which
would be difficult, unaffordable or im- each new application from the ground programs operate as if they were in sole
possible to attain using dedicated com- up, with much effort spent on technical possession of the computer. It shields
puters. Further, todays systems are just matters, such as maintaining a database programmers from the painful details of
the beginning: we can easily conceive of of users, that have little to do with the ap- memory and disk allocation, communi-
archival services that could be relied on plication itself. If Internet-scale applica- cation protocols, scheduling of myriad
for hundreds of years and intelligent tions are to become mainstream, these in- processes, and interfaces to devices for

More than 150 MILLION hosts are connected to the


Internet, and the number is GROWING exponentially.
search engines for tomorrows Semantic frastructure issues must be dealt with data input and output. An operating sys-
Web [see The Semantic Web, by Tim once and for all. tem greatly simplifies the development of
Berners-Lee, James Hendler and Ora Las- We can gain inspiration for eliminat- new computer programs. Similarly, an
sila; Scientific American, May 2001]. ing this duplicate effort from operating Internet-scale operating system would
Unfortunately, the creation of Inter- systems such as Unix and Microsoft simplify the development of new distrib-
net-scale applications remains an impos- Windows. An operating system provides uted applications.

Existing Distributed Systems


COMPUTING STORAGE
GIMPS (Great Internet Mersenne Prime Search): Napster: www.napster.com/
www.mersenne.org/ Allowed users to share digital music. A central database stored the
Searches for large prime numbers. About 130,000 people are locations of all files, but data were transferred directly between
signed up, and five new primes have been found, including the user systems. Songwriters and music publishers brought a class-
largest prime known, which has four million digits. action lawsuit against Napster. The parties reached an agreement
whereby rights to the music would be licensed to Napster and
distributed.net: www.distributed.net/ artists would be paid, but the new fee-based service had not
Has decrypted several messages by using brute-force searches started as of January 2002.
through the space of possible encryption keys. More than 100
billion keys are tried each second on its current decryption Gnutella: www.gnutella.com/
project. Also searches for sets of numbers called optimal Golomb Provides a private, secure shared file system. There is no central
rulers, which have applications in coding and communications. server; instead a request for a file is passed from each computer
to all its neighbors.
SETI@home (Search for Extraterrestrial Intelligence):
http://setiathome.berkeley.edu/ Freenet: http://freenetproject.org/
Analyzes radio telescope data, searching for signals of Offers a similar service to Gnutella but uses a better file-location
extraterrestrial origin. A total of 3.4 million users have devoted protocol. Designed to keep file requesters and suppliers anony-
more than 800,000 years of processor time to the task. mous and to make it difficult for a host owner to determine or be
held responsible for the Freenet files stored on his computer.
folding@home: http://folding.stanford.edu/
Run by Vijay Pandes group in the chemistry department at Mojo Nation: www.mojonation.net/
Stanford University, this project has about 20,000 computers Also similar to Gnutella, but files are broken into small pieces
performing molecular-dynamics simulations of how proteins that are stored on different computers to improve the rate at
fold, including the folding of Alzheimer amyloid-beta protein. which data can be uploaded to the network. A virtual payment
system encourages users to provide resources.
Intel/United Devices cancer research project:
http://members.ud.com/projects/cancer/ Fasttrack P2P Stack: www.fasttrack.nu/
Searches for possible cancer drugs by testing which of 3.5 billion A peer-to-peer system in which more powerful computers become
molecules are best shaped to bind to any one of eight proteins search hubs as needed. This software underlies the Grokster,
that cancers need to grow. MusicCity (Morpheus) and KaZaA file-sharing services.

33 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
An ISOS consists of a thin layer of
software (an ISOS agent) that runs on MOONLIGHTING COMPUTERS
each host computer (such as Marys)
and a central coordinating system that With Internet-scale applications, PCs around the world can work during times when
runs on one or more ISOS server com- they would otherwise sit idle. Heres how it works:
plexes. This veneer of software would
provide only the core functions of allo-
cating and scheduling resources for each 1. An Internet-scale operating system (ISOS)
task, handling communication among coordinates all the participating
host computers and determining the re- computers and pays them for their work.
imbursement required for each machine.
This type of operating system, called a
microkernel, relegates higher-level func- 2. Marys home computer works while shes
tions to programs that make use of the away. Its one of millions of PCs
operating system but are not a part of it. that are crunching data and delivering
For instance, Mary would not use the file fragments for the network.
ISOS directly to save her files as pieces
distributed across the Internet. She might
run a backup application that used ISOS
functions to do that for her. The ISOS
would use principles borrowed from eco-
nomics to apportion computing re-
sources to different users efficiently and
fairly and to compensate the owners of
the resources. 3. Her laptop stores backup copies
Two broad types of applications might of encrypted fragments of other users'
benefit from an ISOS. The first is distrib- files. The laptop is connected only
uted data processing, such as physical occasionally, but that suffices.
simulations, radio signal analysis, genet-
ic analysis, computer graphics rendering
and financial modeling. The second is dis-
tributed online services, such as file stor-
age systems, databases, hosting of Web
sites, streaming media (such as online
video) and advanced Web search engines. 5. Later, Mary watches an obscure indie
movie that is consolidated from file
Whats Mine Is Yours 4. When Mary gets back on her PC, the work fragments delivered by the network.
for the network is automatically suspended.
COMPUTING TODAY operates pre-
dominantly as a private resource; orga-
nizations and individuals own the sys- the number is growing exponentially. the network, the result is a bigger, faster
tems that they use. An ISOS would facil- Consequently, an ISOS could provide a and cheaper computer than the users
itate a new paradigm in which it would virtual computer with potentially 150 could own privately. Continual upgrad-
be routine to make use of resources all million times the processing speed and ing of the resource pools hardware caus-
across the Internet. The resource pool storage capacity of a typical single com- es the total speed and capacity of this
hosts able to compute or store data and puter. Even when this virtual computer is ber-computer to increase even faster
networks able to transfer data between divided up among many users, and after than the number of connected hosts.
hostswould still be individually owned, one allows for the overhead of running Also, the pool is self-maintaining: when
but they could work for anyone. Hosts
would include desktops, laptops, server DAVID P. ANDERSON and JOHN KUBIATOWICZ are both associated with the University of Cal-
THE AUTHORS

computers, network-attached storage de- ifornia, Berkeley. Anderson was on the faculty of the computer science department from
vices and maybe handheld devices. 1985 to 1991. He is now director of the SETI@home project and chief science officer of Unit-
The Internet resource pool differs ed Devices, a provider of distributed computing software that is allied with the distrib-
from private resource pools in several im- uted.net project. Kubiatowicz is an assistant professor of computer science at Berkeley
XPLANE

portant ways. More than 150 million and is chief architect of OceanStore, a distributed storage system under development with
hosts are connected to the Internet, and many of the properties required for an ISOS.

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 34


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
HOW A DISTRIBUTED SERVICE WOULD OPERATE
Internet-Scale Operating System
(ISOS) Server Complex
ISOS Agent
Request Work Program
host list 2. ISOS tells its agent programs on for 4. Mary orders
the host computers to perform Acme an Acme
Host List tasks for Acme. ISOS pays the movie and ACME 6. Hundreds of hosts
hosts for the use of their pays Acme send small pieces
resources. directly. of the movie file to
Movie Marys Internet-
ACME ISOS order enabled TV.
Agent

Movie
Movie Service Movie File
1. Acme Movie Service Agents Fragments
wants to distribute
movies to viewers. Acme Send Mary
requests hosts from the 3. Acme sends its movie movie
ISOS server, the systems service agent program to
traffic cop. The ISOS the hosts. Acme splits its 5. Movie service instructs its agents 7. The movie is assembled,
server sends a host list, movie into fragments and and Mary is free to
to send Mary the movie. ISOS pays
for which Acme pays. also sends them to hosts. enjoy her Acme movie.
the hosts for their work.

a computer breaks down, its owner even- Some characteristics of the resource rious and malicious users may attempt to
tually fixes or replaces it. pool create difficulties that an ISOS must disrupt, cheat or spoof the system. All
Extraordinary parallel data transmis- deal with. The resource pool is heteroge- these problems have a major influence on
sion is possible with the Internet resource neous: Hosts have different processor the design of an ISOS.
pool. Consider Marys movie, being up- types and operating systems. They have
loaded in fragments from perhaps 200 varying amounts of memory and disk Who Gets What?
hosts. Each host may be a PC connected space and a wide range of Internet con- AN INTERNET-SCALE operating sys-
to the Internet by an antiquated 56k mo- nection speeds. Some hosts are behind tem must address two fundamental is-
dem far too slow to show a high-quali- firewalls or other similar layers of soft- sues how to allocate resources and how
ty video but combined they could deliv- ware that prohibit or hinder incoming to compensate resource suppliers. A
er 10 megabits a second, better than a ca- connections. Many hosts in the pool are model based on economic principles in
ble modem. Data stored in a distributed available only sporadically; desktop PCs which suppliers lease resources to con-
system are available from any location are turned off at night, and laptops and sumers can deal with both issues at once.
(with appropriate security safeguards) systems using modems are frequently not In the 1980s researchers at Xerox PARC
and can survive disasters that knock out connected. Hosts disappear unpredict- proposed and analyzed economic ap-
sections of the resource pool. Great secu- ablysometimes permanentlyand new proaches to apportioning computer re-
rity is also possible, with systems that hosts appear. sources. More recently, Mojo Nation de-
could not be compromised without break- The ISOS must also take care not to veloped a file-sharing system in which
ing into, say, 10,000 computers. antagonize the owners of hosts. It must users are paid in a virtual currency
In this way, the Internet-resource par- have a minimal impact on the non-ISOS (mojo) for use of their resources and
adigm can increase the bounds of what is uses of the hosts, and it must respect lim- they in turn must pay mojo to use the sys-
possible (such as higher speeds or larger itations that owners may impose, such as tem. Such economic models encourage
data sets) for some applications, where- allowing a host to be used only at night owners to allow their resources to be
as for others it can lower the cost. For or only for specific types of applications. used by other organizations, and theory
certain applications it may do neither Yet the ISOS cannot trust every host to shows that they lead to optimal alloca-
its a paradigm, not a panacea. And de- play by the rules in return for its own tion of resources.
XPLANE

signing an ISOS also presents a number good behavior. Owners can inspect and Even with 150 million hosts at its dis-
of obstacles. modify the activities of their hosts. Cu- posal, the ISOS will be dealing in scarce

35 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
resources, because some tasks will request Researchers have explored statistical fraction will be large enough to encour-
and be capable of using essentially un- methods for detecting malicious or mal- age owners to participate and small
limited resources. As it constantly decides functioning hosts. A recent idea for pre- enough to make many Internet-scale ap-
where to run data-processing jobs and venting unearned computation credit is plications economically feasible. A typi-
how to allocate storage space, the ISOS to ensure that each work unit has a num- cal PC owner might see the system as a
must try to perform tasks as cheaply as ber of intermediate results that the serv- barter economy in which he gets free ser-
possible. It must also be fair, not allowing er can quickly check and that can be ob- vices, such as file backup and Web host-
one task to run efficiently at the expense tained only by performing the entire ing, in exchange for the use of his other-
of another. Making these criteria pre- computation. Other approaches are need- wise idle processor time and disk space.
cise and devising scheduling algorithms ed to prevent fraud in data storage and
to achieve them, even approximately service provision. A Basic Architecture
are areas of active research. The cost of ISOS resources to end WE ADVOCATE two basic principles in
The economic system for a shared users will converge to a fraction of the our ISOS design: a minimal core operat-
network must define the basic units of a cost of owning the hardware. Ideally, this ing system and control by central servers.

Curious and malicious USERS may attempt to


DISRUPT, CHEAT or spoof the system.
resource, such as the use of a megabyte of
disk space for a day, and assign values
that take into account properties such as Primes and Crimes
the rate, or bandwidth, at which the stor- By Graham P. Collins
age can be accessed and how frequently
it is available to the network. The system NO ONE HAS SEEN signs of extraterrestrials using a distributed computation project
must also define how resources are (yet), but people have found the largest-known prime numbers, five-figure reward
bought and sold (whether they are paid moneyand legal trouble.
for in advance, for instance) and how The Great Internet Mersenne Prime Search (GIMPS), operating since 1996, has
prices are determined (by auction or by a turned up five extremely large prime numbers so far. The fifth and largest was
price-setting middleman). discovered in November 2001 by 20-year-old Michael Cameron of Owen Sound, Ontario.
Within this framework, the ISOS must Mersenne primes can be expressed as 2P 1, where P is itself a prime number.
accurately and securely keep track of re- Camerons is 213,466,917 1, which would take four million digits to write out. His
source usage. The ISOS would have an computer spent 45 days discovering that his number is a prime; altogether the GIMPS
internal bank with accounts for suppliers network expended 13,000 years of computer time eliminating other numbers that could
and consumers that it must credit or deb- have been the 39th Mersenne.
it according to resource usage. Partici- The 38th Mersenne prime, a mere two million digits long, earned its discoverer
pants can convert between ISOS curren- (Nayan Hajratwala of Plymouth, Mich.) a $50,000 reward for being the first prime with
cy and real money. The ISOS must also more than a million digits. A prime with 10 million digits will win someone $100,000.
ensure that any guarantees of resource A Georgia computer technician, on the other hand, has found nothing but trouble
availability can be met: Mary doesnt through distributed computation. In 1999 David McOwen installed the client program for
want her movie to grind to a halt part- the distributed.net decryption project on computers in seven offices of the DeKalb
way through. The economic system lets Technical Institute, along with Y2K upgrades. During the Christmas holidays, the
resource suppliers control how their re- computers activity was noticed, including small data uploads and downloads each day.
sources are used. For example, a PC In January 2000 McOwen was suspended, and he resigned soon thereafter.
owner might specify that her computers Case closed? Case barely opened: The Georgia Bureau of Investigation spent 18
processor cant be used between 9 A.M. months investigating McOwen as a computer criminal, and in October 2001 he was
and 5 P.M. unless a very high price is paid. charged with eight felonies under Georgias computer crime law. The one count of
Money, of course, encourages fraud, computer theft and seven counts of computer trespass each carry a $50,000 fine and
and ISOS participants have many ways up to 15 years in prison. On January 17, a deal was announced whereby McOwen will
to try to defraud one another. For in- serve one year of probation, pay $2,100 in restitution and perform 80 hours of
stance, resource sellers, by modifying or community service unrelated to computers or technology.
fooling the ISOS agent program running
on their computer, may return fictitious Graham P. Collins is a staff writer and editor.
results without doing any computation.

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 36


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
WHAT AN INTERNET-SCALE OPERATING SYSTEM COULD DO
By harnessing the massive unused computing resources of the global network, an ISOS would make short
work of daunting number-crunching tasks and data storage. Here are just a few of the possibilities:

Financial modeling
SETI: Analysis
$ of celestial
radio signals

3-D rendering
and animation
World
Computer
Network Matching gene
sequences

Streaming media
File backup and pay-per-view service
archiving for
hundreds of years

A computer operating system that pro- proach popular in some peer-to-peer sys- ered on and connected to the network.
vides only core functions is called a micro- tems, but central servers are needed to Usage policies spell out the rules an own-
kernel. Higher-level functions are built on ensure privacy of sensitive data, such as er has dictated for using her resources.
top of it as user programs, allowing them accounting data and other information Task descriptions include the resources
to be debugged and replaced more easi- about the resource hosts. Centralization assigned to an online service and the
ly. This approach was pioneered in acad- might seem to require a control system queued jobs of a data-processing task.
emic research systems and has influenced that will become excessively large and To make their computers available to
some commercial systems, such as Win- unwieldy as the number of ISOS-con- the network, resource sellers contact the
dows NT. Most well-known operating nected hosts increases, and it appears to server complex (for instance, through a
systems, however, are not microkernels. introduce a bottleneck that will choke the Web site) to download and install an
The core facilities of an ISOS include system anytime it is unavailable. These ISOS agent program, to link resources to
resource allocation (long-term assign- fears are unfounded: a reasonable num- their ISOS account, and so on. The ISOS
ment of hosts processing power and ber of servers can easily store informa- agent manages the hosts resource usage.
storage), scheduling (putting jobs into tion about every Internet-connected host Periodically it obtains from the ISOS
queues, both across the system and with- and communicate with them regularly. server complex a list of tasks to perform.
in individual hosts), accounting of re- Napster, for example, handled almost 60 Resource buyers send the servers task
source usage, and the basic mechanisms million clients using a central server. Re- requests and application agent programs
for distributing and executing applica- dundancy can be built into the server (to be run on hosts). An online service
tion programs. The ISOS should not du- complex, and most ISOS online services provider can ask the ISOS for a set of
plicate features of local operating systems can continue operating even with the hosts on which to run, specifying its re-
running on hosts. servers temporarily unavailable. source requirements (for example, a dis-
The system should be coordinated by The ISOS server complex would tributed backup service could use spo-
servers operated by the ISOS provider, maintain databases of resource descrip- radically connected resource hosts
which could be a government-funded or- tions, usage policies and task descrip- Marys laptop which would cost less
ganization or a consortium of companies tions. The resource descriptions include, than constantly connected hosts). The
that are major resource sellers and buy- for example, the hosts operating system, ISOS supplies the service with addresses
ers. (One can imagine competing ISOS processor type and speed, total and free and descriptions of the granted hosts and
providers, but we will keep things simple disk space, memory space, performance allows the application agent program to
XPLANE

and assume a unique provider.) Central- statistics of its network connections, and communicate directly between hosts on
ization runs against the egalitarian ap- statistical descriptions of when it is pow- which it is running. The service can re-

37 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.
quest new hosts when some become un- data facility aids in this task with mech- them are trying to lead the process astray.
available. The ISOS does not dictate how anisms for encoding, reconstructing and Other facilities. The toolkit also assists
clients make use of an online service, how repairing data. For maximum survivabil- by providing additional facilities, such as
the service responds or how clients are ity, data are encoded with an m-of-n format conversion (to handle the hetero-
charged by the service (unlike the ISOS- code. An m-of-n code is similar in princi- geneous nature of hosts) and synchro-
controlled payments flowing from re- ple to a hologram, from which a small nization libraries (to aid in cooperation
source users to host owners). piece suffices for reconstructing the among hosts).
whole image. The encoding spreads in- An ISOS suffers from a familiar
An Application Toolkit formation over n fragments (on n re- catch-22 that slows the adoption of
I N P R I N C I P L E , the basic facilities of the source hosts), any m of which are suffi- many new technologies: Until a wide user
ISOS resource allocation, scheduling cient to reconstruct the data. For instance, base exists, only a limited set of applica-
and communication are sufficient to the facility might encode a document into tions will be feasible on the ISOS. Con-
construct a wide variety of applications. 64 fragments, any 16 of which suffice to versely, as long as the applications are
Most applications, however, will have reconstruct it. Continuous repair is also few, the user base will remain small. But
important subcomponents in common. It important. As fragments fail, the repair if a critical mass can be achieved by con-
is useful, therefore, to have a software facility would regenerate them. If prop- vincing enough developers and users of

A typical PC owner might see the system as a


BARTER ECONOMY that provides free services in
exchange for PROCESSOR TIME and DISK SPACE.
toolkit to further assist programmers in erly constructed, a persistent data facili- the intrinsic usefulness of an ISOS, the
building new applications. Code for these ty could preserve information for hun- system should grow rapidly.
facilities will be incorporated into appli- dreds of years. The Internet remains an immense un-
cations on resource hosts. Examples of Secure update. New problems arise tapped resource. The revolutionary rise in
these facilities include: when applications need to update stored popularity of the World Wide Web has
Location independent routing. Applica- information. For example, all copies of the not changed that it has made the re-
tions running with the ISOS can spread information must be updated, and the ob- source pool all the larger. An Internet-
copies of information and instances of jects GUID must point to its latest copy. scale operating system would free pro-
computation among millions of resource An access control mechanism must pre- grammers to create applications that
hosts. They have to be able to access vent unauthorized persons from updat- could run on this World Wide Computer
them again. To facilitate this, applica- ing information. The secure update facil- without worrying about the underlying
tions name objects under their purview ity relies on Byzantine agreement proto- hardware. Who knows what will result?
with Globally Unique Identifiers (GUIDs). cols, in which a set of resource hosts come Mary and her computers will be doing
These names enable location indepen- to a correct decision, even if a third of things we havent even imagined.
dent routing, which is the ability to send
queries to objects without knowing their MORE TO E XPLORE
location. A simplistic approach to loca- The Ecology of Computation. B. A. Huberman. North-Holland, 1988.
tion independent routing could involve a The Grid: Blueprint for a New Computing Infrastructure. Edited by Ian Foster and Carl Kesselman.
database of GUIDs on a single machine, Morgan Kaufmann Publishers, 1998.
but that system is not amenable to han- Peer-to-Peer: Harnessing the Power of Disruptive Technologies. Edited by Andy Oram.
dling queries from millions of hosts. In- OReilly & Associates, 2001.
stead the ISOS toolkit distributes the Many research projects are working toward an Internet-scale operating system, including:
database of GUIDs among resource Chord: www.pdos.lcs.mit.edu/chord/
hosts. This kind of distributed system is Cosm: www.mithral.com/projects/cosm/
being explored in research projects such Eurogrid: www.eurogrid.org/
as the OceanStore persistent data storage Farsite: http://research.microsoft.com/sn/farsite/
project at the University of California at Grid Physics Network (Griphyn): www.griphyn.org/
Berkeley. OceanStore: http://oceanstore.cs.berkeley.edu/
Persistent data storage. Information Particle Physics Data Grid: www.ppdg.net/
stored by the ISOS must be able to sur- Pastry: www.research.microsoft.com/~antr/pastry/
vive a variety of mishaps. The persistent Tapestry: www.cs.berkeley.edu/~ravenben/tapestry/

The Future of the Web SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 38


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.

Das könnte Ihnen auch gefallen