Sie sind auf Seite 1von 74

Know Thy Search Engine

“It's all out there, it's just up to you to find it.”

KNOW

THY

SEARCH ENGINE!!

A Research work on the techniques helpful for


better searches on the net.
A Guide to the surfers for “Knowing what they
want” and “Getting what they want”

Research work by Amit Sharma Page 1


Know Thy Search Engine 2

There is nothing more satisfying than finding what we


are looking hard for…

Research work by Amit Sharma Page 2


Know Thy Search Engine 3

CONTENTS

Preface……….……………………………………………………………………………………. 5

What You Shall Find………………………………………………………………………..…… 6

PART I : An Introduction To Search Engines

Chapter 1 : Search Engines – A Brief Introduction……………………………………..… 8

Chapter 2 : What They Call Them……………………………………………………………. 10

Chapter 3 : Popular Search Engines………………………………………………………… 12

PART II : How Search Engines Work

Chapter 4 : A Glance At The Beginning…………………………………………………….. 16

Chapter 5 : How Search Engines Work……………………………………………………… 17

PART III : Better Techniques To Use Search Engines


Chapter 6 : How To Get Fewer Hits………………………………………………………….. 25

Chapter 7 : How To Get More Hits …………………………………………………………… 27

Chapter 8 : Scorecard : Search Sites……………………………………………………….. 29

Chapter 9 : Guessing The Uniform Resource Locator…………………………………… 30

Chapter 10 : The Complete Listing of Country Codes…………………………………… 31

Chapter 11 : Combining Words When Searching…………………………………………. 34

Chapter 12 : Search Engine Mathematics…………………………………………………… 35

Chapter 13 : Power Searching………………………………………………………………… 38

Section 13.1 Match Any………………………………………………………………. 38


Section 13.2 Match All………………………………………………………………… 38
Section 13.3 Exclude………………………………………………………………….. 39
Section 13.4 Site Search……………………………………………………………… 39
Section 13.5 Search Engine Specific Issues……………………………………… 40
Section 13.6 URL Search…………………………………………………………….. 40
Section 13.7 Link Search…………………………………………………………….. 42
Section 13.8 Title Searching…………………………………………………………. 42
Section 13.9 Wild Card (*)…………………………………………………………….. 43
Section 13.10 Anchor Search………………………………………………………... 45
Section 13.11 Proximity………………………………………………………………. 45
Section 13.12 More Power Search Commands…………………………………… 45
Section 13.13 Related Searches…………………………………………………….. 45
Section 13.14 Clustering……………………………………………………………… 46
Section 13.15 Find Similar……………………………………………………………. 47
Section 13.16 Stemming……………………………………………………………… 48
Section 13.17 Search Within…………………………………………………………. 48
Section 13.18 Spidered Version…………………………………………………….. 49

Research work by Amit Sharma Page 3


Know Thy Search Engine 4

Section 13.19 Search By Language………………………………………………… 50


Section 13.20 Page Translation……………………………………………………... 50
Section 13.21 Porn Filter……………………………………………………………… 51
Section 13.22 Customize Results…………………………………………………… 52

Chapter 14 : Boolean Searching……………………………………………………………… 54

Chapter 15 : Search Engine Features For Searchers…………………………………….. 57

Section 15.1 Search Engine Math Commands…………………………………… 57


Section 15.2 Power Searching Commands………………………………………. 57
Section 15.3 Search Engine Assitance Features………………………………… 60
Section 15.4 Customization & Display Features………………………………… 60
Section 15.5 Boolean Commands…………………………………………………. 61

Chapter 16 : The Three Contendors………………………………………………………… 62

Chapter 17 : Inktomi Powered Searches…………………………………………………… 65

Chapter 18 : Seven Blunders of the Search Engine World……………………………… 66

Chapter 19 : Interesting… Huh?……………………………………………………………… 69

Search Engine Glossary

Research work by Amit Sharma Page 4


Know Thy Search Engine 5

PREFACE

There are three Post-it’s stuck up on my monitor in front of me.

“What exactly are you looking for?”


“Do not waver do not rest till the job is done!”
&
“Think beyond the 9 points!”

As a researcher I found these three notes to myself quite helpful. It was critically
important from a researcher’s point of view to be clear of the objective. And as one
asks this question again and again, he keeps his efforts going in the right direction.

The second note is about the persistence of a researcher. It’s the never satisfying
thirst for knowledge that drives a researcher. And as the Google guys found true
“There’s always more information out there”… a researcher must go on till he gets
what he’s looking for. And surprisingly the best catch comes when you are worn,
disappointed and you give that one last try of the day.

The third note about this game I played once. 9 points in three rows with three points
each. We were asked to connect them using four straight lines without lifting our
pens. And the only hint was “Think beyond the 9 points!!”
So often we go on rampaging within the boundaries set by our own minds. Thinking
from different point of views, stepping into different roles helped a lot in broadening
my own limitations and hence the search.

Sometimes I wonder how we tend to complicate simple things and set our
imagination on the complexity of a fairly simple problem.

My words for search engines: Think simple, specific and clear. You never really have
to make efforts to reach the information from people who really want it to reach you.

Whenever I finish a research I look back at my silly mistakes and say to myself, ”Pray
you don’t waste time on them again!” And then there are these people who always
keep reminding me, “I can do it!” Each one of them has taught me in their own way
something new every time I met them.

My thanks go to my family, my professors and my friends. My special thanks goes to


Arifa, Ahmad, Liana and Mira, who were the only ones allowed on my mind during the
solitary hours of this research.

And how can I forget you my Lord who’s always been there within and beside.

Lastly, a heartiest thanks to my company SCPL, my boss Saurabh and the Search
Engine community!

Any guesses for which search engines I used for my research? ;-)

- Amit Sharma

Research work by Amit Sharma Page 5


Know Thy Search Engine 6

WHAT YOU SHALL FIND…

The reader of these notes is supposed to have primary knowledge of using Internet. I
have assumed here that the person has a fair knowledge of every day usage of
computers. The reader is assumed to know how to connect to the internet and find
his way to an internet browser. Also, where one is supposed to key in a website
address.

This report is split into four parts.

Part I gives you the brief introduction to the search engines. In Part II we go deeper
into the working of a Search Engine. Part III deals purely with the techniques that
would help you use a search engine effectively.

This report begins with a brief introduction to the Search Engines in Part I. We get to
meet the popular search engines after knowing the difference between search and
meta search utilities. Now you would also get to know why Yahoo behaves different
from an Altavista.

For those who would like to jump to the effective techniques to use a search engine,
you could skip the Part II and go directly to Part III.

Part II we begin with how Search Engines came into the web world. Then we discuss
what goes on behind the scenes when you use a search engine, that is, how a search
engine works. We get to meet the ‘spiders’ of the web who help the search engines
accumulate the information. We talk about how the databases are indexed. And then
we discuss the ranking of the information derived. We also talk about the search
engines powered by a search technology powered by the popular Inktomi.

Part III we focus on the use of the search engines. To get our hands on what we are
looking on we begin with the basic techniques of increasing or decreasing our hits.
We move on further to sharpen our search skills by guessing the uniform resource
locators (URLs). And then we head for the big time ‘power searching’. We learn about
stemming, clustering, Boolean searches and also a little bit of mathematics called
search engine mathematics.

Then we talk about the most common mistakes we all make while searching on the
net. We interview AskJeeves and have a lighter aspect of the search engine in focus.

Research work by Amit Sharma Page 6


Know Thy Search Engine 7

PART I
AN INTRODUCTION TO SEARCH ENGINES

Research work by Amit Sharma Page 7


Know Thy Search Engine 8

CHAPTER 1. Search Engines – A Brief Introduction

We use ‘Search Engines‘ to search for information across the Internet. Internet being
an ever-expanding ocean of data, their importance grew with every passing day. I
always used to imagine a search engine as a trained police dog (wonder if dogpile
guys thought the same) that would go and fetch you anything, which you sent it for.
The diversity of the information itself made it necessary to have a tool to cut down on
the time spent in searching.

Today businesses and organizations understand the necessity of being listed on a


good and popular search engine. For those who really want their products and
services to reach the world, getting listing on a search engine is a necessity of the
time.

To name a few of well known search engines…Altavista, Google, HotBot, About,


Excite, Northern Light, LookSmart, Lycos, GoTo, WebCrawler, Yahoo, DogPile,
Highway61, DirectHit, Teoma are on the popularity list.

There are more than 500 search engines all over the world by now. As native search
engines are popping up in their own regions, it’s interesting to witness the war of
superiority among the international SE’s.

SE’s are basically database programs. Their job is to obtain data from websites in
order to identify, organize and list websites of possible interest to people who are
seeking them.
An information seeker can visit a search engine and enter a word or phrase for the
search engine to seek out for them. The search engine will present the seeker with
the results of their search in a manner which the seeker can then investigate further.

Search engines are powerful and can draw up many more references than is usually
necessary. It is useful for the seeker to learn how to refine their search to shorten the
list they will ultimately be presented with. All search engines provide ample resources
to help a seeker refine their search.

How to get registered with Search Engines is now a subject in itself - Search Engine
Optimization. Its beyond the scope of this research though the information in this
report would definitely help.

Before we move further I would like to make a mention about the Meta Tags.

META tags are inserted by Web page designers and developers into Web pages so
that search engines can identify and categorize the Web page's content. META tags,
which are invisible to the reader, assist search engines and Web browsers.

Research work by Amit Sharma Page 8


Know Thy Search Engine 9

Your real search begins when you think you have


reached the horizon…
…You think probably its your last step…
But then…
“There is always more information out there” – says,
Google’s 10 Things.
…The next step you take and there’s a whole new
endless world that awaits you!!

Research work by Amit Sharma Page 9


Know Thy Search Engine 10

CHAPTER 2. What They Call Them

After a brief introduction to the search engines, let us have a more closer look at the
search engines we have today. Now before we get to know them I would like to clear
these common doubts we usually have but do not bother much about.

First I would like to address to the difference between a web search utility and a web
meta search utility.

The web search utilities (AltaVista, HotBot, etc.) index in their locally held databases
the text of pages whose creators have notified them that the pages exist, and given
them the URL. That's one of the main reasons why you often get different results from
the same search on AltaVista and Goto for instance. The size, and hence the
coverage in each index varies. In addition two search engines indexing the same
page may "weigh" the words in the page differently, leading to the same page
appearing higher or lower in their respective search results list. Many searchers often
execute the same search against several different indexes to get the results they
want.

The meta search utilities try to counteract that by sending the search you create to
several of the web indexes at once, thereby sparing you the effort of searching one
after the other manually. However, meta search engines don't offer universal
coverage.

Now, having understood the difference between search and meta search, let us also
refer to the common question,”How are Altavista and Yahoo different from each
other?” I am sure who ever has used both of them would have noticed their
difference, if not understood the underlying difference.

Let me put it in this way… Altavista is a web search engine while Yahoo is a web
search directory.

They're actually different ways of approaching the same problem! When someone
tells Yahoo about their website, an actual human being looks at the site, decides
what it's about, and then places it within a topical hierarchy that Yahoo has
established. The Yahoo hierarchy is like a tree with many thousands of branches,
and the "editors" who place sites within the hierarchy do so very carefully. Because of
the hands on work involved in placing a site within the Yahoo (or LookSmart)
hierarchy, the total number of websites represented tends to be much smaller than
those of the automated web search engines.

The Yahoo searcher who is looking for information navigates the hierarchy until they
find a topic (a branch on the tree) that represents what they're looking for. Once that
topic is located, they'll click on it to review the list of sites placed under that topic. If
none of those sites are "just right" they may choose to find another topic that
resembles what they're looking for, and review the sites placed under it.

Here's an example. Let's say you'd like to buy a new car. You're interested in a
particular Porsche model, and would like to read a road test of that model. If you
search Northernlight for the word "Porsche", among her first few screens of results
will be pages about Porsche coverages, Special Collection Porsche links, Porsche
launches etc.. All related to Porsche, but not what you wanted.

Research work by Amit Sharma Page 10


Know Thy Search Engine 11

While you could refine your search, you could also jump into Yahoo. Browse their
hierarchy looking for something like:

http://dir.yahoo.com/Recreation/Automotive/ Makes_and_Models/Porsche/Magazines/

underneath that heading will be the URLs for Automobile publications that have web
sites. Somewhere amongst them, you may find a magazine that has recently road
tested the car that you're interested in.

Web search utilities and web directories are stylistically very different. Some people
enjoy browsing the Yahoo topic hierarchy, while others like to make the computer do
more of the work. In either case, doing a good job takes time. The Yahoo hierarchy is
huge, and it's easy to go down a few wrong paths before you find the topic you're
most interested in. However, it's also true that creating an effective search statement
that retrieves just what you're looking for takes time and effort too!

Research work by Amit Sharma Page 11


Know Thy Search Engine 12

CHAPTER 3. Popular Search Engines

There are many contenders in the sprawling search engine community. Here is a
quick glance at the most popular ones.

Google (http://www.google.com)
Fast, relevant results are a hallmark of Google due to its extensive use of
popularity ranking of web sites. An added strength is Google's inclusion of
more file formats than other search engines index—formats such as PDF,
Microsoft Word, Excel, and PowerPoint.

All the Web (Fast) (http://www.alltheweb.com/)


A very large engine that offers links to specialized searches such as for MP3s,
FTP items, pictures, etc. The advanced search employs drop down menus for
specifying word and domain filters.

Teoma (http://www.teoma.com/)
A medium-sized search engine that produces excellent retrieval. Besides
listing search results, a portion of the screen provides topical groupings
(folders) of results by keyword. On the right are experts' links leading to sites
that list pages on related general subjects.

MSN Search (http://search.msn.com/)


Features a large web pages database that includes information from Encarta,
MSNBC and other news sources, popular sites taken from DirectHit, as well
as directory listings.

Subject Directories

Yahoo (http://www.yahoo.com)
Yahoo is actually both a subject guide and a search engine. Searches include
not only Yahoo's web site subject directory listings that are selected and
indexed by people, but also a Google-powered database of web sites. Search
results are organized into helpful subject categories. Useful, friendly, a
favorite among web searchers.

Looksmart (http://www.looksmart.com)
Looksmart, Yahoo's main rival, also employs human editors to select and
categorize web sites. LookSmart has partnered with AltaVista to be the
extensive search engine that engages after a query of LookSmart's database.

Open Directory Project (http://www.dmoz.org)


The stated goal of Open Directory is "to produce the most comprehensive
directory of the web by relying on a vast army of volunteer editors." Critics
note that the volunteer-nature of this type of service can lead to uneven
subject coverage and potential bias. Use the subject listings or search by
keyword or phrase to navigate the site.

Librarians' Index to the Internet (http://lii.org/)

Research work by Amit Sharma Page 12


Know Thy Search Engine 13

"A searchable, annotated subject directory of more than 8500 Internet


resources selected and evaluated by librarians for their usefulness to users of
public libraries" (from the "About" page). Easy to use, very browsable.
Annotations are a plus.

About.com (http://home.about.com/)
Excellent source for web guides on popular topics.

Metacrawlers:
Search multiple search engines from the same web site

SurfWax (http://www.surfwax.com/)
Taps the major search engines including Google. Results are merged and
ranked by relevancy. Results with a magnifying glass icon beside them have
quick summaries (SiteSnaps) that may be viewed before deciding to summon
the page. Options for sorting, number of results displayed are available. Nice
customization features. An excellent metasearch engine.

ixquick (http://www.ixquick.com/)
Interesting is ixquick's use of a star rating system. One star is given for each
search engine that placed a site in its top ten. The theory is that a site
appearing in multiple top ten lists is likely to be relevant. Like Dogpile, ixquick
tries to translate a search query into the syntax of each search engine. Search
options include web, news, MP3, and pictures.

Vivisimo (http://vivisimo.com/)
Performs document clustering (based on titles, URLs, and short descriptions)
so that users may browse the results by hierarchical categories. Very
interesting and effective approach.

Dogpile (http://www.dogpile.com)
Once a favorite among searchers, Dogpile now uses paid listings. Keep this in
mind when evaluating search results. Results are listed by search engine.

Research work by Amit Sharma Page 13


Know Thy Search Engine 14

The following table gives you a good view of search engines under different
categories:

Search…. Meta Search…


www.4websearch.com www.Ixquick.com
www.altavista.com www.mamma.com
www.alltheweb.com www.metacrawler.com
www.google.com www.redesearch.com
www.hotbot.com www.surfwax.com
www.lycos.com www.turbo10.com
www.msn.com www.vivisimo.com
www.searchhippo.com
www.teoma.com
www.wisenut.com

News Search… Shopping search…


www.rocketnews.com www.amazon.com
www.bpubs.com www.mysimon.com
www.daypop.com www.shop.com
www.findarticles.com
www.moreover.com
www.newsBlip.com

People Search… Pay Per Click…


www.genealogy.com www.goclick.com
www.biographies.com www.7search.com
www.switchboard.com www.bigwhat.com
www.brainfox.com
www.epilot.com
www.espotting.com
www.findit-quick.com
www.findwhat.com
www.overture.com
www.sprinks.com

Directories… Computing…
www.about.com www.allwhois.com
www.galaxy.com www.hostindex.com
www.goguides.org www.tucows.com
www.looksmart.com
www.dmoz.org OpenDirectory
www.yahoo.com
www.zeal.com

Business… Research Directories…


www.allbusiness.com www.refdesk.com
www.business.com www.lii.org
www.b2bpages.com www.digital-librarian.com
www.statisticalresources.com
www.informationplease.com
www.researchindex.com

Research work by Amit Sharma Page 14


Know Thy Search Engine 15

PART II
HOW SEARCH ENGINES WORK

Research work by Amit Sharma Page 15


Know Thy Search Engine 16

CHAPTER 4. A Glance At The Beginning

The Internet has evolved not by becoming a graphically rich multi-media work but by
the evolution of the tools which made it possible to find and access this richness.

One of the earliest search engines like those today, Lycos, began in the spring of
1994 when John Leavitt’s spider was linked to an indexing program by Michael
Mauldin. Yahoo!, a catalog, became available the same year.

Today there are more of “web location services.” A search engine in proper sense is
a database and the tools to generate that database and search it while a catalog is an
organizational method and related database plus the tools for generating it.

Yahoo! emphasizes cataloging, while others such as Alta Vista or Excite emphasize
providing the largest search database. Some web location services do not own any of
their search engine technology – other services are their main thrust companies such
as Inktomi (after a native American word for spider) provide the search technology.
These web location services have put amazing power into every user’s hands,
making life much better for all of us, that too free of cost.

You might be wondering, may be it’s a rumour, may be not.. these information
companies might increase their revenues by selling information – information about
you. After you use a search engine and find a page with mutual fund quotes, you
might find yourself receiving e-mail advertising investments. Think this is a
coincidence? Think again. The investment company could have paid a search engine
for your e-mail address. There is an existing protocol for servers to ask a user’s
browser for such information, routinely entered during set-up.

Research work by Amit Sharma Page 16


Know Thy Search Engine 17

CHAPTER 5. How Search Engines Work

There are basically three elements to search engines that might be important.

1) Information Discovery & the database,

2) The user search, and,

3) The presentation and ranking of results.

Discovery and Database

A search engine finds information for its database by accepting listings sent in by
authors wanting exposure, or by getting the information from their "Web crawlers,"
"spiders," or "robots," programs that roam the Internet storing links to and information
about each page they visit.

Web crawler programs are a subset of "software agents," programs with an unusual
degree of autonomy which perform tasks for the user. How do these really work? Do
they go across the net by IP number one by one? Do they store all or most of
everything on the Web?

These agents normally start with a historical list of links, such as server lists, and lists
of the most popular or best sites, and follow the links on these pages to find more
links to add to the database. This makes most engines, without a doubt, biased
toward more popular sites. A Web crawler could send back just the title and URL of
each page it visits, or just parse some HTML tags, or it could send back the entire
text of each page.

Alta Vista is clearly hell-bent on indexing anything and everything, with over 30
million pages indexed (7/96). Excite actually claims more pages. OpenText, on the
other hand, indexes the full text of less than a million pages (5/96), but stores many
more URLs. Inktomi has implemented HotBot as a distributed computing solution,
which they claim can grow with the Web and index it in entirety no matter how many
users or how many pages are on the Web. Normally, "good" robots can be excluded
by a bit of exclusion standard code on your site.

It seems unfair, but developers aren't rewarded much by location services for sending
in the URLs of their pages for indexing. The typical time from sending your URL in to
getting it into the database seems to be 6-8 weeks. Most search engines check their
databases to see if URLs still exist and to see if they are recently updated.

Research work by Amit Sharma Page 17


Know Thy Search Engine 18

"Spiders" take a Web page's content and create key search words that enable online
users to find pages they're looking for.

Google began as an academic search engine. In the paper that describes how the
system was built, Sergey Brin and Lawrence Page give an example of how quickly
their spiders can work. They built their initial system to use multiple spiders, usually
three at one time. Each spider could keep about 300 connections to Web pages open
at a time. At its peak performance, using four spiders, their system could crawl over
100 pages per second, generating around 600 kilobytes of data each second.

Keeping everything running quickly meant building a system to feed necessary


information to the spiders. The early Google system had a server dedicated to
providing URLs to the spiders. Rather than depending on an Internet service provider
for the domain name server (DNS) that translates a server's name into an address,
Google had its own DNS, in order to keep delays to a minimum.

When the Google spider looked at an HTML page, it took note of two things:

1) The words within the page


2) Where the words were found

Words occurring in the title, subtitles, meta tags and other positions of relative
importance were noted for special consideration during a subsequent user search.
The Google spider was built to index every significant word on a page, leaving out the
articles "a," "an" and "the." Other spiders take different approaches.

These different approaches usually attempt to make the spider operate faster, allow
users to search more efficiently, or both. For example, some spiders will keep track of

Research work by Amit Sharma Page 18


Know Thy Search Engine 19

the words in the title, sub-headings and links, along with the 100 most frequently
used words on the page and each word in the first 20 lines of text. Lycos is said to
use this approach to spidering the Web.

Other systems, such as AltaVista, go in the other direction, indexing every single
word on a page, including "a," "an," "the" and other "insignificant" words. The push to
completeness in this approach is matched by other systems in the attention given to
the unseen portion of the Web page, the meta tags.

Meta tags allow the owner of a page to specify key words and concepts under which
the page will be indexed. This can be helpful, especially in cases in which the words
on the page might have double or triple meanings -- the meta tags can guide the
search engine in choosing which of the several possible meanings for these words is
correct. There is, however, a danger in over-reliance on meta tags, because a
careless or unscrupulous page owner might add meta tags that fit very popular topics
but have nothing to do with the actual contents of the page. To protect against this,
spiders will correlate Meta tags with page content, rejecting the meta tags that don't
match the words on the page.

All of this assumes that the owner of a page actually wants it to be included in the
results of a search engine's activities. Many times, the page's owner doesn't want it
showing up on a major search engine, or doesn't want the activity of a spider
accessing the page. Consider, for example, a game that builds new, active pages
each time sections of the page are displayed or new links are followed. If a Web
spider accesses one of these pages, and begins following all of the links for new
pages, the game could mistake the activity for a high-speed human player and spin
out of control. To avoid situations like this, the robot exclusion protocol was
developed. This protocol, implemented in the meta-tag section at the beginning of a
Web page, tells a spider to leave the page alone -- to neither index the words on the
page nor try to follow its links.

Building the Index

Once the spiders have completed the task of finding information on Web pages (and
we should note that this is a task that is never actually completed -the constantly
changing nature of the Web means that the spiders are always crawling), the search
engine must store the information in a way that makes it useful.

There are two key components involved in making the gathered data accessible to
users:

• The information stored with the data


• The method by which the information is indexed

In the simplest case, a search engine could just store the word and the URL where it
was found. In reality, this would make for an engine of limited use, since there would
be no way of telling whether the word was used in an important or a trivial way on the
page, whether the word was used once or many times or whether the page contained
links to other pages containing the word. In other words, there would be no way of
building the ranking list that tries to present the most useful pages at the top of the list
of search results.

To make for more useful results, most search engines store more than just the word
and URL. An engine might store the number of times that the word appears on a

Research work by Amit Sharma Page 19


Know Thy Search Engine 20

page. The engine might assign a weight to each entry, with increasing values
assigned to words as they appear near the top of the document, in sub-headings, in
links, in the meta tags or in the title of the page. Each commercial search engine has
a different formula for assigning weight to the words in its index. This is one of the
reasons that a search for the same word on different search engines will produce
different lists, with the pages presented in different orders.

Regardless of the precise combination of additional pieces of information stored by a


search engine, the data will be encoded to save storage space. For example, the
original Google paper describes using 2 bytes, of 8 bits each, to store information on
weighting -- whether the word was capitalized, its font size, position, and other
information to help in ranking the hit. Each factor might take up 2 or 3 bits within the
2-byte grouping (8 bits = 1 byte). As a result, a great deal of information can be stored
in a very compact form. After the information is compacted, it's ready for indexing.

An index has a single purpose: It allows information to be found as quickly as


possible. There are quite a few ways for an index to be built, but one of the most
effective ways is to build a hash table. In hashing, a formula is applied to attach a
numerical value to each word. The formula is designed to evenly distribute the entries
across a predetermined number of divisions. This numerical distribution is different
from the distribution of words across the alphabet, and that is the key to a hash
table's effectiveness.

In English, there are some letters that begin many words, while others begin fewer.
You'll find, for example, that the "M" section of the dictionary is much thicker than the
"X" section. This inequity means that finding a word beginning with a very "popular"
letter could take much longer than finding a word that begins with a less popular one.
Hashing evens out the difference, and reduces the average time it takes to find an
entry. It also separates the index from the actual entry. The hash table contains the
hashed number along with a pointer to the actual data, which can be sorted in
whichever way allows it to be stored most efficiently. The combination of efficient
indexing and effective storage makes it possible to get results quickly, even when the
user creates a complicated search.

Hash This!

The key in public-key encryption is based on a hash value. This is a value that is
computed from a base input number using a hashing algorithm. Essentially, the hash
value is a summary of the original value. The important thing about a hash value is
that it is nearly impossible to derive the original input number without knowing the
data used to create the hash value. Here's a simple example:
Input number Hashing algorithm Hash value
10,667 Input # x 143 1,525,381

You can see how hard it would be to determine that the value 1,525,381 came from
the multiplication of 10,667 and 143. But if you knew that the multiplier was 143, then
it would be very easy to calculate the value 10,667. Public-key encryption is actually
much more complex than this example, but that is the basic idea.

Public keys generally use complex algorithms and very large hash values for
encrypting, including 40-bit or even 128-bit numbers. A 128-bit number has a
possible 2128 or
3,402,823,669,209,384,634,633,746,074,300,000,000,000,000,000,000,000,000,000,000,000,000

Research work by Amit Sharma Page 20


Know Thy Search Engine 21

different combinations! This would be like trying to find one particular grain of sand in
the Sahara Desert.

Building a Search

Searching through an index involves a user building a query and submitting it through
the search engine. The query can be quite simple, a single word at minimum.
Building a more complex query requires the use of Boolean operators that allow you
to refine and extend the terms of the search.

The Boolean operators most often seen are:

• AND - All the terms joined by "AND" must appear in the pages or documents.
Some search engines substitute the operator "+" for the word AND.
• OR - At least one of the terms joined by "OR" must appear in the pages or
documents.
• NOT - The term or terms following "NOT" must not appear in the pages or
documents. Some search engines substitute the operator "-" for the word
NOT.
• FOLLOWED BY - One of the terms must be directly followed by the other.
• NEAR - One of the terms must be within a specified number of words of the
other.
• Quotation Marks - The words between the quotation marks are treated as a
phrase, and that phrase must be found within the document or file.

The searches defined by Boolean operators are literal searches -- the engine looks
for the words or phrases exactly as they are entered. This can be a problem when the
entered words have multiple meanings. "Bed," for example, can be a place to sleep, a
place where flowers are planted, the storage space of a truck or a place where fish
lay their eggs. If you're interested in only one of these meanings, you might not want
to see pages featuring all of the others. You can build a literal search that tries to
eliminate unwanted meanings, but it's nice if the search engine itself can help out.

One of the areas of search engine research is concept-based searching. Some of this
research involves using statistical analysis on pages containing the words or phrases
you search for, in order to find other pages you might be interested in. Obviously, the
information stored about each page is greater for a concept-based search engine,
and far more processing is required for each search. Still, many groups are working
to improve both results and performance of this type of search engine. Others have
moved on to another area of research, called natural-language queries.

The idea behind natural-language queries is that you can type a question in the same
way you would ask it to a human sitting beside you -- no need to keep track of
Boolean operators or complex query structures. The most popular natural language
query site today is AskJeeves.com, which parses the query for keywords that it then
applies to the index of sites it has built. It only works with simple queries; but
competition is heavy to develop a natural-language query engine that can accept a
query of great complexity.

User Search

What can the user do besides typing a few relevant words into the search form? Can
they specify that words must be in the title of a page? What about specifying that

Research work by Amit Sharma Page 21


Know Thy Search Engine 22

words must be in an URL, or perhaps in a special HTML tag? Can they use all logical
operators between words like AND, OR, and NOT?

Most engines allow you to type in a few words, and then search for occurrences of
these words in their data base. Each one has their own way of deciding what to do
about approximate spellings, plural variations, and truncation. If you just type words
into the "basic search" interface you get from the search engine's main page, you
also can get different logical expressions binding the different words together. Excite!
actually uses a kind of "fuzzy" logic, searching for the AND of multiple words as well
as the OR of the words. Most engines have separate advanced search forms where
you can be more specific, and form complex Boolean searches (every one mentioned
in this article except Hotbot). Some search tools parse HTML tags, allowing you to
look for things specifically as links, or as a title or URL without consideration of the
text on the page.

By searching only in titles, one can eliminate pages with only brief mentions of a
concept, and only retrieve pages that really focus on your concept.

By searching links, one can determine how many and which pages point at your site.
Understanding what each page does with the non-standard pluralization, truncation,
etc. can be quite important in how successful your searches will be. For example, if
you search for "bikes" you won't get "bicycle," "bicycles," or "bike." In this case, I
would use a search engine that allowed "truncation," that is, one that allowed the
search word "bike" to match "bikes" as well, and I would search for "bicycle OR bike
OR cycle" ("bicycle* OR bike* OR cycle*" in Alta Vista).

Presentation & Ranking

With databases that can keep the entire Web at the fingertips of the search engines,
there will always be relevant pages, but how do you get rid of the less relevant and
emphasize the more relevant?

Most engines find more sites from a typical search query than you could ever wade
through. Search engines give each document they find some measure of the quality
of the match to your search query, a relevance score. Relevance scores reflect the
number of times a search term appears, if it appears in the title, if it appears at the
beginning of the document, and if all the search terms are near each other; some
details are given in engine help pages. Some engines allow the user to control the
relevance score by giving different weights to each search word. One thing that all
engines do, however, is to use alphabetical order at some point in their display
algorithm. If relevance scores are not very different for various matches, then you end
up with this sorry default. For most uses, a good summary is more useful than a
ranking. The summary is usually composed of the title of a document and some text
from the beginning of the document, but can include an author-specified summary
given in a meta-tag. Scanning summaries really saves you time if your search returns
more than a few items.

Lets have quick revision of what we have discussed…

! Search engines use software called spiders, which comb the Internet looking
for documents and their Web addresses.

Research work by Amit Sharma Page 22


Know Thy Search Engine 23

! The documents and Web addresses are collected and sent to the search
engine’s indexing software.

! The indexing software extracts information from the documents, storing it in a


database. The kind of information indexed depends on the particular search
engine. Some index every word in a document; others index the document
title only.

! When you perform a search by entering keywords, the database is searched


for documents that match.

! The search engine assembles a web page that lists the results as hypertext
links.

Research work by Amit Sharma Page 23


Know Thy Search Engine 24

PART III
BETTER TECHNIQUES TO USE SEARCH ENGINES

Research work by Amit Sharma Page 24


Know Thy Search Engine 25

CHAPTER 6. How To Get Fewer Hits

One of the most common maladies afflicting people searching for information on the
Internet is that they get too many "hits" from their searches. Nobody is going to try
thousands of links looking for the information that they want. The trick is to refine your
search so that you get fewer "hits" that are of greater quality. Making better use of the
AND operator and phrase searching can help you quite a bit, so they will be the focus
of our discussion here.

This AND That

Take a look at your work area and count the number of pens. Now, take another look
and count the number of pens that are red. Unless all you use are red pens, the
second number is likely to be lower than the first. Like the brief exercise above, the
purpose of the AND operator in searching is to add additional conditions that an
Internet document must match to become a "hit". By searching for multiple words,
linked by AND operators, you can dramatically reduce the number of documents that
match your search criteria.

For example, say you were searching Google for information about proper nutrition
for your dog. Searching for the word "dog" (Google syntax: dog) returns 13,100,000
matching documents. Searching for documents containing both the word "dog" AND
the word "nutrition" (Google syntax: +dog +nutrition) shrunk the number of hits down
to 285,000. Adding the word "guide" to the previous search (Google syntax: +dog
+nutrition +guide) reduced the number of hits to 77,100. Finally, hoping to find
information sourced from veterinary schools, the domain suffix ".edu" was added to
the search (InfoSeek syntax: +dog +nutrition +guide +.edu). This reduced the number
of matching documents to a quite manageable 9,260.

It's How You Phrase It

Like the AND operator, phrase searching requires that all the terms in the phrase be
present in the document for it to be a hit. However, when using the AND operator, the
terms don't have to be anywhere near each other, they just both have to be present.
In phrase searching, the terms have to be adjacent to each other, just as they are in
the phrase. For example, searching Google for documents containing the words time
AND management AND seminar (Google syntax: +time +management +seminar)
yielded 1,060,000 hits. However searching for the phrase "time management
seminar" (Google syntax: "time management seminar") returned 1,140 matching
documents. While this is a substantial reduction in the number of hits received,
consider that it may be too drastic. Searching for the phrase "time management
seminar" skips documents containing the phrases "time management class" or
"seminar on time management", since neither phrase exactly matches your search
phrase.

Research work by Amit Sharma Page 25


Know Thy Search Engine 26

In Conclusion

The use of the AND operator as well as phrase searching can play important roles in
your performing more effective searches. You may want to experiment with using
them together where appropriate ("New York" AND tour). If one phrase or word
combination doesn't give you the results that you desire, try using slightly different
words that convey the same meaning. It's all out there, it's just up to you to find it.

Research work by Amit Sharma Page 26


Know Thy Search Engine 27

CHAPTER 7. How To Get More Hits

Searchers getting too many matches from their searches are a common problem.
Although it sounds odd, searchers often miss important information because they get
too few matches from their searches. This can be true even if the total hits number in
the thousands. Using the OR operator can help you find all of the documents you're
looking for.

Use Synonyms to Cover All the Bases

One of the best ways that the OR operator can help you is when you are searching
for something that is commonly referred to by more than one name. For instance, do
you refer to small, portable computers as "laptops" or "notebooks". Within the
computer industry, those two terms are used interchangeably. If you're looking for
information on this type of computer, you should search on both terms. This is good
advice even if you're searching the Yahoo category hierarchy. For example, a search
for the word laptop in Yahoo yielded 435(Google – 926,000) hits. Searching for the
word notebook in Yahoo yielded 547(Google – 1,990,000) hits. Searching for laptop or
notebook created 173 hits (Google - 113,000 hits). It's important to note that 435+547
equals 982, not 173. The 809 hit overlap represents those matches that contain both
search terms.

Don't Forget Synonymous Phrases

Last issue, we discussed the importance and value of phrase searching. It's important
to remember the number of ways that a concept can be phrased. For instance if you
search NorthernLight for "ocean liner", you'll get 6,882 matches. Substituting the
phrase "cruise ship" will return 51,417 hits. To nautical professionals, "cruise ship"
and "ocean liner" may not be the same thing, but there's no arguing that the terms are
often used interchangeably. You can cover a lot of ground by searching for two or
more phrases at once, separated by the OR operator. (ocean liner or cruise ship gave
56,866 hits)

Research work by Amit Sharma Page 27


Know Thy Search Engine 28

None of the search engines out there today are perfect, but using the
right one at the right time can make all the difference.

Research work by Amit Sharma Page 28


Know Thy Search Engine 29

CHAPTER 8. Score Card : Search Sites

From PC Magazine :

Relevant
Customizes Eliminates Effective
hits from Eliminates
query duplicate anticipatory
initial dead links
effectively links results
inquiry
About.com Good Good Fair Fair Fair
AltaVista Fair Good Fair Good Fair
Ask Jeeves Fair Fair Good Good Good
Excite Good Good Good Fair Good
FAST Search Good Good Fair Fair Fair
GO Network Good Good Fair Fair Good
Google Excellent Fair Good Fair Fair
GoTo.com Fair Fair Good Good Good
HotBot Excellent Good Good Good Good
Looksmart Good Fair Fair Fair Fair
Lycos Good Good Fair Fair Fair
Northern
Good Excellent Good Fair Excellent
Light
WebCrawler Fair Fair Good Good Fair
Yahoo! Excellent Good Good Good Good

To test Web search sites, they devised more than 50 queries spanning a variety of
subjects and of varying style, from single words to natural-language questions.
Initially they tried all of our queries using the default settings for each site. Then they
examined the first ten hits, evaluating each page's relevance and recording the
number of duplicate and dead links.

Their ratings reflect both the number of hits that technically matched the query and
the quality of the information returned. They also tested the tools that each site
offered to customize a query and improve the overall quality of their return set.

They rechecked any dead links (defined as genuine HTML 404 errors and not
temporary glitches that prevented access to the pages) over the course of the month
of their testing before delivering our final rating in that category.

Many search engines today try to anticipate the exact nature of your query, returning
stock data when you search for a company, for example, in addition to the sites that
its search engine itself has catalogued. They evaluated the usefulness of these
efforts to determine how well you can rely on these to improve your search
experience.

Research work by Amit Sharma Page 29


Know Thy Search Engine 30

CHAPTER 9. Guessing The Uniform Resource Locator

Think of the organization most likely to provide an answer to your question. Then try
to go directly to their Web site.

Try guessing the central URL for the organization.

1. Leave off http://


2. Try the common www to start the machine address
3. Use the name, acronym, or brief name of the organization (nra, honda, uwyo)
in the middle
4. Add the appropriate top level domain, most often .com.

com for commercial


net for networks, but can be used by anyone
edu for U.S. higher education
org for other organizations, but can be used by anyone
mil for U.S. military
gov for U.S. federal government
int for international organizations established by treaties
.state.XX.us for U.S. state governments

In the next page we see the country codes

Research work by Amit Sharma Page 30


Know Thy Search Engine 31

CHAPTER 10. Complete List of Country Codes

ac – Ascension .ad – Andorra .ae – United Arab .af – Afghanistan


Island Emirates
.ag – Antigua and .ai – Anguilla .al – Albania .am – Armenia
Barbuda
.an – Netherlands .ao – Angola .aq – Antarctica .ar – Argentina
Antilles
.as – American .at – Austria .au – Australia .aw – Aruba
Samoa
.az – Azerbaijan .ba – Bosnia and .bb – Barbados .bd – Bangladesh
Herzegovina
.be – Belgium .bf – Burkina Faso .bg – Bulgaria .bh – Bahrain
.bi – Burundi .bj – Benin .bm – Bermuda .bn – Brunei
Darussalam
.bo – Bolivia .br – Brazil .bs – Bahamas .bt – Bhutan
.bv – Bouvet .bw – Botswana .by – Belarus .bz – Belize
Island
.ca – Canada .cc – Cocos .cd – Congo, .cf – Central
(Keeling) Islands Democratic African Republic
Republic of the
.cg – Congo, .ch – Switzerland .ci – Cote d'Ivoire .ck – Cook Islands
Republic of
.cl – Chile .cm – Cameroon .cn – China .co – Colombia
.cr – Costa Rica .cu – Cuba .cv – Cap Verde .cx – Christmas
Island
.cy – Cyprus .cz – Czech .de – Germany .dj – Djibouti
Republic
.dk – Denmark .dm – Dominica .do – Dominican .dz – Algeria
Republic
.ec – Ecuador .ee – Estonia .eg – Egypt .eh – Western
Sahara
.er – Eritrea .es – Spain .et – Ethiopia .fi – Finland
.fj – Fiji .fk – Falkland .fm – Micronesia, .fo – Faroe Islands
Islands (Malvina) Federal State of
.fr – France .ga – Gabon .gd – Grenada .ge – Georgia
.gf – French .gg – Guernsey .gh – Ghana .gi – Gibraltar
Guiana
.gl – Greenland .gm – Gambia .gn – Guinea .gp – Guadeloupe
.gq – Equatorial .gr – Greece .gs – South .gt – Guatemala
Guinea Georgia and the
South Sandwich
Islands
.gu – Guam .gw – Guinea- .gy – Guyana .hk – Hong Kong
Bissau
.hm – Heard and .hn – Honduras .hr – .ht – Haiti
McDonald Islands Croatia/Hrvatska

Research work by Amit Sharma Page 31


Know Thy Search Engine 32

.hu – Hungary .id – Indonesia .ie – Ireland .il – Israel


.im – Isle of Man .in – India .io – British .iq – Iraq
Indian Ocean
Territory
.ir – Iran (Islamic .is – Iceland .it – Italy .je – Jersey
Republic of)
.jm – Jamaica .jo – Jordan .jp – Japan .ke – Kenya
.kg – Kyrgyzstan .kh – Cambodia .ki – Kiribati .km – Comoros
.kn – Saint Kitts .kp – Korea, .kr – Korea, .kw – Kuwait
and Nevis Democratic Republic of
People's Republic
.ky – Cayman .kz – Kazakhstan .la – Lao People's .lb – Lebanon
Islands Democratic
Republic
.lc – Saint Lucia .li – Liechtenstein .lk – Sri Lanka .lr – Liberia
.ls – Lesotho .lt – Lithuania .lu – Luxembourg .lv – Latvia
.ly – Libyan Arab .ma – Morocco .mc – Monaco .md – Moldova,
Jamahiriya Republic of
.mg – Madagascar .mh – Marshall .mk – Macedonia, .ml – Mali
Islands Former Yugoslav
Republic
.mm – Myanmar .mn – Mongolia .mo – Macau .mp – Northern
Mariana Islands
.mq – Martinique .mr – Mauritania .ms – Montserrat .mt – Malta
.mu – Mauritius .mv – Maldives .mw – Malawi .mx – Mexico
.my – Malaysia .mz – .na – Namibia .nc – New
Mozambique Caledonia
.ne – Niger .nf – Norfolk .ng – Nigeria .ni – Nicaragua
Island
.nl – Netherlands .no – Norway .np – Nepal .nr – Nauru
.nu – Niue .nz – New .om – Oman .pa – Panama
Zealand
.pe – Peru .pf – French .pg – Papua New .ph – Philippines
Polynesia Guinea
.pk – Pakistan .pl – Poland .pm – St. Pierre .pn – Pitcairn
and Miquelon Island
.pr – Puerto Rico .ps – Palestinian .pt – Portugal .pw – Palau
Territories
.py – Paraguay .qa – Qatar .re – Reunion .ro – Romania
Island
.ru – Russian .rw – Rwanda .sa – Saudi Arabia .sb – Solomon
Federation Islands
.sc – Seychelles .sd – Sudan .se – Sweden .sg – Singapore
.sh – St. Helena .si – Slovenia .sj – Svalbard and .sk – Slovak
Jan Mayen Islands Republic
.sl – Sierra Leone .sm – San Marino .sn – Senegal .so – Somalia
.sr – Suriname .st – Sao Tome .sv – El Salvador .sy – Syrian Arab
and Principe Republic

Research work by Amit Sharma Page 32


Know Thy Search Engine 33

.sz – Swaziland .tc – Turks and .td – Chad .tf – French


Caicos Islands Southern
Territories
.tg – Togo .th – Thailand .tj – Tajikistan .tk – Tokelau
.tm – .tn – Tunisia .to – Tonga .tp – East Timor
Turkmenistan
.tr – Turkey .tt – Trinidad and .tv – Tuvalu .tw – Taiwan
Tobago
.tz – Tanzania .ua – Ukraine .ug – Uganda .uk – United
Kingdom
.um – US Minor .us – United .uy – Uruguay .uz – Uzbekistan
Outlying Islands States
.va – Holy See .vc – Saint .ve – Venezuela .vg – Virgin
(City Vatican Vincent and the Islands (British)
State) Grenadines
.vi – Virgin .vn – Vietnam .vu – Vanuatu .wf – Wallis and
Islands (USA) Futuna Islands
.ws – Western .ye – Yemen .yt – Mayotte .yu – Yugoslavia
Samoa
.za – South Africa .zm – Zambia .zw – Zimbabwe

Research work by Amit Sharma Page 33


Know Thy Search Engine 34

CHAPTER 11. Combining Words When Searching

The portals, directories, and search engines have different ways of combining words
when searching. If you enter chocolate strawberries, it might find hits with either the
word chocolate or the word strawberries. It might find pages with both chocolate and the
word strawberries. Or it may retrieve records with the phrase chocolate strawberries.
Which do you want and how do you know?

Use phrase searching whenever possible. Almost all the portals and search engines
can do phrase searching -- searching for the words entered adjacent to each other
and exactly in the order submitted. Most use double quotes to identify a phrase:
"this is a phrase"

Other techniques used by many portals and search engines involve using Boolean
operators (AND, OR, NOT) or the plus and minus symbols. Most search engines now
default to an AND operation. Try using a plus (+) directly in front of required words
and a minus (-) directly in front of words or phrases to exclude from search results.

+apples +strawberries -kiwi


(apples AND strawberries) NOT kiwi

Research work by Amit Sharma Page 34


Know Thy Search Engine 35

CHAPTER 12. Search Engine Mathematics

All that most people need to know is a little basic "search engine math" in order to
improve their results. One can learn how to easily add, subtract and multiply one’s
way into better searches at your favorite search engine. The information below works
for nearly all of the major search engines.

Be Specific

Before learning math, it's a helpful reminder that the more specific your search is, the
more likely you will find what you want. Don't be afraid to tell a search engine exactly
what you are looking for.

For example, if you want information about Windows 98 bugs, search for "Windows
98 bugs," not "Windows." Or even better, search for exactly what the problem is: "I
can't install a USB device in Windows 98," for example. You'll be surprised at how
often this works.

Using The + Symbol to Add

Sometimes, you want to make sure that a search engine finds pages that have all the
words you enter, not just some of them. The + symbol lets you do this.

For example, imagine you want to find pages that have references to both President
Clinton and Kenneth Starr on the same page. You could search this way:
+clinton +starr

Only pages that contain both words would appear in your results. Here are some
other examples:

+windows +98 +bugs

That would find pages that have all three of the words on them, helpful if you wanted
to narrow down a search to Windows 98 bugs, rather than on Windows 98 in general.

+star +trek +insurrection

That would get you pages about Star Trek that also specifically mention
"Insurrection," the title of a Star Trek film.

The + symbol is especially helpful when you do a search and then find yourself
overwhelmed with information. Imagine that you wanted to reserve a camping space
in California's Yosemite National Park. You might start out simply searching like this:

yosemite

If so, chances are, you'll probably get too many off-target results. Instead, try
searching for all the words you know must appear on the type of page you're looking
for:

+yosemite +camping +reservations

Research work by Amit Sharma Page 35


Know Thy Search Engine 36

Using The - Symbol to Subtract

Sometimes, you want a search engine to find pages that have one word on them but
not another word. The - symbol lets you do this.

For example, imagine you want information about President Clinton but don't want to
be overwhelmed by pages relating to the Monica Lewinsky scandal. You could
search this way:

clinton -lewinsky

That tells the search engine to find pages that mention "clinton" and then to remove
any of them that also mention "lewinsky."

Similarly, perhaps you are looking for information specifically about Windows 95 but
keep getting pages about Windows 98 or Windows 3.1. You could eliminate them
with a search like this:

windows -98 -3.1

Perhaps you are a fan of the original Star Trek series but instead keep finding pages
about Voyager, Deep Space Nine or Star Trek: The Next Generation. Try a search
like this:

star trek -voyager -deep -space -nine -next -generation

In general, the - symbol is helpful for focusing results when you get too many that are
unrelated to your topic. Simply begin subtracting terms you know are not of interest,
and you should get better results.

Using Quotation Marks To Multiply

Now that you know how to add and subtract terms, we can move on to multiplication.
As in normal math, multiplying terms through a "phrase search" can be a much better
way to get the answers you are looking for.

For example, remember above when we wanted pages about reserving a campsite in
Yosemite? We entered all the terms like this:

+yosemite +camping +reservations

That brings back pages that have all those words on them, but there's no guarantee
that the words may necessarily be near each other. You could get a page that
mentions Yosemite in the opening paragraph but then later talks about getting
camping reservations in the Grand Canyon. All the words you added together would
appear on this page, but it still might not be what you are looking for.

Doing a phrase search avoids this problem. This is where you tell a search engine to
give you pages where the terms appear in exactly the order you specify. You do this
by putting quotation marks around the phrase, like this:

"yosemite camping reservations"

Research work by Amit Sharma Page 36


Know Thy Search Engine 37

Now, only pages that have all the words and in the exact order shown above will be
listed. The answers should be much more on target than with simple addition.
Likewise, remember this addition example?

+windows +98 +bugs

As you can imagine, multiplying the terms together within a phrase search would
work better, because that exact phrase probably appears on good pages dealing with
Windows 98 bugs. So try this:

"windows 98 bugs"

Remember the search for information about the latest Star Trek movie? We could
transform that into a phrase search like this:

"star trek insurrection"

But the movie's title actually has a colon after the word "trek," and many pages might
also follow this format. Thus, a better phrase search might be:

"star trek: insurrection"

Combining Symbols

Once you've mastered adding, subtracting and multiplying, you can combine symbols
to easily create targeted searches.

For example, remember the person who wanted pages only about Star Trek's original
series? We searched this way:

star trek -voyager -deep -space -nine -next -generation

A better search might use subtraction and multiplication:

"star trek" -voyager -"deep space nine" -"next generation"

Research work by Amit Sharma Page 37


Know Thy Search Engine 38

CHAPTER 13. Power Searching

13.1 Match Any


Sometimes you want pages that contain any of your search terms. For example, you
may want to find pages that say either Ireland or Eire.

At some search engines, you can do a Match Any search by using a menu next to the
search box or on the advanced search page.

Keep in mind that most search engines will automatically first list pages that have all
your terms, then some of your terms, when you perform a Match Any search.

Some search engine specific notes are below:

AltaVista

At AltaVista, testing shows that Match Any is most likely what will happen in response
to a default search. Earlier in 2001, AltaVista had said that Match Any would only
occur if you searched for five words or more. This no longer seems to be the case.
The article below explains what AltaVista previously said would happen:

13.2 Match All


This is a search for pages containing all of your search terms, rather than any of
them. For example, you may want to find pages with references to both Clinton and
Dole on the same page.

Practically all major search engines support the + symbol as a means of doing a
Match All search. All search by default, even if you don't use the + symbol.

See the Search Engine Math page for more specific help on using the + symbol.
Some search engine specific notes are below:

AOL Search

By default, AOL Search will look for any sites in its Open Directory information that
contain all the words you enter. It will check both the words in the Open Directory
listing and the words on the page that the listing leads to.

AOL Search will not check for matches in its Inktomi listings UNLESS there are
absolutely no Open Directory listings that match all words. However, if you use AOL
Search's advanced search and choose the "On the Web Only" option, then your
search will be conducted against only Inktomi's listings.

Research work by Amit Sharma Page 38


Know Thy Search Engine 39

13.3 Exclude
Most major search engines allow you to exclude documents that contain certain
words. This is a helpful way to narrow a search.

For example, you may want a page about the philosopher Calvin, not the cartoon
character Calvin. By excluding pages that mention Hobbes, the cartoon character's
sidekick, you will get better results.

The best way to do this is by using the - command, which is supported by practically
all major search engines.

See the Search Engine Math page for more specific help on using the - symbol.

13.4 Site Search


One of the most powerful features available is the ability to control what sites are
included or excluded from a search. For example, imagine you wanted to see all the
pages from the Mars Exploration web site run by the NASA's Jet Propulsion
Laboratory. At AltaVista, you could use this command:

host:mars.jpl.nasa.gov

In response, AltaVista would display all the pages it has indexed from the
mars.jpl.nasa.gov domain.

Now imagine you wanted to find all the pages from the Mars Exploration web sites
that also mention Venus and Jupiter. You could do that this way:

host:mars.jpl.nasa.gov venus jupiter

That tells AltaVista to list pages with the words "venus" and "jupiter" that are within
the Mars Exploration web site.

You can even combine other commands, such as those described on the Search
Engine Math page. For instance, look at this example:

host:mars.jpl.nasa.gov -"mars pathfinder"

Here, we are telling AltaVista to list all pages within the Mars Exploration web site
that do not contain the exact phrase "mars pathfinder."

Now, imagine you are looking for information about Mars landings but are getting
overwhelmed by results from NASA. You can get rid of NASA pages by doing this:

"mars exploration " -host:nasa.gov

In that example, we are looking for the phrase "mars landings" but excluding any
pages from sites that end in nasa.gov. That means we will NOT get pages from sites
like these

• mars.jpl.nasa.gov
• spacekids.hq.nasa.gov

Research work by Amit Sharma Page 39


Know Thy Search Engine 40

• cmex.arc.nasa.gov.

We could even decide to see all pages about Mars landing from US educational sites,
which end in .edu, like this:

"mars landings" +host:edu

Finally, imagine you live outside the US and want to see results that are
predominately from your country. Here's how someone in the UK might search for
football (soccer) information:

"football scores" +host:uk

This finds pages that say "football scores" and which are from sites that end in the .uk
domain, which is used for UK-based sites.

13.5 Search Engine Specific Issues


The examples shown above all use the command that works at AltaVista. The same
examples work at Google, FAST Search and some Inktomi-powered search engines,
if you use the corresponding site search command that these each offer.

Often, search engines that support a site search command also make this possible to
do using their advanced search pages. In addition, I'd highly recommend
downloading the Google Toolbar. Once you've done this, when visiting any web site,
you can use the toolbar's "Search site" button to search within just that web site.

Finally, for search engines that don't offer a site search command, you may find that
there is a URL Search command that provides a similar ability.

13.6 URL Search


Several search engines offer the ability to search within the text of a URL. This is very
similar to performing a site search.

Excite

Excite has a "site" command as explained in the Site Search section, but this
command cannot be combined with search terms in an attempt to locate pages on a
particular topic from a particular web site or to filter out pages from a particular web
site. For example, this query wouldn't work:

mars exploration -site:mars.jpl.nasa.gov

However, you can use the URL command to get a similar result. For instance:

mars exploration -url:mars.jpl.nasa.gov

would work to list pages about "mars exploration" but would remove any that came
from the mars.jpl.nasa.gov site. Be aware that when using the URL command in this
way, only the exact site listed will be removed. For example, this query:

mars exploration -url:nasa.gov

Research work by Amit Sharma Page 40


Know Thy Search Engine 41

would remove pages from nasa.gov but still allow pages from mars.jpl.nasa.gov to
appear, since that is a different web site.

However, when using the + command, then any sites containing the core domain will
be included. In other words, this command:

mars exploration +url:nasa.gov

would bring up pages from any site that has nasa.gov in the URL, such as

• mars.jpl.nasa.gov
• spacekids.hq.nasa.gov
• cmex.arc.nasa.gov.

Google

Google's advanced search page uses the allinurl command for finding URLs that
contain certain words, as described more on the Checking Your Listing page.
However, it is the undocumented "inurl" command that you should use, if you want to
find both web pages with words in the URL and within the pages themselves.

For example, let's say you want to find PDF files about mars exploration. Entering
"mars exploration" isn't enough, because that could bring back both HTML and PDF
pages. To solve this, you can use the inurl command to specify that URLs must have
the word "pdf" in them, which will increase the chances of getting PDF files. Here's
both commands, combined:

mars exploration inurl:pdf

If you used the "allinurl" command rather than the "inurl" command, this search
wouldn't work.

By the way, the "allinurl" command takes its name because when using it, you are
requiring that ALL the words appear IN the URL. In contrast, the inurl command
means that ANY of the words you specify should appear.

Google also has a command that lets you narrow your search to find documents in
particular formats, such works better than forcing the URL command into this role.
The command is filetype:, and you follow it with the extension you want to search for.
For instance:

california power crisis filetype:pdf

brings back PDF files that contain the words "california power crisis." In contrast:

california power crisis filetype:asp

brings back Microsoft Active Server Pages (ASP) files, while

california power crisis filetype:html

brings back ordinary HTML files that end in .html, that contain the words. It will not
bring back HTML files the end in .htm, however. Technically, Google considers those
to be a different file type, simply because the ending is different.

Research work by Amit Sharma Page 41


Know Thy Search Engine 42

13.7 Link Search


Several search engines offer the ability to search for all the pages linking to a
particular page or domain.

13.8 Title Searching


Many of the major search engines allow you to search within the HTML title of a web
page. This is the text that appears within the title tag of a document. For example, a
page has an HTML title like this:

<title>Power Searching For Anyone</title>

If someone were to do a title search for "power searching," then this page might
appear, because those words appear in the HTML title.

Some search engines support title searching using the "title" command, which looks
like this:

title:terms

where "terms" are the words you wish to search for in the title. Here are some
examples:

title:mars
title:mars landings
title:"mars landings"

In the first example, we're looking for the word "mars" in page titles. In the second
example, we're looking for both "mars" and "landings" in titles. In the last example,
we're looking for the exact phrase "mars landings" in titles.

Some additional search engine specific notes are below:

Direct Hit

A search within title option is available on the advanced search page but was found
not to be working when tested on August 6, 2001.

Google

Google uses the allintitle: command rather than title:, and the command means that
documents will have ALL the words you specify in their titles. The intitle: command
means that ANY of the words may be present.

HotBot

Research work by Amit Sharma Page 42


Know Thy Search Engine 43

Title searching, either by using the advanced search page or via the title: command
only brings back results from Inktomi (as described below), not from the Open
Directory or Direct Hit.

Inktomi

The title command works for single or multiple words within the Inktomi-listings of
these services:

• HotBot
• iWon
• MSN Search

The command also does not work with quotation marks around multiple words or the
+ or - symbol. Instead, to perform operations such as a phrase search within the title,
you'll need to go to the advanced search pages of these search engines (iWon
doesn't have one). Enter the words you want to find including the search operators
that you wish to use. Then find the option to search for words in the page title.

Others

The title command does not work reliably with GoTo. At AOL Search, it only works if
you use the advanced search page with the On the Web Only option.

At Yahoo, you must instead use the t: command instead of title: to search through
titles.

Several of the search engines which support the title command also allow you to
specify a title search using their advanced search pages.

13.9 Wildcards (*)


You can search for plurals or variations of words using a wildcard character. It is also
a great way to search if you don’t know the spelling of a word.

The format looks like this:

• sing* finds singing and sings


• theat* finds theater and theatre

Some of the search engines offering wildcard search also support what is called
"stemming." That means they will find terms like "singing" even if you only enter
"sing." This also means you may not need to use a wildcard symbol.

Below are some important additional details about wildcard searching at specific
search engines.

AOL Search

At AOL Search, the ? symbol serves as a wildcard and will replace any single
character, such as:

s?ng matches sing, sang, song

Research work by Amit Sharma Page 43


Know Thy Search Engine 44

This only works to find matches in AOL Search's Open Directory information. It does
not work to bring back matches from Inktomi-powered listings, as explained further
below.

Inktomi

Inktomi has two wildcard commands. The * symbol will match one or more
characters, such as:

sing* matches sings, singers, singing

The ? symbol matches any single character, and you can use it more than once. For
instance:

s?ng matches sing, sang, song

??ng matches ring, rang, sing, sang and


any other four letter word ending in ng

Both commands only work reliably at iWon, at the time of this writing. They fail to
function properly at AOL Search, HotBot, MSN Search or LookSmart to bring up
matches from within the Inktomi listings that they use.

They also do not appear to bring up matches in wildcard fashion from any of the other
data sets these services use, with the exception of AOL Search (see AOL Search
section, above).

Northern Light

Like Inktomi above, Northern Light has two wildcard commands. The * symbol will
match one or more characters, while % is used to match just a single character.

13.10 Anchor Search


Some search engines allow you to search specifically within the "anchor" or "link" text
that appears on a web page. For example, consider this example:

Click Here For The Mars Exploration Web Site

Notice the words "Mars Exploration Web Site" are all contained within the hyperlink?
This is the anchor text or the link text.

13.11 Proximity

Research work by Amit Sharma Page 44


Know Thy Search Engine 45

Some search engines let you indicate how close words should appear to each other.
Most people do not need this type of control. Usually, phrase searching is all you
need.

13.12 More Power Search Commands


Several of the major search engines offer additional commands that allow you to
search by media type, to search within ALT text or link text, and other types of
queries. In particular, these services are notable in offering expanded features:

• AltaVista
• Inktomi-powered services
• Google
• Northern Light

Explore the help pages and advanced search forms at each service to better
understand the additional options that are available.

13.13 Related Searches


A related searches feature is designed to help users narrow in on what they are
looking for. For example, let's say you searched for "mars." When the results
appeared, you might also be shown some related searches links, such as "mission to
mars" or "life on mars." If you selected one of these links, a new search would be
conducted, using the words you clicked on. This can help you be more specific in
your query, which often leads to better results.

AltaVista

Displays related searches near the top of the results page, next to the words "Others
searched for."

AllTheWeb.com

Displays related searches near the top of the results page, next to the words "Narrow
your search."

Direct Hit

Displays related searches near the top of the results page, under the "Related
Searches" heading.

Excite

After performing a search, click on the "Zoom In" button near the search box to see
related terms. These will appear in a separate window. Select the related term you
want and choose the Search button within the new window. Your search will then be
sent to Excite.

HotBot

Research work by Amit Sharma Page 45


Know Thy Search Engine 46

Displays related searches near the top of the results page, under the "People who did
this search also searched for" heading.

MSN Search

Displays related searches in the "Popular Topics" area below the search box, on the
results page.

Yahoo

At Yahoo, related searches appear at the bottom of its results page.

13.14 Clustering
Have you ever done a search and found the top results all seem to come from one
site? Clustering prevents this. Clustering generally allows only one or two pages per
site to be represented in the top results. This means that you get more variety and a
better chance of quickly finding something of interest. The section below highlights
how this feature works at the major services that offer it.

AltaVista

AltaVista clusters listings so that no more than two pages per site appear in its
results. If a second page from a particular web site is listed, it will be indented under
the first page. To see more results from a site, select the "Additional relevant pages
from this site" link, if it appears for a particular listing.

AllTheWeb.com

Clustering is on by default and will prevent more than two pages from the same web
site from being displayed. It can be overridden by changing the Site Collapsing option
on the Search Customization page (see the Customizing Results section below). You
can also view more pages from any particular site listed by selecting the "more hits
from" link that follows the listing.

Excite

There is no way to disable clustering at Excite. However, you can see more pages
from any particular site listed by selecting the "more from this site" link that follows
the listing.

Google

Google clusters so that no more than two pages per site appear in its results. If a
second page is listed, it will be "indented" under the first page. To see more results
from a site, select the "More results from" link that will appear below the second page
listed.

HotBot

Research work by Amit Sharma Page 46


Know Thy Search Engine 47

At HotBot, clustering is on by default. However, it only works within the listings


provided by Inktomi. To turn off clustering, go to the advanced search page, then in
the "Best Page Only" section, check the "Disable Best Page Only Filter" box. You can
also view more pages from any particular site listed by selecting the "See results from
this site only" link that follows the listing.

MSN Search

Clustering at MSN Search has to be enabled from its advanced search page. Look for
the "Show one result per domain" option and select it to start clustering.

Northern Light

Northern Light has clustering, and there is no way to turn this off. To see more pages
from a site, click on the "More Results" link below the page listing, if it appears.

13.15 Find Similar


Did you find a web page in the search results that seemed perfect -- it was exactly
what you were looking for? A "Find Similar" feature tells the search engine to seek
out other pages that seem similar to those you like. The section below highlights how
this feature works at the major services that offer it.

AltaVista

Click on the "Related pages" link that appears at the bottom of each listing.

AOL Search

Click on the "Show me more like this" option that appears at the bottom of each page
listed. This takes you to where that page is categorized within the version of the Open
Directory that AOL users. That can help you find similar web sites.

AltaVista

Click on the "Similar pages" link that appears at the end of each listing.

13.16 Stemming
Stemming is the ability for a search engine to search for variations of a word based
on its stem. For example, entering "swim" might also find "swims" and maybe
"swimming," depending on the search engine.

The Search Features Chart shows which search engines will do stemming by default
and those that allow it to be switched on as an option. Some search engine specific
notes are also below.

Direct Hit

Research work by Amit Sharma Page 47


Know Thy Search Engine 48

Singular and plural forms for words should generally provide the same results (cook
versus cooks) and should other ending such as ing (cook versus cooking).

HotBot

Stemming should be on automatically for Direct Hit-powered listings (see Direct Hit,
above). For stemming in Inktomi-powered listings, see the Inktomi section, below.

Inktomi

Inktomi-powered HotBot, (iWon, MSN Search) provide stemming as an option. To


enable it, go to the advanced search pages of each search engine, then

• At HotBot, check the "Enable Word Stemming" box.


• At iWon, enable the "Use word stemming" box.
• At MSN Search, see below.

MSN Search

This appears to be on permanently, at least for some queries. For example, a search
for "run," "runs" and "running" in Oct. 2001 found the same results. Oddly, using the
"Enable Stemming" box on MSN Search's advanced search page actually causes no
results to appear.

13.17 Search Within


Ever do a search and still feel like you have too many results? Instead of trying a new
search, you might have more luck narrowing down the set of matches you've already
generated. Some search engines make this easy through a "Search Within" feature.
The section below highlights how this feature works at the major services that offer it.

AltaVista

After performing a search, check the "Search within these results" box under the
search box, on the results page.

Google

After performing a search, click on the "Search within results" link that appears at the
bottom of the results page, next to the search box, on the results page.

HotBot

After performing a search, check the "Search within these results" box that appears n,
next to the search box, on the results page.

LookSmart

Research work by Amit Sharma Page 48


Know Thy Search Engine 49

You cannot search within results generated from a keyword search on the LookSmart
home page. However, if you navigate to any particular category, you can then search
for matching sites that appear only within that category and its subcategories. To do
this, when in a category, change the drop down box at the top of the category page
from "the Web" to the second option, which will be the name of the category you are
in.

Lycos

At Lycos, choose the "Search these results" option which appears next to the search
box, at the top of the results page.

Yahoo

At Yahoo, you can't run a search and then search within it. But you can go to any
category and then choose to search just within that section. Just look for the
appropriate options near the search boxes that appear within the categories.

13.18 Spidered Version


It can be helpful to see the exact version of a web page that was presented to a
search engine's spider. This is good for those times when a page no longer exists,
allowing you to still find the information. It's also essential if you want to determine if a
search engine spider was shown something different than what a human user sees.
In fact, some webmasters may "pagejack" someone else's web page, feeding it to a
search engine in hopes of attaining a good ranking.

Only Google allows you to see the actual page it spidered, through its "Cached"
feature. When you search, a "Cached" link may appear below some pages that are
listed. Click on this, and you'll be shown the page that was indexed, and any of your
search terms will be highlighted.

You can also bring up the spidered version using Google's cache command. Simple
enter the URL of a page after cache: and omitting the http:// prefix. For instance, to
see the cached version of this page, you would enter this into Google:

cache:searchenginewatch.com/facts/assistance.html

13.19 Search By Language


Sometimes you may want to find pages written in a particular language. For example,
you might want travel advice about Paris written in French. If you search for "paris,"
you'll probably get many pages written in English, since the city is spelled the same
way in English and French. However, with a search by language option, you can
specify that only pages written in French should be returned.

Searching by language isn't perfect. Search engines generally use dictionaries of


terms specific to different languages to identify a page's language when spidering it.
That means pages with content written in several different languages may not be
categorized properly. Additionally, because this is an automated process, it can suffer
from the mistakes that any automated system may have.

Below is how to search by language, at search engines that offer this feature.

Research work by Amit Sharma Page 49


Know Thy Search Engine 50

AltaVista & AllTheWeb.com

Use the drop-down box that appears next to the search box on the home page and
results page, to search in a particular language that's offered.

Excite, Google, Lycos

Use the advanced search page to search by language at these services.

HotBot

Use the drop-down box that appears on the left-hand side on the home page, to
search in a particular language that's offered.

MSN Search

Use the "Language" drop-down box on the advanced search page to search by
language through Inktomi's crawler-based results.

Northern Light

Use the "Documents written in: Language" drop-down box on the advanced search
page.

13.20 Page Translation


Some search engines allow you to translate web pages they list into different
languages. That's helpful if you see a page you are interested in but it is written in a
language you don't understand. Below is how to do translation at search engines that
offer this.

AltaVista & Lycos

Click on the "Translate" link that appears at the bottom of each listing.

Google

Click on the "Translate this page" link that appears next to the title of pages that are
not in English, when using the main Google.com web site.

13.21 Porn Filter


Some search engines allow you to filter out pages that may lead to pornographic web
sites or sites with content that might be considered offensive to some people. They
generally do this by scanning pages for pornographic terms at the time they are
indexed. "Block" lists and human review is also conducted.

Porn filters are not perfect, but they can be especially helpful if you are working with
children and want to minimize the risk of them seeing sexually explicit or offensive
terms in the results that appear.

Research work by Amit Sharma Page 50


Know Thy Search Engine 51

Some search engine specific notes are below:

AltaVista

Use this page to enable the porn filter:

AltaVista Family Filter Setup


http://www.altavista.com/sites/search/ffset

AllTheWeb

The porn filter is on by default. It can be overridden by changing the Offensive


Content Reduction option on the Search Customization page

Google

Enable the porn filter by using the SafeSearch option Google's customize page

LookSmart

LookSmart does not list porn sites. However, searches on porn turns will bring up
listings from the Inktomi results that supplement LookSmart's own listings. Therefore,
do not consider a search on LookSmart to be child-safe.

Lycos

Use this page to enable the porn filter:

Lycos SearchGuard
http://searchguard.lycos.com/

HotBot, MSN Search & Northern Light

These service will warn "You have entered a search term that is likely to return adult
content" if you enter porn terms. That prevents you from immediately seeing possibly
objectionable content. However, results are still offered, if you choose to go beyond
this warning. These results come from across the web, from the service itself, or from
a partner search engine that specializes in listing porn sites and content.

13.22 Customize Results


Wouldn't it be nice to see more than the 10 results at a time that are usually displayed
at most search engines? Perhaps you might want to see just the titles of matching
web pages. Some search engines allow you to customize your results in this way,
usually via advanced search pages or from menu options

Sort By Date

Research work by Amit Sharma Page 51


Know Thy Search Engine 52

Sort by date sounds like a great idea, but there are big problems with dates on the
web. Some web servers report incorrect dates or no dates at all.

For instance, Go's engineers estimated in 1998 (back when the search engine still
existed) that only 70 percent of web servers returned the correct date, while 20
percent reported the current date, regardless of when the page was created or
changed. The remaining 10 percent of the time, the web servers reported no date at
all.

Still, date sorting is a nice feature to have, and one that many professionals want.
When you choose the option, they list pages with newer dates first.

At MSN Search, you'll find this option on the advanced search page. Use the "Sort
equally relevant results by" box.

At Northern Light, you enable date sorting it by going to the advanced search page
and checking the "Sort results by" option.

Keep in mind that often when people want to sort by date, they are often trying to get
the latest information on a news topic. In these case, it is better to use a news search
engine.

Date Range

Some search engines let you restrict a search so that only pages within a particular
date range are displayed. This feature can suffer from the fact that web page dates
can be unreliable, as described above. However, it can also be useful, especially as a
means of determining how fresh a search engine's listings are.

For example, if you restrict a search to find pages less than a month old and don't get
any matches, you have a pretty good idea that the search engine's listings are out of
date.

Date Display

Along with the page description, some search engines show the date when a web
page was created or modified. As noted above, these dates may not always be
reliable. However, they do provide a useful clue as to how fresh or stale a search
engine's listings are. Thus, search engines that show a date deserve praise for doing
so.

When no date is reported, these search engines will instead display the date the
page was spidered.

Northern Light is an exception. In these cases, it won't report a date at all.

Directories don't spider pages, but they can display when a listing was manually
added or updated, if desired.

Research work by Amit Sharma Page 52


Know Thy Search Engine 53

CHAPTER 14. Boolean Searching

Boolean search commands have been used by professionals for searching through
traditional databases for years. Despite this, they are overkill for the average web
user. The commands described on the Search Engine Math page provide the same
basic functionality as Boolean commands and are also supported by all the major
search services. If you are new to searching, start off learning how to search better by
first reading the Search Engine Math page, rather than trying to learn Boolean
commands. I'm certain you'll find it easier.In fact, many professionals might benefit by
abandoning Boolean commands when using web search engines. But since there is
a comfort level in using what is already familiar, this page covers how Boolean
commands are implemented at the major search services. It assumes you are
already familiar with Boolean searching, although some resources that provide
further help appear at the end of the page.

OR
The Boolean OR command is used in order to allow any of the specified search terms
to be present on the web pages listed in results. It can also be described as a Match
Any search. You use the command like this:

ireland OR eire

Search engines that support OR are shown on the Search Features Chart. For those
that don't, see their advance search pages, where an option to search for any of your
terms is often available.

Research work by Amit Sharma Page 53


Know Thy Search Engine 54

Also be aware that some search engines perform an OR search by default, as shown
in the Match Any section of the Power Searching For Anyone page. Search engine
specific notes are below: AOL Search

OR failed to work correctly at the time this page was written. For instance, a search
for "ireland OR eire" failed to yield a much larger set of results that should have
appeared when compared to "ireland AND eire".

Google

OR will not work to find different phrases, such as "bill clinton" OR "hillary clinton"

AND
The Boolean AND command is used in order to require that all search terms be
present on the web pages listed in results. It can also be described as a Match All
search. You use the command like this:

clinton AND dole

Search engines that support AND are shown on the Search Features Chart. For
those that don't, using the + symbol is generally a good alternative.

Also be aware that some search engines perform an AND search by default, as
shown in the Match All section of the Power Searching For Anyone page. Search
engine specific notes are below: AOL Search

When using AND, you may find a slightly different number of documents will be
retrieved when compared to using the + symbol. This appears to be because AOL
Search will check both its own listings and Inktomi listings when using AND but only
Inktomi listings when using the + symbol.

NOT
The Boolean NOT command is used in order to require that a particular search term
NOT be present on web pages listed in results. It can also be described as an
Exclude search. You use the command like this:

clinton NOT dole

Search engines that support NOT are shown on the Search Features Chart. For
those that don't, using the - symbol is generally a good alternative. Search engine
specific notes are below: AOL Search

When using NOT, you may find a slightly larger number of documents will be
retrieved when compared to using the + symbol or no commands at all. This shouldn't
happen, but it did at the time this page was written.

NEAR
The NEAR command is used in order to specify how close terms should appear to
each other. You use the command like this:

Research work by Amit Sharma Page 54


Know Thy Search Engine 55

moon NEAR river

Please consider whether you really need to control proximity within your searches.
Most search engines will try to find the terms you indicate next to each other, or within
close proximity to each other, by default. Also, all of the search engines support
phrase searching through use of quotation marks. See Search Engine Math page for
more information about phrase searching.

Search engines that support NEAR are shown on the Search Features Chart. Search
engine specific notes are below.AltaVista

NEAR means that terms will appear within 10 words of each other.

AOL Search

You can control the exact number of words apart by using NEAR/#. For instance,
NEAR/5 would mean the terms should be five words apart. If you don't specify a
number, then the terms must appear right next to each other.

Lycos

NEAR means that terms will appear within 25 words of each other. Lycos also
supports an extensive range of other adjacency commands. See the site's help pages
for Boolean searches for further details.

Nesting ( )
Nesting allows you to build complex queries. You nest queries using parentheses,
like this:

impeachment AND (clinton OR johnson)

Search engines that expressly say that they support nesting are shown on the Search
Features Chart. I have not tried to verify this information. Be aware that the major
search engines may process nested queries differently than each other.Other Notes

AltaVista

Boolean searching can only be done from the advanced search page.

Excite, Google & MSN

Boolean commands must be in uppercase. That's why I show them that way on this
page. If you always use uppercase, you won't have problems when going between
services.

Inktomi- services (HotBot, MSN Search)

You must set the menu option on the home page or advanced search page to
"Boolean phrase" when using Boolean commands.

Lycos

Research work by Amit Sharma Page 55


Know Thy Search Engine 56

Lycos says it supports many Boolean commands, and I haven't verified these,
because of the difficulty of determining exactly which datasets might be processed. In
addition, AllTheWeb -- which powers many of the search results at Lycos -- doesn't
support Boolean. This makes it unclear how Lycos itself might then do this.

Research work by Amit Sharma Page 56


Know Thy Search Engine 57

CHAPTER 15. Search Engine Features for Searchers

The search engine features chart below is designed primarily for users of search
engines. It summarizes key search commands and search assistance features.
These are described more fully on the Search Engine Math, Power Searching For
Anyone

Note: This chart is being updated

15.1 Search Engine Math Commands

Command How Supported By


Include All but LookSmart
+
Term (Does work for LookSmart's Inktomi results)
All but LookSmart
Exclude (Does work for LookSmart's Inktomi results.
-
Term Also, will not work for preprogrammed results
to popular queries at MSN Search)
All but
Direct Hit, LookSmart, MSN Search
Phrase ""
(Does work for LookSmart's Inktomi results. At
MSN Search, unpredictable about when it works)
AltaVista, Direct Hit, Excite, LookSmart
Auto Not yet updated, but may be still correct:
Match Netscape, Yahoo, GoTo
Any adv.
Term AllTheWeb, AOL Search, Google
search
HotBot, Lycos, MSN Search
page
Other Northern Light (use OR)
AllTheWeb, AOL Search, Google, HotBot, Lycos,
Match Auto
MSN, Northern Light
All
Terms Can usually be done with + symbol or adv. search
Other
page

NOTE: Math commands at Lycos tend to bring back results from FAST's crawler-based
results, rather than the human-powered results Lycos uses from the Open Directory. If you
want to search just Open Directory results, then use the Lycos advanced search page.

15.2 Power Searching Commands

Command How Supported By


AltaVista, Inktomi
Title Search (HotBot, iWon,
title:
MSN),
Northern Light
AllTheWeb,
Lycos (for
normal.title:
AllTheWeb
results only)

Research work by Amit Sharma Page 57


Know Thy Search Engine 58

allintitle:
Google
intitle:
adv. search
Direct Hit
page
AOL, Excite,
HotBot, MSN,
LookSmart, Lycos
none Not yet updated,
but may be still
correct:
Netscape
Not yet updated,
but may be still
other
correct:
Yahoo (t:)
host: AltaVista
Excite, Google
site: (Netscape,
Yahoo)
AllTheWeb,
Lycos (for
url.host:
AllTheWeb
results only)
Site
Search Inktomi (HotBot,
domain:
iWon, LookSmart)
AOL, Direct Hit,
HotBot,
LookSmart,
Lycos, MSN,
none
Netscape,
Northern Light,
Open Directory,
Yahoo
AltaVista, Excite,
url:
Northern Light
AllTheWeb,
Lycos (for
url.all:
AllTheWeb
results only)
allinurl:
Google
inurl:
Inktomi
URL Search originurl: (AOL, GoTo,
HotBot)
u: Yahoo
AOL, Direct Hit,
HotBot,
LookSmart, MSN
none Not yet updated,
but may be still
correct:
Open Directory

Research work by Amit Sharma Page 58


Know Thy Search Engine 59

AltaVista, Google,
link:
Northern Light
Inktomi (AOL,
HotBot, iWon,
MSN)
linkdomain:
(NOTE:
measures links to
entire domains)
AllTheWeb,
Lycos (for
Link Search link.all:
AllTheWeb
results only)
AOL, Direct Hit,
Excite, HotBot,
LookSmart,
Northern Light
none Not yet updated,
but may be still
correct:
Netscape, Yahoo
(n/a)
AltaVista, Inktomi
(iWon), Northern
Light
* Not yet updated,
but may be still
correct:
Yahoo
AOL Search,
?
Inktomi (iWon)
Wildcard % Northern Light
AllTheWeb,
Direct Hit, Excite,
Google, HotBot,
LookSmart,
none Lycos, MSN
(MSN's help says
it offers wildcard,
but it failed to
during testing)
anchor: AltaVista
AllTheWeb, AOL
Anchor Search Search, Direct
None Hit, Excite,
Google, Inktomi,
HotBot, Lycos

NOTE: The commands above are primarily useful when dealing with crawler-based search engines.
"None" indicates any crawler-based or human-powered search engine that creates its own listings but
which does not provide a particular command for searching within those listings. It may also indicate a
portal that that outsources for its listings and which lacks a single command to work across the multiple
datasets it uses.

Research work by Amit Sharma Page 59


Know Thy Search Engine 60

15.3 Search Assistance Features

Feature Offered By
AltaVista, AllTheWeb, Excite, HotBot, Lycos, MSN,
Yahoo
Related Searches
Not yet updated, but may be still correct:
iWon
AltaVista, AllTheWeb, Excite, Google,
Clustering
HotBot, MSN, Northern Light
Find Similar AltaVista, AOL Search, Google
AOL Search, Direct Hit, HotBot, Inktomi (HotBot,
Stemming
MSN)
Search Within AltaVista, Google, HotBot, Lycos
Spidered Version Google
Search By AltaVista, AllTheWeb, Excite, Google,
Language HotBot, Lycos, MSN, Northern Light
Page Translation AltaVista, Google, Lycos
Porn Filter AltaVista, AllTheWeb, Google
Porn Warning HotBot, MSN, Northern Light

15.4 Customization & Display Features

Feature Supported By
AltaVista, AllTheWeb, AOL Search (5), Direct Hit,
Number Of Excite, Google, HotBot, LookSmart (15),
Listings Shown Lycos, MSN (15), Northern Light
(10 unless noted) Not yet updated, but may be still correct:
iWon, Netscape, Yahoo (20)
Ability To AltaVista, AllTheWeb, Excite, Google, HotBot, MSN
Increase Number Not yet updated, but may be still correct:
Of Listings? Yahoo
AltaVista, AllTheWeb, Excite, Google, HotBot, MSN
See 20 Results Not yet updated, but may be still correct:
Yahoo
AltaVista, AllTheWeb, Excite, Google, HotBot, MSN
See 50 Results Not yet updated, but may be still correct:
Yahoo
AllTheWeb, Google, HotBot,
See 100 Results Not yet updated, but may be still correct:
Yahoo
Sort By Date MSN Search, Northern Light
AltaVista, Google, HotBot, MSN, Northern Light
Date Range Not yet updated, but may be still correct:
iWon, Yahoo
Date Displayed? AltaVista, HotBot (for Inktomi results),

Research work by Amit Sharma Page 60


Know Thy Search Engine 61

Northern Light
Display Titles
AltaVista, Excite, HotBot (URLs only option), MSN
Only?
Other Major
Customize AltaVista, AllTheWeb, Google
Options

15.5 Boolean Commands

Command How Supported By


AltaVista, AOL Search, Excite, Google,
OR
Inktomi (HotBot, MSN), Lycos, Northern Light
Or
AllTheWeb, Direct Hit, LookSmart,
None Not yet updated, but may be still correct:
Yahoo
AltaVista, AOL Search, Excite,
AND
Inktomi (HotBot, MSN) Lycos, Northern Light
And AllTheWeb, Direct Hit, Google, LookSmart
None Not yet updated, but may be still correct:
Yahoo
AOL Search, Excite, Inktomi (HotBot),
NOT
Lycos, Northern Light
AltaVista, Inktomi (MSN)
AND
Not Not yet updated, but may be still correct:
NOT
Netscape
AllTheWeb, Direct Hit, Google, LookSmart,
None Not yet updated, but may be still correct:
Yahoo
AltaVista, AOL Search, Excite,
()
Inktomi (MSN), Northern Light
Nesting AllTheWeb, Direct Hit, Google, Inktomi (HotBot),
LookSmart, Lycos
None
Not yet updated, but may be still correct:
Yahoo
AltaVista (10 words), AOL Search (specify
NEAR
number), Lycos (25 words)
Near
AllTheWeb, Direct Hit, Google,
None
Inktomi (HotBot, MSN), LookSmart
Notes
At AltaVista, Boolean only works on advanced search page.
At Excite, Google & MSN, Boolean commands must be in UPPERCASE
At Inktomi-powered services, set menu to "Boolean"

Research work by Amit Sharma Page 61


Know Thy Search Engine 62

CHAPTER 16. The Three Contendors

Search
Google AlltheWeb Altavista
Engine
HUGE. Claims over HUGE. Claims will LARGE.
1.5 billion pages, but reach a billion pages Claims to be
may be counting soon. General Web the biggest
Size, type pages not fully database. Excellent also.
Size varies indexed. General Web ranking. General Web
frequently database with often Advanced search database. Use
and widely. useful ranking by worth mastering. the Advanced
popularity. Far from Search with
comprehensive, but Boolean
often finds "the best" operators.
pages.
Yes. Use " ". Searches Yes. Use " " Yes. Use " "
common "stop words" In advanced search,
Phrase
if in phrases in terms in lower "filter"
searching
quotes. boxes always
searched as phrases.
Enclose in " " or try Enclose in " ". Enclose in " "
Proper without. Automatically Enclose each variant or use NEAR
name looks for terms in form in " " and all operator.
searching close proximity. forms in ( ) in top Case
search box. Sensitive.
Partial. AND assumed In top box, AND AND (default),
between words. default. OR,
Capitalize OR. AND NOT,
- excludes. For OR, enclose terms NEAR (within
No ( ) or nesting. or phrases in ( ) 10 words).
In Advanced Search, without typing "or". Use only in
limited Boolean: Advanced search Advanced
ALL = and "filter" boxes offer Search larger
Boolean
ANY = or partial equivalent of box.
logic
WITHOUT = and not Boolean logic: Do NOT use in
Must include = AND A-V simple
Must not include = Search.
NOT
Should include
prioritizes higher in
ranking.
No OR.
DEFAULT=AND. DEFAULT is AND Available only
- excludes In top box, in Simple
+ will allow you to - excludes Search (which
+Requires/
retrieve "stop words" defaults to
-Excludes
(e.g., +in) OR).
We
recommend

Research work by Amit Sharma Page 62


Know Thy Search Engine 63

Boolean logic
in Advanced
Search --
much more
powerful and
specific.
Yes. At bottom of No. Add terms. No. Add
results page, click terms.
Sub- "Search within
Searching results" and enter
more terms.

Based on page Automatic Fuzzy AND. Automatic


popularity measured Also seems to use Fuzzy AND.
in links to it from "importance" and links Some of the
other pages: high to pages. top results
rank if a lot of other In Advanced Search, have
pages link to it. SHOULD INCLUDE purchased the
Results Fuzzy AND also gives higher priority to right to be
Ranking invoked. word or phrase in box. there (not
Matching and ranking Each box read as a based on your
based on "cached" phrase. terms).
version of pages that
may not be the most
recent version.

link: In Advanced Search, title:


site: can search within: url:
allintitle: text, title, link:
allinurl: link name, url, link to host:
Field
Easy to use Advanced the url and filter by: domain
limiting
Search boxes for domain terms anchor:
these. text:
image:
applet:
No, and no stemming. No. Enclose variants in Use *.
Search variant ( ) in top box to
endings and create OR.
synonyms separately,
Truncation separating with OR
(capitalized):
airline OR airlines

No. No. Yes. Upper


case retrieves
only matching
upper case.
Case Lower case
sensitivity retrieves
either lower
or upper case.
Also accent
and character

Research work by Amit Sharma Page 63


Know Thy Search Engine 64

sensitive.
Yes. Major Romanized Yes, extensive list Yes, extensive
and non-Romanized includes major list includes
languages in Romanized and non- major
Advanced Search. Romanized languages. Romanized
Language Allows you to specify and non-
matching character Romanized
sets. languages.

In Advanced Search. In Advanced Search. Yes, in


Limit by
Advanced
age of
search.
documents

Yes, in Translate this No. Yes, to and


page link following from English
some pages. To and other
English from major languages.
Translation
European languages. Click on
Translate
following
result.

Research work by Amit Sharma Page 64


Know Thy Search Engine 65

CHAPTER 17. Inktomi Powered Searches

Inktomi's popularity grew several years ago as they powered the secondary search
database that had driven Yahoo!. Since then, Yahoo as switched to using Google as
their secondary search and backend database, however Inktomi is just as popular
now, as they were several years ago, if not more so. Their spiders are named "Slurp",
and different versions of Slurp crawls the web many different times throughout the
month, as Inktomi powers many sites search results. There isn't much more to
Inktomi then that. Slurp puts heavy weight on Title and description tags, and will
rarely deep crawl a site. Slurp usually only spiders pages that are submitted to its
index. Here is a list of some of the sites that Inktomi provides results for:

America Online
MSN ( The Microsoft Network )
iWon
HotBot
Looksmart
Goto.com
About
Goo
Anzwers
eoexchange
Powerize
NBCi
Canada.com
Chello
CNET
Swiss Search
Geocities
eHOLA
iAtlas
GoProfit
ICQ.com
Mobilecom
n2h2
quepasa.com
RadarUOL
Starmedia
4Anything.com

As you can see, thats quite a list. Inktomi has may different versions of it's SLURP
spiders.

Research work by Amit Sharma Page 65


Know Thy Search Engine 66

CHAPTER 18. Seven Blunders of Search Engine World

In the lighthearted spirit of the popular books for "idiots" and "dummies," here's a look
at seven common blunders that are virtually guaranteed to deliver useless,
nonsensical, or completely worthless search results.

Some of these gaffes might surprise you. But once you recognize them, it's easy to
banish these little gremlins forever from your Web search tool kit.

Sputtering on "Stop Words"

Some search engines simply ignore certain words. They are never used to find a
matching document, despite what amounts to a direct command when you type them
into a search form.

These are called "stop words" because the search engine doesn't "stop" when the
words are found in the index (if they are even indexed at all). Why not? Because stop
words are either too common to generate meaningful results, or are parts of speech
like adverbs, conjunctions, prepositions, or forms of "be" that mean nothing unless
they're part of a phrase with more "important" nouns and verbs.

If you use a stop word in a query you may get wildly irrelevant results. For example,
the phrase "searching the web" contains two stop words: "the" and "web." Though it's
not a particularly common word, web is used so frequently on the Internet that it's
virtually worthless as a finding aid.

Stripping out the stop words, "searching the web" becomes "searching," which will
naturally lead to results describing everything from criminal manhunts to quests for
enlightenment—and if you're lucky, maybe even something about searching the web.

How can you identify stop words? Google tells you when it's ignoring a stop word, at
the very top of a results page. You can force Google to include a stop word in a query
by putting a plus sign in front of it. AlltheWeb takes a different approach -- it often
automatically rewrites your query to include a stop word as part of a quoted phrase
with other query terms. Check out the link below to the 300 most common words in
English, many of which are stop words.

Bungling with Boolean

Boolean operators, like "and," "or," and "not," can help narrow search results—when
used properly. The problem is that Boolean operators, because of their apparent
simplicity, appear to be easy to use. Maybe, and/or not really.

According to Ran Hock, author of The Extreme Searcher's Guide to web Search
Engines, search engines implement Boolean features in different ways. For example,
while some accept a simple "not," others require "and not" for the same effect.
Additionally, some engines require that Boolean operators be capitalized, while
others do not (or and do not?).

Being Ever So Vulgar

Research work by Amit Sharma Page 66


Know Thy Search Engine 67

Vulgar comes from the Latin vulgus, meaning common. Like some educated
sophisticates, search engines have a problem with common words. It's not that
they're being snotty or pretentious. It's that some words are so common that they
appear in literally millions of documents, making them virtually useless as a finding
aid.

Take weather, for example. There are thousands of sites providing weather
information, from local forecasts to elaborate treatises on meteorology. Tighten your
query by using focusing words to narrow the scope of your search. Rather than
merely searching for "weather," construct a query like "Cicely Alaska annual
snowfall," or something equally specific.

Looking for a Rose, By Any Other Name

Be careful when a word has multiple meanings. Think of the word "bond" as an
example. If you just the single word "bond" as a query, the search engine has to
figure out if you're looking for information about financial bonds, chemical bonds, or
even James Bond.

Make it easier for the engine to help you. Ask yourself the question before the search
engine does for you, and phrase your query accordingly.

Search engines are also easily confused by heteronyms, words that are spelled
identically but have different meanings when pronounced differently. For example,
"lead," pronounced LEED, means to guide. Pronounced as LED, though, the word
refers to the metal element. When you can, use concrete synonyms instead of
heteronyms.

Committing Capital Offenses

Yet another problem for the searcher is whether to use capital letters in a query.
Some engines are case sensitive, while others are not. As a rule of thumb, it's a good
idea to always use lower case letters when you search. This will typically return
results that contain both upper and lower case letters.

If you use uppercase letters in a query to a case sensitive engine, results will only
include documents that also use upper case letters. This is usually a good thing for
proper nouns like names or places, which use initial upper case letters anyway. But it
might cause you to miss other documents where case-sensitivity is less important.

Close, But No Cigar

Most search engines do a good job at matching simple phrases, like "Afghan
refugees," or "space shuttle missions." You run into problems, though, with a phrase
like this section's title. Searching for "close but no cigar" on one major engine (which
shall remain mercifully unnamed) provided this link as its number two pick: The
Common Cold: Relief But No Cure. Definitely no cigar!

The distance between one word and another in a document is referred to as


proximity. Some search engines will give a positive result if your query words appear
anywhere on a page, whether or not they are near each other, or are used together in
a phrase.

Research work by Amit Sharma Page 67


Know Thy Search Engine 68

If you're searching for something where your keywords must be near each other to
get good results, your only option is to use AltaVista's advanced search and the
NEAR operator in your query. This finds documents containing both specified words
or phrases within 10 words of each other.

And now for the number one most common searching mistake:

Searching for Hits in all the Wrong Places

If you're determined to find what you're looking for on the Web, be sure you're using
the right tools for the job. Search engines vary widely in scope, function, and quality.
You'll waste a lot of time if you don't choose the best search engine for each specific
searching task.

Should you use a crawler-based search engine, or a human compiled web directory?
How about a specialized search site, a database, or an invisible web resource? By
analyzing your needs and comparing them with the strengths and weaknesses of
each search service *before* you search, you'll likely get better results.

If you're relatively new to searching and get stuck, don't be hard on yourself. One of
the most ridiculous misconceptions I've ever heard is that "you can find anything on
the Internet." This is about as true as saying that there are diamonds in every coal
mine.

And though it may sound like heresay, sometimes your best bet for finding
information is to log off and take a trip to your local library. Libraries have tons of
resources that aren't available on the Web. And librarians are trained experts who are
usually more than willing to help you find what you're looking for. When you're getting
nowhere on the Web, take advantage of these (usually very nice) "human search
engines."

Begone, Mistakes!

As you gain experience searching the Web, avoiding these seven searching mistakes
will become second nature. Whenever you get weird or unexpected results, take a
close look at your query and try to figure out what happened. You'll likely discover yet
another mistake to avoid.

Research work by Amit Sharma Page 68


Know Thy Search Engine 69

CHAPTER 19. Interesting… huh?

As random as they are relevant, enigmatic as they are enlightening, search engines
have earned a slightly sullied reputation as a necessary evil. But it is a one-sided
assessment. The search engines have not been able to explain themselves. Until
now.

Thanks to its sophisticated program, which answers questions with phrases or


sentences, Jeeves of AskJeeves.com granted SatireWire Editor Treat Warland the
opportunity to actually interview a search engine. There were many important
questions to ask. Unfortunately, he never got to most of them.

NOTE: These are real screen captures of actual responses. Advertisements


appearing with results have been edited out, and the query boxes have been
enlarged to allow readers to view entire questions. This does not in any way alter the
responses.

Research work by Amit Sharma Page 69


Know Thy Search Engine 70

Research work by Amit Sharma Page 70


Know Thy Search Engine 71

Research work by Amit Sharma Page 71


Know Thy Search Engine 72

Copyright © 2000, SatireWire.

Research work by Amit Sharma Page 72


Know Thy Search Engine 73

Search Engine Glossary

Boolean search: A search allowing the inclusion or exclusion of documents


containing certain words through the use of operators such as AND, NOT and OR.

Concept search: A search for documents related conceptually to a word, rather


than specifically containing the word itself.

Full-text index: An index containing every word of every document cataloged,


including stop words (defined below).

Fuzzy search: A search that will find matches even when words are only partially
spelled or misspelled.

Index: The searchable catalog of documents created by search engine software.


Also called "catalog." Index is often used as a synonym for search engine. Index is
commonly pluralized as "indices." However, Search Engine Watch instead uses the
alternative plural form "indexes."

Keyword search: A search for documents containing one or more words that are
specified by a user.

Phrase search: A search for documents containing a exact sentence or phrase


specified by a user.

Precision: The degree in which a search engine lists documents matching a query.
The more matching documents that are listed, the higher the precision. For example,
if a search engine lists 80 documents found to match a query but only 20 of them
contain the search words, then the precision would be 25%.

Proximity search: A search where users to specify that documents returned


should have the words near each other.

Query-By-Example: A search where a user instructs an engine to find more


documents that are similar to a particular document. Also called "find similar."

Recall: Related to precision, this is the degree in which a search engine returns all
the matching documents in a collection. There may be 100 matching documents, but
a search engine may only find 80 of them. It would then list these 80 and have a
recall of 80%.

Relevancy: How well a document provides the information a user is looking for, as
measured by the user.

Search Engine: The software that searches an index and returns matches. Search
engine is often used synonymously with spider and index, although these are
separate components that work with the engine.

Spider: The software that scans documents and adds them to an index by following
links. Spider is often used as a synonym for search engine.

Research work by Amit Sharma Page 73


Know Thy Search Engine 74

Stemming: The ability for a search to include the "stem" of words. For example,
stemming allows a user to enter "swimming" and get back results also for the stem
word "swim."

Stop words: Conjunctions, prepositions and articles and other words such as AND,
TO and A that appear often in documents yet alone may contain little meaning.

Thesaurus: A list of synonyms a search engine can use to find matches for
particular words if the words themselves don't appear in documents.

Research work by Amit Sharma Page 74

Das könnte Ihnen auch gefallen