Beruflich Dokumente
Kultur Dokumente
KNOW
THY
SEARCH ENGINE!!
CONTENTS
Preface……….……………………………………………………………………………………. 5
PREFACE
As a researcher I found these three notes to myself quite helpful. It was critically
important from a researcher’s point of view to be clear of the objective. And as one
asks this question again and again, he keeps his efforts going in the right direction.
The second note is about the persistence of a researcher. It’s the never satisfying
thirst for knowledge that drives a researcher. And as the Google guys found true
“There’s always more information out there”… a researcher must go on till he gets
what he’s looking for. And surprisingly the best catch comes when you are worn,
disappointed and you give that one last try of the day.
The third note about this game I played once. 9 points in three rows with three points
each. We were asked to connect them using four straight lines without lifting our
pens. And the only hint was “Think beyond the 9 points!!”
So often we go on rampaging within the boundaries set by our own minds. Thinking
from different point of views, stepping into different roles helped a lot in broadening
my own limitations and hence the search.
Sometimes I wonder how we tend to complicate simple things and set our
imagination on the complexity of a fairly simple problem.
My words for search engines: Think simple, specific and clear. You never really have
to make efforts to reach the information from people who really want it to reach you.
Whenever I finish a research I look back at my silly mistakes and say to myself, ”Pray
you don’t waste time on them again!” And then there are these people who always
keep reminding me, “I can do it!” Each one of them has taught me in their own way
something new every time I met them.
And how can I forget you my Lord who’s always been there within and beside.
Lastly, a heartiest thanks to my company SCPL, my boss Saurabh and the Search
Engine community!
Any guesses for which search engines I used for my research? ;-)
- Amit Sharma
The reader of these notes is supposed to have primary knowledge of using Internet. I
have assumed here that the person has a fair knowledge of every day usage of
computers. The reader is assumed to know how to connect to the internet and find
his way to an internet browser. Also, where one is supposed to key in a website
address.
Part I gives you the brief introduction to the search engines. In Part II we go deeper
into the working of a Search Engine. Part III deals purely with the techniques that
would help you use a search engine effectively.
This report begins with a brief introduction to the Search Engines in Part I. We get to
meet the popular search engines after knowing the difference between search and
meta search utilities. Now you would also get to know why Yahoo behaves different
from an Altavista.
For those who would like to jump to the effective techniques to use a search engine,
you could skip the Part II and go directly to Part III.
Part II we begin with how Search Engines came into the web world. Then we discuss
what goes on behind the scenes when you use a search engine, that is, how a search
engine works. We get to meet the ‘spiders’ of the web who help the search engines
accumulate the information. We talk about how the databases are indexed. And then
we discuss the ranking of the information derived. We also talk about the search
engines powered by a search technology powered by the popular Inktomi.
Part III we focus on the use of the search engines. To get our hands on what we are
looking on we begin with the basic techniques of increasing or decreasing our hits.
We move on further to sharpen our search skills by guessing the uniform resource
locators (URLs). And then we head for the big time ‘power searching’. We learn about
stemming, clustering, Boolean searches and also a little bit of mathematics called
search engine mathematics.
Then we talk about the most common mistakes we all make while searching on the
net. We interview AskJeeves and have a lighter aspect of the search engine in focus.
PART I
AN INTRODUCTION TO SEARCH ENGINES
We use ‘Search Engines‘ to search for information across the Internet. Internet being
an ever-expanding ocean of data, their importance grew with every passing day. I
always used to imagine a search engine as a trained police dog (wonder if dogpile
guys thought the same) that would go and fetch you anything, which you sent it for.
The diversity of the information itself made it necessary to have a tool to cut down on
the time spent in searching.
There are more than 500 search engines all over the world by now. As native search
engines are popping up in their own regions, it’s interesting to witness the war of
superiority among the international SE’s.
SE’s are basically database programs. Their job is to obtain data from websites in
order to identify, organize and list websites of possible interest to people who are
seeking them.
An information seeker can visit a search engine and enter a word or phrase for the
search engine to seek out for them. The search engine will present the seeker with
the results of their search in a manner which the seeker can then investigate further.
Search engines are powerful and can draw up many more references than is usually
necessary. It is useful for the seeker to learn how to refine their search to shorten the
list they will ultimately be presented with. All search engines provide ample resources
to help a seeker refine their search.
How to get registered with Search Engines is now a subject in itself - Search Engine
Optimization. Its beyond the scope of this research though the information in this
report would definitely help.
Before we move further I would like to make a mention about the Meta Tags.
META tags are inserted by Web page designers and developers into Web pages so
that search engines can identify and categorize the Web page's content. META tags,
which are invisible to the reader, assist search engines and Web browsers.
After a brief introduction to the search engines, let us have a more closer look at the
search engines we have today. Now before we get to know them I would like to clear
these common doubts we usually have but do not bother much about.
First I would like to address to the difference between a web search utility and a web
meta search utility.
The web search utilities (AltaVista, HotBot, etc.) index in their locally held databases
the text of pages whose creators have notified them that the pages exist, and given
them the URL. That's one of the main reasons why you often get different results from
the same search on AltaVista and Goto for instance. The size, and hence the
coverage in each index varies. In addition two search engines indexing the same
page may "weigh" the words in the page differently, leading to the same page
appearing higher or lower in their respective search results list. Many searchers often
execute the same search against several different indexes to get the results they
want.
The meta search utilities try to counteract that by sending the search you create to
several of the web indexes at once, thereby sparing you the effort of searching one
after the other manually. However, meta search engines don't offer universal
coverage.
Now, having understood the difference between search and meta search, let us also
refer to the common question,”How are Altavista and Yahoo different from each
other?” I am sure who ever has used both of them would have noticed their
difference, if not understood the underlying difference.
Let me put it in this way… Altavista is a web search engine while Yahoo is a web
search directory.
They're actually different ways of approaching the same problem! When someone
tells Yahoo about their website, an actual human being looks at the site, decides
what it's about, and then places it within a topical hierarchy that Yahoo has
established. The Yahoo hierarchy is like a tree with many thousands of branches,
and the "editors" who place sites within the hierarchy do so very carefully. Because of
the hands on work involved in placing a site within the Yahoo (or LookSmart)
hierarchy, the total number of websites represented tends to be much smaller than
those of the automated web search engines.
The Yahoo searcher who is looking for information navigates the hierarchy until they
find a topic (a branch on the tree) that represents what they're looking for. Once that
topic is located, they'll click on it to review the list of sites placed under that topic. If
none of those sites are "just right" they may choose to find another topic that
resembles what they're looking for, and review the sites placed under it.
Here's an example. Let's say you'd like to buy a new car. You're interested in a
particular Porsche model, and would like to read a road test of that model. If you
search Northernlight for the word "Porsche", among her first few screens of results
will be pages about Porsche coverages, Special Collection Porsche links, Porsche
launches etc.. All related to Porsche, but not what you wanted.
While you could refine your search, you could also jump into Yahoo. Browse their
hierarchy looking for something like:
http://dir.yahoo.com/Recreation/Automotive/ Makes_and_Models/Porsche/Magazines/
underneath that heading will be the URLs for Automobile publications that have web
sites. Somewhere amongst them, you may find a magazine that has recently road
tested the car that you're interested in.
Web search utilities and web directories are stylistically very different. Some people
enjoy browsing the Yahoo topic hierarchy, while others like to make the computer do
more of the work. In either case, doing a good job takes time. The Yahoo hierarchy is
huge, and it's easy to go down a few wrong paths before you find the topic you're
most interested in. However, it's also true that creating an effective search statement
that retrieves just what you're looking for takes time and effort too!
There are many contenders in the sprawling search engine community. Here is a
quick glance at the most popular ones.
Google (http://www.google.com)
Fast, relevant results are a hallmark of Google due to its extensive use of
popularity ranking of web sites. An added strength is Google's inclusion of
more file formats than other search engines index—formats such as PDF,
Microsoft Word, Excel, and PowerPoint.
Teoma (http://www.teoma.com/)
A medium-sized search engine that produces excellent retrieval. Besides
listing search results, a portion of the screen provides topical groupings
(folders) of results by keyword. On the right are experts' links leading to sites
that list pages on related general subjects.
Subject Directories
Yahoo (http://www.yahoo.com)
Yahoo is actually both a subject guide and a search engine. Searches include
not only Yahoo's web site subject directory listings that are selected and
indexed by people, but also a Google-powered database of web sites. Search
results are organized into helpful subject categories. Useful, friendly, a
favorite among web searchers.
Looksmart (http://www.looksmart.com)
Looksmart, Yahoo's main rival, also employs human editors to select and
categorize web sites. LookSmart has partnered with AltaVista to be the
extensive search engine that engages after a query of LookSmart's database.
About.com (http://home.about.com/)
Excellent source for web guides on popular topics.
Metacrawlers:
Search multiple search engines from the same web site
SurfWax (http://www.surfwax.com/)
Taps the major search engines including Google. Results are merged and
ranked by relevancy. Results with a magnifying glass icon beside them have
quick summaries (SiteSnaps) that may be viewed before deciding to summon
the page. Options for sorting, number of results displayed are available. Nice
customization features. An excellent metasearch engine.
ixquick (http://www.ixquick.com/)
Interesting is ixquick's use of a star rating system. One star is given for each
search engine that placed a site in its top ten. The theory is that a site
appearing in multiple top ten lists is likely to be relevant. Like Dogpile, ixquick
tries to translate a search query into the syntax of each search engine. Search
options include web, news, MP3, and pictures.
Vivisimo (http://vivisimo.com/)
Performs document clustering (based on titles, URLs, and short descriptions)
so that users may browse the results by hierarchical categories. Very
interesting and effective approach.
Dogpile (http://www.dogpile.com)
Once a favorite among searchers, Dogpile now uses paid listings. Keep this in
mind when evaluating search results. Results are listed by search engine.
The following table gives you a good view of search engines under different
categories:
Directories… Computing…
www.about.com www.allwhois.com
www.galaxy.com www.hostindex.com
www.goguides.org www.tucows.com
www.looksmart.com
www.dmoz.org OpenDirectory
www.yahoo.com
www.zeal.com
PART II
HOW SEARCH ENGINES WORK
The Internet has evolved not by becoming a graphically rich multi-media work but by
the evolution of the tools which made it possible to find and access this richness.
One of the earliest search engines like those today, Lycos, began in the spring of
1994 when John Leavitt’s spider was linked to an indexing program by Michael
Mauldin. Yahoo!, a catalog, became available the same year.
Today there are more of “web location services.” A search engine in proper sense is
a database and the tools to generate that database and search it while a catalog is an
organizational method and related database plus the tools for generating it.
Yahoo! emphasizes cataloging, while others such as Alta Vista or Excite emphasize
providing the largest search database. Some web location services do not own any of
their search engine technology – other services are their main thrust companies such
as Inktomi (after a native American word for spider) provide the search technology.
These web location services have put amazing power into every user’s hands,
making life much better for all of us, that too free of cost.
You might be wondering, may be it’s a rumour, may be not.. these information
companies might increase their revenues by selling information – information about
you. After you use a search engine and find a page with mutual fund quotes, you
might find yourself receiving e-mail advertising investments. Think this is a
coincidence? Think again. The investment company could have paid a search engine
for your e-mail address. There is an existing protocol for servers to ask a user’s
browser for such information, routinely entered during set-up.
There are basically three elements to search engines that might be important.
A search engine finds information for its database by accepting listings sent in by
authors wanting exposure, or by getting the information from their "Web crawlers,"
"spiders," or "robots," programs that roam the Internet storing links to and information
about each page they visit.
Web crawler programs are a subset of "software agents," programs with an unusual
degree of autonomy which perform tasks for the user. How do these really work? Do
they go across the net by IP number one by one? Do they store all or most of
everything on the Web?
These agents normally start with a historical list of links, such as server lists, and lists
of the most popular or best sites, and follow the links on these pages to find more
links to add to the database. This makes most engines, without a doubt, biased
toward more popular sites. A Web crawler could send back just the title and URL of
each page it visits, or just parse some HTML tags, or it could send back the entire
text of each page.
Alta Vista is clearly hell-bent on indexing anything and everything, with over 30
million pages indexed (7/96). Excite actually claims more pages. OpenText, on the
other hand, indexes the full text of less than a million pages (5/96), but stores many
more URLs. Inktomi has implemented HotBot as a distributed computing solution,
which they claim can grow with the Web and index it in entirety no matter how many
users or how many pages are on the Web. Normally, "good" robots can be excluded
by a bit of exclusion standard code on your site.
It seems unfair, but developers aren't rewarded much by location services for sending
in the URLs of their pages for indexing. The typical time from sending your URL in to
getting it into the database seems to be 6-8 weeks. Most search engines check their
databases to see if URLs still exist and to see if they are recently updated.
"Spiders" take a Web page's content and create key search words that enable online
users to find pages they're looking for.
Google began as an academic search engine. In the paper that describes how the
system was built, Sergey Brin and Lawrence Page give an example of how quickly
their spiders can work. They built their initial system to use multiple spiders, usually
three at one time. Each spider could keep about 300 connections to Web pages open
at a time. At its peak performance, using four spiders, their system could crawl over
100 pages per second, generating around 600 kilobytes of data each second.
When the Google spider looked at an HTML page, it took note of two things:
Words occurring in the title, subtitles, meta tags and other positions of relative
importance were noted for special consideration during a subsequent user search.
The Google spider was built to index every significant word on a page, leaving out the
articles "a," "an" and "the." Other spiders take different approaches.
These different approaches usually attempt to make the spider operate faster, allow
users to search more efficiently, or both. For example, some spiders will keep track of
the words in the title, sub-headings and links, along with the 100 most frequently
used words on the page and each word in the first 20 lines of text. Lycos is said to
use this approach to spidering the Web.
Other systems, such as AltaVista, go in the other direction, indexing every single
word on a page, including "a," "an," "the" and other "insignificant" words. The push to
completeness in this approach is matched by other systems in the attention given to
the unseen portion of the Web page, the meta tags.
Meta tags allow the owner of a page to specify key words and concepts under which
the page will be indexed. This can be helpful, especially in cases in which the words
on the page might have double or triple meanings -- the meta tags can guide the
search engine in choosing which of the several possible meanings for these words is
correct. There is, however, a danger in over-reliance on meta tags, because a
careless or unscrupulous page owner might add meta tags that fit very popular topics
but have nothing to do with the actual contents of the page. To protect against this,
spiders will correlate Meta tags with page content, rejecting the meta tags that don't
match the words on the page.
All of this assumes that the owner of a page actually wants it to be included in the
results of a search engine's activities. Many times, the page's owner doesn't want it
showing up on a major search engine, or doesn't want the activity of a spider
accessing the page. Consider, for example, a game that builds new, active pages
each time sections of the page are displayed or new links are followed. If a Web
spider accesses one of these pages, and begins following all of the links for new
pages, the game could mistake the activity for a high-speed human player and spin
out of control. To avoid situations like this, the robot exclusion protocol was
developed. This protocol, implemented in the meta-tag section at the beginning of a
Web page, tells a spider to leave the page alone -- to neither index the words on the
page nor try to follow its links.
Once the spiders have completed the task of finding information on Web pages (and
we should note that this is a task that is never actually completed -the constantly
changing nature of the Web means that the spiders are always crawling), the search
engine must store the information in a way that makes it useful.
There are two key components involved in making the gathered data accessible to
users:
In the simplest case, a search engine could just store the word and the URL where it
was found. In reality, this would make for an engine of limited use, since there would
be no way of telling whether the word was used in an important or a trivial way on the
page, whether the word was used once or many times or whether the page contained
links to other pages containing the word. In other words, there would be no way of
building the ranking list that tries to present the most useful pages at the top of the list
of search results.
To make for more useful results, most search engines store more than just the word
and URL. An engine might store the number of times that the word appears on a
page. The engine might assign a weight to each entry, with increasing values
assigned to words as they appear near the top of the document, in sub-headings, in
links, in the meta tags or in the title of the page. Each commercial search engine has
a different formula for assigning weight to the words in its index. This is one of the
reasons that a search for the same word on different search engines will produce
different lists, with the pages presented in different orders.
In English, there are some letters that begin many words, while others begin fewer.
You'll find, for example, that the "M" section of the dictionary is much thicker than the
"X" section. This inequity means that finding a word beginning with a very "popular"
letter could take much longer than finding a word that begins with a less popular one.
Hashing evens out the difference, and reduces the average time it takes to find an
entry. It also separates the index from the actual entry. The hash table contains the
hashed number along with a pointer to the actual data, which can be sorted in
whichever way allows it to be stored most efficiently. The combination of efficient
indexing and effective storage makes it possible to get results quickly, even when the
user creates a complicated search.
Hash This!
The key in public-key encryption is based on a hash value. This is a value that is
computed from a base input number using a hashing algorithm. Essentially, the hash
value is a summary of the original value. The important thing about a hash value is
that it is nearly impossible to derive the original input number without knowing the
data used to create the hash value. Here's a simple example:
Input number Hashing algorithm Hash value
10,667 Input # x 143 1,525,381
You can see how hard it would be to determine that the value 1,525,381 came from
the multiplication of 10,667 and 143. But if you knew that the multiplier was 143, then
it would be very easy to calculate the value 10,667. Public-key encryption is actually
much more complex than this example, but that is the basic idea.
Public keys generally use complex algorithms and very large hash values for
encrypting, including 40-bit or even 128-bit numbers. A 128-bit number has a
possible 2128 or
3,402,823,669,209,384,634,633,746,074,300,000,000,000,000,000,000,000,000,000,000,000,000
different combinations! This would be like trying to find one particular grain of sand in
the Sahara Desert.
Building a Search
Searching through an index involves a user building a query and submitting it through
the search engine. The query can be quite simple, a single word at minimum.
Building a more complex query requires the use of Boolean operators that allow you
to refine and extend the terms of the search.
• AND - All the terms joined by "AND" must appear in the pages or documents.
Some search engines substitute the operator "+" for the word AND.
• OR - At least one of the terms joined by "OR" must appear in the pages or
documents.
• NOT - The term or terms following "NOT" must not appear in the pages or
documents. Some search engines substitute the operator "-" for the word
NOT.
• FOLLOWED BY - One of the terms must be directly followed by the other.
• NEAR - One of the terms must be within a specified number of words of the
other.
• Quotation Marks - The words between the quotation marks are treated as a
phrase, and that phrase must be found within the document or file.
The searches defined by Boolean operators are literal searches -- the engine looks
for the words or phrases exactly as they are entered. This can be a problem when the
entered words have multiple meanings. "Bed," for example, can be a place to sleep, a
place where flowers are planted, the storage space of a truck or a place where fish
lay their eggs. If you're interested in only one of these meanings, you might not want
to see pages featuring all of the others. You can build a literal search that tries to
eliminate unwanted meanings, but it's nice if the search engine itself can help out.
One of the areas of search engine research is concept-based searching. Some of this
research involves using statistical analysis on pages containing the words or phrases
you search for, in order to find other pages you might be interested in. Obviously, the
information stored about each page is greater for a concept-based search engine,
and far more processing is required for each search. Still, many groups are working
to improve both results and performance of this type of search engine. Others have
moved on to another area of research, called natural-language queries.
The idea behind natural-language queries is that you can type a question in the same
way you would ask it to a human sitting beside you -- no need to keep track of
Boolean operators or complex query structures. The most popular natural language
query site today is AskJeeves.com, which parses the query for keywords that it then
applies to the index of sites it has built. It only works with simple queries; but
competition is heavy to develop a natural-language query engine that can accept a
query of great complexity.
User Search
What can the user do besides typing a few relevant words into the search form? Can
they specify that words must be in the title of a page? What about specifying that
words must be in an URL, or perhaps in a special HTML tag? Can they use all logical
operators between words like AND, OR, and NOT?
Most engines allow you to type in a few words, and then search for occurrences of
these words in their data base. Each one has their own way of deciding what to do
about approximate spellings, plural variations, and truncation. If you just type words
into the "basic search" interface you get from the search engine's main page, you
also can get different logical expressions binding the different words together. Excite!
actually uses a kind of "fuzzy" logic, searching for the AND of multiple words as well
as the OR of the words. Most engines have separate advanced search forms where
you can be more specific, and form complex Boolean searches (every one mentioned
in this article except Hotbot). Some search tools parse HTML tags, allowing you to
look for things specifically as links, or as a title or URL without consideration of the
text on the page.
By searching only in titles, one can eliminate pages with only brief mentions of a
concept, and only retrieve pages that really focus on your concept.
By searching links, one can determine how many and which pages point at your site.
Understanding what each page does with the non-standard pluralization, truncation,
etc. can be quite important in how successful your searches will be. For example, if
you search for "bikes" you won't get "bicycle," "bicycles," or "bike." In this case, I
would use a search engine that allowed "truncation," that is, one that allowed the
search word "bike" to match "bikes" as well, and I would search for "bicycle OR bike
OR cycle" ("bicycle* OR bike* OR cycle*" in Alta Vista).
With databases that can keep the entire Web at the fingertips of the search engines,
there will always be relevant pages, but how do you get rid of the less relevant and
emphasize the more relevant?
Most engines find more sites from a typical search query than you could ever wade
through. Search engines give each document they find some measure of the quality
of the match to your search query, a relevance score. Relevance scores reflect the
number of times a search term appears, if it appears in the title, if it appears at the
beginning of the document, and if all the search terms are near each other; some
details are given in engine help pages. Some engines allow the user to control the
relevance score by giving different weights to each search word. One thing that all
engines do, however, is to use alphabetical order at some point in their display
algorithm. If relevance scores are not very different for various matches, then you end
up with this sorry default. For most uses, a good summary is more useful than a
ranking. The summary is usually composed of the title of a document and some text
from the beginning of the document, but can include an author-specified summary
given in a meta-tag. Scanning summaries really saves you time if your search returns
more than a few items.
! Search engines use software called spiders, which comb the Internet looking
for documents and their Web addresses.
! The documents and Web addresses are collected and sent to the search
engine’s indexing software.
! The search engine assembles a web page that lists the results as hypertext
links.
PART III
BETTER TECHNIQUES TO USE SEARCH ENGINES
One of the most common maladies afflicting people searching for information on the
Internet is that they get too many "hits" from their searches. Nobody is going to try
thousands of links looking for the information that they want. The trick is to refine your
search so that you get fewer "hits" that are of greater quality. Making better use of the
AND operator and phrase searching can help you quite a bit, so they will be the focus
of our discussion here.
Take a look at your work area and count the number of pens. Now, take another look
and count the number of pens that are red. Unless all you use are red pens, the
second number is likely to be lower than the first. Like the brief exercise above, the
purpose of the AND operator in searching is to add additional conditions that an
Internet document must match to become a "hit". By searching for multiple words,
linked by AND operators, you can dramatically reduce the number of documents that
match your search criteria.
For example, say you were searching Google for information about proper nutrition
for your dog. Searching for the word "dog" (Google syntax: dog) returns 13,100,000
matching documents. Searching for documents containing both the word "dog" AND
the word "nutrition" (Google syntax: +dog +nutrition) shrunk the number of hits down
to 285,000. Adding the word "guide" to the previous search (Google syntax: +dog
+nutrition +guide) reduced the number of hits to 77,100. Finally, hoping to find
information sourced from veterinary schools, the domain suffix ".edu" was added to
the search (InfoSeek syntax: +dog +nutrition +guide +.edu). This reduced the number
of matching documents to a quite manageable 9,260.
Like the AND operator, phrase searching requires that all the terms in the phrase be
present in the document for it to be a hit. However, when using the AND operator, the
terms don't have to be anywhere near each other, they just both have to be present.
In phrase searching, the terms have to be adjacent to each other, just as they are in
the phrase. For example, searching Google for documents containing the words time
AND management AND seminar (Google syntax: +time +management +seminar)
yielded 1,060,000 hits. However searching for the phrase "time management
seminar" (Google syntax: "time management seminar") returned 1,140 matching
documents. While this is a substantial reduction in the number of hits received,
consider that it may be too drastic. Searching for the phrase "time management
seminar" skips documents containing the phrases "time management class" or
"seminar on time management", since neither phrase exactly matches your search
phrase.
In Conclusion
The use of the AND operator as well as phrase searching can play important roles in
your performing more effective searches. You may want to experiment with using
them together where appropriate ("New York" AND tour). If one phrase or word
combination doesn't give you the results that you desire, try using slightly different
words that convey the same meaning. It's all out there, it's just up to you to find it.
Searchers getting too many matches from their searches are a common problem.
Although it sounds odd, searchers often miss important information because they get
too few matches from their searches. This can be true even if the total hits number in
the thousands. Using the OR operator can help you find all of the documents you're
looking for.
One of the best ways that the OR operator can help you is when you are searching
for something that is commonly referred to by more than one name. For instance, do
you refer to small, portable computers as "laptops" or "notebooks". Within the
computer industry, those two terms are used interchangeably. If you're looking for
information on this type of computer, you should search on both terms. This is good
advice even if you're searching the Yahoo category hierarchy. For example, a search
for the word laptop in Yahoo yielded 435(Google – 926,000) hits. Searching for the
word notebook in Yahoo yielded 547(Google – 1,990,000) hits. Searching for laptop or
notebook created 173 hits (Google - 113,000 hits). It's important to note that 435+547
equals 982, not 173. The 809 hit overlap represents those matches that contain both
search terms.
Last issue, we discussed the importance and value of phrase searching. It's important
to remember the number of ways that a concept can be phrased. For instance if you
search NorthernLight for "ocean liner", you'll get 6,882 matches. Substituting the
phrase "cruise ship" will return 51,417 hits. To nautical professionals, "cruise ship"
and "ocean liner" may not be the same thing, but there's no arguing that the terms are
often used interchangeably. You can cover a lot of ground by searching for two or
more phrases at once, separated by the OR operator. (ocean liner or cruise ship gave
56,866 hits)
None of the search engines out there today are perfect, but using the
right one at the right time can make all the difference.
From PC Magazine :
Relevant
Customizes Eliminates Effective
hits from Eliminates
query duplicate anticipatory
initial dead links
effectively links results
inquiry
About.com Good Good Fair Fair Fair
AltaVista Fair Good Fair Good Fair
Ask Jeeves Fair Fair Good Good Good
Excite Good Good Good Fair Good
FAST Search Good Good Fair Fair Fair
GO Network Good Good Fair Fair Good
Google Excellent Fair Good Fair Fair
GoTo.com Fair Fair Good Good Good
HotBot Excellent Good Good Good Good
Looksmart Good Fair Fair Fair Fair
Lycos Good Good Fair Fair Fair
Northern
Good Excellent Good Fair Excellent
Light
WebCrawler Fair Fair Good Good Fair
Yahoo! Excellent Good Good Good Good
To test Web search sites, they devised more than 50 queries spanning a variety of
subjects and of varying style, from single words to natural-language questions.
Initially they tried all of our queries using the default settings for each site. Then they
examined the first ten hits, evaluating each page's relevance and recording the
number of duplicate and dead links.
Their ratings reflect both the number of hits that technically matched the query and
the quality of the information returned. They also tested the tools that each site
offered to customize a query and improve the overall quality of their return set.
They rechecked any dead links (defined as genuine HTML 404 errors and not
temporary glitches that prevented access to the pages) over the course of the month
of their testing before delivering our final rating in that category.
Many search engines today try to anticipate the exact nature of your query, returning
stock data when you search for a company, for example, in addition to the sites that
its search engine itself has catalogued. They evaluated the usefulness of these
efforts to determine how well you can rely on these to improve your search
experience.
Think of the organization most likely to provide an answer to your question. Then try
to go directly to their Web site.
The portals, directories, and search engines have different ways of combining words
when searching. If you enter chocolate strawberries, it might find hits with either the
word chocolate or the word strawberries. It might find pages with both chocolate and the
word strawberries. Or it may retrieve records with the phrase chocolate strawberries.
Which do you want and how do you know?
Use phrase searching whenever possible. Almost all the portals and search engines
can do phrase searching -- searching for the words entered adjacent to each other
and exactly in the order submitted. Most use double quotes to identify a phrase:
"this is a phrase"
Other techniques used by many portals and search engines involve using Boolean
operators (AND, OR, NOT) or the plus and minus symbols. Most search engines now
default to an AND operation. Try using a plus (+) directly in front of required words
and a minus (-) directly in front of words or phrases to exclude from search results.
All that most people need to know is a little basic "search engine math" in order to
improve their results. One can learn how to easily add, subtract and multiply one’s
way into better searches at your favorite search engine. The information below works
for nearly all of the major search engines.
Be Specific
Before learning math, it's a helpful reminder that the more specific your search is, the
more likely you will find what you want. Don't be afraid to tell a search engine exactly
what you are looking for.
For example, if you want information about Windows 98 bugs, search for "Windows
98 bugs," not "Windows." Or even better, search for exactly what the problem is: "I
can't install a USB device in Windows 98," for example. You'll be surprised at how
often this works.
Sometimes, you want to make sure that a search engine finds pages that have all the
words you enter, not just some of them. The + symbol lets you do this.
For example, imagine you want to find pages that have references to both President
Clinton and Kenneth Starr on the same page. You could search this way:
+clinton +starr
Only pages that contain both words would appear in your results. Here are some
other examples:
That would find pages that have all three of the words on them, helpful if you wanted
to narrow down a search to Windows 98 bugs, rather than on Windows 98 in general.
That would get you pages about Star Trek that also specifically mention
"Insurrection," the title of a Star Trek film.
The + symbol is especially helpful when you do a search and then find yourself
overwhelmed with information. Imagine that you wanted to reserve a camping space
in California's Yosemite National Park. You might start out simply searching like this:
yosemite
If so, chances are, you'll probably get too many off-target results. Instead, try
searching for all the words you know must appear on the type of page you're looking
for:
Sometimes, you want a search engine to find pages that have one word on them but
not another word. The - symbol lets you do this.
For example, imagine you want information about President Clinton but don't want to
be overwhelmed by pages relating to the Monica Lewinsky scandal. You could
search this way:
clinton -lewinsky
That tells the search engine to find pages that mention "clinton" and then to remove
any of them that also mention "lewinsky."
Similarly, perhaps you are looking for information specifically about Windows 95 but
keep getting pages about Windows 98 or Windows 3.1. You could eliminate them
with a search like this:
Perhaps you are a fan of the original Star Trek series but instead keep finding pages
about Voyager, Deep Space Nine or Star Trek: The Next Generation. Try a search
like this:
In general, the - symbol is helpful for focusing results when you get too many that are
unrelated to your topic. Simply begin subtracting terms you know are not of interest,
and you should get better results.
Now that you know how to add and subtract terms, we can move on to multiplication.
As in normal math, multiplying terms through a "phrase search" can be a much better
way to get the answers you are looking for.
For example, remember above when we wanted pages about reserving a campsite in
Yosemite? We entered all the terms like this:
That brings back pages that have all those words on them, but there's no guarantee
that the words may necessarily be near each other. You could get a page that
mentions Yosemite in the opening paragraph but then later talks about getting
camping reservations in the Grand Canyon. All the words you added together would
appear on this page, but it still might not be what you are looking for.
Doing a phrase search avoids this problem. This is where you tell a search engine to
give you pages where the terms appear in exactly the order you specify. You do this
by putting quotation marks around the phrase, like this:
Now, only pages that have all the words and in the exact order shown above will be
listed. The answers should be much more on target than with simple addition.
Likewise, remember this addition example?
As you can imagine, multiplying the terms together within a phrase search would
work better, because that exact phrase probably appears on good pages dealing with
Windows 98 bugs. So try this:
"windows 98 bugs"
Remember the search for information about the latest Star Trek movie? We could
transform that into a phrase search like this:
But the movie's title actually has a colon after the word "trek," and many pages might
also follow this format. Thus, a better phrase search might be:
Combining Symbols
Once you've mastered adding, subtracting and multiplying, you can combine symbols
to easily create targeted searches.
For example, remember the person who wanted pages only about Star Trek's original
series? We searched this way:
At some search engines, you can do a Match Any search by using a menu next to the
search box or on the advanced search page.
Keep in mind that most search engines will automatically first list pages that have all
your terms, then some of your terms, when you perform a Match Any search.
AltaVista
At AltaVista, testing shows that Match Any is most likely what will happen in response
to a default search. Earlier in 2001, AltaVista had said that Match Any would only
occur if you searched for five words or more. This no longer seems to be the case.
The article below explains what AltaVista previously said would happen:
Practically all major search engines support the + symbol as a means of doing a
Match All search. All search by default, even if you don't use the + symbol.
See the Search Engine Math page for more specific help on using the + symbol.
Some search engine specific notes are below:
AOL Search
By default, AOL Search will look for any sites in its Open Directory information that
contain all the words you enter. It will check both the words in the Open Directory
listing and the words on the page that the listing leads to.
AOL Search will not check for matches in its Inktomi listings UNLESS there are
absolutely no Open Directory listings that match all words. However, if you use AOL
Search's advanced search and choose the "On the Web Only" option, then your
search will be conducted against only Inktomi's listings.
13.3 Exclude
Most major search engines allow you to exclude documents that contain certain
words. This is a helpful way to narrow a search.
For example, you may want a page about the philosopher Calvin, not the cartoon
character Calvin. By excluding pages that mention Hobbes, the cartoon character's
sidekick, you will get better results.
The best way to do this is by using the - command, which is supported by practically
all major search engines.
See the Search Engine Math page for more specific help on using the - symbol.
host:mars.jpl.nasa.gov
In response, AltaVista would display all the pages it has indexed from the
mars.jpl.nasa.gov domain.
Now imagine you wanted to find all the pages from the Mars Exploration web sites
that also mention Venus and Jupiter. You could do that this way:
That tells AltaVista to list pages with the words "venus" and "jupiter" that are within
the Mars Exploration web site.
You can even combine other commands, such as those described on the Search
Engine Math page. For instance, look at this example:
Here, we are telling AltaVista to list all pages within the Mars Exploration web site
that do not contain the exact phrase "mars pathfinder."
Now, imagine you are looking for information about Mars landings but are getting
overwhelmed by results from NASA. You can get rid of NASA pages by doing this:
In that example, we are looking for the phrase "mars landings" but excluding any
pages from sites that end in nasa.gov. That means we will NOT get pages from sites
like these
• mars.jpl.nasa.gov
• spacekids.hq.nasa.gov
• cmex.arc.nasa.gov.
We could even decide to see all pages about Mars landing from US educational sites,
which end in .edu, like this:
Finally, imagine you live outside the US and want to see results that are
predominately from your country. Here's how someone in the UK might search for
football (soccer) information:
This finds pages that say "football scores" and which are from sites that end in the .uk
domain, which is used for UK-based sites.
Often, search engines that support a site search command also make this possible to
do using their advanced search pages. In addition, I'd highly recommend
downloading the Google Toolbar. Once you've done this, when visiting any web site,
you can use the toolbar's "Search site" button to search within just that web site.
Finally, for search engines that don't offer a site search command, you may find that
there is a URL Search command that provides a similar ability.
Excite
Excite has a "site" command as explained in the Site Search section, but this
command cannot be combined with search terms in an attempt to locate pages on a
particular topic from a particular web site or to filter out pages from a particular web
site. For example, this query wouldn't work:
However, you can use the URL command to get a similar result. For instance:
would work to list pages about "mars exploration" but would remove any that came
from the mars.jpl.nasa.gov site. Be aware that when using the URL command in this
way, only the exact site listed will be removed. For example, this query:
would remove pages from nasa.gov but still allow pages from mars.jpl.nasa.gov to
appear, since that is a different web site.
However, when using the + command, then any sites containing the core domain will
be included. In other words, this command:
would bring up pages from any site that has nasa.gov in the URL, such as
• mars.jpl.nasa.gov
• spacekids.hq.nasa.gov
• cmex.arc.nasa.gov.
Google's advanced search page uses the allinurl command for finding URLs that
contain certain words, as described more on the Checking Your Listing page.
However, it is the undocumented "inurl" command that you should use, if you want to
find both web pages with words in the URL and within the pages themselves.
For example, let's say you want to find PDF files about mars exploration. Entering
"mars exploration" isn't enough, because that could bring back both HTML and PDF
pages. To solve this, you can use the inurl command to specify that URLs must have
the word "pdf" in them, which will increase the chances of getting PDF files. Here's
both commands, combined:
If you used the "allinurl" command rather than the "inurl" command, this search
wouldn't work.
By the way, the "allinurl" command takes its name because when using it, you are
requiring that ALL the words appear IN the URL. In contrast, the inurl command
means that ANY of the words you specify should appear.
Google also has a command that lets you narrow your search to find documents in
particular formats, such works better than forcing the URL command into this role.
The command is filetype:, and you follow it with the extension you want to search for.
For instance:
brings back PDF files that contain the words "california power crisis." In contrast:
brings back ordinary HTML files that end in .html, that contain the words. It will not
bring back HTML files the end in .htm, however. Technically, Google considers those
to be a different file type, simply because the ending is different.
If someone were to do a title search for "power searching," then this page might
appear, because those words appear in the HTML title.
Some search engines support title searching using the "title" command, which looks
like this:
title:terms
where "terms" are the words you wish to search for in the title. Here are some
examples:
title:mars
title:mars landings
title:"mars landings"
In the first example, we're looking for the word "mars" in page titles. In the second
example, we're looking for both "mars" and "landings" in titles. In the last example,
we're looking for the exact phrase "mars landings" in titles.
Direct Hit
A search within title option is available on the advanced search page but was found
not to be working when tested on August 6, 2001.
Google uses the allintitle: command rather than title:, and the command means that
documents will have ALL the words you specify in their titles. The intitle: command
means that ANY of the words may be present.
HotBot
Title searching, either by using the advanced search page or via the title: command
only brings back results from Inktomi (as described below), not from the Open
Directory or Direct Hit.
Inktomi
The title command works for single or multiple words within the Inktomi-listings of
these services:
• HotBot
• iWon
• MSN Search
The command also does not work with quotation marks around multiple words or the
+ or - symbol. Instead, to perform operations such as a phrase search within the title,
you'll need to go to the advanced search pages of these search engines (iWon
doesn't have one). Enter the words you want to find including the search operators
that you wish to use. Then find the option to search for words in the page title.
Others
The title command does not work reliably with GoTo. At AOL Search, it only works if
you use the advanced search page with the On the Web Only option.
At Yahoo, you must instead use the t: command instead of title: to search through
titles.
Several of the search engines which support the title command also allow you to
specify a title search using their advanced search pages.
Some of the search engines offering wildcard search also support what is called
"stemming." That means they will find terms like "singing" even if you only enter
"sing." This also means you may not need to use a wildcard symbol.
Below are some important additional details about wildcard searching at specific
search engines.
AOL Search
At AOL Search, the ? symbol serves as a wildcard and will replace any single
character, such as:
This only works to find matches in AOL Search's Open Directory information. It does
not work to bring back matches from Inktomi-powered listings, as explained further
below.
Inktomi
Inktomi has two wildcard commands. The * symbol will match one or more
characters, such as:
The ? symbol matches any single character, and you can use it more than once. For
instance:
Both commands only work reliably at iWon, at the time of this writing. They fail to
function properly at AOL Search, HotBot, MSN Search or LookSmart to bring up
matches from within the Inktomi listings that they use.
They also do not appear to bring up matches in wildcard fashion from any of the other
data sets these services use, with the exception of AOL Search (see AOL Search
section, above).
Northern Light
Like Inktomi above, Northern Light has two wildcard commands. The * symbol will
match one or more characters, while % is used to match just a single character.
Notice the words "Mars Exploration Web Site" are all contained within the hyperlink?
This is the anchor text or the link text.
13.11 Proximity
Some search engines let you indicate how close words should appear to each other.
Most people do not need this type of control. Usually, phrase searching is all you
need.
• AltaVista
• Inktomi-powered services
• Google
• Northern Light
Explore the help pages and advanced search forms at each service to better
understand the additional options that are available.
AltaVista
Displays related searches near the top of the results page, next to the words "Others
searched for."
AllTheWeb.com
Displays related searches near the top of the results page, next to the words "Narrow
your search."
Direct Hit
Displays related searches near the top of the results page, under the "Related
Searches" heading.
Excite
After performing a search, click on the "Zoom In" button near the search box to see
related terms. These will appear in a separate window. Select the related term you
want and choose the Search button within the new window. Your search will then be
sent to Excite.
HotBot
Displays related searches near the top of the results page, under the "People who did
this search also searched for" heading.
MSN Search
Displays related searches in the "Popular Topics" area below the search box, on the
results page.
Yahoo
13.14 Clustering
Have you ever done a search and found the top results all seem to come from one
site? Clustering prevents this. Clustering generally allows only one or two pages per
site to be represented in the top results. This means that you get more variety and a
better chance of quickly finding something of interest. The section below highlights
how this feature works at the major services that offer it.
AltaVista
AltaVista clusters listings so that no more than two pages per site appear in its
results. If a second page from a particular web site is listed, it will be indented under
the first page. To see more results from a site, select the "Additional relevant pages
from this site" link, if it appears for a particular listing.
AllTheWeb.com
Clustering is on by default and will prevent more than two pages from the same web
site from being displayed. It can be overridden by changing the Site Collapsing option
on the Search Customization page (see the Customizing Results section below). You
can also view more pages from any particular site listed by selecting the "more hits
from" link that follows the listing.
Excite
There is no way to disable clustering at Excite. However, you can see more pages
from any particular site listed by selecting the "more from this site" link that follows
the listing.
Google clusters so that no more than two pages per site appear in its results. If a
second page is listed, it will be "indented" under the first page. To see more results
from a site, select the "More results from" link that will appear below the second page
listed.
HotBot
MSN Search
Clustering at MSN Search has to be enabled from its advanced search page. Look for
the "Show one result per domain" option and select it to start clustering.
Northern Light
Northern Light has clustering, and there is no way to turn this off. To see more pages
from a site, click on the "More Results" link below the page listing, if it appears.
AltaVista
Click on the "Related pages" link that appears at the bottom of each listing.
AOL Search
Click on the "Show me more like this" option that appears at the bottom of each page
listed. This takes you to where that page is categorized within the version of the Open
Directory that AOL users. That can help you find similar web sites.
AltaVista
Click on the "Similar pages" link that appears at the end of each listing.
13.16 Stemming
Stemming is the ability for a search engine to search for variations of a word based
on its stem. For example, entering "swim" might also find "swims" and maybe
"swimming," depending on the search engine.
The Search Features Chart shows which search engines will do stemming by default
and those that allow it to be switched on as an option. Some search engine specific
notes are also below.
Direct Hit
Singular and plural forms for words should generally provide the same results (cook
versus cooks) and should other ending such as ing (cook versus cooking).
HotBot
Stemming should be on automatically for Direct Hit-powered listings (see Direct Hit,
above). For stemming in Inktomi-powered listings, see the Inktomi section, below.
Inktomi
MSN Search
This appears to be on permanently, at least for some queries. For example, a search
for "run," "runs" and "running" in Oct. 2001 found the same results. Oddly, using the
"Enable Stemming" box on MSN Search's advanced search page actually causes no
results to appear.
AltaVista
After performing a search, check the "Search within these results" box under the
search box, on the results page.
After performing a search, click on the "Search within results" link that appears at the
bottom of the results page, next to the search box, on the results page.
HotBot
After performing a search, check the "Search within these results" box that appears n,
next to the search box, on the results page.
LookSmart
You cannot search within results generated from a keyword search on the LookSmart
home page. However, if you navigate to any particular category, you can then search
for matching sites that appear only within that category and its subcategories. To do
this, when in a category, change the drop down box at the top of the category page
from "the Web" to the second option, which will be the name of the category you are
in.
Lycos
At Lycos, choose the "Search these results" option which appears next to the search
box, at the top of the results page.
Yahoo
At Yahoo, you can't run a search and then search within it. But you can go to any
category and then choose to search just within that section. Just look for the
appropriate options near the search boxes that appear within the categories.
Only Google allows you to see the actual page it spidered, through its "Cached"
feature. When you search, a "Cached" link may appear below some pages that are
listed. Click on this, and you'll be shown the page that was indexed, and any of your
search terms will be highlighted.
You can also bring up the spidered version using Google's cache command. Simple
enter the URL of a page after cache: and omitting the http:// prefix. For instance, to
see the cached version of this page, you would enter this into Google:
cache:searchenginewatch.com/facts/assistance.html
Below is how to search by language, at search engines that offer this feature.
Use the drop-down box that appears next to the search box on the home page and
results page, to search in a particular language that's offered.
HotBot
Use the drop-down box that appears on the left-hand side on the home page, to
search in a particular language that's offered.
MSN Search
Use the "Language" drop-down box on the advanced search page to search by
language through Inktomi's crawler-based results.
Northern Light
Use the "Documents written in: Language" drop-down box on the advanced search
page.
Click on the "Translate" link that appears at the bottom of each listing.
Click on the "Translate this page" link that appears next to the title of pages that are
not in English, when using the main Google.com web site.
Porn filters are not perfect, but they can be especially helpful if you are working with
children and want to minimize the risk of them seeing sexually explicit or offensive
terms in the results that appear.
AltaVista
AllTheWeb
Enable the porn filter by using the SafeSearch option Google's customize page
LookSmart
LookSmart does not list porn sites. However, searches on porn turns will bring up
listings from the Inktomi results that supplement LookSmart's own listings. Therefore,
do not consider a search on LookSmart to be child-safe.
Lycos
Lycos SearchGuard
http://searchguard.lycos.com/
These service will warn "You have entered a search term that is likely to return adult
content" if you enter porn terms. That prevents you from immediately seeing possibly
objectionable content. However, results are still offered, if you choose to go beyond
this warning. These results come from across the web, from the service itself, or from
a partner search engine that specializes in listing porn sites and content.
Sort By Date
Sort by date sounds like a great idea, but there are big problems with dates on the
web. Some web servers report incorrect dates or no dates at all.
For instance, Go's engineers estimated in 1998 (back when the search engine still
existed) that only 70 percent of web servers returned the correct date, while 20
percent reported the current date, regardless of when the page was created or
changed. The remaining 10 percent of the time, the web servers reported no date at
all.
Still, date sorting is a nice feature to have, and one that many professionals want.
When you choose the option, they list pages with newer dates first.
At MSN Search, you'll find this option on the advanced search page. Use the "Sort
equally relevant results by" box.
At Northern Light, you enable date sorting it by going to the advanced search page
and checking the "Sort results by" option.
Keep in mind that often when people want to sort by date, they are often trying to get
the latest information on a news topic. In these case, it is better to use a news search
engine.
Date Range
Some search engines let you restrict a search so that only pages within a particular
date range are displayed. This feature can suffer from the fact that web page dates
can be unreliable, as described above. However, it can also be useful, especially as a
means of determining how fresh a search engine's listings are.
For example, if you restrict a search to find pages less than a month old and don't get
any matches, you have a pretty good idea that the search engine's listings are out of
date.
Date Display
Along with the page description, some search engines show the date when a web
page was created or modified. As noted above, these dates may not always be
reliable. However, they do provide a useful clue as to how fresh or stale a search
engine's listings are. Thus, search engines that show a date deserve praise for doing
so.
When no date is reported, these search engines will instead display the date the
page was spidered.
Directories don't spider pages, but they can display when a listing was manually
added or updated, if desired.
Boolean search commands have been used by professionals for searching through
traditional databases for years. Despite this, they are overkill for the average web
user. The commands described on the Search Engine Math page provide the same
basic functionality as Boolean commands and are also supported by all the major
search services. If you are new to searching, start off learning how to search better by
first reading the Search Engine Math page, rather than trying to learn Boolean
commands. I'm certain you'll find it easier.In fact, many professionals might benefit by
abandoning Boolean commands when using web search engines. But since there is
a comfort level in using what is already familiar, this page covers how Boolean
commands are implemented at the major search services. It assumes you are
already familiar with Boolean searching, although some resources that provide
further help appear at the end of the page.
OR
The Boolean OR command is used in order to allow any of the specified search terms
to be present on the web pages listed in results. It can also be described as a Match
Any search. You use the command like this:
ireland OR eire
Search engines that support OR are shown on the Search Features Chart. For those
that don't, see their advance search pages, where an option to search for any of your
terms is often available.
Also be aware that some search engines perform an OR search by default, as shown
in the Match Any section of the Power Searching For Anyone page. Search engine
specific notes are below: AOL Search
OR failed to work correctly at the time this page was written. For instance, a search
for "ireland OR eire" failed to yield a much larger set of results that should have
appeared when compared to "ireland AND eire".
OR will not work to find different phrases, such as "bill clinton" OR "hillary clinton"
AND
The Boolean AND command is used in order to require that all search terms be
present on the web pages listed in results. It can also be described as a Match All
search. You use the command like this:
Search engines that support AND are shown on the Search Features Chart. For
those that don't, using the + symbol is generally a good alternative.
Also be aware that some search engines perform an AND search by default, as
shown in the Match All section of the Power Searching For Anyone page. Search
engine specific notes are below: AOL Search
When using AND, you may find a slightly different number of documents will be
retrieved when compared to using the + symbol. This appears to be because AOL
Search will check both its own listings and Inktomi listings when using AND but only
Inktomi listings when using the + symbol.
NOT
The Boolean NOT command is used in order to require that a particular search term
NOT be present on web pages listed in results. It can also be described as an
Exclude search. You use the command like this:
Search engines that support NOT are shown on the Search Features Chart. For
those that don't, using the - symbol is generally a good alternative. Search engine
specific notes are below: AOL Search
When using NOT, you may find a slightly larger number of documents will be
retrieved when compared to using the + symbol or no commands at all. This shouldn't
happen, but it did at the time this page was written.
NEAR
The NEAR command is used in order to specify how close terms should appear to
each other. You use the command like this:
Please consider whether you really need to control proximity within your searches.
Most search engines will try to find the terms you indicate next to each other, or within
close proximity to each other, by default. Also, all of the search engines support
phrase searching through use of quotation marks. See Search Engine Math page for
more information about phrase searching.
Search engines that support NEAR are shown on the Search Features Chart. Search
engine specific notes are below.AltaVista
NEAR means that terms will appear within 10 words of each other.
AOL Search
You can control the exact number of words apart by using NEAR/#. For instance,
NEAR/5 would mean the terms should be five words apart. If you don't specify a
number, then the terms must appear right next to each other.
Lycos
NEAR means that terms will appear within 25 words of each other. Lycos also
supports an extensive range of other adjacency commands. See the site's help pages
for Boolean searches for further details.
Nesting ( )
Nesting allows you to build complex queries. You nest queries using parentheses,
like this:
Search engines that expressly say that they support nesting are shown on the Search
Features Chart. I have not tried to verify this information. Be aware that the major
search engines may process nested queries differently than each other.Other Notes
AltaVista
Boolean searching can only be done from the advanced search page.
Boolean commands must be in uppercase. That's why I show them that way on this
page. If you always use uppercase, you won't have problems when going between
services.
You must set the menu option on the home page or advanced search page to
"Boolean phrase" when using Boolean commands.
Lycos
Lycos says it supports many Boolean commands, and I haven't verified these,
because of the difficulty of determining exactly which datasets might be processed. In
addition, AllTheWeb -- which powers many of the search results at Lycos -- doesn't
support Boolean. This makes it unclear how Lycos itself might then do this.
The search engine features chart below is designed primarily for users of search
engines. It summarizes key search commands and search assistance features.
These are described more fully on the Search Engine Math, Power Searching For
Anyone
NOTE: Math commands at Lycos tend to bring back results from FAST's crawler-based
results, rather than the human-powered results Lycos uses from the Open Directory. If you
want to search just Open Directory results, then use the Lycos advanced search page.
allintitle:
Google
intitle:
adv. search
Direct Hit
page
AOL, Excite,
HotBot, MSN,
LookSmart, Lycos
none Not yet updated,
but may be still
correct:
Netscape
Not yet updated,
but may be still
other
correct:
Yahoo (t:)
host: AltaVista
Excite, Google
site: (Netscape,
Yahoo)
AllTheWeb,
Lycos (for
url.host:
AllTheWeb
results only)
Site
Search Inktomi (HotBot,
domain:
iWon, LookSmart)
AOL, Direct Hit,
HotBot,
LookSmart,
Lycos, MSN,
none
Netscape,
Northern Light,
Open Directory,
Yahoo
AltaVista, Excite,
url:
Northern Light
AllTheWeb,
Lycos (for
url.all:
AllTheWeb
results only)
allinurl:
Google
inurl:
Inktomi
URL Search originurl: (AOL, GoTo,
HotBot)
u: Yahoo
AOL, Direct Hit,
HotBot,
LookSmart, MSN
none Not yet updated,
but may be still
correct:
Open Directory
AltaVista, Google,
link:
Northern Light
Inktomi (AOL,
HotBot, iWon,
MSN)
linkdomain:
(NOTE:
measures links to
entire domains)
AllTheWeb,
Lycos (for
Link Search link.all:
AllTheWeb
results only)
AOL, Direct Hit,
Excite, HotBot,
LookSmart,
Northern Light
none Not yet updated,
but may be still
correct:
Netscape, Yahoo
(n/a)
AltaVista, Inktomi
(iWon), Northern
Light
* Not yet updated,
but may be still
correct:
Yahoo
AOL Search,
?
Inktomi (iWon)
Wildcard % Northern Light
AllTheWeb,
Direct Hit, Excite,
Google, HotBot,
LookSmart,
none Lycos, MSN
(MSN's help says
it offers wildcard,
but it failed to
during testing)
anchor: AltaVista
AllTheWeb, AOL
Anchor Search Search, Direct
None Hit, Excite,
Google, Inktomi,
HotBot, Lycos
NOTE: The commands above are primarily useful when dealing with crawler-based search engines.
"None" indicates any crawler-based or human-powered search engine that creates its own listings but
which does not provide a particular command for searching within those listings. It may also indicate a
portal that that outsources for its listings and which lacks a single command to work across the multiple
datasets it uses.
Feature Offered By
AltaVista, AllTheWeb, Excite, HotBot, Lycos, MSN,
Yahoo
Related Searches
Not yet updated, but may be still correct:
iWon
AltaVista, AllTheWeb, Excite, Google,
Clustering
HotBot, MSN, Northern Light
Find Similar AltaVista, AOL Search, Google
AOL Search, Direct Hit, HotBot, Inktomi (HotBot,
Stemming
MSN)
Search Within AltaVista, Google, HotBot, Lycos
Spidered Version Google
Search By AltaVista, AllTheWeb, Excite, Google,
Language HotBot, Lycos, MSN, Northern Light
Page Translation AltaVista, Google, Lycos
Porn Filter AltaVista, AllTheWeb, Google
Porn Warning HotBot, MSN, Northern Light
Feature Supported By
AltaVista, AllTheWeb, AOL Search (5), Direct Hit,
Number Of Excite, Google, HotBot, LookSmart (15),
Listings Shown Lycos, MSN (15), Northern Light
(10 unless noted) Not yet updated, but may be still correct:
iWon, Netscape, Yahoo (20)
Ability To AltaVista, AllTheWeb, Excite, Google, HotBot, MSN
Increase Number Not yet updated, but may be still correct:
Of Listings? Yahoo
AltaVista, AllTheWeb, Excite, Google, HotBot, MSN
See 20 Results Not yet updated, but may be still correct:
Yahoo
AltaVista, AllTheWeb, Excite, Google, HotBot, MSN
See 50 Results Not yet updated, but may be still correct:
Yahoo
AllTheWeb, Google, HotBot,
See 100 Results Not yet updated, but may be still correct:
Yahoo
Sort By Date MSN Search, Northern Light
AltaVista, Google, HotBot, MSN, Northern Light
Date Range Not yet updated, but may be still correct:
iWon, Yahoo
Date Displayed? AltaVista, HotBot (for Inktomi results),
Northern Light
Display Titles
AltaVista, Excite, HotBot (URLs only option), MSN
Only?
Other Major
Customize AltaVista, AllTheWeb, Google
Options
Search
Google AlltheWeb Altavista
Engine
HUGE. Claims over HUGE. Claims will LARGE.
1.5 billion pages, but reach a billion pages Claims to be
may be counting soon. General Web the biggest
Size, type pages not fully database. Excellent also.
Size varies indexed. General Web ranking. General Web
frequently database with often Advanced search database. Use
and widely. useful ranking by worth mastering. the Advanced
popularity. Far from Search with
comprehensive, but Boolean
often finds "the best" operators.
pages.
Yes. Use " ". Searches Yes. Use " " Yes. Use " "
common "stop words" In advanced search,
Phrase
if in phrases in terms in lower "filter"
searching
quotes. boxes always
searched as phrases.
Enclose in " " or try Enclose in " ". Enclose in " "
Proper without. Automatically Enclose each variant or use NEAR
name looks for terms in form in " " and all operator.
searching close proximity. forms in ( ) in top Case
search box. Sensitive.
Partial. AND assumed In top box, AND AND (default),
between words. default. OR,
Capitalize OR. AND NOT,
- excludes. For OR, enclose terms NEAR (within
No ( ) or nesting. or phrases in ( ) 10 words).
In Advanced Search, without typing "or". Use only in
limited Boolean: Advanced search Advanced
ALL = and "filter" boxes offer Search larger
Boolean
ANY = or partial equivalent of box.
logic
WITHOUT = and not Boolean logic: Do NOT use in
Must include = AND A-V simple
Must not include = Search.
NOT
Should include
prioritizes higher in
ranking.
No OR.
DEFAULT=AND. DEFAULT is AND Available only
- excludes In top box, in Simple
+ will allow you to - excludes Search (which
+Requires/
retrieve "stop words" defaults to
-Excludes
(e.g., +in) OR).
We
recommend
Boolean logic
in Advanced
Search --
much more
powerful and
specific.
Yes. At bottom of No. Add terms. No. Add
results page, click terms.
Sub- "Search within
Searching results" and enter
more terms.
sensitive.
Yes. Major Romanized Yes, extensive list Yes, extensive
and non-Romanized includes major list includes
languages in Romanized and non- major
Advanced Search. Romanized languages. Romanized
Language Allows you to specify and non-
matching character Romanized
sets. languages.
Inktomi's popularity grew several years ago as they powered the secondary search
database that had driven Yahoo!. Since then, Yahoo as switched to using Google as
their secondary search and backend database, however Inktomi is just as popular
now, as they were several years ago, if not more so. Their spiders are named "Slurp",
and different versions of Slurp crawls the web many different times throughout the
month, as Inktomi powers many sites search results. There isn't much more to
Inktomi then that. Slurp puts heavy weight on Title and description tags, and will
rarely deep crawl a site. Slurp usually only spiders pages that are submitted to its
index. Here is a list of some of the sites that Inktomi provides results for:
America Online
MSN ( The Microsoft Network )
iWon
HotBot
Looksmart
Goto.com
About
Goo
Anzwers
eoexchange
Powerize
NBCi
Canada.com
Chello
CNET
Swiss Search
Geocities
eHOLA
iAtlas
GoProfit
ICQ.com
Mobilecom
n2h2
quepasa.com
RadarUOL
Starmedia
4Anything.com
As you can see, thats quite a list. Inktomi has may different versions of it's SLURP
spiders.
In the lighthearted spirit of the popular books for "idiots" and "dummies," here's a look
at seven common blunders that are virtually guaranteed to deliver useless,
nonsensical, or completely worthless search results.
Some of these gaffes might surprise you. But once you recognize them, it's easy to
banish these little gremlins forever from your Web search tool kit.
Some search engines simply ignore certain words. They are never used to find a
matching document, despite what amounts to a direct command when you type them
into a search form.
These are called "stop words" because the search engine doesn't "stop" when the
words are found in the index (if they are even indexed at all). Why not? Because stop
words are either too common to generate meaningful results, or are parts of speech
like adverbs, conjunctions, prepositions, or forms of "be" that mean nothing unless
they're part of a phrase with more "important" nouns and verbs.
If you use a stop word in a query you may get wildly irrelevant results. For example,
the phrase "searching the web" contains two stop words: "the" and "web." Though it's
not a particularly common word, web is used so frequently on the Internet that it's
virtually worthless as a finding aid.
Stripping out the stop words, "searching the web" becomes "searching," which will
naturally lead to results describing everything from criminal manhunts to quests for
enlightenment—and if you're lucky, maybe even something about searching the web.
How can you identify stop words? Google tells you when it's ignoring a stop word, at
the very top of a results page. You can force Google to include a stop word in a query
by putting a plus sign in front of it. AlltheWeb takes a different approach -- it often
automatically rewrites your query to include a stop word as part of a quoted phrase
with other query terms. Check out the link below to the 300 most common words in
English, many of which are stop words.
Boolean operators, like "and," "or," and "not," can help narrow search results—when
used properly. The problem is that Boolean operators, because of their apparent
simplicity, appear to be easy to use. Maybe, and/or not really.
According to Ran Hock, author of The Extreme Searcher's Guide to web Search
Engines, search engines implement Boolean features in different ways. For example,
while some accept a simple "not," others require "and not" for the same effect.
Additionally, some engines require that Boolean operators be capitalized, while
others do not (or and do not?).
Vulgar comes from the Latin vulgus, meaning common. Like some educated
sophisticates, search engines have a problem with common words. It's not that
they're being snotty or pretentious. It's that some words are so common that they
appear in literally millions of documents, making them virtually useless as a finding
aid.
Take weather, for example. There are thousands of sites providing weather
information, from local forecasts to elaborate treatises on meteorology. Tighten your
query by using focusing words to narrow the scope of your search. Rather than
merely searching for "weather," construct a query like "Cicely Alaska annual
snowfall," or something equally specific.
Be careful when a word has multiple meanings. Think of the word "bond" as an
example. If you just the single word "bond" as a query, the search engine has to
figure out if you're looking for information about financial bonds, chemical bonds, or
even James Bond.
Make it easier for the engine to help you. Ask yourself the question before the search
engine does for you, and phrase your query accordingly.
Search engines are also easily confused by heteronyms, words that are spelled
identically but have different meanings when pronounced differently. For example,
"lead," pronounced LEED, means to guide. Pronounced as LED, though, the word
refers to the metal element. When you can, use concrete synonyms instead of
heteronyms.
Yet another problem for the searcher is whether to use capital letters in a query.
Some engines are case sensitive, while others are not. As a rule of thumb, it's a good
idea to always use lower case letters when you search. This will typically return
results that contain both upper and lower case letters.
If you use uppercase letters in a query to a case sensitive engine, results will only
include documents that also use upper case letters. This is usually a good thing for
proper nouns like names or places, which use initial upper case letters anyway. But it
might cause you to miss other documents where case-sensitivity is less important.
Most search engines do a good job at matching simple phrases, like "Afghan
refugees," or "space shuttle missions." You run into problems, though, with a phrase
like this section's title. Searching for "close but no cigar" on one major engine (which
shall remain mercifully unnamed) provided this link as its number two pick: The
Common Cold: Relief But No Cure. Definitely no cigar!
If you're searching for something where your keywords must be near each other to
get good results, your only option is to use AltaVista's advanced search and the
NEAR operator in your query. This finds documents containing both specified words
or phrases within 10 words of each other.
And now for the number one most common searching mistake:
If you're determined to find what you're looking for on the Web, be sure you're using
the right tools for the job. Search engines vary widely in scope, function, and quality.
You'll waste a lot of time if you don't choose the best search engine for each specific
searching task.
Should you use a crawler-based search engine, or a human compiled web directory?
How about a specialized search site, a database, or an invisible web resource? By
analyzing your needs and comparing them with the strengths and weaknesses of
each search service *before* you search, you'll likely get better results.
If you're relatively new to searching and get stuck, don't be hard on yourself. One of
the most ridiculous misconceptions I've ever heard is that "you can find anything on
the Internet." This is about as true as saying that there are diamonds in every coal
mine.
And though it may sound like heresay, sometimes your best bet for finding
information is to log off and take a trip to your local library. Libraries have tons of
resources that aren't available on the Web. And librarians are trained experts who are
usually more than willing to help you find what you're looking for. When you're getting
nowhere on the Web, take advantage of these (usually very nice) "human search
engines."
Begone, Mistakes!
As you gain experience searching the Web, avoiding these seven searching mistakes
will become second nature. Whenever you get weird or unexpected results, take a
close look at your query and try to figure out what happened. You'll likely discover yet
another mistake to avoid.
As random as they are relevant, enigmatic as they are enlightening, search engines
have earned a slightly sullied reputation as a necessary evil. But it is a one-sided
assessment. The search engines have not been able to explain themselves. Until
now.
Fuzzy search: A search that will find matches even when words are only partially
spelled or misspelled.
Keyword search: A search for documents containing one or more words that are
specified by a user.
Precision: The degree in which a search engine lists documents matching a query.
The more matching documents that are listed, the higher the precision. For example,
if a search engine lists 80 documents found to match a query but only 20 of them
contain the search words, then the precision would be 25%.
Recall: Related to precision, this is the degree in which a search engine returns all
the matching documents in a collection. There may be 100 matching documents, but
a search engine may only find 80 of them. It would then list these 80 and have a
recall of 80%.
Relevancy: How well a document provides the information a user is looking for, as
measured by the user.
Search Engine: The software that searches an index and returns matches. Search
engine is often used synonymously with spider and index, although these are
separate components that work with the engine.
Spider: The software that scans documents and adds them to an index by following
links. Spider is often used as a synonym for search engine.
Stemming: The ability for a search to include the "stem" of words. For example,
stemming allows a user to enter "swimming" and get back results also for the stem
word "swim."
Stop words: Conjunctions, prepositions and articles and other words such as AND,
TO and A that appear often in documents yet alone may contain little meaning.
Thesaurus: A list of synonyms a search engine can use to find matches for
particular words if the words themselves don't appear in documents.