Sie sind auf Seite 1von 11

Working of Web Search Engines

in the index to find documents that are relevant to


the search query. Based on its algorithm, the
Abstract search engine returns results to the searcher in the
The amount of information on the web is search engine result page (SERP).
growing rapidly, as well as the number of new The search engine algorithm is a set of rules that
users inexperienced in the art of web research. a search engine follows, in order to return the
True search engines crawl the web, and then most relevant results. Search engines fail to return
automatically generate their listings. If you relevant results sometimes, and that is why they
change your web pages, search engine crawlers need to improve their algorithm constantly. The
will eventually find these changes, and that can algorithms determine the placement of web
affect your listing. Page titles, body copy, meta documents in the organic or natural search results,
tags (sometimes) and other elements all play a which are typically displayed on the left side of
role in how each search engine evaluates the the screen in the SERPs, as illustrated in the figure
relevancy of your page(and hence its ranking). 1
There are plenty of ways to cater a search
engine’s crawlers and change a site to help
improve its rankings. One such is Google Web
Search Engine. This report goes through the
different generations of web search engines, the
simplified algorithm used for Page Ranking and
an overview of the Google Architecture. It is
important to know how the search Engines Works,
as to get the best out of them.

1. Introduction
Internet search engine is a specialized tool that
helps us find information on the World Wide
Figure 1
Web. A technical encyclopedia, WhatIs.com,
provides an accurate definition of a search engine. Search engine algorithms are very closely kept
“A search engine is a coordinated set of programs industry secrets, because of the fierce competition
that includes: in the field. Another reason for search engines to
• A spider (also called a "crawler" or a keep their algorithms private is search engine
"bot") that goes to every page or spam. If webmasters knew the exact algorithm of
representative pages on every Web site a search engine, they could manipulate the results
that wants to be searchable and reads it, in their favor quite easily. By testing different
using hypertext links on each page to tactics, website owners sometimes find out
discover and read a site's other pages elements of the algorithms and act accordingly to
• A program that creates a huge index boost their ranking in the SERPs. Therefore,
(sometimes called a "catalog") from the changes in the algorithms are often due to
pages that have been read increased search engine spam.
• A program that receives your search
There are dozens of search engines which are
request, compares it to the entries in the
used by billions of people every day. Amongst
index, and returns results to you.”
these include popular ones like Google, Yahoo,
(Whatis.com ,2001.)
and Bing.
In essence, the search engine bots crawl web
The web creates new challenges for information
pages and use links to help them navigate to other
retrieval. The amount of information on the web is
pages. The search engine then indexes those pages
growing rapidly, as people are likely to surf the
into its database. When a searcher sends a search
web using its link graph, often starting with high
query, the search engine compares the web pages
1
quality human maintained indices such as Yahoo! purchased from the University and, after being
or with search engines like Lycos, AltaVista etc. transferred through several companies, is a
Human maintained lists cover popular topics separate corporation today. It was created as a
effectively but are subjective, expensive to build directory, containing Gopher and telnet search
and maintain, slow to improve, and cannot cover features in addition to its Web search feature.
all esoteric topics. Jerry Yang and David Filo created Yahoo in
Automated search engines that rely on keyword 1994. It started out as a listing of their favorite
matching usually return too many low quality Web sites. What made it different was that each
matches. entry, in addition to the URL, also had a
To make matters worse, some advertisers
description of the page. Within a year the two
attempt to gain people’s attention by taking
received funding and Yahoo, the corporation, was
measures meant to mislead automated search
created.
engines and also there are spammers who want to
influence the web search results. Later in 1994, WebCrawler was introduced. It
was the first full-text search engine on the
2. A Brief History of Search Engines Internet; the entire text of each page was indexed
The history of Internet search engines dates for the first time.
back to 1990, when Alan Emtage, a student at Lycos introduced relevance retrieval, prefix
McGill University in Montreal developed a search matching, and word proximity in 1994. It was a
engine called Archie. As there was no World large search engine, indexing over 60 million
Wide Web at that time, Archie operated in a documents in 1996; the largest of any search
system called File Transfer Protocol (FTP). In engine at the time. Like many of the other search
June 2003, Matthew Gray developed the first engines, Lycos was created in a university
robot on the Web called the Wanderer. Referred to atmosphere at Carnegie Mellon University by Dr.
as the mother of search engines, World Wide Web Michael Mauldin.
Wanderer captured URLs on the web and stored Infoseek went online in 1995. It didn't really
them in the first ever web database, Wandex. bring anything new to the search engine scene. It
Other improved web robots soon followed and is now owned by the Walt Disney Internet Group
search engines began categorizing web pages in and the domain forwards to Go.com.
databases, instead of just crawling and listing Alta Vista also began in 1995. It was the first
them. In 1994 Galaxy, Lycos and WebCrawler search engine to allow natural language inquires
were launched, bringing search engine indexing to and advanced searching techniques. It also
a more advanced state. A small directory project provides a multimedia search for photos, music,
by two Stanford University Ph.D. candidates, and videos.
David Filo and Jerry Yang was also introduced in Inktomi started in 1996 at UC Berkeley. In June
1994, which the creators called Yahoo! This small of 1999 Inktomi introduced a directory search
directory has since turned into a multi-billion engine powered by "concept induction"
dollar company and is currently one of the biggest technology. "Concept induction," according to the
online search providers. company, "takes the experience of human analysis
Many search engines that are still major players and applies the same habits to a computerized
in the search arena were established in the analysis of links, usage, and other patterns to
following years, including AltaVista, Excite, determine which sites are most popular and the
Inktomi, HotBot and Ask Jeeves. most productive." Inktomi was purchased by
Excite was introduced in 1993 by six Stanford Yahoo in 2003.
University students. It used statistical analysis of AskJeeves and Northern Light were both
word relationships to aid in the search process. launched in 1997.
Today it's a part of the AskJeeves Company. Google was launched in 1997 by Sergey Brin
EINet Galaxy (Galaxy) was established in 1994 and Larry Page as part of a research project at
as part of the MCC Research Consortium at the Stanford University. It uses inbound links to rank
University of Texas, in Austin. It was eventually sites. In 1998 MSN Search and the Open

2
Directory were also started. Today MSN link, but sometimes come across other factors at
Search(Live) is also known as Bing, work that cause inappropriate results to rise to the
top of the list. One of the factors that can lead to
3. Three types of Search Engines this type of misinformation may be erroneous
The term "search engine" is often used assumptions by searchers as to what’s really going
generically to describe crawler-based search on “behind the curtain.”“Search engine” is the
engines, human-powered directories, and hybrid popular term for an Information Retrieval (IR)
search engines. These types of search engines system.
gather their listings in different ways, through Before a search engine can tell you where a file
crawler-based searches, human-powered or document is, it must be found. To find
directories, and hybrid searches. information on the hundreds of millions of Web
3.1. Crawler-based search engines pages that exist, a search engine employs special
software robots, called spiders, to build lists of the
Crawler-based search engines, such as Google
words found on Web sites. When a spider is
(http://www.google.com), create their listings
building its lists, the process is called Web
automatically. They "crawl" or "spider" the web,
crawling. In order to build and maintain a useful
then people search through what they have found. list of words, a search engine's spiders have to
If web pages are changed, crawler-based search look at a lot of pages.
engines eventually find these changes, and that 4.1. High level Design Architecture of a
can affect how those pages are listed. Page titles, WebCrawler
body copy and other elements all play a role.
3.2. Human-powered directories
A human-powered directory, such as the Open
Directory Project depends on humans for its
listings. (Yahoo!, which used to be a directory,
now gets its information from the use of
crawlers.) A directory gets its information from
submissions, which include a short description to
the directory for the entire site, or from editors
who write one for sites they review. A search
looks for matches only in the descriptions
submitted. Changing web pages, therefore, has no
effect on how they are listed. Techniques that are Figure 2
useful for improving a listing with a search A Web crawler is a computer program that
engine have nothing to do with improving a browses the World Wide Web in a methodical,
listing in a directory. automated manner or in an orderly fashion. Other
3.3. Hybrid search engines terms for Web crawlers are ants, automatic
Today, it is extremely common for crawler-type indexers, bots, Web spiders, Web robots etc.
and human-powered results to be combined when The behavior of a Web crawler is the outcome
conducting a search. Usually, a hybrid search of a combination of policies:
engine will favor one type of listings over • a selection policy that states which pages to
another. For example, MSN (Now Bing) is more download,
likely to present human-powered listings from • a re-visit policy that states when to check
LookSmart for changes to the pages,
• a politeness policy that states how to avoid
overloading Web sites, and
4. How do Search Engines Work?
• a parallelization policy that states how to
Many Internet nomads are confounded when
coordinate distributed Web crawlers.
they enter a search query and get back a set of
over 10,000 “relevant” hits, viewable in batches of
10. There are occasions when the searcher will
plow through the list hoping to find the perfect
3
4.2. How does any spider start its travels over word in the first 20 lines of text. Lycos is said to
the Web? use this approach to spidering the Web.
The usual starting points are lists of heavily Other systems, such as AltaVista, go in the other
used servers and very popular pages. The spider direction, indexing every single word on a page,
will begin with a popular site, indexing the words including "a," "an," "the" and other "insignificant"
on its pages and following every link found within words. The push to completeness in this approach
the site. In this way, the spidering system quickly is matched by other systems in the attention given
begins to travel, spreading out across the most to the unseen portion of the Web page, the meta
widely used portions of the Web. tags.
Consumers would actually prefer a finding
engine, rather than a search engine.
Search engines match queries against an index
that they create. The index consists of the words in
each document, plus pointers to their locations
within the documents. This is called an inverted
file. A search engine or IR system comprises four
essential modules:
• A document processor
• A query processor
• A search and matching function
• A ranking capability
While users focus on “search,” the search and
matching function is only one of the four modules.
Each of these four modules may cause the
expected or unexpected results that consumers get
when they use a Search Engine.
4.2.1. Document Processor
The Document Processor prepares, processes,
and inputs the documents, pages, or sites that
users search against. The Document Processor
should perform some or all of the following steps:
1. Normalize the document stream to a
Figure 3 predefined format
2. Break the document stream into desired
When the Google spider looked at an HTML retrievable units
page, it took note of two things: 3. Isolate and meta-tags sub-document pieces
• The words within the page 4. Identify potential indexable elements in
• Where the words were found documents
Words occurring in the title, subtitles, meta tags 5. Delete stop words
and other positions of relative importance were 6. Stem terms
noted for special consideration during a 7. Extract index entries
subsequent user search. The Google spider was 8. Compute weights
built to index every significant word on a page, 9. Create and update the main inverted file
leaving out the articles "a," "an" and "the." Other against which the search engine searches
spiders take different approaches. in order to match queries to documents.
These different approaches usually attempt to Steps 1-3: Preprocessing
make the spider operate faster, allow users to While essential and potentially important in
search more efficiently, or both. For example, affecting the outcome of a search, these first three
some spiders will keep track of the words in the steps simply standardize the multiple formats
title, sub-headings and links, along with the 100 encountered when deriving documents from
most frequently used words on the page and each various providers or handling various Web sites.
They serve to merge all the data into a single
4
consistent data structure that all the downstream process has two goals. In terms of efficiency,
processes can handle. The need for a well-formed, stemming reduces the number of unique words in
consistent format is of relative importance in the index, which in turn reduces the storage space
direct proportion to the sophistication of later required for the index and speeds up the search
steps of document processing. Step 2 is important process.
because the pointers stored in the inverted file will In terms of effectiveness, stemming improves
enable a system to retrieve various sized units, recall by reducing all forms of word to a base or
site, page, document, section, paragraph, or stemmed form. For example, if a user asks for
sentence. analyze, he or she may also want documents
Step 4: Identify potential indexable elements in which contain analysis, analyzing, analyzer,
documents analyzes, and analyzed. Therefore, the document
Identifying potential indexable elements in processor stems document terms to analy- so that
documents dramatically affects the nature and documents which include various forms of analy-
quality of the document representation that the will have equal likelihood of being retrieved,
engine will search against. In designing the which would not occur if the engine only indexed
system, we must define the following: What is a variant forms separately and required the user to
term? Is it the alphanumeric characters between enter all. Of course, stemming does have a
blank spaces or punctuation? If so, what about downside. It may negatively affect precision in
noncom positional phrases (phrases where the that all forms of a stem will match, when, in fact,
separate words do not convey the meaning of the a successful query for the user would have come
phrase, like skunk works or hot dog), multiword from matching only the word form actually used
proper names, or interword symbols such as in the query.
hyphens or apostrophes that can denote the Systems may implement either a strong
difference between “small business men” vs. stemming algorithm or a weak stemming
small-business men?” Each search engine depends algorithm. A strong stemming algorithm will strip
on a set of rules that its document processor must off inflectional suffixes (-s, -es, -ed) and
execute to determine what action is to be taken by derivational suffixes (-able, -aciousness, -ability),
the “tokenizer,” i.e., the software used to define a while a weak stemming algorithm will strip off
‘term’ suitable for indexing. only the inflectional suffixes (-s, -es, -ed).
Step 5: Delete stop words Step 7: Extract index entries
This step helps save system resources by Having completed steps 1 through 6, the
eliminating from further processing, as well as document processor extracts the remaining entries
potential matching, those terms that have little from the original document. For example, the
value in finding useful documents in response to a following paragraph shows the full text as sent to
customer’s query. This step used to matter much a search engine for processing: “Milosevic's
more than it does now when memory has become comments, carried by the official news agency
so much cheaper and systems so much faster, but Tanjug, cast doubt over the governments at the
since stop words may comprise up to 40 percent of talks, which the international community has
text words in a document, it still has some called to try to prevent an all-out war in the
significance. Serbian province. President Milosevic said it was
A stop word list typically consists of those word well known that Serbia and Yugoslavia were
classes known to convey little substantive firmly committed to resolving problems in
meaning, such as articles (a, the), conjunctions Kosovo, which is an integral part of Serbia,
(and, but), interjections (oh, but), prepositions (in, peacefully in Serbia with the participation of the
over), pronouns (he, it), and forms of the “to be” representatives of all ethnic communities, Tanjug
verb (is, are). To delete stop words, an algorithm said. Milosevic was speaking during a meeting
compares index term candidates in the documents with British Foreign Secretary Robin Cook, who
against a stop word list and eliminates certain delivered an ultimatum to attend negotiations in a
terms from inclusion in the index for searching. week's time on an autonomy proposal for Kosovo
Step 6: Stem terms with ethnic Albanian leaders from the province.
Stemming removes word suffixes, perhaps Cook earlier told a conference that Milosevic had
recursively in layer after layer of processing. The agreed to study the proposal.”
5
Steps 1 through 6 reduce this text for searching whole, the term “antibiotic” would probably be a
to the following text: “Milosevic comm carri offic good discriminator among documents, and
new agen Tanjug cast doubt govern talk interna therefore would be assigned a high weight.
commun Conversely, in a database devoted to health or
call try prevent all-out war Serb province medicine, “antibiotic” would probably be a poor
President Milosevic said well known Serbia discriminator, since it occurs very often. The tf/idf
Yugoslavia firm commit resolv problem Kosovo weighting scheme assigns higher weights to those
integr part Serbia peace Serbia particip representa terms that really distinguish one document from
ethnic commun Tanjug said Milosevic speak the others.
meeti British Foreign Secretary Robin Cook Step 9: Create index
deliver ultimat attend negoti week time autonomy The index or inverted file is the internal data
propos Kosovo ethnic Alban lead province Cook structure that stores the index information and that
earl told conference Milosevic agree study will be searched for each query. Inverted files
propos” range from a simple listing of every alphanumeric
The output of step 7 is then inserted and stored sequence in a set of documents/pages being
in an inverted file that lists the index entries and indexed along with the overall identifying
an indication of their position and frequency of numbers of the documents in which that sequence
occurrence. The specific nature of the index occurs, to a more linguistically complex list of
entries, however, will vary based on the decision entries, their tf/idf weights, and pointers to where
in Step 4 concerning what constitutes an inside each document the term occurs. The more
“indexable term.” More sophisticated Document complete the information in the index, the better
Processors will have phrase recognizers, as well as the search results.
Named Entity recognizers and Categorizers, to 4.2.2. Query Processor
insure index entries such as Milosevic are tagged Query processing has seven possible steps,
as a person and entries such as Yugoslavia and though a system can cut these steps short and
Serbia as countries. proceed to match the query to the inverted file at
Step 8: Compute weights. any of a number of places during the processing.
Weights are assigned to terms in the index file. Document processing shares many steps with
The simplest of search engines simply assign a query processing. More steps and more documents
binary weight: 1 for presence and 0 for absence. make the process more expensive for processing
The more sophisticated the search engine, the in terms of computational resources and
more complex the weighting scheme. Measuring responsiveness. However, the longer the wait for
the frequency of occurrence of a term in the results, the higher the quality of results. Thus,
document creates more sophisticated weighting, search system designers must choose what is most
with length-normalization of frequencies still important to their users, time or quality. Publicly
more sophisticated. available search engines usually choose time over
Extensive experience in Information Retrieval very high quality because they have too many
research over many years has clearly demonstrated documents to search against. The steps in query
that the optimal weighting comes from use of term processing are as follows (with the option to stop
frequency/inverse document frequency (tf/idf). processing and start matching indicated as
This algorithm measures the frequency of “Matcher”):
occurrence of each term within a document. Then 1. Tokenize query terms
it compares that frequency against the frequency 2. Recognize query terms vs. special operators
of occurrence in the entire database. 3. ---------------------------> Matcher
Not all terms are good discriminators; that is, 4. Delete stop words
they don’t all single out one document from 5. Stem words
another very well. A simple example would be the 6. Create query representation
word “THE.” This word appears in too many 7. ---------------------------> Matcher
documents to help distinguish one from another. 8. Expand query terms
A less obvious example would be the word 9. Compute weights
“antibiotic.” In a sports database, when we 10. ---------------------------> Matcher
compare each document to the database as a
6
Figure 4 Steps 3 and 4: Delete stop words and stem
words
Some search engines will go further and stop-
list and stem the query, similar to the processes
described in the Document Processor section. The
stop list might also contain words from commonly
occurring querying phrases, such as “I’d like
information about….” However, since most
publicly available search engines encourage very
short queries, as evidenced in the size of query
window they provide, they may drop these two
steps.
Step 5: Creating the query representation
How each particular search engine creates a
query representation depends on how the system
does its matching. If a statistically based matcher
is used, then the query must match the statistical
representations of the documents in the system.
Good statistical queries should contain many
synonyms and other terms in order to create a full
representation. If a Boolean matcher is utilized,
then the system must create logical sets of the
terms connected by AND, OR, or NOT.
Step 1: Tokenize query terms The NLP system will recognize single terms,
As soon as a user inputs a query, the search phrases, and Named Entities. If it uses any
engine, whether a keyword-based system or a full Boolean logic, it will also recognize the logical
Natural Language Processing (NLP) system, must operators from Step 2 and create a representation
tokenize the query stream, i.e., break it down into containing logical sets of the terms to be AND’d,
understandable segments. Usually a token is OR’d, or NOT’d. At this point, a search engine
defined as an alphanumeric string that occurs may take the query representation and perform the
between white space and or punctuation. search against the inverted file. More advanced
Step 2: Recognize query terms vs. special search engines may take two further steps.
operators Step 6: Expand query terms
Since users may employ special operators in Since users of search engines usually include
their query, including Boolean, adjacency, or only a single statement of their information needs
proximity operators, the system needs to parse the in a query, it becomes highly probable that the
query first into query terms and operators. These information they need may be expressed using
operators may occur in the form of reserved synonyms, rather than the exact query terms, in
punctuation (e.g., quotation marks) or reserved the documents that the search engine searches
terms in specialized format (e.g., AND, OR). In against. Therefore, more sophisticated systems
the case of an NLP system, the query processor may expand the query into all possible
will recognize the operators implicitly in the synonymous terms and perhaps even broader and
language used no matter how they might be narrower terms.
expressed (e.g., prepositions, conjunctions, This process approaches what search
ordering). intermediaries did for end-users in the earlier days
At this point, a search engine may take the list of commercial search systems. Then
of query terms and search them against the intermediaries might have used the same
inverted file. In fact, this is the point at which the controlled vocabulary or thesaurus used by the
majority of publicly available search engines indexers who assigned subject descriptors to
perform their search. documents.
Today, resources such as WordNet are generally
available, or specialized expansion facilities may
7
take the initial query and enlarge it by adding query and each document/page based on the
associated vocabulary. scoring algorithm used by the system. Scoring
Step 7: Computer query term weight (assuming algorithms base their rankings on the
more than one query term) presence/absence of query term(s), term
The final step in query processing involves frequency, tf/idf, Boolean logic fulfillment, or
computing weights for the terms in the query. query term weights. Some search engines use
Sometimes the user controls this step by scoring algorithms not based on document
indicating either how much to weight each term or contents, but rather, on relations among
simply which term or concept in the query matters documents or past retrieval history of
most and must appear in each retrieved document documents/pages.
to ensure relevance. After computing the similarity of each
Leaving the weighting up to the user is document in the subset of documents, the system
uncommon because research has shown that users presents an ordered list to the user. The
are not particularly good at determining the sophistication of the ordering of the documents
relative importance of terms in their queries. They again depends on the model the system uses, as
can’t make this determination for several reasons. well as the richness of the document and query
First, they don’t know what else exists in the weighting mechanisms. For example, search
database and document terms are weighted by engines that only require the presence of any
being compared to the database as a whole. alphanumeric string from the query occurring
Second, most users seek information about an anywhere, in any order, in a document would
unfamiliar subject, so they may not know the produce a very different ranking from one by a
correct terminology. Few search engines search engine that performed linguistically correct
implement system-based query weighting, but phrasing for document and query representation
some do an implicit weighting by treating the first and that utilized the proven tf/idf weighting
term(s) in a query as having higher significance. scheme.
They use this information to provide a list of However, the search engine determines rank,
documents/pages to the user. After this final step, and the ranked results list goes to the user, who
the expanded, weighted query is searched against can then simply click and follow the system’s
the inverted file of documents. internal pointers to the selected document/page.
4.2.3. Search and Matching Functions More sophisticated systems will go even further
How systems carry out their search and at this stage and allow the user to provide some
matching functions differs according to which relevance feedback or to modify their query based
theoretical model of IR underlies the system’s on the results they have seen. If either of these are
design philosophy. available, the system will then adjust its query
Searching the inverted file for documents which representation to reflect this value-added feedback
meet the query requirements, referred to simply as and rerun the search with the improved query to
“matching,” is typically a standard binary search produce either a new set of documents or a simple
no matter whether the search ends after the first re-ranking of documents from the initial search.
two, five, or all seven steps of query processing. 4.3. Ranking
While the computational processing required for Google's rise to success was in large part due to a
simple, un-weighted, non-Boolean query matching patented algorithm called PageRank that helps
is far simpler than when the model is an NLP- rank web pages that match a given search string.
based query within a weighted, Boolean model, it When Google was a Stanford research project, it
also follows that the simpler the document was nicknamed BackRub because the technology
representation, the query representation, and the checks backlinks to determine a site's importance.
matching algorithm, the less relevant the results, Previous keyword-based methods of ranking
except for very simple queries, such as one-word, search results, used by many search engines that
non-ambiguous queries seeking the most generally were once more popular than Google, would rank
known information. pages by how often the search terms occurred in
Having determined which subset of documents the page, or how strongly associated the search
or pages match the query requirements to some terms were within each resulting page. The
degree, a similarity score is computed between the PageRank algorithm instead analysis human-
8
generated links assuming that web pages linked (however, Wikipedia is actually a sink rather than
from many important pages are themselves likely a hub because it uses nofollow on external links).
to be important. The algorithm computes a The rank value indicates an importance of a
recursive score for pages, based on the weighted particular page. A hyperlink to a page counts as a
sum of the PageRanks of the pages linking to vote of support. The PageRank of a page is
them. PageRank is thought to correlate well with defined recursively and depends on the number
human concepts of importance. In addition to and PageRank metric of all pages that link to it
PageRank, Google over the years has added many ("incoming links"). A page that is linked to by
other secret criteria for determining the ranking of many pages with high PageRank receives a high
pages on result lists, reported to be over 200 rank itself. If there are no links to a web page there
different indicators. The details are kept secret due is no support for that page.
to spammers and in order to maintain an Numerous academic papers concerning
advantage over Google's competitors. PageRank have been published since Page and
PageRank is a link analysis algorithm, named Brin's original paper. In practice, the PageRank
after Larry Page and used by the Google Internet concept has proven to be vulnerable to
search engine that assigns a numerical weighting manipulation, and extensive research has been
to each element of a hyperlinked set of documents, devoted to identifying falsely inflated PageRank
such as the World Wide Web, with the purpose of and ways to ignore links from documents with
"measuring" its relative importance within the set. falsely inflated PageRank.
The algorithm may be applied to any collection of Other link-based ranking algorithms for Web
entities with reciprocal quotations and references. pages include the HITS algorithm invented by Jon
The numerical weight that it assigns to any given Kleinberg (used by Teoma and now Ask.com), the
element E is referred to as the PageRank of E and IBM CLEVER project, and the TrustRank
denoted by PR(E). algorithm.
Google describes PageRank: PageRank is a probability distribution used to
“PageRank reflects our view of the importance of represent the likelihood that a person randomly
web pages by considering more than 500 million clicking on links will arrive at any particular page.
variables and 2 billion terms. Pages that we PageRank can be calculated for collections of
believe are important pages receive a higher documents of any size. It is assumed in several
PageRank and are more likely to appear at the top research papers that the distribution is evenly
of the search results. divided among all documents in the collection at
PageRank also considers the importance of each the beginning of the computational process. The
page that casts a vote, as votes from some pages PageRank computations require several passes,
are considered to have greater value, thus giving called "iterations", through the collection to adjust
the linked page greater value. We have always approximate PageRank values to more closely
taken a pragmatic approach to help improve reflect the theoretical true value.
search quality and create useful products, and our A probability is expressed as a numeric value
technology uses the collective intelligence of the between 0 and 1. A 0.5 probability is commonly
web to determine a page's importance.” expressed as a "50% chance" of something
The name "PageRank" is a trademark of Google, happening. Hence, a PageRank of 0.5 means there
and the PageRank process has been patented (U.S. is a 50% chance that a person clicking on a
Patent 6,285,999). However, the patent is assigned random link will be directed to the document with
to Stanford University and not to Google. Google the 0.5 PageRank.
has exclusive license rights on the patent from Assume a small universe of four web pages: A,
Stanford University. The university received 1.8 B, C and D. The initial approximation of
million shares of Google in exchange for use of PageRank would be evenly divided between these
the patent; the shares were sold in 2005 for four documents. Hence, each document would
336million. begin with an estimated PageRank of 0.25.
A PageRank results from a mathematical In the original form of PageRank initial values
algorithm based on the graph created by all World were simply 1. This meant that the sum of all
Wide Web pages as nodes and hyperlinks, taking pages was the total number of pages on the web.
into consideration authority hubs like Wikipedia Later versions of PageRank (see the formulas
9
below) would assume a probability distribution The Figure 5 below explains it in simpler manner:
between 0 and 1. Here a simple probability
distribution will be used—hence the initial value
of 0.25.
If pages B, C, and D each only link to A, they
would each confer 0.25 PageRank to A. All
PageRank PR( ) in this simplistic system would
thus gather to A because all links would be
pointing to A.

This is 0.75.Suppose that page B has a link to


page C as well as to page A, while page D has
links to all three pages. The value of the link-votes
is divided among all the outbound links on a page.
Thus, page B gives a vote worth 0.125 to
Figure 5
page A and a vote worth 0.125 to page C. Only
one third of D's PageRank is counted for A's
Mathematical PageRanks (out of 100) for a
PageRank (approximately 0.083).
simple network (PageRanks reported by Google
are rescaled logarithmically). Page C has a higher
PageRank than Page E, even though it has fewer
links to it; the link it has is of a much higher
value. A web surfer who chooses a random link on
In other words, the PageRank conferred by an every page (but with 15% likelihood jumps to a
outbound link is equal to the document's own random page on the whole web) is going to be on
PageRank score divided by the normalized Page E for 8.1% of the time. (The 15% likelihood
number of outbound links L( ) (it is assumed that of jumping to an arbitrary page corresponds to a
links to specific URLs only count once per damping factor of 85%.) Without damping, all
document). web surfers would eventually end up on Pages A,
B, or C, and all other pages would have PageRank
zero. Page A is assumed to link to all pages in the
web, because it has no outgoing links.

In the general case, the PageRank value for any


page u can be expressed as:

i.e. the PageRank value for a page u is dependent


on the PageRank values for each page v out of the
set Bu (this set contains all pages linking to
page u), divided by the number L(v) of links from
page v.

10
The Panda Update [2] How Stuff Works
Google’s recent Panda (a.k.a. “Farmer”) Page http://www.howstuffworks.com.
Rank algorithm update is to provide higher page [3] WebReference.com
rankings for quality, rather than quantity, content. [4] The Anatomy of a Large-Scale
The biggest sites hurt in the change seem to be the Hypertextual Web Search Engine by
“content farms”. Sergey Brin and Lawrence Page
Wikipedia defines a “content farm” as “a [5] How a Search Engine Works by Elizabeth
company that employs large numbers of often Liddy
freelance writers to generate large amounts of http://www.cnlp.org/publications/02HowA
SearchEngineWorks.pdf
textual content which is specifically designed to
satisfy algorithms for maximal retrieval by
automated search engines. Their main goal is to
generate advertising revenue through attracting
reader page views.” In other words, spammy
content designed to fool Google into better results.
To prevent spammers from gaming the system,
Google does not divulge what specific changes
they’ve made to their algorithm.
Google formed their definition of low quality by
asking outside testers to rate sites by answering
questions such as:
• Would you be comfortable giving this site
your credit card?
• Would you be comfortable giving medicine
prescribed by this site to your kids?
• Do you consider this site to be
authoritative?
• Would it be okay if this was in a magazine?
• Does this site have excessive ads?
And if the answer was yes then PageRank was
to decrease.

5. Conclusion
Search engine plays important role in accessing
the content over the internet, it fetches the pages
requested by the user.
It has made the internet and accessing the
information just a click away. The need for better
search engines only increases. The search engine
sites are among the most popular websites.
Search Engines are not the place to answer
questions but to find information. So it is equally
important for one to know how Search Engines
Work and how to get the best out of them.

6. References
[1] Wikipedia.
http://en.wikipedia.org/wiki/Web_search_
engine
11

Das könnte Ihnen auch gefallen