Sie sind auf Seite 1von 41

The Anatomy of a Large-Scale

Hypertextual
Web Search Engine
Sergey Brin and Lawrence Page
Computer Science Department,
Stanford University, Stanford, CA 94305, USA
sergey@cs.stanford.edu and page@cs.stanford.edu
Introduction
• This paper marks the birth of Google search engine. The paper
describes the first Google prototype designed by Sergey Brin and
Lawrence Page.
• The paper also cover the challenges faced during developing a search
engine that would have the capability to handle the largest amount of
dataset ever seen and perform an even larger query set on the
dataset.
• The main goal of Google was to improve the quality of web searches.
To do this Google employed many methods, one of which was to use
the hyperlink text available.
Before Google
Generally, people used to type in their query into –
• A search engine like Yahoo!, which kept high quality human
maintained indices
• But these indices were prone to error, expensive, slow and did not cover all
the topics.
• Or any other automated search engines available at that time.
• These search engines used keyword matching to return matches. The matches
returned were of low quality and prone to manipulation.
A brief history of search engines (in numbers)
• In 1994, one of the first web search engines, the World Wide Web Worm
(WWWW) had an index of 110,000 web pages and web accessible
documents and received an average of about 1500 queries per day.
• As of November, 1997, the top search engines claimed to index from 2
million to 100 million web documents. Altavista claimed it handled roughly
20 million queries per day.
• The authors predicted, rather correctly, that by the year 2000, a
comprehensive index of the Web will contain over a billion documents and
will handle hundreds of millions of queries per day.
• Thus, Google focused on being such a search engine that excelled in both,
quality and scalability.
Google: Scalability
• What was needed to be done?
• Fast crawling technology
• Efficient storage
• Quick query retrieval
• How it was achieved?
• Parallelization of time consuming tasks
• Efficient use of storage with the help of various data structures
• Kept in mind the growth rate or hardware
Design Goals
• Improved search quality
• Very high precision
• Academic search engine research
• An architecture that supports research activities
System features
PageRank
Bringing order to the Web
Citation Graph
• A citation graph is a directed
graph in which
each vertex represents a
document and in which each
edge represents a citation from
the current publication to
another.
• The citation graph of Web of
Webgraph describes the
directed links between pages of
the World Wide Web
PageRank Calculations X

• We assume page A has pages


T1...Tn which point to it (i.e., are
citations). The parameter d is a T1 T2 Tn
damping factor which can be set
between 0 and 1. C(A) is defined
as the number of links going out
of page A. The PageRank of a
page A is given as follows: A
• PR(A) = (1-d) + d (PR(T1)/C(T1) +
... + PR(Tn)/C(Tn))
Why PageRank?
• The probability that the random • A page can have a high
surfer visits a page is its PageRank if there are many
PageRank. And, the d damping pages that point to it, or if there
factor is the probability at each are some pages that point to it
page the "random surfer" will and have a high PageRank.
get bored and request another • Recommendation of buying a
random page. phone
Anchor text
Changing perspectives
What is Anchor Text?
• Anchor Text is the visible,
clickable text in a hyperlink. In
Like this
modern browsers, it is often Tiny dancing Horse
blue and underlined.
How and Why Anchor Text?
• Most search engines associate • Advantages
the text of a link with the page • Anchors often provide more
that the link is on. accurate descriptions of web
pages than the pages themselves.
• Google also associate it with the
• Anchors may exist for documents
page the link points to. which cannot be indexed by a text-
based search engine, such as
images, programs, and databases.
• Provide better quality results.
Other Features of Google
• It has location information for all hits and so it makes extensive use of
proximity in search.
• Google keeps track of some visual presentation details such as font
size of words. Words in a larger or bolder font are weighted higher
than other words.
System Anatomy
Architecture Overview
Methodology
• Distributed web crawlers.
• URLserver sends the list of URLs to be fetched.
• Fetched pages are stored in a compressed form
• Done by storeserver which stores the compressed web pages into a repository
• Each page is given a docID corresponding to the URL of the page
• An indexer reads the data from storage, uncompresses the documents and parse
them.
• The parsed documents are divided into word occurrences called hits.
• The hits record the word, position in document, an approximation of font size, and
capitalization.
• Also, all links are parsed out of the page
• Important link info is stored in anchors file which have enough information to determine where each link
points from and to and the text of the link
• This database of links is used to calculate the PageRank of all documents
Methodology (Continued)
• The indexer distributes these hits into a
set of "barrels", creating a partially sorted
forward index.
• The sorter takes the barrels and re-sorts
them by wordID to generate the inverted
index.
• A program called DumpLexicon takes this
list together with the lexicon produced by
the indexer and generates a new lexicon
to be used by the searcher.
• The searcher is run by a web server and
uses the lexicon built by DumpLexicon
together with the inverted index and the
PageRanks to answer queries.
Major Data Structures
BigFiles
• BigFiles are virtual files spanning multiple file systems and are
addressable by 64 bit integers.
Repository
• The repository contains the full
HTML of every web page.
• Each page is compressed using
zlib.
• The choice of compression
technique is a tradeoff between
speed and compression ratio.
• The repository requires no other
data structures to be used in
order to access it.
Document Index
• The document index stores a pointer • Additionally, there is a file which is
to each of the document in the used to convert URLs into docIDs. It is
repository, a checksum of the data, a list of URL checksums with their
and various statistics. corresponding docIDs and is sorted by
• It is a fixed width ISAM (Index checksum.
sequential access mode) index, • In order to find the docID of a
ordered by docID. particular URL, the URL’s checksum is
• If the document has been crawled, it computed and a binary search is
also contains a pointer into a variable performed on the checksums file to
width file called docinfo which find its docID.
contains its URL and title.
• Otherwise the pointer points into the
URLlist which contains just the URL.
Lexicon
• The lexicon tracks the different words that make up the corpus of
documents. (Vocabulary)
• It is implemented in two parts
• A list of the words (concatenated together but separated by nulls)
• A hash table of pointers for fast lookup
• One important change from earlier systems is that the lexicon can fit
in memory for a reasonable price.
Hit Lists
• A hit list corresponds to the list of • There are two types of hits
occurrences of a particular word in • Fancy hits – hits occurring in a URL,
the lexicon in a document. title, anchor text, or meta tag
• Plain hits – hits occurring everywhere
• The hit list encodes the font, else
position in the document, and
capitalization of the word.
• The authors use a hand optimized
encoding scheme (2 bytes for every
hit) to minimize the space required
to store the list.
Forward Index
• The forward index stores a
mapping between document id,
word ids, and the hit list
corresponding to these words.
Inverted Index
• The inverted index maps
between word ids and document
ids.
• This list index provides the
representation of the
occurrences of a word in all
documents.
Crawling the Web
What is a Web crawler?
• A web crawler (also known as a
web spider or web robot) is a
program or automated script which
browses the World Wide Web in a
methodical, automated manner.
• They help in the process of Web
indexing.
• A large portion of search engine
development is crawling the web
and downloading pages to be
added to the index.
Web crawling in Google
• Main challenge is to deal with millions of different web servers and
pages over which Google has no control.
• In the Google prototype, a single URL server forwards lists of URLs to
distributed web crawlers that download pages for indexing.
• Google Web crawlers deals with large parts of internet and thus are
designed to be very robust and are carefully tested.
Indexing the Web
Parsing
• Must handle a huge array or possible errors.
• Typos in HTML tags
• Kilobytes of zeros in the middle of a tag
• Non-ASCII characters
• HTML tags nested hundreds deep
Indexing Documents into Barrels
• After each document is parsed, it is encoded into a number of barrels.
• Every word is converted into a wordID by using an in-memory hash
table -- the lexicon.
• New additions to the lexicon hash table are logged to a file.
• Once the words are converted into wordID’s, their occurrences in the
current document are translated into hit lists and are written into the
forward barrels.
• Multiple indexers are run at parallel
• Thus a log file is created to keep record of extra words that were not in base
lexicon to be processed after all the indexers have run
Sorting
• In order to generate the inverted index, the sorter takes each of the
forward barrels and sorts it by wordID to produce an inverted barrel
for title and anchor hits and a full text inverted barrel.
• This process happens one barrel at a time, thus requiring little
temporary storage.
• Since the barrels don’t fit into main memory, the sorter further
subdivides them into baskets which do fit into memory based on
wordID and docID.
• Then the sorter, loads each basket into memory, sorts it and writes its
contents into the short inverted barrel and the full inverted barrel.
Searching
The Ranking System
• The author mentions that Google maintained much more information
about web documents than any typical search engines at that time.
• Additionally, the factors such as hits from anchor text and the
PageRank of the document are to be taken into consideration.
• The rank of a web page is calculated by combining all of this
information.
• The ranking function defined in the paper is such that no particular
factor has too much influence.
• Rather a weight is assigned with each factor according to its influence
on the rank.
A Single Word Query
• Google looks at that document’s hit list for that word.
• Google considers each hit to be one of several different types (title, anchor,
URL, plain text large font, plain text small font, ...), each of which has its
own type-weight.
• Google counts the number of hits of each type in the hit list.
• Then every count is converted into a count-weight.
• Count-weights increase linearly with counts at first but quickly taper off so that more
than a certain count will not help.
• We take the dot product of the vector of count-weights with the vector of
type-weights to compute an IR score for the document.
• Finally, the IR score is combined with PageRank to give a final rank to the
document.
A Multi-word Query
• Google scans multiple hitlists at once so that hits occurring close together in a
document are weighted higher than hits occurring far apart.
• The hits from the multiple hit lists are matched up so that nearby hits are
matched together.
• For every matched set of hits, a proximity is computed.
• The proximity is based on how far apart the hits are in the document (or anchor) but is
classified into 10 different value "bins" ranging from a phrase match to "not even close".
• Counts are computed not only for every type of hit but for every type and
proximity.
• Every type and proximity pair has a type-prox-weight.
• The counts are converted into count-weights and we take the dot product of the
count-weights and the type-prox-weights.
• This dot product along with the PageRank gives the final IR score.
Feedback
• The ranking function has many parameters like the type-weights and
the type-prox-weights.
• To figure out the right values for these parameters Google have a user
feedback mechanism
• A trusted user may optionally evaluate all of the results that are returned.
• This feedback is saved and used to modify the ranking function.
Google Query Evaluation
• Given the data crawled and indexed, we can start running search queries
on it.
• The Google search algorithm runs the following set of steps:
1. Parse the query.
2. Convert words into word ids.
3. Seek to the start of the doclist for every word.
4. Scan through the doclist until there is a document matching all the search terms.
5. Compute the rank of that document for the query.
6. If we are at the end of a doclist, seek to the start of the next doclist and repeat at
step 4.
7. If we are not at the end of any doclist, go to step 4.
8. Sort the documents that match by rank, and return the top k.
References
• https://moz.com/learn/seo/anchor-text
• https://searchdatacenter.techtarget.com/definition/ISAM
• https://www.sciencedaily.com/terms/web_crawler.htm

Das könnte Ihnen auch gefallen