Information Organization - How Search Engines Work

SIMS 202 Information Organization and Retrieval
Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000
Uploaded by: CarAutoDriver
Last Time
Web Search
Directories vs. Search engines How web search differs from other search
Type of data searched over Type of searches done Type of searchers doing search
Web queries are short
Web search ranking combines many features
This probably means people are often using search engines to find starting points Once at a useful site, they must follow links or use site search
What about Ranking?
Lots of variation here Combining subsets of:

Pretty messy in many cases Details usually proprietary and fluctuating Term frequencies Term proximities Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category information Popularity information
Most use a variant of vector space ranking to combine these Heres how it might work:
Make a vector of weights for each feature Multiply this by the counts for each feature
From description of the NorthernLight search engine, by Mark Krellenstein http://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm
High-Precision Ranking
Proximity search can help get highprecision results if > 1 term
Hearst 96 paper:
Combine Boolean and passage-level proximity Proves significant improvements when retrieving top 5, 10, 20, 30 documents Results reproduced by Mitra et al. 98 Google uses something similar
Boolean Formulations, Hearst 96
Results
Spam

Email Spam:
Undesired content
Web Spam:
Content is disguised as something it is not, in order to
Be retrieved more often than it otherwise would Be retrieved in contexts that it otherwise would not be retrieved in
Web Spam
What are the types of Web spam?

Repeat cars thousands of times
Add extra terms to get a higher ranking Add irrelevant terms to get more hits
Put a dictionary in the comments field Put extra terms in the same color as the background of the web page
Add irrelevant terms to get different types of hits Add irrelevant links to boost your link analysis ranking
Put sex in the title field in sites that are selling cars
There is a constant arms race between web search companies and spammers
Commercial Issues
General internet search is often commercially driven
Commercial sector sometimes hides things harder to track than research On the other hand, most CTOs for search engine companies used to be researchers, and so help us out Commercial search engine information changes monthly Sometimes motivations are commercial rather than technical
Goto.com uses payments to determine ranking order iwon.com gives out prizes
Web Search Architecture
Web Search Architecture
Preprocessing
Collection gathering phase
Web crawling
Collection indexing phase
Online
Query servers This part not talked about in the readings
From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Standard Web Search Engine Architecture

crawl the web Check for duplicates, store the documents
DocIds
user query
Search engine servers
create an inverted index
Show results To user
Inverted index
More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.
Inverted Indexes for Web Search Engines

Inverted indexes are still used, even though the web is so huge Some systems partition the indexes across different machines; each machine handles different parts of the data Other systems duplicate the data across many machines; queries are distributed among the machines Most do a combination of these
In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row.
Cascading Allocation of CPUs
A variation on this that produces a cost-savings:
Put high-quality/common pages on many machines Put lower quality/less common pages on fewer machines Query goes to high quality machines first If no hits found there, go to other machines
Web Crawlers
How do the web search engines get all of the items they index? Main idea:
Start with known sites Record information for these sites Follow the links from each site Record information found at new sites Repeat
Web Crawlers

How do the web search engines get all of the items they index? More precisely:
Take the first page off of the queue If this page has not yet been processed:
Put a set of known sites on a queue Repeat the following until the queue is empty:
Record the information found on this page Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processed
In what order should the links be followed?
Page Visit Order

Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Structure to be traversed
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:

http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Breadth-first search
(must be in presentation mode to see this animation)
Page Visit Order

Depth-first search
(must be in presentation mode to see this animation)
Page Visit Order

(more complex graphs & sites)

Site 1 1 1 1 1 1 3 5 6 5 2 2 2 Page 1 2 4 6 3 5 1 1 1 2 1 2 3
Depth-First Crawling
Page 1
Site 1 Page 3 Page 2
Page 1
Site 2 Page 3
Page 2
Page 5 Page 4 Page 6 Page 1 Site 3
Page 1
Site 5
Page 2
Page 1
Site 6
(more complex graphs & sites)

Page 1
Site Page 1 1 2 1 1 2 1 6 1 3 2 2 2 3 1 4 3 1 1 5 5 1 5 2 6 1
Breadth First Crawling
Site 1 Page 3 Page 2
Page 1
Site 2 Page 3
Page 2
Page 5 Page 4 Page 6 Page 1 Site 3
Page 1
Site 5
Page 2
Page 1
Site 6
Web Crawling Issues
Keep out signs Freshness
A file called norobots.txt tells the crawler which directories are off limits Figure out which pages change often Recrawl these often Convert page contents with a hash function Compare new pages to the hash table Server unavailable Incorrect html Missing links Infinite loops
Duplicates, virtual hosts, etc

Lots of problems
Web crawling is difficult to do robustly!
Cha-Cha

Cha-cha searches an intranet

Sites associated with an organization
Instead of hand-edited categories

Computes shortest path from the root for each hit Organizes search results according to which subdomain the pages are found in
Cha-Cha Web Crawling Algorithm

Start with a list of servers to crawl Restrict crawl to certain domain(s) Obey No Robots standard Follow hyperlinks only
links are placed on a queue traversal is breadth-first
for UCB, simply start with www.berkeley.edu

*.berkeley.edu
do not read local filesystems
See first lecture or the technical papers for more information
Summary
Web search differs from traditional IR systems

Different kind of collection Different kinds of users/queries Different economic motivations
Ranking combines many features in a difficult-to-specify manner
Link analysis and proximity of terms seems especially important This is in contrast to the term-frequency orientation of standard search
Why?
Summary (cont.)
Web search engine archicture

Similar in many ways to standard IR Indexes usually duplicated across machines to handle many queries quickly
Web crawling
Used to create the collection Can be guided by quality metrics Is very difficult to do robustly
Web Search Statistics
Searches per Day
Info missing For fast.com, Excite, Northernlight, etc.
Information from searchenginewatch.com
Web Search Engine Visits
Percentage of web users who visit the site shown
Search Engine Size (July 2000)
Does size matter? You cant access many hits anyhow.
Increasing numbers of indexed pages, selfreported
Increasing numbers of indexed pages (more recent) selfreported
Web Coverage
Directory sizes

Information Organization - How Search Engines Work

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Information Organization - How Search Engines Work

Hochgeladen von

Copyright:

Verfügbare Formate

SIMS 202 Information Organization and Retrieval

Uploaded by: CarAutoDriver

Web queries are short

Web search ranking combines many features

What about Ranking?

Lots of variation here Combining subsets of:

From description of the NorthernLight search engine, by Mark Krellenstein http://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm

Boolean Formulations, Hearst 96

What are the types of Web spam?

Web Search Architecture

Web Search Architecture

Collection indexing phase

From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Standard Web Search Engine Architecture

create an inverted index

Show results To user

Inverted Indexes for Web Search Engines

From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Cascading Allocation of CPUs

A variation on this that produces a cost-savings:

In what order should the links be followed?

Page Visit Order

Page Visit Order

Animated examples of breadth-first vs depth-first search on trees:

Page Visit Order

Animated examples of breadth-first vs depth-first search on trees:

Page Visit Order

Animated examples of breadth-first vs depth-first search on trees:

(more complex graphs & sites)

Site 1 Page 3 Page 2

Page 5 Page 4 Page 6 Page 1 Site 3

(more complex graphs & sites)

Breadth First Crawling

Site 1 Page 3 Page 2

Page 5 Page 4 Page 6 Page 1 Site 3

Web Crawling Issues

Keep out signs Freshness

Duplicates, virtual hosts, etc

Web crawling is difficult to do robustly!

Cha-cha searches an intranet

Instead of hand-edited categories

Cha-Cha Web Crawling Algorithm

for UCB, simply start with www.berkeley.edu

do not read local filesystems

See first lecture or the technical papers for more information

Web search differs from traditional IR systems

Ranking combines many features in a difficult-to-specify manner

Web search engine archicture

Web Search Statistics

Searches per Day

Info missing For fast.com, Excite, Northernlight, etc.

Information from searchenginewatch.com

Web Search Engine Visits

Information from searchenginewatch.com

Percentage of web users who visit the site shown

Information from searchenginewatch.com

Search Engine Size (July 2000)

Information from searchenginewatch.com

Does size matter? You cant access many hits anyhow.

Information from searchenginewatch.com

Increasing numbers of indexed pages, selfreported

Information from searchenginewatch.com

Increasing numbers of indexed pages (more recent) selfreported

Information from searchenginewatch.com