Sie sind auf Seite 1von 41

SIMS 202 Information Organization and Retrieval

Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000

Uploaded by: CarAutoDriver

Last Time

Web Search

Directories vs. Search engines How web search differs from other search
Type of data searched over Type of searches done Type of searchers doing search

Web queries are short

Web search ranking combines many features

This probably means people are often using search engines to find starting points Once at a useful site, they must follow links or use site search

What about Ranking?

Lots of variation here Combining subsets of:


Pretty messy in many cases Details usually proprietary and fluctuating Term frequencies Term proximities Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category information Popularity information

Most use a variant of vector space ranking to combine these Heres how it might work:
Make a vector of weights for each feature Multiply this by the counts for each feature

From description of the NorthernLight search engine, by Mark Krellenstein http://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm

High-Precision Ranking
Proximity search can help get highprecision results if > 1 term
Hearst 96 paper:
Combine Boolean and passage-level proximity Proves significant improvements when retrieving top 5, 10, 20, 30 documents Results reproduced by Mitra et al. 98 Google uses something similar

Boolean Formulations, Hearst 96

Results

Spam

Email Spam:
Undesired content

Web Spam:
Content is disguised as something it is not, in order to
Be retrieved more often than it otherwise would Be retrieved in contexts that it otherwise would not be retrieved in

Web Spam

What are the types of Web spam?


Repeat cars thousands of times

Add extra terms to get a higher ranking Add irrelevant terms to get more hits
Put a dictionary in the comments field Put extra terms in the same color as the background of the web page

Add irrelevant terms to get different types of hits Add irrelevant links to boost your link analysis ranking
Put sex in the title field in sites that are selling cars

There is a constant arms race between web search companies and spammers

Commercial Issues
General internet search is often commercially driven

Commercial sector sometimes hides things harder to track than research On the other hand, most CTOs for search engine companies used to be researchers, and so help us out Commercial search engine information changes monthly Sometimes motivations are commercial rather than technical
Goto.com uses payments to determine ranking order iwon.com gives out prizes

Web Search Architecture

Web Search Architecture

Preprocessing
Collection gathering phase
Web crawling

Collection indexing phase

Online
Query servers This part not talked about in the readings

From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Standard Web Search Engine Architecture


crawl the web Check for duplicates, store the documents

DocIds

user query
Search engine servers

create an inverted index

Show results To user

Inverted index

More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.

Inverted Indexes for Web Search Engines


Inverted indexes are still used, even though the web is so huge Some systems partition the indexes across different machines; each machine handles different parts of the data Other systems duplicate the data across many machines; queries are distributed among the machines Most do a combination of these

In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row.

From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Cascading Allocation of CPUs

A variation on this that produces a cost-savings:

Put high-quality/common pages on many machines Put lower quality/less common pages on fewer machines Query goes to high quality machines first If no hits found there, go to other machines

Web Crawlers
How do the web search engines get all of the items they index? Main idea:

Start with known sites Record information for these sites Follow the links from each site Record information found at new sites Repeat

Web Crawlers

How do the web search engines get all of the items they index? More precisely:
Take the first page off of the queue If this page has not yet been processed:

Put a set of known sites on a queue Repeat the following until the queue is empty:
Record the information found on this page Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processed

In what order should the links be followed?

Page Visit Order


Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

Structure to be traversed

Page Visit Order

Animated examples of breadth-first vs depth-first search on trees:


http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

Breadth-first search
(must be in presentation mode to see this animation)

Page Visit Order

Animated examples of breadth-first vs depth-first search on trees:


http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

Depth-first search
(must be in presentation mode to see this animation)

Page Visit Order

Animated examples of breadth-first vs depth-first search on trees:


http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

(more complex graphs & sites)


Site 1 1 1 1 1 1 3 5 6 5 2 2 2 Page 1 2 4 6 3 5 1 1 1 2 1 2 3

Depth-First Crawling

Page 1

Site 1 Page 3 Page 2

Page 1

Site 2 Page 3

Page 2

Page 5 Page 4 Page 6 Page 1 Site 3

Page 1

Site 5
Page 2

Page 1
Site 6

(more complex graphs & sites)


Page 1
Site Page 1 1 2 1 1 2 1 6 1 3 2 2 2 3 1 4 3 1 1 5 5 1 5 2 6 1

Breadth First Crawling

Site 1 Page 3 Page 2

Page 1

Site 2 Page 3

Page 2

Page 5 Page 4 Page 6 Page 1 Site 3

Page 1

Site 5
Page 2

Page 1
Site 6

Web Crawling Issues

Keep out signs Freshness

A file called norobots.txt tells the crawler which directories are off limits Figure out which pages change often Recrawl these often Convert page contents with a hash function Compare new pages to the hash table Server unavailable Incorrect html Missing links Infinite loops

Duplicates, virtual hosts, etc


Lots of problems

Web crawling is difficult to do robustly!

Cha-Cha

Cha-cha searches an intranet


Sites associated with an organization

Instead of hand-edited categories


Computes shortest path from the root for each hit Organizes search results according to which subdomain the pages are found in

Cha-Cha Web Crawling Algorithm


Start with a list of servers to crawl Restrict crawl to certain domain(s) Obey No Robots standard Follow hyperlinks only
links are placed on a queue traversal is breadth-first

for UCB, simply start with www.berkeley.edu


*.berkeley.edu

do not read local filesystems

See first lecture or the technical papers for more information

Summary

Web search differs from traditional IR systems


Different kind of collection Different kinds of users/queries Different economic motivations

Ranking combines many features in a difficult-to-specify manner

Link analysis and proximity of terms seems especially important This is in contrast to the term-frequency orientation of standard search
Why?

Summary (cont.)

Web search engine archicture


Similar in many ways to standard IR Indexes usually duplicated across machines to handle many queries quickly

Web crawling
Used to create the collection Can be guided by quality metrics Is very difficult to do robustly

Web Search Statistics

Searches per Day

Info missing For fast.com, Excite, Northernlight, etc.

Information from searchenginewatch.com

Web Search Engine Visits

Information from searchenginewatch.com

Percentage of web users who visit the site shown

Information from searchenginewatch.com

Search Engine Size (July 2000)

Information from searchenginewatch.com

Does size matter? You cant access many hits anyhow.

Information from searchenginewatch.com

Increasing numbers of indexed pages, selfreported

Information from searchenginewatch.com

Increasing numbers of indexed pages (more recent) selfreported

Information from searchenginewatch.com

Web Coverage

Information from searchenginewatch.com

From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Directory sizes

Information from searchenginewatch.com

Das könnte Ihnen auch gefallen