Beruflich Dokumente
Kultur Dokumente
Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000
Last Time
Web Search
Directories vs. Search engines How web search differs from other search
Type of data searched over Type of searches done Type of searchers doing search
This probably means people are often using search engines to find starting points Once at a useful site, they must follow links or use site search
Pretty messy in many cases Details usually proprietary and fluctuating Term frequencies Term proximities Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category information Popularity information
Most use a variant of vector space ranking to combine these Heres how it might work:
Make a vector of weights for each feature Multiply this by the counts for each feature
High-Precision Ranking
Proximity search can help get highprecision results if > 1 term
Hearst 96 paper:
Combine Boolean and passage-level proximity Proves significant improvements when retrieving top 5, 10, 20, 30 documents Results reproduced by Mitra et al. 98 Google uses something similar
Results
Spam
Email Spam:
Undesired content
Web Spam:
Content is disguised as something it is not, in order to
Be retrieved more often than it otherwise would Be retrieved in contexts that it otherwise would not be retrieved in
Web Spam
Add extra terms to get a higher ranking Add irrelevant terms to get more hits
Put a dictionary in the comments field Put extra terms in the same color as the background of the web page
Add irrelevant terms to get different types of hits Add irrelevant links to boost your link analysis ranking
Put sex in the title field in sites that are selling cars
There is a constant arms race between web search companies and spammers
Commercial Issues
General internet search is often commercially driven
Commercial sector sometimes hides things harder to track than research On the other hand, most CTOs for search engine companies used to be researchers, and so help us out Commercial search engine information changes monthly Sometimes motivations are commercial rather than technical
Goto.com uses payments to determine ranking order iwon.com gives out prizes
Preprocessing
Collection gathering phase
Web crawling
Online
Query servers This part not talked about in the readings
DocIds
user query
Search engine servers
Inverted index
More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.
Inverted indexes are still used, even though the web is so huge Some systems partition the indexes across different machines; each machine handles different parts of the data Other systems duplicate the data across many machines; queries are distributed among the machines Most do a combination of these
In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row.
Put high-quality/common pages on many machines Put lower quality/less common pages on fewer machines Query goes to high quality machines first If no hits found there, go to other machines
Web Crawlers
How do the web search engines get all of the items they index? Main idea:
Start with known sites Record information for these sites Follow the links from each site Record information found at new sites Repeat
Web Crawlers
How do the web search engines get all of the items they index? More precisely:
Take the first page off of the queue If this page has not yet been processed:
Put a set of known sites on a queue Repeat the following until the queue is empty:
Record the information found on this page Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processed
Structure to be traversed
Breadth-first search
(must be in presentation mode to see this animation)
Depth-first search
(must be in presentation mode to see this animation)
Depth-First Crawling
Page 1
Page 1
Site 2 Page 3
Page 2
Page 1
Site 5
Page 2
Page 1
Site 6
Page 1
Site 2 Page 3
Page 2
Page 1
Site 5
Page 2
Page 1
Site 6
A file called norobots.txt tells the crawler which directories are off limits Figure out which pages change often Recrawl these often Convert page contents with a hash function Compare new pages to the hash table Server unavailable Incorrect html Missing links Infinite loops
Cha-Cha
Start with a list of servers to crawl Restrict crawl to certain domain(s) Obey No Robots standard Follow hyperlinks only
links are placed on a queue traversal is breadth-first
Summary
Link analysis and proximity of terms seems especially important This is in contrast to the term-frequency orientation of standard search
Why?
Summary (cont.)
Web crawling
Used to create the collection Can be guided by quality metrics Is very difficult to do robustly
Web Coverage
Directory sizes