Beruflich Dokumente
Kultur Dokumente
Databases
@Carnegie Mellon
Databases
@Carnegie Mellon
Web Crawling
One
important application (our focus): search Topic-specific search engines + General-purpose ones
search queries index repository user crawler
WWW
Databases
@Carnegie Mellon
Out-of-date Repository
Web
23% of Web pages change daily 40% commercial Web pages change daily
Many
Databases
@Carnegie Mellon
Others:
objective function = avg. freshness, age Our goal: focus directly on impact on users
search queries index repository
WWW
crawler
user
Databases
@Carnegie Mellon
Databases
@Carnegie Mellon
Qualityq =
documentsD
Given a workload W of user queries: Average quality = 1/K x queries q W (freqq x Qualityq)
Databases
@Carnegie Mellon
Viewing Likelihood
Depends primarily on rank in list [Joachims KDD02] From AltaVista data [Lempel et al. WWW03]:
1.2
view probability
Probability of Viewing
ViewProbability(r) r 1.5
rank Rank
100
150
Databases
@Carnegie Mellon
Search engines internal notion of how well a document matches a query Each D/Q pair numerical score [0,1] Combination of many factors, including:
Vector-space similarity (e.g., TF.IDF cosine metric) Link-based factors (e.g., PageRank) Anchortext of referring pages
Databases
@Carnegie Mellon
(Caveat)
Normally only used for relative ranking Need to craft scoring function carefully
Databases
@Carnegie Mellon
Measuring Quality
scoring function over live copy of D
Avg. Quality =
usage logs
10
Databases
@Carnegie Mellon
ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in descending order of relevance Out-of-date repository: scrambles ranking lowers quality
11
Let QD = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D
Databases
@Carnegie Mellon
REDOWNLOAD
12
Databases
@Carnegie Mellon
Download Prioritization
Idea: Given QD for each doc., prioritize (re)downloading accordingly Q: How to measure QD? Two difficulties: 1. Live copy unavailable 2. Given both the live and repository copies of D, measuring QD may require computing ranks of all documents for all queries
Approach: (1) Estimate QD for past versions, (2) Forecast current QD
13
Databases
@Carnegie Mellon
Overhead of Estimating QD
Estimate while updating inverted index
14
Databases
@Carnegie Mellon
Forecast Future QD
Avg. weekly QD :
second 24 weeks
Data: 48 weekly snapshots of 15 web sites sampled from OpenDirectory topics Queries: AltaVista query log
15
first 24 weeks
Databases
@Carnegie Mellon
Summary
Estimate
Forecast
Prioritize
16
Databases
@Carnegie Mellon
Overall Effectiveness
resource requirement
Staleness = fraction of out-ofdate documents* [Cho et al. 2000] Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002]
17
Databases
@Carnegie Mellon
Tagged as important by shingling measure, although did not match many queries in workload
18
(boston.com)
Databases
@Carnegie Mellon
19
(washingtonpost.com)
Databases
@Carnegie Mellon
20
Databases
@Carnegie Mellon
crawling
[Chakrabarti, many others] Select subset of pages that match user interests Our work: given a set of pages, decide when to (re)download each based on predicted content shifts + user interests
21
Databases
@Carnegie Mellon
Summary
Crawling: an optimization problem Objective: maximize quality as perceived by users Approach: Measure QD using query workload and usage logs Prioritize downloading based on forecasted QD Various reasons for improvement
22
Accounts for false positives and negatives Does not rely on size of text change to estimate importance Does not always ignore frequently updated pages
Databases
@Carnegie Mellon
THE END
Paper
23
Databases
@Carnegie Mellon
[Wolf et al., WWW02]: Maximize weighted avg. freshness for fixed set of docs. Document weights determined by prob. of embarrassment User-Centric Crawling: Which queries affected by a change, and by how much?
Change A: significantly alters relevance to several common queries Change B: only affects relevance to infrequent queries, and not by much Doc. ranked #1000 for a popular query should be ranked #2 Small embarrassment but big loss in quality
24
Databases
@Carnegie Mellon
Inverted Index
Doc1 Seminar: Cancer Symptoms
Word
Cancer
Seminar
Doc5 (1)
Doc6 (1)
Doc1 (1)
25
Symptoms
Doc1 (1)
Doc4 (3)
Doc8 (2)
Databases
@Carnegie Mellon
Cancer
Doc7 (2)
Doc9 (1)
26
Databases
@Carnegie Mellon
Compute previous and new scores of the downloaded document while updating postings Maintain an approximate mapping between score and rank for each query term (20 bytes per mapping in our exps.) Compute previous and new ranks (approximately) using the computed scores and score-to-rank mapping Measure QD using previous and new ranks (by applying an approximate function derived in the paper)
27
Databases
@Carnegie Mellon
Out-of-date Repository
28