User-Centric Web Crawling: Sandeep Pandey & Christopher Olston Carnegie Mellon University

User-Centric Web Crawling
Sandeep Pandey & Christopher Olston Carnegie Mellon University
Databases
@Carnegie Mellon
Databases
@Carnegie Mellon
Web Crawling
One
important application (our focus): search Topic-specific search engines + General-purpose ones
search queries index repository user crawler
WWW
Databases
@Carnegie Mellon
Out-of-date Repository
Web

is always changing [Arasu et.al., TOIT01]
23% of Web pages change daily 40% commercial Web pages change daily
Many
problems may arise due to an out-ofdate repository

Hurt both precision and recall
Databases
@Carnegie Mellon
Web Crawling Optimization Problem

Not
enough resources to (re)download every web document every day/hour

Must pick and choose optimization problem
Others:
objective function = avg. freshness, age Our goal: focus directly on impact on users
search queries index repository
WWW
crawler
user
Databases
@Carnegie Mellon
Web Search User Interface

1. User enters keywords 2. Search engine returns ranked list of results
1. 2. 3. 4. ------------------------
documents
3. User visits subset of results
Databases
@Carnegie Mellon
Objective: Maximize Repository Quality (as perceived by users)
Suppose a user issues search query q:
Qualityq =
documentsD
(likelihood of viewing D) x (relevance of D to q)
Given a workload W of user queries: Average quality = 1/K x queries q W (freqq x Qualityq)
Databases
@Carnegie Mellon
Viewing Likelihood

Depends primarily on rank in list [Joachims KDD02] From AltaVista data [Lempel et al. WWW03]:
1.2
view probability
1 0.8 0.6 0.4 0.2 0 0 50
Probability of Viewing
ViewProbability(r) r 1.5
rank Rank
100
150
Databases
@Carnegie Mellon
Relevance Scoring Function
Search engines internal notion of how well a document matches a query Each D/Q pair numerical score [0,1] Combination of many factors, including:

Vector-space similarity (e.g., TF.IDF cosine metric) Link-based factors (e.g., PageRank) Anchortext of referring pages
Databases
@Carnegie Mellon
(Caveat)
Using scoring function for absolute relevance

Normally only used for relative ranking Need to craft scoring function carefully
Databases
@Carnegie Mellon
Measuring Quality
scoring function over live copy of D
Avg. Quality =
q (freqq x D (likelihood of viewing D) x (relevance of D to q))

query logs ViewProb( Rank(D, q) )
usage logs
scoring function over (possibly stale) repository
10
Databases
@Carnegie Mellon
Lessons from Quality Metric

Avg. Quality =
q (freqq x D (ViewProb( Rank(D, q) ) x (relevance of D to q))

ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in descending order of relevance Out-of-date repository: scrambles ranking lowers quality
11
Let QD = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D
Databases
@Carnegie Mellon
QD: Improvement in Quality
REDOWNLOAD
12
Repository Copy of D (stale) Repository Quality += QD
Web Copy of D (fresh)
Databases
@Carnegie Mellon
Download Prioritization
Idea: Given QD for each doc., prioritize (re)downloading accordingly Q: How to measure QD? Two difficulties: 1. Live copy unavailable 2. Given both the live and repository copies of D, measuring QD may require computing ranks of all documents for all queries
Approach: (1) Estimate QD for past versions, (2) Forecast current QD
13
Databases
@Carnegie Mellon
Overhead of Estimating QD
Estimate while updating inverted index
14
Databases
@Carnegie Mellon
Forecast Future QD
Avg. weekly QD :
second 24 weeks
Top 50% Top 80% Top 90%
Data: 48 weekly snapshots of 15 web sites sampled from OpenDirectory topics Queries: AltaVista query log
15
first 24 weeks
Databases
@Carnegie Mellon
Summary
Estimate
QD at index time future QD
Forecast
Prioritize
downloading according to forecasted QD
16
Databases
@Carnegie Mellon
Overall Effectiveness
resource requirement
Staleness = fraction of out-ofdate documents* [Cho et al. 2000] Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002]
Min. Staleness Min. Embarrassment User-Centric
* Used shingling to filter out trivial changes
Scoring function: PageRank (similar results for TF.IDF)
17
Quality (fraction of ideal)
Databases
@Carnegie Mellon
Reasons for Improvement

Does
not rely on size of text change to estimate importance
Tagged as important by shingling measure, although did not match many queries in workload
18
(boston.com)
Databases
@Carnegie Mellon
Reasons for Improvement

Accounts
for false negatives Does not always ignore frequently-updated pages
User-centric crawling repeatedly re-downloads this page
19
(washingtonpost.com)
Databases
@Carnegie Mellon
Related Work (1/2)
General-purpose Web crawling:
Min. Staleness [Cho, Garcia-Molina, SIGMOD00]
Maximize average freshness or age for fixed set of docs.
Min. Embarrassment [Wolf et al., WWW02]:

Maximize weighted avg. freshness for fixed set of docs. Document weights determined by prob. of embarrassment
[Edwards et al., WWW01]

Maximize average freshness for a growing set of docs. How to balance new downloads vs. redownloading old docs.
20
Databases
@Carnegie Mellon
Related Work (2/2)

Focused/topic-specific

crawling
[Chakrabarti, many others] Select subset of pages that match user interests Our work: given a set of pages, decide when to (re)download each based on predicted content shifts + user interests
21
Databases
@Carnegie Mellon
Summary

Crawling: an optimization problem Objective: maximize quality as perceived by users Approach: Measure QD using query workload and usage logs Prioritize downloading based on forecasted QD Various reasons for improvement

22
Accounts for false positives and negatives Does not rely on size of text change to estimate importance Does not always ignore frequently updated pages
Databases
@Carnegie Mellon
THE END
Paper
available at: www.cs.cmu.edu/~olston
23
Databases
@Carnegie Mellon
Most Closely Related Work
[Wolf et al., WWW02]: Maximize weighted avg. freshness for fixed set of docs. Document weights determined by prob. of embarrassment User-Centric Crawling: Which queries affected by a change, and by how much?

Change A: significantly alters relevance to several common queries Change B: only affects relevance to infrequent queries, and not by much Doc. ranked #1000 for a popular query should be ranked #2 Small embarrassment but big loss in quality
Metric penalizes false negatives
24
Databases
@Carnegie Mellon
Inverted Index
Doc1 Seminar: Cancer Symptoms
Word
Cancer
Posting list DocID (freq)

Doc7 (2) Doc1 (1) Doc9 (1)
Seminar
Doc5 (1)
Doc6 (1)
Doc1 (1)
25
Symptoms
Doc1 (1)
Doc4 (3)
Doc8 (2)
Databases
@Carnegie Mellon
Updating Inverted Index

Stale Doc1 Seminar: Cancer Symptoms Live Doc1 Cancer management: how to detect breast cancer
Cancer
Doc7 (2)
Doc1 (1) (2)
Doc9 (1)
26
Databases
@Carnegie Mellon
Measure QD While Updating Index
Compute previous and new scores of the downloaded document while updating postings Maintain an approximate mapping between score and rank for each query term (20 bytes per mapping in our exps.) Compute previous and new ranks (approximately) using the computed scores and score-to-rank mapping Measure QD using previous and new ranks (by applying an approximate function derived in the paper)
27
Databases
@Carnegie Mellon
Out-of-date Repository
Repository Copy of D (stale)
Web Copy of D (fresh)
28

User-Centric Web Crawling: Sandeep Pandey & Christopher Olston Carnegie Mellon University

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

User-Centric Web Crawling: Sandeep Pandey & Christopher Olston Carnegie Mellon University

Hochgeladen von

Copyright:

Verfügbare Formate

User-Centric Web Crawling

Sandeep Pandey & Christopher Olston Carnegie Mellon University

is always changing [Arasu et.al., TOIT01]

problems may arise due to an out-ofdate repository

Web Crawling Optimization Problem

enough resources to (re)download every web document every day/hour

Web Search User Interface

3. User visits subset of results

Objective: Maximize Repository Quality (as perceived by users)

Suppose a user issues search query q:

(likelihood of viewing D) x (relevance of D to q)

1 0.8 0.6 0.4 0.2 0 0 50

Relevance Scoring Function

Using scoring function for absolute relevance

q (freqq x D (likelihood of viewing D) x (relevance of D to q))

scoring function over (possibly stale) repository

Lessons from Quality Metric

q (freqq x D (ViewProb( Rank(D, q) ) x (relevance of D to q))

QD: Improvement in Quality

Repository Copy of D (stale) Repository Quality += QD

Web Copy of D (fresh)

Top 50% Top 80% Top 90%

QD at index time future QD

downloading according to forecasted QD

Min. Staleness Min. Embarrassment User-Centric

* Used shingling to filter out trivial changes

Scoring function: PageRank (similar results for TF.IDF)

Quality (fraction of ideal)

Reasons for Improvement

not rely on size of text change to estimate importance

Reasons for Improvement

for false negatives Does not always ignore frequently-updated pages

User-centric crawling repeatedly re-downloads this page

Related Work (1/2)

General-purpose Web crawling:

Min. Staleness [Cho, Garcia-Molina, SIGMOD00]

Maximize average freshness or age for fixed set of docs.

Min. Embarrassment [Wolf et al., WWW02]:

[Edwards et al., WWW01]

Related Work (2/2)

available at: www.cs.cmu.edu/~olston

Most Closely Related Work

Metric penalizes false negatives

Posting list DocID (freq)

Updating Inverted Index

Doc1 (1) (2)

Measure QD While Updating Index

Repository Copy of D (stale)

Web Copy of D (fresh)

Das könnte Ihnen auch gefallen