Sie sind auf Seite 1von 28

User-Centric Web Crawling

Sandeep Pandey & Christopher Olston Carnegie Mellon University

Databases
@Carnegie Mellon

Databases
@Carnegie Mellon

Web Crawling
One

important application (our focus): search Topic-specific search engines + General-purpose ones
search queries index repository user crawler

WWW

Databases
@Carnegie Mellon

Out-of-date Repository
Web

is always changing [Arasu et.al., TOIT01]

23% of Web pages change daily 40% commercial Web pages change daily

Many

problems may arise due to an out-ofdate repository


Hurt both precision and recall

Databases
@Carnegie Mellon

Web Crawling Optimization Problem


Not

enough resources to (re)download every web document every day/hour


Must pick and choose optimization problem

Others:

objective function = avg. freshness, age Our goal: focus directly on impact on users
search queries index repository

WWW
crawler

user

Databases
@Carnegie Mellon

Web Search User Interface


1. User enters keywords 2. Search engine returns ranked list of results
1. 2. 3. 4. ------------------------
documents

3. User visits subset of results

Databases
@Carnegie Mellon

Objective: Maximize Repository Quality (as perceived by users)

Suppose a user issues search query q:

Qualityq =

documentsD

(likelihood of viewing D) x (relevance of D to q)

Given a workload W of user queries: Average quality = 1/K x queries q W (freqq x Qualityq)

Databases
@Carnegie Mellon

Viewing Likelihood

Depends primarily on rank in list [Joachims KDD02] From AltaVista data [Lempel et al. WWW03]:
1.2

view probability

1 0.8 0.6 0.4 0.2 0 0 50

Probability of Viewing

ViewProbability(r) r 1.5

rank Rank

100

150

Databases
@Carnegie Mellon

Relevance Scoring Function

Search engines internal notion of how well a document matches a query Each D/Q pair numerical score [0,1] Combination of many factors, including:

Vector-space similarity (e.g., TF.IDF cosine metric) Link-based factors (e.g., PageRank) Anchortext of referring pages

Databases
@Carnegie Mellon

(Caveat)

Using scoring function for absolute relevance


Normally only used for relative ranking Need to craft scoring function carefully

Databases
@Carnegie Mellon

Measuring Quality
scoring function over live copy of D

Avg. Quality =

q (freqq x D (likelihood of viewing D) x (relevance of D to q))


query logs ViewProb( Rank(D, q) )

usage logs

scoring function over (possibly stale) repository

10

Databases
@Carnegie Mellon

Lessons from Quality Metric


Avg. Quality =

q (freqq x D (ViewProb( Rank(D, q) ) x (relevance of D to q))


ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in descending order of relevance Out-of-date repository: scrambles ranking lowers quality

11

Let QD = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D

Databases
@Carnegie Mellon

QD: Improvement in Quality

REDOWNLOAD

12

Repository Copy of D (stale) Repository Quality += QD

Web Copy of D (fresh)

Databases
@Carnegie Mellon

Download Prioritization
Idea: Given QD for each doc., prioritize (re)downloading accordingly Q: How to measure QD? Two difficulties: 1. Live copy unavailable 2. Given both the live and repository copies of D, measuring QD may require computing ranks of all documents for all queries
Approach: (1) Estimate QD for past versions, (2) Forecast current QD

13

Databases
@Carnegie Mellon

Overhead of Estimating QD
Estimate while updating inverted index

14

Databases
@Carnegie Mellon

Forecast Future QD
Avg. weekly QD :

second 24 weeks

Top 50% Top 80% Top 90%

Data: 48 weekly snapshots of 15 web sites sampled from OpenDirectory topics Queries: AltaVista query log

15

first 24 weeks

Databases
@Carnegie Mellon

Summary
Estimate

QD at index time future QD

Forecast

Prioritize

downloading according to forecasted QD

16

Databases
@Carnegie Mellon

Overall Effectiveness

resource requirement

Staleness = fraction of out-ofdate documents* [Cho et al. 2000] Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002]

Min. Staleness Min. Embarrassment User-Centric

* Used shingling to filter out trivial changes

Scoring function: PageRank (similar results for TF.IDF)

17

Quality (fraction of ideal)

Databases
@Carnegie Mellon

Reasons for Improvement


Does

not rely on size of text change to estimate importance

Tagged as important by shingling measure, although did not match many queries in workload

18

(boston.com)

Databases
@Carnegie Mellon

Reasons for Improvement


Accounts

for false negatives Does not always ignore frequently-updated pages

User-centric crawling repeatedly re-downloads this page

19

(washingtonpost.com)

Databases
@Carnegie Mellon

Related Work (1/2)

General-purpose Web crawling:

Min. Staleness [Cho, Garcia-Molina, SIGMOD00]

Maximize average freshness or age for fixed set of docs.

Min. Embarrassment [Wolf et al., WWW02]:


Maximize weighted avg. freshness for fixed set of docs. Document weights determined by prob. of embarrassment

[Edwards et al., WWW01]


Maximize average freshness for a growing set of docs. How to balance new downloads vs. redownloading old docs.

20

Databases
@Carnegie Mellon

Related Work (2/2)


Focused/topic-specific

crawling

[Chakrabarti, many others] Select subset of pages that match user interests Our work: given a set of pages, decide when to (re)download each based on predicted content shifts + user interests

21

Databases
@Carnegie Mellon

Summary

Crawling: an optimization problem Objective: maximize quality as perceived by users Approach: Measure QD using query workload and usage logs Prioritize downloading based on forecasted QD Various reasons for improvement

22

Accounts for false positives and negatives Does not rely on size of text change to estimate importance Does not always ignore frequently updated pages

Databases
@Carnegie Mellon

THE END

Paper

available at: www.cs.cmu.edu/~olston

23

Databases
@Carnegie Mellon

Most Closely Related Work

[Wolf et al., WWW02]: Maximize weighted avg. freshness for fixed set of docs. Document weights determined by prob. of embarrassment User-Centric Crawling: Which queries affected by a change, and by how much?

Change A: significantly alters relevance to several common queries Change B: only affects relevance to infrequent queries, and not by much Doc. ranked #1000 for a popular query should be ranked #2 Small embarrassment but big loss in quality

Metric penalizes false negatives

24

Databases
@Carnegie Mellon

Inverted Index
Doc1 Seminar: Cancer Symptoms

Word
Cancer

Posting list DocID (freq)


Doc7 (2) Doc1 (1) Doc9 (1)

Seminar

Doc5 (1)

Doc6 (1)

Doc1 (1)

25

Symptoms

Doc1 (1)

Doc4 (3)

Doc8 (2)

Databases
@Carnegie Mellon

Updating Inverted Index


Stale Doc1 Seminar: Cancer Symptoms Live Doc1 Cancer management: how to detect breast cancer

Cancer

Doc7 (2)

Doc1 (1) (2)

Doc9 (1)

26

Databases
@Carnegie Mellon

Measure QD While Updating Index

Compute previous and new scores of the downloaded document while updating postings Maintain an approximate mapping between score and rank for each query term (20 bytes per mapping in our exps.) Compute previous and new ranks (approximately) using the computed scores and score-to-rank mapping Measure QD using previous and new ranks (by applying an approximate function derived in the paper)

27

Databases
@Carnegie Mellon

Out-of-date Repository

Repository Copy of D (stale)

Web Copy of D (fresh)

28

Das könnte Ihnen auch gefallen