Sie sind auf Seite 1von 37

Scalable Techniques for Clustering the Web

Taher H. Haveliwala Aristides Gionis Piotr Indyk

Stanford University
{taherh,gionis,indyk}@cs.stanford.edu

Project Goals
Generate fine-grained clustering of web based on topic Similarity search (Whats Related?) Two major issues:
Develop appropriate notion of similarity Scale up to millions of documents

Prior Work
Offline: detecting replicas
[Broder-Glassman-Manasse-Zweig97] [Shivakumar-G. Molina98]

Online: finding/grouping related pages


[Zamir-Etzioni98] [Manjara]

Link based methods


[Dean-Henzinger99, Clever]

Prior Work: Online, Link


Online: cluster results of search queries
does not work for clustering entire web offline

Link based approaches are limited


What about relatively new pages? What about less popular pages?

Prior Work: Copy detection


Designed to detect duplicates/nearreplicas Do not scale when notion of similarity is modified to topical similarity Creation of document-document similarity matrix is the core challenge: join

bottleneck

Pairwise similarity
Consider relation Docs(id, sentence) Must compute:
SELECT FROM WHERE GROUP BY HAVING D1.id, D2.id Docs D1, Docs D2 D1.sentence = D2.sentence D1.id, D2.id count(*) >

What if we change sentence to word?

Pairwise similarity
Relation Docs(id, word) Compute:
SELECT FROM WHERE GROUP BY HAVING D1.id, D2.id Docs D1, Docs D2 D1.word = D2.word D1.id, D2.id count(*) >

For 25M urls, could take months to compute!

Overview
Choose document representation Choose similarity metric Compute pairwise document similarities Generate clusters

Document representation
Bag of words model Bag for each page p consists of
Title of p Anchor text of all pages pointing to p (Also include window of words around anchors)

Bag Generation
http://www.foobar.com/

...click here for a great music page... ...click here for great sports page...

http://www.music.com/

MusicWorld

Enter our site


http://www.baz.com/

...what I had for lunch...


...this music is great...

Bag Generation
Union of anchor windows is a concise description of a page. Note that using anchor windows, we can cluster more documents than weve crawled:
In general, a set of N documents refers to cN urls

Standard IR
Remove stopwords (~ 750) Remove high frequency & low frequency terms Use stemming Apply TFIDF scaling

Overview
Choose document representation Choose similarity metric Compute pairwise document similarities Generate clusters

Similarity
Similarity metric for pages U1, U2, that were assigned bags B1, B2, respectively
sim(U1, U2) = |B1 B2| / |B1 B2|

Threshold is set to 20%

Reality Check
www.foodchannel.com: www.epicurious.com/a_home/a00_home/home.html www.gourmetworld.com www.foodwine.com www.cuisinenet.com www.kitchenlink.com www.yumyum.com www.menusonline.com www.snap.com/directory/category/0,16,-324,00.html www.ichef.com www.home-canning.com .37 .36 .325 .3125 .3125 .3 .3 .2875 .2875 .275

Overview
Choose document representation Choose similarity metric Compute pairwise document similarities Generate clusters

Pair Generation
Find all pairs of pages (U1, U2) satisfying sim(U1, U2) 20% Ignore all url pairs with sim < 20% How do we avoid the join bottleneck?

Locality Sensitive Hashing


Idea: use special kind of hashing Locality Sensitive Hashing (LSH) provides a solution:
Min-wise hash functions [Broder98] LSH [Indyk, Motwani98], [Cohen et al2000]

Properties:
Similar urls are hashed together w.h.p Dissimilar urls are not hashed together

Locality Sensitive Hashing

sports.com golf.com

music.com opera.com sing.com

Hashing
Two steps
Min-hash (MH): a way to consistently sample words from bags Locality sensitive hashing (LSH): similar pages get hashed to the same bucket while dissimilar ones do not

Step 1: Min-hash
Step 1: Generate m min-hash signatures for each url (m = 80)
For i = 1...m Generate a random order hi on words mhi(u) = argmin {hi(w) | w Bu}

Pr(mhi(u) = mhi(v)) = sim(u, v)

Step 1: Min-hash
Round 1: ordering = [cat, dog, mouse, banana]

Set A: {mouse, dog} MH-signature = dog

Set B: {cat, mouse} MH-signature = cat

Step 1: Min-hash
Round 2: ordering = [banana, mouse, cat, dog]

Set A: {mouse, dog} MH-signature = mouse

Set B: {cat, mouse} MH-signature = mouse

Step 2: LSH
Step 2: Generate l LSH signatures for each url, using k of the min-hash values (l = 125, k = 3)
For i = 1...l Randomly select k min-hash indices and concatenate them to form ith LSH signature

Step 2: LSH
Generate candidate pair if u and v have an LSH signature in common in any round Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v))k

Step 2: LSH
Set A:
MH1 MH2 MH3 MH4

{mouse, dog, horse, ant}

Set B:
MH1 MH2 MH3 MH4

= = = =

horse mouse ant dog

{cat, ice, shoe, mouse}

= = = =

cat mouse ice shoe

LSH134 = horse-ant-dog LSH234 = mouse-ant-dog

LSH134 = cat-ice-shoe LSH234 = mouse-ice-shoe

Step 2: LSH
Bottom line - probability of collision:
10% similarity 0.1% 1% similarity 0.0001%

Step 2: LSH
Round 1
sports.com golf.com party.com
sportteamwin

...

music.com opera.com
musicsoundplay

...

sing.com
singmusicear

Step 2: LSH
Round 2
sports.com golf.com
gameteamscore

...

music.com sing.com
audiomusicnote

...

opera.com
theaterlucianosing

Sort & Filter


Using all buckets from all LSH rounds, generate candidate pairs Sort candidate pairs on first field Filter candidate pairs: keep pair (u, v), only if u and v agree on 20% of MHsignatures Ready for Whats Related? queries...

Overview
Choose document representation Choose similarity metric Compute pairwise document similarities Generate clusters

Clustering
The set of document pairs represents the document-document similarity matrix with 20% similarity threshold Clustering algorithms
S-Link: connected components C-Link: maximal cliques Center: approximation to C-Link

Center
Scan through pairs (they are sorted on first component) For each run [(u, v1), ... , (u, vn)]
if u is not marked cluster = u + unmarked neighbors of u mark u and all neighbors of u

Center

Results
20 Million urls on Pentium-II 450
Algorithm Step Bag generation Bag sorting Min-hash LSH Filtering Sorting CENTER Running Time (hours) 23 4.7 26 16 83 107 18

Sample Cluster
feynman.princeton.edu/~sondhi/205main.html hep.physics.wisc.edu/wsmith/p202/p202syl.html hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html pdg.lbl.gov/mc_particle_id_contents.html physics.ucsc.edu/courses/10.html town.hall.org/places/SciTech/qmachine www.as.ua.edu/physics/hetheory.html www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html www.phy.duke.edu/Courses/271/Synopsis.html

. . . (total of 27 urls) . . .

Ongoing/Future Work
Tune anchor-window length Develop system to measure quality
What is ground truth? How do you judge clustering of millions of pages?