Beruflich Dokumente
Kultur Dokumente
Massive Crawling
for the Masses
Paolo Boldi, Andrea Marino,
Massimo Santini, Sebastiano Vigna
Dipartimento di Informatica
Università degli Studi di Milano
Italy
Once upon a time
UbiCrawler
UbiCrawler was a scalable, fault-tolerant and fully
distributed web crawler (Software: Practice &
Experience, 34(8):711-726, 2004)!
Fill bandwidth in spite of politeness (both at host and IP level) => coherent
time frame!
Completely configurable!
(1) Frontier
Workbench
Sieve Distributor Virtualizer
(disk queues)
host visit state
other
URL in
agents URL
new host
URL in
(3) known host (2)
Guava
Workbench
workbench entry
cache
DNSThread (in memory)
IP workbench entry
visit state (acquire)
Store URLs
found
page, visit state (put back)
headers
etc.
WorkbenchThread
ParsingThread parsed!
visit state
12000
10000
8000
Pages/s
6000
4000
2000
0
0 50 100 150 200 250 300
Number of threads
Front size
70000
125 threads
500 threads
60000 2000 threads
Average front size (IPs)
50000
40000
30000
20000
10000
0
0 500 1000 1500 2000 2500 3000 3500 4000
IP delay (ms)
Average speed
14000
Average Speed (Requests/s)
12000
10000
8000
6000
4000
125 threads
2000 500 threads
2000 threads
0
0 500 1000 1500 2000 2500 3000 3500 4000
IP delay (ms)
Comparisons
Machines Speed/agent (MB/s)
Heritrix (ClueWeb12) 5 4
IRLBot 1 40
Download@ http://law.di.unimi.it/!