Sie sind auf Seite 1von 22

Streaming

L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Compact data structures: data streaming
Luca Becchetti
Sapienza Universit`a di Roma Rome, Italy
April 26, 2008
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
1
What is streaming?
2
Count-Min sketch
3
Inner product and join
4
Heavy hitters
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Informally...
Streaming involves ([Muthukrishnan, 2005a]):
Small number of passes over data. (Typically 1?)
Sublinear space (sublinear in the universe or number of
stream items?)
A model of computation...
Similar to dynamic, online, approximation or randomized
algorithms, but with more constraints
Constraints impose limitations that make many easy
problems hard (further in this lecture)
Being poly-time/poly-space no longer sucient
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Motivations
US Bbones [Odlyzko, 2003] [Ipoque GMBH, 2007]
Trac explosion in past years [Muthukrishnan, 2005a]
30 billions emails, 1 billions SMS, IMs daily (2005)
1 billion packets/router x hr
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Motivations
LAN
LAN
INTERNET
Router
SNMP log
Flow log
Packet log
Logs
SNMP: (Router ID, Interface ID, Timestamp, Bytes sent since
last obs.)
Flow: (Source IP, Dest IP, Start Time, Duration, No. Packets,
No. Bytes)
(Source IP, Dest IP, Src/Dest Port Numbers, Time, No. Bytes)
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Challenges [Muthukrishnan, 2005a]
1 link with 2 Gb/s. Say avg packet size is 50 bytes
Number of pkts/sec = 5 Million
Time per pkt = 0.2 sec (time available for processing)
If we capture pkt headers per packet: src/dest IP, time,
no of bytes, etc. at least 10 bytes
Space per second is 50 Mb. Space per day is 4.5 Tb per
link
ISPs have hundreds of links.
Focus is on solutions for real applications
Note: we seek solutions that work in practice easy to
implement, require small space, allow fast updates and queries
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Some typical queries [Muthukrishnan, 2005a]
How many distinct IP addresses use a given link currently
or anytime during the day?
What are the top k voluminous ows currently in
progress in a link?
How many distinct ows were observed?
Are trac patterns in two routers correlated? What are
(un)usual trends?
Network monitoring just one possible application
On-line statistics on search engines query logs
On-line statistics on server logs
...
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
The streaming model [Muthukrishnan, 2005b]
Processing
unit
U
j
U
j+1
U
j+2
U
j+3
Streaming model
Data ow
Data item arrive over time
U
j
processed before U
j +1
arrives
Only one pass (or few passes, at most O(log))
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
The streaming model [Muthukrishnan, 2005b]
Underlying data (signal): an n-dimensional array A, n
typically large (e.g., the size of the IP address space)
Update arrive over time. The j -th update is a pair
U
j
= (i , x), where i is an index:
A
i
= A
i
+ x
In general, x can be any
Initially: A
i
= 0, i = 1, . . . n
A(t): the state of the array after the rst t updates
Goal
Compute and maintain functions over A in small space,
with fast updates and computation
typically, space << n
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Special cases of the model
[Muthukrishnan, 2005b]
j -th update changes A(j ) (Time series model)
Cash register: x 0
Turnstile model: most general model
In this lecture
Compact summaries of data streams (Count-Min
sketches [Cormode and Muthukrishnan, 2005])
Applications: point queries, range sums, heavy hitters,
quantiles
Only a drop in a sea of results...
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
The CM sketch
[Cormode and Muthukrishnan, 2005]
In the sequel: cash-register model
2-Dimensional array whose size is determined by design
parameters and (their meaning explained further)
Array is C[j , l ], where j = 1, . . . , d and l = 1, . . . , w
d =
_
ln
1

_
(depth)
w =
_
e

_
(width)
Every entry initially 0
d hash functions h
1
, . . . , h
d
chosen uniformly at random
from a pairwise-independent family (see rst lecture)
h
r
: {1, . . . , n} {1, . . . , w}
Update
Pair (i , c) is observed, meaning that, ideally, A
i
= A
i
+ c
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
CM sketch: Update procedure
w
d
h
d
h
1
i
+c
+c
+c
+c
CM sketch update
update(i, c)
Require: i: array index, c: value
1: for j : 1 . . . d do
2: C[j, h
j
(i )] = C[j, h
j
(i )] + c
3: end for
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Point query
Basic query, building block for the others
Q(i ): estimate A
i
Point query estimate
PQ(i)
Require: i: array index
1: return

A
i
= min
j
C[j, h
j
(i )]
Theorem ([Cormode and Muthukrishnan, 2005])

A
i
A
i
. Furthermore, P
_

A
i
> A
i
+ A
1
_
, where
A
1
=

n
i =1
|A
i
| is the 1-norm of A.
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Proof of theorem
Dene I
ijk
= 1 if (i = k)

(h
j
(i ) = h
j
(k)), 0 otherwise
P[h
j
(i ) = h
j
(k)]
1
w
=

e
by pairwise independence
Dene X
ij
=

n
k=1
I
ijk
A
k
X
ij
0 and C[i , j ] = A
i
+ X
ij


A
i
A
i
X
ij
measures the error introduced by collisions
E[X
ij
] = E[

n
k=1
I
ijk
A
k
] =

n
k=1
A
k
E[I
ijk
]

e
A
1
Notice that the only random variables are the I
ijk
s and
the X
ij
s
Furthermore,
P
_

A
i
> A
i
+ A
1
_
= P[j : C[j , h
j
(i )] > A
i
+ A
1
]
= P[j : A
i
+ X
ij
> A
i
+ A
1
] = P[j : X
ij
> eE[X
ij
]]
< e
d
,
where the fourth inequality follows from Markovs inequality.
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Join size
4
2
2 3 4 2
1 4 2 1 4 2
R.A
S.A
1 2 3 4
1 2 3 4
*
*
R.B
R.C
Join of two relations R and S on same attribute A
Tuple
We want to compute COUNT(R 1
A
S)
Space (n) required (in the picture: n = 4)
In the example: COUNT(R 1
A
S) = 3x2 + 2x2 = 10
Underlying operation: compute scalar product of two
non-negative vectors of same size
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Inner product
Given two vectors A and B, we want to keep a compact
summary of A B
We use two CM sketches C
A
and C
B
for A and B
Estimation
Let S
j
=

w
k=1
C
A
[j , k]C
B
[j , k]
Estimation: return S min
j
S
j
Theorem ([Cormode and Muthukrishnan, 2005])
S A B. Furthermore:
P[S > A B + A
1
B] .
Q1: Prove theorem above. Proof follows same lines as for
point query
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Heavy hitters
[Cormode and Muthukrishnan, 2005]
-heavy hitters of A: {i : A
i
> A
1
}
No. -heavy hitters: between 0 and 1/
Approximate heavy hitters: accept i such that
A
i
( )A
1
for some specied <
We consider the cash-register model
Heavy hitters algorithm: ingredients
CM sketch and point query basic building blocks
Return items whose estimate exceeds A
1
Assume c
s
is the s-th update A
1
=

t
s=1
c
s

A
1
can be easily maintained and updated in small
space
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Heavy hitters: update and query
update(i, c, H) {Update CM sketch}
Require: i: array index, c 0, H: heap
1:

A
i
= PQ(i)
2: if

A
i
> A
1
then
3: HeapUpdate(i,

A
i
, H) {Insert if i H}
4: end if
5: s = HeapMin(H)
6: while PQ(s) A
1
do
7: HeapDelete(s)
8: s = HeapMin(H)
9: end while
Heap and query
Generic heap element: pair (i ,

A
i
) ordered by

A
i
Heavy hitters: return all elements i in H such that

A
i
> A
1
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Heavy hitters: performance
Theorem ([Cormode and Muthukrishnan, 2005])
Assume Inserts only (cash register model). With CM sketches
using space O
_
1

log
n

_
and update time O
_
log
n

_
per item:
Every heavy hitter is output
With probability at least 1 : i) no item whose real
count is ( )A
1
is output and ii) the number of
items in the heap is O
_
1

_
Question
Assume d =
_
e

_
and w =
_
ln
n

_
and let T be the estimated
set of heavy hitters. Recall that

A
i
A
i
. Assume that, after
any number t of insertions: A
i
< ( )A
1
. Prove that
P[i T ] .
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Solution
Since i T , we have

A
i
> A
1
j : C[i , h
j
(i )] > A
1
.
Hence:
A
i
< ( )A
1
C[i , h
j
(i )] A
i
> A
1
, j .
This implies:
P[i T ] = P
_

A
i
> A
i
+ A
1
_
= P[j : C[i , h
j
(i )] > A
i
+ A
1
] < e
d
=

n
,
where the third inequality follows from the general result seen
for PQ(i). Finally, there are at most n items, so:
P[i : i T ] .
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Cormode, G. and Muthukrishnan, S. (2005).
An improved data stream summary: the count-min sketch and
its applications.
J. Algorithms, 55(1):5875.
Ipoque GMBH, G. (2007).
Internet study 2007. URL: http://www.ipoque.com/.
Muthukrishnan, S. (2005a).
Data stream algorithms. URL:
http://www.cs.rutgers.edu/muthu/str05.html.
Muthukrishnan, S. (2005b).
Data streams: Algorithms and applications.
In Foundations and Trends in Theoretical Computer Science,
Now Publishers or World Scientic, volume 1.
Draft at authors homepage:
http://www.cs.rutgers.edu/muthu/stream-1-1.ps.
Streaming
L. Becchetti
What is
streaming?
Count-Min
sketch
Inner product
and join
Heavy hitters
Odlyzko, A. M. (2003).
Internet trac growth: sources and implications.
In Proc. of SPIE conference on Optical Transmission Systems
and Equipment for WDM Networking, pages 115.

Das könnte Ihnen auch gefallen