Beruflich Dokumente
Kultur Dokumente
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Event Data
Finance Gaming Monitoring
Advertisment
Sensor Networks
Social Media
Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Online Learning
converges, e.g. if
http://leon.bottou.org/research/stochastic
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
100 events per second 360k events per hour 8.6M events per day 260M events per month 3.2B events per year
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
http://wordle.net
http://www.flickr.com/photos/arenamontanus/269158554/
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
answer stream queries with finite resources how often does an item appear in a stream? how many distinct elements are in the stream? what are the top-k most frequent items?
Continuous Stream of Data
Typical examples:
Stream Queries
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
The Trade-Off
Big Data
Stream Mining Map Reduce and friends
Fast
Exact
First seen here: http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) Interested in most active elements only.
Case 1: element already in data base 142 142 12 132 142 432 553 712 023 15 12 8 5 3 2 713 3 Case 2: new element 713 023 2 13
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Count-Min Sketches
Summarize histograms over large feature sets Like bloom filters, but better
m bins 0 1 0 2 0 1 5 4 3 0 3 5 0 2 2 0 0 0 1 0 2 3 3 2 0 5 7 0 0 2 3 8 n different hash functions
Query result: 1
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .
Online clustering
0 1 0 2
0 1 5 4
3 0 3 5
0 2 2 0
0 0 1 0
2 3 3 2
0 5 7 0
0 2 3 8
Aggarwal, A Framework for Clustering Massive-Domain Data Streams, IEEE International Conference on Data Engineering , 2009
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Time
Keep quite a big log (a month?) Constant write/erase in database Alternative: Exponential decay
DB
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Exponential Decay
Exponential Decay
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
do
Then, reconstruct
As a reminder:
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
More: Maximum-Likelihood
based on
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Outlier detection
Once you have a model, you can compute p-values (based on recent time frames!)
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
TF-IDF
for each word: update(word, t, 1.0) for each document: update(#docs, t, 1.0) query: score(word) / score(#docs)
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
class priors
Priors
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
ICML 2003
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
transform TF to log( . + 1) IDF-style normalization square length normalization use complement probability another log normalize those weights again Predict linearly using those weights
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
elements!
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Streamdrill
Heavy Hitters counting + exponential decay Instant counts & top-k results over time windows. Indices! Snapshots for historical analysis Beta demo available at http://streamdrill.com, launch imminent
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Architecture Overview
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
REST API
Create a trend
/1/create/plays/user:song:location?size=1000? timescales=day,hour
Update a word
/1/update/plays/frank:123123:San+Francisco
/1/update/plays/paul:145323:Berlin?ts=131341354135
Get most played songs for SF or Paul Get score for a word
/1/query/score/hello
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
http://play.streamdrill.com/vis/
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Trends:
$FB:http://on.wsj.com/15fHaZW
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Summary
Doesn't always have to be scaling! Stream mining: Approximate results with finite resources. streamdrill: stream analysis engine
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT