Sie sind auf Seite 1von 36

Online Learning with Stream Mining

Mikio L. Braun, @mikiobraun http://blog.mikiobraun.de TWIMPACT http://twimpact.com

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Event Data
Finance Gaming Monitoring

Advertisment

Sensor Networks

Social Media

Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Online Learning

Isn't all learning online?


Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Isn't Machine Learning easily Online?

Stochastic gradient descent

converges, e.g. if

http://leon.bottou.org/research/stochastic
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Online vs Batch: Non-stationarities

Very good tutorial by Albert Bifet et al. on these issues at http://sites.google.com/site/advancedstreamingtutorial


Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Time horizons vs. Learning rate

You can't just do online learning on event data!


Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Also, Event Data is huge

The problem: You easily get A LOT OF DATA!


100 events per second 360k events per hour 8.6M events per day 260M events per month 3.2B events per year

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

So, online learning challenges:


So much data! Concept Drift Online (as in not batch) is not the whole story.

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Digging into Least Squares


Idea: Batch method like least squares on recent portion of the data.

this could be huge!

d with entries d d x d is probably ok

But: It's just a sum!

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Another Problem: High-dimensional Spaces

Potentially large spaces:


distinct words: >100k IP addresses: >100M users in a social network: >10M

http://wordle.net

http://www.flickr.com/photos/arenamontanus/269158554/

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Stream Mining to the rescue

Stream mining algorithms:

answer stream queries with finite resources how often does an item appear in a stream? how many distinct elements are in the stream? what are the top-k most frequent items?
Continuous Stream of Data

Typical examples:

Bounded Resource Analyzer

Stream Queries

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

The Trade-Off
Big Data
Stream Mining Map Reduce and friends

Fast

Exact

First seen here: http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Heavy Hitters (a.k.a. Top-k)

Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) Interested in most active elements only.
Case 1: element already in data base 142 142 12 132 142 432 553 712 023 15 12 8 5 3 2 713 3 Case 2: new element 713 023 2 13

Fixed tables of counts

Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Count-Min Sketches

Summarize histograms over large feature sets Like bloom filters, but better
m bins 0 1 0 2 0 1 5 4 3 0 3 5 0 2 2 0 0 0 1 0 2 3 3 2 0 5 7 0 0 2 3 8 n different hash functions

Query result: 1

Updates for new entry

Query: Take minimum over all hash functions


Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .

Clustering with count-min Sketches

Online clustering

For each data point:


Map to closest centroid ( compute distances) Update centroid

count-min sketches to represent sum over all vectors in a class

0 1 0 2

0 1 5 4

3 0 3 5

0 2 2 0

0 0 1 0

2 3 3 2

0 5 7 0

0 2 3 8

Aggarwal, A Framework for Clustering Massive-Domain Data Streams, IEEE International Conference on Data Engineering , 2009

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Heavy Hitters over Time-Window

Time

Keep quite a big log (a month?) Constant write/erase in database Alternative: Exponential decay

DB

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Exponential Decay

Instead of a fixed window, use exponential timestamp decay


score halftime

The beauty: updates are recursive

time shift term


Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Exponential Decay

Collect stats by a table of expdecay counters


counters[item] ts[item] # counters # last timestamp

update(C, item, timestamp, count) update counts


C.counters[item] = count + weight(timestamp, C.ts[item]) * C.counters[item] C.ts[item] = timestamp C.lastupdate = timestamp

score(C, item) return score


return weight(C.lastupdate, C.ts[item]) * C.counters[item]

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Least Squares Revisited


Need to compute For each


do

Then, reconstruct

As a reminder:

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

More: Maximum-Likelihood

Estimate probabilistic models

based on

which is slightly biased, but simpler

But wait, how do I 1/n with randomly spaced events?

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Outlier detection

Once you have a model, you can compute p-values (based on recent time frames!)

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

TF-IDF

estimate word document frequencies

for each word: update(word, t, 1.0) for each document: update(#docs, t, 1.0) query: score(word) / score(#docs)

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Extracting a relevant subset

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Classification with Nave Bayes

Naive Bayes is also just counting, right?


frequency of word in document Number of times word appears in class

class priors

Priors

Multinomnial nave Bayes

Total number of words in class

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Classification with Naive Bayes

ICML 2003
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Classification with Naive Bayes

7 Steps to improve NB:


transform TF to log( . + 1) IDF-style normalization square length normalization use complement probability another log normalize those weights again Predict linearly using those weights

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

What about non-parametric methods and Kernel Methods?

Problem here, no real accumulation of information in statistics, e.g. SVMs

sum over all

elements!

Could still use streamdrill to extract a representative subset.

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Streamdrill

Heavy Hitters counting + exponential decay Instant counts & top-k results over time windows. Indices! Snapshots for historical analysis Beta demo available at http://streamdrill.com, launch imminent

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Architecture Overview

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

REST API

Create a trend

/1/create/plays/user:song:location?size=1000? timescales=day,hour

Update a word

/1/update/plays/frank:123123:San+Francisco

Another word (with timestamp)


/1/query/plays?city=San+Francisco /1/query/plays?user=paul

/1/update/plays/paul:145323:Berlin?ts=131341354135

Get most played songs for SF or Paul Get score for a word
/1/query/score/hello
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Example: Twitter Stock Analysis

http://play.streamdrill.com/vis/
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Example: Twitter Stock Analysis

Trends:

symbol:combinations symbol:hashtag symbol:keywords symbol:mentions symbol trend symbol:url

$AAPL:$GOOG $AAPL:#trading $GOOG:disruption $GOOG:WallStreetCom $AAPL

$FB:http://on.wsj.com/15fHaZW

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Example: Twitter Stock Analysis

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Example: Twitter Stock Analysis

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Example: Twitter Stock Analysis


Twitter

tweets JavaScript via REST Tweet Analyzer updates streamdrill

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Summary

Doesn't always have to be scaling! Stream mining: Approximate results with finite resources. streamdrill: stream analysis engine

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Das könnte Ihnen auch gefallen