Sie sind auf Seite 1von 155

IoT Big Data

Stream Mining
Tutorial
KDD 2016

Gianmarco De Francisci Morales, Albert Bifet, Latifur Khan,


Joao Gama, and Wei Fan
Organizers (1/3)

is a Scientist at QCRI. His research focuses on large scale data
Gianmarco 
 mining and big data, with a particular emphasis on web mining
De Francisci Morales 
 and Data Intensive Scalable Computing systems. He is an
active member of the open source community of the Apache

 Software Foundation working on the Hadoop ecosystem, and a

 committer for the Apache Pig project. He is the co-leader of the
SAMOA project, an open-source platform for mining big data
streams.

 http://gdfm.me

Albert Bifet is Associate Professor at Telecom ParisTech and Honorary
Research Associate at the WEKA Machine Learning Group at
University of Waikato. He is the author of a book on Adaptive
Stream Mining and Pattern Learning and Mining from Evolving
Data Streams. He is one of the leaders of MOA and SAMOA
software environments for implementing algorithms and running
experiments for online learning from evolving data streams.

http://albertbifet.com

2
Organizers (2/3)
is a full Professor (tenured) in the Computer Science
Latifur Khan department at the University of Texas at Dallas where he has
been teaching and conducting research since September 2000.
He has received prestigious awards including the IEEE
Technical Achievement Award for Intelligence and Security
Informatics. Dr. Khan is an ACM Distinguished Scientist and a
Senior Member of IEEE.


https://www.utdallas.edu/~lkhan/
João Gama is Associate professor at the University of Porto and a senior
researcher at LIAAD Inesc Tec. He received his Ph.D. degree in
Computer Science from the University of Porto, Portugal. His
main interests are machine learning, and data mining, mainly in
the context of time-evolving data streams. He authored a recent
book in Knowledge Discovery from Data Streams.

http://www.liaad.up.pt/~jgama

3
Organizers (3/3)

Wei Fan is the Deputy Head at Baidu Research Big Data Lab. His
co-authored paper received ICDM '06 Best Application
Paper Award, he led the team that used his Random
Decision Tree method to win 2008 ICDM Data Mining Cup
Championship. He received 2010 IBM Outstanding
Technical Achievement Award for his contribution to IBM
Infosphere Streams. At Huawei, he has led his colleagues
to develop Huawei Stream. SMART – a streaming platform
for online and real-time processing.

http://www.weifan.info

4
https://sites.google.com/site/iotminingtutorial

Outline
• IoT Fundamentals of • IoT Distributed 

Stream Mining Stream Mining
• IoT Setting
• Distributed Stream
• Classification Processing Engines
• Concept Drift • Classification
• Regression
• Regression
• Clustering
• Open Source Tools
• Frequent Itemset Mining
• Applications
• Concept Evolution

• Limited Labeled Learning • Conclusions

5
IoT Fundamentals of
Stream Mining
Part I
IoT Setting

7
INTERNET OF THINGS

IoT: sensors and actuators connected by networks to


computing systems.
• Gartner predicts 20.8 billion IoT devices by 2020.
• IDC projects 32 billion IoT devices by 2020
INTERNET OF THINGS
digital universe

Figure 3: EMC Digital Universe, 2014

• EMC Digital Universe, 2014


7
INTERNET OF THINGS

• EMC Digital Universe, 2014


IOT (MC KINSEY)
IOT AND INDUSTRY 4.0

Interoperability: IoT
Information transparency: virtual copy of the physical
world
Technical assistance: support human decisions
Decentralized decisions: make decisions on their own
Gather

Deploy Clean

Model

Standard Approach
Finite training sets

Static models
13
Importance$of$O
Pain • Points
As$spam$trends$change
retrain$the$model$with
• Need to retrain!

• Things change over time

• How often?

• Data unused until next


update!

• Value of data wasted

14
Value of Data
15
IoT Stream Mining

• Maintain models online

• Incorporate data on the fly

• Unbounded training sets

• Detect changes and adapts

• Dynamic models

16
IoT Big Data Streams
• Volume + Velocity (+ Variety)

• Too large for single commodity server main memory

• Too fast for single commodity server CPU

• A solution needs to be:

• Distributed

• Scalable

17
Approximation Algorithms

• General idea, good for streaming algorithms

• Small error ε with high probability 1-δ

• True hypothesis H, and learned hypothesis Ĥ

• Pr[ |H - Ĥ| < ε|H| ] > 1-δ

18
Approximation Algorithms

• What is the largest number that we can store in 8


bits?

19
Approximation Algorithms

• What is the
largest number
that we can store
in 8 bits?

20
Approximation Algorithms

• What is the
largest number
that we can store
in 8 bits?

21
Approximation Algorithms

• What is the
largest number
that we can store
in 8 bits?

22
Approximation Algorithms

• What is the
largest number
that we can store
in 8 bits?

23
Approximation Algorithms

• What is the largest number that we can


store in 8 bits?

24
Predictive Learning
• Classification
• Regression
• Concept Drift

25
Classification

26
Definition
Given a set of training
examples belonging to nC
different classes, a classifier
algorithm builds a model
that predicts for every Examples
unlabeled instance x the • Email spam filter
class C to which it belongs • Twitter sentiment analyzer

27 Photo: Stephen Merity http://smerity.com


example at a time,
it only once (at

ed amount of

mited amount of
Process
One example at at time,

predict at any

used at most once


• Limited memory
• Limited time
• Anytime prediction

28
Naïve Bayes
• Based on Bayes’ P (x|C)P (C)
P (C|x) =
theorem P (x)

• Probability of likelihood ⇥ prior


observing feature xi posterior =
evidence
given class C
Y
• Prior class probability P (C|x) / P (xi |C)P (C)
xi 2x
P(C)

• Just counting! C = arg max P (C|x)


C
29
Perceptron
• Linear classifier

• ⃗i,yi⟩
Data stream: ⟨x Perceptron

w1
• ⃗i) = σ(w
ỹi = hw⃗(x ⃗ iT ⃗
xi ) Attribute 1

Attribute 2 w2
• σ(x) = 1/(1+e-x) σʹ=σ(x)(1-σ(x))
Attribute 3 w3 Output hw~ (~xi )
• ⃗ )=½∑(yi-ỹi)2
Minimize MSE J(w w4
Attribute 4

• ⃗ i+1 = w
SGD w ⃗ i - η∇J ⃗
xi Attribute 5 w5

• ∇J = -(yi-ỹi)ỹi(1-ỹi) I Data stream: h~xi , yi i


I ~ T ~xi ,
Classical perceptron: hw~ (~xi ) = w
P
• ⃗ i+1 = w
w ⃗ i + η(yi-ỹi)ỹi(1-ỹi)x
⃗i I Minimize Mean-square error: J(w ~)= 1
2 (yi hw~ (~xi ))2

30
Perceptron Learning
Perceptron

P ERCEPTRON L EARNING(Stream, ⌘)
1 for each class
2 do P ERCEPTRON L EARNING(Stream, class, ⌘)

P ERCEPTRON L EARNING(Stream, class, ⌘)


1 ⇤ Let w0 and w ~ be randomly initialized
2 for each example (~x , y ) in Stream
3 do if class = y
4 then = (1 hw~ (~x )) · hw~ (~x ) · (1 hw~ (~x ))
5 else = (0 hw~ (~x )) · hw~ (~x ) · (1 hw~ (~x ))
6 w
~ =w ~ + ⌘ · · ~x

P ERCEPTRON P REDICTION(~x )
1 return arg maxclass hw~ class (~x )
31
Decision Tree
• Each node tests a features Car deal?
• Each branch represents a value Road
Tested?

• Each leaf assigns a class


Yes No

• Greedy recursive induction



Mileage?

• Sort all examples through tree


High Low
• xi = most discriminative attribute

• New node for xi, new branch for each Age?

value, leaf assigns majority class


Recent Old

• Stop if no error | limit on #instances ✅ ❌

32
Very Fast Decision Tree
Pedro Domingos, Geoff Hulten: “Mining high-speed data streams”. KDD ’00

• AKA, Hoeffding Tree

• A small sample can often be enough to choose a near


optimal decision

• Collect sufficient statistics from a small set of examples

• Estimate the merit of each alternative attribute

• Choose the sample size that allows to differentiate


between the alternatives

33
Leaf Expansion
• When should we expand a leaf?

• Let x1 be the most informative attribute,



x2 the second most informative one

• Is x1 a stable option?

• Hoeffding bound
r
R2 ln(1/ )
• Split if G(x1) - G(x2) > ε =
2n
34
Hoeffding Tree or VFDT
Hoeffding Tree or VFDT
HT(Stream, )
HT Induction
HT(Stream, )
1 ⇤ Let HT be a tree with a single leaf(root)
1 ⇤ Let HT be a tree with a single leaf(root)
2 ⇤ Init counts nijk at root
2 ⇤ Init counts nijk at root
3 for each example (x, y ) in Stream
3 for each example (x, y ) in Stream
4 do HTG ROW((x, y), HT , )
4 do HTG ROW((x, y ), HT , )
HTG ROW((x, y ), HT , )
1 ⇤ Sort (x, y) to leaf l using HT
2 ⇤ Update counts nijk at leaf l
3 if examples seen so far at l are not all of the same class
4 then ⇤ Compute G for each attributeq
R 2 ln 1/
5 if G(Best Attr.) G(2nd best) > 2n
6 then ⇤ Split leaf on best attribute
7 for each branch
8 do ⇤ Start
35 new leaf and initiliatize counts
Properties
• Number of examples to expand node depends only on
Hoeffding bound (ε decreases with √n)

• Low variance model (stable decisions with statistical support)

• Low overfitting (examples processed only once, no need for


pruning)

• Theoretical guarantees on error rate with high probability

• Hoeffding algorithms asymptotically close to batch learner.



Expected disagreement δ/p (p = probability instance falls into a leaf)

• Ties: broken when ε < τ even if ΔG < ε

36
Regression

37
Definition
Given a set of training
examples with a numeric
label, a regression algorithm
builds a model that predicts
for every unlabeled instance x Examples
the value with high accuracy • Stock price
y=ƒ(x)
• Airplane delay

38 Photo: Stephen Merity http://smerity.com


Perceptron
• Linear regressor
Perceptron
• ⃗i,yi⟩
Data stream: ⟨x
Attribute 1 w1

• ⃗i) = w
ỹi = hw⃗(x ⃗T ⃗
xi Attribute 2 w2

• ⃗ )=½∑(yi-ỹi)2
Minimize MSE J(w Attribute 3 w3 Output hw~ (~xi )

Attribute 4 w4
• ⃗' = w
SGD w ⃗ - η∇J ⃗
xi
Attribute 5 w5

• ∇J = -(yi-ỹi)
I Data stream: h~xi , yi i
I ~ T ~xi ,
Classical perceptron: hw~ (~xi ) = w
• ⃗' = w
w ⃗ + η(yi-ỹi)x
⃗i 1 P
I Minimize Mean-square error: J(w ~)= 2 (yi hw~ (~xi ))2

39
Regression Tree

• Same structure as decision tree

• Predict = average target value or



linear model at leaf (vs majority)

• Gain = reduction in standard deviation (vs entropy)


qX
= (y˜i yi )2 /(N 1)

40
AMRules
Rules

Rules
Rules
• Problem: very large decision trees
have context that is complex and

hard to understand

• Rules: self-contained, modular, easier


to interpret, no need to cover universe

• 𝓛 keeps sufficient statistics to:

• make predictions

• expand the rule

• detect changes and anomalies

41
Ensembles of Adaptive Model Rules from High-Speed
AMRules
Rule sets

Adaptive Model Rules


Predicting with a rule s
E. Almeida, C. Ferreira, J. Gama. "Adaptive Model Rules from Data Streams." ECML-PKDD ‘13

• Ruleset: ensemble of rules

• Rule prediction: mean, linear model

• Ruleset prediction

• Weighted avg. of predictions of rules


covering instance x

• Weights inversely proportional to error E.g: x = [4, 1, 1, 2]


X
• Default rule covers uncovered f̂ (x) = ✓l ŷl ,
instances
Rl 2S(xi )

42
Ensembles of Adaptive Model Rules from High-Speed Data Streams

AMRules Induction AMRules


Rule sets

Algorithm 1: Training AMRules


Input: S: Stream of examples
• Rule creation: default rule expansion begin
R {}, D 0
foreach (x, y ) 2 S do
• Rule expansion: split on attribute foreach Rule r 2 S(x) do
if ¬IsAnomaly(x, r ) then
maximizing σ reduction if PHTest(errorr , ) then
r Remove the rule from R
R2 ln(1/ )
• Hoeffding bound ε = else
2n Update sufficient statistics Lr
ExpandRule(r )
• Expand when σ1st/σ2nd < 1 - ε
if S(x) = ; then
Update LD
• Evict rule when P-H test error large ExpandRule(D)
if D expanded then
R R[D
• Detect and explain local anomalies D 0
return (R, LD )

43
Concept Drift

44
Definition
Given an input sequence
⟨x1,x2,…,xt⟩, output at instant
t an alarm signal if there is a
distribution change, and a
prediction x̂ t+1 minimizing Outputs
the error |x̂ t+1 − xt+1| • Alarm indicating change
• Estimate of parameter

45 Photo: http://www.logsearch.io
Application
• Change detection on
evaluation of model

• Training error should decrease


orous guarantees of performance (a theorem). We show that Estimation
with more examples -
these guarantees can be transferred to decision tree learners xt
as follows: if a change is followed by a long enough stable -
Estimator Alarm
period, the classification error of the learner will tend, and - Change -
• Change in distribution of
the same rate, to the error rate of VFDT.
Detector
We testtraining
on Sectionerror
6 our methods with synthetic 6
datasets, using the SEA concepts, introduced in [22] and a
rotating hyperplane as described in [13], and two sets from 6
• Input = stream of real/binary ?
the UCI repository, Adult and Poker-Hand. We compare our -
methods amongnumbers
themselves but also with CVFDT, another Memory
concept-adapting variant of VFDT proposed by Domingos,
Spencer, and Hulten [13]. A one-line conclusion of our ex-
• Trade-off
periments would be that,between detecting
because of its self-adapting prop-
erty, we true changes
can present andour
datasets where avoiding
algorithm performs Figure 1: Change Detector and Estimator System
much better
falsethanalarms
CVFDT and we never do much worse.
Some comparison of time and memory usage of our meth- justify the election of one of them for our algorithms. Most
ods and CVFDT is included. approaches for predicting and detecting change in streams of
46 data can be discussed as systems consisting of three modules:
Cumulative Sum
• Alarm when mean of input data differs from zero

• Memoryless heuristic (no statistical guarantee)

• Parameters: threshold h, drift speed v

• g0 = 0, gt = max(0, gt-1 + εt - v)

• if gt > h then alarm; gt = 0

47
Page-Hinckley Test

• Similar structure to Cumulative Sum

• g0 = 0, gt = gt-1 + (εt - v)

• Gt = mint(gt)

• if gt - Gt > h then alarm; gt = 0

48
Statistical Process Control
J Gama, P. Medas, G. Castillo, P. Rodrigues: “Learning with Drift Detection”. SBIA '04

Concept Drift
• Monitor error in sliding window

• Null hypothesis:
 0.8


concept
drift
no change between windows
Drift level

Error rate
• If error > warning level

Warning level
learn in parallel new model

new window
on the current window 0
pmin + smin
0 Number of examples processed (time) 5000

• if error > drift level



substitute new model for old
Statistical Drift Detection Method
(Joao Gama et al. 2004)

49
Concept-adapting VFDT
G. Hulten, L. Spencer, P. Domingos: “Mining Time-Changing Data Streams”. KDD ‘01

• Model consistent with sliding window on stream

• Keep sufficient statistics also at internal nodes

• Recheck periodically if splits pass Hoeffding test

• If test fails, grow alternate subtree and swap-in



when accuracy of alternate is better

• Processing updates O(1) time, +O(W) memory

• Increase counters for incoming instance, 



decrease counters for instance going out window

50
VFDTc: Adapting to Change
J. Gama, R. Fernandes, R. Rocha: “Decision Trees for Mining Data Streams”. IDA (2006)

• Monitor error rate

• When drift is detected

• Start learning alternative subtree in parallel

• When accuracy of alternative is better

• Swap subtree

• No need for window of instances

51
Hoeffding Adaptive Tree
A. Bifet, R. Gavaldà: “Adaptive Parameter-free Learning from Evolving Data Streams” IDA (2009)

• Replace frequency counters by estimators

• No need for window of instances

• Sufficient statistics kept by estimators separately

• Parameter-free change detector + estimator with


theoretical guarantees for subtree swap (ADWIN)

• Keeps sliding window consistent with 



“no-change hypothesis”
A. Bifet, R. Gavaldà: “Learning from Time-Changing Data with Adaptive Windowing”. SDM ‘07
52
Clustering

53
Definition
Given a set of unlabeled
instances, distribute them
into homogeneous groups
according to some common
relations or affinities Examples
• Market segmentation
• Social network communities

54 Photo: W. Kandinsky - Several Circles (edited)


Approaches
• Distance based (CluStream)

• Density based (DenStream)

• Kernel based, Coreset based, much more…

• Most approaches combine online + offline phase

• Formally: minimize cost function 



over a partitioning of the data

55
Snapshot 25,0
Micro-Clusters
Tian Zhang, Raghu Ramakrishnan, Miron Livny: “BIRCH: An Efficient Data Clustering Method for Very Large Databases”. SIGMOD ’96

• AKA, Cluster Features CF



Statistical summary structure

• Maintained in online phase,



input for offline phase

• ⃗i⟩, d dimensions
Data stream ⟨x

• Cluster feature vector



N: number of points

LSj: sum of values (for dim. j)

SSj: sum of squared values (for dim. j)

• Easy to update, easy to merge

• Constant space irrespective to the


number of examples!

56
CluStream
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu: “A Framework for Clustering Evolving Data Streams”. VLDB ‘03

• Timestamped data stream ⟨ti, ⃗


xi⟩, represented in d+1 dimensions

• Seed algorithm with q micro-clusters (k-means on initial data)

• Online phase. For each new point, either:

• Update one micro-cluster (point within maximum boundary)

• Create a new micro-cluster (delete/merge other micro-clusters)

• Offline phase. Determine k macroclusters on demand:

• K-means on micro-clusters (weighted pseudo-points)

• Time-horizon queries via pyramidal snapshot mechanism

57
DBSCAN
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: “A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise”. KDD ‘96

• ε-n(p) = set of points at distance ≤ ε

• Core object q = ε-n(q) has weight ≥ μ

• p is directly density-reachable from q

• p ∈ ε-n(q) ∧ q is a core object

• pn is density-reachable from p1

• chain of points p1,…,pn such that pi


+1 is directly d-r from pi

• Cluster = set of points that are


mutually density-connected

58
DenStream
Feng Cao, Martin Ester, Weining Qian, Aoying Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ‘06

• Based on DBSCAN

• Core-micro-cluster: CMC(w,c,r) 

weight w > μ, center c, radius r < ε

• Potential/outlier micro-clusters

• Online: merge point into p (or o)



micro-cluster if new radius r'< ε


Figure
Promote outlier to potential if w > βμ
1: Representation by

• Else create new o-micro-cluster


of stream, i.e., the number of poin
• Offline: DBSCAN
time.
59 In static environment, the cl
Static Evaluation
• Internal (validation)

• Sum of squared distance (point to centroid)

• Dunn index (on distance d)



D = min(inter-cluster d) / max(intra-cluster d)

• External (ground truth)

• Rand = #agreements / #choices = 2(TP+TN)/(N(N-1))

• Purity = #majority class per cluster / N


60
Streaming Evaluation
H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer:

“An effective evaluation measure for clustering on evolving data streams”. KDD ’11

• Clusters may: appear, fade, move, merge

• Missed points (unassigned)

• Misplaced points (assigned to different cluster)

• Noise

• Cluster Mapping Measure CMM

• External (ground truth)

• Normalized sum of penalties of these errors

61
Frequent Itemset
Mining

62
Definition
Given a collection of sets of
items, find all the subsets
that occur frequently, i.e.,
more than a minimum
support of times Examples
• Market basket mining
• Item recommendation

63
Fundamentals
• Dataset D, set of items t ∈ D,
constant s (minimum support)

• Support(t) = number of sets



in D that contain t

• Itemset t is frequent if
support(t) ≥ s

• Frequent Itemset problem:

• Given D and s, find all


frequent itemsets

64
Itemset Mining
Itemset Mining
Example
Dataset Example
Document Patterns Support Frequent
Frequent
d1 d1
d1 abce
abce
abce d1,d2,d3,d4,d5,d6
6 c c
d2 d2
d2 cdecde
cde d1,d2,d3,d4,d5
5 e,ce e,ce
d3 d3
d3 abce
abce
abce d1,d3,d4,d5
4 a,ac,ae,ace
a,ac,ae,ace
d4 d4
d4 acde
acde
acde d1,d3,d5,d6
4 b,bc b,bc
d5 d5
d5 abcde
abcde
abcde d2,d4,d5,d6
4 d,cd d,cd
d6 d6
d6 bcdbcd
bcd d1,d3,d5
3 ab,abc,abe
ab,abc,abe
be,bce,abce
be,bce,abce
d2,d4,d5
3 de,cde
de,cde

minimal support = 3

65
Variations
• A priori property: t ⊆ t' ➝ support(t) ≥ support(t’)

• Closed: none of its supersets has the same support

• Can generate all freq. itemsets and their support

• Maximal: none of its supersets is frequent

• Can generate all freq. itemsets (without support)

• Maximal ⊆ Closed ⊆ Frequent


66
Example
Dataset Example
Document Patterns
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd

67
Itemset Streams

• Support as fraction of stream length

• Exact vs approximate

• Incremental, sliding window, adaptive

• Frequent, closed, maximal

68
Pattern mining in streams
Key data structure: La.ce of pa2erns, with counts

{A,B,C,D},2 count≤7

count>7

{A,B,C},4 {A,B,D},3 {A,C,D},3 {B,C,D},8

{A,B},15 {A,C},12 {B,C},10 {A,D},5 {B,D},12 {C,D},12

{A},20 {B},18 {C},18 {D},25

69
Pattern mining in streams
The vast majority of stream pattern mining algorithms
(implicitly or explicitly) build and update the pattern
lattice.

General scheme:

let L be initial, empty lattice;


forever do {
collect a batch of items of size B;
build a summary S of the batch;
merge S into L;
}

70
Lossy Counting
G. S. Manku, R. Motwani: “Approximate frequency counts over data streams”. VLDB '02

• Keep data structure D with tuples (x, freq(x), error(x))

• Imagine to divide the stream in buckets of size 1/ε

• Foreach itemset x in the stream, 



Bid = current sequential bucket id starting from 1

• if x ∈ D, freq(x)++

• else D ← D ∪ (x, 1, Bid - 1)

• Prune D at bucket boundaries: evict x if freq(x) + error(x) ≤ Bid

71
Moment
Y. Chi , H. Wang, P. Yu , R. Muntz: “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window”. ICDM ‘04

• Keeps track of boundary below frequent itemsets in a window

• Closed Enumeration Tree (CET) (~ prefix tree)

• Infrequent gateway nodes (infrequent)

• Unpromising gateway nodes (frequent non-closed, child non-closed)

• Intermediate nodes (frequent non-closed, child closed)

• Closed nodes (frequent)

• By adding/removing transactions closed/infreq. do not change

72
FP-Stream
C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: “Mining frequent patterns in data streams at multiple time granularities”. NGDM (2003)

• Multiple time granularities

• Based on FP-Growth (depth-first search over itemset lattice)

• Pattern-tree + Tilted-time window

• Time sensitive queries, emphasis on recent history

• High time and memory complexity

73
Itemset mining

CLOSTREAM (Yen+ 09) (Sliding window, all


closed, exact)

MFI (Li+ 09) (Transaction-sensitive window,


frequent closed, exact)

IncMine (Cheng+ 08) (Sliding window,


frequent closed, approximate; faster for
moderate approximate ratios)

74
Sequence, tree, graph
mining
MILE (Chen+ 05), SMDS (Marascu-Masseglia 06),
SSBE (Koper-Nguyen 11): Frequent subsequence
(aka sequential pattern) mining

Bifet+ 08: Frequent closed unlabeled subtree mining


Bifet+ 11: Frequent closed labeled subtree mining

Bifet+11: Frequent closed subgraph mining

75
Concept Evolution

76
Challenge: Concept Evolution
Novel class
y 
 y 


D D
- - - - - - - - - -

y1
- - - - - - - - - - - - - - X X X X X
C
y1 X X X X X XX X X X X X
C
- - - - - -
XX X X X X X X X X X X
X X X X X XX X X X X X

A
A X X X X X X


- - - - - - - - - - - - - -

 - - - - - - - - - - - - - - - - 
 - - - - - - - - - - - - - - - -
++++ ++ - - - - - - - - - - - - - - - - ++++ ++ - - - - - - - - - - - - - - - -
++ + + ++ - - - - - - - - - - - - - - - - ++ + + ++ - - - - - - - - - - - - - - - -
- - - - - - - -- - - - -
y2
+ +++ ++ + - - - - - - - -- - - - -
y2
+ +++ ++ +
++ + + + ++ + ++ + + + ++ +
+++++ ++++ +++
+ ++ + + ++ ++ + + + + + + + + +
+ + + + + + + +
B +++++ ++++ +++
+ ++ + + ++ ++ + + + + + + + + +
+ + + + + + + +
B

x1 x x1 x
Classification rules:
R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +
R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -
Existing classification models misclassify novel class instances

FEARLESS engineering 77
Existing Techniques: Ensemble based Approaches
Masud et al. [1][2]

M1 +

x,? M2 +
+

input M3 -

Individual voting Ensemble


Classifier outputs output

[1] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data.
ICDM 2008: 929-934
[2] Mohammad M. Masud, Clay Woolam, Jing Gao, Latifur Khan, Jiawei Han, Kevin W. Hamlen, Nikunj C. Oza: Facing the reality of data stream classification: coping with scarcity of labeled data.
Knowl. Inf. Syst. 33(1): 213-244 (2011)

FEARLESS engineering !78


Existing Techniques: Ensemble Techniques

➢ Divide the data stream into equal sized chunks


– Train a classifier from each data chunk
– Keep the best t such classifier-ensemble
– Example: for t = 3

Note: Di may contain data points from different classes


Labeled chunk
Data D1 D2 D543 D654
chunks
Unlabeled chunk

Addresses infinite length


Models M1 M2 M543 and concept-drift

Ensemble M1 M24 M53 Prediction

FEARLESS engineering 79
Novel Class Detection
Masud et al. [1][2], Khateeb et al. [3]

➢ Non parametric
– does not assume any underlying model of existing classes
➢ Steps:
1. Creating and saving decision boundary during training
2. Detecting and filtering outliers
3. Measuring cohesion and separation among test and training
instances

[1] Mohammad M. Masud, Qing Chen, Latifur Khan, Charu C. Aggarwal, Jing Gao, Jiawei Han, Ashok N. Srivastava, Nikunj C. Oza: Classification and Adaptive
Novel Class Detection of Feature-Evolving Data Streams. IEEE Trans. Knowl. Data Eng. 25(7): 1484-1497 (2013)
[2] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: Classification and Novel Class Detection in Concept-Drifting Data
Streams under Time Constraints. IEEE Trans. Knowl. Data Eng. 23(6): 859-874 (2011)
[3] Tahseen Al-Khateeb, Mohammad M. Masud, Latifur Khan, Charu C. Aggarwal, Jiawei Han, Bhavani M. Thuraisingham: Stream Classification with Recurring
and Novel Class Detection Using Class-Based Ensemble. ICDM 2012: 31-40

FEARLESS engineering 80
Training with Semi-Supervised Clustering

Impurity based Clustering

Legend:
Black dots: unlabeled instances
Colored dots: labeled instances

FEARLESS engineering 81
Semi Supervised Clustering
Masud et al. [1][2]

➢ Objective function (dual minimization problem)

Intra-cluster dispersion Cluster impurity

Impi: = Aggregated dissimilarity counti * Entropyi = ADCi * Enti


Aggregated dissimilarity count (ADC):

Entropy (Ent):

The minimization problem is solved using the Expectation-


Maximization (E-M) framework
[1] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data.
ICDM 2008: 929-934
[2] Mohammad M. Masud, Clay Woolam, Jing Gao, Latifur Khan, Jiawei Han, Kevin W. Hamlen, Nikunj C. Oza: Facing the reality of data stream classification: coping with scarcity of labeled data.
Knowl. Inf. Syst. 33(1): 213-244 (2011)

FEARLESS engineering 82
Outlier Detection and Filtering
Test instance inside Test instance outside
decision boundary decision boundary Test instance
(not outlier) Raw outlier or Routlier x

y
Ensemble of L models
D x
M1 M2 ... Mt
y1
C
A Routlier Routlier Routlier
x
AND X is an existing
y2 True False class instance
B
X is a filtered outlier (Foutlier)
x1 x (potential novel class instance)

Foutliers may appear as a result of novel class, concept-drift, or noise.


Therefore, they are filtered to reduce noise as much as possible.

FEARLESS engineering 83
Novel Class Detection

Test instance
x

Ensemble of L models
(Step 1) M1 M2 ... Mt (Step 4)
q-NSC>0
Routlier
Routlier Routlier for q’>q N Treat as
Foutliers existing
X is an existing with all class
(Step 2) AND
False class instance models?
True

X is a filtered outlier (Foutlier)


(potential novel class instance) Y
Compute
q-NSC with
all models
and other
Foutliers Novel class found

(Step 3)

FEARLESS engineering 84
Computing Cohesion & Separation

λ o,5(x) ➢λc(x) is the


a(x)
set of nearest neighbors
x b (x) λ-,5(x)
λ+,5(x) b+(x) -
of x belonging to class c
➢λo(x) is the
+ - -
- - set of nearest Foutliers of x
+ +
++ -
+ + - -
+ + - -

▪ a(x) = mean distance from an Foutlier x to the instances in λo,q(x)


▪ bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure)
▪ q-Neighborhood Silhouette Coefficient (q-NSC):

(b min (x) − a(x))


q - NSC(x) =
max(b min (x), a(x))
▪ If q-NSC(x) is positive, it means x is closer to Foutliers than any other class.

FEARLESS engineering 85
Detection of Concurrent Novel Classes
Masud et al. [1], Faria et al. [2]

• Challenges
– High false positive (FP) (existing classes detected as novel) and false negative (FN) (missed
novel classes) rates
– Two or more novel classes arrive at a time

• Solutions
– Dynamic decision boundary – based on previous mistakes
• Inflate the decision boundary if high FP, deflate if high FN
– Build statistical model to filter out noise data and concept drift from the outliers.
– Multiple novel classes are detected by
• Constructing a graph where outlier cluster is a vertex
• Merging the vertices based on silhouette coefficient
• Counting the number of connected components in the resultant (i.e., merged) graph

[1] Mohammad M. Masud, Qing Chen, Latifur Khan, Charu C. Aggarwal, Jing Gao, Jiawei Han, Bhavani M. Thuraisingham: Addressing Concept-Evolution in Concept-Drifting
Data Streams. ICDM 2010: 929-934
[2] Elaine R. Faria, João Gama, André C. P. L. F. Carvalho: Novelty detection algorithm for data streams multi-class problems. SAC 2013: 795-800

FEARLESS engineering 86
Novel and Recurrence
Khateeb et al. [1]

Stream

chunk0 chunk1 chunk49 chunk50

Novel

chunk51 chunk52 chunk99 chunk100

Recurrence

chunk101 chunk102 chunk149 chunk150


[1] Tahseen Al-Khateeb, Mohammad M. Masud, Latifur Khan, Charu C. Aggarwal, Jiawei Han, Bhavani M. Thuraisingham: Stream Classification with Recurring and Novel
Class Detection Using Class-Based Ensemble. ICDM 2012: 31-40

FEARLESS engineering 87
Challenges: Fixed Chunk Size/ Decay Rate
Masud et al. [1], Parker et al. [2], Aggarwal et al. [3], Klinkenberg[4], Cohen et al. [5]

1 2 3 k k+1 k+2 k+3 2k 2k+1

Chunk 1 Chunk 2

➢ Fixed chunk size


– requires a priori knowledge about the time-scale of change.
– delayed reaction if the chunk size is too large.
– unnecessary frequent training during stable period if chunk size is too small.

➢ Fixed decay rate


– assigns weight to data instances based on their age.
– decay constant must match the unknown rate of change.

[1] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: Classification and Novel Class Detection in Concept-Drifting Data Streams under
Time Constraints. IEEE Trans. Knowl. Data Eng. 23(6): 859-874 (2011)
[2] Brandon Shane Parker, Latifur Khan: Detecting and Tracking Concept Class Drift and Emergence in Non-Stationary Fast Data Streams. AAAI 2015: 2908-2913
[3] Charu C. Aggarwal, Philip S. Yu: On Classification of High-Cardinality Data Streams. SDM 2010: 802-813
[4] Ralf Klinkenberg: Learning drifting concepts: Example selection vs. example weighting. Intell. Data Anal. 8(3): 281-300 (2004)
[5] Edith Cohen, Martin J. Strauss: Maintaining time-decaying stream aggregates. J. Algorithms 59(1): 19-36 (2006)

FEARLESS engineering 88
Challenges: Fixed Chunk Size
Concept Drifts

Time

Chunk size too large – Delayed reaction

Chunk size too small – Performance issue

Correct Wrong

FEARLESS engineering 89
Solution: Adaptive Chunk Size

Concept Drifts

Time

Adaptive Chunk Size

Correct Wrong

FEARLESS engineering 90
Adaptive Chunk - Sliding Window
Gamma et al. [1], Bifet et al. [2], Harel et al. [3]

1 k-1 k k+1 n n+1 n+2 n+2 n+3

➢ Existing dynamic sliding window techniques


– monitor error rate of the classifier.
– Update classifier if starts to show bad performance.
– fully supervised, which is not feasible in case of real-world data streams.

[1] João Gama, Gladys Castillo: Learning with Local Drift Detection. ADMA 2006: 42-55
[2] Albert Bifet, Ricard Gavaldà: Learning from Time-Changing Data with Adaptive Windowing. SDM 2007: 443-448
[3] Maayan Harel, Shie Mannor, Ran El-Yaniv, Koby Crammer: Concept Drift Detection Through Resampling. ICML 2014: 1009-1017

FEARLESS engineering 91
Adaptive Chunk - Unsupervised
Haque et al. [1][2]

Input
Prediction using
Predicted Class
Ensemble

Classifier
Confidence

C1 C2 C3 Ck-2 Ck-1 Ck Ck+1 Cn-1 Cn

Distribution Before Distribution After

Yes No
Update Classifier Change Grow Window
& Shrink Window

[1] Ahsanul Haque, Latifur Khan, Michael Baron, Bhavani M. Thuraisingham, Charu C. Aggarwal: Efficient handling of concept drift and concept evolution over Stream Data. ICDE 2016: 481-492.
[2] Ahsanul Haque, Latifur Khan, Michael Baron: SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream. AAAI 2016: 1652-1658.

FEARLESS engineering 92
Adaptive Chunk - Unsupervised
Haque et al. [1][2]

Input
Prediction using Predicted
Ensemble Class

Association
Association Association

Purity Purity Purity

Model Model
1 2 Model t
Confidence
Confidence Confidence

Classifier
Confidence

Yes No
Update Classifier & Change
Shrink Window Grow Window

[1] Ahsanul Haque, Latifur Khan, Michael Baron, Bhavani M. Thuraisingham, Charu C. Aggarwal: Efficient handling of concept drift and concept evolution over Stream Data. ICDE 2016: 481-492
[2] Ahsanul Haque, Latifur Khan, Michael Baron: SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream. AAAI 2016: 1652-1658.

FEARLESS engineering 93
Confidence of a model

FEARLESS engineering 94
Confidence Estimators
➢ Let h be the closest cluster from data instance x in model
Mi, confidence of Mi in classifying instance x is
calculated based on the following estimators:

x
ai

)
(i x
D
Rh

▪ Ns = 15, Nm = 14
▪ pix = 14/15

FEARLESS engineering 95
How good are the estimators?

g i1 a i1

g i2 a i2

g i3 a i3

gil ail

Calculation of zia from gi and ai.

g i1 p i1

g i2 p i2

g i3 p i3

gil pil

Calculation of zip from gi and pi.

FEARLESS engineering 96
Example: Confidence Calculation
❖ An example with t = 3 (3 models) and C = 3 (3 classes)
Classification Model (M1 ) Model (M2) Model (M3)

Test instance (x)


c
a
b

The nearest micro-cluster to x in M1, M2 and M3 are a, b, c, respectively

Training:
Z1 = (0.52, 0.48)
Z2 = (0.41, 0.59)
Z3 = (0.36, 0.64)

FEARLESS engineering 97
Confidence Value Distribution

Beta Distribution Change in Confidence Value


Distribution

➢ Beta distribution has two parameters, i.e., α and β.


– distribution is symmetric if α = β and unimodal if α, β > 1.
– approaches infinity at 0 if α<1.
– approaches infinity at 1 if β<1.

FEARLESS engineering 98
Change Detection

C1 C2 C3 Ck-2 Ck-1 Ck Ck+1 Cn-1 Cn

Beta(α0, β0) Beta(α1, β1)

Xi = ith confidence score stored in the sliding window

FEARLESS engineering 99
Limited Labelling

100
Limited Labeled Learning using Active Learning
Masud et al. [1][2], Fan et al. [3], Zhu et al. [4]

Traditional Learning
Training Data Training Data Training Data

Test Data Test Data Test Data

Chunk 1 Chunk 2 Chunk N

Learning with Limited Labeled Data


Training Data Training Data Training Data

Test Data Test Data Test Data

Chunk 1 Chunk 2 Chunk N

Unlabeled

[1] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data.
ICDM 2008: 929-934
[2] Mohammad M. Masud, Clay Woolam, Jing Gao, Latifur Khan, Jiawei Han, Kevin W. Hamlen, Nikunj C. Oza: Facing the reality of data stream classification: coping with scarcity of labeled data.
Knowl. Inf. Syst. 33(1): 213-244 (2011)
[3] Wei Fan, Yi-an Huang, Haixun Wang, Philip S. Yu: Active Mining of Data Streams. SDM 2004: 457-461
[4] Xingquan Zhu, Peng Zhang, Xiaodong Lin, Yong Shi: Active Learning from Data Streams. ICDM 2007: 757-762

FEARLESS engineering 101


Limited Labeled Learning

➢ Semi-supervised training:
– Label is requested for an instance only if classifier confidence in classifying that
instance was below a confidence threshold (τ).
– otherwise, predicted label is used as the final label.
➢ A new model is trained on the training data.
➢ The oldest model in the ensemble is replaced.

Ensemble M

New classifier M' Oldest classifier

FEARLESS engineering 102


IoT Distributed 

Stream Mining
Part II
https://sites.google.com/site/iotminingtutorial

Outline
• IoT Fundamentals of • IoT Distributed 

Stream Mining Stream Mining
• IoT Setting
• Distributed Stream
• Classification Processing Engines
• Concept Drift • Classification
• Regression
• Regression
• Clustering
• Open Source Tools
• Frequent Itemset Mining
• Applications
• Concept Evolution

• Limited Labeled Learning • Conclusions

104
Distributed Stream
Processing Engines

105
A Tale of two Tribes
M. Stonebraker U. Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone”. ICDE ’05

Faster Larger

Database

App App App


DB
DB
DB
Data DB
DB DB

106
SPE Evolution
1st generation Aurora Abadi et al., “Aurora: a new model and architecture for
—2003 data stream management,” VLDB Journal, 2003

—2004 STREAM Arasu et al., “STREAM: The Stanford Data Stream


Management System,” Stanford InfoLab, 2004.

2nd generation —2005 Borealis Abadi et al., “The Design of the Borealis Stream
Processing Engine,” in CIDR  ’05

SPC Amini et al., “SPC: A Distributed, Scalable Platform


—2006
for Data Mining,” in DMSSP  ’06

SPADE Gedik et al., “SPADE: The System S Declarative


—2008
Stream Processing Engine,” in SIGMOD  ’08

3rd generation —2010 S4 Neumeyer et al., “S4: Distributed Stream Computing


Platform,” in ICDMW  ’10

—2011 Storm http://storm.apache.org

—2013 Samza http://samza.incubator.apache.org

107
Actors Model
Event
routing
Live Streams
PE External
Persister
Stream 1 PE
Output 1
Stream 2 PE
Output 2
Stream 3 PE
PE

108
S4 Example
status.text:"Introducing #S4: a distributed #stream processing system"

EV RawStatus TopicExtractorPE (PE1)


KEY null extracts hashtags from status.text
VAL text="Int..."
EV Topic
PE1 KEY topic="stream"
VAL count=1
EV Topic
KEY topic="S4"
VAL count=1
TopicCountAndReportPE (PE2-3)
keeps counts for each topic across
PE2 PE3 all tweets. Regularly emits report
event if topic count is above
a configured threshold.

TopicNTopicPE (PE4)
EV Topic
keeps counts for top topics and outputs
KEY reportKey="1" PE4
top-N topics to external persister
VAL topic="S4", count=4

109
Groupings

• Key Grouping 

(hashing) PE PE
• Shuffle Grouping
 PEI PEI
(round-robin)
PEI PEI
• All Grouping

(broadcast)

110
Groupings

• Key Grouping 

(hashing) PE PE
• Shuffle Grouping
 PEI PEI
(round-robin)
PEI PEI
• All Grouping

(broadcast)

111
Groupings

• Key Grouping 

(hashing) PE PE
• Shuffle Grouping
 PEI PEI
(round-robin)
PEI PEI
• All Grouping

(broadcast)

112
Groupings

• Key Grouping 

(hashing) PE PE
• Shuffle Grouping
 PEI PEI
(round-robin)
PEI PEI
• All Grouping

(broadcast)

113
real time computation: streaming computation

Big Data
MapReduce Limitations
Example
Processing Engines
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
apache storm
3 sentiment analysis

apache samza from linkedin


• Low latency
apache spark streaming

Storm characteristics
14
for real-time data processing workloads

1 Fast
• High Latency (Not real time) 2 Scalable
3 Fault-tolerant
Storm and Samza are fairly similar. Both systems provide:
4 Reliable
5 Easy to operate
1 a partitioned stream model,
2 a distributed execution environment,
3 an API for stream processing,
4 fault tolerance,
5 Kafka integration

114
Kappa Architecture

• Apache Kafka is a fast, scalable, durable, and


fault-tolerant publish-subscribe messaging system.

115
Apache Spark

• Spark Streaming is an extension of Spark that


allows processing data stream using micro-batches
of data.

116
Apache Spark

• Discretized Stream or DStream represents a continuous stream of data

• either the input data stream received from source, or

• the processed data stream generated by transforming the input stream.

• Internally, a DStream is represented by a continuous series of RDDs

117
Apache Spark

• Any operation applied on a DStream translates to


operations on the underlying RDDs

118
Apache Flink
real time computation: streaming computation

MapReduce Limitations
Example
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
3 sentiment analysis

• Streaming engine

14
119
Apache Flink
real time computation: streaming computation

MapReduce Limitations
Example
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
3 sentiment analysis

• Streaming engine

14
120
Apache Beam

121
Apache Beam
• Apache Beam code can run in:

• Apache Flink

• Apache Spark

• Google Cloud Dataflow

• Google Cloud Dataflow replaced MapReduce:

• It is based on FlumeJava and MillWheel, a stream engine as


Storm, Samza

• It writes and reads to Google Pub/Sub, a service similar to Kafka

122
Apache Beam

123
Classification

124
Hadoop AllReduce
A. Agarwal, O. Chapelle, M. Dudík, J. Langford: “A Reliable Effective Terascale Linear Learning System”. JMLR (2014)

• MPI AllReduce on MapReduce

• Parallel SGD + L-BFGS

• Aggregate + Redistribute

• Each node computes partial gradient

• Aggregate (sum) complete gradient

• Each node gets updated model

• Hadoop for data locality (map-only job)

125
9

13 15
37 37

1 8

7 37 37 5 3 37 37 4

7 5 3 4

re 1: AllReduce operation. Initially, each node holds its own value. Values are passed up

AllReduce
and summed, until the global sum is obtained in the root node (reduce phase). The global
en passed back down to all other nodes (broadcast phase). At the end, each node contains
al sum.
Reduction Tree
Upward = Reduce Downward = Broadcast (All)
Hadoop-compatible AllReduce 126
Parallel Decision Trees
Attributes
• Which kind of parallelism?

• Task

Instances
• Data

• Horizontal Data
• Vertical

Attributes Class
Instance

127
Horizontal Partitioning
Y. Ben-Haim, E. Tom-Tov: “A Streaming Parallel Decision Tree Algorithm”. JMLR (2010)

Stats Model
Instances

Histograms
Stream Stats

Stats

Single attribute
Model Updates
Aggregation
tracked in to
compute splits
multiple nodes 128
Hoeffding Tree Profiling
Other
6 %
Split
Training time for
 24 %
100 nominal +
100 numeric
Learn
attributes 70 %

129
Vertical Partitioning
N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Murdopo: “VHT: Vertical Hoeffding Tree”, 2016 https://arxiv.org/abs/1607.08325

Model Stats

Attributes
Stream Stats

Stats
Single attribute
tracked in Splits
single node 130
Vertical Hoeffding Tree

Source (n) Model (n) Stats (n) Evaluator (1)

Stream Instance

Shuffle Grouping Control


Key Grouping
All Grouping Split

Result

131
Advantages of 

Vertical Parallelism
• High number of attributes => high level of parallelism

(e.g., documents)

• vs. task parallelism

• Parallelism observed immediately

• vs. horizontal parallelism

• Reduced memory usage (no model replication)

• Parallelized split computation


132
Regression

133
VAMR
A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ‘14

• Vertical AMRules

• Model: rule body + head

• Target mean updated continuously
 Learner1

with covered instances for predictions


Learner2
Instances
• Default rule (creates new rules) Model
Aggregator
New Rules

• Learner: statistics
Learnerp

• Vertical: Learner tracks statistics of Rule


Updates

independent subset of rules


Predictions

• One rule tracked by only one Learner

• Model -> Learner: key grouping on rule ID

134
HAMR
A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ‘14

• VAMR single model is bottleneck

• Hybrid AMRules
 New Rules Default Rule


Rule
Updates
New Rules
Model
(Vertical + Horizontal) Aggregator1 Learner

Instances Learners
Model Learners
• Shuffle among multiple
 Instances
Aggregator2
Model
Learners

Model
Aggregator Learners
Models for parallelism Model2
Aggregator 2
Aggregators
Learners
Learners

New Rules

• Problem: distributed default rule Model


Aggregatorr
Rule
Updates

decreases performance
Predictions
Predictions

• Separate dedicate Learner 



for default rule

135
Open Source Tools

136
http://moa.cms.waikato.ac.nz/

MOA
• {M}assive {O}nline {
{M}assive {O}nline {A}nalysis is a framework for online

MOA (Bifet et al. 2


learning from data streams.

• It is closely related to WEKA


{M}assive {O}nline {A}nalysis is a framew
• learning
It includes a collection fromand
of offline data streams.
online as well as tools
for evaluation:

• classification, regression, clustering

• outlier detection, frequent pattern mining


It is closely related to WEKA
• Easy to extend, design It
and run experiments
includes a collection of offline and o
tools137for evaluation:
http://huawei-noah.github.io/streamDM-Cpp/

streamDM C++

138
Vision

Streaming Distributed

IoT Big Data Stream Mining


139
developing new distributed ML algorithms to enrich the existing librar
http://samoa-project.net
of-the-art algorithms [27, 28]. Moreover, SAMOA provides the possibi
grating new DSPEs, allowing in that way the ML programmers to imp
algorithm once and run it in different DSPEs [28].

SAMOA
An adapter for integrating Apache Flink into Apache SAMOA was im
in scope of this master thesis, with the main parts of its implementa
G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)
addressed in this section. With the use of our adapter, ML algorithm
executed on top of Apache Flink. The implemented adapter will be us
evaluation of the ML pipelines and HT algorithm variations.

Data
Mining

Non
Distributed
Distributed

Batch Stream Batch Stream

Storm, S4,
Hadoop
Samza

Mahout SAMOA R, MOA


WEKA,…

Figure 20: Apache SAMOA’s high level architecture.

5.1 Apache
140 SAMOA Abstractions
http://huawei-noah/github.io/streamDM

StreamDM

141
Applications

142
Application: Encrypted Traffic Fingerprinting

• Traffic Fingerprinting (TFP) is a Traffic Analysis


(TA) attack that threatens web/app navigation
privacy.
• TFP allows attackers to learn information about a
website/app accessed by the user, by recognizing
patterns in traffic.
• Examples:
– Website Fingerprinting
– App Fingerprinting

http://www.clickz.com/tag/fingerprinting

FEARLESS engineering 143


Website Fingerprinting

FEARLESS engineering 144


App Fingerprinting

FEARLESS engineering 145


Application: Duplicate Detection over Textual News Stream

Duplicate of Duplicate of

News reports News reports News reports

Day 1 Day 2 Day 3

FEARLESS engineering 146


Duplicate Detection over Textual News Stream

FEARLESS engineering 147


Future Direction

• Multi-stream classification
– Some streams are labeled, some are not.
– Distributions are related but not same (e.g., covariate
shift).
– Requires bias correction, domain adaptation, etc.
• Adversarial active learning
– Traditional algorithms are vulnerable to adversarial
manipulation.
– Instances should be selected carefully.
• Efficient online change detection

FEARLESS engineering 148


Conclusions

149
Summary

• IoT Streaming useful for finding approximate


solutions with reasonable amount of time & limited
resources

• Algorithms for classification, regression, clustering,


frequent itemset mining

• Distributed systems for very large streams

150
Open Challenges
• Structured output

• Multi-target learning

• Millions of classes

• Representation learning

• Ease of use

151
References

152
• IDC’s Digital Universe Study. EMC (2011)

• P. Domingos, G. Hulten: “Mining high-speed data streams”. KDD ’00

• J Gama, P. Medas, G. Castillo, P. Rodrigues: “Learning with drift detection”. SBIA’04

• G. Hulten, L. Spencer, P. Domingos: “Mining Time-Changing Data Streams”. KDD ‘01

• J. Gama, R. Fernandes, R. Rocha: “Decision trees for mining data streams”. IDA (2006)

• A. Bifet, R. Gavaldà: “Adaptive Parameter-free Learning from Evolving Data Streams”. IDA (2009)

• A. Bifet, R. Gavaldà: “Learning from Time-Changing Data with Adaptive Windowing”. SDM ’07

• E. Almeida, C. Ferreira, J. Gama. "Adaptive Model Rules from Data Streams”. ECML-PKDD ‘13

• H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer: “An effective evaluation
measure for clustering on evolving data streams”. KDD ’11

• T. Zhang, R. Ramakrishnan, M. Livny: “BIRCH: An Efficient Data Clustering Method for Very Large
Databases”. SIGMOD ’96

• C. C. Aggarwal, J. Han, J. Wang, P. S. Yu: “A Framework for Clustering Evolving Data Streams”. VLDB ‘03

• M. Ester, H. Kriegel, J. Sander, X. Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise”. KDD ‘96

153
• F. Cao, M. Ester, W. Qian, A. Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”.
SDM ‘06

• G. S. Manku, R. Motwani: “Approximate frequency counts over data streams”. VLDB '02

• Y. Chi , H. Wang, P. Yu , R. Muntz: “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding
Window”. ICDM ’04

• C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: “Mining frequent patterns in data streams at multiple time
granularities”. NGDM (2003)

• M. Stonebraker U. Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone”. ICDE ’05

• A. Agarwal, O. Chapelle, M. Dudík, J. Langford: “A Reliable Effective Terascale Linear Learning System”.
JMLR (2014)

• Y. Ben-Haim, E. Tom-Tov: “A Streaming Parallel Decision Tree Algorithm”. JMLR (2010)

• A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data
Streams”. BigData ’14

• G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)

• J. Gama: “Knowledge Discovery from Data Streams”. Chapman and Hall (2010)

• J. Gama: “Data Stream Mining: the Bounded Rationality”. Informatica 37(1): 21-25 (2013)

154
Contacts
• https://sites.google.com/site/iotminingtutorial

• Gianmarco De Francisci Morales



gdfm@acm.org @gdfm7

• Albert Bifet

abifet@telecom-paristech.fr @abifet

• Latifur Khan

lkhan@utdallas.edu

• João Gama

jgama@fep.up.pt @JoaoMPGama

• Wei Fan

fanwei03@baidu.com @fanwei

155

Das könnte Ihnen auch gefallen