KDD 2016 IOT Tutorial

IoT Big Data
Stream Mining
Tutorial
KDD 2016
Gianmarco De Francisci Morales, Albert Bifet, Latifur Khan,

Joao Gama, and Wei Fan
Organizers (1/3)
 
is a Scientist at QCRI. His research focuses on large scale data
Gianmarco   mining and big data, with a particular emphasis on web mining
De Francisci Morales   and Data Intensive Scalable Computing systems. He is an
active member of the open source community of the Apache
  Software Foundation working on the Hadoop ecosystem, and a
  committer for the Apache Pig project. He is the co-leader of the
SAMOA project, an open-source platform for mining big data
streams.
  http://gdfm.me
 
Albert Bifet is Associate Professor at Telecom ParisTech and Honorary
Research Associate at the WEKA Machine Learning Group at
University of Waikato. He is the author of a book on Adaptive
Stream Mining and Pattern Learning and Mining from Evolving
Data Streams. He is one of the leaders of MOA and SAMOA
software environments for implementing algorithms and running
experiments for online learning from evolving data streams.
http://albertbifet.com
2
Organizers (2/3)
is a full Professor (tenured) in the Computer Science
Latifur Khan department at the University of Texas at Dallas where he has
been teaching and conducting research since September 2000.
He has received prestigious awards including the IEEE
Technical Achievement Award for Intelligence and Security
Informatics. Dr. Khan is an ACM Distinguished Scientist and a
Senior Member of IEEE.
 
https://www.utdallas.edu/~lkhan/
João Gama is Associate professor at the University of Porto and a senior
researcher at LIAAD Inesc Tec. He received his Ph.D. degree in
Computer Science from the University of Porto, Portugal. His
main interests are machine learning, and data mining, mainly in
the context of time-evolving data streams. He authored a recent
book in Knowledge Discovery from Data Streams.
http://www.liaad.up.pt/~jgama
3
Organizers (3/3)
 
Wei Fan is the Deputy Head at Baidu Research Big Data Lab. His
co-authored paper received ICDM '06 Best Application
Paper Award, he led the team that used his Random
Decision Tree method to win 2008 ICDM Data Mining Cup
Championship. He received 2010 IBM Outstanding
Technical Achievement Award for his contribution to IBM
Infosphere Streams. At Huawei, he has led his colleagues
to develop Huawei Stream. SMART – a streaming platform
for online and real-time processing.
http://www.weifan.info
4
https://sites.google.com/site/iotminingtutorial
Outline
• IoT Fundamentals of • IoT Distributed  
Stream Mining Stream Mining
• IoT Setting
• Distributed Stream
• Classification Processing Engines
• Concept Drift • Classification
• Regression
• Regression
• Clustering
• Open Source Tools
• Frequent Itemset Mining
• Applications
• Concept Evolution
• Limited Labeled Learning • Conclusions
5
IoT Fundamentals of
Stream Mining
Part I
IoT Setting
7
INTERNET OF THINGS
IoT: sensors and actuators connected by networks to

computing systems.
• Gartner predicts 20.8 billion IoT devices by 2020.
• IDC projects 32 billion IoT devices by 2020
INTERNET OF THINGS
digital universe
Figure 3: EMC Digital Universe, 2014
• EMC Digital Universe, 2014

7
INTERNET OF THINGS
• EMC Digital Universe, 2014

IOT (MC KINSEY)
IOT AND INDUSTRY 4.0
Interoperability: IoT
Information transparency: virtual copy of the physical
world
Technical assistance: support human decisions
Decentralized decisions: make decisions on their own
Gather
Deploy Clean
Model
Standard Approach
Finite training sets 
Static models
13
Importance$of$O
Pain • Points
As$spam$trends$change
retrain$the$model$with
• Need to retrain!
• Things change over time
• How often?
• Data unused until next

update!
• Value of data wasted
14
Value of Data
15
IoT Stream Mining
• Maintain models online
• Incorporate data on the fly
• Unbounded training sets
• Detect changes and adapts
• Dynamic models
16
IoT Big Data Streams
• Volume + Velocity (+ Variety)
• Too large for single commodity server main memory
• Too fast for single commodity server CPU
• A solution needs to be:
• Distributed
• Scalable
17
Approximation Algorithms
• General idea, good for streaming algorithms
• Small error ε with high probability 1-δ
• True hypothesis H, and learned hypothesis Ĥ
• Pr[ |H - Ĥ| < ε|H| ] > 1-δ
18
• What is the largest number that we can store in 8

bits?
19
• What is the
largest number
that we can store
in 8 bits?
20
• What is the
largest number
that we can store
in 8 bits?
21
• What is the
largest number
that we can store
in 8 bits?
22
• What is the
largest number
that we can store
in 8 bits?
23
• What is the largest number that we can

store in 8 bits?
24
Predictive Learning
• Classification
• Regression
• Concept Drift
25
Classification
26
Definition
Given a set of training
examples belonging to nC
different classes, a classifier
algorithm builds a model
that predicts for every Examples
unlabeled instance x the • Email spam filter
class C to which it belongs • Twitter sentiment analyzer
27 Photo: Stephen Merity http://smerity.com

example at a time,
it only once (at
ed amount of
mited amount of
Process
One example at at time, 
predict at any
•
used at most once

• Limited memory
• Limited time
• Anytime prediction
28
Naïve Bayes
• Based on Bayes’ P (x|C)P (C)
P (C|x) =
theorem P (x)
• Probability of likelihood ⇥ prior

observing feature xi posterior =
evidence
given class C
Y
• Prior class probability P (C|x) / P (xi |C)P (C)
xi 2x
P(C)
• Just counting! C = arg max P (C|x)

C
29
Perceptron
• Linear classifier
• ⃗i,yi⟩
Data stream: ⟨x Perceptron
w1
• ⃗i) = σ(w
ỹi = hw⃗(x ⃗ iT ⃗
xi ) Attribute 1
Attribute 2 w2
• σ(x) = 1/(1+e-x) σʹ=σ(x)(1-σ(x))
Attribute 3 w3 Output hw~ (~xi )
• ⃗ )=½∑(yi-ỹi)2
Minimize MSE J(w w4
Attribute 4
• ⃗ i+1 = w
SGD w ⃗ i - η∇J ⃗
xi Attribute 5 w5
• ∇J = -(yi-ỹi)ỹi(1-ỹi) I Data stream: h~xi , yi i

I ~ T ~xi ,
Classical perceptron: hw~ (~xi ) = w
P
• ⃗ i+1 = w
w ⃗ i + η(yi-ỹi)ỹi(1-ỹi)x
⃗i I Minimize Mean-square error: J(w ~)= 1
2 (yi hw~ (~xi ))2
30
Perceptron Learning
Perceptron
P ERCEPTRON L EARNING(Stream, ⌘)
1 for each class
2 do P ERCEPTRON L EARNING(Stream, class, ⌘)
P ERCEPTRON L EARNING(Stream, class, ⌘)

1 ⇤ Let w0 and w ~ be randomly initialized
2 for each example (~x , y ) in Stream
3 do if class = y
4 then = (1 hw~ (~x )) · hw~ (~x ) · (1 hw~ (~x ))
5 else = (0 hw~ (~x )) · hw~ (~x ) · (1 hw~ (~x ))
6 w
~ =w ~ + ⌘ · · ~x
P ERCEPTRON P REDICTION(~x )
1 return arg maxclass hw~ class (~x )
31
Decision Tree
• Each node tests a features Car deal?
• Each branch represents a value Road
Tested?
• Each leaf assigns a class

Yes No
• Greedy recursive induction

❌
Mileage?
• Sort all examples through tree

High Low
• xi = most discriminative attribute
✅
• New node for xi, new branch for each Age?
value, leaf assigns majority class

Recent Old
• Stop if no error | limit on #instances ✅ ❌
32
Very Fast Decision Tree
Pedro Domingos, Geoff Hulten: “Mining high-speed data streams”. KDD ’00
• AKA, Hoeffding Tree
• A small sample can often be enough to choose a near

optimal decision
• Collect sufficient statistics from a small set of examples
• Estimate the merit of each alternative attribute
• Choose the sample size that allows to differentiate

between the alternatives
33
Leaf Expansion
• When should we expand a leaf?
• Let x1 be the most informative attribute, 

x2 the second most informative one
• Is x1 a stable option?
• Hoeffding bound
r
R2 ln(1/ )
• Split if G(x1) - G(x2) > ε =
2n
34
Hoeffding Tree or VFDT
Hoeffding Tree or VFDT
HT(Stream, )
HT Induction
HT(Stream, )
1 ⇤ Let HT be a tree with a single leaf(root)
1 ⇤ Let HT be a tree with a single leaf(root)
2 ⇤ Init counts nijk at root
2 ⇤ Init counts nijk at root
3 for each example (x, y ) in Stream
3 for each example (x, y ) in Stream
4 do HTG ROW((x, y), HT , )
4 do HTG ROW((x, y ), HT , )
HTG ROW((x, y ), HT , )
1 ⇤ Sort (x, y) to leaf l using HT
2 ⇤ Update counts nijk at leaf l
3 if examples seen so far at l are not all of the same class
4 then ⇤ Compute G for each attributeq
R 2 ln 1/
5 if G(Best Attr.) G(2nd best) > 2n
6 then ⇤ Split leaf on best attribute
7 for each branch
8 do ⇤ Start
35 new leaf and initiliatize counts
Properties
• Number of examples to expand node depends only on
Hoeffding bound (ε decreases with √n)
• Low variance model (stable decisions with statistical support)
• Low overfitting (examples processed only once, no need for

pruning)
• Theoretical guarantees on error rate with high probability
• Hoeffding algorithms asymptotically close to batch learner. 

Expected disagreement δ/p (p = probability instance falls into a leaf)
• Ties: broken when ε < τ even if ΔG < ε
36
Regression
37
Definition
Given a set of training
examples with a numeric
label, a regression algorithm
builds a model that predicts
for every unlabeled instance x Examples
the value with high accuracy • Stock price
y=ƒ(x)
• Airplane delay
38 Photo: Stephen Merity http://smerity.com

Perceptron
• Linear regressor
Perceptron
• ⃗i,yi⟩
Data stream: ⟨x
Attribute 1 w1
• ⃗i) = w
ỹi = hw⃗(x ⃗T ⃗
xi Attribute 2 w2
• ⃗ )=½∑(yi-ỹi)2
Minimize MSE J(w Attribute 3 w3 Output hw~ (~xi )
Attribute 4 w4
• ⃗' = w
SGD w ⃗ - η∇J ⃗
xi
Attribute 5 w5
• ∇J = -(yi-ỹi)
I Data stream: h~xi , yi i
I ~ T ~xi ,
Classical perceptron: hw~ (~xi ) = w
• ⃗' = w
w ⃗ + η(yi-ỹi)x
⃗i 1 P
I Minimize Mean-square error: J(w ~)= 2 (yi hw~ (~xi ))2
39
Regression Tree
• Same structure as decision tree
• Predict = average target value or 

linear model at leaf (vs majority)
• Gain = reduction in standard deviation (vs entropy)

qX
= (y˜i yi )2 /(N 1)
40
AMRules
Rules
Rules
Rules
• Problem: very large decision trees
have context that is complex and 
hard to understand
• Rules: self-contained, modular, easier

to interpret, no need to cover universe
• 𝓛 keeps sufficient statistics to:
• make predictions
• expand the rule
• detect changes and anomalies
41
Ensembles of Adaptive Model Rules from High-Speed
AMRules
Rule sets
Adaptive Model Rules

Predicting with a rule s
E. Almeida, C. Ferreira, J. Gama. "Adaptive Model Rules from Data Streams." ECML-PKDD ‘13
• Ruleset: ensemble of rules
• Rule prediction: mean, linear model
• Ruleset prediction
• Weighted avg. of predictions of rules

covering instance x
• Weights inversely proportional to error E.g: x = [4, 1, 1, 2]

X
• Default rule covers uncovered f̂ (x) = ✓l ŷl ,
instances
Rl 2S(xi )
42
Ensembles of Adaptive Model Rules from High-Speed Data Streams
AMRules Induction AMRules

Rule sets
Algorithm 1: Training AMRules

Input: S: Stream of examples
• Rule creation: default rule expansion begin
R {}, D 0
foreach (x, y ) 2 S do
• Rule expansion: split on attribute foreach Rule r 2 S(x) do
if ¬IsAnomaly(x, r ) then
maximizing σ reduction if PHTest(errorr , ) then
r Remove the rule from R
R2 ln(1/ )
• Hoeffding bound ε = else
2n Update sufficient statistics Lr
ExpandRule(r )
• Expand when σ1st/σ2nd < 1 - ε
if S(x) = ; then
Update LD
• Evict rule when P-H test error large ExpandRule(D)
if D expanded then
R R[D
• Detect and explain local anomalies D 0
return (R, LD )
43
Concept Drift
44
Definition
Given an input sequence
⟨x1,x2,…,xt⟩, output at instant
t an alarm signal if there is a
distribution change, and a
prediction x̂ t+1 minimizing Outputs
the error |x̂ t+1 − xt+1| • Alarm indicating change
• Estimate of parameter
45 Photo: http://www.logsearch.io
Application
• Change detection on
evaluation of model
• Training error should decrease

orous guarantees of performance (a theorem). We show that Estimation
with more examples -
these guarantees can be transferred to decision tree learners xt
as follows: if a change is followed by a long enough stable -
Estimator Alarm
period, the classification error of the learner will tend, and - Change -
• Change in distribution of
the same rate, to the error rate of VFDT.
Detector
We testtraining
on Sectionerror
6 our methods with synthetic 6
datasets, using the SEA concepts, introduced in [22] and a
rotating hyperplane as described in [13], and two sets from 6
• Input = stream of real/binary ?
the UCI repository, Adult and Poker-Hand. We compare our -
methods amongnumbers
themselves but also with CVFDT, another Memory
concept-adapting variant of VFDT proposed by Domingos,
Spencer, and Hulten [13]. A one-line conclusion of our ex-
• Trade-off
periments would be that,between detecting
because of its self-adapting prop-
erty, we true changes
can present andour
datasets where avoiding
algorithm performs Figure 1: Change Detector and Estimator System
much better
falsethanalarms
CVFDT and we never do much worse.
Some comparison of time and memory usage of our meth- justify the election of one of them for our algorithms. Most
ods and CVFDT is included. approaches for predicting and detecting change in streams of
46 data can be discussed as systems consisting of three modules:
Cumulative Sum
• Alarm when mean of input data differs from zero
• Memoryless heuristic (no statistical guarantee)
• Parameters: threshold h, drift speed v
• g0 = 0, gt = max(0, gt-1 + εt - v)
• if gt > h then alarm; gt = 0
47
Page-Hinckley Test
• Similar structure to Cumulative Sum
• g0 = 0, gt = gt-1 + (εt - v)
• Gt = mint(gt)
• if gt - Gt > h then alarm; gt = 0
48
Statistical Process Control
J Gama, P. Medas, G. Castillo, P. Rodrigues: “Learning with Drift Detection”. SBIA '04
Concept Drift
• Monitor error in sliding window
• Null hypothesis:  0.8

concept
drift
no change between windows
Drift level
Error rate
• If error > warning level 
Warning level
learn in parallel new model 
new window
on the current window 0
pmin + smin
0 Number of examples processed (time) 5000
• if error > drift level 

substitute new model for old
Statistical Drift Detection Method
(Joao Gama et al. 2004)
49
Concept-adapting VFDT
G. Hulten, L. Spencer, P. Domingos: “Mining Time-Changing Data Streams”. KDD ‘01
• Model consistent with sliding window on stream
• Keep sufficient statistics also at internal nodes
• Recheck periodically if splits pass Hoeffding test
• If test fails, grow alternate subtree and swap-in 

when accuracy of alternate is better
• Processing updates O(1) time, +O(W) memory
• Increase counters for incoming instance,  

decrease counters for instance going out window
50
VFDTc: Adapting to Change
J. Gama, R. Fernandes, R. Rocha: “Decision Trees for Mining Data Streams”. IDA (2006)
• Monitor error rate
• When drift is detected
• Start learning alternative subtree in parallel
• When accuracy of alternative is better
• Swap subtree
• No need for window of instances
51
Hoeffding Adaptive Tree
A. Bifet, R. Gavaldà: “Adaptive Parameter-free Learning from Evolving Data Streams” IDA (2009)
• Replace frequency counters by estimators
• No need for window of instances
• Sufficient statistics kept by estimators separately
• Parameter-free change detector + estimator with

theoretical guarantees for subtree swap (ADWIN)
• Keeps sliding window consistent with  

“no-change hypothesis”
A. Bifet, R. Gavaldà: “Learning from Time-Changing Data with Adaptive Windowing”. SDM ‘07
52
Clustering
53
Definition
Given a set of unlabeled
instances, distribute them
into homogeneous groups
according to some common
relations or affinities Examples
• Market segmentation
• Social network communities
54 Photo: W. Kandinsky - Several Circles (edited)

Approaches
• Distance based (CluStream)
• Density based (DenStream)
• Kernel based, Coreset based, much more…
• Most approaches combine online + offline phase
• Formally: minimize cost function  

over a partitioning of the data
55
Snapshot 25,0
Micro-Clusters
Tian Zhang, Raghu Ramakrishnan, Miron Livny: “BIRCH: An Efficient Data Clustering Method for Very Large Databases”. SIGMOD ’96
• AKA, Cluster Features CF 

Statistical summary structure
• Maintained in online phase, 

input for offline phase
• ⃗i⟩, d dimensions
Data stream ⟨x
• Cluster feature vector 

N: number of points 
LSj: sum of values (for dim. j) 
SSj: sum of squared values (for dim. j)
• Easy to update, easy to merge
• Constant space irrespective to the

number of examples!
56
CluStream
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu: “A Framework for Clustering Evolving Data Streams”. VLDB ‘03
• Timestamped data stream ⟨ti, ⃗

xi⟩, represented in d+1 dimensions
• Seed algorithm with q micro-clusters (k-means on initial data)
• Online phase. For each new point, either:
• Update one micro-cluster (point within maximum boundary)
• Create a new micro-cluster (delete/merge other micro-clusters)
• Offline phase. Determine k macroclusters on demand:
• K-means on micro-clusters (weighted pseudo-points)
• Time-horizon queries via pyramidal snapshot mechanism
57
DBSCAN
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: “A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise”. KDD ‘96
• ε-n(p) = set of points at distance ≤ ε
• Core object q = ε-n(q) has weight ≥ μ
• p is directly density-reachable from q
• p ∈ ε-n(q) ∧ q is a core object
• pn is density-reachable from p1
• chain of points p1,…,pn such that pi

+1 is directly d-r from pi
• Cluster = set of points that are

mutually density-connected
58
DenStream
Feng Cao, Martin Ester, Weining Qian, Aoying Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ‘06
• Based on DBSCAN
• Core-micro-cluster: CMC(w,c,r)  
weight w > μ, center c, radius r < ε
• Potential/outlier micro-clusters
• Online: merge point into p (or o) 

micro-cluster if new radius r'< ε
•
Figure
Promote outlier to potential if w > βμ
1: Representation by
• Else create new o-micro-cluster

of stream, i.e., the number of poin
• Offline: DBSCAN
time.
59 In static environment, the cl
Static Evaluation
• Internal (validation)
• Sum of squared distance (point to centroid)
• Dunn index (on distance d) 

D = min(inter-cluster d) / max(intra-cluster d)
• External (ground truth)
• Rand = #agreements / #choices = 2(TP+TN)/(N(N-1))
• Purity = #majority class per cluster / N

60
Streaming Evaluation
H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer: 
“An effective evaluation measure for clustering on evolving data streams”. KDD ’11
• Clusters may: appear, fade, move, merge
• Missed points (unassigned)
• Misplaced points (assigned to different cluster)
• Noise
• Cluster Mapping Measure CMM
• External (ground truth)
• Normalized sum of penalties of these errors
61
Frequent Itemset
Mining
62
Definition
Given a collection of sets of
items, find all the subsets
that occur frequently, i.e.,
more than a minimum
support of times Examples
• Market basket mining
• Item recommendation
63
Fundamentals
• Dataset D, set of items t ∈ D,
constant s (minimum support)
• Support(t) = number of sets 

in D that contain t
• Itemset t is frequent if
support(t) ≥ s
• Frequent Itemset problem:
• Given D and s, find all

frequent itemsets
64
Itemset Mining
Itemset Mining
Example
Dataset Example
Document Patterns Support Frequent
Frequent
d1 d1
d1 abce
abce
abce d1,d2,d3,d4,d5,d6
6 c c
d2 d2
d2 cdecde
cde d1,d2,d3,d4,d5
5 e,ce e,ce
d3 d3
d3 abce
abce
abce d1,d3,d4,d5
4 a,ac,ae,ace
a,ac,ae,ace
d4 d4
d4 acde
acde
acde d1,d3,d5,d6
4 b,bc b,bc
d5 d5
d5 abcde
abcde
abcde d2,d4,d5,d6
4 d,cd d,cd
d6 d6
d6 bcdbcd
bcd d1,d3,d5
3 ab,abc,abe
ab,abc,abe
be,bce,abce
be,bce,abce
d2,d4,d5
3 de,cde
de,cde
minimal support = 3
65
Variations
• A priori property: t ⊆ t' ➝ support(t) ≥ support(t’)
• Closed: none of its supersets has the same support
• Can generate all freq. itemsets and their support
• Maximal: none of its supersets is frequent
• Can generate all freq. itemsets (without support)
• Maximal ⊆ Closed ⊆ Frequent

66
Example
Dataset Example
Document Patterns
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
67
Itemset Streams
• Support as fraction of stream length
• Exact vs approximate
• Incremental, sliding window, adaptive
• Frequent, closed, maximal
68
Pattern mining in streams
Key data structure: La.ce of pa2erns, with counts
{A,B,C,D},2 count≤7
count>7
{A,B,C},4 {A,B,D},3 {A,C,D},3 {B,C,D},8
{A,B},15 {A,C},12 {B,C},10 {A,D},5 {B,D},12 {C,D},12
{A},20 {B},18 {C},18 {D},25
69
Pattern mining in streams
The vast majority of stream pattern mining algorithms
(implicitly or explicitly) build and update the pattern
lattice.
General scheme:
let L be initial, empty lattice;

forever do {
collect a batch of items of size B;
build a summary S of the batch;
merge S into L;
}
70
Lossy Counting
G. S. Manku, R. Motwani: “Approximate frequency counts over data streams”. VLDB '02
• Keep data structure D with tuples (x, freq(x), error(x))
• Imagine to divide the stream in buckets of size 1/ε
• Foreach itemset x in the stream,  

Bid = current sequential bucket id starting from 1
• if x ∈ D, freq(x)++
• else D ← D ∪ (x, 1, Bid - 1)
• Prune D at bucket boundaries: evict x if freq(x) + error(x) ≤ Bid
71
Moment
Y. Chi , H. Wang, P. Yu , R. Muntz: “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window”. ICDM ‘04
• Keeps track of boundary below frequent itemsets in a window
• Closed Enumeration Tree (CET) (~ prefix tree)
• Infrequent gateway nodes (infrequent)
• Unpromising gateway nodes (frequent non-closed, child non-closed)
• Intermediate nodes (frequent non-closed, child closed)
• Closed nodes (frequent)
• By adding/removing transactions closed/infreq. do not change
72
FP-Stream
C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: “Mining frequent patterns in data streams at multiple time granularities”. NGDM (2003)
• Multiple time granularities
• Based on FP-Growth (depth-first search over itemset lattice)
• Pattern-tree + Tilted-time window
• Time sensitive queries, emphasis on recent history
• High time and memory complexity
73
Itemset mining
CLOSTREAM (Yen+ 09) (Sliding window, all

closed, exact)
MFI (Li+ 09) (Transaction-sensitive window,

frequent closed, exact)
IncMine (Cheng+ 08) (Sliding window,

frequent closed, approximate; faster for
moderate approximate ratios)
74
Sequence, tree, graph
mining
MILE (Chen+ 05), SMDS (Marascu-Masseglia 06),
SSBE (Koper-Nguyen 11): Frequent subsequence
(aka sequential pattern) mining
Bifet+ 08: Frequent closed unlabeled subtree mining

Bifet+ 11: Frequent closed labeled subtree mining
Bifet+11: Frequent closed subgraph mining
75
Concept Evolution
76
Challenge: Concept Evolution
Novel class
y   y  
D D
- - - - - - - - - -
y1
- - - - - - - - - - - - - - X X X X X
C
y1 X X X X X XX X X X X X
C
- - - - - -
XX X X X X X X X X X X
X X X X X XX X X X X X
A
A X X X X X X

- - - - - - - - - - - - - -
  - - - - - - - - - - - - - - - -   - - - - - - - - - - - - - - - -
++++ ++ - - - - - - - - - - - - - - - - ++++ ++ - - - - - - - - - - - - - - - -
++ + + ++ - - - - - - - - - - - - - - - - ++ + + ++ - - - - - - - - - - - - - - - -
- - - - - - - -- - - - -
y2
+ +++ ++ + - - - - - - - -- - - - -
y2
+ +++ ++ +
++ + + + ++ + ++ + + + ++ +
+++++ ++++ +++
+ ++ + + ++ ++ + + + + + + + + +
+ + + + + + + +
B +++++ ++++ +++
+ ++ + + ++ ++ + + + + + + + + +
+ + + + + + + +
B

x1 x x1 x
Classification rules:
R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +
R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -
Existing classification models misclassify novel class instances
FEARLESS engineering 77
Existing Techniques: Ensemble based Approaches
Masud et al. [1][2]
M1 +
x,? M2 +
+
input M3 -
Individual voting Ensemble

Classifier outputs output
[1] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data.
ICDM 2008: 929-934
[2] Mohammad M. Masud, Clay Woolam, Jing Gao, Latifur Khan, Jiawei Han, Kevin W. Hamlen, Nikunj C. Oza: Facing the reality of data stream classification: coping with scarcity of labeled data.
Knowl. Inf. Syst. 33(1): 213-244 (2011)
FEARLESS engineering !78

Existing Techniques: Ensemble Techniques
➢ Divide the data stream into equal sized chunks

– Train a classifier from each data chunk
– Keep the best t such classifier-ensemble
– Example: for t = 3
Note: Di may contain data points from different classes

Labeled chunk
Data D1 D2 D543 D654
chunks
Unlabeled chunk
Addresses infinite length

Models M1 M2 M543 and concept-drift
Ensemble M1 M24 M53 Prediction
Novel Class Detection
Masud et al. [1][2], Khateeb et al. [3]
➢ Non parametric
– does not assume any underlying model of existing classes
➢ Steps:
1. Creating and saving decision boundary during training
2. Detecting and filtering outliers
3. Measuring cohesion and separation among test and training
instances
[1] Mohammad M. Masud, Qing Chen, Latifur Khan, Charu C. Aggarwal, Jing Gao, Jiawei Han, Ashok N. Srivastava, Nikunj C. Oza: Classification and Adaptive
Novel Class Detection of Feature-Evolving Data Streams. IEEE Trans. Knowl. Data Eng. 25(7): 1484-1497 (2013)
[2] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: Classification and Novel Class Detection in Concept-Drifting Data
Streams under Time Constraints. IEEE Trans. Knowl. Data Eng. 23(6): 859-874 (2011)
[3] Tahseen Al-Khateeb, Mohammad M. Masud, Latifur Khan, Charu C. Aggarwal, Jiawei Han, Bhavani M. Thuraisingham: Stream Classification with Recurring
and Novel Class Detection Using Class-Based Ensemble. ICDM 2012: 31-40
Training with Semi-Supervised Clustering
Impurity based Clustering
Legend:
Black dots: unlabeled instances
Colored dots: labeled instances
Semi Supervised Clustering
Masud et al. [1][2]
➢ Objective function (dual minimization problem)
Intra-cluster dispersion Cluster impurity
Impi: = Aggregated dissimilarity counti * Entropyi = ADCi * Enti

Aggregated dissimilarity count (ADC):
Entropy (Ent):
The minimization problem is solved using the Expectation-

Maximization (E-M) framework
ICDM 2008: 929-934
Knowl. Inf. Syst. 33(1): 213-244 (2011)
Outlier Detection and Filtering
Test instance inside Test instance outside
decision boundary decision boundary Test instance
(not outlier) Raw outlier or Routlier x
y
Ensemble of L models
D x
M1 M2 ... Mt
y1
C
A Routlier Routlier Routlier
x
AND X is an existing
y2 True False class instance
B
X is a filtered outlier (Foutlier)
x1 x (potential novel class instance)
Foutliers may appear as a result of novel class, concept-drift, or noise.

Therefore, they are filtered to reduce noise as much as possible.
Novel Class Detection
Test instance
x
Ensemble of L models
(Step 1) M1 M2 ... Mt (Step 4)
q-NSC>0
Routlier
Routlier Routlier for q’>q N Treat as
Foutliers existing
X is an existing with all class
(Step 2) AND
False class instance models?
True
X is a filtered outlier (Foutlier)

(potential novel class instance) Y
Compute
q-NSC with
all models
and other
Foutliers Novel class found
(Step 3)
Computing Cohesion & Separation
λ o,5(x) ➢λc(x) is the

a(x)
set of nearest neighbors
x b (x) λ-,5(x)
λ+,5(x) b+(x) -
of x belonging to class c
➢λo(x) is the
+ - -
- - set of nearest Foutliers of x
+ +
++ -
+ + - -
+ + - -
▪ a(x) = mean distance from an Foutlier x to the instances in λo,q(x)

▪ bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure)
▪ q-Neighborhood Silhouette Coefficient (q-NSC):
(b min (x) − a(x))

q - NSC(x) =
max(b min (x), a(x))
▪ If q-NSC(x) is positive, it means x is closer to Foutliers than any other class.
Detection of Concurrent Novel Classes
Masud et al. [1], Faria et al. [2]
• Challenges
– High false positive (FP) (existing classes detected as novel) and false negative (FN) (missed
novel classes) rates
– Two or more novel classes arrive at a time
• Solutions
– Dynamic decision boundary – based on previous mistakes
• Inflate the decision boundary if high FP, deflate if high FN
– Build statistical model to filter out noise data and concept drift from the outliers.
– Multiple novel classes are detected by
• Constructing a graph where outlier cluster is a vertex
• Merging the vertices based on silhouette coefficient
• Counting the number of connected components in the resultant (i.e., merged) graph
[1] Mohammad M. Masud, Qing Chen, Latifur Khan, Charu C. Aggarwal, Jing Gao, Jiawei Han, Bhavani M. Thuraisingham: Addressing Concept-Evolution in Concept-Drifting
Data Streams. ICDM 2010: 929-934
[2] Elaine R. Faria, João Gama, André C. P. L. F. Carvalho: Novelty detection algorithm for data streams multi-class problems. SAC 2013: 795-800
Novel and Recurrence
Khateeb et al. [1]
Stream
chunk0 chunk1 chunk49 chunk50
Novel
Recurrence

[1] Tahseen Al-Khateeb, Mohammad M. Masud, Latifur Khan, Charu C. Aggarwal, Jiawei Han, Bhavani M. Thuraisingham: Stream Classification with Recurring and Novel
Class Detection Using Class-Based Ensemble. ICDM 2012: 31-40
Challenges: Fixed Chunk Size/ Decay Rate
Masud et al. [1], Parker et al. [2], Aggarwal et al. [3], Klinkenberg[4], Cohen et al. [5]
1 2 3 k k+1 k+2 k+3 2k 2k+1
Chunk 1 Chunk 2
➢ Fixed chunk size

– requires a priori knowledge about the time-scale of change.
– delayed reaction if the chunk size is too large.
– unnecessary frequent training during stable period if chunk size is too small.
➢ Fixed decay rate

– assigns weight to data instances based on their age.
– decay constant must match the unknown rate of change.
[1] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: Classification and Novel Class Detection in Concept-Drifting Data Streams under
Time Constraints. IEEE Trans. Knowl. Data Eng. 23(6): 859-874 (2011)
[2] Brandon Shane Parker, Latifur Khan: Detecting and Tracking Concept Class Drift and Emergence in Non-Stationary Fast Data Streams. AAAI 2015: 2908-2913
[3] Charu C. Aggarwal, Philip S. Yu: On Classification of High-Cardinality Data Streams. SDM 2010: 802-813
[4] Ralf Klinkenberg: Learning drifting concepts: Example selection vs. example weighting. Intell. Data Anal. 8(3): 281-300 (2004)
[5] Edith Cohen, Martin J. Strauss: Maintaining time-decaying stream aggregates. J. Algorithms 59(1): 19-36 (2006)
Challenges: Fixed Chunk Size
Concept Drifts
Time
Chunk size too large – Delayed reaction
Chunk size too small – Performance issue
Correct Wrong
Solution: Adaptive Chunk Size
Concept Drifts
Time
Adaptive Chunk Size
Correct Wrong
Adaptive Chunk - Sliding Window
Gamma et al. [1], Bifet et al. [2], Harel et al. [3]
1 k-1 k k+1 n n+1 n+2 n+2 n+3
➢ Existing dynamic sliding window techniques

– monitor error rate of the classifier.
– Update classifier if starts to show bad performance.
– fully supervised, which is not feasible in case of real-world data streams.
[1] João Gama, Gladys Castillo: Learning with Local Drift Detection. ADMA 2006: 42-55
[2] Albert Bifet, Ricard Gavaldà: Learning from Time-Changing Data with Adaptive Windowing. SDM 2007: 443-448
[3] Maayan Harel, Shie Mannor, Ran El-Yaniv, Koby Crammer: Concept Drift Detection Through Resampling. ICML 2014: 1009-1017
Adaptive Chunk - Unsupervised
Haque et al. [1][2]
Input
Prediction using
Predicted Class
Ensemble
Classifier
Confidence
C1 C2 C3 Ck-2 Ck-1 Ck Ck+1 Cn-1 Cn
Distribution Before Distribution After
Yes No
Update Classifier Change Grow Window
& Shrink Window
[1] Ahsanul Haque, Latifur Khan, Michael Baron, Bhavani M. Thuraisingham, Charu C. Aggarwal: Efficient handling of concept drift and concept evolution over Stream Data. ICDE 2016: 481-492.
[2] Ahsanul Haque, Latifur Khan, Michael Baron: SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream. AAAI 2016: 1652-1658.
Adaptive Chunk - Unsupervised
Haque et al. [1][2]
Input
Prediction using Predicted
Ensemble Class
Association
Association Association
Purity Purity Purity
Model Model
1 2 Model t
Confidence
Confidence Confidence
Classifier
Confidence
Yes No
Update Classifier & Change
Shrink Window Grow Window
[1] Ahsanul Haque, Latifur Khan, Michael Baron, Bhavani M. Thuraisingham, Charu C. Aggarwal: Efficient handling of concept drift and concept evolution over Stream Data. ICDE 2016: 481-492
[2] Ahsanul Haque, Latifur Khan, Michael Baron: SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream. AAAI 2016: 1652-1658.
Confidence of a model
Confidence Estimators
➢ Let h be the closest cluster from data instance x in model
Mi, confidence of Mi in classifying instance x is
calculated based on the following estimators:
x
ai
)
(i x
D
Rh
▪ Ns = 15, Nm = 14
▪ pix = 14/15
How good are the estimators?
g i1 a i1
g i2 a i2
g i3 a i3
gil ail
Calculation of zia from gi and ai.
g i1 p i1
g i2 p i2
g i3 p i3
gil pil
Calculation of zip from gi and pi.
Example: Confidence Calculation
❖ An example with t = 3 (3 models) and C = 3 (3 classes)
Classification Model (M1 ) Model (M2) Model (M3)
Test instance (x)

c
a
b
The nearest micro-cluster to x in M1, M2 and M3 are a, b, c, respectively
Training:
Z1 = (0.52, 0.48)
Z2 = (0.41, 0.59)
Z3 = (0.36, 0.64)
Confidence Value Distribution
Beta Distribution Change in Confidence Value

Distribution
➢ Beta distribution has two parameters, i.e., α and β.

– distribution is symmetric if α = β and unimodal if α, β > 1.
– approaches infinity at 0 if α<1.
– approaches infinity at 1 if β<1.
Change Detection
C1 C2 C3 Ck-2 Ck-1 Ck Ck+1 Cn-1 Cn
Beta(α0, β0) Beta(α1, β1)
Xi = ith confidence score stored in the sliding window
Limited Labelling
100
Limited Labeled Learning using Active Learning
Masud et al. [1][2], Fan et al. [3], Zhu et al. [4]
Traditional Learning
Training Data Training Data Training Data
Test Data Test Data Test Data
Chunk 1 Chunk 2 Chunk N
Learning with Limited Labeled Data

Training Data Training Data Training Data
Test Data Test Data Test Data
Chunk 1 Chunk 2 Chunk N
Unlabeled
ICDM 2008: 929-934
Knowl. Inf. Syst. 33(1): 213-244 (2011)
[3] Wei Fan, Yi-an Huang, Haixun Wang, Philip S. Yu: Active Mining of Data Streams. SDM 2004: 457-461
[4] Xingquan Zhu, Peng Zhang, Xiaodong Lin, Yong Shi: Active Learning from Data Streams. ICDM 2007: 757-762

Limited Labeled Learning
➢ Semi-supervised training:
– Label is requested for an instance only if classifier confidence in classifying that
instance was below a confidence threshold (τ).
– otherwise, predicted label is used as the final label.
➢ A new model is trained on the training data.
➢ The oldest model in the ensemble is replaced.
Ensemble M
New classifier M' Oldest classifier

IoT Distributed  
Stream Mining
Part II
https://sites.google.com/site/iotminingtutorial
Outline
• IoT Fundamentals of • IoT Distributed  
Stream Mining Stream Mining
• IoT Setting
• Distributed Stream
• Classification Processing Engines
• Concept Drift • Classification
• Regression
• Regression
• Clustering
• Open Source Tools
• Frequent Itemset Mining
• Applications
• Concept Evolution
• Limited Labeled Learning • Conclusions
104
Distributed Stream
Processing Engines
105
A Tale of two Tribes
M. Stonebraker U. Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone”. ICDE ’05
Faster Larger
Database
App App App

DB
DB
DB
Data DB
DB DB
106
SPE Evolution
1st generation Aurora Abadi et al., “Aurora: a new model and architecture for
—2003 data stream management,” VLDB Journal, 2003
—2004 STREAM Arasu et al., “STREAM: The Stanford Data Stream

Management System,” Stanford InfoLab, 2004.
2nd generation —2005 Borealis Abadi et al., “The Design of the Borealis Stream
Processing Engine,” in CIDR ’05
SPC Amini et al., “SPC: A Distributed, Scalable Platform

—2006
for Data Mining,” in DMSSP ’06
SPADE Gedik et al., “SPADE: The System S Declarative

—2008
Stream Processing Engine,” in SIGMOD ’08
3rd generation —2010 S4 Neumeyer et al., “S4: Distributed Stream Computing

Platform,” in ICDMW ’10
—2011 Storm http://storm.apache.org
—2013 Samza http://samza.incubator.apache.org
107
Actors Model
Event
routing
Live Streams
PE External
Persister
Stream 1 PE
Output 1
Stream 2 PE
Output 2
Stream 3 PE
PE
108
S4 Example
status.text:"Introducing #S4: a distributed #stream processing system"
EV RawStatus TopicExtractorPE (PE1)

KEY null extracts hashtags from status.text
VAL text="Int..."
EV Topic
PE1 KEY topic="stream"
VAL count=1
EV Topic
KEY topic="S4"
VAL count=1
TopicCountAndReportPE (PE2-3)
keeps counts for each topic across
PE2 PE3 all tweets. Regularly emits report
event if topic count is above
a configured threshold.
TopicNTopicPE (PE4)
EV Topic
keeps counts for top topics and outputs
KEY reportKey="1" PE4
top-N topics to external persister
VAL topic="S4", count=4
109
Groupings
• Key Grouping  
(hashing) PE PE
• Shuffle Grouping  PEI PEI
(round-robin)
PEI PEI
• All Grouping 
(broadcast)
110
Groupings
(hashing) PE PE
(round-robin)
PEI PEI
• All Grouping 
(broadcast)
111
Groupings
(hashing) PE PE
(round-robin)
PEI PEI
• All Grouping 
(broadcast)
112
Groupings
(hashing) PE PE
(round-robin)
PEI PEI
• All Grouping 
(broadcast)
113
real time computation: streaming computation
Big Data
MapReduce Limitations
Example
Processing Engines
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
apache storm
3 sentiment analysis
apache samza from linkedin

• Low latency
apache spark streaming
Storm characteristics
14
for real-time data processing workloads
1 Fast
• High Latency (Not real time) 2 Scalable
3 Fault-tolerant
Storm and Samza are fairly similar. Both systems provide:
4 Reliable
5 Easy to operate
1 a partitioned stream model,
2 a distributed execution environment,
3 an API for stream processing,
4 fault tolerance,
5 Kafka integration
114
Kappa Architecture
• Apache Kafka is a fast, scalable, durable, and

fault-tolerant publish-subscribe messaging system.
115
Apache Spark
• Spark Streaming is an extension of Spark that

allows processing data stream using micro-batches
of data.
116
Apache Spark
• Discretized Stream or DStream represents a continuous stream of data
• either the input data stream received from source, or
• the processed data stream generated by transforming the input stream.
• Internally, a DStream is represented by a continuous series of RDDs
117
Apache Spark
• Any operation applied on a DStream translates to

operations on the underlying RDDs
118
Apache Flink
Example
1 predictions
• Streaming engine
14
119
Apache Flink
Example
1 predictions
• Streaming engine
14
120
Apache Beam
121
Apache Beam
• Apache Beam code can run in:
• Apache Flink
• Apache Spark
• Google Cloud Dataflow
• Google Cloud Dataflow replaced MapReduce:
• It is based on FlumeJava and MillWheel, a stream engine as

Storm, Samza
• It writes and reads to Google Pub/Sub, a service similar to Kafka
122
Apache Beam
123
Classification
124
Hadoop AllReduce
A. Agarwal, O. Chapelle, M. Dudík, J. Langford: “A Reliable Effective Terascale Linear Learning System”. JMLR (2014)
• MPI AllReduce on MapReduce
• Parallel SGD + L-BFGS
• Aggregate + Redistribute
• Each node computes partial gradient
• Aggregate (sum) complete gradient
• Each node gets updated model
• Hadoop for data locality (map-only job)
125
9
13 15
37 37
1 8
7 37 37 5 3 37 37 4
7 5 3 4
re 1: AllReduce operation. Initially, each node holds its own value. Values are passed up
AllReduce
and summed, until the global sum is obtained in the root node (reduce phase). The global
en passed back down to all other nodes (broadcast phase). At the end, each node contains
al sum.
Reduction Tree
Upward = Reduce Downward = Broadcast (All)
Hadoop-compatible AllReduce 126
Parallel Decision Trees
Attributes
• Which kind of parallelism?
• Task
Instances
• Data
• Horizontal Data
• Vertical
Attributes Class
Instance
127
Horizontal Partitioning
Y. Ben-Haim, E. Tom-Tov: “A Streaming Parallel Decision Tree Algorithm”. JMLR (2010)
Stats Model
Instances
Histograms
Stream Stats
Stats
Single attribute
Model Updates
Aggregation
tracked in to
compute splits
multiple nodes 128
Hoeffding Tree Profiling
Other
6 %
Split
Training time for  24 %
100 nominal +
100 numeric
Learn
attributes 70 %
129
Vertical Partitioning
N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Murdopo: “VHT: Vertical Hoeffding Tree”, 2016 https://arxiv.org/abs/1607.08325
Model Stats
Attributes
Stream Stats
Stats
Single attribute
tracked in Splits
single node 130
Vertical Hoeffding Tree
Source (n) Model (n) Stats (n) Evaluator (1)
Stream Instance
Shuffle Grouping Control

Key Grouping
All Grouping Split
Result
131
Advantages of  
Vertical Parallelism
• High number of attributes => high level of parallelism 
(e.g., documents)
• vs. task parallelism
• Parallelism observed immediately
• vs. horizontal parallelism
• Reduced memory usage (no model replication)
• Parallelized split computation

132
Regression
133
VAMR
A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ‘14
• Vertical AMRules
• Model: rule body + head
• Target mean updated continuously  Learner1
with covered instances for predictions

Learner2
Instances
• Default rule (creates new rules) Model
Aggregator
New Rules
• Learner: statistics
Learnerp
• Vertical: Learner tracks statistics of Rule

Updates
independent subset of rules

Predictions
• One rule tracked by only one Learner
• Model -> Learner: key grouping on rule ID
134
HAMR
A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ‘14
• VAMR single model is bottleneck
• Hybrid AMRules  New Rules Default Rule

Rule
Updates
New Rules
Model
(Vertical + Horizontal) Aggregator1 Learner
Instances Learners
Model Learners
• Shuffle among multiple  Instances
Aggregator2
Model
Learners
Model
Aggregator Learners
Models for parallelism Model2
Aggregator 2
Aggregators
Learners
Learners
New Rules
• Problem: distributed default rule Model

Aggregatorr
Rule
Updates
decreases performance
Predictions
Predictions
• Separate dedicate Learner  

for default rule
135
Open Source Tools
136
http://moa.cms.waikato.ac.nz/
MOA
• {M}assive {O}nline {
{M}assive {O}nline {A}nalysis is a framework for online
MOA (Bifet et al. 2

learning from data streams.
• It is closely related to WEKA

{M}assive {O}nline {A}nalysis is a framew
• learning
It includes a collection fromand
of offline data streams.
online as well as tools
for evaluation:
• classification, regression, clustering
• outlier detection, frequent pattern mining

It is closely related to WEKA
• Easy to extend, design It
and run experiments
includes a collection of offline and o
tools137for evaluation:
http://huawei-noah.github.io/streamDM-Cpp/
streamDM C++
138
Vision
Streaming Distributed
IoT Big Data Stream Mining

139
developing new distributed ML algorithms to enrich the existing librar
http://samoa-project.net
of-the-art algorithms [27, 28]. Moreover, SAMOA provides the possibi
grating new DSPEs, allowing in that way the ML programmers to imp
algorithm once and run it in different DSPEs [28].
SAMOA
An adapter for integrating Apache Flink into Apache SAMOA was im
in scope of this master thesis, with the main parts of its implementa
G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)
addressed in this section. With the use of our adapter, ML algorithm
executed on top of Apache Flink. The implemented adapter will be us
evaluation of the ML pipelines and HT algorithm variations.
Data
Mining
Non
Distributed
Distributed
Batch Stream Batch Stream
Storm, S4,
Hadoop
Samza
Mahout SAMOA R, MOA

WEKA,…
Figure 20: Apache SAMOA’s high level architecture.
5.1 Apache
140 SAMOA Abstractions
http://huawei-noah/github.io/streamDM
StreamDM
141
Applications
142
Application: Encrypted Traffic Fingerprinting
• Traffic Fingerprinting (TFP) is a Traffic Analysis

(TA) attack that threatens web/app navigation
privacy.
• TFP allows attackers to learn information about a
website/app accessed by the user, by recognizing
patterns in traffic.
• Examples:
– Website Fingerprinting
– App Fingerprinting
http://www.clickz.com/tag/fingerprinting

Website Fingerprinting

App Fingerprinting

Application: Duplicate Detection over Textual News Stream
Duplicate of Duplicate of
News reports News reports News reports
Day 1 Day 2 Day 3

Duplicate Detection over Textual News Stream

Future Direction
• Multi-stream classification
– Some streams are labeled, some are not.
– Distributions are related but not same (e.g., covariate
shift).
– Requires bias correction, domain adaptation, etc.
• Adversarial active learning
– Traditional algorithms are vulnerable to adversarial
manipulation.
– Instances should be selected carefully.
• Efficient online change detection

Conclusions
149
Summary
• IoT Streaming useful for finding approximate

solutions with reasonable amount of time & limited
resources
• Algorithms for classification, regression, clustering,

frequent itemset mining
• Distributed systems for very large streams
150
Open Challenges
• Structured output
• Multi-target learning
• Millions of classes
• Representation learning
• Ease of use
151
References
152
• IDC’s Digital Universe Study. EMC (2011)
• P. Domingos, G. Hulten: “Mining high-speed data streams”. KDD ’00
• J Gama, P. Medas, G. Castillo, P. Rodrigues: “Learning with drift detection”. SBIA’04
• G. Hulten, L. Spencer, P. Domingos: “Mining Time-Changing Data Streams”. KDD ‘01
• J. Gama, R. Fernandes, R. Rocha: “Decision trees for mining data streams”. IDA (2006)
• A. Bifet, R. Gavaldà: “Adaptive Parameter-free Learning from Evolving Data Streams”. IDA (2009)
• A. Bifet, R. Gavaldà: “Learning from Time-Changing Data with Adaptive Windowing”. SDM ’07
• E. Almeida, C. Ferreira, J. Gama. "Adaptive Model Rules from Data Streams”. ECML-PKDD ‘13
• H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer: “An effective evaluation
measure for clustering on evolving data streams”. KDD ’11
• T. Zhang, R. Ramakrishnan, M. Livny: “BIRCH: An Efficient Data Clustering Method for Very Large
Databases”. SIGMOD ’96
• C. C. Aggarwal, J. Han, J. Wang, P. S. Yu: “A Framework for Clustering Evolving Data Streams”. VLDB ‘03
• M. Ester, H. Kriegel, J. Sander, X. Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise”. KDD ‘96
153
• F. Cao, M. Ester, W. Qian, A. Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”.
SDM ‘06
• G. S. Manku, R. Motwani: “Approximate frequency counts over data streams”. VLDB '02
• Y. Chi , H. Wang, P. Yu , R. Muntz: “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding
Window”. ICDM ’04
• C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: “Mining frequent patterns in data streams at multiple time
granularities”. NGDM (2003)
• M. Stonebraker U. Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone”. ICDE ’05
• A. Agarwal, O. Chapelle, M. Dudík, J. Langford: “A Reliable Effective Terascale Linear Learning System”.
JMLR (2014)
• Y. Ben-Haim, E. Tom-Tov: “A Streaming Parallel Decision Tree Algorithm”. JMLR (2010)
• A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data
Streams”. BigData ’14
• G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)
• J. Gama: “Knowledge Discovery from Data Streams”. Chapman and Hall (2010)
• J. Gama: “Data Stream Mining: the Bounded Rationality”. Informatica 37(1): 21-25 (2013)
154
Contacts
• https://sites.google.com/site/iotminingtutorial
• Gianmarco De Francisci Morales 

gdfm@acm.org @gdfm7
• Albert Bifet 
abifet@telecom-paristech.fr @abifet
• Latifur Khan 
lkhan@utdallas.edu
• João Gama 
jgama@fep.up.pt @JoaoMPGama
• Wei Fan 
fanwei03@baidu.com @fanwei
155

KDD 2016 IOT Tutorial

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

KDD 2016 IOT Tutorial

Hochgeladen von

Copyright:

Verfügbare Formate

IoT Big Data

Gianmarco De Francisci Morales, Albert Bifet, Latifur Khan,

• Limited Labeled Learning • Conclusions

IoT: sensors and actuators connected by networks to

Figure 3: EMC Digital Universe, 2014

• EMC Digital Universe, 2014

• EMC Digital Universe, 2014

• Things change over time

• Data unused until next

• Value of data wasted

• Maintain models online

• Incorporate data on the fly

• Unbounded training sets

• Detect changes and adapts

• Too large for single commodity server main memory

• Too fast for single commodity server CPU

• A solution needs to be:

• General idea, good for streaming algorithms

• Small error ε with high probability 1-δ

• True hypothesis H, and learned hypothesis Ĥ

• Pr[ |H - Ĥ| < ε|H| ] > 1-δ

• What is the largest number that we can store in 8

• What is the largest number that we can

27 Photo: Stephen Merity http://smerity.com

used at most once

• Probability of likelihood ⇥ prior

• Just counting! C = arg max P (C|x)

• ∇J = -(yi-ỹi)ỹi(1-ỹi) I Data stream: h~xi , yi i

P ERCEPTRON L EARNING(Stream, class, ⌘)

• Each leaf assigns a class

• Greedy recursive induction

• Sort all examples through tree

value, leaf assigns majority class

• Stop if no error | limit on #instances ✅ ❌

• AKA, Hoeffding Tree

• A small sample can often be enough to choose a near

• Collect sufficient statistics from a small set of examples

• Estimate the merit of each alternative attribute

• Choose the sample size that allows to differentiate

• Let x1 be the most informative attribute,

• Low variance model (stable decisions with statistical support)

• Low overfitting (examples processed only once, no need for

• Theoretical guarantees on error rate with high probability

• Hoeffding algorithms asymptotically close to batch learner.

• Ties: broken when ε < τ even if ΔG < ε

38 Photo: Stephen Merity http://smerity.com

• Same structure as decision tree

• Predict = average target value or

• Gain = reduction in standard deviation (vs entropy)

• Rules: self-contained, modular, easier

• 𝓛 keeps sufficient statistics to:

• expand the rule

• detect changes and anomalies

Adaptive Model Rules

• Ruleset: ensemble of rules

• Rule prediction: mean, linear model

• Weighted avg. of predictions of rules

• Weights inversely proportional to error E.g: x = [4, 1, 1, 2]

AMRules Induction AMRules

Algorithm 1: Training AMRules

• Training error should decrease

• Memoryless heuristic (no statistical guarantee)

• Parameters: threshold h, drift speed v

• if gt > h then alarm; gt = 0

• Let x1 be the most informative attribute, 

• Hoeffding algorithms asymptotically close to batch learner. 

• Predict = average target value or 

• Null hypothesis:  0.8

• if error > drift level 

• If test fails, grow alternate subtree and swap-in 

• Increase counters for incoming instance,  

• Keeps sliding window consistent with  

• Formally: minimize cost function  

• AKA, Cluster Features CF 

• Maintained in online phase, 

• Cluster feature vector 

• Online: merge point into p (or o) 

• Dunn index (on distance d) 

• Support(t) = number of sets