Beruflich Dokumente
Kultur Dokumente
Stream Mining
Tutorial
KDD 2016
http://albertbifet.com
2
Organizers (2/3)
is a full Professor (tenured) in the Computer Science
Latifur Khan department at the University of Texas at Dallas where he has
been teaching and conducting research since September 2000.
He has received prestigious awards including the IEEE
Technical Achievement Award for Intelligence and Security
Informatics. Dr. Khan is an ACM Distinguished Scientist and a
Senior Member of IEEE.
https://www.utdallas.edu/~lkhan/
João Gama is Associate professor at the University of Porto and a senior
researcher at LIAAD Inesc Tec. He received his Ph.D. degree in
Computer Science from the University of Porto, Portugal. His
main interests are machine learning, and data mining, mainly in
the context of time-evolving data streams. He authored a recent
book in Knowledge Discovery from Data Streams.
http://www.liaad.up.pt/~jgama
3
Organizers (3/3)
Wei Fan is the Deputy Head at Baidu Research Big Data Lab. His
co-authored paper received ICDM '06 Best Application
Paper Award, he led the team that used his Random
Decision Tree method to win 2008 ICDM Data Mining Cup
Championship. He received 2010 IBM Outstanding
Technical Achievement Award for his contribution to IBM
Infosphere Streams. At Huawei, he has led his colleagues
to develop Huawei Stream. SMART – a streaming platform
for online and real-time processing.
http://www.weifan.info
4
https://sites.google.com/site/iotminingtutorial
Outline
• IoT Fundamentals of • IoT Distributed
Stream Mining Stream Mining
• IoT Setting
• Distributed Stream
• Classification Processing Engines
• Concept Drift • Classification
• Regression
• Regression
• Clustering
• Open Source Tools
• Frequent Itemset Mining
• Applications
• Concept Evolution
5
IoT Fundamentals of
Stream Mining
Part I
IoT Setting
7
INTERNET OF THINGS
Interoperability: IoT
Information transparency: virtual copy of the physical
world
Technical assistance: support human decisions
Decentralized decisions: make decisions on their own
Gather
Deploy Clean
Model
Standard Approach
Finite training sets
Static models
13
Importance$of$O
Pain • Points
As$spam$trends$change
retrain$the$model$with
• Need to retrain!
• How often?
14
Value of Data
15
IoT Stream Mining
• Dynamic models
16
IoT Big Data Streams
• Volume + Velocity (+ Variety)
• Distributed
• Scalable
17
Approximation Algorithms
18
Approximation Algorithms
19
Approximation Algorithms
• What is the
largest number
that we can store
in 8 bits?
20
Approximation Algorithms
• What is the
largest number
that we can store
in 8 bits?
21
Approximation Algorithms
• What is the
largest number
that we can store
in 8 bits?
22
Approximation Algorithms
• What is the
largest number
that we can store
in 8 bits?
23
Approximation Algorithms
24
Predictive Learning
• Classification
• Regression
• Concept Drift
25
Classification
26
Definition
Given a set of training
examples belonging to nC
different classes, a classifier
algorithm builds a model
that predicts for every Examples
unlabeled instance x the • Email spam filter
class C to which it belongs • Twitter sentiment analyzer
ed amount of
mited amount of
Process
One example at at time,
predict at any
•
28
Naïve Bayes
• Based on Bayes’ P (x|C)P (C)
P (C|x) =
theorem P (x)
• ⃗i,yi⟩
Data stream: ⟨x Perceptron
w1
• ⃗i) = σ(w
ỹi = hw⃗(x ⃗ iT ⃗
xi ) Attribute 1
Attribute 2 w2
• σ(x) = 1/(1+e-x) σʹ=σ(x)(1-σ(x))
Attribute 3 w3 Output hw~ (~xi )
• ⃗ )=½∑(yi-ỹi)2
Minimize MSE J(w w4
Attribute 4
• ⃗ i+1 = w
SGD w ⃗ i - η∇J ⃗
xi Attribute 5 w5
30
Perceptron Learning
Perceptron
P ERCEPTRON L EARNING(Stream, ⌘)
1 for each class
2 do P ERCEPTRON L EARNING(Stream, class, ⌘)
P ERCEPTRON P REDICTION(~x )
1 return arg maxclass hw~ class (~x )
31
Decision Tree
• Each node tests a features Car deal?
• Each branch represents a value Road
Tested?
32
Very Fast Decision Tree
Pedro Domingos, Geoff Hulten: “Mining high-speed data streams”. KDD ’00
33
Leaf Expansion
• When should we expand a leaf?
• Is x1 a stable option?
• Hoeffding bound
r
R2 ln(1/ )
• Split if G(x1) - G(x2) > ε =
2n
34
Hoeffding Tree or VFDT
Hoeffding Tree or VFDT
HT(Stream, )
HT Induction
HT(Stream, )
1 ⇤ Let HT be a tree with a single leaf(root)
1 ⇤ Let HT be a tree with a single leaf(root)
2 ⇤ Init counts nijk at root
2 ⇤ Init counts nijk at root
3 for each example (x, y ) in Stream
3 for each example (x, y ) in Stream
4 do HTG ROW((x, y), HT , )
4 do HTG ROW((x, y ), HT , )
HTG ROW((x, y ), HT , )
1 ⇤ Sort (x, y) to leaf l using HT
2 ⇤ Update counts nijk at leaf l
3 if examples seen so far at l are not all of the same class
4 then ⇤ Compute G for each attributeq
R 2 ln 1/
5 if G(Best Attr.) G(2nd best) > 2n
6 then ⇤ Split leaf on best attribute
7 for each branch
8 do ⇤ Start
35 new leaf and initiliatize counts
Properties
• Number of examples to expand node depends only on
Hoeffding bound (ε decreases with √n)
36
Regression
37
Definition
Given a set of training
examples with a numeric
label, a regression algorithm
builds a model that predicts
for every unlabeled instance x Examples
the value with high accuracy • Stock price
y=ƒ(x)
• Airplane delay
• ⃗i) = w
ỹi = hw⃗(x ⃗T ⃗
xi Attribute 2 w2
• ⃗ )=½∑(yi-ỹi)2
Minimize MSE J(w Attribute 3 w3 Output hw~ (~xi )
Attribute 4 w4
• ⃗' = w
SGD w ⃗ - η∇J ⃗
xi
Attribute 5 w5
• ∇J = -(yi-ỹi)
I Data stream: h~xi , yi i
I ~ T ~xi ,
Classical perceptron: hw~ (~xi ) = w
• ⃗' = w
w ⃗ + η(yi-ỹi)x
⃗i 1 P
I Minimize Mean-square error: J(w ~)= 2 (yi hw~ (~xi ))2
39
Regression Tree
40
AMRules
Rules
Rules
Rules
• Problem: very large decision trees
have context that is complex and
hard to understand
• make predictions
41
Ensembles of Adaptive Model Rules from High-Speed
AMRules
Rule sets
• Ruleset prediction
42
Ensembles of Adaptive Model Rules from High-Speed Data Streams
43
Concept Drift
44
Definition
Given an input sequence
⟨x1,x2,…,xt⟩, output at instant
t an alarm signal if there is a
distribution change, and a
prediction x̂ t+1 minimizing Outputs
the error |x̂ t+1 − xt+1| • Alarm indicating change
• Estimate of parameter
45 Photo: http://www.logsearch.io
Application
• Change detection on
evaluation of model
• g0 = 0, gt = max(0, gt-1 + εt - v)
47
Page-Hinckley Test
• g0 = 0, gt = gt-1 + (εt - v)
• Gt = mint(gt)
48
Statistical Process Control
J Gama, P. Medas, G. Castillo, P. Rodrigues: “Learning with Drift Detection”. SBIA '04
Concept Drift
• Monitor error in sliding window
Error rate
• If error > warning level
Warning level
learn in parallel new model
new window
on the current window 0
pmin + smin
0 Number of examples processed (time) 5000
49
Concept-adapting VFDT
G. Hulten, L. Spencer, P. Domingos: “Mining Time-Changing Data Streams”. KDD ‘01
50
VFDTc: Adapting to Change
J. Gama, R. Fernandes, R. Rocha: “Decision Trees for Mining Data Streams”. IDA (2006)
• Swap subtree
51
Hoeffding Adaptive Tree
A. Bifet, R. Gavaldà: “Adaptive Parameter-free Learning from Evolving Data Streams” IDA (2009)
53
Definition
Given a set of unlabeled
instances, distribute them
into homogeneous groups
according to some common
relations or affinities Examples
• Market segmentation
• Social network communities
55
Snapshot 25,0
Micro-Clusters
Tian Zhang, Raghu Ramakrishnan, Miron Livny: “BIRCH: An Efficient Data Clustering Method for Very Large Databases”. SIGMOD ’96
• ⃗i⟩, d dimensions
Data stream ⟨x
56
CluStream
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu: “A Framework for Clustering Evolving Data Streams”. VLDB ‘03
57
DBSCAN
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: “A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise”. KDD ‘96
• pn is density-reachable from p1
58
DenStream
Feng Cao, Martin Ester, Weining Qian, Aoying Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ‘06
• Based on DBSCAN
• Core-micro-cluster: CMC(w,c,r)
weight w > μ, center c, radius r < ε
• Potential/outlier micro-clusters
•
Figure
Promote outlier to potential if w > βμ
1: Representation by
• Noise
61
Frequent Itemset
Mining
62
Definition
Given a collection of sets of
items, find all the subsets
that occur frequently, i.e.,
more than a minimum
support of times Examples
• Market basket mining
• Item recommendation
63
Fundamentals
• Dataset D, set of items t ∈ D,
constant s (minimum support)
• Itemset t is frequent if
support(t) ≥ s
64
Itemset Mining
Itemset Mining
Example
Dataset Example
Document Patterns Support Frequent
Frequent
d1 d1
d1 abce
abce
abce d1,d2,d3,d4,d5,d6
6 c c
d2 d2
d2 cdecde
cde d1,d2,d3,d4,d5
5 e,ce e,ce
d3 d3
d3 abce
abce
abce d1,d3,d4,d5
4 a,ac,ae,ace
a,ac,ae,ace
d4 d4
d4 acde
acde
acde d1,d3,d5,d6
4 b,bc b,bc
d5 d5
d5 abcde
abcde
abcde d2,d4,d5,d6
4 d,cd d,cd
d6 d6
d6 bcdbcd
bcd d1,d3,d5
3 ab,abc,abe
ab,abc,abe
be,bce,abce
be,bce,abce
d2,d4,d5
3 de,cde
de,cde
minimal support = 3
65
Variations
• A priori property: t ⊆ t' ➝ support(t) ≥ support(t’)
67
Itemset Streams
• Exact vs approximate
68
Pattern mining in streams
Key data structure: La.ce of pa2erns, with counts
{A,B,C,D},2 count≤7
count>7
69
Pattern mining in streams
The vast majority of stream pattern mining algorithms
(implicitly or explicitly) build and update the pattern
lattice.
General scheme:
70
Lossy Counting
G. S. Manku, R. Motwani: “Approximate frequency counts over data streams”. VLDB '02
• if x ∈ D, freq(x)++
71
Moment
Y. Chi , H. Wang, P. Yu , R. Muntz: “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window”. ICDM ‘04
72
FP-Stream
C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: “Mining frequent patterns in data streams at multiple time granularities”. NGDM (2003)
73
Itemset mining
74
Sequence, tree, graph
mining
MILE (Chen+ 05), SMDS (Marascu-Masseglia 06),
SSBE (Koper-Nguyen 11): Frequent subsequence
(aka sequential pattern) mining
75
Concept Evolution
76
Challenge: Concept Evolution
Novel class
y
y
D D
- - - - - - - - - -
y1
- - - - - - - - - - - - - - X X X X X
C
y1 X X X X X XX X X X X X
C
- - - - - -
XX X X X X X X X X X X
X X X X X XX X X X X X
A
A X X X X X X
- - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
++++ ++ - - - - - - - - - - - - - - - - ++++ ++ - - - - - - - - - - - - - - - -
++ + + ++ - - - - - - - - - - - - - - - - ++ + + ++ - - - - - - - - - - - - - - - -
- - - - - - - -- - - - -
y2
+ +++ ++ + - - - - - - - -- - - - -
y2
+ +++ ++ +
++ + + + ++ + ++ + + + ++ +
+++++ ++++ +++
+ ++ + + ++ ++ + + + + + + + + +
+ + + + + + + +
B +++++ ++++ +++
+ ++ + + ++ ++ + + + + + + + + +
+ + + + + + + +
B
x1 x x1 x
Classification rules:
R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +
R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -
Existing classification models misclassify novel class instances
FEARLESS engineering 77
Existing Techniques: Ensemble based Approaches
Masud et al. [1][2]
M1 +
x,? M2 +
+
input M3 -
[1] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data.
ICDM 2008: 929-934
[2] Mohammad M. Masud, Clay Woolam, Jing Gao, Latifur Khan, Jiawei Han, Kevin W. Hamlen, Nikunj C. Oza: Facing the reality of data stream classification: coping with scarcity of labeled data.
Knowl. Inf. Syst. 33(1): 213-244 (2011)
FEARLESS engineering 79
Novel Class Detection
Masud et al. [1][2], Khateeb et al. [3]
➢ Non parametric
– does not assume any underlying model of existing classes
➢ Steps:
1. Creating and saving decision boundary during training
2. Detecting and filtering outliers
3. Measuring cohesion and separation among test and training
instances
[1] Mohammad M. Masud, Qing Chen, Latifur Khan, Charu C. Aggarwal, Jing Gao, Jiawei Han, Ashok N. Srivastava, Nikunj C. Oza: Classification and Adaptive
Novel Class Detection of Feature-Evolving Data Streams. IEEE Trans. Knowl. Data Eng. 25(7): 1484-1497 (2013)
[2] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: Classification and Novel Class Detection in Concept-Drifting Data
Streams under Time Constraints. IEEE Trans. Knowl. Data Eng. 23(6): 859-874 (2011)
[3] Tahseen Al-Khateeb, Mohammad M. Masud, Latifur Khan, Charu C. Aggarwal, Jiawei Han, Bhavani M. Thuraisingham: Stream Classification with Recurring
and Novel Class Detection Using Class-Based Ensemble. ICDM 2012: 31-40
FEARLESS engineering 80
Training with Semi-Supervised Clustering
Legend:
Black dots: unlabeled instances
Colored dots: labeled instances
FEARLESS engineering 81
Semi Supervised Clustering
Masud et al. [1][2]
Entropy (Ent):
FEARLESS engineering 82
Outlier Detection and Filtering
Test instance inside Test instance outside
decision boundary decision boundary Test instance
(not outlier) Raw outlier or Routlier x
y
Ensemble of L models
D x
M1 M2 ... Mt
y1
C
A Routlier Routlier Routlier
x
AND X is an existing
y2 True False class instance
B
X is a filtered outlier (Foutlier)
x1 x (potential novel class instance)
FEARLESS engineering 83
Novel Class Detection
Test instance
x
Ensemble of L models
(Step 1) M1 M2 ... Mt (Step 4)
q-NSC>0
Routlier
Routlier Routlier for q’>q N Treat as
Foutliers existing
X is an existing with all class
(Step 2) AND
False class instance models?
True
(Step 3)
FEARLESS engineering 84
Computing Cohesion & Separation
FEARLESS engineering 85
Detection of Concurrent Novel Classes
Masud et al. [1], Faria et al. [2]
• Challenges
– High false positive (FP) (existing classes detected as novel) and false negative (FN) (missed
novel classes) rates
– Two or more novel classes arrive at a time
• Solutions
– Dynamic decision boundary – based on previous mistakes
• Inflate the decision boundary if high FP, deflate if high FN
– Build statistical model to filter out noise data and concept drift from the outliers.
– Multiple novel classes are detected by
• Constructing a graph where outlier cluster is a vertex
• Merging the vertices based on silhouette coefficient
• Counting the number of connected components in the resultant (i.e., merged) graph
[1] Mohammad M. Masud, Qing Chen, Latifur Khan, Charu C. Aggarwal, Jing Gao, Jiawei Han, Bhavani M. Thuraisingham: Addressing Concept-Evolution in Concept-Drifting
Data Streams. ICDM 2010: 929-934
[2] Elaine R. Faria, João Gama, André C. P. L. F. Carvalho: Novelty detection algorithm for data streams multi-class problems. SAC 2013: 795-800
FEARLESS engineering 86
Novel and Recurrence
Khateeb et al. [1]
Stream
Novel
Recurrence
FEARLESS engineering 87
Challenges: Fixed Chunk Size/ Decay Rate
Masud et al. [1], Parker et al. [2], Aggarwal et al. [3], Klinkenberg[4], Cohen et al. [5]
Chunk 1 Chunk 2
[1] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: Classification and Novel Class Detection in Concept-Drifting Data Streams under
Time Constraints. IEEE Trans. Knowl. Data Eng. 23(6): 859-874 (2011)
[2] Brandon Shane Parker, Latifur Khan: Detecting and Tracking Concept Class Drift and Emergence in Non-Stationary Fast Data Streams. AAAI 2015: 2908-2913
[3] Charu C. Aggarwal, Philip S. Yu: On Classification of High-Cardinality Data Streams. SDM 2010: 802-813
[4] Ralf Klinkenberg: Learning drifting concepts: Example selection vs. example weighting. Intell. Data Anal. 8(3): 281-300 (2004)
[5] Edith Cohen, Martin J. Strauss: Maintaining time-decaying stream aggregates. J. Algorithms 59(1): 19-36 (2006)
FEARLESS engineering 88
Challenges: Fixed Chunk Size
Concept Drifts
Time
Correct Wrong
FEARLESS engineering 89
Solution: Adaptive Chunk Size
Concept Drifts
Time
Correct Wrong
FEARLESS engineering 90
Adaptive Chunk - Sliding Window
Gamma et al. [1], Bifet et al. [2], Harel et al. [3]
[1] João Gama, Gladys Castillo: Learning with Local Drift Detection. ADMA 2006: 42-55
[2] Albert Bifet, Ricard Gavaldà: Learning from Time-Changing Data with Adaptive Windowing. SDM 2007: 443-448
[3] Maayan Harel, Shie Mannor, Ran El-Yaniv, Koby Crammer: Concept Drift Detection Through Resampling. ICML 2014: 1009-1017
FEARLESS engineering 91
Adaptive Chunk - Unsupervised
Haque et al. [1][2]
Input
Prediction using
Predicted Class
Ensemble
Classifier
Confidence
Yes No
Update Classifier Change Grow Window
& Shrink Window
[1] Ahsanul Haque, Latifur Khan, Michael Baron, Bhavani M. Thuraisingham, Charu C. Aggarwal: Efficient handling of concept drift and concept evolution over Stream Data. ICDE 2016: 481-492.
[2] Ahsanul Haque, Latifur Khan, Michael Baron: SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream. AAAI 2016: 1652-1658.
FEARLESS engineering 92
Adaptive Chunk - Unsupervised
Haque et al. [1][2]
Input
Prediction using Predicted
Ensemble Class
Association
Association Association
Model Model
1 2 Model t
Confidence
Confidence Confidence
Classifier
Confidence
Yes No
Update Classifier & Change
Shrink Window Grow Window
[1] Ahsanul Haque, Latifur Khan, Michael Baron, Bhavani M. Thuraisingham, Charu C. Aggarwal: Efficient handling of concept drift and concept evolution over Stream Data. ICDE 2016: 481-492
[2] Ahsanul Haque, Latifur Khan, Michael Baron: SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream. AAAI 2016: 1652-1658.
FEARLESS engineering 93
Confidence of a model
FEARLESS engineering 94
Confidence Estimators
➢ Let h be the closest cluster from data instance x in model
Mi, confidence of Mi in classifying instance x is
calculated based on the following estimators:
x
ai
)
(i x
D
Rh
▪ Ns = 15, Nm = 14
▪ pix = 14/15
FEARLESS engineering 95
How good are the estimators?
g i1 a i1
g i2 a i2
g i3 a i3
gil ail
g i1 p i1
g i2 p i2
g i3 p i3
gil pil
FEARLESS engineering 96
Example: Confidence Calculation
❖ An example with t = 3 (3 models) and C = 3 (3 classes)
Classification Model (M1 ) Model (M2) Model (M3)
Training:
Z1 = (0.52, 0.48)
Z2 = (0.41, 0.59)
Z3 = (0.36, 0.64)
FEARLESS engineering 97
Confidence Value Distribution
FEARLESS engineering 98
Change Detection
FEARLESS engineering 99
Limited Labelling
100
Limited Labeled Learning using Active Learning
Masud et al. [1][2], Fan et al. [3], Zhu et al. [4]
Traditional Learning
Training Data Training Data Training Data
Unlabeled
[1] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data.
ICDM 2008: 929-934
[2] Mohammad M. Masud, Clay Woolam, Jing Gao, Latifur Khan, Jiawei Han, Kevin W. Hamlen, Nikunj C. Oza: Facing the reality of data stream classification: coping with scarcity of labeled data.
Knowl. Inf. Syst. 33(1): 213-244 (2011)
[3] Wei Fan, Yi-an Huang, Haixun Wang, Philip S. Yu: Active Mining of Data Streams. SDM 2004: 457-461
[4] Xingquan Zhu, Peng Zhang, Xiaodong Lin, Yong Shi: Active Learning from Data Streams. ICDM 2007: 757-762
➢ Semi-supervised training:
– Label is requested for an instance only if classifier confidence in classifying that
instance was below a confidence threshold (τ).
– otherwise, predicted label is used as the final label.
➢ A new model is trained on the training data.
➢ The oldest model in the ensemble is replaced.
Ensemble M
Outline
• IoT Fundamentals of • IoT Distributed
Stream Mining Stream Mining
• IoT Setting
• Distributed Stream
• Classification Processing Engines
• Concept Drift • Classification
• Regression
• Regression
• Clustering
• Open Source Tools
• Frequent Itemset Mining
• Applications
• Concept Evolution
104
Distributed Stream
Processing Engines
105
A Tale of two Tribes
M. Stonebraker U. Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone”. ICDE ’05
Faster Larger
Database
106
SPE Evolution
1st generation Aurora Abadi et al., “Aurora: a new model and architecture for
—2003 data stream management,” VLDB Journal, 2003
2nd generation —2005 Borealis Abadi et al., “The Design of the Borealis Stream
Processing Engine,” in CIDR ’05
107
Actors Model
Event
routing
Live Streams
PE External
Persister
Stream 1 PE
Output 1
Stream 2 PE
Output 2
Stream 3 PE
PE
108
S4 Example
status.text:"Introducing #S4: a distributed #stream processing system"
TopicNTopicPE (PE4)
EV Topic
keeps counts for top topics and outputs
KEY reportKey="1" PE4
top-N topics to external persister
VAL topic="S4", count=4
109
Groupings
• Key Grouping
(hashing) PE PE
• Shuffle Grouping
PEI PEI
(round-robin)
PEI PEI
• All Grouping
(broadcast)
110
Groupings
• Key Grouping
(hashing) PE PE
• Shuffle Grouping
PEI PEI
(round-robin)
PEI PEI
• All Grouping
(broadcast)
111
Groupings
• Key Grouping
(hashing) PE PE
• Shuffle Grouping
PEI PEI
(round-robin)
PEI PEI
• All Grouping
(broadcast)
112
Groupings
• Key Grouping
(hashing) PE PE
• Shuffle Grouping
PEI PEI
(round-robin)
PEI PEI
• All Grouping
(broadcast)
113
real time computation: streaming computation
Big Data
MapReduce Limitations
Example
Processing Engines
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
apache storm
3 sentiment analysis
Storm characteristics
14
for real-time data processing workloads
1 Fast
• High Latency (Not real time) 2 Scalable
3 Fault-tolerant
Storm and Samza are fairly similar. Both systems provide:
4 Reliable
5 Easy to operate
1 a partitioned stream model,
2 a distributed execution environment,
3 an API for stream processing,
4 fault tolerance,
5 Kafka integration
114
Kappa Architecture
115
Apache Spark
116
Apache Spark
117
Apache Spark
118
Apache Flink
real time computation: streaming computation
MapReduce Limitations
Example
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
3 sentiment analysis
• Streaming engine
14
119
Apache Flink
real time computation: streaming computation
MapReduce Limitations
Example
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
3 sentiment analysis
• Streaming engine
14
120
Apache Beam
121
Apache Beam
• Apache Beam code can run in:
• Apache Flink
• Apache Spark
122
Apache Beam
123
Classification
124
Hadoop AllReduce
A. Agarwal, O. Chapelle, M. Dudík, J. Langford: “A Reliable Effective Terascale Linear Learning System”. JMLR (2014)
• Aggregate + Redistribute
125
9
13 15
37 37
1 8
7 37 37 5 3 37 37 4
7 5 3 4
re 1: AllReduce operation. Initially, each node holds its own value. Values are passed up
AllReduce
and summed, until the global sum is obtained in the root node (reduce phase). The global
en passed back down to all other nodes (broadcast phase). At the end, each node contains
al sum.
Reduction Tree
Upward = Reduce Downward = Broadcast (All)
Hadoop-compatible AllReduce 126
Parallel Decision Trees
Attributes
• Which kind of parallelism?
• Task
Instances
• Data
• Horizontal Data
• Vertical
Attributes Class
Instance
127
Horizontal Partitioning
Y. Ben-Haim, E. Tom-Tov: “A Streaming Parallel Decision Tree Algorithm”. JMLR (2010)
Stats Model
Instances
Histograms
Stream Stats
Stats
Single attribute
Model Updates
Aggregation
tracked in to
compute splits
multiple nodes 128
Hoeffding Tree Profiling
Other
6 %
Split
Training time for
24 %
100 nominal +
100 numeric
Learn
attributes 70 %
129
Vertical Partitioning
N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Murdopo: “VHT: Vertical Hoeffding Tree”, 2016 https://arxiv.org/abs/1607.08325
Model Stats
Attributes
Stream Stats
Stats
Single attribute
tracked in Splits
single node 130
Vertical Hoeffding Tree
Stream Instance
Result
131
Advantages of
Vertical Parallelism
• High number of attributes => high level of parallelism
(e.g., documents)
133
VAMR
A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ‘14
• Vertical AMRules
• Learner: statistics
Learnerp
134
HAMR
A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ‘14
Instances Learners
Model Learners
• Shuffle among multiple
Instances
Aggregator2
Model
Learners
Model
Aggregator Learners
Models for parallelism Model2
Aggregator 2
Aggregators
Learners
Learners
New Rules
decreases performance
Predictions
Predictions
135
Open Source Tools
136
http://moa.cms.waikato.ac.nz/
MOA
• {M}assive {O}nline {
{M}assive {O}nline {A}nalysis is a framework for online
streamDM C++
138
Vision
Streaming Distributed
SAMOA
An adapter for integrating Apache Flink into Apache SAMOA was im
in scope of this master thesis, with the main parts of its implementa
G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)
addressed in this section. With the use of our adapter, ML algorithm
executed on top of Apache Flink. The implemented adapter will be us
evaluation of the ML pipelines and HT algorithm variations.
Data
Mining
Non
Distributed
Distributed
Storm, S4,
Hadoop
Samza
5.1 Apache
140 SAMOA Abstractions
http://huawei-noah/github.io/streamDM
StreamDM
141
Applications
142
Application: Encrypted Traffic Fingerprinting
http://www.clickz.com/tag/fingerprinting
Duplicate of Duplicate of
• Multi-stream classification
– Some streams are labeled, some are not.
– Distributions are related but not same (e.g., covariate
shift).
– Requires bias correction, domain adaptation, etc.
• Adversarial active learning
– Traditional algorithms are vulnerable to adversarial
manipulation.
– Instances should be selected carefully.
• Efficient online change detection
149
Summary
150
Open Challenges
• Structured output
• Multi-target learning
• Millions of classes
• Representation learning
• Ease of use
151
References
152
• IDC’s Digital Universe Study. EMC (2011)
• J. Gama, R. Fernandes, R. Rocha: “Decision trees for mining data streams”. IDA (2006)
• A. Bifet, R. Gavaldà: “Adaptive Parameter-free Learning from Evolving Data Streams”. IDA (2009)
• A. Bifet, R. Gavaldà: “Learning from Time-Changing Data with Adaptive Windowing”. SDM ’07
• E. Almeida, C. Ferreira, J. Gama. "Adaptive Model Rules from Data Streams”. ECML-PKDD ‘13
• H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer: “An effective evaluation
measure for clustering on evolving data streams”. KDD ’11
• T. Zhang, R. Ramakrishnan, M. Livny: “BIRCH: An Efficient Data Clustering Method for Very Large
Databases”. SIGMOD ’96
• C. C. Aggarwal, J. Han, J. Wang, P. S. Yu: “A Framework for Clustering Evolving Data Streams”. VLDB ‘03
• M. Ester, H. Kriegel, J. Sander, X. Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise”. KDD ‘96
153
• F. Cao, M. Ester, W. Qian, A. Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”.
SDM ‘06
• G. S. Manku, R. Motwani: “Approximate frequency counts over data streams”. VLDB '02
• Y. Chi , H. Wang, P. Yu , R. Muntz: “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding
Window”. ICDM ’04
• C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: “Mining frequent patterns in data streams at multiple time
granularities”. NGDM (2003)
• M. Stonebraker U. Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone”. ICDE ’05
• A. Agarwal, O. Chapelle, M. Dudík, J. Langford: “A Reliable Effective Terascale Linear Learning System”.
JMLR (2014)
• A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data
Streams”. BigData ’14
• G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)
• J. Gama: “Knowledge Discovery from Data Streams”. Chapman and Hall (2010)
• J. Gama: “Data Stream Mining: the Bounded Rationality”. Informatica 37(1): 21-25 (2013)
154
Contacts
• https://sites.google.com/site/iotminingtutorial
• Albert Bifet
abifet@telecom-paristech.fr @abifet
• Latifur Khan
lkhan@utdallas.edu
• João Gama
jgama@fep.up.pt @JoaoMPGama
• Wei Fan
fanwei03@baidu.com @fanwei
155