Beruflich Dokumente
Kultur Dokumente
Ad ertisment
Sensor Networks
Social Media
Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie
urn e$ent stream into some statisti%s &e$ents'se%on#, tren#s, %ounting( )*urn pre#i%tion &but really +ust training on #ata( )lustering o, ne-s stories &ok( .utlier #ete%tion ,or monitoring &ok, yea*(
Many /$ents
Many .b+e%ts
100 e$ents ' se%on# 010k per *our 2.1M per #ay 210M per mont* 0.2B per year
!tt":##www$flickr$com#"!otos#arenamontanus#%&'()*))+#
3eekly reports4 a #ay 5e%ommen#ations4 a ,e- *ours 3eb Analyti%s4 se%on#s to a ,e- minutes A# .ptimi6ation4 millise%on#s
can react in
%on$erges, e.g. i,
>ust like you -oul# -it* stati% #ata, in a B?@ A.5 L..B #ata managementC *uge laten%y=
)o$ers Lo%ally 3eig*te# Linear 5egression, Dai$e Bayes, @aussian Dis%riminati$e Analysis, kEMeans, Logisti% 5egression, Deural Det-orks, Brin%ipal )omponent Analysis, ?n#epen#ent )omponent Analysis, /7pe%tation Ma7imi6ation, :upport Fe%tor Ma%*ines
Big Data Beers, April 15, 2014, Berlin
!pache )adoop
?n#ustry :tan#ar# ,or Map 5e#u%e .riginally #e$elope# at ;a*oo A ,ait*,ul embo#iment o, t*e >a$a /nterprise Min#set, bin#ings " tools ,or e$eryt*ing )ore %omponents4 GDA: H Map5e#u%e
)adoop *acts
;ou s*oul#n<t be a,rai# o, >a$a GDA: like a Boor Mans A: &rea#'-rite, no seek( to #istribute #ata Map splits essentially on Ile le$el &some Ile ,ormats are un#erstoo#( >ob startup takes 9uite some time &minutes( Map'5e#u%e +obs %an be s%ripts JK inter,a%e to e$ery language in prin%iple Fery e7tensible &in >a$a(4 !ser DeIne# Aun%tions, et%.
:plit -ork into small pie%es o, %o#e *an#ling a single e$ent Baralleli6e @oo# laten%y Do Luery Layer=
M)lassi%alN multiEt*rea#e# programming %onsi#ere# *arm,ul &lo%king, %on%urrent a%%ess( ?ntera%ting a%tors &ea%* running singleEt*rea#e#( /mp*asis on ,un%tional programming " immutable #ata stru%tures
!pache Storm
(aralleli#e% Micro-$atches
!pache Spar,
Stream "rocessing not 0ust ma"#reduce but more com"lete 1functional collection style A-,2, also for streams in memory !y"ed Hadoo" com"etitor de elo"ed wit! su""ort by 3. 4erkeley
Data is noisy Dot e$ery #ata point is important Met*o#s are noisy, too Absolute numbers are o,ten not important, too
Brobabilisti% Algorit*m4 3it* n spa%e, per,orm task -it* error e = f(n) -it* e O 0 as n O P Mot*er o, all algorit*ms4 $loom1lter
)ount a%ti$ities o$er large item sets &millions, e$en more, e.g. ?B a##resses, -itter users( ?ntereste# in most a%ti$e elements only.
.ase (: element already in data base (5% (+% +5% ))5 6(% /%5 () (% * ) 5 % 6(5 5 .ase %: new element 6(5 /%5 % (+% (+% (% (5
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2 !
:ummari6e *istograms o$er large ,eature sets Like bloom Ilters, but better
m bins / ( / % / ( ) + 5 / 5 ) / % % / / / ( / % 5 5 % / ) 6 / / % 5 * n different !as! functions
8uery result: (
G$ .ormode and S$ Mut!ukris!nan$ "n impro#ed data stream summary$ The count-min sketch and its applications% 9A:,N %//+, J$ Algorit!m ));(<: )*=6) ;%//)< $
2014 by Mikio L. Braun
.nline %lustering
Aggarwal, " Frame&ork for Clusterin' (assi#e-Domain Data Streams, IEEE International Conference on Data En'ineerin' , 2
)ounting is statisti%s=
/mpiri%al mean4
)orrelations4
based on
.n%e you *a$e a mo#el, you %an %ompute pE$alues &base# on re%ent time ,rames=(
"nline T*-ID*
class "riors
-riors
,.M9 %//5
2014 by Mikio L. Braun
trans,orm A to log& . H 1( ?DAEstyle normali6ation s9uare lengt* normali6ation use %omplement probability anot*er log normali6e t*ose -eig*ts again Bre#i%t linearly using t*ose -eig*ts
Least 5e%ently !se# )a%*es :parse Fe%tors :parse Matri%es )on#itional Brobabilities &Gistograms( A%%umulators ...
Streamdrill
Gea$y Gitters %ounting H e7ponential #e%ay Instant %ounts " topEk results o$er time -in#o-s ?nEMemory BroIling an# ren#ing 5e%ommen#ations )ount Distin%t
Mo#ules
!rchitect re "vervie4
!tt":##"lay$streamdrill$com# is#
Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun
ren#s4
SAB4*ttp4''on.-s+.%om'15,GaU3
ML on Streams
)onstantly #eal -it* ne- #ata ?t<s o,ten not t*at %omple7, really Gig* #ata rate " e$ent range Bat%*4 Gig* laten%y O Ga#oop :tream4 Lo- laten%y O :torm ' :park Appro7imate4 @oo# enoug* O :tream#rill