Beruflich Dokumente
Kultur Dokumente
DataPhanatik
FortheloveofCode,DataandAnalytics
QuickIntroductiontoApacheSpark
PostedonMarch20,2016
WhatisSpark
Sparkisafastandgeneralpurposeframeworkforclustercomputing.
ItiswritteninScalabutisavailableinScala,JavaandPython.
ThemaindataabstractioninSparkistheRDD,orResilientDistributedDataset.
Withthisabstraction,Sparkenablesdatatobedistributedandprocessedamong
themanynodesofacomputingcluster.
Itprovidesbothinteractiveandprogrammaticinterfaces.Itconsistsofthefollowing
components:
SparkCorethefoundationalclassesonwhichSparkisbuilt.ItprovidestheRDD.
SparkStreamingaprotocolforprocessingstreamingdata
SparkSQLanAPIforhandlingstructureddata
MLLibaMachineLearninglibrary
GraphXaGraphlibrary
ClusterManager
Inordertooperate,inproductionmodeSparkneedsaclustermanagerthatmanagesdata
distribution,taskschedulingandfaulttoleranceamongthevariousnodes.
Thereare3typessupportedApacheMesos,HadoopYARNandSparkstandalone.
SparkFundamentals
Spark,ashasbeenpreviouslydefinedisaframeworkforfastandgeneralpurposecluster
computing.ItsfundamentalabstractionistheRDDtheresilientdistributeddatasetwhich
meansthatitisinherentlyparallelizableamongthenodesandcoresofacomputingcluster.
TheRDDisheldentirelyinRAM.
http://searchdatascience.com/brief-introduction-to-apache-spark/
1/3
3/30/2016
WhendataisreadintoSpark,itisreadintoanRDD.Onceitisreadintoan
RDDitcanbeoperatedon.Thereare2distincttypesofoperationsonRDDs:
1.Transformations
TransformationsareusedtoconvertdataintheRDDtoanotherform.Theresultofa
transformationisanotherRDD.Examplesoftransformationsare:
map()takesafunctionasargumentandappliesthefunctiontoeachitem/elementof
theRDD
flatMap()takesafunctionasargumentandappliesthefunctiontoeachelement
whileflatteningtheresultsintoasinglelevelcollection.
filter()takesabooleanexpressionandreturnsanRDDwithrowsforwhichthe
booleanexpressionistrue.e.g.linesofafilewhichcontainthestringObama
countByKey()givenaPair/mapRDDi.e.withKeyvaluepairs,returnanotherPair
RDDwithcountsbykey.
2.Actions
ActionsareoperationsonanRDDwhichresultinsomesortofoutputthatisnotanRDD
e.g.alist,DataFrame,oroutputtothescreen.Examplesofactionoperationsare:
collect()AppliesthevarioustransformationstoanRDDthenreturnstheresultasa
collection.
countreturnsacountofthenumberofelementsinanRDD
reduce()takesafunctionandrepeatedlyappliesittotheelementsofthe
RDDtoproduceasingleoutputvalue
RDDsandLazyEvaluation
AfundamentalideainSparksimplementationistheapplicationoflazyevaluationandthis
isimplementedforallSparktransformationoperations.
ThusanRDDisfundamentallyadataabstractionso,whenwecallsay:
s c a l a > v a l r d d = s c . p a r a l l e l i z e ( S e q ( 1 , 3 , 4 , 5 , 9 ) )
r d d : o r g . a p a c h e . s p a r k . r d d . R D D [ I n t ] = P a r a l l e l C o l l e c t i o n R D D [ 0 ]
s c a l a > v a l
m a p p e d R D D :
m a p p e d R D D = r d d . m a p ( x = > x * x )
o r g . a p a c h e . s p a r k . r d d . R D D [ I n t ]
=
M a p P a r t i t i o n s R D D [ 1 ]
a t
p a r a l l e l i z e
a t
m a p
a t
: 2 3
whatweregettingasmappedRDDisjustananexpressionthathasntbeenevaluated.This
expressionisessentiallyarecordofasequenceoperationsthatneedtobeevaluatedi.e.
http://searchdatascience.com/brief-introduction-to-apache-spark/
2/3
a t
3/30/2016
parallelize>map
Thisexpressionamountstowhatisknownasalineagegraphandcanbeseenasfollows:
s c a l a >
r e s 4 : S
( 4 ) M a p
|
P a r
m a p
t r i
P a r
a l l
p e d
n g
t i t
e l C
R D D . t o D e b u g S t r i n g
=
i o n s R D D [ 1 ] a t m a p a t : 2 3 [ ]
o l l e c t i o n R D D [ 0 ] a t p a r a l l e l i z e
a t
: 2 1
[ ]
ThisentrywaspostedinBigDataandDistributedSystemsandtaggedapachesparkby
femibyte.Bookmarkthepermalink[http://searchdatascience.com/briefintroduction
toapachespark/].
http://searchdatascience.com/brief-introduction-to-apache-spark/
3/3