Sie sind auf Seite 1von 3

3/30/2016

Quick Introduction to Apache Spark | Data Phanatik

DataPhanatik
FortheloveofCode,DataandAnalytics

QuickIntroductiontoApacheSpark
PostedonMarch20,2016
WhatisSpark
Sparkisafastandgeneralpurposeframeworkforclustercomputing.
ItiswritteninScalabutisavailableinScala,JavaandPython.
ThemaindataabstractioninSparkistheRDD,orResilientDistributedDataset.
Withthisabstraction,Sparkenablesdatatobedistributedandprocessedamong
themanynodesofacomputingcluster.
Itprovidesbothinteractiveandprogrammaticinterfaces.Itconsistsofthefollowing
components:
SparkCorethefoundationalclassesonwhichSparkisbuilt.ItprovidestheRDD.
SparkStreamingaprotocolforprocessingstreamingdata
SparkSQLanAPIforhandlingstructureddata
MLLibaMachineLearninglibrary
GraphXaGraphlibrary
ClusterManager
Inordertooperate,inproductionmodeSparkneedsaclustermanagerthatmanagesdata
distribution,taskschedulingandfaulttoleranceamongthevariousnodes.
Thereare3typessupportedApacheMesos,HadoopYARNandSparkstandalone.
SparkFundamentals
Spark,ashasbeenpreviouslydefinedisaframeworkforfastandgeneralpurposecluster
computing.ItsfundamentalabstractionistheRDDtheresilientdistributeddatasetwhich
meansthatitisinherentlyparallelizableamongthenodesandcoresofacomputingcluster.
TheRDDisheldentirelyinRAM.
http://searchdatascience.com/brief-introduction-to-apache-spark/

1/3

3/30/2016

Quick Introduction to Apache Spark | Data Phanatik

WhendataisreadintoSpark,itisreadintoanRDD.Onceitisreadintoan
RDDitcanbeoperatedon.Thereare2distincttypesofoperationsonRDDs:
1.Transformations
TransformationsareusedtoconvertdataintheRDDtoanotherform.Theresultofa
transformationisanotherRDD.Examplesoftransformationsare:
map()takesafunctionasargumentandappliesthefunctiontoeachitem/elementof
theRDD
flatMap()takesafunctionasargumentandappliesthefunctiontoeachelement
whileflatteningtheresultsintoasinglelevelcollection.
filter()takesabooleanexpressionandreturnsanRDDwithrowsforwhichthe
booleanexpressionistrue.e.g.linesofafilewhichcontainthestringObama
countByKey()givenaPair/mapRDDi.e.withKeyvaluepairs,returnanotherPair
RDDwithcountsbykey.
2.Actions
ActionsareoperationsonanRDDwhichresultinsomesortofoutputthatisnotanRDD
e.g.alist,DataFrame,oroutputtothescreen.Examplesofactionoperationsare:
collect()AppliesthevarioustransformationstoanRDDthenreturnstheresultasa
collection.
countreturnsacountofthenumberofelementsinanRDD
reduce()takesafunctionandrepeatedlyappliesittotheelementsofthe
RDDtoproduceasingleoutputvalue
RDDsandLazyEvaluation
AfundamentalideainSparksimplementationistheapplicationoflazyevaluationandthis
isimplementedforallSparktransformationoperations.
ThusanRDDisfundamentallyadataabstractionso,whenwecallsay:

s c a l a > v a l r d d = s c . p a r a l l e l i z e ( S e q ( 1 , 3 , 4 , 5 , 9 ) )
r d d : o r g . a p a c h e . s p a r k . r d d . R D D [ I n t ] = P a r a l l e l C o l l e c t i o n R D D [ 0 ]
s c a l a > v a l
m a p p e d R D D :

m a p p e d R D D = r d d . m a p ( x = > x * x )
o r g . a p a c h e . s p a r k . r d d . R D D [ I n t ]
=

M a p P a r t i t i o n s R D D [ 1 ]

a t

p a r a l l e l i z e

a t

m a p

a t

: 2 3

whatweregettingasmappedRDDisjustananexpressionthathasntbeenevaluated.This
expressionisessentiallyarecordofasequenceoperationsthatneedtobeevaluatedi.e.
http://searchdatascience.com/brief-introduction-to-apache-spark/

2/3

a t

3/30/2016

Quick Introduction to Apache Spark | Data Phanatik

parallelize>map
Thisexpressionamountstowhatisknownasalineagegraphandcanbeseenasfollows:

s c a l a >
r e s 4 : S
( 4 ) M a p
|
P a r

m a p
t r i
P a r
a l l

p e d
n g
t i t
e l C

R D D . t o D e b u g S t r i n g
=
i o n s R D D [ 1 ] a t m a p a t : 2 3 [ ]
o l l e c t i o n R D D [ 0 ] a t p a r a l l e l i z e

a t

: 2 1

[ ]

ThisentrywaspostedinBigDataandDistributedSystemsandtaggedapachesparkby
femibyte.Bookmarkthepermalink[http://searchdatascience.com/briefintroduction
toapachespark/].

http://searchdatascience.com/brief-introduction-to-apache-spark/

3/3

Das könnte Ihnen auch gefallen