Sie sind auf Seite 1von 11

Intro to Hadoop and MapReduce

Lesson 1 Notes

Introduction

Hi!WelcometoFundamentalsofHadoopandMapReduce.MynamesSarah
Sproehnle,andImtheVicePresidentofEducationalServicesatCloudera,a
companywhichhelpsdevelop,support,andmanageHadoop.

AndImIanWrigley,ClouderasSeniorCurriculumManager.Betweenus,Sarah
andIhavebeenresponsibleforbringingHadooptrainingtoover20,000people,
andwereexcitedtoreachamuchbiggeraudiencehereatUdacity.Duringthis
courseweregoingtodiscusswhatbigdatais,whatHadoopis,whyitsuseful,
andhowtowriteMapReducecode.

Bytheendofthecourse,youllbeabletodescribethekindsofproblemsHadoopaddresses,
andyoullhavewrittenMapReduceprogramstoefficientlyanalyzeverylargeWebserverlog
files.Infact,youllhavehadhandsonexperiencerunningaHadoopjobbytheendoflessontwo.

So,letsstart.Inthislesson,we'regoingtodefine'bigdata',thesortofproblemsitintroduces,
andhowtoaddressthoseproblems.

Sources of Data

Organizationshavebeengeneratingdatasince
wayback,butastimegoeson,moreandmore
dataisbeinggenerated.IBMestimatesthatas
muchas90%ofthedataintheworldtoday
hasbeencreatedinthelasttwoyearsalone.

Justasasimpleexample,thinkaboutyourcellphone.Wheneverits
turnedon,itsconnectingtocelltowerstogetreception.Asyoumove
around,itwillconnecttodifferenttowers,andatdifferentsignal
strengthsdependingonhowfarawayfromthemyouare.Allofthat
connectiondataiscollectedbythephonecompany,anditslogged.
Copyright2014Udacity,Inc.AllRightsReserved.

Theycanuseittofinddeadspotsintheircoverage,toworkoutwhichtowersarethebusiest
andneedincreasedcapacity...theycaneventraceyouifyoumakeanemergencycallbutdont
giveyourexactlocation.Thatsanenormousamountofdatarightthere.

Anotherexampleiswhenyouvisita
WebsitelikeAmazonorNetflix.
Everythingyoudothereislogged:what
pagesyouviewed,whatproductsyou
lookedat,howlongyouspentoneach
page...eventhingslikewhatWeb
browseryouwereusingandwhatsortof
computeryouwereconnectingfrom.
Again,hugeamountsofdata.

Andthatsjustinthecorporateworld.Inmedicine,forexample,eachXRaycreateshuge
amountsofpotentiallyincrediblyvaluableinformation,andcomparinglargenumbersofthemcan
helpustodetectsimilaritiesintumors.

Thisincreaseintheamountofdataweregeneratingopensuphugepossibilities.Butitcomes
withproblemstoo.Wehavetostoreallthatdata,andwehavetobeabletoprocessitina
sensibleamountoftime.

Quiz: What is a Big Data problem?


ThiscourseisaboutHadoop,andhowithelpstodealwithBigData.Butnoteverythingis
actuallyabigdataproblem.Therearelotsofcaseswhereyoucanusetraditionalsystemsto
store,manage,andprocessyourdata.Sothefirstthingyouneedtodoisdecideifwhatyou
havereallydoesfallundertheheadingofbigdatainthefirstplace.Andtomakethatcall,we
havetocreatesomekindofdefinitionforwhatbigdatais.

Letsstartwithaquickquestion.Whichofthesewouldyouconsidertobebigdata?Youarenot
goingtobegradedonthisanswer,butgiveityourbestguess.

[]orderdetailsforapurchaseatastore
[]allordersacrosshundredsofbranchesnationwide
[]informationaboutapersonsstockportfolio
[]allstocktransactionsmadeontheNewYorkStockExchangeduringtheyear

Answer:
Formostpeople,theanswersaregoingtobe2and4.Alistofpurchasesatasinglestoreis
Copyright2014Udacity,Inc.AllRightsReserved.

almostcertainlysmallenoughtobeeasilyhandledbyatraditionalrelationaldatabasesystem
orevenjustaspreadsheet.Ordersfromhundredsofstoresnationwide,though,couldstartto
overwhelmtraditionalsystems.Likewise,informationaboutasinglepersonsstockportfolioisa
smallandeasilymanagedchunkofdata.ButdataontradesacrosstheentireNYSEforayear
willrunintotensorhundredsofterabytesandthatswheretraditionalsystemsreallydostartto
struggle.

Definition of Big Data


Theresnoonedefinitionforbigdataitsaverysubjectiveterm.Mostpeoplewouldconsidera
datasetofterabytesormoretobebigdata,buttherearecertainlypeopleusingHadoopwith
greatsuccessonsmallerchunksofdatathanthat.Onereasonabledefinitionisthatitsdata
whichcantcomfortablybeprocessedonasinglemachine.

Quiz: Challenges
ButBigDataismorethanjustsizeofthedata.Whatadditionalproblemscanyouseeinthis
field?

[]mostdataisworthlessanditshardtofindtheusefulparts
[]itshardtogatherdata
[]dataiscreatedveryfast
[]datafromdifferentsourcesisindifferentformats

Answer:
Apotentialchallengewithbigdataisthatitiscreatedveryfastanddoescomefromdifferent
sourceswhichcouldcomeinavarietyofformats.Inmyexperience,mostdataisnotworthless
butactuallydoeshavealotofvalue.

The 3 Vs of Big Data:


WhenyoureadortalkaboutBigData,youlloftenhearpeoplerefertothethreeVs.Volume
referstothesizeofdatathatyouredealingwith,Varietyreferstothefactthatthedataisoften
comingfromlotsofdifferentsourcesandinmanydifferentformats,andVelocityreferstothe
speedatwhichthedataisbeinggenerated,andthespeedatwhichitneedstobemade
availableforprocessing.Soletslookinmoredetailateachofthem.

Volume
Thepricetostoredatahasdroppedincrediblyoverthelast60years.In1980,thecostper
gigabytewasseveralhundredthousanddollars.In2013,itswellunder10cents.

Copyright2014Udacity,Inc.AllRightsReserved.

Althoughitsworthsayingthatifyouactuallywanttostorethedatareliably,youregoingtoend
uppayingrathermorethanthatprobablyseveraldollarspergigabyte,maybeevenmore.

Thatsparticularlythecasewithmore
traditionaldatastoragedevicessuchas
storageareanetworks,orSANs,which
canbeextremelyexpensive.Thehigh
costofreliablestorageputsacaponthe
amountofdatacompaniescan
practicallystore.Atsomepoint,theyd
say,OK,itstooexpensivetostoreall
thatdatathatImnotdoinganythingwith.
Letsjuststorethecriticalstuff:my
actualsales,forexample,ratherthanall
thatstuffabouthowlongpeoplespenton
eachpageofmyWebsite.Butitturns
out,aswellsee,thatthedatatheyre
currentlythrowingawaycanbeincredibly
useful.Whatweneedisacheaperway
tostoreitreliably.

Andofcoursestoringthedataisonlyonepartoftheequationyoualsoneedtobeabletoread
Copyright2014Udacity,Inc.AllRightsReserved.

andprocessitefficiently.StoringaterabyteofdataonaSANisntsohard,butstreamingthe
datafromtheSANacrossthenetworktosomecentralprocessorcantakealongtime,and
processingitcanbeextremelyslow.

QUIZ: Volume

Whichofthefollowingdatadoyouthinkisworthstoringandanalyzing?

[]transactions(financial,governmentrelated)
[]logs(recordsofactivity,location)
[]businessdata(productcatalogs,prices,customers)
[]userdata(images,documents,video)
[]sensordata(temperature,pollution)
[]medicaldata(xrays,brainactivityrecords)
[]social(email,twitteretc)

Answer
Andtheansweristhatallofthesecanprovideusefulinformation.Butinordertostoreit,youll
needawaytoscaleyourstoragecapacityuptomassivevolume.Hadoop,whichstoresdatain
adistributedwayacrossmultiplemachines,doesthat.Youllseejusthowinthenextlesson.

Variety

ThesecondVisdatavariety.Foralongtime,peoplehaveuseddatabasestostoreand
processtheirdataeithersmallerdatabaseslikeMySQL,orbigdatawarehousesbasedon
softwarefromcompanieslikeOracleandIBM.Butforadatawarehousetoeffectivelyprocess
information,allthatinformationhastofitnicelyintoapredefinedsetoftables.Theproblemis
thatthesedays,lotsofthedatayouwanttostoreiswhatwetendtocallunstructureddata,or
semistructureddata.Sarahcangiveussomeexamples.

Byunstructured,wemeanthedataarrivesinlotsof
differentformats.Forexample,abankmighthavea
listofyourcreditcardandaccounttransactions,but
theymayalsohavescansofyourchecks,recordsof
yourinteractionswithcustomerservice
representativesontheWebandoverthephone,
perhapsevenrecordingsofthosephonecalls.Allof
thatdataisinavarietyofdifferentformats,anditcan
behardtostoreandreconcileitallusingtraditional
systems.

Copyright2014Udacity,Inc.AllRightsReserved.

Andthisalsotiesbacktovolume.Youwanttostorethatdatainitsoriginalformatsoyourenot
throwinganyinformationaway.Thatwayyoucanthenprocessthedatalaterindifferentways
youmightnotevenhavethoughtoforiginally.

Forinstance,ifwejusttranscribecallcenter
conversationsintotext,wehavewhatpeoplesaidto
ourcustomerservicerepresentatives.Butifwehave
theactualrecordings,thenlateronwemightdevelop
softwarewhichcaninterpretthetoneofvoicethe
customerusesandthatmightleadtoavery
differentinterpretationofthedata.AndthenicethingaboutHadoopisthatitdoesntcarewhat
formatthedatacomesin.Unlikeatraditionaldatabase,youcanjuststorethedatainitsraw
format,andmanipulateandreformatitlater.

Quiz: Data Variety

Sometimesthemostunlikelydatacanbeextremelyusefulandleadtosavingsduetobetter
planning.Forexample,aconventionalsystemforcoordinatinglogisticssystemmightsendthe
closesttrucktothewarehousetopickupthepackage.However,itmightbethattheclosest
truckisnotthebestsolutionperhapstherearetrafficjams,orthemostdirectrouteisonsmall
roadsthatwouldtakelongertodrive.Maybethetruckdoesnthaveenoughfreespaceforthe
newload.Sowhatkindofdatawouldbehelpfulinmakingabetterplanthatcouldsavemoney
andtimeforthecompany?

[]CurrentGPSlocationfromalltrucks
[]Currentitinerariesforalltrucks
[]Currenttrafficspeedinrelatedareasasreportedby
servicessuchasWaze
[]Currentloadoftrucksbyvolumeandweight
[]Fuelefficiencyofthedifferentvehicles

Answer:
Andagainalloftheseanswersarecorrect.Youcansavealotofmoney,andtime,bymaking
betterdecisions,drivenbymorevarieddata.Theworldweliveinisextremelycomplex,and
therearealotofvariablestoconsiderthatyoucantweaktogetlargebenefits.

Velocity

Copyright2014Udacity,Inc.AllRightsReserved.

Velocity,thethirdV,isaboutthespeedatwhichthedataarrives,readytobeprocessed.We
needtobeabletoacceptandstorethatdataevenwhenitscominginatarateofterabytesor
moreaday,whichisoftenthecase.Ifwecantstoreitasitarrives,wellendupdiscarding
someofit,andthatswhatweabsolutelywanttoavoid.

What problems can we solve?

ThinkaboutanecommerceWebsite.Ifweknowwhatproductsyouvelookedatinthepast,we
couldrecommendsimilarproductsthenexttimeyouvisitoursite.Ifyouspentfiveminutes
lookingataparticularitem,wecouldmaybesendyouanemailinformingyouwhenthatitemis
onsale.IfweknowthatyoutypicallybrowseoursiteusingafirstgenerationiPad,wecould
suggestthelatestmodel.

Thisisahugedifferencetowhatwewoulddobefore,whenweonlystoredrecordsofactual
purchases.IfwecanstoreandprocessallofourWebserverlogfiles,alongwiththepurchase
datathatsinourtraditionaldatawarehouse,wecangivethecustomeramuchbettershopping
experiencewhichshoulddirectlytranslateintobiggerprofits.

YetanotherexampleisamoviesitelikeNetflix.Basedonwhat
theyknowaboutyourviewinghabits,theycanrecommend
moviestoyouasyoucanseehere,becauseofwhatIans
ratedhighlybefore,themovieontheleftisrecommendedfor
himandtheycanevenpredictwhatratinghellgivethe
movie.

History of solving data problems

Sothereareplentyofthingswecandowithbigdata.Butfirstwehavetosolveacoupleof
problems.Weneedtobeabletostorethedatainacosteffectiveway,andweneedtobeable
toprocessitefficiently.Anditturnsoutthatthesearenoteasyproblemstosolvewhenwere
talkingaboutmassiveamountsofdata.Fortunately,though,someextremelysmartpeopleat
Googlewereworkingontheminthelate1990sandreleasedtheresultsoftheirworkas
researchpapersin2003and2004.LetsseewhatDougCutting,oneofthefoundersofHadoop,
hastosay.
Copyright2014Udacity,Inc.AllRightsReserved.

DOUG CUTTING about History of Hadoop:

So,letmetellyouhowHadoopcametobe.Abouttenyearsagoinaround
2003,IwasworkingonanOpenSourcewebsearchenginecalledNutch,and
weknewitneededtobesomethingveryscalable,becausetheWebwasyou
know,billionsofpages.terabytes,petabytes,ofdata,thatweneededtobeable
toprocess,andwesetaboutdoingthebestjobwecouldanditwastough.We
gotthingsupandrunningonfourorfivemachines,notverywell,andaround
thattimeGooglepublishedsomepapersabouthowtheyweredoingthingsinternally.
Publishedapaperabouttheirdistributedfilesystem,TFS.andabouttheirprocessing,
framework,MapReduce.SomypartnerandI,atthetime,inthisproject,MikeCafarella.
saidabouttryingtoreimplementtheseinOpenSource.Sothatmorepeoplecoulduse
themthanjustfolksatGoogle.Tookusacoupleofyears,andwehadNutchupand
runningon,insteadoffourorfivemachines,on,20to40machines.Itwasn'tperfect,it
wasn'ttotallyreliable,butitworked.Andwerealizethattogetittothepointwhereitwas
scaledtothousandsofmachines,andbeasbulletproofasitneededtobe,wouldtake
morethanjustthetwoofus,workingparttime.

Aroundthattime,Yahooapproachedmeandsaidtheywereinterestedininvestingin
this.SoIwenttoworkforYahooinJanuaryof2006.FirstthingIdidthere,was,wetook
thepartsofNutchthatwereadistributedcomputingplatform,andputthemintoa
separateproject.AnewprojectchristenedHadoop.Overthenextcoupleyears,with,
Yahoo'shelp,andthehelpofothers,wetookHadoop,andreallygotittothepointwhere
itdidscaletopetabytes,andrunningonthousandsofprocessors.Anddoingsoquite
reliably.

Itspreadtolotsofcompanies,andmostlyintheInternetsector,andbecamequitea
success.afterthat,we,westartedtoseeabunchofotherprojectsgrowuparoundit.
AndHadoop'sgrowntobethekernelofa,which,prettymuchanoperatingsystemforbig
data.We'vegottoolsthat,allowyouto,moreeasilydo,MapReduceprogramming,so,
youcandevelopusingSQLoradataflowlanguagecalledPig.And
we'vealsogotthebeginningsofhigherleveltools.We'vegotinteractiveSQLwith
Impala.We'vegotSearch.andsowe'rereallyseeingthisdeveloptobeingageneral
purposeplatformfordataprocessing.thatscale'smuchbetterandthatitismuchmore
flexiblethananythingthat's,that's,elseisoutthere.

ThatsthestoryofthegenesisofHadoop:itsbasedonworkdonebythefolksatGoogle,andits
grownfromsmallbeginningstothepointnowwherehundredsofpeoplecontributetothe
project,andwhereitsbeingusedbythousandsandthousandsofcompaniesworldwide.The
Copyright2014Udacity,Inc.AllRightsReserved.

Hadooplogoisactuallyalittleyellowelephant,butdoyouknowwherethenamecamefrom?
Theresafunnystoryattachedtothat.HeresDougagain.

DOUG about Name of Hadoop

SothenameHadoopcomesfrommyson'stoyelephant.Whenhewasabout
two,afriendgavehimalittlestuffedelephantwhichheplayedwith
incessantly.Andweoverheardhimcallingitsomething,thisstrangewordthat
heinvented,andsaidHadoop.SoIimmediatelywroteitdownbecauseIwas
inthesoftwarebusiness.Andwe'realwayslookingforgoodnames.Andthis
onecamewithamascot,even.AndafewyearslaterwhenIneededaproject
name,pulleditout.Now,IwroteitdownasHADOOP.Andfiguredthateveryone
wouldsayHadoop.NowitturnsouteveryonesaysHadoopinstead,butIpersistinsaying
Hadoop.Nowmyson,ofcourse,is13,andexpectsroyaltiesforthename.Hehewants
morecredit.Healsoaccusesmeofstealingthetoy.Atsomepoint,hewasusingitin
somekindofrocketshipexperiment,andIhadtorescueit.Andnowit,itlivesinmysock
drawerfor,forsafety.

Hadoop Cluster
ThecoreHadoopprojectconsistsofaway
tostoredata,knownastheHadoop
DistributedFileSystem,orHDFS,anda
waytoprocessthedata,called
MapReduce.Thekeyconceptisthatwe
splitthethedataupandstoreitacrossa
collectionofmachines,knownasacluster.
Then,whenwewanttoprocessthedata,
weprocessitwhereitsactuallystored.
Ratherthanretrievingthedatafroma
centralserver,insteaditsalreadyonthe
cluster,andwecanprocessitinplace.Youcanaddmoremachinestothecluster(makethe
clusterbigger)astheamountofdatayourestoringgrowsand,indeed,manypeoplestartwith
justafewmachinesandaddmoreastheyreneeded.Themachinesintheclusterdontneedto
beparticularlyhighendalthoughmostclustersarebuiltusingrackmountservers,theyare
typicallymidrangeserversratherthantopoftherangeequipment.

Hadoop Ecosystem

CoreHadoopconsistsofHDFSandMapReduce.

Copyright2014Udacity,Inc.AllRightsReserved.

Butsincetheprojectwasfirststarted,anawfullotofothersoftwarehasgrownuparoundit.And
thatswhatwecalltheHadoopEcosystem.Someofthesoftwareisintendedtomakeiteasyto
loaddataintotheHadoopcluster,whilelotsofitisdesignedtomakeHadoopeasiertouse.For
example,asyoullseeinthenextlesson,writingMapReducecodeisntcompletelysimple.You
needtoknowaprogramminglanguagelikeJava,orPython,orRuby,orPerl.Buttherearelots
offolksouttherewhoarentprogrammersbutwhocanwriteSQLqueriestoaccessdataina
traditionalrelationaldatabaselikeSQLServer.Andofcoursealotofbusinessintelligencetools
alsowanttohookintoHadoop.

Forthatreason,otheropensourceprojectshavebeen
createdtomakeiteasierforpeopletoquerytheirdata
withoutknowinghowtocode.TwokeyonesareHiveand
Pig.InsteadofhavingtowriteMappersandReducers,in
Hiveyoujustwritestatements,whichlookverymuchlike
standardSQL.TheHiveinterpreterturnsthatSQLinto
MapReducecode,whichitthenrunsonthecluster.Andan
alternativeisPig,whichallowsyoutowritecodetoanalyse
yourdatainafairlysimplescriptinglanguageratherthanMapReduceagain,thecodeisturned
intoactualJavaMapReduceandrunonthecluster.

HiveandPigaregreat,buttheyrestillrunningMapReducejobs,whichmeantheywilltakea
reasonableamountoftime,especiallywhenrunningonreallylargeamountsofdata.Soanother
opensourceprojectcalledImpalawasdevelopedwhichagainallowsyoutoqueryyourdata
usingSQLbutwhichdirectlyaccessesthatdata,ratherthanaccessingitviaMapReduce.
Impalaisoptimizedforlowlatencyqueriesinotherwords,Impalaqueriesrunveryquickly,
typicallymanytimesfasterthanHivequerieswhileHiveisoptimizedforlongrunningbatch
processingjobs.

Anotherprojectusedbymanypeopleis
Sqoop.Thattakesdatafromatraditional
relationaldatabaseserversuchas
MicrosoftSQLServerandputsitinHDFS
asdelimitedfilessoitcanbeprocessed
alongwiththeotherdataonthecluster.
ThentheresFlume,whichingestsdataas
itsgeneratedbyexternalsystems.HBase
isarealtimedatabasebuiltontopofHDFS.Hueisagraphicalfrontendtothecluster.Oozieis
aworkflowmanagementtool.Mahoutisamachinelearninglibrary

Copyright2014Udacity,Inc.AllRightsReserved.

Infact,therearesomanydifferentecosystemprojectsthatmakingthemalltalktoeachother,
andworkwellwitheachother,canbetricky.Tomakeinstallingandmaintainingaclustereasier,
Cloudera,thecompanyweworkfor,hasputtogetheradistributionofHadoopcalledCDH.This
takesallthekeyecosystemprojects,alongwithHadoopitself,andpackagesthemtogetherso
thatinstallationisareallysimpleprocess.Andthecomponentsarealltestedtogether,soyou
canbesurethattherearenoincompatibilitiesbetweenthem.Ofcourseitscompletelyfreeand
opensource,justlikeHadoopitself.Youcouldinstalleverythingfromscratchyourself,butitsfar
easiertouseCDH,andthatscertainlywhatwedrecommend.Inthenextlesson,infact,youll
bedownloadingandrunningavirtualmachinewhichhasCDHinstalled.

Conclusion
Sointhislessonyoulearnedwhatbigdatais,andhowHadoopcanhelpwithbigdata
problems.Inthenextlesson,welltakeadeeperlookatthetwokeypartsofHadoop:thats
HDFS,theHadoopDistributedFileSystem,andMapReduce,thewayyoucanprocessthat
data.

Copyright2014Udacity,Inc.AllRightsReserved.

Das könnte Ihnen auch gefallen