Sie sind auf Seite 1von 141

Cloudera"Data"Analyst"Training:""

Using"Pig,"Hive,"and"Impala"with"Hadoop"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#1$
201410"
IntroducIon"
Chapter"1"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#2$
Course"Chapters"

!! Introduc/on$
!! Hadoop"Fundamentals" Course$Introduc/on$

!! IntroducIon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig" Data"ETL"and"Analysis"With"Pig""
!! MulI/Dataset"OperaIons"with"Pig"
!! Pig"TroubleshooIng"and"OpImizaIon"

!! IntroducIon"to"Impala"and"Hive"
!! Querying"With"Impala"and"Hive"
!! Impala"and"Hive"Data"Management" IntroducIon"to"Impala"and"Hive"
!! Data"Storage"and"Performance"

!! RelaIonal"Data"Analysis"With"Impala"and"Hive"
!! Working"with"Impala""
!! Analyzing"Text"and"Complex"Data"with"Hive" Data"Analysis"With"Impala"and"Hive"
!! Hive"OpImizaIon"
!! Extending"Hive"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion" Course"Conclusion"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#3$
Chapter"Topics"

Introduc/on$ Course$Introduc/on$

!! About$This$Course$
!! About"Cloudera"
!! Course"LogisIcs"
!! IntroducIons"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#4$
Course"ObjecIves"(1)"

During$this$course,$you$will$learn$
! The$purpose$of$Hadoop$and$its$related$tools$
! The$features$that$Pig,$Hive,$and$Impala$offer$for$data$acquisi/on,$storage,$
and$analysis$
! How$to$iden/fy$typical$use$cases$for$large#scale$data$analysis$
! How$to$load$data$from$rela/onal$databases$and$other$sources$
! How$to$manage$data$in$HDFS$and$export$it$for$use$with$other$systems$
! How$Pig,$Hive,$and$Impala$improve$produc/vity$for$typical$analysis$tasks$
! The$language$syntax$and$data$formats$supported$by$these$tools$

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#5$
Course"ObjecIves"(2)"

! How$to$design$and$execute$queries$on$data$stored$in$HDFS$
! How$to$join$diverse$datasets$to$gain$valuable$business$insight$
! How$Hive$and$Impala$can$be$extended$with$custom$func/ons$and$scripts$
! How$to$analyze$structured,$semi#structured,$and$unstructured$data$
! How$to$store$and$query$data$for$bePer$performance$
! How$to$determine$which$tool$is$the$best$choice$for$a$given$task$

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#6$
Chapter"Topics"

Introduc/on$ Course$Introduc/on$

!! About"This"Course"
!! About$Cloudera$
!! Course"LogisIcs"
!! IntroducIons"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#7$
About"Cloudera"(1)"

! The$leader$in$Apache$Hadoop#based$soSware$and$services$
! Founded$by$leading$experts$on$Hadoop$from$Facebook,$Yahoo,$Google,$
and$Oracle$
! Provides$support,$consul/ng,$training,$and$cer/fica/on$for$Hadoop$users$
! Staff$includes$commiPers$to$virtually$all$Hadoop$projects$
! Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$
– Tom"White,"Lars"George,"Kathleen"Ting,"etc."

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#8$
About"Cloudera"(2)"

! Customers$include$many$key$users$of$Hadoop$
– Allstate,"AOL"AdverIsing,"Box,"BT,"CBS"InteracIve,"eBay,"Experian,"FICO,"
Groupon,"MasterCard,"NaIonal"Cancer"InsItute,"Orbitz,"Social"Security"
AdministraIon,"Trend"Micro,"Trulia,"US"Army,"…"
! Cloudera$public$training:$
– Cloudera"Developer"Training"for"Apache"Hadoop"
– Cloudera"Developer"Training"for"Apache"Spark"
– Designing"and"Building"Big"Data"ApplicaIons"
– Cloudera"Administrator"Training"for"Apache"Hadoop"
– Cloudera"Data"Analyst"Training:"Using"Pig,"Hive,"and"Impala"with"Hadoop"
– Cloudera"Training"for"Apache"HBase"
– IntroducIon"to"Data"Science:"Building"Recommender"Systems"
– Cloudera"EssenIals"for"Apache"Hadoop"
! Onsite$and$custom$training$is$also$available$
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#9$
CDH"

! CDH$(Cloudera’s$Distribu/on,$including$Apache$Hadoop)$
– 100%"open"source,"enterprise/ready"distribuIon"of"Hadoop"and""
related"projects"
– The"most"complete,"tested,"and"widely/deployed"distribuIon"of"Hadoop"
– Integrates"all"key"Hadoop"ecosystem"projects"
– Available"as"RPMs"and"Ubuntu/Debian/SuSE"packages"or"as"a"tarball"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#10$
Cloudera"Express"

! Cloudera$Express$
– Free"download"
! The$best$way$to$get$started$
$with$Hadoop$
! Includes$CDH$
! Includes$Cloudera$Manager$
– End/to/end""
administraIon"for""
Hadoop"
– Deploy,"manage,"and""
monitor"your"cluster"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#11$
Cloudera"Enterprise"

! Cloudera$Enterprise$
– SubscripIon"product"including"CDH"and""
Cloudera"Manager"
! Includes$support$
! Includes$extra$Cloudera$Manager$features$
– ConfiguraIon"history"and"rollbacks"
– Rolling"updates"
– LDAP"integraIon"
– SNMP"support"
– Automated"disaster"recovery"
– Etc."

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#12$
Chapter"Topics"

Introduc/on$ Course$Introduc/on$

!! About"This"Course"
!! About"Cloudera"
!! Course$Logis/cs$
!! IntroducIons"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#13$
LogisIcs"

! Class$start$and$finish$/mes$
! Lunch$
! Breaks$
! Restrooms$
! Wi#Fi$access$
! Virtual$machines$
! Can$I$come$in$early/stay$late?$

Your$instructor$will$give$you$details$on$how$to$access$the$course$materials$
and$exercise$instruc/ons$for$the$class$

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#14$
Chapter"Topics"

Introduc/on$ Course$Introduc/on$

!! About"This"Course"
!! About"Cloudera"
!! Course"LogisIcs"
!! Introduc/ons$

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#15$
IntroducIons"

! About$your$instructor$
! About$you$
– Where"do"you"work"and"what"do"you"do"there?"
– Which"database(s)"and"pladorm(s)"do"you"use?"
– Have"you"worked"with"Apache"Hadoop"or"related"tools?"""
– Any"experience"as"a"developer?"
– What"programming"languages"do"you"use?"
– What"are"your"expectaIons"for"this"course?"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#16$
Hadoop"Fundamentals"
Chapter"2"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#1%
Course"Chapters"

!! IntroducDon"
!! Hadoop%Fundamentals% Course%Introduc7on%

!! IntroducDon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig" Data"ETL"and"Analysis"With"Pig""
!! MulD/Dataset"OperaDons"with"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon"

!! IntroducDon"to"Impala"and"Hive"
!! Querying"With"Impala"and"Hive"
!! Impala"and"Hive"Data"Management" IntroducDon"to"Impala"and"Hive"
!! Data"Storage"and"Performance"

!! RelaDonal"Data"Analysis"With"Impala"and"Hive"
!! Working"with"Impala""
!! Analyzing"Text"and"Complex"Data"with"Hive" Data"Analysis"With"Impala"and"Hive"
!! Hive"OpDmizaDon"
!! Extending"Hive"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion" Course"Conclusion"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#2%
Hadoop"Fundamentals"

In%this%chapter,%you%will%learn%
! Which%factors%led%to%the%era%of%Big%Data%
! What%Hadoop%is%and%what%significant%features%it%offers%
! How%Hadoop%offers%reliable%storage%for%massive%amounts%of%data%with%
HDFS%
! How%Hadoop%supports%large#scale%data%processing%through%MapReduce%
! How%‘Hadoop%Ecosystem’%tools%can%boost%an%analyst’s%produc7vity%
! Several%ways%to%integrate%Hadoop%into%the%modern%data%center%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#3%
Chapter"Topics"

Hadoop%Fundamentals% Course%Introduc7on%

!! The%Mo7va7on%for%Hadoop%
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#4%
Velocity"

! We%are%genera7ng%data%faster%than%ever%
– Processes"are"increasingly"automated"
– Systems"are"increasingly"interconnected"
– People"are"increasingly"interacDng"online"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#5%
Variety"

! We%are%producing%a%wide%variety%of%data%
– Social"network"connecDons"
– Server"and"applicaDon"log"files"
– Electronic"medical"records"
– Images,"audio,"and"video"
– RFID"and"wireless"sensor"network"events"
– Product"raDngs"on"shopping"and"review"Web"sites"
– And"much"more…"
! Not%all%of%this%maps%cleanly%to%the%rela7onal%model%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#6%
Volume"

! Every%day…%
– More"than"1.5"billion"shares"are"traded"on"the"New"York"Stock"
Exchange"
– Facebook"stores"2.7"billion"comments"and"‘Likes’"
– Google"processes"about"24"petabytes"of"data"
! Every%minute…%
– Foursquare"handles"more"than"2,000"check/ins"
– TransUnion"makes"nearly"70,000"updates"to"credit"files"
! And%every%second…%
– Banks"process"more"than"10,000"credit"card"transacDons"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#7%
Data"Has"Value"

! This%data%has%many%valuable%applica7ons%
– Product"recommendaDons"
– PredicDng"demand"
– MarkeDng"analysis"
– Fraud"detecDon"
– And"many,"many"more…"
! We%must%process%it%to%extract%that%value%
– And"processing"all#the#data"can"yield"more"accurate"results"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#8%
We"Need"a"System"that"Scales"

! We’re%genera7ng%too%much%data%to%process%with%tradi7onal%tools%
! Two%key%problems%to%address%%
– How"can"we"reliably"store"large"amounts"of"data"at"a"reasonable"cost?"
– How"can"we"analyze"all"the"data"we"have"stored?"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#9%
Chapter"Topics"

Hadoop%Fundamentals% Course%Introduc7on%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop%Overview%
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#10%
What"is"Apache"Hadoop?"

! Scalable%and%economical%data%storage%and%processing%
– Distributed"and"fault/tolerant""
– Harnesses"the"power"of"industry"standard"hardware"
! Heavily%inspired%by%technical%documents%published%by%Google%

Batch"
Search"Engine" Machine" Stream"
Processing" AnalyDc"SQL" Other"
(Cloudera" Learning" Processing"
(MapReduce," (Impala)" ApplicaDons"
Search)" (Spark,"Mahout)" (Spark)"
Hive,"Pig)"

Workload"Management"(YARN)"

Data"Storage"
Filesystem" Online"NoSQL"
(HDFS)" (HBase)"

Data"IntegraDon"(Sqoop,"Flume)"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#11%
Scalability"

! Hadoop%is%a%distributed%system%
– A"collecDon"of"servers"running"Hadoop"sogware"is"called"a"cluster#
! Individual%servers%within%a%cluster%are%called%nodes&
– Typically"standard"rackmount"servers"running"Linux"
– Each"node"both"stores"and"processes"data"
! Add%more%nodes%to%the%cluster%to%increase%scalability%
– A"cluster"may"contain"up"to"several"thousand"nodes"
– You"can"scale"out"incrementally"as"required"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#12%
Fault"Tolerance"

! Paradox:%Adding%nodes%increases%the%chance%that%any%one%of%them%will%fail%
– SoluDon:"build"redundancy"into"the"system"and"handle"it"automaDcally"
! Files%loaded%into%HDFS%are%replicated%across%nodes%in%the%cluster%
– If"a"node"fails,"its"data"is"re/replicated"using"one"of"the"other"copies"
! Data%processing%jobs%are%broken%into%individual%tasks%
– Each"task"takes"a"small"amount"of"data"as"input"
– Thousands"of"tasks"(or"more)"ogen"run"in"parallel"
– If"a"node"fails"during"processing,"its"tasks"are"rescheduled"elsewhere"
! Rou7ne%failures%are%handled%automa7cally%without%any%loss%of%data%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#13%
Chapter"Topics"

Hadoop%Fundamentals% Course%Introduc7on%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data%Storage:%HDFS%
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#14%
HDFS:"Hadoop"Distributed"File"System"

! HDFS%provides%the%storage%layer%for%Hadoop%data%processing%
! Provides%inexpensive%and%reliable%storage%for%massive%amounts%of%data%
! Other%Hadoop%components%work%with%data%in%HDFS%
– MapReduce,"Impala,"Hive,"Pig,"Spark,"etc.""

Batch"
Search"Engine" Machine" Stream"
Processing" AnalyDc"SQL" Other"
(Cloudera" Learning" Processing"
(MapReduce," (Impala)" ApplicaDons"
Search)" (Spark,"Mahout)" (Spark)"
Hive,"Pig)"

Workload"Management"(YARN)"

Data"Storage"
Filesystem" Online"NoSQL"
(HDFS)" (HBase)"

Data"IntegraDon"(Sqoop,"Flume)"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#15%
HDFS"Features"

! Op7mized%for%sequen7al%access%to%a%rela7vely%small%number%of%large%files%
– Each"file"is"likely"to"be"100MB"or"larger ""
– MulD/gigabyte"files"are"typical"
! In%some%ways,%HDFS%is%similar%to%a%UNIX%filesystem%
– Hierarchical,"with"UNIX/style"paths"(e.g.,"/sales/rpt/asia.txt)"
– UNIX/style"file"ownership"and"permissions"
! There%are%also%some%major%devia7ons%from%UNIX%
– No"concept"of"a"current"directory"
– Cannot"modify"files"once"wri>en"
– Must"use"Hadoop/specific"uDliDes"or"custom"code"to"access"HDFS"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#16%
HDFS"Architecture"

Hadoop Cluster
A#Small#Hadoop#Cluster#
! Hadoop%has%a%master/slave%
architecture% Master%
op ! HDFS%master%daemon:%NameNode%
fs -put sales.txt /reports HDFS#master#daemon#

– Manages"namespace"and"metadata#
– Monitors"slave"nodes"
! HDFS%slave%daemon:%DataNode%
– Reads"and"writes"the"actual"data"
Slaves&
HDFS#slave#daemons#

op fs -get /reports/sales.txt

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#17%
Accessing"HDFS"via"the"Command"Line"

! HDFS%is%not%a%general%purpose%filesystem%
– Not"built"into"the"OS,"so"only"specialized"tools"can"access"it"
– End"users"typically"access"HDFS"via"the"hdfs dfs command"
! Example:%display%the%contents%of%the%/user/fred/sales.txt%file%

$ hdfs dfs -cat /user/fred/sales.txt

! Example:%Create%a%directory%(below%the%root)%called%reports%

$ hdfs dfs -mkdir /reports

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#18%
Copying"Local"Data"To"and"From"HDFS"

! Remember%that%HDFS%is%dis7nct%from%your%local%filesystem%
– Use"hdfs dfs –put%to"copy"local"files"to"HDFS"
– Use"hdfs dfs -get%to"fetch"a"local"copy"of"a"file"from"HDFS"
Hadoop#Cluster#
Hadoop Cluster
Hadoop Cluster

$ hadoop
$ hadoop fs -putfssales.txt
-put sales.txt /reports
/reports
$ hdfs dfs -put file
Client MachineClient#
Client Machine

$ hadoop
$ hadoop fs/reports/sales.txt
fs -get
$ hdfs -get /reports/sales.txt
dfs -get file

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#19%
More"hdfs dfs"Command"Examples""

! Copy%file%input.txt%from%local%disk%to%the%user’s%directory%in%HDFS%

$ hdfs dfs -put input.txt input.txt

– This"will"copy"the"file"to"/user/username/input.txt
! Get%a%directory%lis7ng%of%the%HDFS%root%directory%

$ hdfs dfs -ls /

! Delete%the%file%/reports/sales.txt%

$ hdfs dfs -rm /reports/sales.txt

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#20%
Using"the"Hue"HDFS"File"Manager"

! Hue%is%a%Web%interface%for%Hadoop%
– Hadoop"User"Experience"
! Hue%includes%an%applica7on%for%browsing%and%managing%files%in%HDFS%
– To"use"Hue,"browse"to"http://hue_server:8888/

Manage"Files"

Upload"Files"

Browse"Files"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#21%
Chapter"Topics"

Hadoop%Fundamentals% Course%Introduc7on%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed%Data%Processing:%YARN,%MapReduce,%and%Spark%
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#22%
Workload"Management:"YARN"

! Many%Hadoop%tools%work%with%data%in%a%Hadoop%cluster%
! Requires%workload%management%to%distribute%and%monitor%work%across%
the%cluster%

Batch"
Search"Engine" Machine" Stream"
Processing" AnalyDc"SQL" Other"
(Cloudera" Learning" Processing"
(MapReduce," (Impala)" ApplicaDons"
Search)" (Spark,"Mahout)" (Spark)"
Hive,"Pig)"

Workload"Management"(YARN"or"MapReduce"1)"

Data"Storage"
Filesystem" Online"NoSQL"
(HDFS)" (HBase)"

Data"IntegraDon"(Sqoop,"Flume)"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#23%
Hadoop"Cluster"Architecture"

Hadoop Cluster
A#Small#Hadoop#Cluster#
! Master/Slave%Architecture%
– YARN"or"MapReduce"version"1" Master%
op fs -put sales.txt /reports YARN&master&daemon&
– Details"differ"slightly" HDFS#master#daemon#

! Master%nodes%
– Run"master"daemons"to"accept"jobs,""
and"monitor"and"distribute"work"
! Slave%nodes%
Slaves%
– Run"slave"daemons"to"start"tasks" YARN&slave&daemons&
HDFS#slave#daemons#
– Do"the"actual"work"
op fs -get /reports/sales.txt
– Report"status"back"to"master"daemons"
! HDFS%and%YARN/MRv1%are%collocated%
– Slave"nodes"run"both"HDFS"and"slave"
daemons"on"the"same"machines"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#24%
General"Data"Processing"

! Hadoop%includes%two%general%data%processing%engines%
– MapReduce"
– Spark"
! Both%are%programming%libraries%(Java,%Scala,%Python…)%

Batch"
Search"Engine" Machine" Stream"
Processing" AnalyDc"SQL" Other"
(Cloudera" Learning" Processing"
(MapReduce," (Impala)" ApplicaDons"
Search)" (Spark,"Mahout)" (Spark)"
Hive,"Pig)"

Workload"Management"(YARN"or"MapReduce)"

Data"Storage"
Filesystem" Online"NoSQL"
(HDFS)" (HBase)"

Data"IntegraDon"(Sqoop,"Flume)"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#25%
Hadoop"MapReduce"

! Hadoop%MapReduce%was%the%original%processing%engine%for%Hadoop%
– SDll"the"most"commonly"used"general"data"processing"engine"
! Based%on%the%the%‘map#reduce’%programming%model%
– A"style"of"processing"data"popularized"by"Google"
! Provides%a%set%of%programming%libraries%%
– Primarily"supports"Java""
– Streaming"MapReduce"provides"(limited)"support"for"scripDng"
languages"such"as"Python""
! Benefits%of%Hadoop%MapReduce%
– Simplicity"
– Flexibility"
– Scalability"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#26%
Apache"Spark"

! The%next%genera7on%general%data%processing%engine%
! Builds%on%the%same%‘map#reduce’%programming%model%as%Hadoop%
MapReduce%
! Originally%developed%at%AMP%Lab%at%UC%Berkeley%
! Spark%supports%Scala,%Java,%and%Python%
! Spark%has%the%same%benefits%as%MapReduce,%plus…%
– Improved"performance"using"in/memory"processing"
– Higher"level"programming"model"to"speed"up"development"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#27%
Chapter"Topics"

Hadoop%Fundamentals% Course%Introduc7on%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data%Analysis%and%Processing:%Pig,%Hive,%and%Impala%
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#28%
Data"Processing"and"Analysis"with"Hadoop"(1)"

! Hadoop%MapReduce%and%Spark%are%powerful%data%processing%engines%but…%
– Hard"to"master"
– Require"programming"skills"
– Slow"to"develop,"hard"to"maintain"
! Hadoop%includes%several%other%tools%%for%data%processing%and%analysis%
– Tools"for"data"analysts,"not"programmers"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#29%
Data"Processing"and"Analysis"with"Hadoop"(2)"

! Higher%level%abstrac7ons%for%general%data%processing%
– Pig,"Hive"
! Specialized%processing%engines%for%interac7ve%analysis%
– Impala,"Search"

Natural""
PigLaDn" Impala/HiveQL"
Language"
Data#
Pla;orm#
Impala" Search"
Pig" Hive"

Data#
Processing# MapReduce,"Spark,"etc."
Engine#

Data#Storage# HDFS"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#30%
Apache"Pig"

! Apache%Pig%builds%on%Hadoop%to%offer%high#level%data%processing%
– This"is"an"alternaDve"to"wriDng"low/level"MapReduce"code"
– Pig"is"especially"good"at"joining"and"transforming"data"

people = LOAD '/user/training/customers' AS (cust_id, name);


orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);
groups = GROUP orders BY cust_id;
totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;
result = JOIN totals BY group, people BY cust_id;
DUMP result;
%
! The%Pig%interpreter%runs%on%the%client%machine%
– Turns"PigLaDn"scripts"into"MapReduce"jobs"
– Submits"those"jobs"to"the"cluster"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#31%
Apache"Hive"

! Hive%is%another%abstrac7on%on%top%of%Hadoop%
– Like"Pig,"it"also"reduces"development"Dme""
– Hive"uses"a"SQL/like"language"called"HiveQL"

SELECT customers.cust_id, SUM(cost) AS total


FROM customers
JOIN orders
ON (customers.cust_id = orders.cust_id)
GROUP BY customers.cust_id
"
ORDER BY total DESC;

! A%Hive%Server%runs%on%a%master%node%
– Turns"HiveQL"queries"into"MapReduce"jobs"
– Submits"those"jobs"to"the"cluster"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#32%
Cloudera"Impala"

! Massively%parallel%SQL%engine%which%runs%on%a%Hadoop%cluster%
– Inspired"by"Google’s"Dremel"project"
– Can"query"data"stored"in"HDFS"or"HBase"tables"
! Uses%Impala%SQL%
– Very"similar"to"HiveQL"
! High%performance%%
– Typically"at"least"10"Dmes"faster"than"Hive"or"MapReduce"
– High/level"query"language"(subset"of"SQL/92)"
! Impala%is%100%%Apache#licensed%open%source%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#33%
Chapter"Topics"

Hadoop%Fundamentals% Course%Introduc7on%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database%Integra7on:%Sqoop%
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#34%
Apache"Sqoop"

! Sqoop%exchanges%data%between%an%RDBMS%and%Hadoop%
! It%can%import%all%tables,%a%single%table,%or%a%por7on%of%a%table%into%HDFS%
– Does"this"very"efficiently"via"a"Map/only"MapReduce"job"
– Result"is"a"directory"in"HDFS"containing"comma/delimited"text"files"
! Sqoop%can%also%export%data%from%HDFS%back%to%the%database%

Database Hadoop Cluster

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#35%
ImporDng"Tables"with"Sqoop"

! This%example%imports%the%customers%table%from%a%MySQL%database%
– Will"create"/mydata/customers"directory"in"HDFS"
– Directory"will"contain"comma/delimited"text"files"

$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table customers

! Adding%the%--direct%op7on%may%offer%bejer%performance%
– Uses"database/specific"tools"instead"of"JDBC"
– This"opDon"is"not"compaDble"with"all"databases"
! High#performance%custom%connectors%for%some%databases%
– Netezza,"Teradata,"MySQL…"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#36%
ImporDng"An"EnDre"Database"with"Sqoop"

! Import%all%tables%from%the%database%(fields%will%be%tab#delimited)%

$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--fields-terminated-by '\t' \
--warehouse-dir /mydata

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#37%
ImporDng"ParDal"Tables"with"Sqoop"

! Import%only%specified%columns%from%products%table%

$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table products \
--columns "prod_id,name,price"

! Import%only%matching%rows%from%products%table%

$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table products \
--where "price >= 1000"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#38%
Incremental"Imports"with"Sqoop"

! What%if%new%records%are%added%to%the%database?%
– Could"re/import"all"records,"but"this"is"inefficient"
! Sqoop’s%incremental%append%mode%imports%only%new%records%
– Based"on"value"of"last"record"in"specified"column"

$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table orders \
--incremental append \
--check-column order_id \
--last-value 6713821

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#39%
Handling"ModificaDons"with"Incremental"Imports"

! What%if%exis7ng%records%are%also%modified%in%the%database?%
– Incremental"append"mode"doesn’t"handle"this"
! In%CDH%5.2%and%later,%Sqoop’s%lastmodified%append%mode%adds%and%
updates%records%
– Caveat:"You"must"maintain"a"Dmestamp"column"in"your"table"

$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table shipments \
--incremental lastmodified \
--check-column last_update_date \
--last-value "2013-06-12 03:15:59"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#40%
ExporDng"Data"from"Hadoop"to"RDBMS"with"Sqoop"

! We%have%seen%several%ways%to%pull%records%from%an%RDBMS%into%Hadoop%
– It"is"someDmes"also"helpful"to"push"data"in"Hadoop"back"to"an"RDBMS"
! Sqoop%supports%this%via%export%

$ sqoop export \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--export-dir /mydata/recommender_output \
--table product_recommendations

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#41%
Chapter"Topics"

Hadoop%Fundamentals% Course%Introduc7on%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other%Hadoop%Data%Tools%
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#42%
Apache"HBase"

! HBase%is%“the%Hadoop%database”%
! Can%store%massive%amounts%of%data%
– Gigabytes,"terabytes,"and"even"petabytes"of"data"in"a"table"
– Tables"can"have"many"thousands"of"columns"
! Scales%to%provide%very%high%write%throughput%
– Hundreds"of"thousands"of"inserts"per"second"
! Fairly%primi7ve%when%compared%to%an%RDBMS%
– NoSQL":"There"is"no"high/level"query"language""
– Use"API"to"scan"/"get"/"put"values"based"on"keys"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#43%
Apache"Flume"

%%
! Flume%imports%data%into%HDFS%as&it&is&being&generated%by%various%sources%
Log Files
UNIX Custom
syslog Sources

Program And many


Output more...

Hadoop Cluster

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#44%
Recap:"Data"Center"IntegraDon"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#45%
Chapter"Topics"

Hadoop%Fundamentals% Course%Introduc7on%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise%Scenario%Explana7on%
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#46%
Hands/On"Exercises:"Scenario"ExplanaDon"

! Hands#On%Exercises%throughout%the%course%will%reinforce%the%topics%being%
discussed%
– Exercises"simulate"the"kind"of"tasks"ogen"performed"using"the"tools"you"
will"learn"about"in"class"
– Most"exercises"depend"on"data"generated"in"earlier"exercises"
! Scenario:%Dualcore%Inc.%is%a%leading%electronics%retailer%
– More"than"1,000"brick/and/mortar"stores"
– Dualcore"also"has"a"thriving"e/commerce"Web"site"
! Dualcore%has%hired%you%to%help%find%value%in%its%data%
– You"will"process"and"analyze"data"from"internal"and"external"sources"
– IdenDfy"opportuniDes"to"increase"revenue"
– Find"new"ways"to"reduce"costs"
– Help"other"departments"achieve"their"goals"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#47%
Chapter"Topics"

Hadoop%Fundamentals% Course%Introduc7on%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion%
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#48%
EssenDal"Points"

! We%are%genera7ng%more%data%–%and%faster%–%than%ever%before%
! Most%of%this%data%maps%poorly%to%structured%rela7onal%tables%
! The%ability%to%store%and%process%this%data%can%yield%valuable%insight%
! Hadoop%offers%scalable%data%storage%and%processing%%
! There%are%lots%of%tools%in%the%Hadoop%ecosystem%that%help%you%to%integrate%
Hadoop%with%other%systems,%manage%complex%jobs,%and%ease%analysis%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#49%
Bibliography"

The%following%offer%more%informa7on%on%topics%discussed%in%this%chapter%
! 10%Hadoopable%Problems%(recorded%presenta7on)%
– http://tiny.cloudera.com/dac02a
! Guide%to%HDFS%Commands%
– http://tiny.cloudera.com/hdfscommands
! Hadoop:&The&Defini<ve&Guide,&3rd&Edi<on&(O’Reilly%book)%
– http://tiny.cloudera.com/hadooptdg
! Sqoop%User%Guide%
– http://tiny.cloudera.com/sqoopuser
!  Spark%Documenta7on%
– http://tiny.cloudera.com/sparkdoc

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#50%
Chapter"Topics"

Hadoop%Fundamentals% Course%Introduc7on%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands#On%Exercise:%Data%Ingest%with%Hadoop%Tools%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#51%
About"the"Training"Virtual"Machine"

! During%this%course,%you%will%perform%numerous%hands#on%exercises%using%the%
Cloudera%Training%Virtual%Machine%(VM)%
! The%VM%has%Hadoop%installed%in%pseudoBdistributed&mode%
– A"cluster"comprised"of"a"single"node"
– Typically"used"for"tesDng"code"before"deploying"to"a"large"cluster"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#52%
Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"

! In%this%Hands#On%Exercise,%you%will%gain%prac7ce%adding%data%from%the%local%
filesystem%and%a%rela7onal%database%server%to%HDFS%
– You"will"analyze"this"data"in"subsequent"exercises"
! Please%refer%to%the%Hands#On%Exercise%Manual%for%instruc7ons%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#53%
IntroducAon"to"Pig"
Chapter"3"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#1%
Course"Chapters"

!! IntroducAon"
!! Hadoop"Fundamentals" Course"IntroducAon"

!! Introduc=on%to%Pig%
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig" Data%ETL%and%Analysis%With%Pig%%
!! MulA/Dataset"OperaAons"with"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"

!! IntroducAon"to"Impala"and"Hive"
!! Querying"With"Impala"and"Hive"
!! Impala"and"Hive"Data"Management" IntroducAon"to"Impala"and"Hive"
!! Data"Storage"and"Performance"

!! RelaAonal"Data"Analysis"With"Impala"and"Hive"
!! Working"with"Impala""
!! Analyzing"Text"and"Complex"Data"with"Hive" Data"Analysis"With"Impala"and"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion" Course"Conclusion"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#2%
IntroducAon"to"Pig"

In%this%chapter,%you%will%learn%
! The%key%features%Pig%offers%
! How%organiza=ons%use%Pig%for%data%processing%and%analysis%
! How%to%use%Pig%interac=vely%and%in%batch%mode%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#3%
Chapter"Topics"

Introduc=on%to%Pig% Data%ETL%and%Analysis%With%Pig%

!! What%is%Pig?%
!! Pig’s"Features"
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#4%
Apache"Pig"Overview"

! Apache%Pig%is%a%plaMorm%for%data%analysis%and%processing%on%Hadoop%
– It"offers"an"alternaAve"to"wriAng"MapReduce"code"directly"
! Originally%developed%as%a%research%project%at%Yahoo%%
– Goals:"flexibility,"producAvity,"and"maintainability"
– Now"an"open/source"Apache"project"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#5%
The"Anatomy"of"Pig"

! Main%components%of%Pig%
– The"data"flow"language"(Pig"LaAn)"
– The"interacAve"shell"where"you"can"type"Pig"LaAn"statements"(Grunt)"
– The"Pig"interpreter"and"execuAon"engine"

Pig Latin Script Pig Interpreter / Execution Engine MapReduce Jobs

!"Preprocess"and"parse"Pig"La0n
AllSales = LOAD 'sales'
!"Check"data"types
AS (cust, price); !"Make"op0miza0ons
BigSales = FILTER AllSales !"Plan"execu0on
BY price > 100;
STORE BigSales INTO 'myreport';
!"Generate"MapReduce"jobs
!"Submit"job(s)"to"Hadoop
!"Monitor"progress

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#6%
Where"to"Get"Pig"

! CDH%is%the%easiest%way%to%install%Hadoop%and%Pig%
– A"Hadoop"distribuAon"which"includes"HDFS,"MapReduce,"Spark,"Pig,"
Hive,"Impala,"Sqoop,"HBase,"and"other"Hadoop"ecosystem"components"
– Available"as"RPMs,"Ubuntu/Debian/SuSE"packages,"or"a"tarball"
– Simple"installaAon"
– 100%"free"and"open"source"
! Installa=on%is%outside%the%scope%of%this%course%
– Cloudera"offers"a"training"course"for"System"Administrators,!Cloudera!
Administrator!Training!for!Apache!Hadoop!

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#7%
Chapter"Topics"

Introduc=on%to%Pig% Data%ETL%and%Analysis%With%Pig%

!! What"is"Pig?"
!! Pig’s%Features%
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#8%
Pig"Features"

! Pig%is%an%alterna=ve%to%wri=ng%low#level%MapReduce%code%
! Many%features%enable%sophis=cated%analysis%and%processing%
– HDFS"manipulaAon"
– UNIX"shell"commands"
– RelaAonal"operaAons"
– PosiAonal"references"for"fields"
– Common"mathemaAcal"funcAons"
– Support"for"custom"funcAons"and"data"formats%
– Complex"data"structures"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#9%
Chapter"Topics"

Introduc=on%to%Pig% Data%ETL%and%Analysis%With%Pig%

!! What"is"Pig?"
!! Pig’s"Features"
!! Pig%Use%Cases%
!! InteracAng"with"Pig"
!! Conclusion"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#10%
How"Are"OrganizaAons"Using"Pig?"

! Many%organiza=ons%use%Pig%for%data%analysis%
– Finding"relevant"records"in"a"massive"data"set"
– Querying"mulAple"data"sets"
– CalculaAng"values"from"input"data"
! Pig%is%also%frequently%used%for%data%processing%
– Reorganizing"an"exisAng"data"set"
– Joining"data"from"mulAple"sources"to"produce"a"new"data"set"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#11%
Use"Case:"Web"Log"SessionizaAon"

! Pig%can%help%you%extract%valuable%informa=on%from%Web%server%log%files%

... Web Server Log Data


10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129"
10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)"
10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129"
10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)"
10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129"
10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622"
10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129"
10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"
...

Clickstream Data for User Sessions


Process Logs
Recent Activity for John Smith
May 3, 2013 May 12, 2013
Search for 'Widget' Track Order

Widget Results Contact Us

Details for Widget X Send Complaint

Order Widget X

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#12%
Use"Case:"Data"Sampling"

! Sampling%can%help%you%explore%a%representa=ve%por=on%of%a%large%data%set%
– Allows"you"to"examine"this"porAon"with"tools"that"do"not"scale"well"
– Supports"faster"iteraAons"during"development"of"analysis"jobs"

100 TB 50 MB

Random
Sampling

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#13%
Use"Case:"ETL"Processing"

! Pig%is%also%widely%used%for%Extract,%Transform,%and%Load%(ETL)%processing%

Operations Pig Jobs Running on Hadoop Cluster

Accounting Data Warehouse


Validate Fix Remove Encode
data errors duplicates values

Call Center

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#14%
Chapter"Topics"

Introduc=on%to%Pig% Data%ETL%and%Analysis%With%Pig%

!! What"is"Pig?"
!! Pig’s"Features"
!! Pig"Use"Cases"
!! Interac=ng%with%Pig%
!! Conclusion"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#15%
Using"Pig"InteracAvely"

! You%can%use%Pig%interac=vely,%via%the%Grunt%shell%
– Pig"interprets"each"Pig"LaAn"statement"as"you"type"it"
– ExecuAon"is"delayed"unAl"output"is"required"
– Very"useful"for"ad"hoc"data"inspecAon"
! Example%of%how%to%start,%use,%and%exit%Grunt%

$ pig
grunt> allsales = LOAD 'sales' AS (name, price);
grunt> bigsales = FILTER allsales BY price > 100;
grunt> STORE bigsales INTO 'myreport';
grunt> quit;
%
! Can%also%execute%a%Pig%La=n%statement%from%the%UNIX%shell%via%the%-e%
op=on

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#16%
InteracAng"with"HDFS"

! You%can%manipulate%HDFS%with%Pig,%via%the%fs%command

% grunt> fs -mkdir sales/;


grunt> fs -put europe.txt sales/;
grunt> allsales = LOAD 'sales' AS (name, price);
grunt> bigsales = FILTER allsales BY price > 100;
grunt> STORE bigsales INTO 'myreport';
grunt> fs -getmerge myreport/ bigsales.txt;

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#17%
InteracAng"with"UNIX"

! The%sh%command%lets%you%run%UNIX%programs%from%Pig

grunt> sh date;
Wed Nov 12 06:39:13 PST 2014
grunt> fs -ls; -- lists HDFS files
%
grunt> sh ls; -- lists local files

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#18%
Running"Pig"Scripts"

! A%Pig%script%is%simply%Pig%La=n%code%stored%in%a%text%file%
– By"convenAon,"these"files"have"the".pig"extension"
! You%can%run%a%Pig%script%from%within%the%Grunt%shell%via%the%run%command%
– This"is"useful"for"automaAon"and"batch"execuAon""

grunt> run salesreport.pig;

! It%is%common%to%run%a%Pig%script%directly%from%the%UNIX%shell%

$ pig salesreport.pig

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#19%
MapReduce"and"Local"Modes"

! As%described%earlier,%Pig%turns%Pig%La=n%into%MapReduce%jobs%
– Pig"submits"those"jobs"for"execuAon"on"the"Hadoop"cluster"
! It%is%also%possible%to%run%Pig%in%‘local%mode’%using%the%-x%flag%
– This"runs"jobs"on"the"local!machine"instead"of"the"cluster"
– Local"mode"uses"the"local"filesystem"instead"of"HDFS"
– Can"be"helpful"for"tesAng"before"deploying"a"job"to"producAon"

$ pig –x local -- interactive

$ pig -x local salesreport.pig -- batch

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#20%
Client/Side"Log"Files"

! If%a%job%fails,%Pig%may%produce%a%log%file%to%explain%why%
– These"log"files"are"typically"produced"in"your"current"working"directory"
– On"the"local"(client)"machine"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#21%
Chapter"Topics"

Introduc=on%to%Pig% Data%ETL%and%Analysis%With%Pig%

!! What"is"Pig?"
!! Pig’s"Features"
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#22%
EssenAal"Points"

! Pig%offers%an%alterna=ve%to%wri=ng%MapReduce%code%directly%
– Pig"interprets"Pig"LaAn"code"in"order"to"create"MapReduce"jobs"
– It"then"submits"these"jobs"to"the"Hadoop"cluster"
! You%can%execute%Pig%La=n%code%interac=vely%through%Grunt%
– Pig"delays"job"execuAon"unAl"output"is"required"
! It%is%also%common%to%store%Pig%La=n%code%in%a%script%for%batch%execu=on%
– Allows"for"automaAon"and"code"reuse"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#23%
Bibliography"

The%following%offer%more%informa=on%on%topics%discussed%in%this%chapter%
! Apache%Pig%Web%Site%
– http://pig.apache.org/
! Process%a%Million%Songs%with%Apache%Pig%
– http://tiny.cloudera.com/dac03a
! Powered%By%Pig%
– http://tiny.cloudera.com/poweredbypig
! LinkedIn:%User%Engagement%Powered%By%Apache%Pig%and%Hadoop%
– http://tiny.cloudera.com/dac03c
! Programming)Pig)(O’Reilly%book)%
– http://tiny.cloudera.com/programmingpig

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#24%
Basic"Data"Analysis"with"Pig"
Chapter"4"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#1%
Course"Chapters"

!! IntroducDon"
!! Hadoop"Fundamentals" Course"IntroducDon"

!! IntroducDon"to"Pig"
!! Basic%Data%Analysis%with%Pig%
!! Processing"Complex"Data"with"Pig" Data%ETL%and%Analysis%With%Pig%%
!! MulD/Dataset"OperaDons"with"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon"

!! IntroducDon"to"Impala"and"Hive"
!! Querying"With"Impala"and"Hive"
!! Impala"and"Hive"Data"Management" IntroducDon"to"Impala"and"Hive"
!! Data"Storage"and"Performance"

!! RelaDonal"Data"Analysis"With"Impala"and"Hive"
!! Working"with"Impala""
!! Analyzing"Text"and"Complex"Data"with"Hive" Data"Analysis"With"Impala"and"Hive"
!! Hive"OpDmizaDon"
!! Extending"Hive"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion" Course"Conclusion"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#2%
Basic"Data"Analysis"with"Pig"

In%this%chapter,%you%will%learn%
! The%basic%syntax%of%Pig%LaFn%
! How%to%load%and%store%data%using%Pig%
! Which%simple%data%types%Pig%uses%to%represent%data%
! How%to%sort%and%filter%data%in%Pig%
! How%to%use%many%of%Pig’s%built#in%funcFons%for%data%processing%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#3%
Chapter"Topics"

Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%

!! Pig%LaFn%Syntax%
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#4%
Pig"LaDn"Overview"

! Pig%LaFn%is%a%data$flow%language%
– The"flow"of"data"is"expressed"as"a"sequence"of"statements"
! The%following%is%a%simple%Pig%LaFn%script%to%load,%filter,%and%store%data%

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#5%
Pig"LaDn"Grammar:"Keywords"

! Pig%LaFn%keywords%are%highlighted%here%in%blue%text%
– Keywords"are"reserved"–"you"cannot"use"them"to"name"things"

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#6%
Pig"LaDn"Grammar:"IdenDfiers"(1)"

! IdenFfiers%are%the%names%assigned%to%fields%and%other%data%structures$

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#7%
Pig"LaDn"Grammar:"IdenDfiers"(2)"

! IdenFfiers%must%conform%to%Pig’s%naming%rules$
! An%idenFfier%must%always%begin%with%a%leQer%
– This"may"only"be"followed"by"le>ers,"numbers,"or"underscores"

Valid% x q1 q1_2013 MyData


Invalid% 4 price$ profit% _sale

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#8%
Pig"LaDn"Grammar:"Comments"

! Pig%LaFn%supports%two%types%of%comments%
– Single"line"comments"begin"with"--"""
– MulD/line"comments"begin"with"/*"and"end"with"*/"

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#9%
Case/SensiDvity"in"Pig"LaDn"

! Whether%case%is%significant%in%Pig%LaFn%depends%on%context%
! Keywords%(shown%here%in%blue%text)%are%not%case#sensiFve%
– Neither"are"operators"(such"as"AND,"OR,"or"IS NULL)""
! IdenFfiers%and%paths%(shown%here%in%red%text)%are%case#sensiFve%
– So"are"funcDon"names"(such"as"SUM"or"COUNT)"and"constants"

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999;

STORE bigsales INTO 'myreport';

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#10%
Common"Operators"in"Pig"LaDn"

! Many%commonly#used%operators%in%Pig%LaFn%are%familiar%to%SQL%users%
– Notable"difference:"Pig"LaDn"uses"=="and"!="for"comparison"

ArithmeFc% Comparison% Null% Boolean%


+ == IS NULL AND
- != IS NOT NULL OR
* < NOT
/ >
% <=
>=

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#11%
Chapter"Topics"

Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%

!! Pig"LaDn"Syntax"
!! Loading%Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#12%
Basic"Data"Loading"in"Pig"

! Pig’s%default%loading%funcFon%is%called%PigStorage
– The"name"of"the"funcDon"is"implicit"when"calling"LOAD
– PigStorage"assumes"text"format"with"tab/separated"columns"
! Consider%the%following%file%in%HDFS%called%sales%
– The"two"fields"are"separated"by"tab"characters"
"
" Alice 2999
Bob 3625
" Carlos 2764
"
! This%example%loads%data%from%the%above%file

allsales = LOAD 'sales' AS (name, price);

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#13%
Data"Sources:"File"and"Directories"

! The%previous%example%loads%data%from%a%file%named%sales

allsales = LOAD 'sales' AS (name, price);

! Since%this%is%not%an%absolute%path,%it%is%relaFve%to%your%home%directory%
– Your"home"directory"in"HDFS"is"typically"/user/youruserid/
– Can"also"specify"an"absolute"path"(e.g.,"/dept/sales/2012/q4)"
! The%path%can%also%refer%to%a%directory%
– In"this"case,"Pig"will"recursively"load"all"files"in"that"directory"
– File"pa>erns"(“globs”)"are"also"supported"

allsales = LOAD 'sales_200[5-9]' AS (name, price);

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#14%
Specifying"Column"Names"During"Load"

! The%previous%example%also%assigns%names%to%each%column%

allsales = LOAD 'sales' AS (name, price);

! Assign%column%names%is%not%required%
– This"can"be"useful"when"exploring"a"new"dataset"
– Refer"to"fields"by"posiDon"($0"is"first,"$1"is"second,"$53"is"54th,"etc.)"

allsales = LOAD 'sales';

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#15%
Using"Alternate"Column"Delimiters"

! You%can%specify%an%alternate%delimiter%as%an%argument%to%PigStorage%
! This%example%shows%how%to%load%comma#delimited%data%
– Note"that"this"is"a"single"statement"

allsales = LOAD 'sales.csv' USING PigStorage(',') AS


(name, price);

! Or%to%load%pipe#delimited%data%without%specifying%column%names%

allsales = LOAD 'sales.txt' USING PigStorage('|');

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#16%
Chapter"Topics"

Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple%Data%Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#17%
Simple"Data"Types"in"Pig"

! Pig%supports%several%basic%data%types%
– Similar"to"those"in"most"databases"and"programming"languages"
! Pig%treats%fields%of%unspecified%type%as%an%array%of%bytes%
– Called"the"bytearray"type"in"Pig""

" allsales = LOAD 'sales' AS (name, price);

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#18%
List"of"Simple"Data"Types"

! There%are%eight%data%types%in%Pig%for%simple%values%

Name% DescripFon% Example%Value%


int Whole"numbers% 2013
long Large"whole"numbers% 5,365,214,142L
float Decimals% 3.14159F
double Very"precise"decimals% 3.14159265358979323846
boolean* True"or"false"values" true
datetime* Date"and"Dme" 2013-05-30T14:52:39.000-04:00
chararray Text"strings% Alice
bytearray Raw"bytes"(e.g."any"data)% N/A

""*"Not"available"in"older"versions"of"Pig"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#19%
Specifying"Data"Types"in"Pig"

! Pig%will%do%its%best%to%determine%data%types%based%on%context%
– For"example,"you"can"calculate"sales"commission"as""price * 0.1
– In"this"case,"Pig"will"assume"that"this"value"is"of"type"double"
! However,%it%is%beQer%to%specify%data%types%explicitly%when%possible%
– Helps"with"error"checking"and"opDmizaDons"
– Easiest"to"do"this"upon"load"using"the"format"fieldname:type+

allsales = LOAD 'sales' AS (name:chararray, price:int);

! Choosing%the%right%data%type%is%important%to%avoid%loss%of%precision%
! Important:%Avoid%using%floaFng%point%numbers%to%represent%money!%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#20%
How"Pig"Handles"Invalid"Data"

! When%encountering%invalid%data,%Pig%subsFtutes%NULL%for%the%value%
– For"example,"an"int"field"containing"the"value"Q4
! The%IS NULL%and%IS NOT NULL%operators%test%for%null%values%
– Note"that"NULL"is"not"the"same"as"the"empty"string"''
! You%can%use%these%operators%to%filter%out%bad%records%

hasprices = FILTER Records BY price IS NOT NULL;

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#21%
Chapter"Topics"

Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field%DefiniFons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#22%
Key"Data"Concepts"in"Pig"

! RelaFonal%databases%have%tables,%rows,%columns,%and%fields%
! We%will%use%the%following%data%to%illustrate%Pig’s%equivalents%
– Assume"this"data"was"loaded"from"a"tab/delimited"text"file"as"before"

name% price% country%


Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
Étienne 2368 fr
Fredo 5637 it

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#23%
Pig"Data"Concepts:"Fields"

! A%single%element%of%data%is%called%a%field$
– It"corresponds"to"one"of"the"eight"data"types"seen"earlier"

name% price% country%


Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
Étienne 2368 fr
Fredo 5637 it

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#24%
Pig"Data"Concepts:"Tuples"

! A%collec0on%of%values%is%called%a%tuple$
– Fields"within"a"tuple"are"ordered,"but"need"not"all"be"of"the"same"type"

name% price% country%


Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
Étienne 2368 fr
Fredo 5637 it

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#25%
Pig"Data"Concepts:"Bags"

! A%collec0on%of%tuples%is%called%a%bag$
! Tuples%within%a%bag%are%unordered%by%default%
– The"field"count"and"types"may"vary"between"tuples"in"a"bag"

name% price% country%


Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
Étienne 2368 fr
Fredo 5637 it

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#26%
Pig"Data"Concepts:"RelaDons"

! A%relaFon%is%simply%a%bag%with%an%assigned%name%(alias)%
– Most"Pig"LaDn"statements"create"a"new"relaDon"
! A%typical%script%loads%one%or%more%datasets%into%relaFons%
– Processing"creates"new"relaDons"instead"of"modifying"exisDng"ones"
– The"final"result"is"usually"also"a"relaDon,"stored"as"output"

allsales = LOAD 'sales' AS (name, price);


bigsales = FILTER allsales BY price > 999;
STORE bigsales INTO 'myreport';

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#27%
Chapter"Topics"

Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data%Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#28%
Data"Output"in"Pig"

! The%command%used%to%handle%output%depends%on%its%desFnaFon%
– DUMP:"sends"output"to"the"screen"
– STORE:"sends"output"to"disk"(HDFS)"
! Example%of%DUMP%output,%using%data%from%the%file%shown%earlier%
– The"parentheses"and"commas"indicate"tuples"with"mulDple"fields"

(Alice,2999,us)
(Bob,3625,ca)
(Carlos,2764,mx)
(Dieter,1749,de)
(Étienne,2368,fr)
(Fredo,5637,it)

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#29%
Storing"Data"with"Pig"

! The%STORE%command%is%used%to%store%data%to%HDFS%
– Similar"to"LOAD,"but"writes"data"instead"of"reading"it"
– The"output"path"is"the"name"of"a"directory"
– The"directory"must"not"yet"exist"
! As%with%LOAD,%the%use%of%PigStorage%is%implicit%
– The"field"delimiter"also"has"a"default"value"(tab)"

STORE bigsales INTO 'myreport';

– You"may"also"specify"an"alternate"delimiter"

STORE bigsales INTO 'myreport' USING PigStorage(',');

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#30%
Chapter"Topics"

Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing%the%Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#31%
Viewing"the"Schema"with"DESCRIBE

! The%DESCRIBE%command%shows%the%structure%of%the%data,%including%
names%and%types%
! The%following%Grunt%session%shows%an%example%

grunt> allsales = LOAD 'sales' AS (name:chararray,


price:int);
grunt> DESCRIBE allsales;
%
allsales: {name: chararray,price: int}

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#32%
Chapter"Topics"

Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering%and%SorFng%Data"
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#33%
Filtering"in"Pig"LaDn"

! The%FILTER%keyword%extracts%tuples%matching%the%specified%criteria%
"
"
bigsales = FILTER allsales BY price > 3000;

allsales bigsales

name% price% country% price > 3000" name% price% country%


Alice 2999 us Bob 3625 ca
Bob 3625 ca Fredo 5637 it
Carlos 2764 mx
Dieter 1749 de
Étienne 2368 fr
Fredo 5637 it

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#34%
Filtering"by"MulDple"Criteria"

! You%can%combine%criteria%with%AND%and%OR

somesales = FILTER allsales BY name == 'Dieter' OR (price >


3500 AND price < 4000);

allsales somesales

name% price% country% name% price% country%


Alice 2999 us Bob 3625 ca
Bob 3625 ca Dieter 1749 de
Carlos 2764 mx
Dieter 1749 de Name%is%Dieter,%or%price%is%greater%%
Étienne 2368 fr than%3500%and%less%than%4000"
Fredo 5637 it

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#35%
Aside:"String"Comparisons"in"Pig"LaDn"

! The%==%operator%is%supported%for%any%type%in%Pig%LaFn%
– This"operator"is"used"for"exact"comparisons"
"
" alices = FILTER allsales BY name == 'Alice';

! Pig%LaFn%supports%paQern%matching%through%Java’s%regular$expressions%$
– This"is"done"with"the"MATCHES"operator"

a_names = FILTER allsales BY name MATCHES 'A.*';

spammers = FILTER senders BY email_addr


MATCHES '.*@example\\.com$';

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#36%
Field"SelecDon"in"Pig"LaDn"

! Filtering%extracts%rows,%but%someFmes%we%need%to%extract%columns%
– This"is"done"in"Pig"LaDn"using"the"FOREACH"and"GENERATE"keywords

twofields = FOREACH allsales GENERATE amount, trans_id;


%

allsales twofields
salesperson% amount% trans_id% amount% trans_id%
Alice 2999 107546 2999 107546
Bob 3625 107547 3625 107547
Carlos 2764 107548 2764 107548
Dieter 1749 107549 1749 107549
Étienne 2368 107550 2368 107550
Fredo 5637 107550 5637 107550

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#37%
GeneraDng"New"Fields"in"Pig"LaDn"

! The%FOREACH%and%GENERATE%keywords%can%also%be%used%to%create%fields%
– For"example,"you"could"create"a"new"field"based"on"price"

t = FOREACH allsales GENERATE price * 0.07;

! It%is%possible%to%name%such%fields%

t = FOREACH allsales GENERATE price * 0.07 AS tax;

! And%you%can%also%specify%the%data%type

t = FOREACH allsales GENERATE price * 0.07 AS tax:float;

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#38%
EliminaDng"Duplicates"

!  DISTINCT%eliminates%duplicate%records%in%a%bag%
– All%fields%must"be"equal"to"be"considered"a"duplicate"

unique_records = DISTINCT all_alices;

all_alices unique_records

firstname% lastname% country% firstname% lastname% country%


Alice Smith us Alice Smith us
Alice Jones us Alice Jones us
Alice Brown us Alice Brown us
Alice Brown us Alice Brown ca
Alice Brown ca

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#39%
Controlling"Sort"Order

! Use%ORDER...BY%to%sort%the%records%in%a%bag%in%ascending%order%
– Add"DESC"to"sort"in"descending"order"instead"
– Take"care"to"specify"a"schema"–"data"type"affects"how"data"is"sorted!"

sortedsales = ORDER allsales BY country DESC;

allsales sortedsales
name% price% country% name% price% country%
Alice 29.99 us Alice 29.99 us
Bob 36.25 ca Carlos 27.64 mx
Carlos 27.64 mx Fredo 56.37 it
Dieter 17.49 de Étienne 23.68 fr
Étienne 23.68 fr Dieter 17.49 de
Fredo 56.37 it Bob 36.25 ca

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#40%
LimiDng"Results"

! As%in%SQL,%you%can%use%LIMIT%to%reduce%the%number%of%output%records%

somesales = LIMIT allsales 10;

! Beware!%Record%ordering%is%random%unless%specified%with%ORDER BY
– Use"ORDER BY"and"LIMIT"together"to"find"top/N"results"

sortedsales = ORDER allsales BY price DESC;


top_five = LIMIT sortedsales 5;

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#41%
Chapter"Topics"

Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly#used%FuncFons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#42%
Built/in"FuncDons"

! These%are%just%a%sampling%of%Pig’s%many%built#in%funcFons%
%
FuncFon%DescripFon% Example%InvocaFon% Input% Output%
Convert"to"uppercase" UPPER(country) uk UK

Remove"leading/trailing"spaces" TRIM(name) _Bob_ Bob

Return"a"random"number" RANDOM() 0.4816132


6652569
Round"to"closest"whole"number" ROUND(price) 37.19 37

Return"chars"between"two"posiDons" SUBSTRING(name, 0, 2) Alice Al

! You%can%use%these%with%the%FOREACH..GENERATE%keywords%

rounded = FOREACH allsales GENERATE ROUND(price);

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#43%
Chapter"Topics"

Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#44%
EssenDal"Points"

! Pig%LaFn%supports%many%of%the%same%operaFons%as%SQL%
– Though"Pig’s"approach"is"quite"different"
– Pig"LaDn"loads,"transforms,"and"stores"data"in"a"series"of"steps"
! The%default%delimiter%for%both%input%and%output%is%the%tab%character%
– You"can"specify"an"alternate"delimiter"as"an"argument"to"PigStorage
! Specifying%the%names%and%types%of%fields%is%not%required%
– But"it"can"improve"performance"and"readability"of"your"code"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#45%
Bibliography"

The%following%offer%more%informaFon%on%topics%discussed%in%this%chapter%
! Pig%LaFn%Basics%
– http://tiny.cloudera.com/piglatinbasics
! Pig%LaFn%Built#In%FuncFons%
– http://tiny.cloudera.com/piglatinbuiltin
! DocumentaFon%for%Java%Regular%Expression%PaQerns%
– http://tiny.cloudera.com/javaregex

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#46%
Chapter"Topics"

Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands#On%Exercise:%Using%Pig%for%ETL%processing"

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#47%
Hands/On"Exercise:"Using"Pig"for"ETL"processing"

! In%this%Hands#On%Exercise,%you%will%write%%Pig%LaFn%code%to%perform%basic%ETL%
processing%tasks%on%data%related%to%Dualcore’s%online%adverFsing%campaigns%
! Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucFons%

©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#48%

Das könnte Ihnen auch gefallen