Beruflich Dokumente
Kultur Dokumente
Using"Pig,"Hive,"and"Impala"with"Hadoop"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#1$
201410"
IntroducIon"
Chapter"1"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#2$
Course"Chapters"
!! Introduc/on$
!! Hadoop"Fundamentals" Course$Introduc/on$
!! IntroducIon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig" Data"ETL"and"Analysis"With"Pig""
!! MulI/Dataset"OperaIons"with"Pig"
!! Pig"TroubleshooIng"and"OpImizaIon"
!! IntroducIon"to"Impala"and"Hive"
!! Querying"With"Impala"and"Hive"
!! Impala"and"Hive"Data"Management" IntroducIon"to"Impala"and"Hive"
!! Data"Storage"and"Performance"
!! RelaIonal"Data"Analysis"With"Impala"and"Hive"
!! Working"with"Impala""
!! Analyzing"Text"and"Complex"Data"with"Hive" Data"Analysis"With"Impala"and"Hive"
!! Hive"OpImizaIon"
!! Extending"Hive"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion" Course"Conclusion"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#3$
Chapter"Topics"
Introduc/on$ Course$Introduc/on$
!! About$This$Course$
!! About"Cloudera"
!! Course"LogisIcs"
!! IntroducIons"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#4$
Course"ObjecIves"(1)"
During$this$course,$you$will$learn$
! The$purpose$of$Hadoop$and$its$related$tools$
! The$features$that$Pig,$Hive,$and$Impala$offer$for$data$acquisi/on,$storage,$
and$analysis$
! How$to$iden/fy$typical$use$cases$for$large#scale$data$analysis$
! How$to$load$data$from$rela/onal$databases$and$other$sources$
! How$to$manage$data$in$HDFS$and$export$it$for$use$with$other$systems$
! How$Pig,$Hive,$and$Impala$improve$produc/vity$for$typical$analysis$tasks$
! The$language$syntax$and$data$formats$supported$by$these$tools$
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#5$
Course"ObjecIves"(2)"
! How$to$design$and$execute$queries$on$data$stored$in$HDFS$
! How$to$join$diverse$datasets$to$gain$valuable$business$insight$
! How$Hive$and$Impala$can$be$extended$with$custom$func/ons$and$scripts$
! How$to$analyze$structured,$semi#structured,$and$unstructured$data$
! How$to$store$and$query$data$for$bePer$performance$
! How$to$determine$which$tool$is$the$best$choice$for$a$given$task$
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#6$
Chapter"Topics"
Introduc/on$ Course$Introduc/on$
!! About"This"Course"
!! About$Cloudera$
!! Course"LogisIcs"
!! IntroducIons"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#7$
About"Cloudera"(1)"
! The$leader$in$Apache$Hadoop#based$soSware$and$services$
! Founded$by$leading$experts$on$Hadoop$from$Facebook,$Yahoo,$Google,$
and$Oracle$
! Provides$support,$consul/ng,$training,$and$cer/fica/on$for$Hadoop$users$
! Staff$includes$commiPers$to$virtually$all$Hadoop$projects$
! Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$
– Tom"White,"Lars"George,"Kathleen"Ting,"etc."
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#8$
About"Cloudera"(2)"
! Customers$include$many$key$users$of$Hadoop$
– Allstate,"AOL"AdverIsing,"Box,"BT,"CBS"InteracIve,"eBay,"Experian,"FICO,"
Groupon,"MasterCard,"NaIonal"Cancer"InsItute,"Orbitz,"Social"Security"
AdministraIon,"Trend"Micro,"Trulia,"US"Army,"…"
! Cloudera$public$training:$
– Cloudera"Developer"Training"for"Apache"Hadoop"
– Cloudera"Developer"Training"for"Apache"Spark"
– Designing"and"Building"Big"Data"ApplicaIons"
– Cloudera"Administrator"Training"for"Apache"Hadoop"
– Cloudera"Data"Analyst"Training:"Using"Pig,"Hive,"and"Impala"with"Hadoop"
– Cloudera"Training"for"Apache"HBase"
– IntroducIon"to"Data"Science:"Building"Recommender"Systems"
– Cloudera"EssenIals"for"Apache"Hadoop"
! Onsite$and$custom$training$is$also$available$
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#9$
CDH"
! CDH$(Cloudera’s$Distribu/on,$including$Apache$Hadoop)$
– 100%"open"source,"enterprise/ready"distribuIon"of"Hadoop"and""
related"projects"
– The"most"complete,"tested,"and"widely/deployed"distribuIon"of"Hadoop"
– Integrates"all"key"Hadoop"ecosystem"projects"
– Available"as"RPMs"and"Ubuntu/Debian/SuSE"packages"or"as"a"tarball"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#10$
Cloudera"Express"
! Cloudera$Express$
– Free"download"
! The$best$way$to$get$started$
$with$Hadoop$
! Includes$CDH$
! Includes$Cloudera$Manager$
– End/to/end""
administraIon"for""
Hadoop"
– Deploy,"manage,"and""
monitor"your"cluster"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#11$
Cloudera"Enterprise"
! Cloudera$Enterprise$
– SubscripIon"product"including"CDH"and""
Cloudera"Manager"
! Includes$support$
! Includes$extra$Cloudera$Manager$features$
– ConfiguraIon"history"and"rollbacks"
– Rolling"updates"
– LDAP"integraIon"
– SNMP"support"
– Automated"disaster"recovery"
– Etc."
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#12$
Chapter"Topics"
Introduc/on$ Course$Introduc/on$
!! About"This"Course"
!! About"Cloudera"
!! Course$Logis/cs$
!! IntroducIons"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#13$
LogisIcs"
! Class$start$and$finish$/mes$
! Lunch$
! Breaks$
! Restrooms$
! Wi#Fi$access$
! Virtual$machines$
! Can$I$come$in$early/stay$late?$
Your$instructor$will$give$you$details$on$how$to$access$the$course$materials$
and$exercise$instruc/ons$for$the$class$
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#14$
Chapter"Topics"
Introduc/on$ Course$Introduc/on$
!! About"This"Course"
!! About"Cloudera"
!! Course"LogisIcs"
!! Introduc/ons$
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#15$
IntroducIons"
! About$your$instructor$
! About$you$
– Where"do"you"work"and"what"do"you"do"there?"
– Which"database(s)"and"pladorm(s)"do"you"use?"
– Have"you"worked"with"Apache"Hadoop"or"related"tools?"""
– Any"experience"as"a"developer?"
– What"programming"languages"do"you"use?"
– What"are"your"expectaIons"for"this"course?"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#16$
Hadoop"Fundamentals"
Chapter"2"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#1%
Course"Chapters"
!! IntroducDon"
!! Hadoop%Fundamentals% Course%Introduc7on%
!! IntroducDon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig" Data"ETL"and"Analysis"With"Pig""
!! MulD/Dataset"OperaDons"with"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon"
!! IntroducDon"to"Impala"and"Hive"
!! Querying"With"Impala"and"Hive"
!! Impala"and"Hive"Data"Management" IntroducDon"to"Impala"and"Hive"
!! Data"Storage"and"Performance"
!! RelaDonal"Data"Analysis"With"Impala"and"Hive"
!! Working"with"Impala""
!! Analyzing"Text"and"Complex"Data"with"Hive" Data"Analysis"With"Impala"and"Hive"
!! Hive"OpDmizaDon"
!! Extending"Hive"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion" Course"Conclusion"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#2%
Hadoop"Fundamentals"
In%this%chapter,%you%will%learn%
! Which%factors%led%to%the%era%of%Big%Data%
! What%Hadoop%is%and%what%significant%features%it%offers%
! How%Hadoop%offers%reliable%storage%for%massive%amounts%of%data%with%
HDFS%
! How%Hadoop%supports%large#scale%data%processing%through%MapReduce%
! How%‘Hadoop%Ecosystem’%tools%can%boost%an%analyst’s%produc7vity%
! Several%ways%to%integrate%Hadoop%into%the%modern%data%center%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#3%
Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on%
!! The%Mo7va7on%for%Hadoop%
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#4%
Velocity"
! We%are%genera7ng%data%faster%than%ever%
– Processes"are"increasingly"automated"
– Systems"are"increasingly"interconnected"
– People"are"increasingly"interacDng"online"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#5%
Variety"
! We%are%producing%a%wide%variety%of%data%
– Social"network"connecDons"
– Server"and"applicaDon"log"files"
– Electronic"medical"records"
– Images,"audio,"and"video"
– RFID"and"wireless"sensor"network"events"
– Product"raDngs"on"shopping"and"review"Web"sites"
– And"much"more…"
! Not%all%of%this%maps%cleanly%to%the%rela7onal%model%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#6%
Volume"
! Every%day…%
– More"than"1.5"billion"shares"are"traded"on"the"New"York"Stock"
Exchange"
– Facebook"stores"2.7"billion"comments"and"‘Likes’"
– Google"processes"about"24"petabytes"of"data"
! Every%minute…%
– Foursquare"handles"more"than"2,000"check/ins"
– TransUnion"makes"nearly"70,000"updates"to"credit"files"
! And%every%second…%
– Banks"process"more"than"10,000"credit"card"transacDons"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#7%
Data"Has"Value"
! This%data%has%many%valuable%applica7ons%
– Product"recommendaDons"
– PredicDng"demand"
– MarkeDng"analysis"
– Fraud"detecDon"
– And"many,"many"more…"
! We%must%process%it%to%extract%that%value%
– And"processing"all#the#data"can"yield"more"accurate"results"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#8%
We"Need"a"System"that"Scales"
! We’re%genera7ng%too%much%data%to%process%with%tradi7onal%tools%
! Two%key%problems%to%address%%
– How"can"we"reliably"store"large"amounts"of"data"at"a"reasonable"cost?"
– How"can"we"analyze"all"the"data"we"have"stored?"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#9%
Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop%Overview%
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#10%
What"is"Apache"Hadoop?"
! Scalable%and%economical%data%storage%and%processing%
– Distributed"and"fault/tolerant""
– Harnesses"the"power"of"industry"standard"hardware"
! Heavily%inspired%by%technical%documents%published%by%Google%
Batch"
Search"Engine" Machine" Stream"
Processing" AnalyDc"SQL" Other"
(Cloudera" Learning" Processing"
(MapReduce," (Impala)" ApplicaDons"
Search)" (Spark,"Mahout)" (Spark)"
Hive,"Pig)"
Workload"Management"(YARN)"
Data"Storage"
Filesystem" Online"NoSQL"
(HDFS)" (HBase)"
Data"IntegraDon"(Sqoop,"Flume)"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#11%
Scalability"
! Hadoop%is%a%distributed%system%
– A"collecDon"of"servers"running"Hadoop"sogware"is"called"a"cluster#
! Individual%servers%within%a%cluster%are%called%nodes&
– Typically"standard"rackmount"servers"running"Linux"
– Each"node"both"stores"and"processes"data"
! Add%more%nodes%to%the%cluster%to%increase%scalability%
– A"cluster"may"contain"up"to"several"thousand"nodes"
– You"can"scale"out"incrementally"as"required"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#12%
Fault"Tolerance"
! Paradox:%Adding%nodes%increases%the%chance%that%any%one%of%them%will%fail%
– SoluDon:"build"redundancy"into"the"system"and"handle"it"automaDcally"
! Files%loaded%into%HDFS%are%replicated%across%nodes%in%the%cluster%
– If"a"node"fails,"its"data"is"re/replicated"using"one"of"the"other"copies"
! Data%processing%jobs%are%broken%into%individual%tasks%
– Each"task"takes"a"small"amount"of"data"as"input"
– Thousands"of"tasks"(or"more)"ogen"run"in"parallel"
– If"a"node"fails"during"processing,"its"tasks"are"rescheduled"elsewhere"
! Rou7ne%failures%are%handled%automa7cally%without%any%loss%of%data%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#13%
Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data%Storage:%HDFS%
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#14%
HDFS:"Hadoop"Distributed"File"System"
! HDFS%provides%the%storage%layer%for%Hadoop%data%processing%
! Provides%inexpensive%and%reliable%storage%for%massive%amounts%of%data%
! Other%Hadoop%components%work%with%data%in%HDFS%
– MapReduce,"Impala,"Hive,"Pig,"Spark,"etc.""
Batch"
Search"Engine" Machine" Stream"
Processing" AnalyDc"SQL" Other"
(Cloudera" Learning" Processing"
(MapReduce," (Impala)" ApplicaDons"
Search)" (Spark,"Mahout)" (Spark)"
Hive,"Pig)"
Workload"Management"(YARN)"
Data"Storage"
Filesystem" Online"NoSQL"
(HDFS)" (HBase)"
Data"IntegraDon"(Sqoop,"Flume)"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#15%
HDFS"Features"
! Op7mized%for%sequen7al%access%to%a%rela7vely%small%number%of%large%files%
– Each"file"is"likely"to"be"100MB"or"larger ""
– MulD/gigabyte"files"are"typical"
! In%some%ways,%HDFS%is%similar%to%a%UNIX%filesystem%
– Hierarchical,"with"UNIX/style"paths"(e.g.,"/sales/rpt/asia.txt)"
– UNIX/style"file"ownership"and"permissions"
! There%are%also%some%major%devia7ons%from%UNIX%
– No"concept"of"a"current"directory"
– Cannot"modify"files"once"wri>en"
– Must"use"Hadoop/specific"uDliDes"or"custom"code"to"access"HDFS"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#16%
HDFS"Architecture"
Hadoop Cluster
A#Small#Hadoop#Cluster#
! Hadoop%has%a%master/slave%
architecture% Master%
op ! HDFS%master%daemon:%NameNode%
fs -put sales.txt /reports HDFS#master#daemon#
– Manages"namespace"and"metadata#
– Monitors"slave"nodes"
! HDFS%slave%daemon:%DataNode%
– Reads"and"writes"the"actual"data"
Slaves&
HDFS#slave#daemons#
op fs -get /reports/sales.txt
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#17%
Accessing"HDFS"via"the"Command"Line"
! HDFS%is%not%a%general%purpose%filesystem%
– Not"built"into"the"OS,"so"only"specialized"tools"can"access"it"
– End"users"typically"access"HDFS"via"the"hdfs dfs command"
! Example:%display%the%contents%of%the%/user/fred/sales.txt%file%
! Example:%Create%a%directory%(below%the%root)%called%reports%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#18%
Copying"Local"Data"To"and"From"HDFS"
! Remember%that%HDFS%is%dis7nct%from%your%local%filesystem%
– Use"hdfs dfs –put%to"copy"local"files"to"HDFS"
– Use"hdfs dfs -get%to"fetch"a"local"copy"of"a"file"from"HDFS"
Hadoop#Cluster#
Hadoop Cluster
Hadoop Cluster
$ hadoop
$ hadoop fs -putfssales.txt
-put sales.txt /reports
/reports
$ hdfs dfs -put file
Client MachineClient#
Client Machine
$ hadoop
$ hadoop fs/reports/sales.txt
fs -get
$ hdfs -get /reports/sales.txt
dfs -get file
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#19%
More"hdfs dfs"Command"Examples""
! Copy%file%input.txt%from%local%disk%to%the%user’s%directory%in%HDFS%
– This"will"copy"the"file"to"/user/username/input.txt
! Get%a%directory%lis7ng%of%the%HDFS%root%directory%
! Delete%the%file%/reports/sales.txt%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#20%
Using"the"Hue"HDFS"File"Manager"
! Hue%is%a%Web%interface%for%Hadoop%
– Hadoop"User"Experience"
! Hue%includes%an%applica7on%for%browsing%and%managing%files%in%HDFS%
– To"use"Hue,"browse"to"http://hue_server:8888/
Manage"Files"
Upload"Files"
Browse"Files"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#21%
Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed%Data%Processing:%YARN,%MapReduce,%and%Spark%
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#22%
Workload"Management:"YARN"
! Many%Hadoop%tools%work%with%data%in%a%Hadoop%cluster%
! Requires%workload%management%to%distribute%and%monitor%work%across%
the%cluster%
Batch"
Search"Engine" Machine" Stream"
Processing" AnalyDc"SQL" Other"
(Cloudera" Learning" Processing"
(MapReduce," (Impala)" ApplicaDons"
Search)" (Spark,"Mahout)" (Spark)"
Hive,"Pig)"
Workload"Management"(YARN"or"MapReduce"1)"
Data"Storage"
Filesystem" Online"NoSQL"
(HDFS)" (HBase)"
Data"IntegraDon"(Sqoop,"Flume)"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#23%
Hadoop"Cluster"Architecture"
Hadoop Cluster
A#Small#Hadoop#Cluster#
! Master/Slave%Architecture%
– YARN"or"MapReduce"version"1" Master%
op fs -put sales.txt /reports YARN&master&daemon&
– Details"differ"slightly" HDFS#master#daemon#
! Master%nodes%
– Run"master"daemons"to"accept"jobs,""
and"monitor"and"distribute"work"
! Slave%nodes%
Slaves%
– Run"slave"daemons"to"start"tasks" YARN&slave&daemons&
HDFS#slave#daemons#
– Do"the"actual"work"
op fs -get /reports/sales.txt
– Report"status"back"to"master"daemons"
! HDFS%and%YARN/MRv1%are%collocated%
– Slave"nodes"run"both"HDFS"and"slave"
daemons"on"the"same"machines"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#24%
General"Data"Processing"
! Hadoop%includes%two%general%data%processing%engines%
– MapReduce"
– Spark"
! Both%are%programming%libraries%(Java,%Scala,%Python…)%
Batch"
Search"Engine" Machine" Stream"
Processing" AnalyDc"SQL" Other"
(Cloudera" Learning" Processing"
(MapReduce," (Impala)" ApplicaDons"
Search)" (Spark,"Mahout)" (Spark)"
Hive,"Pig)"
Workload"Management"(YARN"or"MapReduce)"
Data"Storage"
Filesystem" Online"NoSQL"
(HDFS)" (HBase)"
Data"IntegraDon"(Sqoop,"Flume)"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#25%
Hadoop"MapReduce"
! Hadoop%MapReduce%was%the%original%processing%engine%for%Hadoop%
– SDll"the"most"commonly"used"general"data"processing"engine"
! Based%on%the%the%‘map#reduce’%programming%model%
– A"style"of"processing"data"popularized"by"Google"
! Provides%a%set%of%programming%libraries%%
– Primarily"supports"Java""
– Streaming"MapReduce"provides"(limited)"support"for"scripDng"
languages"such"as"Python""
! Benefits%of%Hadoop%MapReduce%
– Simplicity"
– Flexibility"
– Scalability"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#26%
Apache"Spark"
! The%next%genera7on%general%data%processing%engine%
! Builds%on%the%same%‘map#reduce’%programming%model%as%Hadoop%
MapReduce%
! Originally%developed%at%AMP%Lab%at%UC%Berkeley%
! Spark%supports%Scala,%Java,%and%Python%
! Spark%has%the%same%benefits%as%MapReduce,%plus…%
– Improved"performance"using"in/memory"processing"
– Higher"level"programming"model"to"speed"up"development"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#27%
Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data%Analysis%and%Processing:%Pig,%Hive,%and%Impala%
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#28%
Data"Processing"and"Analysis"with"Hadoop"(1)"
! Hadoop%MapReduce%and%Spark%are%powerful%data%processing%engines%but…%
– Hard"to"master"
– Require"programming"skills"
– Slow"to"develop,"hard"to"maintain"
! Hadoop%includes%several%other%tools%%for%data%processing%and%analysis%
– Tools"for"data"analysts,"not"programmers"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#29%
Data"Processing"and"Analysis"with"Hadoop"(2)"
! Higher%level%abstrac7ons%for%general%data%processing%
– Pig,"Hive"
! Specialized%processing%engines%for%interac7ve%analysis%
– Impala,"Search"
Natural""
PigLaDn" Impala/HiveQL"
Language"
Data#
Pla;orm#
Impala" Search"
Pig" Hive"
Data#
Processing# MapReduce,"Spark,"etc."
Engine#
Data#Storage# HDFS"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#30%
Apache"Pig"
! Apache%Pig%builds%on%Hadoop%to%offer%high#level%data%processing%
– This"is"an"alternaDve"to"wriDng"low/level"MapReduce"code"
– Pig"is"especially"good"at"joining"and"transforming"data"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#31%
Apache"Hive"
! Hive%is%another%abstrac7on%on%top%of%Hadoop%
– Like"Pig,"it"also"reduces"development"Dme""
– Hive"uses"a"SQL/like"language"called"HiveQL"
! A%Hive%Server%runs%on%a%master%node%
– Turns"HiveQL"queries"into"MapReduce"jobs"
– Submits"those"jobs"to"the"cluster"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#32%
Cloudera"Impala"
! Massively%parallel%SQL%engine%which%runs%on%a%Hadoop%cluster%
– Inspired"by"Google’s"Dremel"project"
– Can"query"data"stored"in"HDFS"or"HBase"tables"
! Uses%Impala%SQL%
– Very"similar"to"HiveQL"
! High%performance%%
– Typically"at"least"10"Dmes"faster"than"Hive"or"MapReduce"
– High/level"query"language"(subset"of"SQL/92)"
! Impala%is%100%%Apache#licensed%open%source%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#33%
Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database%Integra7on:%Sqoop%
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#34%
Apache"Sqoop"
! Sqoop%exchanges%data%between%an%RDBMS%and%Hadoop%
! It%can%import%all%tables,%a%single%table,%or%a%por7on%of%a%table%into%HDFS%
– Does"this"very"efficiently"via"a"Map/only"MapReduce"job"
– Result"is"a"directory"in"HDFS"containing"comma/delimited"text"files"
! Sqoop%can%also%export%data%from%HDFS%back%to%the%database%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#35%
ImporDng"Tables"with"Sqoop"
! This%example%imports%the%customers%table%from%a%MySQL%database%
– Will"create"/mydata/customers"directory"in"HDFS"
– Directory"will"contain"comma/delimited"text"files"
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table customers
! Adding%the%--direct%op7on%may%offer%bejer%performance%
– Uses"database/specific"tools"instead"of"JDBC"
– This"opDon"is"not"compaDble"with"all"databases"
! High#performance%custom%connectors%for%some%databases%
– Netezza,"Teradata,"MySQL…"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#36%
ImporDng"An"EnDre"Database"with"Sqoop"
! Import%all%tables%from%the%database%(fields%will%be%tab#delimited)%
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--fields-terminated-by '\t' \
--warehouse-dir /mydata
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#37%
ImporDng"ParDal"Tables"with"Sqoop"
! Import%only%specified%columns%from%products%table%
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table products \
--columns "prod_id,name,price"
! Import%only%matching%rows%from%products%table%
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table products \
--where "price >= 1000"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#38%
Incremental"Imports"with"Sqoop"
! What%if%new%records%are%added%to%the%database?%
– Could"re/import"all"records,"but"this"is"inefficient"
! Sqoop’s%incremental%append%mode%imports%only%new%records%
– Based"on"value"of"last"record"in"specified"column"
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table orders \
--incremental append \
--check-column order_id \
--last-value 6713821
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#39%
Handling"ModificaDons"with"Incremental"Imports"
! What%if%exis7ng%records%are%also%modified%in%the%database?%
– Incremental"append"mode"doesn’t"handle"this"
! In%CDH%5.2%and%later,%Sqoop’s%lastmodified%append%mode%adds%and%
updates%records%
– Caveat:"You"must"maintain"a"Dmestamp"column"in"your"table"
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table shipments \
--incremental lastmodified \
--check-column last_update_date \
--last-value "2013-06-12 03:15:59"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#40%
ExporDng"Data"from"Hadoop"to"RDBMS"with"Sqoop"
! We%have%seen%several%ways%to%pull%records%from%an%RDBMS%into%Hadoop%
– It"is"someDmes"also"helpful"to"push"data"in"Hadoop"back"to"an"RDBMS"
! Sqoop%supports%this%via%export%
$ sqoop export \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--export-dir /mydata/recommender_output \
--table product_recommendations
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#41%
Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other%Hadoop%Data%Tools%
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#42%
Apache"HBase"
! HBase%is%“the%Hadoop%database”%
! Can%store%massive%amounts%of%data%
– Gigabytes,"terabytes,"and"even"petabytes"of"data"in"a"table"
– Tables"can"have"many"thousands"of"columns"
! Scales%to%provide%very%high%write%throughput%
– Hundreds"of"thousands"of"inserts"per"second"
! Fairly%primi7ve%when%compared%to%an%RDBMS%
– NoSQL":"There"is"no"high/level"query"language""
– Use"API"to"scan"/"get"/"put"values"based"on"keys"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#43%
Apache"Flume"
%%
! Flume%imports%data%into%HDFS%as&it&is&being&generated%by%various%sources%
Log Files
UNIX Custom
syslog Sources
Hadoop Cluster
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#44%
Recap:"Data"Center"IntegraDon"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#45%
Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise%Scenario%Explana7on%
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#46%
Hands/On"Exercises:"Scenario"ExplanaDon"
! Hands#On%Exercises%throughout%the%course%will%reinforce%the%topics%being%
discussed%
– Exercises"simulate"the"kind"of"tasks"ogen"performed"using"the"tools"you"
will"learn"about"in"class"
– Most"exercises"depend"on"data"generated"in"earlier"exercises"
! Scenario:%Dualcore%Inc.%is%a%leading%electronics%retailer%
– More"than"1,000"brick/and/mortar"stores"
– Dualcore"also"has"a"thriving"e/commerce"Web"site"
! Dualcore%has%hired%you%to%help%find%value%in%its%data%
– You"will"process"and"analyze"data"from"internal"and"external"sources"
– IdenDfy"opportuniDes"to"increase"revenue"
– Find"new"ways"to"reduce"costs"
– Help"other"departments"achieve"their"goals"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#47%
Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion%
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#48%
EssenDal"Points"
! We%are%genera7ng%more%data%–%and%faster%–%than%ever%before%
! Most%of%this%data%maps%poorly%to%structured%rela7onal%tables%
! The%ability%to%store%and%process%this%data%can%yield%valuable%insight%
! Hadoop%offers%scalable%data%storage%and%processing%%
! There%are%lots%of%tools%in%the%Hadoop%ecosystem%that%help%you%to%integrate%
Hadoop%with%other%systems,%manage%complex%jobs,%and%ease%analysis%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#49%
Bibliography"
The%following%offer%more%informa7on%on%topics%discussed%in%this%chapter%
! 10%Hadoopable%Problems%(recorded%presenta7on)%
– http://tiny.cloudera.com/dac02a
! Guide%to%HDFS%Commands%
– http://tiny.cloudera.com/hdfscommands
! Hadoop:&The&Defini<ve&Guide,&3rd&Edi<on&(O’Reilly%book)%
– http://tiny.cloudera.com/hadooptdg
! Sqoop%User%Guide%
– http://tiny.cloudera.com/sqoopuser
! Spark%Documenta7on%
– http://tiny.cloudera.com/sparkdoc
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#50%
Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands#On%Exercise:%Data%Ingest%with%Hadoop%Tools%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#51%
About"the"Training"Virtual"Machine"
! During%this%course,%you%will%perform%numerous%hands#on%exercises%using%the%
Cloudera%Training%Virtual%Machine%(VM)%
! The%VM%has%Hadoop%installed%in%pseudoBdistributed&mode%
– A"cluster"comprised"of"a"single"node"
– Typically"used"for"tesDng"code"before"deploying"to"a"large"cluster"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#52%
Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
! In%this%Hands#On%Exercise,%you%will%gain%prac7ce%adding%data%from%the%local%
filesystem%and%a%rela7onal%database%server%to%HDFS%
– You"will"analyze"this"data"in"subsequent"exercises"
! Please%refer%to%the%Hands#On%Exercise%Manual%for%instruc7ons%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#53%
IntroducAon"to"Pig"
Chapter"3"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#1%
Course"Chapters"
!! IntroducAon"
!! Hadoop"Fundamentals" Course"IntroducAon"
!! Introduc=on%to%Pig%
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig" Data%ETL%and%Analysis%With%Pig%%
!! MulA/Dataset"OperaAons"with"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Impala"and"Hive"
!! Querying"With"Impala"and"Hive"
!! Impala"and"Hive"Data"Management" IntroducAon"to"Impala"and"Hive"
!! Data"Storage"and"Performance"
!! RelaAonal"Data"Analysis"With"Impala"and"Hive"
!! Working"with"Impala""
!! Analyzing"Text"and"Complex"Data"with"Hive" Data"Analysis"With"Impala"and"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion" Course"Conclusion"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#2%
IntroducAon"to"Pig"
In%this%chapter,%you%will%learn%
! The%key%features%Pig%offers%
! How%organiza=ons%use%Pig%for%data%processing%and%analysis%
! How%to%use%Pig%interac=vely%and%in%batch%mode%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#3%
Chapter"Topics"
Introduc=on%to%Pig% Data%ETL%and%Analysis%With%Pig%
!! What%is%Pig?%
!! Pig’s"Features"
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#4%
Apache"Pig"Overview"
! Apache%Pig%is%a%plaMorm%for%data%analysis%and%processing%on%Hadoop%
– It"offers"an"alternaAve"to"wriAng"MapReduce"code"directly"
! Originally%developed%as%a%research%project%at%Yahoo%%
– Goals:"flexibility,"producAvity,"and"maintainability"
– Now"an"open/source"Apache"project"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#5%
The"Anatomy"of"Pig"
! Main%components%of%Pig%
– The"data"flow"language"(Pig"LaAn)"
– The"interacAve"shell"where"you"can"type"Pig"LaAn"statements"(Grunt)"
– The"Pig"interpreter"and"execuAon"engine"
!"Preprocess"and"parse"Pig"La0n
AllSales = LOAD 'sales'
!"Check"data"types
AS (cust, price); !"Make"op0miza0ons
BigSales = FILTER AllSales !"Plan"execu0on
BY price > 100;
STORE BigSales INTO 'myreport';
!"Generate"MapReduce"jobs
!"Submit"job(s)"to"Hadoop
!"Monitor"progress
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#6%
Where"to"Get"Pig"
! CDH%is%the%easiest%way%to%install%Hadoop%and%Pig%
– A"Hadoop"distribuAon"which"includes"HDFS,"MapReduce,"Spark,"Pig,"
Hive,"Impala,"Sqoop,"HBase,"and"other"Hadoop"ecosystem"components"
– Available"as"RPMs,"Ubuntu/Debian/SuSE"packages,"or"a"tarball"
– Simple"installaAon"
– 100%"free"and"open"source"
! Installa=on%is%outside%the%scope%of%this%course%
– Cloudera"offers"a"training"course"for"System"Administrators,!Cloudera!
Administrator!Training!for!Apache!Hadoop!
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#7%
Chapter"Topics"
Introduc=on%to%Pig% Data%ETL%and%Analysis%With%Pig%
!! What"is"Pig?"
!! Pig’s%Features%
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#8%
Pig"Features"
! Pig%is%an%alterna=ve%to%wri=ng%low#level%MapReduce%code%
! Many%features%enable%sophis=cated%analysis%and%processing%
– HDFS"manipulaAon"
– UNIX"shell"commands"
– RelaAonal"operaAons"
– PosiAonal"references"for"fields"
– Common"mathemaAcal"funcAons"
– Support"for"custom"funcAons"and"data"formats%
– Complex"data"structures"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#9%
Chapter"Topics"
Introduc=on%to%Pig% Data%ETL%and%Analysis%With%Pig%
!! What"is"Pig?"
!! Pig’s"Features"
!! Pig%Use%Cases%
!! InteracAng"with"Pig"
!! Conclusion"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#10%
How"Are"OrganizaAons"Using"Pig?"
! Many%organiza=ons%use%Pig%for%data%analysis%
– Finding"relevant"records"in"a"massive"data"set"
– Querying"mulAple"data"sets"
– CalculaAng"values"from"input"data"
! Pig%is%also%frequently%used%for%data%processing%
– Reorganizing"an"exisAng"data"set"
– Joining"data"from"mulAple"sources"to"produce"a"new"data"set"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#11%
Use"Case:"Web"Log"SessionizaAon"
! Pig%can%help%you%extract%valuable%informa=on%from%Web%server%log%files%
Order Widget X
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#12%
Use"Case:"Data"Sampling"
! Sampling%can%help%you%explore%a%representa=ve%por=on%of%a%large%data%set%
– Allows"you"to"examine"this"porAon"with"tools"that"do"not"scale"well"
– Supports"faster"iteraAons"during"development"of"analysis"jobs"
100 TB 50 MB
Random
Sampling
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#13%
Use"Case:"ETL"Processing"
! Pig%is%also%widely%used%for%Extract,%Transform,%and%Load%(ETL)%processing%
Call Center
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#14%
Chapter"Topics"
Introduc=on%to%Pig% Data%ETL%and%Analysis%With%Pig%
!! What"is"Pig?"
!! Pig’s"Features"
!! Pig"Use"Cases"
!! Interac=ng%with%Pig%
!! Conclusion"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#15%
Using"Pig"InteracAvely"
! You%can%use%Pig%interac=vely,%via%the%Grunt%shell%
– Pig"interprets"each"Pig"LaAn"statement"as"you"type"it"
– ExecuAon"is"delayed"unAl"output"is"required"
– Very"useful"for"ad"hoc"data"inspecAon"
! Example%of%how%to%start,%use,%and%exit%Grunt%
$ pig
grunt> allsales = LOAD 'sales' AS (name, price);
grunt> bigsales = FILTER allsales BY price > 100;
grunt> STORE bigsales INTO 'myreport';
grunt> quit;
%
! Can%also%execute%a%Pig%La=n%statement%from%the%UNIX%shell%via%the%-e%
op=on
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#16%
InteracAng"with"HDFS"
! You%can%manipulate%HDFS%with%Pig,%via%the%fs%command
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#17%
InteracAng"with"UNIX"
! The%sh%command%lets%you%run%UNIX%programs%from%Pig
grunt> sh date;
Wed Nov 12 06:39:13 PST 2014
grunt> fs -ls; -- lists HDFS files
%
grunt> sh ls; -- lists local files
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#18%
Running"Pig"Scripts"
! A%Pig%script%is%simply%Pig%La=n%code%stored%in%a%text%file%
– By"convenAon,"these"files"have"the".pig"extension"
! You%can%run%a%Pig%script%from%within%the%Grunt%shell%via%the%run%command%
– This"is"useful"for"automaAon"and"batch"execuAon""
! It%is%common%to%run%a%Pig%script%directly%from%the%UNIX%shell%
$ pig salesreport.pig
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#19%
MapReduce"and"Local"Modes"
! As%described%earlier,%Pig%turns%Pig%La=n%into%MapReduce%jobs%
– Pig"submits"those"jobs"for"execuAon"on"the"Hadoop"cluster"
! It%is%also%possible%to%run%Pig%in%‘local%mode’%using%the%-x%flag%
– This"runs"jobs"on"the"local!machine"instead"of"the"cluster"
– Local"mode"uses"the"local"filesystem"instead"of"HDFS"
– Can"be"helpful"for"tesAng"before"deploying"a"job"to"producAon"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#20%
Client/Side"Log"Files"
! If%a%job%fails,%Pig%may%produce%a%log%file%to%explain%why%
– These"log"files"are"typically"produced"in"your"current"working"directory"
– On"the"local"(client)"machine"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#21%
Chapter"Topics"
Introduc=on%to%Pig% Data%ETL%and%Analysis%With%Pig%
!! What"is"Pig?"
!! Pig’s"Features"
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#22%
EssenAal"Points"
! Pig%offers%an%alterna=ve%to%wri=ng%MapReduce%code%directly%
– Pig"interprets"Pig"LaAn"code"in"order"to"create"MapReduce"jobs"
– It"then"submits"these"jobs"to"the"Hadoop"cluster"
! You%can%execute%Pig%La=n%code%interac=vely%through%Grunt%
– Pig"delays"job"execuAon"unAl"output"is"required"
! It%is%also%common%to%store%Pig%La=n%code%in%a%script%for%batch%execu=on%
– Allows"for"automaAon"and"code"reuse"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#23%
Bibliography"
The%following%offer%more%informa=on%on%topics%discussed%in%this%chapter%
! Apache%Pig%Web%Site%
– http://pig.apache.org/
! Process%a%Million%Songs%with%Apache%Pig%
– http://tiny.cloudera.com/dac03a
! Powered%By%Pig%
– http://tiny.cloudera.com/poweredbypig
! LinkedIn:%User%Engagement%Powered%By%Apache%Pig%and%Hadoop%
– http://tiny.cloudera.com/dac03c
! Programming)Pig)(O’Reilly%book)%
– http://tiny.cloudera.com/programmingpig
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#24%
Basic"Data"Analysis"with"Pig"
Chapter"4"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#1%
Course"Chapters"
!! IntroducDon"
!! Hadoop"Fundamentals" Course"IntroducDon"
!! IntroducDon"to"Pig"
!! Basic%Data%Analysis%with%Pig%
!! Processing"Complex"Data"with"Pig" Data%ETL%and%Analysis%With%Pig%%
!! MulD/Dataset"OperaDons"with"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon"
!! IntroducDon"to"Impala"and"Hive"
!! Querying"With"Impala"and"Hive"
!! Impala"and"Hive"Data"Management" IntroducDon"to"Impala"and"Hive"
!! Data"Storage"and"Performance"
!! RelaDonal"Data"Analysis"With"Impala"and"Hive"
!! Working"with"Impala""
!! Analyzing"Text"and"Complex"Data"with"Hive" Data"Analysis"With"Impala"and"Hive"
!! Hive"OpDmizaDon"
!! Extending"Hive"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion" Course"Conclusion"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#2%
Basic"Data"Analysis"with"Pig"
In%this%chapter,%you%will%learn%
! The%basic%syntax%of%Pig%LaFn%
! How%to%load%and%store%data%using%Pig%
! Which%simple%data%types%Pig%uses%to%represent%data%
! How%to%sort%and%filter%data%in%Pig%
! How%to%use%many%of%Pig’s%built#in%funcFons%for%data%processing%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#3%
Chapter"Topics"
Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%
!! Pig%LaFn%Syntax%
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#4%
Pig"LaDn"Overview"
! Pig%LaFn%is%a%data$flow%language%
– The"flow"of"data"is"expressed"as"a"sequence"of"statements"
! The%following%is%a%simple%Pig%LaFn%script%to%load,%filter,%and%store%data%
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#5%
Pig"LaDn"Grammar:"Keywords"
! Pig%LaFn%keywords%are%highlighted%here%in%blue%text%
– Keywords"are"reserved"–"you"cannot"use"them"to"name"things"
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#6%
Pig"LaDn"Grammar:"IdenDfiers"(1)"
! IdenFfiers%are%the%names%assigned%to%fields%and%other%data%structures$
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#7%
Pig"LaDn"Grammar:"IdenDfiers"(2)"
! IdenFfiers%must%conform%to%Pig’s%naming%rules$
! An%idenFfier%must%always%begin%with%a%leQer%
– This"may"only"be"followed"by"le>ers,"numbers,"or"underscores"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#8%
Pig"LaDn"Grammar:"Comments"
! Pig%LaFn%supports%two%types%of%comments%
– Single"line"comments"begin"with"--"""
– MulD/line"comments"begin"with"/*"and"end"with"*/"
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#9%
Case/SensiDvity"in"Pig"LaDn"
! Whether%case%is%significant%in%Pig%LaFn%depends%on%context%
! Keywords%(shown%here%in%blue%text)%are%not%case#sensiFve%
– Neither"are"operators"(such"as"AND,"OR,"or"IS NULL)""
! IdenFfiers%and%paths%(shown%here%in%red%text)%are%case#sensiFve%
– So"are"funcDon"names"(such"as"SUM"or"COUNT)"and"constants"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#10%
Common"Operators"in"Pig"LaDn"
! Many%commonly#used%operators%in%Pig%LaFn%are%familiar%to%SQL%users%
– Notable"difference:"Pig"LaDn"uses"=="and"!="for"comparison"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#11%
Chapter"Topics"
Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%
!! Pig"LaDn"Syntax"
!! Loading%Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#12%
Basic"Data"Loading"in"Pig"
! Pig’s%default%loading%funcFon%is%called%PigStorage
– The"name"of"the"funcDon"is"implicit"when"calling"LOAD
– PigStorage"assumes"text"format"with"tab/separated"columns"
! Consider%the%following%file%in%HDFS%called%sales%
– The"two"fields"are"separated"by"tab"characters"
"
" Alice 2999
Bob 3625
" Carlos 2764
"
! This%example%loads%data%from%the%above%file
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#13%
Data"Sources:"File"and"Directories"
! The%previous%example%loads%data%from%a%file%named%sales
! Since%this%is%not%an%absolute%path,%it%is%relaFve%to%your%home%directory%
– Your"home"directory"in"HDFS"is"typically"/user/youruserid/
– Can"also"specify"an"absolute"path"(e.g.,"/dept/sales/2012/q4)"
! The%path%can%also%refer%to%a%directory%
– In"this"case,"Pig"will"recursively"load"all"files"in"that"directory"
– File"pa>erns"(“globs”)"are"also"supported"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#14%
Specifying"Column"Names"During"Load"
! The%previous%example%also%assigns%names%to%each%column%
! Assign%column%names%is%not%required%
– This"can"be"useful"when"exploring"a"new"dataset"
– Refer"to"fields"by"posiDon"($0"is"first,"$1"is"second,"$53"is"54th,"etc.)"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#15%
Using"Alternate"Column"Delimiters"
! You%can%specify%an%alternate%delimiter%as%an%argument%to%PigStorage%
! This%example%shows%how%to%load%comma#delimited%data%
– Note"that"this"is"a"single"statement"
! Or%to%load%pipe#delimited%data%without%specifying%column%names%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#16%
Chapter"Topics"
Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple%Data%Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#17%
Simple"Data"Types"in"Pig"
! Pig%supports%several%basic%data%types%
– Similar"to"those"in"most"databases"and"programming"languages"
! Pig%treats%fields%of%unspecified%type%as%an%array%of%bytes%
– Called"the"bytearray"type"in"Pig""
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#18%
List"of"Simple"Data"Types"
! There%are%eight%data%types%in%Pig%for%simple%values%
""*"Not"available"in"older"versions"of"Pig"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#19%
Specifying"Data"Types"in"Pig"
! Pig%will%do%its%best%to%determine%data%types%based%on%context%
– For"example,"you"can"calculate"sales"commission"as""price * 0.1
– In"this"case,"Pig"will"assume"that"this"value"is"of"type"double"
! However,%it%is%beQer%to%specify%data%types%explicitly%when%possible%
– Helps"with"error"checking"and"opDmizaDons"
– Easiest"to"do"this"upon"load"using"the"format"fieldname:type+
! Choosing%the%right%data%type%is%important%to%avoid%loss%of%precision%
! Important:%Avoid%using%floaFng%point%numbers%to%represent%money!%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#20%
How"Pig"Handles"Invalid"Data"
! When%encountering%invalid%data,%Pig%subsFtutes%NULL%for%the%value%
– For"example,"an"int"field"containing"the"value"Q4
! The%IS NULL%and%IS NOT NULL%operators%test%for%null%values%
– Note"that"NULL"is"not"the"same"as"the"empty"string"''
! You%can%use%these%operators%to%filter%out%bad%records%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#21%
Chapter"Topics"
Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field%DefiniFons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#22%
Key"Data"Concepts"in"Pig"
! RelaFonal%databases%have%tables,%rows,%columns,%and%fields%
! We%will%use%the%following%data%to%illustrate%Pig’s%equivalents%
– Assume"this"data"was"loaded"from"a"tab/delimited"text"file"as"before"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#23%
Pig"Data"Concepts:"Fields"
! A%single%element%of%data%is%called%a%field$
– It"corresponds"to"one"of"the"eight"data"types"seen"earlier"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#24%
Pig"Data"Concepts:"Tuples"
! A%collec0on%of%values%is%called%a%tuple$
– Fields"within"a"tuple"are"ordered,"but"need"not"all"be"of"the"same"type"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#25%
Pig"Data"Concepts:"Bags"
! A%collec0on%of%tuples%is%called%a%bag$
! Tuples%within%a%bag%are%unordered%by%default%
– The"field"count"and"types"may"vary"between"tuples"in"a"bag"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#26%
Pig"Data"Concepts:"RelaDons"
! A%relaFon%is%simply%a%bag%with%an%assigned%name%(alias)%
– Most"Pig"LaDn"statements"create"a"new"relaDon"
! A%typical%script%loads%one%or%more%datasets%into%relaFons%
– Processing"creates"new"relaDons"instead"of"modifying"exisDng"ones"
– The"final"result"is"usually"also"a"relaDon,"stored"as"output"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#27%
Chapter"Topics"
Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data%Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#28%
Data"Output"in"Pig"
! The%command%used%to%handle%output%depends%on%its%desFnaFon%
– DUMP:"sends"output"to"the"screen"
– STORE:"sends"output"to"disk"(HDFS)"
! Example%of%DUMP%output,%using%data%from%the%file%shown%earlier%
– The"parentheses"and"commas"indicate"tuples"with"mulDple"fields"
(Alice,2999,us)
(Bob,3625,ca)
(Carlos,2764,mx)
(Dieter,1749,de)
(Étienne,2368,fr)
(Fredo,5637,it)
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#29%
Storing"Data"with"Pig"
! The%STORE%command%is%used%to%store%data%to%HDFS%
– Similar"to"LOAD,"but"writes"data"instead"of"reading"it"
– The"output"path"is"the"name"of"a"directory"
– The"directory"must"not"yet"exist"
! As%with%LOAD,%the%use%of%PigStorage%is%implicit%
– The"field"delimiter"also"has"a"default"value"(tab)"
– You"may"also"specify"an"alternate"delimiter"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#30%
Chapter"Topics"
Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing%the%Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#31%
Viewing"the"Schema"with"DESCRIBE
! The%DESCRIBE%command%shows%the%structure%of%the%data,%including%
names%and%types%
! The%following%Grunt%session%shows%an%example%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#32%
Chapter"Topics"
Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering%and%SorFng%Data"
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#33%
Filtering"in"Pig"LaDn"
! The%FILTER%keyword%extracts%tuples%matching%the%specified%criteria%
"
"
bigsales = FILTER allsales BY price > 3000;
allsales bigsales
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#34%
Filtering"by"MulDple"Criteria"
! You%can%combine%criteria%with%AND%and%OR
allsales somesales
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#35%
Aside:"String"Comparisons"in"Pig"LaDn"
! The%==%operator%is%supported%for%any%type%in%Pig%LaFn%
– This"operator"is"used"for"exact"comparisons"
"
" alices = FILTER allsales BY name == 'Alice';
! Pig%LaFn%supports%paQern%matching%through%Java’s%regular$expressions%$
– This"is"done"with"the"MATCHES"operator"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#36%
Field"SelecDon"in"Pig"LaDn"
! Filtering%extracts%rows,%but%someFmes%we%need%to%extract%columns%
– This"is"done"in"Pig"LaDn"using"the"FOREACH"and"GENERATE"keywords
allsales twofields
salesperson% amount% trans_id% amount% trans_id%
Alice 2999 107546 2999 107546
Bob 3625 107547 3625 107547
Carlos 2764 107548 2764 107548
Dieter 1749 107549 1749 107549
Étienne 2368 107550 2368 107550
Fredo 5637 107550 5637 107550
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#37%
GeneraDng"New"Fields"in"Pig"LaDn"
! The%FOREACH%and%GENERATE%keywords%can%also%be%used%to%create%fields%
– For"example,"you"could"create"a"new"field"based"on"price"
! It%is%possible%to%name%such%fields%
! And%you%can%also%specify%the%data%type
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#38%
EliminaDng"Duplicates"
! DISTINCT%eliminates%duplicate%records%in%a%bag%
– All%fields%must"be"equal"to"be"considered"a"duplicate"
all_alices unique_records
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#39%
Controlling"Sort"Order
! Use%ORDER...BY%to%sort%the%records%in%a%bag%in%ascending%order%
– Add"DESC"to"sort"in"descending"order"instead"
– Take"care"to"specify"a"schema"–"data"type"affects"how"data"is"sorted!"
allsales sortedsales
name% price% country% name% price% country%
Alice 29.99 us Alice 29.99 us
Bob 36.25 ca Carlos 27.64 mx
Carlos 27.64 mx Fredo 56.37 it
Dieter 17.49 de Étienne 23.68 fr
Étienne 23.68 fr Dieter 17.49 de
Fredo 56.37 it Bob 36.25 ca
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#40%
LimiDng"Results"
! As%in%SQL,%you%can%use%LIMIT%to%reduce%the%number%of%output%records%
! Beware!%Record%ordering%is%random%unless%specified%with%ORDER BY
– Use"ORDER BY"and"LIMIT"together"to"find"top/N"results"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#41%
Chapter"Topics"
Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly#used%FuncFons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#42%
Built/in"FuncDons"
! These%are%just%a%sampling%of%Pig’s%many%built#in%funcFons%
%
FuncFon%DescripFon% Example%InvocaFon% Input% Output%
Convert"to"uppercase" UPPER(country) uk UK
! You%can%use%these%with%the%FOREACH..GENERATE%keywords%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#43%
Chapter"Topics"
Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#44%
EssenDal"Points"
! Pig%LaFn%supports%many%of%the%same%operaFons%as%SQL%
– Though"Pig’s"approach"is"quite"different"
– Pig"LaDn"loads,"transforms,"and"stores"data"in"a"series"of"steps"
! The%default%delimiter%for%both%input%and%output%is%the%tab%character%
– You"can"specify"an"alternate"delimiter"as"an"argument"to"PigStorage
! Specifying%the%names%and%types%of%fields%is%not%required%
– But"it"can"improve"performance"and"readability"of"your"code"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#45%
Bibliography"
The%following%offer%more%informaFon%on%topics%discussed%in%this%chapter%
! Pig%LaFn%Basics%
– http://tiny.cloudera.com/piglatinbasics
! Pig%LaFn%Built#In%FuncFons%
– http://tiny.cloudera.com/piglatinbuiltin
! DocumentaFon%for%Java%Regular%Expression%PaQerns%
– http://tiny.cloudera.com/javaregex
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#46%
Chapter"Topics"
Basic%Data%Analysis%with%Pig% Data%ETL%and%Analysis%With%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DefiniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly/used"FuncDons"
!! Conclusion"
!! Hands#On%Exercise:%Using%Pig%for%ETL%processing"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#47%
Hands/On"Exercise:"Using"Pig"for"ETL"processing"
! In%this%Hands#On%Exercise,%you%will%write%%Pig%LaFn%code%to%perform%basic%ETL%
processing%tasks%on%data%related%to%Dualcore’s%online%adverFsing%campaigns%
! Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucFons%
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#48%