Beruflich Dokumente
Kultur Dokumente
201612
Introduc9on
Chapter 1
Course Chapters
§ Introduc.on
§ Apache Hadoop Fundamentals
§ Introduc9on to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ Mul9-Dataset Opera9ons with Apache Pig
§ Apache Pig Troubleshoo9ng and Op9miza9on
§ Introduc9on to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ Rela9onal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive Op9miza9on
§ Apache Impala Op9miza9on
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-3
Trademark Informa9on
§ The names and logos of Apache products men.oned in Cloudera training courses,
including those listed below, are trademarks of the Apache SoDware Founda.on
– Apache Accumulo – Apache Lucene
– Apache Avro – Apache Mahout
– Apache Bigtop – Apache Oozie
– Apache Crunch – Apache Parquet
– Apache Flume – Apache Pig
– Apache Hadoop – Apache Sentry
– Apache HBase – Apache Solr
– Apache HCatalog – Apache Spark
– Apache Hive – Apache Sqoop
– Apache Impala (incuba9ng) – Apache Tika
– Apache KaVa – Apache Whirr
– Apache Kudu – Apache ZooKeeper
§ All other product names, logos, and brands cited herein are the property of their
respec.ve owners
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-4
Chapter Topics
Introduc.on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-5
Course Objec9ves (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-6
Course Objec9ves (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-7
Chapter Topics
Introduc.on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-8
About Cloudera
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-9
Cloudera Training
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-10
Cloudera Cer9fica9on
§ In addi.on to our public training courses, Cloudera offers two levels of
cer.fica.ons
§ Cloudera Cer.fied Professional (CCP)
– The industry’s most demanding performance-based cer9fica9on, CCP
evaluates and recognizes a candidate’s mastery of the technical skills
most sought-a`er by employers
– CCP Data Engineer
– CCP Data Scien9st
§ Cloudera Cer.fied Associate (CCA)
– CCA exams validate founda9onal skills and provide the groundwork for
a candidate to achieve mastery under the CCP program
– CCA Data Analyst
– CCA Spark and Hadoop Developer
– Cloudera Cer9fied Administrator for Apache Hadoop (CCAH)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-11
Cloudera Cer9fied Associate (CCA) Data Analyst
§ The CCA Data Analyst exam tests founda.onal data analyst skills
– ETL processes
– Data defini9on language
– Query language
§ This course is excellent prepara.on for the exam
§ Remote-proctored exam available anywhere at any .me
§ Register at http://tiny.cloudera.com/cca_data_analyst
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-12
CDH
INTEGRATE
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-13
Cloudera Express
§ Cloudera Express
– Completely free to
download and use
§ The best way to get started
with Hadoop
§ Includes CDH
§ Includes Cloudera Manager
– End-to-end
administra9on for
Hadoop
– Deploy, manage, and
monitor your cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-14
Cloudera Enterprise
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-15
Chapter Topics
Introduc.on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-16
Logis9cs
Your instructor will give you details on how to access the course materials
and exercise instruc.ons for the course
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-17
Chapter Topics
Introduc.on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-18
Introduc9ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 01-19
Apache Hadoop Fundamentals
Chapter 2
Course Chapters
§ IntroducEon
§ Apache Hadoop Fundamentals
§ IntroducEon to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ MulE-Dataset OperaEons with Apache Pig
§ Apache Pig TroubleshooEng and OpEmizaEon
§ IntroducEon to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ RelaEonal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive OpEmizaEon
§ Apache Impala OpEmizaEon
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-2
Apache Hadoop Fundamentals
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-4
Volume
§ Every day…
– Over 2.25 billion shares are traded on the New York Stock Exchange
– Facebook stores 4.5 billion “Likes”
– Google processes about 24 petabytes of data
§ Every minute…
– Facebook users share nearly 2.5 million pieces of content
– Email users send 204,000,000 messages
§ Every second…
– Financial insEtuEons process more than 10,000 credit card transacEons
– Amazon Web Services fields more than 650,000 requests
§ And it’s not stopping…
– They can’t collect it all yet, but the Large Hadron Collider produces 572
terabytes of data per second
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-5
Velocity
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-6
Variety
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-7
Value
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-8
We Need a System that Scales
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-9
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-10
What is Apache Hadoop?
tolerant
UNIFIED SERVICES
– Harnesses the power RESOURCE MANAGEMENT SECURITY
of industry-standard YARN Sentry, RecordService
hardware
FILESYSTEM RELATIONAL NoSQL
§ Heavily inspired by HDFS Kudu HBase
INTEGRATE
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-11
Scalability
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-12
Fault Tolerance
§ Paradox: Adding nodes increases the chance that any one of them will fail
– SoluEon: Build redundancy into the system and handle it automaEcally
§ Files loaded into HDFS are replicated across nodes in the cluster
– If a node fails, its data is replicated using one of the other copies
§ Data processing jobs are broken into individual tasks
– Each task takes a small amount of data as input
– Thousands of tasks (or more) oeen run in parallel
– If a node fails during processing, its tasks are rescheduled elsewhere
§ RouJne failures are handled automaJcally without any loss of data
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-13
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-14
HDFS: Hadoop Distributed File System
Spark INTEGRATE
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-15
HDFS CharacterisEcs
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-16
HDFS Architecture
architecture Master
HDFS NameNode
op § HDFS master daemon: NameNode
fs -put sales.txt /reports
– Manages namespace and metadata
– Monitors worker nodes
§ HDFS worker daemon: DataNode
– Reads and writes the actual data
Workers
HDFS DataNodes
op fs -get /reports/sales.txt
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-17
Accessing HDFS Using the Command Line
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-18
Copying Local Data to and from HDFS
Hadoop Cluster
Hadoop Cluster
Hadoop Cluster
$ hadoop
$ hadoop fs -getfs/reports/sales.txt
-get /reports/sales.txt
$ hdfs dfs -get file
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-19
More hdfs dfs Command Examples
§ Copy file input.txt from local disk to the user’s directory in HDFS
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-20
Using the Hue HDFS File Manager
File Browser
Manage Files
Upload Files
Browse Files
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-21
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-22
Workload Management: YARN
usage INTEGRATE
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-23
Hadoop Cluster Architecture
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-24
General Data Processing
– MapReduce
Spark, Hive, Pig Spark Impala Solr
MapReduce
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-25
Hadoop MapReduce
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-26
Apache Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-27
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-28
Data Processing and Analysis with Hadoop (1)
§ Hadoop MapReduce and Spark are powerful data processing engines but…
– Hard to master
– Require programming skills
– Slow to develop, hard to maintain
§ Hadoop includes several other tools for data processing and analysis
– Tools for data analysts, not programmers
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-29
Data Processing and Analysis with Hadoop (2)
§ Higher-level abstracJons
PROCESS, ANALYZE, SERVE
for general data
processing BATCH STREAM SQL SEARCH
INTEGRATE
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-30
Apache Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-31
Apache Hive
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-32
Apache Impala (IncubaEng)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-33
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-34
Apache Sqoop
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-35
ImporEng Tables with Sqoop
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username jdoe --password bigsecret \
--table customers
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-36
Specifying Import Directory with Sqoop
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username jdoe --password bigsecret \
--table customers \
--target-dir /mydata/customers
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username jdoe --password bigsecret \
--table customers \
--warehouse-dir /mydata
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/company \
--username jdoe --password bigsecret \
--warehouse-dir /mydata \
--fields-terminated-by '\t'
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-38
ImporEng ParEal Tables with Sqoop
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username jdoe --password bigsecret \
--table products \
--warehouse-dir /mydata \
--columns "prod_id,name,price"
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username jdoe --password bigsecret \
--table products \
--warehouse-dir /mydata \
--where "price >= 1000"
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-39
Incremental Imports with Sqoop
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username jdoe --password bigsecret \
--table orders \
--warehouse-dir /mydata \
--incremental append \
--check-column order_id \
--last-value 6713821
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-40
Handling ModificaEons with Incremental Imports
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username jdoe --password bigsecret \
--table shipments \
--warehouse-dir /mydata \
--incremental lastmodified \
--check-column last_update_date \
--last-value "2016-12-21 05:36:59"
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-41
ExporEng Data from Hadoop to RDBMS with Sqoop
§ We have seen several ways to pull records from an RDBMS into Hadoop
– It is someEmes also helpful to push data in Hadoop back to an RDBMS
§ Sqoop supports this with export
– The target table must already exist in the RDBMS
$ sqoop export \
--connect jdbc:mysql://localhost/company \
--username jdoe --password bigsecret \
--table product_recommendations \
--export-dir /mydata/recommender_output
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-42
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-43
Apache HBase
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-44
Apache Flume
§ Flume imports data into HDFS as it is being generated by various sources
Log Files
UNIX Custom
syslog Sources
Hadoop Cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-45
Recap: Data Center IntegraEon
Flu F S
me HD
Sq
oop Hadoop Cluster oo
p
Sq
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-46
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-47
Hands-On Exercises: Scenario ExplanaEon
§ Exercises throughout the course will reinforce the topics being discussed
– Exercises simulate the kind of tasks oeen performed using the tools you
will learn about in class
– Most exercises depend on data generated in earlier exercises
§ Scenario: Dualcore Inc. is a leading electronics retailer
– More than 1,000 brick-and-mortar stores
– Dualcore also has a thriving e-commerce website
§ Dualcore has hired you to help find value in its data; you will
– Process and analyze data from internal and external sources
– IdenEfy opportuniEes to increase revenue
– Find new ways to reduce costs
– Help other departments achieve their goals
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-48
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-49
EssenEal Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-50
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-51
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-52
Hands-On Exercise: Data Ingest with Hadoop Tools
§ In this Hands-On Exercise, you will gain pracJce adding data from the local
filesystem and from a relaJonal database server to HDFS
– You will analyze this data in subsequent exercises
– Please refer to the Hands-On Exercise Manual for instrucEons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-53
About the Training Virtual Machine
§ During this course, you will perform numerous hands-on exercises using the
provided hands-on environment
– A virtual machine (VM) running Linux
§ This environment has Hadoop installed in pseudo-distributed mode
– A cluster comprised of a single node
– Typically used for tesEng code before deploying to a large cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 02-54
Introduc)on to Apache Pig
Chapter 3
Course Chapters
§ Introduc)on
§ Apache Hadoop Fundamentals
§ Introduc.on to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ Mul)-Dataset Opera)ons with Apache Pig
§ Apache Pig Troubleshoo)ng and Op)miza)on
§ Introduc)on to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ Rela)onal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive Op)miza)on
§ Apache Impala Op)miza)on
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-2
Introduc)on to Apache Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-3
Chapter Topics
§ What Is Pig?
§ Pig Features
§ Pig Use Cases
§ Interac)ng with Pig
§ Essen)al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-4
Apache Pig Overview
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-5
The Anatomy of Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-6
Chapter Topics
§ What Is Pig?
§ Pig Features
§ Pig Use Cases
§ Interac)ng with Pig
§ Essen)al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-7
Pig Features
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-8
Chapter Topics
§ What Is Pig?
§ Pig Features
§ Pig Use Cases
§ Interac)ng with Pig
§ Essen)al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-9
How Are Organiza)ons Using Pig?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-10
Use Case: Web Log Sessioniza)on
§ Pig can help you extract valuable informa.on from web server log files
Order Widget X
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-11
Use Case: Data Sampling
100 TB 50 MB
Random
Sampling
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-12
Use Case: ETL Processing
§ Pig is also widely used for Extract, Transform, and Load (ETL) processing
Accounting
Validate Fix Remove Encode Data Warehouse
data errors duplicates values
Call Center
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-13
Chapter Topics
§ What Is Pig?
§ Pig Features
§ Pig Use Cases
§ Interac.ng with Pig
§ Essen)al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-14
Using Pig Interac)vely
$ pig
grunt> allsales = LOAD 'sales' AS (name, price);
grunt> bigsales = FILTER allsales BY price > 999;
grunt> STORE bigsales INTO 'myreport';
grunt> quit;
§ Use pig -e to execute a Pig La.n statement from the UNIX shell
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-15
Interac)ng with HDFS
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-16
Interac)ng with UNIX
grunt> sh date;
Wed Nov 12 06:39:13 PST 2016
grunt> sh ls; -- lists local files
grunt> fs -ls; -- lists HDFS files
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-17
Running Pig Scripts
§ A Pig script is simply Pig La.n code stored in a text file
– By conven)on, these files have the .pig extension
§ You can run a Pig script from within the Grunt shell using run
– This is useful for automa)on and batch execu)on
§ It is common to run a Pig script directly from the UNIX shell
$ pig salesreport.pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-18
MapReduce and Local Modes
§ As described earlier, Pig turns Pig La.n into MapReduce jobs
– Pig submits those jobs for execu)on on the Hadoop cluster
§ It is also possible to run Pig in “local mode” using the -x flag
– This runs jobs on the local machine instead of the cluster
– Local mode uses the local filesystem instead of HDFS
– Can be helpful for tes)ng before deploying a job to produc)on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-19
Client-Side Log Files
§ If a job fails, Pig may produce a log file to explain why
– These log files are typically produced in your current working directory
on the local (client) machine
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-20
Chapter Topics
§ What Is Pig?
§ Pig Features
§ Pig Use Cases
§ Interac)ng with Pig
§ Essen.al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-21
Essen)al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-22
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 03-23
Basic Data Analysis with Apache Pig
Chapter 4
Course Chapters
§ IntroducGon
§ Apache Hadoop Fundamentals
§ IntroducGon to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ MulG-Dataset OperaGons with Apache Pig
§ Apache Pig TroubleshooGng and OpGmizaGon
§ IntroducGon to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ RelaGonal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive OpGmizaGon
§ Apache Impala OpGmizaGon
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-2
Basic Data Analysis with Apache Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-4
Pig LaGn Overview
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-5
Pig LaGn Grammar: Keywords
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-6
Pig LaGn Grammar: IdenGfiers (1)
§ Iden+fiers are the names assigned to fields and other data structures
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-7
Pig LaGn Grammar: IdenGfiers (2)
Valid Invalid
x 4
Q1 price$
q1_2016 profit%
myData _sale
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-8
Pig LaGn Grammar: Comments
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-9
Case SensiGvity in Pig LaGn
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-10
Common Operators in Pig LaGn
§ Many commonly used operators in Pig LaBn are familiar to SQL users
– Notable difference: Pig LaGn uses == and != for comparison
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-11
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-12
Basic Data Loading in Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-13
Data Sources: Files and Directories
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-14
Specifying Column Names During Load
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-15
Using AlternaGve Column Delimiters
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-16
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-17
Simple Data Types in Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-18
Specifying Data Types in Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-19
How Pig Handles Invalid Data
§ When encountering invalid data, Pig subsBtutes NULL for the value
– For example, an int field containing the value Q4
§ The IS NULL and IS NOT NULL operators test for null values
– Note that NULL is not the same as the empty string ''
§ You can use these operators to filter out bad records
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-20
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-21
Key Data Concepts in Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-22
Pig Data Concepts: Fields
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-23
Pig Data Concepts: Tuples
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-24
Pig Data Concepts: Bags
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-25
Pig Data Concepts: RelaGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-26
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-27
Data Output in Pig
(Alice,2999,us)
(Bob,3625,ca)
(Carlos,2764,mx)
(Dieter,1749,de)
(Étienne,2368,fr)
(Fredo,5637,it)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-28
Storing Data with Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-29
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-30
Viewing the Schema with DESCRIBE
§ DESCRIBE shows the structure of the data, including names and types
§ The following Grunt session shows an example
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-31
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-32
Filtering in Pig
allsales bigsales
name price country name price country
Alice 2999 us Bob 3625 ca
Bob 3625 ca Fredo 5637 it
Carlos 2764 mx
Dieter 1749 de Price is greater than 3000
Étienne 2368 fr
Fredo 5637 it
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-33
Filtering by MulGple Criteria
allsales somesales
name price country name price country
Alice 2999 us Bob 3625 ca
Bob 3625 ca Dieter 1749 de
Carlos 2764 mx
Dieter 1749 de Name is Dieter, or price is greater
Étienne 2368 fr than 3500 and less than 4000
Fredo 5637 it
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-34
Aside: String Comparisons in Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-35
Field SelecGon in Pig
allsales twofields
salesperson amount trans_id amount trans_id
Alice 2999 107546 2999 107546
Bob 3625 107547 3625 107547
Carlos 2764 107548 2764 107548
Dieter 1749 107549 1749 107549
Étienne 2368 107550 2368 107550
Fredo 5637 107550 5637 107550
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-36
GeneraGng New Fields in Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-37
Changing Data Types in Pig
§ Convert fields from one data type to another using cast operators
– Useful if schema is unspecified when data is loaded
– Conversion to some data types is not supported
– Data may be lost or truncated
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-38
EliminaGng Duplicates
all_alices unique_records
firstname lastname country firstname lastname country
Alice Smith us Alice Smith us
Alice Jones us Alice Jones us
Alice Brown us Alice Brown us
Alice Brown us Alice Brown ca
Alice Brown ca
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-39
Controlling Sort Order
allsales sortedsales
name price country name price country
Alice 29.99 us Alice 29.99 us
Bob 36.25 ca Carlos 27.64 mx
Carlos 27.64 mx Fredo 56.37 it
Dieter 17.49 de Étienne 23.68 fr
Étienne 23.68 fr Dieter 17.49 de
Fredo 56.37 it Bob 36.25 ca
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-40
LimiGng Results
§ As in SQL, you can use LIMIT to reduce the number of output records
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-41
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-42
Built-In FuncGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-43
Built-In Math FuncGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-44
Built-In String FuncGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-45
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-46
EssenGal Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-47
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-48
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-49
Hands-On Exercise: Using Pig for ETL processing
§ In this Hands-On Exercise, you will write Pig LaBn code to perform basic ETL
processing tasks on data related to Dualcore’s online adverBsing campaigns
– Please refer to the Hands-On Exercise Manual for instrucGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriCen consent from Cloudera. 04-50
Processing Complex Data with Apache
Pig
Chapter 5
Course Chapters
§ IntroducGon
§ Apache Hadoop Fundamentals
§ IntroducGon to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ MulG-Dataset OperaGons with Apache Pig
§ Apache Pig TroubleshooGng and OpGmizaGon
§ IntroducGon to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ RelaGonal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive OpGmizaGon
§ Apache Impala OpGmizaGon
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-2
Processing Complex Data with Apache Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-3
Chapter Topics
§ Storage Formats
§ Complex/Nested Data Types
§ Grouping
§ Built-In FuncGons for Complex Data
§ IteraGng Grouped Data
§ EssenGal Points
§ Hands-On Exercise: Analyzing Ad Campaign Data with Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-4
Storage Formats
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-5
Other Supported Formats
§ Here are a few of Pig’s built-in funcFons for loading data in other formats
– It is also possible to implement a custom loader by wriGng Java code
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-6
Load and Store FuncGons
§ Some funcFons load data, some store data, and some do both
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-7
Chapter Topics
§ Storage Formats
§ Complex/Nested Data Types
§ Grouping
§ Built-In FuncGons for Complex Data
§ IteraGng Grouped Data
§ EssenGal Points
§ Hands-On Exercise: Analyzing Ad Campaign Data with Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-8
Pig’s Complex Data Types: Tuple and Bag
§ We have already seen two of Pig’s three complex data types
– A tuple is a collecGon of fields
– A bag is a collecGon of tuples
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-9
Pig’s Complex Data Types: Map
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-10
RepresenGng Complex Types in Pig
§ It is important to know how to define and recognize these types in Pig
Type DefiniFon
Tuple Comma-delimited list inside parentheses:
('107546', 2999, 'Alice')
Map Brackets surround comma-delimited list of pairs; keys and values separated by #:
['store'#'MIA01','location'#'Coral Gables']
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-11
Loading and Using Complex Types (1)
TAB TAB
Field 1 Field 2 Field 3
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-12
Loading and Using Complex Types (2)
TAB TAB
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-13
Referencing Map Data
Bob [salary#52000,age#52]
§ And loaded with the following schema
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-14
Chapter Topics
§ Storage Formats
§ Complex/Nested Data Types
§ Grouping
§ Built-In FuncGons for Complex Data
§ IteraGng Grouped Data
§ EssenGal Points
§ Hands-On Exercise: Analyzing Ad Campaign Data with Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-15
Grouping Records by a Field (1)
Alice 729
Bob 3999
Alice 27999
Carol 32999
Carol 4999
§ Use GROUP BY to do this in Pig LaFn
– The new relaGon has one record per unique value in the specified field
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-16
Grouping Records by a Field (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-17
Grouping Records by a Field (3)
group sales
field field
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-18
Using GROUP BY to Aggregate Data
§ Aggregate func1ons create one output value from mulFple input values
– For example, to calculate total sales by employee
– Usually applied to grouped data
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-19
Grouping Everything into a Single Record
§ We just saw that GROUP BY creates one record for each unique value
§ GROUP ALL puts all data into one record
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-20
Using GROUP ALL to Aggregate Data
§ Use GROUP ALL when you need to aggregate one or more columns
– For example, to calculate total sales for all employees
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-21
Removing NesGng in Data
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-22
An Example of FLATTEN
§ The following shows the nested data and what FLATTEN does to it
grunt> byname = GROUP sales BY name;
grunt> DUMP byname;
(Bob,{(Bob,3999)})
(Alice,{(Alice,729),(Alice,27999)})
(Carol,{(Carol,32999),(Carol,4999)})
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-23
Chapter Topics
§ Storage Formats
§ Complex/Nested Data Types
§ Grouping
§ Built-In FuncFons for Complex Data
§ IteraGng Grouped Data
§ EssenGal Points
§ Hands-On Exercise: Analyzing Ad Campaign Data with Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-24
Pig’s Built-In Aggregate FuncGons
§ Pig has built-in support for other aggregate funcFons besides SUM
§ Examples:
– AVG: Calculates the average (mean) of all values in the bag
– MIN: Returns the smallest value in the bag
– MAX: Returns the largest value in the bag
§ Pig has two built-in funcFons for counFng records
– COUNT: Returns the number of non-null elements in the bag
– COUNT_STAR: Returns the number of all elements in the bag
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-25
Other Notable Built-In FuncGons
FuncFon DescripFon
BagToString Concatenates the elements of a bag into a string (chararray)
BagToTuple Un-nests the elements of a bag into a tuple
DIFF Finds tuples that appear in only one of two supplied bags
IsEmpty Used with FILTER to match bags or maps that contain no data
SIZE Returns the size of the field (definiGon of size varies by data type)
TOKENIZE Splits a text string (chararray) into a bag of individual words
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-26
Chapter Topics
§ Storage Formats
§ Complex/Nested Data Types
§ Grouping
§ Built-In FuncGons for Complex Data
§ IteraFng Grouped Data
§ EssenGal Points
§ Hands-On Exercise: Analyzing Ad Campaign Data with Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-27
Record IteraGon
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-28
Nested FOREACH
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-29
Nested FOREACH Example (1)
Input Data
§ Our input data contains a list of employee
job Ftles and corresponding salaries President 192000
Director 152500
§ Goal: idenFfy the three highest salaries Director 161000
within each Ftle Director 167000
Director 165000
Director 147000
Engineer 92300
Engineer 85000
Engineer 83000
Engineer 81650
Engineer 82100
Engineer 87300
Engineer 76000
Manager 87000
Manager 81000
Manager 75000
Manager 79000
Manager 67500
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-30
Nested FOREACH Example (2)
§ First, load the data from the file Input Data (excerpt)
President 192000
§ Next, group employees by Ftle
Director 152500
– Assigned to new relaGon title_group Director 161000
...
Engineer 92300
...
Manager 67500
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-31
Nested FOREACH Example (3)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-32
Nested FOREACH Example (4)
Code (LOAD statement removed for brevity) Input Data (excerpt)
title_group = GROUP employees BY title; President 192000
Director 152500
top_salaries = FOREACH title_group { Director 161000
sorted = ORDER employees BY salary DESC; ...
Engineer 92300
highest_paid = LIMIT sorted 3;
...
GENERATE group, highest_paid.salary;
Manager 67500
};
(Director,{(167000),(165000),(161000)})
(Engineer,{(92300),(87300),(85000)})
(Manager,{(87000),(81000),(79000)})
(President,{(192000)})
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-33
Chapter Topics
§ Storage Formats
§ Complex/Nested Data Types
§ Grouping
§ Built-In FuncGons for Complex Data
§ IteraGng Grouped Data
§ EssenFal Points
§ Hands-On Exercise: Analyzing Ad Campaign Data with Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-34
EssenGal Points
§ Pig has three complex data types: tuple, bag, and map
– A map is simply a collecGon of key/value pairs
§ These structures can contain simple types like int or chararray
– But they can also contain complex data types
– Nested data structures are common in Pig
§ Pig provides methods for grouping and ungrouping data
– You can remove a level of nesGng using the FLATTEN operator
§ Pig offers several built-in aggregate funcFons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-35
Chapter Topics
§ Storage Formats
§ Complex/Nested Data Types
§ Grouping
§ Built-In FuncGons for Complex Data
§ IteraGng Grouped Data
§ EssenGal Points
§ Hands-On Exercise: Analyzing Ad Campaign Data with Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-36
Hands-On Exercise: Analyzing Ad Campaign Data with Pig
§ In this Hands-On Exercise, you will analyze data from Dualcore’s online ad
campaign
– Please refer to the Hands-On Exercise Manual for instrucGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 05-37
Mul$-Dataset Opera$ons with Apache
Pig
Chapter 6
Course Chapters
§ Introduc$on
§ Apache Hadoop Fundamentals
§ Introduc$on to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ Mul)-Dataset Opera)ons with Apache Pig
§ Apache Pig Troubleshoo$ng and Op$miza$on
§ Introduc$on to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ Rela$onal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive Op$miza$on
§ Apache Impala Op$miza$on
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-2
Mul$-Dataset Opera$ons with Apache Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-4
Overview of Combining Datasets
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-5
Example Datasets (1)
stores
§ Most examples in this chapter will involve the
same two datasets A Anchorage
B Boston
§ The first is a file containing informa)on about C Chicago
Dualcore’s stores D Dallas
E Edmonton
§ There are two fields in this rela)on F Fargo
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-6
Example Datasets (2)
stores
§ Our other dataset is a file containing
informa)on about Dualcore’s salespeople A Anchorage
B Boston
§ This rela)on contains three fields C Chicago
D Dallas
1. person_id:int (unique key) E Edmonton
2. name:chararray (salesperson name) F Fargo
3. store_id:chararray (refers to store) salespeople
1 Alice B
2 Bob D
3 Carlos F
4 Dieter A
5 Étienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-7
Grouping Mul$ple Rela$ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-8
Example of COGROUP
stores
A Anchorage
grunt> grouped = COGROUP stores BY store_id,
B Boston
salespeople BY store_id;
C Chicago
D Dallas
E Edmonton grunt> DUMP grouped;
F Fargo (A,{(A,Anchorage)},{(4,Dieter,A)})
(B,{(B,Boston)},{(1,Alice,B),(8,Hannah,B)})
salespeople (C,{(C,Chicago)},{(6,Fredo,C),(9,Irina,C)})
(D,{(D,Dallas)},{(2,Bob,D),(7,George,D)})
1 Alice B (E,{(E,Edmonton)},{})
2 Bob D (F,{(F,Fargo)},{(3,Carlos,F),(5,Étienne,F)})
3 Carlos F
(,{},{(10,Jack,)})
4 Dieter A
5 Étienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-9
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-10
Join Overview
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-11
Key Fields
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-12
Inner Joins
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-13
Inner Join Example
stores
A Anchorage
grunt> joined = JOIN stores BY store_id,
B Boston
salespeople BY store_id;
C Chicago
D Dallas
E Edmonton grunt> DUMP joined;
F Fargo (A,Anchorage,4,Dieter,A)
(B,Boston,1,Alice,B)
salespeople (B,Boston,8,Hannah,B)
(C,Chicago,6,Fredo,C)
1 Alice B (C,Chicago,9,Irina,C)
2 Bob D (D,Dallas,2,Bob,D)
3 Carlos F
(D,Dallas,7,George,D)
4 Dieter A
5 Étienne F (F,Fargo,3,Carlos,F)
6 Fredo C (F,Fargo,5,Étienne,F)
7 George D
8 Hannah B
9 Irina C
10 Jack
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-14
Elimina$ng Duplicate Fields (1)
§ As with COGROUP, the new rela)on s)ll contains duplicate fields
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-15
Elimina$ng Duplicate Fields (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-16
Outer Joins
§ Pig La)n allows you to specify the type of join following the field name
– Inner joins do not specify a join type
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-17
LeV Outer Join Example
stores
§ Le[ Outer Join: Result contains all records from the
A Anchorage rela)on specified on the le[, but only matching
B Boston
C Chicago records from the one specified on the right
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo LEFT OUTER, salespeople BY store_id;
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-18
Right Outer Join Example
stores
§ Right Outer Join: Result contains all records from
A Anchorage the rela)on specified on the right, but only
B Boston
C Chicago matching records from the one specified on the le[
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo RIGHT OUTER, salespeople BY store_id;
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-19
Full Outer Join Example
stores
§ Full Outer Join: Result contains all matching and
A Anchorage non-matching records from both rela)ons
B Boston
C Chicago
D Dallas grunt> joined = JOIN stores BY store_id
E Edmonton FULL OUTER, salespeople BY store_id;
F Fargo
grunt> DUMP joined;
salespeople (A,Anchorage,4,Dieter,A)
(B,Boston,1,Alice,B)
1 Alice B
(B,Boston,8,Hannah,B)
2 Bob D
3 Carlos F (C,Chicago,6,Fredo,C)
4 Dieter A (C,Chicago,9,Irina,C)
5 Étienne F (D,Dallas,2,Bob,D)
6 Fredo C (D,Dallas,7,George,D)
7 George D (E,Edmonton,,,)
8 Hannah B (F,Fargo,3,Carlos,F)
9 Irina C (F,Fargo,5,Étienne,F)
10 Jack (,,10,Jack,)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-20
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-21
Crossing Datasets
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-22
Cross Product Example
sizes
§ Generates every possible combina)on of records in
S 3.5 ml the sizes and colors rela)ons
M 7.5 ml
L 12.0 ml
grunt> crossed = CROSS sizes, colors;
colors
grunt> DUMP crossed;
C Cyan (S,3.5 ml,C,Cyan)
M Magenta (S,3.5 ml,M,Magenta)
Y Yellow (S,3.5 ml,Y,Yellow)
K Black (S,3.5 ml,K,Black)
(M,7.5 ml,C,Cyan)
(M,7.5 ml,M,Magenta)
(M,7.5 ml,Y,Yellow)
(M,7.5 ml,K,Black)
(L,12.0 ml,C,Cyan)
(L,12.0 ml,M,Magenta)
(L,12.0 ml,Y,Yellow)
(L,12.0 ml,K,Black)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-23
Concatena$ng Datasets
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-24
UNION Example
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-25
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-26
SpliSng Datasets
§ You have learned several ways to combine datasets into a single rela)on
§ Some)mes you need to split a dataset into mul)ple rela)ons
– Server logs by date range
– Customer lists by region
– Product lists by vendor
§ Pig La)n supports this with the SPLIT operator
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-27
SPLIT Example
§ Split customers into groups for a rewards program, based on life)me value
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-28
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-29
Essen$al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-30
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-31
Hands-On Exercise: Analyzing Disparate Datasets with Pig
§ In this Hands-On Exercise, you will analyze mul)ple datasets with Pig
– Please refer to the Hands-On Exercise Manual for instruc$ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 06-32
Apache Pig Troubleshoo2ng and Op2miza2on
Chapter 7
Course Chapters
§ Introduc2on
§ Apache Hadoop Fundamentals
§ Introduc2on to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ Mul2-Dataset Opera2ons with Apache Pig
§ Apache Pig Troubleshoo6ng and Op6miza6on
§ Introduc2on to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ Rela2onal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive Op2miza2on
§ Apache Impala Op2miza2on
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-2
Apache Pig Troubleshoo2ng and Op2miza2on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-3
Chapter Topics
§ Troubleshoo6ng Pig
§ Logging
§ Using Hadoop’s Web UI
§ Data Sampling and Debugging
§ Performance Overview
§ Understanding the Execu2on Plan
§ Tips for Improving the Performance of Pig Jobs
§ Essen2al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-4
Troubleshoo2ng Overview
§ We have now covered how to use Pig for data analysis
– Unfortunately, some2mes your code may not work as you expect
– It is important to remember that Pig and Hadoop are intertwined
§ Here we will cover some techniques for isola6ng and resolving problems
– We will start with a few op2ons to the pig command
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-5
Helping Yourself
§ We will discuss some op6ons for the pig command in this chapter
– You can view all of them by using the -h (help) op2on
– Keep in mind that many op2ons are advanced or rarely used
§ One useful op6on is -c (check), which validates the syntax of your code
$ pig -c myscript.pig
myscript.pig syntax OK
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-6
Ge_ng Help from Others
$ pig -version
Apache Pig version 0.12.0-cdh5.8.0
$ hadoop version
Hadoop 2.6.0-cdh5.8.0
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-7
Chapter Topics
§ Troubleshoo2ng Pig
§ Logging
§ Using Hadoop’s Web UI
§ Data Sampling and Debugging
§ Performance Overview
§ Understanding the Execu2on Plan
§ Tips for Improving the Performance of Pig Jobs
§ Essen2al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-8
Customizing Log Messages
log4j.logger.org.apache.pig=ERROR
log4j.logger.org.apache.hadoop.conf.Configuration=ERROR
log4jconf=/etc/pig/conf/log4j.properties
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-9
Customizing Log Messages on a Per-Job Basis
log4j.logger.org.apache.pig=DEBUG
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-10
Controlling Client-Side Log Files
§ When a job fails, Pig may produce a log file to explain why
– These are typically produced in your current directory
§ To use a different loca6on, use the -l (log) op6on when star6ng Pig
$ pig -l /tmp
log.file=/tmp
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-11
Chapter Topics
§ Troubleshoo2ng Pig
§ Logging
§ Using Hadoop’s Web UI
§ Data Sampling and Debugging
§ Performance Overview
§ Understanding the Execu2on Plan
§ Tips for Improving the Performance of Pig Jobs
§ Essen2al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-12
The Hadoop Web UI
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-13
The ResourceManager Web UI (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-14
The ResourceManager Web UI (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-15
Naming Your Job
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-16
Killing a Job
§ A job that processes a lot of data can take hours to complete
– Some2mes you spot an error in your code just ader submi_ng a job
– Rather than wait for the job to complete, you can kill it
§ First, find the job’s ID on the main page of the ResourceManager web UI
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-17
Chapter Topics
§ Troubleshoo2ng Pig
§ Logging
§ Using Hadoop’s Web UI
§ Data Sampling and Debugging
§ Performance Overview
§ Understanding the Execu2on Plan
§ Tips for Improving the Performance of Pig Jobs
§ Essen2al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-18
Using SAMPLE to Create a Smaller Dataset
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-19
Intelligent Sampling with ILLUSTRATE
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-20
Using ILLUSTRATE Helps You to Understand Data Flow
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-21
General Debugging Strategies
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-22
Chapter Topics
§ Troubleshoo2ng Pig
§ Logging
§ Using Hadoop’s Web UI
§ Data Sampling and Debugging
§ Performance Overview
§ Understanding the Execu2on Plan
§ Tips for Improving the Performance of Pig Jobs
§ Essen2al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-23
Performance Overview
§ We have discussed several techniques for finding errors in Pig La6n code
– Once you get your code working, you’ll oden want it to work faster
§ Performance tuning is a broad and complex subject
– Requires a deep understanding of Pig, Hadoop, Java, and Linux
– Typically the domain of engineers and system administrators
§ Most of these topics are beyond the scope of this course
– We will cover the basics here, and offer several performance
improvement 2ps
– See Programming Pig for detailed coverage
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-24
Chapter Topics
§ Troubleshoo2ng Pig
§ Logging
§ Using Hadoop’s Web UI
§ Data Sampling and Debugging
§ Performance Overview
§ Understanding the Execu6on Plan
§ Tips for Improving the Performance of Pig Jobs
§ Essen2al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-25
How Pig La2n Becomes a MapReduce Job
§ Pig La6n code ul6mately runs as MapReduce jobs on the Hadoop cluster
§ However, Pig does not translate your code into Java MapReduce code
– Much like rela2onal databases don’t translate SQL to C language code
– Like a database, Pig interprets the Pig La2n to develop execu2on plans
– Pig’s execu2on engine uses these to submit MapReduce jobs to Hadoop
§ EXPLAIN describes Pig’s three execu6on plans
– Logical
– Physical
– MapReduce
§ Seeing an example job will help us beeer understand EXPLAIN’s output
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-26
Descrip2on of Example Code and Data
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-27
Using EXPLAIN
§ Using EXPLAIN rather than DUMP will show the execu6on plans
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-28
Chapter Topics
§ Troubleshoo2ng Pig
§ Logging
§ Using Hadoop’s Web UI
§ Data Sampling and Debugging
§ Performance Overview
§ Understanding the Execu2on Plan
§ Tips for Improving the Performance of Pig Jobs
§ Essen2al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-29
Pig’s Runtime Optimizations
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-30
Op2miza2ons You Can Make in Pig La2n Code
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-31
Don’t Produce Output You Don’t Really Need
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-32
Specify Schema Whenever Possible
§ Specifying schema when loading data eliminates the need for Pig to guess
– It may choose a “bigger” type than you need (like long instead of int)
§ The postcode and phone fields in the stores dataset were never used
– Elimina2ng them in the schema ensures they’ll be omiFed in joined
stores =
LOAD 'stores' AS (store_id:chararray, name:chararray);
sales =
LOAD 'sales' AS (store_id:chararray, price:int);
joined =
JOIN sales BY store_id, stores BY store_id;
groups =
GROUP joined BY sales::store_id;
totals =
FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton';
sorted = ORDER region BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-33
Filter Unwanted Data as Early as Possible
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-34
Consider Adjus2ng the Paralleliza2on
set default_parallel 5;
stores = LOAD 'stores' AS (store_id:chararray, name:chararray);
sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN regsales BY store_id, stores BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.regsales::price) AS amount;
unique = DISTINCT totals;
sorted = ORDER unique BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-35
Specify the Largest Dataset Last in a Join
set default_parallel 5;
stores = LOAD 'stores' AS (store_id:chararray, name:chararray);
sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN stores BY store_id, regsales BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.regsales::price) AS amount;
unique = DISTINCT totals;
sorted = ORDER unique BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-36
Try Using Compression on Intermediate Data
§ Pig scripts o]en yield jobs with both a map and a reduce phase
– Output of the map phase becomes input to the reduce phase
– Compressing this intermediate data is easy and can boost performance
– Your system administrator may need to install a compression library
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-37
A Few More Tips for Improving Performance
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-38
Chapter Topics
§ Troubleshoo2ng Pig
§ Logging
§ Using Hadoop’s Web UI
§ Data Sampling and Debugging
§ Performance Overview
§ Understanding the Execu2on Plan
§ Tips for Improving the Performance of Pig Jobs
§ Essen6al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-39
Essen2al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-40
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 07-41
Introduc)on to Apache Hive and
Impala
Chapter 8
Course Chapters
§ Introduc)on
§ Apache Hadoop Fundamentals
§ Introduc)on to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ Mul)-Dataset Opera)ons with Apache Pig
§ Apache Pig Troubleshoo)ng and Op)miza)on
§ Introduc.on to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ Rela)onal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive Op)miza)on
§ Apache Impala Op)miza)on
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-2
Introduc)on to Apache Hive and Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-3
Chapter Topics
§ What Is Hive?
§ What Is Impala?
§ Why Use Hive and Impala?
§ Schema and Data Storage
§ Comparing Hive and Impala to Tradi)onal Databases
§ Use Cases
§ Essen)al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-4
Review: Hadoop Data Processing and Analysis
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
YARN Sentry, RecordService
STORE
BATCH REAL-TIME
Sqoop Kafka, Flume
INTEGRATE
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-5
What Is Apache Hive?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-6
HiveQL
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-7
Hive High-Level Overview
Hadoop
Cluster
HDFS
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-8
Chapter Topics
§ What Is Hive?
§ What Is Impala?
§ Why Use Hive and Impala?
§ Schema and Data Storage
§ Comparing Hive and Impala to Tradi)onal Databases
§ Use Cases
§ Essen)al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-9
What Is Apache Impala (incuba)ng)?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-10
Impala High-Level Overview
HDFS
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-11
Impala Query Language
Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-12
Chapter Topics
§ What Is Hive?
§ What Is Impala?
§ Why Use Hive and Impala?
§ Schema and Data Storage
§ Comparing Hive and Impala to Tradi)onal Databases
§ Use Cases
§ Essen)al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-13
Why Use Hive and Impala?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-14
Comparing Hive and Impala
§ Hive and Impala are different tools, but are closely related
– They use very similar variants of SQL
– They share the same data warehouse and metadata storage
– They are oben used together
§ This course first focuses on areas in common between Hive and Impala
– HiveQL and Impala SQL are very similar
– Differences will be noted
§ Later, the course emphasizes some areas in which they differ
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-15
Chapter Topics
§ What Is Hive?
§ What Is Impala?
§ Why Use Hive and Impala?
§ Schema and Data Storage
§ Comparing Hive and Impala to Tradi)onal Databases
§ Use Cases
§ Essen)al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-16
How Hive and Impala Load and Store Data (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-17
How Hive and Impala Load and Store Data (2)
§ Hive and Impala use the metastore to determine data format and loca.on
– The query itself operates on data stored in a filesystem (typically HDFS)
Metastore
Query
(metadata in RDBMS)
Impala or
Hive Server
Tables
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-18
Chapter Topics
§ What Is Hive?
§ What Is Impala?
§ Why Use Hive and Impala?
§ Schema and Data Storage
§ Comparing Hive and Impala to Tradi.onal Databases
§ Use Cases
§ Essen)al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-19
Your Cluster Is Not a Database Server
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-20
Comparing Hive and Impala to a Rela)onal Database
* Hive now has limited, experimental support for UPDATE, DELETE, and transac)ons.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-21
Chapter Topics
§ What Is Hive?
§ What Is Impala?
§ Why Use Hive and Impala?
§ Schema and Data Storage
§ Comparing Hive and Impala to Tradi)onal Databases
§ Use Cases
§ Essen)al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-22
Use Case: Log File Analy)cs
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-23
Use Case: Sen)ment Analysis
Negative
Neutral
Positive
07 08 09 10 11 12 13 14 15 16 17 18
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-24
Use Case: Business Intelligence
Suppliers by Region
Japan: 31 suppliers
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-25
Chapter Topics
§ What Is Hive?
§ What Is Impala?
§ Why Use Hive and Impala?
§ Schema and Data Storage
§ Comparing Hive and Impala to Tradi)onal Databases
§ Use Cases
§ Essen.al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-26
Essen)al Points
§ Hive and Impala are tools for performing SQL queries on data in Hadoop
§ HiveQL and Impala SQL are very similar to SQL-92
– Easy to learn for those with rela)onal database experience
– However, does not replace your RDBMS
§ Hive generates jobs that run on the Hadoop cluster data processing engine
– Runs MapReduce or Spark jobs based on HiveQL statements
§ Impala executes queries directly on the Hadoop cluster
– Uses a very fast specialized SQL engine, not MapReduce or Spark
§ Tables are typically directories of files in HDFS
– Informa)on about those tables is kept in the metastore (in an RDBMS)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-27
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 08-28
Querying with Apache Hive and Impala
Chapter 9
Course Chapters
§ IntroducFon
§ Apache Hadoop Fundamentals
§ IntroducFon to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ MulF-Dataset OperaFons with Apache Pig
§ Apache Pig TroubleshooFng and OpFmizaFon
§ IntroducFon to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ RelaFonal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive OpFmizaFon
§ Apache Impala OpFmizaFon
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-2
Querying with Apache Hive and Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-4
Tables
§ Data for Hive and Impala tables is stored on the filesystem (typically HDFS)
– Each table maps to a single directory
§ A table’s directory may contain mulQple files
– Typically delimited text files, but many formats are supported
– Subdirectories are not allowed
– Unless the table is parFFoned
§ The metastore gives context to this data
– Helps map raw data in HDFS to named columns of specific types
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-5
Databases
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-6
Exploring Databases and Tables (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-7
Exploring Databases and Tables (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-8
Exploring Databases and Tables (3)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-9
Exploring Databases and Tables (4)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-10
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-11
An IntroducFon to HiveQL and Impala SQL
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-12
Syntax Basics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-13
SelecFng Data from Tables
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-14
SorFng Query Results
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-15
LimiFng Query Results
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-16
Using a WHERE Clause to Filter Results
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-17
Common Operators in HiveQL and Impala SQL
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-18
Table Aliases
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-19
Combining Query Results with a Union
§ UNION ALL unifies output from mulQple SELECTs into a single result set
– The name, order, and types of columns in each query must match
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-20
Subqueries in the FROM Clause
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-21
Subqueries in the WHERE Clause
§ Recent versions of Hive and Impala allow subqueries in the WHERE clause
– Use to filter one table based on criteria in another table
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-22
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-23
Data Types
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-24
Integer Types
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-25
Decimal Types
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-26
Character Types
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-27
Other Simple Types
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-28
Data Type Conversion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-29
Handling of Out-of-Range Values
age age
CREATE TABLE names_and_ages 29 29
(name STRING, 17 17
age TINYINT)
NULL NULL
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'; NULL 127
NULL -128
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-30
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-31
InteracFng with Hive and Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-32
Using Hue with Hive and Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-33
The Hue Query Editor
Show logs
Choose a
database
Visualize or
save query View results
results and history
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-34
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-35
StarFng Beeline (Hive’s Shell)
$ beeline -u jdbc:hive2://host:10000 \
-n username -p password
Connecting to jdbc:hive2://host:10000
Connected to: Apache Hive (version 1.1.0-cdh5.8.0)
Beeline version 1.1.0-cdh5.8.0 by Apache Hive
0: jdbc:hive2://localhost:10000>
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-36
ExecuFng Queries in Beeline
+-------------+-------------+
| lname | fname |
+-------------+-------------+
| Ham | Marilyn |
| Franks | Gerard |
...
| Falgoust | Jennifer |
+-------------+-------------+
50 rows selected (15.829 seconds)
0: url>
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-37
Using Beeline
0: jdbc:hive2://localhost:10000> !exit
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-38
ExecuFng Hive Queries from the Command Line
§ You can execute a file containing HiveQL code using the -f opQon
$ beeline -u ... -f myquery.hql
§ Or use HiveQL directly from the command line using the -e opQon
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-39
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-40
StarFng the Impala Shell
$ impala-shell -i myserver.example.com:21000
[myserver.example.com:21000] >
Note: Some log messages have been truncated to beDer fit the slide
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-41
Using the Impala Shell
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-42
ExecuFng Queries in the Impala Shell
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-43
InteracFng with the OperaFng System
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-44
Running Impala Queries from the Command Line
§ Run queries directly from the command line with the -q opQon
$ impala-shell -f myquery.hql \
--delimited \
--output_delimiter='\t' \
-o results.txt
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-45
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-46
EssenFal Points
§ HiveQL and Impala SQL syntax are familiar to those who know SQL
– A subset of SQL-92, plus extensions
§ Every table belongs to exactly one database
– The SHOW DATABASES command lists databases
– The USE command switches the acFve database
– The SHOW TABLES command lists all tables in a database
§ Every column in a table has an associated data type
– Most simple column types are similar to SQL data types
§ Hive and Impala both have query editors in Hue and command line shells
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-47
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-48
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-49
Hands-on Exercise: Running Queries from the Shell, Scripts, and
Hue
§ In this Hands-On Exercise, you will run queries from the Impala and Beeline
shells, scripts, and Hue
– Please refer to the Hands-On Exercise Manual for instrucFons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 09-50
Apache Hive and Impala Data
Management
Chapter 10
Course Chapters
§ IntroducFon
§ Apache Hadoop Fundamentals
§ IntroducFon to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ MulF-Dataset OperaFons with Apache Pig
§ Apache Pig TroubleshooFng and OpFmizaFon
§ IntroducFon to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ RelaFonal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive OpFmizaFon
§ Apache Impala OpFmizaFon
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-2
Apache Hive and Impala Data Management
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-3
Chapter Topics
§ Data Storage
§ CreaFng Databases and Tables
§ Loading Data
§ Altering Databases and Tables
§ Simplifying Queries with Views
§ Storing Query Results
§ EssenFal Points
§ Hands-On Exercise: Data Management
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-4
Recap: How Hive and Impala Load and Store Data
§ Hive and Impala use the metastore to determine data format and locaGon
– The query itself operates on data stored in a filesystem (typically HDFS)
Metastore
Query
(metadata in RDBMS)
Impala or
Hive Server
Tables
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-5
The Warehouse Directory
§ By default, Hive and Impala store data in the HDFS directory
/user/hive/warehouse
§ Each table is a subdirectory containing any number of files
customers table
cust_id name country /user/hive/warehouse/customers
001 Alice us
002 Bob ca 001 Alice us
file1 002 Bob ca
003 Carlos mx 003 Carlos mx
004 Dieter de
… … … …
392 Maria it
393 Nigel uk 392 Maria it
file2 393 Nigel uk
394 Ophelia dk 394 Ophelia dk
395 Peter us
… … … …
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-6
Impala Metadata Cache InvalidaFon
INVALIDATE METADATA;
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-7
Using the Hue Metastore Manager
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-8
Accessing the Metastore with HCatalog
§ Hive and Impala have built-in capability to directly access the metastore
§ Other tools like Pig lack built-in direct metastore access
§ HCatalog is a Hive sub-project that provides access to the metastore
– Accessible through command line and REST API
– Allows you to define and describe tables using Hive syntax
– Enables integraFon with applicaFons that lack direct metastore access
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-9
Chapter Topics
§ Data Storage
§ CreaGng Databases and Tables
§ Loading Data
§ Altering Databases and Tables
§ Simplifying Queries with Views
§ Storing Query Results
§ EssenFal Points
§ Hands-On Exercise: Data Management
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-10
CreaFng a Database
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-11
CreaFng a Table (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-12
CreaFng a Table (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-13
CreaFng a Table (3)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-14
CreaFng a Table (4)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-15
CreaFng a Table (5)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-16
Example Table DefiniFon
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-17
CreaFng Tables Based on ExisFng Schema
§ Use LIKE to create a new table using the definiGon of an exisGng table
CREATE TABLE jobs_archived LIKE jobs;
§ Column definiGons and names are derived from the exisGng table
– New table will contain no data
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-18
Controlling Table Data LocaFon
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-19
Externally Managed Tables
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-20
CreaFng Tables Using the Metastore Manager
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-21
Chapter Topics
§ Data Storage
§ CreaFng Databases and Tables
§ Loading Data
§ Altering Databases and Tables
§ Simplifying Queries with Views
§ Storing Query Results
§ EssenFal Points
§ Hands-On Exercise: Data Management
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-22
Data ValidaFon
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-23
Loading Data from HDFS Files
§ To load data, simply add files to the table’s directory in HDFS
– Can be done directly using the hdfs dfs commands
– This example loads data from HDFS into the sales table
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-24
OverwriFng Data from Files
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-25
Loading Data Using the Metastore Manager
§ The Metastore Manager provides two wizards for loading data into tables
– Create a new table from a file
– Import data into an exisFng table
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-26
Loading Data from a RelaFonal Database
§ Sqoop has built-in support for imporGng data into Hive and Impala
§ Add the --hive-import opGon to your Sqoop command
– Creates the table in the metastore
– Imports data from the RDBMS to the table’s directory in HDFS
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training \
--password training \
--fields-terminated-by '\t' \
--table employees \
--hive-import \
--hive-database default \
--hive-table employees
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-27
Chapter Topics
§ Data Storage
§ CreaFng Databases and Tables
§ Loading Data
§ Altering Databases and Tables
§ Simplifying Queries with Views
§ Storing Query Results
§ EssenFal Points
§ Hands-On Exercise: Data Management
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-28
Removing a Database
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-29
Removing a Table
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-31
Renaming and Modifying Columns
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-32
Reordering Columns in Hive
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-33
Adding and Removing Columns
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-34
Replacing Columns and Reproducing Tables
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-35
Chapter Topics
§ Data Storage
§ CreaFng Databases and Tables
§ Loading Data
§ Altering Databases and Tables
§ Simplifying Queries with Views
§ Storing Query Results
§ EssenFal Points
§ Hands-On Exercise: Data Management
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-36
Simplifying Complex Queries
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-37
CreaFng Views
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-38
Exploring Views
SHOW TABLES;
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-39
Modifying and Removing Views
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-40
Chapter Topics
§ Data Storage
§ CreaFng Databases and Tables
§ Loading Data
§ Altering Databases and Tables
§ Simplifying Queries with Views
§ Storing Query Results
§ EssenFal Points
§ Hands-On Exercise: Data Management
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-41
Saving Query Output to a Table
§ INSERT INTO TABLE adds records without first deleGng exisGng data
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-42
CreaFng Tables Based on ExisFng Data
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-43
WriFng Output to HDFS in Hive
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-44
Saving to MulFple Directories in Hive (1)
§ We just saw that you can save output to an HDFS file
FROM customers c
INSERT OVERWRITE DIRECTORY 'ny_customers'
SELECT cust_id, fname, lname WHERE state='NY';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-45
Saving to MulFple Directories in Hive (2)
FROM customers c
INSERT OVERWRITE DIRECTORY 'ny_names'
SELECT fname, lname
WHERE state = 'NY'
INSERT OVERWRITE DIRECTORY 'ny_count'
SELECT COUNT(DISTINCT cust_id)
WHERE state = 'NY';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-46
Saving to MulFple Directories in Hive (3)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-47
Chapter Topics
§ Data Storage
§ CreaFng Databases and Tables
§ Loading Data
§ Altering Databases and Tables
§ Simplifying Queries with Views
§ Storing Query Results
§ EssenGal Points
§ Hands-On Exercise: Data Management
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-48
EssenFal Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-49
Chapter Topics
§ Data Storage
§ CreaFng Databases and Tables
§ Loading Data
§ Altering Databases and Tables
§ Simplifying Queries with Views
§ Storing Query Results
§ EssenFal Points
§ Hands-On Exercise: Data Management
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-50
Hands-On Exercise: Data Management
§ In this Hands-On Exercise, you will create and load tables using the Hue
Metastore Manager, Impala SQL, and Sqoop
– Please refer to the Hands-On Exercise Manual for instrucFons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 10-51
Data Storage and Performance
Chapter 11
Course Chapters
§ IntroducFon
§ Apache Hadoop Fundamentals
§ IntroducFon to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ MulF-Dataset OperaFons with Apache Pig
§ Apache Pig TroubleshooFng and OpFmizaFon
§ IntroducFon to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ RelaFonal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive OpFmizaFon
§ Apache Impala OpFmizaFon
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-2
Data Storage and Performance
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-3
Chapter Topics
§ Par>>oning Tables
§ Loading Data into ParFFoned Tables
§ When to Use ParFFoning
§ Choosing a File Format
§ Using Avro and Parquet File Formats
§ EssenFal Points
§ Hands-On Exercise: Data Storage and Performance
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-4
Table ParFFoning
§ By default, all data files for a table are stored in a single directory
– All files in the directory are read during a query
§ Par>>oning subdivides the data
– Data is physically divided during loading, based on values from one or
more columns
§ Speeds up queries that filter on par>>on columns
– Only the files containing the selected data need to be read
§ Does not prevent you from running queries that span mul>ple par>>ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-5
Example: ParFFoning Customers by State (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-6
Example: ParFFoning Customers by State (2)
/dualcore/customers
file1
1000000 Quentin Shepard 32092 West 10th Street Prairie City SD 57649
1000001 Brandon Louis 1311 North 2nd Street Clearfield IA 50840
1000002 Marilyn Ham 25831 North 25th Street Concord CA 94522
…
file2
1050344 Denise Carey 1819 North Willow Parkway Phoenix AZ 85042
1050345 Donna Pettigrew 1725 Patterson Street Garberville CA 95542
1050346 Hans Swann 1148 North Hornbeam Avenue Sacramento CA 94230
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-7
Example: ParFFoning Customers by State (3)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-8
ParFFoning File Structure
/dualcore/
customers_by_state
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-9
CreaFng a ParFFoned Table
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-10
ParFFon Columns
DESCRIBE customers_by_state;
+---------+--------+---------+
| name | type | comment |
+---------+--------+---------+
| cust_id | int | |
| fname | string | |
| lname | string | |
| address | string | |
| city | string | |
| zipcode | string | |
| state | string | |
+---------+--------+---------+
A parFFon column is a virtual column;
column values are not stored in the files
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-11
Nested ParFFons
customers_by_state
state=AK
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-12
Chapter Topics
§ ParFFoning Tables
§ Loading Data into Par>>oned Tables
§ When to Use ParFFoning
§ Choosing a File Format
§ Using Avro and Parquet File Formats
§ EssenFal Points
§ Hands-On Exercise: Data Storage and Performance
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-13
Loading Data into a ParFFoned Table
§ Dynamic par>>oning
– Hive/Impala automaFcally creates parFFons
– Inserted data is stored in the correct parFFons based on column values
§ Sta>c par>>oning
– You manually create new parFFons using ADD PARTITION
– When loading data, you specify which parFFon to store it in
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-14
Dynamic ParFFoning
§ Hive or Impala automa>cally creates par>>ons and inserts data into them
based on the values of the specified par>>on column
– The values of the parFFon column(s) are not included in the files
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-15
StaFc ParFFoning
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-16
StaFc ParFFoning Example: ParFFon Calls by Day (1)
call-20161002.log
03:45:01,505-555-2345,CALL_RECEIVED
03:45:09,505-555-2345,OPTION_SELECTED,Billing
03:56:21,505-555-2345,AGENT_ANSWER,Agent ID A1503
03:57:01,505-555-2345,QUESTION
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-17
StaFc ParFFoning Example: ParFFon Calls by Day (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-18
Static Partitioning Example: Partition Calls by Day (3)
§ This command
1. Adds the partition to the table’s metadata
2. Creates subdirectory call_date=2016-10-01 in
/user/hive/warehouse/call_logs/
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-19
StaFc ParFFoning Example: ParFFon Calls by Day (4)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-20
Hive Only: Shortcut for Loading Data into ParFFons
§ Hive will create a new par>>on if the one specified doesn’t exist
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-21
Viewing, Adding, and Removing ParFFons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-22
Chapter Topics
§ ParFFoning Tables
§ Loading Data into ParFFoned Tables
§ When to Use Par>>oning
§ Choosing a File Format
§ Using Avro and Parquet File Formats
§ EssenFal Points
§ Hands-On Exercise: Data Storage and Performance
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-23
When to Use ParFFoning
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-24
When Not to Use ParFFoning
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-25
ParFFoning in Hive (1)
SET hive.exec.dynamic.partition.mode=nonstrict;
§ Note: Hive proper>es set in Beeline are for the current session only.
– Your system administrator can configure properFes permanently
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-26
ParFFoning in Hive (2)
§ Cau>on: If the par>>on column has many unique values, many par>>ons
will be created
§ Three Hive configura>on proper>es exist to limit this
– hive.exec.max.dynamic.partitions.pernode
– Maximum number of dynamic parFFons that can be created by any
given node involved in a query
– Default 100
– hive.exec.max.dynamic.partitions
– Total number of dynamic parFFons that can be created by one
HiveQL statement
– Default 1000
– hive.exec.max.created.files
– Maximum total files (on all nodes) created by a query
– Default 100000
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-27
Chapter Topics
§ ParFFoning Tables
§ Loading Data into ParFFoned Tables
§ When to Use ParFFoning
§ Choosing a File Format
§ Using Avro and Parquet File Formats
§ EssenFal Points
§ Hands-On Exercise: Data Storage and Performance
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-28
Choosing a File Format
§ Hive and Impala support many different file formats for data storage
§ Format op>ons
– TEXTFILE
– SEQUENCEFILE
– AVRO
– PARQUET
– RCFILE
– ORCFILE (Hive only)*
* Supported, but not recommended, in CDH
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-29
ConsideraFons for Choosing a File Format
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-30
Text File Format
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-31
SequenceFile Format
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-32
Apache Avro File Format
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-33
Columnar Formats
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-34
Columnar File Formats: RCFile and ORCFile
§ RCFile
– A column-oriented format originally created for Hive tables
– All data stored as strings (inefficient)
– Verdict: Poor performance and limited interoperability
§ ORCFile
– An improved version of RCFile
– Currently supported only in Hive, not Impala
– More efficient than RCFile, but not well supported outside of Hive
– Verdict: Improved performance but limited interoperability
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-35
Columnar File Formats: Apache Parquet
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-36
Chapter Topics
§ ParFFoning Tables
§ Loading Data into ParFFoned Tables
§ When to Use ParFFoning
§ Choosing a File Format
§ Using Avro and Parquet File Formats
§ EssenFal Points
§ Hands-On Exercise: Data Storage and Performance
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-37
Using Avro File Format (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-38
Using Avro File Format (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-39
Using Parquet File Format (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-40
Using Parquet File Format (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-41
Chapter Topics
§ ParFFoning Tables
§ Loading Data into ParFFoned Tables
§ When to Use ParFFoning
§ Choosing a File Format
§ Using Avro and Parquet File Formats
§ Essen>al Points
§ Hands-On Exercise: Data Storage and Performance
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-42
EssenFal Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-43
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-44
Chapter Topics
§ ParFFoning Tables
§ Loading Data into ParFFoned Tables
§ When to Use ParFFoning
§ Choosing a File Format
§ Using Avro and Parquet File Formats
§ EssenFal Points
§ Hands-On Exercise: Data Storage and Performance
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-45
Hands-On Exercise: Data Storage and Performance
§ In this Hands-On Exercise, you will create a table for ad click data,
par>>oned by the network on which the ad was displayed
– Please refer to the Hands-On Exercise Manual for instrucFons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 11-46
Rela%onal Data Analysis
with Apache Hive and Impala
Chapter 12
Course Chapters
§ Introduc%on
§ Apache Hadoop Fundamentals
§ Introduc%on to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ Mul%-Dataset Opera%ons with Apache Pig
§ Apache Pig Troubleshoo%ng and Op%miza%on
§ Introduc%on to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ Rela)onal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive Op%miza%on
§ Apache Impala Op%miza%on
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-2
Rela%onal Data Analysis with Apache Hive and Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-3
Chapter Topics
§ Joining Datasets
§ Common Built-In Func%ons
§ Aggrega%on and Windowing
§ Essen%al Points
§ Hands-On Exercise: Rela%onal Data Analysis
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-4
Joins
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-5
Join Syntax
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-6
Inner Join Example
customers table
cust_id name country SELECT c.cust_id, name, total
a Alice us FROM customers c
JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result of query
4 b 1456
5 z 2137
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-7
LeW Outer Join Example
customers table
cust_id name country SELECT c.cust_id, name, total
a Alice us FROM customers c
LEFT OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result of query
5 z 2137
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-8
Right Outer Join Example
customers table
cust_id name country SELECT c.cust_id, name, total
a Alice us FROM customers c
RIGHT OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result of query
5 z 2137
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-9
Full Outer Join Example
customers table
cust_id name country SELECT c.cust_id, name, total
a Alice us FROM customers c
FULL OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result of query
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-10
Using an Outer Join to Find Unmatched Entries
customers table
cust_id name country SELECT c.cust_id, name, total
a Alice us FROM customers c
FULL OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id)
c Carlos mx
WHERE c.cust_id IS NULL
d Dieter de OR o.total IS NULL;
orders table
Result of query
order_id cust_id total
cust_id name total
1 a 1539
d Dieter NULL
2 c 1871
NULL NULL 2137
3 a 6352
4 b 1456
5 z 2137
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-11
NULL Values in Join Key Columns
customers_with_null table
cust_id name country SELECT c.cust_id, name, total
a Alice us FROM customers_with_null c
b Bob ca JOIN orders_with_null o
c Carlos mx ON (c.cust_id = o.cust_id);
d Dieter de
NULL Unknown NULL Result of query
cust_id name total
orders_with_null table
order_id cust_id total a Alice 1539
employees table
fname lname salary SELECT fname, lname, grade
Sean Baca 19554 FROM employees e
JOIN salary_grades g
Ana Bobo 26596
ON (e.salary >= g.min_amt
Mary Beal 53485
AND e.salary <= g.max_amt);
Gary Burt 21191
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-14
Cross Join Example
disks table
name SELECT * FROM disks
Internal hard disk
CROSS JOIN sizes;
External hard disk
Result of query
name size
Internal hard disk 1.0 terabytes
sizes table Internal hard disk 2.0 terabytes
size Internal hard disk 3.0 terabytes
1.0 terabytes Internal hard disk 4.0 terabytes
2.0 terabytes External hard disk 1.0 terabytes
3.0 terabytes External hard disk 2.0 terabytes
4.0 terabytes External hard disk 3.0 terabytes
External hard disk 4.0 terabytes
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-15
Left Semi-Joins (1)
SELECT c.cust_id
FROM customers c
LEFT SEMI JOIN orders o
ON (c.cust_id = o.cust_id AND o.total > 1500);
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-16
Left Semi-Joins (2)
SELECT c.cust_id
FROM customers c
LEFT SEMI JOIN orders o
ON (c.cust_id = o.cust_id AND o.total > 1500);
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-17
Chapter Topics
§ Joining Datasets
§ Common Built-In Func)ons
§ Aggrega%on and Windowing
§ Essen%al Points
§ Hands-On Exercise: Rela%onal Data Analysis
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-18
Built-In Func%ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-19
Informa%on about Built-In Func%ons in Hive
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-20
Trying Built-In Func%ons
§ To test a built-in func)on, use a SELECT statement with no FROM clause
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-21
Built-In Mathema%cal Func%ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-22
Built-In Date and Time Func%ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-23
Conver%ng Date and Time Formats
to_date(from_unixtime(
unix_timestamp(order_dt, 'MM/dd/yyyy')))
from_unixtime(unix_timestamp(current_timestamp()),
'MMM d, yyyy HH:mm')
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-24
Built-In String Func%ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-25
String Concatena%on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-26
Parsing URLs
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-27
Other Built-In Func%ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-28
Chapter Topics
§ Joining Datasets
§ Common Built-In Func%ons
§ Aggrega)on and Windowing
§ Essen%al Points
§ Hands-On Exercise: Rela%onal Data Analysis
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-29
Aggrega%on and Windowing
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-30
Example: Record Grouping and Aggregate Func%ons
products table
prod_id brand name price
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-31
Built-in Aggregate Func%ons
Counts all rows where field is unique and not NULL COUNT(DISTINCT fname)
§ You can also define custom User Defined Aggregate Func)ons (UDAFs)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-32
Window Aggrega%on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-33
Windows
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-34
Example: Aggrega%on over a Window (1)
§ Ques)on: What is the price of the least expensive product in each brand?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-35
Example: Aggrega%on over a Window (2)
§ Ques)on: For each product, how does the price compare to the minimum
price for that brand?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-36
Example: RANK and ROW_NUMBER
products table
query results
prod_id brand name price prod_id brand price rank n
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-37
Example: Using Windowing for Subqueries
subquery results
prod_id brand price rank
query results
3 Dualcore 1.99 1 prod_id brand price
2 Dualcore 11.99 2
3 Dualcore 1.99
1 Dualcore 18.39 3
6 Gigabux 20.00
6 Gigabux 20.00 1
7 Gigabux 20.00
7 Gigabux 20.00 1
4 Gigabux 40.50 3
5 Gigabux 50.50 4
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-38
Other Windowing Func%ons
Func)on Descrip)on
DENSE_RANK Rank of current value within the window (with consecu%ve
rankings)
NTILE(n) The n-%le (of n) within the window that the current value is in
LEAD(column, n, default) The value in the specified column in the nth following row
LAG(column, n, default) The value in the specified column in the nth preceding row
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-39
Example: RANK and DENSE_RANK
6 Gigabux 20.00 1 1
7 Gigabux 20.00 1 1
4 Gigabux 40.50 3 2
5 Gigabux 50.50 4 3
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-40
Time-Based Windows
ads table
campaign_id display_date display_site cost
A1 2016-05-01 audiophile 76
A1 2016-05-01 photosite 64
A3 2016-05-01 photosite 68
A2 2016-05-02 audiophile 82
A3 2016-05-02 dvdreview 66
A3 2016-05-02 audiophile 82
A1 2016-05-02 audiophile 70
A3 2016-05-03 audiophile 66
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-41
Example: Time-Based Windows
§ Ques)on: How many ads were displayed per site per day?
§ A non-windowed aggrega)on:
query results
display_date display_site n
2016-05-01 audiophile 1
2016-05-01 photosite 2
2016-05-02 audiophile 3
2016-05-02 dvdreview 1
2016-05-03 audiophile 1
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-42
Example: Rank over Time-Based Windows
2016-05-01 audiophile 1 2
2016-05-01 photosite 2 1
2016-05-02 audiophile 3 1
2016-05-02 dvdreview 1 2
2016-05-03 audiophile 1 1
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-43
Example: Using LAG with Time-Based Windows
§ Ques)on: How did each site’s daily count compare to the previous day?
§ Ques)on:
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-44
Sliding Windows
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-45
Example: Time-Based Sliding Window
§ Ques)on: What is the per-site average count for the week ending today?
2016-05-01 audiophile 1 1
2016-05-01 photosite 2 2
2016-05-02 audiophile 3 2
2016-05-02 dvdreview 1 1
2016-05-03 audiophile 1 1.66
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-46
Chapter Topics
§ Joining Datasets
§ Common Built-In Func%ons
§ Aggrega%on and Windowing
§ Essen)al Points
§ Hands-On Exercise: Rela%onal Data Analysis
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-47
Essen%al Points
§ Hive and Impala offer many ways to join data from disparate datasets
§ Many standard SQL func)ons are available
– Perform opera%ons on strings, dates, URLs, and numeric values
§ Aggrega)on func)ons group values in mul)ple rows into single values
§ Windowing func)ons calculate values based on sets of rows without
combining the rows
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-48
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-49
Chapter Topics
§ Joining Datasets
§ Common Built-In Func%ons
§ Aggrega%on and Windowing
§ Essen%al Points
§ Hands-On Exercise: Rela)onal Data Analysis
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-50
Hands-On Exercise: Rela%onal Data Analysis
§ In this Hands-On Exercise, you will analyze sales and product data
– Please refer to the Hands-On Exercise Manual for instruc%ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 12-51
Complex Data with Apache Hive and
Impala
Chapter 13
Course Chapters
§ IntroducGon
§ Apache Hadoop Fundamentals
§ IntroducGon to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ MulG-Dataset OperaGons with Apache Pig
§ Apache Pig TroubleshooGng and OpGmizaGon
§ IntroducGon to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ RelaGonal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive OpGmizaGon
§ Apache Impala OpGmizaGon
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-2
Complex Data with Apache Hive and Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-4
Complex Data Types
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-5
Why Use Complex Values?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-6
Example: Customer Phone Numbers
§ Using the ARRAY type allows us to store the data in one table
customers_phones table
SELECT name,
cust_id name phones phones[0],
a Alice [555-1111, phones[1]
555-2222, FROM customers_phones;
555-3333]
b Bob [555-4444]
c Carlos [555-5555,
555-6666] query results
name phones[0] phones[1]
Alice 555-1111 555-2222
Bob 555-4444 NULL
Carlos 555-5555 555-6666
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-8
Using ARRAY Columns with Hive (2)
Data File
a,Alice,555-1111|555-2222|555-3333
b,Bob,555-4444
c,Carlos,555-5555|555-6666
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-9
Using MAP Columns with Hive (1)
SELECT name,
phones['home'] AS home
customers_phones table
FROM customers_phones;
cust_id name phones
a Alice {home:555-1111,
work:555-2222, query results
mobile:555-3333}
name home
b Bob {mobile:555-4444}
Alice 555-1111
c Carlos {work:555-5555,
Bob NULL
home:555-6666}
Carlos 555-6666
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-10
Using MAP Columns with Hive (2)
§ MAP keys must all be one data type, and values must all be one data type
– MAP<KEY-TYPE,VALUE-TYPE>
§ You can specify the key-value separator (default is control+C)
CREATE TABLE customers_phones
(cust_id STRING,
name STRING,
phones MAP<STRING,STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':';
Data File
a,Alice,home:555-1111|work:555-2222|mobile:555-3333
b,Bob,mobile:555-4444
c,Carlos,work:555-5555|home:555-6666
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-11
Using STRUCT Columns with Hive (1)
SELECT name,
customers_addr table address.state,
address.zipcode
cust_id name address FROM customers_addr;
a Alice {street:742 Evergreen Terrace,
city:Springfield,
state:OR,
zipcode:97477}
query results
b Bob {street:1600 Pennsylvania Ave NW,
city:Washington, name state zipcode
state:DC, Alice OR 97477
zipcode:20500}
c Carlos {street:342 Gravelpit Terrace,
Bob DC 20500
city:Bedrock, Carlos NULL NULL
state:null,
zipcode:null}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-12
Using STRUCT Columns with Hive (2)
Data File
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-13
Complex Columns in the SELECT List in Hive Queries
§ You can include complex columns in the SELECT list in Hive queries
– Hive returns full ARRAY, MAP, or STRUCT columns
SELECT phones
FROM customers_phones;
customers_phones table
cust_id name phones
a Alice [555-1111,
555-2222, query results
555-3333]
phones
b Bob [555-4444]
[555-1111,
c Carlos [555-5555, 555-2222,
555-6666] 555-3333]
[555-4444]
[555-5555,
555-6666]
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-14
Returning the Number of Items in a CollecGon with Hive
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-15
ConverGng ARRAY to Records with explode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-16
Using explode with a Lateral View
§ No other columns can be included in the SELECT list with explode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-17
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-18
Complex Data with Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-19
Querying ARRAY Columns with Impala (1)
cust_phones_parquet table
query results
cust_id name phones
item pos
a Alice [555-1111,
555-1111 0
555-2222,
555-3333] 555-2222 1
b Bob [555-4444] 555-3333 2
555-4444 0
c Carlos [555-5555,
555-6666] 555-5555 0
555-6666 1
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-20
Querying ARRAY Columns with Impala (2)
§ Use implicit join notaIon to join ARRAY elements with rows of the table
– Returns ARRAY elements with scalar column values from same rows
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-21
Querying MAP Columns with Impala (1)
cust_phones_parquet table
query results
cust_id name phones
key value
a Alice {home:555-1111,
home 555-1111
work:555-2222,
mobile:555-3333} work 555-2222
b Bob {mobile:555-4444} mobile 555-3333
mobile 555-4444
c Carlos {work:555-5555,
home:555-6666} work 555-5555
home 555-6666
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-22
Querying MAP Columns with Impala (2)
§ Use implicit join notaIon to join MAP elements with rows of the table
– Can use key and value in the SELECT list and WHERE clause
cust_phones_parquet table
cust_id name phones query results
a Alice {home:555-1111, name home
work:555-2222,
Alice 555-1111
mobile:555-3333}
Carlos 555-6666
b Bob {mobile:555-4444}
c Carlos {work:555-5555,
home:555-6666}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-23
Querying STRUCT Columns with Impala
§ The Impala query syntax for STRUCT columns is the same as with Hive
SELECT name,
cust_addr_parquet table address.state,
address.zipcode
cust_id name address FROM
a Alice {street:742 Evergreen Terrace, cust_addr_parquet;
city:Springfield,
state:OR,
zipcode:97477}
b Bob {street:1600 Pennsylvania Ave NW,
query results
city:Washington,
state:DC, name state zipcode
zipcode:20500}
Alice OR 97477
c Carlos {street:342 Gravelpit Terrace,
city:Bedrock, Bob DC 20500
state:null,
Carlos NULL NULL
zipcode:null}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-24
Complex Columns in the SELECT List in Impala Queries
§ In Impala queries, complex columns are not allowed in the SELECT list
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-25
Returning the Number of Items in a CollecGon with Impala
§ Impala does not support the size funcIon or other collecIon funcIons
– Use COUNT and GROUP BY to count the items in an ARRAY or MAP
cust_phones_parquet table
cust_id name phones query results
a Alice [555-1111, name num
555-2222,
555-3333] Alice 3
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-26
Loading Data Containing Complex Types
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-27
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-28
EssenGal Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-29
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-30
Hands-On Exercise: Complex Data with Hive and Impala
§ In this exercise, you will load and query data with complex columns related
to a customer loyalty program
– Please refer to the Hands-On Exercise Manual for instrucGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 13-31
Analyzing Text with Apache Hive and
Impala
Chapter 14
Course Chapters
§ IntroducHon
§ Apache Hadoop Fundamentals
§ IntroducHon to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ MulH-Dataset OperaHons with Apache Pig
§ Apache Pig TroubleshooHng and OpHmizaHon
§ IntroducHon to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ RelaHonal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive OpHmizaHon
§ Apache Impala OpHmizaHon
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-2
Analyzing Text with Apache Hive and Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-4
Matching PaFerns in Text
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-5
Regular Expression Matching
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-6
Regular Expression Matching Operator
– Or use the Impala-only operator IREGEXP in CDH 5.7 and higher
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-7
ExtracHng and Replacing Text
§ Hive and Impala include basic funcEons for extracEng or replacing text
– Including substring and translate
§ These examples assume that txt has the following value
– It's on Oak St. or Maple St in 90210
§ These are not flexible enough for working with free-form text fields
– Instead use regular expressions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-8
Regular Expression Extract and Replace
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-9
Regular Expression Extract and Replace FuncHons
§ Hive and Impala have funcEons for regular expression extract and replace
– regexp_extract returns matched text
– regexp_replace subsHtutes another value for matched text
§ These examples assume that txt has the following value
– It's on Oak St. or Maple St in 90210
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-10
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-11
Text Processing Overview
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-12
Hive Record Formats
§ So far we have used only ROW FORMAT DELIMITED when creaEng
tables stored as text files
– Requires data in rows and columns with consistent delimiters
§ But Hive supports other record formats
§ A SerDe is an interface Hive uses to read and write data
– SerDe stands for serializer/deserializer
§ SerDes enable Hive to access data that is not in structured tabular format
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-13
Hive SerDes
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-14
Specifying a Hive SerDe
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-15
Log Files
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-16
CreaHng a Table with Regex SerDe (1)
Log excerpt
05/23/2016 19:45:19 312-555-7834 CALL_RECEIVED ""
05/23/2016 19:48:37 312-555-7834 COMPLAINT "Item damaged"
Regex SerDe
CREATE TABLE calls (
event_date STRING,
event_time STRING,
phone_num STRING,
event_type STRING,
details STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");
Log excerpt
05/23/2016 19:45:19 312-555-7834 CALL_RECEIVED ""
05/23/2016 19:48:37 312-555-7834 COMPLAINT "Item damaged"
Regex SerDe
CREATE TABLE calls (
event_date STRING,
event_time STRING,
phone_num STRING,
event_type STRING,
details STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-18
Fixed-Width Formats
1030929610759620160829012215Oakland CA94618
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-19
Fixed-Width Format Example
Input data
1030929610759620160829012215Oakland CA94618
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-20
CSV Format
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-21
CSV SerDe Example
Input Data
1,Gigabux,gigabux@example.com
2,"ACME Distribution Co.",acme@example.com
3,"Bitmonkey, Inc.",bmi@example.com
CSV SerDe
CREATE TABLE vendors
(id INT,
name STRING,
email STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-22
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-23
SenHment Analysis
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-24
Splicng a String into Records
peopleFROM
SELECT names FROMpeople;
example;
Amy,Sam,Ted
Amy,Sam,Ted
SELECT split(names,
SELECT SPLIT(people,',')
',')FROM
FROMpeople;
example;
["Amy","Sam","Ted"]
["Amy","Sam","Ted"]
SELECT EXPLODE(SPLIT(people, ',')) AS x FROM example;
SELECT
Amy explode(split(names, ',')) FROM people;
Amy
Sam
Sam
Ted
Ted
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-25
Parsing Sentences into Words
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-26
n-grams
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-27
CalculaHng n-grams in Hive (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-28
CalculaHng n-grams in Hive (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-29
Finding Specific n-grams in Text
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-30
Histograms
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-31
CalculaHng Data for Histograms
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-32
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-33
EssenHal Points
§ Hive and Impala support regular expressions for working with strings
– Compare
– Extract
– Replace
§ Hive uses SerDes to read and write data
– Specified (or defaulted) when creaHng a table
– Enable Hive to process unstructured or semi-structured text data
– RegexSerDe supports data lacking consistent delimiters
– OpenCSVSerDe supports data in CSV format
§ An n-gram is a sequence of words
– Use ngrams and context_ngrams in Hive to find their frequency
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-34
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-35
Hands-On Exercise: Analyzing Text With Hive
§ In this exercise, you will use Hive to process and analyze web server log data
and to analyze product raEngs
– Please refer to the Hands-On Exercise Manual for instrucHons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 14-36
Apache Hive Op,miza,on
Chapter 15
Course Chapters
§ Introduc,on
§ Apache Hadoop Fundamentals
§ Introduc,on to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ Mul,-Dataset Opera,ons with Apache Pig
§ Apache Pig Troubleshoo,ng and Op,miza,on
§ Introduc,on to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ Rela,onal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive Op0miza0on
§ Apache Impala Op,miza,on
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-2
Apache Hive Op,miza,on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-4
Hive Query Processing
– MapReduce or Spark
• Parse HiveQL
§ To op0mize Hive queries, you need to • Make op,miza,ons
understand how queries are • Plan execu,on
• Submit job(s) to cluster
processed • Monitor progress
Hadoop
Cluster
HDFS
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-5
How Hive on MapReduce Processes Data
SELECT COUNT(cust_id)
FROM customers
WHERE zipcode=94305;
Distribute Tasks to Nodes
Map Phase
Build the Execu,on Plan
Reduce Phase
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-6
Understanding Map and Reduce
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-7
MapReduce Example
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-8
Map Phase
Map Output
Input Data Alice
Bob
3625
5174 Alice 3625
0 Alice 3625 Alice 893
1 Bob 5174 Alice 893 Alice 2139
2 Alice 893 Alice 2139 Alice 5834
3 Alice 2139 Carlos 1039 Alice 12491
4 Diana 3581 Diana 3581 Carlos 392 Carlos 1431
5 Carlos 1039 Carlos 1039
6 Bob 4823 Bob 9997
7 Alice 5834 Bob 4823 Bob 5174 Diana 5385
8 Carlos 392 Alice 5834 Bob 4823
9 Diana 1804 Diana 3581
Diana 1804
Carlos 392
Diana 1804
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-9
Shuffle and Sort
§ Hadoop automa0cally sorts and merges output from all map tasks
– This intermediate process is known as shuffle and sort
– The result is the input to the reduce phase
– In this example, the data is sorted by sales rep name, because that is
how our query aggregates the data
Map Output
Reduce Input
Input Data Alice
Bob
3625
5174 Alice 3625
0 Alice 3625 Alice 893
1 Bob 5174 Alice 893 Alice 2139
2 Alice 893 Alice 2139 Alice 5834
3 Alice 2139 Carlos 1039 Alice 12491
4 Diana 3581 Diana 3581 Carlos 392 Carlos 1431
5 Carlos 1039 Carlos 1039
6 Bob 4823 Bob 9997
7 Alice 5834 Bob 4823 Bob 5174 Diana 5385
8 Carlos 392 Alice 5834 Bob 4823
9 Diana 1804 Diana 3581
Diana 1804
Carlos 392
Diana 1804
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-10
Reduce Phase
§ Input to the reduce phase comes from the shuffle and sort process
– Reduce tasks process mul,ple records
– In this example, each reduce task processes all the records with the
same sales rep name
– The reduce task aggregates by compu,ng the sum of all the order totals
for the grouped rows
Map Output
Reduce Input
Input Data Alice
Bob
3625
5174 Alice 3625
0 Alice 3625 Alice 893
1 Bob 5174 Alice
Alice
893
2139
Alice 2139 Reduce Output
2 Alice 893 Alice 5834
3 Alice 2139 Carlos 1039 Alice 12491
4 Diana 3581 Diana 3581 Carlos 392 Carlos 1431
5 Carlos 1039 Carlos 1039
6 Bob 4823 Bob 9997
7 Alice 5834 Bob 4823 Bob 5174 Diana 5385
8 Carlos 392 Alice 5834 Bob 4823
9 Diana 1804 Diana 3581
Diana 1804
Carlos 392
Diana 1804
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-11
Hive Fetch Task
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-12
Hive Query Performance PaEerns (1)
DESCRIBE customers;
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-13
Hive Query Performance PaEerns (2)
§ The next slowest type of query requires both map and reduce phases
SELECT COUNT(cust_id)
FROM customers
WHERE zipcode=94305;
§ The slowest type of query requires mul0ple map and reduce phases
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-14
Viewing the Execu,on Plan
EXPLAIN SELECT *
FROM customers;
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-15
Viewing a Query Plan with EXPLAIN (1)
STAGE DEPENDENCIES:
... (excerpt shown on next slide)
STAGE PLANS:
... (excerpt shown on upcoming slide)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-16
Viewing a Query Plan with EXPLAIN (2)
STAGE DEPENDENCIES:
§ Our query has four stages Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
§ Dependencies define order Stage-3 depends on stages: Stage-0
1. Stage-1 (first) Stage-2 depends on stages: Stage-3
2. Stage-0
3. Stage-3 STAGE PLANS:
... (shown on next slide)
4. Stage-2 (last)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-17
Viewing a Query Plan with EXPLAIN (3)
STAGE PLANS:
§ Stage-1: MapReduce job Stage: Stage-1
Map Reduce
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-18
Viewing a Query Plan with EXPLAIN (4)
STAGE PLANS:
Stage: Stage-1 (covered earlier)…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-19
Viewing a Query Plan with EXPLAIN (5)
STAGE PLANS:
Stage: Stage-1 (covered earlier) ...
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-20
Viewing a Job in Hue (1)
Job Browser
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-21
Viewing a Job in Hue (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-22
Hive Server Web UI
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-23
Parallel Execu,on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-24
Using Hive with Hadoop Standalone Mode
SET mapreduce.framework.name=local;
§ Runs the job in a single Java Virtual Machine (JVM) on the Hive server
– Significantly reduces query latency
– But only appropriate to use with small amounts of data
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-25
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-26
Bucke,ng Data in Hive
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-27
Crea,ng a Bucketed Table
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-28
Inser,ng Data into a Bucketed Table
SET hive.enforce.bucketing=true;
INSERT OVERWRITE TABLE orders_bucketed
SELECT * FROM orders;
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-29
Sampling Data from a Bucketed Table
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-30
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-31
Indexes in Hive
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-32
Viewing and Building Indexes in Hive
§ This command lists the indexes associated with the orders table
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-33
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-34
MapReduce and Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-35
Hive on Spark
SET hive.execution.engine=spark;
§ Expect a long delay as Spark initializes after you submit the first query
– Subsequent queries run without a delay
§ Hive on Spark requires more memory on cluster than Hive on MapReduce
– And uses memory even when queries are not running
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-36
When to Use Hive on Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-37
Viewing a Hive on Spark Job
§ The Hive on Spark web UI displays running and recent jobs and job details
– Access from the Hue Job Browser or the Hadoop web UI
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-38
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-39
Essen,al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-40
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-41
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-42
Hands-on Exercise: Hive Op,miza,on
§ In this exercise, you will prac0ce techniques to improve Hive query
performance
– Please refer to the Hands-On Exercise Manual for instruc,ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-43
Apache Impala Op,miza,on
Chapter 16
Course Chapters
§ Introduc,on
§ Apache Hadoop Fundamentals
§ Introduc,on to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ Mul,-Dataset Opera,ons with Apache Pig
§ Apache Pig Troubleshoo,ng and Op,miza,on
§ Introduc,on to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ Rela,onal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive Op,miza,on
§ Apache Impala Op0miza0on
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-2
Apache Impala Op,miza,on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-4
Impala in the Cluster
– The statestore
– Provides lookup service for Impala DataNode
Daemon (HDFS)
Impala daemons Worker
– Periodically checks status of Nodes
Impala DataNode
Impala daemons Daemon (HDFS)
– The catalog service
– Relays metadata changes to all Impala DataNode
the Impala daemons in a cluster Daemon (HDFS)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-5
How Impala Executes a Query
Impala
HDFS
Daemon
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-6
Metadata Caching (1)
Impala Metadata
HDFS
Daemon cache
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-7
Metadata Caching (2)
Metastore
CREATE TABLE Impala Metadata
suppliers (…) Daemon cache
metadata
Impala Metadata in RDBMS
Daemon cache
Impala Metadata
Daemon cache
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-8
External Changes and Metadata Caching
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-9
Upda,ng the Impala Metadata Cache
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-10
Query Fault Tolerance in Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-11
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-12
Impala Performance Overview
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-13
Impala Memory Usage
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-14
Query Performance Op,miza,on (1)
§ Impala uses sta0s0cs about tables to op0mize joins and similar func0ons
§ You should compute sta0s0cs for tables with COMPUTE STATS
– Ader you load a table ini,ally
– When the amount of data in a table changes substan,ally
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-15
Query Performance Op,miza,on (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-16
Viewing the Query Execu,on Plan
§ To view Impala’s execu0on plan, prefix your query with EXPLAIN or use
the Explain buaon in Hue
+---------------------------------+
| Explain String |
+---------------------------------+
| Estimated Per-Host Requirements:
Memory=48.00MB VCores=1
...
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-17
Example: Execu,on Plan (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-18
Example: Execu,on Plan (2)
06:AGGREGATE [FINALIZE]
| output: count:merge(o.order_id)
|
05:EXCHANGE [UNPARTITIONED]
|
03:AGGREGATE
Read query | output: count(o.order_id)
|
stages from the 02:HASH JOIN [INNER JOIN, BROADCAST]
boDom up | hash predicates: d.order_id = o.order_id
| runtime filters: RF000 <- o.order_id
|
|--04:EXCHANGE [BROADCAST]
| |
| 00:SCAN HDFS [default.orders o]
| partitions=1/1 files=4 size=60.26MB
| predicates: year(o.order_date) = 2008
|
01:SCAN HDFS [default.order_details d]
partitions=1/1 files=4 size=50.86MB
runtime filters: RF000 -> d.order_id
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-19
Example: Execu,on Plan (3)
…
|
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: d.order_id = o.order_id
| runtime filters: RF000 <- o.order_id
|
|--04:EXCHANGE [BROADCAST]
| |
| 00:SCAN HDFS [default.orders o]
| partitions=1/1 files=4 size=60.26MB
Joins require | predicates: year(o.order_date) = 2008
scans of both |
tables 01:SCAN HDFS [default.order_details d]
partitions=1/1 files=4 size=50.86MB
runtime filters: RF000 -> d.order_id
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-20
Segng the Explain Level
06:AGGREGATE [FINALIZE]
05:EXCHANGE [UNPARTITIONED]
03:AGGREGATE
02:HASH JOIN [INNER JOIN, BROADCAST]
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-21
Query Details ader Execu,on
§ The Impala shell provides commands to show details acer running a query
(not available in Hue)
– SUMMARY: Overview of ,mings for query phases
– PROFILE: Detailed report of query execu,on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-22
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-23
Essen,al Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-24
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-25
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-26
Hands-On Exercise: Impala Op,miza,on
§ In this Hands-On Exercise, you will explore query execu0on plans for various
types of queries
– Please refer to the Hands-On Exercise Manual for instruc,ons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 16-27
Extending Apache Hive and Impala
Chapter 17
Course Chapters
§ IntroducGon
§ Apache Hadoop Fundamentals
§ IntroducGon to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ MulG-Dataset OperaGons with Apache Pig
§ Apache Pig TroubleshooGng and OpGmizaGon
§ IntroducGon to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ RelaGonal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive OpGmizaGon
§ Apache Impala OpGmizaGon
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-2
Extending Apache Hive and Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-4
Recap: Hive File Formats and SerDes
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-5
File Format Java Classes
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-6
Specifying a File Format in Hive
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-7
How Are File Formats Related to SerDes?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-8
Custom File Formats and SerDes in Hive
§ Hive allows creaDon of custom SerDes and file formats using its Java APIs
– You can find open source Hive SerDes and file formats on the web
– WriGng your own is seldom necessary
§ Next we show how to use a custom SerDe and InputFormat in Hive
– To read data in XML format
– Using JAR file from http://tiny.cloudera.com/xml_serde
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-9
Adding a JAR File to Hive
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-10
Example: Custom XML SerDe and InputFormat (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-11
Example: Custom XML SerDe and InputFormat (2)
Input Data
<records>
<record customer_id="1">
<firstname>Seo-yeon</firstname>
<lastname>Lee</lastname>
</record>
<record customer_id="2">
<firstname>Xavier</firstname>
<lastname>Gray</lastname>
</record>
</records>
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-12
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-13
Using TRANSFORM to Process Data Using External Scripts
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-14
Data Input and Output with TRANSFORM
§ Your external program will receive one record per line on standard input
– Each field in the supplied record will be a tab-separated string
– NULL values are converted to the literal string \N
§ You may need to convert values to appropriate types within your program
– For example, converGng to numeric types for calculaGons
§ Your program must return tab-delimited fields on standard output
– Output fields can opGonally be named and cast using the syntax below
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-15
Hive TRANSFORM Example (1)
employees table
fname email
Antoine antoin@example.fr
Kai kai@example.de
Pedro pedro@example.mx
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-16
Hive TRANSFORM Example (2)
#!/usr/bin/env perl
while (<STDIN>) {
($name, $email) = split /\t/;
($suffix) = $email =~ /\.([a-z]+)$/;
$greeting = $greetings{$suffix};
$greeting = 'Hello' unless defined($greeting);
print "$greeting $name\n";
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-17
Hive TRANSFORM Example (3)
#!/usr/bin/env perl
while (<STDIN>) {
The first line tells the system to use the Perl interpreter
($name, $email) = split /\t/;
when=running
($suffix) $emailthis =~script.
/\.([a-z]+)$/;
$greeting = $greetings{$suffix};
$greeting = 'Hello'
The next unless
line defines defined($greeting);
the greetings using an associative
print "$greeting $name\n";
} array keyed by the country codes from the email
addresses.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-18
Hive TRANSFORM Example (4)
#!/usr/bin/env perl
%greetings = ('de'
Read => 'Hallo',
each record from standard input within the loop,
'fr' => 'Bonjour',
and then split=>them
'mx' into fields based on tab characters.
'Hola');
while (<STDIN>) {
($name, $email) = split /\t/;
($suffix) = $email =~ /\.([a-z]+)$/;
$greeting = $greetings{$suffix};
$greeting = 'Hello' unless defined($greeting);
print "$greeting $name\n";
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-19
Hive TRANSFORM Example (5)
#!/usr/bin/env perl
while (<STDIN>) {
($name, $email) = split /\t/;
($suffix) = $email =~ /\.([a-z]+)$/;
$greeting = $greetings{$suffix};
$greeting = 'Hello' unless defined($greeting);
print "$greeting $name\n";
} Extract the country code from the email address (the pattern
matches any letters following the final dot). Use that to look up
a greeting, but default to ‘Hello’ if none is found.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-20
Hive TRANSFORM Example (6)
#!/usr/bin/env perl
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-21
Hive TRANSFORM Example (7)
Bonjour Antoine
Hallo Kai
Hola Pedro
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-22
Using Scripts in a Secure Cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-23
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-24
Overview of User-Defined FuncGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-25
Developing Hive User-Defined FuncGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-26
Example: Using a UDF in Hive (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-27
Example: Using a UDF in Hive (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-29
Overview of Impala User-Defined FuncGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-30
Using a Java UDF in Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-31
Using a C++ UDF in Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-32
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-33
Parameterized Queries
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-34
Hive Variables (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-35
Hive Variables (2)
§ You can also set variables when you start Beeline from the command line
§ For example, the following query counts the unique customers in a state
– This HiveQL is saved in the file state.hql
state.hql
SELECT COUNT(DISTINCT cust_id) FROM customers
WHERE state = '${hivevar:mystate}';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-36
Impala Variables (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-37
Impala Variables (2)
§ You can set variables when you start Impala shell from the command line
– The syntax differs slightly from Beeline
state.sql
SELECT COUNT(DISTINCT cust_id) FROM customers
WHERE state = '${var:mystate}';
Command line
$ impala-shell --var=mystate="CA" -f state.sql
$ impala-shell --var=mystate="NY" -f state.sql
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-38
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-39
EssenGal Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-40
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-41
Hands-On Exercise: Data TransformaGon with Hive
§ In this Hands-On Exercise, you will use a Hive transform script and a UDF to
esDmate shipping costs on abandoned orders
– Please refer to the Hands-On Exercise Manual for instrucGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 17-42
Choosing the Best Tool for the Job
Chapter 18
Course Chapters
§ IntroducGon
§ Apache Hadoop Fundamentals
§ IntroducGon to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ MulG-Dataset OperaGons with Apache Pig
§ Apache Pig TroubleshooGng and OpGmizaGon
§ IntroducGon to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ RelaGonal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive OpGmizaGon
§ Apache Impala OpGmizaGon
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-2
Choosing the Best Tool for the Job
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-4
Recap of Data Analysis and Processing Tools
§ Pig
– Procedural data flow language executed using MapReduce
§ Hive
– SQL-based queries executed using MapReduce or Spark
§ Impala
– High-performance SQL-based queries using a custom execuGon engine
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-5
Comparing Pig, Hive, and Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-6
Do These Replace an RDBMS?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-7
Comparing an RDBMS to Hive and Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-8
Hive Features Currently Unsupported in Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-9
Apache Sqoop
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-10
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-11
Which to Choose?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-12
Analysis Workflow Example
Hadoop Cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-13
When to Use MapReduce or Spark?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-14
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-15
EssenGal Points
§ You have learned about several tools for data processing and analysis
– Each is beDer at some tasks than others
– Choose the best one for a given job
– Workflows may involve exchanging data between them
§ SelecNon criteria include scale, speed, control, and producNvity
– MapReduce and Spark offer control at the cost of producGvity
– Pig and Hive offer producGvity but not necessarily speed
– RelaGonal databases offer speed but not scalability
– Impala offers scalability and speed but has some limitaGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-16
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-17
OpGonal Hands-On Exercise: Analyzing Abandoned Carts
§ In this opNonal Hands-On Exercise, you will use your preferred tool to
analyze data about abandoned orders to determine if a free shipping
promoNon would be profitable
– Please refer to the Hands-On Exercise Manual for instrucGons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 18-18
Conclusion
Chapter 19
Course Chapters
§ IntroducDon
§ Apache Hadoop Fundamentals
§ IntroducDon to Apache Pig
§ Basic Data Analysis with Apache Pig
§ Processing Complex Data with Apache Pig
§ MulD-Dataset OperaDons with Apache Pig
§ Apache Pig TroubleshooDng and OpDmizaDon
§ IntroducDon to Apache Hive and Impala
§ Querying with Apache Hive and Impala
§ Apache Hive and Impala Data Management
§ Data Storage and Performance
§ RelaDonal Data Analysis with Apache Hive and Impala
§ Complex Data with Apache Hive and Impala
§ Analyzing Text with Apache Hive and Impala
§ Apache Hive OpDmizaDon
§ Apache Impala OpDmizaDon
§ Extending Apache Hive and Impala
§ Choosing the Best Tool for the Job
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wri@en consent from Cloudera. 19-2
Conclusion (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wri@en consent from Cloudera. 19-3
Conclusion (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wri@en consent from Cloudera. 19-4
Which Course to Take Next?
Cloudera offers a range of training courses for you and your team
§ For developers
– Cloudera Search Training
– Cloudera Training for Apache HBase
– Developer Training for Apache Spark and Hadoop
§ For system administrators
– Cloudera Administrator Training for Apache Hadoop
§ For data scienDsts
– Data Science at Scale using Spark and Hadoop
§ For architects, managers, CIOs, and CTOs
– Cloudera Essen=als for Apache Hadoop
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wri@en consent from Cloudera. 19-5
Cloudera CerDfied Associate (CCA) Data Analyst
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wri@en consent from Cloudera. 19-6
Extending Apache Pig
Appendix A
Extending Apache Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-2
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-3
The Need for Parameters (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-4
The Need for Parameters (2)
§ You may need to change the script slightly for each run
– For example, to modify the paths or filter criteria
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-5
Making the Script More Flexible with Parameters
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-6
Two Tricks for Specifying Parameter Values
INPUT=sales
MINPRICE=999
# comments look like this
NAME='Alice'
– Use -m filename opLon to tell Pig which file contains the values
§ Parameter values can be defined with the output of a shell command
– For example, to set MONTH to the current month
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-7
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-8
The Need for Macros
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-9
Defining a Macro in Pig LaLn
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-10
Invoking Macros
§ To invoke a macro, call it by name and supply values in the correct order
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-11
Reusing Code with Imports
import 'commission_calc.pig';
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-12
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-13
User-Defined FuncLons (UDFs)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-14
Using UDFs WriBen in Java
REGISTER '/path/to/myudf.jar';
...
data = FOREACH allsales GENERATE com.example.MYFUNC(name);
REGISTER '/path/to/myudf.jar';
DEFINE FOO com.example.MYFUNC;
...
data = FOREACH allsales GENERATE FOO(name);
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-15
WriLng UDFs in Python (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-16
WriLng UDFs in Python (2)
if len(phone) == 12:
# XXX-YYY-ZZZZ or XXX.YYY.ZZZZ format
areacode = phone[0:3]
elif len(phone) == 14:
# (XXX) YYY-ZZZZ or (XXX) YYY.ZZZZ format
areacode = phone[1:4]
return areacode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-17
Invoking Python UDFs from Pig
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-18
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-19
Open Source UDFs
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-20
Piggy Bank
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-21
Apache DataFu (incubating)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. A1-22
Using a Contributed UDF
Pig LaLn
REGISTER '/usr/lib/pig/datafu-*.jar';
DEFINE DIST datafu.pig.geo.HaversineDistInMiles;
Output data
(2564.207116295711)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-23
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-24
Recap: Load and Store FuncLons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-25
Sharing Data Between Pig, Hive, and Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-26
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-27
Processing Data with an External Script
§ While Pig La=n is powerful, some tasks are easier in another language
§ Pig allows you to stream data through another language for processing
– This is done using STREAM
§ Similar concept to Hadoop Streaming and HiveQL’s TRANSFORM
– Data is supplied to the script on standard input as tab-delimited fields
– Script writes results to standard output as tab-delimited fields
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-28
STREAM Example in Python (1)
§ This example will calculate a user’s age given that user’s birthdate
– This calculaLon is done in a Python script named agecalc.py
§ Here is the corresponding Pig La=n code
– BackLcks used to quote script name following the alias
– Single quotes used for quoLng script name within SHIP
– The schema for the data produced by the script follows the AS keyword
DUMP out;
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-29
STREAM Example in Python (2)
#!/usr/bin/env python
import sys
from datetime import datetime
d1 = datetime.strptime(birthdate, '%Y-%m-%d')
d2 = datetime.now()
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-30
STREAM Example in Python (3)
§ The Pig script again, and the data it reads and writes
DUMP out;
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-31
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-32
EssenLal Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-33
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-34
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-35
Hands-On Exercise: Extending Pig with Streaming and UDFs
§ In this Hands-On Exercise, you will process data with an external script and a
user-defined func=on.
– Please refer to the Hands-On Exercise Manual for instrucLons
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. A1-36