Sie sind auf Seite 1von 39

Big Data Hadoop and Spark Developer

Lesson 9—Pig

© Simplilearn. All rights reserved.


Learning Objectives

Discuss the basics of Pig

Explain Pig Architecture and Operations


Pig
Topic 1—Introduction to Pig
Why Pig?

Prior to 2006, programs were written only on MapReduce using Java. Developers used to face
a lot of challenges:

Fundamentals while Need for Common


creating a program Operation

Challenges
faced by the
developers:

Coding difficulty Rigid Dataflow

Pig was developed to overcome such challenges.


What Is Pig?

Big Pig is a scripting platform designed to process and analyze large


data sets, and it runs on Hadoop clusters. Pig is extensible, self-
optimizing, and easily programmed.
Features of Pig

D1
D2
D4
D3

Schemas can be
assigned
dynamically
Provides step-by-step Supports UDFs and
procedural control data types
Pig: Example

Yahoo has scientists who use grid tools to scan through petabytes of data.

Pig supports data with partial


In the data factory, data may or unknown schemas and
Write scripts to test a theory not be in a standardized state supports semi-structured or
unstructured data
Pig
Topic 2—Pig Architecture and Operations
Pig Architecture
Components of Pig Architecture

Does type checking


and checks the
syntax of the script
Components of Pig Architecture

Performs activities
like split, merge, transform,
reorder, etc.
Components of Pig Architecture

Compiles the optimized


code into a series of
MapReduce jobs
Components of Pig Architecture

MapReduce jobs are


executed on Hadoop to
produce the desired results
Stages of Pig Operations

Load data
Execution
and write Pig
of the plan
script
Pig
Operations
A = LOAD ‘myfile’ Results are dumped on
AS (x, y, z); screen or stored in
B = FILTER A by x > 0; HDFS
• Parses and checks the script
C = GROUP B BY x; • Optimizes the scripts
D = FOREACH A GENERATE • Plans execution
• Submits to Hadoop
x, COUNT(B);
• Monitors job progress
STORE D INTO ‘output’;
Data Model Supported by Pig

A simple atomic value


Example: ‘Mike’ Atom

A sequence of fields that can be of


Tuple any data type
Example: (‘Mike’ 43)
A collection of tuples of potentially varying
structures that can contain duplicates Bag
Example: {(‘Mike’), (‘Doug’, (43, 45))}
An associative array—the key must be a
Map chararray, but the value can be of any type
Example: [name#Mike,phone#5551212]
Basics of Pig Latin

By default, Pig treats undeclared fields as ByteArrays.

User Defined Functions (UDFs)


with a known or explicitly set
return type

2
Pig can
Schema information provided by infer a Use of operators that
a LOAD function or explicitly expect a certain type of
3 field’s type
1
declared using an AS clause based on: field

Type conversion is lazy; the data type is enforced at execution only.


Pig Execution Modes

Pig execution modes

Local mode: Pig depends on the MapReduce mode: Pig relies on


OS file system HDFS

Currently, four modes are supported by pig; Tez, Tez-local, Spark, and Spark-local.
Pig Interactive Modes

Pig Interactive Modes

Interactive mode: A code is Batch mode: Files containing Pig


written and executed line by line scripts are created and executed in a
batch
Pig vs. SQL: Example
Track customers in Texas who spend more than $2,000.

SQL Pig

SELECT c_id , SUM(amount) AS customer = LOAD '/data/customer.dat' AS


CTotal (c_id,name,city);
FROM customers c sales = LOAD '/data/sales.dat' AS (s_id,c_id,date,amount);
JOIN sales s ON c.c_id = s.c_id salesBLR = FILTER customer BY city == ‘Texas';
WHERE c.city = ‘Texas' joined= JOIN customer BY c_id, salesTX BY c_id;
GROUP BY c_id grouped = GROUP joined BY c_id;
HAVING SUM(amount) > 2000 summed= FOREACH grouped GENERATE GROUP,
ORDER BY CTotal DESC SUM(joined.salesTX::amount);
spenders= FILTER summed BY $1 > 2000;
sorted = ORDER spenders BY $1 DESC;
DUMP sorted;
Loading and Storing Methods

Loading refers to loading relations from files in the Pig buffer.


Storing refers to writing outputs to the file system.

Load Data Store Data

Keyword: LOAD Keyword: STORE


Functions/Relations on Data

Some of the relations performed by Big Data and Hadoop Developers are as follows:

Relations

Filtering Transforming Grouping Sorting Combining Splitting


Functions/Relations on Data

Some of the relations performed by Big Data and Hadoop Developers are as follows:

Relations

Filtering Transforming Grouping Sorting Combining Splitting

Filtering can be defined as filtering of data based on a conditional clause such as grade and pay.
Functions/Relations on Data

Some of the relations performed by Big Data and Hadoop Developers are as follows:

Relations

Filtering Transforming Grouping Sorting Combining Splitting

Transforming refers to making data presentable for the extraction of logical data.
Functions/Relations on Data

Some of the relations performed by Big Data and Hadoop Developers are as follows:

Relations

Filtering Transforming Grouping Sorting Combining Splitting

Grouping refers to generating a group of meaningful data.


Functions/Relations on Data

Some of the relations performed by Big Data and Hadoop Developers are as follows:

Relations

Filtering Transforming Grouping Sorting Combining Splitting

Sorting of data refers to arranging the data in either ascending or descending order.
Functions/Relations on Data

Some of the relations performed by Big Data and Hadoop Developers are as follows:

Relations

Filtering Transforming Grouping Sorting Combining Splitting

Combining refers to performing a union operation on the data stored in the variable.
Functions/Relations on Data

Some of the relations performed by Big Data and Hadoop Developers are as follows:

Relations

Filtering Transforming Grouping Sorting Combining Splitting

Splitting refers to the logical separation of data .


Pig Commands

Command Function
load Reads data from system
Store Writes data to file system
foreach Applies expressions to each record and outputs one or more records
filter Applies predicate and removes records that do not return true
Group/cogroup Collects records with the same key from one or more inputs
join Joins two or more inputs based on a key
order Sorts records based on a key
distinct Removes duplicate records
union Merges data sets
split Splits data into two or more sets based on filter conditions
stream Sends all records through a user-provided binary
dump Writes output to stdout
limit Limits the number of records
Key Takeaways

Pig is a high-level data flow scripting language and has two major components:
Runtime engine and Pig Latin language.

Pig runs in two execution modes: Local and MapReduce.

Pig engine can be installed by downloading the mirror web link from the
website pig.apache.org.

Three parameters need to be followed before setting the environment for Pig
Latin: ensure that all Hadoop services are running properly, Pig is completely
installed and configured, and all required datasets are uploaded in the HDFS.
Quiz
QUIZ
Which of the following commands is used to start Pig in MapReduce mode?
1

a. Pig

b. Pig -x MapReduce

c. Pig -x local

d. Both Pig and Pig -x MapReduce


QUIZ
Which of the following commands is used to start Pig in MapReduce mode?
1

a. Pig

b. Pig -x MapReduce

c. Pig -x local

d. Both Pig and Pig -x MapReduce

The correct answer is d.


Pig or Pig -x MapReduce commands can be used to run Pig in MapReduce mode.
QUIZ Which of the following keywords in Pig scripting is used for displaying the output on
2 the screen?

a. DUMP

b. STORE

c. LOAD

d. TOKENIZE
QUIZ Which of the following keywords in Pig scripting is used for displaying the output on
2 the screen?

a. DUMP

b. STORE

c. LOAD

d. TOKENIZE

The correct answer is a.


DUMP is used to display the output on the screen.
QUIZ
Which of the following keywords in Pig scripting is used for accepting input files?
3

a. LOAD

b. STORE

c. FLATTERN

d. TOKENIZE
QUIZ
Which of the following keywords in Pig scripting is used for accepting input files?
3

a. LOAD

b. STORE

c. FLATTERN

d. TOKENIZE

The correct answer is a.


LOAD in Pig scripting is used for accepting input files.
QUIZ
Which of the following keywords is used to perform combining in Pig?
4

a. LOAD

b. STORE

c. UNION

d. FOREACH
QUIZ
Which of the following keywords is used to perform combining in Pig?
4

a. LOAD

b. STORE

c. UNION

d. FOREACH

The correct answer is c.


UNION in Pig scripting is used for combining data in Pig.
This concludes “Pig.”
The next lesson is “Basics of Apache Spark.”

©Simplilearn. All rights reserved