Pig: Building High-Level Dataflows Over Map-Reduce

Pig : Building High-Level
Dataflows over Map-Reduce

Data Processing Renaissance
 Internet companies swimming in data

• E.g. TBs/day at Yahoo!
 Data analysis is “inner loop” of product innovation
 Data analysts are skilled programmers
Data Warehousing …?
Scale Often not scalable enough
Prohibitively expensive at web scale

$$$$
• Up to $200K/TB
• Little control over execution method

SQL • Query optimization is hard
• Parallel environment
• Little or no statistics
• Lots of UDFs
New Systems For Data Analysis
 Map-Reduce
 Apache Hadoop ...
 Dryad
Map-Reduce
Input k1 v1 k1 v1 Output
records k2 v2 k1 v3 records
map reduce
k1 v3 k1 v5
map
k2 v4 k2 v2 reduce
k1 v5 k2 v4
Just a group-by-aggregate?
The Map-Reduce Appeal
Scalable due to simpler design

Scale • Only parallelizable operations
• No transactions
$ Runs on cheap commodity hardware
SQL Procedural Control- a processing “pipe”

Disadvantages
1. Extremely rigid data flow M R

Other flows constantly hacked in
M M R M
Join, Union Split Chains
2. Common operations must be coded by hand

• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions
• Difficult to maintain, extend, and optimize
Pros And Cons
Need a high-level, general data flow language

Enter Pig Latin
Need a high-level, general data flow language

Outline
• Map-Reduce and the need for Pig Latin
• Pig Latin
• Compilation into Map-Reduce
• Example Generation
• Future Work
Pig Latin Example 1
Suppose we have a table
urls: (url, category, pagerank)
Simple SQL query that finds,
For each sufficiently large category, the average
pagerank of high-pagerank urls in that category
SELECT category, Avg(pagetank)

FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 106
11
Equivalent Pig Latin program
• good_urls = FILTER urls BY pagerank > 0.2;
• groups = GROUP good_urls BY category;
• big_groups = FILTER groups BY

COUNT(good_urls) > 106 ;
• output = FOREACH big_groups GENERATE

category, AVG(good_urls.pagerank);
12
Data Flow
Filter good_urls
by pagerank > 0.2
Group by category
Filter category
by count > 106
Foreach category
generate avg. pagerank
13
Pig
• Consist of a scripting language known as Pig Latin and Pig
Latin Compiler.
– It is a high level scripting language used to write code to analyze data.
– Compiler converts the code into equivalent MapReduce code.
– Easier to write code in Pig compared to programming in Map Reduce.
– Pig has an optimizer that decides how to get the data quickly.
BENIFITS
• Ease of coding: Writes complex programs. It explicitly encodes
the complex tasks involving inter-related data
transformations, as data flow sequences.
• Optimization: Encodes the task in such a way that they can

easily optimized for execution. This allows user to concentrate
on the data processing aspects without bothering about
efficiency.
• Extensibility: Allows to create own custom functions/user

defined functions.
Why use Pig?
In MAP REDUCE
In Pig Latin
Example
• Pig Latin is Procedural
Pig Latin is procedural, it fits very naturally in
the pipeline paradigm. SQL on the other hand
is declarative.
• Consider, for example, a simple pipeline, where
data from sources users and clicks is to be joined
and filtered, and then joined to data from a third
source geoinfo and aggregated and finally stored
into a able ValuableClicksPerDMA.
SQL Query
• insert into ValuableClicksPerDMA select dma, count(*) from

geoinfo join ( select name, ipaddr from users join clicks on
(users.name = clicks.user) where value > 0; ) using ipaddr
group by dma;
• SQL is declarative but not step-by-step style

The Pig Latin for this will look like:
Users = load 'users' as (name, age, ipaddr);
Clicks = load 'clicks' as (user, url, value);
ValuableClicks = filter Clicks by value > 0;
UserClicks = join Users by name, ValuableClicks by user;
Geoinfo = load 'geoinfo' as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
• Pig Latin is procedural (dataflow programming model)

– Step-by-step query style is much cleaner and easier to
write
Example Data Analysis Task
Find the top 10 most visited pages in each category
Visits Url Info

User Url Time Url Category PageRank
Amy cnn.com 8:00 cnn.com News 0.9
Amy bbc.com 10:00 bbc.com News 0.8
Amy flickr.com 10:05 flickr.com Photos 0.7
Fred cnn.com 12:00 espn.com Sports 0.9

Data Flow
Load Visits
Group by url
Foreach url
Load Url Info
generate count
Join on url
Group by category
Foreach category
generate top10 urls
Dataflow Language
User specifies a sequence of steps where each step

specifies only a single high-level data transformation
The step-by-step method of creating a program in Pig is much

cleaner and simpler to use than the single block method of SQL.
It is easier to keep track of what your variables are, and where
you are in the process of analyzing your data.
Jasmine Novak
Engineer, Yahoo!
26
Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(urlVisits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;
Operates directly over files
topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;

Quick Start and Interoperability
gVisits = group visits by url;

gCategories = groupSchemas
visitCounts by category;
optional;
CangCategories
topUrls = foreach be assignedgenerate
dynamically
top(visitCounts,10);

User-Code as a First-Class Citizen
gVisits User-defined functions
= group visits by url; (UDFs)
can be used in every construct
• Load, Store
• Group, Filter, Foreach

gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);

UDFs as First-Class Citizens
• Used Defined Functions (UFDs) can be used in

every construct
Load, Store, Group, Filter, Foreach
• Example 2
Suppose we want to find for each category, the top
10 urls according to pagerank
groups = GROUP urls BY category;

output = FOREACH groups GENERATE
category, top10(urls);
30
Data Model
• Tuple: A tuple is an ordered set of fields.

Example : (raja, 30)
• Bag: A bag is a collection of tuples.

Example : {(raju,30),(Mohhammad,45)}
• Map: A Map is a set of key-value pairs.

Example : [ ‘name’#’Raju’, ‘age’#30]
Nested Data Model
• Pig Latin has a fully-nestable data model with:
– Atomic values, tuples, bags (lists), and maps
finance
yahoo , email
news
• More natural to programmers than flat tuples

• Avoids expensive joins
Pig Latin – Relational Operations
Pig Latin – Relational Operations
UDFs as First-Class Citizens
• Used Defined Functions (UFDs) can be used in

every construct
Load, Store, Group, Filter, Foreach
• Example 2
Suppose we want to find for each category, the top
10 urls according to pagerank
groups = GROUP urls BY category;

output = FOREACH groups GENERATE
category, top10(urls);
35
Nested Data Model
Decouples grouping as an independent operation
User Url Time group Visits
Amy cnn.com 8:00 group by url Amy cnn.com 8:00
cnn.com
Amy bbc.com 10:00 Fred cnn.com 12:00
Amy bbc.com 10:05
Fred cnn.com 12:00 Amy bbc.com 10:00
bbc.com
Amy bbc.com 10:05
36
CoGroup
results revenue
query url rank query adSlot amount
Lakers nba.com 1 Lakers top 50
Lakers espn.com 2 Lakers side 20
Kings nhl.com 1 Kings top 30
Kings nba.com 2 Kings side 10
group results revenue

Lakers

Kings
Cross-product of the 2 bags would give natural join

Outline
• Pig Latin
• Future Work
Implementation
SQL user
automatic
rewrite + Pig or
optimize
Pig is open-source.
or
http://hadoop.apache.org/pig
Hadoop
Map-Reduce
cluster • ~50% of Hadoop jobs at

Yahoo! are Pig
• 1000s of jobs per day
Building a Logical Plan
• Pig interpreter first parse Pig Latin command,

and verifies that the input files and bags being
referred are valid
• Builds logical plan for every bag that the user
defines
• Processing triggers only when user invokes a
STORE command on a bag
(at that point, the logical plan for that bag is
compiled into physical plan and is executed)
40
Compilation into Map-Reduce
Map1 Every group or join operation
Load Visits
forms a map-reduce boundary
Group by url
Reduce1
Map2
Foreach url
Load Url Info
generate count
Join on url
Reduce2
Map3
Other operations Group by category
pipelined into map Reduce3
Foreach category
and reduce phases generate top10(urls)
Optimizations: Skew Join
• Default join method is symmetric hash join.
cross product carried out on 1 reducer
group results revenue

Lakers

Kings
• Problem if too many values with same key

• Further splits them among reducers
Optimizations: Multiple Data Flows
Load Users Map1
Filter bots
Group by Group by
state demographic
Reduce1
Apply udfs Apply udfs
Store into Store into

‘bystate’ ‘bydemo’
Optimizations: Multiple Data Flows
Load Users Map1
Filter bots
Split
Group by Group by
state demographic
Demultiplex Reduce1
Apply udfs Apply udfs
Store into Store into

‘bystate’ ‘bydemo’
Other Optimizations
• Carry data as byte arrays as far as possible
• Using binary comparator for sorting
• “Streaming” data through external

executables
Performance
Outline
• Pig Latin
• Future Work
Example Dataflow Program
LOAD LOAD
(user, url) (url, pagerank)
JOIN
on url
Find users that
FOREACH GROUP tend to visit
user, canonicalize(url) on user
high-pagerank
FOREACH pages
user, AVG(pagerank)
FILTER
avgPR> 0.5
Iterative Process
LOAD LOAD
JOIN Joining on right

on url attribute?
FOREACH GROUP
Bug in UDF FOREACH

canonicalize? user, AVG(pagerank)
Everything being
FILTER filtered out?
avgPR> 0.5
No Output 
How to do test runs?
• Run with real data

– Too inefficient (TBs of data)
• Create smaller data sets (e.g., by sampling)

– Empty results due to joins [Chaudhuri et. al. 99], and
selective filters
• Biased sampling for joins

– Indexes not always present
Examples to Illustrate Program
(www.cnn.com, 0.9)
LOAD LOAD
(www.frogs.com, 0.3)
(user, url) (url, pagerank) (www.snails.com, 0.4)
(Amy, cnn.com)
(Amy, http://www.frogs.com)
JOIN (Amy, www.cnn.com, 0.9)
(Fred, www.snails.com/index.html) (Amy, www.frogs.com, 0.3)
on url
(Fred, www.snails.com, 0.4)
FOREACH GROUP
user, canonicalize(url) on user (Amy, www.cnn.com, 0.9)
( Amy, )
(Amy, www.frogs.com, 0.3)
( Fred, (Fred, www.snails.com, 0.4) )

FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Amy, 0.6)
(Fred, www.snails.com)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Value Addition From Examples
• Examples can be used for

– Debugging
– Understanding a program written by someone else
– Learning a new operator, or language
Good Examples: Consistency
LOAD LOAD
(Amy, cnn.com)
JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user 0. Consistency
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank) output example
(Amy, www.frogs.com) =
operator applied on input example
FILTER
avgPR> 0.5
Good Examples: Realism
LOAD LOAD
(Amy, cnn.com)
JOIN
on url
FOREACH GROUP
user, canonicalize(url) on user 1. Realism
FOREACH
FILTER
avgPR> 0.5
Good Examples: Completeness
LOAD LOAD
2. Completeness
JOIN
on url Demonstrate the salient
properties of each operator,
GROUP e.g., FILTER
FOREACH
FOREACH
user, AVG(pagerank)
(Amy, 0.6)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Good Examples: Conciseness
LOAD LOAD
3. Conciseness
(Amy, cnn.com)
JOIN
on url
FOREACH GROUP
FOREACH
FILTER
avgPR> 0.5
Implementation Status
• Available as ILLUSTRATE command in open-source

release of Pig
• Available as Eclipse Plugin (PigPen)
• See SIGMOD09 paper for algorithm and experiments

Related Work
• Sawzall
– Data processing language on top of map-reduce
– Rigid structure of filtering followed by aggregation
• Hive
– SQL-like language on top of Map-Reduce
• DryadLINQ
– SQL-like language on top of Dryad
• Nested data models
– Object-oriented databases
Future / In-Progress Tasks
• Columnar-storage layer
• Metadata repository
• Profiling and Performance Optimizations
• Tight integration with a scripting language
– Use loops, conditionals, functions of host language
• Memory Management
• Project Suggestions at:
http://wiki.apache.org/pig/ProposedProjects
Credits
Summary
• Big demand for parallel data processing

– Emerging tools that do not look like SQL DBMS
– Programmers like dataflow pipes over static files
• Hence the excitement about Map-Reduce

• But, Map-Reduce is too low-level and rigid
Pig Latin
Sweet spot between map-reduce and SQL

Pig: Building High-Level Dataflows Over Map-Reduce

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Pig: Building High-Level Dataflows Over Map-Reduce

Hochgeladen von

Copyright:

Verfügbare Formate

Pig : Building High-Level

Dataflows over Map-Reduce

 Internet companies swimming in data

Scale Often not scalable enough

Prohibitively expensive at web scale

• Little control over execution method

 Apache Hadoop ...

Scalable due to simpler design

$ Runs on cheap commodity hardware

SQL Procedural Control- a processing “pipe”

1. Extremely rigid data flow M R

Join, Union Split Chains

2. Common operations must be coded by hand

Need a high-level, general data flow language

Need a high-level, general data flow language

• Map-Reduce and the need for Pig Latin

• Compilation into Map-Reduce

SELECT category, Avg(pagetank)

• good_urls = FILTER urls BY pagerank > 0.2;

• groups = GROUP good_urls BY category;

• big_groups = FILTER groups BY

• output = FOREACH big_groups GENERATE

• Optimization: Encodes the task in such a way that they can

• Extensibility: Allows to create own custom functions/user

• insert into ValuableClicksPerDMA select dma, count(*) from

• SQL is declarative but not step-by-step style

• Pig Latin is procedural (dataflow programming model)

Visits Url Info

Amy cnn.com 8:00 cnn.com News 0.9

Amy bbc.com 10:00 bbc.com News 0.8

Amy flickr.com 10:05 flickr.com Photos 0.7

Fred cnn.com 12:00 espn.com Sports 0.9

User specifies a sequence of steps where each step

The step-by-step method of creating a program in Pig is much

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;

store topUrls into ‘/data/topUrls’;

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;

store topUrls into ‘/data/topUrls’;

visitCounts = join visitCounts by url, urlInfo by url;

store topUrls into ‘/data/topUrls’;

• Used Defined Functions (UFDs) can be used in

groups = GROUP urls BY category;

• Tuple: A tuple is an ordered set of fields.

• Bag: A bag is a collection of tuples.

• Map: A Map is a set of key-value pairs.

• More natural to programmers than flat tuples

• Used Defined Functions (UFDs) can be used in

groups = GROUP urls BY category;

group results revenue

Kings nhl.com 1 Kings top 30

Cross-product of the 2 bags would give natural join

• Map-Reduce and the need for Pig Latin

• Compilation into Map-Reduce

cluster • ~50% of Hadoop jobs at

• Pig interpreter first parse Pig Latin command,

group results revenue

Kings nhl.com 1 Kings top 30

• Problem if too many values with same key

Apply udfs Apply udfs

Store into Store into

Apply udfs Apply udfs