Sie sind auf Seite 1von 61

Pig : Building High-Level

Dataflows over Map-Reduce


Data Processing Renaissance

 Internet companies swimming in data


• E.g. TBs/day at Yahoo!
 Data analysis is “inner loop” of product innovation
 Data analysts are skilled programmers
Data Warehousing …?

Scale Often not scalable enough

Prohibitively expensive at web scale


$$$$
• Up to $200K/TB

• Little control over execution method


SQL • Query optimization is hard
• Parallel environment
• Little or no statistics
• Lots of UDFs
New Systems For Data Analysis

 Map-Reduce

 Apache Hadoop ...

 Dryad
Map-Reduce

Input k1 v1 k1 v1 Output
records k2 v2 k1 v3 records
map reduce
k1 v3 k1 v5

map
k2 v4 k2 v2 reduce
k1 v5 k2 v4

Just a group-by-aggregate?
The Map-Reduce Appeal

Scalable due to simpler design


Scale • Only parallelizable operations
• No transactions

$ Runs on cheap commodity hardware

SQL Procedural Control- a processing “pipe”


Disadvantages

1. Extremely rigid data flow M R


Other flows constantly hacked in

M M R M

Join, Union Split Chains

2. Common operations must be coded by hand


• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions
• Difficult to maintain, extend, and optimize
Pros And Cons

Need a high-level, general data flow language


Enter Pig Latin

Need a high-level, general data flow language


Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work
Pig Latin Example 1
Suppose we have a table
urls: (url, category, pagerank)
Simple SQL query that finds,
For each sufficiently large category, the average
pagerank of high-pagerank urls in that category

SELECT category, Avg(pagetank)


FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 106

11
Equivalent Pig Latin program

• good_urls = FILTER urls BY pagerank > 0.2;

• groups = GROUP good_urls BY category;

• big_groups = FILTER groups BY


COUNT(good_urls) > 106 ;

• output = FOREACH big_groups GENERATE


category, AVG(good_urls.pagerank);
12
Data Flow

Filter good_urls
by pagerank > 0.2

Group by category

Filter category
by count > 106

Foreach category
generate avg. pagerank

13
Pig
• Consist of a scripting language known as Pig Latin and Pig
Latin Compiler.
– It is a high level scripting language used to write code to analyze data.
– Compiler converts the code into equivalent MapReduce code.
– Easier to write code in Pig compared to programming in Map Reduce.
– Pig has an optimizer that decides how to get the data quickly.
BENIFITS
• Ease of coding: Writes complex programs. It explicitly encodes
the complex tasks involving inter-related data
transformations, as data flow sequences.

• Optimization: Encodes the task in such a way that they can


easily optimized for execution. This allows user to concentrate
on the data processing aspects without bothering about
efficiency.

• Extensibility: Allows to create own custom functions/user


defined functions.
Why use Pig?
In MAP REDUCE
In Pig Latin
Example
• Pig Latin is Procedural
Pig Latin is procedural, it fits very naturally in
the pipeline paradigm. SQL on the other hand
is declarative.
• Consider, for example, a simple pipeline, where
data from sources users and clicks is to be joined
and filtered, and then joined to data from a third
source geoinfo and aggregated and finally stored
into a able ValuableClicksPerDMA.
SQL Query

• insert into ValuableClicksPerDMA select dma, count(*) from


geoinfo join ( select name, ipaddr from users join clicks on
(users.name = clicks.user) where value > 0; ) using ipaddr
group by dma;

• SQL is declarative but not step-by-step style


The Pig Latin for this will look like:
Users = load 'users' as (name, age, ipaddr);
Clicks = load 'clicks' as (user, url, value);
ValuableClicks = filter Clicks by value > 0;
UserClicks = join Users by name, ValuableClicks by user;
Geoinfo = load 'geoinfo' as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';

• Pig Latin is procedural (dataflow programming model)


– Step-by-step query style is much cleaner and easier to
write
Example Data Analysis Task
Find the top 10 most visited pages in each category

Visits Url Info


User Url Time Url Category PageRank

Amy cnn.com 8:00 cnn.com News 0.9

Amy bbc.com 10:00 bbc.com News 0.8

Amy flickr.com 10:05 flickr.com Photos 0.7

Fred cnn.com 12:00 espn.com Sports 0.9


Data Flow
Load Visits

Group by url

Foreach url
Load Url Info
generate count

Join on url

Group by category

Foreach category
generate top10 urls
Dataflow Language

User specifies a sequence of steps where each step


specifies only a single high-level data transformation

The step-by-step method of creating a program in Pig is much


cleaner and simpler to use than the single block method of SQL.
It is easier to keep track of what your variables are, and where
you are in the process of analyzing your data.

Jasmine Novak
Engineer, Yahoo!
26
Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;


gCategories = group visitCounts by category;
Operates directly over files
topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;


Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;


gCategories = groupSchemas
visitCounts by category;
optional;
CangCategories
topUrls = foreach be assignedgenerate
dynamically
top(visitCounts,10);

store topUrls into ‘/data/topUrls’;


User-Code as a First-Class Citizen
visits = load ‘/data/visits’ as (user, url, time);
gVisits User-defined functions
= group visits by url; (UDFs)
can be used in every construct
visitCounts = foreach gVisits generate url, count(urlVisits);
• Load, Store
• Group, Filter, Foreach
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;


gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;


UDFs as First-Class Citizens

• Used Defined Functions (UFDs) can be used in


every construct
Load, Store, Group, Filter, Foreach
• Example 2
Suppose we want to find for each category, the top
10 urls according to pagerank

groups = GROUP urls BY category;


output = FOREACH groups GENERATE
category, top10(urls);
30
Data Model

• Tuple: A tuple is an ordered set of fields.


Example : (raja, 30)

• Bag: A bag is a collection of tuples.


Example : {(raju,30),(Mohhammad,45)}

• Map: A Map is a set of key-value pairs.


Example : [ ‘name’#’Raju’, ‘age’#30]
Nested Data Model
• Pig Latin has a fully-nestable data model with:
– Atomic values, tuples, bags (lists), and maps
finance
yahoo , email
news

• More natural to programmers than flat tuples


• Avoids expensive joins
Pig Latin – Relational Operations
Pig Latin – Relational Operations
UDFs as First-Class Citizens

• Used Defined Functions (UFDs) can be used in


every construct
Load, Store, Group, Filter, Foreach
• Example 2
Suppose we want to find for each category, the top
10 urls according to pagerank

groups = GROUP urls BY category;


output = FOREACH groups GENERATE
category, top10(urls);
35
Nested Data Model
Decouples grouping as an independent operation
User Url Time group Visits
Amy cnn.com 8:00 group by url Amy cnn.com 8:00
cnn.com
Amy bbc.com 10:00 Fred cnn.com 12:00
Amy bbc.com 10:05
Fred cnn.com 12:00 Amy bbc.com 10:00
bbc.com
Amy bbc.com 10:05

36
CoGroup
results revenue
query url rank query adSlot amount
Lakers nba.com 1 Lakers top 50
Lakers espn.com 2 Lakers side 20
Kings nhl.com 1 Kings top 30
Kings nba.com 2 Kings side 10

group results revenue


Lakers nba.com 1 Lakers top 50
Lakers
Lakers espn.com 2 Lakers side 20

Kings nhl.com 1 Kings top 30


Kings
Kings nba.com 2 Kings side 10

Cross-product of the 2 bags would give natural join


Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work
Implementation

SQL user

automatic
rewrite + Pig or
optimize
Pig is open-source.
or
http://hadoop.apache.org/pig
Hadoop
Map-Reduce

cluster • ~50% of Hadoop jobs at


Yahoo! are Pig
• 1000s of jobs per day
Building a Logical Plan

• Pig interpreter first parse Pig Latin command,


and verifies that the input files and bags being
referred are valid
• Builds logical plan for every bag that the user
defines
• Processing triggers only when user invokes a
STORE command on a bag
(at that point, the logical plan for that bag is
compiled into physical plan and is executed)
40
Compilation into Map-Reduce
Map1 Every group or join operation
Load Visits
forms a map-reduce boundary
Group by url
Reduce1
Map2
Foreach url
Load Url Info
generate count

Join on url
Reduce2
Map3
Other operations Group by category
pipelined into map Reduce3
Foreach category
and reduce phases generate top10(urls)
Optimizations: Skew Join
• Default join method is symmetric hash join.
cross product carried out on 1 reducer

group results revenue


Lakers nba.com 1 Lakers top 50
Lakers
Lakers espn.com 2 Lakers side 20

Kings nhl.com 1 Kings top 30


Kings
Kings nba.com 2 Kings side 10

• Problem if too many values with same key


• Further splits them among reducers
Optimizations: Multiple Data Flows
Load Users Map1

Filter bots

Group by Group by
state demographic
Reduce1

Apply udfs Apply udfs

Store into Store into


‘bystate’ ‘bydemo’
Optimizations: Multiple Data Flows
Load Users Map1

Filter bots

Split

Group by Group by
state demographic

Demultiplex Reduce1

Apply udfs Apply udfs

Store into Store into


‘bystate’ ‘bydemo’
Other Optimizations

• Carry data as byte arrays as far as possible

• Using binary comparator for sorting

• “Streaming” data through external


executables
Performance
Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work
Example Dataflow Program
LOAD LOAD
(user, url) (url, pagerank)

JOIN
on url
Find users that
FOREACH GROUP tend to visit
user, canonicalize(url) on user
high-pagerank
FOREACH pages
user, AVG(pagerank)

FILTER
avgPR> 0.5
Iterative Process
LOAD LOAD
(user, url) (url, pagerank)

JOIN Joining on right


on url attribute?

FOREACH GROUP
user, canonicalize(url) on user

Bug in UDF FOREACH


canonicalize? user, AVG(pagerank)
Everything being
FILTER filtered out?
avgPR> 0.5

No Output 
How to do test runs?

• Run with real data


– Too inefficient (TBs of data)

• Create smaller data sets (e.g., by sampling)


– Empty results due to joins [Chaudhuri et. al. 99], and
selective filters

• Biased sampling for joins


– Indexes not always present
Examples to Illustrate Program
(www.cnn.com, 0.9)
LOAD LOAD
(www.frogs.com, 0.3)
(user, url) (url, pagerank) (www.snails.com, 0.4)

(Amy, cnn.com)
(Amy, http://www.frogs.com)
JOIN (Amy, www.cnn.com, 0.9)
(Fred, www.snails.com/index.html) (Amy, www.frogs.com, 0.3)
on url
(Fred, www.snails.com, 0.4)

FOREACH GROUP
user, canonicalize(url) on user (Amy, www.cnn.com, 0.9)
( Amy, )
(Amy, www.frogs.com, 0.3)

( Fred, (Fred, www.snails.com, 0.4) )


FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Amy, 0.6)
(Fred, www.snails.com)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Value Addition From Examples

• Examples can be used for


– Debugging
– Understanding a program written by someone else
– Learning a new operator, or language
Good Examples: Consistency
LOAD LOAD
(user, url) (url, pagerank)

(Amy, cnn.com)
(Amy, http://www.frogs.com)
JOIN
(Fred, www.snails.com/index.html)
on url

FOREACH GROUP
user, canonicalize(url) on user 0. Consistency

FOREACH
(Amy, www.cnn.com) user, AVG(pagerank) output example
(Amy, www.frogs.com) =
(Fred, www.snails.com)
operator applied on input example
FILTER
avgPR> 0.5
Good Examples: Realism
LOAD LOAD
(user, url) (url, pagerank)

(Amy, cnn.com)
(Amy, http://www.frogs.com)
JOIN
(Fred, www.snails.com/index.html)
on url

FOREACH GROUP
user, canonicalize(url) on user 1. Realism

FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Good Examples: Completeness
LOAD LOAD
(user, url) (url, pagerank)
2. Completeness

JOIN
on url Demonstrate the salient
properties of each operator,
GROUP e.g., FILTER
FOREACH
user, canonicalize(url) on user

FOREACH
user, AVG(pagerank)
(Amy, 0.6)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Good Examples: Conciseness
LOAD LOAD
(user, url) (url, pagerank)
3. Conciseness
(Amy, cnn.com)
(Amy, http://www.frogs.com)
JOIN
(Fred, www.snails.com/index.html)
on url

FOREACH GROUP
user, canonicalize(url) on user

FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Implementation Status

• Available as ILLUSTRATE command in open-source


release of Pig

• Available as Eclipse Plugin (PigPen)

• See SIGMOD09 paper for algorithm and experiments


Related Work
• Sawzall
– Data processing language on top of map-reduce
– Rigid structure of filtering followed by aggregation
• Hive
– SQL-like language on top of Map-Reduce
• DryadLINQ
– SQL-like language on top of Dryad
• Nested data models
– Object-oriented databases
Future / In-Progress Tasks

• Columnar-storage layer
• Metadata repository
• Profiling and Performance Optimizations
• Tight integration with a scripting language
– Use loops, conditionals, functions of host language
• Memory Management
• Project Suggestions at:
http://wiki.apache.org/pig/ProposedProjects
Credits
Summary

• Big demand for parallel data processing


– Emerging tools that do not look like SQL DBMS
– Programmers like dataflow pipes over static files

• Hence the excitement about Map-Reduce


• But, Map-Reduce is too low-level and rigid

Pig Latin
Sweet spot between map-reduce and SQL

Das könnte Ihnen auch gefallen