Sie sind auf Seite 1von 46

Pig : Building High-Level

Dataflows over Map-Reduce

Utkarsh Srivastava

Research &
Cloud Computing
Data Processing Renaissance

Internet companies swimming in data


• E.g. TBs/day at Yahoo!

Data analysis is “inner loop” of product innovation


Data analysts are skilled programmers
Data Warehousing …?

Scale Often not scalable enough

Prohibitively expensive at web scale


$$$$
• Up to $200K/TB

• Little control over execution method


SQL • Query optimization is hard
• Parallel environment
• Little or no statistics
• Lots of UDFs
New Systems For Data Analysis

Map-Reduce

Apache Hadoop ...

Dryad
Map-Reduce

Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5

map
k2 v4 k2 v2 reduce
k1 v5 k2 v4

Just a group-by-aggregate?
The Map-Reduce Appeal

Scalable due to simpler design


Scale • Only parallelizable operations
• No transactions

$ Runs on cheap commodity hardware

SQL Procedural Control- a processing “pipe”


Disadvantages

1. Extremely rigid data flow M R


Other flows constantly hacked in

M M R M

Join, Union Split Chains

2. Common operations must be coded by hand


• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions
• Difficult to maintain, extend, and optimize
Pros And Cons

Need a high-level, general data flow language


Enter Pig Latin

Need a high-level, general data flow language


Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work
Example Data Analysis Task
Find the top 10 most visited pages in each category

Visits Url Info


User Url Time Url Category PageRank

Amy cnn.com 8:00 cnn.com News 0.9

Amy bbc.com 10:00 bbc.com News 0.8

Amy flickr.com 10:05 flickr.com Photos 0.7

Fred cnn.com 12:00 espn.com Sports 0.9


Data Flow
Load Visits

Group by url

Foreach url
Load Url Info
generate count

Join on url

Group by category

Foreach category
generate top10 urls
In Pig Latin
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);


visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;


topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;


Step-by-step Procedural Control
Target users are entrenched procedural programmers
The step-by-step method of creating a program in Pig is much cleaner and
simpler to use than the single block method of SQL. It is easier to keep track of
what your variables are, and where you are in the process of analyzing your
data.
Jasmine Novak
Engineer, Yahoo!

With the various interleaved clauses in SQL, it is difficult to know what is


actually happening sequentially. With Pig, the data nesting and the temporary
tables get abstracted away. Pig has fewer primitives than SQL does, but it’s
more powerful.
David Ciemiewicz
Search Excellence, Yahoo!

• Automatic query optimization is hard


• Pig Latin does not preclude optimization
Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;


gCategories = group visitCounts by category;
topUrls = foreach gCategories
Operates generate
directly overtop(visitCounts,10);
files

store topUrls into ‘/data/topUrls’;


Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;


gCategories = group visitCounts by category;
Schemasgenerate
topUrls = foreach gCategories optional; top(visitCounts,10);
Can be assigned dynamically
store topUrls into ‘/data/topUrls’;
User-Code as a First-Class Citizen
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits functions
User-defined by url; (UDFs)
visitCountscan
= foreach gVisits
be used generate
in every url, count(urlVisits);
construct
• Load, Store
urlInfo =• load ‘/data/urlInfo’
Group, as (url, category, pRank);
Filter, Foreach

visitCounts = join visitCounts by url, urlInfo by url;


gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;


Nested Data Model
• Pig Latin has a fully-nestable data model with:
–Atomic values, tuples, bags (lists), and maps

finance
yahoo , email
news

• More natural to programmers than flat tuples


• Avoids expensive joins
Nested Data Model
Decouples grouping as an independent operation
User Url Time group Visits
Amy cnn.com 8:00 group by url cnn.com
Amy cnn.com 8:00

Amy bbc.com 10:00 Fred cnn.com 12:00

Amy bbc.com 10:05


Fred cnn.com 12:00 Amy bbc.com 10:00
bbc.com
Amy bbc.com 10:05
• Common case: aggregation on these nested sets
• Power users:
I franklysophisticated UDFs,than
like pig much better e.g.,
SQLsequence
in some analysis
respects (group + optional flatten works better for me,
• EfficientI love
Implementation (see paper)
nested data structures).”

Ted Dunning
Chief Scientist, Veoh
19
CoGroup
results revenue
query url rank query adSlot amount
Lakers nba.com 1 Lakers top 50
Lakers espn.com 2 Lakers side 20
Kings nhl.com 1 Kings top 30
Kings nba.com 2 Kings side 10

group results revenue


Lakers nba.com 1 Lakers top 50
Lakers
Lakers espn.com 2 Lakers side 20

Kings nhl.com 1 Kings top 30


Kings
Kings nba.com 2 Kings side 10

Cross-product of the 2 bags would give natural join


Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work
Implementation

SQL user

automatic
rewrite + Pig or
optimize Pig is open-source.
or

http://hadoop.apache.org/pig
Hadoop
Map-Reduce

cluster • ~50% of Hadoop jobs at


Yahoo! are Pig
• 1000s of jobs per day
Compilation into Map-Reduce
Map1 Every group or join operation
Load Visits
forms a map-reduce boundary
Group by url
Reduce1
Map2
Foreach url
Load Url Info
generate count

Join on url
Reduce2
Map3
Other operations Group by category
pipelined into map and Reduce3
Foreach category
reduce phases generate top10(urls)
Optimizations: Using the Combiner

Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5

map
k2 v4 k2 v2 reduce
k1 v5 k2 v4

Can pre-process data on the map-side to reduce data shipped


• Algebraic Aggregation Functions
• Distinct processing
Optimizations: Skew Join
• Default join method is symmetric hash join.
cross product carried out on 1 reducer

group results revenue


Lakers nba.com 1 Lakers top 50
Lakers
Lakers espn.com 2 Lakers side 20

Kings nhl.com 1 Kings top 30


Kings
Kings nba.com 2 Kings side 10

• Problem if too many values with same key


• Skew join samples data to find frequent values
• Further splits them among reducers
Optimizations: Fragment-Replicate Join

• Symmetric-hash join repartitions both inputs

• If size(data set 1) >> size(data set 2)


– Just replicate data set 2 to all partitions of data set 1

• Translates to map-only job


– Open data set 2 as “side file”
Optimizations: Merge Join

• Exploit data sets are already sorted.

• Again, a map-only job


– Open other data set as “side file”
Optimizations: Multiple Data Flows
Load Users Map1

Filter bots

Group by Group by
state demographic
Reduce1

Apply udfs Apply udfs

Store into ‘bystate’ Store into ‘bydemo’


Optimizations: Multiple Data Flows
Load Users Map1

Filter bots

Split

Group by Group by
state demographic

Demultiplex Reduce1

Apply udfs Apply udfs

Store into ‘bystate’ Store into ‘bydemo’


Other Optimizations

• Carry data as byte arrays as far as possible

• Using binary comparator for sorting

• “Streaming” data through external executables


Performance
Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work
Example Dataflow Program
LOAD LOAD
(user, url) (url, pagerank)

JOIN
on url
Find users that
FOREACH GROUP tend to visit
user, canonicalize(url) on user
high-pagerank
FOREACH pages
user, AVG(pagerank)

FILTER
avgPR> 0.5
Iterative Process
LOAD LOAD
(user, url) (url, pagerank)

JOIN Joining on right


on url attribute?

FOREACH GROUP
user, canonicalize(url) on user

Bug in UDF FOREACH


canonicalize? user, AVG(pagerank)
Everything being
FILTER filtered out?
avgPR> 0.5

No Output ☹
How to do test runs?

• Run with real data


– Too inefficient (TBs of data)

• Create smaller data sets (e.g., by sampling)


– Empty results due to joins [Chaudhuri et. al. 99], and
selective filters

• Biased sampling for joins


– Indexes not always present
Examples to Illustrate Program
LOAD LOAD (www.cnn.com, 0.9)
(www.frogs.com, 0.3)
(user, url) (url, pagerank) (www.snails.com, 0.4)

(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN (Amy, www.cnn.com, 0.9)
(Fred, www.snails.com/index.html) (Amy, www.frogs.com, 0.3)
on url
(Fred, www.snails.com, 0.4)

FOREACH GROUP
user, canonicalize(url) on user (Amy, www.cnn.com, 0.9)
( Amy, )
(Amy, www.frogs.com, 0.3)

FOREACH ( Fred, (Fred, www.snails.com, 0.4) )

(Amy, www.cnn.com) user, AVG(pagerank)


(Amy, www.frogs.com)
(Amy, 0.6)
(Fred, www.snails.com)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Value Addition From Examples

• Examples can be used for


– Debugging
– Understanding a program written by someone else
– Learning a new operator, or language
Good Examples: Consistency
LOAD LOAD
(user, url) (url, pagerank)

(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url

FOREACH GROUP
user, canonicalize(url) on user 0. Consistency

FOREACH
(Amy, www.cnn.com) user, AVG(pagerank) output example
(Amy, www.frogs.com) =
(Fred, www.snails.com)
operator applied on input example
FILTER
avgPR> 0.5
Good Examples: Realism
LOAD LOAD
(user, url) (url, pagerank)

(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url

FOREACH GROUP
user, canonicalize(url) on user 1. Realism

FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Good Examples: Completeness
LOAD LOAD
(user, url) (url, pagerank)
2. Completeness

JOIN
on url Demonstrate the salient
properties of each operator,
GROUP e.g., FILTER
FOREACH
user, canonicalize(url) on user

FOREACH
user, AVG(pagerank)
(Amy, 0.6)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Good Examples: Conciseness
LOAD LOAD
(user, url) (url, pagerank)
3. Conciseness
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url

FOREACH GROUP
user, canonicalize(url) on user

FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Implementation Status

• Available as ILLUSTRATE command in open-source release


of Pig

• Available as Eclipse Plugin (PigPen)

• See SIGMOD09 paper for algorithm and experiments


Related Work
• Sawzall
– Data processing language on top of map-reduce
– Rigid structure of filtering followed by aggregation
• Hive
– SQL-like language on top of Map-Reduce
• DryadLINQ
– SQL-like language on top of Dryad
• Nested data models
– Object-oriented databases
Future / In-Progress Tasks

• Columnar-storage layer
• Metadata repository
• Profiling and Performance Optimizations
• Tight integration with a scripting language
–Use loops, conditionals, functions of host language
• Memory Management
• Project Suggestions at:
http://wiki.apache.org/pig/ProposedProjects
Credits
Summary

• Big demand for parallel data processing


– Emerging tools that do not look like SQL DBMS
– Programmers like dataflow pipes over static files

• Hence the excitement about Map-Reduce

• But, Map-Reduce is too low-level and rigid

Pig Latin
Sweet spot between map-reduce and SQL

Das könnte Ihnen auch gefallen