Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava

Pig : Building High-Level
Dataflows over Map-Reduce
Utkarsh Srivastava
Research &
Cloud Computing
Data Processing Renaissance
Internet companies swimming in data

• E.g. TBs/day at Yahoo!
Data analysis is “inner loop” of product innovation

Data analysts are skilled programmers
Data Warehousing …?
Scale Often not scalable enough
Prohibitively expensive at web scale

$$$$
• Up to $200K/TB
• Little control over execution method

SQL • Query optimization is hard
• Parallel environment
• Little or no statistics
• Lots of UDFs
New Systems For Data Analysis
Map-Reduce
Apache Hadoop ...
Dryad
Map-Reduce
Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5
map
k2 v4 k2 v2 reduce
k1 v5 k2 v4
Just a group-by-aggregate?
The Map-Reduce Appeal
Scalable due to simpler design

Scale • Only parallelizable operations
• No transactions
$ Runs on cheap commodity hardware
SQL Procedural Control- a processing “pipe”

Disadvantages
1. Extremely rigid data flow M R

Other flows constantly hacked in
M M R M
Join, Union Split Chains
2. Common operations must be coded by hand

• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions
• Difficult to maintain, extend, and optimize
Pros And Cons
Need a high-level, general data flow language

Enter Pig Latin
Need a high-level, general data flow language

Outline
• Map-Reduce and the need for Pig Latin
• Pig Latin
• Compilation into Map-Reduce
• Example Generation
• Future Work
Example Data Analysis Task
Find the top 10 most visited pages in each category
Visits Url Info

User Url Time Url Category PageRank
Amy cnn.com 8:00 cnn.com News 0.9
Amy bbc.com 10:00 bbc.com News 0.8
Amy flickr.com 10:05 flickr.com Photos 0.7
Fred cnn.com 12:00 espn.com Sports 0.9

Data Flow
Load Visits
Group by url
Foreach url
Load Url Info
generate count
Join on url
Group by category
Foreach category
generate top10 urls
In Pig Latin
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;

topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;

Step-by-step Procedural Control
Target users are entrenched procedural programmers
The step-by-step method of creating a program in Pig is much cleaner and
simpler to use than the single block method of SQL. It is easier to keep track of
what your variables are, and where you are in the process of analyzing your
data.
Jasmine Novak
Engineer, Yahoo!
With the various interleaved clauses in SQL, it is difficult to know what is

actually happening sequentially. With Pig, the data nesting and the temporary
tables get abstracted away. Pig has fewer primitives than SQL does, but it’s
more powerful.
David Ciemiewicz
Search Excellence, Yahoo!
• Automatic query optimization is hard

• Pig Latin does not preclude optimization
Quick Start and Interoperability
visitCounts = foreach gVisits generate url, count(urlVisits);

topUrls = foreach gCategories
Operates generate
directly overtop(visitCounts,10);
files

Quick Start and Interoperability
visitCounts = foreach gVisits generate url, count(urlVisits);

Schemasgenerate
topUrls = foreach gCategories optional; top(visitCounts,10);
Can be assigned dynamically
User-Code as a First-Class Citizen
gVisits = group visits functions
User-defined by url; (UDFs)
visitCountscan
= foreach gVisits
be used generate
in every url, count(urlVisits);
construct
• Load, Store
urlInfo =• load ‘/data/urlInfo’
Group, as (url, category, pRank);
Filter, Foreach

topUrls = foreach gCategories generate top(visitCounts,10);

Nested Data Model
• Pig Latin has a fully-nestable data model with:
–Atomic values, tuples, bags (lists), and maps
finance
yahoo , email
news
• More natural to programmers than flat tuples

• Avoids expensive joins
Nested Data Model
Decouples grouping as an independent operation
User Url Time group Visits
Amy cnn.com 8:00 group by url cnn.com
Amy cnn.com 8:00
Amy bbc.com 10:00 Fred cnn.com 12:00
Amy bbc.com 10:05

Fred cnn.com 12:00 Amy bbc.com 10:00
bbc.com
Amy bbc.com 10:05
• Common case: aggregation on these nested sets
• Power users:
I franklysophisticated UDFs,than
like pig much better e.g.,
SQLsequence
in some analysis
respects (group + optional ﬂatten works better for me,
• EfficientI love
Implementation (see paper)
nested data structures).”
Ted Dunning
Chief Scientist, Veoh
19
CoGroup
results revenue
query url rank query adSlot amount
Lakers nba.com 1 Lakers top 50
Lakers espn.com 2 Lakers side 20
Kings nhl.com 1 Kings top 30
Kings nba.com 2 Kings side 10
group results revenue

Lakers

Kings
Cross-product of the 2 bags would give natural join

Outline
• Pig Latin
• Future Work
Implementation
SQL user
automatic
rewrite + Pig or
optimize Pig is open-source.
or
http://hadoop.apache.org/pig
Hadoop
Map-Reduce
cluster • ~50% of Hadoop jobs at

Yahoo! are Pig
• 1000s of jobs per day
Compilation into Map-Reduce
Map1 Every group or join operation
Load Visits
forms a map-reduce boundary
Group by url
Reduce1
Map2
Foreach url
Load Url Info
generate count
Join on url
Reduce2
Map3
Other operations Group by category
pipelined into map and Reduce3
Foreach category
reduce phases generate top10(urls)
Optimizations: Using the Combiner
Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5
map
k2 v4 k2 v2 reduce
k1 v5 k2 v4
Can pre-process data on the map-side to reduce data shipped

• Algebraic Aggregation Functions
• Distinct processing
Optimizations: Skew Join
• Default join method is symmetric hash join.
cross product carried out on 1 reducer
group results revenue

Lakers

Kings
• Problem if too many values with same key

• Skew join samples data to find frequent values
• Further splits them among reducers
Optimizations: Fragment-Replicate Join
• Symmetric-hash join repartitions both inputs
• If size(data set 1) >> size(data set 2)

– Just replicate data set 2 to all partitions of data set 1
• Translates to map-only job

– Open data set 2 as “side file”
Optimizations: Merge Join
• Exploit data sets are already sorted.
• Again, a map-only job

– Open other data set as “side file”
Optimizations: Multiple Data Flows
Load Users Map1
Filter bots
Group by Group by
state demographic
Reduce1
Apply udfs Apply udfs
Store into ‘bystate’ Store into ‘bydemo’

Optimizations: Multiple Data Flows
Load Users Map1
Filter bots
Split
Group by Group by
state demographic
Demultiplex Reduce1
Apply udfs Apply udfs
Store into ‘bystate’ Store into ‘bydemo’

Other Optimizations
• Carry data as byte arrays as far as possible
• Using binary comparator for sorting
• “Streaming” data through external executables

Performance
Outline
• Pig Latin
• Future Work
Example Dataflow Program
LOAD LOAD
(user, url) (url, pagerank)
JOIN
on url
Find users that
FOREACH GROUP tend to visit
user, canonicalize(url) on user
high-pagerank
FOREACH pages
user, AVG(pagerank)
FILTER
avgPR> 0.5
Iterative Process
LOAD LOAD
JOIN Joining on right

on url attribute?
FOREACH GROUP
Bug in UDF FOREACH

canonicalize? user, AVG(pagerank)
Everything being
FILTER filtered out?
avgPR> 0.5
No Output ☹
How to do test runs?
• Run with real data

– Too inefficient (TBs of data)
• Create smaller data sets (e.g., by sampling)

– Empty results due to joins [Chaudhuri et. al. 99], and
selective filters
• Biased sampling for joins

– Indexes not always present
Examples to Illustrate Program
LOAD LOAD (www.cnn.com, 0.9)
(www.frogs.com, 0.3)
(user, url) (url, pagerank) (www.snails.com, 0.4)
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN (Amy, www.cnn.com, 0.9)
(Fred, www.snails.com/index.html) (Amy, www.frogs.com, 0.3)
on url
(Fred, www.snails.com, 0.4)
FOREACH GROUP
user, canonicalize(url) on user (Amy, www.cnn.com, 0.9)
( Amy, )
(Amy, www.frogs.com, 0.3)
FOREACH ( Fred, (Fred, www.snails.com, 0.4) )
(Amy, www.cnn.com) user, AVG(pagerank)

(Amy, www.frogs.com)
(Amy, 0.6)
(Fred, www.snails.com)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Value Addition From Examples
• Examples can be used for

– Debugging
– Understanding a program written by someone else
– Learning a new operator, or language
Good Examples: Consistency
LOAD LOAD
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user 0. Consistency
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank) output example
(Amy, www.frogs.com) =
operator applied on input example
FILTER
avgPR> 0.5
Good Examples: Realism
LOAD LOAD
(Amy, cnn.com)
on url
FOREACH GROUP
user, canonicalize(url) on user 1. Realism
FOREACH
FILTER
avgPR> 0.5
Good Examples: Completeness
LOAD LOAD
2. Completeness
JOIN
on url Demonstrate the salient
properties of each operator,
GROUP e.g., FILTER
FOREACH
FOREACH
user, AVG(pagerank)
(Amy, 0.6)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Good Examples: Conciseness
LOAD LOAD
3. Conciseness
(Amy, cnn.com)
on url
FOREACH GROUP
FOREACH
FILTER
avgPR> 0.5
Implementation Status
• Available as ILLUSTRATE command in open-source release

of Pig
• Available as Eclipse Plugin (PigPen)
• See SIGMOD09 paper for algorithm and experiments

Related Work
• Sawzall
– Data processing language on top of map-reduce
– Rigid structure of filtering followed by aggregation
• Hive
– SQL-like language on top of Map-Reduce
• DryadLINQ
– SQL-like language on top of Dryad
• Nested data models
– Object-oriented databases
Future / In-Progress Tasks
• Columnar-storage layer
• Metadata repository
• Profiling and Performance Optimizations
• Tight integration with a scripting language
–Use loops, conditionals, functions of host language
• Memory Management
• Project Suggestions at:
http://wiki.apache.org/pig/ProposedProjects
Credits
Summary
• Big demand for parallel data processing

– Emerging tools that do not look like SQL DBMS
– Programmers like dataflow pipes over static files
• Hence the excitement about Map-Reduce
• But, Map-Reduce is too low-level and rigid
Pig Latin
Sweet spot between map-reduce and SQL

Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava

Hochgeladen von

Copyright:

Verfügbare Formate

Pig : Building High-Level

Dataflows over Map-Reduce

Internet companies swimming in data

Data analysis is “inner loop” of product innovation

Scale Often not scalable enough

Prohibitively expensive at web scale

• Little control over execution method

Apache Hadoop ...

Scalable due to simpler design

$ Runs on cheap commodity hardware

SQL Procedural Control- a processing “pipe”

1. Extremely rigid data flow M R

Join, Union Split Chains

2. Common operations must be coded by hand

Need a high-level, general data flow language

Need a high-level, general data flow language

• Map-Reduce and the need for Pig Latin

• Compilation into Map-Reduce

Visits Url Info

Amy cnn.com 8:00 cnn.com News 0.9

Amy bbc.com 10:00 bbc.com News 0.8

Amy flickr.com 10:05 flickr.com Photos 0.7

Fred cnn.com 12:00 espn.com Sports 0.9

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

gCategories = group visitCounts by category;

store topUrls into ‘/data/topUrls’;

With the various interleaved clauses in SQL, it is difficult to know what is

• Automatic query optimization is hard

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;

store topUrls into ‘/data/topUrls’;

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;

visitCounts = join visitCounts by url, urlInfo by url;

store topUrls into ‘/data/topUrls’;

• More natural to programmers than flat tuples

Amy bbc.com 10:00 Fred cnn.com 12:00

Amy bbc.com 10:05

group results revenue

Kings nhl.com 1 Kings top 30

Cross-product of the 2 bags would give natural join

• Map-Reduce and the need for Pig Latin

• Compilation into Map-Reduce

cluster • ~50% of Hadoop jobs at

Can pre-process data on the map-side to reduce data shipped

group results revenue

Kings nhl.com 1 Kings top 30

• Problem if too many values with same key

• Symmetric-hash join repartitions both inputs

• If size(data set 1) >> size(data set 2)

• Translates to map-only job

• Exploit data sets are already sorted.

• Again, a map-only job

Apply udfs Apply udfs

Store into ‘bystate’ Store into ‘bydemo’

Apply udfs Apply udfs

Store into ‘bystate’ Store into ‘bydemo’

• Carry data as byte arrays as far as possible

• Using binary comparator for sorting

• “Streaming” data through external executables

• Map-Reduce and the need for Pig Latin

• Compilation into Map-Reduce