Beruflich Dokumente
Kultur Dokumente
Map-Reduce
Dryad
Map-Reduce
Input k1 v1 k1 v1 Output
records k2 v2 k1 v3 records
map reduce
k1 v3 k1 v5
map
k2 v4 k2 v2 reduce
k1 v5 k2 v4
Just a group-by-aggregate?
The Map-Reduce Appeal
M M R M
• Pig Latin
• Example Generation
• Future Work
Pig Latin Example 1
Suppose we have a table
urls: (url, category, pagerank)
Simple SQL query that finds,
For each sufficiently large category, the average
pagerank of high-pagerank urls in that category
11
Equivalent Pig Latin program
Filter good_urls
by pagerank > 0.2
Group by category
Filter category
by count > 106
Foreach category
generate avg. pagerank
13
Pig
• Consist of a scripting language known as Pig Latin and Pig
Latin Compiler.
– It is a high level scripting language used to write code to analyze data.
– Compiler converts the code into equivalent MapReduce code.
– Easier to write code in Pig compared to programming in Map Reduce.
– Pig has an optimizer that decides how to get the data quickly.
BENIFITS
• Ease of coding: Writes complex programs. It explicitly encodes
the complex tasks involving inter-related data
transformations, as data flow sequences.
Group by url
Foreach url
Load Url Info
generate count
Join on url
Group by category
Foreach category
generate top10 urls
Dataflow Language
Jasmine Novak
Engineer, Yahoo!
26
Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(urlVisits);
36
CoGroup
results revenue
query url rank query adSlot amount
Lakers nba.com 1 Lakers top 50
Lakers espn.com 2 Lakers side 20
Kings nhl.com 1 Kings top 30
Kings nba.com 2 Kings side 10
• Pig Latin
• Example Generation
• Future Work
Implementation
SQL user
automatic
rewrite + Pig or
optimize
Pig is open-source.
or
http://hadoop.apache.org/pig
Hadoop
Map-Reduce
Join on url
Reduce2
Map3
Other operations Group by category
pipelined into map Reduce3
Foreach category
and reduce phases generate top10(urls)
Optimizations: Skew Join
• Default join method is symmetric hash join.
cross product carried out on 1 reducer
Filter bots
Group by Group by
state demographic
Reduce1
Filter bots
Split
Group by Group by
state demographic
Demultiplex Reduce1
• Pig Latin
• Example Generation
• Future Work
Example Dataflow Program
LOAD LOAD
(user, url) (url, pagerank)
JOIN
on url
Find users that
FOREACH GROUP tend to visit
user, canonicalize(url) on user
high-pagerank
FOREACH pages
user, AVG(pagerank)
FILTER
avgPR> 0.5
Iterative Process
LOAD LOAD
(user, url) (url, pagerank)
FOREACH GROUP
user, canonicalize(url) on user
No Output
How to do test runs?
(Amy, cnn.com)
(Amy, http://www.frogs.com)
JOIN (Amy, www.cnn.com, 0.9)
(Fred, www.snails.com/index.html) (Amy, www.frogs.com, 0.3)
on url
(Fred, www.snails.com, 0.4)
FOREACH GROUP
user, canonicalize(url) on user (Amy, www.cnn.com, 0.9)
( Amy, )
(Amy, www.frogs.com, 0.3)
(Amy, cnn.com)
(Amy, http://www.frogs.com)
JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user 0. Consistency
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank) output example
(Amy, www.frogs.com) =
(Fred, www.snails.com)
operator applied on input example
FILTER
avgPR> 0.5
Good Examples: Realism
LOAD LOAD
(user, url) (url, pagerank)
(Amy, cnn.com)
(Amy, http://www.frogs.com)
JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user 1. Realism
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Good Examples: Completeness
LOAD LOAD
(user, url) (url, pagerank)
2. Completeness
JOIN
on url Demonstrate the salient
properties of each operator,
GROUP e.g., FILTER
FOREACH
user, canonicalize(url) on user
FOREACH
user, AVG(pagerank)
(Amy, 0.6)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Good Examples: Conciseness
LOAD LOAD
(user, url) (url, pagerank)
3. Conciseness
(Amy, cnn.com)
(Amy, http://www.frogs.com)
JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Implementation Status
• Columnar-storage layer
• Metadata repository
• Profiling and Performance Optimizations
• Tight integration with a scripting language
– Use loops, conditionals, functions of host language
• Memory Management
• Project Suggestions at:
http://wiki.apache.org/pig/ProposedProjects
Credits
Summary
Pig Latin
Sweet spot between map-reduce and SQL