Beruflich Dokumente
Kultur Dokumente
Utkarsh Srivastava
Research &
Cloud Computing
Data Processing Renaissance
Map-Reduce
Dryad
Map-Reduce
Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5
map
k2 v4 k2 v2 reduce
k1 v5 k2 v4
Just a group-by-aggregate?
The Map-Reduce Appeal
M M R M
• Pig Latin
• Example Generation
• Future Work
Example Data Analysis Task
Find the top 10 most visited pages in each category
Group by url
Foreach url
Load Url Info
generate count
Join on url
Group by category
Foreach category
generate top10 urls
In Pig Latin
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
finance
yahoo , email
news
Ted Dunning
Chief Scientist, Veoh
19
CoGroup
results revenue
query url rank query adSlot amount
Lakers nba.com 1 Lakers top 50
Lakers espn.com 2 Lakers side 20
Kings nhl.com 1 Kings top 30
Kings nba.com 2 Kings side 10
• Pig Latin
• Example Generation
• Future Work
Implementation
SQL user
automatic
rewrite + Pig or
optimize Pig is open-source.
or
http://hadoop.apache.org/pig
Hadoop
Map-Reduce
Join on url
Reduce2
Map3
Other operations Group by category
pipelined into map and Reduce3
Foreach category
reduce phases generate top10(urls)
Optimizations: Using the Combiner
Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5
map
k2 v4 k2 v2 reduce
k1 v5 k2 v4
Filter bots
Group by Group by
state demographic
Reduce1
Filter bots
Split
Group by Group by
state demographic
Demultiplex Reduce1
• Pig Latin
• Example Generation
• Future Work
Example Dataflow Program
LOAD LOAD
(user, url) (url, pagerank)
JOIN
on url
Find users that
FOREACH GROUP tend to visit
user, canonicalize(url) on user
high-pagerank
FOREACH pages
user, AVG(pagerank)
FILTER
avgPR> 0.5
Iterative Process
LOAD LOAD
(user, url) (url, pagerank)
FOREACH GROUP
user, canonicalize(url) on user
No Output ☹
How to do test runs?
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN (Amy, www.cnn.com, 0.9)
(Fred, www.snails.com/index.html) (Amy, www.frogs.com, 0.3)
on url
(Fred, www.snails.com, 0.4)
FOREACH GROUP
user, canonicalize(url) on user (Amy, www.cnn.com, 0.9)
( Amy, )
(Amy, www.frogs.com, 0.3)
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user 0. Consistency
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank) output example
(Amy, www.frogs.com) =
(Fred, www.snails.com)
operator applied on input example
FILTER
avgPR> 0.5
Good Examples: Realism
LOAD LOAD
(user, url) (url, pagerank)
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user 1. Realism
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Good Examples: Completeness
LOAD LOAD
(user, url) (url, pagerank)
2. Completeness
JOIN
on url Demonstrate the salient
properties of each operator,
GROUP e.g., FILTER
FOREACH
user, canonicalize(url) on user
FOREACH
user, AVG(pagerank)
(Amy, 0.6)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Good Examples: Conciseness
LOAD LOAD
(user, url) (url, pagerank)
3. Conciseness
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Implementation Status
• Columnar-storage layer
• Metadata repository
• Profiling and Performance Optimizations
• Tight integration with a scripting language
–Use loops, conditionals, functions of host language
• Memory Management
• Project Suggestions at:
http://wiki.apache.org/pig/ProposedProjects
Credits
Summary
Pig Latin
Sweet spot between map-reduce and SQL