Beruflich Dokumente
Kultur Dokumente
Topic 1
Designing MapReduce
Implementations Part 1
AGENDA
MapReduce
Thinking MapReduce!!!
MapReduce
Thinking MapReduce!!!
Are input files input files independent of each
other?
MapReduce
Thinking MapReduce!!!
Are input files input files independent of each
other?
Can the problem be broken into smaller tasks?
MapReduce
Thinking MapReduce!!!
Are input files input files independent of each
other?
Can the problem be broken into smaller tasks?
Can the partial results of executing processing on
small tasks be aggregated or consolidated?
MapReduce
Thinking MapReduce!!!
Are input files input files independent of each
other?
Can the problem be broken into smaller tasks?
Can the partial results of executing processing on
small tasks be aggregated or consolidated?
Identify the individual entity on which processing
happens.
MapReduce
MapReduce Design Patterns
MapReduce
MapReduce Design Patterns
Design pattern is a template for solving a
common and general data manipulation problem
with MapReduce.
MapReduce
MapReduce Design Patterns
Design pattern is a template for solving a
common and general data manipulation problem
with MapReduce.
Summarization Patterns
MapReduce
MapReduce Design Patterns
Design pattern is a template for solving a
common and general data manipulation problem
with MapReduce.
Summarization Patterns
Filtering Patterns
MapReduce
MapReduce Design Patterns
Design pattern is a template for solving a
common and general data manipulation problem
with MapReduce.
Summarization Patterns
Filtering Patterns
Join Patterns
MapReduce
MapReduce Design Patterns
Design pattern is a template for solving a
common and general data manipulation problem
with MapReduce.
Summarization Patterns
Filtering Patterns
Join Patterns
Job Chaining Patterns
MapReduce
MapReduce Design Patterns
MapReduce
MapReduce Design Patterns
MapReduce
MapReduce Design Patterns
MapReduce
MapReduce Design Patterns
MapReduce
MapReduce Design Patterns
MapReduce
MapReduce Design Patterns
MapReduce feasibility
MapReduce
MapReduce Design Patterns
MapReduce feasibility
MapReduce
MapReduce Design Patterns
MapReduce feasibility
Detailed design
MapReduce
MapReduce Design Patterns
Description of the Design Pattern
Examples of the Design Pattern
Structure of the Design Pattern
Detailed explanation of one example
MapReduce feasibility
High Level MapReduce design
Detailed design
Java MapReduce code
MapReduce
Summarization Pattern
MapReduce
Summarization Pattern
Numerical Summarization
MapReduce
Summarization Pattern
Numerical Summarization
Description
MapReduce
Summarization Pattern
Numerical Summarization
Description
MapReduce
Numerical Summarization Pattern
Examples
MapReduce
Numerical Summarization Pattern
Examples
MapReduce
Numerical Summarization Pattern
Examples
Digital Marketing
- Summarize ads by types, time of the day
- Plot a histogram
- Analyze ad effectiveness
MapReduce
Summarization Pattern General Structure
MapReduce
Summarization Pattern General Structure
HDFS
Input
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 1
Block 2
Block 3
Block 4
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 1
Block 3
Block 4
InputFormat
Block 2
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 1
I/P Split
1
Block 3
I/P Split
3
Block 4
I/P Split
4
InputFormat
Block 2
I/P Split
2
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 2
I/P Split
2
Block 3
I/P Split
3
Block 4
I/P Split
4
InputFormat RecordReader
Block 1
I/P Split
1
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 2
I/P Split
2
Block 3
I/P Split
3
Block 4
I/P Split
4
InputFormat RecordReader
Block 1
I/P Split
1
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 2
I/P Split
2
Block 3
I/P Split
3
Block 4
I/P Split
4
InputFormat RecordReader
Block 1
I/P Split
1
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 2
I/P Split
2
Block 3
I/P Split
3
Block 4
I/P Split
4
InputFormat RecordReader
Block 1
I/P Split
1
Map
()
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 2
I/P Split
2
Block 3
I/P Split
3
Block 4
I/P Split
4
InputFormat RecordReader
Block 1
I/P Split
1
Map
()
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 3
I/P Split
3
Block 4
I/P Split
4
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>
Partitioner
Block 2
I/P Split
2
InputFormat RecordReader
Block 1
I/P Split
1
Map
()
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 3
I/P Split
3
Block 4
I/P Split
4
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>
Partitioner
Block 2
I/P Split
2
InputFormat RecordReader
Block 1
I/P Split
1
Map
()
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 3
I/P Split
3
Block 4
I/P Split
4
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>
Partitioner
Block 2
I/P Split
2
InputFormat RecordReader
Block 1
I/P Split
1
Map
()
<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 3
I/P Split
3
Block 4
I/P Split
4
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>
Reduce
()
Partitioner
Block 2
I/P Split
2
InputFormat RecordReader
Block 1
I/P Split
1
Map
()
<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 3
I/P Split
3
Block 4
I/P Split
4
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>
Reduce
()
Partitioner
Block 2
I/P Split
2
InputFormat RecordReader
Block 1
I/P Split
1
Map
()
<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>
Reduc
e
Output
<group 1,
Summary>
<group 2,
Summary>
<group 3,
Summary>
<group 6,
Summary>
<group 7,
Summary>
<group 8,
Summary>
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 4
I/P Split
4
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>
Reduce
()
<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>
Reduc
e
Output
<group 1,
Summary>
<group 2,
Summary>
<group 3,
Summary>
<group 6,
Summary>
<group 7,
Summary>
<group 8,
Summary>
OutputFormat RecordWriter
Block 3
I/P Split
3
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
Map
Output
Partitioner
Block 2
I/P Split
2
InputFormat RecordReader
Block 1
I/P Split
1
Map
()
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 4
I/P Split
4
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>
Reduce
()
<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>
Reduc
e
Output
<group 1,
Summary>
<group 2,
Summary>
<group 3,
Summary>
<group 6,
Summary>
<group 7,
Summary>
<group 8,
Summary>
HDFS
Outpu
t
OutputFormat RecordWriter
Block 3
I/P Split
3
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
Map
Output
Partitioner
Block 2
I/P Split
2
InputFormat RecordReader
Block 1
I/P Split
1
Map
()
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 4
I/P Split
4
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>
Reduce
()
<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>
Reduc
e
Output
<group 1,
Summary>
<group 2,
Summary>
<group 3,
Summary>
<group 6,
Summary>
<group 7,
Summary>
<group 8,
Summary>
HDFS
Outpu
t
OutputFormat RecordWriter
Block 3
I/P Split
3
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
Map
Output
Partitioner
Block 2
I/P Split
2
InputFormat RecordReader
Block 1
I/P Split
1
Map
()
part-r00000
MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 4
I/P Split
4
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>
Reduce
()
<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>
Reduc
e
Output
<group 1,
Summary>
<group 2,
Summary>
<group 3,
Summary>
<group 6,
Summary>
<group 7,
Summary>
<group 8,
Summary>
HDFS
Outpu
t
OutputFormat RecordWriter
Block 3
I/P Split
3
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>
Map
Output
Partitioner
Block 2
I/P Split
2
InputFormat RecordReader
Block 1
I/P Split
1
Map
()
part-r00000
part-r00001
MapReduce
Example Dataset
MapReduce
Example Dataset
stackexchange.com
MapReduce
Example Dataset
stackexchange.com
https://archive.org/details/stackexchange
MapReduce
Example Dataset
stackexchange.com
https://archive.org/details/stackexchange
posts.xml
comments.xml
users.xml
MapReduce
Example Dataset
posts.xml
<row
FavoriteCount="4"
CommentCount="0"
AnswerCount="5"
Tags="<tells><online>"
Title="How
to
detect
tells
online?"
LastActivityDate="2012-07-02T19:20:08.690"
LastEditDate="2012-0110T19:40:36.453" LastEditorUserId="39" OwnerUserId="36" Body="<p>I
know there ara some tips to detect tells while playing live, but how can you do
it while playing poker online? What should I look for?</p> " ViewCount="411"
Score="13"
CreationDate="2012-01-10T19:35:18.413"
AcceptedAnswerId="50" PostTypeId="1" Id="18"/>
Users:
post questions on the site.
post answers to the question.
up-vote or down-vote questions and answers.
specify tags to attribute the post.
MapReduce
Example Dataset
posts.xml - Metadata
Attribute
FavoriteCoun
t
CommentCou
nt
AnswerCount
Tags
Title
LastActivityD
ate
Description
MapReduce
Example Dataset
comments.xml
<row UserId="35" CreationDate="2012-01-10T19:33:35.373" Text="What
were the suits of the cards? It matters in particular because each player had a
hole card that might play, and they can't both be of the same suit." Score="1"
PostId="6" Id="6"/>
Comments are follow-up questions or suggestions on posts (questions or
answers).
MapReduce
Example Dataset
comments.xml - Metadata
Attribute
Description
MapReduce
Example Dataset
users.xml
<row AccountId="407388" Age="26" DownVotes="0" UpVotes="1" Views="1"
AboutMe="<p>Symfony 2 developer</p> <p>Fancy front-end HTML5/JS
enthusiast</p> <p>Amateur Poker Player</p> " Location="Slovakia"
LastAccessDate="2012-01-24T21:41:56.360"
DisplayName="Teo.sk"
CreationDate="2012-01-10T19:14:15.500"
Reputation="101"
Id="15"
WebsiteUrl="http://teo.sk"/>
Account details provided on the site.
MapReduce
Example Dataset
users.xml - Metadata
Attribute
Description
RECAP
How to approach a MapReduce problem?
Numerical Summarization Pattern Part 1
BUMPER