Sie sind auf Seite 1von 16

MapReduce Examples

CSE532: Theory of Database Systems


Fusheng Wang
Department of Biomedical Informatics
Department of Computer Science

Word Count Execution


Input
the
quick
brown
fox
the fox
ate the
mouse
how now
brown
cow
Execution

Map

Map

Shuffle & Sort

Reduce

Output

Reduce

brown, 2
fox, 2
how, 1
now, 1
the, 3

Reduce

ate, 1
cow, 1
mouse, 1
quick, 1

brown, 1
fox, 1
the, 1

fox, 1
the, 1
the, 1

Map
brown, 1
how, 1
now, 1

Map

quick, 1
ate, 1
mouse, 1

cow, 1

Total Order Sorting by Mapper Output Keys


The output key-value pairs from the Mapper are sorted
by keys before they reach the reducers
The sort order for keys is controlled by RawComparator
mapreduce.job.output.key.comparator.class
Keys are a subclass of WritableComparable
Or the RawComparator: compare records read from a
stream without deserializing them into objects

Partitioner (a customizable hashing function) will decide


how the keys are split into reducers
Each reducer will merge the keys from multiple reducers
and preserve the order
Across reducers, there is no total order
A single reducer will generate a total order of keys but will
be too slow
Sorting

Idea
Idea: produce a set of sorted files that, if
concatenated, would form a globally sorted file
The secret: use a partitioner that respects the total
order of the output

e.g.: sort the weather dataset by temperature


Reducer
<-10C
-10C - 0C

-0C - 10C
>= 10C

Sorting

Total Order Partioner


HashPartitioner (default) hashes a records key to
determine which partition/reducer the record belongs in
Goals of a total order partitioner:
The number of partitions equals to the number of reducers
The size of each partition should be balanced

Sampling the key space to estimate the distribution


and generate partitioning boundaries for partitioning
The ImputSampler runs on client limiting splits for sampling

The InputSampler writes a partition file to share with the tasks


running on the cluster with Distributed Cache
Distributed Cache is a facility provided by the Map-Reduce
framework to cache files (text, archives, jars etc.) needed by
applications
Sorting

Overview of Total Order Sorting

Sorting

Example Code

Sorting

Joins
Repartition joinA reduce-side join for situations
where you are joining two or more large datasets
Replication joinA map-side join that works in
situations where one of the datasets is small enough
to cache
Semi-joinAnother map-side
join where one dataset is initially
too large to fit into memory, but
after some filtering can be
reduced down to a size that can
fit in memory

Repartition Join
A repartition join is a reduce-side join implemented as
a single MapReduce job, and supports multi-way join

The map phase reads the data from multiple datasets,


determining the join value for each record, and
emitting that join value as the output key
A (key, value) B (key, value)
(key, value(value, tag) ): tag annotates the table name
The output value contains needed for combining datasets in the
reducer to produce the job output

A reducer receives all of the values for a join key


emitted by the map function, and partition them based
on data sources
The reducer performs a Cartesian product across all
partitions and emits the results of each join
Repartition Join

Repartition Join

Example
Join Customers (CID, Name,
Phone) with Orders (CID,
OrderID, Price Date):
Find orders for each customer

Mapper: same key (CID) for


both inputs; value is customer
info for Customers, order info
for Orders, PLUS a tag on
data source

Repartition Join

(CID, Name, Phone)

(CID, OrderID, Price, Date)

The Reducer Side of Repartioned Join


For a given join key,
the reduce task
performs a full crossproduct of values
from different sources

Repartition Join

Replicated Joins
Repartioned join happens late at Reducer phase, major
overhead on moving data to reducer nodes
Replicated join: join operation between one large and
many small data sets that can be performed on the map
side
Completely eliminates the need to shuffle any data to the
reduce phase
All the data sets except the very large one are essentially
read into memory during the setup phase of each map task
Join is done entirely in the map phase, with the very large
data set being the input for the MapReduce job
Restriction: a replicated join is really useful only for an
inner or a left outer join where the large data set is the
left data set
Replicated Join

Replicated Joins

Replicated Joins

// Read cached table into Hashtable

Replicated Join

References
[1] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters.
Commun. ACM, 51(1):107113, 2008.
[2] Fuhui Wu et al. Comparison & Performance Analysis of Join Approach in MapReduce
ISCTCS 2012, CCIS 320, pp. 629636
[3] Marko Lali et al. Comparison of a Sequential & a MapReduce Approach to Joining Large
Datasets MIPRO 2013, pp.1289-1291
[4] Spyros, B., Jignesh, M.P., Vuk, E., Jun, R., Eugene, J., Yuanyuan, T.: A Comparison of
Join Algorithms for Log Processing in MapReduce. In: SIGMOD 2010, June 611. ACM,
Indianapolis (2010)
[5] Foto N. Afrati et al. Optimizing Multiway Joins in a Map-Reduce Environment IEEE
Transactions On Knowledge And Data Engineering, pp. 1282- 1298, 2011
[6] Alper Okcan et al. Processing Theta-Joins using MapReduce SIGMOD11, June 1216,
2011 pp. 949-951
[7] Xiaofei Zhang et al. Efficient Multiway Theta Join Processing Using MapReduce
Proceedings of the VLDB Endowment, Vol. 5, No. 11, pp.1184-1196
[8] Anwar Shaikh et al. Join Query Processing in MapReduce Environment CNC 2012,
LNICST , pp.275-281

Join References