Sie sind auf Seite 1von 16

MapReduce Examples

CSE532: Theory of Database Systems

Fusheng Wang
Department of Biomedical Informatics
Department of Computer Science

Word Count Execution

the fox
ate the
how now



Shuffle & Sort




brown, 2
fox, 2
how, 1
now, 1
the, 3


ate, 1
cow, 1
mouse, 1
quick, 1

brown, 1
fox, 1
the, 1

fox, 1
the, 1
the, 1

brown, 1
how, 1
now, 1


quick, 1
ate, 1
mouse, 1

cow, 1

Total Order Sorting by Mapper Output Keys

The output key-value pairs from the Mapper are sorted
by keys before they reach the reducers
The sort order for keys is controlled by RawComparator
Keys are a subclass of WritableComparable
Or the RawComparator: compare records read from a
stream without deserializing them into objects

Partitioner (a customizable hashing function) will decide

how the keys are split into reducers
Each reducer will merge the keys from multiple reducers
and preserve the order
Across reducers, there is no total order
A single reducer will generate a total order of keys but will
be too slow

Idea: produce a set of sorted files that, if
concatenated, would form a globally sorted file
The secret: use a partitioner that respects the total
order of the output

e.g.: sort the weather dataset by temperature

-10C - 0C

-0C - 10C
>= 10C


Total Order Partioner

HashPartitioner (default) hashes a records key to
determine which partition/reducer the record belongs in
Goals of a total order partitioner:
The number of partitions equals to the number of reducers
The size of each partition should be balanced

Sampling the key space to estimate the distribution

and generate partitioning boundaries for partitioning
The ImputSampler runs on client limiting splits for sampling

The InputSampler writes a partition file to share with the tasks

running on the cluster with Distributed Cache
Distributed Cache is a facility provided by the Map-Reduce
framework to cache files (text, archives, jars etc.) needed by

Overview of Total Order Sorting


Example Code


Repartition joinA reduce-side join for situations
where you are joining two or more large datasets
Replication joinA map-side join that works in
situations where one of the datasets is small enough
to cache
Semi-joinAnother map-side
join where one dataset is initially
too large to fit into memory, but
after some filtering can be
reduced down to a size that can
fit in memory

Repartition Join
A repartition join is a reduce-side join implemented as
a single MapReduce job, and supports multi-way join

The map phase reads the data from multiple datasets,

determining the join value for each record, and
emitting that join value as the output key
A (key, value) B (key, value)
(key, value(value, tag) ): tag annotates the table name
The output value contains needed for combining datasets in the
reducer to produce the job output

A reducer receives all of the values for a join key

emitted by the map function, and partition them based
on data sources
The reducer performs a Cartesian product across all
partitions and emits the results of each join
Repartition Join

Repartition Join

Join Customers (CID, Name,
Phone) with Orders (CID,
OrderID, Price Date):
Find orders for each customer

Mapper: same key (CID) for

both inputs; value is customer
info for Customers, order info
for Orders, PLUS a tag on
data source

Repartition Join

(CID, Name, Phone)

(CID, OrderID, Price, Date)

The Reducer Side of Repartioned Join

For a given join key,
the reduce task
performs a full crossproduct of values
from different sources

Repartition Join

Replicated Joins
Repartioned join happens late at Reducer phase, major
overhead on moving data to reducer nodes
Replicated join: join operation between one large and
many small data sets that can be performed on the map
Completely eliminates the need to shuffle any data to the
reduce phase
All the data sets except the very large one are essentially
read into memory during the setup phase of each map task
Join is done entirely in the map phase, with the very large
data set being the input for the MapReduce job
Restriction: a replicated join is really useful only for an
inner or a left outer join where the large data set is the
left data set
Replicated Join

Replicated Joins

Replicated Joins

// Read cached table into Hashtable

Replicated Join

[1] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters.
Commun. ACM, 51(1):107113, 2008.
[2] Fuhui Wu et al. Comparison & Performance Analysis of Join Approach in MapReduce
ISCTCS 2012, CCIS 320, pp. 629636
[3] Marko Lali et al. Comparison of a Sequential & a MapReduce Approach to Joining Large
Datasets MIPRO 2013, pp.1289-1291
[4] Spyros, B., Jignesh, M.P., Vuk, E., Jun, R., Eugene, J., Yuanyuan, T.: A Comparison of
Join Algorithms for Log Processing in MapReduce. In: SIGMOD 2010, June 611. ACM,
Indianapolis (2010)
[5] Foto N. Afrati et al. Optimizing Multiway Joins in a Map-Reduce Environment IEEE
Transactions On Knowledge And Data Engineering, pp. 1282- 1298, 2011
[6] Alper Okcan et al. Processing Theta-Joins using MapReduce SIGMOD11, June 1216,
2011 pp. 949-951
[7] Xiaofei Zhang et al. Efficient Multiway Theta Join Processing Using MapReduce
Proceedings of the VLDB Endowment, Vol. 5, No. 11, pp.1184-1196
[8] Anwar Shaikh et al. Join Query Processing in MapReduce Environment CNC 2012,
LNICST , pp.275-281

Join References