Beruflich Dokumente
Kultur Dokumente
Map
Map
Reduce
Output
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
Reduce
ate, 1
cow, 1
mouse, 1
quick, 1
brown, 1
fox, 1
the, 1
fox, 1
the, 1
the, 1
Map
brown, 1
how, 1
now, 1
Map
quick, 1
ate, 1
mouse, 1
cow, 1
Idea
Idea: produce a set of sorted files that, if
concatenated, would form a globally sorted file
The secret: use a partitioner that respects the total
order of the output
-0C - 10C
>= 10C
Sorting
Sorting
Example Code
Sorting
Joins
Repartition joinA reduce-side join for situations
where you are joining two or more large datasets
Replication joinA map-side join that works in
situations where one of the datasets is small enough
to cache
Semi-joinAnother map-side
join where one dataset is initially
too large to fit into memory, but
after some filtering can be
reduced down to a size that can
fit in memory
Repartition Join
A repartition join is a reduce-side join implemented as
a single MapReduce job, and supports multi-way join
Repartition Join
Example
Join Customers (CID, Name,
Phone) with Orders (CID,
OrderID, Price Date):
Find orders for each customer
Repartition Join
Repartition Join
Replicated Joins
Repartioned join happens late at Reducer phase, major
overhead on moving data to reducer nodes
Replicated join: join operation between one large and
many small data sets that can be performed on the map
side
Completely eliminates the need to shuffle any data to the
reduce phase
All the data sets except the very large one are essentially
read into memory during the setup phase of each map task
Join is done entirely in the map phase, with the very large
data set being the input for the MapReduce job
Restriction: a replicated join is really useful only for an
inner or a left outer join where the large data set is the
left data set
Replicated Join
Replicated Joins
Replicated Joins
Replicated Join
References
[1] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters.
Commun. ACM, 51(1):107113, 2008.
[2] Fuhui Wu et al. Comparison & Performance Analysis of Join Approach in MapReduce
ISCTCS 2012, CCIS 320, pp. 629636
[3] Marko Lali et al. Comparison of a Sequential & a MapReduce Approach to Joining Large
Datasets MIPRO 2013, pp.1289-1291
[4] Spyros, B., Jignesh, M.P., Vuk, E., Jun, R., Eugene, J., Yuanyuan, T.: A Comparison of
Join Algorithms for Log Processing in MapReduce. In: SIGMOD 2010, June 611. ACM,
Indianapolis (2010)
[5] Foto N. Afrati et al. Optimizing Multiway Joins in a Map-Reduce Environment IEEE
Transactions On Knowledge And Data Engineering, pp. 1282- 1298, 2011
[6] Alper Okcan et al. Processing Theta-Joins using MapReduce SIGMOD11, June 1216,
2011 pp. 949-951
[7] Xiaofei Zhang et al. Efficient Multiway Theta Join Processing Using MapReduce
Proceedings of the VLDB Endowment, Vol. 5, No. 11, pp.1184-1196
[8] Anwar Shaikh et al. Join Query Processing in MapReduce Environment CNC 2012,
LNICST , pp.275-281
Join References