Beruflich Dokumente
Kultur Dokumente
Hadoop Workflow
MapReduce
Map Reduce
MapReduce is a processing technique and programing model to distribute the tasks to respected
nodes (commodity systems where data exists) to process vast amounts of distributed data (multi-
terabyte data-sets) in-parallel and in a reliable, fault-tolerant manner and then finally return the
required information.
Followings are the different concepts/programs involved in MapReduce Job throughout different
phases, which are mentioned in above section.
• Driver/Client Program (Configured and Tool).
• InputFormat (Input Split and Record Reader).
• Custom Key (WritableCompararble).
• Mapper (Mapper).
• Combiner (Reducer).
• Partitioner (Partitioner).
• Secondary Sort (WritableComparator).
• Group by Comparator (WritableComparator).
• Reducer.
• OutputFormat.
Map Reduce
Input Format/ Record Reader
InputFormat: To read the data to be processed, Hadoop comes up with
InputFormat, which has following responsibilities:
• Compute the input splits of data.
• Provide logic to read the input split.
• That means, based on InputFormat configuration, JobTracker will validate
the file input path and input format class.
• Record Reader: It reads the individual input splits (chunks), line by line and
call the Mapper class map() method while passing key(position of line in a
file) and value(complete line text) as a pair.
Map Phase
• Mappers run on unsorted input key/values pairs. Each mapper emits zero, one or
multiple output key/value pairs for each input key/value pairs.
• When map function starts producing output, it is not simply written to the disk but it
includes buffering writes and some presorting. Each map writes output to a circular
memory buffer (default size 100 MB) assigned to it. When the contents of the buffer
reach a certain threshold size, a background thread will start to spill the contents to
disk. Map outputs will continue to be written to the buffer while the spill takes place,
but if the buffer fills up during this time, the map will block until the spill is complete.
Before it writes to disk, the thread first divides the data into partitions corresponding
to the reducers that they will ultimately be sent to.
Map Phase
Sample Example of Parsing TSV Dataset by using Mapper :
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40