Sie sind auf Seite 1von 14

MapReduce Introduction

Hadoop Workflow
MapReduce
Map Reduce
MapReduce is a processing technique and programing model to distribute the tasks to respected
nodes (commodity systems where data exists) to process vast amounts of distributed data (multi-
terabyte data-sets) in-parallel and in a reliable, fault-tolerant manner and then finally return the
required information.
Followings are the different concepts/programs involved in MapReduce Job throughout different
phases, which are mentioned in above section.
• Driver/Client Program (Configured and Tool).
• InputFormat (Input Split and Record Reader).
• Custom Key (WritableCompararble).
• Mapper (Mapper).
• Combiner (Reducer).
• Partitioner (Partitioner).
• Secondary Sort (WritableComparator).
• Group by Comparator (WritableComparator).
• Reducer.
• OutputFormat.
Map Reduce
Input Format/ Record Reader
InputFormat: To read the data to be processed, Hadoop comes up with
InputFormat, which has following responsibilities:
• Compute the input splits of data.
• Provide logic to read the input split.
• That means, based on InputFormat configuration, JobTracker will validate
the file input path and input format class.
• Record Reader: It reads the individual input splits (chunks), line by line and
call the Mapper class map() method while passing key(position of line in a
file) and value(complete line text) as a pair.
Map Phase
• Mappers run on unsorted input key/values pairs. Each mapper emits zero, one or
multiple output key/value pairs for each input key/value pairs.
• When map function starts producing output, it is not simply written to the disk but it
includes buffering writes and some presorting. Each map writes output to a circular
memory buffer (default size 100 MB) assigned to it. When the contents of the buffer
reach a certain threshold size, a background thread will start to spill the contents to
disk. Map outputs will continue to be written to the buffer while the spill takes place,
but if the buffer fills up during this time, the map will block until the spill is complete.
Before it writes to disk, the thread first divides the data into partitions corresponding
to the reducers that they will ultimately be sent to.
Map Phase
Sample Example of Parsing TSV Dataset by using Mapper :
1979 23 23 2 43 24 25 26 26 26 26 25 26 25

1980 26 27 28 28 28 30 31 31 31 30 30 30 29

1981 31 32 32 32 33 34 35 36 36 34 34 34 34

1984 39 38 39 39 39 41 42 43 40 39 38 38 40

(Key,Value) -> (14, “1979 23 23 2 43 24 25


26 26 26 26 25 26 25”)
Mapper
public void map(LongWritable key, Text value, OutputCollector < Text,
IntWritable > output, Reporter reporter) throws IOException {
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line, "\t");
String year = s.nextToken();
while (s.hasMoreTokens()) {
lasttoken = s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
}
Mapper
• 0067011990999991950051507004...9999999N9+00001+99999999999...
• 0043011990999991950051512004...9999999N9+00221+99999999999...
• 0043011990999991950051518004...9999999N9-00111+99999999999...
• 0043012650999991949032412004...0500001N9+01111+99999999999...
• 0043012650999991949032418004...0500001N9+00781+99999999999...

• (Key, Value) -> (63, “0067011990999991950051507004...9999999N9+00001+99999999999...”)


• (Key, Value) -> (126, “0043011990999991950051512004...9999999N9+00221+99999999999...”)
Mapper
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
MapReduce Terminologies
■ PayLoad - Applications implement the Map and the Reduce functions, and form the
core of the job.
■ Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value
pair.
■ NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
■ DataNode - Node where data is presented in advance before any processing takes
place.
■ MasterNode - Node where JobTracker runs and which accepts job requests from
clients.
■ SlaveNode - Node where Map and Reduce program runs.
■ JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
■ Task Tracker - Tracks the task and reports status to JobTracker.
■ Job - A program is an execution of a Mapper and Reducer across a dataset.
■ Task - An execution of a Mapper or a Reducer on a slice of data.
■ Task Attempt - A particular instance of an attempt to execute a task on a
SlaveNode.

Das könnte Ihnen auch gefallen