Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run

1
Table of Contents
1. Introduction ......................................................................................................................................... 2
2. Key Ideas Behind MapReduce ........................................................................................................... 2
3. What is MapReduce? .......................................................................................................................... 5
4. Hadoop implementation of MapReduce ........................................................................................... 9
5. Anatomy of a MapReduce Job Run ................................................................................................ 14
5.1. Job Submission ................................................................................................................................ 14
5.2. Job Initialization ............................................................................................................................... 15
5.3. Task Assignment .............................................................................................................................. 15
5.4. Task Execution ................................................................................................................................. 16
5.5. Progress and Status Updates ............................................................................................................ 16
5.6. Job Completion ................................................................................................................................ 17
6. Shuffle and Sort in Hadoop .............................................................................................................. 17
7. MapReduce example: Weather Dataset .......................................................................................... 19

2

1. Introduction
Many scientific applications require processes for handling data that no longer fit on a single
cost-effective computer. Besides scientific data experiments such as simulations are creating vast
data stores that require new scientific methods to analyze and organize the data.
Parallel/distributed processing of data-intensive applications typically involves partitioning or
subdividing the data into multiple segments which can be processed independently using the same
executable application program in parallel on an appropriate computing platform, then reassembling
the results to produce the completed output data.
A MapReduce programming is able to focus on the problem that needs to be solved since only the
map and reduce functions need to be implemented, and its framework takes care of the burden a
programmer has to deal with lower-level mechanisms to control the data flow .
2. Key Ideas Behind MapReduce
Assume failures are common. A well designed, fault tolerant service must cope with failures up to a
point without impacting the quality of service, failures should not result in inconsistencies or
indeterminism from the user perspective. As servers go down, other cluster nodes should seamlessly step
in to handle the load, and overall performance should gracefully degrade as server failures pile up. Just as
important, a broken server that has been repaired should be able to seamlessly rejoin the service without
manual reconguration by the administrator. Mature implementations of the MapReduce programming
model are able to robustly cope with failures through a number of mechanisms such as automatic task
restarts on dierent cluster nodes.
Move processing to the data. In traditional high-performance computing (HPC) applications (e.g., for
climate or nuclear simulations), it is commonplace for a supercomputer to have processing nodes and
storage nodes linked together by a high-capacity interconnect. Many data-intensive workloads are not
very processor-demanding, which means that the separation of compute and storage creates a bottleneck
in the network.
As an alternative to moving data around, it is more ecient to move the processing around. That is,
MapReduce assumes an architecture where processors and storage(disk) are co-located. In such a setup,
we can take advantage of data locality by running code on the processor directly attached to the block of
data we need. The distributed le system is responsible for managing the data over which MapReduce
operates.
Process data sequentially and avoid random access. Data-intensive processing by denition means that
the relevant datasets are too large to t in memory and must be held on disk. Seek times for random disk
access are fundamentally limited by the mechanical nature of the devices: read heads can only move so
fast and platters can only spin so rapidly. As a result, it is desirable to avoid random data access, and
instead organize computations so that data is processed sequentially. A simple scenario 10 poignantly
illustrates the large performance gap between sequential operations and random seeks: assume a 1
terabyte database containing
100-byte records. Given reasonable assumptions about disk latency and

throughput, a back-of-the-envelope calculation will show that updating 1% of the records (by accessing
and then mutating each record) will take about a month on a single machine. On the other hand, if one
3

simply reads the entire database and rewrites all the records (mutating those that need updating), the
process would finish in under a work day on a single machine. Sequential data access is, literally, orders
of magnitude faster than random data access. The development of solid-state drives is unlikely the change
this balance for at least two reasons. First, the cost difference between traditional magnetic disks and
solid-state disks remains substantial: large-data will for the most part remain on mechanical drives, at
least in the near future. Second, although solid-state disks have substantially faster seek times, order-of-
magnitude differences in performance between sequential and random access still remain .
MapReduce is primarily designed for batch processing over large datasets. To the extent possible,
all computations are organized into long streaming operations that take advantage of the aggregate
bandwidth of many disks in a cluster. Many aspects of MapReduces design explicitly trade latency for
throughput.
Hide system-level details from the application developer. According to many guides on the practice of
software engineering written by experienced industry professionals, one of the key reasons why writing
code is difficult is because the programmer must simultaneously keep track of many details in short term
memory ranging from the mundane (e.g., variable names) to the sophisticated (e.g., a corner case of an
algorithm that requires special treatment).This imposes a high cognitive load and requires intense
concentration, which leads to a number of recommendations about a programmers environment (e.g.,
quiet office, comfortable furniture, large monitors, etc.). The challenges in writing distributed software
are greatly compounded the programmer must manage details across several threads, processes, or
machines. Of course, the biggest headache in distributed programming is that code runs concurrently in
unpredictable orders, accessing data in unpredictable patterns. This gives rise to race conditions,
deadlocks, and other well-known problems. Programmers are taught to use low-level devices such as
mutexes and to apply high-level design patterns such as producer consumer queues to tackle these
challenges, but the truth remains: concurrent programs are notoriously difficult to reason about and even
harder to debug.
MapReduce addresses the challenges of distributed programming by providing an abstraction that
isolates the developer from system-level details (e.g., locking of data structures, data starvation issues in
the processing pipeline, etc.). The programming model specifies simple and well- defined interfaces
between a small number of components, and therefore is easy for the programmer to reason about.
MapReduce maintains a separation of what computations are to be performed and how those
computations are actually carried out on a cluster of machines. The first is under the control of the
programmer, while the second is exclusively the responsibility of the execution framework or runtime.
The advantage is that the execution framework only needs to be designed once and verified for
correctness thereafter, as long as the developer expresses computations in the programming model, code
is guaranteed to behave as expected. The upshot is that the developer is freed from having to worry about
system-level details (e.g., no more debugging race conditions and addressing lock contention) and can
instead focus on algorithm or application design.
Seamless scalability.For data-intensive processing, it goes without saying that scalable algorithms are
highly desirable. As an aspiration, let us sketch the behavior of an ideal algorithm. We can define
scalability along at least two dimensions. First, in terms of data: given twice the amount of data, the same
algorithm should take at most twice as long to run, all else being equal. Second, in terms of resources:
4

given a cluster twice the size, the same algorithm should take no more than half as long to run.
Furthermore, an ideal algorithm would maintain these desirable scaling characteristics across a wide
range of settings: on data ranging from gigabytes to terabytes, on clusters consisting of a few to a few
thousand machines. Finally, the ideal algorithm would exhibit these desired behaviors without requiring
any modifications whatsoever, not even tuning of parameters. The truth is that most current algorithms
are far from the ideal. In the domain of text processing, for example, most algorithms today assume that
data fits in memory on a single machine. For the most part, this is a fair assumption. But what happens
when the amount of data doubles in the near future, and then doubles again shortly thereafter? Simply
buying more memory is not a viable solution, as the amount of data is growing faster than the price of
memory is falling. Furthermore, the price of a machine does not scale linearly with the amount of
available memory beyond a certain point (once again, the scaling up vs. scaling out argument). Quite
simply, algorithms that require holding intermediate data in memory on a single machine will simply
break on sufficiently-large datasets moving from a single machine to a cluster architecture requires
fundamentally different algorithms.
Perhaps the most exciting aspect of MapReduce is that it represents a small step toward
algorithms that behave in the ideal manner discussed above. Recall that the programming model
maintains a clear separation between what computations need to occur with how those computations are
actually orchestrated on a cluster. As a result, a MapReduce algorithm remains fixed, and it is the
responsibility of the execution framework to execute the algorithm. Amazingly, the MapReduce
programming model is simple enough that it is actually possible, in many circumstances, to approach the
ideal scaling characteristics discussed above. If running an algorithm on a particular dataset takes 100
machine hours, then we should be able to finish in an hour on a cluster of 100 machines, or use a cluster
of 10 machines to complete the same task in ten hours. With MapReduce, this isnt so far from the truth,
at least for some applications.
Data/code co-location. The phrase data distribution is misleading, since one of the key ideas behind
MapReduce is to move the code, not the data. However, the more general point remains in order for
computation to occur, we need to somehow feed data to the code. In MapReduce, this issue is
inexplicably intertwined with scheduling and relies heavily on the design of the underlying distributed file
system. To achieve data locality, the scheduler starts tasks on the node that holds a particular block of
data (i.e., on its local drive) needed by the task. This has the effect of moving code to the data. If this is
not possible (e.g., a node is already running too many tasks), new tasks will be started elsewhere, and the
necessary data will be streamed over the network. An important optimization here is to prefer nodes that
are on the same rack in the datacenter as the node holding the relevant data block, since inter-rack
bandwidth is significantly less than intra-rack bandwidth.
Synchronization. In general, synchronization refers to the mechanisms by which multiple concurrently
running processes join up, for example, to share intermediate results or otherwise exchange state
information. In MapReduce, synchronization is accomplished by a barrier between the map and reduce
phases of processing. Intermediate key-value pairs must be grouped by key, which is accomplished by a
large distributed sort involving all the nodes that executed map tasks and all the nodes that will execute
reduce tasks. This necessarily involves copying intermediate data over the network, and therefore the
process is commonly known as shuffle and sort. A MapReduce job with m mappers and r reducers
involves up to m r distinct copy operations, since each mapper may have intermediate output going to
5

every reducer. Note that the reduce computation cannot start until all the mappers have finished emitting
key-value pairs and all intermediate key-value pairs have been shuffled and sorted, since the execution
framework cannot otherwise guarantee that all values associated with the same key have been gathered.
Error and fault handling. The MapReduce execution framework must accomplish all the tasks above in
an environment where errors and faults are the norm, not the exception. Since MapReduce was explicitly
designed around low-end commodity servers, the runtime must be especially resilient. In large clusters,
disk failures are common and RAM experiences more errors than one might expect . Datacenters suffer
from both planned outages (e.g., system maintenance and hardware upgrades) and unexpected outages
(e.g., power failure, connectivity loss, etc.). And thats just hardware. No software is bug free exceptions
must be appropriately trapped, logged, and recovered from. Large-data problems have a penchant for
uncovering obscure corner cases in code that is otherwise thought to be bug-free. Furthermore, any
suciently large dataset will contain corrupted data or records that are mangled beyond a programmers
imagination resulting in errors that one would never think to check for or trap. The MapReduce execution
framework must thrive in this hostile environment.

3. What is MapReduce?

MapReduce is an emerging programming model for a data-intensive application proposed by
Google. MapReduce is utilized by Google and Yahoo to power their web search. MapReduce was first
describes in a research paper from Google. More than ten thousand distinct programs have been
implemented using MapReduce at Google. MapReduce is designed to run jobs that last minutes or hours
on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth
interconnects. It works like a Unix pipeline:
cat input | grep | sort | unique -c | cat > output
Input | Map | Shuffle & Sort | Reduce | Output
One of the most significant advantages of MapReduce is that it provides an abstraction that hides
many system-level details from the programmer. Therefore, a developer can focus on what computations
need to be performed, as opposed to how those computations are actually carried out or how to get the
data to the processes that depend on them. Like OpenMP and MPI, MapReduce provides a means to
distribute computation without burdening the programmer with the details of distributed computing (but
at a different level of granularity).
MapReduce borrows ideas from functional programming, where programmer defines Map and
Reduce tasks to process large set of distributed data. The key strengths of MapReduce programming
model are the high degree of parallelism combined with the simplicity of the programming model
and its applicability to a large variety of application domains. This requires dividing the workload across
a large number of machines. The degree of parallelism depends on the input data size. Map function
processes the input pairs (key1, value1) returning some other intermediary pair (key2, value2). Then the
intermediary pairs are grouped together according to their key. After, each group will be processed
by the reduce function which will output some new pairs of the form (key3, value3). The approach
6

assumes that there are no dependencies between the input data. This make it easy to parallelize the
problem. The number of parallel reduce task is limited by the number of distinct "key" values which are
emitted by the map function.
MapReduce incorporates usually also a framework which supports MapReduce operations. A master
controls the whole MapReduce process. The MapReduce framework is responsible for load balancing, re-
issuing task if a worker as failed or is to slow, etc. The master divides the input data into separate units,
send individual chunks of data to the mapper machines and collects the information once a mapper is
finished. If the mapper are finished then the reducer machines will be assigned work. All key/value pairs
with the same key will be send to the same reducer.

Fig 1. MapReduce Computational Model
MapReduce can refer to three distinct but related concepts. First, MapReduce is a programming
model, which is the sense discussed above. Second, MapReduce can refer to the execution framework
(i.e., the runtime) that coordinates the execution of programs written in this particular style. Finally,
MapReduce can refer to the software implementation of the programming model and the execution
framework: for example, Googles proprietary implementation vs. the open-source Hadoop
implementation in Java.
Part of the design of MapReduce algorithms involves imposing the key-value structure on arbitrary
datasets. For a collection of web pages, keys may be URLs and values may be the actual HTML content.
For a graph, keys may represent node ids and values may contain the adjacency lists of those nodes.
In MapReduce, the programmer defines a mapper and a reducer with the following signatures:
map: (k 1 , v 1 ) [(k 2 , v 2 )]
reduce: (k 2 , [v 2 ]) [(k 3 , v 3 )]
When we start a map/reduce workflow, the framework will split the input into segments, passing each
segment to a different machine. Each machine then runs the mapper on the portion of data attributed to it.
7

The mapper is applied to every input key-value pair (split across an arbitrary number of les) to
generate an arbitrary number of intermediate key-value pairs. The reducer is applied to all values
associated with the same intermediate key to generate output key-value pairs. Implicit between the map
and reduce phases is a distributed group by operation on intermediate keys. Intermediate data arrive at
each reducer in order, sorted by the key. However, no ordering relationship is guaranteed for keys across
different reducers. Output key-value pairs from each reducer are written persistently back onto the
distributed file system (whereas intermediate key-value pairs are transient and not preserved). The output
ends up in r les on the distributed le system, where r is the number of reducers.

The diagram below illustrates the overall MapReduce word count process.

Fig 2. The overall MapReduce word count process

A simple word count algorithm in MapReduce is shown in Figure 2. This algorithm counts the
number of occurrences of every word in a text collection. The mapper takes an input key-value pair,
tokenizes the document, and emits an intermediate key-value pair for every word: the word itself serves
as the key, and the integer one serves as the value (denoting that weve seen the word once).

Fig 3. Pseudo-code for the word count algorithm in MapReduce

8

The MapReduce execution framework guarantees that all values associated with the same key are
brought together in the reducer. Therefore, in our word count algorithm, we simply need to sum up all
counts (ones) associated with each word. The reducer does exactly this, and emits final key-value pairs
with the word as the key, and the count as the value. Final output is written to the distributed file system,
one file per reducer. Words within each file will be sorted by alphabetical order, and each file will contain
roughly the same number of words.

Fig 4. Execution overview

Figure 4 shows the overall flow of a MapReduce operation in our implementation. When the user
program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in
Figure 4 correspond to the numbers in the list below):

1. The MapReduce library in the user program first splits the input files into M pieces of typically
16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional
parameter). It then starts up many copies of the program on a cluster of machines.
2. One of the copies of the program is special , the master. The rest are workers that are assigned
work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle
workers and assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses
key/value pairs out of the input data and passes each pair to the user-dened Map function. The
intermediate key/value pairs produced by the Map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the
partitioning function. The locations of these buffered pairs on the local disk are passed back to the
master, who is responsible for forwarding these locations to the reduce workers.
9

5. When a reduce worker is notified by the master about these locations, it uses remote procedure
calls to read the buffered data from the local disks of the map workers. When a reduce worker has
read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same
key are grouped together. The sorting is needed because typically many different keys map to the
same reduce task. If the amount of intermediate data is too large to fit in memory, an external
sort is used.
6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key
encountered, it passes the key and the corresponding set of intermediate values to the users
Reduce function. The output of the Reduce function is appended to a final output file for this
reduce partition.
7. When all map tasks and reduce tasks have been completed, the master wakes up the user
program. At this point, the MapReduce call in the user program returns back to the user code.

After successful completion, the output of the mapreduce execution is available in the R output files
(one per reduce task, with file names as specified by the user). Typically, users do not need to combine
these R output files into one file they often pass these files as input to another MapReduce call, or use
them from another distributed application that is able to deal with input that is partitioned into multiple
files.

4. Hadoop implementation of MapReduce

Hadoop is an open source based on MapReduce framework for running applications on
large clusters built of commodity hardware from Apache. The Hadoop framework transparently
provides applications both reliability and data motion. Hadoop implements Map/Reduce, where the
application is divided into many small fragments of work, each of which may be executed or re-executed
on any node in the cluster. In addition, it provides the Hadoop distributed file system (HDFS) that stores
data on the compute nodes, providing a very high aggregate bandwidth across the cluster. HDFS is
the primary storage system used by Hadoop applications.The MapReduce Framework takes care of
scheduling tasks, monitoring them, and re-executing failed tasks.
Hadoop commonly refers to the main component of the platform, the one from where the others
offer high level services. This's the storage framework with the processing framework, formed by the
Hadoop Distributed Filesystem library, the MapReduce library, and a core library, all working together.
This represents the first project, that would lead the path for the others to work.
Those are: HBase (a columnar database), Hive (a data mining tool), Pig (scripting), Chuckwa
(log analysis), they are all subjected to the availability of the platform. Then we have ZooKeeper
(coordination service) independent of hadoop availability and used from HBase, and Avro
(serialization/deserialization) designed to support the main service component requirements.

10

Figure 5. A global view on the framework's subprojects dependencies

Avro: A serialization system for efficient, cross-language RPC, and persistent data storage.
Pig: A data flow language and execution environment for exploring very large datasets. Pig runs
on HDFS and MapReduce clusters.
Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a query
language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for
querying the data.
Cassandra: A scalable multi-master database with no single points of failure.
Chukwa: A data collection system for managing large distributed systems.
Hbase: A distributed, column-oriented database. HBase uses HDFS for its underlying storage
and supports both batch-style computations using MapReduce and point queries (random reads).
ZooKeeper: A distributed, highly available coordination service. ZooKeeper provides primitives
such as distributed locks that can be used for building distributed applications.
Sqoop: A tool for efficiently moving data between relational databases and HDFS.
Mahout: A Scalable machine learning and data mining library.

The Google File System (GFS) supports Googles proprietary implementation of MapReduce; in
the open-source world, HDFS (Hadoop Distributed File System) is an open-source implementation of
GFS that supports Hadoop. Although MapReduce doesnt necessarily require the distributed le system,
it is dicult to realize many of the advantages of the programming model without a storage substrate that
behaves much like the DFS.
Hadoop Distributed File System is designed to reliably store very large files across machines in
a large cluster. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last
block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and
11

replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at
any time.
The distributed le system adopts a masterslave architecture in which the master maintains the
le namespace (metadata, directory structure, file to block mapping, location of blocks, and access
permissions) and the slaves manage the actual data blocks. In GFS, the master is called the GFS master,
and the slaves are called GFS chunkservers. In Hadoop, the same roles are lled by the namenode and
datanodes, respectively.
In HDFS, an application client wishing to read a le (or a portion thereof) must rst contact the
namenode to determine where the actual data is stored. In response to the client request, the namenode
returns the relevant block id and the location where the block is held (i.e., which datanode). The client
then contacts the datanode to retrieve the data. Blocks are themselves stored on standard single-machine
le systems, so HDFS lies on top of the standard OS stack (e.g., Linux). An important feature of the
design is that data is never moved through the namenode. Instead, all data transfer occurs directly
between clients and datanodes; communications with the namenode only involves transfer of metadata.
By default, HDFS stores three separate copies of each data block to ensure both reliability, availability,
and performance. To create a new le and write data to HDFS, the application client rst contacts the
namenode, which updates the le namespace after checking permissions and making sure the le doesnt
already exist. The namenode allocates a new block on a suitable datanode, and the application is directed
to stream data directly to it. From the initial datanode, data is further propagated to additional replicas.
The architecture of a complete Hadoop cluster is shown in Figure 3.

Figure 6. Architecture of a complete Hadoop cluster
The NameNode will coordinate almost all read/write and access operations between clients and
the DataNodes from the cluster, the DataNodes will store, read and write the information, while the
12

BackupNode is in charge of accelerating some heavy operations like boot up, ensuring failover data
recovery, among others. In MapReduce, the JobTracker will coordinate all about deploying application
tasks over the DataNodes, as well as summarizing their results, and the TaskTracker processes running on
them will receive these tasks and execute them.
There are some differences between the Hadoop implementation of MapReduce and Googles
implementation. In Hadoop, the reducer is presented with a key and an iterator over all values associated
with the particular key. The values are arbitrarily ordered. Googles implementation allows the
programmer to specify a secondary sort key for ordering the values (if desired) in which case values
associated with each key would be presented to the developers reduce code in sorted order. Another
difference: in Googles implementation the programmer is not allowed to change the key in the reducer.
That is, the reducer output key must be exactly the same as the reducer input key. In Hadoop, there is no
such restriction, and the reducer can emit an arbitrary number of output key-value pairs (with dierent
keys).
In Hadoop, a mapper object is initialized for each map task (associated with a particular sequence
of key-value pairs called an input split) and the Map method is called on each key-value pair by the
execution framework. In configuring a MapReduce job, the programmer provides a hint on the number of
map tasks to run, but the execution framework makes the final determination based on the physical layout
of the data .The situation is similar for the reduce phase :a reducer object is initialized for each reduce
task, and the Reduce method is called once per intermediate key. In contrast with the number of map
tasks, the programmer can precisely specify the number of reduce tasks.
The reducer in MapReduce receives all values associated with the same key at once. However, it
is possible to start copying intermediate key-value pairs over the network to the nodes running the
reducers as soon as each mapper nishes this is a common optimization and implemented in Hadoop.

A Hadoop MapReduce job is divided up into a number of map tasks and reduce tasks.
Tasktrackers periodically send heartbeat messages to the jobtracker that also doubles as a vehicle for task
allocation. If a tasktracker is available to run tasks (in Hadoop parlance, has empty task slots), the return
acknowledgment of the tasktracker heartbeat contains task allocation information. The number of reduce
tasks is equal to the number of reducers specied by the programmer. The number of map tasks, on the
other hand, depends on many factors: the number of mappers specied by the programmer serves as a hint
to the execution framework, but the actual number of tasks depends on both the number of input les and
the number of HDFS data Sblocks occupied by those les. Each map task is assigned a sequence of input
key-value pairs, called an input split in Hadoop. Input splits are computed automatically and the
execution framework strives to align them to HDFS block boundaries so that each map task is associated
with a single data block. In scheduling map tasks, the jobtracker tries to take advantage of data locality if
possible, map tasks are scheduled on the slave node that holds the input split, so that the mapper will be
processing local data.
In Hadoop, mappers are Java objects with a Map method (among others). A mapper object is
instantiated for every map task by the tasktracker. The life-cycle of this object begins with instantiation,
where a hook is provided in the API to run programmer-specied code. This means that mappers can read
in side data, providing an opportunity to load state, static data sources, dictionaries, etc. After
initialization, the Map method is called (by the execution framework) on all key-value pairs in the input
split. Since these method calls occur in the context of the same Java object, it is possible to preserve state
across multiple input key-value pairs within the same map task. After all key-value pairs in the input split
13

have been processed, the mapper object provides an opportunity to run programmer specied termination
code.
The actual execution of reducers is similar to that of the mappers. Each reducer object is
instantiated for every reduce task. The Hadoop API provides hooks for programmer-specied
initialization and termination code. After initialization, for each intermediate key in the partition (dened
by the partitioner), the execution framework repeatedly calls the Reduce method with an intermediate key
and an iterator over all values associated with that key. The programming model also guarantees that
intermediate keys will be presented to the Reduce method in sorted order. The process is transactional,
those map or reduce tasks not executed, (for data availability issues) will be reattempted a number of
times, and then redistributed to other nodes.

Figure 7. Hadoop MapReduce lifecycle

14

5. Anatomy of a MapReduce Job Run
This section uncovers the steps Hadoop takes to run a job. The whole process is illustrated in Figure
At the highest level, there are four independent entities:
The client, which submits the MapReduce job.
The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main
class is JobTracker.
The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java
applications whose main class is TaskTracker.
The distributed filesystem, which is used for sharing job files between the other entities.

Figure 8. Anatomy of a MapReduce J ob Run

5.1. Job Submission

The runJob() method on JobClient is a convenience method that creates a new JobClient
instance and calls submitJob() on it (step 1 in Figure ). Having submitted the job, runJob() polls the jobs
progress once a second, and reports the progress to the console if it has changed since the last report.
When the job is complete, if it was successful, the job counters are displayed. Otherwise, the error that
caused the job to fail is logged to the console.
The job submission process implemented by JobClients submitJob() method does the following:
15

Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2).
Checks the output specification of the job. For example, if the output directory has not been
specified or it already exists, the job is not submitted and an error is thrown to the MapReduce
program.
Computes the input splits for the job. If the splits cannot be computed, because the input paths
dont exist, for example, then the job is not submitted and an error is thrown to the MapReduce
program.
Copies the resources needed to run the job, including the job JAR file, the configuration file and
the computed input splits, to the jobtrackers filesystem in a directory named after the job ID. The
job JAR is copied with a high replication factor (controlled by the mapred.submit.replication
property, which defaults to 10) so that there are lots of copies across the cluster for the
tasktrackers to access when they run tasks for the job (step 3).
Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker)
(step 4).

5.2. Job Initialization

When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue
from where the job scheduler will pick it up and initialize it. Initialization involves creating an object to
represent the job being run, which encapsulates its tasks, and bookkeeping information to keep track of
the tasks status and progress (step 5). To create the list of tasks to run, the job scheduler first retrieves the
input splits computed by the JobClient from the shared filesystem (step 6). It then creates one map task
for each split. The number of reduce tasks to create is determined by the mapred.reduce.tasks
property in the JobConf, which is set by the setNumReduceTasks() method, and the scheduler
simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.
5.3. Task Assignment

Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker.
Heartbeats tell the jobtracker that a tasktracker is alive, but they also double as a channel for messages. As
a part of the heartbeat, a tasktracker will indicate whether it is ready to run a new task, and if it is, the
jobtracker will allocate it a task, which it communicates to the tasktracker using the heartbeat return value
(step 7). Before it can choose a task for the tasktracker, the jobtracker must choose a job to select the task
from.
Having chosen a job, the jobtracker now chooses a task for the job. Tasktrackers have a fixed
number of slots for map tasks and for reduce tasks: for example, a tasktracker may be able to run two map
tasks and two reduce tasks simultaneously. (The precise number depends on the number of cores and the
amount of memory on the tasktracker) .The default scheduler fills empty map task slots before reduce
task slots, so if the tasktracker has at least one empty map task slot, the jobtracker will select a map task;
otherwise, it will select a reduce task.
16

To choose a reduce task the jobtracker simply takes the next in its list of yet-to-be-run reduce
tasks, since there are no data locality considerations. For a map task, however, it takes account of the
tasktrackers network location and picks a task whose input split is as close as possible to the tasktracker.
In the optimal case, the task is data-local, that is, running on the same node that the split resides on.
Alternatively, the task may be rack-local: on the same rack, but not the same node, as the split. Some
tasks are neither data-local nor rack-local and retrieve their data from a different rack from the one they
are running on.
5.4. Task Execution

Now the tasktracker has been assigned a task, the next step is for it to run the task. First, it
localizes the job JAR by copying it from the shared filesystem to the tasktrackers filesystem. It also
copies any files needed from the distributed cache by the application to the local disk; Second, it creates
alocal working directory for the task, and un-jars the contents of the JAR into this directory. Third, it
creates an instance of TaskRunner to run the task.
TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10),so that
any bugs in the user-defined map and reduce functions dont affect the tasktracker (by causing it to crash
or hang, for example). The child process communicates with its parent through the umbilical interface.
This way it informs the parent of the tasks progress every few seconds until the task is complete.
5.5. Progress and Status Updates

MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run.
Because this is a significant length of time, its important for the user to get feedback on how the job is
progressing. A job and each of its tasks have a status, which includes such things as the state of the job or
task (e.g., running, successfully completed, failed), the progress of maps and reduces, the values of the
jobs counters, and a status message or description (which may be set by user code).
When a task is running, it keeps track of its progress, that is, the proportion of the task completed.
For map tasks, this is the proportion of the input that has been processed. For reduce tasks, its a little
more complex, but the system can still estimate the proportion of the reduce input processed. It does this
by dividing the total progress into three parts, corresponding to the three phases of the shuffle. For
example, if the task has run the reducer on half its input, then the tasks progress is , since it has
completed the copy and sort phases ( each) and is half way through the reduce phase ().
If a task reports progress, it sets a flag to indicate that the status change should be sent to the
tasktracker. The flag is checked in a separate thread every three seconds, and if set it notifies the
tasktracker of the current task status. Meanwhile, the tasktracker is sending heartbeats to the jobtracker
every five seconds (this is a minimum, as the heartbeat interval is actually dependent on the size of the
cluster: for larger clusters, the interval is longer), and the status of all the tasks being run by the
tasktracker is sent in the call.
The jobtracker combines these updates to produce a global view of the status of all the jobs being
run and their constituent tasks. Finally, as mentioned earlier, the JobClient receives the latest status
17

by polling the jobtracker every second. Clients can also use JobClients getJob() method to obtain a
RunningJob instance, which contains all of the status information for the job.

5.6. Job Completion

When the jobtracker receives a notification that the last task for a job is complete, it changes the
status for the job to successful. Then, when the JobClient polls for status, it learns that the job has
completed successfully, so it prints a message to tell the user, and then returns from the runJob() method.
The jobtracker also sends a HTTP job notification if it is configured to do so. This can be
configured by clients wishing to receive callbacks, via the job.end.notification.url property. Last, the
jobtracker cleans up its working state for the job, and instructs tasktrackers to do the same (so
intermediate output is deleted, for example).
6. Shuffle and Sort in Hadoop
MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by
which the system performs the sort and transfers the map outputs to the reducers as inputs is known as the
shuffle. The shuffle is an area of the codebase where refinements and improvements are continually
being made. In many ways, the shuffle is the heart of MapReduce, and is where the magic happens.
The following figure illustrates the shuffle and sort phase:

Figure 9. Shuffle and sort in MapReduce
18

Map side
Map outputs are buffered in memory in a circular buffer
When buffer reaches threshold, contents are spilled to disk
Spills merged in a single, partitioned file (sorted within each partition): combiner runs
here
Reduce side
First, map outputs are copied over to reducer machine
Sort is a multi-pass merge of map outputs (happens in memory and on disk): combiner
runs here
Final merge pass goes directly into reducer

19

7. MapReduce example: Weather Dataset

Create a program that mines weather data
Weather sensors collecting data every hour at many locations across the globe, gather a large
volume of log data. Source: NCDC
The data is stored using a line-oriented ASCII format, in which each line is a record
Mission - calculate max temperature each year around the world
Problem - millions of temperature measurements records

Figure 10. NCDC raw data
For our example, we will write a program that mines weather data. Weather sensors collecting data
every hour at many locations across the globe gather a large volume of log data, which is a good
candidate for analysis with MapReduce, since it is semistructured and record-oriented.
The data we will use is from the National Climatic Data Center (NCDC, http://www.ncdc.noaa.gov/).
The data is stored using a line-oriented ASCII format, in which each line is a record. The format supports
a rich set of meteorological elements, many of which are optional or with variable data lengths. For
simplicity, we shall focus on the basic elements, such as temperature, which are always present and are of
fixed width. Example 7-1 shows a sample line with some of the salient fields highlighted. The line has
been split into multiple lines to show each field: in the real file, fields are packed into one line with no
delimiters.
20

Figure 11. NCDC raw data
Data files are organized by date and weather station.There is a directory for each year from 1901 to
2001, each containing a gzipped file for each weather station with its readings for that year. The whole
dataset is made up of a large number of relatively small files since there are tens of thousands of weather
station.The data was preprocessed so that each years readings were concatenated into a single file.
MapReduce works by breaking the processing into 2 phases: the map and the reduce. Both map and
reduce phases have key-value pairs as input and output. Programmers have to specify two functions: map
and reduce function.
The input to the map phase is the raw NCDC data. Here, the key is the offset of the beginning of the
line and the value is each line of the data set. The map function pulls out the year and the air temperature
from each input value.
The reduce function takes <year, temperature> pairs as input and produces the maximum
temperature for each year as the result.
To visualize the way the map works, consider the following sample lines of input data.
Original NCDC Format

Input file for the map function, stored in HDFS

21

Output of the map function, running in parallel for each block

The output from the map function is processed by the MapReduce framework before being sent
to the reduce function. This processing sorts and groups the key-value pairs by key. So, continuing the
example, our reduce function sees the following input:

Each year appears with a list of all its air temperature readings. All the reduce function has to do now is
iterate through the list and pick up the maximum reading. This is the final output: the maximum global
temperature recorded in each year.

The whole data flow

22

Start the local hadoop cluster
Open five CYGWIN windows and arrange them in the similar fashion as below.
1. Start the namenode in the first window by executing:
cd hadoop-0.19.1
bin/hadoop namenode
2. Start the secondary namenode in the second window by executing:
cd hadoop-0.19.1
bin/hadoop secondarynamenode
3. Start the job tracker the third window by executing:
cd hadoop-0.19.1
bin/haoop jobtracker
4. Start the data node the fourth window by executing:
cd hadoop-0.19.1
bin/haoop datanode
5. Start the task tracker the fifth window by executing:
cd hadoop-0.19.1
bin/haoop tasktracker

Figure 12. Start the local hadoop cluster

23

Having run through how the MapReduce program works, the next step is to express it in code.
We need three things: a map function, a reduce function, and some code to run the job. The map function
is represented by the Mapper class, which declares an abstract map() method. Figure 13- shows the
implementation of our map method.
Map function

Fig 13. Mapper for maximum temperature example
The Mapper class is a generic type, with four formal type parameters that specify the input key,
input value, output key, and output value types of the map function. For the present example, the input
key is a long integer offset, the input value is a line of text, the output key is a year, and the output value
is an air temperature (an integer). Rather than use built-in Java types, Hadoop provides its own set of
basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io
package. Here we use LongWritable, which corresponds to a Java Long, Text (like Java String), and
IntWritable (like Java Integer).
The map() method is passed a key and a value. We convert the Text value containing the line of
input into a Java String, then use its substring() method to extract the columns we are interested in. The
map() method also provides an instance of Context to write the output to. In this case, we write the year as
a Text object (since we are just using it as a key), and the temperature is wrapped in an IntWritable.
24

We write an output record only if the temperature is present and the quality code indicates the
temperature reading is OK.
The reduce function is similarly defined using a Reducer, as illustrated in Figure 14.
Reduce function

Fig 14. Reducer for maximum temperature example
Again, four formal type parameters are used to specify the input and output types, this time for
the reduce function. The input types of the reduce function must match the output types of the map
function: Text and IntWritable. And in this case, the output types of the reduce function are Text and
IntWritable, for a year and its maximum temperature, which we find by iterating through the temperatures
and comparing each with a record of the highest found so far.
The third piece of code runs the MapReduce job (see Figure 15).
Main function for running the MapReduce job
A Job object forms the specification of the job. It gives you control over how the job is run. When we
run this job on a Hadoop cluster, we will package the code into a JAR file (which Hadoop will distribute
around the cluster). Rather than explicitly specify the name of the JAR file, we can pass a class in the
Jobs setJarByClass() method, which Hadoop will use to locate the relevant JAR file by looking for the
JAR file containing this class.
Having constructed a Job object, we specify the input and output paths. An input path is specified by
calling the static addInputPath() method on FileInputFormat, and it can be a single file, a directory (in
which case, the input forms all the files in that directory), or a file pattern. As the name suggests,
addInputPath() can be called more than once to use input from multiple paths. The output path (of
which there is only one) is specified by the static setOutputPath() method on FileOutputFormat. It
25

specifies a directory where the output files from the reducer functions are written. The directory shouldnt
exist before running the job,as Hadoop will complain and not run the job. This precaution is to prevent
data loss.
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods. The setOutputKeyClass() and setOutputValueClass() methods control the
output types for the map and the reduce functions, which are often the same, as they are in our case. If
they are different, then the map output types can be set using the methods
setMapOutputKeyClass() and setMapOutputValueClass().The input types are controlled via the input
format, which we have not explicitly set since we are using the default TextInputFormat.
After setting the classes that define the map and reduce functions, we are ready to run the job. The
waitForCompletion() method on Job submits the job and waits for it to finish. The methods boolean
argument is a verbose flag, so in this case the job writes information about its progress to the console.

Fig 15. Application to find the maximum temperature in the weather dataset
26

A test run

The output from running the job provides some useful information. For example, we can
see that the job was given an ID of job_local_0009, and it ran one map task and one reduce task.
Knowing the job and task IDs can be very useful when debugging MapReduce jobs.

Output in HDFS

27

Map Reduce Chart

Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run

Hochgeladen von

Copyright:

Verfügbare Formate

1

100-byte records. Given reasonable assumptions about disk latency and

Das könnte Ihnen auch gefallen