Beruflich Dokumente
Kultur Dokumente
Table of Contents
1. Introduction ......................................................................................................................................... 2
2. Key Ideas Behind MapReduce ........................................................................................................... 2
3. What is MapReduce? .......................................................................................................................... 5
4. Hadoop implementation of MapReduce ........................................................................................... 9
5. Anatomy of a MapReduce Job Run ................................................................................................ 14
5.1. Job Submission ................................................................................................................................ 14
5.2. Job Initialization ............................................................................................................................... 15
5.3. Task Assignment .............................................................................................................................. 15
5.4. Task Execution ................................................................................................................................. 16
5.5. Progress and Status Updates ............................................................................................................ 16
5.6. Job Completion ................................................................................................................................ 17
6. Shuffle and Sort in Hadoop .............................................................................................................. 17
7. MapReduce example: Weather Dataset .......................................................................................... 19
2
1. Introduction
Many scientific applications require processes for handling data that no longer fit on a single
cost-effective computer. Besides scientific data experiments such as simulations are creating vast
data stores that require new scientific methods to analyze and organize the data.
Parallel/distributed processing of data-intensive applications typically involves partitioning or
subdividing the data into multiple segments which can be processed independently using the same
executable application program in parallel on an appropriate computing platform, then reassembling
the results to produce the completed output data.
A MapReduce programming is able to focus on the problem that needs to be solved since only the
map and reduce functions need to be implemented, and its framework takes care of the burden a
programmer has to deal with lower-level mechanisms to control the data flow .
2. Key Ideas Behind MapReduce
Assume failures are common. A well designed, fault tolerant service must cope with failures up to a
point without impacting the quality of service, failures should not result in inconsistencies or
indeterminism from the user perspective. As servers go down, other cluster nodes should seamlessly step
in to handle the load, and overall performance should gracefully degrade as server failures pile up. Just as
important, a broken server that has been repaired should be able to seamlessly rejoin the service without
manual reconguration by the administrator. Mature implementations of the MapReduce programming
model are able to robustly cope with failures through a number of mechanisms such as automatic task
restarts on dierent cluster nodes.
Move processing to the data. In traditional high-performance computing (HPC) applications (e.g., for
climate or nuclear simulations), it is commonplace for a supercomputer to have processing nodes and
storage nodes linked together by a high-capacity interconnect. Many data-intensive workloads are not
very processor-demanding, which means that the separation of compute and storage creates a bottleneck
in the network.
As an alternative to moving data around, it is more ecient to move the processing around. That is,
MapReduce assumes an architecture where processors and storage(disk) are co-located. In such a setup,
we can take advantage of data locality by running code on the processor directly attached to the block of
data we need. The distributed le system is responsible for managing the data over which MapReduce
operates.
Process data sequentially and avoid random access. Data-intensive processing by denition means that
the relevant datasets are too large to t in memory and must be held on disk. Seek times for random disk
access are fundamentally limited by the mechanical nature of the devices: read heads can only move so
fast and platters can only spin so rapidly. As a result, it is desirable to avoid random data access, and
instead organize computations so that data is processed sequentially. A simple scenario 10 poignantly
illustrates the large performance gap between sequential operations and random seeks: assume a 1
terabyte database containing