Sie sind auf Seite 1von 8

(Hadoop) MapReduce: More Power, Less Code

This is an introduction to MapReduce as a framework for beginners in


distributed computing, and a short tutorial on Hadoop, an open source
MapReduce implementation. It is intended for anyone with significant
experience in programming and a flair for distributed systems.
MapReduce is a framework, a pattern, and a programming paradigm that allows us to carry
out computations over several terabytes of data in a matter of seconds. When it comes to
massive-scale architecture and a huge amount of data, with built-in fault tolerance, theres
nothing better than this. But when we come to defne MapReduce programming, it is
basically just a combination of two functions a map function and a reduce function. This
shows not just the amount of simplicity exposed by the framework in terms of the eforts of
the programmer, but also the sheer power and fexibility of the code that runs under the
hood.
It was Google that frst introduced MapReduce as a framework. It is used to build indexes for
Google Web searches! It handles many petabytes of data every day, where programs are
executed on a large-scale cluster. You might not realise the actual increase in performance
when processing a limited amount of data on a limited amount of machines, but if you do
dream of becoming that big one day (and of course you do!), then this is the way to go.
However, this should not push you into thinking that MapReduce is efective only for large
datasets and data-intensive computations; what can be more important here is that
programmers without any experience with parallel and distributed systems can easily use
distributed resources, with the help of MapReduce programming.
Here I would like to point out that there is some diference between distributed computing
and parallel computing. Although parallel computing is more of a modifed form of distributed
computing, parallel computing generally refers to a large number of processors sharing the
same memory or the same disks, while distributed computing is increasingly referred to as a
cluster of nodes, where each node is an independent unit with its own memory and disk.
These could be any computer or laptop that you fnd lying around you. This is why the days
of dedicated high-performance supercomputers are gone. These days, the computational
power of carefully tuned and confgured clusters of such computers can easily match, or
even exceed, the performance of several supercomputers that you might have heard of.
Another advantage of using such clusters is that they have a linear scalability curve, and you
dont have to go about buying a bigger and better server to get increased performance.
Use cases
MapReduce is a good ft for problems that can easily be divided into a number of smaller
pieces, which can thus be solved independently. The data is ideally (but not necessarily) in
the form of lists, or just a huge chunk of raw information waiting to be processed be it log
fles, geospatial data, genetic data to be used in biochemistry, or (of course!) Web pages to
be indexed in search engines! Whatever it might be, it is a necessary condition that the fnal
output must not depend on the order in which data is processed.
The use of MapReduce is on the rise in Web analytics, data mining, and various other
housekeeping functions in combination with other forms of databases. It is also being used
in complex felds ranging from graphics processing in Nvidias GPUs, to animation and
Machine Learning algorithms. I hope all of you have seen the movie Shrek. Guess what? A
distributed MapReduce framework was used to render the output frames of the animated
movie!
Also, Amazon Elastic MapReduce has allowed this paradigm to gain much more exposure
than ever before, and made it a one-click process. Amazon hosts the Hadoop framework
running on Amazon EC2 and Amazon S3. It has allowed MapReduce computations without
the headache of setting up servers and clusters, making them available as a resource that
can be easily rented out and charged for, on an hourly basis.
Figure 1: A representational working of a MapReduce program
How to think in terms of MapReduce
MapReduce borrows heavily from the languages of the functional programming model, like
Lisp, etc., which are focused on processing lists. So, I suggest you get your concepts in
functional programming clear before trying your hand at MapReduce.
Although MapReduce gives programmers with no experience in distributed systems an easy
interface, the programmer does have to keep in mind the bandwidth considerations in a
cluster, and the amount of data that is being passed around. Carefully implemented
MapReduce algorithms can go a long way in improving the performance of a particular
cluster. Also, all the computations performed in a MapReduce operation are a batch process,
as opposed to SQL, which has an interactive query-like interface. While solving a problem
using MapReduce, it is obvious that the problem has to be divided into two functions, i.e.,
map and reduce:
The map function inputs a series of data streams and processes all the values that
follow in the sequence. It takes the initial set of key-value pairs, and in turn, produces
an intermediate pair to be passed on to the reducer.
The reduce function typically combines all the elements of processed data generated
by the mappers. Its job is mainly to take a set of intermediate key-value pairs and
output a key-value pair that is basically an aggregate of all the values received by it
from the mapper.
Combiner functions are sometimes used to combine data on the mapper node, before it
goes to the reducer. Mostly, the code used to apply a combiner and a reducer function is the
same. This allows us to save a lot of data-transfer bandwidth, and can improve efciency
noticeably. But this does not mean that we should be implementing combiners in every case,
since if there is not much data to combine, it can take up unnecessary processing power that
could be used in a better manner.
There are no limitations on the variety of content that can get passed into a map or a reduce
function. It can be anything ranging from simple strings to complex data, to which you can, in
turn, again apply MapReduce. Apparently, MapReduce is based upon the popular
map/reduce construct from functional programming, but goes into describing the
methodology in a much more parallelised fashion.
Also, a lot of work goes into matters such as the various methods of distributing the data
among multiple nodes. So, the most difcult part is not about implementing the functions and
writing code; its about how you carve out and shape your data.
A (very) short tutorial
Hadoop is one of the most popular implementations (and its open source, too) of this
framework that was introduced by Google. There is also a project called Disco, which looks
interesting to me, since the core of this project has been implemented in Erlang, along with
some bits and pieces in Python. This should make it perform better than Hadoop, which is
written in Java. Also, Hadoop tends to seem a bit more complex (and too massive for a
framework, being a little more than 70 MB for the newer version!) than it should be for most
users, so I suggest you take a look at the Disco project if you are a Python developer.
Even so, we will still examine Hadoop, since it is one of the most actively developed, right
now. I will assume that you have already set up Hadoop, preferably version 0.21.0 in
pseudo-distributed mode, if you dont have access to a cluster. Since the Hadoop website
already has in-depth documentation for installing it, there is no point in repeating that here.
Alternatively, I would recommend using Clouderas distribution, which provides Deb and
RPM packages for easy installation on major Linux distributions. It even provides
preconfgured VMWare images for download. Hadoop provides three basic ways of
developing MapReduce applications:
Java API
Hadoop streaming
Hadoop pipes
Java API
This is the native Java API that Hadoop provides, which is available as the
org.apache.hadoop.mapreduce package in the new implementation, and the
org.apache.hadoop.mapred as the older version. Instead of devoting a full-fedged step-
by-step tutorial to developing a Java program with this API, lets just try to run some of the
excellent examples already bundled with Hadoop.
The example we are going to take up is the distributed version of the grep command, which
takes a text fle as the input, and returns the number of occurrences of the specifed word in
the input fle. You can take a look at the source code of the example, which can be found in
the HADOOP_HOME/mapred/src/examples/org/apache/examples/Grep.java fle,
where HADOOP_HOME is a variable specifying the directory where you have installed Hadoop
on your system.
First of all, lets copy a text fle to be searched, from the local flesystem, into HDFS, using
the following command in Hadoops home directory:
bin/hadoop s !cop"#rom$ocal pa%h/on/local/iles"s%em
/D#&/pa%h/%o/%ex%/ile
Now that we have our inputs ready, we can run the MapReduce example using the following
command:
bin/hadoop jar hadoop!mapred!examples!'.().'.jar grep
/D#&/pa%h/%o/inpu%/ile /D#&/pa%h/%o/ou%pu%/Direc%or" *e"+ord
Here, *e"+ord is any word or regular expression that you are trying to fnd in the text fle.
After a lot of verbose text scrolling by, when your job is complete, you can take a look at the
output:
bin/hadoop s !ca% /D#&/pa%h/%o/ou%pu%/ile
The output should look something like whats shown below, i.e., the keyword accompanied
by the number of occurrences:
( *e"+ord
There are a number of other examples that are available, including the classic word count,
which is almost synonymous with MapReduce tutorials on the Web. HBase also provides an
API in Java for easier access to the HBase tables. These classes are available in the
org.apache.hadoop.hbase.mapreduce package. An excellent tutorial on HBase and
MapReduce can be found at here; go through it if youre interested.
Hadoop streaming
The Streaming utility allows you to program with any application by receiving input at the
standard input and giving output at the standard output. Here, any executable can be
specifed as the mapper and reducer, as long as it is able to receive and send output from/to
the standard interface. For example, the following command in Hadoops home directory will
send a MapReduce job for submission using the streaming utility:
bin/hadoop jar ,HADOOP_HOME/mapred/con%rib/s%reaming/hadoop!'.().'!
s%reaming.jar
!inpu% HD#&pa%h%o-npu%Direc%or" .
!ou%pu% HD#&pa%h%oOu%pu%Direc%or" .
!mapper mapper&crip% .
!reducer reducer&crip% .
!ile Addi%ional#ile .
!ile MoreAddi%ional#iles
It is assumed that the mapper and reducer scripts are already present on all the nodes, if
you are running a cluster, while the fle argument allows you to specify any additional fles
that need to be sent to those nodes (they can even be mapper and reducer script fles
themselves, if they are not yet present on all nodes), which are used by the mapper or
reducer program.
Hadoop pipes
With Hadoop pipes, you can implement applications that require higher performance in
numerical calculations using C++ in MapReduce. The pipes utility works by establishing a
persistent socket connection on a port with the Java pipes task on one end, and the external
C++ process at the other.
Other dedicated alternatives and implementations are also available, such as Pydoop for
Python, and libhds for C. These are mostly built as wrappers, and are JNI-based. It is,
however, noticeable that MapReduce tasks are often a smaller component to a larger aspect
of chaining, redirecting and recursing MapReduce jobs. This is usually done with the help of
higher-level languages or APIs like Pig, Hive and Cascading, which can be used to express
such data extraction and transformation problems.
Under the hood
We just saw how easily just a few lines of code can perform computations on a cluster, and
distribute them in a manner that would take days for us to code manually. But now, lets just
take a brief look at Hadoop MapReduces job fow, and some background on how it all
happens.
Figure 2: A simplifed view of Hadoops MapReduce job fow
The story starts when the driver program submits the job confguration to the JobTracker
node. As the name suggests, the JobTracker node controls the overall progress of the job,
and resides on the namenode of the distributed flesystem, while monitoring individual
success or failure of each Task. This JobTracker splits the Job into individual Tasks and
submits them to respective TaskTrackers, which mostly reside on the DataNodes
themselves, or at least on the same rack on which the data is present.
This makes the HDFS flesystem rack-aware. Now, each TaskTracker receives its share of
input data, and starts processing the map function specifed by the confguration. When all
the Map Tasks are completed, the JobTracker asks the TaskTrackers to start processing the
reduce function. A JobTracker deals with failed and unresponsive tasks by running backup
tasks, i.e., multiple copies of the same task at diferent nodes. Whichever node completes
frst, gets its output accepted. When both the Map and Reduce tasks are completed, the
JobTracker notifes the client program, and dumps the output into the specifed output
directory.
The job fow, as a result, is made up of four main components:
The Driver Program
The JobTracker
The TaskTracker
The Distributed Filesystem
Limitations
I have been trumpeting the goodness of MapReduce for quite a bit of time, so it becomes
essential to spend some time criticising it. MapReduce, apart from being touted as the best
thing to happen to cloud computing, has also been criticised a lot:
Data-transfer bandwidths between diferent nodes put a cap on the amount of data
that can be passed around in the cluster.
Not all types of problems can be handled by MapReduce, since a fundamental
requirement for a problem is internal independence of data, and we cannot exactly
determine or control the number or order of Map and Reduce tasks.
A major (the most important) problem in this paradigm is the barrier between the Map
phase and Reduce phase, since processing cannot go into the Reduce phase until all
the map tasks have completed.
Last-minute pointers
Before leaving you with your own thoughts, here are a few pointers:
The HDFS flesystem can also be accessed using the Namenodes Web interface, at
h%%p///localhos%/0''1'/.
File data can also be streamed from Datanodes using HTTP, on port 50075.
Job progress can be determined using the Map-reduce Web Administration console
of the JobTracker node, which can be accessed at h%%p///localhos%/0''2'/.
The jps command can be used to determine the ports at which Hadoop is active and
listening.
The thrift interface is also an alternative for binding and developing with languages
other than Java.
Further reading
1.Apache Hadoop Ofcial website
2.Disco Project
3.Clouderas free Hadoop Training lectures
4.Googles lectures and slides on Cluster Computing and Mapreduce
5.Googles MapReduce Research Paper
6.MapReduce: A major step backwards [PDF]
Feature image courtesy: Udi h Bauman. Reused under the terms of CC-BY-SA 2.0 License.

Das könnte Ihnen auch gefallen