Sie sind auf Seite 1von 5

Proceeding of the IEEE

International Conference on Automation and Logistics


Zhengzhou, China, August 2012

Deploying and Researching Hadoop in Virtual Machines


Guanghui Xu, Feng Xu*, Hongxu Ma
College of Computer and Information
Hohai University
Hohai, Nanjing 211100, China
jerryxgh@gmail.com {njxufeng & mdahagg }@163.com
programming model Hadoop but also all the advantages of
virtual machines, such as fully utilizing the system resources,
easing the management of the systems, improving the
reliability and saving the power.

AbstractHadoop's emerging and the maturity of


virtualization make it feasible to combine them together to
process immense data set. To do research on Hadoop in virtual
environment, an experimental environment is needed. This
paper firstly introduces some technologies used such as
CloudStack, MapReduce and Hadoop. Based on that, a method
to deploy CloudStack is given. Then we discuss how to deploy
Hadoop in virtual machines which can be obtained from
CloudStack by some means, then an algorithm to solve the
problem that all the virtual machines which are created by
CloudStack using same template have a same hostname. After
that we run some Hadoop programs under the virtual cluster,
which shows that it is feasible to deploying Hadoop in this way.
Then some methods to optimize Hadoop in virtual machines
are discussed. From this paper, readers can follow it to set up
their own Hadoop experimental environment and capture the
current status and trend of optimizing Hadoop in virtual
environment.
Index Terms
Virtualization

I.

MapReduce;

Hadoop;

II.

In this section we talk about CloudStack, MapReduce


programming model and its open source implementation
Hadoop, two widely used technologies in virtualization is
also mentioned. At last we discuss the advantages and
disadvantages of Hadoop deployed in virtual environment.
A. CloudStack
CloudStack is an open source software platform that
pools computing resources to build public, private, and
hybrid Infrastructure as a Service (IaaS) clouds. CloudStack
manages the network, storage, and compute nodes that make
up a cloud infrastructure. CloudStack can be used to deploy,
manage, and configure cloud computing environments [12].
With CloudStack, you can do things below:
Set up an on-demand, elastic cloud computing
service. Service providers can sell self-service
virtual machine instances, storage volumes, and
networking configurations over the Internet [12].
Set up an on-premise private cloud for use by
employees. Rather than managing virtual machines
in the same way as physical machines, with
CloudStack an enterprise can offer self-service
virtual machines to users without involving IT
departments [12].

CloudStack;

INTRODUCTION

When computer was just invented, data and compute


resources were centralized, computer users used terminal to
access them. And with the development of hardware,
personal computer comes into our life. But now it shows a
trend that data and compute resources are centralized again
which called cloud computing.
Nowadays, the most frequently used programs are those
Internet based services, such as search engines, social
network services and electronic businesses, which have
millions of users. Every moment those services emit large
amounts of data, which brings a problem: how to deal with
the immense data set. Search engine leader Google uses a
programming model called MapReduce can process 20PB
data per day [5]. Hadoop is an open source implementation
of MapReduce, which is sponsored by Yahoo. As free and
open source software, Hadoop is developing fast; most
recently its first stable version is released. A lot of research
results have been integrated into it. Not only researchers but
also enterprises are using Hadoop.
Meanwhile, with the maturity of virtual machine
technology, VM-based computing infrastructure has coming
up, such as Amazon EC2 (Elastic Cloud Computing).
With this, can we use Hadoop in virtual clusters instead
of physical cluster? If we can do that, we can not only obtain
the super data processing ability provided by parallel

978-1-4673-0364-4/12/$31.00 2012 IEEE

BACKGROUND

B. Virtualization Technology
Virtualization is a kind of technologies which can make
computing element running on virtual machines rather than
on physical ones. There are a lot of virtualization
technologies, but we focus on the two technologies which
are free and open source software and have been widely used.
1) Xen
Xen is a virtual-machine monitor providing services that
allow multiple computer operating systems to execute on the
same computer hardware concurrently[10].It is originally
developed by University of Cambridge Computer Laboratory.
Xen is free software and licensed under the GNU General
Public License (GPLv2).Until this article being written, the

395

latest release version is 4.1.Amazon EC2 (Elastic Compute


Cloud) is using Xen.
2) KVM
Kernel-based Virtual Machine (KVM) is a virtualization
infrastructure for the Linux kernel. KVM supports native
virtualization on processors with hardware virtualization
extensions [11].And also KVM is free and open source
software.

of the master node of MapReduce. Besides that,


virtualization can help to fully utilize the system resources.
By using EC2 like services, customers can easily and costeffectively process vast amounts of data.
2) Disadvantages
The only disadvantage is that the potential for poor
performance and heavy load undoubtedly, which is what to
be solved.

C. MapReduce and Hadoop


The name MapReduce comes from the two kinds
operations in functional programming language: Map and
Reduce. In functional programming language, function has
no side-effect, which means that programs written by
functional programming language can be more optimized in
parallel programming. In functional programming language,
Map and Reduce take functions as parameters, which are
fully used in MapReduce.
MapReduce programming model divide problems to be
solved into Map and Reduce, the two kinds operations.
When it receives a request, its processing flow is like in
figure 1.
input
HDFS

Map

Copy
Shuffle&Sort

Split 0

III.

A. CloudStack Deployment
A CloudStack installation consists of two parts: the
Management Server and the cloud infrastructure that it
manages. When you set up and manage a CloudStack cloud,
you provision resources such as hosts, storage devices, and
IP addresses into the Management Server, and the
Management Server manages those resources. Figure 2
below shows the profile of it:

Part0
Map

Split 3

Hypervisor

Machine 1

Machine 2

Fig 2. CloudStack overview

Shuffle&Sort

Split 4

Management
Server

output
HDFS

Split 1
Split 2

DEPLOYING CLOUDSTACK AND HADOOP

Part1

1) Management Server
CloudStack use management server to manage the
resources. Users can manage their cloud infrastructure
through management server UI or API.
2) Cloud Infrastructure
The Management Server manages one or more zones
(typically, datacenters) containing host computers where
guest virtual machines will run. The cloud infrastructure is
organized as follows [12]:
Zone: Typically, a zone is equivalent to a single
datacenter. A zone consists of one or more pods and
secondary storage.
Pod: A pod is usually one rack of hardware that
includes a layer-2 switch and one or more clusters.
Cluster: A cluster consists of one or more hosts and
primary storage.
Host: A single compute node within a cluster. The
hosts are where the actual cloud services run in the
form of guest virtual machines.
Primary storage is associated with a cluster, and it
stores the disk volumes for all the VMs running on
hosts in that cluster.
Secondary storage is associated with a zone, and it
stores templates, ISO images, and disk volume
snapshots.
Figure below shows the cloud infrastructure in
CloudStack:

Map

Fig 1. MapReduce programming model

MapReduce is only a programming model, in Google, it's


running on GFS(Google File System)[6].Hadoop is the open
source implementation of MapReduce, and it has its own
distributed file system, called HDFS(Hadoop Distributed
File System).Until this article is written, its latest release
version has Common, HDFS and Hadoop MapReduce three
parts. Common is the common utilities that support the other
Hadoop subprojects; HDFS is Hadoop Distributed File
System; Hadoop MapReduce just as it says following the
introduction of MapReduce.
D. Advantages and Disadvantages of Hadoop in Virtual
Environment
1) Advantages
MapReduce is designed under commodity PC cluster,
management of thousands commodity PCs is a big job. Also
reliability of commodity PC is a question. Maybe the biggest
problem is the power consumption. So if one want to build
its own compute center, it will pay quite a lot. This is where
the EC2 like services are used. Deploying the Hadoop
Applications on virtual machines can take all the advantages
of virtualization, which can make the management of the
cluster more easily, improve the reliability which is because
that virtual machines can be more easily recovered from
crush than physical ones. Thus, it can improve the reliability

396

$ sudo update-alternatives --install /usr/bin/java


java /usr/lib/java/jdk1.6.0_20/bin/java 300
$ sudo update-alternatives --install /usr/bin/javac
javac /usr/lib/java/jdk1.6.0_20/bin/javac 300
$ sudo update-alternatives --config java

Zone
Secondary
Storage

Pod

4) Run Hadoop
With sun jdk, Hadoop is easy to run, but a problem
comes: CloudStack use template to create virtual machines,
which makes that all the virtual machines has the same
hostname, it will bring conflict to Hadoop. To solve, we
introduce Auto Change Hostname Service (ACHS). When a
virtual starts, it firstly run a program, we name it Auto
Change Hostname Client (ACHC), ACHC ask ACHS
whether this machine is registered , if not, register and
request a hostname, then change hostname and write in into
OS configuration and run Hadoop services. If ACHC find
that this machine has been registered, run Hadoop services
immediately. Figure below shows the procedure of the
algorithm:

Cluster

Host

Primary
Storage

Fig 3. Organization of a zone in CloudStack

3) CloudStack installation
a) Prepare
Operating system should be one of RHEL 5.4-5.x
64-bit 6.2+ 64-bit or CentOS 5.4-5.x 64-bit or 6.2+
64-bit or Ubuntu 10.04 LTS.
64-bit x86 CPU (more cores results in better
performance)
4 GB of memory
250 GB of local disk (more results in better
capability; 500 GB recommended)
At least 1 NIC
Statically allocated IP address
Fully qualified domain name as returned by the
hostname command
XenServer 6.0 (for CloudStack 3.0.0) or XenServer
6.0.2
KVM
b) Management Server Installation
Download the CloudStack Management Server You
should have a file in the form of CloudStack-VERSION-NOSVERSION.tar.gz. Untar the file and then run the
install.sh script inside it

Start

Registered?

NO
Request Hostname
and Register

YES

Change hostname
and save it to
configuration

Run Hadop Services


Quit

Fig 4. Procedure of Auto Change Hostname Algorithm

IV.

# tar xzf CloudStack-VERSION-N-OSVERSION.tar.gz


# cd CloudStack-VERSION-N-OSVERSION # ./install.sh

RESEARCHES ON HADOOP IN VIRTUAL MACHINES

A. Task Scheduling
Hadoop's performance is closely tied to its task scheduler,
which implicitly assumes that cluster nodes are
homogeneous and tasks make progress linearly, and uses
these assumptions to decide when to speculatively re-execute
tasks that appear to be stragglers[1].These are the implicit
assumptions of Hadoop's scheduler[1]:
Nodes can perform work at roughly the same rate.
Tasks progress at a constant rate throughout time.
There is no cost to launching a speculative task on a
node that would otherwise have an idle slot.
A tasks progress score is roughly equal to the
fraction of its total work that it has done. Specifically,
in a reduce task, the copy, reduce and merge phases
each take 1/3 of the total time.
Tasks tend to finish in waves, so a task with a low
progress score is likely a slow task.
Different tasks of the same category (map or reduce)
require roughly the same amount of work

Then choose M to install the Management Server


software.
To know more, you can find in [12].
B. Hadoop deployment
Hadoop is written in Java, we deploy Hadoop under
Ubuntu 12.04, but the Sun JDK has been deleted from the
official source, to have JDK to run Hadoop, follow these
steps:
1) Download the latest version of JDK for Ubuntu from
http://www.oracle.com/technetwork/java/javase/downloads/
jdk-7u4-downloads-1591156.html, we chose jdk-7u4-linuxi586.tar.gz.
2) Set environment variables
Untar the file, set environment variable JAVA_HOME to
the path of JDK, add JAVA_HOME/bin to PATH and
JAVA_HOME/lib to CLASSPATH.
3) Make sun-jdk be default jdk

397

But if Hadoop is running on virtual machines and knows


weather any two virtual machines are in a same physical host,
it will help Hadoop to decide which virtual machine run
which map or reduce job.
In physical cluster, it may be homogenous because the
machine in it may be all the same in hardware, but in virtual
environment, it becomes complicated , that's because even
the virtual machines has the same virtual hardware, some of
them may run on same physical host, and some of them may
run on different physical hosts. Though virtual machine
monitor can isolate the CPU and memory, but virtual
machines have to complete for network bandwidth and disk,
which may cause the Hadoop's implicit assumption that the
cluster Hadoop is running on is homogenous fail. If the
homogenous assumption fails, efficiency of the scheduler of
Hadoop will be impacted seriously.
Though some scheduling algorithms has been brought up,
for example LATE (Longest Approximate Time to End) [1],
but it is designed to help Hadoop to cope with heterogeneous
environment, not only virtual environment. We need to find
a scheduling algorithm in only the virtual cluster for Hadoop
which can improve its efficiency more.

B. I/O Scheduling
The efficiency in virtual machine may be very low than
in physical machine,the reason including task scheduling and
I/O scheduling.MapReduce is designed to run in physical
machines,when a MapReduce task is running,a lot of data
will be tranfered between machines,the efficiency of I/O
scheduling is very important to shorten shorten the response
time.
V.

DIFFERENCES OF HADOOP IN VIRTUAL AND PHYSICAL


MACHINES

There is no doubt that the virtual environment is different


from physical environment. But which point is relative to the
efficiency is the key. The most different point is that the I/O
environment. For example, machine A and machine B are
running map and reduce jobs. But A needs some data on B
and B needs some data on A. If in virtual environment, there
are two cases, one is that A and B are in different host
machines, they transfer data as figure 5.
Network Card
1/2

1/2

VM A

VI.

VM B

1/
2

We talk about CloudStack, MapReduce programming


model and Hadoop. CloudStack can be used to create virtual
cluster; MapReduce use two operations in functional
programming language map and reduce, which allows
distributed parallel running. Then we discuss how to deploy
Hadoop in virtual machines which can be obtained from
CloudStack by some means, then an algorithm to solve the
problem that all the virtual machines which are created by
CloudStack using same template have a same hostname.
After that we run some Hadoop programs under the virtual
cluster, which shows that it is feasible to deploying Hadoop
in this way. We answer the question why it is feasible to
deploy Hadoop in virtualized data center by discussing the
advantages and disadvantages of Hadoop in Virtual
Environment. The advantages are that it can ease the
management, fully utilize the computing resources, make
Hadoop more reliable and save power and so on. But before
enjoying it, we have to face the lower performance of virtual
machine. Then some methods to optimize Hadoop in virtual
machines are discussed.
At last we talk the differences of Hadoop in virtual and
physical machines, from that we point out two ways to
optimize Hadoop in virtual environment. Our future work is
to follow the two ways to design algorithms to solve the
problem.

2
1/

Hard
Disk
Physical Machine
Fig 5. VMs are in same host

But if A and B are in the same machine, they transfer


data as figure 6.

VM B
1

VM A
1

Network Card

Network Card

Hard
Disk

Hard
Disk

Physical

Physical

Machine 1

Machine 2

Fig 6.

CONCLUSION AND FUTURE WORK

VMs are in different hosts

Two cases show big difference in efficiency. In case 1,


VM A and VM B use different hard disks and different
network cards; but in case 2, VM A and VM B use the same
hard disk and the same network card, this is same to physical
machines. Data transferring efficiency is half of case 1.That
will make the response time much longer. Too long response
time cant be tolerated in short jobs which is the mainly kind
of jobs MapReduce processes.

REFERENCES
[1]

[2]

398

M. Zaharia, et al., Improving MapReduce performance in


heterogeneous environments, Proc. Proceedings of the 8th USENIX
conference on Operating systems design and implementation,
USENIX Association, 2008, pp. 29-42.
S. Ibrahim, et al., Evaluating MapReduce on Virtual Machines: The
Hadoop Case, Book Evaluating MapReduce on Virtual Machines:
The Hadoop Case, Series Evaluating MapReduce on Virtual
Machines: The Hadoop Case 5931,ed., Editor ed. Springer Berlin /
Heidelberg, 2009, pp. 519-528.

[3]

[4]

[5]
[6]

F. Jun, et al., Evaluating I/O Scheduler in Virtual Machines for


Mapreduce Application, Proc. Grid and Cooperative Computing
(GCC), 2010 9th International Conference on, 2010, pp. 64-69.
A. Matsunaga, et al., CloudBLAST: combining MapReduce and
virtualization on distributed resources for bioinformatics
applications, Proc. 4th IEEE International Conference on eScience,
eScience 2008, December 7, 2008 - December 12, 2008, Inst. of Elec.
and Elec. Eng. Computer Society, 2008, pp. 222-229.
J. Dean and S. Ghemawat, MapReduce: Simplified data processing
on large clusters, Commun Acm, vol. 51, no. 1, 2008, pp. 107-113.
S. Ghemawat, et al., The google file system, Proc. SOSP'03:
Proceedings of the 19th ACM Symposium on Operating Systems

Principles, October 19, 2003 - October 22, 2003, Association for


Computing Machinery, 2003, pp. 29-43.
[7] and , :, , no. 05,
2009, pp. 1337-1348.
[8] and , , , no. 09,
2009, pp. 2562-2567.
[9] Fair Scheduler for Hadoop[EB/OL], Book Fair Scheduler for
Hadoop[EB/OL], Series Fair Scheduler for Hadoop[EB/OL] 2010-0415,ed., Editor ed., 2010, pp.
[10] Xen. http://en.wikipedia.org/wiki/Xen
[11] KVM. http://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine.
[12] CloudStack3.0InstallGuide.

399

Das könnte Ihnen auch gefallen