Beruflich Dokumente
Kultur Dokumente
I.
MapReduce;
Hadoop;
II.
CloudStack;
INTRODUCTION
BACKGROUND
B. Virtualization Technology
Virtualization is a kind of technologies which can make
computing element running on virtual machines rather than
on physical ones. There are a lot of virtualization
technologies, but we focus on the two technologies which
are free and open source software and have been widely used.
1) Xen
Xen is a virtual-machine monitor providing services that
allow multiple computer operating systems to execute on the
same computer hardware concurrently[10].It is originally
developed by University of Cambridge Computer Laboratory.
Xen is free software and licensed under the GNU General
Public License (GPLv2).Until this article being written, the
395
Map
Copy
Shuffle&Sort
Split 0
III.
A. CloudStack Deployment
A CloudStack installation consists of two parts: the
Management Server and the cloud infrastructure that it
manages. When you set up and manage a CloudStack cloud,
you provision resources such as hosts, storage devices, and
IP addresses into the Management Server, and the
Management Server manages those resources. Figure 2
below shows the profile of it:
Part0
Map
Split 3
Hypervisor
Machine 1
Machine 2
Shuffle&Sort
Split 4
Management
Server
output
HDFS
Split 1
Split 2
Part1
1) Management Server
CloudStack use management server to manage the
resources. Users can manage their cloud infrastructure
through management server UI or API.
2) Cloud Infrastructure
The Management Server manages one or more zones
(typically, datacenters) containing host computers where
guest virtual machines will run. The cloud infrastructure is
organized as follows [12]:
Zone: Typically, a zone is equivalent to a single
datacenter. A zone consists of one or more pods and
secondary storage.
Pod: A pod is usually one rack of hardware that
includes a layer-2 switch and one or more clusters.
Cluster: A cluster consists of one or more hosts and
primary storage.
Host: A single compute node within a cluster. The
hosts are where the actual cloud services run in the
form of guest virtual machines.
Primary storage is associated with a cluster, and it
stores the disk volumes for all the VMs running on
hosts in that cluster.
Secondary storage is associated with a zone, and it
stores templates, ISO images, and disk volume
snapshots.
Figure below shows the cloud infrastructure in
CloudStack:
Map
396
Zone
Secondary
Storage
Pod
4) Run Hadoop
With sun jdk, Hadoop is easy to run, but a problem
comes: CloudStack use template to create virtual machines,
which makes that all the virtual machines has the same
hostname, it will bring conflict to Hadoop. To solve, we
introduce Auto Change Hostname Service (ACHS). When a
virtual starts, it firstly run a program, we name it Auto
Change Hostname Client (ACHC), ACHC ask ACHS
whether this machine is registered , if not, register and
request a hostname, then change hostname and write in into
OS configuration and run Hadoop services. If ACHC find
that this machine has been registered, run Hadoop services
immediately. Figure below shows the procedure of the
algorithm:
Cluster
Host
Primary
Storage
3) CloudStack installation
a) Prepare
Operating system should be one of RHEL 5.4-5.x
64-bit 6.2+ 64-bit or CentOS 5.4-5.x 64-bit or 6.2+
64-bit or Ubuntu 10.04 LTS.
64-bit x86 CPU (more cores results in better
performance)
4 GB of memory
250 GB of local disk (more results in better
capability; 500 GB recommended)
At least 1 NIC
Statically allocated IP address
Fully qualified domain name as returned by the
hostname command
XenServer 6.0 (for CloudStack 3.0.0) or XenServer
6.0.2
KVM
b) Management Server Installation
Download the CloudStack Management Server You
should have a file in the form of CloudStack-VERSION-NOSVERSION.tar.gz. Untar the file and then run the
install.sh script inside it
Start
Registered?
NO
Request Hostname
and Register
YES
Change hostname
and save it to
configuration
IV.
A. Task Scheduling
Hadoop's performance is closely tied to its task scheduler,
which implicitly assumes that cluster nodes are
homogeneous and tasks make progress linearly, and uses
these assumptions to decide when to speculatively re-execute
tasks that appear to be stragglers[1].These are the implicit
assumptions of Hadoop's scheduler[1]:
Nodes can perform work at roughly the same rate.
Tasks progress at a constant rate throughout time.
There is no cost to launching a speculative task on a
node that would otherwise have an idle slot.
A tasks progress score is roughly equal to the
fraction of its total work that it has done. Specifically,
in a reduce task, the copy, reduce and merge phases
each take 1/3 of the total time.
Tasks tend to finish in waves, so a task with a low
progress score is likely a slow task.
Different tasks of the same category (map or reduce)
require roughly the same amount of work
397
B. I/O Scheduling
The efficiency in virtual machine may be very low than
in physical machine,the reason including task scheduling and
I/O scheduling.MapReduce is designed to run in physical
machines,when a MapReduce task is running,a lot of data
will be tranfered between machines,the efficiency of I/O
scheduling is very important to shorten shorten the response
time.
V.
1/2
VM A
VI.
VM B
1/
2
2
1/
Hard
Disk
Physical Machine
Fig 5. VMs are in same host
VM B
1
VM A
1
Network Card
Network Card
Hard
Disk
Hard
Disk
Physical
Physical
Machine 1
Machine 2
Fig 6.
REFERENCES
[1]
[2]
398
[3]
[4]
[5]
[6]
399