Beruflich Dokumente
Kultur Dokumente
Multi-Node
Installation
Manual
venkat@cloudwick.com
Steps for
Step1
Operating System(o/s) platform selection and installation.
Why Linux?
Hadoop Supported Platforms
Hadoop native library is supported only on *nix platforms only.
Unfortunately it is known not to work on Cygwin and Mac OS X and has mainly
been used on the GNU/Linux platform.
It has been tested on the following GNU/Linux distributions:
RHEL4/Fedora/SUSE Linux
Ubuntu
Gentoo
On all the above platforms a 32/64 bit Hadoop native library will work with a
respective 32/64 bit jvm. Install any of the Linux flavor given below .
Step2:
Designing the cluster network.
Create two virtual machines - a Master and a slave using one of the Linux image.
While creating the VM's, provide hadoop as user name. Once the VMs are ready,
click on VM from file menu and choose settings. Select Network Adaptor and
choose Bridged instead of NAT. Do this both on master and slave VMs.
Step3
Change your Hostname without Rebooting in Linux
Make sure you are logged in as root and move to /etc/sysconfig and open the
network file in vi.
[root@localhost /]# vi /etc/sysconfig/network
Look for the HOSTNAME line and replace it with the new hostname you want to
use. Change the localhost.localdomain to the required hostname. Here I
am going to change it as master.
When you are done, save your changes and exit vi.
Next edit the /etc/hosts file and set the new hostname.
[root@localhost /]# vi /etc/hosts
In hosts, edit the line that has the old hostname and replace it with your new one.
if it is not there add the ip address and the new hostname by the end of the
existing data.
Replace the localhost or whatever may be the hostname associated with the Host
IP to the desired one as follows.
Save your changes and exit vi. The changes to /etc/hosts and
/etc/sysconfig/network are necessary to make your changes persistent
(in the event of an unscheduled reboot).
Now we use the hostname program to change the hostname that is currently set.
[root@localhost ~]# hostname
localhost.localdomain
[root@localhost ~]# hostname master
[root@localhost ~]#
And run it again without any parameters to see if the hostname changed.
[root@localhost ~]# hostname
master
To verify the hostname has been fully changed, exit from terminal and you should
see your new hostname being used at the login prompt and after you've logged
back in.
time=1.12 ms
time=0.316 ms
time=0.314 ms
time=0.490 ms
time=0.411 ms
time=1.38 ms
time=0.424 ms
time=0.349 ms
time=0.329 ms
time=0.334 ms
time=0.331 ms
Step4
Firewall Configuration
Hadoop uses a lot of ports for its internal and external communications. We've
just allowed all traffic between the servers in the cluster and clients. But if you
don't want to do that you can also selectively open the required ports. By default
in Linux the firewall are enabled such as
SELinux (Security-Enhanced Linux (SELinux) is a Linux feature that provides a mechanism for
supporting access control security policies, including United States Department of Defense-style
mandatory access controls, through the use of Linux Security Modules (LSM) in the Linux kernel. It is not a
Linux distribution, but rather a set of Kernel modifications and user-space tools that can be added to
various Linux distributions. Its architecture strives to separate enforcement of security decisions from the
security policy itself and streamlines the volume of software charged with security policy enforcement. The
key concepts underlying SELinux can be traced to several earlier projects by the United States National
Security Agency.)
IPTables (iptables is a user space application program that allows a system administrator to configure
the tables provided by the Linux kernel firewall (implemented as different Netfilter modules) and the
chains and rules it stores. Different kernel modules and programs are currently used for different
protocols; iptables applies to IPv4, ip6tables to IPv6, arptables to ARP, and ebtables to Ethernet frames.
iptables requires elevated privileges to operate and must be executed by user root, otherwise it fails to
function. On most Linux systems, iptables is installed as /usr/sbin/iptables and documented in its man
[2]
page, which can be opened using man iptables when installed. It may also be found in /sbin/iptables, but
since iptables is more like a service rather than an "essential binary", the preferred location remains
/usr/sbin.)
TCP Wrapper(TCP Wrapper is a host-based networking ACL system, used to filter network access to
Internet Protocol servers on (Unix-like) operating systems such as Linux or BSD. It allows host or
subnetwork IP addresses, names and/or ident query replies, to be used as tokens on which to filter for
access control purposes).
Disable the iptables (ipv4 and ipv6) and disable the selinux. As of now we are least
bothered about TCP Wrappers.
You can also use setenforce command as shown below to disable SELinux.
Possible parameters to setenforce commands are: Enforcing , Permissive, 1
(enable) or 0 (disable).
# setenforce 0
To disable the SELinux permanently, modify the /etc/selinux/config and set the
SELINUX=disabled as shown below. One you make any changes to the
/etc/selinux/config, reboot the server for the changes to be considered.
# cat /etc/selinux/config
SELINUX=disabled
SELINUXTYPE=targeted
Following are the possible values for the SELINUX variable in the
/etc/selinux/config file
Step5
Install java on all the machine
Download and copy the java bin file in to the /usr/local directory
Requirements:
jdk1.6.0 x64(bin).or jre1.6.0 x64(bin).
Installation Instructions
This procedure installs the Java Development Kit (JDK) for 64-bit Linux, using a
self-extracting binary file. The JDK download includes the Java SE Runtime
Environment (JRE) you do not have to download the JRE separately.
The name of the downloaded file has the following format:
jdk-6u<version>-linux-x64.bin
<version>jdk-6u18-linux-x64.bin
3. Change directory to the location where you would like the files to be installed.
The next step installs the JDK into the current directory.
/usr/local/
Here:
The binary code license is displayed, and you are prompted to agree to its terms.
The Java Development Kit files are installed in a directory called jdk.6.0_<version>
in the current directory. the directory stricture is as follows
Delete the bin file if you want to save disk space.
Note about Root Access: Installing the software automatically creates a directory
called jre1.6.0_<version> . Note that if you choose to install the Java SE
Runtime Environment into system-wide location such as /usr/local, you must
first become root to gain the necessary permissions. If you do not have root
access, simply install the Java SE Runtime Environment into your home directory,
or a subdirectory that you have permission to write to.
Note about Overwriting Files: If you install the software in a directory that
contains a subdirectory named jre1.6.0_ <version> , the new software overwrites
files of the same name in that jre1.6.0_ <version> directory. Please be careful to
rename the old directory if it contains files you would like to keep.
Note about System Preferences: By default, the installation script configures the
system such that the backing store for system preferences is created inside the
JDK's installation directory. If the JDK is installed on a network-mounted drive, it
and the system preferences can be exported for sharing with Java runtime
environments on other machines.
export PATH=$PATH:/usr/local/java/bin
Step6
Cloudera cdh3 installation(Hadoop Installation)
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware. It has many similarities with existing distributed file systems. however,
the differences from other distributed file systems are significant. HDFS is highly fault-tolerant
and is designed to be deployed on low- cost hardware. HDFS provides high throughput access
to application data and is suitable for applications that have large data sets. HDFS relaxes a few
POSIX requirements to enable streaming access to file system data. HDFS was originally built as
infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache
Hadoop project, which is part of the Apache Lucene project. The following picture gives an
overview of the most important HDFS components.
Click the following link to go to the installation package. otherwise copy this link
and open it in the browser.
https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation
Download and open the cdh3-repository-1.0.1-noarch.rpm using the package installer and install it.
Follow the below steps to install the packages to install it.
click on Continue Anyway to initiate the installation of the package. Click on install to start the
installation. Now it will Download and install the package
(To install a different version of CDH on a Red Hat system, open the repo file (for example,
cloudera-cdh3.repo, and change the 3 in the repo file to the version number you want. For
example, change the 3 to 3u0 to install CDH3 Update 0.)
b)To install CDH3 on a Red Hat system:
Before installing: (Optionally) add a repository key. Add the Cloudera Public GPG Key to your
repository by executing one of the following commands:
# sudo rpm --import http://archive.cloudera.com/redhat/6/x86_64/cdh/RPMGPG-KEY-cloudera
c)Find and install the Hadoop core package. For example:
#yum search hadoop
#sudo yum install hadoop-0.20
Install each type of daemon package on the appropriate machine. For example, install the
NameNode package on your Master machine:
#sudo yum install hadoop-0.20-<daemon type>
where <daemon type> is one of the following:
namenode
Datanode
Secondarynamenode
Jobtracker
Tasktracker
After installation of the daemons check whether the daemons correctly installed or not. The installation
directory is /etc/init.d.
Step7
Configuration
The only required environment variable we have to configure for Hadoop in this
tutorial is JAVA_HOME. Open /conf/hadoop-env.sh in the editor of your choice (if
you used the installation path in this tutorial, the full path is
/usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment
variable to the Sun JDK/JRE 6 directory.
Update the following lines to hadoop-env.sh file
[root@Slave /]# cd /usr/lib/hadoop/conf/
[hduser@slave conf]$ vim hadoop-env.sh
Change
# The java implementation to use. Required.
# export JAVA_HOME=/usr/local/jdk1.6.0_30/
to
# The java implementation to use. Required.
export JAVA_HOME=/usr/local/jdk1.6.0_30/
Step8
Adding the dedicated users to the hadoop group.
Now We will add the dedicated users(hdfs and mapred) to the hadoop
group in master and the slave
In master
[root@master /]# sudo gpasswd -a hdfs hadoop
[root@master /]# sudo gpasswd -a mapred hadoop
In slave
conf/*-site.xml
Note: As of Hadoop 0.20.0, the configuration settings previously found in hadoopsite.xml were moved to core-site.xml (hadoop.tmp.dir, fs.default.name),
mapred-site.xml (mapred.job.tracker) and hdfs-site.xml (dfs.replication).
In this section, we will configure the directory where Hadoop will store its data
files, the network ports it listens to, etc.
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/lib/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:8020</value>
</property>
Create the tmp directory in some location and give the proper ownership and
permissions. Then map the location in the core-site.xml file. This will be the
temporary folder in which the data to the hdfs will first saved temporarily.
[root@master /]# mkdir /usr/lib/hadoop/tmp
[root@master hadoop]# chmod 750 tmp/
[root@master hadoop]# chown hdfs:hadoop tmp/
In file conf/hdfs-site.xml:
<!-- In: conf/hdfs-site.xml -->
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/storage/name </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/storage/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
In master give the ownership of the storage directory to the hdfs user.
[root@master /]# chown hdfs:hadoop /storage/
In file conf/mapred-site.xml:
[root@slave conf]$
gedit
mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>hdfs://master:8021</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/hadoop/mapred/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/hadoop/mapred/local</value>
</property>
<property>
<name>mapred.temp.dir</name>
<value>/home/hadoop/mapred/temp</value>
</property>
To synchronize the core-site.xml,hdfs-site.xml and mapredsite.xml file between master and slave..
For core-site.xml
[root@master /]# rsync -avrt /usr/lib/hadoop/conf/core-site.xml
root@<slave IP add>:/usr/lib/hadoop/conf/core-site.xml
The authenticity of host '192.168.208.216 (192.168.208.216)' can't be
established.
RSA key fingerprint is 6c:2a:38:e6:b3:e0:0c:00:88:56:55:df:f6:b9:a3:68.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.208.216' (RSA) to the list of known
hosts.
root@192.168.208.216's password:
sending incremental file list
core-site.xml
For hdfs-site.xml
[root@master
For mapred-site.xml
[root@master
/]#
/]#
/]#
/]#
/]#
export
export
export
export
export
HADOOP_NAMENODE_USER=hdfs
HADOOP_SECONDARYNAMENODE_USER=hdfs
HADOOP_DATANODE_USER=hdfs
HADOOP_JOBTACKER_USER=mapred
HADOOP_TASKTRACKER_USER=mapred
In Slave
[root@slave
[root@slave
[root@slave
[root@slave
/]#
/]#
/]#
/]#
export
export
export
export
HADOOP_NAMENODE_USER=hdfs
HADOOP_DATANODE_USER=hdfs
HADOOP_JOBTACKER_USER=mapred
HADOOP_TASKTRACKER_USER=mapred
Step8
Formatting the HDFS file system via the NameNode
Now move to the bin directory of hadoop and now we are about to
start formatting our hdfs filesystem via namenode
[root@master init.d]# cd /usr/lib/hadoop/bin/
[root@master init.d]# ls
Hadoop
hadoop-daemon.sh
rcc
start-all.sh
start-dfs.sh
stop-all.sh
stop-dfs.sh hadoopconfig.sh
hadoop-daemons.sh
slaves.sh
start-balancer.sh
start-mapred.sh stop-balancer.sh stop-mapred.sh
Storage directory
/storage/name has been successfully formatted.
12/03/21
13:29:17
INFO
common.Storage:
Now we will start the demons one by one in the appropriate nodes if
we need multinode cluster. Otherwise start all demons in the single
node.
bin/start-all.sh
starting namenode,
logging to /usr/lib/hadoop0.20/bin/../logs/hadoop-hdfs-namenode-master.out
hdfs@master's password:
master: starting datanode, logging to /usr/lib/hadoop0.20/bin/../logs/hadoop-hdfs-datanode-master.out
hdfs@master's password:
master: starting secondarynamenode, logging to /usr/lib/hadoop0.20/bin/../logs/hadoop-hdfs-secondarynamenode-master.out
starting jobtracker, logging to /usr/lib/hadoop0.20/bin/../logs/hadoop-hdfs-jobtracker-master.out
hdfs@master's password:
master: starting tasktracker, logging to /usr/lib/hadoop0.20/bin/../logs/hadoop-hdfs-tasktracker-master.out
[hdfs@master hadoop]$ jps
7977 SecondaryNameNode
7744 NameNode
7869 DataNode
8173 TaskTracker
7897 JobTracker
8188 Jps
[hdfs@master hadoop]$
Here live node is 1 because we have enabled all the daemons in single
node.
Secondarynamenode Daemon:
[root@master /]# /etc/init.d/hadoop-0.20-secondarynamenode
start
Starting
Hadoop
secondarynamenode
daemon
(hadoopsecondarynamenode): starting secondarynamenode, logging to
/usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenodemaster.out
secondarynamenode (pid 30575) is running..
[ OK ]
[root@master /]#
Jobtracker Daemon:
[root@master /]# /etc/init.d/hadoop-0.20-jobtracker start
Starting Hadoop namenode daemon (hadoop-namenode): starting
namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoopnamenode-master.out
namenode (pid 30178) is running...
[ OK ]
[root@master /]#
Tasktracker Daemon:
[root@master /]# /etc/init.d/hadoop-0.20-tasktracker start
Starting Hadoop namenode daemon (hadoop-namenode): starting
namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoopnamenode-master.out
namenode (pid 28193) is running...
[ OK ]
[root@master /]#
As of now we have installed datanode on master and slave; So
we need to start it on both the machines.
[root@slave /]# /etc/init.d/hadoop-0.20-tasktracker start
Starting Hadoop namenode daemon (hadoop-namenode): starting
namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoopnamenode-master.out
namenode (pid 28193) is running...
[ OK ]
[root@master /]#
Datanode Daemon:
[root@master /]# /etc/init.d/hadoop-0.20-datanode start
Starting Hadoop namenode daemon (hadoop-namenode): starting
namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoopnamenode-master.out
datanode (pid 30655) is running...
[ OK ]
[root@master /]#
Or if you want the full details of the ports follow the command:
[root@master /]# netstat -ptlen
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address
Foreign Address
State
User
PID/Program name
tcp
0 0 0.0.0.0:22
0.0.0.0:*
LISTEN 0
9970
1615/sshd
tcp
0 0 127.0.0.1:25
0.0.0.0:*
LISTEN 0
10170 1691/master
tcp
0 0 :::50090
:::*
LISTEN 494
66492 30575/java
tcp
0 0 ::ffff:192.168.208.217:8020 :::*
LISTEN 494
66244 30415/java
tcp
0 0 :::59284
:::*
LISTEN 494
66099 30415/java
tcp
0 0 :::47701
:::*
LISTEN 494
66469 30575/java
tcp
0 0 :::50070
:::*
LISTEN 494
66255 30415/java
tcp
0 0 :::22
:::*
LISTEN 0
9972
1615/sshd
Inode
tcp
0 0 :::56184
tcp
0 0 :::50010
tcp
0 0 :::50075
tcp
0 0 :::50020
[root@master /]#
:::*
:::*
:::*
:::*
LISTEN
LISTEN
LISTEN
LISTEN
494
494
494
494
66596
66818
66820
66836
30655/java
30655/java
30655/java
30655/java
Now open the browser and in the address bar master:50070. It should give the
proper output with 2 data node running.
Let us check the status of the running nodes. click on the live nodes we will get
the following details.