Sie sind auf Seite 1von 4

https://code.renci.org/gf/project/drops/wiki/?action=&pagename =Hadoop+Installation+Notes&wikiverssort_by=wiki_version.WIKI _VERSION_ID&wikiverssort_order=asc http://www.ibm.com/developerworks/linux/library/l-hadoop2/index.html?

ca=dgr-lnxw01HadoopP2dth-LX

Hadoop on Eucalyptus cloud


The goal was to develop a capability to launch a Hadoop cluster on-demand on an Eucalyptus cloud. The overall steps are (1) create a Hadoop enabled vm image and (2) modify Hadoop src/contrib/ec2 scripts to work with the prepared image. There are existing tools that can deploy Hadoop on demand on Amazon EC2 clouds - Cloudera scripts, Whirr, jclouds and other tools using the web services interface to an EC2 cloud. Unfortunately, these tools (as they stand now) could not be used for an Eucalyptus cloud endpoint. <Describe why these tools could not be used>. Although not the most user friendly, the modified contrib/ec2 scripts provide the most flexibility.

Notes on Eucalyptus image operations

Resize and upload image To resize images, so that you can install new software on images, eg. $ cd /images/scratch/hadoop-images $ sudo /sbin/fsck.ext3 -f centos-neuca.5-3.x86-64-2GB.img $ sudo /sbin/resize2fs centos-neuca.5-3.x86-64-2GB.img 8G Bundle, upload and register images To bundle, upload and register images, do $ euca-bundle-image -i centos-neuca.5-3.x86-64-8GB.img --kernel eki-43241251 --ramdisk eri-7BB1133A $ euca-upload-bundle -b hadoop-images -m /tmp/centos-neuca.5-3.x86-648GB.img.manifest.xml $ euca-register hadoop-images/centos-neuca.5-3.x86-64-8GB.img.manifest.xml This should give you the new image id. Running an instance with a keypair $ source eucarc $ euca-add-keypair hadoop-key $ cat (output_prev) >> hadoop-key $ chmod 600 hadoop-key $ euca-run-instances -k hadoop-key -t m1.xlarge emi-6A24167F $ ssh -i hadoop-key root@192.168.201.24 Adding new software to an image and then register the modified image Login as root to a running instance instantiated using a neuca-enabled base image. You have to get the euca credentials and install euca tools on the running instance in /mnt. Install all the new software and then bundle the volume from /mnt and upload bundle, register image. Here is an example: $ yum update $ cd /mnt $ export VERSION=1.3.1 $ export ARCH=x86_64

$ yum install vim-X11 vim-common vim-enhanced vim-minimal $ vi /etc/yum.repos.d/euca.repo ## Put the following in this file without the comment tag ## # euca2ools? # name=Euca2ools # baseurl=http://www.eucalyptussoftware.com/downloads/repo/euca2ools/1.3.1/yum/centos/ # enabled=1 ## $ yum install euca2ools.$ARCH --nogpgcheck $ yum install python-dev libssl-dev swig help2man unzip rsync make wget curl ## Copy the Euca credentials into /mnt ## ## $ cd /mnt $ unzip euca2-Hadoop-x509.zip ## Install any packages you want to be included in the image ## $ yum install emacs ## ## Bundle image, upload bundle and register image # kernel and ramdisk has to be the same as the current instance; -d is the destination dir for storing the parts # -p is the name of the image - the prefix to manifest.xml; -s is the size of the image in MB ## $ source eucarc $ euca-bundle-vol -c ${EC2_CERT} -k ${EC2_PRIVATE_KEY} -u ${EC2_USER_ID} --ec2cert ${EUCALYPTUS_CERT} --no-inherit --kernel eki-43241251 --ramdisk eri-7BB1133A -d /mnt r x86_64 -p myImage -s 5120 ## For some strange reason, euca-upload-bundle was failing with no permissions # Installing ntp solves the problem mysteriously ## $ yum install -y ntp $ ntpdate pool.ntp.org $ euca-upload-bundle -b test-images -m /mnt/myImage.manifest.xml $ euca-register test-images/myImage.manifest.xml Delete registered images To delete an image, do $ euca-deregister emi-6A761691 $ euca-delete-bundle -a $EC2_ACCESS_KEY -s $EC2_SECRET_KEY --url $S3_URL -b hadoopimages -p centos-neuca.5-3.x86-64-8GB.img --clear

Preparing a Hadoop Enabled Image


Start with a base neuca-enaled image with enough disk space. The emi-6A24167F image is a 8GB base neucaenabled image. euca-run-instances -k hadoop-key -t m1.xlarge emi-6A24167F . Login to the vm as root. Install Java $ yum install java-sdk Install rsync, sudo and ssh (if not already there) $ yum install rsync $ yum install sudo Install apache httpd (for apachectl), php, mysql $ yum install httpd $ yum install php $ yum install mysql-server mysql Install ganglia for gmond, gmetad

$ rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-54.noarch.rpm $ yum install rrdtool ganglia ganglia-gmetad ganglia-gmond ganglia-web httpd php apr apr-util Create user hadoop $ adduser -d /home/hadoop -s /bin/bash -m hadoop $ passwd hadoop Create HADOOP_HOME and copy hadoop source code # The scripts assume that HADOOP_HOME is /usr/local/hadoop-* $ cd /usr/local # Copy hadoop-0.20.2.tar.gz to /usr/local $ scp geni-orca@geni-build.renci.org:/home/orca/hadoop/hadoop-0.20.2.tar.gz . $ tar zxvf hadoop-0.20.2.tar.gz # Edit hadoop-env.sh to modify JAVA_HOME to point to /usr/lib/jvm/java-1.6.0 $ vi hadoop-0.20.2/conf/hadoop-env.sh # Remove hadoop-0.20.2.tar.gz so that ls -d /usr/local/hadoop-* returns the correct HADOOP_HOME $ rm hadoop-0.20.2.tar.gz

Registering the Hadoop Enabled Image


Install euca tools on the vm and copy euca credentials to /mnt $ yum update $ export VERSION=1.3.1 $ export ARCH=x86_64 $ yum install vim-X11 vim-common vim-enhanced vim-minimal $ vi /etc/yum.repos.d/euca.repo ## Put the following in this file without the comment tag ## # euca2ools? # name=Euca2ools # baseurl=http://www.eucalyptussoftware.com/downloads/repo/euca2ools/1.3.1/yum/centos/ # enabled=1 ## $ yum install euca2ools.$ARCH --nogpgcheck $ yum install python-dev libssl-dev swig help2man unzip rsync make wget curl ## Copy the Euca credentials into /mnt ## ## $ cd /mnt $ unzip euca2-Hadoop-x509.zip Upload and register the new image $ cd /mnt $ source eucarc $ euca-bundle-vol -c ${EC2_CERT} -k ${EC2_PRIVATE_KEY} -u ${EC2_USER_ID} --ec2cert ${EUCALYPTUS_CERT} --no-inherit --kernel eki-43241251 --ramdisk eri-7BB1133A -d /mnt r x86_64 -p centos-neuca.5-3.x86-64-Hadoop-5GB.img -s 5120 $ euca-upload-bundle -b hadoop-images -m /mnt/centos-neuca.5-3.x86-64-Hadoop5GB.img.manifest.xml $ euca-register hadoop-images/centos-neuca.5-3.x86-64-Hadoop-5GB.img.manifest.xml $ rm centos-neuca.5-3.x86-64-Hadoop-5GB.img.* Note: euca-bundle-vol also creates the .img file. This can be saved and used later to upload images at other eucalyptus sites. Another round about way to upload the same image at a different euca cluster is to download and source the eucarc for that euca cluster and doing the process from euca-bundle-vol again, keeping in mind to change the eki-** and eri-** in the arguments to euca-bundle-vol .

Modify contrib/ec2 scripts

Go to hadoop source tree on euca-m.renci.ben , the eucalyptus head node. Navigate to /home/orca/hadoop/hadoop0.20.2/src/contrib/ec2/bin . The scripts in there work with Amazon EC2 by default. To make them work with an Eucalyptus, the following scripts were modified - (1) hadoop-ec2-env.sh , (2) launch-hadoop-master , (3) launchhadoop-slaves and (4) hadoop-ec2-init-remote.sh . A new script , 'start_hadoop_on_slaves.sh' was added . 'hadoopec2-env.sh' sets up the environment variables, including the emi-id of the Hadoop enabled image. The 'launchhadoop-master' instantiates a vm that hosts the Hadoop master node, copies the remote init scripts and invokes them to start the hadoop master daemons. The 'launch-hadoop-slaves' script starts the vms for each slave. When the vms are up, it copies the remote init script and the 'star_hadoop_on_slaves.sh' script to the hadoop master. The hadoop master then ssh s to the slave nodes and launches the init scripts for the slaves, which starts the hadoop slave daemons.

Testing A Hadoop Installation

Let HADOOP_HOME be the directory where hadoop tarball has been unbundled and which contains the bin and the conf directories. The hadoop configuration files like the hadoop-site.xml (old style with everything in one config file) reside in $HADOOP_HOME/conf on both master machine and slave machines. Remember to populate the MASTER_HOST in the configuration file to point to the IP address of the hadoop master. If the master and slaves' daemons are not started, start them. For the master do the following: $ $ $ $ cd $HADOOP_HOME bin/hadoop namenode -format bin/hadoop-daemon.sh start namenode bin/hadoop-daemon.sh start jobtracker

On the slave, do the following: $ cd $HADOOP_HOME $ bin/hadoop-daemon.sh start datanode $ bin/hadoop-daemon.sh start tasktracker The logs are in the 'logs' directory at HADOOP_HOME. Note that before any slaves join the master, the master log will have exceptions like "error: java.io.IOException: File /mnt/hadoop/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1". This is normal, as I see in the mailing lists. On the Hadoop master, go to HADOOP_HOME and run dfsadmin -report . This should show how many slaves have connected to the master and other dfs data. $ cd $HADOOP_HOME $ bin/hadoop dfsadmin -report Then try out basic filesystem operations, making a dir, ls, putting files into dfs etc. and running a simple 'wordcount' mapreduce job. $ bin/hadoop $ bin/hadoop $ bin/hadoop $ bin/hadoop $ bin/hadoop $ bin/hadoop $ bin/hadoop $ bin/hadoop ** Note that fs -mkdir /foodir fs -ls / fs -mkdir input fs -ls input fs -put /home/orca/foo/wordcount-input.txt input jar hadoop-0.20.2-examples.jar wordcount input output fs -ls output fs -cat output/part-r-00000 | head -10 you can't see text output if 'compress' is set to true in hadoop-site.xml

Useful end-to-end Hadoop cluster setup instructions at : http://www.ibm.com/developerworks/linux/library/l-hadoop2/index.html?ca=dgr-lnxw01HadoopP2dth-LX

Das könnte Ihnen auch gefallen