Sie sind auf Seite 1von 14

docForCloudera

IN PROGRESS Expect incomplete content.

Project Use Case Installing Hadoop Scope Detailed Installation Process Analyze Recommended Hardware/Software Choose an Amazon Machine Image (AMI) Instance Connect to bigDataa Configure ec2 Security Group Configure RHEL Firewall Rules Install the Correct Oracle Java Development Kit Install CDH4 on a Single Linux Node in Pseudo-distributed Mode Install Hadoop Hello World MapReduce Application Starting and Stopping Hadoop Our POC Conclusion and Lessons Learned Miscellaneous Notes Helpful Links

Project
Use Case Installing Hadoop
Scope
This is an initial attempt at producing a big data sandbox and scope is important. The following items are NOT in scope: Security Working at a client data center Deploying in distributed mode (i.e. a multi node cluster).2 All work completed using EC2 admind account

Detailed Installation Process

I decided to use Cloudera's Distribution Including Apache Hadoop (CDH4). Cloudera provides a Quick Start Guide, an Installation Guide, a Security Guide and a High Availability Guide. All of the documentation for CDH4 links from the main CDH4 documentation page.

Analyze Recommended Hardware/Software

I want to use a Red Hat Enterprise Linux (RHEL) distribution because the client's data center uses Red Hat distributions. CDH4 is available for:
64-bit packages for Red Hat Enterprise Linux 5.7 32-bit and 64-bit packages for Red Hat Enterprise Linux 6.2 Cloudera recommends 64 bit packages for production environments.

Choose an Amazon Machine Image (AMI) Instance


I am going to use a Standard Medium Instance3 2 ECUs 1 Core 3.7 GB Availability zone = us-east-1a I am going to use RHEL 6.3, 64 bit4 Default Kernel ID Default RAM Disk ID Key Pair Name = bigData Security Group = default This will cost $0.190 per hour

Connect to bigDataa
Now, connect to the machine using the secure shell ( SSH) to verify that it is up and running: Establish Connection
ssh -i bigData.pem foo@bar.amazonaws.com

I can connect to the machine successfully5

Configure ec2 Security Group


Hadoop will require the following security rules: Connection method HTTP SSH Protocol TCP TCP From Port 80 22 To Port 80 22 Source 0.0.0.0/0 0.0.0.0/0

Custom... Custom...

TCP TCP

50030 50070

50030 50070

0.0.0.0/0 0.0.0.0/0

The default security group in our EC2 Admin account has these settings baked in, so you should be able to just use that.

Configure RHEL Firewall Rules

RHEL firewall rules will prevent you from accessing Hadoop services. You need to configure the iptables service so you can access Hadoop services. One way to do this:
Configure iptables
## save: service iptables save ## output: iptables: Saving firewall rules to /etc/sysconfig/iptables:[ ## stop: service iptables stop ## output: iptables: Flushing firewall rules: [ iptables: Setting chains to policy ACCEPT: filter [ iptables: Unloading modules:

OK

OK OK

] ]

You can perform more specific and probably more secure configuration by adding a rule to /etc/sysconfig/ipt ables. I chose not to take this approach because security is out of scope and this seems more complicated

Install the Correct Oracle Java Development Kit

CDH4 requires:
JDK 1.6.0_8 = minimum required 1.6.0_31 = recommended version Check Java version: Java Version
java -version java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.1) (rhel-1.45.1.11.1.el6-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

We need to install Oracle's JDK: Download jdk-6u31-linux-x64.bin to my machine

FTP to EC2 Install: Be Careful If you don't configure JAVA_HOME correctly, you will not be able to start the MapReduce Service.

Install JDK
## This example assumes you installed your JDK to: /usr/lib/java/jdk1.6.0_31/ ## Make the file executable: chmod +x jdk-6u31-linux-x64.bin ## Run the file: ./jdk-6u31-linux-x64.bin ## Check to see if default changed java -version java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.1) (rhel-1.45.1.11.1.el6-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) ## Edit etc/profile, somewhere include: export JAVA_HOME=/usr/lib/java/jdk1.6.0_31 export PATH=$JAVA_HOME/bin:$PATH ## Edit etc/sudoers, somewhere include: Defaults env_keep+=JAVA_HOME ## Check JAVA_HOME sudo env | grep JAVA_HOME ## Should see: JAVA_HOME=/usr/lib/java/jdk1.6.0_31 ## Don't move on until this works, or else nothing else will work!!!!

Install CDH4 on a Single Linux Node in Pseudo-distributed Mode


Download the distribution for a 64 bit Red Hat Machine from Cloudera FTP to EC2 Install: Install CDH4 Package
sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.noarch.rpm

Add the Cloudera Public GPG Key to your repository:

Add Public Key


sudo rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

Install Hadoop in pseudo-distributed mode: Install Hadoop


sudo yum install hadoop-0.20-conf-pseudo ## Verify install hadoop version Hadoop 2.0.0-cdh4.0.1 Subversion file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hadoop-2.0. 0-cdh4.0.1/src/hadoop-common-project/hadoop-common -r 4d98eb718ec0cce78a00f292928c5ab6e1b84695 Compiled by jenkins on Thu Jun 28 17:39:22 PDT 2012 From source with checksum 04eb9f6c19c85f3d085358fcfed36767 ## View Files rpm -ql hadoop-0.20-conf-pseudo /etc/hadoop/conf.pseudo.mr1 /etc/hadoop/conf.pseudo.mr1/README /etc/hadoop/conf.pseudo.mr1/core-site.xml /etc/hadoop/conf.pseudo.mr1/hadoop-metrics.properties /etc/hadoop/conf.pseudo.mr1/hdfs-site.xml /etc/hadoop/conf.pseudo.mr1/log4j.properties /etc/hadoop/conf.pseudo.mr1/mapred-site.xml /var/lib/hadoop /var/lib/hadoop/cache /var/lib/hdfs /var/lib/hdfs/cache

Install Hadoop
1. Format the Name Node:

Format Name Node


sudo -u hdfs hdfs namenode -format ## Output: 12/08/15 08:59:00 WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration. 12/08/15 08:59:00 WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration. Formatting using clusterid: CID-3619e2da-781e-4bc6-9d1b-f86f1ae20f9b 12/08/15 08:59:01 INFO util.HostsFileReader: Refreshing hosts (include/exclude) list 12/08/15 08:59:01 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000 12/08/15 08:59:01 INFO util.GSet: VM type = 64-bit 12/08/15 08:59:01 INFO util.GSet: 2% max memory = 19.33375 MB 12/08/15 08:59:01 INFO util.GSet: capacity = 2^21 = 2097152 entries 12/08/15 08:59:01 INFO util.GSet: recommended=2097152, actual=2097152 12/08/15 08:59:01 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false 12/08/15 08:59:01 INFO blockmanagement.BlockManager: defaultReplication = 1 12/08/15 08:59:01 INFO blockmanagement.BlockManager: maxReplication = 512 12/08/15 08:59:01 INFO blockmanagement.BlockManager: minReplication = 1 12/08/15 08:59:01 INFO blockmanagement.BlockManager: maxReplicationStreams = 2 12/08/15 08:59:01 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false 12/08/15 08:59:01 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000 12/08/15 08:59:01 INFO namenode.FSNamesystem: fsOwner = hdfs (auth:SIMPLE) 12/08/15 08:59:01 INFO namenode.FSNamesystem: supergroup = supergroup 12/08/15 08:59:01 INFO namenode.FSNamesystem: isPermissionEnabled = true 12/08/15 08:59:01 INFO namenode.FSNamesystem: HA Enabled: false 12/08/15 08:59:01 INFO namenode.FSNamesystem: Append Enabled: true 12/08/15 08:59:01 INFO namenode.NameNode: Caching file names occuring more than 10 times 12/08/15 08:59:02 INFO namenode.NNStorage: Storage directory /var/lib/hadoop-hdfs/cache/hdfs/dfs/name has been successfully formatted. 12/08/15 08:59:02 INFO namenode.FSImage: Saving image file /var/lib/hadoop-hdfs/cache/hdfs/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression 12/08/15 08:59:02 INFO namenode.FSImage: Image file of size 119 saved in 0 seconds. 12/08/15 08:59:02 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 12/08/15 08:59:02 INFO namenode.FileJournalManager: Purging logs older than 0 12/08/15 08:59:02 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ip-10-90-255-217.ec2.internal/10.90.255.217 ************************************************************/

2. Start HDFS

Start HDFS
## Start: for service in /etc/init.d/hadoop-hdfs-* do sudo $service start done ## Output: Starting Hadoop datanode: [ OK ] starting datanode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-90-255-217.out Starting Hadoop namenode: [ OK ] starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-ip-10-90-255-217.out Starting Hadoop secondarynamenode: [ OK ] starting secondarynamenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-secondarynamenode-ip-10-90-255-217.out ## Stop: for service in /etc/init.d/hadoop-hdfs-* do sudo $service stop done ## Output: Stopping Hadoop datanode: [ OK ] stopping datanode Stopping Hadoop namenode: [ OK ] stopping namenode Stopping Hadoop secondarynamenode: [ OK ] stopping secondarynamenode

The NameNode provides a web console on port 50070. Go to http://localhost:50070/dfshealth.jsp to view the web form:

If you are having trouble getting to the admin page, it could be a firewall issue. To check:
Debug Firewall Issues
## Download index.html wget 127.0.0.1:50070 ## Output Connecting to 127.0.0.1:50070... connected. HTTP request sent, awaiting response... 200 OK Length: 1045 (1.0K) [text/html] Saving to: index.html ## Print index to standard out cat index.html <meta HTTP-EQUIV="REFRESH" content="0;url=dfshealth.jsp"/> <html> <!-Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

3. Create the /tmp Directory Create /tmp Directory


## Make directory sudo -u hdfs hadoop fs -mkdir /tmp ## everyone gets read, write and execute permision to this folder, but don't let anyone delete this folder or change it's permissions sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

4. Create the MapReduce System Directories

Create System Directories


## Create a bunch of directories sudo -u hdfs hadoop fs -mkdir /var sudo -u hdfs hadoop fs -mkdir /var/lib sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred sudo -u hdfs hadoop fs -mkdir /var/lib/hadoophdfs/cache/mapred/mapred/staging sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoophdfs/cache/mapred/mapred/staging sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred ## Check the output: sudo -u hdfs hadoop fs -ls -R / ## Returns drwxrwxrwt - hdfs supergroup 0 /tmp drwxr-xr-x - hdfs supergroup 0 /var drwxr-xr-x - hdfs supergroup 0 /var/lib drwxr-xr-x - hdfs supergroup 0 /var/lib/hadoop-hdfs drwxr-xr-x - hdfs supergroup 0 /var/lib/hadoop-hdfs/cache drwxr-xr-x - mapred supergroup 0 /var/lib/hadoop-hdfs/cache/mapred drwxr-xr-x - mapred supergroup 0 /var/lib/hadoop-hdfs/cache/mapred/mapred drwxr-xr-x - hdfs supergroup 0 /var/lib/hadoophdfs drwxr-xr-x - hdfs supergroup 0 /var/lib/hadoophdfs/cache drwxr-xr-x - hdfs supergroup 0 /var/lib/hadoophdfs/cache/mapred drwxr-xr-x - hdfs supergroup 0 /var/lib/hadoophdfs/cache/mapred/mapred drwxrwxrwt - hdfs supergroup 0 /var/lib/hadoophdfs/cache/mapred/mapred/staging

5. Start MapReduce Start Map Reduce


## Start Service for service in /etc/init.d/hadoop-0.20-mapreduce-* do sudo $service start done ## Stop Service for service in /etc/init.d/hadoop-0.20-mapreduce-* do sudo $service stop done

The JobTracker provides a web console on port 50030. Go to http://localhost:50030/dfshealth.jsp to view the web form:

6. Create User Directories Create home directory for user on NameNode and update file permissions: Add Joe User
## Create group if it does not exist: addgroup hdusers ## Create user and add to group useradd -G hdusers joe ## TODO: Add SSH access and sudoers ## Check that user was created successfully: awk -F":" '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd

Pay Attention If you do not set up permissions correctly, this won't work. You may see errors like: Caused by: org.apache.hadoop.security.AccessControlException

Create Directory on NameNode


## Home sudo -u sudo -u ## File sudo -u directory: hdfs hadoop fs -mkdir /user/joe hdfs hadoop fs -chown joe /user/joe permissions: hdfs hadoop fs -chmod -R 777 /var/lib/hadoop-hdfs/cache/mapred/mapred

Hello World MapReduce Application


Setup data for processing

Copy XML
su joe hadoop fs -mkdir input hadoop fs -put /etc/hadoop/conf/*.xml input hadoop fs -ls input ## Output: -rw-r--r-1 joe supergroup 1461 2012-08-20 14:42 input/core-site.xml -rw-r--r-1 joe supergroup 1854 2012-08-20 14:42 input/hdfs-site.xml -rw-r--r-1 joe supergroup 1001 2012-08-20 14:42 input/mapred-site.xml ## Run example MapReduce: /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+' ## Check output: hadoop fs -ls Found 2 items drwxr-xr-x - hdfs supergroup 0 2012-08-20 17:00 input drwxr-xr-x - hdfs supergroup 0 2012-08-21 08:46 output ## List output files: hadoop fs -ls output Found 3 items -rw-r--r-1 hdfs supergroup drwxr-xr-x - hdfs supergroup -rw-r--r-1 hdfs supergroup

0 2012-08-21 08:46 output/_SUCCESS 0 2012-08-21 08:46 output/_logs 150 2012-08-21 08:46 output/part-00000

# View content: hadoop fs -cat output/part-00000 | head 1 dfs.datanode.data.dir 1 dfs.namenode.checkpoint.dir 1 dfs.namenode.name.dir 1 dfs.replication 1 dfs.safemode.extension 1 dfs.safemode.min.datanodes

Starting and Stopping Hadoop


Starting Hadoop: TODO Should be scripted

Starting Hadoop
## 1. Start machine using https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances ## 2. ssh -i bigData.pem foo@bar.amazonaws.com ## 3. Turn off Firewalls: service iptables save service iptables stop ## 4. Start HDFS: for service in /etc/init.d/hadoop-hdfs-* do sudo $service start done ## 4. Check NameNode UI: ec2-50-17-179-120.compute-1.amazonaws.com:50070 ## 5. Start MapReduce: for service in /etc/init.d/hadoop-0.20-mapreduce-* do sudo $service start done ## 6. Check MapReduce UI: ec2-50-17-179-120.compute-1.amazonaws.com:50030 ## 7. Or check for running Java processes jps

Stopping Hadoop: Stopping Hadoop


## 1. Stop HDFS for service in /etc/init.d/hadoop-hdfs-* do sudo $service stop done ## 2. Stop MapReduce for service in /etc/init.d/hadoop-0.20-mapreduce-* do sudo $service stop done ## Turn off Machine using https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances

Our POC
Data load Map Reduce Data Pipeline ETL Testing the Data Sandbox

Conclusion and Lessons Learned

Miscellaneous
Notes
1 The

term 'big data stack' is purposefully vague at this point. Currently, the plan for this project is to use Cloudera's CDH4 Linux package to get started. This package includes the Hadoop Distributed File System ( HDFS), MapReduc e, Flume, HBase, Hive, Hue, and Oozie. As detailed requirements for this project emerge, the specifics of the big data stack may change.
2 CDH4

can run in pseudo-distributed mode (i.e. a single node cluster) vs. distributed mode (i.e. multi node cluster). In pseudo distributed mode, Hadoop processing is distributed over all of the cores/processors on a single machine. All Hadoop services communicate over local transmission control protocol (TCP) sockets for inter-process communication.
3A

few minutes of research did not yield any helpful suggestions with regards to what instance type would work best. From a cost and performance perspective a medium instance seems like a reasonable place to start.
4 Amazon

does not offer RHEL 6.2 from the quick start wizard for creating a new instance. After looking through CD H4 release notes it references RHEL 6.2 and 6.3. I am assuming RHEL 6.3's exclusion from the CDH4 quick start guide was in error and will be using RHEL 6.3
5I

did not add anyone's public keys to this machine. I am connecting to the machine as ROOT and using the private key. I recognize I am not following best practices for administering and securing a Linux machine. The private key is not shared and as this is a development/small scale project this seems fine for now. If other people need to work on this machine we can address setting up other users, etc. when it is a project requirement. I did not set up an elastic IP for this address so it is subject to change (I am stopping this machine when not in use and you get a new IP every-time you start the machine.

Helpful Links
Learning Hadoop: Understanding Hadoop Clusters and the Network Hadoop 101 NameNode Secondary NameNode (Checkpoint Node) DataNode JobTracker TaskTracker Project Name Explainer Hadoop Shell Map Reduce Map Reduce 101 Authorization and Authentication Cluster Architecture Explained Demystifying Hadoop

ETL Tools: Hadoop Ecosystem: Oozie Flume Sqoop General: Kettle Spring Batch

Das könnte Ihnen auch gefallen