Beruflich Dokumente
Kultur Dokumente
Project Use Case Installing Hadoop Scope Detailed Installation Process Analyze Recommended Hardware/Software Choose an Amazon Machine Image (AMI) Instance Connect to bigDataa Configure ec2 Security Group Configure RHEL Firewall Rules Install the Correct Oracle Java Development Kit Install CDH4 on a Single Linux Node in Pseudo-distributed Mode Install Hadoop Hello World MapReduce Application Starting and Stopping Hadoop Our POC Conclusion and Lessons Learned Miscellaneous Notes Helpful Links
Project
Use Case Installing Hadoop
Scope
This is an initial attempt at producing a big data sandbox and scope is important. The following items are NOT in scope: Security Working at a client data center Deploying in distributed mode (i.e. a multi node cluster).2 All work completed using EC2 admind account
I decided to use Cloudera's Distribution Including Apache Hadoop (CDH4). Cloudera provides a Quick Start Guide, an Installation Guide, a Security Guide and a High Availability Guide. All of the documentation for CDH4 links from the main CDH4 documentation page.
I want to use a Red Hat Enterprise Linux (RHEL) distribution because the client's data center uses Red Hat distributions. CDH4 is available for:
64-bit packages for Red Hat Enterprise Linux 5.7 32-bit and 64-bit packages for Red Hat Enterprise Linux 6.2 Cloudera recommends 64 bit packages for production environments.
Connect to bigDataa
Now, connect to the machine using the secure shell ( SSH) to verify that it is up and running: Establish Connection
ssh -i bigData.pem foo@bar.amazonaws.com
Custom... Custom...
TCP TCP
50030 50070
50030 50070
0.0.0.0/0 0.0.0.0/0
The default security group in our EC2 Admin account has these settings baked in, so you should be able to just use that.
RHEL firewall rules will prevent you from accessing Hadoop services. You need to configure the iptables service so you can access Hadoop services. One way to do this:
Configure iptables
## save: service iptables save ## output: iptables: Saving firewall rules to /etc/sysconfig/iptables:[ ## stop: service iptables stop ## output: iptables: Flushing firewall rules: [ iptables: Setting chains to policy ACCEPT: filter [ iptables: Unloading modules:
OK
OK OK
] ]
You can perform more specific and probably more secure configuration by adding a rule to /etc/sysconfig/ipt ables. I chose not to take this approach because security is out of scope and this seems more complicated
CDH4 requires:
JDK 1.6.0_8 = minimum required 1.6.0_31 = recommended version Check Java version: Java Version
java -version java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.1) (rhel-1.45.1.11.1.el6-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
FTP to EC2 Install: Be Careful If you don't configure JAVA_HOME correctly, you will not be able to start the MapReduce Service.
Install JDK
## This example assumes you installed your JDK to: /usr/lib/java/jdk1.6.0_31/ ## Make the file executable: chmod +x jdk-6u31-linux-x64.bin ## Run the file: ./jdk-6u31-linux-x64.bin ## Check to see if default changed java -version java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.1) (rhel-1.45.1.11.1.el6-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) ## Edit etc/profile, somewhere include: export JAVA_HOME=/usr/lib/java/jdk1.6.0_31 export PATH=$JAVA_HOME/bin:$PATH ## Edit etc/sudoers, somewhere include: Defaults env_keep+=JAVA_HOME ## Check JAVA_HOME sudo env | grep JAVA_HOME ## Should see: JAVA_HOME=/usr/lib/java/jdk1.6.0_31 ## Don't move on until this works, or else nothing else will work!!!!
Install Hadoop
1. Format the Name Node:
2. Start HDFS
Start HDFS
## Start: for service in /etc/init.d/hadoop-hdfs-* do sudo $service start done ## Output: Starting Hadoop datanode: [ OK ] starting datanode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-90-255-217.out Starting Hadoop namenode: [ OK ] starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-ip-10-90-255-217.out Starting Hadoop secondarynamenode: [ OK ] starting secondarynamenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-secondarynamenode-ip-10-90-255-217.out ## Stop: for service in /etc/init.d/hadoop-hdfs-* do sudo $service stop done ## Output: Stopping Hadoop datanode: [ OK ] stopping datanode Stopping Hadoop namenode: [ OK ] stopping namenode Stopping Hadoop secondarynamenode: [ OK ] stopping secondarynamenode
The NameNode provides a web console on port 50070. Go to http://localhost:50070/dfshealth.jsp to view the web form:
If you are having trouble getting to the admin page, it could be a firewall issue. To check:
Debug Firewall Issues
## Download index.html wget 127.0.0.1:50070 ## Output Connecting to 127.0.0.1:50070... connected. HTTP request sent, awaiting response... 200 OK Length: 1045 (1.0K) [text/html] Saving to: index.html ## Print index to standard out cat index.html <meta HTTP-EQUIV="REFRESH" content="0;url=dfshealth.jsp"/> <html> <!-Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->
The JobTracker provides a web console on port 50030. Go to http://localhost:50030/dfshealth.jsp to view the web form:
6. Create User Directories Create home directory for user on NameNode and update file permissions: Add Joe User
## Create group if it does not exist: addgroup hdusers ## Create user and add to group useradd -G hdusers joe ## TODO: Add SSH access and sudoers ## Check that user was created successfully: awk -F":" '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd
Pay Attention If you do not set up permissions correctly, this won't work. You may see errors like: Caused by: org.apache.hadoop.security.AccessControlException
Copy XML
su joe hadoop fs -mkdir input hadoop fs -put /etc/hadoop/conf/*.xml input hadoop fs -ls input ## Output: -rw-r--r-1 joe supergroup 1461 2012-08-20 14:42 input/core-site.xml -rw-r--r-1 joe supergroup 1854 2012-08-20 14:42 input/hdfs-site.xml -rw-r--r-1 joe supergroup 1001 2012-08-20 14:42 input/mapred-site.xml ## Run example MapReduce: /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+' ## Check output: hadoop fs -ls Found 2 items drwxr-xr-x - hdfs supergroup 0 2012-08-20 17:00 input drwxr-xr-x - hdfs supergroup 0 2012-08-21 08:46 output ## List output files: hadoop fs -ls output Found 3 items -rw-r--r-1 hdfs supergroup drwxr-xr-x - hdfs supergroup -rw-r--r-1 hdfs supergroup
0 2012-08-21 08:46 output/_SUCCESS 0 2012-08-21 08:46 output/_logs 150 2012-08-21 08:46 output/part-00000
# View content: hadoop fs -cat output/part-00000 | head 1 dfs.datanode.data.dir 1 dfs.namenode.checkpoint.dir 1 dfs.namenode.name.dir 1 dfs.replication 1 dfs.safemode.extension 1 dfs.safemode.min.datanodes
Starting Hadoop
## 1. Start machine using https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances ## 2. ssh -i bigData.pem foo@bar.amazonaws.com ## 3. Turn off Firewalls: service iptables save service iptables stop ## 4. Start HDFS: for service in /etc/init.d/hadoop-hdfs-* do sudo $service start done ## 4. Check NameNode UI: ec2-50-17-179-120.compute-1.amazonaws.com:50070 ## 5. Start MapReduce: for service in /etc/init.d/hadoop-0.20-mapreduce-* do sudo $service start done ## 6. Check MapReduce UI: ec2-50-17-179-120.compute-1.amazonaws.com:50030 ## 7. Or check for running Java processes jps
Our POC
Data load Map Reduce Data Pipeline ETL Testing the Data Sandbox
Miscellaneous
Notes
1 The
term 'big data stack' is purposefully vague at this point. Currently, the plan for this project is to use Cloudera's CDH4 Linux package to get started. This package includes the Hadoop Distributed File System ( HDFS), MapReduc e, Flume, HBase, Hive, Hue, and Oozie. As detailed requirements for this project emerge, the specifics of the big data stack may change.
2 CDH4
can run in pseudo-distributed mode (i.e. a single node cluster) vs. distributed mode (i.e. multi node cluster). In pseudo distributed mode, Hadoop processing is distributed over all of the cores/processors on a single machine. All Hadoop services communicate over local transmission control protocol (TCP) sockets for inter-process communication.
3A
few minutes of research did not yield any helpful suggestions with regards to what instance type would work best. From a cost and performance perspective a medium instance seems like a reasonable place to start.
4 Amazon
does not offer RHEL 6.2 from the quick start wizard for creating a new instance. After looking through CD H4 release notes it references RHEL 6.2 and 6.3. I am assuming RHEL 6.3's exclusion from the CDH4 quick start guide was in error and will be using RHEL 6.3
5I
did not add anyone's public keys to this machine. I am connecting to the machine as ROOT and using the private key. I recognize I am not following best practices for administering and securing a Linux machine. The private key is not shared and as this is a development/small scale project this seems fine for now. If other people need to work on this machine we can address setting up other users, etc. when it is a project requirement. I did not set up an elastic IP for this address so it is subject to change (I am stopping this machine when not in use and you get a new IP every-time you start the machine.
Helpful Links
Learning Hadoop: Understanding Hadoop Clusters and the Network Hadoop 101 NameNode Secondary NameNode (Checkpoint Node) DataNode JobTracker TaskTracker Project Name Explainer Hadoop Shell Map Reduce Map Reduce 101 Authorization and Authentication Cluster Architecture Explained Demystifying Hadoop
ETL Tools: Hadoop Ecosystem: Oozie Flume Sqoop General: Kettle Spring Batch