Beruflich Dokumente
Kultur Dokumente
Zahid Ul Haq
Student ID: 200377925
Preface
The objective of this document is to make installation of Hadoop and HIVE easy for
novice in field of Big Data. Step by step with illustration will hopefully the execution
of commands and installation verification easy. Furthermore, basic manipulation
command are also described at introductory level.
Contents
Install Ubuntu .............................................................................................................................................................................. 1
Create Virtual Machine ........................................................................................................................................................ 1
Customize hardware of Virtual Machine ............................................................................................................................. 3
Customization of Ubuntu screen ......................................................................................................................................... 6
Sharing folder ...................................................................................................................................................................... 7
VMware Tools ...................................................................................................................................................................... 8
Installing JAVA on Virtual Machine............................................................................................................................................ 10
Checking Java installed version ......................................................................................................................................... 10
Updating debian packages on the server .......................................................................................................................... 10
Install Java Development Kit.............................................................................................................................................. 10
Updating bash file for Java variable................................................................................................................................... 11
Verifying $JAVA_HOME creation ....................................................................................................................................... 11
Reboot the system ............................................................................................................................................................. 12
Check Java Version ............................................................................................................................................................ 12
Check Java Variable path ................................................................................................................................................... 12
Add a Hadoop User and Group.............................................................................................................................................. 13
Add a Group ....................................................................................................................................................................... 13
Add a Common User “hduser” to the Group..................................................................................................................... 13
Add the hduser to the sudo User Group ........................................................................................................................... 13
Reboot the system ............................................................................................................................................................. 14
Add the Shared folder to Desktop ..................................................................................................................................... 14
Install SSH .................................................................................................................................................................................. 14
Check SSH Installation ....................................................................................................................................................... 14
Install openssh-server ........................................................................................................................................................ 15
Verify SSH installation........................................................................................................................................................ 15
SSH Key .................................................................................................................................................................................. 16
Generate SSH key .............................................................................................................................................................. 16
Distribute SSH Public Key................................................................................................................................................... 16
Verify SSH public and private key generation and copy of public key to authorized_keys............................................... 17
Set the permission on the .ssh directory ........................................................................................................................... 18
Confirm the installation of SSH and reboot ....................................................................................................................... 18
Rsync ...................................................................................................................................................................................... 18
Install Rysnc ....................................................................................................................................................................... 18
Reboot system and take a snap shot. ................................................................................................................................ 18
Install and Configure Hadoop on a Single Node Cluster............................................................................................................ 19
Download the Hadoop....................................................................................................................................................... 19
Un-compress the Hadoop tar files ..................................................................................................................................... 19
Move all Hadoop related files to hadoop direcotry .......................................................................................................... 19
Set Hadoop Environment Variable .................................................................................................................................... 19
Verify Hadoop Environment Variable ................................................................................................................................ 20
Reboot the system and take a snapshot. .......................................................................................................................... 21
Check Version of Hadoop .................................................................................................................................................. 21
Build the Hadoop data directories......................................................................................................................................... 21
Modifying Hadoop Configuration File ....................................................................................................................................... 21
Check Hadoop Configuration File ...................................................................................................................................... 21
Modify permission on the HADOOP_CONF_DIR ............................................................................................................... 22
Check all file available in HADOOP_CONF_DIR ................................................................................................................. 22
Add JAVA_HOME variable to $HADOOP_CONF_DIR/hadoop-env.sh ............................................................................... 22
Reconfirm you JAVA_HOME variable ................................................................................................................................ 23
Know your hostname (Computer name or Domain Name Server) ................................................................................... 23
Modify core-site.xml .......................................................................................................................................................... 23
Modify mapred-site.xml .................................................................................................................................................... 25
Modify hdfs-site.xml .......................................................................................................................................................... 27
Return ownership of the $HADOOP_HOME folder to root ............................................................................................... 28
Format the HDFS................................................................................................................................................................ 29
Startup Hadoop Cluster ............................................................................................................................................................. 29
Start All Hadoop Services................................................................................................................................................... 29
Shutdown all Hadoop service ............................................................................................................................................ 29
Start DFS Service ................................................................................................................................................................ 30
Accessing NameNode via UI .............................................................................................................................................. 30
Start Yarn Services ............................................................................................................................................................. 31
Start Job History Services .................................................................................................................................................. 31
Browsing Resource Manager ............................................................................................................................................. 32
Confirm Running of Hadoop Service ................................................................................................................................. 32
Testing the HDFS........................................................................................................................................................................ 32
Hadoop Interface commands ............................................................................................................................................ 32
List folders in HDFS ............................................................................................................................................................ 33
Make a folder / directory in HDFS ..................................................................................................................................... 33
Removing files with wild card character ........................................................................................................................... 33
Verification of files on local Computer .............................................................................................................................. 34
Copy a file from HDFS to the local computer .................................................................................................................... 35
Merging files ...................................................................................................................................................................... 35
Display last lines of the file ................................................................................................................................................ 36
Exploring properties of HDFS Files. ....................................................................................................................................... 36
Change group association of files ...................................................................................................................................... 36
Change ownership ............................................................................................................................................................. 36
Change permission on files ................................................................................................................................................ 37
Copy file to another folder ................................................................................................................................................ 37
Display size of files ............................................................................................................................................................. 38
Moving files ....................................................................................................................................................................... 38
Statistics about file / directory .......................................................................................................................................... 38
HIVE ........................................................................................................................................................................................... 38
Download Hive................................................................................................................................................................... 38
Extra the tar file ................................................................................................................................................................. 39
Move the extracted files to /usr/local/hive ...................................................................................................................... 39
Add Environment variables ............................................................................................................................................... 39
Instantiate environment variables. ................................................................................................................................... 40
Reboot ............................................................................................................................................................................... 40
Set Ownership and Permission .......................................................................................................................................... 40
Add the HADOOP_HOME Variable .................................................................................................................................... 40
Start the HDFS Environment .............................................................................................................................................. 41
Build the directory structure for HIVE on the HDFS .......................................................................................................... 41
Setup the initial scheme for the database......................................................................................................................... 41
Launch HIVE ....................................................................................................................................................................... 41
Error handling during Hive Start ........................................................................................................................................ 41
Re -run schema for the database ...................................................................................................................................... 42
Re-launch Hive ................................................................................................................................................................... 42
HiveQL Data Definition Language .............................................................................................................................................. 42
Show Databases................................................................................................................................................................. 42
Create Database ................................................................................................................................................................ 43
Verify creation of Database ............................................................................................................................................... 43
Use Database ..................................................................................................................................................................... 43
Using a Non-existent Database ......................................................................................................................................... 43
Drop Database ................................................................................................................................................................... 43
Create Table....................................................................................................................................................................... 44
Show Tables ....................................................................................................................................................................... 44
........................................................................................................................................................................................... 44
Describe Table ................................................................................................................................................................... 44
Drop Table ......................................................................................................................................................................... 44
Creating Tables in Specified Databases ................................................................................................................................. 44
Create table in specified database .................................................................................................................................... 45
Managed Hive Tables ............................................................................................................................................................ 45
Created a managed table .................................................................................................................................................. 45
Insert single record in a table ............................................................................................................................................ 45
Insert multiple records in a table ...................................................................................................................................... 46
Checking File the table in hive warehouse ............................................................................................................................ 46
Checking contents of the table .............................................................................................................................................. 46
Check content of temporary table ........................................................................................................................................ 46
Install Ubuntu
Hadoop is supported by Linux platform and its flavours. If your computer operating system is other than
Linux, then you need to crete a Virtual Machine for installing of Linux without interfering with local computer’s
operating system. Ubuntu is one the platforms that can be used for installing of Linux.
Create Virtual Machine
1. Open VMware Workstation
2. Click on “Create a New Virtual Machine” which will prompt a Wizad for installation.
3. Choose “Typical (recommonded) unless you desire to install a customized version. Click “Next”.
4. Browse location where “iso” disc image for Ubuntu is saved on your computer.
5. It is a best practice to use a stable version of Ubuntu – LTS: Long Term Support
6. Enter a desirable name for Linux and User. User name must be in lower case. Provide a desirable
password.
Page |1
7. Enter a desirable name for Virtual machine. Click “Browse” tab for changing default location of virtual
machine which can be also from Edit > Preferences.
8. Change default Maximum virtual disk size of 20 GB to something that suit your computer system.
9. Keep default “ Split virtual disk into multiple files” which provides easy portability of data, less chance of “
all data lost”, and better response for operations like snapshots, shrink, and extend etc.
Customize hardware of Virtual Machine
Ubuntu will rely on local machine’s resources for its operation. You can change allocation of those resources if
you don’t want to go default option which in most cases may not be the best option for your computer system.
10. Click “Customize hardware” for customization of virtual machines hardware. Default memory of 1 GB will
result in delay execution of heavier tasks.
11. 1 GB memory equates to 1024 MB, and it is better practice to choose a number which is multiple of it.
12. For a standard computer with 16 GB memory, allocate at-least 8 GB ram. Always leave enough memory
for local computer so that it does not hang.
13. Enter a desirable memory for virtual machine by either dragging the scale or entering numbers in MB
14. Leave “Network Adapter” to default NAT – Network Address Translation. NAT is useful when you are
either connected to a network through a non-Ethernet network adapter.
a. Local computer and virtual machine both will have separate IP address
b. Virtual machine remains inaccessible to other machines on host network
c. If virtual machine needs to be available to other machines or virtual machines, then change
Network Adapter setting to “Bridged”
15. Click Close and then click Finish to close configuration of virtual machine.
16. Ubuntu screen will prompt up like below and recommend base packages installation will start
automatically.
17. Installation may take from 10 to 15 minutes depending internet connectivity and memory allocated.
23. Click search icon and type “Terminal”. Drag Terminal icon to launcher for quick access.
24. Execution of above steps will result in a screen like below.
Sharing folder
Shared folder facilitates sharing of files between virtual and local machine over any network.
25. Click on VM in menu bar, select “Setting”. In the “Virtual Machine Setting”, select Options and then
“Shared Folders”. Under “Folder sharing” tab, select “Always enabled” and then add location for shared
folder and name it.
26. Click “Next” and then “OK” to close configuration.
VMware Tools
VMWare tools contain all required updates for easier mouse, display, network drivers, and sharing of files
between local machine and virtual machines.
27. If VMware tools were previously installed on the VMware Workstation, the below steps will update them.
28. In the Library tab, select desired virtual machine and power it on. If Library tab is not visible, click View
tab in menu bar, select Customize and then choose Library, or press F9.
29. Right click on virtual machine and select Install / Reinstall VMware tools
30. Selecting Install /Reinstall VMware Tools mount a DVD drive in Ubuntu drivers list.
31. Open a Terminal window either by pressing CTRL+ALT+T or selecting from launcher or search box.
32. The following commands will extract the VMware Tools and install files to Desktop.
cd "/media/ubuntu/VMware Tools"
mkdir ~/Desktop/vmwaretools
tar -zxf ./VMwareTools-*.tar.gz -C ~/Desktop/vmwaretools
cd ~/Desktop
a. Remember:
i. cd stands change drive,
ii. cd “/media/ubuntu/ VMware Tools” prompts creation of media folder for user “ubuntu”.
iii. tar –zxf will untar the tar file where dash “-“denotes optional, z means unzip, x means
extract file, f means the following argument is a filename. Sometimes, you may add v to
the command which means print the filenames verbosely.
33. Once the VMware tools install files are extracted, the following command will install VMware Tools. Leave
the default settings for each option.
sudo – denotes switch user and do, usually refers to root user who is super user having all administrative
rights.
sudo ~/Desktop/vmwaretools/vmware-tools-distrib/vmware-install.pl -d default
34. If VMware tools are already installed, then installation will not proceed further.
35. The following command will display “shared” folder for virtual machine that you have previously defined.
vmware-hgfsclient
36. A symbolic link to the shared folder can be created by the following command. After execution, verify
existence of the link ~/Desktop where ~ denotes home folder.
ln -s /mnt/hgfs/Shared ~/Desktop/Shared Ubuntu
Installing JAVA on Virtual Machine
Hadoop is written in Java language, so we need Java to be installed for running of Hadoop packages.
Checking Java installed version
java -version
The following command will push the JAVA_HOME and its path to bashrc file.
echo "export JAVA_HOME=/usr/lib/jvm/default-java" | sudo tee --append /etc/bash.bashrc > /dev/null
echo "PATH=\$PATH:\$JAVA_HOME/bin" | sudo tee --append /etc/bash.bashrc > /dev/null
Install SSH
Secure Socket Shell is a network protocol that provides administrators with a secure way to access a
remote computer. SSH also refers to a suite of application that implement protocol. It provides strong and
encrypted way to transfer data over network including internet which is an insecure network. SSH runs on port
22.
Check SSH Installation
If SSH is installed, port number 22 would be open. The following command will display all ports that are
open.
netstat –tulpn
Install openssh-server
sudo apt-get -y install openssh-server
The following command will display all files in .ssh folder where we can see presence of private and public
keys.
ls -all .ssh
The following command will show hduser as a valid authorized user with authorized key.
cat .ssh/authorized_keys
The following command will command show if this machine “Ubuntu” is a known host to any other
machine. Right now, it does not show any known_host.
cat .ssh/authorized_keys
Set the permission on the .ssh directory
sudo chmod -R 0600 ~/.ssh/authorized_keys
Confirm the installation of SSH and reboot
ssh ubuntu
Rsync
Rysync speeds up copies when the destination already has an older copy of file(s) by sending only the
changed part, meaning it allows you to synchronize remote folders. It is more useful when you have to copy a
large number or a large files.
Install Rysnc
The following command will install Rysnc
sudo apt-get install rsync
After untaring, navigate to the usr/local folder and verify its download through the following command.
cd /usr/local , press enter. Then type ls
ls –lh will list all contents of the directory and the kind of permissions allowed on the directories.
The above incomplete list of files of hadoop directory indicates that user “ubuntu” has only permissions.
Modify permission on the HADOOP_CONF_DIR
The following command will allow user “root” to edit configuration
sudo chown root -R $HADOOP_HOME will change ownership to root
sudo chmod 777 –R $HADOOP_HOME will grant reclusively read/write/execute permission to root
Change the variable to look like this statement, this location should be the same as JAVA_HOME
environment variable:
export JAVA_HOME=/usr/lib/jvm/default-java
echo $JAVA_HOME will display location of JAVA_HOME variable if you are not sure about path of
JAVA_HOME variable.
Modify core-site.xml
Hadoop core-site.xml contains information about where namenode runs which can run on port 9000 and
8020.
Namenode: fs.default.FS is default name of namenode which you may change if you don’t want to go
with default name
The default address of namenode server is hdfs://localhost:9000. You can also use 8020 is default
location for running of namenode.
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
<description>localhost may be replaced with a DNS that points to the NameNode.</description>
</property>
</configuration>
Modify yarn-site.xml
The following command will open yarn-site.xml, place the contents at the end of file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<!--<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>-->
</configuration>
Modify mapred-site.xml
In HADOOP_CONF_DIR, we don’t have a mapred-site.xml. So we will make a copy of mapred-
site.xml.template as mapred-site.xml by the following command.
sudo cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml
sudo gedit $HADOOP_CONF_DIR/mapred-site.xml will open the file.
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Modify hdfs-site.xml
hdfs-site.xml contains information about directories where namenode and datanode stores their data.
You can customize location of these directories.
dfs.namenode.name.dir stores metadata – namespace, FSImage and edit logs, in the specified local directory.
Namenode will be using heapsize memory specified in core-site.xml during operation and but will use
persistently dfs.datanode.data.dir to store hdfs blocks.
We need to specify replication factor – at how many datanodes hdfs data needs to be stored, which by default is
3. Since we are using a single node cluster, we will assign a value of 1.
sudo gedit $HADOOP_CONF_DIR/hdfs-site.xml will open xml file.
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hduser/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hduser/hadoop_data/hdfs/datanode</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
<description>If "true", enable permission checking in HDFS. If "false", permission checking is turned off, but all
other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner
or group of files or directories.</description>
</property>
</configuration>
Return ownership of the $HADOOP_HOME folder to root
sudo chown root -R $HADOOP_HOME will change ownership for root user.
sudo chmod 777 -R $HADOOP_HOME will grant read/write/execute permission.
Format the HDFS
We need to format the directories created for HDFS before using it to ensure that there are default values
or files in it.
hdfs namenode –format will create and then format directories structure for namenode.
1. Reboot the system and take a snapshot.
sudo reboot
Scroll down, if you find logging text and running of nodes, it means the configuration are successful.
$HADOOP_HOME/sbin/start-dfs.sh
start-dfs.sh
The output in the browser will look like the below screen.
Start Yarn Services
Either of the following command will start yarn.
$HADOOP_HOME/sbin/start-yarn.sh
start-yarn.sh
hadoop fs {arg} relates to a generic file system which can point to any file systems like local,HDFS etc.
hadoop dfs{arg} is specific to HDFS. It has been deprecated now, and instead hdfs dfs {arg} should be used.
hdfs dfs {arg} relates to HDF’s operation.
Copy the created file “my_file.txt” a few times more to another folder.
Let’s create another folder named “use” by the command
hdfs dfs -mkdir /user
Repeat the above step, but this time copy the files to another folder “use”,
hdfs dfs -copyFromLocal ~/my_file.txt /use
hdfs dfs -copyFromLocal ~/my_file.txt /use/my_file2.txt
hdfs dfs -copyFromLocal ~/my_file.txt /use/my_file3.txt
If the folder is not emtpy, it can not be deleted with -rmdir command.
hdfs dfs –ls /use will reveal the contents of the “use” folder.
Merging files
getmerge command concatenates files from a source directory into a destination directory. The following
command will merge three files from Test and make a new file in user directory.
hdfs dfs -getmerge /Test ~/user/my_merged_file.txt
Display last lines of the file
tail command display last lines(kilobytes) of the file. The following command will command last line of the
my_file.txt in Test folder.
hdfs dfs –tail /Test/my_file.txt
Change ownership
chown command changes the user ownership of file . An option of “-R” will change also the ownership of all files
within a directory if the target is a directory. The following command will change ownership of “my_file.txt” in
“use” folder from hduser to hadoop.
hdfs dfs –ls /use It will show current permissions, ownerships, size of file etc on the file/directory
hdfs dfs –chown –R ubuntu /use/my_file.txt It will change ownership to ubuntu on the file.
hdfs dfs –ls /use It will show new ownership change.
Change permission on files
chmod command changes the permissions of files. An option of “-R” will change permission of all files within a
directory is the target is a directory.The following command will change permission on file “my_file.txt” to read,
write, execute for user, group, others.
hdfs dfs –ls /use It will show current permissions on the file/directory
hdfs dfs –chmod 777 /use/my_file.txt It will grant all permissions for all on the file
hdfs dfs –ls /use It will show again current permissions on the file.
Moving files
mv command moves from source to destination.
hdfs dfs -mv user/my_file* /Test/ will move all my_file type files to Test folder
Verify its moving by checking Test folder via hadoop fs –ls /Test
HIVE
Hive is a data warehouse system for hadoop. Hadoop requires Java but Hive runs SQL-like queries through its own language
HQL that get compiled and run as Map Reduce Jobs. Data in hadoop even though general unstructured has some vague
structure associated with it.
Download Hive
The following command will download “hive-2.1.1” from mirror.csclub.uwaterloo.ca
wget http://mirror.csclub.uwaterloo.ca/apache/hive/hive-2.1.1/apache-hive-2.1.1-bin.tar.gz -P
~/Downloads/Hive
Reboot
sudo reboot
Set Ownership and Permission
Login as hduser and use the following command to give “root” the ownership of $HOME_HIVE and grant
permissions.
sudo chown root -R $HIVE_HOME
sudo chmod 777 -R $HIVE_HOME
Launch HIVE
The command hive will launch hive.
Re-launch Hive
hive command will start Hive.
Create Database
The following command will create named hduserdb.
create database hduserdb;
Use Database
The following command will use the target database for reading or creation of tables in it.
use hduserdb;
Drop Database
The following command will delete / drop the database even if it has tables inside it. If the database does not have
table, then drop database name_of_database; will also work.
drop database if exists hduserdb cascade;
Create Table
The following command will create a table named “things” with rows delimited by comma separate..
CREATE TABLE things (thingId INT, name STRING) row format delimited fields terminated by ',';
Show Tables
The following command will show the newly created table “things” in the list.
show tables;
Describe Table
Describe command followed by target table will provide structure info of the table.
Describe things;
Drop Table
Drop table followed by target table will delete all data and the table structure as well. Delete table
followed by target table will delete the data but the structure will remain the same, so you can still roll
back the data. Delete command can be used with where condition also to specify condition.
drop table things; for droping of table
show tables; to verify that table has been deleted.
We have student_profile table, but since we inserted values into it, Hive created temporary tables, tmp_table_1
shows insertion of 1 record and tmp_table_2 insertion of 2 records.