Sie sind auf Seite 1von 53

HADOOP AND HIVE MADE EASY

Zahid Ul Haq
Student ID: 200377925
Preface

The objective of this document is to make installation of Hadoop and HIVE easy for
novice in field of Big Data. Step by step with illustration will hopefully the execution
of commands and installation verification easy. Furthermore, basic manipulation
command are also described at introductory level.
Contents
Install Ubuntu .............................................................................................................................................................................. 1
Create Virtual Machine ........................................................................................................................................................ 1
Customize hardware of Virtual Machine ............................................................................................................................. 3
Customization of Ubuntu screen ......................................................................................................................................... 6
Sharing folder ...................................................................................................................................................................... 7
VMware Tools ...................................................................................................................................................................... 8
Installing JAVA on Virtual Machine............................................................................................................................................ 10
Checking Java installed version ......................................................................................................................................... 10
Updating debian packages on the server .......................................................................................................................... 10
Install Java Development Kit.............................................................................................................................................. 10
Updating bash file for Java variable................................................................................................................................... 11
Verifying $JAVA_HOME creation ....................................................................................................................................... 11
Reboot the system ............................................................................................................................................................. 12
Check Java Version ............................................................................................................................................................ 12
Check Java Variable path ................................................................................................................................................... 12
Add a Hadoop User and Group.............................................................................................................................................. 13
Add a Group ....................................................................................................................................................................... 13
Add a Common User “hduser” to the Group..................................................................................................................... 13
Add the hduser to the sudo User Group ........................................................................................................................... 13
Reboot the system ............................................................................................................................................................. 14
Add the Shared folder to Desktop ..................................................................................................................................... 14
Install SSH .................................................................................................................................................................................. 14
Check SSH Installation ....................................................................................................................................................... 14
Install openssh-server ........................................................................................................................................................ 15
Verify SSH installation........................................................................................................................................................ 15
SSH Key .................................................................................................................................................................................. 16
Generate SSH key .............................................................................................................................................................. 16
Distribute SSH Public Key................................................................................................................................................... 16
Verify SSH public and private key generation and copy of public key to authorized_keys............................................... 17
Set the permission on the .ssh directory ........................................................................................................................... 18
Confirm the installation of SSH and reboot ....................................................................................................................... 18
Rsync ...................................................................................................................................................................................... 18
Install Rysnc ....................................................................................................................................................................... 18
Reboot system and take a snap shot. ................................................................................................................................ 18
Install and Configure Hadoop on a Single Node Cluster............................................................................................................ 19
Download the Hadoop....................................................................................................................................................... 19
Un-compress the Hadoop tar files ..................................................................................................................................... 19
Move all Hadoop related files to hadoop direcotry .......................................................................................................... 19
Set Hadoop Environment Variable .................................................................................................................................... 19
Verify Hadoop Environment Variable ................................................................................................................................ 20
Reboot the system and take a snapshot. .......................................................................................................................... 21
Check Version of Hadoop .................................................................................................................................................. 21
Build the Hadoop data directories......................................................................................................................................... 21
Modifying Hadoop Configuration File ....................................................................................................................................... 21
Check Hadoop Configuration File ...................................................................................................................................... 21
Modify permission on the HADOOP_CONF_DIR ............................................................................................................... 22
Check all file available in HADOOP_CONF_DIR ................................................................................................................. 22
Add JAVA_HOME variable to $HADOOP_CONF_DIR/hadoop-env.sh ............................................................................... 22
Reconfirm you JAVA_HOME variable ................................................................................................................................ 23
Know your hostname (Computer name or Domain Name Server) ................................................................................... 23
Modify core-site.xml .......................................................................................................................................................... 23
Modify mapred-site.xml .................................................................................................................................................... 25
Modify hdfs-site.xml .......................................................................................................................................................... 27
Return ownership of the $HADOOP_HOME folder to root ............................................................................................... 28
Format the HDFS................................................................................................................................................................ 29
Startup Hadoop Cluster ............................................................................................................................................................. 29
Start All Hadoop Services................................................................................................................................................... 29
Shutdown all Hadoop service ............................................................................................................................................ 29
Start DFS Service ................................................................................................................................................................ 30
Accessing NameNode via UI .............................................................................................................................................. 30
Start Yarn Services ............................................................................................................................................................. 31
Start Job History Services .................................................................................................................................................. 31
Browsing Resource Manager ............................................................................................................................................. 32
Confirm Running of Hadoop Service ................................................................................................................................. 32
Testing the HDFS........................................................................................................................................................................ 32
Hadoop Interface commands ............................................................................................................................................ 32
List folders in HDFS ............................................................................................................................................................ 33
Make a folder / directory in HDFS ..................................................................................................................................... 33
Removing files with wild card character ........................................................................................................................... 33
Verification of files on local Computer .............................................................................................................................. 34
Copy a file from HDFS to the local computer .................................................................................................................... 35
Merging files ...................................................................................................................................................................... 35
Display last lines of the file ................................................................................................................................................ 36
Exploring properties of HDFS Files. ....................................................................................................................................... 36
Change group association of files ...................................................................................................................................... 36
Change ownership ............................................................................................................................................................. 36
Change permission on files ................................................................................................................................................ 37
Copy file to another folder ................................................................................................................................................ 37
Display size of files ............................................................................................................................................................. 38
Moving files ....................................................................................................................................................................... 38
Statistics about file / directory .......................................................................................................................................... 38
HIVE ........................................................................................................................................................................................... 38
Download Hive................................................................................................................................................................... 38
Extra the tar file ................................................................................................................................................................. 39
Move the extracted files to /usr/local/hive ...................................................................................................................... 39
Add Environment variables ............................................................................................................................................... 39
Instantiate environment variables. ................................................................................................................................... 40
Reboot ............................................................................................................................................................................... 40
Set Ownership and Permission .......................................................................................................................................... 40
Add the HADOOP_HOME Variable .................................................................................................................................... 40
Start the HDFS Environment .............................................................................................................................................. 41
Build the directory structure for HIVE on the HDFS .......................................................................................................... 41
Setup the initial scheme for the database......................................................................................................................... 41
Launch HIVE ....................................................................................................................................................................... 41
Error handling during Hive Start ........................................................................................................................................ 41
Re -run schema for the database ...................................................................................................................................... 42
Re-launch Hive ................................................................................................................................................................... 42
HiveQL Data Definition Language .............................................................................................................................................. 42
Show Databases................................................................................................................................................................. 42
Create Database ................................................................................................................................................................ 43
Verify creation of Database ............................................................................................................................................... 43
Use Database ..................................................................................................................................................................... 43
Using a Non-existent Database ......................................................................................................................................... 43
Drop Database ................................................................................................................................................................... 43
Create Table....................................................................................................................................................................... 44
Show Tables ....................................................................................................................................................................... 44
........................................................................................................................................................................................... 44
Describe Table ................................................................................................................................................................... 44
Drop Table ......................................................................................................................................................................... 44
Creating Tables in Specified Databases ................................................................................................................................. 44
Create table in specified database .................................................................................................................................... 45
Managed Hive Tables ............................................................................................................................................................ 45
Created a managed table .................................................................................................................................................. 45
Insert single record in a table ............................................................................................................................................ 45
Insert multiple records in a table ...................................................................................................................................... 46
Checking File the table in hive warehouse ............................................................................................................................ 46
Checking contents of the table .............................................................................................................................................. 46
Check content of temporary table ........................................................................................................................................ 46
Install Ubuntu
Hadoop is supported by Linux platform and its flavours. If your computer operating system is other than
Linux, then you need to crete a Virtual Machine for installing of Linux without interfering with local computer’s
operating system. Ubuntu is one the platforms that can be used for installing of Linux.
Create Virtual Machine
1. Open VMware Workstation
2. Click on “Create a New Virtual Machine” which will prompt a Wizad for installation.
3. Choose “Typical (recommonded) unless you desire to install a customized version. Click “Next”.
4. Browse location where “iso” disc image for Ubuntu is saved on your computer.
5. It is a best practice to use a stable version of Ubuntu – LTS: Long Term Support

6. Enter a desirable name for Linux and User. User name must be in lower case. Provide a desirable
password.

Page |1
7. Enter a desirable name for Virtual machine. Click “Browse” tab for changing default location of virtual
machine which can be also from Edit > Preferences.

8. Change default Maximum virtual disk size of 20 GB to something that suit your computer system.
9. Keep default “ Split virtual disk into multiple files” which provides easy portability of data, less chance of “
all data lost”, and better response for operations like snapshots, shrink, and extend etc.
Customize hardware of Virtual Machine
Ubuntu will rely on local machine’s resources for its operation. You can change allocation of those resources if
you don’t want to go default option which in most cases may not be the best option for your computer system.
10. Click “Customize hardware” for customization of virtual machines hardware. Default memory of 1 GB will
result in delay execution of heavier tasks.
11. 1 GB memory equates to 1024 MB, and it is better practice to choose a number which is multiple of it.
12. For a standard computer with 16 GB memory, allocate at-least 8 GB ram. Always leave enough memory
for local computer so that it does not hang.

13. Enter a desirable memory for virtual machine by either dragging the scale or entering numbers in MB
14. Leave “Network Adapter” to default NAT – Network Address Translation. NAT is useful when you are
either connected to a network through a non-Ethernet network adapter.
a. Local computer and virtual machine both will have separate IP address
b. Virtual machine remains inaccessible to other machines on host network
c. If virtual machine needs to be available to other machines or virtual machines, then change
Network Adapter setting to “Bridged”
15. Click Close and then click Finish to close configuration of virtual machine.
16. Ubuntu screen will prompt up like below and recommend base packages installation will start
automatically.
17. Installation may take from 10 to 15 minutes depending internet connectivity and memory allocated.

18. Upon successful installation, a screen like below will pop-up.


19. Enter your password, and log on to Ubuntu, initial screen will look like below.
Customization of Ubuntu screen
Time, date, location, and apps on launcher can be customized as per the requirement. They remain always re-
configurable, so you can experiment.
20. You can remove non-required items like Libra documents, Amazon etc. right clicking them and select
unlock from launcher to free up some space on the screen.
21. You can always add them or more items to launcher by clicking “gear” icon on right side and select “Select
Settings”.
22. You can localize location, time and date setting by double clicking the clock and select appropriate
configuration.

23. Click search icon and type “Terminal”. Drag Terminal icon to launcher for quick access.
24. Execution of above steps will result in a screen like below.
Sharing folder
Shared folder facilitates sharing of files between virtual and local machine over any network.
25. Click on VM in menu bar, select “Setting”. In the “Virtual Machine Setting”, select Options and then
“Shared Folders”. Under “Folder sharing” tab, select “Always enabled” and then add location for shared
folder and name it.
26. Click “Next” and then “OK” to close configuration.
VMware Tools
VMWare tools contain all required updates for easier mouse, display, network drivers, and sharing of files
between local machine and virtual machines.
27. If VMware tools were previously installed on the VMware Workstation, the below steps will update them.
28. In the Library tab, select desired virtual machine and power it on. If Library tab is not visible, click View
tab in menu bar, select Customize and then choose Library, or press F9.
29. Right click on virtual machine and select Install / Reinstall VMware tools

30. Selecting Install /Reinstall VMware Tools mount a DVD drive in Ubuntu drivers list.

31. Open a Terminal window either by pressing CTRL+ALT+T or selecting from launcher or search box.
32. The following commands will extract the VMware Tools and install files to Desktop.
cd "/media/ubuntu/VMware Tools"
mkdir ~/Desktop/vmwaretools
tar -zxf ./VMwareTools-*.tar.gz -C ~/Desktop/vmwaretools
cd ~/Desktop
a. Remember:
i. cd stands change drive,
ii. cd “/media/ubuntu/ VMware Tools” prompts creation of media folder for user “ubuntu”.
iii. tar –zxf will untar the tar file where dash “-“denotes optional, z means unzip, x means
extract file, f means the following argument is a filename. Sometimes, you may add v to
the command which means print the filenames verbosely.

33. Once the VMware tools install files are extracted, the following command will install VMware Tools. Leave
the default settings for each option.
sudo – denotes switch user and do, usually refers to root user who is super user having all administrative
rights.
sudo ~/Desktop/vmwaretools/vmware-tools-distrib/vmware-install.pl -d default
34. If VMware tools are already installed, then installation will not proceed further.

35. The following command will display “shared” folder for virtual machine that you have previously defined.
vmware-hgfsclient

36. A symbolic link to the shared folder can be created by the following command. After execution, verify
existence of the link ~/Desktop where ~ denotes home folder.
ln -s /mnt/hgfs/Shared ~/Desktop/Shared Ubuntu
Installing JAVA on Virtual Machine
Hadoop is written in Java language, so we need Java to be installed for running of Hadoop packages.
Checking Java installed version
java -version

Updating debian packages on the server


a. apt( Advanced Packaging Tools)-get is used for installing Debian packages, to whom Ubuntu also
belongs
b. –y denotes assume yes,
sudo apt-get -y update

Install Java Development Kit


The following command will install Java into /usr/lib/jvm/java-x-openjdk-amd64, which can be verified by
navigating to File > Computer >usr >lib > jvm. The link folder “default-java” will be used for set up the
JAVA_HOME Variable.

sudo apt-get -y install default-jdk


Updating bash file for Java variable
Apache which hosts open source platform of Hadoop has some default set-up, and JAVA_HOME is one
the default values that they use in the files. To match our Hadoop system to their source, replicate all
those default variables in our Hadoop installation.

The following command will push the JAVA_HOME and its path to bashrc file.
echo "export JAVA_HOME=/usr/lib/jvm/default-java" | sudo tee --append /etc/bash.bashrc > /dev/null
echo "PATH=\$PATH:\$JAVA_HOME/bin" | sudo tee --append /etc/bash.bashrc > /dev/null

Verifying $JAVA_HOME creation


The following command will open bashrc file, open it with “getedit” and scroll-down to the bottom to
verify existence of JAVA_HOME variable and its path.
sudo gedit /etc/bash.bashrc
Reboot the system
To instantiate the environment variable, reboot the system with the following command.
sudo reboot
Check Java Version
java -version

Check Java Variable path


The following command will JAVA_HOME variable to the screen. Note that echo prints information to the
screen & $ sign denotes variable.
echo $JAVA_HOME
Add a Hadoop User and Group
In a fully distributed Hadoop cluster, HDFS will need to communicate with several machine to store its data. We
need common Hadoop user and Hadoop group to access those HDFS files, and need to grant them appropriate
permissions
Add a Group
The following command will create group “hadoop”.
sudo addgroup hadoop

Add a Common User “hduser” to the Group


The following command will create user “hduser” and add to group “hadoop”. Provide password for the
new user as per requirement. Here, I am giving it hduser for simplicity.
You may skip adding details about the new user to be created.
sudo adduser --ingroup hadoop hduser

Add the hduser to the sudo User Group


sudo user group have super user or administrative privileges. Since hduser would need to access all
datanodes and namenode, so we need to it to super user group.

The following command will hduser to sudo.


sudo adduser hduser sudo

List all users


The following command will display all users, scroll down to the end of script to find user names.
Reboot the system
sudo reboot
Logon as hduser to virtual machine, and enter your password which was ‘hduser’.
Add the Shared folder to Desktop
The following command will add the shared folder to Desktop of hduser.
ln -s /mnt/hgfs/Shared ~/Desktop/Shared

Install SSH
Secure Socket Shell is a network protocol that provides administrators with a secure way to access a
remote computer. SSH also refers to a suite of application that implement protocol. It provides strong and
encrypted way to transfer data over network including internet which is an insecure network. SSH runs on port
22.
Check SSH Installation
If SSH is installed, port number 22 would be open. The following command will display all ports that are
open.
netstat –tulpn
Install openssh-server
sudo apt-get -y install openssh-server

Verify SSH installation


Check if the port 22 is open after installation of ssh server.
netstat –tulpn
SSH Key
HDFS stores data across multiple datanodes. In absence of a SSH Key, the common user has to be
authorized on all daemons which will require creation of the user and password for the user on all nodes.
Through SSH Key, we generate a private key that is stored securely and one public key which is stored across
network. A valid combination of both keys will grant access to the user on all daemons.
Generate SSH key
The following command will generate rsa “Rivest-Shamir-Adleman” which is the recommended choice for
new keys. Note that in the command, -f means “filename” and –P indicates create file if not exist.
ssh-keygen -f ~/.ssh/id_rsa -t rsa -P ""

The ssh public key has been saved in /home/hduser/.ssh/id_rsa.pub

Distribute SSH Public Key


The following command will copy public key to the machine’s authorized-keys file.
ssh-copy-id -i hduser@ubuntu
Verify SSH public and private key generation and copy of public key to authorized_keys

The following command will display all files in .ssh folder where we can see presence of private and public
keys.
ls -all .ssh

The following command will show hduser as a valid authorized user with authorized key.
cat .ssh/authorized_keys

The following command will command show if this machine “Ubuntu” is a known host to any other
machine. Right now, it does not show any known_host.
cat .ssh/authorized_keys
Set the permission on the .ssh directory
sudo chmod -R 0600 ~/.ssh/authorized_keys
Confirm the installation of SSH and reboot
ssh ubuntu

Rsync
Rysync speeds up copies when the destination already has an older copy of file(s) by sending only the
changed part, meaning it allows you to synchronize remote folders. It is more useful when you have to copy a
large number or a large files.
Install Rysnc
The following command will install Rysnc
sudo apt-get install rsync

Reboot system and take a snap shot.


Install and Configure Hadoop on a Single Node Cluster
Download the Hadoop
The following command will fetch and download the tar file for hadoop-2.9.0 to Hadoop, sub-folder of
the Download folder
wget http://apache.forsale.plus/hadoop/common/hadoop-2.9.0/hadoop-2.9.0.tar.gz -P
~/Downloads/Hadoop

Un-compress the Hadoop tar files


The following command will unzip, extract filename “~/Downloads/Hadoop/hadoop-*” where * indicates
any hadoop file present in Hadoop folder. If there are more than one Hadoop downloads, you have to
mention the name of that file precisely.
sudo tar -zxf ~/Downloads/Hadoop/hadoop-*.tar.gz -C /usr/local

After untaring, navigate to the usr/local folder and verify its download through the following command.
cd /usr/local , press enter. Then type ls

Move all Hadoop related files to hadoop direcotry


The following command will move files from /usr/local to /usr/local/hadoop
sudo mv /usr/local/hadoop-* /usr/local/hadoop

Set Hadoop Environment Variable


The following commands will push Hadoop environment variable to .bashrc file.
echo -e '# HADOOP Variables START' >> ~/.bashrc
echo -e 'export HADOOP_HOME=/usr/local/hadoop' >> ~/.bashrc
echo -e 'export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop' >> ~/.bashrc
echo -e 'export HADOOP_DATA_HOME=~/hadoop_data/hdfs' >> ~/.bashrc
echo -e 'PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin' >> ~/.bashrc
echo -e '# HADOOP Variables END' >> ~/.bashrc

Verify Hadoop Environment Variable


The following open bashrc file with text editor gedit; and then scroll down to the bottom of the file for
the above environment variables.
gedit ~/.bashrc

Instantiate Environment variables and print path of variables.


source ~/.bashrc will update bashrc file and instantiate the variables.
echo $ HADOOP_HOME will print to the screen path of Hadoop variable.
Reboot the system and take a snapshot.
sudo reboot
Check Version of Hadoop
Login as hduser and use the following command to check version of hadoop.
hadoop version

Build the Hadoop data directories


The minimum directories needed for Hadoop ecosystem are namenode – to store metadata, datanode –
to store hdfs data and tmp – to store temporary files. The following command will create directories in
“~/hadoop_data/hdfs” which is set in bashrc file as HADOOP_DATA_HOME.
mkdir -p $HADOOP_DATA_HOME/namenode
mkdir -p $HADOOP_DATA_HOME/datanode
mkdir -p $HADOOP_DATA_HOME/tmp

Then navigate to directory and check the files.


cd ~/hadoop_data/hdfs

Modifying Hadoop Configuration File


Check Hadoop Configuration File
Before Modify permissions on HADOOP_CONF_DIR , let’s check what is inside and who have permissions
on it.
echo $HADOOP_CONF_DIR will print path of the Hadoop Configuration directory
cd $HADOOP_CONF_DIR will take to the directory of Hadoop configuration

ls –lh will list all contents of the directory and the kind of permissions allowed on the directories.
The above incomplete list of files of hadoop directory indicates that user “ubuntu” has only permissions.
Modify permission on the HADOOP_CONF_DIR
The following command will allow user “root” to edit configuration
sudo chown root -R $HADOOP_HOME will change ownership to root
sudo chmod 777 –R $HADOOP_HOME will grant reclusively read/write/execute permission to root

Check all file available in HADOOP_CONF_DIR


We would need to configure some files in HADOOP_CONF_DIR, let’s have once a look of its content.
ls –l will display contents of directory.

Add JAVA_HOME variable to $HADOOP_CONF_DIR/hadoop-env.sh


In hadoop-evn.sh, you need to change path of JAVA_HOME variable, and you may change heap size of
memory which is the memory required by namenode to come up if default 1000 MB is not suitable for
you.
sudo gedit $HADOOP_CONF_DIR/hadoop-env.sh will open hadoop-env.sh file. Locate the area that
indicates the current Java_HOME variable which should look something like this.
export JAVA_HOME=${JAVA_HOME}

Change the variable to look like this statement, this location should be the same as JAVA_HOME
environment variable:
export JAVA_HOME=/usr/lib/jvm/default-java
echo $JAVA_HOME will display location of JAVA_HOME variable if you are not sure about path of
JAVA_HOME variable.

Reconfirm you JAVA_HOME variable


echo $JAVA_HOME – the outcome shall not be blank

Know your hostname (Computer name or Domain Name Server)


In a single node, you may refer to your system as local host, but it is preferable to refer by its DNS
especially in a big set ups.
echo $(hostname) will display your local DNS

Modify core-site.xml
Hadoop core-site.xml contains information about where namenode runs which can run on port 9000 and
8020.
Namenode: fs.default.FS is default name of namenode which you may change if you don’t want to go
with default name
The default address of namenode server is hdfs://localhost:9000. You can also use 8020 is default
location for running of namenode.

Furthermore, core-site.xml contains information about a base for temporary directory.

sudo gedit $HADOOP_CONF_DIR/core-site.xml will open the file.


Add the following command lines to the configuration of the core-site.xml file
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/${user.name}/hadoop_data/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
<description>localhost may be replaced with a DNS that points to the NameNode.</description>
</property>
</configuration>
Modify yarn-site.xml
The following command will open yarn-site.xml, place the contents at the end of file.

sudo gedit $HADOOP_CONF_DIR/yarn-site.xml

<configuration>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>

<!--<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>-->
</configuration>

Note <!- - is used for a comment.

Modify mapred-site.xml
In HADOOP_CONF_DIR, we don’t have a mapred-site.xml. So we will make a copy of mapred-
site.xml.template as mapred-site.xml by the following command.
sudo cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml
sudo gedit $HADOOP_CONF_DIR/mapred-site.xml will open the file.

Add the following lines to the file and save it.


<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>local</value>
</property>

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

</configuration>
Modify hdfs-site.xml
hdfs-site.xml contains information about directories where namenode and datanode stores their data.
You can customize location of these directories.
dfs.namenode.name.dir stores metadata – namespace, FSImage and edit logs, in the specified local directory.
Namenode will be using heapsize memory specified in core-site.xml during operation and but will use
persistently dfs.datanode.data.dir to store hdfs blocks.
We need to specify replication factor – at how many datanodes hdfs data needs to be stored, which by default is
3. Since we are using a single node cluster, we will assign a value of 1.
sudo gedit $HADOOP_CONF_DIR/hdfs-site.xml will open xml file.

Add the following to the file and save it.


<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hduser/hadoop_data/hdfs/namenode</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hduser/hadoop_data/hdfs/datanode</value>
</property>

<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
<description>If "true", enable permission checking in HDFS. If "false", permission checking is turned off, but all
other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner
or group of files or directories.</description>
</property>

</configuration>
Return ownership of the $HADOOP_HOME folder to root
sudo chown root -R $HADOOP_HOME will change ownership for root user.
sudo chmod 777 -R $HADOOP_HOME will grant read/write/execute permission.
Format the HDFS
We need to format the directories created for HDFS before using it to ensure that there are default values
or files in it.
hdfs namenode –format will create and then format directories structure for namenode.
1. Reboot the system and take a snapshot.
sudo reboot

Startup Hadoop Cluster


Start All Hadoop Services
start-all.sh will start all hadoop services. You will get warning message of its deprecation, but it will still work.

Scroll down, if you find logging text and running of nodes, it means the configuration are successful.

Shutdown all Hadoop service


stop-all.sh will stop all hadoop services.
Start DFS Service
Either of the following command will start namenode.

$HADOOP_HOME/sbin/start-dfs.sh
start-dfs.sh

Accessing NameNode via UI


The default address of namenode web UI (Universal Identifier) is http://localhost:50070. Type
http://localhost:50070 in web browser of your virtual machine to access NameNode, change local host to your
NameNode DNS or IP Addresss if required.

The output in the browser will look like the below screen.
Start Yarn Services
Either of the following command will start yarn.

$HADOOP_HOME/sbin/start-yarn.sh
start-yarn.sh

Start Job History Services


Either of the following command will job history.

$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver


mr-jobhistory-daemon.sh start historyserver
Browsing Resource Manager
Browse the ResourceManager in your Web Browse via http://localhost:8088, change localhost to your NameNode
DNS or IP Address

Confirm Running of Hadoop Service


jps command is used to check all java related services namely the Hadoop daemons like Namenode, Datanode,
ResourceManger, NodeManager etc. which are running on the local machine.

Testing the HDFS


For testing HDFS, we will use certain commands that are aimed to work with Hadoop’s file system.

Hadoop Interface commands


All these commands of Hadoop must begin wither with:

 hadoop fs {arg} relates to a generic file system which can point to any file systems like local,HDFS etc.
 hadoop dfs{arg} is specific to HDFS. It has been deprecated now, and instead hdfs dfs {arg} should be used.
 hdfs dfs {arg} relates to HDF’s operation.

Create a text file in home folder


In the following command, echo will print the specified and cat will make a file “ my_file.txt” in the home folder.
echo "Hello this will be my first distributed and fault-tolerant data set\!" | cat >> my_file.txt
Remember:
 echo prints out the specific text on the screen
 cat which refers to concatenate is used to merge text files into one, and copying file.
List folders in HDFS
The following command will list the hdfs folders
hdfs dfs -ls /

Make a folder / directory in HDFS


The following command will create a folder named “user” in dfs
hdfs dfs -mkdir /user
Verify creation of the folder by using command hdfs dfs -ls /

Copy from Local Computer to HDFS


The following command will copy “my_file.txt” from local machine to user folder in hadoop.
hdfs dfs -copyFromLocal ~/my_file.txt /user
hdfs dfs -copyFromLocal ~/my_file.txt /user/my_file2.txt
hdfs dfs -copyFromLocal ~/my_file.txt /user/my_file3.txt

Verify creation of the files in the folder”user” by


hdfs dfs -ls /user

Copy the created file “my_file.txt” a few times more to another folder.
Let’s create another folder named “use” by the command
hdfs dfs -mkdir /user

Repeat the above step, but this time copy the files to another folder “use”,
hdfs dfs -copyFromLocal ~/my_file.txt /use
hdfs dfs -copyFromLocal ~/my_file.txt /use/my_file2.txt
hdfs dfs -copyFromLocal ~/my_file.txt /use/my_file3.txt

Check the contents of the folder by the following command


hdfs dfs -ls /user

Removing files with wild card character


Wild card character will remove all files. The following command will remove all files starting “my_file*”.
hdfs dfs -rm /user/my_file*
Removing the folder in HDFS
The following command will remove the folder if it is empty.
hdfs dfs -rmdir /user

If the folder is not emtpy, it can not be deleted with -rmdir command.

hdfs dfs –ls /use will reveal the contents of the “use” folder.

Copying file to Local Computer from the HDFS


The following command will copy “my_file3.txt” from HDFS to local computer.
hdfs dfs -copyToLocal /use/my_file3.txt ~/my_file3_HDFSImport.txt

Verification of files on local Computer


The command ll -/ will reveal all files on the local computer. Note that ll – is an alias of ls –alF where

 –a option shows hidden files


 -l option shows the output as a long list with its attributes
 -F option swill append one */=>@| to the entries, it is basically used to differentiate files from the
directories as it will append / to the directory entries.
Copy a file from HDFS to the local computer
get command may also be to copy a specified file from HDFS to the local filesystem. The following command will
copy my_file.txt from HDFS to local computer.
hdfs dfs -get /Test/my_file.txt my_file_hadoopt.txt

Copy a file from local compute to HDFS


put command will copy file, or files from local filesystem to hadoop file system. The following command will copy to
“/user” folder in HDFS.
hdfs dfs –put ~/my_file.txt /user

Merging files
getmerge command concatenates files from a source directory into a destination directory. The following
command will merge three files from Test and make a new file in user directory.
hdfs dfs -getmerge /Test ~/user/my_merged_file.txt
Display last lines of the file
tail command display last lines(kilobytes) of the file. The following command will command last line of the
my_file.txt in Test folder.
hdfs dfs –tail /Test/my_file.txt

Exploring properties of HDFS Files.


Change group association of files
chgrp command changes the group association of files with an option of “-R” to use operation recursively. The
following command will change group association of file “my _file.txt” from superuser to hadoop. Verify the
outcome by using hdfs df –ls /.
hdfs dfs –chgrp hadoop /use/my_file.txt

Change ownership
chown command changes the user ownership of file . An option of “-R” will change also the ownership of all files
within a directory if the target is a directory. The following command will change ownership of “my_file.txt” in
“use” folder from hduser to hadoop.
hdfs dfs –ls /use It will show current permissions, ownerships, size of file etc on the file/directory
hdfs dfs –chown –R ubuntu /use/my_file.txt It will change ownership to ubuntu on the file.
hdfs dfs –ls /use It will show new ownership change.
Change permission on files
chmod command changes the permissions of files. An option of “-R” will change permission of all files within a
directory is the target is a directory.The following command will change permission on file “my_file.txt” to read,
write, execute for user, group, others.
hdfs dfs –ls /use It will show current permissions on the file/directory
hdfs dfs –chmod 777 /use/my_file.txt It will grant all permissions for all on the file
hdfs dfs –ls /use It will show again current permissions on the file.

Copy file to another folder


In HDFS, cp command can copy file(s) to an existing folder. In Linux, cp can copy to an non-existing folder,
but in HDFS it is not allowed.The following command all files in /use/ folder to /user/ folder
hdfs dfs –cp /use/my_file* /user/
Display size of files
du command will display sizes of files in the target folder
hdfs dfs –du /user will display file sizes in “/user” folder.

Moving files
mv command moves from source to destination.
hdfs dfs -mv user/my_file* /Test/ will move all my_file type files to Test folder
Verify its moving by checking Test folder via hadoop fs –ls /Test

Statistics about file / directory


stat commands prints statistics about the targeted file / directory. By providing additional options, you
can additional information like:
 name (%n)
 type (%F)
 user name of owner(%u)
 group name of owner (%g)
 blocks (%b)
 block size (%o)
 replication (%r)
 modification date (%y, %Y). %y shows UTC date as “yyyy-MM-dd HH:mm:ss
The command hdfs dfs –stat /use/my_file.txt will show when the file generated, and option “%F” will
display type of file.

HIVE
Hive is a data warehouse system for hadoop. Hadoop requires Java but Hive runs SQL-like queries through its own language
HQL that get compiled and run as Map Reduce Jobs. Data in hadoop even though general unstructured has some vague
structure associated with it.

Download Hive
The following command will download “hive-2.1.1” from mirror.csclub.uwaterloo.ca
wget http://mirror.csclub.uwaterloo.ca/apache/hive/hive-2.1.1/apache-hive-2.1.1-bin.tar.gz -P
~/Downloads/Hive

Extra the tar file


The following command will extract the tar files from Download folder to /usr/local folder.
sudo tar -zxf ~/Downloads/Hive/apache-hive-* -C /usr/local

Move the extracted files to /usr/local/hive


The following command will move apche-hive-2.1.1. from/usr/local/apache-hive to /usr/local/hive
sudo mv /usr/local/apache-hive-* /usr/local/hive

Add Environment variables


The following commands will insert the environment variables at the end of .bashrc file.
echo -e "# HIVE Variables START" | cat >> ~/.bashrc
echo -e "export HIVE_HOME=/usr/local/hive" | cat >> ~/.bashrc
echo -e "export HIVE_CONF_HOME=\$HIVE_HOME/conf" | cat >> ~/.bashrc
echo -e "export PATH=\$PATH:\$HIVE_HOME/bin" | cat >> ~/.bashrc
echo -e "export CLASSPATH=\$CLASSPATH:\$HIVE_HOME/lib/*:." | cat >> ~/.bashrc
echo -e "# HIVE Variables END" | cat >> ~/.bashrc
Use the following command to verify the successful execution.
gedit ~/.bashrc

Instantiate environment variables.


source ~/.bashrc will instantiate the environment variables.

Verify Environment variables


echo $HIVE_HOME should give location of /usr/local/hadoop

Reboot
sudo reboot
Set Ownership and Permission
Login as hduser and use the following command to give “root” the ownership of $HOME_HIVE and grant
permissions.
sudo chown root -R $HIVE_HOME
sudo chmod 777 -R $HIVE_HOME

Add the HADOOP_HOME Variable


Execute the following codes to export the HADOOP_HOME environement variables to bin/hive-config.sh
echo -e "# HIVE Variables START" | cat >> $HIVE_HOME/bin/hive-config.sh
echo -e "export HADOOP_HOME=${HADOOP_HOME}" | cat >> $HIVE_HOME/bin/hive-config.sh
echo -e "# HIVE Variables END" | cat >> $HIVE_HOME/bin/hive-config.sh
Start the HDFS Environment
Start HDFS if it is not running.
start-dfs.sh
Build the directory structure for HIVE on the HDFS
The following commands will make directories and skip those that already exist.
$HADOOP_HOME/bin/hdfs dfs -mkdir /tmp
$HADOOP_HOME/bin/hdfs dfs -mkdir -p /user/hive/warehouse
$HADOOP_HOME/bin/hdfs dfs -chmod -R 0777 /tmp
$HADOOP_HOME/bin/hdfs dfs -chmod -R 0777 /user/hive/warehouse

Setup the initial scheme for the database


schematool -initSchema -dbType derby

Launch HIVE
The command hive will launch hive.

Error handling during Hive Start


To handle the error of SLF4J: Class path contains multiple SLF4J binings, execute the following commands
sudo chown -R hduser $HIVE_HOME
sudo mv /usr/local/hive/lib/log4j-slf4j-impl-2.4.1.jar /usr/local/hive/lib/log4j-slf4j-impl-2.4.1.jar.ignore
sudo chown -R root $HIVE_HOME

Re -run schema for the database


The following command will create metastore.db where Hive data will be stored
schematool -initSchema -dbType derby

Re-launch Hive
hive command will start Hive.

HiveQL Data Definition Language


HiveQL Data Definition Language (DDL) is similar to SQL. HQL statements must end with a semicolon like JAVA, C#
languages.
Show Databases
The following command will all databases in hive. Since there are no database available at the moment,
the result will be
show databases;

Create Database
The following command will create named hduserdb.
create database hduserdb;

Verify creation of Database


Use command of show databases to see the newly created database “hduserdb”
show databases;

Use Database
The following command will use the target database for reading or creation of tables in it.
use hduserdb;

Using a Non-existent Database


If you try to use a non-existent database, the system will give an error

Drop Database
The following command will delete / drop the database even if it has tables inside it. If the database does not have
table, then drop database name_of_database; will also work.
drop database if exists hduserdb cascade;
Create Table
The following command will create a table named “things” with rows delimited by comma separate..
CREATE TABLE things (thingId INT, name STRING) row format delimited fields terminated by ',';

Show Tables
The following command will show the newly created table “things” in the list.
show tables;

Describe Table
Describe command followed by target table will provide structure info of the table.
Describe things;

Drop Table
Drop table followed by target table will delete all data and the table structure as well. Delete table
followed by target table will delete the data but the structure will remain the same, so you can still roll
back the data. Delete command can be used with where condition also to specify condition.
drop table things; for droping of table
show tables; to verify that table has been deleted.

Creating Tables in Specified Databases


1. Create specified databases
The following commands will create two databases bdat and bdat1.
create database bdat;
create database bdat1;
Verify their creation by using command show databases;

Create table in specified database


The following commands will create tables in specifiedt databases.
create table bdat. student ( id int, name string, age int);

create table bdat1. student ( id int, name string, course string);

Managed Hive Tables


These tables live inside in Hive Warehouse. All tables created in hive are managed unless specific location in HDFS is
indicated.

Created a managed table


The following command will create student_profile table inside hive warehouse.

create table student_profile ( id int, name string, age int);

Insert single record in a table


The following command will insert single record in student_profile table

insert into student_profile values

> ('001', 'Tom', '32');


Insert multiple records in a table
The following command will insert multiple values in student_profile table.

insert into student_profile values


> ('002', 'Abey', '33'),
> ('003', 'Jay', '34');

We have student_profile table, but since we inserted values into it, Hive created temporary tables, tmp_table_1
shows insertion of 1 record and tmp_table_2 insertion of 2 records.

Checking File the table in hive warehouse


The following command will display the file created in warehouse of hive.
hadoop fs -ls /user/hive/warehouse

Checking contents of the table


The following command will reveal of contents of student_profile table.
hadoop fs -ls /user/hive/warehouse/student_profile

Check content of temporary table


The following command will display content of temporary table
hadoop fs -cat /user/hive/warehouse/student_profile/000000_0

Das könnte Ihnen auch gefallen