Hadoop Manual

Hadoop
TABLE OF CONTENTS
S.NO
CONTENTS
Basics of Big data and hadoop
Installing and configuring hadoop on ubuntu
Running Hadoop at local host on ubuntu
Installing and configuring hive on ubuntu
Basics of Hive
Performing CRUD operations in hive
Basics of Big Data And Hadoop

What is Big Data?
Big data means really a big data, it is a collection of large datasets that cannot be processed
using traditional computing techniques. Big data is not merely a data, rather it has become a
complete subject, which involves various tools, technqiues and frameworks.
What Comes Under Big Data?

Big data involves the data produced by different devices and applications. Given below are some
of the fields that come under the umbrella of Big Data.
Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
Social Media Data : Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
Stock Exchange Data : The stock exchange data holds information about the buy and
sell decisions made on a share of different companies made by the customers.
Power Grid Data : The power grid data holds information consumed by a particular node
with respect to a base station.
Transport Data : Transport data includes model, capacity, distance and availability of a
vehicle.
Search Engine Data : Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.
Structured data : Relational data.
Semi Structured data : XML data.
Unstructured data : Word, PDF, Text, Media Logs.
Benefits of Big Data

Big data is really critical to our life and its emerging as one of the most important technologies
in modern world. Follow are just few benefits which are very much known to all of us:
Using the information kept in the social network like Facebook, the marketing agencies
are learning about the response for their campaigns, promotions, and other advertising
mediums.
Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.
Big Data Technologies

Big data technologies are important in providing more accurate analysis, which may lead to more
concrete decision-making resulting in greater operational efficiencies, cost reductions, and
reduced risks for the business.
To harness the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in realtime and can protect data
privacy and security.
There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we
examine the following two classes of technology:
Operational Big Data

This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.
NoSQL Big Data systems are designed to take advantage of new cloud computing architectures
that have emerged over the past decade to allow massive computations to be run inexpensively
and efficiently. This makes operational big data workloads much easier to manage, cheaper, and
faster to implement.
Some NoSQL systems can provide insights into patterns and trends based on real-time data
with minimal coding and without the need for data scientists and additional infrastructure.
Analytical Big Data

This includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis that may
touch most or all of the data.
MapReduce provides a new method of analyzing data that is complementary to the capabilities
provided by SQL, and a system based on MapReduce that can be scaled up from single servers
to thousands of high and low end machines.
These two classes of technology are complementary and frequently deployed together.
Operational vs. Analytical Systems

Operational
Analytical
Latency
1 ms - 100 ms
1 min - 100 min
Concurrency
1000 - 100,000
1 - 10
Access Pattern
Writes and Reads
Reads
Queries
Selective
Unselective
Data Scope
Operational
Retrospective
End User
Customer
Data Scientist
Technology
NoSQL
MapReduce, MPP Database
Big Data Challenges

The major challenges associated with big data are as follows:
Capturing data
Curation
Storage
Searching
Sharing
Transfer
Analysis
Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
Traditional Approach
In this approach, an enterprise will have a computer to store and process big data. Here data
will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated
softwares can be written to interact with the database, process the required data and present
it to the users for analysis purpose.
Limitation
This approach works well where we have less volume of data that can be accommodated by
standard database servers, or up to the limit of the processor which is processing the data.
But when it comes to dealing with huge amounts of data, it is really a tedious task to process
such data through a traditional database server.
Googles Solution
Google solved this problem using an algorithm called MapReduce. This algorithm divides the
task into small parts and assigns those parts to many computers connected over the network,
and collects the results to form the final result dataset.
Hadoop
Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an
Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant.
Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes. In short, Hadoop framework is capabale enough to develop
applications capable of running on clusters of computers and they could perform complete
statistical analysis for a huge amounts of data.
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. A Hadoop
frame-worked application works in an environment that provides distributed storage and

computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.
Hadoop Architecture
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules. These libraries provides filesystem and OS level abstractions and contains the
necessary Java files and scripts required to start Hadoop.
Hadoop
YARN: This is a framework for job scheduling and cluster resource
management.
Hadoop Distributed File System (HDFS): A distributed file system that provides
high-throughput access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing of large data
sets.
We can use following diagram to depict these four components available in Hadoop framework.
Since 2012, the term "Hadoop" often refers not just to the base modules mentioned above but
also to the collection of additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark etc.
Map Reduce
Hadoop MapReduce is a software framework for easily writing applications which process big
amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
The Map Task: This is the first task, which takes input data and converts it into a set
of data, where individual elements are broken down into tuples (key/value pairs).
The Reduce Task: This task takes the output from a map task as input and combines
those data tuples into a smaller set of tuples. The reduce task is always performed
after the map task.
Typically both the input and the output are stored in a file-system. The framework takes care
of scheduling tasks, monitoring them and re-executes the failed tasks.
The
MapReduce
framework
consists
of
single
master JobTracker and
one
slave TaskTracker per cluster-node. The master is responsible for resource management,
tracking resource consumption/availability and scheduling the jobs component tasks on the
slaves, monitoring them and re-executing the failed tasks. The slaves TaskTracker execute the
tasks as directed by the master and provide task-status information to the master
periodically.
The JobTracker is a single point of failure for the Hadoop MapReduce service which means if
JobTracker goes down, all running jobs are halted.
Hadoop Distributed File System

Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP
FS, S3 FS, and others, but the most common file system used by Hadoop is the Hadoop
Distributed File System (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on large clusters (thousands of
computers) of small computer machines in a reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a singleNameNode that
manages the file system metadata and one or more slaveDataNodes that store the actual data.
A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of
DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The
DataNodes takes care of read and write operation with the file system. They also take care of
block creation, deletion and replication based on instruction given by NameNode.
HDFS provides a shell like any other file system and a list of commands are available to
interact with the file system. These shell commands will be covered in a separate chapter along
with appropriate examples.
How Does Hadoop Work?

Stage 1
A user/application can submit a job to the Hadoop (a hadoop job client) for required process by
specifying the following items:
1.
The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce
functions.
3. The job configuration by setting different parameters specific to the job.
Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration
to the slaves, scheduling tasks and monitoring them, providing status and diagnostic
information to the job-client.
Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation and
output of the reduce function is stored into the output files on the file system.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures
at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based
Installing And Configuring Hadoop In Ubuntu

Installing Java
k@laptop:~$ cd ~
# Update the source list
k@laptop:~$ sudo apt-get update
# The OpenJDK project is the default version of Java
# that is provided from a supported Ubuntu repository.
k@laptop:~$ sudo apt-get install default-jdk
k@laptop:~$ java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
Adding a dedicated Hadoop user

k@laptop:~$ sudoaddgrouphadoop
Adding group `hadoop' (GID 1002) ...
Done.
k@laptop:~$ sudoadduser --ingrouphadoophduser

Adding user `hduser' ...
Adding new user `hduser' (1001) with group `hadoop' ...
Creating home directory `/home/hduser' ...
Copying files from `/etc/skel' ...
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Changing the user information for hduser
Enter the new value, or press ENTER for the default
Full Name []:
Room Number []:
Work Phone []:
Home Phone []:
Other []:
Is
the
information
correct?
[Y/n]
Installing SSH
ssh has two main components:
1.
ssh : The command we use to connect to remote machines - the client.
2.
sshd : The daemon that is running on the server and allows clients to connect to the server.
The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first. Use this
command to do that :
k@laptop:~$ sudo apt-get install ssh
This will install ssh on our machine. If we get something similar to the following, we can think it is setup
properly:
k@laptop:~$ which ssh
/usr/bin/ssh
k@laptop:~$ which sshd

/usr/sbin/sshd
Create and Setup SSH Certificates

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local machine. For our
single-node setup of Hadoop, we therefore need to configure SSH access to localhost.
So, we need to have SSH up and running on our machine and configured it to allow SSH public key
authentication.
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a password.
However, this requirement can be eliminated by creating and setting up SSH certificates using the
following commands. If asked for a filename just leave it blank and press the enter key to continue.
k@laptop:~$ suhduser
Password:
k@laptop:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
50:6b:f3:fc:0f:32:bf:30:79:c2:41:71:26:cc:7d:e3hduser@laptop
The key's randomart image is:
+--[ RSA 2048]----+
|
.oo.o
. .o=. o |
.+. o.|
o=
S+
.+
O+
Oo |
o.. |
E |
|
|
|
+-----------------+
hduser@laptop:/home/k$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The second command adds the newly created key to the list of authorized keys so that Hadoop can use ssh
without prompting for a password.We can check if ssh works:
hduser@laptop:/home/k$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64)
Install Hadoop
hduser@laptop:~$
2.6.0.tar.gz
wget
http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-
hduser@laptop:~$ tar xvzf hadoop-2.6.0.tar.gz
We want to move the Hadoop installation to the /usr/local/hadoop directory using the following command:
hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop
[sudo] password for hduser:
hduser is not in the sudoers file. This incident will be reported.
Oops!... We got:
"hduser is not in the sudoers file. This incident will be reported."
This error can be resolved by logging in as a root user, and then add hduser to sudo:
hduser@laptop:~/hadoop-2.6.0$ su k
Password:
k@laptop:/home/hduser$ sudoadduserhdusersudo
[sudo] password for k:

Adding user `hduser' to group `sudo' ...
Adding user hduser to group sudo
Done.
Now, the hduser has root priviledge, we can move the Hadoop installation to
the/usr/local/hadoop directory without any problem:
k@laptop:/home/hduser$ sudosuhduser
hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop

hduser@laptop:~/hadoop-2.6.0$ sudochown -R hduser:hadoop /usr/local/hadoop
Setup Configuration Files

The following files will have to be modified to complete the Hadoop setup:
1.
~/.bashrc
2.
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
3.
/usr/local/hadoop/etc/hadoop/core-site.xml
4.
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
5.
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
1. ~/.bashrc:
Before editing the .bashrc file in our home directory, we need to find the path where Java has been
installed to set the JAVA_HOME environment variable using the following command:
hduser@laptop update-alternatives --config java

There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-7-openjdkamd64/jre/bin/java
Nothing to configure.
Now we can append the following to the end of ~/.bashrc:
hduser@laptop:~$ vi ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
hduser@laptop:~$ source ~/.bashrc
note that the JAVA_HOME should be set as the path just before the '.../bin/':
hduser@ubuntu-VirtualBox:~$ javac -version
javac 1.7.0_75
hduser@ubuntu-VirtualBox:~$ which javac

/usr/bin/javac
hduser@ubuntu-VirtualBox:~$ readlink -f /usr/bin/javac

/usr/lib/jvm/java-7-openjdk-amd64/bin/javac
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
We need to set JAVA_HOME by modifying hadoop-env.sh file.

hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME variable will
be available to Hadoop whenever it is started up.
3. /usr/local/hadoop/etc/hadoop/core-site.xml:
The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses

when starting up.
This file can be used to override the default settings that Hadoop starts with.
hduser@laptop:~$ sudomkdir -p /app/hadoop/tmp
hduser@laptop:~$ sudochownhduser:hadoop /app/hadoop/tmp
Open the file and enter the following in between the <configuration></configuration> tag:
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
theFileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml
By default, the /usr/local/hadoop/etc/hadoop/ folder contains

file which has to be renamed/copied with the name mapred-site.xml:
hduser@laptop:~$
cp
/usr/local/hadoop/etc/hadoop/mapred-site.xml
The mapred-site.xml file is used to specify which framework is being used for MapReduce.
We need to enter the following content in between the <configuration></configuration> tag:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster
that is being used.
It is used to specify the directories which will be used as the namenode and the datanode on that host.
Before editing this file, we need to create two directories which will contain the namenode and the
datanode for this Hadoop installation.
This can be done using the following commands:
hduser@laptop:~$ sudomkdir -p /usr/local/hadoop_store/hdfs/namenode
hduser@laptop:~$ sudomkdir -p /usr/local/hadoop_store/hdfs/datanode
hduser@laptop:~$ sudochown -R hduser:hadoop /usr/local/hadoop_store
Open the file and enter the following content in between the <configuration></configuration> tag:
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
Format the New Hadoop Filesystem

Now, the Hadoop file system needs to be formatted so that we can start to use it. The format command
should be issued with write permission since it creates current directory
under /usr/local/hadoop_store/hdfs/namenode folder:
hduser@laptop:~$ hadoopnamenode -format
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
15/04/18 14:43:03 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = laptop/192.168.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.6.0
STARTUP_MSG: classpath = /usr/local/hadoop/etc/hadoop
...
STARTUP_MSG: java = 1.7.0_65
************************************************************/
15/04/18 14:43:03 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
15/04/18 14:43:03 INFO namenode.NameNode: createNameNode [-format]
15/04/18 14:43:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
Formatting using clusterid: CID-e2f515ac-33da-45bc-8466-5b1100a2bf7f
15/04/18 14:43:09 INFO namenode.FSNamesystem: No KeyProvider found.
15/04/18 14:43:09 INFO namenode.FSNamesystem: fsLock is fair:true
15/04/18 14:43:10 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
15/04/18 14:43:10 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.iphostname-check=true
15/04/18 14:43:10 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is
set to 000:00:00:00.000
15/04/18 14:43:10 INFO blockmanagement.BlockManager: The block deletion will start around 2015 Apr 18
14:43:10
15/04/18 14:43:10 INFO util.GSet: Computing capacity for map BlocksMap
15/04/18 14:43:10 INFO util.GSet: VM type
= 64-bit
15/04/18 14:43:10 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB

15/04/18 14:43:10 INFO util.GSet: capacity
= 2^21 = 2097152 entries
15/04/18 14:43:10 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false

15/04/18 14:43:10 INFO blockmanagement.BlockManager: defaultReplication
15/04/18 14:43:10 INFO blockmanagement.BlockManager: maxReplication
15/04/18 14:43:10 INFO blockmanagement.BlockManager: minReplication
=1
= 512
=1
15/04/18 14:43:10 INFO blockmanagement.BlockManager: maxReplicationStreams
=2
15/04/18 14:43:10 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false

15/04/18 14:43:10 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
15/04/18 14:43:10 INFO blockmanagement.BlockManager: encryptDataTransfer
= false
15/04/18 14:43:10 INFO blockmanagement.BlockManager: maxNumBlocksToLog
= 1000
15/04/18 14:43:10 INFO namenode.FSNamesystem: fsOwner
= hduser (auth:SIMPLE)
15/04/18 14:43:10 INFO namenode.FSNamesystem: supergroup
= supergroup
15/04/18 14:43:10 INFO namenode.FSNamesystem: isPermissionEnabled = true

15/04/18 14:43:10 INFO namenode.FSNamesystem: HA Enabled: false
15/04/18 14:43:10 INFO namenode.FSNamesystem: Append Enabled: true
15/04/18 14:43:11 INFO util.GSet: Computing capacity for map INodeMap
= 64-bit
= 2^20 = 1048576 entries
15/04/18 14:43:11 INFO namenode.NameNode: Caching file names occuring more than 10 times
15/04/18 14:43:11 INFO util.GSet: Computing capacity for map cachedBlocks
= 64-bit

15/04/18
14:43:11
0.9990000128746033
INFO
= 2^18 = 262144 entries
namenode.FSNamesystem:
dfs.namenode.safemode.threshold-pct
15/04/18 14:43:11 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0

15/04/18 14:43:11 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension
= 30000
15/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache on namenode is enabled

15/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache
entry expiry time is 600000 millis
15/04/18 14:43:11 INFO util.GSet: Computing capacity for map NameNodeRetryCache
= 64-bit
15/04/18 14:43:11 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB

= 2^15 = 32768 entries
15/04/18 14:43:11 INFO namenode.NNConf: ACLs enabled? false

15/04/18 14:43:11 INFO namenode.NNConf: XAttrs enabled? true
15/04/18 14:43:11 INFO namenode.NNConf: Maximum size of anxattr: 16384
15/04/18 14:43:12 INFO namenode.FSImage: Allocated new BlockPoolId: BP-130729900-192.168.1.11429393391595
15/04/18 14:43:12 INFO common.Storage: Storage directory /usr/local/hadoop_store/hdfs/namenode has
been successfully formatted.
15/04/18 14:43:12 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid>= 0
15/04/18 14:43:12 INFO util.ExitUtil: Exiting with status 0
15/04/18 14:43:12 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at laptop/192.168.1.1
************************************************************/
Running Hadoop on Ubuntu at Local host

LOG INTO ADMINISTRATOR HADOOP USER ACCOUNT:
$sudo su hduser.
CHANGE DIRECTORY TO HADOOP INSTALLATION:
$cd /usr/local/hadoop/hadoop-2.6.0/sbin/
START ALL HADOOP SERVICES:
$ start-all.sh
CHECK IF ALL THE NODES ARE RUNNING
$ jps
CHECK IF NAME NODE IS RUNNING IN LOCAL HOST AT PORT 50070

http://localhost:50070
CHECK IF SECONDARY NODE IS RUNNING IN LOCAL HOST AT PORT 50090

http://localhost:50090/status.jsp
CHECK IF DATA NODE IS RUNNING IN LOCAL HOST AT PORT 50070

http://localhost:50070/dfshealth.html#tab-datanode
STOPPING ALL HADOOP SERVICES

$ stop-all.sh
Basics of hive
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce .
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table

describes each unit:
Unit Name
Operation
User Interface
Hive is a data warehouse infrastructure software that

can create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows
server).
Meta Store
Hive chooses respective database servers to store the

schema or Metadata of tables, databases, columns in a
table, their data types, and HDFS mapping.
HiveQL Process Engine
HiveQL is similar to SQL for querying on schema info on
the Metastore. It is one of the replacements of

traditional approach for MapReduce program. Instead of
writing MapReduce program in Java, we can write a
query for MapReduce job and process it.
Execution Engine
The conjunction part of HiveQL process Engine and

MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE
Hadoop distributed file system or HBASE are the data

storage techniques to store data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step
No.
Operation
Execute Query
The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.
Get Plan
The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
Get Metadata
The compiler sends metadata request to Metastore (any database).
Send Metadata
Metastore sends metadata as a response to the compiler.
Send Plan
The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
Execute Plan
The driver sends the execute plan to the execution engine.
Execute Job
Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node
and it assigns this job to TaskTracker, which is in Data node. Here,
the query executes MapReduce job.
7.1
Metadata Ops
Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.
Fetch Result
The execution engine receives the results from Data nodes.
Send Results
The execution engine sends those resultant values to the driver.
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range
exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the
INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:
Type
Postfix
Example
TINYINT
10Y
SMALLINT
10S
INT
10
BIGINT
10L
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:
Data Type
Length
VARCHAR
1 to 65355
CHAR
255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format YYYY-MM-DD HH:MM:SS.fffffffff and format yyyy-mm-dd
hh:mm:ss.ffffffffff.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Floating Point Types

Floating point types are nothing but numbers with decimal points. Generally, this type of data is
composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
-308
308
type. The range of decimal type is approximately -10
to 10 .
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Installing And Configuring Hive in Ubuntu

INTRODUCTION
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization,
query, and analysis. Apache Hive supports analysis of large datasets stored in Hadoops HDFS and
compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called
HiveQL(Hive Query Language) while maintaining full support for map/reduce.
Hive Installation
Installing HIVE:
1.
Browse to the link: http://apache.claz.org/hive/stable/
2.
Click the apache-hive-0.13.0-bin.tar.gz
3.
Save and Extract it
Commands:
user@ubuntu:~$
user@ubuntu:~$
user@ubuntu:~$
user@ubuntu:~$
cd /usr/lib/
sudo mkdir hive
cd Downloads
sudo mv apache-hive-0.13.0-bin /usr/lib/hive
Setting Hive environment variable:

Commands:
user@ubuntu:~$ cd
user@ubuntu:~$ sudo gedit ~/.bashrc
Copy and paste the following lines at end of the file
# Set HIVE_HOME
export HIVE_HOME="/usr/lib/hive/apache-hive-0.13.0-bin"
PATH=$PATH:$HIVE_HOME/bin
export PATH
Setting HADOOP_PATH in HIVE config.sh

Commands:
user@ubuntu:~$ cd /usr/lib/hive/apache-hive-0.13.0-bin/bin
user@ubuntu:~$ sudo gedit hive-config.sh
Go to the line where the following statements are written:
# Allow alternate conf dir location.

HIVE_CONF_DIR="${HIVE_CONF_DIR:-$HIVE_HOME/conf"
export HIVE_CONF_DIR=$HIVE_CONF_DIR
export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH
Below this write the following:
export HADOOP_HOME=/usr/local/hadoop
(write the path where hadoop file is there)
Create Hive directories within HDFS

Command:
user@ubuntu:~$ hadoop fs -mkdir /usr/hive/warehouse
Setting READ/WRITE permission for table:

Command:
user@ubuntu:~$ hadoop fs -chmod g+w /usr/hive/warehouse
Performing C.R.U.D Operations in Hive

Launching HIVE
Command:
user@ubuntu:~$ hive
Creating a database:
Command:
hive> create database mydb;
Show all databases:

Command:
hive> show databases;
Create a Table:
Command:
hive> create table john(id varchar(30),number varchar(30));
Insert into a Table:

Command:
hive> Insert into john values(20,40);
Display a Table:
Command:
hive> Select * from john;
Alter the Table:

Command:
hive>
ALTER TABLE john RENAME TO jerry;
Drop the Table:

Command:
hive>
DROP TABLE jerry;

Hadoop Manual

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop Manual

Hochgeladen von

Copyright:

Verfügbare Formate

Hadoop

Basics of Big data and hadoop

Installing and configuring hadoop on ubuntu

Running Hadoop at local host on ubuntu

Installing and configuring hive on ubuntu

Performing CRUD operations in hive

Basics of Big Data And Hadoop

What Comes Under Big Data?

Structured data : Relational data.

Semi Structured data : XML data.

Unstructured data : Word, PDF, Text, Media Logs.

Benefits of Big Data

Big Data Technologies

Operational Big Data

Analytical Big Data

Operational vs. Analytical Systems

1 min - 100 min

Writes and Reads

MapReduce, MPP Database

Big Data Challenges

frame-worked application works in an environment that provides distributed storage and

YARN: This is a framework for job scheduling and cluster resource

master JobTracker and

Hadoop Distributed File System

How Does Hadoop Work?

Installing And Configuring Hadoop In Ubuntu

Adding a dedicated Hadoop user

k@laptop:~$ sudoadduser --ingrouphadoophduser

ssh : The command we use to connect to remote machines - the client.

k@laptop:~$ which sshd

Create and Setup SSH Certificates

hduser@laptop:/home/k$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

hduser@laptop:~$ tar xvzf hadoop-2.6.0.tar.gz

[sudo] password for k:

hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop

Setup Configuration Files

hduser@laptop update-alternatives --config java

Now we can append the following to the end of ~/.bashrc:

hduser@laptop:~$ source ~/.bashrc

hduser@ubuntu-VirtualBox:~$ which javac

hduser@ubuntu-VirtualBox:~$ readlink -f /usr/bin/javac

We need to set JAVA_HOME by modifying hadoop-env.sh file.

The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses

By default, the /usr/local/hadoop/etc/hadoop/ folder contains

Format the New Hadoop Filesystem

15/04/18 14:43:03 INFO namenode.NameNode: STARTUP_MSG:

15/04/18 14:43:10 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB

= 2^21 = 2097152 entries

15/04/18 14:43:10 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false

15/04/18 14:43:10 INFO blockmanagement.BlockManager: maxReplicationStreams

15/04/18 14:43:10 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false

15/04/18 14:43:10 INFO blockmanagement.BlockManager: maxNumBlocksToLog

15/04/18 14:43:10 INFO namenode.FSNamesystem: fsOwner

15/04/18 14:43:10 INFO namenode.FSNamesystem: supergroup

15/04/18 14:43:10 INFO namenode.FSNamesystem: isPermissionEnabled = true

15/04/18 14:43:11 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB

15/04/18 14:43:11 INFO util.GSet: capacity

= 2^20 = 1048576 entries

15/04/18 14:43:11 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB

= 2^18 = 262144 entries

15/04/18 14:43:11 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0

15/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache on namenode is enabled

15/04/18 14:43:11 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB

= 2^15 = 32768 entries