Beruflich Dokumente
Kultur Dokumente
TABLE OF CONTENTS
S.NO
CONTENTS
Basics of Hive
Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
Social Media Data : Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
Stock Exchange Data : The stock exchange data holds information about the buy and
sell decisions made on a share of different companies made by the customers.
Power Grid Data : The power grid data holds information consumed by a particular node
with respect to a base station.
Transport Data : Transport data includes model, capacity, distance and availability of a
vehicle.
Search Engine Data : Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.
Using the information kept in the social network like Facebook, the marketing agencies
are learning about the response for their campaigns, promotions, and other advertising
mediums.
Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.
Some NoSQL systems can provide insights into patterns and trends based on real-time data
with minimal coding and without the need for data scientists and additional infrastructure.
Analytical
Latency
1 ms - 100 ms
Concurrency
1000 - 100,000
1 - 10
Access Pattern
Reads
Queries
Selective
Unselective
Data Scope
Operational
Retrospective
End User
Customer
Data Scientist
Technology
NoSQL
Capturing data
Curation
Storage
Searching
Sharing
Transfer
Analysis
Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
Traditional Approach
In this approach, an enterprise will have a computer to store and process big data. Here data
will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated
softwares can be written to interact with the database, process the required data and present
it to the users for analysis purpose.
Limitation
This approach works well where we have less volume of data that can be accommodated by
standard database servers, or up to the limit of the processor which is processing the data.
But when it comes to dealing with huge amounts of data, it is really a tedious task to process
such data through a traditional database server.
Googles Solution
Google solved this problem using an algorithm called MapReduce. This algorithm divides the
task into small parts and assigns those parts to many computers connected over the network,
and collects the results to form the final result dataset.
Hadoop
Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an
Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant.
Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes. In short, Hadoop framework is capabale enough to develop
applications capable of running on clusters of computers and they could perform complete
statistical analysis for a huge amounts of data.
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. A Hadoop
Hadoop Architecture
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules. These libraries provides filesystem and OS level abstractions and contains the
necessary Java files and scripts required to start Hadoop.
Hadoop
management.
Hadoop Distributed File System (HDFS): A distributed file system that provides
high-throughput access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing of large data
sets.
We can use following diagram to depict these four components available in Hadoop framework.
Since 2012, the term "Hadoop" often refers not just to the base modules mentioned above but
also to the collection of additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark etc.
Map Reduce
Hadoop MapReduce is a software framework for easily writing applications which process big
amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
The Map Task: This is the first task, which takes input data and converts it into a set
of data, where individual elements are broken down into tuples (key/value pairs).
The Reduce Task: This task takes the output from a map task as input and combines
those data tuples into a smaller set of tuples. The reduce task is always performed
after the map task.
Typically both the input and the output are stored in a file-system. The framework takes care
of scheduling tasks, monitoring them and re-executes the failed tasks.
The
MapReduce
framework
consists
of
single
one
slave TaskTracker per cluster-node. The master is responsible for resource management,
tracking resource consumption/availability and scheduling the jobs component tasks on the
slaves, monitoring them and re-executing the failed tasks. The slaves TaskTracker execute the
tasks as directed by the master and provide task-status information to the master
periodically.
The JobTracker is a single point of failure for the Hadoop MapReduce service which means if
JobTracker goes down, all running jobs are halted.
A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of
DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The
DataNodes takes care of read and write operation with the file system. They also take care of
block creation, deletion and replication based on instruction given by NameNode.
HDFS provides a shell like any other file system and a list of commands are available to
interact with the file system. These shell commands will be covered in a separate chapter along
with appropriate examples.
The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce
functions.
3. The job configuration by setting different parameters specific to the job.
Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration
to the slaves, scheduling tasks and monitoring them, providing status and diagnostic
information to the job-client.
Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation and
output of the reduce function is stored into the output files on the file system.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures
at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based
the
information
correct?
[Y/n]
Installing SSH
ssh has two main components:
1.
2.
sshd : The daemon that is running on the server and allows clients to connect to the server.
The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first. Use this
command to do that :
k@laptop:~$ sudo apt-get install ssh
This will install ssh on our machine. If we get something similar to the following, we can think it is setup
properly:
k@laptop:~$ which ssh
/usr/bin/ssh
So, we need to have SSH up and running on our machine and configured it to allow SSH public key
authentication.
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a password.
However, this requirement can be eliminated by creating and setting up SSH certificates using the
following commands. If asked for a filename just leave it blank and press the enter key to continue.
k@laptop:~$ suhduser
Password:
k@laptop:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
50:6b:f3:fc:0f:32:bf:30:79:c2:41:71:26:cc:7d:e3hduser@laptop
The key's randomart image is:
+--[ RSA 2048]----+
|
.oo.o
. .o=. o |
.+. o.|
o=
S+
.+
O+
Oo |
o.. |
E |
|
|
|
+-----------------+
The second command adds the newly created key to the list of authorized keys so that Hadoop can use ssh
without prompting for a password.We can check if ssh works:
hduser@laptop:/home/k$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64)
Install Hadoop
hduser@laptop:~$
2.6.0.tar.gz
wget
http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-
We want to move the Hadoop installation to the /usr/local/hadoop directory using the following command:
hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop
[sudo] password for hduser:
hduser is not in the sudoers file. This incident will be reported.
Oops!... We got:
"hduser is not in the sudoers file. This incident will be reported."
This error can be resolved by logging in as a root user, and then add hduser to sudo:
hduser@laptop:~/hadoop-2.6.0$ su k
Password:
k@laptop:/home/hduser$ sudoadduserhdusersudo
Now, the hduser has root priviledge, we can move the Hadoop installation to
the/usr/local/hadoop directory without any problem:
k@laptop:/home/hduser$ sudosuhduser
~/.bashrc
2.
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
3.
/usr/local/hadoop/etc/hadoop/core-site.xml
4.
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
5.
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
1. ~/.bashrc:
Before editing the .bashrc file in our home directory, we need to find the path where Java has been
installed to set the JAVA_HOME environment variable using the following command:
hduser@laptop:~$ vi ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
note that the JAVA_HOME should be set as the path just before the '.../bin/':
hduser@ubuntu-VirtualBox:~$ javac -version
javac 1.7.0_75
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME variable will
be available to Hadoop whenever it is started up.
3. /usr/local/hadoop/etc/hadoop/core-site.xml:
Open the file and enter the following in between the <configuration></configuration> tag:
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
theFileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml
hduser@laptop:~$
cp
/usr/local/hadoop/etc/hadoop/mapred-site.xml
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
The mapred-site.xml file is used to specify which framework is being used for MapReduce.
We need to enter the following content in between the <configuration></configuration> tag:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster
that is being used.
It is used to specify the directories which will be used as the namenode and the datanode on that host.
Before editing this file, we need to create two directories which will contain the namenode and the
datanode for this Hadoop installation.
This can be done using the following commands:
hduser@laptop:~$ sudomkdir -p /usr/local/hadoop_store/hdfs/namenode
hduser@laptop:~$ sudomkdir -p /usr/local/hadoop_store/hdfs/datanode
hduser@laptop:~$ sudochown -R hduser:hadoop /usr/local/hadoop_store
Open the file and enter the following content in between the <configuration></configuration> tag:
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
************************************************************/
15/04/18 14:43:03 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
15/04/18 14:43:03 INFO namenode.NameNode: createNameNode [-format]
15/04/18 14:43:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
Formatting using clusterid: CID-e2f515ac-33da-45bc-8466-5b1100a2bf7f
15/04/18 14:43:09 INFO namenode.FSNamesystem: No KeyProvider found.
15/04/18 14:43:09 INFO namenode.FSNamesystem: fsLock is fair:true
15/04/18 14:43:10 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
15/04/18 14:43:10 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.iphostname-check=true
15/04/18 14:43:10 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is
set to 000:00:00:00.000
15/04/18 14:43:10 INFO blockmanagement.BlockManager: The block deletion will start around 2015 Apr 18
14:43:10
15/04/18 14:43:10 INFO util.GSet: Computing capacity for map BlocksMap
15/04/18 14:43:10 INFO util.GSet: VM type
= 64-bit
=1
= 512
=1
=2
= false
= 1000
= hduser (auth:SIMPLE)
= supergroup
= 64-bit
15/04/18 14:43:11 INFO namenode.NameNode: Caching file names occuring more than 10 times
15/04/18 14:43:11 INFO util.GSet: Computing capacity for map cachedBlocks
15/04/18 14:43:11 INFO util.GSet: VM type
= 64-bit
INFO
namenode.FSNamesystem:
dfs.namenode.safemode.threshold-pct
= 30000
= 64-bit
$cd /usr/local/hadoop/hadoop-2.6.0/sbin/
$ start-all.sh
$ jps
Basics of hive
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce .
Hive is not
A relational database
Features of Hive
Architecture of Hive
The following component diagram depicts the architecture of Hive:
Operation
User Interface
Meta Store
Execution Engine
HDFS or HBASE
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step
No.
Operation
Execute Query
The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.
Get Plan
The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
Get Metadata
The compiler sends metadata request to Metastore (any database).
Send Metadata
Metastore sends metadata as a response to the compiler.
Send Plan
The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
Execute Plan
The driver sends the execute plan to the execution engine.
Execute Job
Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node
and it assigns this job to TaskTracker, which is in Data node. Here,
the query executes MapReduce job.
7.1
Metadata Ops
Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.
Fetch Result
The execution engine receives the results from Data nodes.
Send Results
The execution engine sends those resultant values to the driver.
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range
exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the
INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:
Type
Postfix
Example
TINYINT
10Y
SMALLINT
10S
INT
10
BIGINT
10L
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:
Data Type
Length
VARCHAR
1 to 65355
CHAR
255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format YYYY-MM-DD HH:MM:SS.fffffffff and format yyyy-mm-dd
hh:mm:ss.ffffffffff.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
-308
308
type. The range of decimal type is approximately -10
to 10 .
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Hive Installation
Installing HIVE:
1.
2.
3.
Commands:
user@ubuntu:~$
user@ubuntu:~$
user@ubuntu:~$
user@ubuntu:~$
cd /usr/lib/
sudo mkdir hive
cd Downloads
sudo mv apache-hive-0.13.0-bin /usr/lib/hive
# Set HIVE_HOME
export HIVE_HOME="/usr/lib/hive/apache-hive-0.13.0-bin"
PATH=$PATH:$HIVE_HOME/bin
export PATH
user@ubuntu:~$ cd /usr/lib/hive/apache-hive-0.13.0-bin/bin
user@ubuntu:~$ sudo gedit hive-config.sh
export HADOOP_HOME=/usr/local/hadoop
Command:
user@ubuntu:~$ hive
Creating a database:
Command:
hive> create database mydb;
Create a Table:
Command:
hive> create table john(id varchar(30),number varchar(30));
Display a Table:
Command:
hive> Select * from john;
hive>