Sie sind auf Seite 1von 61

Apache

Hive
Prashant Gupta

HIVE

Hive
Data warehousing package built on
top of hadoop.
Used for data analysis on structured
data.
Targeted towards users comfortable
with SQL.
It is similar to SQL and called HiveQL.
Abstracts complexity of hadoop.
No Java is required.

Features of Hive
How is it Different from SQL
The major difference is that a Hive query
executes on a Hadoop infrastructure rather
than a traditional database.
This allows Hive to handle huge data sets data sets so large that high-end,
expensive, traditional databases would fail.
The internal execution of a Hive query is
via a series of automatically generated
Map Reduce jobs

When not to use Hive


Semi-structured or complete unstructured data.
Hive is not designed for online transaction
processing.
It is best for batch jobs over large sets of data.
Latency for Hive queries is generally very high
in minutes, even when data sets are very small
(say a few hundred megabytes).
It cannot be compared with systems such as
oracle where analyses are conducted on a
significantly smaller amount of data.

Install Hive
To install hive
untar the .gz file using tar xvzf hive-0.13.0-bin.tar.gz

To initialize the environment variables, export the


following:
export HADOOP_HOME=/home/usr/hadoop-0.20.2
(Specifies the location of the installation
directory of hadoop.)
export HIVE_HOME=/home/usr/hive-0.13.0-bin
(Specifies the location of the hive to the
environment variable.)
export PATH=$PATH:$HIVE_HOME/bin

Hive configurations

Hive default configuration is stored in hive-default.xml


file in the conf directory

Hive comes configured to use derby as the metastore

Hive Modes
To start the hive shell, type hive and
Enter.
Hive in Local mode
No HDFS is required, All files run on local
file system.
hive> SET mapred.job.tracker=local

Hive in MapReduce(hadoop) mode


hive> SET
mapred.job.tracker=master:9001;

Introducing data types


The primitive data types in hive include
Integers, Boolean, Floating point,
Date,Timestamp and Strings.
The below table lists the size of data types:
Type Size
------------------------TINYINT 1 byte
SMALLINT 2 byte
INT 4 byte
BIGINT 8 byte
FLOAT 4 byte (single precision floating point numbers)
DOUBLE8 byte (double precision floating point numbers)
BOOLEANTRUE/FALSE value
STRINGMax size is 2GB.
Complex data Type : Array ,Map ,Structs

Configuring Hive
Hive is configured using an XML configuration file called
hivesite.xml and is located in Hives conf directory.
Execution engines
Hive was originally written to use MapReduce as its execution
engine, and that is still the default.
We can use Apache Tez as its execution engine, and also work is
underway to support Spark, too. Both Tez and Spark are general
directed acyclic graph (DAG) engines that offer more flexibility
and higher performance than MapReduce.
Its easy to switch the execution engine on a per-query basis, so
you can see the effect of a different engine on a particular query.
Set Hive to use Tez: hive> SET hive.execution.engine=tez;
The execution engine is controlled by the hive.execution.engine
property, which defaults to mr (for MapReduce).

Hive Architecture

Components
Thrift Client
It is possible to interact with hive by using any
programming language that usages Thrift server. For e.g.
Python
Ruby

JDBC Driver
Hive provides a pure java JDBC driver for java application
to connect to hive , defined in the class
org.hadoop.hive.jdbc.HiveDriver

ODBC Driver
An ODBC driver allows application that supports ODBC
protocol

Components

Metastore

This is the central repository for Hive metadata.


By default, Hive is configured to use Derby as the metastore.
As a result of the configuration, a metastore_db directory is
created in each working folder.
What are the problems with the default metastore
Users cannot see the tables created by others if they do not
use the same metastore_db.
Only one embedded Derby database can access the database
files at any given point of time
Results in only one open Hive session with a metastore. Not
possible to have multiple sessions with Derby as the
metastore.
Solution
We can use a standalone database either on the same machine
or on a remote machine as a metastore and any JDBCcompliant database can be used

Configuring MySQL as
metastore
Install MySQL Admin/Client
Create a Hadoop user and grant permissions to the user
mysql -u root p
mysql> Create user 'hadoop'@'localhost' identified by 'hadoop;
mysql> Grant ALL on *.* to 'hadoop'@'localhost' with GRANT option;

Modify the following properties in hive-site.xml to use MySQL instead of


Derby. This creates a database in MySql by the name Hive :
name : javax.jdo.option.ConnectionUR
value : dbc:mysql://localhost:3306/Hive?
createDatabaseIfNotExist=true
name : javax.jdo.option.ConnectionDriverName
value : com.mysql.jdbc.Driver
name : javax.jdo.option.ConnectionUserName
value : hadoop
name : javax.jdo.option.ConnectionPassword
value : hadoop

Hive Program Structure


The Hive Shell
The shell is the primary way that we will interact with Hive, by
issuing commands in HiveQL.
HiveQL is heavily influenced by MySQL, so if you are familiar with
MySQL, you should feel at home using Hive.
The command must be terminated with a semicolon to tell Hive to
execute it.
HiveQL is generally case insensitive.
The Tab key will autocomplete Hive keywords and functions.

Hive can run in non-interactive mode.


Use -f option to run the commands in the specified file,
hive -f script.hql

For short scripts, you can use the -e option to specify the commands
inline, in which case the final semicolon is not required.
hive -e 'SELECT * FROM dummy'

Ser-de
A SerDe is a combination of a Serializer and a
Deserializer (hence, Ser-De).
The Serializer, however, will take a Java object that Hive
has been working with, and turn it into something that
Hive can write to HDFS or another supported system.
Serializer is used when writing data, such as through an
INSERT-SELECT statement.
The Deserializer interface takes a string or binary
representation of a record, and translates it into a Java
object that Hive can manipulate.
Deserializer is used at query time to execute SELECT
statements.

Hive Tables
A Hive table is logically made up of the data being stored in HDFS and the
associated metadata describing the layout of the data in the MySQL table.
Managed Table
When you create a table in Hive and load data into a managed table, it is moved into
Hives warehouse directory.
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;

External Table
Alternatively, you may create an external table, which tells Hive to refer to the data
that is at an existing location outside the warehouse directory.
The location of the external data is specified at table creation time:
CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;

When you drop an external table, Hive will leave the data untouched and
only delete the metadata.
Hive does not do any transformation while loading data into tables. Load
operations are currently pure copy/move operations that move data files
into locations corresponding to Hive tables.

Storage Format
Text File
When you create a table with no ROW FORMAT or
STORED AS clauses, the default format is delimited
text with one row per line.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Storage Format
RC: Record Columnar File
The RC format was designed for clusters with
MapReduce in mind. It is a huge step up over
standard text files. Its a mature format with ways
to ingest into the cluster without ETL. It is supported
in severalhadoop system components.
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'

Storage Format
ORC: Optimized Row Columnar
File
The ORC format showed up in Hive 0.11 onwards. As
the name implies, it is more optimized than the RC
format. If you want to hold onto speed and compress
the data as much as possible, then ORC is best.
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileOutputFormat'

Practice Session

CREATE DATABASE|SCHEMA [IF NOT


EXISTS] <database name>
or
hive> CREATE SCHEMA testdb;
SHOW DATABASES;
DROP SCHEMA userdb;

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF


NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT
col_comment], ...)] [COMMENT table_comment]
[ROW FORMAT row_format] [STORED AS
file_format]
Loading a data
LOAD DATA [local ] INPATH
'hdfs_file_or_directory_path'

Create Table
Managed Table
CREATE TABLE Student (sno int, sname
string, year int) row format delimited fields
terminated by ',';

External Table
CREATE EXTERNAL TABLE Student(sno int,
sname string, year int) row format
delimited fields terminated by ',LOCATION
'/user/external_table';

Load Data to table


To store the local files to hive location
LOAD DATA local INPATH
'/home/cloudera/SampleDataFile/student_
marks.csv' INTO table Student;
To store file located in HDFS file
system to hive table location
LOAD DATA INPATH
'/user/cloudera/Student_Year.csv' INTO
table Student;

Table Commands

Insert Data
INSERT OVERWRITE TABLE targettable
select col1, col2 from source (to overwrite data)
INSERT INTO TABLE targettbl
select col1, col2 from source (to append data)

Multitable insert
From sourcetable
INSERT OVERWRITE TABLE table1
select col1,col2 where condition1
INSERT OVERWRITE TABLE table2
select col1,col2 where condition2

Create table..as Select


Create table table1 as select col1,col2 from source;

Create a new table with existing schema like other


table
Create table newtable like existingtable;

Database Commands
Displays all created DB List.
Show Databases;

To Create new database with default


properties.
Create Database DBName;

Create database with comment


Create Database DBName comment holds backup data ;

To Use Database
Use DBName;

To View the database details


DESCRIBE DATABASE EXTENDEDDbName

Table Commands
To list all tables
Show Tables;

Displaying all contents of the table


select * from <table-name>;
select * from Student_Year where year = 2011;

Display header information along with


Data
set hive.cli.print.header=true;

Using Group by
select year,count(sno) from Student_Year group by
year;

Table Commands
SubQueries
A subquery is a SELECT statement that is embedded in another
SQL statement.
Hive has limited support for subqueries, permitting a subquery in
the FROM clause of a SELECT statement, or in the WHERE clause
in certain cases.
The following query finds the average maximum temperature for
every year and weather station:
SELECT year, AVG(max_temperature)
FROM (
SELECT year, MAX(temperature) AS max_temperature
FROM records2
GROUP BY year
) mt
GROUP BY year;

Table Commands
Alter table
To Add column
ALTER TABLE student ADD COLUMNS (Year string);

To Modify a column
ALTER TABLE table_name CHANGE old_col_name new_col_name
new_data_type

Changes the table name;


Alter table Employee RENAME to emp;

Drops a partition
ALTER table MyTable DROP PARTITION (age=17) -- Drop Table

DROP TABLE
DROP TABLE operatordetails;

Describe Table Schema


Desc Employee;
Describe extended Employee; -- displays detailed information

View
A view is a sort of virtual table that is defined by a SELECT
statement.
Views may also be used to restrict users access to particular
subsets of tables that they are authorized to see.
In Hive, a view is not materialized to disk when it is created;
rather, the views SELECT statement is executed when the
statement that refers to the view is run.
Views are included in the output of the SHOW TABLES
command, and you can see more details about a particular
view, including the query used to define it, by issuing the
DESCRIBE EXTENDED view_name command.
Create Views
CREATE VIEW view_name (id,name) AS SELECT * from users;
Drop a view
Drop view viewName;

Joins
Only equality joins, outer joins, and
left semi joins are supported in Hive.
Hive does not support join conditions
that are not equality conditions as it
is very difficult to express such
conditions as a map/reduce job. Also,
more than two tables can be joined
in Hive

Example-Join
hive> SELECT * FROM sales;
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
hive> SELECT * FROM items;
2 Tie
4 Coat
3 Hat
1 Scarf

Table Commands
Using Join
One of the nice things about using Hive, rather than raw
MapReduce, is that Hive makes performing commonly used
operations very simple.
We can perform an inner join on the two tables as follows:
hive> SELECT sales.*, items.* FROM sales JOIN items ON
(sales.id = items.id);

hive> SELECT a.val, b.val, c.val FROM a JOIN b ON (a.KEY


= b.key1)
JOIN c ON (c.KEY = b.key1)
You can see how many MapReduce jobs Hive will use for any
particular query by prefixing it with the EXPLAIN keyword:,
For even more detail, prefix the query with EXPLAIN EXTENDED.
EXPLAIN SELECT sales.*, items.* FROM sales JOIN items
ON (sales.id = items.id);

Table Commands
Outer joins
Outer joins allow you to find non-matches in
the tables being joined.
hive> SELECT sales.*, items.* FROM sales LEFT
OUTER JOIN items ON (sales.id = items.id);
hive> SELECT sales.*, items.* FROM sales
RIGHT OUTER JOIN items ON (sales.id =
items.id);
hive>SELECT sales.*, items.* FROM sales FULL
OUTER JOIN items ON (sales.id = items.id);

Map Side Join

If all but one of the tables being joined are small, the join
can be performed as a map only job.
The query does not need a reducer. For every mapper a,b is
read completely. A restriction is that aFULL/RIGHT OUTER
JOIN bcannot be performed.
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join
b on a.key = b.key

Partitioning in Hive
Using partitions,
you can make it
faster to execute
queries on slices of
the data.
A table can have
one or more
partition columns.
A separate data
directory is created
for each distinct
value combination
in the partition
columns.

Partitioning in Hive
Partitions are defined at the time of creating a table
using PARTITIONED BY clause is used to create
partition.
Static Partition (Example-1)
CREATE TABLE student_partnew (name STRING,id int,marks
String) PARTITIONED BY (pyear STRING) row format
delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/home/notroot/std_2011.csv'
INTO TABLE student_partnew PARTITION (pyear='2011');
LOAD DATA LOCAL INPATH '/home/notroot/std_2012.csv'
INTO TABLE student_partnew PARTITION (pyear='2012');
LOAD DATA LOCAL INPATH '/home/notroot/std_2013.csv'
INTO TABLE student_partnew PARTITION (pyear='2013');

Partitioning in Hive
Static Partition (Example-2)
CREATE TABLE student_New (id int,name string,marks
int,year int) row format delimited fields terminated by ',';
LOAD DATA local INPATH
'/home/notroot/Sandeep/DataSamples/Student_new.csv'
INTO table Student_New;
CREATE TABLE student_part (id int,name string,marks int,)
PARTITIONED BY (year STRING);
INSERT into TABLE student_part PARTITION(pyear='2012' )
SELECT id,name,marks from student_new WHERE
year='2012';
SHOW Partition
SHOW PARTITIONS month_part;

Partitioning in Hive
Dynamic Partition
To enable dynamic partitions
set hive.exec.dynamic.partition=true;
(To enable dynamic partitions, by default it is false)

set hive.exec.dynamic.partition.mode=nonstrict;
(To allow a table to be partitioned based on multiple columns
in hive, in such case we have to enable the nonstrict mode)

set hive.exec.max.dynamic.partitions.pernode=300;
(The default value is 100, we have to modify the same
according to the possible no of partitions that would come in
your case)
hive.exec.max.created.files=150000
(IThe default values is 100000 but for larger tables it can
exceed the default, so we may have to update the same.)

Partitioning in Hive
CREATE TABLE Stage_oper_Month (oper_id string, Creation_Date string,
oper_name String, oper_age int, oper_dept String, oper_dept_id int,
opr_status string, EYEAR STRING, EMONTH STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA local INPATH
'/home/notroot/Sandeep/DataSamples/user_info.csv'INTO TABLE
Stage_oper_Month;
CREATE TABLE Fact_oper_Month (oper_id string, Creation_Date string,
oper_name String, oper_age int, oper_dept String, oper_dept_id int)
PARTITIONED BY (opr_status string, eyear STRING, eMONTH STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',';
FROM Stage_oper_Month INSERT OVERWRITE TABLE Fact_oper_Month
PARTITION (opr_status, eyear, eMONTH) SELECT oper_id, Creation_Date,
oper_name, oper_age, oper_dept, oper_dept_id, opr_status, EYEAR,
EMONTH DISTRIBUTE BY opr_status, eyear, eMONTH;
(Select from partition table)
Select oper_id, oper_name, oper_dept from Fact_oper_Month where
eyear=2010 and emonth=1;

Bucketing Features

Partitioning gives effective results when there are limited


number of partitions and comparatively equal sized
partitions
To overcome the problem of partitioning, Hive provides
Bucketing concept, another techniquefor decomposing
table data sets into more manageable parts.
Bucketing conceptis based on(hashing function on the
bucketed column)mod(by total number of buckets)
UseCLUSTERED BYclause to divide the table into buckets.
Bucketing can be done along with Partitioning on Hive
tables and even without partitioning.
Bucketed tables will create almost equally distributed data
file parts.
To populate the bucketed table, we need to set the property
set hive.enforce.bucketing = true;

Bucketing Advantage
Bucketing Advantages
Bucketed tables offer efficient sampling than by nonbucketed tables. With sampling, we can try out queries on a
fraction of data for testing and debugging purpose when the
original data sets are very huge.
As the data files are equal sized parts, map-side joins will be
faster on bucketed tables than non-bucketed tables. In Mapside join,a mapperprocessing a bucket of the left table
knows that the matching rows in the right table will bein its
corresponding bucket, so it only retrieves that bucket (which
is a small fractionof all the data stored in the right table).
Similar to partitioning, bucketed tables provide faster query
responses than non-bucketed tables.

Bucketing Example
We can create bucketed tables with the help ofCLUSTERED BYclause
and optionalSORTED BYclause in CREATE TABLE statement and
DISTRIBUTED BY clause in load statement.
CREATE TABLE Month_bucketed (oper_id string, Creation_Date string,
oper_name String, oper_age int,oper_dept String, oper_dept_id int,
opr_status string, eyear string , emonth string) CLUSTERED BY(oper_id)
SORTED BY (oper_id,Creation_Date) INTO 10 BUCKETS ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
Similar to partitioned tables, we cannot directly load bucketed
tables withLOAD DATA (LOCAL) INPATHcommand, rather we
need to useINSERT OVERWRITE TABLE SELECT FROMclause
from another table to populate the bucketed tables.
INSERT OVERWRITE TABLE Month_bucketed SELECT oper_id,
Creation_Date, oper_name, oper_age, oper_dept, oper_dept_id,
opr_status, EYEAR, EMONTH FROM stage_oper_month DISTRIBUTE BY
oper_id sort by oper_id, Creation_Date;

Partitioning with Bucketing


CREATE TABLE Month_Part_bucketed (oper_id string, Creation_Date
string, oper_name String, oper_age int,oper_dept String,
oper_dept_id int) PARTITIONED BY (opr_status string, eyear STRING,
eMONTH STRING) CLUSTERED BY(oper_id) SORTED BY
(oper_id,Creation_Date) INTO 12 BUCKETS ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
FROM Stage_oper_Month stg INSERT OVERWRITE TABLE
Month_Part_bucketed PARTITION(opr_status, eyear, eMONTH)
SELECT stg.oper_id, stg.Creation_Date, stg.oper_name,
stg.oper_age, stg.oper_dept, stg.oper_dept_id, stg.opr_status,
stg.EYEAR, stg.EMONTH DISTRIBUTE BY opr_status, eyear, eMONTH;
Note: Unlike partitioned columns (which are not included in
table columns definition), Bucketed columns are included in
table definition as shown in above code
foroper_idandcreation_datecolumns.

Table Sampling in Hive


Table Sampling in hive is nothing but extraction small fraction of data
from the original large data sets. It is similar to LIMIT operator in Hive.
Difference between LIMIT and TABLESAMPLE in Hive.
In many cases a LIMIT clause executes the entire query, and then onlyreturns
limited results.
But Sampling will only select a portionof data to perform query.

To see the performance difference between bucketed and non-bucketed


tables.
Query-1: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept
FROM month_bucketed TABLESAMPLE(BUCKET 12 OUT OF 12 ON oper_id);
Query-2: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept
FROM stage_oper_month limit 18;

Note: Query-1 should always perform faster that query-2


To perform random sampling with Hive
SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept FROM
month_bucketed TABLESAMPLE (1 percent);

Hive UDF
UDF is a java code which must satisfy the following two properties.
UDF must implement at least one evaluate() method
UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF
Sample UDF
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null)
{ return null;
}
return new Text(s.toString().toLowerCase());
}
}
hive> add jar my_jar.jar;
hive> create temporary function my_lower as 'com.example.hive.udf.Lower';
hive> select empid , my_lower(empname) from employee;

Hive UDAF
A UDAF works on multiple input rows and creates a single output
row. Aggregate functions include such functions as COUNT and
MAX.
An aggregate function is more difficult to write than a regular UDF.
UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF
Contain one or more nested static classes implementing
org.apache.hadoop.hive.ql.exec.UDAFEvaluator
UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF
An evaluator must implement five methods
init()
The init() method initializes the evaluator and resets its internal
state.
In MaximumIntUDAFEvaluator, we set the IntWritable object
holding the final result to null.

Hive UDAF
iterate()
The iterate() method is called every time there is a new value to be
aggregated. The evaluator should update its internal state with the
result of performing the aggregation. The arguments that iterate()
takes correspond to those in the Hive function from which it was called.
In this example, there is only one argument. The value is first checked
to see whether it is null, and if it is, it is ignored. Otherwise, the result
instance variable is set either to values integer value (if this is the first
value that has been seen) or to the larger of the current result and
value (if one or more values have already been seen). We return true to
indicate that the input value was valid.
terminatePartial()
The terminatePartial() method is called when Hive wants a result for the
partial aggregation. The method must return an object that
encapsulates the state of the aggregation.
In this case, an IntWritable suffices because it encapsulates either the
maximum value seen or null if no values have been processed.

Hive UDAF
merge()
The merge() method is called when Hive decides to combine one
partial aggregation with another. The method takes a single object,
whose type must correspond to the return type of the
terminatePartial() method.
In this example, the merge() method can simply delegate to the
iterate() method because the partial aggregation is represented in the
same way as a value being aggregated. This is not generally the
case(well see a more general example later), and the method should
implement the logic to combine the evaluators state with the state of
the partial aggregation.

terminate()
The terminate() method is called when the final result of the
aggregation is needed. The evaluator should return its state as a
value.
In this case, we return the result instance variable.

Hive UDAF
package com.hadoopbook.hive;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.io.IntWritable;
public class HiveUDAFSample extends UDAF {
public static class MaximumIntUDAFEvaluator implements UDAFEvaluator
{
private IntWritable result;
public void init() {
result = null;
}
public boolean iterate(IntWritable value) {
if (value == null) {
return true;
}

Hive UDAF
if (result == null) {
result = new IntWritable(value.get());
} else {
result.set(Math.max(result.get(), value.get()));
}
return true;
}
public IntWritable terminatePartial() {
return result;
}
public boolean merge(IntWritable other) {
return iterate(other);
}
public IntWritable terminate() {
return result;
}
}
}

Hive UDAF
To Use UDAF in hive;
hive> add jar my_jar.jar;
hive> CREATE TEMPORARY FUNCTION maximum
AS 'com.hadoopbook.hive.HiveUDAFSample';
hive>SELECT maximum(salary) FROM employee;

Performance Tuning
Partitioning Tables:
Hive partitioning is an effective method to improve the
query performance on larger tables. Partitioning allows
you to store data in separate sub-directories under
table location. It greatly helps the queries which are
queried upon the partition key(s). Although the
selection of partition key is always a sensitive decision,
it should always be a low cardinal attribute, e.g. if your
data is associated with time dimension, then date could
be a good partition key. Similarly, if data has
association with location, like a country or state, then
its a good idea to have hierarchical partitions like
country/state.

Performance Tuning
De-normalizing data:
Normalization is a standard process used to model
your data tables with certain rules to deal with
redundancy of data and anomalies. In simpler
words, if you normalize your data sets, you end up
creating multiple relational tables which can be
joined at the run time to produce the results. Joins
are expensive and difficult operations to perform
and are one of the common reasons for performance
issues. Because of that, its a good idea to avoid
highly normalized table structures because they
require join queries to derive the desired metrics.

Performance Tuning
Compress map/reduce output:
Compression techniques significantly reduce the intermediate data
volume, which internally reduces the amount of data transfers
between mappers and reducers. All this generally occurs over the
network. Compression can be applied on the mapper and reducer
output individually. Keep in mind that gzip compressed files are not
splittable. That means this should be applied with caution. A
compressed file size should not be larger than a few hundred
megabytes. Otherwise it can potentially lead to an imbalanced job.
Other options of compression codec could be snappy, lzo, bzip, etc.
For map output compression
setmapred.compress.map.outputto true
For job output compression setmapred.output.compressto true

Performance Tuning
Map join:
Map joins are really efficient if a table on
the other side of a join is small enough to
fit in the memory. Hive supports a
parameter, hive.auto.convert.join,
which when its set to true suggests that
Hive try to map join automatically. When
using this parameter, be sure the auto
convert is enabled in the Hive
environment.

Performance Tuning
Bucketing:
Bucketing improves the join performance if the bucket key and
join keys are common. Bucketing in Hive distributes the data
in different buckets based on the hash results on the bucket
key. It also reduces the I/O scans during the join process if the
process is happening on the same keys (columns).
Additionally its important to ensure the bucketing flag is set
(SET hive.enforce.bucketing=true;) every time before
writing data to the bucketed table. To leverage the bucketing
in the join operation we shouldSET
hive.optimize.bucketmapjoin=true. This setting hints to
Hive to do bucket level join during the map stage join. It also
reduces the scan cycles to find a particular key because
bucketing ensures that the key is present in a certain bucket.

Performance Tuning
Parallel execution:
As HIVE queries are inbuilt translated to a
number of map reduce jobs, but having
multiple Map-reduce jobs is not enough,
real advantage is of their parallel
execution and as noted above simply
writing a query does not achieve this.
SELECT table1.a FROM
table1 JOIN table2 ON (table1.a =table2.a
)
join table3 ON (table3.a=table1.a)
join table4 ON (table4.b=table3.b);
Output: Execution time : 800 sec
But let us check the execution plan for this:
observations (see picture highlighted
area):
Total Map-Reduce Jobs: 2.
Serially Launched & Run.

Performance Tuning
Parallel execution:

To achieve this, we thought about query rewriting in a way to segregate the query into
independent units which HIVE could work upon
as independent map reduce jobs running
parallel. Following is what we did to our query:

SELECT r1.a FROM


(SELECT table1.a FROM table1 JOIN table2 ON
table1.a =table2.a ) r1
JOIN
(SELECT table3.a FROM table3 JOIN table4 ON
table3.b =table4.b ) r2
ON (r1.a =r2.a) ;

Output: Same results. But Execution time:


464 secs
observations:

Total Map-Reduce Jobs: 5 (see picture


highlighted area).

Jobs are parallel Launched & Run. (see


highlighted area).

Decrease in query execution time (around 50%


in our case)
Points to Note:

Thank You
Question?
Feedback?

explorehadoop@gmail.com

Das könnte Ihnen auch gefallen