Hive Performance - Practical Guide

Title Hive Performance Practical
Guide
November 2013
Title Hive Performance Practical Guide | November 2013
INSTRUCTIONS
This is an automated template for the technical whitepaper
series Enlighten. The following guidelines will help you work at it
effectively to create a market facing output. Once completed
please send it across to eootb@hcl.com. We shall get back to
you with the final copy.
Cover Page: Please update only the Title and the Year Month.
Once you move onto writing the other sections, the same shall
be automatically updated at the header across all the pages.
Table of Contents: Do not make any changes to this page. Any
change if need be to the heading of a page, should be done at
the page itself (in the heading format as used currently). Once
the paper is complete move to the TOC, right click and update
fields. The table shall be update automatically.
Highlight Pad : On the left of each page, you will find a Grey
color box which you will need to use for one key point from the
text on the right/ quotation/ statistic etc.
Addition/ Removal of Pages : Writing can be continued, onto
the pages to write additional content. Table of content should be
updated at the end to account for all such modifications.
Final Page: This is uneditable text carrying information about
HCL and ERS.
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
TABLE OF CONTENTS
Abstract ............................................................................................. 4
Challenge for developers .................................................................. 4
Test Environment .............................................................................. 4
Prerequisite Knowledge .................................................................... 5
Performance Parameters .................................................................. 5
Query Optimization ......................................................................... 11
Data Optimization ............................................................................ 15
References ...................................................................................... 16
Author Info ....................................................................................... 17
Abstract
Hive is data warehouse and query language for hadoop, an
essential tool in the Hadoop ecosystem that provides a SQL dialect
for querying data stored in the Hadoop Distributed Filesystem
(HDFS). Good for batch processing.
Most data warehouse applications are implemented using relational
databases that use SQL as the query language. Hive lowers the
barrier for moving these applications to Hadoop.
Challenge for developers

Although Hive provides SQL language for Hadoop, there are
aspects that are different from other SQL-based environments; also
documentation for Hive users and Hadoop developers has been
sparse.
While working on project for one of the client providing media
analysis, where Hive was used to process complex media log, I
read blogs, searched web and tutorials for optimizing hive query to
gain maximum performance, and found many options and
recommendations scattered all over with so many ifs and buts,
some of them are useful, some are confusing and some are
baseless.
Here I am trying to put all information and knowledge based on
practical experience gained to improve hive query performance at
single place. Also my intent is to highlight good practices and
parameter tuning that must be followed and will clearly explain
consideration points for complex tuning parameters. This will help
me and developer like me to refer this document in future while
working with Hive.
Test Environment
This section describes the environment where hive performance
tuning parameter and other query optimization techniques were
tested. We have tested our solution on two different platforms. (Yes
we were lucky to do that)
Test Platform 1
Microsoft HDInsight 1.6 (beta) with 40 node cluster, where each
node has 2 core and 4GB RAM. HDInsight 1.6 uses Hortonworks
distribution.
Test Platform 2
Amazon EC2: 10 Node cluster (Ubuntu 12.04 LTS 64 Bit Server,
m1.large - 2 Core, 7.5 GB RAM)
Cloudera cdh4 (4.1.2): (hadoop and hive), setup using cloudera

manager.
Prerequisite Knowledge
1. Good understanding of map-reduce framework. (What is the
output of a map task, how map output is transferred to reduce
task and what is the significance of Partitioner)
2. Understanding of hadoop distributed cache.
3. Understanding of hadoop performance tuning.
4. Moderate understanding of compression techniques like snappy,
LZO, sequence files.
Performance Parameters
I have categorized it into two sections:
DEFAULT PARAM: This should be used as a part of good

practice without any consideration.
TRICKY PARAM: As this should be used case to case basis

and require analysing your requirement, data size, data
categorization, data manipulation, data transfer behavior, query
complexity and cluster size.
1. Map Join (DEFAULT PARAM)

"set hive.auto.convert.join = true"
Enabling map join can significantly decrease your hive query
processing time, and you would experience noticeable performance
boost, I have seen one query taking 25 min to process brought
down to 5 min. To enable map join use this setting in your hive shell
or hive script "set hive.auto.convert.join = true". There is one catch
with the map join - if all the tables are too large to exceed the limit
set then the regular reduce join will be used; i.e., currently, the total
size of one table participating in join should be less than or equal to
25 M. 25M is a very conservative number and user can change this
number by "set hive.smalltable.filesize = 33554432" (Please see https://cwiki.apache.org/Hive/joinoptimization.html). You should
always use this setting in your query if you use join operation, and it
will not put any negative impact, as hive will use map-join if
condition imply else regular reduce side join will be used.
2. Bucketed Map Join (DEFAULT PARAM)

set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
By default this optimization is disabled and I suggest to enable it in
your hive-site.xml like other default optimization parameters. This
setting will improve your join performance by joining individual
buckets between tables in the map phase, because it does not need
to fetch the entire contents of one table to match against each
bucket in the other table. So, if tables participating in join operation
are bucketed on join fields then it will definatily improve performance.
You will understand it better once you read bucketing in following
section. I will also suggest to read Map Join topic on Page no. 284
rd
of Hadoop The Definitive Guide, 3 Edition.
The hive.optimize.bucketmapjoin.sortedmerge setting takes
advantage if your bucketed/clustered fields are also sorted. (sorting
using Cluster By and combination of Distribute By and Sort By
clause is explained in sorting section below)
(Please see also https://cwiki.apache.org/confluence/download/attachments/2736205
4/Hive+Summit+2011join.pdf?version=1&modificationDate=1309986642000)
3. Intermediate Compression (DEFAULT PARAM)

set hive.exec.compress.intermediate = true
Intermediate compression shrinks the data shuffled between the
map and reduce tasks for a job, but you must select a codec that
has lower CPU cost than greater compression. As most hive jobs
are I/O bound and even though some are CPU bound, with the right
selection of codec you will always be benefited. I highly recommend
that you enable this property, for some special cases in your dev
environment where you generally test with small set of data you can
skip this setting.
Note: In higher CPU bound job, if you are getting better
performance by off this setting, then only you should disable it in
that job script.
Tip: Use snappy compression code, some people prefer using LZO
as its splitable, but for intermediate compression we dont have to
worry about whether compression code support splitting or not.
set mapred.map.output.compression.codec=
org.apache.hadoop.io.compress.SnappyCodec
set hive.exec.compress.intermediate = true
For setting snappy - http://code.google.com/p/hadoop-snappy/

For using with hive - http://www.cloudera.com/content/clouderacontent/cloudera-docs/CDH4/4.3.0/CDH4-InstallationGuide/cdh4ig_topic_23_5.html
4. Local Mode (DEFAULT PARAM)

set hive.exec.mode.local.auto=true
This parameter should also be used with all hive deployment.
Launching MR job on all nodes for a very small set of data
significantly consumes overall job execution time. Hive can
automatically leverage the lighter weight of the local mode to
perform all the tasks for the job on a single machine and sometimes
in the same process.
To achieve this set below property in your hive-site.xml
<property>
<name>hive.exec.mode.local.auto</name>
<value>true</value>
<description>
Let hive determine whether to run in
local mode automatically
</description>
</property>
(Please refer programming hive page number 135 for detail)
5. Strict Mode (TRICKY PARAM)

"set hive.mapred.mode = strict"
This parameter I have categorised as TRICKY because of only one
consideration that this should not be used with production but only
during development and testing phase. This will help you write hive
query in optimized way that will give best performance and will
prevent unintended and undesirable result.
Using this property will restrict 3 types of queries:
Queries on partitioned tables are not permitted unless they

include a partition filter in WHERE clause. This is very useful as
if you are not using partition filter on partitioned tables then you
are neglecting a huge performance boost, and if you think that
there could be use case where you don't want to use partition
filter then I will suggest that you rethink on your partitioning plan.
Second restriction is on queries that use ORDER BY clause, but

no LIMIT clause. Generally ORDER BY itself should be avoided
as far as possible; because it will send all result to single
reducer to perform the ordering. LIMIT will prevent the reducer
from running for an extended period of time.
Note: This restriction is mainly useful in development
environment and its recommended that you off this property
and LIMIT from your query before you promote it to testing or
production.
Third restriction prevents Cartesian product. This is very useful

as people coming from RDBMS world may think that queries
that perform JOIN not with an ON clause but with WHERE
clause will have query optimized by the query planner,
effectively converting WHERE clause into an ON clause.
Unfortunately, hive does not perform this optimization, so a
runaway query will occur if tables are large.
(Please read http://stackoverflow.com/questions/587965/whatis-runaway-query to understand runaway query)
6. Parallel Execution (TRICKY PARAM)

set hive.exec.parallel =true
Hive converts a query into one or more stages. Stages could be a
Map-Reduce stage, sampling stage, merge stage, limit stage, or
other possible tasks hive needs to do. By default, Hive executes
these stages one at a time. However, a particular job may consist of
some stages that are not dependent on each other and could be
executed in parallel, possibly allowing the overall job to complete
more quickly. (See programming hive page no. 136 for hivesite.xml property settings)
The reason to categories this as TRICKY is it require observation
in a shared cluster as running more stages in parallel will increase
cluster utilization. However, if you are sure to utilize the full
bandwidth of cluster then I recommended that you enable parallel
execution.
7. JVM Reuse (TRICKY PARAM)

This is quite a tricky parameter that requires careful observation on
case to case basis of following items.
Map/Reduce slot available on tasktracker.

There's no point in considering this parameter if you cannot
have more than 2 map/reduce slots per node. Slots are
generally decided based on availability of cores or processors

and RAM on tasktracker node.
Note: YARN Hadoop 2 does not support JVM reuse.
Execution time of map/reduce task.

It is useful to set this parameter so that more than one mapper
and reducer task can utilize same JVM, if per task execution
time is less than 40 second.
Careful monitoring of total job execution time before and

after setting this parameter (keep log of each execution to
compare), monitor JVM heap.
However this is Hadoop tuning parameter but it is very relevant to

Hive performance, especially where it's hard to avoid small files and
scenarios with lots of tasks, most of which have short execution time
as defined above.
As this is Hadoop tuning parameter I am not going to cover this in
detail, but as Hive is dependent on hadoop make sure your hadoop
cluster is fine tuned first.
(Please see hadoop definitive guide 3rd edition page no. 219 for
better understanding on JVM Reuse)
8. Mapper and Reducer Number (TRICKY PARAM)

Tuning the number of map-reduce tasks launched for your job can
play significant role in performance improvement. This tricky setting
requires careful observation of following items.
Overall cluster map/reduce slots information.

This is your cluster bandwidth, make sure to utilize it completely,
and always try to avoid the case of resource underutilization.
Cluster Map Slot = No of max map tasks set on node *
tasktracker nodes in cluster.
Cluster Reduce Slot = No of max reduce tasks set on node *
tasktracker nodes in cluster.
mapred-site.xml properties
mapreduce.tasktracker.map.tasks.maximum and
mapreduce.tasktracker.reduce.tasks.maximum can be used to
set slots available per node.
Example: Suppose you have 4 cores and 6 GB RAM on a node,
and considering that one core is capable of handling 2
processes and 1 map-reduce task needs 512 MB RAM, then we
have 1 core and 2 GB RAM for datanode and tasktracker
daemon (default 1GB is set for each daemon), 1 core 1 GB
RAM for other system processes then remaining 2 core and 3
GB RAM can be used to determine ideal slot for that node, i.e.,
4 task slots (2 core * 2 processes = 4, 4*512 MB < 3 GB).
Average number of queries that can run in parallel in your

cluster.
This should be considered only for shared cluster.
If in a shared cluster you assume that average number of
queries that will be launched simultaneously is 4, then below
defined formula can be used to determine ideal number of
reduce max task allowed for one job, so to avoid cluster
underutilized and to give each job fair execution time.
Total cluster reduce slot * 1.5 / average number of queries
running
12*1.5/4 = 4.5 5
Number of map-reduce tasks launched for each mapreduce stages.

You can tweak number of map-reduce tasks launched in order
to achieve best performance by utilizing maximum cluster
resources.
Here I will explain how to manipulate the number of map-reduce

tasks for a job while considering the above items.
Map Tasks
set mapred.max.split.size=<number>
Number of map tasks that can be launched simultaneously depends
on tasktracker's map slots available within cluster. The number of
map tasks for one hive execution stage is determined based on
input data, input splits and block size identified. To better
understand suppose data identified for stage 1 query is 10 GB and
there is no input split explicitly defined then block size (suppose
256MB) will determine the total map tasks required as 40 (i.e.,
10240/256=40), but if max input split size (64MB) is also defined
then minimum of input split or block size is picked and in that case
map tasks would be 160 (i.e., 10240/64 = 160)
Tips:
-
Never modify block size but rather input split to tweak number of
map tasks. Use mapred.max.split.size property to achieve this.
Always try avoiding situation by where after most mappers and

reducers are scheduled, one or two tasks remains to run all
alone, by increasing or decreasing map-reduce tasks.
10
Try increasing map tasks number if slots are underutilized and

monitor result, if performance improves add parameter setting
to your hive job script.
I have seen reduction of around 10 minutes in query execution time,

when we have increased our number of map tasks to utilize full
cluster slots.
Reduce Tasks
set hive.exec.reducers.bytes.per.reducer=<number> - to change
the average load.
set hive.exec.reducers.max=<number> - to limit the maximum
number of reducer.
Number of reduce tasks that can be launched simultaneously again
depends on cluster reducer slots available.
While monitoring your job for reducer number, if you see that cluster
slots are underutilized or in the end very few reducers are running
alone, then try to use above defined properties to control the
number of reducers determined for your job. Also remember that in
a shared cluster you can set max reducer (hive.exec.reducers.max)
to have optimal cluster bandwidth utilization.
Tips:
-
Always try to have balance with number of map-reduce tasks

and total cluster slots, don't make your numbers to high or to
low.
In an urge to increase the number of map-reduce tasks don't

make your splits very small.
Prepare a sheet and mark performance improvement per job

basis when you tweak the number of map-reduce tasks, and put
those settings in job query itself.
Query Optimization
In this section we will see how we can optimize Hive query to get
best performance.
1. Partitioning
Partition directories.
Partitioning is not a new term, and has quite obvious benefits. I
highly recommend that you should analyse your input data to
identify any opportunity to partition your data in order to have
following benefits:
Process only relevant data not whole.
Hive put partitioned data into separate directories. When you use
partition filter in your where clause, input data from specific
11
directories are picked instead of scanning whole data set. This can
significantly improve query performance as less data to process
means reduced query processing time.
Local mode setting can shine.
If your partitioned data is so small that its completely available on

one datanode then your local mode setting will shine.
Map join can shine.
If your partitioned data is small enough to fit in size defined by

hive.smalltable.filesize then your join operation will be optimized to
use Map Join.
(Please see page no. 58 of Programming Hive for more detail. You
can also refer https://cwiki.apache.org/confluence/display/Hive/LanguageManual+
DDL, http://www.brentozar.com/archive/2013/03/introduction-tohive-partitioning/ )
Drawback: the only drawback of partitioning is fundamental
HDFS was designed for many millions of large files, not for billions
of small files, make sure to carefully choose your partition filter.
(Please see Over Partitioning section of Programming Hive, page
122)
Tips: In our use case, analysts wanted media logs to be analysed
on state, year, month, day, and user type, etc., here user types was
very limited (only 2) so having partition on that is of no use, partition
on day will create lot of small files, so we have created partition on
stateyear month. As user type was also a very important filter
for our use case, we have used it for bucket (explained later).
-
An ideal partition scheme should not result in too many

partitions and their directories, and the files in each directory
should be large, some multiple of the filesystem block size.
A good strategy for time-range partitioning, for example, is to

determine the approximate size of your data accumulation over
different granularities of time, and start with the granularity that
results in modest growth in the number of partitions over time.
Use partition for filters that will present upper set of data.
Its a good idea to partition on parameters that divide overall

data in such which require separate analysis most of the time.
Consider these columns for partition - Region, Country, State,

IP Address geo location, department etc.
12
2. Bucketing
Bucket files.
Partitions offer a convenient way to segregate data and to optimize
queries. However, not all data sets lead to sensible partitioning,
especially given the concerns raised earlier about appropriate sizing.
Bucketing is another technique for decomposing data sets into more
manageable parts. Using bucketed query will significantly improve
query performance. (Please see Programming Hive page no. 125
and
http://archive.cloudera.com/cdh/3/hive/language_manual/working_w
ith_bucketed_tables.html for usage information)
Processing relevant data.

Hive puts bucketed data into separate files determined using
hash function. When you use bucked query; input data will be
picked from specific buckets rather scanning entire data set.
As specified in partition section, local mode and map join

settings will also shine with bucked queries.
Drawback: It also has the same potential problem defined in

partition section that it will multiply the number of files managed by
namenode.
Tips: In our use case defined in partitioning section, we have
created bucket on user type (individual and household). As analysis
done majorly for these two user types and partition was not suitable
for user type column (explained in partition section), we created
bucket on user type column.
-
Remember data is divided based on hash function unlike simple

value match in partition.
Know about your hash function and figure out specific bucket
range required for your lookup query.
3. Indexing
The purpose of using indexing in hive is to improve the speed of
query lookup on certain column of tables which is not different from
partitioning and bucketing. Without an index queries with WHERE
clause like WHERE col1=10 load the entire table or partition and
process all the rows. But if an index exists for col1, then only a
portion of file needs to be loaded and processed.
(Please see
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+I
ndexing, https://cwiki.apache.org/confluence/display/Hive/IndexDev
and Programming Hive page no 117, for implementation detail)
13
Drawback: Hive has limited indexing capabilities, but you can

provide custom implementation. There are no keys in the usual
RDBMS sense. The improvement in query speed that an index can
provide comes at the cost of additional processing to create the
index and disk space to store the index.
Tip: Create index on column(s) with less distinct values.
4. Distribute By & Cluster By

Distribute By & Cluster By clause when used will distribute your map
output to reducers based on columns defined with it. Think of
partitioning (hash partitioner), carefully using this clause can
decrease your query processing time and you will have following
benefits:
This is very effective when you use clustered fields in your

Group By clause.
As clustered fields ensure that the data blocks are organized

based on the hash values, a hash join becomes far more
efficient for disk I/O and network bandwidth because it can
operate large co-located blocks of data, hence improving JOIN
operations.
Consider a case where your outer query is dependent on some

inner query and want to further filter it based on some column(s)
from inner query output, in this case if those column(s) are
clustered then it will optimize your query.
5. Sorting
Cluster By is a shortcut for Distribute By and Sort By.
Sorting can be done by using Sort By clause, but you wont get the
desired result by only using Sort By clause (Please see
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+
SortBy for detail).
I am not going to cover how sorting can be used in your HiveQL, but
I would like to highlight one performance improvement point. When
in your query the column(s) for distribution and sort column(s) are
not exactly same, then prefer using distribute by along with sort by
clause rather cluster by, as cluster by will apply distribute by and
then sort by on all column(s) defined with it and it is quite possible
that you want to distribute and sort on different set of column(s).
14
Data Optimization
Compression
Using snappy compression as default compression on my hive data,
always produced good result for me, I recommend that you use
sequence file as your hive table storage format.
Example:
CREATE TABLE page_view(viewTime INT, userid
BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
Drawback: One of Hives unique features is that Hive does not
force data to be converted to a specific format and applying
compression on overall data will restrict interoperability.
Tip: Use overall compression with hive internal table, and be careful
while applying compression on external tables.
15
References
Join Optimization
https://cwiki.apache.org/Hive/joinoptimization.html
https://cwiki.apache.org/confluence/download/attachments/2736205
4/Hive+Summit+2011join.pdf?version=1&modificationDate=1309986642000
Snappy Compression
http://code.google.com/p/hadoop-snappy/
http://www.cloudera.com/content/cloudera-content/clouderadocs/CDH4/4.3.0/CDH4-Installation-Guide/cdh4ig_topic_23_5.html
Runaway Query
http://stackoverflow.com/questions/587965/what-is-runaway-query
Hive Partitioning
DDL
http://www.brentozar.com/archive/2013/03/introduction-to-hivepartitioning/
Hive Bucketing
http://archive.cloudera.com/cdh/3/hive/language_manual/working_w
ith_bucketed_tables.html
Hive Indexing
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+I
ndexing
https://cwiki.apache.org/confluence/display/Hive/IndexDev
Hive Sorting
SortBy
Programming Hive
By Edward Capriolo, Dean Wampler, and Jason Rutherglen
Chapter 10 Tuning, Over Partitioning - Page 122
Bucketing - Page 125, Indexing - Page 117
Hadoop: The Definitive Guide, Third Edition

by Tom White
16
Author Info
Sabir Hussain
Sr. Technical Architect TFG-Analytics
Has 10+ years of experience in
Architectural design, development and
implementation of multi-tier web based
enterprise application using Java/J2EE.
Has been involved from past 2+ years in
Bigdata Analytics using Hadoop, Hive,
HBase, Storm, Talend, Actuate BIRT,
Tableua, Pentaho, MS-HDInsight and
others.
17
Hello, Im from HCLs Engineering and R&D Services. We enable

technology led organizations to go to market with innovative products
and solutions. We partner with our customers in building world class
products and creating associated solution delivery ecosystems to help
bring market leadership. We develop engineering products, solutions
and platforms across Aerospace and Defense, Automotive, Consumer
Electronics, Software, Online, Industrial Manufacturing, Medical
Devices, Networking & Telecom, Office Automation, Semiconductor
and Servers & Storage for our customers.
For more details contact eootb@hcl.com
Follow us on twitter: http://twitter.com/hclers
Visit our blog: http://ers.hclblogs.com/
Visit our website: http://www.hcltech.com/engineering-services/
About HCL
About HCL Technologies
HCL Technologies is a leading global IT services company, working
with clients in the areas that impact and redefine the core of their
businesses. Since its inception into the global landscape after its IPO in
1999, HCL focuses on transformational outsourcing, underlined by
innovation and value creation, and offers integrated portfolio of services
including software-led IT solutions, remote infrastructure management,
engineering and R&D services and BPO. HCL leverages its extensive
global offshore infrastructure and network of offices in 26 countries to
provide holistic, multi-service delivery in key industry verticals including
Financial Services, Manufacturing, Consumer Services, Public Services
and Healthcare. HCL takes pride in its philosophy of 'Employees First,
Customers Second' which empowers our 85,335 transformers to create
a real value for the customers. HCL Technologies, along with its
subsidiaries, has reported consolidated revenues of US$ 4.3 billion (Rs.
22417 crores), as on TTM ended Sep 30 '12.
For more information, please visit www.hcltech.com
About HCL Enterprise
HCL is a $6.2 billion leading global technology and IT enterprise
comprising two companies listed in India - HCL Technologies and HCL
Infosystems. Founded in 1976, HCL is one of India's original IT garage
start-ups. A pioneer of modern computing, HCL is a global
transformational enterprise today. Its range of offerings includes
product engineering, custom & package applications, BPO, IT
infrastructure services, IT hardware, systems integration, and
distribution of information and communications technology (ICT)
products across a wide range of focused industry verticals. The HCL
team consists of over 90,000 professionals of diverse nationalities, who
operate from 31 countries including over 500 points of presence in
India. HCL has partnerships with several leading global 1000 firms,
including leading IT and technology firms.
For more information, please visit www.hcl.com

Hive Performance - Practical Guide

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hive Performance - Practical Guide

Hochgeladen von

Copyright:

Verfügbare Formate

Title Hive Performance Practical

Title Hive Performance Practical Guide | November 2013

Title Hive Performance Practical Guide | November 2013

Title Hive Performance Practical Guide | November 2013

Challenge for developers

Title Hive Performance Practical Guide | November 2013

Cloudera cdh4 (4.1.2): (hadoop and hive), setup using cloudera

DEFAULT PARAM: This should be used as a part of good

TRICKY PARAM: As this should be used case to case basis

1. Map Join (DEFAULT PARAM)

Title Hive Performance Practical Guide | November 2013

2. Bucketed Map Join (DEFAULT PARAM)

3. Intermediate Compression (DEFAULT PARAM)

Title Hive Performance Practical Guide | November 2013

For setting snappy - http://code.google.com/p/hadoop-snappy/

4. Local Mode (DEFAULT PARAM)

5. Strict Mode (TRICKY PARAM)

Queries on partitioned tables are not permitted unless they

Title Hive Performance Practical Guide | November 2013

Second restriction is on queries that use ORDER BY clause, but

Third restriction prevents Cartesian product. This is very useful

6. Parallel Execution (TRICKY PARAM)

7. JVM Reuse (TRICKY PARAM)

Map/Reduce slot available on tasktracker.

Title Hive Performance Practical Guide | November 2013

generally decided based on availability of cores or processors

Execution time of map/reduce task.

Careful monitoring of total job execution time before and

However this is Hadoop tuning parameter but it is very relevant to

8. Mapper and Reducer Number (TRICKY PARAM)

Overall cluster map/reduce slots information.

Title Hive Performance Practical Guide | November 2013

Average number of queries that can run in parallel in your

Number of map-reduce tasks launched for each mapreduce stages.

Here I will explain how to manipulate the number of map-reduce

Always try avoiding situation by where after most mappers and

Title Hive Performance Practical Guide | November 2013

Try increasing map tasks number if slots are underutilized and

I have seen reduction of around 10 minutes in query execution time,

Always try to have balance with number of map-reduce tasks

In an urge to increase the number of map-reduce tasks don't

Prepare a sheet and mark performance improvement per job

Process only relevant data not whole.

Title Hive Performance Practical Guide | November 2013

Local mode setting can shine.

If your partitioned data is so small that its completely available on

Map join can shine.

If your partitioned data is small enough to fit in size defined by

An ideal partition scheme should not result in too many

A good strategy for time-range partitioning, for example, is to

Its a good idea to partition on parameters that divide overall

Consider these columns for partition - Region, Country, State,

Title Hive Performance Practical Guide | November 2013

Processing relevant data.

As specified in partition section, local mode and map join

Drawback: It also has the same potential problem defined in

Remember data is divided based on hash function unlike simple

Title Hive Performance Practical Guide | November 2013

Drawback: Hive has limited indexing capabilities, but you can

4. Distribute By & Cluster By

This is very effective when you use clustered fields in your

As clustered fields ensure that the data blocks are organized

Consider a case where your outer query is dependent on some

Title Hive Performance Practical Guide | November 2013