Beruflich Dokumente
Kultur Dokumente
Hive
Prashant Gupta
HIVE
Hive
Data warehousing package built on
top of hadoop.
Used for data analysis on structured
data.
Targeted towards users comfortable
with SQL.
It is similar to SQL and called HiveQL.
Abstracts complexity of hadoop.
No Java is required.
Features of Hive
How is it Different from SQL
The major difference is that a Hive query
executes on a Hadoop infrastructure rather
than a traditional database.
This allows Hive to handle huge data sets data sets so large that high-end,
expensive, traditional databases would fail.
The internal execution of a Hive query is
via a series of automatically generated
Map Reduce jobs
Install Hive
To install hive
untar the .gz file using tar xvzf hive-0.13.0-bin.tar.gz
Hive configurations
Hive Modes
To start the hive shell, type hive and
Enter.
Hive in Local mode
No HDFS is required, All files run on local
file system.
hive> SET mapred.job.tracker=local
Configuring Hive
Hive is configured using an XML configuration file called
hivesite.xml and is located in Hives conf directory.
Execution engines
Hive was originally written to use MapReduce as its execution
engine, and that is still the default.
We can use Apache Tez as its execution engine, and also work is
underway to support Spark, too. Both Tez and Spark are general
directed acyclic graph (DAG) engines that offer more flexibility
and higher performance than MapReduce.
Its easy to switch the execution engine on a per-query basis, so
you can see the effect of a different engine on a particular query.
Set Hive to use Tez: hive> SET hive.execution.engine=tez;
The execution engine is controlled by the hive.execution.engine
property, which defaults to mr (for MapReduce).
Hive Architecture
Components
Thrift Client
It is possible to interact with hive by using any
programming language that usages Thrift server. For e.g.
Python
Ruby
JDBC Driver
Hive provides a pure java JDBC driver for java application
to connect to hive , defined in the class
org.hadoop.hive.jdbc.HiveDriver
ODBC Driver
An ODBC driver allows application that supports ODBC
protocol
Components
Metastore
Configuring MySQL as
metastore
Install MySQL Admin/Client
Create a Hadoop user and grant permissions to the user
mysql -u root p
mysql> Create user 'hadoop'@'localhost' identified by 'hadoop;
mysql> Grant ALL on *.* to 'hadoop'@'localhost' with GRANT option;
For short scripts, you can use the -e option to specify the commands
inline, in which case the final semicolon is not required.
hive -e 'SELECT * FROM dummy'
Ser-de
A SerDe is a combination of a Serializer and a
Deserializer (hence, Ser-De).
The Serializer, however, will take a Java object that Hive
has been working with, and turn it into something that
Hive can write to HDFS or another supported system.
Serializer is used when writing data, such as through an
INSERT-SELECT statement.
The Deserializer interface takes a string or binary
representation of a record, and translates it into a Java
object that Hive can manipulate.
Deserializer is used at query time to execute SELECT
statements.
Hive Tables
A Hive table is logically made up of the data being stored in HDFS and the
associated metadata describing the layout of the data in the MySQL table.
Managed Table
When you create a table in Hive and load data into a managed table, it is moved into
Hives warehouse directory.
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
External Table
Alternatively, you may create an external table, which tells Hive to refer to the data
that is at an existing location outside the warehouse directory.
The location of the external data is specified at table creation time:
CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
When you drop an external table, Hive will leave the data untouched and
only delete the metadata.
Hive does not do any transformation while loading data into tables. Load
operations are currently pure copy/move operations that move data files
into locations corresponding to Hive tables.
Storage Format
Text File
When you create a table with no ROW FORMAT or
STORED AS clauses, the default format is delimited
text with one row per line.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Format
RC: Record Columnar File
The RC format was designed for clusters with
MapReduce in mind. It is a huge step up over
standard text files. Its a mature format with ways
to ingest into the cluster without ETL. It is supported
in severalhadoop system components.
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
Storage Format
ORC: Optimized Row Columnar
File
The ORC format showed up in Hive 0.11 onwards. As
the name implies, it is more optimized than the RC
format. If you want to hold onto speed and compress
the data as much as possible, then ORC is best.
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileOutputFormat'
Practice Session
Create Table
Managed Table
CREATE TABLE Student (sno int, sname
string, year int) row format delimited fields
terminated by ',';
External Table
CREATE EXTERNAL TABLE Student(sno int,
sname string, year int) row format
delimited fields terminated by ',LOCATION
'/user/external_table';
Table Commands
Insert Data
INSERT OVERWRITE TABLE targettable
select col1, col2 from source (to overwrite data)
INSERT INTO TABLE targettbl
select col1, col2 from source (to append data)
Multitable insert
From sourcetable
INSERT OVERWRITE TABLE table1
select col1,col2 where condition1
INSERT OVERWRITE TABLE table2
select col1,col2 where condition2
Database Commands
Displays all created DB List.
Show Databases;
To Use Database
Use DBName;
Table Commands
To list all tables
Show Tables;
Using Group by
select year,count(sno) from Student_Year group by
year;
Table Commands
SubQueries
A subquery is a SELECT statement that is embedded in another
SQL statement.
Hive has limited support for subqueries, permitting a subquery in
the FROM clause of a SELECT statement, or in the WHERE clause
in certain cases.
The following query finds the average maximum temperature for
every year and weather station:
SELECT year, AVG(max_temperature)
FROM (
SELECT year, MAX(temperature) AS max_temperature
FROM records2
GROUP BY year
) mt
GROUP BY year;
Table Commands
Alter table
To Add column
ALTER TABLE student ADD COLUMNS (Year string);
To Modify a column
ALTER TABLE table_name CHANGE old_col_name new_col_name
new_data_type
Drops a partition
ALTER table MyTable DROP PARTITION (age=17) -- Drop Table
DROP TABLE
DROP TABLE operatordetails;
View
A view is a sort of virtual table that is defined by a SELECT
statement.
Views may also be used to restrict users access to particular
subsets of tables that they are authorized to see.
In Hive, a view is not materialized to disk when it is created;
rather, the views SELECT statement is executed when the
statement that refers to the view is run.
Views are included in the output of the SHOW TABLES
command, and you can see more details about a particular
view, including the query used to define it, by issuing the
DESCRIBE EXTENDED view_name command.
Create Views
CREATE VIEW view_name (id,name) AS SELECT * from users;
Drop a view
Drop view viewName;
Joins
Only equality joins, outer joins, and
left semi joins are supported in Hive.
Hive does not support join conditions
that are not equality conditions as it
is very difficult to express such
conditions as a map/reduce job. Also,
more than two tables can be joined
in Hive
Example-Join
hive> SELECT * FROM sales;
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
hive> SELECT * FROM items;
2 Tie
4 Coat
3 Hat
1 Scarf
Table Commands
Using Join
One of the nice things about using Hive, rather than raw
MapReduce, is that Hive makes performing commonly used
operations very simple.
We can perform an inner join on the two tables as follows:
hive> SELECT sales.*, items.* FROM sales JOIN items ON
(sales.id = items.id);
Table Commands
Outer joins
Outer joins allow you to find non-matches in
the tables being joined.
hive> SELECT sales.*, items.* FROM sales LEFT
OUTER JOIN items ON (sales.id = items.id);
hive> SELECT sales.*, items.* FROM sales
RIGHT OUTER JOIN items ON (sales.id =
items.id);
hive>SELECT sales.*, items.* FROM sales FULL
OUTER JOIN items ON (sales.id = items.id);
If all but one of the tables being joined are small, the join
can be performed as a map only job.
The query does not need a reducer. For every mapper a,b is
read completely. A restriction is that aFULL/RIGHT OUTER
JOIN bcannot be performed.
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join
b on a.key = b.key
Partitioning in Hive
Using partitions,
you can make it
faster to execute
queries on slices of
the data.
A table can have
one or more
partition columns.
A separate data
directory is created
for each distinct
value combination
in the partition
columns.
Partitioning in Hive
Partitions are defined at the time of creating a table
using PARTITIONED BY clause is used to create
partition.
Static Partition (Example-1)
CREATE TABLE student_partnew (name STRING,id int,marks
String) PARTITIONED BY (pyear STRING) row format
delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/home/notroot/std_2011.csv'
INTO TABLE student_partnew PARTITION (pyear='2011');
LOAD DATA LOCAL INPATH '/home/notroot/std_2012.csv'
INTO TABLE student_partnew PARTITION (pyear='2012');
LOAD DATA LOCAL INPATH '/home/notroot/std_2013.csv'
INTO TABLE student_partnew PARTITION (pyear='2013');
Partitioning in Hive
Static Partition (Example-2)
CREATE TABLE student_New (id int,name string,marks
int,year int) row format delimited fields terminated by ',';
LOAD DATA local INPATH
'/home/notroot/Sandeep/DataSamples/Student_new.csv'
INTO table Student_New;
CREATE TABLE student_part (id int,name string,marks int,)
PARTITIONED BY (year STRING);
INSERT into TABLE student_part PARTITION(pyear='2012' )
SELECT id,name,marks from student_new WHERE
year='2012';
SHOW Partition
SHOW PARTITIONS month_part;
Partitioning in Hive
Dynamic Partition
To enable dynamic partitions
set hive.exec.dynamic.partition=true;
(To enable dynamic partitions, by default it is false)
set hive.exec.dynamic.partition.mode=nonstrict;
(To allow a table to be partitioned based on multiple columns
in hive, in such case we have to enable the nonstrict mode)
set hive.exec.max.dynamic.partitions.pernode=300;
(The default value is 100, we have to modify the same
according to the possible no of partitions that would come in
your case)
hive.exec.max.created.files=150000
(IThe default values is 100000 but for larger tables it can
exceed the default, so we may have to update the same.)
Partitioning in Hive
CREATE TABLE Stage_oper_Month (oper_id string, Creation_Date string,
oper_name String, oper_age int, oper_dept String, oper_dept_id int,
opr_status string, EYEAR STRING, EMONTH STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA local INPATH
'/home/notroot/Sandeep/DataSamples/user_info.csv'INTO TABLE
Stage_oper_Month;
CREATE TABLE Fact_oper_Month (oper_id string, Creation_Date string,
oper_name String, oper_age int, oper_dept String, oper_dept_id int)
PARTITIONED BY (opr_status string, eyear STRING, eMONTH STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',';
FROM Stage_oper_Month INSERT OVERWRITE TABLE Fact_oper_Month
PARTITION (opr_status, eyear, eMONTH) SELECT oper_id, Creation_Date,
oper_name, oper_age, oper_dept, oper_dept_id, opr_status, EYEAR,
EMONTH DISTRIBUTE BY opr_status, eyear, eMONTH;
(Select from partition table)
Select oper_id, oper_name, oper_dept from Fact_oper_Month where
eyear=2010 and emonth=1;
Bucketing Features
Bucketing Advantage
Bucketing Advantages
Bucketed tables offer efficient sampling than by nonbucketed tables. With sampling, we can try out queries on a
fraction of data for testing and debugging purpose when the
original data sets are very huge.
As the data files are equal sized parts, map-side joins will be
faster on bucketed tables than non-bucketed tables. In Mapside join,a mapperprocessing a bucket of the left table
knows that the matching rows in the right table will bein its
corresponding bucket, so it only retrieves that bucket (which
is a small fractionof all the data stored in the right table).
Similar to partitioning, bucketed tables provide faster query
responses than non-bucketed tables.
Bucketing Example
We can create bucketed tables with the help ofCLUSTERED BYclause
and optionalSORTED BYclause in CREATE TABLE statement and
DISTRIBUTED BY clause in load statement.
CREATE TABLE Month_bucketed (oper_id string, Creation_Date string,
oper_name String, oper_age int,oper_dept String, oper_dept_id int,
opr_status string, eyear string , emonth string) CLUSTERED BY(oper_id)
SORTED BY (oper_id,Creation_Date) INTO 10 BUCKETS ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
Similar to partitioned tables, we cannot directly load bucketed
tables withLOAD DATA (LOCAL) INPATHcommand, rather we
need to useINSERT OVERWRITE TABLE SELECT FROMclause
from another table to populate the bucketed tables.
INSERT OVERWRITE TABLE Month_bucketed SELECT oper_id,
Creation_Date, oper_name, oper_age, oper_dept, oper_dept_id,
opr_status, EYEAR, EMONTH FROM stage_oper_month DISTRIBUTE BY
oper_id sort by oper_id, Creation_Date;
Hive UDF
UDF is a java code which must satisfy the following two properties.
UDF must implement at least one evaluate() method
UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF
Sample UDF
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null)
{ return null;
}
return new Text(s.toString().toLowerCase());
}
}
hive> add jar my_jar.jar;
hive> create temporary function my_lower as 'com.example.hive.udf.Lower';
hive> select empid , my_lower(empname) from employee;
Hive UDAF
A UDAF works on multiple input rows and creates a single output
row. Aggregate functions include such functions as COUNT and
MAX.
An aggregate function is more difficult to write than a regular UDF.
UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF
Contain one or more nested static classes implementing
org.apache.hadoop.hive.ql.exec.UDAFEvaluator
UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF
An evaluator must implement five methods
init()
The init() method initializes the evaluator and resets its internal
state.
In MaximumIntUDAFEvaluator, we set the IntWritable object
holding the final result to null.
Hive UDAF
iterate()
The iterate() method is called every time there is a new value to be
aggregated. The evaluator should update its internal state with the
result of performing the aggregation. The arguments that iterate()
takes correspond to those in the Hive function from which it was called.
In this example, there is only one argument. The value is first checked
to see whether it is null, and if it is, it is ignored. Otherwise, the result
instance variable is set either to values integer value (if this is the first
value that has been seen) or to the larger of the current result and
value (if one or more values have already been seen). We return true to
indicate that the input value was valid.
terminatePartial()
The terminatePartial() method is called when Hive wants a result for the
partial aggregation. The method must return an object that
encapsulates the state of the aggregation.
In this case, an IntWritable suffices because it encapsulates either the
maximum value seen or null if no values have been processed.
Hive UDAF
merge()
The merge() method is called when Hive decides to combine one
partial aggregation with another. The method takes a single object,
whose type must correspond to the return type of the
terminatePartial() method.
In this example, the merge() method can simply delegate to the
iterate() method because the partial aggregation is represented in the
same way as a value being aggregated. This is not generally the
case(well see a more general example later), and the method should
implement the logic to combine the evaluators state with the state of
the partial aggregation.
terminate()
The terminate() method is called when the final result of the
aggregation is needed. The evaluator should return its state as a
value.
In this case, we return the result instance variable.
Hive UDAF
package com.hadoopbook.hive;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.io.IntWritable;
public class HiveUDAFSample extends UDAF {
public static class MaximumIntUDAFEvaluator implements UDAFEvaluator
{
private IntWritable result;
public void init() {
result = null;
}
public boolean iterate(IntWritable value) {
if (value == null) {
return true;
}
Hive UDAF
if (result == null) {
result = new IntWritable(value.get());
} else {
result.set(Math.max(result.get(), value.get()));
}
return true;
}
public IntWritable terminatePartial() {
return result;
}
public boolean merge(IntWritable other) {
return iterate(other);
}
public IntWritable terminate() {
return result;
}
}
}
Hive UDAF
To Use UDAF in hive;
hive> add jar my_jar.jar;
hive> CREATE TEMPORARY FUNCTION maximum
AS 'com.hadoopbook.hive.HiveUDAFSample';
hive>SELECT maximum(salary) FROM employee;
Performance Tuning
Partitioning Tables:
Hive partitioning is an effective method to improve the
query performance on larger tables. Partitioning allows
you to store data in separate sub-directories under
table location. It greatly helps the queries which are
queried upon the partition key(s). Although the
selection of partition key is always a sensitive decision,
it should always be a low cardinal attribute, e.g. if your
data is associated with time dimension, then date could
be a good partition key. Similarly, if data has
association with location, like a country or state, then
its a good idea to have hierarchical partitions like
country/state.
Performance Tuning
De-normalizing data:
Normalization is a standard process used to model
your data tables with certain rules to deal with
redundancy of data and anomalies. In simpler
words, if you normalize your data sets, you end up
creating multiple relational tables which can be
joined at the run time to produce the results. Joins
are expensive and difficult operations to perform
and are one of the common reasons for performance
issues. Because of that, its a good idea to avoid
highly normalized table structures because they
require join queries to derive the desired metrics.
Performance Tuning
Compress map/reduce output:
Compression techniques significantly reduce the intermediate data
volume, which internally reduces the amount of data transfers
between mappers and reducers. All this generally occurs over the
network. Compression can be applied on the mapper and reducer
output individually. Keep in mind that gzip compressed files are not
splittable. That means this should be applied with caution. A
compressed file size should not be larger than a few hundred
megabytes. Otherwise it can potentially lead to an imbalanced job.
Other options of compression codec could be snappy, lzo, bzip, etc.
For map output compression
setmapred.compress.map.outputto true
For job output compression setmapred.output.compressto true
Performance Tuning
Map join:
Map joins are really efficient if a table on
the other side of a join is small enough to
fit in the memory. Hive supports a
parameter, hive.auto.convert.join,
which when its set to true suggests that
Hive try to map join automatically. When
using this parameter, be sure the auto
convert is enabled in the Hive
environment.
Performance Tuning
Bucketing:
Bucketing improves the join performance if the bucket key and
join keys are common. Bucketing in Hive distributes the data
in different buckets based on the hash results on the bucket
key. It also reduces the I/O scans during the join process if the
process is happening on the same keys (columns).
Additionally its important to ensure the bucketing flag is set
(SET hive.enforce.bucketing=true;) every time before
writing data to the bucketed table. To leverage the bucketing
in the join operation we shouldSET
hive.optimize.bucketmapjoin=true. This setting hints to
Hive to do bucket level join during the map stage join. It also
reduces the scan cycles to find a particular key because
bucketing ensures that the key is present in a certain bucket.
Performance Tuning
Parallel execution:
As HIVE queries are inbuilt translated to a
number of map reduce jobs, but having
multiple Map-reduce jobs is not enough,
real advantage is of their parallel
execution and as noted above simply
writing a query does not achieve this.
SELECT table1.a FROM
table1 JOIN table2 ON (table1.a =table2.a
)
join table3 ON (table3.a=table1.a)
join table4 ON (table4.b=table3.b);
Output: Execution time : 800 sec
But let us check the execution plan for this:
observations (see picture highlighted
area):
Total Map-Reduce Jobs: 2.
Serially Launched & Run.
Performance Tuning
Parallel execution:
To achieve this, we thought about query rewriting in a way to segregate the query into
independent units which HIVE could work upon
as independent map reduce jobs running
parallel. Following is what we did to our query:
Thank You
Question?
Feedback?
explorehadoop@gmail.com