07 Hive

Hadoop Ecosystem: HIVE
What is Hive?
Hive is a SQL dialect, called Hive Query Language (abbreviated HiveQL or just
HQL) for querying data stored in a Hadoop cluster. Hive translates most
queries to MapReduce jobs, thereby exploiting the scalability of Hadoop, while
presenting a familiar SQL abstraction.
Where the Hive suited well?

Hive is most suited for data warehouse applications, where relatively static
data is analysed, fast response times are not required, and when the data is
not changing rapidly.
Where a large data set is maintained and mined for insights, reports, etc.
Hive makes it easier for developers to port SQL-based applications to Hadoop
Limitations of Hive
Hive is not a full database. The biggest limitation is that Hive does not provide
record-level update, insert, nor delete.
Because Hadoop is a batch-oriented system, Hive queries have higher latency;
due to the start-up overhead for MapReduce jobs Finally, Hive does not
provide transactions.
Hive doesnt provide crucial features required for OLTP. If you need OLTP
features for large-scale data, you should consider using a NoSQL database.
Examples include HBase, a NoSQL database integrated with Hadoop
Does HiveQL confirm ANSI SQL standard?

HiveQL does not conform to the ANSI SQL standard and it differs in various
ways from the familiar SQL dialects provided by Oracle, MySQL, and SQL
Server. (However, it is closest to MySQLs dialect of SQL.)
Hive architecture
CLI->Command line interface to interact with Hive

GUI->Karmasphere/ Cloudera-Hue/Qubole/others
HWI->Hive web interface, provide remote access to Hive
export ANT_LIB=/opt/ant/lib
bin/hive --service hwi
Programming access->JDBC, ODBC
Thrift server: provides remote access from other processes.

If you want to access hive query using programming, thrift is the solution.
Driver->compiles the input, optimizes the computation required, and executes

the required steps, usually with MapReduce jobs.
Metastore->The Metastore is a separate relational database (usually a MySQL

instance) where Hive Persists table schemas and other system metadata (such
as table schema and partition information). By default hive uses built in Derby
SQL server, which provide limited single process storage (You cannot run two
simultaneous instances of Hive CLI)
How internal processing happen in Hive?
When MapReduce jobs are required, Hive doesnt generate Java MapReduce
programs. Instead, it uses built-in, generic Mapper and Reducer modules that
are driven by an XML file representing the job plan. In other words, these
generic modules function like mini language interpreters and the language to
drive the computation is encoded in XML.
How pig is different from Hive?

Pig was developed at Yahoo! about the same time Facebook was developing
Hive.
Scenario where temporary table is required to manage the complexity Pig is
useful
Pig is described as a data flow language, rather than a query language.
Pig is often used as part of ETL processes used to ingest external data into a
Hadoop cluster and transform it into a more desirable form.
Pig is less suitable for porting over SQL applications and experienced SQL users
will have a larger learning curve with Pig.
Hadoop teams to use a combination of Hive and Pig, selecting the appropriate
tool for particular jobs.
How Hbase is different from Hive?
What if you need the database features that Hive doesnt provide, like row-
level updates, rapid query response times, and transactions?
HBase is inspired by Googles Big Table, although it doesnt implement all Big
Table features.
HBase doesnt provide a query language like SQL, but Hive is now integrated
with HBase.
HBase supports is column-oriented storage where columns can be organized
into column families.
HBase also keeps a configurable number of versions of each columns values
(marked by timestamps)
HBase also uses in-memory caching of data and local files for the append log of
updates. Periodically, the durable files are updated with all the append log
updates, etc.
What is metastore service?
All Hive installations require a metastore service, which Hive uses to store
table schemas and other metadata. It is typically implemented using tables in a
relational database. By default, Hive uses a built-in Derby SQL server, which
provides limited, single process storage. For example, when using Derby, you
cant run two simultaneous instances of the Hive CLI. However, this is fine for
learning Hive on a personal machine and some developer tasks. For clusters,
MySQL or a similar relational database is required. If you are running with the
default Derby database for the metastore, youll notice that your current
working directory now contains a new subdirectory called metastore_db that
was created by Derby during the short hive session you just executed.
Derby vs. MySQL

If you are running with the default Derby database for the metastore, youll
notice that your current working directory now contains a new subdirectory
called metastore_db that was created by Derby during the short hive session
you just executed. Derby support one instance only, for clusters, MySQL or a
similar relational database is required.
What is the use of Hcatalog?
Hcatalog can be used to share data structures with external systems. Hcatalog
provides access to hive metastore to users of other tools on Hadoop so that
they can read and write data to hives data warehouse.
HIVE Practical
Hive location of table data and metastore
Vi $HIVE_HOME/conf/hive-site.xml
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.metastore.local=true
javax.jdo.option.ConnectionURL= jdbc:derby:;databaseName=
/home/adminuser/vikas/hive_test/hive/metastore_db;create=true
Define user specific warehouse directory

Vi $HOME/.hiverc
ADD JAR /path/to/custom_hive_extensions.jar;

set hive.metastore.warehouse.dir=/home/adminuser/vikas/hive_test/hive/warehouse;
set hive.cli.print.current.db=true;
set hive.exec.mode.local.auto=true;
set hive.cli.print.header=true;
--Store command history 10000 lines

$HOME/.hivehistory
if you want to change the warehouse path then, write in hive-site.xml then give the
permission to that directory if it is on local system
sudo chown -R user /Path

sudo chmod -R 777 /Path
if given path is on HDFS , then stop and start the hadoop services
stop-all.sh
start-all.sh
Mysql configuration for metastore

Vi $HIVE_HOME/conf/hive-site.xml
javax.jdo.option.ConnectionURL=
jdbc:mysql://db1.mydomain.pvt/hive_db?createDatabaseIfNotExist=true
javax.jdo.option.ConnectionDriverName= com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName= database_user
javax.jdo.option.ConnectionPassword= database_pass
Download jconnector driver and place in $HIVE_HOME/lib
Hive CLI option

$ hive --help --service cli
usage: hive
-d,--define <key=value> Variable substitution to apply to hive
commands. e.g. -d A=B or --define A=B
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
-h <hostname> connecting to Hive Server on remote host
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable substitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-p <port> connecting to Hive Server on port number
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed SQL to the
console)
Hive namespaces for variables and properties

Namespace Access Description
hivevar Read/Write (v0.8.0 and later) User-defined custom
variables.
hiveconf Read/Write Hive-specific configuration properties.
system Read/Write Configuration properties defined by Java.
env Read only Environment variables
$ hive --define foo=bar

hive> set foo;
hive> set hivevar:foo;
hivevar:foo=bar;
hive> set hivevar:foo=bar2;
hive> set foo;
hive> create table toss1(i int, ${hivevar:foo} string);
$ hive --hiveconf hive.cli.print.current.db=true

hive> set hiveconf:hive.cli.print.current.db;
hive> set hiveconf:hive.cli.print.current.db=true;
$ hive --hiveconf y=5
hive> set y;
hive> set system:user.name;

system:user.name=myusername
hive> set system:user.name=yourusername;
hive> set system:user.name;
system:user.name=yourusername
hive> set env:HOME;

hive> set;
$ YEAR=2012 hive -e "SELECT * FROM mytable WHERE year = ${env:YEAR}";
Hive one line command
$ hive -e "SELECT * FROM mytable LIMIT 3";

$ hive -S -e "select * FROM mytable LIMIT 3" > /tmp/myquery
$ hive -S -e "set" | grep warehouse
$ hive -e "CREATE TABLE src(s STRING)";

$ echo "one row" > /tmp/myfile
$ hive -e "LOAD DATA LOCAL INPATH '/tmp/myfile' INTO TABLE src;
Executing Hive Queries from Files
$ hive -f /path/to/file/withqueries.hql
hive> source /path/to/file/withqueries.hql;
Word count example using Hive

CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS

SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
Or,
-- 1) Hive queries for Word Count

drop table if exists doc;
-- 2) create table to load whole file

create table doc(
text string
) row format delimited fields terminated by '\n' stored as textfile;
--3) loads plain text file

load data local inpath '/home/adminuser/vikas/new.txt' overwrite into table doc;
-- 4) wordCount in single line

SELECT word, COUNT(*) FROM doc LATERAL VIEW explode(split(lower(text), '\\W+'))
lTable as word GROUP BY word;
SELECT word, COUNT(*) FROM doc LATERAL VIEW explode(split(text, ' ')) abc as word
GROUP BY word;
--5) quit;exit;-to exit hive cli
Lists the primitive and collection types supported by Hive

Type Size Literal syntax examples
TINYINT 1 byte signed integer. 20
SMALLINT 2 byte signed integer. 20
INT 4 byte signed integer. 20
BIGINT 8 byte signed integer. 20
BOOLEAN Boolean true or false. TRUE
FLOAT Single precision floating point. 3.14159
DOUBLE Double precision floating point. 3.14159
STRING Sequence of characters. 'Now is the time', "for all good
men"
TIMESTAMP Integer, float, or string. 1327882394,
1327882394.123456789
'2012-02-03 12:34:56.123456789'
BINARY Array of bytes. -
Type Description Example Access

ARRAY Ordered sequence of same type name arrar(john) name[0]
MAP collection of key, value name map(first,a) first->a
STRUCT Similar to Object name struct(first string)) name.first
JavaScript Object Notation (JSON)

{
"name": "John Doe",
"salary": 100000.0,
"subordinates": ["Mary Smith", "Todd Jones"],
"deductions": {
"Federal Taxes": .2,
"State Taxes": .05,
"Insurance": .1
},
"address": {
"street": "1 Michigan Ave.",
"city": "Chicago",
"state": "IL",
"zip": 60600
}
}
Input File-->
John DoeÂ100000.0ÂMary Smith^BTodd JonesÂFederal Taxes^C.2^BState

Taxes^C.05^BInsurance^C.1Â1 Michigan Ave.^BChicago^BIL^B60600
Table creation-->
CREATE TABLE employees (

name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
Data Loading-->
load data local inpath '/home/cloudera/my/person.txt' into table employees;
Data selection-->
select subordinates[0],deductions['Federal Taxes'],address.city from employees;
Hives default record and field delimiters
Delimiter Description
\n For text files, each line is a record, so the line feed character separates records.
Â (ctrl+A) Separates all fields (columns). octal code \001
^B Separate the elements in an ARRAY or STRUCT, or the key-value pairs in a
MAP. octal code \002
^C Separate key-value pairs. octal code \003
Hive Database related command

The Hive concept of a database is essentially just a catalog or namespace of tables.
If you dont specify a database, the default database is used.
The simplest syntax for creating a database

hive> CREATE DATABASE | SCHEMA financials;
hive> CREATE DATABASE financials COMMENT 'Holds all financial tables';
hive> CREATE DATABASE financials WITH DBPROPERTIES ('creator' = 'Mark
Moneybags', 'date' = '2012-01-02');
Hive will throw an error if financials already exists. You can suppress these warnings with
this variation:
hive> CREATE DATABASE IF NOT EXISTS financials;
You can see the databases that already exist

hive> SHOW DATABASES;
hive> SHOW DATABASES LIKE 'h.*';
You can overrid the default database lcoation /user/hive/warehouse/financials.db.

hive> CREATE DATABASE financials LOCATION '/my/preferred/directory';
To know the database description

hive> DESCRIBE DATABASE financials;
hive> DESCRIBE DATABASE EXTENDED financials;
Set working database

hive> USE financials;
hive> SHOW TABLES;
Setting a property to print the current database

hive> set hive.cli.print.current.db=true;
hive (financials)> USE default;
Drop a empty database and database having tables

hive> DROP DATABASE IF EXISTS financials;
hive> DROP DATABASE IF EXISTS financials CASCADE;
Alter database
hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba');
Table creation in Hive
Creating managed/internal table (Hive controls the lifecycle of their data )

CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT> COMMENT 'Keys are deductions names,
values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> COMMENT
'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';
You can also copy the schema of an existing table:

CREATE TABLE IF NOT EXISTS mydb.employees2 LIKE mydb.employees;
Lists the table

hive> USE mydb;
hive> SHOW TABLES;
hive> SHOW TABLES IN mydb;
hive> SHOW TABLES 'empl.*';
Show details about the table

hive> DESCRIBE EXTENDED | FORMATTED mydb.employees;
Creating external table

Hive does not assume it owns the data.
If the data is shared between tools, then creating an external table
Dropping the table does not delete the data
Some HiveQL constructs are not permitted for external tables.
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (

exchange STRING,
symbol STRING,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';
Partitioned table
they have important performance benefits, and they can help organize data in a logical
fashion, such as hierarchically.
CREATE TABLE employees (

name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING);
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=CA/state=AB
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=CA/state=BC
...
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=US/state=AL
However, a query across all partitions could trigger an enormous MapReduce job if the table
data and number of partitions are large. A highly suggested safety measure is putting Hive
into strict mode,
hive> set hive.mapred.mode=strict;
You can see the partition

hive> SHOW PARTITIONS employees;
hive> SHOW PARTITIONS employees PARTITION(country='US');
hive> SHOW PARTITIONS employees PARTITION(country='US', state='AK');
Loading data
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
INTO TABLE employees
PARTITION (country = 'US', state = 'CA');
External partition table
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages (
hms INT,
severity STRING,
server STRING,
process_id INT,
message STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2)

LOCATION 'hdfs://master_server/data/log_messages/2012/01/02';
Archiving old data on Inexpensive storage Amazon S3

Copy the data for the partition being moved to S3.
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02
Alter the table to point the partition to the S3 location:

ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';
Remove the HDFS copy of the partition using the hadoop fs -rmr command:
hadoop fs -rmr /data/log_messages/2011/01/02
Example of cluster table

CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
CLUSTERED BY (exchange, symbol)
SORTED BY (ymd ASC)
INTO 96 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';
Table related operation
Dropping Tables
DROP TABLE IF EXISTS employees;
Renaming a Table
ALTER TABLE log_messages RENAME TO logmsgs;
Adding, Modifying, and Dropping a Table Partition
ALTER TABLE log_messages ADD IF NOT EXISTS
PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)

SET LOCATION 's3n://ourbucket/logs/2011/01/02';
ALTER TABLE log_messages DROP IF EXISTS PARTITION(year = 2011, month = 12, day =
2);
You can rename a column, change its position, type, or comment:

ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;
Adding Columns
ALTER TABLE log_messages ADD COLUMNS (
app_name STRING COMMENT 'Application name',
session_id LONG COMMENT 'The current session id');
Deleting or Replacing Columns

ALTER TABLE log_messages REPLACE COLUMNS (
hours_mins_secs INT COMMENT 'hour, minute, seconds from timestamp',
severity STRING COMMENT 'The message severity'
message STRING COMMENT 'The rest of the message');
Alter Table Properties

ALTER TABLE log_messages SET TBLPROPERTIES (
'notes' = 'The process id is no longer captured; this column is always NULL');
Alter Storage Properties

PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;
Other operation
hive -e 'ALTER TABLE log_messages TOUCH PARTITION(year = 2012, month = 1, day =

1);'
ALTER TABLE log_messages ARCHIVE

PARTITION(year = 2012, month = 1, day = 1);

PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP;

PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE;
Dynamic insert on partitioned table
set hive.exec.dynamoc.partition=true
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT into t3 PARTITION (region) select id,name,region from t2 ;
Find player name with maximum run

create table temp_batting (col_value STRING);
LOAD DATA INPATH '/user/adminuser/input/Batting.csv' OVERWRITE INTO TABLE

temp_batting;
create table batting (player_id STRING, year INT, runs INT);
insert overwrite table batting

SELECT
regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id,
regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year,
regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run
from temp_batting;
SELECT year, max(runs) FROM batting GROUP BY year;
SELECT a.year, a.player_id, a.runs from batting a

JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b
ON (a.year = b.year AND a.runs = b.runs) ;
Hive UDF
We have to trim some value
1) Our class must extend UDF abstract class

2) our class must have atleast one evaluate() {this method is not from UDF class} method
3) compile java file
4) create jar file
5) Add jar file to hive classpath
6) Create temporary function
eclipse->new project->add jar from Hadoop and Hive lib
com.hadoop.hive ctrl+shift+o (to import package)
public class TestUDF extends UDF{
Text t=new Text();
public Text evaluate(Text str){

if (str==null){return str;}
t.set(StringUtils.strip(str.toString()));
return t;
}
public Text evaluate(Text str,String splchar){

if (str==null){return str;}
t.set(StringUtils.strip(str.toString(),splchar));
return t;
}
hive>add jar /home/esiavir/trimUDF.jar

hive>create temporary function vardhan as 'com.hadoop.hive.TestUDF

07 Hive

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

07 Hive

Hochgeladen von

Copyright:

Verfügbare Formate

Hadoop Ecosystem: HIVE

Where the Hive suited well?

Does HiveQL confirm ANSI SQL standard?

CLI->Command line interface to interact with Hive

Thrift server: provides remote access from other processes.

Driver->compiles the input, optimizes the computation required, and executes

Metastore->The Metastore is a separate relational database (usually a MySQL

How pig is different from Hive?

Derby vs. MySQL

Define user specific warehouse directory

ADD JAR /path/to/custom_hive_extensions.jar;

--Store command history 10000 lines

sudo chown -R user /Path

Mysql configuration for metastore

Download jconnector driver and place in $HIVE_HOME/lib

Hive CLI option

Hive namespaces for variables and properties

$ hive --define foo=bar

$ hive --hiveconf hive.cli.print.current.db=true

hive> set system:user.name;

hive> set env:HOME;

Hive one line command

$ hive -e "SELECT * FROM mytable LIMIT 3";

$ hive -e "CREATE TABLE src(s STRING)";

Executing Hive Queries from Files

hive> source /path/to/file/withqueries.hql;

Word count example using Hive

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts AS

-- 1) Hive queries for Word Count

-- 2) create table to load whole file

--3) loads plain text file

-- 4) wordCount in single line

--5) quit;exit;-to exit hive cli

Lists the primitive and collection types supported by Hive

Type Description Example Access

JavaScript Object Notation (JSON)

John DoeÂ100000.0ÂMary Smith^BTodd JonesÂFederal Taxes^C.2^BState

CREATE TABLE employees (

Hives default record and field delimiters

Hive Database related command

The simplest syntax for creating a database

You can see the databases that already exist

You can overrid the default database lcoation /user/hive/warehouse/financials.db.

To know the database description

Set working database

Setting a property to print the current database

Drop a empty database and database having tables

Table creation in Hive

Creating managed/internal table (Hive controls the lifecycle of their data )

You can also copy the schema of an existing table:

Lists the table

Show details about the table

Creating external table

CREATE EXTERNAL TABLE IF NOT EXISTS stocks (

CREATE TABLE employees (

hive> set hive.mapred.mode=strict;

You can see the partition

ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2)

Archiving old data on Inexpensive storage Amazon S3

Alter the table to point the partition to the S3 location:

Example of cluster table

Table related operation