Sie sind auf Seite 1von 16

Hadoop Ecosystem: HIVE

What is Hive?
Hive is a SQL dialect, called Hive Query Language (abbreviated HiveQL or just
HQL) for querying data stored in a Hadoop cluster. Hive translates most
queries to MapReduce jobs, thereby exploiting the scalability of Hadoop, while
presenting a familiar SQL abstraction.

Where the Hive suited well?


Hive is most suited for data warehouse applications, where relatively static
data is analysed, fast response times are not required, and when the data is
not changing rapidly.
Where a large data set is maintained and mined for insights, reports, etc.
Hive makes it easier for developers to port SQL-based applications to Hadoop

Limitations of Hive
Hive is not a full database. The biggest limitation is that Hive does not provide
record-level update, insert, nor delete.
Because Hadoop is a batch-oriented system, Hive queries have higher latency;
due to the start-up overhead for MapReduce jobs Finally, Hive does not
provide transactions.
Hive doesnt provide crucial features required for OLTP. If you need OLTP
features for large-scale data, you should consider using a NoSQL database.
Examples include HBase, a NoSQL database integrated with Hadoop

Does HiveQL confirm ANSI SQL standard?


HiveQL does not conform to the ANSI SQL standard and it differs in various
ways from the familiar SQL dialects provided by Oracle, MySQL, and SQL
Server. (However, it is closest to MySQLs dialect of SQL.)

Hive architecture

CLI->Command line interface to interact with Hive


GUI->Karmasphere/ Cloudera-Hue/Qubole/others
HWI->Hive web interface, provide remote access to Hive
export ANT_LIB=/opt/ant/lib
bin/hive --service hwi
Programming access->JDBC, ODBC

Thrift server: provides remote access from other processes.


If you want to access hive query using programming, thrift is the solution.

Driver->compiles the input, optimizes the computation required, and executes


the required steps, usually with MapReduce jobs.

Metastore->The Metastore is a separate relational database (usually a MySQL


instance) where Hive Persists table schemas and other system metadata (such
as table schema and partition information). By default hive uses built in Derby
SQL server, which provide limited single process storage (You cannot run two
simultaneous instances of Hive CLI)
How internal processing happen in Hive?
When MapReduce jobs are required, Hive doesnt generate Java MapReduce
programs. Instead, it uses built-in, generic Mapper and Reducer modules that
are driven by an XML file representing the job plan. In other words, these
generic modules function like mini language interpreters and the language to
drive the computation is encoded in XML.

How pig is different from Hive?


Pig was developed at Yahoo! about the same time Facebook was developing
Hive.
Scenario where temporary table is required to manage the complexity Pig is
useful
Pig is described as a data flow language, rather than a query language.
Pig is often used as part of ETL processes used to ingest external data into a
Hadoop cluster and transform it into a more desirable form.
Pig is less suitable for porting over SQL applications and experienced SQL users
will have a larger learning curve with Pig.
Hadoop teams to use a combination of Hive and Pig, selecting the appropriate
tool for particular jobs.
How Hbase is different from Hive?
What if you need the database features that Hive doesnt provide, like row-
level updates, rapid query response times, and transactions?
HBase is inspired by Googles Big Table, although it doesnt implement all Big
Table features.
HBase doesnt provide a query language like SQL, but Hive is now integrated
with HBase.
HBase supports is column-oriented storage where columns can be organized
into column families.
HBase also keeps a configurable number of versions of each columns values
(marked by timestamps)
HBase also uses in-memory caching of data and local files for the append log of
updates. Periodically, the durable files are updated with all the append log
updates, etc.
What is metastore service?
All Hive installations require a metastore service, which Hive uses to store
table schemas and other metadata. It is typically implemented using tables in a
relational database. By default, Hive uses a built-in Derby SQL server, which
provides limited, single process storage. For example, when using Derby, you
cant run two simultaneous instances of the Hive CLI. However, this is fine for
learning Hive on a personal machine and some developer tasks. For clusters,
MySQL or a similar relational database is required. If you are running with the
default Derby database for the metastore, youll notice that your current
working directory now contains a new subdirectory called metastore_db that
was created by Derby during the short hive session you just executed.

Derby vs. MySQL


If you are running with the default Derby database for the metastore, youll
notice that your current working directory now contains a new subdirectory
called metastore_db that was created by Derby during the short hive session
you just executed. Derby support one instance only, for clusters, MySQL or a
similar relational database is required.
What is the use of Hcatalog?
Hcatalog can be used to share data structures with external systems. Hcatalog
provides access to hive metastore to users of other tools on Hadoop so that
they can read and write data to hives data warehouse.

HIVE Practical
Hive location of table data and metastore
Vi $HIVE_HOME/conf/hive-site.xml
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.metastore.local=true
javax.jdo.option.ConnectionURL= jdbc:derby:;databaseName=
/home/adminuser/vikas/hive_test/hive/metastore_db;create=true

Define user specific warehouse directory


Vi $HOME/.hiverc

ADD JAR /path/to/custom_hive_extensions.jar;


set hive.metastore.warehouse.dir=/home/adminuser/vikas/hive_test/hive/warehouse;
set hive.cli.print.current.db=true;
set hive.exec.mode.local.auto=true;
set hive.cli.print.header=true;

--Store command history 10000 lines


$HOME/.hivehistory

if you want to change the warehouse path then, write in hive-site.xml then give the
permission to that directory if it is on local system

sudo chown -R user /Path


sudo chmod -R 777 /Path
if given path is on HDFS , then stop and start the hadoop services

stop-all.sh
start-all.sh

Mysql configuration for metastore


Vi $HIVE_HOME/conf/hive-site.xml
javax.jdo.option.ConnectionURL=
jdbc:mysql://db1.mydomain.pvt/hive_db?createDatabaseIfNotExist=true
javax.jdo.option.ConnectionDriverName= com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName= database_user
javax.jdo.option.ConnectionPassword= database_pass

Download jconnector driver and place in $HIVE_HOME/lib

Hive CLI option


$ hive --help --service cli
usage: hive
-d,--define <key=value> Variable substitution to apply to hive
commands. e.g. -d A=B or --define A=B
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
-h <hostname> connecting to Hive Server on remote host
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable substitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-p <port> connecting to Hive Server on port number
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed SQL to the
console)

Hive namespaces for variables and properties


Namespace Access Description
hivevar Read/Write (v0.8.0 and later) User-defined custom
variables.
hiveconf Read/Write Hive-specific configuration properties.
system Read/Write Configuration properties defined by Java.
env Read only Environment variables

$ hive --define foo=bar


hive> set foo;
hive> set hivevar:foo;
hivevar:foo=bar;
hive> set hivevar:foo=bar2;
hive> set foo;
hive> create table toss1(i int, ${hivevar:foo} string);

$ hive --hiveconf hive.cli.print.current.db=true


hive> set hiveconf:hive.cli.print.current.db;
hive> set hiveconf:hive.cli.print.current.db=true;
$ hive --hiveconf y=5
hive> set y;

hive> set system:user.name;


system:user.name=myusername
hive> set system:user.name=yourusername;
hive> set system:user.name;
system:user.name=yourusername

hive> set env:HOME;


hive> set;
$ YEAR=2012 hive -e "SELECT * FROM mytable WHERE year = ${env:YEAR}";

Hive one line command

$ hive -e "SELECT * FROM mytable LIMIT 3";


$ hive -S -e "select * FROM mytable LIMIT 3" > /tmp/myquery
$ hive -S -e "set" | grep warehouse

$ hive -e "CREATE TABLE src(s STRING)";


$ echo "one row" > /tmp/myfile
$ hive -e "LOAD DATA LOCAL INPATH '/tmp/myfile' INTO TABLE src;

Executing Hive Queries from Files

$ hive -f /path/to/file/withqueries.hql

hive> source /path/to/file/withqueries.hql;

Word count example using Hive


CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts AS


SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;

Or,

-- 1) Hive queries for Word Count


drop table if exists doc;

-- 2) create table to load whole file


create table doc(
text string
) row format delimited fields terminated by '\n' stored as textfile;

--3) loads plain text file


load data local inpath '/home/adminuser/vikas/new.txt' overwrite into table doc;

-- 4) wordCount in single line


SELECT word, COUNT(*) FROM doc LATERAL VIEW explode(split(lower(text), '\\W+'))
lTable as word GROUP BY word;
SELECT word, COUNT(*) FROM doc LATERAL VIEW explode(split(text, ' ')) abc as word
GROUP BY word;

--5) quit;exit;-to exit hive cli

Lists the primitive and collection types supported by Hive


Type Size Literal syntax examples
TINYINT 1 byte signed integer. 20
SMALLINT 2 byte signed integer. 20
INT 4 byte signed integer. 20
BIGINT 8 byte signed integer. 20
BOOLEAN Boolean true or false. TRUE
FLOAT Single precision floating point. 3.14159
DOUBLE Double precision floating point. 3.14159
STRING Sequence of characters. 'Now is the time', "for all good
men"
TIMESTAMP Integer, float, or string. 1327882394,
1327882394.123456789
'2012-02-03 12:34:56.123456789'
BINARY Array of bytes. -

Type Description Example Access


ARRAY Ordered sequence of same type name arrar(john) name[0]
MAP collection of key, value name map(first,a) first->a
STRUCT Similar to Object name struct(first string)) name.first

JavaScript Object Notation (JSON)


{
"name": "John Doe",
"salary": 100000.0,
"subordinates": ["Mary Smith", "Todd Jones"],
"deductions": {
"Federal Taxes": .2,
"State Taxes": .05,
"Insurance": .1
},
"address": {
"street": "1 Michigan Ave.",
"city": "Chicago",
"state": "IL",
"zip": 60600
}
}
Input File-->

John Doe^A100000.0^AMary Smith^BTodd Jones^AFederal Taxes^C.2^BState


Taxes^C.05^BInsurance^C.1^A1 Michigan Ave.^BChicago^BIL^B60600

Table creation-->

CREATE TABLE employees (


name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Data Loading-->
load data local inpath '/home/cloudera/my/person.txt' into table employees;

Data selection-->
select subordinates[0],deductions['Federal Taxes'],address.city from employees;

Hives default record and field delimiters

Delimiter Description
\n For text files, each line is a record, so the line feed character separates records.
^A (ctrl+A) Separates all fields (columns). octal code \001
^B Separate the elements in an ARRAY or STRUCT, or the key-value pairs in a
MAP. octal code \002
^C Separate key-value pairs. octal code \003

Hive Database related command


The Hive concept of a database is essentially just a catalog or namespace of tables.
If you dont specify a database, the default database is used.

The simplest syntax for creating a database


hive> CREATE DATABASE | SCHEMA financials;
hive> CREATE DATABASE financials COMMENT 'Holds all financial tables';
hive> CREATE DATABASE financials WITH DBPROPERTIES ('creator' = 'Mark
Moneybags', 'date' = '2012-01-02');

Hive will throw an error if financials already exists. You can suppress these warnings with
this variation:
hive> CREATE DATABASE IF NOT EXISTS financials;

You can see the databases that already exist


hive> SHOW DATABASES;
hive> SHOW DATABASES LIKE 'h.*';

You can overrid the default database lcoation /user/hive/warehouse/financials.db.


hive> CREATE DATABASE financials LOCATION '/my/preferred/directory';

To know the database description


hive> DESCRIBE DATABASE financials;
hive> DESCRIBE DATABASE EXTENDED financials;

Set working database


hive> USE financials;
hive> SHOW TABLES;

Setting a property to print the current database


hive> set hive.cli.print.current.db=true;
hive (financials)> USE default;

Drop a empty database and database having tables


hive> DROP DATABASE IF EXISTS financials;
hive> DROP DATABASE IF EXISTS financials CASCADE;

Alter database
hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba');

Table creation in Hive

Creating managed/internal table (Hive controls the lifecycle of their data )


CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT> COMMENT 'Keys are deductions names,
values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> COMMENT
'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';

You can also copy the schema of an existing table:


CREATE TABLE IF NOT EXISTS mydb.employees2 LIKE mydb.employees;

Lists the table


hive> USE mydb;
hive> SHOW TABLES;
hive> SHOW TABLES IN mydb;
hive> SHOW TABLES 'empl.*';

Show details about the table


hive> DESCRIBE EXTENDED | FORMATTED mydb.employees;

Creating external table


Hive does not assume it owns the data.
If the data is shared between tools, then creating an external table
Dropping the table does not delete the data
Some HiveQL constructs are not permitted for external tables.

CREATE EXTERNAL TABLE IF NOT EXISTS stocks (


exchange STRING,
symbol STRING,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

Partitioned table
they have important performance benefits, and they can help organize data in a logical
fashion, such as hierarchically.

CREATE TABLE employees (


name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING);

hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=CA/state=AB
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=CA/state=BC
...
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=US/state=AL

However, a query across all partitions could trigger an enormous MapReduce job if the table
data and number of partitions are large. A highly suggested safety measure is putting Hive
into strict mode,

hive> set hive.mapred.mode=strict;

You can see the partition


hive> SHOW PARTITIONS employees;
hive> SHOW PARTITIONS employees PARTITION(country='US');
hive> SHOW PARTITIONS employees PARTITION(country='US', state='AK');

Loading data
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
INTO TABLE employees
PARTITION (country = 'US', state = 'CA');
External partition table
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages (
hms INT,
severity STRING,
server STRING,
process_id INT,
message STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2)


LOCATION 'hdfs://master_server/data/log_messages/2012/01/02';

Archiving old data on Inexpensive storage Amazon S3


Copy the data for the partition being moved to S3.
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02

Alter the table to point the partition to the S3 location:


ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';

Remove the HDFS copy of the partition using the hadoop fs -rmr command:
hadoop fs -rmr /data/log_messages/2011/01/02

Example of cluster table


CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
CLUSTERED BY (exchange, symbol)
SORTED BY (ymd ASC)
INTO 96 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

Table related operation

Dropping Tables
DROP TABLE IF EXISTS employees;

Renaming a Table
ALTER TABLE log_messages RENAME TO logmsgs;
Adding, Modifying, and Dropping a Table Partition
ALTER TABLE log_messages ADD IF NOT EXISTS
PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'

ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)


SET LOCATION 's3n://ourbucket/logs/2011/01/02';

ALTER TABLE log_messages DROP IF EXISTS PARTITION(year = 2011, month = 12, day =
2);

You can rename a column, change its position, type, or comment:


ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;

Adding Columns
ALTER TABLE log_messages ADD COLUMNS (
app_name STRING COMMENT 'Application name',
session_id LONG COMMENT 'The current session id');

Deleting or Replacing Columns


ALTER TABLE log_messages REPLACE COLUMNS (
hours_mins_secs INT COMMENT 'hour, minute, seconds from timestamp',
severity STRING COMMENT 'The message severity'
message STRING COMMENT 'The rest of the message');

Alter Table Properties


ALTER TABLE log_messages SET TBLPROPERTIES (
'notes' = 'The process id is no longer captured; this column is always NULL');

Alter Storage Properties


ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;

Other operation

hive -e 'ALTER TABLE log_messages TOUCH PARTITION(year = 2012, month = 1, day =


1);'

ALTER TABLE log_messages ARCHIVE


PARTITION(year = 2012, month = 1, day = 1);

ALTER TABLE log_messages


PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP;

ALTER TABLE log_messages


PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE;
Dynamic insert on partitioned table

set hive.exec.dynamoc.partition=true
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT into t3 PARTITION (region) select id,name,region from t2 ;

Find player name with maximum run


create table temp_batting (col_value STRING);

LOAD DATA INPATH '/user/adminuser/input/Batting.csv' OVERWRITE INTO TABLE


temp_batting;

create table batting (player_id STRING, year INT, runs INT);

insert overwrite table batting


SELECT
regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id,
regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year,
regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run
from temp_batting;

SELECT year, max(runs) FROM batting GROUP BY year;

SELECT a.year, a.player_id, a.runs from batting a


JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b
ON (a.year = b.year AND a.runs = b.runs) ;

Hive UDF

We have to trim some value

1) Our class must extend UDF abstract class


2) our class must have atleast one evaluate() {this method is not from UDF class} method
3) compile java file
4) create jar file
5) Add jar file to hive classpath
6) Create temporary function

eclipse->new project->add jar from Hadoop and Hive lib

com.hadoop.hive ctrl+shift+o (to import package)

public class TestUDF extends UDF{

Text t=new Text();

public Text evaluate(Text str){


if (str==null){return str;}
t.set(StringUtils.strip(str.toString()));
return t;
}

public Text evaluate(Text str,String splchar){


if (str==null){return str;}
t.set(StringUtils.strip(str.toString(),splchar));
return t;
}

hive>add jar /home/esiavir/trimUDF.jar


hive>create temporary function vardhan as 'com.hadoop.hive.TestUDF

Das könnte Ihnen auch gefallen