Beruflich Dokumente
Kultur Dokumente
What is Hive?
Hive is a SQL dialect, called Hive Query Language (abbreviated HiveQL or just
HQL) for querying data stored in a Hadoop cluster. Hive translates most
queries to MapReduce jobs, thereby exploiting the scalability of Hadoop, while
presenting a familiar SQL abstraction.
Limitations of Hive
Hive is not a full database. The biggest limitation is that Hive does not provide
record-level update, insert, nor delete.
Because Hadoop is a batch-oriented system, Hive queries have higher latency;
due to the start-up overhead for MapReduce jobs Finally, Hive does not
provide transactions.
Hive doesnt provide crucial features required for OLTP. If you need OLTP
features for large-scale data, you should consider using a NoSQL database.
Examples include HBase, a NoSQL database integrated with Hadoop
Hive architecture
HIVE Practical
Hive location of table data and metastore
Vi $HIVE_HOME/conf/hive-site.xml
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.metastore.local=true
javax.jdo.option.ConnectionURL= jdbc:derby:;databaseName=
/home/adminuser/vikas/hive_test/hive/metastore_db;create=true
if you want to change the warehouse path then, write in hive-site.xml then give the
permission to that directory if it is on local system
stop-all.sh
start-all.sh
$ hive -f /path/to/file/withqueries.hql
Or,
Table creation-->
Data Loading-->
load data local inpath '/home/cloudera/my/person.txt' into table employees;
Data selection-->
select subordinates[0],deductions['Federal Taxes'],address.city from employees;
Delimiter Description
\n For text files, each line is a record, so the line feed character separates records.
^A (ctrl+A) Separates all fields (columns). octal code \001
^B Separate the elements in an ARRAY or STRUCT, or the key-value pairs in a
MAP. octal code \002
^C Separate key-value pairs. octal code \003
Hive will throw an error if financials already exists. You can suppress these warnings with
this variation:
hive> CREATE DATABASE IF NOT EXISTS financials;
Alter database
hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba');
Partitioned table
they have important performance benefits, and they can help organize data in a logical
fashion, such as hierarchically.
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=CA/state=AB
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=CA/state=BC
...
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=US/state=AL
However, a query across all partitions could trigger an enormous MapReduce job if the table
data and number of partitions are large. A highly suggested safety measure is putting Hive
into strict mode,
Loading data
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
INTO TABLE employees
PARTITION (country = 'US', state = 'CA');
External partition table
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages (
hms INT,
severity STRING,
server STRING,
process_id INT,
message STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
Remove the HDFS copy of the partition using the hadoop fs -rmr command:
hadoop fs -rmr /data/log_messages/2011/01/02
Dropping Tables
DROP TABLE IF EXISTS employees;
Renaming a Table
ALTER TABLE log_messages RENAME TO logmsgs;
Adding, Modifying, and Dropping a Table Partition
ALTER TABLE log_messages ADD IF NOT EXISTS
PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
ALTER TABLE log_messages DROP IF EXISTS PARTITION(year = 2011, month = 12, day =
2);
Adding Columns
ALTER TABLE log_messages ADD COLUMNS (
app_name STRING COMMENT 'Application name',
session_id LONG COMMENT 'The current session id');
Other operation
set hive.exec.dynamoc.partition=true
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT into t3 PARTITION (region) select id,name,region from t2 ;
Hive UDF