Hive SQLforHadoop

CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
Hive: SQL for Hadoop

An Lssenual 1ool for
Padoop-based
uaLa Warehouses
Tuesday, January 10, 12
l'll argue LhaL Plve ls lndlspensable Lo people creaung daLa warehouses" wlLh Padoop, because lL glves Lhem a slmllar" SCL lnLerface Lo Lhelr daLa,
maklng lL easler Lo mlgraLe skllls and even apps from exlsung relauonal Lools Lo Padoop.
Adapted from our
Hadoop 3-Day Developer Course
info@thinkbiganalytics.com
thinkbiganalytics.com
1hese noLes are (heavlly) adapLed from our Padoop 3-day developer course. Cf course, we won'L do exerclses ln Lhls Lalk ,) 1he second day focuses on
Plve. We also oer Plve Lralnlng by lLself, admln courses, eLc.
Programming Hive
Summer, 2012
OReilly
l'm collaboraung wlLh Lwo oLher colleagues Lo wrlLe rogrammlng Plve", whlch wlll be publlshed by C'8ellly Lhls summer. unul lL's ready, Lhe besL source
of documenLauon ls on Lhe Plve slLe, hup://hlve.apache.org.
4
programmingscala.com polyglotprogramming.com/fpjava
Dean Wampler
Functional
Programming
for Java Developers
Dean Wampler
20+ years exp.
Think Big Academy
5
A True Story...
Once upon an Internet
time, in the land of
Enterprise, the Kings
Warehouse of Data was
bursting at the seams.
And lo, the scribes
decided they could cut
costs and store more
data by moving to
Hadoop.
But then the scribes
complained amongst
themselves, The
language of the land of
Java hurts our tongues!
It takes forever to say
anything!
There was a scribe who
walked through the
Kingdom-wide Labyrinth
(kwl), where he found
The Book of Faces.
In that book, he found
the secret language the
Bee Keepers use to
query the god Hadoop!
And there was much
rejoicing, for the scribes
could work with their
data again!
The End.
14
Enter Hive
15
Since your team knows SQL
and all your Data Warehouse
apps are written in SQL,
Hive minimizes the effort of
migrating to Hadoop.
Ideal for data warehousing.
Ad-hoc queries of data.
Familiar SQL dialect.
Analysis of large data sets.
Hadoop MapReduce jobs.

16
Hive
Hive is a killer app, in our opinion, for data warehouse teams migrating to Hadoop, because it
gives them a familiar SQL language that hides the complexity of MR programming.
Invented at Facebook.
Open sourced to Apache in

2008.
http://hive.apache.org
17
Hive
18
A Scenario:
Mining Daily
Click Stream
Logs
19
Ingest & Transform:
From: file://server1/var/log/clicks.log
Jan 9 09:02:17 server1 movies[18]:
1234: search for vampires in love.

As we copy the daily click stream log over to a local staging location, we transform it into the
Hive table format we want.
20
Ingest & Transform:

Timestamp
21
Ingest & Transform:

The server
22
Ingest & Transform:

The process
(movies search)
and the process id.
23
Ingest & Transform:

Customer id
24
Ingest & Transform:

The log message
25
Ingest & Transform:
09:02:17Âserver1ÂmoviesÂ18Â1234Âsea
rch for vampires in love.

To: /staging/2012-01-09.log

As we copy the daily click stream log over to a local staging location, we transform it into the
Hive table format we want.
26
Ingest & Transform:
09:02:17Âserver1ÂmoviesÂ18Â1234Âsea
rch for vampires in love.

To: /staging/2012-01-09.log
Removed month (Jan) and day (09).
Added Â as eld separators (Hive convention).
Separated process id from process name.

The transformations we made. (You could use many di!erent Linux, scripting, code, or
Hadoop-related ingestion tools to do this.
27
Ingest & Transform:
hadoop fs -put /staging/2012-01-09.log \
/clicks/2012/01/09/log.txt
Put in HDFS:
(The nal le name doesnt matter)

Here we use the hadoop shell command to put the le where we want it in the le system.
Note that the name of the target le doesnt matter; well just tell Hive to read all les in the
directory, so there could be many les there!
28
Back to Hive...
CREATE EXTERNAL TABLE clicks (
hms STRING,
hostname STRING,
process STRING,
pid INT,
uid INT,
message STRING)
PARTITIONED BY (
year INT,
month INT,
day INT);
Create an external Hive table:

You dont have to
use EXTERNAL
and PARTITIONED
together.
Now lets create an external table that will read those les as the backing store. Also, we
make it partitioned to accelerate queries that limit by year, month or day. (You dont have to
use external and partitioned together)
29
Back to Hive...
ALTER TABLE clicks ADD IF NOT EXISTS
PARTITION (
year=2012,
month=01,
day=09)
LOCATION '/clicks/2012/01/09';
Add a partition for 2012-01-09:
A directory in HDFS.
We add a partition for each day. Note the LOCATION path, which is a the directory where we
wrote our le.
30
Now, Analyze!!
SELECT hms, uid, message FROM clicks
WHERE message LIKE '%vampire%' AND
year = 2012 AND
month = 01 AND
day = 09;
Whats with the kids and vampires??
After some MapReduce crunching...

09:02:29 1234 search for twilight of the vampires
09:02:35 1234 add to cart vampires want their genre back

And we can run SQL queries!!
SQL analysis with Hive.
Other tools can use the data, too.
Massive scalability with Hadoop.

31
Recap
32
Hadoop
Master
!Job Tracker Name Node
HDFS
Hive
Driver
(compiles, optimizes, executes)
CLI HWI Thrift Server
Metastore
JDBC ODBC
Karmasphere Others...
Hive queries generate MR jobs. (Some operations dont invoke Hadoop processes, e.g., some
very simple queries and commands that just write updates to the metastore.)
CLI = Command Line Interface.
HWI = Hive Web Interface.
HDFS
MapR
S3
HBase (new)
Others...
33
Tables
Hadoop
Master
HDFS
Hive
Driver
Metastore
JDBC ODBC
There is early support for using Hive with HBase. Other databases and distributed le
systems will no doubt follow.
Table metadata
stored in a
relational DB.
34
Tables
Hadoop
Master
HDFS
Hive
Driver
Metastore
JDBC ODBC
For production, you need to set up a MySQL or PostgreSQL database for Hives metadata. Out
of the box, Hive uses a Derby DB, but it can only be used by a single user and a single process
at a time, so its ne for personal development only.
Most queries
use MapReduce
jobs.
35
Queries
Hadoop
Master
HDFS
Hive
Driver
Metastore
JDBC ODBC
Hive generates MapReduce jobs to implement all the but the simplest queries.
Benets
Horizontal scalability.
Drawbacks
Latency!
36
MapReduce Queries
The high latency makes Hive unsuitable for online database use. (Hive also doesnt support
transactions and has other limitations that are relevant here) So, these limitations make
Hive best for o"ine (batch mode) use, such as data warehouse apps.
Benets
Horizontal scalability.
Data redundancy.
Drawbacks
No insert, update, and

delete!
37
HDFS Storage
You can generate new tables or write to local les. Forthcoming versions of HDFS will support
appending data.
Schema on Read
Schema enforcement at
query time, not write
time.
38
HDFS Storage
Especially for external tables, but even for internal ones since the les are HDFS les, Hive
cant enforce that records written to table les have the specied schema, so it does these
checks at query time.
No Transactions.
Some SQL features not

implemented (yet).
39
Other Limitations
40

More on Tables
and Schemas
41
Data Types
The usual scalar types:
TINYINT, , BIGNT.
FLOAT, DOUBLE.
BOOLEAN.
STRING.
Like most databases...
42
Data Types
The unusual complex types:
STRUCT.
MAP.
ARRAY.
Structs are like objects or c-style structs. Maps are key-value pairs, and you know what
arrays are ;)
43
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING,FLOAT>,
address STRUCT<
street:STRING,
city:STRING,
state:STRING,
zip:INT>
);
subordinates references other records by the employee name. (Hive doesnt have indexes, in
the usual sense, but an indexing feature was recently added.) Deductions is a key-value list
of the name of the deduction and a oat indicating the amount (e.g., %). Address is like a
class, object, or c-style struct, whatever you prefer.
44
File & Record Formats
CREATE TABLE employees ()
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
All the
defaults for
text les!
Suppose our employees table has a custom format and eld delimiters. We can change them,
although here Im showing all the default values used by Hive!
45

Select, Where,
Group By, Join,...
46
Common SQL...
You get most of the usual

suspects for SELECT, WHERE,
GROUP BY and JOIN.
Well just highlight a few unique features.
47
ADD JAR MyUDFs.jar;
CREATE TEMPORARY FUNCTION
net_salary
AS 'com.example.NetCalcUDF';
SELECT name,
net_salary(salary, deductions)
FROM employees;
User Dened
Functions
Following a Hive dened API, implement your own functions, build, put in a jar, and then use
them in your queries. Here we (pretend to) implement a function that takes the employees
salary and deductions, then computes the net salary.
48
SELECT name, salary
FROM employees
ORDER BY salary ASC;
ORDER BY vs.
SORT BY
A total ordering - one reducer.

SELECT name, salary
FROM employees
SORT BY salary ASC;
A local ordering - sorts within each reducer.

For a giant data set, piping everything through one reducer might take a very long time. A
compromise is to sort locally, so each reducer sorts its output. However, if you structure
your jobs right, you might achieve a total order depending on how data gets to the reducers.
(E.g., each reducer handles a years worth of data, so joining the les together would be
totally sorted.)
Only equality (x = y).

49
Inner Joins
SELECT ...
FROM clicks a JOIN clicks b ON (
a.uid = b.uid, a.day = b.day)
WHERE a.process = 'movies'
AND b.process = 'books'
AND a.year > 2012;
Note that the a.year > is in the WHERE clause, not the ON clause for the JOIN. (Im doing a
correlation query; which users searched for movies and books on the same day?)
Some outer and semi join constructs supported, as well as some Hadoop-specic
optimization constructs.
50
A Final Example
of Controlling
MapReduce...
Calling out to external

programs to perform map
and reduce operations.
51
Specify Map & Reduce
Processes
52
Example
FROM (
FROM clicks
MAP message
USING '/tmp/vampire_extractor'
AS item_title, count
CLUSTER BY item_title) it
INSERT OVERWRITE TABLE vampire_stuff
REDUCE it.item_title, it.count
USING '/tmp/thing_counter.py'
AS item_title, counts;
Note the MAP USING and REDUCE USING. Were also using CLUSTER BY (distributing and
sorting on item_title).
53
Example
FROM (
FROM clicks
MAP message
Call specic
map and
reduce
processes.
54
And Also:
FROM (
FROM clicks
MAP message
Like GROUP
BY, but
directs
output to
specic
reducers.
55
And Also:
FROM (
FROM clicks
MAP message
How to
populate an
internal
table.
56
Hive:
Conclusions
Not a real SQL Database.
Transactions, updates, etc.
but features will grow.
High latency queries.
Documentation poor.
57
Hive Disadvantages
Indispensable for SQL users.
Easier than Java MR API.
Makes porting data warehouse

apps to Hadoop much easier.
58
Hive Advantages
59
Questions?
dean.wampler@thinkbiganalytics.com
thinkbiganalytics.com
This presentation: github.com/deanwampler/Presentations

Hive SQLforHadoop

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hive SQLforHadoop

Hochgeladen von

Copyright:

Verfügbare Formate

CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Hive: SQL for Hadoop

Ideal for data warehousing.

Ad-hoc queries of data.

Familiar SQL dialect.

Analysis of large data sets.

Hadoop MapReduce jobs.

Open sourced to Apache in

Removed month (Jan) and day (09).

Added ^A as eld separators (Hive convention).

Separated process id from process name.

(The nal le name doesnt matter)

Create an external Hive table:

Add a partition for 2012-01-09:

Whats with the kids and vampires??

After some MapReduce crunching...

SQL analysis with Hive.

Other tools can use the data, too.

Massive scalability with Hadoop.

No insert, update, and

Some SQL features not

The usual scalar types:

The unusual complex types:

You get most of the usual

A total ordering - one reducer.

A local ordering - sorts within each reducer.

Only equality (x = y).

Calling out to external

Not a real SQL Database.

Transactions, updates, etc.

but features will grow.

High latency queries.

Indispensable for SQL users.

Easier than Java MR API.

Makes porting data warehouse

Das könnte Ihnen auch gefallen