Sie sind auf Seite 1von 59

CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Hive: SQL for Hadoop


An Lssenual 1ool for
Padoop-based
uaLa Warehouses
Tuesday, January 10, 12
l'll argue LhaL Plve ls lndlspensable Lo people creaung daLa warehouses" wlLh Padoop, because lL glves Lhem a slmllar" SCL lnLerface Lo Lhelr daLa,
maklng lL easler Lo mlgraLe skllls and even apps from exlsung relauonal Lools Lo Padoop.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
Adapted from our
Hadoop 3-Day Developer Course
info@thinkbiganalytics.com
thinkbiganalytics.com
Tuesday, January 10, 12
1hese noLes are (heavlly) adapLed from our Padoop 3-day developer course. Cf course, we won'L do exerclses ln Lhls Lalk ,) 1he second day focuses on
Plve. We also oer Plve Lralnlng by lLself, admln courses, eLc.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
Programming Hive
Summer, 2012
OReilly
Tuesday, January 10, 12
l'm collaboraung wlLh Lwo oLher colleagues Lo wrlLe rogrammlng Plve", whlch wlll be publlshed by C'8ellly Lhls summer. unul lL's ready, Lhe besL source
of documenLauon ls on Lhe Plve slLe, hup://hlve.apache.org.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
4
programmingscala.com polyglotprogramming.com/fpjava
Dean Wampler
Functional
Programming
for Java Developers
Dean Wampler
20+ years exp.
Think Big Academy
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
5
A True Story...
Tuesday, January 10, 12
Once upon an Internet
time, in the land of
Enterprise, the Kings
Warehouse of Data was
bursting at the seams.
Tuesday, January 10, 12
And lo, the scribes
decided they could cut
costs and store more
data by moving to
Hadoop.
Tuesday, January 10, 12
But then the scribes
complained amongst
themselves, The
language of the land of
Java hurts our tongues!
Tuesday, January 10, 12
It takes forever to say
anything!
Tuesday, January 10, 12
There was a scribe who
walked through the
Kingdom-wide Labyrinth
(kwl), where he found
The Book of Faces.
Tuesday, January 10, 12
In that book, he found
the secret language the
Bee Keepers use to
query the god Hadoop!
Tuesday, January 10, 12
And there was much
rejoicing, for the scribes
could work with their
data again!
Tuesday, January 10, 12
The End.
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
14
Enter Hive
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
15
Since your team knows SQL
and all your Data Warehouse
apps are written in SQL,
Hive minimizes the effort of
migrating to Hadoop.
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Ideal for data warehousing.

Ad-hoc queries of data.

Familiar SQL dialect.

Analysis of large data sets.

Hadoop MapReduce jobs.


16
Hive
Tuesday, January 10, 12
Hive is a killer app, in our opinion, for data warehouse teams migrating to Hadoop, because it
gives them a familiar SQL language that hides the complexity of MR programming.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Invented at Facebook.

Open sourced to Apache in


2008.

http://hive.apache.org
17
Hive
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
18
A Scenario:
Mining Daily
Click Stream
Logs
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
19
Ingest & Transform:

From: file://server1/var/log/clicks.log
Jan 9 09:02:17 server1 movies[18]:
1234: search for vampires in love.

Tuesday, January 10, 12
As we copy the daily click stream log over to a local staging location, we transform it into the
Hive table format we want.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
20
Ingest & Transform:

From: file://server1/var/log/clicks.log
Jan 9 09:02:17 server1 movies[18]:
1234: search for vampires in love.

Timestamp
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
21
Ingest & Transform:

From: file://server1/var/log/clicks.log
Jan 9 09:02:17 server1 movies[18]:
1234: search for vampires in love.

The server
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
22
Ingest & Transform:

From: file://server1/var/log/clicks.log
Jan 9 09:02:17 server1 movies[18]:
1234: search for vampires in love.

The process
(movies search)
and the process id.
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
23
Ingest & Transform:

From: file://server1/var/log/clicks.log
Jan 9 09:02:17 server1 movies[18]:
1234: search for vampires in love.

Customer id
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
24
Ingest & Transform:

From: file://server1/var/log/clicks.log
Jan 9 09:02:17 server1 movies[18]:
1234: search for vampires in love.

The log message
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
25
Ingest & Transform:

From: file://server1/var/log/clicks.log
09:02:17^Aserver1^Amovies^A18^A1234^Asea
rch for vampires in love.

To: /staging/2012-01-09.log
Jan 9 09:02:17 server1 movies[18]:
1234: search for vampires in love.

Tuesday, January 10, 12
As we copy the daily click stream log over to a local staging location, we transform it into the
Hive table format we want.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
26
Ingest & Transform:
09:02:17^Aserver1^Amovies^A18^A1234^Asea
rch for vampires in love.

To: /staging/2012-01-09.log

Removed month (Jan) and day (09).

Added ^A as eld separators (Hive convention).

Separated process id from process name.


Tuesday, January 10, 12
The transformations we made. (You could use many di!erent Linux, scripting, code, or
Hadoop-related ingestion tools to do this.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
27
Ingest & Transform:
hadoop fs -put /staging/2012-01-09.log \
/clicks/2012/01/09/log.txt

Put in HDFS:

(The nal le name doesnt matter)


Tuesday, January 10, 12
Here we use the hadoop shell command to put the le where we want it in the le system.
Note that the name of the target le doesnt matter; well just tell Hive to read all les in the
directory, so there could be many les there!
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
28
Back to Hive...
CREATE EXTERNAL TABLE clicks (
hms STRING,
hostname STRING,
process STRING,
pid INT,
uid INT,
message STRING)
PARTITIONED BY (
year INT,
month INT,
day INT);

Create an external Hive table:


You dont have to
use EXTERNAL
and PARTITIONED
together.
Tuesday, January 10, 12
Now lets create an external table that will read those les as the backing store. Also, we
make it partitioned to accelerate queries that limit by year, month or day. (You dont have to
use external and partitioned together)
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
29
Back to Hive...
ALTER TABLE clicks ADD IF NOT EXISTS
PARTITION (
year=2012,
month=01,
day=09)
LOCATION '/clicks/2012/01/09';

Add a partition for 2012-01-09:

A directory in HDFS.
Tuesday, January 10, 12
We add a partition for each day. Note the LOCATION path, which is a the directory where we
wrote our le.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
30
Now, Analyze!!
SELECT hms, uid, message FROM clicks
WHERE message LIKE '%vampire%' AND
year = 2012 AND
month = 01 AND
day = 09;

Whats with the kids and vampires??

After some MapReduce crunching...



09:02:29 1234 search for twilight of the vampires
09:02:35 1234 add to cart vampires want their genre back

Tuesday, January 10, 12
And we can run SQL queries!!
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

SQL analysis with Hive.

Other tools can use the data, too.

Massive scalability with Hadoop.


31
Recap
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
32
Hadoop
Master
!Job Tracker Name Node
HDFS
Hive
Driver
(compiles, optimizes, executes)
CLI HWI Thrift Server
Metastore
JDBC ODBC
Karmasphere Others...
Tuesday, January 10, 12
Hive queries generate MR jobs. (Some operations dont invoke Hadoop processes, e.g., some
very simple queries and commands that just write updates to the metastore.)
CLI = Command Line Interface.
HWI = Hive Web Interface.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

HDFS

MapR

S3

HBase (new)

Others...
33
Tables
Hadoop
Master
!Job Tracker Name Node
HDFS
Hive
Driver
(compiles, optimizes, executes)
CLI HWI Thrift Server
Metastore
JDBC ODBC
Karmasphere Others...
Tuesday, January 10, 12
There is early support for using Hive with HBase. Other databases and distributed le
systems will no doubt follow.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Table metadata
stored in a
relational DB.
34
Tables
Hadoop
Master
!Job Tracker Name Node
HDFS
Hive
Driver
(compiles, optimizes, executes)
CLI HWI Thrift Server
Metastore
JDBC ODBC
Karmasphere Others...
Tuesday, January 10, 12
For production, you need to set up a MySQL or PostgreSQL database for Hives metadata. Out
of the box, Hive uses a Derby DB, but it can only be used by a single user and a single process
at a time, so its ne for personal development only.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Most queries
use MapReduce
jobs.
35
Queries
Hadoop
Master
!Job Tracker Name Node
HDFS
Hive
Driver
(compiles, optimizes, executes)
CLI HWI Thrift Server
Metastore
JDBC ODBC
Karmasphere Others...
Tuesday, January 10, 12
Hive generates MapReduce jobs to implement all the but the simplest queries.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Benets

Horizontal scalability.

Drawbacks

Latency!
36
MapReduce Queries
Tuesday, January 10, 12
The high latency makes Hive unsuitable for online database use. (Hive also doesnt support
transactions and has other limitations that are relevant here) So, these limitations make
Hive best for o"ine (batch mode) use, such as data warehouse apps.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Benets

Horizontal scalability.

Data redundancy.

Drawbacks

No insert, update, and


delete!
37
HDFS Storage
Tuesday, January 10, 12
You can generate new tables or write to local les. Forthcoming versions of HDFS will support
appending data.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Schema on Read

Schema enforcement at
query time, not write
time.
38
HDFS Storage
Tuesday, January 10, 12
Especially for external tables, but even for internal ones since the les are HDFS les, Hive
cant enforce that records written to table les have the specied schema, so it does these
checks at query time.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

No Transactions.

Some SQL features not


implemented (yet).
39
Other Limitations
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
40

More on Tables
and Schemas
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
41
Data Types

The usual scalar types:

TINYINT, , BIGNT.

FLOAT, DOUBLE.

BOOLEAN.

STRING.
Tuesday, January 10, 12
Like most databases...
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
42
Data Types

The unusual complex types:

STRUCT.

MAP.

ARRAY.
Tuesday, January 10, 12
Structs are like objects or c-style structs. Maps are key-value pairs, and you know what
arrays are ;)
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
43
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING,FLOAT>,
address STRUCT<
street:STRING,
city:STRING,
state:STRING,
zip:INT>
);
Tuesday, January 10, 12
subordinates references other records by the employee name. (Hive doesnt have indexes, in
the usual sense, but an indexing feature was recently added.) Deductions is a key-value list
of the name of the deduction and a oat indicating the amount (e.g., %). Address is like a
class, object, or c-style struct, whatever you prefer.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
44
File & Record Formats
CREATE TABLE employees ()
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
All the
defaults for
text les!
Tuesday, January 10, 12
Suppose our employees table has a custom format and eld delimiters. We can change them,
although here Im showing all the default values used by Hive!
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
45

Select, Where,
Group By, Join,...
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
46
Common SQL...

You get most of the usual


suspects for SELECT, WHERE,
GROUP BY and JOIN.
Tuesday, January 10, 12
Well just highlight a few unique features.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
47
ADD JAR MyUDFs.jar;
CREATE TEMPORARY FUNCTION
net_salary
AS 'com.example.NetCalcUDF';
SELECT name,
net_salary(salary, deductions)
FROM employees;
User Dened
Functions
Tuesday, January 10, 12
Following a Hive dened API, implement your own functions, build, put in a jar, and then use
them in your queries. Here we (pretend to) implement a function that takes the employees
salary and deductions, then computes the net salary.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
48
SELECT name, salary
FROM employees
ORDER BY salary ASC;
ORDER BY vs.
SORT BY

A total ordering - one reducer.


SELECT name, salary
FROM employees
SORT BY salary ASC;

A local ordering - sorts within each reducer.


Tuesday, January 10, 12
For a giant data set, piping everything through one reducer might take a very long time. A
compromise is to sort locally, so each reducer sorts its output. However, if you structure
your jobs right, you might achieve a total order depending on how data gets to the reducers.
(E.g., each reducer handles a years worth of data, so joining the les together would be
totally sorted.)
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Only equality (x = y).


49
Inner Joins
SELECT ...
FROM clicks a JOIN clicks b ON (
a.uid = b.uid, a.day = b.day)
WHERE a.process = 'movies'
AND b.process = 'books'
AND a.year > 2012;
Tuesday, January 10, 12
Note that the a.year > is in the WHERE clause, not the ON clause for the JOIN. (Im doing a
correlation query; which users searched for movies and books on the same day?)
Some outer and semi join constructs supported, as well as some Hadoop-specic
optimization constructs.
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
50
A Final Example
of Controlling
MapReduce...
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Calling out to external


programs to perform map
and reduce operations.
51
Specify Map & Reduce
Processes
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
52
Example
FROM (
FROM clicks
MAP message
USING '/tmp/vampire_extractor'
AS item_title, count
CLUSTER BY item_title) it
INSERT OVERWRITE TABLE vampire_stuff
REDUCE it.item_title, it.count
USING '/tmp/thing_counter.py'
AS item_title, counts;
Tuesday, January 10, 12
Note the MAP USING and REDUCE USING. Were also using CLUSTER BY (distributing and
sorting on item_title).
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
53
Example
FROM (
FROM clicks
MAP message
USING '/tmp/vampire_extractor'
AS item_title, count
CLUSTER BY item_title) it
INSERT OVERWRITE TABLE vampire_stuff
REDUCE it.item_title, it.count
USING '/tmp/thing_counter.py'
AS item_title, counts;
Call specic
map and
reduce
processes.
Tuesday, January 10, 12
Note the MAP USING and REDUCE USING. Were also using CLUSTER BY (distributing and
sorting on item_title).
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
54
And Also:
FROM (
FROM clicks
MAP message
USING '/tmp/vampire_extractor'
AS item_title, count
CLUSTER BY item_title) it
INSERT OVERWRITE TABLE vampire_stuff
REDUCE it.item_title, it.count
USING '/tmp/thing_counter.py'
AS item_title, counts;
Like GROUP
BY, but
directs
output to
specic
reducers.
Tuesday, January 10, 12
Note the MAP USING and REDUCE USING. Were also using CLUSTER BY (distributing and
sorting on item_title).
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
55
And Also:
FROM (
FROM clicks
MAP message
USING '/tmp/vampire_extractor'
AS item_title, count
CLUSTER BY item_title) it
INSERT OVERWRITE TABLE vampire_stuff
REDUCE it.item_title, it.count
USING '/tmp/thing_counter.py'
AS item_title, counts;
How to
populate an
internal
table.
Tuesday, January 10, 12
Note the MAP USING and REDUCE USING. Were also using CLUSTER BY (distributing and
sorting on item_title).
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
56
Hive:
Conclusions
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Not a real SQL Database.

Transactions, updates, etc.

but features will grow.

High latency queries.

Documentation poor.
57
Hive Disadvantages
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved

Indispensable for SQL users.

Easier than Java MR API.

Makes porting data warehouse


apps to Hadoop much easier.
58
Hive Advantages
Tuesday, January 10, 12
CopyrlghL 2011-2012, 1hlnk 8lg Analyucs, All 8lghLs 8eserved
59
Questions?
dean.wampler@thinkbiganalytics.com
thinkbiganalytics.com
This presentation: github.com/deanwampler/Presentations
Tuesday, January 10, 12

Das könnte Ihnen auch gefallen