Sie sind auf Seite 1von 46

Big Data Hadoop and Spark Developer

Lesson 4—Basics of Hive and Impala

© Simplilearn. All rights reserved.


Learning Objectives

Identify the features of Hive and Impala

Understand the methods to interact with Hive and Impala


Basics of Hive and Impala
Topic 1: Features of Hive and Impala
Introduction to Hive and Impala

Batch Interactive
Processing SQL

SELECT t1.a1 as c1, t2.b1 as c2

FROM t1 JOIN t2 ON (t1.a2=t2.b2);


• Hive and Impala provide an SQL-like interface
for users to extract data from the Hadoop
system.
Resource Management • They reside on top of Hadoop and can be used
to query data from the underlying storage
Storage components.
HDFS HBase
Hive and Impala: Similarities

• Hive is very similar to Impala in the following ways:


Hive and Impala: Differences

Hive was developed by Facebook around 2007. Impala was developed by Cloudera around
2012.

It is an Open source Apache project. It is an incubating Apache project.

It has a high level abstraction layer on top of It has a high performance dedicated SQL
MapReduce and Apache Spark.
engine.
It uses HiveQL to query the structured data in a
metastore. It uses Impala SQL for ad hoc queries.

It is suitable for structured data. It is designed for high concurrency and ad


z
hoc queries.
Hive and Impala: Comparison

Hive Impala

• Comprises a specialized SQL


• Provides more features than engine that offers five to fifty
Impala times faster performance
than Hive

• Is highly extensible
• Used mainly for interactive
queries and data analysis
• Used mostly for batch
processing • Accommodates many
concurrent users
Relational Databases vs. Hive vs. Impala
Use Case: Hive and Impala

Hive and Impala are commonly used to analyze social media coverage.
Basics of Hive and Impala
Topic 2: Interacting with Hive and Impala
Executing a Query in Hive and Impala

Receive SQL query Receive SQL query

Parse Hive QL 1 Parse Impala SQL

Make optimizations 2 Make optimizations

Plan execution 3 Plan execution

Submit job(s) to cluster 4 Execute query on cluster

Monitor progress 5 Store the data in HDFS

Process data—
6
MapReduce or Apache Spark

Store the data in HDFS 7


Hive Query Editor
Interfaces to Run Hive and Impala Queries

Hive and Impala offer numerous interfaces to run queries:

• Command-line shell:
– Impala: Impala shell
– Hive: Beeline
Impala Query Editor
• Hue Web UI:
– Hive Query Editor
– Impala Query Editor

• Metastore Manager:
– ODBC/JDBC
Impala Lab Access Details

• The steps to start Impala in lab are as follows:

Step 1 Step 2

• Log in to cloud lab • Connect to any


web console with daemon server with
your credentials the help of the
command below:
•impala-shell -i
cloudera-
slavenode3.cloudlab.
com
Demo
Starting Impala Lab

Demonstrate the method to start and connect to the Impala lab from command.
Impala Lab Access Details
Connecting with Hive and Impala Shell

• To execute Impala commands from Impala shell:

• To run Hive using Beeline:


Running Impala Queries from Command Line

To check all options of Impala using the help option: Impala-shell –help

Impala-shell –q ‘select *
To run direct queries from shell using the –q option: from simple’

Impala-shell –d
To issue a use database on startup using the –d option: Simplilearn
Demo
Connecting with Hive and Impala Shell

Demonstrate the method to connect with Hive and Impala shell, along with some basic
operations.
Sample Queries

SELECT version();
To explore a new Impala instance:
SELECT current_database();

CREATE DATABASE IF NOT EXISTS


To create a database: sample;

To verify a database: SHOW databases;

To specify the location where the database is CREATE DATABASE IF NOT EXISTS
to be created: database_name LOCATION hdfs_path;
Sample Queries

To switch the current session to another


USE db_name;
database:

CREATE TABLE stockprice


(stock_id INT,
date STRING,
open_price FLOAT,
high_price FLOAT,
low_price FLOAT,
close_price FLOAT,
To create a table in Parquet format: stock_volume INT,
adjclose_price FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION
'/home/singh25nov_gmail/input'
stored as parquet ;
Sample Queries

CREATE EXTERNAL TABLE stop_loss


(
stock_id INT,
stock_volume FLOAT,
stock_current_rate DOUBLE,
To load csv data from local files: stock_trigger_price DOUBLE
)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','
LOCATION
'/user/cloudera/sample_data/tab1';

To list all tables in the current database in SHOW tables;


Impala:
Sample Queries

INSERT INTO stockprice


(date,open_price,high_price,low_price,c
To insert a single row: lose_price,stock_volume,adjclose_price
) VALUES ('15112017',102
,105,98.6,100154711,100);

impala-shell -i <impala-daemon-uri> -f
To migrate from SQL:
<filename>.sql;
Sample Queries

SELECT stockprice.open_price,
MAX(stockprice.stock_volume),
MIN(stop_loss.stock_current_rate)
To aggregate and join: FROM stop_loss JOIN stockprice USING
(stock_id)
GROUP BY high_price ORDER BY 1
LIMIT 5;

DROP (DATABASE|SCHEMA) [IF EXISTS]

To drop a database: database_name [RESTRICT | CASCADE]


[LOCATION hdfs_path];
Sample Queries

• Interactive mode:
SELECT count(*) FROM stockprice;

To query the Impala table: • Set of commands contained in a file:


impala-shell-i impala-host -f <filename>.sql;
• Single command to the impala-shell:
impala-shell-i impala-host-q 'select count(*) from stockprice‘;
Executing Queries in the Impala Shell

localhost.localdomain:21000] > select * from webpage where page_id > 40


> LIMIT 5;
Query: select * from webpage where page_id > 40
LIMIT 5
+---------+--------------------------+--------------------------------------+
| page_id | name | assoc_files |
+---------+--------------------------+--------------------------------------+
| 41 | sorrento_f31l_sales.html | theme2.css,code.js,sorrento_f31l.jpg |
| 42 | sorrento_f40l_sales.html | theme1.css,code.js,sorrento_f40l.jpg |
| 43 | sorrento_f01l_sales.html | theme1.css,code.js,sorrento_f01l.jpg |
| 44 | sorrento_f11l_sales.html | theme1.css,code.js,sorrento_f11l.jpg |
| 45 | titanic_2400_sales.html | theme1.css,code.js,titanic_2400.jpg |
+---------+--------------------------+--------------------------------------+
Fetched 5 row(s) in 1.32s
Demo
Impala Queries

Demonstrate the sample Impala queries.


Running Hive Queries Using Beeline

• The character “!” is used to execute Beeline


commands.

The commands used to run Beeline:


• !exit: Used to exit the shell
• !help: Shows list of all commands
• !verbose: Shows added details of queries
Demo
Running Hive Queries Using Beeline

Demonstrate the method to connect with Beeline and execute basic queries.
Running Beeline from Command Line

beeline –u … -f
To execute file using the –u option: simplilearn.hql

To use HiveQL directly from the command line using the -e beeline –u ... -e 'SELECT *
option: FROM users‘

To continue running script even after an error: beeline –u … -force=TRUE


Running Hive Query

Hive> select * from device


> LIMIT 5;
OK • All SQL commands are terminated
1 2008-10-21 00:00:00 Sorrento F00L phone with a semicolon “;”
2 2010-04-19 00:00:00 Titanic 2100 phone
3 2011-02-18 00:00:00 MeeToo 3.0 phone
4 2011-09-21 00:00:00 MeeToo 3.1 phone
5 2008-10-21 00:00:00 iFruit 1 phone
Time taken: 0.296 seconds, Fetched: 5 row(s)
Connecting Hive and Impala Shell with Hue

• Hue can be used to write Hive and Impala


queries from the User Interface.
Demo
Connecting Hive and Impala Shell with Hue

Demonstrate the method to connect Hive and Impala shell using Hue.
Hive and Impala Editors in Hue

Diagram 1 Diagram 2
Key Takeaways

Hive and Impala are tools to perform SQL queries on data residing on HDFS
or HBase.

Hive and Impala are easy to learn for experienced SQL developers.

Hive and Impala solve the Big Data problem but cannot replace a traditional
RDBMS.

Hive runs MapReduce or Spark jobs on Hadoop based on HiveQL statements.

Impala uses a very fast specialized SQL engine that is faster than MapReduce.
Quiz
QUIZ
Which of the following components can be used to accept command inputs from users?
1

a. Command Line Interface

b. Query compiler

c. Execution engine

d. Thrift server
QUIZ
Which of the following components can be used to accept command inputs from users?
1

a. Command Line Interface

b. Query compiler

c. Execution engine

d. Thrift server

The correct answer is a.


The Command Line Interface is used as an input medium to accept command input from users.
QUIZ
Hive can be accessed from Hue using ________.
2

a. Impala editor

b. Hive Editor

c. File browser

d. YARN UI
QUIZ
Hive can be accessed from Hue using ________.
2

a. Impala editor

b. Hive Editor

c. File browser

d. YARN UI

The correct answer is b.


Hive can be accessed through the Hive editor in Hue.
QUIZ
Impala can be accessed from Hue using ________.
3

a. Impala editor

b. Hive Editor

c. File browser

d. YARN UI
QUIZ
Impala can be accessed from Hue using ________.
3

a. Impala editor

b. Hive Editor

c. File browser

d. YARN UI

The correct answer is a.


Impala can be accessed through the Impala editor in Hue.
QUIZ
Updating an individual record is possible in______.
4

a. Impala

b. Hive

c. RDBMS

d. All of the above


QUIZ
Updating an individual record is possible in______.
4

a. Impala

b. Hive

c. RDBMS

d. All of the above

The correct answer is c.


Hive and Impala cannot update individual records, but an RDBMS can.
QUIZ
Deleting an individual record is possible in_______.
5

a. RDBMS

b. Hive

c. Impala

d. All of the above


QUIZ
Deleting an individual record is possible in_______.
5

a. RDBMS

b. Hive

c. Impala

d. All of the above

The correct answer is a.


Hive and Impala cannot delete individual records, but an RDBMS can.
This concludes “Basics of Hive and Impala.”
The next lesson is “Working with Hive and Impala.”

©Simplilearn. All rights reserved