Apachesqoopwithusecases 140724051203 Phpapp01

Apache Sqoop
BY DAVIN.J.ABRAHAM
What is Sqoop
 Apache Sqoop is a tool designed for efficiently transferring bulk

data between Apache Hadoop and structured datastores such as
relational databases.
 Sqoop imports data from external structured datastores into HDFS or
related systems like Hive and HBase.
 Sqoop can also be used to export data from Hadoop and export it
to external structured datastores such as relational databases and
enterprise data warehouses.
 Sqoop works with relational databases such as: Teradata, Netezza,
Oracle, MySQL, Postgres, and HSQLDB.
Why Sqoop?
 As more organizations deploy Hadoop to analyse vast streams of

information, they may find they need to transfer large amount of
data between Hadoop and their existing databases, data
warehouses and other data sources
 Loading bulk data into Hadoop from production systems or
accessing it from map-reduce applications running on a large
cluster is a challenging task since transferring data using scripts is a
inefficient and time-consuming task
Hadoop-Sqoop?
 Hadoop is great for storing massive data in terms of volume using

HDFS
 It Provides a scalable processing environment for structured and
unstructured data
 But it’s Batch-Oriented and thus not suitable for low latency
interactive query operations
 Sqoop is basically an ETL Tool used to copy data between HDFS and
SQL databases
 Import SQL data to HDFS for archival or analysis
 Export HDFS to SQL ( e.g : summarized data used in a DW fact table )
What Sqoop Does
 Designed to efficiently transfer bulk data between Apache Hadoop

and structured datastores such as relational databases, Apache
Sqoop:
 Allows data imports from external datastores and enterprise data
warehouses into Hadoop
 Parallelizes data transfer for fast performance and optimal system
utilization
 Copies data quickly from external systems to Hadoop
 Makes data analysis more efficient
 Mitigates excessive loads to external systems.
How Sqoop Works
 Sqoop provides a pluggable connector mechanism for optimal

connectivity to external systems.
 The Sqoop extension API provides a convenient framework for
building new connectors which can be dropped into Sqoop
installations to provide connectivity to various systems.
 Sqoop itself comes bundled with various connectors that can be
used for popular database and data warehousing systems.
Who Uses Sqoop?
 Online Marketer Coupons.com uses sqoop to exchange data

between Hadoop and the IBM Netezza data warehouse appliance,
The organization can query its structres databases and pipe the
results into Hadoop using sqoop.
 Education company The Apollo group also uses the software not
only to extract data from databases but to inject the results from
Hadoop jobs back into relational databases
 And countless other hadoop users use sqoop to efficiently move
their data
Importing Data - Lists databases in
your mysql database.
$ sqoop list-databases --connect jdbc:mysql://<<mysql-server>>/employees --
username airawat --password myPassword
.
.
.
13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL
streaming resultset.
information_schema
employees
test
Lists tables in your mysql database.
$ sqoop list-tables --connect jdbc:mysql://<<mysql-server>>/employees --
username airawat --password myPassword
.
.
.
13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL
streaming resultset.
departments
dept_emp
dept_manager
employees
employees_exp_stg
employees_export
salaries
titles
Importing data in MySql into HDFS
 Replace "airawat-mySqlServer-node" with the host name of the

node running mySQL server, replace login credentials and target
directory.
Importing a table into HDFS - basic import
$ sqoop import \
--connect jdbc:mysql://airawat-mySqlServer-node/employees \
--username myUID \
--password myPWD \
--table employees \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/employees
.
.
.
.9139 KB/sec)
13/05/31 22:32:25 INFO mapreduce.ImportJobBase: Retrieved 300024
records
Executing imports with an options
file for static information
 Rather than repeat the import command along with connection
related input required, each time, you can pass an options file as an
argument to sqoop.
 Create a text file, as follows, and save it someplace, locally on the
node you are running the sqoop client on.
. Sample Options file:
___________________________________________________________________________
$ vi SqoopImportOptions.txt
#
#Options file for sqoop import
#
import
--connect
jdbc:mysql://airawat-mySqlServer-node/employees
--username
myUID
--password
myPwd
#
#All other commands should be specified in the command line
Options File - Command
The command
$ sqoop --options-file SqoopImportOptions.txt \

--table departments \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/departments
.
.
.
13/05/31 22:48:55 INFO mapreduce.ImportJobBase: Transferred 153 bytes
in 26.2453 seconds (5.8296 bytes/sec)
13/05/31 22:48:55 INFO mapreduce.ImportJobBase: Retrieved 9 records.
-m argument is to specify number of mappers. The department table has a handful of

records, so I am setting it to 1.
The files Created In hdfs
Files created in HDFS:
$ hadoop fs -ls -R sqoop-mysql/

drwxr-xr-x - airawat airawat 0 2013-05-31 22:48 sqoop-
mysql/departments
-rw-r--r-- 3 airawat airawat 0 2013-05-31 22:48 sqoop-
mysql/departments/_SUCCESS
mysql/departments/_logs
mysql/departments/_logs/history
mysql/departments/_logs/history/cdh-
jt01_1369839495962_job_201305290958_0062_conf.xml
mysql/departments/_logs/history/job_201305290958_0062_1370058514473_ airawa
t_departments.jar
mysql/departments/part-m-00000
To View the contents of a table
. Data file contents:
$ hadoop fs -cat sqoop-mysql/departments/part-m-00000 | more
d009,Customer Service
d005,Development
d002,Finance
d003,Human Resources
d001,Marketing
d004,Production
d006,Quality Management
d008,Research
d007,Sales
Import all Rows But Column Specific

--table dept_emp \
--columns “EMP_NO,DEPT_NO,FROM_DATE,TO_DATE” \
--as-textfile \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/DeptEmp
Import all Columns, But row Specific
using Where Clause
Import all columns, filter rows using where clause

--table employees \
--where "emp_no > 499948" \
--as-textfile \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/employeeGtTest
Import - Free Form Query
. Import with a free form query with where clause

--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where EMP_NO <
20000 AND $CONDITIONS' \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/employeeFrfrmQry1
Import without Where clause
Import with a free form query without where clause

--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees
where $CONDITIONS' \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/employeeFrfrmQrySmpl2
Export: Create sample Table
Employees
Create a table in mysql:
mysql> CREATE TABLE employees_export (

emp_no int(11) NOT NULL,
birth_date date NOT NULL,
first_name varchar(14) NOT NULL,
last_name varchar(16) NOT NULL,
gender enum('M','F') NOT NULL,
hire_date date NOT NULL,
PRIMARY KEY (emp_no)
Import Employees to hdfs to
demonstrate export
Import some data into HDFS:
sqoop --options-file SqoopImportOptions.txt \

--query 'select EMP_NO,birth_date,first_name,last_name,gender,hire_date
from employees where $CONDITIONS' \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/Employees
EXPORT – Create a stage table
Create a stage table in mysql:
mysql > CREATE TABLE employees_exp_stg (

emp_no int(11) NOT NULL,
birth_date date NOT NULL,
first_name varchar(14) NOT NULL,
last_name varchar(16) NOT NULL,
gender enum('M','F') NOT NULL,
hire_date date NOT NULL,
PRIMARY KEY (emp_no)
);
The Export Command
$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username MyUID \
--password myPWD \
--table employees_export \
--staging-table employees_exp_stg \
--clear-staging-table \
-m 4 \
--export-dir /user/airawat/sqoop-mysql/Employees
.
.
.
13/06/04 09:54:18 INFO manager.SqlManager: Migrated 300024
records from `employees_exp_stg` to `employees_export`
Results of Export
Results
mysql> select * from employees_export limit 1;

+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date |
+--------+------------+------------+-----------+--------+------------+
| 200000 | 1960-01-11 | Selwyn | Koshiba | M | 1987-06-05 |
+--------+------------+------------+-----------+--------+------------+
mysql> select count(*) from employees_export;

+----------+
| count(*) |
+----------+
| 300024 |
+----------+
mysql> select * from employees_exp_stg;

Empty set (0.00 sec)
Export – Update Mode
. Export in update mode

A2.2.1. Prep:
I am going to set hire date to null for some records, for trying this functionality out.
mysql> update employees_export set hire_date = null where

emp_no >400000;
Query OK, 99999 rows affected, 65535 warnings (1.26 sec)
Rows matched: 99999 Changed: 99999 Warnings: 99999
Now to see if the update worked
Sqoop command:
Next, we will export the same data to the same table, and see if the hire date is
updated.
$ sqoop export \
--username myUID \
--password myPWD \
--direct \
--update-key emp_no \
--update-mode updateonly \
It Worked!
. Results:
mysql> select count(*) from employees_export where hire_date
is null;
+----------+
| count(*) |
+----------+
| 0 |
+----------+
1 row in set (0.22 sec)
Export in upsert (Update+Insert)
mode
Upsert= insert if does not exist,
update if exists.
Upsert Command
sqoop export \
--username myUID \
--password myPWD \
--update-key emp_no \
--update-mode allowinsert \
Exports may Fail due to
 Loss of connectivity from the Hadoop cluster to the database (either

due to hardware fault, or server software crashes)
 Attempting to INSERT a row which violates a consistency constraint
(for example, inserting a duplicate primary key value)
 Attempting to parse an incomplete or malformed record from the
HDFS source data
 Attempting to parse records using incorrect delimiters
 Capacity issues (such as insufficient RAM or disk space)
Sqoop up Healthcare?
 Most hospitals today store patient information in relational

databases
 In order to analyse this data and gain some insight from it, we need
to get it into Hadoop.
 Sqoop will make that process very efficient.
Thank You For Your Time 

Apachesqoopwithusecases 140724051203 Phpapp01

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Apachesqoopwithusecases 140724051203 Phpapp01

Hochgeladen von

Copyright:

Verfügbare Formate

Apache Sqoop

 Apache Sqoop is a tool designed for efficiently transferring bulk

 As more organizations deploy Hadoop to analyse vast streams of

 Hadoop is great for storing massive data in terms of volume using

 Designed to efficiently transfer bulk data between Apache Hadoop

 Sqoop provides a pluggable connector mechanism for optimal

 Online Marketer Coupons.com uses sqoop to exchange data

 Replace "airawat-mySqlServer-node" with the host name of the

$ sqoop --options-file SqoopImportOptions.txt \

-m argument is to specify number of mappers. The department table has a handful of

$ hadoop fs -ls -R sqoop-mysql/

. Data file contents:

$ hadoop fs -cat sqoop-mysql/departments/part-m-00000 | more

$ sqoop --options-file SqoopImportOptions.txt \

Import all columns, filter rows using where clause

. Import with a free form query with where clause

Import with a free form query without where clause

Create a table in mysql:

mysql> CREATE TABLE employees_export (

sqoop --options-file SqoopImportOptions.txt \

Create a stage table in mysql:

mysql > CREATE TABLE employees_exp_stg (

mysql> select * from employees_export limit 1;

mysql> select count(*) from employees_export;

mysql> select * from employees_exp_stg;

. Export in update mode

mysql> update employees_export set hire_date = null where

 Loss of connectivity from the Hadoop cluster to the database (either

 Most hospitals today store patient information in relational

Das könnte Ihnen auch gefallen