Hadoop Eco System - Class 1

Flume
© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

Put in HDFS – Problem Statement
Put to transfer files to HDFS but it has some disadvantages
● Transfer data in real time.

● Webservers generate data continuously so it is a difficult task to wait for
completion of generating file to transfer
Need solution which can transfer the "streaming data" from servers to HDFS
with little delay.
● In HDFS, till the file is closed the length of the file will be considered as zero.
So if source is writing data into HDFS and the network was interrupted in the
middle of the operation (without closing the file), then the data written in
the file will be lost.

What is Flume?
● Flume is a service that basically lets you ingest data into HDFS
● Distributed, reliable, available service for moving large amount of Data as it
produced
● Open source service, developed by Cloudera
● Goals
Reliability
Scalability
Manageability
Extensibility

How Flume Works?

Flume Components
Agent: An independent process that hosts flume components such as sources,

channels and sinks and thus has the ability to receive, store and forward events to
their next-hop destination.
Source: An interface implementation that can consume events delivered to it via a

specific mechanism. For example, an Avro source is a source implementation that can
be used to receive Avro events from clients or other agents in the flow. When a
source receives an even it hands it over to one or more channels.

Flume Components
Channel: A transient store for events where events are delivered to the channel via
sources operating within the agent. An event put in a channel stays in that channel
until a sink removes it for further transport. An example of channel is the JDBC channel
that uses a file system backed embedded database to persist the events until they are
removed by a sink. Channels play an important role in ensuring durability of the flaws.
Sink: An interface implementation that can remove events from a channel transmit
them to the next agent in the flow or to the events final destination. Sinks that
transmit the event to its final destination are also known as terminal sinks. The Flume
HDFS sink is an example of a regular sink that can transmit messages to other agents
that are running an Avro source.

Collector

Conf Directory
Conf directory have the following files −
flume-conf.properties.template
flume-env.sh.template
flume-env.ps1.template
Log4j.properties
Rename
● flume-conf.properties.template file as flume-conf.properties
● flume-env.sh.template as flume-env.sh and Set JAVA_HOME

Various Sources/Channels/Sinks
Sources Channels Sinks
Avro Source Memory Channel HDFS Sink
Thrift Source JDBC Channel Hive Sink
Exec Source Kafka Channel Logger Sink
JMS Source File Channel Avro Sink
Spooling Directory Source Spillable Memory Channel Thrift Sink
Twitter 1% firehose Pseudo Transaction File Roll Sink
Source Channel Null Sink
Kafka Source HBaseSink
Syslog Sources AsyncHBaseSink
ElasticSearchSink
Kafka Sink
Various Sources/Channels/Sinks
Sources Channels Sinks
Avro Source Memory Channel HDFS Sink
Thrift Source JDBC Channel Hive Sink
Exec Source Kafka Channel Logger Sink
JMS Source File Channel Avro Sink
Spooling Directory Source Spillable Memory Channel Thrift Sink
Twitter 1% firehose Pseudo Transaction File Roll Sink
Source Channel Null Sink
Kafka Source HBaseSink
Syslog Sources AsyncHBaseSink
ElasticSearchSink
Kafka Sink
Sqoop

What is Sqoop?
● Sqoop is a tool to import Data from RDBMS to HDFS

● Open source tool developed by Cloudera
● Can do SQL -> Hadoop and Hadoop -> SQL
● It uses JDBC to talk to Databases
● It can import Single Table or all tables in a Database
● It allows to use SELECT query for selective import

How Sqoop Works?
● Sqoop takes commands from User, analyzes the tables and

generates Java Code that will take care of data import to HDFS
● After the Java Code generation, a Map Only Mapreduce Job is run
to import Data
● By Default 4 Mappers are run with each mapper importing 25% of
data.

Sqoop import
sqoop import \
--connect jdbc:mysql://localhost/<db-name> \
Syntax --username <username> \
--table <table-name> \
--m 1 \
--target-dir <path>
sqoop import \
--connect jdbc:mysql://localhost/training_db \
Example --username root \
--table user_log \
--m 1 \
--target-dir /tmp/user_log

Sqoop import Only Specific Columns
sqoop import \
--m 1 \
--target-dir <path> --columns column1,column2
sqoop import \
--table user_log \
--m 1 \
--target-dir /tmp/user_log --columns

Sqoop import Only Where Condition
sqoop import \
--m 1 \
--target-dir <path> --where “condition”
sqoop import \
--table user_log \
--m 1 \
--target-dir /tmp/user_log --where “country=’JP’”

Sqoop export
sqoop export \
Syntax --username <user> \
--table <table> \
--export-dir <path>
sqoop export \
--table user_log \
--export-dir /tmp/user_log
Impala

What is Impala?
• Provides fast, interactive SQL queries directly on data stored in HDFS or

HBase.
• Uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user
interface (Cloudera Impala query UI in Hue) as Apache Hive. This provides
a familiar and unified platform for real-time or batch-oriented queries.
• Is an addition to tools available for querying big data. Impala does not
replace the batch processing frameworks built on MapReduce such as
Hive.
• Hive and other frameworks built on MapReduce are best suited for
long running batch jobs, such as those involving batch processing of
Extract, Transform, and Load (ETL) type jobs.

Impala vs Hive
• Impala is SQL on HDFS while Hive is SQL on Hadoop
• Impala uses MPP engine to distribute the query processing while Hive uses
Map Reduce
• No MR overhead in case of Impala while Hadoop, since use MR, has the
overhead attached.
• Impala is not fault tolerant while Hive has the usual fault tolerance of Map
Reduce
• MapReduce materializes all intermediate results, which enables better scalability and fault
tolerance (while slowing down data processing).
Impala streams intermediate results between executors (trading off scalability).
• Impala is recommended for real time SQL queries while Hive is

recommended for large batch jobs
• Hive generates query expressions at compile time whereas Impala does
runtime code generation for “big loops”.

Avro Data
Formats

What is Avro?
● Apache Avro is a data serialization system which works

independent of language.
● It was developed by the same person who has implemented

hadoop - Doug Cutting.
● Hadoop lacks language portability in terms of writable classes
● Avro is a preferred tool to serialize data in Hadoop.
● It uses JSON for defining data types and protocols

Avro in Hive
CREATE EXTERNAL TABLE tweets

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
Syntax OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/<path>/'
TBLPROPERTIES (
'avro.schema.url'='hdfs://<path>/<file.avsc>'
);

Avro in Pig
alias = LOAD '<input-path>'
USING org.apache.pig.piggybank.storage.avro.AvroStorage(
Syntax 'no_schema_check',
'schema_file', 'hdfs://<schema-path>/<file.avsc>');
STORE alias INTO '<output-path>'

USING org.apache.pig.piggybank.storage.avro.AvroStorage(
Syntax
'no_schema_check',
'data', '<path-to-data>');

Thank You

Hadoop Eco System - Class 1

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop Eco System - Class 1

Hochgeladen von

Copyright:

Verfügbare Formate

Flume

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

● Transfer data in real time.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

Agent: An independent process that hosts flume components such as sources,

Source: An interface implementation that can consume events delivered to it via a

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

Avro Source Memory Channel HDFS Sink

Thrift Source JDBC Channel Hive Sink

Exec Source Kafka Channel Logger Sink

JMS Source File Channel Avro Sink

Spooling Directory Source Spillable Memory Channel Thrift Sink

Twitter 1% firehose Pseudo Transaction File Roll Sink

Source Channel Null Sink

Kafka Source HBaseSink

Syslog Sources AsyncHBaseSink

Avro Source Memory Channel HDFS Sink

Thrift Source JDBC Channel Hive Sink

Exec Source Kafka Channel Logger Sink

JMS Source File Channel Avro Sink

Spooling Directory Source Spillable Memory Channel Thrift Sink

Twitter 1% firehose Pseudo Transaction File Roll Sink

Source Channel Null Sink

Kafka Source HBaseSink

Syslog Sources AsyncHBaseSink

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

● Sqoop is a tool to import Data from RDBMS to HDFS

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

● Sqoop takes commands from User, analyzes the tables and

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

• Provides fast, interactive SQL queries directly on data stored in HDFS or

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

• Impala is recommended for real time SQL queries while Hive is

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

● Apache Avro is a data serialization system which works

● It was developed by the same person who has implemented

● Hadoop lacks language portability in terms of writable classes

● Avro is a preferred tool to serialize data in Hadoop.

● It uses JSON for defining data types and protocols

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

CREATE EXTERNAL TABLE tweets

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

STORE alias INTO '<output-path>'

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

Das könnte Ihnen auch gefallen