Sie sind auf Seite 1von 25

Flume

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Put in HDFS – Problem Statement
Put to transfer files to HDFS but it has some disadvantages

● Transfer data in real time.


● Webservers generate data continuously so it is a difficult task to wait for
completion of generating file to transfer

Need solution which can transfer the "streaming data" from servers to HDFS
with little delay.

● In HDFS, till the file is closed the length of the file will be considered as zero.
So if source is writing data into HDFS and the network was interrupted in the
middle of the operation (without closing the file), then the data written in
the file will be lost.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


What is Flume?
● Flume is a service that basically lets you ingest data into HDFS
● Distributed, reliable, available service for moving large amount of Data as it
produced
● Open source service, developed by Cloudera

● Goals
Reliability
Scalability
Manageability
Extensibility

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


How Flume Works?

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Flume Components

Agent: An independent process that hosts flume components such as sources,


channels and sinks and thus has the ability to receive, store and forward events to
their next-hop destination.

Source: An interface implementation that can consume events delivered to it via a


specific mechanism. For example, an Avro source is a source implementation that can
be used to receive Avro events from clients or other agents in the flow. When a
source receives an even it hands it over to one or more channels.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Flume Components
Channel: A transient store for events where events are delivered to the channel via
sources operating within the agent. An event put in a channel stays in that channel
until a sink removes it for further transport. An example of channel is the JDBC channel
that uses a file system backed embedded database to persist the events until they are
removed by a sink. Channels play an important role in ensuring durability of the flaws.

Sink: An interface implementation that can remove events from a channel transmit
them to the next agent in the flow or to the events final destination. Sinks that
transmit the event to its final destination are also known as terminal sinks. The Flume
HDFS sink is an example of a regular sink that can transmit messages to other agents
that are running an Avro source.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Collector

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Conf Directory
Conf directory have the following files −
flume-conf.properties.template
flume-env.sh.template
flume-env.ps1.template
Log4j.properties

Rename
● flume-conf.properties.template file as flume-conf.properties
● flume-env.sh.template as flume-env.sh and Set JAVA_HOME

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Various Sources/Channels/Sinks
Sources Channels Sinks

Avro Source Memory Channel HDFS Sink

Thrift Source JDBC Channel Hive Sink

Exec Source Kafka Channel Logger Sink

JMS Source File Channel Avro Sink

Spooling Directory Source Spillable Memory Channel Thrift Sink

Twitter 1% firehose Pseudo Transaction File Roll Sink

Source Channel Null Sink

Kafka Source HBaseSink

Syslog Sources AsyncHBaseSink

ElasticSearchSink

Kafka Sink
© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.
Various Sources/Channels/Sinks
Sources Channels Sinks

Avro Source Memory Channel HDFS Sink

Thrift Source JDBC Channel Hive Sink

Exec Source Kafka Channel Logger Sink

JMS Source File Channel Avro Sink

Spooling Directory Source Spillable Memory Channel Thrift Sink

Twitter 1% firehose Pseudo Transaction File Roll Sink

Source Channel Null Sink

Kafka Source HBaseSink

Syslog Sources AsyncHBaseSink

ElasticSearchSink

Kafka Sink
© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.
Sqoop

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


What is Sqoop?

● Sqoop is a tool to import Data from RDBMS to HDFS


● Open source tool developed by Cloudera
● Can do SQL -> Hadoop and Hadoop -> SQL
● It uses JDBC to talk to Databases
● It can import Single Table or all tables in a Database
● It allows to use SELECT query for selective import

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


How Sqoop Works?

● Sqoop takes commands from User, analyzes the tables and


generates Java Code that will take care of data import to HDFS
● After the Java Code generation, a Map Only Mapreduce Job is run
to import Data
● By Default 4 Mappers are run with each mapper importing 25% of
data.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Sqoop import
sqoop import \
--connect jdbc:mysql://localhost/<db-name> \
Syntax --username <username> \
--table <table-name> \
--m 1 \
--target-dir <path>

sqoop import \
--connect jdbc:mysql://localhost/training_db \
Example --username root \
--table user_log \
--m 1 \
--target-dir /tmp/user_log

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Sqoop import Only Specific Columns
sqoop import \
--connect jdbc:mysql://localhost/<db-name> \
Syntax --username <username> \
--table <table-name> \
--m 1 \
--target-dir <path> --columns column1,column2

sqoop import \
--connect jdbc:mysql://localhost/training_db \
Example --username root \
--table user_log \
--m 1 \
--target-dir /tmp/user_log --columns

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Sqoop import Only Where Condition
sqoop import \
--connect jdbc:mysql://localhost/<db-name> \
Syntax --username <username> \
--table <table-name> \
--m 1 \
--target-dir <path> --where “condition”

sqoop import \
--connect jdbc:mysql://localhost/training_db \
Example --username root \
--table user_log \
--m 1 \
--target-dir /tmp/user_log --where “country=’JP’”

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Sqoop export

sqoop export \
--connect jdbc:mysql://localhost/<db-name> \
Syntax --username <user> \
--table <table> \
--export-dir <path>

sqoop export \
--connect jdbc:mysql://localhost/training_db \
Example --username root \
--table user_log \
--export-dir /tmp/user_log
© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.
Impala

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


What is Impala?

• Provides fast, interactive SQL queries directly on data stored in HDFS or


HBase.
• Uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user
interface (Cloudera Impala query UI in Hue) as Apache Hive. This provides
a familiar and unified platform for real-time or batch-oriented queries.
• Is an addition to tools available for querying big data. Impala does not
replace the batch processing frameworks built on MapReduce such as
Hive.
• Hive and other frameworks built on MapReduce are best suited for
long running batch jobs, such as those involving batch processing of
Extract, Transform, and Load (ETL) type jobs.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Impala vs Hive
• Impala is SQL on HDFS while Hive is SQL on Hadoop

• Impala uses MPP engine to distribute the query processing while Hive uses
Map Reduce
• No MR overhead in case of Impala while Hadoop, since use MR, has the
overhead attached.
• Impala is not fault tolerant while Hive has the usual fault tolerance of Map
Reduce
• MapReduce materializes all intermediate results, which enables better scalability and fault
tolerance (while slowing down data processing).
Impala streams intermediate results between executors (trading off scalability).

• Impala is recommended for real time SQL queries while Hive is


recommended for large batch jobs
• Hive generates query expressions at compile time whereas Impala does
runtime code generation for “big loops”.

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Avro Data
Formats

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


What is Avro?

● Apache Avro is a data serialization system which works


independent of language.

● It was developed by the same person who has implemented


hadoop - Doug Cutting.

● Hadoop lacks language portability in terms of writable classes

● Avro is a preferred tool to serialize data in Hadoop.

● It uses JSON for defining data types and protocols

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Avro in Hive

CREATE EXTERNAL TABLE tweets


ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
Syntax OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/<path>/'
TBLPROPERTIES (
'avro.schema.url'='hdfs://<path>/<file.avsc>'
);

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Avro in Pig
alias = LOAD '<input-path>'
USING org.apache.pig.piggybank.storage.avro.AvroStorage(
Syntax 'no_schema_check',
'schema_file', 'hdfs://<schema-path>/<file.avsc>');

STORE alias INTO '<output-path>'


USING org.apache.pig.piggybank.storage.avro.AvroStorage(
Syntax
'no_schema_check',
'data', '<path-to-data>');

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.


Thank You

© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.

Das könnte Ihnen auch gefallen