Beruflich Dokumente
Kultur Dokumente
Need solution which can transfer the "streaming data" from servers to HDFS
with little delay.
● In HDFS, till the file is closed the length of the file will be considered as zero.
So if source is writing data into HDFS and the network was interrupted in the
middle of the operation (without closing the file), then the data written in
the file will be lost.
● Goals
Reliability
Scalability
Manageability
Extensibility
Sink: An interface implementation that can remove events from a channel transmit
them to the next agent in the flow or to the events final destination. Sinks that
transmit the event to its final destination are also known as terminal sinks. The Flume
HDFS sink is an example of a regular sink that can transmit messages to other agents
that are running an Avro source.
Rename
● flume-conf.properties.template file as flume-conf.properties
● flume-env.sh.template as flume-env.sh and Set JAVA_HOME
ElasticSearchSink
Kafka Sink
© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.
Various Sources/Channels/Sinks
Sources Channels Sinks
ElasticSearchSink
Kafka Sink
© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.
Sqoop
sqoop import \
--connect jdbc:mysql://localhost/training_db \
Example --username root \
--table user_log \
--m 1 \
--target-dir /tmp/user_log
sqoop import \
--connect jdbc:mysql://localhost/training_db \
Example --username root \
--table user_log \
--m 1 \
--target-dir /tmp/user_log --columns
sqoop import \
--connect jdbc:mysql://localhost/training_db \
Example --username root \
--table user_log \
--m 1 \
--target-dir /tmp/user_log --where “country=’JP’”
sqoop export \
--connect jdbc:mysql://localhost/<db-name> \
Syntax --username <user> \
--table <table> \
--export-dir <path>
sqoop export \
--connect jdbc:mysql://localhost/training_db \
Example --username root \
--table user_log \
--export-dir /tmp/user_log
© Copyright, Intellipaat Software Solutions Pvt. Ltd. All rights reserved.
Impala
• Impala uses MPP engine to distribute the query processing while Hive uses
Map Reduce
• No MR overhead in case of Impala while Hadoop, since use MR, has the
overhead attached.
• Impala is not fault tolerant while Hive has the usual fault tolerance of Map
Reduce
• MapReduce materializes all intermediate results, which enables better scalability and fault
tolerance (while slowing down data processing).
Impala streams intermediate results between executors (trading off scalability).