Apache Flume data ingestion from multiple sources to HDFS

Cover Page for Academic Tasks
Course Code: INT56 Course Title: BIG DATA-HADOOP
Course Instructor: Dr. MAMOON RASHID
Academic Task No.: CA3 Academic Task Title: Apache Flume
Date of Submission: 19-10-2020 Section-Q1E27
Student’s Roll no: A12 Student’s Reg. no:11909639

Evaluation Parameters:
Declaration:
I declare that this Assignment is my individual work. I have not copied it

from any other student’s work or from any other source except where due
acknowledgement is made explicitly in the text, nor has any part been
written for me by any other person.
Evaluator’s comments (For Instructor’s use only)
General Observations Suggestions for Improvement

Best part of assignment
Evaluator’s Signature and Date:
Marks Obtained: Max. Marks: …………………
1|Page
Q1: In Big Data world, one of the main job is to collect, aggregate, and move data from a
single source or many sources to a centralized data store or multiple destinations. Based on
this, you are given one scenario where you are supposed to replicate data from terminal with
four channels and two sinks. Support your answer with screenshot of CLI fetching data from
HDFS out of four separate directories. The input for your problem is given as below:
Step for Multiplexing
1. First we need to create configuration file
After creating the configuration file we will run this command
mukul:~/apache-flume-1.9.0-bin$ bin/flume-ng agent -n a1 --conf ./conf/-f --conf-file

conf/multiplexing.conf -Dflume.root.logger=DEBUG,console
2|Page
2. After that we will connect with the local host
mukul:~$ telnet localhost 44444
3. We will create a Directory by following command
mukul:~$ hadoop dfs -mkdir /flume_multiplexing/manager/developer
3|Page
4. By using this command we can see the data inside the directory
mukul:~$ hdfs dfs -cat /user/flume/FlumeData.1603009045377.tmp
4|Page
Q2: Apache Flume is used for data ingestion and it helps to get data from various data
generators to fetch data and store it on Hadoop Distributed File System. You are required
to create one directory on Ubuntu local file system and then treat this directory as source for
three files named temp1.txt, temp2.txt and temp3.txt. Use apache flume to spool this data
from source to sink of HDFS. Support your answer with screenshot of CLI fetching same
text file on HDFS.
Steps for use spooling:
First we need to create the configuration file, to create a configuration file we need to go to the
flume directory and click on the Conf directory or by using below command:
: ~/apache-flume-1.9.1-cdh5.3.2-bin/conf$ cat flume1.conf
5|Page
After created the configuration file we need to create one directory by the name (myflume, as we
written in the conf file) in the Home directory.
After created the directory i.e., (myflume) we will move one file with the name temp1.text.
After that we will run the command:
mukul:~/apache-flume-1.9.0-bin$ bin/flume-ng agent -n agent1 --conf ./conf/-f -

-conf-file conf/flume1.conf -Dflume.root.logger=DEBUG,console
6|Page
Now we can see that the temp1.text showing completed.
7|Page
Now we will see in the Hadoop directory by using the following command:
mukul:~$ hdfs dfs -cat /user/flume/FlumeData.1603009045377.tmp
8|Page
As we can see the spooling process is running on, the benefit of spooling is it will fetch all file in
local filesytem acting like a source and aggregating them as one single file inside HDFS.
Now we can see all three text file with the name temp1.txt, temp2.txt, temp3.txt fetched from
local file system to HDFS using spool.
Q3: Explain the steps required for the installation of Apache Flume and fetch Twitter Data
to store it in HDFS
Apache Flume:
• Apache Flume is a tool for data ingestion in HDFS. It collects, aggregates and transports
large amount of streaming data such as log files, events from various sources like network
traffic, social media, email messages etc. to HDFS.
• Flume is a highly reliable & distributed.
• The main idea behind the Flume’s design is to capture streaming data from various web
servers to HDFS.
• It has simple and flexible architecture based on streaming data flows.

9|Page
• It is fault-tolerant and provides reliability mechanism for Fault tolerance & failure
recovery.
Steps to install apache flume and twitter sentiment analysis using apache flume
1. Download Flume and extract it with following command:
$ tar -xvzf apache-flume-1.9.0-bin.tar.gz
2. Download the flume-sources-1.0-SNAPSHOT.jar from following link:
https://drive.google.com/file/d/0B_t6uqPmWadsdWJNQ0NjaXBUYUk/view?usp=sharing
3.Paste flume-sources-1.0-SNAPSHOT.jar in flume lib folder
cp flume-sources-1.0-SNAPSHOT.jar $FLUME_HOME/lib
10 | P a g e
4.Add it to the flume class path as shown below in the conf/flume-env.sh file . Add Java
home path too:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
FLUME_CLASSPATH=/home/mukul/apache-flume-1.9.0-bin/lib/flume-sources-1.0-
SNAPSHOT.jar
5.Download consumer Key, consumer Secret, accessTokenSecret from -

https://apps.twitter.com/ which can be accessed from your twitter developer account by
creating a simple app.
11 | P a g e
6.Create flume-twitter.conf file in conf folder and paste given lines:
7. Add following to .bashrc file
12 | P a g e
8. Rename these 3 files in lib folder of Flume. (All you need to do just change the
extention of these files from .jar to .org)
9. Make Directories in HDFS using following Command:

$ hdfs dfs –mkdir /user/flume/tweets
10. Run Flume and collect data into HDFS.

Start Hadoop first. $start-all.sh
Start flume using the below command:
mukul:~/apache-flume-1.9.0-bin$ bin/flume-ng agent -n TwitterAgent --conf ./conf/-f -
-conf-file conf/flume-twitter.conf -Dflume.root.logger=DEBUG,console
13 | P a g e
After a couple of minutes the Tweets should appear in HDFS.
11.Visit localhost:9870 in your browser

Navigate to localhost:9870/explorer.html#/user/flume/tweets/
So we can see that the tweets fetched from the server into the HDFS.
14 | P a g e

Apache Flume data ingestion from multiple sources to HDFS

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Apache Flume data ingestion from multiple sources to HDFS

Hochgeladen von

Copyright:

Verfügbare Formate

Cover Page for Academic Tasks

Course Code: INT56 Course Title: BIG DATA-HADOOP

Course Instructor: Dr. MAMOON RASHID

Academic Task No.: CA3 Academic Task Title: Apache Flume

Date of Submission: 19-10-2020 Section-Q1E27

Student’s Roll no: A12 Student’s Reg. no:11909639

I declare that this Assignment is my individual work. I have not copied it

Evaluator’s comments (For Instructor’s use only)

General Observations Suggestions for Improvement

Evaluator’s Signature and Date:

Marks Obtained: Max. Marks: …………………

Step for Multiplexing

1. First we need to create configuration file

After creating the configuration file we will run this command

mukul:~/apache-flume-1.9.0-bin$ bin/flume-ng agent -n a1 --conf ./conf/-f --conf-file

mukul:~$ telnet localhost 44444

3. We will create a Directory by following command

mukul:~$ hadoop dfs -mkdir /flume_multiplexing/manager/developer

mukul:~$ hdfs dfs -cat /user/flume/FlumeData.1603009045377.tmp

Steps for use spooling:

: ~/apache-flume-1.9.1-cdh5.3.2-bin/conf$ cat flume1.conf

After that we will run the command:

mukul:~/apache-flume-1.9.0-bin$ bin/flume-ng agent -n agent1 --conf ./conf/-f -

mukul:~$ hdfs dfs -cat /user/flume/FlumeData.1603009045377.tmp

• Flume is a highly reliable & distributed.

• It has simple and flexible architecture based on streaming data flows.

1. Download Flume and extract it with following command:

$ tar -xvzf apache-flume-1.9.0-bin.tar.gz

2. Download the flume-sources-1.0-SNAPSHOT.jar from following link:

3.Paste flume-sources-1.0-SNAPSHOT.jar in flume lib folder

5.Download consumer Key, consumer Secret, accessTokenSecret from -

7. Add following to .bashrc file

9. Make Directories in HDFS using following Command:

10. Run Flume and collect data into HDFS.

11.Visit localhost:9870 in your browser

Das könnte Ihnen auch gefallen