Sie sind auf Seite 1von 14

Cover Page for Academic Tasks

Course Code: INT56 Course Title: BIG DATA-HADOOP

Course Instructor: Dr. MAMOON RASHID

Academic Task No.: CA3 Academic Task Title: Apache Flume

Date of Submission: 19-10-2020 Section-Q1E27

Student’s Roll no: A12 Student’s Reg. no:11909639


Evaluation Parameters:

Declaration:

I declare that this Assignment is my individual work. I have not copied it


from any other student’s work or from any other source except where due
acknowledgement is made explicitly in the text, nor has any part been
written for me by any other person.

Evaluator’s comments (For Instructor’s use only)

General Observations Suggestions for Improvement


Best part of assignment

Evaluator’s Signature and Date:

Marks Obtained: Max. Marks: …………………

1|Page
Q1: In Big Data world, one of the main job is to collect, aggregate, and move data from a
single source or many sources to a centralized data store or multiple destinations. Based on
this, you are given one scenario where you are supposed to replicate data from terminal with
four channels and two sinks. Support your answer with screenshot of CLI fetching data from
HDFS out of four separate directories. The input for your problem is given as below:

Step for Multiplexing

1. First we need to create configuration file

After creating the configuration file we will run this command

mukul:~/apache-flume-1.9.0-bin$ bin/flume-ng agent -n a1 --conf ./conf/-f --conf-file


conf/multiplexing.conf -Dflume.root.logger=DEBUG,console

2|Page
2. After that we will connect with the local host

mukul:~$ telnet localhost 44444

3. We will create a Directory by following command

mukul:~$ hadoop dfs -mkdir /flume_multiplexing/manager/developer

3|Page
4. By using this command we can see the data inside the directory

mukul:~$ hdfs dfs -cat /user/flume/FlumeData.1603009045377.tmp

4|Page
Q2: Apache Flume is used for data ingestion and it helps to get data from various data
generators to fetch data and store it on Hadoop Distributed File System. You are required
to create one directory on Ubuntu local file system and then treat this directory as source for
three files named temp1.txt, temp2.txt and temp3.txt. Use apache flume to spool this data
from source to sink of HDFS. Support your answer with screenshot of CLI fetching same
text file on HDFS.

Steps for use spooling:

First we need to create the configuration file, to create a configuration file we need to go to the
flume directory and click on the Conf directory or by using below command:

: ~/apache-flume-1.9.1-cdh5.3.2-bin/conf$ cat flume1.conf

5|Page
After created the configuration file we need to create one directory by the name (myflume, as we
written in the conf file) in the Home directory.

After created the directory i.e., (myflume) we will move one file with the name temp1.text.

After that we will run the command:

mukul:~/apache-flume-1.9.0-bin$ bin/flume-ng agent -n agent1 --conf ./conf/-f -


-conf-file conf/flume1.conf -Dflume.root.logger=DEBUG,console

6|Page
Now we can see that the temp1.text showing completed.

7|Page
Now we will see in the Hadoop directory by using the following command:

mukul:~$ hdfs dfs -cat /user/flume/FlumeData.1603009045377.tmp

8|Page
As we can see the spooling process is running on, the benefit of spooling is it will fetch all file in
local filesytem acting like a source and aggregating them as one single file inside HDFS.

Now we can see all three text file with the name temp1.txt, temp2.txt, temp3.txt fetched from
local file system to HDFS using spool.

Q3: Explain the steps required for the installation of Apache Flume and fetch Twitter Data
to store it in HDFS

Apache Flume:

• Apache Flume is a tool for data ingestion in HDFS. It collects, aggregates and transports
large amount of streaming data such as log files, events from various sources like network
traffic, social media, email messages etc. to HDFS.

• Flume is a highly reliable & distributed.

• The main idea behind the Flume’s design is to capture streaming data from various web
servers to HDFS.

• It has simple and flexible architecture based on streaming data flows.


9|Page
• It is fault-tolerant and provides reliability mechanism for Fault tolerance & failure
recovery.

Steps to install apache flume and twitter sentiment analysis using apache flume

1. Download Flume and extract it with following command:

$ tar -xvzf apache-flume-1.9.0-bin.tar.gz

2. Download the flume-sources-1.0-SNAPSHOT.jar from following link:

https://drive.google.com/file/d/0B_t6uqPmWadsdWJNQ0NjaXBUYUk/view?usp=sharing

3.Paste flume-sources-1.0-SNAPSHOT.jar in flume lib folder

cp flume-sources-1.0-SNAPSHOT.jar $FLUME_HOME/lib

10 | P a g e
4.Add it to the flume class path as shown below in the conf/flume-env.sh file . Add Java
home path too:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

FLUME_CLASSPATH=/home/mukul/apache-flume-1.9.0-bin/lib/flume-sources-1.0-
SNAPSHOT.jar

5.Download consumer Key, consumer Secret, accessTokenSecret from -


https://apps.twitter.com/ which can be accessed from your twitter developer account by
creating a simple app.

11 | P a g e
6.Create flume-twitter.conf file in conf folder and paste given lines:

7. Add following to .bashrc file

12 | P a g e
8. Rename these 3 files in lib folder of Flume. (All you need to do just change the
extention of these files from .jar to .org)

9. Make Directories in HDFS using following Command:


$ hdfs dfs –mkdir /user/flume/tweets

10. Run Flume and collect data into HDFS.


Start Hadoop first. $start-all.sh
Start flume using the below command:
mukul:~/apache-flume-1.9.0-bin$ bin/flume-ng agent -n TwitterAgent --conf ./conf/-f -
-conf-file conf/flume-twitter.conf -Dflume.root.logger=DEBUG,console

13 | P a g e
After a couple of minutes the Tweets should appear in HDFS.

11.Visit localhost:9870 in your browser


Navigate to localhost:9870/explorer.html#/user/flume/tweets/

So we can see that the tweets fetched from the server into the HDFS.

14 | P a g e

Das könnte Ihnen auch gefallen