Beruflich Dokumente
Kultur Dokumente
Declaration:
1|Page
Q1: In Big Data world, one of the main job is to collect, aggregate, and move data from a
single source or many sources to a centralized data store or multiple destinations. Based on
this, you are given one scenario where you are supposed to replicate data from terminal with
four channels and two sinks. Support your answer with screenshot of CLI fetching data from
HDFS out of four separate directories. The input for your problem is given as below:
2|Page
2. After that we will connect with the local host
3|Page
4. By using this command we can see the data inside the directory
4|Page
Q2: Apache Flume is used for data ingestion and it helps to get data from various data
generators to fetch data and store it on Hadoop Distributed File System. You are required
to create one directory on Ubuntu local file system and then treat this directory as source for
three files named temp1.txt, temp2.txt and temp3.txt. Use apache flume to spool this data
from source to sink of HDFS. Support your answer with screenshot of CLI fetching same
text file on HDFS.
First we need to create the configuration file, to create a configuration file we need to go to the
flume directory and click on the Conf directory or by using below command:
5|Page
After created the configuration file we need to create one directory by the name (myflume, as we
written in the conf file) in the Home directory.
After created the directory i.e., (myflume) we will move one file with the name temp1.text.
6|Page
Now we can see that the temp1.text showing completed.
7|Page
Now we will see in the Hadoop directory by using the following command:
8|Page
As we can see the spooling process is running on, the benefit of spooling is it will fetch all file in
local filesytem acting like a source and aggregating them as one single file inside HDFS.
Now we can see all three text file with the name temp1.txt, temp2.txt, temp3.txt fetched from
local file system to HDFS using spool.
Q3: Explain the steps required for the installation of Apache Flume and fetch Twitter Data
to store it in HDFS
Apache Flume:
• Apache Flume is a tool for data ingestion in HDFS. It collects, aggregates and transports
large amount of streaming data such as log files, events from various sources like network
traffic, social media, email messages etc. to HDFS.
• The main idea behind the Flume’s design is to capture streaming data from various web
servers to HDFS.
Steps to install apache flume and twitter sentiment analysis using apache flume
https://drive.google.com/file/d/0B_t6uqPmWadsdWJNQ0NjaXBUYUk/view?usp=sharing
cp flume-sources-1.0-SNAPSHOT.jar $FLUME_HOME/lib
10 | P a g e
4.Add it to the flume class path as shown below in the conf/flume-env.sh file . Add Java
home path too:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
FLUME_CLASSPATH=/home/mukul/apache-flume-1.9.0-bin/lib/flume-sources-1.0-
SNAPSHOT.jar
11 | P a g e
6.Create flume-twitter.conf file in conf folder and paste given lines:
12 | P a g e
8. Rename these 3 files in lib folder of Flume. (All you need to do just change the
extention of these files from .jar to .org)
13 | P a g e
After a couple of minutes the Tweets should appear in HDFS.
So we can see that the tweets fetched from the server into the HDFS.
14 | P a g e