Sie sind auf Seite 1von 2

Flume Case Study: Twitter Data Extraction

Problem Statement
In this case study, a flume agent is configured to retrieve data from Twitter. We
know that Twitter is a huge source of data with people's opinions and preferences.
The data can be used to analyse the public opinion or review on a specific topic or
a product. Various types of analysis can be done based on the tweet data and
location. The data from flume can be used for real time processing using Apache
Spark using Streaming API. Spark Streaming is used to process live data using
various data sources such as Kafka, Flume or TCP sockets. It also supports Twitter
streaming API. By usage of Flume, we can build a fault tolerant system that
provides real time data and saves a copy of the data in the required location.
Spark also inbuilt machine learning algorithms which makes the analysis much
faster, reliable and fault tolerant.

In this way we can obtain the results required in real time using Spark and store
the data in database for deeper analysis using Hadoop. Now we build a simple flume
agent that has a twitter source and a sink that is accessed by Spark for data
retrieval. To prevent any loss in data, we will build flume agent with custom sink.
Even if the spark machine goes down, data remains in the channel due to the
transaction feature in data transfer.

Proposed Solution
Now the configuration has to be set for various important components in our
architecture. One such component is the source configured to read data from
Twitter. The flume agent �agent1� with source �source_read� is configured to a
custom type source. To access the data, by registering the application twitter
provides credentials using which a user can retrieve the data. We can also set the
keywords if we need specific tweets containing these words. This is quite useful in
case of analysis on a specific topic or product.

Cloudera provides the jar files which has to be included in the Flume classpath to
access those classes. It can be done by adding the path of jar in �flume-env.sh�
configuration file. In case if additional parameters such as proxy need to be set
then the jar has to be re-build using the source code.

agent1.sources.source_read.type =
com.cloudera.flume.source.TwitterSource
agent1.sources.source_read.channels = MemChannel
agent1.sources.source_read.consumerKey =
agent1.sources.source_read.consumerSecret =
agent1.sources.source_read.accessToken =
agent1.sources.source_read.accessTokenSecret =
agent1.sources.source_read.keywords = hadoop
The code for the channel is simple. We develop a memory based channel:

agent1.channels.memory1.type = memory
agent1.channels.memory1.capacity = 1000
agent1.channels.memory1.transactionCapacity = 100
The code for the custom sink is passed using the parameter �type�. The jar file of
the code has to be added to the flume classpath. The spark machine IP address and
the port are defined.

agent1.sinks = spark_dump
agent1.sinks.spark_dump.type = org.apache.spark.streaming.flume.sink.SparkSink
agent1.sinks.spark_dump.hostname =
agent1.sinks.spark_dump.port =
agent1.sinks.spark_dump.channel = memory1
Learn Hadoop by working on interesting Big Data and Hadoop Projects for just $9
Executing Solution
Start flume agent on the server using the following command:

$ bin/flume-ng agent -n $agent_name -c conf -f


conf/flume-conf.properties.template
The data from the source will be in the following format

Aug 25 18:26:22 +0000


2015","favorite_count":0,"place":null,"coordinates":null,"text":"RT @ARsoftCo:
Hortonworks buys better Hadoop data flow management
http://t.co/w41vWX0vKW","contributors":null,"retweeted_status":
{"filter_level":"low","contributors":null,"text":"Hortonworks buys better Hadoop
data flow management
http://t.co/w41vWX0vKW","geo":null,"retweeted":false,"in_reply_to_screen_name":null
,"possibly_sensitive":false,"truncated":false,"lang":"en","entities":{"trends":
[],"symbols":[],"urls":[{"expanded_url":"http://arsoft-company.biz/hortonworks-
buys-better-hadoop-data-flow-management/","indices":[52,74],"display_url":"arsoft-
company.biz/hortonworks-bu\u2026","url":"http://t.co/w41vWX0vKW"}],"hashtags":
[],"user_mentions":
[]},"in_reply_to_status_id_str":null,"id":636233138667130880,"source":"WordPress.co
m<\/a>","in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":nu
ll,"retweet_count":1

Das könnte Ihnen auch gefallen