IJSRD - International Journal for Scientific Research & Development| Vol.

4, Issue 02, 2016 | ISSN (online): 2321-0613

Clustering and Summarization of Tweet Streams
Pratik Sadawarte1 Amreen Khan2
Department of Computer Science & Engineering
G.H. Raisoni College of Engineering Nagpur, India

Abstract— The tweets are the short messages which are
posted usually on the social networking sites such as twitter,
facebook, tumblr and weiboetc on different blogs as well.
The various real time and real life information are shared via
tweets. They may or may not be changed over a period of
time. As the number of events are taking place at a different
time. In this case, the detection of the event is one of the
difficult task. The events may be in the very large numbers.
The four types of operations are taking place namely
creating, splitting, absorbing and at last merging. These may
be very useful in detecting the current occurring events. To
detect the current event the following procedure takes place ,
which are tweet stream clustering for clustering of the large
amount of tweets after that the tweets are summarized by
using the tweet stream summarization algorithm and at last
the current event detection method is used to get the current
event form the recent events going on via tweet streams.
Here the efficiency plays an important role. It is shown by a
graph which shows the relation between the number of
tweets and the number of days.
Key words: Tweet, Clustering, Summarization, Sumblr,
The social networking sites comprise of the different
number of nodes which may be the different computers, set
of computers, single user or the group of user. Every node
means the user must have the different viewpoints, ideas,
opinions, as well as belief. These may be seen by their
tweets. This may be exactly or slightly different from the
tweets of the any other user. The conclusion is that the each
and every tweets are different in their own way. The tweets
may vary according to the moods of the user, according to
the surrounding as well as any current event is going on
Let us take an example of the world cup or any
cricket tournament or the release of certain films, the tweets
about these current topics will be large numbers as
compared to the different topics. Suppose the user is a
doctor he may post the tweets about the medical terms,
suppose the user is a film actor then he may tweets about the
films or any current events regarding the film industries
such as any award show etc, in this way the tweets and their
topics vary accordingly. The information must be updated
according to the recent times. The process used for updating
the information is shown in the following figure.

Fig. 1: The information updation process
The one of the important method of combining the
data or tweets is to use the data or tweet stream integration

method. The following block diagram is used to show the
data or tweet stream integration.

Fig. 2: Merging/ Integration of data
In the starting stage the users are able to share only
the textual messages but now a days the multimedia such as
sound, video and graphics are also able to share via social
networking sites.
The different users may post the tweets which may
be senseless, useless of which are of no concern such tweets
may be discarded or eliminated or deleted. The occurrence
of the same or slightly same tweets is also of none of use so
such tweets may be discarded. The two important issues are
very important in the tweet streams operations are the
efficiency and the flexibility as the tweets streams are in
very large number so it must be efficient and the time period
of the tweet being posted is also different so it must be
In the current section, the related information regarding the
paper is given below. The information about the clustering
algorithm, summarization algorithm and the tweet stream
merging algorithm as well as the current event detection
algorithm is discussed in the following sub-points.
A. Tweet Stream Clustering Algorithm:
The related data about the clustering is discussed below.
BRICH [2] composed the large tweet streams in the memory
structure which decrease the amount of data in the form of
trees. [3] uses the particular framework which takes into the
consideration of the data which are very useful and discard
the data or tweets which are useless. [1] uses the traditional
data stream clustering method. It also shows the time frame
which contains the tweets along with their time. The user
can get the time also on which date and time the tweet was
posted by user.
[4] [5] [6] gives the information related to the
various clustering methods as well as the topics detection
algorithms as well.[8] it deals with the text, data , tweets on
online basis.
[10] event detection and tracking is discussed here .

All rights reserved by www.ijsrd.com


Clustering and Summarization of Tweet Streams
(IJSRD/Vol. 4/Issue 02/2016/414)

There exist three main modules in the framework, namely
Tweet Stream Clustering module, the High Level
Summarization module, and the Current Event Detection
module and Statistic Generation module.

Fig. 3: Clustering of the tweet streams
B. Tweet Stream Summarization:
The two main category of the tweet summarization is the
tweet extraction and tweet abstraction respectively. [12]
uses the algorithm which detects the tweets for the
summarization purpose discarding the old tweets. [15] is
used to do the summarization of the tweets by using the
simple tweets. [7] deals with the coreferent objects or topics
which deals with the same thing, that means the same thing
is there in both the tweets. [3] the two main steps are the
content of tweets selection and the content of tweets
presentation. [15] proposed the single tweet which is used to
summarize the tweet streams by using the
Reinforcement algorithm.
C. Current Event Detection:
Detection of timeline or the current event detection is the
final output of this paper. The topic which is very much
popular now a days will be coming in the current event
detection. [8] gives the idea about the current events
regarding the conversations. [7] proposed the summaries
regarding the time called as time based summaries. Some
algorithm does not consider the scalability and efficiency
but such algorithm are not very useful. Slope based
techniques are also been used for the summarization
purpose. The current event is the event which is most
popular event going on the current days. The following fig.
shows the current event detection.

Fig. 4: Current Event Detection

A. Tweet Stream Clustering module:
The tweet stream clustering module maintains the statistical
data streams. It is used to cluster the abundance amount of
tweets in the diferent groups or clusters.
1) Initialization:
K-mean clustering algorithm is used to create the initial
clusters at the starting of the streams. Initialization of TCVs
takes place after that process. Whenever the new tweet
arrives the clustering process starts to update the TCVs
2) Incremental Clustering:
Let us consider, any number of tweets arriving at a
particular time. There is a problem whether to add that
tweets in the old clusters or to create the new cluster. If the
tweets are belonging to the old created clusters, then they
are added to it, if they are not then the new clusters are
formed with the different name and that tweets are added to
the new cluster.
3) Deleting Old Clusters:
There are many such tweets which do not last for long time,
such as tweets about news, sports, and football matches etc.
These tweets are meaningful only for a particular period of
time. They do not play important role for a very long time. it
is very much safe to delete such tweets which are rarely
discussed. In such way the outdated tweets are deleted.
4) Merging Clusters:
The memory of the system can be easily exhausted if the
number of clusters is increasing in the large amount. In
order to avoid such problem, the upper limits of forming
clusters are specified. If this upper limit is reached then the
process known as merging starts. The clusters which are
similar in nature are merged. This can reduce the number of
clusters along with the quality clusters can obtained.

Fig. 5: Merging Clusters
B. Summarization Module:
There are two types of summaries provided by the
summarization module, online and historical summaries.
The currently discussed topics are added in the online
summaries whereas the topics discussed in the specific
period in the past are added in the historical summaries. We
are mainly deals with the online summaries that are the
current or recent tweets.
1) TCV-Rank Summarization Algorithm:
The TCV-Rank Summarization algorithm is used for the
summarization of tweet stream purpose. The online
summaries and the historical summaries are created using
this TCV-Rank Summarization algorithm.

All rights reserved by www.ijsrd.com


Clustering and Summarization of Tweet Streams
(IJSRD/Vol. 4/Issue 02/2016/414)

C. Current Event Detection module:
There are many tweets posted on different topic and at the
different time. There may be abundance in number of tweets
arises on the twitter. But in the huge amounts of tweets,
there may be some tweets which may belong to a particular
topic, which is discussed many times. This is nothing but the
current event or recent trends.
D. Statistic Generation Module:
As the current event is generated by the Current Event
Detection module. The statistics or graph is generated by the
Statistics Generation module, which may be very easy
visible and understandable by anyuser. The event or topic
which is discussed on large scale takes the higher position
and likewise the graph or statistics is created and shown
graphically which can be easy to understand.

Fig. 6: Graphical Representation
Sumblrwhich is the combination of the summarization,
clustering and current event detection. The tweet stream
summarization algorithm is used to summarize the tweet
streams by deleting the old or senseless tweet streams. The
tweet stream clustering algorithm is used to clusters the
tweets in the different groups. Finally , the current event
detection algorithm is used to detect the current event
detection. The effectiveness and efficiency of the algorithms
are greater than the previously used algorithms. The
comparison between the older algorithm with the used
algorithm are also shown by the pie chart. For future work,
the videos and images may be analysed.

Fig. 6: Comparison between traditional and current

[1] Zhenhuawang, lidanshou, kechen, gang chen, and
sharadmehrotra,” On summarization and timeline
generation for evolutionary tweet streams” IEEE
transactions on knowledge and data engineering, Vol.
27, No. 5, May 2015.
[2] Daswin
dammindaalahakoon, and grahameholmes,”A data
mining framework for electricity consumption analysis
from meter data” IEEE transactions on industrial
informatics, Vol. 7, No. 3, August 2011.
[3] Xue-qi cheng, member, pan du,student member,
jiafengguo, xiaofeizhu, and yixinchen, senior member
,”Ranking on data manifold with sink points” IEEE
transactions on knowledge and data engineering, Vol.
25, No. 1, January 2013.
[4] Jose f. rodriguesjr., member, hanghang tong, jia-yu pan,
agmaj.m. traina, caetanotrainajr., and christosfaloutsos,
member,” Large graph analysis in the gmine system”
IEEE transactions on knowledge and data engineering,
Vol. 25, No. 1, January 2013.
[5] Xiaoyancai and wenjie li,”mutually reinforced
manifold-ranking based relevance propagation model
for query-focused multi-document summarization”
IEEE transactions on audio, speech, and language
processing, Vol. 20, No. 5, July 2012.
[6] Binxingjiao, linjun yang, jizhengxu ,qi tian, and
fengwu, senior member, ”Visually summarizing web
pages through internal and external images” IEEE
transactions on multimedia, Vol.14, No.6, December
[7] Jianchao yang, jieboluo ,jieyu,andthomass. huang,life
fellow,” Photo stream alignment and summarization for
collaborative photo collection and sharing” IEEE
[8] Bo liu, yanshanxiao, philip s. yu , longbingcao,
yunzhang, and zhifenghao,” Uncertain one-class
learning and concept summarization learning on
uncertain data streams” IEEE transactions on
knowledge and data engineering, Vol. 26, No. 2,
February 2014.
[9] György j. simon, pedro j. caraballo, terry m. therneau,
steven s. cha, m. reginacastro, and peter w. li,
“Extending association rule summarization techniques
to assess risk of diabetes mellitus” IEEE transactions on
knowledge and data engineering, Vol. 27, No. 1,
January 2015.
[10] Yizhou sun, jietang ,jiaweihan , chengchen, and
manishgupta,” Co-evolution of multi-typed objects in
dynamic star networks” IEEE transactions on
knowledge and data engineering, Vol. 26, No. 12,
December 2014.
[11] Jiajunliu, yi yang, zihuang, yang yang, and
hengtaoshen,” On the influence propagation of web
videos” IEEE transactions on knowledge and data
engineering, Vol. 26, No. 8, August 2014.
[12] Dongshengduan, yuhua li, ruixuanl ,ruizhang, xiwugu,
and kunmei wen,” Limtopic: a framework of
incorporating link based importance into topic
modeling” IEEE transactions on knowledge and data
engineering, Vol. 26, No. 10, October 2014.

All rights reserved by www.ijsrd.com


Clustering and Summarization of Tweet Streams
(IJSRD/Vol. 4/Issue 02/2016/414)

[13] Vasanthanraghavan, gregversteeg, aramgalstyan, and
alexander g. tartakovsky,” Modeling temporal activity
patterns in dynamic social networks” IEEE transactions
on computational social systems, Vol. 1, No. 1, March
[14] Daan van britsom, antoonbronselaer, and guy de tre ,”
Using data merging techniques for generating
multidocument summarizations” IEEE transactions on
fuzzy systems, Vol. 23, No. 3, June 2015 .
[15] Haiyingshen, senior member, ze li, jinweiliu, and joseph
edward grant,” Knowledge sharing in the online social
network of yahoo!answers and its implications” IEEE
transactions on computers, Vol. 64, No. 6, June 2015.

All rights reserved by www.ijsrd.com