Sie sind auf Seite 1von 7

Indian J.Sci.Res.

14 (1): 102-108, 2017 ISSN: 0976-2876 (Print)


ISSN: 2250-0138 (Online)

TWEET SEGMENTATION AND CLASSIFICATION FOR RUMOR IDENTIFICATION


USING KNN APPROACH

V. GAYATHRIa1 AND A.E. NARAYANANb


ab
Periyar Maniammai University, Vallam, India

ABSTRACT

Big data analytics is the process of examining large data sets containing a variety of data types i.e., big data to uncover
hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. Big data can
be analyzed with the software tools commonly used as part of advanced analytics disciplines such as predictive analytics, data
mining, text analytics and statistical analysis. In this project, we analyze social media data. Social media analytics is the practice of
gathering data from blogs and social media websites and analyzing that data to make business decisions. The most common use of
social media analytics is to mine customer sentiment in order to support marketing and customer service activities. And then take
twitter big data to predict named entity. Considering wide use of Twitter as the source of information, reaching an interesting
tweet for a user among a bunch of tweets is challenging. In this work, it is aimed to reduce the Twitter user’s effort to access to the
tweet carrying the information of interest. To this aim, a tweet recommendation method under a user interest model generated via
named entities is presented. To achieve the goal, HybridSeg is generated via named entities extracted from user’s followers’ and
user’s own posts. And extend our approach to analyze short text in tweets and rumor based tweets. So implement K-nearest
neighbor classifier (K-NN) approach to eliminate rumor based tweets with improved accuracy rates. And can implement in real
time tweet environments to identify the rumor with high level security.
KEYWORDS: Social Network, HybridSeg, NER, POS, Data
BIG DATA capabilities of the applications that are traditionally used
to process and analyze the data set in its domain. Big Data
Big data is an all-encompassing term for any is a moving target; what is considered to be "Big" today
collection of data sets so large and complex that it will not be so years ahead. "For some organizations,
becomes difficult to process them using traditional data facing hundreds of gigabytes of data for the first time may
processing applications. The challenges include analysis, trigger a need to reconsider data management options. For
capture, curation, search, sharing, storage, transfer, others, it may take tens or hundreds of terabytes before
visualization, and privacy violations. The trend to larger data size becomes a significant consideration."
data sets is due to the additional information derivable
from analysis of a single large set of related data, as Big data usually includes data sets with sizes
compared to separate smaller sets with the same total beyond the ability of commonly used software tools
amount of data, allowing correlations to be found to "spot to capture, curate, manage, and process data within a
business trends, prevent diseases, combat crime and so tolerable elapsed time. Big data "size" is a constantly
on." The limitations also affect search finance and moving target, as of 2012 ranging from a few dozen
business informatics. Data sets grow in size in part terabytes to many peta bytes of data. Big data is a set of
because they are increasingly being gathered by techniques and technologies that require new forms of
ubiquitous information-sensing mobile devices, aerial integration to uncover large hidden values from large
sensory technologies (remote sensing), software logs, datasets that are diverse, complex, and of a massive scale.
cameras, microphones, radio-frequency,identification In a 2001 research report and related
(RFID) readers, and wireless sensor networks. The lectures, META Group (now Gartner) analyst Doug
world's technological per-capita capacity to store Laney defined data growth challenges and opportunities
information has roughly doubled every 40 months since as being three-dimensional, i.e. increasing volume
the 1980s; as of 2012, every day 2.5exabytes (2.5×1018) (amount of data), velocity (speed of data in and out), and
of data were created. The challenge for large enterprises variety (range of data types and sources). Gartner, and
is determining who should own big data initiatives that now much of the industry, continue to use this "3Vs"
straddle the entire organization. model for describing big data. In 2012,Gartner updated its
Big data is difficult to work with using definition as follows: "Big data is high volume, high
most relational database management systems and velocity, and/or high variety information assets that
desktop statistics and visualization packages, requiring require new forms of processing to enable enhanced
instead "massively parallel software running on tens, decision making, insight discovery and process
hundreds, or even thousands of servers". What is optimization." Additionally, a new V "Veracity" is added
considered "big data" varies depending on the capabilities by some organizations to describe it.
of the organization managing the set, and on the

1
Corresponding author
GAYATHRI AND NARAYANAN: TWEET SEGMENTATION AND CLASSIFICATION FOR RUMOR IDENTIFICATION...

If Gartner’s definition (the 3Vs) is still widely used, the Big data analytics enables organizations to analyze a mix
growing maturity of the concept fosters a more sound of structured, semi-structured and unstructured data in
difference between big data and Business Intelligence, search of valuable business information and insights.
regarding data and their use:
• Business Intelligence uses descriptive statistics with
data with high information density to measure things,
detect trends etc.;
• Big data uses inductive statistics and concepts
from nonlinear system identification to infer laws
(regressions, nonlinear relationships, and causal
effects) from large sets of data with low information
density to reveal relationships, dependencies and
perform predictions of outcomes and behaviors. BIG DATA ADVANTAGES
Big data can also be defined as "Big data is a Big data analytics is the process of examining
large volume unstructured data which cannot be handled large data sets containing a variety of data types -- i.e., big
by standard database management systems like DBMS, data -- to uncover hidden patterns, unknown correlations,
RDBMS or ORDBMS". market trends, customer preferences and other useful
Big data can be described by the following characteristics: business information. The analytical findings can lead to
more effective marketing, new revenue opportunities,
Volume – The quantity of data that is generated is very better customer service, improved operational efficiency,
important in this context. It is the size of the data which competitive advantages over rival organizations and other
determines the value and potential of the data under business benefits.
consideration and whether it can actually be considered as The primary goal of big data analytics is to help
Big Data or not. The name ‘Big Data’ itself contains a companies make more informed business decisions by
term which is related to size and hence the characteristic. enabling data scientists, predictive modelers and other
Variety - The next aspect of Big Data is its variety. This analytics professionals to analyze large volumes of
means that the category to which Big Data belongs to is transaction data, as well as other forms of data that may
also a very essential fact that needs to be known by the be untapped by conventional business
data analysts. This helps the people, who are closely intelligence (BI) programs. That could include Web
analyzing the data and are associated with it, to server logs and Internet click stream data, social media
effectively use the data to their advantage and thus content and social network activity reports, text from
upholding the importance of the Big Data. customer emails and survey responses, mobile-phone call
detail records and machine data captured by sensors
Velocity - The term ‘velocity’ in the context refers to the connected to the Internet of Things. Some people
speed of generation of data or how fast the data is exclusively associate big data with semi-structured
generated and processed to meet the demands and the and unstructured data of that sort, but consulting firms
challenges which lie ahead in the path of growth and like Gartner Inc. and Forrester Research Inc. also consider
development. transactions and other structured data to be valid
Variability - This is a factor which can be a problem for components of big data analytics applications.
those who analyze the data. This refers to the Big data can be analyzed with the software tools
inconsistency which can be shown by the data at times, commonly used as part of advanced analytics disciplines
thus hampering the process of being able to handle and such as predictive analytics, data mining, text
manage the data effectively. analytics and statistical analysis. Mainstream BI software
Veracity - The quality of the data being captured can vary and data visualization tools can also play a role in the
greatly. Accuracy of analysis depends on the veracity of analysis process. But the semi-structured and unstructured
the source data. data may not fit well in traditional data warehouses based
on relational databases. Furthermore, data warehouses
Complexity - Data management can become a very may not be able to handle the processing demands posed
complex process, especially when large volumes of data by sets of big data that need to be updated frequently or
come from multiple sources. These data need to be linked, even continually -- for example, real-time data on the
connected and correlated in order to be able to grasp the performance of mobile applications or of oil and gas
information that is supposed to be conveyed by these data. pipelines. As a result, many organizations looking to
This situation, is therefore, termed as the ‘complexity’ of collect, process and analyze big data have turned to a
Big Data.

Indian J.Sci.Res. 14 (1): 102-108, 2017


GAYATHRI AND NARAYANAN: TWEET SEGMENTATION AND CLASSIFICATION FOR RUMOR IDENTIFICATION...

newer class of technologies that includes Hadoop and important steps in natural language processing.
related tools such as YARN, MapReduce, Spark, Hive Essentially, segmentation is trying to determine the
and Pigas well as NoSQL databases. Those technologies boundary of the word. As a fundamental natural language
form the core of an open source software framework that analysis task, word segmentation plays a key role in many
supports the processing of large and diverse data sets natural language processing applications. Different from
across clustered systems. In some cases, Hadoop the traditional word segmentation, many new words exist
clusters and NoSQL systems are being used as landing in the segmentation on twitter. Traditional methods can’t
pads and staging areas for data before it gets loaded into a deal with this problem well. The need to segment and
data warehouse for analysis, often in a summarized form label sequences arises in many different problems in
that is more conducive to relational structures. several scientific fields. Hidden Markov models (HMMs)
Increasingly though, big data vendors are pushing the and stochastic grammars are well understood and widely
concept of a Hadoop data lake that serves as the central used probabilistic models for such problems. These
repository for an organization's incoming streams of raw approaches difficult to segments each entities in tweets.
data. In such architectures, subsets of the data can then be Twitter is a type of micro blogging service in which users
filtered for analysis in data warehouses and analytical are allowed to post contents such as small messages,
databases, or it can be analyzed directly in Hadoop using individual images, or videos. Base features include the
batch query tools, stream processing software and SQL on gazetteer features and orthographic features. In the NER
Hadoop technologies that run interactive, ad hoc queries task, a huge amount of unlabeled data is often used for
written in SQL. Potential pitfalls that can trip up identifying unseen entities. There are already 53
organizations on big data analytics initiatives include a gazetteers in the baseline system. The maximum window
lack of internal analytics skills and the high cost of hiring size for gazetteer features is 6, and the model will learn
experienced analytics professionals. The amount of the named entity type associated with a specific phrase, if
information that's typically involved, and its variety, can it is in one or more of the gazetteer lexicons.
also cause data management headaches, including data Orthographic features can be divided into five types. The
quality and consistency issues. In addition, integrating orthographic feature templates are as follows:
Hadoop systems and data warehouses can be a challenge,
although various vendors now offer software connectors • n-gram: wi for i in {-1,0,1}, conjunction of previous
between Hadoop and relational databases, as well as other word and current word wi−1|wi for i in {-1,0}.
data integration tools with big data capabilities. With big • Affixes: Prefixes and suffixes of xi . The first and last
data analytics, scientists and others can analyze huge n characters ranging from 1 to 3.
volumes of data that old analytics and business • Capitalization: There are two patterns of
intelligence solutions can't find. Consider this; it's capitalization: One is an indicator of capitalization
possible that enterprise could accumulate billions of rows for the first character, and the other is an indicator of
of data with hundreds of millions of data combinations in capitalization for all characters.
multiple data stores and abundant formats. Figure is • Digit: There are three patterns for numbers: i)
demonstrating the value of Big Data Analytics by drawing Whether the current word has a digit, ii) whether the
the graph between time and cumulative cash flow. Old current word is a single digit, and iii) whether the
analytics techniques like any data warehousing current word has two digits.
application, you have to wait hours or days to get • Non-alphabet: Whether the current word contains a
information as compared to Big Data Analytics. hyphen and other punctuation marks. Among the
Information has the timeliness value when it is processed other punctuation marks is the colon (:). In general,
at right time otherwise it would be of no use. It might not what follows right after a colon mark represents a
return its value at proper cost. feature weight. To make the model learn correctly,
we normalize only the colon mark
EXISTING SYSTEM
In the NER task, POS tags and chunks contain
Description very useful information for finding and classifying named
Named entity recognition (NER) is a task of entities. We predict POS tags and chunks by using a
finding and classifying names of things, such as person, model trained with Twitter data. Commonly used NER
location, and organization, given a sequence of words. methods on formal texts such as newspaper articles are
NER is a very important subtask of information extraction built upon on linguistic features extracted locally.
(IE). With the development of the Internet, a huge amount However, considering the short and noisy nature of
of information has been generated by users. The tweets, performance of these methods is inadequate on
information generated on the Internet, particularly on tweets and new approaches have to be generated to deal
social media (e.g., Twitter and Facebook), includes very with this type of data. Recently, tweet representation
diverse and noisy texts. Word segmentation is one of the based on segments in order to extract named entities has

Indian J.Sci.Res. 14 (1): 102-108, 2017


GAYATHRI AND NARAYANAN: TWEET SEGMENTATION AND CLASSIFICATION FOR RUMOR IDENTIFICATION...

proven its validity in NER field. Along with named each type of topic, i.e., the most frequent vocabulary used
entities extracted from tweets via tweet segmentation, in each type of topic. To do so, we first performed a
user’s retweet and mention history, and followed users are filtering process to remove irrelevant words. The filtering
also considered as strong indicators of interest and a process removed all the stop words contained in the
model representing user interest is generated. Reducing tweets. The stop word removal process includes Twitter-
Twitter users’ effort to access tweets carrying the specific words and words in stop word lists for the main
information of interest is the main goal of the study, and a languages in the dataset. After that, we computed the TF
tweet recommendation approach under a user interest (term frequency) of each word for each type of trending
model generated via named entities is presented. topic. This process produced a list of words for each type
of trending topic, ranked in a descending order by TF
Disadvantages of the existing system
value. These steps are implemented in Global and Local
• Event detection and summarization, opinion mining, Context. Before we extract pseudo feedback, POS tagger
sentiment analysis, and many others. implements to define features categories such adverb,
• Limited length of a tweet (i.e., 140 characters) and no adjective and so on in Natural Language. Then implement
restrictions on its writing styles, tweets often contain KNN approach to classify the rumors using three features
grammatical errors, misspellings, and informal such as content features, network features, blog features.
abbreviations. Implementation
On the other hand, despite the noisy nature of • Tweets acquisition
tweets, the core semantic information is well preserved in
tweets in the form of named entities or semantic phrases. • Preprocessing

PROPOSED SYSTEM • Hybrid segmentation

System architecture • Named entity recognition


• Performance evaluation
Modules Description
Tweets acquisition
Twitter is an online social networking service
that enables users to send and read short 140-
character messages called "tweets". Registered users can
read and post tweets, but unregistered users can only read
them. Users access Twitter through the website
interface, SMS, or mobile device app. In order to have an
opinion about the user, his posts have to be examined.
Therefore, using Twitter API, all tweets posted by user
are crawled first. In this study, we tried to examine the
user with not only his posts but also his friends’ posts.
However, crawling all friends’ posts is a huge overload,
and misleading since Twitter following mechanism does
not show an actual interest every time. People sometimes
Explanation tend to follow some users for a temporary occasion and
A rumor is defined as a statement whose true then forget to un-follow. Sometimes they follow some
value is unverifiable. Rumors may spread misinformation users just to be informed of, although they are not actually
on a social media of people. Identifying rumors is critical interested in. There are also friends that do not post a
in online social media where huge amounts of information tweet for a long time, but still followed by the user. In this
are easily reached across a large network by sources with module, we can upload the tweet datasets as CSV file. It
unverified authority. In this paper, we address the contains following id, followers id, time stamp, user
problem of rumor detection on twitter a social media. following, user followers and tweets
This paper focuses on the development of HybridSeg and Preprocessing
KNN approach to the classification of tweets (posts on
Twitter). HybridSeg learns from both global and local For named entities to be extracted successfully,
contexts, and has the ability of learning from pseudo the informal writing style in tweets has to be handled.
feedback. In order to analyze the textual content of the Before real data has entered our lives, studies on the area
tweets, we give an overview of the top terms occurring in were being conducted on formal texts such as news

Indian J.Sci.Res. 14 (1): 102-108, 2017


GAYATHRI AND NARAYANAN: TWEET SEGMENTATION AND CLASSIFICATION FOR RUMOR IDENTIFICATION...

articles. Generally named entities are assumed as words denoted by HybridSegNGram, is proposed based on the
written in uppercase or mixed case phrases where observation that many tweets published within a short
uppercased letters are at the beginning and ending, and time period are about the same topic. HybridSegNGram
almost all of the studies bases on this assumption. segments tweets by estimating the term-dependency
However, capitalization is not a strong indicator in tweet- within a batch of tweets. The segments recognized based
like informal texts, sometimes even misleading. As the on local context with high confidence serve as good
example of capitalization shows, the approaches have to feedback to extract more meaningful segments. The
be changed. To extract named entities in tweets, the effect learning from pseudo feedback is conducted iteratively
of the informality of the tweets has to be minimized as and the method implementing the iterative learning is
possible. To obtain this minimalism, following tasks are named HybridSegIter.
applied on the data:
Named entity recognition
• Links, hash tags, and mentions are removed since
Named Entity Recognition can be basically
they cannot be a part of a named entity.
defined as identifying and categorizing certain type of
• Conjunctives, stop words, vocatives, and slang words data (i.e. person, location, organization names, and date-
etc. are removed. time and numeric expressions) in a certain type of text.
• Although punctuation is not taken as an indicator On the other hand, tweets are characteristically short and
since tweets are informal, still elimination of noisy. Given the limited length of a tweet, and restriction
punctuation is needed. So, smileys are also removed. free writing style, named entity recognition on this type of
• Repeating characters to express feelings are removed. data become challenging. After basic segmentation, a
• Informal writing style related issues such as great number of named entities in the text, such as
mistyping are corrected. personal names, location names and organization names,
• Artification related problems are solved since users are not yet segmented and recognized properly. Part of
connecting from mobile devices tend to ignore speech tagging is applicable to a wide range of NLP tasks
Turkish characters. including named entity segmentation and information
It can be seen that preprocessing tasks can be extraction. Named Entity Recognition strategies vary on
divided into two logical groups. Pre-segmenting, and basically three factors: Language, textual genre and
Correcting. Removal of links, hash-tags, mentions, domain, and entity type. Language is very important
conjunctives, stop words, vocatives, slang words and because language characteristics affect approaches.
elimination of punctuation are considered as pre Assign each word to its most frequent tag and assign each
segmentation. It is accepted that parts in the texts before Out of Vocabulary (OOV) word the most common POS
and after a redundant word, or a punctuation mark cannot tag. Textual genre is another concept whose effects
form a named entity together, therefore every removal of cannot be neglected.
words is behaved as it segments the tweet as well as Performance evaluation
punctuation does it naturally. Since tweets are pre-
segmented before they are handled in tweet segmentation In this module, we can evaluate the process of
process, pre-segmentation tasks reduces the complexity of the system using accuracy rate and normalized utility. Our
the text and increase proposed system provides improved accuracy rate and
normalized utility.
Hybrid segmentation
Design/Algorithm/Pseudo code (whichever is
HybridSeg learns from both global and local applicable) KNN classification
contexts, and has the ability of learning from pseudo
feedback. HybridSeg is also designed to iteratively learn In pattern recognition, the k-Nearest Neighbors
from confident segments as pseudo feedback. Tweets are algorithm (or k-NN for short) is a non-parametric method
posted for information sharing and communication. The used for classification and regression. In both cases, the
named entities and semantic phrases are well preserved in input consists of the k closest training examples in
tweets. The global context derived from Web pages the feature space. The output depends on whether k-NN is
therefore helps identifying the meaningful segments in used for classification or regression:
tweets. The well preserved linguistic features in these • In k-NN classification, the output is a class
tweets facilitate named entity recognition with high membership. An object is classified by a majority
accuracy. Each named entity is a valid segment. The vote of its neighbors, with the object being assigned
method utilizing local linguistic features is denoted by to the class most common among its k nearest
HybridSegNER. It obtains confident segments based on neighbors (k is a positive integer, typically small). If
the voting results of multiple off-the-shelf NER tools. k = 1, then the object is simply assigned to the class
Another method utilizing local collocation knowledge, of that single nearest neighbor.

Indian J.Sci.Res. 14 (1): 102-108, 2017


GAYATHRI AND NARAYANAN: TWEET SEGMENTATION AND CLASSIFICATION FOR RUMOR IDENTIFICATION...

• In k-NN regression, the output is the property value Assign to X the most frequent class in
for the object. This value is the average of the values
of its k nearest neighbors.
End
k-NN is a type of instance-based learning, or lazy
learning, where the function is only approximated locally The basic concept as: (3-NN)
and all computation is deferred until classification. The k-
NN algorithm is among the simplest of all machine
learning algorithms. Both for classification and
regression, it can be useful to assign weight to the
contributions of the neighbors, so that the nearer
neighbors contribute more to the average than the more
distant ones. For example, a common weighting scheme
consists in giving each neighbor a weight of 1/d,
where d is the distance to the neighbor. The neighbors are
taken from a set of objects for which the class (for k-NN
classification) or the object property value (for k-NN
regression) is known. This can be thought of as the
training set for the algorithm, though no explicit training
step is required. A shortcoming of the k-NN algorithm is CONCLUSION
that it is sensitive to the local structure of the data. The
algorithm has nothing to do with and is not to be confused We designed novel features for use in the
with k-means, another popular machine learning classification of tweets in order to develop a system
technique. The training examples are vectors in a through which informational data may be filtered from the
multidimensional feature space, each with a class label. conversations, which are not of much value in the context
The training phase of the algorithm consists only of of searching for immediate information for relief efforts
storing the feature vectors and class labels of the training or bystanders to utilize in order to minimize damages. The
samples. In the classification phase, k is a user-defined results of our experiments show that classifying tweets as
constant, and an unlabeled vector (a query or test point) is “rumor” vs. “non rumor” can use solely the proposed
classified by assigning the label which is most frequent features if computing resources are concerned, since the
among the k training samples nearest to that query point. computing power required to process data into featured is
immensely decreased in comparison to a BOW feature set
A commonly used distance metric for continuous which contains a substantially larger number of features.
variables is Euclidean distance. For discrete variables, However, if computing power and time necessary to
such as for text classification, another metric can be used, process incoming Twitter data are not a concern, a
such as the overlap metric (or Hamming distance). In the combined feature set of the proposed features and BOW-
context of gene expression microarray data, for presence approach will maximize overall accuracy.
example, k-NN has also been employed with correlation
coefficients such as Pearson and Spearman. Often, the FUTURE WORK
classification accuracy of k-NN can be improved
In future work, we can extend our approach
significantly if the distance metric is learned with
implement various classification algorithm to predict the
specialized algorithms such as Neighbor or Neighborhood
attackers and also eliminate the attackers from twitter
components analysis.
datasets. And try this approach to implement in various
Mathematical model languages in twitter.
KNN algorithm as derived as follows REFERENCES
BEGIN Li C., Weng J., He Q., Yao Y., Datta A., Sun A. and Lee
Input: D={(X1,C1),….,(Xn,Cn)} B.S., “Twiner: Named entity recognition in
targeted twitter stream,” in SIGIR, 2012, pp.
X=(X1,…, Xn) new instance to be classified 721–730.
For each labeled instance(Xi,Ci) calculate d(Xi,X) Li C., Weng J., He Q. and Sun A., “Exploiting hybrid
Order d(Xi,X) from lowest to highest, (i-1,…,N) contexts for tweet segmentation,” in SIGIR,
2013, pp. 523–532.
Select the K nearest instances to X:

Indian J.Sci.Res. 14 (1): 102-108, 2017


GAYATHRI AND NARAYANAN: TWEET SEGMENTATION AND CLASSIFICATION FOR RUMOR IDENTIFICATION...

Liu X., Zhang S., Wei F. and Zhou M., “Recognizing


named entities in tweets,” in ACL, 2011, pp.
359–367.
Liu X., Zhou X., Fu Z., Wei F. and Zhou M., “Exacting
social events for tweets using a factor graph,” in
AAAI, 2012.
Ritter A., Mausam, Etzioni O. and Clark S., “Open
domain event extraction from twitter,” in KDD,
2012, pp. 1104–1112.
Ritter A., Clark S., Mausam and O. Etzioni, “Named
entity recognition in tweets: An experimental
study,” in EMNLP, 2011, pp. 1524–1534.
Cui A., Zhang M., Liu Y., Ma S. and Zhang K., “Discover
breaking events with popular hashtags in
twitter,” in CIKM, 2012, pp. 1794–1798.
Meng X., Wei F., Liu X., Zhou M., Li S. and Wang H.,
“Entity centric topic-oriented opinion
summarization in twitter,” in KDD, 2012, pp.
379–387.
Luo Z., Osborne M. and Wang T., “Opinion retrieval in
twitter,” in ICWSM, 2012.
Wang X., Wei F., Liu X., Zhou M. and Zhang M., “Topic
sentiment analysis in twitter: a graph-based
hashtag sentiment classification approach,” in
CIKM, 2011, pp. 1031–1040.

Indian J.Sci.Res. 14 (1): 102-108, 2017

Das könnte Ihnen auch gefallen