Sie sind auf Seite 1von 29

Breaking News Detection and Tracking in Twitter

WI: IWI
August 31st, 2010

Swit Phuvipadawat, Tsuyoshi Murata


Dept. of Computer Science
Tokyo Institute of Technology, Japan
Outline

• Introduction
• Analysis
• Methodology
• Results and Application
• Challenges and Future Works
• Conclusion
Introduction
Twitter as a news channel

In June 2009, during the Iranian Election Twitter has transformed


the way people convey news.
http://blog.marsdencartoons.com/2009/06/18/cartoon-iranian-election-demonstrations-and-twitter/marsden-iran-twitter72/
Twitter as a news channel

Earthquake with 6.4


Earthquake
Earthquake with
with6.4
6.4
magnitude hits Taiwan! Earthquakes around the world Obama Health
magnitude
magnitudehits
hitsTaiwan!
Taiwan! Earthquake in Taiwan Reform
Earthquake in Chile
Health care
Tsunami alert after Chilean earthquake. Earthquake in Haiti explained.
Tsunami alert after Chilean earthquake.

Iraq Election Apple announced iPad


Earlyvoting
Early votingbegin
beginMarch,
March,77
Apple to launch iPad on
IraqElection
Iraq Election Early voting begin March, 7 March 26
Iraq Election The Apple iPad starting
US and UN hope Sunni $499
participation help heal the
Steve Jobs demoed iPad
The
TheApple iPad starting $499 wound.
The Apple iPad starting$499
Apple iPad starting $499
Research Topic

“Breaking News Detection and Tracking in


Twitter”
➡ Topic Detection and Tracking (TDT)
➡ Information Retrieval
➡ Social Network Analysis
Topic Detection and Tracking (TDT)
• To monitor broadcast news and alert an analyst to new and
interesting events happening in the world. [Allan 2001]
• To search, organize and structure multilingual, news
oriented textual materials from a variety of broadcast news
media. [Fiscus & Doddington 2002]
• Focuses on 5 tasks:
❖ Story segmentation
❖ First story detection
❖ Cluster detection
❖ Tracking
Recent Studies
• Topological characteristics of Twitter
What is Twitter, a Social Network or a News Media? H. Kwak, C. Lee, S. Moon
[WWW2010]
➡ 85% of trending topics in Twitter appear in headline news
• Using Twitter data to improve web ranking
Time is of the Essence: Improving Recency Ranking Using Twitter Data
A. Dong, R. Zhang et. al [WWW2010]
➡ Micro-blogging data reveals fresh URLs not yet indexed by search
engine
• Event detection
Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors
T. Sakaki, M. Okazaki, Y. Matsuo [WWW2010]
Recent Studies
• Influential Topics, Users detection

❖ Characterizing Microblogs with Topic


Model D. Ramage, S. Dumais, D.
Liebling [ICWSM2010]

➡ Use Labeled LDA, a supervised


learning model to characterize
the content of messages into
substance, style, status and
social characteristics.

❖ TwitterRank: Finding Topic-sensitive


Influential Twitters J. Weng, E. Peng,
J. Jiang [WSDM2010]

➡ Use PageRank with topic


model (LDA) to measure the
influence of users.
Analysis
Message Analysis
Findings from a dataset of 154,000 msg.
Single Message Aspect with 33,000 msg. from news engaging users

Msg Attributes Count %


@ Tag a user 79,469 51.6%
Embed a link
http://
50,404 32.7%
RT Retweet 29,935 19.4%
# Use a hashtag 20,348 13.2%

http:// Text Characteristics Examples

terrible, horrible, terrifying,


Sensational adjectives E shocking, terrific, amazing, ...

# Sensational phrases E wow! oh my god! ...


US. President, Obama, Michael
Significant nouns F Jackson, Japan, Toyota, ...
@ Impactful verbs F kill, die, crash, reveal,
discover, rescue, ...
Data of March 2009
Network Analysis
Timeline Aspect

RT (retweet) is to take a twitter


message of someone and
rebroadcasting that same
message

To retold a story to your


friends

RT @John Earthquake in Tokyo!

RT
Lisa 12:30 A M6
Earthquake in Tokyo!

John12:15 A M3
Methodology
Method for Collecting, Indexing and
Grouping

‣ Collecting

Collecting • Fetch messages using pre-defined search


queries for breaking news related keyword
and hashtags
‣ Indexing
• Index based on term vectors is constructed.
• Apache Lucene is used as an information
retrieval library
‣ Grouping
• Similar messages are grouped together to
form a news story
• Similarity comparison is based on the
vector space model using TF-IDF with term
boosting for proper nouns
Grouping Method Explained

Conditions
• Message in a group must be
related to the first story
• Further messages can develop
upon previous messages

A message is compared with the


first message in a group and the sim(m1 , m2 ) = # [tf (t, m ) ! idf (t) ! boost(t)]
2
t "m1

top k terms in that group. count(t in m)


tf (t, m) =
size(m)
$ N '
idf (t) = 1 + log & )
% count(m has t) (
Boost is raised for proper nouns e.g. China, Obama, Toyota
and Hashtags. NER is used for detection
Name Entity Recognizer
• Stanford Named Entity Recognizer (NER)
has been adopted for the following
uses:

➡ To detect proper nouns used in the


grouping algorithm

➡ To classify messages based on


named entities (Person, Organization,
Location, Misc.)

• NER is based on linear chain Conditional


Random Field (CRF)

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information
into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the
Association for Computational Linguistics (ACL 2005), pp. 363-37
Method for Group Ranking
• A group score is based on
reliability, popularity and
freshness factors. The score for each group is computed
‣ Reliability comes from the as follows:
numbers of followers who
follow the user who posted a
message.
‣ Popularity comes from the
numbers of retweet.
‣ Freshness is computed from
the difference of current time
and time where a message is
posted.
Results and Application
Detection Effectiveness
Method
Rates
Search query

Precision 90.0% (45/50)


Recall -
Spam 8% (4/50)
Avg. time to collect
100 new msg.
72 sec
User generated 11.1% (5/45)
Based on an experiment conducted in June 2010
Example Result of Grouping
Toyota (a) No boost (b) b=1.5
MJ. G0 M3 M4 M5 G0 M2 M3 M4 M5
Airline G1 M7 M8 G1 M7 M8
US. Japan
G2 M0 M1 G2 M0 M1
Prisoner
G3 M2 G4 M6
G4 M6 G5 M9
G5 M9 Boosting improves the grouping result

(c) b=1.7 (c) b=2


G0 M0 M1 M7 M8 G0 M2 M3 M4 M5 M9
G1 M2 M3 M4 M5 G1 M0 M1 M7 M8
G2 M6 G2 M6
G3 M9
Application

• A prototype application
called Hotstream is
developed.

• The goal is to create an


automatic news portal
based on Twitter data.
Challenges and
Future Works
Challenges
• The length of messages is
short

• Two similar stories may be


expressed using different
vocabulary terms

• The style of writting is


unconventional with slangs,
many ways for spellings
Future Works
• Explore the comunity structures of named
entities to find relationship among groups of
messages

Grouped by TF-
IDF with proper
noun term
boosting
Example Dataset
Messages-Named Entities

Top 18 stories and their keywords from Hotstream as of July 21st, 2010
Red nodes = keywords, Yellow nodes = message groups
Community Detection Experiment
Network Characteristics
Network Type Edge betweeness

No. Vertices 453 (254,200)

No. Edges 1280


Australian
Prime Minister Mean Degree 5.639

No. Clusters 40

Largest Component Fraction 0.781

Community Detection Results


BP Oil
leak Method Edge betweeness
US. Military in No. Communities 68
Middle East
Modularity 0.71

Purity 0.67
Conclusion
• Introduced Twitter as a mean to convey news

• Described messages, network characteristics of


Twitter

• Described the method to collect, index, group and


rank messages

• Introduced Hotstream, an automatic news portal

• Propose an extension study on group-keyword


network to improve the grouping result
Thank You

Das könnte Ihnen auch gefallen