Breaking News Detection and Tracking in Twitter (WI:IW'10)

Breaking News Detection and Tracking in Twitter
WI: IWI
August 31st, 2010
Swit Phuvipadawat, Tsuyoshi Murata

Dept. of Computer Science
Tokyo Institute of Technology, Japan
Outline
• Introduction
• Analysis
• Methodology
• Results and Application
• Challenges and Future Works
• Conclusion
Introduction
Twitter as a news channel
In June 2009, during the Iranian Election Twitter has transformed

the way people convey news.
http://blog.marsdencartoons.com/2009/06/18/cartoon-iranian-election-demonstrations-and-twitter/marsden-iran-twitter72/
Twitter as a news channel
Earthquake with 6.4

Earthquake
Earthquake with
with6.4
6.4
magnitude hits Taiwan! Earthquakes around the world Obama Health
magnitude
magnitudehits
hitsTaiwan!
Taiwan! Earthquake in Taiwan Reform
Earthquake in Chile
Health care
Tsunami alert after Chilean earthquake. Earthquake in Haiti explained.
Tsunami alert after Chilean earthquake.
Iraq Election Apple announced iPad

Earlyvoting
Early votingbegin
beginMarch,
March,77
Apple to launch iPad on
IraqElection
Iraq Election Early voting begin March, 7 March 26
Iraq Election The Apple iPad starting
US and UN hope Sunni $499
participation help heal the
Steve Jobs demoed iPad
The
TheApple iPad starting $499 wound.
The Apple iPad starting$499
Apple iPad starting $499
Research Topic
“Breaking News Detection and Tracking in

Twitter”
➡ Topic Detection and Tracking (TDT)
➡ Information Retrieval
➡ Social Network Analysis
Topic Detection and Tracking (TDT)
• To monitor broadcast news and alert an analyst to new and
interesting events happening in the world. [Allan 2001]
• To search, organize and structure multilingual, news
oriented textual materials from a variety of broadcast news
media. [Fiscus & Doddington 2002]
• Focuses on 5 tasks:
❖ Story segmentation
❖ First story detection
❖ Cluster detection
❖ Tracking
Recent Studies
• Topological characteristics of Twitter
What is Twitter, a Social Network or a News Media? H. Kwak, C. Lee, S. Moon
[WWW2010]
➡ 85% of trending topics in Twitter appear in headline news
• Using Twitter data to improve web ranking
Time is of the Essence: Improving Recency Ranking Using Twitter Data
A. Dong, R. Zhang et. al [WWW2010]
➡ Micro-blogging data reveals fresh URLs not yet indexed by search
engine
• Event detection
Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors
T. Sakaki, M. Okazaki, Y. Matsuo [WWW2010]
Recent Studies
• Influential Topics, Users detection
❖ Characterizing Microblogs with Topic

Model D. Ramage, S. Dumais, D.
Liebling [ICWSM2010]
➡ Use Labeled LDA, a supervised

learning model to characterize
the content of messages into
substance, style, status and
social characteristics.
❖ TwitterRank: Finding Topic-sensitive

Influential Twitters J. Weng, E. Peng,
J. Jiang [WSDM2010]
➡ Use PageRank with topic

model (LDA) to measure the
influence of users.
Analysis
Message Analysis
Findings from a dataset of 154,000 msg.
Single Message Aspect with 33,000 msg. from news engaging users
Msg Attributes Count %

@ Tag a user 79,469 51.6%
Embed a link
http://
50,404 32.7%
RT Retweet 29,935 19.4%
# Use a hashtag 20,348 13.2%
http:// Text Characteristics Examples
terrible, horrible, terrifying,

Sensational adjectives E shocking, terrific, amazing, ...
# Sensational phrases E wow! oh my god! ...

US. President, Obama, Michael
Significant nouns F Jackson, Japan, Toyota, ...
@ Impactful verbs F kill, die, crash, reveal,
discover, rescue, ...
Data of March 2009
Network Analysis
Timeline Aspect
RT (retweet) is to take a twitter

message of someone and
rebroadcasting that same
message
To retold a story to your

friends
RT @John Earthquake in Tokyo!
RT
Lisa 12:30 A M6
Earthquake in Tokyo!
John12:15 A M3
Methodology
Method for Collecting, Indexing and
Grouping
‣ Collecting
Collecting • Fetch messages using pre-defined search

queries for breaking news related keyword
and hashtags
‣ Indexing
• Index based on term vectors is constructed.
• Apache Lucene is used as an information
retrieval library
‣ Grouping
• Similar messages are grouped together to
form a news story
• Similarity comparison is based on the
vector space model using TF-IDF with term
boosting for proper nouns
Grouping Method Explained
Conditions
• Message in a group must be
related to the first story
• Further messages can develop
upon previous messages
A message is compared with the

first message in a group and the sim(m1 , m2 ) = # [tf (t, m ) ! idf (t) ! boost(t)]
2
t "m1
top k terms in that group. count(t in m)

tf (t, m) =
size(m)
$ N '
idf (t) = 1 + log & )
% count(m has t) (
Boost is raised for proper nouns e.g. China, Obama, Toyota
and Hashtags. NER is used for detection
Name Entity Recognizer
• Stanford Named Entity Recognizer (NER)
has been adopted for the following
uses:
➡ To detect proper nouns used in the

grouping algorithm
➡ To classify messages based on

named entities (Person, Organization,
Location, Misc.)
• NER is based on linear chain Conditional

Random Field (CRF)
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information
into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the
Association for Computational Linguistics (ACL 2005), pp. 363-37
Method for Group Ranking
• A group score is based on
reliability, popularity and
freshness factors. The score for each group is computed
‣ Reliability comes from the as follows:
numbers of followers who
follow the user who posted a
message.
‣ Popularity comes from the
numbers of retweet.
‣ Freshness is computed from
the difference of current time
and time where a message is
posted.
Results and Application
Detection Effectiveness
Method
Rates
Search query
Precision 90.0% (45/50)

Recall -
Spam 8% (4/50)
Avg. time to collect
100 new msg.
72 sec
User generated 11.1% (5/45)
Based on an experiment conducted in June 2010
Example Result of Grouping
Toyota (a) No boost (b) b=1.5
MJ. G0 M3 M4 M5 G0 M2 M3 M4 M5
Airline G1 M7 M8 G1 M7 M8
US. Japan
G2 M0 M1 G2 M0 M1
Prisoner
G3 M2 G4 M6
G4 M6 G5 M9
G5 M9 Boosting improves the grouping result
(c) b=1.7 (c) b=2

G0 M0 M1 M7 M8 G0 M2 M3 M4 M5 M9
G1 M2 M3 M4 M5 G1 M0 M1 M7 M8
G2 M6 G2 M6
G3 M9
Application
• A prototype application
called Hotstream is
developed.
• The goal is to create an

automatic news portal
based on Twitter data.
Challenges and
Future Works
Challenges
• The length of messages is
short
• Two similar stories may be

expressed using different
vocabulary terms
• The style of writting is

unconventional with slangs,
many ways for spellings
Future Works
• Explore the comunity structures of named
entities to find relationship among groups of
messages
Grouped by TF-
IDF with proper
noun term
boosting
Example Dataset
Messages-Named Entities
Top 18 stories and their keywords from Hotstream as of July 21st, 2010
Red nodes = keywords, Yellow nodes = message groups
Community Detection Experiment
Network Characteristics
Network Type Edge betweeness
No. Vertices 453 (254,200)
No. Edges 1280

Australian
Prime Minister Mean Degree 5.639
No. Clusters 40
Largest Component Fraction 0.781
Community Detection Results

BP Oil
leak Method Edge betweeness
US. Military in No. Communities 68
Middle East
Modularity 0.71
Purity 0.67
Conclusion
• Introduced Twitter as a mean to convey news
• Described messages, network characteristics of

Twitter
• Described the method to collect, index, group and

rank messages
• Introduced Hotstream, an automatic news portal
• Propose an extension study on group-keyword

network to improve the grouping result
Thank You

Breaking News Detection and Tracking in Twitter (WI:IW&#39;10)

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Breaking News Detection and Tracking in Twitter (WI:IW&#39;10)

Hochgeladen von

Copyright:

Verfügbare Formate

Breaking News Detection and Tracking in Twitter

Swit Phuvipadawat, Tsuyoshi Murata

In June 2009, during the Iranian Election Twitter has transformed

Earthquake with 6.4

Iraq Election Apple announced iPad

“Breaking News Detection and Tracking in

❖ Characterizing Microblogs with Topic

➡ Use Labeled LDA, a supervised

❖ TwitterRank: Finding Topic-sensitive

➡ Use PageRank with topic

Msg Attributes Count %

http:// Text Characteristics Examples

terrible, horrible, terrifying,

# Sensational phrases E wow! oh my god! ...

RT (retweet) is to take a twitter

To retold a story to your

RT @John Earthquake in Tokyo!

Collecting • Fetch messages using pre-defined search

A message is compared with the

top k terms in that group. count(t in m)

➡ To detect proper nouns used in the

➡ To classify messages based on

• NER is based on linear chain Conditional

Precision 90.0% (45/50)

(c) b=1.7 (c) b=2

• The goal is to create an

• Two similar stories may be

• The style of writting is

No. Vertices 453 (254,200)

No. Edges 1280

Largest Component Fraction 0.781

Community Detection Results

• Described messages, network characteristics of

• Described the method to collect, index, group and

• Introduced Hotstream, an automatic news portal

• Propose an extension study on group-keyword

Das könnte Ihnen auch gefallen

Breaking News Detection and Tracking in Twitter (WI:IW'10)

Breaking News Detection and Tracking in Twitter (WI:IW'10)