Willkommen bei Scribd!

Etola: A News Headline Clustering Tool

Hochgeladen von

0% fanden dieses Dokument nützlich (0 Abstimmungen)

48 Ansichten15 Seiten

Nibir Bora & Bhabani Mishra. In 1st Project Innovation Contest (PIC 2012), poster session of 8th International Conference on Distributed Computing and Internet Technology (ICDCIT 2012), Bhubaneswar, 1-4 February, 2012.

Copyright

Verfügbare Formate

PDF, TXT oder online auf Scribd lesen

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Dieses Dokument melden

Copyright:

Attribution Non-Commercial (BY-NC)

Verfügbare Formate

Als PDF, TXT herunterladen oder online auf Scribd lesen

Markieren Sie unangemessene Inhalte

0% fanden dieses Dokument nützlich (0 Abstimmungen)

48 Ansichten15 Seiten

Etola: A News Headline Clustering Tool

Hochgeladen von

Nibir Bora

Copyright:

Attribution Non-Commercial (BY-NC)

Verfügbare Formate

Als PDF, TXT herunterladen oder online auf Scribd lesen

Markieren Sie unangemessene Inhalte

Zu Seite

Sie sind auf Seite 1von 15

Im Dokument suchen

ICDCIT-2012

PROJECT INNOVATION CONTEST

Etola: A News Headline Clustering Tool

(PIC No: 11)
Nibir Nayan Bora Project Guide: Dr. Bhabani Shankar Prasad Mishra KIIT University, Bhubaneswar, Odisha

Outline
Perform Cluster Analysis on news headlines.
Create a dataset Propose a new technique Compare with traditional method Evaluate quality

Cluster Analysis
Assigning objects to groups (called clusters) in accordance with the general clustering rule, high intra-cluster similarity and low inter-cluster similarity

Document clustering / Text categorization Unsupervised learning

Why News Headlines?

How are news headlines different? Small size Less dimensions Grammatical inconsistent Benefit: Traditional clustering methods can be modified accordingly.

The Dataset
Extracted from the Reuters corpus (90 category split) No. of documents: 343 No. of classes (topics): 26 e.g. gold, coffee, jobs, retail, housing. Each headline relates to a single topic.

The Dataset (examples)

Headline U.S. cotton certificate expiration date extended China trying to increase cotton output, paper says January housing sales drop, realty group says Quebec February housing starts fall Pakistan not seen as major wheat exporter
Weather hurting Yugoslav wheat - USDA Report

Topic Cotton Cotton Housing Housing Wheat

wheat

Preprocessing
Unwanted characters Case normalization Tokenization Stopwords Stemming

Vector Space Model

k-means
Randomly select k initial centroids repeat
Assign each document to the closest centroid. Recalculate cluster centroids.

until clusters do not change

Frequent term
Create a list of all terms in corpus and arrange them in order of their frequencies. for each term
if the term is present more often in unclustered documents, then Form a cluster by collecting all documents containing the term.

Apply refinement

Refinement
For clusters with single document, mark document as unclustered. Assign each unclustered document to its closest cluster. Apply k-means clustering to these clusters.

Frequent term plus

Assumption: The lesser the number of terms in a document, the easier it is to identify the topic term. A document contains only one topic term.
What do we do? Arrange the documents in reverse order of number of terms. Try to find a topic term in each.

Frequent term plus (algorithm)

Create a list of all terms in corpus and their frequencies. Arrange the documents in ascending order of number of terms. for each document
Starting with term with highest frequency in corpus, try to find a topic term. if the term is present more often in unclustered documents, then Form a cluster by collecting all documents containing the term.

Apply refinement

Frequent nouns
Similar to Frequent term clustering. In preprocessing, all non-noun features are removed.

Frequent nouns plus

Similar to Frequent term plus clustering. In preprocessing, all non-noun features are removed.

Results
Legend: FF Frequent feature FN Frequent nouns
k-means FF FF plus FN FN plus

No. of clusters generated

Correct clusters

26
17

29
19

34
22

28
20

33
22

Incorrect clusters
No. of documents correctly clustered

9
238

10
226

12
255

8
241

11
272

Applications
Create news categories by clustering news from various sources.
Like Google News

Future Work
Cluster descriptor Use of synonyms

Das könnte Ihnen auch gefallen

Shoe Dog: A Memoir by the Creator of Nike
Von Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
Bewertung: 4.5 von 5 Sternen
4.5/5 (537)
Grit: The Power of Passion and Perseverance
Von Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
Bewertung: 4 von 5 Sternen
4/5 (587)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Von Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
Bewertung: 4 von 5 Sternen
4/5 (894)
The Yellow House: A Memoir (2019 National Book Award Winner)
Von Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
Bewertung: 4 von 5 Sternen
4/5 (98)
The Little Book of Hygge: Danish Secrets to Happy Living
Von Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
Bewertung: 3.5 von 5 Sternen
3.5/5 (399)
On Fire: The (Burning) Case for a Green New Deal
Von Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
Bewertung: 4 von 5 Sternen
4/5 (73)
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Von Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Bewertung: 4 von 5 Sternen
4/5 (5794)
Never Split the Difference: Negotiating As If Your Life Depended On It
Von Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
Bewertung: 4.5 von 5 Sternen
4.5/5 (838)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Von Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
Bewertung: 4.5 von 5 Sternen
4.5/5 (474)
Yes Please
Von Everand
Yes Please
Amy Poehler
Bewertung: 4 von 5 Sternen
4/5 (1891)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Von Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
Bewertung: 3.5 von 5 Sternen
3.5/5 (231)
Principles: Life and Work
Von Everand
Principles: Life and Work
Ray Dalio
Bewertung: 4 von 5 Sternen
4/5 (599)
The Emperor of All Maladies: A Biography of Cancer
Von Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
Bewertung: 4.5 von 5 Sternen
4.5/5 (271)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Von Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
Bewertung: 4 von 5 Sternen
4/5 (1090)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Von Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
Bewertung: 3.5 von 5 Sternen
3.5/5 (2219)
Team of Rivals: The Political Genius of Abraham Lincoln
Von Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
Bewertung: 4.5 von 5 Sternen
4.5/5 (234)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Von Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
Bewertung: 4.5 von 5 Sternen
4.5/5 (344)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Von Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
Bewertung: 4.5 von 5 Sternen
4.5/5 (265)
Fear: Trump in the White House
Von Everand
Fear: Trump in the White House
Bob Woodward
Bewertung: 3.5 von 5 Sternen
3.5/5 (738)
Angela's Ashes: A Memoir
Von Everand
Angela's Ashes: A Memoir
Frank McCourt
Bewertung: 4.5 von 5 Sternen
4.5/5 (440)
Rise of ISIS: A Threat We Can't Ignore
Von Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
Bewertung: 3.5 von 5 Sternen
3.5/5 (137)
Steve Jobs
Von Everand
Steve Jobs
Walter Isaacson
Bewertung: 4.5 von 5 Sternen
4.5/5 (806)
John Adams
Von Everand
John Adams
David McCullough
Bewertung: 4.5 von 5 Sternen
4.5/5 (2409)
The Unwinding: An Inner History of the New America
Von Everand
The Unwinding: An Inner History of the New America
George Packer
Bewertung: 4 von 5 Sternen
4/5 (45)
Bad Feminist: Essays
Von Everand
Bad Feminist: Essays
Roxane Gay
Bewertung: 4 von 5 Sternen
4/5 (1015)
The Glass Castle: A Memoir
Von Everand
The Glass Castle: A Memoir
Jeannette Walls
Bewertung: 4.5 von 5 Sternen
4.5/5 (1712)
Wolf Hall: A Novel
Von Everand
Wolf Hall: A Novel
Hilary Mantel
Bewertung: 4 von 5 Sternen
4/5 (3811)
The Outsider: A Novel
Von Everand
The Outsider: A Novel
Stephen King
Bewertung: 4 von 5 Sternen
4/5 (1839)
The Perks of Being a Wallflower
Von Everand
The Perks of Being a Wallflower
Stephen Chbosky
Bewertung: 4.5 von 5 Sternen
4.5/5 (2099)
The Woman in Cabin 10
Von Everand
The Woman in Cabin 10
Ruth Ware
Bewertung: 3.5 von 5 Sternen
3.5/5 (2322)
The Light Between Oceans: A Novel
Von Everand
The Light Between Oceans: A Novel
M.L. Stedman
Bewertung: 4.5 von 5 Sternen
4.5/5 (789)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Von Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
Bewertung: 4.5 von 5 Sternen
4.5/5 (119)
Little Women
Von Everand
Little Women
Louisa May Alcott
Bewertung: 4 von 5 Sternen
4/5 (104)
Brooklyn: A Novel
Von Everand
Brooklyn: A Novel
Colm Toibin
Bewertung: 3.5 von 5 Sternen
3.5/5 (1937)
A Man Called Ove: A Novel
Von Everand
A Man Called Ove: A Novel
Fredrik Backman
Bewertung: 4.5 von 5 Sternen
4.5/5 (4609)
The Art of Racing in the Rain: A Novel
Von Everand
The Art of Racing in the Rain: A Novel
Garth Stein
Bewertung: 4 von 5 Sternen
4/5 (4200)
Manhattan Beach: A Novel
Von Everand
Manhattan Beach: A Novel
Jennifer Egan
Bewertung: 3.5 von 5 Sternen
3.5/5 (792)
A Tree Grows in Brooklyn
Von Everand
A Tree Grows in Brooklyn
Betty Smith
Bewertung: 4.5 von 5 Sternen
4.5/5 (1929)
Sing, Unburied, Sing: A Novel
Von Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
Bewertung: 4 von 5 Sternen
4/5 (1103)
The Constant Gardener: A Novel
Von Everand
The Constant Gardener: A Novel
John le Carre
Bewertung: 3.5 von 5 Sternen
3.5/5 (104)
Her Body and Other Parties: Stories
Von Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
Bewertung: 4 von 5 Sternen
4/5 (821)
Paperchina
Dokument10 Seiten
Paperchina
RAM KUMAR
Noch keine Bewertungen
Splunk Quick Reference Guide
Dokument6 Seiten
Splunk Quick Reference Guide
Lsniper
Noch keine Bewertungen
Notes On The Site That Takes Calculations
Dokument7 Seiten
Notes On The Site That Takes Calculations
Nurdin Šabić
Noch keine Bewertungen
Kindergarten 2D and 3D Shapes PDF
Dokument33 Seiten
Kindergarten 2D and 3D Shapes PDF
Aibegim Abdyldabekova
100% (1)
Risk, Return, and The Capital Asset Pricing Model
Dokument52 Seiten
Risk, Return, and The Capital Asset Pricing Model
Faryal Shahid
Noch keine Bewertungen
Phy1 11 - 12 Q1 0603 PF FD
Dokument68 Seiten
Phy1 11 - 12 Q1 0603 PF FD
hiro
Noch keine Bewertungen
Magnetic Field Splitting of Spectral Lines
Dokument2 Seiten
Magnetic Field Splitting of Spectral Lines
Sio Mo
Noch keine Bewertungen
Subsea Control Systems SXGSSC PDF
Dokument6 Seiten
Subsea Control Systems SXGSSC PDF
Limuel Espiritu
Noch keine Bewertungen
Stratified Random Sampling Precision
Dokument10 Seiten
Stratified Random Sampling Precision
EPAH SIRENGO
Noch keine Bewertungen
Calculate area under curve
Dokument6 Seiten
Calculate area under curve
Punit Singh Sahni
Noch keine Bewertungen
Understandable Statistics Concepts and Methods 12th Edition
Dokument61 Seiten
Understandable Statistics Concepts and Methods 12th Edition
herbert.porter697
100% (36)
Jyothi Swarup's SAS Programming Resume
Dokument4 Seiten
Jyothi Swarup's SAS Programming Resume
thiru_lageshetti368
Noch keine Bewertungen
Operation Management
Dokument15 Seiten
Operation Management
Rabia Rasheed
Noch keine Bewertungen
Gpelab A Matlab Toolbox For Computing Stationary Solutions and Dynamics of Gross-Pitaevskii Equations (Gpe)
Dokument122 Seiten
Gpelab A Matlab Toolbox For Computing Stationary Solutions and Dynamics of Gross-Pitaevskii Equations (Gpe)
Pol Mestres
Noch keine Bewertungen
Unit 2. Error and The Treatment of Analytical Data: (1) Systematic Error or Determinate Error
Dokument6 Seiten
Unit 2. Error and The Treatment of Analytical Data: (1) Systematic Error or Determinate Error
Kamran Jalil
Noch keine Bewertungen
Kirigami Pop Ups
Dokument9 Seiten
Kirigami Pop Ups
TamaraJovanovic
100% (2)
Bs 8666 of 2005 Bas Shape Codes
Dokument5 Seiten
Bs 8666 of 2005 Bas Shape Codes
opulithe
Noch keine Bewertungen
Mangaldan Distric Ii Pogo-Palua Elementary School: ND RD
Dokument3 Seiten
Mangaldan Distric Ii Pogo-Palua Elementary School: ND RD
Flordeliza Manaois Ramos
Noch keine Bewertungen
The 3-Rainbow Index of Graphs
Dokument28 Seiten
The 3-Rainbow Index of Graphs
Dinda Kartika
Noch keine Bewertungen
World-GAN - A Generative Model For Minecraft
Dokument8 Seiten
World-GAN - A Generative Model For Minecraft
Gordon
Noch keine Bewertungen
Csat Test 1 Answers
Dokument11 Seiten
Csat Test 1 Answers
rahul kumar
Noch keine Bewertungen
Aiken 1980
Dokument31 Seiten
Aiken 1980
Michael Sanches
Noch keine Bewertungen
ND Mechatronics
Dokument260 Seiten
ND Mechatronics
aniaffiah
Noch keine Bewertungen
Spillways
Dokument26 Seiten
Spillways
ogul
Noch keine Bewertungen
29a ΙΡΙΔΟΛΟΓΙΑ EG
Dokument28 Seiten
29a ΙΡΙΔΟΛΟΓΙΑ EG
O TΣΑΡΟΣ ΤΗΣ ΑΝΤΙΒΑΡΥΤΗΤΑΣ ΛΙΑΠΗΣ ΠΑΝΑΓΙΩΤΗΣ
Noch keine Bewertungen
ECON 233-Introduction To Game Theory - Husnain Fateh Ahmad
Dokument7 Seiten
ECON 233-Introduction To Game Theory - Husnain Fateh Ahmad
Adeel Shaikh
Noch keine Bewertungen
Shell Balance Flow Thro Circular Pipes
Dokument23 Seiten
Shell Balance Flow Thro Circular Pipes
Raja Selvaraj
Noch keine Bewertungen
Correlation and Regression
Dokument9 Seiten
Correlation and Regression
Md Ibrahim Molla
Noch keine Bewertungen
Math 9
Dokument8 Seiten
Math 9
Erich Gallardo
Noch keine Bewertungen
Divergence and Curl
Dokument8 Seiten
Divergence and Curl
ali
Noch keine Bewertungen