Computational Journalism 2016 Week 2: Text Analysis

Frontiers of
Computational Journalism
Columbia Journalism School
Week 2: Text Analysis
September 23, 2016
This class
Why text analysis?

Text analysis in journalism
The Document Vector Space model
Topic modeling
Why Text Analysis?
Stories from counting
When Hu Jintao came to power in 2002, China was already experiencing a worsening social crisis. In 2004,
President Hu offered a rhetorical response to growing internal instability, trumpeting what he called a
harmonious society. For some time, this new watchword burgeoned, becoming visible everywhere in
the Partys propaganda.
- Qian Gang, Watchwords: Reading China through its Party Vocabulary
But by 2007 it was already on the decline, as stability preservation made its
rapid ascent. ... Together, these contrasting pictures of the harmonious society
and stability preservation form a portrait of the real predicament facing
President Hu Jintao. A harmonious society may be a pleasing idea, but its the
iron will behind stability preservation that packs the real punch.
- Qian Gang, Watchwords: Reading China through its Party Vocabulary
Google ngram viewer

12% of all English books
Data can give a wider view

Let me talk about Downton Abbey for a minute. The show's
popularity has led many nitpickers to draft up lists of mistakes. ...
But all of these have relied, so far as I can tell, on finding a phrase
or two that sounds a bit off, and checking the online sources for
earliest use.
I lack such social graces. So I thought: why not just check every
single line in the show for historical accuracy? ... So I found some
copies of the Downton Abbey scripts online, and fed every single
two-word phrase through the Google Ngram database to see
how characteristic of the English Language, c. 1917, Downton
Abbey really is.
- Ben Schmidt, Making Downton more traditional
Bigrams that do not appear in English books between 1912

and 1921.
Bigrams that are at least 100 times more common today

than they were in 1912-1921
Text Analysis in Journalism
USA Today/Twitter Political Issues Index
Twitter sentiment index

Post-match analysis of public attitudes on Twitter, University of Reading, 2015
Politico analysis of GOP primary, 2012
CNN State of the Union Twitter analysis, 2010
The Post obtained draft versions of 12 audits by the inspector generals office,
covering projects from the Caribbean to Pakistan to the Republic of Georgia
between 2011 and 2013. The drafts are confidential and rarely become public.
The Post compared the drafts with the final reports published by the
inspector generals office and interviewed former and current employees. Emails and other internal records also were reviewed.
The Post tracked changes in the language that auditors used to describe
USAID and its mission offices. The analysis found that more than 400
negative references were removed from the audits between the draft and
final versions.
Sentiment analysis used by Washington Post, 2014
LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years

Los Angeles Times
The Times analyzed Los Angeles Police Department violent crime

data from 2005 to 2012. Our analysis found that the Los Angeles
Police Department misclassified an estimated 14,000 serious assaults
as minor offenses, artificially lowering the citys crime levels. To
conduct the analysis, The Times used an algorithm that combined
two machine learning classifiers. Each classifier read in a brief
description of the crime, which it used to determine if it was a minor
or serious assault.
An example of a minor assault reads: "VICTS AND SUSPS BECAME
INV IN VERBA ARGUMENT SUSP THEN BEGAN HITTING VICTS
IN THE FACE.
We used a machine-learning method

known as latent Dirichlet allocation to
identify the topics in all 14,400 petitions
and to then categorize the briefs. This
enabled us to identify which lawyers did
which kind of work for which sorts of
petitioners. For example, in cases where
workers sue their employers, the lawyers
most successful getting cases before the
court were far more likely to represent the
employers rather than the employees.
The Echo Chamber, Reuters
Document Vector Space Model
Documents, not words

We can use clustering and classification techniques if
we can convert documents into vectors.
As before, we want to find numerical features that
describe the document.
How do we capture the meaning of a document in
numbers?
What is this document "about"?

Most commonly occurring words a pretty good
indicator. 30 the
23
19
19
18
17
15
15
14
14
11
8
7
to
and
a
animal
cruelty
of
crimes
in
for
that
crime
we
Features = words works fine

Encode each document as the list of words it contains.
Dimensions = vocabulary of document set.
Value on each dimension = # of times word appears in
document
Example
D1 = I like databases
D2 = I hate hate databases
Each row = document vector

All rows = term-document matrix
Individual entry = tf(t,d) = term frequency
Aka Bag of words model

Throws out word order.
e.g. soldiers shot civilians and civilians shot soldiers
encoded identically.
Tokenization
The documents come to us as long strings, not individual
words. Tokenization is the process of converting the string
into individual words, or "tokens."
For this course, we will assume a very simple strategy:
o convert all letters to lowercase
o remove all punctuation characters
o separate words based on spaces
Note that this won't work at all for Chinese. It will fail in
some ways even for English. How?
Distance function
Useful for:
clustering documents
finding docs similar to example
matching a search query
Basic idea: look for overlapping terms
Cosine similarity
Given document vectors a,b define
similarity(a, b) a b
If each word occurs exactly once in each document,
equivalent to counting overlapping words.
Note: not a distance function, as similarity increases when
documents are similar. (What part of the definition of a
distance function is violated here?)
Problem: long documents always win

Let a = This car runs fast.
Let b = My car is old. I want a new car, a shiny car
Let query = fast car
this
car
runs
fast
my
is
old
want
new
shiny
Problem: long documents always win

similarity(a,q) = 1*1 [car] + 1*1 [fast] = 2
similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3
Longer document more similar, by virtue of repeating
words.
Normalize document vectors

ab
similarity(a, b)
a b
= cos()
returns result in [0,1]
Normalized query example

this
car
runs
fast
my
is
old
want
new
shiny
2
1
similarity(a, q) =
=
0.707
4 2
2
similarity(b, q) =
3
0.514
17 2
Cosine similarity
cos = similarity(a, b)
ab
a b
Cosine distance (finally)

ab
dist(a, b) 1
a b
Problem: common words

We want to look at words that discriminate among
documents.
Stopwords: if all documents contain the, are all
documents similar?
Common words: if most documents contain car then car
doesnt tell us much about (contextual) similarity.
Context matters
General News
Car Reviews
= contains car
= does not contain car
Document Frequency
Idea: de-weight common words
Common = appears in many documents
df (t, D) = d D : t d D
document frequency = fraction of docs
containing term
Inverse Document Frequency

Invert (so more common = smaller weight) and take
log
idf (t, D) = log ( D d D : t d )
TF-IDF
Multiply term frequency by inverse document frequency
tfidf (t, d, D) = tf (t, d) idf (d, D)

= n(t, d) log ( D n(t, D))
n(t,d) = number of times term t in doc d
n(t,D) = number docs in D containing t
TF-IDF depends on entire corpus

The TF-IDF vector for a document changes if we
add another document to the corpus.
tfidf (t, d, D) = tf (t, d) idf (d, D)

if we add a document, D changes!
TF-IDF is sensitive to context. The context is all other

documents
What is this document "about"?

Each document is now a vector of TF-IDF scores for every word in
the document. We can look at which words have the top scores.
crimes
cruelty
crime
reporting
animals
michael
category
commit
criminal
societal
trends
conviction
patterns
0.0675591652263963
0.0585772393867342
0.0257614113616027
0.0208838148975406
0.0179258756717422
0.0156575858658684
0.0154564813388897
0.0137447439653709
0.0134312894429112
0.0124164973052386
0.0119505837811614
0.0115699047136248
0.011248045148093
Saltons description of tf-idf
- from Salton et al, A Vector Space Model for Automatic Indexing, 1975
TF
TF-IDF
nj-sentator-menendez corpus, Overview sample files

color = human tags generated from TF-IDF clusters
Cluster Hypothesis
documents in the same cluster behave similarly with respect to
relevance to information needs
- Manning, Raghavan, Schtze, Introduction to Information Retrieval
Not really a precise statement but the crucial link between

human semantics and mathematical properties.
Articulated as early as 1971, has been shown to hold at web
scale, widely assumed.
Bag of words + TF-IDF hard to beat

Practical win: good precision-recall metrics in tests with humantagged document sets.
Still the dominant text indexing scheme used today. (Lucene,
FAST, Google) Many variants and extensions.
Some, but not much, theory to explain why this works. (E.g. why
that particular IDF formula? why doesnt indexing bigrams
improve performance?)
Collectively:
the vector space document model
Topic Modeling
Problem Statement
Can the computer tell us the topics in a document
set? Can the computer organize the documents by
topic?
Note: TF-IDF tells us the topics of a single document,
but here we want topics of an entire document set.
Simplest possible technique

Sum TF-IDF scores for each word across entire
document set, choose top ranking words.
This is how Overview generates cluster descriptions.
Topic Modeling Algorithms

Basic idea: reduce dimensionality of document vector
space, so each dimension is a topic.
Each document is then a vector of topic weights. We
want to figure out what dimensions and weights give
a good approximation of the full set of words in each
document.
Many variants: LSI, PLSI, LDA, NMF
Matrix Factorization
Approximate term-document matrix V as product of
two lower rank matrixes
V
m docs by n terms
m docs by r "topics"
H
r "topics" by n terms
Matrix Factorization
A "topic" is a group of words that occur together.
words in this topic

topics in this document
Non-negative Matrix Factorization

All elements of document coordinate matrix W and topic
matrix H must be >= 0
Simple iterative algorithm to compute.
Still have to choose number of topics r
Latent Dirichlet Allocation

Imagine that each document is written by someone going
through the following process:
1.
2.
3.
For each doc d, choose mixture of topics p(z|d)

For each word w in d, choose a topic z from p(z|d)
Then choose word from p(w|z)
A document has a distribution of topics.

Each topic is a distribution of words.
LDA tries to find these two sets of distributions.
"Documents"
LDA models each document as a distribution over

topics. Each word belongs to a single topic.
"Topics"
LDA models a topic as a distribution over all the words in the

corpus. In each topic, some words are more likely, some are less
likely.
LDA Plate Notation

topics in doc
topic
topic for word
concentration
parameter
word in doc
N words
in doc
words in topics
D docs
word
concentration
parameter
K topics
Computing LDA
Inputs:
word[d][i]
k
a
b
Also:
n
len[d]
v
document words
# topics
doc topic concentration
topic word concentration
# docs
# words in document
vocabulary size
Computing LDA
Outputs:
topics[n][i]
topic_words[k][v]
doc_topics[n][k]
doc/word topic assignments

topic words dist
document topics dist
topics -> topic_words

topic_words[*][*] = b
for d=1..n
for i=1..len[d]
topic_words[topics[d][i]][word[d][i]] += 1
for j=1..k
normalize topic_words[j]
topics -> doc_topics

doc_topics[*][*] = a
for d=1..n
for i=1..len[d]
doc_topics[d][topics[d][i]] +=1
for d=1..n
normalize doc_topics[d]
Update topics
// for each word in document, sample a new topic
for d=1..n
for i=1..len[d]
w = word[d][i]
for t=1..k
p[t] = doc_topics[d][j] * topic_words[j][w]
topics[d][i] = sample from p[t]
Dimensionality reduction
Output of NMF and LDA is a vector of much lower
dimension for each document. ("Document
coordinates in topic space.")
Dimensions are concepts or topics instead of
words.
Can measure cosine distance, cluster, etc. in this new
space.

Computational Journalism 2016 Week 2: Text Analysis

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Computational Journalism 2016 Week 2: Text Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

Frontiers of

Why text analysis?

Why Text Analysis?

Stories from counting

Google ngram viewer

Data can give a wider view

Bigrams that do not appear in English books between 1912

Bigrams that are at least 100 times more common today

Text Analysis in Journalism

USA Today/Twitter Political Issues Index

Twitter sentiment index

Politico analysis of GOP primary, 2012

CNN State of the Union Twitter analysis, 2010

LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years

The Times analyzed Los Angeles Police Department violent crime

We used a machine-learning method

Document Vector Space Model

Documents, not words

What is this document "about"?

Features = words works fine

Each row = document vector

Aka Bag of words model

Problem: long documents always win

Problem: long documents always win

Normalize document vectors

Normalized query example

Cosine distance (finally)

Problem: common words

Inverse Document Frequency

idf (t, D) = log ( D d D : t d )

tfidf (t, d, D) = tf (t, d) idf (d, D)

TF-IDF depends on entire corpus

tfidf (t, d, D) = tf (t, d) idf (d, D)

TF-IDF is sensitive to context. The context is all other

What is this document "about"?

Saltons description of tf-idf

nj-sentator-menendez corpus, Overview sample files

Not really a precise statement but the crucial link between

Bag of words + TF-IDF hard to beat

Simplest possible technique

This is how Overview generates cluster descriptions.

Topic Modeling Algorithms

words in this topic

Non-negative Matrix Factorization

Still have to choose number of topics r

Latent Dirichlet Allocation

For each doc d, choose mixture of topics p(z|d)

A document has a distribution of topics.

LDA models each document as a distribution over

LDA models a topic as a distribution over all the words in the

LDA Plate Notation

doc/word topic assignments

topics -> topic_words

topics -> doc_topics

Das könnte Ihnen auch gefallen