Sie sind auf Seite 1von 68

Frontiers of

Computational Journalism
Columbia Journalism School
Week 2: Text Analysis
September 23, 2016

This class

Why text analysis?


Text analysis in journalism
The Document Vector Space model
Topic modeling

Why Text Analysis?

Stories from counting

When Hu Jintao came to power in 2002, China was already experiencing a worsening social crisis. In 2004,
President Hu offered a rhetorical response to growing internal instability, trumpeting what he called a
harmonious society. For some time, this new watchword burgeoned, becoming visible everywhere in
the Partys propaganda.
- Qian Gang, Watchwords: Reading China through its Party Vocabulary

But by 2007 it was already on the decline, as stability preservation made its
rapid ascent. ... Together, these contrasting pictures of the harmonious society
and stability preservation form a portrait of the real predicament facing
President Hu Jintao. A harmonious society may be a pleasing idea, but its the
iron will behind stability preservation that packs the real punch.
- Qian Gang, Watchwords: Reading China through its Party Vocabulary

Google ngram viewer


12% of all English books

Data can give a wider view


Let me talk about Downton Abbey for a minute. The show's
popularity has led many nitpickers to draft up lists of mistakes. ...
But all of these have relied, so far as I can tell, on finding a phrase
or two that sounds a bit off, and checking the online sources for
earliest use.
I lack such social graces. So I thought: why not just check every
single line in the show for historical accuracy? ... So I found some
copies of the Downton Abbey scripts online, and fed every single
two-word phrase through the Google Ngram database to see
how characteristic of the English Language, c. 1917, Downton
Abbey really is.
- Ben Schmidt, Making Downton more traditional

Bigrams that do not appear in English books between 1912


and 1921.

Bigrams that are at least 100 times more common today


than they were in 1912-1921

Text Analysis in Journalism

USA Today/Twitter Political Issues Index

Twitter sentiment index


Post-match analysis of public attitudes on Twitter, University of Reading, 2015

Politico analysis of GOP primary, 2012

CNN State of the Union Twitter analysis, 2010

The Post obtained draft versions of 12 audits by the inspector generals office,
covering projects from the Caribbean to Pakistan to the Republic of Georgia
between 2011 and 2013. The drafts are confidential and rarely become public.
The Post compared the drafts with the final reports published by the
inspector generals office and interviewed former and current employees. Emails and other internal records also were reviewed.
The Post tracked changes in the language that auditors used to describe
USAID and its mission offices. The analysis found that more than 400
negative references were removed from the audits between the draft and
final versions.
Sentiment analysis used by Washington Post, 2014

LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years


Los Angeles Times

The Times analyzed Los Angeles Police Department violent crime


data from 2005 to 2012. Our analysis found that the Los Angeles
Police Department misclassified an estimated 14,000 serious assaults
as minor offenses, artificially lowering the citys crime levels. To
conduct the analysis, The Times used an algorithm that combined
two machine learning classifiers. Each classifier read in a brief
description of the crime, which it used to determine if it was a minor
or serious assault.
An example of a minor assault reads: "VICTS AND SUSPS BECAME
INV IN VERBA ARGUMENT SUSP THEN BEGAN HITTING VICTS
IN THE FACE.

We used a machine-learning method


known as latent Dirichlet allocation to
identify the topics in all 14,400 petitions
and to then categorize the briefs. This
enabled us to identify which lawyers did
which kind of work for which sorts of
petitioners. For example, in cases where
workers sue their employers, the lawyers
most successful getting cases before the
court were far more likely to represent the
employers rather than the employees.
The Echo Chamber, Reuters

Document Vector Space Model

Documents, not words


We can use clustering and classification techniques if
we can convert documents into vectors.
As before, we want to find numerical features that
describe the document.
How do we capture the meaning of a document in
numbers?

What is this document "about"?


Most commonly occurring words a pretty good
indicator. 30 the
23
19
19
18
17
15
15
14
14
11
8
7

to
and
a
animal
cruelty
of
crimes
in
for
that
crime
we

Features = words works fine


Encode each document as the list of words it contains.
Dimensions = vocabulary of document set.
Value on each dimension = # of times word appears in
document

Example
D1 = I like databases
D2 = I hate hate databases

Each row = document vector


All rows = term-document matrix
Individual entry = tf(t,d) = term frequency

Aka Bag of words model


Throws out word order.
e.g. soldiers shot civilians and civilians shot soldiers
encoded identically.

Tokenization
The documents come to us as long strings, not individual
words. Tokenization is the process of converting the string
into individual words, or "tokens."
For this course, we will assume a very simple strategy:
o convert all letters to lowercase
o remove all punctuation characters
o separate words based on spaces

Note that this won't work at all for Chinese. It will fail in
some ways even for English. How?

Distance function
Useful for:
clustering documents
finding docs similar to example
matching a search query
Basic idea: look for overlapping terms

Cosine similarity
Given document vectors a,b define

similarity(a, b) a b
If each word occurs exactly once in each document,
equivalent to counting overlapping words.
Note: not a distance function, as similarity increases when
documents are similar. (What part of the definition of a
distance function is violated here?)

Problem: long documents always win


Let a = This car runs fast.
Let b = My car is old. I want a new car, a shiny car
Let query = fast car
this

car

runs

fast

my

is

old

want

new

shiny

Problem: long documents always win


similarity(a,q) = 1*1 [car] + 1*1 [fast] = 2
similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3
Longer document more similar, by virtue of repeating
words.

Normalize document vectors


ab
similarity(a, b)
a b

= cos()
returns result in [0,1]

Normalized query example


this

car

runs

fast

my

is

old

want

new

shiny

2
1
similarity(a, q) =
=
0.707
4 2
2
similarity(b, q) =

3
0.514
17 2

Cosine similarity

cos = similarity(a, b)

ab
a b

Cosine distance (finally)


ab
dist(a, b) 1
a b

Problem: common words


We want to look at words that discriminate among
documents.
Stopwords: if all documents contain the, are all
documents similar?
Common words: if most documents contain car then car
doesnt tell us much about (contextual) similarity.

Context matters
General News

Car Reviews

= contains car
= does not contain car

Document Frequency
Idea: de-weight common words
Common = appears in many documents

df (t, D) = d D : t d D
document frequency = fraction of docs
containing term

Inverse Document Frequency


Invert (so more common = smaller weight) and take
log

idf (t, D) = log ( D d D : t d )

TF-IDF
Multiply term frequency by inverse document frequency

tfidf (t, d, D) = tf (t, d) idf (d, D)


= n(t, d) log ( D n(t, D))
n(t,d) = number of times term t in doc d
n(t,D) = number docs in D containing t

TF-IDF depends on entire corpus


The TF-IDF vector for a document changes if we
add another document to the corpus.

tfidf (t, d, D) = tf (t, d) idf (d, D)


if we add a document, D changes!

TF-IDF is sensitive to context. The context is all other


documents

What is this document "about"?


Each document is now a vector of TF-IDF scores for every word in
the document. We can look at which words have the top scores.
crimes
cruelty
crime
reporting
animals
michael
category
commit
criminal
societal
trends
conviction
patterns

0.0675591652263963
0.0585772393867342
0.0257614113616027
0.0208838148975406
0.0179258756717422
0.0156575858658684
0.0154564813388897
0.0137447439653709
0.0134312894429112
0.0124164973052386
0.0119505837811614
0.0115699047136248
0.011248045148093

Saltons description of tf-idf

- from Salton et al, A Vector Space Model for Automatic Indexing, 1975

TF

TF-IDF

nj-sentator-menendez corpus, Overview sample files


color = human tags generated from TF-IDF clusters

Cluster Hypothesis
documents in the same cluster behave similarly with respect to
relevance to information needs
- Manning, Raghavan, Schtze, Introduction to Information Retrieval

Not really a precise statement but the crucial link between


human semantics and mathematical properties.
Articulated as early as 1971, has been shown to hold at web
scale, widely assumed.

Bag of words + TF-IDF hard to beat


Practical win: good precision-recall metrics in tests with humantagged document sets.
Still the dominant text indexing scheme used today. (Lucene,
FAST, Google) Many variants and extensions.
Some, but not much, theory to explain why this works. (E.g. why
that particular IDF formula? why doesnt indexing bigrams
improve performance?)

Collectively:
the vector space document model

Topic Modeling

Problem Statement
Can the computer tell us the topics in a document
set? Can the computer organize the documents by
topic?
Note: TF-IDF tells us the topics of a single document,
but here we want topics of an entire document set.

Simplest possible technique


Sum TF-IDF scores for each word across entire
document set, choose top ranking words.

This is how Overview generates cluster descriptions.

Topic Modeling Algorithms


Basic idea: reduce dimensionality of document vector
space, so each dimension is a topic.
Each document is then a vector of topic weights. We
want to figure out what dimensions and weights give
a good approximation of the full set of words in each
document.
Many variants: LSI, PLSI, LDA, NMF

Matrix Factorization
Approximate term-document matrix V as product of
two lower rank matrixes

V
m docs by n terms

m docs by r "topics"

H
r "topics" by n terms

Matrix Factorization
A "topic" is a group of words that occur together.

words in this topic


topics in this document

Non-negative Matrix Factorization


All elements of document coordinate matrix W and topic
matrix H must be >= 0
Simple iterative algorithm to compute.

Still have to choose number of topics r

Latent Dirichlet Allocation


Imagine that each document is written by someone going
through the following process:
1.
2.
3.

For each doc d, choose mixture of topics p(z|d)


For each word w in d, choose a topic z from p(z|d)
Then choose word from p(w|z)

A document has a distribution of topics.


Each topic is a distribution of words.
LDA tries to find these two sets of distributions.

"Documents"

LDA models each document as a distribution over


topics. Each word belongs to a single topic.

"Topics"

LDA models a topic as a distribution over all the words in the


corpus. In each topic, some words are more likely, some are less
likely.

LDA Plate Notation


topics in doc
topic
topic for word
concentration
parameter

word in doc

N words
in doc

words in topics

D docs

word
concentration
parameter

K topics

Computing LDA
Inputs:
word[d][i]
k
a
b
Also:
n
len[d]
v

document words
# topics
doc topic concentration
topic word concentration
# docs
# words in document
vocabulary size

Computing LDA
Outputs:
topics[n][i]
topic_words[k][v]
doc_topics[n][k]

doc/word topic assignments


topic words dist
document topics dist

topics -> topic_words


topic_words[*][*] = b
for d=1..n
for i=1..len[d]
topic_words[topics[d][i]][word[d][i]] += 1
for j=1..k
normalize topic_words[j]

topics -> doc_topics


doc_topics[*][*] = a
for d=1..n
for i=1..len[d]
doc_topics[d][topics[d][i]] +=1
for d=1..n
normalize doc_topics[d]

Update topics
// for each word in document, sample a new topic
for d=1..n
for i=1..len[d]
w = word[d][i]
for t=1..k
p[t] = doc_topics[d][j] * topic_words[j][w]
topics[d][i] = sample from p[t]

Dimensionality reduction
Output of NMF and LDA is a vector of much lower
dimension for each document. ("Document
coordinates in topic space.")
Dimensions are concepts or topics instead of
words.
Can measure cosine distance, cluster, etc. in this new
space.

Das könnte Ihnen auch gefallen