Beruflich Dokumente
Kultur Dokumente
Computational Journalism
Columbia Journalism School
Week 2: Text Analysis
September 23, 2016
This class
When Hu Jintao came to power in 2002, China was already experiencing a worsening social crisis. In 2004,
President Hu offered a rhetorical response to growing internal instability, trumpeting what he called a
harmonious society. For some time, this new watchword burgeoned, becoming visible everywhere in
the Partys propaganda.
- Qian Gang, Watchwords: Reading China through its Party Vocabulary
But by 2007 it was already on the decline, as stability preservation made its
rapid ascent. ... Together, these contrasting pictures of the harmonious society
and stability preservation form a portrait of the real predicament facing
President Hu Jintao. A harmonious society may be a pleasing idea, but its the
iron will behind stability preservation that packs the real punch.
- Qian Gang, Watchwords: Reading China through its Party Vocabulary
The Post obtained draft versions of 12 audits by the inspector generals office,
covering projects from the Caribbean to Pakistan to the Republic of Georgia
between 2011 and 2013. The drafts are confidential and rarely become public.
The Post compared the drafts with the final reports published by the
inspector generals office and interviewed former and current employees. Emails and other internal records also were reviewed.
The Post tracked changes in the language that auditors used to describe
USAID and its mission offices. The analysis found that more than 400
negative references were removed from the audits between the draft and
final versions.
Sentiment analysis used by Washington Post, 2014
to
and
a
animal
cruelty
of
crimes
in
for
that
crime
we
Example
D1 = I like databases
D2 = I hate hate databases
Tokenization
The documents come to us as long strings, not individual
words. Tokenization is the process of converting the string
into individual words, or "tokens."
For this course, we will assume a very simple strategy:
o convert all letters to lowercase
o remove all punctuation characters
o separate words based on spaces
Note that this won't work at all for Chinese. It will fail in
some ways even for English. How?
Distance function
Useful for:
clustering documents
finding docs similar to example
matching a search query
Basic idea: look for overlapping terms
Cosine similarity
Given document vectors a,b define
similarity(a, b) a b
If each word occurs exactly once in each document,
equivalent to counting overlapping words.
Note: not a distance function, as similarity increases when
documents are similar. (What part of the definition of a
distance function is violated here?)
car
runs
fast
my
is
old
want
new
shiny
= cos()
returns result in [0,1]
car
runs
fast
my
is
old
want
new
shiny
2
1
similarity(a, q) =
=
0.707
4 2
2
similarity(b, q) =
3
0.514
17 2
Cosine similarity
cos = similarity(a, b)
ab
a b
Context matters
General News
Car Reviews
= contains car
= does not contain car
Document Frequency
Idea: de-weight common words
Common = appears in many documents
df (t, D) = d D : t d D
document frequency = fraction of docs
containing term
TF-IDF
Multiply term frequency by inverse document frequency
0.0675591652263963
0.0585772393867342
0.0257614113616027
0.0208838148975406
0.0179258756717422
0.0156575858658684
0.0154564813388897
0.0137447439653709
0.0134312894429112
0.0124164973052386
0.0119505837811614
0.0115699047136248
0.011248045148093
- from Salton et al, A Vector Space Model for Automatic Indexing, 1975
TF
TF-IDF
Cluster Hypothesis
documents in the same cluster behave similarly with respect to
relevance to information needs
- Manning, Raghavan, Schtze, Introduction to Information Retrieval
Collectively:
the vector space document model
Topic Modeling
Problem Statement
Can the computer tell us the topics in a document
set? Can the computer organize the documents by
topic?
Note: TF-IDF tells us the topics of a single document,
but here we want topics of an entire document set.
Matrix Factorization
Approximate term-document matrix V as product of
two lower rank matrixes
V
m docs by n terms
m docs by r "topics"
H
r "topics" by n terms
Matrix Factorization
A "topic" is a group of words that occur together.
"Documents"
"Topics"
word in doc
N words
in doc
words in topics
D docs
word
concentration
parameter
K topics
Computing LDA
Inputs:
word[d][i]
k
a
b
Also:
n
len[d]
v
document words
# topics
doc topic concentration
topic word concentration
# docs
# words in document
vocabulary size
Computing LDA
Outputs:
topics[n][i]
topic_words[k][v]
doc_topics[n][k]
Update topics
// for each word in document, sample a new topic
for d=1..n
for i=1..len[d]
w = word[d][i]
for t=1..k
p[t] = doc_topics[d][j] * topic_words[j][w]
topics[d][i] = sample from p[t]
Dimensionality reduction
Output of NMF and LDA is a vector of much lower
dimension for each document. ("Document
coordinates in topic space.")
Dimensions are concepts or topics instead of
words.
Can measure cosine distance, cluster, etc. in this new
space.