Sie sind auf Seite 1von 3

0.

Notation

There are d words in the dictionary. d is large. (Think 20000.) There are
k topics. (Think k 100.) Each topic is a d vector with non-negative
components summing to 1. The i th component is the probability that a
random word in a document (purely) on that topic is word i. We let M be
the d k matrix with one column for each topic vector.

0.2

The Model

The Pure Model: Each document is purely on a single topic.


[This is really a cluster model. More general models where a doc is allowed
to be on multiple topics are more difficult to tackle.]
Topic Weights w1 , w2 , . . . , wk : positive reals summing to 1.
Documents are picked in i.i.d. trials. Lets say each document has m
words in it. To pick the m words of one document:
1. Pick a topic l (l {1, 2, . . . , k}), with
Pr(l = 1) = w1 ; Pr(l = 2) = w2 ; . . . Pr(l = k) = wk .
2. In m i.i.d. trials pick words of the document: In each of the m trials,
pick a random word with (l is from step 1)
Prob that word i is picked = Mil ,

for i = 1, 2, . . . , d.

[This is the multinomial probability distribution.]


Define the word-document matrix A by
Aij =

Number of occurrences of word i in document j


.
m

Each column of A is a document. Each column sums to 1.


Inference Problem Given A, find the topic vectors, topic weights
***SCHEMATIC DIAGRAM OF THE MODEL ON THE BOARD****
Primary Words Assumption Each topic has a set of primary words;
the total of their components (in the topic vector) is at least 1 . The sets
of primary words of each topic are disjoint.
So most words in document vector for a doc. on topic l are primary words
for that topic.
1

Question What can you say about the dot product of two document
vectors if they are on different topics ? First think of the = 0 case, then
small .
Question Is the above a give-away ? I.e., can you solve the inference
problem just based on this?
Hint What can you say about the dot product of two document vectors
on the same topic. (even when = 0). Think of the case when components
of the topic vector are smaller than 1/m, so a single word is unlikely to occur
in a document.

0.3

The Solution

First the case when = 0. In that case A

B1 0 . . .
0 B ...

2
A=
0
0 ...
0 0 ...

is a block matrix:

0
0
0
0

.
... ...
0 Bk

Theorem Making the Primary Words and Pure Topics Assumptions, the
top k singular vectors of A are close to the indicator vectors of k clusters of
documents, proivded m is large enough.
[The clusters are : Cluster l consists of documents with topic l. Indicator
vector of a cluster is the vector of all 1s on the cluster and 0 elsewhere,
normalized to length 1.]
Idea of the Proof First for the case = 0. Notation: nl is the number
of documents on topic l and dl is the number of primary words of topic l.
E(Bl ) = M,l 1T .

1 (E(Bl )) = |M,l | nl nl p,

(1)

where, p = Maxi Mil . Bl E(Bl ) is a random matrix with mean 0 and


independent columns. Since we are picking m words in each document, the
variance of each entry of Bl is at most
p/m.
We now pull out (without proof) a fundamental (hard) theorem from Random Matrix Theory to assert that

nl p

1 ((Bl E(Bl ))) Max length of any column+ nl Max S.D. of any entry 1+ .
m
2

We see that this quantity is much smaller than 1 (E(Bl )) for m large enough.
Now, assume that |M,1 |, |M,2 |, . . . , |M,k | are all distinct, so that 1 (Bl )
are all distinct. Also assume that |M,1 | > |M,2 | > > |M,k |.
We claim that the top singular vector of A will be close to the indicator
vector of the first cluster. First, prove that it does not have any component
on clusters other than the first. Then, suppose it has a component perpendicular to the indicator vector on the first cluster. The contribution of this
compoenent will be at most like 1 (Bl E(Bl )) << 1 (Bl )....
Ref: Latent Semantic Indexing by Papadimitriou, Raghavan, Tamaki,
Vempala.

Das könnte Ihnen auch gefallen