Willkommen bei Scribd!

Topic modeling and latent semantic analysis

Hochgeladen von

0% fanden dieses Dokument nützlich (0 Abstimmungen)

16 Ansichten3 Seiten

The document describes a topic model for documents. It defines: - d words in the dictionary and k topics - Each topic is a distribution over words represented by a d-dimensional vector - Documents are generated by first sampling a topic, then sampling words from that topic's distribution - The word-document matrix A represents the number of times each word appears in each document - The document poses inference questions about using A to recover the topic vectors and weights - It states that under certain assumptions, the top k singular vectors of A will be close to indicator vectors of the k topic clusters

Originalbeschreibung:

Intro to Basic Topic Model

Originaltitel

Topic Modeling

Copyright

Verfügbare Formate

PDF, TXT oder online auf Scribd lesen

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Dieses Dokument melden

Copyright:

Verfügbare Formate

Als PDF, TXT herunterladen oder online auf Scribd lesen

Markieren Sie unangemessene Inhalte

0% fanden dieses Dokument nützlich (0 Abstimmungen)

16 Ansichten3 Seiten

Topic modeling and latent semantic analysis

Hochgeladen von

adethro

Copyright:

Verfügbare Formate

Als PDF, TXT herunterladen oder online auf Scribd lesen

Markieren Sie unangemessene Inhalte

Zu Seite

Sie sind auf Seite 1von 3

Im Dokument suchen

0.

Notation

There are d words in the dictionary. d is large. (Think 20000.) There are
k topics. (Think k 100.) Each topic is a d vector with non-negative
components summing to 1. The i th component is the probability that a
random word in a document (purely) on that topic is word i. We let M be
the d k matrix with one column for each topic vector.

0.2

The Model

The Pure Model: Each document is purely on a single topic.

[This is really a cluster model. More general models where a doc is allowed
to be on multiple topics are more difficult to tackle.]
Topic Weights w1 , w2 , . . . , wk : positive reals summing to 1.
Documents are picked in i.i.d. trials. Lets say each document has m
words in it. To pick the m words of one document:
1. Pick a topic l (l {1, 2, . . . , k}), with
Pr(l = 1) = w1 ; Pr(l = 2) = w2 ; . . . Pr(l = k) = wk .
2. In m i.i.d. trials pick words of the document: In each of the m trials,
pick a random word with (l is from step 1)
Prob that word i is picked = Mil ,

for i = 1, 2, . . . , d.

[This is the multinomial probability distribution.]

Define the word-document matrix A by
Aij =

Number of occurrences of word i in document j

.
m

Each column of A is a document. Each column sums to 1.

Inference Problem Given A, find the topic vectors, topic weights
***SCHEMATIC DIAGRAM OF THE MODEL ON THE BOARD****
Primary Words Assumption Each topic has a set of primary words;
the total of their components (in the topic vector) is at least 1 . The sets
of primary words of each topic are disjoint.
So most words in document vector for a doc. on topic l are primary words
for that topic.
1

Question What can you say about the dot product of two document
vectors if they are on different topics ? First think of the = 0 case, then
small .
Question Is the above a give-away ? I.e., can you solve the inference
problem just based on this?
Hint What can you say about the dot product of two document vectors
on the same topic. (even when = 0). Think of the case when components
of the topic vector are smaller than 1/m, so a single word is unlikely to occur
in a document.

0.3

The Solution

First the case when = 0. In that case A

B1 0 . . .
0 B ...

2
A=
0
0 ...
0 0 ...

is a block matrix:

0
0
0
0

.
... ...
0 Bk

Theorem Making the Primary Words and Pure Topics Assumptions, the
top k singular vectors of A are close to the indicator vectors of k clusters of
documents, proivded m is large enough.
[The clusters are : Cluster l consists of documents with topic l. Indicator
vector of a cluster is the vector of all 1s on the cluster and 0 elsewhere,
normalized to length 1.]
Idea of the Proof First for the case = 0. Notation: nl is the number
of documents on topic l and dl is the number of primary words of topic l.
E(Bl ) = M,l 1T .

1 (E(Bl )) = |M,l | nl nl p,

(1)

where, p = Maxi Mil . Bl E(Bl ) is a random matrix with mean 0 and

independent columns. Since we are picking m words in each document, the
variance of each entry of Bl is at most
p/m.
We now pull out (without proof) a fundamental (hard) theorem from Random Matrix Theory to assert that

nl p

1 ((Bl E(Bl ))) Max length of any column+ nl Max S.D. of any entry 1+ .
m
2

We see that this quantity is much smaller than 1 (E(Bl )) for m large enough.
Now, assume that |M,1 |, |M,2 |, . . . , |M,k | are all distinct, so that 1 (Bl )
are all distinct. Also assume that |M,1 | > |M,2 | > > |M,k |.
We claim that the top singular vector of A will be close to the indicator
vector of the first cluster. First, prove that it does not have any component
on clusters other than the first. Then, suppose it has a component perpendicular to the indicator vector on the first cluster. The contribution of this
compoenent will be at most like 1 (Bl E(Bl )) << 1 (Bl )....
Ref: Latent Semantic Indexing by Papadimitriou, Raghavan, Tamaki,
Vempala.

Das könnte Ihnen auch gefallen

Complexity of Inference in Latent Dirichlet Allocation
Dokument14 Seiten
Complexity of Inference in Latent Dirichlet Allocation
No12n533
Noch keine Bewertungen
Elements of Statistical Learning Solution Manual
Dokument112 Seiten
Elements of Statistical Learning Solution Manual
Aditya Jain
100% (3)
School of Mathematical Sciences MAT 3044 Discrete Mathematics
Dokument20 Seiten
School of Mathematical Sciences MAT 3044 Discrete Mathematics
Thalika Ruchaia
Noch keine Bewertungen
Real Analysis I Notes (MAT1032 2011/2012
Dokument91 Seiten
Real Analysis I Notes (MAT1032 2011/2012
Carlos Martinez
Noch keine Bewertungen
Chapter 16 Greedy Algorithms
Dokument4 Seiten
Chapter 16 Greedy Algorithms
rananaveedkhalid
Noch keine Bewertungen
TR645 PDF
Dokument24 Seiten
TR645 PDF
ShivaJps
Noch keine Bewertungen
Problem Set 4
Dokument4 Seiten
Problem Set 4
Yash Varun
Noch keine Bewertungen
CS2223: Algorithms Greedy Algorithms 1 Greedy Algorithm Fundamentals
Dokument4 Seiten
CS2223: Algorithms Greedy Algorithms 1 Greedy Algorithm Fundamentals
raghu_slp
Noch keine Bewertungen
Inf2b Learn Note07 2up
Dokument5 Seiten
Inf2b Learn Note07 2up
NiravSinghDabhi
Noch keine Bewertungen
Problem Set 1
Dokument2 Seiten
Problem Set 1
lsifuali
Noch keine Bewertungen
Weatherwax Epstein Hastie Solution Manual
Dokument147 Seiten
Weatherwax Epstein Hastie Solution Manual
Rudrani Angira
Noch keine Bewertungen
A. Basic Counting Principles 1. The Product Rule: Hint: The Product Rule Can Be Identificated by Word "And". The Product
Dokument12 Seiten
A. Basic Counting Principles 1. The Product Rule: Hint: The Product Rule Can Be Identificated by Word "And". The Product
fito plankton
Noch keine Bewertungen
HWK 7
Dokument2 Seiten
HWK 7
Mehul darak
Noch keine Bewertungen
ENM321 - Ordinary Differential Equations & Special Functions - NCL
Dokument74 Seiten
ENM321 - Ordinary Differential Equations & Special Functions - NCL
Sam Wang Chern Peng
Noch keine Bewertungen
CS 6505 Lectures 3 & 4: Greedy Algorithms and Matroids
Dokument5 Seiten
CS 6505 Lectures 3 & 4: Greedy Algorithms and Matroids
DineshGarg
Noch keine Bewertungen
The Elements of Statistical Learning Solution Manual and Notes
Dokument121 Seiten
The Elements of Statistical Learning Solution Manual and Notes
Niraj Altekar
Noch keine Bewertungen
2669 Maximal Margin Labeling For Multi Topic Text Categorization
Dokument8 Seiten
2669 Maximal Margin Labeling For Multi Topic Text Categorization
Anurag Sharma
Noch keine Bewertungen
Math21 - Chapter 2
Dokument40 Seiten
Math21 - Chapter 2
Jessica Sergio
Noch keine Bewertungen
Discrete Maths Lecture Note(Chapter 1-3)
Dokument19 Seiten
Discrete Maths Lecture Note(Chapter 1-3)
Gemechis Gurmesa
Noch keine Bewertungen
Lecturenotes Stability PDF
Dokument73 Seiten
Lecturenotes Stability PDF
Hakan Demirci
Noch keine Bewertungen
Tensor Decomposition For Topic Models: An Overview and Implementation
Dokument14 Seiten
Tensor Decomposition For Topic Models: An Overview and Implementation
John Kirk
Noch keine Bewertungen
A Short Course in Discrete Mathematics
Von Everand
A Short Course in Discrete Mathematics
Edward A. Bender
Bewertung: 3 von 5 Sternen
3/5 (1)
Automata and Complexity Theory Module
Dokument104 Seiten
Automata and Complexity Theory Module
Surafel
Noch keine Bewertungen
Project 1 Report: With As
Dokument5 Seiten
Project 1 Report: With As
Filote Cosmin
Noch keine Bewertungen
VC-Tree
Dokument3 Seiten
VC-Tree
aniketgupta3001
Noch keine Bewertungen
Pigeonhole Principle
Dokument10 Seiten
Pigeonhole Principle
nafizz
Noch keine Bewertungen
LSE Math Notes on Differential Equations
Dokument94 Seiten
LSE Math Notes on Differential Equations
SwaggyVBros M
Noch keine Bewertungen
Week 2 - Mathematics in The Modern World (Mathematical Language and Symbols)
Dokument7 Seiten
Week 2 - Mathematics in The Modern World (Mathematical Language and Symbols)
Celphy Trimucha
Noch keine Bewertungen
Introduction To Variational Calculus: Lecture Notes: 1. Examples of Variational Problems
Dokument17 Seiten
Introduction To Variational Calculus: Lecture Notes: 1. Examples of Variational Problems
akshayaurora
Noch keine Bewertungen
NUMERICAL ANALYSIS II COURSE OUTLINE
Dokument91 Seiten
NUMERICAL ANALYSIS II COURSE OUTLINE
akinwamb
Noch keine Bewertungen
Semi Structured Textpdf
Dokument8 Seiten
Semi Structured Textpdf
acsio
Noch keine Bewertungen
Fixed Parameter Tractability of Binary Near-Perfect Phylogenetic Tree Reconstruction
Dokument12 Seiten
Fixed Parameter Tractability of Binary Near-Perfect Phylogenetic Tree Reconstruction
Anonymous RrGVQj
Noch keine Bewertungen
Chapter 1 - Introduction
Dokument13 Seiten
Chapter 1 - Introduction
Najat Albarakati
Noch keine Bewertungen
Advanced Calculus Lectures
Dokument216 Seiten
Advanced Calculus Lectures
Gabriel Vasconcelos
100% (1)
GÃ¡l-Miltersen2003 Chapter TheCellProbeComplexityOfSuccin
Dokument13 Seiten
GÃ¡l-Miltersen2003 Chapter TheCellProbeComplexityOfSuccin
zbiclmgtwlhnmebgyh
Noch keine Bewertungen
CH 1
Dokument30 Seiten
CH 1
haithamibraheem67
Noch keine Bewertungen
Research Statement
Dokument5 Seiten
Research Statement
Emad Abdurasul
Noch keine Bewertungen
The Elements of Statistical Learning Solution Manual and Notes
Dokument147 Seiten
The Elements of Statistical Learning Solution Manual and Notes
Jinwen Sun
Noch keine Bewertungen
MATH219 Lecture 1
Dokument16 Seiten
MATH219 Lecture 1
Harun Gölükcü
Noch keine Bewertungen
As Needed For Those Other Classes (So If You Get Lucky and Find A Solution in One of
Dokument12 Seiten
As Needed For Those Other Classes (So If You Get Lucky and Find A Solution in One of
Yogendra Mishra
Noch keine Bewertungen
arieli2017.Four-Valued Paradefinite Logics
Dokument36 Seiten
arieli2017.Four-Valued Paradefinite Logics
Abilio Rodrigues
Noch keine Bewertungen
Introduction To Theory of Computation: KR Chowdhary Professor & Head
Dokument17 Seiten
Introduction To Theory of Computation: KR Chowdhary Professor & Head
Balaji Srikaanth
Noch keine Bewertungen
(Textbook) (Solution) The Elements of Statistical Learning
Dokument147 Seiten
(Textbook) (Solution) The Elements of Statistical Learning
Cerav
Noch keine Bewertungen
IFT 6390 Fundamentals of Machine Learning Homework 1
Dokument6 Seiten
IFT 6390 Fundamentals of Machine Learning Homework 1
Rochak Agarwal
Noch keine Bewertungen
Topology Notes
Dokument30 Seiten
Topology Notes
Süleyman Cengizci
0% (1)
MA133 Notes
Dokument81 Seiten
MA133 Notes
Charlie
Noch keine Bewertungen
On Finding The Natural Number of Topics With Latent Dirichlet Allocation Some Observations PDF
Dokument12 Seiten
On Finding The Natural Number of Topics With Latent Dirichlet Allocation Some Observations PDF
yangming hu
Noch keine Bewertungen
Tensor Decomposition For Topic Models: An Overview and Implementation
Dokument16 Seiten
Tensor Decomposition For Topic Models: An Overview and Implementation
John Kirk
Noch keine Bewertungen
Computational Physics Course Overview
Dokument77 Seiten
Computational Physics Course Overview
amyounis
Noch keine Bewertungen
Algorithm String
Dokument185 Seiten
Algorithm String
Nguyễn Văn Thanh
Noch keine Bewertungen
COMP 3250 B - Design and Analysis of Algorithms (Advanced Class)
Dokument2 Seiten
COMP 3250 B - Design and Analysis of Algorithms (Advanced Class)
Z Goo
Noch keine Bewertungen
Lecture19 20 21
Dokument14 Seiten
Lecture19 20 21
915431916
Noch keine Bewertungen
Counting Subsets and Permutations
Dokument145 Seiten
Counting Subsets and Permutations
maferguz
Noch keine Bewertungen
The Cell Probe Complexity of Succinct Data Structures: Anna G Al, Peter Bro Miltersen
Dokument13 Seiten
The Cell Probe Complexity of Succinct Data Structures: Anna G Al, Peter Bro Miltersen
Jai Murhekar
Noch keine Bewertungen
Assignment
Dokument2 Seiten
Assignment
Jaggerjack90
Noch keine Bewertungen
SadeghiBabaiezadehJutten DicLearningByConvexApproximation SPL 2013 NonOfficialVersion
Dokument4 Seiten
SadeghiBabaiezadehJutten DicLearningByConvexApproximation SPL 2013 NonOfficialVersion
jwdali
Noch keine Bewertungen
HW01 - Math Recap
Dokument4 Seiten
HW01 - Math Recap
ghukasyans033
Noch keine Bewertungen
Latent Dirichlet Allocation
Dokument13 Seiten
Latent Dirichlet Allocation
zarthon
100% (2)
The Induction Book
Von Everand
The Induction Book
Steven H. Weintraub
Noch keine Bewertungen
Elementary Number Theory: Second Edition
Von Everand
Elementary Number Theory: Second Edition
Underwood Dudley
Bewertung: 4 von 5 Sternen
4/5 (4)
Lecture 32 - Ergodic Theory
Dokument3 Seiten
Lecture 32 - Ergodic Theory
adethro
Noch keine Bewertungen
Mosek Userguide
Dokument81 Seiten
Mosek Userguide
adethro
Noch keine Bewertungen
Poisson Simulation
Dokument1 Seite
Poisson Simulation
adethro
Noch keine Bewertungen
Miscelanous
Dokument1 Seite
Miscelanous
adethro
Noch keine Bewertungen
MATLAB Programming Basics Lecture
Dokument24 Seiten
MATLAB Programming Basics Lecture
adethro
Noch keine Bewertungen
Sol 02
Dokument6 Seiten
Sol 02
adethro
Noch keine Bewertungen
Sol 01
Dokument4 Seiten
Sol 01
adethro
Noch keine Bewertungen
Geometri Ruang File 1
Dokument4 Seiten
Geometri Ruang File 1
Muhammad Isna Sumaatmaja
Noch keine Bewertungen
Recent Developments in Ultrasonic NDT Modelling in CIVA
Dokument7 Seiten
Recent Developments in Ultrasonic NDT Modelling in CIVA
cal2_uni
Noch keine Bewertungen
NETZSCH NEMO BY Pumps USA
Dokument2 Seiten
NETZSCH NEMO BY Pumps USA
Wawan Nopex
Noch keine Bewertungen
Angle Facts Powerpoint Excellent
Dokument10 Seiten
Angle Facts Powerpoint Excellent
Nina
100% (1)
User's Manual: Electrolyte Analyzer
Dokument25 Seiten
User's Manual: Electrolyte Analyzer
Nghi Nguyen
Noch keine Bewertungen
Data Sheet - Item Number: 750-8212/025-002 Controller PFC200 2nd Generation 2 X ETHERNET, RS-232/-485 Telecontrol Technology Ext. Temperature ECO
Dokument20 Seiten
Data Sheet - Item Number: 750-8212/025-002 Controller PFC200 2nd Generation 2 X ETHERNET, RS-232/-485 Telecontrol Technology Ext. Temperature ECO
diengov
Noch keine Bewertungen
Alili M S PDF
Dokument20 Seiten
Alili M S PDF
Statsitika IT
Noch keine Bewertungen
Clustering Methods for Data Mining
Dokument60 Seiten
Clustering Methods for Data Mining
Suchithra Salilan
Noch keine Bewertungen
Section 3 - Vibration Measurement
Dokument24 Seiten
Section 3 - Vibration Measurement
Abbas Akbar
Noch keine Bewertungen
Standard Normal Distribution Table PDF
Dokument1 Seite
Standard Normal Distribution Table PDF
Wong Yan Li
Noch keine Bewertungen
Open Problems in The Mathematics of Data Science
Dokument152 Seiten
Open Problems in The Mathematics of Data Science
can dagidir
Noch keine Bewertungen
Focal Points: Basic Optics, Chapter 4
Dokument47 Seiten
Focal Points: Basic Optics, Chapter 4
PAM ALVARADO
Noch keine Bewertungen
As Statistics Mechanics 1
Dokument240 Seiten
As Statistics Mechanics 1
claire zhou
Noch keine Bewertungen
MC Maths - Lesson Plans - Stage 6 - C05
Dokument9 Seiten
MC Maths - Lesson Plans - Stage 6 - C05
syasmiita
Noch keine Bewertungen
Science Clinic Gr10 Chemistry Questions 2016
Dokument44 Seiten
Science Clinic Gr10 Chemistry Questions 2016
Bheki
Noch keine Bewertungen
Js4n2nat 4
Dokument2 Seiten
Js4n2nat 4
ting
Noch keine Bewertungen
Quizlet-Philippine Electrical Code
Dokument2 Seiten
Quizlet-Philippine Electrical Code
na zafira
0% (1)
Aircraft Engine Instruments
Dokument11 Seiten
Aircraft Engine Instruments
samuelkasoka641
Noch keine Bewertungen
I) All Questions Are Compulsory. Ii) Figure To The Right Indicate Full Marks. Iii) Assume Suitable Data Wherever Necessary
Dokument1 Seite
I) All Questions Are Compulsory. Ii) Figure To The Right Indicate Full Marks. Iii) Assume Suitable Data Wherever Necessary
thamizharasi arul
Noch keine Bewertungen
VHDL Experiments
Dokument55 Seiten
VHDL Experiments
sandeepsingh93
Noch keine Bewertungen
Basic Research Approaches and Designs - An Overview - Amoud - 2020
Dokument18 Seiten
Basic Research Approaches and Designs - An Overview - Amoud - 2020
Haaji Commando
Noch keine Bewertungen
Dss Paper 1
Dokument2 Seiten
Dss Paper 1
hemal
Noch keine Bewertungen
Lesson Planning Product-Based Performance Task
Dokument8 Seiten
Lesson Planning Product-Based Performance Task
MaricarElizagaFontanilla-Lee
Noch keine Bewertungen
Sumitomo Catalog1
Dokument17 Seiten
Sumitomo Catalog1
metalartbielsko
Noch keine Bewertungen
Carbon Fibre
Dokument25 Seiten
Carbon Fibre
jagadish.kv
Noch keine Bewertungen
Puppo, F. (2012) - Dalla Vaghezza Del Linguaggio Alla Retorica Forense. Saggio Di Logica Giuridica
Dokument3 Seiten
Puppo, F. (2012) - Dalla Vaghezza Del Linguaggio Alla Retorica Forense. Saggio Di Logica Giuridica
AldunIdhun
Noch keine Bewertungen
03a IGCSE Maths 4MB1 Paper 2R - January 2020 Examination Paper
Dokument32 Seiten
03a IGCSE Maths 4MB1 Paper 2R - January 2020 Examination Paper
Mehwish Arif
Noch keine Bewertungen
Schindler Drive Chain Maintenance
Dokument9 Seiten
Schindler Drive Chain Maintenance
Kevin ali
Noch keine Bewertungen
Maquina de Anestesia - ADSII
Dokument2 Seiten
Maquina de Anestesia - ADSII
alexander
Noch keine Bewertungen
Staad Foundation
Dokument25 Seiten
Staad Foundation
Anonymous nwByj9L
100% (2)