Beruflich Dokumente
Kultur Dokumente
Ankur Moitra
Institute for Advanced Study
and Princeton University
INFORMATION OVERLOAD!
Challenge: develop tools for automatic comprehension of data
INFORMATION OVERLOAD!
Challenge: develop tools for automatic comprehension of data
INFORMATION OVERLOAD!
Challenge: develop tools for automatic comprehension of data
INFORMATION OVERLOAD!
Challenge: develop tools for automatic comprehension of data
INFORMATION OVERLOAD!
Challenge: develop tools for automatic comprehension of data
INFORMATION OVERLOAD!
Challenge: develop tools for automatic comprehension of data
INFORMATION OVERLOAD!
Challenge: develop tools for automatic comprehension of data
INFORMATION OVERLOAD!
Challenge: develop tools for automatic comprehension of data
OUTLINE
OUTLINE
Are there efficient algorithms to find the topics?
OUTLINE
Are there efficient algorithms to find the topics?
Challenge: We cannot rigorously analyze algorithms used
in practice! (When do they work? run quickly?)
OUTLINE
Are there efficient algorithms to find the topics?
Challenge: We cannot rigorously analyze algorithms used
in practice! (When do they work? run quickly?)
Part I: An Optimization Perspective
OUTLINE
Are there efficient algorithms to find the topics?
Challenge: We cannot rigorously analyze algorithms used
in practice! (When do they work? run quickly?)
Part I: An Optimization Perspective
WORD-BY-DOCUMENT MATRIX
WORD-BY-DOCUMENT MATRIX
words (m)
documents (n)
WORD-BY-DOCUMENT MATRIX
words (m)
documents (n)
i
M
j
relative frequency of word i in document j
WORD-BY-DOCUMENT MATRIX
words (m)
documents (n)
words (m)
documents (n)
= A
nonnegative
W
nonnegative
words (m)
documents (n)
= A
W
nonnegative
nonnegative
WLOG we can assume columns of A, W sum to one
words (m)
documents (n)
topics
= A
W
nonnegative
nonnegative
WLOG we can assume columns of A, W sum to one
words (m)
documents (n)
topics
= A
W
nonnegative
nonnegative
WLOG we can assume columns of A, W sum to one
words (m)
documents (n)
topics
= A
representation
W
nonnegative
nonnegative
WLOG we can assume columns of A, W sum to one
words (m)
documents (n)
topics
= A
representation
W
nonnegative
nonnegative
WLOG we can assume columns of A, W sum to one
words (m)
documents (n)
topics
= A
representation
W
nonnegative
nonnegative
WLOG we can assume columns of A, W sum to one
AN ABRIDGED HISTORY
AN ABRIDGED HISTORY
Machine Learning and Statistics:
Introduced by [Lee, Seung, 99]
Goal: extract latent relationships in the data
Applications to text classification, information retrieval,
collaborative filtering, etc [Hofmann 99], [Kumar et al 98],
[Xu et al 03], [Kleinberg, Sandler 04],
AN ABRIDGED HISTORY
Machine Learning and Statistics:
Introduced by [Lee, Seung, 99]
Goal: extract latent relationships in the data
Applications to text classification, information retrieval,
collaborative filtering, etc [Hofmann 99], [Kumar et al 98],
[Xu et al 03], [Kleinberg, Sandler 04],
AN ABRIDGED HISTORY
Machine Learning and Statistics:
Introduced by [Lee, Seung, 99]
Goal: extract latent relationships in the data
Applications to text classification, information retrieval,
collaborative filtering, etc [Hofmann 99], [Kumar et al 98],
[Xu et al 03], [Kleinberg, Sandler 04],
variables
system of polynomial
inequalities
system of polynomial
inequalities
variables
Can we reduce the number of variables from nr+mr to O(r2)?
words (m)
topics (r)
words (m)
topics (r)
If an anchor word
occurs then the
document is at least
partially about the topic
words (m)
topics (r)
personal finance
If an anchor word
occurs then the
document is at least
partially about the topic
words (m)
topics (r)
personal finance
If an anchor word
occurs then the
document is at least
partially about the topic
401k
words (m)
topics (r)
If an anchor word
occurs then the
document is at least
partially about the topic
401k
words (m)
topics (r)
baseball
If an anchor word
occurs then the
document is at least
partially about the topic
401k
words (m)
bunt
401k
If an anchor word
occurs then the
document is at least
partially about the topic
words (m)
bunt
401k
If an anchor word
occurs then the
document is at least
partially about the topic
words (m)
bunt
401k
If an anchor word
occurs then the
document is at least
partially about the topic
words (m)
bunt
401k
oscar-winning
If an anchor word
occurs then the
document is at least
partially about the topic
words (m)
bunt
401k
oscar-winning
If an anchor word
occurs then the
document is at least
partially about the topic
words (m)
bunt
401k
oscar-winning
If an anchor word
occurs then the
document is at least
partially about the topic
A is p-separable if each
topic has an anchor
word that occurs with
probability p
ANCHOR WORDS
VERTICES
ANCHOR WORDS
A
VERTICES
W
ANCHOR WORDS
A
VERTICES
W
ANCHOR WORDS
A
VERTICES
W
ANCHOR WORDS
A
VERTICES
W
ANCHOR WORDS
A
VERTICES
W
ANCHOR WORDS
A
VERTICES
W
ANCHOR WORDS
A
VERTICES
W
ANCHOR WORDS
A
VERTICES
W
ANCHOR WORDS
A
VERTICES
W
ANCHOR WORDS
A
VERTICES
W
=
Deleting a word
changes the convex hull
it is an anchor word
OUTLINE
Are there efficient algorithms to find the topics?
Challenge: We cannot rigorously analyze algorithms used
in practice! (When do they work? run quickly?)
Part I: An Optimization Perspective
TOPIC MODELS
fixed
A
stochastic
W
TOPIC MODELS
fixed
A
stochastic
W
=
document #1: (1.0, personal finance)
TOPIC MODELS
fixed
A
stochastic
W
=
document #1: (1.0, personal finance)
TOPIC MODELS
fixed
A
stochastic
W
TOPIC MODELS
fixed
A
stochastic
W
=
document #2: (0.5, baseball); (0.5, movie review)
TOPIC MODELS
fixed
A
stochastic
W
=
document #2: (0.5, baseball); (0.5, movie review)
TOPIC MODELS
fixed
A
stochastic
W
TOPIC MODELS
Latent Dirichlet Allocation (Blei, Ng, Jordan)
fixed
A
Dirichlet
W
TOPIC MODELS
fixed
A
TOPIC MODELS
Correlated Topic Model (Blei, Lafferty)
fixed
A
Logisitic Normal
W
TOPIC MODELS
fixed
A
TOPIC MODELS
Pachinko Allocation Model (Li, McCallum)
fixed Multilevel DAG
A
TOPIC MODELS
Pachinko Allocation Model (Li, McCallum)
fixed Multilevel DAG
A
The crucial observation is, we can work with the Gram matrix
(defined next)
AT
A
W WT
E[ M M T]
AT
A
W WT
E[ M M T]
A E[ W WT] AT
AT
A
W WT
E[ M M T]
R
W WT
A E[ W WT] AT
AT
ARAT
E[ M M T]
A E[ W WT] AT
AT
R
W WT
nonnegative
ARAT
E[ M M T]
A E[ W WT] AT
AT
R
W WT
nonnegative
separable!
ARAT
E[ M M T]
A E[ W WT] AT
ARAT
AT
R
W WT
nonnegative
separable!
Anchor words are extreme rows of the Gram matrix!
The crucial observation is, we can work with the Gram matrix
(defined next)
The crucial observation is, we can work with the Gram matrix
(defined next)
Given enough documents, we can still find the anchor words!
The crucial observation is, we can work with the Gram matrix
(defined next)
Given enough documents, we can still find the anchor words!
How can we use the anchor words to find the rest of A?
The crucial observation is, we can work with the Gram matrix
(defined next)
Given enough documents, we can still find the anchor words!
How can we use the anchor words to find the rest of A?
The posterior distribution Pr[topic|word] is supported on
just one topic, for an anchor word
The crucial observation is, we can work with the Gram matrix
(defined next)
Given enough documents, we can still find the anchor words!
How can we use the anchor words to find the rest of A?
The posterior distribution Pr[topic|word] is supported on
just one topic, for an anchor word
We can use the anchor words to find Pr[topic|word] for all the
other words
what we have:
A
Pr[topic|word]
what we have:
A
Pr[topic|word]
what we want:
Pr[word|topic]
what we have:
A
Pr[topic|word]
Bayes Rule
what we want:
Pr[word|topic]
Pr[word|topic] =
Pr[topic|word] Pr[word]
Pr[topic|word] Pr[word]
word
Pr[word|topic] =
Pr[topic|word] Pr[word]
Pr[topic|word] Pr[word]
word
Pr[word|topic] =
Pr[topic|word] Pr[word]
Pr[topic|word] Pr[word]
word
Pr[word|topic] =
Pr[topic|word] Pr[word]
Pr[topic|word] Pr[word]
word
Pr[word|topic] =
Pr[topic|word] Pr[word]
Pr[topic|word] Pr[word]
word
Pr[word|topic] =
Pr[topic|word] Pr[word]
Pr[topic|word] Pr[word]
word
METHODOLOGY:
We ran our algorithm on real and synthetic data:
synthetic data: train an LDA model on 1100 NIPS abstracts,
use this model to run experiments
METHODOLOGY:
We ran our algorithm on real and synthetic data:
synthetic data: train an LDA model on 1100 NIPS abstracts,
use this model to run experiments
Our algorithm is fifty times faster and performs nearly the
same on all metrics we tried (l_1, log-likelihood, coherence,)
when compared to MALLET
EXPERIMENTAL RESULTS
[Arora, Ge, Halpern, Mimno, Moitra, Sontag, Wu, Zhu, ICML13]:
EXPERIMENTAL RESULTS
[Arora, Ge, Halpern, Mimno, Moitra, Sontag, Wu, Zhu, ICML13]:
3000
Seconds
Algorithm
2000
Gibbs
Recover
RecoverL2
1000
RecoverKL
0
0
25000
50000
Documents
75000
100000
EXPERIMENTAL RESULTS
[Arora, Ge, Halpern, Mimno, Moitra, Sontag, Wu, Zhu, ICML13]:
3000
Seconds
Algorithm
2000
Gibbs
Recover
RecoverL2
1000
RecoverKL
0
0
25000
50000
75000
100000
Documents
SynthNIPS, L1 error
2.0
Algorithm
L1.error
1.5
Gibbs
Recover
1.0
RecoverL2
RecoverKL
0.5
0.0
2000
4000
6000
8000
10000
20000
Documents
40000
60000
80000
100000
Infinite
EXPERIMENTAL RESULTS
[Arora, Ge, Halpern, Mimno, Moitra, Sontag, Wu, Zhu, ICML13]:
3000
Seconds
Algorithm
2000
Gibbs
Recover
RecoverL2
1000
RecoverKL
0
0
25000
50000
Documents
75000
100000
EXPERIMENTAL RESULTS
[Arora, Ge, Halpern, Mimno, Moitra, Sontag, Wu, Zhu, ICML13]:
3000
Seconds
Algorithm
2000
Gibbs
Recover
RecoverL2
1000
RecoverKL
0
0
25000
50000
75000
100000
Documents
LogProbPerToken
6.0
6.5
Algorithm
Gibbs
Recover
7.0
RecoverL2
RecoverKL
7.5
8.0
2000
4000
6000
8000
10000
20000
40000
Documents
60000
80000
100000
Infinite
METHODOLOGY:
We ran our algorithm on real and synthetic data:
synthetic data: train an LDA model on 1100 NIPS abstracts,
use this model to run experiments
Our algorithm is fifty times faster and performs nearly the
same on all metrics we tried (l_1, log-likelihood, coherence,)
when compared to MALLET
METHODOLOGY:
We ran our algorithm on real and synthetic data:
synthetic data: train an LDA model on 1100 NIPS abstracts,
use this model to run experiments
Our algorithm is fifty times faster and performs nearly the
same on all metrics we tried (l_1, log-likelihood, coherence,)
when compared to MALLET
MY WORK ON LEARNING
MY WORK ON LEARNING
Is Learning
Computationally
Easy?
MY WORK ON LEARNING
Nonnegative Matrix
Factorization
New Models
Is Learning
Computationally
Easy?
Topic Models
MY WORK ON LEARNING
computational geometry
Nonnegative Matrix
Factorization
New Models
Is Learning
Computationally
Easy?
Topic Models
MY WORK ON LEARNING
computational geometry
Nonnegative Matrix
Factorization
New Models
Topic Models
experiments
Is Learning
Computationally
Easy?
MY WORK ON LEARNING
computational geometry
Nonnegative Matrix
Factorization
New Models
Topic Models
experiments
Method of
Moments
Is Learning
Computationally
Easy?
MY WORK ON LEARNING
computational geometry
Nonnegative Matrix
Factorization
New Models
Topic Models
experiments
Is Learning
Computationally
Easy?
Method of
Moments
algebraic geometry
Nonnegative Matrix
Factorization
MY WORK ON LEARNING
computational geometry
Nonnegative Matrix
Factorization
New Models
Topic Models
experiments
Is Learning
Computationally
Easy?
Method of
Moments
algebraic geometry
Mixtures of
Gaussians
Nonnegative Matrix
Factorization
10
15
20
0.58
0.60
0.62
0.64
0.66
0.68
0.70
15
10
5
20
0.58
0.60
0.62
0.64
0.66
0.68
0.70
15
10
5
20
0.58
0.60
0.62
0.64
0.66
0.68
0.70
15
10
5
20
0.58
0.60
0.62
0.64
0.66
0.68
0.70
15
10
5
20
0.58
0.60
0.62
0.64
0.66
0.68
0.70
15
10
5
20
0.58
0.60
0.62
0.64
0.66
0.68
0.70
MY WORK ON LEARNING
computational geometry
Nonnegative Matrix
Factorization
New Models
Topic Models
experiments
Is Learning
Computationally
Easy?
Method of
Moments
algebraic geometry
Mixtures of
Gaussians
Nonnegative Matrix
Factorization
MY WORK ON LEARNING
computational geometry
Nonnegative Matrix
Factorization
New Models
Deep Learning
experiments
local search
Is Learning
Computationally
Easy?
Method of
Moments
algebraic geometry
Mixtures of
Gaussians
Topic Models
Nonnegative Matrix
Factorization
MY WORK ON LEARNING
complex analysis
computational geometry
Population
Recovery
DNF
Deep Learning
Nonnegative Matrix
Factorization
New Models
experiments
local search
Is Learning
Computationally
Easy?
Method of
Moments
algebraic geometry
Mixtures of
Gaussians
Topic Models
Nonnegative Matrix
Factorization
MY WORK ON LEARNING
complex analysis
computational geometry
Population
Recovery
DNF
Deep Learning
Nonnegative Matrix
Factorization
New Models
experiments
local search
Is Learning
Computationally
Easy?
Method of
Moments
algebraic geometry
Mixtures of
Gaussians
Topic Models
Nonnegative Matrix
Factorization
Robustness
functional analysis
Linear Regression
MY WORK ON ALGORITHMS
MY WORK ON ALGORITHMS
Approximation Algorithms,
Metric Embeddings
MY WORK ON ALGORITHMS
Approximation Algorithms,
Metric Embeddings
Information Theory,
Communication Complexity
MY WORK ON ALGORITHMS
Approximation Algorithms,
Metric Embeddings
Information Theory,
Communication Complexity
?
f(x+y) = f(x) + f(y)
Combinatorics,
Smooth Analysis
f(x)
f(w)
f(y)
f(z)
Summary:
Often optimization problems abstracted from
learning are intractable!
Summary:
Often optimization problems abstracted from
learning are intractable!
Are there new models that better capture the
instances we actually want to solve in practice?
Summary:
Often optimization problems abstracted from
learning are intractable!
Are there new models that better capture the
instances we actually want to solve in practice?
These new models can lead to interesting theory
questions and highly practical and new algorithms
Summary:
Often optimization problems abstracted from
learning are intractable!
Are there new models that better capture the
instances we actually want to solve in practice?
These new models can lead to interesting theory
questions and highly practical and new algorithms
There are many exciting questions left to explore
at the intersection of algorithms and learning
Any Questions?
Summary:
Often optimization problems abstracted from
learning are intractable!
Are there new models that better capture the
instances we actually want to solve in practice?
These new models can lead to interesting theory
questions and highly practical and new algorithms
There are many exciting questions left to explore
at the intersection of algorithms and learning