Beruflich Dokumente
Kultur Dokumente
Variational
Objective Yoon Kim, Sam Wiseman, Alexander Rush
Inference
Strategies
Advanced Topics
Case Studies
Conclusion
References
Tutorial 2018
https://github.com/harvardnlp/DeepLatentNLP
1/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Goals
Introduction Background
Goals
Background
Models 2 Models
Variational
Objective
3 Variational Objective
Inference
Strategies
6 Case Studies
2/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Goals
Introduction Background
Goals
Background
Models 2 Models
Variational
Objective
3 Variational Objective
Inference
Strategies
6 Case Studies
3/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Probabilistic models provide a declarative language for specifying prior knowledge
Variational
and structural relationships in the context of unknown variables.
Objective
Advanced Topics
• Known interactions in the data
Case Studies • Uncertainty about unknown factors
Conclusion
• Constraints on model properties
References
4/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Probabilistic models provide a declarative language for specifying prior knowledge
Variational
and structural relationships in the context of unknown variables.
Objective
Advanced Topics
• Known interactions in the data
Case Studies • Uncertainty about unknown factors
Conclusion
• Constraints on model properties
References
4/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Latent-Variable Modeling in NLP
Goals
Background
References
5/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Toolbox of methods for learning rich, non-linear data representations through
Variational
numerical optimization.
Objective
Advanced Topics
• Highly-flexible predictive models
Case Studies • Transferable feature representations
Conclusion
• Structurally-aligned network architectures
References
6/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Toolbox of methods for learning rich, non-linear data representations through
Variational
numerical optimization.
Objective
Advanced Topics
• Highly-flexible predictive models
Case Studies • Transferable feature representations
Conclusion
• Structurally-aligned network architectures
References
6/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Goals
Deep Learning in NLP
Background
References
7/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Inference • What unique challenges come from modeling text with latent variables?
Strategies
Advanced Topics
• What techniques have been explored and shown to be effective in recent
Case Studies
papers?
Conclusion
References
We explore these through the lens of variational inference.
8/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Inference • What unique challenges come from modeling text with latent variables?
Strategies
Advanced Topics
• What techniques have been explored and shown to be effective in recent
Case Studies
papers?
Conclusion
References
We explore these through the lens of variational inference.
8/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Tutorial Take-Aways
Introduction
Goals
Background 1 A collection of deep latent-variable models for NLP
Models
Conclusion
5 A survey of example applications
References
6 Code samples and techniques for practical use
9/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Tutorial Non-Objectives
Introduction
Goals
Background Not covered (for time, not relevance):
Models
• Many classical latent-variable approaches.
Variational
Objective
• Undirected graphical models such as MRFs
Inference
Strategies
• Non-likelihood based models such as GANs
Advanced Topics
Case Studies
• Sampling-based inference such as MCMC.
Conclusion
10/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Goals
Introduction Background
Goals
Background
Models 2 Models
Variational
Objective
3 Variational Objective
Inference
Strategies
6 Case Studies
11/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are deep networks?
Introduction
Deep networks are parameterized non-linear functions; They transform input z
Goals
Background into features h using parameters π.
Models
Inference
Strategies
h = MLP(z; π) = V σ(W z + b) + a π = {V , W , a, b},
Advanced Topics
The recurrent neural network, which maps a sequence of inputs z1:T into a
Case Studies
sequence of features h1:T ,
Conclusion
References
ht = RNN(ht−1 , zt ; π) = σ(U zt + V ht−1 + b) π = {V , U , b}
.
12/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are deep networks?
Introduction
Deep networks are parameterized non-linear functions; They transform input z
Goals
Background into features h using parameters π.
Models
Inference
Strategies
h = MLP(z; π) = V σ(W z + b) + a π = {V , W , a, b},
Advanced Topics
The recurrent neural network, which maps a sequence of inputs z1:T into a
Case Studies
sequence of features h1:T ,
Conclusion
References
ht = RNN(ht−1 , zt ; π) = σ(U zt + V ht−1 + b) π = {V , U , b}
.
12/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are latent variable models?
Conclusion
• Data consists of N i.i.d samples,
References
N
Y
p(x(1:N ) , z (1:N ) ; θ) = p(x(n) | z (n) ; θ)p(z (n) ; θ).
n=1
13/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are latent variable models?
Conclusion
• Data consists of N i.i.d samples,
References
N
Y
p(x(1:N ) , z (1:N ) ; θ) = p(x(n) | z (n) ; θ)p(z (n) ; θ).
n=1
13/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are latent variable models?
Conclusion
• Data consists of N i.i.d samples,
References
N
Y
p(x(1:N ) , z (1:N ) ; θ) = p(x(n) | z (n) ; θ)p(z (n) ; θ).
n=1
13/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Probabilistic Graphical Models
Inference
Strategies
x(n)
Advanced Topics
Case Studies N
Conclusion N
Y
References p(x(1:N ) , z (1:N ) ; θ) = p(x(n) | z (n) ; θ)p(z (n) ; θ)
n=1
For models p(x, z; θ), we’ll be interested in the posterior over latent variables z:
Introduction
Goals
Background
p(x, z; θ)
Models p(z | x; θ) = .
p(x; θ)
Variational
Objective
Inference Why?
Strategies
Advanced Topics
• z will often represent interesting information about our data (e.g., the
Case Studies cluster x(n) lives in, how similar x(n) and x(n+1) are).
Conclusion • Learning the parameters θ of the model often requires calculating posteriors
References as a subroutine.
• Intuition: if I know likely z (n) for x(n) , I can learn by maximizing
p(x(n) | z (n) ; θ). 15/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
For models p(x, z; θ), we’ll be interested in the posterior over latent variables z:
Introduction
Goals
Background
p(x, z; θ)
Models p(z | x; θ) = .
p(x; θ)
Variational
Objective
Inference Why?
Strategies
Advanced Topics
• z will often represent interesting information about our data (e.g., the
Case Studies cluster x(n) lives in, how similar x(n) and x(n+1) are).
Conclusion • Learning the parameters θ of the model often requires calculating posteriors
References as a subroutine.
• Intuition: if I know likely z (n) for x(n) , I can learn by maximizing
p(x(n) | z (n) ; θ). 15/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
References
• Latent variable objectives complicate backpropagation.
16/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
References
• Latent variable objectives complicate backpropagation.
16/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models Discrete Models
Discrete Models
Continuous Models Continuous Models
Structured Models
Structured Models
Variational
Objective
Advanced Topics
4 Inference Strategies
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
17/153
Tutorial:
Deep Latent NLP A Language Model
(bit.do/lvnlp)
Models
Context: RNN language models are remarkable at this task,
Discrete Models
Continuous Models x1:T ∼ RNNLM(x1:T ; θ).
Structured Models
Conclusion (n)
x1 ... (n)
xT
References
N
θ 18/153
Tutorial:
Deep Latent NLP A Language Model
(bit.do/lvnlp)
Models
Context: RNN language models are remarkable at this task,
Discrete Models
Continuous Models x1:T ∼ RNNLM(x1:T ; θ).
Structured Models
Conclusion (n)
x1 ... (n)
xT
References
N
θ 18/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Case Studies
3 Structured LVs z (Structured learning )
Conclusion
References
19/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Case Studies
3 Structured LVs z (Structured learning )
Conclusion
References
19/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models Discrete Models
Discrete Models
Continuous Models Continuous Models
Structured Models
Structured Models
Variational
Objective
Advanced Topics
4 Inference Strategies
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
20/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 1: Discrete Clustering
Inference
Strategies Discrete latent variable models induce a clustering over sentences x(n) .
Advanced Topics
Example uses:
Case Studies
Inference
Strategies Discrete latent variable models induce a clustering over sentences x(n) .
Advanced Topics
Example uses:
Case Studies
Introduction
Generative process:
Models
Discrete Models
1 Draw cluster z ∈ {1, . . . , K} from a categorical with param µ.
Continuous Models
Structured Models 2 Draw word T words xt from a categorical with word distribution πz .
Variational
Objective Parameters: θ = {µ ∈ ∆K−1 , K × V stochastic matrix π}
Inference
Strategies Gives rise to the ”Naive Bayes” distribution:
Advanced Topics
T
Y
Case Studies p(x, z; θ) = p(z; µ) × p(x | z; π) = µz × Cat(xt ; π)
Conclusion t=1
T
References Y
= µz × πz,xt
t=1
22/153
Tutorial:
Deep Latent NLP Model 1: Graphical Model View
(bit.do/lvnlp)
µ
Introduction
z (n)
Models
Discrete Models
Continuous Models
Structured Models (n)
x1 ... xT
(n)
Variational
Objective π N
Inference
Strategies
Advanced Topics
N N
Case Studies Y Y
(n) (n)
p(x ,z ; µ, π) = p(z (n) ; µ) × p(x(n) | z (n) ; π)
Conclusion
n=1 n=1
References
YN T
Y
= µz (n) × πz (n) ,x(n)
t
n=1 t=1
23/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Deep Model 1: Discrete - Mixture of RNNs
Conclusion
(n)
x1 ... xT
(n)
References
π N
24/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Inference Interesting question: how will this affect the learned latent space?
Strategies
References
25/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Inference Interesting question: how will this affect the learned latent space?
Strategies
References
25/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
For both discrete models, can apply Bayes’ rule:
Models
Discrete Models
Continuous Models
p(z) × p(x | z)
Structured Models
p(z | x; θ) =
Variational p(x)
Objective
p(z) × p(x | z)
Inference
= K
X
Strategies
p(z=k) × p(x | z=k)
Advanced Topics k=1
Case Studies
Conclusion
• For mixture of categoricals, posterior uses word counts under each πk .
References
• For mixture of RNNs, posterior requires running RNN over x for each k.
26/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
For both discrete models, can apply Bayes’ rule:
Models
Discrete Models
Continuous Models
p(z) × p(x | z)
Structured Models
p(z | x; θ) =
Variational p(x)
Objective
p(z) × p(x | z)
Inference
= K
X
Strategies
p(z=k) × p(x | z=k)
Advanced Topics k=1
Case Studies
Conclusion
• For mixture of categoricals, posterior uses word counts under each πk .
References
• For mixture of RNNs, posterior requires running RNN over x for each k.
26/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
For both discrete models, can apply Bayes’ rule:
Models
Discrete Models
Continuous Models
p(z) × p(x | z)
Structured Models
p(z | x; θ) =
Variational p(x)
Objective
p(z) × p(x | z)
Inference
= K
X
Strategies
p(z=k) × p(x | z=k)
Advanced Topics k=1
Case Studies
Conclusion
• For mixture of categoricals, posterior uses word counts under each πk .
References
• For mixture of RNNs, posterior requires running RNN over x for each k.
26/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models Discrete Models
Discrete Models
Continuous Models Continuous Models
Structured Models
Structured Models
Variational
Objective
Advanced Topics
4 Inference Strategies
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
27/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 2: Continuous / Dimensionality Reduction
Inference
Strategies Find a lower-dimensional, well-behaved continuous representation of a sentence.
Advanced Topics Latent variables in Rd make distance/similarity easy. Examples:
Case Studies
• Recent work in text generation assumes a latent vector per sentence [Bowman
Conclusion
et al. 2016; Yang et al. 2017; Hu et al. 2017].
References
• Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015])
can be interpreted in this way.
28/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 2: Continuous / Dimensionality Reduction
Inference
Strategies Find a lower-dimensional, well-behaved continuous representation of a sentence.
Advanced Topics Latent variables in Rd make distance/similarity easy. Examples:
Case Studies
• Recent work in text generation assumes a latent vector per sentence [Bowman
Conclusion
et al. 2016; Yang et al. 2017; Hu et al. 2017].
References
• Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015])
can be interpreted in this way.
28/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Variational
1 Draw continuous latent variable z from Normal with param µ.
Objective
2 For each t, draw word xt from categorical with param softmax(W z).
Inference
Strategies
Parameters: θ = {µ ∈ Rd , π}, π = {W ∈ RV ×d }
Advanced Topics
Case Studies Intuition: µ is a global distribution, z captures local word distribution of the
Conclusion sentence.
References
29/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Graphical Model View
µ
Introduction
Models
Discrete Models z (n)
Continuous Models
Structured Models
Variational
Objective (n)
x1 ... xT
(n)
Inference
Strategies π N
Advanced Topics
Case Studies
Conclusion
Gives rise to the joint distribution:
References N
Y N
Y
(n) (n)
p(x ,z ; θ) = p(z (n) ; µ) × p(x(n) | z (n) ; π)
n=1 n=1
30/153
Tutorial:
Deep Latent NLP Deep Model 2: Continuous ”Mixture” of RNNs
(bit.do/lvnlp)
Generative Process:
Introduction
1 Draw latent variable z ∼ N (µ, I).
Models
2 Draw each token xt from a conditional RNNLM.
Discrete Models
Continuous Models
Structured Models
RNN is also conditioned on latent z,
Variational
Objective p(x, z; π, µ, I) = p(z; µ, I) × p(x | z; π)
Inference
Strategies
= N (z; µ, I) × CRNNLM(x1:T ; π, z)
Advanced Topics
Case Studies
where
Conclusion T
Y
References CRNNLM(x1:T ; π, z) = softmax(W ht )xt
t=1
Generative Process:
Introduction
1 Draw latent variable z ∼ N (µ, I).
Models
2 Draw each token xt from a conditional RNNLM.
Discrete Models
Continuous Models
Structured Models
RNN is also conditioned on latent z,
Variational
Objective p(x, z; π, µ, I) = p(z; µ, I) × p(x | z; π)
Inference
Strategies
= N (z; µ, I) × CRNNLM(x1:T ; π, z)
Advanced Topics
Case Studies
where
Conclusion T
Y
References CRNNLM(x1:T ; π, z) = softmax(W ht )xt
t=1
Generative Process:
Introduction
1 Draw latent variable z ∼ N (µ, I).
Models
2 Draw each token xt from a conditional RNNLM.
Discrete Models
Continuous Models
Structured Models
RNN is also conditioned on latent z,
Variational
Objective p(x, z; π, µ, I) = p(z; µ, I) × p(x | z; π)
Inference
Strategies
= N (z; µ, I) × CRNNLM(x1:T ; π, z)
Advanced Topics
Case Studies
where
Conclusion T
Y
References CRNNLM(x1:T ; π, z) = softmax(W ht )xt
t=1
Variational
Objective
z(n)
Inference
Strategies
Advanced Topics
(n)
x1 ... (n)
xT
Case Studies N
π
Conclusion
References
32/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
Advanced Topics • Shallow and deep Model 2 variants mirror Model 1 variants exactly, but
Case Studies
with continuous z.
Conclusion
• Integral intractable (in general) for both shallow and deep variants.
References
33/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
Advanced Topics • Shallow and deep Model 2 variants mirror Model 1 variants exactly, but
Case Studies
with continuous z.
Conclusion
• Integral intractable (in general) for both shallow and deep variants.
References
33/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models Discrete Models
Discrete Models
Continuous Models Continuous Models
Structured Models
Structured Models
Variational
Objective
Advanced Topics
4 Inference Strategies
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
34/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model 3: Structure Learning
Inference Process:
Introduction
Inference Structured latent variable models are used to infer unannotated structure:
Strategies
Advanced Topics • Unsupervised POS tagging [Brown et al. 1992; Merialdo 1994; Smith and Eisner 2005]
Case Studies • Unsupervised dependency parsing [Klein and Manning 2004; Headden III et al. 2009]
Conclusion
Or when structure is useful for interpreting our data:
References
• Segmentation of documents into topical passages [Hearst 1997]
• Alignment [Vogel et al. 1996]
35/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 3: Structured - Hidden Markov Model
Introduction
Generative Process:
Models
1 For each t, draw zt ∈ {1, . . . , K} from a categorical with param µzt−1 .
Discrete Models
Continuous Models
2 Draw observed token xt from categorical with param πzt .
Structured Models
Variational
Objective Parameters: θ = {K × K stochastic matrix µ, K × V stochastic matrix π}
Inference
Strategies Gives rise to the joint distribution:
Advanced Topics
T
Y T
Y
Case Studies p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion t=1 t=1
T T
References Y Y
= µzt−1 ,zt × πzt ,xt
t=1 t=1
36/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 3: Structured - Hidden Markov Model
Introduction
Generative Process:
Models
1 For each t, draw zt ∈ {1, . . . , K} from a categorical with param µzt−1 .
Discrete Models
Continuous Models
2 Draw observed token xt from categorical with param πzt .
Structured Models
Variational
Objective Parameters: θ = {K × K stochastic matrix µ, K × V stochastic matrix π}
Inference
Strategies Gives rise to the joint distribution:
Advanced Topics
T
Y T
Y
Case Studies p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion t=1 t=1
T T
References Y Y
= µzt−1 ,zt × πzt ,xt
t=1 t=1
36/153
Tutorial:
Deep Latent NLP Graphical Model View
(bit.do/lvnlp)
µ
Introduction z1 z2 z3 z4
Models
Discrete Models
Continuous Models
Structured Models
x1 x2 x3 x4
Variational
Objective N
Inference π
Strategies
Advanced Topics
T T
Case Studies Y Y
p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion
t=1 t=1
References
YT YT
= µzt−1 ,zt × πzt ,xt
t=1 t=1
37/153
Tutorial:
Deep Latent NLP Further Extension: Factorial HMM
(bit.do/lvnlp)
Variational
Objective z1,1 z1,2 z1,3 z1,4
Inference
Strategies
x1 x2 x3 x4
Advanced Topics
Case Studies
N
Conclusion
L Y
Y T T
Y
References p(x, z; θ) = p(zl,t | zl,t−1 ) × p(xt | z1:L,t )
l=1 t=1 t=1
38/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Deep Model 3: Deep HMM
Introduction Parameterize transition and emission distributions with neural networks (c.f., Tran
Models et al. [2016])
Discrete Models
Continuous Models
Structured Models • Model transition distribution as
Variational
Objective
p(zt | zt−1 ) = softmax(MLP(zt−1 ; µ))
Inference
Strategies
• Model emission distribution as
Advanced Topics
Case Studies
p(xt | zt ) = softmax(MLP(zt ; π))
Conclusion
References
Note: K × K transition parameters for standard HMM vs. O(K × d + d2 ) for
deep version.
39/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Deep Model 3: Deep HMM
Introduction Parameterize transition and emission distributions with neural networks (c.f., Tran
Models et al. [2016])
Discrete Models
Continuous Models
Structured Models • Model transition distribution as
Variational
Objective
p(zt | zt−1 ) = softmax(MLP(zt−1 ; µ))
Inference
Strategies
• Model emission distribution as
Advanced Topics
Case Studies
p(xt | zt ) = softmax(MLP(zt ; π))
Conclusion
References
Note: K × K transition parameters for standard HMM vs. O(K × d + d2 ) for
deep version.
39/153
Tutorial:
Deep Latent NLP Graphical Model View
(bit.do/lvnlp)
µ
Introduction z1 z2 z3 z4
Models
Discrete Models
Continuous Models
Structured Models
x1 x2 x3 x4
Variational
Objective N
Inference π
Strategies
Advanced Topics
T T
Case Studies Y Y
p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion
t=1 t=1
References
YT YT
= µzt−1 ,zt × πzt ,xt
t=1 t=1
40/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
For structured models, Bayes’ rule may tractable,
Models
Discrete Models
Continuous Models
p(z; µ) × p(x | z; π)
Structured Models
p(z | x; θ) = P 0 0
Variational z 0 p(z ; µ) × p(x | z ; π)
Objective
Inference
Strategies
• Unlike previous models, z contains interdependent “parts.”
Advanced Topics
• For both shallow and deep Model 3 variants, it’s possible to calculate
Case Studies
Conclusion
p(x; θ) exactly, with a dynamic program.
References • For some structured models, like Factorial HMM, the dynamic program may
still be intractable.
41/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
For structured models, Bayes’ rule may tractable,
Models
Discrete Models
Continuous Models
p(z; µ) × p(x | z; π)
Structured Models
p(z | x; θ) = P 0 0
Variational z 0 p(z ; µ) × p(x | z ; π)
Objective
Inference
Strategies
• Unlike previous models, z contains interdependent “parts.”
Advanced Topics
• For both shallow and deep Model 3 variants, it’s possible to calculate
Case Studies
Conclusion
p(x; θ) exactly, with a dynamic program.
References • For some structured models, like Factorial HMM, the dynamic program may
still be intractable.
41/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
6 Case Studies
42/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
6 Case Studies
43/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Learning with Maximum Likelihood
Variational
Objective
Maximum Likelihood Objective: Find model parameters θ that maximize the likelihood of the data,
ELBO
Inference N
X
Strategies
θ∗ = arg max log p(x(n) ; θ)
Advanced Topics θ n=1
Case Studies
Conclusion
References
44/153
Tutorial:
Deep Latent NLP Learning Deep Models
(bit.do/lvnlp)
N
X
Introduction
L(θ) = log p(x(n) ; θ)
Models
n=1
Variational
Objective x
Maximum Likelihood
ELBO N
Inference
Strategies θ
Advanced Topics
• Dominant framework is gradient-based optimization:
Case Studies
References
x(n)
N
46/153
Tutorial:
Deep Latent NLP Learning Deep Latent-Variable Models: Marginalization
(bit.do/lvnlp)
References
x(n)
N
46/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
6 Case Studies
47/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Variational Inference
Introduction
Models
High-level: decompose objective into lower-bound and gap.
Variational
Objective
GAP(θ, λ)
Maximum Likelihood
ELBO
Inference L(θ)
Strategies
LB(θ, λ)
Advanced Topics
Case Studies
Conclusion
L(θ) = LB(θ, λ) + GAP(θ, λ) for some λ
References
48/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Marginal Likelihood: Variational Decomposition
Advanced Topics
References
Introduction
Models
log p(x; θ) = Eq log p(x) (Expectation over z)
Variational
Objective p(x, z)
= Eq log (Mult/div by p(z|x), combine numerator)
Maximum Likelihood
p(z | x)
ELBO
p(x, z) q(z | x)
Inference = Eq log (Mult/div by q(z|x))
Strategies q(z | x) p(z | x)
Advanced Topics p(x, z) q(z | x)
= Eq log + Eq log (Split Log)
Case Studies q(z | x) p(z | x)
Conclusion p(x, z; θ)
= Eq log + KL[q(z | x; λ) k p(z | x; θ)]
References
q(z | x; λ)
50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound: Proof
Introduction
Models
log p(x; θ) = Eq log p(x) (Expectation over z)
Variational
Objective p(x, z)
= Eq log (Mult/div by p(z|x), combine numerator)
Maximum Likelihood
p(z | x)
ELBO
p(x, z) q(z | x)
Inference = Eq log (Mult/div by q(z|x))
Strategies q(z | x) p(z | x)
Advanced Topics p(x, z) q(z | x)
= Eq log + Eq log (Split Log)
Case Studies q(z | x) p(z | x)
Conclusion p(x, z; θ)
= Eq log + KL[q(z | x; λ) k p(z | x; θ)]
References
q(z | x; λ)
50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound: Proof
Introduction
Models
log p(x; θ) = Eq log p(x) (Expectation over z)
Variational
Objective p(x, z)
= Eq log (Mult/div by p(z|x), combine numerator)
Maximum Likelihood
p(z | x)
ELBO
p(x, z) q(z | x)
Inference = Eq log (Mult/div by q(z|x))
Strategies q(z | x) p(z | x)
Advanced Topics p(x, z) q(z | x)
= Eq log + Eq log (Split Log)
Case Studies q(z | x) p(z | x)
Conclusion p(x, z; θ)
= Eq log + KL[q(z | x; λ) k p(z | x; θ)]
References
q(z | x; λ)
50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound: Proof
Introduction
Models
log p(x; θ) = Eq log p(x) (Expectation over z)
Variational
Objective p(x, z)
= Eq log (Mult/div by p(z|x), combine numerator)
Maximum Likelihood
p(z | x)
ELBO
p(x, z) q(z | x)
Inference = Eq log (Mult/div by q(z|x))
Strategies q(z | x) p(z | x)
Advanced Topics p(x, z) q(z | x)
= Eq log + Eq log (Split Log)
Case Studies q(z | x) p(z | x)
Conclusion p(x, z; θ)
= Eq log + KL[q(z | x; λ) k p(z | x; θ)]
References
q(z | x; λ)
50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound: Proof
Introduction
Models
log p(x; θ) = Eq log p(x) (Expectation over z)
Variational
Objective p(x, z)
= Eq log (Mult/div by p(z|x), combine numerator)
Maximum Likelihood
p(z | x)
ELBO
p(x, z) q(z | x)
Inference = Eq log (Mult/div by q(z|x))
Strategies q(z | x) p(z | x)
Advanced Topics p(x, z) q(z | x)
= Eq log + Eq log (Split Log)
Case Studies q(z | x) p(z | x)
Conclusion p(x, z; θ)
= Eq log + KL[q(z | x; λ) k p(z | x; θ)]
References
q(z | x; λ)
50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound over Observations
Introduction h p(x, z; θ) i
ELBO(θ, λ; x) = Eq(z) log
Models
q(z | x; λ)
Variational
Objective
Maximum Likelihood • ELBO is a function of the generative model parameters, θ, and the
ELBO
Inference
variational parameters, λ.
Strategies
N N
Advanced Topics X X
log p(x(n) ; θ) ≥ ELBO(θ, λ; x(n) )
Case Studies
n=1 n=1
Conclusion N
X h p(x(n) , z; θ) i
References = Eq(z | x(n) ; λ) log
n=1
q(z | x(n) ; λ)
Models
Variational
• Just as with p and θ, we can select any form of q and λ that satisfies ELBO
Objective
conditions.
Maximum Likelihood
ELBO
• Different choices of q will lead to different algorithms.
Inference
Strategies • We will explore several forms of q:
Advanced Topics • Posterior
Case Studies • Point Estimate / MAP
Conclusion • Amortized
References
• Mean Field (later)
52/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Case Studies
λ = [λ(1) , . . . , λ(N ) ] is a concatenation of local variational parameters λ(n) , e.g.
Conclusion
53/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example Family: Amortized Parameterization [Kingma and Welling 2014]
λ
Introduction
Models
KL(q(z | x)||p(z | x))
Variational z (n) z (n)
Objective
Maximum Likelihood
ELBO
Inference x(n)
Strategies x(n) θ
N
Advanced Topics N
Case Studies
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Exact Gradient
Sampling Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
55/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Maximizing the Evidence Lower Bound
Introduction
Central quantity of interest: almost all methods are maximizing the ELBO
Models
Variational
arg max ELBO(θ, λ)
Objective
θ,λ
Inference
Strategies Aggregate ELBO objective,
Exact Gradient
Sampling N
Conjugacy
X
arg max ELBO(θ, λ) = arg max ELBO(θ, λ; x(n) )
Advanced Topics θ,λ θ,λ n=1
Case Studies N
X h p(x(n) , z (n) ; θ) i
Conclusion = arg max Eq log
θ,λ n=1
q(z (n) | x(n) ; λ)
References
56/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Maximizing the Evidence Lower Bound
Introduction
Central quantity of interest: almost all methods are maximizing the ELBO
Models
Variational
arg max ELBO(θ, λ)
Objective
θ,λ
Inference
Strategies Aggregate ELBO objective,
Exact Gradient
Sampling N
Conjugacy
X
arg max ELBO(θ, λ) = arg max ELBO(θ, λ; x(n) )
Advanced Topics θ,λ θ,λ n=1
Case Studies N
X h p(x(n) , z (n) ; θ) i
Conclusion = arg max Eq log
θ,λ n=1
q(z (n) | x(n) ; λ)
References
56/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Maximizing ELBO: Model Parameters
Introduction
Models h p(x, z; θ) i
arg max Eq log = arg max Eq [log p(x, z; θ)]
Variational θ q(z | x; λ) θ
Objective
Inference
Strategies
Exact Gradient
Sampling
θ
Conjugacy
ELBO
Advanced Topics ELBO
Case Studies
Conclusion
References
Intuition: Maximum likelihood problem under variables drawn from q(z | x; λ).
57/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model Estimation: Gradient Ascent on Model Parameters
Introduction
Easy: Gradient respect to θ
Models
h i
Variational
Objective
∇θ ELBO(θ, λ; x) = ∇θ Eq log p(x, z; θ)
h i
Inference
Strategies
= Eq ∇θ log p(x, z; θ)
Exact Gradient
Sampling
Conjugacy
58/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model Estimation: Gradient Ascent on Model Parameters
Introduction
Easy: Gradient respect to θ
Models
h i
Variational
Objective
∇θ ELBO(θ, λ; x) = ∇θ Eq log p(x, z; θ)
h i
Inference
Strategies
= Eq ∇θ log p(x, z; θ)
Exact Gradient
Sampling
Conjugacy
58/153
Tutorial:
Deep Latent NLP Maximizing ELBO: Variational Distribution
(bit.do/lvnlp)
Conclusion
References
Variational
h p(x, z; θ) i
∇λ ELBO(θ, λ; x) = ∇λ Eq log
Objective
q(z | x; λ)
Inference h p(x, z; θ) i
Strategies 6= Eq ∇λ log
Exact Gradient q(z | x; λ)
Sampling
Conjugacy
• Cannot naively move ∇ inside the expectation, since q depends on λ.
Advanced Topics
• This section: Inference in practice:
Case Studies
Conclusion
1 Exact gradient
References
2 Sampling: score function, reparameterization
3 Conjugacy: closed-form, coordinate ascent
60/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model Inference: Gradient Ascent on λ?
Introduction
Hard: Gradient respect to λ
Models
Variational
h p(x, z; θ) i
∇λ ELBO(θ, λ; x) = ∇λ Eq log
Objective
q(z | x; λ)
Inference h p(x, z; θ) i
Strategies 6= Eq ∇λ log
Exact Gradient q(z | x; λ)
Sampling
Conjugacy
• Cannot naively move ∇ inside the expectation, since q depends on λ.
Advanced Topics
• This section: Inference in practice:
Case Studies
Conclusion
1 Exact gradient
References
2 Sampling: score function, reparameterization
3 Conjugacy: closed-form, coordinate ascent
60/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Exact Gradient
Sampling Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
61/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Strategy 1: Exact Gradient
Introduction
Models h p(x, z; θ) i
Variational ∇λ ELBO(θ, λ; x) = ∇λ Eq(z | x; λ) log
Objective
q(z | x; λ)
X p(x, z; θ)
Inference = ∇λ q(z | x; λ) log
Strategies q(z | x; λ)
Exact Gradient
z∈Z
Sampling
Conjugacy
• Naive enumeration: Linear in |Z|.
Advanced Topics
Case Studies
• Depending on structure of q and p, potentially faster with dynamic
Conclusion
programming.
References • Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2
with point estimate.
62/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Strategy 1: Exact Gradient
Introduction
Models h p(x, z; θ) i
Variational ∇λ ELBO(θ, λ; x) = ∇λ Eq(z | x; λ) log
Objective
q(z | x; λ)
X p(x, z; θ)
Inference = ∇λ q(z | x; λ) log
Strategies q(z | x; λ)
Exact Gradient
z∈Z
Sampling
Conjugacy
• Naive enumeration: Linear in |Z|.
Advanced Topics
Case Studies
• Depending on structure of q and p, potentially faster with dynamic
Conclusion
programming.
References • Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2
with point estimate.
62/153
Tutorial:
Deep Latent NLP Example: Model 1 - Naive Bayes
(bit.do/lvnlp)
λ z z
Introduction
Models
x1 ... xT
x
Variational
Objective
63/153
Tutorial:
Deep Latent NLP Example: Model 1 - Naive Bayes
(bit.do/lvnlp)
λ z z
Introduction
Models
x1 ... xT
x
Variational
Objective
63/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Exact Gradient
Sampling Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
64/153
Tutorial:
Deep Latent NLP Strategy 2: Sampling
(bit.do/lvnlp)
Inference ∇q
Strategies ∇ log q = ⇒ ∇q = q∇ log q
Exact Gradient
q
Sampling
Policy-gradient style training [Williams 1992]
Conjugacy
Advanced Topics
h i X
∇λ Eq log p(x, z; θ) = ∇λ q(z | x; λ) log p(x, z; θ)
Case Studies
z
Conclusion
References
66/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Variational
First term. Use basic identity:
Objective
Inference
∇q
∇ log q = ⇒ ∇q = q∇ log q
Strategies
q
Exact Gradient
Sampling Policy-gradient style training [Williams 1992]
Conjugacy
h i X
Advanced Topics
∇λ Eq log p(x, z; θ) = ∇λ q(z | x; λ) log p(x, z; θ)
Case Studies
| {z }
z
q∇ log q
Conclusion
References
67/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
68/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 2a: Sampling — Score Function Gradient Estimator
Introduction
First term. Use basic identity:
Models
Variational
∇q
Objective ∇ log q = ⇒ ∇q = q∇ log q
q
Inference
Strategies Policy-gradient style training [Williams 1992]
Exact Gradient
Sampling
h i X
Conjugacy ∇λ Eq log p(x, z; θ) = ∇λ q(z | x; λ) log p(x, z; θ)
Advanced Topics z
X
Case Studies = q(z | x; λ)∇λ log q(z | x; λ) log p(x, z; θ)
Conclusion
z
h i
References = Eq log p(x, z; θ)∇λ log q(z | x; λ)
69/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Strategy 2a: Sampling — Score Function Gradient Estimator
Models
Variational
Objective
Second term. Need additional identity:
Inference X X
Strategies ∇q = ∇ q = ∇1 = 0
Exact Gradient
Sampling
Conjugacy h i X
Advanced Topics ∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
z
Case Studies
Conclusion
References
70/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Second term. Need additional identity:
Variational
Objective
X X
∇q = ∇ q = ∇1 = 0
Inference
Strategies
Exact Gradient h i X
Sampling
Conjugacy
∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
z
Advanced Topics X
Case Studies
= ∇λ q(z | x; λ) log q(z | x; λ) + q(z | x; λ) ∇λ log q(z | x; λ)
| {z } | {z }
z ∇q
Conclusion
q∇ log q
q
References
71/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Strategy 2a: Sampling — Score Function Gradient Estimator
Models
Second term. Need additional identity:
Variational
Objective X X
Inference
∇q = ∇ q = ∇1 = 0
Strategies
Exact Gradient
Sampling
h i X
Conjugacy ∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
Advanced Topics z
X X
Case Studies = log q(z | x; λ)q(z | x; λ)∇λ log q(z | x; λ) + ∇λ q(z | x; λ)
z z
Conclusion
References
72/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
73/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
74/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Strategy 2a: Sampling — Score Function Gradient Estimator
Models
Inference
h p(x, z; θ) i
Strategies
∇λ ELBO(θ, λ; x) = ∇λ Eq log
q(z | x; λ)
Exact Gradient
Sampling
h p(x, z; θ) i
Conjugacy
= Eq log ∇λ log q(z | x; λ)
q(z | x; λ)
Advanced Topics h i
= Eq Rθ,λ (z)∇λ log q(z | x; λ)
Case Studies
Conclusion
References
75/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Estimate with samples,
Variational
Objective z (1) , . . . , z (J) ∼ q(z | x; λ)
Inference
Strategies h i
Exact Gradient Eq Rθ,λ (z)∇λ log q(z | x; λ)
Sampling
Conjugacy J
1X
Advanced Topics ≈ Rθ,λ (z (j) )∇λ log q(z (j) | x; λ)
J
Case Studies j=1
Conclusion
Intuition: if a sample z (j) is has high reward Rθ,λ (z (j) ), increase the probability
References
of z (j) by moving along the gradient ∇λ log q(z (j) | x; λ).
76/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Variational
Objective
• Essentially reinforcement learning with reward Rθ,λ (z)
References
77/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 1 - Naive Bayes
λ z z
Introduction
Models x1 ... xT
Variational x
Objective
λ z z
Introduction
Models x1 ... xT
Variational x
Objective
Inference
Strategies Gradient calculation (first term):
Exact Gradient
h i h i
Sampling ∇λ Ez∼q(z | x; λ) log p(x, z; θ) = ∇λ E∼U log p(x, g(, λ); θ)
Conjugacy h i
Advanced Topics = E∼U ∇λ log p(x, g(, λ); θ)
Case Studies J
1X
Conclusion ≈ ∇λ log p(x, g((j) , λ); θ)
J
j=1
References
where
(1) , . . . , (J) ∼ U 79/153
Tutorial:
Deep Latent NLP Strategy 2b: Sampling — Reparameterization
(bit.do/lvnlp)
Inference
Strategies Gradient calculation (first term):
Exact Gradient
h i h i
Sampling ∇λ Ez∼q(z | x; λ) log p(x, z; θ) = ∇λ E∼U log p(x, g(, λ); θ)
Conjugacy h i
Advanced Topics = E∼U ∇λ log p(x, g(, λ); θ)
Case Studies J
1X
Conclusion ≈ ∇λ log p(x, g((j) , λ); θ)
J
j=1
References
where
(1) , . . . , (J) ∼ U 79/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Strategy 2b: Sampling — Reparameterization
Variational
Objective
• Unbiased, like the score function gradient estimator, but empirically lower
Inference
Strategies variance.
Exact Gradient
Sampling • In practice, single sample is often sufficient.
Conjugacy
Conclusion
References
80/153
Tutorial:
Deep Latent NLP Strategy 2: Continuous Latent Variable RNN
(bit.do/lvnlp)
Introduction
Models
λ z z
Variational
Objective
x1 ... xT
Inference
Strategies
x
Exact Gradient
Sampling
Conjugacy
Choose variational family to be an amortized diagonal Gaussian
q(z | x; λ) = N (µ, σ 2 )
Advanced Topics
Case Studies
Conclusion
µ, σ 2 = enc(x; λ)
References Then we can sample from q(z | x; λ) by
∼ N (0, I) z = µ + σ
81/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 2b: Sampling — Reparameterization
Introduction p(x,z; θ)
(Recall Rθ,λ (z) = log q(z | x; λ) )
Models
• Score function:
Variational
Objective
Inference
∇λ ELBO(θ, λ; x) = Ez∼q [Rθ,λ (z)∇λ log q(z | x; λ)]
Strategies
Exact Gradient
Sampling
• Reparameterization:
Conjugacy
Conclusion
where g(, λ; x) = µ + σ.
References Informally, reparameterization gradients differentiate through Rθ,λ (·) and thus
has “more knowledge” about the structure of the objective function.
82/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Exact Gradient
Sampling Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
83/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 3: Conjugacy
Introduction
Models
For certain choices for p and q, we can compute parts of
Variational
Objective
arg max ELBO(θ, λ; x)
Inference λ
Strategies
Exact Gradient
exactly in closed-form.
Sampling
Conjugacy
Advanced Topics
Recall that
Case Studies
arg max ELBO(θ, λ; x) = arg min KL[q(z | x; λ)kp(z | x; θ)]
Conclusion λ λ
References
84/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 3: Conjugacy
Introduction
Models
For certain choices for p and q, we can compute parts of
Variational
Objective
arg max ELBO(θ, λ; x)
Inference λ
Strategies
Exact Gradient
exactly in closed-form.
Sampling
Conjugacy
Advanced Topics
Recall that
Case Studies
arg max ELBO(θ, λ; x) = arg min KL[q(z | x; λ)kp(z | x; θ)]
Conclusion λ λ
References
84/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Strategy 3a: Conjugacy — Tractable Posterior Inference
Introduction Suppose we can tractably calculate p(z | x; θ). Then KL[q(z | x; λ)kp(z | x; θ)]
Models is minimized when,
Variational q(z | x; λ) = p(z | x; θ)
Objective
Inference
Strategies • The E-step in Expectation Maximization algorithm [Dempster et al. 1977]
Exact Gradient
Sampling
Conjugacy
posterior gap
Advanced Topics
Case Studies
λ
L ELBO
Conclusion
References
85/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 1 - Naive Bayes
Introduction
λ z z
Models
Variational x1 ... xT
Objective x
Inference
Strategies
Exact Gradient
Sampling
Conjugacy
p(x, z; θ)
Advanced Topics p(z | x; θ) = PK
0
z 0 =1 p(x, z ; θ)
Case Studies
86/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 3 — HMM
Introduction
µ
Models
Variational
z1 z2 z3 z4
Objective
Inference
Strategies
Exact Gradient
Sampling x1 x2 x3 x4
Conjugacy
Advanced Topics
N
Case Studies π
Conclusion T
Y
References p(x, z; θ) = p(z0 ) p(zt | zt−1 ; µ)p(xt | zt ; π)
t=1
87/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Variational
Run forward/backward dynamic programming to calculate posterior marginals,
Objective
88/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Variational
Run forward/backward dynamic programming to calculate posterior marginals,
Objective
88/153
Tutorial:
Deep Latent NLP Connection: Gradient Ascent on Log Marginal Likelihood
(bit.do/lvnlp)
Case Studies
Conclusion
L ELBO
References
89/153
Tutorial:
Deep Latent NLP Connection: Gradient Ascent on Log Marginal Likelihood
(bit.do/lvnlp)
Case Studies
Conclusion
L ELBO
References
89/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
• Practically, this means we don’t have to manually perform posterior
Variational
Objective inference in the E-step. Can just calculate log p(x; θ) and call
Inference backpropagation.
Strategies
Exact Gradient • Example: in deep HMM, just implement forward algorithm to calculate
Sampling
Conjugacy log p(x; θ) and backpropagate using autodiff. No need to implement
Advanced Topics backward algorithm. (Or vice versa).
Case Studies
Conclusion (See Eisner [2016]: “Inside-Outside and Forward-Backward Algorithms Are Just
References Backprop”)
90/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
• Let p(z | x; θ) be intractable, but suppose p(x, z; θ) is
Variational
Objective conditionally conjugate, meaning p(zt | x, z−t ; θ) is exponential family.
Inference
Strategies
• Restrict the family of distributions q so that it factorizes over zt , i.e.
Exact Gradient
Sampling
T
Y
Conjugacy q(z; λ) = q(zt ; λt )
Advanced Topics t=1
Case Studies
(mean field family)
Conclusion
• Further choose q(zt ; λt ) so that it is in the same family as p(zt | x, z−t ; θ) .
References
91/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
(n) (n) KL(q(z)||p(z|x))
Variational
z1 zT z (n)
Objective
Inference
Strategies
(n) (n)
Exact Gradient λ1 λT x(n)
Sampling
N
Conjugacy N θ
Advanced Topics
T
Case Studies Y
q(z; λ) = q(zt ; λt )
Conclusion
t=1
References
92/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Mean Field Family
Introduction
• Optimize ELBO via coordinate ascent, i.e. iterate for λ1 , . . . , λT
Models T
hY i
Variational
arg max KL q(zt ; λt )kp(z | x; θ)
λt t=1
Objective
Variational
Objective
z2,1 z2,2 z2,3 z2,4
Inference
Strategies
z1,1 z1,2 z1,3 z1,4
Exact Gradient
Sampling
Conjugacy
Advanced Topics x1 x2 x3 x4
Case Studies N
Conclusion
L Y
Y T
References
p(x, z; θ) = p(zl,t | zl,t−1 ; θ)p(xt | zl,t ; θ)
l=1 t=1
94/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 3 — Factorial HMM
Introduction
z3,1 z3,2 z3,3 z3,4
Models
Variational
Objective z2,1 z2,2 z2,3 z2,4
Inference
Strategies
Exact Gradient z1,1 z1,2 z1,3 z1,4
Sampling
Conjugacy
Advanced Topics x1 x2 x3 x4
Case Studies
N
Conclusion
References
q(z1,1 ; λ1,1 ) ∝ exp Eq(z−(1,1) ; λ−(1,1) ) [log p(x, z; θ)]
95/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 3 — Factorial HMM
Introduction
z3,1 z3,2 z3,3 z3,4
Models
Variational
Objective z2,1 z2,2 z2,3 z2,4
Inference
Strategies
Exact Gradient z1,1 z1,2 z1,3 z1,4
Sampling
Conjugacy
Advanced Topics x1 x2 x3 x4
Case Studies
N
Conclusion
References
q(z2,1 ; λ2,1 ) ∝ exp Eq(z−(2,1) ; λ−(2,1) ) [log p(x, z; θ)]
96/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
97/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
97/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Advanced Topics
Gumbel-Softmax
Flows 5 Advanced Topics
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References
6 Case Studies
98/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Advanced Topics
Models
Variational
Objective
1 Gumbel-Softmax: Extend reparameterization to discrete variables.
Inference
Strategies 2 Flows: Optimize a tighter bound by making the variational family q more
Advanced Topics flexible.
Gumbel-Softmax
Flows 3 Importance Weighting: Optimize a tighter bound through importance
IWAE
sampling.
Case Studies
Conclusion
References
99/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Advanced Topics
Gumbel-Softmax
Flows 5 Advanced Topics
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References
6 Case Studies
100/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Challenges of Discrete Variables
References
p(x, z; θ) 2
B(x; ψ) − log
q(z | x; λ)
101/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Challenges of Discrete Variables
References
p(x, z; θ) 2
B(x; ψ) − log
q(z | x; λ)
101/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]
Conclusion
2 Adding k to log αk , finding argmax, i.e.
References
i = arg max[log αk + k ] zi = 1
k
102/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]
Conclusion
2 Adding k to log αk , finding argmax, i.e.
References
i = arg max[log αk + k ] zi = 1
k
102/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]
Conclusion
2 Adding k to log αk , finding argmax, i.e.
References
i = arg max[log αk + k ] zi = 1
k
102/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]
Reparameterization:
Introduction
Reparameterization:
Introduction
Reparameterization:
Introduction
Introduction But this won’t work, because zero gradients (almost everywhere)
Models
z = g(, α) = arg max(log α + )> s =⇒ ∇λ Rθ,λ (z) = 0
Variational
s∈∆K−1
Objective
Inference
Strategies
Advanced Topics
Gumbel-Softmax trick: replace arg max with softmax
Gumbel-Softmax
log α + exp((log αk + k )/τ )
Flows
IWAE
z = softmax zk = PK
τ j=1 exp((log αj + j )/τ )
Case Studies
Introduction But this won’t work, because zero gradients (almost everywhere)
Models
z = g(, α) = arg max(log α + )> s =⇒ ∇λ Rθ,λ (z) = 0
Variational
s∈∆K−1
Objective
Inference
Strategies
Advanced Topics
Gumbel-Softmax trick: replace arg max with softmax
Gumbel-Softmax
log α + exp((log αk + k )/τ )
Flows
IWAE
z = softmax zk = PK
τ j=1 exp((log αj + j )/τ )
Case Studies
Introduction But this won’t work, because zero gradients (almost everywhere)
Models
z = g(, α) = arg max(log α + )> s =⇒ ∇λ Rθ,λ (z) = 0
Variational
s∈∆K−1
Objective
Inference
Strategies
Advanced Topics
Gumbel-Softmax trick: replace arg max with softmax
Gumbel-Softmax
log α + exp((log αk + k )/τ )
Flows
IWAE
z = softmax zk = PK
τ j=1 exp((log αj + j )/τ )
Case Studies
Models
• Approaches a discrete distribution as τ → 0 (anneal τ during training).
Variational
Objective • Reparameterizable by construction
Inference
Strategies
• Differentiable and has non-zero gradients
Advanced Topics
Gumbel-Softmax
Flows
IWAE
Case Studies
Conclusion
105/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]
Variational
Objective
• See Maddison et al. [2017] on whether we can use the original categorical
Inference
Strategies
densities p(z), q(z), or need to use relaxed densities pGS (z), qGS (z).
Advanced Topics
Gumbel-Softmax
• Requires that p(x | z; θ) “makes sense” for non-discrete z (e.g. attention).
Flows
IWAE • Lower-variance, but biased gradient estimator. Variance → ∞ as τ → 0.
Case Studies
Conclusion
References
106/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Advanced Topics
Gumbel-Softmax
Flows 5 Advanced Topics
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References
6 Case Studies
107/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Variational
Recall
Objective
log p(x; θ) = ELBO(θ, λ; x) − KL[q(z | x; λ) k p(z | x; θ)]
Inference
Strategies
Bound is tight when variational posterior equals true posterior
Advanced Topics
Gumbel-Softmax
Flows
q(z | x; λ) = p(z | x; θ) =⇒ log p(x; θ) = ELBO(θ, λ; x)
IWAE
Case Studies We want to make q(z | x; λ) as flexible as possible: can we do better than just
Conclusion Gaussian?
References
108/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Variational
Recall
Objective
log p(x; θ) = ELBO(θ, λ; x) − KL[q(z | x; λ) k p(z | x; θ)]
Inference
Strategies
Bound is tight when variational posterior equals true posterior
Advanced Topics
Gumbel-Softmax
Flows
q(z | x; λ) = p(z | x; θ) =⇒ log p(x; θ) = ELBO(θ, λ; x)
IWAE
Case Studies We want to make q(z | x; λ) as flexible as possible: can we do better than just
Conclusion Gaussian?
References
108/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Idea: transform a sample from a simple initial variational distribution,
Variational
Objective
z0 ∼ q(z | x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ)
Inference
Strategies
Advanced Topics
into a more complex one
Gumbel-Softmax
Flows
IWAE
zK = fK ◦ · · · ◦ f2 ◦ f1 (z0 ; λ)
Case Studies
where fi (zi−1 ; λ)’s are invertible transformations (whose parameters are
Conclusion
absorbed by λ).
References
109/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Idea: transform a sample from a simple initial variational distribution,
Variational
Objective
z0 ∼ q(z | x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ)
Inference
Strategies
Advanced Topics
into a more complex one
Gumbel-Softmax
Flows
IWAE
zK = fK ◦ · · · ◦ f2 ◦ f1 (z0 ; λ)
Case Studies
where fi (zi−1 ; λ)’s are invertible transformations (whose parameters are
Conclusion
absorbed by λ).
References
109/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Flows [Rezende and Mohamed 2015; Kingma et al. 2016]
Introduction
Sample from final variational posterior is given by zK . Density is given by the
Models
change of variables formula:
Variational
Objective K
X ∂f −1
log qK (zK | x; λ) = log q(z0 | x; λ) + log k
Inference
Strategies ∂zk
k=1
Advanced Topics K ∂f
k
X
= log q(z0 | x; λ) − log
Gumbel-Softmax
∂zk−1
Flows | {z }
IWAE k=1
log density of Gaussian | {z }
log determinant of Jacobian
Case Studies
Conclusion
References Determinant calculation is O(N 3 ) in general, but can be made faster depending
on parameterization of fk
110/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Can still use reparameterization to obtain gradients. Letting
Variational
Objective F (z) = fK ◦ · · · ◦ f1 (z),
Inference
Strategies
h p(x, z; θ) i
ELBO(θ, λ; x) = ∇λ EqK (zK | x; λ) log
Advanced Topics
qK (zK | x; λ)
Gumbel-Softmax
h p(x, F (z0 ); θ) ∂F i
= ∇λ Eq(z0 | x; λ) log − log
Flows
q(z0 | x; λ) ∂z0
IWAE
Case Studies
h p(x, F (z0 ); θ) ∂F i
= E∼N (0,I) ∇λ log − log
q(z0 | x; λ) ∂z0
Conclusion
References
111/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Flows [Rezende and Mohamed 2015; Kingma et al. 2016]
Introduction
Examples of fk (zk−1 ; λ)
Models
• Normalizing Flows [Rezende and Mohamed 2015]
Variational
Objective
fk (zk−1 ) = zk−1 + uk h(wk> zk−1 + bk )
Inference
Strategies
Advanced Topics
• Inverse Autoregressive Flows [Kingma et al. 2016]
Gumbel-Softmax
Flows
IWAE
fk (zk−1 ) = zk−1 σk + µk
Case Studies
σk,d = sigmoid(NN(zk−1,<d )) µk,d = NN(zk−1,<d )
Conclusion
References (In this case the Jacobian is upper triangular, so determinant is just the
product of diagonals)
112/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Variational
Objective
Inference
Strategies
Advanced Topics
Gumbel-Softmax
Flows
IWAE
Case Studies
113/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Advanced Topics
Gumbel-Softmax
Flows 5 Advanced Topics
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References
6 Case Studies
114/153
Tutorial:
Deep Latent NLP Importance Weighted Autoencoder (IWAE) [Burda et al. 2015]
(bit.do/lvnlp)
• Flows are a way of tightening the ELBO by making the variational family
Introduction
more flexible.
Models
• Not the only way: can obtain a tighter lower bound on log p(x; θ) by using
Variational
Objective multiple importance samples.
Inference
Strategies
Consider:
Advanced Topics K
Gumbel-Softmax 1 X p(x, z (k) ; θ)
IK = (k) | x; λ)
,
Flows
K q(z
IWAE k=1
QK
Case Studies
where z (1:K) ∼ k=1 q(z
(k) | x; λ).
Conclusion
References
Note that IK is an unbiased estimator of p(x; θ):
• Flows are a way of tightening the ELBO by making the variational family
Introduction
more flexible.
Models
• Not the only way: can obtain a tighter lower bound on log p(x; θ) by using
Variational
Objective multiple importance samples.
Inference
Strategies
Consider:
Advanced Topics K
Gumbel-Softmax 1 X p(x, z (k) ; θ)
IK = (k) | x; λ)
,
Flows
K q(z
IWAE k=1
QK
Case Studies
where z (1:K) ∼ k=1 q(z
(k) | x; λ).
Conclusion
References
Note that IK is an unbiased estimator of p(x; θ):
Introduction
Any unbiased estimator of p(x; θ) can be used to obtain a lower bound, using
Models
Jensen’s inequality:
Variational
Objective
p(x; θ) = Eq(z (1:K) | x; λ) [IK ]
Inference
Strategies
=⇒ log p(x; θ) ≥ Eq(z (1:K) | x; λ) [log IK ]
Advanced Topics
K
" #
Gumbel-Softmax 1 X p(x, z (k) ; θ)
Flows
= Eq(z (1:K) | x; λ) log
IWAE
K
k=1
q(z (k) | x; λ)
Case Studies
However, can also show [Burda et al. 2015]:
Conclusion
References
• log p(x; θ) ≥ E [log IK ] ≥ E [log IK−1 ]
• limK→∞ E [log IK ] = log p(x; θ) under mild conditions
116/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
K
" #
Variational 1 X p(x, z (k) ; θ)
Eq(z (1:K) | x; λ) log
Objective
K
k=1
q(z (k) | x; λ)
Inference
Strategies
Conclusion
optimize E [log IK ] directly.
References • Otherwise, need score function gradient estimators [Mnih and Rezende 2016].
117/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Advanced Topics
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Advanced Topics
Case Studies
q(z | x; λ) = N (µ, σ 2 )
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries h̃T = RNN(x; ψ)
and Topics
120/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Sentence VAE Example [Bowman et al. 2016]
Introduction
Models
Variational
Objective
Inference
Strategies
(from Bowman et al. [2016])
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries λ z z
and Topics
Conclusion
x1 ... xT
References
x
121/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Issue 1: Posterior Collapse
Introduction
p(x,z; θ)
Models
ELBO(θ, λ) = Eq(z | x; λ) [log q(z | x; λ) ]
Variational
Objective
Inference
= Eq(z | x; λ) [log p(x | z; θ)] − KL[q(z | x; λ)kp(z)]
Strategies
| {z } | {z }
Reconstruction likelihood Regularizer
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
Model L/ELBO Reconstruction KL
with Latent Variables
Latent Summaries
and Topics RNN LM -329.10 - -
Conclusion RNN VAE -330.20 -330.19 0.01
References
Models
• x and z become independent, and p(x, z; θ) reduces to a non-LV language
Variational
Objective
model.
Inference
Strategies
• Chen et al. [2017]: If it’s possible to model p? (x) without making use of z, then
Advanced Topics
ELBO optimum is at:
Case Studies
Sentence VAE
Encoder/Decoder p? (x) = p(x | z; θ) = p(x; θ) q(z | x; λ) = p(z)
with Latent Variables
Latent Summaries
and Topics
KL[q(z | x; λ)kp(z)] = 0
Conclusion
References
123/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Mitigating Posterior Collapse
Introduction
Use less powerful likelihood models [Miao et al. 2016; Yang et al. 2017], or “word
Models
Variational
dropout” [Bowman et al. 2016].
Objective
Inference
Strategies Model LL/ELBO Reconstruction KL
Advanced Topics
RNN LM -329.1 - -
Case Studies
RNN VAE -330.2 -330.2 0.01
Sentence VAE
Encoder/Decoder
with Latent Variables
+ Word Drop -334.2 -332.8 1.44
Latent Summaries
and Topics CNN VAE -332.1 -322.1 10.0
Conclusion
124/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Mitigating Posterior Collapse
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries
and Topics
Conclusion
References
Conclusion
In practice, often necessary to combine various methods.
References
126/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Variational
Objective
• ELBO always lower bounds log p(x; θ), so can calculate an upper bound on
Inference PPL efficiently.
Strategies
• When reporting ELBO, should also separately report,
Advanced Topics
Case Studies
KL[q(z | x; λ)kp(z)]
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries
to give an indication of how much the latent variable is being “used”.
and Topics
Conclusion
References
127/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Issue 2: Evaluation
Introduction
References
128/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evaluation
Introduction
Qualitative evaluation
Models
• Evaluate samples from prior/variational posterior.
Variational
Objective • Interpolation in latent space.
Inference
Strategies
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries
and Topics
Conclusion
References
(from Bowman et al. [2016])
129/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Advanced Topics
Introduction
Models
Variational
Objective
Inference
Strategies
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables Given: Source information s = s1 , . . . , sM .
Latent Summaries
and Topics
Conclusion
Generative process:
References • Draw x1:T | s ∼ CRNNLM(θ, enc(s)).
131/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Latent, Per-token Experts [Yang et al. 2018]
Introduction
Generative process: For t = 1, . . . , T ,
Models • Draw zt | x<t , s ∼ softmax(U ht ).
Variational • Draw xt | zt , x<t , s ∼ softmax(W tanh(Qzt ht ); θ)
Objective
Inference
Strategies (n) (n) (n)
z1 z2 zT
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables
(n)
x1
(n)
x2 ... (n)
xT
Latent Summaries
and Topics
N
Conclusion
References
Introduction
Learning: zt are independent given x<t , so we can marginalize at each time-step
Models
(Method 3: Conjugacy).
Variational
Objective
arg max log p(x | s; θ) =
Inference
θ
Strategies
T X
Y K
Advanced Topics
arg max log p(zt =k | s, x<t ; θ) p(xt | zt =k, x<t , s; θ).
Case Studies θ t=1 k=1
Sentence VAE
Encoder/Decoder
with Latent Variables
Test-time:
Latent Summaries
and Topics T X
Y K
Conclusion arg max p(zt =k | s, x<t ; θ) p(xt | zt =k, x<t , s; θ).
x1:T
t=1 k=1
References
133/153
Tutorial:
Deep Latent NLP Case-Study: Latent, Per-token Experts [Yang et al. 2018]
(bit.do/lvnlp)
Advanced Topics
Case Studies
Dialogue generation results (s is context):
Sentence VAE
Encoder/Decoder
with Latent Variables
Model BLEU
Latent Summaries
and Topics
Prec Rec
Conclusion
Introduction
Models
Variational
Objective
Inference
Strategies
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries Decoding with an attention mechanism:
and Topics
Conclusion
M
X
References xt | x<t , s ∼ softmax(W [ht , αt,m enc(s)m ]).
m=1
135/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
136/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Copy Attention
Introduction Learning: Can maximize the log per-token marginal [Gu et al. 2016], as with
Models per-token experts:
Variational
Objective
max log p(x1 , . . . , xT | s; θ)
θ
Inference
Strategies T
Y X
Advanced Topics
= max log p(zt = z 0 | s, x<t ; θ) p(xt | z 0 , x<t , x; θ).
θ
t=1 z 0 ∈{0,1}
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables
Test-time:
Latent Summaries
and Topics
T
Y X
p(zt =z 0 | s, x<t ; θ) p(xt | z 0 , x<t , s; θ).
Conclusion
arg max
References x1:T
t=1 z 0 ∈{0,1}
137/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Attention as a Latent Variable [Deng et al. 2018]
Variational
Objective
Generative process: For t = 1, . . . , T ,
Inference
Strategies • Set αt to be attention weights.
Advanced Topics
• Draw zt | x<t , s ∼ Cat(αt ).
Case Studies
Sentence VAE • Draw xt | zt , x<t , s ∼ softmax(W [ht , enc(szt )]; θ).
Encoder/Decoder
with Latent Variables
Latent Summaries
and Topics
Conclusion
References
138/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Marginal likelihood under latent attention model:
Variational T X
M
Objective
Y
p(x1:T | s; θ) = αt,m softmax(W [ht , enc(sm )]; θ)xt .
Inference
t=1 m=1
Strategies
Advanced Topics
Case Studies
Sentence VAE Standard attention likelihood:
Encoder/Decoder
with Latent Variables
T M
Latent Summaries Y X
and Topics p(x1:T | s; θ) = softmax(W [ht , αt,m enc(sm )]; θ)xt .
Conclusion t=1 m=1
References
139/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Learning Strategy #1: Maximize the log marginal via enumeration as above.
Variational
Objective
Learning Strategy #2: Maximize the ELBO with AVI:
Inference
Strategies
Advanced Topics max Eq(zt ; λ) [log p(xt | x<t , zt , s)] − KL[q(zt ; λ)kp(zt | x<t , s)].
λ,θ
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables • q(zt | x; λ) approximates p(zt | x1:T , s; θ); implemented with a BLSTM.
Latent Summaries
and Topics
• q isn’t reparameterizable, so gradients obtained using REINFORCE +
Conclusion
baseline.
References
140/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Test-time: Calculate p(xt | x<t , s; θ) by summing out zt .
Variational
Objective
MT Results on IWSLT-2014:
Inference
Strategies
References
141/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Encoder/Decoder with Structured Latent Variables
Models
Variational At least two EMNLP 2018 papers augment encoder/decoder text generation
Objective
Inference
models with structured latent variables:
Strategies
1 Lee et al. [2018] generate x1:T by iteratively refining sequences of words z1:T .
Advanced Topics
Case Studies
Sentence VAE 2 Wiseman et al. [2018] generate x1:T conditioned on a latent template or plan
Encoder/Decoder
with Latent Variables z1:S .
Latent Summaries
and Topics
Conclusion
References
142/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models
Inference
Strategies 4 Inference Strategies
Advanced Topics
Introduction
Summary as a Latent Variable [Miao and Blunsom 2016]
Models
Generative process for a document x = x1 , . . . , xT :
Variational
Objective • Draw a latent summary z1 , . . . , zM ∼ RNNLM(θ)
Inference
Strategies • Draw x1 , . . . , xT | z1:M ∼ CRNNLM(θ, z)
Advanced Topics
Case Studies
Sentence VAE
Posterior Inference:
Encoder/Decoder
with Latent Variables
Latent Summaries p(z1:M | x1:T ; θ) = p(summary | document; θ).
and Topics
Conclusion
References
144/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Summary as a Latent Variable [Miao and Blunsom 2016]
Models
Generative process for a document x = x1 , . . . , xT :
Variational
Objective • Draw a latent summary z1 , . . . , zM ∼ RNNLM(θ)
Inference
Strategies • Draw x1 , . . . , xT | z1:M ∼ CRNNLM(θ, z)
Advanced Topics
Case Studies
Sentence VAE
Posterior Inference:
Encoder/Decoder
with Latent Variables
Latent Summaries p(z1:M | x1:T ; θ) = p(summary | document; θ).
and Topics
Conclusion
References
144/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Inference max Eq(z1:M ; λ) [log p(x1:T | z1:M ; θ)] − KL[q(z1:M ; λ)kp(z1:M ; θ)]
Strategies λ,θ
Advanced Topics
Case Studies
• q(z1:M ; λ) approximates p(z1:M | x1:T ; θ); also implemented with
Sentence VAE
encoder/decoder RNNs.
Encoder/Decoder
with Latent Variables
Latent Summaries • q(z1:M ; λ) not reparameterizable, so gradients use REINFORCE + baselines.
and Topics
Conclusion
References
145/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Summary as a Latent Variable [Miao and Blunsom 2016]
Models
Semi-supervised Training: Can also use documents without corresponding
Variational
Objective summaries in training.
Inference
Strategies
• Train q(z1:M ; λ) ≈ p(z1:M | x1:T ; θ) with labeled examples.
Advanced Topics
• Infer summary z for an unlabeled document with q.
Case Studies
Sentence VAE
Encoder/Decoder • Use inferred z to improve model p(x1:T | z1:M ; θ).
with Latent Variables
Latent Summaries
and Topics
• Allows for outperforming strictly supervised models!
Conclusion
References
146/153
Tutorial:
Deep Latent NLP Topic Models [Blei et al. 2003]
(bit.do/lvnlp)
Introduction
Models
Variational
Objective
Inference
Strategies
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder (n) (n)
with Latent Variables Generative process: for each document x(n) = x1 , . . . , xT ,
Latent Summaries (n)
and Topics • Draw topic distribution ztop ∼ Dir(α)
Conclusion • For t = 1, . . . , T :
(n) (n)
References • Draw topic zt ∼ Cat(ztop )
• Draw xt ∼ Cat(β z (n) )
t 147/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Simple, Deep Topic Models [Miao et al. 2017]
Models
(n)
Motivation: easy to learn deep topic models with VI if q(ztop ; λ) is
Variational
Objective reparameterizable.
Inference
Strategies
(n)
Advanced Topics Idea: draw ztop from a transformation of a Gaussian.
(n)
Case Studies • Draw z0 ∼ N (µ0 , σ 20 )
Sentence VAE
(n) (n)
Encoder/Decoder
with Latent Variables
• Set ztop = softmax(W z0 ).
Latent Summaries (n)
and Topics • Use analogous transformation when drawing from q(ztop ; λ).
Conclusion
References
148/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Case Studies
Sentence VAE
Encoder/Decoder
Learning Step #2: Use AVI to optimize resulting ELBO.
with Latent Variables
Latent Summaries
h i
(n) (n) (n)
and Topics max Eq(z(n) ; λ) log p(x(n) | ztop ; θ) − KL[N (z0 ; λ)kN (z0 ; µ0 , σ 20 )]
λ,θ top
Conclusion
References
149/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Simple, Deep Topic Models [Miao et al. 2017]
Models
Variational
Objective Perplexities on held-out documents, for three datasets:
Inference
Strategies
Model MXM 20News RCV1
Advanced Topics
OnlineLDA [Hoffman et al. 2010] 342 1015 1058
Case Studies
Sentence VAE AVI-LDA [Miao et al. 2017] 272 830 602
Encoder/Decoder
with Latent Variables
Latent Summaries
and Topics
Conclusion
References
150/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction
2 Models
Models
Variational
Objective 3 Variational Objective
Inference
Strategies
4 Inference Strategies
Advanced Topics
Case Studies
5 Advanced Topics
Conclusion
References
6 Case Studies
7 Conclusion
151/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Advanced Topics
152/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Advanced Topics
152/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Implementation
Variational
Objective
• Modern toolkits make it easy to implement these models.
Inference
Strategies • Combine the flexibility of auto-differentiation for optimization (PyTorch)
Advanced Topics with distribution and VI libraries (Pyro).
Case Studies
In fact, we have implemented this entire tutorial. See website link:
Conclusion
http://bit.do/lvnlp
References
153/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Implementation
Variational
Objective
• Modern toolkits make it easy to implement these models.
Inference
Strategies • Combine the flexibility of auto-differentiation for optimization (PyTorch)
Advanced Topics with distribution and VI libraries (Pyro).
Case Studies
In fact, we have implemented this entire tutorial. See website link:
Conclusion
http://bit.do/lvnlp
References
153/153
Tutorial:
Charu C. Aggarwal and ChengXiang Zhai. 2012. A Survey of Text Clustering Algorithms. In
Deep Latent NLP Mining Text Data, pages 77–128. Springer.
(bit.do/lvnlp)
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by
Jointly Learning to Align and Translate. In Proceedings of ICLR.
Introduction
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of
Models
machine Learning research, 3(Jan):993–1022.
Variational
Samuel R. Bowman, Luke Vilnis, Oriol Vinyal, Andrew M. Dai, Rafal Jozefowicz, and Samy
Objective
Bengio. 2016. Generating Sentences from a Continuous Space. In Proceedings of CoNLL.
Inference
Strategies Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai.
Advanced Topics
1992. Class-based N-gram Models of Natural Language. Computational Linguistics,
18(4):467–479.
Case Studies
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. 2015. Importance Weighted
Conclusion
Autoencoders. In Proceedings of ICLR.
References
Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya
Sutskever, and Pieter Abbeel. 2017. Variational Lossy Autoencoder. In Proceedings of ICLR.
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the
Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of
the Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. 153/153
Tutorial:
Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum Likelihood from
Deep Latent NLP Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B,
(bit.do/lvnlp)
39(1):1–38.
Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander M. Rush. 2018. Latent
Introduction
Alignment and Variational Attention. In Proceedings of NIPS.
Models
Adji B. Dieng, Yoon Kim, Alexander M. Rush, and David M. Blei. 2018. Avoiding Latent
Variational Variable Collapse with Generative Skip Models. In Proceedings of the ICML Workshop on
Objective
Theoretical Foundations and Applications of Deep Generative Models.
Inference
Strategies Adji B. Dieng, Chong Wang, Jianfeng Gao, and John Paisley. 2017. TopicRNN: A Recurrent
Neural Network With Long-Range Semantic Dependency. In Proceedings of ICLR.
Advanced Topics
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online
Case Studies
Learning and Stochastic Optimization. Journal of Machine Learning Research, 12.
Conclusion
Jason Eisner. 2016. Inside-Outside and Forward-Backward Algorithms Are Just Backprop
References
(Tutorial Paper). In Proceedings of the Workshop on Structured Prediction for NLP.
Ekaterina Garmash and Christof Monz. 2016. Ensemble Learning for Multi-source Neural
Machine Translation. In Proceedings of COLING.
Zoubin Ghahramani and Michael I. Jordan. 1996. Factorial Hidden Markov Models. In
Proceedings of NIPS. 153/153
Tutorial:
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism
Deep Latent NLP in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the
(bit.do/lvnlp)
Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages
1631–1640.
Introduction
Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016.
Models
Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association
Variational
for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 140–149.
Objective
Inference
Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang. 2017. Generating
Strategies Sentences by Editing Prototypes. arXiv:1709.08878.
Advanced Topics William P Headden III, Mark Johnson, and David McClosky. 2009. Improving Unsupervised
Case Studies Dependency Parsing with Richer Contexts and Smoothing. In Proceedings of NAACL.
Conclusion Marti A Hearst. 1997. Texttiling: Segmenting text into multi-paragraph subtopic passages.
References Computational linguistics, 23(1):33–64.
Matthew Hoffman, Francis R Bach, and David M Blei. 2010. Online learning for latent dirichlet
allocation. In advances in neural information processing systems, pages 856–864.
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward
Controlled Generation of Text. In Proceedings of ICML. 153/153
Tutorial:
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive
Deep Latent NLP Mixtures of Local Experts. Neural Computation, 3(1):79–87.
(bit.do/lvnlp)
Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with
Gumbel-Softmax. In Proceedings of ICLR.
Introduction
Yoon Kim, Sam Wiseman, Andrew C. Miller, David Sontag, and Alexander M. Rush. 2018.
Models
Semi-Amortized Variational Autoencoders. In Proceedings of ICML.
Variational Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In
Objective
Proceedings of ICLR.
Inference
Strategies Diederik P. Kingma, Tim Salimans, and Max Welling. 2016. Improving Variational Inference
with Autoregressive Flow. arXiv:1606.04934.
Advanced Topics
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In Proceedings
Case Studies
of ICLR.
Conclusion
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio
References
Torralba, and Sanja Fidler. 2015. Skip-thought Vectors. In Proceedings of NIPS.
Dan Klein and Christopher D Manning. 2004. Corpus-based Induction of Syntactic Structure:
Models of Dependency and Constituency. In Proceedings of ACL.
Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic Non-Autoregressive
Neural Sequence Modeling by Iterative Refinement. In Proceedings of EMNLP. 153/153
Tutorial:
Stefan Lee, Senthil Purushwalkam Shiva Prakash, Michael Cogswell, Viresh Ranjan, David
Deep Latent NLP Crandall, and Dhruv Batra. 2016. Stochastic Multiple Choice Learning for Training Diverse
(bit.do/lvnlp)
Deep Ensembles. In Proceedings of NIPS.
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The Concrete Distribution: A
Introduction
Continuous Relaxation of Discrete Random Variables. In Proceedings of ICLR.
Models
Bernard Merialdo. 1994. Tagging English Text with a Probabilistic Model. Computational
Variational
Objective Linguistics, 20(2):155–171.
Inference Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing
Strategies
LSTM language models. In International Conference on Learning Representations.
Advanced Topics
Yishu Miao and Phil Blunsom. 2016. Language as a Latent Variable: Discrete Generative
Case Studies Models for Sentence Compression. In Proceedings of EMNLP.
Conclusion Yishu Miao, Edward Grefenstette, and Phil Blunsom. 2017. Discovering Discrete Latent Topics
References with Neural Variational Inference. In Proceedings of ICML.
Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural Variational Inference for Text Processing.
In Proceedings of ICML.
Andriy Mnih and Danilo J. Rezende. 2016. Variational Inference for Monte Carlo Objectives. In
Proceedings of ICML. 153/153
Tutorial:
Andryi Mnih and Karol Gregor. 2014. Neural Variational Inference and Learning in Belief
Deep Latent NLP Networks. In Proceedings of ICML.
(bit.do/lvnlp)
Anjan Nepal and Alexander Yates. 2013. Factorial Hidden Markov Models for Learning
Representations of Natural Language. arXiv:1312.6168.
Introduction
George Papandreou and Alan L. Yuille. 2011. Perturb-and-Map Random Fields: Using Discrete
Models
Optimization to Learn and Sample from Energy Models. In Proceedings of ICCV.
Variational
Danilo J. Rezende and Shakir Mohamed. 2015. Variational Inference with Normalizing Flows.
Objective
In Proceedings of ICML.
Inference
Strategies Noah A. Smith and Jason Eisner. 2005. Contrastive Estimation: Training Log-Linear Models on
Advanced Topics
Unlabeled Data. In Proceedings of ACL.
Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014. Sequence to Sequence Learning with Neural
Case Studies
Networks. In Proceedings of NIPS.
Conclusion
Ke Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, and Kevin Knight. 2016. Unsupervised
References
Neural Hidden Markov Models. In Proceedings of the Workshop on Structured Prediction for
NLP.
Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. Hmm-based word alignment in
statistical translation. In Proceedings of the 16th conference on Computational
linguistics-Volume 2, pages 836–841. Association for Computational Linguistics. 153/153
Tutorial:
Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen, Jiaji Huang, Wei Ping, Sanjeev Satheesh,
Deep Latent NLP and Lawrence Carin. 2018. Topic Compositional Neural Language Model. In Proceedings of
(bit.do/lvnlp)
AISTATS.
Peter Willett. 1988. Recent Trends in Hierarchic Document Clustering: A Critical Review.
Introduction
Information Processing & Management, 24(5):577–597.
Models
Ronald J. Williams. 1992. Simple Statistical Gradient-following Algorithms for Connectionist
Variational
Objective Reinforcement Learning. Machine Learning, 8.
Inference Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2018. Learning Neural Templates
Strategies
for Text Generation. In Proceedings of EMNLP.
Advanced Topics
Jiacheng Xu and Greg Durrett. 2018. Spherical Latent Spaces for Stable Variational
Case Studies Autoencoders. In Proceedings of EMNLP.
Conclusion Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. Breaking the
References Softmax Bottleneck: A High-Rank RNN Language Model. In Proceedings of ICLR.
Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. 2017. Improved
Variational Autoencoders for Text Modeling using Dilated Convolutions. In Proceedings of
ICML.
153/153