Deep Latent-Variable Models of Natural Language: Yoon Kim, Sam Wiseman, Alexander Rush

Tutorial:
Deep Latent NLP

(bit.do/lvnlp)
Deep Latent-Variable Models
Introduction of Natural Language
Models
Variational
Objective Yoon Kim, Sam Wiseman, Alexander Rush
Inference
Strategies
Advanced Topics
Case Studies
Conclusion
References
Tutorial 2018
https://github.com/harvardnlp/DeepLatentNLP
1/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Goals
Introduction Background
Goals
Background
Models 2 Models
Variational
Objective
3 Variational Objective
Inference
Strategies
Advanced Topics 4 Inference Strategies

Case Studies
Conclusion 5 Advanced Topics

References
6 Case Studies
2/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Goals
Goals
Background
Models 2 Models
Variational
Objective
Inference
Strategies

Case Studies

References
6 Case Studies
3/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Goal of Latent-Variable Modeling

Goals
Background
Models
Probabilistic models provide a declarative language for specifying prior knowledge
Variational
and structural relationships in the context of unknown variables.
Objective
Inference Makes it easy to specify:

Strategies
Advanced Topics
• Known interactions in the data
Case Studies • Uncertainty about unknown factors
Conclusion
• Constraints on model properties
References
4/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Goal of Latent-Variable Modeling

Goals
Background
Models
Probabilistic models provide a declarative language for specifying prior knowledge
Variational
and structural relationships in the context of unknown variables.
Objective
Inference Makes it easy to specify:

Strategies
Advanced Topics
• Known interactions in the data
Case Studies • Uncertainty about unknown factors
Conclusion
• Constraints on model properties
References
4/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Latent-Variable Modeling in NLP
Goals
Background
Models Long and rich history of latent-variable models of natural language.

Variational
Objective
Major successes include, among many others:
Inference
Strategies • Statistical alignment for translation
Advanced Topics
• Document clustering and topic modeling
Case Studies
• Unsupervised part-of-speech tagging and parsing
Conclusion
References
5/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Goals of Deep Learning

Goals
Background
Models
Toolbox of methods for learning rich, non-linear data representations through
Variational
numerical optimization.
Objective
Inference Makes it easy to fit:

Strategies
Advanced Topics
• Highly-flexible predictive models
Case Studies • Transferable feature representations
Conclusion
• Structurally-aligned network architectures
References
6/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Goals of Deep Learning

Goals
Background
Models
Toolbox of methods for learning rich, non-linear data representations through
Variational
numerical optimization.
Objective
Inference Makes it easy to fit:

Strategies
Advanced Topics
• Highly-flexible predictive models
Case Studies • Transferable feature representations
Conclusion
• Structurally-aligned network architectures
References
6/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Goals
Deep Learning in NLP
Background
Models Current dominant paradigm for NLP.

Variational
Objective
Major successes include, among many others:
Inference
Strategies • Text classification
Advanced Topics
• Neural machine translation
Case Studies
• NLU Tasks (QA, NLI, etc)
Conclusion
References
7/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Tutorial: Deep Latent-Variable Models for NLP

Introduction
Goals
Background
• How should a contemporary ML/NLP researcher reason about
Models
latent-variables?
Variational
Objective
Inference • What unique challenges come from modeling text with latent variables?
Strategies
Advanced Topics
• What techniques have been explored and shown to be effective in recent
Case Studies
papers?
Conclusion
References
We explore these through the lens of variational inference.
8/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Tutorial: Deep Latent-Variable Models for NLP

Introduction
Goals
Background
• How should a contemporary ML/NLP researcher reason about
Models
latent-variables?
Variational
Objective
Inference • What unique challenges come from modeling text with latent variables?
Strategies
Advanced Topics
• What techniques have been explored and shown to be effective in recent
Case Studies
papers?
Conclusion
References
We explore these through the lens of variational inference.
8/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Tutorial Take-Aways
Introduction
Goals
Background 1 A collection of deep latent-variable models for NLP
Models
Variational 2 An understanding of a variational objective

Objective
Inference 3 A toolkit of algorithms for optimization

Strategies
Advanced Topics 4 A formal guide to advanced techniques

Case Studies
Conclusion
5 A survey of example applications
References
6 Code samples and techniques for practical use
9/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Tutorial Non-Objectives
Introduction
Goals
Background Not covered (for time, not relevance):
Models
• Many classical latent-variable approaches.
Variational
Objective
• Undirected graphical models such as MRFs
Inference
Strategies
• Non-likelihood based models such as GANs
Advanced Topics
Case Studies
• Sampling-based inference such as MCMC.
Conclusion
References • Details of deep learning architectures.
10/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Goals
Goals
Background
Models 2 Models
Variational
Objective
Inference
Strategies

Case Studies

References
6 Case Studies
11/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are deep networks?
Introduction
Deep networks are parameterized non-linear functions; They transform input z
Goals
Background into features h using parameters π.
Models
Variational Important examples: The multilayer perceptron,

Objective
Inference
Strategies
h = MLP(z; π) = V σ(W z + b) + a π = {V , W , a, b},
Advanced Topics
The recurrent neural network, which maps a sequence of inputs z1:T into a
Case Studies
sequence of features h1:T ,
Conclusion
References
ht = RNN(ht−1 , zt ; π) = σ(U zt + V ht−1 + b) π = {V , U , b}
.
12/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are deep networks?
Introduction
Deep networks are parameterized non-linear functions; They transform input z
Goals
Background into features h using parameters π.
Models
Variational Important examples: The multilayer perceptron,

Objective
Inference
Strategies
h = MLP(z; π) = V σ(W z + b) + a π = {V , W , a, b},
Advanced Topics
The recurrent neural network, which maps a sequence of inputs z1:T into a
Case Studies
sequence of features h1:T ,
Conclusion
References
ht = RNN(ht−1 , zt ; π) = σ(U zt + V ht−1 + b) π = {V , U , b}
.
12/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are latent variable models?
Latent variable models give us a joint distribution

Introduction
Goals
Background p(x, z; θ).
Models
Variational • x is our observed data

Objective
• z is a collection of latent variables
Inference
Strategies • θ are the deterministic parameters of the model, such as the neural network
Advanced Topics parameters
Case Studies
Conclusion
• Data consists of N i.i.d samples,
References
N
Y
p(x(1:N ) , z (1:N ) ; θ) = p(x(n) | z (n) ; θ)p(z (n) ; θ).
n=1
13/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Goals
Models

Objective
Inference
Case Studies
Conclusion
References
N
Y
n=1
13/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Goals
Models

Objective
Inference
Case Studies
Conclusion
References
N
Y
n=1
13/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Probabilistic Graphical Models
Introduction • A directed PGM shows the conditional independence structure.

Goals
Background
• By chain rule, latent variable model over observations can be represented as,
Models
θ z (n)
Variational
Objective
Inference
Strategies
x(n)
Advanced Topics
Case Studies N
Conclusion N
Y
References p(x(1:N ) , z (1:N ) ; θ) = p(x(n) | z (n) ; θ)p(z (n) ; θ)
n=1
• Specific models may factor further.

14/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
For models p(x, z; θ), we’ll be interested in the posterior over latent variables z:
Introduction
Goals
Background
p(x, z; θ)
Models p(z | x; θ) = .
p(x; θ)
Variational
Objective
Inference Why?
Strategies
Advanced Topics
• z will often represent interesting information about our data (e.g., the
Case Studies cluster x(n) lives in, how similar x(n) and x(n+1) are).
Conclusion • Learning the parameters θ of the model often requires calculating posteriors
References as a subroutine.
• Intuition: if I know likely z (n) for x(n) , I can learn by maximizing
p(x(n) | z (n) ; θ). 15/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
For models p(x, z; θ), we’ll be interested in the posterior over latent variables z:
Introduction
Goals
Background
p(x, z; θ)
Models p(z | x; θ) = .
p(x; θ)
Variational
Objective
Inference Why?
Strategies
Advanced Topics
• z will often represent interesting information about our data (e.g., the
Case Studies cluster x(n) lives in, how similar x(n) and x(n+1) are).
Conclusion • Learning the parameters θ of the model often requires calculating posteriors
References as a subroutine.
• Intuition: if I know likely z (n) for x(n) , I can learn by maximizing
p(x(n) | z (n) ; θ). 15/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Problem Statement: Two Views

Introduction
Goals
Background Deep Models & LV Models are naturally complementary:
Models
• Rich function approximators with modular parts.
Variational
Objective
• Declarative methods for specifying model constraints.
Inference
Strategies
Advanced Topics Deep Models & LV Models are frustratingly incompatible:

Case Studies
• Deep networks make posterior inference intractable.
Conclusion
References
• Latent variable objectives complicate backpropagation.
16/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Problem Statement: Two Views

Introduction
Goals
Background Deep Models & LV Models are naturally complementary:
Models
• Rich function approximators with modular parts.
Variational
Objective
• Declarative methods for specifying model constraints.
Inference
Strategies
Advanced Topics Deep Models & LV Models are frustratingly incompatible:

Case Studies
• Deep networks make posterior inference intractable.
Conclusion
References
• Latent variable objectives complicate backpropagation.
16/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction 2 Models
Models Discrete Models
Discrete Models
Continuous Models Continuous Models
Structured Models
Structured Models
Variational
Objective
Inference 3 Variational Objective

Strategies
Advanced Topics
4 Inference Strategies
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
17/153
Tutorial:
Deep Latent NLP A Language Model
(bit.do/lvnlp)
Our goal is to model a sentence, x1 . . . xT .

Introduction
Models
Context: RNN language models are remarkable at this task,
Discrete Models
Continuous Models x1:T ∼ RNNLM(x1:T ; θ).
Structured Models
Variational Defined as,

Objective
T
Y T
Y
Inference p(x1:T ) = p(xt | x<t ) = softmax(W ht )xt
Strategies
t=1 t=1
Advanced Topics
where ht = RNN(ht−1 , xt−1 ; θ)
Case Studies
Conclusion (n)
x1 ... (n)
xT
References
N
θ 18/153
Tutorial:
Deep Latent NLP A Language Model
(bit.do/lvnlp)
Our goal is to model a sentence, x1 . . . xT .

Introduction
Models
Context: RNN language models are remarkable at this task,
Discrete Models
Continuous Models x1:T ∼ RNNLM(x1:T ; θ).
Structured Models
Variational Defined as,

Objective
T
Y T
Y
Inference p(x1:T ) = p(xt | x<t ) = softmax(W ht )xt
Strategies
t=1 t=1
Advanced Topics
where ht = RNN(ht−1 , xt−1 ; θ)
Case Studies
Conclusion (n)
x1 ... (n)
xT
References
N
θ 18/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction A Collection of Model Archetypes

Models
Discrete Models
Focus: semi-supervised or unsupervised learning, i.e. don’t just learn the
Continuous Models
Structured Models probabilities, but the process. Range of choices in selecting z
Variational
Objective
Inference 1 Discrete LVs z (Clustering )

Strategies
2 Continuous LVs z (Dimensionality reduction)
Advanced Topics
Case Studies
3 Structured LVs z (Structured learning )
Conclusion
References
19/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction A Collection of Model Archetypes

Models
Discrete Models
Focus: semi-supervised or unsupervised learning, i.e. don’t just learn the
Continuous Models
Structured Models probabilities, but the process. Range of choices in selecting z
Variational
Objective
Inference 1 Discrete LVs z (Clustering )

Strategies
2 Continuous LVs z (Dimensionality reduction)
Advanced Topics
Case Studies
3 Structured LVs z (Structured learning )
Conclusion
References
19/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Discrete Models
Structured Models
Structured Models
Variational
Objective

Strategies
Advanced Topics
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
20/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 1: Discrete Clustering
Introduction Inference Process:

Models
Discrete Models
In an old house in Paris that was
Continuous Models
Structured Models
covered with vines lived twelve little Cluster 23
Variational girls in two straight lines.
Objective
Inference
Strategies Discrete latent variable models induce a clustering over sentences x(n) .
Advanced Topics
Example uses:
Case Studies
Conclusion • Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012].

References • Mixture of expert text generation models [Jacobs et al. 1991; Garmash and Monz
2016; Lee et al. 2016]
21/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 1: Discrete Clustering

Models
Discrete Models
Continuous Models
Structured Models
covered with vines lived twelve little Cluster 23
Objective
Inference
Strategies Discrete latent variable models induce a clustering over sentences x(n) .
Advanced Topics
Example uses:
Case Studies
Conclusion • Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012].

References • Mixture of expert text generation models [Jacobs et al. 1991; Garmash and Monz
2016; Lee et al. 2016]
21/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 1: Discrete - Mixture of Categoricals
Introduction
Generative process:
Models
Discrete Models
1 Draw cluster z ∈ {1, . . . , K} from a categorical with param µ.
Continuous Models
Structured Models 2 Draw word T words xt from a categorical with word distribution πz .
Variational
Objective Parameters: θ = {µ ∈ ∆K−1 , K × V stochastic matrix π}
Inference
Strategies Gives rise to the ”Naive Bayes” distribution:
Advanced Topics
T
Y
Case Studies p(x, z; θ) = p(z; µ) × p(x | z; π) = µz × Cat(xt ; π)
Conclusion t=1
T
References Y
= µz × πz,xt
t=1
22/153
Tutorial:
Deep Latent NLP Model 1: Graphical Model View
(bit.do/lvnlp)
µ
Introduction
z (n)
Models
Discrete Models
Continuous Models
Structured Models (n)
x1 ... xT
(n)
Variational
Objective π N
Inference
Strategies
Advanced Topics
N N
Case Studies Y Y
(n) (n)
p(x ,z ; µ, π) = p(z (n) ; µ) × p(x(n) | z (n) ; π)
Conclusion
n=1 n=1
References
YN T
Y
= µz (n) × πz (n) ,x(n)
t
n=1 t=1
23/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Deep Model 1: Discrete - Mixture of RNNs
Introduction Generative process:

Models
1 Draw cluster z ∈ {1, . . . , K} from a categorical.
Discrete Models
Continuous Models
Structured Models
2 Draw words x1:T from RNNLM with parameters πz .
Variational p(x, z; θ) = µz × RNNLM(x1:T ; πz )
Objective
µ
Inference
Strategies
Advanced Topics z (n)

Case Studies
Conclusion
(n)
x1 ... xT
(n)
References
π N
24/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Difference Between Models

Models
Discrete Models
• Dependence structure:
Continuous Models
Structured Models • Mixture of Categoricals: xt independent of other xj given z.
Variational
• Mixture of RNNs: xt fully dependent.
Objective
Inference Interesting question: how will this affect the learned latent space?
Strategies
Advanced Topics • Number of parameters:

Case Studies • Mixture of Categoricals: K × V .
Conclusion • Mixture of RNNs: K × d2 + V × d with RNN with d hidden dims.
References
25/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Difference Between Models

Models
Discrete Models
• Dependence structure:
Continuous Models
Structured Models • Mixture of Categoricals: xt independent of other xj given z.
Variational
• Mixture of RNNs: xt fully dependent.
Objective
Inference Interesting question: how will this affect the learned latent space?
Strategies
Advanced Topics • Number of parameters:

Case Studies • Mixture of Categoricals: K × V .
Conclusion • Mixture of RNNs: K × d2 + V × d with RNN with d hidden dims.
References
25/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
For both discrete models, can apply Bayes’ rule:
Models
Discrete Models
Continuous Models
p(z) × p(x | z)
Structured Models
p(z | x; θ) =
Variational p(x)
Objective
p(z) × p(x | z)
Inference
= K
X
Strategies
p(z=k) × p(x | z=k)
Advanced Topics k=1
Case Studies
Conclusion
• For mixture of categoricals, posterior uses word counts under each πk .
References
• For mixture of RNNs, posterior requires running RNN over x for each k.
26/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
Models
Discrete Models
Continuous Models
p(z) × p(x | z)
Structured Models
p(z | x; θ) =
Variational p(x)
Objective
p(z) × p(x | z)
Inference
= K
X
Strategies
p(z=k) × p(x | z=k)
Advanced Topics k=1
Case Studies
Conclusion
References
26/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
Models
Discrete Models
Continuous Models
p(z) × p(x | z)
Structured Models
p(z | x; θ) =
Variational p(x)
Objective
p(z) × p(x | z)
Inference
= K
X
Strategies
p(z=k) × p(x | z=k)
Advanced Topics k=1
Case Studies
Conclusion
References
26/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Discrete Models
Structured Models
Structured Models
Variational
Objective

Strategies
Advanced Topics
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
27/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 2: Continuous / Dimensionality Reduction

Models
Discrete Models
Continuous Models
covered with vines lived twelve little
Structured Models

Objective
Inference
Strategies Find a lower-dimensional, well-behaved continuous representation of a sentence.
Advanced Topics Latent variables in Rd make distance/similarity easy. Examples:
Case Studies
• Recent work in text generation assumes a latent vector per sentence [Bowman
Conclusion
et al. 2016; Yang et al. 2017; Hu et al. 2017].
References
• Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015])
can be interpreted in this way.
28/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 2: Continuous / Dimensionality Reduction

Models
Discrete Models
Continuous Models
covered with vines lived twelve little
Structured Models

Objective
Inference
Strategies Find a lower-dimensional, well-behaved continuous representation of a sentence.
Advanced Topics Latent variables in Rd make distance/similarity easy. Examples:
Case Studies
• Recent work in text generation assumes a latent vector per sentence [Bowman
Conclusion
et al. 2016; Yang et al. 2017; Hu et al. 2017].
References
• Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015])
can be interpreted in this way.
28/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Model 2: Continuous ”Mixture”

Models
Discrete Models
Continuous Models Generative Process:
Structured Models
Variational
1 Draw continuous latent variable z from Normal with param µ.
Objective
2 For each t, draw word xt from categorical with param softmax(W z).
Inference
Strategies
Parameters: θ = {µ ∈ Rd , π}, π = {W ∈ RV ×d }
Advanced Topics
Case Studies Intuition: µ is a global distribution, z captures local word distribution of the
Conclusion sentence.
References
29/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Graphical Model View
µ
Introduction
Models
Discrete Models z (n)
Continuous Models
Structured Models
Variational
Objective (n)
x1 ... xT
(n)
Inference
Strategies π N
Advanced Topics
Case Studies
Conclusion
Gives rise to the joint distribution:
References N
Y N
Y
(n) (n)
p(x ,z ; θ) = p(z (n) ; µ) × p(x(n) | z (n) ; π)
n=1 n=1
30/153
Tutorial:
Deep Latent NLP Deep Model 2: Continuous ”Mixture” of RNNs
(bit.do/lvnlp)
Generative Process:
Introduction
1 Draw latent variable z ∼ N (µ, I).
Models
2 Draw each token xt from a conditional RNNLM.
Discrete Models
Continuous Models
Structured Models
RNN is also conditioned on latent z,
Variational
Objective p(x, z; π, µ, I) = p(z; µ, I) × p(x | z; π)
Inference
Strategies
= N (z; µ, I) × CRNNLM(x1:T ; π, z)
Advanced Topics
Case Studies
where
Conclusion T
Y
References CRNNLM(x1:T ; π, z) = softmax(W ht )xt
t=1
ht = RNN(ht−1 , [xt−1 ; z]; π)

31/153
Tutorial:
(bit.do/lvnlp)
Generative Process:
Introduction
Models
Discrete Models
Continuous Models
Structured Models
Variational
Inference
Strategies
Advanced Topics
Case Studies
where
Conclusion T
Y
t=1
ht = RNN(ht−1 , [xt−1 ; z]; π)

31/153
Tutorial:
(bit.do/lvnlp)
Generative Process:
Introduction
Models
Discrete Models
Continuous Models
Structured Models
Variational
Inference
Strategies
Advanced Topics
Case Studies
where
Conclusion T
Y
t=1
ht = RNN(ht−1 , [xt−1 ; z]; π)

31/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Graphical Model View

Models µ
Discrete Models
Continuous Models
Structured Models
Variational
Objective
z(n)
Inference
Strategies
Advanced Topics
(n)
x1 ... (n)
xT
Case Studies N
π
Conclusion
References
32/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
Models For continuous models, Bayes’ rule is harder to compute,

Discrete Models
Continuous Models
p(z; µ) × p(x | z; π)
Structured Models
p(z | x; θ) = Z
Variational p(z; µ) × p(x | z; π) dz
Objective
z
Inference
Strategies
Advanced Topics • Shallow and deep Model 2 variants mirror Model 1 variants exactly, but
Case Studies
with continuous z.
Conclusion
• Integral intractable (in general) for both shallow and deep variants.
References
33/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
Models For continuous models, Bayes’ rule is harder to compute,

Discrete Models
Continuous Models
p(z; µ) × p(x | z; π)
Structured Models
p(z | x; θ) = Z
Variational p(z; µ) × p(x | z; π) dz
Objective
z
Inference
Strategies
Advanced Topics • Shallow and deep Model 2 variants mirror Model 1 variants exactly, but
Case Studies
with continuous z.
Conclusion
• Integral intractable (in general) for both shallow and deep variants.
References
33/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Discrete Models
Structured Models
Structured Models
Variational
Objective

Strategies
Advanced Topics
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
34/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model 3: Structure Learning
Inference Process:
Introduction
Models In an old house in Paris that was

Discrete Models
Continuous Models covered with vines lived twelve little
Structured Models
girls in two straight lines.
Variational
Objective
Inference Structured latent variable models are used to infer unannotated structure:
Strategies
Advanced Topics • Unsupervised POS tagging [Brown et al. 1992; Merialdo 1994; Smith and Eisner 2005]
Case Studies • Unsupervised dependency parsing [Klein and Manning 2004; Headden III et al. 2009]
Conclusion
Or when structure is useful for interpreting our data:
References
• Segmentation of documents into topical passages [Hearst 1997]
• Alignment [Vogel et al. 1996]
35/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 3: Structured - Hidden Markov Model
Introduction
Generative Process:
Models
1 For each t, draw zt ∈ {1, . . . , K} from a categorical with param µzt−1 .
Discrete Models
Continuous Models
2 Draw observed token xt from categorical with param πzt .
Structured Models
Variational
Objective Parameters: θ = {K × K stochastic matrix µ, K × V stochastic matrix π}
Inference
Strategies Gives rise to the joint distribution:
Advanced Topics
T
Y T
Y
Case Studies p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion t=1 t=1
T T
References Y Y
= µzt−1 ,zt × πzt ,xt
t=1 t=1
36/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 3: Structured - Hidden Markov Model
Introduction
Generative Process:
Models
1 For each t, draw zt ∈ {1, . . . , K} from a categorical with param µzt−1 .
Discrete Models
Continuous Models
2 Draw observed token xt from categorical with param πzt .
Structured Models
Variational
Objective Parameters: θ = {K × K stochastic matrix µ, K × V stochastic matrix π}
Inference
Strategies Gives rise to the joint distribution:
Advanced Topics
T
Y T
Y
Case Studies p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion t=1 t=1
T T
References Y Y
t=1 t=1
36/153
Tutorial:
Deep Latent NLP Graphical Model View
(bit.do/lvnlp)
µ
Introduction z1 z2 z3 z4
Models
Discrete Models
Continuous Models
Structured Models
x1 x2 x3 x4
Variational
Objective N
Inference π
Strategies
Advanced Topics
T T
Case Studies Y Y
p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion
t=1 t=1
References
YT YT
t=1 t=1
37/153
Tutorial:
Deep Latent NLP Further Extension: Factorial HMM
(bit.do/lvnlp)
Introduction z3,1 z3,2 z3,3 z3,4

Models
Discrete Models
Continuous Models
z2,1 z2,2 z2,3 z2,4
Structured Models
Variational
Objective z1,1 z1,2 z1,3 z1,4
Inference
Strategies
x1 x2 x3 x4
Advanced Topics
Case Studies
N
Conclusion
L Y
Y T T
Y
References p(x, z; θ) = p(zl,t | zl,t−1 ) × p(xt | z1:L,t )
l=1 t=1 t=1
38/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Deep Model 3: Deep HMM
Introduction Parameterize transition and emission distributions with neural networks (c.f., Tran
Models et al. [2016])
Discrete Models
Continuous Models
Structured Models • Model transition distribution as
Variational
Objective
p(zt | zt−1 ) = softmax(MLP(zt−1 ; µ))
Inference
Strategies
• Model emission distribution as
Advanced Topics
Case Studies
p(xt | zt ) = softmax(MLP(zt ; π))
Conclusion
References
Note: K × K transition parameters for standard HMM vs. O(K × d + d2 ) for
deep version.
39/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Deep Model 3: Deep HMM
Introduction Parameterize transition and emission distributions with neural networks (c.f., Tran
Models et al. [2016])
Discrete Models
Continuous Models
Structured Models • Model transition distribution as
Variational
Objective
p(zt | zt−1 ) = softmax(MLP(zt−1 ; µ))
Inference
Strategies
• Model emission distribution as
Advanced Topics
Case Studies
p(xt | zt ) = softmax(MLP(zt ; π))
Conclusion
References
Note: K × K transition parameters for standard HMM vs. O(K × d + d2 ) for
deep version.
39/153
Tutorial:
Deep Latent NLP Graphical Model View
(bit.do/lvnlp)
µ
Introduction z1 z2 z3 z4
Models
Discrete Models
Continuous Models
Structured Models
x1 x2 x3 x4
Variational
Objective N
Inference π
Strategies
Advanced Topics
T T
Case Studies Y Y
p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion
t=1 t=1
References
YT YT
t=1 t=1
40/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
For structured models, Bayes’ rule may tractable,
Models
Discrete Models
Continuous Models
p(z; µ) × p(x | z; π)
Structured Models
p(z | x; θ) = P 0 0
Variational z 0 p(z ; µ) × p(x | z ; π)
Objective
Inference
Strategies
• Unlike previous models, z contains interdependent “parts.”
Advanced Topics
• For both shallow and deep Model 3 variants, it’s possible to calculate
Case Studies
Conclusion
p(x; θ) exactly, with a dynamic program.
References • For some structured models, like Factorial HMM, the dynamic program may
still be intractable.
41/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference
Introduction
For structured models, Bayes’ rule may tractable,
Models
Discrete Models
Continuous Models
p(z; µ) × p(x | z; π)
Structured Models
p(z | x; θ) = P 0 0
Variational z 0 p(z ; µ) × p(x | z ; π)
Objective
Inference
Strategies
• Unlike previous models, z contains interdependent “parts.”
Advanced Topics
• For both shallow and deep Model 3 variants, it’s possible to calculate
Case Studies
Conclusion
p(x; θ) exactly, with a dynamic program.
References • For some structured models, like Factorial HMM, the dynamic program may
still be intractable.
41/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models
Variational 3 Variational Objective

Objective
Maximum Likelihood Maximum Likelihood
ELBO
ELBO
Inference
Strategies

Case Studies

References
6 Case Studies
42/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
ELBO
ELBO
Inference
Strategies

Case Studies

References
6 Case Studies
43/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Learning with Maximum Likelihood
Variational
Objective
Maximum Likelihood Objective: Find model parameters θ that maximize the likelihood of the data,
ELBO
Inference N
X
Strategies
θ∗ = arg max log p(x(n) ; θ)
Advanced Topics θ n=1
Case Studies
Conclusion
References
44/153
Tutorial:
Deep Latent NLP Learning Deep Models
(bit.do/lvnlp)
N
X
Introduction
L(θ) = log p(x(n) ; θ)
Models
n=1
Variational
Objective x
Maximum Likelihood
ELBO N
Inference
Strategies θ
Advanced Topics
• Dominant framework is gradient-based optimization:
Case Studies
Conclusion θ(i) = θ(i−1) + η∇θ L(θ)

References
• ∇θ L(θ) calculated with backpropagation.
• Tactics: mini-batch based training, adaptive learning rates [Duchi et al. 2011;
Kingma and Ba 2015].
45/153
Tutorial:
Deep Latent NLP Learning Deep Latent-Variable Models: Marginalization
(bit.do/lvnlp)
Likelihood requires summing out the latent variables,

Z
Introduction X
p(x; θ) = p(x, z; θ) (= p(x, z; θ)dz if continuous z)
Models
z∈Z
Variational
Objective
Maximum Likelihood
In general, hard to optimize log-likelihood for the training set,
ELBO N
X X
Inference L(θ) = log p(x(n) , z; θ)
Strategies
n=1 z∈Z
Advanced Topics
Case Studies θ z (n)

Conclusion
References
x(n)
N
46/153
Tutorial:
Deep Latent NLP Learning Deep Latent-Variable Models: Marginalization
(bit.do/lvnlp)
Likelihood requires summing out the latent variables,

Z
Introduction X
p(x; θ) = p(x, z; θ) (= p(x, z; θ)dz if continuous z)
Models
z∈Z
Variational
Objective
Maximum Likelihood
In general, hard to optimize log-likelihood for the training set,
ELBO N
X X
Inference L(θ) = log p(x(n) , z; θ)
Strategies
n=1 z∈Z
Advanced Topics
Case Studies θ z (n)

Conclusion
References
x(n)
N
46/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
ELBO
ELBO
Inference
Strategies

Case Studies

References
6 Case Studies
47/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Variational Inference
Introduction
Models
High-level: decompose objective into lower-bound and gap.
Variational
Objective
GAP(θ, λ)
Maximum Likelihood
ELBO
Inference L(θ)
Strategies
LB(θ, λ)
Advanced Topics
Case Studies
Conclusion
L(θ) = LB(θ, λ) + GAP(θ, λ) for some λ
References
Provides framework for deriving a rich set of optimization algorithms.
48/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Marginal Likelihood: Variational Decomposition
Introduction For any1 distribution q(z | x; λ) over z,

Models
h p(x, z; θ) i
Variational
L(θ) = Eq log + KL[q(z | x; λ) k p(z | x; θ)]
Objective q(z | x; λ)
Maximum Likelihood
ELBO
Inference posterior gap

Strategies
Advanced Topics
Case Studies ELBO (evidence lower bound)

Conclusion
References
Since KL is always non-negative, L(θ) ≥ ELBO(θ, λ).

1
Technical condition: supp(q(z)) ⊂ supp(p(z | x; θ)) 49/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound: Proof
Introduction
Models
log p(x; θ) = Eq log p(x) (Expectation over z)
Variational
Objective p(x, z)
= Eq log (Mult/div by p(z|x), combine numerator)
Maximum Likelihood
p(z | x)
ELBO
p(x, z) q(z | x)
Inference = Eq log (Mult/div by q(z|x))
Strategies q(z | x) p(z | x)
Advanced Topics p(x, z) q(z | x)
= Eq log + Eq log (Split Log)
Case Studies q(z | x) p(z | x)
Conclusion p(x, z; θ)
= Eq log + KL[q(z | x; λ) k p(z | x; θ)]
References
q(z | x; λ)
50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Variational
Objective p(x, z)
Maximum Likelihood
p(z | x)
ELBO
p(x, z) q(z | x)
References
q(z | x; λ)
50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Variational
Objective p(x, z)
Maximum Likelihood
p(z | x)
ELBO
p(x, z) q(z | x)
References
q(z | x; λ)
50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Variational
Objective p(x, z)
Maximum Likelihood
p(z | x)
ELBO
p(x, z) q(z | x)
References
q(z | x; λ)
50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Variational
Objective p(x, z)
Maximum Likelihood
p(z | x)
ELBO
p(x, z) q(z | x)
References
q(z | x; λ)
50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound over Observations
Introduction h p(x, z; θ) i
ELBO(θ, λ; x) = Eq(z) log
Models
q(z | x; λ)
Variational
Objective
Maximum Likelihood • ELBO is a function of the generative model parameters, θ, and the
ELBO
Inference
variational parameters, λ.
Strategies
N N
Advanced Topics X X
log p(x(n) ; θ) ≥ ELBO(θ, λ; x(n) )
Case Studies
n=1 n=1
Conclusion N
X h p(x(n) , z; θ) i
References = Eq(z | x(n) ; λ) log
n=1
q(z | x(n) ; λ)
= ELBO(θ, λ; x(1:N ) ) = ELBO(θ, λ)

51/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Setup: Selecting Variational Family

Introduction
Models
Variational
• Just as with p and θ, we can select any form of q and λ that satisfies ELBO
Objective
conditions.
Maximum Likelihood
ELBO
• Different choices of q will lead to different algorithms.
Inference
Strategies • We will explore several forms of q:
Advanced Topics • Posterior
Case Studies • Point Estimate / MAP
Conclusion • Amortized
References
• Mean Field (later)
52/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example Family : Full Posterior Form

Introduction
KL(q(z | x) || p(z | x))

Models
z (n) z (n)
Variational
Objective
Maximum Likelihood
ELBO
λ(n)
x(n) θ
Inference N
Strategies
N
Advanced Topics
Case Studies
λ = [λ(1) , . . . , λ(N ) ] is a concatenation of local variational parameters λ(n) , e.g.
Conclusion
References q(z (n) | x(n) ; λ) = q(z (n) | x(n) ; λ(n) ) = N (λ(n) , 1)
53/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example Family: Amortized Parameterization [Kingma and Welling 2014]
λ
Introduction
Models
KL(q(z | x)||p(z | x))
Variational z (n) z (n)
Objective
Maximum Likelihood
ELBO
Inference x(n)
Strategies x(n) θ
N
Advanced Topics N
Case Studies
Conclusion λ parameterizes a global network (encoder/inference network) that is run over

References x(n) to produce the local variational distribution, e.g.
q(z (n) | x(n) ; λ) = N (µ(n) , 1), µ(n) = enc(x(n) ; λ)

54/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Strategies 4 Inference Strategies
Exact Gradient
Sampling Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
55/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Maximizing the Evidence Lower Bound
Introduction
Central quantity of interest: almost all methods are maximizing the ELBO
Models
Variational
arg max ELBO(θ, λ)
Objective
θ,λ
Inference
Strategies Aggregate ELBO objective,
Exact Gradient
Sampling N
Conjugacy
X
arg max ELBO(θ, λ) = arg max ELBO(θ, λ; x(n) )
Advanced Topics θ,λ θ,λ n=1
Case Studies N
X h p(x(n) , z (n) ; θ) i
Conclusion = arg max Eq log
θ,λ n=1
q(z (n) | x(n) ; λ)
References
56/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Maximizing the Evidence Lower Bound
Introduction
Central quantity of interest: almost all methods are maximizing the ELBO
Models
Variational
Objective
θ,λ
Inference
Strategies Aggregate ELBO objective,
Exact Gradient
Sampling N
Conjugacy
X
arg max ELBO(θ, λ) = arg max ELBO(θ, λ; x(n) )
Advanced Topics θ,λ θ,λ n=1
Case Studies N
X h p(x(n) , z (n) ; θ) i
Conclusion = arg max Eq log
θ,λ n=1
q(z (n) | x(n) ; λ)
References
56/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Maximizing ELBO: Model Parameters
Introduction
Models h p(x, z; θ) i
arg max Eq log = arg max Eq [log p(x, z; θ)]
Variational θ q(z | x; λ) θ
Objective
Inference
Strategies
Exact Gradient
Sampling
θ
Conjugacy
ELBO
Advanced Topics ELBO
Case Studies
Conclusion
References
Intuition: Maximum likelihood problem under variables drawn from q(z | x; λ).
57/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model Estimation: Gradient Ascent on Model Parameters
Introduction
Easy: Gradient respect to θ
Models
h i
Variational
Objective
∇θ ELBO(θ, λ; x) = ∇θ Eq log p(x, z; θ)
h i
Inference
Strategies
= Eq ∇θ log p(x, z; θ)
Exact Gradient
Sampling
Conjugacy
Advanced Topics • Since q not dependent on θ, ∇ moves inside expectation.

Case Studies
• Estimate with samples from q. Term log p(x, z; θ) is easy to evaluate. (In
Conclusion
practice single sample is often sufficient).
References
• In special cases, can exactly evaluate expectation.
58/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model Estimation: Gradient Ascent on Model Parameters
Introduction
Easy: Gradient respect to θ
Models
h i
Variational
Objective
∇θ ELBO(θ, λ; x) = ∇θ Eq log p(x, z; θ)
h i
Inference
Strategies
= Eq ∇θ log p(x, z; θ)
Exact Gradient
Sampling
Conjugacy
Advanced Topics • Since q not dependent on θ, ∇ moves inside expectation.

Case Studies
• Estimate with samples from q. Term log p(x, z; θ) is easy to evaluate. (In
Conclusion
practice single sample is often sufficient).
References
• In special cases, can exactly evaluate expectation.
58/153
Tutorial:
Deep Latent NLP Maximizing ELBO: Variational Distribution
(bit.do/lvnlp)

Introduction λ
Models
= arg max log p(x; θ) − KL[q(z | x; λ) k p(z | x; θ)]
Variational
λ
Objective
= arg min KL[q(z | x; λ) k p(z | x; θ)]
Inference λ
Strategies
Exact Gradient posterior gap
Sampling
Conjugacy λ
Advanced Topics L
ELBO
Case Studies
Conclusion
References
Intuition: q should approximate the posterior p(z|x). However, may be difficult if

q or p is a deep model. 59/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model Inference: Gradient Ascent on λ?
Introduction
Hard: Gradient respect to λ
Models
Variational
h p(x, z; θ) i
∇λ ELBO(θ, λ; x) = ∇λ Eq log
Objective
q(z | x; λ)
Inference h p(x, z; θ) i
Strategies 6= Eq ∇λ log
Exact Gradient q(z | x; λ)
Sampling
Conjugacy
• Cannot naively move ∇ inside the expectation, since q depends on λ.
Advanced Topics
• This section: Inference in practice:
Case Studies
Conclusion
1 Exact gradient
References
2 Sampling: score function, reparameterization
3 Conjugacy: closed-form, coordinate ascent
60/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model Inference: Gradient Ascent on λ?
Introduction
Hard: Gradient respect to λ
Models
Variational
h p(x, z; θ) i
Objective
q(z | x; λ)
Inference h p(x, z; θ) i
Strategies 6= Eq ∇λ log
Exact Gradient q(z | x; λ)
Sampling
Conjugacy
• Cannot naively move ∇ inside the expectation, since q depends on λ.
Advanced Topics
• This section: Inference in practice:
Case Studies
Conclusion
1 Exact gradient
References
2 Sampling: score function, reparameterization
3 Conjugacy: closed-form, coordinate ascent
60/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
61/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Strategy 1: Exact Gradient
Introduction
Variational ∇λ ELBO(θ, λ; x) = ∇λ Eq(z | x; λ) log
Objective
q(z | x; λ)
X p(x, z; θ)
Inference = ∇λ q(z | x; λ) log
Strategies q(z | x; λ)
Exact Gradient
z∈Z
Sampling
Conjugacy
• Naive enumeration: Linear in |Z|.
Advanced Topics
Case Studies
• Depending on structure of q and p, potentially faster with dynamic
Conclusion
programming.
References • Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2
with point estimate.
62/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Strategy 1: Exact Gradient
Introduction
Variational ∇λ ELBO(θ, λ; x) = ∇λ Eq(z | x; λ) log
Objective
q(z | x; λ)
X p(x, z; θ)
Inference = ∇λ q(z | x; λ) log
Exact Gradient
z∈Z
Sampling
Conjugacy
• Naive enumeration: Linear in |Z|.
Advanced Topics
Case Studies
• Depending on structure of q and p, potentially faster with dynamic
Conclusion
programming.
References • Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2
with point estimate.
62/153
Tutorial:
Deep Latent NLP Example: Model 1 - Naive Bayes
(bit.do/lvnlp)
λ z z
Introduction
Models
x1 ... xT
x
Variational
Objective
Inference Let q(z | x; λ) = Cat(ν) where ν = enc(x; λ)

Strategies
Exact Gradient
Sampling
h p(x, z; θ) i
Conjugacy ∇λ ELBO(θ, λ; x) = ∇λ Eq(z | x; λ) log
q(z | x; λ)
Advanced Topics X p(x, z; θ)
Case Studies = ∇λ q(z | x; λ) log
q(z | x; λ)
Conclusion
z∈Z
X p(x, z; θ)
References = ∇λ νz log
νz
z∈Z
63/153
Tutorial:
Deep Latent NLP Example: Model 1 - Naive Bayes
(bit.do/lvnlp)
λ z z
Introduction
Models
x1 ... xT
x
Variational
Objective

Strategies
Exact Gradient
Sampling
h p(x, z; θ) i
Conjugacy ∇λ ELBO(θ, λ; x) = ∇λ Eq(z | x; λ) log
q(z | x; λ)
Advanced Topics X p(x, z; θ)
Case Studies = ∇λ q(z | x; λ) log
q(z | x; λ)
Conclusion
z∈Z
X p(x, z; θ)
References = ∇λ νz log
νz
z∈Z
63/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
64/153
Tutorial:
Deep Latent NLP Strategy 2: Sampling
(bit.do/lvnlp)
Introduction h log p(x, z; θ) i

Models ∇λ ELBO(θ, λ; x) = ∇λ Eq log
log q(z | x; λ)
Variational
h i h i
Objective = ∇λ Eq log p(x, z; θ) − ∇λ Eq log q(z | x; θ)
Inference
Strategies
Exact Gradient
Sampling • How can we approximate this gradient with sampling? Naive algorithm fails
Conjugacy
to provide non-zero gradient.
Advanced Topics
Case Studies z (1) , . . . , z (J) ∼ q(z | x; λ)

Conclusion J
1 Xh i
References ∇λ log p(x, z (j) ; θ) = 0
J
j=1
• Manipulate expression so we can move ∇λ inside Eq before sampling.
65/153
Tutorial:
Deep Latent NLP Strategy 2: Sampling
(bit.do/lvnlp)
Introduction h log p(x, z; θ) i

Models ∇λ ELBO(θ, λ; x) = ∇λ Eq log
log q(z | x; λ)
Variational
h i h i
Objective = ∇λ Eq log p(x, z; θ) − ∇λ Eq log q(z | x; θ)
Inference
Strategies
Exact Gradient
Sampling • How can we approximate this gradient with sampling? Naive algorithm fails
Conjugacy
to provide non-zero gradient.
Advanced Topics
Case Studies z (1) , . . . , z (J) ∼ q(z | x; λ)

Conclusion J
1 Xh i
References ∇λ log p(x, z (j) ; θ) = 0
J
j=1
• Manipulate expression so we can move ∇λ inside Eq before sampling.
65/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Strategy 2a: Sampling — Score Function Gradient Estimator

Models
Variational First term. Use basic identity:

Objective
Inference ∇q
Strategies ∇ log q = ⇒ ∇q = q∇ log q
Exact Gradient
q
Sampling
Policy-gradient style training [Williams 1992]
Conjugacy
Advanced Topics
h i X
∇λ Eq log p(x, z; θ) = ∇λ q(z | x; λ) log p(x, z; θ)
Case Studies
z
Conclusion
References
66/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Models
Variational
First term. Use basic identity:
Objective
Inference
∇q
∇ log q = ⇒ ∇q = q∇ log q
Strategies
q
Exact Gradient
Sampling Policy-gradient style training [Williams 1992]
Conjugacy
h i X
Advanced Topics
Case Studies
| {z }
z
q∇ log q
Conclusion
References
67/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 2a: Sampling — Score Function Gradient Estimator

Introduction
Models First term. Use basic identity:

Variational
Objective
∇q
∇ log q = ⇒ ∇q = q∇ log q
Inference q
Strategies
Exact Gradient Policy-gradient style training [Williams 1992]
Sampling
Conjugacy h i X
Advanced Topics
z
Case Studies X
= q(z | x; λ)∇λ log q(z | x; λ) log p(x, z; θ)
Conclusion
z
References
68/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
First term. Use basic identity:
Models
Variational
∇q
Objective ∇ log q = ⇒ ∇q = q∇ log q
q
Inference
Strategies Policy-gradient style training [Williams 1992]
Exact Gradient
Sampling
h i X
Conjugacy ∇λ Eq log p(x, z; θ) = ∇λ q(z | x; λ) log p(x, z; θ)
Advanced Topics z
X
Case Studies = q(z | x; λ)∇λ log q(z | x; λ) log p(x, z; θ)
Conclusion
z
h i
References = Eq log p(x, z; θ)∇λ log q(z | x; λ)
69/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Variational
Objective
Second term. Need additional identity:
Inference X X
Strategies ∇q = ∇ q = ∇1 = 0
Exact Gradient
Sampling
Conjugacy h i X
Advanced Topics ∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
z
Case Studies
Conclusion
References
70/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Models
Variational
Objective
X X
∇q = ∇ q = ∇1 = 0
Inference
Strategies
Exact Gradient h i X
Sampling
Conjugacy
∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
z
Advanced Topics X
Case Studies
= ∇λ q(z | x; λ) log q(z | x; λ) + q(z | x; λ) ∇λ log q(z | x; λ)
| {z } | {z }
z ∇q
Conclusion
q∇ log q
q
References
71/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Variational
Objective X X
Inference
∇q = ∇ q = ∇1 = 0
Strategies
Exact Gradient
Sampling
h i X
Conjugacy ∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
Advanced Topics z
X X
Case Studies = log q(z | x; λ)q(z | x; λ)∇λ log q(z | x; λ) + ∇λ q(z | x; λ)
z z
Conclusion
References
72/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Models Second term. Need additional identity:

Variational X X
Objective
∇q = ∇ q = ∇1 = 0
Inference
Strategies
Exact Gradient h i X
Sampling ∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
Conjugacy
z
Advanced Topics X X
= log q(z | x; λ)q(z | x; λ)∇λ log q(z | x; λ) + ∇λ q(z | x; λ)
Case Studies
z z
| {z }
Conclusion P
=∇ q=∇1=0
References
73/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Models Second term. Need additional identity:

Variational X X
Objective
∇q = ∇ q = ∇1 = 0
Inference
Strategies
Exact Gradient
h i X
Sampling ∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
Conjugacy
z
Advanced Topics
X X
= log q(z | x; λ)q(z | x; λ)∇λ log q(z | x; λ) + ∇λ q(z | x; λ)
Case Studies
z z
Conclusion
= Eq [log q(z | x; λ)∇λ q(z | x; λ)]
References
74/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Variational Putting these together,

Objective
Inference
h p(x, z; θ) i
Strategies
q(z | x; λ)
Exact Gradient
Sampling
h p(x, z; θ) i
Conjugacy
= Eq log ∇λ log q(z | x; λ)
q(z | x; λ)
Advanced Topics h i
= Eq Rθ,λ (z)∇λ log q(z | x; λ)
Case Studies
Conclusion
References
75/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Models
Estimate with samples,
Variational
Objective z (1) , . . . , z (J) ∼ q(z | x; λ)
Inference
Strategies h i
Exact Gradient Eq Rθ,λ (z)∇λ log q(z | x; λ)
Sampling
Conjugacy J
1X
Advanced Topics ≈ Rθ,λ (z (j) )∇λ log q(z (j) | x; λ)
J
Case Studies j=1
Conclusion
Intuition: if a sample z (j) is has high reward Rθ,λ (z (j) ), increase the probability
References
of z (j) by moving along the gradient ∇λ log q(z (j) | x; λ).
76/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Models
Variational
Objective
• Essentially reinforcement learning with reward Rθ,λ (z)
Inference • Score function gradient is generally applicable regardless of what

Strategies
Exact Gradient
distribution q takes (only need to evaluate ∇λ log q).
Sampling
Conjugacy • This generality comes at a cost, since the reward is “black-box”: unbiased
Advanced Topics estimator, but high variance.
Case Studies
• In practice, need variance-reducing control variate B. (More on this later).
Conclusion
References
77/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 1 - Naive Bayes
λ z z
Introduction
Models x1 ... xT
Variational x
Objective

Strategies
Exact Gradient Sample z (1) , . . . , z (J) ∼ q(z | x; λ)
Sampling
Conjugacy
Advanced Topics h p(x, z; θ) i

∇λ ELBO(θ, λ; x) = Eq log ∇λ log q(z | x; λ)
Case Studies q(z | x; λ)
Conclusion J
1X p(x, z (j) ; θ)
References
≈ νz (j) log ∇λ log νz (j)
J νz (j)
j=1
Computational complexity: O(J) vs O(|Z|) 78/153

Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
λ z z
Introduction
Models x1 ... xT
Variational x
Objective

Strategies
Exact Gradient Sample z (1) , . . . , z (J) ∼ q(z | x; λ)
Sampling
Conjugacy
Advanced Topics h p(x, z; θ) i

∇λ ELBO(θ, λ; x) = Eq log ∇λ log q(z | x; λ)
Case Studies q(z | x; λ)
Conclusion J
1X p(x, z (j) ; θ)
References
≈ νz (j) log ∇λ log νz (j)
J νz (j)
j=1
Computational complexity: O(J) vs O(|Z|) 78/153

Tutorial:
Deep Latent NLP Strategy 2b: Sampling — Reparameterization
(bit.do/lvnlp)
Suppose we can sample from q by applying a deterministic, differentiable

Introduction transformation g to a base noise density,
Models
∼U z = g(, λ)
Variational
Objective
Inference
Strategies Gradient calculation (first term):
Exact Gradient
h i h i
Sampling ∇λ Ez∼q(z | x; λ) log p(x, z; θ) = ∇λ E∼U log p(x, g(, λ); θ)
Conjugacy h i
Advanced Topics = E∼U ∇λ log p(x, g(, λ); θ)
Case Studies J
1X
Conclusion ≈ ∇λ log p(x, g((j) , λ); θ)
J
j=1
References
where
(1) , . . . , (J) ∼ U 79/153
Tutorial:
Deep Latent NLP Strategy 2b: Sampling — Reparameterization
(bit.do/lvnlp)
Suppose we can sample from q by applying a deterministic, differentiable

Introduction transformation g to a base noise density,
Models
∼U z = g(, λ)
Variational
Objective
Inference
Strategies Gradient calculation (first term):
Exact Gradient
h i h i
Sampling ∇λ Ez∼q(z | x; λ) log p(x, z; θ) = ∇λ E∼U log p(x, g(, λ); θ)
Conjugacy h i
Advanced Topics = E∼U ∇λ log p(x, g(, λ); θ)
Case Studies J
1X
Conclusion ≈ ∇λ log p(x, g((j) , λ); θ)
J
j=1
References
where
(1) , . . . , (J) ∼ U 79/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Strategy 2b: Sampling — Reparameterization
Variational
Objective
• Unbiased, like the score function gradient estimator, but empirically lower
Inference
Strategies variance.
Exact Gradient
Sampling • In practice, single sample is often sufficient.
Conjugacy
Advanced Topics • Cannot be used out-of-the-box for discrete z.

Case Studies
Conclusion
References
80/153
Tutorial:
Deep Latent NLP Strategy 2: Continuous Latent Variable RNN
(bit.do/lvnlp)

Introduction
Models
λ z z
Variational
Objective
x1 ... xT
Inference
Strategies
x
Exact Gradient
Sampling
Conjugacy
Choose variational family to be an amortized diagonal Gaussian
q(z | x; λ) = N (µ, σ 2 )
Advanced Topics
Case Studies
Conclusion
µ, σ 2 = enc(x; λ)
References Then we can sample from q(z | x; λ) by
∼ N (0, I) z = µ + σ
81/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 2b: Sampling — Reparameterization
Introduction p(x,z; θ)
(Recall Rθ,λ (z) = log q(z | x; λ) )
Models
• Score function:
Variational
Objective
Inference
∇λ ELBO(θ, λ; x) = Ez∼q [Rθ,λ (z)∇λ log q(z | x; λ)]
Strategies
Exact Gradient
Sampling
• Reparameterization:
Conjugacy
Advanced Topics ∇λ ELBO(θ, λ; x) = E∼N (0,I) [∇λ Rθ,λ (g(, λ; x))]

Case Studies
Conclusion
where g(, λ; x) = µ + σ.
References Informally, reparameterization gradients differentiate through Rθ,λ (·) and thus
has “more knowledge” about the structure of the objective function.
82/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies
Conclusion
5 Advanced Topics
References
6 Case Studies
83/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 3: Conjugacy
Introduction
Models
For certain choices for p and q, we can compute parts of
Variational
Objective
arg max ELBO(θ, λ; x)
Inference λ
Strategies
Exact Gradient
exactly in closed-form.
Sampling
Conjugacy
Advanced Topics
Recall that
Case Studies
arg max ELBO(θ, λ; x) = arg min KL[q(z | x; λ)kp(z | x; θ)]
Conclusion λ λ
References
84/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 3: Conjugacy
Introduction
Models
For certain choices for p and q, we can compute parts of
Variational
Objective
arg max ELBO(θ, λ; x)
Inference λ
Strategies
Exact Gradient
exactly in closed-form.
Sampling
Conjugacy
Advanced Topics
Recall that
Case Studies
arg max ELBO(θ, λ; x) = arg min KL[q(z | x; λ)kp(z | x; θ)]
Conclusion λ λ
References
84/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Strategy 3a: Conjugacy — Tractable Posterior Inference
Introduction Suppose we can tractably calculate p(z | x; θ). Then KL[q(z | x; λ)kp(z | x; θ)]
Models is minimized when,
Variational q(z | x; λ) = p(z | x; θ)
Objective
Inference
Strategies • The E-step in Expectation Maximization algorithm [Dempster et al. 1977]
Exact Gradient
Sampling
Conjugacy
posterior gap
Advanced Topics
Case Studies
λ
L ELBO
Conclusion
References
85/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
λ z z
Models
Variational x1 ... xT
Objective x
Inference
Strategies
Exact Gradient
Sampling
Conjugacy
p(x, z; θ)
Advanced Topics p(z | x; θ) = PK
0
z 0 =1 p(x, z ; θ)
Case Studies
Conclusion So λ is given by the parameters of the categorical distribution, i.e.

References
λ = [p(z = 1 | x; θ), . . . , p(z = K | x; θ)]
86/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 3 — HMM
Introduction
µ
Models
Variational
z1 z2 z3 z4
Objective
Inference
Strategies
Exact Gradient
Sampling x1 x2 x3 x4
Conjugacy
Advanced Topics
N
Case Studies π
Conclusion T
Y
References p(x, z; θ) = p(z0 ) p(zt | zt−1 ; µ)p(xt | zt ; π)
t=1
87/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Example: Model 3 — HMM

Models
Variational
Run forward/backward dynamic programming to calculate posterior marginals,
Objective
Inference p(zt , zt+1 | x; θ)

Strategies
Exact Gradient 2
Sampling variational parameters λ ∈ RT K store edge marginals. These are enough to
Conjugacy
calculate
Advanced Topics
q(z; λ) = p(z | x; θ)
Case Studies
Conclusion (i.e. the exact posterior) over any sequence z.

References
88/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Example: Model 3 — HMM

Models
Variational
Run forward/backward dynamic programming to calculate posterior marginals,
Objective
Inference p(zt , zt+1 | x; θ)

Strategies
Exact Gradient 2
Sampling variational parameters λ ∈ RT K store edge marginals. These are enough to
Conjugacy
calculate
Advanced Topics
q(z; λ) = p(z | x; θ)
Case Studies
Conclusion (i.e. the exact posterior) over any sequence z.

References
88/153
Tutorial:
Deep Latent NLP Connection: Gradient Ascent on Log Marginal Likelihood
(bit.do/lvnlp)
Why not perform gradient ascent directly on log marginal likelihood?

Introduction X
Models
log p(x; θ) = log p(x, z; θ)
z
Variational
Objective Same as optimizing ELBO with posterior inference (i.e EM). Gradients of model
Inference parameters given by (where q(z | x; λ) = p(z | x; θ)):
Strategies
Exact Gradient
Sampling
∇θ log p(x; θ) = Eq(z | x; λ) [∇θ log p(x, z; θ)]
Conjugacy
posterior gap
Advanced Topics
Case Studies
Conclusion
L ELBO
References
89/153
Tutorial:
Deep Latent NLP Connection: Gradient Ascent on Log Marginal Likelihood
(bit.do/lvnlp)
Why not perform gradient ascent directly on log marginal likelihood?

Introduction X
Models
log p(x; θ) = log p(x, z; θ)
z
Variational
Objective Same as optimizing ELBO with posterior inference (i.e EM). Gradients of model
Inference parameters given by (where q(z | x; λ) = p(z | x; θ)):
Strategies
Exact Gradient
Sampling
∇θ log p(x; θ) = Eq(z | x; λ) [∇θ log p(x, z; θ)]
Conjugacy
posterior gap
Advanced Topics
Case Studies
Conclusion
L ELBO
References
89/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Connection: Gradient Ascent on Log Marginal Likelihood

Introduction
Models
• Practically, this means we don’t have to manually perform posterior
Variational
Objective inference in the E-step. Can just calculate log p(x; θ) and call
Inference backpropagation.
Strategies
Exact Gradient • Example: in deep HMM, just implement forward algorithm to calculate
Sampling
Conjugacy log p(x; θ) and backpropagate using autodiff. No need to implement
Advanced Topics backward algorithm. (Or vice versa).
Case Studies
Conclusion (See Eisner [2016]: “Inside-Outside and Forward-Backward Algorithms Are Just
References Backprop”)
90/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 3b: Conditional Conjugacy

Introduction
Models
• Let p(z | x; θ) be intractable, but suppose p(x, z; θ) is
Variational
Objective conditionally conjugate, meaning p(zt | x, z−t ; θ) is exponential family.
Inference
Strategies
• Restrict the family of distributions q so that it factorizes over zt , i.e.
Exact Gradient
Sampling
T
Y
Conjugacy q(z; λ) = q(zt ; λt )
Advanced Topics t=1
Case Studies
(mean field family)
Conclusion
• Further choose q(zt ; λt ) so that it is in the same family as p(zt | x, z−t ; θ) .
References
91/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 3b: Conditional Conjugacy

Introduction
Models
(n) (n) KL(q(z)||p(z|x))
Variational
z1 zT z (n)
Objective
Inference
Strategies
(n) (n)
Exact Gradient λ1 λT x(n)
Sampling
N
Conjugacy N θ
Advanced Topics
T
Case Studies Y
q(z; λ) = q(zt ; λt )
Conclusion
t=1
References
92/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Mean Field Family
Introduction
• Optimize ELBO via coordinate ascent, i.e. iterate for λ1 , . . . , λT
Models T
hY i
Variational
arg max KL q(zt ; λt )kp(z | x; θ)
λt t=1
Objective
Inference • Coordinate ascent updates will take the form

Strategies
Exact Gradient

Sampling q(zt ; λt ) ∝ exp Eq(z−t ; λ−t ) [log p(x, z; θ)]
Conjugacy
Advanced Topics where

Case Studies XY
Eq(z−t ; λ−t ) [log p(x, z; θ)] = q(zj ; λj ) log p(x, z; θ)
Conclusion
j6=t j6=t
References
• Since p(zt | x, z−t ) was assumed to be in the exponential family, above
updates can be derived in closed form.
93/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Example: Model 3 — Factorial HMM
Introduction z3,1 z3,2 z3,3 z3,4

Models
Variational
Objective
z2,1 z2,2 z2,3 z2,4
Inference
Strategies
z1,1 z1,2 z1,3 z1,4
Exact Gradient
Sampling
Conjugacy
Advanced Topics x1 x2 x3 x4
Case Studies N
Conclusion
L Y
Y T
References
p(x, z; θ) = p(zl,t | zl,t−1 ; θ)p(xt | zl,t ; θ)
l=1 t=1
94/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 3 — Factorial HMM
Introduction
z3,1 z3,2 z3,3 z3,4
Models
Variational
Inference
Strategies
Exact Gradient z1,1 z1,2 z1,3 z1,4
Sampling
Conjugacy
Case Studies
N
Conclusion
References

q(z1,1 ; λ1,1 ) ∝ exp Eq(z−(1,1) ; λ−(1,1) ) [log p(x, z; θ)]
95/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
z3,1 z3,2 z3,3 z3,4
Models
Variational
Inference
Strategies
Exact Gradient z1,1 z1,2 z1,3 z1,4
Sampling
Conjugacy
Case Studies
N
Conclusion
References

q(z2,1 ; λ2,1 ) ∝ exp Eq(z−(2,1) ; λ−(2,1) ) [log p(x, z; θ)]
96/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Models Exact Inference:

Variational
Objective • Naive: K states, L levels =⇒ HMM with K L states =⇒ O(T K 2L )
Inference
• Smarter: O(T LK L+1 )
Strategies
Exact Gradient
Sampling
Conjugacy
Mean Field:
Advanced Topics
• Gaussian emissions: O(T LK 2 ) [Ghahramani and Jordan 1996].
Case Studies
• Categorical emission: need more variational approximations, but ultimately
Conclusion
O(LKV T ) [Nepal and Yates 2013].
References
97/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Models Exact Inference:

Variational
Objective • Naive: K states, L levels =⇒ HMM with K L states =⇒ O(T K 2L )
Inference
• Smarter: O(T LK L+1 )
Strategies
Exact Gradient
Sampling
Conjugacy
Mean Field:
Advanced Topics
• Gaussian emissions: O(T LK 2 ) [Ghahramani and Jordan 1996].
Case Studies
• Categorical emission: need more variational approximations, but ultimately
Conclusion
O(LKV T ) [Nepal and Yates 2013].
References
97/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Advanced Topics
Gumbel-Softmax
Flows 5 Advanced Topics
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References
6 Case Studies
98/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Advanced Topics
Models
Variational
Objective
1 Gumbel-Softmax: Extend reparameterization to discrete variables.
Inference
Strategies 2 Flows: Optimize a tighter bound by making the variational family q more
Advanced Topics flexible.
Gumbel-Softmax
Flows 3 Importance Weighting: Optimize a tighter bound through importance
IWAE
sampling.
Case Studies
Conclusion
References
99/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Advanced Topics
Gumbel-Softmax
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References
6 Case Studies
100/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Challenges of Discrete Variables
Introduction Review: we can always use score function estimator

Variational ∇λ ELBO(x, θ, λ) = Eq log ∇λ log q(z | x; λ)
Objective
q(z | x; λ)
h p(x, z; θ) i
Inference = Eq log − B ∇λ log q(z | x; λ)
Advanced Topics
P P
Gumbel-Softmax • Eq [B∇λ log q(z | x; λ)] = 0 (since E[∇ log q] = q∇ log q = ∇q = 0)
Flows
IWAE
• Control variate B (not dependent on z, but can depend on x).
Case Studies
• Estimate this quantity with another neural net [Mnih and Gregor 2014]
Conclusion
References
p(x, z; θ) 2
B(x; ψ) − log
q(z | x; λ)
101/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Challenges of Discrete Variables
Introduction Review: we can always use score function estimator

Variational ∇λ ELBO(x, θ, λ) = Eq log ∇λ log q(z | x; λ)
Objective
q(z | x; λ)
h p(x, z; θ) i
Inference = Eq log − B ∇λ log q(z | x; λ)
Advanced Topics
P P
Gumbel-Softmax • Eq [B∇λ log q(z | x; λ)] = 0 (since E[∇ log q] = q∇ log q = ∇q = 0)
Flows
IWAE
• Control variate B (not dependent on z, but can depend on x).
Case Studies
• Estimate this quantity with another neural net [Mnih and Gregor 2014]
Conclusion
References
p(x, z; θ) 2
B(x; ψ) − log
q(z | x; λ)
101/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]
The “Gumbel-Max” trick [Papandreou and Yuille 2011]

Introduction
αk
Models p(zk = 1; α) = PK
Variational j=1 αj
Objective
where z = [0, 0, . . . , 1, . . . , 0] is a one-hot vector.
Inference
Strategies Can sample from p(z; α) by
Advanced Topics
1 Drawing independent Gumbel noise = 1 , . . . , K
Gumbel-Softmax
Flows
IWAE k = − log(− log uk ) uk ∼ U(0, 1)
Case Studies
Conclusion
2 Adding k to log αk , finding argmax, i.e.
References
i = arg max[log αk + k ] zi = 1
k
102/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
αk
Variational j=1 αj
Objective
Inference
Advanced Topics
Gumbel-Softmax
Flows
Case Studies
Conclusion
References
k
102/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
αk
Variational j=1 αj
Objective
Inference
Advanced Topics
Gumbel-Softmax
Flows
Case Studies
Conclusion
References
k
102/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Reparameterization:
Introduction
Models z = arg max (log α + )> s = g(, α)

s∈∆K−1
Variational
Objective
z = g(, α) is a deterministic function applied to stochastic noise.
Inference
Strategies Let’s try applying this:
Advanced Topics
Gumbel-Softmax αk
Flows
q(zk = 1 | x; λ) = PK α = enc(x; λ)
IWAE j=1 αj
Case Studies p(x,z; θ)
(Recalling Rθ,λ (z) = log q(z | x; λ) ),
Conclusion
References ∇λ Eq(z | x; λ) [Rθ,λ (z)] = ∇λ E∼Gumbel [Rθ,λ (g(, α)]

= E∼Gumbel [∇λ Rθ,λ (g(, α))]
103/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Reparameterization:
Introduction

s∈∆K−1
Variational
Objective
Inference
Advanced Topics
Gumbel-Softmax αk
Flows
q(zk = 1 | x; λ) = PK α = enc(x; λ)
IWAE j=1 αj
Conclusion

103/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Reparameterization:
Introduction

s∈∆K−1
Variational
Objective
Inference
Advanced Topics
Gumbel-Softmax αk
Flows
q(zk = 1 | x; λ) = PK α = enc(x; λ)
IWAE j=1 αj
Conclusion

103/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]
Introduction But this won’t work, because zero gradients (almost everywhere)
Models
z = g(, α) = arg max(log α + )> s =⇒ ∇λ Rθ,λ (z) = 0
Variational
s∈∆K−1
Objective
Inference
Strategies
Advanced Topics
Gumbel-Softmax trick: replace arg max with softmax
Gumbel-Softmax
log α + exp((log αk + k )/τ )
Flows
IWAE
z = softmax zk = PK
τ j=1 exp((log αj + j )/τ )
Case Studies
Conclusion (where τ is a temperature term.)

References h log α + i
∇λ Eq(z | x; λ) [Rθ,λ (z)] ≈ E∼Gumbel ∇λ Rθ,λ softmax
τ
104/153
Tutorial:
Deep Latent NLP
Models
Variational
s∈∆K−1
Objective
Inference
Strategies
Advanced Topics
Gumbel-Softmax
Flows
IWAE
z = softmax zk = PK
Case Studies

τ
104/153
Tutorial:
Deep Latent NLP
Models
Variational
s∈∆K−1
Objective
Inference
Strategies
Advanced Topics
Gumbel-Softmax
Flows
IWAE
z = softmax zk = PK
Case Studies

τ
104/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Models
• Approaches a discrete distribution as τ → 0 (anneal τ during training).
Variational
Objective • Reparameterizable by construction
Inference
Strategies
• Differentiable and has non-zero gradients
Advanced Topics
Gumbel-Softmax
Flows
IWAE
Case Studies
Conclusion
References (from Maddison et al. [2017])
105/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]
Variational
Objective
• See Maddison et al. [2017] on whether we can use the original categorical
Inference
Strategies
densities p(z), q(z), or need to use relaxed densities pGS (z), qGS (z).
Advanced Topics
Gumbel-Softmax
• Requires that p(x | z; θ) “makes sense” for non-discrete z (e.g. attention).
Flows
IWAE • Lower-variance, but biased gradient estimator. Variance → ∞ as τ → 0.
Case Studies
Conclusion
References
106/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Advanced Topics
Gumbel-Softmax
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References
6 Case Studies
107/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Flows [Rezende and Mohamed 2015; Kingma et al. 2016]

Models
Variational
Recall
Objective
log p(x; θ) = ELBO(θ, λ; x) − KL[q(z | x; λ) k p(z | x; θ)]
Inference
Strategies
Bound is tight when variational posterior equals true posterior
Advanced Topics
Gumbel-Softmax
Flows
q(z | x; λ) = p(z | x; θ) =⇒ log p(x; θ) = ELBO(θ, λ; x)
IWAE
Case Studies We want to make q(z | x; λ) as flexible as possible: can we do better than just
Conclusion Gaussian?
References
108/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Models
Variational
Recall
Objective
log p(x; θ) = ELBO(θ, λ; x) − KL[q(z | x; λ) k p(z | x; θ)]
Inference
Strategies
Bound is tight when variational posterior equals true posterior
Advanced Topics
Gumbel-Softmax
Flows
q(z | x; λ) = p(z | x; θ) =⇒ log p(x; θ) = ELBO(θ, λ; x)
IWAE
Case Studies We want to make q(z | x; λ) as flexible as possible: can we do better than just
Conclusion Gaussian?
References
108/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Flows [Rezende and Mohamed 2015; Kingma et al. 2016]

Introduction
Models
Idea: transform a sample from a simple initial variational distribution,
Variational
Objective
z0 ∼ q(z | x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ)
Inference
Strategies
Advanced Topics
into a more complex one
Gumbel-Softmax
Flows
IWAE
zK = fK ◦ · · · ◦ f2 ◦ f1 (z0 ; λ)
Case Studies
where fi (zi−1 ; λ)’s are invertible transformations (whose parameters are
Conclusion
absorbed by λ).
References
109/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Models
Idea: transform a sample from a simple initial variational distribution,
Variational
Objective
z0 ∼ q(z | x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ)
Inference
Strategies
Advanced Topics
into a more complex one
Gumbel-Softmax
Flows
IWAE
zK = fK ◦ · · · ◦ f2 ◦ f1 (z0 ; λ)
Case Studies
where fi (zi−1 ; λ)’s are invertible transformations (whose parameters are
Conclusion
absorbed by λ).
References
109/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Sample from final variational posterior is given by zK . Density is given by the
Models
change of variables formula:
Variational
Objective K
X ∂f −1
log qK (zK | x; λ) = log q(z0 | x; λ) + log k

Inference
Strategies ∂zk
k=1
Advanced Topics K ∂f
k
X
= log q(z0 | x; λ) − log
Gumbel-Softmax

∂zk−1

Flows | {z }
IWAE k=1
log density of Gaussian | {z }
log determinant of Jacobian
Case Studies
Conclusion
References Determinant calculation is O(N 3 ) in general, but can be made faster depending
on parameterization of fk
110/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Models
Can still use reparameterization to obtain gradients. Letting
Variational
Objective F (z) = fK ◦ · · · ◦ f1 (z),
Inference
Strategies
h p(x, z; θ) i
ELBO(θ, λ; x) = ∇λ EqK (zK | x; λ) log
Advanced Topics
qK (zK | x; λ)
Gumbel-Softmax
h p(x, F (z0 ); θ) ∂F i
= ∇λ Eq(z0 | x; λ) log − log

Flows
q(z0 | x; λ) ∂z0

IWAE
Case Studies
h p(x, F (z0 ); θ) ∂F i
= E∼N (0,I) ∇λ log − log

q(z0 | x; λ) ∂z0

Conclusion
References
111/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Examples of fk (zk−1 ; λ)
Models
• Normalizing Flows [Rezende and Mohamed 2015]
Variational
Objective
fk (zk−1 ) = zk−1 + uk h(wk> zk−1 + bk )
Inference
Strategies
Advanced Topics
• Inverse Autoregressive Flows [Kingma et al. 2016]
Gumbel-Softmax
Flows
IWAE
fk (zk−1 ) = zk−1 σk + µk
Case Studies
σk,d = sigmoid(NN(zk−1,<d )) µk,d = NN(zk−1,<d )
Conclusion
References (In this case the Jacobian is upper triangular, so determinant is just the
product of diagonals)
112/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Models
Variational
Objective
Inference
Strategies
Advanced Topics
Gumbel-Softmax
Flows
IWAE
Case Studies
Conclusion (from Rezende and Mohamed [2015])

References
113/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Advanced Topics
Gumbel-Softmax
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References
6 Case Studies
114/153
Tutorial:
Deep Latent NLP Importance Weighted Autoencoder (IWAE) [Burda et al. 2015]
(bit.do/lvnlp)
• Flows are a way of tightening the ELBO by making the variational family
Introduction
more flexible.
Models
• Not the only way: can obtain a tighter lower bound on log p(x; θ) by using
Variational
Objective multiple importance samples.
Inference
Strategies
Consider:
Advanced Topics K
Gumbel-Softmax 1 X p(x, z (k) ; θ)
IK = (k) | x; λ)
,
Flows
K q(z
IWAE k=1
QK
Case Studies
where z (1:K) ∼ k=1 q(z
(k) | x; λ).
Conclusion
References
Note that IK is an unbiased estimator of p(x; θ):
Eq(z (1:K) | x; λ) [IK ] = p(x; θ).

115/153
Tutorial:
Deep Latent NLP Importance Weighted Autoencoder (IWAE) [Burda et al. 2015]
(bit.do/lvnlp)
• Flows are a way of tightening the ELBO by making the variational family
Introduction
more flexible.
Models
• Not the only way: can obtain a tighter lower bound on log p(x; θ) by using
Variational
Objective multiple importance samples.
Inference
Strategies
Consider:
Advanced Topics K
IK = (k) | x; λ)
,
Flows
K q(z
IWAE k=1
QK
Case Studies
where z (1:K) ∼ k=1 q(z
(k) | x; λ).
Conclusion
References
Note that IK is an unbiased estimator of p(x; θ):
Eq(z (1:K) | x; λ) [IK ] = p(x; θ).

115/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Importance Weighted Autoencoder (IWAE) [Burda et al. 2015]
Introduction
Any unbiased estimator of p(x; θ) can be used to obtain a lower bound, using
Models
Jensen’s inequality:
Variational
Objective
p(x; θ) = Eq(z (1:K) | x; λ) [IK ]
Inference
Strategies
=⇒ log p(x; θ) ≥ Eq(z (1:K) | x; λ) [log IK ]
Advanced Topics
K
" #
Flows
= Eq(z (1:K) | x; λ) log
IWAE
K
k=1
q(z (k) | x; λ)
Case Studies
However, can also show [Burda et al. 2015]:
Conclusion
References
• log p(x; θ) ≥ E [log IK ] ≥ E [log IK−1 ]
• limK→∞ E [log IK ] = log p(x; θ) under mild conditions
116/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Importance Weighted Autoencoder (IWAE) [Burda et al. 2015]

Introduction
Models
K
" #
Variational 1 X p(x, z (k) ; θ)
Eq(z (1:K) | x; λ) log
Objective
K
k=1
q(z (k) | x; λ)
Inference
Strategies
Advanced Topics • Note that with K = 1, we recover the ELBO.

Gumbel-Softmax (k)
Flows • Can interpret p(x,z ; θ)
as importance weights.
q(z (k) | x; λ)
IWAE
• If q(z | x; λ) is reparameterizable, we can use the reparameterization trick to
Case Studies
Conclusion
optimize E [log IK ] directly.
References • Otherwise, need score function gradient estimators [Mnih and Rezende 2016].
117/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Advanced Topics
Case Studies 5 Advanced Topics

Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries 6 Case Studies
and Topics
Conclusion Sentence VAE

References Encoder/Decoder with Latent Variables
Latent Summaries and Topics
118/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Advanced Topics

Sentence VAE
Encoder/Decoder
and Topics

119/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Sentence VAE Example [Bowman et al. 2016]

Introduction
Models Generative Model (Model 2):

Variational • Draw z ∼ N (0, I)
Objective
Inference • Draw xt | z ∼ CRNNLM(θ, z)

Strategies
Variational Model (Amortized): Deep Diagonal Gaussians,
Advanced Topics
Case Studies
q(z | x; λ) = N (µ, σ 2 )
Sentence VAE
Encoder/Decoder
Latent Summaries h̃T = RNN(x; ψ)
and Topics
Conclusion µ = W1 h̃T σ 2 = exp(W2 h̃T ) λ = {W1 , W2 , ψ}

References
120/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Sentence VAE Example [Bowman et al. 2016]
Introduction
Models
Variational
Objective
Inference
Strategies
(from Bowman et al. [2016])
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
Latent Summaries λ z z
and Topics
Conclusion
x1 ... xT
References
x
121/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Issue 1: Posterior Collapse
Introduction
p(x,z; θ)
Models
ELBO(θ, λ) = Eq(z | x; λ) [log q(z | x; λ) ]
Variational
Objective
Inference
= Eq(z | x; λ) [log p(x | z; θ)] − KL[q(z | x; λ)kp(z)]
Strategies
| {z } | {z }
Reconstruction likelihood Regularizer
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
Model L/ELBO Reconstruction KL
Latent Summaries
and Topics RNN LM -329.10 - -
Conclusion RNN VAE -330.20 -330.19 0.01
References
(On Yahoo Corpus from Yang et al. [2017])

122/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Issue 1: Posterior Collapse

Introduction
Models
• x and z become independent, and p(x, z; θ) reduces to a non-LV language
Variational
Objective
model.
Inference
Strategies
• Chen et al. [2017]: If it’s possible to model p? (x) without making use of z, then
Advanced Topics
ELBO optimum is at:
Case Studies
Sentence VAE
Encoder/Decoder p? (x) = p(x | z; θ) = p(x; θ) q(z | x; λ) = p(z)
Latent Summaries
and Topics
KL[q(z | x; λ)kp(z)] = 0
Conclusion
References
123/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Mitigating Posterior Collapse
Introduction
Use less powerful likelihood models [Miao et al. 2016; Yang et al. 2017], or “word
Models
Variational
dropout” [Bowman et al. 2016].
Objective
Inference
Strategies Model LL/ELBO Reconstruction KL
Advanced Topics
RNN LM -329.1 - -
Case Studies
RNN VAE -330.2 -330.2 0.01
Sentence VAE
Encoder/Decoder
+ Word Drop -334.2 -332.8 1.44
Latent Summaries
and Topics CNN VAE -332.1 -322.1 10.0
Conclusion
References (On Yahoo Corpus from Yang et al. [2017])
124/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Mitigating Posterior Collapse
Introduction Gradually anneal multiplier on KL term, i.e.

Models
Eq(z | x; λ) [log p(x | z; θ)] − β KL[q(z | x; λ)kp(z)]
Variational
Objective
Inference β goes from 0 to 1 as training progresses

Strategies
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
Latent Summaries
and Topics
Conclusion
References

125/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Mitigating Posterior Collapse

Introduction
Models Other approaches:

Variational • Use auxiliary losses (e.g. train z as part of a topic model) [Dieng et al. 2017;
Objective
Wang et al. 2018]
Inference
Strategies
• Use von Mises–Fisher distribution with a fixed concentration parameter [Guu
Advanced Topics
et al. 2017; Xu and Durrett 2018]
Case Studies
Sentence VAE • Combine stochastic/amortized variational inference [Kim et al. 2018]
Encoder/Decoder
• Add skip connections [Dieng et al. 2018]
Latent Summaries
and Topics
Conclusion
In practice, often necessary to combine various methods.
References
126/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Issue 2: Evaluation

Models
Variational
Objective
• ELBO always lower bounds log p(x; θ), so can calculate an upper bound on
Inference PPL efficiently.
Strategies
• When reporting ELBO, should also separately report,
Advanced Topics
Case Studies
KL[q(z | x; λ)kp(z)]
Sentence VAE
Encoder/Decoder
Latent Summaries
to give an indication of how much the latent variable is being “used”.
and Topics
Conclusion
References
127/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Issue 2: Evaluation
Introduction
Models Also can evaluate log p(x; θ) with importance sampling

Variational
Objective
h p(x | z; θ)p(z) i
p(x; θ) = Eq(z | x; λ)
Inference q(z | x; λ)
Strategies
K
1 X p(x|z (k) ; θ)p(z (k) )
Advanced Topics ≈
K
k=1
q(z (k) | x; λ)
Case Studies
Sentence VAE
Encoder/Decoder So
with Latent Variables K
Latent Summaries 1 X p(x|z (k) ; θ)p(z (k) )
and Topics =⇒ log p(x; θ) ≈ log
K q(z (k) | x; λ)
k=1
Conclusion
References
128/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evaluation
Introduction
Qualitative evaluation
Models
• Evaluate samples from prior/variational posterior.
Variational
Objective • Interpolation in latent space.
Inference
Strategies
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
Latent Summaries
and Topics
Conclusion
References
129/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Advanced Topics

Sentence VAE
Encoder/Decoder
and Topics

130/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Encoder/Decoder [Sutskever et al. 2014; Cho et al. 2014]
Introduction
Models
Variational
Objective
Inference
Strategies
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables Given: Source information s = s1 , . . . , sM .
Latent Summaries
and Topics
Conclusion
Generative process:
References • Draw x1:T | s ∼ CRNNLM(θ, enc(s)).
131/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Latent, Per-token Experts [Yang et al. 2018]
Introduction
Generative process: For t = 1, . . . , T ,
Models • Draw zt | x<t , s ∼ softmax(U ht ).
Variational • Draw xt | zt , x<t , s ∼ softmax(W tanh(Qzt ht ); θ)
Objective
Inference
Strategies (n) (n) (n)
z1 z2 zT
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
(n)
x1
(n)
x2 ... (n)
xT
Latent Summaries
and Topics
N
Conclusion
References
If U ∈ RK×d , used K experts; increases the flexibility of per-token distribution.

132/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Case-Study: Latent Per-token Experts [Yang et al. 2018]
Introduction
Learning: zt are independent given x<t , so we can marginalize at each time-step
Models
(Method 3: Conjugacy).
Variational
Objective
arg max log p(x | s; θ) =
Inference
θ
Strategies
T X
Y K
Advanced Topics
arg max log p(zt =k | s, x<t ; θ) p(xt | zt =k, x<t , s; θ).
Case Studies θ t=1 k=1
Sentence VAE
Encoder/Decoder
Test-time:
Latent Summaries
and Topics T X
Y K
Conclusion arg max p(zt =k | s, x<t ; θ) p(xt | zt =k, x<t , s; θ).
x1:T
t=1 k=1
References
133/153
Tutorial:
Deep Latent NLP Case-Study: Latent, Per-token Experts [Yang et al. 2018]
(bit.do/lvnlp)
PTB language modeling results (s is constant):

Introduction
Models Model PPL

Variational
Objective Merity et al. [2018] 57.30
Inference Softmax-mixture [Yang et al. 2018] 54.44
Strategies
Advanced Topics
Case Studies
Dialogue generation results (s is context):
Sentence VAE
Encoder/Decoder
Model BLEU
Latent Summaries
and Topics
Prec Rec
Conclusion
References No mixture 14.1 11.1

Softmax-mixture [Yang et al. 2018] 15.7 12.3
134/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Attention [Bahdanau et al. 2015]
Introduction
Models
Variational
Objective
Inference
Strategies
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder
Latent Summaries Decoding with an attention mechanism:
and Topics
Conclusion
M
X
References xt | x<t , s ∼ softmax(W [ht , αt,m enc(s)m ]).
m=1
135/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Copy Attention [Gu et al. 2016; Gulcehre et al. 2016]

Introduction
Models Copy attention models copying words directly from s.

Variational
Objective Generative process: For t = 1, . . . , T ,
Inference
Strategies
• Set αt to be attention weights.
Advanced Topics • Draw zt | x<t , s ∼ Bern(MLP([ht , enc(s)])).
Case Studies • If zt = 0
Sentence VAE
Encoder/Decoder • Draw xt | zt , x<t , s ∼ softmax(W ht ).
Latent Summaries
and Topics
• Else
Conclusion • Draw xt ∈ {s1 , . . . , sM } | zt , x<t , s ∼ Cat(αt ).
References
136/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Copy Attention
Introduction Learning: Can maximize the log per-token marginal [Gu et al. 2016], as with
Models per-token experts:
Variational
Objective
max log p(x1 , . . . , xT | s; θ)
θ
Inference
Strategies T
Y X
Advanced Topics
= max log p(zt = z 0 | s, x<t ; θ) p(xt | z 0 , x<t , x; θ).
θ
t=1 z 0 ∈{0,1}
Case Studies
Sentence VAE
Encoder/Decoder
Test-time:
Latent Summaries
and Topics
T
Y X
p(zt =z 0 | s, x<t ; θ) p(xt | z 0 , x<t , s; θ).
Conclusion
arg max
References x1:T
t=1 z 0 ∈{0,1}
137/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Attention as a Latent Variable [Deng et al. 2018]
Variational
Objective
Generative process: For t = 1, . . . , T ,
Inference
Strategies • Set αt to be attention weights.
Advanced Topics
• Draw zt | x<t , s ∼ Cat(αt ).
Case Studies
Sentence VAE • Draw xt | zt , x<t , s ∼ softmax(W [ht , enc(szt )]; θ).
Encoder/Decoder
Latent Summaries
and Topics
Conclusion
References
138/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Models
Marginal likelihood under latent attention model:
Variational T X
M
Objective
Y
p(x1:T | s; θ) = αt,m softmax(W [ht , enc(sm )]; θ)xt .
Inference
t=1 m=1
Strategies
Advanced Topics
Case Studies
Sentence VAE Standard attention likelihood:
Encoder/Decoder
T M
Latent Summaries Y X
and Topics p(x1:T | s; θ) = softmax(W [ht , αt,m enc(sm )]; θ)xt .
Conclusion t=1 m=1
References
139/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Learning Strategy #1: Maximize the log marginal via enumeration as above.
Variational
Objective
Learning Strategy #2: Maximize the ELBO with AVI:
Inference
Strategies
Advanced Topics max Eq(zt ; λ) [log p(xt | x<t , zt , s)] − KL[q(zt ; λ)kp(zt | x<t , s)].
λ,θ
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables • q(zt | x; λ) approximates p(zt | x1:T , s; θ); implemented with a BLSTM.
Latent Summaries
and Topics
• q isn’t reparameterizable, so gradients obtained using REINFORCE +
Conclusion
baseline.
References
140/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Models
Test-time: Calculate p(xt | x<t , s; θ) by summing out zt .
Variational
Objective
MT Results on IWSLT-2014:
Inference
Strategies
Advanced Topics Model PPL BLEU

Case Studies
Standard Attn 7.03 32.31
Sentence VAE
Encoder/Decoder
Latent Attn (marginal) 6.33 33.08
Latent Summaries
and Topics
Latent Attn (ELBO) 6.13 33.09
Conclusion
References
141/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Encoder/Decoder with Structured Latent Variables
Models
Variational At least two EMNLP 2018 papers augment encoder/decoder text generation
Objective
Inference
models with structured latent variables:
Strategies
1 Lee et al. [2018] generate x1:T by iteratively refining sequences of words z1:T .
Advanced Topics
Case Studies
Sentence VAE 2 Wiseman et al. [2018] generate x1:T conditioned on a latent template or plan
Encoder/Decoder
with Latent Variables z1:S .
Latent Summaries
and Topics
Conclusion
References
142/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Models

Objective
Inference
Advanced Topics

Sentence VAE
Encoder/Decoder
and Topics

143/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Summary as a Latent Variable [Miao and Blunsom 2016]
Models
Generative process for a document x = x1 , . . . , xT :
Variational
Objective • Draw a latent summary z1 , . . . , zM ∼ RNNLM(θ)
Inference
Strategies • Draw x1 , . . . , xT | z1:M ∼ CRNNLM(θ, z)
Advanced Topics
Case Studies
Sentence VAE
Posterior Inference:
Encoder/Decoder
Latent Summaries p(z1:M | x1:T ; θ) = p(summary | document; θ).
and Topics
Conclusion
References
144/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Generative process for a document x = x1 , . . . , xT :
Variational
Objective • Draw a latent summary z1 , . . . , zM ∼ RNNLM(θ)
Inference
Strategies • Draw x1 , . . . , xT | z1:M ∼ CRNNLM(θ, z)
Advanced Topics
Case Studies
Sentence VAE
Posterior Inference:
Encoder/Decoder
Latent Summaries p(z1:M | x1:T ; θ) = p(summary | document; θ).
and Topics
Conclusion
References
144/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction Summary as a Latent Variable [Miao and Blunsom 2016]

Models
Variational Learning: Maximize the ELBO with amortized family:

Objective
Inference max Eq(z1:M ; λ) [log p(x1:T | z1:M ; θ)] − KL[q(z1:M ; λ)kp(z1:M ; θ)]
Strategies λ,θ
Advanced Topics
Case Studies
• q(z1:M ; λ) approximates p(z1:M | x1:T ; θ); also implemented with
Sentence VAE
encoder/decoder RNNs.
Encoder/Decoder
Latent Summaries • q(z1:M ; λ) not reparameterizable, so gradients use REINFORCE + baselines.
and Topics
Conclusion
References
145/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Semi-supervised Training: Can also use documents without corresponding
Variational
Objective summaries in training.
Inference
Strategies
• Train q(z1:M ; λ) ≈ p(z1:M | x1:T ; θ) with labeled examples.
Advanced Topics
• Infer summary z for an unlabeled document with q.
Case Studies
Sentence VAE
Encoder/Decoder • Use inferred z to improve model p(x1:T | z1:M ; θ).
Latent Summaries
and Topics
• Allows for outperforming strictly supervised models!
Conclusion
References
146/153
Tutorial:
Deep Latent NLP Topic Models [Blei et al. 2003]
(bit.do/lvnlp)
Introduction
Models
Variational
Objective
Inference
Strategies
Advanced Topics
Case Studies
Sentence VAE
Encoder/Decoder (n) (n)
with Latent Variables Generative process: for each document x(n) = x1 , . . . , xT ,
Latent Summaries (n)
and Topics • Draw topic distribution ztop ∼ Dir(α)
Conclusion • For t = 1, . . . , T :
(n) (n)
References • Draw topic zt ∼ Cat(ztop )
• Draw xt ∼ Cat(β z (n) )
t 147/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Simple, Deep Topic Models [Miao et al. 2017]
Models
(n)
Motivation: easy to learn deep topic models with VI if q(ztop ; λ) is
Variational
Objective reparameterizable.
Inference
Strategies
(n)
Advanced Topics Idea: draw ztop from a transformation of a Gaussian.
(n)
Case Studies • Draw z0 ∼ N (µ0 , σ 20 )
Sentence VAE
(n) (n)
Encoder/Decoder
• Set ztop = softmax(W z0 ).
Latent Summaries (n)
and Topics • Use analogous transformation when drawing from q(ztop ; λ).
Conclusion
References
148/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
(n)
Models Learning Step #1: Marginalize out per-word latents zt .
Variational
Objective
N T X
K
Inference (n)
Y (n)
Y (n)
Strategies p({x(n) }N N
n=1 , {ztop }n=1 ; θ) = p(ztop | θ) ztop,k βk,x(n)
t
Advanced Topics
n=1 t=1 k=1
Case Studies
Sentence VAE
Encoder/Decoder
Learning Step #2: Use AVI to optimize resulting ELBO.
Latent Summaries
h i
(n) (n) (n)
and Topics max Eq(z(n) ; λ) log p(x(n) | ztop ; θ) − KL[N (z0 ; λ)kN (z0 ; µ0 , σ 20 )]
λ,θ top
Conclusion
References
149/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Variational
Objective Perplexities on held-out documents, for three datasets:
Inference
Strategies
Model MXM 20News RCV1
Advanced Topics
OnlineLDA [Hoffman et al. 2010] 342 1015 1058
Case Studies
Sentence VAE AVI-LDA [Miao et al. 2017] 272 830 602
Encoder/Decoder
Latent Summaries
and Topics
Conclusion
References
150/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Introduction
2 Models
Models
Variational
Objective 3 Variational Objective
Inference
Strategies
Advanced Topics
Case Studies
5 Advanced Topics
Conclusion
References
6 Case Studies
7 Conclusion
151/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Deep Latent-Variable NLP: Two Views

Introduction
Models Deep Models & LV Models are naturally complementary:

Variational
Objective
• Rich set of model choices: discrete, continuous, and structured.
Inference • Real applications across NLP including some state-of-the-art models.
Strategies
Advanced Topics
Case Studies Deep Models & LV Models are frustratingly incompatible:

Conclusion
• Many interesting approaches to the problem: reparameterization,
References
score-function, and more.
• Lots of area for research into improved approaches.
152/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Deep Latent-Variable NLP: Two Views

Introduction
Models Deep Models & LV Models are naturally complementary:

Variational
Objective
• Rich set of model choices: discrete, continuous, and structured.
Inference • Real applications across NLP including some state-of-the-art models.
Strategies
Advanced Topics
Case Studies Deep Models & LV Models are frustratingly incompatible:

Conclusion
• Many interesting approaches to the problem: reparameterization,
References
score-function, and more.
• Lots of area for research into improved approaches.
152/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Implementation
Variational
Objective
• Modern toolkits make it easy to implement these models.
Inference
Strategies • Combine the flexibility of auto-differentiation for optimization (PyTorch)
Advanced Topics with distribution and VI libraries (Pyro).
Case Studies
In fact, we have implemented this entire tutorial. See website link:
Conclusion
http://bit.do/lvnlp
References
153/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Introduction
Models
Implementation
Variational
Objective
• Modern toolkits make it easy to implement these models.
Inference
Strategies • Combine the flexibility of auto-differentiation for optimization (PyTorch)
Advanced Topics with distribution and VI libraries (Pyro).
Case Studies
In fact, we have implemented this entire tutorial. See website link:
Conclusion
http://bit.do/lvnlp
References
153/153
Tutorial:
Charu C. Aggarwal and ChengXiang Zhai. 2012. A Survey of Text Clustering Algorithms. In
Deep Latent NLP Mining Text Data, pages 77–128. Springer.
(bit.do/lvnlp)
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by
Jointly Learning to Align and Translate. In Proceedings of ICLR.
Introduction
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of
Models
machine Learning research, 3(Jan):993–1022.
Variational
Samuel R. Bowman, Luke Vilnis, Oriol Vinyal, Andrew M. Dai, Rafal Jozefowicz, and Samy
Objective
Bengio. 2016. Generating Sentences from a Continuous Space. In Proceedings of CoNLL.
Inference
Strategies Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai.
Advanced Topics
1992. Class-based N-gram Models of Natural Language. Computational Linguistics,
18(4):467–479.
Case Studies
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. 2015. Importance Weighted
Conclusion
Autoencoders. In Proceedings of ICLR.
References
Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya
Sutskever, and Pieter Abbeel. 2017. Variational Lossy Autoencoder. In Proceedings of ICLR.
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the
Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of
the Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. 153/153
Tutorial:
Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum Likelihood from
Deep Latent NLP Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B,
(bit.do/lvnlp)
39(1):1–38.
Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander M. Rush. 2018. Latent
Introduction
Alignment and Variational Attention. In Proceedings of NIPS.
Models
Adji B. Dieng, Yoon Kim, Alexander M. Rush, and David M. Blei. 2018. Avoiding Latent
Variational Variable Collapse with Generative Skip Models. In Proceedings of the ICML Workshop on
Objective
Theoretical Foundations and Applications of Deep Generative Models.
Inference
Strategies Adji B. Dieng, Chong Wang, Jianfeng Gao, and John Paisley. 2017. TopicRNN: A Recurrent
Neural Network With Long-Range Semantic Dependency. In Proceedings of ICLR.
Advanced Topics
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online
Case Studies
Learning and Stochastic Optimization. Journal of Machine Learning Research, 12.
Conclusion
Jason Eisner. 2016. Inside-Outside and Forward-Backward Algorithms Are Just Backprop
References
(Tutorial Paper). In Proceedings of the Workshop on Structured Prediction for NLP.
Ekaterina Garmash and Christof Monz. 2016. Ensemble Learning for Multi-source Neural
Machine Translation. In Proceedings of COLING.
Zoubin Ghahramani and Michael I. Jordan. 1996. Factorial Hidden Markov Models. In
Proceedings of NIPS. 153/153
Tutorial:
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism
Deep Latent NLP in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the
(bit.do/lvnlp)
Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages
1631–1640.
Introduction
Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016.
Models
Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association
Variational
for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 140–149.
Objective
Inference
Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang. 2017. Generating
Strategies Sentences by Editing Prototypes. arXiv:1709.08878.
Advanced Topics William P Headden III, Mark Johnson, and David McClosky. 2009. Improving Unsupervised
Case Studies Dependency Parsing with Richer Contexts and Smoothing. In Proceedings of NAACL.
Conclusion Marti A Hearst. 1997. Texttiling: Segmenting text into multi-paragraph subtopic passages.
References Computational linguistics, 23(1):33–64.
Matthew Hoffman, Francis R Bach, and David M Blei. 2010. Online learning for latent dirichlet
allocation. In advances in neural information processing systems, pages 856–864.
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward
Controlled Generation of Text. In Proceedings of ICML. 153/153
Tutorial:
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive
Deep Latent NLP Mixtures of Local Experts. Neural Computation, 3(1):79–87.
(bit.do/lvnlp)
Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with
Gumbel-Softmax. In Proceedings of ICLR.
Introduction
Yoon Kim, Sam Wiseman, Andrew C. Miller, David Sontag, and Alexander M. Rush. 2018.
Models
Semi-Amortized Variational Autoencoders. In Proceedings of ICML.
Variational Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In
Objective
Proceedings of ICLR.
Inference
Strategies Diederik P. Kingma, Tim Salimans, and Max Welling. 2016. Improving Variational Inference
with Autoregressive Flow. arXiv:1606.04934.
Advanced Topics
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In Proceedings
Case Studies
of ICLR.
Conclusion
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio
References
Torralba, and Sanja Fidler. 2015. Skip-thought Vectors. In Proceedings of NIPS.
Dan Klein and Christopher D Manning. 2004. Corpus-based Induction of Syntactic Structure:
Models of Dependency and Constituency. In Proceedings of ACL.
Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic Non-Autoregressive
Neural Sequence Modeling by Iterative Refinement. In Proceedings of EMNLP. 153/153
Tutorial:
Stefan Lee, Senthil Purushwalkam Shiva Prakash, Michael Cogswell, Viresh Ranjan, David
Deep Latent NLP Crandall, and Dhruv Batra. 2016. Stochastic Multiple Choice Learning for Training Diverse
(bit.do/lvnlp)
Deep Ensembles. In Proceedings of NIPS.
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The Concrete Distribution: A
Introduction
Continuous Relaxation of Discrete Random Variables. In Proceedings of ICLR.
Models
Bernard Merialdo. 1994. Tagging English Text with a Probabilistic Model. Computational
Variational
Objective Linguistics, 20(2):155–171.
Inference Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing
Strategies
LSTM language models. In International Conference on Learning Representations.
Advanced Topics
Yishu Miao and Phil Blunsom. 2016. Language as a Latent Variable: Discrete Generative
Case Studies Models for Sentence Compression. In Proceedings of EMNLP.
Conclusion Yishu Miao, Edward Grefenstette, and Phil Blunsom. 2017. Discovering Discrete Latent Topics
References with Neural Variational Inference. In Proceedings of ICML.
Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural Variational Inference for Text Processing.
In Proceedings of ICML.
Andriy Mnih and Danilo J. Rezende. 2016. Variational Inference for Monte Carlo Objectives. In
Proceedings of ICML. 153/153
Tutorial:
Andryi Mnih and Karol Gregor. 2014. Neural Variational Inference and Learning in Belief
Deep Latent NLP Networks. In Proceedings of ICML.
(bit.do/lvnlp)
Anjan Nepal and Alexander Yates. 2013. Factorial Hidden Markov Models for Learning
Representations of Natural Language. arXiv:1312.6168.
Introduction
George Papandreou and Alan L. Yuille. 2011. Perturb-and-Map Random Fields: Using Discrete
Models
Optimization to Learn and Sample from Energy Models. In Proceedings of ICCV.
Variational
Danilo J. Rezende and Shakir Mohamed. 2015. Variational Inference with Normalizing Flows.
Objective
In Proceedings of ICML.
Inference
Strategies Noah A. Smith and Jason Eisner. 2005. Contrastive Estimation: Training Log-Linear Models on
Advanced Topics
Unlabeled Data. In Proceedings of ACL.
Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014. Sequence to Sequence Learning with Neural
Case Studies
Networks. In Proceedings of NIPS.
Conclusion
Ke Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, and Kevin Knight. 2016. Unsupervised
References
Neural Hidden Markov Models. In Proceedings of the Workshop on Structured Prediction for
NLP.
Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. Hmm-based word alignment in
statistical translation. In Proceedings of the 16th conference on Computational
linguistics-Volume 2, pages 836–841. Association for Computational Linguistics. 153/153
Tutorial:
Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen, Jiaji Huang, Wei Ping, Sanjeev Satheesh,
Deep Latent NLP and Lawrence Carin. 2018. Topic Compositional Neural Language Model. In Proceedings of
(bit.do/lvnlp)
AISTATS.
Peter Willett. 1988. Recent Trends in Hierarchic Document Clustering: A Critical Review.
Introduction
Information Processing & Management, 24(5):577–597.
Models
Ronald J. Williams. 1992. Simple Statistical Gradient-following Algorithms for Connectionist
Variational
Objective Reinforcement Learning. Machine Learning, 8.
Inference Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2018. Learning Neural Templates
Strategies
for Text Generation. In Proceedings of EMNLP.
Advanced Topics
Jiacheng Xu and Greg Durrett. 2018. Spherical Latent Spaces for Stable Variational
Case Studies Autoencoders. In Proceedings of EMNLP.
Conclusion Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. Breaking the
References Softmax Bottleneck: A High-Rank RNN Language Model. In Proceedings of ICLR.
Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. 2017. Improved
Variational Autoencoders for Text Modeling using Dilated Convolutions. In Proceedings of
ICML.
153/153

Deep Latent-Variable Models of Natural Language: Yoon Kim, Sam Wiseman, Alexander Rush

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Deep Latent-Variable Models of Natural Language: Yoon Kim, Sam Wiseman, Alexander Rush

Hochgeladen von

Copyright:

Verfügbare Formate

Tutorial:

Deep Latent NLP

Advanced Topics 4 Inference Strategies

Conclusion 5 Advanced Topics

Advanced Topics 4 Inference Strategies

Conclusion 5 Advanced Topics

Introduction Goal of Latent-Variable Modeling

Inference Makes it easy to specify:

Introduction Goal of Latent-Variable Modeling

Inference Makes it easy to specify:

Models Long and rich history of latent-variable models of natural language.

Introduction Goals of Deep Learning

Inference Makes it easy to fit:

Introduction Goals of Deep Learning

Inference Makes it easy to fit:

Models Current dominant paradigm for NLP.

Tutorial: Deep Latent-Variable Models for NLP

Tutorial: Deep Latent-Variable Models for NLP

Variational 2 An understanding of a variational objective

Inference 3 A toolkit of algorithms for optimization

Advanced Topics 4 A formal guide to advanced techniques

References • Details of deep learning architectures.

Advanced Topics 4 Inference Strategies

Conclusion 5 Advanced Topics

Variational Important examples: The multilayer perceptron,

Variational Important examples: The multilayer perceptron,

Latent variable models give us a joint distribution

Variational • x is our observed data

Latent variable models give us a joint distribution

Variational • x is our observed data

Latent variable models give us a joint distribution

Variational • x is our observed data

Introduction • A directed PGM shows the conditional independence structure.

• Specific models may factor further.

Problem Statement: Two Views

Advanced Topics Deep Models & LV Models are frustratingly incompatible:

Problem Statement: Two Views

Advanced Topics Deep Models & LV Models are frustratingly incompatible:

Inference 3 Variational Objective

Our goal is to model a sentence, x1 . . . xT .

Variational Defined as,

Our goal is to model a sentence, x1 . . . xT .

Variational Defined as,

Introduction A Collection of Model Archetypes

Inference 1 Discrete LVs z (Clustering )

Introduction A Collection of Model Archetypes

Inference 1 Discrete LVs z (Clustering )

Inference 3 Variational Objective

Introduction Inference Process:

Conclusion • Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012].

Introduction Inference Process:

Conclusion • Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012].

Introduction Generative process:

Advanced Topics z (n)

Introduction Difference Between Models

Advanced Topics • Number of parameters:

Introduction Difference Between Models

Advanced Topics • Number of parameters:

Inference 3 Variational Objective

Introduction Inference Process:

Variational girls in two straight lines.

Introduction Inference Process:

Variational girls in two straight lines.