Sie sind auf Seite 1von 211

Tutorial:

Deep Latent NLP


(bit.do/lvnlp)
Deep Latent-Variable Models
Introduction of Natural Language
Models

Variational
Objective Yoon Kim, Sam Wiseman, Alexander Rush
Inference
Strategies

Advanced Topics

Case Studies

Conclusion

References

Tutorial 2018
https://github.com/harvardnlp/DeepLatentNLP

1/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Goals
Introduction Background
Goals
Background

Models 2 Models
Variational
Objective
3 Variational Objective
Inference
Strategies

Advanced Topics 4 Inference Strategies


Case Studies

Conclusion 5 Advanced Topics


References

6 Case Studies

2/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Goals
Introduction Background
Goals
Background

Models 2 Models
Variational
Objective
3 Variational Objective
Inference
Strategies

Advanced Topics 4 Inference Strategies


Case Studies

Conclusion 5 Advanced Topics


References

6 Case Studies

3/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Goal of Latent-Variable Modeling


Goals
Background

Models
Probabilistic models provide a declarative language for specifying prior knowledge
Variational
and structural relationships in the context of unknown variables.
Objective

Inference Makes it easy to specify:


Strategies

Advanced Topics
• Known interactions in the data
Case Studies • Uncertainty about unknown factors
Conclusion
• Constraints on model properties
References

4/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Goal of Latent-Variable Modeling


Goals
Background

Models
Probabilistic models provide a declarative language for specifying prior knowledge
Variational
and structural relationships in the context of unknown variables.
Objective

Inference Makes it easy to specify:


Strategies

Advanced Topics
• Known interactions in the data
Case Studies • Uncertainty about unknown factors
Conclusion
• Constraints on model properties
References

4/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Latent-Variable Modeling in NLP
Goals
Background

Models Long and rich history of latent-variable models of natural language.


Variational
Objective
Major successes include, among many others:
Inference
Strategies • Statistical alignment for translation
Advanced Topics
• Document clustering and topic modeling
Case Studies
• Unsupervised part-of-speech tagging and parsing
Conclusion

References

5/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Goals of Deep Learning


Goals
Background

Models
Toolbox of methods for learning rich, non-linear data representations through
Variational
numerical optimization.
Objective

Inference Makes it easy to fit:


Strategies

Advanced Topics
• Highly-flexible predictive models
Case Studies • Transferable feature representations
Conclusion
• Structurally-aligned network architectures
References

6/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Goals of Deep Learning


Goals
Background

Models
Toolbox of methods for learning rich, non-linear data representations through
Variational
numerical optimization.
Objective

Inference Makes it easy to fit:


Strategies

Advanced Topics
• Highly-flexible predictive models
Case Studies • Transferable feature representations
Conclusion
• Structurally-aligned network architectures
References

6/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Goals
Deep Learning in NLP
Background

Models Current dominant paradigm for NLP.


Variational
Objective
Major successes include, among many others:
Inference
Strategies • Text classification
Advanced Topics
• Neural machine translation
Case Studies
• NLU Tasks (QA, NLI, etc)
Conclusion

References

7/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Tutorial: Deep Latent-Variable Models for NLP


Introduction
Goals
Background
• How should a contemporary ML/NLP researcher reason about
Models
latent-variables?
Variational
Objective

Inference • What unique challenges come from modeling text with latent variables?
Strategies

Advanced Topics
• What techniques have been explored and shown to be effective in recent
Case Studies
papers?
Conclusion

References
We explore these through the lens of variational inference.

8/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Tutorial: Deep Latent-Variable Models for NLP


Introduction
Goals
Background
• How should a contemporary ML/NLP researcher reason about
Models
latent-variables?
Variational
Objective

Inference • What unique challenges come from modeling text with latent variables?
Strategies

Advanced Topics
• What techniques have been explored and shown to be effective in recent
Case Studies
papers?
Conclusion

References
We explore these through the lens of variational inference.

8/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Tutorial Take-Aways
Introduction
Goals
Background 1 A collection of deep latent-variable models for NLP
Models

Variational 2 An understanding of a variational objective


Objective

Inference 3 A toolkit of algorithms for optimization


Strategies

Advanced Topics 4 A formal guide to advanced techniques


Case Studies

Conclusion
5 A survey of example applications
References
6 Code samples and techniques for practical use

9/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Tutorial Non-Objectives
Introduction
Goals
Background Not covered (for time, not relevance):
Models
• Many classical latent-variable approaches.
Variational
Objective
• Undirected graphical models such as MRFs
Inference
Strategies
• Non-likelihood based models such as GANs
Advanced Topics

Case Studies
• Sampling-based inference such as MCMC.
Conclusion

References • Details of deep learning architectures.

10/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction
Goals
Introduction Background
Goals
Background

Models 2 Models
Variational
Objective
3 Variational Objective
Inference
Strategies

Advanced Topics 4 Inference Strategies


Case Studies

Conclusion 5 Advanced Topics


References

6 Case Studies

11/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are deep networks?

Introduction
Deep networks are parameterized non-linear functions; They transform input z
Goals
Background into features h using parameters π.
Models

Variational Important examples: The multilayer perceptron,


Objective

Inference
Strategies
h = MLP(z; π) = V σ(W z + b) + a π = {V , W , a, b},
Advanced Topics
The recurrent neural network, which maps a sequence of inputs z1:T into a
Case Studies
sequence of features h1:T ,
Conclusion

References
ht = RNN(ht−1 , zt ; π) = σ(U zt + V ht−1 + b) π = {V , U , b}

.
12/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are deep networks?

Introduction
Deep networks are parameterized non-linear functions; They transform input z
Goals
Background into features h using parameters π.
Models

Variational Important examples: The multilayer perceptron,


Objective

Inference
Strategies
h = MLP(z; π) = V σ(W z + b) + a π = {V , W , a, b},
Advanced Topics
The recurrent neural network, which maps a sequence of inputs z1:T into a
Case Studies
sequence of features h1:T ,
Conclusion

References
ht = RNN(ht−1 , zt ; π) = σ(U zt + V ht−1 + b) π = {V , U , b}

.
12/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are latent variable models?

Latent variable models give us a joint distribution


Introduction
Goals
Background p(x, z; θ).
Models

Variational • x is our observed data


Objective
• z is a collection of latent variables
Inference
Strategies • θ are the deterministic parameters of the model, such as the neural network
Advanced Topics parameters
Case Studies

Conclusion
• Data consists of N i.i.d samples,

References
N
Y
p(x(1:N ) , z (1:N ) ; θ) = p(x(n) | z (n) ; θ)p(z (n) ; θ).
n=1
13/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are latent variable models?

Latent variable models give us a joint distribution


Introduction
Goals
Background p(x, z; θ).
Models

Variational • x is our observed data


Objective
• z is a collection of latent variables
Inference
Strategies • θ are the deterministic parameters of the model, such as the neural network
Advanced Topics parameters
Case Studies

Conclusion
• Data consists of N i.i.d samples,

References
N
Y
p(x(1:N ) , z (1:N ) ; θ) = p(x(n) | z (n) ; θ)p(z (n) ; θ).
n=1
13/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
What are latent variable models?

Latent variable models give us a joint distribution


Introduction
Goals
Background p(x, z; θ).
Models

Variational • x is our observed data


Objective
• z is a collection of latent variables
Inference
Strategies • θ are the deterministic parameters of the model, such as the neural network
Advanced Topics parameters
Case Studies

Conclusion
• Data consists of N i.i.d samples,

References
N
Y
p(x(1:N ) , z (1:N ) ; θ) = p(x(n) | z (n) ; θ)p(z (n) ; θ).
n=1
13/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Probabilistic Graphical Models

Introduction • A directed PGM shows the conditional independence structure.


Goals
Background
• By chain rule, latent variable model over observations can be represented as,
Models
θ z (n)
Variational
Objective

Inference
Strategies
x(n)
Advanced Topics

Case Studies N

Conclusion N
Y
References p(x(1:N ) , z (1:N ) ; θ) = p(x(n) | z (n) ; θ)p(z (n) ; θ)
n=1

• Specific models may factor further.


14/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference

For models p(x, z; θ), we’ll be interested in the posterior over latent variables z:
Introduction
Goals
Background
p(x, z; θ)
Models p(z | x; θ) = .
p(x; θ)
Variational
Objective

Inference Why?
Strategies

Advanced Topics
• z will often represent interesting information about our data (e.g., the
Case Studies cluster x(n) lives in, how similar x(n) and x(n+1) are).
Conclusion • Learning the parameters θ of the model often requires calculating posteriors
References as a subroutine.
• Intuition: if I know likely z (n) for x(n) , I can learn by maximizing
p(x(n) | z (n) ; θ). 15/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference

For models p(x, z; θ), we’ll be interested in the posterior over latent variables z:
Introduction
Goals
Background
p(x, z; θ)
Models p(z | x; θ) = .
p(x; θ)
Variational
Objective

Inference Why?
Strategies

Advanced Topics
• z will often represent interesting information about our data (e.g., the
Case Studies cluster x(n) lives in, how similar x(n) and x(n+1) are).
Conclusion • Learning the parameters θ of the model often requires calculating posteriors
References as a subroutine.
• Intuition: if I know likely z (n) for x(n) , I can learn by maximizing
p(x(n) | z (n) ; θ). 15/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Problem Statement: Two Views


Introduction
Goals
Background Deep Models & LV Models are naturally complementary:
Models
• Rich function approximators with modular parts.
Variational
Objective
• Declarative methods for specifying model constraints.
Inference
Strategies

Advanced Topics Deep Models & LV Models are frustratingly incompatible:


Case Studies
• Deep networks make posterior inference intractable.
Conclusion

References
• Latent variable objectives complicate backpropagation.

16/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Problem Statement: Two Views


Introduction
Goals
Background Deep Models & LV Models are naturally complementary:
Models
• Rich function approximators with modular parts.
Variational
Objective
• Declarative methods for specifying model constraints.
Inference
Strategies

Advanced Topics Deep Models & LV Models are frustratingly incompatible:


Case Studies
• Deep networks make posterior inference intractable.
Conclusion

References
• Latent variable objectives complicate backpropagation.

16/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models Discrete Models
Discrete Models
Continuous Models Continuous Models
Structured Models
Structured Models
Variational
Objective

Inference 3 Variational Objective


Strategies

Advanced Topics
4 Inference Strategies
Case Studies

Conclusion
5 Advanced Topics
References

6 Case Studies
17/153
Tutorial:
Deep Latent NLP A Language Model
(bit.do/lvnlp)

Our goal is to model a sentence, x1 . . . xT .


Introduction

Models
Context: RNN language models are remarkable at this task,
Discrete Models
Continuous Models x1:T ∼ RNNLM(x1:T ; θ).
Structured Models

Variational Defined as,


Objective
T
Y T
Y
Inference p(x1:T ) = p(xt | x<t ) = softmax(W ht )xt
Strategies
t=1 t=1
Advanced Topics
where ht = RNN(ht−1 , xt−1 ; θ)
Case Studies

Conclusion (n)
x1 ... (n)
xT
References
N
θ 18/153
Tutorial:
Deep Latent NLP A Language Model
(bit.do/lvnlp)

Our goal is to model a sentence, x1 . . . xT .


Introduction

Models
Context: RNN language models are remarkable at this task,
Discrete Models
Continuous Models x1:T ∼ RNNLM(x1:T ; θ).
Structured Models

Variational Defined as,


Objective
T
Y T
Y
Inference p(x1:T ) = p(xt | x<t ) = softmax(W ht )xt
Strategies
t=1 t=1
Advanced Topics
where ht = RNN(ht−1 , xt−1 ; θ)
Case Studies

Conclusion (n)
x1 ... (n)
xT
References
N
θ 18/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction A Collection of Model Archetypes


Models
Discrete Models
Focus: semi-supervised or unsupervised learning, i.e. don’t just learn the
Continuous Models
Structured Models probabilities, but the process. Range of choices in selecting z
Variational
Objective

Inference 1 Discrete LVs z (Clustering )


Strategies
2 Continuous LVs z (Dimensionality reduction)
Advanced Topics

Case Studies
3 Structured LVs z (Structured learning )
Conclusion

References

19/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction A Collection of Model Archetypes


Models
Discrete Models
Focus: semi-supervised or unsupervised learning, i.e. don’t just learn the
Continuous Models
Structured Models probabilities, but the process. Range of choices in selecting z
Variational
Objective

Inference 1 Discrete LVs z (Clustering )


Strategies
2 Continuous LVs z (Dimensionality reduction)
Advanced Topics

Case Studies
3 Structured LVs z (Structured learning )
Conclusion

References

19/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models Discrete Models
Discrete Models
Continuous Models Continuous Models
Structured Models
Structured Models
Variational
Objective

Inference 3 Variational Objective


Strategies

Advanced Topics
4 Inference Strategies
Case Studies

Conclusion
5 Advanced Topics
References

6 Case Studies
20/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 1: Discrete Clustering

Introduction Inference Process:


Models
Discrete Models
In an old house in Paris that was
Continuous Models
Structured Models
covered with vines lived twelve little Cluster 23
Variational girls in two straight lines.
Objective

Inference
Strategies Discrete latent variable models induce a clustering over sentences x(n) .
Advanced Topics
Example uses:
Case Studies

Conclusion • Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012].


References • Mixture of expert text generation models [Jacobs et al. 1991; Garmash and Monz
2016; Lee et al. 2016]
21/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 1: Discrete Clustering

Introduction Inference Process:


Models
Discrete Models
In an old house in Paris that was
Continuous Models
Structured Models
covered with vines lived twelve little Cluster 23
Variational girls in two straight lines.
Objective

Inference
Strategies Discrete latent variable models induce a clustering over sentences x(n) .
Advanced Topics
Example uses:
Case Studies

Conclusion • Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012].


References • Mixture of expert text generation models [Jacobs et al. 1991; Garmash and Monz
2016; Lee et al. 2016]
21/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 1: Discrete - Mixture of Categoricals

Introduction
Generative process:
Models
Discrete Models
1 Draw cluster z ∈ {1, . . . , K} from a categorical with param µ.
Continuous Models
Structured Models 2 Draw word T words xt from a categorical with word distribution πz .
Variational
Objective Parameters: θ = {µ ∈ ∆K−1 , K × V stochastic matrix π}
Inference
Strategies Gives rise to the ”Naive Bayes” distribution:
Advanced Topics
T
Y
Case Studies p(x, z; θ) = p(z; µ) × p(x | z; π) = µz × Cat(xt ; π)
Conclusion t=1
T
References Y
= µz × πz,xt
t=1
22/153
Tutorial:
Deep Latent NLP Model 1: Graphical Model View
(bit.do/lvnlp)
µ

Introduction
z (n)
Models
Discrete Models
Continuous Models
Structured Models (n)
x1 ... xT
(n)
Variational
Objective π N
Inference
Strategies

Advanced Topics
N N
Case Studies Y Y
(n) (n)
p(x ,z ; µ, π) = p(z (n) ; µ) × p(x(n) | z (n) ; π)
Conclusion
n=1 n=1
References
YN T
Y
= µz (n) × πz (n) ,x(n)
t
n=1 t=1
23/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Deep Model 1: Discrete - Mixture of RNNs

Introduction Generative process:


Models
1 Draw cluster z ∈ {1, . . . , K} from a categorical.
Discrete Models
Continuous Models
Structured Models
2 Draw words x1:T from RNNLM with parameters πz .
Variational p(x, z; θ) = µz × RNNLM(x1:T ; πz )
Objective
µ
Inference
Strategies

Advanced Topics z (n)


Case Studies

Conclusion
(n)
x1 ... xT
(n)
References

π N

24/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Difference Between Models


Models
Discrete Models
• Dependence structure:
Continuous Models
Structured Models • Mixture of Categoricals: xt independent of other xj given z.
Variational
• Mixture of RNNs: xt fully dependent.
Objective

Inference Interesting question: how will this affect the learned latent space?
Strategies

Advanced Topics • Number of parameters:


Case Studies • Mixture of Categoricals: K × V .
Conclusion • Mixture of RNNs: K × d2 + V × d with RNN with d hidden dims.

References

25/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Difference Between Models


Models
Discrete Models
• Dependence structure:
Continuous Models
Structured Models • Mixture of Categoricals: xt independent of other xj given z.
Variational
• Mixture of RNNs: xt fully dependent.
Objective

Inference Interesting question: how will this affect the learned latent space?
Strategies

Advanced Topics • Number of parameters:


Case Studies • Mixture of Categoricals: K × V .
Conclusion • Mixture of RNNs: K × d2 + V × d with RNN with d hidden dims.

References

25/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference

Introduction
For both discrete models, can apply Bayes’ rule:
Models
Discrete Models
Continuous Models
p(z) × p(x | z)
Structured Models
p(z | x; θ) =
Variational p(x)
Objective
p(z) × p(x | z)
Inference
= K
X
Strategies
p(z=k) × p(x | z=k)
Advanced Topics k=1
Case Studies

Conclusion
• For mixture of categoricals, posterior uses word counts under each πk .
References

• For mixture of RNNs, posterior requires running RNN over x for each k.
26/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference

Introduction
For both discrete models, can apply Bayes’ rule:
Models
Discrete Models
Continuous Models
p(z) × p(x | z)
Structured Models
p(z | x; θ) =
Variational p(x)
Objective
p(z) × p(x | z)
Inference
= K
X
Strategies
p(z=k) × p(x | z=k)
Advanced Topics k=1
Case Studies

Conclusion
• For mixture of categoricals, posterior uses word counts under each πk .
References

• For mixture of RNNs, posterior requires running RNN over x for each k.
26/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference

Introduction
For both discrete models, can apply Bayes’ rule:
Models
Discrete Models
Continuous Models
p(z) × p(x | z)
Structured Models
p(z | x; θ) =
Variational p(x)
Objective
p(z) × p(x | z)
Inference
= K
X
Strategies
p(z=k) × p(x | z=k)
Advanced Topics k=1
Case Studies

Conclusion
• For mixture of categoricals, posterior uses word counts under each πk .
References

• For mixture of RNNs, posterior requires running RNN over x for each k.
26/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models Discrete Models
Discrete Models
Continuous Models Continuous Models
Structured Models
Structured Models
Variational
Objective

Inference 3 Variational Objective


Strategies

Advanced Topics
4 Inference Strategies
Case Studies

Conclusion
5 Advanced Topics
References

6 Case Studies
27/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 2: Continuous / Dimensionality Reduction

Introduction Inference Process:


Models
Discrete Models
In an old house in Paris that was
Continuous Models
covered with vines lived twelve little
Structured Models

Variational girls in two straight lines.


Objective

Inference
Strategies Find a lower-dimensional, well-behaved continuous representation of a sentence.
Advanced Topics Latent variables in Rd make distance/similarity easy. Examples:
Case Studies
• Recent work in text generation assumes a latent vector per sentence [Bowman
Conclusion
et al. 2016; Yang et al. 2017; Hu et al. 2017].
References
• Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015])
can be interpreted in this way.
28/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 2: Continuous / Dimensionality Reduction

Introduction Inference Process:


Models
Discrete Models
In an old house in Paris that was
Continuous Models
covered with vines lived twelve little
Structured Models

Variational girls in two straight lines.


Objective

Inference
Strategies Find a lower-dimensional, well-behaved continuous representation of a sentence.
Advanced Topics Latent variables in Rd make distance/similarity easy. Examples:
Case Studies
• Recent work in text generation assumes a latent vector per sentence [Bowman
Conclusion
et al. 2016; Yang et al. 2017; Hu et al. 2017].
References
• Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015])
can be interpreted in this way.
28/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Model 2: Continuous ”Mixture”


Models
Discrete Models
Continuous Models Generative Process:
Structured Models

Variational
1 Draw continuous latent variable z from Normal with param µ.
Objective
2 For each t, draw word xt from categorical with param softmax(W z).
Inference
Strategies
Parameters: θ = {µ ∈ Rd , π}, π = {W ∈ RV ×d }
Advanced Topics

Case Studies Intuition: µ is a global distribution, z captures local word distribution of the
Conclusion sentence.
References

29/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Graphical Model View
µ
Introduction

Models
Discrete Models z (n)
Continuous Models
Structured Models

Variational
Objective (n)
x1 ... xT
(n)

Inference
Strategies π N
Advanced Topics

Case Studies

Conclusion
Gives rise to the joint distribution:
References N
Y N
Y
(n) (n)
p(x ,z ; θ) = p(z (n) ; µ) × p(x(n) | z (n) ; π)
n=1 n=1
30/153
Tutorial:
Deep Latent NLP Deep Model 2: Continuous ”Mixture” of RNNs
(bit.do/lvnlp)

Generative Process:
Introduction
1 Draw latent variable z ∼ N (µ, I).
Models
2 Draw each token xt from a conditional RNNLM.
Discrete Models
Continuous Models
Structured Models
RNN is also conditioned on latent z,
Variational
Objective p(x, z; π, µ, I) = p(z; µ, I) × p(x | z; π)
Inference
Strategies
= N (z; µ, I) × CRNNLM(x1:T ; π, z)
Advanced Topics

Case Studies
where
Conclusion T
Y
References CRNNLM(x1:T ; π, z) = softmax(W ht )xt
t=1

ht = RNN(ht−1 , [xt−1 ; z]; π)


31/153
Tutorial:
Deep Latent NLP Deep Model 2: Continuous ”Mixture” of RNNs
(bit.do/lvnlp)

Generative Process:
Introduction
1 Draw latent variable z ∼ N (µ, I).
Models
2 Draw each token xt from a conditional RNNLM.
Discrete Models
Continuous Models
Structured Models
RNN is also conditioned on latent z,
Variational
Objective p(x, z; π, µ, I) = p(z; µ, I) × p(x | z; π)
Inference
Strategies
= N (z; µ, I) × CRNNLM(x1:T ; π, z)
Advanced Topics

Case Studies
where
Conclusion T
Y
References CRNNLM(x1:T ; π, z) = softmax(W ht )xt
t=1

ht = RNN(ht−1 , [xt−1 ; z]; π)


31/153
Tutorial:
Deep Latent NLP Deep Model 2: Continuous ”Mixture” of RNNs
(bit.do/lvnlp)

Generative Process:
Introduction
1 Draw latent variable z ∼ N (µ, I).
Models
2 Draw each token xt from a conditional RNNLM.
Discrete Models
Continuous Models
Structured Models
RNN is also conditioned on latent z,
Variational
Objective p(x, z; π, µ, I) = p(z; µ, I) × p(x | z; π)
Inference
Strategies
= N (z; µ, I) × CRNNLM(x1:T ; π, z)
Advanced Topics

Case Studies
where
Conclusion T
Y
References CRNNLM(x1:T ; π, z) = softmax(W ht )xt
t=1

ht = RNN(ht−1 , [xt−1 ; z]; π)


31/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Graphical Model View


Models µ
Discrete Models
Continuous Models
Structured Models

Variational
Objective
z(n)

Inference
Strategies

Advanced Topics
(n)
x1 ... (n)
xT
Case Studies N
π
Conclusion

References

32/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Posterior Inference
Introduction

Models For continuous models, Bayes’ rule is harder to compute,


Discrete Models
Continuous Models
p(z; µ) × p(x | z; π)
Structured Models
p(z | x; θ) = Z
Variational p(z; µ) × p(x | z; π) dz
Objective
z
Inference
Strategies

Advanced Topics • Shallow and deep Model 2 variants mirror Model 1 variants exactly, but
Case Studies
with continuous z.
Conclusion
• Integral intractable (in general) for both shallow and deep variants.
References

33/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Posterior Inference
Introduction

Models For continuous models, Bayes’ rule is harder to compute,


Discrete Models
Continuous Models
p(z; µ) × p(x | z; π)
Structured Models
p(z | x; θ) = Z
Variational p(z; µ) × p(x | z; π) dz
Objective
z
Inference
Strategies

Advanced Topics • Shallow and deep Model 2 variants mirror Model 1 variants exactly, but
Case Studies
with continuous z.
Conclusion
• Integral intractable (in general) for both shallow and deep variants.
References

33/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models Discrete Models
Discrete Models
Continuous Models Continuous Models
Structured Models
Structured Models
Variational
Objective

Inference 3 Variational Objective


Strategies

Advanced Topics
4 Inference Strategies
Case Studies

Conclusion
5 Advanced Topics
References

6 Case Studies
34/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model 3: Structure Learning

Inference Process:
Introduction

Models In an old house in Paris that was


Discrete Models
Continuous Models covered with vines lived twelve little
Structured Models
girls in two straight lines.
Variational
Objective

Inference Structured latent variable models are used to infer unannotated structure:
Strategies

Advanced Topics • Unsupervised POS tagging [Brown et al. 1992; Merialdo 1994; Smith and Eisner 2005]
Case Studies • Unsupervised dependency parsing [Klein and Manning 2004; Headden III et al. 2009]
Conclusion
Or when structure is useful for interpreting our data:
References
• Segmentation of documents into topical passages [Hearst 1997]
• Alignment [Vogel et al. 1996]
35/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 3: Structured - Hidden Markov Model

Introduction
Generative Process:
Models
1 For each t, draw zt ∈ {1, . . . , K} from a categorical with param µzt−1 .
Discrete Models
Continuous Models
2 Draw observed token xt from categorical with param πzt .
Structured Models

Variational
Objective Parameters: θ = {K × K stochastic matrix µ, K × V stochastic matrix π}
Inference
Strategies Gives rise to the joint distribution:
Advanced Topics
T
Y T
Y
Case Studies p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion t=1 t=1
T T
References Y Y
= µzt−1 ,zt × πzt ,xt
t=1 t=1
36/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Model 3: Structured - Hidden Markov Model

Introduction
Generative Process:
Models
1 For each t, draw zt ∈ {1, . . . , K} from a categorical with param µzt−1 .
Discrete Models
Continuous Models
2 Draw observed token xt from categorical with param πzt .
Structured Models

Variational
Objective Parameters: θ = {K × K stochastic matrix µ, K × V stochastic matrix π}
Inference
Strategies Gives rise to the joint distribution:
Advanced Topics
T
Y T
Y
Case Studies p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion t=1 t=1
T T
References Y Y
= µzt−1 ,zt × πzt ,xt
t=1 t=1
36/153
Tutorial:
Deep Latent NLP Graphical Model View
(bit.do/lvnlp)
µ

Introduction z1 z2 z3 z4
Models
Discrete Models
Continuous Models
Structured Models
x1 x2 x3 x4
Variational
Objective N
Inference π
Strategies

Advanced Topics
T T
Case Studies Y Y
p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion
t=1 t=1
References
YT YT
= µzt−1 ,zt × πzt ,xt
t=1 t=1
37/153
Tutorial:
Deep Latent NLP Further Extension: Factorial HMM
(bit.do/lvnlp)

Introduction z3,1 z3,2 z3,3 z3,4


Models
Discrete Models
Continuous Models
z2,1 z2,2 z2,3 z2,4
Structured Models

Variational
Objective z1,1 z1,2 z1,3 z1,4
Inference
Strategies
x1 x2 x3 x4
Advanced Topics

Case Studies
N

Conclusion
L Y
Y T T
Y
References p(x, z; θ) = p(zl,t | zl,t−1 ) × p(xt | z1:L,t )
l=1 t=1 t=1

38/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Deep Model 3: Deep HMM

Introduction Parameterize transition and emission distributions with neural networks (c.f., Tran
Models et al. [2016])
Discrete Models
Continuous Models
Structured Models • Model transition distribution as
Variational
Objective
p(zt | zt−1 ) = softmax(MLP(zt−1 ; µ))
Inference
Strategies
• Model emission distribution as
Advanced Topics

Case Studies
p(xt | zt ) = softmax(MLP(zt ; π))
Conclusion

References
Note: K × K transition parameters for standard HMM vs. O(K × d + d2 ) for
deep version.
39/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Deep Model 3: Deep HMM

Introduction Parameterize transition and emission distributions with neural networks (c.f., Tran
Models et al. [2016])
Discrete Models
Continuous Models
Structured Models • Model transition distribution as
Variational
Objective
p(zt | zt−1 ) = softmax(MLP(zt−1 ; µ))
Inference
Strategies
• Model emission distribution as
Advanced Topics

Case Studies
p(xt | zt ) = softmax(MLP(zt ; π))
Conclusion

References
Note: K × K transition parameters for standard HMM vs. O(K × d + d2 ) for
deep version.
39/153
Tutorial:
Deep Latent NLP Graphical Model View
(bit.do/lvnlp)
µ

Introduction z1 z2 z3 z4
Models
Discrete Models
Continuous Models
Structured Models
x1 x2 x3 x4
Variational
Objective N
Inference π
Strategies

Advanced Topics
T T
Case Studies Y Y
p(x, z; θ) = p(zt | zt−1 ; µzt−1 ) × p(xt | zt ; πzt )
Conclusion
t=1 t=1
References
YT YT
= µzt−1 ,zt × πzt ,xt
t=1 t=1
40/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference

Introduction
For structured models, Bayes’ rule may tractable,
Models
Discrete Models
Continuous Models
p(z; µ) × p(x | z; π)
Structured Models
p(z | x; θ) = P 0 0
Variational z 0 p(z ; µ) × p(x | z ; π)
Objective

Inference
Strategies
• Unlike previous models, z contains interdependent “parts.”
Advanced Topics
• For both shallow and deep Model 3 variants, it’s possible to calculate
Case Studies

Conclusion
p(x; θ) exactly, with a dynamic program.
References • For some structured models, like Factorial HMM, the dynamic program may
still be intractable.
41/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Posterior Inference

Introduction
For structured models, Bayes’ rule may tractable,
Models
Discrete Models
Continuous Models
p(z; µ) × p(x | z; π)
Structured Models
p(z | x; θ) = P 0 0
Variational z 0 p(z ; µ) × p(x | z ; π)
Objective

Inference
Strategies
• Unlike previous models, z contains interdependent “parts.”
Advanced Topics
• For both shallow and deep Model 3 variants, it’s possible to calculate
Case Studies

Conclusion
p(x; θ) exactly, with a dynamic program.
References • For some structured models, like Factorial HMM, the dynamic program may
still be intractable.
41/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective
Maximum Likelihood Maximum Likelihood
ELBO
ELBO
Inference
Strategies

Advanced Topics 4 Inference Strategies


Case Studies

Conclusion 5 Advanced Topics


References

6 Case Studies

42/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective
Maximum Likelihood Maximum Likelihood
ELBO
ELBO
Inference
Strategies

Advanced Topics 4 Inference Strategies


Case Studies

Conclusion 5 Advanced Topics


References

6 Case Studies

43/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction

Models
Learning with Maximum Likelihood
Variational
Objective
Maximum Likelihood Objective: Find model parameters θ that maximize the likelihood of the data,
ELBO

Inference N
X
Strategies
θ∗ = arg max log p(x(n) ; θ)
Advanced Topics θ n=1
Case Studies

Conclusion

References

44/153
Tutorial:
Deep Latent NLP Learning Deep Models
(bit.do/lvnlp)

N
X
Introduction
L(θ) = log p(x(n) ; θ)
Models
n=1
Variational
Objective x
Maximum Likelihood
ELBO N
Inference
Strategies θ
Advanced Topics
• Dominant framework is gradient-based optimization:
Case Studies

Conclusion θ(i) = θ(i−1) + η∇θ L(θ)


References
• ∇θ L(θ) calculated with backpropagation.
• Tactics: mini-batch based training, adaptive learning rates [Duchi et al. 2011;
Kingma and Ba 2015].
45/153
Tutorial:
Deep Latent NLP Learning Deep Latent-Variable Models: Marginalization
(bit.do/lvnlp)

Likelihood requires summing out the latent variables,


Z
Introduction X
p(x; θ) = p(x, z; θ) (= p(x, z; θ)dz if continuous z)
Models
z∈Z
Variational
Objective
Maximum Likelihood
In general, hard to optimize log-likelihood for the training set,
ELBO N
X X
Inference L(θ) = log p(x(n) , z; θ)
Strategies
n=1 z∈Z
Advanced Topics

Case Studies θ z (n)


Conclusion

References

x(n)
N
46/153
Tutorial:
Deep Latent NLP Learning Deep Latent-Variable Models: Marginalization
(bit.do/lvnlp)

Likelihood requires summing out the latent variables,


Z
Introduction X
p(x; θ) = p(x, z; θ) (= p(x, z; θ)dz if continuous z)
Models
z∈Z
Variational
Objective
Maximum Likelihood
In general, hard to optimize log-likelihood for the training set,
ELBO N
X X
Inference L(θ) = log p(x(n) , z; θ)
Strategies
n=1 z∈Z
Advanced Topics

Case Studies θ z (n)


Conclusion

References

x(n)
N
46/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective
Maximum Likelihood Maximum Likelihood
ELBO
ELBO
Inference
Strategies

Advanced Topics 4 Inference Strategies


Case Studies

Conclusion 5 Advanced Topics


References

6 Case Studies

47/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Variational Inference
Introduction

Models
High-level: decompose objective into lower-bound and gap.
Variational
Objective
GAP(θ, λ)
Maximum Likelihood
ELBO

Inference L(θ)
Strategies
LB(θ, λ)
Advanced Topics

Case Studies

Conclusion
L(θ) = LB(θ, λ) + GAP(θ, λ) for some λ
References

Provides framework for deriving a rich set of optimization algorithms.

48/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Marginal Likelihood: Variational Decomposition

Introduction For any1 distribution q(z | x; λ) over z,


Models
h p(x, z; θ) i
Variational
L(θ) = Eq log + KL[q(z | x; λ) k p(z | x; θ)]
Objective q(z | x; λ)
Maximum Likelihood
ELBO

Inference posterior gap


Strategies

Advanced Topics

Case Studies ELBO (evidence lower bound)


Conclusion

References

Since KL is always non-negative, L(θ) ≥ ELBO(θ, λ).


1
Technical condition: supp(q(z)) ⊂ supp(p(z | x; θ)) 49/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound: Proof

Introduction

Models
log p(x; θ) = Eq log p(x) (Expectation over z)
Variational
Objective p(x, z)
= Eq log (Mult/div by p(z|x), combine numerator)
Maximum Likelihood
p(z | x)
ELBO  
p(x, z) q(z | x)
Inference = Eq log (Mult/div by q(z|x))
Strategies q(z | x) p(z | x)
Advanced Topics p(x, z) q(z | x)
= Eq log + Eq log (Split Log)
Case Studies q(z | x) p(z | x)
Conclusion p(x, z; θ)
= Eq log + KL[q(z | x; λ) k p(z | x; θ)]
References
q(z | x; λ)

50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound: Proof

Introduction

Models
log p(x; θ) = Eq log p(x) (Expectation over z)
Variational
Objective p(x, z)
= Eq log (Mult/div by p(z|x), combine numerator)
Maximum Likelihood
p(z | x)
ELBO  
p(x, z) q(z | x)
Inference = Eq log (Mult/div by q(z|x))
Strategies q(z | x) p(z | x)
Advanced Topics p(x, z) q(z | x)
= Eq log + Eq log (Split Log)
Case Studies q(z | x) p(z | x)
Conclusion p(x, z; θ)
= Eq log + KL[q(z | x; λ) k p(z | x; θ)]
References
q(z | x; λ)

50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound: Proof

Introduction

Models
log p(x; θ) = Eq log p(x) (Expectation over z)
Variational
Objective p(x, z)
= Eq log (Mult/div by p(z|x), combine numerator)
Maximum Likelihood
p(z | x)
ELBO  
p(x, z) q(z | x)
Inference = Eq log (Mult/div by q(z|x))
Strategies q(z | x) p(z | x)
Advanced Topics p(x, z) q(z | x)
= Eq log + Eq log (Split Log)
Case Studies q(z | x) p(z | x)
Conclusion p(x, z; θ)
= Eq log + KL[q(z | x; λ) k p(z | x; θ)]
References
q(z | x; λ)

50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound: Proof

Introduction

Models
log p(x; θ) = Eq log p(x) (Expectation over z)
Variational
Objective p(x, z)
= Eq log (Mult/div by p(z|x), combine numerator)
Maximum Likelihood
p(z | x)
ELBO  
p(x, z) q(z | x)
Inference = Eq log (Mult/div by q(z|x))
Strategies q(z | x) p(z | x)
Advanced Topics p(x, z) q(z | x)
= Eq log + Eq log (Split Log)
Case Studies q(z | x) p(z | x)
Conclusion p(x, z; θ)
= Eq log + KL[q(z | x; λ) k p(z | x; θ)]
References
q(z | x; λ)

50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound: Proof

Introduction

Models
log p(x; θ) = Eq log p(x) (Expectation over z)
Variational
Objective p(x, z)
= Eq log (Mult/div by p(z|x), combine numerator)
Maximum Likelihood
p(z | x)
ELBO  
p(x, z) q(z | x)
Inference = Eq log (Mult/div by q(z|x))
Strategies q(z | x) p(z | x)
Advanced Topics p(x, z) q(z | x)
= Eq log + Eq log (Split Log)
Case Studies q(z | x) p(z | x)
Conclusion p(x, z; θ)
= Eq log + KL[q(z | x; λ) k p(z | x; θ)]
References
q(z | x; λ)

50/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evidence Lower Bound over Observations

Introduction h p(x, z; θ) i
ELBO(θ, λ; x) = Eq(z) log
Models
q(z | x; λ)
Variational
Objective
Maximum Likelihood • ELBO is a function of the generative model parameters, θ, and the
ELBO

Inference
variational parameters, λ.
Strategies
N N
Advanced Topics X X
log p(x(n) ; θ) ≥ ELBO(θ, λ; x(n) )
Case Studies
n=1 n=1
Conclusion N
X h p(x(n) , z; θ) i
References = Eq(z | x(n) ; λ) log
n=1
q(z | x(n) ; λ)

= ELBO(θ, λ; x(1:N ) ) = ELBO(θ, λ)


51/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Setup: Selecting Variational Family


Introduction

Models

Variational
• Just as with p and θ, we can select any form of q and λ that satisfies ELBO
Objective
conditions.
Maximum Likelihood
ELBO
• Different choices of q will lead to different algorithms.
Inference
Strategies • We will explore several forms of q:
Advanced Topics • Posterior
Case Studies • Point Estimate / MAP
Conclusion • Amortized

References
• Mean Field (later)

52/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Example Family : Full Posterior Form


Introduction

KL(q(z | x) || p(z | x))


Models
z (n) z (n)
Variational
Objective
Maximum Likelihood
ELBO
λ(n)
x(n) θ
Inference N
Strategies
N
Advanced Topics

Case Studies
λ = [λ(1) , . . . , λ(N ) ] is a concatenation of local variational parameters λ(n) , e.g.
Conclusion

References q(z (n) | x(n) ; λ) = q(z (n) | x(n) ; λ(n) ) = N (λ(n) , 1)

53/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example Family: Amortized Parameterization [Kingma and Welling 2014]

λ
Introduction

Models
KL(q(z | x)||p(z | x))
Variational z (n) z (n)
Objective
Maximum Likelihood
ELBO

Inference x(n)
Strategies x(n) θ
N
Advanced Topics N
Case Studies

Conclusion λ parameterizes a global network (encoder/inference network) that is run over


References x(n) to produce the local variational distribution, e.g.

q(z (n) | x(n) ; λ) = N (µ(n) , 1), µ(n) = enc(x(n) ; λ)


54/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Exact Gradient
Sampling Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies

Conclusion
5 Advanced Topics
References

6 Case Studies
55/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Maximizing the Evidence Lower Bound
Introduction
Central quantity of interest: almost all methods are maximizing the ELBO
Models

Variational
arg max ELBO(θ, λ)
Objective
θ,λ
Inference
Strategies Aggregate ELBO objective,
Exact Gradient
Sampling N
Conjugacy
X
arg max ELBO(θ, λ) = arg max ELBO(θ, λ; x(n) )
Advanced Topics θ,λ θ,λ n=1
Case Studies N
X h p(x(n) , z (n) ; θ) i
Conclusion = arg max Eq log
θ,λ n=1
q(z (n) | x(n) ; λ)
References

56/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Maximizing the Evidence Lower Bound
Introduction
Central quantity of interest: almost all methods are maximizing the ELBO
Models

Variational
arg max ELBO(θ, λ)
Objective
θ,λ
Inference
Strategies Aggregate ELBO objective,
Exact Gradient
Sampling N
Conjugacy
X
arg max ELBO(θ, λ) = arg max ELBO(θ, λ; x(n) )
Advanced Topics θ,λ θ,λ n=1
Case Studies N
X h p(x(n) , z (n) ; θ) i
Conclusion = arg max Eq log
θ,λ n=1
q(z (n) | x(n) ; λ)
References

56/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Maximizing ELBO: Model Parameters
Introduction

Models h p(x, z; θ) i
arg max Eq log = arg max Eq [log p(x, z; θ)]
Variational θ q(z | x; λ) θ
Objective

Inference
Strategies
Exact Gradient
Sampling
θ
Conjugacy
ELBO
Advanced Topics ELBO
Case Studies

Conclusion

References
Intuition: Maximum likelihood problem under variables drawn from q(z | x; λ).

57/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model Estimation: Gradient Ascent on Model Parameters
Introduction
Easy: Gradient respect to θ
Models
h i
Variational
Objective
∇θ ELBO(θ, λ; x) = ∇θ Eq log p(x, z; θ)
h i
Inference
Strategies
= Eq ∇θ log p(x, z; θ)
Exact Gradient
Sampling
Conjugacy

Advanced Topics • Since q not dependent on θ, ∇ moves inside expectation.


Case Studies
• Estimate with samples from q. Term log p(x, z; θ) is easy to evaluate. (In
Conclusion
practice single sample is often sufficient).
References
• In special cases, can exactly evaluate expectation.

58/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model Estimation: Gradient Ascent on Model Parameters
Introduction
Easy: Gradient respect to θ
Models
h i
Variational
Objective
∇θ ELBO(θ, λ; x) = ∇θ Eq log p(x, z; θ)
h i
Inference
Strategies
= Eq ∇θ log p(x, z; θ)
Exact Gradient
Sampling
Conjugacy

Advanced Topics • Since q not dependent on θ, ∇ moves inside expectation.


Case Studies
• Estimate with samples from q. Term log p(x, z; θ) is easy to evaluate. (In
Conclusion
practice single sample is often sufficient).
References
• In special cases, can exactly evaluate expectation.

58/153
Tutorial:
Deep Latent NLP Maximizing ELBO: Variational Distribution
(bit.do/lvnlp)

arg max ELBO(θ, λ)


Introduction λ
Models
= arg max log p(x; θ) − KL[q(z | x; λ) k p(z | x; θ)]
Variational
λ
Objective
= arg min KL[q(z | x; λ) k p(z | x; θ)]
Inference λ
Strategies
Exact Gradient posterior gap
Sampling
Conjugacy λ
Advanced Topics L
ELBO
Case Studies

Conclusion

References

Intuition: q should approximate the posterior p(z|x). However, may be difficult if


q or p is a deep model. 59/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model Inference: Gradient Ascent on λ?
Introduction
Hard: Gradient respect to λ
Models

Variational
h p(x, z; θ) i
∇λ ELBO(θ, λ; x) = ∇λ Eq log
Objective
q(z | x; λ)
Inference h p(x, z; θ) i
Strategies 6= Eq ∇λ log
Exact Gradient q(z | x; λ)
Sampling
Conjugacy
• Cannot naively move ∇ inside the expectation, since q depends on λ.
Advanced Topics
• This section: Inference in practice:
Case Studies

Conclusion
1 Exact gradient
References
2 Sampling: score function, reparameterization
3 Conjugacy: closed-form, coordinate ascent

60/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Model Inference: Gradient Ascent on λ?
Introduction
Hard: Gradient respect to λ
Models

Variational
h p(x, z; θ) i
∇λ ELBO(θ, λ; x) = ∇λ Eq log
Objective
q(z | x; λ)
Inference h p(x, z; θ) i
Strategies 6= Eq ∇λ log
Exact Gradient q(z | x; λ)
Sampling
Conjugacy
• Cannot naively move ∇ inside the expectation, since q depends on λ.
Advanced Topics
• This section: Inference in practice:
Case Studies

Conclusion
1 Exact gradient
References
2 Sampling: score function, reparameterization
3 Conjugacy: closed-form, coordinate ascent

60/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Exact Gradient
Sampling Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies

Conclusion
5 Advanced Topics
References

6 Case Studies
61/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Strategy 1: Exact Gradient

Introduction

Models h p(x, z; θ) i
Variational ∇λ ELBO(θ, λ; x) = ∇λ Eq(z | x; λ) log
Objective
q(z | x; λ)
X p(x, z; θ) 
Inference = ∇λ q(z | x; λ) log
Strategies q(z | x; λ)
Exact Gradient
z∈Z
Sampling
Conjugacy
• Naive enumeration: Linear in |Z|.
Advanced Topics

Case Studies
• Depending on structure of q and p, potentially faster with dynamic

Conclusion
programming.
References • Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2
with point estimate.
62/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Strategy 1: Exact Gradient

Introduction

Models h p(x, z; θ) i
Variational ∇λ ELBO(θ, λ; x) = ∇λ Eq(z | x; λ) log
Objective
q(z | x; λ)
X p(x, z; θ) 
Inference = ∇λ q(z | x; λ) log
Strategies q(z | x; λ)
Exact Gradient
z∈Z
Sampling
Conjugacy
• Naive enumeration: Linear in |Z|.
Advanced Topics

Case Studies
• Depending on structure of q and p, potentially faster with dynamic

Conclusion
programming.
References • Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2
with point estimate.
62/153
Tutorial:
Deep Latent NLP Example: Model 1 - Naive Bayes
(bit.do/lvnlp)

λ z z
Introduction

Models
x1 ... xT
x
Variational
Objective

Inference Let q(z | x; λ) = Cat(ν) where ν = enc(x; λ)


Strategies
Exact Gradient
Sampling
h p(x, z; θ) i
Conjugacy ∇λ ELBO(θ, λ; x) = ∇λ Eq(z | x; λ) log
q(z | x; λ)
Advanced Topics X p(x, z; θ) 
Case Studies = ∇λ q(z | x; λ) log
q(z | x; λ)
Conclusion
z∈Z
X p(x, z; θ) 
References = ∇λ νz log
νz
z∈Z

63/153
Tutorial:
Deep Latent NLP Example: Model 1 - Naive Bayes
(bit.do/lvnlp)

λ z z
Introduction

Models
x1 ... xT
x
Variational
Objective

Inference Let q(z | x; λ) = Cat(ν) where ν = enc(x; λ)


Strategies
Exact Gradient
Sampling
h p(x, z; θ) i
Conjugacy ∇λ ELBO(θ, λ; x) = ∇λ Eq(z | x; λ) log
q(z | x; λ)
Advanced Topics X p(x, z; θ) 
Case Studies = ∇λ q(z | x; λ) log
q(z | x; λ)
Conclusion
z∈Z
X p(x, z; θ) 
References = ∇λ νz log
νz
z∈Z

63/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Exact Gradient
Sampling Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies

Conclusion
5 Advanced Topics
References

6 Case Studies
64/153
Tutorial:
Deep Latent NLP Strategy 2: Sampling
(bit.do/lvnlp)

Introduction h log p(x, z; θ) i


Models ∇λ ELBO(θ, λ; x) = ∇λ Eq log
log q(z | x; λ)
Variational
h i h i
Objective = ∇λ Eq log p(x, z; θ) − ∇λ Eq log q(z | x; θ)
Inference
Strategies
Exact Gradient
Sampling • How can we approximate this gradient with sampling? Naive algorithm fails
Conjugacy
to provide non-zero gradient.
Advanced Topics

Case Studies z (1) , . . . , z (J) ∼ q(z | x; λ)


Conclusion J
1 Xh i
References ∇λ log p(x, z (j) ; θ) = 0
J
j=1
• Manipulate expression so we can move ∇λ inside Eq before sampling.
65/153
Tutorial:
Deep Latent NLP Strategy 2: Sampling
(bit.do/lvnlp)

Introduction h log p(x, z; θ) i


Models ∇λ ELBO(θ, λ; x) = ∇λ Eq log
log q(z | x; λ)
Variational
h i h i
Objective = ∇λ Eq log p(x, z; θ) − ∇λ Eq log q(z | x; θ)
Inference
Strategies
Exact Gradient
Sampling • How can we approximate this gradient with sampling? Naive algorithm fails
Conjugacy
to provide non-zero gradient.
Advanced Topics

Case Studies z (1) , . . . , z (J) ∼ q(z | x; λ)


Conclusion J
1 Xh i
References ∇λ log p(x, z (j) ; θ) = 0
J
j=1
• Manipulate expression so we can move ∇λ inside Eq before sampling.
65/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Strategy 2a: Sampling — Score Function Gradient Estimator


Models

Variational First term. Use basic identity:


Objective

Inference ∇q
Strategies ∇ log q = ⇒ ∇q = q∇ log q
Exact Gradient
q
Sampling
Policy-gradient style training [Williams 1992]
Conjugacy

Advanced Topics
h i X
∇λ Eq log p(x, z; θ) = ∇λ q(z | x; λ) log p(x, z; θ)
Case Studies
z
Conclusion

References

66/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Strategy 2a: Sampling — Score Function Gradient Estimator


Models

Variational
First term. Use basic identity:
Objective

Inference
∇q
∇ log q = ⇒ ∇q = q∇ log q
Strategies
q
Exact Gradient
Sampling Policy-gradient style training [Williams 1992]
Conjugacy
h i X
Advanced Topics
∇λ Eq log p(x, z; θ) = ∇λ q(z | x; λ) log p(x, z; θ)
Case Studies
| {z }
z
q∇ log q
Conclusion

References

67/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Strategy 2a: Sampling — Score Function Gradient Estimator


Introduction

Models First term. Use basic identity:


Variational
Objective
∇q
∇ log q = ⇒ ∇q = q∇ log q
Inference q
Strategies
Exact Gradient Policy-gradient style training [Williams 1992]
Sampling
Conjugacy h i X
Advanced Topics
∇λ Eq log p(x, z; θ) = ∇λ q(z | x; λ) log p(x, z; θ)
z
Case Studies X
= q(z | x; λ)∇λ log q(z | x; λ) log p(x, z; θ)
Conclusion
z
References

68/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 2a: Sampling — Score Function Gradient Estimator
Introduction
First term. Use basic identity:
Models

Variational
∇q
Objective ∇ log q = ⇒ ∇q = q∇ log q
q
Inference
Strategies Policy-gradient style training [Williams 1992]
Exact Gradient
Sampling
h i X
Conjugacy ∇λ Eq log p(x, z; θ) = ∇λ q(z | x; λ) log p(x, z; θ)
Advanced Topics z
X
Case Studies = q(z | x; λ)∇λ log q(z | x; λ) log p(x, z; θ)
Conclusion
z
h i
References = Eq log p(x, z; θ)∇λ log q(z | x; λ)

69/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Strategy 2a: Sampling — Score Function Gradient Estimator
Models

Variational
Objective
Second term. Need additional identity:
Inference X X
Strategies ∇q = ∇ q = ∇1 = 0
Exact Gradient
Sampling
Conjugacy h i X  
Advanced Topics ∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
z
Case Studies

Conclusion

References

70/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Strategy 2a: Sampling — Score Function Gradient Estimator


Introduction

Models
Second term. Need additional identity:
Variational
Objective
X X
∇q = ∇ q = ∇1 = 0
Inference
Strategies
Exact Gradient h i X 
Sampling
Conjugacy
∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
z
Advanced Topics X   
Case Studies
= ∇λ q(z | x; λ) log q(z | x; λ) + q(z | x; λ) ∇λ log q(z | x; λ)
| {z } | {z }
z ∇q
Conclusion
q∇ log q
q

References

71/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Strategy 2a: Sampling — Score Function Gradient Estimator
Models
Second term. Need additional identity:
Variational
Objective X X
Inference
∇q = ∇ q = ∇1 = 0
Strategies
Exact Gradient
Sampling
h i X 
Conjugacy ∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
Advanced Topics z
X X
Case Studies = log q(z | x; λ)q(z | x; λ)∇λ log q(z | x; λ) + ∇λ q(z | x; λ)
z z
Conclusion

References

72/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Strategy 2a: Sampling — Score Function Gradient Estimator


Introduction

Models Second term. Need additional identity:


Variational X X
Objective
∇q = ∇ q = ∇1 = 0
Inference
Strategies
Exact Gradient h i X 
Sampling ∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
Conjugacy
z
Advanced Topics X X
= log q(z | x; λ)q(z | x; λ)∇λ log q(z | x; λ) + ∇λ q(z | x; λ)
Case Studies
z z
| {z }
Conclusion P
=∇ q=∇1=0
References

73/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Strategy 2a: Sampling — Score Function Gradient Estimator


Introduction

Models Second term. Need additional identity:


Variational X X
Objective
∇q = ∇ q = ∇1 = 0
Inference
Strategies
Exact Gradient
h i X 
Sampling ∇λ Eq log q(z | x; λ) = ∇λ q(z | x; λ) log q(z | x; λ)
Conjugacy
z
Advanced Topics
X X
= log q(z | x; λ)q(z | x; λ)∇λ log q(z | x; λ) + ∇λ q(z | x; λ)
Case Studies
z z
Conclusion
= Eq [log q(z | x; λ)∇λ q(z | x; λ)]
References

74/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Strategy 2a: Sampling — Score Function Gradient Estimator
Models

Variational Putting these together,


Objective

Inference
h p(x, z; θ) i
Strategies
∇λ ELBO(θ, λ; x) = ∇λ Eq log
q(z | x; λ)
Exact Gradient
Sampling
h p(x, z; θ) i
Conjugacy
= Eq log ∇λ log q(z | x; λ)
q(z | x; λ)
Advanced Topics h i
= Eq Rθ,λ (z)∇λ log q(z | x; λ)
Case Studies

Conclusion

References

75/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Strategy 2a: Sampling — Score Function Gradient Estimator


Introduction

Models
Estimate with samples,
Variational
Objective z (1) , . . . , z (J) ∼ q(z | x; λ)
Inference
Strategies h i
Exact Gradient Eq Rθ,λ (z)∇λ log q(z | x; λ)
Sampling
Conjugacy J
1X
Advanced Topics ≈ Rθ,λ (z (j) )∇λ log q(z (j) | x; λ)
J
Case Studies j=1

Conclusion
Intuition: if a sample z (j) is has high reward Rθ,λ (z (j) ), increase the probability
References
of z (j) by moving along the gradient ∇λ log q(z (j) | x; λ).

76/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Strategy 2a: Sampling — Score Function Gradient Estimator


Models

Variational
Objective
• Essentially reinforcement learning with reward Rθ,λ (z)

Inference • Score function gradient is generally applicable regardless of what


Strategies
Exact Gradient
distribution q takes (only need to evaluate ∇λ log q).
Sampling
Conjugacy • This generality comes at a cost, since the reward is “black-box”: unbiased
Advanced Topics estimator, but high variance.
Case Studies
• In practice, need variance-reducing control variate B. (More on this later).
Conclusion

References

77/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 1 - Naive Bayes

λ z z
Introduction

Models x1 ... xT
Variational x
Objective

Inference Let q(z | x; λ) = Cat(ν) where ν = enc(x; λ)


Strategies
Exact Gradient Sample z (1) , . . . , z (J) ∼ q(z | x; λ)
Sampling
Conjugacy

Advanced Topics h p(x, z; θ) i


∇λ ELBO(θ, λ; x) = Eq log ∇λ log q(z | x; λ)
Case Studies q(z | x; λ)
Conclusion J
1X p(x, z (j) ; θ)
References
≈ νz (j) log ∇λ log νz (j)
J νz (j)
j=1

Computational complexity: O(J) vs O(|Z|) 78/153


Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 1 - Naive Bayes

λ z z
Introduction

Models x1 ... xT
Variational x
Objective

Inference Let q(z | x; λ) = Cat(ν) where ν = enc(x; λ)


Strategies
Exact Gradient Sample z (1) , . . . , z (J) ∼ q(z | x; λ)
Sampling
Conjugacy

Advanced Topics h p(x, z; θ) i


∇λ ELBO(θ, λ; x) = Eq log ∇λ log q(z | x; λ)
Case Studies q(z | x; λ)
Conclusion J
1X p(x, z (j) ; θ)
References
≈ νz (j) log ∇λ log νz (j)
J νz (j)
j=1

Computational complexity: O(J) vs O(|Z|) 78/153


Tutorial:
Deep Latent NLP Strategy 2b: Sampling — Reparameterization
(bit.do/lvnlp)

Suppose we can sample from q by applying a deterministic, differentiable


Introduction transformation g to a base noise density,
Models
∼U z = g(, λ)
Variational
Objective

Inference
Strategies Gradient calculation (first term):
Exact Gradient
h i h i
Sampling ∇λ Ez∼q(z | x; λ) log p(x, z; θ) = ∇λ E∼U log p(x, g(, λ); θ)
Conjugacy h i
Advanced Topics = E∼U ∇λ log p(x, g(, λ); θ)
Case Studies J
1X
Conclusion ≈ ∇λ log p(x, g((j) , λ); θ)
J
j=1
References
where
(1) , . . . , (J) ∼ U 79/153
Tutorial:
Deep Latent NLP Strategy 2b: Sampling — Reparameterization
(bit.do/lvnlp)

Suppose we can sample from q by applying a deterministic, differentiable


Introduction transformation g to a base noise density,
Models
∼U z = g(, λ)
Variational
Objective

Inference
Strategies Gradient calculation (first term):
Exact Gradient
h i h i
Sampling ∇λ Ez∼q(z | x; λ) log p(x, z; θ) = ∇λ E∼U log p(x, g(, λ); θ)
Conjugacy h i
Advanced Topics = E∼U ∇λ log p(x, g(, λ); θ)
Case Studies J
1X
Conclusion ≈ ∇λ log p(x, g((j) , λ); θ)
J
j=1
References
where
(1) , . . . , (J) ∼ U 79/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction

Models
Strategy 2b: Sampling — Reparameterization
Variational
Objective
• Unbiased, like the score function gradient estimator, but empirically lower
Inference
Strategies variance.
Exact Gradient
Sampling • In practice, single sample is often sufficient.
Conjugacy

Advanced Topics • Cannot be used out-of-the-box for discrete z.


Case Studies

Conclusion

References

80/153
Tutorial:
Deep Latent NLP Strategy 2: Continuous Latent Variable RNN
(bit.do/lvnlp)


Introduction

Models
λ z z
Variational
Objective
x1 ... xT
Inference
Strategies
x
Exact Gradient
Sampling
Conjugacy
Choose variational family to be an amortized diagonal Gaussian

q(z | x; λ) = N (µ, σ 2 )
Advanced Topics

Case Studies

Conclusion
µ, σ 2 = enc(x; λ)
References Then we can sample from q(z | x; λ) by

 ∼ N (0, I) z = µ + σ
81/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Strategy 2b: Sampling — Reparameterization

Introduction p(x,z; θ)
(Recall Rθ,λ (z) = log q(z | x; λ) )
Models
• Score function:
Variational
Objective

Inference
∇λ ELBO(θ, λ; x) = Ez∼q [Rθ,λ (z)∇λ log q(z | x; λ)]
Strategies
Exact Gradient
Sampling
• Reparameterization:
Conjugacy

Advanced Topics ∇λ ELBO(θ, λ; x) = E∼N (0,I) [∇λ Rθ,λ (g(, λ; x))]


Case Studies

Conclusion
where g(, λ; x) = µ + σ.
References Informally, reparameterization gradients differentiate through Rθ,λ (·) and thus
has “more knowledge” about the structure of the objective function.
82/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Exact Gradient
Sampling Exact Gradient
Conjugacy
Sampling
Advanced Topics
Conjugacy
Case Studies

Conclusion
5 Advanced Topics
References

6 Case Studies
83/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Strategy 3: Conjugacy
Introduction

Models
For certain choices for p and q, we can compute parts of
Variational
Objective
arg max ELBO(θ, λ; x)
Inference λ
Strategies
Exact Gradient
exactly in closed-form.
Sampling
Conjugacy

Advanced Topics
Recall that
Case Studies
arg max ELBO(θ, λ; x) = arg min KL[q(z | x; λ)kp(z | x; θ)]
Conclusion λ λ
References

84/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Strategy 3: Conjugacy
Introduction

Models
For certain choices for p and q, we can compute parts of
Variational
Objective
arg max ELBO(θ, λ; x)
Inference λ
Strategies
Exact Gradient
exactly in closed-form.
Sampling
Conjugacy

Advanced Topics
Recall that
Case Studies
arg max ELBO(θ, λ; x) = arg min KL[q(z | x; λ)kp(z | x; θ)]
Conclusion λ λ
References

84/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Strategy 3a: Conjugacy — Tractable Posterior Inference

Introduction Suppose we can tractably calculate p(z | x; θ). Then KL[q(z | x; λ)kp(z | x; θ)]
Models is minimized when,
Variational q(z | x; λ) = p(z | x; θ)
Objective

Inference
Strategies • The E-step in Expectation Maximization algorithm [Dempster et al. 1977]
Exact Gradient
Sampling
Conjugacy
posterior gap
Advanced Topics

Case Studies
λ
L ELBO
Conclusion

References

85/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 1 - Naive Bayes

Introduction
λ z z
Models

Variational x1 ... xT
Objective x
Inference
Strategies
Exact Gradient
Sampling
Conjugacy
p(x, z; θ)
Advanced Topics p(z | x; θ) = PK
0
z 0 =1 p(x, z ; θ)
Case Studies

Conclusion So λ is given by the parameters of the categorical distribution, i.e.


References
λ = [p(z = 1 | x; θ), . . . , p(z = K | x; θ)]

86/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 3 — HMM

Introduction
µ
Models

Variational
z1 z2 z3 z4
Objective

Inference
Strategies
Exact Gradient
Sampling x1 x2 x3 x4
Conjugacy

Advanced Topics
N

Case Studies π
Conclusion T
Y
References p(x, z; θ) = p(z0 ) p(zt | zt−1 ; µ)p(xt | zt ; π)
t=1

87/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Example: Model 3 — HMM


Models

Variational
Run forward/backward dynamic programming to calculate posterior marginals,
Objective

Inference p(zt , zt+1 | x; θ)


Strategies
Exact Gradient 2
Sampling variational parameters λ ∈ RT K store edge marginals. These are enough to
Conjugacy
calculate
Advanced Topics
q(z; λ) = p(z | x; θ)
Case Studies

Conclusion (i.e. the exact posterior) over any sequence z.


References

88/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Example: Model 3 — HMM


Models

Variational
Run forward/backward dynamic programming to calculate posterior marginals,
Objective

Inference p(zt , zt+1 | x; θ)


Strategies
Exact Gradient 2
Sampling variational parameters λ ∈ RT K store edge marginals. These are enough to
Conjugacy
calculate
Advanced Topics
q(z; λ) = p(z | x; θ)
Case Studies

Conclusion (i.e. the exact posterior) over any sequence z.


References

88/153
Tutorial:
Deep Latent NLP Connection: Gradient Ascent on Log Marginal Likelihood
(bit.do/lvnlp)

Why not perform gradient ascent directly on log marginal likelihood?


Introduction X
Models
log p(x; θ) = log p(x, z; θ)
z
Variational
Objective Same as optimizing ELBO with posterior inference (i.e EM). Gradients of model
Inference parameters given by (where q(z | x; λ) = p(z | x; θ)):
Strategies
Exact Gradient
Sampling
∇θ log p(x; θ) = Eq(z | x; λ) [∇θ log p(x, z; θ)]
Conjugacy
posterior gap
Advanced Topics

Case Studies

Conclusion
L ELBO
References

89/153
Tutorial:
Deep Latent NLP Connection: Gradient Ascent on Log Marginal Likelihood
(bit.do/lvnlp)

Why not perform gradient ascent directly on log marginal likelihood?


Introduction X
Models
log p(x; θ) = log p(x, z; θ)
z
Variational
Objective Same as optimizing ELBO with posterior inference (i.e EM). Gradients of model
Inference parameters given by (where q(z | x; λ) = p(z | x; θ)):
Strategies
Exact Gradient
Sampling
∇θ log p(x; θ) = Eq(z | x; λ) [∇θ log p(x, z; θ)]
Conjugacy
posterior gap
Advanced Topics

Case Studies

Conclusion
L ELBO
References

89/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Connection: Gradient Ascent on Log Marginal Likelihood


Introduction

Models
• Practically, this means we don’t have to manually perform posterior
Variational
Objective inference in the E-step. Can just calculate log p(x; θ) and call
Inference backpropagation.
Strategies
Exact Gradient • Example: in deep HMM, just implement forward algorithm to calculate
Sampling
Conjugacy log p(x; θ) and backpropagate using autodiff. No need to implement
Advanced Topics backward algorithm. (Or vice versa).
Case Studies

Conclusion (See Eisner [2016]: “Inside-Outside and Forward-Backward Algorithms Are Just
References Backprop”)

90/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Strategy 3b: Conditional Conjugacy


Introduction

Models
• Let p(z | x; θ) be intractable, but suppose p(x, z; θ) is
Variational
Objective conditionally conjugate, meaning p(zt | x, z−t ; θ) is exponential family.
Inference
Strategies
• Restrict the family of distributions q so that it factorizes over zt , i.e.
Exact Gradient
Sampling
T
Y
Conjugacy q(z; λ) = q(zt ; λt )
Advanced Topics t=1

Case Studies
(mean field family)
Conclusion
• Further choose q(zt ; λt ) so that it is in the same family as p(zt | x, z−t ; θ) .
References

91/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Strategy 3b: Conditional Conjugacy


Introduction

Models
(n) (n) KL(q(z)||p(z|x))
Variational
z1 zT z (n)
Objective

Inference
Strategies
(n) (n)
Exact Gradient λ1 λT x(n)
Sampling
N
Conjugacy N θ
Advanced Topics
T
Case Studies Y
q(z; λ) = q(zt ; λt )
Conclusion
t=1
References

92/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Mean Field Family

Introduction
• Optimize ELBO via coordinate ascent, i.e. iterate for λ1 , . . . , λT

Models T
hY i
Variational
arg max KL q(zt ; λt )kp(z | x; θ)
λt t=1
Objective

Inference • Coordinate ascent updates will take the form


Strategies
Exact Gradient
 
Sampling q(zt ; λt ) ∝ exp Eq(z−t ; λ−t ) [log p(x, z; θ)]
Conjugacy

Advanced Topics where


Case Studies XY
Eq(z−t ; λ−t ) [log p(x, z; θ)] = q(zj ; λj ) log p(x, z; θ)
Conclusion
j6=t j6=t
References
• Since p(zt | x, z−t ) was assumed to be in the exponential family, above
updates can be derived in closed form.
93/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Example: Model 3 — Factorial HMM

Introduction z3,1 z3,2 z3,3 z3,4


Models

Variational
Objective
z2,1 z2,2 z2,3 z2,4

Inference
Strategies
z1,1 z1,2 z1,3 z1,4
Exact Gradient
Sampling
Conjugacy

Advanced Topics x1 x2 x3 x4
Case Studies N
Conclusion
L Y
Y T
References
p(x, z; θ) = p(zl,t | zl,t−1 ; θ)p(xt | zl,t ; θ)
l=1 t=1
94/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 3 — Factorial HMM

Introduction
z3,1 z3,2 z3,3 z3,4
Models

Variational
Objective z2,1 z2,2 z2,3 z2,4
Inference
Strategies
Exact Gradient z1,1 z1,2 z1,3 z1,4
Sampling
Conjugacy

Advanced Topics x1 x2 x3 x4
Case Studies
N
Conclusion

References
 
q(z1,1 ; λ1,1 ) ∝ exp Eq(z−(1,1) ; λ−(1,1) ) [log p(x, z; θ)]

95/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Example: Model 3 — Factorial HMM

Introduction
z3,1 z3,2 z3,3 z3,4
Models

Variational
Objective z2,1 z2,2 z2,3 z2,4
Inference
Strategies
Exact Gradient z1,1 z1,2 z1,3 z1,4
Sampling
Conjugacy

Advanced Topics x1 x2 x3 x4
Case Studies
N
Conclusion

References
 
q(z2,1 ; λ2,1 ) ∝ exp Eq(z−(2,1) ; λ−(2,1) ) [log p(x, z; θ)]

96/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Example: Model 3 — Factorial HMM


Introduction

Models Exact Inference:


Variational
Objective • Naive: K states, L levels =⇒ HMM with K L states =⇒ O(T K 2L )
Inference
• Smarter: O(T LK L+1 )
Strategies
Exact Gradient
Sampling
Conjugacy
Mean Field:
Advanced Topics
• Gaussian emissions: O(T LK 2 ) [Ghahramani and Jordan 1996].
Case Studies
• Categorical emission: need more variational approximations, but ultimately
Conclusion
O(LKV T ) [Nepal and Yates 2013].
References

97/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Example: Model 3 — Factorial HMM


Introduction

Models Exact Inference:


Variational
Objective • Naive: K states, L levels =⇒ HMM with K L states =⇒ O(T K 2L )
Inference
• Smarter: O(T LK L+1 )
Strategies
Exact Gradient
Sampling
Conjugacy
Mean Field:
Advanced Topics
• Gaussian emissions: O(T LK 2 ) [Ghahramani and Jordan 1996].
Case Studies
• Categorical emission: need more variational approximations, but ultimately
Conclusion
O(LKV T ) [Nepal and Yates 2013].
References

97/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Advanced Topics
Gumbel-Softmax
Flows 5 Advanced Topics
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References

6 Case Studies
98/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Advanced Topics
Models

Variational
Objective
1 Gumbel-Softmax: Extend reparameterization to discrete variables.
Inference
Strategies 2 Flows: Optimize a tighter bound by making the variational family q more
Advanced Topics flexible.
Gumbel-Softmax
Flows 3 Importance Weighting: Optimize a tighter bound through importance
IWAE
sampling.
Case Studies

Conclusion

References

99/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Advanced Topics
Gumbel-Softmax
Flows 5 Advanced Topics
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References

6 Case Studies
100/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Challenges of Discrete Variables

Introduction Review: we can always use score function estimator


Models h p(x, z; θ) i
Variational ∇λ ELBO(x, θ, λ) = Eq log ∇λ log q(z | x; λ)
Objective
q(z | x; λ)
h p(x, z; θ)  i
Inference = Eq log − B ∇λ log q(z | x; λ)
Strategies q(z | x; λ)
Advanced Topics
P P
Gumbel-Softmax • Eq [B∇λ log q(z | x; λ)] = 0 (since E[∇ log q] = q∇ log q = ∇q = 0)
Flows
IWAE
• Control variate B (not dependent on z, but can depend on x).
Case Studies
• Estimate this quantity with another neural net [Mnih and Gregor 2014]
Conclusion

References
 p(x, z; θ) 2
B(x; ψ) − log
q(z | x; λ)
101/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Challenges of Discrete Variables

Introduction Review: we can always use score function estimator


Models h p(x, z; θ) i
Variational ∇λ ELBO(x, θ, λ) = Eq log ∇λ log q(z | x; λ)
Objective
q(z | x; λ)
h p(x, z; θ)  i
Inference = Eq log − B ∇λ log q(z | x; λ)
Strategies q(z | x; λ)
Advanced Topics
P P
Gumbel-Softmax • Eq [B∇λ log q(z | x; λ)] = 0 (since E[∇ log q] = q∇ log q = ∇q = 0)
Flows
IWAE
• Control variate B (not dependent on z, but can depend on x).
Case Studies
• Estimate this quantity with another neural net [Mnih and Gregor 2014]
Conclusion

References
 p(x, z; θ) 2
B(x; ψ) − log
q(z | x; λ)
101/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]

The “Gumbel-Max” trick [Papandreou and Yuille 2011]


Introduction
αk
Models p(zk = 1; α) = PK
Variational j=1 αj
Objective
where z = [0, 0, . . . , 1, . . . , 0] is a one-hot vector.
Inference
Strategies Can sample from p(z; α) by
Advanced Topics
1 Drawing independent Gumbel noise  = 1 , . . . , K
Gumbel-Softmax
Flows
IWAE k = − log(− log uk ) uk ∼ U(0, 1)
Case Studies

Conclusion
2 Adding k to log αk , finding argmax, i.e.
References
i = arg max[log αk + k ] zi = 1
k

102/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]

The “Gumbel-Max” trick [Papandreou and Yuille 2011]


Introduction
αk
Models p(zk = 1; α) = PK
Variational j=1 αj
Objective
where z = [0, 0, . . . , 1, . . . , 0] is a one-hot vector.
Inference
Strategies Can sample from p(z; α) by
Advanced Topics
1 Drawing independent Gumbel noise  = 1 , . . . , K
Gumbel-Softmax
Flows
IWAE k = − log(− log uk ) uk ∼ U(0, 1)
Case Studies

Conclusion
2 Adding k to log αk , finding argmax, i.e.
References
i = arg max[log αk + k ] zi = 1
k

102/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]

The “Gumbel-Max” trick [Papandreou and Yuille 2011]


Introduction
αk
Models p(zk = 1; α) = PK
Variational j=1 αj
Objective
where z = [0, 0, . . . , 1, . . . , 0] is a one-hot vector.
Inference
Strategies Can sample from p(z; α) by
Advanced Topics
1 Drawing independent Gumbel noise  = 1 , . . . , K
Gumbel-Softmax
Flows
IWAE k = − log(− log uk ) uk ∼ U(0, 1)
Case Studies

Conclusion
2 Adding k to log αk , finding argmax, i.e.
References
i = arg max[log αk + k ] zi = 1
k

102/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]

Reparameterization:
Introduction

Models z = arg max (log α + )> s = g(, α)


s∈∆K−1
Variational
Objective
z = g(, α) is a deterministic function applied to stochastic noise.
Inference
Strategies Let’s try applying this:
Advanced Topics
Gumbel-Softmax αk
Flows
q(zk = 1 | x; λ) = PK α = enc(x; λ)
IWAE j=1 αj
Case Studies p(x,z; θ)
(Recalling Rθ,λ (z) = log q(z | x; λ) ),
Conclusion

References ∇λ Eq(z | x; λ) [Rθ,λ (z)] = ∇λ E∼Gumbel [Rθ,λ (g(, α)]


= E∼Gumbel [∇λ Rθ,λ (g(, α))]
103/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]

Reparameterization:
Introduction

Models z = arg max (log α + )> s = g(, α)


s∈∆K−1
Variational
Objective
z = g(, α) is a deterministic function applied to stochastic noise.
Inference
Strategies Let’s try applying this:
Advanced Topics
Gumbel-Softmax αk
Flows
q(zk = 1 | x; λ) = PK α = enc(x; λ)
IWAE j=1 αj
Case Studies p(x,z; θ)
(Recalling Rθ,λ (z) = log q(z | x; λ) ),
Conclusion

References ∇λ Eq(z | x; λ) [Rθ,λ (z)] = ∇λ E∼Gumbel [Rθ,λ (g(, α)]


= E∼Gumbel [∇λ Rθ,λ (g(, α))]
103/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]

Reparameterization:
Introduction

Models z = arg max (log α + )> s = g(, α)


s∈∆K−1
Variational
Objective
z = g(, α) is a deterministic function applied to stochastic noise.
Inference
Strategies Let’s try applying this:
Advanced Topics
Gumbel-Softmax αk
Flows
q(zk = 1 | x; λ) = PK α = enc(x; λ)
IWAE j=1 αj
Case Studies p(x,z; θ)
(Recalling Rθ,λ (z) = log q(z | x; λ) ),
Conclusion

References ∇λ Eq(z | x; λ) [Rθ,λ (z)] = ∇λ E∼Gumbel [Rθ,λ (g(, α)]


= E∼Gumbel [∇λ Rθ,λ (g(, α))]
103/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]

Introduction But this won’t work, because zero gradients (almost everywhere)
Models
z = g(, α) = arg max(log α + )> s =⇒ ∇λ Rθ,λ (z) = 0
Variational
s∈∆K−1
Objective

Inference
Strategies

Advanced Topics
Gumbel-Softmax trick: replace arg max with softmax
Gumbel-Softmax
 log α +   exp((log αk + k )/τ )
Flows
IWAE
z = softmax zk = PK
τ j=1 exp((log αj + j )/τ )
Case Studies

Conclusion (where τ is a temperature term.)


References h  log α +  i
∇λ Eq(z | x; λ) [Rθ,λ (z)] ≈ E∼Gumbel ∇λ Rθ,λ softmax
τ
104/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]

Introduction But this won’t work, because zero gradients (almost everywhere)
Models
z = g(, α) = arg max(log α + )> s =⇒ ∇λ Rθ,λ (z) = 0
Variational
s∈∆K−1
Objective

Inference
Strategies

Advanced Topics
Gumbel-Softmax trick: replace arg max with softmax
Gumbel-Softmax
 log α +   exp((log αk + k )/τ )
Flows
IWAE
z = softmax zk = PK
τ j=1 exp((log αj + j )/τ )
Case Studies

Conclusion (where τ is a temperature term.)


References h  log α +  i
∇λ Eq(z | x; λ) [Rθ,λ (z)] ≈ E∼Gumbel ∇λ Rθ,λ softmax
τ
104/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]

Introduction But this won’t work, because zero gradients (almost everywhere)
Models
z = g(, α) = arg max(log α + )> s =⇒ ∇λ Rθ,λ (z) = 0
Variational
s∈∆K−1
Objective

Inference
Strategies

Advanced Topics
Gumbel-Softmax trick: replace arg max with softmax
Gumbel-Softmax
 log α +   exp((log αk + k )/τ )
Flows
IWAE
z = softmax zk = PK
τ j=1 exp((log αj + j )/τ )
Case Studies

Conclusion (where τ is a temperature term.)


References h  log α +  i
∇λ Eq(z | x; λ) [Rθ,λ (z)] ≈ E∼Gumbel ∇λ Rθ,λ softmax
τ
104/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]


Introduction

Models
• Approaches a discrete distribution as τ → 0 (anneal τ during training).
Variational
Objective • Reparameterizable by construction
Inference
Strategies
• Differentiable and has non-zero gradients

Advanced Topics
Gumbel-Softmax
Flows
IWAE

Case Studies

Conclusion

References (from Maddison et al. [2017])

105/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction

Models Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017]
Variational
Objective
• See Maddison et al. [2017] on whether we can use the original categorical
Inference
Strategies
densities p(z), q(z), or need to use relaxed densities pGS (z), qGS (z).
Advanced Topics
Gumbel-Softmax
• Requires that p(x | z; θ) “makes sense” for non-discrete z (e.g. attention).
Flows
IWAE • Lower-variance, but biased gradient estimator. Variance → ∞ as τ → 0.
Case Studies

Conclusion

References

106/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Advanced Topics
Gumbel-Softmax
Flows 5 Advanced Topics
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References

6 Case Studies
107/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Flows [Rezende and Mohamed 2015; Kingma et al. 2016]


Models

Variational
Recall
Objective
log p(x; θ) = ELBO(θ, λ; x) − KL[q(z | x; λ) k p(z | x; θ)]
Inference
Strategies
Bound is tight when variational posterior equals true posterior
Advanced Topics
Gumbel-Softmax
Flows
q(z | x; λ) = p(z | x; θ) =⇒ log p(x; θ) = ELBO(θ, λ; x)
IWAE

Case Studies We want to make q(z | x; λ) as flexible as possible: can we do better than just
Conclusion Gaussian?
References

108/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Flows [Rezende and Mohamed 2015; Kingma et al. 2016]


Models

Variational
Recall
Objective
log p(x; θ) = ELBO(θ, λ; x) − KL[q(z | x; λ) k p(z | x; θ)]
Inference
Strategies
Bound is tight when variational posterior equals true posterior
Advanced Topics
Gumbel-Softmax
Flows
q(z | x; λ) = p(z | x; θ) =⇒ log p(x; θ) = ELBO(θ, λ; x)
IWAE

Case Studies We want to make q(z | x; λ) as flexible as possible: can we do better than just
Conclusion Gaussian?
References

108/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Flows [Rezende and Mohamed 2015; Kingma et al. 2016]


Introduction

Models
Idea: transform a sample from a simple initial variational distribution,
Variational
Objective
z0 ∼ q(z | x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ)
Inference
Strategies

Advanced Topics
into a more complex one
Gumbel-Softmax
Flows
IWAE
zK = fK ◦ · · · ◦ f2 ◦ f1 (z0 ; λ)
Case Studies
where fi (zi−1 ; λ)’s are invertible transformations (whose parameters are
Conclusion
absorbed by λ).
References

109/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Flows [Rezende and Mohamed 2015; Kingma et al. 2016]


Introduction

Models
Idea: transform a sample from a simple initial variational distribution,
Variational
Objective
z0 ∼ q(z | x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ)
Inference
Strategies

Advanced Topics
into a more complex one
Gumbel-Softmax
Flows
IWAE
zK = fK ◦ · · · ◦ f2 ◦ f1 (z0 ; λ)
Case Studies
where fi (zi−1 ; λ)’s are invertible transformations (whose parameters are
Conclusion
absorbed by λ).
References

109/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Flows [Rezende and Mohamed 2015; Kingma et al. 2016]

Introduction
Sample from final variational posterior is given by zK . Density is given by the
Models
change of variables formula:
Variational
Objective K
X ∂f −1
log qK (zK | x; λ) = log q(z0 | x; λ) + log k

Inference
Strategies ∂zk
k=1
Advanced Topics K ∂f
k
X
= log q(z0 | x; λ) − log
Gumbel-Softmax

∂zk−1

Flows | {z }
IWAE k=1
log density of Gaussian | {z }
log determinant of Jacobian
Case Studies

Conclusion

References Determinant calculation is O(N 3 ) in general, but can be made faster depending
on parameterization of fk
110/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Flows [Rezende and Mohamed 2015; Kingma et al. 2016]


Introduction

Models
Can still use reparameterization to obtain gradients. Letting
Variational
Objective F (z) = fK ◦ · · · ◦ f1 (z),
Inference
Strategies
h p(x, z; θ) i
ELBO(θ, λ; x) = ∇λ EqK (zK | x; λ) log
Advanced Topics
qK (zK | x; λ)
Gumbel-Softmax
h p(x, F (z0 ); θ) ∂F i
= ∇λ Eq(z0 | x; λ) log − log

Flows
q(z0 | x; λ) ∂z0

IWAE

Case Studies
h  p(x, F (z0 ); θ) ∂F i
= E∼N (0,I) ∇λ log − log

q(z0 | x; λ) ∂z0

Conclusion

References

111/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Flows [Rezende and Mohamed 2015; Kingma et al. 2016]

Introduction
Examples of fk (zk−1 ; λ)
Models
• Normalizing Flows [Rezende and Mohamed 2015]
Variational
Objective
fk (zk−1 ) = zk−1 + uk h(wk> zk−1 + bk )
Inference
Strategies

Advanced Topics
• Inverse Autoregressive Flows [Kingma et al. 2016]
Gumbel-Softmax
Flows
IWAE
fk (zk−1 ) = zk−1 σk + µk
Case Studies
σk,d = sigmoid(NN(zk−1,<d )) µk,d = NN(zk−1,<d )
Conclusion

References (In this case the Jacobian is upper triangular, so determinant is just the
product of diagonals)
112/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Flows [Rezende and Mohamed 2015; Kingma et al. 2016]

Models

Variational
Objective

Inference
Strategies

Advanced Topics
Gumbel-Softmax
Flows
IWAE

Case Studies

Conclusion (from Rezende and Mohamed [2015])


References

113/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Advanced Topics
Gumbel-Softmax
Flows 5 Advanced Topics
IWAE
Gumbel-Softmax
Case Studies
Flows
Conclusion
IWAE
References

6 Case Studies
114/153
Tutorial:
Deep Latent NLP Importance Weighted Autoencoder (IWAE) [Burda et al. 2015]
(bit.do/lvnlp)

• Flows are a way of tightening the ELBO by making the variational family
Introduction
more flexible.
Models
• Not the only way: can obtain a tighter lower bound on log p(x; θ) by using
Variational
Objective multiple importance samples.
Inference
Strategies
Consider:
Advanced Topics K
Gumbel-Softmax 1 X p(x, z (k) ; θ)
IK = (k) | x; λ)
,
Flows
K q(z
IWAE k=1
QK
Case Studies
where z (1:K) ∼ k=1 q(z
(k) | x; λ).
Conclusion

References
Note that IK is an unbiased estimator of p(x; θ):

Eq(z (1:K) | x; λ) [IK ] = p(x; θ).


115/153
Tutorial:
Deep Latent NLP Importance Weighted Autoencoder (IWAE) [Burda et al. 2015]
(bit.do/lvnlp)

• Flows are a way of tightening the ELBO by making the variational family
Introduction
more flexible.
Models
• Not the only way: can obtain a tighter lower bound on log p(x; θ) by using
Variational
Objective multiple importance samples.
Inference
Strategies
Consider:
Advanced Topics K
Gumbel-Softmax 1 X p(x, z (k) ; θ)
IK = (k) | x; λ)
,
Flows
K q(z
IWAE k=1
QK
Case Studies
where z (1:K) ∼ k=1 q(z
(k) | x; λ).
Conclusion

References
Note that IK is an unbiased estimator of p(x; θ):

Eq(z (1:K) | x; λ) [IK ] = p(x; θ).


115/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Importance Weighted Autoencoder (IWAE) [Burda et al. 2015]

Introduction
Any unbiased estimator of p(x; θ) can be used to obtain a lower bound, using
Models
Jensen’s inequality:
Variational
Objective
p(x; θ) = Eq(z (1:K) | x; λ) [IK ]
Inference
Strategies
=⇒ log p(x; θ) ≥ Eq(z (1:K) | x; λ) [log IK ]
Advanced Topics
K
" #
Gumbel-Softmax 1 X p(x, z (k) ; θ)
Flows
= Eq(z (1:K) | x; λ) log
IWAE
K
k=1
q(z (k) | x; λ)
Case Studies
However, can also show [Burda et al. 2015]:
Conclusion

References
• log p(x; θ) ≥ E [log IK ] ≥ E [log IK−1 ]
• limK→∞ E [log IK ] = log p(x; θ) under mild conditions
116/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Importance Weighted Autoencoder (IWAE) [Burda et al. 2015]


Introduction

Models
K
" #
Variational 1 X p(x, z (k) ; θ)
Eq(z (1:K) | x; λ) log
Objective
K
k=1
q(z (k) | x; λ)
Inference
Strategies

Advanced Topics • Note that with K = 1, we recover the ELBO.


Gumbel-Softmax (k)
Flows • Can interpret p(x,z ; θ)
as importance weights.
q(z (k) | x; λ)
IWAE
• If q(z | x; λ) is reparameterizable, we can use the reparameterization trick to
Case Studies

Conclusion
optimize E [log IK ] directly.
References • Otherwise, need score function gradient estimators [Mnih and Rezende 2016].

117/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Advanced Topics

Case Studies 5 Advanced Topics


Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries 6 Case Studies
and Topics

Conclusion Sentence VAE


References Encoder/Decoder with Latent Variables
Latent Summaries and Topics
118/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Advanced Topics

Case Studies 5 Advanced Topics


Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries 6 Case Studies
and Topics

Conclusion Sentence VAE


References Encoder/Decoder with Latent Variables
Latent Summaries and Topics
119/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Sentence VAE Example [Bowman et al. 2016]


Introduction

Models Generative Model (Model 2):


Variational • Draw z ∼ N (0, I)
Objective

Inference • Draw xt | z ∼ CRNNLM(θ, z)


Strategies
Variational Model (Amortized): Deep Diagonal Gaussians,
Advanced Topics

Case Studies
q(z | x; λ) = N (µ, σ 2 )
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries h̃T = RNN(x; ψ)
and Topics

Conclusion µ = W1 h̃T σ 2 = exp(W2 h̃T ) λ = {W1 , W2 , ψ}


References

120/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Sentence VAE Example [Bowman et al. 2016]

Introduction

Models

Variational
Objective

Inference
Strategies
(from Bowman et al. [2016])
Advanced Topics

Case Studies 
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries λ z z
and Topics

Conclusion
x1 ... xT
References
x
121/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Issue 1: Posterior Collapse

Introduction
p(x,z; θ)
Models
ELBO(θ, λ) = Eq(z | x; λ) [log q(z | x; λ) ]
Variational
Objective

Inference
= Eq(z | x; λ) [log p(x | z; θ)] − KL[q(z | x; λ)kp(z)]
Strategies
| {z } | {z }
Reconstruction likelihood Regularizer
Advanced Topics

Case Studies
Sentence VAE
Encoder/Decoder
Model L/ELBO Reconstruction KL
with Latent Variables
Latent Summaries
and Topics RNN LM -329.10 - -
Conclusion RNN VAE -330.20 -330.19 0.01
References

(On Yahoo Corpus from Yang et al. [2017])


122/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Issue 1: Posterior Collapse


Introduction

Models
• x and z become independent, and p(x, z; θ) reduces to a non-LV language
Variational
Objective
model.
Inference
Strategies
• Chen et al. [2017]: If it’s possible to model p? (x) without making use of z, then
Advanced Topics
ELBO optimum is at:
Case Studies
Sentence VAE
Encoder/Decoder p? (x) = p(x | z; θ) = p(x; θ) q(z | x; λ) = p(z)
with Latent Variables
Latent Summaries
and Topics
KL[q(z | x; λ)kp(z)] = 0
Conclusion

References

123/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Mitigating Posterior Collapse
Introduction
Use less powerful likelihood models [Miao et al. 2016; Yang et al. 2017], or “word
Models

Variational
dropout” [Bowman et al. 2016].
Objective

Inference
Strategies Model LL/ELBO Reconstruction KL
Advanced Topics
RNN LM -329.1 - -
Case Studies
RNN VAE -330.2 -330.2 0.01
Sentence VAE
Encoder/Decoder
with Latent Variables
+ Word Drop -334.2 -332.8 1.44
Latent Summaries
and Topics CNN VAE -332.1 -322.1 10.0
Conclusion

References (On Yahoo Corpus from Yang et al. [2017])

124/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Mitigating Posterior Collapse

Introduction Gradually anneal multiplier on KL term, i.e.


Models
Eq(z | x; λ) [log p(x | z; θ)] − β KL[q(z | x; λ)kp(z)]
Variational
Objective

Inference β goes from 0 to 1 as training progresses


Strategies

Advanced Topics

Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries
and Topics

Conclusion

References

(from Bowman et al. [2016])


125/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Mitigating Posterior Collapse


Introduction

Models Other approaches:


Variational • Use auxiliary losses (e.g. train z as part of a topic model) [Dieng et al. 2017;
Objective
Wang et al. 2018]
Inference
Strategies
• Use von Mises–Fisher distribution with a fixed concentration parameter [Guu
Advanced Topics
et al. 2017; Xu and Durrett 2018]
Case Studies
Sentence VAE • Combine stochastic/amortized variational inference [Kim et al. 2018]
Encoder/Decoder
with Latent Variables
• Add skip connections [Dieng et al. 2018]
Latent Summaries
and Topics

Conclusion
In practice, often necessary to combine various methods.
References

126/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Issue 2: Evaluation


Models

Variational
Objective
• ELBO always lower bounds log p(x; θ), so can calculate an upper bound on
Inference PPL efficiently.
Strategies
• When reporting ELBO, should also separately report,
Advanced Topics

Case Studies
KL[q(z | x; λ)kp(z)]
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries
to give an indication of how much the latent variable is being “used”.
and Topics

Conclusion

References

127/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Issue 2: Evaluation
Introduction

Models Also can evaluate log p(x; θ) with importance sampling


Variational
Objective
h p(x | z; θ)p(z) i
p(x; θ) = Eq(z | x; λ)
Inference q(z | x; λ)
Strategies
K
1 X p(x|z (k) ; θ)p(z (k) )
Advanced Topics ≈
K
k=1
q(z (k) | x; λ)
Case Studies
Sentence VAE
Encoder/Decoder So
with Latent Variables K
Latent Summaries 1 X p(x|z (k) ; θ)p(z (k) )
and Topics =⇒ log p(x; θ) ≈ log
K q(z (k) | x; λ)
k=1
Conclusion

References

128/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Evaluation

Introduction
Qualitative evaluation
Models
• Evaluate samples from prior/variational posterior.
Variational
Objective • Interpolation in latent space.
Inference
Strategies

Advanced Topics

Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries
and Topics

Conclusion

References
(from Bowman et al. [2016])
129/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Advanced Topics

Case Studies 5 Advanced Topics


Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries 6 Case Studies
and Topics

Conclusion Sentence VAE


References Encoder/Decoder with Latent Variables
Latent Summaries and Topics
130/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Encoder/Decoder [Sutskever et al. 2014; Cho et al. 2014]

Introduction

Models

Variational
Objective

Inference
Strategies

Advanced Topics

Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables Given: Source information s = s1 , . . . , sM .
Latent Summaries
and Topics

Conclusion
Generative process:
References • Draw x1:T | s ∼ CRNNLM(θ, enc(s)).

131/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Latent, Per-token Experts [Yang et al. 2018]

Introduction
Generative process: For t = 1, . . . , T ,
Models • Draw zt | x<t , s ∼ softmax(U ht ).
Variational • Draw xt | zt , x<t , s ∼ softmax(W tanh(Qzt ht ); θ)
Objective

Inference
Strategies (n) (n) (n)
z1 z2 zT
Advanced Topics

Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables
(n)
x1
(n)
x2 ... (n)
xT
Latent Summaries
and Topics
N
Conclusion

References

If U ∈ RK×d , used K experts; increases the flexibility of per-token distribution.


132/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Case-Study: Latent Per-token Experts [Yang et al. 2018]

Introduction
Learning: zt are independent given x<t , so we can marginalize at each time-step
Models
(Method 3: Conjugacy).
Variational
Objective
arg max log p(x | s; θ) =
Inference
θ
Strategies
T X
Y K
Advanced Topics
arg max log p(zt =k | s, x<t ; θ) p(xt | zt =k, x<t , s; θ).
Case Studies θ t=1 k=1
Sentence VAE
Encoder/Decoder
with Latent Variables
Test-time:
Latent Summaries
and Topics T X
Y K
Conclusion arg max p(zt =k | s, x<t ; θ) p(xt | zt =k, x<t , s; θ).
x1:T
t=1 k=1
References

133/153
Tutorial:
Deep Latent NLP Case-Study: Latent, Per-token Experts [Yang et al. 2018]
(bit.do/lvnlp)

PTB language modeling results (s is constant):


Introduction

Models Model PPL


Variational
Objective Merity et al. [2018] 57.30
Inference Softmax-mixture [Yang et al. 2018] 54.44
Strategies

Advanced Topics

Case Studies
Dialogue generation results (s is context):
Sentence VAE
Encoder/Decoder
with Latent Variables
Model BLEU
Latent Summaries
and Topics
Prec Rec
Conclusion

References No mixture 14.1 11.1


Softmax-mixture [Yang et al. 2018] 15.7 12.3
134/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
Attention [Bahdanau et al. 2015]

Introduction

Models

Variational
Objective

Inference
Strategies

Advanced Topics

Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries Decoding with an attention mechanism:
and Topics

Conclusion
M
X
References xt | x<t , s ∼ softmax(W [ht , αt,m enc(s)m ]).
m=1
135/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Copy Attention [Gu et al. 2016; Gulcehre et al. 2016]


Introduction

Models Copy attention models copying words directly from s.


Variational
Objective Generative process: For t = 1, . . . , T ,
Inference
Strategies
• Set αt to be attention weights.
Advanced Topics • Draw zt | x<t , s ∼ Bern(MLP([ht , enc(s)])).
Case Studies • If zt = 0
Sentence VAE
Encoder/Decoder • Draw xt | zt , x<t , s ∼ softmax(W ht ).
with Latent Variables
Latent Summaries
and Topics
• Else
Conclusion • Draw xt ∈ {s1 , . . . , sM } | zt , x<t , s ∼ Cat(αt ).
References

136/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp) Copy Attention

Introduction Learning: Can maximize the log per-token marginal [Gu et al. 2016], as with
Models per-token experts:
Variational
Objective
max log p(x1 , . . . , xT | s; θ)
θ
Inference
Strategies T
Y X
Advanced Topics
= max log p(zt = z 0 | s, x<t ; θ) p(xt | z 0 , x<t , x; θ).
θ
t=1 z 0 ∈{0,1}
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables
Test-time:
Latent Summaries
and Topics
T
Y X
p(zt =z 0 | s, x<t ; θ) p(xt | z 0 , x<t , s; θ).
Conclusion
arg max
References x1:T
t=1 z 0 ∈{0,1}

137/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction

Models
Attention as a Latent Variable [Deng et al. 2018]
Variational
Objective
Generative process: For t = 1, . . . , T ,
Inference
Strategies • Set αt to be attention weights.
Advanced Topics
• Draw zt | x<t , s ∼ Cat(αt ).
Case Studies
Sentence VAE • Draw xt | zt , x<t , s ∼ softmax(W [ht , enc(szt )]; θ).
Encoder/Decoder
with Latent Variables
Latent Summaries
and Topics

Conclusion

References

138/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Attention as a Latent Variable [Deng et al. 2018]


Introduction

Models
Marginal likelihood under latent attention model:
Variational T X
M
Objective
Y
p(x1:T | s; θ) = αt,m softmax(W [ht , enc(sm )]; θ)xt .
Inference
t=1 m=1
Strategies

Advanced Topics

Case Studies
Sentence VAE Standard attention likelihood:
Encoder/Decoder
with Latent Variables
T M
Latent Summaries Y X
and Topics p(x1:T | s; θ) = softmax(W [ht , αt,m enc(sm )]; θ)xt .
Conclusion t=1 m=1
References

139/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Attention as a Latent Variable [Deng et al. 2018]

Introduction

Models
Learning Strategy #1: Maximize the log marginal via enumeration as above.
Variational
Objective
Learning Strategy #2: Maximize the ELBO with AVI:
Inference
Strategies

Advanced Topics max Eq(zt ; λ) [log p(xt | x<t , zt , s)] − KL[q(zt ; λ)kp(zt | x<t , s)].
λ,θ
Case Studies
Sentence VAE
Encoder/Decoder
with Latent Variables • q(zt | x; λ) approximates p(zt | x1:T , s; θ); implemented with a BLSTM.
Latent Summaries
and Topics
• q isn’t reparameterizable, so gradients obtained using REINFORCE +
Conclusion
baseline.
References

140/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Attention as a Latent Variable [Deng et al. 2018]


Introduction

Models
Test-time: Calculate p(xt | x<t , s; θ) by summing out zt .
Variational
Objective
MT Results on IWSLT-2014:
Inference
Strategies

Advanced Topics Model PPL BLEU


Case Studies
Standard Attn 7.03 32.31
Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Attn (marginal) 6.33 33.08
Latent Summaries
and Topics
Latent Attn (ELBO) 6.13 33.09
Conclusion

References

141/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Encoder/Decoder with Structured Latent Variables
Models

Variational At least two EMNLP 2018 papers augment encoder/decoder text generation
Objective

Inference
models with structured latent variables:
Strategies
1 Lee et al. [2018] generate x1:T by iteratively refining sequences of words z1:T .
Advanced Topics

Case Studies
Sentence VAE 2 Wiseman et al. [2018] generate x1:T conditioned on a latent template or plan
Encoder/Decoder
with Latent Variables z1:S .
Latent Summaries
and Topics

Conclusion

References

142/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction 2 Models
Models

Variational 3 Variational Objective


Objective

Inference
Strategies 4 Inference Strategies
Advanced Topics

Case Studies 5 Advanced Topics


Sentence VAE
Encoder/Decoder
with Latent Variables
Latent Summaries 6 Case Studies
and Topics

Conclusion Sentence VAE


References Encoder/Decoder with Latent Variables
Latent Summaries and Topics
143/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Summary as a Latent Variable [Miao and Blunsom 2016]

Models
Generative process for a document x = x1 , . . . , xT :
Variational
Objective • Draw a latent summary z1 , . . . , zM ∼ RNNLM(θ)
Inference
Strategies • Draw x1 , . . . , xT | z1:M ∼ CRNNLM(θ, z)
Advanced Topics

Case Studies
Sentence VAE
Posterior Inference:
Encoder/Decoder
with Latent Variables
Latent Summaries p(z1:M | x1:T ; θ) = p(summary | document; θ).
and Topics

Conclusion

References

144/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Summary as a Latent Variable [Miao and Blunsom 2016]

Models
Generative process for a document x = x1 , . . . , xT :
Variational
Objective • Draw a latent summary z1 , . . . , zM ∼ RNNLM(θ)
Inference
Strategies • Draw x1 , . . . , xT | z1:M ∼ CRNNLM(θ, z)
Advanced Topics

Case Studies
Sentence VAE
Posterior Inference:
Encoder/Decoder
with Latent Variables
Latent Summaries p(z1:M | x1:T ; θ) = p(summary | document; θ).
and Topics

Conclusion

References

144/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction Summary as a Latent Variable [Miao and Blunsom 2016]


Models

Variational Learning: Maximize the ELBO with amortized family:


Objective

Inference max Eq(z1:M ; λ) [log p(x1:T | z1:M ; θ)] − KL[q(z1:M ; λ)kp(z1:M ; θ)]
Strategies λ,θ

Advanced Topics

Case Studies
• q(z1:M ; λ) approximates p(z1:M | x1:T ; θ); also implemented with
Sentence VAE
encoder/decoder RNNs.
Encoder/Decoder
with Latent Variables
Latent Summaries • q(z1:M ; λ) not reparameterizable, so gradients use REINFORCE + baselines.
and Topics

Conclusion

References

145/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Summary as a Latent Variable [Miao and Blunsom 2016]

Models
Semi-supervised Training: Can also use documents without corresponding
Variational
Objective summaries in training.
Inference
Strategies
• Train q(z1:M ; λ) ≈ p(z1:M | x1:T ; θ) with labeled examples.

Advanced Topics
• Infer summary z for an unlabeled document with q.
Case Studies
Sentence VAE
Encoder/Decoder • Use inferred z to improve model p(x1:T | z1:M ; θ).
with Latent Variables
Latent Summaries
and Topics
• Allows for outperforming strictly supervised models!
Conclusion

References

146/153
Tutorial:
Deep Latent NLP Topic Models [Blei et al. 2003]
(bit.do/lvnlp)

Introduction

Models

Variational
Objective

Inference
Strategies

Advanced Topics

Case Studies
Sentence VAE
Encoder/Decoder (n) (n)
with Latent Variables Generative process: for each document x(n) = x1 , . . . , xT ,
Latent Summaries (n)
and Topics • Draw topic distribution ztop ∼ Dir(α)
Conclusion • For t = 1, . . . , T :
(n) (n)
References • Draw topic zt ∼ Cat(ztop )
• Draw xt ∼ Cat(β z (n) )
t 147/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Simple, Deep Topic Models [Miao et al. 2017]

Models
(n)
Motivation: easy to learn deep topic models with VI if q(ztop ; λ) is
Variational
Objective reparameterizable.
Inference
Strategies
(n)
Advanced Topics Idea: draw ztop from a transformation of a Gaussian.
(n)
Case Studies • Draw z0 ∼ N (µ0 , σ 20 )
Sentence VAE
(n) (n)
Encoder/Decoder
with Latent Variables
• Set ztop = softmax(W z0 ).
Latent Summaries (n)
and Topics • Use analogous transformation when drawing from q(ztop ; λ).
Conclusion

References

148/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Simple, Deep Topic Models [Miao et al. 2017]


Introduction
(n)
Models Learning Step #1: Marginalize out per-word latents zt .
Variational
Objective
N T X
K
Inference (n)
Y (n)
Y (n)
Strategies p({x(n) }N N
n=1 , {ztop }n=1 ; θ) = p(ztop | θ) ztop,k βk,x(n)
t
Advanced Topics
n=1 t=1 k=1

Case Studies
Sentence VAE
Encoder/Decoder
Learning Step #2: Use AVI to optimize resulting ELBO.
with Latent Variables
Latent Summaries
h i
(n) (n) (n)
and Topics max Eq(z(n) ; λ) log p(x(n) | ztop ; θ) − KL[N (z0 ; λ)kN (z0 ; µ0 , σ 20 )]
λ,θ top
Conclusion

References

149/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction
Simple, Deep Topic Models [Miao et al. 2017]
Models

Variational
Objective Perplexities on held-out documents, for three datasets:
Inference
Strategies
Model MXM 20News RCV1
Advanced Topics
OnlineLDA [Hoffman et al. 2010] 342 1015 1058
Case Studies
Sentence VAE AVI-LDA [Miao et al. 2017] 272 830 602
Encoder/Decoder
with Latent Variables
Latent Summaries
and Topics

Conclusion

References

150/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)
1 Introduction

Introduction
2 Models
Models

Variational
Objective 3 Variational Objective
Inference
Strategies
4 Inference Strategies
Advanced Topics

Case Studies
5 Advanced Topics
Conclusion

References
6 Case Studies

7 Conclusion
151/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Deep Latent-Variable NLP: Two Views


Introduction

Models Deep Models & LV Models are naturally complementary:


Variational
Objective
• Rich set of model choices: discrete, continuous, and structured.
Inference • Real applications across NLP including some state-of-the-art models.
Strategies

Advanced Topics

Case Studies Deep Models & LV Models are frustratingly incompatible:


Conclusion
• Many interesting approaches to the problem: reparameterization,
References
score-function, and more.
• Lots of area for research into improved approaches.

152/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Deep Latent-Variable NLP: Two Views


Introduction

Models Deep Models & LV Models are naturally complementary:


Variational
Objective
• Rich set of model choices: discrete, continuous, and structured.
Inference • Real applications across NLP including some state-of-the-art models.
Strategies

Advanced Topics

Case Studies Deep Models & LV Models are frustratingly incompatible:


Conclusion
• Many interesting approaches to the problem: reparameterization,
References
score-function, and more.
• Lots of area for research into improved approaches.

152/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction

Models
Implementation
Variational
Objective
• Modern toolkits make it easy to implement these models.
Inference
Strategies • Combine the flexibility of auto-differentiation for optimization (PyTorch)
Advanced Topics with distribution and VI libraries (Pyro).
Case Studies
In fact, we have implemented this entire tutorial. See website link:
Conclusion
http://bit.do/lvnlp
References

153/153
Tutorial:
Deep Latent NLP
(bit.do/lvnlp)

Introduction

Models
Implementation
Variational
Objective
• Modern toolkits make it easy to implement these models.
Inference
Strategies • Combine the flexibility of auto-differentiation for optimization (PyTorch)
Advanced Topics with distribution and VI libraries (Pyro).
Case Studies
In fact, we have implemented this entire tutorial. See website link:
Conclusion
http://bit.do/lvnlp
References

153/153
Tutorial:
Charu C. Aggarwal and ChengXiang Zhai. 2012. A Survey of Text Clustering Algorithms. In
Deep Latent NLP Mining Text Data, pages 77–128. Springer.
(bit.do/lvnlp)
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by
Jointly Learning to Align and Translate. In Proceedings of ICLR.
Introduction
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of
Models
machine Learning research, 3(Jan):993–1022.
Variational
Samuel R. Bowman, Luke Vilnis, Oriol Vinyal, Andrew M. Dai, Rafal Jozefowicz, and Samy
Objective
Bengio. 2016. Generating Sentences from a Continuous Space. In Proceedings of CoNLL.
Inference
Strategies Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai.
Advanced Topics
1992. Class-based N-gram Models of Natural Language. Computational Linguistics,
18(4):467–479.
Case Studies
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. 2015. Importance Weighted
Conclusion
Autoencoders. In Proceedings of ICLR.
References
Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya
Sutskever, and Pieter Abbeel. 2017. Variational Lossy Autoencoder. In Proceedings of ICLR.
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the
Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of
the Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. 153/153
Tutorial:
Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum Likelihood from
Deep Latent NLP Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B,
(bit.do/lvnlp)
39(1):1–38.
Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander M. Rush. 2018. Latent
Introduction
Alignment and Variational Attention. In Proceedings of NIPS.
Models
Adji B. Dieng, Yoon Kim, Alexander M. Rush, and David M. Blei. 2018. Avoiding Latent
Variational Variable Collapse with Generative Skip Models. In Proceedings of the ICML Workshop on
Objective
Theoretical Foundations and Applications of Deep Generative Models.
Inference
Strategies Adji B. Dieng, Chong Wang, Jianfeng Gao, and John Paisley. 2017. TopicRNN: A Recurrent
Neural Network With Long-Range Semantic Dependency. In Proceedings of ICLR.
Advanced Topics
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online
Case Studies
Learning and Stochastic Optimization. Journal of Machine Learning Research, 12.
Conclusion
Jason Eisner. 2016. Inside-Outside and Forward-Backward Algorithms Are Just Backprop
References
(Tutorial Paper). In Proceedings of the Workshop on Structured Prediction for NLP.
Ekaterina Garmash and Christof Monz. 2016. Ensemble Learning for Multi-source Neural
Machine Translation. In Proceedings of COLING.
Zoubin Ghahramani and Michael I. Jordan. 1996. Factorial Hidden Markov Models. In
Proceedings of NIPS. 153/153
Tutorial:
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism
Deep Latent NLP in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the
(bit.do/lvnlp)
Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages
1631–1640.
Introduction
Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016.
Models
Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association
Variational
for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 140–149.
Objective

Inference
Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang. 2017. Generating
Strategies Sentences by Editing Prototypes. arXiv:1709.08878.
Advanced Topics William P Headden III, Mark Johnson, and David McClosky. 2009. Improving Unsupervised
Case Studies Dependency Parsing with Richer Contexts and Smoothing. In Proceedings of NAACL.
Conclusion Marti A Hearst. 1997. Texttiling: Segmenting text into multi-paragraph subtopic passages.
References Computational linguistics, 23(1):33–64.
Matthew Hoffman, Francis R Bach, and David M Blei. 2010. Online learning for latent dirichlet
allocation. In advances in neural information processing systems, pages 856–864.
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward
Controlled Generation of Text. In Proceedings of ICML. 153/153
Tutorial:
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive
Deep Latent NLP Mixtures of Local Experts. Neural Computation, 3(1):79–87.
(bit.do/lvnlp)
Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with
Gumbel-Softmax. In Proceedings of ICLR.
Introduction
Yoon Kim, Sam Wiseman, Andrew C. Miller, David Sontag, and Alexander M. Rush. 2018.
Models
Semi-Amortized Variational Autoencoders. In Proceedings of ICML.
Variational Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In
Objective
Proceedings of ICLR.
Inference
Strategies Diederik P. Kingma, Tim Salimans, and Max Welling. 2016. Improving Variational Inference
with Autoregressive Flow. arXiv:1606.04934.
Advanced Topics
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In Proceedings
Case Studies
of ICLR.
Conclusion
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio
References
Torralba, and Sanja Fidler. 2015. Skip-thought Vectors. In Proceedings of NIPS.
Dan Klein and Christopher D Manning. 2004. Corpus-based Induction of Syntactic Structure:
Models of Dependency and Constituency. In Proceedings of ACL.
Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic Non-Autoregressive
Neural Sequence Modeling by Iterative Refinement. In Proceedings of EMNLP. 153/153
Tutorial:
Stefan Lee, Senthil Purushwalkam Shiva Prakash, Michael Cogswell, Viresh Ranjan, David
Deep Latent NLP Crandall, and Dhruv Batra. 2016. Stochastic Multiple Choice Learning for Training Diverse
(bit.do/lvnlp)
Deep Ensembles. In Proceedings of NIPS.
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The Concrete Distribution: A
Introduction
Continuous Relaxation of Discrete Random Variables. In Proceedings of ICLR.
Models
Bernard Merialdo. 1994. Tagging English Text with a Probabilistic Model. Computational
Variational
Objective Linguistics, 20(2):155–171.

Inference Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing
Strategies
LSTM language models. In International Conference on Learning Representations.
Advanced Topics
Yishu Miao and Phil Blunsom. 2016. Language as a Latent Variable: Discrete Generative
Case Studies Models for Sentence Compression. In Proceedings of EMNLP.
Conclusion Yishu Miao, Edward Grefenstette, and Phil Blunsom. 2017. Discovering Discrete Latent Topics
References with Neural Variational Inference. In Proceedings of ICML.
Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural Variational Inference for Text Processing.
In Proceedings of ICML.
Andriy Mnih and Danilo J. Rezende. 2016. Variational Inference for Monte Carlo Objectives. In
Proceedings of ICML. 153/153
Tutorial:
Andryi Mnih and Karol Gregor. 2014. Neural Variational Inference and Learning in Belief
Deep Latent NLP Networks. In Proceedings of ICML.
(bit.do/lvnlp)
Anjan Nepal and Alexander Yates. 2013. Factorial Hidden Markov Models for Learning
Representations of Natural Language. arXiv:1312.6168.
Introduction
George Papandreou and Alan L. Yuille. 2011. Perturb-and-Map Random Fields: Using Discrete
Models
Optimization to Learn and Sample from Energy Models. In Proceedings of ICCV.
Variational
Danilo J. Rezende and Shakir Mohamed. 2015. Variational Inference with Normalizing Flows.
Objective
In Proceedings of ICML.
Inference
Strategies Noah A. Smith and Jason Eisner. 2005. Contrastive Estimation: Training Log-Linear Models on
Advanced Topics
Unlabeled Data. In Proceedings of ACL.
Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014. Sequence to Sequence Learning with Neural
Case Studies
Networks. In Proceedings of NIPS.
Conclusion
Ke Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, and Kevin Knight. 2016. Unsupervised
References
Neural Hidden Markov Models. In Proceedings of the Workshop on Structured Prediction for
NLP.
Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. Hmm-based word alignment in
statistical translation. In Proceedings of the 16th conference on Computational
linguistics-Volume 2, pages 836–841. Association for Computational Linguistics. 153/153
Tutorial:
Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen, Jiaji Huang, Wei Ping, Sanjeev Satheesh,
Deep Latent NLP and Lawrence Carin. 2018. Topic Compositional Neural Language Model. In Proceedings of
(bit.do/lvnlp)
AISTATS.
Peter Willett. 1988. Recent Trends in Hierarchic Document Clustering: A Critical Review.
Introduction
Information Processing & Management, 24(5):577–597.
Models
Ronald J. Williams. 1992. Simple Statistical Gradient-following Algorithms for Connectionist
Variational
Objective Reinforcement Learning. Machine Learning, 8.

Inference Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2018. Learning Neural Templates
Strategies
for Text Generation. In Proceedings of EMNLP.
Advanced Topics
Jiacheng Xu and Greg Durrett. 2018. Spherical Latent Spaces for Stable Variational
Case Studies Autoencoders. In Proceedings of EMNLP.
Conclusion Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. Breaking the
References Softmax Bottleneck: A High-Rank RNN Language Model. In Proceedings of ICLR.
Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. 2017. Improved
Variational Autoencoders for Text Modeling using Dilated Convolutions. In Proceedings of
ICML.
153/153

Das könnte Ihnen auch gefallen