Sie sind auf Seite 1von 94

Variational Autoencoders - An Introduction

Devon Graham

University of British Columbia


drgraham@cs.ubc.ca

Oct 31st, 2017


Table of contents

Introduction

Deep Learning Perspective

Probabilistic Model Perspective

Applications

Conclusion
Introduction

I Auto-Encoding Variational Bayes, Diederik P. Kingma and


Max Welling, ICLR 2014
Introduction

I Auto-Encoding Variational Bayes, Diederik P. Kingma and


Max Welling, ICLR 2014
I Generative model
Introduction

I Auto-Encoding Variational Bayes, Diederik P. Kingma and


Max Welling, ICLR 2014
I Generative model
I Running example: Want to generate realistic-looking MNIST
digits (or celebrity faces, video game plants, cat pictures, etc)
Introduction

I Auto-Encoding Variational Bayes, Diederik P. Kingma and


Max Welling, ICLR 2014
I Generative model
I Running example: Want to generate realistic-looking MNIST
digits (or celebrity faces, video game plants, cat pictures, etc)
I https://jaan.io/
what-is-variational-autoencoder-vae-tutorial/
Introduction

I Auto-Encoding Variational Bayes, Diederik P. Kingma and


Max Welling, ICLR 2014
I Generative model
I Running example: Want to generate realistic-looking MNIST
digits (or celebrity faces, video game plants, cat pictures, etc)
I https://jaan.io/
what-is-variational-autoencoder-vae-tutorial/
I Deep Learning perspective and Probabilistic Model perspective
Introduction - Autoencoders

I
Introduction - Autoencoders

I Attempt to learn identity function


Introduction - Autoencoders

I Attempt to learn identity function


I Constrained in some way (e.g., small latent vector
representation)
Introduction - Autoencoders

I Attempt to learn identity function


I Constrained in some way (e.g., small latent vector
representation)
I Can generate new images by giving different latent vectors to
trained network
Introduction - Autoencoders

I Attempt to learn identity function


I Constrained in some way (e.g., small latent vector
representation)
I Can generate new images by giving different latent vectors to
trained network
I Variational: use probabilistic latent encoding
Deep Learning Perspective
Deep Learning Perspective

I Goal: Build a neural network that generates MNIST digits


from random (Gaussian) noise
Deep Learning Perspective

I Goal: Build a neural network that generates MNIST digits


from random (Gaussian) noise
I Define two sub-networks: Encoder and Decoder
Deep Learning Perspective

I Goal: Build a neural network that generates MNIST digits


from random (Gaussian) noise
I Define two sub-networks: Encoder and Decoder
I Define a Loss Function
Encoder

I A neural network qθ (z|x)


Encoder

I A neural network qθ (z|x)


I Input: datapoint x (e.g. 28 × 28-pixel MNIST digit)
Encoder

I A neural network qθ (z|x)


I Input: datapoint x (e.g. 28 × 28-pixel MNIST digit)
I Output: encoding z, drawn from Gaussian density with
parameters θ
Encoder

I A neural network qθ (z|x)


I Input: datapoint x (e.g. 28 × 28-pixel MNIST digit)
I Output: encoding z, drawn from Gaussian density with
parameters θ
I |z|  |x|
Encoder

I A neural network qθ (z|x)


I Input: datapoint x (e.g. 28 × 28-pixel MNIST digit)
I Output: encoding z, drawn from Gaussian density with
parameters θ
I |z|  |x|

I
Decoder

I A neural network pφ (x|z), parameterized by φ


Decoder

I A neural network pφ (x|z), parameterized by φ


I Input: encoding z, output from encoder
Decoder

I A neural network pφ (x|z), parameterized by φ


I Input: encoding z, output from encoder
I Output: reconstruction x̃, drawn from distribution of the data
Decoder

I A neural network pφ (x|z), parameterized by φ


I Input: encoding z, output from encoder
I Output: reconstruction x̃, drawn from distribution of the data
I E.g., output parameters for 28 × 28 Bernoulli variables
Decoder

I A neural network pφ (x|z), parameterized by φ


I Input: encoding z, output from encoder
I Output: reconstruction x̃, drawn from distribution of the data
I E.g., output parameters for 28 × 28 Bernoulli variables

I
Loss Function

I x̃ is reconstructed from z where |z|  |x̃|


Loss Function

I x̃ is reconstructed from z where |z|  |x̃|


I How much information is lost when we go from x to z to x̃?
Loss Function

I x̃ is reconstructed from z where |z|  |x̃|


I How much information is lost when we go from x to z to x̃?
I Measure this with reconstruction log-likelihood: log pφ (x|z)
Loss Function

I x̃ is reconstructed from z where |z|  |x̃|


I How much information is lost when we go from x to z to x̃?
I Measure this with reconstruction log-likelihood: log pφ (x|z)
I Measures how effectively the decoder has learned to
reconstruct x given the latent representation z
Loss Function

I Loss function is negative reconstruction log-likelihood +


regularizer
Loss Function

I Loss function is negative reconstruction log-likelihood +


regularizer
I Loss decomposes into term for each datapoint:
N
X
L(θ, φ) = li (θ, φ)
i=1
Loss Function

I Loss function is negative reconstruction log-likelihood +


regularizer
I Loss decomposes into term for each datapoint:
N
X
L(θ, φ) = li (θ, φ)
i=1

I Loss for datapoint xi :


  
li (θ, φ) = −Ez∼qθ (z|xi ) log pφ (xi |z) + KL qθ (z|xi )||p(z)
Loss Function

I Negative reconstruction log-likelihood:


 
−Ez∼qθ (z|xi ) log pφ (xi |z)
Loss Function

I Negative reconstruction log-likelihood:


 
−Ez∼qθ (z|xi ) log pφ (xi |z)

I Encourages decoder to learn to reconstruct the data


Loss Function

I Negative reconstruction log-likelihood:


 
−Ez∼qθ (z|xi ) log pφ (xi |z)

I Encourages decoder to learn to reconstruct the data


I Expectation taken over distribution of latent representations
Loss Function

I KL Divergence as regularizer:
  
KL qθ (z|xi )||p(z) = Ez∼qθ (z|xi ) log qθ (z|xi ) − log p(z)
Loss Function

I KL Divergence as regularizer:
  
KL qθ (z|xi )||p(z) = Ez∼qθ (z|xi ) log qθ (z|xi ) − log p(z)

I Measures information lost when using qθ to represent p


Loss Function

I KL Divergence as regularizer:
  
KL qθ (z|xi )||p(z) = Ez∼qθ (z|xi ) log qθ (z|xi ) − log p(z)

I Measures information lost when using qθ to represent p


I We will use p(z) = N (0, I)
Loss Function

I KL Divergence as regularizer:
  
KL qθ (z|xi )||p(z) = Ez∼qθ (z|xi ) log qθ (z|xi ) − log p(z)

I Measures information lost when using qθ to represent p


I We will use p(z) = N (0, I)
I Encourages encoder to produce z’s that are close to standard
normal distribution
Loss Function

I KL Divergence as regularizer:
  
KL qθ (z|xi )||p(z) = Ez∼qθ (z|xi ) log qθ (z|xi ) − log p(z)

I Measures information lost when using qθ to represent p


I We will use p(z) = N (0, I)
I Encourages encoder to produce z’s that are close to standard
normal distribution
I Encoder learns a meaningful representation of MNIST digits
Loss Function

I KL Divergence as regularizer:
  
KL qθ (z|xi )||p(z) = Ez∼qθ (z|xi ) log qθ (z|xi ) − log p(z)

I Measures information lost when using qθ to represent p


I We will use p(z) = N (0, I)
I Encourages encoder to produce z’s that are close to standard
normal distribution
I Encoder learns a meaningful representation of MNIST digits
I Representation for images of the same digit are close together
in latent space
Loss Function

I KL Divergence as regularizer:
  
KL qθ (z|xi )||p(z) = Ez∼qθ (z|xi ) log qθ (z|xi ) − log p(z)

I Measures information lost when using qθ to represent p


I We will use p(z) = N (0, I)
I Encourages encoder to produce z’s that are close to standard
normal distribution
I Encoder learns a meaningful representation of MNIST digits
I Representation for images of the same digit are close together
in latent space
I Otherwise could “memorize” the data and map each observed
datapoint to a distinct region of space
MNIST latent variable space
Reparameterization trick

I We want to use gradient descent to learn the model’s


parameters
Reparameterization trick

I We want to use gradient descent to learn the model’s


parameters
I Given z drawn from qθ (z|x), how do we take derivatives of (a
function of) z w.r.t. θ?
Reparameterization trick

I We want to use gradient descent to learn the model’s


parameters
I Given z drawn from qθ (z|x), how do we take derivatives of (a
function of) z w.r.t. θ?
I We can reparameterize: z = µ + σ 
Reparameterization trick

I We want to use gradient descent to learn the model’s


parameters
I Given z drawn from qθ (z|x), how do we take derivatives of (a
function of) z w.r.t. θ?
I We can reparameterize: z = µ + σ 
I  ∼ N (0, I), and is element-wise product
Reparameterization trick

I We want to use gradient descent to learn the model’s


parameters
I Given z drawn from qθ (z|x), how do we take derivatives of (a
function of) z w.r.t. θ?
I We can reparameterize: z = µ + σ 
I  ∼ N (0, I), and is element-wise product
I Can take derivatives of (functions of) z w.r.t. µ and σ
Reparameterization trick

I We want to use gradient descent to learn the model’s


parameters
I Given z drawn from qθ (z|x), how do we take derivatives of (a
function of) z w.r.t. θ?
I We can reparameterize: z = µ + σ 
I  ∼ N (0, I), and is element-wise product
I Can take derivatives of (functions of) z w.r.t. µ and σ
I Output of qθ (z|x) is vector of µ’s and vector of σ’s
Summary

I Deep Learning objective is to minimize the loss function:


N 
X 
  
L(θ, φ) = − Ez∼qθ (z|xi ) log pφ (xi |z) + KL qθ (z|xi )||p(z)
i=1
Probabilistic Model Perspective
Probabilistic Model Perspective
I Data x and latent variables z
Probabilistic Model Perspective
I Data x and latent variables z
I Joint pdf of the model: p(x, z) = p(x|z)p(z)
Probabilistic Model Perspective
I Data x and latent variables z
I Joint pdf of the model: p(x, z) = p(x|z)p(z)
I Decomposes into likelihood: p(x|z), and prior: p(z)
Probabilistic Model Perspective
I Data x and latent variables z
I Joint pdf of the model: p(x, z) = p(x|z)p(z)
I Decomposes into likelihood: p(x|z), and prior: p(z)
I Generative process:
Draw latent variables zi ∼ p(z)
Draw datapoint xi ∼ p(x|z)
Probabilistic Model Perspective
I Data x and latent variables z
I Joint pdf of the model: p(x, z) = p(x|z)p(z)
I Decomposes into likelihood: p(x|z), and prior: p(z)
I Generative process:
Draw latent variables zi ∼ p(z)
Draw datapoint xi ∼ p(x|z)
I Graphical model:
Probabilistic Model Perspective

I Suppose we want to do inference in this model


Probabilistic Model Perspective

I Suppose we want to do inference in this model


I We would like to infer good values of z, given observed data
Probabilistic Model Perspective

I Suppose we want to do inference in this model


I We would like to infer good values of z, given observed data
I Then we could use them to generate real-looking MNIST
digits
Probabilistic Model Perspective

I Suppose we want to do inference in this model


I We would like to infer good values of z, given observed data
I Then we could use them to generate real-looking MNIST
digits
I We want to calculate the posterior:

p(x|z)p(z)
p(z|x) =
p(x)
Probabilistic Model Perspective

I Suppose we want to do inference in this model


I We would like to infer good values of z, given observed data
I Then we could use them to generate real-looking MNIST
digits
I We want to calculate the posterior:

p(x|z)p(z)
p(z|x) =
p(x)

R
I Need to calculate evidence: p(x) = p(x|z)p(z)dz
Probabilistic Model Perspective

I Suppose we want to do inference in this model


I We would like to infer good values of z, given observed data
I Then we could use them to generate real-looking MNIST
digits
I We want to calculate the posterior:

p(x|z)p(z)
p(z|x) =
p(x)

R
I Need to calculate evidence: p(x) = p(x|z)p(z)dz
I Integral over all configurations of latent variables /
Probabilistic Model Perspective

I Suppose we want to do inference in this model


I We would like to infer good values of z, given observed data
I Then we could use them to generate real-looking MNIST
digits
I We want to calculate the posterior:

p(x|z)p(z)
p(z|x) =
p(x)

R
I Need to calculate evidence: p(x) = p(x|z)p(z)dz
I Integral over all configurations of latent variables /
I Intractable
Probabilistic Model Perspective

I Variational inference to the rescue!


Probabilistic Model Perspective

I Variational inference to the rescue!


I Let’s approximate the true posterior p(z|x) with the ‘best’
distribution from some family qλ (z|x)
Probabilistic Model Perspective

I Variational inference to the rescue!


I Let’s approximate the true posterior p(z|x) with the ‘best’
distribution from some family qλ (z|x)
I Which choice of λ gives the ‘best’ qλ (z|x)?
Probabilistic Model Perspective

I Variational inference to the rescue!


I Let’s approximate the true posterior p(z|x) with the ‘best’
distribution from some family qλ (z|x)
I Which choice of λ gives the ‘best’ qλ (z|x)?
I KL divergence measures information lost when using qλ to
approximate p
Probabilistic Model Perspective

I Variational inference to the rescue!


I Let’s approximate the true posterior p(z|x) with the ‘best’
distribution from some family qλ (z|x)
I Which choice of λ gives the ‘best’ qλ (z|x)?
I KL divergence measures information lost when using qλ to
approximate p
 
I Choose λ to minimize KL qλ (z|x)||p(z|x) = KL qλ ||p
Probabilistic Model Perspective

  
KL qλ ||p := Ez∼qλ log qλ (z|x) − log p(z|x)
   
= Ez∼qλ log qλ (z|x) − Ez∼qλ log p(x, z)
+ log p(x)
Probabilistic Model Perspective

  
KL qλ ||p := Ez∼qλ log qλ (z|x) − log p(z|x)
   
= Ez∼qλ log qλ (z|x) − Ez∼qλ log p(x, z)
+ log p(x)

I Still contains p(x) term! So cannot compute directly


Probabilistic Model Perspective

  
KL qλ ||p := Ez∼qλ log qλ (z|x) − log p(z|x)
   
= Ez∼qλ log qλ (z|x) − Ez∼qλ log p(x, z)
+ log p(x)

I Still contains p(x) term! So cannot compute directly


I But p(x) does not depend on λ, so still hope
Probabilistic Model Perspective

I Define Evidence Lower BOund:


   
ELBO(λ) := Ez∼qλ log p(x, z) − Ez∼qλ log qλ (z|x)
Probabilistic Model Perspective

I Define Evidence Lower BOund:


   
ELBO(λ) := Ez∼qλ log p(x, z) − Ez∼qλ log qλ (z|x)

I Then
    
KL qλ ||p = Ez∼qλ log qλ (z|x) − Ez∼qλ log p(x, z) + log p(x)
= −ELBO(λ) + log p(x)
Probabilistic Model Perspective

I Define Evidence Lower BOund:


   
ELBO(λ) := Ez∼qλ log p(x, z) − Ez∼qλ log qλ (z|x)

I Then
    
KL qλ ||p = Ez∼qλ log qλ (z|x) − Ez∼qλ log p(x, z) + log p(x)
= −ELBO(λ) + log p(x)


I So minimizing KL qλ ||p w.r.t. λ is equivalent to maximizing
ELBO(λ)
Probabilistic Model Perspective

I Since no two datapoints share latent variables, we can write:


N
X
ELBO(λ) = ELBOi (λ)
i=1
Probabilistic Model Perspective

I Since no two datapoints share latent variables, we can write:


N
X
ELBO(λ) = ELBOi (λ)
i=1

I Where
   
ELBOi (λ) = Ez∼qλ (z|xi ) log p(xi , z) − Ez∼qλ (z|xi ) log qλ (z|xi )
Probabilistic Model Perspective

I We can rewrite the term ELBOi (λ):


   
ELBOi (λ) = Ez∼qλ (z|xi ) log p(xi , z) − Ez∼qλ (z|xi ) log qλ (z|xi )
 
= Ez∼qλ (z|xi ) log p(xi |z) + log p(z)
 
− Ez∼qλ (z|xi ) log qλ (z|xi )
 
= Ez∼qλ (z|xi ) log p(xi |z)
 
− Ez∼qλ (z|xi ) log qλ (z|xi ) − log p(z)
  
= Ez∼qλ (z|xi ) log p(xi |z) − KL qλ (z|xi )||p(z)
Probabilistic Model Perspective

I How do we relate λ to φ and θ seen earlier?


Probabilistic Model Perspective

I How do we relate λ to φ and θ seen earlier?


I We can parameterize approximate posterior qθ (z|x, λ) by a
network that takes data x and outputs parameters λ
Probabilistic Model Perspective

I How do we relate λ to φ and θ seen earlier?


I We can parameterize approximate posterior qθ (z|x, λ) by a
network that takes data x and outputs parameters λ
I Parameterize the likelihood p(x|z) with a network that takes
latent variables and outputs parameters to the data
distribution pφ (x|z)
Probabilistic Model Perspective

I How do we relate λ to φ and θ seen earlier?


I We can parameterize approximate posterior qθ (z|x, λ) by a
network that takes data x and outputs parameters λ
I Parameterize the likelihood p(x|z) with a network that takes
latent variables and outputs parameters to the data
distribution pφ (x|z)
I So we can re-write
  
ELBOi (θ, φ) = Ez∼qθ (z|xi ) log pφ (xi |z) − KL qθ (z|xi )||p(z)
Probabilistic Model Objective

I Recall the Deep Learning objective derived earlier. We want


to minimize:
N 
X 
  
L(θ, φ) = − Ez∼qθ (z|xi ) log pφ (xi |z) + KL qθ (z|xi )||p(z)
i=1
Probabilistic Model Objective

I Recall the Deep Learning objective derived earlier. We want


to minimize:
N 
X 
  
L(θ, φ) = − Ez∼qθ (z|xi ) log pφ (xi |z) + KL qθ (z|xi )||p(z)
i=1

I The objective just derived for the Probabilistic Model was to


maximize:
N 
X 
  
ELBO(θ, φ) = Ez∼qθ (z|xi ) log pφ (xi |z) − KL qθ (z|xi )||p(z)
i=1

I They are equivalent!


Applications - Image generation

A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. arXiv
preprint arXiv :1602.02644, 2016.
Applications - Caption generation

Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for deep learning of
images, labels and captions. In NIPS, 2016.
Applications - Semi-/Un-supervised document classification

Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick. Improved variational autoencoders for text modeling
using dilated convolutions. In Proceedings of The 34rd International Conference on Machine Learning, 2017.
Applications - Pixel art videogame characters

https://mlexplained.wordpress.com/category/generative-models/vae/.
Conclusion

I We derived the same objective from


Conclusion

I We derived the same objective from


I 1) A deep learning point of view, and
Conclusion

I We derived the same objective from


I 1) A deep learning point of view, and
I 2) A probabilistic models point of view
Conclusion

I We derived the same objective from


I 1) A deep learning point of view, and
I 2) A probabilistic models point of view
I Showed they are equivalent
Conclusion

I We derived the same objective from


I 1) A deep learning point of view, and
I 2) A probabilistic models point of view
I Showed they are equivalent
I Saw some applications
Conclusion

I We derived the same objective from


I 1) A deep learning point of view, and
I 2) A probabilistic models point of view
I Showed they are equivalent
I Saw some applications
I Thank you. Questions?