You are on page 1of 43

Bayesian Reasoning

and Deep Learning


Shakir Mohamed
DeepMind

shakirm.com

@shakir_za

9 October 2015

Abstract
Deep learning and Bayesian machine learning are currently two of the most
active areas of machine learning research. Deep learning provides a powerful
class of models and an easy framework for learning that now provides state-ofthe-art methods for applications ranging from image classification to speech
recognition. Bayesian reasoning provides a powerful approach for information
integration, inference and decision making that has established it as the key
tool for data-efficient learning, uncertainty quantification and robust model
composition that is widely used in applications ranging from information
retrieval to large-scale ranking. Each of these research areas has shortcomings
that can be effectively addressed by the other, pointing towards a needed
convergence of these two areas of machine learning; the complementary
aspects of these two research areas is the focus of this talk. Using the tools of
auto-encoders and latent variable models, we shall discuss some of the ways in
which our machine learning practice is enhanced by combining deep learning
with Bayesian reasoning. This is an essential, and ongoing, convergence that
will only continue to accelerate and provides some of the most exciting
prospects, some of which we shall discuss, for contemporary machine learning
research.
Bayesian Reasoning and Deep Learning

Deep Learning

Bayesian Reasoning

Better ML

Bayesian Reasoning and Deep Learning

Deep Learning

A framework for constructing flexible models

+ Rich non-linear models for


classification and sequence prediction.
+ Scalable learning using stochastic
approximations and conceptually simple.

- Only point estimates


- Hard to score models, do
model selection and
complexity penalisation.

+ Easily composable with other gradientbased methods


Bayesian Reasoning and Deep Learning

Bayesian Reasoning

A framework for inference and decision making

+ Unified framework for model building,


inference, prediction and decision making

- Mainly conjugate and linear


models

+ Explicit accounting for uncertainty and


variability of outcomes

- Potentially intractable
inference leading to
expensive computation or
long simulation times.

+ Robust to overfitting; tools for model


selection and composition.
Bayesian Reasoning and Deep Learning

Two Streams of Machine Learning


Deep Learning

+ Rich non-linear models for


classification and sequence
prediction.
+ Scalable learning using stochastic
approximation and conceptually
simple.
+ Easily composable with other
gradient-based methods
- Only point estimates
- Hard to score models, do selection
and complexity penalisation.

Bayesian Reasoning and Deep Learning

Bayesian Reasoning

- Mainly conjugate and linear


models
- Potentially intractable inference,
computationally expensive or long
simulation time.
+ Unified framework for model
building, inference, prediction and
decision making
+ Explicit accounting for uncertainty
and variability of outcomes
+ Robust to overfitting; tools for
model selection and composition.
6

Outline
Bayesian Reasoning

Deep Learning

Complementary strengths that we should


expect to be successfully combined.
1

Why is this a good idea?


Review of deep learning
Limitations of maximum likelihood and MAP estimation

How can we achieve this convergence?


Case study using auto-encoders and latent variable models
Approximate Bayesian inference

What else can we do?


Semi-supervised learning, classification, better inference
and more.

Bayesian Reasoning and Deep Learning

A (Statistical) Review of Deep Learning


Table 1: Correspondence between link and activations functions in
generalised regression.
Target
Regression
Link
Inv link
Activation

Generalised Linear Regression


= w> x + b

Real
Binary

p(y|x) = p(y|g(); )

The basic function can be any linear


function, e.g., ane, convolution.
g(.) is an inverse link function that well
refer to as an activation function.

Linear
Logistic

Binary

Probit

Binary

Gumbel

Binary

Logistic

Categorical

Multinomial

Counts
Counts
Non-neg.
Sparse
Ordered

Poisson
Poisson
Gamma
Tobit
Ordinal

Identity

Logit log 1-

Identity
Sigmoid

Inv
Gauss
-1
CDF
()
Compl.
log-log
log(-log())

Gauss
CDF
()
Gumbel CDF
-x
e-e

Probit

Hyperbolic
Tangent
tanh()
Multin. Logit

Tanh

log()
p
()
Reciprocal

1
1+exp(-)

Sigmoid

Softmax

Pi
j j

exp()
2
1

max max(0; )
Cum.
Logit
( k - )

ReLU

the Bernoulli distribution.


There are many link functions that allow us to make other distributional assumptions for the target (response) y. In deep learning, the
link function is referred to as the activation function and I list in the
table below the names for these functions used in the two fields. From
this table we can see that many of the popular approaches for specifying neural networks that have counterparts in statistics and related
literatures under (sometimes) very different names, such multinomial
regression in statistics and softmax classification in deep learning, or
rectifier in deep learning and tobit models is statistics.

Maximum likelihood estimation


Optimise the negative log-likelihood

L=

1.2

Bayesian Reasoning and Deep Learning

log p(y|g(); )

recursive generalised linear models

Constructing a recursive GLM or deep deep feed-forward neural network using the linear predictor as the basic building block. GLMS

A (Statistical) Review of Deep Learning

Recursive Generalised Linear Regression


Recursively compose the basic linear functions.
Gives a deep neural network.

E[y] = hL . . . hl h0 (x)
A general framework for building non-linear, parametric models
Problem: Overfitting of MLE leading to limited generalisation.
Bayesian Reasoning and Deep Learning

A (Statistical) Review of Deep Learning


Regularisation Strategies for Deep Networks
Regularisation is essential to overcome the limitations of maximum
likelihood estimation.
Regularisation, penalised regression, shrinkage.
A wide range of available regularisation techniques:
Large data sets
Input noise/jittering and data augmentation/expansion.
L2 /L1 regularisation (Weight decay, Gaussian prior)
Binary or Gaussian Dropout
Batch normalisation

More robust loss function using MAP estimation instead.

Bayesian Reasoning and Deep Learning

10

More Robust Learning


MAP estimators and limitations
Power of MAP estimators is that they provide
some robustness to overfitting.
Creates sensitivities to parameterisation.
1. Sensitivities aect gradients and can make learning hard
Invariant MAP estimators and exploiting natural
gradients, trust region methods and other
improved optimisation.
2. Still no way to measure confidence of our model.
Can generate frequentist confidence intervals
and bootstrap estimates.
Bayesian Reasoning and Deep Learning

11

Towards Bayesian Reasoning


Proposed solutions have not fully dealt with the underlying issues.
Issues arise as a consequence of:
Reasoning only about the most likely solution and
Not maintaining knowledge of the underlying variability (and
averaging over this).

Given this powerful model class and invaluable tools for


regularisation and optimisation, let us develop a
Pragmatic Bayesian Approach for
Probabilistic Reasoning in Deep Networks.

Bayesian reasoning over some, but not all parts of our models (yet).
Bayesian Reasoning and Deep Learning

12

Outline
Bayesian Reasoning

Deep Learning

Complementary strengths that we should


expect to be successfully combined.
1

Why is this a good idea?


Review of deep learning
Limitations of maximum likelihood and MAP estimation

How can we achieve this convergence?


Case study using auto-encoders and latent variable models
Approximate Bayesian inference

What else can we do?


Semi-supervised learning, classification, better inference
and more.

Bayesian Reasoning and Deep Learning

13

Dimensionality Reduction and Auto-encoders


Unsupervised learning and auto-encoders
A generic tool for dimensionality
reduction and feature extraction.
Minimise reconstruction error using an
encoder and a decoder.
+

Non-linear dimensionality reduction


using deep networks for encoder and
decoder.

Easy to implement as a single


computational graph and train using
SGD

No natural handling of missing data


No representation of variability of the
representation space.

Bayesian Reasoning and Deep Learning

z = f(y)

Decoder
g(.)

Encoder
f(.)

y* = g(z)
Data y

L=

log p(y|g(z))

L = ky

2
g(f (y))k2
14

Dimensionality Reduction and Auto-encoders

Some questions about auto-encoders:


What is the model we are interested in?
Why use an encoder?
How do we regularise?

z = f(y)

Decoder
g(.)

Encoder
f(.)

y* = g(z)
Data y

Best to be explicit about the:


Probabilistic model of interest and
Mechanism we use for inference.

Bayesian Reasoning and Deep Learning

15

Density Estimation and Latent Variable Models


Latent variable models:
Generic and flexible model class for density estimation.
Specifies a generative process that gives rise to the data.

BXPCA

Latent Gaussian Models:


Probabilistic PCA, Factor analysis (FA), Bayesian Exponential
Family PCA (BXPCA).
Latent Variable

z N (z|, )

Observation Model

= Wz + b
y Expon(y|)

Exponential fam natural parameters .

W
y
n = 1, , N

Use our knowledge of deep learning to design even richer models.


Bayesian Reasoning and Deep Learning

16

Deep Generative Models

DLGM

Rich extension of previous model using deep neural networks.


E.g., non-linear factor analysis, non-linear Gaussian belief
networks, deep latent Gaussian models (DLGM).

z2

Latent Variables (Stochastic layers)

zl N (zl |fl (zl+1 ), l )

fl (z) = (Wh(z) + b)
Deterministic layers

h4
h3

W1

z1

hi (x) = (Ax + c)
h2
Observation Model

= Wh1 + b

h1

y Expon(y|)

Can also use non-exponential family.


Bayesian Reasoning and Deep Learning

n = 1, , N

17

Deep Latent Gaussian Models


Our inferential tasks are:

1. Explain this data

p(z|y, W) / p(y|z, W)p(z)


2. Make predictions:

p(y |y) =

p(y |z, W)p(z|y, W)dz

3. Choose
Z the best model

p(y|W) =

p(y|z, W)p(z)dz

Bayesian Reasoning and Deep Learning

z1
h2
h1

y
n = 1, , N

18

Variational Inference
Use tools from approximate inference to handle intractable integrals.

KL[q(z|y)kp(z|y)]

Approximation class

True posterior

q (z)

Reconstruction

F(y, q) = Eq(z) [log p(y|z)]

Reconstruction cost:
Expected log-likelihood
measures how well
samples from q(z) are able
to explain the data y.
Penalty: Explanation of
the data q(z) doesnt deviate
too far from your beliefs
p(z) - Okhams razor.
Penalty

KL[q(z)kp(z)]

Penalty is derived from your model and does not need to be designed.
Bayesian Reasoning and Deep Learning

19

Amortised Variational Inference


z ~ q(z | y)

F (y, q) = Eq(z) [log p(y|z)]


Approx. Posterior

Reconstruction

KL[q(z)kp(z)]
Penalty

Approximate posterior distribution q(z): Best match


to true posterior p(z|y), one of the unknown
inferential quantities of interest to us.
Inference network: q is an encoder or inverse model.
Parameters of q are now a set of global parameters
used for inference of all data points - test and train.
Amortise (spread) the cost of inference over all data.

Inference/
Encoder
q(z |y)

Data y

Encoders provide an ecient mechanism for


amortised posterior inference
Bayesian Reasoning and Deep Learning

20

Auto-encoders and Inference in DGMs


F(y, q) = Eq(z) [log p(y|z)]
Approx. Posterior

KL[q(z)kp(z)]

Reconstruction

z ~ q(z | y)

Model
p(y |z)

Inference
Network
q(z |y)

Penalty

Model (Decoder): likelihood p(y|z).


Inference (Encoder): variational distribution q(z|y)

Stochastic encoder-decoder systems


implement variational inference.

y ~ p(y | z)
Data y

Specific combination of variational inference in latent


variable models using inference networks
Variational Auto-encoder
But dont forget what your model is, and what inference you use.
Bayesian Reasoning and Deep Learning

21

What Have we Gained


+

Transformed an auto-encoders into more


interesting deep generative models.

Rich new class of density estimators built


with non-linear models.

Used a principled approach for deriving


loss functions that automatically include
appropriate penalty functions.

Explained how an encoder enters into


our models and why this is a good idea.

Able to answer all our desired inferential


questions.

Knowledge of the uncertainty associated


with our latent variables.

Bayesian Reasoning and Deep Learning

F(y, q) = Eq(z) [log p(y|z)]

KL[q(z)kp(z)]

z ~ q(z | y)

Model
p(y |z)

Inference
Network
q(z |y)

y ~ p(y | z)
Data y

22

What Have we Gained


F(y, q) = Eq(z) [log p(y|z)]

KL[q(z)kp(z)]

Able to score our models and do model


selection using the free energy.

Can impute missing data under any


missingness assumption

Can still combine with natural gradient


and improved optimisation tools.

Easy implementation - have a single


computational graph and simple Monte
Carlo gradient estimators.

Computational complexity the same as


any large-scale deep learning system.

z ~ q(z | y)

Model
p(y |z)

Inference
Network
q(z |y)

y ~ p(y | z)
Data y

A true marriage of Bayesian Reasoning and Deep Learning


Bayesian Reasoning and Deep Learning

23

...
Data Visualisation
MNIST Handwritten digits

...

28x28

DLGM

...

500

Samples from 2D latent model

Bayesian Reasoning and Deep Learning

...
...

100

...

300

...

28x28

...

100

...

400

...

96x96

Labels in 2D latent space

24

DLGM

Visualising MNIST in 3D

Bayesian Reasoning and Deep Learning

25

DLGM

Data Simulation

Data

Bayesian Reasoning and Deep Learning

Samples

26

Missing Data Imputation


Original Data

unobserved pixels

Inferred Image

DLGM

10%
observed

50%
observed

Bayesian Reasoning and Deep Learning

27

Outline
Bayesian Reasoning

Deep Learning

Complementary strengths that we should


expect to be successfully combined.
1

Why is this a good idea?


Review of deep learning
Limitations of maximum likelihood and MAP estimation

How can we achieve this convergence?


Auto-encoders and latent variable models
Approximate and variational inference

What else can we do?


Semi-supervised learning, recurrent networks, classification,
better inference and more.

Bayesian Reasoning and Deep Learning

28

Semi-supervised Learning

Semi-supervised DLGM

Can extend the marriage of Bayesian reasoning and deep learning to the
problem of semi-supervised classification.

z
W

x
n = 1, , N

Bayesian Reasoning and Deep Learning

29

Semi-supervised DLGM

Analogical Reasoning

Bayesian Reasoning and Deep Learning

30

Generative Models with Attention


Figure 7. MNIST generation sequences for DRAW without atWe can
also combine other tools from deep learning to design
tention. Notice how the network first generates a very blurry imthat is subsequently
refined.generative models: recurrent networks
even age
more
powerful
and attention.
Figure 8. Generated MNIST images with two digits.

attention
it constructs the digit by tracing the lines
nt Neural Network For with
Image
Generation
much like a person with a pen.

DRAW

ts scenes
d by the
.

y step is
ne while
ew years
by a seby a sin& Hinton,
to, 2014;
014; Serequential
h can be
s such as
model in
possible
nse it re-

P (x|z)
decoder
FNN

ct

write

ct

write

. . . cT

4.3. MNIST Generation with Two Digits


dec

decoder

P (x|z1:T )

decoder

ht motivation
The main
for using
1
RNN
RNNan attention-based generative model is that large images can be built up iteratively,
z
zt+1
zt
decoding
by adding to a small
part of the image at a time.
To test
(generative
model)
sample this capability sample
sample
in a controlled
fashion, we trained DRAW
encoding
two
28 |x,
28 zMNIST
images choQ(z|x) to generate
Q(ztimages
|x, z1:t with
Q(z
1)
t+1
1:t )
(inference)
sen at random and placed at random locations in a 60 60
encoderIn casesencoder
black background.
where the two digits overlap,
henc
t 1
RNN
RNNtogether at each point and
encoder the pixel intensities were added
FNN
clipped to be noread
greater thanread
one. Examples of generated
data are shown in Fig. 8. The network typically generates
x the other, suggesting
x
x
one digit and then
an ability to recreate composite scenes from simple pieces.

Figure 2. Left:
Conventional
Auto-Encoder. Dur4.4. Street
View House Variational
Number Generation
ing generation, a sample z is drawn from a prior P (z) and passedFigure 9. Generated SVHN images. The rightmost column
MNIST digits are very simplistic in terms of visual strucshows the training images closest (in L2 distance) to the generthrough the
feedforward
decoder
network
to
compute
the
probature, and we were keen to see how well DRAW performed
ated images beside them. Note that the two columns are visually
bility of the
input
P
(x|z)
given
the
sample.
During
inference
the
on natural images. Our first natural image generation exsimilar, but the numbers are generally different.
input x is periment
passed to
thetheencoder
network,
producing
an approxused
multi-digit
Street View
House Numbers
datasetQ(z|x)
(Netzer etover
al., 2011).
used the same
preprocessimate posterior
latentWe
variables.
During
training, z
ing
as
(Goodfellow
et al.,
2013),
a 64 64
31
Bayesian Reasoning
and
Deep
Learning
is sampled
from
Q(z|x)
and
then
usedyielding
to compute
thehouse
total de-highly realistic, as shown in Figs. 9 and 10. Fig. 11 reveals

Uncertainty on Model Parameters

Bayesian Neural Networks

We can also combine other tools from deep learning to design even more
powerful generative models: recurrent networks and attention.

Bayesian Reasoning and Deep Learning

Weight Uncertainty in Neural

WY 1
0.1

0.5

h2 H 1

H2
0.1

0.7

W2 H3

1.3

H1

H2

H3

0.1 0.3 1.4

0.2
1.2

h1

W3

Figure 1. Left: each weight has a fixed value, as provided by classical backpropagation. Right: each weight is assigned a distribuytion, as provided by Bayes by Backprop.

the par
through
regressi
this c

Inputs
tion on
tion (gi
transfor

The we
tion (M
the ML

n = 1, , N

is related to recent methods in deep, generative modelling


(Kingma and Welling, 2014; Rezende et al., 2014; Gregor

32

In Review
Deep learning as a framework for building highly
flexible non-linear parametric models, but
regularisation and accounting for uncertainty
and lack of knowledge is still needed.

Bayesian reasoning as a general framework for


inference that allows us to account for
uncertainty and a principled approach for
regularisation and model scoring.
Combined Bayesian reasoning with auto-encoders and
showed just how much can be gained by a marriage of these
two streams of machine learning research.

z ~ q(z | y)

Model
p(y |z)

Inference
Network
q(z |y)

y ~ p(y | z)
Data y

Bayesian Reasoning and Deep Learning

33

Thanks to many people:


Danilo Rezende, Ivo Danihelka, Karol Gregor, Charles Blundell,
Theophane Weber, Andriy Mnih, Daan Wierstra (Google DeepMind),
Durk Kingma, Max Welling (U. Amsterdam)

Thank You.

Bayesian Reasoning and Deep Learning

34

Some References
Probabilistic Deep Learning

Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in deep generative models."
ICML (2014).

Kingma, Diederik P., and Max Welling. "Auto-encoding variational Bayes." ICLR 2014.

Mnih, Andriy, and Karol Gregor. "Neural variational inference and learning in belief networks." ICML (2014).

Gregor, Karol, et al. "Deep autoregressive networks." ICML (2014).

Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with deep generative models. NIPS (pp. 3581-3589).

Gregor, K., Danihelka, I., Graves, A., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:
1502.04623.

Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. arXiv preprint arXiv:1505.05770.

Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight Uncertainty in Neural Networks. arXiv preprint arXiv:1505.05424.

Hernndez-Lobato, J. M., & Adams, R. P. (2015). Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. arXiv
preprint arXiv:1502.05336.

Gal, Y., & Ghahramani, Z. (2015). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint
arXiv:1506.02142.

Bayesian Reasoning and Deep Learning

35

What is a Variational Method?


Variational Principle
General family of methods for approximating
complicated densities by a simpler class of densities.

KL[q(z|y)kp(z|y)]

Approximation class

True posterior

q (z)
Deterministic approximation procedures
with bounds on probabilities of interest.
Fit the variational parameters.
Bayesian Reasoning and Deep Learning

36

From IS to Variational Inference


Integral problem

Proposal

Importance Weight

Jensens inequality
log

p(x)g(x)dx

p(x) log g(x)dx

log p(y) = log


log p(y) = log

p(y|z)p(z)dz
q(z)
p(y|z)p(z)
dz
q(z)

p(z)
log p(y) = log p(y|z)
q(z)dz
q(z)

Z
p(z)
log p(y)
q(z) log p(y|z)
dz
q(z)
=

Variational lower bound


Bayesian Reasoning and Deep Learning

q(z) log p(y|z)

= Eq(z) [log p(y|z)]

q(z)
q(z) log
p(z)

KL[q(z)kp(z)]
37

Minimum Description Length (MDL)


F(y, q) = Eq(z) [log p(y|z)]
Stochastic encoder

Data code-length

KL[q(z)kp(z)]
Hypothesis code

Stochastic encoder-decoder systems implement variational inference.

Regularity in our data that can be explained with


latent variables, implies that the data is compressible.
MDL: inference seen as a problem of compression
we must find the ideal shortest message of our data y:
marginal likelihood.
Must introduce an approximation to the ideal
message.
Encoder: variational distribution q(z|y),
Decoder: likelihood p(y|z).

Bayesian Reasoning and Deep Learning

z ~ q(z | y)

Decoder
p(y |z)

Encoder
q(z |y)

y ~ p(y | z)
Data y

38

Denoising Auto-encoders (DAE)


F(y, q) = Eq(z) [log p(y|z)]
Stochastic encoder

Reconstruction

(z, y)
Penalty

Stochastic encoder-decoder systems implement variational inference.

DAE: A mechanism for finding representations or


features of data (i.e. latent variable explanations).

Encoder: variational distribution q(z|y),

Decoder: likelihood p(y|z).


The variational approach requires you to be explicit
about your assumptions. Penalty is derived from your
model and does not need to be designed.

Bayesian Reasoning and Deep Learning

z ~ q(z | y)

Decoder
p(y |z)

Encoder
q(z |y)

y ~ p(y | z)
Data y

39

Amortising the Cost of Inference


Repeat:
E-step
For i = 1, N
n

/ r Eq

(z) [log p (yn |zn )]

r KL[q(zn )kp(zn )]

Instead of solving this optimisation


for every data point n, we can
instead use a model.

M-step

1 X
/
r log p (yn |zn )
N n
z

z ~ q(z | y)

Model
p(y |z)

Inference
Network
q(z |y)

y ~ p(y | z)
Data y

Inference network: q is an encoder or inverse model.


Parameters of q are now a set of global parameters
used for inference of all data points - test and train.
Share the cost of inference (amortise) over all data.
Combines easily with mini-batches and Monte Carlo
expectations.
Can jointly optimise variational and model
parameters: no need for alternating optimisation.

Bayesian Reasoning and Deep Learning

40

Implementing your Variational Algorithm


Avoid deriving pages of gradient updates for variational inference.
Variational inference turns integration
into optimisation:
Automated Tools:
Dierentiation: Theano, Torch7, Stan
Message passing: infer.NET

Eq [( log p(y|z) + log q(z)


Forward pass
Prior
p(z)

z H[q(z)]

Bayesian Reasoning and Deep Learning

Backward pass

r
Prior
p(z)

log p(z)
Inference
q(z |x)

Stochastic gradient descent and


other preconditioned optimisation.
Same code can run on both GPUs
or on distributed clusters.
Probabilistic models are modular,
can easily be combined.

Model
p(x |z)
Model
p(x |z)
Data x

log p(x|z)

log p(z)]

Inference
q(z |x)

Ideally want probabilistic programming


using variational inference.
41

Stochastic Backpropagation
A Monte Carlo method that works with continuous latent variables.
Original problem

r Eq(z) [f (z)]
2

Reparameterisation

Backpropagation
with Monte Carlo

z N (, )
z = + N (0, 1)
r EN (0,1) [f ( + )]

EN (0,1) [r={, } f ( + )]

Can use any likelihood function, avoids the need for additional lower bounds.
Low-variance, unbiased estimator of the gradient.
Can use just one sample from the base distribution.
Possible for many distributions with location-scale or other known
transformations, such as the CDF.

Bayesian Reasoning and Deep Learning

42

Monte Carlo Control Variate Estimators


More general Monte Carlo approach that can be used with both discrete
or continuous latent variables.
Property of the score function:

r q (z|x)
r log q (z|x) =
q (z|x)

Original problem

r Eq

Score ratio

Eq

(z) [log p (y|z)r

log q(z|y)]

Eq

(z) [(log p (y|z)

c)r log q(z|y)]

MCCV Estimate

(z) [log p (y|z)]

c is known as a control variate and is used


to control the variance of the estimator.

Bayesian Reasoning and Deep Learning

43