Sie sind auf Seite 1von 70

advances in markov chain monte carlo methods iain murray ma msci natural science

s physics university of cambridge uk gatsby computational neuroscience unit univ


ersity college london queen square london wcn ar united kingdom thesis submitted
for the degree of doctor of philosophy university of london i iain murray confi
rm that the work presented in this thesis is my own where information has been d
erived from other sources i confirm that this has been indicated in the thesis a
bstract probability distributions over many variables occur frequently in bayesi
an inference statistical physics and simulation studies samples from distributio
ns give insight into their typical behavior and can allow approximation of any q
uantity of interest such as expectations or normalizing constants markov chain m
onte carlo mcmc introduced by metropolis et al allows sampling from distribution
s with intractable normalization and remains one of most important tools for app
roximate computation with probability distributions while not needed by mcmc nor
malizers are key quantities in bayesian statistics marginal likelihoods are need
ed for model comparison in statistical physics many physical quantities relate t
o the partition function in this thesis we propose and investigate several new m
onte carlo algorithms both for evaluating normalizing constants and for improved
sampling of distributions many mcmc correctness proofs rely on using reversible
transition operators often these operators lead to slow diffusive motion resemb
ling a random walk after reviewing existing mcmc algorithms we develop a new fra
mework for constructing nonreversible transition operators that allow more persi
stent motion next we explore and extend mcmcbased algorithms for computing norma
lizing constants we compare annealing multicanonical and nested sampling giving
recommendations for their use we also develop a new mcmc operator and nested sam
pling approach for the potts model this demonstrates that nested sampling is som
etimes better than annealing methods at computing normalizing constants and draw
ing posterior samples finally we consider doublyintractable distributions with e
xtra unknown normalizer terms that do not cancel in standard mcmc algorithms we
propose using several deterministic approximations for the unknown terms and inv
estigate their interaction with sampling algorithms we then develop novel exacts
amplingbased mcmc methods the exchange algorithm and latent histories for the fi
rst time these algorithms do not require separate approximation before sampling
begins moreover the exchange algorithm outperforms the only alternative sampling
algorithm for doubly intractable distributions acknowledgments i feel very fort
unate to have been supervised by zoubin ghahramani my training has benefited gre
atly from his expertise enthusiasm and encouragement i have been equally fortuna
te to receive regular advice and ideas from david mackay much of this research w
ould never have happened without these mentors and friends this work was carried
out at the gatsby computational neuroscience unit at university college london
this is a firstclass environment in which to conduct research both intellectuall
y and socially peter dayan the director is owed a lot of credit for this as are
all of gatsbys members and visitors past and present several individuals have of
fered me their constant support advice and good humor in particular angela yu ed
snelson katherine heller and members of the inference group in cambridge id als
o like to extend a special thanks to alex boss whose administrative help often e
xtended beyond the call of duty some of chapter is a review of john skillings ne
sted sampling algorithm john has been generous with his advice and encouragement
which enabled me to pursue a study based on his work thanks also to radford nea
l for comments on aspects of chapters and hyunchul kim for meanfield code used i
n chapter and matthew stephens for encouraging me to work out the noninfinite ex
planation of the exchange algorithm thanks to my examiners peter green and loren
z wernisch for reviewing this work and catching some mistakes of course i am sol
ely responsible for any errors that remain i am enormously grateful to the gatsb
y charitable foundation for funding my research and providing travel money furth
er travel monies were received from the pascal network of excellence auai the ni
ps foundation and the valencia bayesian meeting various items of free software h
ave played a vital role in conducting this research gcc a gnulinux gnuplot l tex
metapost hobby octave valgrind seward and nethercote vim and many more projects
in the public interest such as these deserve considerable support finally i wou
ld like to thank my family and friends for all their support and patience conten
ts front matter abstract acknowledgments contents list of algorithms list of fig
ures list of tables notes on notation introduction graphical models directed gra
phical models undirected graphical models the potts model computations with grap
hs the role of summation simple monte carlo sampling from distributions importan
ce sampling markov chain monte carlo mcmc choice of method markov chain monte ca
rlo metropolis methods generality of metropolishastings gibbs sampling a two sta
ge acceptance rule conditional estimators raoblackwellization waste recycling co
nstruction of estimators convergence auxiliary variable methods swendsenwang con
tents slice sampling hamiltonian monte carlo simulated tempering expanded ensemb
les parallel tempering annealed importance sampling ais tempered transitions gen
eralization to only forward transitions generalization to a single pass annealin
g methods multicanonical ensemble exact sampling exact sampling example the isin
g model discussion and outlook multiple proposals and nonreversible markov chain
s population monte carlo multipletry metropolis efficiency of mtm multipleimport
ance try wasterecycled mtm adapting k automatically ordered overrelaxation with
pivotbased transitions persistence with pivot states ordered overrelaxation pivo
tbased transitions pivotbased metropolis summary related and future work normali
zing constants and nested sampling starting at the prior bridging to the posteri
or aside on the prior factorization thermodynamic integration multicanonical sam
pling nested sampling a change of variables computations in the new representati
on nested sampling algorithms mcmc approximations integrating out x degenerate l
ikelihoods contents efficiency of the algorithms nested sampling multicanonical
sampling importance sampling constructing annealing schedules markov chains for
normalizing constants randomize operator orderings changes in lengthscale and en
ergy a new version of swendsenwang description of slice sampling experiments dis
cussion of slice sampling results the potts model summary related work philosoph
y experiments discussion and conclusions doublyintractable distributions bayesia
n learning of undirected models do we need z for mcmc targets for mcmc approxima
tion approximation algorithms extension to hidden variables experiments involvin
g fully observed models experiment involving hidden variables discussion product
space interpretation bridging exchange algorithm details for proof of correctne
ss reinterpreting savm approximation schemes the exchange algorithm the single a
uxiliary variable method mavm a temperedtransitions refinement comparison of the
exchange algorithm and mavm ising model comparison discussion metropolishasting
s algorithm performance latent history methods slice sampling doublyintractable
distributions contents latent histories mavm discussion summary and future work
references list of algorithms rejection sampling metropolishastings a two stage
acceptance rule population monte carlo as in capp et al e a single step of the m
ultipletry metropolis markov chain self sizeadjusting population for ordered ove
rrelaxation the pivotbased transition the pivotbased metropolis operator single
particle nested sampling multiple particle nested sampling swendsenwang for weig
hted bond configurations construction of an annealing schedule from nested sampl
ing standard but infeasible metropolishastings exchange algorithm exchange algor
ithm with bridging multiple auxiliary variable method mavm simple rejection samp
ling algorithm for py template for a latent history sampler metropolishastings l
atent history sampler list of figures a selection of directed graphical models s
ome graphical models over three variables a grid of binary variables an intracta
ble undirected graphical model drawing samples from lowdimensional distributions
challenges for markov chain exploration the swendsenwang algorithm slice sampli
ng the effect of annealing parallel tempering annealed importance sampling ais t
empered transitions multicanonical ensemble example coupling from the past cftp
overview metropolis and population monte carlo on the funnel distribution multip
letry metropolis mtm performance of multipletry metropolis the idea behind succe
ssive overrelaxation illustration of the pivotbased transitions operator ordered
overrelaxation schemes applied to a bivariate gaussian using pivot states for p
ersistent motion example of persistent motion reflect move for discrete distribu
tions views of the integral z l d nested sampling illustrations the arithmetic a
nd geometric means of nested samplings progress errors in nested sampling approx
imations empirical average behavior of ais and nested sampling potts model state
s accessible by mcmc and nested sampling list of figures a simple illustration o
f the global nature of z histograms of approximate samples for heart disease ris
k model quality of marginals from approximate mcmc methods toy semisupervised pr
oblem with results from approximate samplers the exchange algorithms augmented m
odel a product space model motivating the exchange algorithm the proposed change
in joint distribution under a bridged exchange augmented model used by savm joi
nt model for mavm comparison of mavm and the exchange algorithm learning a preci
sion performance comparison of exchange and mavm on an ising model the latent hi
story representation list of tables equilibrium efficiency of mtm and metropolis
accuracy on tdistribution after proposals a rough interpretation of the evidenc
e scale nested sampling ais and multicanonical behavior with slicesampling estim
ates of the deceptive distribution partition function results for potts systems
notes on notation probability distributions we have chosen to follow a fairly lo
ose but commonlyused notation for probabilities occasionally we use p x x to den
ote the probability that a random variable x takes on the value x but as long as
the meaning can be inferred from context we simply write p x often there is mor
e than one probability distribution over the same variable we simply write qx fo
r the probability under distribution q and px for the probability under p we nev
er mention the space in which x lives nor any measure theory unless it is actual
ly used we rarely need to distinguish between probability densities and discrete
probabilities this loose notation is imprecise but hopefully its simplicity wil
l be appreciated by some readers probability of a given b we use several notatio
ns for distributions over variables that depend on other quantities p ab this is
the conditional probability of a given b bayes rule can be applied to infer b f
rom a p ba p abp bp a p a b the probability of a is a function depending on some
parameter b one should not necessarily assume that bayes rule holds for a and b
t a b t is a transition operator which gives a probability distribution over a
new position a given a starting position b one could also write t a b the arrow
is to provide a more obvious distinction from authors that use t a b for the pro
bability of the transition a b transition operators do not necessarily satisfy b
ayes rule known as detailed balance so the notation t ab is avoided t a b c this
specifies parameters c in addition to starting location b expectations we use e
pxf x x pxf x for the expectation or average of f x under the distribution px a
sum with no specified range should be taken to mean over all values the variance
of a quantity is given by varpf epx f x epxf x chapter introduction probability
distributions over many variables occur frequently in bayesian inference statis
tical physics and simulation studies computational methods for dealing with thes
e large and complex distributions remains an active area of research graphical m
odels section provide a powerful framework for representing these distributions
we use these to explain challenging probability distributions and sometimes the
algorithms to deal with them a surprising number of statistical problems result
in the computation of averages which we explain in section monte carlo technique
s section approximate these summations using random samples many of these method
s rely on the use of markov chains section extending markov chain monte carlo mc
mc techniques is the subject of this thesis graphical models compact representat
ions of highdimensional probability distributions are essential for their interp
retation feasibility of learning and computational reasons as a concrete example
consider a distribution over a vector of d binary variables x for small d eg tw
o or three an explicit table of counts for all possible joint settings could be
maintained and used for frequencybased estimates of the settings probabilities e
xplicitly storing such a table becomes exponentially more costly as d grows even
if the table is stored in a sparse data structure the representation is not use
ful for learning probabilities most cells will contain zero observations enforci
ng some structure is essential when learning from large multivariate distributio
ns the simplest multivariate distributions assume that all of their component va
riables xd are independent d px d pxd graphical models c x x x x x x x x x x y a
na bayes ve b hidden causes y c simple parametric model figure a selection of d
irected graphical models this assumption is generally inappropriate as it means
that no relationships between any of the variables can ever be learned the graph
ical model for equation has a node for each xd with no edges between any of the
nodes graphs with more structure provide a convenient representation for differe
nt factorizations of px two classes of graphical model are used in this thesis d
irected graphs offer a natural representation of causal relationships or the res
ult of a sequence of operations while useful in modeling we mainly use directed
graphs for explanations of algorithms undirected graphical models are a more nat
ural representation for constraints and mutual interactions these are an importa
nt motivation for chapter this section offers a brief review of the required con
cepts directed graphical models a directed graphical model is represented by a d
irected acyclic graph dag generally the joint distribution factorizes into a pro
duct of terms for each node conditioned on all of its parents figure a is a dire
cted graphical model for the joint distribution found in the na bayes model with
class variable c and feature vector x xd ve d pc x pcpxc pc d pxd c figure a an
d equation contain exactly the same information from either form it is readily a
pparent to those familiar with the representations that the features x are depen
dent but independent conditioned on the class variable c as more variables and m
ore complicated structure are introduced the graphical model is often easier to
interpret at a glance than an equation giving the factorization of the joint dis
tribution several graphical model figures have been included throughout this the
sis for those that might find them useful graphical models x x x fi x x x x x x
x x x a b c d figure some graphical models over three variables figure c demonst
rates another piece of directed graphical model notation sometimes shading or a
doubleoutline is used to indicate that a node represents an observed variable he
re y are the data being modeled which are assumed to come from a distribution pa
rameterized by strictly this graph says nothing else about the distribution any
joint distribution over two variables can be written as py pyp pypy so the arrow
could equally be drawn the other way the implication is that generating y from
is a natural operation the arrow directions make a big difference in figure b th
e x variables are independent in the generative process after observing y our kn
owledge about x forms a potentially complex joint distribution we do not expect
direct sampling from pxy to be easy methods based on markov chains will offer a
solution to this problem undirected graphical models a probability distribution
must be normalized so one representation used frequently in chapter is px f x z
z x f x where f x is any positive function for which the normalization sum or in
tegral exists however without detailed analysis of f this representation says no
thing about the structure of the distribution undirected graphical models repres
ent f as a product of terms but do not identify some nodes as children and other
s as parents this is appropriate for representing mutual interactions modern und
irected models use bipartite factor graphs figure b shows three variables which
all interact through a central factor node traditionally this distribution would
be represented as in figure c where the three nodes all take part in a common i
nteraction as they form a fully connected clique the factor representation is mo
re powerful the traditional representation cannot represent the distribution in
figure d which contains three factors px x x f x x f x x f x x z graphical model
s the potts model the potts model is a widespread example of an undirected graph
ical model its probability distribution is defined over discrete variables s eac
h taking on one of q distinct settings or colors psj q exp zp j q jij si sj ije
the variables exist as nodes on a graph where ij e means that nodes i and j are
linked by an edge the kronecker delta si sj is one when si and sj are the same c
olor and zero otherwise neighboring nodes pay an energy penalty of jij when they
are different colors often a single common coupling jij j is used a common exte
nsion allows biases to be put on the colors an additional term i hi si is put in
side the exponential the potts model with bias terms and q has several different
names the boltzmann machine the ising model binary markov random fields or auto
logistic models computations with graphs imagine a problem that involves summing
over the settings of a binary potts model with variables in general this seems
impossible there are possible states for comparison the age of the universe is u
sually estimated to be about picoseconds while most current processors take abou
t picoseconds to perform a single instruction if the models graph forms a chain
s s s sn the distribution can be factored into a product of functions involving
each adjacent pair psj zj n fn sn sn j n s now certain sums such as the normaliz
er zj can be decomposed is as follows n fn sn sn j and expectations of functions
of variables epsgsi are tractable an example showing how the sums gsi s n fn sn
sn s si s f s s s f s s fi si si sn fi si si gsi si sn fn sn sn fn sn sn perfor
ming the sums right to left makes the computation oqn rather than oq n where q f
or binary variables the role of summation a b figure a small state space of bina
ry variables represented as a pixels and b an undirected graphical model the abo
ve technique easily generalizes to treestructured graphs but other topologies ar
e common in applications figure shows a grid of binary variables arrays like thi
s are common in computer vision spatial statistics and statistical physics treat
ing each row of pixels as a single variable makes a chainstructured graph with n
variables each taking on states summing has changed from an operation that take
s orders of magnitude longer than the age of the universe to being almost instan
taneous advanced versions of this procedure exist for general graphical models n
otably the junction tree algorithm lauritzen and spiegelhalter the cost of the a
lgorithm is determined by the treewidth of a graph which indicates the largest n
umber of variables that need to be joined in order to form a tree structure figu
re shows a genuinely intractable graphical model or does it even if the variable
s are only binary then summing over all states with a graphbased exact inference
algorithm is infeasible forming a tractable tree will require making at least o
ne node with joint settings of variables the topology of the graph isnt everythi
ng however if the variables were continuous and gaussian distributed then the mo
del is quite tractable almost everything one might want to know about a gaussian
distribution can be found easily from a cholesky decomposition of the covarianc
e matrix this matrix factorization is flexible numerically stable and costs on f
or a nn covariance matrix seeger when n a current midrange workstation can perfo
rm the required computation in a couple of seconds the role of summation summing
over all the configurations of a multivariate distribution turns out to be the
dominant computational task in many fields one of the goals of statistical physi
cs is to capture the collective behavior of systems that like the potts model in
volve enormous numbers of interacting parts many physical quantities relate to s
imple statistics of summing in diagonal stripes means this need only happen once
rather than times the role of summation figure an intractable undirected graphi
cal model a grid of variables with pairwise interactions these parts averaged ov
er the entire system other key quantities can be derived from z the zustandssumm
e or sum over states bayesian inference the use of probability theory to deal wi
th uncertainty is also dominated by the computation of averages as a canonical e
xample we consider a statistical model x for data x generated using parameters t
he predictive distribution over new data xn given observations of n previous set
tings xn n is an average n under the posterior distribution pxn n n pxn xn n n p
xn pxn n d n n epxn n the posterior distribution is given by bayes rule pxn n n
px n pxn n p n pxn n p d n which involves another sum over in general any time a
quantity is unknown we must consider all of its possible settings which tends t
o involve an average under a probability distribution this thesis concentrates o
n computational methods rather than their application simple monte carlo however
modeling and learning from data are a key motivation for this work so we briefl
y mention some more general references bayesian inference is named after bayes a
lthough many of the ideas that currently fall under this banner came much later
for more on the philosophy a good start is a beautifully written paper by cox re
cent textbooks offer a broader view of probabilistic modeling and are available
from the viewpoints of various communities including statistics gelman et al mac
hine learning mackay bishop and the physical sciences sivia and skilling simple
monte carlo many of the alleged difficulties with finding averages are unduly pe
ssimistic imagine we were interested in the average salary x of people p in the
set of statisticians s formally this is a large intractable sum over all of the
s statisticians in the world eps xp s xp ps but to claim that the question is un
answerable is absurd we can clearly get a reasonable estimate of the answer by m
easuring just some statisticians eps xp s s x ps for random survey of s people p
s s s no reasonable application needs the exact answer which in any case is cons
tantly fluctuating as individual statisticians are hired promoted and retire so
for all practical purposes the problem is solved nearly conducting surveys that
obtain a fair sample from a target population is difficult but the technique is
so useful we are prepared to invest effort into good sampling methods this stati
stical sampling technique is directly relevant to solving difficult integrals in
statistical inference for example what is the distribution over an unknown quan
tity x after observing data d from a distribution with unknown parameters pxd px
dpd d epdpx d s s px s d s pd s we draw samples from the target distribution wh
ich rather than the set of statisticians is now pd approximating general average
s by statistical sampling is known as simple monte carlo the estimates are unbia
sed which if not clear now is shown coxs work has been subject to some technical
objections which are countered by horn simple monte carlo more generally in sec
tion as long as variances are bounded appropriately the sum of independent terms
will obey a central limit theorem the estimators variance will scale like s hav
ing an estimators variance shrink only linearly with computational effort is oft
en considered poor sokal begins a lecture course on monte carlo with the warning
monte carlo is an extremely bad method it should be used only when all alternat
ive methods are worse however as sokal goes on to acknowledge monte carlo is als
o an extremely important method on some problems it might be the only way to obt
ain accurate answers metropolis describes how the physicist enrico fermi used mo
nte carlo before the development of electronic computers when insomnia struck he
would spend his nights predicting the results of complex neutron scattering exp
eriments by performing statistical sampling calculations with a mechanical addin
g machine as analytically deriving the behavior of many neutrons seemed intracta
ble fermis ability to make accurate predictions astonished his colleagues many p
roblems in physics and statistics are complex and involve many variables numeric
al methods that scale exponentially with the dimensionality of the problem will
not work at all in contrast monte carlo is usually simple and its s scaling of e
rror bars independent of dimensionality may be good enough even when more advanc
ed deterministic methods are available monte carlo can be appropriate if todays
computers can return useful answers with less implementation effort than more co
mplex methods sampling from distributions just as finding a fair sample from a p
opulation is difficult in surveys sampling correctly from arbitrary probability
distributions is also hard simple monte carlo is only as easy to implement as th
e random variate generator for the entire joint distribution involved a graphica
l description of sampling from a probability distribution is given by figure a p
oints are drawn uniformly from the unit area underneath the probability density
function and their corresponding value recorded this correctly assigns each elem
ent of the input space dx with probability pxdx the probability mass to the left
of each point ux is distributed as uniform to implement a sampler we can first
draw u and then compute xu from the inverse cumulative distribution that is xu u
where x x px dx now the difficulty of sampling from distributions with this gen
eral method becomes apparent it is often infeasible to even normalize p a single
integral over the entire state space whereas x is a whole continuum of such int
egrals simple monte carlo c q x copt q x p x px xi hi xj hj x x x x a direct sam
pling x x b rejection sampling x figure drawing samples from lowdimensional dist
ributions a sample locations can be taken from points drawn uniformly under the
density curve b rejection sampling provides a way to implement this samples are
drawn uniformly underneath a simpler curve cq x as in algorithm points above p x
px such as xi hi are rejected the remaining points come uniformly from undernea
th p and are valid samples from p the constant c is chosen such that cq is alway
s above p the best setting is shown as copt but finding this value is not always
possible while is tractable for some standard distributions random number gener
ators must generally use less direct techniques rejection sampling is a method f
or drawing samples from a distribution px by using another distribution qx from
which we can already sample the method requires an ability to evaluate the proba
bility density of points under both distributions up to a constant p x px and q
x qx further we must be able to find a constant c such that cq x p x sampling un
derneath a p x px curve gives the correct probability distribution we can sample
under p x by drawing samples from the tractable qx distribution and applying al
gorithm figure b the more closely q matches p the lower the rejection rate can b
e made providing that c is adjusted to maintain a valid bound we can improve q a
fter each step of the algorithm ideally q would be updated automatically as poin
ts from p are evaluated in general this could be difficult but for logconcave di
stributions there are at least two techniques gilks and wild gilks for applicati
ons of rejection sampling to standard distributions and several more techniques
for nonuniform random variate generation see devroye simple monte carlo algorith
m rejection sampling inputs target distribution px p x simple distribution qx q
x and number of samples s outputs xs s samples from px s find a constant c such
that cq x p x for all x ideally minimize c within the constraint for s s draw ca
ndidate x q draw height h uniform cq x if x h is below p ie h p x then xs x else
go to end for importance sampling computing p x and q x then throwing x away al
ong with all the associated computations seems wasteful yet discarding samples i
s a key part of rejection samplings correct operation importance sampling avoids
such rejections by rewriting the integral as an expectation under any distribut
ion q that has the same support as p f xpx dx s f x px qx dx qx if qx wherever p
x wx pxqx xs qx f xwx qx dx s f xs wxs s one interpretation is that the weights
wxs make some samples from q more important than others rather than assigning ha
rsh weights of zero and one as in rejection sampling the simplest view is that e
quation is is just simple monte carlo integration of an expectation under qx we
immediately know that the estimator is unbiased and might obey a central limit t
heorem as before if qx is much smaller than px in some regions of the state spac
e then the effective function f xpxqx will have very high or even infinite varia
nce under qx states with extreme importance weights are rare events under q and
might not be observed within a moderate number of samples this means that error
bars based on the empirical variance of the importance weights can be very misle
ading for a practical example of this problem along with the associated recommen
dation to use broad distributions with heavy tails see mackay section interestin
gly the ideal q distribution is not equal to p if f x is a positive function the
n qx f xpx would give a zero variance estimator this optimal distribution is mar
kov chain monte carlo mcmc unobtainable as evaluating qx requires the normalizat
ion zq f xpx dx the target integral but sometimes deliberate matches between q a
nd p are useful in practice for example when interested in gathering statistics
about the tails of p nothing about the importance sampling trick in equation act
ually requires the original integral to be an expectation we can divide and mult
iply any integrand by a convenient distribution to make it an expectation thus i
mportance sampling allows any integral to be approximated by monte carlo as most
of the integrals in dominant applications are expectations we still maintain px
in the equations throughout evaluating the importance weights wx requires evalu
ating px p xzp but often zp is unknown in these cases the normalizing constant m
ust be approximated separately as follows zp zq zq s p x dx p x qx dx q x w x p
xq x xs qx w x qx dx s w xs s this gives an approximation for the importance wei
ghts wx zq p x px qx zp q x w x s s s s w x which can be used within equation th
e resulting estimator is biased but consistent when both estimators and have bou
nded variance markov chain monte carlo mcmc both rejection sampling and importan
ce sampling require a tractable surrogate distribution qx neither method will pe
rform well if maxx pxqx is large rejection sampling will rarely return samples a
nd importance sampling will have large variance markov chain monte carlo methods
can be used to sample from px distributions that are complex and have unknown n
ormalization this is achieved by relaxing the requirement that the samples shoul
d be independent a markov chain generates a correlated sequence of states each s
tep in the sequence is drawn from a transition operator t x x which gives the pr
obability of moving from state x to state x according to the markov property the
transition probabilities depend only on the current state x in particular any f
ree parameters eg step markov chain monte carlo mcmc sizes in a family of transi
tion operators t x x cannot be chosen based on the history of the chain a basic
requirement for t is that given a sample from px the marginal distribution over
the next state in the chain is also the target distribution of interest p px x t
x x px for all x by induction all subsequent steps of the chain will have the s
ame marginal distribution the transition operator is said to leave the target di
stribution p stationary mcmc algorithms often require operators that ensure the
marginal distribution over a state of the chain tends to px regardless of starti
ng state this requires irreducibility the ability to reach any x where px in a f
inite number of steps and aperiodicity no states are only accessible at certain
regularly spaced times for more details see tierney for now we note that as long
as a t satisfies equation it can be useful as the other conditions can be met t
hrough combinations with other operators given that px is a complicated distribu
tion it might seem unreasonable to expect that we could find a transition operat
or t leaving it stationary however it is often easy to construct a transition op
erator satisfying detailed balance t x x px t x x px for all x x this states tha
t a step starting at equilibrium and transitioning under t has the same probabil
ity forwards x x or backwards x x proving detailed balance only requires conside
ring each pair of states in isolation there is no sum over all states as in equa
tion having shown equation summing over x on both sides immediately recovers the
stationary distribution requirement thus detailed balance is a useful property
for deriving many mcmc methods however it is not always required or even desirab
le chapter introduces some mcmc transition operators that do not satisfy equatio
n given any transition operator t satisfying the stationary condition equation w
e can construct a reverse operator t defined by t x x t x x px t x x px t x x px
px x t x x px a symmetric form of this relationship shows that an operator sati
sfying detailed balance is its own reverse transition operator t x x px t x x px
for all x x summing over x or x reveals that this mutual reversibility conditio
n implies that the stationary condition equation holds for both t and t therefor
e constructing choice of method a pair of mutually reversible transition distrib
utions is an alternative strategy for constructing mcmc operators detailed balan
ce is a restricted case where a transition operator must be its own reverse oper
ator choice of method in this introduction we described the importance of highdi
mensional probability distributions samples from these distributions capture the
ir typical properties which is the basis of the monte carlo estimation of expect
ations such as f xpx dx we assume or hope that extreme values under px do not fo
rm a significant contribution to the integral for expectations in many physical
and statistical applications this is a reasonable assumption in highdimensional
problems sampling from an interesting target distribution px is often intractabl
e methods such as rejection sampling fail because finding a closematching simple
distribution qx is not possible for the same reason importance sampling estimat
ors tend to have very high variance we described rejection sampling as it remain
s an important method when iid samples from lowdimensional distributions are req
uired this occurs in some simulation work and as part of some mcmc methods simil
arly while simple importance sampling has problems in high dimensions it provide
s the basis of more advanced methods that are useful on some highdimensional pro
blems we neglect much of the importance sampling literature but will study some
methods relating to markov chains the focus of this thesis are methods that use
markov chains these allow us to draw correlated samples from complex distributio
ns and to perform monte carlo integration in highdimensional problems the next c
hapter reviews several important algorithms based on markov chains together with
some new contributions after this we will be in a position to outline the remai
nder of the thesis chapter markov chain monte carlo this chapter reviews some im
portant markov chain monte carlo mcmc algorithms much of this material is standa
rd and could be skipped by those already familiar with the literature however so
me of the material in this chapter is to the best of our knowledge novel these c
ontributions include generalizations of tempered transitions in subsection and t
he introduction of a slicesampling version of the two stage acceptance rule see
sections and we also present some results in an unconventional way which is desi
gned to help with reading later chapters metropolis methods algorithm gives a pr
ocedure for simulating a markov chain with stationary distribution px due to has
tings algorithm metropolishastings input initial setting x number of iterations
s for s s propose x qx x compute a px qx x px qx x set x x with probability min
a eg a draw r uniform b if r a then set x x end for the setting of x at the end
of each iteration is considered as a sample from the target probability distribu
tion px adjacent samples are identical when the state is not updated in step b b
ut every iteration must be recorded as part of the markov chain metropolis metho
ds it is straightforward to show that the metropolishastings algorithm satisfies
detailed balance t x x px min px qx x px qx x qx x px min px qx x px qx x px qx
x min px qx x t x x px qx x px as required here the probability for a forwards
transition x x was manipulated into the probability of the reverse transition x
x however it would have been sufficient to stop after the second line and note t
hat the expression is symmetric in x and x the algorithm is valid for any propos
al distribution q ideal proposals would be constructed for rapid exploration of
the distribution of interest p we could write qx x d to emphasize choices of pro
posal that are based on observed data d using any fixed d is valid but q cannot
be based on the past history of the sampler as such a chain would not be markov
often proposals are not complicated databased distributions simple perturbations
such as a gaussian with mean x are commonly chosen the acceptreject step in the
algorithm corrects for the mismatch between the proposal and target distributio
ns when the proposal distribution is symmetric ie qx x qx x for all x x the acce
ptance ratio in step simplifies to a ratio under the distribution of interest a
px px this is the original metropolis algorithm metropolis et al some authors eg
mackay prefer to drop such distinctions and simply refer to all metropolishasti
ngs algorithms as metropolis methods the next section suggests another justifica
tion of this view generality of metropolishastings consider a mcmc algorithm tha
t proposes a state from a distribution qx x and accepts with probability pa x x
restricting attention to transition operators satisfying detailed balance pa x x
qx x px pa x x qx x px for all x x gives an equality constraint we also have ine
qualities that must hold for probabilities pa x x and pa x x optimizing the aver
age acceptance probability pxpa x x px pa x x with respect to ax x and ax x must
saturate one or more of the inequalities a metropolis methods well known proper
ty of linear programming problems therefore either pa x x or pa x x and pa x x i
s given by equation this gives the metropolis hastings acceptance rule pa x x mi
n px qx x pxqx x according to the constraint of equation a pair of valid accepta
nce probabilities must have the same ratio pa x xpa x x as any other valid pair
therefore they can be obtained by multiplying the metropolishastings probabiliti
es by a constant less than one this corresponds to mixing in some fraction of th
e do nothing transition operator which leaves the chain in the current state at
those two sites it also corresponds to adjusting the proposal distribution q to
suggest staying still more often staying still more often harms the asymptotic v
ariance of the chain in this sense although not by all measures using equation i
s optimal peskun given this result it is unsurprising that metropolishastings ha
s become almost synonymous with mcmc it is tempting to conclude that the only wa
y to improve markov chains for monte carlo is by researching domainspecific prop
osal distributions such as in the vision communitys datadriven mcmc tu and zhu i
n fact a rich variety of more generic mcmcbased algorithms exist and continue to
be developed many of these do satisfy detailed balance but the corresponding mh
qx x distribution is often defined implicitly and would not be a natural descri
ption of the method to illustrate the limitations of claiming all reversible mcm
c is just metropolis hastings we show that all reversible mcmc is just metropoli
s this also demonstrates a methodology used throughout the thesis we construct a
new target distribution px x qx x px ie the joint distribution over a point x p
and a point x proposed from that location the marginal distribution over x is t
he original target p now consider the symmetric metropolis proposal that swaps t
he values of x and x such that putatively x comes from p and x was proposed from
it the metropolis acceptance probability for this swap proposal is min px x px
x min px qx x px qx x ie the metropolishastings acceptance probability in this s
ense only an algorithm with symmetric proposals is needed but this is not a very
natural description of the metropolishastings algorithm similarly while it is p
ossible to describe all algorithms satisfying detailed balance as metropolis or
metropolishastings algorithms other descriptions may be more natural however we
will find constructing joint distributions like px x a useful theoretical tool a
nd one that suggests new markov chain operators metropolis methods gibbs samplin
g gibbs sampling geman and geman resamples each dimension xi of a multivariate q
uantity x from their conditional distributions pxi xji any individual update mai
ntains detailed balance which is easily checked directly alternatively we can wr
ite the gibbs sampling update as a metropolishastings proposal qx x pxi xji ixji
xji where i is an indicator function ensuring all components other than xi stay
fixed the acceptance probability for this proposal is identical to one so need
not be checked gibbs sampling is often easy to implement if the target distribut
ion is discrete and each variable takes on a small number of settings then the c
onditional distributions can be explicitly computed pxi xji pxi xji pxi xji x px
i xji i on continuous problems the onedimensional conditional distributions are
usually amenable to standard sampling methods as mentioned in subsection a desir
able feature of gibbs sampling is that it has no free parameters and so can be a
pplied fairly automatically indeed the bugs packages can create gibbs samplers f
or a large variety of models specified using a simple description language spieg
elhalter et al there are actually some free choices regarding how the variables
are updated blockgibbs sampling chooses groups of variables to update at once al
so continuous distributions can be reparameterized before gibbs sampling a basis
in which the dimensions are independent would obviously be particularly effecti
ve clifford et al contains several interesting discussions regarding the nature
and history of the gibbs sampler the method has often been regarded with a rathe
r undeserved special status amongst mcmc methods it is just one of many ways to
construct a markov chain with a target stationary distribution in particular the
re is no need to approximate gibbs sampling when sampling from conditionals is n
ot tractable as in ritter and tanner metropolishastings updates of each dimensio
n can be used instead and there is no need to call this method metropoliswithing
ibbs a two stage acceptance rule if the target distribution is factored into two
terms for example a prior and likelihood px xlx a proposal can be accepted with
probability pa min qx x x qx xx min lx lx construction of estimators straightfo
rward checking shows that this satisfies detailed balance we know from previous
discussions that it will accept less often than standard metropolishastings and
is inferior according to peskun but computationally it can be more efficient alg
orithm is an acceptance rule that will accept proposals with probability pa if x
is able to veto an a priori unreasonable proposal then lx need not be computed
factoring out a cheap sanitycheck distribution x may be worth a fall in acceptan
ce rate in problems where likelihood evaluations are expensive algorithm a two s
tage acceptance rule input initial setting x proposed setting x draw r uniform i
f r qx x x qx xx draw r uniform if r lx accept else reject lx else reject as wit
h standard metropolishastings none of the probabilities need to be computed exac
tly bounding them sufficiently to make the acceptreject decision in step or is a
ll that is required in practice few monte carlo codes implement the required int
erval arithmetic to make these savings in contrast the two stage procedure is ge
nerally applicable and easy to implement algorithm is a special case of the surr
ogate transitions method liu pp and also the different algorithm of christen and
fox the twostage method here is also equivalent to dostert et al algorithm ii w
hich presented it in the context of a particular choice for qx x this literature
has a rich variety of possible approximate distributions that can be used for x
in particular applications in a statistical application where px depends on a d
ata set a general choice would be to use a subset of the data to define x an obv
ious generalization is splitting the acceptance rule into more than two terms fa
ctoring the distribution into many terms could make the acceptance rate fall dra
matically so there would need to be a specific computational benefit constructio
n of estimators the output of an mcmc algorithm is a set of correlated samples d
rawn from a joint distribution p xs at equilibrium every marginal pxs is the cor
rect distribution of interest using these equilibrium samples in the straightfor
ward monte carlo construction of estimators estimator would be unbiased s s ep x
s f xs s xs p xs s s s s s f xs s f xs s xs s xs p xs s s f xs pxs epf s xs the
states drawn from a markov chain started at an arbitrary position will not have
the correct marginal distribution asymptotically equation is still an unbiased e
stimator under weak conditions but in practice it is advisable to discard some i
nitial states allowing the chain to burnin after this there is no need to remove
intermediate samples in an attempt to obtain a set of approximately independent
samples such thinning will only make the variance of the estimator worse geyer
however thinning can be justified when adjacent steps are highly correlated or w
hen computing f x is costly then discarding some samples can save computer time
that is better spent running the markov chain for longer a related issue is whet
her it is better to run one long markov chain or perhaps to obtain independent p
oints more quickly by trying many initializations geyer also shows that theoreti
cally a longer chain is to be preferred intuitively an arbitrary initialization
is bad as it introduces bias which takes time to remove it turns out this time i
s better spent evolving an existing chain to new locations despite the theory fa
voring running a single chain we tend to prefer running a few rather than one wh
en more than one computer processor is available this is the easiest way to para
llelize our code although agreement amongst multiple chains does not provide a g
uarantee that the markov chains are mixing differing results would reveal a fail
ure to mix that might go unnoticed from a single chain finally multiple chains a
re useful for adapting stepsizes in metropolis methods and turn up naturally in
population methods chapter conditional estimators raoblackwellization using a sa
mple average of function values is not the only way to construct mcmc estimators
it is an attractive default procedure as the estimator only needs to know f x e
valuated at the samples however it is sometimes possible to perform some computa
tions involving f x analytically a simple identity lets us use this analytical k
nowledge f xpx x h x f xpxh ph construction of estimators where h is an arbitrar
y statistic if we can sum over the distribution conditioned on h then the bracke
ted term evaluated under samples of h can be used as an estimator a special case
provides more concrete motivation f xi px x xji xi f xi pxi xji pxji epxji epxi
xji f xi here we are interested in a function of a particular variable xi sums
over a single variable are often tractable the identity allows averaging epxi xj
i f xi under monte carlo samples rather than f x itself as an example consider f
inding the marginal of a binary variable xi f xi xi the standard estimator throw
s away xji and averages the xi samples a sequence of zeros and ones the new esti
mator throws away the observed xi values but does use xji the resulting average
of real numbers between zero and one can be much better behaved if xi is nearly
independent of the other variables the new estimator gives nearly the correct an
swer from one sample whereas the standard estimator is always noisy does the con
ditioning trick improve variances in general without loss of generality we assum
e that the function of interest has zero mean under this assumption the variance
of the estimator that conditions on a statistic h becomes varph epxhf x h ph x
pxhf x pxhf x x x h ph pxf x where the bound is an application of jensens inequa
lity using the convexity of the square function the final equality from summing
out h means that varph epxhf x varpxf x the variance of the conditional estimato
r is never worse than the standard monte carlo estimator this result applies und
er independent sampling from px unfortunately the result does not hold in genera
l for correlated mcmc samples liu et al even if the variance is improved we dont
generally know when a conditional estimator will be computationally more effici
ent than the straightforward monte carlo estimator as an extreme example hx x gi
ves an estimator with zero variance because the target expectation is computed e
xactly clearly the cost of computing the conditional expectation must be conside
red the use of equation was suggested by gelfand and smith an influential paper
that popularized gibbs sampling in the statistics community their motivation con
struction of estimators was essentially equation cited as a version of the raobl
ackwell theorem the bounds requirement of independent samples was satisfied beca
use the paper proposed running many independent gibbs sampling runs as this prac
tice has fallen out of favor it seems misleading to continue to call the method
raoblackwellization although some of the literature continues to do so despite t
he lack of raoblackwell guarantees the estimator can often be justified empirica
lly as was done earlier by at least pearl waste recycling the metropolishastings
algorithm has the undesirable property that for many proposal distributions a l
arge fraction of proposed states are rejected from the final set of samples as c
omputations involving these rejected points were performed it seems a shame not
to use this information in monte carlo estimators the same observation followed
our description of rejection sampling for which the solution was to use the same
proposal distribution in importance sampling mh proposal distributions are gene
rally local in nature and unsuitable as importance sampling proposals waste recy
cling is a framework for constructing better estimators using the importance rat
ios computed while running mcmc a simple way to understand how to perform waster
ecycling is to augment the stationary distribution over the current state x and
the next proposal x with an auxiliary variable x we declare that x was generated
by copying one of x and x min px qxx px qx x min px qxx px qx x px x x xx x x a
lternatively any other rule that gives x the same stationary distribution as x c
an be used now we could use draws of x for estimators instead of x we can also a
verage estimators under the conditional distribution of x given x and x epxf x e
qx xpx xxx f px x x x px qx x f x f x f x min px qx x this is just the condition
al estimator equation from the previous section applied to the auxiliary distrib
ution px x x px x qx xpx x we hesitate to assign credit for this estimator to an
y particular author it or closely related estimators can be found in various ind
ependent sources including at least kalos and whitlock tjelmeland and frenkel wa
sterecycling is not necessarily better than the straightforward estimator let al
one worth the computa convergence p p q l p a a slow diffusion b burnin and mode
finding a c balancing density and volume figure challenges for markov chain exp
loration tional expense however there is empirical evidence that it can be usefu
l dramatically so in the case of frenkel casella and robert also derived a raobl
ackwellized estimator for using all the points drawn by the metropolis algorithm
however their algorithm has an os cost in the number of steps of the markov cha
in and unsurprisingly has not been widely adopted convergence no review of marko
v chains would be complete without discussing their convergence properties if th
e simulation is initialized at an arbitrary starting location how long must we w
ait before we can treat the chains states as samples from the stationary distrib
ution often metropolis proposals have a local nature the typical stepsize is lim
ited by a need to maintain a reasonable acceptance rate the amount of time it ta
kes to explore a distance of length l by a diffusive random walk scales like l f
igure a this offers a rule of thumb to understand sampling within a mode an arbi
trarily chosen initial state is usually very improbable under the target distrib
ution reaching a mode in high dimensions can take a long time with sophisticated
optimizers let alone a markov chain simulation analyzing the chain can identify
when the statistics of the sampler settle down allowing the initial exploratory
phase to be discarded such diagnostics could be severely misleading if there ar
e multiple modes as in figure b in general applications there is no way of knowi
ng if a markov chain has yet to find the most important regions of a probability
distribution there are also more subtle problems with convergence that do not r
elate to finding modes or step sizes figure c illustrates a distribution with a
pillar at x a points within this mode have higher probability density than any s
tate within the shaded plateau a x a but given the large size of the plateau few
proposals will suggest moving to the pillar initializing the chain within the p
illar using an optimizer would not help here p a and p a a the pillar and platea
u have auxiliary variable methods equal probability mass yet only a small fracti
on of proposals from the pillar to the plateau can be accepted the acceptance ra
te is aaa for symmetric proposals a proposal distribution that knew the probabil
ity masses involved eg iid sampling qx x px would move rapidly between the modes
but this would involve already having the knowledge that mcmc is being used to
discover there is a theoretical literature on markov chain convergence but for g
eneral statistical problems it is difficult to prove much about the validity of
any particular mcmc result in later chapters we make occasional use of a standar
d diagnostic tool to compare the convergence properties of chains but in general
we would also run the sampling code on a cutdown version of the problem where e
xact results are available or simple importance sampling will work in inference
problems it is a good idea to check that a sampler is consistent with ground tru
th when learning from synthetic data generated from the prior model geweke s met
hod for getting it right is a more advanced way to check consistency between sam
plers for the prior and posterior such diagnostics are also a guard against erro
rs in the software implementation a possibility considered seriously even by exp
erts kass et al auxiliary variable methods as emphasized in the introduction mon
te carlo is usually best avoided when possible in tractable problems computation
s based on analytical solutions will usually be far more efficient one would thi
nk this means we should analytically marginalize out variables wherever possible
auxiliary variable methods turn this thinking on its head given a target distri
bution px we introduce auxiliary variables z such that px px z dz we could just
sample from px corresponding to analytically integrating out z from px z instead
auxiliary variable methods instantiate the auxiliary variables in a markov chai
n that explores the joint distribution surprisingly this can result in a better
monte carlo method an alternative way of introducing extra variables is to make
the target distribution a conditional of a joint distribution px px the whole px
distribution is explored but only samples where are retained this seemingly was
teful procedure can also yield better monte carlo estimates an example is simula
ted tempering discussed later in subsection swendsenwang the swendsenwang algori
thm swendsen and wang is a highly effective algorithm for sampling potts models
subsection with a small number of colors q the auxiliary variable methods algori
thm is important in its own right we use it in chapters and and as a significant
development in auxiliary variable methods edwards and sokal provided a scheme f
or constructing similar auxiliary variable methods for a wider class of models t
hey also identified the fortuinkasteleynswendsenwang fksw auxiliary variable joi
nt distribution that underlies the algorithm the fksw joint distribution is over
the original potts color variables s si on the nodes of a graph and binary bond
variables d dij present on each edge ps d zp j q pij dij pij dij si sj ije pij
ejij as long as all couplings jij are positive the marginal distribution over s
is the potts distribution equation the marginal distribution over the bonds is t
he random cluster model of fortuin and kasteleyn pd zp j q ije pijij pij dij q c
d d where cd is the number of connected components in a graph with edges whereve
r dij the algorithm of swendsen and wang performs block gibbs sampling on the jo
int model by alternately sampling from p ds and p sd this also allows a sample f
rom any of the three distributions potts ps random cluster pd or fksw ps d to be
converted into a sample from one of the others the algorithm is illustrated in
figure while implementing swendsenwang we were grateful for the efficient percol
ation code available in newman and ziff it is commonly stated that swendsenwang
only allows positive jij couplings as presented here in fact swendsen and wang d
escribed the extension to negative bonds which also follows easily from edwards
and sokals generalization the resulting algorithm only forms bonds between edges
where jij if the adjacent sites have different colors colors joined by these ne
gative bonds must remain different when choosing a new coloring for binary syste
ms two colorings of a cluster are possible the previous coloring and its inverse
when there are many strong negative interactions the entire system gets locked
into one of two configurations whereas many configurations are probable swendsen
wang can still be very useful when only some connections have negative weights s
tern et al provide a nice example in the context of modeling the game of go for
a wider view of related methods in the context of spatial statistics see besag a
nd green auxiliary variable methods a potts state b fksw state c random cluster
state d new fksw state figure the swendsenwang algorithm a as an example we run
the algorithm on a binary potts model on a square lattice b under the fksw condi
tional pds bonds are placed down with probability pij ejij wherever adjacent sit
es have the same color c discarding the colors gives a sample from the random cl
uster model d sampling from psd involves assigning each connected component or c
luster a new color uniformly at random giving a new fksw state discarding the bo
nds gives a new setting of the potts model this coloring is dramatically differe
nt from the previous one in contrast a sweep of singlesite gibbs sampling can on
ly diffuse the boundaries of the red and blue regions by roughly the width of a
single site auxiliary variable methods x h x h x h h x h x h x a p x x h b p x c
x h h x h x d e figure slice sampling ac show a procedure for unimodal distribu
tions after sampling h from its conditional distribution an interval is found en
closing the region where hx h samples are drawn uniformly from this interval unt
il one underneath the curve is found points outside the curve can be used to shr
ink the interval this is an adaptive rejection sampling method d sampling unifor
mly from the slice is difficult for multimodal distributions e one valid procedu
re uses an initial bracket of width that encloses x but is centered uniformly at
random this bracket is extended in increments of until it sticks outside the cu
rve at both ends this can cut off regions of the slice but applying the previous
adaptive rejection sampling procedure still leaves the uniform distribution on
the slice invariant slice sampling all metropolis methods make use of a proposal
distribution qx x in continuous spaces such distributions have stepsize paramet
ers that have to be set well almost all proposals are rejected if stepsizes are
too large overly small stepsizes lead to slow diffusive exploration of the targe
t distribution in contrast a selftuning method which has lessimportant or even n
o stepsize parameters would be preferable slice sampling neal is a method with o
ne auxiliary variable that fits into the framework of edwards and sokal the auxi
liary variable does not directly help mixing but allows the use of relatively ea
sytoimplement algorithms that allow selftuning while generating a valid markov c
hain like gibbs sampling a slice sampling chain has no rejections it always move
s when sampling a continuous distribution slice sampling works by stepping back
to the view of figure a sampling involves drawing points uniformly underneath a
curve hx p x rather than drawing points iid as in rejection sampling a markov ch
ain explores this uniform distribution over the original variables x and a heigh
t variable h hx slice sampling algorithms alternate between updating h and one o
r more of the original variables the height has a simple conditional distributio
n from which it can be resampled h uniform hx conditioned on h the stationary di
stribution over x is uniform over the slice that lies under the curve hx h when
the distribution is auxiliary variable methods unimodal this conditional can be
sampled by rejection sampling see figure ac more generally we use transition ope
rators that leave a uniform distribution on the slice stationary one of the oper
ators introduced by neal is briefly explained in figure e like most metropolis m
ethods it uses a stepsize parameter which would ideally match the lengthscale of
the problem l when exponentially when region by a random walk neal also provide
s a slice sampling operator that can expand its search region exponentially quic
kly skilling and mackay provide a version with no stepsize it starts with a larg
e search region which shrinks with a fast exponent both of these operators are s
lightly harder to implement than the simple version sketched in figure we now pr
opose a simple we believe novel method that can shrink slice samplings search re
gion more efficiently the method is a generalization of subsection where we fact
ored the target density into px xlx here we also introduce two auxiliary variabl
es ph x uniform x and ph x uniform lx the new method follows any of the standard
slice sampling algorithms but defines the slice by requiring both x h and lx h
this version of slice sampling still satisfies detailed balance with respect to
the stationary distribution px we first identify all of the quantities involved
in the slice sampling procedure the algorithm starts with a point x px then two
auxiliary variables h h are generated the search procedure eg figure e will crea
te an interval around x and possibly some unacceptable points and adaptations of
the interval we collectively refer to all of the intermediate intervals and rej
ected points as s finally an acceptable point x is found and adopted the joint p
robability of all of these quantities is t x s h h x px x lx psh h xpx s h h x p
h x ph x xlx z l the initial search region shrinks l the search region requires
ol computations to grow out this is much better than the ol steps required by me
tropolis to explore the psh h xpx s h h xz all of the standard slicesampling pro
cedures are designed to ensure that x is only accepted as the final point if thi
s expression is symmetric in x and x therefore t x s h h x px t x s h h x px sum
ming this expression over s h and h shows that the operator satisfies detailed b
alance multiple auxiliary variables sometimes allow very fast mixing as in swend
senwang however in this context the slice defined by the two auxiliary variables
h and h will typically be smaller than under standard slice sampling this is li
kely to increase the mixing time of the chain neal warns about this problem with
reference to damien et al our motivation is computational h can be checked firs
t and only auxiliary variable methods if it satisfies the slice constraint need
l h be checked thus schemes that start with a large range and rapidly shrink a b
racket around the slice can do so with less computation if x can reject some unr
easonable points while being cheap to compute hamiltonian monte carlo physical s
imulations often work by discretizing time and approximately computing the resul
t of following the hamiltonian dynamics of the system being modeled these dynami
cs exploit gradients of a target stationary probability density not just pointwi
se evaluations as in the metropolis algorithm theoretically gradients only ever
cost a constant multiple of the computer time needed to evaluate a function bisc
hof and bcker and are a much richer source of information the difficulty with u
simulation work is that inaccuracies accumulate rapidly unless time is discretiz
ed very finely which has a large computational cost the metropolis algorithm tar
gets only an equilibrium distribution over states the method doesnt need this di
stribution to be the result of an underlying dynamical system instead the algori
thm evolves according to a proposal distribution the progress of the metropolis
algorithm often resembles a diffusive random walk in contrast a particle evolvin
g under hamiltonian dynamics will naturally move in persistent trajectories acro
ss its statespace these two areas of physical simulation research were combined
in hybrid monte carlo duane et al hamiltonian dynamics are simulated approximate
ly as a metropolis proposal which is accepted or rejected this allows persistent
rapid motion across a state space without damaging the equilibrium distribution
of the chain through discretization errors if the target probability distributi
on does not correspond to a hamiltonian system then auxiliary variables are intr
oduced to create a fictitious system that can be simulated while hamiltonian dyn
amics is time reversible some care is required to ensure the discretized version
is also reversible neal reviews a variety of hamiltonian based monte carlo meth
ods see mackay for another review with simple example code hybrid monte carlo cr
ossed into a statistical setting through work on bayesian neural networks neal b
where it is extremely successful hamiltonian monte carlo methods continue to ap
pear in recent work especially in gaussian process modeling although they could
be adopted more widely annealing methods annealing methods simulated annealing k
irkpatrick et al is a heuristic for minimizing a cost function ex with isolated
local optima the metropolis algorithm is run on a probability distribution deriv
ed from the energy e pk x p x k xek ex zk zk a base measure x is often omitted s
et to uniform but including this term ensures pk is defined when initially the i
nverse temperature k is set very low so that the distribution is diffuse the dis
tribution is then slowly cooled by increasing k towards infinity eventually all
of the probability mass will be concentrated on the global optimum of ex there c
an be no guarantee that a particular sampling procedure will actually find the g
lobal optimum in general it is not possible to do better than an exhaustive sear
ch however the heuristic has proved useful in a variety of applications monte ca
rlo is not interested in optima but if px lxx and ex log lx then k returns the t
arget distribution of interest and distributions with k give more diffuse distri
butions figure shows an example of a sequence of distributions it might be that
isolated modes at are difficult to explore by available markov chain operators t
hese modes join and disappear at lower these more diffuse distributions will typ
ically be easier to explore this section reviews algorithms designed to exploit
this observation to provide better samples from multimodal target distributions
it turns out that all of these methods can also provide estimates of the target
distributions normalizing constant a full discussion of normalizing constants is
deferred to chapter notation all of the algorithms in this section and several
later in the thesis use a sequence of distributions we have adopted the followin
g conventions throughout p is a base distribution usually easy to explore or ame
nable to direct sampling if pk are defined by inverse temperatures then there ar
e k intermediate distributions pk k between the base and target k distributions
pk p is the target distribution if pk are defined by inverse temperatures then k
the sequences are usually defined using temperatures as in equation but other m
ethods for creating intermediate distributions between the base and target distr
ibutions may be used annealing methods figure the effect of annealing at the dis
tribution is equal to a convenient base distribution as the inverse temperature
increases the distribution morphs into the target distribution at simulated temp
ering expanded ensembles a coherent way of dealing with annealing in mcmc simula
tions was independently discovered under the names expanded ensembles lyubartsev
et al and simulated tempering marinari and parisi the idea is to bring the choi
ce of intermediate distribution pk into the joint distribution under exploration
px k wst k p x k where wst k are userchosen weights attached to each temperatur
e level the conditional distribution at temperature k returns the original distr
ibution of interest pxk px a markov chain at equilibrium returns correlated samp
les from the correct distribution by only recording states when the hope is that
the extra computation is justified by the decrease in autocorrelation of the ma
rkov chain the extent to which simulated tempering improves a chains mixing time
depends heavily on the target distribution and the markov chains used one gener
al characteristic of the algorithm is the amount of time it takes to move the te
mperature index k between its extreme values usually metropolis proposals k k ar
e used because large changes in temperature are unlikely to be accepted the adja
cent levels must be similar enough for these changes in k to be accepted frequen
tly this sets a minimum number of temperature levels k even if all moves were ac
cepted k would be undergoing a slow random walk it will take at least ok steps t
o bring information from the rapidly mixing k distribution to the target k k dis
tribution the extra cost of the algorithm depends on the fraction of time spent
at the other temperature levels at equilibrium each level has marginal probabili
ty pk x wst k p x wst k zk k the partition function zk can easily vary by orders
of magnitude across temperature levels in order for simulated tempering to spen
d a reasonable amount of time at each temperature level we could set wst k zk ex
cept of course that this ideal weighting is unknown eg z is the unknown normaliz
ation of the target distribution a practical implementation of simulated temperi
ng must be an iterative process where annealing methods px t t t t t t t t p x p
x t t t t p x t t t t figure an illustration of the idea behind a family of par
allel tempering algorithms each column represents the states of a markov chain e
ach states in the first row have marginal distribution px the other rows have st
ationary distributions bridging between this distribution and a vague distributi
on for which exploration is easy mcmc proceeds by traditional simulation of each
of the independent states and dotted arrows metropolis proposals to exchange th
e states between adjacent levels the wst k are improved over a series of prelimi
nary runs parallel tempering parallel tempering is a popular alternative to simu
lated tempering based on a different joint distribution while simulated temperin
g uses a union space consisting of the temperature and the original state space
parallel tempering simulates on a product space that is replicas of the original
state space exist concurrently for each of the stationary distributions pk k px
k k pk xk as the states at each level are independent under the target distribut
ion independent transition operators tk with corresponding stationary distributi
ons pk can be applied for the higher temperature chains to communicate the resul
ts of their freer motion to the lower temperature chains interactions are also r
equired figure illustrates one possible way of evolving a product space ensemble
the state of the markov chain consists of an entire column of variables these e
volve using standard mcmc operators or by swaps between states that are accepted
or rejected according to the metropolis rule the swap proposals can be schedule
d in a variety of ways but as with simulated tempering information will propagat
e amongst levels by a random walk or slower by construction the amount of comput
er time spent on each level is completely under the users control in particular
no weights wst k are required which is one source of this method has several nam
es and independent authors its history is documented by iba b annealing methods
the methods relative popularity another feature of coexisting chains is the poss
ibility of using proposals based on sophisticated interactions between states eg
wang and swendsen these advantages are at the expense of an obvious increase in
memory requirements and a hardertogauge cost for bringing a larger system to eq
uilibrium annealed importance sampling ais annealed importance sampling neal a l
ooks more like the original simulated annealing for optimization than the above
ensemble methods an initial point x is drawn from p a vague high temperature bas
e distribution further points x x xk result from cooling by evolving through a s
equence of k markov chain transition operators t tk whose stationary distributio
ns p pk become sharper at each iteration the ensemble of k points generated in t
he algorithm is denoted by x xk k k we treat the procedure that generated them a
s a proposal which has the following joint distribution qx p x k k tk xk xk but
what is this a proposal for qxk xkk the marginal distribution over the final poi
nt qx is not the target distribution and is probably intractable we cannot direc
tly compare qxk with pxk instead we must create an ensemble target distribution
p x that can be compared to qx under this target model xk is drawn from the dist
ribution of interest p pk the remaining auxiliary quantities are drawn using a s
equence of reverse markov chain transitions such that k p x pk xk k tk xk xk whe
re tk are reverse operators as described in section figure illustrates both the
ensembles p x and qx samples of x from q can be assigned importance weights with
respect to p in the usual way subsection w x p x q x k k p xk k p x k p xk tk x
k xk k tk xk xk p x k k pk xk pk xk k p xk k p xk k in neals papers the indexes
of the points started at n and ran down to the presentation here is designed to
be consistent with similar algorithms found in later chapters annealing methods
xk pk px p x xk tk xk tk xk x t x qx xk tk xk tk xk x t x x p x x figure anneale
d importance sampling ais is standard importance sampling applied to an ensemble
p x this generative process starts with a draw from the target distribution px
pk x which is intractable importance sampling uses proposals drawn from qx which
starts at a simple distribution x p x samples weighted in proportional to w x c
an be used as though they came from p typically only xk is of interest which und
er p comes from the target pk p the weights themselves are an unbiased estimate
of zp zq zk z as in standard importance sampling equation a valid choice for all
of the tk is the do nothing operator which attaches all of its probability mass
to staying still to x x x x then qx just corresponds to drawing from p and copy
ing the results similarly p x only draws from pk as one might expect the importa
nce weights collapse down to those of standard importance sampling of a target d
istribution p pk with proposals from the base distribution q p this highlights t
hat the markov chains do not need to mix well for ais to be valid but poor mixin
g operators will not improve over standard importance sampling a feature of ais
is that the importance weights do not require details of the transition operator
s tk this means that a given ensemble x s will always be given the same importan
ce regardless of whether the tk mix immediately by drawing from their stationary
distributions pk or in fact do nothing like to the downside to this behavior is
that even with ideal transition operators there is always a mismatch between p
x and qx under perfect mixing the final point is proposed from pk not the target
distribution pk the annealed importance sampling technique was first described
in the physics literature by jarzynski however earlier still was the method of t
empered transitions which contains the same core idea tempered transitions tempe
red transitions neal a defines a reversible markov chain transition operator lea
ving a distribution of interest pk stationary thus its use is in mcmc annealing
methods rather than importance sampling it uses pairs of mutually reversible bas
e transitions tk and tk with corresponding stationary distributions pk k k given
a current state k a candidate state k is proposed through a sequence of states
x x drawn as follows generate k from k using tk x x generate k from k using tk x
x generate x from using t x generate from x using t x x x generate k from k usi
ng tk generate k from k using tk x x this generating sequence is illustrated in
figure a the candidate state is accepted with probability k x x a k k min k pk k
pk k x x pk k pk k x x note that the normalizations of pk cancel so are not nee
ded one way to derive this acceptance rule it is to identify the whole joint dis
tribution of figure a as the target of x the sampler if the order of the states
were reversed figure b k would be the new state generated at equilibrium a few l
ines of manipulation results in equation for the metropolis acceptance probabili
ty for this proposal the intermediate states can be discarded because they are r
esampled at the next iteration typically p is much more vague than pk which allo
ws easier movement around the middle of the proposal sequence if there are many
intermediate distributions then every pk can be made close to its neighbors pk t
his suggests each state visited during the proposal should be close to equilibri
um under each of the transition operators and that the final state should nearly
be drawn from pk which is close to pk thus the acceptance rate can be high for
large k one might expect that deterministically raising and lowering the tempera
ture gives better performance than the markov chain employed by simulated temper
ing in fact the advantage is roughly cancelled by the need for a larger number o
f intermediate distributions see neal a for details the main advantages of tempe
red transitions are it doesnt need a set of weights like simulated tempering it
doesnt have parallel temperings large memory requirements because equation can b
e accumulated as the states are generated annealing methods x t tk tk x k x k x
x t x x k tk x k x k tk x k x k pk x a current state x t tk x k x t x x k tk x k
x tk x k x k tk x k xk pk x b proposed state figure tempered transitions a the
algorithm starts with a state k and x generates the ensemble x b the acceptance
probability is for the proposal x which by reversing all the states would make k
the new state generated under x the equilibrium distribution pk generalization
to only forward transitions in standard tempered transitions tk and tk are mutua
lly reversible here we consider two modified algorithms a forwards tempered tran
sition operator t uses tk tk tk to form the proposal distribution q where each t
k is any transition operator leaving pk stationary a reverse tempered transition
operator t uses tk tk tk to form the proposal distribution q where each tk is t
he reverse transition operator corresponding to tk both algorithms use the same
acceptance rule as before neal a shows that the operator defined by standard tem
pered transitions satisfies detailed balance here we present a similar proof to
show that t and t are mutually reversible with respect to pk as in standard temp
ered transitions we start at k and generate a sequence of states x x x x k k x k
to simplify notation both and are taken to mean x x x throughout the probabilit
y of generating a particular sequence of states given the initial state using on
ly forward transitions tk is k k q k k k x x x k k tk k k x x k k k tk k k x x k
k tk k k pk k x x x pk k x k tk k k pk k x x x pk k x pk k x pk k x k k k tk k
k x x k tk k k x x pk k pk k x x pk k pk k x x annealing methods using equation
and equation we can compute the equilibrium probability x x x of starting at k g
enerating the sequence k k and accepting the move with operator t x x x x t k k
k pk k qk k k pk k ak k x x x x x x k k k k x x tk k k k tk k k pk k min x x x k
pk k pk k x x pk k pk k x x qk k k pk k ak k x x x x x x t k k k pk k x x x x x
x t summing both sides over all intermediate quantities k k kk gives x x x x x
x t k k pk k t k k pk k this is the equilibrium probability of observing the cor
responding reverse move under which shows that both t and t leave pk invariant t
herefore we only need one of them t say this means that the standard tempered tr
ansitions algorithm still leaves the target distribution invariant if forward op
erators tk are used for both tk and tk implementing reverse operators is not req
uired generalization to a single pass the second half of tempered transitions is
identical to annealed importance sampling neal states that the major difference
between the two methods is the requirement for an upward pass in tempered trans
itions making the updates twice as expensive in fact this upward pass is not nec
essary looking back at figure there is no reason why the ais ensemble x q cannot
be used as a metropolishastings proposal for updating a state x drawn at equili
brium from p this is a metropolis independence sampler where the proposal does n
ot depend on the current state this method of turning an importance sampler into
a markov chain operator dates back at least to hastings importance sampling see
ms a more natural way to construct estimators than the metropolis independence s
ampler notice that an independence sampler sometimes rejects states when the cur
rent state is more important the states kept by the algorithm will depend on the
order in which the proposals are made which is arbitrary as they are independen
t it seems likely that treating the proposals identically through importance wei
ghts will be better and theoretically this is usually the case liu multicanonica
l ensemble tempered transitions and annealed importance sampling are really two
applications of the same joint distribution the main difference is whether an mc
mc operator or an importance sampler is preferable in isolation the importance s
ampler may yield better estimators but the mcmc operator can be combined with ot
her mcmc operators hastings argued that unweighted mcmc samples are more interpr
etable although one could also obtain approximate unweighted samples by resampli
ng from a distribution over points proportional to their importanceweights multi
canonical ensemble the multicanonical ensemble berg and neuhaus is another metho
d for allowing easier movement around a state space rather than introducing extr
a distributions the original probability distribution is reweighted pmc x wmc ex
eex e log p x the weight function wmc x only depends on a states energy e which
is the log of the states original unnormalized probability the weights are chos
en such that the marginal distribution over energies pe is uniform as far as pos
sible if the energy is unbounded a cutoff is introduced why set the distribution
over energies uniform as with the temperaturebased algorithms more time is spen
t in regions of low probability which helps join isolated modes but the multican
onical ensemble is just a heuristic there is nothing fundamental about this dist
ribution in fact visualizing such a distribution can be somewhat surprising figu
re shows the result of reweighting a simple onedimensional distribution for simp
licity we created a set of energy bands and used a procedure that only ensured t
hat the distribution over bands was uniform the most probable states in figure c
are in the extreme tail of the original distribution few states occupy the lowe
st energy bands so any particular one must be visited frequently in many highdim
ensional problems the lowest probability states will be plentiful and the tails
will be given much less prominence however when there are only a relatively smal
l number states at a particular energy the very high probabilities assigned by t
he multicanonical ensemble can be very useful berg and neuhaus invented the mult
icanonical ensemble to deal with distributions containing bottlenecks regions wi
th small numbers of states that separate regions with significant probability ma
ss strangely bottleneck states can be problematic even when they are individuall
y more probable than many of the states typical under the distribution a markov
chain is not able to spend much time exploring the states between important regi
ons if there are many more states elsewhere with much higher total probability t
his phenomenon is known as an entropic barrier and can be overcome by multicanon
ical ensemble original probability x a energy x b multicanonical probability x c
figure an example of a multicanonical ensemble a the target distribution of int
erest b the energy function that defines this distribution the horizontal lines
mark bands of energy used to construct the new ensemble c the final figure resul
ts from setting the probability of each state proportional to the reciprocal of
the number of states in its energy range drawing a state from this multicanonica
l distribution and reading off its energy band gives a uniform distribution over
the bands some artifacts in the plot are due to the precise way bands were chos
en but note that the sharp peaks in the middle of the multicanonical plot are du
e in part to the truncation of the original distribution at x setting the distri
bution over energies is a global operation and can have unexpected consequences
exact sampling a multicanonical ensemble which makes states with abnormal energi
es low or high much more probable as in simulated tempering finding the weights
or weighting function for the multicanonical method is difficult the original ap
plication of the multicanonical ensemble used prior knowledge to construct a par
ametric approximation that would give a nearuniform distribution over energy gen
eral applications often use histogram methods that bin ranges of energy together
and iteratively fit weights for each bin on the basis of preliminary runs for m
ore advanced methodologies there are whole theses largely dedicated to these met
hods eg smith ferkinghoffborg unlike annealing methods the multicanonical method
never simulates the target distribution however correlated samples from pmc can
be used to construct importance sampling estimators of any quantity of interest
this is explored further in chapter exact sampling exact or perfect sampling si
mply means drawing a sample from a target distribution markov chain monte carlo
uses approximate samples the distribution over the position at the sth step of a
markov chain depends perhaps weakly on its starting position at s surprisingly
it is sometimes possible to obtain exact samples from markov chains a sketch of
the technique known as coupling from the past cftp propp and wilson is given in
figure the core idea is to find the location of a markov chain that has run for
an infinitely long time without having to simulate its entire length typically e
xact sampling algorithms require the ability to track sets of states through a r
andom possibly large number of markov chain steps this is technically challengin
g but provides a means to draw samples from some distributions such as large pot
ts models and some spatial point processes where traditional means fail see wils
on for reviews examples and more algorithms for exact sampling with markov chain
s is exact sampling actually useful some users like to see samples which can giv
e insight into a problem but in many statistical applications there is no intere
st in the samples themselves only estimates of expectations given some exact sam
ples the variance of an estimator could be improved by running markov chains fro
m the samples to produce more correlated samples if it is possible to start with
exact samples some comfort is derived from removing any bias associated with th
e burn in of the chain but it is still better to run a small number of long mark
ov chains than investing in equilibrating many independent ones section several
algorithms in chapter rely heavily on exact sampling these require an exact samp
le from a different distribution at each iteration using approximations is dange
rous as biases can accumulate at every iteration so markov chains potentially ex
act sampling x time a a sample is at the end of an infinite chain x time t b we
try to find the end with a finite number of random numbers x time t t c eventual
ly after a random amount of time we find the sample figure coupling from the pas
t cftp overview each dash on the time axis represents a random number used by th
e chain everything must be consistent with the infinite chain in a we try to loc
ate the final state by looking at drawing just some of the random numbers b befo
re time t we know nothing so the chain could be anywhere we evolve a bound on th
e state space forward to time zero at first this might only localize the sample
c we look at more random numbers further into the past until a bound collapses t
o a single state we then follow it to the exact sample at time zero it takes a r
andom amount of computation before the bound can be made tight exact sampling ha
ve to be run for a very long time exact sampling gives a stopping rule and one t
hat guarantees zero bias exact sampling example the ising model coupling from th
e past relies on being able to prove where the final state of a markov chain lie
s without knowing the starting location to demonstrate that this is sometimes po
ssible we describe the summary state technique following childs et al the method
is originally due to huber who uses the name bounding chains our task is to fin
d the setting of an ising model a potts model subsection where each variable tak
es on q settings si we assume the system evolved under gibbs sampling subsection
from time t to t each variable was updated in order at each time step according
to the following algorithm uit uniform p p si sji sji sji sji exp si jei t t t
jij sj jij si sj t exp t jei if uit p then set si else si in general the setting
of variable i at time t depends on the random number uit and the settings of it
s neighbors j ei the summary states algorithm starts by drawing all the random n
umbers uit for some finite range of time t t we dont know what happened before t
ime t so for each variable we set si t we then follow the gibbs sampling updates
from t t to t setting the states to s whenever the result of an update depends
on the settings of states we do not know uit uniform pmax max p si sji sji sji s
ji sj sj t t t pmin min p si sji sji sji sji sj sj t if uit pmin then set si t i
f uit pmax then set si t otherwise set si t t t if some of the si states are set
to then we do not know the setting of our exact sample we draw more of the rand
om numbers uit from time t back to time t and start again from st the hope is th
at eventually after doubling the number of markov chain steps enough times no s
will remain and we will know s discussion and outlook more sophisticated version
s of the algorithm have been developed for models with q and for tracking depend
encies among unknown states when the connection strengths j are large it wont be
possible to get rid of all the s in practice better markov chains than gibbs sa
mpling are required while the details are considerably more involved it is possi
ble to bound the behavior of randomcluster samplers propp and wilson and careful
ly implemented swendsenwang samplers huber discussion and outlook in this chapte
r we have reviewed a subset of the established literature on markov chain monte
carlo methods most mcmc algorithms are underpinned by proving detailed balance w
hich is almost synonymous with the seminal algorithms of metropolis et al and ha
stings some choices still remain in the choice of acceptance rule and the constr
uction of estimators we presented an interpretation of wasterecycling estimators
which we will apply in a new setting in chapter we also reviewed a twostage acc
eptance rule in subsection which trades off statistical optimality to give compu
tational savings we extended this idea to slice sampling in subsection algorithm
s and proposal distributions tuned for particular applications are clearly impor
tant but were beyond the scope of this review moreover we did not discuss techni
ques that sample distributions with variable dimensionality eg green or infinite
dimensionality eg neal c while important in some applications or as a replaceme
nt for model selection techniques they are not used by the remainder of the thes
is we do have a brief excursion into sampling from an infinitedimensional distri
bution at the end of chapter although this is more closely related to the work o
n exact sampling reviewed in the previous section auxiliary variable methods hav
e the potential to dramatically accelerate markov chain exploration figure demon
strated the large changes possible in a single step of swendsenwang slice sampli
ng doesnt make such dramatic moves but it is an easytoapply method that we have
found personally useful both in this thesis and in quickly testing hypotheses in
discussions with colleagues similarly hamiltonian monte carlo has proved person
ally useful in work outside this thesis murray and snelson and we feel it should
be used more widely the use of annealing methods appears somewhat mysterious es
pecially in a statistical setting it seems strange to sample at a range of tempe
ratures in applications outside of physics however the annealing heuristic figur
e can be very effective in problems with multiple modes it is less clear whether
these distributions need to be simulated together as in parallel tempering or s
eparately as in the other algorithms we explore the issue of populations of samp
les in the next chapter this work leads us to construct discussion and outlook n
ew algorithms for constructing nonreversible markov chains operators that do not
satisfy detailed balance returning to annealing chapter explores the computatio
n of normalizing constants there we explore why standard mcmc is insufficient fo
r solving this problem and the advantages of using multiple distributions as in
annealing we then provide a critical and comparative review of annealed importan
ce sampling the multicanonical ensemble and nested sampling a new method by john
skilling we will propose new procedures for running both nested sampling and an
nealed importance sampling finally all of the algorithms discussed so far rely o
n being able to evaluate a function proportional to the target distribution p x
px it turns out that this assumption is not met by a class of statistical infere
nce problems which we label doublyintractable chapter is dedicated to constructi
ng new algorithms for this problem their operation draws on ideas from all areas
of this review including annealing and exact sampling chapter multiple proposal
s and nonreversible markov chains markov chain monte carlo is ambitious a huge a
mount is asked of a solitary point diffusing through the void of a complex highd
imensional space the target stationary distribution may have multiple modes a re
presentative sample of which must be found also somehow the mass of each mode is
implicitly estimated so that the correct amount of time is spent in each one ev
en navigating a single mode can be difficult a long standing idea is that multip
le points evolving simultaneously should help communication amongst points might
help find and explore the regions of high probability mass the simplest way to
introduce multiple points is to target a product model p x k k pxk this is just
like parallel tempering subsection except that each of the independent distribut
ions are the same although each of the xk variables are independent a markov cha
in operator used to update one of them can depend on the setting of the others a
t the same time step this provides a limited but adapting source of information
about lengthscales in the problem and the positions of modes adaptive direction
sampling ads is a name given to at least two algorithms one of these updates a p
oint by resampling from its conditional distribution ie gibbs sampling along a d
irection suggested by another pair of points gilks et al this allows coupled mov
ement of highly correlated variables which would be difficult under standard axi
saligned gibbs sampling initial experiments with ads highlighted the need for ma
ny more particles than dimensions in order to equilibrate in all necessary direc
tions then the burden of bringing k points all to equilibrium simultaneously rem
oves the advantage on the other hand some advantage is obtained with a small num
ber of points when combined with standard axisaligned updates to ensure mixing i
n the full space i have been able to produce very similar results using a slice
sampling based version of ads as suggested in mackay related ideas are populatio
n monte carlo still being actively pursued in the literature ter braak christen
and fox while general theoretical results seem elusive a small number of paralle
l chains is often empirically useful and seems advisable in general this chapter
is dedicated to exploring other monte carlo algorithms that attempt to leverage
an advantage from multiple points the first of these population monte e carlo a
s in capp et al is an existing importance sampling method described as a competi
tor to mcmc methods we highlight some important differences with mcmcbased metho
ds like ads not made apparent enough in the current literature we then return to
mcmc methods looking at those not based on the ads product model these methods
have a single markov chain backbone which branches to a set of points for consid
eration at each step ensembles of dependent points might allow more thorough loc
al exploration section reviews and extends multipletry metropolis mtm an algorit
hm for making good use of expensive proposal mechanisms section covers ordered o
verrelaxation an existing extension to gibbs sampling which draws multiple point
s to improve navigation of highly correlated variables we extend ordered overrel
axation from just gibbs sampling to arbitrary transition operators and multiple
metropolis proposals population monte carlo many monte carlo algorithms could be
described as population methods iba a provides a review of some of them this se
ction refers to a specific method with the name population monte carlo pmc as de
scribed by capp et al and e popularized in robert and casella it is described as
an algorithm which can be iterated like mcmc but unlike markov methods can easi
ly be adapted over time examples were presented in which replacing mcmc with pmc
gave better results in fact it is also easy to demonstrate the reverse this sec
tion provides a critical review of pmc the template of a pmc method is given in
algorithm k particles are evolved in parallel an arbitrary distribution qks is u
sed to move each particle at each step these distributions can depend on all the
particles entire histories unbiased estimates can be constructed from the propo
sed points k weighted by their importance weights x as usual subsection the weig
hts are usually only available up to a constant consistent estimators are still
available by using normalized weights the resampled points drawn in step are a b
iased sample from the target distribution for finite k for example k would just
give a sample from the proposal distribution when k is large the points are appr
oximately unbiased samples from the target distribution it is hoped these will f
orm a useful basis for constructing the s population monte carlo algorithm popul
ation monte carlo as in capp et al e for s to s for k to k select proposal distr
ibution qks s generate xk qks x compute weight wk s pk qks k x x end for resampl
e k positions with replacement from k with probabilities x s s s proportional to
the weights giving this steps sample x x xk end for proposal distributions at t
he next iteration estimators should be formed from the importanceweighted ensemb
le before the resampling step as pmc is a framework for importance sampling the
qks distributions must have appreciable probability mass across the entire suppo
rt of the target distribution while capp et al do mention the need for heavy tai
ls they simultaneously suggest that e pmc is a replacement for mcmc population m
onte carlo borrows from mcmc algorithms for the construction of the proposal fro
m importance sampling for the construction of appropriate estimators these two g
oals seem mutually exclusive metropolishastings proposals are typically local in
nature we do not expect them to capture the entire target distribution sometime
s we might propose dramatic moves perhaps from a kernel density estimate tierney
and mira warnes but such proposals do not need to be a good global approximatio
n to the target distribution for mcmc to work such transition operators can be c
ombined with local diffusion for robustness importance sampling has no such choi
ce using mh proposals which tend to cut off some or most of the target density c
an have disastrous effects in pmc some particles would occasionally have very la
rge weights this would give high variance estimators and the occasional collapse
of all or most particles to a single point in the resampling step both capp et
al and robert and casella describe a mixture pmc e algorithm which seems very li
kely to suffer from these failure modes of importance sampling the algorithm sug
gests choosing qks from a set of spherical gaussian distributions with a variety
of widths the probability of adopting each width is adapted according to their
past ability to make proposals that survive the resampling step to demonstrate t
he problem with mixture pmc we use a version of the funnel distribution describe
d by neal the distribution is over a zeromean gaussian distributed variable v wi
th standard deviation and several independent zeromean gaussian variates xi each
with variance expv this type of structure can easily exist in hierarchical baye
sian models the original funnel distribution used nine xi s s multipletry metrop
olis variates and was exceedingly challenging for simple metropolis sampling sli
ce sampling largely alleviated this problem here we make the distribution easier
to handle using only four xi variates so that metropolis sampling will equilibr
ate more quickly we used spherical gaussian proposals with standard deviations s
elected uniformly from an ad hoc set of proposal widths metropolis was run times
with steps in each run which produced a good histogram for the marginal distrib
ution of v figure a only in the tails of the distribution are some frequencies n
ot quite right it would have been better to sample fewer longer runs section but
we wished to provide a direct comparison to pmc steps with particles figure b s
hows a histogram of samples obtained from step of pmc algorithm these samples do
not match the true marginal distribution although we can expect this distributi
on to be biased figures cf show estimates of vs marginal distribution based on t
he importance weights of the particles before resampling the pmc estimates are t
errible for all four variants of the algorithm described in the figure none of t
he gaussian distributions are good importance samplers for this distribution a c
ombination of the estimators does not work well either large importance weights
during the pmc runs often caused many particles in the population to move togeth
er into a new region of space unlike capp et al we do not e see this as a sign t
hat the space is being explored instead it indicates the underlying high varianc
e of the importance sampling estimator when the distributions are inappropriate
for importance sampling there is no guarantee that the time spent by the populat
ion within a region reflects the target distribution instead the times between l
arge importance weight events depend on the details of the tails of the proposal
distributions in this case adapting the distributions did not solve the problem
in any case favoring proposals that can lead to extremely large weights seems u
nwise iterated importance sampling algorithms are attractive in some contexts ho
wever using proposals from mcmc algorithms within importance sampling is not gen
erally advisable we now return to algorithms that stay within the mcmc framework
multipletry metropolis this section concerns the multipletry metropolis method
mtm of liu et al we review the method and provide a simple interpretation as an
evolving joint distribution including auxiliary variables this almost trivial re
writing of the algorithm makes a possible improvement obvious and relates to the
pivotbased metropolis method introduced later in the chapter the mtm algorithm
was motivated as follows local proposal distributions qx x may lead to slow conv
ergence but longer range moves are rarely accepted if multi multipletry metropol
is a metropolis b pmc samples q separate adapted c pmc weights q separate adapte
d d pmc weights q combined adapted e pmc weights q combined fixed f pmc weights
q separate fixed figure histograms of samples of the v parameter from the funnel
distribution the curves show the correct marginal posterior all samplers were i
nitialized at v x both metropolis and pmc used spherical gaussian proposal distr
ibutions with the same set of proposal widths for pmc these were adapted as e in
capp et al or chosen from a fixed uniform distribution the pmc importance weigh
ts were either computed using the proposal width that was used q separate as in
capp et al or using q combined the mixture of gaussians e obtained by summing ov
er the choice of width robert and casella multipletry metropolis ple points are
proposed from a longrange distribution then it is more likely that an acceptable
one will be found mtm algorithm is one way of allowing k draws from a proposal
distribution figure illustrates and motivates the procedure with a graphical mod
el algorithm a single step of the multipletry metropolis markov chain draw k iid
proposals from xk qxk x compute weights wx xk px qxk x x xk is any symmetric fu
nction x xk xk x choose one proposal x x xk using probabilities proportional to
the weights generate a new set of k points xk qxk x for k k and define xk x wx x
wxk x compute ax x wx x wxk x assign x x with probability mina mtm is actually
a family of algorithms defined by the function used at step to weight the points
mtmii is the mtm algorithm with symmetric q and q giving weights wxk pxk as var
ious of choices of were reported to behave very similarly we use mtmii in our ex
periments figure s graphical model description immediately suggests an alternati
ve algorithm step of the mtm algorithm resamples x xk from the joint distributio
n this discards the setting already made by a previous iteration in some circums
tances keeping the existing proposals could save computer time this would be ach
ieved by omitting step after the first iteration and updating step to include xk
xk in the assignment efficiency of mtm for a given proposal distribution qxk x
each step of mtm is more expensive than a single metropolishastings proposal cou
nting the number of target distribution evaluations suggests that the computatio
nal cost is roughly k times more for mtm various practical issues make the exact
cost tradeoff more complex in many circumstances the time taken to evaluate k p
robabilities will grow sublinearly with k this may be because some intermediate
results can be shared or because related computations can be vectorized and comp
uted efficiently on some types of computer the sequential nature of mcmc often m
akes it difficult to vectorize operations yet such code transformations can have
surprisingly large benefits especially in popular interpreted environments such
as r octave or matlab the results reported by liu et al suggest that mtm outper
forms metropolis without needing implementationbased codetuning one example invo
lved simple multipletry metropolis x x x xix xk a k proposals are generated and
one selected x x x x xk b a suggested swap makes x the sample from px x x x xx i
xk c resampling all but x and x makes the joint setting more plausible figure m
ultipletry metropolis mtm a x is assumed to come from the target distribution px
k proposals are drawn xk qxk x this defines a joint target distribution on whic
h the markov chain operates firstly one of the proposed values xi say is chosen
and labeled x b one could consider proposing a swap x x giving a joint state in
which x generated x xki the acceptance probability for this proposal could typic
ally be low for large k it will often be obvious that x rather than x generated
the remaining variables c mtm swaps x and x and also resamples the remaining var
iables from their new stationary distribution xki qxk x the acceptance rule is s
imply metropolishastings applied to the proposal a c multipletry metropolis varx
t index mtmii metropolis metropolis figure performance of multipletry metropolis
based on figure from liu et al samplers were run on a tdistribution as describe
d in the main text plotted is varx t the variance of a samplers state after t st
eps from x estimated from runs the time taken for a chains states to match the t
rue variance of is a measure of its burnin time computer time was measured as fo
llows for mtmii with k index is the number of steps taken for the metropolis alg
orithms index is optimistically the number of steps divided by k the performance
of metropolis with and mtmii with closely reproduce the results from the origin
al paper rerunning metropolis with a gaussian stepsize of shows a much shorter b
urnin time than mtm gaussianbased proposals on a onedimensional tdistribution wi
th five degrees of freedom the amount of computer time to reach equilibrium from
an extreme fixed starting point x appeared to be considerably less for mtmii wi
th n x proposals than for metropolis with n x proposals as shown in figure this
claim is reproducible but why were different proposal widths and used while maki
ng the comparison presumably a longer stepsize was deemed appropriate for mtm as
it has more attempts to find a good proposal but our results show that metropol
is with approaches equilibrium according to the measure used faster than metropo
lis with or mtm with the amount of computer time needed to forget a particular a
typical initialization is only one measure of performance we also used rcoda cow
les et al to estimate the effective number of samples based on a spectral analys
is of the time series these are compared in table table equilibrium efficiency o
f mtm and metropolis method iid sampling control mh mh mtm effective samples pro
posals the acceptance rates for metropolishastings with and were and respectivel
y this suggests that the optimal step size is somewhere in between the two multi
pletry metropolis values tried mtms acceptance rate was close to ideal for both
standard mh and according to liu et als experience mtm despite this mtms samplin
g efficiency on this very simple example distribution is worse than mh at both e
quilibrium sampling and initial burning in this is a negative result but one wor
th noting using multiple parallel proposals is not in our experience worthwhile
in itself this makes sense when proposals are made sequentially a good proposal
can be accepted immediately and future proposals move from the new location in c
ontrast only one point from a set of parallel proposals can be accepted if there
is more than one good point the rest are wasted the motivation for using multip
le proposals cannot be to explore larger search regions instead we should consid
er mtm if making multiple proposals together is computationally cheaper than mak
ing them separately the orientational bias procedure in frenkel and smit algorit
hm section is equivalent to mtmii in this context the proposals involve moving m
olecules a molecules probability depends on independent terms depending on its p
osition and orientation the orientational bias procedure proposes a single new p
osition with several possible orientations while the issue isnt discussed in det
ail presumably it is sharing the positional energy computation across proposals
that makes the algorithm worthwhile in molecular dynamics simulations a statisti
cal example claiming benefit from multiple proposals is given by qi and minka th
ey drew proposals from local gaussian approximations based on the hessian of the
log posterior at the current sample location in highdimensional problems these
proposals would be expensive to construct od to prepare a matrix factorization f
or a multivariategaussian sampler drawing multiple samples is then cheaper od ev
en sequential sampling can take advantage of this because the proposal will be r
eused until a move is accepted one wonders if given the effort expended to const
ruct the distribution drawing more samples than needed for acceptance could be u
seful the reported accuracy for a given cpu time does appear to be slightly bett
er by combining hessianbased sampling with an mtm variant multipleimportance try
multipleimportance try mtm makes it more likely that the chain moves but is was
teful of information multiple evaluations of the probability density are made at
each iteration but at most one of these points is used qi and minka suggested a
way to use all of the xk points and called their combined hessianbased mtm meth
od adaptive multipleimportance try amit we will refer to the part of the algorit
hm that concerns including all the points in mtm as multipleimportance try mit q
i and minka also discuss cheap approximations to hessians but are still keen to
get the best use from them however constructed multipletry metropolis mit is the
standard multipletry metropolis algorithm with a particular choice for postproc
essing the samples if p can be evaluated then the samples are weighted by wk pxk
qxk x when px is only known up to a constant the wk are computed using the unno
rmalized version p x and then made to sum to one this is standard importance sam
pling using qxk x as a proposal distribution the locations x of these importance
samplers are chosen using mcmc using local gaussian approximations can lead to
importance sampling estimators with large or even infinite variances this sugges
ts that mit is likely to give erroneous results on highdimensional problems mit
will work when good global proposal distributions can be obtained but we are uns
ure why mcmc would be an appropriate way to adapt these distributions these are
similar concerns to those discussed in section regarding population monte carlo
wasterecycled mtm waste recycling subsection is a method for using proposals tha
t are not accepted as part of the markov chain here we describe how to recycle t
he points from the mtm method that are normally unused waste recycling does not
carry the risks of huge variances associated with using importance sampling esti
mators with metropolis proposals we first take the mtm joint distribution identi
fied in figure a k p x xk k k px k qxk x and augment it with a new variable x th
e marginal distribution of the new variable should be the target distribution of
interest this could be achieved by copying x or picking one of the proposals at
random and copying it according to the metropolis hastings acceptance rule ak p
x xk x k min pxk qxxk px qxkx x xk x x k ak as in subsection any quantity can b
e estimated by averaging over each of the x xk settings weighted by their probab
ility under this conditional distribution so far we are still discarding the xk
variables drawn by the algorithm if the x xk joint state were accepted we could
draw a new auxiliary variable x p x xk using x the conditional distribution anal
ogous to equation to include both possible joint multipletry metropolis settings
in an estimator we use a further auxiliary variable x which is equal to one of
x or x we could define the probability of x x according to the min a acceptance
probability of the mtm algorithm however this would sometimes give zero weight t
o x in normal wasterecycling we do not mind ignoring the initial state as it was
already used at the previous iteration but the xk were created at this iteratio
n instead we use px x x xk x xk wxk x k wxk x k wxk x k which also gives x the c
orrect stationary distribution the final estimator averages over all states visi
ted at each iteration based on the following identity for epxf x epxxk x x k p x
xk p x x xk x f x pxx xk x xk the expression looks cumbersome but when expanded
out only costs ok to compute the estimator can be implemented once and used as
a blackbox the conditional probabilities of x and x equation require some backwa
rds pro posal probabilities not normally needed by mtm or mit this would remove
the potential advantage when proposals are expensive as in qi and minka but woul
d not be a problem when as in frenkel and smit the purpose of multiple proposals
is shared computation of px we compared the mtms mean square error mse in estim
ating x to that obtained by the wasterecycled version wrmtm on the same tdistrib
ution example as before xxxk x x xk xx x table accuracy on tdistribution after p
roposals method mh mh mtm wrmtm mit mean square error mse postprocessing the mtm
run with wasterecycling reduces the mse considerably but still not to the level
of metropolishastings speedups from efficient implementation of parallel propos
als would still be required to justify wasterecycled mtm treating the xk proposa
ls as draws from importance samplers the mit estimator obtains the unless we are
applying the alternative algorithm suggested just before subsection ordered ove
rrelaxation figure the idea behind successive overrelaxation for minimization wh
en performing component updates moving past the minimum along the current search
direction can give faster progress along the direction that points to the optim
um monte carlo overrelaxation methods attempt to successively move to the other
side of a components conditional distribution which can cause persistent motion
along a more interesting direction best mse this reflects the fact that importan
ce sampling is usually more appropriate than mcmc for estimating the statistics
of simple d densities in higher dimensional problems where mcmc is favored over
simple importance sampling the proposals will generally be more local and take l
onger to explore the target distribution this situation was simulated by setting
the proposal width to and taking mtm steps with k averaging over such runs the
mean square error mse of mtm in estimating x was this wasnt hurt by wasterecycli
ng which had an mse of there wasnt much to gain from using the extra samples as
the mtm sequence is highly correlated as such recycling would not be worth the c
omputational expense but wouldnt be too harmful the mit importance sampling esti
mate had a significantly worse mse of mit does not necessarily reduce variance i
n highdimensional problems it could be much worse ordered overrelaxation ordered
overrelaxation neal b is a technique to improve the convergence of gibbs sampli
ng for some distributions with strong positive correlations amongst its variable
s a variety of mcmc methods exist that are inspired by successive overrelaxation
in optimization see figure ordered overrelaxation is one of these methods consi
dered here because it is based on a population of proposals ordered overrelaxati
on is based on gibbs sampling a normal iteration of gibbs sampling involves resa
mpling xd from its conditional distribution pxd xd d a wasteful way to implement
this rule is to draw k samples from the conditional distribution and then pick
one of them uniformly at random we could leave the conditional dis ordered overr
elaxation tribution stationary by also including the current setting and choosin
g uniformly from k settings moreover we need only apply a transition operator th
at leaves this uniform distribution over k settings stationary ordered overrelax
ation provides such an operator the ordered overrelaxation transition operator r
equires that the k candidates for the dth component can be ordered in some way i
f they are real scalars we simply sort by numerical value if only a partial orde
ring is available then ties are broken arbitrarily after sorting the points they
are relabeled in order s s si xd sk the operator chooses xd ski as the next ste
p in the markov chain as long as k is odd this rule will always pick one of the
k new values because the rule is deterministic and reversible it clearly leaves
a uniform distribution over the points stationary choosing xd ski tends to move
to the opposite side of the conditional distribution from xd this is precisely t
he goal of overrelaxation the effect becomes stronger when increasing k for very
large k the point is moved almost deterministically from its current quantile q
to q in na implementations any benefits from persistent motion is cancelled by
the cost ve of drawing additional samples whether ordered overrelaxation is usef
ul will depend on whether repeated draws sk t sk z can be performed cheaply neal
b discusses some circumstances in which this can be done another potential cost
saving is to make the number of points k small when ordered overrelaxation is l
ess useful we now introduce a new procedure with this goal adapting k automatica
lly our new algorithm attempts to use fewer samples in ordered overrelaxation wh
en the current position is close to the center of its conditional distribution t
his is a situation where overrelaxation is typically unhelpful it requires the u
ser to set kmin and kmax parameters giving the smallest and largest acceptable n
umber of samples to draw per iteration the procedure is detailed in algorithm th
e numbers of new points to the right and left of the current position are tracke
d as r and l in an extreme situation where the current position is at the very e
dge of the conditional distribution one of r and l will remain at zero and kmax
points will be drawn conversely when xd is near the median of the conditional di
stribution the algorithm will typically return sooner after obtaining an unorder
ed set of points from algorithm the list is sorted together with the current poi
nt xd si giving an ordered set sk k as before we now show k that ordered overrel
axation still satisfies detailed balance using a variable k to do this pivotbase
d transitions algorithm self sizeadjusting population for ordered overrelaxation
generates number of points k and unordered set sk k k inputs kmin kmax current
location x initialize l r for k kmax sk pxd xd d if sk xd then r r else if sk xd
then l l end if if r kmin and l kmin then kk return end if end for we need the
equilibrium probability of starting at point xd si generating a set sk and then
deterministically transitioning to xd ski the order in which the set of points w
as generated does not matter although some orderings are not possible because al
gorithm would terminate before generating the entire set let cl r be the number
of orderings of r l points with r greater than xd that can be generated by the a
lgorithm the equilibrium probability of all the points is ci ki k k psk xd d giv
en the symmetry cl r cr l this is also the probability of starting at xd ski pro
ducing the same set of points and then deterministically moving to x si summing
over all intermediate points shows that the transition xd xd satisfies detailed
balance pivotbased transitions inspired by the ordered overrelaxation generaliza
tion of gibbs sampling we now introduce a new method which uses general markov c
hain transition operators we start from any pair of mutually reversible transiti
on operators t x xpx t x x px which leave a desired target distribution px stati
onary a single reversible operator t t will also suffice algorithm takes the exi
sting operator pair and constructs a new transition operator t x x this procedur
e is illustrated in figure the pivotbased transition operator will favor moves t
o the opposite side of the region accessible by t as defined by a user supplied
ordering the stationary target stationary distribution px is maintained without
introducing any rejections into the chain although t x x x will carry finite pro
bability in some discrete settings or if t x x x is finite each of the sk k poin
ts in algorithm are marginally distributed according to k pivotbased transitions
algorithm the pivotbased transition t x x take one step of t from point x to a
pivot state z ie z t z x use t to draw k points one step away from z ie sk t sk
z k k use the sk and x points to create an ordered list sk k break ties k arbitr
arily and relabel the points such that sk sk identify index i giving x si set x
equal to ski start point x pivot state z points generated from t s z point selec
ted for x a forwards start point x pivot state z points generated from t s z poi
nt selected for x b backwards figure illustration of the pivotbased transitions
operator for k between points x and x via points z and set s consisting of the s
et of round dots excluding the ringed one a moving forwards generating a draw fr
om t x s z x b the reverse process involving the same s and z occurs with probab
ility t x s z x pivotbased transitions the target distribution they also have th
e same relation to the pivot state all points including x and x can be seen as d
raws from t sk z by symmetry the point labeled as the starting position can be r
esampled from a uniform distribution over the points or updated as in ordered ov
errelaxation which leaves the uniform distribution stationary while perhaps unco
nvincing as a proof this symmetry is important if all the proposals started at x
then this location would be made special making it harder to construct plausibl
e reverse moves starting at any other point we now show explicitly that the new
operator satisfies detailed balance given a starting point x the probability of
generating a pivot state z a new point x and k other points set s as in figure a
is t x s z x t z x k t x z ss t s z the first step must produce the pivot state
z from x the other points can then be produced in any of k possible orderings t
he probability of producing the same configuration of points by starting at x as
in figure b is t x s z x t z x k t x z ss t s z a small amount of manipulation
gives t z x t x z px t x s z x pz px t x s z x t x z t z x px pz px t x s z x px
t x s z xpx t x x px t x xpx ie the new chain satisfies detailed balance the se
cond line was obtained from the t and t operators mutual reversibility condition
the final line is obtained simply by summing over z and s the number of points
k used by the algorithm could be automatically adjusted we would use algorithm t
o draw the points replacing the conditional distribution with t sk z the proof a
bove still holds if we replace k with cl r ordered overrelaxation with pivotbase
d transitions gibbs sampling updates each component of a distribution xd in turn
by applying a transition operator td that resamples from the conditional pxd xd
d the original ordered overrelaxation replaces gibbs samplings td operators wit
h modified versions that tend to move to the opposite side of the conditional di
stribution from the current setting of xd pivotbased transitions gibbs sampling
is not the only monte carlo algorithm that updates components of a distribution
in turn now we can perform ordered overrelaxation by replacing any algorithms co
mponentbased update operators td with their pivotbased versions each update will
tend to move the current setting to the opposite side of the distribution defin
ed by td when the original markov chain is gibbs sampling the z states become ir
relevant and the resulting transition operator is equivalent to the standard ord
ered overrelaxation algorithm figure gives an illustrative example of ordered ov
errelaxation applied to a bivariate gaussian distribution using pivotbased trans
itions we can obtain the same benefits as the original ordered overrelaxation wi
th slice sampling which doesnt require the ability to sample from conditional di
stributions again as with standard ordered overrelaxation the benefits are only
worth the computational cost if some saving can be made when drawing multiple po
ints although pivotbased transitions do not introduce any rejections into the ch
ain if the underlying transition operator is a metropolis method then the chain
will sometimes stay still the rejections will cause reversals in the direction o
f motion reducing the ability of the chain to maintain a persistent direction th
is will tend to occur at the edges of a distribution where rejections are more f
requent and where ordered overrelaxation is normally of most use figure c demons
trates that pivotbased transitions based on metropolis works much less well than
the same method based on slice or gibbs sampling in section we will develop a b
etter metropolis method based on pivot states first we explore another use of pi
votbased transitions persistence with pivot states as pivotbased transitions app
ly to general transition operators we can consider algorithms that are not tied
to componentbased updates in this section we present an alternative way to achie
ve persistent motion using pivotbased transitions this algorithm is based on the
expanded distribution px z t z xpx t x zpz for other ways of constructing nonre
versible chains that use this distribution see neal b here we do not require the
base operator t to be reversible a sample x z from the expanded distribution pr
ovides an equilibrium sample x and a pivot state z as produced by step of algori
thm following the rest of the algorithm gives a new setting x z the same pivot s
tate with a new starting point pivotbased transitions a standard ordered overrel
axation gibbs sampling b pivotbased transitions univariate slice sampling c pivo
tbased transitions metropolis figure ordered overrelaxation schemes applied to a
highly correlated bivariate gaussian some exact samples from the distribution a
re shown in black standard metropolis shown in red proceeds by slow diffusion or
dered overrelaxation schemes shown in blue are able to make persistent progress
along the distribution here exaggerated with k rejections in the underlying tran
sition operators upset this persistent motion however the metropolis sampler in
c gains little benefit from pivotbased transitions pivotbased transitions this p
rocedure leaves the expanded distribution invariant the proof is similar to befo
re t x s x z k t x z ss t s z t s z ss t x s x z k t x z t x z t z xpx pz t x s
x z t x s x z t x z pz t z x px t z xpx t z x px t x s x z t z x px t x s x z t
z xpx t x z x z px z t x z x z px z the final line was obtained by using equatio
n and summing over the intermediate points s we have shown that the appropriate
detailed balance relationship holds with respect to the expanded distribution eq
uation suggests that the sample from the expanded distribution can also be inter
preted as an equilibrium sample z and a pivot state x t x z this pivot state cou
ld be updated by an alternative pivotbased transition operator t based on algori
thm as t was used in step we switch to using t to produce the k points in step a
version of the above proof shows that this operator also leaves the expanded di
stribution invariant given a pair sampled from the expanded distribution x x px
x t x x px we alternate between two markov chain updates first we update x using
x as a pivot state with algorithm operator t next we update x using x as a pivo
t state with the alternate operator t figure illustrates the effect of alternate
ly applying these two operators both operators have a tendency to reverse the or
dering of x and x by continually jumping over each other the pair move persisten
tly across the space that induces the ordering by chance for finite k the points
will sometimes stay in the same order in subsequent iterations motion will be i
n the opposite direction allowing the chain to mix across the whole space figure
shows pivotbased persistent motion when the base transition operator has a smal
l step size which would normally lead to slow diffusion the effect is dramatic t
he entire range of the distribution is explored in many fewer iterations however
the improved dynamics alone are not actually sufficient to justify the increase
d cost of drawing multiple proposals as with mtm some further savings such as pa
rallel computation are required pivotbased transitions point to be updated pivot
state updated at next iteration new points the set s updated point next pivot f
igure using pivot states for persistent motion each row shows a new iteration of
the algorithm described in the main text the pivot state tends to move in the s
ame direction for multiple iterations in the final two iterations shown the dire
ction of motion has reversed drift in one direction would persist for longer if
a larger number of states were drawn from the pivot state metropolis k position
x iteration number figure example of persistent motion exploring uniform with ga
ussian proposals metropoliss random walk behavior is greatly reduced by pivotbas
ed transitions although at a greater computational cost pivotbased metropolis pi
votbased metropolis pivotbased metropolis is a new alternative to mtmbased metho
ds for using multiple metropolis proposals as in mtm we use an arbitrary proposa
l distribution qxk x but by making different tradeoffs orderedoverrelaxation upd
ates become possible the new procedure will follow algorithm closely but uses ar
bitrary proposal distributions rather than transition operators arbitrary propos
als could be turned into metropolishastings transition operators for use in algo
rithm but many of the candidate points would be in the same place due to rejecti
ons algorithm provides a way to use all of the points proposed from a proposal d
istribution q the pivot state can optionally be created using a different propos
al distribution q also the pivot state need not live in the same space as the di
stribution being sampled unless applying the persistence procedure of subsection
this procedure can be seen as an instance of random proposal distributions besa
g et al algorithm the pivotbased metropolis operator t x x take one step of q fr
om point x to a pivot state z ie z qz x use q to draw k points one step away fro
m z ie sk qsk z k k use the sk and x points to create an ordered list sk k break
ties k arbitrarily and relabel the points such that sk sk identify index i givi
ng x si qz sk compute weights wk psk qsk z choose x sj from sk k using a distrib
ution proportional to the k weights or any reversible move ts j i sk that leaves
this distribution invariant the pivotbased metropolis operator of algorithm is
a valid transition operator for px proof the equilibrium probability of starting
at si generating pivot state z the remainder of the set of points s sk k and ch
oosing final point sj is k t z s j si psi ts j i s psi qz si k ki qsk z which is
invariant to swapping si sj summing over all intermediate points skij z shows t
hat t sj si satisfies detailed balance ordered overrelaxation uses a transition
rule that leaves a uniform distribution over a set of points stationary pivotbas
ed metropolis and perhaps other algorithms require a generalized move that leave
s a nonuniform discrete distribution stationary a probability distribution over
points sk can be represented on a unit interval each point occupies a segment wi
th width equal to its probability pk the segments can be ordered by any method t
hat does not use the current setting si or the history of the markov chain summa
ry c si pi sj pj oij c figure reflect move for discrete distributions a unit int
erval is constructed from segments sk each of these bars has width equal to the
points probability given current point si the probability of transitioning to an
other point is proportional to its overlap with si s reflection for example sj w
ill be selected with probability oij si selftransitions are only possible if si
overlaps with the middle of the unit interval this is when ordered overrelaxatio
n is less helpful assume we have a current sample si and wish to a apply a trans
ition operator that tends to take us to the opposite end of the distribution suc
h a transition operator is illustrated in figure the probability interval corres
ponding to the current point c cpi is reflected to cpi c the reflected interval
will have some overlap with one or more points probability segments we sample fr
om these points with probability proportional to their overlap oij with the refl
ected interval the probability of observing a transition from si sj is ts j i pi
oij pi oij pi which is independent of the direction of the transition therefore
detailed balance holds the generalized reflection rule leaves the discrete targ
et distribution invariant using this rule within algorithm gives an operator bas
ed on arbitrary proposals that is suitable for use in ordered overrelaxation or
the persistent motion of subsection summary this chapter explored ways to use po
pulations of points drawn from proposal distributions we started with a review o
f two existing methods population monte carlo pmc and multipletry metropolis mtm
which highlighted two important points proposal distributions used by mcmc are
usually unsuitable for simple importance sampling attempts to combine mcmc with
estimators based on importance sampling should not forget this drawing multiple
points in parallel from a proposal distribution usually gives slower mixing than
drawing the same number of points sequentially the parallel methods require sav
ings from shared computations to be competitive in response to item we derived a
waste recycling scheme for mtm as an alternative to an existing importance samp
lingbased estimator related and future work we then focussed on methods that use
a population of points to move the state of a markov chain to the opposite side
of a userdefined ordering chaining these updates together can cause persistent
motion through neals ordered overrelaxation or the method we introduced in subse
ction a random walk will explore a distribution with lengthscale l in ol steps o
f size more persistent motion can bring this down to ol our contributions to thi
s area were algorithm which automatically reduces the size of a population when
ordered overrelaxation is less likely to be useful saving computation generalizi
ng ordered overrelaxation to transition operators other than gibbs sampling algo
rithm using ordered transition operators for persistent motion without using com
ponentbased updates subsection generalizing ordered updates to operate directly
on a set of weighted points section these moves are less likely to reject than u
sing metropolishastings within algorithm which is important because rejections d
estroy persistent motion the usefulness of these innovations relies on the assum
ption that drawing multiple proposals is computationally efficient it remains to
be seen if implementations leveraging caching of intermediate results or parall
el computations will make our work effective in real applications similar assump
tions are made by related work in the literature which we now briefly consider a
long with directions for future research related and future work tjelmeland and
stormark investigate an idea similar to pivotbased metropolis they suggest drawi
ng a set of antithetic variables possibly using quasi monte carlo recent work al
so applies these ideas to mtm craiu and lemieux pivotbased metropolis could also
be extended in this way tjelmeland also recommends wasterecycling which would a
lso apply to pivotbased algorithms the main difference in his algorithm is that
all the states are drawn as proposals from the current state x rather than an in
termediate state z making reversible moves based on these proposals requires wor
king out the probability of each state reproducing the entire ensemble an ok com
putation our pivot state approach reduces this computation to ok and makes an ex
change of the current point with one of the proposals seem more natural as an ex
ample we showed pivotbased ordered overrelaxation applied to slice sampling in f
act slice sampling already has its own method for overrelaxation neal although t
his is susceptible to rejections unless the edges of the slice are found accurat
ely related and future work ordered overrelaxation may be preferable as it never
has rejections while needing only enough information to draw k samples there is
a way to eliminate the pivot state when slice sampling from unimodal distributi
ons it is possible to gibbs sample the slice sampling auxiliary distribution so
standard ordered overrelaxation applies subsection described a new way to make p
ersistent motion along a direction that defines an ordering on the state space c
hoosing suitable methods for ordering points will be problemdependent in general
one generic application could be simulated tempering subsection where energy co
uld be used to define an ordering persistent motion along this ordering should e
ncourage the inversetemperature to move more rapidly between zero and one radfor
d neal has suggested personal communication considering an alternative to pivotb
ased transitions this method starts with a random integer f drawn uniformly betw
een and k inclusive the current point x is evolved through a sequence of f trans
itions using t then starting at x again a sequence of b k f backwards transition
s is simulated using t choosing uniformly from the resulting k states maintains
the stationary distribution also reordering the points and choosing the compleme
nt to x as in ordered overrelaxation is valid producing a chain of states rather
than a star of steps starting at a single state is likely to produce larger mov
es a version of neals idea based on the weighted reflect operator of section is
also possible the chain could use transition operators that leave a cheapertoeva
luate surrogate distribution invariant proposal weights could be evaluated for a
thinned subset of the chain and then the reflect operator applied yet another p
ossibility is the multipoint metropolis method qin and liu an extension to mtm t
hat uses a chain of states a disadvantage of using a chain of transitions is tha
t each move must be made sequentially the pivotbased algorithms introduced in th
is chapter could evaluate their proposals in parallel on suitable computer hardw
are chapter normalizing constants and nested sampling the computation of normali
zing constants plays an important role in inference and statistical physics for
example bayesian model comparison needs the evidence or marginal likelihood of a
model m given observed data d z pdm pd mpm d l d where the model has prior and
likelihood l over parameters in statistical physics z exped is the normalizing c
onstant for the distribution over states with energy e at inverse temperature in
this context z is an important quantity known as the partition function a funct
ion of from which several fundamental properties of a system can be derived the
marginal likelihood equation is also a function in this case of the model m whic
h will become important in chapter in this chapter we focus on estimating the co
nstant for a given model we also consider estimating the whole z function from p
hysics which perhaps surprisingly is useful for solving the marginal likelihood
problem statistical physics normally uses the normalization in its log form log
z is known as the free energy of a system when used in statistical inference log
z has also been found to be a useful and natural scale for comparing models giv
en two models m and m the loglikelihood ratio log zz log pdm log pdm can be used
in a classical hypothesis test which rejects m in favor of m when a threshold i
s exceeded this means that absolute differences in log z correspond to shifts in
model quality that are meaningful without reference to an alternative model res
earchers decrypting codes at bletchley park judged that a difference of roughly
a deciban log zz is about the smallest difference in two models that people can
express table contains qualitative descriptions of the importance of marginal li
kelihood ratios table a selection of qualitative descriptions of the importance
of evidence values taken from kass and raftery and jeffreys appendix b p the thi
rd column could be turned into a deviance by multiplying by two for more on unit
s of information content particularly the ban see mackay p log zz loge zz log zz
z z evidence against m bans nats bits to to to to to to to to to to to to weak
substantial strong decisive beyond reasonable doubt the precise interpretation o
f the numerical value of a models evidence will depend on context in some applic
ations it is quite possible to keep using m even with a marginallikelihood ratio
that in isolation seems in favor of m beyond reasonable doubt in a legal settin
g it may be that a guilty hypothesis being a thousand times more likely than inn
ocent is enough to convict but other applications associate even more extreme lo
sses with adopting a particular model in data compression of large files a diffe
rence in size of bits is insignificant if compressing with m is computationally
cheap an improvement in marginal likelihood very much greater than bits is requi
red to adopt a new model even in the legal setting one must remember that the mo
st probable model is not necessarily the one with the highest marginal likelihoo
d extreme prior ratios can cancel large likelihood ratios although this shouldnt
happen very often under the rules of rational inference one should mechanically
apply bayes rule but if the data are strongly out of line with prior expectatio
ns it would seem sensible to carefully reexamine the models assumptions and chec
k for computational errors this chapter focusses on the computation of z the mes
sage to take from the above discussion is that computing the evidence of a model
to a very high level of precision is usually pointless it is difficult to appre
ciate a difference in log z of less than one and some errors larger than this ma
y not affect decisions based on the computation this means that even noisy monte
carlo estimates can be useful indeed monte carlo techniques are sometimes ident
ified as providing gold standard estimates of z suitable for comparison with oth
er approximate techniques eg beal and ghahramani kuss and rasmussen however just
as with any monte carlo integration method it is quite possible to get wrong an
swers for some classes of difficult problems this includes algorithms guaranteed
to be correct asymptotically developing a variety of methods and checks and hop
ing for the best is all we can do in general this chapter first examines the nec
essary ingredients of a monte carlo method for computing z next we review nested
sampling a new method due to john skilling we provide a comparative analysis of
this new algorithm and more established techniques then having analyzed nested
sampling in its own right we consider using it as a method for guiding and check
ing other algorithms starting at the prior starting at the prior by definition t
he prior should spread its probability mass over settings of the parameters deem
ed reasonable before observing the likelihood l therefore samples from this dist
ribution can sometimes provide a useful representation of the whole parameter sp
ace in particular as z is just an average under the prior a simple monte carlo a
pproach could be attempted z l d s s ls s s the variance of this estimator may b
e huge most of the mass of the integral is associated with parameter settings th
at are typical under the posterior the posterior often occupies a small effectiv
e fraction of the priors volume so it can take a long time to obtain a represent
ative sample however simple monte carlo equation does work on small problems and
provides a simpletoimplement check of more complex code starting at the prior w
ill also form the basis of more advanced methods why not start at the posterior
instead posterior samples are concentrated around the mass of the integral also
practitioners already sample from the posterior for making predictions it would
be convenient if these samples could also estimate z the obvious importance samp
ling estimator for z based on posterior samples is nonsensical as it involves kn
owing z a moment of inspiration yields the following harmonic mean estimator not
ed by newton and raftery l z z l d pd m d l s pd m s s ls as acknowledged by its
creators this estimator has the unfortunate property that the least probable po
ints have small l and so carry the largest weights thus the effective sample siz
e tends to be very small indeed the estimator can easily have infinite variance
various attempts have been made to fix the harmonic mean estimator both within t
he original paper and in more recent research raftery et al the authors are keen
to avoid sampling from the prior as well as the posterior however this goal see
ms misplaced firstly implementing samplers for most priors is relatively simple
and allows several diagnostics to be performed eg checking prior assumptions loo
k reasonable checking against simple importance sampling equation and geweke s t
ests for getting it right secondly while some specific instances seem to work th
ere is a generic problem with posterioronly methods as pointed out by neal in th
e discussion of newton and raftery the choice of prior has a strong influence on
the value of z taking a broad prior to an improper limit will send z towards ze
ro starting at the prior however in many inference problems the data are so infl
uential that the statistics of the posterior are fairly insensitive to such chan
ges in the prior this demonstrates that posterior statistics alone bear little r
elation to z in many statistical problems reliable estimates of z require findin
g the prior mass associated with the large likelihood values found under the pos
terior in lowdimensional problems this is a feasible density estimation problem
although deterministic approximations to z based on a direct fit to the posterio
r may be preferable in those cases chib notes that it is sufficient to know the
posterior probability at just a single point for any point z l p d m chib provid
es an estimator for p d m based on the probability of a gibbs samplers transitio
ning to from each of the points visited by the sampler chibs method is relativel
y easy to apply and should work well on unimodal problems posteriors with more c
omplex shapes will lead to subtle difficulties in general it is unreasonable to
expect the sequence of states visited by a sampler to densely cover the posterio
r in models with many parameters if this were possible it would be feasible to b
uild a kernel density estimate on the basis of a preliminary run and perform imp
ortance sampling instead practitioners simply hope to get a representative sampl
e of points that are useful for prediction it is accepted that the sampler may n
ot go near some typical points as an analogy consider conducting a survey of the
world population a sample of people might not include anyone from some cities y
et will be adequate for many statistical purposes in contrast chibs method is li
kely to fail without fairly dense sampling if none of the posterior samples can
easily reach within one gibbssampling iteration are in its city the estimator wi
ll be badly behaved this wont occur if was chosen from the posterior samples but
this is only hiding the problem at the expense of biasing the estimator jumping
directly to the posterior is fraught with problems the normalizer z is a global
quantity that depends on the entire posteriors relationship to the prior if sta
ndard numerical methods apply that is good news but if the posterior is sufficie
ntly complicated to justify mcmc the place to start is probably a much simpler d
istribution like the prior on small problems simple monte carlo can be more reli
able than chibs method neal when solving larger problems prior sampling equation
doesnt work either we need advanced methods that start with a feasible sampling
problem but overcome the variance problems of simple importance sampling bridgi
ng to the posterior bridging to the posterior when samples from the prior rarely
find regions of high posterior mass a distribution closer to the posterior is n
eeded for importance sampling to work designing or adaptively finding a tractabl
e approximation to the posterior is one approach although hard in general anothe
r solution is a divideandconquer approach rather than finding z directly a serie
s of easier subproblems are solved first consider a family of distributions spec
ified by an inverse temperature p l ee z z e log l the prior has while p is the
posterior all of the distributions for involve unknown normalizations given by t
he partition function z we will try to compute this function for a range of inve
rse temperatures k k when adjacent distributions p k and p k are close they can
be compared with standard importance sampling as reviewed in subsection weights
give the relative importance of states evaluated under the two distributions wk
lk lk k lk the expected value of these weights is the ratio of the two distribut
ions normalizers zk zk s s wk s s s p k these estimates can then be combined usi
ng z zk z z z zk z z z z zk numerical concerns suggest taking logs of a large po
sitive product zk log z log z k log k zk zk where each term in the sum can be es
timated from samples under p k this form is also convenient for error bars the v
ariance of an estimator for log z is the sum of the variances of the estimators
for the log of each ratio given this basic annealing approach there are several
implementation alternatives simulated tempering and parallel tempering reviewed
back in section both offer correlated samples from p over a userset range of ann
ealed importance sampling bridging to the posterior subsection samples from dist
ributions which are only close to p but a single run gives the same estimator as
above for s k k log wk k k k log z k k log lk where k is the sample drawn at th
e k th iteration of ais or a single sample from p k when s or multiple ais runs
are performed the samples are combined differently independent importance sample
rs provide local gradients or ratio information at each temperature the samples
are first combined at each level as in equation ais gives up the requirement for
valid samples from each p k by running importance sampling on a global distribu
tion see subsection thus ais dictates forming independent z estimates from each
run and then averaging them generally the ais estimator should be used because i
t does not assume exact samples are available from each temperature in some case
s applying equation to pseudosamples from rapidly mixing markov chains can work
better than ais but is difficult to justify aside on the prior factorization som
e readers may be more familiar with canonical distributions defined by p expez h
ere the energy e is defined in terms of the negativelog of the whole unnormalize
d probability rather than just the likelihood for finite state spaces this corre
sponds to equation with a uniform in an unbounded state space p is an improper d
istribution introducing a tractable normalized base measure ensures it is always
possible to sample from p the base measure need not be a prior distribution ano
ther choice may be computationally convenient or the problem may not be bayesian
and so have no prior in subsection a distribution is factorized in an unconvent
ional way for algorithmic reasons the term prior is used throughout this chapter
to emphasize a common usage of these algorithms and that should usually be a si
mple tractable starting point thermodynamic integration the bridging procedures
above have been independently developed by various communities gelman and meng t
hese different views can bring a better understanding to the problem the statist
ical physics community views z as a useful function applying operators to the pa
rtition function often yields interesting physically meaningful bridging to the
posterior quantities in particular for the canonical distribution p ee z z ee d
the lognormalizer is a moment generating function d log z d z e ee d epe d log z
d z e e e d z e e e d ep e epe varpe unlike z itself these expectations can rea
sonably be approximated for a given temperature using samples recorded under sim
ulations at that temperature of course log z is related to its gradients by a tr
ivial identity log z log z d log z d d z log z log z epe d the simplest discrete
approximation to equation uses measurements at a sequence of temperatures k in
order to estimate the normalizer at inverse temperature k k log z k k k ek k pk
comparing to equation and remembering that e log l we see that simple thermodyna
mic integration and a bridging sequence of importance samplers are the same appr
oximation thermodynamic integration seems deceptively general the difficulty is
in choosing intermediate distributions appropriate temperaturebased distribution
s may be difficult and sometimes impossible to find this chapter explores how to
choose a sequence of temperatures and alternative methods that dont use anneali
ng schedules multicanonical sampling multicanonical sampling the multicanonical
ensemble was mentioned in section as a distribution that might allow better mark
ov chain exploration of the state space than the original target distribution th
e multicanonical ensemble is available in terms of the target distribution pt p
zt and a multicanonical weighting function wmc t pmc p mc zmc p p wmc mc t the m
ulticanonical heuristic suggests finding a weighting such that a markov chain ex
ploring pmc spends equal times at all energies including those typical of the po
sterior and the prior a weighting function giving approximately this behavior mu
st be found from preliminary runs samples from pmc can be used as an importance
sampling proposal distribution as in subsection weights w p p give the normalizi
ng constant ratio mc between any distribution p p zp and the multicanonical dist
ribution pmc p s s w s s zp zmc the multicanonical normalization zmc is generall
y unavailable but can be eliminated by comparing to the base distribution log zt
log pt log instead of bridging carefully between the prior and posterior the mu
lticanonical ensemble has the ambitious goal of capturing both at once one hopes
that the broad coverage of pmc gives estimators with reasonable variance for qu
antities relating to both pt and this is explored theoretically and experimental
ly in later sections nested sampling nested sampling is a new monte carlo method
by skilling intended for general bayesian computation which bears some relation
to earlier work found in mcdonald and singer it is designed to be a general and
robust alternative to annealingbased methods like annealing it starts at the pr
ior and samples from a sequence of distributions that become more constrained at
each iteration however nested sampling makes no use of temperature and does not
require tuning of intermediate distributions or other large sets of parameters
it also provides a natural means to compute error bars on all of its results wit
hout needing multiple runs of the algorithm nested sampling the key feature and
technical difficulty of nested sampling is sampling from the prior subject to a
series of lower bounds on the likelihood the reward for attempting this novel ch
allenge is an estimate of the prior probability mass associated with each observ
ed likelihood value this representation of the posterior provides an estimate of
the normalization constant and any other property of the posterior in addition
to reviewing necessary material the remainder of this chapter provides an improv
ed implementation of nested sampling for problems with degenerate likelihoods or
discrete distributions a brief study of a deterministic approximation of nested
samplings behavior this has theoretical implications for its performance and pr
actical benefits for some applications a comparative analysis of the generic pro
perties of annealed importance sampling nested sampling and multicanonical sampl
ing illustrative examples some simple continuous distributions and the potts mod
el an undirected graphical model incidentally a new variant of swendsenwang is d
erived some of this material was previously presented in murray et al b a change
of variables the normalization of a posterior over variables is a weighted sum
of likelihoods l over elements of prior mass d z pdm pd m pmd l d we label each
element of prior mass dx d this is only a change in notation z l dx the integral
could in principle by accumulating elements in a standard raster order in the p
arameter space figure a alternatively we can add up the contributions from each
scalar element of prior mass dx in any order we like as illustrated in figure b
we choose to order the elements by their corresponding likelihood as is a distri
bution its elements sum to one and can be arranged along a unit interval z lx dx
x d l l for now assume is continuous and l provides a total ordering of element
s so that x is an invertible change of variables in subsection these assumptions
will be relaxed so that the mapping will always be invertible nested sampling a
b figure views of the integral z l d for the posterior normalization a elements
of the parameter space d are associated with likelihood values heights l these
elements likelihood values are summed weighted by their prior masses d b bars wi
th exactly the same set of heights are arranged in order along a scalar unit int
erval three bars are colored to help illustrate the correspondence in a the bars
have a hyperarea base corresponding to an element in parameter space in b they
have a scalar width corresponding to an element of prior mass dx xdx the area of
each bar is its weighted likelihood so z is the sum of these the area under the
curve in b the mapping x may seem strange but is closely related to the familia
r cumulative distribution function cdf for a onedimensional density p c p d this
quantity is often considered very natural some authors prefer to define distrib
utions in terms of the cdf unlike a density the cdf is invariant to monotonic ch
anges of variable and it directly returns the probability mass between two setti
ngs of the parameters also its inverse allows the straightforward generation of
samples from uniform random variates each element dc corresponds to an element o
f probability mass pd naturally these elements sum to one dc cumulative distribu
tion functions for multivariate distributions are less frequently used in statis
tics a sensible ordering of the elements of probability mass is less obvious and
the multivariate integrals involved may be difficult comparing equations and we
see that x is just a cumulative distribution function of the prior correspondin
g to a particular choice of ordering for scalar parameters the standard inverse
cdf c gives the parameter such that fraction c of the prior mass is associated w
ith parameters less than for general parameters the inverse of the likelihoodsor
ted cdf x gives the parameter value such that fraction x of the prior mass is as
sociated with likelihoods greater than l this is illustrated in figure a for mul
timodal likelihoods the prior mass satisfying a likelihood constraint will be sc
attered over disconnected regions nested sampling lx lx lx x a x x x b x x x c f
igure nested sampling illustrations adapted from mackay a elements of parameter
space top are sorted by likelihood and arranged on the xaxis an eighth of the pr
ior mass is inside the innermost likelihood contour in this figure b point xs is
drawn from the prior inside the likelihood contour defined by xs ls is identifi
ed the ordering on the xaxis and pxs are known but exact values of xs are not kn
own c with n particles the one with smallest likelihood defines the likelihood c
ontour and is replaced by a new point inside the contour ls and pxs are still kn
own computations in the new representation given the change of variables describ
ed in the previous section the normalizer is just the area under a monotonic one
dimensional curve l vs x onedimensional integrals are usually easy to approximat
e numerically assuming an oracle provided some points xs ls s ordered such that
xs xs we can obtain an estimate z based s on quadrature s z s s ls wq s wq s x x
s where the quadrature weights given correspond to the trapezoidal rule rectangl
e rules can upper and lower bound the error zz as long as an upper bound on l is
known the boundary conditions of the sum require an arbitrary choice for the ed
ge xvalues such as x x and xs xs sensitivity to such choices could always be che
cked in our experience they do not matter quadrature is effectively approximatin
g the posterior with a distribution over s particles each with probability propo
rtional to ls wq samples from this distribution can approximate samples from the
true posterior the approximate distribution can also be used to directly approx
imate posterior expectations the change of variables has removed details of the
highdimensional space and made s nested sampling the integration problem apparen
tly easy of course the highdimensional space is still in the original problem an
d will make the change of variables intractable this is now the target for appro
ximation nested sampling algorithms nested sampling aims to provide a set of poi
nts s ls s and a probability distris bution over their corresponding x xs values
a simple algorithm to draw s such points is algorithm see also figure b algorit
hm single particle nested sampling initial point draw for s to s draw s ls where
l ls ls otherwise end for the first parameter set by this algorithm is drawn fr
om the prior which implies that the corresponding cumulative prior quantity must
have distribution px uniform similarly pxs xs uniform xs as each point is drawn
from the prior subject to ls ls xs xs this recursive relationship defines px a
simple generalization algorithm uses multiple particles at each step the one wit
h smallest likelihood is replaced with a draw from a constrained prior figure c
algorithm multiple particle nested sampling initialize draw n points n n n m arg
minn ln m for s to s redraw m ls given by equation algorithm m argminn ln s m en
d for the first parameter is the setting with the smallest l from the initial n
draws this is the particle with the largest x for this parameter to be at x the
other n points must have x x so px n n x n the extra factor of n comes from the
invariance of the points to reordering and is needed for correct normalization a
lternatively it is immediately identified as a standard result from order statis
tics the nth ordered point drawn from a distribution has its cumulative quantity
distributed according to betan n n where our case corresponds to n n nested sam
pling eg balakrishnan and clifford cohen after replacing a particle in step ther
e will be n samples uniformly between and xs the point with smallest l and large
st x will be a fraction r xs xs through this range distributed as prxs n n rn ch
anging variables gives pxs xs n n xs n xs n xs xs this defines the joint distrib
ution over the entire sequence s px px n s pxs xs n note that px depends only on
n it is the same distribution regardless of the problemdependent likelihood fun
ction if we knew the x locations we could combine these with the likelihood obse
rvations l ls s and compute the estimate of the normalization z given in equatio
n s instead we have a distribution over x which gives a distribution over what t
his estimator would be pzls n zx z pxn dx we defer philosophizing over the preci
se meaning of this distribution to subsection for now we assume that this distri
bution gives reasonable beliefs over the estimate that would be obtained by quad
rature as can be checked in a given application the uncertainty of this distribu
tion tends to be much larger than the differences between choices of quadrature
scheme we can therefore somewhat loosely take equation to be a distribution over
z itself similarly distributions over any other quantity such as log z or poste
rior expectations can be obtained by averaging quadrature estimates over pxn to
recap the key ideas required to understand nested sampling are it would be conve
nient if we could perform an intractable mapping from the original state space t
o a cumulative quantity x numerical computation of z or posterior expectations w
ould only involve a onedimensional function samples from the prior subject to a
nested sequence of constraints equation give a probabilistic realization of the
mapping these samples give a distribution over the results of any numerical comp
utation that could be performed given the change of variables no algorithm can s
olve all problems some pathological integration problems will always be impossib
le for nested sampling the difficulty is in obtaining samples from the nested se
quence of constrained priors nested sampling e e xs e e e e s xs x exp log x err
or bars e e e e e e e e e s x exp log x error bars figure the arithmetic and geo
metric means of xs against iteration number s for algorithm with n error bars on
the geometric mean show expsn sn samples of pxn are superimposed s omitted for
clarity mcmc approximations the nested sampling algorithm assumes obtaining samp
les from ls equa tion is possible this is somewhat analogous to thermodynamic in
tegrations requirement for samples drawn from a sequence of temperaturebased dis
tributions in section annealed importance sampling offers a way to approximately
sample from temperaturebased distributions and still obtain unbiased estimates
of z in contrast nested sampling really needs exact samples from its intermediat
e distributions for its theoretical justification to be valid rejection sampling
of the constrained distributions using would slow down exponentially with itera
tion number s in general we do not know how to sample efficiently from the const
rained distributions so despite the theoretical difficulties we will replace ste
p of algorithm with an approximate sampler using a few steps of a markov chain i
n practice this often works well but one must remember to be careful when interp
reting error bars that ignore the approximation we must initialize the markov ch
ain for each new sample somewhere one possibility is to start at the position of
the deleted point s on the contour constraint which is independent of the other
points and not far from the bulk of the required uniform distribution however i
f the markov chain mixes slowly amongst modes the new point starting at s may be
trapped in an insignificant mode experience suggests it is generally better to
start at one of the other n existing points inside the contour constraint these
are all ideally draws from the correct distribution ls so represent modes fairly
making this new point effectively independent of the point it cloned may take m
any markov chain steps how many to use is an unfortunate free parameter of mcmcb
ased nested sampling nested sampling integrating out x to estimate quantities of
interest we must average over pxn as in equation the mean of a distribution ove
r logz can be found by simple monte carlo estimation logz logzxpxn dx t t logzxt
t xt pxn this scheme is easily implemented for any expectation under pxn includ
ing error bars from the variance of logz to reduce noise in comparisons between
runs it is advisable to reuse the same samples from pxn eg clamp the seed used t
o generate them a simple deterministic approximation of pxn is useful for unders
tanding and also provides fasttocompute lowvariance estimators after s iteration
s xs directly requires substantial work but notice that logxs characterized by i
ts mean and standard deviation logxs sn sn s s s logt s s s t where the ts are d
rawn iid from pt n tn dealing with this distribution is a sum of independent ter
ms the central limit theorem suggests that this should be well figure shows how
well this description summarizes sequences sampled from pxn using the geometric
path as a proxy for the unknown x is a cheap alternative to monte carlo averagin
g over settings equation see figure we might further assume that trapezoidal est
imates of integrals are dominated by a number of trapezoids found around a parti
cular iteration s the corresponding uncertainty in log z will be domi nated by u
ncertainty in log xs s n s n it usually takes extensive sampling to distinguish
s n from the true standard deviation of the posterior over log z degenerate like
lihoods the progress of nested sampling is supposed to be independent of the lik
elihood function the procedures that estimate quantities based on nested samplin
g results rely on this behavior as an example the geometric mean path suggests t
hat with n particles nested sampling should typically get to x after about log i
terations now consider running nested sampling algorithm on a problem with the f
ollowing likelihood function x lx x nested sampling a b c figure histograms of e
rrors in computing epx log z under three approximations for integrating over px
random experiments shown the test system was a dimensional hypercube of length w
ith uniform prior centered on the origin the loglikelihood was l nested sampling
used n s a monte carlo estimation equation using t sampled trajectories b t sam
pled trajectories c deterministic approximation using the geometric mean traject
ory in this example the distribution plog z from equation has a width so the err
ors in finding its mean in b and c are tolerable after drawing n particles from
the prior about of them will have x and l if the inequality ls ls is strictly en
forced all new particles must have l and therefore x therefore x will be reached
in about s iterations leading to very wrong results if instead ls ls is enforce
d the results will still be wrong x will not be reached until about iteration s
the problem is that the mapping set up in subsection required a total ordering o
f all the dx elements of prior mass it was assumed the likelihood function lx pr
ovided a unique key for this sort operation in many applications this will be su
fficient as long as no two likelihood evaluations in algorithm are numerically i
dentical there is no problem but degenerate likelihoods like equation require sp
ecial treatment completely flat plateaus in likelihood functions of continuous v
ariables are rare but when is discrete finite elements of prior mass carry the s
ame likelihood this will always introduce plateaus in lx solving the degeneracy
problem requires imposing a total ordering on the dx elements earlier presentati
ons of nested sampling skilling suggest introducing an auxiliary labeling of the
states that resolves likelihood ties it was suggested these labels could be cho
sen arbitrarily random samples from uniform suffice as would a cryptographic ide
ntification key derived from or almost anything else however a fixed key does no
t solve the problem with discrete parameters if the x plateau in the example abo
ve originated from a single discrete setting then all the dx elements in this ra
nge would receive the same cryptographic key likelihood nested sampling ties wou
ld still be unresolved if random labels are employed and always regenerated on r
epeated observations of the same then nested sampling will give the correct resu
lts this is how the code skilling has made available is implemented skilling men
tions a better way of thinking about the random labels instead of an extra piece
of information attached to each parameter setting is an extra input variable fo
r an extended problem with a likelihood that depends on and the distinction may
seem fine on large problems the same location will rarely be revisited but getti
ng this right is important for obtaining correct answers on small test problems
the joint distribution view also allows a rejectionless algorithm for the potts
example in subsection which was tersely introduced in murray et al b the auxilia
ry joint distribution is over the variables of interest and an independent varia
ble p p p l l z z lj z where lj ll l and z standard nested sampling is now possi
ble at each iteration lj must increase we can choose such that log is smaller th
an the smallest difference in logl allowed by machine precision this ensures the
auxiliary variable the constrainedprior distribution becomes l ls s s j l ls an
d otherwise will only matter when l ls s mcmcbased implementations of nested sam
pling require transition operators that leave j stationary the simplest method i
s to propose with an operator that leaves the original constrained prior station
ary and then generate uniform rejecting the proposal if s a rejectionless method
would be preferable and is sometimes possible by marginalizing out s m l ls s l
ls otherwise if the operator used for was slice sampling subsection then m can
also be slice sampled directly similarly a gibbs sampler could easily be adapted
to use the conditionals of m instead of if metropolishastings proposals are use
d it will available from httpwwwinferencephycamacukbayesys efficiency of the alg
orithms sometimes be possible to reweight q to reduce or eliminate rejections an
example of such a transition operator is developed in subsection after generati
ng s the auxiliary variable tribution uniform ls ls s s s uniform s ls ls s can
be drawn from its conditional dis as a minor refinement this could be done lazil
y if and when it is required by the next iteration efficiency of the algorithms
this section attempts to compare the computational efficiency of three significa
nt algorithms for estimating normalizing constants general comparisons of monte
carlo algorithms are hard to make as they are designed for use with problems whe
re ground truth is unavailable also some algorithms are targeted at a particular
class of problem making performance comparisons on other classes unfair all thr
ee normalizing constant methods considered below are designed to deal with probl
ems in which the base measure is much more diffuse than the compact target distr
ibution pt we look at the scaling of computational cost with the nature of this
disparity the behavior of a multivariate gaussian toy problem is worked out for
each algorithm the base distribution is vaguely distributed with precision exp z
the target distribution pt has higher precision t pt t exp zt d d d d d d z l z
t t l exp d d d we pretend that the normalization constants of the gaussians are
unknown and require all of the algorithms under study to return an approximatio
n to the log of the ratio zt z t d many continuous distributions have locallygau
ssian peaks so poor performance on the toy problem would be a general cause for
concern we also make some general remarks relevant to nongaussian modes despite
this the theoretical comparison presented here is necessarily limited unreasonab
le assumptions are made regarding the efficiency of the algorithms ability to dr
aw samples from intermediate distributions obviously this doesnt test the abilit
y of the algorithms to find and then fairly represent isolated modes this will b
e probed in the empirical comparisons in the sections that follow for now the an
alysis is confined to behavior that does not require details of particular marko
v chains used in implementations nested sampling the scaling behavior of nested
sampling can be found without detailed calculation the algorithm is characterize
d by an exponential shrinkage of the prior mass under consideration if the bulk
of the posterior mass is found around x then an n particle nested sampling run r
eaches dominant terms in the integral after s n log x iterations subsection expl
ained that the uncertainty in log z is typically s n obtaining a unit uncertaint
y in log z would require setting the number of particles to n s for this choice
the number of iterations required to reach the target distributions mass from th
e prior is s log x most of a highdimensional gaussians mass is within a hypersph
erical shell out at a radius of d this means that the bulk of the target distrib
ution is in a shell enclosing a fraction t d of the prior volume if the prior we
re uniform then we could immediately say that the bulk of the target distributio
n is found around x t d it turns out that this is a good enough description for
gaussians too log x d log t therefore the s log x computational cost of nested s
ampling on the toy problem is number of samples for unit error s d log t the od
scaling revealed by this analysis is somewhat disappointing if each sample costs
od to produce then the total cost for a unit error is od if one knew that each
dimension were independent one could estimate the lognormalizer of each gaussian
separately making the d estimates have variance d would require d log t samples
each but samples of just one variable only cost o computation which gives a tot
al cost of od rather than od the reason to resort to monte carlo is that real co
mputations do not decompose this way but we shall see that without detailed know
ledge of the distribution annealing does only need od iterations the advantages
of nested sampling do come at a cost according to the cumulative distribution fu
nction the logcumulative prior mass inside a radius p of dt is log dd t which nu
merically matches the scaling of d log t to within a d small factor over a large
range of t and d efficiency of the algorithms multicanonical sampling as there
are many possible algorithms for constructing an approximate multicanonical ense
mble a general comparison to the method is hard instead we generously assume tha
t we have obtained by chance the ideal function that reweights the target distri
bution such that the distribution over energy is uniform this avoids having to c
onsider the details of histogrambuilding methods or other energy density approxi
mations the danger is that this sections results may be severely misleading if t
he dominant computational cost is actually obtaining the reweighting the analysi
s here should be seen as a lower bound of the costs involved and could be useful
for indicating which method is best for refining initial results however they w
ere obtained by definition the multicanonical ensemble has a uniform marginal di
stribution over energy to be defined this distribution must be confined to some
finite range width h say so pmc e h this ensemble could be used to compute the n
ormalizing constant of a simple target distribution ps which also has a uniform
energy distribution but over an energy range with smaller width h the importance
weights equation are a constant with probability hh and zero otherwise this dis
tribution gives varlog ps h sh after s effective samples empirically the estimat
or still exhibits quite heavy tails after the s hh samples predicted for unit er
ror but the expression gives good error bars for larger s finding effective widt
hs corresponding to h and h will provide a ruleofthumb estimate of multicanonica
ls variance on more complex distributions we now derive properties of the multic
anonical ensemble for the gaussian problem under the target multivariate gaussia
n the energy is t e log l which is proportional to r d d d d d d t r the squarer
adius r follows a scaled distribution with d degrees of freedom pr rd expt r wei
ghting by the reciprocal of pr makes the distribution over r and the energy unif
orm giving the multicanonical ensemble p p wmc mc et r rd et r rd r rh a maximum
energy h must be imposed for the distribution to be defined a reasonable choice
for the square radius defining this limit is larger than the prior mean value r
h d d this cutoff excludes little of the prior mass from the ensemble efficiency
of the algorithms the target distribution is softly confined to a narrow range
of the energy spectrum with an effective width likely to be related to the stand
ard deviation of a distribution rheff dt while the exact variance of the estimat
or does not have a simple analytical form a more detailed analysis shows that th
e guess hsheff rhsrheff is basically correct var log zt t s d astoundingly the n
umber of effective samples required to obtain a unit error scales only as o d th
is is better than possible by a method that exploited independence and estimated
the normalizer of each of the d gaussians separately perhaps this is a sign tha
t through the weighting function the method has unreasonable a priori knowledge
of the answer despite the ideal weighting function the scaling with precision ra
tio t is not impressive as the target distributions precision grows the multican
onical sampler becomes exponentially worse than nested sampling equation this is
a clear indication that the energy spectrum ratio hheff can be quite different
from the volume ratio that determines nested samplings performance the details o
f a particular target distribution can make either method dramatically better th
an the other another concern is the cost of each of the s effective samples the
ensemble pmc is a distribution capturing a large energy range which will usually
take many steps of mcmc exploration to reach equilibrium in contrast the interm
ediate distributions of nested sampling or annealing were designed to be close s
o a small number of mcmc steps may be sufficient to equilibrate states at each i
teration the logprobability of the variance of the two parts equation of the mul
ticanonical estimator equation can be approximated as follows e w varw varlog s
ew s ew the ratio of the target distribution equation and multicanonical distrib
ution equation give importance weights w rd exp r noting the integrals z w r dr
d d d z and w r dr d d and that for suitably large h these give approximately rh
times the required expectations we have varlog rh d rh p sd d s d stirlings app
rox to both functions substituting in rh assuming large d and using the approxim
ate independence of the separate log estimators when t gives the variance of the
final estimator as r h i t d var log zt s the version in the main text drops fu
rther terms for clarity this footnote is terse because what matters is the simpl
er hheff description in the main text and the following assertion empirically th
e expression above reasonably matches the observed estimator variance for t in a
few or more dimensions efficiency of the algorithms pmc has a range of emc d lo
g t between the typical sets of the target and prior distributions if o iteratio
ns of mcmc are required to change this energy by then it will take at least oemc
iterations to equilibrate the multicanonical ensemble by a random walk this wou
ld make the cost of a obtaining a single effective sample from the multicanonica
l ensemble the same order as the total cost of an entire nested sampling run the
potentially good performance of the multicanonical method can be wiped out by t
he use of some standard markov chains importance sampling we now analyze the per
formance of combining a bridge of k importance sampling estimates as in section
for the toy gaussian problem all the intermediate distributions are also gaussia
n pk lk zk k exp zk d d d k t this allows computation of the variance of the est
imator in equation under the ideal case of direct sampling from each pk the log
of the importance weights at each level from equation become log wk k k d d d th
e variance of the log weights under pk is varpk log wk k k varpk k k d d d k k d
k d the variance of log wk only depends on k through the ratio k kk the varianc
e of log z k t k k log wk k k is minimized by equalizing all the contributing va
riances the resulting estimator has variance t k dk var log z d k t log for larg
e k constructing annealing schedules the approximation becomes accurate very qui
ckly the amount of computation k required for unit variance scales with dimensio
nality of the problem as od number of samples for unit error k d log t comparing
to equation we see that the dependence of the computational cost is the same as
nested sampling and that the scaling with dimensionality is better this result
is not strongly tied to the tail behavior of a gaussian redoing the calculations
for distributions p exp slightly to dplog t p d d only changes the number of sa
mples required before hastily discarding nested sampling remember the caveats at
the beginning of this section practical realities of markov chain based sampler
s will make performance worse than predicted here the potts model subsection wil
l be an example of a distribution which while easily sampled by multicanonical a
nd nested sampling is totally unapproachable with temperaturebased annealing met
hods as with multicanonical sampling we were very generous to annealing by assum
ing the free parameters defining the annealing schedule were chosen optimally we
were unlikely to guess k k t t k indeed the optimal schedule balances variances
which through equation are intimately related to z itself we cannot know the op
timal schedule without already having solved the problem some work is required t
o find a good schedule default choices may not be sufficient for example the lin
ear annealing schedule k kk gives varkkk log z d t k t for large t this is expon
entially worse than the correct schedule and nested sampling constructing anneal
ing schedules various families of annealing schedule have been used in the liter
ature while demonstrating ais on some example distributions neal spliced togethe
r a linear schedule at high temperatures with a geometric schedule at lower temp
eratures this is not too far from the behavior of equation the optimal schedule
for some wellbehaved found by a largek integral approximation to the discrete su
m over equation and also checked numerically constructing annealing schedules po
steriors beal suggested a family of nonlinear annealing schedules k e kk e kk wh
ere the linearity parameter e makes the schedule approximately linear for large
values of e but dwells for longer at high temperatures for small positive values
another ad hoc nonlinear scheme is given by kuss and rasmussen where a linear s
chedule is raised to the power of four k kk k k each of the above schemes were p
resumably the result of some experimentation preliminary runs are required to ch
eck that a proposed schedule eg fourth power of a linear schedule is adequate an
d to eliminate others also any free parameters must be found such as the number
of levels linearity setting e or the changepoint between a linear and geometric
schedule algorithms for these fitting procedures could be although rarely are de
scribed in detail part of the difficulty is deciding how to include results from
earlier higher variance runs often these are only run informally and simply dis
carded regardless of how an annealing schedule is chosen the selection should id
eally be performed automatically otherwise computer time may become irrelevant c
ompared to the time required by manual adjustments an annealing schedule should
control the variance of the log weights vark log wk k k vark log l rearranging g
ives a recurrence relationship for the inverse temperatures k k v vark log l whi
ch achieves a given target variance v varlog w on each weight this update rule i
s applied from until an inverse temperature of is reached the variance of the lo
g z estimator equation is k v a binary search on the control parameter v can fin
d a balanced annealing schedule with a user chosen total variance or computation
al cost implementing this algorithm for constructing annealing schedules require
s the variance of the loglikelihood or energy at each k considered these quantit
ies are not available exactly otherwise log z would be known so the variances mu
st be approximated from some preliminary experiments running mcmc at each k cons
idered would be too costly one could try to interpolate results measured at a pr
eliminary range of temperatures alternatively statistics at any temperature are
available from a single run of either multicanonical or nested sampling we propo
se using nested sampling to set a schedule for ais the details are given in algo
rithm sampling from the multicanonical ensemble would need a large set of markov
chains for normalizing constants control parameters to be set nested sampling c
an be run before annealing as it only has two control parameters n and the amoun
t of effort to put into sampling at each iteration annealed importance sampling
can then be run with a schedule estimated from nested sampling comparing the ann
ealing results to those predicted by nested sampling could uncover problems with
markov chains that would go unnoticed using a single method algorithm construct
ion of an annealing schedule from nested sampling inputs total target variance v
target num nest particles n numerical tolerance tol run nested sampling algorith
m obtaining s assuming xs expsn compute weights wq as in equation vmax vtarget v
min k while vmax vmin k vtarget tol vtrial vmax vmin create discrete proxy for p
k distribution pk wq ls k v k k varp log l k s if k then passed end of schedule
start again with a higher variance per level vmin vtrial k else if k vtrial vta
rget then exceeded target variance start again with lower variance per level vma
x vtrial k else k k end if end while k return annealing schedule k markov chains
for normalizing constants each of the monte carlo algorithms in this chapter re
quire sampling from complex distributions p or pmc standard sampling techniques
metropolis hastings slice sampling etc should apply but there are some issues sp
ecial to these algorithms that are worth considering randomize operator ordering
s many mcmc operators concatenate several operators with different behaviors tog
ether gibbs sampling for example updates each dimension of a parameter vector se
parately markov chains for normalizing constants some users prefer to randomize
the order of these updates so that the resulting mixture operator maintains deta
iled balance but many use a deterministic ordering if only for convenience of im
plementation this is not a good idea with algorithms that start out of equilibri
um at each iteration the problem is most easily demonstrated by nested sampling
with n at each iteration the only particle is by definition on the boundary of t
he constrained prior the first update must increase the likelihood of the partic
le subsequent updates have some freedom to decrease the likelihood again as only
a limited number of markov chain steps can be performed at each iteration the p
article will climb unnaturally fast up the likelihood surface in the direction o
f the first transition operator in subsection nested sampling is run using a uni
variate slice sampler applied to each variable in a random order initially these
experiments used a fixed ordering the first variable to be updated would system
atically become much more constrained than the last even if by symmetry they wer
e equivalent fortunately this pathology is so severe that it quickly made itself
known by causing numerical problems and crashing the slice sampling code experi
ments with spherical gaussians confirm that annealed importance sampling suffers
from a similar problem histograms of the unweighted final states obtained show
that the statistics of each dimension depend on the order in which they were upd
ated the ais weights correct this bias asymptotically but samples without these
artifacts will tend to need lower variance weights the easiest fix is to randoml
y permute the markov chain operators at each iteration changes in lengthscale an
d energy there is usually a dramatic difference in scale between a prior and pos
terior it is unlikely that the same markov chain operators are appropriate for b
oth yet annealing has to sample them and all the interpolating distributions in
between similarly nested sampling has to sample from a sequence of distributions
that shrink exponentially in volume from the prior to the posterior and beyond
step size parameters in metropolishastings algorithms must be changed as the alg
orithm proceeds it is also profitable to adapt the initial step size of slice sa
mpling some authors set a schedule of step sizes by hand but automatic schemes a
re clearly desirable one option is to adapt based on the acceptance rate of the
previous iteration or to use parallel chains for ais this would require giving u
p some theoretical correctness while nested sampling is already in an approximat
e setting by using a markov chain at all another option for ais is to adapt the
schedule of proposals after each run each run is unbiased in z using any stepsiz
e but when adapting it is still advisable to discard early runs which will have
higher variance markov chains for normalizing constants there is usually a large
change in the scale of probabilities involved between a diffuse prior and a pos
terior which concentrates mass on a much smaller number of states many standard
markov chain operators such as simple metropolis and slice sampling are only abl
e to make small changes in logprobability at each iteration depending on the alg
orithm this may or not be desirable the majority of the volume in a highdimensio
nal solid body is in a thin shell near its surface mackay p for nested sampling
this means that much of s mass is likely to be close to the likelihood contour s
urface and large changes in likelihood are not required instead we need efficien
t chains that sample well at close to constant likelihood temperaturebased distr
ibutions have soft constraints that lead to broader distributions over energy al
though in many problems they are still constrained to a relatively narrow range
the multicanonical method samples from a single distribution which has significa
nt overlap with both the prior and posterior making the distribution over energi
es uniform requires making some states much more probable than others under pmc
simple metropolis methods are unable to move rapidly between regions of many sta
tes with low probability and more compact regions with high probability the expl
oration of pmc s energy spectrum will be characterized by a random walk or slowe
r process this suggests that equilibrating pmc will need at least emc steps wher
e emc is the range of log probabilities under pmc not the original distribution
some monte carlo algorithms such as hybrid monte carlo are able to make larger c
hanges in energy hamiltonian dynamics based on pmc could be simulated as long as
the reweighting function is smooth nested sampling could also benefit from hami
ltonian dynamics for its large movements in statespace although is not compatibl
e with large changes in energy fortunately versions of slice sampling that can u
se hamiltonian dynamics on the prior and reflections from constraint boundaries
have already been developed neal another important markov chain operator for dra
matic moves in statespace and energy is swendsenwang subsection while this algor
ithm can work at any temperature it does not allow reweightings of the energy an
d is not easily modified to sample at near constant energy by recasting the prob
lem we can develop a version of swendsenwang that will work with multicanonical
and nested sampling a new version of swendsenwang the partition functions of the
potts model equation the random cluster model equation and the fksw joint distr
ibution equation are identical also a sample from any of these distributions is
easily converted into a sample from one of for literature searches it is helpful
to know that in physics a constant energy distribution is known as a microcanon
ical ensemble markov chains for normalizing constants the others this allows usi
ng any of the distributions to simulate the potts model and find its normalizati
on zp j q we focus on the random cluster model assuming identical positive coupl
ings j on each edge we rewrite the random cluster distribution in an unconventio
nal way pd zp j q expje z ldd zn where cd q z dij zn ld expd logej d under this
factorization the energy is minus the total number of bonds d and the inversetem
perature the bonds d logej is set by the coupling parameter j as in subsection c
d is the number of connected components or clusters formed by algorithm gives an
mcmc operator to update the bond configuration d d the stationary distribution
is a weighted prior wmc dd where wmc could be multicanonical weights or set to w
mc for prior sampling or set to zero and one to sample from the prior subject to
constraints on d algorithm swendsenwang for weighted bond configurations create
a random coloring s uniformly from the q cd colorings satisfying the bond const
raints d as in the swendsenwang algorithm count sites that allow bonds e ije si
sj draw d from t d es zt s wmc d es d es d throw away the old bonds d and pick u
niformly from one of the of setting d bonds in the e available sites ways the pr
obability of proposing a particular coloring and new setting of the bonds is t s
d d t d s d t d es t s d es d t d es q cd wmc d zt s q cd summing over all poss
ible intermediate colorings the probability of starting with d bonds d and endin
g with d bonds d is proportional to t d d wmc d d d wmc d s t s d d zt s wmc d w
mc d z q cd q cd s this expression is symmetric under the exchange of d d and d
d therefore the transition operator satisfies detailed balance with respect to t
he weighted prior it is experiments also ergodic proof with finite probability a
ll si are given the same color then any d with nonzero weight is possible in tur
n all allowable d have finite probability when performing nested sampling using
the weighted prior representation the likelihood constraints in are thresholds o
n the total number of bonds d this can be realized by setting wmc for states wit
h fewer bonds many states have identical d which requires careful treatment as d
iscussed in subsection the simple implementation that draws a random key for eac
h state will lead to some rejections when proposing moves to a state with the sa
me d on the constraint surface sampling without rejections can be achieved by se
tting the weights such that algorithm leaves the auxiliary constrained distribut
ion m equation invariant the number of bonds has previously been identified as a
useful energylike target for reweighting janke and kappler in that work the bon
ds were updated by single site gibbs sampling rather than the blockgibbs samplin
g move of step this does not allow simulation of the fixed d ensemble or rapid e
xploration near fixed d single site updates are easier to implement however and
would become more attractive on systems that allow a different jij on each edge
in this case the implementation of global updates is much more involved experime
nts detailed comparisons of nested sampling and more established monte carlo tec
hniques are not currently available in the literature anecdotally nested samplin
g has already been useful in astronomy mukherjee et al and shaw et al claim nest
ed sampling gives speedups of orders of magnitude over annealing based methods t
he annealing approach that was cited in both papers was very carefully implement
ed beltrn et al these results are somewhat surprising given that both nested a s
ampling and annealing follow a sequence of increasingly constrained ensembles an
d theoretically seem quite similar if anything nested sampling seems slightly wo
rse although this should be a small effect for the d astronomy problems that wer
e tested the focus of this section is not applications but well understood test
problems hopefully these will give some better insight into the relative merits
of the approaches description of slice sampling experiments a set of experiments
were performed on some continuous distributions that are amenable to slice samp
ling this allowed the same mcmc code to be used within each algorithm the distri
butions tested are described first and then details of the algorithms the result
s which appear in table are discussed in the next section experiments gaussian b
ase and target distributions as considered theoretically in section were run in
ten dimensions these experiments reveal the actual performance when confounded b
y interactions with a particular mcmc operator in version a we set the standard
deviation of the base distribution to be times wider than the target ie t in ver
sion b we made the prior times wider t tdistribution this was included as anothe
r simple standard distribution with different tail behavior the target distribut
ion was a tendimensional multivariatet with five degrees of freedom the base dis
tribution was gaussian with two modes a mixture of gaussians as tested in neal a
pt zt exp d d exp d d the base distribution was a unit sixdimensional gaussian
p n i deceptive this twodimensional problem bridges from a spherical gaussian di
stribution with precision to a twodimensional mixture of gaussians taken from ne
al a pt zt i j ij exp i j exp ij i j ij exp i j exp ij where and the mixture com
ponents are in four groups ij i j ij i j ij i j ij i j means in the upperright q
uadrant means in the upperleft quadrant means in the lowerleft quadrant means in
the lowerright quadrant this target distribution is exceedingly challenging and
more pathological than experienced in many statistical problems it is interesti
ng however the different spacings of the means make it hard for algorithms to kn
ow from a distance where the bulk of the probability mass is this highlights dif
ferences between the massfinding heuristics implicitly performed by the algorith
ms in all cases one markov chain update consisted of a simple univariate slice s
ampler applied once to each variable in a new random order at each iteration a l
inear stepping out procedure was employed with an initial stepsize equal to one
or to the range of settings currently occupied by particles being simulated in p
arallel some additional choices were needed by each method nest nested sampling
runs had two free parameters the number of particles n and the number of slicesa
mpling steps used to update at each iteration unless n each experiments new part
icle was initialized at one of the n particles already satisfying the likelihood
constraint as slice samplers always move and there are no plateaus in these pro
blems likelihood functions the details in subsection were not required rather th
an setting a number of iterations s we terminated the nested sampler when the es
timate of log z appeared to have converged in particular we used a geometric app
roximation for x to estimate log z with equation and terminated when the remaini
ng prior mass appeared to contribute a fraction of only to the sum the reported
results used monte carlo samples of x in equation these are very similar to thos
e obtained from the geometric mean approximation but also provide quantiles of t
he predictive posterior over log z aisrc annealed importance sampling has two fr
ee parameters in addition to its annealing schedule an experiment was repeated r
times each run using c parallel chains when c the parallel chains are used to s
et the initial step sizes of the slice sampler for multimodal distributions this
is not strictly justified within the ais framework but it seems unlikely the re
sults will differ greatly from using preliminary runs instead an ais initializat
ion of e nest n n k k estimated an annealing schedule from a nested sampling run
with the given n and step per iteration the given value of k intermediate tempe
ratures was needed for target standard error e the details of this procedure and
the linear and fourth power schedules were given in section multicanonical ten
slice sampling chains initialized from the prior were run in parallel on the mul
ticanonical ensemble simple error bars were calculated from the variance of the
estimates from each chain on the gaussian problem the analyticallyderived weight
ing function that sets a uniform distribution over energies equation could be us
ed we also tried setting the multicanonical weights by estimating the distributi
on over energies from a nested sampling run discussion of slice sampling results
the experiments with a gaussian highlight differences between the theoretical i
deals in section and the realities of mcmc nested sampling is fastest with n whi
ch was tried as a preliminary run the predictions from one or ten slice sampling
sweeps per iteration are imprecise as one might expect but also overconfident t
he posterior over log z has negligible overlap with the true answer increasing t
he amount of mcmc sampling to slice sampling steps per iteration overcomes the b
ias but at a large computational cost increasing the number of particles n gives
more precise answers and increases accuracy there are three reasons each partic
le can start at one of the other particles which are supposed to be drawn from t
he target distribution the more particles there are the easier it is to forget e
xactly which one was copied the distributions change more slowly with larger n t
he answer from n with one step of sampling gives experiments table empirical beh
avior of slicesamplingbased nested sampling ais and multicanonical sampling the
distributions and methods are described in the main text subsection the results
are discussed in subsection distribution gaussian a method initialization log l
evals log z truth nest n step nest n steps nest n steps nest n step nest n steps
nest n step nest n steps ais nest n k ais nest n k ais nest n k ais nest n k ai
s nest n k ais linear k ais fourth power k multicanonical nest n multicanonical
nest n multicanonical analytic weighting multicanonical analytic weighting truth
nest n step multicanonical analytic weighting gaussian b tdistribution truth ne
st n step nest n step nest n step ais nest n k ais linear k ais fourth power k t
wo modes truth nest n step nest n step nest n step nest n step nest n step ais n
est n k ais nest n k ais nest n k truth nest n step ais nest n k deceptive exper
iments a much better answer than n with steps and is also an order of magnitude
cheaper even with only one sampling step per iteration the estimates and error b
ars with n and n are indistinguishable from those obtained by exact sampling fro
m the constrained priors three standard annealed importance sampling runs were p
erformed with the same target error of ais ais and ais obtaining the target erro
r from only one run requires a very long annealing schedule k this gave similar
results to shorter runs with k but error bars are much more easily obtained from
multiple runs increasing the number of runs to required shortening the annealin
g schedule further for a similar target error or computational cost the short k
runs had larger errors than predicted and unreliable error bars this schedule wa
s designed assuming that the algorithm would keep close to the equilibrium distr
ibutions defined in the annealing schedule in this case there were not enough br
idging distributions for this approximation to be accurate the result with ais k
which uses ten parallel chains is similar to the result from separate runs ais
but with many fewer likelihood evaluations in the absence of a population of poi
nts the slice sampling code used an initial step size of one which is inappropri
ately small for this problem at high temperatures a user not prepared to adapt s
tepsizes based on a population should find some way to set them appropriately th
e sequence of three ais runs with k are deliberately at higher precision than ne
cessary so that the error bars are somewhat reliable and reproducible something
that is sadly not true of the lower precision ais runs these confirm that a line
ar annealing schedule is worse than one that dwells for longer at higher tempera
tures in this case the fourth power schedule quite closely follows that set by n
ested sampling and has very similar performance the first multicanonical result
with weights set by nested sampling seems to give comparable uncertainty to ais
for the amount of computer effort however running the chains for ten times longe
r reveals problems with this multicanonical estimator we had set the weights by
estimating the probability mass between each pair of likelihood values ls and ls
visited by nested sampling for simplicity we used the deterministic approximati
on in equation this gives noisy estimates of the ideal multicanonical weights wh
ich it turns out are not good enough we could attempt to smooth the approximate
multicanonical weighting function instead for the gaussian case we went directly
to the correct multicanonical ensemble equation this confirmed that the problem
with the estimator was the choice of weighting function and gives multicanonica
l a performance somewhere in between nested sampling and ais on the gaussian a p
roblem even with the ideal weighting multicanonical fails to equilibrate and fai
ls to give reasonable error bars on the gaussian b problem this is to be expecte
d given the methods poor scaling with precision ratio t we did we were able to c
onfirm this because is quite tractable for this toy problem experiments not cont
inue to try multicanonical adapting the weighting function beyond a nested sampl
ing initialization seems important and this would complicate the comparison turn
ing to the tdistribution example we see behavior different from the gaussian cas
e the annealing schedule estimated from nested sampling is closer to linear than
to a fourth power moreover the linear schedule and that produced from nested sa
mpling are reproducibly better than the fourth power notice the reversal from th
e gaussian experiment a good annealing schedule was estimated cheaply from neste
d sampling with n obtaining as good answers as ais with nested sampling alone wo
uld be more expensive exactly how much is hard to tell from individual runs beca
use the error bars can be unreliable a more reliable comparison of nested sampli
ng and ais can be made based on many repeated runs nested sampling was run times
each over a range of n and compared to ais with runs of schedules of varying le
ngths set by nested sampling figure a shows that the actual performance measured
as a mean squared error of the log z estimates was similar for the two methods
nested sampling is slightly more accurate at low computational costs being taken
over by ais at higher precisions increasing the number of ais runs at the expen
se of annealing schedule length performs worse for the same computational cost t
his difference goes away at high precisions but may be a concern when many runs
are required on multimodal problems the main problem with ais is that at lower p
recisions simple estimates and error bars based on the mean and variance of the
importance weights are unreliable figure b shows large biases in estimates for l
og z figure c shows that the error bars are too small on average jackknife estim
ates not shown are sometimes slightly better calibrated but give very similar re
sults for a user the error bars are often very important so in this case the nes
ted sampling estimator is the favorable choice the peaks of the two modes target
distribution are highly separated at low temperatures the fraction of ais chain
s ending in the mode in the positive orthant is compared to its actual probabili
ty mass of a third through weighting these runs ais converges to the correct ans
wer once enough sufficientlylong chains have been run further checks show that i
t also estimated the mass of each mode correctly the same is true for nested sam
pling the results from large n obtain nearly the correct relative masses of the
modes and in turn the overall normalization detailed analysis still shows some s
igns of bias the posterior overlap with the correct answer is smaller than appar
ent from the scalar error bars in the table both methods find twomodes difficult
because it is not feasible to sample correctly from at late iterations or from
low temperature distributions ais relies on reweighting over many runs nested sa
mpling requires many particles to maintain a fair representation of even after l
ikelihood evaluations both nested sampling and ais have problems experiments mse
a ais ais nest e likelihood evaluations mean error b ais ais nest e likelihood
evaluations ais ais nest calibration c e likelihood evaluations figure average b
ehavior of ais and nested sampling over runs for a range of n or target errors t
he xaxis gives computational cost in likelihood evaluations a mean square error
of log z estimate b mean error or bias of point estimate c mean square normalize
d error error divided by error bar width a measure of calibration experiments ta
ble estimates of the deceptive distribution quadrant upperleft upperright lowerl
eft lowerright true probability nest estimate ais estimate with the deceptive di
stribution nested sampling has an error bar on log z that is far too small aiss
estimate is consistent with the correct answer although as with two modes with a
higher standard error than targeted due to poor mixing at low temperatures addi
tional problems are revealed by looking at the probability distribution over the
four quadrants containing each cluster of modes table again the nested sampling
run is far too confident in this case aiss error bars are generally much better
although it is very certain that the upperright quadrant has much less probabil
ity mass than it actually does runs of nested sampling with smaller n show that
it too can easily lose the upperright quadrant nested samplings answers are wron
g according to its error bars due to the markov chain approximation at late iter
ations with high likelihood constraints the slice sampler is unable to equilibra
te and is relying on a large number of particles to provide a good starting poin
t on this problem the inaccuracy of this approximation does not reduce as fast a
s the size of the error bars with a large number of points increasing the number
of slice sampling steps per iteration does not solve the problem because it wou
ld take an unrealistically large number to move between isolated modes however p
roposals that take all of the particles positions into account could help shaw e
t al uses an approximate rejection sampler based on uniform samples within ellip
soidal fits to clusters of the existing particles no problems with biases were r
eported using this method although clearly problems could still occur when the e
llipsoids fail to correctly capture in general ellipsoids could be used as metro
polis proposals as part of several and various attempts to equilibrate a particl
e the potts model the potts model was introduced in subsection it describes a cl
ass of undirected graphical models over discrete variables s the variables take
on one of q colors and the model has a temperaturelike coupling parameter j gibb
s sampling updates of are the simplest way to implement approximate nested sampl
ing we also try clusterbased updates subsection while physicists tend to be inte
rested in a broader range of quantities we focus here on the normalization const
ant zp j q where the discrete variables s are the variables that need to be inte
grated ie summed over experiments table partition function results for potts sys
tems see text for details method gibbs ais swendsenwang ais gibbs nested samplin
g randomcluster nested sampling acceptance ratio q ising j q j table shows resul
ts on two example systems an ising model q and a q potts model in a difficult pa
rameter regime we tested nested sampling and ais with gibbs sampling and cluster
based updates annealed importance sampling ais was run times with a geometric sp
acing of settings of j as the annealing schedule nested sampling used n particle
s and fullsystem mcmc updates to approximate each draw from we also developed an
acceptance ratio method bennett based on our representation in equation which w
e ran extensively and should give nearly correct results the markov chains used
by nested sampling were initialized at one of the n particles satisfying the cur
rent constraint preliminary experiments that initialized a new particle s at s o
n the constraint surface were a failure the gibbs nested sampler could get stuck
permanently in a local maximum of the likelihood while the cluster method gave
erroneous answers for the ising system this supports the suggestions of subsecti
on ais performed very well on the ising system and can work with q at low coupli
ng strengths we took advantage of its performance in easy parameter regimes to c
ompute z which was needed to interpret the results from the clusterbased nested
sampler however with a temperaturebased annealing schedule ais was unable to giv
e useful answers for the q system close to the critical j evaluated nested sampl
ing appears to be correct within its error bars under these conditions it is kno
wn that even the efficient swendsenwang algorithm mixes slowly for potts models
with q near critical values of j which correspond to a first order phase transit
ion gore and jerrum see figure typical potts model states are either entirely di
sordered or ordered disordered states contain a jumble of small regions with dif
ferent colors eg figure b in ordered states the system is predominantly one colo
r eg figure d moving between these two phases is difficult defining a valid mcmc
method that moves between distinct phases requires knowledge of the relative pr
obability of the whole collections of states in those phases temperaturebased an
nealing algorithms explore the model for a range of settings of j and fail to ca
pture the correct behavior near the transition despite using closely related mar
kov chains to those used in ais nested sampling can work in all parameter discus
sion and conclusions a b c d e figure two q potts models with starting states a
and c were simulated with fullsystem swendsenwang updates with j the correspondi
ng results b and d are typical of all the intermediate samples swendsenwang is u
nable to take a into an ordered phase or c into a disordered phase although both
phases are typical at this j e in contrast shows an intermediate state of neste
d sampling which succeeds in bridging the phases regimes figure e shows how nest
ed sampling can explore a mixture of ordered and disordered phases by moving ste
adily through these states nested sampling is able to estimate the prior mass as
sociated with each likelihood value this behavior is not possible in algorithms
that use j as a control parameter such as ais with a temperaturebased schedule d
iscussion and conclusions summary the main purpose of this chapter was to study
nested sampling and its relationship with more established methods we find that
it fits into a unique position amongst monte carlo algorithms unlike the majorit
y of annealingbased methods nested sampling can deal with first order phase tran
sitions while multicanonical sampling also solves this problem its properties ar
e different in almost every other respect nested sampling does not need prior se
tting of weights its scaling with dimensionality and energy ranges are very diff
erent and nested sampling follows a sequence of distributions like annealing in
some statistical settings nested sampling has clear advantages over multicanonic
al undoubtedly it will perform badly on some applications where multicanonical h
as already been successful nested samplings theoretical scaling with dimensional
ity od iterations is worse than that of annealing od iterations in practice the
difference can be less dramatic as bringing a ensemble to equilibrium may requir
e fewer function evaluations than a temperature based distribution the slicesamp
ling results do support aiss superiority at least for higher accuracy results bu
t other issues such as ease of implementation and quality of error bars may be m
ore significant issues with annealings error bars were blamed for an order of ma
gnitude extra cost by both beltrn et al and a shaw et al discussion and conclusi
ons despite this the relative underperformance of annealing reported by these as
tronomers is difficult to account for given the theoretical and practical result
s of this chapter the most likely explanation is a difference in the operators u
sed to update the intermediate distributions in this chapter the same or closely
related operators where used with ais and nested sampling in shaw et al nested
sampling was described with its own special update rule using ellipsoids fitted
to the current particles this algorithm could easily require fewer function eval
uations than a simple slice sampler or metropolis method as used within annealin
g in beltrn et al although the ellipsoid algoa rithm was suggested by the need t
o sample from nested samplings there is no reason not to use the same basic code
to propose moves for an annealing method combined with an annealing schedule sp
ecified by nested sampling ais might perform as well as or better than nested sa
mpling related work this chapter only considered a subset of the available metho
ds for computing normalizing constants the focus has been on simple methods that
compute the normalizing constant of a generic probability distribution we avoid
ed assuming detailed knowledge about the target distribution and its relationshi
ps with other distributions as much as possible for users with more time a riche
r set of methods are available some of which are mentioned here the intermediate
distributions in annealed importance sampling do not have to be temperaturebase
d distributions that has been the focus here because of the simplicity and gener
ality of raising the likelihood to a power any other way of bringing in the like
lihood gradually can be used one data point at a time may be natural in a statis
tical setting a generalization linked importance sampling neal may be helpful fo
r such cases where the intermediate levels are fixed and limited in number annea
led importance sampling can be seen as a member of a wider family of sequential
monte carlo smc methods some of these algorithms allow transfer of particles amo
ngst modes and should definitely be considered by anyone attracted to nested sam
pling by this property a recent example of smc combined with an interesting set
of bridging distributions leveraging the power of graphical models is presented
in hamze and de freitas some situations require the computation of more than one
normalizing constant an option is to construct a distribution containing each o
f the models and explore it with mcmc reversible jump mcmc green makes this poss
ible even when the models have parameter vectors of different dimensionalities t
he relative probability of each model is available from the amount of time the s
ampler spent exploring each model and more advanced estimators based on transiti
on probabilities may be available this discussion and conclusions only provides
the normalizing constants up to a constant although if one of the models conside
red is tractable this constant could be found the path sampling approach of gelm
an and meng suggests computing normalizing constants by integrating along a path
of model hyperparameters rather than along a single inversetemperature paramete
r if the normalizer for each setting of the hyperparameters is required this is
particularly effective contour plots based on independent estimates will usually
be very noisy compared to a path sampling approach philosophy various authors h
ave noted the irony of using monte carlo a frequentist procedure for bayesian co
mputation eg ohagan neal rasmussen and ghahramani rather than giving beliefs abo
ut quantities given the computations performed monte carlo algorithms provide fr
equentist statistical estimators skilling claims that nested sampling is bayesia
n this section examines the extent to which this holds our target is a posterior
distribution over z philosophically this is slightly tricky because z is a cons
tant which we should be able to work out given the prior and likelihood function
s thus according to any rational calculus of beliefs as in cox or jaynes the pos
terior should concentrate all of its mass at the true value the problem is not l
ack of available information but computer time to use it perfectly the solution
to this conundrum is to be careful about what we claim to know assume that some
agent is running nested sampling on our behalf and only reports to us l ls if it
reported more information such as s locations we would be in an embarrassing si
tuation to use this information we would need a probabilistic model including s
in which inference would probably be hard but throwing away knowledge is hard to
deal with rationally so we pretend it was never available thus the inferences f
or nested sampling as in rasmussen and ghahramani are rational for a fictitious
observer with limited information this observers results will be more vague than
if s were not ignored but should be sensible which is not guaranteed by general
approximations to bayesian inference we first set up priors based on the knowle
dge that an agent is running nested sampling which will provide ls quantities wi
th associated xs cumulative mass values our knowledge of nested sampling defines
a problemindependent prior distribution over x xs we should also specify a prio
r distribution over functions lx after observing l ls we update our beliefs abou
t the underlying cumulative values according to bayes rule p xl p lx p xp l it i
s this posterior distribution that should be used when computing posteriors over
other quantities such discussion and conclusions as z p zl p zx l p xl dx note
that specifying a prior over monotonic functions p lx and computing with it appe
ars difficult in general skilling declares that i cant so i dont instead the pri
or p x is used this corresponds to a particular assumption an improper uniform p
rior over likelihood functions this cannot be avoided by claiming general ignora
nce unlike s we must be told ls thus to maintain a claim of rationality we must
be happy with this particular choice of prior or type of ignorance about the lik
elihood function in addition and much more seriously we must assume the agent is
actually running nested sampling as in algorithm in fact we really know that so
me approximation will be involved as we have seen this can give us unreasonable
beliefsas with most probabilistic modeling where we know a priori that a models
joint distribution does not really capture every detail of a real system similar
concerns must surround all such attempts to introduce bayesian methodology into
monte carlo algorithms for general inference problems for example bayesian lear
ning of the onedimensional weighting function of the multicanonical method has b
een attempted eg smith this should be a difficult inference problem as the outpu
t of a sampler has a complicated dependence structure only with incorrect assump
tions ie approximations can bayesian methods be applied the ultimate justificati
on for such methods must be their empirical performance which is sometimes very
good chapter doublyintractable distributions most of this thesis has been dedica
ted to sampling from probability distributions where the key difficulty has been
an intractable normalization when considering a posterior over parameters given
data y py py py p py py we assumed that the joint probability in the numerator
could be easily evaluated for any particular joint setting y chapter described h
ow the important quantity py can be approximated but it is often infeasible to c
ompute this quantity exactly standard mcmc methods are designed for use with the
se intractable distributions markov chain operators can be constructed by restri
cting consideration to a manageable subset of s state space at each step in metr
opolishastings only two settings are considered the current setting and a random
ly chosen proposal gibbs sampling changes only one component of at a time metrop
olis requires an ability to evaluate py ratios for various pairs and gibbs sampl
ing requires the ability to sample from the conditional distributions pi ji y by
considering restricted parts of the state space neither method needs to know th
e global normalizing constant py but what if py like py contains a summation ove
r a large state space so it cannot feasibly be evaluated pointwise then the prob
lem is doublyintractable and as we shall see even performing markov chain monte
carlo is potentially exceedingly difficult the next section explains why doublyi
ntractable distributions arise and the difficulties involved section explores ap
proximations of standard mcmc algorithms in the context of undirected graphical
models moller et al provided the first feasible algorithm for models where sampl
ing from py is possible but its normalization is unknown their method is reviewe
d in section and generalized by us in section working on these algorithms inspir
ed the new exchange algorithm sec bayesian learning of undirected models tion th
ese innovations were first described in murray et al a this chapter provides a s
lightly more general version of the exchange algorithm with a new derivation whi
ch is somewhat simpler than the original description full mathematical derivatio
ns of detailed balance are provided which were previously omitted for space reas
ons in section we consider further new valid mcmc algorithms for doublyintractab
le distributions which provide a connection to the approximate bayesian computat
ion abc literature section provides slice sampling algorithms for doublyintracta
ble distributions finally new directions for research in this area are considere
d in section bayesian learning of undirected models the potts model discussed in
the previous chapter belongs to a very wide class of energybased models where p
y exp z j ej ycj j z y j z fj ycj j j fj ycj j the sets cj each index a subset o
f variables that take part in a corresponding potential function fj parameterize
d by j each potential expresses mutual compatibilities amongst a subset of varia
bles ycj sampling from the y variables is possible with mcmc but computing the n
ormalization can be very difficult section reviewed how special structure in a g
raphical model sometimes allows efficient computation of the normalization z mos
t of this chapter concerns methods that apply to general distributions so it is
convenient for clarity to collapse the model to a simpler form py f y z we now c
onsider sampling from the posterior over parameters equation when the likelihood
is of the unnormalized form given in equation this posterior py f y p z py offe
rs a new difficulty as before py is not needed for mcmc but the normalizing cons
tant z cannot be ignored as it is a function of the parameters the variables bei
ng sampled every time new parameter values are considered it appears that an int
ractable computation involving z will be required as mcmc estimators are approxi
mations unless an infinite number of iterations are performed and each itera bay
esian learning of undirected models tion is generally infeasible py in equation
can be called a doublyintractable distribution while sampling from parameter pos
teriors by mcmc is a well established technique it is largely associated with di
stributions that could be represented as directed graphical models however sampl
ing parameters in anything but the most trivial undirected graphical model is do
ublyintractable while directed models are a more natural tool for modeling causa
l relationships the soft constraints provided by undirected models have proven u
seful in a variety of problem domains we briefly mention six applications a in c
omputer vision markov random fields mrfs a form of undirected model are used to
model the soft constraint a pixel or image feature imposes on nearby pixels or f
eatures geman and geman this use of mrfs grew out of a long tradition in spatial
statistics besag b in language modeling a common form of sentence model measure
s a large number of features of a sentence fj s such as the presence of a word s
ubjectverb agreement the output of a parser on the sentence etc and assigns each
such feature a weight j a random field model of this is then ps z exp j j fj s
where the weights can be learned via maximum likelihood iterative scaling method
s della pietra et al c these undirected models can be extended to coreference an
alysis which deals with determining for example whether two items eg strings cit
ations refer to the same underlying object mccallum and wellner d undirected mod
els have been used to model protein folding winther and krogh and the soft const
raints on the configuration of protein side chains yanover and weiss e semisuper
vised classification is the problem of classifying a large number of unlabeled p
oints using a small number of labeled points and some prior knowledge that nearb
y points have the same label this problem can be approached by defining an undir
ected graphical model over both labeled and unlabeled data zhu and ghahramani f
given a set of directed models pyj the products of experts idea is a simple way
of defining a more powerful undirected model by multiplying them py z models des
pite the long history and wide applicability of undirected models until recently
bayesian treatments of learning the parameters of large undirected models have
been virtually nonexistent indeed there is a related statistical literature on b
ayesian inference in undirected models log linear models and contingency tables
albert dellaportas and forster dobra et al however this literature with the nota
ble exception of the technique reviewed in section assumes that the partition fu
nction z can be computed exactly but for many of the machine learning applicatio
ns of undirected models cited above this assumption is unreasonable this chapter
addresses bayesian learning for models with intractable z j pyj hinton the prod
uct assigns high probability when there is consensus among component ie low tree
width graphs graphical gaussian models and small contingency tables bayesian lea
rning of undirected models do we need z for mcmc the modelers effort was put int
o specifying f y it is tempting to think that z should have little relevance and
there must be some way to sidestep computing it it was established above that t
he normalizer z is not a constant in the context of sampling this section explor
es the rle of z in more detail addressing the o consequences for any scheme that
avoids computing it algorithm gives the straightforward application of standard
metropolishastings algorithm to the doublyintractable distribution in equation
algorithm standard but infeasible metropolishastings mh input initial setting nu
mber of iterations s for s s propose q y compute a p y q y f y p q y z py q y f
y p q y z draw r uniform if r a then set end for computing the acceptance ratio
requires a ratio of normalizing constants or at least a bound tight enough for s
tep this is difficult in general there are usually some free choices while const
ructing mcmc algorithms perhaps some nuisance parameters could be judiciously se
t to remove z through some fortunate cancellations we are not free to cancel out
z through a choice of prior p the prior would have to depend on the number of o
bserved data points and would take on extreme values dominating any inferences m
urray and ghahramani in theory the proposal distribution could be defined to rem
ove explicit z dependence from the acceptance ratio but in practice this does no
t seem to help eg q y zg or q y z would be difficult to construct without knowin
g z and would be terrible proposals the only distribution we know of that contai
ns z gives probabilities in data space not over parameters section reviews and e
xtends a method by moller et al which introduced auxiliary variables taking on v
alues in the data space this allows proposals that cancel out the unknown terms
but only if it is possible to draw from the intractable distribution py this key
insight makes it possible to sample from a limited but significant class of dis
tributions for which mcmc was previously impossible however the algorithm and it
s extensions are not a panacea it is not always possible to draw exact samples f
rom the data distribution another problem or opportunity is bayesian learning of
undirected models x y a f x y b f figure a the dash marked y shows the position
of our observation in data space the curve shows an unnormalized function f whi
ch gives low probability to the alternative data set x b after changing the para
meter the unnormalized function evaluated at the observation has increased f y f
y however the likelihood has decreased py py noticing this requires considering
the new highprobability region of data space containing x which is not necessar
ily close to the observation that specifying a workable stationary distribution
for the auxiliary variables requires approximating the parameter posterior befor
e sampling begins an optimistic train of thought sees no problem as z is just a
sum and summing over unknowns is a standard feature of mcmc for example models w
ith latent variables z often have intractable likelihoods but while it may be di
fficult to evaluate py z py z z pyz pz we can jointly sample over p zy as long a
s this distribution is known up to a constant discarding the latents gives sampl
es from the correct marginal py in doublyintractable problems the likelihood con
tains z the reciprocal of a sum of terms rather than a sum over latent variables
but could there be some similar way of instantiating latent variables to remove
the need to compute z section provides such a method with an infinite number of
latent variables but how do we know there is not a better choice in general mus
t all simple algorithms fail figure shows a hypothetical unnormalized probabilit
y function for two settings of its parameters an obvious but important observati
on is that the change in an unnormalized probability function evaluated at an ob
servation does not necessarily tell us anything about the change in the likeliho
od of the parameters in fact we must consider the parameters effect on the entir
e observation space this is in sharp contrast to mcmc sampling of the y variable
s for fixed parameters where only two settings need be considered at a time if w
e are not going to compute z explicitly then a valid mcmc algorithm must have so
me other source of global information that is sensitive to changes anywhere in o
bservation space one of the simplest ways to get information from a probability
distribution is of course through a sample a sample x f x z using the same funct
ion as figure b could easily land in the new region of high probability on the l
eft noticing that approximation schemes f x f x gives some indication that any p
erceived benefit of f y f y should be penalized despite the apparent paucity of
such information surprisingly these single samples at each parameter setting are
sufficient to create valid mcmc algorithms for py that do not need z this is ho
w the approach in moller et al works which explains why it requires samples from
the target distribution using an exact or perfect sampling method section appro
ximate samples from a few mcmc steps cannot guarantee considering the new bubble
in data space in figure b a markov chain started at eg the observation will be
heavily biased towards a mode near the starting point and may never consider any
other modes are spurious modes in data space a problem in practice contrastive
divergence learning hinton uses very brief mcmc sampling starting at the observe
d data and can give useful results on complex problems in machine learning somet
imes we know the parameter posterior is simple it is logconcave for fullyobserve
d exponentialfamily distributions in such cases deterministic approximations to
the parameter posterior perform favorably compared to pseudomcmc approaches well
ing and parise however if we wish to construct a valid mcmc method we must forma
lly show that the entire data space has been properly considered exact sampling
explicitly tracks bounds that start off considering the entire observation space
section a procedure with this flavor seems an essential part of samplers for do
ublyintractable distributions thus even though a new latent variable approach in
troduced in section does not use exact sampling as such the requirements are and
must be very similar exact sampling is a formidable requirement deterministic a
pproaches and mcmc methods that do not use the correct posterior distribution wi
ll always have a place some possibilities are studied in the next section there
is also always a place for gold standard methods that given enough time should g
ive correct answers those are the focus of the remainder of the chapter approxim
ation schemes for concreteness this section studies a simple but widespread type
of graphical model the boltzmann machine bm is a markov random field which defi
nes a probability distribution over a vector of binary variables s s sk where si
psw exp zw wij si sj ij the symmetric weight matrix w parameterizes this distri
bution in a bm there are usually also linear bias terms i bi si in the exponent
with these the model is equivalent to a potts or ising model with magnetic field
parameters we omit these biases to simplify notation although the models in the
experiments assume them approximation schemes the usual algorithm for learning
bms is a maximum likelihood version of the em algorithm assuming some of the var
iables are hidden sh and some observed so ackley et al the gradient of the log p
robability is log pso w ecsi sj eusi sj wij where ec denotes expectation under t
he clamped data distribution psh so w and eu denotes expectation under the uncla
mped distribution psw for a data set s s sn sn of iid data the gradient of the l
og likelihood is simply summed over n for boltzmann machines with large treewidt
h see section these expectations would take exponential time to compute and the
usual approach is to approximate them using gibbs sampling or one of many more r
ecent approximate inference algorithms targets for mcmc approximation metropolis
hastings for the parameters of a boltzmann machine given fully observed data nee
ds to bound a psw pw qw w s psw pw qw w s zw zw n pw qw w s pw qw w s exp nij wi
j wij n n si sj the first class of approximation we will pursue is to substitute
a deterministic approxi mation zw zw into the above expression clearly this res
ults in an approximate sampler which does not converge to the true equilibrium d
istribution over parameters moreover it seems reckless to take an approximate qu
antity to the n th power despite these caveats we explore empirically whether ap
proaches based on this class of approximation are viable note that above we need
only compute the ratio of the partition function at pairs of parameter settings
zw zw this ratio can be approximated directly by importance sampling zw epsw ex
p zw wij wij si sj ij thus any method for estimating expectations under psw samp
lingbased or deterministic can be nested into the metropolis sampler for w for s
mall steps w w estimating ratios of normalizers and finding gradients with respe
ct to the parameters are closely related problems gradients may be more useful a
pproximation schemes as they provide a direction in which to move which is usefu
l in algorithms based on dynamical systems such as hybrid monte carlo subsection
hybrid monte carlo would also require a z ratio for its acceptreject step this
effort may not be justified when the gradients and zw are only available as appr
oximations simpler schemes that use gradient information also exist neal the sim
plest of these is the uncorrected langevin method parameters are updated without
any rejections according to the rule i i log py ni i where ni are independent d
raws from a zeromean unit variance gaussian intuitively this rule performs gradi
ent descent but explores away from the optimum through the noise term strictly t
his is only an approximation except in the limit of vanishing using the above or
other dynamical methods a third target for approximation for systems with conti
nuous parameters is the gradient of the joint log probability in the case of bms
we have log ps w wij si sj n n n n epsw si sj log pw wij assuming an easy to di
fferentiate prior the main difficulty arises as in equation from computing the m
iddle term the unclamped expectations over the variables interestingly although
many learning algorithms for undirected models eg the original boltzmann machine
learning rule are based on computing gradients such as equation and it would be
simple to plug these into approximate stochastic dynamics mcmc methods to do ba
yesian inference this approach does not appear to have been investigated we expl
ore this approach in our experiments this section has considered two existing sa
mpling schemes metropolis and langevin and identified three targets for approxim
ation to make these schemes tractable zw zw zw and epsw si sj while the explicit
derivations above focused on boltzmann machines these same expressions generali
ze in a straightforward way to bayesian parameter inference in a general undirec
ted model as in equation in particular many undirected models of interest can be
parameterized to have potentials in the exponential family ej ycj j uj ycj j fo
r such models the key ingredients to an approximation are the expected sufficien
t statistics epy uj ycj approximation algorithms in this section approximations
for each of the three target quantities in equations and are identified these ar
e used to propose a variety of approximate sampling methods for doublyintractabl
e distributions first outlined in murray and ghahramani approximation schemes va
riational lower bounds were developed in statistical physics but are also widely
used in machine learning they use jensens inequality to lower bound the log par
tition function in the following way log z log x f x log x qx f x qx x qx log f
x qx log z x qx log f x hq f q the relationship holds for any distribution qx pr
ovided it is not zero where px has support the second term in the bound is calle
d the entropy of the distribution hq x qx log qx the overall bound f is often ca
lled the free energy see winn and bishop for more details and a framework for au
tomatic construction and optimization of variational bounds for a large class of
graphical models the na mean field method is a variational method with q constr
ained to belong to the ve set of fully factorized distributions qmf q qx i qi xi
for the boltzmann machine a local maximum of this lower bound log zmf maxqqmf f
q can be found with an iterative and tractable fixedpoint algorithm see for exa
mple mackay chapter let the meanfield metropolis algorithm be defined by using z
mf in place of z in the acceptance probability computation equation the expectat
ions from the na mean field algorithm could also be used to compute direct ve ap
proximations to the gradients in equation for use in a stochastic dynamics metho
d jensens inequality can be used to obtain much tighter bounds than those given
by the na meanfield method for example when constraining q to be in the set of a
ll ve treestructured distributions qtree optimizing the lower bound on the parti
tion function is still tractable wiegerinck obtaining ztree z the tree metropoli
s algorithm is defined through the use of this approximation in equation alterna
tively expectations under the tree could be used to form the gradient estimate f
or a stochastic dynamics method equation bethe approximation a recent justificat
ion for applying belief propagation to graphs with cycles is the relationship be
tween this algorithms messages and the fixed points of the bethe free energy yed
idia et al this breakthrough gave a new approximation for the partition function
in the loopy metropolis algorithm belief propagation is run on each proposed sy
stem and the bethe free energy is used to approximate the acceptance probability
equation traditionally belief propagation is used to compute marginals pairwise
marginals can be used to compute the expectations used in gradient methods eg e
quation or in finding partition function ratios equation these approaches lead t
o different algorithms although their approximation schemes approximations are c
learly closely related langevin using brief sampling the pairwise marginals requ
ired in equations and can be approximated by mcmc sampling the gibbs sampler use
d in subsection is a popular choice whereas in subsection a more sophisticated s
wendsenwang sampler is employed unfortunately as in maximum likelihood learning
equation the parameterdependent variance of these estimates can hinder convergen
ce and introduce biases the brief langevin algorithm inspired by work on contras
tive divergence hinton uses very brief sampling starting from the data x which g
ives biased but low variance estimates of the required expectations as the appro
ximations in this section are run as an inner loop to the main sampler the cheap
ness of brief sampling makes it an attractive option langevin using exact sampli
ng unbiased expectations can be obtained in some systems using an exact sampling
algorithm section although the gradients from this method are guaranteed to be
unbiased parameterdependent variance could lead to worse performance than the pr
oposed brief langevin method variance could be reduced by reusing pseudo random
numbers however we shall see later that there are much more elegant ways to use
an exact sampler if one is available pseudolikelihood replacing the likelihood o
f the parameters with a tractable product of conditional probabilities is a comm
on approximation in markov random fields for image modeling one of the earliest
bayesian approaches to learning in large systems of which we are aware was in th
is context wang et al yu and cheng the models used in the experiments of subsect
ion were not well approximated by the pseudolikelihood so it is not explored fur
ther here extension to hidden variables so far we have only considered models of
the form py where all variables y are observed often models need to cope with m
issing data or have variables that are always hidden these are often the models
that would most benefit from a bayesian approach to learning the parameters in f
ully observed models in the exponential family the parameter posteriors are ofte
n relatively simple as they are log concave if the prior used is also log concav
e as seen later in figure the parameter posterior with hidden variables will be
a linear combination of log concave functions which need not be log concave and
can be multimodal in theory the extension to hidden variables is simple first co
nsider a model py h where h are unobserved variables the parameter posterior is
still proportional to approximation schemes pyp and we observe py h py h z fj y
hcj j h j log py log z log h j fj y hcj j log z log zy that is the sum in the se
cond term is a partition function zy for an undirected graph of the variables h
to see this compare to equation and consider the fixed observations y as paramet
ers of the potential functions in a system with multiple iid observations zy mus
t be computed for each setting of y note however that these additional partition
function evaluations are for systems smaller than the original therefore any me
thod that approximates z or related quantities directly from the parameters can
still be used for parameter learning in systems with hidden variables the brief
sampling and pseudolikelihood approximations rely on settings of every variable
provided by the data for systems with hidden variables these methods could use s
ettings from samples conditioned on the observed data in some systems this sampl
ing can be performed easily hinton in subsection several steps of mcmc sampling
over the hidden variables are performed in order to apply the brief langevin met
hod experiments involving fully observed models the approximate samplers describ
ed in subsection were tested on three systems the first taken from edwards and h
avrnek lists six binary properties detaila ing risk factors for coronary heart d
isease in men modeling these variables as outputs of a fullyconnected boltzmann
machine we attempted to draw samples from the distribution over the unknown weig
hts we can compute z exactly in this system which allows us to compare methods a
gainst a metropolis sampler with an exact inner loop a previous bayesian treatme
nt of these data also exists dellaportas and forster while predictions wouldnt n
eed as many samples we performed sampling for iterations to obtain reasonable hi
stograms for each of the weights figure the meanfield tree and loopy metropolis
methods each proposed changes to one parameter at a time using a zeromean gaussi
an with variance the brief langevin method used a stepsize qualitatively the res
ults are the same as those reported by dellaportas and forster parameters deemed
important by them have very little overlap with zero the meanfield metropolis a
lgorithm failed to converge producing noisy and wide histograms over an ever inc
reasing range of weights figure the sampler with the approximation schemes mean
field tree waa wab wac wad wae waf wbb wbc wbd wbe wbf wcc wcd wce wcf wdd wde w
df wee wef wff figure histograms of samples for every parameter in the heart dis
ease risk factor model results from exact metropolis are shown in solid blue loo
py metropolis dashed purple brief langevin dotted red these curves are often ind
istinguishable the meanfield and tree metropolis algorithms performed very badly
to reduce clutter these are only shown once each in the plots for waa and wab r
espectively shown in dashdotted black f f parameters parameters figure loopy met
ropolis is shown dashed blue brief langevin solid black left an example histogra
m as in figure for the edge bm the vertical line shows the true weight also show
n are the fractions of samples f within of the true value for every parameter in
the edge system center and the edge system right the parameters are sorted by f
for clarity higher curves indicate better performance approximation schemes tre
ebased inner loop did not always converge either and when it did its samples did
not match those of the exact metropolis algorithm very well the loopy metropoli
s and brief langevin methods closely match the marginal distributions predicted
by the exact metropolis algorithm for most of the weights results are not shown
for algorithms using expectations from loopy belief propagation in equation or e
quation as these gave almost identical performance to loopy metropolis based on
equation our other two test systems are node boltzmann machines and demonstrate
learning where exact computation of zw is intractable we considered two randomly
generated systems one with edges and another with each of the parameters not se
t to zero including the biases was drawn from a unit gaussian experiments on an
artificial system allow comparisons with the true weight matrix we ensured our t
raining data were drawn from the correct distribution with an exact sampling met
hod childs et al reviewed in subsection this level of control would not be avail
able on a natural data set the loopy metropolis algorithm and the brief langevin
method were applied to data points from each system the model structure was pro
vided so that only nonzero parameters were learned figure shows a typical histog
ram of parameter samples the predictive ability over all parameters is also show
n short runs on similar systems with stronger weights show that loopy metropolis
can be made to perform arbitrarily badly more quickly than the brief langevin m
ethod on this class of system experiment involving hidden variables finally we c
onsider an undirected model approach taken from work on semisupervised learning
in zhu and ghahramani here a graph is defined using the d positions x xi yi of u
nlabeled and labeled data the variables on the graph are the class labels s si o
f the points the joint model for the l labeled points and u unobserved or hidden
variables is psx exp z lu i ij si sj wij where wij exp xi xj yi yj x y the edge
weights of the model wij are functions of the euclidean distance between points
i and j measured with respect to scale parameters x y nearby points wish to be
classified in the same way whereas far away points may be approximately uncorrel
ated unless linked by a bridge of points in between these test sets are availabl
e online httpwwwgatsbyuclacukiamblug approximation schemes y x x y y x a data se
t b brief langevin c loopy metropolis figure a a data set for semisupervised lea
rning with variables two groups of classified points and and unlabeled data b ap
proximate samples from the posterior of the parameters x and y equation an uncor
rected langevin sampler using gradients with respect to log approximated by a sw
endsenwang sampler was used c approximate samples using loopy metropolis the lik
elihoods in this model can be interesting functions of zhu and ghahramani leadin
g to nongaussian and possibly multimodal parameter posteriors with any simple pr
ior as the likelihood is often a very flat function over some parameter regions
the map parameters can change dramatically with small changes in the prior there
is also the possibility that no single settings of the parameters can capture o
ur knowledge when performing binary classification equation which is a type of p
otts model can be rewritten as a standard boltzmann machine the edge weights wij
are now all coupled through so our sampler will only explore a twodimensional p
arameter space x y however little of the above theory is changed by this we can
still approximate the partition function and use this in a standard metropolis s
cheme or apply langevin methods based on equation where gradients include sums o
ver edges figure a shows an example data set for this problem this toy data set
is designed to have an interpretable posterior over and demonstrates the type of
parameter uncertainty observed in real problems we can see intuitively that we
do not want x or y to be close to zero this would disconnect all points in the g
raph making the likelihood small l parameters that correlate nearby points that
are the same will be much more probable under a large range of sensible priors n
either can both x and y be large this would force the and clusters to be close w
hich is also undesirable however one of x and y can be large as long as the othe
r stays below around one these intuitions are closely matched by the results sho
wn in figure b this plot shows draws from the parameter posterior using the brie
f langevin method based on a swendsenwang sampling inner loop described in zhu a
nd ghahramani we also reparameterized the posterior to take gradients with respe
ct to log rather than this is important for any unconstrained gradient method li
ke langevin note that predictions from typical samples of will vary greatly for
example large x approximation schemes predicts the unlabeled cluster in the top
left as mainly s whereas large y predicts s it would not be possible to obtain t
he same predictive distribution over labels with a single optimal setting of the
parameters as was pursued in zhu and ghahramani this demonstrates how bayesian
inference over the parameters of an undirected model can have a significant impa
ct on predictions figure c shows that loopy metropolis converges to a very poor
posterior distribution which does not capture the long arms in figure b this is
due to poor approximate partition functions from the inner loop the graph induce
d by w contains many tight cycles which cause problems for loopy belief propagat
ion as expected loopy propagation gave sensible posteriors on other problems whe
re the observed points were less dense and formed linear chains discussion altho
ugh mcmc sampling in general undirected models is intractable there are a variet
y of approximate methods that can be brought forth to tackle this problem we hav
e proposed and explored a range of such approximations including two variational
approximations brief sampling and the bethe approximation combined with metropo
lis and langevin methods clearly many more approximations could be explored the
mean field and treebased metropolis algorithms performed disastrously even on si
mple problems we believe these failures result from the use of a lower bound as
an approximation where the lower bound is poor the acceptance probability for le
aving that parameter setting will be exceedingly low thus the sampler is often a
ttracted towards extreme regions where the bound is loose and does not return th
e bethe free energy based metropolis algorithm performs considerably better and
gave the best results on one of our artificial systems however it also performed
terribly on our final application in general if an approximation performs poorl
y in the inner loop then we cannot expect good parameter posteriors from the out
er loop in loopy propagation it is well known that poor approximations result fo
r frustrated systems and systems with large weights or tight cycles the typicall
y broader distributions from brief langevin and its less rapid failure with stro
ng weights means that we expect it to be more robust than loopy metropolis anoth
er advantage is the cost per iteration the mean field and belief propagation alg
orithms are iterative procedures with potentially large and uncontrolled costs f
or convergence brief langevin gave reasonable answers on some large systems wher
e the other methods failed although it too can suffer from very large artifacts
even on the very simple heart disease data set one of the posterior marginals wa
s very different under this approximation this actually means the whole joint di
stribution is in the wrong place we now turn to valid mcmc algorithms which are
clearly required if correct inferences are important the exchange algorithm f y
z y q y f x z x figure an augmented model with the original generative model for
observations py as a marginal the exchange algorithm is a particular sequence o
f metropolishastings steps on this model the exchange algorithm the standard met
ropolishastings algorithm proposes taking the data y away from the current param
eters which would remove a factor including z from the joint probability p y the
proposal also suggests giving the data to the new parameters which introduces a
new factor including z if each parameter setting always owned a data set under
the models joint distribution then we would not need to keep adding and removing
these z factors we bring the proposed parameter into the model and give it a da
ta set of its own x this augmented distribution illustrated in figure has joint
probability py x p f y f x q y z z given a setting of y a second setting of para
meters is generated from an arbitrary distribution q y a fantasy dataset is then
generated from px f x z where the function f is the same as in the datagenerati
ng likelihood equation the original joint distribution p y is evidently maintain
ed on adding these child variables to the graphical model as long as we can gene
rate fantasies from this prior model two monte carlo operators are now feasible
operator one resamples x from its distribution conditioned on y this is a blockg
ibbs sampling operator which is naturally implemented by sampling from the propo
sal followed by generating a fantasy for that parameter setting thus we are able
to change as long as we can sample from px resampling x at the same time remove
s the need to evaluate the change in z although this does require an exact sampl
e this will need a method like coupling from the past section just having a mark
ov chain with stationary distribution px is not sufficient operator two is a met
ropolis move that proposes swapping the values of the two parameter settings as
the proposal is symmetric the acceptance ratio is simply the exchange algorithm
the ratio of joint probabilities before and after the swap a q y p f y f x z z p
q y f y f x p q y f y f x q y p f y f x z z combining the two operators gives a
lgorithm the exchange algorithm algorithm exchange algorithm input initial numbe
r of iterations s for s s propose q y generate an auxiliary variable x f x z com
pute acceptance ratio a q y p f y f x q y p f y f x draw r uniform if r a then s
et end for each step tries to take the data y from the current parameter setting
we speculate that a better parameter setting is which was generated by q y how
can we persuade to give up the data to the rival parameter setting we offer it a
replacement data set x from the distribution if f x f y then this replacement i
s preferred by to the real data y which is a good thing we have to consider both
sides of the exchange the ratio f y f x measures how much likes the trade in da
ta sets only by liking the data and generating good replacements can a parameter
enter the set of the posterior samples parameters that spread high f settings o
ver many data sets will like the swap but tend to generate unacceptable fantasie
s this penalty replaces the need to compute z comparing the acceptance ratio to
the infeasible mh algorithms ratio in algorithm we identify that the exchange al
gorithm replaces the ratio of normalizers with a onesample unbiased importance s
ampling estimate cf equation z f x z f x x f x z this gives an interpretation of
why the algorithm works but it is important to remember that arbitrary estimato
rs based on importance sampling or other methods for computing z and its ratios
are unlikely to give a transition operator with the correct stationary distribut
ion we emphasize the exchange motivation because the algorithm is simply metropo
lishastings which is already well understood here applied to an augmented model
for more mathematical detail see subsection which provides a proof of the statio
nary distribution for a more general version of the algorithm the exchange algor
ithm x x x x i y figure an alternative representation of the generative model fo
r observations y all possible parameter settings j are instantiated fixed and us
ed to generate a set of data variables xj the indicator i is used to set y xi th
e posterior over i the parameter chosen by the indicator variable i is identical
to py in the original model product space interpretation the exchange algorithm
was originally inspired by carlin and chib which gives a general method for mod
el comparison by mcmc in this approach every model is instantiated in the sample
r which explores a product space of all the models parameter settings an indicat
or variable which is also sampled specifies which model was used to generate the
data the posterior probability distribution over the indicator variable is the
posterior distribution over models exchange moves in product spaces are common i
n the parallel or replica tempering methods in the physics literature eg swendse
n and wang and geyer by temporarily assuming that there is only a finite discret
e set of possible parameters j each setting can be considered as a separate mode
l this suggests a somewhat strange joint distribution see also figure pxj y i pi
iy xj j f xj j zj where iy xi is an indicator function enforcing y xi this corr
esponds to choosing a parameter setting i from the prior and generating the obse
rved data as before but then also generating unobserved data sets using every ot
her setting of the parameters although this only appears to be a convoluted rest
atement of the original generative process all zj terms are now always present z
is a constant again each of the unobserved datasets xji can be updated by stand
ard mcmc methods such as gibbs sampling conditioned on i the remaining dataset x
i y is not updated because it has a single known observed value to update i we n
otice that an isolated mh proposal i i will not work it is exceedingly unlikely
that xi y instead we couple i i proposals with an exchange of datasets xi y and
xi the mh acceptance the exchange algorithm ratio for this proposal is simple mu
ch of the joint distribution in equation cancels leaving a qi i y pi f y i f xi
i qi i y pi f y i f xi i the only remaining problem is that if the number of par
ameter settings is very large we require huge amounts of storage and a very long
time to reach equilibrium in particular the method seems impractical if is cont
inuous the solution to these problems recreates the exchange algorithm of the pr
evious section we declare that at each time step all of the x variables are at e
quilibrium conditioned on the indicator variable i but we do not need to store t
hese values only when a swap of ownership between and is proposed do we need to
know the current setting of s data we can then draw the value from its equilibri
um distribution x px f x z pretending that this had already been computed after
the exchange has been accepted or rejected the intermediate quantity x can be di
scarded we redraw any unobserved data set from its stationary distribution as re
quired this retrospective sampling trick has been used in a variety of other mcm
c algorithms for infinitedimensional models papaspiliopoulos and roberts beskos
et al an infinitedimensional model and retrospective sampling are not required t
o describe the exchange algorithm but provide connections to the literature whic
h might be useful for some readers the product model works without exact samplin
g for models with a small number of parameter settings and is also how the algor
ithm was originally conceived other readers may prefer the explanation in the pr
evious section even on continuous parameters it uses metropolishastings on a fin
ite auxiliary system which is more established theoretically bridging exchange a
lgorithm the metropolishastings acceptance rule has the maximum acceptance rate
for any reversible transition operator that may only accept or reject proposals
from q see subsection as the exchange algorithm is only approximately the same a
s the direct mh approach in some sense it rejects moves that it should not as we
have much better approximations of normalizing constant ratios than equation be
tter methods should be possible how can good parameter proposals get rejected it
may be that is a much better explanation of the data y than the current paramet
ers and would be accepted under mh however the exchange algorithm can reject the
swap because x f x z is improbable under this makes movement unnecessarily slow
and suggests searching for a way to make swaps look more favorable the exchange
algorithm with bridging draws a fantasy as before x f x z but then applies a se
ries of modifications x x xk such that xk is typically the exchange algorithm mo
re appealing to to describe the precise form of these modifications we specify a
new augmented joint distribution py xk k k p f y z the augmented model combines
the original model a parameter proposal generating a fantasy and bridging steps
q y f x z k k tk xk xk we choose the tk to be markov chain transition operators
with corresponding stationary distributions pk a convenient choice is pk x f x
k f x k fk x where k k k giving k intermediate distributions that bridge between
p x f x z and pk x f x z other bridging schemes could be used in what follows w
e only assume the directional symmetry pk x pkk x similarly we require tk xk xk
tkk xk xk which is easily achieved by always using the same transition operator
for the same underlying stationary distribution as before there are two possible
markov chain operators operator one block gibbs sample from p xk y q y f x z k
tk xk xk k operator two propose swapping whilst simultaneously reversing the ord
er of the xk sequence as illustrated in figure if the tk operators satisfy detai
led balance ie tk x x pk x tk x x pk x then the mh acceptance ratio for operator
two does not depend on the details of tk and is easily computed the concatenati
on of the two operators results in algorithm figure note that k reduces to the p
revious exchange algorithm as before the k k fk xk fk xk term in the acceptance
ratio corresponds the exchange algorithm f x z x t x x x q y xk tk xk xk xk f xk
z xk t xk xk tk xk xk y f y z xk q y x tk x x t x x x y f y z figure the propos
ed change under a bridged exchange given a current parameter the algorithm gener
ated an exact sample x from its distribution and a sequence of data sets that co
me from distributions closer and closer to px the swap move proposes making the
current parameter and the owner of a fantasy as x was typical of not the stack o
f auxiliary variables is reversed so it looks like it might have been generated
under the original process algorithm exchange algorithm with bridging input init
ial iterations s bridging levels k for s s propose q y generate an auxiliary var
iable by exact sampling x p x f x z generate k further auxiliary variables with
transition operators x t x x x t x x xk tk xk xk compute acceptance ratio q y p
f y a q y p f y draw r uniform if r a then set end for k k fk xk fk xk the excha
nge algorithm to an unbiased estimate of zz proof it is an annealed importance s
ampling ais weight equation this is the natural extension to the simple importan
ce estimate equation in the original exchange algorithm linked importance sampli
ng neal could also be used as a dropin replacement the bridging extension to the
exchange algorithm allows us to improve its implicit normalization constant est
imation and improve the acceptance rate for some additional cost fortunately no
further expensive exact sampling on top of that needed by the original algorithm
is required per iteration the performance as a function of k is explored in sec
tion details for proof of correctness it turns out that the algorithm as stated
is also valid when the operators tk do not satisfy detailed balance we give the
details of a proof for this more general case we define a set of reverse markov
chain operators tk x x tk x x pk x tk x x pk x pk x define the transition operat
or t by steps of algorithm let t be defined by the same algorithm but using tk r
ather than tk we now prove that both t and t leave py stationary t x xk py proba
bility of transition is py pyp f y p py zpy f x f x z z k probability of being a
t q y proposing a move to k generating x tk xk xk generating x xk k k min q y p
f y q y p f y fk xk fk xk and finally accepting rearranging and using fk xk f xk
min p f y q y f x py z z p f y q y f xk py z z k tk xk xk k k k tk xk xk fk xk
fk xk the single auxiliary variable method we substitute equation and equation i
nto the second product and reorder its terms min p f y q y f x py z z p f y q y
f xk py z z k tk xk xk k k tk xkk xkk k now swapping and reversing the order of
the auxiliary variables ie mapping x x xk xk x x and swapping t and t throughout
only swaps the arguments of the min leaving the value of the expression unchang
ed therefore the probability of the reverse transition under t via the same inte
rmediate values is the same as in equation that is for any values of xk t x xk p
y t xk x p y summing over all possible intermediate values xk gives t py t p y f
urther summing over or shows that both t and t leave py stationary if as was ori
ginally proposed t is reversible then t t and the bridged exchange algorithm als
o satisfies detailed balance as an aside we mention that the bridging scheme is
very similar to that found in the method of dragging fast variables neal a altho
ugh dragging was presented using reversible transition operators a similar proof
to the one above shows that it too can use nonreversible transition operators t
he single auxiliary variable method the first valid mcmc algorithm for doublyint
ractable distributions was discovered by moller et al for comparison this sectio
n reviews their method which we call the single auxiliary variable method savm s
avm extends the original model to include a single auxiliary variable x which sh
ares the same state space as y figure px y px y f y p z the joint distribution p
y is unaffected no known method of defining auxiliary variables removes z from t
he joint distribution however through careful choice the single auxiliary variab
le method x y y a b figure a the original model unknown parameters generated obs
erved variables y b the savm augmented model the conditional distribution of x m
ust have a tractable dependence in existing approaches this distribution is only
a function of one of y or eg f x yz or a normalizable function of x of q explic
it z dependence can be removed from the mh ratio a px y qx x y px y qx x y a con
venient form of proposal distribution is qx x y q y qx which corresponds to the
usual change in parameters followed by a choice for the auxiliary variable if th
is choice which happens to ignore the old x uses qx f x z where f and z are the
same functions as in py equation then the mh acceptance ratio becomes a px y p y
qx q y px y py qx q y px y zf y p f x z q y px y z f y p f x z q y f y p q y px
y f x f y p q y px y f x now every term can be computed as with the exchange al
gorithm the big assumption is that we can draw independent exact samples from th
e proposal distribution equation the missing part of this description was the co
nditional distribution of the auxiliary variable px y this choice is not key to
constructing a valid mh algorithm but our choice will have a strong impact on th
e efficiency of the markov chain normally we have a choice over the proposal dis
tribution here that choice is forced upon us and instead mavm a temperedtransiti
ons refinement we choose the target distribution pxy to match the proposals as c
losely as possible we cannot maximize the acceptance rate by choosing pxy f x z
as that would reintroduce explicit z terms into the mh ratio moller et al sugges
ted two possibilities use a normalizable approximation to the ideal case replace
with a point estimate such as the maximum pseudolikelihood estimate based on th
e observations y this gives an auxiliary distribution px y pxy pxy f x z with a
fixed normalization constant z which will cancel in equation the broken lines in
figure b indicate that while x could be a child of and y in practice previous w
ork has only used one of the possible parents for concreteness we assume px y f
xz for some fixed y in all that follows but our results are applicable to either
case reinterpreting savm seen in the light of the product model in figure molle
r et als savm method appears slightly strange savm can be reproduced by augmenti
ng our joint model in figure containing variables x x with an additional arbitra
ry latent x with no subscript as used in savm then we can define the following p
roposal draw j qj i perform the deterministic threeway swap x xi xj xj x xi befo
re the swap xi was equal to the observed data y so after the swap xj will be equ
al to the data now owned by j the acceptance ratio for this proposal is precisel
y as in savm equation if we want to take y from i and give it to rival setting j
why involve a third parameter in section we will see that the third party can m
ake the transaction harder or mediate it the bridging steps in subsection were s
pecifically designed to make the swap more palatable in the next section we prop
ose an extension of savm which has similar bridging steps mavm a temperedtransit
ions refinement as with the exchange algorithm savms acceptance ratio equation c
an be seen as an approximation to the exact normalization constant evaluation in
algorithm savm uses the following two unbiased onesample importancesampling est
imators f x z z f x z f x z f x x f x z x f x z mavm a temperedtransitions refin
ement a biased estimate of zz is obtained by dividing equation by equa tion the
unknown constant z fortuitously cancels and amazingly substituting this elementa
ry approximation into the mh algorithm gave a valid method as with the exchange
algorithm savms importance sampling estimators are very crude a large mismatch b
etween px y and qx y can cause a high mh rejection rate bridging between these t
wo distributions might help which suggests replacing the two importance sampling
estimators with annealed importance sampling estimates this gives algorithm our
new algorithm has k auxiliary variables and collapses to savm for k we call thi
s method with k the multiple auxiliary variable method mavm algorithm multiple a
uxiliary variable method mavm input initial x iterations s bridging levels k for
s s propose q y propose the first component of x by exact sampling x p x y f x
z propose remainder of x using k transition operator steps x t x x y x t x x y x
k tk xk xk y compute acceptance ratio a f y p q y f y p q y k k fk xk fk xk fk x
k fk x k draw r uniform if r a then set x x end for as in the bridged exchange a
lgorithm ratios involving the auxiliary variables can be computed online as they
are generated there is no need to store the whole x ensemble the tk are any con
venient transition operators that leave corresponding distributions pk stationar
y where pk x f x k f x k fk x and k k k other sequences of stationary distributi
ons are possible they must start at p f x z as we are forced to draw an exact sa
mple from this as part of the proposal from there they should bridge towards the
approximate or estimatorbased distribution used by savm mavm a temperedtransiti
ons refinement xk xk x x y figure the joint distribution for the annealingbased
multiple auxiliary variable method mavm here it is assumed that pxk y is based o
nly on a datadriven parameter estimate as in equation the auxiliary variables br
idge towards the distribution implied by the graylevel and thickness of the arro
ws from y and indicate the strengths of influence on the auxiliary variables the
se are controlled by k in equation to motivate and validate this algorithm we ex
tend the auxiliary variables x to an ensemble of variables x x x x xk figure we
give xk the same conditional distribution as the single auxiliary variable x in
savm equation the distribution over the remaining variables is defined by a sequ
ence of markov chain transition operators tk xk xk with k k pxk xk y tk xk xk y
px x y t x x y px x y t x x y where as usual tk x x pk x tk x x pk x this define
s a stationary distribution over the ensemble px ypy treating the procedure in a
lgorithm as a proposal qx x the acceptance rule is that of standard metropolisha
stings as in other bridging schemes the details of the transition operators canc
el after substituting equation while we started by replacing savms importance sa
mpling estimators with ais the resulting algorithm is more closely related to te
mpered transitions neal a our approach has cheaper moves than standard tempered
transitions which would regenerate xk x from px y before every mh proposal this
is exploiting the generalization of tempered transitions introduced in subsectio
n as with adding bridging to the exchange algorithm mavm makes savm a closer mat
ch to ideal metropolishastings sampling there is an additional cost of k markov
comparison of the exchange algorithm and mavm chain steps per iteration but no a
dditional exact sampling which might need many markov chain steps we have also p
rovided an answer to an open question in moller et al on how to use both and y i
n the auxiliary distribution px y we use y in coming up with a point estimate of
the parameters to get a distribution in roughly the right place then we bridge
towards a better fit to f x z using ideas from annealing comparison of the excha
nge algorithm and mavm we first consider a concrete example for which all comput
ations are easy this allows comparison with exact partition function evaluation
algorithm and averaging over chains starting from the true posterior we consider
sampling from the posterior of a single precision parameter which has likelihoo
d corresponding to n iid zeromean gaussian observations y y y yn with a conjugat
e prior pyn n p gamma the corresponding posterior is tractable py gamma n n yn b
ut we pretend that the normalizing constant in the likelihood is unknown we comp
are the average acceptance rate of the algorithms for two choices of proposal di
stribution q y all of the algorithms require n exact gaussian samples for which
we used standard generators for large n one could also generate the sufficient s
tatistic n xn with a chisquared routine we also draw directly from the gaussian
stationary distributions pk in the bridging algorithms this simulates an ideal c
ase where the energy levels are close or the transition operators mix well more
levels would be required for the same performance with less efficient operators
we now report results for n and y the first experiment uses proposals drawn dire
ctly from the parameter posterior equation the mh acceptance probability becomes
a all proposals are accepted when z is computed exactly therefore any rejection
s are undesirable byproducts of the auxiliary variable scheme which can only imp
licitly obtain noisy estimates of the normalizing constants figure a shows that
both mavm and the exchange algorithm improve over the savm baseline of moller et
al it appears that a large number k of bridging levels are required to bring th
e acceptance rate close to the attainable a however significant benefit is obtai
ned from a relatively small number of levels after which there are diminishing r
eturns as each algorithm requires an exact comparison of the exchange algorithm
and mavm sample which in applications can require many markov chain steps the im
provement from a few extra steps k can be worth the cost see subsection in this
artificial situation the performance of mavm was similar to the exchange algorit
hm this result favors the exchange algorithm which has a slightly simpler update
rule and does not need to find a maximum pseudolikelihood estimate before sampl
ing begins in figure a we had set figure b shows that the performance of mavm fa
lls off when this estimate is of poor quality for moderate k the exchange algori
thm automatically obtains an acceptance rate similar to the best possible perfor
mance of mavm only for k was performance considerably worse than savm for this s
imple posterior sometimes manages to be a useful intermediary but by k the excha
nge algorithm has caught up with mavm more importantly the exchange algorithm pe
rforms significantly better than savm and mavm in a more realistic situation whe
re the parameter proposal q y is not ideal figure c shows results using a gaussi
an proposal centered on the current parameter value the exchange algorithm explo
its the local nature of the proposal rapidly obtaining the same acceptance rate
as exactly evaluating z mavm performs much worse although adding bridging levels
does rapidly improve performance over the original savm algorithm savm is now h
indered by which is more rarely between and the posterior distribution over equa
tion becomes sharper for n this makes the performance of savm and mavm fall off
more rapidly as is moved away from its optimum value these methods require bette
r estimates of with larger datasets ising model comparison we have also consider
ed the ising model distribution with yi on a graph with nodes i and edges e py e
xp z j yi yj ije i h yi we used the summary states algorithm subsection for exac
t sampling and a single sweep of gibbs sampling for the transition operators t r
esults are reported for y drawn from a model with h and j moller et al used unif
orm priors over h and j this only works or seems to for a q with small step size
s the algorithms hang if j is proposed because cftp based on gibbs sampling take
s a very long time to return a sample in this regime we used a uniform prior ove
r j and larger closertooptimal step sizes comparison of the exchange algorithm a
nd mavm ave acceptance probability k exchange mavm savm a ave acceptance probabi
lity mavm k mavm k savm b ave acceptance probability k exchange mavm savm c figu
re comparison of mavm and the exchange algorithm learning a gaussian precision a
average acceptance rate as a function of k mavm with k corresponds to savm the
method of moller et al exact normalizing constant evaluation in would give an ac
ceptance rate of one b average acceptance rate as a function of the initial para
meter estimate required by savm k and the extended version mavm horizontal bars
show the results for the exchange algorithm which has no for k c as in a but wit
h a gaussian proposal distribution of width centered on the current parameter se
tting the horizontal line shows the maximum average acceptance rate for a revers
ible transition operator this is obtained by exact normalizing constant evaluati
on comparison of the exchange algorithm and mavm efficiency k efficiency exchang
e mavm savm exchange mavm savm k a b figure performance on a toroidal square isi
ng lattice the data were generated from an exact sample with j and h proposals w
ere gaussian perturbations of width the plot shows efficiency effective number o
f samples estimated by rcoda cowles et al divided by the total number of gibbssa
mpling sweeps computer time all the algorithms are expensive savm requires gibbs
sweeps of the ising model to obtain one effective sample of the parameters this
could be improved by using more advanced exact sampling methods figure gives so
me example results for two different step sizes for both methods the number of e
ffective samples obtained per iteration will tend to the maximum possible as k i
n practice the performance of the two methods does converge however these bridgi
ng steps come at a cost the best k is a compromise between the computer time per
iteration and the mixing time of the chain in the examples we tried the exchang
e algorithm provided better mixing times than savmmavm for all finite k bridging
provided good improvements over savm but only gave appreciable benefits to the
exchange algorithm for larger step sizes discussion mcmc methods typically navig
ate complicated probability distributions by local diffusion longer range propos
als will be rejected unless carefully constructed it is usually the case that as
the stepsize of proposals are made sufficiently small the acceptance rate of a
mh method tends to one however savm does not have this property it introduces re
jections even when while the exchange algorithm has a for all k as the stepsize
tends to zero mavm will only recover a as k this is because the third party in t
he proposed swap see subsection is not necessarily close to even in a simple uni
modal dimensional posterior distribution figure c this is a significant disadvan
tage in comparison with the exchange algorithm we found the exchange algorithm p
erforms better than the only other existing mcmc method for this problem and is
simpler to implement latent history methods latent history methods both mavm and
the exchange algorithm require exact samples and use them to compute similar me
tropolishastings acceptance ratios in this section we propose new algorithms whi
ch are also valid mcmc methods for doublyintractable distributions but which do
not draw exact samples they do follow a very similar coupling procedure but the
properties of the resulting algorithms are quite different we start with an insp
iring but impractical rejection sampling algorithm for the parameters given for
example by marjoram et al algorithm b algorithm simple rejection sampling algori
thm for py generate p simulate y py accept if y y on real problems where the dat
a can take on many values this algorithm will rarely accept marjoram et al sugge
st relaxing the equality in step by accepting data sets that are close in some s
ense they also suggest mcmc algorithms which diffuse the parameters in step rath
er than drawing from the prior these algorithms are valid samplers for the wrong
distribution py is near observed value this section will introduce an alternati
ve approach which tries to make it more probable that the fantasy data in step i
s equal to the observed data while keeping the validity of the original algorith
m a generative model specifies a likelihood py but not necessarily the details o
f an algorithm to sample from this distribution the latent history representatio
n brings a particular sampling procedure into the description of the model we de
clare that a stationary ergodic markov chain t was simulated for an infinite num
ber of time steps generating an arbitrarilyinitialized sequence x x x x x and th
at we observed the data y x the marginal probability of any xt including x is th
e stationary distribution of the markov chain which we set to the target models
py unraveling a markov chain and writing it down as a generative model was a tri
ck employed by hinton et al our latent history representation is illustrated in
figure as in coupling from the past section we also consider the random variable
s u u u u u used by the markov chain transition operator at each time step given
and the ut variables which provide t with its source of randomness each state i
s a deterministic function of the previous xt xt ut latent history methods x x x
x y u u u u u figure the latent history representation augments the observed da
ta y and parameters with the history of a markov chain x x x with stationary dis
tribution py if we can effectively sample over all unknowns in this graphical mo
del we have by discarding xt ut an algorithm to sample metropolishastings algori
thm the latent history representation is a distribution with an infinite number
of latent variables although we can not explicitly update all of these variables
mcmc sampling is notionally straightforward algorithm would be a valid mcmc pro
cedure if we had an oracle to perform the infinite number of computations in ste
ps and on our behalf as an aside we note that this algorithm is similar to algor
ithm f in marjoram et al for approximate bayesian computation abc algorithm temp
late for a latent history sampler input initial setting number of iterations s f
or s s block gibbs sample from px u x y propose q y p q y with probability min p
q y follow deterministic map xt xt ut to find y x if y y then accept move end f
or the algorithm can feasibly be implemented if t provably coalesces to a single
final state from any start state given the sequence of random numbers ut this a
llows step of algorithm to be replaced with the procedure used by coupling from
the past cftp see section to identify y while only examining some of the ut s fu
rther there are two differences firstly the equivalent of the generation in step
is always performed before checking the equivalent to step our version would al
so be advisable in the original abc setting assuming data simulation is more exp
ensive than evaluation of prior probabilities secondly the major difference we s
ample the fantasy y in a way that has a bias towards reproducing y the algorithm
s are not directly comparable however the abc community assumes that they cannot
evaluate a function proportional to py which would make it difficult to constru
ct the transition operators needed by our algorithm latent history methods by co
mbining steps and the coupling from the past can be terminated early if at any p
oint it becomes clear that y y all we require now is an oracle that can return a
ny ut on request creating this service can replace the block gibbs sampling in s
tep as long as we are consistent with its infinite computation markov properties
in the model imply px u x t pxt xt t put xt xt this structure allows resampling
of x x in sequence from pxt xt pxt xt pxt pxt t xt xt pxt t xt xt pxt the rever
se transition operator corresponding to t as long as put xt xt is known for oper
ator t any ut value can be sampled after the corresponding xt and xt are availab
le this means that step in algorithm can be implemented lazily construct an obje
ct that will provide ut values on request consistent with block gibbs sampling t
he entire model this object will respond to a request for ut by returning the va
lue from memory if it has previously been sampled or if necessary by sampling ba
ck to xt from the furthest back x value that is known and returning a sample fro
m put xt xt within each iteration all samples are stored for future use to ensur
e consistency the practical implementation of a latent history metropolishasting
s sampler is summarized in algorithm this includes an optional refinement versio
n b which analytically integrates out u this requires computing the transition p
robability for one step of t which is usually possible and should increase the a
cceptance rate the refinement also makes the algorithm feasible for markov chain
s that do not completely coalesce for example on continuous state spaces a disad
vantage of version b is that step may be difficult to implement without always i
dentifying x whereas step in version a will be able to prematurely halt cftp whe
never y x which is often easy to prove performance a simple special case highlig
hts large differences between latent history samplers and the previous auxiliary
variable algorithms we first consider version a of the sampler applied to a lar
ge collection of d independent bernoulli variables sharing a single unknown para
meter this is an ising model in the limit of zero connection strengths assume th
at the transition operator in use t latent history methods algorithm metropolish
astings latent history sampler input initial setting number of iterations s and
an algorithm cftp that can use a finite number of random numbers to prove where
an infinitely long markov chain ends see section for s s create lazy ut generato
r consistent with px u x y propose q y either version a p q y with probability m
in p q y end for is gibbs sampling which for these independent variables draws f
rom the stationary distribution in one sweep the sampler is implemented as follo
ws for each of the observed spins yd the corresponding ud random variate is unif
ormly distributed in if the spin is up and if the spin is down now updates to ar
e constrained to lie within the two closest surrounding ud values this range has
a typical scale of d the posterior over parameters has a width that scales as d
this suggests that it will take at least od steps to equilibrate the single sca
lar parameter by random walk sampling disappointing given the ideal conditions n
otice that the performance of the algorithm depends on the details of the transi
tion operators an unreasonably clever transition operator could generate y if u
py and some other data set otherwise then the probability of finding y y is min
pypy giving an overall algorithm similar to the ideal metropolis hastings algori
thm actually the two stage rule described in algorithm neither mavm nor the exch
ange algorithm have such dependence on the transition operators how the exact sa
mple was obtained is irrelevant we now turn to version b of the sampler with the
same independent bernoulli model and gibbs sampling operators here the transiti
on probabilities in step are equal to the parameters likelihoods under the model
the algorithm reduces to standard metropolishastings for which has the best pos
sible acceptance rate for a given proposal distribution q rapidly mixing operato
rs are rewarded by this version of latent histories operators that move very slo
wly will perform as in version a we have implemented both versions of the sample
r and applied them to the same ising identify whether y x is y using cftp and ut
generator if y y then accept move or version b draw r uniform identify enough a
bout x for acceptance rule using cftp and ut s if r t x y x p q y then accept mo
ve t x y x p q y slice sampling doublyintractable distributions model problem as
in subsection we used gibbs sampling for the y variables combined with the summ
ary states algorithm subsection to perform inferences required by the algorithm
the summary states code required some modification so that its transitions were
driven by the latent history samplers ugenerator rather than a generic random nu
mber generator then the state at time t or t was identified as in subsection the
parameter proposals were gaussian with width which preliminary runs indicated w
as roughly optimal in version a we terminated each iteration as soon as the summ
ary state algorithm identified the state at time zero or a non state at time zer
o was incompatible with the data in version b we always waited until x had been
identified more elaborate code could terminate earlier based on bounds of the ac
ceptance ratio but this is also true for the exchange algorithm mavm and most st
andard metropolishastings algorithms as a result of early terminations version a
was roughly five times cheaper per iteration than our implementation of version
b even with this advantage the number of effective samples per gibbssamplinglik
e sweep through the system was around ten times worse for version a the efficien
cy of version b was effective samples per gibbs sweep about half that obtained b
y savm the latent history efficiency computations included the gibbslike sweeps
needed to compute the us while savm was given its random numbers for free this m
akes the real computation time of the methods comparable which method is actuall
y faster will come down to implementation details however both the bridged excha
nge and mavm algorithms are clearly faster than either variant of latent histori
es on this ising problem we might hope that latent histories will perform better
on other problems a more realistic reason to be excited by latent histories is
that they might be a much better route to approximate methods than any of the pr
evious algorithms in this chapter we return to this issue in section first we tu
rn to the original motivation for developing latent histories constructing slice
samplers slice sampling doublyintractable distributions all of the algorithms f
or doublyintractable distributions described so far are based on metropolishasti
ngs such algorithms always require stepsize parameters that need to be set well
overly large stepsizes lead to almost all proposals being rejected overly small
stepsizes give lengthy randomwalklike exploration of the posterior over it would
be nice to be able to use a selftuning method like slice sampling subsection wh
ich has lessimportant or even no stepsize parameters slice sampling doublyintrac
table distributions latent histories the latent history representation naturally
allows slice sampling it also has the property like the exchange algorithm that
small steps in are always accepted so moves will always be possible the trick i
s to identify the variables in the joint distribution figure as and u while thin
king of the x quantities as just a deterministic function of these variables con
ditioned on the u variables any standard slice sampling algorithm subsection can
be applied to the parameter the stationary distribution pu is simply the prior
over the parameters conditioned on x u y any point not satisfying the constraint
is off the slice and discarded by the slice sampling procedure this description
gives a slice sampler corresponding to version a of latent history algorithm a
slice sampling implementation of version b sets the stationary distribution to t
y x p after each iteration u is effectively updated in a blockgibbs update this
involves resetting a lazy ut generator as in the metropolishastings algorithm m
avm standard slice sampling cannot be applied to the parameter in the mavm joint
distribution because this would involve computing z as with metropolishastings
the auxiliary variables must be changed simultaneously it turns out that with an
appropriate definition of a slice algorithm can be converted into a slice sampl
er we first describe the new algorithm ostensibly we do just use a standard slic
e sampling procedure subsection to update the parameters however the range of au
xiliary heights h h is defined in terms of the following quantity hx p y px y z
f y p qx y k k fk xk fk xk after drawing the slicesampling auxiliary quantity h
uniform hx we consider new parameter settings using any of the normal slice samp
ling procedures however whenever a new setting of the parameters is considered a
new setting of auxiliary x variables is generated in the same way as algorithm
then as usual slice membership is assertained by checking hx h we now show that
this procedure satisfies detailed balance with respect to the desired stationary
distribution as in subsection we use the label s for all rejected points and ot
her ancillary points generated while exploring the slice in a steppingout proced
ure the probability of starting at a setting x generating a particular setting o
f discussion h generating intermediate quantities s and finally an acceptable ne
w pair x is py px y phx y qs h q s h qx y py px y py z qx y px y p y z qs h q s
h qx y q s h qs h qx y qx y all of the slicesampling bracketing procedures menti
oned in subsection ensure that q s h qs h q s h qs h therefore equation is invar
iant to exchanging x x and the overall procedure satisfies detailed balance this
derivation closely follows that of standard slice sampling this is possible bec
ause conditioned on h the decision to reject intermediate parameters in s is the
same whether generating s forwards from or backwards from in contrast the accep
tability of a proposal in the exchange algorithm cannot be summarized by a scala
r slice height h a move depends on giving a new data set to the old parameter wh
ich will have different probabilities depending on whether the move started at o
r we dont currently see any way to define a slice sampler for the exchange repre
sentation as noted in subsection even very small moves in can be rejected when u
sing the mavm auxiliary system the usual slice bracketing procedures work on an
assumption that nearby regions will always be accepted and exponentially shrink
towards the current point unless a large number of bridging levels are used this
behavior could lead to overly small step sizes care would be needed in applying
stock slicesampling code discussion we established in subsection that global in
formation like that provided by exact sampling must be found as part of a strict
ly valid mcmc algorithm for doublyintractable distributions given this it seems
likely that the exchange algorithm performs about as well as is possible on the
problems we have tried the acceptance rate approaches that of metropolishastings
with moderate k leaving exact sampling as the dominant cost we would generally
recommend this algorithm if a valid mcmc procedure is required moller et al poin
ted out that good deterministic approximations to py could make savm perform ver
y well our results firmly support the use of bridging and we would recommend alw
ays considering mavm instead there is an unresolved issue concerning doublyintra
ctable distributions when learning discussion from n iid data points n py n py n
zn n f y n n all of the known valid mcmc algorithms for py require drawing an e
xact sample from py this means drawing a fantasy containing n data sets x xn n b
ut n perhaps there is enough information about z in a single exact sample from t
he data distribution drawing n exact samples may be prohibitively expensive with
large data sets whether this problem can be avoided is an open question sadly i
t is not possible to draw exact samples from all distributions approximate relax
ations must be used for some situations section observed that a bad choice of ap
proximations applied to the inner loop of standard metropolishastings can have s
urprising and disastrous consequences on the quality of the approximate samples
the valid algorithms discussed in this chapter are an alternative target for app
roximation the obvious relaxation of methods requiring exact sampling from the d
ata distribution is to use a brief run of mcmc possibly starting from the observ
ed data moller et al report that a fixed length of sampling can be more ineffici
ent than exact sampling for similar levels of performance however this idea woul
d be worth exploring further in cases where exact sampling is not possible a nat
ural relaxation of the latent history representation is truncating the history t
o some finite length of time k this would require specifying an explicit prior d
istribution for xk perhaps an approximation to the original model one could opti
onally replace the stationary markov chain operator t with a sequence of operato
rs tk bridging between the approximate distribution used for xk and the target m
odel distribution this finite model could be simulated using the latent history
algorithms with explicit computations or any standard mcmc technique we believe
that truncated latent history representations are a promising direction for futu
re research in approximate inference for doublyintractable distributions running
finite markov chains within the exchange algorithm or mavm is an innerloop appr
oximation somewhat like those explored in section the consequences of approximat
ing mcmc procedures is hard to predict it is possible that very unreasonable inf
erences will result in contrast a truncated latent history representation is a m
odel on which we can apply standard mcmc algorithms the model will not describe
exactly the same beliefs as the original distribution but all of the inferences
performed on it will be selfconsistent thus if prior draws from the truncated mo
del look reasonable it is likely that the inferences will also be sensible this
line of reasoning is really suggesting that we throw out the original doublyintr
actable model and replace it with a more tractable latent variable model if the
model was originally introduced as an ad hoc way to introduce dependencies then
us discussion ing an alternative that can be treated consistently may be prefera
ble in particular we can look at draws from the prior and see if they actually r
eflect our beliefs heckerman et al follow a related thought process they discuss
circumstances in which undirected graphical models may not be a natural represe
ntation and provide an alternative replacement changing the model doesnt seem at
tractive when an undirected model was derived directly from physical considerati
ons an example of such a model is recent work on learning protein structure podt
elezhnikov et al which learns the parameters of an energy function using contras
tive divergence hinton a full bayesian treatment of these parameters with mcmc s
eems daunting exact sampling from the configurationaldistribution of a protein c
hain seems infeasibly difficult however using the wrong model could still be use
d as an approximation technique and doing so may be more stable than innerloop a
pproximations this chapter has not addressed computing marginal likelihoods pym
for bayesian model comparison of different undirected graphical models evidently
the standard methods discussed in chapter will apply in theory although such co
mputations are likely to be demanding a triplyintractable problem chapter summar
y and future work the markov chain monte carlo method introduced by metropolis e
t al was one of the earliest applications of electronic computers today it conti
nues to be an important technique for approximate numerical integration in a gro
wing number of applications chapter provided an introduction to the challenges i
nvolved while chapter described some of the current literature along with some m
inor extensions in chapter we explored the promising ideas of using multiple par
ticles and multiple proposals in markov chain simulations as with other authors
we found that modest improvements are possible largely derived from the basic le
ngthscale information given by a few samples from a distribution however drawing
multiple proposals can easily cost more to compute than the statistical benefit
it remains to be seen if implementations leveraging caching of intermediate res
ults or parallel computations can gain advantage in real applications other auth
ors in the literature remain hopeful that they can and our pivotbased approaches
seem to bring unique benefits to this area of the field chapter investigated me
thods for computing normalizing constants one of these ais is actually an import
ance sampling method rather than mcmc and nested sampling offers a new sampling
paradigm in its own right however both of these algorithms use markov chains in
their practical implementation we extended existing techniques for nested sampli
ng on discrete distributions we then compared three fundamentally different algo
rithms based on the same markov chain our theoretical and empirical results show
that the algorithms perform very differently and that no method is appropriate
for all distributions it would seem wise to use more than one algorithm as stand
ard practice we proposed one practical approach to realize this a method for tun
ing ais based on preliminary runs of nested sampling chapter serves to highlight
the limitations of mcmc sampling from doubly intractable distributions is excee
dingly difficult yet these distributions are found in a large class of undirecte
d statistical models and have received a great deal of at tention in recent year
s we have introduced three new algorithms for this problem mavm the exchange alg
orithm and latent history samplers these improve the existing state of the art r
emove the need for separate deterministic approximations and offer new direction
s for approximating this difficult problem exploring truncations of latent histo
ries is a possible area for further work this thesis has concentrated on investi
gating new markov chains and new algorithms for the better use of existing mcmc
operators these have been investigated empirically and sometimes theoretically o
n simple problems designed to demonstrate the algorithms key properties one limi
tation of the presentation here has been the focus on stationary distributions l
ittle has been said about the chains formal convergence rates this is partly bec
ause making meaningful statements about convergence for general statistical prob
lems seems difficult section also wherever possible we have reduced new algorith
ms to the metropolishastings algorithm applied to an auxiliary system this passe
s some of the burden of deriving rates of convergence onto an established litera
ture while we have focused on mcmc methodology and proposed a number of novel al
gorithms ultimately the computational advantages of these methods has to be prov
en on challenging realworld applications this is an area i hope to spend more ti
me on in my future research bibliography d h ackley g e hinton and t j sejnowski
a learning algorithm for boltzmann machines cognitive science page j albert bay
esian selection of loglinear models technical report working paper duke universi
ty institute of statistics and decision sciences page n balakrishnan and a cliff
ord cohen order statistics inference estimation methods academic press san diego
isbn page t bayes an essay towards solving a problem in the doctrine of chances
philosophical transations communicated by richard price in a letter to john can
ton also available edited by g a barnard in biometrika december page m j beal va
riational algorithms for approximate bayesian inference phd thesis gatsby comput
ational neuroscience unit university college london page m j beal and z ghahrama
ni the variational bayesian em algorithm for incomplete data with application to
scoring graphical model structures in j m bernardo m j bayarri j o berger a p d
awid d heckerman a f m smith and m west editors bayesian statistics pages oxford
university press page m beltrn j garc a abellido j lesgourgues a r liddle and a
slosar bayesian model selection and isocurvature perturbations physical review
d pages and c h bennett efficient estimation of free energy differences from mon
te carlo data journal of computational physics october page b a berg and t neuha
us multicanonical ensemble a new approach to simulate firstorder phase transitio
ns phys rev lett january page j besag spatial interaction and the statistical an
alysis of lattice systems journal of the royal statistical society page j besag
and p j green spatial statistics and bayesian computation journal of the royal s
tatistical society b page bibliography j besag p green d higdon and k mengersen
bayesian computation and stochastic systems statistical science page a beskos o
papaspiliopoulos g o roberts and p fearnhead exact and computationally efficient
likelihoodbased estimation for discretely observed diffusion processes with dis
cussion journal of the royal statistical society series b statistical methodolog
y page c bischof and m bcker computing derivatives of computer programs in moder
n u methods and algorithms of quantum chemistry volume pages page c m bishop pat
tern recognition and machine learning springerverlag new york isbn page o capp a
guillin jm marin and c p robert population monte carlo journal e of computation
al graphical statistics december pages and b p carlin and s chib bayesian model
choice via markov chain monte carlo methods journal of the royal statistical soc
iety b methodological page g casella and c p robert raoblackwellisation of sampl
ing schemes biometrika page s chib marginal likelihood from the gibbs output jou
rnal of the american statistical association december page a m childs r b patter
son and d j c mackay exact sampling from nonattractive distributions using summa
ry states physical review e pages and a christen and c fox mcmc using an approxi
mation journal of computational and graphical statistics page j a christen and c
fox a selfadjusting multiscale mcmc algorithm poster presented at valencia baye
sian meeting benidorm preprint available from jaccimatmx code available from htt
pwwwcimatmxjactwalk page p clifford et al discussion on the meeting on the gibbs
sampler and other markov chain monte carlo methods journal of the royal statist
ical society series b page methodological m k cowles n best k vines and m plumme
r rcoda available from httpwwwfisiarcfrcoda pages and bibliography r t cox proba
bility frequency and reasonable expectation american journal of physics january
pages and r v craiu and c lemieux acceleration of the multipletry metropolis alg
orithm using antithetic and stratified sampling statistics and computing page p
damien j wakefield and s walker gibbs sampling for bayesian nonconjugate and hie
rarchical models by using auxiliary variables journal of the royal statistical s
ociety series b statistical methodology page s della pietra v j della pietra and
j d lafferty inducing features of random fields ieee transactions on pattern an
alysis and machine intelligence page p dellaportas and j j forster markov chain
monte carlo model determination for hierarchical and graphical loglinear models
biometrika pages and l devroye nonuniform random variate generation springerverl
ag new york isbn out of print available from httpcgscscarletoncalucrnbookindexht
ml page a dobra c tebaldi and m west data augmentation in multiway contingency t
ables with fixed marginal totals journal of statistical planning and inference p
age p dostert y efendiev t y hou and w luo coarsegradient langevin algorithms fo
r dynamic data integration and uncertainty quantification journal of computation
al physics page s duane a d kennedy b j pendleton and d roweth hybrid monte carl
o physics letters b september page d edwards and t havrnek a fast procedure for
model search in multidimensional a contingency tables biometrika august page r g
edwards and a d sokal generalizations of the fortuinkasteleynswendsenwang repre
sentation and monte carlo algorithm physical review pages and j ferkinghoffborg
monte carlo methods in complex systems phd thesis graduate school of biophysics
niels bohr institute and rise national laboratory faculty of science university
of copenhagen may page c m fortuin and p w kasteleyn on the randomcluster model
i introduction and relation to other models physica page bibliography d frenkel
speedup of monte carlo simulations by sampling of rejected states proceedings of
the national academy of sciences pages and d frenkel and b smit understanding m
olecular simulation academic press inc nd edition pages and a e gelfand and a f
m smith samplingbased approaches to calculating marginal densities journal of th
e american statistical association page a gelman and xl meng simulating normaliz
ing constants from importance sampling to bridge sampling to path sampling stati
stical science pages and a gelman j b carlin h s stern and d b rubin bayesian da
ta analysis second edition chapman hallcrc isbn x page d geman and s geman stoch
astic relaxation gibbs distribution and bayesian restoration of images ieee tran
sactions on pattern analysis and machine intelligence pages and j geweke getting
it right joint distribution tests of posterior simulators journal of the americ
an statistical association pages and c j geyer markov chain monte carlo maximum
likelihood in computing science and statistics proceedings of the rd symposium o
n the interface pages page c j geyer practical markov chain monte carlo statisti
cal science november w r gilks page derivativefree adaptive rejection sampling f
or gibbs sampling in j bernardo j berger a p dawid and a f m smith editors bayes
ian statistics oxford university press page w r gilks and p wild adaptive reject
ion sampling for gibbs sampling applied statistics page w r gilks g o roberts an
d e i george adaptive direction sampling the statistician special issue conferen
ce on practical bayesian statistics page v k gore and m r jerrum the swendsenwan
g process does not always mix rapidly in proceedings of the twentyninth annual a
cm symposium on theory of computing pages acm press new york ny usa page bibliog
raphy v k gore and m r jerrum the swendsenwang process does not always mix rapid
ly journal of statistical physics page p j green reversible jump markov chain mo
nte carlo computation and bayesian model determination biometrika december pages
and f hamze and n de freitas hot coupling a particle approach to inference and
normalization on pairwise undirected graphs in y weiss b schlkopf and j platt o
editors advances in neural information processing systems nips proceedings of th
e conference mit press page w k hastings monte carlo sampling methods using mark
ov chains and their applications biometrika april pages and d heckerman d m chec
kering c meek r rounthwaite and c kadie dependency networks for inference collab
orative filtering and data visualization journal of machine learning research jm
lr page g e hinton training products of experts by minimizing contrastive diverg
ence neural computation august pages and g e hinton s osindero and yw teh a fast
learning algorithm for deep belief nets neural computation page j d hobby a use
rs manual for metapost technical report att bell laboratories murray hill new je
rsey page k s v horn constructing a logic of plausible inference a guide to coxs
theorem international journal of approximate reasoning page m huber exact sampl
ing and approximate counting techniques in th acm symposium on the theory of com
puting pages page m huber a bounding chain for swendsenwang random structures al
gorithms page y iba population monte carlo algorithms transactions of the japane
se society for artificial intelligence a page y iba extended ensemble monte carl
o international journal of modern physics c b page w janke and s kappler multibo
ndic cluster algorithm for monte carlo simulations of firstorder phase transitio
ns physical review letters page c jarzynski equilibrium freeenergy differences f
rom nonequilibrium measurements a masterequation approach physical review e nove
mber page bibliography e t jaynes probability theory the logic of science princi
ples and elementary applications vol cambridge university press april page h jef
freys theory of probability oxford university press rd edition edition republish
ed in oxford classic texts in the physical sciences page m h kalos and p a whitl
ock monte carlo methods volume i basics john wiley isbn page r e kass and a e ra
ftery bayes factors journal of the american statistical association page r e kas
s b p carlin a gelman and r m neal markov chain monte carlo in practice a roundt
able discussion american statistician page s kirkpatrick c d gelatt jr and m p v
ecchi optimization by simulated annealing science page m kuss and c e rasmussen
assessing approximate inference for binary gaussian process classification journ
al of machine learning research pages and s l lauritzen and d j spiegelhalter lo
cal computations with probabilities on graphical structures and their applicatio
n to expert systems with discussion journal of the royal statistical society ser
ies b page j s liu metropolized independent sampling with comparisons to rejecti
on sampling and importance sampling statistics and computing page j s liu monte
carlo strategies in scientific computing springer isbn page j s liu w h wong and
a kong covariance structure of the gibbs sampler with applications to the compa
risons of estimators and augmentation schemes biometrika page j s liu f liang an
d w h wong the use of multipletry method and local optimization in metropolis sa
mpling journal of american statistical association pages and a p lyubartsev a a
martsinovski s v shevkunov and p n vorontsovvelyaminov new approach to monte car
lo calculation of the free energy method of expanded ensembles j chem phys d mac
kay nested sampling explanatory illustrations available from httpwwwinferencephy
camacukbayesysboxnestedpdf page page bibliography d j c mackay information theor
y inference and learning algorithms cambridge university press available from ht
tpwwwinferencephycamacukmackayitila pages and e marinari and g parisi simulated
tempering a new monte carlo scheme europhysics letters page p marjoram j molitor
v plagnol and s tavar markov chain monte carlo without e likelihoods proceeding
s of the national academy of sciences pages and a mccallum and b wellner toward
conditional models of identity uncertainty with application to proper noun coref
erence in ijcai workshop on information integration on the web page i r mcdonald
and k singer machine calculation of thermodynamic properties of a simple fluid
at supercritical temperatures j chem phys page n metropolis the beginning of the
monte carlo method los alamos science page n metropolis a w rosenbluth m n rose
nbluth a h teller and e teller equation of state calculation by fast computing m
achines j chem phys pages and j moller a n pettitt k k berthelsen and r w reeves
an efficient markov chain monte carlo method for distributions with intractable
normalising constants technical report r department of mathematical sciences aa
lborg university pages and j moller a n pettitt r reeves and k k berthelsen biom
etrika an efficient markov chain monte carlo method for distributions with intra
ctable normalising constants pages and p mukherjee d parkinson and a r liddle a
nested sampling algorithm for cosmological model selection the astrophysical jou
rnal l page i murray and z ghahramani bayesian learning in undirected graphical
models approximate mcmc algorithms in m chickering and j halpern editors proceed
ings of the th annual conference on

Das könnte Ihnen auch gefallen