0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)
259 Ansichten70 Seiten
This document summarizes advances in Markov chain Monte Carlo (MCMC) methods. It proposes several new MCMC algorithms for both evaluating normalizing constants and improving sampling of distributions. It also develops new frameworks for constructing nonreversible transition operators that allow more persistent motion in MCMC. Finally, it considers doubly intractable distributions and proposes exact sampling-based MCMC methods that do not require separate approximation before sampling begins.
This document summarizes advances in Markov chain Monte Carlo (MCMC) methods. It proposes several new MCMC algorithms for both evaluating normalizing constants and improving sampling of distributions. It also develops new frameworks for constructing nonreversible transition operators that allow more persistent motion in MCMC. Finally, it considers doubly intractable distributions and proposes exact sampling-based MCMC methods that do not require separate approximation before sampling begins.
This document summarizes advances in Markov chain Monte Carlo (MCMC) methods. It proposes several new MCMC algorithms for both evaluating normalizing constants and improving sampling of distributions. It also develops new frameworks for constructing nonreversible transition operators that allow more persistent motion in MCMC. Finally, it considers doubly intractable distributions and proposes exact sampling-based MCMC methods that do not require separate approximation before sampling begins.
advances in markov chain monte carlo methods iain murray ma msci natural science
s physics university of cambridge uk gatsby computational neuroscience unit univ
ersity college london queen square london wcn ar united kingdom thesis submitted for the degree of doctor of philosophy university of london i iain murray confi rm that the work presented in this thesis is my own where information has been d erived from other sources i confirm that this has been indicated in the thesis a bstract probability distributions over many variables occur frequently in bayesi an inference statistical physics and simulation studies samples from distributio ns give insight into their typical behavior and can allow approximation of any q uantity of interest such as expectations or normalizing constants markov chain m onte carlo mcmc introduced by metropolis et al allows sampling from distribution s with intractable normalization and remains one of most important tools for app roximate computation with probability distributions while not needed by mcmc nor malizers are key quantities in bayesian statistics marginal likelihoods are need ed for model comparison in statistical physics many physical quantities relate t o the partition function in this thesis we propose and investigate several new m onte carlo algorithms both for evaluating normalizing constants and for improved sampling of distributions many mcmc correctness proofs rely on using reversible transition operators often these operators lead to slow diffusive motion resemb ling a random walk after reviewing existing mcmc algorithms we develop a new fra mework for constructing nonreversible transition operators that allow more persi stent motion next we explore and extend mcmcbased algorithms for computing norma lizing constants we compare annealing multicanonical and nested sampling giving recommendations for their use we also develop a new mcmc operator and nested sam pling approach for the potts model this demonstrates that nested sampling is som etimes better than annealing methods at computing normalizing constants and draw ing posterior samples finally we consider doublyintractable distributions with e xtra unknown normalizer terms that do not cancel in standard mcmc algorithms we propose using several deterministic approximations for the unknown terms and inv estigate their interaction with sampling algorithms we then develop novel exacts amplingbased mcmc methods the exchange algorithm and latent histories for the fi rst time these algorithms do not require separate approximation before sampling begins moreover the exchange algorithm outperforms the only alternative sampling algorithm for doubly intractable distributions acknowledgments i feel very fort unate to have been supervised by zoubin ghahramani my training has benefited gre atly from his expertise enthusiasm and encouragement i have been equally fortuna te to receive regular advice and ideas from david mackay much of this research w ould never have happened without these mentors and friends this work was carried out at the gatsby computational neuroscience unit at university college london this is a firstclass environment in which to conduct research both intellectuall y and socially peter dayan the director is owed a lot of credit for this as are all of gatsbys members and visitors past and present several individuals have of fered me their constant support advice and good humor in particular angela yu ed snelson katherine heller and members of the inference group in cambridge id als o like to extend a special thanks to alex boss whose administrative help often e xtended beyond the call of duty some of chapter is a review of john skillings ne sted sampling algorithm john has been generous with his advice and encouragement which enabled me to pursue a study based on his work thanks also to radford nea l for comments on aspects of chapters and hyunchul kim for meanfield code used i n chapter and matthew stephens for encouraging me to work out the noninfinite ex planation of the exchange algorithm thanks to my examiners peter green and loren z wernisch for reviewing this work and catching some mistakes of course i am sol ely responsible for any errors that remain i am enormously grateful to the gatsb y charitable foundation for funding my research and providing travel money furth er travel monies were received from the pascal network of excellence auai the ni ps foundation and the valencia bayesian meeting various items of free software h ave played a vital role in conducting this research gcc a gnulinux gnuplot l tex metapost hobby octave valgrind seward and nethercote vim and many more projects in the public interest such as these deserve considerable support finally i wou ld like to thank my family and friends for all their support and patience conten ts front matter abstract acknowledgments contents list of algorithms list of fig ures list of tables notes on notation introduction graphical models directed gra phical models undirected graphical models the potts model computations with grap hs the role of summation simple monte carlo sampling from distributions importan ce sampling markov chain monte carlo mcmc choice of method markov chain monte ca rlo metropolis methods generality of metropolishastings gibbs sampling a two sta ge acceptance rule conditional estimators raoblackwellization waste recycling co nstruction of estimators convergence auxiliary variable methods swendsenwang con tents slice sampling hamiltonian monte carlo simulated tempering expanded ensemb les parallel tempering annealed importance sampling ais tempered transitions gen eralization to only forward transitions generalization to a single pass annealin g methods multicanonical ensemble exact sampling exact sampling example the isin g model discussion and outlook multiple proposals and nonreversible markov chain s population monte carlo multipletry metropolis efficiency of mtm multipleimport ance try wasterecycled mtm adapting k automatically ordered overrelaxation with pivotbased transitions persistence with pivot states ordered overrelaxation pivo tbased transitions pivotbased metropolis summary related and future work normali zing constants and nested sampling starting at the prior bridging to the posteri or aside on the prior factorization thermodynamic integration multicanonical sam pling nested sampling a change of variables computations in the new representati on nested sampling algorithms mcmc approximations integrating out x degenerate l ikelihoods contents efficiency of the algorithms nested sampling multicanonical sampling importance sampling constructing annealing schedules markov chains for normalizing constants randomize operator orderings changes in lengthscale and en ergy a new version of swendsenwang description of slice sampling experiments dis cussion of slice sampling results the potts model summary related work philosoph y experiments discussion and conclusions doublyintractable distributions bayesia n learning of undirected models do we need z for mcmc targets for mcmc approxima tion approximation algorithms extension to hidden variables experiments involvin g fully observed models experiment involving hidden variables discussion product space interpretation bridging exchange algorithm details for proof of correctne ss reinterpreting savm approximation schemes the exchange algorithm the single a uxiliary variable method mavm a temperedtransitions refinement comparison of the exchange algorithm and mavm ising model comparison discussion metropolishasting s algorithm performance latent history methods slice sampling doublyintractable distributions contents latent histories mavm discussion summary and future work references list of algorithms rejection sampling metropolishastings a two stage acceptance rule population monte carlo as in capp et al e a single step of the m ultipletry metropolis markov chain self sizeadjusting population for ordered ove rrelaxation the pivotbased transition the pivotbased metropolis operator single particle nested sampling multiple particle nested sampling swendsenwang for weig hted bond configurations construction of an annealing schedule from nested sampl ing standard but infeasible metropolishastings exchange algorithm exchange algor ithm with bridging multiple auxiliary variable method mavm simple rejection samp ling algorithm for py template for a latent history sampler metropolishastings l atent history sampler list of figures a selection of directed graphical models s ome graphical models over three variables a grid of binary variables an intracta ble undirected graphical model drawing samples from lowdimensional distributions challenges for markov chain exploration the swendsenwang algorithm slice sampli ng the effect of annealing parallel tempering annealed importance sampling ais t empered transitions multicanonical ensemble example coupling from the past cftp overview metropolis and population monte carlo on the funnel distribution multip letry metropolis mtm performance of multipletry metropolis the idea behind succe ssive overrelaxation illustration of the pivotbased transitions operator ordered overrelaxation schemes applied to a bivariate gaussian using pivot states for p ersistent motion example of persistent motion reflect move for discrete distribu tions views of the integral z l d nested sampling illustrations the arithmetic a nd geometric means of nested samplings progress errors in nested sampling approx imations empirical average behavior of ais and nested sampling potts model state s accessible by mcmc and nested sampling list of figures a simple illustration o f the global nature of z histograms of approximate samples for heart disease ris k model quality of marginals from approximate mcmc methods toy semisupervised pr oblem with results from approximate samplers the exchange algorithms augmented m odel a product space model motivating the exchange algorithm the proposed change in joint distribution under a bridged exchange augmented model used by savm joi nt model for mavm comparison of mavm and the exchange algorithm learning a preci sion performance comparison of exchange and mavm on an ising model the latent hi story representation list of tables equilibrium efficiency of mtm and metropolis accuracy on tdistribution after proposals a rough interpretation of the evidenc e scale nested sampling ais and multicanonical behavior with slicesampling estim ates of the deceptive distribution partition function results for potts systems notes on notation probability distributions we have chosen to follow a fairly lo ose but commonlyused notation for probabilities occasionally we use p x x to den ote the probability that a random variable x takes on the value x but as long as the meaning can be inferred from context we simply write p x often there is mor e than one probability distribution over the same variable we simply write qx fo r the probability under distribution q and px for the probability under p we nev er mention the space in which x lives nor any measure theory unless it is actual ly used we rarely need to distinguish between probability densities and discrete probabilities this loose notation is imprecise but hopefully its simplicity wil l be appreciated by some readers probability of a given b we use several notatio ns for distributions over variables that depend on other quantities p ab this is the conditional probability of a given b bayes rule can be applied to infer b f rom a p ba p abp bp a p a b the probability of a is a function depending on some parameter b one should not necessarily assume that bayes rule holds for a and b t a b t is a transition operator which gives a probability distribution over a new position a given a starting position b one could also write t a b the arrow is to provide a more obvious distinction from authors that use t a b for the pro bability of the transition a b transition operators do not necessarily satisfy b ayes rule known as detailed balance so the notation t ab is avoided t a b c this specifies parameters c in addition to starting location b expectations we use e pxf x x pxf x for the expectation or average of f x under the distribution px a sum with no specified range should be taken to mean over all values the variance of a quantity is given by varpf epx f x epxf x chapter introduction probability distributions over many variables occur frequently in bayesian inference statis tical physics and simulation studies computational methods for dealing with thes e large and complex distributions remains an active area of research graphical m odels section provide a powerful framework for representing these distributions we use these to explain challenging probability distributions and sometimes the algorithms to deal with them a surprising number of statistical problems result in the computation of averages which we explain in section monte carlo technique s section approximate these summations using random samples many of these method s rely on the use of markov chains section extending markov chain monte carlo mc mc techniques is the subject of this thesis graphical models compact representat ions of highdimensional probability distributions are essential for their interp retation feasibility of learning and computational reasons as a concrete example consider a distribution over a vector of d binary variables x for small d eg tw o or three an explicit table of counts for all possible joint settings could be maintained and used for frequencybased estimates of the settings probabilities e xplicitly storing such a table becomes exponentially more costly as d grows even if the table is stored in a sparse data structure the representation is not use ful for learning probabilities most cells will contain zero observations enforci ng some structure is essential when learning from large multivariate distributio ns the simplest multivariate distributions assume that all of their component va riables xd are independent d px d pxd graphical models c x x x x x x x x x x y a na bayes ve b hidden causes y c simple parametric model figure a selection of d irected graphical models this assumption is generally inappropriate as it means that no relationships between any of the variables can ever be learned the graph ical model for equation has a node for each xd with no edges between any of the nodes graphs with more structure provide a convenient representation for differe nt factorizations of px two classes of graphical model are used in this thesis d irected graphs offer a natural representation of causal relationships or the res ult of a sequence of operations while useful in modeling we mainly use directed graphs for explanations of algorithms undirected graphical models are a more nat ural representation for constraints and mutual interactions these are an importa nt motivation for chapter this section offers a brief review of the required con cepts directed graphical models a directed graphical model is represented by a d irected acyclic graph dag generally the joint distribution factorizes into a pro duct of terms for each node conditioned on all of its parents figure a is a dire cted graphical model for the joint distribution found in the na bayes model with class variable c and feature vector x xd ve d pc x pcpxc pc d pxd c figure a an d equation contain exactly the same information from either form it is readily a pparent to those familiar with the representations that the features x are depen dent but independent conditioned on the class variable c as more variables and m ore complicated structure are introduced the graphical model is often easier to interpret at a glance than an equation giving the factorization of the joint dis tribution several graphical model figures have been included throughout this the sis for those that might find them useful graphical models x x x fi x x x x x x x x x a b c d figure some graphical models over three variables figure c demonst rates another piece of directed graphical model notation sometimes shading or a doubleoutline is used to indicate that a node represents an observed variable he re y are the data being modeled which are assumed to come from a distribution pa rameterized by strictly this graph says nothing else about the distribution any joint distribution over two variables can be written as py pyp pypy so the arrow could equally be drawn the other way the implication is that generating y from is a natural operation the arrow directions make a big difference in figure b th e x variables are independent in the generative process after observing y our kn owledge about x forms a potentially complex joint distribution we do not expect direct sampling from pxy to be easy methods based on markov chains will offer a solution to this problem undirected graphical models a probability distribution must be normalized so one representation used frequently in chapter is px f x z z x f x where f x is any positive function for which the normalization sum or in tegral exists however without detailed analysis of f this representation says no thing about the structure of the distribution undirected graphical models repres ent f as a product of terms but do not identify some nodes as children and other s as parents this is appropriate for representing mutual interactions modern und irected models use bipartite factor graphs figure b shows three variables which all interact through a central factor node traditionally this distribution would be represented as in figure c where the three nodes all take part in a common i nteraction as they form a fully connected clique the factor representation is mo re powerful the traditional representation cannot represent the distribution in figure d which contains three factors px x x f x x f x x f x x z graphical model s the potts model the potts model is a widespread example of an undirected graph ical model its probability distribution is defined over discrete variables s eac h taking on one of q distinct settings or colors psj q exp zp j q jij si sj ije the variables exist as nodes on a graph where ij e means that nodes i and j are linked by an edge the kronecker delta si sj is one when si and sj are the same c olor and zero otherwise neighboring nodes pay an energy penalty of jij when they are different colors often a single common coupling jij j is used a common exte nsion allows biases to be put on the colors an additional term i hi si is put in side the exponential the potts model with bias terms and q has several different names the boltzmann machine the ising model binary markov random fields or auto logistic models computations with graphs imagine a problem that involves summing over the settings of a binary potts model with variables in general this seems impossible there are possible states for comparison the age of the universe is u sually estimated to be about picoseconds while most current processors take abou t picoseconds to perform a single instruction if the models graph forms a chain s s s sn the distribution can be factored into a product of functions involving each adjacent pair psj zj n fn sn sn j n s now certain sums such as the normaliz er zj can be decomposed is as follows n fn sn sn j and expectations of functions of variables epsgsi are tractable an example showing how the sums gsi s n fn sn sn s si s f s s s f s s fi si si sn fi si si gsi si sn fn sn sn fn sn sn perfor ming the sums right to left makes the computation oqn rather than oq n where q f or binary variables the role of summation a b figure a small state space of bina ry variables represented as a pixels and b an undirected graphical model the abo ve technique easily generalizes to treestructured graphs but other topologies ar e common in applications figure shows a grid of binary variables arrays like thi s are common in computer vision spatial statistics and statistical physics treat ing each row of pixels as a single variable makes a chainstructured graph with n variables each taking on states summing has changed from an operation that take s orders of magnitude longer than the age of the universe to being almost instan taneous advanced versions of this procedure exist for general graphical models n otably the junction tree algorithm lauritzen and spiegelhalter the cost of the a lgorithm is determined by the treewidth of a graph which indicates the largest n umber of variables that need to be joined in order to form a tree structure figu re shows a genuinely intractable graphical model or does it even if the variable s are only binary then summing over all states with a graphbased exact inference algorithm is infeasible forming a tractable tree will require making at least o ne node with joint settings of variables the topology of the graph isnt everythi ng however if the variables were continuous and gaussian distributed then the mo del is quite tractable almost everything one might want to know about a gaussian distribution can be found easily from a cholesky decomposition of the covarianc e matrix this matrix factorization is flexible numerically stable and costs on f or a nn covariance matrix seeger when n a current midrange workstation can perfo rm the required computation in a couple of seconds the role of summation summing over all the configurations of a multivariate distribution turns out to be the dominant computational task in many fields one of the goals of statistical physi cs is to capture the collective behavior of systems that like the potts model in volve enormous numbers of interacting parts many physical quantities relate to s imple statistics of summing in diagonal stripes means this need only happen once rather than times the role of summation figure an intractable undirected graphi cal model a grid of variables with pairwise interactions these parts averaged ov er the entire system other key quantities can be derived from z the zustandssumm e or sum over states bayesian inference the use of probability theory to deal wi th uncertainty is also dominated by the computation of averages as a canonical e xample we consider a statistical model x for data x generated using parameters t he predictive distribution over new data xn given observations of n previous set tings xn n is an average n under the posterior distribution pxn n n pxn xn n n p xn pxn n d n n epxn n the posterior distribution is given by bayes rule pxn n n px n pxn n p n pxn n p d n which involves another sum over in general any time a quantity is unknown we must consider all of its possible settings which tends t o involve an average under a probability distribution this thesis concentrates o n computational methods rather than their application simple monte carlo however modeling and learning from data are a key motivation for this work so we briefl y mention some more general references bayesian inference is named after bayes a lthough many of the ideas that currently fall under this banner came much later for more on the philosophy a good start is a beautifully written paper by cox re cent textbooks offer a broader view of probabilistic modeling and are available from the viewpoints of various communities including statistics gelman et al mac hine learning mackay bishop and the physical sciences sivia and skilling simple monte carlo many of the alleged difficulties with finding averages are unduly pe ssimistic imagine we were interested in the average salary x of people p in the set of statisticians s formally this is a large intractable sum over all of the s statisticians in the world eps xp s xp ps but to claim that the question is un answerable is absurd we can clearly get a reasonable estimate of the answer by m easuring just some statisticians eps xp s s x ps for random survey of s people p s s s no reasonable application needs the exact answer which in any case is cons tantly fluctuating as individual statisticians are hired promoted and retire so for all practical purposes the problem is solved nearly conducting surveys that obtain a fair sample from a target population is difficult but the technique is so useful we are prepared to invest effort into good sampling methods this stati stical sampling technique is directly relevant to solving difficult integrals in statistical inference for example what is the distribution over an unknown quan tity x after observing data d from a distribution with unknown parameters pxd px dpd d epdpx d s s px s d s pd s we draw samples from the target distribution wh ich rather than the set of statisticians is now pd approximating general average s by statistical sampling is known as simple monte carlo the estimates are unbia sed which if not clear now is shown coxs work has been subject to some technical objections which are countered by horn simple monte carlo more generally in sec tion as long as variances are bounded appropriately the sum of independent terms will obey a central limit theorem the estimators variance will scale like s hav ing an estimators variance shrink only linearly with computational effort is oft en considered poor sokal begins a lecture course on monte carlo with the warning monte carlo is an extremely bad method it should be used only when all alternat ive methods are worse however as sokal goes on to acknowledge monte carlo is als o an extremely important method on some problems it might be the only way to obt ain accurate answers metropolis describes how the physicist enrico fermi used mo nte carlo before the development of electronic computers when insomnia struck he would spend his nights predicting the results of complex neutron scattering exp eriments by performing statistical sampling calculations with a mechanical addin g machine as analytically deriving the behavior of many neutrons seemed intracta ble fermis ability to make accurate predictions astonished his colleagues many p roblems in physics and statistics are complex and involve many variables numeric al methods that scale exponentially with the dimensionality of the problem will not work at all in contrast monte carlo is usually simple and its s scaling of e rror bars independent of dimensionality may be good enough even when more advanc ed deterministic methods are available monte carlo can be appropriate if todays computers can return useful answers with less implementation effort than more co mplex methods sampling from distributions just as finding a fair sample from a p opulation is difficult in surveys sampling correctly from arbitrary probability distributions is also hard simple monte carlo is only as easy to implement as th e random variate generator for the entire joint distribution involved a graphica l description of sampling from a probability distribution is given by figure a p oints are drawn uniformly from the unit area underneath the probability density function and their corresponding value recorded this correctly assigns each elem ent of the input space dx with probability pxdx the probability mass to the left of each point ux is distributed as uniform to implement a sampler we can first draw u and then compute xu from the inverse cumulative distribution that is xu u where x x px dx now the difficulty of sampling from distributions with this gen eral method becomes apparent it is often infeasible to even normalize p a single integral over the entire state space whereas x is a whole continuum of such int egrals simple monte carlo c q x copt q x p x px xi hi xj hj x x x x a direct sam pling x x b rejection sampling x figure drawing samples from lowdimensional dist ributions a sample locations can be taken from points drawn uniformly under the density curve b rejection sampling provides a way to implement this samples are drawn uniformly underneath a simpler curve cq x as in algorithm points above p x px such as xi hi are rejected the remaining points come uniformly from undernea th p and are valid samples from p the constant c is chosen such that cq is alway s above p the best setting is shown as copt but finding this value is not always possible while is tractable for some standard distributions random number gener ators must generally use less direct techniques rejection sampling is a method f or drawing samples from a distribution px by using another distribution qx from which we can already sample the method requires an ability to evaluate the proba bility density of points under both distributions up to a constant p x px and q x qx further we must be able to find a constant c such that cq x p x sampling un derneath a p x px curve gives the correct probability distribution we can sample under p x by drawing samples from the tractable qx distribution and applying al gorithm figure b the more closely q matches p the lower the rejection rate can b e made providing that c is adjusted to maintain a valid bound we can improve q a fter each step of the algorithm ideally q would be updated automatically as poin ts from p are evaluated in general this could be difficult but for logconcave di stributions there are at least two techniques gilks and wild gilks for applicati ons of rejection sampling to standard distributions and several more techniques for nonuniform random variate generation see devroye simple monte carlo algorith m rejection sampling inputs target distribution px p x simple distribution qx q x and number of samples s outputs xs s samples from px s find a constant c such that cq x p x for all x ideally minimize c within the constraint for s s draw ca ndidate x q draw height h uniform cq x if x h is below p ie h p x then xs x else go to end for importance sampling computing p x and q x then throwing x away al ong with all the associated computations seems wasteful yet discarding samples i s a key part of rejection samplings correct operation importance sampling avoids such rejections by rewriting the integral as an expectation under any distribut ion q that has the same support as p f xpx dx s f x px qx dx qx if qx wherever p x wx pxqx xs qx f xwx qx dx s f xs wxs s one interpretation is that the weights wxs make some samples from q more important than others rather than assigning ha rsh weights of zero and one as in rejection sampling the simplest view is that e quation is is just simple monte carlo integration of an expectation under qx we immediately know that the estimator is unbiased and might obey a central limit t heorem as before if qx is much smaller than px in some regions of the state spac e then the effective function f xpxqx will have very high or even infinite varia nce under qx states with extreme importance weights are rare events under q and might not be observed within a moderate number of samples this means that error bars based on the empirical variance of the importance weights can be very misle ading for a practical example of this problem along with the associated recommen dation to use broad distributions with heavy tails see mackay section interestin gly the ideal q distribution is not equal to p if f x is a positive function the n qx f xpx would give a zero variance estimator this optimal distribution is mar kov chain monte carlo mcmc unobtainable as evaluating qx requires the normalizat ion zq f xpx dx the target integral but sometimes deliberate matches between q a nd p are useful in practice for example when interested in gathering statistics about the tails of p nothing about the importance sampling trick in equation act ually requires the original integral to be an expectation we can divide and mult iply any integrand by a convenient distribution to make it an expectation thus i mportance sampling allows any integral to be approximated by monte carlo as most of the integrals in dominant applications are expectations we still maintain px in the equations throughout evaluating the importance weights wx requires evalu ating px p xzp but often zp is unknown in these cases the normalizing constant m ust be approximated separately as follows zp zq zq s p x dx p x qx dx q x w x p xq x xs qx w x qx dx s w xs s this gives an approximation for the importance wei ghts wx zq p x px qx zp q x w x s s s s w x which can be used within equation th e resulting estimator is biased but consistent when both estimators and have bou nded variance markov chain monte carlo mcmc both rejection sampling and importan ce sampling require a tractable surrogate distribution qx neither method will pe rform well if maxx pxqx is large rejection sampling will rarely return samples a nd importance sampling will have large variance markov chain monte carlo methods can be used to sample from px distributions that are complex and have unknown n ormalization this is achieved by relaxing the requirement that the samples shoul d be independent a markov chain generates a correlated sequence of states each s tep in the sequence is drawn from a transition operator t x x which gives the pr obability of moving from state x to state x according to the markov property the transition probabilities depend only on the current state x in particular any f ree parameters eg step markov chain monte carlo mcmc sizes in a family of transi tion operators t x x cannot be chosen based on the history of the chain a basic requirement for t is that given a sample from px the marginal distribution over the next state in the chain is also the target distribution of interest p px x t x x px for all x by induction all subsequent steps of the chain will have the s ame marginal distribution the transition operator is said to leave the target di stribution p stationary mcmc algorithms often require operators that ensure the marginal distribution over a state of the chain tends to px regardless of starti ng state this requires irreducibility the ability to reach any x where px in a f inite number of steps and aperiodicity no states are only accessible at certain regularly spaced times for more details see tierney for now we note that as long as a t satisfies equation it can be useful as the other conditions can be met t hrough combinations with other operators given that px is a complicated distribu tion it might seem unreasonable to expect that we could find a transition operat or t leaving it stationary however it is often easy to construct a transition op erator satisfying detailed balance t x x px t x x px for all x x this states tha t a step starting at equilibrium and transitioning under t has the same probabil ity forwards x x or backwards x x proving detailed balance only requires conside ring each pair of states in isolation there is no sum over all states as in equa tion having shown equation summing over x on both sides immediately recovers the stationary distribution requirement thus detailed balance is a useful property for deriving many mcmc methods however it is not always required or even desirab le chapter introduces some mcmc transition operators that do not satisfy equatio n given any transition operator t satisfying the stationary condition equation w e can construct a reverse operator t defined by t x x t x x px t x x px t x x px px x t x x px a symmetric form of this relationship shows that an operator sati sfying detailed balance is its own reverse transition operator t x x px t x x px for all x x summing over x or x reveals that this mutual reversibility conditio n implies that the stationary condition equation holds for both t and t therefor e constructing choice of method a pair of mutually reversible transition distrib utions is an alternative strategy for constructing mcmc operators detailed balan ce is a restricted case where a transition operator must be its own reverse oper ator choice of method in this introduction we described the importance of highdi mensional probability distributions samples from these distributions capture the ir typical properties which is the basis of the monte carlo estimation of expect ations such as f xpx dx we assume or hope that extreme values under px do not fo rm a significant contribution to the integral for expectations in many physical and statistical applications this is a reasonable assumption in highdimensional problems sampling from an interesting target distribution px is often intractabl e methods such as rejection sampling fail because finding a closematching simple distribution qx is not possible for the same reason importance sampling estimat ors tend to have very high variance we described rejection sampling as it remain s an important method when iid samples from lowdimensional distributions are req uired this occurs in some simulation work and as part of some mcmc methods simil arly while simple importance sampling has problems in high dimensions it provide s the basis of more advanced methods that are useful on some highdimensional pro blems we neglect much of the importance sampling literature but will study some methods relating to markov chains the focus of this thesis are methods that use markov chains these allow us to draw correlated samples from complex distributio ns and to perform monte carlo integration in highdimensional problems the next c hapter reviews several important algorithms based on markov chains together with some new contributions after this we will be in a position to outline the remai nder of the thesis chapter markov chain monte carlo this chapter reviews some im portant markov chain monte carlo mcmc algorithms much of this material is standa rd and could be skipped by those already familiar with the literature however so me of the material in this chapter is to the best of our knowledge novel these c ontributions include generalizations of tempered transitions in subsection and t he introduction of a slicesampling version of the two stage acceptance rule see sections and we also present some results in an unconventional way which is desi gned to help with reading later chapters metropolis methods algorithm gives a pr ocedure for simulating a markov chain with stationary distribution px due to has tings algorithm metropolishastings input initial setting x number of iterations s for s s propose x qx x compute a px qx x px qx x set x x with probability min a eg a draw r uniform b if r a then set x x end for the setting of x at the end of each iteration is considered as a sample from the target probability distribu tion px adjacent samples are identical when the state is not updated in step b b ut every iteration must be recorded as part of the markov chain metropolis metho ds it is straightforward to show that the metropolishastings algorithm satisfies detailed balance t x x px min px qx x px qx x qx x px min px qx x px qx x px qx x min px qx x t x x px qx x px as required here the probability for a forwards transition x x was manipulated into the probability of the reverse transition x x however it would have been sufficient to stop after the second line and note t hat the expression is symmetric in x and x the algorithm is valid for any propos al distribution q ideal proposals would be constructed for rapid exploration of the distribution of interest p we could write qx x d to emphasize choices of pro posal that are based on observed data d using any fixed d is valid but q cannot be based on the past history of the sampler as such a chain would not be markov often proposals are not complicated databased distributions simple perturbations such as a gaussian with mean x are commonly chosen the acceptreject step in the algorithm corrects for the mismatch between the proposal and target distributio ns when the proposal distribution is symmetric ie qx x qx x for all x x the acce ptance ratio in step simplifies to a ratio under the distribution of interest a px px this is the original metropolis algorithm metropolis et al some authors eg mackay prefer to drop such distinctions and simply refer to all metropolishasti ngs algorithms as metropolis methods the next section suggests another justifica tion of this view generality of metropolishastings consider a mcmc algorithm tha t proposes a state from a distribution qx x and accepts with probability pa x x restricting attention to transition operators satisfying detailed balance pa x x qx x px pa x x qx x px for all x x gives an equality constraint we also have ine qualities that must hold for probabilities pa x x and pa x x optimizing the aver age acceptance probability pxpa x x px pa x x with respect to ax x and ax x must saturate one or more of the inequalities a metropolis methods well known proper ty of linear programming problems therefore either pa x x or pa x x and pa x x i s given by equation this gives the metropolis hastings acceptance rule pa x x mi n px qx x pxqx x according to the constraint of equation a pair of valid accepta nce probabilities must have the same ratio pa x xpa x x as any other valid pair therefore they can be obtained by multiplying the metropolishastings probabiliti es by a constant less than one this corresponds to mixing in some fraction of th e do nothing transition operator which leaves the chain in the current state at those two sites it also corresponds to adjusting the proposal distribution q to suggest staying still more often staying still more often harms the asymptotic v ariance of the chain in this sense although not by all measures using equation i s optimal peskun given this result it is unsurprising that metropolishastings ha s become almost synonymous with mcmc it is tempting to conclude that the only wa y to improve markov chains for monte carlo is by researching domainspecific prop osal distributions such as in the vision communitys datadriven mcmc tu and zhu i n fact a rich variety of more generic mcmcbased algorithms exist and continue to be developed many of these do satisfy detailed balance but the corresponding mh qx x distribution is often defined implicitly and would not be a natural descri ption of the method to illustrate the limitations of claiming all reversible mcm c is just metropolis hastings we show that all reversible mcmc is just metropoli s this also demonstrates a methodology used throughout the thesis we construct a new target distribution px x qx x px ie the joint distribution over a point x p and a point x proposed from that location the marginal distribution over x is t he original target p now consider the symmetric metropolis proposal that swaps t he values of x and x such that putatively x comes from p and x was proposed from it the metropolis acceptance probability for this swap proposal is min px x px x min px qx x px qx x ie the metropolishastings acceptance probability in this s ense only an algorithm with symmetric proposals is needed but this is not a very natural description of the metropolishastings algorithm similarly while it is p ossible to describe all algorithms satisfying detailed balance as metropolis or metropolishastings algorithms other descriptions may be more natural however we will find constructing joint distributions like px x a useful theoretical tool a nd one that suggests new markov chain operators metropolis methods gibbs samplin g gibbs sampling geman and geman resamples each dimension xi of a multivariate q uantity x from their conditional distributions pxi xji any individual update mai ntains detailed balance which is easily checked directly alternatively we can wr ite the gibbs sampling update as a metropolishastings proposal qx x pxi xji ixji xji where i is an indicator function ensuring all components other than xi stay fixed the acceptance probability for this proposal is identical to one so need not be checked gibbs sampling is often easy to implement if the target distribut ion is discrete and each variable takes on a small number of settings then the c onditional distributions can be explicitly computed pxi xji pxi xji pxi xji x px i xji i on continuous problems the onedimensional conditional distributions are usually amenable to standard sampling methods as mentioned in subsection a desir able feature of gibbs sampling is that it has no free parameters and so can be a pplied fairly automatically indeed the bugs packages can create gibbs samplers f or a large variety of models specified using a simple description language spieg elhalter et al there are actually some free choices regarding how the variables are updated blockgibbs sampling chooses groups of variables to update at once al so continuous distributions can be reparameterized before gibbs sampling a basis in which the dimensions are independent would obviously be particularly effecti ve clifford et al contains several interesting discussions regarding the nature and history of the gibbs sampler the method has often been regarded with a rathe r undeserved special status amongst mcmc methods it is just one of many ways to construct a markov chain with a target stationary distribution in particular the re is no need to approximate gibbs sampling when sampling from conditionals is n ot tractable as in ritter and tanner metropolishastings updates of each dimensio n can be used instead and there is no need to call this method metropoliswithing ibbs a two stage acceptance rule if the target distribution is factored into two terms for example a prior and likelihood px xlx a proposal can be accepted with probability pa min qx x x qx xx min lx lx construction of estimators straightfo rward checking shows that this satisfies detailed balance we know from previous discussions that it will accept less often than standard metropolishastings and is inferior according to peskun but computationally it can be more efficient alg orithm is an acceptance rule that will accept proposals with probability pa if x is able to veto an a priori unreasonable proposal then lx need not be computed factoring out a cheap sanitycheck distribution x may be worth a fall in acceptan ce rate in problems where likelihood evaluations are expensive algorithm a two s tage acceptance rule input initial setting x proposed setting x draw r uniform i f r qx x x qx xx draw r uniform if r lx accept else reject lx else reject as wit h standard metropolishastings none of the probabilities need to be computed exac tly bounding them sufficiently to make the acceptreject decision in step or is a ll that is required in practice few monte carlo codes implement the required int erval arithmetic to make these savings in contrast the two stage procedure is ge nerally applicable and easy to implement algorithm is a special case of the surr ogate transitions method liu pp and also the different algorithm of christen and fox the twostage method here is also equivalent to dostert et al algorithm ii w hich presented it in the context of a particular choice for qx x this literature has a rich variety of possible approximate distributions that can be used for x in particular applications in a statistical application where px depends on a d ata set a general choice would be to use a subset of the data to define x an obv ious generalization is splitting the acceptance rule into more than two terms fa ctoring the distribution into many terms could make the acceptance rate fall dra matically so there would need to be a specific computational benefit constructio n of estimators the output of an mcmc algorithm is a set of correlated samples d rawn from a joint distribution p xs at equilibrium every marginal pxs is the cor rect distribution of interest using these equilibrium samples in the straightfor ward monte carlo construction of estimators estimator would be unbiased s s ep x s f xs s xs p xs s s s s s f xs s f xs s xs s xs p xs s s f xs pxs epf s xs the states drawn from a markov chain started at an arbitrary position will not have the correct marginal distribution asymptotically equation is still an unbiased e stimator under weak conditions but in practice it is advisable to discard some i nitial states allowing the chain to burnin after this there is no need to remove intermediate samples in an attempt to obtain a set of approximately independent samples such thinning will only make the variance of the estimator worse geyer however thinning can be justified when adjacent steps are highly correlated or w hen computing f x is costly then discarding some samples can save computer time that is better spent running the markov chain for longer a related issue is whet her it is better to run one long markov chain or perhaps to obtain independent p oints more quickly by trying many initializations geyer also shows that theoreti cally a longer chain is to be preferred intuitively an arbitrary initialization is bad as it introduces bias which takes time to remove it turns out this time i s better spent evolving an existing chain to new locations despite the theory fa voring running a single chain we tend to prefer running a few rather than one wh en more than one computer processor is available this is the easiest way to para llelize our code although agreement amongst multiple chains does not provide a g uarantee that the markov chains are mixing differing results would reveal a fail ure to mix that might go unnoticed from a single chain finally multiple chains a re useful for adapting stepsizes in metropolis methods and turn up naturally in population methods chapter conditional estimators raoblackwellization using a sa mple average of function values is not the only way to construct mcmc estimators it is an attractive default procedure as the estimator only needs to know f x e valuated at the samples however it is sometimes possible to perform some computa tions involving f x analytically a simple identity lets us use this analytical k nowledge f xpx x h x f xpxh ph construction of estimators where h is an arbitrar y statistic if we can sum over the distribution conditioned on h then the bracke ted term evaluated under samples of h can be used as an estimator a special case provides more concrete motivation f xi px x xji xi f xi pxi xji pxji epxji epxi xji f xi here we are interested in a function of a particular variable xi sums over a single variable are often tractable the identity allows averaging epxi xj i f xi under monte carlo samples rather than f x itself as an example consider f inding the marginal of a binary variable xi f xi xi the standard estimator throw s away xji and averages the xi samples a sequence of zeros and ones the new esti mator throws away the observed xi values but does use xji the resulting average of real numbers between zero and one can be much better behaved if xi is nearly independent of the other variables the new estimator gives nearly the correct an swer from one sample whereas the standard estimator is always noisy does the con ditioning trick improve variances in general without loss of generality we assum e that the function of interest has zero mean under this assumption the variance of the estimator that conditions on a statistic h becomes varph epxhf x h ph x pxhf x pxhf x x x h ph pxf x where the bound is an application of jensens inequa lity using the convexity of the square function the final equality from summing out h means that varph epxhf x varpxf x the variance of the conditional estimato r is never worse than the standard monte carlo estimator this result applies und er independent sampling from px unfortunately the result does not hold in genera l for correlated mcmc samples liu et al even if the variance is improved we dont generally know when a conditional estimator will be computationally more effici ent than the straightforward monte carlo estimator as an extreme example hx x gi ves an estimator with zero variance because the target expectation is computed e xactly clearly the cost of computing the conditional expectation must be conside red the use of equation was suggested by gelfand and smith an influential paper that popularized gibbs sampling in the statistics community their motivation con struction of estimators was essentially equation cited as a version of the raobl ackwell theorem the bounds requirement of independent samples was satisfied beca use the paper proposed running many independent gibbs sampling runs as this prac tice has fallen out of favor it seems misleading to continue to call the method raoblackwellization although some of the literature continues to do so despite t he lack of raoblackwell guarantees the estimator can often be justified empirica lly as was done earlier by at least pearl waste recycling the metropolishastings algorithm has the undesirable property that for many proposal distributions a l arge fraction of proposed states are rejected from the final set of samples as c omputations involving these rejected points were performed it seems a shame not to use this information in monte carlo estimators the same observation followed our description of rejection sampling for which the solution was to use the same proposal distribution in importance sampling mh proposal distributions are gene rally local in nature and unsuitable as importance sampling proposals waste recy cling is a framework for constructing better estimators using the importance rat ios computed while running mcmc a simple way to understand how to perform waster ecycling is to augment the stationary distribution over the current state x and the next proposal x with an auxiliary variable x we declare that x was generated by copying one of x and x min px qxx px qx x min px qxx px qx x px x x xx x x a lternatively any other rule that gives x the same stationary distribution as x c an be used now we could use draws of x for estimators instead of x we can also a verage estimators under the conditional distribution of x given x and x epxf x e qx xpx xxx f px x x x px qx x f x f x f x min px qx x this is just the condition al estimator equation from the previous section applied to the auxiliary distrib ution px x x px x qx xpx x we hesitate to assign credit for this estimator to an y particular author it or closely related estimators can be found in various ind ependent sources including at least kalos and whitlock tjelmeland and frenkel wa sterecycling is not necessarily better than the straightforward estimator let al one worth the computa convergence p p q l p a a slow diffusion b burnin and mode finding a c balancing density and volume figure challenges for markov chain exp loration tional expense however there is empirical evidence that it can be usefu l dramatically so in the case of frenkel casella and robert also derived a raobl ackwellized estimator for using all the points drawn by the metropolis algorithm however their algorithm has an os cost in the number of steps of the markov cha in and unsurprisingly has not been widely adopted convergence no review of marko v chains would be complete without discussing their convergence properties if th e simulation is initialized at an arbitrary starting location how long must we w ait before we can treat the chains states as samples from the stationary distrib ution often metropolis proposals have a local nature the typical stepsize is lim ited by a need to maintain a reasonable acceptance rate the amount of time it ta kes to explore a distance of length l by a diffusive random walk scales like l f igure a this offers a rule of thumb to understand sampling within a mode an arbi trarily chosen initial state is usually very improbable under the target distrib ution reaching a mode in high dimensions can take a long time with sophisticated optimizers let alone a markov chain simulation analyzing the chain can identify when the statistics of the sampler settle down allowing the initial exploratory phase to be discarded such diagnostics could be severely misleading if there ar e multiple modes as in figure b in general applications there is no way of knowi ng if a markov chain has yet to find the most important regions of a probability distribution there are also more subtle problems with convergence that do not r elate to finding modes or step sizes figure c illustrates a distribution with a pillar at x a points within this mode have higher probability density than any s tate within the shaded plateau a x a but given the large size of the plateau few proposals will suggest moving to the pillar initializing the chain within the p illar using an optimizer would not help here p a and p a a the pillar and platea u have auxiliary variable methods equal probability mass yet only a small fracti on of proposals from the pillar to the plateau can be accepted the acceptance ra te is aaa for symmetric proposals a proposal distribution that knew the probabil ity masses involved eg iid sampling qx x px would move rapidly between the modes but this would involve already having the knowledge that mcmc is being used to discover there is a theoretical literature on markov chain convergence but for g eneral statistical problems it is difficult to prove much about the validity of any particular mcmc result in later chapters we make occasional use of a standar d diagnostic tool to compare the convergence properties of chains but in general we would also run the sampling code on a cutdown version of the problem where e xact results are available or simple importance sampling will work in inference problems it is a good idea to check that a sampler is consistent with ground tru th when learning from synthetic data generated from the prior model geweke s met hod for getting it right is a more advanced way to check consistency between sam plers for the prior and posterior such diagnostics are also a guard against erro rs in the software implementation a possibility considered seriously even by exp erts kass et al auxiliary variable methods as emphasized in the introduction mon te carlo is usually best avoided when possible in tractable problems computation s based on analytical solutions will usually be far more efficient one would thi nk this means we should analytically marginalize out variables wherever possible auxiliary variable methods turn this thinking on its head given a target distri bution px we introduce auxiliary variables z such that px px z dz we could just sample from px corresponding to analytically integrating out z from px z instead auxiliary variable methods instantiate the auxiliary variables in a markov chai n that explores the joint distribution surprisingly this can result in a better monte carlo method an alternative way of introducing extra variables is to make the target distribution a conditional of a joint distribution px px the whole px distribution is explored but only samples where are retained this seemingly was teful procedure can also yield better monte carlo estimates an example is simula ted tempering discussed later in subsection swendsenwang the swendsenwang algori thm swendsen and wang is a highly effective algorithm for sampling potts models subsection with a small number of colors q the auxiliary variable methods algori thm is important in its own right we use it in chapters and and as a significant development in auxiliary variable methods edwards and sokal provided a scheme f or constructing similar auxiliary variable methods for a wider class of models t hey also identified the fortuinkasteleynswendsenwang fksw auxiliary variable joi nt distribution that underlies the algorithm the fksw joint distribution is over the original potts color variables s si on the nodes of a graph and binary bond variables d dij present on each edge ps d zp j q pij dij pij dij si sj ije pij ejij as long as all couplings jij are positive the marginal distribution over s is the potts distribution equation the marginal distribution over the bonds is t he random cluster model of fortuin and kasteleyn pd zp j q ije pijij pij dij q c d d where cd is the number of connected components in a graph with edges whereve r dij the algorithm of swendsen and wang performs block gibbs sampling on the jo int model by alternately sampling from p ds and p sd this also allows a sample f rom any of the three distributions potts ps random cluster pd or fksw ps d to be converted into a sample from one of the others the algorithm is illustrated in figure while implementing swendsenwang we were grateful for the efficient percol ation code available in newman and ziff it is commonly stated that swendsenwang only allows positive jij couplings as presented here in fact swendsen and wang d escribed the extension to negative bonds which also follows easily from edwards and sokals generalization the resulting algorithm only forms bonds between edges where jij if the adjacent sites have different colors colors joined by these ne gative bonds must remain different when choosing a new coloring for binary syste ms two colorings of a cluster are possible the previous coloring and its inverse when there are many strong negative interactions the entire system gets locked into one of two configurations whereas many configurations are probable swendsen wang can still be very useful when only some connections have negative weights s tern et al provide a nice example in the context of modeling the game of go for a wider view of related methods in the context of spatial statistics see besag a nd green auxiliary variable methods a potts state b fksw state c random cluster state d new fksw state figure the swendsenwang algorithm a as an example we run the algorithm on a binary potts model on a square lattice b under the fksw condi tional pds bonds are placed down with probability pij ejij wherever adjacent sit es have the same color c discarding the colors gives a sample from the random cl uster model d sampling from psd involves assigning each connected component or c luster a new color uniformly at random giving a new fksw state discarding the bo nds gives a new setting of the potts model this coloring is dramatically differe nt from the previous one in contrast a sweep of singlesite gibbs sampling can on ly diffuse the boundaries of the red and blue regions by roughly the width of a single site auxiliary variable methods x h x h x h h x h x h x a p x x h b p x c x h h x h x d e figure slice sampling ac show a procedure for unimodal distribu tions after sampling h from its conditional distribution an interval is found en closing the region where hx h samples are drawn uniformly from this interval unt il one underneath the curve is found points outside the curve can be used to shr ink the interval this is an adaptive rejection sampling method d sampling unifor mly from the slice is difficult for multimodal distributions e one valid procedu re uses an initial bracket of width that encloses x but is centered uniformly at random this bracket is extended in increments of until it sticks outside the cu rve at both ends this can cut off regions of the slice but applying the previous adaptive rejection sampling procedure still leaves the uniform distribution on the slice invariant slice sampling all metropolis methods make use of a proposal distribution qx x in continuous spaces such distributions have stepsize paramet ers that have to be set well almost all proposals are rejected if stepsizes are too large overly small stepsizes lead to slow diffusive exploration of the targe t distribution in contrast a selftuning method which has lessimportant or even n o stepsize parameters would be preferable slice sampling neal is a method with o ne auxiliary variable that fits into the framework of edwards and sokal the auxi liary variable does not directly help mixing but allows the use of relatively ea sytoimplement algorithms that allow selftuning while generating a valid markov c hain like gibbs sampling a slice sampling chain has no rejections it always move s when sampling a continuous distribution slice sampling works by stepping back to the view of figure a sampling involves drawing points uniformly underneath a curve hx p x rather than drawing points iid as in rejection sampling a markov ch ain explores this uniform distribution over the original variables x and a heigh t variable h hx slice sampling algorithms alternate between updating h and one o r more of the original variables the height has a simple conditional distributio n from which it can be resampled h uniform hx conditioned on h the stationary di stribution over x is uniform over the slice that lies under the curve hx h when the distribution is auxiliary variable methods unimodal this conditional can be sampled by rejection sampling see figure ac more generally we use transition ope rators that leave a uniform distribution on the slice stationary one of the oper ators introduced by neal is briefly explained in figure e like most metropolis m ethods it uses a stepsize parameter which would ideally match the lengthscale of the problem l when exponentially when region by a random walk neal also provide s a slice sampling operator that can expand its search region exponentially quic kly skilling and mackay provide a version with no stepsize it starts with a larg e search region which shrinks with a fast exponent both of these operators are s lightly harder to implement than the simple version sketched in figure we now pr opose a simple we believe novel method that can shrink slice samplings search re gion more efficiently the method is a generalization of subsection where we fact ored the target density into px xlx here we also introduce two auxiliary variabl es ph x uniform x and ph x uniform lx the new method follows any of the standard slice sampling algorithms but defines the slice by requiring both x h and lx h this version of slice sampling still satisfies detailed balance with respect to the stationary distribution px we first identify all of the quantities involved in the slice sampling procedure the algorithm starts with a point x px then two auxiliary variables h h are generated the search procedure eg figure e will crea te an interval around x and possibly some unacceptable points and adaptations of the interval we collectively refer to all of the intermediate intervals and rej ected points as s finally an acceptable point x is found and adopted the joint p robability of all of these quantities is t x s h h x px x lx psh h xpx s h h x p h x ph x xlx z l the initial search region shrinks l the search region requires ol computations to grow out this is much better than the ol steps required by me tropolis to explore the psh h xpx s h h xz all of the standard slicesampling pro cedures are designed to ensure that x is only accepted as the final point if thi s expression is symmetric in x and x therefore t x s h h x px t x s h h x px sum ming this expression over s h and h shows that the operator satisfies detailed b alance multiple auxiliary variables sometimes allow very fast mixing as in swend senwang however in this context the slice defined by the two auxiliary variables h and h will typically be smaller than under standard slice sampling this is li kely to increase the mixing time of the chain neal warns about this problem with reference to damien et al our motivation is computational h can be checked firs t and only auxiliary variable methods if it satisfies the slice constraint need l h be checked thus schemes that start with a large range and rapidly shrink a b racket around the slice can do so with less computation if x can reject some unr easonable points while being cheap to compute hamiltonian monte carlo physical s imulations often work by discretizing time and approximately computing the resul t of following the hamiltonian dynamics of the system being modeled these dynami cs exploit gradients of a target stationary probability density not just pointwi se evaluations as in the metropolis algorithm theoretically gradients only ever cost a constant multiple of the computer time needed to evaluate a function bisc hof and bcker and are a much richer source of information the difficulty with u simulation work is that inaccuracies accumulate rapidly unless time is discretiz ed very finely which has a large computational cost the metropolis algorithm tar gets only an equilibrium distribution over states the method doesnt need this di stribution to be the result of an underlying dynamical system instead the algori thm evolves according to a proposal distribution the progress of the metropolis algorithm often resembles a diffusive random walk in contrast a particle evolvin g under hamiltonian dynamics will naturally move in persistent trajectories acro ss its statespace these two areas of physical simulation research were combined in hybrid monte carlo duane et al hamiltonian dynamics are simulated approximate ly as a metropolis proposal which is accepted or rejected this allows persistent rapid motion across a state space without damaging the equilibrium distribution of the chain through discretization errors if the target probability distributi on does not correspond to a hamiltonian system then auxiliary variables are intr oduced to create a fictitious system that can be simulated while hamiltonian dyn amics is time reversible some care is required to ensure the discretized version is also reversible neal reviews a variety of hamiltonian based monte carlo meth ods see mackay for another review with simple example code hybrid monte carlo cr ossed into a statistical setting through work on bayesian neural networks neal b where it is extremely successful hamiltonian monte carlo methods continue to ap pear in recent work especially in gaussian process modeling although they could be adopted more widely annealing methods annealing methods simulated annealing k irkpatrick et al is a heuristic for minimizing a cost function ex with isolated local optima the metropolis algorithm is run on a probability distribution deriv ed from the energy e pk x p x k xek ex zk zk a base measure x is often omitted s et to uniform but including this term ensures pk is defined when initially the i nverse temperature k is set very low so that the distribution is diffuse the dis tribution is then slowly cooled by increasing k towards infinity eventually all of the probability mass will be concentrated on the global optimum of ex there c an be no guarantee that a particular sampling procedure will actually find the g lobal optimum in general it is not possible to do better than an exhaustive sear ch however the heuristic has proved useful in a variety of applications monte ca rlo is not interested in optima but if px lxx and ex log lx then k returns the t arget distribution of interest and distributions with k give more diffuse distri butions figure shows an example of a sequence of distributions it might be that isolated modes at are difficult to explore by available markov chain operators t hese modes join and disappear at lower these more diffuse distributions will typ ically be easier to explore this section reviews algorithms designed to exploit this observation to provide better samples from multimodal target distributions it turns out that all of these methods can also provide estimates of the target distributions normalizing constant a full discussion of normalizing constants is deferred to chapter notation all of the algorithms in this section and several later in the thesis use a sequence of distributions we have adopted the followin g conventions throughout p is a base distribution usually easy to explore or ame nable to direct sampling if pk are defined by inverse temperatures then there ar e k intermediate distributions pk k between the base and target k distributions pk p is the target distribution if pk are defined by inverse temperatures then k the sequences are usually defined using temperatures as in equation but other m ethods for creating intermediate distributions between the base and target distr ibutions may be used annealing methods figure the effect of annealing at the dis tribution is equal to a convenient base distribution as the inverse temperature increases the distribution morphs into the target distribution at simulated temp ering expanded ensembles a coherent way of dealing with annealing in mcmc simula tions was independently discovered under the names expanded ensembles lyubartsev et al and simulated tempering marinari and parisi the idea is to bring the choi ce of intermediate distribution pk into the joint distribution under exploration px k wst k p x k where wst k are userchosen weights attached to each temperatur e level the conditional distribution at temperature k returns the original distr ibution of interest pxk px a markov chain at equilibrium returns correlated samp les from the correct distribution by only recording states when the hope is that the extra computation is justified by the decrease in autocorrelation of the ma rkov chain the extent to which simulated tempering improves a chains mixing time depends heavily on the target distribution and the markov chains used one gener al characteristic of the algorithm is the amount of time it takes to move the te mperature index k between its extreme values usually metropolis proposals k k ar e used because large changes in temperature are unlikely to be accepted the adja cent levels must be similar enough for these changes in k to be accepted frequen tly this sets a minimum number of temperature levels k even if all moves were ac cepted k would be undergoing a slow random walk it will take at least ok steps t o bring information from the rapidly mixing k distribution to the target k k dis tribution the extra cost of the algorithm depends on the fraction of time spent at the other temperature levels at equilibrium each level has marginal probabili ty pk x wst k p x wst k zk k the partition function zk can easily vary by orders of magnitude across temperature levels in order for simulated tempering to spen d a reasonable amount of time at each temperature level we could set wst k zk ex cept of course that this ideal weighting is unknown eg z is the unknown normaliz ation of the target distribution a practical implementation of simulated temperi ng must be an iterative process where annealing methods px t t t t t t t t p x p x t t t t p x t t t t figure an illustration of the idea behind a family of par allel tempering algorithms each column represents the states of a markov chain e ach states in the first row have marginal distribution px the other rows have st ationary distributions bridging between this distribution and a vague distributi on for which exploration is easy mcmc proceeds by traditional simulation of each of the independent states and dotted arrows metropolis proposals to exchange th e states between adjacent levels the wst k are improved over a series of prelimi nary runs parallel tempering parallel tempering is a popular alternative to simu lated tempering based on a different joint distribution while simulated temperin g uses a union space consisting of the temperature and the original state space parallel tempering simulates on a product space that is replicas of the original state space exist concurrently for each of the stationary distributions pk k px k k pk xk as the states at each level are independent under the target distribut ion independent transition operators tk with corresponding stationary distributi ons pk can be applied for the higher temperature chains to communicate the resul ts of their freer motion to the lower temperature chains interactions are also r equired figure illustrates one possible way of evolving a product space ensemble the state of the markov chain consists of an entire column of variables these e volve using standard mcmc operators or by swaps between states that are accepted or rejected according to the metropolis rule the swap proposals can be schedule d in a variety of ways but as with simulated tempering information will propagat e amongst levels by a random walk or slower by construction the amount of comput er time spent on each level is completely under the users control in particular no weights wst k are required which is one source of this method has several nam es and independent authors its history is documented by iba b annealing methods the methods relative popularity another feature of coexisting chains is the poss ibility of using proposals based on sophisticated interactions between states eg wang and swendsen these advantages are at the expense of an obvious increase in memory requirements and a hardertogauge cost for bringing a larger system to eq uilibrium annealed importance sampling ais annealed importance sampling neal a l ooks more like the original simulated annealing for optimization than the above ensemble methods an initial point x is drawn from p a vague high temperature bas e distribution further points x x xk result from cooling by evolving through a s equence of k markov chain transition operators t tk whose stationary distributio ns p pk become sharper at each iteration the ensemble of k points generated in t he algorithm is denoted by x xk k k we treat the procedure that generated them a s a proposal which has the following joint distribution qx p x k k tk xk xk but what is this a proposal for qxk xkk the marginal distribution over the final poi nt qx is not the target distribution and is probably intractable we cannot direc tly compare qxk with pxk instead we must create an ensemble target distribution p x that can be compared to qx under this target model xk is drawn from the dist ribution of interest p pk the remaining auxiliary quantities are drawn using a s equence of reverse markov chain transitions such that k p x pk xk k tk xk xk whe re tk are reverse operators as described in section figure illustrates both the ensembles p x and qx samples of x from q can be assigned importance weights with respect to p in the usual way subsection w x p x q x k k p xk k p x k p xk tk x k xk k tk xk xk p x k k pk xk pk xk k p xk k p xk k in neals papers the indexes of the points started at n and ran down to the presentation here is designed to be consistent with similar algorithms found in later chapters annealing methods xk pk px p x xk tk xk tk xk x t x qx xk tk xk tk xk x t x x p x x figure anneale d importance sampling ais is standard importance sampling applied to an ensemble p x this generative process starts with a draw from the target distribution px pk x which is intractable importance sampling uses proposals drawn from qx which starts at a simple distribution x p x samples weighted in proportional to w x c an be used as though they came from p typically only xk is of interest which und er p comes from the target pk p the weights themselves are an unbiased estimate of zp zq zk z as in standard importance sampling equation a valid choice for all of the tk is the do nothing operator which attaches all of its probability mass to staying still to x x x x then qx just corresponds to drawing from p and copy ing the results similarly p x only draws from pk as one might expect the importa nce weights collapse down to those of standard importance sampling of a target d istribution p pk with proposals from the base distribution q p this highlights t hat the markov chains do not need to mix well for ais to be valid but poor mixin g operators will not improve over standard importance sampling a feature of ais is that the importance weights do not require details of the transition operator s tk this means that a given ensemble x s will always be given the same importan ce regardless of whether the tk mix immediately by drawing from their stationary distributions pk or in fact do nothing like to the downside to this behavior is that even with ideal transition operators there is always a mismatch between p x and qx under perfect mixing the final point is proposed from pk not the target distribution pk the annealed importance sampling technique was first described in the physics literature by jarzynski however earlier still was the method of t empered transitions which contains the same core idea tempered transitions tempe red transitions neal a defines a reversible markov chain transition operator lea ving a distribution of interest pk stationary thus its use is in mcmc annealing methods rather than importance sampling it uses pairs of mutually reversible bas e transitions tk and tk with corresponding stationary distributions pk k k given a current state k a candidate state k is proposed through a sequence of states x x drawn as follows generate k from k using tk x x generate k from k using tk x x generate x from using t x generate from x using t x x x generate k from k usi ng tk generate k from k using tk x x this generating sequence is illustrated in figure a the candidate state is accepted with probability k x x a k k min k pk k pk k x x pk k pk k x x note that the normalizations of pk cancel so are not nee ded one way to derive this acceptance rule it is to identify the whole joint dis tribution of figure a as the target of x the sampler if the order of the states were reversed figure b k would be the new state generated at equilibrium a few l ines of manipulation results in equation for the metropolis acceptance probabili ty for this proposal the intermediate states can be discarded because they are r esampled at the next iteration typically p is much more vague than pk which allo ws easier movement around the middle of the proposal sequence if there are many intermediate distributions then every pk can be made close to its neighbors pk t his suggests each state visited during the proposal should be close to equilibri um under each of the transition operators and that the final state should nearly be drawn from pk which is close to pk thus the acceptance rate can be high for large k one might expect that deterministically raising and lowering the tempera ture gives better performance than the markov chain employed by simulated temper ing in fact the advantage is roughly cancelled by the need for a larger number o f intermediate distributions see neal a for details the main advantages of tempe red transitions are it doesnt need a set of weights like simulated tempering it doesnt have parallel temperings large memory requirements because equation can b e accumulated as the states are generated annealing methods x t tk tk x k x k x x t x x k tk x k x k tk x k x k pk x a current state x t tk x k x t x x k tk x k x tk x k x k tk x k xk pk x b proposed state figure tempered transitions a the algorithm starts with a state k and x generates the ensemble x b the acceptance probability is for the proposal x which by reversing all the states would make k the new state generated under x the equilibrium distribution pk generalization to only forward transitions in standard tempered transitions tk and tk are mutua lly reversible here we consider two modified algorithms a forwards tempered tran sition operator t uses tk tk tk to form the proposal distribution q where each t k is any transition operator leaving pk stationary a reverse tempered transition operator t uses tk tk tk to form the proposal distribution q where each tk is t he reverse transition operator corresponding to tk both algorithms use the same acceptance rule as before neal a shows that the operator defined by standard tem pered transitions satisfies detailed balance here we present a similar proof to show that t and t are mutually reversible with respect to pk as in standard temp ered transitions we start at k and generate a sequence of states x x x x k k x k to simplify notation both and are taken to mean x x x throughout the probabilit y of generating a particular sequence of states given the initial state using on ly forward transitions tk is k k q k k k x x x k k tk k k x x k k k tk k k x x k k tk k k pk k x x x pk k x k tk k k pk k x x x pk k x pk k x pk k x k k k tk k k x x k tk k k x x pk k pk k x x pk k pk k x x annealing methods using equation and equation we can compute the equilibrium probability x x x of starting at k g enerating the sequence k k and accepting the move with operator t x x x x t k k k pk k qk k k pk k ak k x x x x x x k k k k x x tk k k k tk k k pk k min x x x k pk k pk k x x pk k pk k x x qk k k pk k ak k x x x x x x t k k k pk k x x x x x x t summing both sides over all intermediate quantities k k kk gives x x x x x x t k k pk k t k k pk k this is the equilibrium probability of observing the cor responding reverse move under which shows that both t and t leave pk invariant t herefore we only need one of them t say this means that the standard tempered tr ansitions algorithm still leaves the target distribution invariant if forward op erators tk are used for both tk and tk implementing reverse operators is not req uired generalization to a single pass the second half of tempered transitions is identical to annealed importance sampling neal states that the major difference between the two methods is the requirement for an upward pass in tempered trans itions making the updates twice as expensive in fact this upward pass is not nec essary looking back at figure there is no reason why the ais ensemble x q cannot be used as a metropolishastings proposal for updating a state x drawn at equili brium from p this is a metropolis independence sampler where the proposal does n ot depend on the current state this method of turning an importance sampler into a markov chain operator dates back at least to hastings importance sampling see ms a more natural way to construct estimators than the metropolis independence s ampler notice that an independence sampler sometimes rejects states when the cur rent state is more important the states kept by the algorithm will depend on the order in which the proposals are made which is arbitrary as they are independen t it seems likely that treating the proposals identically through importance wei ghts will be better and theoretically this is usually the case liu multicanonica l ensemble tempered transitions and annealed importance sampling are really two applications of the same joint distribution the main difference is whether an mc mc operator or an importance sampler is preferable in isolation the importance s ampler may yield better estimators but the mcmc operator can be combined with ot her mcmc operators hastings argued that unweighted mcmc samples are more interpr etable although one could also obtain approximate unweighted samples by resampli ng from a distribution over points proportional to their importanceweights multi canonical ensemble the multicanonical ensemble berg and neuhaus is another metho d for allowing easier movement around a state space rather than introducing extr a distributions the original probability distribution is reweighted pmc x wmc ex eex e log p x the weight function wmc x only depends on a states energy e which is the log of the states original unnormalized probability the weights are chos en such that the marginal distribution over energies pe is uniform as far as pos sible if the energy is unbounded a cutoff is introduced why set the distribution over energies uniform as with the temperaturebased algorithms more time is spen t in regions of low probability which helps join isolated modes but the multican onical ensemble is just a heuristic there is nothing fundamental about this dist ribution in fact visualizing such a distribution can be somewhat surprising figu re shows the result of reweighting a simple onedimensional distribution for simp licity we created a set of energy bands and used a procedure that only ensured t hat the distribution over bands was uniform the most probable states in figure c are in the extreme tail of the original distribution few states occupy the lowe st energy bands so any particular one must be visited frequently in many highdim ensional problems the lowest probability states will be plentiful and the tails will be given much less prominence however when there are only a relatively smal l number states at a particular energy the very high probabilities assigned by t he multicanonical ensemble can be very useful berg and neuhaus invented the mult icanonical ensemble to deal with distributions containing bottlenecks regions wi th small numbers of states that separate regions with significant probability ma ss strangely bottleneck states can be problematic even when they are individuall y more probable than many of the states typical under the distribution a markov chain is not able to spend much time exploring the states between important regi ons if there are many more states elsewhere with much higher total probability t his phenomenon is known as an entropic barrier and can be overcome by multicanon ical ensemble original probability x a energy x b multicanonical probability x c figure an example of a multicanonical ensemble a the target distribution of int erest b the energy function that defines this distribution the horizontal lines mark bands of energy used to construct the new ensemble c the final figure resul ts from setting the probability of each state proportional to the reciprocal of the number of states in its energy range drawing a state from this multicanonica l distribution and reading off its energy band gives a uniform distribution over the bands some artifacts in the plot are due to the precise way bands were chos en but note that the sharp peaks in the middle of the multicanonical plot are du e in part to the truncation of the original distribution at x setting the distri bution over energies is a global operation and can have unexpected consequences exact sampling a multicanonical ensemble which makes states with abnormal energi es low or high much more probable as in simulated tempering finding the weights or weighting function for the multicanonical method is difficult the original ap plication of the multicanonical ensemble used prior knowledge to construct a par ametric approximation that would give a nearuniform distribution over energy gen eral applications often use histogram methods that bin ranges of energy together and iteratively fit weights for each bin on the basis of preliminary runs for m ore advanced methodologies there are whole theses largely dedicated to these met hods eg smith ferkinghoffborg unlike annealing methods the multicanonical method never simulates the target distribution however correlated samples from pmc can be used to construct importance sampling estimators of any quantity of interest this is explored further in chapter exact sampling exact or perfect sampling si mply means drawing a sample from a target distribution markov chain monte carlo uses approximate samples the distribution over the position at the sth step of a markov chain depends perhaps weakly on its starting position at s surprisingly it is sometimes possible to obtain exact samples from markov chains a sketch of the technique known as coupling from the past cftp propp and wilson is given in figure the core idea is to find the location of a markov chain that has run for an infinitely long time without having to simulate its entire length typically e xact sampling algorithms require the ability to track sets of states through a r andom possibly large number of markov chain steps this is technically challengin g but provides a means to draw samples from some distributions such as large pot ts models and some spatial point processes where traditional means fail see wils on for reviews examples and more algorithms for exact sampling with markov chain s is exact sampling actually useful some users like to see samples which can giv e insight into a problem but in many statistical applications there is no intere st in the samples themselves only estimates of expectations given some exact sam ples the variance of an estimator could be improved by running markov chains fro m the samples to produce more correlated samples if it is possible to start with exact samples some comfort is derived from removing any bias associated with th e burn in of the chain but it is still better to run a small number of long mark ov chains than investing in equilibrating many independent ones section several algorithms in chapter rely heavily on exact sampling these require an exact samp le from a different distribution at each iteration using approximations is dange rous as biases can accumulate at every iteration so markov chains potentially ex act sampling x time a a sample is at the end of an infinite chain x time t b we try to find the end with a finite number of random numbers x time t t c eventual ly after a random amount of time we find the sample figure coupling from the pas t cftp overview each dash on the time axis represents a random number used by th e chain everything must be consistent with the infinite chain in a we try to loc ate the final state by looking at drawing just some of the random numbers b befo re time t we know nothing so the chain could be anywhere we evolve a bound on th e state space forward to time zero at first this might only localize the sample c we look at more random numbers further into the past until a bound collapses t o a single state we then follow it to the exact sample at time zero it takes a r andom amount of computation before the bound can be made tight exact sampling ha ve to be run for a very long time exact sampling gives a stopping rule and one t hat guarantees zero bias exact sampling example the ising model coupling from th e past relies on being able to prove where the final state of a markov chain lie s without knowing the starting location to demonstrate that this is sometimes po ssible we describe the summary state technique following childs et al the method is originally due to huber who uses the name bounding chains our task is to fin d the setting of an ising model a potts model subsection where each variable tak es on q settings si we assume the system evolved under gibbs sampling subsection from time t to t each variable was updated in order at each time step according to the following algorithm uit uniform p p si sji sji sji sji exp si jei t t t jij sj jij si sj t exp t jei if uit p then set si else si in general the setting of variable i at time t depends on the random number uit and the settings of it s neighbors j ei the summary states algorithm starts by drawing all the random n umbers uit for some finite range of time t t we dont know what happened before t ime t so for each variable we set si t we then follow the gibbs sampling updates from t t to t setting the states to s whenever the result of an update depends on the settings of states we do not know uit uniform pmax max p si sji sji sji s ji sj sj t t t pmin min p si sji sji sji sji sj sj t if uit pmin then set si t i f uit pmax then set si t otherwise set si t t t if some of the si states are set to then we do not know the setting of our exact sample we draw more of the rand om numbers uit from time t back to time t and start again from st the hope is th at eventually after doubling the number of markov chain steps enough times no s will remain and we will know s discussion and outlook more sophisticated version s of the algorithm have been developed for models with q and for tracking depend encies among unknown states when the connection strengths j are large it wont be possible to get rid of all the s in practice better markov chains than gibbs sa mpling are required while the details are considerably more involved it is possi ble to bound the behavior of randomcluster samplers propp and wilson and careful ly implemented swendsenwang samplers huber discussion and outlook in this chapte r we have reviewed a subset of the established literature on markov chain monte carlo methods most mcmc algorithms are underpinned by proving detailed balance w hich is almost synonymous with the seminal algorithms of metropolis et al and ha stings some choices still remain in the choice of acceptance rule and the constr uction of estimators we presented an interpretation of wasterecycling estimators which we will apply in a new setting in chapter we also reviewed a twostage acc eptance rule in subsection which trades off statistical optimality to give compu tational savings we extended this idea to slice sampling in subsection algorithm s and proposal distributions tuned for particular applications are clearly impor tant but were beyond the scope of this review moreover we did not discuss techni ques that sample distributions with variable dimensionality eg green or infinite dimensionality eg neal c while important in some applications or as a replaceme nt for model selection techniques they are not used by the remainder of the thes is we do have a brief excursion into sampling from an infinitedimensional distri bution at the end of chapter although this is more closely related to the work o n exact sampling reviewed in the previous section auxiliary variable methods hav e the potential to dramatically accelerate markov chain exploration figure demon strated the large changes possible in a single step of swendsenwang slice sampli ng doesnt make such dramatic moves but it is an easytoapply method that we have found personally useful both in this thesis and in quickly testing hypotheses in discussions with colleagues similarly hamiltonian monte carlo has proved person ally useful in work outside this thesis murray and snelson and we feel it should be used more widely the use of annealing methods appears somewhat mysterious es pecially in a statistical setting it seems strange to sample at a range of tempe ratures in applications outside of physics however the annealing heuristic figur e can be very effective in problems with multiple modes it is less clear whether these distributions need to be simulated together as in parallel tempering or s eparately as in the other algorithms we explore the issue of populations of samp les in the next chapter this work leads us to construct discussion and outlook n ew algorithms for constructing nonreversible markov chains operators that do not satisfy detailed balance returning to annealing chapter explores the computatio n of normalizing constants there we explore why standard mcmc is insufficient fo r solving this problem and the advantages of using multiple distributions as in annealing we then provide a critical and comparative review of annealed importan ce sampling the multicanonical ensemble and nested sampling a new method by john skilling we will propose new procedures for running both nested sampling and an nealed importance sampling finally all of the algorithms discussed so far rely o n being able to evaluate a function proportional to the target distribution p x px it turns out that this assumption is not met by a class of statistical infere nce problems which we label doublyintractable chapter is dedicated to constructi ng new algorithms for this problem their operation draws on ideas from all areas of this review including annealing and exact sampling chapter multiple proposal s and nonreversible markov chains markov chain monte carlo is ambitious a huge a mount is asked of a solitary point diffusing through the void of a complex highd imensional space the target stationary distribution may have multiple modes a re presentative sample of which must be found also somehow the mass of each mode is implicitly estimated so that the correct amount of time is spent in each one ev en navigating a single mode can be difficult a long standing idea is that multip le points evolving simultaneously should help communication amongst points might help find and explore the regions of high probability mass the simplest way to introduce multiple points is to target a product model p x k k pxk this is just like parallel tempering subsection except that each of the independent distribut ions are the same although each of the xk variables are independent a markov cha in operator used to update one of them can depend on the setting of the others a t the same time step this provides a limited but adapting source of information about lengthscales in the problem and the positions of modes adaptive direction sampling ads is a name given to at least two algorithms one of these updates a p oint by resampling from its conditional distribution ie gibbs sampling along a d irection suggested by another pair of points gilks et al this allows coupled mov ement of highly correlated variables which would be difficult under standard axi saligned gibbs sampling initial experiments with ads highlighted the need for ma ny more particles than dimensions in order to equilibrate in all necessary direc tions then the burden of bringing k points all to equilibrium simultaneously rem oves the advantage on the other hand some advantage is obtained with a small num ber of points when combined with standard axisaligned updates to ensure mixing i n the full space i have been able to produce very similar results using a slice sampling based version of ads as suggested in mackay related ideas are populatio n monte carlo still being actively pursued in the literature ter braak christen and fox while general theoretical results seem elusive a small number of paralle l chains is often empirically useful and seems advisable in general this chapter is dedicated to exploring other monte carlo algorithms that attempt to leverage an advantage from multiple points the first of these population monte e carlo a s in capp et al is an existing importance sampling method described as a competi tor to mcmc methods we highlight some important differences with mcmcbased metho ds like ads not made apparent enough in the current literature we then return to mcmc methods looking at those not based on the ads product model these methods have a single markov chain backbone which branches to a set of points for consid eration at each step ensembles of dependent points might allow more thorough loc al exploration section reviews and extends multipletry metropolis mtm an algorit hm for making good use of expensive proposal mechanisms section covers ordered o verrelaxation an existing extension to gibbs sampling which draws multiple point s to improve navigation of highly correlated variables we extend ordered overrel axation from just gibbs sampling to arbitrary transition operators and multiple metropolis proposals population monte carlo many monte carlo algorithms could be described as population methods iba a provides a review of some of them this se ction refers to a specific method with the name population monte carlo pmc as de scribed by capp et al and e popularized in robert and casella it is described as an algorithm which can be iterated like mcmc but unlike markov methods can easi ly be adapted over time examples were presented in which replacing mcmc with pmc gave better results in fact it is also easy to demonstrate the reverse this sec tion provides a critical review of pmc the template of a pmc method is given in algorithm k particles are evolved in parallel an arbitrary distribution qks is u sed to move each particle at each step these distributions can depend on all the particles entire histories unbiased estimates can be constructed from the propo sed points k weighted by their importance weights x as usual subsection the weig hts are usually only available up to a constant consistent estimators are still available by using normalized weights the resampled points drawn in step are a b iased sample from the target distribution for finite k for example k would just give a sample from the proposal distribution when k is large the points are appr oximately unbiased samples from the target distribution it is hoped these will f orm a useful basis for constructing the s population monte carlo algorithm popul ation monte carlo as in capp et al e for s to s for k to k select proposal distr ibution qks s generate xk qks x compute weight wk s pk qks k x x end for resampl e k positions with replacement from k with probabilities x s s s proportional to the weights giving this steps sample x x xk end for proposal distributions at t he next iteration estimators should be formed from the importanceweighted ensemb le before the resampling step as pmc is a framework for importance sampling the qks distributions must have appreciable probability mass across the entire suppo rt of the target distribution while capp et al do mention the need for heavy tai ls they simultaneously suggest that e pmc is a replacement for mcmc population m onte carlo borrows from mcmc algorithms for the construction of the proposal fro m importance sampling for the construction of appropriate estimators these two g oals seem mutually exclusive metropolishastings proposals are typically local in nature we do not expect them to capture the entire target distribution sometime s we might propose dramatic moves perhaps from a kernel density estimate tierney and mira warnes but such proposals do not need to be a good global approximatio n to the target distribution for mcmc to work such transition operators can be c ombined with local diffusion for robustness importance sampling has no such choi ce using mh proposals which tend to cut off some or most of the target density c an have disastrous effects in pmc some particles would occasionally have very la rge weights this would give high variance estimators and the occasional collapse of all or most particles to a single point in the resampling step both capp et al and robert and casella describe a mixture pmc e algorithm which seems very li kely to suffer from these failure modes of importance sampling the algorithm sug gests choosing qks from a set of spherical gaussian distributions with a variety of widths the probability of adopting each width is adapted according to their past ability to make proposals that survive the resampling step to demonstrate t he problem with mixture pmc we use a version of the funnel distribution describe d by neal the distribution is over a zeromean gaussian distributed variable v wi th standard deviation and several independent zeromean gaussian variates xi each with variance expv this type of structure can easily exist in hierarchical baye sian models the original funnel distribution used nine xi s s multipletry metrop olis variates and was exceedingly challenging for simple metropolis sampling sli ce sampling largely alleviated this problem here we make the distribution easier to handle using only four xi variates so that metropolis sampling will equilibr ate more quickly we used spherical gaussian proposals with standard deviations s elected uniformly from an ad hoc set of proposal widths metropolis was run times with steps in each run which produced a good histogram for the marginal distrib ution of v figure a only in the tails of the distribution are some frequencies n ot quite right it would have been better to sample fewer longer runs section but we wished to provide a direct comparison to pmc steps with particles figure b s hows a histogram of samples obtained from step of pmc algorithm these samples do not match the true marginal distribution although we can expect this distributi on to be biased figures cf show estimates of vs marginal distribution based on t he importance weights of the particles before resampling the pmc estimates are t errible for all four variants of the algorithm described in the figure none of t he gaussian distributions are good importance samplers for this distribution a c ombination of the estimators does not work well either large importance weights during the pmc runs often caused many particles in the population to move togeth er into a new region of space unlike capp et al we do not e see this as a sign t hat the space is being explored instead it indicates the underlying high varianc e of the importance sampling estimator when the distributions are inappropriate for importance sampling there is no guarantee that the time spent by the populat ion within a region reflects the target distribution instead the times between l arge importance weight events depend on the details of the tails of the proposal distributions in this case adapting the distributions did not solve the problem in any case favoring proposals that can lead to extremely large weights seems u nwise iterated importance sampling algorithms are attractive in some contexts ho wever using proposals from mcmc algorithms within importance sampling is not gen erally advisable we now return to algorithms that stay within the mcmc framework multipletry metropolis this section concerns the multipletry metropolis method mtm of liu et al we review the method and provide a simple interpretation as an evolving joint distribution including auxiliary variables this almost trivial re writing of the algorithm makes a possible improvement obvious and relates to the pivotbased metropolis method introduced later in the chapter the mtm algorithm was motivated as follows local proposal distributions qx x may lead to slow conv ergence but longer range moves are rarely accepted if multi multipletry metropol is a metropolis b pmc samples q separate adapted c pmc weights q separate adapte d d pmc weights q combined adapted e pmc weights q combined fixed f pmc weights q separate fixed figure histograms of samples of the v parameter from the funnel distribution the curves show the correct marginal posterior all samplers were i nitialized at v x both metropolis and pmc used spherical gaussian proposal distr ibutions with the same set of proposal widths for pmc these were adapted as e in capp et al or chosen from a fixed uniform distribution the pmc importance weigh ts were either computed using the proposal width that was used q separate as in capp et al or using q combined the mixture of gaussians e obtained by summing ov er the choice of width robert and casella multipletry metropolis ple points are proposed from a longrange distribution then it is more likely that an acceptable one will be found mtm algorithm is one way of allowing k draws from a proposal distribution figure illustrates and motivates the procedure with a graphical mod el algorithm a single step of the multipletry metropolis markov chain draw k iid proposals from xk qxk x compute weights wx xk px qxk x x xk is any symmetric fu nction x xk xk x choose one proposal x x xk using probabilities proportional to the weights generate a new set of k points xk qxk x for k k and define xk x wx x wxk x compute ax x wx x wxk x assign x x with probability mina mtm is actually a family of algorithms defined by the function used at step to weight the points mtmii is the mtm algorithm with symmetric q and q giving weights wxk pxk as var ious of choices of were reported to behave very similarly we use mtmii in our ex periments figure s graphical model description immediately suggests an alternati ve algorithm step of the mtm algorithm resamples x xk from the joint distributio n this discards the setting already made by a previous iteration in some circums tances keeping the existing proposals could save computer time this would be ach ieved by omitting step after the first iteration and updating step to include xk xk in the assignment efficiency of mtm for a given proposal distribution qxk x each step of mtm is more expensive than a single metropolishastings proposal cou nting the number of target distribution evaluations suggests that the computatio nal cost is roughly k times more for mtm various practical issues make the exact cost tradeoff more complex in many circumstances the time taken to evaluate k p robabilities will grow sublinearly with k this may be because some intermediate results can be shared or because related computations can be vectorized and comp uted efficiently on some types of computer the sequential nature of mcmc often m akes it difficult to vectorize operations yet such code transformations can have surprisingly large benefits especially in popular interpreted environments such as r octave or matlab the results reported by liu et al suggest that mtm outper forms metropolis without needing implementationbased codetuning one example invo lved simple multipletry metropolis x x x xix xk a k proposals are generated and one selected x x x x xk b a suggested swap makes x the sample from px x x x xx i xk c resampling all but x and x makes the joint setting more plausible figure m ultipletry metropolis mtm a x is assumed to come from the target distribution px k proposals are drawn xk qxk x this defines a joint target distribution on whic h the markov chain operates firstly one of the proposed values xi say is chosen and labeled x b one could consider proposing a swap x x giving a joint state in which x generated x xki the acceptance probability for this proposal could typic ally be low for large k it will often be obvious that x rather than x generated the remaining variables c mtm swaps x and x and also resamples the remaining var iables from their new stationary distribution xki qxk x the acceptance rule is s imply metropolishastings applied to the proposal a c multipletry metropolis varx t index mtmii metropolis metropolis figure performance of multipletry metropolis based on figure from liu et al samplers were run on a tdistribution as describe d in the main text plotted is varx t the variance of a samplers state after t st eps from x estimated from runs the time taken for a chains states to match the t rue variance of is a measure of its burnin time computer time was measured as fo llows for mtmii with k index is the number of steps taken for the metropolis alg orithms index is optimistically the number of steps divided by k the performance of metropolis with and mtmii with closely reproduce the results from the origin al paper rerunning metropolis with a gaussian stepsize of shows a much shorter b urnin time than mtm gaussianbased proposals on a onedimensional tdistribution wi th five degrees of freedom the amount of computer time to reach equilibrium from an extreme fixed starting point x appeared to be considerably less for mtmii wi th n x proposals than for metropolis with n x proposals as shown in figure this claim is reproducible but why were different proposal widths and used while maki ng the comparison presumably a longer stepsize was deemed appropriate for mtm as it has more attempts to find a good proposal but our results show that metropol is with approaches equilibrium according to the measure used faster than metropo lis with or mtm with the amount of computer time needed to forget a particular a typical initialization is only one measure of performance we also used rcoda cow les et al to estimate the effective number of samples based on a spectral analys is of the time series these are compared in table table equilibrium efficiency o f mtm and metropolis method iid sampling control mh mh mtm effective samples pro posals the acceptance rates for metropolishastings with and were and respectivel y this suggests that the optimal step size is somewhere in between the two multi pletry metropolis values tried mtms acceptance rate was close to ideal for both standard mh and according to liu et als experience mtm despite this mtms samplin g efficiency on this very simple example distribution is worse than mh at both e quilibrium sampling and initial burning in this is a negative result but one wor th noting using multiple parallel proposals is not in our experience worthwhile in itself this makes sense when proposals are made sequentially a good proposal can be accepted immediately and future proposals move from the new location in c ontrast only one point from a set of parallel proposals can be accepted if there is more than one good point the rest are wasted the motivation for using multip le proposals cannot be to explore larger search regions instead we should consid er mtm if making multiple proposals together is computationally cheaper than mak ing them separately the orientational bias procedure in frenkel and smit algorit hm section is equivalent to mtmii in this context the proposals involve moving m olecules a molecules probability depends on independent terms depending on its p osition and orientation the orientational bias procedure proposes a single new p osition with several possible orientations while the issue isnt discussed in det ail presumably it is sharing the positional energy computation across proposals that makes the algorithm worthwhile in molecular dynamics simulations a statisti cal example claiming benefit from multiple proposals is given by qi and minka th ey drew proposals from local gaussian approximations based on the hessian of the log posterior at the current sample location in highdimensional problems these proposals would be expensive to construct od to prepare a matrix factorization f or a multivariategaussian sampler drawing multiple samples is then cheaper od ev en sequential sampling can take advantage of this because the proposal will be r eused until a move is accepted one wonders if given the effort expended to const ruct the distribution drawing more samples than needed for acceptance could be u seful the reported accuracy for a given cpu time does appear to be slightly bett er by combining hessianbased sampling with an mtm variant multipleimportance try multipleimportance try mtm makes it more likely that the chain moves but is was teful of information multiple evaluations of the probability density are made at each iteration but at most one of these points is used qi and minka suggested a way to use all of the xk points and called their combined hessianbased mtm meth od adaptive multipleimportance try amit we will refer to the part of the algorit hm that concerns including all the points in mtm as multipleimportance try mit q i and minka also discuss cheap approximations to hessians but are still keen to get the best use from them however constructed multipletry metropolis mit is the standard multipletry metropolis algorithm with a particular choice for postproc essing the samples if p can be evaluated then the samples are weighted by wk pxk qxk x when px is only known up to a constant the wk are computed using the unno rmalized version p x and then made to sum to one this is standard importance sam pling using qxk x as a proposal distribution the locations x of these importance samplers are chosen using mcmc using local gaussian approximations can lead to importance sampling estimators with large or even infinite variances this sugges ts that mit is likely to give erroneous results on highdimensional problems mit will work when good global proposal distributions can be obtained but we are uns ure why mcmc would be an appropriate way to adapt these distributions these are similar concerns to those discussed in section regarding population monte carlo wasterecycled mtm waste recycling subsection is a method for using proposals tha t are not accepted as part of the markov chain here we describe how to recycle t he points from the mtm method that are normally unused waste recycling does not carry the risks of huge variances associated with using importance sampling esti mators with metropolis proposals we first take the mtm joint distribution identi fied in figure a k p x xk k k px k qxk x and augment it with a new variable x th e marginal distribution of the new variable should be the target distribution of interest this could be achieved by copying x or picking one of the proposals at random and copying it according to the metropolis hastings acceptance rule ak p x xk x k min pxk qxxk px qxkx x xk x x k ak as in subsection any quantity can b e estimated by averaging over each of the x xk settings weighted by their probab ility under this conditional distribution so far we are still discarding the xk variables drawn by the algorithm if the x xk joint state were accepted we could draw a new auxiliary variable x p x xk using x the conditional distribution anal ogous to equation to include both possible joint multipletry metropolis settings in an estimator we use a further auxiliary variable x which is equal to one of x or x we could define the probability of x x according to the min a acceptance probability of the mtm algorithm however this would sometimes give zero weight t o x in normal wasterecycling we do not mind ignoring the initial state as it was already used at the previous iteration but the xk were created at this iteratio n instead we use px x x xk x xk wxk x k wxk x k wxk x k which also gives x the c orrect stationary distribution the final estimator averages over all states visi ted at each iteration based on the following identity for epxf x epxxk x x k p x xk p x x xk x f x pxx xk x xk the expression looks cumbersome but when expanded out only costs ok to compute the estimator can be implemented once and used as a blackbox the conditional probabilities of x and x equation require some backwa rds pro posal probabilities not normally needed by mtm or mit this would remove the potential advantage when proposals are expensive as in qi and minka but woul d not be a problem when as in frenkel and smit the purpose of multiple proposals is shared computation of px we compared the mtms mean square error mse in estim ating x to that obtained by the wasterecycled version wrmtm on the same tdistrib ution example as before xxxk x x xk xx x table accuracy on tdistribution after p roposals method mh mh mtm wrmtm mit mean square error mse postprocessing the mtm run with wasterecycling reduces the mse considerably but still not to the level of metropolishastings speedups from efficient implementation of parallel propos als would still be required to justify wasterecycled mtm treating the xk proposa ls as draws from importance samplers the mit estimator obtains the unless we are applying the alternative algorithm suggested just before subsection ordered ove rrelaxation figure the idea behind successive overrelaxation for minimization wh en performing component updates moving past the minimum along the current search direction can give faster progress along the direction that points to the optim um monte carlo overrelaxation methods attempt to successively move to the other side of a components conditional distribution which can cause persistent motion along a more interesting direction best mse this reflects the fact that importan ce sampling is usually more appropriate than mcmc for estimating the statistics of simple d densities in higher dimensional problems where mcmc is favored over simple importance sampling the proposals will generally be more local and take l onger to explore the target distribution this situation was simulated by setting the proposal width to and taking mtm steps with k averaging over such runs the mean square error mse of mtm in estimating x was this wasnt hurt by wasterecycli ng which had an mse of there wasnt much to gain from using the extra samples as the mtm sequence is highly correlated as such recycling would not be worth the c omputational expense but wouldnt be too harmful the mit importance sampling esti mate had a significantly worse mse of mit does not necessarily reduce variance i n highdimensional problems it could be much worse ordered overrelaxation ordered overrelaxation neal b is a technique to improve the convergence of gibbs sampli ng for some distributions with strong positive correlations amongst its variable s a variety of mcmc methods exist that are inspired by successive overrelaxation in optimization see figure ordered overrelaxation is one of these methods consi dered here because it is based on a population of proposals ordered overrelaxati on is based on gibbs sampling a normal iteration of gibbs sampling involves resa mpling xd from its conditional distribution pxd xd d a wasteful way to implement this rule is to draw k samples from the conditional distribution and then pick one of them uniformly at random we could leave the conditional dis ordered overr elaxation tribution stationary by also including the current setting and choosin g uniformly from k settings moreover we need only apply a transition operator th at leaves this uniform distribution over k settings stationary ordered overrelax ation provides such an operator the ordered overrelaxation transition operator r equires that the k candidates for the dth component can be ordered in some way i f they are real scalars we simply sort by numerical value if only a partial orde ring is available then ties are broken arbitrarily after sorting the points they are relabeled in order s s si xd sk the operator chooses xd ski as the next ste p in the markov chain as long as k is odd this rule will always pick one of the k new values because the rule is deterministic and reversible it clearly leaves a uniform distribution over the points stationary choosing xd ski tends to move to the opposite side of the conditional distribution from xd this is precisely t he goal of overrelaxation the effect becomes stronger when increasing k for very large k the point is moved almost deterministically from its current quantile q to q in na implementations any benefits from persistent motion is cancelled by the cost ve of drawing additional samples whether ordered overrelaxation is usef ul will depend on whether repeated draws sk t sk z can be performed cheaply neal b discusses some circumstances in which this can be done another potential cost saving is to make the number of points k small when ordered overrelaxation is l ess useful we now introduce a new procedure with this goal adapting k automatica lly our new algorithm attempts to use fewer samples in ordered overrelaxation wh en the current position is close to the center of its conditional distribution t his is a situation where overrelaxation is typically unhelpful it requires the u ser to set kmin and kmax parameters giving the smallest and largest acceptable n umber of samples to draw per iteration the procedure is detailed in algorithm th e numbers of new points to the right and left of the current position are tracke d as r and l in an extreme situation where the current position is at the very e dge of the conditional distribution one of r and l will remain at zero and kmax points will be drawn conversely when xd is near the median of the conditional di stribution the algorithm will typically return sooner after obtaining an unorder ed set of points from algorithm the list is sorted together with the current poi nt xd si giving an ordered set sk k as before we now show k that ordered overrel axation still satisfies detailed balance using a variable k to do this pivotbase d transitions algorithm self sizeadjusting population for ordered overrelaxation generates number of points k and unordered set sk k k inputs kmin kmax current location x initialize l r for k kmax sk pxd xd d if sk xd then r r else if sk xd then l l end if if r kmin and l kmin then kk return end if end for we need the equilibrium probability of starting at point xd si generating a set sk and then deterministically transitioning to xd ski the order in which the set of points w as generated does not matter although some orderings are not possible because al gorithm would terminate before generating the entire set let cl r be the number of orderings of r l points with r greater than xd that can be generated by the a lgorithm the equilibrium probability of all the points is ci ki k k psk xd d giv en the symmetry cl r cr l this is also the probability of starting at xd ski pro ducing the same set of points and then deterministically moving to x si summing over all intermediate points shows that the transition xd xd satisfies detailed balance pivotbased transitions inspired by the ordered overrelaxation generaliza tion of gibbs sampling we now introduce a new method which uses general markov c hain transition operators we start from any pair of mutually reversible transiti on operators t x xpx t x x px which leave a desired target distribution px stati onary a single reversible operator t t will also suffice algorithm takes the exi sting operator pair and constructs a new transition operator t x x this procedur e is illustrated in figure the pivotbased transition operator will favor moves t o the opposite side of the region accessible by t as defined by a user supplied ordering the stationary target stationary distribution px is maintained without introducing any rejections into the chain although t x x x will carry finite pro bability in some discrete settings or if t x x x is finite each of the sk k poin ts in algorithm are marginally distributed according to k pivotbased transitions algorithm the pivotbased transition t x x take one step of t from point x to a pivot state z ie z t z x use t to draw k points one step away from z ie sk t sk z k k use the sk and x points to create an ordered list sk k break ties k arbitr arily and relabel the points such that sk sk identify index i giving x si set x equal to ski start point x pivot state z points generated from t s z point selec ted for x a forwards start point x pivot state z points generated from t s z poi nt selected for x b backwards figure illustration of the pivotbased transitions operator for k between points x and x via points z and set s consisting of the s et of round dots excluding the ringed one a moving forwards generating a draw fr om t x s z x b the reverse process involving the same s and z occurs with probab ility t x s z x pivotbased transitions the target distribution they also have th e same relation to the pivot state all points including x and x can be seen as d raws from t sk z by symmetry the point labeled as the starting position can be r esampled from a uniform distribution over the points or updated as in ordered ov errelaxation which leaves the uniform distribution stationary while perhaps unco nvincing as a proof this symmetry is important if all the proposals started at x then this location would be made special making it harder to construct plausibl e reverse moves starting at any other point we now show explicitly that the new operator satisfies detailed balance given a starting point x the probability of generating a pivot state z a new point x and k other points set s as in figure a is t x s z x t z x k t x z ss t s z the first step must produce the pivot state z from x the other points can then be produced in any of k possible orderings t he probability of producing the same configuration of points by starting at x as in figure b is t x s z x t z x k t x z ss t s z a small amount of manipulation gives t z x t x z px t x s z x pz px t x s z x t x z t z x px pz px t x s z x px t x s z xpx t x x px t x xpx ie the new chain satisfies detailed balance the se cond line was obtained from the t and t operators mutual reversibility condition the final line is obtained simply by summing over z and s the number of points k used by the algorithm could be automatically adjusted we would use algorithm t o draw the points replacing the conditional distribution with t sk z the proof a bove still holds if we replace k with cl r ordered overrelaxation with pivotbase d transitions gibbs sampling updates each component of a distribution xd in turn by applying a transition operator td that resamples from the conditional pxd xd d the original ordered overrelaxation replaces gibbs samplings td operators wit h modified versions that tend to move to the opposite side of the conditional di stribution from the current setting of xd pivotbased transitions gibbs sampling is not the only monte carlo algorithm that updates components of a distribution in turn now we can perform ordered overrelaxation by replacing any algorithms co mponentbased update operators td with their pivotbased versions each update will tend to move the current setting to the opposite side of the distribution defin ed by td when the original markov chain is gibbs sampling the z states become ir relevant and the resulting transition operator is equivalent to the standard ord ered overrelaxation algorithm figure gives an illustrative example of ordered ov errelaxation applied to a bivariate gaussian distribution using pivotbased trans itions we can obtain the same benefits as the original ordered overrelaxation wi th slice sampling which doesnt require the ability to sample from conditional di stributions again as with standard ordered overrelaxation the benefits are only worth the computational cost if some saving can be made when drawing multiple po ints although pivotbased transitions do not introduce any rejections into the ch ain if the underlying transition operator is a metropolis method then the chain will sometimes stay still the rejections will cause reversals in the direction o f motion reducing the ability of the chain to maintain a persistent direction th is will tend to occur at the edges of a distribution where rejections are more f requent and where ordered overrelaxation is normally of most use figure c demons trates that pivotbased transitions based on metropolis works much less well than the same method based on slice or gibbs sampling in section we will develop a b etter metropolis method based on pivot states first we explore another use of pi votbased transitions persistence with pivot states as pivotbased transitions app ly to general transition operators we can consider algorithms that are not tied to componentbased updates in this section we present an alternative way to achie ve persistent motion using pivotbased transitions this algorithm is based on the expanded distribution px z t z xpx t x zpz for other ways of constructing nonre versible chains that use this distribution see neal b here we do not require the base operator t to be reversible a sample x z from the expanded distribution pr ovides an equilibrium sample x and a pivot state z as produced by step of algori thm following the rest of the algorithm gives a new setting x z the same pivot s tate with a new starting point pivotbased transitions a standard ordered overrel axation gibbs sampling b pivotbased transitions univariate slice sampling c pivo tbased transitions metropolis figure ordered overrelaxation schemes applied to a highly correlated bivariate gaussian some exact samples from the distribution a re shown in black standard metropolis shown in red proceeds by slow diffusion or dered overrelaxation schemes shown in blue are able to make persistent progress along the distribution here exaggerated with k rejections in the underlying tran sition operators upset this persistent motion however the metropolis sampler in c gains little benefit from pivotbased transitions pivotbased transitions this p rocedure leaves the expanded distribution invariant the proof is similar to befo re t x s x z k t x z ss t s z t s z ss t x s x z k t x z t x z t z xpx pz t x s x z t x s x z t x z pz t z x px t z xpx t z x px t x s x z t z x px t x s x z t z xpx t x z x z px z t x z x z px z the final line was obtained by using equatio n and summing over the intermediate points s we have shown that the appropriate detailed balance relationship holds with respect to the expanded distribution eq uation suggests that the sample from the expanded distribution can also be inter preted as an equilibrium sample z and a pivot state x t x z this pivot state cou ld be updated by an alternative pivotbased transition operator t based on algori thm as t was used in step we switch to using t to produce the k points in step a version of the above proof shows that this operator also leaves the expanded di stribution invariant given a pair sampled from the expanded distribution x x px x t x x px we alternate between two markov chain updates first we update x using x as a pivot state with algorithm operator t next we update x using x as a pivo t state with the alternate operator t figure illustrates the effect of alternate ly applying these two operators both operators have a tendency to reverse the or dering of x and x by continually jumping over each other the pair move persisten tly across the space that induces the ordering by chance for finite k the points will sometimes stay in the same order in subsequent iterations motion will be i n the opposite direction allowing the chain to mix across the whole space figure shows pivotbased persistent motion when the base transition operator has a smal l step size which would normally lead to slow diffusion the effect is dramatic t he entire range of the distribution is explored in many fewer iterations however the improved dynamics alone are not actually sufficient to justify the increase d cost of drawing multiple proposals as with mtm some further savings such as pa rallel computation are required pivotbased transitions point to be updated pivot state updated at next iteration new points the set s updated point next pivot f igure using pivot states for persistent motion each row shows a new iteration of the algorithm described in the main text the pivot state tends to move in the s ame direction for multiple iterations in the final two iterations shown the dire ction of motion has reversed drift in one direction would persist for longer if a larger number of states were drawn from the pivot state metropolis k position x iteration number figure example of persistent motion exploring uniform with ga ussian proposals metropoliss random walk behavior is greatly reduced by pivotbas ed transitions although at a greater computational cost pivotbased metropolis pi votbased metropolis pivotbased metropolis is a new alternative to mtmbased metho ds for using multiple metropolis proposals as in mtm we use an arbitrary proposa l distribution qxk x but by making different tradeoffs orderedoverrelaxation upd ates become possible the new procedure will follow algorithm closely but uses ar bitrary proposal distributions rather than transition operators arbitrary propos als could be turned into metropolishastings transition operators for use in algo rithm but many of the candidate points would be in the same place due to rejecti ons algorithm provides a way to use all of the points proposed from a proposal d istribution q the pivot state can optionally be created using a different propos al distribution q also the pivot state need not live in the same space as the di stribution being sampled unless applying the persistence procedure of subsection this procedure can be seen as an instance of random proposal distributions besa g et al algorithm the pivotbased metropolis operator t x x take one step of q fr om point x to a pivot state z ie z qz x use q to draw k points one step away fro m z ie sk qsk z k k use the sk and x points to create an ordered list sk k break ties k arbitrarily and relabel the points such that sk sk identify index i givi ng x si qz sk compute weights wk psk qsk z choose x sj from sk k using a distrib ution proportional to the k weights or any reversible move ts j i sk that leaves this distribution invariant the pivotbased metropolis operator of algorithm is a valid transition operator for px proof the equilibrium probability of starting at si generating pivot state z the remainder of the set of points s sk k and ch oosing final point sj is k t z s j si psi ts j i s psi qz si k ki qsk z which is invariant to swapping si sj summing over all intermediate points skij z shows t hat t sj si satisfies detailed balance ordered overrelaxation uses a transition rule that leaves a uniform distribution over a set of points stationary pivotbas ed metropolis and perhaps other algorithms require a generalized move that leave s a nonuniform discrete distribution stationary a probability distribution over points sk can be represented on a unit interval each point occupies a segment wi th width equal to its probability pk the segments can be ordered by any method t hat does not use the current setting si or the history of the markov chain summa ry c si pi sj pj oij c figure reflect move for discrete distributions a unit int erval is constructed from segments sk each of these bars has width equal to the points probability given current point si the probability of transitioning to an other point is proportional to its overlap with si s reflection for example sj w ill be selected with probability oij si selftransitions are only possible if si overlaps with the middle of the unit interval this is when ordered overrelaxatio n is less helpful assume we have a current sample si and wish to a apply a trans ition operator that tends to take us to the opposite end of the distribution suc h a transition operator is illustrated in figure the probability interval corres ponding to the current point c cpi is reflected to cpi c the reflected interval will have some overlap with one or more points probability segments we sample fr om these points with probability proportional to their overlap oij with the refl ected interval the probability of observing a transition from si sj is ts j i pi oij pi oij pi which is independent of the direction of the transition therefore detailed balance holds the generalized reflection rule leaves the discrete targ et distribution invariant using this rule within algorithm gives an operator bas ed on arbitrary proposals that is suitable for use in ordered overrelaxation or the persistent motion of subsection summary this chapter explored ways to use po pulations of points drawn from proposal distributions we started with a review o f two existing methods population monte carlo pmc and multipletry metropolis mtm which highlighted two important points proposal distributions used by mcmc are usually unsuitable for simple importance sampling attempts to combine mcmc with estimators based on importance sampling should not forget this drawing multiple points in parallel from a proposal distribution usually gives slower mixing than drawing the same number of points sequentially the parallel methods require sav ings from shared computations to be competitive in response to item we derived a waste recycling scheme for mtm as an alternative to an existing importance samp lingbased estimator related and future work we then focussed on methods that use a population of points to move the state of a markov chain to the opposite side of a userdefined ordering chaining these updates together can cause persistent motion through neals ordered overrelaxation or the method we introduced in subse ction a random walk will explore a distribution with lengthscale l in ol steps o f size more persistent motion can bring this down to ol our contributions to thi s area were algorithm which automatically reduces the size of a population when ordered overrelaxation is less likely to be useful saving computation generalizi ng ordered overrelaxation to transition operators other than gibbs sampling algo rithm using ordered transition operators for persistent motion without using com ponentbased updates subsection generalizing ordered updates to operate directly on a set of weighted points section these moves are less likely to reject than u sing metropolishastings within algorithm which is important because rejections d estroy persistent motion the usefulness of these innovations relies on the assum ption that drawing multiple proposals is computationally efficient it remains to be seen if implementations leveraging caching of intermediate results or parall el computations will make our work effective in real applications similar assump tions are made by related work in the literature which we now briefly consider a long with directions for future research related and future work tjelmeland and stormark investigate an idea similar to pivotbased metropolis they suggest drawi ng a set of antithetic variables possibly using quasi monte carlo recent work al so applies these ideas to mtm craiu and lemieux pivotbased metropolis could also be extended in this way tjelmeland also recommends wasterecycling which would a lso apply to pivotbased algorithms the main difference in his algorithm is that all the states are drawn as proposals from the current state x rather than an in termediate state z making reversible moves based on these proposals requires wor king out the probability of each state reproducing the entire ensemble an ok com putation our pivot state approach reduces this computation to ok and makes an ex change of the current point with one of the proposals seem more natural as an ex ample we showed pivotbased ordered overrelaxation applied to slice sampling in f act slice sampling already has its own method for overrelaxation neal although t his is susceptible to rejections unless the edges of the slice are found accurat ely related and future work ordered overrelaxation may be preferable as it never has rejections while needing only enough information to draw k samples there is a way to eliminate the pivot state when slice sampling from unimodal distributi ons it is possible to gibbs sample the slice sampling auxiliary distribution so standard ordered overrelaxation applies subsection described a new way to make p ersistent motion along a direction that defines an ordering on the state space c hoosing suitable methods for ordering points will be problemdependent in general one generic application could be simulated tempering subsection where energy co uld be used to define an ordering persistent motion along this ordering should e ncourage the inversetemperature to move more rapidly between zero and one radfor d neal has suggested personal communication considering an alternative to pivotb ased transitions this method starts with a random integer f drawn uniformly betw een and k inclusive the current point x is evolved through a sequence of f trans itions using t then starting at x again a sequence of b k f backwards transition s is simulated using t choosing uniformly from the resulting k states maintains the stationary distribution also reordering the points and choosing the compleme nt to x as in ordered overrelaxation is valid producing a chain of states rather than a star of steps starting at a single state is likely to produce larger mov es a version of neals idea based on the weighted reflect operator of section is also possible the chain could use transition operators that leave a cheapertoeva luate surrogate distribution invariant proposal weights could be evaluated for a thinned subset of the chain and then the reflect operator applied yet another p ossibility is the multipoint metropolis method qin and liu an extension to mtm t hat uses a chain of states a disadvantage of using a chain of transitions is tha t each move must be made sequentially the pivotbased algorithms introduced in th is chapter could evaluate their proposals in parallel on suitable computer hardw are chapter normalizing constants and nested sampling the computation of normali zing constants plays an important role in inference and statistical physics for example bayesian model comparison needs the evidence or marginal likelihood of a model m given observed data d z pdm pd mpm d l d where the model has prior and likelihood l over parameters in statistical physics z exped is the normalizing c onstant for the distribution over states with energy e at inverse temperature in this context z is an important quantity known as the partition function a funct ion of from which several fundamental properties of a system can be derived the marginal likelihood equation is also a function in this case of the model m whic h will become important in chapter in this chapter we focus on estimating the co nstant for a given model we also consider estimating the whole z function from p hysics which perhaps surprisingly is useful for solving the marginal likelihood problem statistical physics normally uses the normalization in its log form log z is known as the free energy of a system when used in statistical inference log z has also been found to be a useful and natural scale for comparing models giv en two models m and m the loglikelihood ratio log zz log pdm log pdm can be used in a classical hypothesis test which rejects m in favor of m when a threshold i s exceeded this means that absolute differences in log z correspond to shifts in model quality that are meaningful without reference to an alternative model res earchers decrypting codes at bletchley park judged that a difference of roughly a deciban log zz is about the smallest difference in two models that people can express table contains qualitative descriptions of the importance of marginal li kelihood ratios table a selection of qualitative descriptions of the importance of evidence values taken from kass and raftery and jeffreys appendix b p the thi rd column could be turned into a deviance by multiplying by two for more on unit s of information content particularly the ban see mackay p log zz loge zz log zz z z evidence against m bans nats bits to to to to to to to to to to to to weak substantial strong decisive beyond reasonable doubt the precise interpretation o f the numerical value of a models evidence will depend on context in some applic ations it is quite possible to keep using m even with a marginallikelihood ratio that in isolation seems in favor of m beyond reasonable doubt in a legal settin g it may be that a guilty hypothesis being a thousand times more likely than inn ocent is enough to convict but other applications associate even more extreme lo sses with adopting a particular model in data compression of large files a diffe rence in size of bits is insignificant if compressing with m is computationally cheap an improvement in marginal likelihood very much greater than bits is requi red to adopt a new model even in the legal setting one must remember that the mo st probable model is not necessarily the one with the highest marginal likelihoo d extreme prior ratios can cancel large likelihood ratios although this shouldnt happen very often under the rules of rational inference one should mechanically apply bayes rule but if the data are strongly out of line with prior expectatio ns it would seem sensible to carefully reexamine the models assumptions and chec k for computational errors this chapter focusses on the computation of z the mes sage to take from the above discussion is that computing the evidence of a model to a very high level of precision is usually pointless it is difficult to appre ciate a difference in log z of less than one and some errors larger than this ma y not affect decisions based on the computation this means that even noisy monte carlo estimates can be useful indeed monte carlo techniques are sometimes ident ified as providing gold standard estimates of z suitable for comparison with oth er approximate techniques eg beal and ghahramani kuss and rasmussen however just as with any monte carlo integration method it is quite possible to get wrong an swers for some classes of difficult problems this includes algorithms guaranteed to be correct asymptotically developing a variety of methods and checks and hop ing for the best is all we can do in general this chapter first examines the nec essary ingredients of a monte carlo method for computing z next we review nested sampling a new method due to john skilling we provide a comparative analysis of this new algorithm and more established techniques then having analyzed nested sampling in its own right we consider using it as a method for guiding and check ing other algorithms starting at the prior starting at the prior by definition t he prior should spread its probability mass over settings of the parameters deem ed reasonable before observing the likelihood l therefore samples from this dist ribution can sometimes provide a useful representation of the whole parameter sp ace in particular as z is just an average under the prior a simple monte carlo a pproach could be attempted z l d s s ls s s the variance of this estimator may b e huge most of the mass of the integral is associated with parameter settings th at are typical under the posterior the posterior often occupies a small effectiv e fraction of the priors volume so it can take a long time to obtain a represent ative sample however simple monte carlo equation does work on small problems and provides a simpletoimplement check of more complex code starting at the prior w ill also form the basis of more advanced methods why not start at the posterior instead posterior samples are concentrated around the mass of the integral also practitioners already sample from the posterior for making predictions it would be convenient if these samples could also estimate z the obvious importance samp ling estimator for z based on posterior samples is nonsensical as it involves kn owing z a moment of inspiration yields the following harmonic mean estimator not ed by newton and raftery l z z l d pd m d l s pd m s s ls as acknowledged by its creators this estimator has the unfortunate property that the least probable po ints have small l and so carry the largest weights thus the effective sample siz e tends to be very small indeed the estimator can easily have infinite variance various attempts have been made to fix the harmonic mean estimator both within t he original paper and in more recent research raftery et al the authors are keen to avoid sampling from the prior as well as the posterior however this goal see ms misplaced firstly implementing samplers for most priors is relatively simple and allows several diagnostics to be performed eg checking prior assumptions loo k reasonable checking against simple importance sampling equation and geweke s t ests for getting it right secondly while some specific instances seem to work th ere is a generic problem with posterioronly methods as pointed out by neal in th e discussion of newton and raftery the choice of prior has a strong influence on the value of z taking a broad prior to an improper limit will send z towards ze ro starting at the prior however in many inference problems the data are so infl uential that the statistics of the posterior are fairly insensitive to such chan ges in the prior this demonstrates that posterior statistics alone bear little r elation to z in many statistical problems reliable estimates of z require findin g the prior mass associated with the large likelihood values found under the pos terior in lowdimensional problems this is a feasible density estimation problem although deterministic approximations to z based on a direct fit to the posterio r may be preferable in those cases chib notes that it is sufficient to know the posterior probability at just a single point for any point z l p d m chib provid es an estimator for p d m based on the probability of a gibbs samplers transitio ning to from each of the points visited by the sampler chibs method is relativel y easy to apply and should work well on unimodal problems posteriors with more c omplex shapes will lead to subtle difficulties in general it is unreasonable to expect the sequence of states visited by a sampler to densely cover the posterio r in models with many parameters if this were possible it would be feasible to b uild a kernel density estimate on the basis of a preliminary run and perform imp ortance sampling instead practitioners simply hope to get a representative sampl e of points that are useful for prediction it is accepted that the sampler may n ot go near some typical points as an analogy consider conducting a survey of the world population a sample of people might not include anyone from some cities y et will be adequate for many statistical purposes in contrast chibs method is li kely to fail without fairly dense sampling if none of the posterior samples can easily reach within one gibbssampling iteration are in its city the estimator wi ll be badly behaved this wont occur if was chosen from the posterior samples but this is only hiding the problem at the expense of biasing the estimator jumping directly to the posterior is fraught with problems the normalizer z is a global quantity that depends on the entire posteriors relationship to the prior if sta ndard numerical methods apply that is good news but if the posterior is sufficie ntly complicated to justify mcmc the place to start is probably a much simpler d istribution like the prior on small problems simple monte carlo can be more reli able than chibs method neal when solving larger problems prior sampling equation doesnt work either we need advanced methods that start with a feasible sampling problem but overcome the variance problems of simple importance sampling bridgi ng to the posterior bridging to the posterior when samples from the prior rarely find regions of high posterior mass a distribution closer to the posterior is n eeded for importance sampling to work designing or adaptively finding a tractabl e approximation to the posterior is one approach although hard in general anothe r solution is a divideandconquer approach rather than finding z directly a serie s of easier subproblems are solved first consider a family of distributions spec ified by an inverse temperature p l ee z z e log l the prior has while p is the posterior all of the distributions for involve unknown normalizations given by t he partition function z we will try to compute this function for a range of inve rse temperatures k k when adjacent distributions p k and p k are close they can be compared with standard importance sampling as reviewed in subsection weights give the relative importance of states evaluated under the two distributions wk lk lk k lk the expected value of these weights is the ratio of the two distribut ions normalizers zk zk s s wk s s s p k these estimates can then be combined usi ng z zk z z z zk z z z z zk numerical concerns suggest taking logs of a large po sitive product zk log z log z k log k zk zk where each term in the sum can be es timated from samples under p k this form is also convenient for error bars the v ariance of an estimator for log z is the sum of the variances of the estimators for the log of each ratio given this basic annealing approach there are several implementation alternatives simulated tempering and parallel tempering reviewed back in section both offer correlated samples from p over a userset range of ann ealed importance sampling bridging to the posterior subsection samples from dist ributions which are only close to p but a single run gives the same estimator as above for s k k log wk k k k log z k k log lk where k is the sample drawn at th e k th iteration of ais or a single sample from p k when s or multiple ais runs are performed the samples are combined differently independent importance sample rs provide local gradients or ratio information at each temperature the samples are first combined at each level as in equation ais gives up the requirement for valid samples from each p k by running importance sampling on a global distribu tion see subsection thus ais dictates forming independent z estimates from each run and then averaging them generally the ais estimator should be used because i t does not assume exact samples are available from each temperature in some case s applying equation to pseudosamples from rapidly mixing markov chains can work better than ais but is difficult to justify aside on the prior factorization som e readers may be more familiar with canonical distributions defined by p expez h ere the energy e is defined in terms of the negativelog of the whole unnormalize d probability rather than just the likelihood for finite state spaces this corre sponds to equation with a uniform in an unbounded state space p is an improper d istribution introducing a tractable normalized base measure ensures it is always possible to sample from p the base measure need not be a prior distribution ano ther choice may be computationally convenient or the problem may not be bayesian and so have no prior in subsection a distribution is factorized in an unconvent ional way for algorithmic reasons the term prior is used throughout this chapter to emphasize a common usage of these algorithms and that should usually be a si mple tractable starting point thermodynamic integration the bridging procedures above have been independently developed by various communities gelman and meng t hese different views can bring a better understanding to the problem the statist ical physics community views z as a useful function applying operators to the pa rtition function often yields interesting physically meaningful bridging to the posterior quantities in particular for the canonical distribution p ee z z ee d the lognormalizer is a moment generating function d log z d z e ee d epe d log z d z e e e d z e e e d ep e epe varpe unlike z itself these expectations can rea sonably be approximated for a given temperature using samples recorded under sim ulations at that temperature of course log z is related to its gradients by a tr ivial identity log z log z d log z d d z log z log z epe d the simplest discrete approximation to equation uses measurements at a sequence of temperatures k in order to estimate the normalizer at inverse temperature k k log z k k k ek k pk comparing to equation and remembering that e log l we see that simple thermodyna mic integration and a bridging sequence of importance samplers are the same appr oximation thermodynamic integration seems deceptively general the difficulty is in choosing intermediate distributions appropriate temperaturebased distribution s may be difficult and sometimes impossible to find this chapter explores how to choose a sequence of temperatures and alternative methods that dont use anneali ng schedules multicanonical sampling multicanonical sampling the multicanonical ensemble was mentioned in section as a distribution that might allow better mark ov chain exploration of the state space than the original target distribution th e multicanonical ensemble is available in terms of the target distribution pt p zt and a multicanonical weighting function wmc t pmc p mc zmc p p wmc mc t the m ulticanonical heuristic suggests finding a weighting such that a markov chain ex ploring pmc spends equal times at all energies including those typical of the po sterior and the prior a weighting function giving approximately this behavior mu st be found from preliminary runs samples from pmc can be used as an importance sampling proposal distribution as in subsection weights w p p give the normalizi ng constant ratio mc between any distribution p p zp and the multicanonical dist ribution pmc p s s w s s zp zmc the multicanonical normalization zmc is generall y unavailable but can be eliminated by comparing to the base distribution log zt log pt log instead of bridging carefully between the prior and posterior the mu lticanonical ensemble has the ambitious goal of capturing both at once one hopes that the broad coverage of pmc gives estimators with reasonable variance for qu antities relating to both pt and this is explored theoretically and experimental ly in later sections nested sampling nested sampling is a new monte carlo method by skilling intended for general bayesian computation which bears some relation to earlier work found in mcdonald and singer it is designed to be a general and robust alternative to annealingbased methods like annealing it starts at the pr ior and samples from a sequence of distributions that become more constrained at each iteration however nested sampling makes no use of temperature and does not require tuning of intermediate distributions or other large sets of parameters it also provides a natural means to compute error bars on all of its results wit hout needing multiple runs of the algorithm nested sampling the key feature and technical difficulty of nested sampling is sampling from the prior subject to a series of lower bounds on the likelihood the reward for attempting this novel ch allenge is an estimate of the prior probability mass associated with each observ ed likelihood value this representation of the posterior provides an estimate of the normalization constant and any other property of the posterior in addition to reviewing necessary material the remainder of this chapter provides an improv ed implementation of nested sampling for problems with degenerate likelihoods or discrete distributions a brief study of a deterministic approximation of nested samplings behavior this has theoretical implications for its performance and pr actical benefits for some applications a comparative analysis of the generic pro perties of annealed importance sampling nested sampling and multicanonical sampl ing illustrative examples some simple continuous distributions and the potts mod el an undirected graphical model incidentally a new variant of swendsenwang is d erived some of this material was previously presented in murray et al b a change of variables the normalization of a posterior over variables is a weighted sum of likelihoods l over elements of prior mass d z pdm pd m pmd l d we label each element of prior mass dx d this is only a change in notation z l dx the integral could in principle by accumulating elements in a standard raster order in the p arameter space figure a alternatively we can add up the contributions from each scalar element of prior mass dx in any order we like as illustrated in figure b we choose to order the elements by their corresponding likelihood as is a distri bution its elements sum to one and can be arranged along a unit interval z lx dx x d l l for now assume is continuous and l provides a total ordering of element s so that x is an invertible change of variables in subsection these assumptions will be relaxed so that the mapping will always be invertible nested sampling a b figure views of the integral z l d for the posterior normalization a elements of the parameter space d are associated with likelihood values heights l these elements likelihood values are summed weighted by their prior masses d b bars wi th exactly the same set of heights are arranged in order along a scalar unit int erval three bars are colored to help illustrate the correspondence in a the bars have a hyperarea base corresponding to an element in parameter space in b they have a scalar width corresponding to an element of prior mass dx xdx the area of each bar is its weighted likelihood so z is the sum of these the area under the curve in b the mapping x may seem strange but is closely related to the familia r cumulative distribution function cdf for a onedimensional density p c p d this quantity is often considered very natural some authors prefer to define distrib utions in terms of the cdf unlike a density the cdf is invariant to monotonic ch anges of variable and it directly returns the probability mass between two setti ngs of the parameters also its inverse allows the straightforward generation of samples from uniform random variates each element dc corresponds to an element o f probability mass pd naturally these elements sum to one dc cumulative distribu tion functions for multivariate distributions are less frequently used in statis tics a sensible ordering of the elements of probability mass is less obvious and the multivariate integrals involved may be difficult comparing equations and we see that x is just a cumulative distribution function of the prior correspondin g to a particular choice of ordering for scalar parameters the standard inverse cdf c gives the parameter such that fraction c of the prior mass is associated w ith parameters less than for general parameters the inverse of the likelihoodsor ted cdf x gives the parameter value such that fraction x of the prior mass is as sociated with likelihoods greater than l this is illustrated in figure a for mul timodal likelihoods the prior mass satisfying a likelihood constraint will be sc attered over disconnected regions nested sampling lx lx lx x a x x x b x x x c f igure nested sampling illustrations adapted from mackay a elements of parameter space top are sorted by likelihood and arranged on the xaxis an eighth of the pr ior mass is inside the innermost likelihood contour in this figure b point xs is drawn from the prior inside the likelihood contour defined by xs ls is identifi ed the ordering on the xaxis and pxs are known but exact values of xs are not kn own c with n particles the one with smallest likelihood defines the likelihood c ontour and is replaced by a new point inside the contour ls and pxs are still kn own computations in the new representation given the change of variables describ ed in the previous section the normalizer is just the area under a monotonic one dimensional curve l vs x onedimensional integrals are usually easy to approximat e numerically assuming an oracle provided some points xs ls s ordered such that xs xs we can obtain an estimate z based s on quadrature s z s s ls wq s wq s x x s where the quadrature weights given correspond to the trapezoidal rule rectangl e rules can upper and lower bound the error zz as long as an upper bound on l is known the boundary conditions of the sum require an arbitrary choice for the ed ge xvalues such as x x and xs xs sensitivity to such choices could always be che cked in our experience they do not matter quadrature is effectively approximatin g the posterior with a distribution over s particles each with probability propo rtional to ls wq samples from this distribution can approximate samples from the true posterior the approximate distribution can also be used to directly approx imate posterior expectations the change of variables has removed details of the highdimensional space and made s nested sampling the integration problem apparen tly easy of course the highdimensional space is still in the original problem an d will make the change of variables intractable this is now the target for appro ximation nested sampling algorithms nested sampling aims to provide a set of poi nts s ls s and a probability distris bution over their corresponding x xs values a simple algorithm to draw s such points is algorithm see also figure b algorit hm single particle nested sampling initial point draw for s to s draw s ls where l ls ls otherwise end for the first parameter set by this algorithm is drawn fr om the prior which implies that the corresponding cumulative prior quantity must have distribution px uniform similarly pxs xs uniform xs as each point is drawn from the prior subject to ls ls xs xs this recursive relationship defines px a simple generalization algorithm uses multiple particles at each step the one wit h smallest likelihood is replaced with a draw from a constrained prior figure c algorithm multiple particle nested sampling initialize draw n points n n n m arg minn ln m for s to s redraw m ls given by equation algorithm m argminn ln s m en d for the first parameter is the setting with the smallest l from the initial n draws this is the particle with the largest x for this parameter to be at x the other n points must have x x so px n n x n the extra factor of n comes from the invariance of the points to reordering and is needed for correct normalization a lternatively it is immediately identified as a standard result from order statis tics the nth ordered point drawn from a distribution has its cumulative quantity distributed according to betan n n where our case corresponds to n n nested sam pling eg balakrishnan and clifford cohen after replacing a particle in step ther e will be n samples uniformly between and xs the point with smallest l and large st x will be a fraction r xs xs through this range distributed as prxs n n rn ch anging variables gives pxs xs n n xs n xs n xs xs this defines the joint distrib ution over the entire sequence s px px n s pxs xs n note that px depends only on n it is the same distribution regardless of the problemdependent likelihood fun ction if we knew the x locations we could combine these with the likelihood obse rvations l ls s and compute the estimate of the normalization z given in equatio n s instead we have a distribution over x which gives a distribution over what t his estimator would be pzls n zx z pxn dx we defer philosophizing over the preci se meaning of this distribution to subsection for now we assume that this distri bution gives reasonable beliefs over the estimate that would be obtained by quad rature as can be checked in a given application the uncertainty of this distribu tion tends to be much larger than the differences between choices of quadrature scheme we can therefore somewhat loosely take equation to be a distribution over z itself similarly distributions over any other quantity such as log z or poste rior expectations can be obtained by averaging quadrature estimates over pxn to recap the key ideas required to understand nested sampling are it would be conve nient if we could perform an intractable mapping from the original state space t o a cumulative quantity x numerical computation of z or posterior expectations w ould only involve a onedimensional function samples from the prior subject to a nested sequence of constraints equation give a probabilistic realization of the mapping these samples give a distribution over the results of any numerical comp utation that could be performed given the change of variables no algorithm can s olve all problems some pathological integration problems will always be impossib le for nested sampling the difficulty is in obtaining samples from the nested se quence of constrained priors nested sampling e e xs e e e e s xs x exp log x err or bars e e e e e e e e e s x exp log x error bars figure the arithmetic and geo metric means of xs against iteration number s for algorithm with n error bars on the geometric mean show expsn sn samples of pxn are superimposed s omitted for clarity mcmc approximations the nested sampling algorithm assumes obtaining samp les from ls equa tion is possible this is somewhat analogous to thermodynamic in tegrations requirement for samples drawn from a sequence of temperaturebased dis tributions in section annealed importance sampling offers a way to approximately sample from temperaturebased distributions and still obtain unbiased estimates of z in contrast nested sampling really needs exact samples from its intermediat e distributions for its theoretical justification to be valid rejection sampling of the constrained distributions using would slow down exponentially with itera tion number s in general we do not know how to sample efficiently from the const rained distributions so despite the theoretical difficulties we will replace ste p of algorithm with an approximate sampler using a few steps of a markov chain i n practice this often works well but one must remember to be careful when interp reting error bars that ignore the approximation we must initialize the markov ch ain for each new sample somewhere one possibility is to start at the position of the deleted point s on the contour constraint which is independent of the other points and not far from the bulk of the required uniform distribution however i f the markov chain mixes slowly amongst modes the new point starting at s may be trapped in an insignificant mode experience suggests it is generally better to start at one of the other n existing points inside the contour constraint these are all ideally draws from the correct distribution ls so represent modes fairly making this new point effectively independent of the point it cloned may take m any markov chain steps how many to use is an unfortunate free parameter of mcmcb ased nested sampling nested sampling integrating out x to estimate quantities of interest we must average over pxn as in equation the mean of a distribution ove r logz can be found by simple monte carlo estimation logz logzxpxn dx t t logzxt t xt pxn this scheme is easily implemented for any expectation under pxn includ ing error bars from the variance of logz to reduce noise in comparisons between runs it is advisable to reuse the same samples from pxn eg clamp the seed used t o generate them a simple deterministic approximation of pxn is useful for unders tanding and also provides fasttocompute lowvariance estimators after s iteration s xs directly requires substantial work but notice that logxs characterized by i ts mean and standard deviation logxs sn sn s s s logt s s s t where the ts are d rawn iid from pt n tn dealing with this distribution is a sum of independent ter ms the central limit theorem suggests that this should be well figure shows how well this description summarizes sequences sampled from pxn using the geometric path as a proxy for the unknown x is a cheap alternative to monte carlo averagin g over settings equation see figure we might further assume that trapezoidal est imates of integrals are dominated by a number of trapezoids found around a parti cular iteration s the corresponding uncertainty in log z will be domi nated by u ncertainty in log xs s n s n it usually takes extensive sampling to distinguish s n from the true standard deviation of the posterior over log z degenerate like lihoods the progress of nested sampling is supposed to be independent of the lik elihood function the procedures that estimate quantities based on nested samplin g results rely on this behavior as an example the geometric mean path suggests t hat with n particles nested sampling should typically get to x after about log i terations now consider running nested sampling algorithm on a problem with the f ollowing likelihood function x lx x nested sampling a b c figure histograms of e rrors in computing epx log z under three approximations for integrating over px random experiments shown the test system was a dimensional hypercube of length w ith uniform prior centered on the origin the loglikelihood was l nested sampling used n s a monte carlo estimation equation using t sampled trajectories b t sam pled trajectories c deterministic approximation using the geometric mean traject ory in this example the distribution plog z from equation has a width so the err ors in finding its mean in b and c are tolerable after drawing n particles from the prior about of them will have x and l if the inequality ls ls is strictly en forced all new particles must have l and therefore x therefore x will be reached in about s iterations leading to very wrong results if instead ls ls is enforce d the results will still be wrong x will not be reached until about iteration s the problem is that the mapping set up in subsection required a total ordering o f all the dx elements of prior mass it was assumed the likelihood function lx pr ovided a unique key for this sort operation in many applications this will be su fficient as long as no two likelihood evaluations in algorithm are numerically i dentical there is no problem but degenerate likelihoods like equation require sp ecial treatment completely flat plateaus in likelihood functions of continuous v ariables are rare but when is discrete finite elements of prior mass carry the s ame likelihood this will always introduce plateaus in lx solving the degeneracy problem requires imposing a total ordering on the dx elements earlier presentati ons of nested sampling skilling suggest introducing an auxiliary labeling of the states that resolves likelihood ties it was suggested these labels could be cho sen arbitrarily random samples from uniform suffice as would a cryptographic ide ntification key derived from or almost anything else however a fixed key does no t solve the problem with discrete parameters if the x plateau in the example abo ve originated from a single discrete setting then all the dx elements in this ra nge would receive the same cryptographic key likelihood nested sampling ties wou ld still be unresolved if random labels are employed and always regenerated on r epeated observations of the same then nested sampling will give the correct resu lts this is how the code skilling has made available is implemented skilling men tions a better way of thinking about the random labels instead of an extra piece of information attached to each parameter setting is an extra input variable fo r an extended problem with a likelihood that depends on and the distinction may seem fine on large problems the same location will rarely be revisited but getti ng this right is important for obtaining correct answers on small test problems the joint distribution view also allows a rejectionless algorithm for the potts example in subsection which was tersely introduced in murray et al b the auxilia ry joint distribution is over the variables of interest and an independent varia ble p p p l l z z lj z where lj ll l and z standard nested sampling is now possi ble at each iteration lj must increase we can choose such that log is smaller th an the smallest difference in logl allowed by machine precision this ensures the auxiliary variable the constrainedprior distribution becomes l ls s s j l ls an d otherwise will only matter when l ls s mcmcbased implementations of nested sam pling require transition operators that leave j stationary the simplest method i s to propose with an operator that leaves the original constrained prior station ary and then generate uniform rejecting the proposal if s a rejectionless method would be preferable and is sometimes possible by marginalizing out s m l ls s l ls otherwise if the operator used for was slice sampling subsection then m can also be slice sampled directly similarly a gibbs sampler could easily be adapted to use the conditionals of m instead of if metropolishastings proposals are use d it will available from httpwwwinferencephycamacukbayesys efficiency of the alg orithms sometimes be possible to reweight q to reduce or eliminate rejections an example of such a transition operator is developed in subsection after generati ng s the auxiliary variable tribution uniform ls ls s s s uniform s ls ls s can be drawn from its conditional dis as a minor refinement this could be done lazil y if and when it is required by the next iteration efficiency of the algorithms this section attempts to compare the computational efficiency of three significa nt algorithms for estimating normalizing constants general comparisons of monte carlo algorithms are hard to make as they are designed for use with problems whe re ground truth is unavailable also some algorithms are targeted at a particular class of problem making performance comparisons on other classes unfair all thr ee normalizing constant methods considered below are designed to deal with probl ems in which the base measure is much more diffuse than the compact target distr ibution pt we look at the scaling of computational cost with the nature of this disparity the behavior of a multivariate gaussian toy problem is worked out for each algorithm the base distribution is vaguely distributed with precision exp z the target distribution pt has higher precision t pt t exp zt d d d d d d z l z t t l exp d d d we pretend that the normalization constants of the gaussians are unknown and require all of the algorithms under study to return an approximatio n to the log of the ratio zt z t d many continuous distributions have locallygau ssian peaks so poor performance on the toy problem would be a general cause for concern we also make some general remarks relevant to nongaussian modes despite this the theoretical comparison presented here is necessarily limited unreasonab le assumptions are made regarding the efficiency of the algorithms ability to dr aw samples from intermediate distributions obviously this doesnt test the abilit y of the algorithms to find and then fairly represent isolated modes this will b e probed in the empirical comparisons in the sections that follow for now the an alysis is confined to behavior that does not require details of particular marko v chains used in implementations nested sampling the scaling behavior of nested sampling can be found without detailed calculation the algorithm is characterize d by an exponential shrinkage of the prior mass under consideration if the bulk of the posterior mass is found around x then an n particle nested sampling run r eaches dominant terms in the integral after s n log x iterations subsection expl ained that the uncertainty in log z is typically s n obtaining a unit uncertaint y in log z would require setting the number of particles to n s for this choice the number of iterations required to reach the target distributions mass from th e prior is s log x most of a highdimensional gaussians mass is within a hypersph erical shell out at a radius of d this means that the bulk of the target distrib ution is in a shell enclosing a fraction t d of the prior volume if the prior we re uniform then we could immediately say that the bulk of the target distributio n is found around x t d it turns out that this is a good enough description for gaussians too log x d log t therefore the s log x computational cost of nested s ampling on the toy problem is number of samples for unit error s d log t the od scaling revealed by this analysis is somewhat disappointing if each sample costs od to produce then the total cost for a unit error is od if one knew that each dimension were independent one could estimate the lognormalizer of each gaussian separately making the d estimates have variance d would require d log t samples each but samples of just one variable only cost o computation which gives a tot al cost of od rather than od the reason to resort to monte carlo is that real co mputations do not decompose this way but we shall see that without detailed know ledge of the distribution annealing does only need od iterations the advantages of nested sampling do come at a cost according to the cumulative distribution fu nction the logcumulative prior mass inside a radius p of dt is log dd t which nu merically matches the scaling of d log t to within a d small factor over a large range of t and d efficiency of the algorithms multicanonical sampling as there are many possible algorithms for constructing an approximate multicanonical ense mble a general comparison to the method is hard instead we generously assume tha t we have obtained by chance the ideal function that reweights the target distri bution such that the distribution over energy is uniform this avoids having to c onsider the details of histogrambuilding methods or other energy density approxi mations the danger is that this sections results may be severely misleading if t he dominant computational cost is actually obtaining the reweighting the analysi s here should be seen as a lower bound of the costs involved and could be useful for indicating which method is best for refining initial results however they w ere obtained by definition the multicanonical ensemble has a uniform marginal di stribution over energy to be defined this distribution must be confined to some finite range width h say so pmc e h this ensemble could be used to compute the n ormalizing constant of a simple target distribution ps which also has a uniform energy distribution but over an energy range with smaller width h the importance weights equation are a constant with probability hh and zero otherwise this dis tribution gives varlog ps h sh after s effective samples empirically the estimat or still exhibits quite heavy tails after the s hh samples predicted for unit er ror but the expression gives good error bars for larger s finding effective widt hs corresponding to h and h will provide a ruleofthumb estimate of multicanonica ls variance on more complex distributions we now derive properties of the multic anonical ensemble for the gaussian problem under the target multivariate gaussia n the energy is t e log l which is proportional to r d d d d d d t r the squarer adius r follows a scaled distribution with d degrees of freedom pr rd expt r wei ghting by the reciprocal of pr makes the distribution over r and the energy unif orm giving the multicanonical ensemble p p wmc mc et r rd et r rd r rh a maximum energy h must be imposed for the distribution to be defined a reasonable choice for the square radius defining this limit is larger than the prior mean value r h d d this cutoff excludes little of the prior mass from the ensemble efficiency of the algorithms the target distribution is softly confined to a narrow range of the energy spectrum with an effective width likely to be related to the stand ard deviation of a distribution rheff dt while the exact variance of the estimat or does not have a simple analytical form a more detailed analysis shows that th e guess hsheff rhsrheff is basically correct var log zt t s d astoundingly the n umber of effective samples required to obtain a unit error scales only as o d th is is better than possible by a method that exploited independence and estimated the normalizer of each of the d gaussians separately perhaps this is a sign tha t through the weighting function the method has unreasonable a priori knowledge of the answer despite the ideal weighting function the scaling with precision ra tio t is not impressive as the target distributions precision grows the multican onical sampler becomes exponentially worse than nested sampling equation this is a clear indication that the energy spectrum ratio hheff can be quite different from the volume ratio that determines nested samplings performance the details o f a particular target distribution can make either method dramatically better th an the other another concern is the cost of each of the s effective samples the ensemble pmc is a distribution capturing a large energy range which will usually take many steps of mcmc exploration to reach equilibrium in contrast the interm ediate distributions of nested sampling or annealing were designed to be close s o a small number of mcmc steps may be sufficient to equilibrate states at each i teration the logprobability of the variance of the two parts equation of the mul ticanonical estimator equation can be approximated as follows e w varw varlog s ew s ew the ratio of the target distribution equation and multicanonical distrib ution equation give importance weights w rd exp r noting the integrals z w r dr d d d z and w r dr d d and that for suitably large h these give approximately rh times the required expectations we have varlog rh d rh p sd d s d stirlings app rox to both functions substituting in rh assuming large d and using the approxim ate independence of the separate log estimators when t gives the variance of the final estimator as r h i t d var log zt s the version in the main text drops fu rther terms for clarity this footnote is terse because what matters is the simpl er hheff description in the main text and the following assertion empirically th e expression above reasonably matches the observed estimator variance for t in a few or more dimensions efficiency of the algorithms pmc has a range of emc d lo g t between the typical sets of the target and prior distributions if o iteratio ns of mcmc are required to change this energy by then it will take at least oemc iterations to equilibrate the multicanonical ensemble by a random walk this wou ld make the cost of a obtaining a single effective sample from the multicanonica l ensemble the same order as the total cost of an entire nested sampling run the potentially good performance of the multicanonical method can be wiped out by t he use of some standard markov chains importance sampling we now analyze the per formance of combining a bridge of k importance sampling estimates as in section for the toy gaussian problem all the intermediate distributions are also gaussia n pk lk zk k exp zk d d d k t this allows computation of the variance of the est imator in equation under the ideal case of direct sampling from each pk the log of the importance weights at each level from equation become log wk k k d d d th e variance of the log weights under pk is varpk log wk k k varpk k k d d d k k d k d the variance of log wk only depends on k through the ratio k kk the varianc e of log z k t k k log wk k k is minimized by equalizing all the contributing va riances the resulting estimator has variance t k dk var log z d k t log for larg e k constructing annealing schedules the approximation becomes accurate very qui ckly the amount of computation k required for unit variance scales with dimensio nality of the problem as od number of samples for unit error k d log t comparing to equation we see that the dependence of the computational cost is the same as nested sampling and that the scaling with dimensionality is better this result is not strongly tied to the tail behavior of a gaussian redoing the calculations for distributions p exp slightly to dplog t p d d only changes the number of sa mples required before hastily discarding nested sampling remember the caveats at the beginning of this section practical realities of markov chain based sampler s will make performance worse than predicted here the potts model subsection wil l be an example of a distribution which while easily sampled by multicanonical a nd nested sampling is totally unapproachable with temperaturebased annealing met hods as with multicanonical sampling we were very generous to annealing by assum ing the free parameters defining the annealing schedule were chosen optimally we were unlikely to guess k k t t k indeed the optimal schedule balances variances which through equation are intimately related to z itself we cannot know the op timal schedule without already having solved the problem some work is required t o find a good schedule default choices may not be sufficient for example the lin ear annealing schedule k kk gives varkkk log z d t k t for large t this is expon entially worse than the correct schedule and nested sampling constructing anneal ing schedules various families of annealing schedule have been used in the liter ature while demonstrating ais on some example distributions neal spliced togethe r a linear schedule at high temperatures with a geometric schedule at lower temp eratures this is not too far from the behavior of equation the optimal schedule for some wellbehaved found by a largek integral approximation to the discrete su m over equation and also checked numerically constructing annealing schedules po steriors beal suggested a family of nonlinear annealing schedules k e kk e kk wh ere the linearity parameter e makes the schedule approximately linear for large values of e but dwells for longer at high temperatures for small positive values another ad hoc nonlinear scheme is given by kuss and rasmussen where a linear s chedule is raised to the power of four k kk k k each of the above schemes were p resumably the result of some experimentation preliminary runs are required to ch eck that a proposed schedule eg fourth power of a linear schedule is adequate an d to eliminate others also any free parameters must be found such as the number of levels linearity setting e or the changepoint between a linear and geometric schedule algorithms for these fitting procedures could be although rarely are de scribed in detail part of the difficulty is deciding how to include results from earlier higher variance runs often these are only run informally and simply dis carded regardless of how an annealing schedule is chosen the selection should id eally be performed automatically otherwise computer time may become irrelevant c ompared to the time required by manual adjustments an annealing schedule should control the variance of the log weights vark log wk k k vark log l rearranging g ives a recurrence relationship for the inverse temperatures k k v vark log l whi ch achieves a given target variance v varlog w on each weight this update rule i s applied from until an inverse temperature of is reached the variance of the lo g z estimator equation is k v a binary search on the control parameter v can fin d a balanced annealing schedule with a user chosen total variance or computation al cost implementing this algorithm for constructing annealing schedules require s the variance of the loglikelihood or energy at each k considered these quantit ies are not available exactly otherwise log z would be known so the variances mu st be approximated from some preliminary experiments running mcmc at each k cons idered would be too costly one could try to interpolate results measured at a pr eliminary range of temperatures alternatively statistics at any temperature are available from a single run of either multicanonical or nested sampling we propo se using nested sampling to set a schedule for ais the details are given in algo rithm sampling from the multicanonical ensemble would need a large set of markov chains for normalizing constants control parameters to be set nested sampling c an be run before annealing as it only has two control parameters n and the amoun t of effort to put into sampling at each iteration annealed importance sampling can then be run with a schedule estimated from nested sampling comparing the ann ealing results to those predicted by nested sampling could uncover problems with markov chains that would go unnoticed using a single method algorithm construct ion of an annealing schedule from nested sampling inputs total target variance v target num nest particles n numerical tolerance tol run nested sampling algorith m obtaining s assuming xs expsn compute weights wq as in equation vmax vtarget v min k while vmax vmin k vtarget tol vtrial vmax vmin create discrete proxy for p k distribution pk wq ls k v k k varp log l k s if k then passed end of schedule start again with a higher variance per level vmin vtrial k else if k vtrial vta rget then exceeded target variance start again with lower variance per level vma x vtrial k else k k end if end while k return annealing schedule k markov chains for normalizing constants each of the monte carlo algorithms in this chapter re quire sampling from complex distributions p or pmc standard sampling techniques metropolis hastings slice sampling etc should apply but there are some issues sp ecial to these algorithms that are worth considering randomize operator ordering s many mcmc operators concatenate several operators with different behaviors tog ether gibbs sampling for example updates each dimension of a parameter vector se parately markov chains for normalizing constants some users prefer to randomize the order of these updates so that the resulting mixture operator maintains deta iled balance but many use a deterministic ordering if only for convenience of im plementation this is not a good idea with algorithms that start out of equilibri um at each iteration the problem is most easily demonstrated by nested sampling with n at each iteration the only particle is by definition on the boundary of t he constrained prior the first update must increase the likelihood of the partic le subsequent updates have some freedom to decrease the likelihood again as only a limited number of markov chain steps can be performed at each iteration the p article will climb unnaturally fast up the likelihood surface in the direction o f the first transition operator in subsection nested sampling is run using a uni variate slice sampler applied to each variable in a random order initially these experiments used a fixed ordering the first variable to be updated would system atically become much more constrained than the last even if by symmetry they wer e equivalent fortunately this pathology is so severe that it quickly made itself known by causing numerical problems and crashing the slice sampling code experi ments with spherical gaussians confirm that annealed importance sampling suffers from a similar problem histograms of the unweighted final states obtained show that the statistics of each dimension depend on the order in which they were upd ated the ais weights correct this bias asymptotically but samples without these artifacts will tend to need lower variance weights the easiest fix is to randoml y permute the markov chain operators at each iteration changes in lengthscale an d energy there is usually a dramatic difference in scale between a prior and pos terior it is unlikely that the same markov chain operators are appropriate for b oth yet annealing has to sample them and all the interpolating distributions in between similarly nested sampling has to sample from a sequence of distributions that shrink exponentially in volume from the prior to the posterior and beyond step size parameters in metropolishastings algorithms must be changed as the alg orithm proceeds it is also profitable to adapt the initial step size of slice sa mpling some authors set a schedule of step sizes by hand but automatic schemes a re clearly desirable one option is to adapt based on the acceptance rate of the previous iteration or to use parallel chains for ais this would require giving u p some theoretical correctness while nested sampling is already in an approximat e setting by using a markov chain at all another option for ais is to adapt the schedule of proposals after each run each run is unbiased in z using any stepsiz e but when adapting it is still advisable to discard early runs which will have higher variance markov chains for normalizing constants there is usually a large change in the scale of probabilities involved between a diffuse prior and a pos terior which concentrates mass on a much smaller number of states many standard markov chain operators such as simple metropolis and slice sampling are only abl e to make small changes in logprobability at each iteration depending on the alg orithm this may or not be desirable the majority of the volume in a highdimensio nal solid body is in a thin shell near its surface mackay p for nested sampling this means that much of s mass is likely to be close to the likelihood contour s urface and large changes in likelihood are not required instead we need efficien t chains that sample well at close to constant likelihood temperaturebased distr ibutions have soft constraints that lead to broader distributions over energy al though in many problems they are still constrained to a relatively narrow range the multicanonical method samples from a single distribution which has significa nt overlap with both the prior and posterior making the distribution over energi es uniform requires making some states much more probable than others under pmc simple metropolis methods are unable to move rapidly between regions of many sta tes with low probability and more compact regions with high probability the expl oration of pmc s energy spectrum will be characterized by a random walk or slowe r process this suggests that equilibrating pmc will need at least emc steps wher e emc is the range of log probabilities under pmc not the original distribution some monte carlo algorithms such as hybrid monte carlo are able to make larger c hanges in energy hamiltonian dynamics based on pmc could be simulated as long as the reweighting function is smooth nested sampling could also benefit from hami ltonian dynamics for its large movements in statespace although is not compatibl e with large changes in energy fortunately versions of slice sampling that can u se hamiltonian dynamics on the prior and reflections from constraint boundaries have already been developed neal another important markov chain operator for dra matic moves in statespace and energy is swendsenwang subsection while this algor ithm can work at any temperature it does not allow reweightings of the energy an d is not easily modified to sample at near constant energy by recasting the prob lem we can develop a version of swendsenwang that will work with multicanonical and nested sampling a new version of swendsenwang the partition functions of the potts model equation the random cluster model equation and the fksw joint distr ibution equation are identical also a sample from any of these distributions is easily converted into a sample from one of for literature searches it is helpful to know that in physics a constant energy distribution is known as a microcanon ical ensemble markov chains for normalizing constants the others this allows usi ng any of the distributions to simulate the potts model and find its normalizati on zp j q we focus on the random cluster model assuming identical positive coupl ings j on each edge we rewrite the random cluster distribution in an unconventio nal way pd zp j q expje z ldd zn where cd q z dij zn ld expd logej d under this factorization the energy is minus the total number of bonds d and the inversetem perature the bonds d logej is set by the coupling parameter j as in subsection c d is the number of connected components or clusters formed by algorithm gives an mcmc operator to update the bond configuration d d the stationary distribution is a weighted prior wmc dd where wmc could be multicanonical weights or set to w mc for prior sampling or set to zero and one to sample from the prior subject to constraints on d algorithm swendsenwang for weighted bond configurations create a random coloring s uniformly from the q cd colorings satisfying the bond const raints d as in the swendsenwang algorithm count sites that allow bonds e ije si sj draw d from t d es zt s wmc d es d es d throw away the old bonds d and pick u niformly from one of the of setting d bonds in the e available sites ways the pr obability of proposing a particular coloring and new setting of the bonds is t s d d t d s d t d es t s d es d t d es q cd wmc d zt s q cd summing over all poss ible intermediate colorings the probability of starting with d bonds d and endin g with d bonds d is proportional to t d d wmc d d d wmc d s t s d d zt s wmc d w mc d z q cd q cd s this expression is symmetric under the exchange of d d and d d therefore the transition operator satisfies detailed balance with respect to t he weighted prior it is experiments also ergodic proof with finite probability a ll si are given the same color then any d with nonzero weight is possible in tur n all allowable d have finite probability when performing nested sampling using the weighted prior representation the likelihood constraints in are thresholds o n the total number of bonds d this can be realized by setting wmc for states wit h fewer bonds many states have identical d which requires careful treatment as d iscussed in subsection the simple implementation that draws a random key for eac h state will lead to some rejections when proposing moves to a state with the sa me d on the constraint surface sampling without rejections can be achieved by se tting the weights such that algorithm leaves the auxiliary constrained distribut ion m equation invariant the number of bonds has previously been identified as a useful energylike target for reweighting janke and kappler in that work the bon ds were updated by single site gibbs sampling rather than the blockgibbs samplin g move of step this does not allow simulation of the fixed d ensemble or rapid e xploration near fixed d single site updates are easier to implement however and would become more attractive on systems that allow a different jij on each edge in this case the implementation of global updates is much more involved experime nts detailed comparisons of nested sampling and more established monte carlo tec hniques are not currently available in the literature anecdotally nested samplin g has already been useful in astronomy mukherjee et al and shaw et al claim nest ed sampling gives speedups of orders of magnitude over annealing based methods t he annealing approach that was cited in both papers was very carefully implement ed beltrn et al these results are somewhat surprising given that both nested a s ampling and annealing follow a sequence of increasingly constrained ensembles an d theoretically seem quite similar if anything nested sampling seems slightly wo rse although this should be a small effect for the d astronomy problems that wer e tested the focus of this section is not applications but well understood test problems hopefully these will give some better insight into the relative merits of the approaches description of slice sampling experiments a set of experiments were performed on some continuous distributions that are amenable to slice samp ling this allowed the same mcmc code to be used within each algorithm the distri butions tested are described first and then details of the algorithms the result s which appear in table are discussed in the next section experiments gaussian b ase and target distributions as considered theoretically in section were run in ten dimensions these experiments reveal the actual performance when confounded b y interactions with a particular mcmc operator in version a we set the standard deviation of the base distribution to be times wider than the target ie t in ver sion b we made the prior times wider t tdistribution this was included as anothe r simple standard distribution with different tail behavior the target distribut ion was a tendimensional multivariatet with five degrees of freedom the base dis tribution was gaussian with two modes a mixture of gaussians as tested in neal a pt zt exp d d exp d d the base distribution was a unit sixdimensional gaussian p n i deceptive this twodimensional problem bridges from a spherical gaussian di stribution with precision to a twodimensional mixture of gaussians taken from ne al a pt zt i j ij exp i j exp ij i j ij exp i j exp ij where and the mixture com ponents are in four groups ij i j ij i j ij i j ij i j means in the upperright q uadrant means in the upperleft quadrant means in the lowerleft quadrant means in the lowerright quadrant this target distribution is exceedingly challenging and more pathological than experienced in many statistical problems it is interesti ng however the different spacings of the means make it hard for algorithms to kn ow from a distance where the bulk of the probability mass is this highlights dif ferences between the massfinding heuristics implicitly performed by the algorith ms in all cases one markov chain update consisted of a simple univariate slice s ampler applied once to each variable in a new random order at each iteration a l inear stepping out procedure was employed with an initial stepsize equal to one or to the range of settings currently occupied by particles being simulated in p arallel some additional choices were needed by each method nest nested sampling runs had two free parameters the number of particles n and the number of slicesa mpling steps used to update at each iteration unless n each experiments new part icle was initialized at one of the n particles already satisfying the likelihood constraint as slice samplers always move and there are no plateaus in these pro blems likelihood functions the details in subsection were not required rather th an setting a number of iterations s we terminated the nested sampler when the es timate of log z appeared to have converged in particular we used a geometric app roximation for x to estimate log z with equation and terminated when the remaini ng prior mass appeared to contribute a fraction of only to the sum the reported results used monte carlo samples of x in equation these are very similar to thos e obtained from the geometric mean approximation but also provide quantiles of t he predictive posterior over log z aisrc annealed importance sampling has two fr ee parameters in addition to its annealing schedule an experiment was repeated r times each run using c parallel chains when c the parallel chains are used to s et the initial step sizes of the slice sampler for multimodal distributions this is not strictly justified within the ais framework but it seems unlikely the re sults will differ greatly from using preliminary runs instead an ais initializat ion of e nest n n k k estimated an annealing schedule from a nested sampling run with the given n and step per iteration the given value of k intermediate tempe ratures was needed for target standard error e the details of this procedure and the linear and fourth power schedules were given in section multicanonical ten slice sampling chains initialized from the prior were run in parallel on the mul ticanonical ensemble simple error bars were calculated from the variance of the estimates from each chain on the gaussian problem the analyticallyderived weight ing function that sets a uniform distribution over energies equation could be us ed we also tried setting the multicanonical weights by estimating the distributi on over energies from a nested sampling run discussion of slice sampling results the experiments with a gaussian highlight differences between the theoretical i deals in section and the realities of mcmc nested sampling is fastest with n whi ch was tried as a preliminary run the predictions from one or ten slice sampling sweeps per iteration are imprecise as one might expect but also overconfident t he posterior over log z has negligible overlap with the true answer increasing t he amount of mcmc sampling to slice sampling steps per iteration overcomes the b ias but at a large computational cost increasing the number of particles n gives more precise answers and increases accuracy there are three reasons each partic le can start at one of the other particles which are supposed to be drawn from t he target distribution the more particles there are the easier it is to forget e xactly which one was copied the distributions change more slowly with larger n t he answer from n with one step of sampling gives experiments table empirical beh avior of slicesamplingbased nested sampling ais and multicanonical sampling the distributions and methods are described in the main text subsection the results are discussed in subsection distribution gaussian a method initialization log l evals log z truth nest n step nest n steps nest n steps nest n step nest n steps nest n step nest n steps ais nest n k ais nest n k ais nest n k ais nest n k ai s nest n k ais linear k ais fourth power k multicanonical nest n multicanonical nest n multicanonical analytic weighting multicanonical analytic weighting truth nest n step multicanonical analytic weighting gaussian b tdistribution truth ne st n step nest n step nest n step ais nest n k ais linear k ais fourth power k t wo modes truth nest n step nest n step nest n step nest n step nest n step ais n est n k ais nest n k ais nest n k truth nest n step ais nest n k deceptive exper iments a much better answer than n with steps and is also an order of magnitude cheaper even with only one sampling step per iteration the estimates and error b ars with n and n are indistinguishable from those obtained by exact sampling fro m the constrained priors three standard annealed importance sampling runs were p erformed with the same target error of ais ais and ais obtaining the target erro r from only one run requires a very long annealing schedule k this gave similar results to shorter runs with k but error bars are much more easily obtained from multiple runs increasing the number of runs to required shortening the annealin g schedule further for a similar target error or computational cost the short k runs had larger errors than predicted and unreliable error bars this schedule wa s designed assuming that the algorithm would keep close to the equilibrium distr ibutions defined in the annealing schedule in this case there were not enough br idging distributions for this approximation to be accurate the result with ais k which uses ten parallel chains is similar to the result from separate runs ais but with many fewer likelihood evaluations in the absence of a population of poi nts the slice sampling code used an initial step size of one which is inappropri ately small for this problem at high temperatures a user not prepared to adapt s tepsizes based on a population should find some way to set them appropriately th e sequence of three ais runs with k are deliberately at higher precision than ne cessary so that the error bars are somewhat reliable and reproducible something that is sadly not true of the lower precision ais runs these confirm that a line ar annealing schedule is worse than one that dwells for longer at higher tempera tures in this case the fourth power schedule quite closely follows that set by n ested sampling and has very similar performance the first multicanonical result with weights set by nested sampling seems to give comparable uncertainty to ais for the amount of computer effort however running the chains for ten times longe r reveals problems with this multicanonical estimator we had set the weights by estimating the probability mass between each pair of likelihood values ls and ls visited by nested sampling for simplicity we used the deterministic approximati on in equation this gives noisy estimates of the ideal multicanonical weights wh ich it turns out are not good enough we could attempt to smooth the approximate multicanonical weighting function instead for the gaussian case we went directly to the correct multicanonical ensemble equation this confirmed that the problem with the estimator was the choice of weighting function and gives multicanonica l a performance somewhere in between nested sampling and ais on the gaussian a p roblem even with the ideal weighting multicanonical fails to equilibrate and fai ls to give reasonable error bars on the gaussian b problem this is to be expecte d given the methods poor scaling with precision ratio t we did we were able to c onfirm this because is quite tractable for this toy problem experiments not cont inue to try multicanonical adapting the weighting function beyond a nested sampl ing initialization seems important and this would complicate the comparison turn ing to the tdistribution example we see behavior different from the gaussian cas e the annealing schedule estimated from nested sampling is closer to linear than to a fourth power moreover the linear schedule and that produced from nested sa mpling are reproducibly better than the fourth power notice the reversal from th e gaussian experiment a good annealing schedule was estimated cheaply from neste d sampling with n obtaining as good answers as ais with nested sampling alone wo uld be more expensive exactly how much is hard to tell from individual runs beca use the error bars can be unreliable a more reliable comparison of nested sampli ng and ais can be made based on many repeated runs nested sampling was run times each over a range of n and compared to ais with runs of schedules of varying le ngths set by nested sampling figure a shows that the actual performance measured as a mean squared error of the log z estimates was similar for the two methods nested sampling is slightly more accurate at low computational costs being taken over by ais at higher precisions increasing the number of ais runs at the expen se of annealing schedule length performs worse for the same computational cost t his difference goes away at high precisions but may be a concern when many runs are required on multimodal problems the main problem with ais is that at lower p recisions simple estimates and error bars based on the mean and variance of the importance weights are unreliable figure b shows large biases in estimates for l og z figure c shows that the error bars are too small on average jackknife estim ates not shown are sometimes slightly better calibrated but give very similar re sults for a user the error bars are often very important so in this case the nes ted sampling estimator is the favorable choice the peaks of the two modes target distribution are highly separated at low temperatures the fraction of ais chain s ending in the mode in the positive orthant is compared to its actual probabili ty mass of a third through weighting these runs ais converges to the correct ans wer once enough sufficientlylong chains have been run further checks show that i t also estimated the mass of each mode correctly the same is true for nested sam pling the results from large n obtain nearly the correct relative masses of the modes and in turn the overall normalization detailed analysis still shows some s igns of bias the posterior overlap with the correct answer is smaller than appar ent from the scalar error bars in the table both methods find twomodes difficult because it is not feasible to sample correctly from at late iterations or from low temperature distributions ais relies on reweighting over many runs nested sa mpling requires many particles to maintain a fair representation of even after l ikelihood evaluations both nested sampling and ais have problems experiments mse a ais ais nest e likelihood evaluations mean error b ais ais nest e likelihood evaluations ais ais nest calibration c e likelihood evaluations figure average b ehavior of ais and nested sampling over runs for a range of n or target errors t he xaxis gives computational cost in likelihood evaluations a mean square error of log z estimate b mean error or bias of point estimate c mean square normalize d error error divided by error bar width a measure of calibration experiments ta ble estimates of the deceptive distribution quadrant upperleft upperright lowerl eft lowerright true probability nest estimate ais estimate with the deceptive di stribution nested sampling has an error bar on log z that is far too small aiss estimate is consistent with the correct answer although as with two modes with a higher standard error than targeted due to poor mixing at low temperatures addi tional problems are revealed by looking at the probability distribution over the four quadrants containing each cluster of modes table again the nested sampling run is far too confident in this case aiss error bars are generally much better although it is very certain that the upperright quadrant has much less probabil ity mass than it actually does runs of nested sampling with smaller n show that it too can easily lose the upperright quadrant nested samplings answers are wron g according to its error bars due to the markov chain approximation at late iter ations with high likelihood constraints the slice sampler is unable to equilibra te and is relying on a large number of particles to provide a good starting poin t on this problem the inaccuracy of this approximation does not reduce as fast a s the size of the error bars with a large number of points increasing the number of slice sampling steps per iteration does not solve the problem because it wou ld take an unrealistically large number to move between isolated modes however p roposals that take all of the particles positions into account could help shaw e t al uses an approximate rejection sampler based on uniform samples within ellip soidal fits to clusters of the existing particles no problems with biases were r eported using this method although clearly problems could still occur when the e llipsoids fail to correctly capture in general ellipsoids could be used as metro polis proposals as part of several and various attempts to equilibrate a particl e the potts model the potts model was introduced in subsection it describes a cl ass of undirected graphical models over discrete variables s the variables take on one of q colors and the model has a temperaturelike coupling parameter j gibb s sampling updates of are the simplest way to implement approximate nested sampl ing we also try clusterbased updates subsection while physicists tend to be inte rested in a broader range of quantities we focus here on the normalization const ant zp j q where the discrete variables s are the variables that need to be inte grated ie summed over experiments table partition function results for potts sys tems see text for details method gibbs ais swendsenwang ais gibbs nested samplin g randomcluster nested sampling acceptance ratio q ising j q j table shows resul ts on two example systems an ising model q and a q potts model in a difficult pa rameter regime we tested nested sampling and ais with gibbs sampling and cluster based updates annealed importance sampling ais was run times with a geometric sp acing of settings of j as the annealing schedule nested sampling used n particle s and fullsystem mcmc updates to approximate each draw from we also developed an acceptance ratio method bennett based on our representation in equation which w e ran extensively and should give nearly correct results the markov chains used by nested sampling were initialized at one of the n particles satisfying the cur rent constraint preliminary experiments that initialized a new particle s at s o n the constraint surface were a failure the gibbs nested sampler could get stuck permanently in a local maximum of the likelihood while the cluster method gave erroneous answers for the ising system this supports the suggestions of subsecti on ais performed very well on the ising system and can work with q at low coupli ng strengths we took advantage of its performance in easy parameter regimes to c ompute z which was needed to interpret the results from the clusterbased nested sampler however with a temperaturebased annealing schedule ais was unable to giv e useful answers for the q system close to the critical j evaluated nested sampl ing appears to be correct within its error bars under these conditions it is kno wn that even the efficient swendsenwang algorithm mixes slowly for potts models with q near critical values of j which correspond to a first order phase transit ion gore and jerrum see figure typical potts model states are either entirely di sordered or ordered disordered states contain a jumble of small regions with dif ferent colors eg figure b in ordered states the system is predominantly one colo r eg figure d moving between these two phases is difficult defining a valid mcmc method that moves between distinct phases requires knowledge of the relative pr obability of the whole collections of states in those phases temperaturebased an nealing algorithms explore the model for a range of settings of j and fail to ca pture the correct behavior near the transition despite using closely related mar kov chains to those used in ais nested sampling can work in all parameter discus sion and conclusions a b c d e figure two q potts models with starting states a and c were simulated with fullsystem swendsenwang updates with j the correspondi ng results b and d are typical of all the intermediate samples swendsenwang is u nable to take a into an ordered phase or c into a disordered phase although both phases are typical at this j e in contrast shows an intermediate state of neste d sampling which succeeds in bridging the phases regimes figure e shows how nest ed sampling can explore a mixture of ordered and disordered phases by moving ste adily through these states nested sampling is able to estimate the prior mass as sociated with each likelihood value this behavior is not possible in algorithms that use j as a control parameter such as ais with a temperaturebased schedule d iscussion and conclusions summary the main purpose of this chapter was to study nested sampling and its relationship with more established methods we find that it fits into a unique position amongst monte carlo algorithms unlike the majorit y of annealingbased methods nested sampling can deal with first order phase tran sitions while multicanonical sampling also solves this problem its properties ar e different in almost every other respect nested sampling does not need prior se tting of weights its scaling with dimensionality and energy ranges are very diff erent and nested sampling follows a sequence of distributions like annealing in some statistical settings nested sampling has clear advantages over multicanonic al undoubtedly it will perform badly on some applications where multicanonical h as already been successful nested samplings theoretical scaling with dimensional ity od iterations is worse than that of annealing od iterations in practice the difference can be less dramatic as bringing a ensemble to equilibrium may requir e fewer function evaluations than a temperature based distribution the slicesamp ling results do support aiss superiority at least for higher accuracy results bu t other issues such as ease of implementation and quality of error bars may be m ore significant issues with annealings error bars were blamed for an order of ma gnitude extra cost by both beltrn et al and a shaw et al discussion and conclusi ons despite this the relative underperformance of annealing reported by these as tronomers is difficult to account for given the theoretical and practical result s of this chapter the most likely explanation is a difference in the operators u sed to update the intermediate distributions in this chapter the same or closely related operators where used with ais and nested sampling in shaw et al nested sampling was described with its own special update rule using ellipsoids fitted to the current particles this algorithm could easily require fewer function eval uations than a simple slice sampler or metropolis method as used within annealin g in beltrn et al although the ellipsoid algoa rithm was suggested by the need t o sample from nested samplings there is no reason not to use the same basic code to propose moves for an annealing method combined with an annealing schedule sp ecified by nested sampling ais might perform as well as or better than nested sa mpling related work this chapter only considered a subset of the available metho ds for computing normalizing constants the focus has been on simple methods that compute the normalizing constant of a generic probability distribution we avoid ed assuming detailed knowledge about the target distribution and its relationshi ps with other distributions as much as possible for users with more time a riche r set of methods are available some of which are mentioned here the intermediate distributions in annealed importance sampling do not have to be temperaturebase d distributions that has been the focus here because of the simplicity and gener ality of raising the likelihood to a power any other way of bringing in the like lihood gradually can be used one data point at a time may be natural in a statis tical setting a generalization linked importance sampling neal may be helpful fo r such cases where the intermediate levels are fixed and limited in number annea led importance sampling can be seen as a member of a wider family of sequential monte carlo smc methods some of these algorithms allow transfer of particles amo ngst modes and should definitely be considered by anyone attracted to nested sam pling by this property a recent example of smc combined with an interesting set of bridging distributions leveraging the power of graphical models is presented in hamze and de freitas some situations require the computation of more than one normalizing constant an option is to construct a distribution containing each o f the models and explore it with mcmc reversible jump mcmc green makes this poss ible even when the models have parameter vectors of different dimensionalities t he relative probability of each model is available from the amount of time the s ampler spent exploring each model and more advanced estimators based on transiti on probabilities may be available this discussion and conclusions only provides the normalizing constants up to a constant although if one of the models conside red is tractable this constant could be found the path sampling approach of gelm an and meng suggests computing normalizing constants by integrating along a path of model hyperparameters rather than along a single inversetemperature paramete r if the normalizer for each setting of the hyperparameters is required this is particularly effective contour plots based on independent estimates will usually be very noisy compared to a path sampling approach philosophy various authors h ave noted the irony of using monte carlo a frequentist procedure for bayesian co mputation eg ohagan neal rasmussen and ghahramani rather than giving beliefs abo ut quantities given the computations performed monte carlo algorithms provide fr equentist statistical estimators skilling claims that nested sampling is bayesia n this section examines the extent to which this holds our target is a posterior distribution over z philosophically this is slightly tricky because z is a cons tant which we should be able to work out given the prior and likelihood function s thus according to any rational calculus of beliefs as in cox or jaynes the pos terior should concentrate all of its mass at the true value the problem is not l ack of available information but computer time to use it perfectly the solution to this conundrum is to be careful about what we claim to know assume that some agent is running nested sampling on our behalf and only reports to us l ls if it reported more information such as s locations we would be in an embarrassing si tuation to use this information we would need a probabilistic model including s in which inference would probably be hard but throwing away knowledge is hard to deal with rationally so we pretend it was never available thus the inferences f or nested sampling as in rasmussen and ghahramani are rational for a fictitious observer with limited information this observers results will be more vague than if s were not ignored but should be sensible which is not guaranteed by general approximations to bayesian inference we first set up priors based on the knowle dge that an agent is running nested sampling which will provide ls quantities wi th associated xs cumulative mass values our knowledge of nested sampling defines a problemindependent prior distribution over x xs we should also specify a prio r distribution over functions lx after observing l ls we update our beliefs abou t the underlying cumulative values according to bayes rule p xl p lx p xp l it i s this posterior distribution that should be used when computing posteriors over other quantities such discussion and conclusions as z p zl p zx l p xl dx note that specifying a prior over monotonic functions p lx and computing with it appe ars difficult in general skilling declares that i cant so i dont instead the pri or p x is used this corresponds to a particular assumption an improper uniform p rior over likelihood functions this cannot be avoided by claiming general ignora nce unlike s we must be told ls thus to maintain a claim of rationality we must be happy with this particular choice of prior or type of ignorance about the lik elihood function in addition and much more seriously we must assume the agent is actually running nested sampling as in algorithm in fact we really know that so me approximation will be involved as we have seen this can give us unreasonable beliefsas with most probabilistic modeling where we know a priori that a models joint distribution does not really capture every detail of a real system similar concerns must surround all such attempts to introduce bayesian methodology into monte carlo algorithms for general inference problems for example bayesian lear ning of the onedimensional weighting function of the multicanonical method has b een attempted eg smith this should be a difficult inference problem as the outpu t of a sampler has a complicated dependence structure only with incorrect assump tions ie approximations can bayesian methods be applied the ultimate justificati on for such methods must be their empirical performance which is sometimes very good chapter doublyintractable distributions most of this thesis has been dedica ted to sampling from probability distributions where the key difficulty has been an intractable normalization when considering a posterior over parameters given data y py py py p py py we assumed that the joint probability in the numerator could be easily evaluated for any particular joint setting y chapter described h ow the important quantity py can be approximated but it is often infeasible to c ompute this quantity exactly standard mcmc methods are designed for use with the se intractable distributions markov chain operators can be constructed by restri cting consideration to a manageable subset of s state space at each step in metr opolishastings only two settings are considered the current setting and a random ly chosen proposal gibbs sampling changes only one component of at a time metrop olis requires an ability to evaluate py ratios for various pairs and gibbs sampl ing requires the ability to sample from the conditional distributions pi ji y by considering restricted parts of the state space neither method needs to know th e global normalizing constant py but what if py like py contains a summation ove r a large state space so it cannot feasibly be evaluated pointwise then the prob lem is doublyintractable and as we shall see even performing markov chain monte carlo is potentially exceedingly difficult the next section explains why doublyi ntractable distributions arise and the difficulties involved section explores ap proximations of standard mcmc algorithms in the context of undirected graphical models moller et al provided the first feasible algorithm for models where sampl ing from py is possible but its normalization is unknown their method is reviewe d in section and generalized by us in section working on these algorithms inspir ed the new exchange algorithm sec bayesian learning of undirected models tion th ese innovations were first described in murray et al a this chapter provides a s lightly more general version of the exchange algorithm with a new derivation whi ch is somewhat simpler than the original description full mathematical derivatio ns of detailed balance are provided which were previously omitted for space reas ons in section we consider further new valid mcmc algorithms for doublyintractab le distributions which provide a connection to the approximate bayesian computat ion abc literature section provides slice sampling algorithms for doublyintracta ble distributions finally new directions for research in this area are considere d in section bayesian learning of undirected models the potts model discussed in the previous chapter belongs to a very wide class of energybased models where p y exp z j ej ycj j z y j z fj ycj j j fj ycj j the sets cj each index a subset o f variables that take part in a corresponding potential function fj parameterize d by j each potential expresses mutual compatibilities amongst a subset of varia bles ycj sampling from the y variables is possible with mcmc but computing the n ormalization can be very difficult section reviewed how special structure in a g raphical model sometimes allows efficient computation of the normalization z mos t of this chapter concerns methods that apply to general distributions so it is convenient for clarity to collapse the model to a simpler form py f y z we now c onsider sampling from the posterior over parameters equation when the likelihood is of the unnormalized form given in equation this posterior py f y p z py offe rs a new difficulty as before py is not needed for mcmc but the normalizing cons tant z cannot be ignored as it is a function of the parameters the variables bei ng sampled every time new parameter values are considered it appears that an int ractable computation involving z will be required as mcmc estimators are approxi mations unless an infinite number of iterations are performed and each itera bay esian learning of undirected models tion is generally infeasible py in equation can be called a doublyintractable distribution while sampling from parameter pos teriors by mcmc is a well established technique it is largely associated with di stributions that could be represented as directed graphical models however sampl ing parameters in anything but the most trivial undirected graphical model is do ublyintractable while directed models are a more natural tool for modeling causa l relationships the soft constraints provided by undirected models have proven u seful in a variety of problem domains we briefly mention six applications a in c omputer vision markov random fields mrfs a form of undirected model are used to model the soft constraint a pixel or image feature imposes on nearby pixels or f eatures geman and geman this use of mrfs grew out of a long tradition in spatial statistics besag b in language modeling a common form of sentence model measure s a large number of features of a sentence fj s such as the presence of a word s ubjectverb agreement the output of a parser on the sentence etc and assigns each such feature a weight j a random field model of this is then ps z exp j j fj s where the weights can be learned via maximum likelihood iterative scaling method s della pietra et al c these undirected models can be extended to coreference an alysis which deals with determining for example whether two items eg strings cit ations refer to the same underlying object mccallum and wellner d undirected mod els have been used to model protein folding winther and krogh and the soft const raints on the configuration of protein side chains yanover and weiss e semisuper vised classification is the problem of classifying a large number of unlabeled p oints using a small number of labeled points and some prior knowledge that nearb y points have the same label this problem can be approached by defining an undir ected graphical model over both labeled and unlabeled data zhu and ghahramani f given a set of directed models pyj the products of experts idea is a simple way of defining a more powerful undirected model by multiplying them py z models des pite the long history and wide applicability of undirected models until recently bayesian treatments of learning the parameters of large undirected models have been virtually nonexistent indeed there is a related statistical literature on b ayesian inference in undirected models log linear models and contingency tables albert dellaportas and forster dobra et al however this literature with the nota ble exception of the technique reviewed in section assumes that the partition fu nction z can be computed exactly but for many of the machine learning applicatio ns of undirected models cited above this assumption is unreasonable this chapter addresses bayesian learning for models with intractable z j pyj hinton the prod uct assigns high probability when there is consensus among component ie low tree width graphs graphical gaussian models and small contingency tables bayesian lea rning of undirected models do we need z for mcmc the modelers effort was put int o specifying f y it is tempting to think that z should have little relevance and there must be some way to sidestep computing it it was established above that t he normalizer z is not a constant in the context of sampling this section explor es the rle of z in more detail addressing the o consequences for any scheme that avoids computing it algorithm gives the straightforward application of standard metropolishastings algorithm to the doublyintractable distribution in equation algorithm standard but infeasible metropolishastings mh input initial setting nu mber of iterations s for s s propose q y compute a p y q y f y p q y z py q y f y p q y z draw r uniform if r a then set end for computing the acceptance ratio requires a ratio of normalizing constants or at least a bound tight enough for s tep this is difficult in general there are usually some free choices while const ructing mcmc algorithms perhaps some nuisance parameters could be judiciously se t to remove z through some fortunate cancellations we are not free to cancel out z through a choice of prior p the prior would have to depend on the number of o bserved data points and would take on extreme values dominating any inferences m urray and ghahramani in theory the proposal distribution could be defined to rem ove explicit z dependence from the acceptance ratio but in practice this does no t seem to help eg q y zg or q y z would be difficult to construct without knowin g z and would be terrible proposals the only distribution we know of that contai ns z gives probabilities in data space not over parameters section reviews and e xtends a method by moller et al which introduced auxiliary variables taking on v alues in the data space this allows proposals that cancel out the unknown terms but only if it is possible to draw from the intractable distribution py this key insight makes it possible to sample from a limited but significant class of dis tributions for which mcmc was previously impossible however the algorithm and it s extensions are not a panacea it is not always possible to draw exact samples f rom the data distribution another problem or opportunity is bayesian learning of undirected models x y a f x y b f figure a the dash marked y shows the position of our observation in data space the curve shows an unnormalized function f whi ch gives low probability to the alternative data set x b after changing the para meter the unnormalized function evaluated at the observation has increased f y f y however the likelihood has decreased py py noticing this requires considering the new highprobability region of data space containing x which is not necessar ily close to the observation that specifying a workable stationary distribution for the auxiliary variables requires approximating the parameter posterior befor e sampling begins an optimistic train of thought sees no problem as z is just a sum and summing over unknowns is a standard feature of mcmc for example models w ith latent variables z often have intractable likelihoods but while it may be di fficult to evaluate py z py z z pyz pz we can jointly sample over p zy as long a s this distribution is known up to a constant discarding the latents gives sampl es from the correct marginal py in doublyintractable problems the likelihood con tains z the reciprocal of a sum of terms rather than a sum over latent variables but could there be some similar way of instantiating latent variables to remove the need to compute z section provides such a method with an infinite number of latent variables but how do we know there is not a better choice in general mus t all simple algorithms fail figure shows a hypothetical unnormalized probabilit y function for two settings of its parameters an obvious but important observati on is that the change in an unnormalized probability function evaluated at an ob servation does not necessarily tell us anything about the change in the likeliho od of the parameters in fact we must consider the parameters effect on the entir e observation space this is in sharp contrast to mcmc sampling of the y variable s for fixed parameters where only two settings need be considered at a time if w e are not going to compute z explicitly then a valid mcmc algorithm must have so me other source of global information that is sensitive to changes anywhere in o bservation space one of the simplest ways to get information from a probability distribution is of course through a sample a sample x f x z using the same funct ion as figure b could easily land in the new region of high probability on the l eft noticing that approximation schemes f x f x gives some indication that any p erceived benefit of f y f y should be penalized despite the apparent paucity of such information surprisingly these single samples at each parameter setting are sufficient to create valid mcmc algorithms for py that do not need z this is ho w the approach in moller et al works which explains why it requires samples from the target distribution using an exact or perfect sampling method section appro ximate samples from a few mcmc steps cannot guarantee considering the new bubble in data space in figure b a markov chain started at eg the observation will be heavily biased towards a mode near the starting point and may never consider any other modes are spurious modes in data space a problem in practice contrastive divergence learning hinton uses very brief mcmc sampling starting at the observe d data and can give useful results on complex problems in machine learning somet imes we know the parameter posterior is simple it is logconcave for fullyobserve d exponentialfamily distributions in such cases deterministic approximations to the parameter posterior perform favorably compared to pseudomcmc approaches well ing and parise however if we wish to construct a valid mcmc method we must forma lly show that the entire data space has been properly considered exact sampling explicitly tracks bounds that start off considering the entire observation space section a procedure with this flavor seems an essential part of samplers for do ublyintractable distributions thus even though a new latent variable approach in troduced in section does not use exact sampling as such the requirements are and must be very similar exact sampling is a formidable requirement deterministic a pproaches and mcmc methods that do not use the correct posterior distribution wi ll always have a place some possibilities are studied in the next section there is also always a place for gold standard methods that given enough time should g ive correct answers those are the focus of the remainder of the chapter approxim ation schemes for concreteness this section studies a simple but widespread type of graphical model the boltzmann machine bm is a markov random field which defi nes a probability distribution over a vector of binary variables s s sk where si psw exp zw wij si sj ij the symmetric weight matrix w parameterizes this distri bution in a bm there are usually also linear bias terms i bi si in the exponent with these the model is equivalent to a potts or ising model with magnetic field parameters we omit these biases to simplify notation although the models in the experiments assume them approximation schemes the usual algorithm for learning bms is a maximum likelihood version of the em algorithm assuming some of the var iables are hidden sh and some observed so ackley et al the gradient of the log p robability is log pso w ecsi sj eusi sj wij where ec denotes expectation under t he clamped data distribution psh so w and eu denotes expectation under the uncla mped distribution psw for a data set s s sn sn of iid data the gradient of the l og likelihood is simply summed over n for boltzmann machines with large treewidt h see section these expectations would take exponential time to compute and the usual approach is to approximate them using gibbs sampling or one of many more r ecent approximate inference algorithms targets for mcmc approximation metropolis hastings for the parameters of a boltzmann machine given fully observed data nee ds to bound a psw pw qw w s psw pw qw w s zw zw n pw qw w s pw qw w s exp nij wi j wij n n si sj the first class of approximation we will pursue is to substitute a deterministic approxi mation zw zw into the above expression clearly this res ults in an approximate sampler which does not converge to the true equilibrium d istribution over parameters moreover it seems reckless to take an approximate qu antity to the n th power despite these caveats we explore empirically whether ap proaches based on this class of approximation are viable note that above we need only compute the ratio of the partition function at pairs of parameter settings zw zw this ratio can be approximated directly by importance sampling zw epsw ex p zw wij wij si sj ij thus any method for estimating expectations under psw samp lingbased or deterministic can be nested into the metropolis sampler for w for s mall steps w w estimating ratios of normalizers and finding gradients with respe ct to the parameters are closely related problems gradients may be more useful a pproximation schemes as they provide a direction in which to move which is usefu l in algorithms based on dynamical systems such as hybrid monte carlo subsection hybrid monte carlo would also require a z ratio for its acceptreject step this effort may not be justified when the gradients and zw are only available as appr oximations simpler schemes that use gradient information also exist neal the sim plest of these is the uncorrected langevin method parameters are updated without any rejections according to the rule i i log py ni i where ni are independent d raws from a zeromean unit variance gaussian intuitively this rule performs gradi ent descent but explores away from the optimum through the noise term strictly t his is only an approximation except in the limit of vanishing using the above or other dynamical methods a third target for approximation for systems with conti nuous parameters is the gradient of the joint log probability in the case of bms we have log ps w wij si sj n n n n epsw si sj log pw wij assuming an easy to di fferentiate prior the main difficulty arises as in equation from computing the m iddle term the unclamped expectations over the variables interestingly although many learning algorithms for undirected models eg the original boltzmann machine learning rule are based on computing gradients such as equation and it would be simple to plug these into approximate stochastic dynamics mcmc methods to do ba yesian inference this approach does not appear to have been investigated we expl ore this approach in our experiments this section has considered two existing sa mpling schemes metropolis and langevin and identified three targets for approxim ation to make these schemes tractable zw zw zw and epsw si sj while the explicit derivations above focused on boltzmann machines these same expressions generali ze in a straightforward way to bayesian parameter inference in a general undirec ted model as in equation in particular many undirected models of interest can be parameterized to have potentials in the exponential family ej ycj j uj ycj j fo r such models the key ingredients to an approximation are the expected sufficien t statistics epy uj ycj approximation algorithms in this section approximations for each of the three target quantities in equations and are identified these ar e used to propose a variety of approximate sampling methods for doublyintractabl e distributions first outlined in murray and ghahramani approximation schemes va riational lower bounds were developed in statistical physics but are also widely used in machine learning they use jensens inequality to lower bound the log par tition function in the following way log z log x f x log x qx f x qx x qx log f x qx log z x qx log f x hq f q the relationship holds for any distribution qx pr ovided it is not zero where px has support the second term in the bound is calle d the entropy of the distribution hq x qx log qx the overall bound f is often ca lled the free energy see winn and bishop for more details and a framework for au tomatic construction and optimization of variational bounds for a large class of graphical models the na mean field method is a variational method with q constr ained to belong to the ve set of fully factorized distributions qmf q qx i qi xi for the boltzmann machine a local maximum of this lower bound log zmf maxqqmf f q can be found with an iterative and tractable fixedpoint algorithm see for exa mple mackay chapter let the meanfield metropolis algorithm be defined by using z mf in place of z in the acceptance probability computation equation the expectat ions from the na mean field algorithm could also be used to compute direct ve ap proximations to the gradients in equation for use in a stochastic dynamics metho d jensens inequality can be used to obtain much tighter bounds than those given by the na meanfield method for example when constraining q to be in the set of a ll ve treestructured distributions qtree optimizing the lower bound on the parti tion function is still tractable wiegerinck obtaining ztree z the tree metropoli s algorithm is defined through the use of this approximation in equation alterna tively expectations under the tree could be used to form the gradient estimate f or a stochastic dynamics method equation bethe approximation a recent justificat ion for applying belief propagation to graphs with cycles is the relationship be tween this algorithms messages and the fixed points of the bethe free energy yed idia et al this breakthrough gave a new approximation for the partition function in the loopy metropolis algorithm belief propagation is run on each proposed sy stem and the bethe free energy is used to approximate the acceptance probability equation traditionally belief propagation is used to compute marginals pairwise marginals can be used to compute the expectations used in gradient methods eg e quation or in finding partition function ratios equation these approaches lead t o different algorithms although their approximation schemes approximations are c learly closely related langevin using brief sampling the pairwise marginals requ ired in equations and can be approximated by mcmc sampling the gibbs sampler use d in subsection is a popular choice whereas in subsection a more sophisticated s wendsenwang sampler is employed unfortunately as in maximum likelihood learning equation the parameterdependent variance of these estimates can hinder convergen ce and introduce biases the brief langevin algorithm inspired by work on contras tive divergence hinton uses very brief sampling starting from the data x which g ives biased but low variance estimates of the required expectations as the appro ximations in this section are run as an inner loop to the main sampler the cheap ness of brief sampling makes it an attractive option langevin using exact sampli ng unbiased expectations can be obtained in some systems using an exact sampling algorithm section although the gradients from this method are guaranteed to be unbiased parameterdependent variance could lead to worse performance than the pr oposed brief langevin method variance could be reduced by reusing pseudo random numbers however we shall see later that there are much more elegant ways to use an exact sampler if one is available pseudolikelihood replacing the likelihood o f the parameters with a tractable product of conditional probabilities is a comm on approximation in markov random fields for image modeling one of the earliest bayesian approaches to learning in large systems of which we are aware was in th is context wang et al yu and cheng the models used in the experiments of subsect ion were not well approximated by the pseudolikelihood so it is not explored fur ther here extension to hidden variables so far we have only considered models of the form py where all variables y are observed often models need to cope with m issing data or have variables that are always hidden these are often the models that would most benefit from a bayesian approach to learning the parameters in f ully observed models in the exponential family the parameter posteriors are ofte n relatively simple as they are log concave if the prior used is also log concav e as seen later in figure the parameter posterior with hidden variables will be a linear combination of log concave functions which need not be log concave and can be multimodal in theory the extension to hidden variables is simple first co nsider a model py h where h are unobserved variables the parameter posterior is still proportional to approximation schemes pyp and we observe py h py h z fj y hcj j h j log py log z log h j fj y hcj j log z log zy that is the sum in the se cond term is a partition function zy for an undirected graph of the variables h to see this compare to equation and consider the fixed observations y as paramet ers of the potential functions in a system with multiple iid observations zy mus t be computed for each setting of y note however that these additional partition function evaluations are for systems smaller than the original therefore any me thod that approximates z or related quantities directly from the parameters can still be used for parameter learning in systems with hidden variables the brief sampling and pseudolikelihood approximations rely on settings of every variable provided by the data for systems with hidden variables these methods could use s ettings from samples conditioned on the observed data in some systems this sampl ing can be performed easily hinton in subsection several steps of mcmc sampling over the hidden variables are performed in order to apply the brief langevin met hod experiments involving fully observed models the approximate samplers describ ed in subsection were tested on three systems the first taken from edwards and h avrnek lists six binary properties detaila ing risk factors for coronary heart d isease in men modeling these variables as outputs of a fullyconnected boltzmann machine we attempted to draw samples from the distribution over the unknown weig hts we can compute z exactly in this system which allows us to compare methods a gainst a metropolis sampler with an exact inner loop a previous bayesian treatme nt of these data also exists dellaportas and forster while predictions wouldnt n eed as many samples we performed sampling for iterations to obtain reasonable hi stograms for each of the weights figure the meanfield tree and loopy metropolis methods each proposed changes to one parameter at a time using a zeromean gaussi an with variance the brief langevin method used a stepsize qualitatively the res ults are the same as those reported by dellaportas and forster parameters deemed important by them have very little overlap with zero the meanfield metropolis a lgorithm failed to converge producing noisy and wide histograms over an ever inc reasing range of weights figure the sampler with the approximation schemes mean field tree waa wab wac wad wae waf wbb wbc wbd wbe wbf wcc wcd wce wcf wdd wde w df wee wef wff figure histograms of samples for every parameter in the heart dis ease risk factor model results from exact metropolis are shown in solid blue loo py metropolis dashed purple brief langevin dotted red these curves are often ind istinguishable the meanfield and tree metropolis algorithms performed very badly to reduce clutter these are only shown once each in the plots for waa and wab r espectively shown in dashdotted black f f parameters parameters figure loopy met ropolis is shown dashed blue brief langevin solid black left an example histogra m as in figure for the edge bm the vertical line shows the true weight also show n are the fractions of samples f within of the true value for every parameter in the edge system center and the edge system right the parameters are sorted by f for clarity higher curves indicate better performance approximation schemes tre ebased inner loop did not always converge either and when it did its samples did not match those of the exact metropolis algorithm very well the loopy metropoli s and brief langevin methods closely match the marginal distributions predicted by the exact metropolis algorithm for most of the weights results are not shown for algorithms using expectations from loopy belief propagation in equation or e quation as these gave almost identical performance to loopy metropolis based on equation our other two test systems are node boltzmann machines and demonstrate learning where exact computation of zw is intractable we considered two randomly generated systems one with edges and another with each of the parameters not se t to zero including the biases was drawn from a unit gaussian experiments on an artificial system allow comparisons with the true weight matrix we ensured our t raining data were drawn from the correct distribution with an exact sampling met hod childs et al reviewed in subsection this level of control would not be avail able on a natural data set the loopy metropolis algorithm and the brief langevin method were applied to data points from each system the model structure was pro vided so that only nonzero parameters were learned figure shows a typical histog ram of parameter samples the predictive ability over all parameters is also show n short runs on similar systems with stronger weights show that loopy metropolis can be made to perform arbitrarily badly more quickly than the brief langevin m ethod on this class of system experiment involving hidden variables finally we c onsider an undirected model approach taken from work on semisupervised learning in zhu and ghahramani here a graph is defined using the d positions x xi yi of u nlabeled and labeled data the variables on the graph are the class labels s si o f the points the joint model for the l labeled points and u unobserved or hidden variables is psx exp z lu i ij si sj wij where wij exp xi xj yi yj x y the edge weights of the model wij are functions of the euclidean distance between points i and j measured with respect to scale parameters x y nearby points wish to be classified in the same way whereas far away points may be approximately uncorrel ated unless linked by a bridge of points in between these test sets are availabl e online httpwwwgatsbyuclacukiamblug approximation schemes y x x y y x a data se t b brief langevin c loopy metropolis figure a a data set for semisupervised lea rning with variables two groups of classified points and and unlabeled data b ap proximate samples from the posterior of the parameters x and y equation an uncor rected langevin sampler using gradients with respect to log approximated by a sw endsenwang sampler was used c approximate samples using loopy metropolis the lik elihoods in this model can be interesting functions of zhu and ghahramani leadin g to nongaussian and possibly multimodal parameter posteriors with any simple pr ior as the likelihood is often a very flat function over some parameter regions the map parameters can change dramatically with small changes in the prior there is also the possibility that no single settings of the parameters can capture o ur knowledge when performing binary classification equation which is a type of p otts model can be rewritten as a standard boltzmann machine the edge weights wij are now all coupled through so our sampler will only explore a twodimensional p arameter space x y however little of the above theory is changed by this we can still approximate the partition function and use this in a standard metropolis s cheme or apply langevin methods based on equation where gradients include sums o ver edges figure a shows an example data set for this problem this toy data set is designed to have an interpretable posterior over and demonstrates the type of parameter uncertainty observed in real problems we can see intuitively that we do not want x or y to be close to zero this would disconnect all points in the g raph making the likelihood small l parameters that correlate nearby points that are the same will be much more probable under a large range of sensible priors n either can both x and y be large this would force the and clusters to be close w hich is also undesirable however one of x and y can be large as long as the othe r stays below around one these intuitions are closely matched by the results sho wn in figure b this plot shows draws from the parameter posterior using the brie f langevin method based on a swendsenwang sampling inner loop described in zhu a nd ghahramani we also reparameterized the posterior to take gradients with respe ct to log rather than this is important for any unconstrained gradient method li ke langevin note that predictions from typical samples of will vary greatly for example large x approximation schemes predicts the unlabeled cluster in the top left as mainly s whereas large y predicts s it would not be possible to obtain t he same predictive distribution over labels with a single optimal setting of the parameters as was pursued in zhu and ghahramani this demonstrates how bayesian inference over the parameters of an undirected model can have a significant impa ct on predictions figure c shows that loopy metropolis converges to a very poor posterior distribution which does not capture the long arms in figure b this is due to poor approximate partition functions from the inner loop the graph induce d by w contains many tight cycles which cause problems for loopy belief propagat ion as expected loopy propagation gave sensible posteriors on other problems whe re the observed points were less dense and formed linear chains discussion altho ugh mcmc sampling in general undirected models is intractable there are a variet y of approximate methods that can be brought forth to tackle this problem we hav e proposed and explored a range of such approximations including two variational approximations brief sampling and the bethe approximation combined with metropo lis and langevin methods clearly many more approximations could be explored the mean field and treebased metropolis algorithms performed disastrously even on si mple problems we believe these failures result from the use of a lower bound as an approximation where the lower bound is poor the acceptance probability for le aving that parameter setting will be exceedingly low thus the sampler is often a ttracted towards extreme regions where the bound is loose and does not return th e bethe free energy based metropolis algorithm performs considerably better and gave the best results on one of our artificial systems however it also performed terribly on our final application in general if an approximation performs poorl y in the inner loop then we cannot expect good parameter posteriors from the out er loop in loopy propagation it is well known that poor approximations result fo r frustrated systems and systems with large weights or tight cycles the typicall y broader distributions from brief langevin and its less rapid failure with stro ng weights means that we expect it to be more robust than loopy metropolis anoth er advantage is the cost per iteration the mean field and belief propagation alg orithms are iterative procedures with potentially large and uncontrolled costs f or convergence brief langevin gave reasonable answers on some large systems wher e the other methods failed although it too can suffer from very large artifacts even on the very simple heart disease data set one of the posterior marginals wa s very different under this approximation this actually means the whole joint di stribution is in the wrong place we now turn to valid mcmc algorithms which are clearly required if correct inferences are important the exchange algorithm f y z y q y f x z x figure an augmented model with the original generative model for observations py as a marginal the exchange algorithm is a particular sequence o f metropolishastings steps on this model the exchange algorithm the standard met ropolishastings algorithm proposes taking the data y away from the current param eters which would remove a factor including z from the joint probability p y the proposal also suggests giving the data to the new parameters which introduces a new factor including z if each parameter setting always owned a data set under the models joint distribution then we would not need to keep adding and removing these z factors we bring the proposed parameter into the model and give it a da ta set of its own x this augmented distribution illustrated in figure has joint probability py x p f y f x q y z z given a setting of y a second setting of para meters is generated from an arbitrary distribution q y a fantasy dataset is then generated from px f x z where the function f is the same as in the datagenerati ng likelihood equation the original joint distribution p y is evidently maintain ed on adding these child variables to the graphical model as long as we can gene rate fantasies from this prior model two monte carlo operators are now feasible operator one resamples x from its distribution conditioned on y this is a blockg ibbs sampling operator which is naturally implemented by sampling from the propo sal followed by generating a fantasy for that parameter setting thus we are able to change as long as we can sample from px resampling x at the same time remove s the need to evaluate the change in z although this does require an exact sampl e this will need a method like coupling from the past section just having a mark ov chain with stationary distribution px is not sufficient operator two is a met ropolis move that proposes swapping the values of the two parameter settings as the proposal is symmetric the acceptance ratio is simply the exchange algorithm the ratio of joint probabilities before and after the swap a q y p f y f x z z p q y f y f x p q y f y f x q y p f y f x z z combining the two operators gives a lgorithm the exchange algorithm algorithm exchange algorithm input initial numbe r of iterations s for s s propose q y generate an auxiliary variable x f x z com pute acceptance ratio a q y p f y f x q y p f y f x draw r uniform if r a then s et end for each step tries to take the data y from the current parameter setting we speculate that a better parameter setting is which was generated by q y how can we persuade to give up the data to the rival parameter setting we offer it a replacement data set x from the distribution if f x f y then this replacement i s preferred by to the real data y which is a good thing we have to consider both sides of the exchange the ratio f y f x measures how much likes the trade in da ta sets only by liking the data and generating good replacements can a parameter enter the set of the posterior samples parameters that spread high f settings o ver many data sets will like the swap but tend to generate unacceptable fantasie s this penalty replaces the need to compute z comparing the acceptance ratio to the infeasible mh algorithms ratio in algorithm we identify that the exchange al gorithm replaces the ratio of normalizers with a onesample unbiased importance s ampling estimate cf equation z f x z f x x f x z this gives an interpretation of why the algorithm works but it is important to remember that arbitrary estimato rs based on importance sampling or other methods for computing z and its ratios are unlikely to give a transition operator with the correct stationary distribut ion we emphasize the exchange motivation because the algorithm is simply metropo lishastings which is already well understood here applied to an augmented model for more mathematical detail see subsection which provides a proof of the statio nary distribution for a more general version of the algorithm the exchange algor ithm x x x x i y figure an alternative representation of the generative model fo r observations y all possible parameter settings j are instantiated fixed and us ed to generate a set of data variables xj the indicator i is used to set y xi th e posterior over i the parameter chosen by the indicator variable i is identical to py in the original model product space interpretation the exchange algorithm was originally inspired by carlin and chib which gives a general method for mod el comparison by mcmc in this approach every model is instantiated in the sample r which explores a product space of all the models parameter settings an indicat or variable which is also sampled specifies which model was used to generate the data the posterior probability distribution over the indicator variable is the posterior distribution over models exchange moves in product spaces are common i n the parallel or replica tempering methods in the physics literature eg swendse n and wang and geyer by temporarily assuming that there is only a finite discret e set of possible parameters j each setting can be considered as a separate mode l this suggests a somewhat strange joint distribution see also figure pxj y i pi iy xj j f xj j zj where iy xi is an indicator function enforcing y xi this corr esponds to choosing a parameter setting i from the prior and generating the obse rved data as before but then also generating unobserved data sets using every ot her setting of the parameters although this only appears to be a convoluted rest atement of the original generative process all zj terms are now always present z is a constant again each of the unobserved datasets xji can be updated by stand ard mcmc methods such as gibbs sampling conditioned on i the remaining dataset x i y is not updated because it has a single known observed value to update i we n otice that an isolated mh proposal i i will not work it is exceedingly unlikely that xi y instead we couple i i proposals with an exchange of datasets xi y and xi the mh acceptance the exchange algorithm ratio for this proposal is simple mu ch of the joint distribution in equation cancels leaving a qi i y pi f y i f xi i qi i y pi f y i f xi i the only remaining problem is that if the number of par ameter settings is very large we require huge amounts of storage and a very long time to reach equilibrium in particular the method seems impractical if is cont inuous the solution to these problems recreates the exchange algorithm of the pr evious section we declare that at each time step all of the x variables are at e quilibrium conditioned on the indicator variable i but we do not need to store t hese values only when a swap of ownership between and is proposed do we need to know the current setting of s data we can then draw the value from its equilibri um distribution x px f x z pretending that this had already been computed after the exchange has been accepted or rejected the intermediate quantity x can be di scarded we redraw any unobserved data set from its stationary distribution as re quired this retrospective sampling trick has been used in a variety of other mcm c algorithms for infinitedimensional models papaspiliopoulos and roberts beskos et al an infinitedimensional model and retrospective sampling are not required t o describe the exchange algorithm but provide connections to the literature whic h might be useful for some readers the product model works without exact samplin g for models with a small number of parameter settings and is also how the algor ithm was originally conceived other readers may prefer the explanation in the pr evious section even on continuous parameters it uses metropolishastings on a fin ite auxiliary system which is more established theoretically bridging exchange a lgorithm the metropolishastings acceptance rule has the maximum acceptance rate for any reversible transition operator that may only accept or reject proposals from q see subsection as the exchange algorithm is only approximately the same a s the direct mh approach in some sense it rejects moves that it should not as we have much better approximations of normalizing constant ratios than equation be tter methods should be possible how can good parameter proposals get rejected it may be that is a much better explanation of the data y than the current paramet ers and would be accepted under mh however the exchange algorithm can reject the swap because x f x z is improbable under this makes movement unnecessarily slow and suggests searching for a way to make swaps look more favorable the exchange algorithm with bridging draws a fantasy as before x f x z but then applies a se ries of modifications x x xk such that xk is typically the exchange algorithm mo re appealing to to describe the precise form of these modifications we specify a new augmented joint distribution py xk k k p f y z the augmented model combines the original model a parameter proposal generating a fantasy and bridging steps q y f x z k k tk xk xk we choose the tk to be markov chain transition operators with corresponding stationary distributions pk a convenient choice is pk x f x k f x k fk x where k k k giving k intermediate distributions that bridge between p x f x z and pk x f x z other bridging schemes could be used in what follows w e only assume the directional symmetry pk x pkk x similarly we require tk xk xk tkk xk xk which is easily achieved by always using the same transition operator for the same underlying stationary distribution as before there are two possible markov chain operators operator one block gibbs sample from p xk y q y f x z k tk xk xk k operator two propose swapping whilst simultaneously reversing the ord er of the xk sequence as illustrated in figure if the tk operators satisfy detai led balance ie tk x x pk x tk x x pk x then the mh acceptance ratio for operator two does not depend on the details of tk and is easily computed the concatenati on of the two operators results in algorithm figure note that k reduces to the p revious exchange algorithm as before the k k fk xk fk xk term in the acceptance ratio corresponds the exchange algorithm f x z x t x x x q y xk tk xk xk xk f xk z xk t xk xk tk xk xk y f y z xk q y x tk x x t x x x y f y z figure the propos ed change under a bridged exchange given a current parameter the algorithm gener ated an exact sample x from its distribution and a sequence of data sets that co me from distributions closer and closer to px the swap move proposes making the current parameter and the owner of a fantasy as x was typical of not the stack o f auxiliary variables is reversed so it looks like it might have been generated under the original process algorithm exchange algorithm with bridging input init ial iterations s bridging levels k for s s propose q y generate an auxiliary var iable by exact sampling x p x f x z generate k further auxiliary variables with transition operators x t x x x t x x xk tk xk xk compute acceptance ratio q y p f y a q y p f y draw r uniform if r a then set end for k k fk xk fk xk the excha nge algorithm to an unbiased estimate of zz proof it is an annealed importance s ampling ais weight equation this is the natural extension to the simple importan ce estimate equation in the original exchange algorithm linked importance sampli ng neal could also be used as a dropin replacement the bridging extension to the exchange algorithm allows us to improve its implicit normalization constant est imation and improve the acceptance rate for some additional cost fortunately no further expensive exact sampling on top of that needed by the original algorithm is required per iteration the performance as a function of k is explored in sec tion details for proof of correctness it turns out that the algorithm as stated is also valid when the operators tk do not satisfy detailed balance we give the details of a proof for this more general case we define a set of reverse markov chain operators tk x x tk x x pk x tk x x pk x pk x define the transition operat or t by steps of algorithm let t be defined by the same algorithm but using tk r ather than tk we now prove that both t and t leave py stationary t x xk py proba bility of transition is py pyp f y p py zpy f x f x z z k probability of being a t q y proposing a move to k generating x tk xk xk generating x xk k k min q y p f y q y p f y fk xk fk xk and finally accepting rearranging and using fk xk f xk min p f y q y f x py z z p f y q y f xk py z z k tk xk xk k k k tk xk xk fk xk fk xk the single auxiliary variable method we substitute equation and equation i nto the second product and reorder its terms min p f y q y f x py z z p f y q y f xk py z z k tk xk xk k k tk xkk xkk k now swapping and reversing the order of the auxiliary variables ie mapping x x xk xk x x and swapping t and t throughout only swaps the arguments of the min leaving the value of the expression unchang ed therefore the probability of the reverse transition under t via the same inte rmediate values is the same as in equation that is for any values of xk t x xk p y t xk x p y summing over all possible intermediate values xk gives t py t p y f urther summing over or shows that both t and t leave py stationary if as was ori ginally proposed t is reversible then t t and the bridged exchange algorithm als o satisfies detailed balance as an aside we mention that the bridging scheme is very similar to that found in the method of dragging fast variables neal a altho ugh dragging was presented using reversible transition operators a similar proof to the one above shows that it too can use nonreversible transition operators t he single auxiliary variable method the first valid mcmc algorithm for doublyint ractable distributions was discovered by moller et al for comparison this sectio n reviews their method which we call the single auxiliary variable method savm s avm extends the original model to include a single auxiliary variable x which sh ares the same state space as y figure px y px y f y p z the joint distribution p y is unaffected no known method of defining auxiliary variables removes z from t he joint distribution however through careful choice the single auxiliary variab le method x y y a b figure a the original model unknown parameters generated obs erved variables y b the savm augmented model the conditional distribution of x m ust have a tractable dependence in existing approaches this distribution is only a function of one of y or eg f x yz or a normalizable function of x of q explic it z dependence can be removed from the mh ratio a px y qx x y px y qx x y a con venient form of proposal distribution is qx x y q y qx which corresponds to the usual change in parameters followed by a choice for the auxiliary variable if th is choice which happens to ignore the old x uses qx f x z where f and z are the same functions as in py equation then the mh acceptance ratio becomes a px y p y qx q y px y py qx q y px y zf y p f x z q y px y z f y p f x z q y f y p q y px y f x f y p q y px y f x now every term can be computed as with the exchange al gorithm the big assumption is that we can draw independent exact samples from th e proposal distribution equation the missing part of this description was the co nditional distribution of the auxiliary variable px y this choice is not key to constructing a valid mh algorithm but our choice will have a strong impact on th e efficiency of the markov chain normally we have a choice over the proposal dis tribution here that choice is forced upon us and instead mavm a temperedtransiti ons refinement we choose the target distribution pxy to match the proposals as c losely as possible we cannot maximize the acceptance rate by choosing pxy f x z as that would reintroduce explicit z terms into the mh ratio moller et al sugges ted two possibilities use a normalizable approximation to the ideal case replace with a point estimate such as the maximum pseudolikelihood estimate based on th e observations y this gives an auxiliary distribution px y pxy pxy f x z with a fixed normalization constant z which will cancel in equation the broken lines in figure b indicate that while x could be a child of and y in practice previous w ork has only used one of the possible parents for concreteness we assume px y f xz for some fixed y in all that follows but our results are applicable to either case reinterpreting savm seen in the light of the product model in figure molle r et als savm method appears slightly strange savm can be reproduced by augmenti ng our joint model in figure containing variables x x with an additional arbitra ry latent x with no subscript as used in savm then we can define the following p roposal draw j qj i perform the deterministic threeway swap x xi xj xj x xi befo re the swap xi was equal to the observed data y so after the swap xj will be equ al to the data now owned by j the acceptance ratio for this proposal is precisel y as in savm equation if we want to take y from i and give it to rival setting j why involve a third parameter in section we will see that the third party can m ake the transaction harder or mediate it the bridging steps in subsection were s pecifically designed to make the swap more palatable in the next section we prop ose an extension of savm which has similar bridging steps mavm a temperedtransit ions refinement as with the exchange algorithm savms acceptance ratio equation c an be seen as an approximation to the exact normalization constant evaluation in algorithm savm uses the following two unbiased onesample importancesampling est imators f x z z f x z f x z f x x f x z x f x z mavm a temperedtransitions refin ement a biased estimate of zz is obtained by dividing equation by equa tion the unknown constant z fortuitously cancels and amazingly substituting this elementa ry approximation into the mh algorithm gave a valid method as with the exchange algorithm savms importance sampling estimators are very crude a large mismatch b etween px y and qx y can cause a high mh rejection rate bridging between these t wo distributions might help which suggests replacing the two importance sampling estimators with annealed importance sampling estimates this gives algorithm our new algorithm has k auxiliary variables and collapses to savm for k we call thi s method with k the multiple auxiliary variable method mavm algorithm multiple a uxiliary variable method mavm input initial x iterations s bridging levels k for s s propose q y propose the first component of x by exact sampling x p x y f x z propose remainder of x using k transition operator steps x t x x y x t x x y x k tk xk xk y compute acceptance ratio a f y p q y f y p q y k k fk xk fk xk fk x k fk x k draw r uniform if r a then set x x end for as in the bridged exchange a lgorithm ratios involving the auxiliary variables can be computed online as they are generated there is no need to store the whole x ensemble the tk are any con venient transition operators that leave corresponding distributions pk stationar y where pk x f x k f x k fk x and k k k other sequences of stationary distributi ons are possible they must start at p f x z as we are forced to draw an exact sa mple from this as part of the proposal from there they should bridge towards the approximate or estimatorbased distribution used by savm mavm a temperedtransiti ons refinement xk xk x x y figure the joint distribution for the annealingbased multiple auxiliary variable method mavm here it is assumed that pxk y is based o nly on a datadriven parameter estimate as in equation the auxiliary variables br idge towards the distribution implied by the graylevel and thickness of the arro ws from y and indicate the strengths of influence on the auxiliary variables the se are controlled by k in equation to motivate and validate this algorithm we ex tend the auxiliary variables x to an ensemble of variables x x x x xk figure we give xk the same conditional distribution as the single auxiliary variable x in savm equation the distribution over the remaining variables is defined by a sequ ence of markov chain transition operators tk xk xk with k k pxk xk y tk xk xk y px x y t x x y px x y t x x y where as usual tk x x pk x tk x x pk x this define s a stationary distribution over the ensemble px ypy treating the procedure in a lgorithm as a proposal qx x the acceptance rule is that of standard metropolisha stings as in other bridging schemes the details of the transition operators canc el after substituting equation while we started by replacing savms importance sa mpling estimators with ais the resulting algorithm is more closely related to te mpered transitions neal a our approach has cheaper moves than standard tempered transitions which would regenerate xk x from px y before every mh proposal this is exploiting the generalization of tempered transitions introduced in subsectio n as with adding bridging to the exchange algorithm mavm makes savm a closer mat ch to ideal metropolishastings sampling there is an additional cost of k markov comparison of the exchange algorithm and mavm chain steps per iteration but no a dditional exact sampling which might need many markov chain steps we have also p rovided an answer to an open question in moller et al on how to use both and y i n the auxiliary distribution px y we use y in coming up with a point estimate of the parameters to get a distribution in roughly the right place then we bridge towards a better fit to f x z using ideas from annealing comparison of the excha nge algorithm and mavm we first consider a concrete example for which all comput ations are easy this allows comparison with exact partition function evaluation algorithm and averaging over chains starting from the true posterior we consider sampling from the posterior of a single precision parameter which has likelihoo d corresponding to n iid zeromean gaussian observations y y y yn with a conjugat e prior pyn n p gamma the corresponding posterior is tractable py gamma n n yn b ut we pretend that the normalizing constant in the likelihood is unknown we comp are the average acceptance rate of the algorithms for two choices of proposal di stribution q y all of the algorithms require n exact gaussian samples for which we used standard generators for large n one could also generate the sufficient s tatistic n xn with a chisquared routine we also draw directly from the gaussian stationary distributions pk in the bridging algorithms this simulates an ideal c ase where the energy levels are close or the transition operators mix well more levels would be required for the same performance with less efficient operators we now report results for n and y the first experiment uses proposals drawn dire ctly from the parameter posterior equation the mh acceptance probability becomes a all proposals are accepted when z is computed exactly therefore any rejection s are undesirable byproducts of the auxiliary variable scheme which can only imp licitly obtain noisy estimates of the normalizing constants figure a shows that both mavm and the exchange algorithm improve over the savm baseline of moller et al it appears that a large number k of bridging levels are required to bring th e acceptance rate close to the attainable a however significant benefit is obtai ned from a relatively small number of levels after which there are diminishing r eturns as each algorithm requires an exact comparison of the exchange algorithm and mavm sample which in applications can require many markov chain steps the im provement from a few extra steps k can be worth the cost see subsection in this artificial situation the performance of mavm was similar to the exchange algorit hm this result favors the exchange algorithm which has a slightly simpler update rule and does not need to find a maximum pseudolikelihood estimate before sampl ing begins in figure a we had set figure b shows that the performance of mavm fa lls off when this estimate is of poor quality for moderate k the exchange algori thm automatically obtains an acceptance rate similar to the best possible perfor mance of mavm only for k was performance considerably worse than savm for this s imple posterior sometimes manages to be a useful intermediary but by k the excha nge algorithm has caught up with mavm more importantly the exchange algorithm pe rforms significantly better than savm and mavm in a more realistic situation whe re the parameter proposal q y is not ideal figure c shows results using a gaussi an proposal centered on the current parameter value the exchange algorithm explo its the local nature of the proposal rapidly obtaining the same acceptance rate as exactly evaluating z mavm performs much worse although adding bridging levels does rapidly improve performance over the original savm algorithm savm is now h indered by which is more rarely between and the posterior distribution over equa tion becomes sharper for n this makes the performance of savm and mavm fall off more rapidly as is moved away from its optimum value these methods require bette r estimates of with larger datasets ising model comparison we have also consider ed the ising model distribution with yi on a graph with nodes i and edges e py e xp z j yi yj ije i h yi we used the summary states algorithm subsection for exac t sampling and a single sweep of gibbs sampling for the transition operators t r esults are reported for y drawn from a model with h and j moller et al used unif orm priors over h and j this only works or seems to for a q with small step size s the algorithms hang if j is proposed because cftp based on gibbs sampling take s a very long time to return a sample in this regime we used a uniform prior ove r j and larger closertooptimal step sizes comparison of the exchange algorithm a nd mavm ave acceptance probability k exchange mavm savm a ave acceptance probabi lity mavm k mavm k savm b ave acceptance probability k exchange mavm savm c figu re comparison of mavm and the exchange algorithm learning a gaussian precision a average acceptance rate as a function of k mavm with k corresponds to savm the method of moller et al exact normalizing constant evaluation in would give an ac ceptance rate of one b average acceptance rate as a function of the initial para meter estimate required by savm k and the extended version mavm horizontal bars show the results for the exchange algorithm which has no for k c as in a but wit h a gaussian proposal distribution of width centered on the current parameter se tting the horizontal line shows the maximum average acceptance rate for a revers ible transition operator this is obtained by exact normalizing constant evaluati on comparison of the exchange algorithm and mavm efficiency k efficiency exchang e mavm savm exchange mavm savm k a b figure performance on a toroidal square isi ng lattice the data were generated from an exact sample with j and h proposals w ere gaussian perturbations of width the plot shows efficiency effective number o f samples estimated by rcoda cowles et al divided by the total number of gibbssa mpling sweeps computer time all the algorithms are expensive savm requires gibbs sweeps of the ising model to obtain one effective sample of the parameters this could be improved by using more advanced exact sampling methods figure gives so me example results for two different step sizes for both methods the number of e ffective samples obtained per iteration will tend to the maximum possible as k i n practice the performance of the two methods does converge however these bridgi ng steps come at a cost the best k is a compromise between the computer time per iteration and the mixing time of the chain in the examples we tried the exchang e algorithm provided better mixing times than savmmavm for all finite k bridging provided good improvements over savm but only gave appreciable benefits to the exchange algorithm for larger step sizes discussion mcmc methods typically navig ate complicated probability distributions by local diffusion longer range propos als will be rejected unless carefully constructed it is usually the case that as the stepsize of proposals are made sufficiently small the acceptance rate of a mh method tends to one however savm does not have this property it introduces re jections even when while the exchange algorithm has a for all k as the stepsize tends to zero mavm will only recover a as k this is because the third party in t he proposed swap see subsection is not necessarily close to even in a simple uni modal dimensional posterior distribution figure c this is a significant disadvan tage in comparison with the exchange algorithm we found the exchange algorithm p erforms better than the only other existing mcmc method for this problem and is simpler to implement latent history methods latent history methods both mavm and the exchange algorithm require exact samples and use them to compute similar me tropolishastings acceptance ratios in this section we propose new algorithms whi ch are also valid mcmc methods for doublyintractable distributions but which do not draw exact samples they do follow a very similar coupling procedure but the properties of the resulting algorithms are quite different we start with an insp iring but impractical rejection sampling algorithm for the parameters given for example by marjoram et al algorithm b algorithm simple rejection sampling algori thm for py generate p simulate y py accept if y y on real problems where the dat a can take on many values this algorithm will rarely accept marjoram et al sugge st relaxing the equality in step by accepting data sets that are close in some s ense they also suggest mcmc algorithms which diffuse the parameters in step rath er than drawing from the prior these algorithms are valid samplers for the wrong distribution py is near observed value this section will introduce an alternati ve approach which tries to make it more probable that the fantasy data in step i s equal to the observed data while keeping the validity of the original algorith m a generative model specifies a likelihood py but not necessarily the details o f an algorithm to sample from this distribution the latent history representatio n brings a particular sampling procedure into the description of the model we de clare that a stationary ergodic markov chain t was simulated for an infinite num ber of time steps generating an arbitrarilyinitialized sequence x x x x x and th at we observed the data y x the marginal probability of any xt including x is th e stationary distribution of the markov chain which we set to the target models py unraveling a markov chain and writing it down as a generative model was a tri ck employed by hinton et al our latent history representation is illustrated in figure as in coupling from the past section we also consider the random variable s u u u u u used by the markov chain transition operator at each time step given and the ut variables which provide t with its source of randomness each state i s a deterministic function of the previous xt xt ut latent history methods x x x x y u u u u u figure the latent history representation augments the observed da ta y and parameters with the history of a markov chain x x x with stationary dis tribution py if we can effectively sample over all unknowns in this graphical mo del we have by discarding xt ut an algorithm to sample metropolishastings algori thm the latent history representation is a distribution with an infinite number of latent variables although we can not explicitly update all of these variables mcmc sampling is notionally straightforward algorithm would be a valid mcmc pro cedure if we had an oracle to perform the infinite number of computations in ste ps and on our behalf as an aside we note that this algorithm is similar to algor ithm f in marjoram et al for approximate bayesian computation abc algorithm temp late for a latent history sampler input initial setting number of iterations s f or s s block gibbs sample from px u x y propose q y p q y with probability min p q y follow deterministic map xt xt ut to find y x if y y then accept move end f or the algorithm can feasibly be implemented if t provably coalesces to a single final state from any start state given the sequence of random numbers ut this a llows step of algorithm to be replaced with the procedure used by coupling from the past cftp see section to identify y while only examining some of the ut s fu rther there are two differences firstly the equivalent of the generation in step is always performed before checking the equivalent to step our version would al so be advisable in the original abc setting assuming data simulation is more exp ensive than evaluation of prior probabilities secondly the major difference we s ample the fantasy y in a way that has a bias towards reproducing y the algorithm s are not directly comparable however the abc community assumes that they cannot evaluate a function proportional to py which would make it difficult to constru ct the transition operators needed by our algorithm latent history methods by co mbining steps and the coupling from the past can be terminated early if at any p oint it becomes clear that y y all we require now is an oracle that can return a ny ut on request creating this service can replace the block gibbs sampling in s tep as long as we are consistent with its infinite computation markov properties in the model imply px u x t pxt xt t put xt xt this structure allows resampling of x x in sequence from pxt xt pxt xt pxt pxt t xt xt pxt t xt xt pxt the rever se transition operator corresponding to t as long as put xt xt is known for oper ator t any ut value can be sampled after the corresponding xt and xt are availab le this means that step in algorithm can be implemented lazily construct an obje ct that will provide ut values on request consistent with block gibbs sampling t he entire model this object will respond to a request for ut by returning the va lue from memory if it has previously been sampled or if necessary by sampling ba ck to xt from the furthest back x value that is known and returning a sample fro m put xt xt within each iteration all samples are stored for future use to ensur e consistency the practical implementation of a latent history metropolishasting s sampler is summarized in algorithm this includes an optional refinement versio n b which analytically integrates out u this requires computing the transition p robability for one step of t which is usually possible and should increase the a cceptance rate the refinement also makes the algorithm feasible for markov chain s that do not completely coalesce for example on continuous state spaces a disad vantage of version b is that step may be difficult to implement without always i dentifying x whereas step in version a will be able to prematurely halt cftp whe never y x which is often easy to prove performance a simple special case highlig hts large differences between latent history samplers and the previous auxiliary variable algorithms we first consider version a of the sampler applied to a lar ge collection of d independent bernoulli variables sharing a single unknown para meter this is an ising model in the limit of zero connection strengths assume th at the transition operator in use t latent history methods algorithm metropolish astings latent history sampler input initial setting number of iterations s and an algorithm cftp that can use a finite number of random numbers to prove where an infinitely long markov chain ends see section for s s create lazy ut generato r consistent with px u x y propose q y either version a p q y with probability m in p q y end for is gibbs sampling which for these independent variables draws f rom the stationary distribution in one sweep the sampler is implemented as follo ws for each of the observed spins yd the corresponding ud random variate is unif ormly distributed in if the spin is up and if the spin is down now updates to ar e constrained to lie within the two closest surrounding ud values this range has a typical scale of d the posterior over parameters has a width that scales as d this suggests that it will take at least od steps to equilibrate the single sca lar parameter by random walk sampling disappointing given the ideal conditions n otice that the performance of the algorithm depends on the details of the transi tion operators an unreasonably clever transition operator could generate y if u py and some other data set otherwise then the probability of finding y y is min pypy giving an overall algorithm similar to the ideal metropolis hastings algori thm actually the two stage rule described in algorithm neither mavm nor the exch ange algorithm have such dependence on the transition operators how the exact sa mple was obtained is irrelevant we now turn to version b of the sampler with the same independent bernoulli model and gibbs sampling operators here the transiti on probabilities in step are equal to the parameters likelihoods under the model the algorithm reduces to standard metropolishastings for which has the best pos sible acceptance rate for a given proposal distribution q rapidly mixing operato rs are rewarded by this version of latent histories operators that move very slo wly will perform as in version a we have implemented both versions of the sample r and applied them to the same ising identify whether y x is y using cftp and ut generator if y y then accept move or version b draw r uniform identify enough a bout x for acceptance rule using cftp and ut s if r t x y x p q y then accept mo ve t x y x p q y slice sampling doublyintractable distributions model problem as in subsection we used gibbs sampling for the y variables combined with the summ ary states algorithm subsection to perform inferences required by the algorithm the summary states code required some modification so that its transitions were driven by the latent history samplers ugenerator rather than a generic random nu mber generator then the state at time t or t was identified as in subsection the parameter proposals were gaussian with width which preliminary runs indicated w as roughly optimal in version a we terminated each iteration as soon as the summ ary state algorithm identified the state at time zero or a non state at time zer o was incompatible with the data in version b we always waited until x had been identified more elaborate code could terminate earlier based on bounds of the ac ceptance ratio but this is also true for the exchange algorithm mavm and most st andard metropolishastings algorithms as a result of early terminations version a was roughly five times cheaper per iteration than our implementation of version b even with this advantage the number of effective samples per gibbssamplinglik e sweep through the system was around ten times worse for version a the efficien cy of version b was effective samples per gibbs sweep about half that obtained b y savm the latent history efficiency computations included the gibbslike sweeps needed to compute the us while savm was given its random numbers for free this m akes the real computation time of the methods comparable which method is actuall y faster will come down to implementation details however both the bridged excha nge and mavm algorithms are clearly faster than either variant of latent histori es on this ising problem we might hope that latent histories will perform better on other problems a more realistic reason to be excited by latent histories is that they might be a much better route to approximate methods than any of the pr evious algorithms in this chapter we return to this issue in section first we tu rn to the original motivation for developing latent histories constructing slice samplers slice sampling doublyintractable distributions all of the algorithms f or doublyintractable distributions described so far are based on metropolishasti ngs such algorithms always require stepsize parameters that need to be set well overly large stepsizes lead to almost all proposals being rejected overly small stepsizes give lengthy randomwalklike exploration of the posterior over it would be nice to be able to use a selftuning method like slice sampling subsection wh ich has lessimportant or even no stepsize parameters slice sampling doublyintrac table distributions latent histories the latent history representation naturally allows slice sampling it also has the property like the exchange algorithm that small steps in are always accepted so moves will always be possible the trick i s to identify the variables in the joint distribution figure as and u while thin king of the x quantities as just a deterministic function of these variables con ditioned on the u variables any standard slice sampling algorithm subsection can be applied to the parameter the stationary distribution pu is simply the prior over the parameters conditioned on x u y any point not satisfying the constraint is off the slice and discarded by the slice sampling procedure this description gives a slice sampler corresponding to version a of latent history algorithm a slice sampling implementation of version b sets the stationary distribution to t y x p after each iteration u is effectively updated in a blockgibbs update this involves resetting a lazy ut generator as in the metropolishastings algorithm m avm standard slice sampling cannot be applied to the parameter in the mavm joint distribution because this would involve computing z as with metropolishastings the auxiliary variables must be changed simultaneously it turns out that with an appropriate definition of a slice algorithm can be converted into a slice sampl er we first describe the new algorithm ostensibly we do just use a standard slic e sampling procedure subsection to update the parameters however the range of au xiliary heights h h is defined in terms of the following quantity hx p y px y z f y p qx y k k fk xk fk xk after drawing the slicesampling auxiliary quantity h uniform hx we consider new parameter settings using any of the normal slice samp ling procedures however whenever a new setting of the parameters is considered a new setting of auxiliary x variables is generated in the same way as algorithm then as usual slice membership is assertained by checking hx h we now show that this procedure satisfies detailed balance with respect to the desired stationary distribution as in subsection we use the label s for all rejected points and ot her ancillary points generated while exploring the slice in a steppingout proced ure the probability of starting at a setting x generating a particular setting o f discussion h generating intermediate quantities s and finally an acceptable ne w pair x is py px y phx y qs h q s h qx y py px y py z qx y px y p y z qs h q s h qx y q s h qs h qx y qx y all of the slicesampling bracketing procedures menti oned in subsection ensure that q s h qs h q s h qs h therefore equation is invar iant to exchanging x x and the overall procedure satisfies detailed balance this derivation closely follows that of standard slice sampling this is possible bec ause conditioned on h the decision to reject intermediate parameters in s is the same whether generating s forwards from or backwards from in contrast the accep tability of a proposal in the exchange algorithm cannot be summarized by a scala r slice height h a move depends on giving a new data set to the old parameter wh ich will have different probabilities depending on whether the move started at o r we dont currently see any way to define a slice sampler for the exchange repre sentation as noted in subsection even very small moves in can be rejected when u sing the mavm auxiliary system the usual slice bracketing procedures work on an assumption that nearby regions will always be accepted and exponentially shrink towards the current point unless a large number of bridging levels are used this behavior could lead to overly small step sizes care would be needed in applying stock slicesampling code discussion we established in subsection that global in formation like that provided by exact sampling must be found as part of a strict ly valid mcmc algorithm for doublyintractable distributions given this it seems likely that the exchange algorithm performs about as well as is possible on the problems we have tried the acceptance rate approaches that of metropolishastings with moderate k leaving exact sampling as the dominant cost we would generally recommend this algorithm if a valid mcmc procedure is required moller et al poin ted out that good deterministic approximations to py could make savm perform ver y well our results firmly support the use of bridging and we would recommend alw ays considering mavm instead there is an unresolved issue concerning doublyintra ctable distributions when learning discussion from n iid data points n py n py n zn n f y n n all of the known valid mcmc algorithms for py require drawing an e xact sample from py this means drawing a fantasy containing n data sets x xn n b ut n perhaps there is enough information about z in a single exact sample from t he data distribution drawing n exact samples may be prohibitively expensive with large data sets whether this problem can be avoided is an open question sadly i t is not possible to draw exact samples from all distributions approximate relax ations must be used for some situations section observed that a bad choice of ap proximations applied to the inner loop of standard metropolishastings can have s urprising and disastrous consequences on the quality of the approximate samples the valid algorithms discussed in this chapter are an alternative target for app roximation the obvious relaxation of methods requiring exact sampling from the d ata distribution is to use a brief run of mcmc possibly starting from the observ ed data moller et al report that a fixed length of sampling can be more ineffici ent than exact sampling for similar levels of performance however this idea woul d be worth exploring further in cases where exact sampling is not possible a nat ural relaxation of the latent history representation is truncating the history t o some finite length of time k this would require specifying an explicit prior d istribution for xk perhaps an approximation to the original model one could opti onally replace the stationary markov chain operator t with a sequence of operato rs tk bridging between the approximate distribution used for xk and the target m odel distribution this finite model could be simulated using the latent history algorithms with explicit computations or any standard mcmc technique we believe that truncated latent history representations are a promising direction for futu re research in approximate inference for doublyintractable distributions running finite markov chains within the exchange algorithm or mavm is an innerloop appr oximation somewhat like those explored in section the consequences of approximat ing mcmc procedures is hard to predict it is possible that very unreasonable inf erences will result in contrast a truncated latent history representation is a m odel on which we can apply standard mcmc algorithms the model will not describe exactly the same beliefs as the original distribution but all of the inferences performed on it will be selfconsistent thus if prior draws from the truncated mo del look reasonable it is likely that the inferences will also be sensible this line of reasoning is really suggesting that we throw out the original doublyintr actable model and replace it with a more tractable latent variable model if the model was originally introduced as an ad hoc way to introduce dependencies then us discussion ing an alternative that can be treated consistently may be prefera ble in particular we can look at draws from the prior and see if they actually r eflect our beliefs heckerman et al follow a related thought process they discuss circumstances in which undirected graphical models may not be a natural represe ntation and provide an alternative replacement changing the model doesnt seem at tractive when an undirected model was derived directly from physical considerati ons an example of such a model is recent work on learning protein structure podt elezhnikov et al which learns the parameters of an energy function using contras tive divergence hinton a full bayesian treatment of these parameters with mcmc s eems daunting exact sampling from the configurationaldistribution of a protein c hain seems infeasibly difficult however using the wrong model could still be use d as an approximation technique and doing so may be more stable than innerloop a pproximations this chapter has not addressed computing marginal likelihoods pym for bayesian model comparison of different undirected graphical models evidently the standard methods discussed in chapter will apply in theory although such co mputations are likely to be demanding a triplyintractable problem chapter summar y and future work the markov chain monte carlo method introduced by metropolis e t al was one of the earliest applications of electronic computers today it conti nues to be an important technique for approximate numerical integration in a gro wing number of applications chapter provided an introduction to the challenges i nvolved while chapter described some of the current literature along with some m inor extensions in chapter we explored the promising ideas of using multiple par ticles and multiple proposals in markov chain simulations as with other authors we found that modest improvements are possible largely derived from the basic le ngthscale information given by a few samples from a distribution however drawing multiple proposals can easily cost more to compute than the statistical benefit it remains to be seen if implementations leveraging caching of intermediate res ults or parallel computations can gain advantage in real applications other auth ors in the literature remain hopeful that they can and our pivotbased approaches seem to bring unique benefits to this area of the field chapter investigated me thods for computing normalizing constants one of these ais is actually an import ance sampling method rather than mcmc and nested sampling offers a new sampling paradigm in its own right however both of these algorithms use markov chains in their practical implementation we extended existing techniques for nested sampli ng on discrete distributions we then compared three fundamentally different algo rithms based on the same markov chain our theoretical and empirical results show that the algorithms perform very differently and that no method is appropriate for all distributions it would seem wise to use more than one algorithm as stand ard practice we proposed one practical approach to realize this a method for tun ing ais based on preliminary runs of nested sampling chapter serves to highlight the limitations of mcmc sampling from doubly intractable distributions is excee dingly difficult yet these distributions are found in a large class of undirecte d statistical models and have received a great deal of at tention in recent year s we have introduced three new algorithms for this problem mavm the exchange alg orithm and latent history samplers these improve the existing state of the art r emove the need for separate deterministic approximations and offer new direction s for approximating this difficult problem exploring truncations of latent histo ries is a possible area for further work this thesis has concentrated on investi gating new markov chains and new algorithms for the better use of existing mcmc operators these have been investigated empirically and sometimes theoretically o n simple problems designed to demonstrate the algorithms key properties one limi tation of the presentation here has been the focus on stationary distributions l ittle has been said about the chains formal convergence rates this is partly bec ause making meaningful statements about convergence for general statistical prob lems seems difficult section also wherever possible we have reduced new algorith ms to the metropolishastings algorithm applied to an auxiliary system this passe s some of the burden of deriving rates of convergence onto an established litera ture while we have focused on mcmc methodology and proposed a number of novel al gorithms ultimately the computational advantages of these methods has to be prov en on challenging realworld applications this is an area i hope to spend more ti me on in my future research bibliography d h ackley g e hinton and t j sejnowski a learning algorithm for boltzmann machines cognitive science page j albert bay esian selection of loglinear models technical report working paper duke universi ty institute of statistics and decision sciences page n balakrishnan and a cliff ord cohen order statistics inference estimation methods academic press san diego isbn page t bayes an essay towards solving a problem in the doctrine of chances philosophical transations communicated by richard price in a letter to john can ton also available edited by g a barnard in biometrika december page m j beal va riational algorithms for approximate bayesian inference phd thesis gatsby comput ational neuroscience unit university college london page m j beal and z ghahrama ni the variational bayesian em algorithm for incomplete data with application to scoring graphical model structures in j m bernardo m j bayarri j o berger a p d awid d heckerman a f m smith and m west editors bayesian statistics pages oxford university press page m beltrn j garc a abellido j lesgourgues a r liddle and a slosar bayesian model selection and isocurvature perturbations physical review d pages and c h bennett efficient estimation of free energy differences from mon te carlo data journal of computational physics october page b a berg and t neuha us multicanonical ensemble a new approach to simulate firstorder phase transitio ns phys rev lett january page j besag spatial interaction and the statistical an alysis of lattice systems journal of the royal statistical society page j besag and p j green spatial statistics and bayesian computation journal of the royal s tatistical society b page bibliography j besag p green d higdon and k mengersen bayesian computation and stochastic systems statistical science page a beskos o papaspiliopoulos g o roberts and p fearnhead exact and computationally efficient likelihoodbased estimation for discretely observed diffusion processes with dis cussion journal of the royal statistical society series b statistical methodolog y page c bischof and m bcker computing derivatives of computer programs in moder n u methods and algorithms of quantum chemistry volume pages page c m bishop pat tern recognition and machine learning springerverlag new york isbn page o capp a guillin jm marin and c p robert population monte carlo journal e of computation al graphical statistics december pages and b p carlin and s chib bayesian model choice via markov chain monte carlo methods journal of the royal statistical soc iety b methodological page g casella and c p robert raoblackwellisation of sampl ing schemes biometrika page s chib marginal likelihood from the gibbs output jou rnal of the american statistical association december page a m childs r b patter son and d j c mackay exact sampling from nonattractive distributions using summa ry states physical review e pages and a christen and c fox mcmc using an approxi mation journal of computational and graphical statistics page j a christen and c fox a selfadjusting multiscale mcmc algorithm poster presented at valencia baye sian meeting benidorm preprint available from jaccimatmx code available from htt pwwwcimatmxjactwalk page p clifford et al discussion on the meeting on the gibbs sampler and other markov chain monte carlo methods journal of the royal statist ical society series b page methodological m k cowles n best k vines and m plumme r rcoda available from httpwwwfisiarcfrcoda pages and bibliography r t cox proba bility frequency and reasonable expectation american journal of physics january pages and r v craiu and c lemieux acceleration of the multipletry metropolis alg orithm using antithetic and stratified sampling statistics and computing page p damien j wakefield and s walker gibbs sampling for bayesian nonconjugate and hie rarchical models by using auxiliary variables journal of the royal statistical s ociety series b statistical methodology page s della pietra v j della pietra and j d lafferty inducing features of random fields ieee transactions on pattern an alysis and machine intelligence page p dellaportas and j j forster markov chain monte carlo model determination for hierarchical and graphical loglinear models biometrika pages and l devroye nonuniform random variate generation springerverl ag new york isbn out of print available from httpcgscscarletoncalucrnbookindexht ml page a dobra c tebaldi and m west data augmentation in multiway contingency t ables with fixed marginal totals journal of statistical planning and inference p age p dostert y efendiev t y hou and w luo coarsegradient langevin algorithms fo r dynamic data integration and uncertainty quantification journal of computation al physics page s duane a d kennedy b j pendleton and d roweth hybrid monte carl o physics letters b september page d edwards and t havrnek a fast procedure for model search in multidimensional a contingency tables biometrika august page r g edwards and a d sokal generalizations of the fortuinkasteleynswendsenwang repre sentation and monte carlo algorithm physical review pages and j ferkinghoffborg monte carlo methods in complex systems phd thesis graduate school of biophysics niels bohr institute and rise national laboratory faculty of science university of copenhagen may page c m fortuin and p w kasteleyn on the randomcluster model i introduction and relation to other models physica page bibliography d frenkel speedup of monte carlo simulations by sampling of rejected states proceedings of the national academy of sciences pages and d frenkel and b smit understanding m olecular simulation academic press inc nd edition pages and a e gelfand and a f m smith samplingbased approaches to calculating marginal densities journal of th e american statistical association page a gelman and xl meng simulating normaliz ing constants from importance sampling to bridge sampling to path sampling stati stical science pages and a gelman j b carlin h s stern and d b rubin bayesian da ta analysis second edition chapman hallcrc isbn x page d geman and s geman stoch astic relaxation gibbs distribution and bayesian restoration of images ieee tran sactions on pattern analysis and machine intelligence pages and j geweke getting it right joint distribution tests of posterior simulators journal of the americ an statistical association pages and c j geyer markov chain monte carlo maximum likelihood in computing science and statistics proceedings of the rd symposium o n the interface pages page c j geyer practical markov chain monte carlo statisti cal science november w r gilks page derivativefree adaptive rejection sampling f or gibbs sampling in j bernardo j berger a p dawid and a f m smith editors bayes ian statistics oxford university press page w r gilks and p wild adaptive reject ion sampling for gibbs sampling applied statistics page w r gilks g o roberts an d e i george adaptive direction sampling the statistician special issue conferen ce on practical bayesian statistics page v k gore and m r jerrum the swendsenwan g process does not always mix rapidly in proceedings of the twentyninth annual a cm symposium on theory of computing pages acm press new york ny usa page bibliog raphy v k gore and m r jerrum the swendsenwang process does not always mix rapid ly journal of statistical physics page p j green reversible jump markov chain mo nte carlo computation and bayesian model determination biometrika december pages and f hamze and n de freitas hot coupling a particle approach to inference and normalization on pairwise undirected graphs in y weiss b schlkopf and j platt o editors advances in neural information processing systems nips proceedings of th e conference mit press page w k hastings monte carlo sampling methods using mark ov chains and their applications biometrika april pages and d heckerman d m chec kering c meek r rounthwaite and c kadie dependency networks for inference collab orative filtering and data visualization journal of machine learning research jm lr page g e hinton training products of experts by minimizing contrastive diverg ence neural computation august pages and g e hinton s osindero and yw teh a fast learning algorithm for deep belief nets neural computation page j d hobby a use rs manual for metapost technical report att bell laboratories murray hill new je rsey page k s v horn constructing a logic of plausible inference a guide to coxs theorem international journal of approximate reasoning page m huber exact sampl ing and approximate counting techniques in th acm symposium on the theory of com puting pages page m huber a bounding chain for swendsenwang random structures al gorithms page y iba population monte carlo algorithms transactions of the japane se society for artificial intelligence a page y iba extended ensemble monte carl o international journal of modern physics c b page w janke and s kappler multibo ndic cluster algorithm for monte carlo simulations of firstorder phase transitio ns physical review letters page c jarzynski equilibrium freeenergy differences f rom nonequilibrium measurements a masterequation approach physical review e nove mber page bibliography e t jaynes probability theory the logic of science princi ples and elementary applications vol cambridge university press april page h jef freys theory of probability oxford university press rd edition edition republish ed in oxford classic texts in the physical sciences page m h kalos and p a whitl ock monte carlo methods volume i basics john wiley isbn page r e kass and a e ra ftery bayes factors journal of the american statistical association page r e kas s b p carlin a gelman and r m neal markov chain monte carlo in practice a roundt able discussion american statistician page s kirkpatrick c d gelatt jr and m p v ecchi optimization by simulated annealing science page m kuss and c e rasmussen assessing approximate inference for binary gaussian process classification journ al of machine learning research pages and s l lauritzen and d j spiegelhalter lo cal computations with probabilities on graphical structures and their applicatio n to expert systems with discussion journal of the royal statistical society ser ies b page j s liu metropolized independent sampling with comparisons to rejecti on sampling and importance sampling statistics and computing page j s liu monte carlo strategies in scientific computing springer isbn page j s liu w h wong and a kong covariance structure of the gibbs sampler with applications to the compa risons of estimators and augmentation schemes biometrika page j s liu f liang an d w h wong the use of multipletry method and local optimization in metropolis sa mpling journal of american statistical association pages and a p lyubartsev a a martsinovski s v shevkunov and p n vorontsovvelyaminov new approach to monte car lo calculation of the free energy method of expanded ensembles j chem phys d mac kay nested sampling explanatory illustrations available from httpwwwinferencephy camacukbayesysboxnestedpdf page page bibliography d j c mackay information theor y inference and learning algorithms cambridge university press available from ht tpwwwinferencephycamacukmackayitila pages and e marinari and g parisi simulated tempering a new monte carlo scheme europhysics letters page p marjoram j molitor v plagnol and s tavar markov chain monte carlo without e likelihoods proceeding s of the national academy of sciences pages and a mccallum and b wellner toward conditional models of identity uncertainty with application to proper noun coref erence in ijcai workshop on information integration on the web page i r mcdonald and k singer machine calculation of thermodynamic properties of a simple fluid at supercritical temperatures j chem phys page n metropolis the beginning of the monte carlo method los alamos science page n metropolis a w rosenbluth m n rose nbluth a h teller and e teller equation of state calculation by fast computing m achines j chem phys pages and j moller a n pettitt k k berthelsen and r w reeves an efficient markov chain monte carlo method for distributions with intractable normalising constants technical report r department of mathematical sciences aa lborg university pages and j moller a n pettitt r reeves and k k berthelsen biom etrika an efficient markov chain monte carlo method for distributions with intra ctable normalising constants pages and p mukherjee d parkinson and a r liddle a nested sampling algorithm for cosmological model selection the astrophysical jou rnal l page i murray and z ghahramani bayesian learning in undirected graphical models approximate mcmc algorithms in m chickering and j halpern editors proceed ings of the th annual conference on
(Lab Report Operation Unit) Experiment 3: Separation of An Ordinary Binary Mixture Consisting of Acetic Acid and Water by Using Simple Batch Distillation Technique.
Mental Math: How to Develop a Mind for Numbers, Rapid Calculations and Creative Math Tricks (Including Special Speed Math for SAT, GMAT and GRE Students)