Beruflich Dokumente
Kultur Dokumente
Markov Chains
A Markov chain is a special type of stochastic process. The standard definition of a stochastic process is an ordered collection of random variables:
{X t : t T}
where
X
X t
as a state
P r (X t +1 = x t +1 |X t = x t, X t 1 = x t 1 , , X 0 = x 0 ) = P r (X t +1 = x t +1 |X t = x t)
then the stochastic process is known as a Markov chain. This conditioning specifies that the future depends on the current state, but not past states. Thus, the Markov chain wanders about the state space, remembering only where it has just been in the last time step. The collection of transition probabilities is sometimes called a
transition matrix when dealing with discrete states, or more generally, a transition kernel. In the context of Markov chain Monte Carlo, it is useful to think of the Markovian property as mild non-independence . MCMC allows us to indirectly generate
Markovian Jargon
Before we move on, it is important to define some general properties of Markov chains. They are frequently encountered in the MCMC literature, and some will help us decide whether MCMC is producing a useful sample from the posterior.
from any
Recurrence States which are visited repeatedly are recurrent. If the expected time to
return to a particular state is bounded, this is known as positive recurrence, otherwise the recurrent state is null recurrent. Further, a chain is Harris recurrent when it visits all states
X S
infinitely often in
the
limit
as
t ;
this
is
an important
characteristic when dealing with unbounded, continuous state spaces. Whenever a chain ends up in a closed, irreducible set of Harris recurrent states, it stays there forever and visits every state with probability one.
is some
n n
transition
P =
Thus,
limiting distribution of the chain. In MCMC, the chain explores the state space according to its limiting marginal distribution.
Ergodicity:
Ergodicity
is
an
emergent
property
of
Markov
chains
which
are
lim P r
n
(i j) = ()
i , j
or in words, after many steps the marginal distribution of the chain is the same at one step as at all other steps. This implies that our Markov chain, which we recall is dependent, can generate samples that are independent if we wait long enough between samples. If it means anything to you, ergodicity is the analogue of the strong law of large numbers for Markov chains. For example, take values
i+1 , , i+n
from a chain that has reached an ergodic state. A statistic of interest can then be estimated by:
i+n
1 n
h(j)
j=i+1
f ()h()d
that:
P r (y x)(x)dx = (y).
Invariance is guaranteed for any reversible Markov chain. Consider a Markov chain in reverse sequence:
(k )
(n )
(n 1)
,...,
(0)
}.
P r (
= y
(k +1)
= x,
(k +2)
= x 1 , ) = P r (
= y
= x)
Forward and reverse transition probabilities may be related through Bayes theorem:
P r (
(k +1)
= x
(k +1)
(k )
= y)
(k )
(y)
(x)
(i )
for some
i < k
balance equation:
(x)P r (y x) = (y)P r (x y)
Reversibility is important because it has the effect of balancing movement through the entire state space. When a Markov chain is reversible, stationary distribution of that chain. Hence, if reversible Markov chain for which does!
Gibbs Sampling
The Gibbs
sampler is
k
If a
posterior has
current values
parameters,
and
sample
from
the
resultant
distributional form (usually easier), and repeat this operation on the other parameters in turn. This procedure generates samples from the posterior distribution. Note that we have now combined Markov chains (conditional independence) and Monte Carlo techniques (estimation by simulation) to yield Markov chain Monte Carlo. Here is a stereotypical Gibbs sampling algorithm: As we can see from the algorithm, each distribution is conditioned on the last iteration of its chain values, constituting a Markov chain as advertised. The Gibbs sampler has all of the important properties outlined in the previous section: it is aperiodic, homogeneous and ergodic. Once the sampler converges, all subsequent samples are from the target distribution. This convergence occurs at a geometric rate.
(0) 1 (0) 2 (0) k
= [
,,
conditional distributions:
(j1) k 1
(1 |
(j1) 2
, ,
(j1) 3
,, ,, ,,
, , ,
(j1) k
) ) )
(2 |
(j) 1
(j1) 3
(j1) k 1
(j1) k
(3 |
(j) 1
(j) 2
(j1) k 1
(j1) k
(j) k 1 (j) k
(k1 | (k |
(j) 1
(j) 1
(j) 2
,,
(j) k 2 (j)
(j1) k (j) k 1
) )
(j) 2
(j) 4
,,
k 2
4. Increment
Occurrences of disasters in the time series is thought to be derived from a Poisson process with a large rate parameter in the early part of the time series, and from one with a smaller rate in the later part. We are interested in locating the change point in the series, which perhaps is related to changes in mining safety regulations.
I n[ 1 ] :
i m p o r tn u m p ya sn p d i s a s t e r s _ a r r a y=n p . a r r a y ( [ 4 ,5 ,4 ,0 ,1 ,4 ,3 ,4 ,0 ,6 ,3 ,3 ,4 ,0 ,2 ,6 , 3 ,3 ,5 ,4 ,5 ,3 ,1 ,4 ,4 ,1 ,5 ,5 ,3 ,4 ,2 ,5 , 2 ,2 ,3 ,4 ,2 ,1 ,3 ,2 ,2 ,1 ,1 ,1 ,1 ,3 ,0 ,0 , 1 ,0 ,1 ,1 ,0 ,0 ,3 ,1 ,0 ,3 ,2 ,2 ,0 ,1 ,1 ,1 , 0 ,1 ,0 ,1 ,0 ,0 ,0 ,2 ,1 ,0 ,0 ,0 ,1 ,1 ,0 ,2 , 3 ,3 ,1 ,1 ,2 ,1 ,1 ,1 ,1 ,2 ,4 ,2 ,0 ,0 ,1 ,4 , 0 ,0 ,0 ,1 ,0 ,0 ,0 ,0 ,0 ,1 ,0 ,0 ,1 ,0 ,1 ] )
I n[ 2 ] :
f i g s i z e ( 1 2 . 5 ,3 . 5 ) n _ c o u n t _ d a t a=l e n ( d i s a s t e r s _ a r r a y ) p l t . b a r ( n p . a r a n g e ( 1 8 5 1 ,1 9 6 2 ) ,d i s a s t e r s _ a r r a y ,c o l o r = " # 3 4 8 A B D " ) p l t . x l a b e l ( " Y e a r " ) p l t . y l a b e l ( " D i s a s t e r s " ) p l t . t i t l e ( " U Kc o a lm i n i n gd i s a s t e r s ,1 8 5 1 1 9 6 2 " ) p l t . x l i m ( 1 8 5 1 ,1 9 6 2 ) ;
We are going to use Poisson random variables for this type of count data. Denoting year i's accident count by
yi ,
yi Poisson()
parameters.
Looking at the time series above, it appears that the rate declines later in the time series.
A changepoint model identifies a point (year) during the observation period (call it
)
parameters: one for the early period and another for the late period.
1 2 if t < if t
= {
not only provides a continuous density function for positive numbers, but it is also
conjugate with the Poisson sampling distribution. We will specify suitably vague
hyperparameters
and
Since we do not have any intuition about the location of the changepoint (prior to viewing the data), we will assign a discrete uniform prior over all years 1851-1962.
This gives:
P ( 1, 2 , |y ) P (y | 1, 2 , )P ( 1, 2 , )
To employ Gibbs sampling, we need to factor the joint posterior into the product of conditional expressions:
1 111
P ( 1, 2 , |y ) [
t =1851
[
t =1851
t=1 8 5 1
yt 1
e
t = +1
]
2
yt
1 1
1 2
y t +1
(+ ) 1
1962 t= +1
y i +1
So, the full conditionals are known, and critically for Gibbs, can easily be sampled from.
1 Gamma(
t =1851 1962
yt + , + )
2 Gamma( yi + , 1962 + )
t = +1 1 1
t=1 8 5 1 t=1 8 5 1
Categorical
y t +1
(+ ) 1
1962 t= +1
y i +1
1962 k =1851
y t +1
(+ ) 1
1962 t= +1
y i +1
Implementing this in Python requires random number generators for both the gamma and discrete uniform distributions. We can leverage NumPy for this:
I n[ 3 ] :
#F u n c t i o nt od r a wr a n d o mg a m m av a r i a t e r g a m m a=n p . r a n d o m . g a m m a #F u n c t i o nt od r a wr a n d o mc a t e g o r i c a lv a r i a t e r c a t e g o r i c a l=l a m b d ap r o b s ,n = N o n e :n p . a r r a y ( p r o b s ) . c u m s u m ( ) . s e a r c h s o r t e d ( n p . r a n d o m . s a m p l e ( n ) )
Next, in order to generate probabilities for the conditional posterior of kernel of the gamma density:
,
we need the
I n[ 4 ] :
d g a m m a=l a m b d al a m ,a ,b :l a m * * ( a 1 )*n p . e x p ( b * l a m )
Diffuse hyperpriors for the gamma priors on
1, 2 :
I n[ 5 ] :
a l p h a ,b e t a=1 . ,1 0
For computational efficiency, it is best to pre-allocate memory to store the sampled values. We need 3 arrays, each with length equal to the number of iterations we plan to run:
I n[ 6 ] :
#S p e c i f yn u m b e ro fi t e r a t i o n s n _ i t e r a t i o n s=1 0 0 0 #I n i t i a l i z et r a c eo fs a m p l e s l a m b d a 1 ,l a m b d a 2 ,t a u=n p . e m p t y ( ( 3 ,n _ i t e r a t i o n s + 1 ) )
The penultimate step initializes the model paramters to arbitrary values:
I n[ 7 ] :
I n[ 7 ] :
l a m b d a 1 [ 0 ]=6 0 l a m b d a 2 [ 0 ]=2 t a u [ 0 ]=5 0
Now we can run the Gibbs sampler.
I n[ 8 ] :
#S a m p l ef r o mc o n d i t i o n a l s f o rii nr a n g e ( n _ i t e r a t i o n s ) : #S a m p l ee a r l ym e a n l a m b d a 1 [ i + 1 ]=r g a m m a ( d i s a s t e r s _ a r r a y [ : t a u [ i ] ] . s u m ( )+a l p h a ,1 . / ( t a u [ i ]+b e t a ) ) #S a m p l el a t em e a n l a m b d a 2 [ i + 1 ]=r g a m m a ( d i s a s t e r s _ a r r a y [ t a u [ i ] : ] . s u m ( )+a l p h a , 1 . / ( n _ c o u n t _ d a t a-t a u [ i ]+b e t a ) ) #S a m p l ec h a n g e p o i n t p=n p . a r r a y ( [ d g a m m a ( l a m b d a 1 [ i + 1 ] ,d i s a s t e r s _ a r r a y [ : t ] . s u m ( )+a l p h a ,t+b e t a ) * d g a m m a ( l a m b d a 2 [ i + 1 ] ,d i s a s t e r s _ a r r a y [ t : ] . s u m ( )+a l p h a ,n _ c o u n t _ d a t a-t+b e t a ) f o rti nr a n g e ( n _ c o u n t _ d a t a ) ] ) t a u [ i + 1 ]=r c a t e g o r i c a l ( p / p . s u m ( ) )
Plotting the trace and histogram of the samples reveals the marginal posteriors of each parameter in the model.
I n[ 9 ] :
f o rs a m p l e si nl a m b d a 1 ,l a m b d a 2 ,t a u : f i g ,a x e s=p l t . s u b p l o t s ( 1 ,2 ) a x e s [ 0 ] . p l o t ( s a m p l e s [ 1 0 0 : ] ) a x e s [ 1 ] . h i s t ( s a m p l e s [ n _ i t e r a t i o n s / 2 : ] )
In fact,
posterior
conditionals cannot always be neatly specified. In contrast to the Gibbs algorithm, the Metropolis-Hastings algorithm generates candidate state transitions from an alternate distribution, and accepts or rejects each candidate probabilistically. Let us first consider a simple Metropolis-Hastings algorithm for a single parameter,
.
We will use a standard sampling distribution, referred to as the proposal distribution, to produce candidate variables possible next value for
q t ( |) . t + 1.
generated value,
, is
at step
probability of moving back to the original value from the candidate, or These probabilistic ingredients are used to define an acceptance ratio:
q t( |)( ) q t(| )()
a( , ) =
The value of
(t +1)
(t +1)
= {
min(a( , 1 min(a( ,
(t )
), 1) ), 1)
(t )
(t )
This transition kernel implies that movement is not guaranteed at every step. It only occurs if the suggested transition is likely based on the acceptance ratio. A single iteration of the Metropolis-Hastings algorithm proceeds as follows: The
original
form
of
the
algorithm
specified to
by
Metropolis
required
that
q t ( |) = q t (| ) ,
which reduces
a( , )
( )/ (),
In either case, the state moves to high-density points in the distribution with high probability, and to low-density points with low probability. After convergence, the Metropolis-Hastings algorithm describes the full target posterior density, so all points are recurrent. 1. Sample
from
q( |
(t )
.
u.
(t +1) (t )
then
(t +1)
, otherwise
Random-walk Metropolis-Hastings
A practical implementation of the Metropolis-Hastings algorithm makes use of a random-walk proposal. Recall that a random walk is a Markov chain that evolves according to:
(t +1) (t )
+ t
t f ()
As
applied to
the
MCMC
sampling,
the
random walk
is
used as
proposal
q( |
) = f (
) =
+ t
implies
) = q(
(t )
| )
which
reduces
the
a( , ) = ()
( )
density, but it may be any distribution that generates an irreducible proposal chain.
An important consideration is the specification of the scale parameter for the random walk error distribution. Large values produce random walk steps that are highly exploratory, but tend to produce proposal values in the tails of the target distribution, potentially resulting in very small acceptance rates. Conversely, small values tend to be accepted more frequently, since they tend to produce proposals close to the current parameter value, but may result in chains that mix very slowly. Some simulation studies suggest optimal acceptance rates in the range of 20-50%. It is often worthwhile to optimize the proposal variance by iteratively adjusting its value, according to observed acceptance rates early in the MCMC simulation .
= 0 + 1 ai
i
p i N ( , )
i
I n[ 1 0 ] :
a g e=n p . a r r a y ( [ 1 3 ,1 4 ,1 4 , 1 2 ,9 ,1 5 ,1 0 ,1 4 ,9 ,1 4 ,1 3 ,1 2 ,9 ,1 0 ,1 5 ,1 1 , 1 5 ,1 1 ,7 ,1 3 ,1 3 ,1 0 ,9 ,6 ,1 1 ,1 5 ,1 3 ,1 0 ,9 ,9 ,1 5 ,1 4 , 1 4 ,1 0 ,1 4 ,1 1 ,1 3 ,1 4 ,1 0 ] ) p r i c e=n p . a r r a y ( [ 2 9 5 0 ,2 3 0 0 ,3 9 0 0 ,2 8 0 0 ,5 0 0 0 ,2 9 9 9 ,3 9 5 0 ,2 9 9 5 ,4 5 0 0 ,2 8 0 0 , 1 9 9 0 ,3 5 0 0 ,5 1 0 0 ,3 9 0 0 ,2 9 0 0 ,4 9 5 0 ,2 0 0 0 ,3 4 0 0 ,8 9 9 9 ,4 0 0 0 , 2 9 5 0 ,3 2 5 0 ,3 9 5 0 ,4 6 0 0 ,4 5 0 0 ,1 6 0 0 ,3 9 0 0 ,4 2 0 0 ,6 5 0 0 ,3 5 0 0 , 2 9 9 9 ,2 6 0 0 ,3 2 5 0 ,2 5 0 0 ,2 4 0 0 ,3 9 9 0 ,4 6 0 0 ,4 5 0 , 4 7 0 0 ] ) / 1 0 0 0 .
This function calculates the joint log-posterior, conditional on values for each
paramter:
I n[ 1 1 ] :
f r o ms c i p y . s t a t si m p o r td i s t r i b u t i o n s d g a m m a=d i s t r i b u t i o n s . g a m m a . l o g p d f
I n[ 1 2 ] :
r n o r m=n p . r a n d o m . n o r m a l r u n i f=n p . r a n d o m . r a n d
d e fm e t r o p o l i s ( n _ i t e r a t i o n s ,i n i t i a l _ v a l u e s ,p r o p _ v a r = 1 ) : n _ p a r a m s=l e n ( i n i t i a l _ v a l u e s ) #I n i t i a lp r o p o s a ls t a n d a r dd e v i a t i o n s p r o p _ s d=[ p r o p _ v a r ] * n _ p a r a m s #I n i t i a l i z et r a c ef o rp a r a m e t e r s t r a c e=n p . e m p t y ( ( n _ i t e r a t i o n s + 1 ,n _ p a r a m s ) ) #S e ti n i t i a lv a l u e s t r a c e [ 0 ]=i n i t i a l _ v a l u e s #C a l c u l a t ej o i n tp o s t e r i o rf o ri n i t i a lv a l u e s c u r r e n t _ l o g _ p r o b=c a l c _ p o s t e r i o r ( * t r a c e [ 0 ] ) #I n i t i a l i z ea c c e p t a n c ec o u n t s a c c e p t e d=[ 0 ] * n _ p a r a m s f o rii nr a n g e ( n _ i t e r a t i o n s ) : i fn o ti % 1 0 0 0 :p r i n t' I t e r a t i o n ' ,i #G r a bc u r r e n tp a r a m e t e rv a l u e s c u r r e n t _ p a r a m s=t r a c e [ i ] f o rji nr a n g e ( n _ p a r a m s ) :
#G e tc u r r e n tv a l u ef o rp a r a m e t e rj p=t r a c e [ i ] . c o p y ( ) #P r o p o s en e wv a l u e i fj = = 2 : #E n s u r et a ui sp o s i t i v e t h e t a=n p . e x p ( r n o r m ( n p . l o g ( c u r r e n t _ p a r a m s [ j ] ) ,p r o p _ s d [ j ] ) ) e l s e : t h e t a=r n o r m ( c u r r e n t _ p a r a m s [ j ] ,p r o p _ s d [ j ] ) #I n s e r tn e wv a l u e p [ j ]=t h e t a #C a l c u l a t el o gp o s t e r i o rw i t hp r o p o s e dv a l u e p r o p o s e d _ l o g _ p r o b=c a l c _ p o s t e r i o r ( * p ) #L o g a c c e p t a n c er a t e a l p h a=p r o p o s e d _ l o g _ p r o b-c u r r e n t _ l o g _ p r o b #S a m p l eau n i f o r mr a n d o mv a r i a t e u=r u n i f ( ) #T e s tp r o p o s e dv a l u e i fn p . l o g ( u )<a l p h a : #A c c e p t t r a c e [ i + 1 , j ]=t h e t a c u r r e n t _ l o g _ p r o b=p r o p o s e d _ l o g _ p r o b a c c e p t e d [ j ]+ =1 e l s e : #R e j e c t t r a c e [ i + 1 , j ]=t r a c e [ i , j ] r e t u r nt r a c e ,a c c e p t e d
Let's run the MH algorithm with a very small proposal variance:
I n[ 1 3 ] :
n _ i t e r=1 0 0 0 0 t r a c e ,a c c=m e t r o p o l i s ( n _ i t e r ,( 1 , 0 , 1 ) ,0 . 0 0 1 )
I t e r a t i o n0 I t e r a t i o n1 0 0 0 I t e r a t i o n2 0 0 0 I t e r a t i o n3 0 0 0 I t e r a t i o n4 0 0 0 I t e r a t i o n5 0 0 0 I t e r a t i o n6 0 0 0
I t e r a t i o n7 0 0 0 I t e r a t i o n8 0 0 0 I t e r a t i o n9 0 0 0
I n[ 1 5 ] :
n p . a r r a y ( a c c ,f l o a t ) / n _ i t e r
O u t [ 1 5 ] : a r r a y ( [0 . 9 7 8 3 , 0 . 9 7 , 0 . 9 6 0 6 ] ) I n[ 1 6 ] :
f o rp a r a m ,s a m p l e si nz i p ( [ ' i n t e r c e p t ' ,' s l o p e ' ,' p r e c i s i o n ' ] ,t r a c e . T ) : f i g ,a x e s=p l t . s u b p l o t s ( 1 ,2 ) a x e s [ 0 ] . p l o t ( s a m p l e s ) a x e s [ 0 ] . s e t _ y l a b e l ( p a r a m ) a x e s [ 1 ] . h i s t ( s a m p l e s [ n _ i t e r / 2 : ] )
I n[ 1 7 ] :
t r a c e _ h i v a r ,a c c=m e t r o p o l i s ( n _ i t e r ,( 1 , 0 , 1 ) ,1 0 )
I t e r a t i o n0 I t e r a t i o n1 0 0 0 I t e r a t i o n2 0 0 0 I t e r a t i o n3 0 0 0 I t e r a t i o n4 0 0 0 I t e r a t i o n5 0 0 0 I t e r a t i o n6 0 0 0 I t e r a t i o n7 0 0 0 I t e r a t i o n8 0 0 0 I t e r a t i o n9 0 0 0
I n[ 1 9 ] :
n p . a r r a y ( a c c ,f l o a t ) / n _ i t e r
O u t [ 1 9 ] : a r r a y ( [0 . 0 2 6 9 , 0 . 0 0 2 3 , 0 . 0 0 7 8 ] ) I n[ 1 8 ] :
f o rp a r a m ,s a m p l e si nz i p ( [ ' i n t e r c e p t ' ,' s l o p e ' ,' p r e c i s i o n ' ] ,t r a c e _ h i v a r . T ) : f i g ,a x e s=p l t . s u b p l o t s ( 1 ,2 ) a x e s [ 0 ] . p l o t ( s a m p l e s ) a x e s [ 0 ] . s e t _ y l a b e l ( p a r a m ) a x e s [ 1 ] . h i s t ( s a m p l e s [ n _ i t e r / 2 : ] )
In order to avoid having to set the proposal variance by trial-and-error, we can add some tuning logic to the algorithm. The following implementation of MetropolisHastings reduces proposal variances by 10% when the acceptance rate is low, and increases it by 10% when the acceptance rate is high.
I n[ 2 0 ] :
d e fm e t r o p o l i s _ t u n e d ( n _ i t e r a t i o n s ,i n i t i a l _ v a l u e s ,p r o p _ v a r = 1 , t u n e _ f o r = N o n e ,t u n e _ i n t e r v a l = 1 0 0 ) : n _ p a r a m s=l e n ( i n i t i a l _ v a l u e s ) #I n i t i a lp r o p o s a ls t a n d a r dd e v i a t i o n s p r o p _ s d=[ p r o p _ v a r ]*n _ p a r a m s #I n i t i a l i z et r a c ef o rp a r a m e t e r s t r a c e=n p . e m p t y ( ( n _ i t e r a t i o n s + 1 ,n _ p a r a m s ) ) #S e ti n i t i a lv a l u e s t r a c e [ 0 ]=i n i t i a l _ v a l u e s #I n i t i a l i z ea c c e p t a n c ec o u n t s a c c e p t e d=[ 0 ] * n _ p a r a m s #C a l c u l a t ej o i n tp o s t e r i o rf o ri n i t i a lv a l u e s c u r r e n t _ l o g _ p r o b=c a l c _ p o s t e r i o r ( * t r a c e [ 0 ] ) i ft u n e _ f o ri sN o n e : t u n e _ f o r=n _ i t e r a t i o n s / 2 f o rii nr a n g e ( n _ i t e r a t i o n s ) :
i fn o ti % 1 0 0 0 :p r i n t' I t e r a t i o n ' ,i #G r a bc u r r e n tp a r a m e t e rv a l u e s c u r r e n t _ p a r a m s=t r a c e [ i ] f o rji nr a n g e ( n _ p a r a m s ) : #G e tc u r r e n tv a l u ef o rp a r a m e t e rj p=t r a c e [ i ] . c o p y ( ) #P r o p o s en e wv a l u e i fj = = 2 : #E n s u r et a ui sp o s i t i v e t h e t a=n p . e x p ( r n o r m ( n p . l o g ( c u r r e n t _ p a r a m s [ j ] ) ,p r o p _ s d [ j ] ) ) e l s e : t h e t a=r n o r m ( c u r r e n t _ p a r a m s [ j ] ,p r o p _ s d [ j ] ) #I n s e r tn e wv a l u e p [ j ]=t h e t a #C a l c u l a t el o gp o s t e r i o rw i t hp r o p o s e dv a l u e p r o p o s e d _ l o g _ p r o b=c a l c _ p o s t e r i o r ( * p ) #L o g a c c e p t a n c er a t e a l p h a=p r o p o s e d _ l o g _ p r o b-c u r r e n t _ l o g _ p r o b #S a m p l eau n i f o r mr a n d o mv a r i a t e u=r u n i f ( ) #T e s tp r o p o s e dv a l u e i fn p . l o g ( u )<a l p h a : #A c c e p t t r a c e [ i + 1 , j ]=t h e t a c u r r e n t _ l o g _ p r o b=p r o p o s e d _ l o g _ p r o b a c c e p t e d [ j ]+ =1 e l s e : #R e j e c t t r a c e [ i + 1 , j ]=t r a c e [ i , j ] #T u n ee v e r y1 0 0i t e r a t i o n s i f( n o t( i + 1 )%t u n e _ i n t e r v a l )a n d( i<t u n e _ f o r ) : #C a l c u l a t ea c e p t a n c er a t e a c c e p t a n c e _ r a t e=( 1 . * a c c e p t e d [ j ] ) / t u n e _ i n t e r v a l i fa c c e p t a n c e _ r a t e < 0 . 2 : p r o p _ s d [ j ]* =0 . 9 e l i fa c c e p t a n c e _ r a t e > 0 . 5 : p r o p _ s d [ j ]* =1 . 1
a c c e p t e d [ j ]=0 r e t u r nt r a c e [ t u n e _ f o r : ] ,a c c e p t e d
I n[ 2 1 ] :
t r a c e _ t u n e d ,a c c=m e t r o p o l i s _ t u n e d ( n _ i t e r ,( 1 , 0 , 1 ) ,t u n e _ f o r = 9 0 0 0 )
I t e r a t i o n0 I t e r a t i o n1 0 0 0 I t e r a t i o n2 0 0 0 I t e r a t i o n3 0 0 0 I t e r a t i o n4 0 0 0 I t e r a t i o n5 0 0 0 I t e r a t i o n6 0 0 0 I t e r a t i o n7 0 0 0 I t e r a t i o n8 0 0 0 I t e r a t i o n9 0 0 0
I n[ 2 2 ] :
n p . a r r a y ( a c c ,f l o a t ) / n _ i t e r
O u t [ 2 2 ] : a r r a y ( [0 . 0 3 2 1 , 0 . 0 3 0 4 , 0 . 0 3 5 7 ] ) I n[ 2 3 ] :
f o rp a r a m ,s a m p l e si nz i p ( [ ' i n t e r c e p t ' ,' s l o p e ' ,' p r e c i s i o n ' ] ,t r a c e _ t u n e d . T ) : f i g ,a x e s=p l t . s u b p l o t s ( 1 ,2 ) a x e s [ 0 ] . p l o t ( s a m p l e s ) a x e s [ 0 ] . s e t _ y l a b e l ( p a r a m ) a x e s [ 1 ] . h i s t ( s a m p l e s [ l e n ( s a m p l e s ) / 2 : ] )
I n[ 2 4 ] :
p l t . p l o t ( a g e ,p r i c e ,' b o ' ) x l a b e l ( ' a g e( y e a r s ) ' ) ;y l a b e l ( ' p r i c e( $ 1 0 0 0 \ ' s ) ' ) x v a l s=n p . l i n s p a c e ( a g e . m i n ( ) ,a g e . m a x ( ) ) f o rii nr a n g e ( 5 0 ) : b 0 , b 1 , t a u=t r a c e _ t u n e d [ n p . r a n d o m . r a n d i n t ( 0 ,1 0 0 0 ) ] p l t . p l o t ( x v a l s ,b 0+b 1 * x v a l s ,' r ' ,a l p h a = 0 . 2 )
In this
dataset
l o g _ d o s e includes
scale, each
administered to 5 rats during the experiment. The response variable is number of positive responses to the dosage.
d e a t h ,
the
The number of deaths can be modeled as a binomial response, with the probability of death being a linear function of dose:
yi Bin(ni , p i ) logit(p i ) = a + bx i
The common statistic of interest in such experiments is the LD50, the dosage at which the probability of death is 50%. Use Metropolis-Hastings sampling to fit a Bayesian model to analyze this bioassay data, and to estimate LD50.
I n[ 2 8 ] :
#L o gd o s ei ne a c hg r o u p l o g _ d o s e=[ . 8 6 ,. 3 ,. 0 5 ,. 7 3 ] #S a m p l es i z ei ne a c hg r o u p n=5 #O u t c o m e s d e a t h s=[ 0 ,1 ,3 ,5 ]
I n[ 2 5 ] :
% l o a d. . / e x a m p l e s / b i o a s s a y . p y
Slice Sampling
Though Gibbs sampling is very computationally efficient, it can be difficult to
implement in a general way, whereas the Metropolis-Hastings algorithm is relatively inefficient, while being easy to implement for a variety of models. We have seen that it is possible to tune Metropolis samplers, but it would be nice to have a "black-box" method that works for arbitrary continuous distributions, which we may know little about a priori.
The slice sampler bridges this gap by being both efficient and easy to program generally. The idea is to first sample from the conditional distribution for some current value of value for
y, x , y
given
(0, f (x)) ,
then sample
y
which is uniform on
defined by the
characteristics of the posterior. The steps required to perform a single iteration of the slice sampler to update the current value of 1. Sample
y x i
is as follows:
(0, f (x . i ))
uniformly on
y
S = { x : y < f (x)} .
around
x i
I.
Hence, slice sampling employs an auxilliary variable (y) that is not retained at the end of the iteration. Note that in practice one may operate on the log scale such that
g(x) = log(f (x)) to
becomes
z = log(y) = g(x , i) e
resulting
in
the
slice
S = { x : z < g(x)} .
There are many ways of establishing and sampling from the interval restriction being that the resulting Markov chain leaves
I,
f (x) invariant.
is to include as much of the slice as possible, so that the potential step size can be large, but not (much) larger than the slice, so that the sampling of invalid points is minimized. Ideally, we would like it to be the slice itself, but it may not always be feasible to determine (and certainly not automatically).
Stepping out
One method for determining a sampling interval for "guess" at the slice width
w, x i+1
interval) until either (1) the interval reaches a maximum pre-speficied width or (2) is less than the
f (x)evaluated
I n[ 3 2 ] :
d e fs t e p _ o u t ( f u n c ,x 0 ,y ,w ,m = n p . i n f ) :
" " " f u n c :t a r g e tf u n c t i o n( e v a l u a t e sa tx ) x 0 :c u r r e n tv a l u eo fx y :y v a l u et h a td e f i n e ss l i c e w :e s t i m a t eo fs l i c ew i d t h m :f a c t o rl i m i t i n gi n t e r v a lt os i z em * w( d e f a u l t st oi n f i n i t y ) R e t u r n se n dp o i n t so fas a m p l i n gi n t e r v a lf o rs l i c ea ty " " " l e f t=x 0-w * n p . r a n d o m . r a n d o m ( ) r i g h t=l e f t+w i=n p . f l o o r ( m * n p . r a n d o m . r a n d o m ( ) ) j=( m 1 )-i w h i l e( i>0 )a n d( y<f u n c ( l e f t ) ) : l e f t=w i=1 w h i l e( j>0 )a n d( y<f u n c ( r i g h t ) ) : r i g h t+ =w j=1 r e t u r nl e f t ,r i g h t
I n[ 3 3 ] :
g a m=s p . s t a t s . d i s t r i b u t i o n s . g a m m a ( 2 ,s c a l e = 1 ) . p d f x v a l s=n p . l i n s p a c e ( 0 ,1 0 ) p l t . p l o t ( x v a l s ,g a m ( x v a l s ) ) x ,y=1 ,0 . 0 3 l , r=s t e p _ o u t ( g a m ,x ,y ,1 ,5 ) p l t . p l o t ( ( l , r ) ,( y , y ) ,' r ' ) x ,y=1 ,0 . 3 l , r=s t e p _ o u t ( g a m ,x ,y ,1 ,5 ) p l t . p l o t ( ( l , r ) ,( y , y ) ,' m ' )
O u t [ 3 3 ] : [ < m a t p l o t l i b . l i n e s . L i n e 2 Da t0 x 1 1 5 7 0 c 3 9 0 > ]
Doubling
The efficiency of stepping out depends largely on the ability to pick a reasonable interval
w
from
which
to
sample.
Otherwise,
the
doubling
procedure
may
be
preferable, as it can be expanded faster. It simply doubles the size of the interval until both endpoints are outside the slice.
I n[ 3 4 ] :
d e fd o u b l i n g ( f u n c ,x 0 ,y ,w ,p = 1 0 ) : " " " f u n c :t a r g e tf u n c t i o n( e v a l u a t e sa tx ) x 0 :c u r r e n tv a l u eo fx y :y v a l u et h a td e f i n e ss l i c e w :e s t i m a t eo fs l i c ew i d t h p :i n t e g e rl i m i t i n gi n t e r v a lt os i z em * w( d e f a u l t st oi n f i n i t y ) R e t u r n se n dp o i n t so fas a m p l i n gi n t e r v a lf o rs l i c ea ty " " " l e f t=x 0-w * n p . r a n d o m . r a n d o m ( ) r i g h t=l e f t+w w h i l e( p>0 )a n d( ( y<f u n c ( l e f t ) )o r( y<f u n c ( r i g h t ) ) ) : i fn p . r a n d o m . r a n d o m ( )<0 . 5 : l e f t=r i g h t-l e f t e l s e : r i g h t+ =r i g h t-l e f t p=1 r e t u r nl e f t ,r i g h t
I n[ 3 5 ] :
g a m=s p . s t a t s . d i s t r i b u t i o n s . g a m m a ( 2 ,s c a l e = 1 ) . p d f
O u t [ 3 5 ] : [ < m a t p l o t l i b . l i n e s . L i n e 2 Da t0 x 1 1 5 6 e d 4 1 0 > ]
Irrespective of which method for interval determination is used, the next step is to draw a value from this interval. One condition that applies to the new point is that it sould be as likely to draw the interval current value. There are two approaches that could be taken: 1. Draw from 2. Draw from
I I
until a suitable point is obtained which shrinks by some factor each time an unsuitable point is
I,
drawn, until a suitable point is obtained A shrinkage function might proceed as follows:
I n[ 3 6 ] :
d e fs h r i n k ( f u n c ,x 0 ,y ,l e f t ,r i g h t ,w ) : l ,r=l e f t ,r i g h t w h i l eT r u e : x 1=l+n p . r a n d o m . r a n d o m ( ) * ( r-l )
I n[ 3 7 ] :
d e fc h e c k _ v a l u e ( x 0 ,x 1 ,l e f t ,r i g h t ,y ,f u n c ,w ) : l ,r=l e f t ,r i g h t m=( l+r ) / 2 . d i f f e r=F a l s e w h i l e( r-l )>( 1 . 1 * w ) : i f( ( x 0<m )a n d( x 1> =m ) )o r( ( x 0> =m )a n d( x 1<m ) ) : #I n t e r v a l sg e n e r a t e df r o mn e wp o i n tl i k e l yd i f f e r e n t d i f f e r=T r u e #R e d u c ei n t e r v a l i fx 1<m : r=m e l s e : l=m i fd i f f e ra n d( y> =f u n c ( l ) )a n d( y> =f u n c ( r ) ) : #P o i n ti sn o ta c c e p t a b l e r e t u r nF a l s e #I fn o tr e j e c t e da b o v e ,a c c e p t r e t u r nT r u e
Otherwise we can just substitute a trivial function for
c h e c k _ v a l u e :
c h e c k _ v a l u e=l a m b d a* a r g s ,* * k w a r g s :r e t u r nT r u e
I n[ 3 8 ] :
u n i f o r m=n p . r a n d o m . u n i f o r m
d e fs l i c e ( n _ i t e r a t i o n s ,l o g p ,i n i t i a l _ v a l u e s ,w = 1 ,t u n e = T r u e ) : n _ p a r a m s=l e n ( i n i t i a l _ v a l u e s ) #I n i t i a l i z et r a c ef o rp a r a m e t e r s t r a c e=n p . e m p t y ( ( n _ i t e r a t i o n s + 1 ,n _ p a r a m s ) ) #S e ti n i t i a lv a l u e s t r a c e [ 0 ]=i n i t i a l _ v a l u e s w _ t u n e=[ ] f o rii nr a n g e ( n _ i t e r a t i o n s ) : i fn o ti % 1 0 0 0 :p r i n t' I t e r a t i o n ' ,i q=t r a c e [ i ] q 0=q . c o p y ( ) w=n p . r e s i z e ( w ,l e n ( q 0 ) ) y=l o g p ( * q 0 )-s t a n d a r d _ e x p o n e n t i a l ( ) #S t e p p i n go u tp r o c e d u r e q l=q 0 . c o p y ( ) q l=u n i f o r m ( 0 ,w ) q r=q 0 . c o p y ( ) q r=q l+w y l=l o g p ( * q l ) y r=l o g p ( * q r ) w h i l e ( ( y<y l ) . a l l ( ) ) : q l=w y l=l o g p ( * q l ) w h i l e ( ( y<y r ) . a l l ( ) ) : q r+ =w y r=l o g p ( * q r ) q _ n e x t=q 0 . c o p y ( ) w h i l eT r u e : #S a m p l eu n i f o r m l yf r o ms l i c e q i=u n i f o r m ( q l ,q r ) y i=l o g p ( * q i )
i fy i>y : q=q i b r e a k e l i f( q i>q ) . a l l ( ) : q r=q i e l i f( q i<q ) . a l l ( ) : q l=q i i ft u n e : #T u n es a m p l e rp a r a m e t e r s w _ t u n e . a p p e n d ( a b s ( q 0-q ) ) w=2*s u m ( w _ t u n e ,0 )/l e n ( w _ t u n e ) t r a c e [ i + 1 ]=q r e t u r nt r a c e
I n[ 3 9 ] :
n _ i t e r a t i o n s=5 0 0 0 t r a c e=s l i c e ( n _ i t e r a t i o n s ,c a l c _ p o s t e r i o r ,( 1 , 0 , 1 ) )
I t e r a t i o n0 I t e r a t i o n1 0 0 0 I t e r a t i o n2 0 0 0 I t e r a t i o n3 0 0 0 I t e r a t i o n4 0 0 0
I n[ 4 0 ] :
f o rp a r a m ,s a m p l e si nz i p ( [ ' i n t e r c e p t ' ,' s l o p e ' ,' p r e c i s i o n ' ] ,t r a c e . T ) : f i g ,a x e s=p l t . s u b p l o t s ( 1 ,2 ) a x e s [ 0 ] . p l o t ( s a m p l e s) a x e s [ 0 ] . s e t _ y l a b e l ( p a r a m ) a x e s [ 1 ] . h i s t ( s a m p l e s [ n _ i t e r a t i o n s / 2 : ] )
References
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian Data Analysis, Second Edition (Chapman & Hall/CRC Texts in Statistical Science) (2nd ed.). Chapman and Hall/CRC. Neal, R. M. (2003). Slice sampling. doi:10.1111/1467-9868.00198 The Annals of Statistics, 31(3), 705 767.
I n[ 8 3 ] :
i m p o r tp a n d a sa sp d i m p o r tn u m p ya sn p #S e ts o m eP a n d a so p t i o n s p d . s e t _ o p t i o n ( ' h t m l ' ,F a l s e ) p d . s e t _ o p t i o n ( ' m a x _ c o l u m n s ' ,3 0 ) p d . s e t _ o p t i o n ( ' m a x _ r o w s ' ,2 0 ) f r o mI P y t h o n . c o r e . d i s p l a yi m p o r tH T M L d e fc s s _ s t y l i n g ( ) : s t y l e s=o p e n ( " s t y l e s / c u s t o m . c s s " ," r " ) . r e a d ( ) r e t u r nH T M L ( s t y l e s ) c s s _ s t y l i n g ( )
O u t [ 8 3 ] :
O u t [ 8 3 ] :
Back to top More info on IPython website (http://ipython.org). The code for this site (https://github.com/ipython/nbviewer) is licensed under BSD (https://github.com/ipython/nbviewer/blob/master/LICENSE.txt). Built thanks to Twitter Bootstrap (http://twitter.github.com/bootstrap/) This web site does not host notebooks, it only renders notebooks available on other websites. Thanks to all our contributors (https://github.com/ipython/nbviewer/contributors) and to Rackspace (http://www.rackspace.com) for hosting. nbviewer version: a3aa8ab (https://github.com/ipython/nbviewer/commit/a3aa8abe0b5aa305afb95c5ff5218464d3970aa6) (Fri, 24 Jan 2014 08:55:28 -0800)