Beruflich Dokumente
Kultur Dokumente
Contents
the end of the chapter (see Park and Bera [2009] for additional maxent assignments).
ii Contents
4
Maximum entropy distributions
The sum rule says if we sum over all events related to one (set of) vari-
able(s) (integrate out one variable or set of variables), we are left with the
2 4. Maximum entropy distributions
p (y | ) p ()
p ( | y) =
p (y)
p ( | y = y0 ) / ` ( | y = y0 ) p ()
p (y = y1 ) p (y = y2 )
0.3 0.7
and
p ( = 1 ) p ( = 2 )
0.5 0.5
4.2 Maximum entropy distributions 3
p (y | = 1 ) p (y | = 2 )
y1 0.2 0.4
y2 0.8 0.6
and
p ( | y = y1 ) p ( | y = y2 )
1 4
1 3 7
2 3
2 3 7
p ( | y, =) / p (y | , =) p ( | =)
4.3 Entropy
Shannon defines entropy as2
n
X
h= pi log (pi )
i=1
2 The axiomatic development for this measure of entropy can be found in Jaynes
the end of the chapter (see Park and Bera [2009] for additional maxent assignments).
5 Clearly, if we knew the mean is 2 then we would assign the uniform discrete distri-
bution above.
6 4. Maximum entropy distributions
h P i
m
exp [0 ] exp j=1 j fj (xi )
p (xi ) =
1h i
Pm
exp [0 ] exp j=1 j fj (xi )
= Pn h P i
m
exp [0 ] k=1 exp j=1 j fj (xi )
h P i
m
exp j=1 j fj (xi )
= Pn h P i
m
k=1 exp j=1 j fj (xi )
k (xi )
=
Z (1 , . . . , m )
is a kernel, and
2 3
n
X m
X
Z (1 , . . . , m ) = exp 4 j fj (xk )5
k=1 j=1
7 Since0 simply ensures the probabilities sum to unity and the partition function
assures this, we can define the partition function without 0 . That is, 0 cancels as
demonstrated above.
8 In physical statistical mechanics, the partition function describes the partitioning
among dierent microstates and serves as a generator function for all manner of results
regarding a process. The notation, Z, refers to the German word for sum over states,
zustandssumme. An example with relevance for our purposes is
@ log Z
=
@1
4.5 Normalization and partition functions 7
Return to the example above. Since we know support and the mean,
n = 3 and the function f (xi ) = xi . This implies
3
X
Z (1 ) = exp [1 xi ]
i=1
and
k (xi )
pi =
Z (1 )
exp [1 xi ]
= P3
k=1 exp [1 xk ]
3
X
pi xi 2.5 = 0
i=1
3
X exp [1 xi ]
P3 xi 2.5 = 0
i=1 k=1 exp [1 xk ]
p1 = 0.116
p2 = 0.268
p3 = 0.616
where is the mean of the distribution. For the example below, we find
4.6.1 Binomial
Suppose we know there are binary (Bernoulli or "success"/"failure") out-
comes associated with each of n draws and the expected value of success
equals np. The expected value of failure is redundant, hence there is only
one moment condition. Then, the maximum entropy assignment includes
the number of combinations which produce x1 "successes"
and x0 "fail-
ures". Of course, this is the binomial operator, x1n,x0 = x1n! !x0 ! where
n
x1 + x0 = n. Let mi x1i ,x0i and generalize the entropy measure to
account for m,
Pn pi
s= pi log
i=1 mi
Now, the kernel (drawn from the Lagrangian) is
ki = mi exp [1 xi ]
9 This suggests logistic regression is a natural (as well as the most common) strategy
p (x, n = 12)
[x1 , x0 ] balanced coin unbalanced coin
1 1
[0, 12] 4,096 531,441
12 24
[1, 11] 4,096 531,441
66 264
[2, 10] 4,096 531,441
220 1,760
[3, 9] 4,096 531,441
495 7,920
[4, 8] 4,096 531,441
792 25,344
[5, 7] 4,096 531,441
924 59,136
[6, 6] 4,096 531,441
792 101,376
[7, 5] 4,096 531,441
495 126,720
[8, 4] 4,096 531,441
220 112,640
[9, 3] 4,096 531,441
66 67,584
[10, 2] 4,096 531,441
12 24,576
[11, 1] 4,096 531,441
1 4,096
[12, 0] 4,096 531,441
10 4. Maximum entropy distributions
Canonical analysis
The above is consistent
with a canonical analysis based on relative entropy
P pi
i pi log pold,i where pold,i reflects probabilities assigned based purely on
(xni )
the number of exchangeable ways of generating xi ; hence, pold,i = n P=
i (xi )
Pmi . Since the denominator of pold,i is absorbed via normalization it can
i mi
P pi
be dropped, then entropy reduces to i pi log m i
and the kernel is the
same as above
ki = mi exp [1 x1i ]
4.6.2 Multinomial
The multinomial is the multivariate analog to the binomial accommodating
k rather than two nominal
Pk outcomes. Like the binomial, the sum of the
outcomes equals n, i=1 xi = n where xk = 1 for each occurrence of event
k. We know there are
X X
n n!
kn = =
x ++x =n
x1 , , xk x
x ++x =n 1
! xk !
1 k 1 k
2!
exp [1 x1 5 x5 ]
x1 ! x6 !
p (x, n = 2)
[x1 , x2 , x3 , x4 , x5 , x6 ] balanced dice unbalanced dice
1 1
[2, 0, 0, 0, 0, 0] 36 441
1 4
[0, 2, 0, 0, 0, 0] 36 441
1 9
[0, 0, 2, 0, 0, 0] 36 441
1 16
[0, 0, 0, 2, 0, 0] 36 441
1 25
[0, 0, 0, 0, 2, 0] 36 441
1 36
[0, 0, 0, 0, 0, 2] 36 441
1 4
[1, 1, 0, 0, 0, 0] 18 441
1 6
[1, 0, 1, 0, 0, 0] 18 441
1 8
[1, 0, 0, 1, 0, 0] 18 441
1 10
[1, 0, 0, 0, 1, 0] 18 441
1 12
[1, 0, 0, 0, 0, 1] 18 441
1 12
[0, 1, 1, 0, 0, 0] 18 441
1 16
[0, 1, 0, 1, 0, 0] 18 441
1 20
[0, 1, 0, 0, 1, 0] 18 441
1 24
[0, 1, 0, 0, 0, 1] 18 441
1 24
[0, 0, 1, 1, 0, 0] 18 441
1 30
[0, 0, 1, 0, 1, 0] 18 441
1 36
[0, 0, 1, 0, 0, 1] 18 441
1 40
[0, 0, 0, 1, 1, 0] 18 441
1 48
[0, 0, 0, 1, 0, 1] 18 441
1 60
[0, 0, 0, 0, 1, 1] 18 441
4.6.3 Hypergeometric
The hypergeometric probability assignment is a purely combinatorics exer-
cise. Suppose we take n draws without replacement from a finite population
of N items of which m are the target items (or events) and x is the num-
ber of target items drawn. There are m ways to draw the targets times
N m xN
nx ways to draw nontargets out of n ways to make n draws. Hence,
our combinatoric measure is
N m
(mx )( nx )
N , x 2 {max (0, m + n N ) , min (m, n)}
(n)
4.7 Continuous maximum entropy examples 13
and mN m
min(m,n)
X x
Nnx
=1
x=max(0,m+nN ) n
we omit N
1 0 If
n
in the denominator, it would be captured via normalization. In other
words, the kernel is
mN m
k= exp [0]
x nx
and the partition function is
min(m,n) N
X
Z= k=
n
x=max(0,m+nN )
14 4. Maximum entropy distributions
exp [0] 1
p (x) = R 3 =
exp [0] dx 3
0
Of course, this is the density function for a uniform with support from 0 to
3.
if we dont know it, we can find the maximum entropy density function for
arbitrary mean, . Using the partition function approach above we have
h i
2
exp 2 (x )
p (x) = R 1 h i
2
1
exp 2 (x ) dx
1
so that 2 = 2 2 and the density function is
" #
2
1 (x )
p (x) = p exp
2 2 2
" #
2
1 (x )
= p exp
210 200
and
h i Z 1
2 2
E (log x) = (log x) p (x) dx 10 = 0
0
1
so that 1 = 0.9 and 2 = 2 2 = 0.05. Substitution gives
2 (9) 1 2
p (x) / exp log x (log x)
20 20
Completing
1 the square and adding in the constant (from normalization)
exp 20 gives
1 2 1 2
p (x) / exp [ log x] exp + log x (log x)
20 20 20
Rewriting produces
" #
2
1
(log x 1)
p (x) / exp log x exp
20
which simplifies as
" #
2
(log x 1)
p (x) / x1 exp
20
1
F (z) =
1 + exp [z]
exp [z]
=
1 + exp [z]
and
1
F (x) =
1 + exp x
s
4.7 Continuous maximum entropy examples 17
2 s2
where z = xs with mean and variance 3 , and probability density
function (pdf)
1
f (z) = F (z) F (z)
s
exp [z]
= 2
s (1 + exp [z])
1
= z 2
s exp 2 + exp z2
and
1
f (x) = x 2
s exp 2s + exp x
2s
pf (y0 | s1 )
Pr (s1 | y = y0 ) =
pf (y0 | s1 ) + (1 p) f (y0 | s0 )
2
p1
(y )
where p = Pr (s1 ), f (y | sj ) = 2
exp 22j , j = 0, 1. Since the
variances of the evidence conditional on the state are the same the nor-
1
malizing constant, p2 , cancels in the ratio. Accordingly, substitution
produces
h i
1 )2
p exp (y02 2
Pr (s1 | y = y0 ) = h i h i
1 )2 0 )2
p exp (y02 2 + (1 p) exp (y02 2
1 (y0 1 )2
p exp 2 2
Now, mulitply by 1
h
(y0 1 )2
i to find
p exp 2 2
1
Pr (s1 | y = y0 ) = h i
0 )2 (y0 1 )2
1+ 1p
p exp (y02 2 + 2 2
Rewrite as log-odds ratio and expand and simplify the exponential term
1
Pr (s1 | y = y0 ) = h i (0 +1 )
1p y0
1 + exp log p exp 2
2
0 1
18 4. Maximum entropy distributions
1
Pr (s1 | y = y0 ) = (0 +1 ) 2
y0 2 + log( 1p
p )
1 + exp
0
2
1
0 1
1
= n
(0 +1 ) 2
o
y0 log( 1p
p )
2 0 1
1 + exp 2
0 1
2
(0 +1 ) 1p
which is the cdf for a logistic random variable with mean 2 log p
0 1
2
and scale parameter s = 0 1 .
h
h zi h z ii Z1 h zi h z i k (z)
E log exp + exp = log exp + exp dz = 1
2 2 2 2 Z
1
yields = 2. Hence,
k (z; = 2) 1
f (z) = = z 2
Z ( = 2) s exp 2 + exp z2
1
f (x) = x 2
s exp 2s + exp x
2s
the density function for a logistic random variables with mean and vari-
2 2
ance 3s .
4.7 Continuous maximum entropy examples 19
Following Blower [2004], we develop this model specification from the max-
imum entropy principle.
Bayesian revision yields
Pr (D, X)
Pr (D | X) =
Pr (X)
1
Pr (D = 1 | X) = Pr(D=0,X)
1+ Pr(D=1,X)
" #
P
m
exp j fj (X1 )
j=1
Pr (D = 1, X, ~) =
Z (1 , . . . , m )
and
" #
P
m
exp j fj (X0 )
j=1
Pr (D = 0, X, ~) =
Z (1 , . . . , m )
" #
P
m
exp j fj (X0 )
Pr (D = 0, X, ~) j=1
= " #
Pr (D = 1, X, ~) P
m
exp j fj (X1 )
j=1
= exp [Y ]
where
m
X
Y = j {fj (X1 ) fj (X0 )}
j=1
weights.12
1
Pr (D = 1 | X, ~) = Pr(D=0,X,~)
1+ Pr(D=1,X,~)
1
=
1 + exp [Y ]
Pr (D = 1, B, C)
B=1 B=2 B=3 Pr (D = 1, C)
C=L 0.008 0.022 0.02 0.05
C=M 0.009 0.027 0.024 0.06
C=H 0.013 0.041 0.036 0.09
Pr (D = 1, B) 0.03 0.09 0.08
Pr (D = 1) 0.2
Pr (D = 0, B, C)
B=1 B=2 B=3 Pr (D = 0, C) Pr(C)
C=L 0.060 0.145 0.078 0.283 0.333
C=M 0.058 0.140 0.075 0.273 0.333
C=H 0.052 0.125 0.067 0.244 0.334
Pr (D = 0, B) 0.17 0.41 0.22
Pr (D = 0) 0.8
Pr (B) 0.2 0.5 0.3
1 2 For the latent variable random utility model with index structure (Y =
(X1 X0 )T = X T ) and unobservable a logistic random variable we have
D = X T
Pr (D = 1 | X) = Pr (D 0 | X)
= Pr X T | X
= F X T
1
=
1 + exp [X T ]
1
=
1 + exp [Y ]
22 4. Maximum entropy distributions
1 3 Complete knowledge includes all the interactions as well as the nine moments indi-
Pr (D = 1, B, C)
B=1 B=2 B=3
C = L 0.0075 0.0225 0.02
C=M 0.009 0.027 0.024
C = H 0.0135 0.0405 0.036
Pr (D = 0, B, C)
B=1 B=2 B=3
C=L 0.0601375 0.1450375 0.077825
C=M 0.0580125 0.1399125 0.075075
C=H 0.05185 0.12505 0.0671
Hence, the maximum entropy conditional probabilities are
Pr (D = 1, B, C)
Pr (D = 1 | B, C) =
Pr (D = 1, B, C) + Pr (D = 0, B, C)
A linear probability model supplies eective starting values for maximum
likelihood estimation of a logistic regression. The linear probability model
is15
Pr (D = 1 | B, C) = X
= 0.3342 0.1147B1 0.0855B2 0.1178CL 0.0881CH
Logistic regression is
1
Pr (D = 1 | B, C) =
1 + exp [X]
Equating the two expressions and solving gives
1 X
X = log
X
or starting values
1 1 X
0 = XT X X T log
X
T0 = 0.5991 0.7463 0.5185 0.7649 0.5330
1 4 Of
course, this doesnt match the DGP as our background knowledge is incomplete.
1 5 Since
the model involves only indicator variables, the predicted values are bounded
between zero and one.
24 4. Maximum entropy distributions
P
s.t. p (1 , 2 , 3 ) (1 23 ) = 0
p (x, ) = p (x | ) p ()
(n1 + n2 + n3 )! n1 n2 n3 exp [1.44756 (1 23 )]
= 1 2 3
n1 !n2 !n3 ! 28.7313
Hence, prior to collection and evaluation of evidence expected values of
are
XX
E [1 ] = 1 p (x, )
x
X
= 1 p () = 0.444554
XX
E [2 ] = 2 p (x, )
x
X
= 2 p () = 0.333169
XX
E [3 ] = 3 p (x, )
x
X
= 3 p () = 0.222277
Hence,
X
E [1 | x = m] = p ( | x = m) 1
= 0.505373
X
E [2 | x = m] = p ( | x = m) 2
= 0.302243
X
E [3 | x = m] = p ( | x = m) 3
= 0.192384
and
E [1 + 2 + 3 | x = m] = 1
P P
s.t. x p (x, ) (1 23 ) = 0
4.8 Maximum entropy posterior distributions 29
exp [ (1 23 )]
p (x, ) = pold (x, )
Z ()
exp [1.44756 (1 23 )]
p (, x = m) = 252051 32 23
28.7313
the probability of the data is
X
p (x) = p (, x = m) = 0.0286946
Hence,
X
E [1 | x = m] = p ( | x = m) 1
= 0.505373
X
E [2 | x = m] = p ( | x = m) 2
= 0.302243
30 4. Maximum entropy distributions
X
E [3 | x = m] = p ( | x = m) 3
= 0.192384
and
E [1 + 2 + 3 | x = m] = 1
as above.
This maximum relative entropy analysis for the joint distribution helps
set up simultaneous evaluation of moment and data conditions when were
evaluating the process rather than a specific die.
P P
x p (x, ) (1 23 ) = 0
s.t.
x=m
Since the likelihood function remains the same, Lagrangian methods yield
exp (x,) (1 23 )
p (x = m, ) = pold (x = m, )
Z (m, )
@ log Z (m, )
=0
@(x,)
In other words, the joint probability given the moment and data conditions
is
2520 5 3 2
p (x = m, ) = exp [0.0420198 (1 23 )]
36 1 2 3
4.9 Convergence or divergence in beliefs 31
P
Since p (x = m) = p (x = m, ) = 0.0209722, the posterior probabil-
ity of is
exp [0.0420198 (1 23 )]
p ( | x = m, E [1 23 ] = 0) = 70
0.0209722
Hence,
X
E [1 | x = m, E [1 23 ] = 0] = p ( | x = m, E [1 23 ] = 0) 1
= 0.462225
X
E [2 | x = m, E [1 23 ] = 0] = p ( | x = m, E [1 23 ] = 0) 2
= 0.306663
X
E [3 | x = m, E [1 23 ] = 0] = p ( | x = m, E [1 23 ] = 0) 3
= 0.2311125
E [1 + 2 + 3 | x = m, E [1 23 ] = 0] = 1
and unlike the previous analysis of a specific die, for the process the moment
condition is maintained
E [1 23 | x = m, E [1 23 ] = 0] = 0
exp[(x,) ( 1 2 3 )]
What weve done here is add a canonical term, Z(m,) , to the
standard Bayesian posterior for given the data, pold ( | x = m), to ac-
count for the moment condition. The partition function, Z (m, ), serves to
normalize the moment conditioned-posterior distribution.
location
Complete ignorance regarding location translates into f (x) = 1 uniform
or rectangular. Uninformed or constant priors means likelihood functions
completely determine posterior beliefs regarding location.
scale
Complete ignorance regarding scale translates intof (x) = x1 for x > 0
or assigning a uniform distribution to the log of scale,f (y) = 1 where
y = log (x) (note: log 2 = 2 log so it matters little whether we speak,
for instance, of standard deviation, x = , or variance, x = 2 , which
makes sense as both are indicators of scale and were depicting ignorance
regarding scale). Intuition for this is probability assignment is invariant
to choice of units. This translates into f (x) dx = f (bx) d (bx) for some
constant b > 0. Since d (bx) = bdx ( d(bx)
dx = b which implies d (bx) = bdx),
we requiref (x) dx = bf (bx) dx and this is only true for f (x) = x1 .
dx
f (x) dx =
x
b dx
bf (bx) dx = dx =
bx x
For example, f (x) = f (bx) = 1 (uniform) leads to f (x) dx 6= bf (bx) dx or
1dx 6= b1dx, which is not scale invariant. Nor does f (x) = exp (x) satisfy
scale invariance as f (x) dx = exp (x) dx 6= bf (bx) dx = b exp (bx) dx.
However, it is intuitively appealing to consider scale ignorance as as-
signing a uniform probability
to the log of scale x, y = log x, f (y) = 1.
dy 1
Then, f (x) = f (y) dx = x . Uninformed priors means likelihood functions
completely determine posterior beliefs regarding scale.
The point here is consistency suggests, or rather demands, that diver-
gence in probability beliefs builds from dierent likelihood functions (process-
ing the information dierently and/or processing dierent information).
discrete:
Pr (xi = i) = n1 ,
uniform none exp [0] = 1
i = 1, . . . , n
Pr (xi = i) = pi ,
nonuniform E [x] = exp [xi ]
i = 1, . . . , n
Pr (x = 1) = p,
Bernoulli E [x] = p exp []
x = (0, 1)
E [x | n] = np, n Pr (x = s | n)
n s
=
ns
binomial18 x exp [x] p (1 p) ,
Pr (x = 1 | n = 1) = p s
s = (0, . . . , n)
E [xi | n] = npi , n!
Pr (x1 , . . . , xk | n) =
multinomial19 h x1 !x
Pk1 k! i n! x1 xk
Pr (xi = 1 | n = 1) = pi , x1 !xk ! p1 pk ,
exp i=1 i xi
i = 1, . . . , k 1 xi = (0, . . . , n)
Pr (x = s) =
1 s
Poisson20 E [x] = x! exp [x] s! exp [] ,
s = (0, 1, . . .)
Pr (x = r) =
E [x] = p1 , r1
geometric21 exp [x] p (1 p)
Pr (x = 1) = p
r = 1, 2, . . .
x+r1 Pr (x
= s; r) =
pr s+r1
negative E [x] = 1p , x s
r
binomial22 Pr (x = 1) = p exp [x] s
p (1 p) ,
s = (0, 1, . . .)
p
E [x] = (1p) log(1p) ,
P1 x Pr (x) =
E [log x] = x plog(1p)
log x
px
logarithmic23 x=1 ! exp [1 x 2 log x] x log(1p) ,
P
1
pk
@
@t kt
|t=1 x = (1, 2, . . .)
k=1
= log(1p)
N m Pr (x = s; m, n, N ) =
hyper- none (m
x )( nx ) m
nm (ms)(Nns )
geometric24 (note: E [x] = N )
N
(n) N
(n)
1 8 The kernel for the binomial includes a measure of the number of exchangeable ways
n
to generate x successes P in n trials, x
. The measure, say m (x), derives from general-
pi
ized entropy, S = i p log m , where mi reflects a measure that ensures entropy is
i
invariant under transformation (or change in units) as required to consistently capture
background information including complete ignorance (see Jereys [1939], Jaynes [2003]
or Sivia and Skilling [2006]). The first order condition from the Lagrangian with mean
moment condition yields the kernel mi exp [xi ], as reflected for the binomial probabil-
1
ity assignment. Since mi 6= n for the binomial distribution, mi is not absorbed through
normalization.
1 9 Analogous to the binomial, the kernel for the multinomial includes a measure of the
n!
number of exchangeable ways to generate x1 , . . . , xk events in n trials, x !x , where
1 k!
events are mutually exclusive (as with the binomial) and draws are with replacement.
2 0 Like the binomial, the kernel for the Poisson includes an invariance measure based
continuous:
1
f (x) = ba ,
uniform none exp [0] = 1
a<x<b
f (x)
h =i
1
exponential E [x] = > 0 exp [x] exp x ,
0<x<1
E [x] = > 0, 1
f (x) = 2/2 (/2)
chi-squared25 0
(1) x /21
E [log x] = 12 exp [1 x 2 log x] exp 2 x ,
d.f. (2)
+ log 2 0<x<1
E [log x] =
0 0
(a)
[a] [a+b]
(a+b)
, f (x) = 1
B(a,b)
1 log x
beta26 E [log (1 x)] = exp xa1 (1 x)
b1
,
0 0 2 log (1 x)
(b) (a+b) 0<x<1
(b) [a+b] ,
a, b > 0
1
h i h i f (x) = p2
normal or h 2
i
2 2
Gaussian
E (x ) = 2 exp (x ) exp (x)
2 2 ,
1 < x < 1
( +1
2 )
f (x) = q
( 2 )
students t E log + x2 + x2 2
+1
2
1 + x
1 < x < 1
x
E [log x] = 1+ log
xm
, f (x) = x+1
m
,
Pareto exp [ log x]
>0 0 < xm x < 1
f (x) =
E [|x|] = 1 ,
Laplace exp [ |x|] 2exp [ |x|] ,
>0
1 < x < 1
+ exp z2 + exp z2
x
z= s
2 6 The Dirichlet distribution is the multivariate extension of the beta distribution with