Beruflich Dokumente
Kultur Dokumente
lvaro Veiga
Departamento de Engenharia Eltrica
Pontifcia Universidade Catlica do Rio de Janeiro
Rua Marqus de So Vicente 225
22453900, Rio de Janeiro, RJ, Brazil
email: alvf@ele.pucrio.br
October, 2002
This paper updates and improves the material presented in a previous paper of ours The BinomialBeta
Hierarchical Model for Ecological Inference Revisited and Implemented via the ECM Algorithm,
posted to the Polmeth Paper Archive in 2001. Matlab routines that implement the approach presented
in this paper is available from http://www3.ufjf.br/~rmattos.
1
Abstract
The binomial-beta hierarchical model from King, Rosen, and Tanner (1999) is a recent
contribution to ecological inference. Developed for the 2x2 tables case and from a
bayesian perspective, the model is featured by the compounding of binomial and beta
distributions into a hierarchical structure. From a sample of aggregate observations,
inference with this model can be made regarding values of unobservable disaggregate
variables. The paper reviews this EI model with two purposes: First, a faster approach
to use it in practice, based on explicit modeling of the disaggregate data generation
process along with posterior maximization implemented via the ECM algorithm, is
proposed and illustrated with an application to a real dataset; second, limitations
concerning the use of marginal posteriors for binomial probabilities as the vehicle of
inference (basically, the failure to respect the accounting identity) instead of the
predictive distributions for the disaggregate proportions are pointed. In the concluding
section, principles for EI model building in general and directions for further research
are suggested.
1 Introduction
The idea of using compound binomial and beta models arose in the EI literature pretty earlier
then KRTs study. More than a decade before, Brown and Payne (1986) proposed the
aggregated compound multinomial (ACM) model, which is a generalization of a binomial
beta model that allows EI to be made in RC tables. Brown and Paynes approach has its
origins in studies from the sixties and the seventies concerned with the estimation of transition
probabilities in Markov chain models. The similarity between this problem and the EI one
rests on the common objective of predicting the cells values in a table or matrix by using
knowledge about rows and columns totals. However, while in the transition probabilities
estimation problem variables in rows and columns are considered the same, reflecting
generally different states among which a system or individual may transit from a time to the
next, the EI problem admits variables in rows and columns may be different, reflecting
crossinformation about attributes of individuals or phenomena in a same period of time.
Hawkes (1969) was maybe the first to apply Markov chain models as an alternative to
the well known Goodman Regression (Goodman 1953, 1959), in a study of voter transition.
However, by that same time an important contribution was made in econometrics by Lee,
Judge and Zellner (1970), who published a book with a comprehensive study on estimation
of Markov chain models. Among the various methods analyzed, the authors proposed an
aggregate multinomial model with constant matrix of transition probabilities for all time
periods. They discussed a classical and a bayesian version of this model. In the latter, the
probabilities were modeled as random variables following a Dirichlet prior distribution
(which is a generalization of the beta distribution) with predefined parameters.
McRae (1977) rediscussed the aggregate multinomial model of Lee, Judge and Zellner
(1970) and proposed to model systematic variations of transition probabilities using
explanatory variables. Brown and Payne (1996) incorporated all these developments in the
context of ecological inference. These authors followed a classical statistics view and
proposed a compound multinomial model (Mosimann 1962; Johnson and Kotz 1969) to
represent the behavior of the (unobservable) disaggregate data of each tables row. From
the aggregation of these various rows results the ACM model, whose parameters are those
of the Dirichlet distribution. Brown and Payne used a reparameterization to generate the
expected cells probabilities plus a set of parameters that capture overdispersions of the
disaggregate frequencies.
KRT (1999) developed the BBH model for EI. The authors formulation is quite similar
to a 2x2 tables version of the ACM model, with few tough subtle differences. Following a
bayesian approach, KRT used a compounding of the binomial and the beta distributions into
a hierarchical structure. By doing so, KRT let the quantities of interest to be the effective
binomial probabilities (plus the parameters of the beta distribution). Considering the 2x2
tables case, it is an important difference from Brown and Paynes (1986) classical
formulation, where the compound (with the beta) binomial distribution is such that the
binomial probabilities are eliminated by marginalization and only the parameters of the beta
part are maintained. As mentioned above, a reparameterization is possible where the
expected binomial probabilities and the over-dispersion parameters are obtained. Thus, the
quantities of interest of the ACM and the BBH models are different.
4
Tough a comparison between those two binomial-beta based models for EI may be
fruitful, in this paper we will restrict our attention to the BBH model. KRT made a successful
implementation of it using MCMC methods, but these methods display the drawback of
being (highly) computer intensive. As a consequence, they are too slow for a practical
analysis of ecological inference. In the next sections, we explore an alternative and faster
approach to use the same BBH model that may turn it more useful for applications. Before
proceeding, we shall state the problem and present some notation.
Technically, the EI problem refers to a situation where we have interest in cells values of
contingency tables for which only the row and column totals are known. The general
approach followed in the literature to solve this problem has been the resort to statistical
models, which serve to estimate or predict the unknown cells values. In this section, we
present the EI problem for the case of a 2x2 contingency table and the basic notation to be
used in the description of the BBH model.
Generally, letters written in uppercase will represent random variables, and in lowercase
observed or known values. In the left part of table 1, variables N Bi and NW i represent the
unobservable disaggregate frequencies; these might be, for instance, the numbers of
black and white people, respectively, who turnout to vote in the ith sampling unit or
precinct. By their turn, variables N Ti , n X i , and ni represent the observable aggregate
frequencies, and can be seen as the numbers of people, in the ith precinct, who turnout to
vote (black or white), who are black, and who are of voting age, respectively. The goal of
EI consists of predicting values for N Bi and NW i given knowledge of the values of N Ti ,
n X i , and ni, for i = 1,...,P, where P is the number of sampling units.
In the right part of table 1, we describe the same situation tough in an alternative way,
with variables represented as proportions and defined as:
n Xi
xi = (1)
ni
N Ti
Ti = (2)
ni
N Bi
Bi = (3)
n Xi
NW i
Wi = . (4)
ni n X i
The problem in this case consists of predicting values of Bi and Wi given knowledge of Ti
and x i, for i = 1,...,P. The use of proportions instead of frequencies to represent the
response variables in EI models has been the most common approach followed in the EI
literature (see Anchen and Chivelly, 1995; and King, 1997). However, KRT developed the
BBH model to represent response variables as frequencies and, in this article, both forms of
representing the EI problem will be considered.
5
Table 1 Representation of variables in the EI problem
prior to observing the aggregate data
Frequencies Proportions
Blacks N Bi n X i N Bi n Xi Bi 1 B i xi
Whites N Wi n i n Xi N Wi ni n Xi Wi 1Wi 1 x i
Total N Ti n i N Ti ni Ti 1 Ti 1
Note the values for the row totals n X i are written in lowercase letters because they are
usually regarded as known or given (a convenient but strong assumption) in EI models. For
this reason, when the aggregate value for column totals nTi is observed, the EI problem is
set up and the disaggregate data may be represented as conditional random variables. This
situation is illustrated in table 2, where nTi represents an observed value for the aggregate
variable N Ti and the disaggregate variable N Bi is now represented conditionally as
N Bi | nTi . A prediction made for this variable automatically determines a prediction for the
other disaggregate variable, because (conditionally) the two are linearly related according to
N wi | nTi = nTi N Bi | nTi .
Blacks N Bi |n Ti n X i N Bi | nTi nX i
N Ti = N Bi + NWi (5)
Before we proceed, we define in this section two probability distributions that are relevant to
our alternative approach to formulate and implement the BBH model.
( X , ) T ~ BBH ( n, c, d ) (7)
i) X | ~ Bin ( n, ) (8)
and:
ii) ~ Beta(c , d ) . (9)
7
Note that definition 1 characterizes a joint bivariate distribution for X and developed by
sequential conditioning, say:
p ( x , | n, c, d ) = p ( x | , n ) f ( | c, d ) (10)
where p ( x | , n) is the binomial density for X | and f ( | c, d ) is the beta density for
. From now on, we will call variable or a particular value taken by it as binomial
probability. The analytic representation of the BBH distribution is given by:
n x +c 1 (1 ) n x + d 1
p ( x , | n, c, d ) = (11)
x B( c, d )
( X , 1 , 2 ) ~ ABBH ( n1 , c1 , d 1 , n2 , c2 , d 2 )
where Abin(,) stands for aggregate binomial distribution (Forcina and Marchetti,1989;
Cleave, 1992). Note it is in general different from a binomial distribution, becoming the latter
only in the case where 1 = 2 = .
Definition 2 characterizes a joint trivariate distribution for X, 1 and 2 also developed
by sequential conditioning, say:
p ( x ,1 , 2 | n1 , c1 , d 1 , n2 , c2 , d 2 ) = p ( x | 1 , 2 , n1 , n2 )
(15)
f (1 | c1 , d1 ) f (2 | c2 , d 2 )
c1 1 (1 1 ) d1 1 2c2 1 (1 2 ) d 2 1
1
B(c1 , d1 ) B (c 2 , d 2 )
where the limits in the summation operator are given by a = max( 0, x n 2 ) and
b = min( n1 , x) .
We shall note there is a relation between these two definitions. The aggregate
binomially distributed random variable X | 1 , 2 , where X = X 1 + X 2 , used in item i)
of definition 2 is equivalent to the assumption that X 1 | 1 and X 2 | 2 are independent
and each binomially distributed. This is a known result from probability theory that led us to
use the term aggregate to name the ABBH distribution in definition 2.
As we will see in the next section, KRT used only the BBH distribution to formulate their
BBH model, while we had to use, in a different fashion, both the BBH and the ABBH
distributions for reasons of our alternative approach to implement the model using the ECM
algorithm.
We present in this section two versions of the BBH model. Both are developed under a
bayesian approach and structured similarly according to three hierarchical stages. The basic
difference between them refers to the assumptions made in the first stage for the data
generation process (DGP). The first version models the disaggregate DGP using binomial
distributions, with the result that the aggregate DGP is, necessarily, modeled according to
aggregate binomial distributions. This version forms the basis for the fast implementation
method we present ahead in Section 6. The second version, which is the one of KRTs,
models directly the aggregate DGP using binomial distributions with a particular design for
the binomial probabilities. This second version makes no assumption for the disaggregate
DGP. In order to present these two versions, we consider the EI problem description in
terms of frequencies (table 1) with P precincts.
In the first stage, the disaggregate data variables N Bi and NW i at the ith precinct are
assumed to follow independent binomial distributions with known counts n i and n X i , and
binomial probabilities i and i. From the accounting identity in (5), this results in that the
aggregate data variable N Ti follows an aggregate binomial distribution. In the second stage,
the binomial probabilities i and i are assumed to be sampled from beta distributions with
parameters (cb,db) and (cw,dw,), respectively, where these parameters are taken to be
9
constant across all precincts. In the third and last stage, the beta parameters are assumed to
follow non informative priors. The formal description of the BBH model in this case can be
written as:
N Bi | i ~ Bin ( n X i , i ) (17)
N Wi | i ~ Bin ( ni n X i , i ) (18)
N Ti | i ,i ~ ABin ( n X i , ni , i , i ) (19)
i | cb , d b ~ Beta(c b , d b ) (20)
i | cw , d w ~ Beta( cw , d w ) (21)
c b ~ n.i . p.d . (22)
for i = 1,...,P and where n.i.p.d. stands for non informative prior distribution. For
simplicity, in this article we work with n.i.p.ds of the uniform type defined in (0,10] for the
beta parameters (alternatively, KRT assumed as priors for these parameters independent
exponential distributions with high mean). The value 10 for the upper truncation of these
priors was introduced so as we have proper priors.
Expressions (17) and (18), which refer to the first stage, represent the disaggregate
DGP with independent binomial models. Taking them together with expressions (20)(21),
that refer to the second stage, we characterize independent BBH distributions (see definition
1) for the random vectors ( N B i , i ) T and ( N Bi ,i ) T , respectively. By its turn, expression
(19) represents the aggregate DGP with an aggregate binomial model. Taking it together
with expressions (20)(21), we characterize an ABBH distribution (see definition 2) for the
random vector ( NTi , i ,i ) T .
The vector of quantities of interest is given by T = [ T , T , h T ] , where
T = [ 1 ,K , P ] , T = [1 ,K, P ] , and h T = [cb , d b , cb , d w ] . Note the size of vector
is dependent on the number of observations as it has 2P + 4 elements. By assuming
independence among sampling units, say, that ( N B i , NW i ) is independent of ( N B j , NW j ) ,
which then implies N Ti is independent of N T j , for i j, we can build the aggregate
posterior PA, as:
10
P n Ti
PA ( | nT ) i (1 i ) X i (1 i ) i X i
n n n
i =1 1 i
UBi n X ni n X (1 ) n Bi
n
i i i
n = n L nB i nTi nB i i (1 i )
i
(26)
B i Bi
icb 1 (1 i ) d b 1 ic w 1 (1 i ) d w 1
B(c b , d b ) B( c w , d w )
p ( nB i | nT ) =
1 1
p( n Bi | nTi , i , i ) p( i , i | nT ) di di (27)
0 0
p ( nW i | nT ) =
1 1
p( nW i | nTi , i ,i ) p( i , i | nT ) di di (28)
0 0
where nTT = [ nT1 ,K , nTP ] is the observed aggregate data. These predictive posteriors are
expected to reflect our uncertainty with regard to the values of the disaggregate variables
N Bi and N Wi .
In KRTs formulation, the disaggregate DGP in the first stage is not considered. The authors
model directly the aggregate DGP by assuming N Ti follows a binomial distribution, with a
given count ni and an aggregate binomial probability i xi +i (1 x i ) . That is to say,
under KRTs formulation we have:
N Ti | i ,i ~ Bin ( ni , i x i + i (1 xi )) (29)
in place of expressions (17), (18) and (19) to represent the first stage of the BBH model.
Note that, by taking (29) together with expressions (20)(21), a BBH distribution for the
random vector ( NTi , i xi +i (1 x i )) T is characterized.
Following our considerations in Section 3, the only instance in which the binomial
distribution in (29) is consistent with our assumptions for the disaggregate DGP that
N Bi | i ~ Bin ( n X i , i ) and N Wi | i ~ Bin ( ni n X i , i ) is when i = i (in the more
general setting, assumed by KRT, that i i , the sum N Ti = N B i + NWi would
necessarily follow an aggregate binomial distribution, as saw in Section 5.1).
11
Another difference of KRTs formulation from the one based on the disaggregate DGP is
that, instead of using the predictive posteriors in (27 ) and (28), KRT undertook the
inferences at precinct level using the marginal posteriors for binomial probabilities:
p ( i | nT ) and p (i | nT ) . The authors used these distributions in full to summarize the
uncertainty about the disaggregate data, and obtained them from the joint posterior for
vector , which we label here as PA* . By considering p(h) constant = 104 (since we are
using independent uniform priors on (0,10] for the elements of h), this posterior is written:
ni
P
P ( | nT )
*
(i xi + i (1 x i ) )nTi (1 i xi i (1 x i ) )n i nTi
A
i =1 nTi
(30)
ic b 1 (1 i ) d b 1 ic w1 (1 i ) d w 1
B (c b , d b ) B( c w , d w )
p ( i | nT ) = / PA* ( | nT ) d/ (31)
In (31), / is a (2P + 3)1 vector composed by all elements of vector except i, and
A/ R 2P + 3 is the subspace of admissible values for /. Analogous definitions apply in the
case of i, say, for // and A// in (32).
6 Posterior Maximization
In this section, we discuss the maximization of the aggregate posterior from the BBH model
version based on the disaggregate DGP, and propose a suitable method to handle this
maximization. Our goal is to maximize for the expression for PA given in (26). We follow
the practice to work with its log version lp A = log PA , which is written as:
P
P P
lp A ( | nT ) = C + nTi ln i + n X i ln( 1 i ) + ( ni n X i ) ln( 1 i )
i =1 1 i i =1 i =1
P nB i n X n i n X (1 ) Bi
U n
+ ln i i
i
i
i =1 n Bi =n BLi n Bi nTi n Bi i (1 i )
P P
+ ( cb 1) ln i + ( d b 1) ln( 1 i )
i =1 i =1
P P
+ ( c w 1) ln i + ( d w 1) ln( 1 i )
i =1 i =1
P ln B (cb , d b ) P ln B( c w , d w )
(33)
1
At an early stage of our research, we tried this approach to maximize the PA* in (30) using a simulated
dataset with 100 observations. Function CONSTR of the Matlab languages optimization toolbox, which
implements the sequential quadratic programming method, was used to maximize the lp *A . The algorithm
processed 20.558 iterations along four hours in a PC computer with a 366 MHz Intel Celeron
processor, before achieving convergence. Brown and Payne (1986) reported a similar problem when
implementing the ACM model with a Newton algorithm.
13
P
( n Bi + cb 1
(1 i )
n X i n Bi +d b 1
)
i
+ cW 1 n i n X i nWi + d W 1
i Wi (1 i )
n
i =1
PD ( | n B , nW ) (34)
[B( cb , d b ) B( cw , d w )] P
P P
lp D ( | nB , nW ) = C + ( nB i + cb 1) log i + ( n X i nB i + d b 1) log( 1 i )
i =1 i =1
P P
+ ( nWi + cw 1) log i + ( ni nX i nWi + d w 1) log( 1 i )
i =1 i =1
P log B( cb , d b ) P log B (c w , d w )
(35)
The Q-function in the present case is the expected value of the lp D in (35) , conditioned on
the observed aggregate data nTT = [nT1 ,K , nTP ] and on a parameter guess k. It can thus
be represented as:
Q( | k ) = E[lp D ( | N B , NW ) | nT ,k ] (36)
From (35), we have the lp D is linear in the n B i s and the nW i s, the elements of the
disaggregate data vectors nB and nW , respectively. This allows us to compute the Q-function
by replacing the unobservable disaggregate data with their respective conditional
predictions, as follows:
14
P P
Q( | k ) = ( n Bi , k + c b 1) log i + ( n X i n Bi ,k + d b 1) log( 1 i )
i =1 i =1
P P
+ (n Wi ,k + cw 1) log i + ( ni n X i nW i , k + d w 1) log( 1 i )
i =1 i =1
P log B( cb , d b ) P log B (c w , d w )
(37)
where:
n B i , k = E[ N Bi | nTi , i , k , i , k ] (38)
Box 1. EMA scheme for maximizing the aggregate posterior of the BBH model
The EMA theory (e.g., McLachlan and Krishnan, 1997) assures the scheme 14 of box
1 presents good convergence properties: For instance, provided that
Q(k +1 | k ) Q(k | k ) as is guaranteed in the MStep the lp A will increase
monotonically at every iteration. However, the EMA in the present case is even worst than a
quasiNewton algorithm because the Qfunction also does not display analytic maximum
and depends on the same number of arguments. The MStep involves an iterative search on
a space with 2P + 4 dimensions (to be done maybe with a quasiNewton algorithm), and
since this step is run at every iteration of the EMA, the mode search would elapse many
times more.
15
6.2 The ECM Algorithm
Fortunately, an extension of the EMA proposed by Meng and Rubin (1993), and known as
ExpectationConditionalMaximization Algorithm (ECMA), is a viable alternative. The
difference between the EMA and the ECMA is that the latter replaces the MStep of the
former by conditional maximization (CM) steps which, individually, are simpler than the
overall maximization in a single Mstep. This is a useful device, for instance, when the M
step is nonanalytic but some or all of the CMsteps are analytic. In such a case, thought
the ECMA may converge in more iterations, it can do it faster than the EMA. The
replacement of the Mstep by CMsteps is not equivalent to maximize the Qfunction, but it
is such that Q(k +1 | k ) Q(k | k ) at every iteration and thus the ECMA presents the
same monotonicity property of the EMA (Meng and Rubin, 1993; see also Schaffer 1997).
For every problem, the number of CM-steps we may use naturally depends on the
number of arguments of the Qfunction, but more important is our choice of partition for
vector . In the present case, a partitioning resulting in three CMsteps of interest can be
achieved. To make it clear, let us consider first the maximization of the Qfunction in (37),
which has to be undertaken at the Mstep of the EMA. By applying first order conditions
for a maximum, say, differentiating that function partially with respect to each element of ,
equating the derivatives to zero, and making a few algebraic manipulations, we obtain the
following equation system:
n B i + cb 1
i = i = 1,..., P (40)
n X i + cb + d b 2
nW i + c w 1
i = i = 1,..., P (41)
ni n X i + c w + d w 2
P
log i
B( cb , d b ) cb i =1
= (42)
B( cb , d b ) P
P
log( 1 i )
B( c b , d b ) d b i=1
= (43)
B( cb , d b ) P
P
log i
B( cw , d w ) c w i =1
= (44)
B( c w ,d w ) P
P
log( 1 i )
B( cw , d w ) d w i=1
= (45)
B( c w , d w ) P
Maximizing the Qfunction is then equivalent to solve for the system (40)-(45), which is
intrinsically nonlinear. If we use an iterative search algorithm like a quasiNewton method
to solve it, this algorithm may delay too much because of the large number of equations and
unknowns (2P + 4 for both). This results in a complicated Mstep that turns the EMA
16
practically useless in this case, as we noted before. However, if we replace this MStep by
suitably chosen CMsteps we can obtain substantial improvements in processing speed, as
we explain now.
Consider the following partition of vector into three subvectors:
T = [ T , T ] (47)
hbT = [ cb , d b ] (48)
hwT = [ cw , d w ] (49)
and let it be given an instantaneous guess Tk = [Tk , hbT, k , hwT. k ] . Then, at the Mstep of the
EMA scheme, replace the general maximization of the Qfunction by the conditional
maximizations:
Box 2. ECMA scheme for maximizing the aggregate posterior of the BBH model.
2. E-step: Compute the conditional expected values for the disaggregate data:
n B i , k = E[ N Bi | nTi , i , k , i , k ]
nWi ,k = nTi nB i , k ;
3. CM1-step: Maximize the function Q(, hb , k , hw, k | k ) for , by computing:
n Bi ,k + cb , k 1
i , k +1 =
n X i + cb , k + d b , k 2
nWi , k + c w , k 1
i , k +1 =
ni n X i + c w, k + d w, k 2
We conclude this section by adding that our method based on the ECMA allows further
improvements in computer efficiency. This might be accomplished by adding more cycles of
E and CMsteps that may produce larger increases in the Qfunction at every iteration.
Such procedures follow further extensions already available in the EMA literature (see Meng
and van Dyk, 1997).
18
7 Methodological Issues
In their approach to the BBH model, KRT used the marginal posteriors for the binomial
probabilities, p ( i | nT ) and p (i | nT ) , to make inferences regarding values of the
disaggregate proportions, Bi and Wi . As mentioned before, this procedure is not the more
appropriate if we want to have desirable EI properties, like aggregation consistency and
consideration of the Duncan-Davis bounds, satisfied. We argue, instead, that when using
the BBH model we should make EI by means of the predictive posteriors for the
disaggregate frequencies, p ( nB i | nT ) and p ( nW i | nT ) , or better, the predictive posteriors
for the disaggregate proportions, p (bi | nT ) and p ( wi | nT ) . The key reason is that, by
construction, predictive posteriors always respect the accounting identity, which guarantees
the disaggregate data predictions will display aggregation consistency and respect the
deterministic bounds.
In this section, we discuss these issues in some detail for the version of the BBH model
based on the disaggregate DGP. We begin by exploring why the use of posteriors for
binomial probabilities is a procedure of inference that generally do not respect the
accounting identity. Then we move on to explore the appropriate use of predictive
posteriors. An additional benefit of such a discussion is that we can learn principles for EI
model development in general. In Appendix 3, we extend this analysis to KRTs version.
N Bi ~ Bin ( n Xi , i ) (53)
NWi ~ Bin (ni n Xi ,i ) (54)
N Ti ~ ABin (n X i , n i , i , i ) (55)
for i = 1,...,P. The 2P1 vector of unknown parameters for this model is given by:
T = [ T , T ] (56)
P
L A ( | nT ) = Abin( nTi | n X i , ni , i ,i )
i =1
P n Ti
= i (1 i ) X i (1 i ) i X i
n n n
(57)
i =1 1 i
n UBi n X ni n X (1 ) n Bi
i i
n = n L nB i nTi n B i i (1 i )
i i
Bi Bi
P
P P
l A ( | nT ) = nTi ln i + n X i ln( 1 i ) + ( ni n X i ) ln( 1 i )
i =1 1 i i =1 i =1
P n Bi n X ni n X (1 ) n Bi
U
+ ln i i
i
i
i =1 n Bi = n LBi nB i nTi nB i i (1 i )
(58)
We can easily verify that l A has an infinite number of stationary points. By applying first
order conditions for a maximum and manipulating, we verify it is not possible to obtain
closed form solutions for the parameters as functions of the aggregate observations nT
and n X . Nevertheless, it is possible to obtain the following relation between the elements of
( , ) :
i i
ti x
i = i i (59)
1 xi 1 xi
where t i = nTi / ni and xi = nTi / ni . This equation represents, for the i-th precinct, a
tomography line disposed over the unit square (see fig. 1). Anyone of the infinitely many
points over this line may be a solution to the problem of maximizing l A . This infinite number
of solutions happens because the structure of the aggregate binomial model does not allow
us to identify its 2P parameters.
But, more important concerning the aggregate binomial model is that any feasible solution
to the likelihood equations can be seen as an EI prediction that displays aggregation
consistency and respects the DuncanDavis bounds. This happens because, for a given
aggregate proportion Ti = t i , equation (59) is equivalent to the accounting identity in (6).
We further explain this point using fig. 1, that shows the space of possible values for the
binomial probabilities (i,i) consisting of the unit square [ 0,1] [0,1] . The negatively sloped
line represents graphically the accounting identity associated with a particular aggregate data
pair (t i,x i). The projections of this line over the axis are the admissible intervals [ Lbi , U ib ]
20
and [ Lwi ,U iw ] in proportions. It is clear that points over the line (satisfying the accounting
identity) are predictions which respect the bounds and present aggregation consistency.
Points off the line are predictions which does not display aggregation consistency, but which
may or may not respect the intervals: For instance, the circles respect both intervals, while
the dark points respect only one of the intervals, and the does not respect any interval.
What happens when we add to the binomial model the other stages in the hierarchy of
the BBH model? First, the aggregate posterior PA and its log version take new functional
forms, as presented earlier in (26) and (33). Second, by applying first order conditions for a
maximum to the logposterior in (33) and manipulating, it is easy to show that the relationship
between the modal values i and i take an implicit non-linear functional form, as follows:
ti x
i = + i i + F ( i ,
i , x i , c b , d b , c w , d w ) (60)
1 xi 1 x i
According to this expression, for the accounting identity to be valid it is necessary that
Fi = 0 for any value of its arguments and for all i = 1,...,P. But, this term is written as:
Ai
Fi = (61)
ni (1 x i )
where:
Ai = (cb 1)(1 i ) ( d b 1)
i + ( c w 1)(1
i ) (d w 1)
i (62)
21
The situations in which Fi = 0 are too restrictive. Provided that, at the same time, i , i ,
and x i belong to the open interval (0,1), this will happen, for instance, when
c b = db = cw = d w = 1 . It is clear, though, that Fi 0 for an infinite number of situations
where these beta parameters are such that Ai 0 .
Even if the accounting identity does not hold true, the values of the marginal posterior
modes for i and i may still respect the bounds. Fig. 1 illustrates this point by means of
the small circles, that represent predicted values for disaggregate proportions which are off
the tomography line and, thus, do not respect the accounting identity. These circles, though,
respect the bounded intervals for both variables. The problem in this case is that we do not
have aggregation consistency anymore, as the disaggregate data predictions i and i
when aggregated (i.e., by replacing Bi and Wi for them in expression (6)) do not fit the
observed aggregate proportion t i .
The above considerations lead us to conclude that the proper way of making EI with the
BBH model is by using the predictive distributions, which are precisely the conditional
distributions for the response (disaggregate) variables given the aggregate data:
p ( nB i | nTi ) and p ( nW i | nTi ) . The advantage of using these distributions is that they
always respect the accounting identity, and, consequently, the properties of
aggregation consistency and of allowance for the Duncan-Davis bounds will also
always be satisfied. This statement bears a wider importance for it is valid not only to the
BBH model under consideration, but also to any statistically based EI model where
ecological inference is treated as a prediction problem (McCue, 2001). [This is a central
difference between the ways KRT and King (1997) use their respectives BBH and TBN
models for making EI. Only the latter is used with predictive distributions as the vehicle of
ecological inference, and thus only the latter produces EI predictions which respect the
accounting identity.] All these considerations are valid also for KRT version of the BBH
model (see Appendix 3).
p ( nB i | nTi , i ,i , h ) (63)
From the hierarchical construction of the BBH model, it follows the disaggregate response
variables N Bi , N Wi and N Ti are distributed independently of the vector of betas
parameters h (see Section 5). This allows us to manipulate expression (57) to produce the
following simplification.:
22
p( n Bi , nTi | i ,i ) p( i ,i | h) p( h)
p ( nB i | nTi , i ,i , h) =
p ( nTi | i , i ) p( i , i | h) p( h)
= p ( nB i | nTi , i , i )
(64)
From this result we are allowed to represent the conditional predictive distribution in (64) as
a noncentral hypergeometric distribution (Mcullagh and Nelder 1989, pp. 257259), say:
n X i ni n X i i (1 i )
nB i
n Bi nTi n Bi i (1 i )
p ( nB i | nTi , i , i ) = (65)
0 [ i , i ]
where:
n X ni n X i i (1 i )
nU nBi
Bi
0 [ i ,i ] = i
(1 ) (66)
n n n Bi
nBi =n BLi B i Ti i i
The mean and variance of this distribution are given, respectively, by:
nUBi
nUBi
Expression (67) for the mean of N Bi | nTi allows the generation of point predictions for the
disaggregate data. Note in expression (67) that the bounds will necessarily be respected
because the lower and upper summation limits n BLi and n UBi , which are those of a
hypergeometric distribution, are also, and not as a coincidence, the same Duncan-Davis
bounds (see Appendix 1). By its turn, expression (65) for the variance of N Bi | nTi allows
the generation of uncertainty measures of prediction. None of these expressions can be
represented in closed form as a function of nTi , i and i , what brings certain cost to its
computation.
When nTi is given, the predictions of the two disaggregate variables N Bi and N Wi
become linearly related, in such a way that the mean and the variance of the predictive
distribution p ( nW i | nTi , i , i ) may be obtained from (67) and (68), as follows:
The conditional predictions for the disaggregate frequencies may yet be converted in
predictions for proportions, as follows:
23
E[ N Bi | nTi , i , i ]
B i = (71)
nX i
Var[ N B i | nTi , i , i ]
VBi = (72)
nX2 i
E[ N Wi | nTi , i ,i ]
W i = (73)
( ni n X i )
Var[ NW i | nTi , i , i ]
VWi = (74)
(ni n X i ) 2
Note the prediction variances in frequencies, given by (68) and (70) are the same.
However, the prediction variances in proportions, given by (72) and (74), differ
multiplicatively by a factor ( ni n X i ) 2 / n 2X i .
All these results follow from the concept of conditional predictive distribution. In addition
to use it to guarantee the accounting identity is respected by our proposed approach, the
computation of the conditional predictive means in (67) and (69) were essential in the
development of the ECMA, particularly in the implementation of the E-step (see expressions
(38) and (39)).
p ( nB i | nT ) =
1 1
p( n Bi | nTi , i , i ) p( i , i | nT ) di di (75)
0 0
p ( i , i | nT ) = A ///
PA ( | nT )d/// (76)
where /// is a (2P + 2)1 vector composed by all elements of except i and i , and
A /// R 2 P + 2 is the subspace of admissible values of /// . The predictive distribution in (75)
is unconditional with respect to the quantities i and i because we have averaged (via
the double integration) over their uncertainty. Under a strict bayesian approach, this
distribution should be the one used for inferences regarding the disaggregate data, since it
embodies the uncertainty associated to the parameters (Migon and Gamerman 1999, p.
201). Also, the complete distribution (not only its point and interval characteristics) should
be obtained in order for a full bayesian approach to ecological inference be undertaken.
However, it is hard to obtain analytically p ( nB i | nT ) by means of the integration in (75)
for two reasons: First, because p ( i , i | nT ) , which has to be obtained via the complex
multidimensional integration of the aggregate posterior in (76), needs to be recovered in the
24
first place. Second, because p ( nB i | nTi , i , i ) is a noncentral hypergeometric density. As
seen in expression (59), the functional form of this density is a ratio where both the
numerator and the denominator depends on the variables of integration i and i . In their
approach to the BBH model, KRT used MCMC methods to recover completely the
aggregate posterior in (25) and the marginal posteriors for binomial probabilities p ( i | nTi )
and p (i | nTi ) . Those MCMC methods could as well be used here with this same purpose
and the additional goal of recovering completely the P unconditional predictive distributions
p ( nB i | nTi ) and p ( nWi | nTi ) . But, if we were to use these methods here we would incur
again in a high price of increased computation.
Another important issue is that resorting to alternative, simpler simulation approaches
based on asymptotic normality of posterior functions would not be appropriate here. The
reason is that a regularity condition - that the number of parameters or quantities of
interest is independent of the number of observations (Gelman et al, 1995) - for the validity
of convergence in probability towards the normal distribution is not satisfied here. This
happens because the BBH model uses as many parameters as observations, an usual device
adopted in the formulation of hierarchical models. When the number of observations
increase, the number of quantities of interest increases together and in this case the failure of
the standard result of asymptotic normality prevents the use of a multivariate normal
approximation to the aggregate posterior, and the simpler derivations of the predictive
distributions from that approximation.
For these reasons, and in order to keep up with our faster approach, we had to use
instead the conditional predictive distributions p ( nB i | nTi , i , i ) and p ( nW i | nTi , i , i )
evaluated at the posterior modes and i
which are obtained via the ECMA (Sub
i
section 6.2). Nevertheless, our approach is a valid EI method that respects the accounting
identity and which then displays the desirable EI properties considered before.
In this section, we illustrate our alternative approach to the BBH model with an application
to a real dataset. The data we used are the same analyzed by KRT and correspond to voter
registration and racial background of people in 275 counties in the States of Florida,
Louisiana, North Carolina, and South Carolina (this dataset was obtained from Gary Kings
web site http://gking.harvard.edu/data). For each county, observations are available for
total voting age population (ni ), the number of those who are registered (nTi ), and the
number of people who are black ( n X i ). The target is to infer the unobservable proportions
of registered people in the population of blacks (Bi), and in the population of whites (Wi).
Performance results are presented in table 3, which display information for the two types
of predictors considered: say, the mode of the posteriors for binomial probabilities, labeled
posterior mode, and the mean of the predictive posteriors for disaggregate proportions,
labeled predictive mean. The ECM algorithm converged after 37 iterations which elapsed
one minute and 59 seconds in a PC computer with a Pentium 700 MHz processor. For the
posterior mode predictor, the coverage of prediction errors within a ten percentage points of
distance from the true values is about 42.18 % for variable B and 84.36 % for variable W,
with the corresponding root mean squared errors of predictions about 0.194 and 0.0973,
25
respectively. These figures are practically the same in the case of the predictive mean, for
which the 10% coverage of prediction errors is 42.91% for variable B and 84.36% for
variable W, with the corresponding root mean squared errors about 0.1898 and 0.0976.
The better performance in the case of variable W owes to the more amount of information
present in the aggregate data for this variable.
Figs. 2 and 3 display scatter plots of predicted against true values for variables B and W
according to the type of predictor, thus providing information on the behavior of individual
predictions. The horizontal lines within these scatter plots represent the admissible intervals
computed from the Duncan-Davis bounds for each observation. In both figures, it is evident
the more amount of information for disaggregate inferences with regard to variable W,
because the admissible intervals are in general shorter than for variable B. As a consequence
of the greater uncertainty in predicting proportions for variable B, the scatter plots in this
case display a large portion of data points centered in the middle of the intervals. In the case
of variable W, the scatter plots display the data points better aligned with the 45o line.
It is interesting to note the two predictors produce quite the same pattern of scatter plots.
It means that for this particular dataset it is practically indifferent to use one predictor or the
other. Almost all predictions from the posterior mode for binomial probabilities stay within
the admissible intervals, but, since they do not respect the accounting identity, in the majority
of cases they do not display aggregation consistency. By their turn, all predictions from the
predictive mean, as they respect the accounting identity, stay within the admissible intervals
and satisfy aggregation consistency.
26
Figure 2. Disaggregate predictions using posterior mode for binomial probabilities.
Table 4 further compares the two predictors by presenting results for nine selected
counties. The predictions produced by each predictor are different for all these counties. For
some of them, and depending on the variable, they happen to be the same but just at the
four significant digits displayed in the table. For variable B, the largest absolute difference
between predictors happens for counties 224 and 229, where
| 224 B224 | = | 229 B229 | = 0.0400 ; while in the case of variable W, it is for county 218,
where | 218 W 218 |= 0.0082 . Table 4 displays estimated standard deviations only for the
predictive means, because we are unable to produce estimated standard deviations for the
posterior mode without simulation (see Section 7).
Table 4 Disaggregate data bounds, true values, and predictions for selected counties.
BL BU WL WU B W B s B W sW
22 .2857 1.0000 .5454 1.0000 .2857 1.0000 .3215 .9910 .3000 .0048 .9909 .0079
39 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 .9639 1.0000 1.0000 .0000 1.0000 .0000
50 .6338 1.0000 .9148 1.0000 .6339 1.0000 .8169 .9574 .8169 .0010 .9574 .0012
79 .6671 1.0000 .9723 1.0000 1.0000 .9722 .8253 .9862 .8333 .0039 .9861 .0042
97 .8746 1.0000 .9859 1.0000 .8750 1.0000 .9299 .9930 .9375 .0020 .9930 .0022
150 .0000 1.0000 .3928 .7500 .4250 .5982 .5071 .5701 .5075 .0056 .5687 .0076
218 .0000 1.0000 .4375 .9375 .5000 .6875 .4507 .7207 .4500 .0146 .7125 .0219
224 .0000 1.0000 .7187 .7291 .0000 .7292 .6400 .7232 .6000 .0016 .7229 .0016
229 .0000 1.0000 .4304 .4430 .0000 .4430 .6400 .4362 .6000 .0019 .4354 .0019
Source: True values were obtained from Gary Kings web site (http://gking.harvard.edu/data).
27
It is interesting to note in table 4 that, for country 39, the posterior mode 39 = 0.96 lies
outside the admissible interval for variable B since it is less than B39L = 1.00 . This provides
an example to the issue we posed in this article that predictions made with the posterior
mode for the binomial probabilities may not respect the deterministic bounds (it also
happens with counties 8, 15, 30, 87, 112, and 124). Note also that it does not happen with
predictions made with the predictive mean for county 39, as well as for any of the other
counties in table 4.
Among the selected counties of table 4, we included counties 50 and 150 which were
used by KRT to illustrate their MCMC based approach. For variable B in county 50, our
predictions of nearly 0.82 (the same for both predictors) are significantly higher than the
0.73 value obtained by those authors; however, for variable W in that same county, our
predictions of nearly 0.96 get close to the value 0.98 obtained by KRT (p.75). In the case
of county 150, our predictions of .50 and .57 for blacks and whites, respectively, also get
close to the .48 and .58 values obtained for the same variables by KRT(p.77).
9 Concluding Comments
In this Appendix, we present the deduction of the bounds for the disaggregate frequency
data when it is given a sample nTT = [nT1 ,K , nTP ] of aggregate data. The sequence of steps
begins by determining the maximum admissible value for the variable N B i . It is simple to
note that this value is the lower among the totals of the column and the row to which the
variable belongs, say:
By another side, the minimum for the other variable NWi is given by:
In a similar fashion, the maximum admissible value for the variable NWi is the smallest
between the totals of the column and the line to which this variable belongs:
N Bi N BL = max( 0 , N T ( N i N X ))
i i i
N UBi = min( N Ti , N X i )
We find it useful to take note, once more, these are the same bounds of the hypergeometric
distribution (McCullagh and Nelder, 1986, pp. 257-259).
Appendix 2 Reparameterization
In order to develop a stable computer algorithm to implement the BBH model based on the
disaggregate DGP, we used a suitable reparameterization. We shall stress that our purpose
here was not to exploit a multivariate normal approximation to the aggregate posterior
because such a procedure may not be applicable to the present problem, as discussed in
Section 7.2. Instead, we used the reparameterization just to make a transformation from a
proper subset of R ( 2 P + 4 ) (the original space for the quantities of interest) to the whole
R ( 2 P + 4 ) , thus allowing the mode searching of the ECM algorithm to be made using all
floating pointnumbers. (The function code we developed to implement the BBH model
allows an option to use or not the reparameterization).
The original parameters are given by:
T = z () T = [T ,T , bT , wT ] (A2.2)
= log (A2.3)
( P1) 1
= log (A2.4)
( P 1) 1
b = log( hb ) (A2.5)
( 2 1 )
= log( hw )
w (A2.6)
( 2 x1 )
The log and division operations are applied to each element of the original vectors.
30
The aggregate log-posterior in the notation of the reparameterization will be
represented by us as :
l A ( | nT ) = l A ( z () | nT ) (A2.7)
The ECM Algorithm makes a search in the space of for a mode of the aggregate log
posterior, and we describe it in box A2.1.
Once the mode KT is determined, the following transformation back to the original
parameterization is applied:
exp( , K )
K = (A2.8)
( P 1) 1 + exp( , K )
exp( , K )
K = (A2.9)
( P 1) 1 + exp( , K )
hb , K = exp(b , K ) (A2.10)
( 21)
h w, K = exp( w, K ) (A2.11)
( 21 )
1. Assume a guess k = [ , k , , k ,b ,k , w , k ] ;
T T T T T
2. E-step: Compute:
n B = E[ N B | nT , k ]
nW = nT n B ;
4. CM1-step:
n B , k + c b ,k 1
, k +1 = log
n n + d 1
X B ,k b, k
nW ,k + cw ,k 1
, k +1 = log ;
+
n n X n W ,k d w, k 1
5. CM2-step:
b , k +1 = arg max Q(b | , k +1 , , k +1 ,b , k , w ,k ) ;
b
6. CM3-step:
w, k +1 = arg max Q(w | , k +1 , , k +1 , b , k +1 , w, k ) ;
w
and determine k +1 = [ ,k +1 , , k +1 , b ,k +1 , w , k +1 ] ;
T T T T T
We follow here similar steps as in Section 7.1 to show that KRTs version of the BBH
model also does not produce inferences that respect the accounting identity. We also start
by considering the first stage, where a binomial distribution is used to model the aggregate
DGP:
N Ti ~ Bin ( ni , i xi + i (1 xi )) (A3.1)
If we assume a uniform prior defined in (0,1) for each i and i, then the aggregate
posterior will be proportional to the aggregate likelihood function. The log version of the
latter is given by:
P ni P
l A ( | nT ) = log + nTi log ( i xi + i (1 x i ) )
i =1
i =1 nTi (A3.2)
P
+ (ni nTi ) log (1 i xi i (1 x i ) )
i =1
By applying first order conditions for a maximum, and making some algebraic manipulation,
we can easily check the mode(s) of the l A satisfy:
ti x
i = i i (A3.3)
1 xi 1 xi
Thus, also in the case of KRTs version, the aggregate posterior modes for binomial
probabilities in the first stage generate predictions that satisfy the accounting identity.
We now turn to check what happens when we add to (A3.1) the other stages of the
hierarchy of KRTs BBH model. First, the aggregate logposterior is transformed according
to:
P ni P
lp A ( | nT ) = C + log + nTi log (i xi + i (1 x i ) )
i =1 nTi i =1
P
+ (ni nTi ) log (1 i xi i (1 xi ) )
i =1
P P
+ (c b 1) log i + ( d b 1) log( 1 i ) (A3.4)
i =1 i =1
P P
+ (c w 1) log i + ( d w 1) log( 1 i )
i =1 i =1
P log B( cb , d b ) P log B (c w , d w )
where C is a constant. By applying, once more, first order conditions for a maximum and
manipulating, we verify the posterior modes satisfy now:
32
ti x
i = + i i + F * ( i ,
i , xi , cb , db , c w , d w ) (A3.5)
1 xi 1 xi
According to this expression, for the accounting identity to be valid it is necessary that
Fi* = 0 for any value of its arguments and all i = 1,...,P. But, this term is written as:
[ i x i +
i (1 xi )][1 i x i
i (1 xi )]
Fi * = Ai* (A3.6)
ni (1 x i )
where:
c 1 db 1 c w 1 d w 1
Ai* = b + + + (A3.7)
1
i 1 i i i
In an analogous fashion, the situations in which Fi* = 0 are too restrictive. Provided that i ,
i and x i simultaneously belong to the open interval (0,1), this will happen, for instance,
when c = d = c = d = 1 . It is easy to check, though, that F * 0 for an infinite
b b w w i
We shall note that KRT (p. 64) state their approach to the BBH model respect the
bounds, but they do not give details as to how the bounds are incorporated in the marginal
posteriors for the binomial probabilities. As it seems, the authors followed the validity of the
accounting identity for the first stage, where the binomial model is defined in isolation, to
justify that those posteriors are defined within the bounds. It seems to be KRTs relation (2)
(appearing in p. 65 and reappearing in p. 71), deduced based solely on the first stage and
which is equivalent here to relation (A3.3), that justifies the bounds are respected. However,
this is not what we verify by means of relation (A3.4), that results from us considering the
other stages in the hierarchy.
References
Achen, C.H. e W.P. Shively (1995). Cross Level Inference. Chicago: University of Chicago Press.
BarndorffNielsen, O. (1978). Information and Exponential Families in Statistical Th eory. New York:
Wiley.
BarndorffNielsen, O. (1982) Exponential Families. In Encyclopedia of Statistical Sciences, vol. 2. New
York: Wiley. pp. 587596.
Beckman, R. J. and G.L. Tetjen (1978) Maximum Likelihood Estimation for the Beta Distribution. Journal
of Statistical Computation and Simulation, 3, 4, p. 253
Bonnans, J. F, J. C. Gilbert, C. Lemarchal, C. Sagastizbal (1996). Mthodes numr iques doptimisation.
Projet Promath/ Institut National de Recherche en Informacion Automatique. Manuscript.
Brown, P.J., and C.D. Payne (1986). Aggregate data, Ecological Regression, and Voting Transitions.
Journal of the American Statistical Association, 81, pp. 452460.
Cleave, N. (1992). Ecological Inference. PhD. Dissertation, University of Liverpoool.
Cleave, N., P.J. Brown and C.D. Payne (1995). Evaluation of Methods for Ecological Inference. Journal
of The Royal Statistical Society A, 158, 1, pp.5572.
Dempster, A. P., N. M. Laird and D. Rubin (1977). Maximum Likelihood from Incomplete Data via the
EM Algorithm. (with discussion), Journal of the Royal Statistical Society B, 39, pp.1-38.
33
Duncan, O. D. and B. Davis (1953). An Alternative to Ecological Correlation. American Sociological
Review,18, pp. 665 666.
Forcina, A. and G. M. Marchetti (1989). Modelling Transition Probabilities in the Analysis of
Aggregate Data. In Decarli, A., B. J. Francis, R. Gilchrist, and G.U.H. Seber, editors, Statistical
Modlling (Proceedings, Trento, 1989), pp 157164. Lecture Notes in Statistics 57. Berlin: Springer
Verlag.
Gelman, A. B., J. S. Carlin, H. S. Stern and D. B. Rubin (1995). Bayesian Data Analysis. New York:
Chapman & Hall/CRC.
Goodman, L. (1953). Ecological Regression and the Behavior of Individuals. American Sociological
Review, 18, pp.663664.
____________ (1959). Some Alternatives to Ecological Correlation. American Journal of Sociology,
64, pp.610 25.
Hawkes, A. G. (1969). An Approach to the Analysis of Electoral Swing. Journal of the Royal Statistical
Society A, 132, pp.6879.
Johnson, N.L. and S. Kotz (1969). Distributions in Statistics: Discrete Distributions. New York: Wiley.
King, Gary (1997). A Solution to the Ecological Inference Problem: Reconstructing Individual
Behavior from Aggregate Data. Princeton: Princeton University Press.
King, Gary, O. Rosen and M. A. Tanner (1999). Binomial Beta Hierarchical Models for Ecological
Inference. Sociological Methods and Research, 28, 1, pp.6190.
Lee, T.C., G.G. Judge and A. Zellner (1970). Estimating the Parameters of the Markov Probability
Model from Aggregate Time Series Data. Contributions to Economic Analysis 65. Amsterdam:
NorthHolland.
McCue, K. F. (2001). The Statistical Foundations of the EI Method. The American Statistician, 55, 2,
pp. 106-110.
McCullagh, P and J. A. Nelder (1989) Generalized Linear Models. 2nd Edition. Monographs on Statistics
and Applied Probability 38. London: Chapman & Hall.
McLachlan, G.J. and T. Krishnan (1997). The EM Algorithm and Extensions. Wiley Series in Probability
and Statistics. New York: Wiley.
McRae, E. C. (1977). Estimation of Time Varying Markov Processes with Aggregate Data.
Econometrica, 45, pp.183198.
Meng, XL. and D. B. Rubin (1993). Maximum Likelihood Estimation via the ECM Alg orithm: a General
Framework. Biometrika, 80, pp. 267278.
Meng, X.-L. and D. van Dyk (1997). The EM algorithm - An Old Folk Song Sung to a Fast New Tune.
Journal of The Royal Statistical Society B, 59, 3, pp.511-567.
Migon, H. S. and D. Gamerman (1999). Statistical Inference: an Integrated Approach. London: Arnold.
Mosiman, J. E. (1962). On the Compound Multinomial Distribution, the Multivariate Distribution and
Correlations Among Proportions. Biometrika, 49, pp. 6582.
Rosen, O., W. Jiang, G. King, and M.A.Tanner (2000). Bayesian and Frequentist Inference for Ecological
Inference: the RC Case. Forthcoming in Statistica Neerlandica.
Schafer, J. L. (1999). Analysis of Incomplete Multivariate Data. Monographs on Statis tics and Applied
Probability 72. Reprinted 1997 Edition. Boca Raton.: Chapman & Hall/CRC.
Schoenberg, R. (1996). Constrained maximum likelihood. Manuscript.
Tanner, M. A. (1996). Tools for Statistical Inference: Methods for the Exploration of Posterior
Distributions and Likelihood Functions. 3rd Edition. New York: Springer.