Sie sind auf Seite 1von 34

The BinomialBeta Hierarchical Model

for Ecological Inference: Methodological Issues and


Fast Implementation via the ECM Algorithm

Rogrio Silva de Mattos


Faculdade de Economia e Administrao,
Universidade Federal de Juiz de Fora
Campus Universitrio, Martelos
36036330, Juiz de Fora, MG, Brazil
email: rmattos@fea.ufjf.br

lvaro Veiga
Departamento de Engenharia Eltrica
Pontifcia Universidade Catlica do Rio de Janeiro
Rua Marqus de So Vicente 225
22453900, Rio de Janeiro, RJ, Brazil
email: alvf@ele.pucrio.br

October, 2002

This paper updates and improves the material presented in a previous paper of ours The BinomialBeta
Hierarchical Model for Ecological Inference Revisited and Implemented via the ECM Algorithm,
posted to the Polmeth Paper Archive in 2001. Matlab routines that implement the approach presented
in this paper is available from http://www3.ufjf.br/~rmattos.
1

Abstract

The binomial-beta hierarchical model from King, Rosen, and Tanner (1999) is a recent
contribution to ecological inference. Developed for the 2x2 tables case and from a
bayesian perspective, the model is featured by the compounding of binomial and beta
distributions into a hierarchical structure. From a sample of aggregate observations,
inference with this model can be made regarding values of unobservable disaggregate
variables. The paper reviews this EI model with two purposes: First, a faster approach
to use it in practice, based on explicit modeling of the disaggregate data generation
process along with posterior maximization implemented via the ECM algorithm, is
proposed and illustrated with an application to a real dataset; second, limitations
concerning the use of marginal posteriors for binomial probabilities as the vehicle of
inference (basically, the failure to respect the accounting identity) instead of the
predictive distributions for the disaggregate proportions are pointed. In the concluding
section, principles for EI model building in general and directions for further research
are suggested.

Keywords: ecological inference; hierarchical models; binomialbeta distribution; ECM


Algorithm.
2

1 Introduction

A recent contribution to the ecological inference (EI) literature is the binomialbeta


hierarchical method proposed by King, Rosen and Tanner (1999), in short KRT. These
authors developed a new EI method for problems with 2x2 tables whose central feature is a
bayesian statistical model that compounds the binomial and the beta probability distributions
into a hierarchical structure. KRT called it the binomialbeta hierarchical (BBH) model for
EI, and argued it is more flexible to recover a wide spectrum of disaggregate data patterns
than the truncated bivariate normal (TBN) model proposed by King (1997). In order to
implement the BBH model, the authors used powerful simulation resources of Markov chain
Monte Carlo (MCMC) algorithms that allow the complete recovery of the joint and
marginal posteriors for the quantities of interest. Since in the BBH model the number of
these quantities is generally large (dependent on the sample size), and those simulation
algorithms are computer intensive, a huge amount of CPU time, of the order of hours, is
needed to implement the inference even with nowadays computer hardware. This restricts
practitioners to use the BBH model in applied ecological inference studies.
In this paper, we review the BBH model with two purposes: First, to present a faster,
and thus more useful, approach to implement it based on posterior maximization via the
ExpectationConditionalMaximization algorithm proposed by Meng and Rubin (1993).
Second, to discuss some methodological issues related to the inappropriate use, made by
KRT, of the posteriors for binomial probabilities as the mode of making EI with the BBH
model. We defend in this article that a proper approach of inference should be based,
instead, on the predictive posteriors for disaggregate variables, because only through them
we assure the accounting identity is respected and, thus, guarantee that desirable properties
of EI predictions like aggregation consistency and allowance for the deterministic bounds
(Duncan and Davis, 1953) are satisfied.
Although the discussion made here remains centered on the basic version of the BBH
model (which does not consider explanatory variables) and on the 22 tables case, this
strategy of working with a simpler version of the EI problem was of major importance for us
to achieve the results presented herein. Notwithstanding this, directions for further research
advanced in the concluding section can promote generalizations of our approach towards
allowing for the effects of explanatory variables and coping with RC EI problems (recently,
Rosen et al (2000) presented a generalized version of the BBH model for RC tables but
still implemented with the computer intensive MCMC methods).
The paper is organized as follows. Section 2 makes a brief historical account of previous
uses of binomial-beta models in the EI literature. Section 3 states the EI problem and
introduces some notation. Section 4 presents definitions of two important distributions used
throughout the text. Sections 5 describes the BBH model and suggests a faster
implementation strategy for it based on the maximization of the aggregate posterior. Different
computational methods for running this maximization are considered in Section 6, and a
preferred one, based on the ECM algorithm, is proposed. Section 7 discusses
methodological issues regarding the use of predictive distributions for the disaggregate data
as the proper instruments for ecological inference. Section 8 illustrates our proposed method
with an application to a real dataset associated to voter registration by race in American
3
states. Finally, concluding comments with suggestions for further research are provided in
Section 9.

2 Brief Literature Review

The idea of using compound binomial and beta models arose in the EI literature pretty earlier
then KRTs study. More than a decade before, Brown and Payne (1986) proposed the
aggregated compound multinomial (ACM) model, which is a generalization of a binomial
beta model that allows EI to be made in RC tables. Brown and Paynes approach has its
origins in studies from the sixties and the seventies concerned with the estimation of transition
probabilities in Markov chain models. The similarity between this problem and the EI one
rests on the common objective of predicting the cells values in a table or matrix by using
knowledge about rows and columns totals. However, while in the transition probabilities
estimation problem variables in rows and columns are considered the same, reflecting
generally different states among which a system or individual may transit from a time to the
next, the EI problem admits variables in rows and columns may be different, reflecting
crossinformation about attributes of individuals or phenomena in a same period of time.
Hawkes (1969) was maybe the first to apply Markov chain models as an alternative to
the well known Goodman Regression (Goodman 1953, 1959), in a study of voter transition.
However, by that same time an important contribution was made in econometrics by Lee,
Judge and Zellner (1970), who published a book with a comprehensive study on estimation
of Markov chain models. Among the various methods analyzed, the authors proposed an
aggregate multinomial model with constant matrix of transition probabilities for all time
periods. They discussed a classical and a bayesian version of this model. In the latter, the
probabilities were modeled as random variables following a Dirichlet prior distribution
(which is a generalization of the beta distribution) with predefined parameters.
McRae (1977) rediscussed the aggregate multinomial model of Lee, Judge and Zellner
(1970) and proposed to model systematic variations of transition probabilities using
explanatory variables. Brown and Payne (1996) incorporated all these developments in the
context of ecological inference. These authors followed a classical statistics view and
proposed a compound multinomial model (Mosimann 1962; Johnson and Kotz 1969) to
represent the behavior of the (unobservable) disaggregate data of each tables row. From
the aggregation of these various rows results the ACM model, whose parameters are those
of the Dirichlet distribution. Brown and Payne used a reparameterization to generate the
expected cells probabilities plus a set of parameters that capture overdispersions of the
disaggregate frequencies.
KRT (1999) developed the BBH model for EI. The authors formulation is quite similar
to a 2x2 tables version of the ACM model, with few tough subtle differences. Following a
bayesian approach, KRT used a compounding of the binomial and the beta distributions into
a hierarchical structure. By doing so, KRT let the quantities of interest to be the effective
binomial probabilities (plus the parameters of the beta distribution). Considering the 2x2
tables case, it is an important difference from Brown and Paynes (1986) classical
formulation, where the compound (with the beta) binomial distribution is such that the
binomial probabilities are eliminated by marginalization and only the parameters of the beta
part are maintained. As mentioned above, a reparameterization is possible where the
expected binomial probabilities and the over-dispersion parameters are obtained. Thus, the
quantities of interest of the ACM and the BBH models are different.
4
Tough a comparison between those two binomial-beta based models for EI may be
fruitful, in this paper we will restrict our attention to the BBH model. KRT made a successful
implementation of it using MCMC methods, but these methods display the drawback of
being (highly) computer intensive. As a consequence, they are too slow for a practical
analysis of ecological inference. In the next sections, we explore an alternative and faster
approach to use the same BBH model that may turn it more useful for applications. Before
proceeding, we shall state the problem and present some notation.

3 Problem Statement and Notation

Technically, the EI problem refers to a situation where we have interest in cells values of
contingency tables for which only the row and column totals are known. The general
approach followed in the literature to solve this problem has been the resort to statistical
models, which serve to estimate or predict the unknown cells values. In this section, we
present the EI problem for the case of a 2x2 contingency table and the basic notation to be
used in the description of the BBH model.
Generally, letters written in uppercase will represent random variables, and in lowercase
observed or known values. In the left part of table 1, variables N Bi and NW i represent the
unobservable disaggregate frequencies; these might be, for instance, the numbers of
black and white people, respectively, who turnout to vote in the ith sampling unit or
precinct. By their turn, variables N Ti , n X i , and ni represent the observable aggregate
frequencies, and can be seen as the numbers of people, in the ith precinct, who turnout to
vote (black or white), who are black, and who are of voting age, respectively. The goal of
EI consists of predicting values for N Bi and NW i given knowledge of the values of N Ti ,
n X i , and ni, for i = 1,...,P, where P is the number of sampling units.
In the right part of table 1, we describe the same situation tough in an alternative way,
with variables represented as proportions and defined as:

n Xi
xi = (1)
ni
N Ti
Ti = (2)
ni
N Bi
Bi = (3)
n Xi
NW i
Wi = . (4)
ni n X i

The problem in this case consists of predicting values of Bi and Wi given knowledge of Ti
and x i, for i = 1,...,P. The use of proportions instead of frequencies to represent the
response variables in EI models has been the most common approach followed in the EI
literature (see Anchen and Chivelly, 1995; and King, 1997). However, KRT developed the
BBH model to represent response variables as frequencies and, in this article, both forms of
representing the EI problem will be considered.
5
Table 1 Representation of variables in the EI problem
prior to observing the aggregate data

Frequencies Proportions

Vote No Vote Total Vote No Vote Total

Blacks N Bi n X i N Bi n Xi Bi 1 B i xi

Whites N Wi n i n Xi N Wi ni n Xi Wi 1Wi 1 x i

Total N Ti n i N Ti ni Ti 1 Ti 1

Note the values for the row totals n X i are written in lowercase letters because they are
usually regarded as known or given (a convenient but strong assumption) in EI models. For
this reason, when the aggregate value for column totals nTi is observed, the EI problem is
set up and the disaggregate data may be represented as conditional random variables. This
situation is illustrated in table 2, where nTi represents an observed value for the aggregate
variable N Ti and the disaggregate variable N Bi is now represented conditionally as
N Bi | nTi . A prediction made for this variable automatically determines a prediction for the
other disaggregate variable, because (conditionally) the two are linearly related according to
N wi | nTi = nTi N Bi | nTi .

Table 2. Representation of variables in the EI problem after observing


the aggregate variables (frequency case)

Vote No Vote Totals

Blacks N Bi |n Ti n X i N Bi | nTi nX i

Whites nTi N Bi | nTi n i n X i n Ti + N Bi | n Ti ni n Xi

Totals nTi n i nTi ni

As we argue in this article, the goal of EI model development is to determine the


distributions of the conditional random variables in table 2 from model assumptions and
deterministic properties of the EI problem. An important deterministic fact that should be
considered in the construction of any EI model is the accounting identity, which for the
variables in frequencies of table 1 consists of:

N Ti = N Bi + NWi (5)

and for the variables in proportions:


6
Ti = Bi xi + Wi (1 xi ) (6)

Whatever our choice of representing the accounting identity (in frequencies or


proportions), its importance for structuring an EI model is twofold: First, if predictions for
the disaggregate variables generated with a particular EI method respect the accounting
identity, then the aggregation of those predictions (using (5) or (6)) will necessarily fit the
observed values for the aggregate variables (nTi or t i ). We will call this property as
aggregation consistency and consider it desirable.
Second, the accounting identity places lower and upper bounds on the true values
taken by N Bi and NW i , a feature considered in the EI literature for the first time by Duncan
and Davis (1953). For instance, prior to observing the aggregate value nTi , as in table 1,
the admissible values for the disaggregate variables N Bi and N Wi stay, respectively, within
the intervals [ 0, n X i ] and [ 0, ni n X i ] . But, after we observe nTi , as in table 2, the
admissible values for these variables stay within generally straighter intervals [ n BLi , nUBi ] and
[ nWL i , nWUi ] , where superscripts L and U indicate lower and upper, respectively. In
Appendix 1, we present a development from the accounting identity in (5) of expressions to
compute the bounds n BLi , n UBi , nWLi and nWU i (for expressions to compute the bounds in
proportions, see, for instance, King, 1997).
In latter sections, we will show that if the EI problem is treated as a prediction problem
from the statistical point of view, then the accounting identity will necessarily be respected
and, as a consequence, the properties of aggregation consistency and predictions respecting
the DuncanDavis bounds will be both satisfied.

4 Two Important Distributions

Before we proceed, we define in this section two probability distributions that are relevant to
our alternative approach to formulate and implement the BBH model.

DEFINITION 1 (Binomialbeta hierarchical distribution). Let X be a discrete


random variable defined in [0,n], where n is a known positive integer, and a
continuous random variable defined in [0,1]. Then, the bivariate random
vector (X,)T follows a binomialbeta hierarchical distribution, represented
as:

( X , ) T ~ BBH ( n, c, d ) (7)

with parameters n, c and d, if:

i) X | ~ Bin ( n, ) (8)
and:
ii) ~ Beta(c , d ) . (9)
7
Note that definition 1 characterizes a joint bivariate distribution for X and developed by
sequential conditioning, say:

p ( x , | n, c, d ) = p ( x | , n ) f ( | c, d ) (10)

where p ( x | , n) is the binomial density for X | and f ( | c, d ) is the beta density for
. From now on, we will call variable or a particular value taken by it as binomial
probability. The analytic representation of the BBH distribution is given by:

n x +c 1 (1 ) n x + d 1
p ( x , | n, c, d ) = (11)
x B( c, d )

where B(c,d) is the beta function.

DEFINITION 2 (Aggregate binomialbeta hierarchical distribution). Let X 1 and


X 2 be two independent discrete random variables defined in [0,n1] and [0,n2],
respectively, where both n1 and n2 are known positive integers; and let 1 and
2 be two independent continuous random variables both defined in [0,1] .
The random vector ( X , 1 , 2 ) T , where X = X 1 + X 2 , follows an aggregate
binomialbeta hierarchical distribution, represented as:

( X , 1 , 2 ) ~ ABBH ( n1 , c1 , d 1 , n2 , c2 , d 2 )

with parameters n1, c1, d1, n2, c2 and d2, if:

i) X | 1 , 2 ~ ABin (n1 , n2 , 1 , 2 ) (12)


ii) 1 ~ Beta(c1 , d 1 ) . (13)
and:
iii) 2 ~ Beta(c 2 , d 2 ) (14)

where Abin(,) stands for aggregate binomial distribution (Forcina and Marchetti,1989;
Cleave, 1992). Note it is in general different from a binomial distribution, becoming the latter
only in the case where 1 = 2 = .
Definition 2 characterizes a joint trivariate distribution for X, 1 and 2 also developed
by sequential conditioning, say:

p ( x ,1 , 2 | n1 , c1 , d 1 , n2 , c2 , d 2 ) = p ( x | 1 , 2 , n1 , n2 )
(15)
f (1 | c1 , d1 ) f (2 | c2 , d 2 )

where p ( x | 1 , 2 , n1 , n 2 ) is the aggregate binomial density for X | 1 , 2 , while


f (1 | c1 , d1 ) and f (2 | c2 , d 2 ) are the beta densities for 1 and 2 . The analytic
representation of the ABBH distribution is given by:
8
x

p ( x ,1 , 2 | n1 , c1 , d 1 , n2 , c2 , d 2 ) = 2 (1 1 ) n1 (1 1 ) n 2
1 2
n n (1 2 )
x1
b
1 2 1 (16)
x1 = a x1 x x1 2 (1 1 )

c1 1 (1 1 ) d1 1 2c2 1 (1 2 ) d 2 1
1
B(c1 , d1 ) B (c 2 , d 2 )

where the limits in the summation operator are given by a = max( 0, x n 2 ) and
b = min( n1 , x) .
We shall note there is a relation between these two definitions. The aggregate
binomially distributed random variable X | 1 , 2 , where X = X 1 + X 2 , used in item i)
of definition 2 is equivalent to the assumption that X 1 | 1 and X 2 | 2 are independent
and each binomially distributed. This is a known result from probability theory that led us to
use the term aggregate to name the ABBH distribution in definition 2.
As we will see in the next section, KRT used only the BBH distribution to formulate their
BBH model, while we had to use, in a different fashion, both the BBH and the ABBH
distributions for reasons of our alternative approach to implement the model using the ECM
algorithm.

5 The Binomial-Beta Hiearchical Model

We present in this section two versions of the BBH model. Both are developed under a
bayesian approach and structured similarly according to three hierarchical stages. The basic
difference between them refers to the assumptions made in the first stage for the data
generation process (DGP). The first version models the disaggregate DGP using binomial
distributions, with the result that the aggregate DGP is, necessarily, modeled according to
aggregate binomial distributions. This version forms the basis for the fast implementation
method we present ahead in Section 6. The second version, which is the one of KRTs,
models directly the aggregate DGP using binomial distributions with a particular design for
the binomial probabilities. This second version makes no assumption for the disaggregate
DGP. In order to present these two versions, we consider the EI problem description in
terms of frequencies (table 1) with P precincts.

5.1 Formulation based on the disaggregate DGP

In the first stage, the disaggregate data variables N Bi and NW i at the ith precinct are
assumed to follow independent binomial distributions with known counts n i and n X i , and
binomial probabilities i and i. From the accounting identity in (5), this results in that the
aggregate data variable N Ti follows an aggregate binomial distribution. In the second stage,
the binomial probabilities i and i are assumed to be sampled from beta distributions with
parameters (cb,db) and (cw,dw,), respectively, where these parameters are taken to be
9
constant across all precincts. In the third and last stage, the beta parameters are assumed to
follow non informative priors. The formal description of the BBH model in this case can be
written as:

N Bi | i ~ Bin ( n X i , i ) (17)
N Wi | i ~ Bin ( ni n X i , i ) (18)
N Ti | i ,i ~ ABin ( n X i , ni , i , i ) (19)
i | cb , d b ~ Beta(c b , d b ) (20)
i | cw , d w ~ Beta( cw , d w ) (21)
c b ~ n.i . p.d . (22)

d b ~ n.i. p.d . (23)

c w ~ n.i. p.d . (24)

d w ~ n.i. p.d . (25)

for i = 1,...,P and where n.i.p.d. stands for non informative prior distribution. For
simplicity, in this article we work with n.i.p.ds of the uniform type defined in (0,10] for the
beta parameters (alternatively, KRT assumed as priors for these parameters independent
exponential distributions with high mean). The value 10 for the upper truncation of these
priors was introduced so as we have proper priors.
Expressions (17) and (18), which refer to the first stage, represent the disaggregate
DGP with independent binomial models. Taking them together with expressions (20)(21),
that refer to the second stage, we characterize independent BBH distributions (see definition
1) for the random vectors ( N B i , i ) T and ( N Bi ,i ) T , respectively. By its turn, expression
(19) represents the aggregate DGP with an aggregate binomial model. Taking it together
with expressions (20)(21), we characterize an ABBH distribution (see definition 2) for the
random vector ( NTi , i ,i ) T .
The vector of quantities of interest is given by T = [ T , T , h T ] , where
T = [ 1 ,K , P ] , T = [1 ,K, P ] , and h T = [cb , d b , cb , d w ] . Note the size of vector
is dependent on the number of observations as it has 2P + 4 elements. By assuming
independence among sampling units, say, that ( N B i , NW i ) is independent of ( N B j , NW j ) ,
which then implies N Ti is independent of N T j , for i j, we can build the aggregate
posterior PA, as:
10
P n Ti
PA ( | nT ) i (1 i ) X i (1 i ) i X i
n n n

i =1 1 i

UBi n X ni n X (1 ) n Bi
n
i i i
n = n L nB i nTi nB i i (1 i )
i
(26)

B i Bi
icb 1 (1 i ) d b 1 ic w 1 (1 i ) d w 1

B(c b , d b ) B( c w , d w )

which is proportional to the product of a number P of ABBH distributions (see expression


(16)).
To implement the full bayesian method for making inferences at precinct level, we have to
determine from (26) the marginal bivariate posteriors p ( i , i | nT ) , i = 1,...,P, and then
the marginal predictive posteriors:

p ( nB i | nT ) =
1 1
p( n Bi | nTi , i , i ) p( i , i | nT ) di di (27)
0 0

p ( nW i | nT ) =
1 1
p( nW i | nTi , i ,i ) p( i , i | nT ) di di (28)
0 0

where nTT = [ nT1 ,K , nTP ] is the observed aggregate data. These predictive posteriors are
expected to reflect our uncertainty with regard to the values of the disaggregate variables
N Bi and N Wi .

5.2 KRTs formulation based on the aggregate DGP

In KRTs formulation, the disaggregate DGP in the first stage is not considered. The authors
model directly the aggregate DGP by assuming N Ti follows a binomial distribution, with a
given count ni and an aggregate binomial probability i xi +i (1 x i ) . That is to say,
under KRTs formulation we have:

N Ti | i ,i ~ Bin ( ni , i x i + i (1 xi )) (29)

in place of expressions (17), (18) and (19) to represent the first stage of the BBH model.
Note that, by taking (29) together with expressions (20)(21), a BBH distribution for the
random vector ( NTi , i xi +i (1 x i )) T is characterized.
Following our considerations in Section 3, the only instance in which the binomial
distribution in (29) is consistent with our assumptions for the disaggregate DGP that
N Bi | i ~ Bin ( n X i , i ) and N Wi | i ~ Bin ( ni n X i , i ) is when i = i (in the more
general setting, assumed by KRT, that i i , the sum N Ti = N B i + NWi would
necessarily follow an aggregate binomial distribution, as saw in Section 5.1).
11
Another difference of KRTs formulation from the one based on the disaggregate DGP is
that, instead of using the predictive posteriors in (27 ) and (28), KRT undertook the
inferences at precinct level using the marginal posteriors for binomial probabilities:
p ( i | nT ) and p (i | nT ) . The authors used these distributions in full to summarize the
uncertainty about the disaggregate data, and obtained them from the joint posterior for
vector , which we label here as PA* . By considering p(h) constant = 104 (since we are
using independent uniform priors on (0,10] for the elements of h), this posterior is written:

ni
P

P ( | nT )
*
(i xi + i (1 x i ) )nTi (1 i xi i (1 x i ) )n i nTi
A
i =1 nTi
(30)
ic b 1 (1 i ) d b 1 ic w1 (1 i ) d w 1


B (c b , d b ) B( c w , d w )

Note it is proportional to the product of a number P of BBH distributions (see expression


(15)). Determining the marginal posteriors for binomial probabilities involves complex,
multidimensional integrations of (30), which we generally represent as:

p ( i | nT ) = / PA* ( | nT ) d/ (31)

p (i | nT ) = // PA* ( | nT ) d// (32)


In (31), / is a (2P + 3)1 vector composed by all elements of vector except i, and
A/ R 2P + 3 is the subspace of admissible values for /. Analogous definitions apply in the
case of i, say, for // and A// in (32).

5.3 Considerations regarding implementation

In order to obtain the marginal posteriors p ( i | nT ) and p (i | nT ) , i = 1,...,P, KRT


proposed the use of simulation methods based on MCMC algorithms (Tanner, 1996;
Gelman et al, 1995). The precise procedure is described in KRT and involves the use of the
Gibbs Sampler coupled with the Metropolis Algorithm to simulate the entire aggregate
posterior PA* in (30), from which we naturally obtain the simulated marginal posteriors for
the binomial probabilities. This approach could as well be used to implement the first
version of the BBH model based on the disaggregate DGP. A basic difference would be
that, in addition to using the MCMC methods to recover the full posterior PA in (26), from
which we could recover also the posteriors p ( i , i | nT ) , i = 1,...,P, we would need to
recover the predictive posteriors given in (27) and (28).
Though widely recognized now as the most appropriate tool to implement complex
inferences, in particular for complex bayesian hierarchical models (as in the present case),
the MCMC methods display the drawback of being highly computer intensive and time
consuming, with its applications generally depending on hours of computer processing. In the
12
case of the BBH model (both versions), the situation is even worst because of the large
number of quantities of interest and because each time we augment the sample size P the
dimension of vector , and thus of the aggregate posterior to be simulated, increases
jointly, implying in extended time of intensive processing.
In the next sections, we explore an alternative way to implement the BBH model by
means of maximizing (instead of summarizing) the aggregate posterior for the first version,
and making inferences using the conditional predictive posteriors for disaggregate
proportions (instead of the marginal posteriors for binomial probabilities). Although only
point and interval predictions for the quantities of interest are to be obtained by our
approach, the advantage is that substantially faster computations will result.

6 Posterior Maximization

In this section, we discuss the maximization of the aggregate posterior from the BBH model
version based on the disaggregate DGP, and propose a suitable method to handle this
maximization. Our goal is to maximize for the expression for PA given in (26). We follow
the practice to work with its log version lp A = log PA , which is written as:

P
P P
lp A ( | nT ) = C + nTi ln i + n X i ln( 1 i ) + ( ni n X i ) ln( 1 i )
i =1 1 i i =1 i =1

P nB i n X n i n X (1 ) Bi
U n

+ ln i i
i

i

i =1 n Bi =n BLi n Bi nTi n Bi i (1 i )
P P
+ ( cb 1) ln i + ( d b 1) ln( 1 i )
i =1 i =1
P P
+ ( c w 1) ln i + ( d w 1) ln( 1 i )
i =1 i =1

P ln B (cb , d b ) P ln B( c w , d w )
(33)

where C is a constant. This aggregate logposterior is an intrinsically nonlinear function of


the 2P + 4 elements of vector , displaying no analytic maximum; thus, a nonlinear
optimization algorithm is necessary to find a mode of it. A suitable candidate might be a
quasi-Newton version of the sequential quadratic programming method (e.g Bonnans et al
1996), which is available from major software packages (see also Schoenberg 1996).
However, the large number of arguments of the lp A in (33) makes this sort of algorithm run
too slowly1.

1
At an early stage of our research, we tried this approach to maximize the PA* in (30) using a simulated
dataset with 100 observations. Function CONSTR of the Matlab languages optimization toolbox, which
implements the sequential quadratic programming method, was used to maximize the lp *A . The algorithm
processed 20.558 iterations along four hours in a PC computer with a 366 MHz Intel Celeron
processor, before achieving convergence. Brown and Payne (1986) reported a similar problem when
implementing the ACM model with a Newton algorithm.
13

6.1 The EM Algorithm


An alternative approach could in principle be the well known ExpectationMaximization
Algorithm (EMA), introduced by Dempster, Laird and Rubin (1977) to maximize likelihood
and posterior functions in incomplete data problems. The EMA can be used to maximize
the lp A and it also operates in an iterative way. In each iteration, another maximization is
processed, say, of the so called Q-function, which in the EMA formalism is the conditional
expectation of the logposterior based on the complete data. In order to represent the EI
problem under that formalism, we consider the aggregate data as incomplete and the
disaggregate data as complete. The Q-function then becomes a conditional expectation of
the disaggregate logposterior.
Let PD be the disaggregate posterior and consider n BT = [n B1 ,K , nB P ] and
nWT = [ nW1 , K, nW P ] are vectors of disaggregate data (hypothetically observed). The PD can
be represented (by taking p(h) constant) as:

P
( n Bi + cb 1
(1 i )
n X i n Bi +d b 1

)
i
+ cW 1 n i n X i nWi + d W 1
i Wi (1 i )
n
i =1
PD ( | n B , nW ) (34)
[B( cb , d b ) B( cw , d w )] P

Now, let lp D = log PD , and write it as:

P P
lp D ( | nB , nW ) = C + ( nB i + cb 1) log i + ( n X i nB i + d b 1) log( 1 i )
i =1 i =1
P P
+ ( nWi + cw 1) log i + ( ni nX i nWi + d w 1) log( 1 i )
i =1 i =1

P log B( cb , d b ) P log B (c w , d w )
(35)

The Q-function in the present case is the expected value of the lp D in (35) , conditioned on
the observed aggregate data nTT = [nT1 ,K , nTP ] and on a parameter guess k. It can thus
be represented as:

Q( | k ) = E[lp D ( | N B , NW ) | nT ,k ] (36)

From (35), we have the lp D is linear in the n B i s and the nW i s, the elements of the
disaggregate data vectors nB and nW , respectively. This allows us to compute the Q-function
by replacing the unobservable disaggregate data with their respective conditional
predictions, as follows:
14
P P
Q( | k ) = ( n Bi , k + c b 1) log i + ( n X i n Bi ,k + d b 1) log( 1 i )
i =1 i =1
P P
+ (n Wi ,k + cw 1) log i + ( ni n X i nW i , k + d w 1) log( 1 i )
i =1 i =1

P log B( cb , d b ) P log B (c w , d w )
(37)
where:

n B i , k = E[ N Bi | nTi , i , k , i , k ] (38)

nW i , k = E[ NW i | nTi , i , k ,i , k ] = nTi n B i , k (39)

The variable n B i in (38) can be shown to be the mean of a non-central hypergeometric


distribution (e.g., McCullag and Nelder 1989, pp. 257-259; see also Section 7.2). Note we
need to compute the conditional prediction only for n B i , k in (38) , and then use it in
expression (39) to determine nW i , k . The EMA scheme to implement the maximization of the
lp A (and thus the PA) is described in Box 1.

Box 1. EMA scheme for maximizing the aggregate posterior of the BBH model

1. Assume an instantaneous guess Tk = [ kT , kT , hkT ] ;


2. EStep: Compute, for i = 1,K, P :
n B i , k = E[ N Bi | nTi , i , k , i , k ]
nWi ,k = nTi n B i , k ;
3. Mstep: Maximize the Q-function for , obtaining:
k +1 = arg max Q( | k ) ;

4.Repeat steps 13 until convergence, say, until || k +1 k ||< and/or


| lp A (k +1 ) lp A (k ) |< , with , 0.

The EMA theory (e.g., McLachlan and Krishnan, 1997) assures the scheme 14 of box
1 presents good convergence properties: For instance, provided that
Q(k +1 | k ) Q(k | k ) as is guaranteed in the MStep the lp A will increase
monotonically at every iteration. However, the EMA in the present case is even worst than a
quasiNewton algorithm because the Qfunction also does not display analytic maximum
and depends on the same number of arguments. The MStep involves an iterative search on
a space with 2P + 4 dimensions (to be done maybe with a quasiNewton algorithm), and
since this step is run at every iteration of the EMA, the mode search would elapse many
times more.
15
6.2 The ECM Algorithm
Fortunately, an extension of the EMA proposed by Meng and Rubin (1993), and known as
ExpectationConditionalMaximization Algorithm (ECMA), is a viable alternative. The
difference between the EMA and the ECMA is that the latter replaces the MStep of the
former by conditional maximization (CM) steps which, individually, are simpler than the
overall maximization in a single Mstep. This is a useful device, for instance, when the M
step is nonanalytic but some or all of the CMsteps are analytic. In such a case, thought
the ECMA may converge in more iterations, it can do it faster than the EMA. The
replacement of the Mstep by CMsteps is not equivalent to maximize the Qfunction, but it
is such that Q(k +1 | k ) Q(k | k ) at every iteration and thus the ECMA presents the
same monotonicity property of the EMA (Meng and Rubin, 1993; see also Schaffer 1997).
For every problem, the number of CM-steps we may use naturally depends on the
number of arguments of the Qfunction, but more important is our choice of partition for
vector . In the present case, a partitioning resulting in three CMsteps of interest can be
achieved. To make it clear, let us consider first the maximization of the Qfunction in (37),
which has to be undertaken at the Mstep of the EMA. By applying first order conditions
for a maximum, say, differentiating that function partially with respect to each element of ,
equating the derivatives to zero, and making a few algebraic manipulations, we obtain the
following equation system:

n B i + cb 1
i = i = 1,..., P (40)
n X i + cb + d b 2

nW i + c w 1

i = i = 1,..., P (41)
ni n X i + c w + d w 2
P
log i
B( cb , d b ) cb i =1
= (42)
B( cb , d b ) P
P
log( 1 i )
B( c b , d b ) d b i=1
= (43)
B( cb , d b ) P
P
log i
B( cw , d w ) c w i =1
= (44)
B( c w ,d w ) P
P
log( 1 i )
B( cw , d w ) d w i=1
= (45)
B( c w , d w ) P

Maximizing the Qfunction is then equivalent to solve for the system (40)-(45), which is
intrinsically nonlinear. If we use an iterative search algorithm like a quasiNewton method
to solve it, this algorithm may delay too much because of the large number of equations and
unknowns (2P + 4 for both). This results in a complicated Mstep that turns the EMA
16
practically useless in this case, as we noted before. However, if we replace this MStep by
suitably chosen CMsteps we can obtain substantial improvements in processing speed, as
we explain now.
Consider the following partition of vector into three subvectors:

T = [T , hbT , hTw ] (46)

where each subvector is given by:

T = [ T , T ] (47)
hbT = [ cb , d b ] (48)
hwT = [ cw , d w ] (49)

and let it be given an instantaneous guess Tk = [Tk , hbT, k , hwT. k ] . Then, at the Mstep of the
EMA scheme, replace the general maximization of the Qfunction by the conditional
maximizations:

CM1: k +1 = arg max Q(, hb , k , h w, k | k ) (50)


CM2: hb , k +1 = arg max Q(k +1 , hb , hw ,k +1 | k +1 , hb , k , hw, k ) (51)


hb
CM3: h w, k +1 = arg max Q(k +1 , hb , k +1 , hw | k +1 , hb ,k +1 , h w, k ) (52)
hw
in order to determine a new estimate Tk+1 = [kT+1 , hbT,k +1 , hwT. k +1 ] .
By following the above procedure, we replace the Mstep by three CM-steps, here
denoted CM1-, CM2-, and CM3-steps. In the CM1-step, we assume as fixed the
elements of the vectors hb and hw by letting both equal to their guess values hb,k and hw,k
and the problem is to maximize the function Q(, hb , k , hw , k | k ) for T = [ T , T ] .
Note the solution to this conditional maximization problem is simple and equivalent to
compute expressions (40) and (41) for i and i (i = 1,...,P), respectively, because those
expressions are now analytic functions of known elements: the conditional predictions n B i , k
and nW i , k (i = 1,...P), and the guesses for the beta parameters hb,k and hw,k (see box 2).
Their computation is thus easy to be implemented, allowing the CM1step to be performed
very fast. Equally important, they provide a solution to 2P of the 2P + 4 unknowns.
In the CM2-step, the 2P values of the quantities of interest i and i are let fixed at
the values i , k +1 and i ,k +1 , recently obtained in the CM1-step. The problem now is to
maximize the function Q(k +1 , hb , hw, k +1 | k +1 , hb , k , hw, k ) for hb. This is equivalent to solve
the subsystem formed by the two equations (42) and (43) for hb = [cb,db]T. This system
does not present analytic solution because of terms in the beta function B( cb , d b ) and its
derivatives appearing in both equations (42) and (43) (Beckman and Tietjen 1978). When
given the i s and i s, sub-system (42)-(43) correspond to the likelihood equations
derived from a beta distribution. As such, it presents a unique solution for hb, because the
beta distribution belongs to a regular exponential family. For distributions in this family, it is
17
well known that the log-likelihood function is strictly concave and thus presents a single
maximum, if one exists (Barndorff-Nielsen 1978 and 1982). The same quasi-Newton type
algorithm considered before is a good option for solving sub-system (42)-(43), in part
because it presents superlinear and sure convergence in the case of strict concavity of the
objective function, but also because the low dimension of the problem (only two arguments)
allows fast processing and better numerical stability (Meng and Rubin, 1993).
Analogous considerations apply to the CM3-step and the complete description of the
ECMA is presented in box 2. Actually, the algorithm implemented by us make use also of a
reparameterization of vector so as to allow a better numerical performance (see details in
Appendix 2). In Section 8, we present an illustrative application of our algorithm on a real
dataset.

Box 2. ECMA scheme for maximizing the aggregate posterior of the BBH model.

1. Assume an instantaneous guess k = [k , hb ,k , hw, k ] ;


T T T T

2. E-step: Compute the conditional expected values for the disaggregate data:
n B i , k = E[ N Bi | nTi , i , k , i , k ]
nWi ,k = nTi nB i , k ;
3. CM1-step: Maximize the function Q(, hb , k , hw, k | k ) for , by computing:

n Bi ,k + cb , k 1
i , k +1 =
n X i + cb , k + d b , k 2

nWi , k + c w , k 1
i , k +1 =
ni n X i + c w, k + d w, k 2

and then determine kT+1 = [ kT+1 ,kT+1 ] ;

4 CM2-step: Maximize the function Q(k +1 , hb , hw, k | k +1 , hb , k , hw, k ) for hb ,


and then determine hb , k +1 = [c b , k +1 , d b , k +1 ] ;

5. CM3-step: Maximize the function Q(k +1 , hb , k +1 , hw | k +1 , hb ,k +1 , hw ,k ) for


h w and then determine hw , k +1 = [c w, k +1 , d w, k +1 ] .

6. Having determined the new estimate k +1 = [k +1 , hb , k +1 , hw ,k +1 ] , repeat steps


T T T T

15 until convergence, say, until || k +1 k ||< and/or


| lp A (k +1 ) lp A (k ) |< , with , 0.

We conclude this section by adding that our method based on the ECMA allows further
improvements in computer efficiency. This might be accomplished by adding more cycles of
E and CMsteps that may produce larger increases in the Qfunction at every iteration.
Such procedures follow further extensions already available in the EMA literature (see Meng
and van Dyk, 1997).
18
7 Methodological Issues

In their approach to the BBH model, KRT used the marginal posteriors for the binomial
probabilities, p ( i | nT ) and p (i | nT ) , to make inferences regarding values of the
disaggregate proportions, Bi and Wi . As mentioned before, this procedure is not the more
appropriate if we want to have desirable EI properties, like aggregation consistency and
consideration of the Duncan-Davis bounds, satisfied. We argue, instead, that when using
the BBH model we should make EI by means of the predictive posteriors for the
disaggregate frequencies, p ( nB i | nT ) and p ( nW i | nT ) , or better, the predictive posteriors
for the disaggregate proportions, p (bi | nT ) and p ( wi | nT ) . The key reason is that, by
construction, predictive posteriors always respect the accounting identity, which guarantees
the disaggregate data predictions will display aggregation consistency and respect the
deterministic bounds.
In this section, we discuss these issues in some detail for the version of the BBH model
based on the disaggregate DGP. We begin by exploring why the use of posteriors for
binomial probabilities is a procedure of inference that generally do not respect the
accounting identity. Then we move on to explore the appropriate use of predictive
posteriors. An additional benefit of such a discussion is that we can learn principles for EI
model development in general. In Appendix 3, we extend this analysis to KRTs version.

7.1 Accounting Identity, Aggregation Consistency and Bounds


We saw in Section 5 that the BBH model based on the disaggregate DGP is built
hierarchically in three stages, where the first is a binomial model for the disaggregate
frequencies. We consider now this first stage in isolation, by looking only to the pure
binomial model, say, without its compounding with another distribution. In this particular
case, the use of binomial probabilities as the vehicle of inference guarantees the bounds are
respected, even though these parameters are unidentified. Although a second hierarchical
stage for the BBH model using the beta distribution may enhance model flexibility, as
defended by KRT, the accounting identity will, probably, not be respected, with the result
that the bounds are not guaranteed to be satisfied anymore.
Let us consider the first stage binomial model in (17), (18), and (19), which we rewrite
here as follows:

N Bi ~ Bin ( n Xi , i ) (53)
NWi ~ Bin (ni n Xi ,i ) (54)
N Ti ~ ABin (n X i , n i , i , i ) (55)

for i = 1,...,P. The 2P1 vector of unknown parameters for this model is given by:

T = [ T , T ] (56)

where, as before, T = [ 1 ,K , P ] and T = [1 ,K , P ] .


Now, let us consider the problem of estimating the parameters T = [ T , T ] . Given a
vector of observed aggregate data nTT = [nT1 ,K , nTP ] , and assuming a uniform prior over
19
(0,1) for every element of , the aggregate posterior is proportional to the observed
likelihood function, here denoted LA and written as:

P
L A ( | nT ) = Abin( nTi | n X i , ni , i ,i )
i =1

P n Ti
= i (1 i ) X i (1 i ) i X i
n n n
(57)
i =1 1 i

n UBi n X ni n X (1 ) n Bi
i i
n = n L nB i nTi n B i i (1 i )
i i

Bi Bi

The loglikelihood l A = log L A is then given by:

P
P P
l A ( | nT ) = nTi ln i + n X i ln( 1 i ) + ( ni n X i ) ln( 1 i )
i =1 1 i i =1 i =1

P n Bi n X ni n X (1 ) n Bi
U

+ ln i i
i

i

i =1 n Bi = n LBi nB i nTi nB i i (1 i )
(58)

We can easily verify that l A has an infinite number of stationary points. By applying first
order conditions for a maximum and manipulating, we verify it is not possible to obtain
closed form solutions for the parameters as functions of the aggregate observations nT
and n X . Nevertheless, it is possible to obtain the following relation between the elements of
( , ) :
i i

ti x

i = i i (59)
1 xi 1 xi

where t i = nTi / ni and xi = nTi / ni . This equation represents, for the i-th precinct, a
tomography line disposed over the unit square (see fig. 1). Anyone of the infinitely many
points over this line may be a solution to the problem of maximizing l A . This infinite number
of solutions happens because the structure of the aggregate binomial model does not allow
us to identify its 2P parameters.
But, more important concerning the aggregate binomial model is that any feasible solution
to the likelihood equations can be seen as an EI prediction that displays aggregation
consistency and respects the DuncanDavis bounds. This happens because, for a given
aggregate proportion Ti = t i , equation (59) is equivalent to the accounting identity in (6).
We further explain this point using fig. 1, that shows the space of possible values for the
binomial probabilities (i,i) consisting of the unit square [ 0,1] [0,1] . The negatively sloped
line represents graphically the accounting identity associated with a particular aggregate data
pair (t i,x i). The projections of this line over the axis are the admissible intervals [ Lbi , U ib ]
20
and [ Lwi ,U iw ] in proportions. It is clear that points over the line (satisfying the accounting
identity) are predictions which respect the bounds and present aggregation consistency.
Points off the line are predictions which does not display aggregation consistency, but which
may or may not respect the intervals: For instance, the circles respect both intervals, while
the dark points respect only one of the intervals, and the does not respect any interval.

Figure 1. Accounting identity, bounds and aggregation consistency.

What happens when we add to the binomial model the other stages in the hierarchy of
the BBH model? First, the aggregate posterior PA and its log version take new functional
forms, as presented earlier in (26) and (33). Second, by applying first order conditions for a
maximum to the logposterior in (33) and manipulating, it is easy to show that the relationship
between the modal values i and i take an implicit non-linear functional form, as follows:

ti x

i = + i i + F ( i ,
i , x i , c b , d b , c w , d w ) (60)
1 xi 1 x i

According to this expression, for the accounting identity to be valid it is necessary that
Fi = 0 for any value of its arguments and for all i = 1,...,P. But, this term is written as:

Ai
Fi = (61)
ni (1 x i )

where:

Ai = (cb 1)(1 i ) ( d b 1)
i + ( c w 1)(1
i ) (d w 1)
i (62)
21
The situations in which Fi = 0 are too restrictive. Provided that, at the same time, i , i ,
and x i belong to the open interval (0,1), this will happen, for instance, when
c b = db = cw = d w = 1 . It is clear, though, that Fi 0 for an infinite number of situations
where these beta parameters are such that Ai 0 .
Even if the accounting identity does not hold true, the values of the marginal posterior
modes for i and i may still respect the bounds. Fig. 1 illustrates this point by means of
the small circles, that represent predicted values for disaggregate proportions which are off
the tomography line and, thus, do not respect the accounting identity. These circles, though,
respect the bounded intervals for both variables. The problem in this case is that we do not
have aggregation consistency anymore, as the disaggregate data predictions i and i
when aggregated (i.e., by replacing Bi and Wi for them in expression (6)) do not fit the
observed aggregate proportion t i .
The above considerations lead us to conclude that the proper way of making EI with the
BBH model is by using the predictive distributions, which are precisely the conditional
distributions for the response (disaggregate) variables given the aggregate data:
p ( nB i | nTi ) and p ( nW i | nTi ) . The advantage of using these distributions is that they
always respect the accounting identity, and, consequently, the properties of
aggregation consistency and of allowance for the Duncan-Davis bounds will also
always be satisfied. This statement bears a wider importance for it is valid not only to the
BBH model under consideration, but also to any statistically based EI model where
ecological inference is treated as a prediction problem (McCue, 2001). [This is a central
difference between the ways KRT and King (1997) use their respectives BBH and TBN
models for making EI. Only the latter is used with predictive distributions as the vehicle of
ecological inference, and thus only the latter produces EI predictions which respect the
accounting identity.] All these considerations are valid also for KRT version of the BBH
model (see Appendix 3).

7.2 Conditional Prediction


Since the predictive distributions are central to a proper use of the BBH model, we discuss
now their construction from model assumptions. In order to accomplish this, we have to
follow two steps. The first involves the concept of conditional predictive distribution,
which in the present context is represented by:

p ( nB i | nTi , i ,i , h ) (63)

From the hierarchical construction of the BBH model, it follows the disaggregate response
variables N Bi , N Wi and N Ti are distributed independently of the vector of betas
parameters h (see Section 5). This allows us to manipulate expression (57) to produce the
following simplification.:
22
p( n Bi , nTi | i ,i ) p( i ,i | h) p( h)
p ( nB i | nTi , i ,i , h) =
p ( nTi | i , i ) p( i , i | h) p( h)
= p ( nB i | nTi , i , i )
(64)

From this result we are allowed to represent the conditional predictive distribution in (64) as
a noncentral hypergeometric distribution (Mcullagh and Nelder 1989, pp. 257259), say:

n X i ni n X i i (1 i )
nB i


n Bi nTi n Bi i (1 i )
p ( nB i | nTi , i , i ) = (65)
0 [ i , i ]
where:
n X ni n X i i (1 i )
nU nBi
Bi

0 [ i ,i ] = i
(1 ) (66)
n n n Bi
nBi =n BLi B i Ti i i

The mean and variance of this distribution are given, respectively, by:

nUBi

E[ N Bi | nTi , i , i ] = n Bi = n Bi p( n Bi | nTi , i ,i ) (67)


n Bi =n BLi

nUBi

Var[ N Bi | nTi , i ,i ] = (n Bi n Bi ) 2 p ( nB i | nTi , i , i ) (68)


n Bi =n BLi

Expression (67) for the mean of N Bi | nTi allows the generation of point predictions for the
disaggregate data. Note in expression (67) that the bounds will necessarily be respected
because the lower and upper summation limits n BLi and n UBi , which are those of a
hypergeometric distribution, are also, and not as a coincidence, the same Duncan-Davis
bounds (see Appendix 1). By its turn, expression (65) for the variance of N Bi | nTi allows
the generation of uncertainty measures of prediction. None of these expressions can be
represented in closed form as a function of nTi , i and i , what brings certain cost to its
computation.
When nTi is given, the predictions of the two disaggregate variables N Bi and N Wi
become linearly related, in such a way that the mean and the variance of the predictive
distribution p ( nW i | nTi , i , i ) may be obtained from (67) and (68), as follows:

E[ N Wi | nTi , i , i ] = nW i = nTi E[ N Bi | nTi , i , i ] (69)

Var[ N Wi | nTi , i , i ] = Var[ N B i | nTi , i ,i ] (70)

The conditional predictions for the disaggregate frequencies may yet be converted in
predictions for proportions, as follows:
23

E[ N Bi | nTi , i , i ]
B i = (71)
nX i
Var[ N B i | nTi , i , i ]
VBi = (72)
nX2 i
E[ N Wi | nTi , i ,i ]
W i = (73)
( ni n X i )
Var[ NW i | nTi , i , i ]
VWi = (74)
(ni n X i ) 2

Note the prediction variances in frequencies, given by (68) and (70) are the same.
However, the prediction variances in proportions, given by (72) and (74), differ
multiplicatively by a factor ( ni n X i ) 2 / n 2X i .
All these results follow from the concept of conditional predictive distribution. In addition
to use it to guarantee the accounting identity is respected by our proposed approach, the
computation of the conditional predictive means in (67) and (69) were essential in the
development of the ECMA, particularly in the implementation of the E-step (see expressions
(38) and (39)).

7.3 Unconditional Prediction


Another concept we shall consider because of its relevance to bayesian models is the one of
unconditional predictive distribution, which in the present case is represented by:

p ( nB i | nT ) =
1 1
p( n Bi | nTi , i , i ) p( i , i | nT ) di di (75)
0 0

The posterior (weighting function) for ( i , i ) in (75) is obtained my marginalization of the


aggregate posterior PA and can be represented as:

p ( i , i | nT ) = A ///
PA ( | nT )d/// (76)

where /// is a (2P + 2)1 vector composed by all elements of except i and i , and
A /// R 2 P + 2 is the subspace of admissible values of /// . The predictive distribution in (75)
is unconditional with respect to the quantities i and i because we have averaged (via
the double integration) over their uncertainty. Under a strict bayesian approach, this
distribution should be the one used for inferences regarding the disaggregate data, since it
embodies the uncertainty associated to the parameters (Migon and Gamerman 1999, p.
201). Also, the complete distribution (not only its point and interval characteristics) should
be obtained in order for a full bayesian approach to ecological inference be undertaken.
However, it is hard to obtain analytically p ( nB i | nT ) by means of the integration in (75)
for two reasons: First, because p ( i , i | nT ) , which has to be obtained via the complex
multidimensional integration of the aggregate posterior in (76), needs to be recovered in the
24
first place. Second, because p ( nB i | nTi , i , i ) is a noncentral hypergeometric density. As
seen in expression (59), the functional form of this density is a ratio where both the
numerator and the denominator depends on the variables of integration i and i . In their
approach to the BBH model, KRT used MCMC methods to recover completely the
aggregate posterior in (25) and the marginal posteriors for binomial probabilities p ( i | nTi )
and p (i | nTi ) . Those MCMC methods could as well be used here with this same purpose
and the additional goal of recovering completely the P unconditional predictive distributions
p ( nB i | nTi ) and p ( nWi | nTi ) . But, if we were to use these methods here we would incur
again in a high price of increased computation.
Another important issue is that resorting to alternative, simpler simulation approaches
based on asymptotic normality of posterior functions would not be appropriate here. The
reason is that a regularity condition - that the number of parameters or quantities of
interest is independent of the number of observations (Gelman et al, 1995) - for the validity
of convergence in probability towards the normal distribution is not satisfied here. This
happens because the BBH model uses as many parameters as observations, an usual device
adopted in the formulation of hierarchical models. When the number of observations
increase, the number of quantities of interest increases together and in this case the failure of
the standard result of asymptotic normality prevents the use of a multivariate normal
approximation to the aggregate posterior, and the simpler derivations of the predictive
distributions from that approximation.
For these reasons, and in order to keep up with our faster approach, we had to use
instead the conditional predictive distributions p ( nB i | nTi , i , i ) and p ( nW i | nTi , i , i )
evaluated at the posterior modes and i
which are obtained via the ECMA (Sub
i

section 6.2). Nevertheless, our approach is a valid EI method that respects the accounting
identity and which then displays the desirable EI properties considered before.

8 An Example Using a Real Dataset

In this section, we illustrate our alternative approach to the BBH model with an application
to a real dataset. The data we used are the same analyzed by KRT and correspond to voter
registration and racial background of people in 275 counties in the States of Florida,
Louisiana, North Carolina, and South Carolina (this dataset was obtained from Gary Kings
web site http://gking.harvard.edu/data). For each county, observations are available for
total voting age population (ni ), the number of those who are registered (nTi ), and the
number of people who are black ( n X i ). The target is to infer the unobservable proportions
of registered people in the population of blacks (Bi), and in the population of whites (Wi).
Performance results are presented in table 3, which display information for the two types
of predictors considered: say, the mode of the posteriors for binomial probabilities, labeled
posterior mode, and the mean of the predictive posteriors for disaggregate proportions,
labeled predictive mean. The ECM algorithm converged after 37 iterations which elapsed
one minute and 59 seconds in a PC computer with a Pentium 700 MHz processor. For the
posterior mode predictor, the coverage of prediction errors within a ten percentage points of
distance from the true values is about 42.18 % for variable B and 84.36 % for variable W,
with the corresponding root mean squared errors of predictions about 0.194 and 0.0973,
25
respectively. These figures are practically the same in the case of the predictive mean, for
which the 10% coverage of prediction errors is 42.91% for variable B and 84.36% for
variable W, with the corresponding root mean squared errors about 0.1898 and 0.0976.
The better performance in the case of variable W owes to the more amount of information
present in the aggregate data for this variable.

Table 3 Performance results from applying the BBH model


Method of Iter. Time Variable B Variable W
Inference (m.s) C10 RMSE C10 RMSE
Posterior Mode 37 1.59 42.18 .1940 84.36 .0973
Predictive Mean 37 1.59 42.91 .1898 84.36 .0976
Obs.: Iter. = Number of iterations; C10 = Coverage of errors within a 10%
band around true values; RMSE = Root mean squared error.

Figs. 2 and 3 display scatter plots of predicted against true values for variables B and W
according to the type of predictor, thus providing information on the behavior of individual
predictions. The horizontal lines within these scatter plots represent the admissible intervals
computed from the Duncan-Davis bounds for each observation. In both figures, it is evident
the more amount of information for disaggregate inferences with regard to variable W,
because the admissible intervals are in general shorter than for variable B. As a consequence
of the greater uncertainty in predicting proportions for variable B, the scatter plots in this
case display a large portion of data points centered in the middle of the intervals. In the case
of variable W, the scatter plots display the data points better aligned with the 45o line.
It is interesting to note the two predictors produce quite the same pattern of scatter plots.
It means that for this particular dataset it is practically indifferent to use one predictor or the
other. Almost all predictions from the posterior mode for binomial probabilities stay within
the admissible intervals, but, since they do not respect the accounting identity, in the majority
of cases they do not display aggregation consistency. By their turn, all predictions from the
predictive mean, as they respect the accounting identity, stay within the admissible intervals
and satisfy aggregation consistency.
26
Figure 2. Disaggregate predictions using posterior mode for binomial probabilities.

Figure 3 Disaggregate predictions using predictive means for disaggregate proportions.

Table 4 further compares the two predictors by presenting results for nine selected
counties. The predictions produced by each predictor are different for all these counties. For
some of them, and depending on the variable, they happen to be the same but just at the
four significant digits displayed in the table. For variable B, the largest absolute difference
between predictors happens for counties 224 and 229, where
| 224 B224 | = | 229 B229 | = 0.0400 ; while in the case of variable W, it is for county 218,

where | 218 W 218 |= 0.0082 . Table 4 displays estimated standard deviations only for the
predictive means, because we are unable to produce estimated standard deviations for the
posterior mode without simulation (see Section 7).

Table 4 Disaggregate data bounds, true values, and predictions for selected counties.

Bounds True Predictions


County Values Post. Mode Predictive Means

BL BU WL WU B W B s B W sW

22 .2857 1.0000 .5454 1.0000 .2857 1.0000 .3215 .9910 .3000 .0048 .9909 .0079
39 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 .9639 1.0000 1.0000 .0000 1.0000 .0000
50 .6338 1.0000 .9148 1.0000 .6339 1.0000 .8169 .9574 .8169 .0010 .9574 .0012
79 .6671 1.0000 .9723 1.0000 1.0000 .9722 .8253 .9862 .8333 .0039 .9861 .0042
97 .8746 1.0000 .9859 1.0000 .8750 1.0000 .9299 .9930 .9375 .0020 .9930 .0022
150 .0000 1.0000 .3928 .7500 .4250 .5982 .5071 .5701 .5075 .0056 .5687 .0076
218 .0000 1.0000 .4375 .9375 .5000 .6875 .4507 .7207 .4500 .0146 .7125 .0219
224 .0000 1.0000 .7187 .7291 .0000 .7292 .6400 .7232 .6000 .0016 .7229 .0016
229 .0000 1.0000 .4304 .4430 .0000 .4430 .6400 .4362 .6000 .0019 .4354 .0019

Source: True values were obtained from Gary Kings web site (http://gking.harvard.edu/data).
27

It is interesting to note in table 4 that, for country 39, the posterior mode 39 = 0.96 lies
outside the admissible interval for variable B since it is less than B39L = 1.00 . This provides
an example to the issue we posed in this article that predictions made with the posterior
mode for the binomial probabilities may not respect the deterministic bounds (it also
happens with counties 8, 15, 30, 87, 112, and 124). Note also that it does not happen with
predictions made with the predictive mean for county 39, as well as for any of the other
counties in table 4.
Among the selected counties of table 4, we included counties 50 and 150 which were
used by KRT to illustrate their MCMC based approach. For variable B in county 50, our
predictions of nearly 0.82 (the same for both predictors) are significantly higher than the
0.73 value obtained by those authors; however, for variable W in that same county, our
predictions of nearly 0.96 get close to the value 0.98 obtained by KRT (p.75). In the case
of county 150, our predictions of .50 and .57 for blacks and whites, respectively, also get
close to the .48 and .58 values obtained for the same variables by KRT(p.77).

9 Concluding Comments

In this paper, we presented methodological and computational issues that led us to a


different implementation of KRTs BBH model for EI. From the methodological side, our
approach is developed from a disaggregate DGP version of the BBH model and is based on
aggregate posterior maximization coupled with the inferential use of the predictive posteriors
for the disaggregate proportions. From the computational side, it is based on the choice of
the ECM algorithm to implement posterior maximization. These two aspects were relevant in
allowing us to produce an alternative and substantially faster EI method based on the BBH
model.
Though we have used the conditional predictive distributions for the disaggregate data in
our approach to the BBH model, a full bayesian approach would be based on the
unconditional predictive distributions, whose analytic forms are still unknown to us. The
problem here derives more from the difficulty to obtain the latter distributions for all the 2P
disaggregate proportions in a fast computational manner. Solutions to obtain them by
simulation exist and could be implemented in the same way as did KRT by means of the
MCMC methods. However, the price of increased computation would be still high with this
respect, while our faster method can produce valid ecological inferences in practice as
showed in the example.
Our methodological discussion considered relevant aspects regarding the BBH model
construction, in particular, and the principles of EI model building, in general. About the
former, we pointed that KRTs procedure of inference based on the binomial probabilities
and not on the disaggregate proportions fails to respect the accounting identity. As a
consequence, and against what is stated by those authors, the EI predictions from the BBH
model bear the risk of not respecting the DuncanDavis bounds. Unless we force it by
imposing external constraints when implementing the BBH model, a procedure which
would violate its internal consistency, we cannot take for granted the bounds will be
respected.
About principles of EI model building, our discussion highlighted the importance of
considering the disaggregate response variables as the relevant quantities of interest to be
28
modeled and the need to work with their associated predictive distributions, in order for the
properties of aggregation consistency and predictions within the bounds to be both
satisfied by any statistical EI model. By the way, we shall take note that Kings TBN model
was developed under such model building principles and thus is an EI method that displays
those desirable properties.
With regard to the particular design of the ECM algorithm we developed for the BBH
model, we shall point it has promising perspectives for improvements and refinements via
further research. For instance, additional cycles of E- and CMsteps may be explored so as
to produce larger increases of the Q-function at every iteration, what may result in even
faster implementations. Research on these topics not only may improve the computer
efficiency of our method, but also may be relevant in the efforts to generalize it to extended
versions of the BBH model that incorporate effects of explanatory variables and cope with
RC EI problems.

Appendix 1 Bounds for the Disaggregate Frequencies

In this Appendix, we present the deduction of the bounds for the disaggregate frequency
data when it is given a sample nTT = [nT1 ,K , nTP ] of aggregate data. The sequence of steps
begins by determining the maximum admissible value for the variable N B i . It is simple to
note that this value is the lower among the totals of the column and the row to which the
variable belongs, say:

N UBi = max( N Bi ) = min( N Ti , N X i ) (A1.1)

By another side, the minimum for the other variable NWi is given by:

N WLi = min( NW i ) = N Ti max( N Bi ) = N Ti min( N Ti , N X i )


(A1.2)
= max( 0, NTi N X i )

In a similar fashion, the maximum admissible value for the variable NWi is the smallest
between the totals of the column and the line to which this variable belongs:

N WUi = max( NWi ) = min( N Ti , N i N X i ) (A1.3)

and the minimum value for N B i :

N BLi = min( N B i ) = NTi max( NW i )


= N Ti min( N Ti , N i N X i )
= max( 0, NTi ( N i N X i )) (A1.4)

These results are summarized in table A1.1.


29

Table A1.1 Lower and Upper bounds for N Bi e N Wi

Variable Lower Upper

N Bi N BL = max( 0 , N T ( N i N X ))
i i i
N UBi = min( N Ti , N X i )

N Wi NWLi = max( 0, NTi N X i ) NWUi = min( NTi , Ni N X i )

We find it useful to take note, once more, these are the same bounds of the hypergeometric
distribution (McCullagh and Nelder, 1986, pp. 257-259).

Appendix 2 Reparameterization

In order to develop a stable computer algorithm to implement the BBH model based on the
disaggregate DGP, we used a suitable reparameterization. We shall stress that our purpose
here was not to exploit a multivariate normal approximation to the aggregate posterior
because such a procedure may not be applicable to the present problem, as discussed in
Section 7.2. Instead, we used the reparameterization just to make a transformation from a
proper subset of R ( 2 P + 4 ) (the original space for the quantities of interest) to the whole
R ( 2 P + 4 ) , thus allowing the mode searching of the ECM algorithm to be made using all
floating pointnumbers. (The function code we developed to implement the BBH model
allows an option to use or not the reparameterization).
The original parameters are given by:

T = [ T , T , hbT , hwT ] (A2.1)

and the reparameterization we chose may be regarded as a vector function


z : R ( 2 P + 4 ) R ( 2 P +4 ) , so that:

T = z () T = [T ,T , bT , wT ] (A2.2)

where is the new parameter vector and:


= log (A2.3)
( P1) 1

= log (A2.4)
( P 1) 1
b = log( hb ) (A2.5)
( 2 1 )

= log( hw )
w (A2.6)
( 2 x1 )

The log and division operations are applied to each element of the original vectors.
30
The aggregate log-posterior in the notation of the reparameterization will be
represented by us as :

l A ( | nT ) = l A ( z () | nT ) (A2.7)

The ECM Algorithm makes a search in the space of for a mode of the aggregate log
posterior, and we describe it in box A2.1.
Once the mode KT is determined, the following transformation back to the original
parameterization is applied:

exp( , K )
K = (A2.8)
( P 1) 1 + exp( , K )
exp( , K )
K = (A2.9)
( P 1) 1 + exp( , K )
hb , K = exp(b , K ) (A2.10)
( 21)

h w, K = exp( w, K ) (A2.11)
( 21 )

Box A2.1 ECMA scheme with reparameterization

1. Assume a guess k = [ , k , , k ,b ,k , w , k ] ;
T T T T T

2. E-step: Compute:
n B = E[ N B | nT , k ]
nW = nT n B ;
4. CM1-step:
n B , k + c b ,k 1
, k +1 = log
n n + d 1
X B ,k b, k
nW ,k + cw ,k 1
, k +1 = log ;
+
n n X n W ,k d w, k 1
5. CM2-step:
b , k +1 = arg max Q(b | , k +1 , , k +1 ,b , k , w ,k ) ;
b

6. CM3-step:
w, k +1 = arg max Q(w | , k +1 , , k +1 , b , k +1 , w, k ) ;
w

and determine k +1 = [ ,k +1 , , k +1 , b ,k +1 , w , k +1 ] ;
T T T T T

7. Repeat steps 16 until convergence, say, until || k +1 k ||< and/or


| l A (k +1 | nT ) l A (k +1 | nT ) |< with , 0 .
31

Appendix 3 KRTs Formulation and the Accounting Identity

We follow here similar steps as in Section 7.1 to show that KRTs version of the BBH
model also does not produce inferences that respect the accounting identity. We also start
by considering the first stage, where a binomial distribution is used to model the aggregate
DGP:

N Ti ~ Bin ( ni , i xi + i (1 xi )) (A3.1)

If we assume a uniform prior defined in (0,1) for each i and i, then the aggregate
posterior will be proportional to the aggregate likelihood function. The log version of the
latter is given by:

P ni P
l A ( | nT ) = log + nTi log ( i xi + i (1 x i ) )
i =1
i =1 nTi (A3.2)
P
+ (ni nTi ) log (1 i xi i (1 x i ) )
i =1

By applying first order conditions for a maximum, and making some algebraic manipulation,
we can easily check the mode(s) of the l A satisfy:

ti x

i = i i (A3.3)
1 xi 1 xi

Thus, also in the case of KRTs version, the aggregate posterior modes for binomial
probabilities in the first stage generate predictions that satisfy the accounting identity.
We now turn to check what happens when we add to (A3.1) the other stages of the
hierarchy of KRTs BBH model. First, the aggregate logposterior is transformed according
to:
P ni P
lp A ( | nT ) = C + log + nTi log (i xi + i (1 x i ) )
i =1 nTi i =1
P
+ (ni nTi ) log (1 i xi i (1 xi ) )
i =1
P P
+ (c b 1) log i + ( d b 1) log( 1 i ) (A3.4)
i =1 i =1
P P
+ (c w 1) log i + ( d w 1) log( 1 i )
i =1 i =1

P log B( cb , d b ) P log B (c w , d w )

where C is a constant. By applying, once more, first order conditions for a maximum and
manipulating, we verify the posterior modes satisfy now:
32

ti x

i = + i i + F * ( i ,
i , xi , cb , db , c w , d w ) (A3.5)
1 xi 1 xi

According to this expression, for the accounting identity to be valid it is necessary that
Fi* = 0 for any value of its arguments and all i = 1,...,P. But, this term is written as:

[ i x i +
i (1 xi )][1 i x i
i (1 xi )]
Fi * = Ai* (A3.6)
ni (1 x i )
where:
c 1 db 1 c w 1 d w 1
Ai* = b + + + (A3.7)

1

i 1 i i i

In an analogous fashion, the situations in which Fi* = 0 are too restrictive. Provided that i ,
i and x i simultaneously belong to the open interval (0,1), this will happen, for instance,
when c = d = c = d = 1 . It is easy to check, though, that F * 0 for an infinite
b b w w i

number of situations where these beta parameters are such that Ai 0 . *

We shall note that KRT (p. 64) state their approach to the BBH model respect the
bounds, but they do not give details as to how the bounds are incorporated in the marginal
posteriors for the binomial probabilities. As it seems, the authors followed the validity of the
accounting identity for the first stage, where the binomial model is defined in isolation, to
justify that those posteriors are defined within the bounds. It seems to be KRTs relation (2)
(appearing in p. 65 and reappearing in p. 71), deduced based solely on the first stage and
which is equivalent here to relation (A3.3), that justifies the bounds are respected. However,
this is not what we verify by means of relation (A3.4), that results from us considering the
other stages in the hierarchy.

References
Achen, C.H. e W.P. Shively (1995). Cross Level Inference. Chicago: University of Chicago Press.
BarndorffNielsen, O. (1978). Information and Exponential Families in Statistical Th eory. New York:
Wiley.
BarndorffNielsen, O. (1982) Exponential Families. In Encyclopedia of Statistical Sciences, vol. 2. New
York: Wiley. pp. 587596.
Beckman, R. J. and G.L. Tetjen (1978) Maximum Likelihood Estimation for the Beta Distribution. Journal
of Statistical Computation and Simulation, 3, 4, p. 253
Bonnans, J. F, J. C. Gilbert, C. Lemarchal, C. Sagastizbal (1996). Mthodes numr iques doptimisation.
Projet Promath/ Institut National de Recherche en Informacion Automatique. Manuscript.
Brown, P.J., and C.D. Payne (1986). Aggregate data, Ecological Regression, and Voting Transitions.
Journal of the American Statistical Association, 81, pp. 452460.
Cleave, N. (1992). Ecological Inference. PhD. Dissertation, University of Liverpoool.
Cleave, N., P.J. Brown and C.D. Payne (1995). Evaluation of Methods for Ecological Inference. Journal
of The Royal Statistical Society A, 158, 1, pp.5572.
Dempster, A. P., N. M. Laird and D. Rubin (1977). Maximum Likelihood from Incomplete Data via the
EM Algorithm. (with discussion), Journal of the Royal Statistical Society B, 39, pp.1-38.
33
Duncan, O. D. and B. Davis (1953). An Alternative to Ecological Correlation. American Sociological
Review,18, pp. 665 666.
Forcina, A. and G. M. Marchetti (1989). Modelling Transition Probabilities in the Analysis of
Aggregate Data. In Decarli, A., B. J. Francis, R. Gilchrist, and G.U.H. Seber, editors, Statistical
Modlling (Proceedings, Trento, 1989), pp 157164. Lecture Notes in Statistics 57. Berlin: Springer
Verlag.
Gelman, A. B., J. S. Carlin, H. S. Stern and D. B. Rubin (1995). Bayesian Data Analysis. New York:
Chapman & Hall/CRC.
Goodman, L. (1953). Ecological Regression and the Behavior of Individuals. American Sociological
Review, 18, pp.663664.
____________ (1959). Some Alternatives to Ecological Correlation. American Journal of Sociology,
64, pp.610 25.
Hawkes, A. G. (1969). An Approach to the Analysis of Electoral Swing. Journal of the Royal Statistical
Society A, 132, pp.6879.
Johnson, N.L. and S. Kotz (1969). Distributions in Statistics: Discrete Distributions. New York: Wiley.
King, Gary (1997). A Solution to the Ecological Inference Problem: Reconstructing Individual
Behavior from Aggregate Data. Princeton: Princeton University Press.
King, Gary, O. Rosen and M. A. Tanner (1999). Binomial Beta Hierarchical Models for Ecological
Inference. Sociological Methods and Research, 28, 1, pp.6190.
Lee, T.C., G.G. Judge and A. Zellner (1970). Estimating the Parameters of the Markov Probability
Model from Aggregate Time Series Data. Contributions to Economic Analysis 65. Amsterdam:
NorthHolland.
McCue, K. F. (2001). The Statistical Foundations of the EI Method. The American Statistician, 55, 2,
pp. 106-110.
McCullagh, P and J. A. Nelder (1989) Generalized Linear Models. 2nd Edition. Monographs on Statistics
and Applied Probability 38. London: Chapman & Hall.
McLachlan, G.J. and T. Krishnan (1997). The EM Algorithm and Extensions. Wiley Series in Probability
and Statistics. New York: Wiley.
McRae, E. C. (1977). Estimation of Time Varying Markov Processes with Aggregate Data.
Econometrica, 45, pp.183198.
Meng, XL. and D. B. Rubin (1993). Maximum Likelihood Estimation via the ECM Alg orithm: a General
Framework. Biometrika, 80, pp. 267278.
Meng, X.-L. and D. van Dyk (1997). The EM algorithm - An Old Folk Song Sung to a Fast New Tune.
Journal of The Royal Statistical Society B, 59, 3, pp.511-567.
Migon, H. S. and D. Gamerman (1999). Statistical Inference: an Integrated Approach. London: Arnold.
Mosiman, J. E. (1962). On the Compound Multinomial Distribution, the Multivariate Distribution and
Correlations Among Proportions. Biometrika, 49, pp. 6582.
Rosen, O., W. Jiang, G. King, and M.A.Tanner (2000). Bayesian and Frequentist Inference for Ecological
Inference: the RC Case. Forthcoming in Statistica Neerlandica.
Schafer, J. L. (1999). Analysis of Incomplete Multivariate Data. Monographs on Statis tics and Applied
Probability 72. Reprinted 1997 Edition. Boca Raton.: Chapman & Hall/CRC.
Schoenberg, R. (1996). Constrained maximum likelihood. Manuscript.
Tanner, M. A. (1996). Tools for Statistical Inference: Methods for the Exploration of Posterior
Distributions and Likelihood Functions. 3rd Edition. New York: Springer.

Das könnte Ihnen auch gefallen