Sie sind auf Seite 1von 5

Vol.

7/2, October 2007 43

Bibliography Q. Li and J. S. Racine. Nonparametric estimation


of conditional CDF and quantile functions with
J. Aitchison and C. G. G. Aitken. Multivariate binary mixed categorical and continuous data. Journal of
discrimination by the kernel method. Biometrika, Business and Economic Statistics, forthcoming.
63(3):413420, 1976.
D. Ouyang, Q. Li, and J. S. Racine. Cross-validation
P. Hall, J. S. Racine, and Q. Li. Cross-validation and the estimation of probability distributions
and the estimation of conditional probability den- with categorical data. Journal of Nonparametric
sities. Journal of the American Statistical Association, Statistics, 18(1):69100, 2006.
99(468):10151026, 2004.
J. S. Racine and Q. Li. Nonparametric estimation
P. Hall, Q. Li, and J. S. Racine. Nonparametric esti- of regression functions with both categorical and
mation of regression functions in the presence of continuous data. Journal of Econometrics, 119(1):99
irrelevant regressors. The Review of Economics and 130, 2004.
Statistics, forthcoming. J. S. Racine, Q. Li, and X. Zhu. Kernel estimation
of multivariate conditional distributions. Annals of
C. Hsiao, Q. Li, and J. S. Racine. A consistent model
Economics and Finance, 5(2):211235, 2004.
specification test with mixed categorical and con-
tinuous data. Journal of Econometrics, 140(2):802 J. S. Racine, J. D. Hart, and Q. Li. Testing the sig-
826, 2007. nificance of categorical predictor variables in non-
parametric regression models. Econometric Re-
Q. Li and J. Racine. Nonparametric Econometrics: The- views, 25:523544, 2007.
ory and Practice. Princeton University Press, 2007.
J. M. Wooldridge. Introductory Econometrics. Thomp-
Q. Li and J. S. Racine. Nonparametric estimation son South-Western, 2003.
of distributions with categorical and continuous
data. Journal of Multivariate Analysis, 86:266292,
August 2003. Tristen Hayfield and Jeffrey S. Racine
McMaster University
Q. Li and J. S. Racine. Cross-validated local linear Hamilton, Ontario, Canada
nonparametric regression. Statistica Sinica, 14(2): hayfietj@mcmaster.ca
485512, April 2004. racinej@mcmaster.ca

eiPack: R C Ecological Inference and


Higher-Dimension Data Management
by Olivia Lau, Ryan T. Moore, and Michael Kellermann In ecological inference, challenges arise because
information is lost when aggregating across indi-
viduals, a problem that cannot be solved by col-
Introduction lecting more aggregate-level data. Thus, EI mod-
Ecological inference (EI) models allow researchers els are unusually sensitive to modeling assumptions.
to infer individual-level behavior from aggregate Testing these assumptions is difficult without access
data when individual-level data is unavailable. Ta- to individual-level data, and recent years have wit-
ble 1 shows a typical unit of ecological analysis: a nessed a lively discussion of the relative merits of
contingency table with observed row and column various models (Wakefield, 2004).
marginals and unobserved interior cells. Nevertheless, there are many applied problems in
which ecological inferences are necessary, either be-
col1 col2 ... colC cause individual-level data is unavailable or because
row1 N11i N12i ... N1Ci N1i the aggregate-level data is considered more authori-
row2 N21i N22i ... N2Ci N2i tative. The latter is true in the voting rights context
... ... ... ... ... in the United States, where federal courts often base
rowR NR1i NR2i ... NRCi NR i decisions on evidence derived from one or more EI
N1i N2i ... NCi Ni models (Cho and Yoon, 2001). While packages such
as MCMCpack (Martin and Quinn, 2006) and eco
Table 1: A typical R C unit in ecological inference; (Imai and Lu, 2005), provide tools for 2 2 inference,
red quantities are typically unobserved. this is insufficient in many applications. In eiPack,

R News ISSN 1609-3631


Vol. 7/2, October 2007 44

we implement three existing methods for the general


case in which the ecological units are R C tables.

max(0, N ji k6=k Nki )


Methods and Data in eiPack max(0, N ji k6=k Nki ) + min( N ji , kk Nki )

The methods currently implemented in eiPack are


the method of bounds (Duncan and Davis, 1953),
ecological regression (Goodman, 1953), and the and
Multinomial-Dirichlet model (Rosen et al., 2001).
The functions that implement these models share
several attributes. The ecological tables are defined
using a common formula of the form cbind(col1, min( N ji , Nk i )
..., colC) cbind(row1, ...,rowR). The row
min( N ji , Nk i ) + max(0, N ji Nk i kk Nki )
and column marginals can be expressed as either
proportions or counts. Auxiliary functions renormal-
ize the results for some subset of columns taken from
the original ecological table, and appropriate print, The intervals generated by the method of bounds
summary, and plot functions conveniently summa- can be analyzed in a variety of ways. Grofman (2000)
rize the model output. suggests calculating the intersection of the unit-level
In the following section, we demonstrate the fea- bounds. If this intersection (calculated by eiPack) is
tures of eiPack using the (included) senc dataset, non-empty, it represents the range of values that are
which contains individual-level party affiliation data consistent with the observed marginals in each of the
for Black, White, and Native American voters in ecological units.
8 counties in southeastern North Carolina. These
counties include 212 precincts, which form the eco- Researchers and practitioners may also choose to
logical units in this dataset. Because the data are ob- restrict their attention to units in which one group
served at the individual level, the interior cell counts dominates, since the bounds will typically be more
are known, allowing us to benchmark the estimates informative in those units. eiPack allows users to set
generated by each method. row thresholds to conduct this extreme case analy-
sis (known as homogeneous precinct analysis in the
voting context). For example, suppose the user is in-
Method of Bounds terested in the proportion of two-party White regis-
The method of bounds (Duncan and Davis, 1953) trants registered as Democrats in precincts that are
uses the observed row and column marginals to cal- at least 90% White. eiPack calculates the desired
culate upper and lower bounds for functions of the bounds:
interior cells of each ecological unit. The method of
bounds is not a statistical procedure in the traditional
sense; the bounds implied by the row and column
marginals are deterministic and there is no proba- > out <- bounds(cbind(dem, rep, non) ~ cbind(black,
+ white, natam), data = senc, rows = "white",
bilistic model for the data-generating process.
+ column = "dem", excluded = "non",
As implemented in eiPack, the method of bounds
+ threshold = 0.9, total = NULL)
allows the user to calculate for a specified column
k k = {1, . . . , C } the deterministic bounds on the
proportion of individuals in each row who belong in
that column. For each unit being considered, let j be These calculated bounds can then be represented
the row of interest, k index columns, k be the column graphically. Segments cover the range of possible
of interest, k be the set of other columns considered, values (the true value for each precinct is the red dot,
and k be the set of columns excluded. For example, not included in the standard bounds plot). In this ex-
if we want the bounds on the proportion of Native ample, the intersection of the precinct-level bounds
American two-party registrants who are Democrats, is empty.
j is Native American, k is Democrat, k is Repub-
lican, and k is No Party. The unit-level quantity of
interest is
> plot(out, row = "white", column = "dem")
N jk i # add true values to plot
N jk i + kk N jki > idx <- as.numeric(rownames(out$bounds$white.dem))
> truth <- senc$whdem[idx]/(senc$white[idx]
The lower and upper bounds on this quantity given + - senc$non[idx])
by the observed marginals are, respectively, > plot((1:length(idx)) / (length(idx) + 1), truth)

R News ISSN 1609-3631


Vol. 7/2, October 2007 45

feasible range) as the number of draws m . In

1.0
51
cases where the cell estimates are near the bound-
65

58 63

aries, choosing truncate = TRUE imposes a uniform
52 61 75 prior over the unit hypercube such that all cell frac-
0.8

54

67
tions are restricted to the range [0, 1].
68
Proportion Democratic

37 118



139 Output from ecological regression can be summa-
39
144
rized numerically just as in lm, or graphically using
0.6

212

111
18 71 104


30

31
85 92 97


113
117

122
123 128
130 137
density plots. We also include functions to calculate
2829 34 90 127 147
25

86
91
94
99
110
115


120

131
129

145

estimates and standard errors of shares of a subset
35 89 96
95
of columns in order to address questions such as,
0.4


98
207

200

"What is the Democratic share of 2-party registration
88

for each group?" For the Bayesian model, densities
0.2

represent functions of the posterior draws of the rc ;


for the frequentist model, densities reflect functions
of regression point estimates and standard errors cal-
0.0

culated using the -method.

> out.reg <- ei.reg(cbind(dem, rep, non)


Precincts at least 90% White
+ ~ cbind(black, white, natam), data = senc)
> lreg <- lambda.reg(out.reg,
columns = c("dem", "rep"))
> density.plot(lreg)
Figure 1: A plot of deterministic bounds.

Ecological Regression

30
In ecological regression (Goodman, 1953), observed Density
row and column marginals are expressed as propor- 0 10
tions and each column is regressed separately on the
row proportions, thus performing C regressions. Re-
gression coefficients then estimate the population in- 0.2 0.2 0.6 1.0
ternal cell proportions. For a given unit i, define Proportion Democratic

Xri , the proportion of individuals in row r,


30
Density

Tci , the proportion of individuals in column c,


and
0 10

rci , the proportion of row r individuals in col-


umn c
0.2 0.2 0.6 1.0
The following identities hold: Proportion Republican

R C
Tci = rci Xri and rci = 1 Figure 2: Density plots of ecological regression out-
r=1 c=1
put.
Defining the population cell fractions rc such that
Cc=1 rc = 1 for every r, ecological regression as- Multinomial-Dirichlet (MD) model
sumes that rc = rci for all i, and estimates the
regression equations Tci = rc Xri + ci . Under In the Multinomial-Dirichlet model proposed by
the standard linear regression assumptions, includ- Rosen et al. (2001), the data is expressed as counts
ing E[ci ] = 0 and Var[ci ] = c2 for all i, these and a hierarchical Bayesian model is fit using a
regressions recover the population parameters rc . Metropolis-within-Gibbs algorithm implemented in
eiPack implements frequentist and Bayesian regres- C. Level 1 models the observed column marginals
sion models (via ei.reg and ei.reg.bayes, respec- as multinomial (and independent across units); the
tively). choice of the multinomial corresponds to sampling
In the Bayesian implementation, we offer two op- with replacement from the population. Level 2 mod-
tions for the prior on rc . As a default, truncate els the unobserved row cell fractions as Dirichlet
= FALSE uses an uninformative flat prior that pro- (and independent across rows and units); Level 3
vides point estimates approaching the frequentist es- models the Dirichlet parameters as i.i.d. Gamma.
timates (even when those estimates are outside the More formally, without a covariate, the model is

R News ISSN 1609-3631


Vol. 7/2, October 2007 46

The output of this function can be returned as mcmc


R objects or arrays; in the former case, the standard

diagnostic tools in coda (Plummer et al., 2006) can
( N1i , . . . , NCi ) Multinomial( Ni , r1i Xri ,
r=1 be applied directly. The MD implementation in-
R cludes lambda and density.plot functions, usage
..., rCi Xri ) for which is analogous to ecological regression:
r=1

> lmd <- lambda.MD(out.nocov,
(r1i , . . . , rCi ) Dirichlet(r1 , . . . , rC ) + columns = c("dem", "rep"))
i.i.d. > density.plot(lmd)
rc Gamma(1 , 2 )
With a unit-level covariate Zi in the second level, If the precinct-level parameters are returned or
the model becomes saved, cover.plot plots the central credible inter-
vals for each precinct. The segments represent the
R

95% central credible intervals and their medians for
( N1i , . . . , NCi ) Multinomial( Ni , r1i Xri , each unit (the true value for each precinct is the red
r=1
R dot, not included in the standard cover.plot).
..., rCi Xri ) > cover.plot(out.nocov, row = "white",
r=1
+ column = "dem")


(r1i , . . . , rCi ) Dirichlet(dr e(rc +rc Zi ) , . . . , # add true values to plot
> points(senc$white/senc$total,
dr e(r(C1) +r(C1) Zi ) , dr ) + senc$whdem/senc$white)
i.i.d.
dr Gamma(1 , 2 )
In the model with a covariate, users have two op- 1.0

tions for the priors on rc and rc . They may as-


sume an improper uniform prior, as was suggested


0.8



Proportion of White Democrats







by Rosen et al. (2001), or they may specify normal





























priors for each rc and rc as follows:



0.6

N(rc , 2rc )


rc

N(rc , 2rc )





rc
0.4

















As Wakefield (2004) notes, the weak identification



that characterizes hierarchical models in the EI con-
0.2

text is likely to make the results sensitive to the


choice of prior. Users should experiment with differ-
0.0

ent assumptions about the prior distribution of the


upper-level parameters in order to gauge the robust- 0.0 0.2 0.4 0.6 0.8 1.0
ness of their inferences.
The parameterization of the prior on each Proportion White in precinct
(r1i , . . . , rCi ) implies that the following log-odds
ratio of expected fractions is linear with respect to
the covariate Zi : Figure 3: Coverage plot for MD model output.

E(rci )
 
log
E(rCi )
= rc + rc Zi Data Management
Conducting an analysis using the MD model re- In the MD model, reasonable-sized problems produce
quires two steps. First, tuneMD calibrates the tuning unreasonable amounts of data. For example, a model
parameters used for Metropolis-Hastings sampling: for voting in Ohio includes 11000 precincts, 3 racial
groups, and 4 parties. Implementing 1000 iterations
> tune.nocov <- tuneMD(cbind(dem, rep, non)
+ ~ cbind(black, white, natam), data = senc, yields about 130 million parameter draws. These
+ ntunes = 10, totaldraws = 100000) draws occupy about 1GB of RAM, and this is almost
certainly not enough iterations. We provide a few
Second, ei.MD.bayes fits the model by calling C code options to users in order to make this model tractable
to generate MCMC draws: for large EI problems.
> out.nocov <- ei.MD.bayes(cbind(dem, rep, non) The unit-level parameters present the most sig-
+ ~ cbind(black, white, natam), nificant data management problem. Rather than
+ covariate = NULL, data = senc, storing unit-level parameters in the workspace,
+ tune.list = tune.nocov) users can save each chain as a .tar.gz file on

R News ISSN 1609-3631


Vol. 7/2, October 2007 47

disk using the option ei.MD.bayes(..., ret.beta L. Goodman. Ecological regressions and the behav-
= "s"), or discard the unit-level draws entirely us- ior of individuals. American Sociological Review, 18:
ing ei.MD.bayes(..., ret.beta = "d"). To recon- 663664, 1953.
struct the chains, users can select the row marginals,
column marginals, and units of interest, without re- B. Grofman. A primer on racial bloc voting analy-
constructing the entire matrix of unit-level draws: sis. In N. Persily, editor, The Real Y2K Problem: Cen-
> read.betas(rows = c("black", "white"),
sus 2000 Data and Redistricting Technology. Brennan
+ columns = "dem", units = 1:150, Center for Justice, New York, 2000.
+ dir = getwd())
K. Imai and Y. Lu. eco: R Package for Fitting
If users are interested in some function of the unit- Bayesian Models of Ecological cvs c Inference in 2x2
level parameters, the implementation of the MD Tables, 2005. URL http://imai.princeton.edu/
model allows them to define a function in R that research/eco.html.
will be called from within the C sampling algorithm,
in which case the unit-level parameters need not be A. D. Martin and K. M. Quinn. Applied Bayesian
saved for post-processing. inference in R using MCMCpack. R News, 6:27,
2006.
Acknowledgments M. Plummer, N. Best, K. Cowles, and K. Vines.
CODA: Convergence diagnostics and output anal-
eiPack was developed with the support of the In- ysis for MCMC. R News, 6:711, 2006.
stitute for Quantitative Social Science at Harvard
University. Thanks to John Fox, Gary King, Kevin O. Rosen, W. Jiang, G. King, and M. A. Tanner.
Quinn, D. James Greiner, and an anonymous ref- Bayesian and frequentist inference for ecological
eree for suggestions and Matt Cox and Bob Kinney inference: The R C case. Statistica Neerlandica,
for technical advice. For further information, see 55(2):134156, 2001.
http://www.olivialau.org/software.
J. Wakefield. Ecological inference for 2 2 tables
(with discussion). Journal of the Royal Statistical So-
Bibliography ciety, 167:385445, 2004.
W. T. Cho and A. H. Yoon. Strange bedfellows: Pol-
itics, courts and statistics: Statistical expert testi-
Olivia Lau, Ryan T. Moore, Michael Kellermann
mony in voting rights cases. Cornell Journal of Law
Institute for Quantitative Social Science
and Public Policy, 10:237264, 2001.
Harvard University, Cambridge, MA
O. D. Duncan and B. Davis. An alternative to ecolog- olivia.lau@post.harvard.edu
ical correlation. American Sociological Review, 18: ryantmoore@post.harvard.edu
665666, 1953. kellerm@fas.harvard.edu

The ade4 Package II: Two-table and


K-table Methods
by Stphane Dray, Anne B. Dufour and Daniel Chessel Holmes, 2006; Dray and Dufour, 2007) and the im-
plementation of the functions follows the description
of this unifying mathematical tool (class dudi). The
Introduction main functions of the package for one-table analysis
methods have been presented in Chessel et al. (2004).
The ade4 package proposes a great variety of ex- This new paper presents a short summary of two-
planatory methods to analyse multivariate datasets. table and K-table methods available in the package.
As suggested by the acronym ade4 (Data Analysis
functions to analyse Ecological and Environmental
data in the framework of Euclidean Exploratory Ecological illustration
methods), the package is devoted to ecologists but
it could be useful in many other fields (e.g., Goecke, In order to illustrate the methods, we used the
2005). Methods available in the package are partic- dataset jv73 (Verneaux, 1973) which is available in
ular cases of the duality diagram (Escoufier, 1987; the package. This dataset concerns 12 rivers. For

R News ISSN 1609-3631

Das könnte Ihnen auch gefallen