2008 14 Chernozhukov

Identication and Estimation of Marginal Eects in Nonlinear
Panel Models
1
Victor Chernozhukov
MIT
Ivan Fern andez-Val
BU
Jinyong Hahn
UCLA
Whitney Newey
MIT
February 4, 2009
1
First version of May 2007. We thank J. Angrist, B. Graham, and seminar participants of Brown Uni-
versity, CEMFI, CEMMAP Microeconometrics: Measurement Matters Conference, CEMMAP Inference
in Partially Identied Models with Applications Conference, CIREQ Inference with Incomplete Models
Conference, Georgetown, Harvard/MIT, MIT, UC Berkeley, USC, 2007 WISE Panel Data Conference,
and 2009 Winter Econometric Society Meetings for helpful comments. Chernozhukov, Fernandez-Val,
and Newey gratefully acknowledge research support from the NSF.
Abstract
This paper gives identication and estimation results for marginal eects in nonlinear panel
models. We nd that linear xed eects estimators are not consistent, due in part to marginal
eects not being identied. We derive bounds for marginal eects and show that they can tighten
rapidly as the number of time series observations grows. We also show in numerical calculations
that the bounds may be very tight for small numbers of observations, suggesting they may be
useful in practice. We propose two novel inference methods for parameters dened as solutions
to linear and nonlinear programs such as marginal eects in multinomial choice models. We
show that these methods produce uniformly valid condence regions in large samples. We give
an empirical illustration.
1 Introduction
Marginal eects are commonly used in practice to quantify the eect of variables on an outcome
of interest. They are known as average treatment eects, average partial eects, and average
structural functions in dierent contexts (e.g., see Wooldridge, 2002, Blundell and Powell, 2003).
In panel data marginal eects average over unobserved individual heterogeneity. Chamberlain
(1984) gave important results on identication of marginal eects in nonlinear panel data using
control variable. Our paper gives identication and estimation results for marginal eects in
panel data under time stationarity and discrete regressors.
It is sometimes thought that marginal eects can be estimated using linear xed eects,
as shown by Hahn (2001) in an example and Wooldridge (2005) under strong independence
conditions. It turns out that the situation is more complicated. The marginal eect may not
be identied. Furthermore, with a binary regressor, the linear xed eects estimator uses the
wrong weighting in estimation when the number of time periods T exceeds three. We show
that correct weighting can be obtained by averaging individual regression coecients, extending
a result of Chamberlain (1982). We also derive nonparametric bounds for the marginal eect
when it is not identied and when regressors are either exogenous or predetermined conditional
on individual eects.
The nonparametric bounds are quite simple to compute and to use for inference but can
be quite wide when T is small. We also consider bounds in semiparametric multinomial choice
models where the form of the conditional probability given regressors and individual eects is
specied. We nd that the semiparametric bounds can be quite tight in binary choice models
with additive heterogeneity.
We also give theorems showing that the bounds can tighten quickly as T grows. We nd
that the nonparametric bounds tighten exponetially fast when conditional probabilities of certain
regressor values are bounded away from zero. We also nd that in a semiparametric logit model
the bounds tighten nearly that fast without any restriction on the distribution of regressors.
These results suggest how the bounds can be used in practice. For large T the nonparametric
bounds may provide useful information. For small T, bounds in semiparametric models may
be quite tight. Also, the tightness of semiparametric bounds for small T makes it feasible to
compute them for dierent small time intervals and combine results to improve eciency. To
illustrate their usefulness we provide an empirical illustration based on Chamberlains (1984)
labor force participation example.
We also develop estimation and inference methods for semiparametric multinomial choice
models. The inferential problem is rather challenging. Indeed, the programs that characterize
the population bounds on model parameters and marginal eects are very dicult to use for
1
inference, since the data-dependent constraints are often infeasible in nite samples or under
misspecication, which produces empty set estimates and condence regions. We overcome these
diculties by projecting these data-dependent constraints onto the model space, thus producing
an always feasible data-dependent constraint set. We then propose linear and nonlinear pro-
gramming methods that use these new modied constraints. Our inference procedures have the
appealing justication of targeting the true model under correct specication and targeting the
best approximating model under incorrect specication. We develop two novel inferential pro-
cedures, one called modied projection and another perturbed bootstrap, that produce uniformly
valid inference in large samples. These methods may be of substantial independent interest.
This paper builds on Honore and Tamer (2006) and Chernozhukov, Hahn, and Newey (2004).
These papers derived bounds for slope coecients in autoregressive and static models, respec-
tively. Here we instead focus on marginal eects and give results on the rate of convergence
of bounds as T grows. Moreover, the identication results in Honore and Tamer (2006) and
Chernozhukov, Hahn, and Newey (2004) characterize the bounds via linear and non-linear pro-
grams, and thus, for the reasons we stated above, they cannot be immediately used for practical
estimation and inference. We propose new methods for estimation and inference, which are prac-
tical and which can be of interest in other problems, and we illustrate them with an empirical
application.
Browning and Carro (2007) give results on marginal eects in autoregressive panel models.
They nd that more than additive heterogeneity is needed to describe some interesting appli-
cation. They also nd that marginal eects are not generally identied in dynamic models.
Chamberlain (1982) gives conditions for consistent estimation of marginal eects in linear corre-
lated random coecient models. Graham and Powell (2008) extend the analysis of Chamberlain
(1982) by relaxing some of the regularity conditions in models with continuous regressors.
In semiparametric binary choice models Hahn and Newey (2004) gave theoretical and simu-
lation results showing that xed eects estimators of marginal eects in nonlinear models may
have little bias, as suggested by Wooldridge (2002). Fernandez-Val (2008) found that averaging
xed eects estimates of individual marginal eects has bias that shrinks faster as T grows than
does the bias of slope coecients. We show that, with small T, nonlinear xed eects consis-
tently estimates an identied component of the marginal eects. We also give numerical results
showing that the bias of xed eects estimators of the marginal eect is very small in a range
of examples.
The bounds approach we take is dierent from the bias correction methods of Hahn and
Kuersteiner (2002), Alvarez and Arellano (2003), Woutersen (2002), Hahn and Newey (2004),
Hahn and Kuersteiner (2007), and Fernandez-Val (2008). The bias corrections are based on large
T approximations. The bounds approach takes explicit account of possible nonidentication for
2
xed T. Inference accuracy of bias corrections will depend on T being the right size relative to
the number of cross-section observations n, while inference for bounds does not.
In Section 2 we give a general nonparametric conditional mean model with correlated unob-
served individual eects and analyze the properties of linear estimators. Section 3 gives bounds
for marginal eects in these models and results on the rate of convergence of these bounds as
T grows. Section 4 gives similar results, with tighter bounds, in a binary choice model with a
location shift individual eect. Section 5 gives results and numerical examples on calculation of
population bounds. Section 6 discusses estimation and Section 7 inference. Section 8 gives an
empirical example.
2 A Conditional Mean Model and Linear Estimators
The data consist of n observations of time series Y
i
= (Y
i1
, ..., Y
iT
)
and X
i
= [X
i1
, ..., X
iT
]
, for a
dependent variable Y
it
and a vector of regressors X
it
. We will assume throughout that (Y
i
, X
i
),
(i = 1, ..., n), are independent and identically distributed observations. A case we consider in
some depth is binary choice panel data where Y
it
0, 1. For simplicity we also give some
results for binary X
it
, where X
it
0, 1.
A general model we consider is a nonseparable conditional mean model as in Wooldridge
(2005). Here there is an unobserved individual eect
i
and a function m(x, ) such that
E[Y
it
[ X
i
,
i
] = m(X
it
,
i
), (t = 1, ..., T). (1)
The individual eect
i
may be a vector of any dimension. For example,
i
could include
individual slope coecients in a binary choice model, where Y
it
0, 1, F() is a CDF, and
Pr(Y
it
= 1 [ X
i
,
i
) = E[Y
it
[ X
i
,
i
] = F(X
it
i2
+
i1
).
Such models have been considered by Browning and Carro (2007) in a dynamic setting. More
familiar models with scalar
i
are also included. For example, the binary choice model with an
individual location eect has
Pr(Y
it
= 1 [ X
i
,
i
) = E[Y
it
[ X
i
,
i
] = F(X
it
+
i1
).
This model has been studied by Chamberlain (1980, 1984, 1992), Hahn and Newey (2004), and
others. The familiar linear model E[Y
it
[ X
i
,
i
] = X
it
+
i
is also included as a special case
of equation (1).
For binary X
it
0, 1 the model of equation (1) reduces to the correlated random coecients
model of Chamberlain (1982). For other X
it
with nite support that does not vary with t it is
a multiple regression version of that model.
3
The two critical assumptions made in equation (1) are that X
i
is strictly exogenous con-
ditional on and that m(x, ) does not vary with time. We consider identication without
the strict exogeneity assumption below. Without time stationarity, identication becomes more
dicult.
Our primary object of interest is the marginal eect given by
0
=
_
[m( x, ) m( x, )]Q
(d)
D
,
where x and x are two possible values for the X
it
vector, Q
denotes the marginal distribution

of , and D is the distance, or number of units, corresponding to x x. This object gives the
average, over the marginal distribution, of the per unit eect of changing x from x to x. It is the
average treatment eect in the treatment eects literature. For example, suppose x = ( x
1
, x
2
)
where x
1
is a scalar, and x = ( x
1
, x
2
)
. Then D = x
1
x
1
would be an appropriate distance
measure and
0
=
_
[m( x
1
, x
2
, ) m( x
1
, x
2
, )]Q
(d)
x
1
x
1
,
would be the per unit eect of changing the rst component of X
it
. Here one could also consider
averages of the marginal eects over dierent values of x
2
.
For example, consider an individual location eect for binary Y
it
where m(x, ) = F(x
+
). Here the marginal eect will be
0
= D
1
_
[F( x
+) F( x
+)]Q
(d).
The restrictions this binary choice model places on the conditional distribution of Y
it
given X
i
and
i
will be useful for bounding marginal eects, as further discussed below.
In this paper we focus on the discrete case where the support of X
i
is a nite set. Thus, the
events X
it
= x and X
it
= x have positive probability and no smoothing is required. It would
also be interesting to consider continuous X
it
.
Linear xed eect estimators are used in applied research to estimate marginal eects. For
example, the linear probability model with xed eects has been applied when Y
it
is binary.
Unfortunately, this estimator is not generally consistent for the marginal eect. There are
two reasons for this. The rst is the marginal eect is generally not identied, as shown by
Chamberlain (1982) for binary X
it
. Second, the xed eects estimator uses incorrect weighting.
To explain, we compare the limit of the usual linear xed eects estimator with the marginal
eect
0
. Suppose that X
i
has nite support X
1
, ..., X
K
and let Q
k
() denote the CDF of
the distribution of conditional on X
i
= X
k
. Dene
k
=
_
[m( x, ) m( x, )]Q
k
(d)/D, T
k
= Pr(X
i
= X
k
).
4
This
k
is the marginal eect conditional on the entire time series X
i
= [X
i1
, ..., X
iT
]
being
equal to X
k
. By iterated expectations,
0
=
K
k=1
T
k
k
. (2)
We will compare this formula with the limit of linear xed eects estimators.
An implication of the conditional mean model that is crucial for identication is
E[Y
it
[ X
i
= X
k
] =
_
m(X
k
t
, )Q
k
(d). (3)
This equation allows us to identify some of the
k
from dierences across time periods of iden-
tied conditional expectations.
To simplify the analysis of the linear xed eect estimator we focus on binary X
it
0, 1.
Consider

w
from least squares on
Y
it
= X
it
+
i
+v
it
, (t = 1, ..., T; i = 1, ..., n),
where each
i
is estimated. This is the usual within estimator, where for

X
i
=

T
t=1
X
it
/T,
w
=
i,t
(X
it

X
i
)Y
it
i,t
(X
it

X
i
)
2
.
Here the estimator of the marginal eect is just

w
. To describe its limit, let r
k
= #t : X
k
t
=
1/T and
2
k
= r
k
(1 r
k
) be the variance of a binomial with probability r
k
.
Theorem 1: If equation (1) is satised, (X
i
, Y
i
) has nite second moments, and

K
k=1
T
k
2
k
>
0, then
w
p
K
k=1
T
k
2
k
K
k=1
T
k
2
k
. (4)
This result is similar to Angrist (1998) who found that, in a treatment eects model in
cross section data, the partially linear slope estimator is a variance weighted average eect.
Comparing equations (2) and (4) we see that the linear xed eects estimator converges to a
weighted average of
k
, weighted by
2
k
, rather than the simple average in equation (2). The
weights are never completely equal, so that the linear xed eects estimator is not consistent
for the marginal eect unless how
k
varies with k is restricted. Imposing restrictions on how
k
varies with k amounts to restricting the conditional distribution of
i
given X
i
, which we are
not doing in this paper.
One reason for inconsistency of

w
is that certain
k
receive zero weight. For notational
purposes let X
1
= (0, ..., 0)
and X
K
= (1, ..., 1)
(where we implicitly assume that these are

included in the support of X
i
). Note that
2
1
=
2
K
= 0 so that
1
and
K
are not included in
5
the weighted average. The explanation for their absence is that
1
and
K
are not identied.
These are marginal eects conditional on X
i
equal a vector of constants, where there are no
changes over time to help identify the eect from equation (3). Nonidentication of these eects
was pointed out by Chamberlain (1982).
Another reason for inconsistency of

w
is that for T 4 the weights on
k
will be dierent
than the corresponding weights for
0
. This is because r
k
varies for k / 1, K except when
T = 2 or T = 3.
This result is dierent from Hahn (2001), who found that

w
consistently estimates the
marginal eect. Hahn (2001) restricted the support of X
i
to exclude both (0, ..., 0)
or (1, ..., 1)
and only considered a case with T = 2. Thus, neither feature that causes inconsistency of

w
was present in that example. As noted by Hahn (2001), the conditions that lead to consistency
of the linear xed eects estimator in his example are quite special.
Theorem 1 is also dierent from Wooldridge (2005). There it is shown that if b
i
= m(1,
i
)
m(0,
i
) is mean independent of X
it

X
i
for each t then linear xed eects is consistent. The
problem is that this independence assumption is very strong when X
it
is discrete. Note that
for T = 2, X
i2

X
i
takes on the values 0 when X
i
= (1, 1) or (0, 0), 1/2 when X
i
= (1, 0) ,
and 1/2 when X
i
= (0, 1). Thus mean independence of b
i
and X
i2

X
i
actually implies that
2
=
3
and that these are equal to the marginal eect conditional on X
i
X
1
, X
4
. This
is quite close to independence of b
i
and X
i
, which is not very interesting if we want to allow
correlation between the regressors and the individual eect.
The lack of identication of
1
and
K
means the marginal eect is actually not identied.
Therefore, no consistent estimator of it exists. Nevertheless, when m(x, ) is bounded there are
informative bounds for
0
, as we show below.
The second reason for inconsistency of

w
can be corrected by modifying the estimator.
In the binary X
it
case Chamberlain (1982) gave a consistent estimator for the identied eect
I
=

K1
k=2
T
k
k
/
K1
k=2
T
k
. The estimator is obtained from averaging across individuals the
least squares estimates of
i
in
Y
it
= X
it
i
+
i
+v
it
, (t = 1, ..., T; i = 1, ..., n),
For s
2
xi
=

T
t=1
(X
it

X
i
)
2
and n
=

n
i=1
1(s
2
xi
> 0), this estimator takes the form
=
1
n
i=1
1(s
2
xi
> 0)
T
t=1
(X
it

X
i
)Y
it
s
2
xi
.
This is equivalent to running least squares in the model
Y
it
=
k
X
it
+
k
+v
it
, (5)
6
for individuals with X
i
= X
k
, and averaging

k
over k weighted by the sample frequencies of
X
k
.
The estimator

of the identied marginal eect
I
can easily be extended to any discrete
X
it
. To describe the extension, let

d
it
= 1(X
it
= x),

d
it
= 1(X
it
= x), r
i
=

T
t=1
d
it
/T, r
i
=
T
t=1
d
it
/T, and n
=

n
i=1
1( r
i
> 0)1( r
i
> 0). The estimator is given by
=
1
n
i=1
1( r
i
> 0)1( r
i
> 0)[
T
t=1
d
it
Y
it
T r
i
T
t=1
d
it
Y
it
T r
i
].
This estimator extends Chamberlains (1982) estimator to the case where X
it
is not binary.
To describe the limit of the estimator

in general, let /
= k :there is

t and

t such that
X
k
t
= x and X
k
t
= x. This is the set of possible values for X
i
where both x and x occur for
at least one time period, allowing identication of the marginal eect from dierences. For all
other values of k, either x or x will be missing from the observations and the marginal eect
will not be identied. In the next Section we will consider bounds for those eects.
Theorem 2: If equation (1) is satised, (X
i
, Y
i
) have nite second moments and

kK
T
k
>
0, then
I
=
kK
k
,
where T
k
= T
k
/
kK
T
k
.
Here

is not an ecient estimator of
I
for T 3, because

is least squares over time,
which does not account properly for time series heteroskedasticity or autocorrelation. An ecient
estimator could be obtained by a minimum distance procedure, though that is complicated. Also,
one would have only few observations to estimate needed weighting matrices, so its properties
may not be great in small to medium sized samples. For these reasons we leave construction of
an ecient estimator to future work.
To see how big the inconsistency of the linear estimators can be we consider a numerical
example, where X
it
0, 1 is i.i.d across i and t, Pr(X
it
= 1) = p
X
,
it
is i.i.d. N(0, 1),
Y
it
= 1(X
it
+
i
+
it
> 0),
i
=
T(

X
i
p
X
)/
_
p
X
(1 p
X
).
Here we consider the marginal eect for x = 1, x = 0, D = 1, given by
0
=
_
[(1 +) ()]Q
(d).
Table 1 and Figure 1 give numerical values for (
w
0
)/
0
and (
0
)/
0
for several values
of T and p
X
, where
w
= plim

w
and = plim

.
We nd that the biases (inconsistencies) can be large in percentage terms. We also nd that
biases are largest when p
X
is small. In this example, the inconsistency of xed eects estimators
7
of marginal eects seems to be largest when the regressor values are sparse. Also we nd that
dierences between the limits of

and

w
are larger for larger T, which is to be expected due
to the weights diering more for larger T.
3 Bounds in the Conditional Mean Model
Although the marginal eect
0
is not identied it is straightforward to bound it. Also, as we
will show below, these bounds can be quite informative, motivating the analysis that follows.
Some additional notation is useful for describing the results. Let
m
k
t
= E[Y
it
[ X
i
= X
k
]/D
be the identied conditional expectations of each time period observation on Y
it
conditional on
the k
th
support point. Also, let () = [m( x, ) m( x, )] /D. The next result gives identi-
cation and bound results for
k
, which can then be used to obtain bounds for
0
.
Lemma 3: Suppose that equation (1) is satised. If there is

t and

t such that X
k
t
= x and
X
k
t
= x then
k
= m
k
t
m
k
t
.
Suppose that B
m(x, )/D B
u
. If there is

t such that X
k
t
= x then
m
k
t
B
u

k
m
k
t
B
.
Also, if there is

t
k
such that X
k
t
= x then
B
m
k
t

k
B
u
m
k
t
.
Suppose that () has the same sign for all . Then if for some k there is

t and

t such that
X
k
t
= x and X
k
t
= x, the sign of () is identied. Furthermore, if () is positive then the
lower bounds may be replaced by zero and if () is negative then the upper bounds may be
replaced by zero.
The bounds on each
k
can be combined to obtain bounds for the marginal eect
0
. Let
/ = k : there is

t such that X
k
t
= x but no

t such that X
k
t
= x,
/ = k : there is

t such that X
k
t
= x but no

t such that X
k
t
= x.
Also, let T
0
(x) = Pr(X
i
: X
it
,= x and X
it
,= x t). The following result is obtained by
multiplying the k
th
bound in Lemma 4 by T
k
and summing.
8
Theorem 4: If equation (1) is satised and B
m(x, )/D B
u
then

0

u
for
= T
0
(B
B
u
) +
K
T
k
( m
k
t
B
u
) +
K
T
k
(B
m
k
t
) +
kK
T
k
k
,
u
= T
0
(B
u
B
) +
K
T
k
( m
k
t
B
) +
K
T
k
(B
u
m
k
t
) +
kK
T
k
k
.
If () has the same sign for all and there is some k
such that X
k
t
= x and X
k
t
= x, the
sign of
0
is identied, and if
0
> 0 (< 0) then
(
u
) can be replaced by

kK
T
k
k
An estimator can be constructed by replacing the probabilities by sample proportions P
k
=
i
1(X
i
= X
k
)/n and P
0
= 1
K
P
k
K
P
k
kK
P
k
, and each m
k
t
by
m
k
t
= 1(n
k
> 0)
n
i=1
1(X
i
= X
k
)Y
it
/n
k
, n
k
=
n
i=1
1(X
i
= X
k
).
Estimators of the upper and lower bound respectively are given by

= P
0
(B
B
u
) +
K
P
k
( m
k
t
B
u
) +
K
P
k
(B
m
k
t
) + (n
/n)
,

u
= P
0
(B
u
B
) +
K
P
k
( m
k
t
B
) +
K
P
k
(B
u
m
k
t
) + (n
/n)
.
The bounds
and
u
will be jointly asymptotically normal with variance matrix that can be
estimated in the usual way, so that set inference can be carried out as described in Chernozhukov,
Hong, and Tamer (2007), or Beresteanu and Molinari (2008).
As an example, consider the binary X case where X
it
0, 1, x = 1 , and x = 0. Let X
K
denote a T 1 unit vector and X
1
be the T 1 zero vector, assumed to lie in the support of
X
i
. Here the bounds will be
= T
K
( m
K
t
B
u
) +T
1
(B
m
1
t
) +
1<k<K
T
k
k
, (6)
u
= T
K
( m
K
t
B
) +T
1
(B
u
m
1
t
) +
1<k<K
T
k
k
.
It is interesting to ask how the bounds behave as T grows. If the bounds converge to
0
as
T goes to innity then
0
is identied for innite T. If the bounds converge rapidly as T grows
then one might hope to obtain tight bounds for T not very large. The following result gives a
simple condition under which the bounds converge to
0
as T grows.
Theorem 5: If equation (1) is satised, B
m(x, )/D B
u
X
i
= (X
i1
, X
i2
, ...) is
stationary and, conditional on
i
, the support of each X
it
is the marginal support of X
it
and
X
i
is ergodic. Then

0
and
u

0
as T .
9
This result gives conditions for identication as T grows, generalizing a result of Chamberlain
(1982) for binary X
it
. In addition, it shows that the bounds derived above shrink to the marginal
eect as T grows. The rate at which the bounds converge in the general model is a complicated
question. Here we will address it in an example and leave general treatment to another setting.
The example we consider is that where X
it
0, 1.
Theorem 6: If equation (1) is satised, B
m(x, )/D B
u
and

X
i
is stationary and
Markov of order J conditional on
i
then for p
1
i
= Pr(X
it
= 0[X
i,t1
= = X
i,tJ
= 0,
i
)
and p
K
i
= Pr(X
it
= 1[X
i,t1
= = X
i,tJ
= 1,
i
)
max[
0
[, [
u
0
[ (B
u
B
)E[(p
1
i
)
TJ
+ (p
K
i
)
TJ
].
If there is > 0 such that p
1
i
1 and p
K
i
1 then
max[
0
[, [
u
0
[ (B
u
B
)2(1 )
TJ
.
If there is a set / of
i
such that Pr(/) > 0 and either Pr(X
i1
= = X
iJ
= 0 [
i
) > 0 for
i
/ and p
1
i
= 1 for all
i
/, or Pr(X
i1
= = X
iJ
= 1 [
i
) > 0 for
i
/ and
p
K
i
= 1, then

0
, or
u

0
.
When the conditional probabilities that X
it
is zero or one are bounded away from one the
bounds will converge at an exponential rate. We conjecture that an analogous result could be
shown for general X
it
. The conditions that imply that one of the bounds does not converge
violates a hypothesis of Theorem 5, that the conditional support of X
it
equals the marginal
support. Theorem 6 shows that in this case the bounds may not shrink to the marginal eect.
The bounds may converge, but not exponentially fast, depending on P(
i
) and the distri-
bution of
i
. For example, suppose that X
it
= 1(
i
it
> 0),
i
N(0, 1),
it
N(0, 1), with
i
i.i.d. over i, and
it
i.i.d. over t and independent of
i
. Then
T
K
= E[(
i
)
T
] =
_
()
T
()d = [
()
T+1
T + 1
]
=
1
T + 1
.
In this example the bounds will converge at the slow rate 1/T. More generally, the convergence
rate will depend on the distribution of p
1
i
and p
K
i
.
It is interesting to note that the convergence rates we have derived so far depend only on
the properties of the joint distribution of (X
i
,
i
), and not on the properties of the conditional
distribution of Y
i
given (X
i
,
i
). This feature of the problem is consistent with us placing no
restrictions on m(x, ). In Section 5 we nd that the bounds and rates may be improved when
the conditional distribution of Y
i
given (X
it
,
i
) is restricted.
10
4 Predetermined Regressors
The previous bound analysis can be extended to cases where the regressor X
it
is just prede-
termined instead of strictly exogenous. These cases cover, for example, dynamic panel models
where X
it
includes lags of Y
it
. To describe this extension let X
i
(t) = [X
i1
, ..., X
it
]
and suppose
that
E[Y
it
[X
i
(t),
i
] = m(X
it
,
i
), (t = 1, ..., T). (7)
For example, this includes the heterogenous, dynamic binary choice model of Browning and
Carro (2007), where Y
it
0, 1 and X
it
= Y
i,t1
.
As before, the marginal eect is given by
0
=
_
[m( x, )m( x, )]Q
(d)/D for two dierent

possible values x and x of the regressors and a distance D. Also, as before, the marginal eect
will have an identied component and an unidentied component. The key implication that is
used to obtain the identied component is
E[Y
it
[X
i
(t) = X(t)] =
_
m(X
t
(t), )Q
(d[X
i
(t) = X(t)), (8)
where X(t) = [X
1
, ..., X
t
]
.
Bounds are obtained by partitioning the set of possible X
i
into subsets that can make
use of the above key implication and a subset where bounds on m(x, )/D are applied. The
key implication applies to subsets of the form A
t
(x) = X : X
t
= x, X
s
,= x s < t,
that is a set of possible X
i
vectors that have x as the t
th
component and not as any previous
components. The bound applies to the same subset as before, that where x never appears, given
by

A(x) = X : X
t
,= x t. Together the union of A
t
(x) over all t and

A(x) constitute a
partition of possible X vectors. Let

T(x) = Pr(X
i

A(x)) be the probability that none of the
components of X
i
is equal to x and
0
= E[
T
t=1
1(X
i
A
t
( x)) 1(X
i
A
t
( x))Y
it
]/D.
Then the key implication and iterated expectations give
Theorem 7: If equation (7) is satised and B
m(x, )/D B
u
then

0

u
for
=
0
+B

T( x) B
u

T( x),
u
=
0
+B
u

T( x) B

T( x). (9)
As previously, estimates of these bounds can be formed from sample analogs. Let

P(x) =
n
i=1
1(X
i

A(x))/n and
=
n
i=1
T
t=1
[1(X
i
A
t
( x)) 1(X
i
A
t
( x))]Y
it
/(nD).
11
The estimates of the bounds are given by

=

+B

P( x) B
u

P( x),
u
=

+B
u

P( x) B

P( x).
Inference using these bounds can be carried out analogously to the strictly exogenous case.
An important example is binary Y
it
0, 1 where X
it
= Y
i,t1
. Here B
u
= 1 and B
= 0,
so the marginal eect is
0
=
_
[Pr(Y
it
= 1[Y
i,t1
= 1, ) Pr(Y
it
= 1[Y
i,t1
= 0, )]Q
(d),
i.e., the eect of the lagged Y
i,t1
on the probability that Y
it
= 1, holding
i
constant, averaged
over
i
. In this sense the bounds provide an approximate solution to the problem considered
by Feller (1943) and Heckman (1981) of evaluating duration dependence in the presence of
unobserved heterogeneity. In this example the bounds estimates are

=

P(0),
u
=

+

P(1). (10)
The width of the bounds is

P(0) +

P(1), so although these bounds may not be very informative
in short panels, in long panels, where

P(0) +

P(1) is small, they will be.
Theorems 5 and 6 on convergence of the bounds as T grows apply to
and
u
from equation
(9), since the bounds have a similar structure and the convergence results explicitly allow for
dependence over time of X
it
conditional on
i
. For example, for Y
it
0, 1 and X
it
= Y
i,t1
,
equation (7) implies that Y
it
is Markov conditional on
i
with J = 1. Theorem 5 then shows that
the bounds converge to the marginal eect as T grows if 0 < Pr(Y
it
= 1[
i
) < 1 with probability
one. Theorem 6 also gives the rate at which the bounds converge, e.g. that will be exponential
if Pr(Y
it
= 1[Y
i,t1
= 1,
i
) and Pr(Y
it
= 0[Y
i,t1
= 0,
i
) are bounded away from one.
It appears that, unlike the strictly exogenous case, there is only one way to estimate the
identied component
0
. In this sense the estimators given here for the bounds should be
asymptotically ecient, so there should be no gain in trying to account for heteroskedasticity
and autcorrelation over time. Also, it does not appear possible to obtain tighter bounds when
monotonicity holds, because the partition is dierent for x and x
5 Semiparametric Multinomial Choice
The bounds for marginal eects derived in the previous sections did not use any functional
form restrictions on the conditional distribution of Y
i
given (X
i
,
i
). If this distribution is
restricted one may be able to tighten the bounds. To illustrate we consider a semiparametric
multinomial choice model where the conditional distribution of Y
i
given (X
i
,
i
) is specied and
the conditional distribution of
i
given X
i
is unknown.
12
We assume that the vector Y
i
of outcome variables can take J possible values Y
1
, . . . , Y
J
.
As before, we also assume that X
i
has a discrete distribution and can take K possible values
X
1
, . . . , X
K
. Suppose that the conditional probability of Y
i
given (X
i
,
i
) is
Pr(Y
i
= Y
j
[ X
i
= X
k
,
i
) = L(Y
j
[ X
k
,
i
,
)
for some nite dimensional
and some known function L. Let Q
k
denote the unknown condi-
tional distribution of
i
given X
i
= X
k
. Let T
jk
denote the conditional probability of Y
i
= Y
j
given X
i
= X
k
. We then have
T
jk
=
_
L
_
Y
j
[ X
k
, ,
_
Q
k
(d) , (j = 1, ..., J; k = 1, ..., K), (11)
where T
jk
is identied from the data and the right hand side are the probabilities predicted
by the model. This model is semiparametric in having a likelihood L that is parametric and
conditional distributions Q
k
for the individual eect that are completely unspecied. In general
the parameters of the model may be set identied, so the previous equation is satised by a set
of values B that includes
and a set of distributions for Q

k
that includes Q
k
for k = 1, ..., K.
We discuss identication of model parameters more in detail in next section. Here we will focus
on bounds for the marginal eect when this model holds.
For example consider a binary choice model where Y
it
0, 1, Y
i1
, ..., Y
iT
are independent
conditional on (X
i
,
i
), and
Pr(Y
it
= 1 [ X
i
,
i
, ) = F(X
it
+
i
) (12)
for a known CDF F(). Then each Y
j
consists of a T 1 vector of zeros and ones, so with
J = 2
T
possible values. Also,
L(Y
i
[ X
i
,
i
, ) =
T
t=1
F(X
it
+
i
)
Y
it
[1 F(X
it
+
i
)]
1Y
it
.
The observed conditional probabilities then satisfy
T
jk
=
_
_
T
t=1
F(X
k
t

+)
Y
j
t
[1 F(X
k
t

+)]
1Y
j
t
_
Q
k
(d) , (j = 1, ..., 2
T
; k = 1, ..., K).
As discussed above, for the binary choice model the marginal eect of a change in X
it
from
x to x, conditional on X
i
= X
k
, is
k
= D
1
_
[F
_
x
+
_
F
_
x
+
_
]Q
k
(d), (13)
for a distance D. This marginal eect is generally not identied. Bounds can be constructed
using the results of Section 3 with B
= 0 and B
u
= 1, since m(x, ) = F(x
+ ) [0, 1].
13
Moreover, in this model the sign of () = D
1
[F( x
+ ) F( x
+ )] does not change

with , so we can apply the result in Lemma 3 to reduce the size of the bounds. These bounds,
however, are not tight because they do not fully exploit the structure of the model. Sharper
bounds are given by
k
= min
B,Q
k
D
1
_
[F ( x
+) F ( x
+)]Q
k
(d)
s.t. T
jk
=
_
L
_
Y
j
[ X
k
, ,
_
Q
k
(d) j,
(14)
and
k
= max
B,Q
k
D
1
_
[F ( x
+) F ( x
+)]Q
k
(d)
s.t. T
jk
=
_
L
_
Y
j
[ X
k
, ,
_
Q
k
(d) j.
(15)
In the next sections we will discuss how these bounds can be computed and estimated. Here we
will consider how fast the bounds shrink as T grows.
First, note that since this model is a special case of (more restricted than) the conditional
mean model, the bounds here will be sharper than the bounds previously given. Therefore, the
bounds here will converge at least as fast as the previous bounds. Imposing the structure here
does improve convergence rates. In some cases one can obtain fast rates without any restrictions
on the joint distribution of X
i
and
i
.
We will consider carefully the logit model and leave other models to future work. The logit
model is simpler than others because
is point identied. In other cases one would need to

account for the bounds for
. To keep the notation simple we focus on the binary X case,

X
it
0, 1, where x = 1 and x = 0. We nd that the bounds shrink at rate T
r
for any nite
r, without any restriction on the joint distribution of X
i
and
i
.
Theorem 8: For k = 1 or k = K and for any r > 0, as T ,
k
= O(T
r
).
Fixed eects maximum likelihood estimators (FEMLEs) are a common approach to estimate
model parameters and marginal eects in multinomial panel models. Here we compare the
probability limit of these estimators to the identied sets for the corresponding parameters.
The FEMLE treats the realizations of the individual eects as parameters to be estimated. The
corresponding population problem can be expressed as
= argmax
k=1
T
k
J
j=1
T
jk
log L
_
Y
j
[ X
k
,
jk
(),
_
, (16)
where
jk
() = argmax
log L
_
Y
j
[ X
k
, ,
_
, j, k. (17)
14
Here, we rst concentrate out the support points of the conditional distributions of and then
solve for the parameter .
Fixed eects estimation therefore imposes that the estimate of Q
k
has no more than J points
of support. The distributions implicitly estimated by FE take the form
Q
k
() =
_
T
jk
, for =
jk
();
0, otherwise.
(18)
The following example illustrates this point using a simple two period model. Consider a two-
period binary choice model with binary regressor and strictly increasing and symmetric CDF,
i.e., F(x) = 1 F(x). In this case the estimand of the xed eects estimators are
jk
() =
_
_
, if Y
j
= (0, 0);
(X
k
1
+X
k
2
)/2, if Y
j
= (1, 0) or Y
j
= (0, 1);
, if Y
j
= (1, 1),
(19)
and the corresponding distribution for has the form
Q
k
() =
_
_
PrY = (0, 0) [ X
k
, if = ;
PrY = (1, 0) [ X
k
+ PrY = (0, 1) [ X
k
, if = (X
k
1
+X
k
2
)/2;
PrY = (1, 1) [ X
k
, if = .
(20)
This formulation of the problem is convenient to analyze the properties of nonlinear xed
eects estimators of marginal eects. Thus, for example, the estimator of the marginal eect
k
takes the form:

k
() = D
1
_
[F( x
+) F( x
+)]

Q
k
(). (21)
The average of these estimates across individuals with identied eects is consistent for the
identied eect
I
when X is binary. This result is shown here analytically for the two-period
case and through numerical examples for T 3.
Theorem 9: If F
(x) > 0, F(x) = 1 F(x), and

K1
k=1
T
k
> 0, then, for T
k
=
T
k
/
K1
k=2
T
k
,

I
=
K1
k=2
T
k

k
(
) =
I
.
For not identied eects the nonlinear xed eects estimators are usually biased toward zero,
introducing bias of the same direction in the xed eect estimator of the average eect
0
if
there are individuals with not identied eects in the population. To see this consider a logit
15
model with binary regressor, X
k
= (0, 0), x = 0 and x = 1. Using that

= 2
(Andersen,
1973) and F
(x) = F(x)(1 F(x)) 1/4, we have

k
(
F(
) F(0)
[PY = (1, 0) [ X
k
+PY = (0, 1) [ X
k
]
/2
_
F()F(1 )Q
k
(d) =
E[
( x
+) [ X = X
k
]
[
k
[ .
This conjecture is further explored numerically in the next section.
6 Characterization and Computation of Population Bounds
6.1 Identication Sets and Extremal Distributions
We will begin our discussion of calculating bounds by considering bounds for the parameter
. Let L
jk
(, Q
k
) :=
_
L
_
Y
j
[ X
k
, ,
_
Q
k
(d) and Q := (Q
1
, . . . , Q
K
). For the subsequent
inferential analysis, it is convenient to consider a quadratic loss function
T(, Q; T) =
j,k
jk
(T) (T
jk
L
jk
(, Q
k
))
2
, (22)
where
jk
(T) are positive weights. By the denition of the model in (11), we can see that
(
, Q
) is such that
T(, Q; T) T(
, Q
; T) = 0,
for every (, Q). For T(; T) := inf
Q
T(, Q; T), this implies that
T(; T) T(
; T) = 0,
for every . Let B be the set of s that minimizes T(; T), i.e.,
B := : T(; T) = 0 .
Then we can see that
B. In other words,
is set identied by the set B.

It follows from the following lemma that one needs only to search over discrete distributions
for Q to nd B. Note that
Lemma 10: If the support C of
i
is compact and L
_
Y
j
[ X
k
, ,
_
is continuous in for
each , j, and k, then, for each B and k, a solution to
Q
k
= arg min
Q
k
J
j=1
jk
(T) (T
jk
L
jk
(, Q
k
))
2
exists that is a discrete distribution with at most J points of support, and L
jk
(, Q
k
) = T
jk
,
j, k.
16
Another important result is that the bounds for marginal eects can be also found by search-
ing over discrete distributions with few points of support. We will focus on the upper bound
k
dened in (15); an analogous result holds for the lower bound
k
in (14).
Lemma 11: If the support C of
i
is compact and L
_
Y
j
[ X
k
, ,
_
is continuous in for
each , j, and k, then, for each B and k, a solution to
Q
k
= arg max
Q
k
D
1
_
[F( x
+) F( x
+)]Q
k
(d) s.t. L
jk
(, Q
k
) = T
jk
, j
can be obtained from a discrete distribution with at most J points of support.
6.2 Numerical Examples
We carry out some numerical calculations to illustrate and complement the previous analytical
results. We use the following binary choice model
Y
it
= 1X
it
+
i
+
it
0, (23)
with
it
i.i.d. over t normal or logistic with zero mean and unit variance. The explanatory
variable X
it
is binary and i.i.d. over t with p
X
= PrX
it
= 1 = 0.5. The unobserved individual
eect
i
is correlated with the explanatory variable for each individual. In particular, we generate
this eect as a mixture of a random component and the standardized individual sample mean
of the regressor. The random part is independent of the regressors and follows a discretized
standard normal distribution, as in Honore and Tamer (2006). Thus, we have
i
=
1i
+
2i
,
where
Pr
1i
= a
m
=
_
_
a
m+1
+a
m
2
_
, for a
m
= 3.0;
_
a
m+1
+a
m
2
_
_
a
m
+a
m1
2
_
, for a
m
= 2.8, 2.6, ..., 2.8;
1
_
a
m
+a
m1
2
_
, for a
m
= 3.0;
and
2i
=
T(

X
i
p
X
)/
_
p
X
(1 p
X
) with

X
i
=

T
t=1
X
it
/T.
Identied sets for parameters and marginal eects are calculated for panels with 2, 3, and 4
periods based on the conditional mean model of Section 2 and semiparametric logit and probit
models. For logit and probit models the sets are obtained using a linear programming algorithm
for discrete regressors, as in Honore and Tamer (2006). Thus, for the parameter we have that
17
B = : L() = 0, where
L() = min
w
k
,v
jk
,
km
K
k=1
w
k
+
J
j=1
K
k=1
v
jk
(24)
v
jk
+
M
m=1
km
L
_
Y
j
[ X
k
,
m
,
_
= T
jk
j, k,
w
k
+
M
m=1
km
= 1 k,
v
jk
0, w
k
0,
km
0 j, k, m.
For marginal eects, see also Chernozhukov, Hahn, and Newey (2004), we solve
k
/
k
= max / min
km
,B
M
m=1
km
[F( x
+
m
) F(x
+
m
)] (25)
M
m=1
km
L
_
Y
j
[ X
k
,
m
,
_
= T
jk
j,
M
m=1
km
= 1,
km
0 j, m.
The identied sets are compared to the probability limits of linear and nonlinear xed eects
estimators.
Figure 2 shows identied sets for the slope coecient
in the logit model. The gures

agree with the well-known result that the model parameter is point identied when T 2, e.g.,
Andersen (1973). The xed eect estimator is inconsistent and has a probability limit that is
biased away from zero. For example, for T = 2 it coincides with the value 2
obtained by
Andersen (1973). For T > 2, the proportionality

= c
for some constant c breaks down.

Identied sets for marginal eects are plotted in Figures 3 7, together with the probability
limits of xed eects maximum likelihood estimators (Figures 4 6) and linear probability model
estimators (Figure 7).
1
Figure 3 shows identied sets based on the general conditional mean
model. The bounds of these sets are obtained using the general bounds (G-bound) for binary
regressors in (6), and imposing the monotonicity restriction on () in Lemma 3 (GM-bound).
In this example the monotonicity restriction has important identication content in reducing
the size of the bounds.
Figures 4 6 show that marginal eects are point identied for individuals with switches in
the value of the regressor, and nonlinear xed eects estimators are consistent for these eects.
This numerical nding suggests that the consistency result for nonlinear xed eects estimators
extends to more than two periods. Unless
= 0, marginal eects for individuals without

switches in the regressor are not point identied, which also precludes point identication of
the average eect. Nonlinear xed eects estimators are biased toward zero for the unidentied
eects, and have probability limits that usually lie outside of the identied set. However, both
1
We consider the version of the linear probability model that allows for individual specic slopes in addition
to the xed eects.
18
the size of the identied sets and the asymptotic biases of these estimators shrink very fast with
the number of time periods. In Figure 7 we see that linear probability model estimators have
probability limits that usually fall outside the identied set for the marginal eect.
For the probit, Figure 8 shows that the model parameter is not point identied, but the size
of the identied set shrinks very fast with the number of time periods. The identied sets and
limits of xed eects estimators in Figures 9 13 are analogous to the results for logit.
7 Estimation
7.1 Minimum Distance Estimator
In multinomial models with discrete regressors the complete description of the DGP is provided
by the parameter vector
(
,
X
, = (
jk
, j = 1, ..., J, k = 1, ..., K)
X
= (
k
, k = 1, ..., K)
,
where
jk
= Pr(Y = Y
j
[X = X
k
),
k
= Pr(X = X
k
).
We denote the true value of this parameter vector by (T
, T
X
, and the nonparametric empirical

estimates by (P
, P
X
. As it is common in regression analysis, we condition on the observed

distribution of X by setting the true value of the probabilities of X to the empirical ones, that
is,
X
= P
X
, T
X
= P
X
.
Having xed the distribution of X, the DGP is completely described by the conditional choice
probabilities .
Our minimum distance estimator is the solution to the following quadratic problem:
B
n
=
_
B : T(; P) min
T(; P) +
n
_
,
where B is the parameter space,
n
is a positive cut-o parameter that shrinks to zero with the
sample size, as in Chernozhukov, Hong, and Tamer (2007), and
T(; P) = min
Q=(Q
1
,...,Q
k
)Q
j,k
jk
(P)
_
P
jk
_
C
L
_
Y
j
[ X
k
, ,
_
Q
k
(d)
_
2
,
where Q is the set of conditional distributions for with J points of support for each covariate
value index k, that is, for S the unit simplex in R
J
and
km
the Dirac delta function at
km
,
Q =
_
Q := (Q
1
, ..., Q
k
) : Q
k
(d) =
J
m=1
km
km
()d, (
k1
, . . . ,
kJ
) C, (
k1
, . . . ,
kJ
) S, k
_
.
19
Here we make use of Lemma 10 that tells us that we can obtain a maximizing solution for Q
k
as
a discrete distribution with at most J points of support for each k. Alternatively, we can write
more explicitly
T(; P) = min
k
=(
k1
,...,
kJ
)C,k
k
=(
k1
,...,
kJ
)S,k
j,k
jk
(P)
_
P
jk
m=1
km
L
_
Y
j
[ X
k
,
km
,
_
_
2
. (26)
In the appendix we give a computational algorithm to solve this problem.
For estimation and inference it is important to allow for the possibility that the postulated
model is not perfectly specied, but still provides a good approximation to the true DGP. In
this case, when the conditional choice probabilities are misspecied, B
n
estimates the identied
set for the parameter of the best approximating model to the true DGP with respect to a chi-
square distance. This model is obtained by projecting the true DGP T onto , the space of
conditional choice probabilities that are compatible with the model. In particular, the projection
T
corresponds to the solution of the minimum distance problem:

T
(T) arg min
W(, T), W(, T) =
j,k
w
jk
(T)(T
jk
jk
)
2
, (27)
where
= :
jk
=
J
m=1
km
L(Y
j
[ X
k
,
km
, ),
(
k1
, . . . ,
kJ
) C, (
k1
, . . . ,
kJ
) S, B, (j, k).
To simplify the exposition, we will assume throughout that T
is unique. Of course, when T ,

then T
= T and the assumption holds trivially.

2
The identied set for the parameter of the
best approximating model is
B
=
_
B : Q Q s.t.
_
L
_
Y
j
[ X
k
, ,
_
dQ
k
() = T
jk
, (j, k)
_
,
i.e., the values of the parameter that are compatible with the projected DGP T
= (T
jk
, j =
1, ..., J, k = 1, ..., K). Under correct specication of the semiparametric model, we have that
T
= T and B
= B.
We shall use the following assumptions.
2
Otherwise, the assumption can be justied using a genericity argument similar to that presented in Newey
(1986), see Appendix. For non-generic values, we can simply select one element of the projection using an
additional complete ordering criterion, and work with the resulting approximating model. In practice, we never
encountered a non-generic value.
20
Assumption 1: (i) The function F dened in (12) is continuous in (, ), so that the
conditional choice probabilities L
jk
(, ) = L
_
Y
j
[ X
k
, ,
_
are also continuous for all (j, k);
(ii) B
B for some compact set B; (iii)

i
has a support contained in a compact set C; and
(iv) the weights
jk
(P) are continuous in P at T, and 0 <
jk
(T) < for all (j, k).
Assumption 1(i) holds for commonly used semiparametric models such as logit and probit
models. The condition 1(iv) about the weights is satised by the chi-square weights
jk
(T) =
T
k
/T
jk
if T
jk
> 0, (j, k).
In some results, we also employ the following assumption.
Assumption 2: Every
is regular at T in the sense that, for any sequence

n
T,
there exists a sequence
n
arg min
B
T(,
n
) such that
n

.
In a variety of cases the assumption of regularity appears to be a good one. First of all, the
assumption holds under point identication, as in the logit model, by the standard consistency
argument for maximum likelihood/minimum distance estimators. Second, for probit and other
similar models, we can argue that this assumption can also be expected to hold when the true
distribution of the individual eect
i
is absolutely continuous, with the exception perhaps of
very non-regular parameter spaces and non-generic situations.
To explain the last point, it is convenient to consider a correctly specied model for sim-
plicity. Let the vector of model conditional choice probabilities for (Y
1
, ...., Y
J
) be L
k
(, )
(L
1k
(, ), ..., L
Jk
(, ))
. Let
k
() L
k
(, ) : C and let /
k
() be the convex hull of
k
(). In the case of probit the specication is non-trivial in the sense that /
k
() possesses a
non-empty interior with respect to the J dimensional simplex. For every
B and some Q
k
,
we have that L
jk
(
, Q
) = T
jk
for all (j, k), that is, (T
1k
, ..., T
Jk
) /
k
(
) for all k. More-

over, under absolute continuity of the true Q
we must have (T
1k
, ..., T
Jk
) interior /
k
(
0
)
for all k, where
0
B is the true value of . Next, for any
in the neighborhood of
0
, we
must have (T
1k
, ..., T
Jk
) interior /
k
(
) for all k, and so on. In order for a point
to be
located on the boundary of B we must have that (T
1k
, ..., T
Jk
) /
k
(
) for some k. Thus, if

the identied set has a dense interior, which we veried numerically in a variety of examples for
the probit model, then each point in the identied set must be regular. Indeed, take rst a point
in the interior of B. Then, for any sequence

n
T, we must have (
1k
, ...,
Jk
) /
k
(
)
for all k for large n, so that T(
;
n
) = 0 for large n. Thus, there is a sequence of points
n
in arg min
B
T(;
n
) converging to
. Now take the point
on the boundary of B,
then for each > 0, there is
in the interior such |
| /2 and such that there is a

sequence of points
n
in arg min
B
T(;
n
) and a nite number n() such that for all n n(),
|
n
| /2. Thus, for all n n(), |
n
| . Since > 0 is arbitrary, it follows that
is regular.
21
We can now give a consistency result for the quadratic estimator.
Theorem 12: If Assumptions 1 holds and
n
log n/n then
d
H
(B
n
, B
) = o
P
(1),
where d
H
is the Hausdor distance between sets
d
H
(B
n
, B
) = max
_
sup
n
B
n
inf
B
[
n
[ , sup
B
inf
n
B
n
[
n
[
_
.
Under Assumption 2 the result holds for
n
= 0.
Moreover, under Assumption 1 the model-predicted probabilities are consistent, for any
n

B
n
, and each j and k,
P
jk
=
J
m=1
km
(
n
)L
_
Y
j
[ X
k
,
km
(
n
),
n
_
p
T
jk
, (28)
where
km
(
n
),
km
(
n
), k, m is a solution to the minimum distance problem (26) for any
n
0, where we assume that T
is unique.
7.2 Marginal Eects
We next consider the problem of estimation of marginal eects, which is of our prime interest. An
immediate issue that arises is that we can not directly use the solution to the minimum distance
problem to estimate the marginal eects. Indeed, the constraints of the linear programming
programs for these eects in (25) many not hold for any B
n
when T is replaced by P
due to sampling variation or under misspecication. In order to resolve the infeasibility issue,
we replace the nonparametric estimates P
jk
by the probabilities predicted by the model P
jk
as dened in (28), and we re-target our estimands to the marginal eects dened in the best
approximating model.
To describe the estimator of the bounds for the marginal eects, it is convenient to introduce
some notation. Let
k
(, ) = min
k
,
k
D
1
J
m=1
[F ( x
+
km
) F ( x
+
km
)]
km
s.t.
jk
=

J
m=1
L
_
Y
j
[ X
k
,
km
,
_
km
j,
k
= (
k1
, . . . ,
kJ
) C,
k
= (
k1
, . . . ,
kJ
) S,
(29)
and
k
(, ) = max
k
,
k
D
1
J
j=1
[F ( x
+
km
) F ( x
+
km
)]
km
s.t.
jk
=

J
m=1
L
_
Y
j
[ X
k
,
km
,
_
km
j,
k
= (
k1
, . . . ,
kJ
) C,
k
= (
k1
, . . . ,
kJ
) S,
(30)
22
where
= (
jk
, j = 1, ..., J, k = 1, ..., K) denotes the the projection of onto , i.e.,
() as dened in (27). Thus, the upper and lower bounds on the true marginal eects of the
best approximating model take the form:
k
= min
B
k
(, T),
k
= max
B
k
(, T).
Under correct specication, these correspond to the lower and upper bounds on the marginal
eects in (14) and (15). We estimate the bounds by

k
= min
B
n
k
(, P),

k
= max
B
n
k
(, P).
Theorem 13: If Assumptions 1 is satised and
n
log n/n then

k
=
k
+o
p
(1),

k
=
k
+o
p
(1).
Under Assumption 2 the result holds for
n
= 0.
8 Inference
8.1 Modied Projection Method
The following method projects a condence region for conditional choice probabilities onto a
simultaneous condence region for all possible marginal eects and other structural parameters.
If a single marginal eect is of interest, then this approach is conservative; if all (or many)
marginal eects are of interest, then this approach is sharp (or close to sharp). In the next
section, we will present an approach that appears to be sharp, at least in large samples, when a
particular single marginal eect is of interest.
It is convenient to describe the approach in two stages.
Stage 1. The nonparametric space
N
of conditional choice probabilities is the product of
K simplex sets S of dimension J, that is,
N
= S
K
. Thus we can begin by constructing a
condence region for the true choice probabilities T by collecting all probabilities
N
that
pass a goodness-of-t test:
CR
1
(T) =
_

N
: W(, P) c
1
(
2
K(J1)
)
_
,
where c
1
(
2
K(J1)
) is the (1 )-quantile of the
2
K(J1)
distribution and W is the goodness-
of-t statistic:
W(, P) = n
j,k
P
k
(P
jk
jk
)
2
jk
.
23
Stage 2. To construct condence regions for marginal eects and any other structural pa-
rameters we project each CR
1
(T) onto , the space of conditional choice probabilities
that are compatible with the model. We obtain this projection
() by solving the minimum

distance problem:
() = arg min
W(
, ), W(
, ) = n
j,k
P
k
(
jk
jk
)
2
jk
. (31)
The condence regions are then constructed from the projections of all the choice probabilities in
CR
1
(T). For the identied set of the model parameter, for example, for each CR
1
(T)
we solve
B
() =
_
B : Q Q s.t.
_
L
_
Y
j
[ X
k
, ,
_
dQ
k
() =
jk
, (j, k),
()
_
.
(32)
Denote the resulting condence region as
CR
1
(B
) = B
() : CR
1
(T).
We may interpret this set as a condence region for the set B
collecting all values
that are
compatible with the best approximating model T
. Under correct specication, B
is just the
identied set B.
If we are interested in bounds on marginal eects, for each CR
1
(T) we get
k
() = min
B
()
k
(, ),
k
() = max
B
()
k
(, ), k = 1, ..., K.
Denote the resulting condence regions as
CR
1
[
k
,
k
] = [
k
(),
k
()] : CR
1
(T).
These sets are condence regions for the sets [
k
,
k
], where
k
and
k
are the lower and upper
bounds on the marginal eects induced by any best approximating model in (B
, T
). Under
correct specication, these will include the upper and lower bounds on the marginal eect [
k
,
k
]
induced by any true model in (B, T).
In a canonical projection method we would implement the second stage by simply intersecting
CR
1
(T) with , but this may give an empty intersection either in nite samples or under
misspecication. We avoid this problem by using the projection step instead of the intersection,
and also by re-targeting our condence regions onto the best approximating model. In order to
state the result about the validity of ourmodied projection method in large samples, let be
the set of vectors with all components bounded away from zero by some > 0.
24
Theorem 14: Suppose Assumption 1 holds, then for (any sequence of true parameter values)
T
0
= (T
, T
X

lim
n
Pr
P
0
_
_
T CR
1
(T)
B
CR
1
(B
)
[
k
,
k
] CR
1
[
k
,
k
], k
_
_
= 1 .
8.2 Perturbed Bootstrap
In this section we present an approach that appears to be sharper than the projection method,
at least in large samples, when a particular single marginal eect is of interest. The estima-
tors for parameters and marginal eects are obtained by nonlinear programming subject to
data-dependent constraints that are modied to respect the constraints of the model. The dis-
tributions of these highly complex estimators are not tractable, and are also non-regular in the
sense that the limit versions of these distributions do not vary with perturbations of the DGP
in a continuous fashion. This implies that the usual bootstrap is not consistent. To overcome
all of these diculties we will rely on a variation of the bootstrap, which we call the perturbed
bootstrap.
The usual bootstrap computes the critical value the -quantile of the distribution of a
test statistic given a consistently estimated data generating process (DGP). If this critical
value is not a continuous function of the DGP, the usual bootstrap fails to consistently estimate
the critical value. We instead consider the perturbed bootstrap, where we compute a set of
critical values generated by suitable perturbations of the estimated DGP and then take the
most conservative critical value in the set. If the perturbations cover at least one DGP that
gives a more conservative critical value than the true DGP does, then this approach yields a
valid inference procedure.
The approach outlined above is most closely related to the Monte-Carlo inference approach
of Dufour (2006); see also Romano and Wolf (2000) for a nite-sample inference procedure for
the mean that has a similar spirit. In the set-identied context, this approach was rst applied
in the MIT thesis work of Rytchkov (2007); see also Chernozhukov (2007).
Recall that the complete description of the DGP is provided by the parameter vector
(
,
X
, where = (
jk
, j = 1, ..., J, k = 1, ..., K)
,
X
= (
k
, k = 1, ..., K)
,
jk
= Pr(Y =
Y
j
[X = X
k
), and
k
= Pr(X = X
k
). The true value of the parameter vector is (T
, T
X
) and
the nonparametric empirical estimate is (P
, P
X
. As before, we condition on the observed

distribution of X and thus set
X
= P
X
and T
X
= P
X
.
We consider the problem of performing inference on a real parameter
. For example,
25
can be an upper (or lower) bound on the marginal eect
k
such as
() = max
B
(),QQ
D
1
_
[F( x
+) F( x
+)]Q
k
(d) s.t. L
jk
(, Q
k
) =
jk
, j,
where
= (
jk
, j = 1, ..., J, k = 1, ..., K) denotes the projection of onto the model space, as
dened in (31), and B
() is the corresponding projection for the identied set of the parameter

dened as in (32). Alternatively,
can be an upper (or lower) bound on a scalar functional c
of the parameter
. Then we dene
() = max
B
()
c
.
As before, we project onto the model space in order to address the problem of infeasibility of
constraints dening the parameters of interest under misspecication or sampling error. Under
misspecication, we interpret our inference as targeting the parameters of interest in the best
approximating model.
In order to perform inference on the true value
(T) of the parameter, we use the

statistic
S
n
=

,
where

=
(P). Let G
n
(s, ) denote the distribution function of S
n
() =

(), when the

data follow the DGP . The goal is to estimate the distribution of the statistic S
n
under the
true DGP = T, that is, to estimate G
n
(s, T).
The method proceeds by constructing a condence region CR
1
(T) that contains the true
DGP T with probability 1, close to one. For eciency purposes, we also want the condence
region to be an ecient estimator of T, in the sense that as n , d
H
(CR
1
(T), T) =
O
p
(n
1/2
),where d
H
is the Hausdor distance between sets. Specically, in our case we use
CR
1
(T) =
N
: W(, P) c
1
(
2
K(J1)
),
where c
1
(
2
K(J1)
) is the (1 )-quantile of the
2
K(J1)
distribution and W is the goodness-
of-t statistic:
W(, P) = n
j,k
P
k
(P
jk
jk
)
2
jk
.
Then we dene the estimates of lower and upper bounds on the quantiles of G
n
(s, T) as
G
1
n
(, T)/G
1
n
(, T) = inf / sup
CR
1
(P)
G
1
n
(, ), (33)
where G
1
n
(, ) = infs : G
n
(s, ) is the -quantile of the distribution function G
n
(s, ).
Then we construct a (1 ) 100% condence region for the parameter of interest as
CR
1
(
) =
_
,
26
where, for =
1
+
2
,
=

G
1
(1
1
, T), =

G
1
(
2
, T).
This formulation allows for both one-sided intervals (either
1
= 0 or
2
= 0) or two-sided
intervals (
1
=
2
= /2).
The following theorem shows that this method delivers (uniformly) valid inference on the
parameter of interest.
Theorem 15. Suppose Assumption 1 holds, then for (any sequence of true parameter values)
T
0
= (T
, T
X

lim
n
Pr
P
0
(

_
,
) 1 .
In practice, we use the following computational approximation to the procedure described
above:
1. Draw a potential DGP
r
= (
r1
, ...,
rK
), where
rk
/(nP
k
, (P
1k
, ..., P
Jk
))/(nP
k
)
and / denotes the multinomial distribution.
2. Keep
r
if it passes the chi-square goodness of t test at the level, using K(J 1)
degrees of freedom, and proceed to the next step. Otherwise reject, and repeat step 1.
3. Estimate the distribution G
n
(s,
r
) of S
n
(
r
) by simulation under the DGP
r
.
4. Repeat steps 1 to 3 for r = 1, ..., R, obtaining G
n
(s,
r
), r = 1, ..., R.
5. Let

G
1
(, T)/
G
1
(, T) = min/ maxG
1
n
(,
1
), ..., G
1
n
(,
R
), and construct a 1
condence region for the parameter of interest as CR
1
(
) =
_
,
, where
=

G
1
(1
1
, T), =

G
1
(
2
, T), and
1
+
2
= .
The computational approximation algorithm is necessarily successful, if it generates at least
one draw of DGP
r
that gives more conservative estimates of the tail quantiles than the true
DGP does, namely [G
1
(
2
, T), G
1
(1
1
, T)] [G
1
(
2
,
r
), G
1
(1
1
,
r
)].
9 Empirical Example
We now turn to an empirical application of our methods to a binary choice panel model of female
labor force participation. It is based on a sample of married women in the National Longitudinal
Survey of Youth 1979 (NLSY79). We focus on the relationship between participation and the
presence of young children in the years 1990, 1992, and 1994. The NLSY79 data set is convenient
to apply our methods because it provides a relatively homogenous sample of women between 25
27
and 33 year-old in 1990, what reduces the extent of other potential confounding factors that may
aect the participation decision, such as the age prole, and that are more dicult to incorporate
in our methods. Other studies that estimate similar models of participation in panel data include
Heckman and MaCurdy (1980), Heckman and MaCurdy (1982), Chamberlain (1984), Hyslop
(1999), Chay and Hyslop (2000), Carrasco (2001), Carro (2007), and Fernandez-Val (2008).
The sample consists of 1,587 married women. Only women continuously married, not stu-
dents or in the active forces, and with complete information on the relevant variables in the entire
sample period are selected from the survey. Descriptive statistics for the sample are shown in
Table 2. The labor force participation variable (LFP) is an indicator that takes the value one if
the woman employment status is in the labor force according to the CPS denition, and zero
otherwise. The fertility variable (kids) indicates whether the woman has any child less than 3
year-old. We focus on very young preschool children as most empirical studies nd that their
presences have the strongest impact on the mother participation decision. LFP is stable across
the years considered, whereas kids is increasing. The proportion of women that change fertility
status grows steadily with the number of time periods of the panel, but there are still 49% of
the women in the sample for which the eect of fertility is not identied after 3 periods.
The empirical specication we use is similar to Chamberlain (1984). In particular, we esti-
mate the following equation
LFP
it
= 1 kids
it
+
i
+
it
0 , (34)
where
i
is an individual specic eect. The parameters of interest are the marginal eects
of fertility on participation for dierent groups of individuals including the entire population.
These eects are estimated using the general conditional mean model and semiparametric logit
and probit models described in Sections 2 and 5, together with linear and nonlinear xed ef-
fects estimators. Analytical and Jackknife large-T bias corrections are also considered, and
conditional xed eects estimates are reported for the logit model.
3
The estimates from the
general model impose monotonicity of the eects. For the semiparametric estimators, we use
the algorithm described in the appendix with penalty
n
= 1/(nlog n) and iterate the quadratic
program 3 times with initial weights w
jk
= nP
k
. This iteration makes the estimates insen-
sitive to the penalty and weighting. We search over discrete distributions with 23 support
points at , 4, 3.6, ..., 3.6, 4, in the quadratic problem, and with 163 support points
at , 8, 7.9, ..., 7.9, 8, in the linear programming problems. The estimates are based
on panels of 2 and 3 time periods, both of them starting in 1990.
Tables 3 and 4 report estimates of the model parameters and marginal eects for 2 and 3
3
The analytical corrections use the estimators of the bias based on expected quantities in Fern andez-Val (2008).
The Jackknife bias correction uses the procedure described in Hahn and Newey (2004).
28
period panels, together with 95% condence regions obtained using the procedures described
in the previous section. For the general model these regions are constructed using the normal
approximation (95% N) and nonparametric bootstrap with 200 repetitions (95% B). For the
logit and probit models, the condence regions are obtained by the modied projection method
(95% MP), where the condence interval for T in the rst stage is approximated by 50,000
DGPs drawn from the empirical multinomial distributions that pass the goodness of t test;
and the perturbed bootstrap method (95% PB) with R = 100, = .01,
1
=
2
= .02, and 200
simulations from each DGP to approximate the distribution of the statistic. We also include
condence intervals obtained by a canonical projection method (95% MP) that intersects the
nonparametric condence interval for T with the space of probabilities compatible with the
semiparametric model :
CR
1
(T) =
_
: W(, P) c
1
(
2
K(J1)
)
_
.
For the xed eects estimators, the condence regions are based on the asymptotic normal
approximation. The semiparametric estimates are shown for
n
= 0, i.e., for the solution that
gives the minimum value in the quadratic problem.
Overall, we nd that the estimates and condence regions based on the general conditional
mean model are too wide to provide informative evidence about the relationship between par-
ticipation and fertility for the entire population. The semiparametric estimates seem to oer a
good compromise between producing more accurate results without adding too much structure
to the model. Thus, these estimates are always inside the condence regions of the general
model and do not suer of important eciency losses relative to the more restrictive xed ef-
fects estimates. Another salient feature of the results is that the misspecication problem of
the canonical projection method clearly arises in this application. Thus, this procedure gives
empty condence regions for the panel with 3 periods. The modied projection and perturbed
bootstrap methods produce similar (non-empty) condence regions for the model parameters
and marginal eects.
10 Possible Extensions
Our analysis is yet conned to models with only discrete explanatory variables. It would be
interesting to extend the analysis to models with continuous explanatory variables. It may be
possible to come up with a sieve-type modication. We expect to obtain a consistent estimator
of the bound by applying the semiparametric method combined with increasing number of par-
titions of the support of the explanatory variables, but we do not yet have any proof. Empirical
likelihood based methods should work in a straightforward manner if the panel model of interest
29
is characterized by a set of moment restrictions instead of a likelihood. We may be able to
improve the nite-sample property of our condence region by using Bartlett type corrections.
11 Appendix
11.1 Proofs
Proof of Theorem 1: By eq. (3),
t
(X
k
t
r
k
)E[Y
it
[ X
i
= X
k
] = Tr
k
(1 r
k
)
_
m(1, )Q
k
(d) (35)
+T(1 r
k
)(r
k
)
_
m(0, )Q
k
(d) = T
2
k
k
.
Note also that

X
i
= r
k
when X
i
= X
k
. Then by the law of large numbers,
i,t
(X
it

X
i
)
2
/n
p
E[
t
(X
it

X
i
)
2
] =
k
T
k
t
(X
k
t
r
k
)
2
=
k
T
k
T
2
k
,
i,t
(X
it

X
i
)Y
it
/n
p
E[
t
(X
it

X
i
)Y
it
] =
k
T
k
t
(X
k
t
r
k
)E[Y
it
[ X
i
= X
k
]
=
k
T
k
T
2
k
k
.
Dividing and applying the continuous mapping theorem gives the result. Q.E.D.
Proof of Theorem 2: The set of X
i
where r
i
> 0 and r
i
> 0 coincides with the set for
which X
i
= X
k
for k /
. On this set it will be the case that r

i
and r
i
are bounded away
from zero. Note also that for

t such that X
k
t
= x we have E[Y
i
t
[ X
i
= X
k
] =
_
m( x, )Q
k
(d).
Therefore, for r
k
= #t : X
k
t
= x/T and r
k
= #t : X
k
t
= x/T, by the law of large numbers,
1
n
n
i=1
1( r
i
> 0)1( r
i
> 0)
T
t=1
d
it
Y
it
T r
i
T
t=1
d
it
Y
it
T r
i
/D
p
E[1( r
i
> 0)1( r
i
> 0)
T
t=1
d
it
Y
it
T r
i
T
t=1
d
it
Y
it
T r
i
]/D
= E[1( r
i
> 0)1( r
i
> 0)
T
t=1
d
it
E[Y
it
[ X
i
]
T r
i
T
t=1
d
it
E[Y
it
[ X
i
]
T r
i
]/D
=
kK
T
k
T r
k
_
m( x, )Q
k
(d)
T r
k

T r
k
_
m( x, )Q
k
(d)
T r
k
/D =
kK
T
k
k
,
1
n
n
i=1
1( r
i
> 0)1( r
i
> 0)
p
E[1( r
i
> 0)1( r
i
> 0)] =
kK
T
k
.
Dividing and applying the continuous mapping theorem gives the result. Q.E.D.
30
Proof of Lemma 3: As before let Q
k
() denote the conditional CDF of given X
i
= X
k
.
Note that
m
k
t
=
E[Y
it
[ X
i
= X
k
]
D
=
_
m(X
k
t
, )Q
k
(d)
D
.
Also we have
k
=
_
()Q
k
(d) =
_
m( x, )Q
k
(d)
D

_
m( x, )Q
k
(d)
D
.
Then if there is

t and

t such that X
k
t
= x and X
k
t
= x
m
k
t
m
k
t
=
_
m( x, )Q
k
(d)
D

_
m( x, )Q
k
(d)
D
=
k
.
Also, if B
m(x, )/D B
u
, then for each k,
B

_
m( x, )Q
k
(d)
D
B
u
, B
u

_
m( x, )Q
k
(d)
D
B
Then if there is

t such that X
k
t
= x we have
m
k
t
B
u
=
_
m( x, )Q
k
(d)
D
B
u

k

_
m( x, )Q
k
(d)
D
B
= m
k
t
B
.
The second inequality in the statement of the theorem follows similarly.
Next, if () has the same sign for all and if for some k
there is

t and

t such that
X
k
t
= x and X
k
t
= x, then sgn(()) = sgn(
k
). Furthermore, since sgn(
k
) = sgn(
k
) is
then known for all k, if it is positive the lower bounds, which are nonpositive, can be replaced by
zero, while if it is negative the upper bounds, which are nonnegative, can be replaced by zero.
Q.E.D.
Proof of Theorem 4: See text.
Proof of Theorem 5: Let Z
iT
= min
T
t=1
1(X
it
= x)/T,
T
t=1
1(X
it
= x)/T. Note
that if Z
iT
> 0 then 1(A
iT
) = 1 for the event A
iT
that there exists

t such that X
i
t
= x and
X
i
t
= x. By the ergodic theorem and continuity of the minimum, conditional on
i
we have
Z
iT
as
b(
i
) = minPr(X
it
= x [
i
), Pr(X
it
= x [
i
) > 0. Therefore Pr(A
iT
[
i
)
Pr(Z
iT
> 0 [
i
) 1 for almost all
i
. It then follows by the dominated convergence theorem
that
Pr(A
iT
) = E[Pr(A
iT
[
i
)] 1.
Also note that Pr(A
iT
) = 1 T
0
K
T
k
K
T
k
, so that
[
0
[ (B
u
B
)(T
0
+
K
T
k
+
K
T
k
) 0.Q.E.D.
31
Proof of Theorem 6: Let T
1
and T
K
be as in equation (6). By the Markov assumption,
T
1
= Pr(X
i1
= = X
iT
= 0) = E[Pr(X
i1
= = X
iT
= 0 [
i
)]
= E[
T
t=J+1
Pr(X
it
= 0 [ X
i,t1
= = X
i,tJ
= 0,
i
) Pr(X
iJ
= = X
i,tJ
= 0 [
i
)]
E[(p
1
i
)
TJ
].
T
K
E[(p
K
i
)
TJ
].
The rst bound then follows as in (6). The second bound then follows from the condition
p
k
i
1 for k 1, K. Now suppose that there is a set A of possible
i
such that Pr(A) > 0,
Q
i
= Pr(X
i1
= = X
iJ
= 0 [
i
) > 0 and p
1
i
= 1 Then
T
1
= E[(p
1
i
)
TJ
Q
i
] E[1(
i
A)(p
1
i
)
TJ
Q
i
] = E[1(
i
A)Q
i
] > 0.
Therefore, for all T the probability T
1
is bounded away from zero, and hence

0
or
u

0
.Q.E.D.
Proof of Theorem 7: Note that every X
i
A
t
(x) has X
it
= x. Also, the X
is
for s > t
are completely unrestricted by X
i
A
t
(x). Therefore, it follows by the key implication that
E[Y
it
[ X
i
A
t
(x)] =
_
m(x, )Q
(d [ X
i
A
t
(x)).
Then by iterated expectations,
_
m(x, )Q
(d) =

T(x)
_
m(x, )Q
(d [ X
i

A(x))
+
T
t=1
Pr(X
i
A
t
(x))
_
m(x, )Q
(d [ X
i
A
t
(x))
=

T(x)
_
m(x, )Q
(d [ X
i

A(x)) +E[
T
t=1
1(X
i
A
t
(x))Y
it
].
Using the bound and dividing by D then gives
E[
T
t=1
1(X
i
A
t
(x))Y
it
]/D +

T(x)B

_
m(x, )Q
(d)/D
E[
T
t=1
1(X
i
A
t
(x))Y
it
]/D +

T(x)B
u
.
Dierencing this bound for x = x and x = x gives the result. Q.E.D.
Proof of Theorem 8: The size of the identied set for the marginal eect is
k
= max
Q
k
Q
k
,B
D
1
_
[F ( +)F ()]Q
k
(d) min
Q
k
Q
k
,B
D
1
_
[F ( +)F ()]Q
k
(d),
32
where Q
k
= Q
k
:
_
L
_
Y
j
[ X
k
, ,
_
Q
k
(d) = T
jk
, j = 1, ..., J. The feasible set of distribu-
tions Q
k
can be further characterized in this case. Let F
T
(, ) := (1, F(X
k
1
+), ..., F(X
k
T
+
)) and T
J
(, ) denote the J 1 power vector of F
T
(, ) including all the dierent products
of the elements of F
T
(, ), i.e.,
T
J
(, ) = (1, ..., F(X
k
T
+), F(X
k
1
+)F(X
k
2
+), ....,
T
t=1
F(X
k
t
+)).
Note that L
_
Y
j
[ X
k
, ,
_
=

T
t=1
F(X
k
t
+)
Y
j
t
1F(X
k
t
+)
1Y
j
t
, so the model probabil-
ities are linear combinations of the elements of T
J
(, ). Therefore, for
k
= (T
1k
, ..., T
Jk
) we
have Q
k
= Q
k
: /
J
_
T
J
(, )Q
k
(d) =
k
, where /
J
is a J J matrix of known constants.
The matrix /
J
is nonsingular, so we have:
Q
k
=
_
Q
k
:
_
T
J
(, )Q
k
(d) = M
k
_
,
where the J 1 vector M
k
= /
1
J

k
is identied from the data.
Now we turn to the analysis of the size of the identied sets. We focus on the case where
k = 1, i.e., X
k
is a vector of zeros, and a similar argument applies to k = K. For k = 1 we have
that F(X
k
t
+ ) = F() for all t, so the power vector only has T + 1 dierent elements given
by (1, F(), ..., F()
T
). The feasible set simplies to:
Q
k
=
_
Q
k
:
_
F()
t
Q
k
(d) = M
kt
, t = 0, ..., T
_
,
where the moments M
kt
are identied by the data. Here
_
F()Q
k
(d) = M
k1
is xed in Q
k
,
so the size of the identied set is given by:
k
= max
Q
k
Q
k
,B
D
1
_
F ( +) Q
k
(d) min
Q
k
Q
k
,B
D
1
_
F ( +) Q
k
(d).
By a change of variable, Z = F(), we can express the previous problem in a form that is
related to a Hausdor truncated moment problem:
k
= max
G
k
G
k
,B
D
1
_
1
0
h
(z)G
k
(dz) min
G
k
G
k
,B
D
1
_
1
0
h
(z)G
k
(dz), (36)
where (
k
= G
k
:
_
1
0
z
t
G
k
(dz) = M
kt
, t = 0, ..., T, h
(z) = F( + F
1
(z)), and F
1
is the
inverse of F.
If the objective function is r times continuously dierentiable, h
(
r
[0, 1], with uniformly
bounded r-th derivative, |h
r
(z)|

h
r
, then we can decompose h
using standard approxi-

mation theory techniques as
h
(z) = P
(z, T) +R
(z, T), (37)

33
where P
(z, T) is the T-degree best polynomial approximation to h
and R
(z, T) is the re-

mainder term of the approximation, see, e.g., Judd (1998) Chap. 3. By Jacksons Theorem the
remainder term is uniformly bounded by
|R
(z, T)|

(T r)!
T!
_
4
_
r
h
r
= O
_
T
r
_
, (38)
as T , and this is the best possible uniform rate of approximation by a T-degree polynomial.
Next, note that for any G
k
(
k
we have that
_
1
0
P
(z, T)G
k
(dz) is xed, since the rst T
moments of Z are xed at (
k
. Moreover,
_
1
0
P
(z, T)G
k
(dz) is xed at B if the parameter is
point identied, B =
. Then, we have
k
= max
G
k
G
k
_
1
0
R
(z, T)G
k
(dx) min
G
k
G
k
_
1
0
R
(z, T)G
k
(dx) 2
h
r
= O
_
T
r
_
. (39)
To complete the proof, we need to check the continuous dierentiability condition and the
point identication of the parameter for the logit model. Point identication follows from Cham-
berlain (1992). For dierentiability, note that for the logit model
h
(z) =
ze
1 (1 e
)z
, (40)
with derivatives
h
r
(z) = r!
e
(1 e
)
r1
[1 (1 e
)z]
r
. (41)
These derivatives are uniformly bounded by

h
r
= r! e
||
(e
||
1[)
r1
< for any nite r.
Q.E.D.
Proof of Theorem 9: Note that for T = 2 and X binary, we have that K = 4. Let
X
1
= (0, 0), X
2
= (0, 1), X
3
= (1, 0), and X
4
= (1, 1). By Lemma 3,
I
is identied by
I
= T
2
[PrY = (0, 1) [ X
2
PrY = (1, 0) [ X
2
]+T
3
[PrY = (1, 0) [ X
3
PrY = (0, 1) [ X
3
].
The probability limit of the xed eects estimator for this eect is

I
=
3
k=2
T
k
[PrY = (0, 1) [ X
k
+ PrY = (1, 0) [ X
k
][F(
/2) F(
/2)].
The condition for consistency
I
=
I
can be written as
F(
/2) =
T
2
PrY = (0, 1) [ X
2
+T
3
PrY = (1, 0) [ X
3
3
k=2
T
k
[PrY = (0, 1) [ X
k
+ PrY = (1, 0) [ X
k
]
,
but this is precisely the rst order condition of the program (16). This result follows, after some
algebra and using the symmetry property of F, by solving the prole problem
= arg max
k=1
T
k
[PrY = (0, 1) [ X
k
log F(X
k
/2)+PrY = (1, 0) [ X
k
log F(X
k
/2)],
34
where X
k
= X
k
2
X
k
1
. Q.E.D.
Proof of Lemma 10: First, by B, we have that T(; T) = 0 and therefore any
Q
k
arg max
Q
k
J
j=1
jk
(T) (T
jk
L
jk
(, Q
k
))
2
satises L
jk
(, Q
k
) = T
jk
j, for each k.
Let the vector of conditional choice probabilities for (Y
1
, ...., Y
J
) be
L
k
(, )
_
L
_
Y
1
[ X
k
, ,
_
, ..., L
_
Y
J
[ X
k
, ,
__
.
Let
k
() L
k
(, ) : C. Note that, for each B,
k
() is a closed and bounded set
due to compactness of C, and has at most dimension J 1 since the sum of the elements of
L
k
(, ) is one . Now, let /
k
() denote the convex hull of
k
(). For any B we have
that there is at least one Q
k
such that L
jk
(, Q
k
) = T
jk
j, i.e.,
(T
1k
, ..., T
Jk
) /
k
() .
By Caratheodory Theorem any point in /
k
() can be written as a convex combination of at
most J vectors located in
k
(). Then, we can write
(T
1k
, ..., T
Jk
) =
J
m=1
km
L
k
(
km
, ) ,
where (
k1
, ...,
kJ
) is on the unit simplex S of dimension J. Thus, the discrete distribution with
J support points at (
k1
, ...,
kJ
) and probabilities (
k1
, ...,
kJ
) solves the population problem
for Q
k
. The result also follows from Lindsay (1995, Theorem 18, p. 112, and Theorem 21, p.
116) (though Lindsay does not provide proofs for his theorems). Q.E.D.
Proof of Lemma 11: For B, let Q
k
= Q
k
: L
jk
(, Q
k
) = T
jk
, j = 1, ..., J. Let
Q
k
Q
k
denote some maximizing value such that
k
= D
1
_
C
[F
_
x
+
_
F
_
x
+
_
]Q
k
(d) .
Note that, for any > 0 we can nd a distribution

Q
M
k
Q
k
with a large number M J of
support points (
1
, ...,
M
) such that
k
< D
1
_
C
[F
_
x
+
_
F
_
x
+
_
]

Q
M
k
(d)
k
.
Our goal is to show that given such

Q
M
k
it suces to allocate its mass over only at most J
support points. Indeed, consider the problem of allocating (
k1
, ...,
kM
) among (
1
, ...,
M
) in
order to solve
max
(
k1
,...,
kM
)
M
m=1
[F
_
x
+
m
_
F
_
x
+
m
_
]
km
35
subject to the constraints:
km
0, m = 1, . . . , M
M
m=1
km
L
_
Y
j
[ X
k
,
m
,
_
= T
jk
, j = 1, ..., J,
M
m=1
km
= 1.
This a linear program of the form
max
R
M
c
such that 0, A = b, 1
= 1,
and any basic feasible solution to this program has M active constraints, of which at most
rank (A) + 1 can be equality constraints. This means that at least M rank(A) 1 of active
constraints are the form
km
= 0, see, e.g., Theorem 2.3 and Denition 2.9 (ii) in Bertsimas and
Tsitsiklis (1997). Hence a basic solution to this linear programming problem will have at least
M J zeroes, that is at most J strictly positive
km
s.
4
Thus, we have shown that given the
original

Q
M
k
with M J points of support there exists a distribution

Q
L
k
Q
k
with just J
points of support such that
k
< D
1
_
C
[F
_
x
+
_
F
_
x
+
_
]

Q
M
k
(d) D
1
_
C
[F
_
x
+
_
F
_
x
+
_
]

Q
L
k
(d)
k
.
This construction works for every > 0.
The nal claim is that there exists a distribution

Q
L
k
Q
k
with J points of support
(
k1
, ...,
kJ
) such that
k
= D
1
_
C
[F
_
x
+
_
F
_
x
+
_
]

Q
L
k
(d) .
Suppose otherwise, then it must be that
k
>
k
D
1
_
C
[F
_
x
+
_
F
_
x
+
_
]

Q
L
k
(d) ,
for some > 0 and for all

Q
L
k
with J points of support. This immediately gives a contradiction
to the previous step where we have shown that, for any > 0,
k
and the right hand side can
be brought close to each other by strictly less than . Q.E.D.
Some Lemmas are useful for proving Theorem 12.
4
Note that rank(A) J 1, since
P
J
j=1
L
`
Y
j
| X
k
, ,
= 1. The exact rank of A depends on the sequence

X
k
, the parameter , the function F, and T. For T = 2 and X binary, for example, rank(A) = J 2 = 2 when
x
1
= x
2
, = 0, or F is the logistic distribution; whereas rank(A) = J 1 = 3 for X
k
1
= X
k
2
, = 0, and F is any
continuous distribution dierent from the logistic.
36
Lemma A1: Let T(, Q; ) =

j,k

jk
() (
jk
L
jk
(, Q
k
))
2
. If Assumption 1 is satised
then, for Q equal to the collection of distributions with support contained in a compact set C,
sup
B,QQ
[T(, Q; P) T(, Q; T)[ = o
P
(1).
Proof: Note that we can write
T(, Q; P) T(, Q; T) =
j,k
jk
(P)(P
jk
T
jk
)
2
+ 2
j,k
jk
(P)(P
jk
T
jk
) (T
jk
L
jk
(, Q
k
))
+
j,k
(
jk
(P)
jk
(T)) (T
jk
L
jk
(, Q
k
))
2
.
The result then follows from P
jk
T
jk
= o
P
(1) and
jk
(P)
jk
(T) = o
P
(1) by the continuous
mapping theorem. Q.E.D.
From Lemma A1, we obtain one-sided uniform convergence:
Lemma A2: Let T(; ) = inf
QQ
T(, Q; ). If Assumption 1 is satised then
sup
B
[T(; P) T(; T)[ = o
P
(1).
Proof: Let

Q
arg inf
QQ
T(, Q; P) and Q
arg inf
QQ
T(, Q; T). By denition of

Q
and
Q
, we have uniformly in and for all n,

T(,

Q
; P) T(,

Q
; T) T(,

Q
; P) T(, Q
; T) T(, Q
; P) T(, Q
; T).
Hence
T(,

Q
; P) T(, Q
; T)
max
_
T(,

Q
; P) T(,

Q
; T)
, [T(, Q
; P) T(, Q
; T)[
_
= o
P
(1),
uniformly in by Lemma A1. Q.E.D.
Lemma A3: If Assumption 1 is satised then T(; T) is continuous in .
Proof: By Lemma 10, the problem
inf
QQ
T(, Q; T)
can be rewritten as
min
(
1k
,...,
Jk
)C,k
(
1k
,...,
Jk
)S,k
j,k
jk
(T)
_
T
jk
m=1
km
L
_
Y
j
[ X
k
,
km
,
_
_
2
,
where J and K are nite, and S denotes the unit simplex in R
J
. Here, (
1k
, . . . ,
jk
) and
(
1k
, . . . ,
Jk
) characterize discrete distributions with no more than J points of support. Because
37
the objective function is continuous in (,
11
, . . . ,
JK
,
11
, . . . ,
JK
), and because C
K
S
K
is
compact, we can apply the theorem of the maximum (e.g. Stokey and Lucas 1989, Theorem
3.6), and obtain the desired conclusion. Q.E.D.
Lemma A4: If Assumption 1 is satised then
sup
B
[T(; P) T(; T)[ = O
P
_
n
1
_
.
Proof: Let Q
arg min
QQ
T(, Q; T). By Lemma 10, we have that T
jk
= L
jk
(, Q
k
) and
T(; T) = 0 B. Then, we have
sup
B
[T(; P) T(; T)[ = sup
B
T(; P) sup
B
T(, Q
; P) =
j,k
jk
(P) (P
jk
T
jk
)
2
= O
P
(n
1
),
where the last equality follows from P
jk
T
jk
= O
P
(n
1/2
),
jk
(P) =
jk
(T) + o
P
(1) by the
continuous mapping theorem, and J and K being nite. Q.E.D.
Proof of Theorem 12. The consistency result under Assumption 1 and
n
log n/n
follows from Theorem 3.1 in Chernozhukov, Hong, and Tamer (2007) with a
n
= n. Indeed, the
Condition C.1 in Chernozhukov, Hong, and Tamer (2007) follows by Assumption 1 (B compact),
Lemma A3 (T(; T) continuous), Lemma A2 (uniform convergence of T(; P) to T(; T) in B),
and Lemma A4 (uniform convergence of T(; P) to T(; T) in B at a rate n).
The consistency result under Assumptions 1 and 2 and
n
= 0 follows from Theorem 3.2 in
Chernozhukov, Hong, and Tamer (2007) with a
n
= n. It is not dicult to show that Assumption
3.2 implies condition C.3 in Chernozhukov, Hong, and Tamer (2007), which along with other
conditions veried above, implies the consistency result.
The second result follows by redening the estimation problem as
P

n
arg min
W(, P), W(, P) =
j,k
jk
(P) (P
jk
jk
)
2
,
where P
= (P
jk
, j = 1, ..., J, k = 1, ..., K) and is the space of conditional choice probabilities
that are compatible with the model. Under Assumption 1, is compact, the function
W(, P) is continuous for each P in the neighborhood of T, and therefore W(, P)W(; T) =
o
p
(1) uniformly in , as P = T + o
p
(1). Moreover, W(, T) is uniquely minimized
at = T
by assumption. Therefore, by the consistency theorem for approximate argmin

estimators, it follows that the
n
-argmin P
is consistent for T
. Q.E.D.
Proof of Theorem 13. We consider the upper bounds only, since the proof for lower
bounds is analogous. We have that (i) the projection
() arg min
j,k
w
jk
()(
jk
jk
)
2
38
is continuous at T by the theorem of the maximum, (ii) the parameter space for and is
compact, (iii) the function dening the constraints
(, ,
k1
, ...,
kJ
,
k1
, ....,
kJ
)
jk
m=1
L(Y
j
[ X
k
,
km
, )
km
is continuous by Assumption 1 and the continuity of the projection, and (iv) the criterion
function
(, ,
k1
, ...,
kJ
,
k1
, ....,
kJ
)
J
m=1
[F
_
x
+
km
_
F
_
x
+
km
_
]
km
is continuous by the assumed continuity of F. Then, using the theorem of the maximum, we
conclude that the maximal mapping
(, )
k
(, )
is continuous. By Theorem 12 and the extended continuous mapping theorem we have that
d
H
(B
n
, B
)
p
0, P
p
T, P

p
T
,
implies that
d
H
(
k
(B
n
, P),
k
(B
, T))
p
0,
where
k
(A, ) =
k
(, ) : A. The conclusion of the theorem then immediately follows.
Q.E.D.
Proof of Theorem 14: By the uniform central limit theorem, W(T, P) converges in law
to
2
J(K1)
under any sequence of true DGPs with T
0
in . It follows that
lim
n
Pr
P
0
T CR
1
(T) = 1 .
Further, the event T CR
1
(T) implies event T
() : CR
1
(T) by construction,
which in turn implies the events B
CR
1
(B
) and [
k
,
k
] CR
1
[
k
,
k
], k. Q.E.D.
Proof of Theorem 15. We have that for S
n
(T) =

(T)
Pr
P
0
,
_
,
= Pr
P
0
S
n
(T) , [G
1
(
2
, T), G
1
(1
1
, T)]
Pr
P
0
[S
n
(T) , [G
1
(
2
, T), G
1
(1
1
, T)] T CR
1
(T)]
+Pr
P
0
T , CR
1
(T)
Pr
P
0
[S
n
(T) , [G
1
(
2
, T), G
1
(1
1
, T)] T CR
1
(T)]
+Pr
P
0
T , CR
1
(T)
Pr
P
0
S
n
(T) , [G
1
(
2
, T), G
1
(1
1
, T)] + Pr
P
0
T , CR
1
(T)
+ Pr
P
0
T , CR
1
(T).
39
Thus if limsup
n
Pr
P
0
T , CR
1
(T) , we obtain that lim
n
Pr
P
0
0
,
_
,
+ ,
which is the desired conclusion.
It now remains to show that limsup
n
Pr
P
0
T , CR
1
(T) . We have that
Pr
P
0
T , CR
1
(T) = Pr
P
0
W(T, P) > c
1
(
2
K(J1)
).
By the uniform central limit theorem, W(T, P) converges in law to
2
K(J1)
under any sequence
T
0
in . Therefore,
lim
n
Pr
P
0
W(T, P) > c
1
(
2
K(J1)
) = Pr
2
K(J1)
> c
1
(
2
K(J1)
) = .
Q.E.D.
11.2 Generic Uniqueness of Projections of Probabilities onto the Model Space
The following lemma is motivated by the analysis of Newey (1986) on generic uniqueness of
quasi-maximum likelihood population parameter values.
Lemma A5. Let ( be a set of vectors = (
jk
, j = 1, ..., k, j = 1, ..., J) > 0 that satisfy the
system of linear constraints

J
j=1

jk
= 1, k = 1, ..., K. Let proj() = arg min
d(,
),
where d(,
) =
_
K
k=1
J
j=1
jk
(
jk
jk
)
2
_
1/2
, be the projection of on the set , where
jk
> 0 for all (j, k) are weights normalized so that d is a proper distance, and = (), B
where B is compact and () =
N
: (
1k
, ...
Jk
)

k
(), k, where
k
is dened as in
Section 7, with link function F being twice continuously dierentiable. The set (
0
= ( :
proj() is unique is an open dense subset of (.
Proof: We rst note that is compact,
d(,
) is continuous, so that the minimum

is attainable, and the projection exists. The rest of the proof has two steps: verication of
openness of (
0
and verication of denseness of (
0
relative to (.
To verify openness, we take
0
(
0
and nd an open neighborhood ^ of
0
in ( such that
^ (
0
. We consider two cases. First, if proj(
0
) is in the interior of , then there exists an
open neighborhood ^
of
0
in . For each in ^, we necessarily have that proj() = ,
so we can take ^ = ^
. Second, if
0
is on the boundary of , the verication follows by an
argument similar to that given by Newey (1986), p.7.
To verify denseness, we take
0
( (
0
, so that proj(
0
) is not unique. For this to happen
it must be that
0
, . Take any element
0
of proj(
0
). Then we can construct a sequence
n
approaching
0
such that proj(
n
) =
0
, so that
n
(
0
. Indeed, simply take
n
=
1
n
0
+
n 1
n

0
.
40
Clearly,
n
( and it approaches
0
. Also, note that by denition
0
is a point of intersection
of with the contour set or ellipse C
0
=
( : d(
0
,
) = t for t = min
=
d(
0
,

). Also,
note that the contour set or sphere C
n
=
( : d(
n
,
) = t
, where t
= min
=
d(
n
,

)
is a strict subset of the sphere C
0
, since by convexity of the distance
t
d(
n
,
0
)
1
n
d(
0
,
0
) +
n 1
n
d(
0
,
0
) =
n 1
n
t t,
with only one common point C
n
C
0
=
0
. This establishes that proj(
n
) =
0
. Q.E.D.
11.3 Computation
The quadratic problem (26) can be solved using computational techniques developed for nite
mixture models such as the EM algorithm or vertex direction methods, see, e.g., Laird (1978),
Bohning (1995), Lindsay (1995, Chap. 6) and Aitkin (1999). These iterative algorithms, how-
ever, are sensitive to initial values and can be very slow to converge in this problem where we
estimate several mixtures over a grid of values for . Moreover, a slow algorithm is specially
inconvenient for the resampling based inference that we develop in the next section. The main
computational diculty in the mixture problems is to nd the location of the support points;
see, e.g., Aitkin (1999). Since the mixtures are nuisance parameters in our problem, we propose
solving the following penalized quadratic problem:
T
(; P) = min
km
j,k
_
_
jk
_
P
jk
m=1
km
L
_
Y
j
[ X
k
,
m
,
_
_
2
+
n
M
m=1
2
km
_
_
, (42)
s.t.
L
m=1
km
= 1,
km
0, j, k.
where M is large and is small. For the weights, we set w
jk
= nP
k
/
L
m=1

km
L(Y
j
[
X
k
,
m
,

), where (
,
km
, (k, m)) is an initial estimate. This is a convex quadratic program-
ming problem for which there are reliable algorithms to nd the solution in polynomial time; see,
e.g., the quadprog package in R (Weingessel, 2007). The penalty
n
acts choosing a distribution
among the set of discrete distributions with support contained in a large grid
1
, ...,
M
. In
general there is an innite number of solutions for Q
k
, one of them is a discrete distribution with
no more than J << M support points by Lemma 10. Here, instead of searching for the solution
with the minimal support, we search over discrete distributions with support points contained
in a large partition of the parameter space C. By making the partition ne enough we guarantee
to cover a solution to the problem, without having to nd explicitly the location of the support
points. The error of the nite grid approximation approaches zero as M if C is compact
and the objective function has boundable variation with respect to
m
; see, e.g., Lindsay (1995;
41
Chap. 6). The penalty favors distributions with large supports. This regularization therefore
addresses the computational diculties created by the non-identiability of Q
k
.
The nal estimates of the identied sets for the parameters and marginal eects are computed
by solving the linear programming problems (24) and (25) for all the parameter values which
satisfy the condition T
(; P) min
(; P) +
n
, and replacing the T
jk
s by the probabilities
predicted by the model P
jk
s for this parameter value , dened as in (28).
References
[1] Aitkin, M, (1999), A general maximum likelihood analysis of variance components in
generalized linear models,Biometrics 55 (1), 117128.
[2] Alvarez, J., and M. Arellano (2003), The Time Series and Cross-Section Asymptotics
of Dynamic Panel Data Estimators,Econometrica 71, 1121-1159.
[3] Angrist, J. D. (1998), Estimating the Labor Market Impact of Voluntary Military Service
Using Social Security Data on Military Applicants,Econometrica 66, 249288.
[4] Andersen, E. (1973), Conditional Inference and Models for Measuring. Copenhagen: Men-
talhygiejnisk Forlag.
[5] Beresteanu, A., and Molinari, F. (2008), Asymptotic properties for a class of par-
tially identied models,Econometrica 76(4), 763814.
[6] Bertsimas, D., and Tsitsiklis, J. N. (1997), Introduction to Linear Optimization,
Athena Scientic, Belmont, Massachusetts.
[7] Blundell, R. and J.L. Powell (2003), Endogeneity in Nonparametric and Semipara-
metric Regression Models, in M. Dewatripont, L. P. Hansen and S. J. Turnsovsky (eds.)
Advances in Economics and Econometrics, Cambridge: Cambridge University Press.
[8] B ohning, D. (1995), A Review of Reliable Maximum Likelihood Algorithms for Semi-
parametric Mixture Models,Journal of Statistical Planning and Inference 47, 528.
[9] Browning, M. and J. Carro (2007), Heterogeneity and Microeconometrics Modeling,
in Blundell, R., W.K. Newey, T. Persson (eds.), Advances in Theory and Econometrics,
Vol. 3, Cambridge: Cambridge University Press.
[10] Carro, J. M. (2007), Estimating Dynamic Panel Data Discrete Choice Models with
Fixed Eects, Journal of Econometrics 140(2), pp 503-528.
42
[11] Carrasco, R. (2001), Binary Choice With Binary Endogenous Regressors in Panel Data:
Estimating the Eect of Fertility on Female Labor Participation, Journal of Business and
Economic Statistics 19(4), 385-394.
[12] Chamberlain, G. (1980), Analysis of Covariance with Qualitative Data, Review of
Economic Studies, 47, 225238.
[13] Chamberlain, G. (1982), Multivariate Regression Models for Panel Data, Journal of
Econometrics, 18, 546.
[14] Chamberlain, G. (1984), Panel Data, in Z. Griliches and M. Intriligator eds
Handbook of Econometrics. Amsterdam: North-Holland.
[15] Chamberlain, G. (1992), Binary Response Models for Panel Data: Identication and
Information, unpublished manuscript.
[16] Chay, K. Y., and D. R. Hyslop (2000), Identication and Estimation of Dynamic
Binary Response Panel Data Models: Empirical Evidence using Alternative Approaches,
unpublished manuscript, University of California at Berkeley.
[17] Chernozhukov, V. (2007), Course Materials for 14.385 Nonlinear Econometric Analysis,
Fall 2007, MIT OpenCourseWare (http://ocw.mit.edu), MIT.
[18] Chernozhukov, V., J.Hahn, and W.K.Newey (2004), Bound Analysis in Panel Mod-
els with Correlated Random Eects, unpublished manuscript.
[19] Chernozhukov, V., H. Hong, and E. Tamer (2007), Estimation and Condence
Regions for Parameter Sets in Econometric Models,Econometrica 75(5), pp. 12431284.
[20] Dufour, J.-M. (2006), Monte Carlo Tests with Nuisance Parameters: A General Ap-
proach to Finite-Sample Inference and Nonstandard Asymptotics, Journal of Econometrics
133, 443477.
[21] Feller, W. (1943), On a General Class of Contagious Distributions, Annals of Statistics,
14, 389-400.
[22] Fern andez-Val, I. (2008), Fixed Eects Estimation of Structural Parameters and
Marginal Eects in Panel Probit Models, unpublished manuscript.
[23] Graham, B.W. and J.L. Powell (2008), Semiparametric Identication and Estimation
of Correlated Random Coecient Models for Panel Data, unpublished manuscript.
43
[24] Hahn, J. (2001), Comment: Binary Regressors in Nonlinear Panel-Data Models with
Fixed Eects, Journal of Business and Economic Statistics 19, 16-17.
[25] Hahn, J., and G. Kuersteiner (2002), Asymptotically Unbiased Inference for a Dy-
namic Panel Model with Fixed Eects when Both n and T Are Large, Econometrica 70,
1639-1657.
[26] Hahn, J., and W. Newey (2004), Jackknife and Analytical Bias Reduction for Nonlinear
Panel Models, Econometrica 72, 1295-1319.
[27] Hahn, J., and G. Kuersteiner (2007), Bias Reduction for Dynamic Nonlinear Panel
Models with Fixed Eects, unpublished manuscript.
[28] Heckman,J.J. (1981), Statistical Models for Discrete Panel Data, in Manski, C.F. and
D. McFadden eds., Structural Analysis of Discrete Data with Econometric Applications,
MIT Press, Cambridge, MA.
[29] Heckman, J. J., and T. E. MaCurdy (1980), A Life Cycle Model of Female Labor
Supply, Review of Economic Studies 47, 47-74.
[30] Heckman, J. J., and T. E. MaCurdy (1982), Corrigendum on: A Life Cycle Model of
Female Labor Supply, Review of Economic Studies 49, 659-660.
[31] Honor e, B.E., and E. Tamer (2006), Bounds on Parameters in Dynamic Discrete
Choice Models, Econometrica 74(3), 611-629.
[32] Hyslop, D. R. (1999), State Dependence, Serial Correlation and Heterogeneity in In-
tertemporal Labor Force Participation of Married Women, Econometrica 67(6), 1255-1294.
[33] Judd, K. L. (1998), Numerical Methods in Economics. MIT Press, Cambridge, MA.
[34] Laird, N. (1978), Nonparametric Maximum Likelihood Estimation of a Mixing Distribu-
tion,Journal of the American Statistical Association 73, 805811.
[35] Lindsay, B.G. (1995), Mixture Models: Theory, Geometry and Applications, NSF-CBMS
Regional Conference Series in Probability and Statistics, Volume 5, IMS: Hayward.
[36] Manski, C.F., and E. Tamer (2002), Inference on Regressions with Interval Data on a
Regressor or Outcome,Econometrica 70, 519 - 546.
[37] McLachlan, G., and D. Peel (2000), Finite mixture models. Wiley Series in Probability
and Statistics: Applied Probability and Statistics. Wiley-Interscience, New York.
44
[38] Newey, W.K.(1986), Generic Uniqueness of the Population Quasi Maximum Likelihood
Estimators, unpublished manuscript.
[39] Romano, J. P., and M. Wolf, (2000), Finite sample nonparametric inference and large
sample eciency, Annals of Statistics, 28(3), 756778.
[40] Rytchkov, O.(2007), Essays on Predictability of Stock Returns. Doctoral Dissertation.
MIT.
[41] Stokey, N.L., and R.E. Lucas (1989), Recursive Methods in Economic Dynamics, Har-
vard University Press: Cambridge.
[42] Weingessel, A. (2007), quadprog: Functions to solve Quadratic Programming Problems.
R package version 1.4-11. S original by Berwin A. Turlach. http://www.r-project.org.
[43] Wooldridge, J.M. (2002), Econometric Analysis of Cross Section and Panel Data, Cam-
bridge, MA: MIT Press.
[44] Wooldridge, J.M. (2005), Fixed-Eects and Related Estimators for Correlated
Random-Coecient and Treatment-Eect Panel Data Models, Review of Economics and
Statistics 87, 385390.
[45] Woutersen, T.(2002), Robustness Against Incidental Parameters, unpublished
manuscript.
45
T
w

w

w

2 34.63 34.63 -91.20 -91.20 -31.07 -31.07
4 12.77 9.91 -61.52 -59.77 20.52 25.32
8 5.76 0.74 -33.16 -20.40 19.90 30.38
(0.62) (0.49) (0.74)
Notes: probit model with a single binary regressor with parameter equal to one. The individual
effect is the standardized mean of the regressor.
w
is the probability limit of the linear fixed
effects estimator with constant slopes and is the probability limit of the average of the linear
fixed effects estimators with individual specific slopes.
(0.61) (0.47) (0.75)
(0.60) (0.45) (0.77)
Table 1: Biases of linear probability model estimators in percentage of marginal
effect (average probability of the response in parenthesis)
p
X
0.5 0.1 0.9
46
Variable Mean Changes (%)
LFP1990 0.75
LFP1992 0.74 0.17
LFP1994 0.75 0.28
kids1990 0.38
kids1992 0.35 0.31
kids1994 0.45 0.51
Table 2: Descriptive Statistics for NLSY79 sample
Notes: LFP - 1 if woman is in the labor force, 0 otherwise;
kid - 1 if woman has any child of age less than 3, 0 otherwise.
Changes (%) measures the proportion of women who change
status between 1990 and the year corresponding to the row.
(n = 1,587)
47
L
i
n
e
a
r
P
(
X
k
)
E
s
t
.
9
5
%

N
9
5
%

B
*
E
s
t
.
9
5
%

C
P
9
5
%

M
P
+
9
5
%

P
B
^
F
E
F
E
-
B
C
C
M
L
E
F
E
*
-
.
3
6
(
-
.
7
5
,

.
0
2
)
(
-
.
8
5
,

.
0
2
)
(
-
.
8
8
,

.
0
8
)
-
.
7
8
-
.
3
6
-
.
3
9

(
-
1
.
1
1
,

-
.
4
6
)
(
-
.
6
7
,

-
.
0
5

)
(
-
.
7
0
,

-
.
0
8
)
(
0
,
0
)
.
4
8
[
-
.
8
1
,

0
]
(
-
.
8
3
,

0
)
(
-
.
8
4
,

0
)
[
-
.
0
6
,

-
.
0
4
]

(
-
.
1
7
,

.
0
0
)
(
-
.
2
0
,

.
0
0
)
(
-
.
2
2
,

.
0
1
)
-
.
0
5
(
0
,
1
)
.
1
4
-
.
1
2
(
-
.
2
0
,

-
.
0
4
)
(
-
.
1
8
,

-
.
0
6
)
-
.
0
6
(
-
.
1
1
,

.
0
0
)
(
-
.
1
5
,

.
0
0
)
(
-
.
1
6
,

.
0
1
)
-
.
0
7
(
1
,
0
)
.
1
7
-
.
0
3
(
-
.
1
0
,

.
0
5
)
(
-
.
0
8
,

.
0
3
)
-
.
0
7
(
-
.
1
3
,

.
0
0
)
(
-
.
1
5
,

.
0
0
)
(
-
.
1
5
,

.
0
1
)
-
.
0
7
(
1
,
1
)
.
2
1
[
-
.
3
8
,

0
]
(
-
.
4
2
,

0
)
(
-
.
4
4
,

0
)
[
-
.
0
7
,

-
.
0
5
]
(
-
.
1
5
,

.
0
0
)
(
-
.
1
8
,

.
0
0
)
(
-
.
2
0
,

.
0
1
)
-
.
0
5
0
[
-
.
4
9
,

-
.
0
2
]
(
-
.
5
3
,

.
0
0
)
(
-
.
5
2
,

-
.
0
1
)
[
-
.
0
6
,

-
.
0
5
]
(
-
.
1
5
,

.
0
0
)
(
-
.
1
7
,

.
0
0
)
(
-
.
1
9
,

.
0
1
)
-
.
0
6
-
.
0
4

-
.
0
7
(
-
.
0
8
,

-
.
0
4
)
(
-
.
0
6
,

-
.
0
2
)

(
-
.
1
1
,

-
.
0
3
)
*
[
-
.
4
1
1
,

-
.
4
0
9
]
(
-
.
8
5
,

.
0
3
)
(
-
.
8
8
,

.
0
4
)
(
-
1
.
0
6
,

.
1
0
)
-
.
8
8
-
.
5
1

(
-
1
.
2
4
,

-
.
5
2
)
(
-
.
8
6
,

-
.
1
6
)
(
0
,
0
)
.
4
8
[
-
.
8
1
,

0
]
(
-
.
8
3
,

0
)
(
-
.
8
4
,

0
)
[
-
.
0
8
,

-
.
0
4
]
(
-
.
2
0
,

.
0
0
)
(
-
.
2
1
,

.
0
1
)
(
-
.
2
4
,

.
0
2
)
-
.
0
5
(
0
,
1
)
.
1
4
-
.
1
2
(
-
.
2
0
,

-
.
0
4
)
(
-
.
1
8
,

-
.
0
6
)
-
.
0
7
(
-
.
1
2
,

.
0
0
)
(
-
.
1
6
,

.
0
1
)
(
-
.
1
7
,

.
0
1
)
-
.
0
7
(
1
,
0
)
.
1
7
-
.
0
3
(
-
.
1
0
,

.
0
5
)
(
-
.
0
8
,

.
0
3
)
-
.
0
7
(
-
.
1
3
,

.
0
1
)
(
-
.
1
4
,

.
0
1
)
(
-
.
1
5
,

.
0
2
)
-
.
0
7
(
1
,
1
)
.
2
1
[
-
.
3
8
,

0
]
(
-
.
4
2
,

0
)
(
-
.
4
4
,

0
)
[
-
.
0
7
,

-
.
0
5
]
(
-
.
1
6
,

.
0
1
)
(
-
.
1
6
,

.
0
1
)
(
-
.
1
9
,

.
0
2
)
-
.
0
6
0
[
-
.
4
9
,

-
.
0
2
]
(
-
.
5
3
,

.
0
0
)
(
-
.
5
2
,

-
.
0
1
)
[
-
.
0
7
,

-
.
0
5
]
(
-
.
1
7
,

.
0
0
)
(
-
.
1
8
,

.
0
1
)
(
-
.
1
9
,

.
0
2
)
-
.
0
6
-
.
0
5
-
.
0
7
(
-
.
0
8
,

-
.
0
4
)
(
-
.
0
7
,

-
.
0
2
)
(
-
.
1
1
,

-
.
0
3
)
P
r
o
b
i
t
N
o
t
e
s
:

D
e
p
e
n
d
e
n
t

v
a
r
i
a
b
l
e

i
s

l
a
b
o
r

f
o
r
c
e

p
a
r
t
i
c
i
p
a
t
i
o
n

i
n
d
i
c
a
t
o
r
;

r
e
g
r
e
s
s
o
r

i
s

a

f
e
r
t
i
l
i
t
y

i
n
d
i
c
a
t
o
r

t
h
a
t

t
a
k
e
s

t
h
e

v
a
l
u
e

1

i
f

t
h
e

w
o
m
a
n

h
a
s

a

c
h
i
l
d

l
e
s
s

t
h
a
n

3

y
e
a
r
s

o
l
d
.

T
i
m
e

p
e
r
i
o
d
s
:

1
9
9
0

a
n
d

1
9
9
2
.

S
o
u
r
c
e
:

N
L
S
Y
7
9
.

N

d
e
n
o
t
e
s

n
o
r
n
a
l

a
p
p
r
o
x
i
m
a
t
i
o
n
;

B

d
e
n
o
t
e
s

n
o
n
p
a
r
a
m
e
t
r
i
c

b
o
o
t
s
t
r
a
p
;

C
P

d
e
n
o
t
e
s

c
a
n
o
n
i
c
a
l

p
r
o
j
e
c
t
i
o
n
;

M
P

d
e
n
o
t
e
s

m
o
d
i
f
i
e
d

p
r
o
j
e
c
t
i
o
n
;

P
B

d
e
n
o
t
e
s

p
e
r
t
u
r
b
e
d

b
o
o
t
s
t
r
a
p
;

F
E

d
e
n
o
t
e
s

f
i
x
e
d

e
f
f
e
c
t
s

m
a
x
i
m
u
m

l
i
k
e
l
i
h
o
o
d

e
s
t
i
m
a
t
o
r

(
F
E
M
L
E
)
;

F
E
-
B
C

d
e
n
o
t
e
s

b
i
a
s

c
o
r
r
e
c
t
e
d

F
E
M
L
E
;

C
M
L
E

d
e
n
o
t
e
s

c
o
n
d
i
t
i
o
n
a
l

F
E
M
L
E
;

L
i
n
e
a
r

F
E

d
e
n
o
t
e
s

t
h
e

l
i
n
e
a
r

w
i
t
h
i
n

g
r
o
u
p
s

e
s
t
i
m
a
t
o
r
.

*
2
0
0

b
o
o
s
t
r
a
p
s

r
e
p
e
t
i
t
i
o
n
s
.

+
B
a
s
e
d

o
n

5
0
,
0
0
0

D
G
P
s
.

^
B
a
s
e
d

o
n

1
0
0

D
G
P
'
s

a
n
d

2
0
0

s
i
m
u
l
a
t
i
o
n
s

f
o
r

e
a
c
h

D
G
P
.
T
a
b
l
e

3
:

L
F
P

a
n
d

F
e
r
t
i
l
i
t
y

(
T

=

2
,

n

=

1
,
5
8
7
)
G
e
n
e
r
a
l

C
E
F

M
o
d
e
l
S
e
m
i
p
a
r
a
m
e
t
r
i
c

M
o
d
e
l
L
o
g
i
t
48
L
i
n
e
a
r
P
(
X
k
)
E
s
t
.
9
5
%

N
9
5
%

B
*
E
s
t
.
9
5
%

C
P
9
5
%

M
P
+
9
5
%

P
B
^
F
E
F
E
-
B
C
J
a
c
k
.
C
M
L
E
F
E
*
-
.
4
2
-
(
-
.
7
6
,

-
.
0
7
)
(
-
.
7
4
,

-
.
1
2
)
-
.
7
1
-
.
4
6
-
.
3
8
-
.
4
6
(
-
.
9
0
,

-
.
5
2
)
(
-
.
6
4
,

-
.
2
8
)
(
-
.
7
0
,

-
.
0
5
)
(
-
.
6
5
,

-
.
2
8
)
(
0
,
0
,
0
)
.
4
[
-
.
8
1
,

0
]
(
-
.
8
3
,

0
)
(
-
.
8
1
,

0
)
[
-
.
0
6
,

-
.
0
6
]
-
(
-
.
1
4
,

-
.
0
1
)
(
-
.
1
4
,

-
.
0
2
)
-
.
0
6
(
0
,
0
,
1
)
.
0
8
-
.
1
2
(
-
.
2
1
,

-
.
0
4
)
(
-
.
2
0
,

-
.
0
6
)
-
.
0
7
-
(
-
.
1
3
,

-
0
1
)
(
-
.
1
4
,

-
.
0
1
)
-
.
0
7
(
0
,
1
,
0
)
.
0
6
-
.
1
(
-
.
2
0
,

.
0
1
)
(
-
.
1
7
,

.
0
1
)
-
.
0
8
-
(
-
.
1
6
,

-
.
0
1
)
(
-
.
1
7
,

-
.
0
2
)
-
.
0
9
(
1
,
0
,
0
)
.
1
4
-
.
0
6
(
-
.
1
4
,

.
0
1
)
(
-
.
1
0
,

.
0
1
)
-
.
0
8
-
(
-
.
1
5
,

-
.
0
1
)
(
-
.
1
5
,

-
.
0
2
)
-
.
0
9
(
0
,
1
,
1
)
.
0
8
-
.
1
8
(
-
.
2
6
,

-
.
0
9
)
(
-
.
2
3
,

-
.
0
9
)
-
.
0
7
-
(
-
.
1
4
,

-
.
0
1
)
(
-
.
1
6
,

-
.
0
2
)
-
.
0
8
(
1
,
0
,
1
)
.
0
3
.
0
2
(
-
.
1
6
,

.
2
0
)
(
-
.
0
2
,

.
2
2
)
-
.
0
8
-
(
-
.
1
7
,

-
.
0
1
)
(
-
.
1
9
,

-
.
0
2
)
-
.
0
9

(
1
,
1
,
0
)
.
1
2
-
.
0
4
(
-
.
1
2
,

.
0
4
)
(
-
.
1
5
,

-
.
0
1
)
-
.
0
7
-
(
-
.
1
3
,

-
.
0
2
)
(
-
.
1
5
,

-
.
0
2
)
-
.
0
8
(
1
,
1
,
1
)
.
0
9
[
-
.
4
1
,

0
]
(
-
.
4
6
,

0
)
(
-
.
4
2
,

0
)
[
-
.
0
8
,

-
.
0
7
]
-
(
-
.
1
4
,

-
.
0
1
)
(
-
.
1
6
,

-
.
0
2
)
-
.
0
8
0
[
-
.
4
0
,

-
.
0
4
]
(
-
.
4
6
,

.
0
0
)
(
-
.
4
1
,

-
.
0
2
)
[
-
.
0
7
,

-
.
0
7
]
-
(
-
.
1
3
,

-
.
0
1
)
(
-
.
1
3
,

-
.
0
2
)
-
.
0
8
-
.
0
7
-
.
0
9

-
.
0
8
(
-
.
0
9
,

-
.
0
6
)
(
-
.
0
9
,

-
.
0
5
)
(
-
.
1
1
,

-
.
0
7
)

(
-
.
1
1
,

-
.
0
6
)
*
[
-
.
4
6
2
,

-
.
4
6
0
]
-
(
-
.
7
4
,

-
.
1
7
)
(
-
.
7
3
,

-
.
1
6
)
-
.
7
8
-
.
5
5
-
.
3
8
(
-
.
9
9
,

-
.
5
7
)
(
-
.
7
5
,

-
.
3
5
)
(
-
.
5
8
,

-
.
1
8
)
(
0
,
0
,
0
)
.
4
[
-
.
8
1
,

0
]
(
-
.
8
3
,

0
)
(
-
.
8
1
,

0
)
[
-
.
0
8
,

-
.
0
6
]
-
(
-
.
1
5
,

-
.
0
2
)
(
-
.
1
5
,

-
.
0
2
)
-
.
0
6
(
0
,
0
,
1
)
.
0
8
-
.
1
2
(
-
.
2
1
,

-
.
0
4
)
(
-
.
2
0
,

-
.
0
6
)
-
.
0
8
-
(
-
.
1
5
,

-
.
0
3
)
(
-
.
1
6
,

-
.
0
2
)
-
.
0
7
(
0
,
1
,
0
)
.
0
6
-
.
1
(
-
.
2
0
,

.
0
1
)
(
-
.
1
7
,

.
0
1
)
-
.
0
8
-
(
-
.
1
6
,

-
.
0
2
)
(
-
.
1
6
,

-
.
0
3
)
-
.
0
8
(
1
,
0
,
0
)
.
1
4
-
.
0
6
(
-
.
1
4
,

.
0
1
)
(
-
.
1
0
,

.
0
1
)
-
.
0
9
-
(
-
.
1
7
,

-
.
0
3
)
(
-
.
1
5
,

-
.
0
2
)
-
.
0
9
(
0
,
1
,
1
)
.
0
8
-
.
1
8
(
-
.
2
6
,

-
.
0
9
)
(
-
.
2
3
,

-
.
0
9
)
-
.
0
8
-
(
-
.
1
4
,

-
.
0
3
)
(
-
.
1
7
,

-
.
0
2
)
-
.
0
8
(
1
,
0
,
1
)
.
0
3
.
0
2
(
-
.
1
6
,

.
2
0
)
(
-
.
0
2
,

.
2
2
)
-
.
0
8
-
(
-
.
1
6
,

-
.
0
2
)
(
-
.
2
2
,

-
.
0
2
)
-
.
0
9
(
1
,
1
,
0
)
.
1
2
-
.
0
4
(
-
.
1
2
,

.
0
4
)
(
-
.
1
5
,

-
.
0
1
)
-
.
0
7
-
(
-
.
1
3
,

-
.
0
3
)
(
-
.
1
5
,

-
.
0
2
)
-
.
0
9
(
1
,
1
,
1
)
.
0
9
[
-
.
4
1
,

0
]
(
-
.
4
6
,

0
)
(
-
.
4
2
,

0
)
[
-
.
0
9
,

-
.
0
7
]
-
(
-
.
1
5
,

-
.
0
3
)
(
-
.
1
7
,

-
.
0
2
)
-
.
0
8
0
[
-
.
4
0
,

-
.
0
4
]
(
-
.
4
6
,

.
0
0
)
(
-
.
4
1
,

-
.
0
2
)
[
-
.
0
8
,

-
.
0
7
]
-
(
-
.
1
4
,

-
.
0
3
)
(
-
.
1
4
,

-
.
0
3
)
-
.
0
8
-
.
0
7
-
.
0
9
-
.
0
8
(
-
.
0
9
,

-
.
0
6
)
(
-
.
0
9
,

-
.
0
5
)
(
-
.
1
1
,

-
.
0
7
)
(
-
.
1
1
,

-
.
0
6
)
P
r
o
b
i
t
N
o
t
e
s
:

D
e
p
e
n
d
e
n
t

v
a
r
i
a
b
l
e

i
s

l
a
b
o
r

f
o
r
c
e

p
a
r
t
i
c
i
p
a
t
i
o
n

i
n
d
i
c
a
t
o
r
;

r
e
g
r
e
s
s
o
r

i
s

a

f
e
r
t
i
l
i
t
y

i
n
d
i
c
a
t
o
r

t
h
a
t

t
a
k
e
s

t
h
e

v
a
l
u
e

1

i
f

t
h
e

w
o
m
a
n

h
a
s

a

c
h
i
l
d

l
e
s
s

t
h
a
n

3

y
e
a
r
s

o
l
d
.

T
i
m
e

p
e
r
i
o
d
s
:

1
9
9
0
,

1
9
9
2
,

a
n
d

1
9
9
4
.

S
o
u
r
c
e
:

N
L
S
Y
7
9
.

N

d
e
n
o
t
e
s

n
o
r
n
a
l

a
p
p
r
o
x
i
m
a
t
i
o
n
;

B

d
e
n
o
t
e
s

n
o
n
p
a
r
a
m
e
t
r
i
c

b
o
o
t
s
t
r
a
p
;

C
P

d
e
n
o
t
e
s

c
a
n
o
n
i
c
a
l

p
r
o
j
e
c
t
i
o
n
;

M
P

d
e
n
o
t
e
s

m
o
d
i
f
i
e
d

p
r
o
j
e
c
t
i
o
n
;

P
B

d
e
n
o
t
e
s

p
e
r
t
u
r
b
e
d

b
o
o
t
s
t
r
a
p
;

F
E

d
e
n
o
t
e
s

f
i
x
e
d

e
f
f
e
c
t
s

m
a
x
i
m
u
m

l
i
k
e
l
i
h
o
o
d

e
s
t
i
m
a
t
o
r

(
F
E
M
L
E
)
;

F
E
-
B
C

d
e
n
o
t
e
s

b
i
a
s

c
o
r
r
e
c
t
e
d

F
E
M
L
E
;

J
a
c
k
.

d
e
n
o
t
e
s

p
a
n
e
l

J
a
c
k
k
n
i
f
e

b
i
a
s

c
o
r
r
e
c
t
e
d

F
E
M
L
E
;

C
M
L
E

d
e
n
o
t
e
s

c
o
n
d
i
t
i
o
n
a
l

F
E
M
L
E
;

L
i
n
e
a
r

F
E

d
e
n
o
t
e
s

t
h
e

l
i
n
e
a
r

w
i
t
h
i
n

g
r
o
u
p
s

e
s
t
i
m
a
t
o
r
.

*
2
0
0

b
o
o
s
t
r
a
p
s

r
e
p
e
t
i
t
i
o
n
s
.

+
B
a
s
e
d

o
n

5
0
,
0
0
0

D
G
P
s
.

^
B
a
s
e
d

o
n

1
0
0

D
G
P
'
s

a
n
d

2
0
0

s
i
m
u
l
a
t
i
o
n
s

f
o
r

e
a
c
h

D
G
P
.
T
a
b
l
e

4
:

L
F
P

a
n
d

F
e
r
t
i
l
i
t
y

(
T

=

3
,

n

=

1
,
5
8
7
)
G
e
n
e
r
a
l

C
E
F

M
o
d
e
l
S
e
m
i
p
a
r
a
m
e
t
r
i
c

M
o
d
e
l
L
o
g
i
t
49
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1 0 0 5 0 0 5 0 1 0 0
p
X
B i a s ( % )
T

=

2
(
0
)
0
(
0
)
0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1 0 0 5 0 0 5 0 1 0 0
p
X
B i a s ( % )
T

=

4
(
0
)
0
(
0
)
0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1 0 0 5 0 0 5 0 1 0 0
p
X
B i a s ( % )
T

=

8
(
0
)
0
(
0
)
0
F
i
g
u
r
e
1
:
B
i
a
s
e
s
o
f
l
i
n
e
a
r
p
r
o
b
a
b
i
l
i
t
y
m
o
d
e
l
e
s
t
i
m
a
t
o
r
s
i
n
p
e
r
c
e
n
t
a
g
e
o
f
m
a
r
g
i
n
a
l
e
e
c
t
.
P
r
o
b
i
t
m
o
d
e
l
w
i
t
h
a
s
i
n
g
l
e
b
i
n
a
r
y
r
e
g
r
e
s
s
o
r
w
i
t
h
p
a
r
a
m
e
t
e
r
e
q
u
a
l
t
o
o
n
e
a
n
d
i
n
d
i
v
i
d
u
a
l
l
o
c
a
t
i
o
n
e
e
c
t
.
I
n
d
i
v
i
d
u
a
l
e
e
c
t
i
s
t
h
e
s
t
a
n
d
a
r
d
i
z
e
d
i
n
d
i
v
i
d
u
a
l
m
e
a
n
o
f
t
h
e
r
e
g
r
e
s
s
o
r
.
50
1
0
1
2
3
6 3 0 3 6
T

=

2
T
r
u
e
N
P
M
L
E
F
E
M
L
E
1
0
1
2
3
6 3 0 3 6
T

=

3
T
r
u
e
N
P
M
L
E
F
E
M
L
E
1
0
1
2
3
6 3 0 3 6
T

=

4
T
r
u
e
N
P
M
L
E
F
E
M
L
E
F
i
g
u
r
e
2
:
L
o
g
i
t
m
o
d
e
l
:
N
o
n
p
a
r
a
m
e
t
r
i
c
M
L
E
i
d
e
n
t
i
c
a
t
i
o
n
s
e
t
s
f
o
r
m
o
d
e
l
p
a
r
a
m
e
t
e
r
0
a
n
d
p
r
o
b
a
b
i
l
i
t
y
l
i
m
i
t
s
o
f
x
e
d
e
e
c
t
s
e
s
t
i
m
a
t
o
r
s
.
51
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

2
T
r
u
e
G
B
o
u
n
d
G
M
B
o
u
n
d
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

3
T
r
u
e
G
B
o
u
n
d
G
M
B
o
u
n
d
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

4
T
r
u
e
G
B
o
u
n
d
G
M
B
o
u
n
d
F
i
g
u
r
e
3
:
L
o
g
i
t
m
o
d
e
l
:
I
d
e
n
t
i
c
a
t
i
o
n
s
e
t
s
f
o
r
a
v
e
r
a
g
e
m
a
r
g
i
n
a
l
e
e
c
t
s
0
b
a
s
e
d
o
n
g
e
n
e
r
a
l
m
o
d
e
l
.
G
-
b
o
u
n
d
s
a
r
e
o
b
t
a
i
n
e
d
u
s
i
n
g
e
q
u
a
t
i
o
n
(
6
)
a
n
d
G
M
-
b
o
u
n
d
s
i
m
p
o
s
e
m
o
n
o
t
o
n
i
c
i
t
y
o
f
t
h
e
m
a
r
g
i
n
a
l
e
e
c
t
s
.
52
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,0)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,1)
True
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (1,1)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
0
Average
True
GMBound
NPMLE
FEMLE
Figure 4: Logit model (T = 2). Identication sets for marginal eects and probability limits of
xed eects estimators.
53
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,0,0)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,0,1)
True
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (1,1,1)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
0
Average
True
GMBound
NPMLE
FEMLE
54
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,0,0,0)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,0,0,1)
True
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (1,1,1,1)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
0
Average
True
GMBound
NPMLE
FEMLE
55
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

2
T
r
u
e
G
M
B
o
u
n
d
N
P
M
L
E
L
i
n
e
a
r
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

3
T
r
u
e
G
M
B
o
u
n
d
N
P
M
L
E
L
i
n
e
a
r
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

4
T
r
u
e
G
M
B
o
u
n
d
N
P
M
L
E
L
i
n
e
a
r
F
i
g
u
r
e
7
:
L
o
g
i
t
m
o
d
e
l
:
I
d
e
n
t
i
c
a
t
i
o
n
s
e
t
s
f
o
r
a
v
e
r
a
g
e
m
a
r
g
i
n
a
l
e
e
c
t
s
a
n
d
p
r
o
b
a
b
i
l
i
t
y
l
i
m
i
t
s
o
f
l
i
n
e
a
r
m
o
d
e
l
e
s
t
i
m
a
t
o
r
s
.
56
1
0
1
2
3
6 3 0 3 6
T

=

2
T
r
u
e
N
P
M
L
E
F
E
M
L
E
1
0
1
2
3
6 3 0 3 6
T

=

3
T
r
u
e
N
P
M
L
E
F
E
M
L
E
1
0
1
2
3
6 3 0 3 6
T

=

4
T
r
u
e
N
P
M
L
E
F
E
M
L
E
F
i
g
u
r
e
8
:
P
r
o
b
i
t
m
o
d
e
l
:
N
o
n
p
a
r
a
m
e
t
r
i
c
M
L
E
i
d
e
n
t
i
c
a
t
i
o
n
s
e
t
s
f
o
r
m
o
d
e
l
p
a
r
a
m
e
t
e
r
0
a
n
d
p
r
o
b
a
b
i
l
i
t
y
l
i
m
i
t
s
o
f
x
e
d
e
e
c
t
s
e
s
t
i
m
a
t
o
r
s
.
57
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

2
T
r
u
e
G
B
o
u
n
d
G
M
B
o
u
n
d
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

3
T
r
u
e
G
B
o
u
n
d
G
M
B
o
u
n
d
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

4
T
r
u
e
G
B
o
u
n
d
G
M
B
o
u
n
d
F
i
g
u
r
e
9
:
P
r
o
b
i
t
m
o
d
e
l
:
I
d
e
n
t
i
c
a
t
i
o
n
s
e
t
s
f
o
r
a
v
e
r
a
g
e
m
a
r
g
i
n
a
l
e
e
c
t
s
0
b
a
s
e
d
o
n
g
e
n
e
r
a
l
m
o
d
e
l
.
G
-
b
o
u
n
d
s
a
r
e
o
b
t
a
i
n
e
d
u
s
i
n
g
e
q
u
a
t
i
o
n
(
6
)
a
n
d
G
M
-
b
o
u
n
d
s
i
m
p
o
s
e
m
o
n
o
t
o
n
i
c
i
t
y
o
f
t
h
e
m
a
r
g
i
n
a
l
e
e
c
t
s
.
58
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,0)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,1)
True
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (1,1)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
0
Average
True
GMBound
NPMLE
FEMLE
Figure 10: Probit model (T = 2). Identication sets for marginal eects and probability limits
of xed eects estimators.
59
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,0,0)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,0,1)
True
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (1,1,1)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
0
Average
True
GMBound
NPMLE
FEMLE
60
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,0,0,0)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (0,0,0,1)
True
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
k
X
k
= (1,1,1,1)
True
GMBound
NPMLE
FEMLE
3 2 1 0 1 2 3
0
.
5
0
.
0
0
.
5
*
0
Average
True
GMBound
NPMLE
FEMLE
61
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

2
T
r
u
e
G
M
B
o
u
n
d
N
P
M
L
E
L
i
n
e
a
r
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

3
T
r
u
e
G
M
B
o
u
n
d
N
P
M
L
E
L
i
n
e
a
r
1
0
1
2
3
0 . 5 0 . 0 0 . 5
*
0
T

=

4
T
r
u
e
G
M
B
o
u
n
d
N
P
M
L
E
L
i
n
e
a
r
F
i
g
u
r
e
1
3
:
P
r
o
b
i
t
m
o
d
e
l
:
I
d
e
n
t
i
c
a
t
i
o
n
s
e
t
s
f
o
r
a
v
e
r
a
g
e
m
a
r
g
i
n
a
l
e
e
c
t
s
a
n
d
p
r
o
b
a
b
i
l
i
t
y
l
i
m
i
t
s
o
f
l
i
n
e
a
r
m
o
d
e
l
e
s
t
i
m
a
t
o
r
s
.
62

2008 14 Chernozhukov

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

2008 14 Chernozhukov

Hochgeladen von

Copyright:

Verfügbare Formate

Identication and Estimation of Marginal Eects in Nonlinear

denotes the marginal distribution

(where we implicitly assume that these are

(d)/D for two dierent

and some known function L. Let Q

and a set of distributions for Q

+ )] does not change

is point identied. In other cases one would need to

. To keep the notation simple we focus on the binary X case,

(x) > 0, F(x) = 1 F(x), and

(x) = F(x)(1 F(x)) 1/4, we have

is set identied by the set B.

in the logit model. The gures

for some constant c breaks down.

= 0, marginal eects for individuals without

, and the nonparametric empirical

. As it is common in regression analysis, we condition on the observed

corresponds to the solution of the minimum distance problem:

(T) arg min

W(, T), W(, T) =

is unique. Of course, when T ,

= T and the assumption holds trivially.

B for some compact set B; (iii)

is regular at T in the sense that, for any sequence

) for all k. More-

) for all k, and so on. In order for a point

) for some k. Thus, if

in the interior of B. Then, for any sequence

. Now take the point

in the interior such |

| /2 and such that there is a

() by solving the minimum

collecting all values

. Under correct specication, B

. As before, we condition on the observed

() is the corresponding projection for the identied set of the parameter

can be an upper (or lower) bound on a scalar functional c

(T) of the parameter, we use the

(), when the

. On this set it will be the case that r

, then we can decompose h

using standard approxi-

(z, T), (37)

(z, T) is the T-degree best polynomial approximation to h

(z, T) is the re-

= 1. The exact rank of A depends on the sequence

, we have uniformly in and for all n,

W(, P), W(, P) =

by assumption. Therefore, by the consistency theorem for approximate argmin

) is continuous, so that the minimum

Das könnte Ihnen auch gefallen