Beruflich Dokumente
Kultur Dokumente
193
SCHOLZ
194
Vol. 8, No. 2
The search for a new defintion was originally motivated by the desire that the
MEXLshould also cover nonparametric situations. The difficulty here is, that typically
we will not have a common a-finite dominating measure and hence cannot use
densities as likelihoods. In spite of this, many nonparametric problems have been
solved by the MML with acceptable solutions, however, it is not clear which extension
of the defhtion has been employed. We are aware of only two attemptsto widen the
definition of the MML in order to cover the undominated case as well. The one by
Kiefer and Wolfowitz (1956) essentially suggests a pairwise comparison of possible
distributions P and Q for the data point x, by comparing (dP/&)(x) and (dQ/&)(x),
where p P + Q. However the version of the Radon-Nikodym derivatives is not
specified. Selection of various versions can lead to many different solutions for the
MML. In this context consider the following example: Let #(x) be the standard normal
density and defrne #*(x) = @(x)for x f 1 and @*(l)= 10. Let (PdB E R ) be the
family of probability distributions generated by the densitiesfe(x) = @*(x- d),
B E 1p. Now Pe 4 8 , l ) and the m.1.e. of 8 (or Pe) upon observing x ouj$t to be
x (or P,). However if one takes the above versionsfi as densities, i.e., as likelihoods,
the m.1.e. will be x 1 for all x.
It appears then, that not only is the Kiefer-Wolfowitzdefinition defective in this
respect but so is the classical defmition as well, unless one considers the density
versions as given entities in any statistical model for data. However the wron8
versions could have been specified and we will not escape the consequences of the
above example. Further it is not clear how to select unproblematic density versions
in the Kiefer-Wolfowitzapproach should one select to represent the statistical model
in terms of given densities.
The only other approach we are aware of for extending the MML towards nonparametric problems is outlined in the new book by Kalbfleisch and Prentice (1980).
Motivating the likelihood in the derivation of the Kaplan-Meier estimator, they
suggest a d i s c r e t i o n of the data, via the concept of rounding mors (cf. also
Kempthorne and Folks (197I) and Kalbfleisch (197l)), and then apply the m a to
the discrete (multinomial)problem and let the rounding error tend to zero,hoping
that the dismte m.1.e. will converge to a limit. Kalbfleisch and Prentice suggest
without fwther support that this will provide an extension of the usual m.We did
consider providing rigorous support for this last claim. However the questions of
existence of the limiting m.1.e. (as the grouping gets finer) and whether this limit
would depend on the grouping employed, presented major obstacles and it remained
unclear Whether an elegant or even satisfactory approach could be found this way.
In the next section we will present a new definition of the MML which is based 011.8
pairwise comparison of probability measures in the neighbourhood of the observed
data point x. Thus, in a sense,the new d e f ~ t i o nis a marriage of the two extension
proposals discwsed above.
2. DEFINITIONS
Let AY be a metric space with metric d and let B be a family of probability
measures on the Bore1 sets of 9C For any (data) point x E $let M,denote the family
of ail measurable sets N, which contain x as an interior point. By D(N,) denote the
diameter of the set N,.
195
1980
DEFINITION
1. For P, Q E B write P 2 Q if
lim inf{P(Nx)/Q(Nx):Nx
E Nxwith D(Nx)IE }
E+O
1,
DEFINITION
2. For P, Q E B write P = Q when P 2 Q and Q L P. Then P and Q are
called equivalent at x.
Comments.
(i) P f Q if and only i f l i m ~ ( ~ ,P(Nx)/Q(Nx)
)+~
exists and equals I.
X
(ii) If P 2 Q and Q 2 R then P L R (transitivity). [We could not establish the same
transitivity relationship for the Kiefer-Wolfowitz definition, the many density
versions representing the basic difficulty.]
X
2) is a
(iv) If {P}.: = ( Q E S? Q f P } and if Bx:= {{P}.: P E B} then {PX,
X
DEFINITION
3. The statistic PO E B is a maximum likelihood estimator (m.1.e.) with
x
respect to x E 9and Biffor every Q E Bsuch that Q> POitfollows that Q- PO.That
X
is, POis an m.1.e. fi and onb if there does not exist a Q E 9 - PO}^ such that Q 2 PO
or equivalent&
lim Q(NX)/Po(Nx)
< 1for all Q E 8- {PO},.
The existence of an m.1.e. according to this definition cannot be taken for granted
and depends on the underlying problem. Further m.l.e.s may not be unique since we
may either have a whole equivalence class of them or the situation may arise where
several equivalence classes of rn.l.e.s exist which are not comparable with respect to
x
Qr
P; then by our
lim P(Nx)/p(Nx)exists = p ( x )
D(N*kO
and
SCHOLZ
196
Vol. 8.No. 2
+ q ( x ) > 0, then
P 2 Q c + p ( x ) -> q(x).
The case p ( x ) + q ( x ) = 0 has to be considered on an individual basis in each given
problem.
Criteria (A) and (B) show the agreement between the new and the classical
definition. This means that the classical methodology for finding m.l.e.s is simply
subsumed in the methodology for the new definition. Further, the new definition
points to those density versions (continuous at x , if they exist) that should be used in
the classical definition in order for agreement to occur between the defmitions. If
there exists a continuous density version it seems only natural to insist on its use in
the classical definition, since densities are supposed to represent the localized
properties of probability distributions. Unresolved are those cases where density
versions, which are continuous at x do not exist. In such cases no natural density
version would suggest itself for the classical approach whereas the new definition
shows how to deal with this question in a satisfactory way, as will be seen from the
following examples.
3.EXAMPLES
In the following examples P ( Q , . . .) may denote either the distribution of one
random variable or the distribution of a random sample of such random variables,
i.e., P serves as a parameter as well. From the context it should be clear what is
meant so that no confusion arises while at the same time we avoid notational
complexities.
Example 3.1. Let XI,. . . , Xn be independently and identically distributed as
9, the class of all probability distributions on the real line. Let P,,be the
empirical distribution corresponding to the observed data vector x = ( x l , . . . , xn),
i.e.,
P E
Pn(A) = C I A ( ~ J / ~
1-1
where
1 ifzA
IA(z) = 0 otherwise.
1980
197
where yl < .. < yh represent the distinct values among xl, . . . , Xn appearing with
respective multiplicities ml, . . . , mh. Furthermore note that by simple application of
Jensen's inequality one has
h
1-1
X
Pn
Q({x}) ? P,({x})
P, = Q,
It seems fitting to give this example first, since in some respect it was the first one
"solved" through the MML, namely by Lambert (1760), cf. Edwards (1974).
Example 3.2. Let Xl, . . . , X n be independently and identically distributed
%(O, 8), 8 > 0 (uniform on (0,8)); then PO, = %(O, OO), with 80 = max(xl, . . . , x,),
is the unique m.1.e. based on the observed data vector x = (XI, . . . , X n ) , since
(a) if 0 c 00 then Pe(Nx)= 0 for D(N,) small, and Pen(Nx)> 0 for all N , E ,A,
so
lim PANx)/Peo(Nx)= 0 < 1.
(b) if 8 > e0,let
{@I,
lim Pe(Nx)IPeo(N~)
= @ (80/B)n(X(N,)/h(N,*))= (do/@" < 1
since the ratio h(N,)/X(N:) ( 21) can approximate 1 by appropriate choice of N,.
Thus Penis an m.l.e., and the uniqueness follows easily. Here the classical approach
can either produce the same solution or no solution or any other solution depending
on the choice of density. Note that in this example there is no density for Pe,,which
is continuous at x .
Example 3.3. Let XI, . . . , X n be independently and identically distributed
% ( 8 - 4, 8 + 3), 8 E I?'(uniform on (0 - 3, 8 + 4)); then any Pe,, = %(do - 3, 80+
+), with O0 E I = (X(n)- 4, x ( ~+) +),is an m.1.e. based on the observed data vector x
= (XI, . . . , xn); here x(11= min(x1, . . . , Xn) and X(n) = max(xl, . . . , xn).
X
Proof. First note that Pen= Pel for 80, 81 E Z,since PO, and PO, admit densities which
are continuous at x with the same value 1 at x. Now let 80 E Z and dl = x ( ~-) 4 <
80and let
N: = N,
n { y E R,:
5 o1
+ 3);
then
lim Pe,(NX)/Pe,,(Nx)
lim Pe,(N~)/Peo(Nx)
= 0 < 1,
hence establishing the claim.
Vol. 8 , No. 2
SCHOLZ
198
(3.1)
The conventional MML fails to justify (3.1) as a starting point; the new definition,
however, easily leads us to (3.1). This follows from our discussion in (B) above in
Section 2 and the fact that one can find (Fo,GO)E 9 X @such that PF,,G,(X= x ,
6 = d ) > 0.
Example 3.5. The Multivariate Normal Distribution: Let X I , . . . , Xn be independently and identically distributed Np(m,B) with m E Rp and B a p X p covariance
matrix. Although m and B,the conventional m.l.e.3 of m,B,are always well defined
as algebraic expressions they no longer could be called m.l.e.3 in the classical sense
when n 5 p or rank B < p . However under the new defrnition 4 ( m , 8 ) emerges
again as the unique m.1.e. provided we let
B = {&(m, B): m E Ipp, B covariance matrix of rank 5 p } .
--
Here XI <
< X n denote the ordered observed values of the sample. Since fo
exhibits discontinuities at the observed data values, it is not evident how the likelihood
ought to be defined in the classical sense. Should one take the left or right limits, or
some value in between to account for the fact that the density at n ought to reflect
the probability in a small neighbourhood around x? It turns out that the new
definition will produce the POgiven above as the unique m.1.e. for this problem. The
derivation, which is somewhat involved is given in the Appendix, Section 5.
DEFINITION
1980
OF
MAXIMUM LIKELIHOOD
199
4. CONCLUDING REMARKS
After making a case for the necessity of a more careful and broader definition for
the MML a new definition has been put forth. In the few examples presented above it
has performed well. Many more examples, dominated or not, could and should be
treated with more or less ease, and we hope that the versatility of the definition, as
we have experienced it so far, will stand up under future tests.
This new defmition will not remove inconsistent m.l.e.'s and replace them by
consistent ones as the following example demonstrates, cf. Barlow (1972). Let XI,
. . . , X, be independently and identically distributed as F E 9,
the family of all
starshaped distribution functions on [0, 11. Then F is starshaped on [0, 11 if F ( x ) / x
is nondecreasing on [0, 11. One easily shows, using the new definition and following
Barlow (1972), that the m.1.e. of F is
n
F(x)= x
Z[X,~~I/~,
i-l
--
N:=N,ni; { y ~ ~ n : y , < x , )
1-1
n-1
Vol. 8 , No. 2
SCHOLZ
200
where
Aij(Nx) = P i ( N i ) / h ( N i ) ;
rj(Nx) = h ( N i ) / A ( N x ) ,
i = 1, 2, j = 1,
. . . , 2".
exists, e.g.,
n
fi2+)
fl f i ( x k + )
for i = 1,2.
k-1
We remark here that thef;,(x) are independent of the density versionf, employed.
Since in the following we will always use onlyf(xk-) andf(xk+), k = 1, . . . , n, there
should be no ambiguity if we identlfL P and some corresponding density versionfof
P.
We now state two lemmas whose proofs are straightforward and omitted:
LEMMA
5.1. Let A,,
0;j = 1,
Then
inf C? rjAij
r ~ ~ s p C ;rjAzj
k
LEMMA
5.2. With rj(Nx) defined as in'(5.1)andYas in Lemma 5.1, it follows that with
k = 2",
{ ( r l ( N x ) ,. . . , rk(Nx)):N , E N ~D(N,)
,
sE}
=Y
(5.2)
lim P1 (Nx)/PZ(Nx )
2
r-0
N,.k>
D(N,kP
1980
201
We will now show that PO(with densityfo as defined in Example 5.6) is an m.1.e.
Note that fo(x,+) = 0 and f o ( X n - ) > 0, which in conjunction with (5.2) and (5.3)
implies that
such thatfoi(x) > 0)
for any P E 9andf a density version of P. We will show in the following steps that
L(P, PO)c 1 for all P E 9such that P # PO,hence establishing that POis an m.1.e.
(a) Let 81C 9 be the following subfamily of distributions: P E P1if and only if
P admits a densityf E 9satisfying the following two conditions:
(i) f = 0 on (x,, 03)
(ii) f is a step function on (0, x,) with at most one step in each interval (xz,x l + , ) ; i =
0, . . . , n - I (x = 0), and in case of a step in ( x r , x , + 1 ) f is continuous at x L .
The following considerations show that our problem can be reduced to showing
L ( P , PO)c 1 for all P E PI with P # Po.
Note that neither step I nor step 11, if carried out, will lead to P1= POor PZ= Po.
(111) I f f 9undf(x,+) > 0 letfi(x) = k f(x) Z(O.~.,(X)
with k > 1 so thatfi E 9.
Then L ( f , f o )c L(fi,fo).
Hence it remains to show L(P, P o ) < 1 for all P E
with P # Po.
f<Xt->
fo(xr-)
fo(xr-)
fo(Xl+)
f(xr->
---
_fCXl+>
fo(xr-)
fo(xr+)
for x, 5 x c x,
= a f(x,+) for x,
+ e, e > 0
= f ( x ) for x
E z.
Here e and u should be chosen so thatfl E gl, in fact u can be chosen arbitrarily
close to 1 by taking e > 0 sufficiently small, so that L ( f ,fo) c L ( f i ,fo) and fi is
Vol. 8, No. 2
SCHOLZ
202
L<f,fo>=
n f<Xl-)/fo(x~-)
1-1
U,
subject to
U I2
an L 0 and
u,(x, - ~
1.
~ - =
1 )
r-1
1-1
The solution is
a, = min max
kSc-1
Izr
I-k
.
n(xr - Xk) '
i = 1, ..., n.
This proves that POis in fact an m.1.e. according to the new definition. It remains to
show that no other P E B can be an m.1.e.
First we claim that any m.1.e. P with density f must by necessity satisfy the
following conditions:
f(xn+) = 0
(5.4)
> 0.
To prove (5.4) SUppOSef(Xn+) > 0, and let P* have density
f(Xn-)
(5.5)
y ) +f(xn+)kXn.xn+du),
where a > 1 and d > 0 are chosen such that P* E 9.Then (5.2) and (5.3) imply
lim P*(N,)/P(N,)
= min{fi'(x)/fi(x):j
= 1, . . . ,2"} = an-l 2 1
and
Modifyfo as follows: at each jump point z offo extendfo continuously to the right a
small amount beyond z, at the same time lowering the plateau value off0 just prior
to z by an increment e > 0 so that the resulting densityf* is in 8;then
1980
For e > 0 sufficiently small the last expression is greater than 1 if P # PO.This
concludes the proof of the uniqueness of Po as an m.1.e.
ACKNOWLEDGEMENT
I would like to thank Professor Ron Pyke for his stimulating interest in this problem. Our
many discussions on this subject were very essential in formulating the final form of the
definition presented here. I would also like to thank Professor R. Berk for pointing out his
review (1967) of Zehna (1966) to me.
RESUME
On presente une definition unitiee de la methode destimation du maximum de vraisemblance. Elle est basee sur une comparaison de deux mesures de probabilite dans un voisinage
de la donnee observee. Cette definition na pas les insuffisances des definitions anterieures,
i.e., elle ne depend pas du choix de la version de la densite dans le cas domine. La definition
sapplique igalement au cas non domine, i.e., elle procure une approche coherente a des
problemes non parametriques destimation du maximum de vraisemblance qui, jusqua present,
ont ete resolu a laide de methodes ad hoe. On montre que la nouvelle definition du maximum
de vraisemblance constitue une extension de lapproche classique telle quutilisee dans le cas
domine. Des exemples parametriques et non parametriques illustrent la nouvelle methodologie.
REFERENCES
Barlow, R.E.; Bartholomew, D.J.; Bremner, J.M., and Brunk, H.D. (1972). Statistical Inference under
Order Restrictions. Wiley, New York.
Berk, R.H. (1967). Review of Zehna (1966). Math. Rev., 33, no. 1922.
Bernoulli, Daniel (1777). The most probable choice between several discrepant observations and the
formation therefrom of the most likely induction. (In Latin.) Acta Acad. Petrop., 3-33. [English
translation: Biometrika, 48 (196 I), 3- 13.1
Bickel, J.P., and Doksum, K.A. (1977). Mathematical Statistics: Basic Ideas and Selected Topics. Holden
Day, San Francisco.
Edwards, A.W.F. (1974). The history of likelihood. Internat. Statist. Rev., 42, 9-15.
Fisher, R.A. (1912). On an absolute criterion for fitting frequency curves. Messenger Math., 41, 155-160.
Fisher, R.A. (1922). On the mathematical foundation of theoretical statistics. Philos. Trans. Roy. Soc.
London Ser. A , 222, 309-368.
Grenander, U. (1956). On the theory of mortality measurements. Skand. Aktuarietidskr., 39, 125-153.
Kalbfleisch, J.D., and Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New
York.
Kalbfleisch, J.G. ( 1980). Probability and Statistical Inference. Volume 11. Springer-Verlag, New York.
Kaplan, E.L., and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer.
Statist. Assoc., 53, 457-481.
Kempthorne, 0..
and Folks, L. (1971). Probability, Statistics, and Data Analysis. The Iowa State University
Press, Ames.
Kiefer, J., and Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of
infinitely many incidental parameters. Ann. Math. Statist., 27, 887-906.
Lambert, J.H. (1760). Photometria. Augustae Vindelicorum.
Rohatgi, V.K. (1976). An Introduction to Probability Theory and Mathematical Statistics. Wiley, New York.
Zehna, P.W. (1966). Invariance of maximum likelihood estimation. Ann. Math. Statist., 37, 744.