Jun Li, Regina Y. Liu (2008) Multivariate Spacing Based On Data Depth

Multivariate Spacings Based on Data Depth: I. Construction of Nonparametric Multivariate Tolerance Regions Author(s): Jun Li and Regina Y.
Liu Source: The Annals of Statistics, Vol. 36, No. 3 (Jun., 2008), pp. 1299-1323 Published by: Institute of Mathematical Statistics Stable URL: http://www.jstor.org/stable/25464668 . Accessed: 20/11/2013 14:22
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The Annals of Statistics.
http://www.jstor.org
This content downloaded from 148.210.210.190 on Wed, 20 Nov 2013 14:22:59 PM All use subject to JSTOR Terms and Conditions
TheAnnalsofStatistics 2008,Vol. 36,No. 3, 1299-1323 DOI: 10.1214/07-AOS505 2008 C) Institute of Mathematical Statistics,
MULTIVARIATE CONSTRUCTION
SPACINGS BASED ON DATA DEPTH: I. OF NONPARAMETRIC MULTIVARIATE TOLERANCE REGIONS1

BY JUN LI AND REGINA Y. Liu
University ofCalifornia at Riverside and RutgersUniversity

This paper introduces and studies multivariatespacings.The spacingsare data depth.Specifically, the derived from developed using theorder statistics which bridges spacingbetween twoconsecutiveorder statisticsis theregion the two order statistics,in the sense thatthe region contains all thepoints whose depthvalues fallbetween thedepthvalues of thetwoconsecutiveorder These multivariatespacings can be viewed as a data-drivenrealiza statistics. tionof theso-called "statistically equivalentblocks." These spacings assume a form of center-outward layersof "shells" ("rings" in the two-dimensional case), where theshapes of theshells followclosely theunderlying probabilis ticgeometry. The propertiesand applications of these spacings are studied. In particular,the spacings are used to constructtoleranceregions.The con of toleranceregions is nonparametric and completelydata driven, struction of theunderlying and theresulting toleranceregionreflects the true geometry This is different from most existing approacheswhich require distribution. thattheshape of the toleranceregionbe specified in advance. The proposed toleranceregions are shown tomeet theprescribed specifications,in terms of /-contentand /3-expectation. minimal un They are also asymptotically der ellipticaldistributions. Finally, a simulationand comparison studyon the proposed toleranceregions is presented.
1. Introduction. The term"spacings" in statistics generally refersto either the intervals (or gaps) between two consecutiveorder statistics or the lengths of these intervals. Spacings have been used extensivelyinprobability and statistics, especially in theareas of distributional characterization, extreme value theory and nonparametric There is a rich literature on the theory inference. and applications of spacings.The excellent treatise by Pyke in [22] aswell as thereferences therein (e.g., [2,4, 9, 12, 28]) all attestto theimportance (e.g., [5] and [27]) and thereafter of spacings. In his paper [22] Pykewrote,
of this paper has been our concem with one Perhaps themost significantrestrictions dimensional spacings. There are many applications inwhich samples are drawn fromtwo-or Received January May 2007. 2007; revised National Science Foundation, the National Security Agency, Supported inpartby grantsfromthe and theFederal AviationAdministration. AMS 2000 subject classifications.Primary62G15, 62G30; secondary62G20, 62H05. Key words and phrases. Data depth, depth order statistics, multivariate spacings, statistically equivalentblocks, toleranceregion.
1299
1300
J.LI AND R. Y. LIU
to study thespacings of theobserva even three-dimensional space and for which it is important tions.
multivariatespacings Although researchon spacings has continued,his call for ingeneralizingtheunivari has remainedlargely unanswered. The main difficulty ate spacings tomultivariatesettingsis the lack of suitableorderingschemes for we introduce multivari This paper has twogoals. First, observations. multivariate derivedfromthenotionof data depth. ate spacingsusing the ordering multivariate we apply the proposed spacings toconstruct multivariate Second, as an application, toleranceregions. nonparametric of The paper is organized as follows.Section 2 is devoted to thedevelopment We beginwith a briefreviewof theunivariatespacingsand multivariatespacings. of thesubjectof data depth aswell as a brief description of someof their properties, data.Note thatthedepthor ofmultivariate and thecorresponding depthordering ratherthantheusual univariatelinearordering dering is fromthecenter-outward we For any twoconsecutivedepthorderstatistics, fromthesmallest to thelargest. define the spacingbetween themas the region thatcontains all thepoints in the sample space whose depthvalues fallbetween thedepthvalues of the twoorder by The multivariatespacings are thecollectionof theseregionsformed statistics. These regionsgenerallyappear as center all pairs of consecutiveorder statistics. outward layersof "shells" ("rings" inT2), and the shapes of the shells follow In Section 3 we geometry of theunderlying distribution. closely theprobabilistic forunivariate data as well as theexist first provide a reviewof toleranceintervals We then describe the multivariatetoleranceregions. ingapproaches forobtaining the multivariate of toleranceregionsusing proposed construction nonparametric of theproposed toleranceregions. theproperties Specifi spacings,and investigate we show thatthese toleranceregions: (1) meet theprescribedspecifications cally, minimal un and (2) are asymptotically in terms of /3-content and /3-expectation, The forma which includestheelliptical family. der a certainclass of distributions of reflecting accurately the resultingtoleranceregionhas thedesirable property In other geometry. words, theshape of our proposed tol underlying probabilistic of the Most existingapproaches require be specifiedinadvance. pre-specification or subjective.It is also worthnoting that which can be consideredarbitrary shape, which ismore suitable in applications our toleranceregion is always connected, of severaldesirable features with otherexistingtolerance isons regions.Itconfirms Most technical our approach.Section 5 containssome concludingremarks. proofs are collected in the Appendix.
3.1 that in using our multivariate spacings to con struct tolerance regions, we have in effect argued that our multivariate spacings are an ideal realization of the so-called "statistically equivalent blocks." This is be cause the realization of our multivariate spacings and their shapes are entirely data We also observe in Section such as quality control. Section 4 contains a simulation study and some compar erance regions is automatically determined by the given data, and does not need to tion of our tolerance region is completely data driven and nonparametric, and the
MULTIVARIATE SPACINGS AND TOLERANCE REGION
1301
driven.Statisticallyequivalentblocks had been consideredby Tukey in [24] and several follow-up papers (see, e.g., [10]) as possible buildingblocks forthecon for toleranceregionsor tools forcharacterizing struction distributions. However, thesepapers again all need to pre-specifythe shapes (e.g. rectanglesor circles in R2) of theblocks. 2. Multivariate spacings derived fromdata depth. We beginwith a brief multivariateordering. reviewof thenotion of data depth and its corresponding Thismultivariate ordering naturallyleads toourmultivariatespacings. 2.1. Data depth and center-outward orderingofmultivariatedata. A data
depth is a measure of "depth" of a given point with respect to a multivariate data
cloud or itsunderlying and itgives rise to a naturalcenter-outward distribution, orderingof thepoints in a multivariatesample.Although theactual depthvalue has been used widely todevelop robust multivariateinference, thedepth-ordering is less understoodand stillunderutilized. Existing notionsof data depth include: Mahalanobis depth([20]), half-spacedepth ([14, 25]), simplicialdepth ([16]), pro jection depth ([7, 8, 23, 30]), etc.More discussion on different notions of data
depth can be found in [ 17, 31].
To help facilitatethecoming expositionofmultivariatespacings,we use the thegeneralconceptof data depthand itscorrespond simplicialdepth to illustrate ingcenter-outward Let {XI, .. ., X, I be a randomsample fromthedis ordering.
tribution F(.) is l(A) =
denote thetriangle with verticesa, b and c. Let I (.) be theindicator function, that
1 (or 0) if A occurs (or not). For the given sample {X1, ..., Xn }, the
E TP9 p >
2. Consider
the bivariate
setting, p =
2. Let A/(a, b, c)
sample simplicialdepthof x isdefinedas (2.1) DF,(x) = ) E I(x e A (Xi1, Xi2,Xi3)),
which is thefraction of the triangles generatedfromthe sample thatcontain the

point x. Here (*) runs over all possible triplets of {X,
x falls in more triangles generatedfromthesample,and thus DFn (x) indicatesthat liesdeeperwithin thedata cloud. The above can be generalized todimensionp by countingsimplicesrather than thatis triangles, (2.2) DFn(x)=( ) I(x E s[Xi .. i +, ), Xnl of size (p + 1). Here
. . .,Xip+l }.
.. ., Xn
.A
larger value of
where (*) runsover all possible subsetsof {X1,

s[Xi, l ... l Xip+ ] is the closed simplex whose
vertices are {Xi,
1302 If F DF
J.LIANDR.
Y. LIU is defined as DF (X) = PF {X E
is given, the simplicial depth of x w.r.t. to F
F. observationsfrom s[Xl, ..., Xp+i]1, whereX1, ..., Xp+1 are (p + 1) random

(X) measures how "deep" x is w.r.t. F, and DFn (x) in (2.2) is a sample estimate
of data depthcan with thekey properties motivation together of DF(X). A fuller and that it is shown that DF(.) is affineinvariant, be found in [16]. In particular, ensures The to affine invariance and strongly DF(.). DF,1(.) convergesuniformly methods are coordinate free,and the thatour proposed spacings and inference convergenceof DF, toDF allows us to approximate DFQ() by DFn (.) ifF is unknown. depthvaluesDFn (Xi)'s For thegivensample {X1,X2, . .., X,j, we calculate the descendingdepthvalue. Denoting by and thenorder theXi's according to their obtain depthvalue,we then X[j] thesamplepoint associatedwith thejth largest ofXi's, with thesequence {X[I], X[2], ..., X[,} which is thedepthorderstatistics Here, a larger order is as X[1] being thedeepest point,and X[n]most outlying. Note that distribution. positionw.r.t. theunderlying sociatedwith amore outlying fromtheusual order statistics depthare different derived from theorderstatistics are orderedfromthesmallestsamplepoint to in theunivariate case, since thelatter is fromthe middle samplepoint andmoves outward the largest, while theformer The thisfeature of thedepthordering. Figure 1helps demonstrate inall directions. fromthe twoplots show tworandomsamples,each of size 500, drawnrespectively For each plot, the distributions. normaland bivariate exponential standard bivariate hull thedeepest inner convex encloses the most "+" marks thedeepest point,and 20% of the sample points.The convex hull expands outward to enclose thenext by thedecreas deepest 20% by each expansion.Those convex hulls determined is fromthe the depth ordering a indicating that are feature value ingdepth nested,
(a) (b)
CO
-4
-2
0 xi
2 xi
FIG. 1. Depth contoursfor: (a) bivariatenormal sample; (b) bivariate exponentialsample.
1303
Note thattheshapeof thedepthcontoursin thoseplots clearlyre center outward. flects the underlying probabilistic geometry, relatively sphericalin thenormalcase case. The nested shells-like and fanning in theexponential upper-right triangularly depthcontoursinFigure 1also help illustrate thefeatures of the multivariatespac ings inSection 2.3. We give thedefinition ofMahalanobis depthhere, since it is also used in the simulation studylaterinSection 4. DEFINITION 2. 1. The Mahalanobis depth ([20]) at x with respect to F is definedas
mDF(X) = [1+ (X jF) EF (X F ]
mean vectorand dispersion matrix of F, respectively. where RF and EF are the Mahalanobis depth isobtainedby replacingILF and EF The sampleversionof the with their sample estimates. Different notionsof depthare capable of capturing different aspectsof the prob abilisticgeometry, andmay lead to different orderingschemes.However, all the are essentiallyfromthecenter outward. We note that all thedepths depthorderings are affineinvariant, aforementioned and so are their The affine resulting orderings. invarianceis a desirable feature fortheconstruction ofmultivariatespacings later inSection 2.3. Note that geometric depths such as thehalf-spaceand thesimplicialdepthsare andmoment-free, and theycapturewell theunderly completelynonparametric of thedata.Although the ingprobabilistic geometry Mahalanobis depthcaptures lesswell theunderlying unless thegeometry geometry happens tobe elliptical, it is computationally more feasible thangeometricdepths. Under ellipticaldistrib utions, the twogeometricdepthscapture fairly well theelliptical structure in the to the largesamplecase and are close competitors Mahalanobis depth. Between the twogeometric depths,thesimplicialdepthprovidesa finer orderingand produces less ties thanthehalf-spacedepth.This pointhas been observed in [18]. For convenience, we will use thenotation D(.) to express any valid notionof notion is tobe emphasized. depth,unless a particular Before we use depthorder statisticsto formulate multivariatespacings, we re view theunivariatespacingsand some of their properties. 2.2. Univariate spacings. Let X1, X2, ..., Xn be a random sample froma univariatecontinuousdistribution F which has the support (a, b). Denote by ofXi's, namelyX[i] < X[2] < ... < X[n]. {X[I], X[21, . . ., X[n]} theorderstatistics Note that we will avoid introducing additional messy notationfordifferentiating theunivariatesettingfromthe multivariateone by using the same notation X[j] thepaper to indicatethejth order statistic throughout of thesampleXi 's. Itwill
1304
J.LI AND R. Y. LIU
whether thenotation is intendedfor theuni generallybe clear fromthecontext ordering" variateor for the multivariatesetting.If needed, thephrase "univariate univariate or the multivariate be used to emphasize the or "depthordering" will ordering. < X[,], the spacingsof the univariate Given the orderstatistics X[11< X[2<
sample
we pro and X[n?l] = b, or theirlengths Di = X[i] - X[i1I]. For convenience, distribution F follows theuniform ceed todiscuss thespacingsby assuming that transformation on (0, 1) [denotedby F - U(O, 1)], since theprobabilityintegral
F(X) transforms the given sample into a sample from U(0, 1). If F - U(0, 1),
refer to the intervals Li =
(X[i-I],
X[i]],
i=
1, .. ., n +
1, with X[O] =
then:
(i) D1 + D2 +
of (DI, D2, ..., Dn+0) is (ii) thedensityfunction

fl n!, O, if di >0and otherwise. d? +d2+ +dn+l = 1,
+ Dn+I
1, and
in itsarguments. Thus thedensityfunction f is completelysymmetrical [21] and [22] have observed thattheuniformspacings (DI, D2, ..., Dn+I) can to their sum.Specifically, be viewed as exponentialrandom variablesproportional
assume
withmean 1 [denotedas Exp(1)], and let ution

S=U1+U2+..+?Un+l and Wi=Ui/S, i= 1,., n+ 1.
that {Ul, U2,
.. ., Un +I } is a random sample
from the exponential
distrib
distributed. Then, (WI,W2,..., Wn+I) and (DI, D2, ..., Dn+I) are identically our of multivariate formulation will appear during A similarproperty spacings later. in extending theunivariate 2.3. Multivariate spacings. The main difficulty multi spacings to higherdimensions lies in the lack of proper orderingof the
variate data. Applying multivariate the center-outward ordering induced from data depth, the distribution F in 9P, and obtain the depth spacings can be defined as follows.
Let X1, ..., Xn be a random sample from a continuous p > 2. For a given data depth D( ),we calculate all DF(Xi)'s,
order statistics X[1], ..., X[n] in descending depth values. Let Zi = DF (Xi), and for i= 1, .-. ,n. Note that Z[1] > ... > Z[n], which are reverse z[i] = DF(X[i]), univariate order statistics of Zi's. The matching indices in Zl]' > ... > Z[n] and ... ., X[n] are useful for tracking depth order statistics with their depth values Xii, in defining multivariate spacings and tolerance regions later.We now define the
multivariatespacingsas follows,
(2.3) MSi = {X: Z'-1] > DF(X) > Z[i]}, i= 1, .. n+ 1,
MULTIVARIATE SPACINGS AND TOLERANCE REGION with z1[0l= sup D DF I (X) } and Z[ 11 0. The corresponding
1305
sample multivariate
spacingsare (2.4) where z[

of Zi =
MS
MS,+1I
={X:Z
= IX: DF,
> DF, (X)>Z

(X) < Z[l]}.
i=l,...,n,
and
- sUpz{DF,(x)}, and Z11]>

i=1. .
>
are thereverse order statistics
Note thatthe multivariatespacingsheredefinethe"gap" between twoconsecu tive as theshell-shaperegion depthorderstatistics bridgingthetwoorderstatistics, generalizingthe interval linkingthetwoconsecutiveorderstatisticsin theunivari ate spacings.Consequently, the multivariatespacings derived fromdepth order statistics are center-outward layersof "shells." Figure 2 illustrates an example of multivariatespacings determinedby a random sample of size fivedrawn from thebivariatenormal distribution with mean (0, 0) and covariancematrix (3 1). The fivedata points are denotedby circles in theplot.The Mahalanobis depth is used tocalculate depthvalues. The multivariatespacings includesix regions,five center-outward layeredshells and theoutmostregion. Note thattheshells clearly reflect theelliptical shape of theunderlying distribution. Plots of the multivariate spacings forthestandard bivariatenormaland exponentialsamplesusing thesim plicial depth show layeredshellswith shapes similarto thoseof Figure 1.Again, theshapeof shells reflects theunderlying geometricfeatures. Next, we observe a usefulproperty thecoverage probabilitiesof the regarding proposedmultivariatespacings.
0.9
DF,, (Xi),
0.8
0.7
0.6
0.5
0A4
0.3
FIG.
2.
M/ultivariate
spacings.for
a I)ivariate
niormial sample.
1306
J.LI AND R. Y. LIU
THEOREM 2.1. Let X1, ..., X, be a randomsample fromF E 9P. Assume multivariatespacings (2.3) is that thenotionof data depth used inderiving the multivariatespacings, Then, thecoverageprobabilitiesof these affineinvariant. as theunivari same the distribution namely{PF(MS1),..., PF(MSn+1)},follow ate spacings {D1, ..., Dn+1 }.
PROOF. Let Zi = DF .. < (Xi) and Ti = PF (X: DF (X) > Zi), for i = 1, * *, n that T[i] =
Then Ti's can be considered as a random sample drawn fromU [0, 1], as seen PF(X: DF(X) > Z[i]). ThereforePF(MSi) = T[i] - T[i_l], where T[o]= 0 and = 1, and thusthetheorem follows. D T[n+1] 3. Tolerance regionbased onmultivariate spacings. A confidenceinterval of interest with a statedcon estimatefora parameter isused toprovidean interval fidencelevel. In productionprocesses or quality control,it is customaryto seek with a stated of theprocess distribution proportion that covers a certain an interval Intervals productspecification. meeting therequired as an assurancefor confidence Inmany practical situations, which fulfillthisneed are called toleranceintervals. of theproduct.To bymultiple characteristics thequalityof a product is specified multi simultaneously, multiple characteristics of those ensure the specifications and regionsare integral variate toleranceregionsare needed. Tolerance intervals They allow thecontrol and qualitycontrol. theory partsof applicationsinreliability A high meet thespecifiedrequirements. of productionsto of intended proportions a high in will result interval this (or region) outside percentageof theproduction
loss or rework rate. Before we describe our proposed construction of tolerance in [19]. Let T[l] < of Ti's. TM be the order statistics It is clear
and regions. of toleranceintervals we briefly reviewtheliterature regions, thedesign of ex either is from known, distribution process If theunderlying perimentor theknowledge gained over long experience, the tolerance intervals or regionsusually can be established.For example, if the sample is drawn from we mean ,tand variancea2, and if with theknown N(,, a), a normaldistribution distrib underlying as those which contains100p% of the definetoleranceintervals is simply(fi- Z(1-p)/2a, I + Z(1-fl)/20aK. toleranceinterval ution,thentheshortest normaldistribu Here Z(1-,)/2 is theupper (1 - 1)/2th quantile of thestandard to as the tolerance level.This developmentcan tion.The constant ,8 is referred manner to thesetting of a p-dimensionalnormal be extended in a straightforward withmean vector it and covariancematrix E [denotedas N(,u, s)]. distribution as an smallest toleranceregioncan be constructed In thiscase, thecorresponding normal sets multivariate of the level underlying ellipsoid. It follows theelliptical and satisfies distribution
IE(t) =-{X: (X -_ )TE 1(X -I) < t},
1307
where t is thesolutionof theequation ..

JJ
r2<t
2,
e-( /2)r2rP-1l
i=2
sinP-iOi drdO.l ..dOpj=/3
If thedistribution or itsparametersare unknown,thefollowingtwodefinitions of toleranceregionshave been considered and accepted as standarddefinitions, see [11] forexample.Again, letX1, ..., X, be a randomsample from F E P,
p >1.
DEFINITION 3.1. T(X,...,Xn) region)at confidencelevely if
is called a /3-content toleranceinterval (or
(3.1)
P(PF(T(X1,
Xn)) >)=
Y.
DEFINITION 3.2. The regionT(X1, ..., Xn) is called a /-expectationtoler ance interval (or region) if
(3.2)
E(PF(T(X,..., Xn)))=/3.
In theunivariate case, if thenormality holds but theparameters assumption are a toleranceinterval unknown, can be constructed by (3.3) (X - cS, X + cS],
(n-1)(1?1/n)z
Xy,n-I
where X and S are respectively thesamplemean and standard.IfDefinition 3.1 is followed,[15] shows that c can be approximated by
x2 n-I is the (y)th quantile of the chi-square
(1-)/2Here
dom (n- 1). When the normality assumptionisuncertain, Wilks (in [29]) proposed touse theorder statistics, < < X[n], to constructthefollowing nonpara X[r] metric toleranceinterval: (3.4) T(XI, X[n-r+l]], Xn) = (X[r], where r is a positive integer and r < (n+ 1)/2. Ithas been shownthatthecoverage of thistoleranceregion, probability namelyPF((X[r], X[n-r+l]]), followsa Beta with parameters(n - 2r+ 1) and 2r, denotedas Beta(n - 2r+ 1,2r). distribution Based on this r can be chosen to satisfy observation, (3.5) or (3.6) E(Beta(n - 2r + 1,2r)) = /3 P(Beta(n - 2r + 1,2r) > /)= y
distribution with degree of free
to meet therequirement in Definitions3.1 or 3.2.Note thatthetoleranceinterval in (3.4) is "symmetric" around theobservedcenter point in thesense thattheinterval
1308
J.LI AND R. Y. LIU
both tails. Wald in [26] con excludes an equal numberof sample points from namely (X[s],X[t]], of thissymmetric toleranceinterval, sidereda generalization
where s= 1< s < t < n. Clearly, 1. Since this includes Wilks' the coverage interval as a special case, if r and t= n - r+ probability of (X[s], X[t]] can be shown
Definitions for to follow Beta(t - s, n - t+ s + 1), thedesired toleranceinterval 3.1. or 3.2 can be obtainedby choosing s and t as thesolutionsof (3.7) or
(3.8) E(Beta(t-s, n-t + s +))=.
P(Beta(t -s, n-t + s +1) >)
=y
Note thatthesolutionfor(3.7) or (3.8) may not be unique.Differentapplications on the may impose different additionaldesirable propertiesand thusconstraints choice of s and t.One intuitively appealing and desirablepropertyis thatthe tol erance interval (or region)beminimal. Charterjee and Patra To achieve the minimal nonparametric toleranceinterval, in [3] proposed a large-sample densityestima approachbased on nonparametric which yields asymptotically The performance minimal tolerance intervals. tion, of thisapproachdepends heavily on the methods used fordensityestimationand as observed Moreover, thisapproach tendstobe overlyconservative, smoothing. ismulti-modal, thetoleranceinterval ob in [6]. When theunderlying distribution is not be the union of which desir tainedby thisapproach may disjoint intervals, able inpractice. todevelop In the case,when F isunknown,there have been efforts multivariate For example, multivariatetoleranceregions. Wald in [26] extended nonparametric the toleranceintervalsin theunivariatecase to Wilks' approach forconstructing Under this the multivariatecase by sequentiallyadapting it foreach coordinate. would be limitedto thehyper toleranceregion method, theshape of theresulting rectangles (or rectangular blocks)with facesparallel to thecoordinate hyperplanes. the Wald's approach toanydesired shapeby introducing Tukey in [24] generalized sta the the blocks." construction of However, conceptof "statistically equivalent and tistical equivalentblocks here requireschoosing a priorian orderingfunction
thus can be somewhat arbitrary.Moreover, the shape of the constructed tolerance
to inter may be difficult regionbased on thispredetermined orderingfunction More discussion on statistically inpractice. pretor implement equivalentblocks is Remark 3.1. given laterin
Chatterjee erance and Patra's approach for constructing intervals based on nonparametric using empirical density estimation process tol asymptotically minimal to is also applicable mentioned in the uni
the multivariate and Mushkudiani
case, although [6] succeeded
it has the same drawbacks in developing
variate case. Recently,
Einmahl theory, Di Bucchianico, an important new method for con tolerance regions. Although this
structing the smallest nonparametric multivariate
1309
method possesses severaldesirable properties,it stillhas thefollowingpotential drawbacks: (i) itrequires theshape of thetoleranceregion; (ii) the pre-specifying obtained toleranceregion well theunderlying of the may not represent geometry data; and (iii) theobtained region may notbe connected.Finally, thecomputation thissmallesttolerance involvedin finding regioncan be quite intensive. multivariatetoleranceregionsshows that The above reviewof nonparametric almost all existingapproaches requirespecifyingin advance theshape of there or elliptical, seem arbitrary and gion.Most shapes specified,such as rectangular mathematicalconvenience.If theshape is not chosen properly, chosenmainly for of theunderlying theseapproaches may lead togrossmisrepresentation geometry of thedata. Recall thatthe toleranceinterval Wilks approach nonparametric proposed in the = From thepoint of view of (X[r], T(XI, ..., Xn) [29] has the form X[nr+?l]].
spacings, theWilks' tolerance interval can be easily seen as the union of some
of theunivariatespacings, suitablenumber
n-r+1 (3.9) T(XI. Xn) = (X[r], X[n-r+?l]] = U i=r+l Li.
Similarly,theproposedmultivariatespacingsderived inSection 2.3 can be used to formtoleranceregions inmultivariatesettings. We now give details on such and discuss their constructions, properties. 3.1. Propertiesof tolerance regions:F is known. Consider the case where
thatX[r], . . ., X[n] denote the depth order statistics of the sample Xi's and that Z[ll, ..., Z[n] are their corresponding depth values. Recall also that Zi = DF(Xi) and Zln] > ... > Z[n]. Then we propose to form the F E TW is known, p > 2. Recall
tolerance unionof a suitablenumber regionas the of theinnerspacings, which can be expressedas follows: rn
(3.10) OZ[rn] = U MSi -{X: DF(X) Z[rn]>
i= 1
fora suitably chosen rn. Here MSi, is the ithspacing,as definedin (2.3). ApplyingTheorem 2.1, thedistribution of thecoverageprobability of theabove toleranceregioncan be determinedimmediately, as shown in thefollowingtheo rem. THEOREM 3.1. The distribution thecoverage probabilityof of PF (OZ[rnh), thetoleranceregion definedin (3. 10),follows Beta(rn, n+ 1-rn). = PROOF. Clearly, OZ[rnh I MSi. It follows fromTheorem 2.1 that L Di are identically distributed. Here Di, i= 1. n+ 1, 5$?L~PF (MSi) and L7rn
1310
J.LIAND R. Y LIU
are the uniform spacings. Recalling theconstruction of the uniform spacingsusing exponentialrandom variablesgiven inSection 2.2,we see then
rn rn n+1
PF(OZ[rn]),
LDi, i=l
and
Ui i=l
Uj, j=l
all have thesamedistribution. Here U1, U2, . .., Un+I are i.i.d.exponentialrandom variableswithmean 1,Exp(1). Since Exp(1) can also be viewed as the Gamma
random variable Gamma(1, Beta(rn, n+ 1 -rn). D 1), Ein IUi/ E'+ I Uj can be easily shown to follow
To finalize constructing the proposed tolerance regionin (3.10),we need to iden a suitable r,which can satisfy tify Definitions3.1 or 3.2. FollowingTheorem 3.1, thisis equivalent to finding meet thefollowing criteria, rnto (3.11) or (3.12) E(Beta(rn, n+ 1 - rn))= / P(Beta(rn, n+ 1 - rn)> ,) = y
For (3.12), r, can be easily solved as

rn= since E(Beta(a, b)) = (n + 1), solution.
a+b. For (3.1 1), it is not easy to find an analytical
we can obtainan approximation of thesolutionusing theasymptotic Alternatively, resultstatedinTheorem 3.2 below. REMARK 3.1 (Multivariate spacings as statistically equivalentblocks). For a multivariatesampleof size n, Tukey (in [24]) considereda partition of thesample space inton+ 1disjointblocks as statistically equivalentblocks ifthefollowings are satisfied:
(a) the coverages of the (n +
of thecoveragesof the (n+ 1) blocks are completely (b) thejointdistribution symmetrical;

(c) if the coverages of the (n +
1) blocks
add up to 1;
over thesimplex is uniform; on an n-simplex,thedistribution (d) thesumof thecoveragesof any k preselectedblocks of the (n+ 1) follows Beta(k,n -k+ 1).
From 3.1, we can see that our multivariate spacings the above conditions and can be viewed as statistically I} satisfy {MS1, . . .,MSn+ equivalent blocks. Note that the blocks as in our multivariate spacings are automat ically determined by the given data. In contrast, the statistically equivalent blocks the proof of Theorem considered in [24] and other follow-ups all need to decide on the shape of the
1) blocks
are taken as barycentric coordinates
1311
blocks before formingtheblocks. Therefore, we view ourmultivariatespacings as an ideal data drivenrealization of statistically equivalentblocks.Moreover, the our of data allows inherited statistical property depth equivalentblocks to follow more closely thedata structure and also be completely nonparametric.
THEOREM 3.2. As n
(rn
oo, if rn satisfies
_p yplp
where fy is they-quantileof thestandardnormaldistribution [i.e., 4(,y) = y], then

P(PF(OZLrn]) PROOF. Recall > ) Y-.
U [0,1], co, follows thebinomialdistribution with parameter(n, ,8).Therefore, as

n -* o, we have ) P (Un < rn) np
that Yi = PF (X: DF (X) > Zi), i = 1 ... i n, with the order < < statistics Y[i] Y[n]. Let co, = #{i: Yi < f }. Then we obtain P (PF (OZ[rn] ) > *.* ,B)= P (wn < rn). Furthermore, since Yi 's can be viewed as an i.i.d. sample from
P (PF (OZ[rn 1) '
rnA
=Y. C
3.2. Propertiesof toleranceregions: F is unknown. IfF is unknown,thetol erance region is thenconstructed fromthe sample spacings in a similarfashion as in (3.10). More specifically, recall that Zi = DFn (Xi), i= 1, ...,n, and that . > Z[1] z[n] are thedescendingestimateddepthvalues corresponding to the depthorder statistics X[j1],..., X[n]. The toleranceregion is thentheunion of a suitablenumberof theinnersample spacings. More precisely,theproposed toler ance regioncan be expressedas (3.13) rn = J MS1
i=l1
{X: DFn (X) > Z[rn]}
whereMSi is the ithsample spacing,as definedin (2.4). To establish theasymptotic propertiesforO[rn] which are analogous to those for we require the followings on the data depthDFn (.) used in the derivation Oz[rn], of thespacings.
surely, that is, as n -+ oc, (3.14)
(i) IfF is absolutelycontinuous,then consistent almost DFn (x) is uniformly

I dn = sup DFn x (x)-DF (X) I ?0 a.s.
1312
J.LIANDR.
Y. LIU
with the location-scatter parameter (,u,E) (ii) If F is an ellipticdistribution

(i.e., its density assumes the form f (x) =
itsellipticcontourcan be expressedas e(x) = (x - ,)E-1 (x - it). In thiscase, of {x: e(x) = c} forsome thelevel sets (or contours)ofDF(X) are also in theform of c,which impliesthat monotone function DF (X) is a strictly e(x). Furthermore,
for any c > 0,
E I
-1/2g((x
/)'-7
1(x - ,))),
then
(3.15)
P(X: DF(X) = C)= 0.
The discussion of (i) and (ii) under the simplicial depth can be found in [16]. discussionsof depthcontourscan be foundin [13, 17] and [31].Under the Further of thesection. main results assumptions(i) and (ii),we now presentthe THEOREM 3.3. Assume thatconditions (i) and (ii) hold for thedepthDFn (.) r[n; 2]and r[n; 3] used inderivingthespacings.For any E > 0, ifthesequences r[n; 1]r as n -+ oc, (1 <r[n;j] <n, j = 1,2,3) satisfy,
( r[n; 1]- ?) / -3,
(/_r[n; 2]_1
?)y
/(-)
r[n;3]__ n? / then
lim P ?[r[n; 1]]) (PF (O
fl-* oo
>) f)<
Y y
lim P (PF (O ? [rfn; 21])>
and
lim E (PF ( 31 [r[n; =
Appendix. The proofofTheorem 3.3 is somewhatinvolvedand is given in the REMARK 3.2. small, to obtain rn Since E inTheorem 3.3 can be arbitrarily
> ,B)= y, we may in practice simply take E = 0 and
calculate rnby solving

rn = n,B+ tyn,(
satisfying P(PF(OW[rn])
If rn is not an integer,we use closer to y, P(Beta(Lrn n + ?1-Lrn)
LrnJor rrn1, depending >
on which of the following
is
P)
and
P(Beta(Frnl
n+
1-Frnl)
> P).

erance regions according toDefinitions
1313
we have justified 3.3. Asymptotic minimum the property. So far, proposed tol a certainclass of distributions theproposed tol (including ellipticaldistributions), erance regionsare asymptotically minimal.This property is clearlydesirable. were first minimal toleranceregions Asymptotically considered in [3] byChat
terjee and Patra. Assume which 3.1 and 3.2. Next, we will show that under
measure. Consider theset:

Assume v }=
that the sample X1, X2, ..., X, is drawn from F(.) ENP has a density function f (-). Let X(.) denote the p-dimensional Lebesgue Gf (v) = PF(f(X) < v).
that all levels set of f have Lebesgue
measures
0 for any v. Chatterjee
and Patra considered
regionformed by densitylevel sets: (3.16) where f,_ R = {x: f(x) >

1 - ,.
zero, namely X{x: f (x) = the following P/-content tolerance
is the (1 - ,B)-quantile of therandom variable f(X). In other words,

It can be shown that = , content with respect to F is at PF(Rf,f)
f,1-fl is a solution of G f (v) =
and that, among
all subsets whose
least ,P,thesubsetRf,j isminimal in thesense of having thesmallestLebesgue measure.
probability
DEFINITION 3.3. A sequence of /B-content toleranceregions {Sn} is called minimal if asymptotically

X,(SnARf,t) P 0 as n -oo.
Here (AA/B) indicatesthesymmetric difference between setsA and B.

sample
In the finite sample case, [3] replaced f with a densityestimate to obtain a

tolerance region, and showed the asymptotic minimum property. Clearly,
thequalityof theobtained toleranceregiondepends on thedensityestimation ap proach used.

Under the approach with data depth D(-), GD(V) Denote words, GD(?71-) = PF(DF(X) ' ql_,) = 1by r,lthe (1 = PF(DF(X) we consider < V). In other
/)-quantile
of the random variable DF(X).
Clearly,
(3.17) RD,B = {X: DF(X) > 01-,
is thetrue depth-based tolerance region. Definition3.3 can then bemod P/-content ifiedfortheapproachusing depthD(.) as:
1314
J.LI AND R. Y LIU
DEFINITION 3.4. A sequence of ,-content toleranceregions {Sn} is called D(.) if minimalw.r.t.thedepthfunction asymptotically
X(SnARD,) P 0 as n -oo.
our proposed toleranceregionsOZ[rnl we show that In thenext two theorems

and Z[rl are asymptotically minimal w.r.t. the chosen depth.
depth underlying THEOREM 3.4. Ifcondition(3.15) holdsfor the D(.), OZ[rn] D(.). Specifically, for rnsatisfying minimalw.r.t. are asymptotically and O?
Z[rnl ( n : yA/il)
we have
x(Ozrn]ARD,P) PO 0 and X(O [rnIARDP) ?. p>
Appendix. is given in the The proofof thistheorem condition (3.15) holds forall thedepth Note that,forelliptical distributions, mentioned inSection 2.1, and thus notions
RD,,B = {X: DF(X) > nj_,} = {x: f(x) > Tf,1-,B} = Rf,B.
Consequently,Theorem 3.4 leads to thecorollarybelow which implies thatour minimal under elliptical distribu proposed toleranceregionsare asymptotically tions.
3.1. COROLLARY -+ as n 0, (3.15), For elliptical distributions, we have, under condition
X(OZ[rn]ARf,fi) P 0,
X(O [rnfIARf,) P' 0.
4. Simulation and comparison studies. In this section,we present some of our toleranceregionsOZ[rn]and theperformance simulationstudiesto illustrate F Assume that The simulation procedureisoutlined in thefollowingsteps. O[ is absolutelycontinuous. F. Calculate thedepths Step 1. Generate a randomsample {X1,X2, ..., Xn I from the rnthlargest depth of Xi 'swith respectto thedata cloud and identify where our toleranceregion, forforming (i.e. Z[rn]) as thethreshold + V,f, | n,B =(n+ 1),8, fl(1 - yi), if(3.1) is required, if (3.2) is required.
Z[rnI'
1315
Step 2. Generate anotherrandomsample, {X1, X2, ..., X*}, fromF. Calculate thedepthof X* with respect to theoriginal samples, {XI, X2, ..., Xnl, and obtain theproportion of Xl 'swhich assume depthvalue greaterthan obtained inStep 1.This proportionis denoted as ,6. thethreshold
Step 3. Repeat manner Step 2 m times and use the average of the m as an approximate I{>,} be of PF(O[rnd. Denote ,B's obtained this average in this by ,. If If a toler
we checkwhetheror not a f-contenttoleranceregion in (3.1) is sought,

fi > P. Let Step 4. Repeat 1 if the event ,6> fl} occurs, and 0 otherwise. times. For a P-content
we simplyrecordthefi. P-expectationtoleranceregionin (3.2) is sought,

Steps 1-3 sufficiently many, say M,
we estimatetheconfidencelevel y by y = ?il I{>,}/M. ance region, we estimatefisimply tolerance For a f-expectation region, by theaverage
of theM fl's, namely = Ei=l ,B pi/M.
we setP = 90%, y = 95%, m = 100,M = Throughoutour simulationstudy, 1000 and n = 300 and 1000. The simplicialdepth is used to calculate all depth The resultsare presentedinTable 1. From Ta values unless specifiedotherwise.
ble 1, we can see that all the estimates are fairly close to the nominal levels. We
also present inTable 1,withinbrackets,thesimulationresults using theapproach Mushkudiani. The coverage fromthis given in [6] byDi Bucchianico, Einmahl and lower thanthenominalvalue, and also, generallyspeak approach is consistently ing,thedifference between theachieved and nominal coverage is largerthanthat of achieving the of ours. Thus, our proposed toleranceregion is better in terms desired tolerancelevel. Moreover, as observed in [6], if thedimensionof theun needs an adjustmentto reflect distribution derlying increases,theapproach there nominal value topreventtheachieved coverage from such a change in the target much below.The adjustment fallingtoo suggestedin [6] seems somewhatad hoc. our approachdoes not requireaddi Adding all theseobservationsto thefactthat tionalassumptions(e.g., shapeof thetolerance our approachclearlyyields region), more favorable multivariatetoleranceregions. nonparametric
TABLE y =95% 1 90%
The achieved confidencelevelsof the when proposed P/-content and P/-expectation tolerance regions
and P =
F BivariateNormal Bivariate Cauchy Bivariate Exponential
n = 300 0.954 0.963 0.941
n = 1000 0.949 0.961 0.943
n = 300 0.90131 [0.877] 0.90036 [0.862] 0.90043 [0.885]
n = 1000 0.90005 [0.887] 0.90061 [0.863] 0.89985 [0.890]
Results in [] are obtained using approach in [6].
1316
J. LI AND R. Y. LIU
o ot J
o.
C0 ~,.
Co
**
I LI
-6
-4
-2
.
(a)
I I,,..,..I,.I.
1 2 3 4 (b)
FIG. 3.
for: (a) a bivariatenormal sample; (b) a bivariate exponentialsample. Tolerance region
To help visualize theoutcome of our constructions, we presentfurther inFig ure 3 our proposed toleranceregionforbivariatenormaland exponentialdistrib utions. Under each distribution, the sample size is 500 and the toleranceregions = 90% and y = 95%. Note that since there shownare aimingfor nominalvalues ,B is no explicitformulafor thesimplicialdepth, thereis no explicitexpressionfor
the proposed tolerance region On[r] = {X: DFn(X) > Z[rn]}. (Here rn is to be de
we can simply presenttheconvex Remark 3.2.) In practice, termined according to hull spanningall thesamplepoints which achievehigherdepthvalue than Z[rn]as theestimatedtoleranceregion. The algorithm provided in [1] can help determine such convexhulls. geo As seen fromtheplots,our toleranceregionscan capture theunderlying theregionhas the metric shapesof thedata. For thebivariatenormaldistribution, theregionhas a trian elliptical shape. For thebivariateexponentialdistribution, more on the regionfocuses gular shape fanning upper-right. Overall, our tolerance
central part of the data and also follows the expansion of the probability mass.
our toleranceregiondoes For example, for thebivariateexponentialdistribution, not include theobservationsnear theorigin, since their positions are relatively In contrast, of thedistribution. thetolerancere outlying with respectto thecenter must includethoseobservations gion obtainedby using the methodproposed in [6] since they observations may notbe acceptable in have highdensity. However, these may be consideredextreme distri accordingto theunderlying practice,since they a smallperturbation This pointcan bemade more pronounced bution. by incurring
to create a thinning gap between the points near the origin and the rest of the data.
a centralregion of the regionin representing Therefore,the designof our tolerance to from data is naturally built preventtheregion observations which are including likelytobe extreme.
1317
As discussed inSection 3.1, when theunderlying distribution F is known, we can constructthe toleranceregion which satisfiesexactly thepreset requirement in (3.1) or (3.2).When F is unknown, we propose a method forconstructing the tolerance region based on thesampleonly and develop their asymptotic properties. From theasymptotic pointof view, theproposedmethod also satisfiesthepreset requirement. To assess theperformance of thisproposed toleranceregion in the of a finite setting sample size,we conduct anothersimulationstudy.In the same bivariatenormal setting as above,we use the Mahalanobis depth to constructthe toleranceregionsseparately underF isknownand F is unknown. One advantage in using the Mahalanobis depth is thatithas a closed formforboth population and sampleversions,and thus we can obtain theexact proposed toleranceregions. Figure 4 shows the two toleranceregions. The dashed ellipse is theone when F is known,which is the true toleranceregion.The solid ellipse is theproposed toleranceregion when F is unknown. The two regionsalmostcoincidewith each which clearly implies thatthe finitesample performance other, of theproposed toleranceregion is quite satisfactory. The convex hull inFigure 4 is formedby only thesample pointswhich have higherdepthvalue thanZ[rl,]. (This generally reduces tremendously thecomputationaleffort, and ismore practical,especially when thedepth itself does not have a simpleclosed form.)It is not surprising that theconvex hull is located inside the solid ellipse. Note thatdifference between these two regions is not significant. Therefore,although theconvex hull formed by thosecentralpoints is not theexact proposed toleranceregion,it is presented
0~~~~~~~~~~~~~0
.+~~~~~~~~~~7~ cv
-6 FIG. 4. -4 -2 0 2 4..
Tolerance regions F vs. knownF. for a bivariatenormal sample: unknown
1318
J.LI AND R. Y. LIU
that provides a practical solution. here to show thatitcan be a viable alternative of certifying specifications, we determine To use the toleranceregion in terms calculating itsdepth whethera new observationis in the toleranceregionby first with respectto thegiven sample and thensimplycomparingthedepthvalue to the straightforward task inpractice. threshold Z[rn].This is a relatively 5. Concluding remarks and future research. In thispaper,we introduced multivariatespacings based on theorderingderived fromdata depth.They sat Moreover, theyare isfyall properties one would expect of a notionof spacings. of theunderlying distribution. nonparametric and theyreflect well thegeometry toleranceregionsfor multivariate We showhow touse thesespacings toconstruct distributions. The construction of our toleranceregionscan be viewed as a multi dimensionalgeneralization of the Wilks' method. Given thatour spacings are derived fromdata depth, the resultingtolerance regionsare always connectedand naturallylocated in the"central"regionof the data set.This is an important propertyin applications: inpractice,specifications of products are not given in disconnectedpatches and a single production line value. measurementaround a target is generallydesigned toproduce continuous regionensures thatthecapabilityof production processes The connectedtolerance can be achieved.This pointwas also discussed and illustrated with Figure 3 in Section 4. One important directionfortheapplicationsof univariatespacings is nonpara tests. metric inference.It includes many existing rank testsand goodness-of-fit can be foundin [22]. In forth A survey as well as relevant references of thesetests coming papers,we shall exploreourmultivariatespacings in thedevelopmentof We generalizemany of theexistingapproaches on uni inference. nonparametric We also study variate spacings. characterizations multivariate distributional using
our multivariate spacings.
APPENDIX PROOF OF THEOREM 3.3.

lemmas. Z[r] I < =
To proveTheorem 3.3,we need thefollowingtwo
LEMMA a.s.
A.
1.
For
any
r,
Z[r] I
dn
sup.
I DFn
(x)-DF
(X)|-O,
of d, we have PROOF. From thedefinition

Z-Zi Then for any r, = IDF,(Xi)-DF(Xi)I < dn, i= 1, ... n.
#{i:Zi < Z[r _dn} <#li:
Z Z]
<r
Therefore,z[r] - dn < Z[r]. Similarly, we can show thatZ[r] - dn < z[r] The claim of the lemma thusfollows. D
MULTIVARIATE AND TOLERANCE SPACINGS REGION LEMMA A.2.
1319
Suppose that a, and bn are twosequences of randomvariables
under theassumptions(i)-(ii),
such that for some random variable a taking values on [0, ox], and an -* a and a on a positive measure subset of the sample space, say S, as n --*oc. Then bn-
PF{ OAn AOb}

where on = {x: DFn (x) > an }, bn =
0
{x: DF
a.s. on the set S,

(X) > b, , and A A B = (A U B) \
(AfnB). Lemma A.2 with proof is given in [13].

We now proceed with the proof of Theorem 3.3. Assume n -- oc, then the consistency property of a sample quantile that rn -
q, as q,
shows that Zrn] Z[rn]. Following
a.s., where TJq is theupper qth quantile of Z=

immediately implies Z[rn]
DF(X).
Clearly, Lemma A.1
Lemma A.2, we have

(A. 1) Denote P(Cn An = >/ ,) < P(An
7q, a.s. Let a, = Z[rn] and bn=
PF(O PF (O ?[rn ), Bn =
rnIAOZ[rn]) PF (O
-/*r 0
a.s. PF( OZ[rnO ). Then
1 rr IOZ[rnn ]) and Cn=
+ Bn >
/3) n Bn < e) + P(An P(Bn > E). + Bn > ,BnBn > ?) Xl? > O
=P(An + Bn> > < P(An From > ,-)+
(A. 1), we have, as n -* oo, P(Bn > ?) -O 0
and
n
lim P (Cn > p

oo
n--oc
lim P (An >-
E-)
VE > ?.
Therefore,
n
lim P(An
oo
>, >f
n--oo
lim PP(Cn
>+8)
VE ?.
If rnsatisfies
(--(B+g) ty/3(1-,8) and lim P(Cn >- J+E) y,
then
lim P(An >Y.>
1320
J.LI AND R. Y. LIU
we have Similarly,
P(An >_B) P(Cn + Bn > ,B) < P(Cn > fi0 + P(Bn > E),
and thus
n-*oo
lim P (An >f)<lim
n-*oo
P(Cn
>
,B-)
VE > 0.
If rnsatisfies
\n then limn O P(An J > ) <Y
,6(n-fl)
and
lim P(Cn n-*0o
)--+ E y,
Regarding thefl-expectation toleranceregion, we now have, following(A. 1),

Bn -* a.s. asn oc.
Bn < 1, Bn is uniformly integrable. Thus = E(Bn) limn,, < E(Cn if rn satisfies that n1 + Bn) < E(Cn) + E(Bn), E(An) and E(Cn) = limn--,>,0 Since
0. Since , then
(A.2)
Similarly, E(Cn) < E(An
limsupE (An) <

n--oo
+ Bn) < E(An)
+ E(Bn).
Then we have
(A.3)
Combining (A.2)
,B< lim inf E (An) nl-*00o

and (A.3), we obtain
of Theorem 3.3.
PROOF
Li
limn,,0
E(An)
,8, and hence
the proof
OF THEOREM
3.4.
Since
n,
GD(Z[rn]) = PF(DF(X) which implies (A.4) Moreover, thefollowing

OZ[rn] ARD,0 = {X: DF(X) U {X: DF(X) C {X: 71> Z[rn], DF(X) < Z[rn], DF(X) Z[rn] <
< Z[rn])
1-
Z[n
171_
771-} > r71-PI < h1-f + IZrn] 711 }
< DF(X) 711_,BI
MULTIVARIATE immediately implies that
SPACINGS AND TOLERANCE REGION
1321
k(0Z[rn)ARDip)
(A.5)
< =
X{X: vx-p
=
< \Z[r?] m-p\ DF(X)

<
<
+ rjx-p
\Z[r?] r)x-p\)
8(\Z^-m.p\),
(3.15), r\x-p + u}. By assumption 8(u) is right continuous at 0 because of (A.4) implies that 8(\Z[rn] rj\-p\) -^? 0. similarly. Following
where = 8(0)
? u < Dp(X) X{X'.r\x-p = 0. Also = X{X: Dp(X) rjx-p) 8(u) Therefore,
the continuity of DF(x).
?> 0. (A.5), we finally obtain X(0Z[rn}ARo,p) Following can be derived The rest of the proof regarding 0"[rn] Lemma A. 1 and the definition of dn, we obtain
{X:DFn(X)>Z[r?\DF(X)<m^\ U {X: DFn(X) < Z[r"\ DF(X) > m_p] > Z[r"] - 2dn,DF{X) < i,,^} C{X: DF(X) < U {X : Z[r"]+ 2dn,DF(X) > n\-p\ DF(X) < DF(X) < m-p + \Z[r?] m.p\ + 2dn}.
c{X:/7l_^-|Z['-?1-^i_/j|-24
Finally,
since Z^
?>
?> r]i-p and dn
0, we have
Mo|[r?,A/?D,/0
<k{X:rn-fi-\Zlrn]-m-p\-2d? < DF(X) = < r,i_p + \Zlr"] i\x_t\+1dn\
8(\Z[r?]-rU?p\+2dn)
This completes
the proof. Jun Li would also like to acknowledge the graduate of Statistics, Rutgers University. as
Acknowledgment.
sistantship provided by theDepartment
REFERENCES
[1] Barber, C. B., Dobkin, D. P. and Huhdanpaa, Software H. 22 469-483. (1996). The Quickhull algorithm for convex hulls. ACM Trans. Math. MR1428265
1322
[2] Beirlant, J., Dierckx, G.,
J.LI AND R. Y LIU

Guillou, A. and Stacaronricacaron, of extreme order C. (2002). On 5 ex 157?
ponential [3] CHATTERJEE, [4] CRESSIE, [5] Darling, [6] Di N. D.
180.MR1965977
S. K.
representations and Patra, Statist. Assoc. An On optimal
of Log-spacings N. K. Bull.
statistics. Extremes multivariate
sets. Calcutta
(1980). Asymptotically 29 73-93. MR0596720
minimal
tolerance
MR0556744
(1979). (1953).
statistic based
on higher order gaps. Biometrika to the random division N. A.
66 619-627.
Math. Statist.24 239-253. MR0058891

A., Einmahl, J. H. J. and Mushkudiani, (2001). Smallest Ph.D. non
a class of problems
related
of an interval. Ann.
Bucchianico,
MR1873333 tolerance regions. Ann. Statist. 29 1320-1343. parametric location estimators. D. (1982). Breakdown [7] DOHOHO, properties of multivariate ing paper, Harvard Univ. [8] DONOHO, [9] Einmahl, [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] FRASER, GUTTMAN, Hall, He, P. X. D. D. and GASKO, J. and van based M. (1992). Breakdown M. properties of location
qualify based on
estimates
Ann. Statist.20 1803-1827.MRI half-spacedepth and projectedoutlyingness.

J. H. Zuijlen, (1988). tribution functions (1951). on uniform spacings. determined statistically Regions: Strong bounds Ann. Probab. equivalent for weighted 16 108-125. blocks.
193313
dis
empirical MR0920258
22 372-381. MR0043425 London.MR0317473

I. (1970). On G.
Sequentially Statistical
Ann. Math. Charles
Statist.
Tolerance distributional
Classical
and Bayesian. spacings.
Griffin,
Anal. 19 201-224. MR0853053

and Wang, J. (1955). W. ments. R. G.
(1986).
powerful (1997).
tests based of depth
on sample
J. Multivariate datasets. Ann.
Statist.25 495-504. MR 1439311

HODGES, Howe, LlU, LlU, Liu, LlU, A bivariate (1969). Two-sided
Convergence
contours
for multivariate
MR0070921 Statistics 26 523-527. sign test. Ann. Math. limits for normal populations?some tolerance improve 64 610-620. /. Amer. Statist. Assoc. On a notion of data depth based on random simplices. Ann. Statist. 18 405-414.
MR1041400
R.,
(1990).
J. and SINGH, K. (1999). Multivariate PARELIUS, analysis by data depth: Descriptive Ann. Statist. 27 783-858. MRI724033 statistics, graphics and inference (with discussion). data: Concepts of data depth on circles directional K. R. and Singh, (1992). Ordering MRI 186260 and spheres. Ann. Statist. 20 1468-1484. and SINGH, K. (1993). A quality index based on data depth and multivariate MR 1212489 88 252-260. J. Amer. Statist. Assoc. (1936). random On the generalized distance in statistics. Soc. Proc.
R.
rank tests. Inst Sci.
P. C. [20] MAHALANOBIS, India 12 49-55. [21] MORAN, [22] [23] [24] [25] PYKE, Stahel, Tukey, TUKEY, P. (1947). (1965). (1981). A
Nat.
division J. Roy.
of an interval. J. Roy. Statist. Statist. Soc. Ser. B
Ser. B Stat. Methodol. 27 395-449. von
9 92-98. MR0023002
R. W. Spacings. Robust Stat. Methodol.
MR0216622
Kovarianzmatrizen. J. (1947). regions?the J. (1975).
Schaetzungen: Ph.D. thesis, ETH
Infinitesmale Zurich.
Optimalitaet
und Schaetzungen and
estimation. Nonparametric case. Ann. Math. continuous
II. Statistical Statist.
blocks equivalent MR0023033 18 529-539. of the 1975 tolerance
tolerance
and picturing data. Proceedings Mathematics MR0426989 2 523-531. ofMathematics Congress for setting method of Wilks' A. (1943). An extension [26] Wald,
International
limits. Ann. Math.
Statist. 14 45-55. MR0007965
MULTIVARIATE
[27] Weiss, [28] Wells, L. (1957). Asymptotic
SPACINGS AND TOLERANCE REGION

power of certain tests of fit based R. on sample spacings.
1323
Ann.
Math. Statist.28 783-786. MR0096327

M., Jammalamadaka, S. R. and Tiwari, (1993). ings statistics S. Statist. Y. Y. S. for tests of fit for the composite Determination MR0004451 based depth functions and associated of sample hypothesis. sizes
Methodol. 55 189-203.MR1210431
[29] WlLKS, [30] [31] Zuo, Zuo, (1941). 12 91-96.
Large sample theory of spac J. Roy. Statist. Soc. Ser. B Stat. tolerance medians. limits. Ann. Math. Ann. Statist. 31
for setting
1460-1490.MR2012822
and Serfling, statistical R. (2000). of sample depth
(2003).
Projection
Structural
functions. Ann.
results for contours properties and convergence Statist. 28 483-499. MR1790006 Department Rutgers of Statistics Center 08854-8019
Department University Riverside, E-mail:
of Statistics of California California 92521-0138
USA
USA
jun.li@ucr.edu
Hill University, New Jersey Pisacataway, rliu@stat.rutgers.edu
E-mail:

Jun Li, Regina Y. Liu (2008) Multivariate Spacing Based On Data Depth

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Jun Li, Regina Y. Liu (2008) Multivariate Spacing Based On Data Depth

Hochgeladen von

Copyright:

Verfügbare Formate

Multivariate Spacings Based on Data Depth: I. Construction of Nonparametric Multivariate Tolerance Regions Author(s): Jun Li and Regina Y.

SPACINGS BASED ON DATA DEPTH: I. OF NONPARAMETRIC MULTIVARIATE TOLERANCE REGIONS1

University ofCalifornia at Riverside and RutgersUniversity

J.LI AND R. Y. LIU

MULTIVARIATE SPACINGS AND TOLERANCE REGION

sample simplicialdepthof x isdefinedas (2.1) DF,(x) = ) E I(x e A (Xi1, Xi2,Xi3)),

which is thefraction of the triangles generatedfromthe sample thatcontain the

where (*) runsover all possible subsetsof {X1,

vertices are {Xi,

Y. LIU is defined as DF (X) = PF {X E

is given, the simplicial depth of x w.r.t. to F

F. observationsfrom s[Xl, ..., Xp+i]1, whereX1, ..., Xp+1 are (p + 1) random

FIG. 1. Depth contoursfor: (a) bivariatenormal sample; (b) bivariate exponentialsample.

MULTIVARIATE SPACINGS AND TOLERANCE REGION

J.LI AND R. Y. LIU

refer to the intervals Li =

of (DI, D2, ..., Dn+0) is (ii) thedensityfunction

withmean 1 [denotedas Exp(1)], and let ution

that {Ul, U2,

from the exponential

spacingsare (2.4) where z[

> DF, (X)>Z

- sUpz{DF,(x)}, and Z11]>

are thereverse order statistics

J.LI AND R. Y. LIU

MULTIVARIATE SPACINGS AND TOLERANCE REGION

where t is thesolutionof theequation ..

sinP-iOi drdO.l ..dOpj=/3

DEFINITION 3.1. T(X,...,Xn) region)at confidencelevely if

is called a /3-content toleranceinterval (or

distribution with degree of free

J.LI AND R. Y. LIU

P(Beta(t -s, n-t + s +1) >)

the multivariate and Mushkudiani

case, although [6] succeeded

it has the same drawbacks in developing

variate case. Recently,

structing the smallest nonparametric multivariate

MULTIVARIATE SPACINGS AND TOLERANCE REGION

For (3.12), r, can be easily solved as

a+b. For (3.1 1), it is not easy to find an analytical

of thecoveragesof the (n+ 1) blocks are completely (b) thejointdistribution symmetrical;

are taken as barycentric coordinates

MULTIVARIATE SPACINGS AND TOLERANCE REGION

where fy is they-quantileof thestandardnormaldistribution [i.e., 4(,y) = y], then

U [0,1], co, follows thebinomialdistribution with parameter(n, ,8).Therefore, as

P (PF (OZ[rn 1) '

{X: DFn (X) > Z[rn]}

(i) IfF is absolutelycontinuous,then consistent almost DFn (x) is uniformly

with the location-scatter parameter (,u,E) (ii) If F is an ellipticdistribution

P(X: DF(X) = C)= 0.

lim P (PF (O ? [rfn; 21])>

calculate rnby solving

If rn is not an integer,we use closer to y, P(Beta(Lrn n + ?1-Lrn)

LrnJor rrn1, depending >

on which of the following

MULTIVARIATE SPACINGS AND TOLERANCE REGION

measure. Consider theset:

that all levels set of f have Lebesgue

0 for any v. Chatterjee

and Patra considered

regionformed by densitylevel sets: (3.16) where f,_ R = {x: f(x) >