1990 Rousseeuw Zomeren Bis Distance

Unmasking Multivariate Outliers and Leverage Points Author(s): Peter J. Rousseeuw and Bert C.
van Zomeren Source: Journal of the American Statistical Association, Vol. 85, No. 411 (Sep., 1990), pp. 633639 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2289995 . Accessed: 25/05/2011 20:33
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at . http://www.jstor.org/action/showPublisher?publisherCode=astata. . Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association.
http://www.jstor.org
Unmasking MultivariateOutliersand Leverage Points

and BERTC. PETER ROUSSEEUW J.
VAN
ZOMEREN*
identification outliers. classical The especially when there several are is outliers a multivariate cloud nottrivial, in point Detecting and affected find it mean covariance matrix, which themselves are method notalways them, because isbasedonthesample does basedon to distances effect, propose compute we get To Thatis howtheoutliers masked. avoidthemasking bytheoutliers. to distances better are suited exposetheoutliers. Theserobust estimates location covariance. of and very robust way.Also here,theoutliers approach masks outliers a similar in In thecase of regression leastsquares data,theclassical a regression is in the Finally, newdisplay proposed which robust be a robust regression method. by may unmasked using highly outliers, good the observations, vertical the distances. plotclassifies dataintoregular This are versus robust residuals plotted are and Several examples discussed. points. leverage points, bad leverage Residual Mahalanobis Minimum volume plot. ellipsoid; distance; KEY WORDS: Breakdown Leverage diagnostic; point;
of definition the A distributions). technical multinormal with together is MVE estimator givenin theAppendix, Outliers observations do notfollow pattern are that the We for algorithms itscomputation. detwoapproximate ofthemajority thedata.Outliers a multivariate of in point the by obtained inserting distances notebyRDi therobust cloudcanbe hard detect, to especially when dimension the for MVE estimates T(X) and C(X) in (1). p exceeds becausethen can no longer on visual we rely 2, and classical between the 1 Figure illustrates distinction perception. classical A method to compute Mahalis the versus weight plotofbrain It estimates. is a log-log robust anobisdistance taking for bodyweight 28 species.The rawdata (before andLeroy(1987, in can MDi = \/(xi- T(X))C(X)-(xiT(X))t (1) logarithms) be found Rousseeuw In purpose. wereused fora different they p. 58), where foreach pointxi. Here, T(X) is the arithmetic meanof a of the 1 Figure we see that majority thedatafollow clear thedata set X and C(X) is theusualsamplecovariance region a In with fewexceptions. thelowerright pattern, matrix. distance The tell xi MDi should us howfar is from there three 6, (observations 16,and25) with dinosaurs are thecenter thecloud,taking of intoaccount shapeof the area body, in theupperleft and and a small brain a heavy the cloudas well.Itiswellknown this that approach suffers (observations monkey and therhesus the we find human do from masking the effect, which by multiple outliers not The weight. 97.5% high a 14 and 17) with relatively brain that necessarily a largeMDi. Thisis due to thefact have ellipse obtainedfromthe classicalestimates tolerance of T(X) and C(X) are notrobust: smallcluster outliers a and (dashedline) is blownup bytheseoutliers, contains will attract T(X) and willinflate C(X) in its direction. ellipse The dinosaur. tolerance all animals thelargest but Therefore, seemsnatural replaceT(X) and C(X) in it to (solidline)anddoes narrower basedon theMVE is much estimators. (1) byrobust the notinclude outliers. to M for Campbell(1980) proposed insert estimators Maof The secondcolumn Table 1 showstheclassical T(X) andC(X), which marked important an improvement. The for distances theseobservations. onlyouthalanobis the Unfortunately, breakdown point M estimators of (i.e., the 25) ellipse(number yields lieroutsidethe tolerance thefraction outliers of they tolerate) at most11(p can is = value X'.975 2.72. On the onlyMDi exceeding cutoff + 1), so it goes downwhenthere morecoordinates are distances in therightmost theother hand,therobust RDi 5 inwhich outliers occur can [see,e.g.,chapter ofHampel, (all observations values the column identify exceptional do and Ronchetti, Rousseeuw, Stahel(1986)]. than2.72 are underscored). larger of As a further one mayconsider estimators mulstep, look at a plot in we Of course, twodimensions can still tivariate location covariance havea highbreakand that become really Algorithms the ofthedatato find outliers. downpoint.The first was such estimator proposedby For in necessary threeand moredimensions. instance, Stahel(1981) and Donoho (1982). Here we willuse the data of variables the stackloss the consider explanatory minimum volumeellipsoid estimator (MVE) introduced p 1965).Thispointcloud(with = 3 and n = (Brownlee byRousseeuw (1985).For T(X) we takethecenter the of of The several outliers. secondcolumn Table 21) contains minimum volumeellipsoid covering of theobservahalf 2 gives the classical MDi, all of whichstay beneath and tions, C(X) is determined thesameellipsoid by (mul= The largestMDi (of observation17) is tipliedby a correction factor obtainconsistency to at X/3,239753.06. howin distances thenextcolumn, only2.70. The robust four (cases 1, 2, 3, and21). pinpoint outliers clearly ever, * 1. IDENTIFICATION MULTIVARIATE OF OUTLIERS
PeterJ. Rousseeuw Professor, is 24, U.I.A., Vesaliuslaan B-2650 Edegem, Belgium. BertC. vanZomeren Teaching is Faculty Assistant, of Mathematics Informatics, University Technology, of and Delft Juare lianalaan 2628BL Delft, Netherlands. authors grateful The 132, The to DavidDonoho,John discusand for Tukey, tworeferees interesting sionsandhelpful comments.
? 1990American Association Statistical Association Statistical of Journal theAmerican and Theory Methods Vol.85, No.411, 1990, September
633
634
Journal of the American Statistical Association, September 1990 Table 2. Mahalanobis Distances (MD) and Robust Distances (RD) forthe Stackloss Data, Along With Diagonal Elements the of the Hat Matrix i MD, 2.25 2.32 1.59 1.27 .30 .77 1.85 1.85 1.36 1.75 1.47 1.84 1.48 1.78 1.69 1.29 2.70 1.50 1.59 0.81 2.18 RD, 5.23 5.27 4.01 .84 .80 .78 .64 .64 .83 .64 .58 .79 .55 .64 2.23 2.11 2.07 2.09 2.29 .64 3.30 h,, .30 .32 .17 .13 .05 .08 .22 .22 .14 .20 .16 .22 .16 .21 .19 .13 .41 .16 .17 .08 .28
Log Brain Weight 6 4-
2-3 -1
. 1
. 3
Log Body Weight

Figure 1. Plot of Log Brain WeightVersus Log Body Weight,With 97.5% Tolerance Ellipse Based on the Classical Mean and Covariance (dashed line) and on the Robust Estimator (solid line).
NOTE:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
The datasetofHawkins, Bradu,and Kass (1984,table 4) yields prime a example themasking of effect. first Therobust The estimates distances and havebeencomputed three variables form pointcloudwith = 75 andp = bymeans a Fortran program, a n of 77 which be obtained can 3. It is knownthatcases 1 to 14 are outliers, the from The computation is of courselarger but us. time than classical The onlyMDi that theclassical MDi in Table 3 do notrevealthis. of method, it is quitefeasible but (even = than X;,.973 3.06 belong observations and on a PC) and theuserobtains to 12 larger much informationonce. at 14, which maskall the others.On the otherhand,the We wouldlike to stress thatthe user does not have to robust distances thesametabledo exposethe14outliers chooseanytuning in constants advanceand, in fact, in the in a single blow. in examples thisarticle wereobtained from routine apof plication theprogram. Notethat do notnecessarily we wantto delete outliers; is onlyourpurpose find the it to Table 1. Classical Mahalanobis Distances (MD,) and Robust them, after which usermaydecidewhether the are they Distances (RD) forthe Brain WeightData to be kept,deleted, corrected, or on depending thesituation. i MD, RD,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 1.01 .70 .30 .38 1.15 2.64 1.71 .71 .86 .80 .69 .87 .68 1.72 1.76 2.37 1.22 .20 1.86 2.27 .83 .42 .26 1.05 2.91 1.59 1.58 .40 .54 .54 .40 .63 .74 6.83 1.59 .64 .48 1.67 .69 .50 .52 3.39 1.14 6.11 2.72 .67 1.19 1.24 .47 .54 .29 1.95 7.26 1.04 1.19 .75
2.72 are underscored.
Distances exceeding the cutoff value V
Remark. Detecting outliers turns out to be hardest whennlp is relatively small.In sucha case a fewdata points maybe nearly collinear chance,thereby comby pletely the determining MVE. Thisis causedbytheemptiness of multivariate space (the "curse of dimenAs sionality"). a ruleof thumb recommend we applying the MVE whenthereare at least fiveobservations per so dimension, nlp > 5. Robustcovariance matrices be used to detect can outin liers several kinds multivariate of such analysis, as principalcomponents (Campbell 1980;Devlin,Gnanadesikan, and Kettenring correlation corand 1981) and canonical respondence analysis (Karnel1988). 2. IDENTIFICATION LEVERAGE OF POINTS IN REGRESSION In linearregression cases are of the type(xi,yi) the wherexi is p-dimensional the response is one-diand yi mensional. Cases forwhich is farawayfrom bulk the xi ofthexiinthedatawecallleverage points. Leverage points occur frequently thexiareobservational, when unlike "designed"situations fixed Leveragepoints with maybe xi.
NOTE: Distances exceeding cutoff the value V
Rousseeuw and van Zomeren: Unmasking Multivariate Outliers and Leverage Points the Table 3. Mahalanobis Distances (MD,) and Robust Distances (RD,) forthe Hawkins-Bradu-Kass Data, Along With Diagonal Elementsof the Hat Matrix i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
NOTE:
635
MD, 1.92 1.86 2.31 2.23 2.10 2.15 2.01 1.92 2.22 2.33 2.45 3.11 2.66 6.38 1.82 2.15 1.39 .85 1.15 1.59 1.09 1.55 1.09 .97 .80 1.17 1.45 .87 .58 1.57 1.84 1.31 .98 1.18 1.24 .85 1.83 .75
RD, 16.20 16.62 17.65 18.18 17.82 16.80 16.82 16.44 17.71 17.21 20.23 21.14 20.16 22.38 1.54 1.88 1.03 .73 .59 1.49 .87 .90 .94 .83 1.26 .86 1.35 1.00 .72 1.97 1.43 .95 .73 1.42 1.26 .86 1.26 .92
h,, .063 .060 .086 .081 .073 .076 .068 .063 .080 .087 .094 .144 .109 .564 .058 .076 .039 .023 .031 .048 .030 .046 .029 .026 .022 .032 .042 .024 .018 .047 .059 .036 .026 .032 .034 .023 .059 .021
i 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
MD, 1.27 1.11 1.70 1.77 1.87 1.42 1.08 1.34 1.97 1.42 1.57 .42 1.30 2.08 2.21 1.41 1.23 1.33 .83 1.40 .59 1.89 1.68 .76 1.29 .97 1.15 1.30 .63 1.55 1.07 1.00 .64 1.05 1.47 1.65 1.90
RD, 1.34 .55 1.48 1.74 1.18 1.82 1.25 1.70 1.65 1.37 1.27 .83 1.19 1.61 2.41 1.26 .66 1.21 .93 1.31 .96 1.89 1.31 1.22 1.17 1.14 1.40 .78 .37 1.64 1.17 1.04 .64 .52 1.14 .96 1.99
h,, .035 .030 .052 .055 .061 .041 .029 .038 .066 .041 .047 .016 .036 .072 .079 .040 .034 .037 .023 .040 .018 .062 .051 .021 .036 .026 .031 .036 .019 .046 .029 .027 .019 .028 .043 .050 .062
Distances exceeding the cutoff value
quitedifficult detect, to however, whenthe x, have dimension higher than becausethen areexactly the 2, we in situation described previously Section in 1. In theusualmultiple linear regression modelgiven by y = XO + e people often thediagonal use elements of
are 14 (Table 3). We knowthatthefirst observations le12, but points, only 13,and14havelarge Thereverage h,i. of distances thexi as we fore, proposeto use therobust are becausethey less easilymasked diagnostics, leverage thanthehii only point refers to the that(xi,yi)is a leverage the hat matrix = X(XtX)-Xt as diagnostics identify Saying H to y, theresponse into leverage points.Unfortunately, hat matrix, the outlyingness x,butdoes nottake the like of the from plane corresponding If classical Mahalanobis distance, suffers from masking account. (x,,yi) lies far the it of effect. can be explained realizing there This by that exists tothemajority thedata,we saythat is a bad leverage or becauseit attracts harmful a pointis very a monotone relation betweenthe h,,and the MD, of point.Such (hencethe leastsquaresregression classical the eventilts thex,: h= (M( M,)21 point, (2) thelinear it relation willbe calleda good leverage
word "leverage"). On the other hand, if (x,, y,) does fit
Therefore, hi,do notnecessarily the detect leverage the points, contrary whatis commonly to believed. Manyauin thors in of even define leverage terms h,iwhich, our opinion, confuses cause and effect: cause is thefact the that somexiareoutlying, whereas hi, merely the are some As (unreliable) diagnostics trying find to thosepoints. an illustration us look at Table 2, which let shows hi, the for 1=1,..i n D thestackloss data. The largest belongs observation to hi, 17,whereas RDi identify the observations 2, 3, and21. (Rousseeuw 1984), where r,(t) = y,- x,Ois the residual 1, equi0 The data Another exampleis the Hawkins-Bradu-Kass set oftheithobservation. LMS estimate is affine
coefof the becauseit improves precision theregression ficients. points good and bad leverage between To distinguish y,as wellas xi,and we also needto we haveto consider of set knowthelinearpattern by themajority thedata. such estimator, regression a high-breakdown Thiscallsfor by ofsquares(LMS), defined as leastmedian (3) median minimize rQ00)
636
Journal of the American Statistical Association, September 1990
After combreakdown point. variant hasthemaximal and scale also calculatesthe corresponding puting0 one by estimate, given
a = kVmedian r2(a)
i=1l...n
StacklossData
4.
U)
(4)
*3
1*
2.5 LMS The standardized constant. wherek is a positive cn regression outresiduals can thenbe usedto indicate ril/ pattern of N from linear the that deviate liers,that points is, -2.5 (Rousseeuw 1984). themajority in 2 our Figure illustrates terminology an exampleof *21 of The simple regression. majority the data are regular Z (b) indicated (a). Points and (d) deviate by observations, and from linearpattern henceare calledregression the 6 4 5 2 1 0 points, outliers, (c) isnot.Both(c) and(d) areleverage but Robustdistance RD, we Therefore, saythat becausetheir valueis outlying. xi point. Figure3. Plot of Robust Residuals Versus Robust Distances RD, for and point (d) is a badleverage (c) is a goodleverage
The observation(b) is called a vertical outlier, because it
point. outlier nota leverage but is a regression in leverage distances Tables2 and3 indicate The robust withits has A referee asked to comparethisdisplay between goodandbad ones, distinguish points cannot but whichwould plot the usual least classicalcounterpart, hand,theLMS becausetheyiare notused.On theother dis: Mahalanobis the versus nonrobust and in 3 residual plots chapter ofRousseeuw Leroy(1987) squaresresiduals tancesMDi. Thisplotis givenin Figure4 forthe same which onesare outliers without telling pinpoint regression or points regression it Therefore, seemslike a good idea to data. It does notrevealanyleverage leverage points. the staybetween lines becauseall of thepoints outliers, construct new displayin whichthe robustresiduals a iden21 observations and17comecloseto being andonly distances versus robust the RDi. In Figure rilaareplotted when would not improve Because of (2), things of tified. to the data. Points theright 3 this donefor stackloss is MDi byhii = 3.06areleverage replacing thevertical through borderline /e375 versusrobust residuals Figure5 is the plot of robust tolerance points, whereaspointsoutsidethe horizontal data. It immedifor distances theHawkins-Bradu-Kass outliers. thisexample In band[-2.5, 2.5] are regression 4 of points, which are showsthatthere 14 leverage the fourpoints withthelargest RDi are also regression ately the 5 aregoodand 10 are bad. A glanceat Figure reveals so points.Figure3 also outliers, theyare bad leverage are of features thesedata, which hardto disa outlier(observation whichis a important 4), contains vertical a This coverotherwise. typeof plotpresents visualclasOur cutoff values outlier with regression RDi < . the of rec- sification the data into fourcategories: regular but are to someextent arbitrary, in theplotwe can with far observations smallRDi and smallril, thevertical 21 cases: observation is notvery ognizetheboundary smallRDi and largeril, thegood leverage with whereas case 2 is onlya mildregression outliers awayinx-space, largeRDi andsmall with points ril, andthebad leverage outlier. withlargeRDi and largeril'. Note thata single points
theStacklossData.
Y(
.(b)
StacklossData
2.5
(a)
. *(d)
cn
..0
O ~ nX -o~~~~~~~~~~~~~~~~~-. CM ~~
.0
:
. 21
.17
-.
Xi
2 MD, Classicaldistance
With Regular (a) Observations, Figure Simple 2. Regression Example 4. Figure PlotofLeast Squares ResidualsVersusClassicalMahal(c) and (d) Bad Leverage (b) Vertical Outlier, Good LeveragePoint, the anobisDistancesMD,for StacklossData. Point.
Rousseeuw and van Zomeren: Unmasking Multivariate Outliers and Leverage Points
637
We would like to stressthatwe are not advocatingthat one simplyremove the outliers.Instead we considerour plotsofrobust residualsand/ordistancesas a merestarting point of the analysis.In some cases the plots may tell us to change the model. In othercases we maybe able to go back to the originaldata and explain where the outliers 0 -J come fromand, perhaps,to correcttheirvalues. out to For the momentwe are stillcarrying simulations of comparedifferent algorithms, studythe distribution ro2.5 N bust distances,and so on. It turnsout that it does not o -2.5 matterso much whichhigh-breakdown estimator used is Ci) when the purpose is to detectoutliers,because then stathan robustness. tisticalefficiency less important is where Further researchis needed to address situations 5 0 10 15 20 25 some of the explanatory variablesare discrete,such as 0Robust distance RD, 1 dummies.The same is true forfunctionally related exterms),because then variables(e.g., polynomial Figure5. Plot of Robust Residuals Versus Robust Distances RD, for planatory the Hawkins-Bradu-Kass Data. one cannotexpectthe majority thexi to forma roughly of ellipsoidal shape. Nonlinear regressionwith high breakdiagnostic never sufficient thisfourfold can be for classi- down point has been addressed by Stromberg (1989). fication! Presently are developinga programcalled ROMA we Robustresiduals maybe used to assign to weights ob- (which stands for RObust MultivariateAnalysis), incorservations to suggest transformations or data and poratingboth robust regressionand robust location/co(Carroll Ruppert1988; Rousseeuwand Leroy 1987). They are variance, as well as other techniques such as robust muchbetter suitedto thisthanleast squaresresiduals, principalcomponents. becauseleastsquarestries produce to renormal-looking Finally,we would like to apologize to all of the people sidualseven whenthe data themselves not normal. whose workwe did not cite. We did not attempt write are to Thecombination therobust of residuals theRD, also a reviewarticle(nor was it originally with meant to be a disoffers another advantage. pointedout by Atkinson cussionpaper). Some reviewsof the relevantliterature As on (1986),itmay sometimes happen that LMS regression outliers and robustnesscan be found in Beckman and the produces relatively a largeresidual a good leverage Cook (1983), Gnanadesikan(1977), Hampel et al. (1986), at point, becauseof smallvariations theregression in coef- and Rousseeuw and Leroy (1987).
LO -D
Hawkins-Bradu-Kass Data
ficients. amplitude thiseffect roughly The of is proporAPPENDIX:METHODSAND ALGORITHMS tionalto theRDi, so theproblem onlyoccurin the can section the right on side of our new display. This is a of we that havea datasetX = (x1,. . ., x,,) n points Suppose distinct improvement theusualplotofstandardized inp dimensions we want estimate "center" "scatover its and and to residuals versus indexoftheobservation, the where one ter"bymeansofa rowvector C(X). We say T(X) and a matrix does notsee whether given a residual to corresponds an that estimators and C are affine the T when equivariant x,at thecenter to a leverage or point. + b) = T(x1,. . ,x,)A + b T(x1A+ b, . .
,x,,A
3. CONCLUSIONS AND OUTLOOK
and
C(x1A + b, . ,x,,A + b)= A'C(x1, , x,,)A (A.1) In thisarticle haveproposed we distances based using on high-breakdown estimators detect to in outliers a mul- foranyrowvector andanynonsingular A. b matrix The p-by-p matrix meanand thesample covariance tivariate pointcloud. This is in line withour previous sample to suggestion identify outliers looking reat regression by 1 1I siduals from high-breakdown Combining a fit. thesetools T(X)= - x, and n leadsto therobust diagnostic ofFigures and 5. plot 3 we Although do notclaim approach be a panacea, this to C(X) - _ E (x, - T(X))I(x, - T(X)) (A.2) ithasworked wellfor very outliers many in realdetecting dataexamples described not here.Ourgeneral outimpression areaffine but becauseevena single equivariant notrobust, them an arbitrary extent. to is thatmostdata sets are further awayfrom usual liercan change the as estimator volume (MVE) is defined ellipsoid assumptions (multivariate linear- Theminimum normality, approximate and T(X) is a p-vector C(X) is a positiveity)thanis commonly assumed.In actualpractice our thepair(T, C), where of matrix suchthatthe determinant C is p-by-p methods haveyielded somenewandsurprising for semidefinite results, to minimized subject in firm example, a consulting fitting economic models to #{i; (x, - T)C-1(x, - T) a2} ? h (A.3) stockexchange data. Another application to mining was (Chork,in press),in which outliers the reflect minerali- where = [(n + p + 1)/2] in which is theinteger of h [q] part zations hidden belowthesurface, their so detection the q. The number is a fixed is which be chosenas a2 constant, can mostimportant of theanalysis. part s of themajority thedata to comefrom a 450 whenwe expect
638
Journal of the American Statistical Association, September 1990
normaldistribution. small samples one also needs a factor For dependson n and p. The MVE has a breakdown point Cn,p, which of nearly50%, whichmeans that T(X) willremainbounded and the eigenvalues of C(X) will stay away fromzero and infinity when less than halfof the data are replaced by arbitrary values (see, e.g., Lopuhaa and Rousseeuw, in press). The robustdistances are definedrelativeto the MVE: RDi = -\/(xi T(X))C(X)-1(x, - T(X))' One can thencomputea weightedmean,
T1(X) =
(A.4)
J depends on a probabilistic argument, because we want to be confidentthat we encounterenough subsamples consistingof out p + 1 good points.Moreover,bycarrying a simulation study, we found smallthatcn = (1 + 15/(n- p))2 is a reasonable factor. thisfactor was incorporated samplecorrection Therefore, in all of the examples of our article. The projectionalgorithm a variantof an algorithm Gasko of is and Donoho (1982). For each pointxi we consider lxiv'- L(x1v', . . .
S(x1v1' ...
, xnv')1
(A. 10)
(
n
XnV')
wi E)
Wixi
(A.5)
and a weightedcovariancematrix,
n
where L and S are the MVE estimates in one dimension, which we compute as follows. For any set of numbers < Zn one can determine shortest Z? ' Z2 '_its halfby taking the smallestof the differences
Zh -
Cl(X)=
wi- 1
-1
Z1, Zh+1
Z2, * . * , Zn -
Znh+1.
(xi - T1(X))'(xi T1(X)) (A.6)
If the smallestdifference zj - Zj-h+l is of midpoint the corresponding half,
we put L equal to the

Zj-h+l)/2
wherethe weightswi = w(RDi) depend on the robustdistances. It can be shownthat T1and Cl have the same breakdownpoint as the initial T and C when the weightfunction vanishesfor w large RDi [see sec. 5 of Lopuhaa and Rousseeuw (in press)]. The MVE methodcan stillbe used whenp = 1, in whichcase it yieldsthe midpointand the lengthof the shortesthalf. The midpoint convergesmerelyas n-Il3 (Rousseeuw 1984), whereas the lengthconvergesas n-12 (Griubel 1988). The influence function and finite-sample behavior of the latterwere studied by Rousseeuw and Leroy (1988). The minimum covariancedeterminant estimator (MCD) is anothermethodwithhighbreakdownpoint (Rousseeuw 1985). It searchesfora subset containing halfof the data, the covariance matrixof whichhas the smallestdeterminant. Recently,it has been proved that the MCD estimator asymptotically is normal (Butler and Jhun1990). The MCD estimatorneeds somewhat morecomputation timethandoes theMVE. The MCD estimator has also been computed by means of simulatedannealing (R. Griubel, but personalcommunication), thisapproachtakesmuch more computation time. We have triedout two approximatealgorithms the MVE. for The first theresampling is algorithm describedin Rousseeuw and Leroy(1987). It is based on theidea oflookingfora smallnumber of good points,rather thanfork bad points,wherek = 1, 2, 3, .... This resemblescertainregression used by Rousalgorithms seeuw (1984) and, independently, Hawkins et al. (1984). We by draw subsamplesof p + 1 different indexed by J observations, = {il, . . ., ip+1}.The mean and covariance matrixof such a subsampleare 1 TJ= p+1xi p+1 j and
L(z.. . .
and S as its length,
S(ZI,
., zn) = (zj
=
(A ll) (A.12)
...,zn)
c(n)(zj -
Zj-h+l)
factor up to a correction c(n), whichdepends on thesample size. Note that (A.10) is exactlythe one-dimensional versionof the robustdistanceRDi of (A.4), but applied to the projectionsxiv' of the data pointsxi on the directionv. As not all possible directionsv can be tried,we have to make a selection. We take all v of the formx, - M where l= 1, .. .,nand M is the coordinatewise median: M = (median xA,.
j=1,...,n
. .,
median x1p).
j=. n
In the algorithm update an array(ui)i=. nwhile 1loops over we 1, . . . , n. The finalui are approximations RDi whichcan be of as plottedor used forreweighting in (A.5) and (A.6). are Both algorithms veryapproximate,but fromour experience this usually does not mattermuch as far as the detection of outliers concerned.The resampling is is algorithm affine equithe variantbut not permutation because reordering xi invariant, will change the random subsamplesJ. On the other hand, the is invariant because it considprojectionalgorithm permutation ers all values of 1, but it is not affineequivariant.Note thatthe is projectionalgorithm much fasterthan the resamplingalgorithm, especiallyin higherdimensions.
[ReceivedNovember1988. Revised August1989.1
REFERENCES
73, 533Atkinson,A. C. (1986), "Masking Unmasked," Biometrika, 541. .......... s," TechnoBeckman, R. J., and Cook, R. D. (1983), "Outlier metrics, 119-163. 25, CJ -(xi - TJ)'(xi- TJ). (A.7) in Brownlee, K. A. (1965), Statistical Theoryand Methodology Science pJ and Engineering (2nd ed.), New York: JohnWiley. The corresponding or ellipsoidshould then be inflated deflated Butler, R., and Jhun,M. (1990), "Asymptotics the MinimumCofor to containexactlyh points,whichamountsto computing Colorado varianceDeterminant Estimator,"unpublishedmanuscript, State University, Dept. of Statistics. m3 = {(Xi (A.8) TJ)C11(x, TJ)t}h:ni Campbell, N. A. (1980), "Robust Procedures in MultivariateAnalysis 29, I: Robust Covariance Estimation,"Applied Statistics, 231-237. because mj is the right factor.The squared volume Carroll, R. J., and Ruppert,D. (1988), Transformation Weighting magnification and of the resulting to ellipsoidis proportional M det(Cj), of which in Regression,London: Chapman & Hall. we keep the smallestvalue. For this"best" subsetJ we compute Chork, C. Y. (in press), "UnmaskingMultivariateAnomalous Observations in Exploration Geochemical Data From Sheeted-Vein Tin T(X) = TJ and C(X) = (/s5)-c Mineralization,"Journalof GeochemicalExploration. (A.9) mC Devlin, S. J., Gnanadesikan,R., and Kettenring, R. (1981), "Robust J. as an approximation the MVE estimator, to followedby a reEstimationof Dispersion Matricesand PrincipalComponents,"Jourweighting step as in (A.5) and (A.6). The numberof subsamples nal of theAmericanStatistical Association,76, 354-362.
Rousseeuw and van Zomeren: Unmasking Multivariate Outliers and Leverage Points Donoho, D. L. (1982), "BreakdownProperties Multivariate of Location Estimators,"qualifying paper, Harvard University. Gasko, M., and Donoho, D. (1982), "Influential Observationin Data Analysis,"in Proceedings theBusinessand Economic Statistics of Section,AmericanStatistical Association,pp. 104-109. Gnanadesikan,R. (1977), Methodsfor Statistical Data Analysisof Multivariate Observations, New York: JohnWiley. Griubel, (1988), "The Lengthof the Shorth,"TheAnnals of Statistics, R. 16, 619-628. Hampel, F. R., Ronchetti, M., Rousseeuw, P. J., and Stahel, W. A. E. (1986), RobustStatistics: Approach Based on Influence The Functions, New York: JohnWiley. Hawkins, D. M., Bradu, D., and Kass, G. V. (1984), "Location of Several Outliersin MultipleRegressionData Using ElementalSets," Technometrics, 197-208. 26, Karnel,G. (1988), "Robust Canonical Correlationand Correspondence Analysis," unpublished manuscript,Technical University, Vienna, Dept. of Statistics. Lopuhaa, H. P., and Rousseeuw, P. J. (in press), "BreakdownPointsof
639
AffineEquivariantEstimatorsof MultivariateLocation and Covariance Matrices," The Annals of StQtistics. Rousseeuw, P. J. (1984), "Least Median ofSquares Regression,"Journal of theAmericanStatistical Association,79, 871-880. (1985), "Multivariate EstimationWithHigh BreakdownPoint," in Mathematical Statistics and Applications(Vol. B, eds. W. Grossmann, G. Pflug,I. Vincze, and W. Wertz, Dordrecht: Reidel Publishing, pp. 283-297. Rousseeuw, P. J., and Leroy, A. (1987), RobustRegression and Outlier Detection,New York: JohnWiley. (1988), "A Robust Scale EstimatorBased on the Shortest Half," Statistica Neerlandica,42, 103-116. Stahel, W. A. (1981), "Robuste Schatzungen:Infinitesimale Optimalitat und Schatzungen von Kovarianzmatrizen," unpublished Ph.D. thesis, ETH Zurich. A. Stromberg, J. (1989), "NonlinearRegressionWithHigh Breakdown Point," unpublishedPh.D. thesis,Cornell University, School of OperationsResearch and Industrial Engineering.

1990 Rousseeuw Zomeren Bis Distance

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

1990 Rousseeuw Zomeren Bis Distance

Hochgeladen von

Copyright:

Verfügbare Formate

Unmasking Multivariate Outliers and Leverage Points Author(s): Peter J. Rousseeuw and Bert C.

Unmasking MultivariateOutliersand Leverage Points

Log Brain Weight 6 4-

Log Body Weight

Distances exceeding the cutoff value V

3.06 are underscored.

NOTE: Distances exceeding cutoff the value V

Distances exceeding the cutoff value

3.06 are underscored.

Journal of the American Statistical Association, September 1990

The observation(b) is called a vertical outlier, because it

Journal of the American Statistical Association, September 1990

(xi - T1(X))'(xi T1(X)) (A.6)

If the smallestdifference zj - Zj-h+l is of midpoint the corresponding half,

we put L equal to the

Das könnte Ihnen auch gefallen