Sie sind auf Seite 1von 20

Estimating Optimal Transformations for Multiple Regression and Correlation Author(s): Leo Breiman and Jerome H.

Friedman Source: Journal of the American Statistical Association, Vol. 80, No. 391 (Sep., 1985), pp. 580598 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2288473 . Accessed: 23/01/2014 18:34
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association.

http://www.jstor.org

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

for Estimating OptimalTransformations Multiple Regression and Correlation


LEO BREIMAN and JEROMEH. FRIEDMAN*
wheretheresponse or thepredictors inY andthepredictor appliedin situations variable In regression theresponse analysis of continuous mixtures ordered variables and variables replaced byfunctions 0(Y) and volve arbitrary XI, . . , Xpareoften or unordered). The functions variables (ordered 0, forestimating categorical 4I(XI), . . . p, (Xp). We discussa procedure If theoriginal variableis catethose functions * that 0* and4 * . . . minimize e2 = E{[0(Y) 01, . . ., ~ p are real-valued. of 0 or Xi assignsa real-valued _ lp=, 0j(Xj)]2}/var[0(Y)],givenonly a sample{(Yk, Xkl, score gorical,theapplication . . ., Xkp), 1 ' values. k ? N} and makingminimal assumptions to each of itscategorical is nonparametric. The optimal of the solution The procedure transformation the data distribution or the form concerning functions. For thebivariate are based solelyon thedata sample{(Yk, Xkl, case, p = 1, 0* and 4* satisfy p* estimates = p(0*, 4*) = max0,0p[0(Y), assumptions concerning the +(X)], where p is theproduct Xkp), 1 ? k ? N} withminimal of theoptimal and p* is themaximal and theform moment correlation coefficient corre- data distribution transformations. X and Y. Our procedure we do notrequire thetransformation thusalso provides functions to lationbetween a In particular, forestimating or even monotone. method the maximalcorrelation between a particular parameterized family two be from in whichthe optimal situations transforvariables. (Laterwe illustrate are notmonotone.) mations ACE. KEY WORDS: Smoothing; to at leastthree It is applicable situations:
.

1. INTRODUCTION

Nonlinear transformation of variables is a commonly used practice in regression problems. Two common goals are staIn thefirst of these,we assumethedata (Yk, Xk), k = 1, bilization of error variance and symmetrization/normalization . N, are independent thedistribution samplesfrom of Y, of error A morecomprehensive distribution. goal, andtheone XI, . . ., Xp. In thesecond,a stationary mean-zero ergodic we adopt, is tofind those transformations that produce thebest- time series theoptimal transformations XI, X2, . . . is assumed, additive fitting model.Knowledge ofsuchtransformations aids are defined to be thefunctions that minimize in theinterpretation and understanding of therelationship between theresponse and predictors. Let Y,X, .. . , Xpbe random Y theresponse variables with > thepredictors. Let 0(Y), q$(XI), . . . , Op(Xp) andXI, . , XXp E02(XpX be arbitrary measurable mean-zero functions ofthecorrespondobservations xl, variables.The fraction ingrandom of variance notexplained and the data consistof N + p consecutive dataform * by defining of 0(Y) on 4I,I i(Xi) is (e2) by a regression XN+P- This is putin a standard
.
-

1. random in regression designs 2. autoregressive in stationary schemes ergodic timeseries in regression. 3. controlled designs

E{LO(Y) e2(0, 1
. . .
,

i(xi)
.

Yk = (1.1)

Xk+p,

Xk =

-I (Xk+p1,

*, Xk),

k = 1, .

,N.

4P)

E02(y)

Thendefine optimaltransformations as functions Q*,41*,... 4* thatminimize (1.1); that is,


e2(0*,

min, 1 . . ., 44) = p~~~ .k

.....,Xp 0o01

mm e2(0,

01,

. . .,

4p). (1.2)

We showin Section 5 that transformations existand optimal a complexsystem of integral satisfy equations.The heart of ourapproach is that there is a simpleiterative algorithm using conditional onlybivariate which to an expectations, converges solution. Whentheconditional optimal are estiexpectations matedfrom a finite data set,then use of thealgorithm results in estimates of theoptimal transformations. This method has some powerful characteristics. It can be
* Leo Breiman is Professor, of Statistics, Department of CaliUniversity CA 94720. Jerome fornia, H. Friedman is Professor, Berkeley, of Department Statistics and Stanford LinearAccelerator Stanford Center, University, StanCA 94305. This workwas supported ford, of Naval Research by Office Contracts N00014-82-K-0054 and N00014-81-K-0340.

In thecontrolled a distribution design situation, P(dy | x) for theresponse variableY is specified foreverypointx = (xl, . . ., xp) in thedesignspace. The Nth-order designconsists ofN points of a specification space, xl, . . . , XN in thedesign andthedataconsist ofthese with measurements points together on theresponse variables Yl, . . ., YN. The {Yk} are assumed from thedistribution with independent Yk drawn P(dy IXk). DenotebyPN(dx) theempirical distribution that givesmass 1/Nto each of thepoints xl, . . ., XN. Assumefurther that measure on thedesign P, where P(dx) is a probability PN thedistribution of space. ThenP(dy I x) andP(dx) determine random variablesY, XI, . . , Xp, and theoptimal transformations are defined as in (1.2). For thebivariate transformations case, p = 1, theoptimal 0*(Y), +*(X) satisfy
.

p*(X, Y) = p(Q*, q*) = maxp[0(Y), +(X)],


0,0

(1.3)

? 1985American Statistical Association oftheAmerican Journal Statistical Association Vol.80, No.391, September 1985, Theory and Methods

580

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

OptimalTransformations Breiman and Friedman:Estimating

581

erforstationary results, however, Thereare no analogous coefficient. The wherep is the product-moment-correlation this we showthat designs. To remedy orcontrolled as themaximal correlation between godicseries quantity p*(X, Y) is known thathave the requisite measure of dependence thereare sequencesof data smooths X and Y, and it is used as a general cases. in all three 1958a,b, and properties (Gebelein1947;also see Renyi1959,Sarmanov parts.Sections1-4 in twodistinct is presented has the following This article Lancaster 1958). The maximalcorrelation anddiscuss overview of themethod nontechnical givea fairly properties (Renyi1959): A are, of neto data. Section5 and Appendix its application 1. 0 ' p*(X, Y) ' 1. moretechnical, thetheoretical foundation presenting cessity, 2. p*(X, Y) = 0 if and onlyifX and Y are independent. fortheprocedure. u(X) = v(Y), where 3. If there existsa relation of theform work. Closestin spirit totheACE Thereis relevant previous u and v are Borel-measurable functions withvar[u(X)] > 0, of Younget we developis theMORALS algorithm algorithm then p*(X, Y) = 1. al. (1976) (also see de Leeuwetal. 1976). Itusesan alternating on discrete transformations in thebivariate case ourprocedure can also be re- least squaresfit,but it restricts Therefore, on conandtransformations to be monotonic variables themaximal correlation as a method forestimating be- ordered garded No theoretical variablesto be linearor polynomial. estimates ofthe tinuous twovariables, as a by-product tween providing forMORALS is given. framework functions achievethemaximum. 0*, 4*, that transofoptimal oftheexistence In thenextsection, we describe our procedure forfinding Renyi(1959) gave a proof conditions similar toours inthebivariate case under transformations notation, deferringformations optimal usingalgorithmic satisfied case. He also derived integral equations to Section5 and Appendix A. We in thegeneral mathematical justifications on thebivariate density in Section3 by applying depending nextillustrate theprocedure it to a by 0* and q* withkernels solutions this on finding assuming simulated data set in whichthe optimal transformations are ofX andY andconcentrated seemgenerally intractable with known. The equations known. The estimates are surprisingly is density good. Our algorithm He did notconsider theproblem solutions. also appliedto theBostonhousing dataof Harrison and Rub- onlya fewknown from data. 0*, q9* infeld (1978) as listedin Belsleyet al. (1980). The transfor-of estimating andZaharov 1960andLancaster (see Sarmanov Kolmogorov mations foundby the algorithm differ from those generally intheoriginal we apply theprocedure 1969) provedthatif Y1, . . . , Yq, XI, . . ., Xp have a joint Finally, applied analysis. thefunctions then distribution, timeseriesarising from A normal to a multiple an airpollution study. 0(YI, . . . , Yq), 4(XI, correlation are linear.It follows FORTRAN implementation of ouralgorithm is availablefrom . . , Xp) havingmaximum in theregression model thisthat 4 presents either author. Section a general discussion andrelates from to other methods forfinding thisprocedure transforempirical p mations. (1.4) 0(Y) = 4i(Xi) + Z, i=l A provide sometheoretical frameSection5 and Appendix In Section5, under weak conditions ifthe workforthealgorithm. distribution 4i(Xi), i = 1, . . ., p, havea jointnormal on thejointdistribution of Y, XI, . . . , Xp, it is shownthat and Z is an independent transforN(0, 72), thentheoptimal transformations existand aregenerally uniqueup to a mations optimal for in (1.2) are0, 01, . . . , Op. Generally, as defined are characterized a modelof theform transformations changeof sign.The optimal of (XI, . (1.4) withZ independent ofa setoflinear whose as theeigenfunctions integral equations are notequal to 0, 1, . transformations Xp),theoptimal involve bivariate distributions. We then showthat our kernels from models with simulated datagenerated in But examples OP. to optimal transformations. procedure converges of the form(1.4), withnon-normal {4i(Xi)}, the estimated A discusses thealgorithm as applied to finite data optimal Appendix werealwaysclose to 0, 01, . . , Op. transformations on the typeof data smooth Finally,we notethe workin a different are dependent sets. The results direction by Kithe bivariate conditional to estimate expectations. meldorf employed a linear-programminget al. (1982), whoconstructed is proven of thealgorithm onlyfora restrictedtype Convergence transformations tofind themonotone 0(Y), #(X) algorithm in morethan1,000 applica- that class of data smooths. However, inthebivariate coefficient correlation maximize thesample on a variety of data sets usingthree case p = 1. tionsof the algorithm different onlyone (verycontrived) typesof data smoothers, has beenfound. instance of nonconvergence 2. THEALGORITHM A also contains of a consistency result. UnAppendix proof as thesamplesize increases the derfairly general conditions, forfinding Our procedure 0*, 0*, . . ., 4* is iterative. in a "weak" senseto the finite datatransformations converge for thevariables distribution a known Y,XI, . Assume ,Xp. Theessential condistributional transformations. spaceoptimal letE02(Y) = 1, and assumethat Without loss of generality, of a involves theasymptotic dition of thetheorem consistency zero. haveexpectation all functions In the case of iid data thereare sequenceof data smooths. case: look at thebivariate we first To illustrate, of various smooths. theconsistency known results concerning (2.1) e2(0, 4) = E[0(Y) - /(X)]2. forkpaperestablished Stone's(1977) pioneering consistency andWagner (1980) and, nearest-neighbor smoothing. Devroye to 0(Y) fora of (2.1) withrespect Consider theminimization independently, Spiegelman and Sacks (1980) gave weakconis The solution keeping EQ2 = 1. givenfunction +(X), See Stone(1977) ditions forconsistency of kernel smooths. and Devroye(1981) fora review of theliterature. (2.2) 01(Y) = E[+b(X) |Y]/IIE[44X)|Y]II
*

>

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

582

ofthe AmericanStatistical Journal Association,September 1985

For loop minimizes e2 (2.4) with min- Each iteration the unrestricted of theinner with I * - [E( )2]12. Next,consider all other to thefunction to +(X) fora given0(Y). The respect 4k(Xk), k = 1, . . . , p, with of (2.1) with respect imization of the evaluations fixedat their (execution functions previous is solution until one complete For loop). The outerloop is iterated pass (2.3) overthepredictor OI(X) = E[0(Y) IX]. For loop) failsto decrease variables (inner opti- e2 (2.4). thebasis of an iterative Equations(2.2) and (2.3) form expectafuncconditional alternating this for thecorresponding involving single Substituting procedure procedure mization ACE algorithm in thebivariate givesriseto tionoptimization tions(ACE). the(2.4) e2. forminimizing thefullACE algorithm

Basic ACE Algorithm

and +,(X1), . . ., 4p(Xp) = 0; Set 0(Y) = Y/IIYII Iterate until e2(0, 4,, . . ., 4p) failsto decrease; until failsto decrease; Iterate e2(0, 1, . . ., /p) Fork= ltopDo: Eilk qi(Xi) | Xk]; Ok,l(Xk) = E[0(Y) replace4k(Xk) with kk,I(Xk); End For Loop; End Inner Iteration Loop; = E[Yi=,Ii(Xi) I Y]IIIE[= I Oi(Xi) IY]; 01(Y) (2.1) at each stepby alternatingly This algorithm decreases replace0(Y) with01(Y); theother andholding to one function with respect minimizing End OuterIteration Loop; of (execution Each iteration evaluation. fixedat its previous 0, 4, . . . , Opare thesolutions 0*, 0, . ,p; one pairof thesesingle-function theiteration loop) performs End ACE Algorithm. one an initial guessfor with begins minimizations. Theprocess iteration In Section5, we provethattheACE algorithm a complete andendswhen ofthefunctions (0 = Y/IIYII) to converges pass failsto decreasee2. In Section5, we provethattheal- optimal transformations. Q*, O*. transformations to optimal converges gorithm 3. APPLICATIONS predictors case of multiple themoregeneral Now consider thebasicACE with analogy indirect XI,. . . , Xp.We proceed theACE algorithm was developed In theprevious section, We minimize algorithm. In practice, datadistridistributions. in thecontext of known butions are seldomknown.Instead,one has a data set {(Yk, k ? N} thatis presumed to be a sample (2.4) Xkl, . . . , Xkp), 1 , kp) = E[0(Y) - I dj(XJ)1, e2(0, q5, * the optimal fromY, XI, . . ., Xp. The goal is to estimate . . . from the functions transformation , 0(Y), 41(XI), 4p(Xp) a = E4p = 0, through EQ2 = 1, EO = E I holding the ACE can be This data. algorithm accomplished by applying bivariate conminimizations involving ofsingle-function series exand theconditional e2, liii, to thedata withthequantities . Fora given setoffunctions ditional expectations. q$1(XI), funcsuitable estimates. The resulting pectations replaced by of (2.4) with respect to ?(Y) yields Op(Xp), minimization of the tions0, 4*, ., Op are thentakenas estimates I transformations. corresponding optimal E[ (2.5) 01(Y) = E[ i(xi) I Y I I Y](xi) error for Theestimate for e2is theusualmeansquared regression: to 4I(X1), (2.4) withrespect The nextstepis to minimize is obtained another through ... ., This qp(Xp),given0(Y). * * 4P ) NE 0O(Yk) I Oj(Xk)] e2(o, I L J= Nk=l with (2.4) of minimization Consider the iterative algorithm. given0(Y) and a given If g(y, xl, . . ., to a singlefunction respect Ok(Xk) for forall datavalues, defined xp) is a function 4 The is set 41, solution , 4p. , k-1I, 4k+17 * is replaced thenu1gh12 by
-

Set 0(Y) = Y/IIYII; until e2(0, q) failsto decrease: Iterate = XI(X) E[0(Y) IX]; replace+(X) withXI(X); 01(Y) = E[O(X) I Y]/IIE[k(X)I Y]II; replace0(Y) with0I(Y); Loop; End Iteration O* and 4*; 0 and 0 are thesolutions End Algorithm.

ACE Algorithm

kk,l(Xk)

=E [0(Y)

i$k

>

i(Xi) I Xkj

(2.6)

I= 111J12 IIgIIN =

Nk=1I

(Yk, Xkl,

Xkp)-

is as follows: The corresponding iterative algorithm Set 41(XI), . . . , 4p(Xp) = 0; . . , 4) failsto decrease; Iterate until e2(0, 4', .P Fork= ltopDo: replace End For Loop; Loop; End Iteration functions. 01, . . Xparethesolution
Xkk,l(Xk)=
i#k q5i(Xi) kk(Xk) with jk1I(Xk);
-

theconditional Forthecase ofcategorical expectation variables, are straightforward: If thedataare {(Xk, Zk)}, k = 1, estimates then N, andZ is categorical,

E[0(Y)

IXkI;

E[XIZ=z]

2
Zk.Z

Xk
Zk Z

whereX is real-valued and the sums are over thesubsetof valueZ = z. For variables observations having(categorical) is based ordered values,theestimation that can assumemany

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

Breiman and Friedman: EstimatingOptimal Transformations

583

(ozone) andvarious airpollution between therelation havebeenthesub- to study Suchprocedures techniques. on smoothing quantities. study(e.g., see Gasser and Rosenblatt meteorological ject of considerable {(Yk, observations of200 bivariate example consists Ourfirst 1979, Cleveland1979, and Cravenand Wahba 1979). Since from themodel inthealgorithm, highspeed Xk), 1 ? k ? 2001generated is repeatedly applied thesmoother We use to local curvature. as well as adaptability is desirable, Yk = exp[xk + Ek], window local linearfitswithvarying a smoother employing from a standard (the "super- withthexl and the 8k drawnindependently by local cross-validation width determined of N(0, 1). Figure1(a) showsa scatterplot normal distribution and Stuetzle1982). see Friedman smoother"; the of applying 0*, 04, . . ., /* at all thecorre- thesedata. Figures1(b)-l(d) showtheresults evaluates The algorithm optimal transforto the data. The estimated at theset of ACE algorithm data values;thatis, 0*(y) is evaluated sponding 0*(y) is shownin Figure1(b)'s plotof 0*(Yk) versus wayto under- mation datavalues{Ykl, k = 1, . . . , N. The simplest is by meansof a plotof Yk, 1 s k s 200. Figure1(c) is a plot of 4*(Xk) versus theshapeof thetransformations stand Xk. 0(y) = log(y)and+(x) thetransformations is,through Theseplotssuggest datavalues-that function versus the corresponding the for distribution. areoptimal theparent Figure1(d) thedata = X3, which Yk and 41, . . . , 4 versus theplotsof 0*(Yk) versus a more is a plotof 0*(Yk) versus4*(Xk). This plotindicates valuesofxl, . . . , xp,respectively. that bevariables than between thetransformed relation theACE procedure byapplying linear we illustrate In thissection, ones. theuntransformed on finite tween toevaluate performance datasets.In order ittovarious overfits is howmuch thealgorithm issuewe address Thenext data for is first appliedto simulated samples,theprocedure in inflated resulting smoothings, We nextapply thedata due to therepeated are known. transformations whichtheoptimal correlation of themaximal andRubinfeld p* and of R*2 = 1 dataof Harrison (1978) estimates itto theBoston housing datasetswe havegenerated, on thesimulated theACE trans- e*2. The answer, as listedin Belsleyet al. (1980), contrasting little. Forourlast is surprisingly with those usedintheoriginal analysis. formations of p* andR*2 twoestimates this,we contrast time series To illustrate toa multiple theACE procedure we apply example,
A A

yvs.xiF

~~~~~~~~~~~~~~~~~~c
()

40

20

-1

-1

0*(Y)

2_
2

vs. 0*(x) 0b*(y)

A 0
-1'-

-2

-2K
I

I~~
20

III
40 60

-2
-2 -1 0

Data. on x; (d) Transformed on y; (c) Transform Data; (b) Transform (a) Original Example: 1. First Figure

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

584 Table 1. Comparison ofp* Estimates Estimate p*direct ACE Mean .700 .709 Standard Deviation .034 .036

ofthe AmericanStatistical Association,September 1985 Journal


Table 3. EstimateDifferences Estimate
pR*2- R2

Mean .001

Standard Deviation .015

.012

.022

is able to thattheACE algorithm This exampleillustrates theabovemodel.Theknown transformations using optimal are for both andpredicnonmonotonic estimates response we define thedirect produce 0(Y) = log Y, +(X) = X3. Therefore, estimate p for p*, givenanydatasetgenerated as abovebythe tortransformations. to the For our nextexample,we applytheACE algorithm logYkandxl and setR2 = p2. The between samplecorrelation Bostonhousing market dataofHarrison andRubinfeld (1978). ACE algorithm produces theestimates inBelsley etal. (1980). A complete ofthese dataappears listing lN these to Harrison and Rubinfeld used data estimate marginal P = =I 6*(Yk) *i(Xk) N E as revealed inthehousing market. Nk=I1 Central airpollution damages a value that relates the to their was analysis housing equation and R*2 = -I In this model p* = .707 and R*2 1 ep*2 in of value of homes each the 506 median owner-occupied .5. in the BostonStandard Statistical Metropolitan For 100datasets,eachofsize 200, generated from theabove censustracts in of to reflected concentration Area air pollution (as nitrogen deviations of thep* estimates model,themeansand standard 12 are and to other variables that to affect oxides) thought are in Table 1. The meansand standard of theR*2 deviations This was estimated to dehousing prices. equation by trying estimates are in Table 2. of functional form on termine the best-fitting housing price We also computed the differences p* - p and R*2 - R2 with a number of 13 variables. these By experimenting possible forthe 100 data sets. The meansand standard deviations are ofthe14variables and13predictors), transformations (response in Table 3. and Rubinfeld on of theform Harrison settled an equation The preceding for experiment was duplicated smaller sample size N = 100. In thiscase we obtained the differences in log(MV) = al + a2(RM)2 + a3AGE Table 4. + a4log(DIS) + a5log(RAD) + a6TAX We nextshowan application of theprocedure to simulated datagenerated from themodel + a7PTRATIO +
a8(B .63)2 Yk
=

exp[sin(27tXk)

Ck12],

1 ?

k ? 200,

withtheXk sampledfrom a uniform distribution U(0, 1) and + a12INDUS + a13CHAS + a14(NOX)P + c. theCk drawnindependently of theXk from a standard normal distribution N(0, 1). Figure2(a) showsa scatterplot of these A brief description of each variableis givenin Appendix B. data. Figures2(b) and 2(c) showtheoptimal transformation (For a morecomplete description, see Harrison andRubinfeld estimates 0*(y) and +*(x). Although log(y) and sin(2irx) are 1978, table 4.) The coefficients al, . . . , a14were determined nottheoptimal transformations forthismodel[owingto the by a leastsquaresfitto measurements of the 14 variables for distribution ofsin(2irx)], non-normal these transformations are the506 censustracts. The bestvalue fortheexponent p was stillclearly suggested by theresulting estimates. found tobe 2.0, bya numerical optimization This (grid search). Our nextexampleconsists of a sampleof 200 triples {Yk, "basic equation"was used to generate estimates forthewillfrom themodelY = XIX2,with ingness Xkl, Xk2), 1 ' k ' 200} drawn to pay forand themarginal of cleanair. Harbenefits from independently a uniform distribution risonand Rubinfeld (1978) notedthattheresults are highly XI andX2generated U(- 1, 1). Note that0(Y) = log(Y) and Oj(Xj) = log Xj sensitive to theparticular ofthehousspecification oftheform be solutions (j = 1, 2) cannot here,sinceY, XI, and X2 all ing price equation. assumenegative values. Figure3(a) showsa plot of 0*(Yk) We appliedtheACE algorithm to thetransformed measureversus Yk, andFigures 3(b) and 3(c) showcorresponding plots ments in the (y', xl .. x13) (using p = 2 for NOX) appearing of j* (Xkl) and 45(Xk2) (1 ' k ' 200). All threesolution basicequation. To theextent that these transformations areclose functions transformation are seen to be double-valued. The to theoptimal will produce ones, thealgorithm almostlinear forthisproblem transformations are 0*(Y) = log|Y| functions. optimal from indicate Departures transformations linearity and 4j(Xj) = loglXjl reflect that can improve of thefit. clearly thequality (j = 1, 2). The estimates this structure near where theorigin, thesmoother except cannot In this(and thefollowing) examplewe applytheprocedure theinfinite in thederivative. reproduce in a forward discontinuity manner. For thefirst stepwise pass we consider
Table2. Comparison ofR*2Estimates Estimate R*2direct ACE Mean .492 .503 Standard Deviation .047 .050
Table 4. EstimateDifferences, Sample Size 100

+ aglog(LSTAT)

+ ajOCRIM

+ aj1ZN

Estimate
p* -

Mean
.029

Standard Deviation .051


.034

p R*- R2.042

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

OptimalTransformations Breimanand Friedman:Estimating

585

_
a 6 _y

.1

| T |I I~~~~~~~~~~~~~~~~~~~I

2 vs. x 1 _!

6~~~~~~~~~~~~~~~~~~~~~

2
. 0.2 0.4 0 :~~~~~~~~~~~

1
-2 -1 -0.5

~~~0

0.5

b 2

.2

4.

0.6.

0I 2

0*(y)~~~~~~~0*x

-1 _-0.5

0.5

1.0 0.5 thprdco

2*

(X2)

1ta aiie inth moe.thescn'as(oe-h

~~~~~~~~

nex pass inrese th


.

o h

rvosps

yls

hn.1

20(',4kx)

eanig1

sicue rdc '


0

novn rbem p=2 trsinlue th:2tiait 0.0 htmxmzsA[2y) h rdco O~ xe (k$k)


-0.5
tkkXk)

12()

sicue

nth

oe.Ti

owr

oe.novdforpeitr Th reulin fia anA o .8. ApligAEsmlaeusyt , ,,I in 2 of onl I .02. reslt in an inres thesouinrsos po of Fiur 4() hos

n a l 3peitr ,I 1rnfr

0.2

0.4

0.6

0.8

Data; (b) Transformed y;, Figure2. Second Example:(a) Original x. (c) Transformed

-1

-0.5

0.5

X2. (c) Transformed

x; y; (b) Transformed Figure 3. Third Example: (a) Transformed

theresponse the 13 bivariate y' (p =1) involving problems of the thebestpredictor until is continued x' (1I k -< 13) in turn. selection variables witheach of thepredictor procedure

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

586

Joumalofthe American Statistical Association,September 1985


e ~~~~ ~~~~~~~~~~~~~~0.4e

33 2

ai
f4*(I0g MV)

00

''''IW^?|I

0*(PTRATIO)

1
0
--

~~~~0.2

0.0

.2La 8.5
4

I I I II , I,_ ,
9 9.5 10

10.5

11

__2
12

14

16

18

__,_____";
20

22

b
t' 2*(MV)
2
1 0

'~~~~~~~~~~~~~~0. 0.4
0.2*(TAX)
r i0.2 t

'''

0.0

10

20

30

40

50

200

400

600

C 20 0*(RM2)

9.

01
0.0
,

$*(NOX2)

2~~~~~~~~~~~~~~~~~~~~
I

~~~~-0.1

~~~~~~~~~~-0.2 -0.3~~~~~~~~~~~~.0 0.002


0.004 0.008

20

40

60

80

O.OOB

1.0

1.
0.5

.*(log

LSTAT)
1
-

h ~2
-:.

0.0

-1 .o
-4 -3 -2

-2N-1 -1 0 1 2 3 4

log(LSTAT) (b) Transformed MV;(c) Transformed Data: (a) Transformed Iog(MV); Figure 4. Boston RM2(a= .492);(d) Transformed Hdousing y Versus NOX2 (a = .09); (ih)Transformed Tax(a - .122); (g) Transformed PT Rtatio (a = .147); (f)Transformed (a - .417); (e) Transformed Predictor of Transformed y.

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

Breiman and Friedman:Estimating OptimalTransformations

587

mation 0(y'). Thisfunction is seento havea positive curvature meteorology intheLos Angeles basin.Thedataconsist ofdaily forcentral valuesofy', connecting twostraight linesegments measurements of ozoneconcentration (maximum one houravof different slope in either side. This suggests thattheloga- erage)andeight meteorological for quantities 330 daysof 1976. rithmic transformation maybe too severe.Figure4(b) shows Appendix C liststhe variables used in the study.The ACE thetransformation 0(y) resulting whenthe(forward stepwise) algorithm was appliedherein thesameforward stepwise manACE algorithm is appliedto theoriginal untransformed census neras in theprevious (housing data) example.Fourvariables measurements. (The samepredictor variable setappears inthis wereselected.These are thefirst fourlistedin Appendix C. model.)This analysis indicates that, ifanything, a mildtrans- The resulting R2was .78. Running theACE algorithm with all formation, involving positive is most curvature, for eight appropriate predictor variables an R2 of .79. produces theresponse variable. In order to assess theextent to whichthesemeteorological Figures 4(c)-4(f) show the ACE transformations(X)k, (x,). variablescapture the daily variation of the ozone level, the day-of-the-year was addedandtheACE algorithm was kk4(Xk4) forthe(transformed) variables x' appearing variable predictor in the final model. The standarddeviation u(4,*) is indicated runwith itandthefour selected meteorological variables. This in each graph.This provides a measure of how strongly each can detect possibleseasonaleffects notcaptured by themeteintothe modelfor0*(y'). [Notethatv(0) = orological variables. The resulting R2 was .82. Figures 4>*(xj)enters 5(a)1.] The twoterms that enter moststrongly involve thenumber 5(f) showtheoptimal transformation estimates. ofrooms squared [Figure 4(c)] andthelogarithm ofthefraction The solution fortheresponse transformation, Figure5(a), of population that is of lowerstatus [Figure 4(d)]. The nearly showsthat, at most,a verymildtransformation with negative linear shapeof thelatter transformation that theorig- curvature is indicated. suggests Similarly, Figure 5(b) indicates that there inal logarithmic transformation was appropriate forthisvari- is no compelling to consider necessity a transformation on the able. The transformation on thenumber ofrooms vari- mostinfluential predictor Air Force Base squared variable, Sandburg ableisfar from that a simple linear, The solution however, transformation indicating estimates fortherequadratic Temperature. does notadequately itsrelationship to housing capture value. maining variables, areall highly however, nonlinear (andnonForfewer than sixrooms, valueis roughly housing For example, independent monotonic). Figure 5(d) suggests that theozone of roomnumber, whereas forlarger values there is a strong concentration is muchmoreinfluenced by themagnitude than linear increasing dependence. The remaining twovariables that thesignof thepressure gradient. enter intothismodelare pupil-teacher ratioand property tax The solution fortheday-of-the-year variable, Figure5(f), rate.The solution transformation fortheformer, Figure4(e), indicates a substantial seasonaleffect after forthe accounting is seen to be approximately linearwhereas forthelatter, meteorological that variables. This effect is minimum at theyear Figure 4(f), has considerable nonlinear structure. Fortaxrates boundaries and has a broadmaximum at aboutMay peaking ofup to $320, housing priceseemsto fallrapidly with increas- 1. Thiscanbe compared with thedependence ofozonepollution ing tax, whereasforlargerratesthe association is roughly on day-of-the-year intoaccount alone,without taking themeconstant. variables.Figure5(g) showsa smooth teorological of ozone Although thevariable (NOX)2was notselected byourstep- concentration on day-of-the-year. Thissmooth has anR2 of .38 wise procedure, we can try to estimate its marginal effect on and is seento peak aboutthree months later (August 3). median homevalueby including itwith thefour selected variThe factthattheday-of-the-year transformation peakedat ables and running ACE withtheresulting fivepredictor vari- thebeginning of May was initially puzzlingto us, sincethe ables. The increasein R2 over thefour-predictor modelwas highest pollution daysoccurfrom toSeptember. July Thislatter .006. The solution on theresponse transformations andoriginal factis confirmed by theday-of-the-year transformation with four predictors changed little. very The solution transformation themeteorological variables removed. Ourcurrent belief is that for(NOX)2is shownin Figure 4(g). This curveis a nonmon- withthemeteorological variables entered, beday-of-the-year otonic function ofNOX2,notwellapproximated bya linear (or comesa partial for hours ofdaylight surrogate before andduring function. monotone) Thismakesitdifficult to formulate a sim- themorning rush.The decline commuter pastMay 1 maythen ple interpretation of thewillingness to pay forclean air from be explained by the factthatdaylight savingtimegoes into thesedata. For low concentration values,housing pricesseem effect in Los Angeleson thelastSundayin April. to increase with increasing forhigher (NOX)2,whereas values Thesedataillustrate that in uncovering ACE is useful interthistrend is substantially reversed. andsuggestive ofthedependence esting Theform relationships. of O*(Yk) verus _j_ f*(Xkj) on the Daggettpressure Figure 4(h) showsa scatterplot and on the day-of-the-year gradient forthefour-predictor model.This plotshowsno evidence of wouldbe extremely difficult to find methodby anyprevious additional structure notcaptured in themodel ology.
0()=

, /j*(Xj)+ e.
j=1

4. DISCUSSION
The ACE algorithm a fully for automated method provides in multiple It estimating optimaltransformations regression. also provides a method forestimating maximal correlation between random variables. Itdiffers from other empirical methods forfinding transformations (Box andTidwell1962;Anscombe and Tukey 1963; Box and Cox 1964; Kruskal1964, 1965; Fraser1967; Box and Hill 1974; Linsey 1972, 1974; Wood

The e^*2 resulting from theuse oftheACE transformations was as compared tothee2 valueof.20 produced bytheHarrison .11,? and Rubinfeld (1978) transformations involving all 14 variables. For our finalexample,we use theACE algorithm to study therelationship between atmospheric ozone concentration and

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

588

Statistical oftheAmerican Journal Association, September1i985 ~~~~~~~~~~0.3 ~~~~


~*(VSTY)
-

0(UP03)

10.2

0
-1

-j
10
b
i-i_
20 30 40

~~~~~0.1

0.0
~~~~~~~~~~~~~~~~~~~~~~~-0.1
__0_ __2 _ _ _ _ _ _ _ _ _ _ _ _ _ _

0
1.5

100

200

300

41.0
0.50.

p)0.2

~*(Day

ofYear)

0.0~~~~~~~~~~~~~~~~~~~.
-0.2 -0.5 -0.4 -1.0

20

40

60

80 1.0

100

200

300

400

0.2

q5~~~~~*(IBHT)

0.5

0.1~ 0.0

~ ~

~~~~~~~1
0.0

0.1

~~~~~~~~~~~~~~~~~~~-0.5
-1.0 2000 3000 4000 5000 0 100 200
...300

-0.2

0
d

1000

400

0.2

$~~~~~*(DGPG)

0.0

-0.2

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

Breiman and Friedman: EstimatingOptimal Transformations

589

1974;Mosteller and Tukey1977; and Tukey1982) in that the dication of how good theanalyst's guess is. We have found "best"transformations of theresponse andpredictor variables that theplotsthemselves often givesurprising newinsights into areunambiguously defined andestimated without use ofad hoc therelationship between theresponse and predictor variables. heuristics, restrictive distributional assumptions, or restriction As withanyregression a highdegreeof associprocedure, of thetransformation to a particular parametric family. ationbetween predictor variables can sometimes cause theinThe algorithm is reasonably computer efficient. On theBos- dividualtransformation estimates to be highly variable, even tonhousing data setcomprising 506 datapoints with14 vari- though thecomplete modelis reasonably stable.Whenthisis ables each, theruntook12 secondsof central processing unit suspected, running thealgorithm on randomly selected subsets (CPU) timeon an IBM 3081 computer. Ourguessis that this of thedata,or on bootstrap samples(Efron1979), can assist translates into2.5 minutes on a VAX 11/750computer. To in assessing thevariability. to other extrapolate use the estimate problems, thatrunning The ACE method has generality beyond that exploited here. timeis proportional to (number of variables)x (samplesize). An immediate generalization wouldinvolve multiple response A strong advantage of theACE procedure is theability to variables algorithm wouldestiYI, . . . , Yq. The generalized incorporate variables of quitedifferent type in terms of theset mate optimal . . ., p* that transformations 0*, . . .0, O*, 04*, of valuesthey can assume.The transformation functions 0(y), minimize 01(xj), . . . , Op(xp)assume values on the real line. Their arguments can, however, assumevalues on any set. For ex01 (Y1) o)(Xj)~ ample,ordered real,periodic (circularly valued)real,ordered, and unordered variables categorical can be incorporated in the to EO = O,= I 1, ..., q, E = O,j = 1, ..., sameregression Forperiodic equation. thesmoother subject variables, 01(Y1)112 p, and = 1. IIY, window needonlywraparound theboundaries. Forcategorical This extension generalizes the ACE procedure in a sense theprocedure variables, can be regarded as estimating optimal similar to that in which canonical correlation generalized linear scores for eachoftheir values.(Thespecial case ofa categorical regression. response and a singlecategorical predictor variable is known The ACE algorithm (Section2) is easilymodified to incoras canonical analysis-see Kendalland Stuart 1967,p. 568this An porate extension. inner over the loop response variables, and theoptimal scorescan, in thiscase, also be obtained by tothat for analogous thepredictor thesinglevariables, replaces solution of a matrix eigenvector problem.) function minimization. TheACE procedure can also handle variables ofmixed type. Forexample, a variable indicating present marital status might 5. OPTIMALTRANSFORMATIONS IN takeon an integer value (number of yearsmarried) or one of FUNCTIONSPACE severalcategorical values(N = never, D = divorced, W= widowed,etc.). This presents no additional in 5.1 Introduction complication conditional estimating This abilityprovidesa expectations. In thissection, we first provetheexistence ofoptimal transstraightforward way to handlemissing data values(Younget formations (Theorem 5.2). Thenwe showthat theACE algoal. 1976). In addition to theregular setsof valuesrealized by rithm converges to an optimal transformation (Theorems 5.4 a variable, it can also takeon thevalue "missing." and 5.5). In somesituations theanalyst, after running ACE, maywant Definerandom variables to takevalueseither in therealsor toestimate valuesofy rather than 0*(y), givena specific value in a finite or countable unordered set. Givena set of random of x. One method fordoing this is to attempt to compute variables Y, XI, . . . , Xp is defined , a transformation by a set = 0 Q*- ( Z Letting we of real-valued however, j*(Xj)). Ij*(XJ), 1j=1 measurable functions (0, 4), ., 4)P)= (0, knowthat thebestleastsquares predictor of Y oftheform Z(Z) 4), each function on therangeof thecorresponding defined is givenbyE(Y I Z). This is implemented in thecurrent ACE random suchthat variables, program by predicting y as thefunction of ljP=I 4* (xj), obEO(Y) = 0, E/j(Xj) = 0, j = 1, . . ., p tained by smoothing thedatavaluesofy on thedatavaluesof to Arthur Owensforsuggesting Ej> j* (xj). We are grateful E02(y) < oo, E)j2(Xj) < oo, j = 1. p. (5.1) thissimpleand elegant prediction procedure. The solution functions 0*(y) and 4 (x1), . . ., * (xp)can Use thenotation be stored as a set of values associated witheach observation +(X) = E 4(Xi). (5.2) . . ? k ? N. Since , Xkl, 1 (Yk, 0(y) and+(x), however, xkp), are usuallysmooth (forcontinuous y, x), theycan be easily andstored approximated as cubicspline functions (deBoor1978) Denotethesetof all transformations by W. witha fewknots. 5.1. A transformation (0*, q*) is optimal for As a tool fordata analysis,the ACE procedure provides Definition = 1 if regression and E(0*)2 to indicate graphical a needfortransformations output as well as toguideintheir choice.Ifa particular a familiar plotsuggests = e*2 E[O*(Y) - (*(X)12 functional form fora transformation, then thedatacan be pre= inf{E[0(Y) - 4(X)]2; EQ2 =1} transformed usingthisfunctional form andtheACE algorithm can be rerun. The linearity (or nonlinearity) of theresulting ACE transformation on the variablein question gives an inDefinition 5.2. A transformation (Q**, + **) is optimal for

EL

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

590

ofthe AmericanStatistical Journal Association,September 1985

ifE(0**)2 = 1, k(o**)2 = 1, and correlation


p=
=

5.1. Proposition

The setof all functions f of theform

E[0**(Y)4**(X)] sup {E[O(Y)4(X)]; E(4)2 = 1, EO2 = 1}.

f(Y, X) = O(Y) + , 41(X1),

0 E H2(Y), fj E H2(Xj),

and norm product with theinner forcorrelation, then 5.1. If (0**, 4**) is optimal Theorem = Ef2, lf112 (g, f) = E[gf], forregression, and the 0* = 0**, 4* = p*4** is optimal 1 -p*. e*2 converse. Furthermore, ofall functions byH2. The subspace is a Hilbert spacedenoted Proof.Write 4 of theform
E(O)2 = =

1 - 2EO4 + Eb2 1 - 2E(O)VE 1 -2p* + E42, 2 + E42 (5.3)


HAX,),

(X)

t(X1),

qj E

H2(Xy),

where

-q EIV. Hence E(O


-

by H2(X). So are H2(Y), is a closed linearsubspacedenoted


-. . ., H2(Xp).

5.2. 5.1 follows from Proposition Proposition of theright withequality onlyif EO = p*. The minimum are 5.1 and 5.2, there 5.2. Under Assumptions Proposition it is equal to side of (5.3) overE42 is at E42 = (p*)2, where 0 < c1 ' c2 < oosuchthat 1 - (p*)2. Then (e*)2 = 1 - (p*)2; and if (0**, 4,**) is constants thenO* = 0**, 4o* = p*4)** is optimalforcorrelation, The argument is reversible. forregression. optimal (A similar C, 11011, + p,j2 IkiPI2) ' o + result appearsin Csaki and Fisher1963.)
1

)2

5.2 Existenceof Optimal Transformations


twoadditional To showexistence ofoptimal transformations, are needed. assumptions 5.1. Assumption suchthat The onlysetof functions satisfying (5.1)

+ ' C2(1O1112

>

likIV)2

0(Y) +

> 4j(Xj) =

a.s.

are individually a.s. zero. thesecondassumption, we use Definition To formulate 5.3.

side Iftheleft is immediate. inequality Proof.Theright-hand a sequencefn= n + z )n does nothold,we can find j such = 1, but llfnl12 -O 0. There is + JP, 1i1onjJ2 that lIIn0112 w a subsequence n' such that0n' 0, O)n,4 j);in thesenseof in H2(Y), H2(X1), . . , H2(Xp),respecweak convergence tively. Write
E[0n'j(Xj)0n'i(Xi)]

= E[1n,'(Xj)E(0n'i(Xi)

I Xj)]

= 5.2 implies Assumption E4n)t,n'i E4j4i (i j), 5.3. DefinetheHilbert Definition spaces H2(Y), H2(XI), to see that II) inf Ikkn iII, < lim Furthermore for and similarly EOn'n4',j. the . , H2(Xp) as thesetsof functions satisfying (5.1) with = + 0 inf lim f lIn'll. Thus,defining Ejoj, usualinner that product; is, H2(Xi) is thesetof all measurable 11011 4, such thatE4j(Xj) = 0, Eoj2(Xj) < oo with(0j', 4j) = = 110 + , & 2 < liminflIf = 0, '112 lf112 E[j' (Xj)0j(Xj)] . I 5.2. The conditional Assumption expectation operators = p 0 = 4, = 5.1, that whichimplies, by Assumption = 0. On theother H2(Y), hand, E(qj(Xj) | Y): H2(Xi) E(4i(X1) Xi): H2(XJ) H2(Xi), E(O(Y) | X) are all compact. H2(Y)
-

i= j

1 = IOn'I112 + + 2 o (On', 112 lIfn InIj4.112 On'j)


i i

H2(Xi)

+ 2 (o n" j n'i)
ioj

? 1. 5.2 is satisfied in mostcases of interest. A suf- Hence, iff = 0, thenliminfllfnlI2 Assumption w ficient condition is given LetX, Y be random bythefollowing. 5.1. If fn f inH2, then Corollary 0n > 0 inH2(Y), Onj variables with jointdensity andmarginals Thenthe fx, fx,y fy. in H2(Xj), j = 1, . . ., p, and theconverse. 4j conditional onH2(Y)-* H2(X) is compact w operator expectation Proof. If fn= On+ O 4nj 0 + Ej 4j, thenbyPropif < ??. < ?o,limsupII4,nIll osition Taken' such 5.2, lim sup II?nIl on 0', 4t, n4)J, that andletf' = 0' + Ej 4);.Thenfor < o(. [fkyIfXfY]dxdy any g E H2, (g, fn ')- (g, f') so(g, f) = (g, f')all g. The Theorem 5.2. UnderAssumptions 5.1 and 5.2, optimal converse iS easier. transformations exist. projection the 5.4. In H2, let Definition Py P1,andPx denote Some machinery is needed. operators intoH2(Y), H2(Xj), andH2(X), respectively.

ff

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

Breiman and Friedman: EstimatingOptimal Transformations

591

operator, This implies expectation On H2(Xi),Pj (j # i) is theconditional forPy. and similarly = UO*, IIPyI*110* 11P4*4* = V+* 5.3. Pyis compact onH2(X) -> H2(Y), andPx Proposition so thatJlPy4*11 is an eigenvalue A* of U, V. Computing gives is compact on H2(Y) -> H2(X). = of U - A*. Now take0 anyeigenfunction 1 110* 4*112 t). This implies, by CorProof. Take t)n E H2(X), 4, = 1. Let 4 = P,0; then110 corresponding to with A, 11011 4) -4 PY 5.2, PyOnj ()i. By Assumption 5.1, that ollary ()nj = 1 - A. This showsthat are notoptimal unless Q*, )112 O* -4 Py4. Now take0 E H2(Y), 4 E H2(X); then so that Py4n The rest ofthetheorem is straightforward verification. (0, Py4) = (0, 4) = (PxO, 4). ThusPx: H2(Y) -> H2(X) is A*= . of Py and hencecompact. theadjoint 5.2. If A has multiplicity theoptimal Corollary one, then is uniqueup to a signchange.In anycase, the transformation the 5.2, consider Now to complete theproofof Theorem is finite setof optimal transformations dimensional. = 1. For - )112 on thesetof all (0, 4) with 110112 110 functional any0, 4, 5.4 Alternating Conditional Methods Directsolution of theequations AO = UO or A4 = V4 is formidable. to use data to directly estimate the Attempting If there is a solutions is as difficult. In the if just bivariate Y are case, X, is 0*, PxO*.On 110112 transformation an optimal = 1, then 110112 then40 = UObecomesa matrix categorical, eigenvalue probThis is thecase treated in Kendalland lem and is tractable. 110 - PX0II2 = 1 - IIPX0II2. Stuart (1967). The ACE algorithm is founded on theobservation that there = 1}. Take On = 1, On suchthat 11011 Let s = {supllPxOll; IlInII2 is an iterative for method finding optimal transformations. We -4 0, and IIPX0nll s-> . By thecompactness of Px, IIPXOIll in illustrate this the bivariate case. The is to goal minimize < 1, then for 0' = ' 1. If 11011 = s. Furthermore, 11011 IIPxOlI with110112 = 1. DenotePxO = E(0 I X), Py4 = 1 110(Y) 4(X)112 > s. Hence 11011 we get thecontradiction IIPxO'II 0/11011, = with Y). Start any first-guess function 0O(Y)having E(O I astransformation. This argument and (0, Px0) is an optimal a nonzero on the of projection the eigenspace largest eigenvalue - Px0II= 1 forall 0 with 110 sumesthat s > 0. Ifs = 0, then of U. Then define a of functions sequence by = 1, and any(0, 0) is optimal. 11011
Q112? 110- pX0II2. - PXOII2over of 110 theminimum achieves 0* that

110 -

5.3 Characterizationof Optimal Transformations


Definetwooperators, U: H2(Y) H2(X), by
US = PyPx0,
->

o = Px0o 01 = PYko/llPY0011 01
=

H2(Y) and V: H2(X)

PXOl,

and in general/,+l = PXOn, It is 0n+1= PY0n+1llPYfn+11l. - 0112is decreased. 110 and clearthatat each stepin theiteration self-adjoint, 5.4. U and V are compact, Proposition to an 4)nconverge Ong, and It is nothardto showthatin general, definite. They have the same eigenvalues, non-negative transformation. fora given optimal between eigenspaces there is a 1-1 correspondence The preceding method of alternating conditionals extends to by specified eigenvalue positive the generalmultivariate case. The analog is clear; givenO,n 0 = PY/iiiPY1ii0 = PXOIIIPo0II is Ong thenextiteration Proof Directverification. On+1 = PYOn+1111PYOn+111On + 1 = PXOn , = = by A, A Let thelargest be denoted eigenvalue IlUlI IIVII. However,there is an additional issue: How can Px0 be comIn thesequel we add theassumption there is at leastone puted that P1(] = usingonlytheconditional expectation operators > 0. ThenA> 0 andTheorem 5.3 follows. 1, . . . , p)? This is done by starting 0(Y) suchthat IIPx0II withsome function 00 off theprojections of 0 - On on the subtracting for anditeratively Theorem 5.3. If Q*, 4* is an optimaltransformation 4 such we geta function subspaces H2(X1), . . . , H2(Xp)until then regression, thattheprojection of 0 - 4 on each of H2(X1)is zero. This * =V AS* = U0*, leads to thedouble-loop algorithm. = 1, then0, Px0 is if 0 satisfies AO = UO, 11011 Conversely, The Double-LoopAlgorithm If 4 satisfies optimalforregression. AO = V+, then0 = The OuterLoop. (a) Start In adwithan initial forregression. guess0O(Y). (b) are optimal and A/llIPyIll Py/llIPyIll, and repeatuntil Put On+1 = PXOn 0n+1 = Pyk)n+1II1Pyk)n+111 dition, convergence.
(e2) =

V+ = PxPr

'j** 0*, Proof Let

be optimal. Then A* = PxO*. Write

= 1 - 2(0*, Xt*)? Ikg*112. 110* - +*II2 # 0, define Theorem 5.4. If IIPEOOII an optimal transforby0* = PESOOIIPEOOII, only mation equality k*= PXO* Then110Jn with = (0*, Py4i*) Notethat (0*, 4)*) _ IIPy4)*II ?*11 ?0,1k,,- (*I>O11 Q* = y*lP*l. Therefore, if Q* = cPy4)*, c constant.

of 00 on theeigenspace Let PEOO be theprojection E of U corresponding to A.

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

592

Journal of the American Statistical Association, September 1985

For any n, 0On = Proof. NoticethatO,+, = UO,/1lUO,l. I + if forn, then it is true where E, because, gn, gn ant,*
On+1 = (an7,* + Ugn)/||an7O + Ugn|i

the argument 0. For any 4 and givenbeforeleads to = 0 E > 0, takeW4l so that - W+l11 e. ThenlITmIll e 114 E + which the IITmW4111, completes proof.

of thedoubleloop. In thefirst, Thereare twoversions the initial functions functions produced by the 40 are thelimiting version. In the inner preceding loop. This is calledtherestart functions are 0. This is the initial second, start thefresh 00 < + + = 1||/an 1 |IUg9n|/Xan (rI)II9gIIIan hh9n difference is that a stronger version. The maintheoretical con= 1 im= 1, a' + lignih2 Thus lignillan ' c(rIl)". But 110nil is a for fresh start. Restart fasterresult holds the sistency 1. Since ao > 0, thenan > 0; so an-' 1. Now running a2 plying in theACE code. and it is embodied algorithm, to reachtheconclusion. use i1?n - 0*112 = (1 - an)2 + ilgn i12 - PXt0*I C 0l,n Algorithm Since I4,+1 0* IIthe theorem TheSingle-Loop X 011I*I = nIPxO follows. The original implementation of ACE combined a singleit0, 40. (b) If,after eration of theinner with functions Loop. (a) Start of theouterloop. TheInner loop withan iteration j Thusit is summarized for thefunctions are 4)m), then define, by thefollowing. m stagesof iteration, =1, 2,...,1~p, 1. Start with00, k0= 0. r< ' rilgil, where and Ugnis I to E. For anyg I E, lUghi then . Since an+ I = i{an/hUOnI0, g9n+1= Ugn/llUOnl0;
-

4)(M+l)

<>

I)) j(0 _E iC))_E gi


-

(M

M 1)) 4m+

2. If thecurrent functions are 0n,


On -

4n,

defineP)n+1 by
)n) d

i>j

i<j

4)n+I

T(fJn-

- Q I-I > 0. 5.5. Let'm = Ej 'P(m). Then IIPxO Theorem T by Proof. Definetheoperator

3. Let On+1 = Pkn+1/IIPy4n+ Runtoconvergence. 11. This is a cleaneralgorithm thanthe double loop, and its on dataruns implementation at leasttwiceas fast as thedouble a single loopandrequires only convergence test. Unfortunately, we have been unableto provethatit converges in function space. Assuming itcanbe shown convergence, that thelimiting 0 is an eigenfunction of U. But givingconditions for0 to to i, or evenshowing correspond that 0 will correspond to i, "almostalways" seemsdifficult. For thisreason,we adopted thedouble-loop algorithm instead.

T = (Il-PPp)(l - p_ ) ..(Il-PI)
0 - 4m+I =

in theinner as loop is expressed Thentheiteration

T(0 - Pm)
-

= Tm+l(0

4))

(5.5)

T(O0 - 00 = 0 - PO + PxO - ~0. Noting that Write PxO) = 0 - PxO,(5.5) becomes


Om+ I = Px0 -

Tm+I(PxO -

).

5.5. by Proposition is then proven The theorem

A: THEACE ALGORITHM APPENDIX ON DATA SETS FINITE A.1 Introduction

--* 0. 5.5. For any0 E H2(X), IIT"'II Proposition TheACE algorithm is implemented onfinite datasets byreplacing = 11,112 - lIPj4II2 Thus S 11k112. 1ITIh Proof. 11(- Pj)4II2 conditional expectations, continuous given data smooths. variables, by s 1. Thereis no 0 # 0 suchthat If there were, IIT)II= 11411. In the theoretical results the concerning andconsistency convergence 4' = 2 4);, = 0, allj. Then then for ItPjtjI ofthe properties ACE algorithm, the critical element is the properties ofthedatasmooth used.Theresults arefragmentary. Convergence ofthealgorithm is proven (q9 q$) = (q$ 4j4)= E (PJq,4j) = 0. for a restricted only classofsmooths. In J J inmore than ofACE on a widevariety practice, 1,000runs ofdata setsandusing three different of smooths, we haveseenonly types T can be decomposed The operator as I + W, whereW is oneinstance offailure A fairly to but converge. congeneral, weak, -* 0 on H2(X). To prove Now we claimthat IITmWII compact. is given. sistency We conjecture proof theform of a stronger conthis,let y > 0 and define result. sistency

sI G(y) = sup{IITW4)I/IIW41I; 1111

1, IIW)II '

y}.

A.2 Data Smooths


Define a datasetD to be a set{x,
XN}

4 defined ThusG(y) < 1 forall y > 0 andis clearly in y. valued nonincreasing functions onD; that is, 4 E F(x) is defined bythe N realnumbers . Then . Define . I)(XN)}. {+(xl), F(x,), j = 1, . . ., p, as thespaceofall real-valued functions on theset{xl,, x2 , defined - lW411)IITm= IITWTM"IITmW`WII 1(11 G(IITm 1W41 , XNj}*.

w 2 yso that Take Xn 4),Ik1I < 1, II|WVnll IITWIIIIIWnII- dimensional = (Xkl, . . , Xkp). Letq)N be the space;that is,Xk collection > y, andG(y) = IITW4)I/IIW411. SI 1, IIW4jII G(y). Then1111 ofall such datasets. Forfixed D, define F(x) as the ofallrealspace

ofN points inp-

= Putyo = IIWIl Ym

Definition A.l. A datasmooth S ofx onxjis a mapping S: F(x) defined for every D in GPN. If 4) E F(x), denote thecorreelement inF(xj) by5(4) | xj) anditsvalues by5(4) I Xkj). is a 4)' sponding there The rangeof Wis denseinH2(X). Otherwise, (W*4)', 4)) that (4)',W4)) = 0, all 4).Thisimplies # 0 such Letxbe anyoneof x,,... p Some examples ofdatasmooths of arethefollowing. IIT*4)'II II4)'II, anda repetition 0 or W'*4' = 0. Then
-*F(x,)
-

- Ym. then Butclearly IITmWII Gm(ym)yo;

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

Breiman and Friedman: EstimatingOptimal Transformations

593

xk E I,, define

intervals {I,}. If 1. Histogram. Divide thereal axis intodisjoint

S(O

Ixk)

>
nX,,m4kE1

smooths so thatthemodified take smooth to be constant preserving constants intozero. is defined The ACE algorithm by thefollowing.
1- 0( )(Yk) = Yk, Cb50(xkJ)
=

4(Xm).

0.

x < x2 thexi getting 2. Nearest Neighbor. Fix M < N12. Order ). < .. < XN (assumeno ties) and corresponding +(x,), Put
S(ktIXk) IOXkM 2M m=-M
mOO

(The inner loop) with0(n)5 2. At then stageof theouter loop, start . m 2 I andj = 1, p, define
(m+1)

0(?). For every

S(0n)

>
i<J

4m+)

_-E

>1
l>J

m)

on If M points are notavailableon one side, makeup thedeficiency theother side. 3. Kernel. TakeK(x) defined on therealswith maximum atx 0. Then
S(4O

m until to -,. Keep increasing convergence (The outer loop) 3. Set Q(n+1) = SY(i 0)/)IISy(li0j)IINGo back to theinner loop 4, (restart)or Oj50 = 0 (freshstart).Continue until convergence.
with Oj?)' =

IXk)

=
m

O 4?(xm)K(xm

Xk)

E f

K(x, -

Xk).

4. Regression. Fix M and order Xkas in example2. At Xk, regressthe values of 4)(xk+M) . . ., 4)(xk+M), excluding O(Xk), on a regression line L(x). Put Xk, getting Xk-M, . . ., Xk+M, excluding are notavailableon each side of Xk, S(I I Xk) = L(xk). If M points side. makeup thedeficiency on theother 5. Supersmoother. See Friedman and Stuetzle (1982). Some properties are relevant to thebehavior are of smoothers that & , holdonlyifthey aretrue for all D C givennext.Theseproperties 1. Linearity. A smooth is linearif

To formalize the space H2(O, +) with thisalgorithm, introduce 4, E H2(x,), andsubspaces 0 E H2(y), ,p), elements H2(0) with elements (0, 0, 0, . . ., 0) = 0 andH2(W) with (O, 01, . ., p) = +4 Forf = (fo,f,., S,: H2(O,4) fp)inH2(0, 4)), define

elements (0, 04, . . .,

H2(0,4)) by

(S,f) =0,

j? i

+ Sij( f,), j=i =fi \,oj


0 = (0, 0, 0, . . , 0), 4)(m) = (0, 0(m)),one complete with Starting cyclein theinner loop is described by
0 + (m I) = I ... Sp - ) -

S(aqi +

/42)

= aSq51+ fS4)2

forall 41, ()2 E F(x) and all constants a, ,B. 2. Constant (4-c), Preserving. If 4 E F(x) is constant
SO
=

( )N on the inner To give a further introduce product property, F(x) defined by

c.

then DefineT on H2(0, Then

4)

Sp)(I

(I

Sj)(O

t()

(A.2)

H2(0,

in (A.2). 4)) as theproduct operator


Tm(0
-

4)(m) = 0 satisfies
S(0-

4)(O))

4) thenthe limiting If, fora given0, the innerloop converges,


4)
=

(A.3)

(4),4')N

4)(Xk)4)'(Xk)

norm andthecorresponding 11IIN 3. Boundedness. S is bounded byM if IIS)IIN? MII4IIN, all 4 E F(x), on F(x). is defined as IkPIIN on F(x,) exactly is where IIS5IIN defined

0,

] = 1,

P.

(A.4)

on anypredictor is zero. Thatis, thesmooth of theresiduals variable Adding 0 = Sy,SIISYNIk (A.5)

satisfied to (A.4) gives a set of equations by theestimated optimal all are linear,exceptthe super- transformations. In theseexamplesof smooths, as an N X N matrix Assume,forthe remainder of thissection, thatthe smooths are can be represented This implies smoother. they as and linear.The (A.4) can be written D. All areconstant with preserving. Histograms operator varying is unbounded due are bounded thenearest by 2. Regression neighbor (A.6) SA) = S,O, i = 1, . - , P. a modified A.5 we introduce to endeffects, butin theSection regres1 of the matrix Assume Let sp(S,) denote the spectrum o Sj. sp(Sj). is smooths sion smooth thatis bounded by 2. The boundforkernel 1 is in thespectrum for constant smooths but preserving (The number morecomplicated. Definematrices notformodified A, byA, = S,(I - S,)-I smooths.) that-1 A as ,AJ. Assumefurther and thematrix A.3 Convergence of ACE sp(A). Then has the solution (A.6) unique Let thedatabe of theform (Yk, Xkl. (Yk, Xk) = Xkp), k = 1, = 0. Define smooths S., j = 1, . . . 'p(A.7) x= N. Assume thaty = x= 4,= A,(I + A)'0,
S l F(y) and S,: F(y, x) -> F(x,). S, F(y, x) Sp, . where ., The element + = (0, 4,p.4),,) givenby (A.7) will be denoted inF(y, x) with zeromean,and byPO. Rewrite LetH2(y,x) be thesetofall functions (A.3) using(I - T)(0 - P0) = 0 as letH2(y),H2(x,)be thecorresponding subspaces. ''J'?' (A.8) 4)(m) = PO - Tm(P so that theresulting functions thesmooths It is essential to modify the mean; thusthe Therefore, have zero means. This is done by subtracting if it can be shown that Tmf theinner loop converges -o 0 by modified S, is defined forall f E H2(4)). Whatwe can showis Theorem A. 1. Sf4)= S,4) - Av(S,4)). (A.1) radiiof l,, Theorem A.]. If det[I + A] $ 0 and if thespectral

smooths and assumetheoriginal Henceforth, we use onlymodified

. ,Snare all less than one, a necessary and sufficient condition for

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

594 -O 0 forall f E H2(4,) is that iTmf det[A!


-

Journal ofthe AmericanStatistical Association,September 1985 smooths are "close" enough to beingself-adjoint so that their largest eigenvalue is real,positive, and less than one. A.4 Consistency of ACE

(I

S))] Si/A)-I(I

(A.9)

has no zerosin JA For'0, 01,. . ., 0,, anyfunctions inH2(Y), H2(X) I 1 exceptA = 1. , H2(Xp), and sufficient and anydata setD E 9N, define Proof. ForTmf 0, all f E H2(4), itis necessary functions Pj(0, I x,) by that thespectral radius of T be less than one. The equation Tf = 2f (A. 15) I Xj = XkJ). Pj(Ij Xkj) = E(O,(Xi) is in component form Let 4j in H2(xj)be defined as therestriction of 4)jto theset of data values {xl,, . . ., x,j} minus its mean value over the data values. = + j = p. 1,. Af, -Si(Ai f, (A.1O) E f, Assumethat theN data vectors I<j ,,j from thedis(Yk, Xk) are samples tribution of (Y, X, . . ., Xp), notnecessarily or even independent Let s = li fi and rewrite (A. 10) as random (see SectionA.5). (Ai -Sj)fj = sj (I -iA) E f,-s) (A.I1) Definition A.2. Let S(m,S/') be any sequenceof data smooths. Theyare meansquared if consistent
-

- Sj)fj = -Sjs or s = -As. By EIISj(N)(0 I xj) - P N(4i | xjJN s = 0, andhencefj = 0, forallj. This thisimplies that assumption, for all .. theanalogous definition forS(N. 00, ,p as above,with of T'.ForA$ 1, butAgreater rulesoutA = 1 as an eigenvalue than Whether or not the algorithm a converges, weak result consistency of thespectral radiiof theS, (j = 1, . . p), define themaximum can be givenundergeneral conditions for the fresh-start algorithm. f, = (g+g, = (1 - A) i< f,- s. Then gj))/(I - s),5o Start with00 E H2(Y). On each data set,runtheinner-loop iteration (A! - S,)(g1+1- g,) = (1 - A)S,g1 m times; that is, define (nn+I) = or 9(n) Tm(9(n)) - S)g,. gi+, = (I-SI)-'(I -iAs,g = -s; then (A.12) leads to -

If A = 1, (A.11) becomes (I

(A.12) Thenset
S)s. (A.13)

Sincegp+1

0(n+1) = S 4(n+l)lllsY+(n

+ I)IIN.

As = (I -Sp/iA)-I(I

Sp) ... (I

- SI/I)-'(I-

If (A. 13) has no nonzero s = 0, g, = 0, andj = 1, then solutions, . . ., p, implying all f, = 0. Conversely, if (A. 13) has a solution s # 0, it leads to a solution of (A.10). Unfortunately, condition toverify for (A.9) is difficult linear general smooths. If theS, are self-adjoint, suchthat non-negative all definite, in theunmodified elements smooth are non-negative, matrix all then radiiof Sj are less than one and (A.9) can be shown spectral to hold that by verifying 121 s
=

by 0(y; m, 1), Oj(x,;m, 1).

with OjN(xj; m, 1). Do the analogous thingin functionspace starting 00, getting functions whoserestriction to thedata setD are denoted

Repeattheouter loop I times, getting thefinal functions m, 1), ON(y;

consistent, linear, anduniformly bounded as N S,N'are meansquared - 00, and if for 110112, any0 E L2(Y), 11N0 110112, then EIION(y; m, I) - 0(y; m, 1)112 -* 0, EllIII m, 1)11k -0 N,(x1; EIIOjN(xj;m, 1) If 0* is the optimal transformation 4* = Px0*, thenas PEOOIIIPEOOII, m, I - a) in any way, m, 1) - 0*11 -? 0, 110(-; llj(I ; m, 1) -04*1l -* 0.

Theorem A.2. For thefresh-start algorithm, if thesmooths SyN)

1 - S,iaru( - S,)1I l(i

> 1 and then A withJAI has no solutions outsolutions withJAI ruling


1.

thattheinner Assuming to PC, theouterloop itloop converges eration is givenby


+11 0 (n

p@(n) / | Sy PO IN) = Sy

Putthematrix SyP = U so that

ll O(n1)= 'O(n)/llCJII (A.n14)

... kv Proof. First notethat foranyproduct of smooths S(,N) ? So '... p S(NO EllS(N PI, ...^ON EIIS~ "Sth0 h. 00k O. This is illustrated with S,v)SJN)00 (i 5 j). Since EIIS>N)00 - 0, then Sf00 = + where 0. PAUII Pjo0 4j,N' EIIjNIN|Therefore

SFN)(Sj(0o) = St PJ00 + SrN)4)JN. If theeigenvalue Aof UL having largest absolute value is real and Q(n+ 1)converges positive, then tothe projection of0(0) onthe eigenspace By assumption,11S(M)0j,N11N < M110j,NIIN, where M does not depend on of A. The limiting of (A.4) and (A.5). If i is not N. Therefore 0, PO is a solution EIISlN)4)j,NIk2 - 0. By assumption, EIIS(N)P10o real and positive, thenf9() oscillatesand does notconverge. If the P,P,0011N -- 0 so that ->O. 1k2 EIISVv)SjN)0o PPj0 smooths are self-adjoint and non-negative thenSYP is the definite, If Proposition A.1. inH2(y)forall datasetsD, and ON is defined product oftwoself-adjoint non-negative definite matrices; henceithas 0 E H2(Y) suchthat real only non-negative We are unableto find eigenvalues. conditions - 0(y)112 0, EIION(y) thisformoregeneral guaranteeing smooths. It can be easilyshownthat with modifications neartheendpoints, then thenearest smooth neighbor satisfies thepreceding conditions. Our 2 0(y) 2 E ON(Y) current research indicates a possibility thatothertypesof common smooths can also be modified intoself-adjoint, non-negative definite smooths with non-negative matrix elements. Forthese,ACE converProof. Write 0/11011 = 0/IIOIIN + 0(/101- l/IIOIIN). Then two genceis guaranteed by thepreceding arguments. parts are needed:first, to showthat ACE, however, has invariably converged usinga variety of nonself-adjoint smooths (withone exception found usingan odd type of kernelsmooth).We conjecture thatformostdata sets, reasonable IIOIIN 11IIONIIN IN

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

Breiman and Friedman: EstimatingOptimal Transformations

595

and second,to showthat F let Forthefirst part,


S2

Write 1IIOIIN 2N

Wm = Um P~,E

W = U -PE;

- Wll-- 0 again. Now, so IlWm


Um6o = Pm)0O + WI0o

1
N

(0N(Yk)
k IIONIIN

0 (Yk)
IIOIIN)

(ON , 0)N
IIONIINIIOIIN)

U'0O = 1IPE-0o + W00.


M 1 ? 10, iO,

(A.16)

For anyE > 0 we will showthat there existsmo,1 suchthat form


0. 001A ||wmo/im 8, |W Soll/2 * (A. 17)

ThenSN Let

to showthat 4, so it is enough SN

to getESN

r > max(G, ImlI; m' Take r = (, + A')/2and selectmosuchthat


VN
= =
-> N

(ON(Yk) -(YJ
111N -

MO). Denote by R(A, Wm)the resolventof Wm.Then WI =

I
27r

IIONIIk +
(IIONIIN

2(ON,
+

0)N
-

I|=r

R Wm)di RQL,

IIIIN)

2(1IOIINIIONIIN

(ON,

0)N)-

and
0

20, E(I10NIIN Both termsare positive, and since EVN IIOIIN)2 and E(IIOIINIIONIIN - (ON, 0)N) O 0. By assumption, 1101kN resultingin SN 40

-ilmi

110112,

rI | gR(A, Wm)11dJAJ, 27r 12=r

Now look at

WN

Nk
IIIIk(1

0 O2(yk)[liII01IN

= r. On JiA = r, for m IIR(A, m io, where alongJIH dI)4is arclength is continuous and bounded.Furthermore, Wm)II_ Wm)II -> IIR({, IIR(Q, If M(r) = maxlpI=rIIR(Q, then W)IIuniformly. W)II,

1/11011]2

< r'M(r)(1 + IIWII where AmO 0 as m -> oo. Certainly,

Am),

IIOIIN IIOIIN/IIOII)

1/11011)2

(1

IIWIII ? r'M(r).
for m2 Fix 6 > 0 suchthat (1 + 6)r < A. Takem' suchthat
max(mo,

from theassumptions. ThenEWN -- 0 follows A. 1, it followsthatEIION(Y; m, 1) Using Proposition E II,N(X,; m, 1)+,(x, that N)II- 0 and, in consequence,
-

0(y; m, ; m, 1)112

ms), Am ' (1 + 6)r. Then

A IIWII/II and
ll'l li'<

1/(1 + 6))'M(r)(l + Am)


1/ 1+
6))'M(r) -

0.

In function space, define


P)m'Q = 0 - Tm0 Um= x

Now choosea newmoand 10suchthat (A. 17) is satisfied. Using(A.17),


ul 00 where 8m,i

Then 0(; m, 1) = Um 0lIUn that in is The laststep theproof showing


- 0*11 -| 0 ||UM00I/11UM00II

P(m)00

A.2. as m, I go to infinity. BeginwithProposition norm. - U0II = IlPyTmPx0ll Now on H2(Y), _ IlTmPx0ll. Proof. llUmO = 1 such that1T1'PX0mll ? 6, all -O 0. If not, take 0mg 110mll IlTmPxIl s m. Let Om,'4 0; then PX0m PxOand
+ IITmPx0II - 0)11 IITm'Px(0m, JlTm'PXOm,ll - 0)11+ IlTm'PxOll. C llPx(0rm Proposition A.2. As m oo, Um-

IIU|| 00H

0 as m, I -l oo. Thus U10 m00* =1


m,I

PE 0O IIPEm1l

PE-00
IIPE-0011

side goes to zeroas m, I U in the uniformoperator and theright

oo. is usedabovebecausewe havein mind The term weakconsistency We conjecture that for reasonable result. a desirable stronger smooths, the set CN = {(Y1,Xl), . . ., (YN,XN); algorithm converges} satisfies froma fixed 00, P(CN) --+1 and thatfor 0N, the limiton CN starting
E[ICNII0N

0N]

0.

will be difficult to prove.A We also conjecture thatsucha theorem wouldbe to assumetheuse mucheasierresult butprobably weaker, side goes to zero. By Proposition (5.5) theright-hand with matrix definite smooths ofself-adjoint non-negative non-negative to some that the we know elements. Then algorithm converges ON, butit is compact. is notnecessarily The operator self-adjoint, Um 0 we that and E[II0N conjecture 0*N] sp(U), By Proposition (A.2), if 0(sp(U)) is anyopen setcontaining C 0(sp(U)). Suppose,forsimthen form sufficiently large,sp(Um) A.5 Mean Squared Consistencyof Nearest to thelargest that theeigenspace eigenvalue EA corresponding plicity, NeighborSmooths ifE, is highergoes through (The proof i of U is one-dimensional. we is applicablein a situation, To show thattheACE algorithm Thenforanyopen neighbutit is morecomplicated.) dimensional, ofTheorem that theassumptions (A.2) can be satisfied. is onlyone eigenvalue needto verify 0 of A,andm sufficiently borhood large,there that thedata (Y,, X), . assuming (YN, XN)are P(m) of Um of Um in 0, )m correspondingWe do this,first Lrn> s, and theprojection Am from a two-dimensional stationary, ergodic process. Thenthe in theuniform Moreover, samples operator topology. to PEA to )r converges and, -l 11012 impliesthatforany 0 E L2(Y), 11011k value. ergodictheorem having largest absolute of Urn as theeigenvalue 'ir can be taken E~I0ISI >~ 1lOW of U and 4m is theeigenvalue trivially, eigenvalue If iL' is thesecondlargest linearsequenceof smooths To show thatwe can get a bounded, absolute value, then(assuming of Urn havingthesecondhighest E,~ smooths. that aremeansquared consistent, we use thenearest neighbor is one-dimensional) 4m > A'.

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

596

Association,September 1985 ofthe AmericanStatistical Journal

for a countable By theergodic theorem, Theorem A.3. Let (Y1, XI),'. . . , (YN, XN) be samples from a {x"} denseon therealline, suchthat thedistribution ofX hasno atoms. and c E W', P(W') = 1, stationary ergodic process Then thereexistsa mean squaredconsistent sequenceof nearest('N(X., w0) = gN(X, CO) - Pb(g I Xn) -O 0. smooths of Y on X. neighbor that foranybounded interval J and anywo Use (A. 19) to establish E The proof LemmaA. 1. beginswith W', (N(X, co) 0 uniformly forx E J. Thenwrite 1N Lemma A.J. Supposethat andletPN(dx) P(dx) has no atoms, E N'(Xk, = 0)IIN w)I(Xk E J) ll|DN(X, 6N- >O; defineJ(x;E) = [x - c, x + ?]; P(dx). Take 3N> O, N k=1 and

>

CN(x) = min{e; PN(J(x, ?)) 2 AN}

e(x) = min{e;P(J(x, e))


PN(J(X, EN(X)) A

6}. (A.18)
(A. 19)

>k=Fkl,

Nk=

o41(Xk

E'

J).

ThenusingA to denote symmetric difference, J(x, e(x)))


-*

0 uniformly inx
c

and
lim sup
N

sup

PN(J(x, E(x)) A J(y, E(y)))

&X(h),

The first term is boundedand goes to zero forco E W'; henceits ofthesecond is bounded expectation goestozero.Theexpectation tenn by cP(X E ' J). SinceJ can be taken arbitrarily large,thiscompletes theproof. Usingtheinequality EjIS6'g gives
-

{(x,y);Ix-yjIt}

where s1(h)- 0 ash- 0. toPN, dfcorresponding Proof. Let FN(x),F(x) be thecumulative it follows P. SinceFN - F andF is continuous, then that
supIFN(x) -

- Pxgll2 Pxglls 2 Ej|S( g - P6gll + 21IP6g

F(x)I -- O.

To prove(A. 18), notethat


PN(J(X, 9N)

? 21jP6g - Pxgll2. lim supEjjS?g - Pxgll2 A.4. For any4(x) c L2(X), lim,,,0jjP& Proposition Proof. For 4 bounded and continuous,

O.11

0.

_ 1PN(J(X EN))

A J(X, E))
-

I q(x')I(x' O
as
(5-- 0

E J(x,e(x)))P(dx')-

(x)

PN(J(X, 0))

|N 1

- PN(J(X, 9N))l

+ 1|N - 31 + IFN(X + ?(x)) - F(x + ?(x))| + IFN(X - ?(X)) - FN(x ( ,

? c forall (, thenIIP,4 forevery x. Since suplP,5 ifitcan be shown follows that for oil -- 0. The proposition every 0 E L2(X), lim sup6llP0ll< o. But
IP6l12 =

to showthat whichdoes it. To prove(A. 19), it is sufficient sup P(J(x, e(x)) A J(y, ?(y))) c ?X(h)x,y, k-yj5h

f[
S
O

fO k(x')I(x' E J(x,C(x)))P(dx')1 P(dx)


[

(X )2 p(d)

I(X' E- J(x,e(x)))P(dx)]

notethat First,
|?(x) - s(Y)I S Ix - yA.

IfJ(x,E(x)), J(y,e(y)) overlap, then their symmetric difference con116 I(x' E J(x,c(x)))P(dx)? 2. (A.20) sistsof twointervals I,, 12 suchthat C 21x - yl. JIj ? 2jx - Yl,1I21 Thereis an ho> 0 suchthat if |x - y ho,thetwoneighborhoods If, x 2? x' - c and(A.20) stillholds,and say,P([x', co)) < ( then is a sequence{x"}, withe(x,) -* 0 there alwaysoverlap.Otherwise ifP((- oo,x']) < 3. similarly andP(J(x", e(x"))) = 3, which is impossible, sinceP has no atoms. Take {OnJ to be a countable set of functions dense in L2(Y). By Thenforh s ho A.3 and A.4, foranyc > 0, we can select6(e, n), N(6, Propositions sup P(J(x, e(x)) A J(y,e(y))) s 2 sup P(I) forall n, n) so that x,y;jx-yt-h |iI92h

x' is suchthat there are numbers Supposethat E+, c- with P([x', x' + c+]) = (, P([x', x' - -]) = 6. Thenx' E J(x, E(x)) implies x x' + +, and xi - e

- PX0,Ik2 ? c for( s and theright-hand side goes to zeroas h -> 0. E1lS'On The lemma is appliedas follows: Let g(y) be anybounded function LetcM 0 as M -* 0o; define 3M = I inL2(Y). Define theindicator If) to denote function, P6(g I x), using Then N(6M, n). maxn.M as

((, n), N 2 N(5, n).


minnlM

6(c,

n) andN(M) =

11/ g(y) I(x' E J(x, e(x)))P(dy, dx')


=

0n E1IS,N

< CM PX0n112

for n ? M, N 2 N(M).

11/ Px(gI x')

I(x' E J(x, e(x)))P(dx').

Notethat Pa is bounded andcontinuous inx. Denote bySWthesmooths M = [NJ]. Proposition with A.3 follows. A.3. ElISg g - Pjgllj -- 0 forfixed 3. Proposition Proof. By (A. 18), with probability one,
Sr (g I x) = (1/ [N]) I g(yj)I(x1 E J(x, EN(X)))

PutM(N) = max{M;N ? max(M,N(M))}. ThenM(N) -> 00 as N oo,and thesequenceof smooths SI is meansquaredconsistent for all On. Noting that for B E L2(Y),
EIIS0B -PX0II2 s 3EIISI0n PXOn N +

9110

OnlI2

theproof of thetheorem. completes The fact that ACE uses modifiedsmoothsSWg = Smg since Av(S?g) andfunctions g suchthat Eg = 0 causesno problems, IIAv(S rg)II = and
(Av(SNg))2

can be replaced forall x by


gN(x, a))
=

(11[3N])

> g(y3)1(x,E J(x, iE(x))),

Av(Sag, usingthenotation of Proposition A.3.

gN(x,

cf),

w is a samplesequence. where

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

Breiman and Friedman: EstimatingOptimal Transformations

597
-

and write Assume g is bounded, g) Av(SI)


=N

below (above). For a regression smooth, M') directly S(4

I
k

N Ni

1N
+

Ix)

f + [rF(0, x)/](x

xx),

(A.21)

where /X,xxare the averages of 0(yj, x, over the indexes in J(x), By theergodic thesecondterm theorem, goes a.s. to EPj(g IX), and and Fx(4,x), U2 arethecovariance between IYk), Xk andthevariance of Proposition A.3 showsthat theproof the of XkovertheindexesinJ(x). an argument mimicking first term goes to zero a.s. Write thesecondterm in (A.21) as write Finally, [Fx(& x)OlR[(x - Xx)ux]

above andbelowin J(x), it is nothardto show If there areM points where A.3 canbe easilychanged toaccount that 0 = Pxg. Thus,Theorem formodified smooths. l(x - XX)/I s 1 In thecontrolled experiment situation, the{Xk}arenotrandom, but thecondition PN(dx) Additional P(dx) is imposed. assumptions are This is nottruenearendpoints where(x -x )/Ix can becomearbinecessary. behavior keepsregression trarily largeas M getslarge.Thisendpoint a function bounded. To remedy beinguniformly this,define Assumption A.1. ForO(Y) anybounded function inL2(Y), E(O(Y) from | X = x) is continuous in x.
[x], = x,

IEP6(g

IX)|

IEP6(g IX)

EPxgI

lIP64

4lI,

lxl? s1

A.2. For i # i and +(x) any boundedcontinuous Assumption = sign(x), lxi> 1, in x. function, E(O(X,) IX, = x) is continuous and define themodified smooth regression by A necessary is Proposition result A.5. = S(4 Ix) x + Fr(4, x)/Ux[(x- XX)/Ux],. Proposition A.S. For O(y) boundedin L2(Y) and +(x) bounded is bounded This modified smooth by 2. and continuous,
-

(A.22)

I N

NJ=I

E O(yJ)o(xJ)as

>

EO(Y)O(X).

Let TN = J=, ThenETN = J7 g(x,)+(xj), g(x) = O(YJ)4(xJ). andFriedman certain is in Breiman (1982). We are almost E[O(Y) IX = x]. By hypothesis, ETNIN-> EO(Y) (X). Furthermore,The proof smooths arealso meansquared consistent that themodified regression N series andintheweaker sensefor controlled time for stationary ergodic ON var(TN) = E E[O(y) - g(x )]20(X ) M on rates at which less definitive conditions butunder experiments,
N
00.

are mean squared smooths no atoms,thenthe modified regression consistent.

TheoremA.4.

If, as N

->

oo, M

->

oo,MIN -> 0, and P(dx) has

=E h(x,)0(x,), * I

where h(x) = E[(O(Y) - g(X))2 | X = x]. Since ho is continuous and bounded, then NIN -+ Eh(X)O(X). Now theapplication of Kolmogorov's exponential boundgives
TNIN - ETNIN
aS >

APPENDIXB: VARIABLES USED IN THE HOUSING VALUEEQUATION OF AND RUBINFELD HARRISON (1978)
MV-median value of owner-occupied home of roomsin owner RM-average number units builtprior to 1940 units of owner AGE-proportion in theBoston centers to fiveemployment DIS-weighted distances

0,

theproposition. proving In Theorem A.2 we add therestriction that function 00be a bounded in L2(Y). Thenthecondition on 0 maybe relaxedto thefollowing: region 11-> . in L2(Y), 1111 110112, For 0, anyboundedfunction N EI0lIN to radialhighways RAD-index of accessibility because A.5 anditsproof. Thesefollow from Proposition Furthermore, taxrate($/$10,000) TAX-fullproperty ofthesmooths ofAssumptions A. 1andA.2, meansquared consistency PTRATIO-pupil-teacher ratioby townschooldistrict can be relaxed to thefollowing requirements. B-black proportion of population is lowerstatus that of population bounded continuous function LSTAT-proportion A.3. Fori =# Assumption j andevery CRIM-crime rateby town +(x,), land zoned forlots greater of town'sresidential ZN-proportion 2I~ P4I -->0. feet than 25,000 square function bounded A.4. Forevery Assumption O(y) E L2(Y), acrespertown business of nonretail INDUS-proportion EIIS,O - P N0IIk 0 boundsthe Charles CHAS-Charles Riverdummy= 1 if tract function bounded continuous A.5. Forevery Assumption +(x,), River,0 otherwise in pphm oxideconcentration NOX-nitrogen EIIS,q$- PVII2 -> 0.
||s,+ PJOIIN ?

smooths The existence of sequencesof nearest-neighbor satisfying in a fashion similar Assumptions A.3, A.4, and A.5 can be proven Lemma A.3 is proven A.3. Assumption totheproof ofTheorem using A.4 and A.5 require A.4. Assumptions A.1 and Proposition Proposition A.S in addition. If thedata are iid, stronger results can be obtained. For instance, mean squaredconsistency can be proven fora modified regression smooth similar to thesupersmoother. Forx of anypoint, letJ(x) be theindexes oftheM points in {XA} directly abovex plustheM below. If there are onlyM' < M above (below), thenincludetheM + (M

USED IN THE APPENDIXC: VARIABLES EXAMPLE OZONE-POLLUTION


SBTP-Sandburg AirForceBase temperature (C?) IBHT-inversionbase height (ft.) gradient (mmhg) DGPG-Daggett pressure VSTY-visibility (miles) (in) VDHT-Vandenburg 500 millibar height HMDT-humidity(percent)

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

598 (F?) IBTP-inversionbase temperature WDSP-wind speed(mph)

ofthe AmericanStatistical Association,September 1985 Journal

No. 757, New York: in Lecture NotesinMathematics, CurveEstimation," Springer-Verlag. als Variations derKorrelation Problem H. (1947), "Das Statitistiche Gebelein, mitder Ausgleichungund Sein Zusammenhang Problem und Eigenwert Variable: Dependent 21, undMechanik, Mathematik Fuer Angewandte Zeitschrift Srechnung," 364-379. (ppm) UP03-Upland ozone concentration Pricesandthe D. L. (1978), "HedonicHousing D., and Rubinfeld, Harrison, Management, Economics ofEnvironmental CleanAir,"Journal Demandfor 1984.] July 1982. Revised August [Received 5, 81-102. of Statistics Theory A. (1967), The Advanced Kendall,M. A., and Stuart, Publishing. (Vol. 2), New York:Hafner REFERENCES and G., May, J. H., and Sampson,A. R. (1982), "Concordant Kimeldorf, by Nonlinear and Their Evaluations MonotoneCorrelations Discordant and Analysis F. J., and Tukey,J. W. (1963), "The Examination Anscombe, in Sciences(19): Optimization Studiesin theManagement Optimization," 5, 141-160. of Residuals,"Technometrics, North-Holland, Amsterdam: eds. S. H. ZanakisandJ.S. Rustagi, Statistics, Diagnostics, R. E. (1980), Regression Belsley,D. A., Kuh, E., andWelsch, pp. 117-130. Wiley. New York:John Scaling:A Numerical Multidimensional Kruskal,J. B. (1964), "Nonmetric of Transformations," Box, G. E. P., and Cox, D. R. (1964), "An Analysis 29, 115-129. Method,"Psychometrika, Ser. B, 26, 211-252. Society, Journal of theRoyalStatistical Monotone by Estimating of Factorial Experiments (1965), "Analysis ofVariance Inhomogeneity Box, G. E. P., andHill,W. J.(1974), "Correcting Ser. Society, oftheRoyalStatistical of theData," Journal Transformations 16, 385-389. Technometrics, Weighting," WithPowerTransformation oftheIndependent B, 27, 251-263. P. W. (1962), "Transformations Box, G. E. P., andTidwell, Annals of Bivariate Distributions," H. 0. (1958), "The Structure Lancaster, 4, 531-550. Technometrics, Variables," 29, 719-736. Statistics, Transformations ofMathematical Optimal J. (1982), "Estimating L., and Friedman, Breiman, New York:John Wiley. Distribution, (1969), The Chi-Squared Technical Report 9, University and Correlation," forMultiple Regression Linsey, J. K. (1972), "FittingResponse SurfacesWith Power Transof California, Berkeley, Dept. of Statistics. Ser. C, 21, 234-237. Journal Society, oftheRoyalStatistical formations," andSmoothing W. S. (1979), "Robust Regression LocallyWeighted Cleveland, ofStatistical andComparison Models,"Journal (1974), "Construction Association, 74, 828-836. American Statistical Journal ofthe Scatterplots," Ser. B, 36, 418-425. Society, oftheRoyalStatistical SplineFuncNoisyData With P., andWahba,G. (1979), "Smoothing Craven, Readand Regression, J.W. (1977); Data Analysis F., andTukey, of Gen- Mosteller, by theMethod Degreeof Smoothing tions:Estimating theCorrect ing,MA: Addison-Wesley. 31, 317-403. Numerische Mathematik, eralized Cross-Validation," AcaActa Mathematica Csaki, P., and Fisher,J. (1963), "On the GeneralNotion of Maximal Renyi,A. (1959), "On Measuresof Dependence," Hungaricae,10, 441-451. demiaeScientiarum Matematikai KoBudapest, Akademia, MagyarTudomanyos Correlation," (Symmetric Coefficient 0. V. (1958a), "The MaximalCorrelation Sarmanov, tatoIntezet, Kozlemenyei, 8, 27-51. Nauk UzSSR, 120, 715-718. Akademii New York:Springer-Verlag. Case)," Doklady DeBoor,C. (1978),A PracticalGuidetoSplines, (Nonsymmetric Coefficient (1958b), "The Maximal Correlation Structure in De Leeuw, J., Young,F. W., and Takane,Y. (1976), "Additive Nauk UzSSR, 121, 52-55. Akademii Case)," Doklady LeastSquaresMethod With ScalOptimal Data: An Alternating Qualitative of Coefficients 0. V., and Zaharov,V. K. (1960), "Maximum Sarmanov, 41, 471-503. Psychometrika, ingFeatures," Nauk UzSSR, 130, 269-271. Akademii Doklady Correlation," Multiple of NonparaEverywhere Convergence Devroye,L. (1981), "On theAlmost in WindowEstimation C., and Sacks, J. (1980), "Consistent 9, 1310- Spiegelman, TheAnnalsofStatistics, Estimates," Function Regression metric 8, 240-246. TheAnnalsofStatistics, Regression," Nonparametric 1319. TheAnnalsof Regression," Nonparametric Results Stone,C. J. (1977), "Consistent Consistency T. J.(1980), "Distribution-Free L., andWagner, Devroye, The 7, 139-149. Function Estimation," Statistics, inNonparametric Discrimination andRegression in Re-Expression," in Guiding Tukey,J. W. (1982), "The Use of Smelting 8, 231-239. AnnalsofStatistics, andA. Siegel,New York:Academic Data Analysis, eds. J.Laurner Modern Look at theJackknife," The Methods: Another B. (1979), "Bootstrap Efron, Press. 7, 1-26. AnnalsofStatistics, ofBox ofTransformations oftheAnalysis " Annals Wood,J.T. (1974), "An Extension andtheLinear Model, D. A. S. (1967), "Data Transformations Fraser, Ser. C, 23, 278-283. Society, of theRoyalStatistical and Cox," Journal 38, 1456-1465. Statistics, ofMathematical QualWith ofScatterplots," Tech- Young,F. W., de Leeuw,J.,andTakane,Y. (1976), "Regression W. (1982), "Smoothing Friedman, J.H., andStuetzle, Least SquaresMethod Variables:An Alternating itative and Quantitative University, Dept. of Statistics. ORION006, Stanford nicalReport 41, 505-529. Psychometrika, WithOptimal ScalingFeatures," for M. (eds.) (1979), "Smoothing Techniques Gasser,T., and Rosenblatt,

Comment
DARYLPREGIBONand YEHUDAVARDI*
done is often thechoiceof transformations In dataanalysis, to to bringobjectivity ACE is a majorattempt subjectively. with have demonstrated and Friedman thisarea. As Breiman withour own, examples,and as we have experienced their are sometimes toolindeed.Our comments ACE is a powerful is muchmore there ourviewthat and reflect in nature critical a the methodology to be done on the subject.We consider like would and however, to statistics, contribution significant problem, an important forattacking theauthors to compliment
* Daryl Pregibonand Yehuda Vardi are Membersof TechnicalStaff, Hill, NJ07974. Murray AT & T Bell Laboratories,

anddata statistics mathematical thegap between for narrowing tool. a useful with thedataanalyst andforproviding analysis,

IS HOW MEANINGFUL 1. ACE IN THEORY: CORRELATION? MAXIMAL


tothebivariate ithere we limit simple To keepourdiscussion to the we raiseareequallyrelevant theissuesthat case, though ofmaximal properties case. ThebasisofACE liesinthe general ? 1985American Association Statistical Association Statistical oftheAmerican Journal and Methods Theory Vol.80, No.391, 1985, September

This content downloaded from 193.136.144.3 on Thu, 23 Jan 2014 18:34:54 PM All use subject to JSTOR Terms and Conditions

Das könnte Ihnen auch gefallen