Sie sind auf Seite 1von 22

Statistical Science

1997, Vol. 12, No. 4, 279–300

The Gaussian Hare and the Laplacian


Tortoise: Computability of Squared-Error
versus Absolute-Error Estimators
Stephen Portnoy and Roger Koenker

Abstract. Since the time of Gauss, it has been generally accepted that
`2 -methods of combining observations by minimizing sums of squared er-
rors have significant computational advantages over earlier `1 -methods
based on minimization of absolute errors advocated by Boscovich,
Laplace and others. However, `1 -methods are known to have signifi-
cant robustness advantages over `2 -methods in many applications, and
related quantile regression methods provide a useful, complementary
approach to classical least-squares estimation of statistical models. Com-
bining recent advances in interior point methods for solving linear pro-
grams with a new statistical preprocessing approach for `1 -type prob-
lems, we obtain a 10- to 100-fold improvement in computational speeds
over current (simplex-based) `1 -algorithms in large problems, demon-
strating that `1 -methods can be made competitive with `2 -methods in
terms of computational speed throughout the entire range of problem
sizes. Formal complexity results suggest that `1 -regression can be made
faster than least-squares regression for n sufficiently large and p modest.
Key words and phrases: `1 , L1 , least absolute deviations, median, re-
gression quantiles, interior point, statistical preprocessing, linear pro-
gramming, simplex method, simultaneous confidence bands.

1. INTRODUCTION ical problems the number of simplex iterations re-


quired for a solution can increase exponentially with
Although `1 -methods of estimation, which mini-
problem size over the range of problem dimensions
mize sums of absolute residuals, have a long his-
typically encountered in statistical practice. How-
tory in statistical applications, there is still some
ever, in practice the simplex algorithm performs ex-
reluctance to adopt them for the analysis of large
tremely well for problems of moderate size. Up to a
datasets because they are regarded as computation-
few hundred observations `1 -regression via the sim-
ally highly demanding. In particular, the simplex
plex algorithm is actually faster, for example, than
algorithm of linear programming that is the main-
conventional `2 -regression in the standard imple-
stay of modern `1 -computation has acquired a rep- mentations provided by S-PLUS. However, for prob-
utation as unwieldy in large problems. This rep- lems exceeding a few thousand observations current
utation may be partially attributed to theoretical implementations of simplex begin to live up to their
results on worst-case performance of the simplex al- slothful theoretical reputation.
gorithm, which establish that for certain patholog- Nevertheless, interest in the application of `1 -
estimation methods, and quantile regression more
Stephen Portnoy is Professor, Department of Statis- generally, in large-scale data analysis has grown
tics, University of Illinois at Urbana-Champaign, steadily in recent years. Applications of quantile
101 Illini Hall, 725 S. Wright Street, Champaign, regression (Koenker and Bassett, 1978; Powell,
Illinois 61820 (e-mail: portnoy@stat.uiuc.edu). Roger 1986) in economics to problems with sample sizes
Koenker is Professor, Departments of Economics in the range 10,000–100,000 are now almost rou-
and Statistics, University of Illinois at Urbana- tine. See, for example, Buchinsky (1994, 1995),
Champaign, Champaign, Illinois 61820 (e-mail: Chamberlain (1994) and Manning, Blumberg and
roger@ysidro.econ.uiuc.edu). Moulton (1995). Interest in bootstrapped inference

279
280 S. PORTNOY AND R. KOENKER

Fig. 1. The Gaussian Hare and the Laplacian Tortoise: this picture is a slightly “retouched” version of a wood engraving by J. J.
Grandville from “Fables de La Fontaine” (published in Paris, 1838). The portrait of Gauss is taken from an 1803 portrait by J. C. A.
Schwartz. The portrait of Laplace appears in “Cauchy: Un Mathématicien Légitimiste au XIXe Siècle,” by Bruno Belhoste (Belin, Paris).

in such applications makes the need for effi- point approach solves a sequence of quadratic prob-
cient computational methods acute. Nonparametric lems in which the relevant interior of the constraint
quantile regression using local polynomials (Welsh, set is approximated by an ellipsoid. This approach,
1996; Fan and Gijbels, 1996) or splines (Koenker, as shown by Karmarkar and subsequent authors,
Ng and Portnoy, 1994; Green and Silverman, 1994), provides demonstrably better worst-case perfor-
has also stimulated the demand for more efficient mance than the simplex algorithm and has also
`1 -computation. Chen and Donoho (1995) and Tib- demonstrated impressive practical performance on
shirani (1996) have recently proposed application of a broad range of large-scale linear programs aris-
`1 -penalties as model selection devices for a broad ing in commerce as well as in extensive numerical
range of applications including image processing. trials.
Finally, there has been considerable interest in mul- After a brief historical introduction to `1 -
tivariate analysis in problems like the Oja (1983) computation, we compare recent interior point
median, which can be formulated as `1 -regression methods to existing simplex-based methods. We
problems with pseudoobservations constructed as find, quite in accordance with recent literature on
U-statistics from an initial sample. See Chaudhuri more general LP’s, that the interior point approach
(1992) for a development of this approach. Taken is competitive with simplex in moderate-sized prob-
together, these developments strongly motivate the lems (say, n up to 1,000) and exhibits a rapidly
search for improved methods of computing `1 -type increasing advantage over simplex for larger prob-
estimators when n is large. lems. We then propose a new form of statistical
Following Karmarkar (1984), there has also preprocessing for general quantile regression prob-
been intense interest among numerical analysts lems that also has the effect of dramatically re-
in alternative “interior point” methods for solving ducing the computation burden. This preprocessing
linear programs (LP’s). Rather than moving from step is somewhat reminiscent of the subsampling
vertex to vertex around the outer surface of the approach in earlier On‘ univariate quantile algo-
constraint set as dictated by simplex, the interior rithms like that of Floyd and Rivest (1975). Taken
GAUSSIAN HARE, LAPLACIAN TORTOISE 281

together, the preprocessing step and a careful choice slopes si = yi − ȳ‘/xi − x̄‘; i = 1; 2; : : : ; n: More
of interior point versus simplex yields an algorithm explicitly, let si‘ denote the ordered si ’s and let wi‘
that is 10 to 100 times faster than current simplex be the associated weights, wi = Žxi − x̄Ž, ordered
methods for a variety of test problems with sam- according to the si ’s. Then β b = sm‘ , where
ple sizes in the range 10,000–200,000. In practice,  j n Žw Ž 
as we will see, combining the preprocessing step X X
i‘
m = min j wi‘ ≥ :
and interior point methods yields `1 -computations i=1 i=1
2
which rival the speeds achievable with current
`2 -methods over the entire range of problem di- With the advent of least squares at the end of
mensions. In theory, the results of Section 6 imply the 18th century, Boscovich’s prototype `1 -estimator
that `1 -computations can be made strictly faster faded into obscurity. It was revived more than a cen-
than their `2 counterparts for problems with n tury later by Edgeworth (1887), who, like Laplace
sufficiently large. earlier, argued that it could deliver better estimates
when the required “corrections” did not happen to
2. INTRODUCTION TO Ý1 -COMPUTATION follow the Gaussian law. Subsequently, Edgeworth
(1888) discarded Boscovich’s second constraint that
In 1760, the Croatian Jesuit Roger Boscovich, the residuals sum to zero, and proposed to mini-
while on a visit to London, posed the following mize the sum of absolute residuals with respect to
problem to Thomas Simpson: both intercept and slope parameters, calling this
Let there be any number of quantities a; b; c; d; e, his “double median” method. He noted that this ap-
all given, and let it be required to find corrections proach could be extended, in principle, to a “plural
to be applied to them under these conditions: median” method. A geometric algorithm was given
1. that their differences may be in a given ratio; for the bivariate case, and a discussion of conditions
2. that the sum of the positive corrections may be under which such median methods were preferable
equal to the sum of the negative; to least-squares methods was also provided. Un-
3. that the sum of the positive and sum of the neg- fortunately, the geometric approach to computing
ative corrections may be a minimum. Edgeworth’s new median regression estimator was
rather awkward, requiring, as he admitted, “the
In modern notation the problem may be formu- attention of a mathematician, and in the case of
lated as follows: find b b such that
α; β many unknowns some power of hypergeometrical
bxi + u conception” (Edgeworth, 1888, page 190).
yi = b
α+β bi
Only with the emergence of the simplex algo-
for given observations yi ; xi ‘; i = 1; : : : ; n; the rithm for linear programming in the late 1940s
“corrections” u
bi satisfy did `1 -methods become practical on a large scale.
X Papers by Charnes, Cooper and Ferguson (1955),
bi = 0
u
Wagner (1959) and others provided a foundation
and for modern implementations, such as Barrodale
X and Roberts (1974) and Bartels and Conn (1980).
ui Ž = min!
Žb See Bloomfield and Steiger (1983) for an exten-
Clearly the differences in the corrected observa- sive discussion of the algorithmic development of
tions `1 -methods, including some very interesting em-
pirical comparisons of the performance of several
bi − y
y bxi − xj ‘
bj = β competing algorithms.
would then satisfy the first requirement that they The simplex approach to solving the general `1 -
were in given ratios, determined solely by the x’s. regression problem
Stigler (1984), in his lively commentary on this n
X
exchange, concludes that Simpson made some (2.1) minp Žyi − x0i bŽ
b∈<
progress on the problem, recognizing, for example, i=1

that the solution should pass through the point relies on the reformulation as the linear program
x̄; ȳ‘ and one data point. Boscovich provided a ge- 
ometric solution, but the problem does not seem to (2.2) min e0 u + e0 v Ž y = Xb + u − v; u; v‘ ∈ <2n
+ :
have been fully resolved in print until the publica- Here e denotes an n-vector of ones. This problem
tion of Laplace (1789), who recognized (see Stigler, has the dual formulation
1986) that one could obtain the slope estimate β b

by computing a weighted median of the candidate (2.3) max y0 d Ž X0 d = 0; d ∈ ’−1; 1“n ;
282 S. PORTNOY AND R. KOENKER

or, equivalently, setting a = d + 1/2‘e, problems by looking in the most obvious place: the
  literature on interior point algorithms for linear
1
(2.4) max y0 a X0 a = X0 e; a ∈ ’0; 1“n : programming, which have dramatically improved
2
upon simplex for a broad class of problems.
A p-element subset of N = ”1; 2; : : : ; n• will be de-
noted by h, and Xh‘; yh‘ will denote the subma- 3. INTERIOR POINT METHODS
trix and subvector of X; y with the corresponding FOR CANONICAL LP’S
rows and elements identified by h. Recognizing that
solutions of (2.1) may be characterized as planes Although prior work in the Soviet literature of-
which pass through precisely p = dimb‘ observa- fered theoretical support for the idea that linear
tions, or as convex combinations of such “basic” so- programs could be solved in polynomial time, Kar-
lutions, we can begin with any such solution, which markar (1984) constituted a watershed in thinking
we may write as about linear programming both by making a more
cogent theoretical argument and by offering direct
(2.5) bh‘ = Xh‘−1 yh‘: evidence for the first time that interior point meth-
ods were demonstrably faster in specific, large, prac-
We may regard any such “basic” primal solution
tical problems.
as an extreme point of the polyhedral convex con-
The close connection between the interior point
straint set. A natural algorithmic strategy is then
approach of Karmarkar (1984) and earlier work on
to move to the adjacent vertex of the constraint set
barrier methods for constrained optimization, no-
in the direction of steepest descent. This transition
tably Fiacco and McCormick (1968), was observed
involves two stages: the first chooses a descent di-
by Gill et al. (1986) and others and has led to what
rection by considering the removal of each of the cur-
may be called without much fear of exaggeration
rent basic observations and computing the gradient
a paradigm shift in the theory and practice of lin-
in the resulting direction; then, having selected the
ear and nonlinear programming. Remarkably, some
direction of steepest descent and thus an observa-
of the fundamental ideas required for this shift ap-
tion to be removed from the currently active “basic”
peared already in the 1950s in a sequence of Oslo
set, we must find the maximal step length in the
working papers by the economist Ragnar Frisch.
chosen direction by searching over the remaining
This work is summarized in Frisch (1956). We will
n − p available observations for a new element to
sketch the main outlines of the approach, with the
introduce into the “basic” set. Each of these transi-
understanding that further details may be found
tions involves an elementary “simplex pivot” matrix
in the excellent expository papers of Wright (1992),
operation to update the current basis. The iteration
Lustig, Marsden and Shanno (1994) and the refer-
continues in this manner until no direction is found,
ences cited there.
at which point the current bh‘ can be declared
Consider the canonical linear program
optimal.

The simplex algorithm offers an extremely ef- (3.1) min c0 x Ž Ax = b; x ≥ 0 ;
ficient approach to computing `1 -type estimators
for many applications, yielding, as we shall see be- and associate with this problem the following loga-
low, timings that are quite competitive with least rithmic barrier (potential-function) reformulation:
squares on problems of moderate size. See Shamir 
(3.2) min Bx; µ‘ Ž Ax = b ;
(1993) for a survey of the extensive literature on the
computational complexity of the simplex method. where
However, the performance of simplex on large prob- X
Bx; µ‘ = c0 x − µ log xk :
lems is somewhat less satisfactory. Problems of
sample size 50,000 may require as much as 50 In effect, (3.2) replaces the inequality constraints in
times the computational effort of least squares to (3.1) by the penalty term of the log barrier. Solv-
compute a median regression, `1 , estimate. Re- ing (3.2) with a sequence of parameters µ such that
cent implementations of simplex, notably those of µ → 0 we obtain in the limit a solution to the orig-
Bixby and collaborators (see, e.g., Bixby’s discussion inal problem (3.1). This approach was elaborated
of Lustig, Marsden and Shanno, 1994), primarily in Fiacco and McCormick (1968) for general con-
address efficient treatment of sparsity, and prepro- strained optimization, but was revived as a linear
cessing to eliminate strictly dominated constraints, programming tool only after its close connection to
and therefore do not seem to be promising from the approach of Karmarkar (1984) was pointed out
the point of view of statistical applications. We be- by Gill et al. (1986). The use of the logarithmic po-
gin our search for more efficient methods for large tential function seems to have been introduced by
GAUSSIAN HARE, LAPLACIAN TORTOISE 283

Frisch (1956), who described it in the following vivid attacking both formulations. The dual problem cor-
terms: responding to (3.1) may be written as
My method is altogether different than sim- 
(3.8) max b0 y Ž A0 y + z = c; z ≥ 0 :
plex. In this method we work systematically
from the interior of the admissible region and Optimality in the primal implies
employ a logarithmic potential as a guide—a (3.9) c − µX−1 e = A0 y;
sort of radar—in order to avoid crossing the
boundary. so setting z = µX−1 e we have the system
Suppose that we have an initial feasible point x0 Ax = b; x > 0;
for (3.1), and consider solving (3.2) by the classical
(3.10) A0 y + z = c; z > 0;
Newton method. Writing the gradient and Hessian
of B with respect to x as Xz = µe:
∇B = c − µX−1 e; Solutions ’xµ‘; yµ‘; zµ‘“ of these equations con-
stitute the central path of solutions to the loga-
∇ 2 B = µX−2 ; rithmic barrier problem, which approach the clas-
where X = diagx‘ and e denotes an n-vector of 1’s, sical complementary slackness condition x0 z = 0, as
we have at each step the Newton problem µ → 0, while maintaining primal and dual feasibil-
n o ity along the path.
(3.3) min c0 p − µp0 X−1 e + 12 µp0 X−2 p Ap = 0 : If we now apply Newton’s method to this system
p
of equations, we obtain
Solving this problem and moving from x0 in the     
Z 0 X px µe − Xz
resulting direction p toward the boundary of the     
constraint set maintains feasibility and is easily (3.11)  A 0 0   py  =  b − Ax ;
seen to improve the objective function. The first- O A0 I pz c − A0 y − z
order conditions for this problem may be written which can be solved explicitly as
as
py = AZ−1 XA0 ‘−1
−2 −1 0
(3.4) µX p + c − µX e = A y;  
· AZ−1 Xc − µX−1 e − A0 y‘ + b − Ax ;
(3.5) Ap = 0; (3.12)
px = XZ−1 ’A0 py + µX−1 e − c + A0 y“;
where y denotes an m-vector of Lagrange multipli-
pz = −A0 py + c − A0 y − z:
ers. Solving for y explicitly, by multiplying through
in the first equation by AX2 and using the con- Like the primal method, the real computa-
straint to eliminate p, we have tional effort of computing this step is the Choleski
factorization of the diagonally weighted matrix
(3.6) AX2 A0 y = AX2 c − µAXe: AZ−1 XA0 : Note that the consequence of moving
These normal equations may be recognized as gen- from a purely primal view of the problem to one
erated from the linear least squares problem that encompasses both the primal and dual is that
2 AX−2 A0 has been replaced by AZ−1 XA0 and the
(3.7) min XA0 y − Xc − µe : 2
right-hand side of the equation for the y-Newton
y
step has altered somewhat. However, the computa-
Solving for y, computing the Newton direction p tional effort is essentially identical. To complete the
from (3.4) and taking a step in the Newton direc- description of the primal-dual algorithm we would
tion toward the boundary constitute the essential need to specify how far to go in the Newton direc-
features of the primal log barrier method. A special tion p, how to adjust µ as the iterations proceed
case of this approach is the affine scaling algorithm and how to stop.
in which we take µ = 0 at each step in (3.6), an ap- In fact, the most prominent examples of imple-
proach anticipated by Dikin (1967) and studied by mentations of the primal-dual log barrier approach
Vanderbei, Meketon and Freedman (1986) and nu- now employ a variant due to Mehrotra (1992), which
merous subsequent authors. resolves all of these issues. We will briefly describe
Recognizing that similar methods may be applied this variant in the next section in the context of a
to the primal and dual formulations simultaneously, slightly more general class of linear programs which
recent theory and implementation of interior point encompasses the `1 -problem as well as the general
methods for linear programming have focused on linear quantile regression problem.
284 S. PORTNOY AND R. KOENKER

4. INTERIOR POINT METHODS FOR where W = A−2 + S−2 ‘−1 : This is a form of the
QUANTILE REGRESSION primal log barrier algorithm described above. Set-
ting µ = 0 in each step yields an affine scaling
Quantile regression, as introduced in Koenker
variant of the algorithm. We should stress again
and Bassett (1978), places asymmetric weight on
that the basic linear algebra of each iteration is
positive and negative residuals, and solves the
essentially unchanged, only the form of the diag-
slightly modified `1 -problem
onal weighting matrix W has changed. We should
n
X also emphasize that there is nothing especially sa-
(4.1) minp ρτ yi − x0i b‘;
b∈< cred about the explicit form of the barrier func-
i=1
tion used in (4.5). Indeed, one of the earliest pro-
where ρτ r‘ = r’τ − Ir < 0‘“ for τ ∈ 0; 1‘: This posed modifications of Karmarkar’s original work
yields the modified linear program was the affine scaling algorithm of Vanderbei, Meke-

min τe0 u + 1 − τ‘e0 v Ž y = Xb + u − v; ton and Freedman (1986), which used, implicitly,

(4.2) µ ni=1 log’minai ; si ‘“ in lieu of the additive spec-
u; v‘ ∈ <2n
+ ification.
and dual formulations Again, it is natural to ask if a primal–dual form
 of the algorithm could improve performance. In
(4.3) max y0 d Ž X0 d = 0; d ∈ ’τ − 1; τ“n the bounded variables formulation we have the La-
or, setting a = d + 1 − τ, grangian
 La; s; b; u; µ‘
(4.4) max y0 a Ž X0 a = 1 − τ‘X0 e; a ∈ ’0; 1“n :

The dual formulation of the quantile regression (4.9) = Ba; s; µ‘ − b0 X0 a − 1 − τ‘X0 e
problem fits nicely into the standard formulations − u0 a + s − e‘;
of interior point methods for linear programs with
bounded variables. The function aτ‘ that maps and setting v = µA−1 we have the first-order con-
’0; 1“ to ’0; 1“n plays a crucial role in connecting ditions, describing the central path (see Gonzaga,
the statistical theory of quantile regression to the 1992),
classical theory of rank tests as described in Guten-
brunner and Jurečková (1992) and Gutenbrun- X0 a = 1 − τ‘X0 e;
ner, Jurečková, Koenker and Portnoy (1993). See a + s = e;
Koenker and d’Orey (1987, 1993) for a detailed de-
scription of modifications of the Barrodale–Roberts (4.10) Xb + u − v = y;
(Barrodale and Roberts, 1974) simplex algorithm USe = µe;
for this problem.
Adding slack variables, s, satisfying the con- AVe = µe;
straint a + s = e, we obtain the barrier function yielding the Newton step
n
X  
(4.5) Ba; s; µ‘ = y0 a + µ log ai + log si ; δb = X0 WX‘−1 1 − τ‘X0 e − X0 a
i=1 
+ X0 W‘ ;
which should be maximized subject to the con-  
straints X0 a = 1 − τ‘X0 e and a + s = e. The δa = W Xδb + ‘ ;
Newton step δa solving (4.11)
δs = −δa ;
 
max y0 δa + µδ0a A−1 − S−1 e δu = µA−1 e − Ue − A−1 Uδa ;
(4.6) 
− 12 µδ0a A−2 + S−2 δa ; δv = µS−1 e − Ve + S−1 Vδs ;
0
subject to X δa = 0; satisfies
where ‘ = y − Xb + µS−1 − A−1 ‘e: The most
−1 −1 −2 −2
(4.7) y + µA − S ‘e − µA + S ‘δa = Xb successful implementations of this approach to date
employ the predictor-corrector step of Mehrotra
for some b ∈ < , and δa such that X0 δa = 0: As
p
(1992), which is described in the context of bounded
before, multiplying through by X0 A−2 +S−2 ‘−1 and
variables problems in Lustig, Marsden and Shanno
using the constraint, we can solve explicitly for the
(1992). A related earlier approach is described in
vector b,
  Zhang (1992). In Mehrotra’s approach we proceed
(4.8) b = X0 WX‘−1 X0 W y + µA−1 − S−1 ‘e ; somewhat differently. Rather than solving for the
GAUSSIAN HARE, LAPLACIAN TORTOISE 285

Newton step (4.11) directly, we substitute the step duality gap vanishes; that is, the values of the pri-
directly into (4.10) to obtain mal and dual objective functions are equal and the
X0 a + δa ‘ = 1 − τ‘X0 e; complementary slackness condition u0 s + a0 v = 0
holds. If, in addition to feasibility, u; v; s; a‘ hap-
a + δa ‘ + s + δs ‘ = e; pened to lie on the central path, the last two equa-
(4.12) Xb + δb ‘ + u + δu ‘ − v + δv ‘ = y; tions of (4.10) would imply that

U + 1u ‘S + 1s ‘ = µe; u0 s + a0 v = 2µn:

A + 1a ‘V + 1v ‘ = µe; Thus, the function g b in (4.15) may be seen as an at-


tempt to adapt µ to the current iterate in such a
where 1a ; 1v ; 1u ; 1s denote the diagonal matrices
way that, for any given value of the duality gap, µ
with diagonals δa ; δv ; δu ; δs , respectively. As noted
is chosen to correspond to the point on the central
by Lustig, Marsden and Shanno, the primary dif-
path with that gap. By definition, gb b 0‘
bD ‘/g0;
b γP ; γ
ference between solving this system and the prior
is the ratio of the duality gap after the tentative
Newton step is the presence of the nonlinear terms
affine-scaling step to the gap at the current iter-
1u 1s ; 1a 1v in the last two equations. To approxi-
ate. If this ratio is small the proposed step is favor-
mate a solution to these equations, Mehrotra (1992)
able and we should reduce µ further, anticipating
suggests first solving for an affine primal–dual di-
that the recentering and nonlinearity adjustment
rection by setting µ = 0 in (4.11). Given this pre-
of the modified step will yield further progress. If,
liminary direction, we may then compute the step
on the other hand, gb bD ‘ isn’t much different
b γP ; γ
length using the following ratio test:
   from g0;
b 0‘, the affine scaling direction is unfa-
aj vorable, and further reduction in µ is ill-advised.
bP = σ min min −
γ ; δaj ;
j δaj Since leaving µ fixed in the iteration brings us back
(4.13)   to the central path, such unfavorable steps are in-
sj
min − ; δsj ; tended to enable better progress in subsequent steps
j δsj by bringing the current iterate back to the vicinity
   of the central path. The rationale for the cubic ad-
uj
bD = σ min min −
γ ;δ ; justment in (4.16), which implements these heuris-
j δuj uj tics, is based on the fact that the recentering of the
(4.14)  
vj Newton direction embodied in the terms µA−1 e and
min − ;δ ; µS−1 e of (4.11) and (4.18) accommodates the O µ‘
j δvj vj
term in the expansion of the duality gap function g b
using scaling factor σ = 0:99995, as in Lustig, Mars- while the nonlinearity adjustment described below
den, and Shanno. Then, defining the function accommodates the O µ2 ‘ effect of the δs δu and δa δv
gb
b γP ; γ bP δs ‘0 u + γ
bD ‘ = s + γ bD δu ‘ terms.
(4.15) We compute the following approximation to the
bP δa ‘0 v + γ
+ a + γ bD δv ‘; solution of system (4.12) with this µ and the non-
the new µ is taken as linear terms 1s 1u and 1a 1v taken from the prelim-
  inary primal-dual affine direction:
gb bD ‘ 3 g0;
b γP ; γ b 0‘ 
(4.16) µ= : δb = X0 WX‘−1 1 − τ‘X0 e − X0 a
b 0‘
g0; 2n

To interpret (4.15) we may use the first three equa- + X0 W‘ ;
tions of (4.10) to write, for any primal–dual feasible  
δa = W Xδb + ‘ ;
point u; v; s; a‘, (4.18)
 0 δs = −δa ;
τe0 u + 1 − τ‘e0 v − a − 1 − τ‘e y
(4.17) δu = µA−1 e − Ue − A−1 Uδa + A−1 1s 1u e;
= u0 s + a0 v:
δv = µS−1 e − Ve + S−1 Vδs + S−1 1a 1v e:
So the quantity u0 s + a0 v is equal to the duality gap,
the difference between the primal and dual objective The iteration proceeds until the algorithm termi-
function values at u; v; s; a‘, and gb bD ‘ is the
b γP ; γ nates when the duality gap y0 a − 1 − τ‘e0 Xb + e0 v
duality gap after the tentative affine scaling step. becomes smaller than a specified ε. Recall that the
Note that the quantity a − 1 − τ‘e is simply the duality gap is zero at a solution, and thus this cri-
vector d appearing in the dual formulation (4.3). At terion offers a more direct indication of convergence
a solution, classical duality theory implies that the than is usually available in iterative algorithms.
286 S. PORTNOY AND R. KOENKER

p=4 p=8 p = 16

5.00

10.00
5.00
l1fit
lm
rqfn

5.00
mek l1fit
1.00

lm
rqfn
mek l1fit

1.00
0.50

lm
rqfn
mek

0.50

1.00
seconds

seconds
seconds

0.50
0.10

0.10
0.05

0.05

0.10
0.01

0.05
500 1000 5000 10000 500 1000 5000 10000 500 1000 5000 10000

n n n

Fig. 2. Timing comparison of three `1 -algorithms for median regression: times are in seconds for the median of five replications
for iid Gaussian data. The parametric dimension of the models is p + 1 with p indicated above each plot; p columns are gen-
erated randomly and an intercept parameter is appended to the resulting design. Timings were made at eight design points in
nx 200; 400; 800; 1;200; 2;000; 4;000; 8;000; 12;000. The solid line represents the results for the simplex-based Barrodale–Roberts al-
gorithm implemented in S-PLUS as l1fit, the rqfn dashed line represents a primal-dual interior point algorithm, mek uses an affine
scaling form of the interior point approach and the dotted line represents least-squares timings based on lm(y∼x) as a benchmark.

Our expectations about satisfactory computa- comparison within the S-PLUS environment be-
tional speed of regression estimators are inevitably cause (a) it is the computing environment in which
strongly conditioned by our experience with least we feel most comfortable, a view widely shared by
squares. In Figure 2 we illustrate the results of the statistical research community, and (b) it offers
a small experiment to compare the computational a convenient means of incorporating new functions
speed of three `1 -algorithms: the Barrodale–Roberts in lower-level languages, like Fortran and C, provid-
simplex algorithm (Barrodale and Roberts, 1974), ing a reasonably transparent and efficient interface
which is employed in many contemporary statistical with the rest of the language. We have consider-
packages; Meketon’s affine scaling algorithm; and able experience with the Barrodale–Roberts (BR)
our implementation of Mehrotra’s (1992) predictor- Fortran code (Barrodale and Roberts, 1974) as
corrector version of the primal–dual log barrier implemented in S-PLUS for l1fit. This code also
algorithm. The former is indicated in the figure underlies the quantile regression routines described
as mek and the latter as rqfn for regression quan- in Koenker and d’Orey (1987, 1993) and still rep-
tiles via Frisch–Newton. The two interior point resents the state of the art after more than 20
algorithms were coded in Fortran employing La- years. The S-PLUS function l1fit incurs a modest
pack (Anderson et al., 1995) subroutines for the overhead getting problems into and out of BR’s For-
requisite linear algebra. They were then incor- tran, but this overhead is quickly dwarfed by the
porated as functions into S-PLUS and timings time spent in the Fortran in large problems. Simi-
are based on the S-PLUS function unix-time(). larly, we have tried to write the interior point code
The Barrodale–Roberts timings are based on the to minimize the S-PLUS overhead, although some
S-PLUS implementation l1fit(x,y). For compar- improvements are still possible in this respect.
ison we also illustrate timings for least-squares Least-squares timings are also potentially contro-
estimation based on S-PLUS function lm(y∼x). versial. The S-PLUS function lm as described by
Such comparisons are inevitably fraught with Chambers (1992) offers three method options: QR
qualifications about programming style, system decomposition, Cholesky and singular-value decom-
overhead and so on. We have chosen to address the position. All of our comparisons are based on the
GAUSSIAN HARE, LAPLACIAN TORTOISE 287

default choice of the QR method. Again there is a age practice can be seen in the analysis of paramet-
modest overhead involved in getting the problem ric linear programming via the simplex algorithm,
descriptions into and the solutions out of the lower- where it is known that in certain problems with an
level Lapack routines which underlie lm. We have n-by-p constraint matrix there can be as many as
run some very limited timing comparisons outside np distinct solutions. However, exploiting some spe-
S-PLUS directly in Fortran to evaluate these over- cial aspects of the quantile regression problem and
head effects and our conclusion from this is that any employing a probabilistic approach, Portnoy (1991)
distortions in relative performance due to overhead was able to show that the number of distinct vertex
effects are slight. solutions (in τ) is Op n log n‘, a rate which provides
We would stress that the code underlying the excellent agreement with empirical experience.
least-squares computations we report is the prod- For interior point methods the crux of the com-
uct of decades of refinement, while our interior plexity argument rests on showing that at each it-
point routines are still in their infancy. There is eration the algorithm reduces the duality gap by a
still considerable scope for improvement in the proportion, say θn < 1. Thus, after K iterations, an
latter. initial duality gap of 10 has been reduced to θnK 10 .
Several features of the figures are immediately Once the gap is sufficiently small (say, less than
striking. For small problems all the `1 -algorithms ε), there is only one vertex of the constraint set at
perform impressively. They are all faster than the which the duality gap can be smaller. This follows
QR implementation of least squares which is gen- obviously from the fact that the vertices are discrete.
erally employed in lm. For small problems the sim- Thus, the vertex with the smaller duality gap must
plex implementation of Barrodale and Roberts is the be the optimal one, and this vertex may be identi-
clear winner, but its roughly quadratic (in sample fied by taking p simplex-type steps. This process,
size) growth over the illustrated range quickly dis- called purification in Gonzaga (1992, Lemma 4.7),
sipates its initial advantage. The interior point algo- requires in our notation p steps involving O np2 ‘
rithms do considerably better than simplex at larger operations, or O np3 ‘ operations. Hence, the num-
sample sizes, exhibiting roughly linear growth, as ber of iterations K required to make θnK 10 < ε is
does least squares. Meketon’s affine scaling algo-
K < log10 /ε‘/− log θn ‘:
rithm performs slightly better than the primal–dual
algorithm, which is somewhat surprising, but for In the worst-case analysis of the interior point
larger p the difference is hardly noticeable. literature, ε is taken to be 2−L , where L is the to-
Beyond the range of problem sizes illustrated tal number of binary bits required to encode the
here, the advantage of the interior point method entire data of the problem. Thus, in our notation ε
over simplex grows exorbitant, fully justifying the would be O np‘: Further, the conventional worst-
initial enthusiasm with which Karmarkar (1984) case analysis employs the bound θn < 1 − cn−1/2 ‘
was received. Nevertheless, there is still a signif- and takes 10 independent√ of n so the number of
icant gap between `1 and `2 performance in large required iterations is O  nL‘. Since each itera-
samples. We explore this gap from the probabilistic tion requires a weighted least-squares solutions of
viewpoint of computational complexity in the next O np2 ‘ operations, the complexity of the algorithm
section. as a whole would be O n5/2 p3 ‘, apparently hope-
lessly disadvantageous relative to least squares.
Fortunately, however, in the random problems for
5. COMPUTATIONAL COMPLEXITY
which quantile regression methods are designed,
In this section we investigate the computational the ε bound on the duality gap at the second best
complexity of the interior point algorithms for quan- vertex can be shown to be considerably larger,
tile regression described above. We should stress at at least with probability tending to 1, than this
the outset, however, that the probabilistic approach worst-case value of 2−L . Lemma A.1 provides the
to complexity analysis adopted here is rather dif- bound log ε = Op p log n‘ under mild conditions on
ferent than that employed in the rest of the inte- the underlying regression model. This leads to a
rior point literature, where the focus on worst-case considerably more optimistic view of these methods
analysis has led to striking discrepancies between for large problems.
theoretical rates and observed computational expe- Renegar (1988) and numerous subsequent au-
rience. The probabilistic approach has the virtue thors have established the existence of a large
that the derived rates are much sharper and conse- class of interior point algorithms for solving lin-
quently more consonant with observed performance. ear programs which, starting from an initially
A similar gap between worst-case theory and aver- feasible primal–dual point with duality gap 10 ,
288 S. PORTNOY AND R. KOENKER

can achieve
√ convergence to a prescribed accuracy The basic idea underlying our preprocessing step
ε in O  n log’10 /ε“‘ iterations in the worst case. rests on the following elementary observation. Con-
More recently, Sonnevend, Stoer, and Zhao (1991) sider the directional derivative of the median re-
have shown under somewhat stronger nondegen- gression (`1 ) problem
eracy conditions that this rate can be improved to n
X
O na log’10 /ε“‘ with a < 1/2. We will call an al- min Žyi − x0i bŽ;
b
gorithm which achieves this rate an na -algorithm. i=1
They give explicit conditions, which hold with prob- which may be written in direction w as
ablity 1 if the the y’s have a continuous density, for n
X 
the case a = 1/4. The following result then follows gb; w‘ = − x0i w sgn∗ yi − x0i b; −x0i w ;
immediately from Lemma A.1. i=1

where
Theorem 5.1. Under the conditions of Lemma 
sgnu‘; if u 6= 0;
A.1, an na -algorithm for median regression con- ∗
sgn u; v‘ =
verges in Op na p log n‘ iterations. With O np2 ‘ sgnv‘; if u = 0:
operations required per iteration and O np3 ‘ oper- Optimality may be characterized as a b∗ such that
ations required for the final “purification” process gb∗ ; w‘ ≥ 0 for all w ∈ <p . Suppose for the moment
such an algorithm has complexity Op n1+a p3 log n‘. that we “knew” that a certain subset JH of the ob-
servations N = ”1; : : : ; n• would fall above the op-
Mizuno, Todd and Ye (1993) provide an alterna- timal median plane and another subset JL would
tive probabilistic approach to the existence of an fall below. Then consider the revised problem
na -algorithm, with a < 1/2, and provide a heuris- X
minp Žyi − x0i bŽ + ŽyL − x0L bŽ + ŽyH − x0H bŽ;
tic argument for a = 1/4. They also conjecture that b∈<
i∈N\JL ∪JH
na might be improvable to log n, by a more refined 
probabilistic approach. This would improve the where xK = i∈JK xi , for K ∈ ”H; L•, and yL ; yH
overall complexity in Theorem 5.1 to Op np3 log 2 n‘ can be chosen as arbitrarily small and large enough,
and seems quite plausible in light of the empirical respectively, to ensure that the corresponding resid-
evidence reported below, and elsewhere in the inte- uals remain negative and positive. We will refer in
rior point literature. In either case we are still faced what follows to these combined pseudo-observations
with a theoretical gap between `1 and `2 perfor- as globs. The new problem, under our provisional
mance that substantiates the empirical experience hypothesis, has exactly the same gradient condition
reported in the previous section. We now introduce as the original one, and therefore the same solu-
a new form of preprocessing for `1 -problems that tions, but the revision has reduced effective sample
has been successful in further narrowing this gap. size by #”JL ; JH • − 2, that is, by the number of
observations in the globs.
How might we know JL ; JH ? Consider computing
6. PREPROCESSING FOR b based on a subsample of
a preliminary estimate β
QUANTILE REGRESSION
m observations. Compute a simultaneous confidence
Many modern linear programming algorithms in- band for x0i β based on this estimate for each i ∈ N.
clude an initial phase of preprocessing which seeks Under plausible sampling assumptions we will see
to reduce problem dimensions by identifying redun- that
√ the length of each interval is proportional to
dant variables and dominated constraints. See, for p/ m, so if M denotes the number √ of yi falling
example, the discussion in Lustig, Marsden, and inside the band, M = Op np/ m‘. Take JL ; JH
Shanno (1994, Section 8.2) and the remarks of the to be composed of the indices of the observations
discussants. Bixby, in this discussion, reports reduc- falling outside the band. So we may now create the
tions of 20–30% in the row and column dimensions “globbed” observations yK ; xK ‘; K ∈ ”L; H•, and
of a sample of standard commercial test problems reestimate based on M + 2 observations. Finally, we
due to “aggressive implementation” of preprocess- must check to verify that all the observations in
ing. Standard preprocessing strategies for LP’s are JH ; JL have the anticipated residual signs; if so,
not, however, particularly well suited to the statisti- we are done; if not, we must repeat the process. If
cal applications which underlie quantile regression. the coverage probability of the bands is P, presum-
In this section we describe some new preprocess- ably near 1, then the expected number of repetitions
ing ideas designed explicitly for quantile regression, of this process is the expectation of a geometric ran-
which can be used to reduce dramatically the effec- dom variable Z with expectation P−1 . We will call
tive sample sizes for these problems. each repetition a cycle.
GAUSSIAN HARE, LAPLACIAN TORTOISE 289

6.1 Implementation 6.2 Confidence Bands


In this subsection we will sketch some further de- The confidence bands used in our reported com-
tails of the preprocessing strategy. We should em- putational experiments are of the standard Scheffé
phasize that there are many aspects of the approach type. Under iid error assumptions the covariance
that deserve further research and refinement. In matrix of the initial solution is given by
an effort to encourage others to contribute to this
process we have made all of the code described be- V = ω2 X0 X‘−1 ;
low available at the website http://www.econ.uiuc. where ω2 = τ1−τ‘/f2 ’F−1 τ‘“; the reciprocal of the
edu/research/rqn/rqn.html. We will refer in what error density at the τth quantile is estimated using
follows to the Frisch–Newton quantile regression al- the Hall–Sheather bandwidth (Hall and Sheather,
gorithm with preprocessing as prqfn. 1988) for Siddiqui’s (1960) estimator. Quantiles of
The basic structure of the current prqfn algorithm the residuals from the initial fit are computed us-
looks like this: ing the Floyd–Rivest algorithm (Floyd and Rivest,
1975). We then pass through the entire sample com-
k←0 puting the intervals
l←0 
m ← ’2n2/3 “ Bi = x0i β b1/2 xi ŽŽ; x0i β
b − 掎V b1/2 xi ŽŽ :
b + 掎V
while (k is small)”
The parameter ζ is currently set, naively, at 2, but
k=k+1
could,
p more √ generally, be set as ζ = 8−1 1 − α‘ +
solve for initial rq using first m observations √
compute confidence interval for this solution 2p − 1‘/ 2 = O  p‘ to achieve 1 − α‘ cover-
reorder globbed sample as first M observations age for the band, and thus assures that the num-
while (l is small)” ber of cycles is geometric. Since, under the moment
l=l+1 condition of Lemma A.1, if p → ∞, the quantity
b1/2 xi ŽŽ also behaves like the square root of a χ2
ŽŽV
solve for new rq using the globbed sample
check residual signs of globbed observations random √ variable, the width of the confidence band
if no bad signs: return optimal solution is Op p/ m‘.
if only few bad: adjust globs, reorder sample, Unfortunately, using the Scheffé bands requires
update M, continue O np2 ‘ operations, a computation of the same or-
if too many bad: increase m and break to der as that required by least-squares estimation of
outer loop the model. It seems reasonable, therefore, to con-
• sider alternatives. One possibility, suggested by the
• Studentized range, is to base intervals on the in-
equality
 b  p
The algorithm presumes that the data has un- 0 Žβj Ž X
(6.1) b
xi β ≤ max × Žxij Ž sj ;
dergone some initial randomization so the first j sj j=1
m observations may be considered representative
of the sample as a whole. In all of the exper- where sj is ω b times the jth diagonal element of
iments reported below we use the Mehrotra– the X0 X‘−1 matrix, and ω b is computed as for the
Lustig–Marsden–Shanno primal–dual algorithm to Scheffé intervals. This approach provides conserva-
compute the subsample solutions. For some “inter- tive (although not “exact”) confidence bands with
p
mediately large” problems it would be preferable width cq j=1 Žxj Ž sj . Note that this requires only
to use the simplex approach, but we postpone this O np‘ operations, thus providing an improved rate.
refinement. Although the affine scaling algorithm Choice of the constant cq is somewhat problem-
of Meketon (1986) exhibited excellent performance atic, but some experimentation with simulated data
on certain subsets of our early test problems, showed that cq could be taken conservatively to be
like those represented in Figure 2, we found its approximately 1, and that the algorithm was re-
performance inconsistent in other tests. It was markably independent of the precise value
√ of cq . For
consequently abandoned in favor of the more reli- these bands the width is again Op p/ m‘, as for
able primal–dual formulation. This choice is quite the Scheffé bands. Although these O np‘ confidence
consistent with the general development of the bands worked well in simulation experiments, and
broader literature on interior point methods for thus merit further study, the computational expe-
linear programming, but probably also deserves rience reported here is based entirely on the more
further exploration. traditional Scheffé bands.
290 S. PORTNOY AND R. KOENKER

After creating the globbed sample, we again solve and minimizing, for any constant c,
the quantile regression problem, this time with the √ α √ 
(6.2) mα pβ log m + cnp/ m pβ log cnp/ m
M observations of the globbed sample. Finally, we
check the signs of the globbed observations. If they yields
all agree with the signs predicted by the confidence  
band we may declare victory and return the opti- m∗ = O np‘2/3 :
mal solution. If there are only a few incorrect signs, Substituting this m∗ back into (6.2), Theorem 5.1
we have found it expedient to adjust the globs, rein- implies that we have complexity
troduce these observations into the new globbed  
sample and resolve. If there are too many incorrect O np‘21+a‘/3 p3 log n ;
signs, we return to the initial phase, increasing the for each cycle of the preprocessing. The number of
initial sample size somewhat, and repeat the pro- cycles required is bounded in probability since it is
cess. One or two repetitions of the inner (fixup) loop a realization of a geometrically distributed random
are not unusual; more than two cycles of the outer variable with a finite expectation. The complex-
loop is highly unusual given current settings of the ity computation for the algorithm as a whole is
confidence band parameters. completed by observing that the required residual
checking is O np‘ for each cycle, and employ-
6.3 Choosing m ing the Studentized range confidence bands also
The choice of the initial subsample size m and its requires O np‘ operations per cycle. Thus the con-
implications for the complexity of an interior point tribution of the confidence band construction and
algorithm for quantile regression with preprocessing residual checking is precisely Op np‘, and for any
is resolved by the next lemma. a < 1/2 the complexity of the `1 -algorithm is there-
fore dominated by this term for any fixed p and n
Theorem 6.1. Under the conditions of Lemma sufficiently large.
A.1, for any nonrecursive quantile regression algo-
rithm with complexity Op nα pβ log n‘, for problems Remarks. (1) Clearly these results above apply
with dimension n; p‘, there exists a confidence not only to median regression, but to quantile re-
band construction based on an initial √ subsample gression in general. (2) If the explicit rates in p of
of size m with expected width Op p/ m‘, and, Theorem 6.1 hold for p → ∞, and if the Mizuno–
consequently, the optimal initial subsample size is Todd–Ye conjecture that na can be improved to log n
m∗ = O np‘2/3 ‘. With this choice of m∗ , M is also holds, then the complexity of the algorithm becomes

O np‘2/3 ‘. Then, with α = 1 + a and β = 3, from O n2/3 p3 log 2 n + Op np‘:
Theorem 5.1, the overall complexity of the algorithm
with preprocessing is, for any na underlying interior The contribution of the first term in this expres-
point algorithm, sion would then assure an improvement over
least squares for n sufficiently large, provided
  p = on1/5 ‘, a rate approaching the domain of
Op np‘21+a‘/3 p3 log n + Op np‘:
nonparametric regression applications. (3) It is
tempting to consider the recursive application of
For a < 1/2, n sufficiently large and p fixed, this
the preprocessing approach described above, and
complexity is dominated by the complexity of the
this can be effective in reducing the complexity of
confidence band computation, and is strictly smaller
the solution of the initial subsample m problem,
than the O np2 ‘ complexity of least squares.
but it does not appear possible to make it effective
in dealing with the globbed sample. This accounts
Proof. Formally, we treat only the case of p
for the qualifier “nonrecursive” in the statement of
fixed, but we have tried to indicate the role of p in
the theorem.
the determination of the constants, where possible.
Thus, for example, for p → ∞, we have suggested
above that the width of both the Scheffé√bands and 7. COMPUTATIONAL EXPERIENCE
the Studentized range bands are Op p/ m‘. For p In this section we provide some further evidence
fixed this condition is trivially satisfied. By indepen- on the performance of our implementation of the
dence we may conclude that the number of observa- algorithm on both simulated and real data. In Fig-
tions inside such a confidence band will be ure 3 we compare the performance of l1fit with the
√  new prqfn, which combines the primal–dual algo-
M = Op np/ m ; rithm with preprocessing. With the range of sample
GAUSSIAN HARE, LAPLACIAN TORTOISE 291

Fig. 3. Timing comparison of two `1 -algorithms for median regression: times are in seconds for the mean of five replications for iid
Gaussian data. The parametric dimension of the models is p+1 with p indicated above each plot; p columns are generated randomly and
an intercept parameter is appended to the resulting design. Timings were made at four design points in nx 20;000; 40;000; 80;000; 120;000.
The dotted line represents the results for the simplex-based Barrodale–Roberts algorithm l1fit, which increases roughly quadratically
in n. The solid line represents prqfn, the timings of the Frisch–Newton interior point algorithm, with preprocessing.
292 S. PORTNOY AND R. KOENKER

Fig. 4. Timing comparison of two `1 -algorithms for median regression: times are in seconds for the mean of 10 replications for iid
Gaussian data. The parametric dimension of the models is p + 1 with p indicated above each plot; p columns are generated randomly
and an intercept parameter is appended to the resulting design. Timings were made at eight design points in nx 40;000, 60;000, 80;000,
100;000, 120;000, 140;000, 160;000, 180;000. The rqfn dashed line represents a primal-dual interior point algorithm; prqfn is rqfn with
preprocessing; and the dotted line represents least-squares timings based on lm(y∼x) as a benchmark.

sizes 20,000–120,000, the clear superiority of prqfn weeks in the previous year and who worked on av-
is very striking. At n = 20; 000, prqfn is faster than erage 35 or more hours per week.
l1fit by a factor of about 10, and it is faster by a We seek to investigate the determinants of the
factor of 100 at n = 120; 000. The quadratic growth logarithm of individuals’ reported wage or salary
in the l1fit timings is also quite apparent in this income in 1989 based on their attained educational
figure. level, a quadratic labor market experience effect,
In Figure 4 we illustrate another small experi- and other characteristics. Results are reported for
ment to compare rqfn and prqfn with lm for n up five distinct quantiles. Least-squares results for
to 180,000. Patience, or more accurately the lack the same model appear in the final column of Ta-
thereof, however, does not permit us to include fur- ble 1. The standard errors reported in parentheses
ther comparisons with l1fit. Figure 4 displays the were computed by the sparsity method described
improvement provided by preprocessing and shows in Koenker (1994) using the Hall–Sheather band-
that prqfn is actually slightly faster than lm for width. There are a number of interesting findings.
p = 4 and quite close to least squares speed for The experience profile of salaries is quite consis-
p = 8 for this range of sample sizes. It may be tent across quantiles, with salary increasing with
noted that internal Fortran timings of prqfn have experience at a decreasing rate. There is a very
shown that most of the time is spent in the primal– moderate tendency toward more deceleration in
dual routine rqfn for n < 200; 000. The results of salary growth with experience at the lower quan-
Sections 5 and 6 suggest that the greatest value of tiles. The white–nonwhite salary gap is highest at
preprocessing appears when n is large enough that the first quartile, with whites receiving a 17% pre-
the time needed to create the globs and check resid- mium over nonwhites with similar characteristics,
uals is comparable to that spent in rqfn. but this appears to decline both in the lower tail
Finally, we report some experience with a moder- and for higher quantiles. Marriage appears to en-
ately large econometric application. This is a fairly tail an enormous premium at the lower quantiles,
typical wage equation as employed in the labor eco- nearly a 30% premium at the fifth percentile, for
nomics literature. See Buchinsky (1994, 1995) for a example, but this premium declines somewhat as
much more extensive discussion of related results. salary rises. The least squares results are quite
The data are from the 5% sample of the 1990 U.S. consistent with the median regession results, but
Census and consists of annual salary and related we should emphasize that the pattern of esti-
characteristics on 113,547 men from the state of Illi- mated quantile regression coefficients in the table
nois who responded that they worked 40 or more as a whole is quite inconsistent with the classi-
GAUSSIAN HARE, LAPLACIAN TORTOISE 293

Table 1
Quantile regression results for a U.S. wage equation

Covariate t 5 0:05 t 5 0:25 t 5 0:5 t 5 0:75 t 5 0:95 ols

Intercept 7.60598 7.95888 8.27162 8.52930 8.54327 8.21327


(0.028468) (0.012609) (0.009886) (0.010909) (0.025368) (0.010672)
exp 0.04596 0.04839 0.04676 0.04461 0.05062 0.04582
(0.001502) (0.000665) (0.000522) (0.000576) (0.001339) (0.000563)
exp 2 −0.00080 −0.00075 −0.00069 −0.00062 −0.00056 −0.00067
(0.000031) (0.000014) (0.000011) (0.000012) (0.000028) (0.000012)
Education 0.07034 0.08423 0.08780 0.09269 0.11953 0.09007
(0.001770) (0.000784) (0.000615) (0.000678) (0.001577) (0.000664)
White 0.14202 0.17084 0.15655 0.13930 0.10262 0.14694
(0.014001) (0.006201) (0.004862) (0.005365) (0.012476) (0.005249)
Married 0.28577 0.24069 0.20120 0.18083 0.20773 0.21624
(0.011013) (0.004878) (0.003824) (0.004220) (0.009814) (0.004129)

cal iid-error linear model or, indeed, any of the thereby enabling computational speed comparable
conventional models accommodating some form of to that of least squares for some large quantile re-
parametric heteroscedasticity. gression problems. There are many possible refine-
In Table 2 we report the time (in seconds) re- ments of the basic approach investigated here, but
quired to produce the estimates in Table 1, using the message for that Gaussian hare who has been
three alternative quantile regression algorithms. frolicking in the flowers, confident of victory, is clear.
The time required for the least-squares estimates Laplace’s old tortoise, despite the house he wears
reported in the last column of Table 1 was 7.8 on his back to protect him from inclement statisti-
seconds, roughly comparable to the prqfn times. cal weather, has a few new tricks and the race is far
Again, the interior point approach with preprocess- from over.
ing as incorporated in prqfn is considerably quicker
than the interior point algorithm applied to the full APPENDIX
data set in rqfn. The simplex approach to comput-
Lemma A.1. In the linear model Yi = x0i β + ui ,
ing quantile regression estimates is represented
i = 1; : : : ; n; assume the following:
here by the modification of the Barrodale–Roberts
(Barrodale and Roberts, 1974) algorithm described (i) ”xi ; Yi ‘; i = 1; : : : ; n• are iid with a bounded
in Koenker and d’Orey (1987) and denoted by rq continuous density in <p+1 ;
in the table. There is obviously a very substantial (ii) EŽxij Žp < ∞ and EŽYi Ža < ∞; for some a > 0:
gain in moving away from the simplex approach to
computation in large problems of this type. Then the duality gap of the median regression esti-
mator at the second best vertex exceeds n−p+5‘ with
probability tending to 1 as n → ∞; and the initial
8. CONCLUSIONS
duality gap 10 satisfies log 10 = Op log n‘.
There is a compelling general case for the superi-
ority of interior point methods over traditional sim- Proof. Let β̂ denote the optimal median re-
plex methods for large linear programming prob- gression solution based on the data ”xi ; Yi ‘; i =
lems, and for large quantile regression applications 1; : : : ; n•, and let dˆ denote the corresponding dual
in particular. We have shown that preprocessing solution. Consider the duality gap at another trial
can effectively reduce the sample size dimension solution pair β̃; d‘,˜ which we can write
of quantile regression problems from n to Op n2/3 ‘, n
X
1∗ = ŽYi − x0i β̃Ž − Y0 d˜
i=1
Table 2
Timing comparisons for three methods in wage equation example: n
X n
X
(A.1)
results are given in seconds for three different quantile regression = ŽYi − x0i β̃Ž − ŽYi − x0i β̂Ž
algorithms described in the text i=1 i=1

Method t 5 0:05 t 5 0:25 t 5 0:5 t 5 0:75 t 5 0:95 + Y dˆ − Y0 d;


0 ˜

prwfn 9.92 9.78 19.91 7.68 8.64 ˆ is zero. Now, as in


since the duality gap at β̂; d‘
rqfn 41.07 42.34 28.33 40.87 59.69 Koenker and Bassett (1978), let h = ”i1 ; : : : ; ip •
rq 565.97 2545.42 3907.42 3704.50 3410.49
denote a subset of N = ”1; 2; : : : ; n• consisting of
294 S. PORTNOY AND R. KOENKER

p distinct indices. Define Xh‘ to be the p × p such pairs. For h; h∗ ‘ ∈ H , let 11 h; h∗ ‘ =
p+1
matrix with rows ”x0i x i ∈ h•, define Yh‘ to be j=1 bj Yij , where ”bj • are defined for arbitrary
the vector with coordinates ”Yi x i ∈ h• and let h; h∗ ‘ ∈ H as in (A.3) and where ”ij • ranges over
βh‘ = X−1 h‘Yh‘. Now suppose ĥ is the subset h ∪ h∗ . Then, clearly,
defining the optimal solution (i.e., β̂ = βĥ‘) and 
that h̃ represents the vertex of the constraint set (A.4) 1∗1 ≥ inf 11 h; h∗ ‘x h; h∗ ‘ ∈ H :
(distinct from ĥ) for which β̃ = βh̃‘ is nearest to β̂ Now fix an arbitrary pair h; h∗ ‘ ∈ H ,
in terms of the primal objective function.  
Note that h̃ is also a subset of size p and dif- 1
P Ž11 h; h∗ ‘Ž ≤ p+5
fers from ĥ in exactly one element. Let i1 be the n
(unique) index in ĥ ∩ h̃c , and let i2 ∈ h̃ ∩ ĥc , where  
1
hc denotes the complement of h. Let r̂i and r̃i de- ≤ P Žbj Ž ≤ 3 ; j = 2; : : : ; p + 1
n
note the respective ith residuals for β̂ and β̃, and 
note the following: r̂i = 0 for i ∈ ĥ; r̃i = 0 for (A.5) 1
+ P for some jx Žbj Ž > 3
i ∈ h̃; sgnr̂i ‘ = sgnr̃i ‘ for i 6∈ ĥ ∪ h̃; and sgnr̂i2 ‘ = n
− sgnr̃i1 ‘. Therefore, since the dual contribution to 
X bk 1
1∗ is positive, we can write
∧ Yij + Y ≤
X X b ik np+2
k6=j j
1∗ ≥ 1∗1 ≡ ŽYi − x0i β̃Ž − ŽYi − x0i β̂Ž
i6∈h i6∈h̃
≡ PA + PB :
 Now note that PA is the probability that the vec-
= Yi1 − x0i1 β̃ sgnr̃i1 ‘
 tor whose coordinates are the “sum over i 6∈ ĥ”
− Yi2 − x0i2 β̂ sgnr̂i2 ‘ terms in b2 ; : : : ; bp+1 ‘ (see (A.3)) lies in a speci-
X 0 X 0 fied cube with sides of length 2/n3 (centered at the
− xi β̃ sgnr̃i ‘ + xi β̂ sgnr̂i ‘
other terms in bj ). That is, if C denotes the set of
i6∈ĥ∪h̃ i6∈ĥ∪h̃
 all cubes in <p with sides of length 2/n3 , then
(A.2) = Yi1 + Yi2 sgnr̃i1 ‘   
X
X PA ≤ sup P sgnri ‘ x0i
− sgnr̃i ‘ x0i X−1 h̃‘Yh̃‘ i6∈h
i6∈h̃ (A.6)  
X −1
· X h‘ ∈ C x C ∈ C
+ sgnr̂i ‘ x0i X−1 ĥ‘Yĥ‘
i6∈ĥ
Now, multiplying the condition in PA through by
p+1
X Xh‘ and noting that the density of the sum in PA is
≡ bj Yij ;
j=1
bounded by the density of a single xi , (A.6) becomes
 p
where, for j = 2; : : : ; p, 2
PA ≤ Eh a0 VolC‘ detXh‘‘ ≤ a1
  n3
X (A.7)
b1 = sgnr̃i1 ‘ − sgnr̃i ‘ x0i X−1 h̃‘ a2
1
≤ p+2
i6∈h̃ n
 
X 0 −1
where a0 , a1 and a2 are constants, detXh‘‘ is
b2 = sgnr̃i1 ‘ − sgnr̂i ‘ xi X ĥ‘ bounded by the moment condition on ”xi • and, for
i6∈ĥ 1
(A.3) p ≥ 1, 3p ≥ p + 2.
 
X 0 −1
Similarly, PB is bounded by
bj+1 = sgnr̂i ‘ xi X ĥ‘
j
  a3
i6∈ĥ (A.8) PB ≤ sup P Yij ∈ ’t; t + n−p+2‘ “ ≤ p+2
  t n
X
− sgnr̃i ‘ x0i X−1 h̃‘ : (where a3 is a bound on the density of Yi ). Thus,
i6∈h̃ j
from (A.5), (A.7) and (A.8),
Now let H be the set of all pairs h; h∗ ‘, where h 
P 1∗1 ≤ n−p+5‘
and h∗ are p-element subsets of indices that differ   
(A.9) n p + 1 a2 + a3
in exactly one element. Note that there are ≤ → 0:
   p+1 2 np+2
n p+1
p+1 2 That is, 1∗1 > n−p+5‘ with probability tending to 1.
GAUSSIAN HARE, LAPLACIAN TORTOISE 295

Finally, taking the initial β and d to be 0, the Dikin, I. I. (1967). Iterative solution of problems of linear and
 quadratic programming. Soviet Math. Dokl. 8 674–675.
initial duality gap is bounded by 10 = ŽYi Ž. How-
Edgeworth, F. Y. (1887). On observations relating to several
ever, a simple Chebyschev-type inequality based on
quantities. Hermathena 6 279–285.
the moment condition (EŽYi Ža < +∞) yields Edgeworth, F. Y. (1888). On a new method of reducing observa-
n  tions relating to several quantities. Philosophical Magazine
X 1+2/a

P ŽYi Ž ≥ n ≤ nP ŽY1 Ž ≥ n2/a 25 184–191.
i=1 Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and
Its Applications. Chapman and Hall, London.
n EŽY1 Ža Fiacco, A. V. and McCormick, G. P. (1968). Nonlinear Program-
≤ → 0:
n2a/a ming: Sequential Unconstrained Minimization Techniques.
Wiley, New York.
ACKNOWLEDGMENTS Floyd, R. W. and Rivest, R. L. (1975). Expected time bounds for
selection. Communications of the ACM 18 165–173.
This research has been partially supported by Frisch, R. (1956). La Résolution des problèmes de programme
NSF Grant SBR-93-20555. Most of the computing linéaire par la méthode du potential logarithmique. Cahiers
carried out in the course of this research was con- du Séminaire d’Econometrie 4 7–20.
Gauss, C. F. (1821). Theoria combinationis observationum er-
ducted on the Sparc 20 ragnar.econ.uiuc.edu in
roribus minimis obnoxiae: pars prior. [Translated (1995) by
the Econometrics Lab at the University of Illinois G. W. Stewart as Theory of the Combination of Observations
and primarily supported by NSF Grant SBR-95- Least Subject to Error. SIAM, Philadelphia.]
12440. Gill, P., Murray, W., Saunders, M., Tomlin, T. and Wright,
The authors would like to thank Kevin Hallock M. (1986). On projected Newton barrier methods for linear
programming and an equivalence to Karmarkar’s projective
for providing the census data used in Section 7, and
method. Math. Programming 36 183–209.
Marc Meketon, Mike Osborne, Gib Bassett and two Gonzaga, C. C. (1992). Path-following methods for linear pro-
referees for useful comments in the course of this re- gramming. SIAM Rev. 34 167–224.
search. We are, of course, solely responsible for any Green, P. J. and Silverman, B. W. (1994). Nonparametric Re-
errors. gression and Generalized Linear Models. Chapman and Hall,
London.
Gutenbrunner, C. and Jurečková, J. (1992). Regression quan-
REFERENCES tile and regression rank score process in the linear model
and derived statistics. Ann. Statist. 20 305–330.
Anderson, E., Bai, Z., Bischof, C., Demmell, J., Dongarra, Gutenbrunner, C., Jurečková, J., Koenker, R. and Portnoy, S.
J., DuCroz, J., Greenbaum, A., Hammerling, S., McKen- (1993). Tests of linear hypotheses based on regression rank
ney, A., Ostrouchov, S. and Sorenson, D. (1995). LAPACK scores. J. Nonparametric Statist. 2 307–333.
Users’ Guide. SIAM, Philadelphia. Hall, P. and Sheather, S. (1988). On the distribution of a stu-
Barrodale, I. and Roberts, F. D. K. (1974). Solution of an dentized quantile. J. Roy. Statist. Soc. Ser. B 50 381–391.
overdetermined system of equations in the `1 norm. Com- Karmarkar, N. (1984). A new polynomial time algorithm for lin-
munications of the ACM 17 319–320. ear programming. Combinatorica 4 373–395.
Bartels, R. and Conn, A. (1980). Linearly constrained discrete Koenker, R. (1994). Confidence intervals for regression quan-
`1 problems. ACM Trans. Math. Software 6 594–608. tiles. In Asymptotic Statistics, Proceedings of the Fifth Prague
Bloomfield, P. and Steiger, W. L. (1983). Least Absolute De- Symposium (P. Mandl and M. Hušková, eds.) 349–359.
viations: Theory, Applications, and Algorithms. Birkhäuser, Springer, Heidelberg.
Boston. Koenker, R. and Bassett, G. (1978). Regression quantiles.
Buchinsky, M. (1994). Changes in US wage structure 1963–87: Econometrica 46 33–50.
an application of quantile regression. Econometrica 62 405– Koenker, R. and d’Orey, V. (1987). Computing regression quan-
458. tiles. J. Roy. Statist. Soc. Ser. C 36 383–393.
Buchinsky, M. (1995). Quantile regression, the Box–Cox trans- Koenker, R. and d’Orey, V. (1993). Computing dual regression
formation model and U.S. wage structure 1963–1987. quantiles and regression rank scores. J. Roy. Statist. Soc. Ser.
J. Econometrics 65 109–154. C 43 410–414.
Chamberlain, G. (1994). Quantile regression, censoring and the Koenker R., Ng, P. and Portnoy, S. (1994). Quantile smoothing
structure of wages. In Advances in Econometrics (C. Sims, splines. Biometrika 81 673–680.
ed.). North-Holland, Amsterdam. Laplace, P.-S. (1789). Sur quelques points du système du
Chambers, J. M. (1992). Linear models. In Statistical Models in monde. Mémoires de l’Académie des Sciences de Paris.
S (J. M. Chambers and T. J. Hastie, eds.) 95–144. Wadsworth, (Reprinted in Œvres Complétes 11 475–558. Gauthier-Villars,
Pacific Grove, CA. Paris.)
Charnes, A., Cooper, W. W. and Ferguson, R. O. (1955). Op- Lustig, I. J., Marsden, R. E. and Shanno, D. F. (1992). On
timal estimation of executive compensation by linear pro- implementing Mehrotra’s predictor-corrector interior-point
gramming. Management Science 1 138–151. method for linear programming. SIAM J. Optim. 2 435–449.
Chaudhuri, P. (1992). Generalized regression quantiles. In Pro- Lustig, I. J., Marsden, R. E. and Shanno, D. F. (1994). Interior
ceedings of the Second Conference on Data Analysis Based on point methods for linear programming: computational state
the L1 Norm and Related Methods 169–186. North-Holland, of the art (with discussion). ORSA J. Comput. 6 1–36.
Amsterdam. Manning, W., Blumberg, L. and Moulton, L. H. (1995). The de-
Chen, S. and Donoho, D. L. (1995). Atomic decomposition by mand for alcohol: the differential response to price. J. Health
basis pursuit. SIAM J. Sci. Stat. Comp. To appear. Economics 14 123–148.
296 S. PORTNOY AND R. KOENKER

Mehrotra, S. (1992). On the implementation of a primal–dual Sonnevend, G., Stoer, J. and Zhao, G. (1991). On the complex-
interior point method. SIAM J. Optim. 2 575–601. ity of following the central path of linear programs by linear
Meketon, M. S. (1986). Least absolute value regression. Techni- extrapolation II. Math. Programming 52 527–553.
cal report, Bell Labs, Holmdel, NJ. Stigler, S. M. (1984). Boscovich, Simpson and a 1760 manuscript
Mizuno, S., Todd, M. J. and Ye, Y. (1993). On adaptive-step pri- note on fitting a linear relation. Biometrika 71 615–620.
mal dual interior point algorithms for linear programming. Stigler, S. M. (1986). The History of Statistics: Measurement of
Math. Oper. Res. 18 964–981. Uncertainty before 1900. Harvard Univ. Press.
Oja, H. (1983). Descriptive statistics for multivariate distribu- Tibshirani, R. (1996). Regression shrinkage and selection via
tions. Statist. Probab. Lett. 1 327–332. the lasso. J. Roy. Statist. Soc. Ser. C 58 267–288.
Portnoy, S. (1991). Asymptotic behavior of the number of re- Vanderbei, R. J., Meketon, M. J. and Freedman, B. A. (1986). A
gression quantile breakpoints. SIAM Journal of Scientific modification of Karmarkar’s linear programming algorithm.
and Statistical Computing 12 867–883. Algorithmica 1 395–407.
Powell, J. L. (1986). Censored regression quantiles. J. Econo- Wagner, H. M. (1959). Linear programming techniques for re-
metrics 32 143–155. gression analysis. J. Amer. Statist. Assoc. 54 206–212.
Renegar, J. (1988). A polynomial-time algorithm based on New- Welsh, A. H. (1996). Robust estimation of smooth regression
ton’s method for linear programming. Math. Programming and spread functions and their derivatives. Statist. Sinica 6
40 59–93. 347–366.
Shamir, R. (1993). Probabilistic analysis in linear programming. Wright, M. H. (1992). Interior methods for constrained opti-
Statist. Sci. 8 57–64. mization. Acta Numerica 1 341–407.
Siddiqui, M. (1960). Distribution of quantiles in samples from Zhang, Y. (1992). Primal–dual interior point approach for com-
a bivariate population. J. Res. Nat. Bur. Stand. B 64 145– puting `1 -solutions and `∞ -solutions of overdetermined lin-
150. ear systems. J. Optim. Theory Appl. 77 323–341.

Comment
Ronald A. Thisted

1. INTRODUCTION concentrating on the statistical aspects, Portnoy


and Koenker have produced real computational ad-
There are few papers on statistical computa-
vances specific to the statistical problem of quantile
tion that deserve to be described as “fabulous,” but
regression.
surely this is one. It contains a number of signifi-
This article also illustrates how valuable it can
cant contributions, both to the practice of statistical
be to switch from a statistical point of view to a nu-
computation and to the ways in which we think
merical analyst’s point of view, and then back. In
about the difficulty of computational problems that
many ways, this interplay of statistical and compu-
are relevant to data analysis. While absolute-error
tational approaches is reminiscent of the gains that
estimation is formally equivalent to linear program-
integrating a dual problem with a primal algorithm
ming, it is refreshing to see computational advances
can produce.
in this area that focus on specifically statistical ap-
An important feature of this work is that it
plications, since those applications often have quite
brings attention to the primal–dual formulation of
different features from a “typical” linear program-
the quantile regression problem, which has the use-
ming problem viewed form the operations research
ful feature that it provides a natural measure of
perspective. While the mathematical structure of
convergence for the computation. The “duality gap,”
quantile regression can be reduced to the same
that is, the difference between the current value of
structure that is required to maximize profits for
the objective function being minimized in the pri-
an airline given the constraints of equipment, crew,
mal problem and the value of the objective function
bookings and so on, the practical issues that arise
being maximized by the dual problem, shrinks to
in the two contexts are actually quite different. By
zero at an optimal solution.
For considering the performance of algorithms in
statistical contexts, the notion of average-case as
Ronald A. Thisted is Professor, Departments of opposed to worst-case performance is an appealing
Statistics, Health Studies, and Anesthesia and Crit- one. This contrasts with much work on algorith-
ical Care, University of Chicago, Chicago, Illinois mic complexity investigations in computer science,
60637 (e-mail: thisted@galton.uchicago.edu). which tend to focus on the latter rather than the for-
GAUSSIAN HARE, LAPLACIAN TORTOISE 297

mer. Indeed, the notion of average-case performance already sorted. This corresponds to the worst-case
of an algorithm that is, for instance, incorporated scenario for Floyd–Rivest, and in this unfavorable
into a statistical package, is one that should be ap- situation their algorithm performs quite badly. If
pealing to frequentists and Bayesians alike—albeit this does cause problems for rqfn, just how “non-
for different reasons. random” must the data be in order for an approach
In the remainder of my comments, I shall address such as rq once again to become competitive?
this aspect of the paper in a bit more detail, and Statistical decision theorists will recognize the
shall also propose some attractive lines for addi- discussion above as a variant on the minimax ver-
tional research. sus Bayesian decision framework. If the worst-case
distribution of data point is a real (or even a likely)
possibility, then the best algorithm is one which
2. AVERAGE-CASE VERSUS
works well against this least-favorable (prior) dis-
WORST-CASE PERFORMANCE
tribution. On the other hand, if the data can be
Floyd and Rivest’s work on algorithms for comput- thought of as a random sample from a particular
ing quantiles (Floyd and Rivest, 1975) is the earliest distribution, then the best algorithm is one which
example of which I am aware of using average-time works well for most realizations from that (prior)
performance of an algorithm to assess its computa- distribution.
tional complexity in a formal way. What has become The gains to be made from the preprocessing
clear is that if, in fact, average performance is of step are impressive, but they are based on Gaus-
great practical importance, there are large gains in sian, not Laplacian distributions. The empirical
computing time which can be realized by using al- investigations described by the authors use data in
gorithms that occasionally may do worse than their which the errors are also Gaussian, guaranteeing
“optimal” counterparts. The Floyd–Rivest selection the applicability of their asymptotics. It would be
algorithm is a good example. worthwhile to repeat the series of experiments re-
The Floyd–Rivest idea for selecting an empirical ported here using two alternative data distributions
quantile involves a preprocessing step. In this step, with non-Gaussian shapes. First, a contaminated
confidence intervals are calculated from a subset of normal, for example,
the data, say the first m points in the data set, in
1 − α‘N0; 12 ‘ + αN0; 32 ‘
order to (provisionally) exclude points from the com-
putation which are quite unlikely to be near the might be expected to produce globs which are
quantile of interest. Occasionally the true quantile smaller than the Gaussian calculations would sug-
will fall outside the confidence interval, requiring gest. Second, a lognormal would introduce the
the computation to backtrack after the error is rec- treble features of failing to have moments, being
ognized. Using this idea, however, requires that the highly asymmetric and actually being representa-
first m data points constitute a random sample of tive of some data sets which we might actually see
the entire data. In the context of quantile regres- (in economics, for instance). Tortoiselike plodding in
sion (of which quantile selection is a very special this direction might be fruitful indeed in helping us
case), this objective can be achieved by an initial to appreciate the limitations (or the robustness!) of
randomization step (as the authors note). the preprocessing included in the prqfn approach.
By selecting a random m of n elements to occupy
the first m data positions, the randomization step 3. BATCH PROCESSING
adds O m‘ = O n2/3 ‘ to the computation, which of Whenever I wish to calculate a quantile regres-
course does not affect the asymptotics presented in sion function for a data set, I am usually interested
the paper—provides that only one pass through the in obtaining several quantiles at once. Are there
data needs to be made. Unfortunately, very large gains that can be achieved by performing the cal-
data sets cannot always fit into random access mem- culations simultaneously for several choices of π,
ory, which increases both the computational and the instead of repeating the entire algorithm for each?
data-management complexity of the problem. The most promising venue for exploring this ques-
Given the sloth to which the human (and the tion appears to be in the globbing phase of the pre-
hare) both are wont, many implementors of the tech- processing. If, for instance, we are interested in
niques described here will simply omit implemen-
τ ∈ ”0:01; 0:10; 0:25; 0:50; 0:75; 0:95; 0:99•;
tation of the randomization step. For this reason,
it would be instructive to know what happens to it would seem that I could first solve the τ = 0:01
rqfn when it is applied to a large data set in which problem, and then automatically include all of data
the signed residuals obtained from the final fit are points whose residuals fell below the first percentile
298 S. PORTNOY AND R. KOENKER

in the lower glob for assessing the 99th percentile statistical computation (such as Kennedy and Gen-
problem. I would then alternate between high and tle, 1980; Press, Flannery, Teukolsky and Vetterling,
low choices for τ, perhaps decreasing m after every 1986; Thisted, 1988), several of which discuss the
second iteration of this process. difficulties of L1 -methods. This paper also opens
the door to potentially large areas of fruitful re-
search in statistical computing. The authors are
4. CONCLUSION
commended for accelerating the pace of this re-
The results presented here should provide im- search by making their computer code available on
petus to revise the standard reference works in the World Wide Web.

Comment
M. R. Osborne

This is an interesting and unusual paper stylishly development here, it is a little surprising that
written in a manner well-reflected in the title. I this aspect is not referenced.
trust it finds a wide readership. The authors indi- 2. There is additional structure in the quantile
cate that there is considerable opportunity for fur- problem over and above the generic LP. This
ther application of their ideas. comes from the special interval constrained form
The paper presents two main themes: of the dual problem. This allows one simplex
step to move off one bound constraint to its op-
1. a case for the use of interior point methods posite bound, and this means that the new basic
instead of the more usual simplicial style of solution can be written down without further
algorithm here identified with Barrodale and calculation. This pattern can occur in sequences
Roberts’s LP-based algorithm as implemented of consecutive steps. This sequence is actually
in S-PLUS; other alternatives to the simplicial a linesearch step in other formulations (Os-
style methods have been championed recently borne, 1985). It can be computed by the fast
(see Osborne and Watson, 1996); median algorithm of Bloomfield and Steiger, for
2. an argument for “preconditioning” the calcula-
example. The Barrodale–Roberts approach is
tion by tentatively classifying residuals predicted
equivalent to using a comparison sort in this
not to be zero in the final solution and aggre-
context and seems already sufficient to explain
gating their contribution to the necessary condi-
the On2 ‘ behavior observed. Recently, Osborne
tions; there is no reason why this step cannot
and Watson (1996) have observed that the secant
be applied to methods other than interior point
algorithm can be applied here and interpreted as
methods.
an alternative to the usual median of three par-
I have reservations about the case for the use of in- titioning in the fast median computation. The
terior point methods, although not necessarily about improvement over Bloomfield and Steiger can be
the conclusions. These reservations are as follows: staggering in problems which arise in fitting a
deterministic model in the presence of noise. For
1. Exponential worst case behavior of the simplex the record, the code distributed by Bartels, Conn
method is unusual. The examples I know can all and Sinclair used a heap sort in the linesearch
be classified as very badly scaled. Quite a deal implementation and was perhaps the first to im-
of work has gone into computing average case prove on the On2 ‘ asymptotics. It would seem
behavior, and this tends to give a very different to be time that S-PLUS used a more modern
picture. Given the general stochastic bias of the implementation.
3. There is at least some folk law concerning the
inferior performance of interior point methods
M. R. Osborne is staff member, Centre for Mathe- when compared with simplex-style methods in
matics and its Applications, Australian National postoptimality computations. However, this is
University, Canberra ACT 0200, Australia. the type of computation employed when study-
GAUSSIAN HARE, LAPLACIAN TORTOISE 299

ing the behavior of regression quantiles as a tion methods is documented in my paper in JIMA
function of the quantile parameter. Numerical Analysis of some years ago.
4. Primal-dual interior point methods have some
question marks regarding their complete numer- Disclaimer. Constraints of time and vagaries of
ical stability when nonuniqueness or degeneracy the mail service have meant this discussion has to
occurs. That this potential trouble is on the cards be prepared between Sydney and Singapore, and
is well documented in the original Basset and after an excellent dinner. Unintentionally, claims
Koenker paper for the stackloss data set. The ro- made may be stronger than would have been the
bustness possible with piecewise linear continua- case if a better vehicle than memory were available.

Rejoinder
Stephen Portnoy and Roger Koenker

We would like to begin by thanking the discus- be significantly quicker. As Osborne notes, there
sants for their encouraging comments as well as have been some doubts raised about interior point
expressing our appreciation to the Editor, Paul methods for postoptimality analysis. However, re-
Switzer, for organizing the discussion. We certainly cent work, notably Monteiro and Mehrotra (1996),
share Ron Thisted’s hope that this work may in- appears more promising. Thisted’s suggestions for
duce others to reevaluate the frequently lamented adapting our preprocessing approach for postopti-
computational burden of `1 -methods, and thereby mality analysis are worth pursuing since the quan-
gradually expand the domain of applicability for tile regression solution at any given τ is clearly in-
quantile regression and related methods. formative about other solutions at nearby τ.
There are many pathways left to explore. As Following Thisted’s comments, some experimen-
Mike Osborne notes there are significant potential tation was done to explore the consequences of
improvements possible in the simplex approach. It nonnormal distributions. We considered Cauchy re-
is indeed remarkable that the algorithm by Barro- sponse and design variables—a setting where the
dale and Roberts is still the vehicle of choice among random mechanism underlying globbing may be
most statistically minded tortoises 25 years after its expected to fail, and also lognormal distributions.
appearance. Our preprocessing strategy provides a Approximate ratios of timings to those for normal
very effective way of speeding up the simplex ap- cases appear in Table 1. Cauchy disturbances ap-
proach as well. In fact, it was only after we found pear to degrade performance somewhat for large n
this approach unsatisfactory for very large n and p and modest p, but asymmetry has negligible effect.
that we began to explore interior point alternatives Other informal experimentation indicates little ef-
to simplex. fect for distributions less extreme than Cauchy,
Both discussants comment on the importance of although a more systematic study of the adaptive
effective postoptimality analysis. In several earlier choice of the tuning constants of the algorithm
papers we have emphasized the value of estima- may have some value in improving performance for
tion and inference methods based on the entire pri- Cauchy-like samples.
mal and dual quantile regression processes. As we Mike Osborne raises the question of the effect of
have noted these processes can be computed with degeneracy on the performance of the algorithm. Be-
Op n log n‘ simplex steps, starting from any initially
optimal basic solution. However, in large problems
Table 1
it may suffice to compute the process β̂τ‘ or its Ratios of timings to those for normal samples
dual counterpart on some prespecified grid. In such
cases, it seems reasonable to explore interior point Cauchy Lognormal
strategies for moving from one τ to the next, in n 5 100;000 n 5 50;000 n 5 100;000 n 5 50;000
effect, tunneling back through the interior rather
p=8 1.34 1.07 1.09 1.01
than traversing from one vertex to another on the
p=4 1.75 0.82 1.05 0.89
exterior of the constraint set. For n large this may
300 S. PORTNOY AND R. KOENKER

cause degeneracy is a serious potential problem for a subject of intense research interest and may even-
exterior point methods, there has been considerable tually yield further hope for the Laplacian tortoise.
attention devoted to it in the interior point liter-
ature. Güler, den Hertog, Ross and Terlaky (1993)
provide an excellent survey of this topic. Since pri- ADDITIONAL REFERENCES
mal and dual degeneracy involve extreme points Güler, O., den Hertog, D., Roos, C. and Terlaky, T. (1993).
(vertices) of the primal and dual constraint sets, Degeneracy in interior point methods for linear program-
respectively, there is reason to believe that inte- ming: a survey. Ann. Oper. Res. 46 107–138.
Kennedy, W. and Gentle, J. E., Jr. (1980). Statistical Comput-
rior point methods may be less sensitive to degener- ing. Dekker, New York.
acy than simplex. This has been our experience in Monteiro, R. D. C. and Mehrotra, S. (1996). A general para-
some limited experiments, but further investigation metric analysis approach and its implications to sensitivity
is definitely warranted. analysis in interior point methods. Math. Programming 72
Thinking about degeneracy leads naturally, in the 65–82.
Osborne, M. R. (1985). Finite Algorithms in Optimization and
theology of linear programming at least, to the sub- Data Analysis. Wiley, New York.
ject of purification. Under degeneracy most interior Osborne, M. R. and Watson, G. A. (1996). Aspects of M-
point methods converge to a point on the relative in- estimation and l1 fitting. In Numerical Analysis (D. F. Grif-
terior of the solution set, thus apparently complicat- fith and G. A. Watson, eds.). World Scientific, Singapore.
ing any attempt to “purify” an interior point solution Press, W., Flannery, B., Teukolsky, S. and Vetterling, W.
(1986). Numerical Recipes: The Art of Scientific Computing.
by finding a nearby vertex solution. Whether effec- Cambridge Univ. Press.
tive purification strategies can be devised to com- Thisted, R. A. (1988). Elements of Statistical Computing. Chap-
bine interior and exterior point approaches remains man and Hall, London.

Das könnte Ihnen auch gefallen