Beruflich Dokumente
Kultur Dokumente
STAN-LCS-106
UC-405
(Ml
INTERPRETABLE
PROJECTION
SALLY
CLAIRE
PURSUIT*
MORTON
Stanford
OCTOBER
1989
*Ph.D
Dissertation
Abstract
The modification
sacrificing
projection
pursuit
by trading
which projection
of projection
accuracy
and under-
pursuit
seeks.
the results.
Following
an introduction
pursuit
regression respectively.
as applied to exploratory
in factor
analysis
description,
algorithms
and entropy.
weighting
with
and
of a description
is
projection
a rotationally
invariant
replacing
smoothness.
exploratory
projection
projection
in the original
require
slightly
interpretable
pursuit
goals.
interpretability
for both
interpretable
alterations
The interpretability
pursuit
A roughness penalty
tational
projection
simonious
are required.
pursuit
In the former
algorithm
The compu-
Examples
and
case,
In the latter,
the connections
models.
The
Abstract
modification
Page iv
as applied to linear regression is shown to be analogous to a nonlin-
ear continuous
selection techniques
Possible extensions
to other data analysis methods are cited and avenues for future research are identified.
The conclusion
in general.
An example of calculating
pretability
for a histogram,
illustrates
the applicability
and inter-
the binwidth
Acknowledgments
I am grateful
to my principal
I also thank
enthusiasm.
my secondary
Persi Diaconis,
Kirk Cameron,
David Draper,
Knowles,
Michael
Sheehy, David
Martin,
Siegmund,
Pregibon,
Hal Stern;
and
Brad Efron,
Tom DiCiccio,
Daryl
Mark
John Rolph,
Anne
and my friends:
Joe Romano,
Mark
Barnett,
Ginger
Gordon,
Arla
Holly
Marincovich,
This
work
Haggerty,
Curt
Lasher,
LeCount,
Alice
Lundin,
Michele
in part
by the Department
DE-AC03-76F00515.
of Energy,
Grant
Table
of Contents
..*
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . aaz
Abstract
Introduction
1. Interpretable
................................
Acknowledgments
....................................
Exploratory
Projection
Exploratory
...........
Pursuit
Projection
Pursuit
4
......
Technique
4
4
............................
1.1.1 Introduction
1.2.1 A Combinatorial
Projection
Strategy
Index
11
....................
Example
Optimization
.................
Index
Exploratory
1.2.2 A Numerical
..........................
Pursuit
Approach
15
..............
17
........................
Background
14
14
....................
Strategy
....
17
...................
............
19
............
$2
............
,24
26
..............
L7
..............................
Invariance
of the Projection
Index
Vi
Index
........
28
. . . . . . . . , . . . . . . . . . 29
Page vii
Tableof Contents
1.4.3 Projection
Axes Restriction
1.4.4 Comparison
With
. . . . . . . . . . . . . . . . . . . $5
Factor Analysis
2. Interpretable
.42
Pursuit
Projection
57
58
60
....................
Pursuit
62
Example
.....................
82
as a Prior
3.4 FutureWork
...............................
88
90
.......................
.............................
3.4.3 Algorithmic
Improvements
91
...................
92
..........................
Example
92
93
.....................
.............................
97
............................
Exploratory
86
90
3.4.2 Extensions
Gradients
: ........
.........
........................
3.3 Interpretability
3.5.2 Conclusion
73
82
Ridge Regression
68
......................
Linear Regression
Connections
65
.......................
and Conclusions
With
..........
..................
Procedure
62
63
.....
Regression Approach
....................
Index
A.1 Interpretable
.......
..........................
3.4.1 Further
57
............................
3.2 Comparison
............
Regression Technique
3.1 Interpretable
46
Regression
Pursuit
2.2.2 Attempts
42
....................
Example
Projection
2.1.1 Introduction
A.
$9
.........................
Projection
Appendix
..................
Procedure
Connections
$7
................................
1.5 Examples
3.
...............
Projection
99
Pursuit
Gradients
.......
99
Page viii
Tableof Contents
A.2 Interpretable
References
Projection
Pursuit
Regression Gradients
. . . . . . 102
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Figure
[l.l]
Captions
data
12
[1.2] Varimax
interpretability
20
[1.3] Varimax
interpretability
21
[1.4] Varimax
interpretability
21
[1.5] Simulated
data with
43
44
45
for various
values of
data
data
[l.lO]
[l.ll]
47
48
49
50
[1.12] Parameter
[1.13] Country
data..
scatterplot
..............
53
of the automobile
55
74
[2.2] Model paths for the air pollution data for models with numberoftermsm=l,
... . 6. .........................
76
iX
Pagex
Figure Captions
[2.3] Model paths for the air pollution data for models with numberoftermsm=7,8,9.
. . . . . . . . . . , . . . . . . . . . . . . . . .
[2.4] Draftsmans
mearregression.
[3.1] I ner
t p reat bl e 1
[3.2] Interpretability
data . . . . . . . . . . . . . . . 79
. . . . . . . . ... . . . . . . . . . . . . .
77
84
. . . . . . . . . . . . . . . . . . 89
IMSE
96
List
[l.l]
of Tables
Most structured Legendre plane index values for the automobile data. . . . . . . . . . . . . . . . . . , . . . . . . . . . . . . . . . . .
linear combinations
data. ...............
xi
data. ........
29
51
52
Introduction
pursuit
and Stuetzle
accuracy for
1981).
Th e f ormer is an exploratory
produces a description
procedure
pursuit by trading
of a group of variables.
which determines
the relationship
pursuit
regression (Friedman
The latter
is a formal
of a dependent variable
which
modeling
to a set of
predictors.
The common
outcome
of all projection
is nonlinear
jection
linear combinations
projection
pursuit
projection
pictorially
pursuit
representation.
that
q projections
row corresponding
are made.
The resulting
to a projection.
data points.
of p variables
direction
The statistician
In
and
predictors.
com-
of the projected
of
is a collection
The remaining
the description
regression,
methods
and is summarized
pursuit
matrix
graphic
each, suppose
is
q X p,
each
and
explain these vectors, both singly and as a group, in the context of the original p
variables and in relation
is to illustrate
a method
to the nonlinear
for trading
components.
or
Page 2
Introduction
model in return
parsimony
and flexibility
A.
in the matrix
of this promising
The
technique while
of the results.
In this dissertation,
than parsimony,
or simplicity
interpretability
is used in a similar
of
or model. Tukey stated the concept in 1961 as It may pay not to try to describe
in the analysis the complexities
Several
and prediction
the parameters
data to be 0 or fl,
Throughout
to the particular
while considering
intuition
of future observations
of
data.
in a model,
easier to compare.
(G)
easier to remember.
(iv)
easier to explain.
The adjective
plex with
terpretability
uninterpretable
throughout
this thesis.
with
interpretable
as is com-
in the philosophical
Page 3
Introduction
The quantification
of interpretability
. . . presents a veritable
The concept is
pursuit
the diversity
chaos of opinion.
as produced by projection
of an interpretability
linear combinations
index.
by a computer
more interpretable
Exploratory
projection
pursuit
is considered first
method motivates
terpretability
which can
search for
results.
lend
This mathe-
be interpreted
of
Fortu-
problem.
about simplicity
is a difficult
projection
pursuit
in Chapter
1.
the simplifying
modification.
versus alternative
strategies.
The modification
algorithm
method is examined.
Projection
A short
The algo-
Various in-
to be employed
interpretable
pursuit
ex-
regression
index.
with an example.
Chapter 3 connects the new approach with established techniques of trading
accuracy
might
for interpretability.
benefit
from
general application
Extensions
this modification
of the tradeoff
are proposed.
between accuracy
methods
that
This
statistical
described within
method.
by others in similar
will be used
Chapter
Interpretable
Exploratory
1.1 presents
projection
pursuit
projection
and outlines
requires that
a simplicity
Pursuit
pursuit is demonstrated.
the algorithm.
exploratory
An example
exploratory
technique
Projection
which
The modification
is
The
to the example
projection
Exploratory
Projection
pursuit
Technique
any initial
Pursuit
assumptions,
while providing
high dimensional
datasets.
The
requiring
1.1.1 Introduction
Classical multivariate
criminant
components
analysis or dis-
or normal
.in nature
pursuit
Page 5
Exploratory
to handle.
exploratory
the problem
exploratory
projection
while maintaining
projection
pursuit).
projects
pursuit)
(Bellman
definition
combinations
gerate structure
Initially,
interactively
projection
an exhaustive
mathematical
projection
1982, Asimov
numerical
Structure
projection
consider principal
pursuit
components
classical
is no longer
many
of a view is defined
views
optimization.
which exhibit
choose interesting
ploratory
the
measured on a multidimensional
As Huber
First,
univariate
1987).
by eye (McDonald
to perform
is revealed in
is due to
space.
A linear projection
projection
of
the technique
projection
techniques
the analysis
index.
are forms
index choices.
of ex-
For example,
RP. The
Page 6
component
is the solution
of the maximization
problem
px
i
Var[PTY]
3
@T&=1
and p:Y
is uncorrelated
contrast,
is a global
the novelty
its ability
components
pursuit
Variance
with
to recognize nonlinear
many definitions
index.
and applicability
of structure
of exploratory
or local structure.
and corresponding
exploratory
projection
a view is.
pursuit
As remarked
projection
In
lies in
previously,
All present indices, however, depend on the same basic premise. The idea is that
though
interesting
is difficult
is clearly
normality.
Friedman (1987) and Huber (1985) p rovide theoretical
Effectively,
methods
attempting
The statistician
covariance structure.
to address situations
structure
metric
Desirable
computing
is
properties
The
and invariance
algorithm
pursuit
original
projection
to be found.
considerations
Exploratory
which
These
Page 7
exploratory
pursuit
is more interesting
exploratory
plane.
index optima
The original
stract
projec-
algorithm
and use-
tion pursuit
projection
version
algorithm
(1985), thereby
simplifying
the notation.
Thus,
= Va?@Y]
and Cov[p~Y,&Y]
= 1
= 0
density.
The constraints
and the
3 Var[PTY]
the
presented by Friedman
due to Huber
Actually,
of the projection
ensure that
the structure
seen in the plane is not due to covariance (Cov) effects which can be dealt with
by classical methods.
Initially,
the original data Y is sphered (Tukey and Tukey 1981). The sphered
variable 2 E RP is defined as
2 f D- W(Y
with
matrix
U and D resulting
- E[Y])
from an eigensystem
decomposition
EW
of the covariance
of Y. That is,
P.31
Page 8
matrix
of associated eigenvalues.
ponent directions
be translated
to one involving
problem [l.l]
linear combinations
max
involving
comY can
of 2:
G(cuTZ, o$Z)
QI>q
3
a&
and LVTQ~= 0
The fact
that
structure,
ditions
the standardization
calculations
actually
imposed
work required
are performed
In subsequent notation,
in the original
covariance
(Friedman
1987).
con-
Thus,
all
to
She associates the value of the index with the visual projection
data space.
to exclude
numerical
though
constraints
WI
= 1
4cv2
= UD-+a2 .
these combinations
are usually
pf-cp,
constraints
= 0
as
D.61
1.1 The
Original Exploratory
Projection
Pursuit
Page 9
Technique
geometrically.
The optimization
procedure.
The initial
problem
[1.4] is a
The second stage employs the derivatives of the index to make an accurate search
in the vicinity
of a good starting
point.
procedure requires that the projection
properties.
to the interpretability
index pos-
These criteria
Projection
Index
these properties.
He begins by
Rl E 2G+I!;Z)
- 1
R:! E 2@(aTZ) - 1
where @ is the cumulative
random variable.
uninteresting,
probability
density function
of a standard
As a measure of nonnormality,
R2)
is uniform
normal
the uniform
G&?;Y,
/3;Y)
J
-1
[p(&,
R2)
+I2
d&dR2
respect to a uniform
subsequent integration
taking
P.71
along with
J
-1
weight function.
which
are
This ac-
Page 10
yields an infinite
sum involving
polynomials
application,
moments.
n observations
f or a function
Thus, it identifies
densities.
variables
2.
covariance structure,
which exhibit
projections
projection
this property
of affine invariance
pursuit
normality
In the following
.index.
Research continues in
clustering
tionally
In
shifts
Y.
how to modify
using GL is dis-
After
property
which GL
Example
dataset (Friedman
:
:
:
:
:
:
:
:
:
:
Y6
., Y7
Y8
fi
Ylo
The second variable
dicating
Page 11
collected
of origin.
As Friedman
inare
gaussianized, which means the discrete values are replaced by normal scores after
any repeated observations
their discrete marginals
are randomly
ordered.
In addition,
plane (PI, P2) are normed to length one. As a result, the coefficient
in a combination
The definition
of the solution
p1 = (-0.21,-0.03,-0.91,
p2 = (-0.75,
-0.13,
plane is
0.16,
0.43,
0.30,-0.05,-0.01,
0.45,-0.07,
0.03, 0.00,-0.02)T
0.04,-0.15,-0.03,
coordinate
ficiency
(standardized)
standardized)
of each observation
In the projection
horizontal
for a variable
and vertical
scatterplot,
0.02,-0.01)T
and
the combinations
(PI,&)
These
1. Interpretable
Exploratory
Projection
Page 12
Pursuit
..
..
. * *.*.
.
.I .I .* :* .
: .
c
. 7. *.y
- .,* #.:.* .,.
, -: *
*>..i\..*
. . .*..
r.*,$s?:.. :..: .
**. : :$ ,I
. . ., . - .:.
. ., :
. .
-1
8
-1
Fig. 1.1
of the automobile
data
[l.S].
How-
ables. For example, the x axis is Yl, the y axis is Y2, the z axis is Y3, and so
on. The combinations
is discussed further
of the projected
this structure
actually
in Section 1.4.3.
In this
question is whether
points.
the
answer to this question by generating values of the index for the same number
Page 13
and dimensions
of normality.
how unusual the former is. Sun (1989) d e1ves deeper into the problem,
an analytical
significance
approximation
to interpret
Though by projecting
tion pursuit
for a critical
providing
exhibited
plane.
exploratory
projec-
has reduced the visual dimension of the problem, the linear projec-
The structure
what these
all ten vari-
ables .
An infinite
straints
of the scatterplot
In other words,
The orientation
plane.
and maintain
is inconsequential
via an orthogonal
the structure
found.
of the structure.
These facts lead to the principle
or most interpretable
linear combinations
contribution
way possible.
have length
representation
Friedman
of the vertical
(1987) attempts
the vectors until
coordinate
and parsimonious
to
the
are maximized.
p2 to differ as much
more interpretable
and the
in it.
This
Such a
Page 14
are considered.
More importantly,
of the vectors
p dimensions
Exploratory
Projection
is
Pursuit
Approach
projection
pursuit
produces
(/3TY, PCY), the value of the index GL(PTY, ,%$Y), and the linear
(,&, pz). Th e scatterplots
combinations
loss in structure
is deemed sufficient.
A precise definition
nonlinear
structure
may be visually
as-
to under-
variables
without
If initially
of variable
might
or parsimony
found in
in (pi, ,f?2).
Strategy
interpretability
selection
in the projection.
1.2.1 A Combinatorial
can be made
components
A similar
(Krzanowski
matrix
a combinatorial
method
in linear
idea, principal
variables,
has been
has two ones and all other entries zero. The result is a variable selection method
with each component
variables.
Both variable
Exploratory
Projection
a combinatorial
in the following
Page 15
Approach
Pursuit
strategy
number of solutions,
to exploratory
projection
pursuit
results
subsets. Given p variables and the fact that each variable is either in or out of a
combination,
symmetric,
The combinations
are
meaning that the (,f3i, ,&) plane is that the same as the (&, ,&) plane.
However,
this count does not include the 211pairs with both subsets identical,
which are
permissible
solutions.
It does include 2P -p
These degenerate
0
2p
+ 2P _ 2P - p = 9-1
- 2p-1
- p
constraint
less. However,
the principal
exponentially
redone.
Unlike
must be completely
projections.
projection
pursuit
Due to the
solution,
this
1.2.2 A Numerical
Optimization
Strategy
The objective
function
used is
criterion
can be included
Page 16
and an interpretability
parameter
its maximum
index S contribution,
X. The interpretability
manner.
exploratory
projection
pursuit
index
index S is defined to
index is divided by
algorithm
is applied in an
algorithm.
or simplicity
The interpretable
tion pursuit
iterative
is the weighted
projec-
Set max G equal to the index value for this most struc-
tured plane. For a succession of values (Xi, X2, . . . , Xl), such as (0.1,0.2, . . . , 1 .O),
solve [1:9] with X = Xi. In each case, use the previous Xi-1 solution
as a starting
point.
One way to envision the procedure is to imagine beginning at the most structured solution
an interpretability
turned,
plexity,
is increased.
If the projection
index is relatively
loss in structure
is gradual.
As the dial is
dial.
(Friedman
solution.
1987), the
1.5.3.
In that
goodness-of-fit
fitted
context,
criterion
roughness penalty
the problem
[1.9] is made by
is a minimization.
The first
(Silverman
term is a
values, and the second is a measure of the roughness of the curve such
as the integrated
to interpretability.
is
Page 17
idea is to think
maximization
subproblems
2%
,
GM%
P2TY)
[l.lO]
2 G
S(Pl,P2)
as an
unconstrained
Rein-
maximization
The computational
pursuit
solution
[1.8] to
I.
can be substantial.
an exploratory
I.
one
projection
is the same for either. The outer loop, however, is reduced from
Index
index $ measures the simplicity
1.3.1 Factor
and consequently
differentiable
multipliers.
index G, it needs to be
Analysis
Background
exploratory
projection
pursuit,
2 X p matrix
.
Consider a general q X p matrix
sional exploratory
matrix
projection
pursuit.
What characteristics
to
q dimen-
does an interpretable
Researchers have
Page 18
The solution
structure
Comments
The goal in
often is rotated
to
differ-
are different,
analysis rotation
of a simplicity
the interpretability
Local interpretability
the philosophical
index S.
of a matrix
may be thought
of in two ways.
or row is individually.
In a general and discrete sense, the more zeros a vector has, the more interpretable
it is as less variables are involved.
a collection
Global interpretability
of vectors is. Given that the vectors are defining a plane and should
not collapse on each other, a simple set of vectors is one in which each row clearly
contains its own small set of variables and has zeros elsewhere.
Thurstone
(1935) advocated
For exam-
(row) should have at least one zero and for each pair of
rows, only a few columns should have nonzero entries in both rows. In summary,
his requirements
correspond
to a matrix
which involves,
Combinations
much or have nonzero coefficients for the same variables and those that do should
clearly divide into subgroups.
These. discrete notions of interpretability
measure which
is tractable
for computer
must be translated
optimization.
into a continuous
Local simplicity
for a
single vector is discussed first and the results are extended to a set of two vectors.
Page 19
Index
translates
entries as possible.
In a discrete
sense, inter-
or as many zero
index
D(W)
f:
I(Wi
0}
[Ml]
i=l
where
I{+}
is the indicator
function.
projection
pursuit
linear combinations
represent direc-
tions and are usually normed as a final step, the index should involve the normed
coefficients.
In addition,
is inconsequential.
should be measured.
In conclusion,
irrespective
i=l,...,p
WTW
(Gorsuch
1983, Harman
at the same
1.1.3.
which Friedman
The corresponding
interpretability
s~(w)z-&f-(*-;)2
i=1 ww
[1.12]
The leading constant is added to make the index value be E [0, 11. Fig. 1.2 shows
the value of the varimax
w in two dimensions
Page 20
0.8
0.8
0.2
0.0
angle
Fig. I.2
of the vector
in radians
Varimax interpretability
index for q = 1, p = 2. The value of
the index for a linear combination w in two dimensions is plotted
versus the angle of the direction in radians over the range [0, TIT].
W.
all positive
components
due
coordinate
Fig. 1.4 shows contours for the surface in Fig. 1.3. These contours
are just the curves of points w which satisfy the equation formed when the left
side of [1.12] is set equal to a constant
ity index is increased,
three points
the contours
el = (l,O,O),
e2 = (O,l,O)
the
is
Index
Page 21
Fig. I.3
Varimax interpretability
index for q = 1, p = 3. The surface of
the index is plotted as the vertical coordinate versus the first two
coordinates (~1, wz) of vectors of length one in the first quadrant.
Fig. J.4
Varimax interpretability
index contours for q = 1, p = 3. The
axes are the components (WI, ~2, ws). The contours, from the
center point outward, are those points which have varimax values S,(w) =(O.O, 0.01, 0.05, 0.2, 0.3, 0.6, 0.8).
1. Interpretable
Exploratory
Projection
Pursuit
= 0.0 and the next three joined curves are contours for
= 0.01,0.05,0.2.
toward
the
to S,(w) = 0.3,0.6,0.8.
Page 22
criterion
which derives its name from the fact it involves the sum of the fourth
of variation
squared of the
1.3.3
The Entropy
Index
The vector of normed squared coefficients has length one and all entries are
positive,
similar to a multinomial
set of probabilities
probability
vector.
the distribution
is (Renyi
1961).
S,(w)-l+lf:
W
2lnW
2
lnp
wTw
wTw *
i=l
Property
measure is slightly
&
= l,...,p
fej
j = l,...,p
value
Property
Page 23
Index
when
could be
However, in terms of
it is the most difficult
to interpret.
Property
Property
explanation
Definition.
The following
Let 5 = (C, &, . . . , &) and y = (ri, 72,. . . , yP) be any two vectors
qll
holds
matrix,
words, if y is a smoothed
(l,O,...
,O) + (go
)...)
by C. An
vectors is
0) h *.. I+ (-
p-
1
lP-
-J- 1 0) s ($,)
1. Interpretable
Definition.
ProjectionPursuit
Exploratory
A function
Page24
f : RP H R is strictly Schur-convex if
if y is not a permutation
of <.
if a vector C is more spread out or uneven than y, then S(c) > S(y).
This intuitive
idea of interpretability
mathematical
meaning.
Us-
indices could be
defined.
Index
the normed squared vector to the point (k, k, . . . , t), which m ight thus be called
the least simple or most complex point.
of any vector 6 be
lJ, z WTW7W T W W T W
and t/c is the most complex point, then
&J(w)= --/$ly
- hII
-
[1.13]
Page 25
Index
i = l,...,p
j=l,...,J
P
Uji = 1
j=l,...,J
[1.14]
i=l
projection
with
pursuit,
RP as are u,
and vC. An
the statistician
[1.15]
The constant
c is calculated
can be used and an average or total distance could replace the m inimum.
If
s;(w)
where k corresponds
G 1 - $@&)2-&+1]
to the maximum
I w; I, i =
in the vector
choices of
must be found.
Analogous
1,. . . , p.
The m inimiza-
are obtainable
cofor
p - 2 ze-
variables each.
Page 26
m4
The minimum
= --St&)
+ --&
(P-&
wi
Algebra reveals
- 1)
W:=f
WTW p
or all coefficients
occurs when
fL=l
WTW
and S;(w) = &(w)
The difficulty
= 1. Th e relationship
values varies.,
when a derivative-based
optimization
procedure is used.
the varimax
approach
Wj
(Wjl,
Index
. . . ) Wjp),
of a set of Q
the varimax
index
between
the squared normed vectors, the variance is taken across the vectors and summed
over the variables to produce
[1.16]
Page 27
-If the sums were reversed and the variance was taken across the variables
and
WWl) +
with
vector.
simplicity
appropriate
exploratory
norming,
= 12p [iP -
~&l,~2)
with
[1.17]
+ * * - + &(wp)
wJ2)
pursuit,
reduces to
+ (P -
vu~l)
norming.
projection
term.
W(w2)
21 - $
, [1.181
-&-&
1
This cross-product
orthogonality,
maximized
so that different
of the
of
and minimized
pursuit
problem
[1.9] with
interpretability
exploratory
exploratory
pro-
The
projection
pursuit
algorithm
is fol-
lowed but two changes are required, as described in the first three subsections.
Invariance
Page 28
of the Projection
observed structure
index G no matter
stating
from which
of a scatterplot
is immaterial
direction
Index
it is viewed.
An alternative
algorithm
as measured by S,,.
algorithm
Definition.
where (PI,
A projection
P2)
of
rotation
matrix
projection
way possible
rotationalinvariance.
if
e or
any projection
rotationallyinvariant
index G is
way of
[1.19]
Rotational
property
is a welcome byproduct
of sphering.
invariant.
This prop-
plane by maximizing
As remarked
Page29
the marginals
GL is to transform
scatterplot
Intuitively,
of
the projec-
value.
Empirically,
invariance
is evident.
In the automo-
as the axes
seeks to maintain
the computational
properties
e
G(/3;Y&Y)
Table1.1
0.0
fi
zi
3*
ZiT
L
4
7*
x
0.35 0.36 0.34 0.32 0.30 0.29 0.29 0.29 0.30 0.32 0.35
Most structured Legendre plane index values for the automobile
data. Values of the Legendre index for different orientations of
the axes in the most structured plane are given.
Projection
Index
alternative
of these coordinates
Polar coordinates
polynomials
are the
GL.
of normality.
of these
The expansion of
to that of
1. Interpretable
Exploratory
Actually,
Pursuit
Projection
Page 30
of a projected
definition
is N(0, I),
of
R.
pendent and
R-
Exp
0 - Unif[-7r,7r]
The proposed index is the integral of the squared distance between the density
of
(R, 0)
and the null hypothesis density, which is the product of the exponential
and uniform
densities,
chosen specifically
for their
weight functions
and rotational
relationships
Friedman
invariance.
By definition,
are identical
require rotational
Throughout
for
(Unif[-l,l]),
in or-
does not
polynomials
and
Page31
W(U) = e-.
which
The polyno-
mials are
Lo(u) = 1
L,(u)
= u - 1
L;(u)
= (u - 2i + lpi-l(u)
[1.21]
- (i - 1)2L44
are defined as
Z;(U)f .L;(u)e-i .
The orthogonality
relationships
O"
J
Zi(U)Zj(U)dU
delta function.
;,j>Oandi#j
Sij
as
f(U)
= e%&(u)
i=O
a; Z
O Li(U)t?-3uf(U)dU
J0
can be written
Ui = Ef [Zi(W)]
as
If a random variable W
1. Interpretable
Exploratory
The 63 portion
Pursuit
Projection
Page 32
f : R H R may be written
as
with pointwise
convergence.
The orthogonahty
relationships
7r
%
J
xJ
J
r
J
cos2(kv)dv =
--r
-%
sin2(kv)dv
bk are
k>l
= 0
k>Oandj>l
cos(kw)sin(jv)dv
dv=2r
1
ak E -~
1
bk E =
-%
sin(
kv)f(v)dv
= i.Ef
[sin(kW)]
nomials as
i- 2 [aikcos(kv)+ b;ksin(ku)])
k=l
are
--x
--I
The ak and
= ?r
functions
bik are
C&k 3
i,k>O
bik G
:Ep
i>Oandk>l
[Ii(R) sin(kO)]
Page 33
of the exponential
density
and
is
h(u>fo(~)
= ($+) (+-) =
-&O(U)
The further
condition
tion, integration,
and subsequent
relationships
multiplica-
+(27r)(&)
~(271)~+~~1T(u~k+b~k)-(2r)(~)(~)
i=O k=l
i=l
[1.22] is equivalent
Maximizing
to maximizing
G&-Y&Y)
The definitions
[I.221
= ~G(Pf-y,&y)
of the coefficients
- f
in GF yield
i=O
- ;Ep[lo(R)].
[1.23]
This index
is the exponential
polynomials.
In application,
instead
(1987).
The extra
1. Interpretable
Explorato y
Projection
Pursuit
Page 34
For example, Ei
by
where rj and Oj are the radius squared and angle for the projected jth observation.
The Fourier index is rotationally
spun by an angle of r.
invariant.
points are
first sum and final term in [1.23] d o not change. The sine and cosine of the new
angle are
cos( 0 + T) = sin 0 cos 7 + cos 0 sin 7
sin(O + r) = cosOcos7
Each component
- sin@sinr
E~[Zi(R)sin(k(O
cos2(kT)Ei[Zi(R)
sin(kO)]
+ r))] =
+ sin2(kr)E~[Zi(R)
cos(k@ )]
-2sin(kr)
[1.24]
sin2(kr)Ei[Zi(R)
cos(k~)E,[Z;(R)sin(kC3)cos(kO)]
= Ei[Zi(R)cos(kO)]+
E~[Zi(R)sin(kO)]
The truncated
version of
Moreover,
the expected value by the sample mean does not affect [1.24], so the
sample truncated
sin(k@)]
In addition,
in the truncated
Hermite
invariant.
function
index.
The Herprojection
Similarly,
the asymptotic
behavior of GL
1.4 The
Page 35
Algorithm
is being investigated
portion
which is
weighted by e- 3 U. The class of densities for which this index is finite must be
determined
The Fourier GF and Legendre GL indices are based on comparing the density
of the projection
a rotationally
with the null hypothesis density. Jones and Sibson (1987) define
invariant
to equate structure
find clusters.
with
rotationally
[1.9] as a criterion.
it since GF is rotationally
analytically.
Axes Restriction
The algorithm
S,. Unfortunately,
invariant
1.4.3 Projection
associated with
A possible future
Radon transforms
tive function
the optimal
invariant.
representation
index value
one as measured by
of a
algorithm,
translates
into an orthogonality
the correlation
representation
constraint
[1.6] is imposed on
(oi, ~22)
constraint
&eZ),
# 1. These maximizing
combinations
k and
when
are or-
variable
Page 36
the maximum
orthogonality.
in the following
manner.
the optimal
constraint,
Without
translation
loss of generality,
Unfortunately,
cannot be
combinations.
simplicity
defined
the interpretability
to
which is orthogonal
to PI.
Mathematically,
[1.25]
is not clear.
this translation.
As an added bonus,
original
(p,,&)
are orthogonal
projection
pursuit
in the
This situa-
algorithm
which
graphs with respect to the covariance metric as noted in Section 1.1.4. In effect,
a further
subjective
representation
simplification
in the solution
is more interpretable.
Throughout
Page 37
1.4.4
Given that
comparison
W ith
factor
with
X = (Xi, X2)r
is assumed and
for (&, Pi). As a result, whenever S,(p,, /?2) is referred to. the
wdv
translat.ion
Factor
+ 21
112
. [1.27]
2
Analysis
this method
- $ -L!&g
is warranted.
the interpretability
index,
projection
is defined as
XEBY
where
matrix
and the observed variables are Y = (K, Y2, . . . , YP)=. This definition
to the one for principal
principal
projection
in Section
1.1.1, principal
to simplify
B,
due to Friedman
is p X p. In addition,
maximize
a different
matrix
constraint.
by multiplying
lation
components
all p
index.
orthogonal
as was remarked
is similar
QBY
as in [1.19]. S
mce the rotation
Thus simplification
in the solu-
the corre-
by Q.
1. Interpretable
Exploratory
p variable
The analogous
In this model,
(fi,
f2>
(Y,,yz,...
Page 38
Pursuit
and E is a
and is found
Projection
X 1 error vector.
by seeking to explain
the covariance
order to simplify
assumptions.
matrix
pursuit
linear combination
to a more interpretable
solution
51 is p X 2
of the variables
A rotation
is made in
to CJQ.
matrix,
QB.
rotated producing
matrix
structure
the factor-loading
factors
without
underlying
matrix
produces QflT, a 2 X p
projection
is analogous to simplifying
the factor-loading
matrix in a two factor model. A transpose is taken since the factor analysis linear
combinations
combinations
exploratory
components analysis.
projection
pursuit
constraint.
involves rocking
increase interpretability.
data analysis
the solution
RP subject
to orthogonality
coefficients
in the factor-analysis
setting.
The interpretable
of as a
1.4 The
Page 39
Algorithm
Procedure
A) G~(~~~~,Ty)
v (p 1, p 2 )
AS
max GF
where &@I,
rithm
P2)
[1.28]
algorithm
projection
is outlined
pursuit
plane.
6. If i =
I,
plane Pi-l.
EXIT.
values.
Otherwise,
parameter
Pi..
5.
The search for the best plane is performed in the sphered data space, as
discussed in Section 1.1.2. However, an important
The pi
sees. In fact,
shortcut
behind
the scenes.
The modification
Friedman
1. Interpretable
Exploratory
Projection
Pursuit
of the unimportant
unimportant
putational
The
components in principal
work involved.
Page 40
of
a1 = DfUp,
a2 = DbJp,
p.291
to the sphered space. If the gradient components are nonzero only in the p - q
dropped dimensions,
optimization
procedure
the maximum
The derivative-based
Thus, no reduction
in dimension
during
starts in
(1987)
structured
the original
rather than
sphered data space. For example, the starting point could be the most structured
pair of original
On the other hand, stepping evenly through the unsphered data space m ight not
cover the data adequately
tion problems.
in some subspace
constrained
optimiza-
to a quadratic
1.4 The
Page
Algorithm
programming
binations
problem.
is extremely
translations
[1.4]. The
The interpretability
combinations
constraints
correlated to orthogonal
powerful.
A.
At present, work continues to
which maintains
the constraints.
However,
one.
41
in two ways.
The initial
pair of vectors
representation
can be
or Steps 5
and 6 can be run with A equal to a very small value, say 0.01. This slight weight
on simplicity
permitted
representation
is similar
to Friedmans
interpretability
The initial
inator
(1987) simplification
may be caught in a
of the projection
this spinning
local maximum
projection
pursuit
index term
and interpretability
have
been increased.
In the,examples
does not.
For example,
as (0.0,0.05,0.1,
the sequence
1. Interpretable
Exploratory
Throughout
as the starting
Projection
Pursuit
Page
of the algorithm
at Xi-1 is used
is in the spirit of rocking the solution away from the original solution.
have shown that the objective
made initially
is fairly
42
Examples
is
As remarked in Section 1.1.4 when the example was considered, the data Y is
usually standardized
Example.
1.5 Examples
In this section, two examples of interpretable
are examined.
exploratory
projection
pursuit
while the
data analyzed
section.
data in this example consists of 200 (n) points in two (p) di-
coordinate
and normally
of the remaining
an angle of thirty
Since the data is only in two dimensions and can be viewed completely
scatterplot,
using interpretable
quite contrived.
The
exploratory
projection
pursuit
in this instance is
in a
the way
1.5 Ezamples
101
Page 43
I
-10"""""""'
-10
-5
I""1
5
10
Yl
Fig.
1.5
Simulated
The algorithm
p = 2.
modifications
are irrelevant
interpretability
Rotational
in this situation
index
invariance
and
parameter
line through
component
of the data.
line toward
and vertical
projection,
lines in the
axis exhibits
When projected
degrees
1. Interpretable
Ezploratoq
Projection
Pursuit
Page 44
-y-+-
~2c
0.8
&
0.6
-1
0.0
0.8
0.6
0.4
0.2
Fig.
1.6
and interpretability
indices versus
increases.
The simplicity
combinations
histograms,
are calculated
are normed to
Page45
1.5 Examples
0.25
,,I,
, , I I
I \ I ,_l
0.20
0.15
0.10 1
h=O.O
A=O.5
Fig. 1.7
values of X.
X=
0.5,
GL = 0.93, Sl =
GL = 0.78, S1 =
0.47,
0.71,
p = (0.92,0.47)T
B = (0.97,0.24)T
As predicted,
_
the horizontal
is evident
the algorithm
in the merging
degrees to
the two
between the
Page46
are virtually
The automobile
indistinguishable
Example
interpretable
exploratory
GF projection
rotational
in a different
notation
index instead
projection
pursuit.
projection
pursuit
analysis
point
orthogonal
in the original
the plane.
The corresponding
projection
demonstrates
representation
constraint
With
of
as Fig. 1.8
purposes.
metric.
Rigidly
[1.6].
imagination,
index values.
are
the
Fig. 1.9. This plot is the same as Fig. 1.1 but with the same limits
for comparison
using two
starting
lack of rotational
1.1.3 is re-examined
exploratory
invariance
of 0.26. However,
to the eye.
The interpretable
A loss of 7% in the
The Automobile
dimensional
rotating
However,
spinning
For example,
which
the combinations
are orthogonal
maintains
in the co-
the correlation
different
after a rotation
of f are less
reported.
of the axes.
1.5 Ezamples
Page 47
.:.. .
I
:. . .,*
*.
.
.
.
I
: . y;. -.
: *. ..:+J:
:.*;..
I
. . a... ..
; .. 1:
. -. .:
..*....
.J.
-- -.-. J : ;;.Y++
*.*.z
.: ..I.
Fag. I.8
----
which exhibit
are translated
clustering
to an orthogonal
If the Leg-
become
,& = (-0.21,
-0.03,-0.91,
,&. = (-0.80,-0.14,
0.16,
0.27,
0.49,-0.02,
measure S@,
0.03,-O-16,--0.02,
0.02,-0.01)T
transforma-
0.00,
originally
are
0.34,-0.05,
0.82,-0.03,
0.05,-0.01,
0.16,
0.04)T
0.04,-0.06)T
Fig. 1.9
Page48
measure 0.17.
@2 = (-0.02,
0.16,-0.09,
with interpretability
0.29,-0.02,
0.94,-0.19,
parameter
are
0.00, -0.03,-0.02,
0.09,-0.01,
0.18,
O.OO)T
0.07,-0.06)T
measure 0.80.
The resulting
values X = (0.0, 0.1, . . . , l.O), analogous to Fig. 1.6 for the sim-
gets bumped
Page49
1.5 Examples
CT
2
0.2
0.4
0.6
0.6
Fig. 1.10
out of a projection
local maximum
X values.
Six solutions
the coefficients
move quickly
suggests investigating
small coefficients
which are
The
In some
to the continuous
in the index
SV, the
the use of a power less than two in the index in the future
1. Interpretable
Exploratory
I I I I
3
Projection
I I I I
. ..
Pursuit
Page 50
I I I /
-_
-4
-2
-4
-4
I 1
h=0.4
*...
-2
X=0.6
..
-2
h=0.7
..
i:
:i
.I
;.
. .
;;
;
.
*.
.
.
-.
;.
.
. :
:- *:
-4
-2
x=1.0
Fig.
I.11
Page 51
1.5 Examples
0.4
0.00
-0.09
-0.97
0.06
0.22
-0.01
0.02
-0.02
-0.02
0.00
-0.06
0.15
0.00
0.95
-0.15
0.08
-0.03
0.19
0.07
-0.06
0.00
-0.08
-0.98
0.04
0.16
-0.02
0.03
-0.02
-0.02
-0.01
-0.03
0.12
0.01
0.97
-0.11
0.12
-0.03
0.14
0.06
-0.04
0.01
-0.05
-0.99
0.01
0.12
-0.01
0.04
0.00
-0.01
-0.01
-0.03
0.06
0.00
0.98
-0.07
0.11
-0.03
0.11
0.05
-0.03
0.01
-0.05
-0.99
0.02
0.09
-0.01
0.05
-0.01
-0.02
-0.01
-0.02
0.06
0.01
0.98
-0.04
0.10
-0.04
0.11
0.05
-0.03
0.02
-0.04
-0.99
0.01
0.09
0.04
0.02
0.02
-0.01
0.00
-0.02
-0.02
0.01
0.99
-0.02
-0.02
0.03
0.01
0.04
-0.09
0.01
-0.03
-1.00
0.00
0.03
0.03
0.02
0.01
0.00
-0.01
0.00
-0.01
0.00
1.00
0.00
0.01
0.03
0.02
0.03
-0.07
0.00
0.00
-1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.5
0.6
0.7
0.8
0.9
1.0
Table 1.2
Linear combinations for the automobile data. The linear combinations are given for the range of X values. The first row for
each X value is PT and the second is ,f?T.
1. Interpretable
3.2
Exploratory
Page 52
Pursuit
-0.10
-0.96
0.05
0.25
0.17
0.94
-0.19
0.06
0.24
0.94
-0.18
-0.10
3.3
Projection
-0.05
-
0.17
-0.96
-
-0.09
-0.97
0.06
0.22
0.15
0.95
-0.15
0.08
0.08
-
0.18
0.18
-
0.07
0.07
-
-0.05
-0.05
-
3.4
-0.06
3.5
-0.08
0.12
-0.05
3.6
3.7
3.8
3.9
-0.98
-0.99
0.06
-0.05
-0.99
0.06
-1.00
-0.99
-1.00
1.0
Table I.3
0.97
0.98
0.98
0.99
1.00
1.00
0.16
-0.11
0.12
-0.07
0.09
0.09
0.08
0.12
0.11
0.10
0.05
-
0.19
0.14
0.11
0.11
0.07
0.06
0.05
0.05
-0.06
-
-0.09
-0.07
Abbreviated
linear combinations for the automobile data. The
linear combinations
are given for the range of X values as in
Table 1.2. A - replaces any coefficient less than 0.05 in absolute
value.
Page 59
1.5 Examples
1.0
0.5
0.0
-_-----_
-
-0.5
-1.0
IIll
IIll
IIII
IIII
0.4
0.2
0.0
III1
0.8
III1
III1
0.2
IIII
0.0
0.8
Y3
IIII
0.5
IIII
0.6
y1
1.0
III1
0.4
Iill
IIII
III1
III1
---
---
-0.5
-1.0
III1
III1
0.2
IIII
IllI
0.4
0.8
Ill1
IIll-
0.8
IIII
0.2
IIII
IIll
0.4
0.6
Yr
IllI-
0.8
YS
1.0
0.5
-
---
-\
0.0
III1
IIII
0.2
IIII
IIII
0.4
0.6
IIll-,
0.8
y,
Fig.
I. It
1. Interpretable
Exploratory
Projection Pursuit
If one variable
Page 54
is in a combination,
it usually
is absent
in the other.
Another
useful diagnostic
tool fashioned
in each combination
solution
are plotted
versus X.
nations involve four variables which are split into two pairs, one in each combination.
variable
automobile
weight.
plane with
by the plotting
plane. However,
no external
a planes usefulness.
No doubt
measurement
Rather,
interpretable
cedure to.find
ticular
Friedman
projection
(1987) presents
applies a transformation
algorithm
In fact,
After
disturbing
a method
to the structured
planes undisturbed.
The
structure.
planes.
He recursively
several interesting
of other interesting
exploratory
such as prediction
and accu-
removal.
between simplicity
index actually
for structure
delineated
place, especially
the structure
a structured
the
symbol.
The statistician
technique,
involves
Page 55
1.5 Examples
.-.
*
.,
- -.
. . ...:: y.,.
-.. . . .
.. ,.. .. .
. **
: :
.
Fig.
The interpretability
onal.
;*
.
..
of a collection
j(:
However,
in orthogonal
-I
..*
I.13
example,
.*
removal procedure
solution
For
anyway.
Originally,
Section
1.1.1.
to rotate
color.
exploratory
After
the point
The statistician
was time-consuming
The solution
projection
choosing
pursuit
was interactive,
three variables,
A fourth
the statistician
dimension
was to automate
combinations
the structure
as mentioned
in
used a joystick
could be shown in
by eye. However,
this task
of four variables
at best.
identification
by defining
an index
1. Interpretable
Exploratory
Projection Pursuit
mathematically
Page 56
optimizer.
identification
The
but she
selection.
coupled with
exploratory
projection
The interpretability
a numerical
pursuit
is an analogous
index is a mathematical
optimization
routine,
to view.
automation
of
measure which,
Chapter
Interpretable
Pursuit
Regression
Projection
This chapter
describes interpretable
ilar in organization
common
to the previous
projection
chapter
modification
reviewing
the notation
is considered
pursuit
though
inal algorithm,
is detailed.
and projection
It is sim-
and strategy.
pursuit
regression.
projection
the
pursuit
regression,
the
1. The final
Projection
projection
(1981).
pursuit
Friedman
the approach
regression in addition
Pursuit
Regression
Technique
regression technique
classification
57
is presented in Friedman
several algorithmic
and multiple
In this chapter,
fea-
response
only single
2. Interpretable
2.1.1
Projection
Pursuit
Page 58
Regression
Introduction
pursuit
linear regression.
regression is to consider it
projec-
predictors
= (X1,X2,.
linear function
familiar
of ordinary
projection
of the centered X.
notation,
. . ,Xp)T.
With
and slightly
un-
WI
Y - E[Y] = WTX + c
where c is the random error term with zero mean. The vector w = (WI, ~2, . . . , w~)~
consists of the regression coefficients.
In general, this linear function
Y given particular
is estimated
by the conditional
expectation
of
value
of Y is
3(x)
The expected
= E[Y] + WTX
value of Y is estimated
m in
In practice,
mean.
The expected
L2
random variables is
L2(w, x, Y) E E[Y - Q2
The parameters
WI
Lg(w,X,Y)
by
P-31
the sample mean over the n data points replaces the population
Page 59
Y - E[Y] = /qct'X)
WI
+ E
equation
[2.3] subject
to the constraint
The rewritten
the projection
model
projections
the fitted
pursuit
stricted
combinations
oj which
of functions
o. The relationship
is a straight
line. A natural
pursuit
unrestricted
regression
functions
of
parametrically.
@jfj(aTX)
the functions
The parameters
WI
+ E
the terms.
oTX
Y depends only on
y - E[Y]= 5
The linear
as in
to vary. Projection
The projection
allowing
value
response variable
fitted
that aTa, = 1.
of the predictor
The resulting
of the smooths
fj are smooth,
pj capture
are re-
and have
the variation
between
the sum
as
m
p(X)
E[Y]
@j.fj(aTx>
j=l
estimated
to exploratory
as in [2.3].
projection
pursuit,
space (Huber
1985).
The model
projection
methods
pursuit
by working
regression
in a lower di-
2. Interpretable
relationship
Pursuit Regression
Projection
Page 60
is nonlinear,
is smooth,
methods
Diaconis
that
such as recursive
any function
number
can be approximated
of terms
m.
Substantial
theoretical
properties
algorithm
are difficult
2.1.2
partitioning.
work
of the method.
as opposed
to other
and Shahshahani
by the model
remains
versus ordinary
nonlinear
(1984) show
to be done
In addition,
re-
with
the numerical
respect
to the
aspects of the
The Algorithm
The parameters
min
fj are estimated
by minimizing
La@, a, f, x, Y)
@j,CYj,fj:j=l,...VZ
QTaj = 1
WI
JVjl = 0
and Var[fj]
The criterion
However,
for.
algorithm
k=
if certain
Friedman
results
l,..
[2.6] cannot
be minimized
simultaneously
(1985) employs
are discussed
in Section
. , m. The problem
optimization
2.2.3.
such an alternating
in this section
is considered
j = 1,. . . ,m
= 1
First,
strategy.
His
he considers
a specific term
k,
pk#k,fk
WI
where Rk s
j#k
For the kth term,
in turn
while
all others
After
The algorithm
cycles through
the
terms
alternating
strategy
The minimizing
the objective
Page 61
The
P-81
The minimizing
This
function
estimate
is found
direction
procedure.
squares problem
as possible.
Minimizing
but rather
as an
the criterion
and requires an
[2.6] as a function
of ok is a least-
In applying
of applying
these methods,
an iterative
search procedure
if the function
to be minimized
simplifies
(Gill
determining
optimization
is of least-squares
first
deriva-
Hessian is preferable.
specifically
et al. 1981).
to
However,
onthis
directly
the difficulty
rately estimating
ak.
function
Thus, rather
tives, a method
italizes
ok cannot be determined
as seen in [2.7].
a function,
Usually
curve is standardized
discussed in Fried-
The minimizing
minimize
smoother
estimated
iterative
point azx is
or accu-
properties.
form,
the Hessian
Friedman
(1985) cap-
procedure
2. Interpretable
2.1.3
Projection
Model
Selection
Strategy
Page 62
Pursuit Regression
projection
pursuit
Interpretable
projection
pursuit
re-
of
those terms.
Friedman
M.
The procedure
for finding
the algorithm
model as a starting
point.
The importance
terms in
of a term k in a model
of m terms is defined as
Ik E fjj
[2.10]
absolute
parameter.
strained
fj are con-
model.
The statistician
in [2.6], which
is
the models residual sum of squares, versus the number of terms for each model.
In most cases, the plot has an elbow shape. The usual advice is to choose that
model closest to the tip of the elbow, where the increase in accuracy
additional
2.2
The Interpretable
A projection
number
of terms
functions
fi.
m =
Projection
pursuit
regression
1, . . . , M.
Pursuit
analysis
combination
represents.
The parameters
Regression
produces
Approach
a series of models
components
associated
due to the
Each is considered
aj in order to understand
,f3j measure the relative
what
with
are the
expressed,
along with
its
importance
of the terms.
Each direction
Projection
of combinations
constraint
regression technique,
of variables
similar
to those in Section
selection
before, a weighted
penalty
criterion
in projection
balancing
The simplicity
As with any
a combinatorial
criterion
The minimization
problem
1nterpretabilit.y
This assump-
foragroupofmvectors(ai,cy2,...,
exploratory
In relation
when normed
in the latter
the simplest
index S
is a gen-
of
pairs of combinations
are ones
is
individual
Within
each combination,
variables
index
pursuit
onality
interpretability
projection
However,
to each other,
of
Index
of interpretable
to be
The Interpretability
eralization
[2X]
of the collection
is minimized.
an
becomes
, Q2 T.--T a m )
xqcrl
As
[2.6] with
and is subtracted
ap-
min La
2.2.1
pursuit.
L2(P)crJJm
combinations
projection
the goodness-of-fit
is employed.
(1 -A)
min
/3j,Crj,fj:j=l,..., TTZ
E [O,l].
in exploratory
proach to variable
interpretability
number of
Arguments
Page 63
interpretability
Sr [1.12] by encouraging
the vector
coefficients
S, [1.18].
varimax
to vary.
2. Interpretable
Different
variables
Page 64
Pursuit Regression
Projection
the combinations
to
pursuit
be simple within
is penalized.
interest
itself.
The varimax
each vector
of the coefficients
within
should
a vector
in the
of decreasing the total number of variables in the model, the same small
subset of variables
of each direction
variables
Summing
necessarily.
the simplicities
to include
to squared orthogonality
pursuit
Consider
the m X p matrix
the directions
different
the same
projection
varimax
Though
goal. First,
variables.
pursuit
be forced
However,
aj are constrained
in general.
tions are nonzero or load into the same columns so that the same variables
in all combinations.
are
52 whose components
are
2
3; -
2
j=l
Each component
tribution
yi is positive
to variable
5;
i=l,...,p
[2.12]
T~j
and contains
The components
relative
con-
of the new
Projection
Page 65
to be as varied as pos-
For y appropriately
normed,
this measure is
= $-&py
WY)
interpretability
12.131
84
The index
function
f2
Sd
O!y1,(Y2,...,~m)=-
The function
[F
of the columns
1)5 5 g-&l
waJ+2(pP
j#k
j=l
is in contrast
interpretability
of each combination.
exploratory
projection
of each column
to be dissimilar.
of the each total column weight over the rows of the matrix,
of the model,
in order
of the combinations.
+m;;:l) *
i=l
to S, for interpretable
pursuit
As a
This re-expression
of R to vary widely.
criterion,
An example
The
or terms
using this
2.2.2
Attempts
to Include
In the discussion
argument
that
so certain.
with
However,
assessed.
of Terms
on further
The interpretability
be visually
a model
-more persuasive.
the Number
reflection,
its combination
of the combination
this conclusion
The
function
fj.
must
2. Interpretable
Projection
Pursuit
Regression
ranking
(1) n-l =
in order of interpretability
function
Situations
of widely
is impossible.
varying
would be difficult
f2(X2)
combination.
more interpretable.
combination
in which
clearly
(apples vs.
to understand
models
f(X1ix2)
consisting
2, fl(&),
(2)m=l,
of one variable
Page 66
oranges),
of variables,
the average.
model
measurements
then a combination
of the two
However, if
characteristics
(reading
and spelling
scores) which could be combined easily into a single variable (verbal ability),
second model m ight be easier to interpret.
the number
As with
tion pursuit
statistician
weight
number
interpretable
is equipped
exploratory
with
projection
pursuit,
the algorithm
interpretable
as an interactive
dial.
three methods
considered.
Factors required
fortunately,
projec-
discussion,
ways to include
an interpretability
the following
the
are unsuccessful.
sure of simplicity,
below.
However, considering
is
for including
the
If the
In
Un-
2.2
The Interpretable
Projection
Pursuit
Regression
Page 67
Approach
is to multiply
the interpretability
t=l
The resulting
is reweighted
when
the number of terms changes. Instead, the index should be such that submodels
of the current
of
of points
Section 2.3.4.
increases.
The total
of as a point
Consider
a distance
interpretability
does.
a plausible
interpretability
Sb(al,Q2,-..,~m)--
ri$
el[Vc~~
index
as in
to a particular
is that
point
index is
-Vll12
j=l
As in Chapter
to the
term
to
Sb reduces to
Pm
2
"ji
- T
cc
i=l
the removal
discontinuous.
j=l
~j
component
of the m inimization,
2-/k
+ m
tyj
as defined
in [2.12].
the derivatives
Unfortunately,
even
are
Page 68
which
is continuous.
toward
distances
vectors
Unfortunately,
though
continuous
the number
sion model
cannot
selection
the interpretability
linear regression
procedure
is poor.
to explore
2.2.3
The strategy
term model
it.
The outlook
Instead,
should
procedure
projection
than a strict
themselves
situation
with
model
regresin which
between
model
pursuit
of
measur-
pursuit
the tradeoff
for an analogy
interpretable
the number
projection
controls
inter-
selection
in
regression
selection
compare
is described
is a
pro-
in Section
Procedure
a large number
of models
toward
and smoothly
to a one parameter
X completely
be less than
of the terms
such a method,
is to begin with
at incorporating
be reduced
The Optimization
T must
simplifies
for simultaneously
As a result,
attempts
A method
parameter
and accuracy.
each term
Without
yet.
pretability
index
of terms
ul.
the others
a third
one so that
produces
minimizes
[2.6].
M.
algorithm
is employed
projection
pursuit
regression
which minimize [2.11] for various X values. Throughout the description, updating a parameter
means noting
dependent
Page 69
Projection
calculations
Moving
on it.
values.
1. The objective
function
lined in Friedman
2. Use a backwards
and Stuetzle
procedure
out-
Fori=l,...,M-m
begin
Rank the terms from most (term 1) to least important
as measured by [2.10]. D iscard the least important
Use an alternating
procedure
to minimize
(term M - i + 1)
term.
the remaining
M - i terms.
a. Fork=l,...,M-i
begin
Update
respectively.
by updating
the
do a single Gauss-Newton
plete the iteration
ok. Com-
iterating
until
the
end
b. If the objective
If the objective
(GOT0
a.). Otherwise,
another pass
decreased sufficiently,
the optimization
the terms
perform
of the M - i
2. Interpretable
Projection
Pursuit Regression
Page 70
Let i = 1.
4. The objective
is [2.11].
function
alternating
procedure
point.
(term m).
possible.
a. For k = 1,. . . , m
begin
Choosing from among several steplengths,
by updating
the
Complete
end
6. If only one loop through
the terms
another
completed
pass (GOT0
Otherwise,
the objective
move
c. If i = I, EXIT.
parameter
If more than
decreased sufficiently,
the optimization
X; is complete.
4.
of the
Projection
The forecasting
Pursuit
alternating
Regression
procedure
Hessian.
direction
(Gauss-Newton)
of the objective.
as Gauss-Newton.
The gradients
utilized
of the interpretability
term
as described
not updated
between
However,
sacrificing
accuracy
index.
in the solution
among term
for an individual
a decrease.
Such a
algorithm.
This approach
in the simplicity
A term is
this approximation
/
For example, suppose a change in term
In the original
decreases as a result.
the terms,
A.
is made to forecast
is the negative
in that an attempt
function
method
is not as accurate
this method
of the objective
the least-
Unfortunately,
While
which
al; for a
the objective
the optimal
Page 71
Approach
projection
between
pursuit
the terms
contributions
to the
before it changes on its own. However, once it changes, the other terms quickly
follow suit, like a row of dominoes.
the algorithm
between.
produces
is used,
2. Inierpretable
Projection
Pursuit
Thus, a compromise
is due to the fact that
sure, irrespective
simplification
Regression
Page 72
of their relative
evidence.
importance
mea-
of the model, that term which least affects the model fit should be
interpretability.
As
a result, the terms are looked at in reverse order of import ante (Step 4).
The second change is that the algorithm
once as opposed to one term at a time.
4a is positive
(Lundy
importance
and
a form of annealing
term (Step
1985). Th e m inimum
steplength
of the model is
considered in Step
are
impossible.
Both
of the simplification
of one term
that
as is demonstrated
the algorithm
results.
However,
the algorithm
behavior
at forecasting
terms.
quickly
example.
the effect
Occasionally,
the
Pollution
Pollution
The example
using additive
Page 73
Example
Example
conditional
expectations
variables
each.
The data
in Breiman
and Friedman
The daily
is analyzed
observations
were recorded
in Los
projection
pursuit,
model is measured
the introduction,
as the fraction
inaccuracy
regression is applied.
terms m = l,...,
pursuit
to have
algorithm
initially.
(Steps 1 and 2 in
is
it cannot
The inaccuracy
explain.
of each
As noted in
in general.
is defined as
u
J52MdJJ)
k(Y)
The plot of the number of terms and fraction of unexplained variance of each
model is shown in Fig. 2.1.
2. Interpretable
Projection
Page 74
Pursuit Regression
m
Fig. 2.1
approach,
the statistician
complexity
of adding
where the
a term.
She generally
marginal
increase in accuracy due to the next term levels off. In some situations,
such an elbow may not exist or it may not be a good model choice in actuality.
Only one model for each number
model space is one dimensional
Interpretable
particular.
.
measure.
projection
number
of terms m is found.
pursuit
regression
expands
the model
the algorithm
space for a
by adding an interpretability
m, the
in U.
The starting
For a particular
an increasing
Lubinsky
Page 75
Example
and Pregibon
through
a description
space characterization,
prehensive
than
agree that
two important
Their
U and Sg summary
description
dimensions
concept is an extension
descriptive
is-more
given above.
are accuracy
comThey
and parsimony.
of interpretability
or
of param-
It
information.
Initially
quence is (O.O,O.l,.
. . , 1.0).
examples,
However,
the interpretability
parameter
is a path
X se-
through
the
model space which consists of a few clumps of models separated by large breaks
in the path.
eliminate
a smoother
additional
nature
described
in Section
values of X specifically
For example,
of the algorithm
In or-
chosen to produce
a more continuous
if on the first pass the path has a large hole between the
ues of (0.33,0.36,0.39).
valueof
m = l,...,
Various
diagnostic
through
X val-
additional
The interpretability
parameter
is solely
of models which
are
distinguished
possibilities.
a model with
U.
Ideally,
of terms,
be
2. Interpretable
Projection
L,,,,
1.0 F
Pursuit
I,,,
,(I,
Page 76
Regression
I,,,
(,,It
ql,,,,,,,,,,,,
f ,,(,,,,
1
0.1
0.15
0.2
0.25
U,
III
1.0
III
0.3
0.35
0.1
0.15
0.2
m=l
III1
U,
III,
0.25
0.3
0.35
m=2
III9
45
4M4
0.6
0.6
0.4
0.2
4
4
cc
4
4
%
44
4bF
0.0
0.1
0.15
0.2
U,
L,,,,
1.0 -
0.3
0.35
0.1
0.15
,4,:1,,
0.2
U,
, , I,,
, , I,,
0.25
0.3
0.35
0.3
0.35
, ,I
m=4
I,,,
0.6
I I I I 1,
m=3
4
vi
0.25
4
44
0.4
4
4
$
0.0
0.1
14 +
I
0.15
0.2
U,
Fig. 2.2
-I
I I
0.25
m=5
0.3
0.35
0.1
0.15
0.2
U,
0.25
m=6
Model paths for the air pollution data for models with number
oftermsm=l,...
,6. Each point indicates the interpretability
S, and fraction of unexplained variance U for a model with the
given number of terms.
L,,,,
1.0
,,I[
Page 77
,,,I
,,I,
y--
,,,,-I
I,,,
I,,,
,,I,
II,,
44
II,,4
4-
4 88
0.8
0.6
$
4
ml
0.4
0.2
a0
44
:4jt
0.0
I
0.1
4
0.15
44
L,,,,
$4
4
;to
14
0.2
0.25
U,
1.0
I,,,
I
0.3
0.35
p8
0.1
0.15
m=7
,I,,
0.2
U,
I,,
II
III
0.25
III-
0.3
0.35
m=8
,,,I-
+#
0.8
r/-Y 0.6
0.4
0.2
%
4
&
4
4
4
14
0.0
.lbllI
0.1
0.15
0.2
0.25
U.
Fig. 2.8
0.35
Model paths for the air pollution data for models with number
Each point indicates the interpretability
of terms m = 7,8,9.
Ss and fraction of unexplained variance U for a model with the
given number of terms.
0.3
m=9
However,
Ss scales
appear to be the same for all the graphs, they are not as implicit
of terms m. A symbolic
are graphed
a particular
graphing
the simplicity
the interpretability
variance
scatterplot
in
in which
all models
U versus interpretability
S, with
on the number
2. Interpretable
Projection
Pursuit
As interpretability
Page 78
Regression
of a model.
that ini-
indicating
For most
m inimum.
in smaller interpretability
for m = 3, interpretability
value of X, rather
the number
The draftsmans
the inaccuracy
out of a local
for a particular
inaccuracy
U and simplicity
simpler
Ss vary.
Again,
note that
more, plotting
scale is m isleading.
value, which is
of terms m and
if a model
with
The
of parameters.
in determining
This type
values Ss E [0.2,0.4].
partitioning
readjusts.
usually
is evident
intermediate
Occasionally,
move resulting
However,
of behavior
algorithm
thereof.
must be
More formally,
and
inaccuracy
and interpretability
criteria.
Page 79
Example
0.6
0.4
0.2
0.0
10
Fig.
2.4
Draftsmans
display for the air pollution data All possible pairwise scatterplots of number of terms m, fraction of unexplained
variance U and interpretability
S, are shown.
The statistician
of the variance,
with
approximately
even moderate
inaccuracy
crossing
interpretability
(S,
2 0.40), cannot
(U
5 6, models
be achieved
without
S, = 0.60 is possible.
the bulk
a seven term
slightly
less is debatable.
If a
two
If close to 0.25
2. Interpretable
inaccuracy
Projection
Page 80
Pursuit Regression
U is acceptable,
Q!l = ( 0.0,
0.0, 0.0,
0.0,
p2 = 1.10,
Q2 = ( 0.0,
0.0, 0.3,
0.9,
0.0,-0.2)r
The fourth
variable,
Sandburg
Breiman
and Friedman
alternating
is
and Tibshirani
1984,
1985). The last variable day of the year also has an effect.
pursuit
conditional
fluential
The projection
o.oy
and
expectation
ables rather
functiondly.
As remarked
of the data.
interpretability
In addition,
Similar
tative
assessment is a further
automated
to the weighing
does
and
a function
of the models
number
of terms
m,
this quali-
subjective
notion
of interpretability
which
to exploratory
projection
pursuit,
projection
regression is
procedure.
As such, an objective
is not
Methods
pursuit
measure of predictive
error may be
such as cross-validation
may be used
fj when choosing
In contrast
a modeling
and therefore
For example,
form.
of terms in the
description,
than another.
of the number
is subjective
the statistician
Though
gauging of a models
-.
p1 = 0.21,
As discussed in Sections
For
A good
and
2.3
The Air
then distinguish
Page 81
Example
Pollution
to a lack of identifiability
method.
/
/
cannot be included
(X) minimization
problem.
the models
increases with
an additional
interpretable
I
modeling
!
I
,
/
.,!
I
I
I
!
in the interpretability
complexity
projection
technique.
Unfortunately,
pursuit
index,
the cross-validation
A subjective
of terms
procedure
measure of how
regression is an exploratory
due
Thus,
Chapter
Connections
A comparison
scribed
of the accuracy
in the previous
warranted
method
and established
demonstrates
the generality
pursuit,
Linear
This setting
whose properties
First,
than
use random
observed
vector
of the trading
de-
selection
techniques
is
This discussion
is preliminary
and
an example
which
approach.
Regression
procedures
modification
provides
for pro-
variable
is stated
and fitted
Y consists
approach
the notation
The problem
tradeoff
the interpretable
in this section.
connections
Conclusions
other model
are identified.
3.1 Interpretable
with
In this chapter,
and interpretability
two chapters
and interesting.
and
notation
regression
as in Chapter
as a minimization
values rather
problem
than
is described.
2, matrix
between
the
The
82
is used.
value minimization.
notation
Rather
3.1 Interpretable
observation.
Linear
Page 83
Regression
If an intercept
term is required,
as
Y=XO+c
The parameters
distance
The
by m inimizing
the squared
The problem
in matrix
form is
(Y - XP)(Y
m jn
The least squares estimates
- XP)
BL* E (xTx)-lxTy
The modification
F (I - x) (Y
tion with
XPts>(Y
The denominator
the problem
rather
is used instead
index Sr defined
squared
that
linear regression
of squared distance
becomes a maximization
P.21
- XBLS)
of the predictors
X E [O, 11, is
_ x S(P)
the combination
correlation
parameter
(Y - -v>*v - -w
WI
has maximum
solution.
[3.1].
correla-
Thus,
if the
than subtracted.
As the interpretability
parameter
X increases,
the fitted
vector
= X,8
3.
Connections
Fig. 3.1
p predictors
and Conclusions
Page 84
Interpretable
linear regression.
As the interpretability
parameter X increases, the fit moves away from the least squares fit
~LS to the interpretable
fit ~ILR in the space spanned by the
p predictors.
fits ?ILR
coefficient
p - 1 variables
may not be the best p - 1 in a least squares sense. Even if the variable
is the same, the interpretable
.
fitting
coefficients
index
S attempts
subset
to the best
to pull the
/?i apart since a diverse group is considered more simple, whereas the
3.1 Interpretable
Page 85
Linear Regression
The definition
of the interpretability
index
model.
S as a smooth
function
means
search
function.
A linear solution
In comparison,
tion criteria
involve
a count function
[l.ll].
prop-
criterion
as an interpretability
a linear solution,
estimate.
variable
index as defined in
cannot be determined
criteria,
the complexity
selec-
the interpretable
(1974) AIC
of a
appropriate
index is
sM(p)sl-
'c
WI
The Mallows
model prediction
variable
selection
technique
uses an unbiased
The resulting
estimate
of the
search through
the model space includes only one model for a given number of variables.
natively,
which
the interpretability
guides the statistician
Another
search procedure
in a nonlinear
interpretability.
for
techniques
is a model estimation
manner
selection
through
as variables
the discreteness
Alter-
procedure
the model
are discarded
of the criterion
space.
rather
for
[3.3]
than model,
selection procedures.
As is discussed in Section 1.3.4, an alternate
interpretability
the modification
index sd can be
The natural
question
of as a type of shrinkage.
3. Connections
Page 86
and Conclusions
3.2 Comparison
W ith
Ridge
Regression
involves the distances to a set of simple points V = {VI,. . . , YJ}. Before restricting its values to be E [O,l], th is simplicity
the negative of the m inimum
The coefficient
not the absolute
lead to a nonlinear
properties
vector
This
standardization
the coefficient
vji
PP
so that
the relative
Though
mass,
these actions
are possible.
X are standardized
ensures that
vector though
one variable
>
such as Schur-convexity
is not necessarily
i=l
CC
solution,
vector ,f3 is
distance to V or
m in
j=l,...,J
the
vector itself
must
the unit RP sphere but just vectors on it. For example, they could be fej,
1 ,7
(P-vj)T(P-vj)
If the optimization
parameter
problem
in matrix
j=l,...,p
is reparameterized
j =
form as
with
a new interpretability
K, [3.2] becomes
rnjn (Y - Xp)(Y
- XP) + K j-ynp
(P - vj>(P
- ,...,
- Vj)
K>O
3.2 Comparison
may be placed outside of the first since the first term does
m in
j=l
The solution
,...,P
rnp
in
- Xp)
p s (XTX
+ KI)-(XTY
vector is
portion
This estimated
Draper
matrix
m inimizes
the
of [3.4].
vector
is similar
to that
of ridge regression
P-51
+ Kzq)
pR f (XTX
point.
P4
+ K (P - Uj)'(fi
(Y - X/?)(Y
Page 87
+ KI)-lXTY
1976,
is
WI
is advised in situations
(Thisted
or Bayesian
XTX
In addition,
view-
is unstable,
components
assumption
the origin
K increases.
~,QS
The interpretability
3. Connections
Page 88
p^ [3.5] sh rm
k s away from the least squares solution
estimate
point
and Conclusions
ul as the interpretability
during
the procedure
If the varimax
parameter
K increases.
toward
the simple
however.
result is a solution
similar to [3.5] except that the shrinkage is away from the set
of 2p points
4~3,.
(&*,
regression terminology,
. . f &).
Shrinkage
toward
points
above.
As Draper and Smith (1981) point out, ridge regression places a restriction
the size of the coefficients
interpretability
approach
problems.
on the coefficients
Clearly,
the approach
and identically
-logL(P;Y)
terized
a prior
likelihood
the posterior
distribution
m inimizes
solution
givenr
or
= (Y - XP)(Y
estimate
likelihood
distributed
prior.
tribution,
as it norms and
as a Prior
The maximum
The
3.3 Interpretability
certain conditions.
them.
is called a prior.
on
- XP)
this expression.
[3.2]
by the
the reparame-
3.3 Interpretability
as
a Prior
Page 89
1.0
angle of /3 in radians
Fig.
3.2
is equivalent
Interpretability
prior density for p = 2.
The prior fn( ,0) is
plotted versus the angle of the coefficient vector in radians for
various values of K.
to minimizing
the negative
interpretability
K>O
parameter
To
to
distribution
of the coefficients
value K > 0 is
exp(KS(P))
log likelihood
defined by Watson
(1983).
The
Page 90
The prior
calculated
function,
using numerical
The constant
integration.
3.4 Future
Similar
C, is
is a monotonic
S1 as in Fig.
1.2.
on interpretability
Work
The previous
interpretable
method
connecting
with
this approach
both extending
algorithm
The coefficients
angle of $ (p = (0,l)).
3.4.1
in Fig. 3.2.
other
with
the method
variable
selection
techniques.
In addition,
of the
work
ideas for
and improving
the
are given.
Further
Connections
Schwarz (1978) and C. J. Stone (1981, 1982) suggest other model selection
techniques
(1974) criteria,
noting
could be compared
technique
with
asymptotic
comparison
description
length principle
In addition,
is af-
the interpretable
an interesting
of a model.
com-
analysis.
(1983) on shrinkage in stepwise regression may provide ways to extend the ridge
regression discussion of Section 3.2. Interestingly
rules noted involve
the number
of parameters,
than a smooth
measure of complexity.
In Chapter
computational
2, the varimax
and entropy
motivation.
intuitive
and
Interpretability
as
9.4 Future
Work
measured
philosophical
Though
Page 91
terminology
the framing
the coefficients
with
simplicity
as defined using
of the interpretable
attempts
approach
as the placing
(1977).
of a prior
on
3.4.2
Extensions
The interpretable
search into
a numerical
whose resulting
extended
approach
one.
description
It can be applied
complicated
descriptions.
At present,
of other description
and accuracy
for methods
feasible va.riable.selection
that
produce
is known
analytically
linear
can be employed
regression
variable
that interpretable
class of models
regression is an example.
optimization
for
and
procedures
situations.
exist,
interpretable
suggested.
is
At present, examples
for which
Whenever
must
all sub-
selection
the interpretable
separate
it provides
combinations
which smartly
seem to indicate
to be [3.1].
Though
unstable
First,
method
is that
linear
space
If the index is
eliminating
might
procedure
model
the tradeoff
lection
be done.
method
may prove
a generalized
At present,
methods
are used
9.
to choose models.
approach
3.4.3
Page 92
and Conclusions
Connections
These methods
Algorithmic
As described
algorithm
could be compared
with
the interpretable
error.
Improvements
in Chapter
would benefit
variant
Fourier
others.
Other rotationally
1, the interpretable
from further
projection
exploratory
improvements.
First,
projection
projection
testing
pursuit
the rotationally
in-
and comparison
with
in Section 1.4.2.
Second, present work involves designing a procedure
optimization
improvement
projection
pursuit
needs further
regression
forecasting
time involved.
procedure
described
in Chapters
in Section
the
2.2.3
moves to a maximizing
3.5 A General
index
points.
Framework
results because
to zero
pursuit
implementation
of a modifica-
Analogously,
investigation.
As mentioned
dition,
a general framework
for tackling
sometimes
implicitly.
similar
at
In adproblems.
the expense of
The identification
and
3.5 A General
formalization
is examined.
simplifying
the rounding
were subjec-
of the binwidth
for a
of interpretability
outcomes
Page 99
Framework
must be broadened
to deal with
Measuring
The definition
the interpretability
increase is
difficult.
3.5.1
The Histogram
Example
must be determined.
is chosen so that
of n observations
no observations
h, which generally
a rule of thumb
usually multiples
determine
the binwidth
employed,
the resulting
intervals
to
are simple,
approach used to
is to estimate
the integrated
by rounding
theoretical
results.
A certain
the further
m inimizes
rounded.
according
can be calculated.
so that
is calculated
of a
to lead to a slightly
approximation
accuracy
different
step which
Scott
is
The integrated
the true density
histogram
density f from
f is
IMSE
J [
WI
+ h)
Page 94
binwidth.
Minimizing
P-81
Scott also shows that if the binwidth
in IMSE
is multiplied
is
= ~IMSE(IZ)
IMSE(cf)
WI
Via Monte
for normal
data.
In reality,
derivative
density
f are unknown.
as a reference distribution.
standard
and Diaconis
density
f and therefore
and Freedman
its
deviation
and Freedman
approximation
these approximations
Given either
approximate
to be robust.
,.
approximation
hs or AD, the statistician
the binwidth
discussed in a moment.
estimate
by rounding.
of such simplification
the estimate
The benefits
also be written
is rounded
to an integer.
as
A* E (1 + e)iis
unit.
is.
are
The new
For example,
may
Page 95
where e is a positive
on whether
is
is impossible.
The estimation
procedure
binwidth
binwidth
u
i
binwidth
u
fis
rounded binwidth
u
h*
minimize
estimated
use normal
approximate
density as reference
round
the relationship
IMSE(fi*)
The percent
factor
change in IMSE
e3
= (13Tl ; e; 21MSE(
can be plotted
due to
is)
as a function
of the multiplying
repercussions
difficult
to measure explicitly.
it to another.
are simpler.
can be retained
that
Finally,
might
a binwidth
result
the accuracy
is
Certainly
in short-term
up or down results in
The histogram
In fact, Ehrenberg
removes confusion
rounding
in terms of accuracy.
that
memory.
in explaining
are easier
the rounding
the histogram
or comparing
xi may prompt
3. Connections
Page 96
and Conclusions
e
Fig. 3.3
rounding.
Extra
of significant
determine.
interpretability
gain.
of a histogram
directly
is difficult
experiments
A typical
be to divide a statistics
experiment
m ight
to quantify
version.
Measurements
to
the
class
a unrounded
of interpretability
could be
made on the basis of the correctness of answers to questions such as How would
-
observations
or Where
may be
lost. For example, the question Where is the mode? may become unanswerable.
Page 97
3.5.2
flexibility
expert opinion
could
to rounding.
principle
technique,
to monitor
pursuit.
modes of information
these computer
these abilities
of exploratory
undreamed
of possibilities
amount
of flexi-
computer-intensive
the statistician
and communicating
retains
her new
provide
of abilities
results in a particular
without
On the one
In the initial
stages of an
data analysis (Tukey 1977), to let the data drive the analysis rather
than subjecting
it to preconceived
previously
In this manner,
translation
analysis,
and irreversibly.
projection
With
has cultivated
and unmanageable.
of parsimony
in return
substantially
assumptions
of models or descriptions
enormously.
Confirmation
(Efron
IMSE,
Conclusion
Computers
bility
In addition,
than
In later
of these models
can be readily
answered
procedures.
As Tukey
can be confirmed?
but rather
has
can be done?.
Computing
putational
statisticians
power is extending
and confirmational
imagination
and eliminating
boundaries.
are alleviated
Even unconscious
(McDonald
hard to explain.
Though
previous mathematical,
restrictions
hard to understand,
Just as grappling
with
comon the
The
the theoretical
demons of
new methods
Page 98
such as projection
a controlled
computing
power
which
pursuit
(Huber
the parsimonious
and communicate
aspects.
the results of a statistical
these
novel
important
such a balance.
is helpful.
techniques
description
Interpretable
Fortunately,
provides
analysis
the very
a means
projection
pursuit
strikes
Appendix
Gradients
A.1
Interpretable
In this
pursuit
Exploratory
section,
objective
are calculated.
desired gradients
the gradients
Projection
Gradients
exploratory
projection
function
is conducted
are
dF
aajplT
dF
aF aF
-daj = (aajlG***
From [A.l],
Pursuit
the gradients
may be written
max GF
hi
in vector notation
+A%
99
j=1,2
as,
j=1,2
as
Page 100
Appendix A. Gradients
The Fourier projection index gradients are calculated from [1.23], yielding
z:(R) = CSe-iR
with recursive equations
[1.20],
- ie-!jR
i=O,l,...
p.211
=o
-=
aL, -1
-=
ddLRz
-==ii:
-=
f3R
2
2i - 1
--fR)%&
( i
The gradients
definition
i - 1
fLi-l-
of the radius
squared
(7)x
= (2Xj)Z
2=
3 )...
dLi-2
the gradients
using the
are
j=1,2
P-31
A.1 Interpretable
Exploratory
Projection Pursuit
Page 101
Gradients
In [A.3], note that 2 is a vector E RP, while X1 and X2 are scalars. As with the
calculation
by
of the simplicity
translated
notation
&
may be
as
[A*41
ml as,-a~;ap,
d012=bpiap2acv2 *
In [A.4], note that the partial
a p X p matrix.
derivative
is
For example,
ah = aplr
(-1
aQ, ts aals
r,s
= l,,..,
Urs
a
r,s = l,...,p
[1.25] yields
a&
-
= - Plr (
P2aTPl>
ah,
MA
w;, _
Plr
---pTB,(BI,)
ap2,
The diagonal
elements are
WlsW2)
I2
r,s=l,...,
pandr#s
Page 102
Appendix A. Gradients
of the individual
%hPd
Taking
simplicities
= $P
partial
simplicity
- Wl(Pl>
derivatives
with
21 -
qp,,p,j
yields
6
gr
-plr [ --I-(p-l)sl(pl)
p
P
-ah = 2PTB,
P;&
The partial
denoted as C yielding
and a cross-term
+ (P - l)Sl(PZ)
as a. com-
to ,8& is identical
respect
r=
-~+CtP1,82)]
l,...,p
with
&
components
replacing
,&
components .
A.2
Interpretable
Projection
Pursuit
Regression
projection
pursuit
The gradients
Y)
WI
wJ(W,~2,...,4
Cyj are
dF
-=
daj
aF
aF dF
-y
(---dajl ' aaj2" ' * ' aajp
aF
Friedman
may be written
(i-x)
(1985) calculates
aL2
L
= -2E[Rj
&j
The partials
regres-
function
F(P, a, f, x, Y) z (1 - X) L2(P;;;fi;Y
are calculated.
Gradients
j = l,...,m
in vector notation
aL2
the
L2 distance
j=
gradients
- @jfj((YTX)]Pjfj(cYTX)X
l,...,m
.
as
j = I.,. . . ,m
using interpolation.
A.2 Interpretable
derivatives
The partials
yields
a&J
-=mC~l,~(~-;)$
doji
Page 103
j=l,...,
mandi=l,.:..,p
k=l
(jTclj)2
dcrj;
2crjl,(Crroj
ark=
aajk
forj=l,...,
mandi=l,...,
(TCyj)2
p.
Ic
3k)
References
Akaike,
H. (1974).
Transactions
Asimov,
on Automatic
D. (1985).
data, SIAM
Bellman,
The
Journal
Control AC-19,
Grand
R. E. (1961).
Adaptive
and Statistical
Control
IEEE
716-723.
A tool for viewing
Tour:
of Scientific
model identification,
Computing
multidimensional
6, 128-143.
Processes, Princeton
University
Press,
Princeton.
Breiman,
L. and Friedman,
for multiple
J. H. (1985).
ican Statistical
Association
Chambers,
discussion),
J. M., Cleveland,
Annals
Journal
of Statistics
Wadsworth,
Regression,
Dawes, R. M. (1979).
discussion),
R. (1989).
W. S., Kleiner,
(with
optimal
transformations
Journal
of the Amer-
smoothers
and additive
80, 580-619.
Estimating
prediction
Linear
17, 453-555.
Graph-
Boston.
and shrinkage
(with
discussion),
making,
American
Psychologist
34, 571-582.
Diaconis,
P. (1987).
Personal communication.
I
I
Page 105
References
Diaconis,
I
Lz theory,
Zeitschrift
fuer
On the histogram
as a density estimator:
WahrscheidichkeitstheoTie
und Verwandte
Gebiete
57, 453-476.
Diaconis,
P. and Shahshahani,
combinations,
/
SIAM
Journal
D. L. and Johnstone,
Donoho,
and a duality
On
M . (1984).
of Scientific
and Statistical
Annals
functions
Computing
Projection-based
I. M . (1989).
nonlinear
of Statistics
of linear
5, 175-191.
approximation
17, 58-106.
!
Draper,
;:
1
/
,
Efron,
B. (1982).
The Jackknife,
CMBS
38, SIAM-NSF,
Efron, B. (1988).
the Bootstrap,
Plans,
Philadelphia.
Computer-intensive
methods in statistical
regression,
SIAM
Review 30,421-449.
Ehrenberg,
A. S. C. (1981).
Statistical
Association
Friedman,
J. fi.
Department
Friedman,
Department
Friedman,
jection
Journal
of the American
35, 67-71.
(1984a).
SMART
of Statistics,
Stanford
J. H. (1984b).
users guide,
Technical
Stanford
J. H. (1985).
LCSOOl,
University.
Classification
Technical
Report
University.
of Statistics,
pursuit,
Report
and multiple
LCS012,
regression
Department
through
of Statistics,
pro-
Stanford
University.
Friedman,
J. H. (1987).
ican Statistical
Friedman,
exploratory
projection
pursuit,
Journal
of the Amer-
J. H. and Stuetzle,
Aal of the.American
Friedman,
Exploratory
Statistical
W. (1981).
Projection
regression,
JOUF
pursuit
IEEE
Transactions
A projection
on
Computers
pursuit
algorithm
C-23,
881-889.
for
References
Page 106
W. and Wright,
M. H. (1981). Practical
Optimization,
Academic
Press, London.
Gill, P. E., Murray,
for NPSOL,
Stanford
Technical
Report
M. A. (1986).
:Users guide
of Operations
Research,
University.
Good, I. 3. (1968).
and a sharpened
C orroboration,
British
razor,
explanation,
evolving
probability,
Journal
simplicity
143.
Good, I. J. and Gaskins, R. A. (1971).
probability
densities,
Gorsuch,
R. L. (1983).
N on p arametric
roughness penalties
for
58, 255-277.
Biometrika
Factor
Analysis,
Lawrence
Erlbaum
Associates,
New
Jersey.
Hall, P. (1987).
tion pursuit,
On polynomial-based
Annals
of Statistics
projection
projec-
17, 589-605.
Harman,
The University
of Chicago Press,
Chicago.
Hastie,
T. and Tibshirani,
Department
Huber,
of Statistics,
P. (1985).
R. (1984).
Stanford
Projection
Generalized
additive
models,
LSCO02,
University.
pursuit
(with
discussion),
Annals
of Statistics
13, 435-525.
Jones, M. C. (1983).
analysis,
Ph.D.
The
Dissertation,
projection
University
pursuit
W. J. (1987).
using principal
is projection
data
pursuit?
(with
discus-
for exploratory
of Bath.
What
algorithm
Applied
to preserve multivariate
Statistics
36, 22-33.
data
References
Page 107
Lubinsky,
D. and Pregibon,
Data
M. (1985).
Applications
problems
in statistics,
Marshall,
A. W. and Olkin,
Its Applications,
of the annealing
Biometrika
Academic
I. (1979).
Journal
of
to combinatorial
Theory
of Majorization
and
Some comments
Mallows,
C. L. (1983).
Data
and Robustness,
algorithm
Inequalities:
C. L. (1973).
McCabe,
as search,
72, 191-198.
Mallows,
New York,
analysis
38, 247-268.
Econometrics
Lundy,
D. (1988).
on C,,
description,
in Scientific
15, 661-676.
Technometrics
Inference
Data Analysis,
Press,
135-151.
G. P. (1984).
McCullagh,
Principal
P. and Nelder,
variables,
J. A. (1983).
Technometrica
G eneralized
26, 137-144.
Linear
Models,
Chapman
J. A. (1982).
Department
McDonald,
analysis
puting
I n t eractive
of Statistics,
Stanford
J. A. and Pedersen,
part
I: introduction,
graphics
Ph.D.
Disser-
University.
J. (1985).
SIAM
Computing
Journal
environments
of Scientific
for data
and Statistical
Com-
6, 1004-1012.
Reinsch, C. H. (1967).
S moothing
by spline functions,
Numeriache
Mathematik
10, 177-183.
RCnyi,
A. (1961).
of the Fourth
On
Berkeley
Symposium
J. (1987).
Royal Statistical
Rosenkrantz,
measures
of entropy
on Mathematical
547-561, University
Stochastic
Society,
and information,
Statistics
of California
complexity
(with
in Proceedings
and Probability,
Press, Berkeley.
discussion)
Journal
of the
R. D. (1977).
I n f erence, Method
and Decision,
Reidel,
Boston.
Page 108
References
Schwarz, G. (1978).
,
I
Estimating
the dimension
6, 461-464.
Scott, D. (1979).
I
I
On optimal
and data-based
histograms,
B&net&z
66, 605-
610.
Silverman,
B. W. (1984).
cyclopedia
of Statistical
Penalized
maximum
likelihood
estimation,
and N. L. Johnson,
in En-
Wiley,
New
nor-
York, 664-667.
Sober, E. (1975).
Simplicity,
Stone, C. J. (1981).
Ad missible
Annals
Local
C. J. (1982).
Akaikes
Clarendon
Press, Oxford.
of Statistics
asymptotic
9, 475-485.
admissibility
of the Institute
of a generalization
of Statistical
of
Mathematics
34, 123-133.
Stone, M. (1979).
Journal
Sun, J. (1989).
of Statistics,
Thisted,
P-values in projection
Stanford
pursuit,
Ph.D. Dissertation,
Department
University.
R. A. (1976).
Bayes methods,
Ridge
regression,
Ph.D. Dissertation,
minimax
Department
estimation
of Statistics,
and empirical
Stanford
Univer-
sity.
Thurstone
L. L. (1935).
University
of Chicago Press,
Chicago.
Tukey, J. W. (1961).
_
D iscussion, emphasizing
analysis,
the connection
Technometrics
Data Analysis,
between analysis
3, 201-202.
Addison-Welsey,
Reading,
MA.
References
Tukey,
statistics:
Page 109
J. W. (1983).
PTOCeedingS
Another
of
in Computer
Mu&ivariate
Data,
G. S. (1983).
Statistics
and
eds. K. Heiner,
Preparation;
ed.
prechosen sequences of
V. Barnett,
189-213.
Watson,
Science
on Spheres, Wiley,
New York.
Wiley,
New York,