Sie sind auf Seite 1von 119

SLAC REPORT-355

STAN-LCS-106
UC-405
(Ml

INTERPRETABLE

PROJECTION

SALLY

CLAIRE

PURSUIT*

MORTON

Stanford

Linear Accelerator Center


Stanford University
Stanford, California 94309

OCTOBER

1989

Prepared for the Department of Energy


under contract number DE-AC03-76SF00515
Printed in the United States of America.
Available from the National Technical Information
Service, U.S. Department
of Commerce, 5285 Port Royal Road, Springfield,
Virginia 22161. Price: Printed Copy A06; Microfiche AOl.

*Ph.D

Dissertation

Abstract

The goal of this thesis is to modify


for interpretability.

The modification

standable model without

sacrificing

projection

pursuit

by trading

produces a more parsimonious


the structure

The method retains the nonlinear versatility

which projection

of projection

accuracy
and under-

pursuit

seeks.

pursuit while clarifying

the results.
Following

an introduction

which outlines the dissertation,

chapters contain the technique


projection

pursuit

regression respectively.

measured as the simplicity


Several interpretability
of rotation
different

as applied to exploratory

in factor

analysis

description,
algorithms

and entropy.

weighting

with

and

of a description

is

projection

a rotationally

invariant

replacing

smoothness.

exploratory

projection

regression are described.

projection

in the original

require

slightly

approach is used to search for a more par-

interpretable

pursuit

The two methods

goals.

interpretability

for both

interpretable

alterations

The interpretability

pursuit

indices for a set of vectors are defined based on the ideas

A roughness penalty

tational

projection

of the coefficients which define its linear projections.

indices due to their contrary

simonious

the first and second

are required.

pursuit

In the former

index is needed and defined.

algorithm

The compu-

Examples

and
case,

In the latter,

of real data are

considered in each situation.


The third
cation

chapter deals with

the connections

between the proposed modifi-

and other ideas which seek to produce more interpretable

models.

The

Abstract
modification

Page iv
as applied to linear regression is shown to be analogous to a nonlin-

ear continuous

method of variable selection.

selection techniques

It is compared with other variable

and is analyzed in a Bayesian context.

Possible extensions

to other data analysis methods are cited and avenues for future research are identified.

The conclusion

addresses the issue of sacrificing

in general.

An example of calculating

pretability

due to a common simplifying

for a histogram,

illustrates

accuracy for parsimony

the tradeoff between accuracy

the applicability

action, namely rounding


of the approach.

and inter-

the binwidth

Acknowledgments

I am grateful

to my principal

I also thank

enthusiasm.

advisor Jerry Friedman

my secondary

Persi Diaconis,

Art Owen, Ani Adhikari,

Kirk Cameron,

David Draper,

Knowles,

Michael

Sheehy, David

Martin,

Siegmund,

advisors and examiners:

Pregibon,

Hal Stern;

and

Brad Efron,

Joe Oliger; my teachers and colleagues:

Tom DiCiccio,

Daryl

for his guidance

Jim Hodges, Iain Johnstone,

Mark

John Rolph,

Anne

and my friends:

Joe Romano,

Mark

Barnett,

Ginger

Brower, Renata Byl, Ray Cowan, Marty

Dart, Judi Davis, Glen Diener, Heather

Gordon,

Arla

Holly

Marincovich,
This

work

Haggerty,

Curt

Lasher,

LeCount,

Alice

Lundin,

Michele

Mike Strange and Joan Winters.


was supported

in part

by the Department

DE-AC03-76F00515.

I dedicate this thesis to


my parents, sister, and brothers,
who inspire me by example.

of Energy,

Grant

Table

of Contents

..*

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . aaz

Abstract

Introduction
1. Interpretable

................................

Acknowledgments

....................................
Exploratory

1.1 The Original

Projection

Exploratory

...........

Pursuit

Projection

Pursuit

4
......

Technique

4
4

............................

1.1.1 Introduction

1.1.3 The Legendre Projection


1.1.4 The Automobile
1.2 The Interpretable

1.2.1 A Combinatorial

Projection

Strategy

Index

1.3.1 Factor Analysis

11

....................

Example

Optimization

1.3 The Interpretability

.................

Index

Exploratory

1.2.2 A Numerical

..........................

1.1.2 The Algorithm

Pursuit

Approach

15

..............

17

........................

Background

14
14

....................
Strategy

....

17

...................

1.3.2 The Varimax

Index For a Single Vector

............

19

1.3.3 The Entropy

Index For a Single Vector

............

$2

1.3.4 The Distance

Index For a Single Vector

............

,24

1.3.5 The Varimax

Index For Two Vectors

1.4 The Algorithm


1.4.1 Rotational

26

..............

L7

..............................
Invariance

1.4.2 The Fourier Projection

of the Projection
Index

Vi

Index

........

28

. . . . . . . . , . . . . . . . . . 29

Page vii

Tableof Contents
1.4.3 Projection

Axes Restriction

1.4.4 Comparison

With

. . . . . . . . . . . . . . . . . . . $5

Factor Analysis

1.4.5 The Optimization

1.5.2 The Automobile

2. Interpretable

.42

Pursuit

Projection

57
58
60

....................

Pursuit

62

Example

.....................

82

as a Prior

3.4 FutureWork

...............................

88

90

.......................

.............................

3.4.3 Algorithmic

Improvements

91
...................

92

..........................

3.5 A General Framework

Example

92
93

.....................

.............................

97

............................

Exploratory

86

90

3.4.2 Extensions

Gradients

: ........

.........

........................

3.3 Interpretability

3.5.2 Conclusion

73

82

Ridge Regression

3.5.1 The Histogram

68

......................

Linear Regression

Connections

65

.......................

and Conclusions

With

..........

..................

Procedure

62
63

to Include the Number of Terms

2.3 The Air Pollution

.....

Regression Approach

....................

Index

2.2.3 The Optimization

A.1 Interpretable

.......

..........................

2.2.1 The Interpretability

3.4.1 Further

57

............................

2.2 The Interpretable

3.2 Comparison

............

Regression Technique

2.1.3 Model Selection Strategy

3.1 Interpretable

46

Regression

Pursuit

2.1.2 The Algorithm

2.2.2 Attempts

42

....................

Example

Projection

2.1.1 Introduction

A.

$9

.........................

Projection

2.1 The Original

Appendix

..................

Procedure

1.5.1 An Easy Example

Connections

$7

................................

1.5 Examples

3.

...............

Projection

99
Pursuit

Gradients

.......

99

Page viii

Tableof Contents
A.2 Interpretable

References

Projection

Pursuit

Regression Gradients

. . . . . . 102

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Figure

[l.l]

Most structured projection scatterplot of the automobile


according to the Legendre index. .....................

Captions

data
12

[1.2] Varimax

interpretability

index for q = 1, p = 2. ..............

20

[1.3] Varimax

interpretability

index for q = 1, p = 3. ..............

21

[1.4] Varimax

interpretability

index contours for q = 1, p = 3. ........

21

[1.5] Simulated

data with

43

n = 200 and p = 2. ..................

[1.6] Projection and interpretability


indices versus X for the simulated data. ..................................

44

Cl.71 Projected simulated data histograms


A. .......................................

45

for various

values of

[1.8] Most structured projection scatterplot of the automobile


......................
according to the Fourier index.

data

[1.9] Most structured projection scatterplot of the automobile


according to the Legendre index. .....................

data

[l.lO]
[l.ll]

47
48

Projection and interpretability


indices versus X for the automobile data. .................................

49

P ro j ec t e d automobile data scatterplots


A. .......................................

50

[1.12] Parameter
[1.13] Country
data..

trace plots for the automobile


of origin projection
....................................

scatterplot

for various values of


data.

..............

53

of the automobile
55

[2.1] Fraction of unexplained variance U versus number of terms


m for the air pollution data. ........................

74

[2.2] Model paths for the air pollution data for models with numberoftermsm=l,
... . 6. .........................

76

iX

Pagex

Figure Captions

[2.3] Model paths for the air pollution data for models with numberoftermsm=7,8,9.
. . . . . . . . . . , . . . . . . . . . . . . . . .
[2.4] Draftsmans

display for the air pollution

mearregression.
[3.1] I ner
t p reat bl e 1
[3.2] Interpretability

[3.3] Percent change in


binwidth example.

data . . . . . . . . . . . . . . . 79

. . . . . . . . ... . . . . . . . . . . . . .

prior density for p = 2.

77

84

. . . . . . . . . . . . . . . . . . 89

IMSE

versus multiplying fraction e in the


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

List

[l.l]

of Tables

Most structured Legendre plane index values for the automobile data. . . . . . . . . . . . . . . . . . , . . . . . . . . . . . . . . . . .

[1.2] Linear combinations


[1.3] Abbreviated

for the automobile

linear combinations

data. ...............

for the automobile

xi

data. ........

29
51
52

Introduction

The goal of this th esis is to modify projection


more interpretability
projection

pursuit

and Stuetzle

accuracy for

in the results. The two techniques examined are exploratory


(Friedman

1981).

1987) and projection

Th e f ormer is an exploratory

produces a description
procedure

pursuit by trading

of a group of variables.

which determines

the relationship

pursuit

regression (Friedman

data analysis method

The latter

is a formal

of a dependent variable

which

modeling
to a set of

predictors.
The common

outcome

of all projection

vectors which define the directions


ponent

is nonlinear

jection

linear combinations

projection

pursuit

projection

pictorially
pursuit

representation.
that

is faced with a collection

q projections

row corresponding

are made.

The resulting

to a projection.

data points.

the linear combinations

of p variables

direction

The statistician

In
and

predictors.

of vectors and a nonlinear

Given a dataset of n observations

com-

contains the pro-

of the projected

the model contains

of

rather than mathematically.

the smooths of the dependent variable versus the projected


The statistician

is a collection

The remaining

the description

and the histograms

regression,

methods

of the linear projections.

and is summarized

For example, in exploratory

pursuit

matrix

graphic

each, suppose
is

q X p,

must try to understand

each
and

explain these vectors, both singly and as a group, in the context of the original p
variables and in relation
is to illustrate

a method

to the nonlinear
for trading

components.

The purpose of this thesis

some of the accuracy in the description

or

Page 2

Introduction
model in return

for more interpretability

object is to retain the versatility


increasing the clarity

parsimony

and flexibility

A.

in the matrix

of this promising

The

technique while

of the results.

In this dissertation,
than parsimony,

or simplicity

interpretability

is used in a similar

which may be thought

is that as few parameters

yet broader sense

of as a special case. The principle

of

as possible should be used in a description

or model. Tukey stated the concept in 1961 as It may pay not to try to describe
in the analysis the complexities

that are really present in the situation.

methods exist which choose more parsimonious

Several

models, such as Mallows (1973)

Cp in linear regression which balances the number of parameters

and prediction

error. Another example is the work of Dawes (1979), who restricts

the parameters

in linear models based on standardized


descriptions

data to be 0 or fl,

calling the resulting

improper linear models. His conclusion is that these models predict

almost as well as ordinary


based on experience.
of the prediction

Throughout

but rather refers to the goodness-of-fit

to the particular

while considering

shares the general philosophical

intuition

this thesis, accuracy is not measured in terms

of future observations

the model or description


Parsimony,

linear regressions and do better than clinical

of

data.

solely the number of parameters


goals of interpretability.

in a model,

These goals are to

produce results which are


(;) easier to understand.
(ii)

easier to compare.

(G)

easier to remember.

(iv)

easier to explain.

The adjective
plex with
terpretability

simple is used interchangeably

uninterpretable

throughout

this thesis.

is included in part to distinguish

which receives extensive treatment

with

interpretable

as is com-

However, the new term in-

this notion from that of simplicity

in the philosophical

and Bayesian literature.

Page 3

Introduction
The quantification

of interpretability

not easy to define or measure.


our intuitions

. . . presents a veritable

The concept is

pursuit

the diversity

chaos of opinion.

are easier than others. In particular,

as produced by projection

of an interpretability

linear combinations

index.

by a computer

more interpretable
Exploratory

rather than a human, in an automatic

projection

pursuit

is considered first

method motivates

approach chosen is supported

terpretability

indices are developed.

requires changes to the original.


ploratory

which can
search for

results.

review of the original


rithmic

lend

This mathe-

index can serve as a cognostic (Tukey 1983), or a diagnostic

be interpreted

of

Fortu-

and many other data analysis methods

themselves readily to the development


matical

problem.

As Sober (1975) comments,

about simplicity

nately, some situations

is a difficult

projection

pursuit

in Chapter

1.

the simplifying

modification.

versus alternative

strategies.

The modification

algorithm

An example of the resulting

method is examined.

Projection

A short
The algo-

Various in-

to be employed
interpretable

pursuit

ex-

regression

is considered in a similar manner in Chapter 2. The differing goals of this second


procedure compel changes in the interpretability

index.

The chapter concludes

with an example.
Chapter 3 connects the new approach with established techniques of trading
accuracy
might

for interpretability.

benefit

from

general application

Extensions

this modification
of the tradeoff

to other data analysis

are proposed.

between accuracy

methods

that

The thesis closes with


and interpretability.

This

work is an example of an approach which simplifies

the complex results of a novel

statistical

described within

method.

by others in similar

The hope is that the framework


circumstances.

will be used

Chapter

Interpretable

Exploratory

In this chapter, interpretable


Section

1.1 presents

projection

pursuit

provides the motivation

projection

and goals of the original

and outlines

for the strategy

requires that

a simplicity

Pursuit

pursuit is demonstrated.

the algorithm.

exploratory

An example

for the new approach is included.

described and support


new method

exploratory

the basic concepts

technique

Projection

which

The modification

chosen is given in the next section.

is
The

index be defined, which is discussed in

Section 1.3. Section 1.4 details the algorithm,

and its application

to the example

is described in the final section.

1.1 The Original


Exploratory

projection

ment of the algorithm


intensive
method

Exploratory

Projection

pursuit

Technique

(Friedman 1987) is an extension and improve-

presented by Friedman and Tukey in 1974. It is a computer

data analysis tool for understanding


helps the statistician

any initial

Pursuit

assumptions,

look for structure

while providing

high dimensional

datasets.

in the data without

The

requiring

the basis for future formal modeling.

1.1.1 Introduction
Classical multivariate
criminant

methods such as principal

analysis can be used successfully

components

analysis or dis-

when the data is elliptical

or normal

1.1 The Original Exploratory Projection Pursuit Technique


and well-described

.in nature

pursuit

Page 5

by its first few moments.

Exploratory

is designed to deal with the type of nonlinear structure

niques are ill-equipped

to handle.

onto a line (one dimensional


dimensional

exploratory

the problem

The method linearly

exploratory

projection

while maintaining

projection

pursuit).

these older tech-

projects

pursuit)

the data cloud

or onto a plane (two

By reducing the dimensionality

the same number of datapoints,

overcomes the curse of dimensionality

(Bellman

1961). This malady

definition

combinations

is chosen for two reasons (Friedman


is easier to understand

of the original variables.

gerate structure

Initially,

interactively

projection

an exhaustive

mathematical

Second, a linear projection

projection

1982, Asimov

numerical

human pattern recognition


consequently

Structure

projection

consider principal

pursuit
components

classical

for specific projection


analysis.

is no longer

scale but rather on a


means that numerous

Thus the scheme, while making

many

of a view is defined

feasible for large datasets, requires the careful choice of a projection

(1985) p oin t s out,

views

the method was automated.

optimization.

one. This simplification

which exhibit

choose interesting

index which measures the structure

possible indices may be defined.

ploratory

the

1985). because the time required

search was prohibitive,

measured on a multidimensional

As Huber

First,

does not exag-

pursuit is to find projections

and the space is searched via a computer

univariate

1987).

as it consists of one or two linear

the idea was to let the statistician

by eye (McDonald

to perform

is revealed in

in the data as it is a smoothed shadow of the actual datapoints.

The goal of exploratory


structure.

is due to

space.

A linear projection
projection

of

the technique

the fact that a huge number of points is required before structure


high dimensional

projection

techniques

the analysis
index.

are forms

index choices.

of ex-

For example,

Let Y be a random vector in

RP. The

Page 6

1. InterpretableExploratory Projection Pursuit


definition

of the jth principal

component

is the solution

of the maximization

problem
px
i

Var[PTY]
3

@T&=1
and p:Y

is uncorrelated

all previous principal


Thus, principal
projection

contrast,

with variance (Var) as the projection

is a global

the novelty

its ability

components

components analysis is an example of one dimensional

pursuit

Variance

with

to recognize nonlinear

many definitions

index.

or general measure of how structured

and applicability

of structure

of exploratory

or local structure.

and corresponding

exploratory

projection

a view is.
pursuit

As remarked

projection

In

lies in

previously,

index choices exist.

All present indices, however, depend on the same basic premise. The idea is that
though

interesting

is difficult

to define or agree on, uninteresting

is clearly

normality.
Friedman (1987) and Huber (1985) p rovide theoretical
Effectively,
methods

the normal distribution


which explain

attempting

The statistician

can be explored adequately using traditional

covariance structure.

to address situations

structure

metric

Desirable

computing

is

so far in this discussion

properties

also affect the decision.

is detailed in the next subsection.

The

preference for the type of

and invariance

in the concrete context of a particular

algorithm

pursuit

to views she holds interesting.

choice is thus based on individual

have been neglected

original

projection

must choose a method by which to measure distance from

to be found.

considerations

Exploratory

for which these methods are not applicable.

this normal origin that leads the algorithm


distance

support for this choice.

which
These

index are discussed as the

Page 7

1.1 The Original Exploratory Projection Pursuit Technique


1.1.2 The Algorithm
Two dimensional

exploratory

ful than one dimensional,


situation

pursuit

is more interesting

problems when interpretable

exploratory

plane.

data may have several interesting

index optima

views or local projection

should find as many as possible.

The original
stract

projec-

is considered which need not be addressed in the one dimensional

case. For the present, the goal is to find one structured

algorithm

and use-

so the former is discussed solely. The two dimensional

also raises intriguing

tion pursuit

projection

version

algorithm

(1985), thereby

though the data consists of n observations


variable Y E W.

simplifying

the notation.

Thus,

of length p, consider first a random

= Va?@Y]

and Cov[p~Y,&Y]

(/?I, ,&) which

= 1
= 0

where G is the projection

index which measures the structure

density.

on the linear combinations

The constraints

and the

(1987) is reviewed in the ab-

The goal is to find linear combinations

3 Var[PTY]

the

Section 1.5.3 addresses this point.

presented by Friedman

due to Huber

Actually,

of the projection

ensure that

the structure

seen in the plane is not due to covariance (Cov) effects which can be dealt with
by classical methods.
Initially,

the original data Y is sphered (Tukey and Tukey 1981). The sphered

variable 2 E RP is defined as
2 f D- W(Y
with
matrix

U and D resulting

- E[Y])

from an eigensystem

decomposition

EW
of the covariance

of Y. That is,

c = E[(Y - E[Y])(Y - E[Y])*]


= UDU*

P.31

Page 8

1. InterpretableExploratory Projection Pursuit


with U the orthonormal

matrix

of associated eigenvalues.
ponent directions
be translated

of eigenvectors of C, and D the diagonal matrix

The axes in the sphered space are the principal

of Y. The previous optimization

to one involving

problem [l.l]

linear combinations

max

involving

al, a2 and projections

comY can
of 2:

G(cuTZ, o$Z)

QI>q
3

a&

and LVTQ~= 0
The fact

that

structure,
ditions

the standardization

which the technique

calculations

actually

imposed

work required

are performed

In subsequent notation,

in the original

covariance

are now geometric

(Friedman

the parameters of a projection

the value of the index is calculated

1987).

con-

Thus,

all

index G are (,0TY, /3lY)

for the sphered data [1.2].

convenience and is invisible

to

She associates the value of the index with the visual projection
data space.

After the maximizing


translated

to exclude

on the sphered data 2.

The sphering, however, is merely a computational


the statistician.

is not concerned with,

reduces the computational

numerical

though

constraints

WI

= 1

4cv2

sphered data plane is found, the vectors o1 and CQ are

to the unsphered original variable space via

/31= UD- ia,


p2

Since only the direction

= UD-+a2 .

of the vectors matter,

these combinations

are usually

normed as a final step.


The variance and correlation

pf-cp,

constraints

= 0

on ,L?iand @2 can be written

as

D.61

1.1 The

Original Exploratory

Projection

Pursuit

Thus, ,& and ,& are orthogonal


orthogonal

Page 9

Technique

in the covariance metric

while (Ye and ~2 are

geometrically.

The optimization

method used to solve the maximization

coarse search followed by an application


optimization

procedure.

The initial

problem

[1.4] is a

of steepest descent, a derivative-based

survey of the index space via a coarse step-

ping approach helps the algorithm

avoid deception by a small local maximum.

The second stage employs the derivatives of the index to make an accurate search
in the vicinity

of a good starting

The numerical optimization


sess certain computational
compute,

point.
procedure requires that the projection

The index should be fast and stable to

properties.

and must be differentiable.

to the interpretability

1.1.3 The Legendre


Friedmans
transforming

index pos-

These criteria

surface again with respect

index defined in Section 1.3.

Projection

Index

(1987) Legendre index exhibits


the sphered projections

these properties.

He begins by

to a square with the definition

Rl E 2G+I!;Z)

- 1

R:! E 2@(aTZ) - 1
where @ is the cumulative
random variable.
uninteresting,

probability

density function

of a standard

Under the null hypothesis that the projections

the density p(R1,

As a measure of nonnormality,

R2)

is uniform

normal

are normal and

on the square [-1, I] X [-l,l].

he takes the integral of the squared distance from

the uniform

G&?;Y,

/3;Y)

He expands the density


orthogonal
tion,

J
-1

[p(&,

R2)

+I2

d&dR2

respect to a uniform

subsequent integration

taking

P.71

p(R1, R2) using Legendre polynomials,

on the square with

along with

J
-1

weight function.

which

are

This ac-

advantage of the orthogonality

Page 10

1. InterpretableExploratory Projeciion Pursuit


.relationships,

yields an infinite

sum involving

polynomials

in R1 and R2, which are functions

application,

the expansion is truncated

moments.

Thus, instead of E[f(Y)]

n observations

expected values of the Legendre


of the random variable

f or a function

f, the sample mean over the

yl, ~2, . . . , yn is calculated.

rather than in the tails.

Thus, it identifies

rather than heavy-tailed

densities.

variables

2.

covariance structure,

which exhibit

GL also has the property

The index has this characteristic


Since exploratory

in the center of distribution,

projections

(Huber 1985), w h ic h means that it is invariant


of Y.

projection

this property

the area of projection

of affine invariance

as it is based on the sphered

pursuit

should not be drawn in by

is desirable for any projection

indices but so far alternatives

normality

have proved less computato take advantage

or use alternate methods of measuring distance from

as discussed in Section 1.4.2.

In the following

section, an example of the original algorithm

cussed in order to provide the motivation


considering

.index.

Research continues in

feasible. Other indices use different sets of polynomials

of different weight functions

clustering

under scale changes and location

The Legendre index is also stable and fast to compute.

tionally

In

and sample moments replace theoretical

This index has been shown to find structure

shifts

Y.

how to modify

pretable, the projection

using GL is dis-

for the proposed modification.

After

the method in order to make the results more inter-

index clearly needs a new theoretical

does not have. Consequently,

property

a new index is defined in Section 1.4.2.

which GL

1.1 The Original Exploratory Projection Pursuit Technique


1.1.4 The Automobile
The automobile

Example

dataset (Friedman

1987) consists of ten variables

on 392 car models and reported in Consumer


Yl
Y2
Y3
Y*
Ys

:
:
:
:
:
:
:
:
:
:

Y6

., Y7
Y8

fi
Ylo
The second variable
dicating

Page 11

collected

Reports from 1972 to 1982:

gallons per mile (fuel inefficiency)


number of cylinders in engine
size of engine (cubic inches)
engine power (horse power)
automobile weight
acceleration (time from 0 to 60 mph)
model year
American (0,l)
European (0,l)
Japanese (0,l)

has 5 possible values while the last three are binary,

the cars country

of origin.

As Friedman

suggests, these variables

inare

gaussianized, which means the discrete values are replaced by normal scores after
any repeated observations
their discrete marginals

are randomly

The object is to ensure that

do not overly bias the search for structure.

All the variables are standardized


the analysis.

ordered.

In addition,

to have zero mean and unit variance before

the linear combinations

which define the maximizing

plane (PI, P2) are normed to length one. As a result, the coefficient
in a combination

represents its relative importance.

The definition

of the solution

p1 = (-0.21,-0.03,-0.91,
p2 = (-0.75,

-0.13,

plane is

0.16,
0.43,

0.30,-0.05,-0.01,

0.45,-0.07,

0.03, 0.00,-0.02)T

0.04,-0.15,-0.03,

That is, the horizontal

coordinate

ficiency

-0.03 times the number of cylinders

(standardized)

standardized)

of each observation

and so on. The scatterplot

In the projection
horizontal

for a variable

and vertical

scatterplot,

0.02,-0.01)T

is -0.21 times fuel inef(gaussianized

and

of the points is shown in Figure 1.1.

the combinations

(PI,&)

are the orthogonal

axes and each observation is plotted as (/?TY, PZY).

These

1. Interpretable

Exploratory

Projection

Page 12

Pursuit

..
..
. * *.*.
.
.I .I .* :* .
: .
c
. 7. *.y
- .,* #.:.* .,.
, -: *
*>..i\..*
. . .*..
r.*,$s?:.. :..: .
**. : :$ ,I
. . ., . - .:.
. ., :
. .

-1
8

Most structured projection scatterplot


according to the Legendre index.

vectors are orthogonal

-1

Fig. 1.1

of the automobile

in the covariance metric due to the constraint

ever, in the usual reference system, the orthogonal

data

[l.S].

How-

axes correspond to the vari-

ables. For example, the x axis is Yl, the y axis is Y2, the z axis is Y3, and so
on. The combinations

are not orthogonal in this reference frame. This departure

from common graphic convention

is discussed further

The first step is to look at the structure

of the projected

case, the points are clustered along the vertical


out to the upper right corner with a few outliers
cern is whether
fluctuation.

this structure

actually

in Section 1.4.3.
In this

axis at low values and straggle


to the left.

The obvious con-

exists in the data or is due to sampling

The value of the index GL for this particular

question is whether

points.

this value is significant.

view is 0.35 and the

Friedman (1987) approximates

the

answer to this question by generating values of the index for the same number

Page 13

1.1 The Original Exploratory Projection Pursuit Technique


of observations
Comparison

and dimensions

(n and p) under the null hypothesis

of the observed value with

of normality.

these generated ones gives an idea of

how unusual the former is. Sun (1989) d e1ves deeper into the problem,
an analytical
significance

approximation

level. The structure

Given the clustering


attempt

to interpret

Though by projecting
tion pursuit

for a critical

providing

value given the data size and chosen

found in this example is significant.

exhibited

along the vertical

the linear combinations

axis, the second stey)is to

which define the projection

the data from ten to two dimensions,

plane.

exploratory

projec-

has reduced the visual dimension of the problem, the linear projec-

tions must still be interpreted

in the original number of variables.

The structure

is represented by a set of points in two dimensions but understanding


points represent in terms of the original

data requires considering

what these
all ten vari-

ables .
An infinite
straints

number of pairs of vectors (pi, ,&) exist which satisfy

[1.6] and define the most structured

vectors can be spun around the origin


rotation

of the scatterplot

In other words,

in the plane rigidly

and they still satisfy the constraints

The orientation

plane.

and maintain

is inconsequential

the conthe two

via an orthogonal
the structure

found.

to its visual representation

of the structure.
These facts lead to the principle
or most interpretable
linear combinations
contribution

way possible.
have length

Given that the data is standardized


one, the coefficients

of each variable to the combination.

find the simplest


variance

that a plane should be defined in the simplest

representation

This action forces the coefficients


as possible.

represent the individual

Friedman

of the plane by spinning

of the squared coefficients

of the vertical

(1987) attempts
the vectors until

coordinate

of the second combination

and parsimonious

vector is easier to understand,

to
the

are maximized.

p2 to differ as much

As a consequence, variables are forced out of the combination.

more interpretable

and the

vector has fewer variables

compare, remember and explain.

in it.

This
Such a

Page 14

1. InterpretableExploratoy Projection Pursuit


The goal of the present work is to expand this approach.
of interpretability
binations

is discussed in Section 1.3. Criteria

are considered.

More importantly,

in the plane allowed but also the solution


slightly

away from the most structured

of the vectors

plane may be rocked in

p dimensions

plane. The resulting

Exploratory

Projection

sessed but attaching


difficult.

is

Pursuit

Approach

projection

pursuit

produces

(/3TY, PCY), the value of the index GL(PTY, ,%$Y), and the linear
(,&, pz). Th e scatterplots

combinations

loss in structure

is deemed sufficient.

As shown in the preceding example, exploratory


the scatterplot

which involve both com-

not only is rotation

acceptable only if the gain in interpretability

1.2 The Interpretable

A precise definition

nonlinear

structure

much meaning to the actual numerical

As remarked earlier, the linear combinations

in terms of the original number of variables.


stand these vectors may involve rounding

may be visually

as-

value of the index is

must still be interpreted

In fact, a mental attempt

to under-

them by eye and dropping

variables

which have too small coefficients.


The question to be considered is whether the linear combinations
more interpretable

without

losing too much observed structure

The idea of the present modification


the scatterplot

If initially
of variable

might

is linked with parsimony,


be considered.

regression is all subsets regression.


applied to principal
restricts

or parsimony

found in

in (pi, ,f?2).

Strategy

interpretability
selection

in the projection.

is to trade some of the structure

in return for more comprehensibility

1.2.1 A Combinatorial

can be made

components

the linear transformation

A similar

(Krzanowski
matrix

a combinatorial

method

The analogous approach

in linear

idea, principal

variables,

has been

1987, McCabe 1984). This method

to a specific form, for example each row

has two ones and all other entries zero. The result is a variable selection method
with each component

formed from two of the original

variables.

Both variable

1.2 The Interpretable

Exploratory

Projection

selection methods, all subsets and principal


only consider whether
Applying

variables, are discrete in nature and

a variable is in or out of the model.

a combinatorial

in the following

Page 15

Approach

Pursuit

strategy

number of solutions,

to exploratory

projection

pursuit

results

each of which consists of a pair of variable

subsets. Given p variables and the fact that each variable is either in or out of a
combination,
symmetric,

2P possible single subsets of variables exist.

The combinations

are

meaning that the (,f3i, ,&) plane is that the same as the (&, ,&) plane.

Thus, the number of pairs of subsets with unequal members is (,).

However,

this count does not include the 211pairs with both subsets identical,

which are

permissible

solutions.

It does include 2P -p

degenerate pairs in which one subset

is empty, or both subsets consist of the same single variable.

These degenerate

pairs do not define planes. The total number of solutions is

0
2p

+ 2P _ 2P - p = 9-1

- 2p-1

- p

Some planes counted may not be permissible

due to the correlation

as discussed in Section 1.4.3, so the actual count may be slightly

constraint

less. However,

the principal

term remains of the order of 2 2P-1 . The number of subsets grows

exponentially

with p and for each subset the optimization

redone.

Unlike

all subsets regression, no smart search methods exist for elimi-

nating poor contenders which do not produce interesting


time required to produce a single exploratory
combinatorial

must be completely

projections.

projection

pursuit

Due to the
solution,

this

approach is not feasible.

1.2.2 A Numerical

Optimization

Strategy

Given that a numerical optimization

is already being done, an approach is to

consider whether a variable selection or interpretability


in the optimization.

The objective

function

used is

criterion

can be included

Page 16

1. InterpretableExploratory Projection Pursuit


for X E [0, 11. Th is f unction
.contribution
pretability

and an interpretability
parameter

its maximum

index S contribution,

X. The interpretability

manner.

exploratory

projection

pursuit

index

index S is defined to
index is divided by

is also E [0, 11.

algorithm

is applied in an

First, find the best plane using the original exploratory

algorithm.

indexed by the inter-

or simplicity

possible value in order that its contribution

The interpretable

tion pursuit

sum of the projection

A na 1o g ously, the value of the projection

have values E [O,l].

iterative

is the weighted

projec-

Set max G equal to the index value for this most struc-

tured plane. For a succession of values (Xi, X2, . . . , Xl), such as (0.1,0.2, . . . , 1 .O),
solve [1:9] with X = Xi. In each case, use the previous Xi-1 solution

as a starting

point.

One way to envision the procedure is to imagine beginning at the most structured solution

and being equipped with

an interpretability

turned,

the value of X increases as the weight on simplicity,

plexity,

is increased.

The plane rocks smoothly

If the projection

index is relatively

loss in structure

is gradual.

As the dial is

or the cost of com-

away from the initial

flat near the maximum

When to stop turning

dial.

(Friedman

solution.
1987), the

the dial is discussed in Section

1.5.3.

The additive functional


drawing a parallel with
1984).

In that

goodness-of-fit
fitted

context,
criterion

form choice for the objective function

roughness penalty
the problem

[1.9] is made by

methods for curve-fitting

is a minimization.

The first

(Silverman
term is a

such as the squared distance between the observed and

values, and the second is a measure of the roughness of the curve such

as the integrated

square of the curves second derivative.

the curve becomes more rough.


comparable

to interpretability.

As the fit improves,

The negative of roughness, or smoothness,

is

Page 17

1.3 The Interpretability Index


An alternate

idea is to think

of solving a series of constrained

maximization

subproblems
2%
,

GM%

P2TY)
[l.lO]

2 G

S(Pl,P2)

for values of ci such as (0.0,O.l) . . . , 1.0). This p ro bl em may be rewritten

as an

unconstrained

Rein-

maximization

using the method of Lagrangian

sch (1967) d escribes the relationship


the relationship

between [1.9] and [l.lO]

The computational

pursuit

solution

[1.8] to

I.

approach [1.9] is linear in

savings for this numerical method versus a combinatorial

can be substantial.

The inner loop, namely finding

an exploratory

I.

one

projection

is the same for either. The outer loop, however, is reduced from

1.3 The Interpretability


The interpretability

Index
index $ measures the simplicity

(pi, pz). It has a minimum

of the pair of vectors

value of zero at the least simple pair and a maximum

value of one at the most simple.

1.3.1 Factor

and consequently

between ci and Xi.

The amount of work required by the functional

differentiable

multipliers.

Like the projection

index G, it needs to be

and fast to compute.

Analysis

For two dimensional

Background
exploratory

projection

pursuit,

the object is simplify

2 X p matrix
.
Consider a general q X p matrix
sional exploratory
matrix

projection

0 with entries wij, corresponding

pursuit.

have? When is one matrix

What characteristics

to

q dimen-

does an interpretable

more simple than another?

Researchers have

Page 18

1. InterpretableExploratory Projection Pursuit


considered

such questions with respect to factor loading matrices.

factor analysis is to explain the correlation


tor model.

The solution

structure

in the variables via the fac-

is not unique and the factor matrix

make it more interpretable.

Comments

ence between this rotation

The goal in

often is rotated

are made regarding the geometric

and the interpretable

to

differ-

rocking of the most structured

plane in Section 1.4.4. Though the two situations

are different,

goal is the same and so factor

research is used as a starting

point in the development


Intuitively,

analysis rotation

of a simplicity

the interpretability

Local interpretability

the philosophical

index S.

of a matrix

may be thought

measures how simple a combination

of in two ways.

or row is individually.

In a general and discrete sense, the more zeros a vector has, the more interpretable
it is as less variables are involved.
a collection

Global interpretability

measures how simple

of vectors is. Given that the vectors are defining a plane and should

not collapse on each other, a simple set of vectors is one in which each row clearly
contains its own small set of variables and has zeros elsewhere.
Thurstone

(1935) advocated

simple structure in the factor loading matrix,

defined by a list of desirable properties


ple, each combination

which were discrete in nature.

For exam-

(row) should have at least one zero and for each pair of

rows, only a few columns should have nonzero entries in both rows. In summary,
his requirements

correspond

to a matrix

for, only a subset of variables (columns).

which involves,

or has nonzero entries

Combinations

should not overlap too

much or have nonzero coefficients for the same variables and those that do should
clearly divide into subgroups.
These. discrete notions of interpretability
measure which

is tractable

for computer

must be translated
optimization.

into a continuous

Local simplicity

for a

single vector is discussed first and the results are extended to a set of two vectors.

Page 19

1.9 The Interpretability Index


1.3.2 The Varimax
Consider
pretability

Index

For a Single Vector

a single vector w = (wi, ~2,. . . , w~)~.

translates

entries as possible.

In a discrete

into as few variables in the combination


The goal is to smooth

sense, inter-

or as many zero

the discrete count interpretability

index
D(W)

f:

I(Wi

0}

[Ml]

i=l
where

I{+}

is the indicator

Since the exploratory

function.
projection

pursuit

linear combinations

represent direc-

tions and are usually normed as a final step, the index should involve the normed
coefficients.

In addition,

in order to smooth the notion that a variable is in or

out of the vector, the general unevenness of the coefficients


The sign of the coefficients

is inconsequential.

measure the relative mass of the coefficients

should be measured.

In conclusion,
irrespective

the index should

of sign and thus should

involve the normed squared quantities

i=l,...,p

WTW

In the 1950s, several factor analysis researchers arrived separately


criterion

(Gorsuch

1983, Harman

at the same

1976) which is known as varimax and is the

variance of the normed squared coefficients.


(1987) used as discussed in Section

1.1.3.

This is the criterion

which Friedman

The corresponding

interpretability

index is denoted by S, and is defined as

s~(w)z-&f-(*-;)2
i=1 ww

[1.12]

The leading constant is added to make the index value be E [0, 11. Fig. 1.2 shows
the value of the varimax

index for a linear combination

w in two dimensions

Page 20

1. InterpretableExploratory Projection Pursuit

0.8

0.8

0.2

0.0

angle
Fig. I.2

of the vector

in radians

Varimax interpretability
index for q = 1, p = 2. The value of
the index for a linear combination w in two dimensions is plotted
versus the angle of the direction in radians over the range [0, TIT].

(p = 2) versus the angle of the direction arctan(w2/wl)

of the linear combination

W.

Fig. 1.3 shows the varimax


mensions (p = 3).
to symmetry.

Only vectors with

all positive

components

one in three diare plotted

Fig. 1.3 shows the value of the index as the vertical

versus the values of (WI, ~2).


be graphed.

index for vectors w of length

The value of w3 is known

due

coordinate

and does not need to

Fig. 1.4 shows contours for the surface in Fig. 1.3. These contours

are just the curves of points w which satisfy the equation formed when the left
side of [1.12] is set equal to a constant
ity index is increased,
three points

the contours

el = (l,O,O),

value. As the value of the interpretabil-

move away from (-&, -$, -$=) toward

e2 = (O,l,O)

and es = (O,O, 1). The centerpoint

the
is

1.3 The Interpretability

Index

Page 21

Fig. I.3

Varimax interpretability
index for q = 1, p = 3. The surface of
the index is plotted as the vertical coordinate versus the first two
coordinates (~1, wz) of vectors of length one in the first quadrant.

Fig. J.4

Varimax interpretability
index contours for q = 1, p = 3. The
axes are the components (WI, ~2, ws). The contours, from the
center point outward, are those points which have varimax values S,(w) =(O.O, 0.01, 0.05, 0.2, 0.3, 0.6, 0.8).

1. Interpretable

Exploratory

Projection

the contour for S,(w)


S,(w)

Pursuit

= 0.0 and the next three joined curves are contours for

= 0.01,0.05,0.2.

The next three sets of lines going outward

eis are contours corresponding

toward

the

to S,(w) = 0.3,0.6,0.8.

Since w is normed, the varimax


criterion,

Page 22

criterion

S, is equivalent to the quartimax

which derives its name from the fact it involves the sum of the fourth

powers of the coefficients.

S, is also the coefficient

of variation

squared of the

squared vector components.

1.3.3

The Entropy

Index

For a Single Vector

The vector of normed squared coefficients has length one and all entries are
positive,

similar to a multinomial

set of probabilities

measures how nonuniform

If a vector is more simple,


from one another.

probability

vector.

The negative entropy of a

the distribution

the more uneven or distinguishable

Thus a second possible interpretability

is (Renyi

1961).

its entries are

index is the negative

entropy of the normed squared coefficients or

S,(w)-l+lf:
W
2lnW
2
lnp
wTw
wTw *
i=l

The usual entropy


simplicity

Property

measure is slightly

altered to have values E [0, 11. The two

measures S, and Se share four common properties.


1. Both are masimized when

&

where the ei,j

= l,...,p

fej

j = l,...,p

are the unit axis vectors.

occurs when only one variable is in the combination.

Thus, the maximum

value

1.3 The Interpretability

Property

2. Both are m inimized

or when the projection


made that

Page 23

Index

when

is an equally weighted average. The argument

an equally weighted average is in fact simple.

deciding which variable most clearly affects the projection,

could be

However, in terms of
it is the most difficult

to interpret.

Property

3. Both are symmetric in the coefficients wi. No variable counts more

than any other.

Property
explanation

Definition.

4. Both are strictly Schur-convex as defined below.


follows Marshall

The following

and Olkin (1979).

Let 5 = (C, &, . . . , &) and y = (ri, 72,. . . , yP) be any two vectors

E RP. Let Cpl 2 $21 2 . - - 2 Cb] ad

qll

1 721 2 . . . 2 ~1 denote their

components in decreasing order. The vector < mujorizes y (C F -y) if

The above definition

holds

7 = CP where P is a doubly stochastic

matrix,

that is P has nonnegative

entries, and column and row sums of one. In other

words, if y is a smoothed

or averaged version of C, it is majorized

example of a set of majorizing

(l,O,...

,O) + (go

)...)

by C. An

vectors is

0) h *.. I+ (-

p-

1
lP-

-J- 1 0) s ($,)

1. Interpretable

Definition.

ProjectionPursuit

Exploratory

A function

with strict inequality

Page24

f : RP H R is strictly Schur-convex if

if y is not a permutation

of <.

This type of convexity is an extension of the usual idea of Jensens Inequality.


Basically,

if a vector C is more spread out or uneven than y, then S(c) > S(y).

This intuitive

idea of interpretability

now has an explicit

The two indices S, and Se rank all majorizable


ing the theory of Schur-convexity,

mathematical

meaning.

vectors in the same order.

a general class of simplicity

Us-

indices could be

defined.

1.3.4 The Distance

Index

For a Single Vector

Besides the variance interpretation,

S, measures the squared distance from

the normed squared vector to the point (k, k, . . . , t), which m ight thus be called
the least simple or most complex point.

Let the notation

for the Euclidean norm

of any vector 6 be

If V, is defined to be the squared and normed version of w,

lJ, z WTW7W T W W T W
and t/c is the most complex point, then

&J(w)= --/$ly
- hII
-

[1.13]

1.3 The Interpretability

Page 25

Index

.This index can be generalized.

In contrast to having one most complex point, an

alternate index can be defined by considering a set


points.

= {or, . . . , VJ} of m simple

This set must have the properties


Uji>O

i = l,...,p

j=l,...,J

P
Uji = 1

j=l,...,J

[1.14]

i=l

The u;s are the squares of vectors on the unit sphere E


example is

= {cj : j = 1,. . . ,p},

used with exploratory

projection

with
pursuit,

RP as are u,

and vC. An

= p. In the event that this index is

the statistician

set of simple points rather than be restricted

could define her own

to the choice of V, used in S,.

This distance would be large when w is not simple, so the interpretability


index should involve the negative of it. An example is

[1.15]

The constant

c is calculated

so that the values are E [O,l].

Any distance norm

can be used and an average or total distance could replace the m inimum.
If

is chosen to be the ejs and the m inimum

Euclidean norm is used, the

distance index becomes

s;(w)

where k corresponds

G 1 - $@&)2-&+1]

to the maximum

I w; I, i =

tion does not need to be done at each step, though


efficient
similar

in the vector
choices of

ros ((i, i,O,.

must be found.

Analogous

such as all permutations

1,. . . , p.

The m inimiza-

the largest absolute


results

are obtainable

of two $ entries and

cofor

p - 2 ze-

. * , O), ($7 o,;, 0, - * , 0), . . .) corresponding to simple solutions of two

variables each.

1. InterpretableExploratory Projection Pursvii


Since the ejs maximize
relationship

Page 26

S, and are the simple points associated with Sl, the

between the two indices proves interesting.

m4
The minimum

= --St&)

+ --&

(P-&

wi

Algebra reveals

- 1)

of the second term occurs when

W:=f

WTW p
or all coefficients

are equal and S:(w) = S,(w) = 0. The maximum

occurs when

fL=l
WTW
and S;(w) = &(w)
The difficulty

= 1. Th e relationship

between the interim

values varies.,

with the distance index Ss is that its derivatives

and present problems

when a derivative-based

Given that the entropy index S, and the varimax

optimization

are not smooth

procedure is used.

index S, share common prop-

erties and the latter is easier to deal with computationally,

the varimax

approach

is generalized to two dimensions.

1.3.5 The Varimax


The varimax
vectors

Wj

(Wjl,

Index

For Two Vectors

index SV can be extended to measure the simplicity


Wjz,

for one combination

. . . ) Wjp),

1,. . . , Q. In the following,

of a set of Q

the varimax

[1.12] is called Si. In order to force orthogonality

index

between

the squared normed vectors, the variance is taken across the vectors and summed
over the variables to produce

[1.16]

1.4 The Algorithm

Page 27

-If the sums were reversed and the variance was taken across the variables

and

summed over the vectors, the unnormed index would equal

WWl) +

with

each element the one dimensional

vector.

simplicity

[1.12] of the corresponding

The previous approach [ 1.161 results in a cross-product

For two dimensional


with appropriate

appropriate

exploratory

norming,

= 12p [iP -

~&l,~2)

with

[1.17]

+ * * - + &(wp)

wJ2)

pursuit,

= 2 and the index,

reduces to

+ (P -

vu~l)

norming.

projection

term.

W(w2)

21 - $

, [1.181

-&-&
1

The first term measures the local simplicities

vectors while the second is a cross-product

term measuring the orthogonality

the two normed squared vectors.

This cross-product

orthogonality,

groups of variables appear in each vector.

maximized

so that different

of the
of

forces the vectors to squared


S, is

when the normed squared versions of w1 and w2 are ek and el, L # 1

and minimized

when both are equal to (i, f, . . . , 5).

1.4 The Algorithm

In this section, the algorithm


jection

pursuit

problem

used to solve the interpretable

[1.9] with

general approach of the original

interpretability

exploratory

exploratory

pro-

index S,, is discussed.

The

projection

pursuit

algorithm

is fol-

lowed but two changes are required, as described in the first three subsections.

1. InterpretableExploratory Projection Pursuit


1.4.1 Rotational

Invariance

Page 28

of the Projection

As discussed in Section 1.1.4, the orientation


and a particular

observed structure

index G no matter
stating

from which

of a scatterplot

is immaterial

should have the same value of the projection

direction

this is that the projection

Index

it is viewed.

An alternative

index should be a function

of the plane, not

of the way the plane is represented (Jones 1983). The interpretable


pursuit

algorithm

should always describe a plane in the simplest

as measured by S,,.
algorithm

should have the property

Definition.

where (PI,

Given these two facts,

A projection

P2)

of

rotation

matrix

projection
way possible

index used by the

rotationalinvariance.
if

e or

is (771,~2) rotated through

with Q the orthogonal

any projection

rotationallyinvariant

index G is

way of

associated with the angle 8 or

[1.19]

Rotational

invariance should not be confused with afhne invariance.

in Section 1.1.3, the latter


Friedmans

property

is a welcome byproduct

Legendre index GL [l. 71 is not rotationally

erty is not required for his algorithm.

of sphering.
invariant.

This prop-

As described in Section 1.1.4, he simplifies

the axes after finding the most structured

plane by maximizing

varimax index S1 for the second combination.


rock away from the most structured

As remarked

the single vector

He does not allow the solution to

plane in order to increase interpretability.

1.4 The Algorithm

Page29

Recall that the first step in calculating


the projected

the marginals

sphered data to a square. Under the null hypothesis

tion is N(0, I), which means the projection


orientation

GL is to transform

of the axes is immaterial.

is not true, the placement

scatterplot

Intuitively,

of

the projec-

looks like a disk and the

however, if the null hypothesis

of the axes affects the marginals

and thus the index

value.
Empirically,

this lack of rotational

invariance

is evident.

. bile example from Section 1.1.4, the most structured


GL(,BFY, ,BgY) = 0.35. The projection

In the automo-

plane was found to have

index values for the scatterplot

as the axes

are spun through

a series of angles are shown in Table 1.1. A new index, which

seeks to maintain

the computational

properties

and to find the same structure

as GL, is developed in the next subsection.

e
G(/3;Y&Y)

Table1.1

0.0

fi

zi

3*

ZiT

L
4

7*
x

0.35 0.36 0.34 0.32 0.30 0.29 0.29 0.29 0.30 0.32 0.35
Most structured Legendre plane index values for the automobile
data. Values of the Legendre index for different orientations of
the axes in the most structured plane are given.

1.4.2 The Fourier

Projection

Index

The Legendre index GL is based on the Cartesian coordinates of the projected


sphered data ($2,
coordinates
natural

o$Z). Th e index is based on knowing the distribution

under the null hypothesis

alternative

of these coordinates

given that rotational

Polar coordinates

polynomials

are the

invariance is desired. The distribution

is also known under the null hypothesis.

the density via orthogonal

GL.

of normality.

of these

The expansion of

is done in a manner similar

to that of

1. Interpretable

Exploratory

The polar coordinates

Actually,

Pursuit

Projection

Page 30

of a projected

the usual polar coordinate

sphered point (aT2, ac.Z) are

definition

rather than its square but the notation


Under the null hypothesis

involves the radius of the point

is easier given the above definition

that the projection

is N(0, I),

of

R.

and 0 are inde-

pendent and

R-

Exp

0 - Unif[-7r,7r]

The proposed index is the integral of the squared distance between the density
of

(R, 0)

and the null hypothesis density, which is the product of the exponential

and uniform

densities,

The density pR,e is expanded as the tensor product


onal polynomials
properties.

chosen specifically

The weight functions

der to utilize the orthogonality


0 portion

for their

weight functions

both Cartesian coordinates

and rotational

must match the densities f~ and fe

relationships

and the polynomials

of the expansion must result in rotational

is not affected by rotation.

of two sets of orthog-

Friedman

chosen for the

invariance.

By definition,

(1987) uses Legendre polynomials

as his two density functions

are identical

is Lebesgue measure, and his algorithm

require rotational

Hall (1989) considered Hermite

developed a one dimensional index. The following


the two authors approaches.

Throughout

for

(Unif[-l,l]),

the Legendre weight function


invariance.

in or-

does not

polynomials

and

discussion combines aspects of

the discussion, i, j, and k are integers.

1.4 The Algordhm

Page31

The set of polynomials

for the R portion

are defined on the interval

is the Laguerre polynomials

[O,oo) with weight function

W(U) = e-.

which

The polyno-

mials are
Lo(u) = 1
L,(u)

= u - 1

L;(u)

= (u - 2i + lpi-l(u)

[1.21]

The associated Laguerre functions

- (i - 1)2L44

are defined as

Z;(U)f .L;(u)e-i .
The orthogonality

relationships

O"
J

between the p.olynomials are

Zi(U)Zj(U)dU

where 6ij is the i(ronecker

delta function.

Any piecewise smooth function


these polynomials

;,j>Oandi#j

Sij

: R H R may be expanded in terms of

as
f(U)

= e%&(u)
i=O

where the ai are the Laguerre coefficients

a; Z

O Li(U)t?-3uf(U)dU

J0

The smoothness property


derivatives

of f means the function

has piecewise continuous first

and the Fourier series converges pointwise.

has density f, the coefficients

can be written

Ui = Ef [Zi(W)]

as

If a random variable W

1. Interpretable

Exploratory

The 63 portion

Pursuit

Projection

Page 32

of the index is expanded in terms of sines and cosines. Any

piecewise smooth function

f : R H R may be written

as

f(w)= y + 5 [ahcos(kw)+ bksin( kv)]


k=l

with pointwise

convergence.

The orthogonahty

relationships

between these trigonometric

7r
%
J
xJ
J
r
J
cos2(kv)dv =

--r

-%

sin2(kv)dv

bk are

k>l

= 0

k>Oandj>l

cos(kw)sin(jv)dv

dv=2r

the Fourier coefficients

1
ak E -~
1
bk E =

_ cos(kv)f(v)dv = ;Ef [cos(kW )]

-%

sin(

kv)f(v)dv

= i.Ef

[sin(kW)]

The density pR,e can be expanded via a tensor product

of the two sets of poly-

nomials as

i- 2 [aikcos(kv)+ b;ksin(ku)])
k=l

The Uik and

are

--x

--I

The ak and

= ?r

functions

bik are

the coefficients defined as

C&k 3

:Ep [ii(R) cos(kO)]

i,k>O

bik G

:Ep

i>Oandk>l

[Ii(R) sin(kO)]

Page 33

1.4 The Algorithm


The null distribution,
the uniform

which is the product

density over [-ir,~]

of the exponential

density

and

is

h(u>fo(~)
= ($+) (+-) =

-&O(U)

The index [1.20] becomes

The further

condition

tion, integration,

that pR,e is square-integrable

and use of the orthogonality

and subsequent

relationships

multiplica-

show that the index

G(PTY, /3,Y) equals

+(27r)(&)

~(271)~+~~1T(u~k+b~k)-(2r)(~)(~)
i=O k=l
i=l

[1.22] is equivalent

Maximizing

to maximizing

G&-Y&Y)

The definitions

[I.221

the Fourier index defined as

= ~G(Pf-y,&y)

of the coefficients

- f

in GF yield

GF(PTY,p,Y) =f 2 Ep[Ii(R)] + 2 2 (Ei [ii(R) cos(k@)I+ Epk(R)sin(k@)l)


i=O k=l

i=O

- ;Ep[lo(R)].
[1.23]
This index

closely resembles the form of GL in Friedman

i = 0 term in the first sum and the last subtracted


function

is the exponential

polynomials.

In application,

instead

(1987).

The extra

term appear since the weight

of Lebesgue measure as it is for Legendre

each sum is truncated

at the same fixed value and

1. Interpretable

Explorato y

Projection

Pursuit

Page 34

the expected values are approximated


data points.

For example, Ei

by the sample moments taken over the

[Ii(R) cos(kO)]is approximated

by

where rj and Oj are the radius squared and angle for the projected jth observation.
The Fourier index is rotationally
spun by an angle of r.

invariant.

Suppose the projected

The radius squared R is unaffected

points are

by the shift so the

first sum and final term in [1.23] d o not change. The sine and cosine of the new
angle are
cos( 0 + T) = sin 0 cos 7 + cos 0 sin 7
sin(O + r) = cosOcos7
Each component

- sin@sinr

in the second term of [1.23] is

Ei[Zi(R) cos(k(0 + T))] +

E~[Zi(R)sin(k(O

cos2(kT)Ei[Zi(R)

sin(kO)]

+ r))] =
+ sin2(kr)E~[Zi(R)

cos(k@ )]

+ 2 sin( kT) cos(kT)Ep[Zi(R)


sin(kO) cos(k@ )]
+

cos2(kT)Ei[Zi(R) cos(k@ )]+

-2sin(kr)

[1.24]
sin2(kr)Ei[Zi(R)

cos(k~)E,[Z;(R)sin(kC3)cos(kO)]
= Ei[Zi(R)cos(kO)]+

E~[Zi(R)sin(kO)]

and the index value is not afFected by the rotation.


the index also has this property
replacing

The truncated

as it is true for each component.

version of
Moreover,

the expected value by the sample mean does not affect [1.24], so the

sample truncated

version of the index is rotationally

Hall (1989) p ro p oses a one dimensional


m ite weight function
densities.

sin(k@)]

In addition,

in the truncated

Hermite

invariant.
function

index.

of e- 3 u2 helps bound his index for heavytailed

The Herprojection

he addresses the question of how many terms to include

version of the index.

Similarly,

the asymptotic

behavior of GL

1.4 The

Page 35

Algorithm

is being investigated

at present. The Fourier index consists of the bounded sine

and cosine Fourier contribution,

and of the Laguerre function

portion

which is

weighted by e- 3 U. The class of densities for which this index is finite must be
determined

as well as the number of terms needed in the sample version.

The Fourier GF and Legendre GL indices are based on comparing the density
of the projection
a rotationally

with the null hypothesis density. Jones and Sibson (1987) define

invariant

to equate structure
find clusters.

with

rotationally

searches through the possible planes with the weighted objec-

[1.9] as a criterion.

Every plane has a single projection

it since GF is rotationally

have a single possible representation;

analytically.

index could be based on

Axes Restriction

The algorithm

S,. Unfortunately,

invariant

(Donoho and Johnstone 1989).

1.4.3 Projection

associated with

Their index tends

outliers while the density indices GF and GL tend to

A possible future

Radon transforms

tive function

index based on comparing cumulants.

the optimal

invariant.

Ideally, each plane would

the most interpretable

representation

index value

one as measured by

of a given plane cannot be solved

However, as the weight on simplicity

(X) is increased, the algorithm

tends to represent each plane most simply.


In order to help the algorithm
plane, the constraints

find the most interpretable

on the linear combinations

of a

(pi, ,&) must be changed. Re-

call that in the original

algorithm,

the linear combinations

(&, &) which define th e solution plane. This constraint

translates

into an orthogonality

which define the solution

the correlation

representation

constraint

[1.6] is imposed on

for the linear combinations

(oi, ~22)

plane in the sphered data space. However, simplicity

is measured for the two unsphered combinations


the two vectors are (fek,

constraint

&eZ),

(/?I, /32) and is maximized

# 1. These maximizing

combinations

thogonal in the original data space and correspond to the variable

k and

when
are or-

variable

1. Inierpwfable Ezploraioy Projeciion Pursuit

Page 36

I axes. Unless two variables are uncorrelated,


achieved by a pair of uncorrelated
To ensure that the algorithm

the maximum

orthogonality.

can find a maximum,

in the following

manner.

the optimal

constraint,

Without

translation

loss of generality,

the plane until the two vectors are orthogonal.


@2onto PI and taking the remainder.
a component

given any pime

after the linear combinations

Unfortunately,

cannot be

combinations.

by the pair (PI, p2) which satisfies the correlation


of the plane is calculated

simplicity

defined

the interpretability

have been translated

to

is not known so it is done


PI is fixed and /?2 is spun in

The spinning is done by projecting

That is, ,& is decomposed into the sum of

which is parallel to PI and a component

which is orthogonal

The new ,&, which is called ,8i, is the latter component.

to PI.

Mathematically,

[1.25]

Whether Sv(P, , Pi> is always greater than or equal to SV(p,,p2)


However, as noted above, the maximum

is not clear.

value of the index can be achieved given

this translation.
As an added bonus,
original

the two combinations

(p,,&)

are orthogonal

variable space, which is the usual graphic reference frame.

tion is in contrast to the original exploratory

projection

pursuit

in the

This situa-

algorithm

which

graphs with respect to the covariance metric as noted in Section 1.1.4. In effect,
a further

subjective

representation

simplification

in the solution

has been made as the visual

is more interpretable.

The final solution for any particular

X value is reported as the normed, trans-

lated set of vectors


[1.26]

1.4 The Algorithm

Throughout

Page 37

the rest of this thesis, the orthogonal

(Pr, &,) is written

1.4.4

and Il.181 becomes

= $KP- W l(PJ+ (p- l)S@{)


Comparison

Given that
comparison

W ith

factor

with

X = (Xi, X2)r

is assumed and

for (&, Pi). As a result, whenever S,(p,, /?2) is referred to. the

actual value is S,,(pl,&)

wdv

translat.ion

Factor

+ 21

112

. [1.27]
2

Analysis

analysis is used to motivate

this method

- $ -L!&g

is warranted.

the interpretability

The two dimensional

index,
projection

is defined as

XEBY
where

is the 2 X p linear combination

matrix

and the observed variables are Y = (K, Y2, . . . , YP)=. This definition
to the one for principal
principal

projection

components except that in the latter case, usually

in Section

1.1.1, principal

to simplify

Section 1.1.4, involved rigidly

B,

due to Friedman

is p X p. In addition,
maximize

a different

matrix

constraint.

by multiplying

(1987) and discussed in

spinning the projected points (X1,X2)

tion plane. The new set of points is QX =

lation

components

all p

index.

The first attempt

orthogonal

components are found so that the dimension of

as was remarked

is similar

QBY

as in [1.19]. S
mce the rotation
Thus simplification

the linear combination

in the solu-

where Q is a two dimensional


is rigid, it maintains

the corre-

through spinning in the plane is achieved


matrix

by Q.

1. Interpretable

Exploratory

p variable

The analogous

In this model,

(fi,
f2>

(Y,,yz,...

Page 38

Pursuit

factor analysis model is

there are assumed to be two unknown

and E is a

and is found

Projection

X 1 error vector.

by seeking to explain

the covariance

, YP) given certain distributional

changing the explanatory

order to simplify

assumptions.

matrix

which is comparable in dimensionality

pursuit

linear combination

to a more interpretable

solution

51 is p X 2

of the variables

Due to the non-uniqueness

A rotation

is made in

to CJQ.

matrix,

QB.

rotated producing

Taking the transpose of the new factor-loading

matrix

structure

power of the model.

the factor-loading

factors

The factor loading matrix

of the model, the factors can be orthogonally

without

underlying

matrix

produces QflT, a 2 X p

to the new exploratory

projection

Thus, spinning the linear combinations

is analogous to simplifying

the factor-loading

matrix in a two factor model. A transpose is taken since the factor analysis linear
combinations

are of the underlying

combinations

are of the observed variables.

between factor analysis and principal


Interpretable
plane.

exploratory

factors, while the exploratory

This comparison is analagous to that

components analysis.

projection

pursuit

In this case, the linear combinations

only to the correlation


via [1.25] to further

constraint.

involves rocking

(pi, /32) are moved in

They are then transposed

increase interpretability.

may not be linear combinations

data analysis

the solution

RP subject

to orthogonality

The more interpretable

coefficients

of the original ones. Such a move is not allowed

in the factor-analysis

setting.

The interpretable

method may be thought

looser, less restrictive

version of factor analysis rotation.

of as a

1.4 The

Page 39

Algorithm

1.4.5 The Optimization

Procedure

The interpretable exploratory projection pursuit objective function is


(I-

A) G~(~~~~,Ty)

v (p 1, p 2 )

AS

max GF

where &@I,
rithm

P2)

is calculated after the translation

[1.28]

[1.25]. The computer algo-

employed is similar to that of the original exploratory

algorithm

described in Section 1.1.2. The algorithm

projection

is outlined

pursuit

below and then

comments are made on the specific steps.


0. Sphere the data [1.2].
1. Conduct
lem [l.l]

a coarse search to find a starting

point to solve the original prob-

(or [1.28] with X = 0).

2. Use an accurate derivative based optimization


structured

procedure to find the most

plane.

3. Spin the solution


representation.

vectors in the optimal

plane to the most interpretable

Call this solution Pa.

4. Decide on a sequence (Xi, X2, . . . , Xl) of interpretability

5. Use a derivative based optimization


and starting

6. If i =

I,

plane Pi-l.

EXIT.

values.

procedure to solve [1.28] with X = Xi

Call the new solution plane

Otherwise,

parameter

set i = i + 1 and GOT0

Pi..
5.

The search for the best plane is performed in the sphered data space, as
discussed in Section 1.1.2. However, an important

note is that the interpretability

of the plane must always be calculated in terms of the original variables.


combinations,

not the ai combinations,

are the ones the statistician

she is unaware of the sphering, which is just a computational

The pi

sees. In fact,

shortcut

behind

the scenes.

The modification
Friedman

does require one important

difference in the sphering. In

(1987), the suggestion is to consider only the first Q sphered variables

1. Interpretable

Exploratory

Projection

Pursuit

2 where q < p and a considerable


dropping

of the unimportant

unimportant
putational

amount of the variance is explained.

The

sphered variables is the same as the dropping

components in principal
work involved.

Page 40

of

components analysis and reduces the com-

In Step 5, the interpretability

gradients are calculated

for (PI, /32) and th en t ranslated via the inverse of [1.5],

a1 = DfUp,
a2 = DbJp,

p.291

to the sphered space. If the gradient components are nonzero only in the p - q
dropped dimensions,
optimization

they become zero after translation.

procedure

the maximum

The derivative-based

assumes it is at the maximum

has not been reached.

and stops, even though

Thus, no reduction

in dimension

during

sphering should be made.


The coarse search in Step 1 is done to ensure that the algorithm
the vicinity

of a large local maximum.

starts in

The procedure which Friedman

(1987)

employs is based on the axes in the sphered data space.

He finds the most

structured

the sphered space.

pair of axes and then takes large steps through

Since the interpretability


a feasible alternative

measure S, is calculated for the original variable space,


m ight be to coarse step through

the original

rather than

sphered data space. For example, the starting point could be the most structured
pair of original

axes. This pair of combinations

is in fact the simplest possible.

On the other hand, stepping evenly through the unsphered data space m ight not
cover the data adequately

as the points could be concentrated

due to covariance structure.


tried so far, the starting

Sphering solves this problem.

procedure used is the program

(Gill et al. 1986). This package solves nonlinear

tion problems.

In all data examples

point did not have an effect on the final solution.

In Steps 2 and 5, the accurate optimization


NPSOL

in some subspace

The search direction

constrained

at any step is the solution

optimiza-

to a quadratic

1.4 The

Page

Algorithm

programming
binations

problem.

The package is employed to solve for the sphered com-

((Y~,cx~) subject to the length and orthogonality

gradients for the projection


index SV derivatives

given the complicated

is extremely

translations

this problem is a difficult

from the un-

[1.25] and from the sphered to unsphered

space [ 1.51. These gradients are given in Appendix

design a steepest descent algorithm

[1.4]. The

The interpretability

as they involve translations

combinations

The package NPSOL

constraints

index GF are straightforward.

are more difficult

correlated to orthogonal

powerful.

A.
At present, work continues to

which maintains

the constraints.

However,

between the sphered and unsphered spaces,

one.

Step 3 can be performed


discretely

41

in two ways.

The initial

spun in the plane to the most interpretable

pair of vectors
representation

can be

or Steps 5

and 6 can be run with A equal to a very small value, say 0.01. This slight weight
on simplicity
permitted

does not overpower

to rock but spinning is allowed.

representation
is similar

of the most structured

to Friedmans

interpretability
The initial
inator

the desire for structure.

The result is the most interpretable

plane. As noted previously,

(1987) simplification

value of the projection

except that a two vector varimax

index GF is used as max G in the denom-

However, the algorithm

and as the weight on simplicity

move to a larger maximum.

may be caught in a

is increased, the procedure may

Thus, the contribution

of the projection

may at some time be greater than one. This is an unexpected


terpretable

this spinning

index is used instead of a single vector one.

of the first term in [1.28].

local maximum

The plane is not

projection

pursuit

approach, both structure

index term

benefit of the in-

and interpretability

have

been increased.
In the,examples

tried, the algorithm

is not very sensitive to the X sequence

choice as long as the values are not too far apart.


(O.O,O.l,. . . , 1.0) produces the same solutions
sequence (0.0,0.25,0.5,1.0)

does not.

For example,

as (0.0,0.05,0.1,

the sequence

. . . , 1.0) but the

1. Interpretable

Exploratory

Throughout
as the starting

Projection

Pursuit

Page

the loop in Steps 5 and 6, the previous solution


point for the application

of the algorithm

at Xi-1 is used

with X;. This approach

is in the spirit of rocking the solution away from the original solution.
have shown that the objective
made initially

is fairly

42

Examples

smooth and a large gain in simplicity

is

in turn for a small loss in structure.

As remarked in Section 1.1.4 when the example was considered, the data Y is
usually standardized

before analysis. The reported combinations

the coefficients represent the relative importance


nation.

are [1.26]. Thus

of the variables in each combi-

The next section consists of the analysis of an easy example followed by

a return to the automobile

Example.

1.5 Examples
In this section, two examples of interpretable
are examined.

exploratory

projection

The first is an example of the one dimensional algorithm,

second is the two dimensional algorithm

applied to the automobile

in Section 1.1.3. Several implementation

pursuit
while the

data analyzed

issues are discussed at the end of the

section.

1.5.1 An Easy Example


The simulated

data in this example consists of 200 (n) points in two (p) di-

mensions. The horizontal


distributed

and vertical coordinates are independent

with means of zero and variances of one and nine respectively.

data is spun about the origin through


tracted

from each coordinate

coordinate

and normally

of the remaining

an angle of thirty

degrees. Three is sub-

of the first fifty points and three is added to each


points.

The data appear in Fig. 1.5.

Since the data is only in two dimensions and can be viewed completely
scatterplot,

using interpretable

quite contrived.

The

exploratory

projection

pursuit

in this instance is

However, this exercise is useful in helping understand

the procedure works and its outcome.

in a

the way

1.5 Ezamples
101

Page 43
I

-10"""""""'
-10

-5

I""1
5

10

Yl

Fig.

1.5

Simulated

The algorithm

p = 2.

is run on the data using the varimax

for a single vector


axes restriction

data with n = 200 ad

S1 and the Legendre index GL.

modifications

are irrelevant

interpretability

Rotational

in this situation

sional solution is sought. The values of the simplicity

index

invariance

and

since a one dimen-

parameter

X for which the

solutions are found are (0,O. 1,0.2, . . . , 1 .O).


The most structured
or the first principal
the observations

line through

component

the data should be about thirty

of the data.

are split into two groups.

line toward

From Fig. 1.5, the horizontal


projected

and vertical

merge into one. In the horizontal

projection,

of these two axes.

the most structure

onto it. If the data is projected onto the vertical

lines in the

axes. The algorithm

the more structured

axis exhibits

onto this line,

The most interpretable

entire space, R2 in this case, are the horizontal


should move the solution

When projected

degrees

when the data is

axis, the two clusters

the two groups only overlap slightly.

1. Interpretable

Ezploratoq

Projection

Pursuit

Page 44

-y-+-

~2c

0.8

&

0.6

-1

0.0

0.8

0.6

0.4

0.2

Fig.

Projection and interpretability


indices versus X for the simulated
data. The projection index values are normalized by the X = 0
value and are joined by the solid line. The simplicity index
values are joined by the dashed line.

1.6

Fig. 1.6 shows the values of the projection


the values of A. The projection
downward

and interpretability

indices versus

index begins at 1.0 as it is normed and moves

to about 0.3 as the weight on simplicity

increases.

The simplicity

index begins at about 0.2 and increases to 1.0.


In addition

to this graph, the statistician

Fig. 1.7 shows the projection


.

combinations

histograms,

should view the various projections.


values of the indices, and the linear

for four chosen values of A. The histogram binwidths

as twice the interquartile


have area one.

range divided by ni and the histograms

are calculated
are normed to

Page45

1.5 Examples
0.25

,,I,

, , I I

I \ I ,_l

0.20
0.15
0.10 1

h=O.O

A=O.5

Fig. 1.7

Projected simulated data histograms for various


The values of the indices and combinations are

values of X.

X = 0.0, GL = 1.00, S1 = 0.21, ,L3= (0.85,0.52)T


A = 0.3,

X=

0.5,

GL = 0.93, Sl =
GL = 0.78, S1 =

0.47,
0.71,

p = (0.92,0.47)T
B = (0.97,0.24)T

X = 1.0, GL = 0.27, Sl = 1.00, ,!3= (l.OO,O.OO)T

As predicted,
_

the horizontal
is evident

the algorithm

moves from an angle of about thirty

axis. As the weight on simplicity

in the merging

groups are overlapping.

of the two groups.


Also important

degrees to

is increased, the loss of structure


In the fmal histogram,

to note is the comparison

the two

between the

Page46

1. InterpretableExploratory Projection Pursuit


first and second histograms
value of the projection
the two histograms
1.5.2

are virtually

The automobile

indistinguishable

Example

interpretable

exploratory

consists of 392 (n) observations

GF projection
rotational

in a different
notation

index instead

projection

pursuit.

projection

pursuit

analysis

uses the Fourier

of the Legendre GL index due to the desire for

point

for the algorithm,

of Section 1.4.5. Comparison


invariance

using the Fourier index.

orthogonal

in the original

the plane.

The corresponding

this choice results

the X = 0 or PO plane in the

of these two solutions

in the Legendre projection.

projection

demonstrates

The dashed axes are (PI, ,&), which

variable space and are the simplest


PO solution

representation

constraint
With

of

as Fig. 1.8

purposes.

metric.

Rigidly

[1.6].

imagination,

index values.

are

for the Legendre index is shown in

In Fig. 1.9, the dashed axes are (/&,pz),


variance

the

Fig. 1.8 shows the PO

Fig. 1.9. This plot is the same as Fig. 1.1 but with the same limits
for comparison

using two

Recall that this data

as discussed in Section 1.4.1. Naturally,

starting

lack of rotational

1.1.3 is re-examined

of ten (p) variables.

exploratory

invariance

of 0.26. However,

to the eye.

data discussed in Section

The interpretable

A loss of 7% in the

index is traded for a gain in simplicity

The Automobile

dimensional

at X = 0 and X = 0.3 respectively.

rotating

However,
spinning

For example,

which

the combinations

are orthogonal
maintains

in the co-

the correlation

the Legendre index changes as shown in Table 1.1.


these axes through
the new marginals

different

angles produces lower

after a rotation

of f are less

clustered and therefore GL is reduced.


The value of the Fourier index GF for the projection
the value for the projection

in Fig. 1.8 is 0.21 while

in Fig. 1.9 is 0.19. The Legendre index GL value

for the second is 0.35 as previously

reported.

For the projection

in Fig. 1.8, the

Legendre index varies from 0.19 to 0.21, depending on the orientation

of the axes.

1.5 Ezamples

Page 47

.:.. .
I
:. . .,*
*.
.
.
.
I
: . y;. -.
: *. ..:+J:
:.*;..
I
. . a... ..
; .. 1:
. -. .:
..*....
.J.
-- -.-. J : ;;.Y++
*.*.z
.: ..I.

Fag. I.8

----

Most structured projection scatterplot of the automobile data


according to the Fourier index. The dashed axes are the solution
combinations which define the plane.

Both find projections

which exhibit

endre index combinations

are translated

clustering

into two groups.

to an orthogonal

If the Leg-

pair via jl.51, they

become
,& = (-0.21,

-0.03,-0.91,

,&. = (-0.80,-0.14,

0.16,

0.27,

and have interpretability


the simplest representation

0.30, -0.05, -0.01,

0.49,-0.02,

measure S@,

0.03, 0.00, -0.02)T

0.03,-O-16,--0.02,

0.02,-0.01)T

/&) = 0.50. Of course, this may not be

of the plane though the orthogonalizing

transforma-

tion may help.


The Fourier index combinations
PI = (-0.06,
p2 = (

0.00,

-0.22, -0.79, -0.44,


O.ll,-0.53,

originally

are

0.34,-0.05,

0.82,-0.03,

0.02, -0.14, -0.05,

0.05,-0.01,

0.16,

0.04)T

0.04,-0.06)T

1. InterpretableEzploratoy Projection Pursuit

Fig. 1.9

Most structured projection scatterplot of the automobile data


according to the Legendre index. The dashed axes are the solution combinations which define the plane maximize GL and are
orthogonal in the covariance metric.

and have interpretability


representation

Page48

measure 0.17.

These axes are spun to the simplest

as discussed in Section 1.4.5 and orthogonalized.

axes are shown in Fig. 1.8 and the combinations


p1 = (-0.03,

-0.12, -0.95, -0.01,

@2 = (-0.02,

0.16,-0.09,

with interpretability

0.29,-0.02,

0.94,-0.19,

parameter

are
0.00, -0.03,-0.02,

0.09,-0.01,

0.18,

O.OO)T

0.07,-0.06)T

measure 0.80.

Fig. 1.10 shows the values of the interpretability


plicity

The resulting

and Fourier indices for sim-

values X = (0.0, 0.1, . . . , l.O), analogous to Fig. 1.6 for the sim-

ulated data. This example demonstrates


as the interpretability

that the projection

index does, possibly

index may increase

because the algorithm

gets bumped

Page49

1.5 Examples

CT
2

0.2

0.4

0.6

0.6

Fig. 1.10

Projection and interpretability


indices versus X for the automobile data.
The projection index values are normalized by the
X = 0 value and are joined by the solid line. The simplicity
index values are joined by the dashed line.

out of a projection

local maximum

as it moves toward a simpler solution.

Given this plot, the statistician


lution

planes for specific

may then choose to view several of the so-

X values.

actual values of the combinations

Six solutions

are given in Table 1.2. If coefficients

less than 0.05 in absolute value are replaced by -,


sense, this action of discarding
nature of the interpretability
at which
coefficients

the coefficients
move quickly

suggests investigating

are shown in Fig. 1.11.

small coefficients

which are

Table 1.3 results.


is contrary

The

In some

to the continuous

measure. However, the second table shows the rate


decrease.

Due to the squaring

in the index

SV, the

to one but slowly to zero. This convergence problem

the use of a power less than two in the index in the future

as discussed in Section 3.4.3. The table also demonstrates

the global simplicity

1. Interpretable

Exploratory

I I I I
3

Projection

I I I I
. ..

Pursuit

Page 50

I I I /

-_

-4

-2

-4

-4

I 1

h=0.4

*...

-2
X=0.6

..

-2

h=0.7

..

i:

:i

.I

;.
. .
;;

;
.

*.
.

.
-.

;.
.
. :
:- *:

-4

-2

x=1.0

Fig.

I.11

Projected automobile data scatterplots for various values of A.


The values of the indices can be seen in Fig. 1.10 and the combinations are given in Table 1.2

Page 51

1.5 Examples

0.4

0.00

-0.09

-0.97

0.06

0.22

-0.01

0.02

-0.02

-0.02

0.00

-0.06

0.15

0.00

0.95

-0.15

0.08

-0.03

0.19

0.07

-0.06

0.00

-0.08

-0.98

0.04

0.16

-0.02

0.03

-0.02

-0.02

-0.01

-0.03

0.12

0.01

0.97

-0.11

0.12

-0.03

0.14

0.06

-0.04

0.01

-0.05

-0.99

0.01

0.12

-0.01

0.04

0.00

-0.01

-0.01

-0.03

0.06

0.00

0.98

-0.07

0.11

-0.03

0.11

0.05

-0.03

0.01

-0.05

-0.99

0.02

0.09

-0.01

0.05

-0.01

-0.02

-0.01

-0.02

0.06

0.01

0.98

-0.04

0.10

-0.04

0.11

0.05

-0.03

0.02

-0.04

-0.99

0.01

0.09

0.04

0.02

0.02

-0.01

0.00

-0.02

-0.02

0.01

0.99

-0.02

-0.02

0.03

0.01

0.04

-0.09

0.01

-0.03

-1.00

0.00

0.03

0.03

0.02

0.01

0.00

-0.01

0.00

-0.01

0.00

1.00

0.00

0.01

0.03

0.02

0.03

-0.07

0.00

0.00

-1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.5

0.6

0.7

0.8

0.9

1.0

Table 1.2

Linear combinations for the automobile data. The linear combinations are given for the range of X values. The first row for
each X value is PT and the second is ,f?T.

1. Interpretable

3.2

Exploratory

Page 52

Pursuit

-0.10

-0.96

0.05

0.25

0.17

0.94

-0.19

0.06

0.24

0.94

-0.18

-0.10
3.3

Projection

-0.05
-

0.17

-0.96
-

-0.09

-0.97

0.06

0.22

0.15

0.95

-0.15

0.08
0.08
-

0.18
0.18
-

0.07
0.07
-

-0.05
-0.05
-

3.4
-0.06
3.5

-0.08

0.12
-0.05

3.6

3.7

3.8

3.9

-0.98
-0.99

0.06

-0.05

-0.99

0.06

-1.00

-0.99

-1.00

1.0

Table I.3

0.97
0.98
0.98
0.99
1.00
1.00

0.16
-0.11
0.12
-0.07
0.09
0.09

0.08
0.12
0.11
0.10

0.05
-

0.19
0.14
0.11
0.11

0.07
0.06
0.05
0.05

-0.06
-

-0.09

-0.07

Abbreviated
linear combinations for the automobile data. The
linear combinations
are given for the range of X values as in
Table 1.2. A - replaces any coefficient less than 0.05 in absolute
value.

Page 59

1.5 Examples
1.0
0.5
0.0

-_-----_
-

-0.5
-1.0

IIll

IIll

IIII

IIII

0.4

0.2

0.0

III1

0.8

III1

III1

0.2

IIII

0.0

0.8

Y3
IIII

0.5

IIII

0.6

y1

1.0

III1

0.4

Iill

IIII

III1

III1

---

---

-0.5
-1.0
III1

III1

0.2

IIII

IllI

0.4

0.8

Ill1

IIll-

0.8

IIII

0.2

IIII

IIll

0.4

0.6

Yr

IllI-

0.8

YS

1.0
0.5
-

---

-\

0.0

III1

IIII

0.2

IIII

IIII

0.4

0.6

IIll-,

0.8

y,

Fig.

I. It

Parameter trace plots for the automobile data.


The values of
the parameters are given for those variables whose coefficients
are large enough. The parameter values (/3li solid, &i dashed)
are plotted versus A.

1. Interpretable

Exploratory

Projection Pursuit

of the two combinations.

If one variable

Page 54

is in a combination,

it usually

is absent

in the other.
Another

useful diagnostic

tool fashioned

after ridge regression trace graphs

is shown in Fig. 1.12. For variables whose coefficients


X, the values of the coefficients

are 2 0.10 for any value of

in each combination

One of the most interesting

solution

are plotted

versus X.

planes is the one for X = 0.8. The combi-

nations involve four variables which are split into two pairs, one in each combination.

The first combination

and the fifth

variable

involves the negative of the third variable engine size

automobile

weight.

The second combination

engine power and the negative of the gaussianized


the solution

plane with

by the plotting

must decide when the tradeoff


In this example,

since the projection

plane. However,

the decision may be difficult.

no external

a planes usefulness.
No doubt

measurement
Rather,

interpretable
cedure to.find
ticular

error can be used to judge

of planes which exhibit

Friedman

projection

(1987) presents

applies a transformation

it yet leaves other orthogonal


pursuit
planes.

algorithm

In fact,
After

disturbing
a method

to the structured

planes undisturbed.

The

can employ the same pro-

Once the A value and hence the par-

plane have been chosen, the structure


as desired.

structure.

should be removed without

planes.

He recursively

several interesting

found and simplified

Given that this is an exploratory

1.9. The object is to find all such views.

of other interesting

exploratory

increased from the previous

such as prediction

plane is found, its structure

plane which normalizes

and accu-

the decision rests with the statistician.

in Figs. 1.8 and

removal.

between simplicity

index actually

this data has a wealth

two are evident

for structure

delineated

the X = 0.8 model is a good stopping

place, especially

the structure

Japanese flag. Fig. 1.13 shows

the type of car (Japanese or Non-Japanese)

racy should be halted.

a structured

the

symbol.

The statistician

technique,

involves

is removed and the next plane is

Page 55

1.5 Examples

.-.

*
.,
- -.
. . ...:: y.,.
-.. . . .
.. ,.. .. .
. **
: :
.

Fig.

The interpretability

onal.

;*
.
..

of a collection

j(:

of solution planes could be considered.

two planes might be simple in relation

However,

in orthogonal

-I
..*

Country of origin projection scatterplot of the automobile data


for X = 0.8. American and European cars are plotted as points,
Japanese cars as asterisks.

I.13

example,

.*

since the structure


planes, in practice

to each other if they are orthog-

removal procedure
solution

For

does not affect structure

sets of planes tend to be orthogonal

anyway.
Originally,
Section

1.1.1.

to rotate
color.

exploratory
After

the point

The statistician

was time-consuming
The solution

projection

choosing

pursuit

was interactive,

three variables,

cloud in real time.

A fourth

the statistician
dimension

could then pick out structure


and only allowed

was to automate

combinations

the structure

as mentioned

in

used a joystick
could be shown in

by eye. However,

this task

of four variables

at best.

identification

by defining

an index

1. Interpretable

Exploratory

which measures structure


statistician

Projection Pursuit

mathematically

loses some of her individual

Page 56

and to use a computer


choice in structure

optimizer.

identification

The

but she

can define her own index if desired.


Interpretable
variable

selection.

coupled with

exploratory

projection

The interpretability

a numerical

pursuit

is an analogous

index is a mathematical

optimization

routine,

choosing which three or four variables

to view.

automation

of

measure which,

takes the place of interactively

Chapter

Interpretable
Pursuit
Regression

Projection

This chapter

describes interpretable

ilar in organization
common

to the previous

projection

chapter

modification

reviewing

the notation

is considered

ing goals of exploratory


interpretability

pursuit

though

to both have been addressed previously.

inal algorithm,

In the second section,

is detailed.

and projection

index must be changed slightly

It is sim-

condensed as several issues

and strategy.

pursuit

regression.

Section 2.1 deals with the orig-

and the new algorithm

projection

the

Due to the differ-

pursuit

regression,

from that of Chapter

the

1. The final

section consists of an example.

2.1 The Original


The original
and Stuetzle

Projection

projection

(1981).

tures and extends

pursuit

Friedman

the approach

regression in addition

Pursuit

Regression

Technique

regression technique

(1984a, 1985) improves


to include

classification

to single response regression.

response regression is considered.

57

is presented in Friedman
several algorithmic
and multiple

In this chapter,

fea-

response
only single

2. Interpretable

2.1.1

Projection

Pursuit

Page 58

Regression

Introduction

The easiest way to understand


as a generalization
tion pursuit
(1982).

pursuit

linear regression.

regression is to consider it

Many authors motivate

projec-

regression in this manner, among them Efron (1988) and McDonald

For ease of notation,

predictors

= (X1,X2,.

linear function
familiar

of ordinary

projection

suppose the means are removed from each of the

of the centered X.

notation,

The goal is to model the response Y as a

. . ,Xp)T.

With

the usual assumptions

and slightly

un-

the single response linear regression model may be written

WI

Y - E[Y] = WTX + c

where c is the random error term with zero mean. The vector w = (WI, ~2, . . . , w~)~
consists of the regression coefficients.
In general, this linear function
Y given particular

is estimated

values of the predictors

by the conditional

expectation

2 = (q, 52,. . . , zp). The fitted

of

value

of Y is
3(x)
The expected

= E[Y] + WTX

value of Y is estimated

distance between the true and fitted

by the sample mean.

w of the model are estimated

m in

In practice,
mean.

The expected

L2

random variables is

L2(w, x, Y) E E[Y - Q2

The parameters

WI

Lg(w,X,Y)

by

P-31

the sample mean over the n data points replaces the population

2.1 The Original Projection

Page 59

Pursuit Regression Technique

The model [2.1] may be written

Y - E[Y] = /qct'X)

WI

+ E

where p = wTw and a = (01, CY~,. . . , c+,) is w normed.


is analogous to [2.2]. Th e p arameters

equation

[2.3] subject

to the constraint

The rewritten
the projection

model

between the fitted


generalization

projections

the fitted

pursuit

stricted

combinations

oj which

zero mean and unit variance.

of functions

o. The relationship

is a straight

line. A natural

pursuit

unrestricted

regression
functions

of

parametrically.

regression model with m terms is

@jfj(aTX)

the functions

The parameters

In the usual way, the conditional

WI

+ E

define the directions

to have length one. In addition,

the terms.

oTX

Y depends only on

value to be a sum of univariate

y - E[Y]= 5
The linear

as in

to vary. Projection

which are smooth but otherwise

The projection

p and o may be estimated

variables X onto the direction

values Y and the projection

allowing

value

response variable

is to allow this relationship

does just that,

fitted

that aTa, = 1.

[2.4] shows that

of the predictor

The resulting

of the smooths
fj are smooth,

pj capture

are re-

and have

the variation

between

mean is used to estimate

the sum

as
m
p(X)

E[Y]

@j.fj(aTx>

j=l

with the parameters


Analogous

estimated

to exploratory

as in [2.3].
projection

pursuit,

may be more successful than other nonlinear


mensional

space (Huber

1985).

The model

projection

methods

pursuit

by working

regression

in a lower di-

[2.5] is useful when the underlying

2. Interpretable

relationship

Pursuit Regression

Projection

Page 60

between the response and predictors

is nonlinear,

gression [ 2.41, and when the relationship

is smooth,

methods

Diaconis

that

such as recursive

any function

number

can be approximated

of terms

m.

Substantial

theoretical

properties

algorithm

are difficult

2.1.2

partitioning.

work

of the method.

as opposed

to other

and Shahshahani

by the model
remains

versus ordinary

nonlinear

(1984) show

[2.5] for a large enough

to be done

In addition,

re-

with

the numerical

respect

to the

aspects of the

as discussed in Section 2.2.3.

The Algorithm

The parameters

pj, oj and the functions

min

fj are estimated

by minimizing

La@, a, f, x, Y)

@j,CYj,fj:j=l,...VZ

QTaj = 1

WI

JVjl = 0
and Var[fj]
The criterion
However,
for.

algorithm
k=

if certain

Friedman

results

l,..

[2.6] cannot

be minimized

simultaneously

ones are fixed, the optimal

(1985) employs

are discussed

in Section

. , m. The problem

for all the parameters.

optimization

as they are pertinent

2.2.3.

values of others are easily solved

such an alternating

in this section

is considered

j = 1,. . . ,m

= 1

First,

strategy.

His

when the modified

he considers

a specific term

k,

[2.6] may be written

min -Wk - Bkfi;(~TX)12

pk#k,fk

WI

where Rk s
j#k
For the kth term,
in turn

while

the three sets of parameters

all others

are held constant.

have been found, the next term is considered.

After

/?k, ok, and fk are estimated


all elements

The algorithm

of the kth term

cycles through

the

2.1 The Original Projection


in the model until

terms

alternating

strategy

The minimizing

Pursuit Regression Technique

the objective

Page 61

in [2.6] d oes not decrease sufficiently.

The

is discussed in more detail in Section 2.2.3.


,Bk is

P-81
The minimizing

This

function

estimate

is found

fk for any particular

using the nonparametric

man (198413). Th e resulting


constraints.

direction

procedure.

squares problem

as possible.

Minimizing

but rather

as an

the criterion

and requires an

[2.6] as a function

of ok is a least-

In applying

than use a method

of applying

these methods,

an iterative

search procedure

if the function

to be minimized

simplifies

and is easily approximated

(Gill

determining

optimization

is of least-squares

fact and uses the Gauss-Newton

first

deriva-

Hessian is preferable.

specifically

et al. 1981).

to

about the function

which only employs

the Hessian, outweighs their additional

However,

onthis

directly

which also uses the actual or approximate

the difficulty

rately estimating

ak.

function

the goal is to use as much information

Thus, rather

tives, a method

italizes

to satisfy mean and variance

ok cannot be determined

as seen in [2.7].

a function,

Usually

curve is standardized

discussed in Fried-

value for each observation.

The minimizing

minimize

smoother

It is not expressed as a mathematical

estimated

iterative

point azx is

or accu-

properties.

form,

the Hessian

Friedman

(1985) cap-

procedure

to find the optimal

2. Interpretable

2.1.3

Projection

Model

Selection

Strategy

Model selection in the original


ing the number

Page 62

Pursuit Regression

projection

pursuit

of terms m in the model.

gression considers not only the number

regression consists of choos-

Interpretable

projection

pursuit

re-

of terms but also the interpretability

of

those terms.
Friedman
M.

(1985) suggests starting

The procedure

for finding

the algorithm

this model is discussed in Section 2.2.3. A model

with M - 1 terms is then determined


the previous

with a large number of terms

model as a starting

using the M - 1 most important

point.

The importance

terms in

of a term k in a model

of m terms is defined as
Ik E fjj

[2.10]

where ]/3l] is the maximum

absolute

parameter.

Since the functions

to have variance one, [@J-lmeasures the contribution

strained

fj are con-

of the term to the

model.
The statistician

may then plot the value of the objective

in [2.6], which

is

the models residual sum of squares, versus the number of terms for each model.
In most cases, the plot has an elbow shape. The usual advice is to choose that
model closest to the tip of the elbow, where the increase in accuracy
additional

2.2

term is not worth

The Interpretable
A projection

number

of terms

functions

fi.

m =

the increased complexity.

Projection

pursuit

regression
1, . . . , M.

Since these functions

they are visually

Pursuit
analysis

combination

represents.

The parameters

Regression
produces

Approach

a series of models

The models nonlinear

components

are smooths and not functionally

assessed by the statistician.

associated

due to the

Each is considered

aj in order to understand
,f3j measure the relative

what

with

are the

expressed,

along with

its

aspect of the data it

importance

of the terms.

2.2 The Interpretable

Each direction

Projection

oj must be considered in the context of the original

variables p. The collection


such as the correlation

of combinations

constraint

regression technique,
of variables

similar

to those in Section

selection

before, a weighted

penalty

criterion

in projection
balancing

The simplicity

1.2 show that


pursuit

As with any

a combinatorial

criterion

The minimization

problem

index measures the interpretability


since the objective

1nterpretabilit.y

that the number of terms m is fixed.

This assump-

tion is discussed in Section 2.2.2. The goal is define an interpretability


am). At first glance, the situation

foragroupofmvectors(ai,cy2,...,
exploratory

two vectors (PI, ,&) is measured.


a plane.
which,

In relation
when normed

in the latter

the simplest

index S
is a gen-

where the simplicity

of

case, the vectors define

pairs of combinations

are ones

This type of orthogonality

is

in Section 2.3.5. Measures of this squared orthog-

and each vectors

individual

Within

each combination,

variables

index

pursuit

and squared, are orthogonal.

onality

interpretability

projection

However,

to each other,

named squared orthogonality

of

Index

Consider for the moment

of interpretable

to be

in the next two subsections.

The Interpretability

eralization

[2X]

of the collection

is minimized.

an

becomes

, Q2 T.--T a m )

xqcrl

As

[2.6] with

in the first term causes both contributions

and is subtracted

ap-

regression is not feasible.

min La

index choices are.discussed

2.2.1

pursuit.

is desirable for parsimony.

L2(P)crJJm

for X E [0, l]. Th e d enominator

combinations

projection

the goodness-of-fit

is employed.

(1 -A)
min
/3j,Crj,fj:j=l,..., TTZ

E [O,l].

in exploratory

a variable selection method which causes the same subgroup

proach to variable

interpretability

number of

is not subject to any global restriction

to appear in all the combinations

Arguments

Page 63

Pursuif Regression Approach

interpretability

enter the index

are selected via the single vector

Sr [1.12] by encouraging

the vector

coefficients

S, [1.18].
varimax
to vary.

2. Interpretable

Different

variables

Page 64

Pursuit Regression

Projection

are selected for either vector by forcing

the combinations

to

exclude the same variables.


Projection

pursuit

be simple within
is penalized.
interest

regression has a different

itself.

That is, homogeneity

The varimax

each vector

of the coefficients

within

should
a vector

index Sr for one vector achieves this. However,

in the

of decreasing the total number of variables in the model, the same small

subset of variables
of each direction
variables

should be nonzero in each vector.

Summing

as in [ 1.171 d oes not force the vectors

necessarily.

On the other hand, if the exploratory

the simplicities

to include

to squared orthogonality

and would contain

may be used for projection


if variable

pursuit

selection is the object,

Consider

the m X p matrix

the directions

different

the same

projection

measure [l ,161 for q = m vectors is used, the vectors would

varimax

Though

goal. First,

variables.

pursuit
be forced

This old index

regression to achieve this outcome.

However,

a new index must be developed.


of combinations

aj are constrained

defined for all sets of combinations

to have length one in [2.6], the index is

in general.

The goal is that all the combina-

tions are nonzero or load into the same columns so that the same variables
in all combinations.

are

An index which forces this outcome is based on a summary

vector y of the matrix

52 whose components

are

2
3; -

2
j=l

Each component
tribution

yi is positive

to variable

5;

i=l,...,p

[2.12]

T~j

and contains

the sum of each terms

i, or the total column weight.

The components

relative

con-

of the new

2.2 The Interpretable

Projection

Pursuit Regression Approach

Page 65

vector sum to m. The object is to force these components


sible. For a single combination,
index S, [1.12].

to be as varied as pos-

this is achieved via the varimax

For y appropriately

normed,

this measure is

= $-&py

WY)

interpretability

12.131

84

The index
function

[2.13] f orces the weight

of all the combinations,

f2

Sd

O!y1,(Y2,...,~m)=-

The function

[F

of the columns

1)5 5 g-&l

waJ+2(pP

j#k

j=l

is in contrast

to force the squared orthogonality

interpretability

of each combination.
exploratory

projection

over the terms is subtracted

of each column

to be dissimilar.

of the each total column weight over the rows of the matrix,

of the model,

in order

of the combinations.

The index forces the overall weights


dispersion

+m;;:l) *

i=l

to S, for interpretable

[1.18], in which the cross-product

pursuit

As a

this index is called Ss and may also be written

Sr measures the individual

This re-expression

of R to vary widely.

depends on the goodness-of-fit

criterion,

An example

The

or terms
using this

index is discussed in Section 2.3.

2.2.2

Attempts

to Include

In the discussion
argument

that

so certain.

with

However,

assessed.

of Terms

of terms m in the model is fixed.

fewer terms is more interpretable

on further

Each term involves

The interpretability
be visually

so far, the number

a model

-more persuasive.

the Number

reflection,

its combination

of the combination

this conclusion

The

than one with


does not appear

oj and the smooth

function

can be measured but the function

fj.
must

2. Interpretable

Projection

Pursuit

Consider the following


out knowing

Regression

example with two variables Xi and X2 (p = 2). With-

the data context,

ranking

(1) n-l =

in order of interpretability

function
Situations

of widely

is impossible.

varying

would be difficult

f2(X2)

The first model involves two functions

combination.

of the most complex

easily can be imagined

more interpretable.

The second involves only one

combination

in which

clearly

For example, if the two variables


traits

(apples vs.

to understand

models

f(X1ix2)

each, the simplest

consisting

the two fitted

2, fl(&),

(2)m=l,

of one variable

Page 66

oranges),

of variables,

the average.

one or the other


are distinct

model

measurements

then a combination

of the two

and Model (1) would be preferable.

However, if

the two were aggregate measures of similar

characteristics

(reading

and spelling

scores) which could be combined easily into a single variable (verbal ability),
second model m ight be easier to interpret.
the number

As with
tion pursuit
statistician
weight
number

interpretable

is equipped

exploratory

with

projection

pursuit,

the algorithm

interpretable

as an interactive
dial.

process in which the

and the model becomes more simple.


is included

should drop terms as it simplifies

three methods

considered.

Factors required

fortunately,

none of the resulting

projec-

As the dial is turned,

of terms to begin with is m and this parameter

discussion,

ways to include

measure of a model is enlightening

an interpretability

(X) in [2.11] is increased

the following

the

are unsuccessful.

regression can be envisioned

sure of simplicity,

below.

However, considering

of terms in the interpretability

even if present attempts

is

for including

the

If the

in the meathe model.

In

m in the index [2.13] are

to put the index values E [0, l] are ignored.


indices works in practice

Un-

for the reasons given

2.2

The Interpretable

Projection

Pursuit

Since the interpretability


%rst attempt

Regression

Page 67

Approach

of a model decreases with the number of terms, the

is to multiply

the interpretability

index [2.13] by A, obtaining

Sa(CY1,Cr2,. . .) CXm)E -;[$-&-;)2]

t=l

The resulting

index Sa decreases as the number of terms increases. However, this

measure does not work in practice as each terms contribution

is reweighted

when

the number of terms changes. Instead, the index should be such that submodels
of the current

model measured contribute

the same to the index, regardless

of

the size of the complete model.


Each model can be thought
number

of points

Section 2.3.4.

increases.

The total

increases as the number


contain

of as a point
Consider

E BP, As terms are added, the

a distance

interpretability

of distance from a set of points


of points

the sarne variables,

Since the object

does.

a plausible

interpretability

Sb(al,Q2,-..,~m)--

ri$

el[Vc~~

index

as in

to a particular
is that

point

all the terms

index is

-Vll12

j=l

As in Chapter

1, the set V is composed of simple vectors

squared and normed


one dimensional

[1.14] and vaj is the

version of the oj term [1.13]. This index is similar

index [1.15]. The m inimum

j but of the sum of distances


the same interpretable

to the

is not taken of each individual

in order to ensure that

term

all the terms simplify

to

vector ~1. If the set V consists of the cls (1 = 1,. . . ,p),

Sb reduces to
Pm

2
"ji
- T

cc
i=l

where yk is the largest


with

the removal

discontinuous.

j=l

~j

component

of the m inimization,

2-/k

+ m

tyj

as defined

in [2.12].

the derivatives

Unfortunately,

of the given index

even
are

Page 68

2. Interpretable Projection Pursuit Regression


Averaging

which

all the possible

is continuous.

toward

distances

vectors

the closest ~1 overpowers

Unfortunately,

though

continuous

all the terms to collapse toward


In conclusion,

the number

sion model

cannot

selection

the interpretability

linear regression
procedure

is poor.

to explore

2.2.3

The strategy

term model

it.

The outlook
Instead,

should

procedure

projection
than a strict

themselves

situation

with

model

regresin which

between

model

pursuit

of

measur-

pursuit

the tradeoff

for an analogy

interpretable

the number

projection

controls

inter-

selection

in

regression
selection

compare

these models is discussed

is described

in the next subsection.

is a
pro-

in Section

Procedure
a large number

of models

of terms m = 1,. . . , M, the following


models for different

X. Steps 1 and 2 find the original


which

toward

and smoothly

to a one parameter

X completely

to find a sequence of interpretable


ity parameter

be less than

of the terms

such a method,

is to begin with

model with number

at incorporating

and the interpretability

be reduced

The Optimization

T must

simplifies

and pulls the term

for simultaneously

the model space, rather

cess. How the statistician


2.3. The optimization

As a result,

attempts

A method

parameter

and accuracy.

each term

as opposed to Sa, the index S, does not force

Without

yet.

pretability

index

the same simple vector.

of terms

has not been found

ul.

the others

none of the three

terms into the index works.


ing both

a third

The power r must be chosen so that

one of the interpretable

one so that

produces

minimizes

[2.6].

M.

For each sub-

algorithm

is employed

values of the interpretabil-

projection

pursuit

regression

Steps 3 and 4 find the sequence of models

which minimize [2.11] for various X values. Throughout the description, updating a parameter

means noting

the new value and basing subsequent

dependent

Page 69

2.2 The Interpretable

Projection

Pursuit Regression Approach

calculations

Moving

a model means permanently

on it.

changing its parameter

values.

1. The objective

function

lined in Friedman
2. Use a backwards

is [2.6]. Use the stagewise modeling

and Stuetzle

procedure

out-

(1981) to build the M term model.

stepwise approach to fit down to the m term model.

Fori=l,...,M-m
begin
Rank the terms from most (term 1) to least important
as measured by [2.10]. D iscard the least important
Use an alternating

procedure

to minimize

(term M - i + 1)

term.

the remaining

M - i terms.

a. Fork=l,...,M-i

begin
Update

the terms parameters

other terms are fixed.

respectively.

step to find the new direction

by updating

pk and fi; using

Only one Gauss-Newton

ation due to the expense of a step.


objective

the

Choosing from among several steplengths,

do a single Gauss-Newton
plete the iteration

pk, Ok and curve fi; assuming

ok. Com-

[2.8] and [2.9]

step is taken for each iterContinue

iterating

until

the

stops decreasing sufficiently.

end
b. If the objective

decreased on the last complete loop through

(a), move the model.

If the objective

(GOT0

a.). Otherwise,

another pass

term model is complete.


end

decreased sufficiently,
the optimization

the terms
perform

of the M - i

2. Interpretable

Projection

Pursuit Regression

Page 70

3. Let X0 = 0 and call the m term model resulting


model.

Choose a sequence of interpretability

from Steps 1 and 2 the X0


parameters

(Xl, X2, . . . , XI).

Let i = 1.
4. The objective

is [2.11].

function

model using a forecasting


starting

Set X = X; and solve for the m term

alternating

procedure

and the Xi-1 model as the

point.

Reorder the terms in reverse order of importance


1) to most important

[2.10] from least (term

(term m).

Make a move in the best direction

possible.

a. For k = 1,. . . , m
begin
Choosing from among several steplengths,

update the crk resulting

from the best step in the steepest descent direction.


iteration

by updating

the

,& and fk using [2.8] and [2.9] respectively.

Only one steepest descent direction


of a step. Always perform
iterating

Complete

step is taken due to the expense

at least one iteration

until the objective

and then continue

stops decreasing sufficiently.

end
6. If only one loop through

the terms

the model regardless of whether


perform

another

completed

pass (GOT0

and the objective

one loop has been completed


perform

another pass (GOT0

Otherwise,

the objective

move

decreases or not and

a). If more than one loop has been

decreased, move the model.


and the objective
a). Otherwise,

m term model with interpretability

c. If i = I, EXIT.

(a) has been completed,

parameter

If more than

decreased sufficiently,
the optimization
X; is complete.

let i = i + 1 and GOT0

4.

of the

2.2 The Interpretable

Projection

The forecasting

Pursuit

alternating

Regression

procedure

in Step 4 differs from the alternating

procedure in Step 2. In the latter case, determining


specific term reduced to a least-squares problem
this problem

could not be solved analytically

Hessian.

direction

(Gauss-Newton)

However, with the addition

of the objective.

as Gauss-Newton.

The gradients

utilized

of the interpretability

term

the effect a simplification


algorithm

as described

not updated

between

for ease in calculation.

However,

sacrificing

accuracy

a decrease in the objective.

shift in term two produces

index.

S, and the equality

The increase in interpretability

in the solution

among term

for an individual

a decrease.

Such a

algorithm.

does not work because of the strong interaction


index

This approach

However, possibly this change

of moves is not considered in the original

in the simplicity

A term is

this approximation
/
For example, suppose a change in term

Using this look one step ahead approach in interpretable


regression

In the original

scenarios exist in which

the global m inimum.

in term one and the resulting


combination

decreases as a result.

the terms,

A.

is made to forecast

in Step 2, each term is considered separately.

ignores the interaction

one does not produce

is the negative

are given in Appendix

in that an attempt

function

method

is not as accurate

of one term will have on the others.

unless the objective

does not produce

this method

of the objective

The Step 4 approach is also different

the least-

Thus, a cruder optimization

Unfortunately,

While

which

must be used. Steepest descent is employed and the step direction


of the gradient

al; for a

and had to be iterated,

[2.11] no longer has this form.

the objective

the optimal

as noted in Section 21.2.

squares form lent itself to a special procedure


the estimated

Page 71

Approach

projection
between

pursuit
the terms

contributions

to the

term must be very large

before it changes on its own. However, once it changes, the other terms quickly
follow suit, like a row of dominoes.
the algorithm
between.

produces

The result is that if Step 2 approach

is used,

either very complex or very simple models but none in

2. Inierpretable

Projection

Pursuit

Thus, a compromise
is due to the fact that
sure, irrespective
simplification

Regression

Page 72

is reached based on empirical


all terms contribute

of their relative

evidence.

The first change

equally to the interpretability

importance

mea-

12.101. Thus, when considering

of the model, that term which least affects the model fit should be

considered first as it does as well as any other at increasing

interpretability.

As

a result, the terms are looked at in reverse order of import ante (Step 4).
The second change is that the algorithm
once as opposed to one term at a time.

moves the entire set of m terms at

A move is not evaluated until it is formed

from a sequence of submoves, each of which is a shift of an individual


4~). These submoves are made in reverse order of goodness-of-fit
are made in the best direction
The third

move appears to be a poor one.

4a is positive

(Lundy

importance

and

possible, the steepest descent one.

change is that the algorithm

a form of annealing

term (Step

always moves the model, even if the

In some respects this jiggling

1985). Th e m inimum

yet quite small, so large unwelcome

steplength

of the model is

considered in Step

increases in the objective

are

impossible.
Both

the second and third

of the simplification

of one term

works well in practice


requirement

that

changes are attempts


on subsequent

as is demonstrated

the algorithm

results.

However,

the algorithm

behavior

is seen in the following

at forecasting

terms.

quickly
example.

The given algorithm

in the next section.

always move means that

the effect

Occasionally,

the

a clearly worse model

adjusts for the next value of X. This

2.3 The Air

Pollution

2.3 The Air

Pollution

The example
using additive

Page 73

Example

Example

in this section concerns air pollution.

models in Hastie and Tibshirani

and using alternating

conditional

expectations

(1985). It consists of 330 (n) o b servations


(p) independent

variables

each.

The data

(1984) and Buja et al. (1989),


(ACE)

in Breiman

and Friedman

of one response variable

The daily

is analyzed

observations

(Y) and nine

were recorded

in Los

Angeles in 1976. The variables are


Y : ozone concentration
x1 : Vandenburg 500 m illibar height
x2 : windspeed
x3 : humidity
x4 : Sandburg Air Force Base temperature
xg : inversion base height
x6 : Daggott pressure gradient
x7 : inversion base temperature
X8 : visibility
xg : day of the year
As in exploratory

projection

pursuit,

all of the variables are standardized

mean zero and variance one before projection


As suggested by Friedman

chosen to be nine (p). The algorithm


M by backstepping

model is measured
the introduction,

as the fraction

inaccuracy

respect to the data rather


fraction

regression is applied.

(1984a), the original

Section 2.2.3) is run for a large value of M

terms m = l,...,

pursuit

to have

algorithm

initially.

(Steps 1 and 2 in

For this example,

is

produces all the submodels with number of


from the largest.
of variance

it cannot

The inaccuracy
explain.

of each

As noted in

in this thesis denotes lack of fit as measured with


than to the population

in general.

From [2.6], this

is defined as
u

J52MdJJ)

k(Y)

The plot of the number of terms and fraction of unexplained variance of each
model is shown in Fig. 2.1.

2. Interpretable

Projection

Page 74

Pursuit Regression

m
Fig. 2.1

Fraction of unexplained variance U versus number of terms m


for the air pollution data. Slight elbows are seen at m = 2,5,7.

Using the original


by weighing

approach,

the statistician

chooses the model from this plot

the increase in accuracy against the additional


chooses a model

complexity

of adding

at an elbow in the plot,

where the

a term.

She generally

marginal

increase in accuracy due to the next term levels off. In some situations,

such an elbow may not exist or it may not be a good model choice in actuality.
Only one model for each number
model space is one dimensional
Interpretable
particular.
.

measure.

projection

number

of terms m is found.

pursuit

regression

expands

point for each simplicity

the model

the algorithm

space for a

by adding an interpretability

search for a given m is the model

shown in Fig. 2.1. Then for a sequence of X values which signify


weight on simplicity,

m, the

in U.

of terms m to two dimensions

The starting

For a particular

an increasing

cuts a path through the model plane [V X S,].

2.3 The Air Pollution

Lubinsky

Page 75

Example

(1988) d iscuss searching

and Pregibon

model space in general.


vides the structure

through

Their premise is that a formalization

a description

of this action pro-

by which such a search can be automated.


which

space characterization,
prehensive

than

agree that

two important

In their work, the latter

Their

is based on Mallows (1983) work,

the two dimensional

U and Sg summary

description

dimensions

concept is an extension

eters measure and is in the spirit

includes both the conciseness of the description

descriptive

is-more

given above.

are accuracy

comThey

and parsimony.

of the usual number

of interpretability

or

of param-

as defined in this thesis.

It

and its usefulness in conveying

information.
Initially

for this and other

quence is (O.O,O.l,.

. . , 1.0).

examples,

However,

the interpretability

the usual result

parameter

is a path

X se-

through

the

model space which consists of a few clumps of models separated by large breaks
in the path.

Even the forecasting

2.2.3 cannot completely


der to produce
with
curve.

eliminate

a smoother

additional

nature

described

in Section

these large hops between model groups.

path, the statistician

values of X specifically

For example,

of the algorithm

In or-

is advised to run the algorithm

chosen to produce

a more continuous

if on the first pass the path has a large hole between the

X = 0.3 and X = 0.4 models, the algorithm

should be run with

ues of (0.33,0.36,0.39).

twenty models are produced for each

valueof

m = l,...,

Using this strategy,


9. The actual

Various

diagnostic

through

X val-

X values used are not shown in the following

figures as their values are not important.


a guide for the algorithm

additional

The interpretability

parameter

is solely

the model space.

plots can be made of the collection

of models which

are

distinguished

by their m, U and Sg values. Chambers et al. (1983) provide several

possibilities.

Given that the number of terms variable m is discrete, a partitioning

approach is used. Partitioning


in a plot represents
S, and inaccuracy

plots are shown in Figs. 2.2 and

a model with
U.

Ideally,

the given number

of terms,

for the best comparison,

2.3. Each point


interpretability

these plots should

be

2. Interpretable

Projection

L,,,,
1.0 F

Pursuit

I,,,

,(I,

Page 76

Regression

I,,,

(,,It

ql,,,,,,,,,,,,
f ,,(,,,,
1
0.1

0.15

0.2

0.25

U,
III
1.0

III

0.3

0.35

0.1

0.15

0.2

m=l
III1

U,
III,

0.25

0.3

0.35

m=2

III9

45

4M4
0.6

0.6

0.4

0.2

4
4

cc

4
4
%
44
4bF
0.0

0.1

0.15

0.2

U,
L,,,,
1.0 -

0.3

0.35

0.1

0.15

,4,:1,,
0.2

U,

, , I,,

, , I,,

0.25

0.3

0.35

0.3

0.35

, ,I

m=4

I,,,

0.6

I I I I 1,

m=3

4
vi

0.25

4
44

0.4

4
4

$
0.0

0.1

14 +
I
0.15

0.2

U,

Fig. 2.2

-I

I I
0.25

m=5

0.3

0.35

0.1

0.15

0.2

U,

0.25

m=6

Model paths for the air pollution data for models with number
oftermsm=l,...
,6. Each point indicates the interpretability
S, and fraction of unexplained variance U for a model with the
given number of terms.

2.3 The Air Pollution Example

L,,,,

1.0

,,I[

Page 77

,,,I

,,I,

y--

,,,,-I

I,,,

I,,,

,,I,

II,,

44

II,,4

4-

4 88
0.8

0.6

$
4

ml
0.4

0.2

a0
44

:4jt
0.0
I
0.1

4
0.15

44

L,,,,

$4

4
;to
14

0.2


0.25

U,

1.0

I,,,

I
0.3

0.35

p8

0.1

0.15

m=7
,I,,

0.2

U,
I,,

II

III

0.25

III-

0.3

0.35

m=8

,,,I-

+#
0.8

r/-Y 0.6

0.4
0.2

%
4
&
4

4
4

14

0.0

.lbllI
0.1
0.15

0.2

0.25

U.

Fig. 2.8

0.35

Model paths for the air pollution data for models with number
Each point indicates the interpretability
of terms m = 7,8,9.
Ss and fraction of unexplained variance U for a model with the
given number of terms.

lined up side by side.


superficially

0.3

m=9

However,

note that though

Ss scales

appear to be the same for all the graphs, they are not as implicit

each plot is the number

of terms m. A symbolic

are graphed

in one plot of unexplained

a particular

graphing

the simplicity

the interpretability

variance

symbol for the number

scales are dependent

scatterplot

in

in which

all models

U versus interpretability

S, with

of terms, also obscures the fact that

on the number

of terms in the model.

2. Interpretable

Projection

Pursuit

As interpretability

increases, so does the inaccuracy

values of m , the path through


tially

Page 78

Regression

of a model.

the model is elbow-shaped,

a large gain in interpretability

of all possible models.

have the same inaccuracy.

that ini-

is made for a small increase in-inaccuracy.

The curves shift to the left as m increases, as the additional


overall inaccuracy

indicating

For most

terms decrease the

For all values of m , the X = 1 models

These models have all directions

parallel and equal to

an ej, so in effect they only have one term of one variable.

Due to the forecasting


is not always monotonic.
non-monotonic
is found.

m inimum.

in smaller interpretability

for m = 3, interpretability

value of X, rather

the number

The draftsmans
the inaccuracy

out of a local
for a particular

are possible for a given

linear regression variable selection methods find

inaccuracy

for a fixed interpretability

display in Fig. 2.4 shows how the number

U and simplicity
simpler

versus the same interpretability

Ss vary.

Again,

than one with

note that

more, plotting

scale is m isleading.

value, which is

of terms m and
if a model

all the models

the models from which to choose if certain requirements

can define equivalence

with

This set of plots is useful

met, such. as U 5 0.20, Sg 2 0.50, or some combination


the statistician

The

of parameters.

fewer terms is considered

in determining

This type

values Ss E [0.2,0.4].

it helps describe the models which

the model which m inimizes

partitioning

readjusts.

does not find the global m inimum

number of terms m . In contrast,

usually

and larger inaccuracy,

poor move may be needed to force the algorithm


The algorithm

the model space

a clearly worse model as evidenced by a

usually on the next step, the algorithm

is evident

intermediate

employed, the path through

Occasionally,

move resulting

However,

of behavior

algorithm

thereof.

must be

More formally,

classes of models from the draftsman

and

plots consisting of sets of models with different number of terms

which satisfy certain

inaccuracy

and interpretability

criteria.

2.3 The Air Pollution

Page 79

Example

0.6
0.4
0.2
0.0

10

Fig.

2.4

Draftsmans
display for the air pollution data All possible pairwise scatterplots of number of terms m, fraction of unexplained
variance U and interpretability
S, are shown.

The statistician
of the variance,
with

must now choose a model.

approximately

even moderate

inaccuracy

crossing

model which explains


model with

The first term explains

75% (U = 0.25). For models with

interpretability

(S,

the 0.20 threshold

2 0.40), cannot
(U

5 6, models

be achieved

without

2 0.20) as seen in Fig. 2.2.

more than 80% of the variance is required,

S, = 0.60 is possible.

the bulk

a seven term

However, its advantages over the original

term model (Fig. 2.1) which explains

slightly

less is debatable.

If a

two

If close to 0.25

2. Interpretable

inaccuracy

Projection

Page 80

Pursuit Regression

U is acceptable,

simple two or three terms .models are possible.

.example, a three term model exists with


a two term model exists with

Ss = 0.85, U = 0.23 and combinations-

Q!l = ( 0.0,

0.0, 0.0,

1.0, 0.0, 0.0, 0.0,

0.0,

p2 = 1.10,

Q2 = ( 0.0,

0.0, 0.3,

0.9,

0.0,-0.2)r

The fourth

variable,

0.0, 0.0, 0.1,

Sandburg

Air Force Base temperature,

as is cited in other analyses of this data (Hastie

Breiman

and Friedman

alternating

is

is the most in-

and Tibshirani

1984,

1985). The last variable day of the year also has an effect.

pursuit

conditional

index with a power less than two may be warranted.

fluential

The projection

o.oy

1.5.2 and 3.4.3, the convergence of the coefficients

slow and an interpretability

regression model is different

in form to the additive

and

expectation

models as it includes smooths of several vari-

ables rather

than of one variable.

Thus, the three models cannot be compared

functiondly.

However, the three inaccuracy

As remarked

of the data.

interpretability

In addition,

among the models.

Similar

tative

assessment is a further

automated

to the weighing

does

one model may be more understandable

and

a function

of the models

may have a clear quadratic

number

of terms

m,

this quali-

subjective

notion

of interpretability

which

to exploratory

projection

pursuit,

projection

regression is

procedure.

As such, an objective

applied to choose between models.

is not

Methods

pursuit

measure of predictive

error may be

such as cross-validation

may be used

to produce an unbiased estimate of the prediction


strategy

fj when choosing

by the index Ss.

In contrast
a modeling

views the functions

and therefore

For example,

form.

of terms in the

and depends on the context

each of the curves is a smooth

description,

than another.

of the number

is subjective

the statistician

Though

not have a functional


explainable

measures are comparable.

in Section 2.2.2, the inclusion

gauging of a models

-.

S, = 0.40 and U = 0.21. Alternatively,

p1 = 0.21,

As discussed in Sections

For

accuracy in the models.

A good

is to choose a small number of models based on the above procedure

and

2.3

The Air

then distinguish

Page 81

Example

Pollution

between them using a resampling

to a lack of identifiability

method.

which results from the fact that the number

/
/

cannot be included

is not a one parameter

(X) minimization

problem.

the models

increases with

an additional

interpretable
I

modeling

!
I
,
/

.,!

I
I
I
!

in the interpretability

complexity
projection

technique.

Unfortunately,

pursuit

index,

the cross-validation
A subjective

of terms
procedure

measure of how

term must be made.

regression is an exploratory

due

Thus,

rather than a strict

Chapter

Connections

A comparison
scribed

of the accuracy

in the previous

warranted
method

and established

demonstrates

the generality

pursuit,

Linear

This setting

whose properties
First,
than

use random

observed
vector

of the trading

de-

selection

techniques

is

between the proposed

This discussion

is preliminary

and

an example

accuracy for interpretability

which

approach.

Regression
procedures

have not yet been proposed

modification

provides

for pro-

is considered for linear regression

various other model space search methods

for the linear

variable

is stated

and fitted

Y consists

approach

are known for comparison.

the notation

The problem

tradeoff

The last section includes

the interpretable

in this section.

connections

Conclusions

other data analysis methods to which this method

Since other model selection


jection

other model

ideas are considered.

are identified.

3.1 Interpretable

with

In this chapter,

topics of future work, including


could be extended,

and interpretability

two chapters

and interesting.

and

notation

regression

as in Chapter

as a minimization

values rather

problem

than

is described.

2, matrix

X has entries xij,

between

the
The

(yl, ~2,. . . , yn),

the value of the jth predictor

82

is used.

value minimization.

of the response values for the n observations

and the n X p matrix

notation

of the squared distance


an expected

Rather

for the jth

3.1 Interpretable

observation.

Linear

Page 83

Regression

If an intercept

term is required,

a column of ones is included.

error vector is E and the model may be written

as

Y=XO+c

The parameters
distance

p = (/31, ,&, . . . , ,f$,) are estimated

between the n fitted

The

by m inimizing

and actual observations.

the squared

The problem

in matrix

form is

(Y - XP)(Y

m jn
The least squares estimates

- XP)

solve the normal equations

BL* E (xTx)-lxTy

The modification

F (I - x) (Y

in [1.12] for example.

tion with

XPts>(Y

The denominator

the problem
rather

is used instead

index Sr defined

of the first term is the m inimum

squared

linear regression solution


Xp

that

linear regression

of squared distance

becomes a maximization

P.21

- XBLS)

of the predictors

the response is the ordinary

X E [O, 11, is

_ x S(P)

index S is the single vector varimax

the combination

correlation

parameter

(Y - -v>*v - -w

distance possible, which would be at the ordinary


Recall that

WI

for values of the interpretability

where the interpretability

has maximum
solution.

[3.1].

correla-

Thus,

if the

as a measure of the models fit,

and the simplicity

term should be added

than subtracted.

As the interpretability

parameter

moves away from the ordinary

X increases,

the fitted

vector

= X,8

least squares fit ELS in the space spanned by the

3.

Connections

Fig. 3.1

p predictors

and Conclusions

Page 84

Interpretable
linear regression.
As the interpretability
parameter X increases, the fit moves away from the least squares fit
~LS to the interpretable
fit ~ILR in the space spanned by the
p predictors.

as shown in Fig. 3.1. The interpretable

fits ?ILR

are not necessarily

the same length as the least squares one.


As a variables

coefficient

decreases toward zero , the fit moves into the space

spanned by a subset of p - 1 predictors.

The most interpretable

p - 1 variables

may not be the best p - 1 in a least squares sense. Even if the variable
is the same, the interpretable
.

fitting

least squares model.

coefficients

search may not guide the statistician


The interpretability

index

S attempts

subset

to the best
to pull the

/?i apart since a diverse group is considered more simple, whereas the

least squares method

considers only squared distance when choosing a model.

3.1 Interpretable

Page 85

Linear Regression

The definition

of the interpretability

index

that it can be used as a cognostic (Tukey


for a more interpretable

model.

S as a smooth

function

means

1983) to guide the automatic

search

However, these smooth and differentiable

erties are the root of the reason the equation


The problem
coefficients
similar

is that the interpretability


which make the objective

[3.2] cannot be solved explicitly.

index contains the squared and normed


a nonquadratic

function.

A linear solution

to [3.1] is not possible.

In comparison,
tion criteria

Mallows (1973) CP and Akaikes

involve

a count function

As [3.2] d oes not admit

[l.ll].

prop-

criterion

as an interpretability

a linear solution,

estimate.

variable

index as defined in

For both the CP and AIC

cannot be determined
criteria,

the complexity

model is equal to the number of variables in the model, say k. With


norming,

the associated interpretability

selec-

the more general Mallows CL

which may be used for any linear estimator

the interpretable

(1974) AIC

of a

appropriate

index is

sM(p)sl-

'c

WI

The Mallows
model prediction

variable

selection

technique

uses an unbiased

error to choose the best model.

The resulting

estimate

of the

search through

the model space includes only one model for a given number of variables.
natively,
which

the interpretability
guides the statistician

Another

search procedure
in a nonlinear

result of this search is variable

interpretability.

Using this terminology,

means that the CP and AIC

for

techniques

is a model estimation
manner

selection

through

as variables

the discreteness

Alter-

procedure

the model

are discarded

of the criterion

are solely variable,

space.

rather

for
[3.3]

than model,

selection procedures.
As is discussed in Section 1.3.4, an alternate

interpretability

defined as the negative distance from a set of simple points.


is whether

the modification

[3.2] can be thought

index sd can be

The natural

question

of as a type of shrinkage.

3. Connections

Page 86

and Conclusions

3.2 Comparison

W ith

Ridge

Regression

In Section 1.3.4, a distance interpretability

index Sd is defined in [1.15] which

involves the distances to a set of simple points V = {VI,. . . , YJ}. Before restricting its values to be E [O,l], th is simplicity
the negative of the m inimum

The coefficient
not the absolute
lead to a nonlinear
properties

vector

This

standardization

others in the coefficient

the coefficient

vji

PP

so that

the relative

Though

mass,

these actions

are possible.

Suppose for the moment that the

X are standardized

to have mean zero and variance

ensures that
vector though

one variable

does not overwhelm

the length of the coefficient

one. This action partially

vector in the interpretability

In order to remove the need for squaring,


be considered.

>

vectors of any length may be compared equally and

such as Schur-convexity

is not necessarily

size or sign, of the elements matters.

response Y and predictors


one.

i=l

CC

,L?is squared and normed

solution,

vector ,f3 is

distance to V or

m in
j=l,...,J

measure of the coefficient

the

vector itself

removes the reason for normalizing


index.
all possible sign combinations

must

The simple set vectors vj are no longer the squares of vectors on

the unit RP sphere but just vectors on it. For example, they could be fej,
1 ,7

p. The distance to each simple point is written

(P-vj)T(P-vj)

If the optimization
parameter

problem

in matrix

j=l,...,p

is reparameterized

j =

form as

with

a new interpretability

K, [3.2] becomes

rnjn (Y - Xp)(Y

- XP) + K j-ynp
(P - vj>(P
- ,...,

- Vj)

K>O

3.2 Comparison

The second m inimum

may be placed outside of the first since the first term does

not involve Uj resulting

m in

j=l

The solution

,...,P

rnp

in

- Xp)

p s (XTX

+ KI)-(XTY

vector is

portion

This estimated
Draper

matrix

m inimizes

the

of [3.4].
vector

is similar

to that

of ridge regression

and Smith 1981) with ridge parameter

Ridge regression may be examined


The method

P-51

+ Kzq)

and I is the index which

pR f (XTX

point.

P4

+ K (P - Uj)'(fi

(Y - X/?)(Y

where I is the p X p identity


bracketed

Page 87

With Ridge Regression

K. The ridge estimate

+ KI)-lXTY

1976,

is

WI

from either a frequentist

is advised in situations

(Thisted

or Bayesian

where the matrix

which can occur when the variables are collinear.

XTX

In addition,

view-

is unstable,

it does better than

linear regression in terms of mean square error as it allows biased estimates.


If a Bayesian analysis is used, the distributional
error term E has independent
Given this assumption,

components

assumption

is made that the

each with mean zero and variance 02.

the prior belief

produces the Bayes rule [3.6].


The ridge estimate
[3.1] toward

the origin

@ R [3.6] sh rm k s away from the least squares solution


as the ridge parameter

K increases.

~,QS

The interpretability

3. Connections

Page 88

p^ [3.5] sh rm
k s away from the least squares solution

estimate
point

and Conclusions

ul as the interpretability

during

the procedure

If the varimax

parameter

K increases.

toward

the simple

The index I may change

however.

index [1.12] is used instead of the distance

index [1.15], the

result is a solution

similar to [3.5] except that the shrinkage is away from the set

of 2p points

4~3,.

(&*,

regression terminology,

. . f &).

Shrinkage

toward

points

is the usual ridge

so the distance index is used in the explanation

above.

As Draper and Smith (1981) point out, ridge regression places a restriction
the size of the coefficients

,0, whether or not that restriction

interpretability

[3.2] d oes not require any restriction

approach

squares the coefficients before examining


mathematical

problems.

on the coefficients

Clearly,

the approach

and identically

-logL(P;Y)

terized

a prior

is discussed in the next section.

is also the maximum

likelihood

the posterior

distribution

Thus, as Good and Gaskins

m inimizes

solution

givenr

N(0, a2). , Then the squared

or

= (Y - XP)(Y

estimate

likelihood

are that the elements of the error

distributed

distance is the negative of the log likelihood

prior.

can be viewed as placing

The necessary assumptions

term E are independent

tribution,

as it norms and

as a Prior

The least squares solution

The maximum

The

As noted, however, this produces

and this Bayesian viewpoint

3.3 Interpretability

certain conditions.

them.

is called a prior.

on

- XP)

this expression.

is equal to the likelihood

Given a prior dismultiplied

(1971) point out, m inimizing

[3.2]

(Y - XP)'(Y - XP) - fqP)

by the

the reparame-

3.3 Interpretability

as

a Prior

Page 89

1.0

angle of /3 in radians
Fig.

3.2

is equivalent

Interpretability
prior density for p = 2.
The prior fn( ,0) is
plotted versus the angle of the coefficient vector in radians for
various values of K.
to minimizing

the negative

do so is to put a prior density

interpretability

for a given interpretability

K>O

index, the prior

parameter

To

to

distribution

of the coefficients

value K > 0 is

where CK is the normalizing constant.


density

minus the log prior.

on p where the prior is proportional

exp(KS(P))

For the varimax

log likelihood

belongs to the general exponential

The coefficients are dependent.


family

defined by Watson

(1983).

The

Page 90

3. Connections and Conclusions

The prior
calculated
function,

for the p = 2 case is plotted

using numerical

The constant

Since the exponential

integration.

the basic shape of the curve resembles the index

As K increases, the prior


increases.

3.4 Future

are pushed toward

Similar

C, is

is a monotonic
S1 as in Fig.

1.2.

on interpretability

an angle of zero (p = (1,O)) or an

results would be seen for p = 3.

Work

The previous
interpretable

sections may serve as a foundation

method

connecting

with

this approach

both extending
algorithm

becomes more steep as the weight

The coefficients

angle of $ (p = (0,l)).

3.4.1

in Fig. 3.2.

other
with

the method

variable

selection

for the comparison


Further

techniques.

others is proposed below.

In addition,

to other data analysis methods

of the
work

ideas for

and improving

the

are given.

Further

Connections

Schwarz (1978) and C. J. Stone (1981, 1982) suggest other model selection
techniques

based on Bayesian assumptions.

pares the Schwarz and Akaike

M. Stone (1979) asymptotically

(1974) criteria,

noting

that the comparison

fected by the type of analysis used. In the Bayesian framework,


technique
Another

could be compared
technique

with

these others utilizing

which may provide

sanen (1987), who applles the minimum


theory

to measure the complexity

asymptotic

comparison

description

length principle

In addition,

is af-

the interpretable

an interesting

of a model.

com-

analysis.

is that of Risfrom coding

the work by Copas

(1983) on shrinkage in stepwise regression may provide ways to extend the ridge
regression discussion of Section 3.2. Interestingly
rules noted involve

the number

of parameters,

enough, all the model selection


k in [3.3], rather

than a smooth

measure of complexity.
In Chapter
computational

2, the varimax

and entropy

indices have similar

appeal and both have historical

motivation.

intuitive

and

Interpretability

as

9.4 Future

Work

measured

by these indices could be compared

philosophical
Though

Page 91

terminology

the framing

the coefficients

with

simplicity

as defined using

in Good (1968), Sober (1975) and Rosenkrantz

of the interpretable

attempts

approach

as the placing

to do this, a less mathematical

(1977).

of a prior

on

and more philosophical

discussion may prove beneficial.

3.4.2

Extensions

The interpretable
search into

a numerical

whose resulting
extended

approach
one.

description

It can be applied

to measure the simplicity


of interpretability

complicated

descriptions.

At present,

of other description

and accuracy

for methods

feasible va.riable.selection

that

produce

any subset of predictors


bound search methods

is known

analytically

linear

can be employed

regression

variable

linear regression may help in collinear

that interpretable

to include from a group of collinear


A general
.

class of models

regression is an example.
optimization

a model sebut for which

the least squares solution


Second, branch

for
and

search the model space,

procedures

situations.

exist,

interpretable

The least squares solution

suggested.

is

At present, examples

linear regression clearly chooses the variables


ones and produces a stable solution.

for which

Whenever

must

all sub-

need not be considered.

selection

the interpretable

useful is generalized linear models (McCullagh

separate

it provides

combinations

which smartly

and in fact, ridge regression is usually

seem to indicate

types such as functions,

to be [3.1].

areas so that all possible contenders

Though

unstable

First,

method

prove useful for even more

is that

linear

space

If the index is

approaches do not exist. For linear regression,

sets regression is possible for two reasons.

eliminating

might

however, its main improvement

procedure

model

to any data analysis

or model involves linear combinations.

the tradeoff

lection

changes the usual combinatorial

be done.

method

may prove

and Nelder 1983), of which logistic

a generalized
At present,

linear model is considered,


stepwise

methods

are used

9.

to choose models.
approach

3.4.3

Page 92

and Conclusions

Connections

These methods

on the basis of prediction

Algorithmic

As described
algorithm

could be compared

with

the interpretable

error.

Improvements
in Chapter

would benefit

variant

Fourier

others.

Other rotationally

1, the interpretable

from further

projection

exploratory

improvements.

First,

index GF needs further


invariant

projection

projection

testing

pursuit

the rotationally

in-

and comparison

with

indices are possible as suggested

in Section 1.4.2.
Second, present work involves designing a procedure

to solve the constrained

optimization

instead of using the general routine as outlined

improvement

should decrease the computational

projection

pursuit

needs further

regression

forecasting

time involved.

procedure

described

in Chapters

as the weight on interpretability


of the squaring

in Section

1 and 2, the convergence of the coefficients


increases is slow. This property

used in the varimax

zero as the combinations

the
2.2.3

moves to a maximizing

3.5 A General

to have the same derivatives

which is the varimax

index

The sections of the latter


at their joined

points.

Framework

The main result of this thesis is the computer


tion of projection

results because

ei (Figs. 1.2 and 1.3). Solutions

except for a range close to the e;s where it is linear.


index could be matched

to zero

index and the result that the slope goes to

might be to use a lower power or a piecewise function

pursuit

implementation

of a modifica-

which makes the results more understandable.

the approach provides

Beyond this specific application,


accuracy

Analogously,

investigation.

As mentioned

dition,

in Section 1.4.5. The

a general framework

for tackling

the search for interpretability

is made often in statistics,

sometimes

implicitly.

similar
at

In adproblems.

the expense of

The identification

and

3.5 A General

formalization

of this action is useful since choices which previously

tive become objective.


histogram

is examined.

simplifying

In the next subsection,

the rounding

were subjec-

of the binwidth

for a

This common example shows that the consequences of a

action in terms of accuracy loss can be approximated.

of interpretability
outcomes

Page 99

Framework

must be broadened

than linear combinations.

to deal with
Measuring

The definition

a much more elusive set of

the interpretability

increase is

difficult.

3.5.1

The Histogram

Example

Consider the problem of drawing a histogram


Two quantities

must be determined.

x0, which usually


bin.

is chosen so that

The second is the binwidth

of n observations

~1, x2, . . . , zn.

The first is the left endpoint

of the first bin

no observations

fall on the boundaries

h, which generally

a rule of thumb

and then rounded

usually multiples

of powers of ten. Based on the mathematical

determine

the rules widely

the binwidth

employed,

the resulting

intervals

to

are simple,

approach used to

the loss in accuracy incurred

is to estimate

Scott (1979) d e t ermines a binwidth

the integrated

by rounding

the shape of the true underlying

rule which asymptotically

mean square error of the histogram

rule and further

theoretical

results.

employs is useful in approximating

A certain
the further

m inimizes

density from the true density.

Diaconis and Freedman (1981) use the same criterion

rounded.

according

can be calculated.

The purpose of a histogram


density.

so that

is calculated

of a

to lead to a slightly

approximation
accuracy

different

step which

Scott

lost if the binwidth

is

The discussion below follows his approach.

The integrated
the true density

mean square error of the estimated

histogram

density f from

f is
IMSE

J [

O E f^( x) - f(x)] 2dx


--oo
1
=O f(~)~dx + O(;
+ $h2
nh
-CO

WI
+ h)

Page 94

3 Connections and Conclusions

where h is the histogram

binwidth.

Minimizing

the first two terms of [3.7] pro-

duces the estimate

P-81
Scott also shows that if the binwidth
in IMSE

is multiplied

by a factor c > 0, the increase

is
= ~IMSE(IZ)

IMSE(cf)

WI

Via Monte

Carlo studies, Scott shows [3.8] and [3.9] to be good approximations

for normal

data.

In reality,
derivative
density

[3.8] is useless since the underlying


Scott,

f are unknown.

as a reference distribution.

where s is the estimated


suggest the similar

standard

and Diaconis

density

f and therefore

and Freedman

its

use the normal

Scott suggests the approximation

deviation

of the data. Diaconis

and Freedman

approximation

where IQ is the interquartile

range of the data. Monte Carlo studies have shown

these approximations
Given either
approximate

to be robust.
,.
approximation
hs or AD, the statistician

the binwidth

discussed in a moment.
estimate

by rounding.

At present, consider rounding

of such simplification
the estimate

k* is E (fis - u, hs + u) where u is some rounding

u would be i if the binwidth


-

The benefits

may elect to further

also be written

is rounded

to an integer.

as
A* E (1 + e)iis

unit.

is.

are

The new

For example,

The new binwidth

may

Page 95

3.5 A General Framework

where e is a positive

or negative factor depending

rounded up or down. This multiplying

on whether

the old estimate

is

factor must be 2 1 as a negative binwidth

is impossible.
The estimation

procedure

may be drawn schematically:

binwidth

binwidth

u
i

binwidth

u
fis

rounded binwidth

u
h*

minimize
estimated

first two terms

use normal
approximate

density as reference

round

If [3.9] is used as an approximation


rounding,

for the resulting loss in IMSE


*
between the IMSE
of hs and h* is written

the relationship

IMSE(fi*)

The percent
factor

change in IMSE

e3
= (13Tl ; e; 21MSE(

can be plotted

due to

is)

as a function

of the multiplying

e as shown in Fig. 3.3.

This exercise demonstrates


different

repercussions

difficult

to measure explicitly.

-are all that

it to another.

are simpler.

can be retained
that

Finally,

might

a binwidth

the class delineations

(1979) sh ows that

result

the accuracy

is

is easier to draw and describe

Certainly

in short-term

up or down results in

The increase in interpretability

The histogram

In fact, Ehrenberg

removes confusion

rounding

in terms of accuracy.

since the class boundaries


to remember.

that

memory.

in explaining

are easier

two digits other than zero


In addition,

the rounding

the histogram

or comparing

in the actual observations

xi may prompt

3. Connections

Page 96

and Conclusions

e
Fig. 3.3

rounding.

Percent change in IMSE


versus multiplying
fraction e in the
binwidth example. The rounded binwidth A* - (1 + e$s, where
ks is the estimated binwidth due to Scott (1979).

Extra

digits beyond the number

of significant

ones in the data leads

to a false sense of accuracy.


How to measure the interpretability
Diaconis

determine.
interpretability

gain.

of a histogram

directly

is difficult

(1987) suggests conducting

experiments

A typical

be to divide a statistics

experiment

m ight

to quantify

into two matched groups and to present to either group respectively


histogram

and its simplified

version.

Measurements

to
the
class

a unrounded

of interpretability

could be

made on the basis of the correctness of answers to questions such as How would
-

adding the following

observations

change the shape of the histogram?

is the lower quartile ?. As interpretability

is increased, some information

or Where
may be

lost. For example, the question Where is the mode? may become unanswerable.

Page 97

5.5 A General Framework

This loss of information,

a more general measure of inaccuracy

could be measured along with interpretability.


be included

3.5.2

by asking data analysts their reaction

have changed statistics

hand, the resulting


and applications.

flexibility

expert opinion

could

to rounding.

principle

technique,

to monitor

pursuit.

modes of information

these computer
these abilities

of exploratory

undreamed

of possibilities

losing the ability

amount

of flexi-

computer-intensive

the statistician

and communicating

retains

her new

to achieve her basic

those results to others.

tools has come freedom.

provide

of abilities

The aim of this thesis is to use the

results in a particular

without

On the one

In the initial

the power with which to follow

stages of an

the basic tenets

data analysis (Tukey 1977), to let the data drive the analysis rather

than subjecting

it to preconceived

stages, the wealth


expanded

previously

In this manner,

translation

goal of clearly understanding

analysis,

and irreversibly.

the sacrifice of an acceptable

for more interpretable

projection

With

has cultivated

and unmanageable.

of parsimony

in return

substantially

On the other, the sheer number and complexity

can be both bewildering

assumptions

of models or descriptions

enormously.

using the bootstrap

Confirmation
(Efron

noted, the statistician


What

IMSE,

Conclusion

Computers

bility

In addition,

than

which may not be true.

In later

which can be fit and evaluated

of these models

can be readily

answered

procedures.

As Tukey

can be confirmed?

but rather

1979) or other resampling

no longer answers What

has

can be done?.

Computing
putational
statisticians

power is extending

and confirmational
imagination

and eliminating

boundaries.

are alleviated

Even unconscious
(McDonald

results of such an analysis can be complex,


important,

hard to explain.

doras box has been opened.

Though

previous mathematical,
restrictions

and Pedersen 1985).

hard to understand,

this progress is exciting,

Just as grappling

with

comon the
The

and even more


in a sense a Pan-

the theoretical

demons of

9. Connections and Conclusions

new methods

Page 98

such as projection

task, so too is considering


In order to understand
effectively,

a controlled

computing

power

which

pursuit

(Huber

the parsimonious
and communicate

1985) is a necessary and difficult

aspects.
the results of a statistical

use of these new methods


has produced

these

novel

to balance the search for an accurate and truthful


equally

important

such a balance.

desire for simplicity.

is helpful.
techniques

description

Interpretable

Fortunately,
provides

analysis
the very
a means

of the data with an

projection

pursuit

strikes

Appendix

Gradients

A.1

Interpretable

In this
pursuit

Exploratory

section,

objective

are calculated.
desired gradients

the gradients

Projection

Gradients

for the interpretable

exploratory

projection

function

Since the search procedure

is conducted

in the sphered space, the

are

dF
aajplT

dF
aF aF
-daj = (aajlG***

From [A.l],

Pursuit

the gradients

may be written

aF _ (I- A> aGF


dcvj

max GF

hi

in vector notation

+A%

99

j=1,2

as,

j=1,2

as

Page 100

Appendix A. Gradients

The Fourier projection index gradients are calculated from [1.23], yielding

From the definition

of the Laguerre functions

z:(R) = CSe-iR
with recursive equations

[1.20],

- ie-!jR

i=O,l,...

derived from the definition

of the Laguerre polynomials

p.211
=o
-=

aL, -1
-=
ddLRz
-==ii:
-=
f3R

2
2i - 1
--fR)%&
( i

The gradients
definition

i - 1
fLi-l-

of the radius

squared

(7)x

= (2Xj)Z

2=

3 )...

and angle 0 are calculated

[1.20]. If X1 z QTZ and X2 G ~$2,


g

dLi-2

the gradients

using the

are

j=1,2

P-31

A.1 Interpretable

Exploratory

Projection Pursuit

Page 101

Gradients

In [A.3], note that 2 is a vector E RP, while X1 and X2 are scalars. As with the
calculation

of the value of the index, the expected

values are approximated

by

sample means over the data.


The gradients

of the simplicity

tion [1.27], th e orthogonally


mapping
written

translated

from the unsphered


in matrix

notation

index are calculated


component

&

using the index definidefinition

11.251, and the

to the sphered space [1.5]. The gradients

may be

as

[A*41

ml as,-a~;ap,

d012=bpiap2acv2 *
In [A.4], note that the partial
a p X p matrix.

derivative

of one vector with respect to another

is

For example,

ah = aplr
(-1
aQ, ts aals

r,s

= l,,..,

Using [1.5] yields


aPjr
-=aaj,

Urs
a

where U and D are the eigenvector

r,s = l,...,p

and eigenvalue matrices

defined in [1.3]. Using

[1.25] yields
a&
-

= - Plr (

P2aTPl>

ah,

MA

w;, _
Plr
---pTB,(BI,)
ap2,
The diagonal

elements are

WlsW2)
I2

r,s=l,...,

pandr#s

Page 102

Appendix A. Gradients

The two vector varimax


bination

of the individual

%hPd

Taking

simplicities

= $P

partial

simplicity

- Wl(Pl>

derivatives

index S, [I.121 may be written

with

21 -

qp,,p,j

yields

6
gr
-plr [ --I-(p-l)sl(pl)
p
P
-ah = 2PTB,
P;&
The partial

denoted as C yielding

and a cross-term

+ (P - l)Sl(PZ)

as a. com-

to ,8& is identical

respect

r=

-~+CtP1,82)]

l,...,p

with

&

components

replacing

,&

components .

A.2

Interpretable

Projection

Pursuit

In this section, the gradients


sion objective

Regression

for the interpretable

projection

pursuit

The gradients

Y)

WI

wJ(W,~2,...,4

used in the steepest descent search for directions

Cyj are

dF
-=
daj

aF
aF dF
-y
(---dajl ' aaj2" ' * ' aajp

From [A.5], the gradients

aF

Friedman

may be written

(i-x)

(1985) calculates
aL2
L
= -2E[Rj

&j

The partials

regres-

function

F(P, a, f, x, Y) z (1 - X) L2(P;;;fi;Y

are calculated.

Gradients

j = l,...,m

in vector notation

aL2

the

L2 distance

j=

gradients

- @jfj((YTX)]Pjfj(cYTX)X

of the curves fj are estimated

l,...,m

.
as

from [2.7], yielding

j = I.,. . . ,m

using interpolation.

A.2 Interpretable

The gradients of the simplicity


[2.13]. Taking partial

derivatives

index are calculated

The partials

using the index definition

yields

a&J
-=mC~l,~(~-;)$

doji

Page 103

Projection Pursuit Regression Gradients

j=l,...,

mandi=l,.:..,p

k=l

of the overall vector y [2.12] are


-2cViic4
ark=

(jTclj)2

dcrj;

2crjl,(Crroj

ark=

aajk

forj=l,...,

mandi=l,...,

(TCyj)2

p.

Ic

3k)

References

Akaike,

H. (1974).

Transactions
Asimov,

A new look at the statistical

on Automatic

D. (1985).

data, SIAM
Bellman,

The

Journal

Control AC-19,
Grand

R. E. (1961).

Adaptive

and Statistical
Control

IEEE

716-723.
A tool for viewing

Tour:

of Scientific

model identification,

Computing

multidimensional
6, 128-143.

Processes, Princeton

University

Press,

Princeton.
Breiman,

L. and Friedman,

for multiple

J. H. (1985).

regression and correlation

ican Statistical

Association

Chambers,

discussion),

J. M., Cleveland,

Annals

Journal

of Statistics

Wadsworth,

Regression,

of the Royal Statistical

Dawes, R. M. (1979).

discussion),

R. (1989).

W. S., Kleiner,

ical Methods GOT Data Analysis,


Copas, J. B. (1983).

(with

optimal

transformations

Journal

of the Amer-

smoothers

and additive

80, 580-619.

Buja, A., Hastie, T. and Tibshirani,


models (with

Estimating

prediction

Linear

17, 453-555.

B. and Tukey, P. A. (1983).

Graph-

Boston.
and shrinkage

(with

discussion),

Society, Series B 45, 311-354.

Th e robust beauty of improper

making,

American

Psychologist

34, 571-582.

Diaconis,

P. (1987).

Personal communication.

linear models in decision

I
I

Page 105

References

Diaconis,
I

P. and Freedman, D. (1981).

Lz theory,

Zeitschrift

fuer

On the histogram

as a density estimator:

WahrscheidichkeitstheoTie

und Verwandte

Gebiete

57, 453-476.
Diaconis,

P. and Shahshahani,

combinations,
/

SIAM

Journal

D. L. and Johnstone,

Donoho,

and a duality

On

M . (1984).
of Scientific

and Statistical

Annals

functions

Computing

Projection-based

I. M . (1989).

with kernel methods,

nonlinear

of Statistics

of linear
5, 175-191.

approximation

17, 58-106.

!
Draper,

N. and Smith, H. (1981).

Applied Regression Analysis,

Wiley, New York

;:

1
/
,

Efron,

B. (1982).

The Jackknife,

CMBS

38, SIAM-NSF,

Efron, B. (1988).

the Bootstrap,

and Other Resampling

Plans,

Philadelphia.

Computer-intensive

methods in statistical

regression,

SIAM

Review 30,421-449.
Ehrenberg,

A. S. C. (1981).

Statistical

Association

Friedman,

J. fi.

Department
Friedman,
Department
Friedman,
jection

Journal

of the American

35, 67-71.

(1984a).

SMART

of Statistics,

Stanford

J. H. (1984b).

users guide,

Technical

Stanford

J. H. (1985).

LCSOOl,

Technical Report LCSO05,

University.

Classification

Technical

Report

University.

A variable span smoother,

of Statistics,

pursuit,

The problem of numeracy,

Report

and multiple

LCS012,

regression

Department

through

of Statistics,

pro-

Stanford

University.
Friedman,

J. H. (1987).

ican Statistical
Friedman,

exploratory

projection

pursuit,

Journal

of the Amer-

Association 82, 249-266.

J. H. and Stuetzle,

Aal of the.American
Friedman,

Exploratory

Statistical

W. (1981).

Projection

regression,

JOUF

Association 76, 817-823.

J. H. and Tukey, J. W. (1974).


data analysis,

pursuit

IEEE

Transactions

A projection
on

Computers

pursuit

algorithm

C-23,

881-889.

for

References

Page 106

Gill, P., Murray,

W. and Wright,

M. H. (1981). Practical

Optimization,

Academic

Press, London.
Gill, P. E., Murray,
for NPSOL,
Stanford

W., Saunders, M. A. and Wright,

Technical

Report

M. A. (1986).

SOL 86-2, Department

:Users guide

of Operations

Research,

University.

Good, I. 3. (1968).
and a sharpened

C orroboration,

British

razor,

explanation,

evolving

probability,

for the Philosophy

Journal

simplicity

of Science 19, 123-

143.
Good, I. J. and Gaskins, R. A. (1971).
probability

densities,

Gorsuch,

R. L. (1983).

N on p arametric

roughness penalties

for

58, 255-277.

Biometrika
Factor

Analysis,

Lawrence

Erlbaum

Associates,

New

Jersey.
Hall, P. (1987).
tion pursuit,

On polynomial-based
Annals

of Statistics

projection

projec-

17, 589-605.

H. H. (1976). M o d em Factor Analysis,

Harman,

indices for exploratory

The University

of Chicago Press,

Chicago.
Hastie,

T. and Tibshirani,

Department
Huber,

of Statistics,

P. (1985).

R. (1984).
Stanford

Projection

Generalized

additive

models,

LSCO02,

University.

pursuit

(with

discussion),

Annals

of Statistics

13, 435-525.
Jones, M. C. (1983).
analysis,

Ph.D.

The

Dissertation,

projection
University

Jones, M. C. and Sibson, R. (1987).


sion) , Journal
Krzanowski,
structure,

pursuit

W. J. (1987).
using principal

is projection

data

pursuit?

(with

discus-

Society, Series A 150, l-36.

S e 1ec tion of variables


components,

for exploratory

of Bath.

What

of the Royal Statistical

algorithm

Applied

to preserve multivariate
Statistics

36, 22-33.

data

References

Page 107

Lubinsky,

D. and Pregibon,

Data

M. (1985).

Applications

problems

in statistics,

Marshall,

A. W. and Olkin,

Its Applications,

of the annealing

Biometrika

Academic

I. (1979).

Journal

of

to combinatorial

Theory

of Majorization

and

Press, New York.

Some comments

Mallows,

C. L. (1983).

Data

and Robustness,

algorithm

Inequalities:

C. L. (1973).

McCabe,

as search,

72, 191-198.

Mallows,

New York,

analysis

38, 247-268.

Econometrics
Lundy,

D. (1988).

on C,,

description,

in Scientific

eds. G. E. P. Box, T. Leonard

15, 661-676.

Technometrics
Inference

Data Analysis,

and C.-F. Wu, Academic

Press,

135-151.
G. P. (1984).

McCullagh,

Principal

P. and Nelder,

variables,

J. A. (1983).

Technometrica

G eneralized

26, 137-144.

Linear

Models,

Chapman

and Hall, New York.


McDonald,
tation,

J. A. (1982).

Department

McDonald,
analysis
puting

I n t eractive

of Statistics,

Stanford

J. A. and Pedersen,
part

I: introduction,

graphics

Ph.D.

Disser-

University.

J. (1985).
SIAM

for data analysis,

Computing

Journal

environments

of Scientific

for data

and Statistical

Com-

6, 1004-1012.

Reinsch, C. H. (1967).

S moothing

by spline functions,

Numeriache

Mathematik

10, 177-183.
RCnyi,

A. (1961).

of the Fourth

On

Berkeley

Symposium

Vol. 1, ed. J. Neyman,


Rissanen,

J. (1987).

Royal Statistical
Rosenkrantz,

measures

of entropy

on Mathematical

547-561, University
Stochastic

Society,

and information,
Statistics

of California

complexity

(with

in Proceedings
and Probability,

Press, Berkeley.

discussion)

Journal

of the

Series B 49, 223-239, 252-265.

R. D. (1977).

I n f erence, Method

and Decision,

Reidel,

Boston.

Page 108

References

Schwarz, G. (1978).
,
I

Estimating

the dimension

of a model, Annuls of Statistics

6, 461-464.

Scott, D. (1979).
I
I

On optimal

and data-based

histograms,

B&net&z

66, 605-

610.
Silverman,

B. W. (1984).

cyclopedia

of Statistical

Penalized

maximum

Sciences, eds. S. Kotz

likelihood

estimation,

and N. L. Johnson,

in En-

Wiley,

New

selection of an accurate and parsimonious

nor-

York, 664-667.
Sober, E. (1975).

Simplicity,

Stone, C. J. (1981).

Ad missible

mal linear regression model,


Stone,

Annals

Local

C. J. (1982).

Akaikes

Clarendon

Press, Oxford.

of Statistics

asymptotic

model selection rule, Annals

9, 475-485.

admissibility

of the Institute

of a generalization
of Statistical

of

Mathematics

34, 123-133.
Stone, M. (1979).
Journal

of the Royal Statistical

Sun, J. (1989).
of Statistics,
Thisted,

C omments on model selection criteria of Akaike and Schwarz,

P-values in projection

Stanford

pursuit,

Ph.D. Dissertation,

Department

University.

R. A. (1976).

Bayes methods,

Society, Series B 41, 276-278.

Ridge

regression,

Ph.D. Dissertation,

minimax

Department

estimation

of Statistics,

and empirical
Stanford

Univer-

sity.
Thurstone

L. L. (1935).

The V ec t OTS of the Mind,

University

of Chicago Press,

Chicago.
Tukey, J. W. (1961).
_

D iscussion, emphasizing

of variance and spectrum

analysis,

Tukey, J. W. (1977). Exploratory

the connection

Technometrics

Data Analysis,

between analysis

3, 201-202.

Addison-Welsey,

Reading,

MA.

References

Tukey,
statistics:

Page 109

J. W. (1983).
PTOCeedingS

Another
of

R. Sacher and J. Wilkinson,


Tukey,
views,

look at the future,

the ldth symposiu m on the Interface,


Springer-Verlag,

P. W. and Tukey, J. W. (1981).


in Interpreting

in Computer

Mu&ivariate

Data,

G. S. (1983).

Statistics

and

eds. K. Heiner,

New York, 2-8.

Preparation;
ed.

prechosen sequences of

V. Barnett,

189-213.
Watson,

Science

on Spheres, Wiley,

New York.

Wiley,

New York,

Das könnte Ihnen auch gefallen