Sie sind auf Seite 1von 11

'

Quasi Regression and black boxes

$'
1 Quasi Regression and black boxes

$
2

Finding important
variables and interactions Theme
As dimension increases many numerical problems become
in black boxes more statistical.

Art B. Owen Because:


Stanford University
1. the sample is inevitably sparse,
owen@stat.stanford.edu
2. error depends on unsampled part of space,
Tao Jiang
3. worst case error bounds are inapplicable
Stanford University
jiang@stat.stanford.edu

& %& %
'
Quasi Regression and black boxes

$'
3 Quasi Regression and black boxes

$
4

Mortgage backed
Example: integration securities integrand
Z
I = f (x)dx Paskov & Traub, Caflisch, Morokoff, & Owen
(0;1)d
Y = Present value of 30 years of monthly cash flows.
Sampling Methods
Prepayment:
1=2
1. Monte Carlo: n 1. puts lumps into payment stream

2. Quasi-Monte Carlo: n 1
(log n)d 1
, but no practical 2. more common when interest rates are low
error estimate
MBS Model (from Goldman-Sachs)
3. Randomized Quasi-Monte Carlo: replication based error
estimates, and n 3=2
(log n)(d 1)=2 Y = f (X )
Rates are asymptotic under mild conditions on f
X  U [0; 1℄360 ! Z =  1 (X )
Interest rates: r1 : : : r360  Geometric Brownian motion

& %& %
Also statistical: approximation
driven by Z

Prepayment fraction: A + B ar tan(C + D  rt )


'
Quasi Regression and black boxes

$'
5 Quasi Regression and black boxes

ANOVA of L2 [0; 1℄d


$
6

Hoeffding, Efron & Stein, Sobol’

Main effects and k –factor interactions generalizing familiar


QMC super on MBS discrete ANOVA

But Y = f (X ) is 99:96% additive X


Latin hypercube sampling variance about 0:04% of MC f (x) = fu (x)
Also Y = f (X ) is 99:98% odd (antisymmetric) uf1;2;:::;dg
antithetic sampling variance is about 0:02% of MC  fu depends only on x-components in set u
R
Additive and odd: was virtually linear in Z  f; = f (x)dx “grand mean”
R
upon further investigation
 2 (f ) = Pu6=; fu (x)2dx
R
Curse of dimensionality not broken by QMC
 fu (x)fv (x)dx = 0, u = 6 v
we just had an easy integrand
n n
QMC requires low “effective dimension” to trounce MC 1X f (xi ) =
X1X
fu (xi )

& %& %
n i=1
n u i=1
QMC xi very uniform in low dimensional projections

Great for functions dominated by fu with small u

'
Quasi Regression and black boxes

$'
7 Quasi Regression and black boxes

$
8

Isotropic integrand The borehole function


Morris, Mitchell, Ylvisaker
Capstick & Keister, Pagageorgiou & Traub, Owen
Flow from upper to lower aquifer:
s 2 ! 2Tu [Hu Hl ℄
f (x) = os
1 
1 (x) ; x U (0; 1)d
  
2 log r
rw 1 + log( 2rLT)ruw2 Kw + TTul
rw
 q 
R
f (x)dx = E os  =2
2
(d) r, rw Radii borehole, basin
closed form (Mathematica) aids comparison of methods Tl , Tu Transmissivities upper and lower

Varies equally in all directions Hl , Hu Potentiometric heads upper and lower

QMC does well


L, Kw Length and conductivity

For d = 25 over 99% of variance from


Diaconis: closed form 6= understanding

& %& %
1,2,3 dimensional ANOVA effects
::: after numerical investigation Which variables are important?
exploiting symmetry and Gaussianity
Which interact?
'
Quasi Regression and black boxes

$'
9 Quasi Regression and black boxes

$
10

Black box functions


A small neural net
Y = f (X ) Without “ + e” Venables, Ripley

Examples X Y
Predict log10 (perf ) from the others
Semiconductors Device design Speed, heat perf published performance of computer

Aerospace Wing shape Lift, drag syct cycle time in nanoseconds

Automotive Auto Frame Strength, weight mmin minimum main memory in kilobytes

Statistics Predictors Responses mmax maximum main memory in kilobytes

cach cache size in kilobytes


Used to design products. Cheaper than physical chmin minimum number of channels
experiments. Costs from milliseconds to hours. Dimension
chmax maximum number of channels
from 3 to 300. Accuracy varies too.
209 examples.

& %& %
Function found by training on
Kriging widely used Journel, Huijbreghts, Sacks, Ylvisaker,
Welch, Wynn, Mitchell

'
Quasi Regression and black boxes

$'
11 Quasi Regression and black boxes

$
12



2:82S 1:12 + 0:45x1 + 2:24x2 + 2:51x3 1:63x4 0:56x5 + 0:43x6
+ 3:17S 1:09 + 2:28x1 0:10x2 + 1:44x3 + 2:70x4 + 1:24x5 + 0:25x6

+ 0:39S 0:04 0:11x1 + 0:11x2 + 0:12x3 0:10x4 0:04x5 + 0:02x6

Given f (x) on [0; 1℄d


0:46 1:21x1 + 1:36x2 + 1:42x3 1:01x4 0:33x5 + 0:30x6

How can we tell if f is:


The n-net function

is a sigmoidal function

1. Nearly linear?

2. Nearly additive?

3. Nearly quadratic?

4. Has mostly 3 factor interactions or less?


5. Which variables matter most?
1
(z ) = [1 + exp( z )℄

6. Which interactions matter most?

We would like:

1. a systematic approach

& %& %
2. that also predicts f


where S
'
Quasi Regression and black boxes

$'
13 Quasi Regression and black boxes

$
14

MC approximation Start with univariate basis


X functions
Let: y = f (x) = r r (x)
r2U
X 0 ; 1 ; 2 ; : : :
= r r (x) +  (x)
r2R First is constant, all are orthonormal

Orthonormal basis r 0 (x) = 1; 0  x  1


 (x) a deterministic truncation error Z 1
j (x)dx = 0; j1
Estimate r from f (xi ) values, where xi  U (0; 1)d 0
Z 1
j (x)k (x)dx = 1j=k
getting
X
fe(x) = er r (x) 0
r2R

& %& %
EG: orthogonal polynomials, sinusoids, wavelets,
Apply graphical and numerical interpretation to fe Hermite(  1()), Chebychev(qbeta())

'
Quasi Regression and black boxes

$'
15 Quasi Regression and black boxes

$
16

Polynomials r , with :::


Tensor product basis Rank kr k0  3, Degree kr k1  4, Order kr k1  3.

Rank Deg Order r (x) #


x = (x1 ; x2 ; : : : ; xd ) 2 [0; 1℄d 0 0 0 Const 1
r = (r(1); r(2); : : : ; r(d)) 2 f0; 1; 2; : : : gd 1 1 1 Linear d
d
Y 
r (x) = r(j ) xj 2 2 Quad d
j =1 3 3 Cubic d
 
2 2 1 Lin  Lin d
2
Finite subset of basis:
3 2 Lin  Quad d(d 1)
4 3 Lin  Cubic d(d 1)
 
P
Rank(r)  krk0 = dj=1 1r(j)>0  B0 4 2 Quad  Quad d
P 2
Degree(r )  kr k1 = dj=1 r(j )  B1 3 3 1 Lin  Lin  Lin d

& %& %
3
 
Order(r )  krk1 = max1jd r(j )  B1 4 2 Lin  Lin  Quad 3 d3
p = 1 + 3d + 3d(d 1) + (2=3)d(d 1)(d 2)
'
Quasi Regression and black boxes

$'
17 Quasi Regression and black boxes

$
18

Approximation through
Interpretation
P R
Variance of f is r6=0 r +  (x) dx integration
2 2

P
Importance of S is r2S r
2
Define: Z (x) = ( 0 (x); : : : ; p 1 (x))T
P
r2S r Var( r )
Estimate by e2 e

Subsets of interest include:


Optimal is
fr j r(1) > 0g involves x1
Z
2
 = arg min f (x) Z (x)T dx
fr j r(1) = 0g does not involve x1
Z  Z
fr j krk0 = 1g additive part
= Z (x)Z (x)T dx
1
Z (x)f (x)dx
fr j 0 < krk0  kg interactions up to order k

fr j 0 < krk1  kg of degree at most k


also,
Z
fr j r(j ) = 0; j > 3g 3 inputs, ISE = (f (x) Z (x)T )2 dx

& %& %
uses only first

'
Quasi Regression and black boxes

Regression and
$'
19 Quasi Regression and black boxes

$
20

Precursors of
quasi-regression
quasi-regression
Z  1Z
 = Z (x)Z (x)T dx Z (x)f (x)dx Quasi-interpolation
Z Chui & Diamond, Wang
= Z (x)f (x)dx
Z Z ) to get fast approximate
by orthogonality
“Ignore the denominator” ( T

Observations interpolation.

xi  U [0; 1℄d ; 1  i  n; IID Computer experiments

Regression Koehler and Owen 1996 advocate quasi-regression for


computer experiments
 1 T
^ = Z T Z Z Y Znp Yn1 Efromovich 1992 applies qr to sinusoids on [0; 1℄.

& %& %
Quasi-Regression Owen 1992 describes quasi-regression for Latin hypercube

e =
1 ZT Y sampling

n
'
Quasi Regression and black boxes

Accuracy in Monte Carlo


$'
21 Quasi Regression and black boxes

Fast stable updates


$
22

sampling Define:
n
 n1
Define: X
er(n) r (xi )f (xi )
i = Yi Zi i=1
P
= n i=1 Zi (Yi
n
Zi ) n 
 n1
Æp1 1 T X 2
P Sr(n) r (xi )f (xi ) er(n)
A p p = n T
n i=1 Zi Zi
1
I i=1
Then:
Now
" #
e = n1 Z T Y er(n) = er(n 1)
+1 n r (xi )f (xi ) er(n 1)

= 1 Z T (Z +  )
n
Sr(n) = n 1 S (n 1)
+
n r
= Æ +A " #2
^ = (Z T Z ) Z T (Z + ) n 1
r (xi )f (xi ) e(n
1
r 1)
n2
= (Z T Z ) 1
ZT 

& %& %
(n)
= (I + A) 1 Æ Chan, Golub, Leveque who use nSr
 
= (I A + A2 A3    )Æ E nn 1 Sr(n) = Var( er(n) )
=: Æ AÆ

'
Quasi Regression and black boxes

$'
23 Quasi Regression and black boxes

$
24

Presented as lack-of-fit:
Updatable accuracy
1 R2
estimates
Predict f (xn ) by fen 1 (xn ) xn indep of fen 1 LOF = ISE Ld
OF = AVG(f fe)2
V ar AVG(f e0 )2
Average recent squared errors
nm 
[ (nm ) =
ISE
1 X
f (xi ) fei
2
(xi ) log10 (LOF ) R2
1
nm nm 1
i=nm 1 4 99:99%
on subsequence nm = m(m + 1)=2 3 99:9%
p 2 99%
estimates avg ISE over recent  2n values

Diagnostic:
1 90%
P
=) 0 0%

& %& %
r Var( r )
Large LOF and small e need bigger basis
1 900%
'
Quasi Regression and black boxes

$'
25 Quasi Regression and black boxes

$
26

p = 1; 000; 000 doable by quasi-reg., not by reg.Owen, Ann Stat 2000


O(n5 + p5 )

Incorporating shrinkage
Footprint

O(np4 )
O(np2 )

Hoerl, Kennard, Efromovich, Donoho, Johnstone, Beran : : :

(Quasi-)regression
Quasi-reg allows larger n or much larger p
Costs of algebra

O(n2 + p2 )

[good luck]
fe ;n (x) = r;n er;n r (x); r;n 2 [0; 1℄
Kriging
O(p2 )
Space

r
O(p)

Easy

Optimally
r2
Dimension
O(n3 + p3 )

r;n =
High

High
r2 + Var( er;n )
Low

Low
O(np2 )
O(np)
Time

Shrinkage can reduce prediction variance.


Cost of f

We use data to estimate r;n


High

High
Low

Low
Quasi-regression

1)2
er(n
^r;n =
Regression

& %& %
e.g.
er(n 1)2
+ Sr(n 1)
Kriging

'
Quasi Regression and black boxes

Exploiting residuals
$'
27 Quasi Regression and black boxes

$
28

For r 6= 0: r (f ) = r (f ), for 2R
 n 
1X
Var r (xi )(f (xi ) ) depends on c
n i=1 N-net example
Try  0 f (x) is prediction of log10 (perf )
More generally d = 6 r are Legendre polynomials
r are tensor products
n
1
n
X  X 
er(n) r (xi ) f (xi ) s;i 1 es(i s (xi )
1)

i=1 s6=r krk0  3 krk1  8 krk1  4 =) p = 1145


Original quasi-reg: r;i = 0 or 1r==0 r;i = 1r2R Net is fast, so n = 500; 000
Self-consistent quasi-reg: r;i = r;i 2 [0; 1℄ (about 3min on 800Mhz PC in java)
R
fe2 by sample variance : : : eliminates explosive

& %& %
Bounding
feedback
er and Sr
Still updatable

er(n)
NB: n ( r ) is a martingale in n
Quasi Regression and black boxes

Quasi Regression and black boxes


&

'
&

'
Neural net results LOF

10^-3 10^-1 10^1 10^3


Number of bases is 1145

Neural net accuracy


Shrinkage applied after n=600 (lower curve)
100
###### Anova at Iteration 500000 ######

1-RSquare (LOF) is 0.0011707 at iteration 499500

1000
Beta[0] (constant factor) is 2.0717

Sample size
Sample mean is 2.0719, sample variance is 0.14359

10000
Unbiased estimates of dimension variances
0.11441 0.026592 0.0027723 0.0 0.0 0.0

100000
Dimension Probabilities
(Ratios of dimension variances to sample variance)

0.79676 0.18518 0.019307 0.0 0.0 0.0


%&

$'
%&

$'
31

29
Quasi Regression and black boxes

Quasi Regression and black boxes


Results
Neural net results, ctd
f (x) =
0:46 1:21x1 + 1:36x2 + 1:42x3 1:01x4 0:33x5 + 0:30x6
Variances on one and two variables / sample variance
 
syct mmin mmax cach chmin 2:82S 1:12 + 0:45x1 + 2:24x2 + 2:51x3 1:63x4 0:56x5 + 0:43x6
0.5177106  
9.292114E-4 0.01069175 + 3:17S 1:09 + 2:28x1 0:10x2 + 1:44x3 + 2:70x4 + 1:24x5 + 0:25x6
0.008898125 0.02590950 0.08782891  
0.05507833 0.006469443 0.05429608 0.1301971 + 0:39S 0:04 0:11x1 + 0:11x2 + 0:12x3 0:10x4 0:04x5 + 0:02x6
0.01091619 6.212815E-4 0.008541468 0.01008703 0.03679156
2.480628E-4 4.889575E-4 2.725553E-4 0.001473632 2.348261E-4 Additive component of fe

52%
Biggest main effect: syct is
Var syct mmin mmax cach chmin delch total
Biggest interaction syctcach is 5:5%
% 0.520 0.011 0.088 0.131 0.037 0.009 0.797
%

$
%

$
32

30
'
Quasi Regression and black boxes

$'
33 Quasi Regression and black boxes

Caveats
$
34

Effect of cycle time  Important variables in E (Y j X = x) are not


0.4

necessarily causal

 Same for f (x) = E^ (Y j X = x) and fe


0.2

 Training x not from a product measure (nor are test x)


0.0

Non-product measure issues

False positives: f; fe might have large structure in region


-0.4

with no data
R
False negatives: (fe f )2 dx might be dominated by x
0.0 0.2 0.4 0.6 0.8 1.0 away from data. Small error and simple model might mask
poor fit in training region. (Easy to compare f and fe on
training data.)
Degree 1 2 3 4
Functions r and estimated anova components correlated

& %& %
Coef -0.272 -0.030 0.00242 .0000777
on empirical distribution
% of fe 51.38 0.630 0.00041 0.000004
Using product of empirical margins mitigates problem (only
slightly)

'
Quasi Regression and black boxes

CPU inputs
$'
35 Quasi Regression and black boxes

Biggest interaction
$
36


Cycle time Cache Size  5.5% of fe

0.0 0.6 0.0 0.6 0.0 0.6


••• ••
• •••• •• •

••••
•• ••
••• ••• • ••• • ••
• •• •••••• ••••• • • • • ••••• •• ••• • •••• •••••
•• •••••••••• •••••• •••••• •
0.6

•• • • • ••• •••••• • •••••••••• • •••


•••••••••••• •••••••••••• ••

syct •••• ••••••• •••••••• • •• •• ••••
•••• ••••
••• •• • • •••
••••••• ••• • ••••• • ••
• ••• • •• ••••• •••••••••••• • •• •••••••• •
•••
••••••••• ••
• • ••• ••••• ••• • • ••••• •••••••••
• ••••••• ••••••••••• •
••• •• •••••••••••• • • • • ••••••
••••• •
• • •• •• • • •••••• •• ••• • •• • •• • •• •• •• •• • • •••••••••• •
0.0

• • •• • •
• • • • •
•••• •• • • • • • ••••
•••••••• • •••
••
• •• • ••• • • • •• •• •• • ••••••• •
• ••••• •• • ••••••••••••• ••••••••• • • •• •••••• •• • • • •••••••• ••••
0.6

•• •••••••••••••••• • • ••••••• •
•••
••• •••
•••
•••
•••• ••••• • ••
••
••• ••••••• •• •
•••••• • •••••••••••••••• ••
•• •••••••••••••••• •••• •
• ••••••••••• ••••
mmin •
• •••••• •••
• ••••••••••••••••
•••• ••••
••••••••• • ••••••
•••
••••••••
••
•••••• •
••••••••••••


••• • •• •• •• ••••••• •• • •••• • ••• •• •
•• • ••• ••• •••
0.0

• •
-0.1 00.10.2

•• ••• • • • •• • •• •
•••••••••••• • • ••• •• • • •••••• • • • • ••••• •• • •• • ••••••••• •
• •••••••••• ••••••••• •• ••• ••• ••• • •••• ••••
•••• • •••• •••• •• • • ••••••••• •• •
••••••••••• •••• •••
•••••••• • •••••• ••• ••••••
••• ••••••••••
•••• • • •••••••••••••••• • •• •••••••••••••••••••••• ••
0.6

• ••• ••••••••••••••••••••••••••••• • •• • ••••• • ••••••


••• •
• ••••••
••••••••••• • •••••••••• •
••• •• • • • •• •• mmax •••• ••••••
••••••
••••
•• •• • •• • ••
0.0

4 -30.2

• • • • •
• • • • • • • • •
-0.-0.
0.6

• • • • •
••••• • • • •• ••• • • • • • • •• • cach • ••• • • • ••• •• •• •
• • • • •
•• ••••• • •••• • •••••• • ••••• • • ••••••• •
• •••••••••
•• •••••• • • • • ••• ••• •• ••••••••••••
• •• •••••••••••• • • • ••••••••• •
• •• ••••••• ••
0.0

•••••••••••••••••••••••••••• • • •• •••••••••• •••••


•••• • • ••••••••••••••••••
••••••• •••••••
••
••••
•• •••••• ••••••••••• • ••
•••••••• ••
• •• • • • 0.8
0.6
0.8
0.6

• • • • •
• • • • •• • • chmin •• 0.4
•••••••••• • • • • ••• • • • •••••••••• ••• •• •• •• • ••• • •••••••••• • • • 0.6
•••• •••••• •
• •••••••••••••••••
• •• •••••••••••••••••
•••••••• •• •
•• •• ••• •••• ••• •••••••••••••••
•• •• •••••
•••
••••• ••••• ••
• ••• • •

•••• ••••••••••••••••••••••••••••
• 0.2 0.4
0.0

••••••••••••• • • •• ••••••• • • • • • •••••••••••••


••• •
•• ••
••••
•• •

& %& %
• • • • • • 0.2
•• • ••• •• • •• • •• •
0.6

chdel
• •• • • ••••• •••• • ••• • •• •••• •
••• •• ••• ••••••• •• • •• •
•••••••••••••••••• ••
• • ••••••
•••••• • • •
• •••••••••••••• ••••••••••••••••••••••••••• • ••••• •••••••••••••• ••• •• • •••••••••• •••••• •••• •
••••
••••••• • •• ••••• • •••••••••
•••••• •••• ••••••••• • • ••••••••••••••• •• • •
0.0

•• • • • • ••••••• •••• •••••


••• •

0.0 0.6 0.0 0.6 0.0 0.6


'
Quasi Regression and black boxes

Biggest interaction
$'
37 Quasi Regression and black boxes

2nd biggest interaction


$
38


Cycle time Main Memory Max  5.4% of fe
Cycle Time x Cache Size Interaction
1.0
0.8
0.6

0.04
cach

-0.0200.02
0.4

-0.04
-0.06
0.2

0.8
0.6
0.0

0.4 0.8
0.6

& %& %
0.2 0.4
0.0 0.2 0.4 0.6 0.8 1.0 0.2

syct

krk0  3 krk1  8 krk1  4

'
Quasi Regression and black boxes

2nd biggest interaction


$'
39 Quasi Regression and black boxes

$
40

Cycle Time x Max Main Memory Interaction


1.0

N-net conclusions
0.8

1. fe a fairly simple function wrt U [0; 1℄6


x1 most important, and nearly linear
0.6

2.
mmax

3. At least one interaction not supported by data


0.4

4. Non-random cross-validation (leave out clusters) might


0.2

help
0.0

& %& %
0.0 0.2 0.4 0.6 0.8 1.0

syct

krk0  3 krk1  8 krk1  4


'
Quasi Regression and black boxes

$'
41 Quasi Regression and black boxes

$
42

Next directions
1. Mars-like dynamic choice of basis Robot arm function
2. Comparisons of f and fe on training data Robot arm has 4 joints: Lengths Lj , angles j

3. Decompositions of fe under empirical measures Shoulder at (0; 0), hand at (u; v ):

4 j
X  4 j
X 
4. Distinguishing f structure from fe artifacts X X
u= Lj os k v= Lj sin k
5. More types of statistical/ML black boxes j =1 k=1 j =1 k=1

6. Missing data (arise in function mining too) f is shoulder to hand distance


7. Stopping rules   p
8. More basis function choices
f L1 ; L2 ; L3 ; L4 ; 1 ; 2 ; 3 ; 4 = u2 + v 2

9. Block diagonal or banded E (Z T Z )


(EG B -splines) 0  Lj  1 0  j  2

& %& %
10. Examples with noise (  unusable basis fns)