Computers & Operations Research 34 (2007) 70–90

www.elsevier.com/locate/cor
Neural network and regression spline value function
approximations for stochastic dynamic programming
Cristiano Cervellera
a
, Aihong Wen
b
, Victoria C.P. Chen
b,∗,1
a
Institute of Intelligent Systems for Automation-ISSIA-CNR, National Research Council of Italy,
Via De Marini 6, 16149 Genova, Italy
b
Department of Industrial & Manufacturing Systems Engineering, The University of Texas at Arlington,
Campus Box 19017, Arlington, TX 76019-0017, USA
Available online 30 June 2005
Abstract
Dynamic programming is a multi-stage optimization method that is applicable to many problems in engineering.
A statistical perspective of value function approximation in high-dimensional, continuous-state stochastic dynamic
programming (SDP) was first presented using orthogonal array (OA) experimental designs and multivariate adap-
tive regression splines (MARS). Given the popularity of artificial neural networks (ANNs) for high-dimensional
modeling in engineering, this paper presents an implementation of ANNs as an alternative to MARS. Comparisons
consider the differences in methodological objectives, computational complexity, model accuracy, and numerical
SDP solutions. Two applications are presented: a nine-dimensional inventory forecasting problem and an eight-
dimensional water reservoir problem. Both OAs and OA-based Latin hypercube experimental designs are explored,
and OA space-filling quality is considered.
᭧ 2005 Elsevier Ltd. All rights reserved.
Keywords: Design of experiments; Statistical modeling; Markov decision process; Orthogonal array; Latin hypercube;
Inventory forecasting; Water reservoir management

Corresponding author. Tel.: +1 817 272 2342; fax: +1 817 272 3406.
E-mail addresses: cervellera@ge.issia.cnr.it (C.Cervellera), axw0932@omega.uta.edu (A.Wen),
vchen@uta.edu (V.C.P. Chen).
1
Supported by NSF Grants #INT 0098009 and #DMI 0100123 and a Technology for Sustainable Environment (TSE) grant
under the US Environmental Protection Agency’s Science to Achieve Results (STAR) program (Contract #R-82820701-0).
0305-0548/$ - see front matter ᭧ 2005 Elsevier Ltd. All rights reserved.
doi:10.1016/j.cor.2005.02.043
C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90 71
1. Introduction
The objective of dynamic programming (DP) is to minimize a “cost” subject to certain constraints over
several stages, where the relevant “cost” is defined for a specific problem. For an inventory problem, the
“cost” is the actual cost involved in holding inventory, having to backorder items, etc., and the constraints
involve capacities for holding and ordering items. An equivalent objective is to maximize a “benefit”,
such as the robustness of a wastewater treatment system against extreme conditions. The state variables
track the state of the systemas it moves through the stages, and a decision is made in each stage to achieve
the objective.
Recursive properties of the DP formulation permit a solution via the fundamental recurrence equation
[1]. In stochastic dynamic programming (SDP), uncertainty is modeled in the formof randomrealizations
of stochastic system variables, and the estimated expected “cost” (“benefit”) is minimized (maximized).
In particular, Markov decision processes may be modeled by a SDP formulation through which the state
variables resulting fromeach decision comprise a Markov process. DP has been applied to many problems
in engineering, such as water resources, revenue management, pest control, and wastewater treatment.
For general background on DP, see Bellman [1], Puterman [2] and Bertsekas [3]. For some applications,
see Gal [4], Shoemaker [5], White [6,7], Foufoula-Georgiou and Kitanidis [8], Culver and Shoemaker
[9], Chen et al. [10], and Tsai et al. [11]. In high dimensions, traditional DP solution methods can become
computationally intractable and attempts have been made to reduce the computational burden [12–14].
For continuous-state SDP, the statistical perspective of Chen et al. [15] and Chen [16] enabled the first
truly practical numerical solution approach for high dimensions. Their method utilized orthogonal array
(OA, [17]) experimental designs and multivariate adaptive regression splines (MARS, [18]). Although
MARS has been applied in several areas (e.g., [19–22]), artificial neural network (ANN) models are much
more prevalent in all areas of engineering (e.g., [23–30]). Given their popularity, an ANN-based SDP
solution method would be more accessible to engineers. In this paper, we present:
• The statistical modeling approach to solving continuous-state SDP.
• Implementation of ANN models in continuous-state SDP as an alternative to MARS.
• Implementation of OA-based Latin hypercube (OA–LH) designs [31] as an alternative to pure OAs.
• Consideration of the “space-filling” quality of generated experimental designs. The designs are in-
tended to distribute discretization points so that they “fill” the state space; however, generated designs
of the same type do not necessarily have equivalent space-filling quality.
• Comparisons between ANN and MARS value function approximations, including methodological
perspectives, computational considerations, and numerical SDP solution quality on two applications.
The novelty of our study is to simultaneously consider how the SDP solution is affected by different
approximation methods and designs of different types and different space-filling qualities.
In the statistical perspective of continuous-state SDP, the two basic requirements are an experimental
design to discretize the continuous state space and a modeling method to approximate the value function
An appropriate design should have good space-filling quality, and an appropriate modeling method should
be flexible in high dimensions, provide a smooth approximation, and be reasonably efficient. A review
of several experimental designs and modeling methods is provided by Chen et al. [32] in the context of
computer experiments. Of the possible modeling choices given in this review paper, only MARS, neural
networks, and kriging satisfy the first two criteria. Some preliminary exploration of the kriging model
72 C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90
yielded positive results on a ten-dimensional wastewater treatment application [33], but solving for the
maximum likelihood estimates of the kriging parameters continues to be a computational challenge. For
this application, kriging required 40 times more computational effort than MARS. In addition, kriging
software is even harder to obtain than MARS software, which is counter to our motivation to employ
ANNs.
Comparisons betweenMARSandANNs for empirical modelinghave beenconductedbyDeVeauxet al.
[34] and Pham and Peat [35], and in both studies MARS performed better overall. Although it is obvious
that ANN models can be employed to approximate functions, the existing literature does not provide
sufficient studies to predict how they will compare to MARS in the SDP setting. In continuous-state SDP,
value functions are typically well-behaved (e.g., convex) and the SDP data involve little noise. Previous
comparisons involved statistical data with considerable noise. It is expected that ANNmodels will perform
better in low noise situations. However, ANN models can add extraneous curvature, a disadvantage
when a smooth convex approximation is desired. MARS, on the other hand, constructs a parsimonious
approximation, but its limited structure, although quite flexible, could affect its approximation ability,
which would be most noticeable in low noise situations. Finally, computational speed is an important
aspect of SDP, and the ANN model fitting process has a reputation of being slow.
Studies on different experimental designs for metamodeling have been conducted [36–38]. However,
none have considered OA–LH designs. Given the success of OA designs in continuous-state SDP, we
chose the OA–LH design as a variation that might work well with ANN models. Points of an OA design
are limited to a fairly sparse grid, while the OA–LH design allows points to move off the grid. The grid
restriction was convenient for the MARS structure, but provides no advantage for ANN models.
Section 2 describes SDP in the context of the two applications that are later used to test our methods.
The SDP statistical modeling approach is given in Section 3. Section 4 provides background on OA and
OA–LH designs. Section 5 describes ANNs, and Section 6 presents the comparisons between the ANNs
and MARS, including the implementation of OA–LH designs. Conclusions are presented in Section 7.
2. SDP applications
In continuous-state DP, the systemstate variables are continuous. Chen et al. [15] and Chen [16] solved
inventory forecasting test problems with four-, six-, and nine-state variables modeled over three stages.
Application to ten- and seven-dimensional water reservoir SDP problems appears in Baglietto et al. [39]
and Baglietto et al. [40], and application to a 20-dimensional wastewater treatment systemappears in Tsai
et al. [11]. Here, we consider the nine-dimensional inventory forecasting problem, which is known to
have significant interaction effects between inventory levels and forecasts due to the capacity constraints,
and an eight-dimensional water reservoir problem, which has simpler structure and is more typical of
continuous-state SDP problems.
2.1. Inventory forecasting problem
For an inventory problem, the inventory levels are inherently discrete since items are counted, but the
range of inventory levels for an item is too large to be practical for a discrete DP solution. Thus, one can
relax the discreteness and use a continuous range of inventory levels. The inventory forecasting problem
C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90 73
uses forecasts of customer demand and a probability distribution on the forecasting updates to model the
demand more accurately than traditional inventory models (see [41]).
2.1.1. State and decision variables
The state variables for inventory forecasting consist of the inventory levels and demand forecasts for
each item. For our nine-dimensional problem, we have three items and two forecasts for each. Suppose
we are currently at the beginning of stage t. Let I
(i)
t
be the inventory level for item i at the beginning of
the current stage. Let D
(i)
(t,t )
be the forecast made at the beginning of the current stage for the demand of
item i occurring over the stage. Similarly, let D
(i)
(t,t +1)
be the forecast made at the beginning of the current
stage for the demand of item i occurring over the next stage (t +1). Then the state vector at the beginning
of stage t is
x
t
=

I
(1)
t
, I
(2)
t
, I
(3)
t
, D
(1)
(t,t )
, D
(2)
(t,t )
, D
(3)
(t,t )
, D
(1)
(t,t +1)
, D
(2)
(t,t +1)
, D
(3)
(t,t +1)

T
.
The decision variables control the order quantities. Let u
(i)
t
be the amount of item i ordered at the
beginning of stage t. Then the decision vector is u
t
=

u
(1)
t
, u
(2)
t
, u
(3)
t

T
, where u
(i)
t
is the amount of item
i ordered in stage t. For simplicity, it is assumed that orders arrive instantly.
2.1.2. Transition function
The transition function for stage t specifies the new state vector to be x
t +1
=f
t
(x
t
, u
t
,
t
), where x
t
is
the state of the system(inventory levels and forecasts) at the beginning of stage t, u
t
is the amount ordered
in stage t, and
t
is the stochasticity associated with updates in the demand forecasts. The evolution of
these forecasts over time is modeled using the multiplicative Martingale model of forecast evolution, for
which the stochastic updates are applied multiplicatively and assumed to have a mean of one, so that the
sequence of future forecasts for stage t + k forms a martingale [42]. See Chen [16] for details on the
stochastic modeling.
2.1.3. Objectives and constraints
The constraints for inventory forecasting are placed on the amount ordered, in the form of capacity
constraints, and on the state variables, in the formof state space bounds. For this test problem, the capacity
constraints are chosen to be restrictive, forcing interactions between the state variables, but are otherwise
arbitrary. The bounds placed on the state variables in each stage are necessary to ensure accurate modeling
over the state space.
The objective function is a cost function involving inventory holding costs and backorder costs. The
typical V-shaped cost function common in inventory modeling is
c
v
(x
t
, u
t
) =
n
I
¸
i=1

h
i
[I
(i)
t +1
]
+
+ ¬
i
[−I
(i)
t +1
]
+

, (1)
where [q]
+
=max{0, q}, h
i
is the holding cost parameter for itemi, and ¬
i
is the backorder cost parameter
for item i. The smoothed version of Eq. (1), described in Chen [16], is employed here.
74 C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90
Fig. 1. Reservior network.
2.2. Water reservoir problem
Optimal operation of water reservoir networks has been studied extensively in the literature (e.g.,
Archibald et al. [43], Foufoula-Georgiou and Kitanidis [8], Gal [4], Lamond and Sobel [44], Turgeon
[45] and Yalpwotz [46] provides an excellent survey). The water reservoir network we consider consists
of four reservoirs (see Fig. 1 for the configuration). In typical models, the inputs to the nodes are water
released from upstream reservoirs and stochastic inflows from external sources (e.g., rivers and rain),
while the outputs correspond to the amount of water to be released during a given time stage (e.g., a
month).
2.2.1. State and decision variables
Ina realistic representationof inflowdynamics, the amount of (random) water flowingintothe reservoirs
from external sources during a given stage t depends on the amounts of past stages. In this paper, we
follow the realistic modeling of Gal [4] and use an autoregressive linear model of order one. Thus, the
ith component of the random vector
t
, representing the stochastic external flow into reservoir i, has the
form
c
(i)
t
= a
(i)
t
c
(i)
t −1
+ b
(i)
t
+ c
(i)
t
.
(i)
t
, (2)
where
t
is a random correction that follows a standard normal distribution. The coefficients of the
linear combinations (a
t
, b
t
, c
t
) actually depend on t, so that it is possible to model proper external inflow
behaviors for different months. The actual values of the coefficients used for the model were based on
the one-reservoir real model described in Gal [4] and extended to our four-reservoir network.
Given Eq. (2), the state vector consists of the water level and stochastic external inflow for each
reservoir:
x
t
=

w
(1)
t
, w
(2)
t
, w
(3)
t
, w
(4)
t
, c
(1)
t
, c
(2)
t
, c
(3)
t
, c
(4)
t

T
,
C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90 75
where w
(i)
t
is the amount of water in reservoir i (i = 1, 2, 3, 4) at the beginning of stage t, and c
(i)
t
the
stochastic external inflowinto reservoir i during stage t. The decision vector is u
t
=

u
(1)
t
, u
(2)
t
, u
(3)
t
, u
(4)
t

T
,
where u
(i)
t
be the amount of water released from reservoir i during stage t.
2.2.2. Transition function
The transition function for stage t updates the water levels and external inflows. The amount of water
at stage t + 1 reflects the flow balance between the water that enters (upstream releases and stochastic
external inflows) and the water that is released. In order to deal with unexpected peaks from stochastic
external inflows, each reservoir has a floodway, so that the amount of water never exceeds the maximum
value W
(i)
. Thus, the new water level is
w
(i)
t +1
= min



w
(i)
t
+
¸
j∈U
(i)
u
(j)
t
− u
(i)
t
+ c
(i)
t
, W
(i)



,
where U
(i)
the set of indexes corresponding to the reservoirs which release water into reservoir i. For the
stochastic external inflows, we can update them using Eq. (2). Like the inventory forecasting model, the
two equations above can be grouped in the compact transition function x
t +1
= f
t
(x
t
, u
t
,
t
).
2.2.3. Objectives and constraints
The constraints for water reservoir management are placed on the water releases u
(i)
t
in the form:
0u
(i)
t
min



w
(i)
t
+
¸
j∈U
i
u
(j)
t
, R
(i)



,
where R
(i)
the maximum pumpage capability for reservoir i. This implies nonnegativity of w
i
t +1
. The ob-
jective function is intended to maintain a reservoir’s water level close to a target value ˜ w
(i)
t
and maximize
benefit represented by a concave function g of the water releases and/or water levels (e.g., power genera-
tion, irrigation, etc.). For maintaining targets, the intuitive cost would be |w
(i)
t
− ˜ w
(i)
t
|, which is aV-shaped
function. To facilitate numerical minimization, we utilized a smoothed version, denoted by l

w
(i)
t
, ˜ w
(i)
t

,
that is analogous to the smoothed cost function for the inventory forecasting problem. For the modeling
benefits, we consider a nonlinear concave function g of the following form (for z ∈ [0, +∞)):
g(z, j, [) =













([ + 1)z, 0z
1
3
j,
[z +
1
2
j −
45
j
2

z −
2
3
j

3

153
2j
2

z −
2
3
j

4
,
1
3
jz
2
3
j,
[z +
1
2
j, z
2
3
j.
Such a function models a benefit which becomes relevant for “large” values of the water releases, de-
pending on suitable parameters j and [. In our example, we used R
(i)
for j and a small number 0.1
76 C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90
for [. Then, the total cost function at stage t can be written as
c
t
(x
t
, u
t
,
t
) =
4
¸
i=1
l

w
(i)
t
, ˜ w
(i)
t


4
¸
i=1
p
(i)
g

u
(i)
t
, j
(i)
, [

,
where the p
(i)
, i = 1, 2, 3, 4 are weights chosen to balance the benefits against the costs.
3. SDP statistical modeling solution approach
3.1. Future value function
For a given stage, the future value function provides the optimal cost to operate the system from stage t
through T as a function of the state vector x
t
. Following the notation in Section 2, this is written recursively
as
F
t
(x
t
) = min
u
t
E{c(x
t
, u
t
,
t
) + F
t +1
(x
t +1
)}
s.t. x
t +1
= f
t
(x
t
, u
t
,
t
),
(x
t
, u
t
) ∈ I
t
. (3)
If we have the future value functions for all the stages, then given the initial state of the system, we
can track the optimal solution through the value functions. Thus, given a finite time horizon T, the basic
SDP problem is to find F
1
(·), . . . , F
T
(·). The recursive formulation permits the DP backward solution
algorithm [1], first by assuming F
T +1
(x
T +1
) ≡ 0, so that F
T
(x
T
) =c
T
(x
T
) (where c
T
(x
T
) is the cost of
the final stage depending only on x
T
), then solving in order for F
T
(·) through F
1
(·). See the description
in Chen [16] or Chen et al. [47] for further details.
3.2. Numerical solution algorithm
Given the state of the systemat the beginning of the current stage, the corresponding optimal cost yields
one point on that stage’s value function. The complete future value function is obtained by solving this
optimization problem for all possible beginning states. For continuous states, this is an infinite process.
Thus, except in simple cases where an analytical solution is possible, a numerical SDP solution requires
a computationally tractable approximation method for the future value function.
Froma statistical perspective, let the state vector be the vector of predictor variables. We seek to estimate
the unknown relationship between the state vector and the future value function. Response observations
for the future value function may be collected via the minimization in Eq. (3). Thus, the basic procedure
uses an experimental design to choose Npoints in the n-dimensional state space so that one may accurately
approximate the future value function; then constructs an approximation.
Chen et al. [15] introduced this statistical perspective using strength three OAs and MARS (smoothed
with quintic functions) applied to the inventory forecasting test problems. MARS was chosen due to its
ease of implementation in high dimensions, ability to identify important interactions, and easy derivative
calculations. The choice of the strength three OAs was motivated by the desire to model interactions,
but univariate modeling can suffer due to the small number of distinct levels in each dimension (e.g.,
p = 13 was the highest number of levels explored). Chen [16] additionally explored strength two OAs
C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90 77
with equivalent N to the strength three OAs, but with lower order balance permitting higher p; however,
their performance was disappointing.
4. Discretizations based on OAs
In continuous-state SDP, the classical solution method discretizes the state space with a full grid,
then uses multi-linear interpolation to estimate the future value function. This full grid is equivalent to a
full factorial experimental design, for which the number of points grows exponentially as the dimension
increases. To avoid this, a fractional factorial design, i.e., taking a subset of the full grid, may be employed.
OAs are special forms of fractional factorial designs. Afull factorial design for n variables, each at p levels,
contains all p
n
level combinations. An OA of strength d (d <n) for n variables, each at p levels, contains
all possible level combinations in any subset of d dimensions, with the same frequency, z. Therefore, when
projected down onto any d dimensions, a full factorial grid of p
d
points replicated z times is represented.
4.1. Measuring space-filling quality
There are a variety of ways to generate OAs (e.g., [48–53]). However, there is no way to guarantee their
space-filling quality. In practice, one can generate many OAs, calculate various space-filling measures,
then choose the best of the bunch. Chen [17] has conducted the only study of how this quality affects an
approximation, and only OAs of strength three were studied.
Looking at the two configurations of points in Fig. 2, the one on the left is seen to space the points
better. There are two basic ways to measure space-filling quality: minimax and maximin (refer to [32]).
Let D be the n-dimensional closed and bounded set of the continuous state space, and P be the set of n
design points in D. Minimax criteria seek to minimize the maximumdistance between (nondesign) points
in D and the nearest-neighbor design point in P. Using the same distance measure as above, define:
MAXDIST(D, P) = sup
x∈D
¸
min
y∈P
d(x, y)
¸
.
Since the supremum over the infinite set of points in D is difficult to calculate, this measure is usually
calculated over a finite sample of points in D, using a full factorial grid or some type of sampling scheme.
However, this measure is generally impractical to calculate in high dimensions. A design that represents
D well will have lower MAXDIST(D, P) than a design that does not.








x
1
0 1 4
x
2
• •
0
2
4
1
3
x
1
0
2
4
1
3
2 3 0 1 4
x
2
2 3
Fig. 2. Two configurations of design points.
78 C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90
Maximin criteria seek to maximize the minimum distance between any pair of points in P. Let d(x, y)
be a distance measure (commonly Euclidean distance) over R. Then define:
MINDIST(P) = min
x∈P



min
y∈P
y=x
d(x, y)



.
Higher MINDIST(P) should correspond to a more uniform scattering of design points. For the OA
designs in this paper, their space-filling quality was judged using measures of both types and a measure
introduced by Chen [17].
4.2. OA-based LH designs
LH designs are equivalent to OA designs of strength one. In other words, if we project the N points
of a LH design onto any single dimension, the points will correspond to N distinct levels in that di-
mension. However, only OAs of strength two and higher are technically orthogonal, so the LH design
does not maintain this property. A hybrid OA–LH design, for which an LH design is (randomly) cre-
ated from an OA structure, was introduced by Tang [31] to address the lack of orthogonality in LH
designs. Although OA–LH designs only satisfy LH properties, the underlying OA provides some bal-
ance. With regard to approximating a function, the LH design provides more information along a single
dimension (for univariate terms), while the OA design provides better coverage in higher dimensions (for
interactions).
In this paper, we utilize the strength three OAs described in Chen [17], and the OA–LH designs are
based on these strength three OAs. Specifically, we employed strength three OAs with p = 11 levels
(N = 1331) and p = 13 levels (N = 2197) in each dimension, and two realizations of OA–LH designs
were generated for each OA. For each size (N = 1331 and 2197), one “good”, one “neutral”, and one
“poor” OA was selected based on space-filling measures described above. Thus, a total of 6 OAs and
12 OA–LH designs were tested. For the inventory forecasting problem, nine-dimensional designs were
generated, and for the water reservoir problem, eight-dimensional designs were generated.
5. Artificial neural networks
5.1. Model description
ANNs are mathematical modeling tools widely used in many different fields (for general background
and applications see Lippmann [54], Haykin [55], Bertsekas and Tsitsiklis [14], Cherkassky and Mulier
[56]; for statistical perspectives see White [57], Barron et al. [58], Ripley [59], or Cheng and Titterington
[60]; for approximation capabilities see Barron [61]). From a purely “analytical” perspective, an ANN
model is a nonlinear parameterized function that computes a mapping of n input variables into (typically)
one output variable, i.e., n predictors and one response. The basic architecture of ANNs is an interconnec-
tion of neural nodes that are organized in layers. The input layer consists of n nodes, one for each input
variable, and the output layer consists of one node for the output variable. In between there are hidden
layers which induce flexibility into the modeling. Activation functions define transformations between
layers, the simplest one being a linear function. Connections between nodes can “feed back” to previous
C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90 79
layers, but for function approximation, the typical ANN is feedforward only with one hidden layer and
uses “sigmoidal” activation functions (monotonic and bounded). A comprehensive description of various
ANN forms may be found in [55].
In our inventory forecasting SDP application, we utilized a feedforward one-hidden-layer ANN
(FF1ANN) with a “hyperbolic tangent” activation function: Tanh(x) = (e
x
− e
−2x
)/(e
x
+ e
−x
). De-
fine n as the number of input variables, H as the number of hidden nodes, v
jh
as weights linking input
nodes j to hidden nodes h, w
h
as weights linking hidden nodes h to the output node, and 0
h
and ¸ as
constant terms called “bias” nodes (like intercept terms). Then our ANN model form is
ˆ g(x; w, v, , ¸) = Tanh

H
¸
h=1
w
h
Z
h
+ ¸

,
where for each hidden node h
Z
h
= Tanh

n
¸
i=1
v
ih
x
i
+ 0
h

.
The parameters v
ih
, w
h
, 0
h
, and ¸ are estimated using a nonlinear least-squares approach called training.
For simplicity, we will refer to the entire vector of parameters as w in the following sections.
5.2. Model fitting
Our training employed a batch processing method (as opposed to the more common on-line processing)
that minimized the squared error loss lack-of-fit (LOF) criterion:
LOF( ˆ g) =
N
¸
j=1
[y
j
− ˆ g(x
j
, w)]
2
,
where N is the number of data points on which the network is being trained, a.k.a., the training data set.
Nonlinear optimization techniques used to conduct the minimization are typically gradient descent-based
methods. We used the backpropagation method (see [62,63]), which computes the gradient of the output
with respect to the outer parameters, then “backpropagates” the partial derivatives through the hidden
layers, in order to compute the gradient with respect to the entire set of parameters. This is completed in
two phases: (i) a forward phase in which an input x
j
is presented to the network in order to generate the
various outputs (i.e., the output of the network and those of the inner activation functions) corresponding
to the current value of the parameters, and (ii) a backward phase in which the gradient is actually computed
on the basis of the values obtained in the forward phase. After all N samples are presented to the network
and the corresponding gradients are computed, the descent algorithm step is completed by upgrading the
vector of parameters w using the steepest-descent method:
w
k+1
= w
k
− :
k
∇LOF(w
k
),
where ∇LOF(w
k
) is the gradient of the LOF function with respect to the vector w
k
. Implementation of
this approach is very simple, since the computation of the gradient requires only products, due to the
80 C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90
mathematical structure of the network. Unfortunately, backpropagation presents many well-known dis-
advantages, such as entrapment in local minima. As a consequence, the choice of the stepsize :
k
is crucial
for acceptable convergence, and very different stepsizes may be appropriate for different applications.
The best training algorithmfor FF1ANNs, in terms of rate of convergence, is the Levenberg–Marquardt
method (first introduced in [64]), which is an approximation of the classic second-order (i.e., it makes
use of the Hessian matrix) Newton method. Since our LOF criterion is quadratic, we can approximate its
Hessian matrix via backpropagation as
J[e(w)]
T
J[e(w)],
where
e(w) = [y
1
− ˆ g(x
1
, w), . . . , y
N
− ˆ g(x
N
, w)]
T
and J[e(w)] is the Jacobian matrix of e(w). Each row of the matrix contains the gradient of the jth single
error term y
j
− ˆ g(x
j
, w) with respect to the set of parameters w.
The update rule for the parameter vector at each step is
w
k+1
= w
k
− [J[e(w
k
)]
T
J[e(w
k
)] + zI]
−1
J[e(w
k
)]
T
e(w
k
)
T
.
When z is large, the method is close to gradient descent with a small stepsize. When z = 0 we have a
standard Newton method with an approximated Hessian matrix (Gauss–Newton method), which is more
accurate near the minima. After each successful step (i.e., when the LOF decreases) z is decreased, while
it is increased when the step degrades the LOF.
5.3. Model implementation
Other than the selection of a model fitting approach, the most important decision in implementing an
ANN model is the choice of architecture. For a FF1ANN, the number of nodes in the hidden layer, H,
determines the architecture. Unfortunately, the appropriate number of hidden nodes is never clear-cut.
Existing theoretical bounds on the accuracy of the approximation are non-constructive or too loose to
be useful in practice. Thus, it is not possible to define “a priori” the optimal number of hidden nodes,
since it depends on the number of samples available and on the smoothness of the function we want to
approximate. Too few nodes would result in a network with poor approximation properties, while too
many would “overfit” the training data (i.e., the network merely “memorizes” the data at the expense
of generalization capability). Therefore, a tradeoff between these two possibilities is needed. There are
many rules of thumb and methods for model selection and validation in the literature, as well as methods
to avoid or reduce overfitting (see [55]), but the choice of a good architecture requires several modeling
attempts and is always very dependent on the particular application.
For our applications, we identified the appropriate number of hidden nodes by exploring many ar-
chitectures and checking the average cost using a test data set from the simulation described in Section
6.2. The best results for the nine-dimensional inventory forecasting problem were obtained for values of
H = 40 for N = 1331 and H = 60 for N = 2197. Similarly, the best results for the eight-dimensional
water reservoir problem were obtained using H = 10 for both sample sizes. For larger values of H the
solutions worsened.
C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90 81
6. ANNs and MARS comparisons
ANNs have become very popular in many areas of science and engineering, due to their proven ability
to fit any data set and to approximate any real function, together with the existence of accessible software,
such as Matlab’s (www.mathworks.com) Neural Network Toolbox, which was used in our study.
The primary disadvantages are (1) the difficulty in selecting the number of hidden nodes H and (2)
potential overfitting of the training data. The second disadvantage occurs because it is tempting to try
to fit the data as closely as possible even when it is not appropriate. Both disadvantages can be handled
by proper use of a test data set, but this process greatly increases the computational effort required.
From a statistical perspective, ANNs have the additional disadvantage of an approximation that does not
provide interpretable information on the relationships between the input (predictor) variables and output
(response) variable.
MARS [18] is based on traditional statistical thinking with a forward stepwise algorithm to select
terms in a linear model followed by a backward procedure to prune the model. The approximation bends
to model curvature at “knot” locations, and one of the objectives of the forward stepwise algorithm
is to select appropriate knots. The search for new basis functions can be restricted to interactions of a
maximumorder. For the inventoryforecastingproblems, MARSwas restrictedtoexplore uptothree-factor
interations. Chen [16] also considered MARS limited to two-factor interactions, but the approximation
degraded. However, for the water reservoir problem, two-factor interactions were sufficient. In addition,
the backwardprocedure was not utilizedsince preliminarytestingdemonstrateda verylarge computational
burden with little change on the approximation. After selection of the basis functions is completed,
smoothness to achieve a certain degree of continuity may be applied. In our case, quintic functions were
used to achieve second derivative continuity (see [16]). The primary disadvantage of MARS is that the
user must specify the maximum number of basis functions to be added, a parameter called M
max
. This
is very similar to the difficulty in selecting H for an ANN, and, like ANNs, a test data set can be used
to select M
max
. MARS is both flexible and easily implemented while providing useful information on
significant relationships between the predictor variables and the response; however, commercial MARS
software has only recently become available from Salford Systems (www.salford-systems.com).
For our study, our own C code was employed.
From the perspective of SDP value function approximation, MARS and ANNs are both excellent for
modeling in high dimensions when noise is small relative to the signal, as is the case with SDP. Their
main difference is the interpretability of the approximation, but this is not a critical issue for accurate
SDP solutions. An issue that may arise in other SDP applications is the selection of the key parameter,
M
max
for MARS and H for ANNs. For the inventory forecasting and water reservoir problems, the value
functions have similar structure fromstage to stage; thus, the same M
max
or Hfor all stages is appropriate.
When different M
max
or H is preferred for different stages, then the SDP algorithm would greatly benefit
from automatic selection of M
max
or H. This is an issue for future research.
6.1. Memory and computational requirements
The memory and computational requirements for a FF1ANN with one output variable depend on the
number of input (predictor) variables n, the number of hidden nodes H, and the size of the training data set
N. For MARS, the key quantities are n, N, the number of eligible knots K, and the maximum number of
basis functions M
max
. In this section, we will explore how the requirements might grow with increasing n
82 C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90
(the dimension of an SDPproblem). Chen et al. [15] set N=O(n
3
), which assumes Nis based on a strength
three OA, K=O(n), which assumes knots can only placed at the p distinct levels of the strength three OA,
and O(n)M
max
O(n
3
). In this paper, the addition of the OA–LH designs permits the placement of a
higher number of knots. However, given an adequate K, it should not be expected to increase with n since
the larger K primarily impacts the quality of univariate modeling. Based on computational experience
with a variety of value function forms, K50 is more than adequate, and a more reasonable upper limit
on M
max
is O(N
2/3
) = O(n
2
). For ANNs, H depends on n, N, and the complexity of the underlying
function. A reasonable range on H (in terms of growth) matches that of O(n)HO(N
2/3
) = O(n
2
).
One should keep in mind that the above estimates are intended to be adequate for approximating SDP
value functions, which are convex and, consequently, generally well behaved. The inventory forecasting
value function, however, does involve a steep curve due to high backorder costs.
For memory requirements, a FF1ANN requires storage of V = [(n + 1)H + H + 1] parameters.
Assuming each of these is a double-precision real number, the total number of bytes required to store a
FF1ANN grows as
O(n
2
)S
FF1ANN
= 8V O(n
3
).
The maximum total number of bytes required to store one smoothed MARS approximation with up to
three-factor interactions grows as [15]
O(n)S
MARS
= 8(6M

max
+ 2n + 3Kn + 1)O(n
2
).
Thus, it appears that the memory requirement for FF1ANNs grows faster than that for MARS. However,
the largest FF1ANN tested in the paper required 5288 bytes and the largest MARS required 36,800 bytes,
so for n = 9, FF1ANNs are still considerably smaller.
Another aspect of memory requirements is memory required while constructing the approximation.
We utilized Levenberg–Marquardt (LM) backpropagation to train the FF1ANN, and a disadvantage of
the LM method lies in its memory requirements. Each step requires the storage of a (N×V) matrix and
the (V × V) Hessian approximation, so the memory needed grows as
O(n
5
)S2
FF1ANN
= O(NnH) + O(n
2
H
2
)O(n
6
).
However, if the memory at disposal is sufficient, LM backpropagation is generally preferred over other
training methods because of its faster convergence. MARS construction stores up to a (N×M
max
) matrix,
resulting in a growth of
O(n
4
)S2
MARS
= O(NM
max
)O(n
5
).
The computational requirements for one iteration of our SDP statistical modeling solution approach
consist of: (i) solving the optimization in Eq. (3) for one stage t at every experimental design point and
(ii) constructing the spline approximation of the future value function. The optimization evaluates a value
function approximation and its first derivative several times in each stage, except the last. For FF1ANNs,
(i) is performed via optimization that requires the evaluation of the output of an ANN and of its first
derivative. The time required for these evaluations is O(nH), so we can write the total time required
to solve the SDP equation for all N points as O(WnNH), where W is the work required to solve the
minimization.
For what concerns (ii), we consider LM backpropagation, which is the one we actually used for
our FF1ANNs. The time for the computation of LM backpropagation is dominated by the Hessian
C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90 83
Table 1
Actual average run times (in minutes) for one SDP iteration using FF1ANN and MARS models
Model N Inventory forecasting Water reservoir
450 MHz P2 550 MHz P2 933 MHz P2 3.06 GHz IX
ANN 1331 65 min (H = 40) 90 min (H = 40) 13 min (H = 10)
MARS 1331 61 min (M
max
= 300) 43 min (M
max
= 100)
ANN 2197 185 min (H = 60) 87 min (H = 60) 18 min (H = 10)
MARS 2197 112 min (M
max
= 400) 67 min (M
max
= 100)
The FF1ANN runs used W
LM
= 125. For the inventory forecasting problem, Matlab was used to conduct the FF1ANN runs
on two Pentium II (P2) machines, and a C code executed the MARS runs on a different P2 machine. For the water reservoir
problem, all runs were conducted in Matlab on an Intel Xeon (IX) workstation; however, calling the MARS C code in Matlab
resulted in significant slow down.
approximation and inversion. The (V×V) Hessian approximation is the product of two (N×V) matrices
that are obtained by backpropagation. Here, for each row of the matrix (which corresponds to a single
experimental design point x
j
), the computation is basically reduced to [H + 1 + (n + 1)H] products,
so the time required to obtain the gradient for all N experimental design points is O(NnH). The matrix
product requires NV
2
real valued products, so this operation (which is O(Nn
2
H
2
)) occupies the relevant
part of the time required to compute the approximation. For the inversion of a (V ×V) matrix, an efficient
algorithm can provide a computational time that is O(V
2
) = O(n
2
H
2
). Therefore, the time needed to
compute one step of LMbackpropagation is dominated by the matrix product: O(Nn
2
H
2
). Define W
LM
to
be the number of steps required to attain the backpropagation goal. (Typically, W
LM
never needs to exceed
300.) Then the time required to train the network by LMbackpropagation grows as O(W
LM
Nn
2
H
2
). Thus,
the total run time for one iteration of the FF1ANN-based SDP solution grows as
O(n
7
)C
FF1ANN
= O(WNnH) + [O(W
LM
n
2
NH
2
)]O(n
9
).
The total run time for one iteration of the MARS-based SDP solution grows as [15]
O(n
7
)C
MARS
= O(WNM
max
) + O(nNM
3
max
)O(n
10
),
where the time to construct the MARS approximation (second term) dominates.
Actual average run times for one SDP iteration are shown in Table 1. For the inventory forecasting
problem, the MARS and ANN runs were conducted on different machines in different countries. Despite
this, it can been seen that the run times for the same N are comparable. In order to conduct a fairer
computational comparison on the same programming platform, considerable effort was expended to
develop a Matlab code that could access our MARS C code. Unfortunately, the ability to call C code
from Matlab is still flawed, and the MARS runs suffered significant slow down. For the water reservoir
problem, all runs were conducted in the same place; however, the slow down experienced by calling the
MARS C code from Matlab is obvious. Unfortunately, at this time, a completely fair comparison that
uses the same programming platform is not readily implementable. Overall, our experience demonstrates
that the ANN model is clearly easier to implement.
84 C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90































••
••











ANN
OA3G
p11
H40
MARS
OA3G
p11
200
MARS
OA3G
p11
300
ANN
OA3N
p11
H40
MARS
OA3N
p11
200
MARS
OA3N
p11
300
ANN
OA3P
p11
H40
MARS
OA3P
p11
200
MARS
OA3P
p11
300
0
4
8
1
5
9
A
b
s
o
l
u
t
e
E
r
r
o
r
2
3
6
7
11
12
13
14
15
10
Fig. 3. Inventory forecasting: Boxplots illustrate the distribution of absolute error for all 9 SDP solutions using strength 3 OAs
with p = 11 (N = 1331). Below each boxplot, the first line indicates the use of ANNs or MARS. The second line indicates the
type of OA, specifically: OA3G = “good” strength 3 OA, OA3N = “neutral” strength 3 OA, and OA3P = “poor” strength 3 OA.
The third line specifies p = 11, and the fourth line provides M
max
for MARS and the number of hidden nodes (H) for ANNs.


••
••






••

••
•••



••








••

•••


••

••


•••





••



ANN
OA3G
p13
H60
MARS
OA3G
p13
300
MARS
OA3G
p13
400
ANN
OA3N
p13
H60
MARS
OA3N
p13
300
MARS
OA3N
p13
400
ANN
OA3P
p13
H60
MARS
OA3P
p13
300
MARS
OA3P
p13
400
0
4
8
1
5
9
A
b
s
o
l
u
t
e
E
r
r
o
r
2
3
6
7
11
10
12
13
14
15
Fig. 4. Inventory forecasting: Boxplots illustrate the distribution of absolute error for all 9 SDP solutions using strength 3 OAs
with p = 13 (N = 2197). Labeling below the boxplots uses the same notation as that in Fig. 3.
6.2. SDP solutions
Both the nine-dimensional inventory forecasting problem and the eight-dimensional water reservoir
problem were solved over three stages (T = 3). Although several FF1ANN architectures were tested,
only one FF1ANN architecture for each design was employed to generate the plots in Figs. 3–8 For the
inventory forecasting problem, H = 40 for N = 1331 and H = 60 for N = 2197, and for the water
reservoir problem, H = 10 for both N. Several MARS approximations were constructed for each design
by varying M
max
and K. For the inventory forecasting problem, results with different M
max
are shown,
and for the water reservoir problem, M
max
= 100. For the OAs we set K = 9 for p = 11 and K = 11
C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90 85

••
••















••





•••
••




















•••



ANN
OA3P
p11
H40
ANN
LH1G
N1331
H40
MARS
OA3P
p11
200
MARS
LH2G
N1331
200
K9
MARS
LH2N
N1331
200
K9
ANN
OA3N
p11
H40
ANN
LH1N
N1331
H40
MARS
OA3P
p11
300
MARS
LH2G
N1331
300
K35
MARS
LH2P
N1331
300
K17
A
b
s
o
l
u
t
e
E
r
r
o
r
0
4
8
1
5
9
2
3
6
7
11
10
12
13
14
15
Fig. 5. Inventory forecasting: Boxplots illustrate the distribution of absolute error for 10 SDP solutions using N =1331. Included
fromFig. 3 are the best (ANN/OA3P; MARS/OA3P/M
max
=200) and worst (ANN/OA3N; MARS/OA3P/M
max
=300) solutions.
Beloweach boxplot, the first line indicates the use of ANNs or MARS. The second line indicates the type of experimental design,
specifically: OA3 notation as in Fig. 3, LH1G = first LH randomization based on a “good” strength 3 OA, LH2G = second LH
randomization based on a “good” strength 3 OA, and similarly for LH1N, LH2N, LH1P, and LH2P. The third line indicates the
size of the design, specifying p = 11 for the OAs and N = 1331 for the OA–LH designs. The fourth line provides M
max
for
MARS and the number of hidden nodes (H) for ANNs. Finally, a fifth line appears for the OA–LH/MARS solutions indicating
the number of eligible knots (K) utilized in the MARS algorithm. The three best OA–LH-based solutions are shown beside the
best OA-based solutions (in the left half), and similarly for the worst solutions (in the right half).


••
••





••


••

••
•••




•••
••









••



••








••




••





ANN
OA3G
p13
H60
ANN
LH2G
N2197
H60
MARS
OA3G
p13
400
MARS
LH2G
N2197
400
K11
MARS
LH2N
N2197
300
K17
ANN
OA3N
p13
H60
ANN
LH1P
N2197
H60
MARS
OA3P
p13
400
MARS
LH2P
N2197
400
K11
MARS
LH1P
N2197
400
K17
A
b
s
o
l
u
t
e
E
r
r
o
r
0
4
8
1
5
9
2
3
6
7
11
10
12
13
14
15
Fig. 6. Inventory forecasting: Boxplots illustrate the distribution of absolute error for 10 SDP solutions using N = 2197 exper-
imental design points. Included from Fig. 4 are the best (ANN/OA3G, MARS/OA3G/M
max
= 400) and worst (ANN/OA3N,
MARS/OA3P/M
max
= 400 solutions. Labeling below the boxplots uses the same notation as that in Fig. 5.
for p = 13 (the maximum allowable in both cases). For the OA–LH designs K was chosen as high as
50. Simulations using the “reoptimizing policy” described in Chen [16] were conducted to determine the
optimal decisions and to test the SDP solutions.
86 C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90
A
b
s
o
l
u
t
e
E
r
r
o
r
ANN
OA3G
p11
H10
MARS
OA3G
p11
100
ANN
OA3N
p11
H10
MARS
OA3N
p11
100
ANN
OA3P
p11
H10
MARS
OA3P
p11
100
ANN
OA3G
p13
H10
MARS
OA3G
p13
100
ANN
OA3N
p13
H10
MARS
OA3N
p13
100
ANN
OA3P
p13
H10
MARS
OA3P
p13
100
0
4
8
1
5
9
2
3
6
7
10
Fig. 7. Water reservoir: Boxplots illustrate the distribution of absolute error for all 12 SDP solutions using strength 3 OAs with
p = 11 (N = 1331) and p = 12 (N = 2197). Labeling below the boxplots uses the same notation as that in Fig. 3.
A
b
s
o
l
u
t
e
E
r
r
o
r
ANN
OA3N
p11
H10
ANN
LH1G
N1331
H10
MARS
OA3N
p13
100
MARS
LH1G
N1331
100
K50
MARS
LH1G
N2197
100
K50
ANN
LH1N
N2197
H10
ANN
OA3P
p11
H10
ANN
LH1P
N1331
H10
MARS
OA3P
p13
100
MARS
LH1P
N1331
100
K50
MARS
LH1N
N2197
100
K50
ANN
LH1G
N2197
H10
0
4
8
1
5
9
2
3
6
7
10
Fig. 8. Water reservoir: Boxplots illustrate the distribution of absolute error for 12 SDP solutions. Included from Fig. 7 are
the best (ANN/OA3N/p11, MARS/OA3N/p13) and worst (ANN/OA3P/p11, MARS/OA3P/p13) solutions. Labeling below the
boxplots uses the same notation as that in Fig. 5.
6.2.1. Inventory forecasting problem
For each of 100 randomly chosen initial state vectors, a simulation of 1000 demand forecast sequences
was conducted for the SDP solutions. The simulation output provided the 100 mean costs (corresponding
to the 100 initial state vectors), each over 1000 sequences. For each initial state vector, the smallest mean
cost achieved by a set of 90 SDP solutions was used as the “true” mean cost for computing the “errors”
in the mean costs. Boxplots in Figs. 3–6 display the distributions (over the 100 initial state vectors) of
these errors. The average over the 100 “true” mean costs was approximately 62.
C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90 87
Figs. 3 and 4 display all 18 of the OA-based SDP solutions. They are organized left to right by the
“goodness” of the OAdesign. The OA/MARS solutions with p=11 (N=1331) are all fairly close, except
for the “worst” one, which is “MARS OA3P p11 300”, i.e., the one that used the poor OAand M
max
=300.
The other MARS OA3P solution (with M
max
=200) is used to represent the “best” OA/MARS solution,
although it is essentially equivalent to the other good OA/MARSsolutions. For the OA/FF1ANNsolutions
with p = 11 (N = 1331), the “worst” one is “ANN OA3N p11 H40”, i.e., the one that used the neutral
OA3 (there is actually another outlier above 15). The “best” one is “ANNOA3P p11 H40”, using the poor
OA. The OA/FF1ANN solution with the good OA is not terrible, but is worst than most of the OA/MARS
solutions because of its higher median. Ideally, we would want good OAs to result in better solutions
than poor OAs; however, sometimes it does not happen this way because the poor qualities are located in
regions of the state space that are not critical to finding a good solution.
For the MARS solutions with p = 13 (N = 2197), the “best” solution is “MARS OA3G p13 400”,
which used the good OA and M
max
=400; while the “worst” is “MARS OA3P p13 400”, which used the
poor OA and M
max
= 400. For the FF1ANN solutions with p = 13 (N = 2197), the “best” solution is
with the good OA, while the “worst” is with the neutral OA. It should be noted in Figs. 3 and 4, as was
illustrated in Chen [16,17], that good OAs result in better SDP solutions on average.
Figs. 5and6displaythe best andworst solutions for eachof the four types of SDPsolutions, withthe best
solutions on the left and worst on the right. The OA/FF1ANNand OA/MARS solutions are repeated from
Figs. 3 and 4. Note the solution for “ANN LH1N N1331 H40”, which used the first OA–LH realization
from the neutral OA. This is the “worst” solution among the FF1ANNs using OA–LH designs, but it is
actually a good solution. The interesting observation here is that the FF1ANNs performed better with the
OA–LH designs. The best solution overall is “MARS LH2G N2197 400 K11”, which used the second
OA–LH realization from the good OA, M
max
= 400, and K = 11. However, there is one disconcerting
solution, and that is “MARS LH2GN1331 300 K35”. It is troubling because it uses a LHdesign generated
from the good OA, but it is visibly one of the “worst” solutions. To some extent the goodness measures
are no longer valid for the OA–LH designs, because the randomization used in generating them may have
resulted in a poorly spaced realization.
6.2.2. Water reservoir problem
For each of 50 randomly chosen initial state vectors, a simulation of 100 demand forecast sequences
was conducted for the SDP solutions. The simulation output provided the 50 mean costs (corresponding
to the 50 initial state vectors), each over 100 sequences. The “errors” in mean costs were computed in the
same manner as the inventory forecasting problem. Boxplots in Figs. 7 and 8 display the distributions (over
the 50 initial state vectors) of these errors. The average over the 50 “true” mean costs was approximately
−47.
Fig. 7 displays all the OA-based SDP solutions. They are organized left to right by OA design size
and by the space-filling quality of the OA design. Among the SDP solutions with p =11 (N =1331) the
“worst” one is clearly “ANN OA3P p11 H=10”. The “best” one is not as easy to identify, but “ANN
OA3N p11 H=10” is seen to have the lowest median. Among the SDP solutions with p=13 (N =2197)
the “worst” one, having a significantly longer box, is “MARS OA3P p13 100”. All the other solutions
for this size design are somewhat comparable, but “MARS OA3N p13 100” is seen to have the lowest
median. For both sizes, the worst solutions employed the poor OA, indicating that poor solutions can be
avoided by not using a poor OA. This agrees with the findings in the previous section.
88 C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90
Fig. 8 displays the best and worst solutions fromFig. 7 along with the best and worst OA–LH/FF1ANN
and OA–LH/MARS solutions. The figure is organized with the best solutions on the left and worst on the
right. Some improvement is seen using the OA–LHdesigns. Unlike the inventory forecasting problem, the
water reservoir problem should not require many interaction terms in the value function approximation;
thus, the increased univariate information from LH designs should be an advantage. Like the findings in
the previous section, the worst OA–LH/FF1ANN solutions are not bad, supporting the expectation that
ANN models perform well with OA–LH designs.
7. Concluding remarks
The statistical modeling approach for numerically solving continuous-state SDP was used to simultane-
ously study OA–LHdesigns and anANN-based implementation as alternatives to the OA/MARS method
of Chen et al. [15] and Chen [16]. In addition, the space-filling quality of the OA designs was considered.
For the SDP solutions using OA designs, good OAs reduced the possibility of a poor SDP solution. For
the inventory forecasting problem, involving significant interaction terms, MARS performed similarly
using both OAand OA–LHdesigns, while FF1ANNsolutions noticeably improved with OA–LHdesigns.
For the water reservoir problem, overall, good solutions could be achieved using MARS models with
either good OA or OA–LH designs and FF1ANN models with good OA–LH designs. With regard to
both computational SDP requirements and quality of SDP solutions, we judged FF1ANNs and MARS
to be very competitive with each other overall. A small advantage to MARS is the information on the
relationships between the future value function and the state variables that may be gleaned from the
fitted MARS model. Current research is developing a method to automatically select M
max
for MARS,
so as to permit flexibility in the approximation across different SDP stages. Unfortunately, a method for
automatically selecting H for ANNs does not seem possible.
References
[1] Bellman RE. Dynamic programming. Princeton: Princeton University Press; 1957.
[2] Puterman ML. Markov decision processes. NewYork, NY: Wiley; 1994.
[3] Bertsekas DP, Tsitsiklis JN. Dynamic programming and optimal control. Belmont, MA: Athena Scientific; 2000.
[4] Gal S. Optimal management of a multireservoir water supply system. Water Resources Research 1979;15:737–48.
[5] Shoemaker CA. Optimal integrated control of univoltine pest populations with age structure. Operations Research
1982;30:40–61.
[6] White DJ. Real applications of Markov decision processes. Interfaces 1985;15(6):73–83.
[7] White DJ. Further real applications of Markov decision processes. Interfaces 1988;18(5):55–61.
[8] Foufoula-Georgiou E, Kitanidis PK. Gradient dynamic programming for stochastic optimal control of multidimensional
water resources systems. Water Resources Research 1988;24:1345–59.
[9] Culver TB, Shoemaker CA. Dynamic optimal ground-water reclamation with treatment capital costs. Journal of Water
Resources Planning and Management 1997;123:23–9.
[10] ChenVCP, Günther D, Johnson EL. Solving for an optimal airline yield management policy via statistical learning. Journal
of the Royal Statistical Society, Series C 2003;52(1):1–12.
[11] Tsai JCC, Chen VCP, Chen J, Beck MB. Stochastic dynamic programming formulation for a wastewater treatment
decision-making framework. Annals of Operations Research, Special Issue on Applied Optimization Under Uncertainty
2004;132:207–21.
[12] Johnson SA, Stedinger JR, Shoemaker CA, Li Y, Tejada-Guibert JA. Numerical solution of continuous-state dynamic
programs using linear and spline interpolation. Operations Research 1993;41:484–500.
C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90 89
[13] Eschenbach EA, Shoemaker CA, Caffey H. Parallel processing of stochastic dynamic programming for continuous state
systems with linear interpolation. ORSA Journal on Computing 1995;7:386–401.
[14] Bertsekas DP, Tsitsiklis JN. Neuro-dynamic programming. Belmont, MA: Athena Scientific; 1996.
[15] Chen VCP, Ruppert D, Shoemaker CA. Applying experimental design and regression splines to high-dimensional
continuous-state stochastic dynamic programming. Operations Research 1999;47:38–53.
[16] Chen VCP. Application of MARS and orthogonal arrays to inventory forecasting stochastic dynamic programs.
Computational Statistics and Data Analysis 1999;30:317–41.
[17] ChenVCP. Measuring the goodness of orthogonal array discretizations for stochastic programming and stochastic dynamic
programming. SIAM Journal of Optimizations 2001;12:322–44.
[18] Friedman JH. Multivariate adaptive regression splines (with discussion). Annals of Statistics 1991;19:1–141.
[19] Carey WP, Yee SS. Calibration of nonlinear solid-state sensor arrays using multivariate regression techniques. Sensors and
Actuators, B: Chemical 1992;B9(2):113–22.
[20] Griffin WL, Fisher NI, Friedman JH, Ryan CG. Statistical techniques for the classification of chromites in diamond
exploration samples. Journal of Geochemical Exploration 1997;59(3):233–49.
[21] Kuhnert PM, Do K-A, McClure R. Combining non-parametric models with logistic regression: an application to motor
vehicle injury data. Computational Statistics and Data Analysis 2000;34(3):371–86.
[22] Shepherd KD, Walsh MG. Development of reflectance spectral libraries for characterization of soil properties. Soil Science
Society of America Journal 2002;66(3):988–98.
[23] Smets HMG, Bogaerts WFL. Deriving corrosion knowledge from case histories: the neural network approach. Materials
& Design 1992;13(3):149–53.
[24] Labossiere P, Turkkan N. Failure prediction of fibre-reinforced materials with neural networks. Journal of Reinforced
Plastics and Composites 1993;12(12):1270–80.
[25] AbdelazizAY, Irving MR, Mansour MM, El-ArabatyAM, NosseirAI. Neural network-based adaptive out-of-step protection
strategy for electrical power systems. International Journal of Engineering Intelligent Systems for Electrical Engineering
and Communications 1997;5(1):35–42.
[26] Chen VCP, Rollins DK. Issues regarding artificial beural network modeling for reactors and fermenters. Bioprocess
Engineering 2000;22:85–93.
[27] Li QS, Liu DK, Fang JQ, JearyAP, Wong CK. Damping in buildings: its neural network model andARmodel. Engineering
Structures 2000;22(9):1216–23.
[28] Liu J. Prediction of the molecular weight for a vinylidene chloride/vinyl chloride resin during shear-related thermal
degradation in air by using a back-propagation artificial neural network. Industrial and Engineering Chemistry Research
2001;40(24):5719–23.
[29] Pigram GM, Macdonald TR. Use of neural network models to predict industrial bioreactor effluent quality. Environmental
Science and Technology 2001;35(1):157–62.
[30] Kottapalli S. Neural network based representation of UH-60A pilot and hub accelerations. Journal of the American
Helicopter Society 2002;47(1):33–41.
[31] Tang B. Orthogonal array-based Latin hypercubes. Journal of the American Statistical Association 1993;88:1392–7.
[32] Chen VCP, Tsui K-L, Barton RR, Allen JK. A review of design and modeling in computer experiments. In: Khattree R,
Rao CR, editors. Handbook of statistics: statistics in industry, vol. 22. Amsterdam: Elsevier Science; 2003. p. 231–61.
[33] Chen VCP, Welch WJ. Statistical methods for deterministic biomathematical models. In: Proceedings of the 53rd session
of the international statistical institute, Seoul, Korea. August 2001.
[34] De Veaux RD, Psichogios DC, Ungar LH. Comparison of two nonparametric estimation schemes: MARS and neural
networks. Computers & Chemical Engineering 1993;17(8):819–37.
[35] Pham DT, Peat BJ. Automatic learning using neural networks and adaptive regression. Measurement and Control
1999;32(9):270–4.
[36] Palmer KD. Data collection plans and meta models for chemical process flowsheet simulators. PhD dissertation, School
of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA, 1998.
[37] Simpson TW, Lin DKJ, Chen W. Sampling strategies for computer experiments: design and analysis. International Journal
of Reliability and Applications 2001;2(3):209–40.
[38] Allen T, Bernshteyn MA, Kabiri K. Constructing meta-models for computer experiments. Journal of Quality Technology
2003;35:264–74.
90 C. Cervellera et al. / Computers & Operations Research 34 (2007) 70–90
[39] Baglietto M, Cervellera C, Parisini T, Sanguineti M, Zoppoli R. Approximating networks for the solution of T-stage
stochastic optimal control problems. In: Proceedings of the IFAC workshop on adaptation and learning in control and
signal processing, 2001.
[40] Baglietto M, Cervellera C, Sanguineti M, Zoppoli R. Water reservoirs management under uncertainty by approximating
networks and learning from data. In: Topics on participatory planning and managing water systems. Amsterdam: Elsevier
Science Ltd., 2005, to appear.
[41] Hadley G, Whitin TM. Analysis of inventory systems. Englewood Cliffs, NJ: Prentice-Hall; 1963.
[42] Heath DC, Jackson PL. Modelling the evolution of demand forecasts with application to safety stock analysis in
production/distribution systems. Technical report #989, School of Operations Research & Industrial Engineering, Cornell
University, Ithaca, NY, 1991.
[43] Archibald TW, McKinnon KIM, Thomas LC. An aggregate stochastic dynamic programming model of multireservoir
Systems. Water Resources Research 1997;33:333–40.
[44] Lamond BF, Sobel MJ. Exact and approximate solutions of affine reservoirs models. Operations Research 1995;43:
771–80.
[45] Turgeon A. A decomposition method for the long-term scheduling of reservoirs in series. Water Resources Research
1981;17:1565–70.
[46] Yakowitz S. Dynamic programming applications in water resources. Water Resources Research 1982;18:673–96.
[47] Chen VCP, Chen J, Beck MB. Statistical learning within a decision-making framework for more sustainable urban
environments. In: Proceedings of the joint research conference on statistics in quality, industry, and technology, Seattle,
WA. June 2000.
[48] Rao CR. Hypercubes of strength ‘d’ leading to confounded designs in factorial experiments. Bulletin of the Calcutta
Mathematical Society 1946;38:67–78.
[49] Bose RC, Bush KA. Orthogonal arrays of strength two and three. Annals of Mathematical Statistics 1952;23:508–24.
[50] Raghavarao D. Constructions and combinatorial problems in design of experiments. NewYork: Wiley; 1971.
[51] Hedayat AS, Wallis WD. Hadamard matrices and their applications. Annals of Statistics 1978;6:1184–238.
[52] Dey A. Orthogonal fractional factorial designs. NewYork: Wiley; 1985.
[53] Hedayat AS, Stufken J, Su G. On difference schemes and orthogonal arrays of strength t . Journal of Statistical Planning
and Inference 1996;56:307–24.
[54] Lippmann RP. An introduction to computing with neural nets. IEEE ASSP Magazine (1987) 4–22.
[55] Haykin SS. Neural networks: a comprehensive foundation. 2nd ed., Upper Saddle River, NJ: Prentice-Hall; 1999.
[56] Cherkassky V, Mulier F. Learning from data: concepts theory, and methods. NewYork: Wiley; 1998.
[57] White H. Learning in neural networks: a statistical perspective. Neural Computation 1989;1:425–64.
[58] Barron AR, Barron RL, Wegman EJ. Statistical learning networks: a unifying view. Computer science and statistics. In:
Proceedings of the 20th symposium on the interface. 1992. p. 192–203.
[59] Ripley BD. Statistical aspects of neural networks. In: Barndorff-Nielsen OE, Jensen JL, Kendall WS, editors. Networks
and chaos—statistical and probabilistic aspects. NewYork: Chapman & Hall; 1993. p. 40–123.
[60] Cheng B, Titterington DM. Neural networks: a review from a statistical perspective (with discussion). Statistical science
1994;9:2–54.
[61] BarronAR. Universal approximation bounds for superpositions of a sigmoidal function. IEEETransactions on Information
Theory 1993;39:930–45.
[62] Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. In: Rumelhart DE,
McClelland JL, editors. Parallel distributed processing: explorations in the microstructure of cognition. Cambridge, MA:
MIT Press; 1986. p. 318–62.
[63] Werbos PK. The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. New
York: Wiley; 1994.
[64] HaganMT, Menhaj M. Trainingfeedforwardnetworks withthe Marquardt algorithm. IEEETransactiononNeural Networks
1994;5:989–93.

C. Cervellera et al. / Computers & Operations Research 34 (2007) 70 – 90

71

1. Introduction The objective of dynamic programming (DP) is to minimize a “cost” subject to certain constraints over several stages, where the relevant “cost” is defined for a specific problem. For an inventory problem, the “cost” is the actual cost involved in holding inventory, having to backorder items, etc., and the constraints involve capacities for holding and ordering items. An equivalent objective is to maximize a “benefit”, such as the robustness of a wastewater treatment system against extreme conditions. The state variables track the state of the system as it moves through the stages, and a decision is made in each stage to achieve the objective. Recursive properties of the DP formulation permit a solution via the fundamental recurrence equation [1]. In stochastic dynamic programming (SDP), uncertainty is modeled in the form of random realizations of stochastic system variables, and the estimated expected “cost” (“benefit”) is minimized (maximized). In particular, Markov decision processes may be modeled by a SDP formulation through which the state variables resulting from each decision comprise a Markov process. DP has been applied to many problems in engineering, such as water resources, revenue management, pest control, and wastewater treatment. For general background on DP, see Bellman [1], Puterman [2] and Bertsekas [3]. For some applications, see Gal [4], Shoemaker [5], White [6,7], Foufoula-Georgiou and Kitanidis [8], Culver and Shoemaker [9], Chen et al. [10], and Tsai et al. [11]. In high dimensions, traditional DP solution methods can become computationally intractable and attempts have been made to reduce the computational burden [12–14]. For continuous-state SDP, the statistical perspective of Chen et al. [15] and Chen [16] enabled the first truly practical numerical solution approach for high dimensions. Their method utilized orthogonal array (OA, [17]) experimental designs and multivariate adaptive regression splines (MARS, [18]). Although MARS has been applied in several areas (e.g., [19–22]), artificial neural network (ANN) models are much more prevalent in all areas of engineering (e.g., [23–30]). Given their popularity, an ANN-based SDP solution method would be more accessible to engineers. In this paper, we present: • • • • The statistical modeling approach to solving continuous-state SDP. Implementation of ANN models in continuous-state SDP as an alternative to MARS. Implementation of OA-based Latin hypercube (OA–LH) designs [31] as an alternative to pure OAs. Consideration of the “space-filling” quality of generated experimental designs. The designs are intended to distribute discretization points so that they “fill” the state space; however, generated designs of the same type do not necessarily have equivalent space-filling quality. • Comparisons between ANN and MARS value function approximations, including methodological perspectives, computational considerations, and numerical SDP solution quality on two applications.

The novelty of our study is to simultaneously consider how the SDP solution is affected by different approximation methods and designs of different types and different space-filling qualities. In the statistical perspective of continuous-state SDP, the two basic requirements are an experimental design to discretize the continuous state space and a modeling method to approximate the value function An appropriate design should have good space-filling quality, and an appropriate modeling method should be flexible in high dimensions, provide a smooth approximation, and be reasonably efficient. A review of several experimental designs and modeling methods is provided by Chen et al. [32] in the context of computer experiments. Of the possible modeling choices given in this review paper, only MARS, neural networks, and kriging satisfy the first two criteria. Some preliminary exploration of the kriging model

g. none have considered OA–LH designs. Cervellera et al. However. kriging required 40 times more computational effort than MARS. one can relax the discreteness and use a continuous range of inventory levels. including the implementation of OA–LH designs. the inventory levels are inherently discrete since items are counted. six-. The SDP statistical modeling approach is given in Section 3.and seven-dimensional water reservoir SDP problems appears in Baglietto et al. convex) and the SDP data involve little noise. and the ANN model fitting process has a reputation of being slow. and an eight-dimensional water reservoir problem. The grid restriction was convenient for the MARS structure. a disadvantage when a smooth convex approximation is desired. which is known to have significant interaction effects between inventory levels and forecasts due to the capacity constraints. and in both studies MARS performed better overall. SDP applications In continuous-state DP. Inventory forecasting problem For an inventory problem. In continuous-state SDP. MARS. / Computers & Operations Research 34 (2007) 70 – 90 yielded positive results on a ten-dimensional wastewater treatment application [33]. could affect its approximation ability. Section 2 describes SDP in the context of the two applications that are later used to test our methods. which has simpler structure and is more typical of continuous-state SDP problems. and nine-state variables modeled over three stages. and Section 6 presents the comparisons between the ANNs and MARS. which would be most noticeable in low noise situations. ANN models can add extraneous curvature. computational speed is an important aspect of SDP. the existing literature does not provide sufficient studies to predict how they will compare to MARS in the SDP setting. Here. the system state variables are continuous. Although it is obvious that ANN models can be employed to approximate functions. but provides no advantage for ANN models. Points of an OA design are limited to a fairly sparse grid. but solving for the maximum likelihood estimates of the kriging parameters continues to be a computational challenge. Finally. [11]. constructs a parsimonious approximation. although quite flexible.. but the range of inventory levels for an item is too large to be practical for a discrete DP solution. Previous comparisons involved statistical data with considerable noise. Chen et al. but its limited structure. Studies on different experimental designs for metamodeling have been conducted [36–38]. while the OA–LH design allows points to move off the grid. kriging software is even harder to obtain than MARS software. on the other hand. For this application. Conclusions are presented in Section 7.72 C. [15] and Chen [16] solved inventory forecasting test problems with four-. we consider the nine-dimensional inventory forecasting problem. we chose the OA–LH design as a variation that might work well with ANN models. Application to ten. value functions are typically well-behaved (e. Section 5 describes ANNs. It is expected that ANN models will perform better in low noise situations. [40]. Thus. However. [39] and Baglietto et al.1. which is counter to our motivation to employ ANNs. In addition. Section 4 provides background on OA and OA–LH designs. The inventory forecasting problem . [34] and Pham and Peat [35]. and application to a 20-dimensional wastewater treatment system appears in Tsai et al. 2. Given the success of OA designs in continuous-state SDP. Comparisons between MARS andANNs for empirical modeling have been conducted by DeVeaux et al. 2.

described in Chen [16]. the capacity constraints are chosen to be restrictive. forcing interactions between the state variables. but are otherwise arbitrary. The typical V-shaped cost function common in inventory modeling is nI cv (xt . The objective function is a cost function involving inventory holding costs and backorder costs. D(t.t) . Let It be the inventory level for item i at the beginning of (i) the current stage. it is assumed that orders arrive instantly. / Computers & Operations Research 34 (2007) 70 – 90 73 uses forecasts of customer demand and a probability distribution on the forecasting updates to model the demand more accurately than traditional inventory models (see [41]). D(t. is employed here. ut is the amount ordered in stage t.1.t+1) . ut .t) . and i is the backorder cost parameter for item i. and t is the stochasticity associated with updates in the demand forecasts.1. ut ) = i=1 hi [It+1 ]+ + (i) (i) i [−It+1 ]+ . Transition function The transition function for stage t specifies the new state vector to be xt+1 = ft (xt . Then the decision vector is ut = i ordered in stage t. Let D(t. . where xt is the state of the system (inventory levels and forecasts) at the beginning of stage t. t ). State and decision variables The state variables for inventory forecasting consist of the inventory levels and demand forecasts for each item. See Chen [16] for details on the stochastic modeling. let D(t. D(t. so that the sequence of future forecasts for stage t + k forms a martingale [42].C. D(t. D(t. 2.2. Then the state vector at the beginning of stage t is xt = It .1. 2. (1) where [q]+ =max{0. The evolution of these forecasts over time is modeled using the multiplicative Martingale model of forecast evolution.1. It . q}. (1). For simplicity.t+1) be the forecast made at the beginning of the current stage for the demand of item i occurring over the next stage (t + 1). for which the stochastic updates are applied multiplicatively and assumed to have a mean of one.3. Cervellera et al. For this test problem. in the form of capacity constraints. The bounds placed on the state variables in each stage are necessary to ensure accurate modeling over the state space. (1) (2) (3) T (i) ut .t) be the forecast made at the beginning of the current stage for the demand of item i occurring over the stage. in the form of state space bounds. and on the state variables. It . Similarly. hi is the holding cost parameter for item i. where ut 2.t+1) . ut .t+1) (1) (2) (3) (1) (2) (3) (1) (2) (3) T (i) . For our nine-dimensional problem. D(t. The smoothed version of Eq. Suppose (i) we are currently at the beginning of stage t. Let ut (i) be the amount of item i ordered at the is the amount of item beginning of stage t. ut . The decision variables control the order quantities. we have three items and two forecasts for each.t) . Objectives and constraints The constraints for inventory forecasting are placed on the amount ordered.

1. The coefficients of the linear combinations (at . Thus. Gal [4].g. wt . wt .g. Water reservoir problem Optimal operation of water reservoir networks has been studied extensively in the literature (e. Reservior network. In this paper. 2. . (2) where t is a random correction that follows a standard normal distribution. t . while the outputs correspond to the amount of water to be released during a given time stage (e. Archibald et al. a month). State and decision variables In a realistic representation of inflow dynamics. Lamond and Sobel [44]. Foufoula-Georgiou and Kitanidis [8]. t . Cervellera et al.. In typical models. The actual values of the coefficients used for the model were based on the one-reservoir real model described in Gal [4] and extended to our four-reservoir network. t (1) (2) (3) (4) (1) (2) (3) (4) T .g. ct ) actually depend on t. / Computers & Operations Research 34 (2007) 70 – 90 Fig. t . so that it is possible to model proper external inflow behaviors for different months. wt . 2. has the form (i) t = at (i) (i) t−1 + bt + c t (i) (i) (i) t .. (2).74 C. the inputs to the nodes are water released from upstream reservoirs and stochastic inflows from external sources (e. 1.2. Given Eq. The water reservoir network we consider consists of four reservoirs (see Fig.. bt . Turgeon [45] and Yalpwotz [46] provides an excellent survey). the amount of (random) water flowing into the reservoirs from external sources during a given stage t depends on the amounts of past stages. the state vector consists of the water level and stochastic external inflow for each reservoir: xt = wt . representing the stochastic external flow into reservoir i. rivers and rain). we follow the realistic modeling of Gal [4] and use an autoregressive linear model of order one. [43]. 1 for the configuration). the ith component of the random vector t .2.

etc. ) = z+ − 2 z− − 2 z− . W wt+1 = min wt + ⎭ ⎩ (i) j ∈U where U (i) the set of indexes corresponding to the reservoirs which release water into reservoir i. In order to deal with unexpected peaks from stochastic external inflows.C.g. Thus. ⎪ 2 3 2 3 3 3 ⎪ ⎪ ⎪ 2 ⎪ ⎩ z+ 1 . For the modeling benefits. 0 z . To facilitate numerical minimization.2. wt .2. ut . 3. ⎩ ⎭ i j ∈U i where R (i) the maximum pumpage capability for reservoir i. the new water level is ⎫ ⎧ ⎬ ⎨ (j ) (i) (i) (i) (i) (i) .1 . The decision vector is ut = where ut be the amount of water released from reservoir i during stage t. Objectives and constraints (i) The constraints for water reservoir management are placed on the water releases ut in the form: ⎧ ⎫ ⎨ ⎬ (j ) (i) (i) 0 ut min wt + ut . we can update them using Eq.2. that is analogous to the smoothed cost function for the inventory forecasting problem. In our example. each reservoir has a floodway. ⎪ ⎪ ⎪ 3 ⎪ ⎨ 2 2 3 153 2 4 1 1 45 g(z. ut − u t + t . 2. R (i) . For the stochastic external inflows. the intuitive cost would be |wt − wt |. The ob(i) jective function is intended to maintain a reservoir’s water level close to a target value wt and maximize ˜ benefit represented by a concave function g of the water releases and/or water levels (e. depending on suitable parameters and .3. so that the amount of water never exceeds the maximum value W (i) . Cervellera et al. For maintaining targets. ut . z . ut . This implies nonnegativity of wt+1 . we utilized a smoothed version.. 4) at the beginning of stage t. 2 3 Such a function models a benefit which becomes relevant for “large” values of the water releases. Like the inventory forecasting model. 2. ut . (i) (i) t the (1) (2) (3) (4) T ut .). we used R (i) for and a small number 0. . (2). irrigation. / Computers & Operations Research 34 (2007) 70 – 90 75 where wt (i) is the amount of water in reservoir i (i = 1. denoted by l wt . and stochastic external inflow into reservoir i during stage t. The amount of water at stage t + 1 reflects the flow balance between the water that enters (upstream releases and stochastic external inflows) and the water that is released. Transition function The transition function for stage t updates the water levels and external inflows. we consider a nonlinear concave function g of the following form (for z ∈ [0. power genera(i) (i) ˜ tion. the two equations above can be grouped in the compact transition function xt+1 = ft (xt . +∞)): ⎧ 1 ⎪ ( + 1)z. 2. z . t ). which is a V-shaped (i) (i) ˜ function.

Then.t. (xt . ut . / Computers & Operations Research 34 (2007) 70 – 90 for .. where the p(i) . Chen [16] additionally explored strength two OAs . wt ˜ − i=1 p(i) g ut . Chen et al. first by assuming FT +1 (xT +1 ) ≡ 0. t ) + Ft+1 (xt+1 )} ut s. then constructs an approximation. p = 13 was the highest number of levels explored). We seek to estimate the unknown relationship between the state vector and the future value function. but univariate modeling can suffer due to the small number of distinct levels in each dimension (e. For continuous states. a numerical SDP solution requires a computationally tractable approximation method for the future value function. The choice of the strength three OAs was motivated by the desire to model interactions. xt+1 = ft (xt . given a finite time horizon T.76 C. the future value function provides the optimal cost to operate the system from stage t through T as a function of the state vector xt . 2. The complete future value function is obtained by solving this optimization problem for all possible beginning states. 3. [15] introduced this statistical perspective using strength three OAs and MARS (smoothed with quintic functions) applied to the inventory forecasting test problems. i = 1. (3) If we have the future value functions for all the stages. The recursive formulation permits the DP backward solution algorithm [1]. then solving in order for FT (·) through F1 (·). . [47] for further details. except in simple cases where an analytical solution is possible. Thus. ut ) ∈ t . Following the notation in Section 2. this is written recursively as Ft (xt ) = min E{c(xt . ut . (i) (i) . we can track the optimal solution through the value functions. then given the initial state of the system. the corresponding optimal cost yields one point on that stage’s value function. . Cervellera et al. t ). FT (·).g. t ) = i=1 l (i) (i) wt . the total cost function at stage t can be written as 4 4 ct (xt . 4 are weights chosen to balance the benefits against the costs. Future value function For a given stage. . let the state vector be the vector of predictor variables. (3). SDP statistical modeling solution approach 3. the basic procedure uses an experimental design to choose N points in the n-dimensional state space so that one may accurately approximate the future value function. this is an infinite process. so that FT (xT ) = cT (xT ) (where cT (xT ) is the cost of the final stage depending only on xT ). the basic SDP problem is to find F1 (·). Thus.2. Thus. From a statistical perspective. ability to identify important interactions. . ut . Response observations for the future value function may be collected via the minimization in Eq. MARS was chosen due to its ease of implementation in high dimensions. See the description in Chen [16] or Chen et al. . and easy derivative calculations. 3. 3. Numerical solution algorithm Given the state of the system at the beginning of the current stage.1.

However. and P be the set of n design points in D. There are two basic ways to measure space-filling quality: minimax and maximin (refer to [32]). / Computers & Operations Research 34 (2007) 70 – 90 77 with equivalent N to the strength three OAs. Cervellera et al.C. To avoid this. then choose the best of the bunch. then uses multi-linear interpolation to estimate the future value function. Measuring space-filling quality There are a variety of ways to generate OAs (e. contains all pn level combinations.1. 4. . Therefore. using a full factorial grid or some type of sampling scheme. Two configurations of design points. the one on the left is seen to space the points better. each at p levels. the classical solution method discretizes the state space with a full grid. A design that represents D well will have lower MAXDIST(D. this measure is usually calculated over a finite sample of points in D. Let D be the n-dimensional closed and bounded set of the continuous state space.e. a full factorial grid of p d points replicated times is represented. Looking at the two configurations of points in Fig.. Chen [17] has conducted the only study of how this quality affects an approximation. 2. In practice. . y) . contains all possible level combinations in any subset of d dimensions. However. each at p levels. A full factorial design for n variables. OAs are special forms of fractional factorial designs. with the same frequency. define: MAXDIST(D. this measure is generally impractical to calculate in high dimensions. however.g. x∈D y∈P Since the supremum over the infinite set of points in D is difficult to calculate. P) = sup min d(x. may be employed. and only OAs of strength three were studied. Minimax criteria seek to minimize the maximum distance between (nondesign) points in D and the nearest-neighbor design point in P. i. for which the number of points grows exponentially as the dimension increases. This full grid is equivalent to a full factorial experimental design. their performance was disappointing. 4.. 2. taking a subset of the full grid. [48–53]). An OA of strength d (d < n) for n variables. P) than a design that does not. 4 3 x1 2 1 0 • • • • • 0 1 2 x2 3 4 x1 4 3 2 1 0 • • • • • 0 1 2 x2 3 4 Fig. when projected down onto any d dimensions. calculate various space-filling measures. there is no way to guarantee their space-filling quality. Using the same distance measure as above. Discretizations based on OAs In continuous-state SDP. a fractional factorial design. one can generate many OAs. but with lower order balance permitting higher p.

a total of 6 OAs and 12 OA–LH designs were tested. For each size (N = 1331 and 2197). 5. Haykin [55]. we utilize the strength three OAs described in Chen [17]. For the OA designs in this paper. the underlying OA provides some balance. one “good”. The input layer consists of n nodes. only OAs of strength two and higher are technically orthogonal. n predictors and one response. y) be a distance measure (commonly Euclidean distance) over R.2. and for the water reservoir problem. so the LH design does not maintain this property. for statistical perspectives see White [57]. and the output layer consists of one node for the output variable. In other words. eight-dimensional designs were generated. nine-dimensional designs were generated. / Computers & Operations Research 34 (2007) 70 – 90 Maximin criteria seek to maximize the minimum distance between any pair of points in P. an ANN model is a nonlinear parameterized function that computes a mapping of n input variables into (typically) one output variable. and two realizations of OA–LH designs were generated for each OA. Artificial neural networks 5. Cherkassky and Mulier [56]. Connections between nodes can “feed back” to previous . their space-filling quality was judged using measures of both types and a measure introduced by Chen [17]. The basic architecture of ANNs is an interconnection of neural nodes that are organized in layers. one for each input variable. the points will correspond to N distinct levels in that dimension. Barron et al. With regard to approximating a function. the LH design provides more information along a single dimension (for univariate terms). Let d(x. Ripley [59]. for approximation capabilities see Barron [61]).78 C. Cervellera et al. For the inventory forecasting problem. 4. Model description ANNs are mathematical modeling tools widely used in many different fields (for general background and applications see Lippmann [54]. i. OA-based LH designs LH designs are equivalent to OA designs of strength one. Bertsekas and Tsitsiklis [14]. [58]. and one “poor” OA was selected based on space-filling measures described above. However. From a purely “analytical” perspective. one “neutral”. Although OA–LH designs only satisfy LH properties. was introduced by Tang [31] to address the lack of orthogonality in LH designs. Activation functions define transformations between layers. and the OA–LH designs are based on these strength three OAs.1. if we project the N points of a LH design onto any single dimension. ⎭ x∈P ⎩ y∈P y=x Higher MINDIST(P) should correspond to a more uniform scattering of design points. A hybrid OA–LH design. Then define: ⎧ ⎫ ⎨ ⎬ MINDIST(P) = min min d(x. In between there are hidden layers which induce flexibility into the modeling. the simplest one being a linear function. while the OA design provides better coverage in higher dimensions (for interactions).e. In this paper. for which an LH design is (randomly) created from an OA structure. y) .. Thus. we employed strength three OAs with p = 11 levels (N = 1331) and p = 13 levels (N = 2197) in each dimension. Specifically. or Cheng and Titterington [60].

in order to compute the gradient with respect to the entire set of parameters. where ∇LOF(wk ) is the gradient of the LOF function with respect to the vector wk .2.a. and (ii) a backward phase in which the gradient is actually computed on the basis of the values obtained in the forward phase. where for each hidden node h n Zh = Tanh i=1 vih xi + h .e.. vj h as weights linking input nodes j to hidden nodes h. but for function approximation. due to the . a. we will refer to the entire vector of parameters as w in the following sections. Implementation of this approach is very simple. the training data set. / Computers & Operations Research 34 (2007) 70 – 90 79 layers. Cervellera et al. then “backpropagates” the partial derivatives through the hidden layers. A comprehensive description of various ANN forms may be found in [55]. In our inventory forecasting SDP application. This is completed in two phases: (i) a forward phase in which an input xj is presented to the network in order to generate the various outputs (i. Nonlinear optimization techniques used to conduct the minimization are typically gradient descent-based methods. the descent algorithm step is completed by upgrading the vector of parameters w using the steepest-descent method: wk+1 = wk − k ∇LOF(wk ). After all N samples are presented to the network and the corresponding gradients are computed. the output of the network and those of the inner activation functions) corresponding to the current value of the parameters. ) = Tanh ˆ h=1 wh Z h + . The parameters vih . wh as weights linking hidden nodes h to the output node. we utilized a feedforward one-hidden-layer ANN (FF1ANN) with a “hyperbolic tangent” activation function: Tanh(x) = (ex − e−2x )/(ex + e−x ). We used the backpropagation method (see [62.k.C. Model fitting Our training employed a batch processing method (as opposed to the more common on-line processing) that minimized the squared error loss lack-of-fit (LOF) criterion: N LOF(g) = ˆ j =1 [yj − g(xj . w)]2 . since the computation of the gradient requires only products. the typical ANN is feedforward only with one hidden layer and uses “sigmoidal” activation functions (monotonic and bounded). which computes the gradient of the output with respect to the outer parameters. wh . 5. w. v. and h and as constant terms called “bias” nodes (like intercept terms).63]). Define n as the number of input variables. For simplicity. h . . and are estimated using a nonlinear least-squares approach called training. Then our ANN model form is H g(x. H as the number of hidden nodes.. ˆ where N is the number of data points on which the network is being trained.

where e(w) = [y1 − g(x1 . determines the architecture. When = 0 we have a standard Newton method with an approximated Hessian matrix (Gauss–Newton method). For a FF1ANN. After each successful step (i. For our applications. as well as methods to avoid or reduce overfitting (see [55]). Each row of the matrix contains the gradient of the jth single error term yj − g(xj . the network merely “memorizes” the data at the expense of generalization capability). Unfortunately. yN − g(xN .e. H. the best results for the eight-dimensional water reservoir problem were obtained using H = 10 for both sample sizes. Cervellera et al. is the Levenberg–Marquardt method (first introduced in [64]). .3. Model implementation Other than the selection of a model fitting approach. which is more accurate near the minima.e. it is not possible to define “a priori” the optimal number of hidden nodes.2. . . Thus.. the choice of the stepsize k is crucial for acceptable convergence. Too few nodes would result in a network with poor approximation properties. such as entrapment in local minima. the number of nodes in the hidden layer. but the choice of a good architecture requires several modeling attempts and is always very dependent on the particular application. we identified the appropriate number of hidden nodes by exploring many architectures and checking the average cost using a test data set from the simulation described in Section 6. while it is increased when the step degrades the LOF. The best training algorithm for FF1ANNs. The best results for the nine-dimensional inventory forecasting problem were obtained for values of H = 40 for N = 1331 and H = 60 for N = 2197. and very different stepsizes may be appropriate for different applications. since it depends on the number of samples available and on the smoothness of the function we want to approximate.e. Similarly. There are many rules of thumb and methods for model selection and validation in the literature. it makes use of the Hessian matrix) Newton method. 5. . when the LOF decreases) is decreased. w) with respect to the set of parameters w. w). the method is close to gradient descent with a small stepsize. . the most important decision in implementing an ANN model is the choice of architecture. while too many would “overfit” the training data (i. backpropagation presents many well-known disadvantages.80 C. in terms of rate of convergence. a tradeoff between these two possibilities is needed. For larger values of H the solutions worsened. the appropriate number of hidden nodes is never clear-cut. ˆ The update rule for the parameter vector at each step is wk+1 = wk − [J [e(wk )]T J [e(wk )] + I ]−1 J [e(wk )]T e(wk )T . w)]T ˆ ˆ and J [e(w)] is the Jacobian matrix of e(w). As a consequence.. / Computers & Operations Research 34 (2007) 70 – 90 mathematical structure of the network. which is an approximation of the classic second-order (i. Therefore. Since our LOF criterion is quadratic. Unfortunately.. we can approximate its Hessian matrix via backpropagation as J [e(w)]T J [e(w)]. Existing theoretical bounds on the accuracy of the approximation are non-constructive or too loose to be useful in practice. When is large.

the same Mmax or H for all stages is appropriate. An issue that may arise in other SDP applications is the selection of the key parameter. From the perspective of SDP value function approximation. smoothness to achieve a certain degree of continuity may be applied. After selection of the basis functions is completed. / Computers & Operations Research 34 (2007) 70 – 90 81 6. such as Matlab’s (www. From a statistical perspective. however. Memory and computational requirements The memory and computational requirements for a FF1ANN with one output variable depend on the number of input (predictor) variables n. together with the existence of accessible software. Cervellera et al.com). two-factor interactions were sufficient. quintic functions were used to achieve second derivative continuity (see [16]). The search for new basis functions can be restricted to interactions of a maximum order. Their main difference is the interpretability of the approximation.C. the number of hidden nodes H. This is very similar to the difficulty in selecting H for an ANN. When different Mmax or H is preferred for different stages. the backward procedure was not utilized since preliminary testing demonstrated a very large computational burden with little change on the approximation. In addition. MARS was restricted to explore up to three-factor interations.com) Neural Network Toolbox. which was used in our study. For MARS. For the inventory forecasting problems. like ANNs. for the water reservoir problem. The second disadvantage occurs because it is tempting to try to fit the data as closely as possible even when it is not appropriate. the key quantities are n. our own C code was employed. Mmax for MARS and H for ANNs. but the approximation degraded. For our study. 6. In our case. N. a parameter called Mmax .salford-systems. Chen [16] also considered MARS limited to two-factor interactions. due to their proven ability to fit any data set and to approximate any real function. commercial MARS software has only recently become available from Salford Systems (www. then the SDP algorithm would greatly benefit from automatic selection of Mmax or H. The primary disadvantages are (1) the difficulty in selecting the number of hidden nodes H and (2) potential overfitting of the training data. as is the case with SDP. a test data set can be used to select Mmax . but this process greatly increases the computational effort required. but this is not a critical issue for accurate SDP solutions. ANNs have the additional disadvantage of an approximation that does not provide interpretable information on the relationships between the input (predictor) variables and output (response) variable. However. The approximation bends to model curvature at “knot” locations. MARS [18] is based on traditional statistical thinking with a forward stepwise algorithm to select terms in a linear model followed by a backward procedure to prune the model. This is an issue for future research. and. In this section. and the maximum number of basis functions Mmax . MARS and ANNs are both excellent for modeling in high dimensions when noise is small relative to the signal.1. we will explore how the requirements might grow with increasing n .mathworks. ANNs and MARS comparisons ANNs have become very popular in many areas of science and engineering. The primary disadvantage of MARS is that the user must specify the maximum number of basis functions to be added. and one of the objectives of the forward stepwise algorithm is to select appropriate knots. MARS is both flexible and easily implemented while providing useful information on significant relationships between the predictor variables and the response. For the inventory forecasting and water reservoir problems. the number of eligible knots K. and the size of the training data set N. thus. Both disadvantages can be handled by proper use of a test data set. the value functions have similar structure from stage to stage.

However. The time for the computation of LM backpropagation is dominated by the Hessian . [15] set N =O(n3 ). which are convex and. and a more reasonable upper limit on Mmax is O(N 2/3 ) = O(n2 ). if the memory at disposal is sufficient. However. Chen et al. H depends on n. For what concerns (ii). The inventory forecasting value function. Another aspect of memory requirements is memory required while constructing the approximation. which is the one we actually used for our FF1ANNs. given an adequate K. LM backpropagation is generally preferred over other training methods because of its faster convergence. (i) is performed via optimization that requires the evaluation of the output of an ANN and of its first derivative. (3) for one stage t at every experimental design point and (ii) constructing the spline approximation of the future value function. and O(n) Mmax O(n3 ). consequently. where W is the work required to solve the minimization. so we can write the total time required to solve the SDP equation for all N points as O(WnNH). Based on computational experience with a variety of value function forms. However. Each step requires the storage of a (N×V ) matrix and the (V × V ) Hessian approximation. it should not be expected to increase with n since the larger K primarily impacts the quality of univariate modeling. the addition of the OA–LH designs permits the placement of a higher number of knots. Thus. / Computers & Operations Research 34 (2007) 70 – 90 (the dimension of an SDP problem). One should keep in mind that the above estimates are intended to be adequate for approximating SDP value functions. the total number of bytes required to store a FF1ANN grows as O(n2 ) SFF1ANN = 8V O(n3 ). a FF1ANN requires storage of V = [(n + 1)H + H + 1] parameters. The optimization evaluates a value function approximation and its first derivative several times in each stage.82 C. Cervellera et al. K 50 is more than adequate. which assumes N is based on a strength three OA. N. and the complexity of the underlying function. the largest FF1ANN tested in the paper required 5288 bytes and the largest MARS required 36. so for n = 9. however. The computational requirements for one iteration of our SDP statistical modeling solution approach consist of: (i) solving the optimization in Eq. and a disadvantage of the LM method lies in its memory requirements. K =O(n). which assumes knots can only placed at the p distinct levels of the strength three OA. For memory requirements. For ANNs. resulting in a growth of O(n4 ) S2MARS = O(NMmax ) O(n5 ). In this paper. MARS construction stores up to a (N ×Mmax ) matrix. A reasonable range on H (in terms of growth) matches that of O(n) H O(N 2/3 ) = O(n2 ). For FF1ANNs. except the last. we consider LM backpropagation. The time required for these evaluations is O(nH). so the memory needed grows as O(n5 ) S2FF1ANN = O(NnH) + O(n2 H 2 ) O(n6 ).800 bytes. it appears that the memory requirement for FF1ANNs grows faster than that for MARS. does involve a steep curve due to high backorder costs. We utilized Levenberg–Marquardt (LM) backpropagation to train the FF1ANN. FF1ANNs are still considerably smaller. The maximum total number of bytes required to store one smoothed MARS approximation with up to three-factor interactions grows as [15] ∗ O(n) SMARS = 8(6Mmax + 2n + 3Kn + 1) O(n2 ). Assuming each of these is a double-precision real number. generally well behaved.

The total run time for one iteration of the MARS-based SDP solution grows as [15] O(n7 ) CMARS = O(WNMmax ) + O(nNM3 ) O(n10 ). however. it can been seen that the run times for the same N are comparable. Here. WLM never needs to exceed 300. In order to conduct a fairer computational comparison on the same programming platform. / Computers & Operations Research 34 (2007) 70 – 90 Table 1 Actual average run times (in minutes) for one SDP iteration using FF1ANN and MARS models Model N Inventory forecasting 450 MHz P2 ANN MARS ANN MARS 1331 1331 2197 2197 65 min (H = 40) 185 min (H = 60) 550 MHz P2 61 min (Mmax = 300) 112 min (Mmax = 400) 933 MHz P2 90 min (H = 40) 87 min (H = 60) Water reservoir 3. the total run time for one iteration of the FF1ANN-based SDP solution grows as O(n7 ) CFF1ANN = O(WNnH) + [O(WLM n2 NH2 )] O(n9 ). Overall. an efficient algorithm can provide a computational time that is O(V 2 ) = O(n2 H 2 ).06 GHz IX 83 13 min (H = 10) 43 min (Mmax = 100) 18 min (H = 10) 67 min (Mmax = 100) The FF1ANN runs used WLM = 125. Thus. calling the MARS C code in Matlab resulted in significant slow down. Actual average run times for one SDP iteration are shown in Table 1.C. For the inversion of a (V × V ) matrix. Despite this. max where the time to construct the MARS approximation (second term) dominates. Unfortunately. For the inventory forecasting problem. all runs were conducted in the same place. For the water reservoir problem. a completely fair comparison that uses the same programming platform is not readily implementable. however. For the inventory forecasting problem. the MARS and ANN runs were conducted on different machines in different countries. the slow down experienced by calling the MARS C code from Matlab is obvious. For the water reservoir problem. Matlab was used to conduct the FF1ANN runs on two Pentium II (P2) machines. the computation is basically reduced to [H + 1 + (n + 1)H ] products. Cervellera et al. Define WLM to be the number of steps required to attain the backpropagation goal. The (V ×V ) Hessian approximation is the product of two (N ×V ) matrices that are obtained by backpropagation. the ability to call C code from Matlab is still flawed. Therefore. our experience demonstrates that the ANN model is clearly easier to implement. . so the time required to obtain the gradient for all N experimental design points is O(NnH). considerable effort was expended to develop a Matlab code that could access our MARS C code. approximation and inversion. at this time. The matrix product requires NV 2 real valued products. the time needed to compute one step of LM backpropagation is dominated by the matrix product: O(Nn2 H 2 ). for each row of the matrix (which corresponds to a single experimental design point xj ). and the MARS runs suffered significant slow down.) Then the time required to train the network by LM backpropagation grows as O(WLM Nn2 H 2 ). so this operation (which is O(Nn2 H 2 )) occupies the relevant part of the time required to compute the approximation. (Typically. and a C code executed the MARS runs on a different P2 machine. Unfortunately. all runs were conducted in Matlab on an Intel Xeon (IX) workstation.

Although several FF1ANN architectures were tested.2. and the fourth line provides Mmax for MARS and the number of hidden nodes (H) for ANNs. / Computers & Operations Research 34 (2007) 70 – 90 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Absolute Error ANN MARS MARS ANN MARS MARS ANN MARS MARS OA3G OA3G OA3G OA3N OA3N OA3N OA3P OA3P OA3P p11 p11 p11 p11 p11 p11 p11 p11 p11 H40 200 300 H40 200 300 H40 200 300 Fig. Inventory forecasting: Boxplots illustrate the distribution of absolute error for all 9 SDP solutions using strength 3 OAs with p = 11 (N = 1331). 4. and for the water reservoir problem. Labeling below the boxplots uses the same notation as that in Fig. only one FF1ANN architecture for each design was employed to generate the plots in Figs. 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 • • • • • • • Absolute Error • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ANN MARS MARS ANN MARS MARS ANN MARS MARS OA3G OA3G OA3G OA3N OA3N OA3N OA3P OA3P OA3P p13 p13 p13 p13 p13 p13 p13 p13 p13 H60 300 400 H60 400 H60 300 400 300 Fig. Mmax = 100. 6. H = 10 for both N. Below each boxplot. Cervellera et al. H = 40 for N = 1331 and H = 60 for N = 2197. The third line specifies p = 11. For the inventory forecasting problem. For the OAs we set K = 9 for p = 11 and K = 11 . Several MARS approximations were constructed for each design by varying Mmax and K. the first line indicates the use of ANNs or MARS. 3. OA3N = “neutral” strength 3 OA. The second line indicates the type of OA. and OA3P = “poor” strength 3 OA. SDP solutions Both the nine-dimensional inventory forecasting problem and the eight-dimensional water reservoir problem were solved over three stages (T = 3). 3. specifically: OA3G = “good” strength 3 OA. 3–8 For the inventory forecasting problem. Inventory forecasting: Boxplots illustrate the distribution of absolute error for all 9 SDP solutions using strength 3 OAs with p = 13 (N = 2197).84 C. and for the water reservoir problem. results with different Mmax are shown.

Included from Fig. / Computers & Operations Research 34 (2007) 70 – 90 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 85 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Absolute Error ANN ANN MARS MARS MARS ANN ANN MARS MARS MARS OA3P LH1G OA3P LH2G LH2N OA3N LH1N OA3P LH2G LH2P p11 N1331 p11 N1331 N1331 p11 N1331 p11 N1331 N1331 300 H40 H40 200 200 200 H40 H40 300 300 K35 K17 K9 K9 Fig. 5. 5. . Inventory forecasting: Boxplots illustrate the distribution of absolute error for 10 SDP solutions using N = 1331. 4 are the best (ANN/OA3G. a fifth line appears for the OA–LH/MARS solutions indicating the number of eligible knots (K) utilized in the MARS algorithm. For the OA–LH designs K was chosen as high as 50. The fourth line provides Mmax for MARS and the number of hidden nodes (H) for ANNs. MARS/OA3P/Mmax = 400 solutions. 3. Simulations using the “reoptimizing policy” described in Chen [16] were conducted to determine the optimal decisions and to test the SDP solutions. MARS/OA3P/Mmax =200) and worst (ANN/OA3N. Below each boxplot. The third line indicates the size of the design. specifically: OA3 notation as in Fig. and similarly for LH1N. 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 • • • • • • • • • • • • • • • • • • • • • Absolute Error • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ANN ANN MARS MARS OA3G LH2G OA3G LH2G p13 N2197 p13 N2197 H60 H60 400 400 K11 MARS ANN ANN MARS MARS MARS LH2N OA3N LH1P OA3P LH2P LH1P N2197 p13 N2197 p13 N2197 N2197 300 H60 H60 400 400 400 K17 K11 K17 Fig. specifying p = 11 for the OAs and N = 1331 for the OA–LH designs. LH2N. Finally. MARS/OA3P/Mmax =300) solutions. LH1P. Inventory forecasting: Boxplots illustrate the distribution of absolute error for 10 SDP solutions using N = 2197 experimental design points. and similarly for the worst solutions (in the right half). Labeling below the boxplots uses the same notation as that in Fig. the first line indicates the use of ANNs or MARS. The second line indicates the type of experimental design. LH1G = first LH randomization based on a “good” strength 3 OA. The three best OA–LH-based solutions are shown beside the best OA-based solutions (in the left half). MARS/OA3G/Mmax = 400) and worst (ANN/OA3N. 3 are the best (ANN/OA3P. Cervellera et al. Included from Fig. and LH2P. 6.C. LH2G = second LH randomization based on a “good” strength 3 OA. for p = 13 (the maximum allowable in both cases).

8. Labeling below the boxplots uses the same notation as that in Fig. For each initial state vector. MARS/OA3N/p13) and worst (ANN/OA3P/p11. 6. Inventory forecasting problem For each of 100 randomly chosen initial state vectors.1. MARS/OA3P/p13) solutions. Included from Fig. 7 are the best (ANN/OA3N/p11. . 5. Water reservoir: Boxplots illustrate the distribution of absolute error for 12 SDP solutions. Boxplots in Figs. 3. 10 9 8 7 Absolute Error 6 5 4 3 2 1 0 ANN ANN MARS OA3N LH1G LH1G p11 N1331 N1331 H10 H10 100 K50 MARS OA3N p13 100 ANN MARS ANN ANN LH1N LH1G OA3P LH1P N2197 N2197 p11 N1331 H10 100 H10 H10 K50 MARS LH1P N1331 100 K50 MARS OA3P p13 100 ANN MARS LH1G LH1N N2197 N2197 H10 100 K50 Fig. / Computers & Operations Research 34 (2007) 70 – 90 10 9 8 7 Absolute Error 6 5 4 3 2 1 0 ANN MARS ANN MARS ANN MARS ANN MARS ANN MARS ANN MARS OA3G OA3G OA3N OA3N OA3P OA3P OA3G OA3G OA3N OA3N OA3P OA3P p13 p13 p13 p13 p13 p13 p11 p11 p11 p11 p11 p11 H10 100 H10 100 H10 100 H10 H10 100 100 H10 100 Fig. the smallest mean cost achieved by a set of 90 SDP solutions was used as the “true” mean cost for computing the “errors” in the mean costs. each over 1000 sequences. 3–6 display the distributions (over the 100 initial state vectors) of these errors. a simulation of 1000 demand forecast sequences was conducted for the SDP solutions. Water reservoir: Boxplots illustrate the distribution of absolute error for all 12 SDP solutions using strength 3 OAs with p = 11 (N = 1331) and p = 12 (N = 2197). Cervellera et al. The average over the 100 “true” mean costs was approximately 62.86 C.2. The simulation output provided the 100 mean costs (corresponding to the 100 initial state vectors). Labeling below the boxplots uses the same notation as that in Fig. 7.

C. The simulation output provided the 50 mean costs (corresponding to the 50 initial state vectors). 6. the “best” solution is “MARS OA3G p13 400”. which used the poor OA and Mmax = 400. The OA/FF1ANN solution with the good OA is not terrible. the “best” solution is with the good OA. For the OA/FF1ANN solutions with p = 11 (N = 1331). The average over the 50 “true” mean costs was approximately −47. the one that used the neutral OA3 (there is actually another outlier above 15). while the “worst” is with the neutral OA. The OA/FF1ANN and OA/MARS solutions are repeated from Figs. which is “MARS OA3P p11 300”. For the FF1ANN solutions with p = 13 (N = 2197). but it is visibly one of the “worst” solutions. 3 and 4. while the “worst” is “MARS OA3P p13 400”.2. It is troubling because it uses a LH design generated from the good OA. that good OAs result in better SDP solutions on average. Among the SDP solutions with p = 11 (N = 1331) the “worst” one is clearly “ANN OA3P p11 H = 10”. having a significantly longer box. except for the “worst” one. because the randomization used in generating them may have resulted in a poorly spaced realization. which used the second OA–LH realization from the good OA. Cervellera et al. Mmax = 400.. each over 100 sequences. The interesting observation here is that the FF1ANNs performed better with the OA–LH designs. The “best” one is not as easy to identify. with the best solutions on the left and worst on the right.17]. as was illustrated in Chen [16. Note the solution for “ANN LH1N N1331 H40”. but it is actually a good solution. the one that used the poor OA and Mmax =300. a simulation of 100 demand forecast sequences was conducted for the SDP solutions.2. The other MARS OA3P solution (with Mmax = 200) is used to represent the “best” OA/MARS solution. indicating that poor solutions can be avoided by not using a poor OA. The “best” one is “ANN OA3P p11 H40”. i. but “MARS OA3N p13 100” is seen to have the lowest median. However. although it is essentially equivalent to the other good OA/MARS solutions. but “ANN OA3N p11 H = 10” is seen to have the lowest median. 5 and 6 display the best and worst solutions for each of the four types of SDP solutions. The “errors” in mean costs were computed in the same manner as the inventory forecasting problem. Ideally. They are organized left to right by OA design size and by the space-filling quality of the OA design. is “MARS OA3P p13 100”. sometimes it does not happen this way because the poor qualities are located in regions of the state space that are not critical to finding a good solution. there is one disconcerting solution. Among the SDP solutions with p = 13 (N = 2197) the “worst” one. however. . It should be noted in Figs. 7 displays all the OA-based SDP solutions. For the MARS solutions with p = 13 (N = 2197). we would want good OAs to result in better solutions than poor OAs. 3 and 4 display all 18 of the OA-based SDP solutions.e. Figs. using the poor OA. Water reservoir problem For each of 50 randomly chosen initial state vectors. This agrees with the findings in the previous section. i. 7 and 8 display the distributions (over the 50 initial state vectors) of these errors. Boxplots in Figs. which used the first OA–LH realization from the neutral OA. the “worst” one is “ANN OA3N p11 H40”. / Computers & Operations Research 34 (2007) 70 – 90 87 Figs. The best solution overall is “MARS LH2G N2197 400 K11”. For both sizes. the worst solutions employed the poor OA. To some extent the goodness measures are no longer valid for the OA–LH designs. Fig. The OA/MARS solutions with p =11 (N =1331) are all fairly close. but is worst than most of the OA/MARS solutions because of its higher median. This is the “worst” solution among the FF1ANNs using OA–LH designs. 3 and 4..e. They are organized left to right by the “goodness” of the OA design. and K = 11. and that is “MARS LH2G N1331 300 K35”. All the other solutions for this size design are somewhat comparable. which used the good OA and Mmax = 400.

Beck MB. Tsitsiklis JN.52(1):1–12. Tejada-Guibert JA. 1994. A small advantage to MARS is the information on the relationships between the future value function and the state variables that may be gleaned from the fitted MARS model. Bertsekas DP. so as to permit flexibility in the approximation across different SDP stages. Unfortunately. Stedinger JR. For the inventory forecasting problem. 7.18(5):55–61. Unlike the inventory forecasting problem. the water reservoir problem should not require many interaction terms in the value function approximation. Kitanidis PK. Series C 2003. thus. while FF1ANN solutions noticeably improved with OA–LH designs. New York. Interfaces 1985. Shoemaker CA. Like the findings in the previous section. With regard to both computational SDP requirements and quality of SDP solutions. Solving for an optimal airline yield management policy via statistical learning. Journal of Water Resources Planning and Management 1997. 1957. 8 displays the best and worst solutions from Fig. Günther D.132:207–21.41:484–500. Cervellera et al. Chen VCP. Water Resources Research 1988.123:23–9. Stochastic dynamic programming formulation for a wastewater treatment decision-making framework. Dynamic programming and optimal control. 2000. The figure is organized with the best solutions on the left and worst on the right. For the SDP solutions using OA designs. Some improvement is seen using the OA–LH designs. Concluding remarks The statistical modeling approach for numerically solving continuous-state SDP was used to simultaneously study OA–LH designs and an ANN-based implementation as alternatives to the OA/MARS method of Chen et al. Foufoula-Georgiou E. Operations Research 1982. White DJ. Numerical solution of continuous-state dynamic programs using linear and spline interpolation. Operations Research 1993. Gal S.30:40–61.15(6):73–83. Markov decision processes. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] Bellman RE. Further real applications of Markov decision processes. Optimal management of a multireservoir water supply system. Current research is developing a method to automatically select Mmax for MARS. Gradient dynamic programming for stochastic optimal control of multidimensional water resources systems.88 C. Annals of Operations Research. a method for automatically selecting H for ANNs does not seem possible. Johnson SA. Journal of the Royal Statistical Society. In addition. the worst OA–LH/FF1ANN solutions are not bad. Shoemaker CA. Chen J. Dynamic programming. overall. Tsai JCC. 7 along with the best and worst OA–LH/FF1ANN and OA–LH/MARS solutions. Li Y. we judged FF1ANNs and MARS to be very competitive with each other overall. Interfaces 1988. involving significant interaction terms. the increased univariate information from LH designs should be an advantage. NY: Wiley.15:737–48. White DJ. the space-filling quality of the OA designs was considered. Johnson EL. Culver TB. Optimal integrated control of univoltine pest populations with age structure. supporting the expectation that ANN models perform well with OA–LH designs. MA: Athena Scientific. [15] and Chen [16]. MARS performed similarly using both OA and OA–LH designs. Chen VCP. Shoemaker CA. Dynamic optimal ground-water reclamation with treatment capital costs. / Computers & Operations Research 34 (2007) 70 – 90 Fig. Water Resources Research 1979. good solutions could be achieved using MARS models with either good OA or OA–LH designs and FF1ANN models with good OA–LH designs. Special Issue on Applied Optimization Under Uncertainty 2004. good OAs reduced the possibility of a poor SDP solution. Belmont. Real applications of Markov decision processes. Princeton: Princeton University Press. Puterman ML. For the water reservoir problem. [12] .24:1345–59.

Kabiri K. International Journal of Reliability and Applications 2001. Cervellera et al. August 2001. Handbook of statistics: statistics in industry. Yee SS. Environmental Science and Technology 2001. Data collection plans and meta models for chemical process flowsheet simulators. Industrial and Engineering Chemistry Research 2001. [37] Simpson TW. Korea. Comparison of two nonparametric estimation schemes: MARS and neural networks. [35] Pham DT. Journal of Geochemical Exploration 1997. [32] Chen VCP. [21] Kuhnert PM. Do K-A. Application of MARS and orthogonal arrays to inventory forecasting stochastic dynamic programs.C. [24] Labossiere P. Friedman JH. [18] Friedman JH. Walsh MG. [36] Palmer KD. Development of reflectance spectral libraries for characterization of soil properties. Bernshteyn MA. Neuro-dynamic programming. Combining non-parametric models with logistic regression: an application to motor vehicle injury data. Rao CR. Fisher NI. Shoemaker CA.7:386–401. Fang JQ. A review of design and modeling in computer experiments.40(24):5719–23. 231–61. Journal of the American Helicopter Society 2002. Psichogios DC. McClure R. NosseirAI. Barton RR. 1996. Seoul. Tsitsiklis JN. El-ArabatyAM. Journal of the American Statistical Association 1993. In: Proceedings of the 53rd session of the international statistical institute.13(3):149–53.34(3):371–86. [23] Smets HMG.59(3):233–49. Statistical methods for deterministic biomathematical models. Irving MR. Parallel processing of stochastic dynamic programming for continuous state systems with linear interpolation. editors.22:85–93.B9(2):113–22. Prediction of the molecular weight for a vinylidene chloride/vinyl chloride resin during shear-related thermal degradation in air by using a back-propagation artificial neural network. In: Khattree R. [26] Chen VCP. Peat BJ. Georgia Institute of Technology. Neural network-based adaptive out-of-step protection strategy for electrical power systems. Statistical techniques for the classification of chromites in diamond exploration samples. vol. Rollins DK. Jeary AP.19:1–141. . Sampling strategies for computer experiments: design and analysis.5(1):35–42. Computational Statistics and Data Analysis 1999.22(9):1216–23. Wong CK. Bogaerts WFL. Ungar LH. ORSA Journal on Computing 1995. Liu DK. Shoemaker CA.12(12):1270–80. [31] Tang B. Applying experimental design and regression splines to high-dimensional continuous-state stochastic dynamic programming. Macdonald TR. Materials & Design 1992. Computational Statistics and Data Analysis 2000.35:264–74. [25] AbdelazizAY.66(3):988–98. [20] Griffin WL. Ryan CG. Calibration of nonlinear solid-state sensor arrays using multivariate regression techniques. Lin DKJ. Operations Research 1999. Welch WJ. [17] Chen VCP. Journal of Reinforced Plastics and Composites 1993. Soil Science Society of America Journal 2002. Measuring the goodness of orthogonal array discretizations for stochastic programming and stochastic dynamic programming. Damping in buildings: its neural network model and AR model. Mansour MM.12:322–44. Measurement and Control 1999. Failure prediction of fibre-reinforced materials with neural networks. / Computers & Operations Research 34 (2007) 70 – 90 89 [13] Eschenbach EA. USA. GA.88:1392–7. School of Industrial and Systems Engineering. Bioprocess Engineering 2000. Amsterdam: Elsevier Science. Turkkan N. Atlanta.47(1):33–41. [19] Carey WP. Chen W. Journal of Quality Technology 2003. [34] De Veaux RD. PhD dissertation. Annals of Statistics 1991.35(1):157–62. Sensors and Actuators. B: Chemical 1992. [15] Chen VCP. [28] Liu J. Allen JK.30:317–41. Deriving corrosion knowledge from case histories: the neural network approach. p. MA: Athena Scientific. Caffey H. [16] Chen VCP. Multivariate adaptive regression splines (with discussion). International Journal of Engineering Intelligent Systems for Electrical Engineering and Communications 1997. Computers & Chemical Engineering 1993. [22] Shepherd KD. [27] Li QS. 22. Neural network based representation of UH-60A pilot and hub accelerations. SIAM Journal of Optimizations 2001. [33] Chen VCP. Tsui K-L.32(9):270–4. Orthogonal array-based Latin hypercubes.47:38–53. [29] Pigram GM. Engineering Structures 2000. [30] Kottapalli S. [14] Bertsekas DP. [38] Allen T. Issues regarding artificial beural network modeling for reactors and fermenters.17(8):819–37. Constructing meta-models for computer experiments. 1998.2(3):209–40. Use of neural network models to predict industrial bioreactor effluent quality. Automatic learning using neural networks and adaptive regression. Ruppert D. 2003. Belmont.

9:2–54. Technical report #989. [44] Lamond BF.. Williams RJ. Statistical science 1994. Approximating networks for the solution of T-stage stochastic optimal control problems. Cornell University. In: Barndorff-Nielsen OE. Sanguineti M. Bulletin of the Calcutta Mathematical Society 1946. [56] Cherkassky V. 2nd ed. Whitin TM. p. 192–203. [63] Werbos PK. Constructions and combinatorial problems in design of experiments.38:67–78. [58] Barron AR. Parallel distributed processing: explorations in the microstructure of cognition. Ithaca. [54] Lippmann RP. [45] Turgeon A. [55] Haykin SS. Beck MB.. Titterington DM. Networks and chaos—statistical and probabilistic aspects. IEEE Transactions on Information Theory 1993. p. [62] Rumelhart DE. Hinton GE. Statistical aspects of neural networks.6:1184–238. Computer science and statistics.33:333–40. industry. An introduction to computing with neural nets. New York: Wiley. [53] Hedayat AS. [59] Ripley BD. Statistical learning networks: a unifying view. and technology. Analysis of inventory systems. In: Rumelhart DE. Menhaj M. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transaction on Neural Networks 1994. MA: MIT Press. Su G. [46] Yakowitz S. [57] White H. [60] Cheng B. WA. p. 1991. Wegman EJ. New York: Chapman & Hall. Chen J. [52] Dey A. Upper Saddle River. Operations Research 1995. Kendall WS. NY.43: 771–80. Modelling the evolution of demand forecasts with application to safety stock analysis in production/distribution systems. McClelland JL. NJ: Prentice-Hall. [43] Archibald TW. Orthogonal arrays of strength two and three. / Computers & Operations Research 34 (2007) 70 – 90 [39] Baglietto M. Seattle. [48] Rao CR. Neural networks: a review from a statistical perspective (with discussion). 1985. Learning internal representations by error propagation. Sobel MJ. Learning in neural networks: a statistical perspective. Neural networks: a comprehensive foundation. 2005. New York: Wiley. Dynamic programming applications in water resources. to appear. Zoppoli R.56:307–24. 1986. Mulier F. School of Operations Research & Industrial Engineering.5:989–93. [41] Hadley G. Wallis WD. Hypercubes of strength ‘d’ leading to confounded designs in factorial experiments. Cambridge. Zoppoli R.17:1565–70. McKinnon KIM. 1993. Annals of Statistics 1978. Parisini T. In: Topics on participatory planning and managing water systems. Statistical learning within a decision-making framework for more sustainable urban environments. [42] Heath DC. In: Proceedings of the joint research conference on statistics in quality. Journal of Statistical Planning and Inference 1996. Neural Computation 1989.39:930–45. 1992. Stufken J. Training feedforward networks with the Marquardt algorithm. IEEE ASSP Magazine (1987) 4–22. Hadamard matrices and their applications. Cervellera C. Jensen JL.23:508–24. Exact and approximate solutions of affine reservoirs models. In: Proceedings of the 20th symposium on the interface. 1971. 1963. In: Proceedings of the IFAC workshop on adaptation and learning in control and signal processing. Englewood Cliffs. NJ: Prentice-Hall. editors. Learning from data: concepts theory. Annals of Mathematical Statistics 1952.1:425–64. Amsterdam: Elsevier Science Ltd. Water Resources Research 1981. 1998. New York: Wiley. [51] Hedayat AS. On difference schemes and orthogonal arrays of strength t. . Bush KA. [40] Baglietto M. Water Resources Research 1982. Orthogonal fractional factorial designs. 1994. The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. An aggregate stochastic dynamic programming model of multireservoir Systems. [49] Bose RC. Water Resources Research 1997. A decomposition method for the long-term scheduling of reservoirs in series. Jackson PL. 318–62. [50] Raghavarao D. Cervellera et al. Barron RL. 2001. New York: Wiley. and methods. Water reservoirs management under uncertainty by approximating networks and learning from data. [64] Hagan MT.90 C. 1999. Sanguineti M. Cervellera C. [47] Chen VCP. June 2000.18:673–96. editors. 40–123. Thomas LC. [61] Barron AR.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.