Beruflich Dokumente
Kultur Dokumente
Abstract
We consider a product rental network with a fixed number of rental units distributed across
multiple locations. The units are accessed by customers without prior reservation and on an
on-demand basis. Customers are provided with the flexibility to decide on how long to keep a
unit and where to return it. Because of the randomness in demand and in the length of the
rental periods and in unit returns, there is a need to periodically reposition inventory away from
some locations and into others. In deciding on how much inventory to reposition and where, the
system manager balances potential lost sales with repositioning costs. Although the problem is
increasingly common in applications involving on-demand rental services, little is known about
the nature of the optimal policy for systems with a general network structure or about effective
approaches to solving the problem. In this paper, we address these limitations. First, we offer a
characterization of the optimal policy. We show that the optimal policy in each period can be
described in terms of a well-specified region over the state space. Within this region, it is optimal
not to reposition any inventory, while, outside the region, it is optimal to reposition but only
such that the system moves to a new state that is on the boundary of the no-repositioning region.
We also provide a simple check for when a state is in the no-repositioning region. Second, we
leverage the features of the optimal policy, along with properties of the optimal cost function, to
propose a provably convergent approximate dynamic programming algorithm to tackle problems
with a large number of dimensions. We provide numerical experiments illustrate the effectiveness
of the algorithm and to highlight the impact of various problem parameters.
Keywords: product rental networks; vehicle sharing; inventory repositioning; optimal policies,
approximate dynamic programming algorithms
∗
saif@umn.edu
†
drjiang@pitt.edu
‡
xiang.li2@target.com
§
iselix@nus.edu.sg
We consider a product rental network with a fixed number of rental units distributed across multiple
locations. Inventory level is reviewed periodically and, in each period, a decision is made on how
much inventory to reposition away from one location to another. Customers may pick a product
up without reservation, and are allowed to keep the product for one or more periods, without
committing to a specific return time or location. Thus, demand is random, and so are the rental
periods and return locations of rented units. Demand that cannot be fulfilled at the location at
which it arises is considered lost and incurs a lost sales penalty (or is fulfilled through other means
at an additional cost). Inventory repositioning is costly and the cost depends on both the origins
and destinations of the repositioning. The firm is interested in minimizing the lost revenue from
unfulfilled demand (lost sales) and the cost incurred from repositioning inventory (repositioning
cost). Note that more aggressive inventory repositioning can reduce lost sales but leads to higher
repositioning cost. Hence, the firm must carefully mitigate the tradeoff between demand fulfillment
Problems with the above features are common in practice. We are particularly motivated by a
variety of car sharing programs that allow customers to rent from one location and return vehicles
to another location. A prominent example1 is Car2Go, which has over 2.5 million registered users
and a fleet of over 14,000 vehicles in 26 cities in North America, Europe and Asia. These services
are on-demand (i.e., they do not require a reservation ahead of use) and offer one-way rentals,
where a customer may return the bike or vehicle to a location of her choice (in fact, Car2Go is
“free-floating” in the sense that there are no designated pickup or drop-off locations). A challenge
in managing these services is the spatial mismatch between vehicle supply and demand that arises
from the uncertainty in trip duration, origin, and destination and the time dependency of these
trip characteristics (e.g., the distribution of trip origin, destination, and duration may depend on
time of day). Unless adequately mitigated with the periodic repositioning of inventory, the spatial
mismatch between supply and demand can lead to significant loss in revenue. For example, some
1
Other examples that share some of the same features include bikeshare systems where customers can pick up a bike
from one location and return it to any other location within the service region; shipping container rentals in the
freight industry where containers can be rented in one location and returned to a different location, with locations
corresponding in some cases to ports in different countries; and the use of certain medical equipment, such as IV
pumps and wheelchairs, in large hospitals by different departments located in various areas of the hospital.
meters [Habibi et al., 2016]. However, repositioning can be expensive. In the case of cars, vehicles
often have to be moved one at a time using a service team and a utility vehicle2 . The following
quote from an article in New York Magazine (2015) on the operations of Car2Go illustrates this
challenge: “[Customers say] you can’t always find a car when you need one — this in spite of
the efforts of Car2Go’s ‘street team’ to distribute vehicles evenly through the coverage zone. The
40-person squad patrols the borough for vehicles that are out of gas, illegally parked, too densely
Although the problem is common in practice3 and carries significant economic costs for the
affected firms and their customers, the existing literature on this topic is limited. In particular,
how to manage these systems optimally is, to the best of our knowledge, not known. Moreover,
there does not appear to be efficient methods for computing solutions for systems as general as the
one we consider in this paper, including effective heuristics. This relative lack of results appears to
be due to the multidimensional nature of the problem as well as the lost sales feature, compounded
by the presence of randomness in demand, in rental periods, and in return locations. In this paper,
we address these limitations through two main contributions. The first contribution is theoretical
• On the theoretical side, we offer one of the first characterizations of the optimal policy for
the inventory repositioning problem in a general network setting with multiple locations,
accounting for important features of on-demand systems, such as randomness in trip volumes,
duration, origin, and destination as well as important spatial and temporal dependencies
(e.g., likelihood of a trip terminating somewhere being dependent on its origin as well as trip
programming (ADP) algorithm that can effectively solve the repositioning problem to near-
optimality. We provide a proof of convergence for our algorithm that takes a fundamentally
2
http://blog.car2go.com/2017/06/26/know-relocation-car2gos-relocated/
3
Renting may become even more prevalent as the economy shifts away from a model built on the exclusive ownership
of resources to one based on on-demand access and resource sharing; see Sundararajan [2016] for discussion and other
examples.
retical contribution to the ADP literature, independent from the repositioning application.
gram. We show that the problem in each period is one that involves solving a convex optimization
problem (and hence can be solved without resorting to an exhaustive search). More significantly,
we show that the optimal policy in each period can be described in terms of two well-specified
regions over the state space. If the system is in a state that falls within one region, it is optimal not
to reposition any inventory (we refer to this region as “the no-repositioning” region). If the system
is in a state that is outside this region, then it is optimal to reposition some inventory but only
such that the system moves to a new state that is on the boundary of the no-repositioning region.
Moreover, we provide a simple check for when a state is in the no-repositioning region, which also
allows us to compute the optimal policy more efficiently. When the problem only involves two loca-
tions, the no-repositioning region becomes a line segment, and the optimal policy can be specified
in terms of fixed thresholds (see He et al. [2018] for a treatment of this special case using a different
approach).
One of the distinctive features of the problem considered lies in its non-linear state update
function. This non-linearity introduces difficulties in showing the convexity of the problem that
must be solved in each period. To address this difficulty, we leverage the fact that the state update
function is piecewise affine and derive properties for the directional derivatives of the value func-
tion. This approach has potential applicability to other systems with piecewise affine state update
functions. Another distinctive feature of the problem is the multi-dimensionality of the state and
action spaces. Unlike many classical inventory problems, the optimal inventory repositioning policy
cannot, in general, be characterized by simple thresholds in the state space, as increasing inventory
at one location requires reducing inventory at some other locations. Instead, we show that the
optimal policy is defined by a no-repositioning region within which it is optimal to do nothing and
outside of which it is optimal to reposition to the region’s boundary. Such an optimal policy not
only generalizes the threshold policy for two-location problems (i.e., it implies a simple threshold
policy for two-location problems) but also preserves some of the computational benefits. There-
fore, the results in this paper may also be useful in informing future studies of multi-dimensional
Due to the curse of dimensionality, the optimal policy (and value function) can be difficult to
compute for problems with more than a small number of dimensions. To address this issue, we
leverage the results obtained regarding the structure of both the value function and the optimal
aspects of approximate value iteration (see, for example, De Farias and Van Roy [2000] and Munos
and Szepesvári [2008]) and stochastic dual dynamic programming (see Pereira and Pinto [1991]).
Convexity of the optimal value function is leveraged to represent the approximate value function as
the maximum over a set of hyperplanes, while the no-repositioning region characterization of the
optimal policy is used to reduce the number of single-period optimization problems that need to
be solved. We also prove a new convergence result for the infinite horizon setting, showing that the
value function approximation converges almost surely to the optimal value function. We conduct
numerical experiments to illustrate the effectiveness of jointly utilizing value and policy structure,
which, to our knowledge, has not yet been explored by related methods in the literature.
The rest of the paper is organized as follows. In Section 2, we review related literature. In
Section 3, we describe and formulate the problem. In Section 4, we analyze the structure of the
optimal policy for the special case of a single period problem. In Section 5 and 6, we use the results
from the single period problem to extend the analysis to problems with finitely and infinitely many
periods. In Section 7, we describe the ADP algorithm and provide numerical results. In Section 8,
Notation. Throughout the paper, the following notation will be used. We use e to denote
a vector of all ones, ei to denote a vector of zeros except 1 at the ith entry, and 0 to denote a
vector of all zeros (the dimension of these vectors will be clear from the context). Also, we write
P
∆n−1 (M ) to denote the (n − 1)-dimensional simplex, i.e., ∆n−1 (M ) = {(x1 , . . . , xn ) | ni=1 xi =
M, x ≥ 0}. Similarly, we use Sn (M ) to denote the n-dimensional simplex with interior, i.e.,
P
Sn (M ) = {(x1 , . . . , xn ) | ni=1 xi ≤ M, x ≥ 0}. Throughout, we use ordinary lowercase letters (e.g.,
x) to denote scalars, and boldfaced lowercase letters (e.g., x) to denote vectors. The Euclidean
norm is denoted k · k2 . For functions f1 and f2 with domain X , let kf1 k∞ = supx∈X |f1 (x)| and let
There is growing literature on inventory repositioning in car and bike sharing systems; see for
example Shu et al. [2013], Nair and Miller-Hooks [2011], O’Mahony and Shmoys [2015], Freund
et al. [2016], Liu et al. [2016], Ghosh et al. [2017], Schuijbroek et al. [2017], Li et al. [2018],
Shui and Szeto [2018], and the references therein. Most of this literature focuses on the static
repositioning problem, where the objective is to find the optimal placement of vehicles before
demand arises, with no more repositioning being made afterwards (e.g., repositioning overnight for
the next day). Much of this work employs mixed integer programming formulations and focuses
on the development of algorithms and heuristics. Similarly, the papers that focus on dynamic
repositioning generally consider heuristic solution techniques and do not offer structural results
regarding the optimal policy (see, for example, Ghosh et al. [2017] and Li et al. [2018]). A related
stream of literature models vehicle sharing systems (mostly in the context of bike share systems) as
closed queueing networks and uses steady state approximations to evaluate system performance; see,
for example, George and Xia [2011], Fricker and Gast [2016], Banerjee et al. [2017] and Braverman
et al. [2018]. Moreover, Chung et al. [2018] analyzes incentive-based repositioning policies for bike
sharing. Other work considers related strategic issues such as fleet sizing, service region design,
infrastructure planning, and user dissatisfaction; see, for example, Jian et al. [2016], Raviv and
Kolka [2013], He et al. [2017], Lu et al. [2017], Freund et al. [2017], Kabra et al. [2018], and Kaspi
et al. [2017]. Comprehensive reviews of the literature on vehicle and bike sharing can be found in
There is literature that addresses inventory repositioning that arises in other settings, including
in the context of repositioning of empty containers in the shipping industry, empty railcars in
railroad operations, and cars in traditional car rentals; see, for example, Lee and Meng [2015] for
However that literature focuses on simple networks and relies on heuristics when considering more
general problems; see for example Song [2005] and Li et al. [2007]. To our knowledge, there are no
The paper that is closest to ours is He et al. [2018], which was subsequent to an earlier version
• Our model is more general than He et al. [2018] in that we allow the rental periods of cars to
be larger than one. This, we believe, is more appropriate for many real world applications.
with n locations now requires 2n inventory states so that ongoing rentals from each location
can be tracked; He et al. [2018] does not require this and formulate an n-dimensional problem.
• In He et al. [2018], the proof of convexity is based on a reformulation of the problem into a
linear program. However, it is not clear from their current proof whether this reformulation
continues to work for the case where rental periods can be greater than one. Our method,
though more complicated, works for both cases. The technical difference also shows that our
• He et al. [2018] characterize the optimal policy for problems with two locations. We provide
characterizations of the optimal policy for the n-location problem, which includes the two-
• He et al. [2018] approximate the problem using the robust approach, while we develop a
systematic and theoretically consistent approach for approximating the true value function.
Empirical testing of our approach shows that it achieves high-quality approximations within
Next, there is related literature on dynamic fleet management. This literature is concerned with
the assignment of vehicles to loads that originate and terminate in different locations over multiple
periods. Recent examples from this literature include Topaloglu and Powell [2006], Godfrey and
Powell [2002], and Powell and Carvalho [1998]. In a typical dynamic fleet management problem,
movements of all vehicles, both full and empty, are decision variables. This is in contrast to
our problem where the movement of vehicles is in part determined by uncontrolled events involving
rentals with uncertain durations and destinations, and where decisions involve only the repositioning
4
The first version of our paper appeared online ahead of the first version of He et al. [2018]. He et al. [2018] refer to
that version of our paper.
development of solution procedures but not on the characterization of the optimal policy.
Finally, there is related literature on computational methods that can solve problems with
convex value functions. Some well-known cutting-plane-based approaches are the stochastic de-
composition algorithm of Higle and Sen [1991], the stochastic dual dynamic programming (SDDP)
method introduced in Pereira and Pinto [1991], and the cutting plane and partial sampling approach
of Chen and Powell [1999]. Our method is most closely related to SDDP, where full expectations
are computed at each iteration. Linowsky and Philpott [2005], Philpott and Guan [2008], Shapiro
[2011], and Girardeau et al. [2014] provide convergence analyses of SDDP, but these analyses are de-
signed for finite-horizon problems (or two-stage stochastic programs) and rely on an exact terminal
value function and/or that there only exist a finite number of cuts.
Our algorithm is most closely related to the cutting plane methods for the infinite horizon
setting proposed in Birge and Zhao [2007] and Warrington et al. [2018]. Birge and Zhao [2007]
proves uniform convergence of the value function approximations to optimal for the case of linear
dynamics, given a strong condition that the cut in each iteration is computed at a state where a
of convex functions optimization problem (or a suitable approximation). Warrington et al. [2018]
focuses on the deterministic setting, uses a fixed set of sampled states at which cuts are computed,
and does not show consistency of their algorithm. Our algorithm removes these restrictions, yet we
are still able to show uniform convergence to the optimal value function. In particular, our analysis
allows for non-linear dynamics and cuts to be computed at states sampled from a distribution.
Furthermore, the use of policy structure (i.e., the no-repositioning zone characterization) in an
As an alternative to cutting plane algorithms, Godfrey and Powell [2001] and Powell et al. [2004]
propose methods based on stochastic approximation (see Kushner and Yin [2003]) to estimate
scalar or separable convex functions. The main idea is to iteratively update a piecewise linear
approximation via noisy samples while ensuring that convexity is maintained. Nascimento and
Powell [2009] extend the technique to a finite-horizon ADP setting for the problem of lagged asset
acquisition (single inventory state) and provides a convergence analysis; see also Nascimento and
Powell [2010]. However, these methods are not immediately applicable to our situation, where the
We consider a product rental network consisting of n locations and N rental units. Inventory
level is reviewed periodically and, in each period, a decision is made on how much inventory to
reposition away from one location to another. Inventory repositioning is costly and the cost depends
on both the origins and destinations of the repositioning. The review periods are of equal length
and decisions are made over a specified planning horizon, either finite or infinite.
Demand in each period is positive and random, with each unit of demand requiring the usage
of one rental unit for one or more periods, with the rental period being also random. Demand
that cannot be satisfied at the location at which it arises is considered lost and incurs a lost sales
penalty. A location in the context of a free-floating car sharing system may correspond to a specified
geographic area (e.g., a zip code area, a neighborhood, or a set of city blocks). Units rented at
one location can be returned to another. Hence, not only are rental durations random but also are
return destinations. At any time, a rental unit can be either at one of the locations, available for
The sequence of events in each period is as follows. At the beginning of the period, inventory
level at each location is observed. A decision is then made on how much inventory to reposition
away from one location to another. Subsequently, demand is realized at each location followed by
the realization of product returns. Note that the review period is assumed to be sufficiently long
We index the periods by t ∈ N, with t = 1 indicating the first period in the planning horizon. We
let xt = (xt,1 , . . . , xt,n ) denote the vector of inventory levels before repositioning in period t, where
5
This assumption is reasonable for the car sharing application where repositioning is assured by dedicated “street
teams,” as in the Car2Go example mentioned, who are assigned to different areas within the service region for the
rapid redeployment of vehicles (e.g., if rental periods are multiples of one hour, this assumption assumes that the street
teams are large enough to carry out the needed repositioning within one hour). In some applications, repositioning
is crowdsourced (this approach is used by Bird, the scooter service provider) or is carried out by users who receive
a monetary compensation for moving a vehicle. In such cases, the assumption of relatively short repositioning
time is also reasonable. There may of course be settings where there is an upper limit on how many units can be
repositioned in a given period (in that case, a constraint would need to be imposed on the amount of inventory that
can be repositioned in a given period). There may also be settings where the repositioning consists of multiple periods
(e.g., in the case of shipping containers). The analysis in that case is more complicated as it requires expanding the
state space to include the “delivery” status of each inventory that is in the process of being repositioned. We leave
this as a potential topic for future research.
denote the vector of inventory levels after repositioning in period t, where yt,i denotes the corre-
sponding inventory level at location i. Note that inventory repositioning should always preserve
P P
the total on-hand inventory. Therefore, we require ni=1 yt,i = ni=1 xt,i .
Inventory repositioning is costly and, for each unit of inventory repositioned away from location
i to location j, a cost of cij is incurred. Consistent with our motivating application of a car sharing
system, we assume there is a cost associated with the repositioning of each unit (this is in contrast
with other applications, such as bike sharing systems where repositioning can occur in batches6 );
see He et al. [2018] for similar treatment. Let c = (cij ) denote the cost vector and let wij denote
the amount of inventory to be repositioned away from location i to location j. Then, the minimum
cost associated with repositioning from an inventory level x to another inventory level y is given
min c·w
Xn n
X
subject to wij − wjk = yj − xj ∀ j = 1, . . . , n
i=1 k=1
w ≥ 0.
The first constraint ensures that the change in inventory level at each location is consistent with the
P P
amounts of inventory being moved into ( i wij ) and out of ( k wjk ) that location. The second
constraint ensures that the amount of inventory being repositioned away from one location to
another is always positive so that the associated cost is accounted for in the objective. It is clear
6
In some bike sharing systems, bikes are moved few units at a time (e.g., Citi Bike uses tricycles that hold 3-4 bikes
at a time to move bikes during the day; this is also the case for several bike sharing systems in China). In such cases,
a per-unit repositioning cost would be a reasonable approximation. In other systems, although units are moved in
batches, the repositioning cost is mostly due to the handling of each unit (e.g., the process of loading and unloading
bikes onto a truck and positioning it in place in a docking station). For example, this would be the case when all
locations are always visited in each period as part of fixed route for the repositioning vehicles. In some situations,
repositioning is crowdsourced (as in the case of the Bird scooters). In that case, the repositioning cost corresponds
to the payment made to the individuals who carried out the repositioning.
w ≥ 0,
Then the inventory repositioning cost from x to y is C(y − x). Without loss of generality, we
assume that cij ≥ 0 satisfy the triangle inequality (i.e., cik ≤ cij + cjk for all i, j, k).
We let dt = (dt,1 , . . . , dt,n ) denote the vector of random demands in period t, with dt,i corre-
sponding to the demand at location i. The amount of demand that cannot be fulfilled is given by
(dt,i − yt,i )+ = max(0, dt,i − yt,i ). Let β denote the per unit lost sales penalty. Then, the total lost
n
X
L(y t , dt ) = β (dt,i − yt,i )+ . (3)
i=1
We assume that each product can be rented at most once within a review period, that is, rental
To model the randomness in both the rental periods and return locations, we assume that, at
the end of each period, a random fraction pt,ij of products rented from location i is returned to
location j for all i, j ∈ {1, 2, . . . , n}, with the rest continuing to be rented. We let P t denote the
Pn Pn
The ith row of P t must satisfy j=1 pt,ij ≤ 1. The case where j=1 pt,ij
< 1 corresponds to a set-
P
ting where rental periods can be greater than one, while the case where nj=1 pt,ij = 1 corresponds
to a setting where rental periods are exactly equal to one. Let µt denote the joint distribution
10
Finally, let γt,i for i = 1, 2, . . . , n and t = 1, 2, . . . , T denote the quantity of the product rented
from location i that remain outstanding at the beginning of period t. To our knowledge, the
problem specification above is among the most general in the literature and the various assumptions
we make are either consistent or further extend assumptions found in the literature (for example,
our assumptions are consistent with those in He et al. [2018] except that we allow for rental
durations to be multiple periods and for this duration to be random). We denote cmax = maxi,j cij
and cmin = mini,j; i6=j cij . The next two assumptions state some useful conditions on P t and the
Assumption 1. For every period t, there exists a random variable pt ∈ [pmin , 1] such that
n
X n
X
pt,ij = pt,kj = pt , ∀ i, k = 1, 2, . . . , n.
j=1 j=1
Pn
An alternative statement is that pt,ij = pt q̃t,ij for some q̃t,ij where j=1 q̃t,ij = 1 for all i.
Assumption 1 implies that the rental duration does not depend on the origin of the rental, but
the distribution of the return locations does depend on the origin. This assumption is plausible
since the rental duration and the return location are usually two separate decisions.
The second assumption enforces boundedness in the difference of cost parameters, with the
upper bound depending on pmin . If pmin = 1, where the rental duration is always one period
(corresponding to the setting of He et al. [2018]), the restriction reduces to ρcmax ≤ β. This means
that the cost of lost sales outweighs the cost of inventory repositioning in the next period. A
similar condition is assumed in He et al. [2018]. In this sense, the convexity in our paper can
include their results as a special case except that in their model, lost sales costs depend on both
origin and destination and the return destinations are known by the platform at the time of rental.
If pmin < 1, the assumption prevents the unpleasant situation where one might want to deliberately
11
Our main result is that, under Assumptions 1 and 2, the value function in each period, consisting
of the lost sales and the cost-to-go as defined next, is always convex, allowing for many structural
The model we described above can be formulated as a Markov decision process. Fix a time
period t. The system states correspond to the on hand inventory levels xt and the outstanding
inventory levels γ t . The state space is specified by the (2n − 1)-dimensional simplex, i.e., (xt , γ t ) ∈
∆2n−1 (N ). Throughout the paper, we denote S := Sn (N ) and ∆ := ∆2n−1 (N ) since these notations
are frequently used. Actions correspond to the vector of target inventory levels y t . Given state
(xt , γ t ), the action space is an (n − 1)-dimensional simplex, i.e., y t ∈ ∆n−1 (eT xt ). The transition
Pn
xt+1,i = (yt,i − dt,i )+ + + min(yt,j , dt,j )) pt,ji ∀ i = 1, 2, . . . , n, t = 1, 2, . . . , T
j=1 (γt,j
P
γt+1,i = (γt,i + min(yt,i , dt,i ))(1 − nj=1 pt,ij ) ∀ i = 1, 2, . . . , n, t = 1, 2, . . . , T.
Given a state (xt , γ t ) and an action y t , the repositioning cost is given by C(y t − xt ), and the
Z Z X
lt (y t ) = Lt (y t , dt ) dµt = β (dt,i − yt,i )+ dµt . (4)
i
The single-period cost is the sum of the inventory repositioning cost and lost sales penalty:
rt (xt , γ t , y t ) = C(y t − xt ) + lt (y t ).
The objective is to minimize the expected discounted cost over a specified planning horizon. In the
case of a finite planning horizon with T periods, the optimality equations are given by
Z
vt (xt , γ t ) = min rt (xt , γ t , y t ) + ρ vt+1 (xt+1 , γ t+1 ) dµt (5)
y t ∈∆n−1 (eT xt )
for t = 1, 2, . . . , T , and
vT +1 (xT +1 , γ T +1 ) = 0, (6)
12
It is useful to note that the problem to be solved in each period can be expressed in the following
form:
where
Z
ut (y t , γ t ) = Ut (y t , γ t , dt , P t ) dµt , (8)
and
where
τx (y, γ, d, P ) = (y − d)+ + P T (γ + min(y, d))
(10)
τγ (y, γ, d, P ) = (γ + min(y, d)) ◦ (e − P t e),
where ◦ denotes the Hadamard product (or the entrywise product), i.e.,
In the next section, we consider the one-period problem, which provides with us a few useful results
before moving on to the multi-period and infinite horizon versions of the problem.
where C(·) is the repositioning cost specified by (1). As shown in equations (7) - (9), the problem
to be solved in each period is of the form (11). Here we assume that u(·) is a convex and continuous
function that maps ∆ to R ∪ {−∞, ∞}, though most of the results in this section still holds when
u(x, γ) is only convex in x. In Section 5, we will show that ut (y t , γ t ) defined in (8) is indeed convex
and continuous in (y t , γ t ).
13
We start with describing some properties of the repositioning cost C(·) defined in (1). These
properties will be useful for characterizing the optimal policy in subsequent sections. Unless stated
otherwise, proofs for all results in the paper can be found in the Appendix.
2. (Convexity): C(λz 1 + (1 − λ)z 2 ) ≤ λC(z 1 ) + (1 − λ)C(z 2 ) for all z 1 , z 2 ∈ H and λ ∈ [0, 1].
Moreover, due to the triangle inequality, it is not optimal to simultaneously move inventory
into and out of the same location. This property can be stated as follows.
n
X n
X
wij = zj+ and wjk = zj− for all j = 1, . . . , n.
i=1 k=1
Lemma 2 leads to the following bound for the repositioning cost C(z).
Lemma 3.
n n
cmin X cmax X
|zi | ≤ C(z) ≤ |zi |. (12)
2 2
i=1 i=1
The proof follows from Lemma 2. There exists an optimal solution w to (1) such that
X 1 XX 1 XX
C(z) = cij wij = cij wij + cji wji
2 2
i,j j i j i
cmax X cmax X cmax X
≤ zj+ + zj− = |zj |.
2 2 2
j j j
It is easy to see that, in (12), the equality holds if cij = cmax for all i, j. Therefore, the bound is
tight. The lower bound follows the same logic. In Section 5, we will use Lemma 3 to derive an
14
The principle result of this section is the characterization of the optimal policy through the no-
repositioning set, the collection of inventory levels from which no repositioning should be made.
The no-repositioning set for a function u(·) when the outstanding inventory level is γ can be defined
as follows:
Pn
where I = N − i=1 γi . Note that I is a function of γ (or equivalently x). For notational
made from inventory levels inside Ωu (γ). In the following theorem, we show that Ωu (γ) is non-
empty, connected and compact and, for inventory levels outside Ωu (γ), it is optimal to reposition
to some point on the boundary of Ωu (γ). In what follows, we denote the boundary of a set E by
Theorem 4. The no-repositioning set Ωu (γ) is nonempty, connected and compact for all γ ∈ S.
π ∗ (x, γ) = x if x ∈ Ωu (γ);
(14)
π ∗ (x, γ) ∈ B(Ωu (γ)) otherwise.
Solving a nondifferentiable convex program such as (11) usually involves some computational
effort. One way to reduce this effort, suggested by Theorem 4, is to characterize the no-repositioning
set Ωu (γ). Characterizing the no-repositioning region can help us identify when a state is inside
Ωu (γ), which allows our ADP algorithm to more easily compute the value iteration step; see Section
7. Let
u(x + tz, γ + tη) − u(x, γ)
u0 (x, γ; z, η) = lim (15)
t↓0 t
denote the directional derivative of u(·) at (x, γ) along the direction (z, η). Since u(·) is assumed
to be convex and continuous in ∆, u0 (x, γ; z, η) is well defined for (x, γ) ∈ ∆. We call (z, η) a
feasible direction at (x, γ) if (x + tz, γ + tη) ∈ ∆ for small enough t > 0. In what follows, we
provide a series of first order characterizations of Ωu (γ), the first of which relies on the directional
15
Proposition 5 is essential for several subsequent results. However, using Proposition 5 to verify
whether a point lies inside the no-repositioning set is computationally impractical, as it involves
checking an infinite number of inequalities in the form of (16). In the following proposition, we pro-
vide a second characterization of Ωu (γ) using the subdifferentials. Before we proceed, we introduce
for all y. The set of all subgradients of u(·, γ) at x is denoted by ∂x u(x, γ). It is well known that
Proposition 6. x ∈ Ωu (γ) if
∂x u(x, γ) ∩ G 6= ∅, (17)
where G = {(g1 , . . . , gn ) : gi − gj ≤ cij ∀ i, j}. If x > 0, then the converse is also true.
Proposition 6 suggests whether a point lies inside the no-repositioning set depends on whether
u(·, γ) has certain subgradients at this point. Such a characterization is useful if we can compute
the subdifferential ∂x u(x, γ). In particular, if u(·, γ) is differentiable at x, then ∂x u(x, γ) consists
of a single point ∇x u(x, γ). In this case, determining its optimality only involves checking n(n − 1)
inequalities.
Corollary 7. Suppose u(·, γ) is differentiable at x ∈ ∆n−1 (I). Then, x ∈ Ωu (γ) if and only if
∂u(x, γ) ∂u(x, γ)
− ≤ cij (18)
∂xi ∂xj
for all i, j.
The no-repositioning set Ωu (γ) can take on many forms. We first discuss the case where there
are only two locations. In this case, the no-repositioning set corresponds to a closed line segment
16
two-threshold policy.
[s1 (γ), s2 (γ)]}, where s1 (γ) = inf{x : u0 ((x, I − x, γ1 , γ2 ); (1, −1, 0, 0)) ≥ −c21 } and s2 (γ) = sup{x :
Corollary 8 is a direct consequence of Theorem 4, Proposition 5, and the fact that there are
only two feasible directions (z, 0). It shows that the optimal policy to problem (11) in the two-
dimensional case is described by two thresholds s1 (γ) < s2 (γ) on the on-hand inventory level x at
inventory from location 2 to location 1. On the other hand, if x is greater than s2 , it is optimal to
bring the inventory level at location 1 down to s2 . When x falls between s1 and s2 , it is optimal
not to reposition as the benefit of inventory repositioning cannot offset the cost.
When there are more than two locations, there does not exist a threshold policy. In what
follows, we characterize the no-repositioning set for three important special cases, the first of which
corresponds to when u(·, γ) is a convex quadratic function. In this case, the no-repositioning set is
Example 1. For a fixed γ, suppose u(y, γ) = y T B(γ)y + y T b(γ) + b0 (γ) and B(γ) is positive
We point out that, in general, the no-repositioning set can be non-convex. The following
example illustrates that even if u(·) is smooth, Ωu (γ) might still be non-convex.
Example 2. Suppose γ = 0, u(y) = y13 + y22 + y32 , and cij = 0.5 (note that the inventory state y is
∆n−1 : −0.5 ≤ 3y12 − 2y3 ≤ 0.5, −0.5 ≤ 3y12 − 2y2 ≤ 0.5, −0.5 ≤ 2y2 − 2y3 ≤ 0.5}.
17
the region under the parabolas 2y2 − 3y12 = 0.5 and 2y3 − 3y12 = 0.5 is not convex. See Figure 1 for
No-repositioning
set Ω𝑢
Feasible region 𝐴𝐼
𝑦1
In this section, we return to the study of the multi-period problem. The optimality equations are
given by (5) and (6). It is clear from (7) that the problem to be solved in each period can be
reduced to (11) with ut (·) in place of u(·). Consequently, the optimal policy in each period will
have the same form as the one-period problem if the functions ut (·), t = 1, . . . , T are convex and
continuous in ∆.
R
Recall that ut (y, γ) = Ut (y, γ, dt , P t ) dµt , where
If the state update functions τx (·, d, P ), τγ (·, d, P ) were linear, then convexity is preserved through
vt+1 (τx (y, γ, d, P ), τγ (y, γ, d, P )). As a result, Ut,d,p (·) = Ut (·, d, P ) and therefore ut (·) would
be convex. However, with non-linear state updates, this is not always the case. In our context,
the state update function is piecewise affine, with the domain of each affine segment specified by
18
instead is piecewise convex. In spite of this, we show that in our context, Ut,d,p (·), and ut (·), is
convex under some mild conditions on the cost parameters ct,ij and the return probabilities pt,ij .
Theorem 9. Suppose Assumptions 1 and 2 hold. For any given t = 1, . . . , T , the function ut (·)
defined in (8) is convex and continuous in ∆. The no-repositioning set Ωut (γ) is nonempty, con-
nected and compact for all γ ∈ S, and can be characterized as in Proposition 5, 6 and Corollary 7.
Pn
1. |u0t (y t , γ t ; ±η, ∓η)| ≤ β i=1 ηi for all (y t , γ t ) ∈ ∆ and any feasible direction (±η, ∓η) with
η≥0;
Pn
2. u0t (y t , γ t ; 0, z) ≤ (ρcmax /2) i=1 |zi | for all (x, γ) ∈ ∆ and any feasible direction (0, z) with
eT z = 0.
A comprehensive proof of Theorem 9 can be found in Appendix A.1. Here, we give an outline of
the approach. We apply induction, starting from vT +1 (y, γ) = 0. We show in Proposition A.5 and
Proposition A.6 that if vt+1 (·) is convex and satisfies certain bounds on its directional derivatives,
then for any realization of dt , P t , the function Ut,dt ,pt (·) is convex and satisfies two types of bounds
on its directional derivatives. The first type shows that if we remove some of the available inventory
and turn them into ongoing rentals, the reduced or enlarged cost can be upper bounded by the lost
sale cost of the these products. The same bound holds if we remove some of the ongoing rentals and
make them available at the locations from which they were rented. The second type bound states
that if we change the origin of some of the ongoing rentals (i.e., we change γ only), the difference
in cost can be upper bounded by the product of (ρcmax /2) and the one-norm of the difference in γ.
The primary reason is that the total return fraction for period t, pt , does not depend on the origin.
Therefore, the difference of costs is at most the repositioning cost in the next period. To complete
the induction, we show in Proposition A.8 that given the convexity of ut (y t , γ t ) and some bounds
19
We have shown that the optimal policy for the multi-period problem has the same form as the
one-period problem. Next we show that the same can be said about the stationary problem with
infinitely many periods. In such a problem, we denote the common distribution for (dt , P t ) by µ.
Similarly, we denote the common values of Lt (·), lt (·) and rt (·) by L(·), l(·) and r(·), respectively.
We use π to denote a stationary policy that uses the same decision rule π in each period. Under
π, the state of the process is a Markov random sequence {(Xt , Γt ), t = 1, 2, . . .}. The optimization
( T )
X
ṽT (x, γ) = min Eπ
x ρt−1
r(Xt , Γt , πt (Xt , Γt )) (21)
π
t=1
denote the value function of a stationary problem with T periods. It is well known that the functions
ṽT (·) converges uniformly to v(·) and v(·) is the unique solution7 to
Z
v(x, γ) = min r(x, γ, y) + ρ v(τx (y, γ, d, P ), τγ (y, γ, d, P )) dµ, (22)
y∈∆n−1 (eT x)
where τx (·) and τγ (·) correspond to the state update functions defined in (10), i.e.,
20
problem (11)
Theorem 10. Suppose Assumptions 1 and 2 hold. The function u(·, γ) is convex and continuous
in ∆. The no-repositioning set Ωu is nonempty, connected and compact for all γ ∈ S, and can be
π ∗ (x, γ) = x if x ∈ Ωu (γ);
(24)
π ∗ (x, γ) ∈ B (Ωu (γ)) otherwise.
Moreover, we have
Pn
1. |u0 (y, γ; ∓η, ±η)| ≤ β i=1 ηi for all (x, γ) ∈ ∆ and any feasible direction (∓η, ±η) with
η ≥ 0;
Pn
2. u0 (y, γ; 0, z) ≤ (ρcmax /2) i=1 |zi | for all (x, γ) ∈ ∆ and any feasible direction (0, z) with
eT z = 0.
So far, we have studied the theoretical properties of the repositioning problem. In this section,
ADP,” that exploits the structure of both the value function and the optimal policy under a sampled
demand and return model. Although Theorems 9 and 10 allow for the use of convex optimization
to help resolve the issue of a multi-dimensional continuous action space, the difficulty of a multi-
dimensional and continuous state space remains. We refer readers to Bertsekas and Tsitsiklis [1996]
and Powell [2007] for a detailed discussion of the computational challenges and solution methods
associated with large MDPs. In particular, we note that when the problem size (the number of
locations or the number of time periods) is large, simple approximations of continuous problems,
such as discretization or aggregation, will usually fail. In addition, discretization can cause our
21
the optimal policy given Theorems 9 and 10 can no longer be readily used. Informal numerical
experiments show that even if we do not consider the ongoing rentals (rental period is always one),
approximating the dynamic program via discretization within a reasonable accuracy is already a
It is thus necessary for us to consider more scalable techniques. A key feature of the algorithm
we describe next is that each iteration involves solving one or more linear programs, allowing it
to leverage the scalability and computational advantages of off-the-shelf solvers. We show via nu-
merical experiments that the algorithm can produce high quality solutions on problems with states
up to 20 dimensions (10 locations) within a reasonable amount of time. The algorithm also pos-
sesses the important theoretical property of asymptotically optimal value function approximations;
see Theorem 14. In the rest of this section, we motivate and describe the algorithm, prove its
convergence, discuss some practical considerations, and present the numerical results.
Theorems 9 and 10 describe the most important feature of our dynamic program, that u(·), the
summation of current period lost sales and the cost-to-go, is convex and continuous. Moreover,
takes advantage of these two structural results. It is well known that a convex function can be
u(y, γ) = sup u(ŷ, γ̂) + (y − ŷ)T ∇y u(ŷ, γ̂) + (γ − γ̂)T ∇γ u(ŷ, γ̂).
ŷ,γ̂
This suggests that we can build an approximation to u(·) by iteratively adding lower-bounding
hyperplanes, with the hope that the approximation becomes arbitrarily good when enough hyper-
planes are considered. This is the main idea of the algorithm, with special considerations made to
The features and analysis that distinguish our ADP algorithm from previous work in the liter-
1. Our algorithm has the ability to skip the optimization step when a sampled state is detected
22
2. The underlying model of SDDP and other cutting-plane methods (see, e.g., Higle and Sen
[1991], Pereira and Pinto [1991], Birge and Zhao [2007]) is typically a two-stage or multi-
stage stochastic linear program. In our case, we must account for the non-linear state update
functions τx and τγ . We emphasize that the choice to design our algorithm around u(·) and not
v(·) is critical. Otherwise, the non-linear state update functions would need to be incorporated
in the optimization step, in the sense of (22). Given that v(τx (y, γ, d, P ), τγ (y, γ, d, P )) is
not necessarily convex, this poses a significant challenge. Instead, our approach leverages the
convexity of u(·) to sidestep this issue entirely (the state updates are then computed outside
3. Our algorithm is designed for the infinite horizon setting, where each approximation “boot-
straps” from the previous approximation and convergence is achieved despite the absence of
a terminal condition such as “vT +1 ≡ 0” used in the finite-horizon case. As such, the conver-
gence analyses used in Chen and Powell [1999], Linowsky and Philpott [2005], Philpott and
Guan [2008], Shapiro [2011], and Girardeau et al. [2014] do not apply.8 Moreover, we remove
a strong condition used in a previous convergence result by Birge and Zhao [2007] for the
infinite horizon setting, where cuts are computed at states that approximately maximize a
Bellman error criterion. Selecting such a state requires solving a difference of convex functions
optimization problem. Our algorithm and proof technique do not require this costly step.
Throughout this section, suppose that we are given M samples of the demand and the return
fraction matrix (d1 , P 1 ), (d2 , P 2 ), . . . , (dM , P M ). Our goal is to optimize the sampled model. The
idea is to start with an initial piecewise-affine approximation u0 (y, γ) (such as u0 (y, γ) = 0) and
then dynamically add linear functions (referred to as cuts in our discussion) into consideration.
gk (y, γ) = (y − y k )T ak + (γ − γ k )T bk + ck ,
8
For example, we do not make use of a property that there are only a finite number of distinct cuts; see Lemma 1
of Philpott and Guan [2008]. We remark, however, that our algorithm has a natural adaptation for finite-horizon
problems.
23
M
1 X
ūJ (y, γ) = {L(y, ds ) + ρv̄J (τx (y, γ, ds , P s ), τγ (y, γ, ds , P s ))} ,
M
s=1
where
at a sample point (ỹ, γ̃). Note that v̄J (x, ζ) is a linear program. To find out the derivatives of
v̄J (x, ζ), we write down the dual formulation for v̄J (x, ζ) as follows:
PJ
v̄J (x, ζ) = max (λ0 e + λ)T x + k=1 µk −aTk y k + bTk (ζ − γ k ) + ck
PJ
s.t. k=1 µk = 1,
λi − λj ≤ cij , ∀ i, j = 1, 2, . . . , n, (26)
P
−λi + Jk=1 µk aki − λ0 ≥ 0, ∀ i = 1, 2, . . . , n,
µk ≥ 0, ∀ k = 1, 2, . . . , NJ .
PJ
From (26), we understand that ∇x v̄J (x, ζ) = λ∗0 e + λ∗ and ∇ζ v̄J (x, ζ) = ∗
k=1 µk bk , where
(λ∗0 , λ∗ , µ∗ ) is an optimal solution for problem (26). The Jacobian matrix for the state update
function is
∇γ̄,γ = Diag(e − P k e), and ∇γ̄,y = Diag (e − P k e) ◦ 1yt ≤dk ,
where x̄ and γ̄ stand for τx and τγ respectively. By computing equation (26) for all pairs of
(τx (y, γ, di , P i ), τγ (y, γ, di , P i )), we can find the tangent hyperplane of ūJ (y, γ) at (ỹ, γ̃).
While (26) can always be solved, we can apply Proposition 6, a characterization of the no-
repositioning region, to reduce the computational load. We first define some terms for uJ (y, γ).
Let K = {k | aki − akj ≤ cij ∀ i, j} denote the set of cuts that satisfy the no-reposition condition.
24
25
Dk = (y, γ) ∈ ∆ (y − y k )T ak + (γ − γ k )T bk + ck
(27)
T T
≥ (y − y l ) al + (γ − γ l ) bl + cl ∀ l = 1, 2, . . . , K
denote a subset of the feasible region that is dominated by the k-th cut. Then we have the following
lemma.
Lemma 11. If x ∈ Dk with k ∈ K, we have x ∈ ΩuJ (ζ)9 and one optimal solution for problem
(26) is λ = ak , µk = 1, µl = 0, ∀ l 6= k.
The basic procedure of adding a cut (CABS) is summarized in Subroutine 1, and the Reposition-
ADP algorithm is given in Algorithm 1. The essential idea is to iterate the following steps: (1)
sample a set of states, (2) compute the appropriate supporting hyperplanes at each state, and
(3) add the hyperplanes to the convex approximation of u(y, γ). If uJ (y, γ) ≤ u(y, γ), we have
ūJ (y, γ) ≤ u(y, γ). Therefore, gs+NJ (y, γ), a tangent hyperplane for ūJ (y, γ), is a lower bound
for u(y, γ), which means that uJ+1 (y, γ) is also a lower bound for u(y, γ). Through the course of
There are several reasonable strategies for sampling the set SJ+1 . The easiest way is to set
|SJ | = 1 (i.e., only add a single cut10 per iteration) and then sample one state according to some
distribution over ∆ — this is the approach taken in the numerical experiments of this paper. Our
tion to improve the practical performance (see Section 7.3); therefore, we introduce the following
Assumption 3. On any iteration J, the sampling distribution produces a set SJ of states from
9
Note that, in general, Ωu (γ) 6= k∈K Dk . The reason is that even if two cuts are both not in K, the intersection of
S
these two cuts could still include the subgradient that satisfies the no-reposition condition.
10
If parallel computing is available, one might consider the “batch” version of the algorithm (i.e., |SJ+1 | > 1) by
performing the inner for-loop of Algorithm 1 on multiple processors (or workers). In this case, each worker receives
uJ , samples a state, and computes the appropriate supporting hyperplane. The main processor would then aggregate
the results into uJ+1 and start the next iteration by broadcasting uJ+1 to each worker.
26
∞
X
P SJ ∩ A 6= ∅ = ∞
J=1
case of one sample per iteration, one might consider the following sampling strategy parameterized
by a deterministic sequence {J }: with probability 1 − J , choose the state in any manner and with
probability J , select a state uniformly at random over ∆◦ . In this case, we have that P SJ ∩ A 6=
P
∅) ≥ J · volume(A). As long as J J = ∞, Assumption 3 is satisfied.
Let us now introduce some notation. For any bounded function f : ∆ → R, we define the mapping
Z
(Lf )(y, γ) = l(y) + ρ min C(y 0 − x0 ) + f (y 0 , γ 0 ) dµ ∀ (y, γ) ∈ ∆, (28)
y 0 ∈∆n−1 (eT x0 )
where x0 = τx (y, γ, d, P ) and γ 0 = τγ (y, γ, d, P ). Note that L is closely related to the standard
Bellman operator associated with the MDP defined in (20); see, for example, Bertsekas and Tsitsik-
lis [1996]. The difference from the standard definition is that L comes from the Bellman recursion
for u(y, γ) instead of v(x, γ). With this in mind, we henceforth simply refer to L as the “Bellman
3. (Fixed Point) The optimal value function u is the unique fixed point of L, i.e., Lu = u.
4. (Constant Shift) Let 1 be the constant one function, i.e., 1(·) = 1, and let α be a scalar. For
27
plicity, we consider the case where |SJ+1 | = 1 for all iterations J. The extension to the batch
case, |SJ+1 | > 1, follows the same idea and is merely a matter of more complicated notation (note,
however, that we will nevertheless make use of a simple special case of batch algorithm as an anal-
ysis tool within the proof). Our convergence proof makes use of a Lipschitz condition, which is a
Lemma 13. Consider a bounded function f : ∆ → R that is convex, continuous, and satisfies
Pn
1. |f 0 (y, γ; ∓η, ±η)| ≤ β i=1 ηi for all (x, γ) ∈ ∆ and any feasible direction (∓η, ±η) with
η ≥ 0;
Pn
2. f 0 (y, γ; 0, v) ≤ (ρcmax /2) i=1 |vi | for all (x, γ) ∈ ∆ and any feasible direction (0, v) with
eT v = 0.
√
Then, the function f is Lipschitz continuous on ∆◦ with Lipschitz constant (3/2) 2n β. In ad-
dition, the function Lf also satisfies the two conditions above, i.e., properties 1 and 2 above hold
with Lf replacing f .
Theorem 14. Suppose Assumptions 1, 2, and 3 hold and that Repositioning-ADP samples one
state per iteration. If u0 (·) = 0, the sequence {uJ (·)} converges uniformly and almost surely to the
optimal value function u(·), i.e., it holds that kuJ − uk∞ → 0 almost surely.
The proof of Theorem 14 relies on relating each sample path of the algorithm to an auxiliary
algorithm where the cuts are added in “batches” rather than one by one. We show that, after ac-
counting for the different timescales, the value function approximations generated by Repositioning-
ADP are close to the approximations generated by the auxiliary algorithm. By noticing that the
auxiliary algorithm is an approximate value iteration algorithm whose per-iteration error can be
bounded in k · k∞ due to Lemma 13, we quantify its error against exact value iteration, which in
turn allows us to quantify the error between Repositioning-ADP and exact value iteration. We
make use of -covers of the state space (for arbitrarily small ) along with Assumption 3 to argue
28
There are two primary practical challenges that arise when implementing Algorithm 1: (1) the
value function approximations are represented by an unbounded number of cuts and (2) the design
If we keep adding cuts to the existing approximation, some cuts become dominated by others, i.e.,
there exists some k ∈ {1, 2, . . . , J} such that gk (y, γ) < maxj=1,...,J gj (y, γ) for all (y, γ) ∈ ∆. It is
important to remove these redundant cuts since they can lower the efficiency of solving optimization
problem (26). Fortunately, the simple structure of the simplex enables us to check whether a piece
is redundant efficiently and effectively. We first show how to determine whether a cut is completely
if and only if
Therefore, to check whether a cut is completely dominated by another cut, one just needs
to perform a series of elementary operations and check one inequality, the computational effort
of which is negligible compared to solving a linear program. Therefore, we can always check
whether the current cut either dominates or is dominated by other cuts by checking at most 2NJ
inequalities. Though Proposition 15 is simple to implement, it does not cover the situation where
a cut is dominated by the maximum of several other cuts, which occurs more frequently. The next
Proposition 16. Dk 6= ∅ if and only if the objective function value of the following linear program
min t
subject to eT y + eT γ = N,
(29)
t ≥ aTl (y − y l ) + bTl (γ − γ l ) + cl − aTk (y − y k ) − bTk (γ − γ k ) − ck , ∀ l 6= k
y, γ ≥ 0,
29
By solving the linear program (29) at most NJ times, one can remove all of the redundant
pieces. One can perform this redundant check periodically. Our numerical implementation also
employs the following strategy. We track the iterations at which a cut dominates when identifying
D. If a cut does not become dominating for a number of iterations greater than some threshold,
we consider it potentially redundant and check for redundancy by solving (29) once. Despite these
attempts to reduce the number cuts, problem instances with more locations naturally require more
cuts to accurately represent the value function u(·). To control the computation required for solving
linear programs, we also make the practical recommendation to set an absolute upper bound on
the number of cuts used in the approximation, with older cuts dropped as needed.
We now propose a more effective method of sampling states for the ADP algorithm beyond the
naive choice of a uniform distribution over ∆. Our tests indicate that, especially when the number
of locations is large, uniform sampling is unable to prioritize the sampling in important regions of
the state space (for example, states with large γ are unlikely to occur in problem instances where
the return probability is high). A reasonable strategy is to periodically simulate the ADP policy
(i.e., the one implied by the current value function approximation) and collect the set of states
visited under this policy — termed a replay buffer. On future iterations, we can sample a portion
of states at which to compute cuts directly from the replay buffer. This idea is based on the notion
of experience replay within the reinforcement learning literature (see Lin [1992] and also Mnih et al.
a dynamic program with a 20-dimensional continuous state space. We set the discount factor as
ρ = 0.95, the repositioning costs to be cmin = cmax = 1, and the lost sales cost as β = 2cmax = 2.11
11
A relatively high lost sales cost reflects the attitude of customers that on-demand rental services should be con-
venient. Therefore, the firm bears the risk of customers leaving the platform when they are inconvenienced by low
30
demand and return probability samples as follows. With each location i, we associate a truncated
normal demand distribution (so that it is nonnegative) with mean νi and standard deviation σi .
P
The νi are drawn from a uniform distribution and then normalized so that i νi = 0.3. We then
we set σi = νi so that locations with higher mean demand are also more volatile. Next, we follow
Assumption 1 and sample one outcome of a matrix (q̃ij ) such that each row is chosen uniformly
from a standard simplex. Each of the M samples of the return probability matrix consists of (q̃ij )
multiplied by a random scaling factor drawn from Uniform(0.7, 0.9).12 Hence, we have pmin = 0.7.
In our experiments, we compare the Repositioning-ADP (R-ADP) policy to the myopic (Myo.)
policy (i.e., the policy associated with v(·) = 0) and the baseline no-repositioning (No-R) policy.
We use a maximum of 1000 cuts for all problem instances and we run the Repositioning-ADP
algorithm for 10,000 iterations for n ≤ 6 and for 20,000 iterations for n = 7, 8, 9, 10. We initially
sample 80% of states randomly13 and 20% of states from the replay buffer of the myopic policy. As
the algorithm progresses, we transition toward a distribution of 20% randomly, 0% from the myopic
replay buffer, and 80% from the current ADP replay buffer. Note that Assumption 3 is satisfied for
this sampling scheme. Redundancy checks are performed every 250 iterations. The performance
of the ADP algorithm is evaluated using Monte-Carlo simulation over 500 sample paths (across 20
initial states, randomly sampled subject to zero outstanding rentals) at various times during the
training process. Since the ADP algorithm itself is random, we repeat the training process 10 times
The results14 are summarized in Table 1. The first column ‘n & Dim.’ shows the number of
locations and the dimension of the state. The second column ‘Sec./Iter.’ shows the CPU time on a
4 GHz Intel Core i7 processor using four cores, which includes the time needed to remove cuts and
generate the replay buffer. The ‘R-ADP Cost’ column refers to the average cost attained by the ADP
supply.
12
This roughly corresponds to an average rental duration between 1.1 and 1.5 periods, which is reasonable for systems
with frequent repositioning.
13
Each sampled state is given by (ỹ, γ̃) = (ξ y 0 , (1−ξ) γ 0 ) ∈ ∆ where y 0 and γ 0 are independent uniform samples from
∆n−1 (N ) and ξ ∼ Uniform(pmin (M ), pmax (M )) where pmin (M ) and pmax (M ) are the minimum and maximum row
sums of the return fraction matrix over the M samples. This sampling scheme can be considered a nearly uniform
sample over the state space, except with the two parts of the state re-scaled by relevant problem parameters so that
they are more likely to fall in important regions.
14
The specific realizations of the randomly generated parameters used in each instance is available upon request. The
same random seed is used in all instances (i.e., all n) to generate the problem parameters.
31
policy; the ‘% Decr. Myo’ and ‘% Decr. No-R’ columns refer to the percentage improvement (cost
decrease) that the ADP policy achieves over the myopic policy and the baseline no-repositioning
policy, respectively. Significantly lower costs are observed in all instances: 19%–56% against the
The ‘% to LB’ column should be interpreted as an optimality gap metric, computed as the
percentage of the lower bound (LB) achieved when the baseline no-repositioning policy is set as
In terms of wall clock time, we observe that our ADP algorithm produces near-optimal results
for n ≤ 6 within an hour (for n = 6, we are using 0.29 · 10000 seconds or 48 minutes). For the
larger problems of n ≥ 7, when provided a limited amount of computation — around three hours
for 20,000 iterations — the estimated optimality gap is slightly larger, between 12%–17%. Figure 2
shows the performance of the ADP policy as the algorithm progresses, along with 95% confidence
intervals and lower bounds. We believe that the optimality gaps for the larger problems can be
reduced with additional computation, as suggested by the convergence plots of Figure 2. Our
informal experiments suggest that reasonable solutions might be attainable for problems of up to
n = 30 locations if computation time on the order of a few days is allowed (especially if parallel
of demand and return fraction samples, could facilitate the convergence of Repositioning-ADP
32
on larger problems beyond the 20-dimensional cases we tested. Approximating MDPs of larger
dimension than this are well-known to be extremely challenging due to the curse of dimensionality;
indeed, Lu et al. [2017] approximate a 9-dimensional problem using two-stage stochastic integer
programming and He et al. [2018] approximate a 5-dimensional problem using a robust approach
Finally, we examine the average percentage of total inventory repositioned in each period by
the ADP policy and the myopic policy, respectively. The results are given in the last two columns,
33
between 37% (5 locations) and 79% (9 locations) higher than the repositioning activity of the
myopic policy. This suggests that the improvement upon the myopic policy can be attributed to
a considerably more aggressive repositioning strategy. Since the myopic policy does not take into
account customers’ return behaviors, the additional repositioning activity observed in the ADP
policy can be explained by its attempt to plan for the future by counteracting the effects of P .
This leads to an important question: in which situations are the effects of P worthwhile to consider
and in which situations do they not matter? In other words, when does the myopic policy perform
well? We investigate these and other practical questions in the next section.
! "
4αq
αp 1 −
1 − αp 5 Parameter Name Value Range
1 n, number of locations 5 –
N , total inventory 1.0 –
5 2 ρ, discount factor 0.95 –
β, lost sales cost 4.5 –
cij , repositioning cost {1, 2} –
!α "
q
αp αν , demand mean 0.3 [0.1, 0.5]
5
ασ , demand volatility 1.0 [0.5, 1.5]
4 3 αp , return fraction 0.75 [0.4, 1.0]
repo. cost per link = 1 αq , return uniformity 0.5 [0.0, 1.0]
Figure 3: Network Structure used in Section 7.5 Table 2: Parameter Values used in Section 7.5
Here, we aim to compare the ADP, myopic, and no-repositioning policies across a range of parameter
settings, with the goal of studying the impacts of (1) total demand, (2) demand volatility, (3) rental
duration (i.e., fraction of products returned per period), and (4) uniformity of return locations. We
again use N = 1 and set ρ = 0.95. Due to the large number of MDP instances that we need
to solve, we only consider n = 5 locations, creating a set of 10-dimensional MDPs. Let ν̃i = 0.2
for i = 1, 2, . . . , 5 and we set the mean demand at each location to be νi = αν ν̃i for some scaling
P
parameter αν , so that i νi = αν . Similar to before, we set σi = ασ νi for another scaling parameter
ασ . The repositioning costs cij are illustrated in Figure 3: the cost between adjacent locations is 1
and the cost between non-adjacent locations (e.g., 1 and 3 or 5 and 2) is 2. The return behavior is
34
of rented products that remain rented), and αq , which we interpret as the “return uniformity.” For
αq = 1, returns are split evenly between the 5 locations and for αq = 0, products are returned
to their rental origins. These two parameters are also illustrated in Figure 3 for location 1. We
vary each of the scaling parameters αν , ασ , αp , and αq individually; their nominal values and test
ranges are summarized in Table 2. In all solved instances, we use a maximum of 1,000 cuts in the
approximation while the ADP algorithm is run for 10,000 iterations; all other algorithmic settings
Figure 4: Impact of Parameters (Left-Axis & Bar: % Improvement of ADP; Right-Axis & Line: Raw Cost)
There are a few key takeaways from these experiments, which we now summarize. We see that
when the mean demand αν in the system is high (greater than 45% of the total inventory), the
performance of the myopic policy essentially matches that of the ADP policy. This can be explained
by the observation that lost sales in high demand systems are somewhat inevitable, so the impact
of considering return behavior and future cost is diminished. On the other hand, when demand
is between 10% and 40% of the total inventory, a substantial improvement of between 7%–40%
beyond the myopic performance is observed. The largest improvement, 40%, is seen for αν = 0.2.
35
policy ranging from 0.54 for ασ = 0.5 and growing tenfold to 5.90 for ασ = 1.5. We also observe
that the gap between the ADP and myopic policies shrinks for higher volatility systems.
The latter two plots are related to the return behavior parameters αp and αq . Although the cost
decreases when the return fraction αp increases (as there is more inventory with which to satisfy
demand), the improvement upon the myopic policy increases. Intuitively, given more available
inventory due to fewer ongoing rentals, the ADP policy has more “opportunities” to reposition and
plan for future periods. We also see that return uniformity αq tends to increase the cost under the
ADP and myopic policies, but interestingly, if no-repositioning is used, the cost is actually reduced
as αq increases in the range of [0, 0.6]. This can perhaps be explained by the “natural repositioning”
induced by the return behaviors, an effect that disappears when active repositioning (i.e., ADP or
myopic) is allowed. Similar to the case of demand volatility, the gap between the ADP and myopic
7.6 Discussion
With free-floating car sharing systems like Car2Go operating in cities like Berlin, Chicago, Van-
couver, and Seattle, the number of locations (that might be aggregated by, e.g., zip or postal code)
in the model can become larger. We describe two aggregation-based approaches that can be used
to integrate our model and methodology into problems that are of potentially very large scale.
1. The first is to simply segment the service region into subregions of manageable size and
solve independent MDPs in each region. To justify such an approximation, we note that
high repositioning costs for locations that are geographically far from one another encourage
repositioning to occur between nearby stations. However, one drawback is that repositioning
would not be able to occur across borders of the subregions, even if two locations are close.
This could be, in part, addressed by experimenting with different boundaries for regions.
2. The second possible solution is hierarchical aggregation, where nearby locations are grouped
with varying levels of granularity. Our approach can be directly applied to the most coarsely
aggregated system. With the coarse repositioning solution fixed (a heuristic could be used to
36
8 Conclusion
In this paper, we consider the problem of optimal repositioning of inventory in a product rental net-
work with multiple locations and where demand, rental periods, and return locations are stochastic.
We show that the optimal policy is specified in terms of a region in the state space, inside of which
it is optimal not to carry out any repositioning and outside of which it is optimal to reposition
inventory. We also prove that when repositioning, it is always optimal to do so such that the system
moves to a new state that is on the boundary of the no-repositioning region and provide a simple
check for when a state is in the no-repositioning region. We then propose a provably convergent
mation of the convex value function by iteratively adding hyperplanes. Numerical experiments on
37
S. Banerjee, D. Freund, and T. Lykouris. Pricing and optimization in shared vehicle systems: An
approximation framework. arXiv preprint arXiv:1608.06819, 2017.
A. Braverman, J. G. Dai, X. Liu, and L. Ying. Empty-car routing in ridesharing systems. arXiv
preprint arXiv:1609.07219, 2018.
Z.-L. Chen and W. B. Powell. Convergent cutting-plane and partial-sampling algorithm for multi-
stage stochastic linear programs with recourse. Journal of Optimization Theory and Applications,
102(3):497–524, 1999.
H. Chung, D. Freund, and D. B. Shmoys. Bike Angels: An analysis of Citi Bike’s incentive
program. In Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable
Societies. ACM, 2018.
D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and
temporal-difference learning. Journal of Optimization Theory and Applications, 105(3):589–608,
2000.
D. Freund, S. G. Henderson, and D. B. Shmoys. Bike sharing. In M. Hu, editor, Sharing Economy:
Making Supply Meet Demand. Springer, 2019.
C. Fricker and N. Gast. Incentives and redistribution in homogeneous bike-sharing systems with
stations of finite capacity. EURO Journal on Transportation and Logistics, 5(3):261–291, 2016.
38
G. A. Godfrey and W. B. Powell. An adaptive dynamic programming algorithm for dynamic fleet
management, i: Single period travel times. Transportation Science, 36(1):21–39, 2002.
L. He, H.-Y. Mak, Y. Rong, and Z.-J. M. Shen. Service region design for urban electric vehicle
sharing systems. Manufacturing & Service Operations Management, 19(2):309–327, 2017.
L. He, Z. Hu, and M. Zhang. Robust repositioning for vehicle sharing. Manufacturing & Service
Operations Management (Forthcoming), 2018.
L. He, H.-Y. Mak, and Y. Rong. Operations management of vehicle sharing systems. In M. Hu,
editor, Sharing Economy: Making Supply Meet Demand. Springer, 2019.
J. L. Higle and S. Sen. Stochastic decomposition: An algorithm for two-stage linear programs with
recourse. Mathematics of Operations Research, 16(3):650–669, 1991.
A. Kabra, E. Belavina, and K. Girotra. Bike-share systems: Accessibility and availability. Chicago
Booth Research Paper No. 15-04. Available at SSRN: https://ssrn.com/abstract=2555671, 2018.
39
H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications,
volume 35. Springer, 2003.
C.-Y. Lee and Q. Meng. Handbook of Ocean Container Transport Logistics. Springer, 2015.
J. Li, C. S. Leung, Y. Wu, and K. Liu. Allocation of empty containers between multi-ports.
European Journal of Operational Research, 182(1):400–412, 2007.
Y. Li, Y. Zheng, and Q. Yang. Dynamic bike reposition: A spatio-temporal reinforcement learning
approach. In Proceedings of the 24th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining. ACM, 2018.
L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching.
Machine Learning, 8(3-4):293–321, 1992.
J. Liu, L. Sun, W. Chen, and H. Xiong. Rebalancing bike sharing systems: A multi-source data
smart optimization. In Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Dis-
covery and Data Mining, pages 1005–1014. ACM, 2016.
M. Lu, Z. Chen, and S. Shen. Optimizing the profitability and quality of service in carshare systems
under demand uncertainty. Manufacturing & Service Operations Management, 20(2):162–180,
2017.
R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine
Learning Research, 9(May):815–857, 2008.
R. Nair and E. Miller-Hooks. Fleet management for vehicle sharing operations. Transportation
Science, 45(4):524–540, 2011.
40
J. M. Nascimento and W. B. Powell. Dynamic programming models and algorithms for the mutual
fund cash balance problem. Management Science, 56(5):801–815, 2010.
E. O’Mahony and D. B. Shmoys. Data analysis and optimization for (citi) bike sharing. In AAAI,
pages 687–694, 2015.
A. B. Philpott and Z. Guan. On the convergence of stochastic dual dynamic programming and
related methods. Operations Research Letters, 36(4):450–455, jul 2008.
W. B. Powell and T. A. Carvalho. Dynamic control of logistics queueing network for large-scale
fleet management. Transportation Science, 32(2):90–109, 1998.
T. Raviv and O. Kolka. Optimal inventory management of a bike-sharing station. IIE Transactions,
45(10):1077–1093, 2013.
R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, USA, 1st edition,
1970.
41
J. Shu, M. C. Chou, Q. Liu, C.-P. Teo, and I.-L. Wang. Models for effective deployment and
redistribution of bicycles within public bicycle-sharing systems. Operations Research, 61(6):
1346–1359, 2013.
C. Shui and W. Szeto. Dynamic green bike repositioning problem–a hybrid rolling horizon artificial
bee colony algorithm approach. Transportation Research Part D: Transport and Environment,
60:119–136, 2018.
D. P. Song. Optimal threshold control of empty vehicle redistribution in two depot service systems.
IEEE Transactions on Automatic Control, 50(1):87–90, 2005.
J. Warrington, P. N. Beuchat, and J. Lygeros. Generalized dual dynamic programming for infinite
horizon problems in continuous state and action spaces. arXiv preprint arXiv:1711.07222, 2018.
42
In this subsection, we provide a complete and self-contained proof for our main result, Theorem 9.
Lemma A.1. If u(x, γ) is jointly convex in x and γ, then u0 (x, γ; z, η) satisfies the following
properties:
Proof. These are well-known important properties for directional derivatives (see, for example,
Rockafellar [1970]).
Lemma A.2. A function f : (a, b) → R is convex if and only if it is continuous with increasing
left- or right-derivative.
Proof. The “only if” part is clear. For the “if part”, we assume f is continuous with increasing
right-derivative, for the proof for left-derivative is similar. It is common knowledge that a function
on an open set in Rn is convex if and only if there exists a subgradient at every point. So, it suffices
to show that f 0 (x+) is a subgradient at x for every x ∈ (a, b). Let gx (y) = f 0 (x+)(y − x) + f (x).
We first show that f (y) ≥ gx (y) if f 0 (x+) is strictly increasing. To show this, let hx (y) = gx (y)−
for some > 0. We claim that f (y) ≥ hx (y). Suppose this is not true. Then there exists z ∈ (a, b)
such that (f − hx )(z) < 0. If z > x, let c = sup{d ≥ x | (f − hx )(y) ≥ 0 for y ∈ [x, d]}. Note that,
by continuity, x < c < z and (f − hx )(c) = 0; and, by construction, for any d > c there exists
y ∈ (c, d) such that (f − hx )(y) < 0. So, there exists a decreasing sequence yn such that yn → c
and (f − hx )(yn ) < 0. It follows that (f − hx )0 (c+) = f 0 (c+) − f 0 (x+) ≤ 0. This contradicts
the assumption that f 0 (·) is strictly increasing. On the other hand, if z < x, let c = inf{d ≤
x | (f − hx )(y) ≥ 0 for y ∈ [d, x]}. Then z < c < x, (f − hx )(c) = 0, and there exists a decreasing
sequence yn such that yn → c and (f −hx )(yn ) ≥ 0. Therefore, (f −hx )0 (c+) = f 0 (c+)−f 0 (x+) ≥ 0.
43
Now, suppose f 0 (x+) is increasing, and let h(x) = f (x) + 2 x2 for some > 0. Then h0 (x+)
is strictly increasing. By the first part of the proof, h(y) = f (y) + 2 y 2 ≥ h0 (x+)(y − x) + h(x) =
Proof. The “only if” part is always true for a convex function on Rn . For the “if part”, note that
g : (a, b) → R is convex iff it is continuous with increasing left- or right-derivative. (For a proof,
see Proposition A.2.) It follows that if g : [a, b] → R is continuous and piecewise convex on [a0 , a1 ],
[a1 , a2 ], . . ., [am−1 , am ], where a = a0 < · · · < am = b, then to show g is convex, we only need
to show that g 0 (x−) ≤ g 0 (x+) for x ∈ [a, b]. To apply this argument to f , note that f is convex
φ(s) is convex if φ0 (s−) ≤ φ0 (s+) for s ∈ J. Set x = y + sz, then φ0 (s−) = −f 0 (x; −z) and
φ0 (s+) = f 0 (x; z). It follows that −f 0 (x; −z) ≤ f 0 (x; z) implies f is convex.
Lemma A.4. For y ∈ Rn , we let J − (y) = {i | yi < 0}, J 0 (y) = {i | yi = 0}, J + (y) = {i | yi > 0},
X
0 0
Ut,d,P (y, γ; z, η) = −β zi + ρ vt+1 (x, ζ; w+ , δ + ), (31)
i∈J − (y−d)∪(J 0 (y−d)∩J − (z))
44
P
zi + ιi + j∈J − (y−d)∪(J 0 (y−d)∩J − (z )) zj pji for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J + (z)),
wi+ =
P
ιi + j∈J − (y−d)∪(J 0 (y−d)∩J − (z )) zj pji for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)),
and
P
ηi (1 − nj=1 pij ) for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J + (z)),
δi+ =
P
(ηi + zi )(1 − nj=1 pij ) for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)),
and
x = τx (y, γ, d, P ); ζ = τγ (y, γ, d, P ).
Proof. Let
n
X n
X
ϑi = γj pji , ιi = ηj pji
j=1 j=1
Note that
X
L(y, d) = β (di − yi ), (32)
i∈J − (y−d)
and let the next state, under (d, P ), be defined by x(y, γ), ζ(y, γ), with components
P P
(yi − di ) + ϑi + j∈J + (y−d) dj pji + j∈J 0− (y−d) yj pji for i ∈ J + (y − d),
xi (y, γ) = (33)
P P
ϑi + j∈J + (y−d) dj pji + j∈J 0− (y−d) yj pji for i ∈ J 0− (y − d),
P
(γi + di )(1 − nj=1 pij ) for i ∈ J + (y − d),
ζi (y, γ) = (34)
P
(γi + yi )(1 − nj=1 pij ) for i ∈ J 0− (y − d).
45
J 0 (y + tz − d) = J 0 (y − d) ∩ J 0 (z),
(35)
+ + 0 +
J (y + tz − d) = J (y − d) ∪ (J (y − d) ∩ J (z)),
J 0− (y + tz − d) = J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)),
where the last equation follows directly from the first and second. For y + tz, we have, directly by
(32), that
X
L(y + tz, d) = β (di − yi − tzi ),
i∈J − (y+tz−d)
xi (y + tz, γ + tη)
P
yi + tzi − di + ϑi + tιi + j∈J + (y+tz−d) dj pji
for i ∈ J + (y + tz − d),
P
+ j∈J 0− (y+tz−d) (yj + tzj )pji
=
P
ϑ + tι + j∈J + (y+tz−d) dj pji
i i
for i ∈ J 0− (y + tz − d),
P
+
j∈J 0− (y+tz−d) (yj + tzj )pji
and
P
(γi + tηi + di )(1 − nj=1 pi,j ) for i ∈ J + (y + tz − d),
ζi (y + tz, γ + tη) =
P
(γi + tηi + yi + tzi )(1 − nj=1 pi,j ) for i ∈ J 0− (y + tz − d).
X
L(y + tz, d) − L(y, d) = −βt zi .
i∈J − (y−d)∪(J 0 (y−d)∩J − (z))
46
and
Set
x(y + tz, γ + tη) − x(y, γ) ζ(y + tz, γ + tη) − ζ(y, γ)
w+ = , δ+ = .
t t
Then
vt+1 (x(y + tz, γ + tη), ζ(y + tz, γ + tη)) − vt+1 (x(y, γ), ζ(y, γ)) 0
lim = vt+1 (x, ζ; w+ , δ + ).
t→0 t
It follows that
X
0 0
Ut,d,p (y, γ; z, η) = −β zi + ρvt+1 (x, ζ; w+ , δ + ). (36)
i∈J − (y−d)∪(J 0 (y−d)∩J − (z))
Next we show that, given the convexity of vt+1 (xt+1 , γ t+1 ) and certain bounds on its directional
derivatives, the function ut (y t , γ t ) not only is convex (Proposition A.5), but also satisfies two types
1. vt+1 (xt+1 , γ t+1 ) is continuous and jointly convex in xt+1 and γ t+1 ,
0 (x
Pn
2. vt+1 t+1 , γ t+1 ; z − η, η) ≤ C(−z) + (β + ρcmax − cmin ) i=1 ηi for any feasible direction
(z − η, η) with η ≥ 0.
47
represents γ t+1 (y, γ). The continuity of u(·) follows from the Dominated Convergence Theorem,
P
as Ud,p (y, γ) ≤ β i di + ρ kvk∞ . To conclude that Ud,p (·) is convex, from Proposition A.3, it
0 0
−Ud,p (y, γ; −z, −η) ≤ Ud,p (y, γ; z, η) ∀ (y, γ) ∈ ∆, (z, η) ∈ R2n with eT (z + η) = 0.
0 (y, γ; z, η) and U 0 (y, γ; −z, −η) for (y, γ) ∈ ∆◦ , so every direction is feasible.
We first derive Ud,p d,p
X
0
Ud,p (y, γ; z, η) = −β zi + ρv 0 (x, γ; w+ , δ + ), (37)
i∈J − (y−d)∪(J 0 (y−d)∩J − (z))
and
X
0
Ud,p (y, γ; −z, −η) = β zi + ρv 0 (x, γ; w− , δ − ),
i∈J − (y−d)∪(J 0 (y−d)∩J + (z))
where
P
zi + ιi + j∈J − (y−d)∪(J 0 (y−d)∩J − (z)) zj pji for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J + (z)),
wi+ =
P
ιi + j∈J − (y−d)∪(J 0 (y−d)∩J − (z)) zj pji for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)),
P
−zi − ιi − j∈J − (y−d)∪(J 0 (y−d)∩J + (z)) zj pji for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J − (z)),
wi− =
P
−ιi − j∈J − (y−d)∪(J 0 (y−d)∩J + (z)) zj pji for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0+ (z)).
P
ηi (1 − nj=1 pij ) for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J + (z)),
δi+ =
P
(ηi + zi )(1 − nj=1 pij ) for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)).
48
Therefore,
P P
z i + 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji for i ∈ J 0 (y − d) ∩ J + (z),
j∈J
P P
wi+ + wi− = −zi + j∈J 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji for i ∈ J 0 (y − d) ∩ J − (z),
P P
j∈J 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji otherwise,
and
Pn
−z i (1 − 0 +
j=1 pij ) for i ∈ J (y − d) ∩ J (z),
P
δi+ + δi− = zi (1 − nj=1 pij ) for i ∈ J 0 (y − d) ∩ J − (z),
0 otherwise.
We now define
Pn P P
z i p ij + 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji for i ∈ J 0 (y − d) ∩ J + (z),
j=1 j∈J
P P P
wiI = − nj=1 zi pij + j∈J 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji for i ∈ J 0 (y − d) ∩ J − (z),
P P
j∈J 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji otherwise,
and
Pn
z i (1 − j=1 pij ) for i ∈ J 0 (y − d) ∩ J + (z),
P
wiO = −zi (1 − nj=1 pij ) for i ∈ J 0 (y − d) ∩ J − (z),
0 otherwise.
49
0 0
Ud,p (y, γ; z, η) + Ud,p (y, γ; −z, −η)
X
=β |zi | + ρv 0 (x, γ; w+ , δ + ) + ρv 0 (x, γ; w− , δ − )
i∈J 0 (y−d)
X
≥β |zi | + ρv 0 (x, γ; w+ + w− , δ + + δ − ) (by subadditivity, Lemma A.1)
i∈J 0 (y−d)
X
=β |zi | + ρv 0 (x, γ; wI + wO , −wO )
i∈J 0 (y−d)
X
≥β |zi | − ρv 0 (x, γ; −wI − wO , wO ) (by the convexity of v, Proposition A.3)
i∈J 0 (y−d)
n n
!
X cmax X I X
≥β |zi | − ρ |wi | + (β + ρcmax − cmin ) |wiO | (by Lemma A.8 and Lemma 3)
2
i∈J 0 (y−d) i=1 i=1
X X
≥ β |zi | − ρcmax p |zi |
i∈J 0 (y−d) i∈J 0 (y−d)
X
− ρ(β + ρcmax − cmin ) (1 − p) |zi | (by the triangle inequality)
i∈J 0 (y−d)
X X
=β |zi | − (pρcmax + ρ(β + ρcmax − cmin )(1 − p)) |zi |
i∈J 0 (y−d) i∈J 0 (y−d)
X X
≥β |zi | − (pρcmax + (β + ρcmax − cmin )(1 − p)) |zi | (since ρ ≤ 1)
i∈J 0 (y−d) i∈J 0 (y−d)
X X
=β |zi | − (β + (ρcmax − cmin ) − p(β − cmin )) |zi |
i∈J 0 (y−d) i∈J 0 (y−d)
X X
≥β |zi | − β |zi | (by Assumption 2)
i∈J 0 (y−d) i∈J 0 (y−d)
≥ 0.
Therefore, Ud,p (·) is convex on ∆◦ . Since ∆ is locally simplicial, the continuous extension of Ud,p (·)
from S ◦ to S must be convex (See for example Rockafellar [1970] Theorem 10.3). Thus u(·) is
convex.
1. vt+1 (xt+1 , γ t+1 ) is continuous and jointly convex in xt+1 and γ t+1 ,
0 (x
Pn
2. vt+1 t+1 , γ t+1 ; z − η, η) ≤ C(−z) + (β + ρcmax − cmin ) i=1 ηi for any feasible direction
50
0 (x
Pn
3. vt+1 t+1 , γ t+1 ; z + η, −η) ≤ C(−z) + β i=1 ηi for any feasible direction (z + η, −η) with
η ≥ 0,
0 (x
Pn
4. vt+1 t+1 , γ t+1 ; 0, z) ≤ (ρcmax /2) i=1 |zi | for any feasible direction (0, z) with eT z = 0,
then we have:
Pn
1. |u0t (y t , γ t ; ∓ξ, ±ξ)| ≤ β i=1 ξi for all (y t , γ t ) ∈ ∆ and any feasible direction (∓ξ, ±ξ) with
ξ≥0;
Pn
2. u0t (y t , γ t ; 0, z) ≤ (ρcmax /2) i=1 |zi | for all (y t , γ t ) ∈ ∆ and any feasible direction (0, z) with
eT z = 0.
Proof. We omit the subscript t to reduce notation. To show the first inequality, we start with
P
showing that u0 (y, γ; −ξ, ξ) ≤ β i ξi . From Lemma A.4, noting that −ξ ≤ 0, we have
X
0
Ud,p (y, γ; −ξ, ξ) = β ξi + ρv 0 (x, γ; w − δ, δ),
i∈J − (y−d)∪J 0 (y−d)
where
P P
−ξi nj=1 pij + j∈J + (y−d) ξj pji for i ∈ J + (y − d),
wi =
P
j∈J + (y−d) ξj pji for i ∈ J − (y − d) ∪ J 0 (y − d),
and
P
ξi (1 − nj=1 pij ) = ξi (1 − p) for i ∈ J + (y − d),
δi =
0 for i ∈ J − (y − d) ∪ J 0 (y − d).
Note that eT w = 0 and so it is clear that (w − δ, δ) is a feasible direction at (x, ζ). It follows that
0
Ud,p (y, γ; −ξ, ξ)
X
= β ξi + ρv 0 (x, γ; w − δ, δ)
i∈J − (y−d)∪J 0 (y−d)
51
0 (y, γ; −ξ, ξ) ≤ β
Pn Pn
So, Ud,p holds for each (y, γ) ∈ ∆ and (z − η, η) feasible. It
i=1 |ξi | =β i=1 ξi
R 0 P
follows that u0 (y, γ; −ξ, ξ) = Ud,p (y, γ; −ξ, ξ) dµ ≤ β i ξi . From Lemma A.5, u(y, γ) is convex,
thus
X
u0 (y, γ; ξ, −ξ) ≥ −u0 (y, γ; −ξ, ξ) ≥ −β ξi .
i
Pn
Now we show that u0 (y, γ; ξ, −ξ) ≤ β i=1 ξi for all (y, γ) ∈ ∆ for all feasible direction (ξ, −ξ)
X
0
Ud,p (y, γ; ξ, −ξ) = −β ξi + ρv 0 (x, ζ; w + δ, −δ)
i∈J − (y−d)
where
P P
ξi nj=1 pij − j∈J + (y−d)∪J 0 (y−d) ξj pji for i ∈ J + (y − d) ∪ J 0 (y − d),
wi =
P
− j∈J + (y−d)∪J 0 (y−d) ξj pji for i ∈ J − (y − d),
52
0
Ud,p (y, γ; ξ, −ξ)
X
= −β ξi + ρv 0 (x, γ; w + δ, −δ)
i∈J − (y−d)
X n n
ρcmax X X
≤ β |ξi | + |wi | + ρβ |δi | (from the third inequality of the assumption)
2
i∈J − (y−d) i=1 i=1
X n n
βX X
≤ β |ξi | + |wi | + ρβ |δi | (since ρcmax ≤ β, ρ ≤ 1)
2
i∈J − (y−d) i=1 i=1
X n
X X
ρβ
≤ β |ξi | + 2 |ξj |pji
2
i∈J − (y−d) i=1 j∈J + (y−d)∪J 0 (y−d)
X n
X
+ρβ |ξi | 1 − pij (by the triangle inequality)
i∈J + (y−d)∪J 0 (y−d) j=1
X X n
X
≤ β |ξi | + ρβ |ξi | ≤ β |ξi |.
i∈J − (y−d) i∈J + (y−d)∪J 0 (y−d) i=1
0 (y, γ; ξ, −ξ) ≤ β
Pn Pn
So, Ud,p i=1 |ξi | =β holds for each (y, γ) ∈ ∆ and (z +η, −η) feasible. It
i=1 ξi
R P
follows that u0 (y, γ; ξ, −ξ) = 0 (y, γ; ξ, −ξ) dµ ≤ β
Ud,p i ξi . From Lemma A.5, u(y, γ) is convex,
thus
X
u0 (y, γ; −ξ, ξ) ≥ −u0 (y, γ; ξ, −ξ) ≥ −β ξi .
i
0
Ud,p (y, γ; 0, z) = ρv 0 (x, γ; ι, z(1 − p))
Pn Pn
where ιi = j=1 zj pji ∀ i and p = j=1 pij (Assumption 1). Therefore,
n
X n X
X n n
X
ιi = zj pji = pzj = 0,
i=1 j=1 i=1 j=1
53
0
Ud,p (y, γ; 0, z) = ρv 0 (x, γ; ι, z(1 − p))
0 (y, γ; 0, z) ≤ (ρc
Pn
So, Ud,p max /2) i=1 |zi | holds for all realizations (d, p). It follows that the integral
The upcoming result, Proposition A.8, assists with the eventual induction over t by stating
that if ut (y t , γ t ) is convex and certain bounds on its directional derivatives are satisfied, then
vt (xt , γ t ) not only is convex, but also satisfies the bounds on the directional derivatives required by
Proposition A.6. Before continuing, we need to introduce a technical lemma that carefully analyzes
the optimal repositioning plan. This lemma plays a crucial role in the proof of Proposition A.8.
n
X n n
1X X
ξi + |ξj − ηj | ≤ ηj (38)
2
j=1 j=1 j=1
Proof. Let I = {i | yi − ηi < 0}. If I = ∅, we have y − η ≥ 0 and we just find ξ = η. Thus in the
that yi − xi < 0 ∀ i ∈ I, so I ⊆ J − .
54
The interpretation of the first case above is that locations j with more inventory than the target
level yj transfer inventory to locations that are below the target level. The second case is interpreted
We claim that ξ satisfies the desired properties. We first verify that yj − ξj ≥ 0. This is clearly
X η i − yi X X
yj − ξj = yj − wij − ηj ≥ yj − wij − ηj ≥ yj − wij − ηj = xj − ηj ≥ 0,
x i − yi
i∈I i∈I i
where the first inequality follows from xi ≥ ηi > yi for i ∈ I. Therefore, y − ξ ≥ 0 and part (1) is
P
complete. Also, using xj − yj = j∈J + wij , we have:
!
X X X X X ηi − yi
ξj = yj + ηj + wij + ηj
x i − yi
j j∈I j∈J − \I j∈J + i∈I
X X X ηi − yi X
= yj + ηj + (xi − yi ) = ηi ,
xi − yi
j∈I j∈(J − \I) ∪ J + i∈I i
55
n
X X n
X
|ξj − ηj | = 2 (ηj − yj ) ≤ 2 |ηj |.
j=1 j∈I j=1
X X
yj − ξj − (xj − ηj ) − w̄ij + w̄ji
i i
X X
= (yj − xj ) − ξj + ηj − w̄ij − w̄ij
i∈I i∈I
/
X X ηi − yi X X η i − yi
= wij − wij − wij + wij = 0,
xi − yi xi − yi
i i∈I i i∈I
P
where we used i w̄ji = 0. Similarly, for all j ∈ I, we have
X X
x j − η j − yj + ξj − w̄ji + w̄ij
i i
X ηj − yj
= xj − η j − 1− wji
xj − yj
i
X η j − yj
= xj − wji − ηj − (yj − xj ) = 0
x j − yj
i
X X X
x j − η j − yj + ξj − w̄ji + w̄ij = xj − yj − wji = 0.
i i i
Therefore, we have shown that w̄ is a feasible solution to the optimization problem for C(·) defined
X
C(y − ξ − x + η) ≤ cij w̄ij
i,j
56
If we define ŵ as
ηi −yi
for i ∈ I, j ∈ J + ;
xi −yi wij
ŵij =
0 otherwise,
X X
η i − yi
cij wij = c · ŵ ≥ C(ξ − η).
xi − yi
i∈I j∈J +
Finally, we have
X X
η i − yi
C(y − ξ − x + η) ≤ C(y − x) − cij wij ≤ C(y − x) − C(ξ − η).
x i − yi
i∈I j∈J +
proving part (4). The last claim follows directly from the construction of ξ and parts (2) and
(3).
Pn
1. u0t (y t , γ t ; ±η, ∓η) ≤ β i=1 ηi for all (y t , γ t ) ∈ ∆ and for any feasible direction (±η, ∓η)
with η ≥ 0;
Pn
2. u0t (y t , γ t ; 0, z) ≤ (ρcmax /2) i=1 |zi | for all (y t , γ t ) ∈ ∆ and for any feasible direction (0, z)
with eT z = 0.
Then, the value function vt (·) is convex and continuous in ∆ with Ωv (γ) = ∆n−1 (I) for γ ∈ S.
57
n
X
vt0 (xt , γ t ; z − η, η) ≤ C(−z) + (β + ρcmax − cmin ) ηi (39)
i=1
n
X
vt0 (xt , γ t ; z + η, −η) ≤ C(−z) + β ηi (40)
i=1
n
ρcmax X
vt0 (xt , γ t ; 0, z) ≤ |zi | (41)
2
i=1
Proof. We omit the subscript t throughout the proof to reduce notation. To show that v(·) is
convex, suppose y 1 and y 2 are optimal solutions of (11) for (x1 , γ 1 ) and (x2 , γ 2 ), respectively.
+ C(λ(y 1 − x1 ) + (1 − λ)(y 2 − x2 ))
≤ λv(x1 , γ 1 ) + (1 − λ)v(x2 , γ 2 ),
by convexity of u(·) and Lemma 1. Continuity follows from Berge’s Maximum Theorem, as the
set-valued map x 7→ ∆n−1 (I) is continuous. To show the result in (39), suppose (z − η, η) is a
feasible direction. Let y ∗ be an optimal solution to equation (11) at (x, γ). Therefore,
Let t > 0 be small enough such that x + t(z − η) ≥ 0. According to Lemma A.7, there exists a
tz + tη) = C(y ∗ − x − tz) − tC(ξ − η). Therefore, y ∗ − tξ is a feasible solution to equation (11)
58
Adding and subtracting u(y ∗ , γ) in the numerator of the first term and then taking limits on both
sides, we get
where the second inequality follows by Lemma A.1 and the third inequality follows by Lemma 3.
Now we show equation (40). Suppose (z + η, −η) is a feasible direction. Again, let y ∗ be an
optimal solution to equation (11) at (x, γ). Then y ∗ + tη is clearly a feasible solution to equation
where the second inequality follows from the subadditivity and the postive homogeneity of C(·)
59
solution to equation (11) at (x, γ). Then y ∗ is a feasible solution for (x, γ + tz) and thus,
The proofs for Theorem 9 for (ut (·)) and (vt (·)) follow from Proposition A.5, Proposition A.6,
Proposition A.8, and the induction. Consequently, an optimal policy for each period is provided by
Theorem 4, and the no-repositioning set can be characterized as in Proposition 5, 6 and Corollary
7.
Proof of Lemma 1: It is clear that the linear program (1) is bounded feasible. Therefore, an
optimal solution to (1) exists and the strong duality holds. The dual linear program can be written
as
C(z) = max λT z
(42)
subject to λj − λi ≤ cij ∀ i, j.
C(tz) = max tλT z : λj − λi ≤ cij , ∀ i, j = t max λT z : λj − λi ≤ cij , ∀ i, j = tC(z).
and lower semicontinuous functions (λT z for each λ), C(·) is also convex and lower semicontinuous.
It is well known that a convex function on a locally simplicial convex set is upper semicontinuous
(Rockafellar [1970] Theorem 10.2). Therefore, as H is a polyhedron, C(·) must be continuous. From
60
1 1 1 1
C(z 1 + z 2 ) = 2C z1 + z2 ≤2 C(z 1 ) + C(z 2 ) = C(z 1 ) + C(z 2 ).
2 2 2 2
Proof of Lemma 2: It is easy to see that an equivalent condition is wi,j wj,k = 0 for all i, j, k. To
show this is true, suppose w is an optimal solution and there exists i, j, k such that wi,j , wj,k > 0.
If i = k, we can set at least one of wi,j and wj,i to 0 without violating the constraints. If i 6= k, we
can set at least one of wi,j and wj,k to 0, and increase wi,k accordingly. In both cases, the resulting
objective is at least as good. Repeating this for all i, k and j can enforce this condition for all i, k
and j.
Proof of Theorem 4: Fix γ ∈ S. Let y ∗ (x, γ) = {y ∈ ∆n−1 (I) : u(x, γ) = C(y − x) + u(y, γ)}
be the set of optimal solutions corresponding to the system state x ∈ S. It is easy to verify that
As C(·) and u(·) are continuous and ∆n−1 (I) is compact, by Berge’s Maximum Theorem, y ∗ (·) is
are also convex, y ∗ (·) is also convex-valued. So, it is clear from (43) that Ωu (γ) is nonempty. To
show Ωu (γ) is compact, suppose y 1 , y 2 , . . . is a sequence in Ωu (γ) such that y n ∈ y ∗ (xn , γ) for
assume that y nk ∈ y ∗ (xnk , γ), xnk → x and y nk → y. As y ∗ (·) is compact-valued, by the Closed
Graph Theorem, y ∗ (·) has a closed graph. This implies that y ∈ y ∗ (x, γ) ⊂ Ωu (γ), and therefore
Ωu (γ) is compact.
To show that Ωu (γ) is connected, suppose the reverse is true. Then, there exist open sets V1 , V2
in ∆n−1 (I) such that V1 ∩ V2 = ∅, V1 ∪ V2 ⊃ Ωu (γ), and V1 ∩ Ωu (γ) and V2 ∩ Ωu (γ) are nonempty.
As y ∗ (·) is convex-valued, this implies that, for any x ∈ ∆n−1 (I), y ∗ (x, γ) is either in V1 or in V2 ,
15
Upper hemicontinuity can be defined as follows. Suppose X and Y are topological spaces. A correspondence
f : X → P(Y ) (power set of Y ) is upper hemicontinuous if for any open set V in Y , f −1 (V ) = {x ∈ X|f (x) ⊂ V } is
open in X.
61
U1 ∪ U2 ⊃ ∆n−1 (I), and U1 ∩ ∆n−1 (I) and U2 ∩ ∆n−1 (I) are nonempty. This implies that the
(n−1)-dimensional simplex ∆n−1 (I) is not connected. We have reached a contradiction. Therefore,
Next, to show that π ∗ is optimal, note that π ∗ (x, γ) = x for x ∈ Ωu (γ) is clear from (13). If
/ Ωu (γ), then, by (43), π ∗ (x, γ) ∈ Ωu (γ). Now, suppose there exists π ∗ (x, γ) = y ∈ Ωu (γ)◦ ,
x∈
then y + t(x − y) ∈ Ωu (γ) for small enough t > 0. Set z = y + t(x − y). Then u(z, γ) + C(z − x) ≤
So, z is as good a solution as y. Therefore, there exists an optimal solution π ∗ (x, γ) ∈ B(Ωu (γ))
if x ∈
/ Ωu (γ).
Proof of Proposition 5: Suppose x ∈ Ωu (γ). Take any feasible direction (z, 0) at (x, γ). Then,
by (13),
u(x + tz, γ) − u(x, γ)
≥ −C(z)
t
for t > 0. Taking the limit as t ↓ 0, we have u0 (x, γ; z, 0) ≥ −C(z). Conversely, suppose
u0 (x, γ; z, 0) ≥ −C(z) for any feasible direction z at x in H. Let φ(t) = u(x + tz, γ). Then,
φ(·) is convex, φ(0) = u(x), and φ0 (0+) = u0 (x, γ; z, 0) ≥ −C(z). By the subgradient inequality,
tφ0 (0+) + φ(0) ≤ φ(t). This implies that −tC(z) + u(x, γ) ≤ u(x + tz, γ) is true for any feasible
such that u(x, γ) > C(y − x) + u(y, γ). Take any g ∈ ∂x u(x, γ). By the subgradient inequality,
P P
Suppose w = (wij ) is an optimal solution to problem (1). Then C(y − x) = i j cij wij , and by
P P P P
Lemma 2, −g T (y − x) = i gi (yi − xi )− − j gj (yj − xj )+ = i j (gi − gj )wij . So, we have
XX XX
cij zij < (gi − gj )wij .
i j i j
62
For the “only if” part, suppose x > 0 and x ∈ Ωu (γ). Assume ∂x u(x, γ) ∩ G = ∅. We will show
that this leads to a contradiction. Let P be the orthogonal projection from Rn to the subspace
P
H = {x ∈ Rn : i xi = 0}. Then
P
xi
P (x) = x − i e,
n
since ∂x u(x, γ) ⊆ H. As ∂x u(x, γ) is closed and P (G) is compact, by Hahn-Banach Theorem, there
for every g ∈ P (∂x u(x, γ)) and for every λ ∈ P (G), or equivalently, as hg, zi = hP (g), zi and
hλ, zi = hP (λ), zi, for every g ∈ ∂x u(x, γ) and for every λ ∈ G. As z is a feasible direction in H
So, we have
for every λ ∈ G. However, by the dual formulation (42), there exists λ ∈ {(y1 , y2 . . . , yn ) | yj − yi ≤
cij ∀ i, j} such that hλ, zi = C(z), or equivalently, h−λ, zi = −C(z). Recognizing that −λ ∈ G
∂u(x, γ) ∂u(x, γ) ∂u(x, γ)
∂x u(x, γ) = , ,..., .
∂x1 ∂x2 ∂xn
In this case, it is easy to see that (17) is simplified to (18). To show that x ∈ Ωu (γ) implies (18)
63
for x ∈ B(∆n−1 (I)). The rest of the proof is the same as Proposition 6.
Proof of Theorem 10: To show that value function retains its structure in the infinite horizon set-
ting, we invoke the general approach outlined in Porteus [1975] and Porteus [1982] which “iterates”
the structural properties of the one-stage problem. Let V ∗ be the space of convex, continuous and
bounded functions over ∆. Note that a one-step structure preservation property holds by Lemma
A.5, Lemma A.6, and Lemma A.8: combined, they say that if the next period value function is in
V ∗ , then the optimal value of the current period is also in V ∗ . Furthermore, the set V ∗ with the
sup-norm k · k∞ is a complete metric space. These two observations allow us to apply Corollary 1
of Porteus [1975] and conclude that v ∈ V ∗ (the remaining assumptions needed to apply the result
can be easily checked). The rest of the proof follows from Lemma A.5, Lemma A.6, Lemma A.8,
Proof of Lemma 11: If x ∈ Dk , ak ∈ ∂x uJ (x, γ). Since aki − akj ≤ cij for all i, j, according to
Proposition 6, we have x ∈ ΩuJ (ζ). For the second part, we first write down the primal formulation
w≥0
eT z = eT x
ξ ≥ (z − y k )T ak + (ζ − γ k )T bk + ck ∀ k = 1, 2, . . . , J
z ≥ 0.
also satisfies the complementary slackness condition. Therefore, the solution is optimal.
Proof of Lemma 12 These basic properties for our Bellman operator L are well-known for the
standard Bellman operator and can be proved in an analogous manner; see, for example, Puterman
64
Proof of Lemma 13 Let (y, γ) ∈ ∆◦ . Let ξ + = max(ξ, 0) and ξ − = min(ξ, 0), then (ξ, η) =
To conclude, the fact that Lf also satisfies the directional derivative conditions follows by Lemma
Proof of Theorem 14 We want to show that for each > 0, there exists an almost surely finite
iteration index J() such that for all J ≥ J(), it holds that kuJ − uk∞ ≤ . Let Br (y, γ) be a
(2n−1)-dimensional ball centered at (y, γ) ∈ ∆◦ with radius r. Consider some 0 > 0 (to be specified
later) and let C(0 ) be an 0 -covering of ∆, meaning that C(0 ) is a finite collection of points in ∆◦
S
(representing the centers of a finite collection of balls with radius 0 ) and ∆ ⊆ (y,γ)∈C(0 ) B0 (y, γ).
Let (ỹ 1 , γ̃ 1 ), (ỹ 2 , γ̃ 2 ), . . . denote the sequence of sample points visited by the algorithm (one per
P
iteration). Thus, by Assumption 3, we have J P{(ỹ J , γ̃ J ) ∈ B0 (y, γ)} = ∞, and an application
of the Borel-Cantelli lemma tells us that each ball B0 (y, γ) associated with the covering is visited
infinitely often with probability one. To reduce notation, we will often suppress (y, γ) and use B0
to denote a generic ball in the covering. Our proof follows three main ideas:
1. For any infinite trajectory of sampled states, we can split it into an infinite number of “phases”
such that in each phase, every ball associated with the 0 -covering is visited at least once.
2. We can then construct an auxiliary “batch” algorithm whose iteration counter aligns with the
sequence of phases from the previous step. This new algorithm is defined as another instance
of Algorithm 1, where on any given iteration, we group all states visited in the corresponding
65
we will refer to the main algorithm as the “asynchronous” version of the batch algorithm.
3. The auxiliary batch algorithm can be viewed as an approximate version of value iteration.
Using the properties of L, we can show that it converges to an approximation of u (with error
depending on 0 ). Finally, we conclude by arguing that the main algorithm does not deviate
JK+1 = min{J > JK : ∀ (y, γ) ∈ C(0 ), ∃ J 0 s.t. JK < J 0 ≤ J, (ỹ J 0 , γ̃ J 0 ) ∈ B0 (y, γ)}
to be the first time after JK such that every ball in the 0 -covering is visited at least once. Notably,
J1 is the first time that the entire covering is visited at least once. We denote the set of iterations
JK = {JK−1 + 1, JK−1 + 2, . . . , JK }
to be the “Kth phase” of the algorithm and let ŜK = {(ỹ J , γ̃ J )}J∈JK be the set of states visited
We now describe “path-dependent” instances of Algorithm 1 to assist with the remaining anal-
ysis. To be precise with the definitions, let us consider a sample path ω. The auxiliary batch
algorithm associated with ω is a new instance of Algorithm 1 that uses iteration counter K and
generates hyperplanes at the set of states SK = ŜK (ω) for all K ≥ 1. The initial approximation is
û0 = u0 and the estimate after K batch updates is denoted ûK (y, γ)(ω) = maxi=1,...,NK ĝi (y, γ)(ω).
We are now interested in studying the stochastic process {ûK (y, γ)}.
Next, we observe that the hyperplanes generated at iteration K + 1 of the batch algorithm are
√
tangent to LûK at the points in SK+1 . Let κ = (3/2) 2n β. Note that by repeatedly applying
Lemma 13 and using u0 = 0, we can argue that all (tangent) hyperplanes generated throughout the
algorithm have directional derivatives bounded by κ. It follows that if (ỹ, γ̃) is a sample point in
SK+1 that lies in a ball B0 and it generates a hyperplane ĝ, then the underestimation error within
the ball is upper-bounded by max(y,γ) ∈B0 (LûK )(y, γ) − ĝ(y, γ) ≤ 2κ0 (using the fact that there
66
Therefore, we have a form of approximate value iteration and can analyze it accordingly (see
Bertsekas and Tsitsiklis [1996]). Utilizing the monotonicity and shift properties of Lemma 12, we
Subtracting (2κ0 ) 1 from both sides and then applying (44) for K = 1, we have
Iterating these steps, we see that LK û0 − (2κ0 )(1 + ρ + · · · + ρK−1 ) 1 ≤ ûK . Taking limits, using
the convergence of the value iteration algorithm (see Puterman [1994]), and noting that ûK ≤ u
2κ0
u(y, γ) − ≤ lim ûK (y, γ) ≤ u(y, γ), ∀ (y, γ) ∈ ∆. (45)
1 − ρ K→∞
Hence, we have shown that the auxiliary batch algorithm generates value function approximations
The final step is to relate the main asynchronous algorithm to the auxiliary batch version.
We claim that the value function approximation ûK generated by the Kth phase, for K ≥ 1, of
the batch algorithm is within a certain error bound of the approximation from the asynchronous
algorithm at JK :
Let us consider the first phase, K = 1. Recall that the two algorithms are initialized with identical
and Lû0 ≤ LuJ for any J ∈ J1 by the monotonicity property of Lemma 12. Also note that the
auxiliary batch algorithm builds a uniform underestimate of Lû0 with points of tangency belonging
67
û1 (ỹ J+1 , γ̃ J+1 ) ≤ (LuJ )(ỹ J+1 , γ̃ J+1 ) = gJ+1 (ỹ J+1 , γ̃ J+1 ).
Suppose (ỹ J+1 , γ̃ J+1 ) is in a ball B0 . Then gJ+1 can fall below û1 by at most max(y,γ) ∈B 0 û1 (y, γ)−
gJ+1 (y, γ) ≤ 4κ0 within the ball. Since this holds for every hyperplane (and corresponding ball)
added throughout the phase K = 1 and noting that every point in ∆ can be associated with some
well-approximating hyperplane (due to the property that each phase contains at least one visit to
which proves (46) for K = 1. Applying L to both sides of (47), utilizing Lemma 12, noting that û2
underestimates Lû1 , and applying the nondecreasing property of the sequence {uJ }, we have
û2 (ỹ J+1 , γ̃ J+1 ) − ρ (4κ0 ) ≤ (LuJ )(ỹ J+1 , γ̃ J+1 ) = gJ+1 (ỹ J+1 , γ̃ J+1 ),
and we obtain û2 − ρ (4κ0 ) 1 − (4κ0 ) 1 ≤ uJ2 , proving (46) for K = 2. We can iterate these steps
to argue (46) for any K. Taking limits (all subsequent limits exist due to the boundedness and
4κ0
lim ûK (y, γ) − ≤ lim uJ (y, γ) = lim uJ (y, γ) ≤ u(y, γ), ∀ (y, γ) ∈ ∆, (48)
K→∞ 1 − ρ K→∞ K J→∞
where the equality follows by the fact uJK (y, γ) is a subsequence of uJ (y, γ). Combining (45) and
(48), we obtain
6κ0 4κ0
u(y, γ) − ≤ lim ûK (y, γ) − ≤ lim uJ (y, γ) ≤ u(y, γ), ∀ (y, γ) ∈ ∆,
1 − ρ K→∞ 1 − ρ J→∞
68
trarily close to u: if we set 0 = (1 − ρ)/(6κ), this implies the existence of J() such that for
Introducing the dummy variable t to denote the maximum, we obtain the formulation (29).
69