Sie sind auf Seite 1von 70

Inventory Repositioning in On-Demand Product Rental Networks

Saif Benjaafar∗1 , Daniel Jiang†2 , Xiang Li‡3 and Xiaobo Li§4


1
Department of Industrial and Systems Engineering, University of Minnesota
2
Department of Industrial Engineering, University of Pittsburgh
3
Target Corporation
4
Department of Industrial Systems Engineering and Management, National University of
Singapore

Abstract
We consider a product rental network with a fixed number of rental units distributed across
multiple locations. The units are accessed by customers without prior reservation and on an
on-demand basis. Customers are provided with the flexibility to decide on how long to keep a
unit and where to return it. Because of the randomness in demand and in the length of the
rental periods and in unit returns, there is a need to periodically reposition inventory away from
some locations and into others. In deciding on how much inventory to reposition and where, the
system manager balances potential lost sales with repositioning costs. Although the problem is
increasingly common in applications involving on-demand rental services, little is known about
the nature of the optimal policy for systems with a general network structure or about effective
approaches to solving the problem. In this paper, we address these limitations. First, we offer a
characterization of the optimal policy. We show that the optimal policy in each period can be
described in terms of a well-specified region over the state space. Within this region, it is optimal
not to reposition any inventory, while, outside the region, it is optimal to reposition but only
such that the system moves to a new state that is on the boundary of the no-repositioning region.
We also provide a simple check for when a state is in the no-repositioning region. Second, we
leverage the features of the optimal policy, along with properties of the optimal cost function, to
propose a provably convergent approximate dynamic programming algorithm to tackle problems
with a large number of dimensions. We provide numerical experiments illustrate the effectiveness
of the algorithm and to highlight the impact of various problem parameters.

Keywords: product rental networks; vehicle sharing; inventory repositioning; optimal policies,
approximate dynamic programming algorithms


saif@umn.edu

drjiang@pitt.edu

xiang.li2@target.com
§
iselix@nus.edu.sg

Electronic copy available at: https://ssrn.com/abstract=2942921


1 Introduction

We consider a product rental network with a fixed number of rental units distributed across multiple

locations. Inventory level is reviewed periodically and, in each period, a decision is made on how

much inventory to reposition away from one location to another. Customers may pick a product

up without reservation, and are allowed to keep the product for one or more periods, without

committing to a specific return time or location. Thus, demand is random, and so are the rental

periods and return locations of rented units. Demand that cannot be fulfilled at the location at

which it arises is considered lost and incurs a lost sales penalty (or is fulfilled through other means

at an additional cost). Inventory repositioning is costly and the cost depends on both the origins

and destinations of the repositioning. The firm is interested in minimizing the lost revenue from

unfulfilled demand (lost sales) and the cost incurred from repositioning inventory (repositioning

cost). Note that more aggressive inventory repositioning can reduce lost sales but leads to higher

repositioning cost. Hence, the firm must carefully mitigate the tradeoff between demand fulfillment

and inventory repositioning.

Problems with the above features are common in practice. We are particularly motivated by a

variety of car sharing programs that allow customers to rent from one location and return vehicles

to another location. A prominent example1 is Car2Go, which has over 2.5 million registered users

and a fleet of over 14,000 vehicles in 26 cities in North America, Europe and Asia. These services

are on-demand (i.e., they do not require a reservation ahead of use) and offer one-way rentals,

where a customer may return the bike or vehicle to a location of her choice (in fact, Car2Go is

“free-floating” in the sense that there are no designated pickup or drop-off locations). A challenge

in managing these services is the spatial mismatch between vehicle supply and demand that arises

from the uncertainty in trip duration, origin, and destination and the time dependency of these

trip characteristics (e.g., the distribution of trip origin, destination, and duration may depend on

time of day). Unless adequately mitigated with the periodic repositioning of inventory, the spatial

mismatch between supply and demand can lead to significant loss in revenue. For example, some
1
Other examples that share some of the same features include bikeshare systems where customers can pick up a bike
from one location and return it to any other location within the service region; shipping container rentals in the
freight industry where containers can be rented in one location and returned to a different location, with locations
corresponding in some cases to ports in different countries; and the use of certain medical equipment, such as IV
pumps and wheelchairs, in large hospitals by different departments located in various areas of the hospital.

Electronic copy available at: https://ssrn.com/abstract=2942921


studies indicate that a user would typically give up on a rental if they have to walk more than 500

meters [Habibi et al., 2016]. However, repositioning can be expensive. In the case of cars, vehicles

often have to be moved one at a time using a service team and a utility vehicle2 . The following

quote from an article in New York Magazine (2015) on the operations of Car2Go illustrates this

challenge: “[Customers say] you can’t always find a car when you need one — this in spite of

the efforts of Car2Go’s ‘street team’ to distribute vehicles evenly through the coverage zone. The

40-person squad patrols the borough for vehicles that are out of gas, illegally parked, too densely

packed in one region, or otherwise causing problems.”

Although the problem is common in practice3 and carries significant economic costs for the

affected firms and their customers, the existing literature on this topic is limited. In particular,

how to manage these systems optimally is, to the best of our knowledge, not known. Moreover,

there does not appear to be efficient methods for computing solutions for systems as general as the

one we consider in this paper, including effective heuristics. This relative lack of results appears to

be due to the multidimensional nature of the problem as well as the lost sales feature, compounded

by the presence of randomness in demand, in rental periods, and in return locations. In this paper,

we address these limitations through two main contributions. The first contribution is theoretical

and the second is computational:

• On the theoretical side, we offer one of the first characterizations of the optimal policy for

the inventory repositioning problem in a general network setting with multiple locations,

accounting for important features of on-demand systems, such as randomness in trip volumes,

duration, origin, and destination as well as important spatial and temporal dependencies

(e.g., likelihood of a trip terminating somewhere being dependent on its origin as well as trip

volumes that are dependent on time and location).

• On the computational side, we describe a new cutting-plane-based approximate dynamic

programming (ADP) algorithm that can effectively solve the repositioning problem to near-

optimality. We provide a proof of convergence for our algorithm that takes a fundamentally
2
http://blog.car2go.com/2017/06/26/know-relocation-car2gos-relocated/
3
Renting may become even more prevalent as the economy shifts away from a model built on the exclusive ownership
of resources to one based on on-demand access and resource sharing; see Sundararajan [2016] for discussion and other
examples.

Electronic copy available at: https://ssrn.com/abstract=2942921


different view from existing cutting-plane-based approaches. This can be viewed as a theo-

retical contribution to the ADP literature, independent from the repositioning application.

Specifically, we formulate the repositioning problem as a multi-period stochastic dynamic pro-

gram. We show that the problem in each period is one that involves solving a convex optimization

problem (and hence can be solved without resorting to an exhaustive search). More significantly,

we show that the optimal policy in each period can be described in terms of two well-specified

regions over the state space. If the system is in a state that falls within one region, it is optimal not

to reposition any inventory (we refer to this region as “the no-repositioning” region). If the system

is in a state that is outside this region, then it is optimal to reposition some inventory but only

such that the system moves to a new state that is on the boundary of the no-repositioning region.

Moreover, we provide a simple check for when a state is in the no-repositioning region, which also

allows us to compute the optimal policy more efficiently. When the problem only involves two loca-

tions, the no-repositioning region becomes a line segment, and the optimal policy can be specified

in terms of fixed thresholds (see He et al. [2018] for a treatment of this special case using a different

approach).

One of the distinctive features of the problem considered lies in its non-linear state update

function. This non-linearity introduces difficulties in showing the convexity of the problem that

must be solved in each period. To address this difficulty, we leverage the fact that the state update

function is piecewise affine and derive properties for the directional derivatives of the value func-

tion. This approach has potential applicability to other systems with piecewise affine state update

functions. Another distinctive feature of the problem is the multi-dimensionality of the state and

action spaces. Unlike many classical inventory problems, the optimal inventory repositioning policy

cannot, in general, be characterized by simple thresholds in the state space, as increasing inventory

at one location requires reducing inventory at some other locations. Instead, we show that the

optimal policy is defined by a no-repositioning region within which it is optimal to do nothing and

outside of which it is optimal to reposition to the region’s boundary. Such an optimal policy not

only generalizes the threshold policy for two-location problems (i.e., it implies a simple threshold

policy for two-location problems) but also preserves some of the computational benefits. There-

fore, the results in this paper may also be useful in informing future studies of multi-dimensional

Electronic copy available at: https://ssrn.com/abstract=2942921


problems.

Due to the curse of dimensionality, the optimal policy (and value function) can be difficult to

compute for problems with more than a small number of dimensions. To address this issue, we

leverage the results obtained regarding the structure of both the value function and the optimal

policy to construct an approximate dynamic programming algorithm. The algorithm combines

aspects of approximate value iteration (see, for example, De Farias and Van Roy [2000] and Munos

and Szepesvári [2008]) and stochastic dual dynamic programming (see Pereira and Pinto [1991]).

Convexity of the optimal value function is leveraged to represent the approximate value function as

the maximum over a set of hyperplanes, while the no-repositioning region characterization of the

optimal policy is used to reduce the number of single-period optimization problems that need to

be solved. We also prove a new convergence result for the infinite horizon setting, showing that the

value function approximation converges almost surely to the optimal value function. We conduct

numerical experiments to illustrate the effectiveness of jointly utilizing value and policy structure,

which, to our knowledge, has not yet been explored by related methods in the literature.

The rest of the paper is organized as follows. In Section 2, we review related literature. In

Section 3, we describe and formulate the problem. In Section 4, we analyze the structure of the

optimal policy for the special case of a single period problem. In Section 5 and 6, we use the results

from the single period problem to extend the analysis to problems with finitely and infinitely many

periods. In Section 7, we describe the ADP algorithm and provide numerical results. In Section 8,

we provide concluding comments.

Notation. Throughout the paper, the following notation will be used. We use e to denote

a vector of all ones, ei to denote a vector of zeros except 1 at the ith entry, and 0 to denote a

vector of all zeros (the dimension of these vectors will be clear from the context). Also, we write
P
∆n−1 (M ) to denote the (n − 1)-dimensional simplex, i.e., ∆n−1 (M ) = {(x1 , . . . , xn ) | ni=1 xi =

M, x ≥ 0}. Similarly, we use Sn (M ) to denote the n-dimensional simplex with interior, i.e.,
P
Sn (M ) = {(x1 , . . . , xn ) | ni=1 xi ≤ M, x ≥ 0}. Throughout, we use ordinary lowercase letters (e.g.,

x) to denote scalars, and boldfaced lowercase letters (e.g., x) to denote vectors. The Euclidean

norm is denoted k · k2 . For functions f1 and f2 with domain X , let kf1 k∞ = supx∈X |f1 (x)| and let

f1 ≤ f2 denote f1 (x) ≤ f2 (x) for all x ∈ X .

Electronic copy available at: https://ssrn.com/abstract=2942921


2 Literature Review

There is growing literature on inventory repositioning in car and bike sharing systems; see for

example Shu et al. [2013], Nair and Miller-Hooks [2011], O’Mahony and Shmoys [2015], Freund

et al. [2016], Liu et al. [2016], Ghosh et al. [2017], Schuijbroek et al. [2017], Li et al. [2018],

Shui and Szeto [2018], and the references therein. Most of this literature focuses on the static

repositioning problem, where the objective is to find the optimal placement of vehicles before

demand arises, with no more repositioning being made afterwards (e.g., repositioning overnight for

the next day). Much of this work employs mixed integer programming formulations and focuses

on the development of algorithms and heuristics. Similarly, the papers that focus on dynamic

repositioning generally consider heuristic solution techniques and do not offer structural results

regarding the optimal policy (see, for example, Ghosh et al. [2017] and Li et al. [2018]). A related

stream of literature models vehicle sharing systems (mostly in the context of bike share systems) as

closed queueing networks and uses steady state approximations to evaluate system performance; see,

for example, George and Xia [2011], Fricker and Gast [2016], Banerjee et al. [2017] and Braverman

et al. [2018]. Moreover, Chung et al. [2018] analyzes incentive-based repositioning policies for bike

sharing. Other work considers related strategic issues such as fleet sizing, service region design,

infrastructure planning, and user dissatisfaction; see, for example, Jian et al. [2016], Raviv and

Kolka [2013], He et al. [2017], Lu et al. [2017], Freund et al. [2017], Kabra et al. [2018], and Kaspi

et al. [2017]. Comprehensive reviews of the literature on vehicle and bike sharing can be found in

He et al. [2019] and Freund et al. [2019].

There is literature that addresses inventory repositioning that arises in other settings, including

in the context of repositioning of empty containers in the shipping industry, empty railcars in

railroad operations, and cars in traditional car rentals; see, for example, Lee and Meng [2015] for

a comprehensive review. The literature on empty container repositioning is particularly extensive.

However that literature focuses on simple networks and relies on heuristics when considering more

general problems; see for example Song [2005] and Li et al. [2007]. To our knowledge, there are no

results regarding the optimal policy for a general network.

The paper that is closest to ours is He et al. [2018], which was subsequent to an earlier version

Electronic copy available at: https://ssrn.com/abstract=2942921


of this paper4 and which considers a problem similar to ours. A summary of the main differences

between our paper and He et al. [2018] is given below.

• Our model is more general than He et al. [2018] in that we allow the rental periods of cars to

be larger than one. This, we believe, is more appropriate for many real world applications.

This generalization introduces additional complexity to the problem. In particular, a system

with n locations now requires 2n inventory states so that ongoing rentals from each location

can be tracked; He et al. [2018] does not require this and formulate an n-dimensional problem.

• In He et al. [2018], the proof of convexity is based on a reformulation of the problem into a

linear program. However, it is not clear from their current proof whether this reformulation

continues to work for the case where rental periods can be greater than one. Our method,

though more complicated, works for both cases. The technical difference also shows that our

generalization of longer rental periods is non-trivial.

• He et al. [2018] characterize the optimal policy for problems with two locations. We provide

characterizations of the optimal policy for the n-location problem, which includes the two-

location problem as a special case.

• He et al. [2018] approximate the problem using the robust approach, while we develop a

systematic and theoretically consistent approach for approximating the true value function.

Empirical testing of our approach shows that it achieves high-quality approximations within

a reasonable amount of time.

Next, there is related literature on dynamic fleet management. This literature is concerned with

the assignment of vehicles to loads that originate and terminate in different locations over multiple

periods. Recent examples from this literature include Topaloglu and Powell [2006], Godfrey and

Powell [2002], and Powell and Carvalho [1998]. In a typical dynamic fleet management problem,

movements of all vehicles, both full and empty, are decision variables. This is in contrast to

our problem where the movement of vehicles is in part determined by uncontrolled events involving

rentals with uncertain durations and destinations, and where decisions involve only the repositioning
4
The first version of our paper appeared online ahead of the first version of He et al. [2018]. He et al. [2018] refer to
that version of our paper.

Electronic copy available at: https://ssrn.com/abstract=2942921


of unused assets. Note that most of the literature on dynamic fleet management focuses on the

development of solution procedures but not on the characterization of the optimal policy.

Finally, there is related literature on computational methods that can solve problems with

convex value functions. Some well-known cutting-plane-based approaches are the stochastic de-

composition algorithm of Higle and Sen [1991], the stochastic dual dynamic programming (SDDP)

method introduced in Pereira and Pinto [1991], and the cutting plane and partial sampling approach

of Chen and Powell [1999]. Our method is most closely related to SDDP, where full expectations

are computed at each iteration. Linowsky and Philpott [2005], Philpott and Guan [2008], Shapiro

[2011], and Girardeau et al. [2014] provide convergence analyses of SDDP, but these analyses are de-

signed for finite-horizon problems (or two-stage stochastic programs) and rely on an exact terminal

value function and/or that there only exist a finite number of cuts.

Our algorithm is most closely related to the cutting plane methods for the infinite horizon

setting proposed in Birge and Zhao [2007] and Warrington et al. [2018]. Birge and Zhao [2007]

proves uniform convergence of the value function approximations to optimal for the case of linear

dynamics, given a strong condition that the cut in each iteration is computed at a state where a

Bellman error criterion is approximately maximized. Computation of such a state is a difference

of convex functions optimization problem (or a suitable approximation). Warrington et al. [2018]

focuses on the deterministic setting, uses a fixed set of sampled states at which cuts are computed,

and does not show consistency of their algorithm. Our algorithm removes these restrictions, yet we

are still able to show uniform convergence to the optimal value function. In particular, our analysis

allows for non-linear dynamics and cuts to be computed at states sampled from a distribution.

Furthermore, the use of policy structure (i.e., the no-repositioning zone characterization) in an

SDDP-like algorithm is new.

As an alternative to cutting plane algorithms, Godfrey and Powell [2001] and Powell et al. [2004]

propose methods based on stochastic approximation (see Kushner and Yin [2003]) to estimate

scalar or separable convex functions. The main idea is to iteratively update a piecewise linear

approximation via noisy samples while ensuring that convexity is maintained. Nascimento and

Powell [2009] extend the technique to a finite-horizon ADP setting for the problem of lagged asset

acquisition (single inventory state) and provides a convergence analysis; see also Nascimento and

Powell [2010]. However, these methods are not immediately applicable to our situation, where the

Electronic copy available at: https://ssrn.com/abstract=2942921


value function is multi-dimensional.

3 The Inventory Repositioning Problem

We consider a product rental network consisting of n locations and N rental units. Inventory

level is reviewed periodically and, in each period, a decision is made on how much inventory to

reposition away from one location to another. Inventory repositioning is costly and the cost depends

on both the origins and destinations of the repositioning. The review periods are of equal length

and decisions are made over a specified planning horizon, either finite or infinite.

Demand in each period is positive and random, with each unit of demand requiring the usage

of one rental unit for one or more periods, with the rental period being also random. Demand

that cannot be satisfied at the location at which it arises is considered lost and incurs a lost sales

penalty. A location in the context of a free-floating car sharing system may correspond to a specified

geographic area (e.g., a zip code area, a neighborhood, or a set of city blocks). Units rented at

one location can be returned to another. Hence, not only are rental durations random but also are

return destinations. At any time, a rental unit can be either at one of the locations, available for

rent, or with a customer being rented.

The sequence of events in each period is as follows. At the beginning of the period, inventory

level at each location is observed. A decision is then made on how much inventory to reposition

away from one location to another. Subsequently, demand is realized at each location followed by

the realization of product returns. Note that the review period is assumed to be sufficiently long

for repositioning initiated in one period to be completed in the same period.5

We index the periods by t ∈ N, with t = 1 indicating the first period in the planning horizon. We

let xt = (xt,1 , . . . , xt,n ) denote the vector of inventory levels before repositioning in period t, where
5
This assumption is reasonable for the car sharing application where repositioning is assured by dedicated “street
teams,” as in the Car2Go example mentioned, who are assigned to different areas within the service region for the
rapid redeployment of vehicles (e.g., if rental periods are multiples of one hour, this assumption assumes that the street
teams are large enough to carry out the needed repositioning within one hour). In some applications, repositioning
is crowdsourced (this approach is used by Bird, the scooter service provider) or is carried out by users who receive
a monetary compensation for moving a vehicle. In such cases, the assumption of relatively short repositioning
time is also reasonable. There may of course be settings where there is an upper limit on how many units can be
repositioned in a given period (in that case, a constraint would need to be imposed on the amount of inventory that
can be repositioned in a given period). There may also be settings where the repositioning consists of multiple periods
(e.g., in the case of shipping containers). The analysis in that case is more complicated as it requires expanding the
state space to include the “delivery” status of each inventory that is in the process of being repositioned. We leave
this as a potential topic for future research.

Electronic copy available at: https://ssrn.com/abstract=2942921


xt,i denotes the corresponding inventory level at location i. Similarly, we let y t = (yt,1 , . . . , yt,n )

denote the vector of inventory levels after repositioning in period t, where yt,i denotes the corre-

sponding inventory level at location i. Note that inventory repositioning should always preserve
P P
the total on-hand inventory. Therefore, we require ni=1 yt,i = ni=1 xt,i .

Inventory repositioning is costly and, for each unit of inventory repositioned away from location

i to location j, a cost of cij is incurred. Consistent with our motivating application of a car sharing

system, we assume there is a cost associated with the repositioning of each unit (this is in contrast

with other applications, such as bike sharing systems where repositioning can occur in batches6 );

see He et al. [2018] for similar treatment. Let c = (cij ) denote the cost vector and let wij denote

the amount of inventory to be repositioned away from location i to location j. Then, the minimum

cost associated with repositioning from an inventory level x to another inventory level y is given

by the solution to the following linear program:

min c·w
Xn n
X
subject to wij − wjk = yj − xj ∀ j = 1, . . . , n
i=1 k=1

w ≥ 0.

The first constraint ensures that the change in inventory level at each location is consistent with the
P P
amounts of inventory being moved into ( i wij ) and out of ( k wjk ) that location. The second

constraint ensures that the amount of inventory being repositioned away from one location to

another is always positive so that the associated cost is accounted for in the objective. It is clear
6
In some bike sharing systems, bikes are moved few units at a time (e.g., Citi Bike uses tricycles that hold 3-4 bikes
at a time to move bikes during the day; this is also the case for several bike sharing systems in China). In such cases,
a per-unit repositioning cost would be a reasonable approximation. In other systems, although units are moved in
batches, the repositioning cost is mostly due to the handling of each unit (e.g., the process of loading and unloading
bikes onto a truck and positioning it in place in a docking station). For example, this would be the case when all
locations are always visited in each period as part of fixed route for the repositioning vehicles. In some situations,
repositioning is crowdsourced (as in the case of the Bird scooters). In that case, the repositioning cost corresponds
to the payment made to the individuals who carried out the repositioning.

Electronic copy available at: https://ssrn.com/abstract=2942921


that the value of the linear program depends only on w = y − x. Define

C(z) = min c·w


Xn n
X
subject to wij − wjk = zj ∀ j = 1, . . . , n (1)
i=1 k=1

w ≥ 0,

for any z ∈ H where ( )


n
X
n
H := z∈R : zi = 0 . (2)
i=1

Then the inventory repositioning cost from x to y is C(y − x). Without loss of generality, we

assume that cij ≥ 0 satisfy the triangle inequality (i.e., cik ≤ cij + cjk for all i, j, k).

We let dt = (dt,1 , . . . , dt,n ) denote the vector of random demands in period t, with dt,i corre-

sponding to the demand at location i. The amount of demand that cannot be fulfilled is given by

(dt,i − yt,i )+ = max(0, dt,i − yt,i ). Let β denote the per unit lost sales penalty. Then, the total lost

sales penalty incurred in period t across all locations is given by

n
X
L(y t , dt ) = β (dt,i − yt,i )+ . (3)
i=1

We assume that each product can be rented at most once within a review period, that is, rental

periods are longer than review periods.

To model the randomness in both the rental periods and return locations, we assume that, at

the end of each period, a random fraction pt,ij of products rented from location i is returned to

location j for all i, j ∈ {1, 2, . . . , n}, with the rest continuing to be rented. We let P t denote the

matrix of random fractions, i.e.,


 
 pt,11 · · · pt,1n 
 . .. .. 
Pt = 
 .
. . . .

 
pt,n1 · · · pt,nn

Pn Pn
The ith row of P t must satisfy j=1 pt,ij ≤ 1. The case where j=1 pt,ij
< 1 corresponds to a set-
P
ting where rental periods can be greater than one, while the case where nj=1 pt,ij = 1 corresponds

to a setting where rental periods are exactly equal to one. Let µt denote the joint distribution

10

Electronic copy available at: https://ssrn.com/abstract=2942921


of dt and P t , i = 1, 2, . . . , n. We assume that the random sequence (dt , P t ) is independent over
R ∞ Pn
time, and the expected aggregate demand in each period is finite (i.e., 0 i=1 dt,i dµt < +∞).

However, we allow dt and P t to be dependent.

Finally, let γt,i for i = 1, 2, . . . , n and t = 1, 2, . . . , T denote the quantity of the product rented

from location i that remain outstanding at the beginning of period t. To our knowledge, the

problem specification above is among the most general in the literature and the various assumptions

we make are either consistent or further extend assumptions found in the literature (for example,

our assumptions are consistent with those in He et al. [2018] except that we allow for rental

durations to be multiple periods and for this duration to be random). We denote cmax = maxi,j cij

and cmin = mini,j; i6=j cij . The next two assumptions state some useful conditions on P t and the

repositioning costs cij .

Assumption 1. For every period t, there exists a random variable pt ∈ [pmin , 1] such that

n
X n
X
pt,ij = pt,kj = pt , ∀ i, k = 1, 2, . . . , n.
j=1 j=1

Pn
An alternative statement is that pt,ij = pt q̃t,ij for some q̃t,ij where j=1 q̃t,ij = 1 for all i.

Assumption 1 implies that the rental duration does not depend on the origin of the rental, but

the distribution of the return locations does depend on the origin. This assumption is plausible

since the rental duration and the return location are usually two separate decisions.

Assumption 2. The repositioning costs satisfy ρcmax − cmin ≤ pmin (β − cmin ).

The second assumption enforces boundedness in the difference of cost parameters, with the

upper bound depending on pmin . If pmin = 1, where the rental duration is always one period

(corresponding to the setting of He et al. [2018]), the restriction reduces to ρcmax ≤ β. This means

that the cost of lost sales outweighs the cost of inventory repositioning in the next period. A

similar condition is assumed in He et al. [2018]. In this sense, the convexity in our paper can

include their results as a special case except that in their model, lost sales costs depend on both

origin and destination and the return destinations are known by the platform at the time of rental.

If pmin < 1, the assumption prevents the unpleasant situation where one might want to deliberately

11

Electronic copy available at: https://ssrn.com/abstract=2942921


“hide” the inventory due to the difference in the repositioning cost. It is clear from the assumption

that ρcmax − cmin ≤ pmin (β − cmin ) ≤ pt (β − cmin ).

Our main result is that, under Assumptions 1 and 2, the value function in each period, consisting

of the lost sales and the cost-to-go as defined next, is always convex, allowing for many structural

properties of the optimal policy to be derived.

The model we described above can be formulated as a Markov decision process. Fix a time

period t. The system states correspond to the on hand inventory levels xt and the outstanding

inventory levels γ t . The state space is specified by the (2n − 1)-dimensional simplex, i.e., (xt , γ t ) ∈

∆2n−1 (N ). Throughout the paper, we denote S := Sn (N ) and ∆ := ∆2n−1 (N ) since these notations

are frequently used. Actions correspond to the vector of target inventory levels y t . Given state

(xt , γ t ), the action space is an (n − 1)-dimensional simplex, i.e., y t ∈ ∆n−1 (eT xt ). The transition

probabilities are induced by the state update function:

Pn
xt+1,i = (yt,i − dt,i )+ + + min(yt,j , dt,j )) pt,ji ∀ i = 1, 2, . . . , n, t = 1, 2, . . . , T
j=1 (γt,j
P
γt+1,i = (γt,i + min(yt,i , dt,i ))(1 − nj=1 pt,ij ) ∀ i = 1, 2, . . . , n, t = 1, 2, . . . , T.

Given a state (xt , γ t ) and an action y t , the repositioning cost is given by C(y t − xt ), and the

expected lost sales penalty is given by

Z Z X
lt (y t ) = Lt (y t , dt ) dµt = β (dt,i − yt,i )+ dµt . (4)
i

The single-period cost is the sum of the inventory repositioning cost and lost sales penalty:

rt (xt , γ t , y t ) = C(y t − xt ) + lt (y t ).

The objective is to minimize the expected discounted cost over a specified planning horizon. In the

case of a finite planning horizon with T periods, the optimality equations are given by

Z
vt (xt , γ t ) = min rt (xt , γ t , y t ) + ρ vt+1 (xt+1 , γ t+1 ) dµt (5)
y t ∈∆n−1 (eT xt )

for t = 1, 2, . . . , T , and

vT +1 (xT +1 , γ T +1 ) = 0, (6)

12

Electronic copy available at: https://ssrn.com/abstract=2942921


where ρ ∈ [0, 1) is the discount factor.

It is useful to note that the problem to be solved in each period can be expressed in the following

form:

vt (xt , γ t ) = min C(y t − xt ) + ut (y t , γ t ), (7)


y t ∈∆n−1 (eT xt )

where
Z
ut (y t , γ t ) = Ut (y t , γ t , dt , P t ) dµt , (8)

and

Ut (y t , γ t , dt , P t ) = Lt (y t , dt ) + ρvt+1 (τx (y t , γ t , dt , P t ), τγ (y t , γ t , dt , P t )), (9)

where
τx (y, γ, d, P ) = (y − d)+ + P T (γ + min(y, d))
(10)
τγ (y, γ, d, P ) = (γ + min(y, d)) ◦ (e − P t e),

where ◦ denotes the Hadamard product (or the entrywise product), i.e.,

(a1 , a2 , . . . , an ) ◦ (b1 , b2 , . . . , bn ) = (a1 b1 , a2 b2 , . . . , an bn ).

In the next section, we consider the one-period problem, which provides with us a few useful results

before moving on to the multi-period and infinite horizon versions of the problem.

4 The One-Period Problem

In this section, we study the following convex optimization problem

v(x, γ) = min C(y − x) + u(y, γ) for (x, γ) ∈ ∆, (11)


y∈∆n−1 (eT x)

where C(·) is the repositioning cost specified by (1). As shown in equations (7) - (9), the problem

to be solved in each period is of the form (11). Here we assume that u(·) is a convex and continuous

function that maps ∆ to R ∪ {−∞, ∞}, though most of the results in this section still holds when

u(x, γ) is only convex in x. In Section 5, we will show that ut (y t , γ t ) defined in (8) is indeed convex

and continuous in (y t , γ t ).

13

Electronic copy available at: https://ssrn.com/abstract=2942921


4.1 Properties of the Repositioning Cost

We start with describing some properties of the repositioning cost C(·) defined in (1). These

properties will be useful for characterizing the optimal policy in subsequent sections. Unless stated

otherwise, proofs for all results in the paper can be found in the Appendix.

Lemma 1. C(·) satisfies the following properties:

1. (Positive Homogeneity): C(tz) = tC(z) for all t ≥ 0.

2. (Convexity): C(λz 1 + (1 − λ)z 2 ) ≤ λC(z 1 ) + (1 − λ)C(z 2 ) for all z 1 , z 2 ∈ H and λ ∈ [0, 1].

3. (Sub-Additivity): C(z 1 + z 2 ) ≤ C(z 1 ) + C(z 2 ) for all z 1 , z 2 ∈ H.

4. (Continuity): C(z) is continuous in z ∈ H.

Moreover, due to the triangle inequality, it is not optimal to simultaneously move inventory

into and out of the same location. This property can be stated as follows.

Lemma 2. There exists an optimal solution w to (1) such that

n
X n
X
wij = zj+ and wjk = zj− for all j = 1, . . . , n.
i=1 k=1

Lemma 2 leads to the following bound for the repositioning cost C(z).

Lemma 3.
n n
cmin X cmax X
|zi | ≤ C(z) ≤ |zi |. (12)
2 2
i=1 i=1

The proof follows from Lemma 2. There exists an optimal solution w to (1) such that

X 1 XX 1 XX
C(z) = cij wij = cij wij + cji wji
2 2
i,j j i j i
cmax X cmax X cmax X
≤ zj+ + zj− = |zj |.
2 2 2
j j j

It is easy to see that, in (12), the equality holds if cij = cmax for all i, j. Therefore, the bound is

tight. The lower bound follows the same logic. In Section 5, we will use Lemma 3 to derive an

important bound on the directional derivatives of the value function.

14

Electronic copy available at: https://ssrn.com/abstract=2942921


4.2 Characterization of the Optimal Policy

The principle result of this section is the characterization of the optimal policy through the no-

repositioning set, the collection of inventory levels from which no repositioning should be made.

The no-repositioning set for a function u(·) when the outstanding inventory level is γ can be defined

as follows:

Ωu (γ) = {x ∈ ∆n−1 (I) : u(x, γ) ≤ C(y − x) + u(y, γ) ∀ y ∈ ∆n−1 (I)} , ∀ γ ∈ S (13)

Pn
where I = N − i=1 γi . Note that I is a function of γ (or equivalently x). For notational

simplicity, we suppress the dependency of I on γ (or x). By definition, no repositioning should be

made from inventory levels inside Ωu (γ). In the following theorem, we show that Ωu (γ) is non-

empty, connected and compact and, for inventory levels outside Ωu (γ), it is optimal to reposition

to some point on the boundary of Ωu (γ). In what follows, we denote the boundary of a set E by

B(E), and the interior of E by E ◦ .

Theorem 4. The no-repositioning set Ωu (γ) is nonempty, connected and compact for all γ ∈ S.

An optimal policy π ∗ to (11) satisfies

π ∗ (x, γ) = x if x ∈ Ωu (γ);
(14)
π ∗ (x, γ) ∈ B(Ωu (γ)) otherwise.

Solving a nondifferentiable convex program such as (11) usually involves some computational

effort. One way to reduce this effort, suggested by Theorem 4, is to characterize the no-repositioning

set Ωu (γ). Characterizing the no-repositioning region can help us identify when a state is inside

Ωu (γ), which allows our ADP algorithm to more easily compute the value iteration step; see Section

7. Let
u(x + tz, γ + tη) − u(x, γ)
u0 (x, γ; z, η) = lim (15)
t↓0 t

denote the directional derivative of u(·) at (x, γ) along the direction (z, η). Since u(·) is assumed

to be convex and continuous in ∆, u0 (x, γ; z, η) is well defined for (x, γ) ∈ ∆. We call (z, η) a

feasible direction at (x, γ) if (x + tz, γ + tη) ∈ ∆ for small enough t > 0. In what follows, we

provide a series of first order characterizations of Ωu (γ), the first of which relies on the directional

15

Electronic copy available at: https://ssrn.com/abstract=2942921


derivatives.

Proposition 5. x ∈ Ωu (γ) if and only if

u0 (x, γ; z, 0) ≥ −C(z) (16)

for any feasible direction (z, 0) at (x, γ).

Proposition 5 is essential for several subsequent results. However, using Proposition 5 to verify

whether a point lies inside the no-repositioning set is computationally impractical, as it involves

checking an infinite number of inequalities in the form of (16). In the following proposition, we pro-

vide a second characterization of Ωu (γ) using the subdifferentials. Before we proceed, we introduce

the following notations. g is said to be a subgradient of u(·, γ) at x if u(y, γ) ≥ u(x, γ) + g T (y − x)

for all y. The set of all subgradients of u(·, γ) at x is denoted by ∂x u(x, γ). It is well known that

∂x u(x, γ) is nonempty, closed and convex for x > 0.

Proposition 6. x ∈ Ωu (γ) if

∂x u(x, γ) ∩ G 6= ∅, (17)

where G = {(g1 , . . . , gn ) : gi − gj ≤ cij ∀ i, j}. If x > 0, then the converse is also true.

Proposition 6 suggests whether a point lies inside the no-repositioning set depends on whether

u(·, γ) has certain subgradients at this point. Such a characterization is useful if we can compute

the subdifferential ∂x u(x, γ). In particular, if u(·, γ) is differentiable at x, then ∂x u(x, γ) consists

of a single point ∇x u(x, γ). In this case, determining its optimality only involves checking n(n − 1)

inequalities.

Corollary 7. Suppose u(·, γ) is differentiable at x ∈ ∆n−1 (I). Then, x ∈ Ωu (γ) if and only if

∂u(x, γ) ∂u(x, γ)
− ≤ cij (18)
∂xi ∂xj

for all i, j.

The no-repositioning set Ωu (γ) can take on many forms. We first discuss the case where there

are only two locations. In this case, the no-repositioning set corresponds to a closed line segment

16

Electronic copy available at: https://ssrn.com/abstract=2942921


with the boundary being the two end points. The optimal policy reduces to a state-dependent

two-threshold policy.

Corollary 8. Suppose n = 2. For γ ∈ S, let I = N − γ1 − γ2 . Then Ωu (γ) = {(x, I − x) : x ∈

[s1 (γ), s2 (γ)]}, where s1 (γ) = inf{x : u0 ((x, I − x, γ1 , γ2 ); (1, −1, 0, 0)) ≥ −c21 } and s2 (γ) = sup{x :

−u0 ((x, I − x, γ1 , γ2 ); (−1, 1, 0, 0)) ≤ c12 }. An optimal policy π ∗ to (11) satisfies

π ∗ (x, I − x, γ1 , γ2 ) = (s1 (γ), I − s1 (γ)) if x < s1 (γ),

π ∗ (x, I − x, γ1 , γ2 ) = (x, I − x) if s1 (γ) ≤ x < s2 (γ),

π ∗ (x, I − x, γ1 , γ2 ) = (s2 (γ), I − s2 (γ)) otherwise.

Corollary 8 is a direct consequence of Theorem 4, Proposition 5, and the fact that there are

only two feasible directions (z, 0). It shows that the optimal policy to problem (11) in the two-

dimensional case is described by two thresholds s1 (γ) < s2 (γ) on the on-hand inventory level x at

location 1. If x is lower than s1 , it is optimal to bring the inventory level up to s1 by repositioning

inventory from location 2 to location 1. On the other hand, if x is greater than s2 , it is optimal to

bring the inventory level at location 1 down to s2 . When x falls between s1 and s2 , it is optimal

not to reposition as the benefit of inventory repositioning cannot offset the cost.

When there are more than two locations, there does not exist a threshold policy. In what

follows, we characterize the no-repositioning set for three important special cases, the first of which

corresponds to when u(·, γ) is a convex quadratic function. In this case, the no-repositioning set is

a polyhedron defined by n(n − 1) linear inequalities.

Example 1. For a fixed γ, suppose u(y, γ) = y T B(γ)y + y T b(γ) + b0 (γ) and B(γ) is positive

semidefinite. By Corollary 7, Ωu (γ) = {y ∈ ∆n−1 (I) : 2y T Bi (γ) + bi (γ) − 2y T Bj (γ) − bj (γ) ≤

cij ∀ i, j}, where Bi (γ) is the i-th row of B(γ).

We point out that, in general, the no-repositioning set can be non-convex. The following

example illustrates that even if u(·) is smooth, Ωu (γ) might still be non-convex.

Example 2. Suppose γ = 0, u(y) = y13 + y22 + y32 , and cij = 0.5 (note that the inventory state y is

always nonnegative so u is convex). Then, the no-repositioning set is characterized by Ωu = {y ∈

∆n−1 : −0.5 ≤ 3y12 − 2y3 ≤ 0.5, −0.5 ≤ 3y12 − 2y2 ≤ 0.5, −0.5 ≤ 2y2 − 2y3 ≤ 0.5}.

17

Electronic copy available at: https://ssrn.com/abstract=2942921


Note that in Example 2, u(y) is a convex function but the no-repositioning set is not convex because

the region under the parabolas 2y2 − 3y12 = 0.5 and 2y3 − 3y12 = 0.5 is not convex. See Figure 1 for

the case where N = y1 + y2 + y3 = 1.

No-repositioning
set Ω𝑢

Feasible region 𝐴𝐼

𝑦1

Figure 1: An illustration of a non-convex no-repositioning set

5 The Multi-Period Problem

In this section, we return to the study of the multi-period problem. The optimality equations are

given by (5) and (6). It is clear from (7) that the problem to be solved in each period can be

reduced to (11) with ut (·) in place of u(·). Consequently, the optimal policy in each period will

have the same form as the one-period problem if the functions ut (·), t = 1, . . . , T are convex and

continuous in ∆.
R
Recall that ut (y, γ) = Ut (y, γ, dt , P t ) dµt , where

Ut (y, γ, d, P ) = Lt (y, d) + ρvt+1 (τx (y, γ, d, P ), τγ (y, γ, d, P )).

If the state update functions τx (·, d, P ), τγ (·, d, P ) were linear, then convexity is preserved through

vt+1 (τx (y, γ, d, P ), τγ (y, γ, d, P )). As a result, Ut,d,p (·) = Ut (·, d, P ) and therefore ut (·) would

be convex. However, with non-linear state updates, this is not always the case. In our context,

the state update function is piecewise affine, with the domain of each affine segment specified by

18

Electronic copy available at: https://ssrn.com/abstract=2942921


a polyhedron. This implies that vt+1 (τx (y, γ, d, P ), τγ (y, γ, d, P )) is not necessarily convex, but

instead is piecewise convex. In spite of this, we show that in our context, Ut,d,p (·), and ut (·), is

convex under some mild conditions on the cost parameters ct,ij and the return probabilities pt,ij .

Theorem 9. Suppose Assumptions 1 and 2 hold. For any given t = 1, . . . , T , the function ut (·)

defined in (8) is convex and continuous in ∆. The no-repositioning set Ωut (γ) is nonempty, con-

nected and compact for all γ ∈ S, and can be characterized as in Proposition 5, 6 and Corollary 7.

An optimal policy π ∗ = (π1∗ , . . . , πT∗ ) to the multi-period problem satisfies

πt∗ (xt , γ t ) = xt if xt ∈ Ωut (γ t );


(19)
πt∗ (xt , γ t ) ∈ ∂Ωut (γ t ) otherwise.

Moreover, for any t = 1, 2, . . . , T , we have

Pn
1. |u0t (y t , γ t ; ±η, ∓η)| ≤ β i=1 ηi for all (y t , γ t ) ∈ ∆ and any feasible direction (±η, ∓η) with

η≥0;

Pn
2. u0t (y t , γ t ; 0, z) ≤ (ρcmax /2) i=1 |zi | for all (x, γ) ∈ ∆ and any feasible direction (0, z) with

eT z = 0.

A comprehensive proof of Theorem 9 can be found in Appendix A.1. Here, we give an outline of

the approach. We apply induction, starting from vT +1 (y, γ) = 0. We show in Proposition A.5 and

Proposition A.6 that if vt+1 (·) is convex and satisfies certain bounds on its directional derivatives,

then for any realization of dt , P t , the function Ut,dt ,pt (·) is convex and satisfies two types of bounds

on its directional derivatives. The first type shows that if we remove some of the available inventory

and turn them into ongoing rentals, the reduced or enlarged cost can be upper bounded by the lost

sale cost of the these products. The same bound holds if we remove some of the ongoing rentals and

make them available at the locations from which they were rented. The second type bound states

that if we change the origin of some of the ongoing rentals (i.e., we change γ only), the difference

in cost can be upper bounded by the product of (ρcmax /2) and the one-norm of the difference in γ.

The primary reason is that the total return fraction for period t, pt , does not depend on the origin.

Therefore, the difference of costs is at most the repositioning cost in the next period. To complete

the induction, we show in Proposition A.8 that given the convexity of ut (y t , γ t ) and some bounds

19

Electronic copy available at: https://ssrn.com/abstract=2942921


on its directional derivatives, vt (xt , γ t ) is convex and satisfies the directional derivative bounds

required by Proposition A.6.

6 Infinite Horizon Problem

We have shown that the optimal policy for the multi-period problem has the same form as the

one-period problem. Next we show that the same can be said about the stationary problem with

infinitely many periods. In such a problem, we denote the common distribution for (dt , P t ) by µ.

Similarly, we denote the common values of Lt (·), lt (·) and rt (·) by L(·), l(·) and r(·), respectively.

We use π to denote a stationary policy that uses the same decision rule π in each period. Under

π, the state of the process is a Markov random sequence {(Xt , Γt ), t = 1, 2, . . .}. The optimization

problem can be written as a Markov decision process (MDP):


(∞ )
X
v(x, γ) = min Eπ
x ρt−1 r(Xt , Γt , π(Xt , Γt )) , (20)
π
t=1

where X1 = x a.e. is the initial state of the process. Let

( T )
X
ṽT (x, γ) = min Eπ
x ρt−1
r(Xt , Γt , πt (Xt , Γt )) (21)
π
t=1

denote the value function of a stationary problem with T periods. It is well known that the functions

ṽT (·) converges uniformly to v(·) and v(·) is the unique solution7 to

Z
v(x, γ) = min r(x, γ, y) + ρ v(τx (y, γ, d, P ), τγ (y, γ, d, P )) dµ, (22)
y∈∆n−1 (eT x)

where τx (·) and τγ (·) correspond to the state update functions defined in (10), i.e.,

τx (y, γ, d, P ) = (y − d)+ + P T (γ + min(y, d)) ∀ t = 1, 2, . . . , T, and


(23)
τγ (y, γ, d, P ) = (γ + min(y, d)) ◦ (e − P t e) ∀ t = 1, 2, . . . , T.
7
For details, the reader may refer to Chapter 6 of Puterman [1994].

20

Electronic copy available at: https://ssrn.com/abstract=2942921


Similar to the multi-period problem, the problem to be solved can be reduced to the one-period

problem (11)

v(x, γ) = min C(y − x) + u(y, γ),


y∈∆n−1 (eT x)
R
where u(y, γ) = U (y, γ, d, P ) dµ and U (y, γ, d, P ) = L(y, d) + ρv(τx (y, γ, d, P ), τγ (y, γ, d, P )).

Theorem 10. Suppose Assumptions 1 and 2 hold. The function u(·, γ) is convex and continuous

in ∆. The no-repositioning set Ωu is nonempty, connected and compact for all γ ∈ S, and can be

characterized as in Proposition 5, 6 and Corollary 7. An optimal policy π ∗ = (π ∗ , π ∗ , . . .) to the

stationary problem with infinitely many periods satisfies

π ∗ (x, γ) = x if x ∈ Ωu (γ);
(24)
π ∗ (x, γ) ∈ B (Ωu (γ)) otherwise.

Moreover, we have
Pn
1. |u0 (y, γ; ∓η, ±η)| ≤ β i=1 ηi for all (x, γ) ∈ ∆ and any feasible direction (∓η, ±η) with

η ≥ 0;
Pn
2. u0 (y, γ; 0, z) ≤ (ρcmax /2) i=1 |zi | for all (x, γ) ∈ ∆ and any feasible direction (0, z) with

eT z = 0.

7 An Approximate Dynamic Programming Approach

So far, we have studied the theoretical properties of the repositioning problem. In this section,

we propose an approximate dynamic programming algorithm, to which we refer as “Repositioning-

ADP,” that exploits the structure of both the value function and the optimal policy under a sampled

demand and return model. Although Theorems 9 and 10 allow for the use of convex optimization

to help resolve the issue of a multi-dimensional continuous action space, the difficulty of a multi-

dimensional and continuous state space remains. We refer readers to Bertsekas and Tsitsiklis [1996]

and Powell [2007] for a detailed discussion of the computational challenges and solution methods

associated with large MDPs. In particular, we note that when the problem size (the number of

locations or the number of time periods) is large, simple approximations of continuous problems,

such as discretization or aggregation, will usually fail. In addition, discretization can cause our

21

Electronic copy available at: https://ssrn.com/abstract=2942921


structural properties to break down, which means the convexity result and characterization of

the optimal policy given Theorems 9 and 10 can no longer be readily used. Informal numerical

experiments show that even if we do not consider the ongoing rentals (rental period is always one),

approximating the dynamic program via discretization within a reasonable accuracy is already a

formidable task for a three-location problem.

It is thus necessary for us to consider more scalable techniques. A key feature of the algorithm

we describe next is that each iteration involves solving one or more linear programs, allowing it

to leverage the scalability and computational advantages of off-the-shelf solvers. We show via nu-

merical experiments that the algorithm can produce high quality solutions on problems with states

up to 20 dimensions (10 locations) within a reasonable amount of time. The algorithm also pos-

sesses the important theoretical property of asymptotically optimal value function approximations;

see Theorem 14. In the rest of this section, we motivate and describe the algorithm, prove its

convergence, discuss some practical considerations, and present the numerical results.

7.1 The Repositioning-ADP Algorithm

Theorems 9 and 10 describe the most important feature of our dynamic program, that u(·), the

summation of current period lost sales and the cost-to-go, is convex and continuous. Moreover,

Proposition 6 provides a characterization of when it is optimal not to reposition. Our algorithm

takes advantage of these two structural results. It is well known that a convex function can be

written as the point-wise supremum of its tangent hyperplanes, i.e.,

u(y, γ) = sup u(ŷ, γ̂) + (y − ŷ)T ∇y u(ŷ, γ̂) + (γ − γ̂)T ∇γ u(ŷ, γ̂).
ŷ,γ̂

This suggests that we can build an approximation to u(·) by iteratively adding lower-bounding

hyperplanes, with the hope that the approximation becomes arbitrarily good when enough hyper-

planes are considered. This is the main idea of the algorithm, with special considerations made to

account for the complicated structure of the state update functions.

The features and analysis that distinguish our ADP algorithm from previous work in the liter-

ature are summarized below.

1. Our algorithm has the ability to skip the optimization step when a sampled state is detected

22

Electronic copy available at: https://ssrn.com/abstract=2942921


as being in the no-repositioning region. This step uses Proposition 6 and it is applied to the

value function approximation at every iteration.

2. The underlying model of SDDP and other cutting-plane methods (see, e.g., Higle and Sen

[1991], Pereira and Pinto [1991], Birge and Zhao [2007]) is typically a two-stage or multi-

stage stochastic linear program. In our case, we must account for the non-linear state update

functions τx and τγ . We emphasize that the choice to design our algorithm around u(·) and not

v(·) is critical. Otherwise, the non-linear state update functions would need to be incorporated

in the optimization step, in the sense of (22). Given that v(τx (y, γ, d, P ), τγ (y, γ, d, P )) is

not necessarily convex, this poses a significant challenge. Instead, our approach leverages the

convexity of u(·) to sidestep this issue entirely (the state updates are then computed outside

of the optimization step).

3. Our algorithm is designed for the infinite horizon setting, where each approximation “boot-

straps” from the previous approximation and convergence is achieved despite the absence of

a terminal condition such as “vT +1 ≡ 0” used in the finite-horizon case. As such, the conver-

gence analyses used in Chen and Powell [1999], Linowsky and Philpott [2005], Philpott and

Guan [2008], Shapiro [2011], and Girardeau et al. [2014] do not apply.8 Moreover, we remove

a strong condition used in a previous convergence result by Birge and Zhao [2007] for the

infinite horizon setting, where cuts are computed at states that approximately maximize a

Bellman error criterion. Selecting such a state requires solving a difference of convex functions

optimization problem. Our algorithm and proof technique do not require this costly step.

Throughout this section, suppose that we are given M samples of the demand and the return

fraction matrix (d1 , P 1 ), (d2 , P 2 ), . . . , (dM , P M ). Our goal is to optimize the sampled model. The

idea is to start with an initial piecewise-affine approximation u0 (y, γ) (such as u0 (y, γ) = 0) and

then dynamically add linear functions (referred to as cuts in our discussion) into consideration.

Suppose we currently have uJ (y, γ) = maxk=1,...,NJ gk (y, γ) where

gk (y, γ) = (y − y k )T ak + (γ − γ k )T bk + ck ,
8
For example, we do not make use of a property that there are only a finite number of distinct cuts; see Lemma 1
of Philpott and Guan [2008]. We remark, however, that our algorithm has a natural adaptation for finite-horizon
problems.

23

Electronic copy available at: https://ssrn.com/abstract=2942921


and NJ is the total number of cuts in the approximation after iteration J. We then need to evaluate

the functional value and the gradient of the following function:

M
1 X
ūJ (y, γ) = {L(y, ds ) + ρv̄J (τx (y, γ, ds , P s ), τγ (y, γ, ds , P s ))} ,
M
s=1

where

v̄J (x, ζ) = min C(z − x) + uJ (z, ζ) (25)


z∈∆n−1 (eT x)

at a sample point (ỹ, γ̃). Note that v̄J (x, ζ) is a linear program. To find out the derivatives of

v̄J (x, ζ), we write down the dual formulation for v̄J (x, ζ) as follows:

PJ 
v̄J (x, ζ) = max (λ0 e + λ)T x + k=1 µk −aTk y k + bTk (ζ − γ k ) + ck
PJ
s.t. k=1 µk = 1,

λi − λj ≤ cij , ∀ i, j = 1, 2, . . . , n, (26)
P
−λi + Jk=1 µk aki − λ0 ≥ 0, ∀ i = 1, 2, . . . , n,

µk ≥ 0, ∀ k = 1, 2, . . . , NJ .

PJ
From (26), we understand that ∇x v̄J (x, ζ) = λ∗0 e + λ∗ and ∇ζ v̄J (x, ζ) = ∗
k=1 µk bk , where

(λ∗0 , λ∗ , µ∗ ) is an optimal solution for problem (26). The Jacobian matrix for the state update

function is

∇x̄,y = Diag(1yt >dk ) + P k ◦ (1yt ≤dk eT ), ∇x̄,γ = P k ,


∇γ̄,γ = Diag(e − P k e), and ∇γ̄,y = Diag (e − P k e) ◦ 1yt ≤dk ,

where x̄ and γ̄ stand for τx and τγ respectively. By computing equation (26) for all pairs of

(τx (y, γ, di , P i ), τγ (y, γ, di , P i )), we can find the tangent hyperplane of ūJ (y, γ) at (ỹ, γ̃).

While (26) can always be solved, we can apply Proposition 6, a characterization of the no-

repositioning region, to reduce the computational load. We first define some terms for uJ (y, γ).

Let K = {k | aki − akj ≤ cij ∀ i, j} denote the set of cuts that satisfy the no-reposition condition.

24

Electronic copy available at: https://ssrn.com/abstract=2942921


Parameters: l, cij , ρ > 0
Data: d1 , d2 , . . . , dM , P 1 , P 2 , . . . , P M
Input: initial approximation uJ (y, γ) = maxk=1,...,NJ gk (y, γ), new sample point (ỹ, γ̃)
Initialization: K = ∅
for k = 1, 2, . . . , NJ ,
if aki − akj ≤ cij ∀ i, j
Add k into the set K
end if
end for
for s = 1, 2, . . . , M ,
x̄s = (ỹ − ds )+ + P Ts (γ̃ + min(ỹ, ds )),
γ̄ s = (γ̃ + min(ỹ, ds )) ◦ (e − P s e),
Let D = arg maxk=1,...,NJ gk (x̄s , γ̄ s )
if D ∩ K = ∅
Solve v̄J (x̄s , γ̄ s ) in Equation (26)
Note the optimal solution as λ∗0 , λ∗ , µ∗ and v ∗ as the objective value
else
Pick any k ∈ D ∩ K
λ∗0 = 0, λ∗ = ak , µ∗ = ek , v ∗ = (x̄s − y k )T ak + (γ̄ s − γ k )T bk + ck
end if
P P J ∗
Set c̄s = β ni=1 (ds,i − ỹi )+ + ρv ∗ , ās = λ∗0 e + λ∗ , b̄s = N j=1 µj bj
s T s
Set ∇x̄,y = Diag(1ỹ>ds ) + P s ◦ (1ỹ≤ds e ), ∇x̄,γ = P s
Set ∇sγ̄,γ = Diag(e − P s e), ∇sγ̄,y = Diag ((e − P s e) ◦ 1ỹ≤ds )
end for
1 PM P 
Set ã = M −β ni=1 1ỹi ≤ds,i ei + ρ∇sx̄,y ās + ρ∇sγ̄,y b̄s
1 PM
s=1
s s
 1 PM
Set b̃ = M s=1 ρ∇x̄,γ ās + ρ∇γ̄,γ b̄s , c̃ = M s=1 c̄s
T
Output: g̃(y, γ) = ãT (y − ỹ) + b̃ (γ − γ̃) + c̃

Subroutine 1: Cut Adding Basic Subroutine (CABS)

Parameters: l, cij , ρ > 0


Data: d1 , d2 , . . . , dM , P 1 , P 2 , . . . , P M
Input: Initial approximation u0 (y, γ) = maxk=1,...,N0 gk (y, γ)
for J = 0, 1, 2, . . .
Sample a finite set of states SJ+1 from ∆ according to some distribution
for s = 1, 2, . . . , |SJ+1 |
Let (ỹ s , γ̃ s ) be the sth sampled state in SJ+1
Run Subroutine 1 with input (ỹ s , γ̃ s ) and let the result be gs+NJ (y, γ)
end for
Set uJ+1 (y, γ) = maxk=1,...,NJ+1 gk (y, γ), where NJ+1 = NJ + |SJ+1 |
end for

Algorithm 1: Repositioning-ADP Algorithm

25

Electronic copy available at: https://ssrn.com/abstract=2942921


We also let


Dk = (y, γ) ∈ ∆ (y − y k )T ak + (γ − γ k )T bk + ck
(27)
T T
≥ (y − y l ) al + (γ − γ l ) bl + cl ∀ l = 1, 2, . . . , K

denote a subset of the feasible region that is dominated by the k-th cut. Then we have the following

lemma.

Lemma 11. If x ∈ Dk with k ∈ K, we have x ∈ ΩuJ (ζ)9 and one optimal solution for problem

(26) is λ = ak , µk = 1, µl = 0, ∀ l 6= k.

The basic procedure of adding a cut (CABS) is summarized in Subroutine 1, and the Reposition-

ADP algorithm is given in Algorithm 1. The essential idea is to iterate the following steps: (1)

sample a set of states, (2) compute the appropriate supporting hyperplanes at each state, and

(3) add the hyperplanes to the convex approximation of u(y, γ). If uJ (y, γ) ≤ u(y, γ), we have

ūJ (y, γ) ≤ u(y, γ). Therefore, gs+NJ (y, γ), a tangent hyperplane for ūJ (y, γ), is a lower bound

for u(y, γ), which means that uJ+1 (y, γ) is also a lower bound for u(y, γ). Through the course of

Repositioning-ADP, we obtain an improving sequence of lower approximations to the true u(y, γ)

function. Hence, if u0 is a uniform underestimate of u, we know that uJ (y, γ) is a bounded

monotone sequence and thus its limit exists.

There are several reasonable strategies for sampling the set SJ+1 . The easiest way is to set

|SJ | = 1 (i.e., only add a single cut10 per iteration) and then sample one state according to some

distribution over ∆ — this is the approach taken in the numerical experiments of this paper. Our

implementation of Repositioning-ADP also uses an iteration-dependent state sampling distribu-

tion to improve the practical performance (see Section 7.3); therefore, we introduce the following

assumption to support the convergence analysis.

Assumption 3. On any iteration J, the sampling distribution produces a set SJ of states from
9
Note that, in general, Ωu (γ) 6= k∈K Dk . The reason is that even if two cuts are both not in K, the intersection of
S
these two cuts could still include the subgradient that satisfies the no-reposition condition.
10
If parallel computing is available, one might consider the “batch” version of the algorithm (i.e., |SJ+1 | > 1) by
performing the inner for-loop of Algorithm 1 on multiple processors (or workers). In this case, each worker receives
uJ , samples a state, and computes the appropriate supporting hyperplane. The main processor would then aggregate
the results into uJ+1 and start the next iteration by broadcasting uJ+1 to each worker.

26

Electronic copy available at: https://ssrn.com/abstract=2942921


∆◦ . The sampled sets {SJ }∞
J=1 satisfy


X 
P SJ ∩ A 6= ∅ = ∞
J=1

for any set A ⊆ ∆◦ with positive volume.

This should be interpreted as an “adequate exploration” assumption. As an example, for the

case of one sample per iteration, one might consider the following sampling strategy parameterized

by a deterministic sequence {J }: with probability 1 − J , choose the state in any manner and with

probability J , select a state uniformly at random over ∆◦ . In this case, we have that P SJ ∩ A 6=
P
∅) ≥ J · volume(A). As long as J J = ∞, Assumption 3 is satisfied.

7.2 Convergence of Repositioning-ADP

Let us now introduce some notation. For any bounded function f : ∆ → R, we define the mapping

L such that Lf : ∆ → R is another bounded function such that

Z
(Lf )(y, γ) = l(y) + ρ min C(y 0 − x0 ) + f (y 0 , γ 0 ) dµ ∀ (y, γ) ∈ ∆, (28)
y 0 ∈∆n−1 (eT x0 )

where x0 = τx (y, γ, d, P ) and γ 0 = τγ (y, γ, d, P ). Note that L is closely related to the standard

Bellman operator associated with the MDP defined in (20); see, for example, Bertsekas and Tsitsik-

lis [1996]. The difference from the standard definition is that L comes from the Bellman recursion

for u(y, γ) instead of v(x, γ). With this in mind, we henceforth simply refer to L as the “Bellman

operator” and note a few standard properties.

Lemma 12. The Bellman operator L has the following properties:

1. (Monotonicity) Given bounded f1 , f2 : ∆ → R with f1 ≤ f2 , then Lf1 ≤ Lf2 .

2. (Contraction) For any bounded f1 , f2 : ∆ → R, it holds that kLf1 − Lf2 k∞ ≤ ρ kf1 − f2 k∞ .

3. (Fixed Point) The optimal value function u is the unique fixed point of L, i.e., Lu = u.

4. (Constant Shift) Let 1 be the constant one function, i.e., 1(·) = 1, and let α be a scalar. For

any bounded f : ∆ → R, it holds that L(f + α1) = Lf + ρ α1.

27

Electronic copy available at: https://ssrn.com/abstract=2942921


We are now ready to discuss the convergence of the Repositioning-ADP algorithm. For sim-

plicity, we consider the case where |SJ+1 | = 1 for all iterations J. The extension to the batch

case, |SJ+1 | > 1, follows the same idea and is merely a matter of more complicated notation (note,

however, that we will nevertheless make use of a simple special case of batch algorithm as an anal-

ysis tool within the proof). Our convergence proof makes use of a Lipschitz condition, which is a

property of the repositioning problem, stated in the following lemma.

Lemma 13. Consider a bounded function f : ∆ → R that is convex, continuous, and satisfies

Pn
1. |f 0 (y, γ; ∓η, ±η)| ≤ β i=1 ηi for all (x, γ) ∈ ∆ and any feasible direction (∓η, ±η) with

η ≥ 0;

Pn
2. f 0 (y, γ; 0, v) ≤ (ρcmax /2) i=1 |vi | for all (x, γ) ∈ ∆ and any feasible direction (0, v) with

eT v = 0.


Then, the function f is Lipschitz continuous on ∆◦ with Lipschitz constant (3/2) 2n β. In ad-

dition, the function Lf also satisfies the two conditions above, i.e., properties 1 and 2 above hold

with Lf replacing f .

We are now ready to state the convergence result for Repositioning-ADP.

Theorem 14. Suppose Assumptions 1, 2, and 3 hold and that Repositioning-ADP samples one

state per iteration. If u0 (·) = 0, the sequence {uJ (·)} converges uniformly and almost surely to the

optimal value function u(·), i.e., it holds that kuJ − uk∞ → 0 almost surely.

The proof of Theorem 14 relies on relating each sample path of the algorithm to an auxiliary

algorithm where the cuts are added in “batches” rather than one by one. We show that, after ac-

counting for the different timescales, the value function approximations generated by Repositioning-

ADP are close to the approximations generated by the auxiliary algorithm. By noticing that the

auxiliary algorithm is an approximate value iteration algorithm whose per-iteration error can be

bounded in k · k∞ due to Lemma 13, we quantify its error against exact value iteration, which in

turn allows us to quantify the error between Repositioning-ADP and exact value iteration. We

make use of -covers of the state space (for arbitrarily small ) along with Assumption 3 to argue

that this error converges to zero.

28

Electronic copy available at: https://ssrn.com/abstract=2942921


7.3 Some Practical Considerations

There are two primary practical challenges that arise when implementing Algorithm 1: (1) the

value function approximations are represented by an unbounded number of cuts and (2) the design

of the state-sampling strategy, which becomes crucial in a high-dimensional state space.

7.3.1 Removing Redundant Cuts

If we keep adding cuts to the existing approximation, some cuts become dominated by others, i.e.,

there exists some k ∈ {1, 2, . . . , J} such that gk (y, γ) < maxj=1,...,J gj (y, γ) for all (y, γ) ∈ ∆. It is

important to remove these redundant cuts since they can lower the efficiency of solving optimization

problem (26). Fortunately, the simple structure of the simplex enables us to check whether a piece

is redundant efficiently and effectively. We first show how to determine whether a cut is completely

dominated by another cut over the simplex.

Proposition 15. aT1 (y − y 1 ) + bT1 (γ − γ 1 ) + c1 ≥ aT2 (y − y 2 ) + bT2 (γ − γ 2 ) + c2 for all (y, γ) ∈ ∆

if and only if

c1 − aT1 y 1 − bT1 γ 1 − c2 + aT2 y 2 + bT2 γ 2 + min{min{a1i − a2i }, min{b1i − b2i }} ≥ 0.


i i

Therefore, to check whether a cut is completely dominated by another cut, one just needs

to perform a series of elementary operations and check one inequality, the computational effort

of which is negligible compared to solving a linear program. Therefore, we can always check

whether the current cut either dominates or is dominated by other cuts by checking at most 2NJ

inequalities. Though Proposition 15 is simple to implement, it does not cover the situation where

a cut is dominated by the maximum of several other cuts, which occurs more frequently. The next

proposition addresses this situation.

Proposition 16. Dk 6= ∅ if and only if the objective function value of the following linear program

min t

subject to eT y + eT γ = N,
(29)
t ≥ aTl (y − y l ) + bTl (γ − γ l ) + cl − aTk (y − y k ) − bTk (γ − γ k ) − ck , ∀ l 6= k

y, γ ≥ 0,

29

Electronic copy available at: https://ssrn.com/abstract=2942921


is negative.

By solving the linear program (29) at most NJ times, one can remove all of the redundant

pieces. One can perform this redundant check periodically. Our numerical implementation also

employs the following strategy. We track the iterations at which a cut dominates when identifying

D. If a cut does not become dominating for a number of iterations greater than some threshold,

we consider it potentially redundant and check for redundancy by solving (29) once. Despite these

attempts to reduce the number cuts, problem instances with more locations naturally require more

cuts to accurately represent the value function u(·). To control the computation required for solving

linear programs, we also make the practical recommendation to set an absolute upper bound on

the number of cuts used in the approximation, with older cuts dropped as needed.

7.3.2 Sampling Distribution

We now propose a more effective method of sampling states for the ADP algorithm beyond the

naive choice of a uniform distribution over ∆. Our tests indicate that, especially when the number

of locations is large, uniform sampling is unable to prioritize the sampling in important regions of

the state space (for example, states with large γ are unlikely to occur in problem instances where

the return probability is high). A reasonable strategy is to periodically simulate the ADP policy

(i.e., the one implied by the current value function approximation) and collect the set of states

visited under this policy — termed a replay buffer. On future iterations, we can sample a portion

of states at which to compute cuts directly from the replay buffer. This idea is based on the notion

of experience replay within the reinforcement learning literature (see Lin [1992] and also Mnih et al.

[2015] for a recent application).

7.4 Experiments on Random Problem Instances

We first present some benchmarking results of running Repositioning-ADP on a set of randomly

generated problems ranging from n = 2 to n = 10 locations, the largest of which corresponds to

a dynamic program with a 20-dimensional continuous state space. We set the discount factor as

ρ = 0.95, the repositioning costs to be cmin = cmax = 1, and the lost sales cost as β = 2cmax = 2.11
11
A relatively high lost sales cost reflects the attitude of customers that on-demand rental services should be con-
venient. Therefore, the firm bears the risk of customers leaving the platform when they are inconvenienced by low

30

Electronic copy available at: https://ssrn.com/abstract=2942921


We consider normalized total inventory of N = 1, and for each problem instance, we take M = 50

demand and return probability samples as follows. With each location i, we associate a truncated

normal demand distribution (so that it is nonnegative) with mean νi and standard deviation σi .
P
The νi are drawn from a uniform distribution and then normalized so that i νi = 0.3. We then

we set σi = νi so that locations with higher mean demand are also more volatile. Next, we follow

Assumption 1 and sample one outcome of a matrix (q̃ij ) such that each row is chosen uniformly

from a standard simplex. Each of the M samples of the return probability matrix consists of (q̃ij )

multiplied by a random scaling factor drawn from Uniform(0.7, 0.9).12 Hence, we have pmin = 0.7.

In our experiments, we compare the Repositioning-ADP (R-ADP) policy to the myopic (Myo.)

policy (i.e., the policy associated with v(·) = 0) and the baseline no-repositioning (No-R) policy.

We use a maximum of 1000 cuts for all problem instances and we run the Repositioning-ADP

algorithm for 10,000 iterations for n ≤ 6 and for 20,000 iterations for n = 7, 8, 9, 10. We initially

sample 80% of states randomly13 and 20% of states from the replay buffer of the myopic policy. As

the algorithm progresses, we transition toward a distribution of 20% randomly, 0% from the myopic

replay buffer, and 80% from the current ADP replay buffer. Note that Assumption 3 is satisfied for

this sampling scheme. Redundancy checks are performed every 250 iterations. The performance

of the ADP algorithm is evaluated using Monte-Carlo simulation over 500 sample paths (across 20

initial states, randomly sampled subject to zero outstanding rentals) at various times during the

training process. Since the ADP algorithm itself is random, we repeat the training process 10 times

for each problem instance in order to obtain confidence intervals.

The results14 are summarized in Table 1. The first column ‘n & Dim.’ shows the number of

locations and the dimension of the state. The second column ‘Sec./Iter.’ shows the CPU time on a

4 GHz Intel Core i7 processor using four cores, which includes the time needed to remove cuts and

generate the replay buffer. The ‘R-ADP Cost’ column refers to the average cost attained by the ADP
supply.
12
This roughly corresponds to an average rental duration between 1.1 and 1.5 periods, which is reasonable for systems
with frequent repositioning.
13
Each sampled state is given by (ỹ, γ̃) = (ξ y 0 , (1−ξ) γ 0 ) ∈ ∆ where y 0 and γ 0 are independent uniform samples from
∆n−1 (N ) and ξ ∼ Uniform(pmin (M ), pmax (M )) where pmin (M ) and pmax (M ) are the minimum and maximum row
sums of the return fraction matrix over the M samples. This sampling scheme can be considered a nearly uniform
sample over the state space, except with the two parts of the state re-scaled by relevant problem parameters so that
they are more likely to fall in important regions.
14
The specific realizations of the randomly generated parameters used in each instance is available upon request. The
same random seed is used in all instances (i.e., all n) to generate the problem parameters.

31

Electronic copy available at: https://ssrn.com/abstract=2942921


Dim. (2n) Sec./Iter. R-ADP Cost % Decr. Myo. % Decr. No-R % to LB R-ADP %-R Myo. %-R

4 0.06 1.20 56.16% 76.48% 99.18% 5.01% 3.01%


6 0.21 3.55 23.63% 52.17% 98.70% 17.20% 11.50%
8 0.27 1.69 22.25% 60.69% 95.84% 5.62% 3.72%
10 0.22 3.65 31.64% 65.14% 96.38% 15.54% 11.30%
12 0.29 2.45 37.70% 75.20% 94.08% 8.09% 5.26%
14 0.42 3.08 19.23% 60.12% 88.10% 11.87% 8.49%
16 0.44 3.83 24.91% 59.67% 85.03% 14.42% 8.77%
18 0.48 2.40 35.04% 64.65% 88.20% 8.77% 4.89%
20 0.54 2.14 28.58% 59.65% 83.31% 7.28% 4.49%

Table 1: Summary of Results for Repositioning-ADP Benchmarks

policy; the ‘% Decr. Myo’ and ‘% Decr. No-R’ columns refer to the percentage improvement (cost

decrease) that the ADP policy achieves over the myopic policy and the baseline no-repositioning

policy, respectively. Significantly lower costs are observed in all instances: 19%–56% against the

myopic policy and 52%–76% against the no-repositioning policy.

The ‘% to LB’ column should be interpreted as an optimality gap metric, computed as the

percentage of the lower bound (LB) achieved when the baseline no-repositioning policy is set as

“0% optimal.” This is done via the formula

Cost of No-R Policy − Cost of ADP Policy


% to LB = . (30)
Cost of No-R Policy − Best Lower Bound

In terms of wall clock time, we observe that our ADP algorithm produces near-optimal results

for n ≤ 6 within an hour (for n = 6, we are using 0.29 · 10000 seconds or 48 minutes). For the

larger problems of n ≥ 7, when provided a limited amount of computation — around three hours

for 20,000 iterations — the estimated optimality gap is slightly larger, between 12%–17%. Figure 2

shows the performance of the ADP policy as the algorithm progresses, along with 95% confidence

intervals and lower bounds. We believe that the optimality gaps for the larger problems can be

reduced with additional computation, as suggested by the convergence plots of Figure 2. Our

informal experiments suggest that reasonable solutions might be attainable for problems of up to

n = 30 locations if computation time on the order of a few days is allowed (especially if parallel

computing is available). Other considerations, such as shorter/finite horizons or smaller numbers

of demand and return fraction samples, could facilitate the convergence of Repositioning-ADP

32

Electronic copy available at: https://ssrn.com/abstract=2942921


(a) 3 Locations / 6 Dim. State (b) 5 Locations / 10 Dim. State

(c) 7 Locations / 14 Dim. State (d) 8 Locations / 16 Dim. State

(e) 9 Locations / 18 Dim. State (f) 10 Locations / 20 Dim. State

Figure 2: Performance of Repositioning-ADP on Randomly Generated Problems

on larger problems beyond the 20-dimensional cases we tested. Approximating MDPs of larger

dimension than this are well-known to be extremely challenging due to the curse of dimensionality;

indeed, Lu et al. [2017] approximate a 9-dimensional problem using two-stage stochastic integer

programming and He et al. [2018] approximate a 5-dimensional problem using a robust approach

within an MDP model.

Finally, we examine the average percentage of total inventory repositioned in each period by

the ADP policy and the myopic policy, respectively. The results are given in the last two columns,

33

Electronic copy available at: https://ssrn.com/abstract=2942921


‘R-ADP %-R’ and ‘Myo. %-R’, and show that the repositioning activity of the ADP policy is

between 37% (5 locations) and 79% (9 locations) higher than the repositioning activity of the

myopic policy. This suggests that the improvement upon the myopic policy can be attributed to

a considerably more aggressive repositioning strategy. Since the myopic policy does not take into

account customers’ return behaviors, the additional repositioning activity observed in the ADP

policy can be explained by its attempt to plan for the future by counteracting the effects of P .

This leads to an important question: in which situations are the effects of P worthwhile to consider

and in which situations do they not matter? In other words, when does the myopic policy perform

well? We investigate these and other practical questions in the next section.

! "
4αq
αp 1 −
1 − αp 5 Parameter Name Value Range
1 n, number of locations 5 –
N , total inventory 1.0 –
5 2 ρ, discount factor 0.95 –
β, lost sales cost 4.5 –
cij , repositioning cost {1, 2} –
!α "
q
αp αν , demand mean 0.3 [0.1, 0.5]
5
ασ , demand volatility 1.0 [0.5, 1.5]
4 3 αp , return fraction 0.75 [0.4, 1.0]
repo. cost per link = 1 αq , return uniformity 0.5 [0.0, 1.0]

Figure 3: Network Structure used in Section 7.5 Table 2: Parameter Values used in Section 7.5

7.5 Comparative Statics

Here, we aim to compare the ADP, myopic, and no-repositioning policies across a range of parameter

settings, with the goal of studying the impacts of (1) total demand, (2) demand volatility, (3) rental

duration (i.e., fraction of products returned per period), and (4) uniformity of return locations. We

again use N = 1 and set ρ = 0.95. Due to the large number of MDP instances that we need

to solve, we only consider n = 5 locations, creating a set of 10-dimensional MDPs. Let ν̃i = 0.2

for i = 1, 2, . . . , 5 and we set the mean demand at each location to be νi = αν ν̃i for some scaling
P
parameter αν , so that i νi = αν . Similar to before, we set σi = ασ νi for another scaling parameter

ασ . The repositioning costs cij are illustrated in Figure 3: the cost between adjacent locations is 1

and the cost between non-adjacent locations (e.g., 1 and 3 or 5 and 2) is 2. The return behavior is

34

Electronic copy available at: https://ssrn.com/abstract=2942921


controlled by two parameters αp , the fraction of rented products returned (thus, 1−αp is the fraction

of rented products that remain rented), and αq , which we interpret as the “return uniformity.” For

αq = 1, returns are split evenly between the 5 locations and for αq = 0, products are returned

to their rental origins. These two parameters are also illustrated in Figure 3 for location 1. We

vary each of the scaling parameters αν , ασ , αp , and αq individually; their nominal values and test

ranges are summarized in Table 2. In all solved instances, we use a maximum of 1,000 cuts in the

approximation while the ADP algorithm is run for 10,000 iterations; all other algorithmic settings

are as described in Section 7.4. The results are given in Figure 4.

Figure 4: Impact of Parameters (Left-Axis & Bar: % Improvement of ADP; Right-Axis & Line: Raw Cost)

There are a few key takeaways from these experiments, which we now summarize. We see that

when the mean demand αν in the system is high (greater than 45% of the total inventory), the

performance of the myopic policy essentially matches that of the ADP policy. This can be explained

by the observation that lost sales in high demand systems are somewhat inevitable, so the impact

of considering return behavior and future cost is diminished. On the other hand, when demand

is between 10% and 40% of the total inventory, a substantial improvement of between 7%–40%

beyond the myopic performance is observed. The largest improvement, 40%, is seen for αν = 0.2.

35

Electronic copy available at: https://ssrn.com/abstract=2942921


Demand volatility ασ has a strong influence on cost for all three policies, with the cost of the ADP

policy ranging from 0.54 for ασ = 0.5 and growing tenfold to 5.90 for ασ = 1.5. We also observe

that the gap between the ADP and myopic policies shrinks for higher volatility systems.

The latter two plots are related to the return behavior parameters αp and αq . Although the cost

decreases when the return fraction αp increases (as there is more inventory with which to satisfy

demand), the improvement upon the myopic policy increases. Intuitively, given more available

inventory due to fewer ongoing rentals, the ADP policy has more “opportunities” to reposition and

plan for future periods. We also see that return uniformity αq tends to increase the cost under the

ADP and myopic policies, but interestingly, if no-repositioning is used, the cost is actually reduced

as αq increases in the range of [0, 0.6]. This can perhaps be explained by the “natural repositioning”

induced by the return behaviors, an effect that disappears when active repositioning (i.e., ADP or

myopic) is allowed. Similar to the case of demand volatility, the gap between the ADP and myopic

policy decreases as uniformity increases.

7.6 Discussion

With free-floating car sharing systems like Car2Go operating in cities like Berlin, Chicago, Van-

couver, and Seattle, the number of locations (that might be aggregated by, e.g., zip or postal code)

in the model can become larger. We describe two aggregation-based approaches that can be used

to integrate our model and methodology into problems that are of potentially very large scale.

1. The first is to simply segment the service region into subregions of manageable size and

solve independent MDPs in each region. To justify such an approximation, we note that

high repositioning costs for locations that are geographically far from one another encourage

repositioning to occur between nearby stations. However, one drawback is that repositioning

would not be able to occur across borders of the subregions, even if two locations are close.

This could be, in part, addressed by experimenting with different boundaries for regions.

2. The second possible solution is hierarchical aggregation, where nearby locations are grouped

with varying levels of granularity. Our approach can be directly applied to the most coarsely

aggregated system. With the coarse repositioning solution fixed (a heuristic could be used to

distribute/extract inventory amongst individual locations so that the repositioning between

36

Electronic copy available at: https://ssrn.com/abstract=2942921


aggregate states is realized), our approach can then be repeated within each aggregate state

at a finer level of granularity.

Both of these approaches are interesting avenues for future research.

8 Conclusion

In this paper, we consider the problem of optimal repositioning of inventory in a product rental net-

work with multiple locations and where demand, rental periods, and return locations are stochastic.

We show that the optimal policy is specified in terms of a region in the state space, inside of which

it is optimal not to carry out any repositioning and outside of which it is optimal to reposition

inventory. We also prove that when repositioning, it is always optimal to do so such that the system

moves to a new state that is on the boundary of the no-repositioning region and provide a simple

check for when a state is in the no-repositioning region. We then propose a provably convergent

approximate dynamic programming algorithm, Repositioning-ADP, that builds a lower approxi-

mation of the convex value function by iteratively adding hyperplanes. Numerical experiments on

problems with up to 20 dimensions support the effectiveness of the algorithmic approach.

37

Electronic copy available at: https://ssrn.com/abstract=2942921


References

S. Banerjee, D. Freund, and T. Lykouris. Pricing and optimization in shared vehicle systems: An
approximation framework. arXiv preprint arXiv:1608.06819, 2017.

D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont,


MA, 1996.

J. R. Birge and G. Zhao. Successive linear approximation solution of infinite-horizon dynamic


stochastic programs. SIAM Journal on Optimization, 18(4):1165–1186, 2007.

A. Braverman, J. G. Dai, X. Liu, and L. Ying. Empty-car routing in ridesharing systems. arXiv
preprint arXiv:1609.07219, 2018.

Z.-L. Chen and W. B. Powell. Convergent cutting-plane and partial-sampling algorithm for multi-
stage stochastic linear programs with recourse. Journal of Optimization Theory and Applications,
102(3):497–524, 1999.

H. Chung, D. Freund, and D. B. Shmoys. Bike Angels: An analysis of Citi Bike’s incentive
program. In Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable
Societies. ACM, 2018.

D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and
temporal-difference learning. Journal of Optimization Theory and Applications, 105(3):589–608,
2000.

D. Freund, A. Norouzi-Fard, A. Paul, S. G. Henderson, and D. B. Shmoys. Data-driven rebalancing


methods for bike-share systems. Technical report, 2016.

D. Freund, S. G. Henderson, and D. B. Shmoys. Minimizing multimodular functions and allocat-


ing capacity in bike-sharing systems. In International Conference on Integer Programming and
Combinatorial Optimization, pages 186–198. Springer, 2017.

D. Freund, S. G. Henderson, and D. B. Shmoys. Bike sharing. In M. Hu, editor, Sharing Economy:
Making Supply Meet Demand. Springer, 2019.

C. Fricker and N. Gast. Incentives and redistribution in homogeneous bike-sharing systems with
stations of finite capacity. EURO Journal on Transportation and Logistics, 5(3):261–291, 2016.

38

Electronic copy available at: https://ssrn.com/abstract=2942921


D. K. George and C. H. Xia. Fleet-sizing and service availability for a vehicle rental system via
closed queueing networks. European Journal of Operational Research, 211:198–207, 2011.

S. Ghosh, P. Varakantham, Y. Adulyasak, and P. Jaillet. Dynamic repositioning to reduce lost


demand in bike sharing systems. Journal of Artificial Intelligence Research, 58:387–430, 2017.

P. Girardeau, V. Leclere, and A. B. Philpott. On the convergence of decomposition methods for


multistage stochastic convex programs. Mathematics of Operations Research, 40(1):130–145,
2014.

G. A. Godfrey and W. B. Powell. An adaptive, distribution-free algorithm for the newsvendor


problem with censored demands, with applications to inventory and distribution. Management
Science, 47(8):1101–1112, 2001.

G. A. Godfrey and W. B. Powell. An adaptive dynamic programming algorithm for dynamic fleet
management, i: Single period travel times. Transportation Science, 36(1):21–39, 2002.

S. Habibi, F. Sprei, C. Englundn, S. Pettersson, A. Voronov, J. Wedlin, and H. Engdahl. Compar-


ison of free-floating car sharing services in cities. European Council of Energy Efficient Economy
Summer Study, pages 771–778, 2016.

L. He, H.-Y. Mak, Y. Rong, and Z.-J. M. Shen. Service region design for urban electric vehicle
sharing systems. Manufacturing & Service Operations Management, 19(2):309–327, 2017.

L. He, Z. Hu, and M. Zhang. Robust repositioning for vehicle sharing. Manufacturing & Service
Operations Management (Forthcoming), 2018.

L. He, H.-Y. Mak, and Y. Rong. Operations management of vehicle sharing systems. In M. Hu,
editor, Sharing Economy: Making Supply Meet Demand. Springer, 2019.

J. L. Higle and S. Sen. Stochastic decomposition: An algorithm for two-stage linear programs with
recourse. Mathematics of Operations Research, 16(3):650–669, 1991.

N. Jian, D. Freund, H. M. Wiberg, and S. G. Henderson. Simulation optimization for a large-scale


bike-sharing system. In Proceedings of the 2016 Winter Simulation Conference, pages 602–613.
IEEE Press, 2016.

A. Kabra, E. Belavina, and K. Girotra. Bike-share systems: Accessibility and availability. Chicago
Booth Research Paper No. 15-04. Available at SSRN: https://ssrn.com/abstract=2555671, 2018.

39

Electronic copy available at: https://ssrn.com/abstract=2942921


M. Kaspi, T. Raviv, and M. Tzur. Bike-sharing systems: User dissatisfaction in the presence of
unusable bicycles. IISE Transactions, 49(2):144–158, 2017.

H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications,
volume 35. Springer, 2003.

C.-Y. Lee and Q. Meng. Handbook of Ocean Container Transport Logistics. Springer, 2015.

J. Li, C. S. Leung, Y. Wu, and K. Liu. Allocation of empty containers between multi-ports.
European Journal of Operational Research, 182(1):400–412, 2007.

Y. Li, Y. Zheng, and Q. Yang. Dynamic bike reposition: A spatio-temporal reinforcement learning
approach. In Proceedings of the 24th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining. ACM, 2018.

L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching.
Machine Learning, 8(3-4):293–321, 1992.

K. Linowsky and A. B. Philpott. On the convergence of sampling-based decomposition algorithms


for multistage stochastic programs. Journal of Optimization Theory and Applications, 125(2):
349–366, 2005.

J. Liu, L. Sun, W. Chen, and H. Xiong. Rebalancing bike sharing systems: A multi-source data
smart optimization. In Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Dis-
covery and Data Mining, pages 1005–1014. ACM, 2016.

M. Lu, Z. Chen, and S. Shen. Optimizing the profitability and quality of service in carshare systems
under demand uncertainty. Manufacturing & Service Operations Management, 20(2):162–180,
2017.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried-


miller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement
learning. Nature, 518(7540):529, 2015.

R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine
Learning Research, 9(May):815–857, 2008.

R. Nair and E. Miller-Hooks. Fleet management for vehicle sharing operations. Transportation
Science, 45(4):524–540, 2011.

40

Electronic copy available at: https://ssrn.com/abstract=2942921


J. M. Nascimento and W. B. Powell. An optimal approximate dynamic programming algorithm for
the lagged asset acquisition problem. Mathematics of Operations Research, 34(1):210–237, 2009.

J. M. Nascimento and W. B. Powell. Dynamic programming models and algorithms for the mutual
fund cash balance problem. Management Science, 56(5):801–815, 2010.

E. O’Mahony and D. B. Shmoys. Data analysis and optimization for (citi) bike sharing. In AAAI,
pages 687–694, 2015.

M. V. F. Pereira and L. M. V. G. Pinto. Multi-stage stochastic optimization applied to energy


planning. Mathematical Programming, 52:359–375, 1991.

A. B. Philpott and Z. Guan. On the convergence of stochastic dual dynamic programming and
related methods. Operations Research Letters, 36(4):450–455, jul 2008.

E. Porteus. Conditions for characterizing the structure of optimal strategies in infinite-horizon


dynamic programs. Journal of Optimization Theory and Applications, 36(3):419–432, 1982.

E. L. Porteus. On the optimality of structured policies in countable stage decision processes.


Management Science, 22(2):148–157, 1975.

W. B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. John


Wiley & Sons, 2nd edition, 2007.

W. B. Powell and T. A. Carvalho. Dynamic control of logistics queueing network for large-scale
fleet management. Transportation Science, 32(2):90–109, 1998.

W. B. Powell, A. Ruszczyński, and H. Topaloglu. Learning algorithms for separable approximations


of discrete stochastic optimization problems. Mathematics of Operations Research, 29(4):814–836,
2004.

M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John


Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994.

T. Raviv and O. Kolka. Optimal inventory management of a bike-sharing station. IIE Transactions,
45(10):1077–1093, 2013.

R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, USA, 1st edition,
1970.

41

Electronic copy available at: https://ssrn.com/abstract=2942921


J. Schuijbroek, R. C. Hampshire, and W.-J. Van Hoeve. Inventory rebalancing and vehicle routing
in bike sharing systems. European Journal of Operational Research, 257(3):992–1004, 2017.

A. Shapiro. Analysis of stochastic dual dynamic programming method. European Journal of


Operational Research, 209(1):63–72, 2011.

J. Shu, M. C. Chou, Q. Liu, C.-P. Teo, and I.-L. Wang. Models for effective deployment and
redistribution of bicycles within public bicycle-sharing systems. Operations Research, 61(6):
1346–1359, 2013.

C. Shui and W. Szeto. Dynamic green bike repositioning problem–a hybrid rolling horizon artificial
bee colony algorithm approach. Transportation Research Part D: Transport and Environment,
60:119–136, 2018.

D. P. Song. Optimal threshold control of empty vehicle redistribution in two depot service systems.
IEEE Transactions on Automatic Control, 50(1):87–90, 2005.

H. Topaloglu and W. B. Powell. Dynamic-programming approximations for stochastic time-staged


integer multicommodity-flow problems. INFORMS Journal on Computing, 18(1):3142, 2006.

J. Warrington, P. N. Beuchat, and J. Lygeros. Generalized dual dynamic programming for infinite
horizon problems in continuous state and action spaces. arXiv preprint arXiv:1711.07222, 2018.

42

Electronic copy available at: https://ssrn.com/abstract=2942921


A Appendix

A.1 Proof of Theorem 9

In this subsection, we provide a complete and self-contained proof for our main result, Theorem 9.

We first provide some technical preparations.

Lemma A.1. If u(x, γ) is jointly convex in x and γ, then u0 (x, γ; z, η) satisfies the following

properties:

• (Positive Homogeneity) u0 (x, γ; tz, tη) = tu0 (x, γ; z, η)

• (Sub-Additivity) u0 (x, γ; z 1 + z 2 , η 1 + η 2 ) ≤ u0 (x, γ; z 1 , η 1 ) + u0 (x, γ; z 2 , η 2 ).

Proof. These are well-known important properties for directional derivatives (see, for example,

Rockafellar [1970]).

Lemma A.2. A function f : (a, b) → R is convex if and only if it is continuous with increasing

left- or right-derivative.

Proof. The “only if” part is clear. For the “if part”, we assume f is continuous with increasing

right-derivative, for the proof for left-derivative is similar. It is common knowledge that a function

on an open set in Rn is convex if and only if there exists a subgradient at every point. So, it suffices

to show that f 0 (x+) is a subgradient at x for every x ∈ (a, b). Let gx (y) = f 0 (x+)(y − x) + f (x).

We need to show that f (y) ≥ gx (y) for y ∈ (a, b).

We first show that f (y) ≥ gx (y) if f 0 (x+) is strictly increasing. To show this, let hx (y) = gx (y)−

for some  > 0. We claim that f (y) ≥ hx (y). Suppose this is not true. Then there exists z ∈ (a, b)

such that (f − hx )(z) < 0. If z > x, let c = sup{d ≥ x | (f − hx )(y) ≥ 0 for y ∈ [x, d]}. Note that,

by continuity, x < c < z and (f − hx )(c) = 0; and, by construction, for any d > c there exists

y ∈ (c, d) such that (f − hx )(y) < 0. So, there exists a decreasing sequence yn such that yn → c

and (f − hx )(yn ) < 0. It follows that (f − hx )0 (c+) = f 0 (c+) − f 0 (x+) ≤ 0. This contradicts

the assumption that f 0 (·) is strictly increasing. On the other hand, if z < x, let c = inf{d ≤

x | (f − hx )(y) ≥ 0 for y ∈ [d, x]}. Then z < c < x, (f − hx )(c) = 0, and there exists a decreasing

sequence yn such that yn → c and (f −hx )(yn ) ≥ 0. Therefore, (f −hx )0 (c+) = f 0 (c+)−f 0 (x+) ≥ 0.

43

Electronic copy available at: https://ssrn.com/abstract=2942921


This again contradicts the assumption that f 0 (·) is strictly increasing. So, we conclude f (y) ≥ hx (y).

As  can be arbitrarily small, we must have f (y) ≥ gx (y).

Now, suppose f 0 (x+) is increasing, and let h(x) = f (x) + 2 x2 for some  > 0. Then h0 (x+)

is strictly increasing. By the first part of the proof, h(y) = f (y) + 2 y 2 ≥ h0 (x+)(y − x) + h(x) =

(f 0 (x+) + x)(y − x) + f (x) + 2 x2 = gx (y) + x(y − x) + 2 x2 . Letting  → 0 on both sides, we have

f (y) ≥ gx (y). The proof is complete.

Lemma A.3. Suppose E ⊂ Rn is convex, f : E → R is continuous, the set of polyhedrons

E1 , E2 , . . . , Em is a partition of E, and f is convex on each of E1 , E2 , . . . , Em . Then f is convex if

and only if −f 0 (x; −z) ≤ f 0 (x; z) for x ∈ E for z ∈ Rn .

Proof. The “only if” part is always true for a convex function on Rn . For the “if part”, note that

g : (a, b) → R is convex iff it is continuous with increasing left- or right-derivative. (For a proof,

see Proposition A.2.) It follows that if g : [a, b] → R is continuous and piecewise convex on [a0 , a1 ],

[a1 , a2 ], . . ., [am−1 , am ], where a = a0 < · · · < am = b, then to show g is convex, we only need

to show that g 0 (x−) ≤ g 0 (x+) for x ∈ [a, b]. To apply this argument to f , note that f is convex

if φ(s) = f (y + sz) is convex as a function of s for each y ∈ E and z ∈ Rn . As E is convex,

the domain of φ(·) is an interval J ⊂ R. As s varies in J, y + sz intersects with E1 , E2 , . . . , Em

for s in (possibly empty) intervals J1 , J2 , . . . , Jm , respectively. As E1 , E2 , . . . , Em forms a partition

of E, J1 , J2 . . . , Jm forms a partition of J. It follows that φ(s) is piecewise convex. Therefore,

φ(s) is convex if φ0 (s−) ≤ φ0 (s+) for s ∈ J. Set x = y + sz, then φ0 (s−) = −f 0 (x; −z) and

φ0 (s+) = f 0 (x; z). It follows that −f 0 (x; −z) ≤ f 0 (x; z) implies f is convex.

In the next lemma, we decompose the directional derivative of Ut,d,p (·).

Lemma A.4. For y ∈ Rn , we let J − (y) = {i | yi < 0}, J 0 (y) = {i | yi = 0}, J + (y) = {i | yi > 0},

J 0+ (y) = {i | yi ≥ 0} and J 0− (y) = {i | yi ≤ 0}. For any realization (d, P ), we have

X
0 0
Ut,d,P (y, γ; z, η) = −β zi + ρ vt+1 (x, ζ; w+ , δ + ), (31)
i∈J − (y−d)∪(J 0 (y−d)∩J − (z))

44

Electronic copy available at: https://ssrn.com/abstract=2942921


Pn
where ιi = j=1 ηj pji ∀ i,



 P
 zi + ιi + j∈J − (y−d)∪(J 0 (y−d)∩J − (z )) zj pji for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J + (z)),
wi+ =

 P
 ιi + j∈J − (y−d)∪(J 0 (y−d)∩J − (z )) zj pji for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)),

and


 P
 ηi (1 − nj=1 pij ) for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J + (z)),
δi+ =

 P
 (ηi + zi )(1 − nj=1 pij ) for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)),

and

x = τx (y, γ, d, P ); ζ = τγ (y, γ, d, P ).

Proof. Let
n
X n
X
ϑi = γj pji , ιi = ηj pji
j=1 j=1

Note that
X
L(y, d) = β (di − yi ), (32)
i∈J − (y−d)

and let the next state, under (d, P ), be defined by x(y, γ), ζ(y, γ), with components


 P P
 (yi − di ) + ϑi + j∈J + (y−d) dj pji + j∈J 0− (y−d) yj pji for i ∈ J + (y − d),
xi (y, γ) = (33)

 P P
 ϑi + j∈J + (y−d) dj pji + j∈J 0− (y−d) yj pji for i ∈ J 0− (y − d),



 P
 (γi + di )(1 − nj=1 pij ) for i ∈ J + (y − d),
ζi (y, γ) = (34)

 P
 (γi + yi )(1 − nj=1 pij ) for i ∈ J 0− (y − d).

Choose t small enough so that the following hold:

45

Electronic copy available at: https://ssrn.com/abstract=2942921


J − (y + tz − d) = J − (y − d) ∪ (J 0 (y − d) ∩ J − (z)),

J 0 (y + tz − d) = J 0 (y − d) ∩ J 0 (z),
(35)
+ + 0 +
J (y + tz − d) = J (y − d) ∪ (J (y − d) ∩ J (z)),

J 0− (y + tz − d) = J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)),

where the last equation follows directly from the first and second. For y + tz, we have, directly by

(32), that
X
L(y + tz, d) = β (di − yi − tzi ),
i∈J − (y+tz−d)

and directly from (33) and (34),

xi (y + tz, γ + tη)


 P

 yi + tzi − di + ϑi + tιi + j∈J + (y+tz−d) dj pji



 for i ∈ J + (y + tz − d),

 P

 + j∈J 0− (y+tz−d) (yj + tzj )pji



=





 P

 ϑ + tι + j∈J + (y+tz−d) dj pji


i i

 for i ∈ J 0− (y + tz − d),

 P
 +
j∈J 0− (y+tz−d) (yj + tzj )pji

and


 P
 (γi + tηi + di )(1 − nj=1 pi,j ) for i ∈ J + (y + tz − d),
ζi (y + tz, γ + tη) =

 P
 (γi + tηi + yi + tzi )(1 − nj=1 pi,j ) for i ∈ J 0− (y + tz − d).

It follows by (35) that

X
L(y + tz, d) − L(y, d) = −βt zi .
i∈J − (y−d)∪(J 0 (y−d)∩J − (z))

For the state update equations, we have

xi (y + tz, γ + tη) − xi (y, γ)

46

Electronic copy available at: https://ssrn.com/abstract=2942921




 P
 tzi + tιi + j∈J − (y−d)∪(J 0 (y−d)∩J − (z)) tzj pji for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J + (z)),
=

 P
 tιi + j∈J − (y−d)∪(J 0 (y−d)∩J − (z)) tzj pji for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)),

and

ζi (y + tz,γ + tη) − ζi (y, γ)




 P
 tηi (1 − nj=1 pij ) for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J + (z)),
=

 P
 t(ηi + zi )(1 − nj=1 pij ) for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)).

Set
x(y + tz, γ + tη) − x(y, γ) ζ(y + tz, γ + tη) − ζ(y, γ)
w+ = , δ+ = .
t t

Then

vt+1 (x(y + tz, γ + tη), ζ(y + tz, γ + tη)) − vt+1 (x(y, γ), ζ(y, γ)) 0
lim = vt+1 (x, ζ; w+ , δ + ).
t→0 t

It follows that

X
0 0
Ut,d,p (y, γ; z, η) = −β zi + ρvt+1 (x, ζ; w+ , δ + ). (36)
i∈J − (y−d)∪(J 0 (y−d)∩J − (z))

This concludes the proof.

Next we show that, given the convexity of vt+1 (xt+1 , γ t+1 ) and certain bounds on its directional

derivatives, the function ut (y t , γ t ) not only is convex (Proposition A.5), but also satisfies two types

of bounds on its directional derivatives (Proposition A.6).

Proposition A.5. Suppose Assumptions 1 and 2 hold and

1. vt+1 (xt+1 , γ t+1 ) is continuous and jointly convex in xt+1 and γ t+1 ,

0 (x
Pn
2. vt+1 t+1 , γ t+1 ; z − η, η) ≤ C(−z) + (β + ρcmax − cmin ) i=1 ηi for any feasible direction

(z − η, η) with η ≥ 0.

Then ut (y t , γ t ) defined in (8) is continuous and jointly convex in y t and γ t .

47

Electronic copy available at: https://ssrn.com/abstract=2942921


Proof. We omit the subscript t to reduce notation. Note that we use γ to represent γ t , while ζ(y, γ)

represents γ t+1 (y, γ). The continuity of u(·) follows from the Dominated Convergence Theorem,
P
as Ud,p (y, γ) ≤ β i di + ρ kvk∞ . To conclude that Ud,p (·) is convex, from Proposition A.3, it

remains to be shown that

0 0
−Ud,p (y, γ; −z, −η) ≤ Ud,p (y, γ; z, η) ∀ (y, γ) ∈ ∆, (z, η) ∈ R2n with eT (z + η) = 0.

0 (y, γ; z, η) and U 0 (y, γ; −z, −η) for (y, γ) ∈ ∆◦ , so every direction is feasible.
We first derive Ud,p d,p

It follows from Lemma A.4 that

X
0
Ud,p (y, γ; z, η) = −β zi + ρv 0 (x, γ; w+ , δ + ), (37)
i∈J − (y−d)∪(J 0 (y−d)∩J − (z))

and
X
0
Ud,p (y, γ; −z, −η) = β zi + ρv 0 (x, γ; w− , δ − ),
i∈J − (y−d)∪(J 0 (y−d)∩J + (z))

where


 P
 zi + ιi + j∈J − (y−d)∪(J 0 (y−d)∩J − (z)) zj pji for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J + (z)),
wi+ =

 P
 ιi + j∈J − (y−d)∪(J 0 (y−d)∩J − (z)) zj pji for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)),



 P
 −zi − ιi − j∈J − (y−d)∪(J 0 (y−d)∩J + (z)) zj pji for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J − (z)),
wi− =

 P
 −ιi − j∈J − (y−d)∪(J 0 (y−d)∩J + (z)) zj pji for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0+ (z)).



 P
 ηi (1 − nj=1 pij ) for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J + (z)),
δi+ =

 P
 (ηi + zi )(1 − nj=1 pij ) for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0− (z)).

48

Electronic copy available at: https://ssrn.com/abstract=2942921


and


 P
 −ηi (1 − nj=1 pij ) for i ∈ J + (y − d) ∪ (J 0 (y − d) ∩ J − (z)),
δi− =

 P
 −(ηi + zi )(1 − nj=1 pij ) for i ∈ J − (y − d) ∪ (J 0 (y − d) ∩ J 0+ (z)).

Therefore,


 P P

 z i + 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji for i ∈ J 0 (y − d) ∩ J + (z),

 j∈J

P P
wi+ + wi− = −zi + j∈J 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji for i ∈ J 0 (y − d) ∩ J − (z),





 P P
 j∈J 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji otherwise,

and


 Pn

 −z i (1 − 0 +
j=1 pij ) for i ∈ J (y − d) ∩ J (z),



P
δi+ + δi− = zi (1 − nj=1 pij ) for i ∈ J 0 (y − d) ∩ J − (z),






 0 otherwise.

We now define


 Pn P P

 z i p ij + 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji for i ∈ J 0 (y − d) ∩ J + (z),

 j=1 j∈J

P P P
wiI = − nj=1 zi pij + j∈J 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji for i ∈ J 0 (y − d) ∩ J − (z),





 P P
 j∈J 0 (y−d)∩J − (z) zj pji − j∈J 0 (y−d)∩J + (z) zj pji otherwise,

and


 Pn

 z i (1 − j=1 pij ) for i ∈ J 0 (y − d) ∩ J + (z),



P
wiO = −zi (1 − nj=1 pij ) for i ∈ J 0 (y − d) ∩ J − (z),






 0 otherwise.

It is not hard to verify that w+ + w− = wI + wO and wO = −(δ + + δ − ), with wO ≥ 0. Therefore,

49

Electronic copy available at: https://ssrn.com/abstract=2942921


we have:

0 0
Ud,p (y, γ; z, η) + Ud,p (y, γ; −z, −η)
X
=β |zi | + ρv 0 (x, γ; w+ , δ + ) + ρv 0 (x, γ; w− , δ − )
i∈J 0 (y−d)
X
≥β |zi | + ρv 0 (x, γ; w+ + w− , δ + + δ − ) (by subadditivity, Lemma A.1)
i∈J 0 (y−d)
X
=β |zi | + ρv 0 (x, γ; wI + wO , −wO )
i∈J 0 (y−d)
X
≥β |zi | − ρv 0 (x, γ; −wI − wO , wO ) (by the convexity of v, Proposition A.3)
i∈J 0 (y−d)
n n
!
X cmax X I X
≥β |zi | − ρ |wi | + (β + ρcmax − cmin ) |wiO | (by Lemma A.8 and Lemma 3)
2
i∈J 0 (y−d) i=1 i=1
X X
≥ β |zi | − ρcmax p |zi |
i∈J 0 (y−d) i∈J 0 (y−d)
X
− ρ(β + ρcmax − cmin ) (1 − p) |zi | (by the triangle inequality)
i∈J 0 (y−d)
X X
=β |zi | − (pρcmax + ρ(β + ρcmax − cmin )(1 − p)) |zi |
i∈J 0 (y−d) i∈J 0 (y−d)
X X
≥β |zi | − (pρcmax + (β + ρcmax − cmin )(1 − p)) |zi | (since ρ ≤ 1)
i∈J 0 (y−d) i∈J 0 (y−d)
X X
=β |zi | − (β + (ρcmax − cmin ) − p(β − cmin )) |zi |
i∈J 0 (y−d) i∈J 0 (y−d)
X X
≥β |zi | − β |zi | (by Assumption 2)
i∈J 0 (y−d) i∈J 0 (y−d)

≥ 0.

Therefore, Ud,p (·) is convex on ∆◦ . Since ∆ is locally simplicial, the continuous extension of Ud,p (·)

from S ◦ to S must be convex (See for example Rockafellar [1970] Theorem 10.3). Thus u(·) is

convex.

Proposition A.6. Suppose Assumptions 1 and 2 hold and

1. vt+1 (xt+1 , γ t+1 ) is continuous and jointly convex in xt+1 and γ t+1 ,

0 (x
Pn
2. vt+1 t+1 , γ t+1 ; z − η, η) ≤ C(−z) + (β + ρcmax − cmin ) i=1 ηi for any feasible direction

50

Electronic copy available at: https://ssrn.com/abstract=2942921


(z − η, η) with η ≥ 0,

0 (x
Pn
3. vt+1 t+1 , γ t+1 ; z + η, −η) ≤ C(−z) + β i=1 ηi for any feasible direction (z + η, −η) with

η ≥ 0,

0 (x
Pn
4. vt+1 t+1 , γ t+1 ; 0, z) ≤ (ρcmax /2) i=1 |zi | for any feasible direction (0, z) with eT z = 0,

then we have:

Pn
1. |u0t (y t , γ t ; ∓ξ, ±ξ)| ≤ β i=1 ξi for all (y t , γ t ) ∈ ∆ and any feasible direction (∓ξ, ±ξ) with

ξ≥0;

Pn
2. u0t (y t , γ t ; 0, z) ≤ (ρcmax /2) i=1 |zi | for all (y t , γ t ) ∈ ∆ and any feasible direction (0, z) with

eT z = 0.

Proof. We omit the subscript t to reduce notation. To show the first inequality, we start with
P
showing that u0 (y, γ; −ξ, ξ) ≤ β i ξi . From Lemma A.4, noting that −ξ ≤ 0, we have

X
0
Ud,p (y, γ; −ξ, ξ) = β ξi + ρv 0 (x, γ; w − δ, δ),
i∈J − (y−d)∪J 0 (y−d)

where


 P P
 −ξi nj=1 pij + j∈J + (y−d) ξj pji for i ∈ J + (y − d),
wi =

 P
 j∈J + (y−d) ξj pji for i ∈ J − (y − d) ∪ J 0 (y − d),

and


 P
 ξi (1 − nj=1 pij ) = ξi (1 − p) for i ∈ J + (y − d),
δi =


 0 for i ∈ J − (y − d) ∪ J 0 (y − d).

Note that eT w = 0 and so it is clear that (w − δ, δ) is a feasible direction at (x, ζ). It follows that

0
Ud,p (y, γ; −ξ, ξ)
X
= β ξi + ρv 0 (x, γ; w − δ, δ)
i∈J − (y−d)∪J 0 (y−d)

51

Electronic copy available at: https://ssrn.com/abstract=2942921


X n
X
≤ β |ξi | + C(−w) + ρ(β + ρcmax − cmin ) |δi | (by assumption)
i∈J − (y−d)∪(J 0 (y−d)) i=1
X n n
ρcmax X X
≤ β |ξi | + |wi | + ρ(β + ρcmax − cmin ) |δi | (by Lemma 3)
2
i∈J − (y−d)∪(J 0 (y−d)) i=1 i=1
X n
X X
2ρcmax
≤ β |ξi | + |ξj |pji
2
i∈J − (y−d)∪(J 0 (y−d)) i=1 j∈J + (y−d)
X
+ ρ(β + ρcmax − cmin ) |ξi | (1 − p) (by the triangle inequality)
i∈J + (y−d)
X X
= β |ξi | + (pρcmax + ρ(β + ρcmax − cmin )(1 − p)) |ξj |
i∈J − (y−d)∪(J 0 (y−d)) j∈J + (y−d)
X X
≤ β |ξi | + (pρcmax + (β + ρcmax − cmin ) (1 − p)) |ξj |
i∈J − (y−d)∪(J 0 (y−d)) j∈J + (y−d)
X X
≤ β |ξi | + (β + (ρcmax − cmin ) − p(β − cmin )) |ξj |
i∈J − (y−d)∪(J 0 (y−d)) j∈J + (y−d)
X X
≤ β |ξi | + β |ξi | (by Assumption 2)
i∈J − (y−d)∪(J 0 (y−d)) i∈J + (y−d)
n
X
≤ β |ξi |
i=1

0 (y, γ; −ξ, ξ) ≤ β
Pn Pn
So, Ud,p holds for each (y, γ) ∈ ∆ and (z − η, η) feasible. It
i=1 |ξi | =β i=1 ξi
R 0 P
follows that u0 (y, γ; −ξ, ξ) = Ud,p (y, γ; −ξ, ξ) dµ ≤ β i ξi . From Lemma A.5, u(y, γ) is convex,

thus
X
u0 (y, γ; ξ, −ξ) ≥ −u0 (y, γ; −ξ, ξ) ≥ −β ξi .
i
Pn
Now we show that u0 (y, γ; ξ, −ξ) ≤ β i=1 ξi for all (y, γ) ∈ ∆ for all feasible direction (ξ, −ξ)

with ξ ≥ 0. From the previous analysis, we have

X
0
Ud,p (y, γ; ξ, −ξ) = −β ξi + ρv 0 (x, ζ; w + δ, −δ)
i∈J − (y−d)

where


 P P
 ξi nj=1 pij − j∈J + (y−d)∪J 0 (y−d) ξj pji for i ∈ J + (y − d) ∪ J 0 (y − d),
wi =

 P
 − j∈J + (y−d)∪J 0 (y−d) ξj pji for i ∈ J − (y − d),

52

Electronic copy available at: https://ssrn.com/abstract=2942921




 P
 ξi (1 − nj=1 pij ) for i ∈ J + (y − d) ∪ J 0 (y − d),
δi =


 0 for i ∈ J − (y − d).

Clear (w + δ, −δ) is a feasible direction at (x, ζ). It follows that

0
Ud,p (y, γ; ξ, −ξ)
X
= −β ξi + ρv 0 (x, γ; w + δ, −δ)
i∈J − (y−d)
X n n
ρcmax X X
≤ β |ξi | + |wi | + ρβ |δi | (from the third inequality of the assumption)
2
i∈J − (y−d) i=1 i=1
X n n
βX X
≤ β |ξi | + |wi | + ρβ |δi | (since ρcmax ≤ β, ρ ≤ 1)
2
i∈J − (y−d) i=1 i=1
X n
X X
ρβ
≤ β |ξi | + 2 |ξj |pji
2
i∈J − (y−d) i=1 j∈J + (y−d)∪J 0 (y−d)
 
X n
X
+ρβ |ξi | 1 − pij  (by the triangle inequality)
i∈J + (y−d)∪J 0 (y−d) j=1
X X n
X
≤ β |ξi | + ρβ |ξi | ≤ β |ξi |.
i∈J − (y−d) i∈J + (y−d)∪J 0 (y−d) i=1

0 (y, γ; ξ, −ξ) ≤ β
Pn Pn
So, Ud,p i=1 |ξi | =β holds for each (y, γ) ∈ ∆ and (z +η, −η) feasible. It
i=1 ξi
R P
follows that u0 (y, γ; ξ, −ξ) = 0 (y, γ; ξ, −ξ) dµ ≤ β
Ud,p i ξi . From Lemma A.5, u(y, γ) is convex,

thus
X
u0 (y, γ; −ξ, ξ) ≥ −u0 (y, γ; ξ, −ξ) ≥ −β ξi .
i

Summing up all the above, we have shown the first inequality.

Now we show the second inequality. From Lemma A.4, we have

0
Ud,p (y, γ; 0, z) = ρv 0 (x, γ; ι, z(1 − p))

Pn Pn
where ιi = j=1 zj pji ∀ i and p = j=1 pij (Assumption 1). Therefore,

n
X n X
X n n
X
ιi = zj pji = pzj = 0,
i=1 j=1 i=1 j=1

53

Electronic copy available at: https://ssrn.com/abstract=2942921


and eT z(1 − p) = 0. It follows that

0
Ud,p (y, γ; 0, z) = ρv 0 (x, γ; ι, z(1 − p))

≤ ρv 0 (x, γ; ι, 0) + ρv 0 (x, γ; 0, z(1 − p)) (by subadditivity, Lemma A.1)


n
ρ2 cmax X
≤ ρC(−ι) + |zi |(1 − p) (by assumption)
2
i=1
n
X 2 n
X
ρcmax ρ cmax
≤ |ιi | + |zi |(1 − p) (by Lemma 3)
2 2
i=1 i=1
Xn X n n
ρcmax ρ2 cmax X
≤ |zj |pji + |zi |(1 − p) (by the triangle inequality)
2 2
i=1 j=1 i=1
Xn
ρcmax
≤ |zi |. (by the subadditivity)
2
i=1

0 (y, γ; 0, z) ≤ (ρc
Pn
So, Ud,p max /2) i=1 |zi | holds for all realizations (d, p). It follows that the integral

also satisfies the condition.

The upcoming result, Proposition A.8, assists with the eventual induction over t by stating

that if ut (y t , γ t ) is convex and certain bounds on its directional derivatives are satisfied, then

vt (xt , γ t ) not only is convex, but also satisfies the bounds on the directional derivatives required by

Proposition A.6. Before continuing, we need to introduce a technical lemma that carefully analyzes

the optimal repositioning plan. This lemma plays a crucial role in the proof of Proposition A.8.

Lemma A.7. Suppose x, y ≥ 0 and eT x = eT y. Suppose x − η ≥ 0. Then there exists another


P P
vector ξ such that: 1) y−ξ ≥ 0, 2) eT ξ = eT η, 3) nj=1 |ξj −ηj | ≤ 2 nj=1 |ηj |, 4) C(y−ξ−x+η) =

C(y − x) − C(ξ − η). Moreover, if η ≥ 0, we have ξ ≥ 0 and

n
X n n
1X X
ξi + |ξj − ηj | ≤ ηj (38)
2
j=1 j=1 j=1

Proof. Let I = {i | yi − ηi < 0}. If I = ∅, we have y − η ≥ 0 and we just find ξ = η. Thus in the

followings, we assume that I 6= ∅. We let J − = {i | yi − xi ≤ 0} and J + = {i | yi − xi > 0}. Note

that yi − xi < 0 ∀ i ∈ I, so I ⊆ J − .

54

Electronic copy available at: https://ssrn.com/abstract=2942921


Lemma 2 suggests that there exists w ≥ 0 such that c · w = C(y − x) and


 P P
 − i∈J + wji = − i wji for j ∈ J − ;
yj − x j =

 P P
 i∈J − wij = i wij for j ∈ J + .

The interpretation of the first case above is that locations j with more inventory than the target

level yj transfer inventory to locations that are below the target level. The second case is interpreted

in an analogous way. Let ξ be such that






 yj for j ∈ I;



ξj = ηj for j ∈ J − \ I;





 P ηi −yi
 for j ∈ J + .
i∈I xi −yi wij + ηj

We claim that ξ satisfies the desired properties. We first verify that yj − ξj ≥ 0. This is clearly

true if j ∈ J − from the construction of ξ. Now let j ∈ J + , then we have

X η i − yi X X
yj − ξj = yj − wij − ηj ≥ yj − wij − ηj ≥ yj − wij − ηj = xj − ηj ≥ 0,
x i − yi
i∈I i∈I i

where the first inequality follows from xi ≥ ηi > yi for i ∈ I. Therefore, y − ξ ≥ 0 and part (1) is
P
complete. Also, using xj − yj = j∈J + wij , we have:

!
X X X X X ηi − yi
ξj = yj + ηj + wij + ηj
x i − yi
j j∈I j∈J − \I j∈J + i∈I
X X X ηi − yi X
= yj + ηj + (xi − yi ) = ηi ,
xi − yi
j∈I j∈(J − \I) ∪ J + i∈I i

verifying part (2). Note that:






 yj − η j for j ∈ I;



ξj − η j = 0 for j ∈ J − \I;





 P ηi −yi
 i∈I xi −yi wij for j ∈ J + .

55

Electronic copy available at: https://ssrn.com/abstract=2942921


To show part (3), we have

n
X X n
X
|ξj − ηj | = 2 (ηj − yj ) ≤ 2 |ηj |.
j=1 j∈I j=1

To show that C(y − ξ − x + η) ≤ C(y − x) − C(ξ − η), let w̄ be such that




 ηi −yi
 (1 − for i ∈ I;
xi −yi )wij
w̄ij =


 wij for i ∈
/ I.

Then for all j ∈ J + , we have

X X
yj − ξj − (xj − ηj ) − w̄ij + w̄ji
i i
X X
= (yj − xj ) − ξj + ηj − w̄ij − w̄ij
i∈I i∈I
/
X X ηi − yi X X η i − yi
= wij − wij − wij + wij = 0,
xi − yi xi − yi
i i∈I i i∈I

P
where we used i w̄ji = 0. Similarly, for all j ∈ I, we have

X X
x j − η j − yj + ξj − w̄ji + w̄ij
i i
X ηj − yj

= xj − η j − 1− wji
xj − yj
i
X η j − yj
= xj − wji − ηj − (yj − xj ) = 0
x j − yj
i

and for all j ∈ J − \I, we have

X X X
x j − η j − yj + ξj − w̄ji + w̄ij = xj − yj − wji = 0.
i i i

Therefore, we have shown that w̄ is a feasible solution to the optimization problem for C(·) defined

in (1) at y − ξ − x + η. Thus, we have:

X
C(y − ξ − x + η) ≤ cij w̄ij
i,j

56

Electronic copy available at: https://ssrn.com/abstract=2942921


X XX  
ηi − yi
= cij wij − wij cij
xi − yi
i,j i∈I j
X X  
ηi − yi
= C(y − x) − cij wij .
+
xi − yi
i∈I j∈J

If we define ŵ as 

 ηi −yi
 for i ∈ I, j ∈ J + ;
xi −yi wij
ŵij =


 0 otherwise,

then we can check that ŵ is a feasible solution to (1) at ξ − η. Therefore, we have

X X  
η i − yi
cij wij = c · ŵ ≥ C(ξ − η).
xi − yi
i∈I j∈J +

Finally, we have

X X  
η i − yi
C(y − ξ − x + η) ≤ C(y − x) − cij wij ≤ C(y − x) − C(ξ − η).
x i − yi
i∈I j∈J +

Together with the subadditivity of C(·), we conclude that

C(y − ξ − x + η) = C(y − x) − C(ξ − η),

proving part (4). The last claim follows directly from the construction of ξ and parts (2) and

(3).

Proposition A.8. Suppose ut (·) is convex and continuous in ∆ and

Pn
1. u0t (y t , γ t ; ±η, ∓η) ≤ β i=1 ηi for all (y t , γ t ) ∈ ∆ and for any feasible direction (±η, ∓η)

with η ≥ 0;

Pn
2. u0t (y t , γ t ; 0, z) ≤ (ρcmax /2) i=1 |zi | for all (y t , γ t ) ∈ ∆ and for any feasible direction (0, z)

with eT z = 0.

Then, the value function vt (·) is convex and continuous in ∆ with Ωv (γ) = ∆n−1 (I) for γ ∈ S.

57

Electronic copy available at: https://ssrn.com/abstract=2942921


For each (xt , γ t ) ∈ ∆ and η ≥ 0, the directional derivatives satisfy

n
X
vt0 (xt , γ t ; z − η, η) ≤ C(−z) + (β + ρcmax − cmin ) ηi (39)
i=1

for any feasible direction (z − η, η) and

n
X
vt0 (xt , γ t ; z + η, −η) ≤ C(−z) + β ηi (40)
i=1

for any feasible direction (z + η, −η). In addition,

n
ρcmax X
vt0 (xt , γ t ; 0, z) ≤ |zi | (41)
2
i=1

for any feasible direction (0, z) with eT z = 0.

Proof. We omit the subscript t throughout the proof to reduce notation. To show that v(·) is

convex, suppose y 1 and y 2 are optimal solutions of (11) for (x1 , γ 1 ) and (x2 , γ 2 ), respectively.

Then, λy 1 + (1 − λ)y 2 ∈ ∆n−1 (λeT x1 + (1 − λ)eT x2 ) and thus

v(λx1 + (1 − λ)x2 , λγ 1 + (1 − λ)γ 2 ) ≤ u(λx1 + (1 − λ)x2 , λγ 1 + (1 − λ)γ 2 )

+ C(λ(y 1 − x1 ) + (1 − λ)(y 2 − x2 ))

≤ λv(x1 , γ 1 ) + (1 − λ)v(x2 , γ 2 ),

by convexity of u(·) and Lemma 1. Continuity follows from Berge’s Maximum Theorem, as the

set-valued map x 7→ ∆n−1 (I) is continuous. To show the result in (39), suppose (z − η, η) is a

feasible direction. Let y ∗ be an optimal solution to equation (11) at (x, γ). Therefore,

v(x, γ) = min C(y − x) + u(y, γ) = C(y ∗ − x) + u(y ∗ , γ).


y∈∆n−1 (eT x)

Let t > 0 be small enough such that x + t(z − η) ≥ 0. According to Lemma A.7, there exists a

vector ξ ≥ 0 such that for small enough t: 1) y ∗ − tξ ≥ 0, 2) eT ξ = eT η, 3) 4) C(y ∗ − tξ − x −

tz + tη) = C(y ∗ − x − tz) − tC(ξ − η). Therefore, y ∗ − tξ is a feasible solution to equation (11)

58

Electronic copy available at: https://ssrn.com/abstract=2942921


at (x + tz − tη, γ + tη) and thus we have

v(x + tz − tη, γ + tη) − v(x, γ)


t
u(y ∗ − tξ, γ + tη) − u(y ∗ − tξ, γ + tξ) + u(y ∗ − tξ, γ + tξ) − u(y ∗ , γ)

t
C(y ∗ − tξ − x − tz + tη) − C(y ∗ − x − tz) + C(y ∗ − x − tz) − C(y ∗ − x)
+
t
u(y ∗ − tξ, γ + tη) − u(y ∗ − tξ, γ + tξ) + u(y ∗ − tξ, γ + tξ) − u(y ∗ , γ)
≤ − C(ξ − η) + C(−z)
t

Adding and subtracting u(y ∗ , γ) in the numerator of the first term and then taking limits on both

sides, we get

v 0 (x, γ; z − η, η) ≤ u0 (y ∗ , γ; −ξ, η) − u0 (y ∗ , γ; −ξ, ξ) + u0 (y ∗ , γ; −ξ, ξ) − C(ξ − η) + C(−z)

≤ u0 (y ∗ , γ; 0, η − ξ) + u0 (y ∗ , γ; −ξ, ξ) − C(ξ − η) + C(−z)


Xn n
X n
ρ cmin X
≤ cmax |ξi − ηi | + β ξi − |ξi − ηi | + C(−z)
2 2
i=1 i=1 i=1
n
X n
ρcmax − cmin X
≤ β ηi + |ξi − ηi | + C(−z)
2
i=1 i=1
n
X
≤ (β + ρcmax − cmin ) ηi + C(−z),
i=1

where the second inequality follows by Lemma A.1 and the third inequality follows by Lemma 3.

Now we show equation (40). Suppose (z + η, −η) is a feasible direction. Again, let y ∗ be an

optimal solution to equation (11) at (x, γ). Then y ∗ + tη is clearly a feasible solution to equation

(11) at (x + tz + tη, γ − tη) and thus we have

v(x + tz + tη, γ − tη) − v(x, γ)


v 0 (x, γ; z + η, −η) = lim
t→0 t
u(y + tη, γ − tη) + C(y ∗ − x − tz) − u(y ∗ , γ) − C(y ∗ − x)

≤ lim
t→0 t
≤ u0 (y ∗ , γ; η, −η) + C(−z)
X n
≤ β ηi + C(−z),
i=1

where the second inequality follows from the subadditivity and the postive homogeneity of C(·)

and the last inequality follows from the assumption.

59

Electronic copy available at: https://ssrn.com/abstract=2942921


To show the result in (39), suppose (0, z) is a feasible direction at (x, γ). Let y ∗ be an optimal

solution to equation (11) at (x, γ). Then y ∗ is a feasible solution for (x, γ + tz) and thus,

v(x, γ + tz) − v(x, γ)


v 0 (x, γ; 0, z) = lim
t→0 t
u(y ∗ , γ + tz) + C(x − y ∗ ) − u(y ∗ , γ) − C(x − y ∗ )
≤ lim
t→0 t
n
∗ ∗
u(y , γ + tz) − u(y , γ) ρcmax X
= lim = u0 (y ∗ , γ; 0, z) ≤ |zi |,
t→0 t 2
i=1

which completes the proof.

The proofs for Theorem 9 for (ut (·)) and (vt (·)) follow from Proposition A.5, Proposition A.6,

Proposition A.8, and the induction. Consequently, an optimal policy for each period is provided by

Theorem 4, and the no-repositioning set can be characterized as in Proposition 5, 6 and Corollary

7.

A.2 Other Proofs

Proof of Lemma 1: It is clear that the linear program (1) is bounded feasible. Therefore, an

optimal solution to (1) exists and the strong duality holds. The dual linear program can be written

as
C(z) = max λT z
(42)
subject to λj − λi ≤ cij ∀ i, j.

From (42), we have

 
C(tz) = max tλT z : λj − λi ≤ cij , ∀ i, j = t max λT z : λj − λi ≤ cij , ∀ i, j = tC(z).

Therefore, C(·) is positively homogeneous. As the pointwise supremum of a collection of convex

and lower semicontinuous functions (λT z for each λ), C(·) is also convex and lower semicontinuous.

It is well known that a convex function on a locally simplicial convex set is upper semicontinuous

(Rockafellar [1970] Theorem 10.2). Therefore, as H is a polyhedron, C(·) must be continuous. From

60

Electronic copy available at: https://ssrn.com/abstract=2942921


the positive homogeneity and the convexity, we have

   
1 1 1 1
C(z 1 + z 2 ) = 2C z1 + z2 ≤2 C(z 1 ) + C(z 2 ) = C(z 1 ) + C(z 2 ).
2 2 2 2

Therefore, C(·) is sub-additive. 

Proof of Lemma 2: It is easy to see that an equivalent condition is wi,j wj,k = 0 for all i, j, k. To

show this is true, suppose w is an optimal solution and there exists i, j, k such that wi,j , wj,k > 0.

If i = k, we can set at least one of wi,j and wj,i to 0 without violating the constraints. If i 6= k, we

can set at least one of wi,j and wj,k to 0, and increase wi,k accordingly. In both cases, the resulting

objective is at least as good. Repeating this for all i, k and j can enforce this condition for all i, k

and j. 

Proof of Theorem 4: Fix γ ∈ S. Let y ∗ (x, γ) = {y ∈ ∆n−1 (I) : u(x, γ) = C(y − x) + u(y, γ)}

be the set of optimal solutions corresponding to the system state x ∈ S. It is easy to verify that

Ωu (γ) = ∪x∈∆n−1 (I) y ∗ (x, γ). (43)

As C(·) and u(·) are continuous and ∆n−1 (I) is compact, by Berge’s Maximum Theorem, y ∗ (·) is

a nonempty-valued and compact-valued upper hemicontinuous15 correspondence. As C(·) and u(·)

are also convex, y ∗ (·) is also convex-valued. So, it is clear from (43) that Ωu (γ) is nonempty. To

show Ωu (γ) is compact, suppose y 1 , y 2 , . . . is a sequence in Ωu (γ) such that y n ∈ y ∗ (xn , γ) for

n ∈ N and y n → y. We need to show that y ∈ Ωu (γ). By passing through a subsequence, we may

assume that y nk ∈ y ∗ (xnk , γ), xnk → x and y nk → y. As y ∗ (·) is compact-valued, by the Closed

Graph Theorem, y ∗ (·) has a closed graph. This implies that y ∈ y ∗ (x, γ) ⊂ Ωu (γ), and therefore

Ωu (γ) is compact.

To show that Ωu (γ) is connected, suppose the reverse is true. Then, there exist open sets V1 , V2

in ∆n−1 (I) such that V1 ∩ V2 = ∅, V1 ∪ V2 ⊃ Ωu (γ), and V1 ∩ Ωu (γ) and V2 ∩ Ωu (γ) are nonempty.

As y ∗ (·) is convex-valued, this implies that, for any x ∈ ∆n−1 (I), y ∗ (x, γ) is either in V1 or in V2 ,
15
Upper hemicontinuity can be defined as follows. Suppose X and Y are topological spaces. A correspondence
f : X → P(Y ) (power set of Y ) is upper hemicontinuous if for any open set V in Y , f −1 (V ) = {x ∈ X|f (x) ⊂ V } is
open in X.

61

Electronic copy available at: https://ssrn.com/abstract=2942921


but not both. Let (U1 , γ) = y ∗−1 (V1 ) and (U2 , γ) = y ∗−1 (V2 ). Then U1 , U2 are open, U1 ∩ U2 = ∅,

U1 ∪ U2 ⊃ ∆n−1 (I), and U1 ∩ ∆n−1 (I) and U2 ∩ ∆n−1 (I) are nonempty. This implies that the

(n−1)-dimensional simplex ∆n−1 (I) is not connected. We have reached a contradiction. Therefore,

Ωu (γ) is also connected.

Next, to show that π ∗ is optimal, note that π ∗ (x, γ) = x for x ∈ Ωu (γ) is clear from (13). If

/ Ωu (γ), then, by (43), π ∗ (x, γ) ∈ Ωu (γ). Now, suppose there exists π ∗ (x, γ) = y ∈ Ωu (γ)◦ ,
x∈

then y + t(x − y) ∈ Ωu (γ) for small enough t > 0. Set z = y + t(x − y). Then u(z, γ) + C(z − x) ≤

u(y, γ) + C(y − z) + C(z − x) = u(y, γ) + tC(y − x) + (1 − t)C(y − x) = u(y, γ) + C(y − x).

So, z is as good a solution as y. Therefore, there exists an optimal solution π ∗ (x, γ) ∈ B(Ωu (γ))

if x ∈
/ Ωu (γ). 

Proof of Proposition 5: Suppose x ∈ Ωu (γ). Take any feasible direction (z, 0) at (x, γ). Then,

by (13),
u(x + tz, γ) − u(x, γ)
≥ −C(z)
t

for t > 0. Taking the limit as t ↓ 0, we have u0 (x, γ; z, 0) ≥ −C(z). Conversely, suppose

u0 (x, γ; z, 0) ≥ −C(z) for any feasible direction z at x in H. Let φ(t) = u(x + tz, γ). Then,

φ(·) is convex, φ(0) = u(x), and φ0 (0+) = u0 (x, γ; z, 0) ≥ −C(z). By the subgradient inequality,

tφ0 (0+) + φ(0) ≤ φ(t). This implies that −tC(z) + u(x, γ) ≤ u(x + tz, γ) is true for any feasible

direction (z, 0). Therefore, we have x ∈ Ωu (γ). 

Proof of Proposition 6: For the “if” part, suppose x ∈


/ Ωu (γ). Then, there exists y ∈ ∆n−1 (I)

such that u(x, γ) > C(y − x) + u(y, γ). Take any g ∈ ∂x u(x, γ). By the subgradient inequality,

u(x, γ) + g T (y − x) ≤ u(y, γ). It follows that

C(y − x) < −g T (y − x).

P P
Suppose w = (wij ) is an optimal solution to problem (1). Then C(y − x) = i j cij wij , and by
P P P P
Lemma 2, −g T (y − x) = i gi (yi − xi )− − j gj (yj − xj )+ = i j (gi − gj )wij . So, we have

XX XX
cij zij < (gi − gj )wij .
i j i j

62

Electronic copy available at: https://ssrn.com/abstract=2942921


Hence, there exists i and j such that gi − gj > cij . This implies g ∈
/ G.

For the “only if” part, suppose x > 0 and x ∈ Ωu (γ). Assume ∂x u(x, γ) ∩ G = ∅. We will show

that this leads to a contradiction. Let P be the orthogonal projection from Rn to the subspace
P
H = {x ∈ Rn : i xi = 0}. Then
P
xi
P (x) = x − i e,
n

where e = (1, . . . , 1) in Rn . Noting that G + αe ⊂ G for any α ∈ R, it is easy to verify that

∂x u(x, γ) ∩ G = ∅ if and only if ∂x u(x, γ) ∩ P (G) = ∅,

since ∂x u(x, γ) ⊆ H. As ∂x u(x, γ) is closed and P (G) is compact, by Hahn-Banach Theorem, there

exists z ∈ H, a ∈ R and b ∈ R such that

hg, zi < a < b < hλ, zi

for every g ∈ P (∂x u(x, γ)) and for every λ ∈ P (G), or equivalently, as hg, zi = hP (g), zi and

hλ, zi = hP (λ), zi, for every g ∈ ∂x u(x, γ) and for every λ ∈ G. As z is a feasible direction in H

at x ∈ Ωu (γ), by Proposition 5, we have u0 (x, γ; z, 0) ≥ −C(z). It follows that

sup{hg, zi : g ∈ ∂x u(x, γ)} = u0 (x, γ; z, 0) ≥ −C(z).

So, we have

−C(z) ≤ a < b < hλ, zi

for every λ ∈ G. However, by the dual formulation (42), there exists λ ∈ {(y1 , y2 . . . , yn ) | yj − yi ≤

cij ∀ i, j} such that hλ, zi = C(z), or equivalently, h−λ, zi = −C(z). Recognizing that −λ ∈ G

leads to the contradiction. Therefore, it follows that ∂x u(x, γ) ∩ G 6= ∅. 

Proof of Corollary 7: If u(·, γ) is differentiable at x, then

 
∂u(x, γ) ∂u(x, γ) ∂u(x, γ)
∂x u(x, γ) = , ,..., .
∂x1 ∂x2 ∂xn

In this case, it is easy to see that (17) is simplified to (18). To show that x ∈ Ωu (γ) implies (18)

63

Electronic copy available at: https://ssrn.com/abstract=2942921


for x ∈ B(∆n−1 (I)). Note that the equality sup{g T z : g ∈ ∂x u(x, γ)} = u0 (x, γ; z, 0) now holds

for x ∈ B(∆n−1 (I)). The rest of the proof is the same as Proposition 6. 

Proof of Theorem 10: To show that value function retains its structure in the infinite horizon set-

ting, we invoke the general approach outlined in Porteus [1975] and Porteus [1982] which “iterates”

the structural properties of the one-stage problem. Let V ∗ be the space of convex, continuous and

bounded functions over ∆. Note that a one-step structure preservation property holds by Lemma

A.5, Lemma A.6, and Lemma A.8: combined, they say that if the next period value function is in

V ∗ , then the optimal value of the current period is also in V ∗ . Furthermore, the set V ∗ with the

sup-norm k · k∞ is a complete metric space. These two observations allow us to apply Corollary 1

of Porteus [1975] and conclude that v ∈ V ∗ (the remaining assumptions needed to apply the result

can be easily checked). The rest of the proof follows from Lemma A.5, Lemma A.6, Lemma A.8,

Theorem 4, Proposition 5, Proposition 6, and Corollary 7. 

Proof of Lemma 11: If x ∈ Dk , ak ∈ ∂x uJ (x, γ). Since aki − akj ≤ cij for all i, j, according to

Proposition 6, we have x ∈ ΩuJ (ζ). For the second part, we first write down the primal formulation

for problem (25):

v̄J (x, ζ) = min c·w+ξ


Xn n
X
subject to wi,j − wj,k = zj − xj ∀ j = 1, 2, . . . , n
i=1 k=1

w≥0

eT z = eT x

ξ ≥ (z − y k )T ak + (ζ − γ k )T bk + ck ∀ k = 1, 2, . . . , J

z ≥ 0.

Since x ∈ ΩuJ (ζ), one optimal solution to the primal formulation is w = 0, z = x, ξ = (x −

y k )T ak + (γ − γ k )T bk + ck . The dual solution λ = ak , µk = 1, µl = 0, ∀ l 6= k is clearly feasible. It

also satisfies the complementary slackness condition. Therefore, the solution is optimal. 

Proof of Lemma 12 These basic properties for our Bellman operator L are well-known for the

standard Bellman operator and can be proved in an analogous manner; see, for example, Puterman

64

Electronic copy available at: https://ssrn.com/abstract=2942921


[1994] or Bertsekas and Tsitsiklis [1996]. 

Proof of Lemma 13 Let (y, γ) ∈ ∆◦ . Let ξ + = max(ξ, 0) and ξ − = min(ξ, 0), then (ξ, η) =

(ξ + , −ξ + ) + (ξ − , −ξ − ) + (0, ξ + + ξ − + η) and it follows from Theorem 10 and Assumption 2 that

|f 0 (y, γ; ξ, η)| ≤ |f 0 (y, γ; ξ + , −ξ + )| + |f 0 (y, γ; ξ − , −ξ − )| + |f 0 (y, γ; 0, ξ + + ξ − + η)|


n
X n
X
≤β |ξ +
i |+β |ξ − + −
i | + (ρcmax /2) kξ + ξ + ηk1
i=1 i=1

≤ β kξk1 + (ρcmax /2) kξ + + ξ − + ηk1



≤ β kξk1 + (β/2) kξ + k1 + kξ − k1 + kηk1

≤ (3/2) β k(ξ, η)k1 ≤ (3/2) 2n β k(ξ, η)k2 .

To conclude, the fact that Lf also satisfies the directional derivative conditions follows by Lemma

A.8, Lemma A.5, and Lemma A.6. 

Proof of Theorem 14 We want to show that for each  > 0, there exists an almost surely finite

iteration index J() such that for all J ≥ J(), it holds that kuJ − uk∞ ≤ . Let Br (y, γ) be a

(2n−1)-dimensional ball centered at (y, γ) ∈ ∆◦ with radius r. Consider some 0 > 0 (to be specified

later) and let C(0 ) be an 0 -covering of ∆, meaning that C(0 ) is a finite collection of points in ∆◦
S
(representing the centers of a finite collection of balls with radius 0 ) and ∆ ⊆ (y,γ)∈C(0 ) B0 (y, γ).

Let (ỹ 1 , γ̃ 1 ), (ỹ 2 , γ̃ 2 ), . . . denote the sequence of sample points visited by the algorithm (one per
P
iteration). Thus, by Assumption 3, we have J P{(ỹ J , γ̃ J ) ∈ B0 (y, γ)} = ∞, and an application

of the Borel-Cantelli lemma tells us that each ball B0 (y, γ) associated with the covering is visited

infinitely often with probability one. To reduce notation, we will often suppress (y, γ) and use B0

to denote a generic ball in the covering. Our proof follows three main ideas:

1. For any infinite trajectory of sampled states, we can split it into an infinite number of “phases”

such that in each phase, every ball associated with the 0 -covering is visited at least once.

2. We can then construct an auxiliary “batch” algorithm whose iteration counter aligns with the

sequence of phases from the previous step. This new algorithm is defined as another instance

of Algorithm 1, where on any given iteration, we group all states visited in the corresponding

65

Electronic copy available at: https://ssrn.com/abstract=2942921


phase of the main algorithm into a single batch and perform all updates at once. For clarity,

we will refer to the main algorithm as the “asynchronous” version of the batch algorithm.

3. The auxiliary batch algorithm can be viewed as an approximate version of value iteration.

Using the properties of L, we can show that it converges to an approximation of u (with error

depending on 0 ). Finally, we conclude by arguing that the main algorithm does not deviate

too far from the auxiliary version.

Let J0 = 0 and for K = 1, 2, . . ., define the random variable

JK+1 = min{J > JK : ∀ (y, γ) ∈ C(0 ), ∃ J 0 s.t. JK < J 0 ≤ J, (ỹ J 0 , γ̃ J 0 ) ∈ B0 (y, γ)}

to be the first time after JK such that every ball in the 0 -covering is visited at least once. Notably,

J1 is the first time that the entire covering is visited at least once. We denote the set of iterations

JK = {JK−1 + 1, JK−1 + 2, . . . , JK }

to be the “Kth phase” of the algorithm and let ŜK = {(ỹ J , γ̃ J )}J∈JK be the set of states visited

throughout the course of phase K.

We now describe “path-dependent” instances of Algorithm 1 to assist with the remaining anal-

ysis. To be precise with the definitions, let us consider a sample path ω. The auxiliary batch

algorithm associated with ω is a new instance of Algorithm 1 that uses iteration counter K and

generates hyperplanes at the set of states SK = ŜK (ω) for all K ≥ 1. The initial approximation is

û0 = u0 and the estimate after K batch updates is denoted ûK (y, γ)(ω) = maxi=1,...,NK ĝi (y, γ)(ω).

We are now interested in studying the stochastic process {ûK (y, γ)}.

Next, we observe that the hyperplanes generated at iteration K + 1 of the batch algorithm are

tangent to LûK at the points in SK+1 . Let κ = (3/2) 2n β. Note that by repeatedly applying

Lemma 13 and using u0 = 0, we can argue that all (tangent) hyperplanes generated throughout the

algorithm have directional derivatives bounded by κ. It follows that if (ỹ, γ̃) is a sample point in

SK+1 that lies in a ball B0 and it generates a hyperplane ĝ, then the underestimation error within
 
the ball is upper-bounded by max(y,γ) ∈B0 (LûK )(y, γ) − ĝ(y, γ) ≤ 2κ0 (using the fact that there

66

Electronic copy available at: https://ssrn.com/abstract=2942921


is zero estimation error at (ỹ, γ̃), the tangent point). Applying this across the 0 -covering, we have:

LûK − (2κ0 ) 1 ≤ max ĝi ≤ max ĝi = ûK+1 . (44)


i=NK +1,...,NK+1 i=1,...,NK+1

Therefore, we have a form of approximate value iteration and can analyze it accordingly (see

Bertsekas and Tsitsiklis [1996]). Utilizing the monotonicity and shift properties of Lemma 12, we

apply L to both sides of (44) for K = 0 to obtain

L(Lû0 − 2κ0 1) = L2 û0 − ρ (2κ0 ) 1 ≤ Lû1 .

Subtracting (2κ0 ) 1 from both sides and then applying (44) for K = 1, we have

L2 û0 − ρ (2κ0 ) 1 − (2κ0 ) 1 ≤ Lû1 − (2κ0 ) 1 ≤ û2 .

Iterating these steps, we see that LK û0 − (2κ0 )(1 + ρ + · · · + ρK−1 ) 1 ≤ ûK . Taking limits, using

the convergence of the value iteration algorithm (see Puterman [1994]), and noting that ûK ≤ u

for all K, we arrive at

2κ0
u(y, γ) − ≤ lim ûK (y, γ) ≤ u(y, γ), ∀ (y, γ) ∈ ∆. (45)
1 − ρ K→∞

Hence, we have shown that the auxiliary batch algorithm generates value function approximations

that closely approximate u in the limit.

The final step is to relate the main asynchronous algorithm to the auxiliary batch version.

We claim that the value function approximation ûK generated by the Kth phase, for K ≥ 1, of

the batch algorithm is within a certain error bound of the approximation from the asynchronous

algorithm at JK :

ûK − (4κ0 )(1 + ρ + · · · + ρK−1 ) 1 ≤ uJK . (46)

Let us consider the first phase, K = 1. Recall that the two algorithms are initialized with identical

approximations, so û0 = u0 . Since {uJ } is a nondecreasing sequence of functions, we have û0 ≤ uJ

and Lû0 ≤ LuJ for any J ∈ J1 by the monotonicity property of Lemma 12. Also note that the

auxiliary batch algorithm builds a uniform underestimate of Lû0 with points of tangency belonging

67

Electronic copy available at: https://ssrn.com/abstract=2942921


to Ŝ1 , so we have û1 ≤ Lû0 ≤ LuJ . The hyperplane gJ+1 added in iteration J + 1 ∈ J1 of the

asynchronous algorithm is tangent to LuJ at (ỹ J+1 , γ̃ J+1 ), so it follows that

û1 (ỹ J+1 , γ̃ J+1 ) ≤ (LuJ )(ỹ J+1 , γ̃ J+1 ) = gJ+1 (ỹ J+1 , γ̃ J+1 ).


Suppose (ỹ J+1 , γ̃ J+1 ) is in a ball B0 . Then gJ+1 can fall below û1 by at most max(y,γ) ∈B 0 û1 (y, γ)−

gJ+1 (y, γ) ≤ 4κ0 within the ball. Since this holds for every hyperplane (and corresponding ball)

added throughout the phase K = 1 and noting that every point in ∆ can be associated with some

well-approximating hyperplane (due to the property that each phase contains at least one visit to

every ball), we have

û1 − (4κ0 ) 1 ≤ max gJ = uJ1 , (47)


J=1,...,J1

which proves (46) for K = 1. Applying L to both sides of (47), utilizing Lemma 12, noting that û2

underestimates Lû1 , and applying the nondecreasing property of the sequence {uJ }, we have

û2 − ρ (4κ0 ) 1 ≤ Lû1 − ρ (4κ0 ) 1 ≤ LuJ1 ≤ LuJ , ∀ J ∈ J2 .

By the same reasoning as above, for J + 1 ∈ J2 , it must hold that

û2 (ỹ J+1 , γ̃ J+1 ) − ρ (4κ0 ) ≤ (LuJ )(ỹ J+1 , γ̃ J+1 ) = gJ+1 (ỹ J+1 , γ̃ J+1 ),

and we obtain û2 − ρ (4κ0 ) 1 − (4κ0 ) 1 ≤ uJ2 , proving (46) for K = 2. We can iterate these steps

to argue (46) for any K. Taking limits (all subsequent limits exist due to the boundedness and

monotonicity of the sequences), we get

4κ0
lim ûK (y, γ) − ≤ lim uJ (y, γ) = lim uJ (y, γ) ≤ u(y, γ), ∀ (y, γ) ∈ ∆, (48)
K→∞ 1 − ρ K→∞ K J→∞

where the equality follows by the fact uJK (y, γ) is a subsequence of uJ (y, γ). Combining (45) and

(48), we obtain

6κ0 4κ0
u(y, γ) − ≤ lim ûK (y, γ) − ≤ lim uJ (y, γ) ≤ u(y, γ), ∀ (y, γ) ∈ ∆,
1 − ρ K→∞ 1 − ρ J→∞

68

Electronic copy available at: https://ssrn.com/abstract=2942921


showing that the asynchronous algorithm generates value function approximations that are arbi-

trarily close to u: if we set 0 = (1 − ρ)/(6κ), this implies the existence of J() such that for

J ≥ J(), kuJ − uk∞ ≤ . 

Proof of Lemma 16: Cut k having a dominating region in ∆ means that

min(y,γ)∈∆ maxl6=k aTl (y − y l ) + bTl (γ − γ l ) + cl − aTk (y − y k ) − bTk (γ − γ k ) − ck < 0.

Introducing the dummy variable t to denote the maximum, we obtain the formulation (29). 

69

Electronic copy available at: https://ssrn.com/abstract=2942921

Das könnte Ihnen auch gefallen