Anda di halaman 1dari 8

Proceedings of the 2011 Industrial Engineering Research Conference

T. Doolen and E. Van Aken, eds.

A Dynamic Programming Approach for


Nation Building Problems
Gregory Tauer, Rakesh Nagi, and Moises Sudit
Department of Industrial and Systems Engineering,
State University of New York at Buffalo
Buffalo, NY
Abstract
Recent attention has been given to quantitative methods for studying Nation Building problems. A nations economic,
political, and social structures constitute large and complex dynamic systems. This leads to the construction of large
and computationally intensive Nation Building simulation models, especially when a high level of detail and validity
are important. We consider a Markov Decision Process model for the Nation Building problem and attempt a dynamic
programming solution approach. However DP algorithms are subject to the curse of dimensionality. This is especially problematic since the models we consider are of large size and high dimensionality. We propose an algorithm
that focuses on a local decision rule for the area of a Nation Building models state space around the target nations
actual state. This process progresses in an online fashion; as the actual state transitions, a new local decision rule is
computed. Decisions are chosen to maximize an infinite horizon discounted reward criteria that considers both short
and long-term gains. Short term gains can be described exactly by the local model. Long term gains, which must be
considered to avoid myopic behavior of local decisions, are approximated as fixed costs locally.

Keywords
Nation Building, Approximate Dynamic Programming, Markov Decision Processes.

1. Introduction
The ultimate goal of a Nation Building mission, as stated by RAND, is to ... promote political and economic reforms
with the objective of transforming a society emerging from conflict into one at peace with itself and its neighbors.
[1]. Nation Building is a task at which that the United Nations and United States have considerable experience, and
much effort has been devoted to understanding what works well, and how the process can be improved [2]. Part of this
improvement effort includes the construction of simulation models to study nation building problems [3, 4].
This paper considers the problem of using simulation models for the task of planning in a nation building environment. Our goal is to find a policy that can be followed to most efficiently realize a given measure of stability. We take
a dynamic programming approach and view the problem as a Markov Decision Process (MDP), which allows us to
study in a structured way the dynamic and random nature of the nation building problem. The largest difficulty faced
in this effort is the high dimensionality, and therefore large number of potential states, in the simulation models under
study. This is the well known Curse of Dimensionality in dynamic programming. For instance the example model
shown in Section 4 involves 500 people that can each be in any of 20 employment categories for a total of 2.28 1034
state combinations (one for each unique employment configuration of the 500 people).
To help overcome the intractable number of states encountered in nation building problems we assume some good
policy is known prior to any study of the simulation model. This assumption is based on the long history and study
of nation building efforts. In Section 3.1 we will call this prior policy an Expert Policy and it will be used to help
judge long term consequences of immediate actions. In some sense this is similar to transfer learning, specifically the
transfer of human advice [5]. Our suggested approach is even more closely related to the rollout policy approach [6]
since we concentrate on results which improve a given policy.
The approach we adopt is similar in many aspects to response surface methodology (RSM) in the statistics literature. In RSM, one starts from some initial factor setting (x0 ) and runs a series of experiments around x0 . The results of

Tauer, Nagi, and Sudit


these experiments are then used to fit a function, often a low degree polynomial function, to approximate the response
surface around x0 . This function is then used as guidance to make small changes to improve the system. Usually, only
small changes are made because the function being used to approximate the response surface was fit exclusively to the
area near x0 ; the approximation might provide poor generalization if big changes are made. This prompts an iterative
approach where a function is fit, a small change is made, then a new function is fit for the area around the new factor
setting, and so on.
In our problem we have a large dynamic system with some current state x0 . We will sample the dynamics of the
system around x0 , fit a function to approximate the value of states (the value function) near x0 , and then use that
function to make decisions about how to control the system. While in RSM it is possible to directly sample the response surface, the main complication in our problem is that it is not possible to directly sample the value function.
The expert policy helps to more accurately estimate the true value function by providing a mechanism to look at the
system far away from x0 without needing to explicitly consider far away states. Similarly to RSM, after the system
moves away from x0 a new function will be fit; when the system transitions to a new state, x0 is redefined as the new
current state.
The approach of localized approximation is not in itself novel [710], and neither is the approach of using parameterized functions to approximate the value function [1113], however we will discuss how their combination can help
overcome some problems with using DP with simulation, as well as propose an extension to these techniques that
allows the incorporation of prior knowledge about how to act; our Expert Policy.

2. Background
This section contains a brief overview of necessary topics. Section 2.1 briefly describes the Markov Decision Process framework and Section 2.2 shows how a linear programming formulation can solve such problems. Section
2.3 discusses specific difficulties faced when using simulation models to study MDP. Finally, Section 2.4 discusses
the use of local approximations and 2.5 covers using linear basis functions to approximate the value function of a MDP.
2.1 Markov Decision Processes
Certain qualities of Nation Building models can be inferred from the systems they address. We assume these models
are stochastic and discrete time. With the additional simplifying assumption that any continuous quantities (such as
gallons of water in a reservoir) can be expressed as discrete quantities (the reservoir is empty, half full, or full), The
result is a discrete time, discrete state, and stochastic system that is well modeled as a Markov Decision Processes.
A Markov Decision Process (MDP) [14] consists of a set of states X , a set of actions A , a state transition probability
function p(x0 |x, a) [0, 1] from all x X to x0 X when action a A is taken, and a reward function R(x, a) R
for all x X and a A . Given the current state x X a decision maker must choose an action a A . Their action
choice will result in the system transitioning to state x0 with probability p(x0 |x, a) and the decision maker receiving
an immediate reward of R(x, a). The decision maker must then chose a new action to take in x0 , and so on. The
decision makers behavior can be specified by a decision rule : X A , known as a policy. The goal of optimization
from a planning perspective to find a policy that directs the system such that as much reward as possible is collected.
The specific objective that will be used here is to maximize the discounted expected reward received over an infinite
horizon with respect to a starting state. This quantity is called the value of a state and is denoted as V (x). Starting
from state x0 , where xt is the state at time t, this is given by:
"
#

t
V (x0 ) = E R(xt , (xt )) x0 , 0 < < 1.
(1)
t=0

The discount parameter 0 < < 1 controls the tradeoff between preferring short or long term rewards. The ultimate
goal of planning in a MDP is to solve for an optimal policy that satisfies: = arg max V (x). When is omitted
from the notation V (x), the value is assumed to be with respect to an optimal policy:

V (x) = V (x) = max V (x).

(2)

V (x) can be represented recursively with respect to the value of the other states. This relationship is called the Bell-

Tauer, Nagi, and Sudit


man equation: V (x) = R(x, (x)) + x0 X p(x0 |x, (x))V (x0 ). Combined with the relationship of Equation (2), the
Bellman equation shows that:
!
V (x) = max R(x, a) +
aA

p(x0 |x, a)V (x0 ) ,

(3)

x X

Which also means that can be defined with respect to V (x) by the action which maximizes the argument: (x) =
arg maxaA (R(x, a) + x0 X p(x0 |x, a)V (x0 )). In the following sections, it will be assumed that 0 R(x, a) Rmax for
all x X and a A .
A key problem faced by MDP is the curse of dimensionality, or the tendency for the size of X to quickly become
intractable with increasing dimensionality of the state space. Section 2.3 and Section 2.4 will review two methods for
addressing this problem.
2.2 Linear Programming Solution Approach
The solution method that will be used here is a linear programming approach. Equation (3) can be used as a set of
bounds on V (x), with one bound for each action a A : V (x) R(x, a) + x0 X p(x0 |x, a)V (x0 ), a A . The MDP can
be solved using these bounds for all states x X using the linear program [14]:
)
(


(4)
LPP = min V (x) V (x) R(x, a) + p(x0 |x, a)V (x0 ), x X , a A .
xX

x0 X

There are |X | variables in LPP ; these variables are the state values V (x) and there is one for each x X . LPP also has
|X | |A | constrains, one for each state-action pair; this approach scales poorly with high dimensionality.
2.3 MDP and Simulation
Using Markov Decision Processes to study simulation models presents additional challenges. For one, simulation
models we examine do not provide the state transition probability function p(x0 |x, a) explicitly. Reinforcement Learning deals with this problem of solving a MDP without prior knowledge of p(x0 |x, a) [15]. One way to handle not
knowing p(x0 |x, a) is to estimated it by running the simulation model, although this approach is problematic in practice when facing large state spaces.
We assume a type of simulation model called a generative mode, defined by [8] to be a black box that takes as
an input any state-action pair (x, a) and returns a randomly sampled next state x0 and reward function value R(x, a).
This allows an estimate on p(x0 |x, a) to be strengthened for an arbitrary x X . It is also possible to feed the generative
models output back into itself, resulting in a trajectory of sequential states {x0 , x1 , . . . , xN }, which is the output typically considered a simulation run.
2.4 Local Approximations
Examining only the state space centered around (local to) the current system state is a method for dealing with
very large state spaces. [710]. An H-neighborhood around state x0 is defined to be the subset of states X X such
that for all x0 X then starting from x0 , x0 is reachable within H transitions of the system. It is possible to find an
H-neighborhood when given a generative model by using a look-ahead tree [8]. Using a full look ahead tree is often
impractical due to the exponential slowdown it experiences with increasing H. In Section 3 this will be avoided by
sampling depth first on only a small subset of the branches.
Given an H-neighborhood X , a local approximation method applied to planning in a MDP first solves for an optimal policy considering just X , implements this policy for the current transition, and then repeats this process for the
state the system transitions into. This approach is advantageous since the size of X , and thus the complexity of solving
the approximation, can be set through H independent of the number of all states, |X |.
The solution to a local model centered around x0 containing states X can be found using the LP:
(
)


H
0
0

LP = minV (x0 ) V (x) R(x, a) + p(x |x, a)V (x ), x X , a A .


x0 X

(5)

Tauer, Nagi, and Sudit

LPH exactly solves an H-neighborhood local approximation around x0 [10].


2.5 Value Function Approximation
Another method for dealing with very large state spaces is to represent V (x) using a function with parameters [11
13]. We are specifically interested in approximating V (x) using a function that is linear with respect to since this
will allow a linear programming approach [12]. This is done using a set of basis functions F where the value of basis
function f F for state x is denoted f (x). Then V (x) can be approximated by a linear combination of these basis
functions as V (x|):
V (x|) = f f (x)
(6)
f F

The goal of this approach is to use far fewer basis functions than there are states and let solution methods solve for
instead of directly for V (x). Note that the basis functions f (x) may be nonlinear even though V (x|) is linear with
respect to . An overview of using basis functions to compactly represent functions is given in [16].
The approximation V (x|) can be substituted into the exact LPP (4) to obtain the approximate linear program shown
in (7), the solution to which approximates the solution to the MDP under study [12].
)
(


0
0
A
(7)
LP = f f (x) f f (x) R(x, a) + p(x |x, a) f f (x ), x X , a A
xX f F

f F

x0 X

f F

The only difference between LPA and LPP is the substitution of V for V . LPA has only || variables which can be far
fewer than LPP which has |X |. Unfortunately, LPA still has |X | |A | constraints. The result is that LPA still has an
intractable number of constraints when |X | is large. A solution is proposed in which only a subset of the constraints
are imposed, the resulting LP is called a Reduced Linear Program [17]. While this approach only offers formal guarantees under special conditions, it has been shown to work well in some applications as illustrated by the queuing
network example of [17]. Unfortunately, LPA is not suitable for direct use in a simulation setting since it requires prior
knowledge of X and p(x0 |x, a) which are assumed to be unavailable.

3. Methodology
In Section 2, two different methods were examined for reducing the number of states that must be considered in finding an approximate solution to a MDP. In Section 3.1, the method from Section 2.5 of linear programming using a
linear value function approximation will be modified to consider only the area local to some initial state x0 . Section
3.2 will show how this approximation can be used in an online fashion to approximately solve the MDP under study,
and section 3.3 discusses some of this methods limitations and practical issues.
3.1 Incorporating Expert Policies to the Linear Programming Approach
The largest advantage enjoyed by methods of local approximations to MDPs, that they do not consider states outside of
X , is also a significant weakness. Take for example an extreme case where R(x, a) = 0, x X but R(x0 , a) > 0, x0
X \X . From the local models point of view this example contains no rewards, and so the estimated value of all x X
will be zero and all policies will be thought optimal. In some sense, this error is due to the local approximation not
considering the portion of the state space that is interesting (R(x, a) > 0). In this section we discuss how, through use
of a given suggested policy, it is possible to sample the value of following that policy and include the result as a fixed
reward (by which we mean constant reward) in the local model.
It is the case in many systems that an expert is able to provide a policy before any optimization. We know by definition
from (2) that an experts policy E has a value V E (x) V (x). An important property of V E (x) is that it can be
estimated with arbitrary precision through simulation; all that is needed is to repeatedly run the simulation starting
from x, record the sum of discounted rewards received, and take their average. We will call this estimated value VxE
and note that VxE = V E (x) + for some error . In this section we assume = 0, but in Section 3.3 we propose a
strategy to cope with the reality that || > 0 due to sampling error.
We pursue an approximate approach similar to that of Section 2.5 since it allows us to set our workload independently of |X | through controlling many constraints are sampled. This is especially important, since in problems of

Tauer, Nagi, and Sudit


high dimensionality the H-neighborhood around x0 might itself be too large to solve exactly. Unfortunately, whats
left after we enforce only a sample of the constraints local to x0 from LPA is an approximation of an approximation,
the performance of which is hard to guarantee.
The resulting LP, which we name RALPH
E , is a Reduced Approximate Linear Program for the H-neighborhood around
an initial state x0 , and assuming a given experts policy E .
min

f f (x)

(8)

xX (F1 ) f F

subject to:

a) +
f f (x) R(x,
0

f F

p(x
0 |x, a)

x X

f f (x0 ),

x X (F1 ), a A

(9)

x X (F2 )

(10)

f F

f f (x) VxE ,

f F

This approximation is built on samples taken from a given generative model. The sets X (F1 ) and X (F2 ) are random
subsets of X . The constraint set (9) is built as follows. First the generative model is run from x0 using a random action
a0 resulting in a next state x1 . The generative is run again, this time from x1 using a new randomly chosen action a1
to get a next state x2 . This process is repeated until xH , then one of the visited states is chosen at random and added to
the set X (F1 ). This process is repeated F1 times, resulting in |X (F1 )| = F1 (it is assumed there are no repeats). Next,
one constraint of type (9) is built for each xi X (F1 ) by sampling the generative model C1 times for each a A . The
a) and p(x
resulting samples are used to form the approximations R(x,
0 |x, a) and the constraint (9) for each state in
0
X (F1 ). While constraint set (9) includes a summation over all x X , this summation only has to be performed over
all x0 X for which p(x
0 |x, a) > 0 of which there are at most F1 (since only F1 samples were taken to build p(x
0 |x, a)).
X (F1 ) is also now used to build the objective function.

X (F2 ) is built in the same way as X (F1 ), except instead of selecting a state to add at random from {x0 , x1 , . . . , xH },
xH is chosen. VxE is then estimated from the average of (1) over C2 runs of D time steps while following E . Each
one of the C2 runs for each xH X (F2 ) requires running the generative model for D time steps using policy E to get
{xH , xH+1 , . . . , xH+D } and evaluate (1) for this sequence of states, actions, and rewards. The estimate VxE can then be
used to generate the corresponding constraint (10) for each x X (F2 ).
The solution to RALPH
E is a value for the parameters . Since the interest here is in planning, we wish to know
what action to take from x0 . This can be found using and by:
!
0 , a) +
aRALP = arg max R(x
aA

x0 X

p(x
0 |x0 , a)

f f (x0 )

(11)

f F

And for convenience this whole process, with parameters, is referred to as RALP(x0 , E , H, D,C1 ,C2 , F1 , F2 , ) = aRALP
3.2 An Online Approximate Linear Programming Approach
In this section an algorithm is described that uses the approximate local model described in Section 3.1 to compute a
policy on demand. We call this policy H
E and its value for a given state is computed as:
H
E (x) = RALP(x, E , H, D,C1 ,C2 , F1 , F2 , ).

(12)

H
This method is called an Online Approach because H
E is only computed as required; E (x) will only be evaluated if
state x is encountered. If only a subset of states is ever expected to be visited, this online approach is beneficial since
it limits the time spent planning policy choices for states that are never encountered.

3.3 Practical Issues


One large issue the method described in Section 3.2 faces in practice is that of choosing suitable basis functions. In
many cases some intuition on this is available, in other cases it is necessary to guess. When forced to guess, it might
become necessary to include far more flexibility in the approximation than would otherwise be required. When using
a linear combination of basis functions as an approximation, increased flexibility comes from an increased number

Tauer, Nagi, and Sudit

Figure 1: Population transitions within region.


of basis functions (higher |F |). This is undesirable from a computational standpoint since the resulting linear programs size is directly related to |F |. One strength of our method over similar approaches is that by limiting the states
examined to a local subset, the number of basis functions needed to provide a decent fit over that area should be smaller.
A second issue faced in practical applications is that, although assumed known exactly in Section 3.1, the estimate VxE
has uncertainty associated with it. This could be handled by using the worst-case confidence interval bound, however
since the goal is to compute a next-step action in an approximate sense, it seems reasonable to continue assignment of
VxE to the mean of its samples as described in Section 3.1.

4. Results
The proposed method is illustrated using a discrete time nation building simulation model having a structure simplified
from [4], then modified to have multiple regions and be controllable. The model under consideration is of a labor market and concerns the employment state of an areas population. Within each of the models five regions, each resident
can occupy one of four employment states; unemployed, private sector employee, government employee, or criminal. Unemployed persons may become employed as either private sector or government employees or as criminals.
Likewise private sector employees, government employees, and criminals may all become unemployed. In addition to
employment, unemployed persons may emigrate to a different region. The allowable employment state transitions are
shown in Figure 1.
The rates at which transitions occur follow binomial distributions with rates governed by action choice. There are two
basic types of control actions in this model: Economic Stimulus which favorably affects the unemployed to private
sector / government employee transition rates and Crime Prevention which favorably affects the unemployed to criminal transition rates. These rates are given in Table 1 for Crime Prevention and Table 2 for Economic Stimulus actions.
Each action can be chosen for at most one region in each discrete time period leading to a total of |A | = 36 unique
action settings.
Table 1: Effect of crime prevention action on rates between unemployed and criminal populations.
On
Off
From/To
Unemployed Criminal Unemployed Criminal
Unemployed
0.00
0.02
Criminal
0.06

0.02
-

Table 2: Effect of economic stimulus action on rates between unemployed and govt./private employed populations.
On
Off
From/To
Unemployed Private Government Unemployed Private Government
Unemployed
0.025
0.038
0.010
0.010
Private
0.000
0.009
Government
0.000
0.009
The total population in the model is 500 residents spread across the five regions. Unemployed residents may migrate

Tauer, Nagi, and Sudit


from their current region to another. Residents emigrating from Region 1 will join Region 2, leaving Region 2 to
Region 3, ... , Region 5 back to Region 1. The number of unemployed residents migrating also follows a binomial
distribution with a rate of 0.01.
Considered as a MDP, the state of this simulation is a 20 dimension population vector (one dimension per population group per region) for a total of |X | = 2.28 1034 states. The reward function R(x, a) is defined to be the total
population across all five regions employed by the private sector or government. The decision makers goal is to
choose on which region to apply the control actions to direct the system for maximum employment.
The model was tested using the method of local approximation described in Section 3, using parameters choices
from Table 3. As required for the method proposed in Section 3, E is defined as: apply the economic stimulus and
crime prevention efforts to the region with the highest total population, with ties broken at random. The set of basis
functions F for the method is defined
the set of polynomials that span all polynomials of degree two centered on
 as
20
+
x0 , so there are a total of |F | = 20
1 + 20 + 1 = 231 basis functions.
2

H
20

Table 3: Detailed parameter settings.


C1 C2
F1
F2
D

20 50 1000 100 100 0.95

Testing was done from two starting states. In the first starting state, each region has 50 unemployed residents, 50
criminal residents, and no private or government employees. In the second starting state Region 1 has 75 private
sector and 75 government employees and no unemployed or criminal residents, while the other four regions have 100
unemployed residents. The second starting state was chosen to demonstrate a condition where E performs poorly.
The model was run repeatedly from each starting state to compare the performance of three policy choices: random
action selection, E , and the method of local approximation from Section 3. Runs were carried out for 200 discrete
time steps from the first starting state and 20 from the second. The shorter time period from the second starting condition was chosen to illustrate the ability of RALPH
E to recover in a state area where E alone performs poorly. Results
are reported as a 99% confidence interval on the average reward received over the course of a run. No confidence
interval bounds are reported for the random and E cases since they have short runtime and were run enough times
(10, 000 each) to make their confidence interval widths insignificant. The results are presented in Table 4.
Table 4: Results from two starting states of the three tested policies.
Random
E
RALPH
E
Mean
Mean
CIMean
CI+
Starting Point 1
171.7
179.5 175.9 180.8 185.8
Startint Point 2
218.4
209.1 222.6 225.9 229.1
As shown by Table 4, RALPH
E outperformed random from both starting points. While the mean performance of
RALPH
also
outperformed
the
mean performance of E from each starting state this difference was not statistically
E
significant from Starting Point 1. The second starting point demonstrates how RALPH
E can still perform better than
random choices on this model, even when the expert policy it uses, E , can not. Unfortunately the state space of this
model is too large to permit an optimal solution so it is not possible to report on performance with respect to optimality.

5. Conclusions
Using an online method to compute policy choices is advantageous if the effort required to compute a single choice
approximately is far less than required to compute the entire policy at once. This is the case when each policy is
computed based on only the subset of a MDPs state space near the current state x0 , since the H-Neighborhood around
x0 depends on H, not on the number of total states. Section 3 discusses how the estimates obtained from these local
approximations can be strengthened using a reasonable policy guess, and Section 4 showed this to be the case empirically. Future work will focus on improving this approach and further testing its performance. One specific area for
improvement is that, while the method presented here focuses on improving an approximation of the value function,
this may not always result in an improved policy.

Tauer, Nagi, and Sudit

References
1. Dobbins, J., Jones, S. G., Crane, K., and DeGrasse, B. C., 2007. The Beginners Guide to Nation Building. RAND
Corporation.
2. Dobbins, J., 2005. Nation-Building: UN Surpasses U.S. on Learning Curve. Technical report, RAND Corporation.
3. Choucri, N., Electris, C., Goldsmith, D., Mistree, D., Madnick, S., Morrison, J. B., Siegel, M., and SweitzerHamilton, M., 2005. Understanding and Modeling State Stability: Exploiting System Dynamics. Technical
report, Composite Information Systems Laboratory, Sloan School of Management, Massachusetts Institute of
Technology.
4. Richardson, D. B., Deckro, R. F., and Wiley, V. D., 2004. Modeling and Analysis of Post-Conflict Reconstruction.
The Journal of Defense Modeling and Simulation: Applications, Methodology, Technology, 1(4), 201213.
5. Taylor, M. E. and Stone, P., 2009. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of
Machine Learning Research, 10, 16331685.
6. Bertsekas, D. P. and Castanon, D. A., 1999. Rollout Algorithms for Stochastic Scheduling Problems. Journal of
Heuristics, 5(1), 89108.
7. Chang, H. S. and Marcus, S. I., 2003. Approximate Receding Horizon Approach for Markov Decision Processes:
Average Reward Case. Journal of Mathematical Analysis and Applications, 286, 636651.
8. Kerns, M., Mansour, Y., and Ng, A. Y., 2002. A Sparse Sampling Algorithm for Near-Optimal Planning in Large
Markov Decision Processes. Machine Learning, 49, 193208.
9. Leizarowitz, A. and Shwartz, A., 2008. Exact Finite Approximations of Average-cost countable Markov Decision
Processes. Automatica, 44, 14801487.
10. Heinz, S., Kaibel, V., Peinhardt, M., Rambau, J., and Tuchscherer, A., 2006. LP-Based Local Approximation for
Markov Decision Problems. Technical report, Konrad-Zuse-Zentrum fur Informationstechnik Berlin.
11. Tsitsiklis, J. N. and Roy, B. V., 1997. An Analysis of Temporal-Difference Learning with Function Approximation. IEEE Transactions on Automatic Control, 42(5), 674690.
12. de Farias, D. P. and Roy, B. V., 2003. The Linear Programming Approach to Approximate Dynamic Programming.
Operations Research, 51(6), 850865.
13. Powell, W., 2007. Approximate Dynamic Programming. John Wiley & Sons Inc.
14. Puterman, M., 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley &
Sons Inc.
15. Kaelbling, L., Littman, M., and Moore, A., 1996. Reinforcement Learning: A Survey. Journal of Artificial
Intelligence Research, 4, 237285.
16. Bishop, C., 2006. Pattern Recognition and Machine Learning. Springer.
17. de Farias, D. P. and Roy, B. V., 2004. On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming. Mathematics of Operations Research, 29(3), 462478.

Anda mungkin juga menyukai