Copyright
Javier
c R. Movellan
April 20, 2010
Please cite as
Movellan J. R. (2009) Primer on Stochastic Optimal Control MPLab Tuto-
rials, University of California San Diego
1
1 Conventions
Unless otherwise stated, capital letters are used for random variables, small
letters for specific values taken by random variables, and Greek letters for fixed
parameters and important functions. We leave implicit the properties of the
probability space (, F, P ) in which the random variables are defined. Notation
of the form X <n is shorthand for X : <2 , i.e., the random variable X
takes values in <n . We use E for expected values and Var for variance. When the
context makes it clear, we identify probability functions by their arguments. For
example p(x, y) is shorthand for the joint probability mass or joint probability
density that the random variable X takes the specific value x and the random
variable Y takes the value y. Similarly E[Y | x] is shorthand for the expected
value of the variable Y given that the random variable X takes value x. We
def
use subscripted colons to indicate sequences: e.g., X1:t = {X1 Xt }. Given a
random variable X and a function f we use df (X)/dX to represent a random
variable that maps values of X into the derivative of f evaluated at the values
taken by X. When safe we gloss over the distinction between discrete and
continuous random variables. Unless stated otherwise, conversion from one to
the other simply calls for the use of integrals and probability density functions
instead of sums and probability mass functions.
Optimal policies are presented in terms of maximization of a reward function.
Equivalently they could be presented as minimization of costs, by simply setting
the cost function equal the reward with opposite sign. Below is a list of useful
words and their equivalents
2
2 Finite Horizon Problems
Consider a stochastic process {(Xt , , Ut , , Ct , Rt ) : t = 1 : T } where Xt is the
state of the system, Ut actions, Ct the control law specific to time t, i.e., Ut =
Ct (Xt ), and Rt a reward process (aka utility, cost, etc.). We use the convention
that an action Ut is produced at time t after Xt is observed (see Figure 1).
This results on a new state Xt+1 and a reward Rt that can depend on Xt , Ut
and on the future state Xt+1 . This point of view has the disadvantage that the
reward Rt looks into the future, i.e., we need to know Xt+1 to determine Rt .
The advantage is that the approach is more natural for situations in which Rt
depends only on Xt , Ut . In this special case Rt does not look into the future. In
any case all the derivations work for the more general case in which the reward
may depend on Xt , Ut , Xt+1 .
Remark 2.1. Alternative Conventions In some cases it is useful to think of the
action at time t to have an instantaneous effect on the state, which evolve at a
longer time scale. This is equivalent to the convention adopted here but with
the action shifted by one time step, i.e., Ut in our convention corresponds to
Ut1 in the instantaneous action effect convention.
This section focuses on episodic problems of fixed length, i.e.., each episode
starts at time 1 and ends at a fixed time T 1.
Our goal is to find a control law c1 , c2 , which maximizes a performance
function of the following form
(c1:T ) = E[R1 | c1:T ] (1)
where
T
X
Rt = t R , t = 1 T (2)
=t
i.e.,
Rt = Rt + Rt+1 (3)
When [0, 1] it is called the discount factor because it tends to discount
rewards that occur far into the future. If > 1 then future rewards become
more important than present rewards. Note
We let the optimal value function t be defined as follows
t (xt ) = max E[Rt | xt , ct:T ] (4)
ct :T
3
and let ct (xt ) maximize E[Rt | xt , ct , ct+1:T ] for all xt with ct:T fixed, i.e.,
Then
E[Rt | xt , ct:T ] = max E[Rt | xt , ct:T ] (7)
ct:T
for all xt
Proof.
and
E[Rt+1 | xt , ct , xt+1 , ct+1:T ] = E[Rt+1 | xt+1 , ct+1:T ] (12)
Using Lemma 8.1 and the fact that there is a policy ct+1:T that maximizes
E[Rt+1 , xt+1 , ct+1:T ] for all xt+1 it follows that
X
max E[Rt+1 | xt , ct:T ] = max p(xt+1 | xt , ct ) E[Rt+1 | xt+1 , ct+1:T ]
ct+1:T ct+1:T
xt+1
X
= p(xt+1 | xt , ct ) max E[Rt+1 | xt+1 , ct+1:T ] (13)
ct+1:T
xt+1
X
= p(xt+1 | xt , ct ) E[Rt+1 | xt+1 , ct+1:T ] = E[Rt+1 | xt , ct , ct+1:T ] (14)
xt+1
4
Remark 2.2. The optimality principle suggests an optimal way for finding
optimal policies: It is easy to find an optimal policy at terminal time T . For
each state xT such policy would choose an action that maximizes the terminal
reward RT , i.e.,
E[RT | xt , cT ] = max E[RT | xt , cT ] (16)
cT
Provided we have an optimal policy for time ct+1:T we can leave it fixed and
then optimize with respect to ct . This allows to recursively compute an optimal
policy starting at time T and finding our way down to time 1
The optimality principle leads to Bellman Optimality Equation which we
state here as a corollary of the Optimality Principle
Corollary 2.1 (Bellman Optimality Equation).
for t = 1 T where
X
E[ t+1 (Xt+1 ) | xt , ut ] = p(xt+1 | xt , ut )t+1 (xt+1 ) (18)
xt+1
and
def
T +1 (x) = 0, for all x (19)
Proof. Obvious for t = T . For t < T revisit equation (13) to get
X
max E[Rt+1 | xt , ct:T ] = p(xt+1 | xt , ct ) max E[Rt+1 | xt+1 , ct+1:T ] (20)
ct+1:T ct+1:T
xt+1
X
= p(xt+1 | xt , ct ) t+1 (xt+1 ) = E[t+1 (Xt+1 ) | xt , ct ] (21)
xt+1
Remark 2.3. It is useful to clarify the assumptions made to prove the opti-
mality principle:
Assumption 1:
E[Rt | xt , ct:T ] = E[Rt | xt , ct ] (22)
Assumption 2:
Assumption 3:
5
Assumption 4: Most importantly wee assumed that the optimal policy
ct+1:T did not impose any constraints on the set of policies ct with respect
to which we were performing the optimization. This would be violated,
if there were an additional penalty or reward that depended directly on
ct:T . For example, this assumption would be violated if we were to force
the policies of interest to be stationary. This would amount to putting a
large penalty for policies that do not satisfy c1 = c2 = cT 1 .
Figure 1 displays a process that satisfies Assumptions 1-3. Note under the
model the reward depends on the start state and the end state and the action.
In addition we let the reward to depend on the control law itself. This allows,
for example, to have the set of available actions depend on the current time and
state.
Rt Reward Process
Action Ut
Process
Ct Controller
Remark 2.4. Note the derivations did not require to make the standard Marko-
vian assumption, i.e.,
p(xt+1 | x1:t , c1:t ) = p(xt+1 | xt , ct ) (25)
Remark 2.5. Consider now the case in which the admissible control laws are
of the form
Ut = Ct (Xt ) Ct (Xt ) (26)
where Ct (xt ) is a set of available actions when visiting state xt at time t. We
can frame this problem by implicitly adding a large negative constant to the
reward function when Ct chooses inadmissible actions. In this case the Bellman
equation reduces to the following form
6
Remark 2.6. Now note that we could apply the restriction that the set of
admissible actions at time t given xt is exactly the action chosen by a given
policy ct . This leads to the Bellman Equation for the Value of a given policy
t (xt , ct:T ) = E[Rt + t+1 (Xt+1 , ct+1:T ) | xt , ct ] (28)
where
t (xt , ct:T ) = E[Rt | xt , ct:T ] (29)
is the value of visiting state xt at time t given policy ct:T .
Remark 2.7. Note that the Bellman equation cannot be used to solve the open
loop control problem, i.e., restrict the set of allowable control laws to open loop
laws. Such laws would be of the form
Ut = ct (X1 ) (30)
which would violate Assumption 4. since
E[R2 | x2 , c2 ] 6= E[R2 | x1 , x2 , c1:2 ] (31)
Remark 2.8 (Sutton and Barto (1998) : Reinforcement Learning, page
76 step leading to equation (3.14) ). Since assuming stationary policies vi-
olates Assumption 4, this step is in Sutton and Bartos proof is not valid. The
results are correct however, since for the infinite horizon case it is possible to
prove Bellmans equation using other methods (see Bertsakas book, for exam-
ple).
Remark 2.9. A problem of interest occurs when the set of possible control laws
is a parameterized collection. For the general case such a problem will involve
interdependencies between the different ct , i.e., the constraints on C cannot be
expressed as
XT
ft (Ct ) (32)
t=0
which is required for ?? to work. For example, if c1:T is implemented as a
feed-forward neural network parameterized by the weights w then would be
stationary, i.e., c1 = c2 = = cT . A constraint that cannot be expressed using
(32). The problem can be approached by having time be one of the inputs to
the model.
7
Let the action Ut [0, 1] represent a gamble of Ut Xt dollars. Thus, using no
discount factor = 1, Bellmans optimality equation takes the following form
Thus, since K is a constant with respect to x, the optimal policy will be identical
at time T 2, T 3...1, i.e., the optimal gambling policy makes
where Xt <n , is the systems state, a <n <n , ut <m , b <n <m ,
Zt <d , c <n <d where ut is a control signal and Zt are zero mean,
independent random vectors with covariance equal to the identity matrix. Our
goal is to find a a control sequence ut:T = ut uT that minimizes the following
cost
Rt = Xt0 qt Xt + Ut0 gt Ut (45)
where the state cost matrix qt is symmetric positive semi definite, and the
control cost matrix gt is symmetric positive definite. Thus the goal is to
8
keep the state Xt as close as possible to zero, while using small control signals.
We define the value at time t of a state xt given a policy and terminal time
T t as follows
T
X
t (xt , ) = t E[Rt | xt , ] (46)
=t
9
Thus
t (xt ) = x0t qt + t0 gt t + +(at + bt t )0 t+1 (at + bt t ) xt (59)
+ Tr(c0t t+1 ct ) + t+1 (60)
Thus
where
Thus
Moreover
Thus
and
t (xt ) = gt t + b0t t+1 (at + bt t ) xt x0t (73)
10
This gradient approach is useful for adaptive approaches to non-stationary prob-
lems and for iterative approaches to solve non-linear control problems via lin-
earizations.
The optimal value of t can also be found by setting the gradient to zero
and solving the resulting algebraic equation. Note for
1
t = gt + b0t t+1 bt ) b0t t+1 at (75)
then
and if
Taking the gradient with respect to ut in a manner similar to how we did above
for t we get
ut = t xt (85)
def 0 1 0
t = (gt + b t+1 b) b t+1 a (86)
11
3.4 Summary of Equations for Optimal Policy
Let
T = qT (87)
uT = 0 (88)
ut = Kt xt (91)
(92)
where
t = t+1 + Tr(c0 t+1 c) (94)
Below is Matlab code
% X_{t+1} = a _t X_t + b u_t + c Z_t
% R_t = X_t q_t X_t + U_t g_t U_t
12
Remark 3.4. Suppose the cost function is of the form
where 1:T is a desired sequence of states. We can handle this case by augment-
ing the system as follows
where
Xt
Xt = t <2n (97)
1
ann 0nn 0n1
a = 0nn 0nn t <2n+1 <2n+1 (98)
01n 01n 1
bnm
b = 0nm <2n+1 <m (99)
01m
cnd
c = 0nd <2n+1 <d (100)
01d
(101)
(102)
def
where t = t+1 t and we use subscripts as a reminder of the dimensionality
of matrices. The return function is now strictly quadratic on the extended state
space
Rt = Xt0 qt Xt + Ut0 gt Ut (103)
where
qt qt 0n1
qt = qt qt 0n1 <2n+1 <2n+1 (104)
01n 01n 0
3.5 Example
Consider the simple case in which
at time t we are at xt and we want to get as close to zero as possible at the next
time step. There is no cost for the size of the control signal. In this case b = I,
13
qt = I, qt = 0, gt = 0, t = 0. Thus we have
T = 0 (106)
T = I (107)
uT = 0 (108)
(109)
KT 1 = T = I (110)
T 1 = 0 (111)
T 1 = I (112)
T 1 = 0 (113)
(114)
t = I, (115)
ut = axt , for t = 1 T 1 (116)
In this case all the controller does is to anticipate the most likely next state
(i.e., ax ) and compensates for it accordingly so that the expected value at the
next time step is zero.
We can add a drag force proportional to vt and constant through the period
[vt , vt + t ] and a random force constant through the same period
!
2t
1 t 2t /2 0 2t /2
xt+t xt Z1,t
= + 2 ut +
vt+t 0 1 t vt t 0 t Z2,t
(120)
14
We can express this as a 2-dimensional discrete time system
where
!
2t
t 2t /2 2t /2
xt 1 0
xt = , a= , b= 2 , c=
vt 0 1 t t 0 t
(122)
And solve for the problem of finding an optimal application of forces to keep
the system at a desired location and/or velocity while minimizing energy con-
sumption.
Figure 2 shows results of a simulation (Matlab Code Available) for a point
mass moving along a line. The mass is located at -10 at time zero. There is
a constant quadratic cost for applying a force at every time step, and a large
quadratic at the terminal time (goal is to be at the origin with zero velocity
by 10 seconds). Note the inverted U shape of the obtained velocity. Also note
the system applies a positive force during the first half of the run and then a
negative force (brakes) increasingly larger as we get close to the desired location.
Note this would have been hard to do with a standard proportional controller
(a change of sign in the applied force from positive early on to negative as we
get close to the objective.
0 1.5 1
Control (Force/Mass)
2
0.5
1
Position
Velocity
4
0
6
0.5
0.5
8
10 0 1
0 500 1000 0 500 1000 0 500 1000
Time (centisecs) Time (centisecs) Time (centisecs)
5 20
4
15
Position Gain
Velocity Gain
3
10
2
5
1
0 0
0 500 1000 0 500 1000
Time (centisecs) Time (centisecs)
Figure 2:
15
The stationary control function
ut = Kxt (124)
0 1 0
K = (b b + g) b a (125)
T
1X
= lim E[Xt0 qXt + Ut0 gUt ] (126)
t T
t=1
t1 1
t = t1 + Tr(c0 t c) (127)
t t
and in the stationary case
t1 1
= + Tr(c0 c) (128)
t t
= Tr(c0 c) (129)
4.1 Example
We want to control
Xt+1 = Xt + Ut + Zt (131)
16
the action at time t is produced after Yt is observed. This action is determined
by a controller Ct whose input is Y1:t , U1:t1 , i.e., the information observed up
to to time t, and whose output the action at time t, i.e.,
Ut = Ct (Ot ) (133)
Y1:t
Ot = (134)
U1:t1
X1 (135)
a sensor model
p(yt | xt , ut1 ) (136)
and state dynamics model
p(xt+1 | xt , ut ) (137)
Remark 5.1. Alternative Conventions Under our convention effect of actions
is not instantaneous, i.e, the action at time t 1 affects the state and the
observation at time t + 1. In some cases it is useful to think of the effect of
actions occurring at a shorter time scales than the state dynamics. In such
cases it may be useful to model the distribution of observations at time t as
being determined by the state and action at time t. Under this convention, Ut
corresponds to what we call Ut+1 (See Right Side of Figure 3).
It may also be useful to think of the Xt generates Yt , which is used by the
controller Ct to generate Ut .
We will make our goal to find a controller that optimizes a performance
function:
17
Yt Yt+1 Sensor Yt Yt+1
Process
Ut Action
Ut+1
Process Ct+1
Controller
Ct
Figure 3: Left: The convention adopted in this document. Arrows represent de-
pendency relationships between variables. Dotted figures indicate unobservable
variables, continuous figures indicate observable variables. Under this conven-
tion the effect of actions is not instantaneous. Right: Alternative convention.
Under this convention the effect of actions is instantaneous.
Assumption 2:
Assumption 3:
Remark 5.2. The catch is that the number of states to represent the observ-
able process grows exponentially with time. For example, if we have binary
observations and actions, the number of possible states by time t is 4t . Thus it
is critical to summarize all the available information.
Remark 5.3. Open Loop Policies We can model open loop processes as special
cases of partially observable control processes. In such cases the state at time
1 but thereafter the observation process is uninformative (e.g., it could be a
constant).
18
Assumption 1:
S1 = f1 (Y1 ) (143)
Assumption 2:
St+1 = ft (St , Yt+1 , Ut ) (144)
Assumption 3:
E[Rt | ot , ut ] = E[Rt | st , ut ] (145)
Assumption 4:
p(yt+1 | y1:t , u1:t ) = p(yt+1 | st , ut ) (146)
and thus the optimal action at time T depends only on sT . We will now show
that if this is true at time t + 1 then it is also true at time t
def
= min E[Rt + t+1 (St+1 ) | st , ut ] = t (st ) (150)
ut
def
where st = ft (ot ). Thus, we only need to keep track of st to find the optimal
policy with respect to ot .
Moreover that the update of the posterior distribution only requires the
current posterior distribution, which becomes a prior, and the new action and
observation
X
p(xt+1 | y1:t+1 , u1:t ) p(xt | y1:t , u1:t )p(xt+1 | xt , ut )p(yt+1 | xt ) (153)
xt
19
X
E[Rt | ot , ut ] = p(xt | ot , ut )Rt (xt , ut ) = E[Rt | st , ut ] (154)
xt
and
X
p(yt+1 | y1:t , u1:t ) = p(xt | y1:t , u1:t1 )p(xt+1 | xt , ut )p(yt+1 | xt+1 ) (155)
xt+1
where Xt <n , is the systems state, a <n <n , ut <m , b <n <m ,
Zt <d , c <n <d where ut is a control signal and Zt are zero mean,
independent random vectors with covariance equal to the identity matrix. Our
goal is to find a a control sequence ut:T = ut uT that minimizes the following
cost
Rt = Xt0 qt Xt + Ut0 gt Ut (159)
where the state cost matrix qt is symmetric positive semi definite, and the
control cost matrix gt is symmetric positive definite. Thus the goal is to
keep the state Xt as close as possible to zero, while using small control signals.
Let
def Y1:t
Ot = (160)
U1:t1
represent the information available at time t. We will solve the problem by
assuming that the optimal cost is of the form
20
First note since g is positive definite, the optimal control at time T is uT = 0.
Thus
T (oT ) = E[XT0 qT XT | oT ] = E[XT0 T XT | oT ] + T (oT ) (162)
and our assumption is correct for the terminal time T with
T = qT , T (oT ) = 0 (163)
and E[Zt,i Zt,j | xt , ut ] = i,j , and that, by assumption E[t+1 (Ot+1 ) | ot , ut ] does
not depend on ut . Thus
t (ot ) = E[Xt0 qt Xt | ot ] + E[t+1 (Ot+1 ) | ot ] + min E (aXt + but )0 t+1 (aXt + but )
ut
+ u0t gt ut | ot , ut
(168)
(169)
The minimization part is equivalent to the one presented in (323) with the
following equivalence: b b, x ut , a t+1 , C aXt , d gt . Thus, using
(329)
ut = t E[Xt | ot ] (170)
where
t = t a (171)
0 1 0
t = (b t+1 b + gt ) b t+1 (172)
21
We will later show that the last term is constant with respect to u1:t . Thus,
where
and
which shares initial distribution 1 and noise variables Z, W with the processes
X, Y, H defined in previous sections. Note
22
Thus
t2
! t2
X X
E[Xt | ot ] = at1 E[X1 | ot ] + at1 bu + at1 cE[Z | ot ] (191)
=1 =1
t2
X
E[Xt | ot ] = at1 E[X1 | ot ] + at1 cE[Z | ot ] (192)
=1
then
t2
X
Yt = Yt k at1 bU (196)
=1
and therefore knowing o1:t determines o1:t = y1:t . Thus
E[Xt | ot ] = E[Xt | y1:t ] (197)
and
Xt E[Xt | ot ] = Xt E[Xt | y1:t ] (198)
which is constant with respect to u1:t1 .
Remark 6.1. Note the control equations for the partially observable case are
identical to the control equations for the fully observable case, but using E[Xt |ot ]
instead of xt .
23
7 Continuous Time Control
For this section we recommend to read first the tutorial on stochastic differential
equations ?, particularly the section on Itos rule.
Consider a dynamical system governed by the following system of stochastic
differential equations
where r is the instantaneous reward function, the time constant for the tem-
poral discount of the reward, and g is the terminal reward. Note
Z t+
1
v(xt , t) = E[ e (st) r(Xs , Us , s)ds | xt , ]
t
Z T
1 1
+ E[ e (s(t+)) r(Xs , Us , s)ds | xt , ] e (206)
t+
Z t+
1 1
= E[ e (st) r(Xs , Us , s)ds | xt , ] + E[v(Xt+ , t + ) | xt , ] e
t
Thus
Z t+
1 /
1 1
E[v(Xt+ , t + ) | xt , ] e v(xt , t) = E[ e (st) r(Xs , Us , s)ds | xt , ]
t
(207)
Thus
1 1
f ()e 1 f (0)
E[v(Xt+ , t + ) | xt , ] e v(xt ) = (210)
24
It is easy to verify that for any differentiable function f
1
f ()e f (0) 1
lim = = f(0) f (0) (211)
0
1 1
dv(x , t) 1
t
lim E[v(Xt+ ) | xt , ] e v(xt ) = v(xt , t) (212)
0 dt
where the total derivative of v is defined as follows
dv(xt , t) 1
= lim E[v(Xt+ ) | xt , ] v(xt ) (213)
dt 0
Thus taking limits on the left hand side and right hand side of (207) we get
dv(xt , t) 1
v(xt , t) = r(xt , ut , t) (214)
dt
We will now expand the total derivative dv(xt , t)/dt. For s t let
def
Ys = v(Xs , s) (215)
def
Us = (Xs , s) (217)
v(x, s)
def
vt (x, s) = (218)
s
def v(x, s)
vx (x, s) = (219)
x
2
def v(x, s)
vxx (x, s) = (220)
xx0
2 def T
c (x, s) = c (x, u)c(x, u) (221)
dE[Yt ] dv(xt , t)
= = vt (xt , t) + vx (xt , t) a(xt , ut )
dt dt
1
+ Tr c2 (xt , ut )vxx (xt , t) (222)
2
25
Putting together (214) and (222) we get the Hamilton Jacoby Belman equation
(HJB) for the value function of a policy
1
v(x, t) = r(x, u, t) + v(x,t)
t + v(x,t)
x a(x, u) + 12 Tr cT (x, u)c(x, u) v(x,t)
x2
u = (x, t)
1
v(x, T ) = g(x)e (T t)
(223)
Numerical Solution: We have the value of v for time T . If we can get the
first and second derivatives of v with respect to x we can then use the HJB
equation to obtain v(x, T ) t. We can then use this to find v for time step
T t
v(x, T )
v(x, T t ) v(x, T ) t (224)
t
We can then progress backwards in time until we reach the starting time t.
and since at the extremum thakes the value of the optimal policy
1
v(x, t) = sup r(x, (x)) + vt (x, t) + vx (x, t) a(x, (x)) + Tr c2 (x, (x))vxx (x, t)
2
(228)
26
And since the only part of the equation that depends on is u = (x) the HJB
equation for the optimal value function follows
n o
1
v(x, t) = supu r(x, u) + vt (x, t) + vx (x, t) a(x, u) + 21 Tr c2 (x, u)vxx (x, t)
v(x, T ) = g(x)
(229)
Note we made the reward to be independent of the time t, in which case the
value function will also be independent of t. Thus the derivative of v with
respect to time needs to be zero and the HJB for the value function follows
1
v(x)= r(x, u) + vx (x) a(x, u) + 21 Tr c2 (x, u)vxx (x)
(231)
u = (x)
Using the same logic, we get the HJB for the optimal value function
n o
1
v(x) = supu r(x, u) + vx (x) a(x, u) + 21 Tr c2 (x, u)vxx (x) (232)
Z T
1
e (st) r(Xs , Us ) ds + gT (XT ) | Xt = x, ]
def
v(x, t) = max E[ (234)
t
where Us = (Xs ) and the instantaneous reward takes the following form
def
r(x, u) = g(x) uT qu (235)
27
In this case the HJB equation looks as follows
1 n v(x, t) v(x, t) v(x, t)
v(x, t) = max g(x) uT qu + + a(x)T + uT b(x)T
u t x x
2
1 v(x, t) o
+ Tr[cT (x)c(x) ] (236)
2 x2
Most importantly the maximum over u can be computed analytically. Taking
the gradient of the right hand side of (236) with respect to u and setting it to
zero we get
v(x, t)
2qu + b(x)T =0 (237)
x
Thus the optimal action is
1 1 v(x, t)
u = q b(x)T (238)
2 x
If q is not full rank then there is an infinite number of optimal actions. We can
choose one by using the pseudo-inverse of q. We need to be careful about q. For
example, consider the 1-D case. If we let q = 0 the optimal gain would go to
infinity, which basically sets the state to zero in an infinitesimal time dt.
Substituting the optimal action into the HJB equation we get
1 v(x, t)
v(x, t) = g(x) uT qu +
t
v(x, t) 1 2 v(x, t)
+ a(x)T + 2uT qu + Tr[cT (x)c(x) ] (239)
x 2 x2
Simplifying, the HJB equation for the optimal value function looks as follows
v(x,t) T
v(x,t)
t = 1 v(x, t) + g(x) + uT qu + x a(x)
2
+ 12 Tr[c(x)T c(x) v(x,t)
x2 ]
(240)
with
Z T
1
v(x, t) = E[ r(Xs , Us )e (st) ds | Xt = x, ] (242)
t
28
where
Us = (Xs ) (243)
T T
r(x, u) = (x ) p(x ) u qu (244)
where the target state can be a function of time. This corresponds to the
problem of having the state Xt track the trajectory t . We assume the value
function takes the following form
v(x, t) = xT t x 2tT x + t (245)
Thus,
v(x, t)
= 2(t t x) (246)
x
2 v(x, t)
= 2t (247)
x2
v(x, t)
= x0 t x + 2tT x t (248)
t
where
t + tT
t = (249)
2
dt
t = (250)
dt
dt
t = (251)
dt
dt
t = (252)
dt
Consider the optimal HJB equation (240)
T
v(x, t) 1 v(x, t)
= v(x, t) + g(x) + uT qu + ax
t x
1 2 v(x, t)
+ Tr[c(x)T c(x) ] (253)
2 x2
where
g(x) = (x )T p(x ) (254)
1 1 T v(x, t)
u(x) = q b = q 1 bT (t t x) (255)
2 x
Thus
1 T 2 1
xT t x 2tT x + t = x t x tT x + (x t )T p(x t ) (256)
+ (t t x)T bq 1 bT (t t x) (257)
T T
+ 2(t t x) ax Tr[c ct ] (258)
29
Expanding some terms
1 T 2 1
xT t x 2tT x + t = x t x tT x + (259)
xT px + 2tT px tT pt (260)
1 T
T
+ x t bq b t x 2tT bq 1 bT t x + tT bq 1 bT t (261)
+ 2tT ax 2xT t ax Tr[cT ct ] (262)
Gathering quadratic, linear, and constant terms we get the continuous time
Ricatti equations
ut (x) = t t xt
t = q 1 bT t
t = q 1 bT t
v(x, t) = xT t x + 2tT x t
t = 1 t p + t bq 1 bT t 2t a
(263)
t = 1 t pT t + t bq 1 bT t aT t
t = 1 t tT pt + T bq 1 bT t Tr[cT ct ]
t = (t + tT )/2
T = (p + pT )/2
T = (p + pT )/2 T
T = TT pT
The update equation for show that in this case T = 0 and therefore t = 0.
30
Thus the update equations for the linear quadratic regulator are as follows
ut (x) = kt x (267)
1 T
kt = q b t (268)
1
t = t p + t bq 1 bT t 2t a (269)
1
t = t Tr[cT ct ] (270)
where : Rnx Rnf is a known function that maps each state x into nf
features of that state. w Rnf is an unknown weight vector that tells us how
to combine the state features to obtain the value function of a state. Thus
nf
v(x, t) X
= i (x)wi (t) = (x)w(t) (272)
x i=1
where
def
(x) = x (x) (273)
31
Disretizing in time
v(x, t) 1 1
= v(x, t + t) v(x, t) (279)
t t t
1 1
= v(x, t + t) (x)T w(t) (280)
t t
Collecting terms constant, linear and quadratic with respect to w we get
1
g(x) + v(x, t + t)
t
1 1 T
+ (x)T a(x) + h(x) ( + )(x) w(t)
t
1 T T 1 T
+ w (t)(x) b(x)q b(x) (x)w(t) = 0 (281)
2
where h(x) is an nf dimensional vector whose ith element is defined as follows
1
hi (x) = Tr[cT (x)c(x)i (x)] (282)
2
Our goal is to find a value of w that satisfies (281). To do so we collect a
sample {x1 , x2 , xns } of states and define an error function which captures
the extent to which the HJB equation is violated. For time step T we want to
find w(T ) such that
where
1
ai (t) = g(xi ) + v(xi , t + t) (286)
t
1 1
bi (t) = (xi )T a(xi ) + h(xi ) ( + )(xi ) (287)
t
1
ci (t) = (xi )T b(xi )q 1 b(xi )T (xi ) (288)
4
This is a Quadratic Regression problem that can be solved using iterative meth-
ods (see Appendix).. Unfortunately this problem has local minima (or difficult
32
plateaus) . Thus it is important to get good starting points. The solution for
time T is unique and we can use it as the starting point for time tt . Provided
t is small, this should be a good starting solution. For some reason, starting
points close to zero seem to also work well. Note to compute the ai (t) terms
we need v(x, t + t). We can thus solve the problem by doing a backward pass,
starting at time T .
Another important issue s to have enough samples so that the regression
problem to estimate w(t) is not underconstrained. If the number of samples is
small one possibility is to use something like Bayesian regression which allows
for sequential learning of the parameters.
Whats Needed
a(x), b(x) can be learned from examples using non-linear regression with
error
7.7 Example
Let
33
We will aproximate the value function using linear and quadratic features
1
(x) = x (297)
x2
Thus
a=b=c=1 (298)
(x) = (0, 1, 2x) (299)
1 = 0 (300)
2 = 0 (301)
3 = 2 (302)
h1 (x) = 0 (303)
h2 (x) = 0 (304)
h3 (x) = 2 (305)
Thus
1 1
yi = qk (xi )T wk g(xi ) = qk (xi )T wk + (xi )2 (308)
2 2
1 k 2
= (w2 ) + 2x w2 w3 + (xi )2 + 2(w3k )2
i k k
(309)
2
and the ith row of is
T 1
i, = qk (xi ) + (xi )T a(xi ) + h(xi ) (xi ) (310)
0 0 0 1
= w2k + 2xi w3k + 1 + 0 1 xi (311)
i k
2x w2 + 4(x ) w3i 2 k
2xi 2 (xi )2
1
= w2 + x (2w3k 1 ) + 1
k i (312)
2x (w2 + 1) + (4w3k 1 )(xi )2 + 2
i k
34
8 Appendix
Lemma 8.1. If wi 0 and maximizes f (i, ) for all i then
X X
max wi f (i, ) = wi max f (i, ) (313)
i i
Proof.
X X X
max wi f (i, ) max f (i, ) = wi f (i, ) (314)
i i i
moreover X X X
max wi f (i, ) f (i, ) = wi max f (i) (315)
i i i
Proof. Let
f (i, i ) = max f (i, ) (318)
and X
f (i, ) = max wi f (i, ) (319)
i
then
X
wi (f (i, i ) f (i, )) = 0 (320)
i
Thus, since
f (i, i ) f (i, ) 0 (321)
it follows that
f (i, ) = f (i, i ) = max f (i, ) (322)
35
where a and d are symmetric positive definite matrices and C is a random vector
with the same dimensionality as bx. Taking the Jacobian with respect to x and
applying the chain rule we have
This is commonly known as the Normal Equation. Thus the value x that mini-
mizes is
x = h (328)
where
h = (b0 ab + d)1 b0 a (329)
Moreover
Now note
Thus
(x) = c0 ac c0 h0 b0 ac = c0 kc (337)
where
For the more general case it is sometimes useful to express (336) as follows
36
Lemma 8.4 (Quadratic Regression). We want to minimize
X 2
(w) = ai + bTi w + wT ci w (340)
i
wT ci w wkT ci wk + 2wkT ci (w wk )
= wkT ci wk + 2wkT ci w (341)
Thus
yi = ai + wkT ci wk (343)
with
37