Primer On Stochastic Optimal Control

Primer on Stochastic Optimal Control
Copyright Javier
c R. Movellan
April 20, 2010
Please cite as
Movellan J. R. (2009) Primer on Stochastic Optimal Control MPLab Tuto-
rials, University of California San Diego
1
1 Conventions
Unless otherwise stated, capital letters are used for random variables, small
letters for specific values taken by random variables, and Greek letters for fixed
parameters and important functions. We leave implicit the properties of the
probability space (, F, P ) in which the random variables are defined. Notation
of the form X <n is shorthand for X : <2 , i.e., the random variable X
takes values in <n . We use E for expected values and Var for variance. When the
context makes it clear, we identify probability functions by their arguments. For
example p(x, y) is shorthand for the joint probability mass or joint probability
density that the random variable X takes the specific value x and the random
variable Y takes the value y. Similarly E[Y | x] is shorthand for the expected
value of the variable Y given that the random variable X takes value x. We
def
use subscripted colons to indicate sequences: e.g., X1:t = {X1 Xt }. Given a
random variable X and a function f we use df (X)/dX to represent a random
variable that maps values of X into the derivative of f evaluated at the values
taken by X. When safe we gloss over the distinction between discrete and
continuous random variables. Unless stated otherwise, conversion from one to
the other simply calls for the use of integrals and probability density functions
instead of sums and probability mass functions.
Optimal policies are presented in terms of maximization of a reward function.
Equivalently they could be presented as minimization of costs, by simply setting
the cost function equal the reward with opposite sign. Below is a list of useful
words and their equivalents
Cost = - Value = - Reward = - Utility.= - Payoff
The goal is to minimize Costs, or equivalent to maximize Value, Reward,

Utility.
We will use the terms Return and Performance to signify Cost or Value.
Step = Stage
One Step Cost = Running Cost

Terminal cost = Bequest cost
Policy = control law = controller = control
Optimal n-step to go cost = optimal 1 step cost + optimal (n-1) step to

go cost
n-step to go cost given policy = 1 step cost given policy + (n-1) step to
go cost given policy
2
2 Finite Horizon Problems
Consider a stochastic process {(Xt , , Ut , , Ct , Rt ) : t = 1 : T } where Xt is the
state of the system, Ut actions, Ct the control law specific to time t, i.e., Ut =
Ct (Xt ), and Rt a reward process (aka utility, cost, etc.). We use the convention
that an action Ut is produced at time t after Xt is observed (see Figure 1).
This results on a new state Xt+1 and a reward Rt that can depend on Xt , Ut
and on the future state Xt+1 . This point of view has the disadvantage that the
reward Rt looks into the future, i.e., we need to know Xt+1 to determine Rt .
The advantage is that the approach is more natural for situations in which Rt
depends only on Xt , Ut . In this special case Rt does not look into the future. In
any case all the derivations work for the more general case in which the reward
may depend on Xt , Ut , Xt+1 .
Remark 2.1. Alternative Conventions In some cases it is useful to think of the
action at time t to have an instantaneous effect on the state, which evolve at a
longer time scale. This is equivalent to the convention adopted here but with
the action shifted by one time step, i.e., Ut in our convention corresponds to
Ut1 in the instantaneous action effect convention.
This section focuses on episodic problems of fixed length, i.e.., each episode
starts at time 1 and ends at a fixed time T 1.
Our goal is to find a control law c1 , c2 , which maximizes a performance
function of the following form
(c1:T ) = E[R1 | c1:T ] (1)
where
T
X
Rt = t R , t = 1 T (2)
=t
i.e.,
Rt = Rt + Rt+1 (3)
When [0, 1] it is called the discount factor because it tends to discount
rewards that occur far into the future. If > 1 then future rewards become
more important than present rewards. Note
We let the optimal value function t be defined as follows
t (xt ) = max E[Rt | xt , ct:T ] (4)
ct :T
In general this maximization problem is very difficult for it involves finding T

jointly optimal functions. Fortunately, as we will see next, the problem decou-
ples into solving T independent optimization problems.
Theorem 2.1 (Optimality Principle). Let ct+1:T be a policy that maximizes

E[Rt+1 | xt+1 , ct+1:T ] for all xt+1 , i.e.,
E[Rt+1 | xt+1 , ct+1:T ] = max E[Rt+1 | xt+1 , ct+1:T ] (5)
ct+1:T
3
and let ct (xt ) maximize E[Rt | xt , ct , ct+1:T ] for all xt with ct:T fixed, i.e.,
E[Rt+1 | xt+1 , ct , ct+1:T ] = max E[Rt+1 | xt+1 , ct , ct+1:T ] (6)

ct
Then
E[Rt | xt , ct:T ] = max E[Rt | xt , ct:T ] (7)
ct:T
for all xt
Proof.
t (xt ) = max E[Rt | xt , ct:T ] = max E[Rt + Rt+1 | xt , ct:T ]

ct:T ct:T

= max E[Rt | xt , ct ] + max E[Rt+1 | xt , ct:T ] (8)
ct ct+1:T
where we used the fact that
E[Rt | xt , ct:T ] = E[Rt | xt , ct ] (9)
which does not depend on ct+1:T . Moreover,

X
max E[Rt+1 | xt , ct:T ] = max p(xt+1 | xt , ct ) E[Rt+1 | xt+1 , ct+1:T ] (10)
ct+1:T ct+1:T
xt+1
p(xt+1 | xt , ct:T ) = p(xt+1 | xt , ct ) (11)
and
E[Rt+1 | xt , ct , xt+1 , ct+1:T ] = E[Rt+1 | xt+1 , ct+1:T ] (12)
Using Lemma 8.1 and the fact that there is a policy ct+1:T that maximizes
E[Rt+1 , xt+1 , ct+1:T ] for all xt+1 it follows that
X
max E[Rt+1 | xt , ct:T ] = max p(xt+1 | xt , ct ) E[Rt+1 | xt+1 , ct+1:T ]
ct+1:T ct+1:T
xt+1
X
= p(xt+1 | xt , ct ) max E[Rt+1 | xt+1 , ct+1:T ] (13)
ct+1:T
xt+1
X
= p(xt+1 | xt , ct ) E[Rt+1 | xt+1 , ct+1:T ] = E[Rt+1 | xt , ct , ct+1:T ] (14)
xt+1
Thus we have that

t (xt ) = maxE[Rt | xt , ct:T ] = max E[Rt | xt , ct ] + E[Rt+1 | xt , ct , ct+1:T ]
ct:T ct
(15)
4
Remark 2.2. The optimality principle suggests an optimal way for finding
optimal policies: It is easy to find an optimal policy at terminal time T . For
each state xT such policy would choose an action that maximizes the terminal
reward RT , i.e.,
E[RT | xt , cT ] = max E[RT | xt , cT ] (16)
cT
Provided we have an optimal policy for time ct+1:T we can leave it fixed and
then optimize with respect to ct . This allows to recursively compute an optimal
policy starting at time T and finding our way down to time 1
The optimality principle leads to Bellman Optimality Equation which we
state here as a corollary of the Optimality Principle
Corollary 2.1 (Bellman Optimality Equation).
t (xt ) = max E[Rt + t+1 (Xt+1 ) | xt , ut ] (17)

ut
for t = 1 T where
X
E[ t+1 (Xt+1 ) | xt , ut ] = p(xt+1 | xt , ut )t+1 (xt+1 ) (18)
xt+1
and
def
T +1 (x) = 0, for all x (19)
Proof. Obvious for t = T . For t < T revisit equation (13) to get
X
max E[Rt+1 | xt , ct:T ] = p(xt+1 | xt , ct ) max E[Rt+1 | xt+1 , ct+1:T ] (20)
ct+1:T ct+1:T
xt+1
X
= p(xt+1 | xt , ct ) t+1 (xt+1 ) = E[t+1 (Xt+1 ) | xt , ct ] (21)
xt+1
Combining this with equation (8) completes the proof.
Remark 2.3. It is useful to clarify the assumptions made to prove the opti-
mality principle:
Assumption 1:
E[Rt | xt , ct:T ] = E[Rt | xt , ct ] (22)
Assumption 2:
p(xt+1 | xt , ct , ct+1:T ) = p(xt+1 | xt , ct ) (23)
Assumption 3:
E[Rt+1 | xt , ct , xt+1 , ct+1:T ] = E[Rt+1 | xt+1 , ct+1:T ] (24)
5
Assumption 4: Most importantly wee assumed that the optimal policy
ct+1:T did not impose any constraints on the set of policies ct with respect
to which we were performing the optimization. This would be violated,
if there were an additional penalty or reward that depended directly on
ct:T . For example, this assumption would be violated if we were to force
the policies of interest to be stationary. This would amount to putting a
large penalty for policies that do not satisfy c1 = c2 = cT 1 .
Figure 1 displays a process that satisfies Assumptions 1-3. Note under the
model the reward depends on the start state and the end state and the action.
In addition we let the reward to depend on the control law itself. This allows,
for example, to have the set of available actions depend on the current time and
state.
Xt1 Xt Xt+1 System

Process
Rt Reward Process
Action Ut
Process
Ct Controller
Figure 1: Graphical Representation of the a time slice of a process satisfying

the required assumptions. Arrows represent dependency relationships between
variables.
Remark 2.4. Note the derivations did not require to make the standard Marko-
vian assumption, i.e.,
p(xt+1 | x1:t , c1:t ) = p(xt+1 | xt , ct ) (25)
Remark 2.5. Consider now the case in which the admissible control laws are
of the form
Ut = Ct (Xt ) Ct (Xt ) (26)
where Ct (xt ) is a set of available actions when visiting state xt at time t. We
can frame this problem by implicitly adding a large negative constant to the
reward function when Ct chooses inadmissible actions. In this case the Bellman
equation reduces to the following form
t (xt ) = max E[Rt + t+1 (Xt ) | xt , ut ] (27)

ut Ct (xt )
6
Remark 2.6. Now note that we could apply the restriction that the set of
admissible actions at time t given xt is exactly the action chosen by a given
policy ct . This leads to the Bellman Equation for the Value of a given policy
t (xt , ct:T ) = E[Rt + t+1 (Xt+1 , ct+1:T ) | xt , ct ] (28)
where
t (xt , ct:T ) = E[Rt | xt , ct:T ] (29)
is the value of visiting state xt at time t given policy ct:T .
Remark 2.7. Note that the Bellman equation cannot be used to solve the open
loop control problem, i.e., restrict the set of allowable control laws to open loop
laws. Such laws would be of the form
Ut = ct (X1 ) (30)
which would violate Assumption 4. since
E[R2 | x2 , c2 ] 6= E[R2 | x1 , x2 , c1:2 ] (31)
Remark 2.8 (Sutton and Barto (1998) : Reinforcement Learning, page
76 step leading to equation (3.14) ). Since assuming stationary policies vi-
olates Assumption 4, this step is in Sutton and Bartos proof is not valid. The
results are correct however, since for the infinite horizon case it is possible to
prove Bellmans equation using other methods (see Bertsakas book, for exam-
ple).
Remark 2.9. A problem of interest occurs when the set of possible control laws
is a parameterized collection. For the general case such a problem will involve
interdependencies between the different ct , i.e., the constraints on C cannot be
expressed as
XT
ft (Ct ) (32)
t=0
which is required for ?? to work. For example, if c1:T is implemented as a
feed-forward neural network parameterized by the weights w then would be
stationary, i.e., c1 = c2 = = cT . A constraint that cannot be expressed using
(32). The problem can be approached by having time be one of the inputs to
the model.
Example 2.1 (A simple Gambling Model (from Ross: Introduction to

Dynamic Programming)). A gamblers goal is to maximize the log fortune
after exactly T bets. The probability of winning on a bet is p. If winning the
gambler gets twice the bet, if losing it loses the bet.
Let Xt represents the fortune after t bets, with initial condition X0 = x0 .
(
0, for t = 0, , n 1
Rt = (33)
log(Xt ), for t = n
7
Let the action Ut [0, 1] represent a gamble of Ut Xt dollars. Thus, using no
discount factor = 1, Bellmans optimality equation takes the following form
t (xt ) = max E[t+1 (Xt+1 ) | xt , ut ] (34)

0u1
= max {p t+1 (xt + uxt ) + (1 p) t+1 (xt uxt )} (35)

0u1
with boundary condition

T (x) = log(x) (36)
Thus
T 1 (x) = max {p log(x + ux) + (1 p) log(x ux)} (37)

0u1
= log(x) + max {p log(1 + u) + (1 p) log(1 u)} (38)

0u1
Taking the derivative with respect to u and setting it to 0 we get

2p 1 u
=0 (39)
1 u2
Thus
uT 1 (x) = 2p 1, provided p > 0.5 (40)

T 1 (x) = log(x) + p log(2p) + (1 p) log(2(1 p)) = log(x) + K (41)
Thus, since K is a constant with respect to x, the optimal policy will be identical
at time T 2, T 3...1, i.e., the optimal gambling policy makes
Ut = (2p 1)Xt (42)
provided p 0.5. If p < 0.5 the optimal policy is to bet nothing.
3 The Linear Quadratic Regulator (LQR)

We are given a linear stochastic dynamical system
Xt+1 = aXt + but + cZt (43)

X1 = x1 (44)
where Xt <n , is the systems state, a <n <n , ut <m , b <n <m ,
Zt <d , c <n <d where ut is a control signal and Zt are zero mean,
independent random vectors with covariance equal to the identity matrix. Our
goal is to find a a control sequence ut:T = ut uT that minimizes the following
cost
Rt = Xt0 qt Xt + Ut0 gt Ut (45)
where the state cost matrix qt is symmetric positive semi definite, and the
control cost matrix gt is symmetric positive definite. Thus the goal is to
8
keep the state Xt as close as possible to zero, while using small control signals.
We define the value at time t of a state xt given a policy and terminal time
T t as follows
T
X
t (xt , ) = t E[Rt | xt , ] (46)
=t
3.1 Linear Policies: Policy Evaluation

We will consider first linear policies of the form ut = t xt , where t is an
m n matrix. Thus the policies of interest are determined by T matrices
1:T = (1 , , T ). If we are interested on affine policies, we just need to
augment the state Xt with a new dimension that is always constant. We will
now show that the
We will now show, by induction, that the value (xt ) of reaching state xt
at time t under policy t:T is a quadratic function of the state1 , i.e.,
(xt ) = x0t t xt + t (47)
First note that since g is positive definite, the optimal control at time T is
uT = 0. Thus t = 0
T (xT ) = x0T qT xT = x0T T xT + T (48)
where
T = qT , T = 0 (49)
Assuming that
(xt+1 ) = x0t+1 t+1 xt + t+1 (50)
and applying Bellmans equation
t (xt ) = x0t qt xt + x0t t0 gt t xt (51)

+ E t+1 (Xt+1 ) | xt , t+1:T + (52)
= x0t qt xt
+ x0t t0 gt t xt (53)

+ E t+1 (axt + bt ut + cZt , ) + (54)
= x0t qt xt
+ x0t t0 gt t xt (55)
+ (axt + bt ut )0 t+1 (axt + bt ut ) + Tr(c0 t+1 c) + t+1 (56)
where we used the fact that E[Zt,i Zt,j | xt , ut ] = i,j and therefore
X 0
E Zt0 c0 t+1 cZt | xt , ut =

(c t+1 c)ij E Zti Ztj (57)
ij
X
= (c0 t+1 c)ij = Tr(c0 t+1 c) (58)
i
1 To avoid clutter we leave implicit the dependency of on t and
9
Thus

t (xt ) = x0t qt + t0 gt t + +(at + bt t )0 t+1 (at + bt t ) xt (59)
+ Tr(c0t t+1 ct ) + t+1 (60)
Thus
(xt ) = x0t t xt + t (61)
where
t = qt + t0 gt t + (at + bt t )0 t+1 (at + bt t ) (62)

= t0 (gt + b0t t+1 t )t + qt + a0t t+1 at (63)
t = Tr(c0t t+1 ct ) + t+1 (64)
3.2 Linear Policies: Policy Improvement

Taking the gradient with respect to t of the state value
vec[t ] (xt ) = vec[t ] x0t t xt + t (65)

0
= t (t xt ) gt (xt ) + t x0t 0
(at + bt t ) t+1 (at + bt t ) xt
(66)
Note
vec[] (t x)0 gt (x) = vec[] x x (t x)0 gt (x) = x Ivec[x] (67)
Thus
(t x)0 gt (x) = gt xx0 (68)
Moreover
vec[] x0 (a + b)0 (a + b)x = vec[] (a + b)x (69)

(a+b)x x0 (a + b)0 (a + b)x (70)
0
= x b vec[(a + b)x] (71)
Thus
x0 (a + b)0 (a + b)x = b0 (a + b)xx0 (72)
and

t (xt ) = gt t + b0t t+1 (at + bt t ) xt x0t (73)
We can thus improve the policy by performing gradient ascent

t t + gt t + b0t t+1 (at + bt t ) xt x0t (74)
10
This gradient approach is useful for adaptive approaches to non-stationary prob-
lems and for iterative approaches to solve non-linear control problems via lin-
earizations.
The optimal value of t can also be found by setting the gradient to zero
and solving the resulting algebraic equation. Note for
1
t = gt + b0t t+1 bt ) b0t t+1 at (75)
then
t (xt ) = 0, for all xt (76)
Note also for t then t simplifies as follows
t = t0 (gt + b0t t+1 t )t + qt + a0t t+1 at (77)

= t0 b0t t+1 at + qt + a0t t+1 at (78)
t = qt + (a0t t0 b0t )t+1 at (79)
3.3 Optimal Unconstrained Policies

Here we show that in fact the optimal policy is linear, so a linearity constraint
turns out not to be a constraint in this case and the results above produce the
optimal policy. The proof works by induction. We note that for the optimal
policy
(xT ) = x0T t xT + T (80)
and if
(xt+1 ) = x0t+1 t+1 xt+1 + t+1 (81)
then, applying the Bellman Equations
(xt ) = min x0t qt xt + u0t gt ut (82)

ut
+ (axt + but )0 t+1 (axt + but ) + Tr(c0 t+1 c) + t+1 (83)
Taking the gradient with respect to ut in a manner similar to how we did above
for t we get
ut (xt ) = gt ut + b0 t+1 (axt + but ) (84)
Setting the gradient to zero we get the optimal ut
ut = t xt (85)
def 0 1 0
t = (gt + b t+1 b) b t+1 a (86)
which is a linear policy.
11
3.4 Summary of Equations for Optimal Policy
Let
T = qT (87)
uT = 0 (88)
then move your way from t = T 1 to t = 1 using the following recursion
Kt = (b0 t+1 b + gt )1 b0 t+1 a (89)

t = qt + a0 t+1 (a bKt ) (90)
and the optimal action at time t is given by
ut = Kt xt (91)
(92)
If desired, the value function can be obtained as follows
t (xt ) = x0t t xt + t (93)
where
t = t+1 + Tr(c0 t+1 c) (94)
Below is Matlab code
% X_{t+1} = a _t X_t + b u_t + c Z_t
% R_t = X_t q_t X_t + U_t g_t U_t
function gain = lqr(a, b, c, q,g,T)

alpha{T} = q{T};
beta{T}=0;
for t = T-1:-1:1
gain{t} = inv(b*alpha{t+1}* b + g{t} )*b*alpha{t+1}*a;
alpha{t} = q{t}+ a*alpha{t+1}*(a - b*gain{t});
beta{t} = beta{t+1}+ trace(c*alpha{t+1} *c);
end
Remark 3.1. The dispersion matrix c has no effect on the optimal control
signal, it only affects the expected payoff given the optimal control.
Remark 3.2. Note the optimal action at time t is an error term axt premulti-
plied by a gain term Kt . The gain term Kt and the targets t do not depend
on x1:T and thus only need to be computed once.
Remark 3.3. Note Kt in (??) is the ridge regression solution to the problem of
predicting b using a. The error of that prediction a bKt appears in the Riccati
equation (??)
12
Remark 3.4. Suppose the cost function is of the form
Rt = (Xt t )0 qt (Xt t ) + Ut0 gt Ut (95)
where 1:T is a desired sequence of states. We can handle this case by augment-
ing the system as follows
Xt+1 = aXt + but + cZt (96)
where

Xt
Xt = t <2n (97)
1

ann 0nn 0n1
a = 0nn 0nn t <2n+1 <2n+1 (98)
01n 01n 1

bnm
b = 0nm <2n+1 <m (99)
01m

cnd
c = 0nd <2n+1 <d (100)
01d
(101)
(102)
def
where t = t+1 t and we use subscripts as a reminder of the dimensionality
of matrices. The return function is now strictly quadratic on the extended state
space
where
qt qt 0n1
qt = qt qt 0n1 <2n+1 <2n+1 (104)
01n 01n 0
3.5 Example
Consider the simple case in which
Xt+1 = aXt + ut + cZt (105)
at time t we are at xt and we want to get as close to zero as possible at the next
time step. There is no cost for the size of the control signal. In this case b = I,
13
qt = I, qt = 0, gt = 0, t = 0. Thus we have
T = 0 (106)
T = I (107)
uT = 0 (108)
(109)
KT 1 = T = I (110)
T 1 = 0 (111)
T 1 = I (112)
T 1 = 0 (113)
(114)
from which it follows that
t = I, (115)
ut = axt , for t = 1 T 1 (116)
In this case all the controller does is to anticipate the most likely next state
(i.e., ax ) and compensates for it accordingly so that the expected value at the
next time step is zero.
3.6 Example: Controlling a mass subject to random forces

Consider a particle with point mass m located at xt with velocity vt subject
to a constant force ft = m ut for the period [t, t + t ]. Using the equations of
motion. For [0, t ] we have that
Z
vt+ = vt + ut ds = vt + ut (117)
0
t
2t
Z
xt+t = xt + vt+s ds = xt + vt t + ut (118)
0 2
or in matrix form
!
2t

xt+t 1 t xt
= + 2 ut (119)
vt+t 0 1 vt t
We can add a drag force proportional to vt and constant through the period
[vt , vt + t ] and a random force constant through the same period
!
2t
1 t 2t /2 0 2t /2

xt+t xt Z1,t
= + 2 ut +
vt+t 0 1 t vt t 0 t Z2,t
(120)
14
We can express this as a 2-dimensional discrete time system
xt+1 = axt + but + cZt (121)
where
!
2t
t 2t /2 2t /2

xt 1 0
xt = , a= , b= 2 , c=
vt 0 1 t t 0 t
(122)
And solve for the problem of finding an optimal application of forces to keep
the system at a desired location and/or velocity while minimizing energy con-
sumption.
Figure 2 shows results of a simulation (Matlab Code Available) for a point
mass moving along a line. The mass is located at -10 at time zero. There is
a constant quadratic cost for applying a force at every time step, and a large
quadratic at the terminal time (goal is to be at the origin with zero velocity
by 10 seconds). Note the inverted U shape of the obtained velocity. Also note
the system applies a positive force during the first half of the run and then a
negative force (brakes) increasingly larger as we get close to the desired location.
Note this would have been hard to do with a standard proportional controller
(a change of sign in the applied force from positive early on to negative as we
get close to the objective.
0 1.5 1
Control (Force/Mass)
2
0.5
1
Position
Velocity
4
0
6
0.5
0.5
8
10 0 1
0 500 1000 0 500 1000 0 500 1000
Time (centisecs) Time (centisecs) Time (centisecs)
5 20
4
15
Position Gain
Velocity Gain
3
10
2
5
1
0 0
0 500 1000 0 500 1000
Time (centisecs) Time (centisecs)
Figure 2:
4 Infinite Horizon Case

As T and under rather mild conditions t becomes stationary and satisfies
the stationary version of (90)
= q + a0 (a b(b0 b + g)1 b0 a) (123)
15
The stationary control function
ut = Kxt (124)
0 1 0
K = (b b + g) b a (125)
minimizes the stationary cost
T
1X
= lim E[Xt0 qXt + Ut0 gUt ] (126)
t T
t=1
Regarding , given the definition of
t1 1
t = t1 + Tr(c0 t c) (127)
t t
and in the stationary case
t1 1
= + Tr(c0 c) (128)
t t
= Tr(c0 c) (129)
Thus the stationary value of state xt is
(xt ) = x0t xt + Tr(c0 c) (130)
4.1 Example
We want to control
Xt+1 = Xt + Ut + Zt (131)
where Ut = KXt . In Matlab, the algebraic Riccati equation can be solved

using the function dare (discrete algebraic riccati equation).
We enter
(alpha, L, K) = dare(a, b, q, g, 0, 1) (132)
For q = 1, g = 0 we get K = 1, i.e, if there is no action cost the best thing to do

is to produce an action equal to the current state but with the oposite sign. For
q = 1, g = 10 we get K = 0.27, i.e., we need to reduce the gain of our response.
5 Partially Observable Processes

Consider a stochastic process {(Xt , Yt , Ut , Ct ) : t = 1 : T } where Xt represents
a hidden state, Yt observable states, and Ut actions.We use the convention that
16
the action at time t is produced after Yt is observed. This action is determined
by a controller Ct whose input is Y1:t , U1:t1 , i.e., the information observed up
to to time t, and whose output the action at time t, i.e.,
Ut = Ct (Ot ) (133)

Y1:t
Ot = (134)
U1:t1
Figure 3 display Markovian constraints in the joint distribution of the differ-

ent variables involved in the model. An arrow from variable X to variable Y
indicates that X is a parent of Y . The probability of a random variable is
conditionally independent of all the other variables given the parent variables.
Dotted figures indicate unobservable variables, continuous figures indicate ob-
servable variables. Under these constraints, the process is defined by an initial
distribution for the hidden states
X1 (135)
a sensor model
p(yt | xt , ut1 ) (136)
and state dynamics model
p(xt+1 | xt , ut ) (137)
Remark 5.1. Alternative Conventions Under our convention effect of actions
is not instantaneous, i.e, the action at time t 1 affects the state and the
observation at time t + 1. In some cases it is useful to think of the effect of
actions occurring at a shorter time scales than the state dynamics. In such
cases it may be useful to model the distribution of observations at time t as
being determined by the state and action at time t. Under this convention, Ut
corresponds to what we call Ut+1 (See Right Side of Figure 3).
It may also be useful to think of the Xt generates Yt , which is used by the
controller Ct to generate Ut .
We will make our goal to find a controller that optimizes a performance
function:
(c1:T ) = E[R1 | c1:T ] (138)

where
T
X
Rt = t R , t = 1 T (139)
=t
The controller maps the information state at time t into actions.
5.1 Equivalence with Fully Observable Case

Assumption 1:
E[Rt | ot , ct:T ] = E[Rt | ot , ct ] (140)
17
Yt Yt+1 Sensor Yt Yt+1
Process
Xt1 Xt Xt+1 System Xt Xt+1

Process
Ut Action
Ut+1
Process Ct+1
Controller
Ct
Information Ot = (Y1:t ,U1:t1) Ot+1 = (Y1:t ,U1:t )

Process
Figure 3: Left: The convention adopted in this document. Arrows represent de-
pendency relationships between variables. Dotted figures indicate unobservable
variables, continuous figures indicate observable variables. Under this conven-
tion the effect of actions is not instantaneous. Right: Alternative convention.
Under this convention the effect of actions is instantaneous.
Assumption 2:
p(ot+1 | ot , ct , ct+1:T ) = p(ot+1 | ot , ct ) (141)
Assumption 3:
E[Rt+1 | ot , ct , ot+1 , ct+1:T ] = E[Rt+1 | ot+1 , ct+1:T ] (142)
Remark 5.2. The catch is that the number of states to represent the observ-
able process grows exponentially with time. For example, if we have binary
observations and actions, the number of possible states by time t is 4t . Thus it
is critical to summarize all the available information.
Remark 5.3. Open Loop Policies We can model open loop processes as special
cases of partially observable control processes. In such cases the state at time
1 but thereafter the observation process is uninformative (e.g., it could be a
constant).
5.2 Sufficient Statistics

A critical problem for the previous algorithm is that it requires us to keep track
of all possible sequences y1:T , u1:T , which grow exponentially as a function of T .
This issue can be sometimes addressed if all the relevant information about the
sequence y1:t , u1:t1 can be described in terms of a summary statistic St which
can be computed in a recursive manner. In particular we need for St to have the
following assumption: Seems like some of these assumptions may be redundant.
Clarify where they are used.
18
Assumption 1:
S1 = f1 (Y1 ) (143)
Assumption 2:
St+1 = ft (St , Yt+1 , Ut ) (144)
Assumption 3:
E[Rt | ot , ut ] = E[Rt | st , ut ] (145)
Assumption 4:
p(yt+1 | y1:t , u1:t ) = p(yt+1 | st , ut ) (146)
where ft are known functions. Note
T (oT ) = E[RT | ot ] = E[RT | st ] = T (sT ) (147)
and thus the optimal action at time T depends only on sT . We will now show
that if this is true at time t + 1 then it is also true at time t
t (ot ) = min E[Rt + t+1 (Ot+1 ) | ot , ut ] (148)

ut

X
= min E[Rt | st , ut ] + p(yt+1 | ot , ut )t+1 (ot , ut , yt+1 ) (149)
ut
y t+1
def
= min E[Rt + t+1 (St+1 ) | st , ut ] = t (st ) (150)
ut
def
where st = ft (ot ). Thus, we only need to keep track of st to find the optimal
policy with respect to ot .
5.3 The Posterior State Distribution as a Sufficient Statis-

tic
Consider the statistic St = pXt | Ot , i.e., the entire posterior distribution of states
given the observed sequence up to time t. First note
S1 (x1 ) = p(x1 | Y1 ) = f1 (Y1 ) for all x1 (151)

(152)
Moreover that the update of the posterior distribution only requires the
current posterior distribution, which becomes a prior, and the new action and
observation
X
p(xt+1 | y1:t+1 , u1:t ) p(xt | y1:t , u1:t )p(xt+1 | xt , ut )p(yt+1 | xt ) (153)
xt
which satisfies the second assumption.
19
X
E[Rt | ot , ut ] = p(xt | ot , ut )Rt (xt , ut ) = E[Rt | st , ut ] (154)
xt
and
X
p(yt+1 | y1:t , u1:t ) = p(xt | y1:t , u1:t1 )p(xt+1 | xt , ut )p(yt+1 | xt+1 ) (155)
xt+1
5.4 Limited Memory States (Under Construction)

What if we want to make a controller that uses a particular variable at time
t as its only source of information and this variable may not necessarily be a
sufficient statistic of all the past observations. My current thinking is that the
optimality equation will hold, but computation of the necessary distributions may
be hard and require sampling.
6 Linear Quadratic Gaussian (LQG)

The LQG problem is the partially observable version of LQR. We are given a
linear stochastic dynamical system
Xt+1 = aXt + but + cZt (156)

Yt+1 = kXt+1 + mWt+1 (157)
X1 1 (158)
where Xt <n , is the systems state, a <n <n , ut <m , b <n <m ,
Zt <d , c <n <d where ut is a control signal and Zt are zero mean,
independent random vectors with covariance equal to the identity matrix. Our
goal is to find a a control sequence ut:T = ut uT that minimizes the following
cost
where the state cost matrix qt is symmetric positive semi definite, and the
control cost matrix gt is symmetric positive definite. Thus the goal is to
keep the state Xt as close as possible to zero, while using small control signals.
Let
def Y1:t
Ot = (160)
U1:t1
represent the information available at time t. We will solve the problem by
assuming that the optimal cost is of the form
t (ot ) = E[Xt0 t Xt | ot ] + t (ot ) (161)
where t (ot ) is constant with respect to t 1, and then proving by induction

that the assumption is correct.
20
First note since g is positive definite, the optimal control at time T is uT = 0.
Thus
T (oT ) = E[XT0 qT XT | oT ] = E[XT0 T XT | oT ] + T (oT ) (162)
and our assumption is correct for the terminal time T with
T = qT , T (oT ) = 0 (163)
Assuming (161) is correct at time t + 1 and applying Bellmans equation
t (ot ) = E[Xt0 qt Xt | ot ] + min E t+1 (Ot+1 ) + u0t gt ut | ot , ut

(164)
ut
= E[Xt0 qt Xt | ot ] + E[t+1 (Ot+1 ) | ot , ut ]

+ min E (aXt + but + cZt )0 t+1 (aXt + but + cZt ) + u0t gt ut | ot , ut

(165)
ut
= E[Xt0 qt Xt | ot ] + E[t+1 (Ot+1 | ot ] + Tr(c0 t+1 c)+

+ min E (aXt + but )0 t+1 (aXt + but ) + u0t gt ut | ot , ut

(166)
ut
E[E[Xt+1 t+1 Xt+1 | Ot+1 ] | ot , ut ] = E[Xt+1 t+1 Xt+1 | ot , ut ] (167)
and E[Zt,i Zt,j | xt , ut ] = i,j , and that, by assumption E[t+1 (Ot+1 ) | ot , ut ] does
not depend on ut . Thus
t (ot ) = E[Xt0 qt Xt | ot ] + E[t+1 (Ot+1 ) | ot ] + min E (aXt + but )0 t+1 (aXt + but )

ut
+ u0t gt ut | ot , ut

(168)
(169)
The minimization part is equivalent to the one presented in (323) with the
following equivalence: b b, x ut , a t+1 , C aXt , d gt . Thus, using
(329)
ut = t E[Xt | ot ] (170)
where
t = t a (171)
0 1 0
t = (b t+1 b + gt ) b t+1 (172)
And, using (339)
min E (aXt + but )0 t+1 (aXt + but ) + u0t gt ut | ot , ut

ut
= E[Xt0 a0 (t+1 kt0 b0 t+1 )aXt | ot ]

+ E[(Xt E[Xt | ot ])0 a0 t b0 t+1 a(Xt E[Xt | ot ])] (173)
21
We will later show that the last term is constant with respect to u1:t . Thus,
t (ot ) = E[Xt0 t Xt | ot ] + t (ot ) (174)
where
t = a0 (t+1 kt0 b0 t+1 )a + qt (175)

= a0 t+1 (a bt ) + qt (176)
and
t (ot ) = E[t+1 (Ot+1 ) | ot ] + Tr(c0 t+1 c) (177)

0 0 0
+ E[(Xt E[Xt | ot ]) a t b t+1 a(Xt E[Xt | ot ])] (178)
By assumption t+1 (ot+1 ) is independent of u1:t+1 we just need to show that
E[(Xt E[Xt | ot ])0 a0 t b0 t+1 a(Xt E[Xt | ot ])] (179)
is also independent of u1:t for t (ot ) to be independent of u1:t , completing the

induction proof.
Lemma 6.1. The innovation term Xt E[Xt | ot ] is constant with respect to
u1:t .
Bertsekas Volume I. Consider the following reference process
Xt+1 = aXt + cZt (180)

Yt+1 = k Xt+1 + mWt+1 (181)
Ot = Y1:t (182)
X1 1 (183)
which shares initial distribution 1 and noise variables Z, W with the processes
X, Y, H defined in previous sections. Note
X2 = aX1 + bU1 + cZ1 (184)

2
X3 = a X1 + abU1 + acZ1 + bU2 + cZ2 (185)
(186)
t2
X
Xt = at1 X1 + at1 (bU + cZ ) (187)
=1
(188)
and by the same token

t2
X
t1
Xt = a X1 + at1 cZ (189)
=1
(190)
22
Thus
t2
! t2
X X
E[Xt | ot ] = at1 E[X1 | ot ] + at1 bu + at1 cE[Z | ot ] (191)
=1 =1
t2
X
E[Xt | ot ] = at1 E[X1 | ot ] + at1 cE[Z | ot ] (192)
=1
where we used the fact that E[U1:t1 | ot ] = u1:t1 . Thus

Xt E[Xt | ot ] = Xt E[Xt | ot ] (193)
Note since
t2
X
Yt = kat1 X1 + mWk + k at1 (bU + cZ ) (194)
=1
t2
X
Yt = kat1 X1 + mWk + k at1 cZ (195)
=1
then
t2
X
Yt = Yt k at1 bU (196)
=1
and therefore knowing o1:t determines o1:t = y1:t . Thus
E[Xt | ot ] = E[Xt | y1:t ] (197)
and
Xt E[Xt | ot ] = Xt E[Xt | y1:t ] (198)
which is constant with respect to u1:t1 .
Remark 6.1. Note the control equations for the partially observable case are
identical to the control equations for the fully observable case, but using E[Xt |ot ]
instead of xt .
6.1 Summary of Control Equations

Let
T = qT (199)
uT = 0 (200)
then move your way from t = T 1 to t = 1 using the following recursion
t = (b0 t+1 b + gt )1 b0 t+1 a (201)
ut = t E[Xt | ot ] (202)
0
t = a t+1 (a bt ) + qt (203)
where E[Xt | ot ] is computed using the Kalman filter equations.
23
7 Continuous Time Control
For this section we recommend to read first the tutorial on stochastic differential
equations ?, particularly the section on Itos rule.
Consider a dynamical system governed by the following system of stochastic
differential equations
dXt = a(Xt , Ut ) + c(Xt , Ut )dBt (204)
where dBt is a Brownian motion differential.
7.1 Value Function for Finite Horizon Problems

Consider a fixed policy and terminal time T . The value of visiting state x at
time t is defined as follows
Z T
1 1
v(x, t) = E[ e (st) r(Xs , Us , t)ds + e (T t) g(XT ) | Xt = xt , ] (205)
t
where r is the instantaneous reward function, the time constant for the tem-
poral discount of the reward, and g is the terminal reward. Note
Z t+
1
v(xt , t) = E[ e (st) r(Xs , Us , s)ds | xt , ]
t
Z T
1 1
+ E[ e (s(t+)) r(Xs , Us , s)ds | xt , ] e (206)
t+
Z t+
1 1
= E[ e (st) r(Xs , Us , s)ds | xt , ] + E[v(Xt+ , t + ) | xt , ] e
t
Thus
Z t+
1 /
1 1
E[v(Xt+ , t + ) | xt , ] e v(xt , t) = E[ e (st) r(Xs , Us , s)ds | xt , ]
t
(207)
Taking limits on the right hand side of (207)

Z t+
1 1
lim E[ e (st) r(Xs , Us , s)ds | xt , ] = r(xt , ut , t) (208)
0 t
Regarding the left hand side of (207), let

def
f () = E[v(Xt+ , t + ) | xt , ] (209)
Thus
1 1
f ()e 1 f (0)
E[v(Xt+ , t + ) | xt , ] e v(xt ) = (210)

24
It is easy to verify that for any differentiable function f
1
f ()e f (0) 1
lim = = f(0) f (0) (211)
0
where f is the first derivative of f . Thus
1 1
dv(x , t) 1
t
lim E[v(Xt+ ) | xt , ] e v(xt ) = v(xt , t) (212)
0 dt
where the total derivative of v is defined as follows
dv(xt , t) 1
= lim E[v(Xt+ ) | xt , ] v(xt ) (213)
dt 0
Thus taking limits on the left hand side and right hand side of (207) we get
dv(xt , t) 1
v(xt , t) = r(xt , ut , t) (214)
dt
We will now expand the total derivative dv(xt , t)/dt. For s t let
def
Ys = v(Xs , s) (215)
Note Ys is defined only for s t and is already conditioned on {Xt = xt }. Using

Itos rule we get
dYs = vt (Xs , s)ds + vx (Xs , s) dXs

1
+ Tr c2 (Xs , Us )vxx (Xs , s) ds (216)
2
def
Us = (Xs , s) (217)
v(x, s)
def
vt (x, s) = (218)
s
def v(x, s)
vx (x, s) = (219)
x
2
def v(x, s)
vxx (x, s) = (220)
xx0
2 def T
c (x, s) = c (x, u)c(x, u) (221)
Seting s = t and taking expected values
dE[Yt ] dv(xt , t)
= = vt (xt , t) + vx (xt , t) a(xt , ut )
dt dt
1
+ Tr c2 (xt , ut )vxx (xt , t) (222)
2
25
Putting together (214) and (222) we get the Hamilton Jacoby Belman equation
(HJB) for the value function of a policy

1
v(x, t) = r(x, u, t) + v(x,t)
t + v(x,t)
x a(x, u) + 12 Tr cT (x, u)c(x, u) v(x,t)
x2
u = (x, t)
1
v(x, T ) = g(x)e (T t)
(223)
Numerical Solution: We have the value of v for time T . If we can get the
first and second derivatives of v with respect to x we can then use the HJB
equation to obtain v(x, T ) t. We can then use this to find v for time step
T t
v(x, T )
v(x, T t ) v(x, T ) t (224)
t
We can then progress backwards in time until we reach the starting time t.
Discrete Time Approximation: From (206) we note to first order
v(xt , t) r(xt , (ut ))t + et / E[v(Xt+t , t + t ) | xt , ] (225)
7.2 Optimal Value Function for Finite Horizon Problems

The optimal value function is defined as follows
v(x, t) = sup v (x, t) (226)

where v is the value function with respect to policy . Thus

1 2
v(x, t) = sup r(x, (x), t) + vt (x, t) + vx (x, t) a(x, (x)) + Tr c (x, (x))vxx (x, t)
2
(227)
and since at the extremum thakes the value of the optimal policy

1
v(x, t) = sup r(x, (x)) + vt (x, t) + vx (x, t) a(x, (x)) + Tr c2 (x, (x))vxx (x, t)
2
(228)
26
And since the only part of the equation that depends on is u = (x) the HJB
equation for the optimal value function follows
n o
1
v(x, t) = supu r(x, u) + vt (x, t) + vx (x, t) a(x, u) + 21 Tr c2 (x, u)vxx (x, t)
v(x, T ) = g(x)
(229)
7.3 Value Function for Infinite Horizon Problems

We can think of the infinite horizon case as a the limiting case of a finite horizon
problem.
Z T
1
v(x) = lim E[ e (st) r(Xs , Us ) s | Xt = x, ] (230)
T t
Note we made the reward to be independent of the time t, in which case the
value function will also be independent of t. Thus the derivative of v with
respect to time needs to be zero and the HJB for the value function follows

1
v(x)= r(x, u) + vx (x) a(x, u) + 21 Tr c2 (x, u)vxx (x)
(231)
u = (x)
Using the same logic, we get the HJB for the optimal value function
n o
1
v(x) = supu r(x, u) + vx (x) a(x, u) + 21 Tr c2 (x, u)vxx (x) (232)
7.4 An important special case

Consider a process defined by the following stochastic differential: equation
dXt = a(Xt )dt + b(Xt )Ut dt + c(Xt )dBt (233)
For an arbitrary t we let
Z T
1
e (st) r(Xs , Us ) ds + gT (XT ) | Xt = x, ]
def
v(x, t) = max E[ (234)
t
where Us = (Xs ) and the instantaneous reward takes the following form
def
r(x, u) = g(x) uT qu (235)
27
In this case the HJB equation looks as follows
1 n v(x, t) v(x, t) v(x, t)
v(x, t) = max g(x) uT qu + + a(x)T + uT b(x)T
u t x x
2
1 v(x, t) o
+ Tr[cT (x)c(x) ] (236)
2 x2
Most importantly the maximum over u can be computed analytically. Taking
the gradient of the right hand side of (236) with respect to u and setting it to
zero we get
v(x, t)
2qu + b(x)T =0 (237)
x
Thus the optimal action is
1 1 v(x, t)
u = q b(x)T (238)
2 x
If q is not full rank then there is an infinite number of optimal actions. We can
choose one by using the pseudo-inverse of q. We need to be careful about q. For
example, consider the 1-D case. If we let q = 0 the optimal gain would go to
infinity, which basically sets the state to zero in an infinitesimal time dt.
Substituting the optimal action into the HJB equation we get
1 v(x, t)
v(x, t) = g(x) uT qu +
t
v(x, t) 1 2 v(x, t)
+ a(x)T + 2uT qu + Tr[cT (x)c(x) ] (239)
x 2 x2
Simplifying, the HJB equation for the optimal value function looks as follows
v(x,t) T
v(x,t)
t = 1 v(x, t) + g(x) + uT qu + x a(x)
2
+ 12 Tr[c(x)T c(x) v(x,t)
x2 ]
(240)
u(x) = 12 q 1 b(x)T v(x,t)

x
7.5 Linear Quadratic Tracker and Regulator

Let
dXt = aXt + bUt + cdBt (241)
with
Z T
1
v(x, t) = E[ r(Xs , Us )e (st) ds | Xt = x, ] (242)
t
28
where
Us = (Xs ) (243)
T T
r(x, u) = (x ) p(x ) u qu (244)
where the target state can be a function of time. This corresponds to the
problem of having the state Xt track the trajectory t . We assume the value
function takes the following form

v(x, t) = xT t x 2tT x + t (245)
Thus,
v(x, t)
= 2(t t x) (246)
x
2 v(x, t)
= 2t (247)
x2
v(x, t)
= x0 t x + 2tT x t (248)
t
where
t + tT
t = (249)
2
dt
t = (250)
dt
dt
t = (251)
dt
dt
t = (252)
dt
Consider the optimal HJB equation (240)
T
v(x, t) 1 v(x, t)
= v(x, t) + g(x) + uT qu + ax
t x
1 2 v(x, t)
+ Tr[c(x)T c(x) ] (253)
2 x2
where
g(x) = (x )T p(x ) (254)
1 1 T v(x, t)
u(x) = q b = q 1 bT (t t x) (255)
2 x
Thus
1 T 2 1
xT t x 2tT x + t = x t x tT x + (x t )T p(x t ) (256)

+ (t t x)T bq 1 bT (t t x) (257)
T T
+ 2(t t x) ax Tr[c ct ] (258)
29
Expanding some terms
1 T 2 1
xT t x 2tT x + t = x t x tT x + (259)

xT px + 2tT px tT pt (260)
1 T
T
+ x t bq b t x 2tT bq 1 bT t x + tT bq 1 bT t (261)
+ 2tT ax 2xT t ax Tr[cT ct ] (262)
Gathering quadratic, linear, and constant terms we get the continuous time
Ricatti equations
ut (x) = t t xt
t = q 1 bT t
t = q 1 bT t
v(x, t) = xT t x + 2tT x t
t = 1 t p + t bq 1 bT t 2t a
(263)
t = 1 t pT t + t bq 1 bT t aT t
t = 1 t tT pt + T bq 1 bT t Tr[cT ct ]
t = (t + tT )/2
T = (p + pT )/2
T = (p + pT )/2 T
T = TT pT
For initialization we used the fact that x0 ay = x0 (a + a0 )y/2.

We can solve this equation numerically using Eulers method. We start at
time T . This gives us the temporal derivatives for , , . Their values at time
t t can be obtained from those derivatives. We can then iterate until we
reach the current time t.
Linear Quadratic Regulator A special case of the linear quadratic tracker

is the linear quadratic regulator. In this case t = 0 for all t.
Thus
p + pT
T = (264)
2
T = 0 (265)
t = 0 (266)
The update equation for show that in this case T = 0 and therefore t = 0.
30
Thus the update equations for the linear quadratic regulator are as follows
ut (x) = kt x (267)
1 T
kt = q b t (268)
1
t = t p + t bq 1 bT t 2t a (269)

1
t = t Tr[cT ct ] (270)

7.6 Nonlinear Systems

Here we present a recent approach to non-linear continuous time for the special
case in 7.4. The approach is based on (?) but here we we adapt it to the finite
horizon problem. We will assume v can be expressed as a linear combination of
known features of the state x, i.e.,
nf
X
T
v(x, t) = (x) w(t) = i (x)wi (t) (271)
i=1
where : Rnx Rnf is a known function that maps each state x into nf
features of that state. w Rnf is an unknown weight vector that tells us how
to combine the state features to obtain the value function of a state. Thus
nf
v(x, t) X
= i (x)wi (t) = (x)w(t) (272)
x i=1
where
def
(x) = x (x) (273)
and is an nx nf matrix whose columns are the i terms

= [1 , , nf ] (274)
Moreover
nf
2 v(x, t) X
= wi (t)i (x) (275)
x2 i=1
where i is an nx nx Hessian matrix

i (x) = 2x i (x) (276)
Thus the HJB equation takes the following form
1 1
(x)T w(t) = g(x) + a(x)T (x)w(t) + wT (t)(x)T b(x)q 1 b(x)T (x)w(t)
4
nf
1 T
X
+ Tr[c(x) c(x) i (x)wi (t)] (277)
4 i=1
(278)
31
Disretizing in time
v(x, t) 1 1
= v(x, t + t) v(x, t) (279)
t t t
1 1
= v(x, t + t) (x)T w(t) (280)
t t
Collecting terms constant, linear and quadratic with respect to w we get
1
g(x) + v(x, t + t)
t
1 1 T
+ (x)T a(x) + h(x) ( + )(x) w(t)
t
1 T T 1 T
+ w (t)(x) b(x)q b(x) (x)w(t) = 0 (281)
2
where h(x) is an nf dimensional vector whose ith element is defined as follows
1
hi (x) = Tr[cT (x)c(x)i (x)] (282)
2
Our goal is to find a value of w that satisfies (281). To do so we collect a
sample {x1 , x2 , xns } of states and define an error function which captures
the extent to which the HJB equation is violated. For time step T we want to
find w(T ) such that
(x)T w(T ) gT (x) (283)
An estimate of w(T ) can be found by solving the following linear regression

problem
ns h
X i2
(w(T )) = g(x) (xi )T w(T ) (284)
i=1
For t < T we let

ns h
X i2
(w(t)) = ai (t) + bTi (t)w(t) + w(t)T ci (t)w(t) (285)
i=1
where
1
ai (t) = g(xi ) + v(xi , t + t) (286)
t
1 1
bi (t) = (xi )T a(xi ) + h(xi ) ( + )(xi ) (287)
t
1
ci (t) = (xi )T b(xi )q 1 b(xi )T (xi ) (288)
4
This is a Quadratic Regression problem that can be solved using iterative meth-
ods (see Appendix).. Unfortunately this problem has local minima (or difficult
32
plateaus) . Thus it is important to get good starting points. The solution for
time T is unique and we can use it as the starting point for time tt . Provided
t is small, this should be a good starting solution. For some reason, starting
points close to zero seem to also work well. Note to compute the ai (t) terms
we need v(x, t + t). We can thus solve the problem by doing a backward pass,
starting at time T .
Another important issue s to have enough samples so that the regression
problem to estimate w(t) is not underconstrained. If the number of samples is
small one possibility is to use something like Bayesian regression which allows
for sequential learning of the parameters.
Whats Needed
a(x), b(x) can be learned from examples using non-linear regression with
error
e(x) = x a(x)t + g(x)ut (289)
c(x) can be obtained from models error
c(x)c(x)T = Cov(X/t a(X) g(X)U ) (290)
q the matrix for the quadratic error of the action.

A way to sample from g(x), the cost of the state.
7.6.1 Using Gaussian Radial Basis Functions

Gaussian functions centered at a fixed set of states 1 nf , and with fixed
precision matrices i can be used as feature functions, i.e.,
1
i (x) = exp (xi i )T i (xi i ) (291)
2
where i is a fixed nx dimensional vector and i is an nx nx symmetric positive
definite matrix. Thus in this case
i (x) = i (x) i (i x) (292)

(x) = i (x) i (x i ) (x i )T i i (293)
7.7 Example
Let
dXt = Xt dt + Ut dt + dBt (294)

1
r(x, u) = g(x) kuk2 (295)
2
g(x) = x2 (296)
33
We will aproximate the value function using linear and quadratic features

1
(x) = x (297)
x2
Thus
a=b=c=1 (298)
(x) = (0, 1, 2x) (299)
1 = 0 (300)
2 = 0 (301)
3 = 2 (302)
h1 (x) = 0 (303)
h2 (x) = 0 (304)
h3 (x) = 2 (305)
For a fixed weight vector wk we get

k
0 0 0 w1
T T k
qk (x) = (x) b(x) b(x) (x) w = 0 1 2x w2k (306)
0 2x 4x2 w3k

0
= w2 + 2xw3k
k
(307)
2xw2k + 4x2 w3k
Thus
1 1
yi = qk (xi )T wk g(xi ) = qk (xi )T wk + (xi )2 (308)
2 2
1 k 2
= (w2 ) + 2x w2 w3 + (xi )2 + 2(w3k )2
i k k

(309)
2
and the ith row of is
T 1
i, = qk (xi ) + (xi )T a(xi ) + h(xi ) (xi ) (310)

0 0 0 1
= w2k + 2xi w3k + 1 + 0 1 xi (311)

i k
2x w2 + 4(x ) w3i 2 k
2xi 2 (xi )2
1

= w2 + x (2w3k 1 ) + 1
k i (312)
2x (w2 + 1) + (4w3k 1 )(xi )2 + 2
i k
34
8 Appendix
Lemma 8.1. If wi 0 and maximizes f (i, ) for all i then
X X
max wi f (i, ) = wi max f (i, ) (313)

i i
Proof.
X X X
max wi f (i, ) max f (i, ) = wi f (i, ) (314)

i i i
moreover X X X
max wi f (i, ) f (i, ) = wi max f (i) (315)

i i i
Lemma 8.2. If wi 0 and

X X
max wi f (i, ) = wi max f (i, ) (316)

i i
then there is such that for all i with wi > 0
f (i, ) = max f (i, ) (317)

Proof. Let
f (i, i ) = max f (i, ) (318)

and X
f (i, ) = max wi f (i, ) (319)

i
then
X
wi (f (i, i ) f (i, )) = 0 (320)
i
Thus, since
f (i, i ) f (i, ) 0 (321)
it follows that
f (i, ) = f (i, i ) = max f (i, ) (322)

for all i such that wi > 0.

Lemma 8.3 (Optimization of Quadratic Functions). This is one of the most
useful optimization problem in applied mathematics. Its solution is behind a
large variety of useful algorithms including Multivariate Linear Regression, the
Kalman Filter, Linear Quadratic Controllers, etc. Let
(x) = E[(bx C)0 a(bx C)] + x0 d x (323)
35
where a and d are symmetric positive definite matrices and C is a random vector
with the same dimensionality as bx. Taking the Jacobian with respect to x and
applying the chain rule we have
Jx = E[JbxC (bx C)0 a(bx C) Jx (bx C)] + Jx x0 d x (324)

0 0
= 2E[(bx C) ab] + 2x d (325)
0 0
x = (Jx ) = 2b a(bx ) + 2d x (326)
where = E[C]. Setting the gradient to zero we get
(b0 ab + d)x = b0 a (327)
This is commonly known as the Normal Equation. Thus the value x that mini-
mizes is
x = h (328)
where
h = (b0 ab + d)1 b0 a (329)
Moreover
(x) = (bh C)0 a(bh C) + 0 h0 dh (330)

0 0 0 0 0 0 0 0 0
= h b abh 2 h b a + E[C aC] + h dh (331)
Now note
0 h0 b0 abh + 0 h0 dh = 0 h0 (b0 ab + d)h (332)

0 0 0 1 0 0 1 0
= a b(b ab + d) (b ab + d)(b ab + d) b a (333)
0 0 0 1 0
= a b(b ab + d) b a (334)
0 0 0
= h b a (335)
Thus
(x) = E[C 0 aC] 0 h0 b0 a (336)
An important special case occurs if C is a constant, e.g., it takes the value c

with probability one. In such case
(x) = c0 ac c0 h0 b0 ac = c0 kc (337)
where
k = a h0 b0 a = a a0 b(b0 ab + d)1 b0 a (338)
For the more general case it is sometimes useful to express (336) as follows
(x) = E[C 0 aC] 0 h0 b0 a = E[C 0 (a h0 b0 a)C] + E[(C )0 h0 b0 a(C )]

(339)
36
Lemma 8.4 (Quadratic Regression). We want to minimize
X 2
(w) = ai + bTi w + wT ci w (340)
i
where ai is a scalar, bi , w are n-dimensional vectors and ci an n n symet-

ric matrix2 . We solve the problem iteratively starting at a weight vector wk
linearizing the quadratic part of the function and iterating.
Linearizing about wk we get
wT ci w wkT ci wk + 2wkT ci (w wk )
= wkT ci wk + 2wkT ci w (341)
Thus
ai + bTi w + wT ci w ai wkT ci wk + (bi + 2ci wk )T w (342)
This results in a linear regression problem with predicted variables in a vector y

with components of the form
yi = ai + wkT ci wk (343)
and predicting variables into a matrix x with rows
xi = (bi + 2ci wk )T (344)
with
wk+1 = (x0 x)1 x0 y (345)
2 We can always symetrize ci with no loss of generality.
37

Primer On Stochastic Optimal Control

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Primer On Stochastic Optimal Control

Diunggah oleh

Hak Cipta:

Format Tersedia

Primer on Stochastic Optimal Control

Cost = - Value = - Reward = - Utility.= - Payoff

The goal is to minimize Costs, or equivalent to maximize Value, Reward,

One Step Cost = Running Cost

Optimal n-step to go cost = optimal 1 step cost + optimal (n-1) step to

In general this maximization problem is very difficult for it involves finding T

Theorem 2.1 (Optimality Principle). Let ct+1:T be a policy that maximizes

E[Rt+1 | xt+1 , ct , ct+1:T ] = max E[Rt+1 | xt+1 , ct , ct+1:T ] (6)

t (xt ) = max E[Rt | xt , ct:T ] = max E[Rt + Rt+1 | xt , ct:T ]

where we used the fact that

E[Rt | xt , ct:T ] = E[Rt | xt , ct ] (9)

which does not depend on ct+1:T . Moreover,

where we used the fact that

p(xt+1 | xt , ct:T ) = p(xt+1 | xt , ct ) (11)

Thus we have that

t (xt ) = max E[Rt + t+1 (Xt+1 ) | xt , ut ] (17)

Combining this with equation (8) completes the proof.

p(xt+1 | xt , ct , ct+1:T ) = p(xt+1 | xt , ct ) (23)

E[Rt+1 | xt , ct , xt+1 , ct+1:T ] = E[Rt+1 | xt+1 , ct+1:T ] (24)

Xt1 Xt Xt+1 System

Figure 1: Graphical Representation of the a time slice of a process satisfying

t (xt ) = max E[Rt + t+1 (Xt ) | xt , ut ] (27)

Example 2.1 (A simple Gambling Model (from Ross: Introduction to

t (xt ) = max E[t+1 (Xt+1 ) | xt , ut ] (34)

= max {p t+1 (xt + uxt ) + (1 p) t+1 (xt uxt )} (35)

with boundary condition

T 1 (x) = max {p log(x + ux) + (1 p) log(x ux)} (37)

= log(x) + max {p log(1 + u) + (1 p) log(1 u)} (38)

Taking the derivative with respect to u and setting it to 0 we get

uT 1 (x) = 2p 1, provided p > 0.5 (40)

Ut = (2p 1)Xt (42)

provided p 0.5. If p < 0.5 the optimal policy is to bet nothing.

3 The Linear Quadratic Regulator (LQR)

Xt+1 = aXt + but + cZt (43)

3.1 Linear Policies: Policy Evaluation

(xt ) = x0t t xt + t (61)

t = qt + t0 gt t + (at + bt t )0 t+1 (at + bt t ) (62)

3.2 Linear Policies: Policy Improvement

vec[t ] (xt ) = vec[t ] x0t t xt + t (65)

vec[] (t x)0 gt (x) = vec[] x x (t x)0 gt (x) = x Ivec[x] (67)

(t x)0 gt (x) = gt xx0 (68)

vec[] x0 (a + b)0 (a + b)x = vec[] (a + b)x (69)

x0 (a + b)0 (a + b)x = b0 (a + b)xx0 (72)

We can thus improve the policy by performing gradient ascent

t (xt ) = 0, for all xt (76)

Note also for t then t simplifies as follows

t = t0 (gt + b0t t+1 t )t + qt + a0t t+1 at (77)

3.3 Optimal Unconstrained Policies

(xT ) = x0T t xT + T (80)

(xt+1 ) = x0t+1 t+1 xt+1 + t+1 (81)

then, applying the Bellman Equations

(xt ) = min x0t qt xt + u0t gt ut (82)

+ (axt + but )0 t+1 (axt + but ) + Tr(c0 t+1 c) + t+1 (83)

ut (xt ) = gt ut + b0 t+1 (axt + but ) (84)

Setting the gradient to zero we get the optimal ut

which is a linear policy.

then move your way from t = T 1 to t = 1 using the following recursion

Kt = (b0 t+1 b + gt )1 b0 t+1 a (89)

and the optimal action at time t is given by

If desired, the value function can be obtained as follows

t (xt ) = x0t t xt + t (93)

function gain = lqr(a, b, c, q,g,T)

Rt = (Xt t )0 qt (Xt t ) + Ut0 gt Ut (95)