Lyapunov Theory Based Stable Markov Game Fuzzy Control For Non-Linear Systems

Engineering Applications of Articial Intelligence 55 (2016) 119127
Contents lists available at ScienceDirect
Engineering Applications of Articial Intelligence

journal homepage: www.elsevier.com/locate/engappai
Lyapunov theory based stable Markov game fuzzy control for

non-linear systems
Rajneesh Sharma
Netaji Subhas Institute of Technology, New Delhi, India
art ic l e i nf o
a b s t r a c t
Article history:
Received 12 February 2016
Received in revised form
4 May 2016
Accepted 17 June 2016
Available online 5 July 2016
In this paper we propose a Lyapunov theory based Markov game fuzzy controller which is both safe and
stable. We attempt to optimize a reinforcement learning (RL) based controller using Markov games,
simultaneously hybridizing it with a Lyapunov theory based control for stability. Proposed technique
generates in an RL based game theoretic, adaptive, self learning, optimal fuzzy controller which is both
robust and has guaranteed stability. Proposed controller is an annealed hybrid of fuzzy Markov games
and the Lyapunov theory based control. Fuzzy systems have been employed as generic function approximators for scaling the proposed approach to continuous state-action domains. We test our proposed
controller on three benchmark non-linear control problems: (i) inverted pendulum, (ii) trajectory
tracking of standard two-link robotic manipulator, and (iii) tracking control of a two link selective
compliance assembly robotic arm (SCARA). Simulation results and comparative evaluation against
baseline fuzzy Markov game based control showcases superiority and effectiveness of the proposed
approach.
& 2016 Elsevier Ltd. All rights reserved.
Keywords:
Reinforcement learning
Q learning
Fuzzy Markov games
Lyapunov theory
Inverted pendulum
Two link robotic manipulator
SCARA
1. Introduction
Reinforcement learning (RL) paradigm centers on Markov Decision Processes (MDP) as the underlying model for adaptive optimal control of non-linear systems (Busoniu et al., 2010; Wiering
and van Otterlo, 2012). A critical assumption in the MDP based RL
technique is the assumption of a stationary environment. However, imposing such a restrictive assumption on the environment
may not be feasible, especially when the controller has to deal
with disturbances and parametric variations.
Notwithstanding this limitation, RL has been used successfully
for controlling a wide variety of non linear systems, e.g., in Kobayashi et al. (2009) a meta-learning method based on temporal
difference has been employed for inverted pendulum control
(IPC); Kumar et al. (2012) presents a self tuning fuzzy Q controller
for IPC; Ju et al. (2014) proposes kernel based approximate dynamic programming approach for inverted pendulum control, and
in Liu et al. (2014) an experience replay least squares policy
iteration procedure has been proposed for efcient utilization of
experiential information. In literature, we can nd quite a few
variants of the inverted pendulum problem. However, in our work,
we have used standard version of the pendulum wherein the pivot
point is mounted on a cart which can move horizontally.
Another domain where RL has been applied is robotic
E-mail address: rajneesh496@gmail.com
http://dx.doi.org/10.1016/j.engappai.2016.06.008
0952-1976/& 2016 Elsevier Ltd. All rights reserved.
manipulator control, which is a highly coupled, non linear and

time varying task. The task becomes even more challenging when
the controller has to cope with varying payload mass and external
disturbances. Both, neural network based RL and fuzzy systems
based RL have been employed for robotic manipulator control.
In Lin (2009) authors have used an H1 reinforcement learning
based controller on a fuzzy wavelet network (FWN).They implement an actor-critic RL formulation avoiding complex Ricatti
equations for controlling SCARA. An adaptive neural RL control has
been proposed in Tang et al. (2014) to counter unknown functions
and dead zone inputs in an actor-critic RL conguration, wherein
Lyapunov theory has been employed to show boundedness of all
closed loop signals. For a comprehensive and in-depth look on
controllers employing soft computing techniques, e.g., neural
networks, fuzzy systems and evolutionary computation on robotic
manipulators; we refer the reader to (Katic and Vukobratovic,
2013).
As stated earlier, all RL based controller design approaches
share a basic lacuna that they assume an MDP framework. To make
RL controller design process more general and robust, we introduced a Markov game formulation wherein the noise and disturbance are viewed as an opponent to the controller in a game
theoretic setup (Sharma and Gopal, 2008). This formulation
helped us in designing RL controllers that are robust in handling
disturbances and noise as the controller always tries to optimize
against the worst case opponent or noise. Markov Game formulation (Sharma and Gopal, 2008) allows broadening of the MDP
120
R. Sharma / Engineering Applications of Articial Intelligence 55 (2016) 119127
based RL control to encompass multiple adaptive agents with

competitive goals. Markov game controller was able to deal with
disturbances and parameter variations of the controlled plant.
However, both MDP and Markov games based RL approaches
failed to address one key concern, namely, stability of the designed
controller.
To be specic, there is no guarantee that the controller will
remain stable in presence of disturbances and/or parameter variations. Our attempt herein is to design self learning, model free
controllers with guaranteed stability. This is sought to be achieved
by incorporating a Lyapunov theory based action generation mechanism in the game theoretic RL setup. The controller has all the
advantages of game based RL (Markov game control) and has
guaranteed stability due to inclusion of a Lyapunov theory based
action.
This work is motivated by a need for addressing the stability
issue in RL based control by proposing a safe and stable' game
theoretic controller. The controller is safe as it uses a Markov game
framework for optimization; controller always optimises against
the worst opponent or plays safe as referred to in the game
theory literature (Vrabie and Vamvoudakis, 2013). In the proposed
approach Markov game based safe policy is hybridized with a
Lyapunov theory based stable policy for generating a safe and
stable' policy. This hybridization is carried out in an annealed or
gradual manner for arriving at a safe and stable game theoretic
control.
Robotic manipulators (Katic and Vukobratovic, 2013) are highly
coupled, non linear and time varying uncertain systems. Furthermore, industrial robotic manipulators are employed for picking up
and releasing objects or they have to deal a with varying payload
mass. This presents a highly challenging and complex task for
testing our proposed approach. We test our approach on two degrees of freedom (DOF) robotic manipulators as they capture all
the intricacies of a six DOF manipulator and are computationally
less expensive. We employ the approach on two robotic arms, i.e.,
a standard two link robot arm and a SCARA.
Proposed controller belongs to the class of self learning/adaptive systems with roots in Machine Learning (Wiering and van
Otterlo, 2012). In contrast to other Articial Intelligence based and
conventional controllers, RL based controllers do not assume access to desired response or trajectory. Proposed controller neither
assumes knowledge of desired response nor system model. The
controller discovers optimal actions by repeated trial and error
interactions with the systems/plant it intends to control. It has
access to only a heuristic reinforcement signal emanated by the
plant telling the controller whether the action taken by it is good
or bad. This makes control task a very challenging one. The advantage is that the designed controller is a self learning, adaptive
and is suitable for controlling an unknown system.
The paper is structured as follows: a systematic presentation of
the RL approaches that lead to the formulation of the proposed
methodology is presented in Section 2. Formulation of Lyapunov
theory based stable Markov game fuzzy controller for the three
tasks: a) inverted pendulum b) Two link robotic manipulator and
c) SCARA, along with simulation models and parameters thereof,
have been described in Section 3. Section 4 presents simulation
results and comparative evaluation of Lyapunov Markov game
fuzzy control against baseline fuzzy Markov game control for the
three problems. Section 5 summarises the paper and outlines
scope for future work.
2. Lyapunov theory based Markov game fuzzy approach

To facilitate reader understanding of the proposed approach,
we give a brief description of some relevant RL approaches.
2.1. Reinforcement learning algorithms

Reinforcement Learning is an online learning paradigm
wherein the learning agent's goal is to adapt its behavior to
maximize/minimize a cumulative reward/cost obtained from the
environment (Busoniu et al., 2010). The key feature that sets RL
apart from other Articial Intelligence based techniques is its extremely goal-oriented nature, and ability to sacrice short term
gains for long term benets.
There are various ways for designing an RL based controller
(Wiering and van Otterlo, 2012). However, in principle, they can be
broadly classied as a) model based, and b) model free. In model
based RL an explicit model of the system is constructed while in
model free RL the model is built impromptu, when the agent attempts to control the system. Herein, we briey describe the
model free RL approach of Q learning (Busoniu et al., 2010). For
other RL approaches the reader is referred to (Wiering and van
Otterlo, 2012).
2.1.1. Q learning
At every time stage k, an adaptive agent (controller) chooses an
action ak to be applied in current state sk . The agent then receives a
reinforcement signal r k from the environment and the environment transitions to the next state sk + 1 under action ak , as chosen
by the agent. Transition from state sk to sk + 1 is made as per the
underlying state transition probability p(sk , sk + 1). Agent's aims to
nd the optimal policy k: k(sk ) ak ; ak A(sk ) so as to minimize
expected sum of discounted cost, i.e., E j = 0 jr k + j where r k + j is
the cost incurred j steps into future and is the discount factor;
0r o1.
In Q learning (Wiering and van Otterlo, 2012), Q-value denes
the quality of a state-action pair, and is the total expected discounted cost incurred by a policy, that takes action ak A(sk ) in
state sk S and follows the optimal policy in the subsequent
states. Q values implicitly contain information regarding transition
probabilities. For the state-action pair (sk , ak ), the Q-value is dened as:
Q s k , ak = r k +
)( )
p s k , ak , s k + 1 V s k + 1
s k + 1 S
(1)
where
V (sk + 1) = min Q (sk + 1, a) is the state value

k+1
a A (a
r k Immediate cost of taking action ak at state sk .

Q-values for every state-action pair can be evaluated by considering (1) as an update rule in an iterative manner. In some RL
domains where system model is not exactly known, implying that
the transition-probabilities p(sk , ak , sk + 1) are unknown, this can't
be implemented. Watkins (Wiering and van Otterlo, 2012) generalized the above Eq. (1), doing away with the need of an explicit
system model, either in the form of cost structure or transition
probabilities:
) {
Q s k , ak Q s k , ak + r k +
min
Q s k + 1, a Q s k , a k
( )
aA s k+1
)}
(2)
where, is the learning rate parameter, 0 r o 1; it governs the

degree to which newly acquired information supersedes the earlier information.
Q-learning is guaranteed to converge to optimal Q-values provided each state-action pair is visited innitely often and the
learning rate parameter is reduced in a gradual manner. Q learning
can be extended to continuous state-action space problems by
using fuzzy systems as generic function approximators termed

fuzzy Q learning (Sharma and Gopal, 2008). Q learning, as well as
fuzzy Q learning, assumes the underlying system as an MDP, which
in our view is quite restrictive. This is particularly true in case
when the controller has to handle disturbances and parameter
variations. In Sharma and Gopal (2008) authors proposed a game
theory based approach called fuzzy Markov games for controller
optimization in presence of external disturbances and parameter
variations. We briey describe the fuzzy Markov game approach.
121
van Otterlo, 2012) resulting in an Minimax policy as:
( )
( )
i s k = Minimax ai s k
a
i k
ar s with probability
=
ai* s k otherwise
( )
( )
(6)
We have used pseudo-stochastic exploration wherein the exploration is gradually reduced ( parameter) to 0.02. Policy i(sk )
a
is random with probability : ai*(sk ) is the minimax solution (5)
2.2. Fuzzy Markov games
and, air sk is a random policy dictating random action selection.
Fuzzy Markov games generalise Markov games (Sharma and

Gopal, 2008) to a high/innite dimensional state-action space by
using fuzzy inference systems (FIS). Markov games were initially
designed for agent optimization in a multi agent scenario but
could be used for non linear RL controller optimization. In fuzzy
Markov games, controller optimization is viewed as a two player
zero sum game between the controller and the disturber. We have
N rules, of the following form:
A global EEP policy a sk is generated from these rule-specic
If s1k is L1i and and snk is L ni
Ri :
( )
2.2.1. Global continuous control action

We calculate optimal policy ai*(sk ): a probability distribution
over the controller action set A by using linear programming
(Sharma and Gopal, 2008) on game matrix (Fig. 1) at state sk :
( )
s
= arg min max q( i,

ai PD(A) di D a A
i
ai , di ai
(4)
PD(A) is probability distribution on the controller action set A.

At each stage k, input state vector sk is matched to rule antecedents of the rule base (3) to get rule-ring strengths i (sk ). A
( )
global control policy a* sk is generated for each rule Ri with rule

ring strength vector as weights, i.e.,
i (sk )... . N (sk )} as:
N
( )=
a* s
( )( )
(s )
(sk ) = {1(sk )
2 (sk ).. .
(7)
Using this policy we get Global continuous control action as:
( )
( )
a s k ak
(8)
amgc (s ) is the fuzzy Markov game controller action at state s . This

continuous action is applied to the system which transitions to
next state sk + 1, and generates a reinforcement signal r k . Game
solver (Linear program) uses q-parameter estimates and generates
a desired value V i(sk + 1), for the next state sk + 1:
( )
V i s k + 1 = min max ( q(i, ai , di ) ai;

ai PD(A) di D a A
i
a i A , di D
Next, we generate a global desired value V (s
( )( )
(s )
( )
V sk+1 =
(9)
k+1
) as:
i = 1 V i s k + 1 i s k + 1
N
i = 1 i
k+1
(10)
2.2.2. Global Q-value

Global Q value is the weighted average of individual rule based
q values and is evaluated for each rule Ri , based on the solution of
the game matrix at state sk . The probability ai*(sk ) is linear programming solution of the game matrix dened at state under each
rule Ri (4). Using EEP we get the policy i(sk ) (6) from which
a
actions ai(sk ) are obtained stochastically to form the global Q-value. Q-values are inferred from the quality of the local Minimax
discrete-actions:
i = 1 ai* s k i s k
N
i = 1 i
a
N
i = 1 i
k A
Ri being ith rule of the rule-base, A = {a1, a2, , am} is the controller action set while D = {d1, d2, ... , do} is the action set for the
disturber. Parameter q: q(i, ai , di ) represents quality of controllerdisturber action pair for each rule Ri . For each rule Ri , a game
matrix, (e.g., for A = 3, D = 3) is constructed (Fig. 1).
( )( )
(s )
i = 1 i s k i s k
( )
a s k =
a mgc s k =
then a = a1 and d = d1with q(i, 1, 1)

or a = a1 and d = d 2with q(i, 1, 2)
or a = a m and d = dowith q(i, m, o) (3)
ai*
( )
policies as:
(5)
Next step is to use exploration-exploitation (EEP) (Wiering and
( )
Q sk =
( )
i = 1 q i, ai, di i s k
( )
i = 1 i s k
(11)
ai = ai(sk ) is the Minimax controller action, and di = di(sk ) is

disturbance action for rule Ri . Inferred disturbance action is
weighted sum of rule specic di (in each rule Ri ):
N
( )
d s k =
( )( )
(s )
i = 1 di s k i s k
N
i = 1 i
(12)
2.2.3. q update
For controlling the system, we use continuous action amgc , but
for learning, we use the discrete action ai chosen stochastically,
under each rule Ri . Temporal difference (TD) error is calculated as:
Fig. 1. Game matrix.
Q = r k + V (s k + 1) Q (s k )
(13)
122
rule-action pair visits dependent parameter k = 0, 1 which

governs the degree of hybridisation of the two mechanisms based
on experiential information as:
q-values are updated as:
q i, ai, di q i, ai, di + Q
( )
( )
i s k
N
i = 1 i s k
(14)
2.3. Lyapunov theory based control

Lyapunov theory has emerged as a powerful method for designing controllers that are guaranteed to be stable or can maintain system operation within an intended range (Levine, 2009). In
Lyapunov's direct or second method, key idea is to construct a
scalar positive Lyapunov function comprising of system's state
variables and then deriving a control law that makes this function's derivative negative along system trajectories. The resultant
controller is guaranteed to be stable.
2.3.1. Inverted pendulum control
The control law proposed in this paper attempts to make the
inverted pendulum locally asymptotically stable around its unstable upright position for a very large domain of attraction. Lyapunov function is chosen as the innite cumulative cost or
( )
C r k = k = 0 r k = k = 0 k +
> 0 and a control law is ob-
tained which makes this Lyapunov functions's rst derive negative

or drives the system to a desired domain of attraction. The control
law thus obtained is guaranteed to be stable as per Lyapunov's
stability theorem (Levine, 2009).
Dynamics of inverted pendulum can be described by state
k
variable formulation of the form sk = (k , ), where k is the angle
of the pendulum from the vertical position at instant k. Lyapunov
theory based control action that makes the value function derivative negative (Aguilar-Ibanez, 2008):
( )
alyap s k =
1 k
k
k
k
cos + ,
k
( )
(15)
> 0 is real, positive constant.

2.3.2. Robotic manipulator control
For each link of the robotic manipulator, we have two state
k
k
k
variables s1k = (1k , 1 ) and s2k = (2k , 2 ). (1k , 1 ) is angle of link
k
1 and its derivative while (2k , 2 ) are corresponding variables for
link 2. The control law which guarantees the system to be locally
asymptotically stable along all the system trajectories can be derived using Lyapunov's direct method (link 1) as:
( )
1 k
k
k
k
1 cos 1 + 1 , 1
( )
1k
(16)
where (1k ) = 1 cos21k
1k , 1
) =
k
1
(17)
where, no > 0 is a constant and nk(i, ai , di ) is a count of the number

of times the (rule-controller action-disturber action) tuple has
been visited till instant k. Lyapunov theory based Markov game
controller action is an annealed combination of the fuzzy Markov
game based action (8) and the Lyapunov theory based action (16)
as per:
alyap _ mg (s k ) = (1 k )a mgc (s k ) + k alyap(s k )
(18)
Under this control action alyap _ mg (s ), inverted pendulum transitions to the next state sk + 1 and generates cost:
k
k
k
rlyap
_ mg = +
= 100
if k < 180
if k > 180
(19)
i.e., a high cost if the pendulum meets the failure condition,

otherwise a quadratic cost to incentivize the pendulum to remain
in the upright position.
The temporal difference error corresponding to the Markov
game controller Q (13) gets modied:
( )
( )
k
k+1
Q lyap _ mg = rlyap
Q sk
_ mg + V s
(20)
k
We use reward rlyap
_ mg (19) to update the candidate function
k
C (rlyap
_ mg )
on line. Q lyap _ mg is used in the q value update (14). For

the robotic manipulator problem we have:
( ) (
( k , ) = ( [ ]2 + cos k )sin k
alyap s1k =
nk ( i, ai , di )
no + nk ( i, ai , di )
) ( )
( ) ; n = 1, 2 being link variable
alyap _ mg s nk = 1 k a mgc s nk + k alyap s nk
where (k ) = 1 cos2k
k
k =
+ cos 1ksin 1k
> 0 is a real, positive constant. Similar expression is obtained for

k
link 2 of the manipulator with s2k = (2k , 2 ).
2.4. Lyapunov theory based Markov game fuzzy control
Lyapunov theory based Markov game fuzzy controller's action
is an annealed mix of the Fuzzy Markov game based action and
Lyapunov theory based control action. The approach uses a
(21)
Lyapunov theory based Markov game fuzzy controller is detailed in Fig. 2. The gure shows various modules that take part in
generating a safe and stable control policy:
We now briey describe how the proposed controller arrives at
a game theoretic stable policy. The FIS antecedent part (3) receives
current state, control action and disturber action to generate a rule
ring strength vector (sk ) = {1(sk ) 2 (sk ).. . i (sk )... . N (sk )}. FIS
consequent part in (3) generates rule specic q values. These q
values are used to form a game matrix which is solved using linear
programming to generate an optimal policy a* sk (5) and optimal
( )
( ) (9).
value V S
k+1
The optimal policy (Linear programming solution) is mixed
( )
with a random policy to generate an EEP policy a sk (7) which

in turn is used to derive Markov game control action amgc (8). The q
values are also used in conjunction with an EEP policy to get global
Q value Q (sk ) (11). The Lyapunov module consists of a candidate
function and a Lyapunov action alyap generator as per (16). The
Lyapunov theory based action and Markov game based action are
combined in hybridizer to produce alyap _ mg (18).
Under this action, system transitions to the next state sk + 1 and
k
generates a reinforcement signal rlyap
_ mg (19) which indicates
goodness or otherwise of the control action alyap _ mg . This reinforcement signal is used to generate a TD error Q lyap _ mg (20)
which is used to tune q values (14), thus making it adaptive. This
entire process of matching current system state to FIS antecedents
(Part A) to generate rule ring strength vector, generating optimal
control policy, rewards/cost, TD error, and q values tuning is repeated at the next state. The sequential making procedure and
associated rening of q values continues till either a goal state is
123
Fig. 2. Lyapunov theory based Markov game fuzzy controller.
reached or the system fails. The q values get progressively rened

as the reinforcement learning (experiential) goes on, and the
controller ultimately learns a safe and stable policy to counter
parameter variations and disturbances.
3. Lyapunov theory based Markov game controller Realization
case). The simulation parameters used herein are: learning rate

parameter = 0.09, discount factor = 0.88, Action set for force
applied is F [ 90 0 + 90]N .
We partition each pendulum variable into three Gaussian fuzzy
k
subsets, i.e., (k , ) are divided into three subsets thereby generating 32 = 9 fuzzy rules for the Lyapunov theory based Markov
game fuzzy controller. Membership degree of input variables is
computed using:
3.1. Inverted pendulum control

We employ the proposed approach for controlling bench mark
Inverted pendulum. State space of the system consists of two realvalued variables (pole angle, in rad) and (pole velocity, in rad/
s). Fig. 3 shows the inverted pendulum.
We apply Lyapunov Markov game fuzzy control action to the
cart to balance the pendulum (Kumar et al., 2012). The pendulum
is initialised from a position close to (, ) = (0, 0) position (a
standard practice). A trial is terminated when, either the pendulum remains balanced for 3000 time steps equivalent to 3 min of
balancing in real time, or it meets the failure condition: > 180 .
The simulation model used is:
l
x j x jp
l (xj ) = e
2j2
; lp = 1, 2, 3;
j = 1, 2;
(23)
i.e, Gaussian membership function, where lp are fuzzy labels for

k
variable j, (x1 = k , x2 = ). Fuzzy label centers are dened as

l
x jp = aj + bj (lp 1) with a1 = 12, a2 = 50, b1 = 12, b2 = 50, and

widths dened by 1 = 12, 2 = 50. Simulation parameters, fuzzication scheme and force action set are kept identical for the fuzzy
Markov game fuzzy controller (MGFC).
3.2. Robotic manipulator control
g sin
m pl sin( 2)
2
cos( )F
( )l m lcos
4
3
(22)
2
where g is the gravity constant (g 9.8 m/s ), mp is the mass of the

pendulum(m 2 kg), M is the mass of the cart(M 8 kg), l is the
pole length (l 0.5 m) and = 1/(M + mp). F is the magnitude of
force applied to the cart which corresponds to amg _ lyap(sk ) (in our
Fig. 3. Inverted pendulum control.
A robot manipulator having n joints may be modeled as (Katic

and Vukobratovic, 2013; Sharma and Gopal, 2008):
..
. .
M ( ) + C ( , ) + g ( ) + f ( ) + ud = u
(24)
corresponds to n 1 vector of joint angles, u is n 1 vector of

torques applied at each joint, M ( ) species n n inertial matrix,
.
.
C ( , ) is n n Centrifugal forces and Coriolis matrix, f ( )
corresponds to the n 1 frictional vector, g ( ) refers to n 1
gravitational vector, while ud refers to a n 1 vector of external
disturbance.
Proposed Lyapunov theory based Markov game fuzzy controller
(LypMGFC) has been implemented as a decentralized controller,
.
i.e., one controller for each link of the manipulator. ( (n) , (n))
being joint variables for the nth link. Each link variable has been
fuzzied into three partitions which lead to nine rules corresponding to one link and overall eighteen rules for both the links.
Membership functions for fuzzication has been adapted from
(Sharma and Gopal, 2008) and consists of:
124
lp
x j( n) x j ( n)
j( n)
e
;
lp
is
l x j ( n) =
p
label
(fuzzy)
lp = 1, 2, 3;
for
the
j = 1, 2;
link
.
n = 1, 2
in
joint
(25)
n,
(x1(1) = (1) , x1(2) = (2) , x2(1) = (1) , x2(2) = (2)). The centers and
width
corresponding
to
these
fuzzy
variables
are
l
x jp(n) = aj(n) + bj (n)(lp 1) with a1(1) = a1(2) = 2, a2(1) = a2(2) = 10,

b1(1) = b1(2) = 2, b2(1) = b2(2) = 10, being center values for fuzzy
sets and width values are 1(1) = 1(2) = 1.5, 2(1) = 2(2) = 8. The
tracking error is calculated as: ek(n) = dk(n) k(n) ; n = 1, 2 and
cost function is dened as c k(n) = e k(n) + ek(n) ; = T > 0.
dk = [dk(1) dk(2)] is the desired trajectory vector. Exploration level
is xed at 0.4 initially and then decayed as 0.4 0.02 over the
iterations. Learning-rate parameter is xed at 0.05 and the discount factor is xed at 0.9.
3.2.1. Standard two link robot
We simulate the robotic manipulator using 4th order Runge
Kutta (RK) integrator with a sampling time of 20 ms for a time
horizon of 20 s. Simulation model and other parameters can be
found in Appendix A1. Two cases have been considered: (i) control
with external disturbance i.e., disturbance torques with a distribution of 720% (Gaussian) around the torque applied to the
robot, (ii) control with external disturbances and pay load variations: controller works against external disturbances as in (i), and
payload mass variations. The mass m2 is varied as: (a) t < 4 s, m2
1 kg, (b) 4 < t 7s, m2 2.5 kg, (c) 7 < t 9s, m2 1 kg, (d)
9 < t 12s, m2 4 kg, (e) 12 < t 15s, m2 1 kg, (f) 15 < t 18s,
m2 2 kg, and (g) 18 < t 20s, m2 1 kg. We have kept simulation and learning parameters same, for both MGFC and
LyapMGFC controllers.
3.2.2. Selective compliance assembly robotic arm (SCARA)
Simulation model (Sharma and Gopal, 2008) and parameters
can be found in Appendix A2. The robotic arm has been simulated
for 16 sec using fourth-order Runge-Kutta method with xed time
step of 20 ms. Simulation and controller parameters for both
controllers ( LyapMGFC and MGFC) are kept same, as in the twolink robotic arm.
4. Simulation results
We give comparison of the proposed controller against base
line game theory based RL control namely Markov game fuzzy RL
control. Proposed Lyapunov theory based Markov game fuzzy
control shares all shortcomings and advantages with corresponding Markov game fuzzy control, both being RL based approaches.
This provides a fair comparison ground for the proposed control
approach. This is more so because our primary claim is that the
inclusion of a Lyapunov theory based action generation mechanism makes fuzzy Markov game controller stable. For comparison of
fuzzy Markov game control against conventional and other RL
based methods, we refer the reader to (Sharma and Gopal, 2008).
Fig. 4. Probability of success achieved by controllers.
is that although MGFC achieves a perfect success % in 350 trials it

fails to maintain it. On the other hand, LyapMGFC maintains its
level of performance or is more consistent. Furthermore,
LyapMGFC achieves 100% stable success much earlier to MGFC.
Fig. 5 shows average balancing steps (averaged over 25 steps)
comparison of the controllers. LyapMGFC controller achieves 3000
balance steps in about 580 trials while MGFC takes about 850
trials. But more importantly, LyapMGFC exhibits a very stable behavior as it maintains 100% success all along. We observe that the
LyapMGFC controller outperforms the MGFC controller in terms of
consistency.
Finally, Fig. 6 shows pole angle trajectory comparison of the
two controllers for a successful trial. We observe lower pole angle
tracking error for LyapMGFC controller.
4.2. Two link robot arm control
4.2.1. Disturbances and parameter variations
Figs. 7 and 8 show link 1and link 2 tracking errors for the
standard two link robotic manipulator. It is seen that Lyapunov
theory based Markov game controller has lower tracking errors.
Furthermore, Lyapunov theory based Markov game fuzzy controller is able to handle payload mass variations more effectively.
Proposed controller makes robot arm track desired trajectory with
very low errors and is fast.
Figs. 9 and 10 depict torques exerted at both links. For link 1 the
torques exerted are lower while for link 2 they are comparable (at
times lower) than the corresponding Markov game controller.
These results indicate relative superiority of Lyapunov theory
based control over the Markov game control. Lower torques at
joints is a distinct advantage of LyapMGFC over MGFC.
4.2.2. Two link manipulator control without payload variations
Figs. 11 and 12 show tracking error comparison of controllers
when the robotic arm does not handle any payload, i.e., pure
tracking problem. Lyapunov theory based controller has lower
4.1. Inverted pendulum control

We give simulation results of applying the proposed LyapMGFC
control approach on an Inverted pendulum and compare it against
the baseline Markov game fuzzy control (MGFC) approach (Sharma, 2015).
Fig. 4 shows success probability comparison, wherein it could
be seen that although LyapMGFC controller achieves 100% success
in about 600 trials (350 trials for MGFC); a conspicuous difference
Fig. 5. Average balancing steps comparison.
125
Fig. 10. Two link controller comparison: torque exerted at link 2.

Fig. 6. Pole angle trajectory.
Fig. 11. Two link comparison without payload variations: tracking error for link 1.
Fig. 7. Two link controller comparison: tracking error for link 1.
Fig. 12. Two link comparison without payload variations: tracking error for link 2.
Fig. 8. Two link controller comparison: tracking error for link 2.
Fig. 13. Two link comparison without payload variations: torque exerted at link 1.
Fig. 9. Two link controller comparison: torque exerted at link 1.
error for link 1 while it is comparable in case of link 2.

Figs. 13 and 14 show control torque comparison for both links.
We observe that in absence of payload variations Lyapunov theory
based controller's performance is at par with the Markov game
controller. This is probably because the Markov game setup is

designed exclusively for handling disturbances while Lyapunov
theory's inclusion adds a stability aspect to the controller.
4.3. SCARA control
The payload mass m is varied as: (a) t < 4 s, m2 m, (b)
4 < t 7s, m2 2 m, (c) 7 < t 10s, m2 m, (d) 9 < t 13 s,
126
Fig. 14. Two link comparison without payload variations: torque exerted at link 2.
Fig. 17. SCARA controller comparison: torque exerted at link 1.
Fig. 15. SCARA controller comparison: tracking error for link 1.
Fig. 18. SCARA controller comparison: torque exerted at link 2.
5. Conclusions and future scope

In this work we have presented a novel Lyapunov theory based
Markov game fuzzy controller which offers a stable and safe
control. Proposed controller demonstrated its supremacy over the
baseline Markov game fuzzy controller in terms of robustness and
stability as evidenced by simulations on the benchmark problems:
a) inverted pendulum b) two link robot arm, and c) SCARA. In
future we intend to apply the proposed scheme on partially observable Markov domains using the POMDP framework.
Appendix A1. Two-link robot arm control

Fig. 16. SCARA controller comparison: tracking error for link 2.
m2 3 m, (e) 13 < t 16s, m2 m.

Figs. 15 and 16 show tracking error comparison of controller for
SCARA. We observe that Lyapunov theory based Markov game
control is slightly better than Markov game control for link 1 and
almost comparable for link 2.
Figs. 17 and 18 show control torque comparison for SCARA.
While Lyapunov theory based control results in comparable torques for link 1, and it is lower for link 2. We admit that results for
SCARA are not that convincing as in the case of two link robot.
However they do show that Lyapunov theory based control is viable, if not superior, for SCARA. We again emphasize that the main
advantage of the Lyapunov theory based approach is the stability
guarantee it holds which is nonexistent in case of pure Markov
game control.
Without payload variations, both MGFC and LyapMGFC have
identical performance. The results once again indicate superior
parameter variation handling capability of the Lyapunov theory
based control over pure Markov game controller. Last but not the
least is advantage Lyapunov theory based control holds is, a
guaranteed tracking stability.
Simulation model (Fig. 19) for two-link robot arm has been
taken from (Sharma and Gopal, 2008). A two-link robot arm has all
the nonlinear effects common to general robot manipulators and is
easy to simulate.
Manipulator dynamics
. .
.2
..
+ + 2 cos 2 + cos 2 1 (212 + 2 )sin 2
.. +
.
+ cos 2

1 sin 2
e cos + e cos( + )
1
1
1
1
1
2
+
=
2
e1 cos( 1 + 2)
= ( m1 + m2)a12
= m2a22
= m2a1a2
e1 = g /a1
Manipulator parameters a1 a2 1 m, m1 m2 1 kg.
Desired trajectory 1desired = sin(t ) , 2desired = cos(t ).

System state-space
is continuous and has four variables, i.e.,
.
.
xk = (1, 2, 1, 2).
Control task is to apply a sequence of torques (k ) = [1(k ) 2(k )]
at each joint so that the joints follow a prescribed trajectory in
joint space, i.e., qd = [1desired , 2desired].
127
p1 = 5.0, p2 = 0.9, p3 = 0.3, p4 = 3.0, p5 = 0.6, s2 = sin(2),

c2 = cos(2)
Desired trajectories
The desired trajectories for joint angles 1 = (1) and 2 = (2)
are set as in Sharma and Gopal (2008): desired( 1) = 0. 5+0. 2( sin t
+ sin 2t ) ( rad) desired( 2) = 0. 13 0. 1( sin t + sin 2t ) ( rad) .
References
Fig. 19. Simulation model for two-link robotic manipulator.
Fig. 20. Two DOF SCARA robot.
Appendix A2. SCARA robot control

A 2 DOF SCARA consists of two parallel revolute joints and two
links (Fig. 20). The dynamical model of the robot has been taken
from (Sharma and Gopal, 2008), and numerical values of the
parameters are same as in Sharma and Gopal (2008).
Dynamical model and parameters
.
.
. 2
..
p + p c
p3 + 0.5p2 c2 1 p4 p2 s22 1 0.5p2 s22
1
2 2
.. +
.
.2
p3 + 0.5p2 c2 p3

2
0.5p2 s21 + p52
1
=
2
Aguilar-Ibanez, C., 2008. A constructive Lyapunov function for controlling the inverted pendulum. In: American Control Conference, Westin Seattle Hotel,
Seattle, Washington, USA, June 1113.
Busoniu, L., Babuska, R., De Schutter, B., Ernst, D., 2010. Reinforcement Learning and
Dynamic Programming using Function Approximator. CRC Press.
Ju, X., Lian, C., Zuo, L., He, H., 2014. Kernel based approximate dynamic programming forreal time online learning control: an experimental study. IEEE Trans.
Control. Syst. Technol. 22 (1).
Katic, D., Vukobratovic, M., 2013. Intelligent Control of Robotic Systems. Springer
Science.
Kobayashi, K., Mizoue, H., Kuremoto, T., Obayashi, M., 2009. A meta-learning
method based on temporal difference error. In: ICONIP 2009, Part 1, LNCS, vol.
5863, pp. 530537.
Kumar, R., Nigam, M.J., Sharma, S., Bhavsar, P., 2012. Temporal difference based
tuning of fuzzy logic controller through reinforcement learning to control an
inverted pendulum. Int. J. Intell. Syst. Appl. 9, 1521.
Levine, J., 2009. Analysis and Control of Nonlinear Systems. SpringerVerlag, London.
Lin, Chuan-Kai, 2009. H1 reinforcement learning control of robot manipulators
using fuzzy wavelet networks. Fuzzy Sets Syst. 160, 17651786.
Liu, Q., Zhou, X., Zhu, F., Fu, Q., Fu, Y., 2014. Experience replay for least squares
policy iteration. IEEE/CAA J. Autom. 1 (3).
Sharma, Rajneesh, Gopal, M., 2008. A Markov game adaptive fuzzy controller for
robot manipulators. IEEE Trans. Fuzzy Syst. 16 (1), 171186.
Sharma, Rajneesh, 2015. Game theoretic lyapunov fuzzy control for inverted pendulum. In: Proceedings of the 4th IEEEE International Conference on Reliability,
Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Amity University, Noida, Uttar Pradesh, India, pp. 16, September 24.
Tang, Li, Liu, Yan-Jun, Tong, Shaocheng, 2014. Adaptive neural control using reinforcement learning for a class of robot manipulator. Neural Computing
and Applications, Springer Verlag, vol. 25, pp. 135141.
Vrabie, D., Vamvoudakis, K.G., 2013. Adaptive Control and Differential Games by
Reinforcement Learning Principles. IET.
Wiering, M., van Otterlo, M., 2012. Reinforcement learning: state-of-the-art. In:
Daswani, M., Sunehag, Peter, Hutter, Marcus (Eds.), Adaptation, Learning and
Optimization 12. Springer, Berlin.

Lyapunov Theory Based Stable Markov Game Fuzzy Control For Non-Linear Systems

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Lyapunov Theory Based Stable Markov Game Fuzzy Control For Non-Linear Systems

Diunggah oleh

Hak Cipta:

Format Tersedia

Engineering Applications of Articial Intelligence 55 (2016) 119127

Contents lists available at ScienceDirect

Engineering Applications of Articial Intelligence

Lyapunov theory based stable Markov game fuzzy control for

manipulator control, which is a highly coupled, non linear and

R. Sharma / Engineering Applications of Articial Intelligence 55 (2016) 119127

based RL control to encompass multiple adaptive agents with

2. Lyapunov theory based Markov game fuzzy approach

2.1. Reinforcement learning algorithms

expected sum of discounted cost, i.e., E j = 0 jr k + j where r k + j is

V (sk + 1) = min Q (sk + 1, a) is the state value

r k Immediate cost of taking action ak at state sk .

where, is the learning rate parameter, 0 r o 1; it governs the

R. Sharma / Engineering Applications of Articial Intelligence 55 (2016) 119127

using fuzzy systems as generic function approximators termed

van Otterlo, 2012) resulting in an Minimax policy as:

is random with probability : ai*(sk ) is the minimax solution (5)

2.2. Fuzzy Markov games

and, air sk is a random policy dictating random action selection.

Fuzzy Markov games generalise Markov games (Sharma and

A global EEP policy a sk is generated from these rule-specic

If s1k is L1i and and snk is L ni

2.2.1. Global continuous control action

= arg min max q( i,

PD(A) is probability distribution on the controller action set A.

global control policy a* sk is generated for each rule Ri with rule

Using this policy we get Global continuous control action as:

amgc (s ) is the fuzzy Markov game controller action at state s . This

V i s k + 1 = min max ( q(i, ai , di ) ai;

Next, we generate a global desired value V (s

2.2.2. Global Q-value

then a = a1 and d = d1with q(i, 1, 1)

or a = a m and d = dowith q(i, m, o) (3)

Next step is to use exploration-exploitation (EEP) (Wiering and

ai = ai(sk ) is the Minimax controller action, and di = di(sk ) is

R. Sharma / Engineering Applications of Articial Intelligence 55 (2016) 119127

rule-action pair visits dependent parameter k = 0, 1 which

q-values are updated as:

2.3. Lyapunov theory based control

> 0 and a control law is ob-

tained which makes this Lyapunov functions's rst derive negative

> 0 is real, positive constant.

where (1k ) = 1 cos21k

where, no > 0 is a constant and nk(i, ai , di ) is a count of the number

alyap _ mg (s k ) = (1 k )a mgc (s k ) + k alyap(s k )

i.e., a high cost if the pendulum meets the failure condition,

on line. Q lyap _ mg is used in the q value update (14). For

( ) ; n = 1, 2 being link variable

alyap _ mg s nk = 1 k a mgc s nk + k alyap s nk

> 0 is a real, positive constant. Similar expression is obtained for

The optimal policy (Linear programming solution) is mixed

with a random policy to generate an EEP policy a sk (7) which

R. Sharma / Engineering Applications of Articial Intelligence 55 (2016) 119127

Fig. 2. Lyapunov theory based Markov game fuzzy controller.

reached or the system fails. The q values get progressively rened

3. Lyapunov theory based Markov game controller Realization

case). The simulation parameters used herein are: learning rate

3.1. Inverted pendulum control

i.e, Gaussian membership function, where lp are fuzzy labels for

variable j, (x1 = k , x2 = ). Fuzzy label centers are dened as

x jp = aj + bj (lp 1) with a1 = 12, a2 = 50, b1 = 12, b2 = 50, and

where g is the gravity constant (g 9.8 m/s ), mp is the mass of the

Fig. 3. Inverted pendulum control.