Anda di halaman 1dari 8

A Reinforcement Learning Fuzzy Controller for

the Ball and Plate System

Nima Mohajerin, Mohammad B. Menhaj, Member, IEEE, Ali Doustmohammadi

AbstractIn this paper, a new fuzzy logic controller, namely

Reinforcement Learning Fuzzy Controller (RLFC), is proposed
and implemented. Based on fuzzy logic, this newly proposed
online-learning controller is capable of improving its behavior by
learning from experiences it gains through interaction with the
plant. RLFC is well established for hardware implementation
with or without a priori knowledge about the plant. To evidence
this claim, a hardware implementation of Ball and Plate system
was established, and RLFC was then developed and applied to it.
The obtained results are illustrated in this paper.
Index TermsFuzzy Logic Controller, Reinforcement
Learning, Ball and Plate system, Balancing Systems, Model-free


ALANCING systems are one of the most popular and

challenging test platforms for control engineers. Such
systems are the traditional cart-pole system (inverted
pendulum), the ball and beam (BnB) system, the multiple
inverted pendulums, the ball and plate system (BnP), etc.
These systems are the promising test-benches for investigating
the performance of both model-free and model-based
controllers. Considering those complicated ones (such as
multiple inverted pendulums or BnP) even if one bothers to
mathematically model them, the resulted model is likely to be
too complicated to be used in a model-based design. One
would highly prefer to use an implemented version of such a
system (if available and not risky) and observe its behavior
while the proposed controller is applied to it. This paper is
devoted to the efforts done for a project in which the main
goal is to control a ball over a flat rotary surface (the plate)
mimicking humans behavior controlling the same plant, i.e.
the BnP system. The proposed controller neither should be
dependent on any physical characteristics of the BnP system
nor should it be supervised by an expert. It should learn an
optimal behavior from its experiences interacting with the BnP
system and improve its action-generation strategy; however,
some prior knowledge about the overall behavior of the BnP

Manuscript received February 5, 2010.

Nima Mohajerin is with the School of Science and Technology of rebro
University, rebro, Sweden. (e-mail:
Mohammad B. Menhaj, is with the Electrical Engineering Department of
Amirkabir University of Technology, (e-mail:
Ali Doustmohammadi, is with the Electrical Engineering Department of
Amirkabir University of Technology, (e-mail:

978-1-4244-8126-2/10/$26.00 2010 IEEE

system may also be available and be used in order to reduce

the needed time for reaching the goal.
Those few published papers on the BnP system are mainly
devoted to achieve the defined goals regarding the BnP itself
rather than how to achieve those goals [3, 4, 5, 8, 9]. They can
be categorized into two main groups: those which are based on
mathematical modeling (with or without hardware
implementation), those proposing model-free controllers.
Since the simplification in the mathematical modeling of the
BnP system yields two separate BnB systems, the first
category is goal-oriented and is of no interest for the current
project [4, 5]. On the other hand, the hardware apparatus used
in the second category is CE151 [6] (or in some rare cases
another apparatus [9]) which all use image feedback for ball
position sensing. However, among all, [5] is devoted to a
mechatronic design of the BnP system controlled by a classic
model-based which benefits a touch-sensor feedback and [3,
4] used the CE151 apparatus. Note that the image feedback is
a time bottleneck which will be discussed in Section III. In [3],
a fuzzy logic controller (FLC) is designed which learns online
from a conventional controller. Although the work in [4] is
done through mathematical modeling and is applied to CE151
apparatus, it is of more interest because it tackles the problem
of trajectory planning (to be stated in Section III). Reports in
[8] and [9] are focused on motion planning and control though
they are less interesting for us.
To achieve the desired goal, a different approach will be
demonstrated in this paper. This approach is based on fuzzy
logic controller which learns on the basis of reinforcement
learning. Additionally, a modified version of the BnP
apparatus is also implemented in this project as a test platform
for the proposed controller.
In this paper, the fundamental concepts of RL are embodied
into fuzzy logic controlling methodologies. This leads to a
new controller, namely Reinforcement Learning Fuzzy
Controller (RLFC) which is capable of learning from its own
experiences. Inherited from fuzzy controlling methodologies,
RLFC is a model-free controller and, of course, previous
knowledge about the system can be included in RLFC so as to
decrease the learning time. However, as it will be seen,
learning in RLFC is not a separate phase of its functioning
This paper is divided into six sections. After this
introduction, in section II RLFC is completely explained, both
conceptually and mathematically. In section III, the BnP
system is introduced and the hardware specification of the
implemented version of this system is also outlines. In section
IV some modifications on RLFC in order to be applicable to

the BnP system is fully discussed. Section V is dedicated to

illustrate and analyze the results of RLFC performance on the
implemented BnP system. In this section, RLFC performance
is also compared with that of a human controlling the same
plant. Finally, section VI concludes the paper.
This section is dedicated to explain the idea and
mathematics of the proposed controller, i.e. RLFC. First, the
behavior of RLFC is outlined conceptually and then the
mathematics will be detailed.
A. Outline
According to Fig. 1, RLFC consists of two main blocks, a
controller (FLC) and a critic. The task of the controller is to
generate and apply actions in each given situation as well as
improving its state-action mapping while the critic has to
evaluate the current state of the plant. Neither controller nor
critic necessarily knows anything about how the plant
responds to actions, how the actions will influence its states
and what the best action is in any given situation. There is also
no need for the critic to know how the actions in the controller
are generated. The most important responsibility of the critic
is to generate an evaluating signal (reinforcement signal),
which best represents the current state of the plant. The
reinforcement signal is then fed to the controller.
Once the controller receives a reinforcement signal, it
should realize whether its generated action was indeed a good
action or a bad one, and in both cases how good or how bad it
was. These terms are embodied into a measure named
reinforcement measure which apparently is a function of the
reinforcement signal. Then according to this measure, the
controller attempts to update its knowledge of generating
actions, i.e. improves its mapping from states to actions. The
process of generating actions is separate from the process of
updating the knowledge and thus they can be assumed as two
parallel tasks. This implies that while the learning procedure is
a discrete-time process, the controlling procedure can be
continues-time. However, without loss of generality it is
assumed that the actions are generated in a discrete-time
manner and each action is generated after the reinforcement
signal - describing consequences of the previously generated
action - has been reported and the parameters have been
updated. The dashed line in Fig. 1 implies the controller
awareness of its generated actions. Although apparently the
controller is aware of its generated actions, in case of
hardware implementation, inaccuracies in mechanic structure
and electronic devices and other unknown disturbances may
impose distortions on the generated actions.

x = [x 1 x 2



Assume that in the universe of discourse of input variable



, i.e. U i , a number of

term sets are defined. The lth

fuzzy rule is : (for now forget about the consequence part)

IF x1 is A1l1 AND x2 is A2l2 AND AND xn is Anln

THEN y is B l


where: 1 li ni , and Ai i is the l i th defined fuzzy set on the

, (1 i n ) . The output variable

universe of discourse of xi

is y and there are M fuzzy sets defined on the universe of

discourse of y ( M ) - V is the universe of discourse of

y. Note that (2) is a fuzzy relation defined on U V [11]

whereU = U 1 U 2

U n . Note also that in real world, all

of the above variables express physical quantities. Therefore

all the corresponding universes of discourse are bounded and
can be covered by a limited number of fuzzy sets.
For hardware implementation, what matters first is the
processing speed of the controller. In other words, we have to
establish a fuzzy controller architecture that leads to an
optimum performance versus its complexity. For this reason,
we propose the following elements for the FLC structure:
singleton fuzziyfier, product inference engine and center
average defuzzyfier [11]. In this case given an input vector
the output of the controller is given by:

y ( A (x ) )
= f (x ) =
( A (x ) )

l =1

i =1

l =1

i =1




where y is the center of normal fuzzy set B . However, as

mentioned earlier, other FLC structures may be considered.
Apparently, only those rules are participated in generating y
that have a non-zero premise (IF-part), i.e. the fired rules. This
fact is not dependent on the FLC structure.
On the design stage, the controller does not know which
states it will observe, or in other words, the designer hardly
may know what rules are useful to be embodied in the fuzzy
rule base. Thus, all rules with premises made by all possible
combinations of input variables, using AND operator, are
included in the fuzzy rule base. The number of these rules is:

L = ni


i =1

B. The Controller
The aforementioned concept is as general as to be
applicable to any fuzzy controller scheme; however, we
assume that the fuzzy IF-THEN rules and controller structure
are of Mamdani type [9]. Imagine that the input to the fuzzy
inference engine (FIS) is an n element vector, each element is
a fuzzy number produced by the fuzzyfication block [11]:

Fig. 1. A typical application of the RLFC.



Please note that superscripts are not powers, unless it is mentioned to be

Obviously, L grows exponentially with respect to the

number of variables and defined term sets. Consequently, the
processing time drastically increases. To solve this, we assume
that for any given value to each variable, there are at most two
term sets with non-zero membership values. This condition
which will be referred by the covering condition is necessary
and if it is held then the number of fuzzy rules which
contribute to generate the actual output, i.e. the fired rules, is
equal or less than 2n, i.e. 2 is powered by n. Noticeably, to
reduce the needed time for discovering the fired rules we
implement a set of conditional statements rather than an
extensive search among all of the rules.
C. The Decision Making
As previously mentioned, the controller is discrete-time. So,
in each iteration, as the controller observes the plant state, it
distinguishes the fired rules. From now on, the L-rules FLC
shrinks to a 2n-rules FLC where 2n << L . The key-point in
generating output, i.e. decision making, is how the
consequences (THEN parts) of these fired rules are selected.
Noticing (2), the assigned term set to the consequence of the

lth rule is B . It was also mentioned that there are M term sets
defined on the universe of discourse of the output variable.
Assume that these term sets are referred by W i where

i = 1, 2,..., M . Having fired the lth rule in the kth iteration,

the probability of choosing W i for its consequence is:

Pkl ( j = i )


where j is a random variable with an unknown discrete

distribution over indices. The aim of reinforcement learning
algorithm, which is discussed in the next sub-section, is to
learn this probability for each rule such that the overall
performance of the controller is (nearly) optimized.

Our objective in this section is to sensibly define Pk to be

well suited for applying the reward/punishment procedure and
also for software programming. Fulfilling these objectives, for
each rule a bounded sub-space on is chosen. Note that
represents the set of real numbers. Factors for how this one
dimensional sub-space should be chosen are discussed in
section IV. Let the related sub-space to lth rule be:

l = a0l , aMl


This distance is divided into M sub-distances as shown by

(8-a), each of which is assigned to an index i where
i = 1, 2, , M . We have:

l = rl


r =1

a) rl = arl 1 , arl ) , r = 1,..., M

b) a0 as as +1 aM , s = 1,..., M 2

c) l {1, 2,..., L} & p, q {1, 2,..., M } p q :

lp ql =
We calculate the probability P l by (9):


Pl (j = i ) =



where rl represents the numeric length of sub-distance


and is calculated by (10).

rl = arl arl 1


Till now the iteration counter, k, is omitted. However since

the reinforcement learning procedure is done by tuning the
above parameters, they are all a function of k. Thus, (9) turns
il ( k )
Pkl ( j = i ) = l
(k )
ail (k ) ail 1 (k )
aMl ( k ) a0l ( k )
By observing (7), (8) and (11) it is obvious that
Pkl ( j = i ) =

Pkl ( j = i ) is a probability function and satisfies the

necessary axioms.
D. Reinforcement Learning Algorithm
In this sub-section the proposed algorithm for tuning the
above defined parameters is depicted. This algorithm is based
on reinforcement learning methods and satisfies the six
axioms mentioned by Barto in [12]. Let r ( k ) be the
reinforcement signal generated by the critic module in the kth
iteration. Note that it represents the effect of the previously
generated action by the FLC, i.e. y ( k 1) , on the plant and
before generating the kth action the parameters of the related
fired rules should be updated. In other words, this scalar can
represent the change in the current state of the plant.
To be more specific, as a general assumption, imagine that
the smaller reinforcement signal represents a better state of
the plant. Obviously, the change in the current state of the
plant is stated by (13). Therefore, an improvement in the
plant situation is indicated by r (k ) > 0 while r ( k ) < 0
indicates that the plant situation is worsened.

r (k ) = r ( k 1) r ( k )
However, since (13) is based solely on the output of the
critic, it does not contain information about which rules have
been fired, which term sets have been chosen for generating
y (k 1) , etc. Thus, r (k ) is not immediately applicable
reward/punishment updating procedure means that if the
generated action resulted in an improvement in the plant
state, the probabilities of choosing the same term set for the
premises of each corresponding fired rule should be
increased. However, if this action caused the plant state to be
worsened, these probabilities should be decreased.
r (k ) will be mathematically manipulated in order to

shape up the reinforcement measure. This measure is the

amount by which the mentioned probabilities, (11), would be
At the first step, a simple modification is done on r (k ) .
This step may be ignored if r (k ) is already a suitable
representation of change in the system state A comprehensive
example will illustrate this case in section IV. This
manipulation is done by

f (.) noticing that f :

r ( k ) = f ( r (k ) )


To modify the probabilities defined by (11), the

corresponding sub-distances, (8) are changed.The amount of
change in the sub-distances relating to the lth rule is defined

il (k ) in (15).

il ( k ) = g . (l , i ). (l ).r (k )
Regarding (15), g is gain and is a scaling factor, (l , i )
represents the exploration/exploitation policy [1], (l ) is the
firing rate of the lth rule and is obtained by replacing x in the
membership function obtained from the premise part of the lth
rule. Note that this factor expresses the contribution of this
rule in generating the output.
There are a variety of exploration/exploitation policies [1, 2,
, 12], however here we propose a simple one:

(l , i ) = 1

(k )

n il

lth rule and is a scaling factor. Considering a particular

rule, as W i is chosen more for this rule, (l , i ) grows
exponentially to one, letting i ( k ) g . (l ).r ( k ) .

The reinforcement measure introduced in subsection II.A is

given by (17).
il (k )
, r (k ) 0
il (k ) =
max i (k ), i (k ) , r (k ) < 0
In (17), max operator is used in case of r ( k ) < 0 . This
is due to avoid large reinforcing those sub-distances that have
not been chosen. The reason is clearer if (18) is studied.
Equation (18) depicts the updating rule:

aql (k 1)
,q < i
aql ( k ) =
max {aq 1 ( k ), aq ( k 1) + i (k )} , q i
where q = 0,1,..., M .

subspaces ql for q = i + 1,..., M are shifted, but their

length remains unchanged.
5- Modification of l always let other subspaces (and
hence other indices) to be chosen. As the system learns
more, a dominant subspace will be found, but the
length of un-chosen subspaces are still not zero as long
as the effect of choosing them is not observed as a
worsening result. This feature is useful in case of
slightly changing plants.
Theorem1. By (18) if a term set receives
reward/punishment, then the probability of choosing that
term set is increased/decreased.
a) Reward. Assume that W i has been chosen for lth rule

and the resulting action has an improving effect. Thus,

Pkl ( j = i ) should increase. Note that in this case

r (k ) > 0 and by (13),(14),(15),(16) and (17) it is obvious

that i ( k ) > 0 . Using (5) we have:


where n il ( k ) counts how many times W i is chosen for the

But only the parameters of fired rules are modified.

Hence there are at most M .2 n modifications per
4- By (18) it should be understood that only the length of
subspaces il and l are modified. Although,


Regarding (18) there are several points to be noted:

1- i is the index of the chosen term set for the
consequence part of the lth fired rule , i.e. B l = W i .
2- q is a counter which starts from i and ends to M.
Apparently, there is no need to update those
parameters which are not modified and q may starts
from i. This will indeed reduce the processing time.
3- There are M parameters for each rule in the rule base.

Pkl ( j = i ) = Pkl ( j = i ) Pkl 1 ( j = i )


il (k )

(k )

il (k 1)


(k 1)

In case of an improvement Pkl ( j = i ) must be a positive

scalar. Using (9), (10), (11) and (12) in (19) we obtain:
a l (k ) ail 1 (k ) ail (k 1) ail 1 (k 1)
Pkl ( j = i ) = i

aml (k )
aml (k 1)
According to (18) the above equation yields:
a l + l (k ) ail 1 ail ail 1
Pkl ( j = i ) = i l

am + l (k )

In which we omit (k-1) on the right-hand side of the above

equation. It can easily be seen that:
il (k ) aml ail + ail 1
Pkl ( j = i ) =
aml aml + l (k )
This equation with regard to (8) implies that:
Pkl ( j = i ) > 0
b) Punishment. In this case r (k ) < 0 .

il (k ) = max {il (k ), il (k )}

Since a negative-length sub-distance is undefined, the max

operators used in the above equation and (17) assures us that
update rule (18) does not yield to undefined sub-distance.
The max operators are used to satisfy (8-b).

In this case Pk ( j = i ) in (19) has to be a negative

scalar. The procedure is the same as part a.


A Ball and Plate system, aforementioned BnP, is an
evolution of the traditional Ball and Beam (BnB) system [13,
14]. It consists of a plate which its slope can be manipulated
in two perpendicular directions and a ball over it. The
behavior of the ball is of interest and can be controlled by
tilting the plate. According to this scheme, various structures
may be proposed and applied in practice. Usually an image
feedback is used to locate the position of the ball. However,
due to its less accuracy and slower sampling rate comparing
touch screen sensors (or simply touch sensors), we opt for a
touch sensor.
The implemented hardware structure in this project, as
outlined in Fig.2, consists of five blocks. However, the whole
system can be viewed as a single block which its input is a
two element vector, i.e. the target angles (20). The output
of the assumed block is a six element vector which contains
the position and velocity of the ball and the current angles of
the plate. Table I illustrates the related parameters.



( xd , y d )

Target position of the ball


( x, y)

Current position of the ball


(vx , vy )

Current velocity of the ball

Pixel per

( x , y )

Current angels of the plate

Angel step

Plate angel with regard to x axis

Angel step

Plate angel with regard to y axis

Angel step

(u x , u y )

Control signal

Angel step


plate target angel with regard to x axis

Angel step

plate target angel with regard to y axis

Angel step

u = x , y


In this paper we are interested in the following problem:

Simple command of the ball. The objective is to place the
ball on any desired location on the plate surface starting
from an arbitrary initial position.

Before explaining the control system, a summary of the

hardware specifications of the implemented BnP for this
project is given below done in hardware implementation of the
BnP in this project is stated. A complete or even brief
description of the efforts how we made this apparatus is
beyond scope of this paper. But a summary is needed to show
that the plant for RLFC is made roughly and there are many
inaccuracies that a classic controller will definitely be unable
to control the plant. Referring back to Fig.2 each block

function is summarized next.

The Actuating Block
The actuating block consists of high accuracy
stepping motors equipped with precise incremental
encoders (3600 ppr*) coupled to their shafts plus
accurate micro-stepping-enhanced drivers.
The original step size of the steppers is 0.9 degrees
and reducible by the drivers down to 1 200 of step

size, i.e. 4.5 10 3 degree per step.

Taking into account the mechanical limitations, the
smallest measurable and applicable amount of
rotation is 0.1 degrees.

Fig. 2. Hardware structure of the implemented BnP system. The dashed

square separates the electronics section from the mechanical parts.

The Sensor Block

The sensor is a 15 inches touch-screen sensor.
Sensor output is a message packet sent through
RS-232 serial communication with 19200 bps.
Thus the fastest sampling rate of the whole sensor
block is 384 samples per second. This implies that
the maximum available time for decision making is
384 2.604 10

The area of the surface of the touch sensor on

which pressure can be sensed is 30.41 22.81cm.
The sensor resolution is 19001900 pixels. If the
sensor sensitivity is uniformly distributed on its
sensitive area, then each pixel is assigned to an area
approximately equals to 1.8 1.2 mm2 of the surface of
the sensor.
The Interface Block
The third and main section of the BnP system is its
electronic interface. This interface receives commands from
an external device in which the controller has been
implemented and then takes necessary corresponding actions.
Each decision made by the controller algorithm is translated
and formed into a message packet which is then sent to the
interface via a typical serial communication (RS232) or other
communication platforms. Then, the interface sends necessary
signals to actuators. In addition to some low-level signal
manipulation (such as applying low-pass filter to sensor
reading and noise cancelling), upon request from the main
controller, the interface sends current information, such as ball
position and velocity or position of the actuators, to the main

ppr: pulse per rotation

bps: bit per second. A measuring unit for serial communication.


In section II, RLFC was discussed completely. In this
section, necessary modifications for making it applicable to
control the implemented BnP system for solving the second
problem depicted in the previous section are expressed.

A. Primary Design Stage

Fig.1 depicts the control architecture as well as signal flows.
With regard to Table I, illustrated signals are explained next:
Control signal vector:

u = u x

u y


Plant state vector:

s = x

u y


0 0]



Plant target state vector:

sd = [ xd


Error vector:

e = e x


ev x

ev y



ex = x x d , e y = y y d
And since we want to make the ball steady:

evx = vx , ev y = v y


vx vy

S = i =1 ( n i + 1)
For our case study: S=32.

In Fig.1 it is obvious that there are six input variables to the

controller section of RLFC. Let us arrange them in x vector as
written in (25).
x = e x

In Fig.3, the corresponding defined term sets are presented

graphically. Number, shape and distribution of the defined
term sets are chosen based on the logical sense and practical
experience. There is no obligation for the system not to
perform well if these selections are changed.
Assuming that the covering condition is satisfied, as it is the
case in Fig.3, in each iteration, there are at most 2 =64 fired
rules. However, if an extensive search is done among 4900
rules to discover fired rules, i.e. to calculate each premise
membership value and check whether it is zero or not, then the
processing time would grow out of tolerable bounds. Instead,
since we know the exact location of each term set, if we locate
each measured value of input variables on their corresponding
universe of discourses then those term sets with non-zero
membership value are discovered. Doing this procedure for all
six input variables, there will be at most two discovered term
sets for each of them. Hence the combinations of these termsets using AND operator directly shows us which rules are
fired. The mentioned locating procedure can be coded into a
software programming language using a set of conditional
statements. Apparently, if there are n defined term sets on a
particular universe of discourse then (n+1) conditional
statements are needed in order to discover the fired rules.
Hence instead of L arithmetic calculations, we are faced with
S logical comparisons, where:



There is no special constraint to be applied to distribution of

term sets on the universes of discourse of the output variables.
There are M=51 triangular term sets which are uniformly
distributed over each universe of discourse. Fig. 4 illustrates
the location of these term sets.


Fig. 4 The defined term sets on the universes of discourses of output


B. Critic: Generation of Reinforcement Signal

Until now, the reinforcement signal is only defined to be an
evaluation of the system current situation generated by the
Fig. 3 shows the defined term sets on the universe of discourse of (a)
e x and e y , (b) v x and v y , (c) x and y . For the respective units refer to critic block. No other precise definition could have been given
Table I.
by now since the nature of this signal directly depends on the
On the universe of discourse of each input variable, a nature of the plant to be controlled. Referring back to Fig.1, it
specific number of term sets are defined. Let these numbers be is obvious that the input to critic is the Error Signal, (24).
n x , n y , nvx , nvy , n x and n y . According to (4), the fuzzy According to this figure the reinforcement signal is a function
of this vector:
rule base contains L rules where:
r = g (e ) = g (e x , e y ,v x ,v y )
L = n x .n y .nvx .nvy .n x .n y
Apparently, the critic is defined by g(.) and is the
For our very specific implementation following quantities
choice. The only necessary condition is that this
are assigned to the above variables:
function should represent the current state of the plant as well
( nx = ny = 7, nvx = nvy = 5, n x = n y = 2 ) L = 4900
possible as it can be. A proposed general form of this function

is depicted in (28).

By this concept, adding a priori knowledge in term of

modifying bounds on

r = cx (ex2 + cv vx2 ) + c y (ey2 + cv v y2 )


Equation (28) can be regarded as a revised version of LMS.

Note that the aim of RLFC is to minimize (28). In (28) three
coefficients are defined which are explained next.

cv is


balancing coefficient between the velocity of the ball and its

position error. As

cv increases,

the controller gives more

credit to stabilizing the ball rather than guiding it to the

desired location.


and c y represent the mutual interaction

between the two actuators. This interaction comes from the

inevitable impreciseness of the mechanical structure of BnP.
According to this mutual interaction, the motion of the ball in
each direction is not only a function of the corresponding
angle of the plate. However, exact values of


and c y are

one of the mechanical specifications. They can be chosen and

then tuned experimentally or to be learnt.
C. Reinforcement Measure
Having proposed the reinforcement signal, we are seeking
for a suitable function to produce il ( k ) according to (17).
Equation (29) is a proposed function for (14):
r ( k )
sgn ( r (k ) )
r (k ) = f ( r ( k ) ) =

r (k )

This function consists of two terms, the first one scales the
pure reinforcement signal received from the critic, and the
second one tunes the learning sensitiveness when the plant is
reward/punishment as they affect the plant state when it is
around the desired target state. Note that rmax can be easily
calculated using (28).
According to (29) and (15), the reinforcement measure is
given by (30) replaced in (17):
r (k )
sgn(r )

il = g. 1 nl ( k ) .

r (k )
D. Adding a Priori Knowledge
From a very general point of view, the proposed algorithm
is a search in the space of possible actions. However, it is
possible to add a priori knowledge in order to increase the
learning speed. To describe that, it would be a great help to
explain how the random selection of output term sets takes
place. Digital processors can produce uniformly distributed
random numbers and this is also used in RLFC. First, a
random number is generated by the processor and then it is
checked to which sub-distance (note equation (8)) it belongs.
Then the corresponding number of that sub-distanfce will be
the index of the chosen term set. Let the randomly generated
number for the lth rule be

j l . Equation (31) must hold:

j l l


j l is a simple procedure. For a typical

example regarding our implemented RLFC, observe the

following rule form:
ex is A11 AND ey is whatever AND

IF vx is A32 AND ey is whatever AND THEN u x is B jl

x is whatever AND y is whatever

Apparently, this applies to a set of rules which the first and
third conditions are fixed as mentioned. By referring back to
the depicted term sets in Fig.3 and Fig.4, This clearly indicates
that the sensible choice for this condition is a high deviation in

towards a positive direction with regard according to

Cartesian co-ordination system. Thus:

j l [35,50]


This implies that those insensible choices are ignored for

the mentioned rules.
After the modified RLFC is implemented it is experimentally
applied to the implemented BnP system. The illustrated graphs
in this section are the results of a series of experiments. In all
the experiments, the position of the ball versus time is
collected using the monitoring section of the implemented
BnP. Then these row data is modified by a sort of software
enhancing procedures to avoid huge amount of confusing
graphs. After omitting the time, the x position versus y is
obtained. Then, the obtained points corresponding to a series
of iterations are drawn on a single graph. Units of x and y are
pixels and the origin of the coordinating system is selected as
seen by the touch sensor. Each figure illustrates the touchsensitive area of the plate. Thus illustrates the 19001900
pixels of the touch sensor. Location of the ball is illustrated
approximately around the mean of the acquired points. Dark
areas on the figures illustrate presence of the ball over the
corresponding areas of the plate. The more each space of the
figure darkens, the more the ball was in the corresponding area
over all observed iterations. The target (desired) location of
the ball in all the illustrated experiences is the centre of the
In Fig. 5, improvement in behavior of the ball under control
of RLFC is shown. It is observed that after approximately
70000 iterations, an acceptable performance is obtained. Note
that the needed time for iterations normally varies. In our
experiments, 70000 iterations took around 20 minutes. Since
the performance of the system is satisfactory around 70000th
iteration, at this stage RLFC is regarded as a trained system.
However, since we do not omit the learning procedure
afterwards, the comments under the next figures do not
include the term trained. Instead, the 70000th iteration is
mentioned as a reference point for a good train of the system.
The control signals relating to the best performance illustrated
in Fig. 5 is shown in Fig. 6.

Fig. 7 . Different starting point.

Fig. 8 . Human best performance.

The main idea of the discussed work was to propose a
human-like controller, capable of learning from the past and
own obtained experiences as well as embedding some prior
knowledge and reasonable facts. Although still far away from
an exact human-like behavior, the result of applying this
controller to a very complex and uncertain plant resulted in
satisfactory performance, especially when compared with a
good human behavior trying to control the same plant. There
are varieties of extensions and modifications to the proposed
Fig. 5. Improvement in behavior of the ball under RLFC control. Top left method, from the form of fuzzy IF-THEN rules to the method
figure corresponds to the first 15000 iterations, where initially a priori of tuning various defined parameters.
knowledge is embedded. Top right and bottom left figures are showing
improvement in performance. After approximately 70000 iterations,
bottom right performance is regarded as acceptable.


In all the aforementioned experiments, the ball is initially

located in the same place. To show that this is not a necessary
condition, in Fig. 7 another start point is chosen. It is seen that
since RLFC has not experienced these new states enough
before, at first it could not perform well. However, as the ball
touches a previously enough-experienced state (shown by an
arrow), the behavior comes under the control. The number of
iterations in this figure is about 3500.


Fig. 6 . Control inputs with respect to the best performance in Fig.5.

In order to compare the performance of RLFC with that of a

human, 10 individuals (all healthy, normal, and matured with
no apparent nervous or muscular disorder) were selected and
asked to control the implemented BnP system. Each individual
was allowed to try the system 10 times. Note that in all
experiences, the steppers are released and the individuals
controlled the plate by their own hands. This would indeed
omit the most complicated nonlinearity and impreciseness of
the system: the actuators. In Fig.8 the best performance is
illustrated which was the 7th try of the individual who
possessed the best control over the BnP system.


R. S. Sutton A.G. Barto , Introduction to reinforcement learning,

MIT Press/Bradford Books, Cambridge, MA, 1998.
L. P. Kaelbling, M. L. Littman, A. W. Moore Reinforcement
Learning: A Survey Journal of Artificial Intelligence. vol. 4, pp.
237-285, May 1996.
A. B. Rad, P. T. Chan, Wai Lun Lo, C. K. Mok An Online
Learning Fuzzy Controller IEEE Trans. Industrial Engineering, vol
50, no. 50, pp. 1016-1021, October 2003.
X. Fan, N. Zhang, S. Teng, Trajectory planning and tracking of ball
and plate system using hierarchical fuzzy control scheme Elsevier
Journal of Fuzzy Sets and Systems, vol. 144, pp. 297-312, 2003.
S. Awtar, K. C. Craig, Mechatronic Design of a Ball on Plate
Balancing System Proc. 7th Mechatronics Forum International
Conference, Atlanta, GA, 2000.
Humusoft Users Manual CE 151 Ball&Plate Apparatus,
E. H. Mamdani, Application of fuzzy algorithms for simple
dynamic plant, Proceedings of IEE, vol. 121, pp. 1585-1588, 1974.
M. Bai, H. Lu, J. Su, Y. Tian, Motion Control of Ball and Plate
System Using Supervisory Fuzzy Controller, WCICA 2006, vol. 2,
pp. 8127-8131, June, 2006.
Wang, H. Tian, Y. Sui, Z. Zhang, X. Ding, Tracking Control of
Ball and Plate System with a Double Feedback Loop Structure,
ICMA 2007, August, 2007.
R. Bellman, Adaptive Control Processes: A Guided Tour, Princton
University Press, 1961.
L. X. Wang, A Course in Fuzzy Systems and Control , Prentice-Hall
International Inc.,1997.
A. G. Barto, Reinforcement Learning, The Handbood of Brain
Theory and Neural Networks, M. A. Arbib, Jaico Publishing House,
The MIT Press, Cambridge, MA, USA, 2006, pp. -804-809.
H. K. Lam, F. H. F. Leung, P. K. S. Tam, Design of a fuzzy
controller for stabilizing a ball and beam system, The proceedings
of IECON 1999, vol. 2, pp. 520 524.
E. Laukanen, S. Yurkovich, A Ball and Beam testbed for Fuzzy
identification and Control design, Proc. American Control Conf.,
June 1993, pp 665-669.