Anda di halaman 1dari 7

An Architecture for Distinguishing between Predictors and Inhibitors in Reinforcement Learning

arXiv:1312.5714v1 [cs.AI] 19 Dec 2013

Patrick Connor Faculty of Computer Science Dalhousie University Halifax, NS, B3H 4R2

Thomas Trappenberg Faculty of Computer Science Dalhousie University Halifax, NS, B3H 4R2

Reinforcement learning treats each input, feature, or stimulus as having a positive or negative reward value. Some stimuli, however, negate or inhibit the values of certain other predictors (excitors) when presented with them, but are otherwise neutral. We show that both linear and non-linear value-function approximators assign inhibitory features a strong value with the opposite valence of the predictor it inhibits (i.e., inhibitor= -excitor). In one circumstance, this gives a correct prediction (i.e., excitor + inhibitor = neutral outcome). Importantly, however, value-function approximators incorrectly predict that when the inhibitor is presented alone, a negative or oppositely valenced outcome will follow whereas the inhibitor alone is actually followed by a neutral outcome. Essentially, we show that having reward value as a direct predictive target can make inhibitors indistinguishable from excitors that predict the oppositely valenced outcome. We show that this problem can be easily avoided if the reinforcement learning problem is broken into 1) a supervised learning module that predicts the positive appearance of primary reinforcements and 2) a reinforcement learning module which sums their agent-dened values.


Value-function approximators1 (VFAs) are used to map multi-dimensional real-world states to values in the practical application of reinforcement learning (RL) algorithms. A variety of function approximation techniques have been applied to this domain (e.g., [2, 3, 4]). Among applications, the input representation will vary, but the output representation or prediction target remains the same: reward value. The reward value scale ranges from reward (positive values) through neutral (zero value) to punishment (negative values). In biological RL, as depicted in classical conditioning experiments, there is a distinction between stimuli (i.e., features) that predict punishment and those that cancel a prediction of reward. In the conditioned inhibition paradigm [5, 6], animals receive alternated presentations of a stimulus followed by a reinforcer, say, a reward (A+) and a combination of that stimulus and another, followed by no reward (AX-). Stimulus A becomes excitatory and stimulus X cancels or inhibits As reward prediction. However, the two are not viewed as entirely symmetric opposites in this eld. For example, the extinction of the inhibitor (X) is generally more complicated than mere non-reinforced presentations [7, 8], which is effective on excitatory stimuli such as A. In contrast, the VFA does see excitors and inhibitors as perfectly symmetric opposites. If trained on the same data as in the conditioned inhibition paradigm (i.e., [1 0]1, [1 1]0), an approximator will subsequently predict a negative reinforcement or punishment when presented with

For a useful introduction to value-function approximation, see [1]

X alone (i.e., [0 1];-1) instead of the correct zero valued outcome (since X is actually a predictor of reward omission, not punishment). In this way, there exists a confusion between predictors of an outcome and inhibitors of the oppositely valenced outcome. Making the distinction between them is important. Besides imposing an increase in prediction error when the inhibitor is presented alone (or in certain combinations), such circumstances could be read incorrectly by an agent and lead to irrational actions (e.g., fear-driven actions instead of disappointment-driven actions and vice versa). Here, we demonstrate this issue in concrete terms and offer a simple, generic solution. Instead of predicting the reward value, we train function approximators to predict specic primary reinforcements (e.g., food, charging station, etc.). Then, the agent-dened reward values of the expected reinforcements are summed to give a reward value prediction. This approach breaks the reinforcement-learning problem into an intermediate supervised learning layer and a RL layer readout. The intermediate prediction of real-world primary reinforcements recognizes the importance of developing high-level representations of the world, though doing so is not the focus of the present work. Instead, we show that the choice of what a supervised learner predicts, or its output representation, is equally important, and is reected in the resulting affect on prediction accuracy.

Predicting the future state instead of values

Figure 1A depicts the typical value-function approximation approach used in RL. The input features which express the environmental state are submitted to the approximator and it predicts the total expected future reward value from that state. It is updated according to one of a variety of RL paradigms (e.g., Sarsa, Q-learning, TD(), etc.) and the specics of the approximator (e.g., gradient descent for a linear approximator). In contrast, Figure 1B depicts the deeper, two-staged RL architecture that this work advocates: multiple supervised learning models, which predict aspects of the future state (predicted features) and, in particular, the appearance of primary reinforcements. A key difference between the predicted feature and reward value targets is that the former has positive values only (a feature is either present or not) and the latter can have both positive and negative values. This rectication is a crucial detail of the proposed approach. The notion of predicting future observations or states is not new to RL [9, 10]. For example, in temporal difference (TD) networks, Sutton and Tanner [9] demonstrate the use of TD learning to predict future observations, not just reward. Although more computation is required than for predicting reward alone, the richer predictive output obviously has potential uses, including being able to foresee future states and allow the agent to act in anticipation of them. We show below how such predictions can be used to resolve the problem described in the previous section, but there is also another benet for RL. Each agent places value on certain kinds of stimuli in the world. For some of these stimuli, the value changes dynamically based on the internal state of the agent. For example, when a robotic agent has a low battery, the charging station ought to have a higher value than when the robot is fully charged. A simple VFA would be forced to learn the value of approaching the charging station depending on its battery state. This is certainly possible, but we might take a cue from biological systems, where it seems that the dynamic revaluation of a primary reinforcer due to satiety is second nature. Satiety can be integrated in the proposed approach simply the reward value of any primary reinforcer can be weighted by a satiety parameter in the RL module. Being fully sated would neutralize the reward value otherwise attributed to the associated reinforcer.

Function approximators

In our evaluation, the standard VFA is instantiated as a linear function via least mean squares (LMS) linear regression and as a non-linear function via support-vector regression (SVR) [11]. This choice was made because both models required very little adjusting of parameters, which might otherwise complicate the interpretation of the results. For the two-stage architecture, we use a slightly different version of LMS, which we derive and relate to LMS below. The form of SVR used here, called the epsilon-SVR, involves tting a hyperplane with margins (i.e., a hyper-rectangular prism) so that it encompasses all data while at the same time minimizing the hyperplanes slope [11]. This is often infeasible, so that outliers are permitted but at a cost, leading to a trade-off between minimization of the slope and the acceptance of outliers. Since explanations of simulation results will not rely on the formal denition of this non-linear model, we direct the 2

Figure 1: Architectures for RL using function approximation (FA). A. The generic VFA, which takes features of the current state and directly predicts the total expected future reward for the agent. B. The proposed two-stage RL architecture. In the rst stage, the salience of a primary reinforcer (or other state feature being tracked) in the next time step (or in the future, if the TD network approach is adopted) is predicted. The nal predicted reward value is the weighted sum of the agent-dened values multiplied by the expected salience of the primary reinforcer. interested reader to Smola and Sch olkopf [11]. We used LIBSVM [12] in our simulations (Settings: cost = 10, epsilon = 1e 5, radial basis function kernel with = N um. F1eatures ). Underlying LMS linear regression is the goal to maximize the likelihood of the given data being generated by a Gaussian random variable with a parameterized n-dimensional linear mean. This function can be expressed as y = T x + N (0, ) (1) where y is the outcome being predicted, x is a vector of inputs, is the vector of model parameters, and 0 and represent the mean and standard deviation of the Gaussian random variable, respectively. The probability density function (PDF) becomes p(y, x|) =
(y T x)2 1 e 22 2


For the two-stage architecture, the reinforcer outcome being predicted cannot be less than zero, y = G(T x + N (0, )) (3)

where G() is the threshold-linear function [13] which returns its argument whenever it is positive and returns zero otherwise. The PDF becomes (y T x)2 1 2 2 , f or y > 0 2 e T x 1 p(y, x|) = (4) f or y = 0 2 (1 erf( 2 ), 0, f or y < 0 differing from the other PDF by zeroing the probability of negative y values and instead heaping it onto the y = 0 case. From this PDF, let us derive the two-stage function approximators learning rule and, indirectly, for LMS as well. Given data generated from Equation 3, we can infer the values 3

of using maximum likelihood estimation. The probability or likelihood that a certain training data set is generated from the PDF is L() = p(y (1) , ...y (m) , x(1) , ...x(m) |)


p(y (i) , x(i) |)


where m is the number of training data points. We can maximize this convex likelihood function by taking its log and ascending its gradient, log L() p(y (i) , x(i) |) = 1 = my>0 log ( 2 ) 2 2 = log my=0 log (2) +
i,y (i) =0 m i=1 m i=1

log p(y (i) , x(i) |)


i,y (i) >0 (y

T x(i) )2
T (i)

log (1 erf ( x )) 2


where my>0 and my=0 are the number of data points when y > 0 and y = 0, respectively. Taking the gradient of this function with respect to each j gives log L() 1 = 2 j (y
i,y (i) >0 (i)


(i) ) xj

2 2

(T x(i) )2 2 2 T (i)

1 erf ( x ) i,y (i) =0 2




The gradient can be ascended by iteratively updating the model parameters, j =: j + where the learning rate, = (Equation 2), resulting in
2 n+2 .

log L() j


The learning rule for can be similarly derived for standard LMS

log L() 1 = 2 j

(y (i) T x(i) )xj




where M is the total number of data points. The practical feature of the rectied learning rule that distinguishes it is what happens for data points where y = 0: when the weighted sum T x is negative, learning is essentially deactivated, especially for small . For our simulations, we use a low noise variance of = 1e 4. The second or RL stage of our architecture computes a weighted sum of its inputs. For simplicity in our simulations, rewarding reinforcers will be given weights of +1, and punishing reinforcers will have weights of 1.

Simulation and results

To show the difference between the two architectures, a single-step RL problem is simulated. The structure is captured in Table 1. In brief, it is the combination of two simultaneous conditioned inhibition experiments, where one reinforcement is rewarding and the other is punishing. The experiment is arranged to show how well these architectures distinguish between inhibitory features and excitatory features of the opposite valence, in a case where both rewards and punishments do occur. As a RL problem, there are no actions to take; an agent is forced from one state to the next. As an aside, however, one can imagine an agents actions also being represented as input features, for which the process of learning to predict outcomes based on proposed actions would then be the same. Specically, there are 4 binary features, such that each feature is associated with a specic outcome that follows its presentation: feature PR is associated with delivery of a positive (rewarding) reinforcer (feature P), OP is associated with omission of P, NR is associated with delivery of a negative (punishing) reinforcer (N), and ON is associated with the omission of N. There are a total of 24 = 16 unique data points. Training on the full dataset includes all 16 data points. The partial dataset is the full dataset excluding the data points in the lightly shaded rows of the table (numbered 2, 5, 7, 4

Table 1: Simulation structure and results. The rst 8 columns describe the simulation data and the last 6 columns describe the reward value predictions for each model (rounded to the nearest onetenth). Column RV gives the true reward value against which the results are compared. See text for further explanation and a description of the abbreviations used. Input Features Outcomes VFA Results 2-Stage Results No. PR OP NR ON P N RV LR-F LR-P SV-F SV-P 2S-F 2S-P 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 .5 1 0 .6 0 0 3 0 0 1 0 0 1 -1 -.5 -1 -1 -1 -1 -1 4 0 0 1 1 0 0 0 0 0 0 0 0 0 5 0 1 0 0 0 0 0 -.5 -1 0 -.6 0 0 6 0 1 0 1 0 0 0 0 0 0 0 0 0 7 0 1 1 0 0 1 -1 -1 -2 -1 -1.6 -1 -1 8 0 1 1 1 0 0 0 -.5 -1 0 -.6 0 0 9 1 0 0 0 1 0 1 .5 1 1 1 1 1 10 1 0 0 1 1 0 1 1 2 1 1.6 1 1 11 1 0 1 0 1 1 0 0 0 0 0 0 0 12 1 0 1 1 1 0 1 .5 1 1 1 1 1 13 1 1 0 0 0 0 0 0 0 0 0 0 0 14 1 1 0 1 0 0 0 .5 1 0 .6 0 0 15 1 1 1 0 0 1 -1 -.5 -1 -1 -1 -1 -1 16 1 1 1 1 0 0 0 0 0 0 0 0 0 Average Absolute Error .25 .375 0 .22 0 0 8, 10, and 14). The removed data points contain evidence of the non-linear nature of the inhibitory features OP and ON. A positive reinforcer is delivered (P = 1) after a data point is presented where feature PR is present and feature OP is not, as shown in the column P of the table. A negative reinforcer is given when NR is present and ON is not, as shown in column N. Both positive and negative reinforcers may be given on the same trial. Finally, the reward value (column RV) for a data point is the sum of the positive reinforcement minus the negative reinforcement values predicted for that data point (i.e., P N ), since the associated second-stage weights are +1 and 1, respectively. The learning cycle for the standard VFA in this experiment is the following: the 4-feature state vector is presented as input. the reward value prediction is made in the next time step, the actual reward value outcome is detected and used to update the model The learning cycle for the two-stage architecture in this experiment is the following: the 4-feature state vector is presented as input. the prediction of the positive reinforcer is made by one function approximator and the prediction of the negative reinforcer is made by a second approximator the reward value prediction is computed as the positive reinforcer prediction minus the negative reinforcer prediction (P N ) in the next time step, the absence/presence of each reinforcer is detected and used to update the associated function approximator. Note that simulations revealed that presenting multiple copies of the data to these models did not affect the results and neither did the order of the data points. The table contains the predictions after training for each approach and each data point, allowing us to see where poor predictions occur. In the bottom row, the average prediction error for each approach is recorded. The columns LR-F and LR-P denote the reward value predictions for 5

LMS when the full dataset and the partial dataset are used for training, respectively. In the LR-F case, the predictions are routinely incorrect by 0.5. By looking at results for data points numbered 2, 3, 5, and 9, we can see that the LMS weights were 0.5 for the predictor of reward (PR) as well as for the predictor for the omission of punishment (ON). The same confusion occurs between the predictor of punishment and predictor of the omission of reward, both having a weight of -0.5. The LR-P results fair no better, the difference being that the weights are exactly doubled. In both cases, the OP and NR signals and the PR and ON signals have equivalent roles in the eyes of a linear VFA. In the results labeled SV-F (full dataset), the additional data points identify the non-linearity in the system and the model captures this information perfectly. Not so, however, when the partial dataset is used (SV-P results). Here, the SV-P model behaves like the linear model, since the partial dataset gives no hint of a non-linear facet. There is a small distinction between the SV-P and LR-P results; in SV-P, the inhibitory features are not as potent as the excitatory predictive features (0.6 versus 1.0). In contrast to these standard value-function approximation results, the two-stage architecture using the rectied LMS function approximator perfectly predicts the correct outcomes, regardless of whether the full (2S-F results) or partial (2S-P results) dataset is used.


Why does the two-stage architecture work? Recall that each function approximator predicts the presence/absence of a single reinforcer and that the prediction cannot be negative (e.g., less than zero food) because the output is rectied. Without this rectication (i.e, using standard LMS), the two-stage architecture gives exactly the same results as LR-F and LR-P (not shown in the table). In both tests (2S-F, 2S-P), the internal rectied LMS weight of the predictor of a rewarding reinforcer (PR) is +1 and the weight of predictor of omission of the rewarding reinforcer (OP) is 1. This is different from the unrectied LMS weights (0.5 and 0.5) in the LR-F case. The reason for this is that when the inhibitor OP is presented alone, the learning rule shuts off learning so that its weight cannot be shrunk toward -0.5. This outcome of rectication resonates with the animal learning literature (but see [14]), which generally nds that non-reinforced presentations of such inhibitors do not extinguish them (i.e., they keep their inhibitory quality), which is in contrast to excitors which are easily extinguished using the same procedure. From the LR-P results, however, we saw that having weights of +1 and 1 for PR and OP, respectively, did not equate to perfect results. The difference here is that when OP is presented alone, the rectied LMS prediction of the reinforcer is truncated at zero to reect the associated PDF (i.e., no negative outcome values) from which the model is derived. This avoids the erroneous result in data point #5. The same rectication applies to the predictor of omission of the negative reinforcer (data point # 2) and other similar scenarios (data points 7, 8, 10, and 14). In contrast, such a rectication does not make sense in direct value-function approximation where both positive (rewarding) and negative (punishing) outcomes occur. Although the nal output of the two-stage architecture is reward value, our approach is not simply changing the contents of the VFA to encapsulate multiple function approximators and a RL module. Instead, they are fundamentally different in terms of the quantity they predict (reward vs. presence of specic reinforcer). In the same vein, the architecture we propose is not trained using a scalar reward prediction error, but rather a vector of state feature-specic prediction errors, one scalar per feature/internal function approximator. Apart from the rectication, the model we used to demonstrate the two-stage architecture learns only linear relationships. However, this is not to say that one could not use in its place a model that learns non-linear relationships. In fact, it will be necessary to have non-linear predictive abilities for specic combinations of features (e.g., XOR or AND cases). Here, the rectied linear model was used to demonstrate the advantage of our 2-stage structure. Non-linear models that are used in its place must also rectify their output, even during learning. For example, the use of SVR without rectication for the partial dataset, but in the two-stage architecture, gives the same results as shown in SV-P. Such rectication is non-native to most regression models. One exception is Dual Pathway Regression [15]. Unlike LMS, which integrates inhibitory features by summing their effects with excitatory features, it multiplies the excitatory feature sum by a factor between zero and one, which is smaller for greater inhibitory feature sums. As a result, inhibition can only shrink predictions, not 6

make them negative. This model is better than LMS at eliminating the effects of irrelevant features and can capture certain non-linear relationships when appropriately extended [16]. In the present experiment, it gives the correct predictions without modication when embedded in the two-stage architecture for both the full and partial datasets. The input representation which would best support the two-stage architecture is likely a hierarchical one. The best deep learning representations here would seem to be those with varying-levels feature complexity that correlate with the primary reinforcers meaningful to the agent. For the thirsty agent, a low-level bank of edge-detectors would seem sufcient to identify a pool or puddle with numerous ripples. In the case of a calm pool, however, a very high-level representation that captures the mirror-like effect of the water would be more discriminative. A hierarchical representation affords such varying levels of complexity and thus would seem to effectively support the prediction of future reinforcements.


We have demonstrated how a two-stage RL architecture can overcome a fundamental issue faced by VFAs: the inability to distinguish between predictors of reinforcement and inhibitors of reinforcement of the opposite valence. The prediction of reward value is replaced by the prediction of primary reinforcers, which themselves have a positive salience only. Such a rectied output representation allows an inhibitor (or predictor of omission) to have a negative inuence when presented together with a predictor of reinforcement, but no inuence when it is presented alone. This eliminates its confusion with predictors of reinforcement with the opposite valence.

[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Cambridge Univ Press, 1998. [2] R. S. Sutton, Generalization in reinforcement learning: Successful examples using sparse coarse coding, Advances in Neural Information Processing Systems, pp. 10381044, 1996. [3] S. Mahadevan and M. Maggioni, Value function approximation with diffusion wavelets and Laplacian eigenfunctions, in Advances in Neural Information Processing systems, 2005, pp. 843850. [4] G. Konidaris, S. Osentoski, and P. S. Thomas, Value function approximation in reinforcement learning using the fourier basis. in AAAI, 2011. [5] I. P. Pavlov, Conditioned reexes: an investigation of the physiological activity of the cerebral cortex. Oxford, England: Oxford University Press, 1927. [6] R. A. Rescorla, Pavlovian conditioned inhibition. Psychological Bulletin, vol. 72, pp. 7794, 1969. [7] C. L. Zimmer-Hart and R. A. Rescorla, Extinction of Pavlovian conditioned inhibition. Journal of Comparative and Physiological Psychology, vol. 86, pp. 837845, 1974. [8] D. T. Lysle and H. Fowler, Inhibition as a slave process: deactivation of conditioned inhibition through extinction of conditioned excitation, Journal of Experimental Psychology. Animal Behavior Processes, vol. 11, no. 1, pp. 7194, 1985. [9] R. S. Sutton and B. Tanner, Temporal-difference networks, in Advances in Neural Information Processing Systems, 2004, pp. 13771384. [10] E. J. Rafols, M. B. Ring, R. S. Sutton, and B. Tanner, Using predictive representations to improve generalization in reinforcement learning. in IJCAI, 2005, pp. 835840. [11] A. J. Smola and B. Sch olkopf, A tutorial on support vector regression, Statistics and Computing, vol. 14, no. 3, pp. 199222, 2004. [12] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:127:27, 2011. [13] V. Nair and G. E. Hinton, Rectied linear units improve restricted boltzmann machines, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807814. [14] I. Baetu and A. G. Baker, Extinction and blocking of conditioned inhibition in human causal learning, Learning & Behavior, vol. 38, pp. 394407, 2010. [15] P. Connor and T. Trappenberg, Biologically plausible feature selection through relative correlation, in Proceedings of the 2013 International Joint Conference on Neural Networks (IJCNN), 2013, pp. 759766. [16] P. Connor, A dual pathway approach for solving the spatial credit assignment problem in a biological way, Ph.D. dissertation, Dalhousie University, November 2013.