Neural circuits have evolved to perform a range of dif- down model describing how a recurrent neural network
ferent functions, from sensory coding to muscle control could optimally perform a range of possible functions,
and decision making. A central goal of systems neuro- given constraints on how much information each neuron
science is to elucidate what these functions are and how encodes about its inputs. A reward function quantifies
neural circuits implement them. A common ‘top-down’ how useful each state of the network is for performing
approach starts by formulating a hypothesis about the a given function. The network’s dynamics are then op-
function performed by a given neural system (e.g. effi- timised to maximise how frequently the network visits
cient coding/decision making), which can be formalised states associated with a high reward. This framework is
via an objective function [1–10]. This hypothesis is then very general: different choices of reward function result in
tested by comparing the predicted behaviour of a model the network performing diverse functions, from efficient
circuit that maximises the assumed objective function coding to decision making and optimal control.
(possibly given constraints, such as noise/metabolic costs Next, we tackle the inverse problem of infering the re-
etc.) with recorded responses. ward function from the observed network dynamics. To
One of the earliest applications of this approach was do this, we show that there is a direct correspondence
sensory coding, where neural circuits are thought to effi- between optimising a neural network as described above,
ciently encode sensory stimuli, with limited information and optimising an agent’s actions (in this case, each neu-
loss [7–13]. Over the years, top-down models have also ron’s responses) via reinforcement learning (RL) [14–18]
been proposed for many central functions performed by (Fig 1). As a result, we can use inverse RL [19–23] to in-
neural circuits, such as generating the complex patterns fer the objective performed by a network from recorded
of activity necessary for initiating motor commands [3], neural responses. Specifically, we are able to derive a
detecting predictive features in the environment [4], or closed-form expression for the reward function optimised
memory storage [5]. Nevertheless, it has remained dif- by a network, given its observed dynamics.
ficult to make quantitative contact between top-down
We hypothesise that the inferred reward function,
model predictions and data, in particular, to rigorously
rather than e.g. the properties of individual neurons, is
test which (if any) of the proposed functions is actually
the most succinct mathematical summary of the network,
being carried out by a real neural circuit.
that generalises across different contexts and conditions.
The first problem is that a pure top-down approach Thus we can use our framework to quantitatively predict
requires us to hypothesise the function performed by a how the network will adapt or learn in order to perform
given neural circuit, which is often not possible. Second, the same function when the external context (e.g. stim-
even if our hypothesis is correct, there may be multiple ulus statistics) constraints (e.g. noise level) or the struc-
ways for a neural circuit to perform the same function, ture of the network (e.g. due to cell death or experimen-
so that the predictions of the top-down model may not tal manipulation) change. Our framework thus not only
match the data. allows RL to be used to train neural networks and use
Here, we propose an approach to directly infer the inverse RL to infer their function, but it also generates
function performed by a recurrent neural network from experimental predictions for a wide range of possible ma-
recorded responses. To do this, we first construct a top- nipulations.
bioRxiv preprint first posted online Apr. 5, 2019; doi: http://dx.doi.org/10.1101/598086. The copyright holder for this preprint (which
was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license.
ii
fun. selection
ward they receive from their environment (Fig 1B). Con-
fun. dynamics
Inverse
versely, another paradigm, called inverse RL [19–23], ex-
C ? D RL plains how to go in the opposite direction, to infer the re-
input network + input environment maze ward associated with different states of the environment
from observations of the agent’s actions. We reasoned
fire spike action move
that, if we could establish a mapping between optimis-
drive network to
reach rewarded
ing neural network dynamics (Fig 1A) and optimising an
goal
rewarded states location agent’s actions via RL (Fig 1B), then we could use in-
PREDICT verse RL to infer the objective function optimised by a
E state state
F high coding-cost
p(s) neural network from its observed dynamics.
position in maze position in maze 1
network state network state OBSERVE INFER REWARD To illustrate this idea, let us compare the problem
s(t) s(t + 1) 0
spike prob.
RL
reward
0.5
1
〈vπ∗ (s′ )〉p(s′ |s,a)
π ∗ (a|s) ∝ pπ (a) e λ
, (3) 00 2
2 4
4 6
6 8
8
0
0 2 4 6
spike count spike count (other cells)
where vπ (s) is a ‘value function’ defined as the predicted D OBSERVE E INFER REWARD F PREDICT
reward in the future (minus the coding cost) if the agent 1 biased input
1 input = -1
input
spike prob.
starts in each state (SI section 1A): -1 ground truth 0.5
input = 1
Inverse inferred
8
vπ (s) = r (s) − λcπ (s) − Lπ (4) RL
cell
0
reward
0 2 4 6
spike count (other cells)
+ 〈r (s′ ) − λcπ (s′ )〉pπ (s′ |s) − Lπ 1
8
input = -1
input = 1
8
1
cell death
spike count
+ 〈r (s′′ ) − λcπ (s′′ )〉pπ (s′′ |s) − Lπ + . . .
6
0 2 4 6 8
input = -1
spike count
spike prob.
4 input = 1
4
0.5
2
where s, s′ and s′′ denote three consecutive states of the 0
0
0 0.5
5.0
1
time prob. 0
regularised RL to optimise each neuron’s response prob- neurons, C is an irrelevant constant, and the unknown
ability, πi (σ̃i |σ, x), exactly as we did for the agent in constant of proportionality comes from the fact that we
the maze. Further, as each update increases the objec- don’t know the coding cost, λ. SI section 1C shows how
tive function Lπ , we can alternate updates for different we can recover the reward function when there is an ex-
neurons to optimise the dynamics of the entire network. ternal input. In this case, we do not obtain a closed-form
In SI section 1B, we show that this results in transition expression for the reward function, but must instead infer
probabilities given by: it via maximum likelihood.
−1 Figure 2E shows how we can use inverse RL to infer
pπ∗ (σi ) 1
δvπ∗ (σ /i ,x)
∗
πi σ̃i = 1|σ /i , x = 1 + e nλ , the reward function optimised by a model network from
1 − pπ∗ (σi ) its observed dynamics. As for the agent in the maze,
(6) once we know the reward function, we can use it to pre-
where σ /i is the state of all neurons except neuron i, and dict how the network dynamics will vary depending on
the internal/external constraints. For example, we can
δvπ∗ σ /i , x = σi 〈vπ∗ (σ, x′ )〉p(x′ |x) , (7) predict how neural tuning curves vary when we alter the
σi =−1,1 input statistics (Fig 2F, upper), or kill a cell (Fig 2F,
where vπ (σ, x) is the value associated with each state, lower).
defined
n in Eqn 4, with a coding cost, cπ (σ, x) =
i=1 D KL [πi (σ̃i |σ, x)pπ (σ̃i )], which penalises devia-
tions from each neuron’s average firing rate. The net- Application to a pair-wise coupled network
work dynamics are optimised by alternately updating the
value function and neural response probabilities (SI sec- Our framework is limited by the fact that the number
tion 1B). of states, ns , scales exponentially with the number of
To see how this works in practice, we simulated a net- neurons (ns = 2n ). Thus, it quickly becomes infeasible to
work of 8 neurons that receive a binary input x (Fig 2A). compute the optimal dynamics as the number of neurons
The assumed goal of the network is to fire exactly 2 spikes increases. Likewise, we need an exponential amount of
when x = −1, and 6 spikes when x = 1, while minimis- data to reliably estimate the sufficient statistics of the
ing the coding cost. To achieve this, the reward was network, required to infer the reward function.
set to unity when the network fired the desired number This problem can be circumvented by using a tractable
of spikes, and zero otherwise (Fig 2B). Using entropy- parametric approximation of the value function. For
regularised RL, we derived optimal tuning curves for each example, if we approximate the value function by a
neuron, which show how their spiking probability should quadratic function of the responses our framework pre-
vary depending on the input, x, and number of spikes dicts a steady-state response distribution
fired by other neurons (Fig 2C). We confirmed that after of the form:
optimisation the number of spikes fired by the network p (σ) ∝ exp i,j∕=i Jij σi σj + i hi σi , where Jij de-
was tightly peaked around the target values (Fig 2D). notes the pair-wise coupling between neurons, and hi is
Decreasing the coding cost reduced noise in the network, the bias. This corresponds to a pair-wise Ising model,
decreasing variability in the total spike count. which has been used previously to model recorded neural
responses [26, 27]. In SI section 1D we derive an approx-
imate RL algorithm to optimise the coupling matrix, J ,
Inferring the objective function from the neural for a given reward function r (σ) and coding cost.
dynamics To illustrate this, we simulated a network of 12 neu-
rons arranged in a ring, with reward function equal to 1 if
We next asked if we could use inverse RL to infer the exactly 4 adjacent neurons are active together, and 0 oth-
reward function optimised by a neural network, just from erwise. After optimisation, nearby neurons had positive
its observed dynamics (Fig 2D). For simplicity, we first couplings, while distant neurons had negative couplings
consider a recurrent network that receives no external (Fig 3A). The network dynamics generated a single hill of
input, in which case the optimal dynamics (Eqn 6) cor- activity which drifted smoothly in time. This is reminis-
respond to Gibbs sampling from a steady-state
distribu- cent of ring attractor models, which have been influential
1
tion: p (σ) ∝ i p (σi ) exp λn vπ (σ) . Combining this in modeling neural functions such as the rodent head di-
with the Bellmann equality (which relates the value func- rection system [28, 29].
tion, vπ (σ, x), to the reward function, r (σ, x)), we can As before, we could then use inverse RL to infer the
derive the following expression for the reward function: reward function from the observed network dynamics.
i=n
However, note that when we use a parametric approxima-
p σi |σ /i tion of the value function this problem is not well-posed,
r (σ) ∝ log + C, (8)
p (σi ) and we have to make additional assumptions (SI sec-
i=1
tion 1D). In our simulation, we found that assuming that
where p σi |σ /i denotes the probability that neuron i the reward function was both sparse and positive was
is in state σi , given the current state of all the other sufficient to recover the original reward function used to
bioRxiv preprint first posted online Apr. 5, 2019; doi: http://dx.doi.org/10.1101/598086. The copyright holder for this preprint (which
was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license.
v
A 1
OBSERVE ism, and thus should be encoded. Here we show how one
12 can use inverse RL to: (i) infer which stimulus features
cell
cell
are encoded by a recorded neural network, and (ii) test
12 1 whether these features are encoded efficiently.
1 12
cell time Efficient coding posits that neurons maximise informa-
INFER REWARD tion encoded about some relevant feature, y (x), given
PREDICT
constraints on the information encoded by each neuron
B high coding-cost about their inputs, x (Fig 4A). This corresponds to max-
1 12 imising:
cell
cell
n
12
1 12
1 Lπ = Iπ (y (x) ; σ) − λ Iπ (σ̃i ; σ, x) , (9)
cell time
C remove connections
i=1
1 12
where λ controls the strength of the constraint. Noting
cell
cell
cell
12 1
where we have omitted terms which don’t depend on
1 12
cell time π. Now this is exactly the same as the objective func-
tion we have been using so far (Eqn 5), in the spe-
FIG. 3. Pair-wise coupled network. (A) We optimized cial case where the reward function, r (σ, x), is equal
the parameters of a pair-wise coupled network, using a re- to the log-posterior, log pπ (y (x)|σ). As a result we
ward function that was equal to 1 when exactly 4 adjacent can maximise Lπ via an iterative algorithm, where
neurons were simultaneously active, and 0 otherwise. The on each iteration we update the reward function by
resulting couplings between neurons are schematized on the setting r (x, σ) ← log pπ (y (x) |σ), before then optimis-
left, with positive couplings in red and negative couplings in ing the network dynamics, via entropy-regularised RL.
blue. The exact coupling strengths are plotted in the centre.
Thus, thanks to the correspondence between entropy-
On the right we show an example of the network dynamics.
Using inverse RL, we can infer the original reward function regularised RL and efficient coding we could derive an
used to optimise the network from its observed dynamics. We algorithm to optimise the dynamics of a recurrent net-
can then use this inferred reward to predict how the network work to perform efficient coding [27].
dynamics will vary when we increase the coding cost (B), re- As an illustration, we simulated a network of 7 neurons
move connections between distant neurons (C) or selectively that receive a sensory input consisting of 7 binary pixels
activate certain neurons (D). (Fig 4B, top). Here, the ‘relevant feature’, y (x) was a
single binary variable, which was equal to 1 if more than
4 pixels were active, and -1 otherwise (Fig 4A, bottom).
optimise the network (SI section 2C). Using the efficient-coding algorithm described above, we
Having inferred the reward function optimised by the derived optimal tuning curves, showing how each neu-
network, we can use it to predict how the coupling ron’s spiking probability should vary with both the num-
matrix, J , and network dynamics vary depending on ber of active pixels and number of spikes fired by other
the internal/external constraints. For example, increas- neurons (Fig 4C). We also derived how the optimal read-
ing the coding cost resulted in stronger positive cou- out, p (y|σ), depended on the number of spiking neurons
plings between nearby neurons and a hill of activity (Fig 4D). Finally, we checked that the optimised net-
that sometimes jumped discontinuously between loca- work encoded significantly more information about the
tions (Fig 3B); removing connections between distant relevant feature than a network of independent neurons,
neurons resulted in two uncoordinated peaks of activity over a large range of coding costs (Fig 4E).
(Fig 3C); finally, selectively activating certain neurons Now, imagine that we just observe the stimulus and
‘pinned’ the hill of activity to a single location (Fig 3D). neural responses (Fig 4F). Can we recover the rele-
vant feature, y (x)? To do this, we first use inverse
RL to infer the reward function from observed neu-
Inferring efficiently encoded stimulus features ral responses (Fig 4G). As described above, if the net-
work is performing efficient coding then the inferred re-
An influential hypothesis, called ‘efficient coding’, ward, r (σ, x) should be proportional to the log-posterior,
posits that sensory neural circuits have evolved to en- log p (y (x)|σ). Thus, given σ, the inferred reward,
code maximal information about sensory stimuli, given r (σ, x) should only depend on changes to the input, x,
internal constraints [7–13]. However, the theory does not that alter y (x). As a result, we can use the inferred re-
specify which stimulus features are relevant to the organ- ward to uncover all inputs, x, that map onto the same
bioRxiv preprint first posted online Apr. 5, 2019; doi: http://dx.doi.org/10.1101/598086. The copyright holder for this preprint (which
was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license.
vi
RELEVANT
A INPUT FEATURE B C by the network, we can predict how its dynamics will vary
tuning curve
x y(x) 7
INPUT
Eff. 1.01 num. pixels with context, such as when we alter the input statistics.
pixels
7
For example, in our simulation, reducing the probability
spike prob.
coding
1
RELEVANT FEATURE
x
0.5
0.5
0
that input pixels are active causes the neural population
1 to split into two cell-types, with distinct tuning curves
y(x)
00
-1 00 2
2 44 66
and mean firing rates (Fig 4H) [13].
time spike count (other cells)
READOUT
p( y | σ)
D 1.0
readout E 0.4
0.4
DISCUSSION
(nats)
probability
p (y = 1 | σ)
0.5 p (y = − 1 | σ ) 0.2
0.2
I(y; σ)
optimal
0 0
0
ind. cells A large research effort has been devoted to developing
00 2 4 6 7 00 1 22
spike count
0.5 1
7 1.0
1
2 cells
Inverse mean spike prob.
y=1 (unbiased stim.)
1
RL from recorded responses, we show how to infer the objec-
reward
y=-1 0.5
7
RESPONSE 0 0.5
5 cells tive function optimised by the network. This approach,
where the objective function is inferred from data, should
cell
0
0
1
time
00 2 4
spike count
6 7 0.5
0.5 0.55
prob. pixel off
0.6
0.6
allow one to: (i) quantitatively predict the responses of
a given recorded network, from functional principles; (ii)
FIG. 4. Efficient coding and inverse RL. (A) The neural build top-down models of high-level neural areas, whose
code was optimised to efficiently encode an external input, function is a priori unknown.
x, so as to maximise information about a relevant stimulus An alternative bottom-up approach is to construct
feature y (x). (B) The input, x consisted of 7 binary pix- phenomenological models describing how neurons re-
els. The relevant feature, y (x), was equal to 1 if >4 pix- spond to given sensory stimuli and/or neural inputs
els were active, and -1 otherwise. (C) Optimising a network
[26, 27, 30, 31]. In common with our work, such mod-
of 7 neurons to efficiently encode y (x) resulted in all neu-
rons having identical tuning curves, which depended on the els are directly fitted to neural data. However, unlike our
number of active pixels and total spike count. (D) The pos- work, they do not set out to reveal the function performed
terior probability that y = 1 varied monotonically with the by the network. Further, they are often poor at predict-
spike count. (E) The optimised network encoded significantly ing neural responses in different contexts (e.g. varying
more information about y (x) than a network of independent stimulus statistics). Here we hypothesize that it is the
neurons with matching stimulus-dependent spiking probabil- function performed by a neural circuit that remains in-
ities, p (σi = 1|x). The coding cost used for the simulations variant, not its dynamics or individual cell properties.
in the other panels is indicated by a red circle. (F-G) We Thus, if we can infer what this function is, we should
use the observed responses of the network (F) to infer the be able to predict how the network dynamics will adapt
reward function optimised by the network, r (σ, x) (G). If
depending on the context, so as to perform the same func-
the network efficiently encodes a relevant feature, y (x), then
the inferred reward (solid lines) should be proportional to the
tion under different constraints. As a result, our theory
log-posterior, log p (y (x)|σ) (empty circles). This allows us could predict how the dynamics of a recorded network
to (i) recover y (x) from observed neural responses, (ii) test will adapt in response to a large range of experimental
whether this feature is encoded efficiently by the network. manipulations, such as varying the stimulus statistics,
(H) We can use the inferred objective to predict how varying blocking connections, knocking out/stimulating cells etc.
the input statistics, by reducing the probability that pixels There is an extensive literature on how neural networks
are active, causes the population to split into two cell types, could perform RL [32–34]. Our focus here was different:
with different tuning curves and mean firing rates (right). we sought to use tools from RL and inverse RL to infer
the function performed by a recurrent neural network.
Thus, we do not assume the network receives an explicit
value of y (x). Here, we see that the inferred reward col- reward signal: the reward function is simply a way of
lapses onto two curves only (blue and red in Fig 4G), expressing which states of the network are useful for per-
depending on the total number of pixels in the stimu- forming a given function. In contrast to previous work,
lus. This allows us to deduce that the relevant coded we treat each neuron as an independent agent, which op-
variable, y (x), must be a sharp threshold on the num- timises their responses to maximise the reward achieved
ber of simultaneously active pixels (despite the neural by the network, given a constraint on how much they can
tuning curves varying smoothly with the number of ac- encode about their inputs. As well as being required for
tive pixels; Fig 4C). Next, having recovered y (x), we can biological realism, the coding constraint has the benefit
check whether it is encoded efficiently by seeing whether of making the inverse RL problem well-posed. Indeed,
the inferred reward, r (σ, x) is proportional to the log- under certain assumptions, we show that it is possible
posterior, log p (y (x)|σ). to write a closed form expression for the reward function
Finally, once we have inferred the function performed optimised by the network, given its steady-state distri-
bioRxiv preprint first posted online Apr. 5, 2019; doi: http://dx.doi.org/10.1101/598086. The copyright holder for this preprint (which
was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license.
vii
bution (Eqn 8). networks [28, 29] (Fig 3), and efficient sensory coding [7–
Our framework relies on several assumptions about the 13] (Fig 4). Further, given a static network without
network dynamics. First, we assume that the network dynamics, our framework is directly equivalent to rate-
has Markov dynamics, such that its state depends only distortion theory [24, 25]. While interesting in its own
on the preceding time-step. To relax this assumption, right, this generality means that we can potentially apply
we could redefine the network state to include spiking our theory to infer the function performed by diverse neu-
activity in several time-steps. For example, we could ral circuits, that have evolved to perform a broad range of
thus include the fact that neurons are unlikely to fire different functional objectives. This contrasts with pre-
two spikes within a given temporal window, called their vious work, where neural data is typically used to test a
refractory period. Of course, this increase in complexity single top-down hypothesis, formulated in advance.
would come at the expense of decreased computational
tractability, which may necessitate approximations. Sec-
ond, we assume the only constraint that neurons face is ACKNOWLEDGMENTS
a ‘coding cost’, which limits how much information they
encode about other neurons and external inputs. In real- This work was supported by ANR JCJC grant (ANR-
ity, biological networks face many other constraints, such 17-CE37-0013) to M.C, ANR Trajectory (ANR-15-CE37-
as the metabolic cost of spiking and constraints on the 0011), ANR DECORE (ANR-18-CE37-0011), the French
wiring between cells, which we may want to incorporate State program Investissements d’Avenir managed by the
explicitly into our framework in the future. Agence Nationale de la Recherche (LIFESENSES; ANR-
Our work unifies several influential theories of neu- 10-LABX-65), EC Grant No. H2020-785907 from the Hu-
ral coding, that were considered separately in previ- man Brain Project (SGA2), and an AVIESAN-UNADEV
ous work. For example, we show a direct link between grant to O.M. The authors would like to thank Ulisse
entropy-regularised RL [15–18] (Fig 1), ring-attractor Ferrari for useful discussions and feedback.
[1] Yang G R, Joglekar MR, Song HF, Newsome WT, ral Comput 6:559–601.
Wang XJ. (2019). Task representations in neural net- [13] Gjorgjieva J, Sompolinsky H, Meister M (2014) Bene-
works trained to perform many cognitive tasks. Nat Neu- fits of pathway splitting in sensory coding. J Neurosci
rosci, 22:297–306 34:12127-12144.
[2] Heeger D J (2017) Theory of cortical function. Proc Natl [14] Sutton RS, Barto AG (2018) Reinforcement learning: An
Acad Sci USA 114:1773-1782 introduction. MIT press.
[3] Sussillo D, Abbott L F (2009) Generating coherent pat- [15] Todorov E (2008) General duality between optimal con-
terns of activity from chaotic neural networks. Neuron trol and estimation. Proc of the 47th IEEE Conference
63:544-557. on Decision and Control 4286-4292
[4] Gutig R (2016) Spiking neurons can discover pre- [16] Schulman J, Chen X, Abbeel P (2017) Equiva-
dictive features by aggregate–label learning. Science lence between policy gradients and soft Q-learning.
351(6277):aab4113 arXiv:1704.06440.
[5] Hopfield JJ (1982) Neural networks and physical sys- [17] Haarnoja T, Tang H, Abbeel P, Levine S (2017). Rein-
tems with emergent collective computational abilities forcement learning with deep energy-based policies. Proc
Proc Natl Acad Sci USA 79:2554–2558 34th International Conf on Machine Learning 70:1352-
[6] Krding K (2007) Decision theory: what should the ner- 1361
vous system do? Science 318:606-610 [18] Tiomkin S, Tishby N (2017). A Unified Bellman Equation
[7] Boerlin M, Machens CK, Denve S (2013) Predictive cod- for Causal Information and Value in Markov Decision
ing of dynamical variables in balanced spiking networks. Processes. arXiv:1703.01585.
PLoS Comp Bio 9 e1003258. [19] Ng AY, Russell SJ (2000) Algorithms for inverse rein-
[8] Simoncelli EP, Olshausen BA (2001) Natural image forcement learning. Proc of the 17th International Con
statistics and neural representation. Ann Rev Neurosci on Machine Learning 663–670
24:1193-1216 [20] Rothkopf CA, Dimitrakakis C (2011) Preference elicita-
[9] Tkacik G, Prentice JS, Balasubramanian V, Schneidman tion and inverse reinforcement learning. In Joint Euro-
E (2010) Optimal population coding by noisy spiking pean conference on machine learning and knowledge dis-
neurons. Proc Natl Acad Sci USA 107:14419-14424. covery in databases Springer pp. 34–48.
[10] Chalk M, Marre O, Tkaik G (2018) Toward a unified [21] Herman M, Gindele T, Wagner J, Schmitt F, Burgard
theory of efficient, predictive, and sparse coding. Proc W (2016) Inverse reinforcement learning with simultane-
Natl Acad Sci USA 115:186–191. ous estimation of rewards and dynamics. Artificial Intel-
[11] Barlow, HB (1961) Possible principles underlying the ligence and Statistics 102–110
transformations of sensory messages. Sensory Communi- [22] Wu Z, Schrater P, Pitkow X (2018) Inverse POMDP:
cation, ed Rosenblith WA (MIT Press, Cambridge, MA), Inferring What You Think from What You Do.
pp 217–234 arXiv:1805.09864.
[12] Field DJ (1994) What is the goal of sensory coding? Neu- [23] Reddy S, Dragan AD, Levine S (2018) Where Do You
bioRxiv preprint first posted online Apr. 5, 2019; doi: http://dx.doi.org/10.1101/598086. The copyright holder for this preprint (which
was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license.
viii
Think You’re Going?: Inferring Beliefs about Dynamics [29] Zhang K (1996) Representation of spatial orientation by
from Behavior. arXiv:1805.08010. the intrinsic dynamics of the head-direction cell ensem-
[24] Berger T. Rate Distortion Theory. (1971) Englewood ble: a theory. J Neurosci 16:2112–2126.
Clis. [30] Pillow JW, Shlens J, Paninski L, Sher A, Litke AM,
[25] Bialek W, van Steveninck RRDR, Tishby N (2006) Effi- Chichilnisky EJ, Simoncelli EP (2008) Spatio-temporal
cient representation as a design principle for neural cod- correlations and visual signalling in a complete neuronal
ing and computation. IEEE international symposium on population. Nature 454:995–999
information theory 659–663 [31] McIntosh L, Maheswaranathan N, Nayebi A, Ganguli S,
[26] Schneidman E, Berry MJ, Segev R, Bialek W (2006) Baccus S (2016) Deep learning models of the retinal re-
Weak pairwise correlations imply strongly correlated net- sponse to natural scenes. Adv Neur Inf Proc Sys 29:1369–
work states in a neural population. Nature 440:1007–1012 1377
[27] Tkacik G, Marre O, Amodei D, Schneidman E, Bialek [32] Niv Y (2009) Reinforcement learning in the brain. J
W, Berry MJ (2014) Searching for collective behavior Mathemat Psychol 53:139–154
in a large network of sensory neurons. PLoS Comp Bio [33] Dayan P, Niv Y (2008) Reinforcement learning: the good,
10:e1003408. the bad and the ugly. Curr Op Neurobio 18:185-196.
[28] Ben-Yishai R, Bar-Or RL, Sompolinsky H (1995) Theory [34] Daw ND, Doya K (2006) The computational neurobiol-
of orientation tuning in visual cortex. Proc Natl Acad Sci, ogy of learning and reward. Curr Op Neurobio 16:199-
92:3844–3848 204.
bioRxiv preprint first posted online Apr. 5, 2019; doi: http://dx.doi.org/10.1101/598086. The copyright holder for this preprint (which
was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license.
vπ (s) = r(s) − λcπ (s) − Lπ + 〈vπ (s′ )〉pπ (s′ |s) . (4) We consider a recurrent neural network, with n neu-
rons, each described by a binary variable, σi = −1/1,
Now, given an initial policy π old , one can show, via denoting whether a given neuron is silent/fires a spike in
the policy improvement theorem [1–4], that applying the each temporal window. The network receives an external
greedy policy update, input, x. The network state is described by a vector of n
T
binary values, σ = (σ1 , σ2 , . . . , σn ) . Both the network
π new (a|s) = arg max −λcπ (s) + 〈vπold (s′ )〉pπ (s′ |s) , and input are assumed to have Markov dynamics. Neu-
π rons are updated asynchronously, by updating a random
(5) neuron at each time-step with probability πi (σi′ |σ). The
is guaranteed to improve the objective function, so that network dynamics are thus described by:
Lπnew ≥ Lπold . Solving the above maximisation, we ar-
n
rive at the following update: 1
p (σ ′ |σ, x) = πi (σi′ |σ, x) δ(σj′ , σj ). (10)
1 1
〈vπold (s′ )〉p(s′ |s,a) n i=1
j∕=i
π new (a|s) = pπold (a)e λ
(6)
Zπold (s)
where δ σj′ , σj = 1 if σj′ = σj and 0 otherwise. Equiv-
where Zπold (s) is a normalisation constant. Repeated alently, we can say that at each time, a set of pro-
application of the Bellmann recursion (Eqn 4) and greedy posed
updates, σ̃, are independently sampled from σ̃ i ∼
policy update (Eqn 6) returns the optimal policy, π ∗ (a|s), i πi (σ̃i |σ), and then a neuron i is selected at random
which maximises Lπ . to be updated, such that σi′ ← σ̃i .
bioRxiv preprint first posted online Apr. 5, 2019; doi: http://dx.doi.org/10.1101/598086. The copyright holder for this preprint (which
was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license.
ii
In the exact algorithm, we updated the value function by II. SIMULATION DETAILS
setting it equal to v̂ (σ, x) (Eqn 15). Since in the para-
metric case this is not possible, we can instead update φ A. Agent navigating a maze
to minimise:
1 2 We considered an agent navigating a 15 × 15 maze.
Gφ̄ (φ) = vφ (σ, x) − v̂φ̄ (σ, x) , (23)
2 p(σ,x) The agent could choose to move up, down, left or right
in the maze. At each time there was a 5% probability
where v̂φ̄ (σ, x) is the target value function, defined as in that the agent moved in a random direction, independent
Eqn 22, with parameters, φ̄ [3–5]. of their selected action. Moving in the direction of a
We follow the procedure set out in [3–5] to transform barrier (shown in blue in Fig 1D) would result in the
this into a stochastic gradient descent algorithm. First, agent remaining in the same location. After reaching
we perform nbatch samples from the current policy. Next, the ‘rewarded’ location (bottom right of the maze), the
we perform a stochastic gradient descent update: agent was immediately transported to a starting location
n in the top left of the maze. We optimised the agent’s
η batch
policy at both low and high coding cost (λ = 0.013/0.13
φ ← φ− f (σ l , xl ) vφ (σ l , xl ) − v̂φ̄ (σ l , xl )
nbatch respectively). The reward was inferred from the agent’s
l=1
(24) policy after optimisation as described in SI section IA.1.
where η is a constant that determines the learning rate.
Finally, after doing this nepoch times, we update the tar-
get parameters, φ̄ ← φ. These steps are repeated until B. Network with single binary input
convergence.
We simulated a network of 8 neurons that receive a sin-
gle binary input, x. The stimulus has a transition prob-
2. Inverse RL
ability: p (x′ = 1|x = −1) = p (x′ = −1|x = 1) = 0.02.
The reward function was unity when x = 1 and the net-
We can infer the parameters of the value function, work fired exactly 2 spikes, or when x = −1 and the
φ, by maximising the log-likelihood: 〈log pφ (σ ′ |σ, x)〉D . network fired exactly 6 spikes. We set λ = 0.114.
We can choose the form of the value function to ensure To avoid trivial solutions where a subset neurons spike
that this is tractable. For example, if the value function continuously while other neurons are silent, we defined
is quadratic in the responses, then this corresponds to the coding cost to penalise deviations from the population
inferring the parameters of a pair-wise Ising model [6, 7]. averaged firing rate (rather than the average firing rate
After inferring φ, we want to infer the reward function. for eachneuron). Thus, the coding cost was defined as
At convergence, ∇φ F (φ) = 0 and φ̄ = φ, so that: n
c(σ) = i=1 〈log πi (σ̃i |σ) /p (σ̃i )〉πi (σ̃i |σ) , where p (σ̃i ) is
0 = 〈f (σ, x) (vφ (σ, x) − v̂φ (σ, x))〉D (25) the average spiking probability, across all neurons.
We inferred the reward r (σ) from neural responses as
= 〈f (σ, x) (r (σ, x) − r̂φ (σ, x))〉D (26)
described in SI section 1C. In Fig 2E we rescaled and
where r̂ (σ, x) is given by: shifted the inferred reward to have the same mean and
variance as the true reward.
n
T 1 T We used the inferred reward to predict how neural
r̂φ (σ, x) = f (σ, x) φ − λ log p (σi ) e nλ φ f (σ,x)
tuning curves should adapt when we alter the stim-
i=1 σi
(27) ulus statistics (Fig 2F, upper) or remove a cell (Fig
In the exact case, where vφ (σ, x) = v̂φ (σ, x) (and thus, 2F, lower). For fig 2F (upper), we altered the stim-
Gφ̄ (φ) = 0), then the inferred reward equals r̂φ (σ, x). ulus statistics by setting p (x′ = 1|x = −1) = 0.01 and
However, this is not necessarily true when we assume an p (x′ = −1|x = 1) = 0.03. For figure 2C (lower), we re-
approximate value function. moved one cell from the network. In both cases, we man-
Just as we did for the value function, we can express ually adjusted λ to keep the average coding cost constant.
the reward function as a linear combination of basis func-
tions: r (σ, x) = θ T g (σ, x). Thus, Eqn 26 becomes:
C. Pair-wise coupled network
T
f (σ) g (σ) θ = 〈f (σ, x) r̂φ (σ, x)〉D . (28)
D
We considered a network of 12 neurons arranged in a
If the reward function has the same number of parame- ring. We defined a reward function that was equal to 1 if
ters than the approximate value function (i.e. f (σ) and exactly 4 adjacent neurons were active, and 0 otherwise.
g (σ) have the same size), then we can solve this equation We defined the coding cost as described in the previous
to find θ. Alternatively, if the reward function has more section, to penalise deviations from the population aver-
parameters than the value function, then we require ad- aged mean firing rate.
ditional assumptions to unambiguously infer the reward. We approximated the value function by a quadratic
bioRxiv preprint first posted online Apr. 5, 2019; doi: http://dx.doi.org/10.1101/598086. The copyright holder for this preprint (which
was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license.
iv
function, v (σ) = i,j∕=i Jij σi σj + i hi σi . We optimised ics, with each unit updated asynchronously. The stimulus
the parameters of this value function using the algorithm dynamics were given by:
described in SI section 1D, with λ = 0.05. We used
batches of nbatch = 40 samples, and updated the target
1 −1
m m
parameters after every nepoch = 100 batches. p (x′ |x) = 1 + eJ j=1 xj δ x′/i , x , (29)
We inferred the reward function from the inferred net- m i=1
work couplings, J and h. As described in SI section
1D, this problem is only well-posed if we assume a low-
d parametric form for the reward function, or add ad- where J = 1.5 is a coupling constant. A ‘relevant’ vari-
ditional assumptions. For our simulation, we assumed able, y (x) was equal to 1 if 4 or more inputs equalled 1,
the reward function was always positive (r (σ) and equal to -1 otherwise.
> 0) and
sparse (i.e. we penalised large values of l1 = σ r (σ)). We optimised a network of n = 7 neurons to efficiently
Finally, for the simulations shown in panels 3B-D, we code the relevant variable y (x), using the algorithm de-
ran the optimisation with λ = 0.1, 0.01 and 0.1, respec- scribed in the main text. For Fig 4B-C we set λ = 0.167.
tively. For panel 3C we removed connections between For Fig 4D we varied λ between 0.1 and 0.5. For Fig 4E
neurons separated by a distance of 3 or more on the ring. we altered the stimulus statistics so that,
For panel 3D we forced two of the neurons to be contin-
uously active.
1 −1
m m
p (x′ |x) = 1 + eJ0 +J j=1 xj δ x′/i , x ,
m i=1
D. Efficient coding (30)
where J0 was a bias term that we varied between 0 and
We considered a stimulus consisting of m = 7 binary 0.4. For each value of J0 we adjust λ so as to keep the
variables, xi = −1/1. The stimulus had Markov dynam- average coding cost constant.
[1] Sutton RS, Barto AG (2018). Reinforcement learning: An [5] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J,
introduction. MIT press. Bellemare MG, Graves A, Riedmiller M, Fidjeland AK,
[2] Mahadevan S (1996) Average reward reinforcement learn- Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou
ing: Foundations, algorithms, and empirical results. Ma- I, King H, Kumuran D, Wierstra D, Legg S, Hassabis D
chine learning, 22:159-195. (2015) Human-level control through deep reinforcement
[3] Schulman J, Chen X, Abbeel P (2017) Equiva- learning. Nature, 518:529–533
lence between policy gradients and soft Q-learning. [6] Schneidman E, Berry II MJ, Segev R, Bialek W. Weak
arXiv:1704.06440. pairwise correlations imply strongly correlated network
[4] Haarnoja T, Tang H, Abbeel P, Levine S (2017). Rein- states in a neural population. Nature 440:1007–1012.
forcement learning with deep energy-based policies. Proc [7] Tkačik G, Marre O, Amodei D, Schneidman E, Bialek
34th International Conf on Machine Learning 70:1352- W, Berry II MJ. (2014) Searching for collective behavior
1361 in a large network of sensory neurons. PLoS Comp Bio,
10:e1003408.