4.4 Discussion
4.4.2 Outlook
Our work not only demonstrated and quantified the advantages of the accelerated analog neuromorphic approach, but also laid a groundwork for implementing reinforcement learning in a spiking neuromorphic network. We firmly believe that based on this groundwork future studies using more neuromorphic resources will show more elaborate agents acting in complex environments guided by reinforcement learning, such as the insect navigation task in Billaudelle et al. [2019b] and Schreiber et al. [2020].
5 Biologically plausible deep
reinforcement learning in a
time-continuous framework
The project in this section was done in close collaboration with Walter Senn, Dominik Dold, Oskar Riedler and Mihai Petrovici. At the time of writing, it is being prepared for publication.
A central question in neuroscience is how the brain is capable of learning and building memories. This question concerns several sub-disciplines of neuroscience and it spans over orders of magnitude both in time and space: from molecular models of synaptic plasticity to changes in behavior over decades. Despite the myriads of proposed models and experimental findings (compare to section 2.2.3), it remains an open issue what the basic coding scheme(s) of the brain is and which learning rule(s) it realizes.
The recent success of deep learning [LeCun et al., 2015] put the question whether deep learning is realized in the mammalian brain back into the focus. Particularly interesting is deep reinforcement learning due to its success in machine learning: it has reached often super-human performance in playing board- and video-games [Mnih et al., 2013, Silver et al., 2017, Vinyals et al., 2019]. This success is based on three key components: the backpropagation algorithm [Rumelhart et al., 1986], the availability of large labeled datasets and the availability of cheap and powerful GPUs. Here, we focus on mechanistic modeling of the backpropagation algorithm and disregard the two other factors.
For a long time, backpropagation in the brain had been considered impossible [Richards et al., 2019]. It seemed implausible that an error signal is propagated backwards over several layers of neurons, tailored individually for each synapse and still obeying biological constraints like locality of interactions and the het- erogeneity of the neuro-synaptic parameters. In the last couple of years, several studies either relaxed implausible assumptions [Lillicrap et al., 2016] or even sug- gested models for the backpropagation mechanism [O’Reilly, 1996, Xie and Seung, 2003, Roelfsema and Ooyen, 2005, Rombouts et al., 2015, Scellier and Bengio, 2017, Whittington and Bogacz, 2017, Amit, 2019, Mesnard et al., 2019, Marblestone et al., 2016, Pozzi et al., 2018, Whittington and Bogacz, 2019, Richards et al., 2019]. A common feature of all these works is that they usually consider supervised and unsupervised learning only, although reinforcement learning is clearly present
5. Time-continuous deep reinforcement learning
in the brain [Niv, 2009]. Deep reinforcement learning is only considered in a few of these studies [Rombouts et al., 2015, Pozzi et al., 2018], but they lack other biological constraints such as time-continuous dynamics.
A large amount of literature has been published about models of reinforcement learning in the brain, see for example the review from Niv [2009]. However these publications are more focused on the biological aspects of reinforcement learning and less on the learning capabilities of the underlying model; and often they only consider shallow learning architectures, for example in Farries and Fairhall [2007], Izhikevich [2007a], Frémaux et al. [2010, 2013], Deperrois et al. [2019]. This restriction to shallow learning limits the capabilities of these models in terms of task complexity, but animals and humans can clearly solve more complex tasks than these shallow models allow.
In this project, we extend the model of deep supervised learning in a time- continuous framework based on the principle of least action [Senn et al., in prepa- ration, Dold, 2020] to include reinforcement learning. In their work, Senn et al. [in preparation] derive neural dynamics and plasticity rules from first principles and present a framework of time-continuous deep learning using only local interac- tions and plasticity. The authors propose a model where stereotypical microcircuits and a predictive firing mechanism of the neurons enable learning via backprop- agation at any time without separate phases. We amended the model with a lateral interaction among the neurons representing the actions and with a global neuromodulator based on the reward-prediction error. The lateral interaction is closely related to winner-takes-all (WTA) circuits, while the reward-prediction error measures the deviation of the received reward from the expectation [Schultz, 2016, Sutton and Barto, 2018]. We show that the proposed model approximates policy-gradient learning and, by that, maximizes the expected reward.
The proposed model is tested on a reduced version of the popular MNIST dataset [LeCun et al., 1998]. Furthermore, we verify the robustness of the model against both fixed and random temporal reward delays as well as against fixed- pattern noise (heterogeneous parameters) in the lateral circuit. Finally, we briefly sketch and test two alternative forms of reward maximizing interactions in the action layer.
Our work contributes to the pursuit for mechanistic models of biologically plausible deep reinforcement learning. This model could be the basis of more elaborate biological reinforcement learning models, for example based on actor- critic architectures; or it could inspire experiments to explore hallmarks of deep reinforcement learning in the brain.
5.1 Materials and Methods
The presented work extends and builds on the framework of supervised learning in the principle of least action model [Senn et al., in preparation, Dold, 2020]. Here, we give a summary of the model, so that the connection between the model and its extension becomes apparent. Moreover, we describe policy-gradient in a deep
5.1 Materials and Methods neural network, which we will use as a standard comparison for the presented learning rules.