[PDF] A bio-inspired bistable recurrent cell allows for long-lasting memory

Abstract

Recurrent neural networks (RNNs) provide state-of-the-art performances in a wide variety of tasks that require memory. These performances can often be achieved thanks to gated recurrent cells such as gated recurrent units (GRU) and long short-term memory (LSTM). Standard gated cells share a layer internal state to store information at the network level, and long term memory is shaped by network-wide recurrent connection weights. Biological neurons on the other hand are capable of holding information at the cellular level for an arbitrary long amount of time through a process called bistability. Through bistability, cells can stabilize to different stable states depending on their own past state and inputs, which permits the durable storing of past information in neuron state. In this work, we take inspiration from biological neuron bistability to embed RNNs with long-lasting memory at the cellular level. This leads to the introduction of a new bistable biologically-inspired recurrent cell that is shown to strongly improves RNN performance on time-series which require very long memory, despite using only cellular connections (all recurrent connections are from neurons to themselves, i.e. a neuron state is not influenced by the state of other neurons). Furthermore, equipping this cell with recurrent neuromodulation permits to link them to standard GRU cells, taking a step towards the biological plausibility of GRU.

Full PDF

aa r X i v : . [ c s . N E ] J un A bio-inspired bistable recurrent cell allows forlong-lasting memory

Nicolas Vecoven

University of Liè[email protected]

Damien Ernst

University of Liège

Guillaume Drion

University of Liège

Abstract

Recurrent neural networks (RNNs) provide state-of-the-art performances in awide variety of tasks that require memory. These performances can often beachieved thanks to gated recurrent cells such as gated recurrent units (GRU) andlong short-term memory (LSTM). Standard gated cells share a layer internal stateto store information at the network level, and long term memory is shaped bynetwork-wide recurrent connection weights. Biological neurons on the other handare capable of holding information at the cellular level for an arbitrary long amountof time through a process called bistability. Through bistability, cells can stabilizeto different stable states depending on their own past state and inputs, which per-mits the durable storing of past information in neuron state. In this work, we takeinspiration from biological neuron bistability to embed RNNs with long-lastingmemory at the cellular level. This leads to the introduction of a new bistablebiologically-inspired recurrent cell that is shown to strongly improves RNN per-formance on time-series which require very long memory, despite using only cel-lular connections (all recurrent connections are from neurons to themselves, i.e. aneuron state is not inﬂuenced by the state of other neurons). Furthermore, equip-ping this cell with recurrent neuromodulation permits to link them to standardGRU cells, taking a step towards the biological plausibility of GRU.

Recurrent neural networks (RNNs) have been widely used in the past years, providing excellentperformances on many problems requiring memory such as e.g. sequence to sequence modeling,speech recognition, and neural translation. These achievements are often the result of the devel-opment of the long short-term memory (LSTM [1]) and gated recurrent units (GRU [2]) recurrentcells, which allow RNNs to capture time-dependencies over long horizons. Despite all the work an-alyzing the performances of such cells [3], recurrent cells remain predominantly black-box models.There has been some advance in understanding the dynamical properties of RNNs as a whole from anon-linear control perspective ([4]), but little has been done in understanding the underlying systemof recurrent cells themselves. Rather, they have been built for their robust mathematical propertieswhen computing gradients with back-propagation through time (BPTT). Research on new recur-rent cells is still ongoing and, building up on LSTM and GRU, recent works have proposed othertypes of gated units ([5], [6], [7]). In addition, an empirical search over hundreds of different gatedarchitectures has been carried in [8].In parallel, there has been an increased interest in assessing the biological plausibility of neuralnetworks. There has not only been a lot of interest in spiking neural networks ([9, 10, 11]), butalso in reconciling more traditional deep learning models with biological plausibility ([12, 13, 14]).RNNs are a promising avenue for the latter ([15]) as they are known to provide great performances

Preprint. Under review. rom a deep learning point of view while theoretically allowing a discrete dynamical simulation ofbiological neurons.RNNs combine simple cellular dynamics and a rich, highly recurrent network architecture. Therecurrent network architecture enables the encoding of complex memory patterns in the connectionweights. These memory patterns rely on global feedback interconnections of large neuronal popula-tions. Such global feedback interconnections are difﬁcult to tune, and can be a source of vanishingor exploding gradient during training, which is a major drawback of RNNs. In biological networks,a signiﬁcant part of advanced computing is handled at the cellular level, mitigating the burden at thenetwork level. Each neuron type can switch between several complex ﬁring patterns, which includee.g. spiking, bursting, and bistability. In particular, bistability is the ability for a neuron to switchbetween two stable outputs depending on input history. It is a form of cellular memory ([16]).In this work, we propose a new biologically motivated bistable recurrent cell (BRC), which embedsclassical RNNs with local cellular memory rather than global network memory. More precisely,BRCs are built such that their hidden recurrent state does not directly inﬂuence other neurons (i.e.they are not recurrently connected to other cells). To make cellular bistability compatible withthe RNNs feedback architecture, a BRC is constructed by taking a feedback control perspectiveon biological neuron excitability ([17]). This approach enables the design of biologically-inspiredcellular dynamics by exploiting the RNNs structure rather than through the addition of complexmathematical functions.To test the capacities of cellular memory, the bistable cells are ﬁrst connected in a feedforwardmanner, getting rid of the network memory coming from global recurrent connections. Despite hav-ing only cellular temporal connections, we show that BRCs provide good performances on standardbenchmarks and surpass more standard ones such as LSTMs and GRUs on benchmarks with datasetscomposed of extremely sparse time-series. Secondly, we show that the proposed bio-inspired recur-rent cell can be made more comparable to a standard GRU by using a special kind of recurrentneuromodulation. We call this neuromodulated bistable recurrent cell nBRC, standing for neuro-modulated BRC. The comparison between nBRCs and GRUs provides food-for-thought and is astep towards reconciling traditional gated recurrent units and biological plausibility.

RNNs have been widely used to tackle many problems having a temporal structure. In such prob-lems, the relevant information can only be captured by processing observations obtained duringmultiple time-steps. More formally, a time-series can be deﬁned as X = [ x , . . . , x T ] with T ∈ N and x i ∈ R n . To capture time-dependencies, RNNs maintain a recurrent hidden state whose updatedepends on the previous hidden state and current observation of a time-series, making them dy-namical systems and allowing them to handle arbitrarily long sequences of inputs. Mathematically,RNNs maintain a hidden state h t = f ( h t − , x t ; θ ) , where h is a constant and θ are the parametersof the network. In its most standard form, an RNN updates its state as follows: h t = g ( U x t + W h t − ) (1)where g is a standard activation function such as a sigmoid or a hyperbolic tangent. However, RNNsusing Equation 1 as the update rule are known to be difﬁcult to train on long sequences due tovanishing (or, more rarely, exploding) gradient problems. To alleviate this problem, more complexrecurrent update rules have been proposed, such as LSTMs ([1]) and GRUs ([2]). These updatesallow recurrent networks to be trained on much longer sequences by using gating principles. Byway of illustration, the updates related to a gated recurrent unit are  z t = σ ( U z x t + W z h t − ) r t = σ ( U r x t + W r h t − ) h t = z t ⊙ h t − + (1 − z t ) ⊙ tanh( U h x t + r t ⊙ W h h t − ) (2)where z is the update gate (used to tune the update speed of the hidden state with respect to newinputs) and r is the reset gate (used to reset parts of the memory).2 Neuronal bistability: a feedback viewpoint

Biological neurons are intrinsically dynamical systems that can exhibit a wide variety of ﬁring pat-terns. In this work, we focus on the control of bistability, which corresponds to the coexistence oftwo stable states at the neuronal level. Bistable neurons can switch between their two stable states inresponse to transient inputs ([16, 18]), endowing them with a kind of never-fading cellular memory([16]).Complex neuron ﬁring patterns are often modeled by systems of ordinary differential equations(ODEs). Translating ODEs into an artiﬁcial neural network algorithm often leads to mixed resultsdue to increased complexity and the difference in modeling language. Another approach to modelneuronal dynamics is to use a control systems viewpoint [17]. In this viewpoint, a neuron is modeledas a set of simple building blocks connected using a multiscale feedback, or recurrent, interconnec-tion pattern.A neuronal feedback diagram focusing on one time-scale, which is sufﬁcient for bistability, is illus-trated in Figure 1A. The block / ( Cs ) accounts for membrane integration, C being the membranecapacitance and s the complex frequency. The outputs from presynaptic neurons V pre are combinedat the input level to create a synaptic current I syn . Neuron-intrinsic dynamics are modeled by thenegative feedback interconnection of a nonlinear function I int = f ( V post ) , called the IV curve inneurophysiology, which outputs an intrinsic current I int that adds to I syn to create the membranecurrent I m . The slope of f ( V post ) determines the feedback gain, a positive slope leading to negativefeedback and a negative slope to positive feedback. I m is then integrated by the postsynaptic neuronmembrane to modify its output voltage V post . + I syn V post I int I m f(V post )V pre,1 V pre,3 V pre,2 _ A B I int V post I int V post α=0.5 α=1.5f(V post ) f(V post )f(V pre ) Figure 1: A. One timescale control diagram of a neuron. B. Plot of the function I int = V post − α tanh( V post ) for two different values of α . Full dots correspond to stable states, empty dots tounstable states.The switch between monostability and bistability is achieved by shaping the nonlinear function I int = f ( V post ) (Figure 1B). The neuron is monostable when f ( V post ) is monotonic of positiveslope (Figure 1B, left). Its only stable state corresponds to the voltage at which I int = 0 in theabsence of synaptic inputs (full dot). The neuron switch to bistability through the creation of alocal region of negative slope in f ( V post ) (Figure 1B, left). Its two stable states correspond to thevoltages at which I int = 0 with positive slope (full dots), separated by an unstable state where I int = 0 with negative slope (empty dot). The local region of negative slope corresponds to a localpositive feedback where the membrane voltage is unstable.In biological neurons, a local positive feedback is provided by regenerative gating, such as sodiumand calcium channel activation or potassium channel inactivation ([18, 19]). The switch frommonostability to bistability can therefore be controlled by tuning ion channel density. This prop-erty can be emulated in electrical circuits by combining transconductance ampliﬁers to create thefunction I int = V post − α tanh( V post ) , (3)where the switch from monostability to bistability is controlled by a single parameter α ([20]). α models the effect of sodium or calcium channel activation, which tunes the local slope of the func-tion, hence the local gain of the feedback loop (Figure 1B). For α ∈ ]0 , , which models a lowsodium or calcium channel density, the function is monotonic, leading to monostability (Figure 1B,left). For α ∈ ]1 , + ∞ [ , which models a high sodium or calcium channel density, a region of negativeslope is created around V post = 0 , and the neuron becomes bistable (Figure 1B, right). This bista-bility leads to never-fading memory, as in the absence of signiﬁcant input perturbation the systemwill remain indeﬁnitely in one of the two stable states depending on the input history.3euronal bistability can therefore be modeled by a simple feedback system whose dynamics is tunedby a single feedback parameter α . This parameter can switch between monostability and bistabilityby tuning the shape of the feedback function f ( V post ) , whereas neuron convergence dynamics iscontrolled by a single feedforward parameter C . In biological neurons, both these parameters canbe modiﬁed dynamically by other neurons via a mechanism called neuromodulation, providing adynamic, controllable memory at the cellular level. The key challenge is to ﬁnd an appropriatemathematical representation of this mechanism to be efﬁciently used in artiﬁcial neural networks,and, more particularly, in RNNs. The bistable recurrent cell (BRC)

To model controllable bistability in RNNs, we start by draw-ing two main comparisons between the feedback structure Figure 1A and the GRU equations (Equa-tion 2). First, we note that the reset gate r has a role that is similar to the one played by the feedbackgain α in Equation 3. In GRU equations, r is the output of a sigmoid function, which implies r ∈ ]0 , . These possible values for r correspond to negative feedback only, which does not allowfor bistability. The update gate z , on the other hand, has a role similar to that of the membrane capac-itance C . Second, one can see through the matrix multiplications W z h t − , W r h t − and W h h t − that each cell uses the internal state of other neurons to compute its own state without going throughsynaptic connections. In biological neurons, the intrinsic dynamics deﬁned by I int is constrainedto only depend on its own state V post , and the inﬂuence of other neurons comes only through thesynaptic compartment ( I syn ), or through neuromodulation.To enforce this cellular feedback constraint in GRU equations and to endow them with bistability,we propose to update h t as follows: h t = c t ⊙ h t − + (1 − c t ) ⊙ tanh( U x t + a t ⊙ h t − ) (4)where a t = 1 + tanh( U a x t + w a ⊙ h t − ) and c t = σ ( U c x t + w c ⊙ h t − ) . a t corresponds to thefeedback parameter α , with a t ∈ ]0 , . c t corresponds to the update gate in GRU and plays the roleof the membrane capacitance C , determining the convergence dynamics of the neuron. We call thisupdated cell the bistable recurrent cell (BRC).The main differences between a BRC and a GRU are twofold. First, each neuron has its own internalstate h t that is not directly affected by the internal state of the other neurons. Indeed, due to the fourinstances of h t − coming from Hadamard products, the only temporal connections existing in layersof BRC are from neurons to themselves. This enforces the memory to be only cellular. Second, thefeedback parameter a t is allowed to take a value in the range ]0 , rather than ]0 , . This allows thecell to switch between monostability ( a ≤ ) and bistability ( a > ) (Figure 2A,B). The proof ofthis switch is provided in Appendix A.It is important to note that the parameters a t and c t are dynamic. a t and c t are neuromodulated bythe previous layer, that is, their value depends on the output of other neurons. Tests were carriedwith a and c as parameters learned by stochastic gradient descent, which resulted in lack of rep-resentational power, leading to the need for neuromodulation. This neuromodulation scheme wasthe most evident as it maintains the cellular memory constraint and leads to the most similar updaterule with respect to standard recurrent cells (Equation 2). However, as will be discussed later, otherneuromodulation schemes can be thought of.Likewise, from a neuroscience perspective, a t could well be greater than . Limiting the range of a t to ]0 , was made for numerical stability and for symmetry between the range of bistable andmonostable neurons. We argue that this is not an issue as, for a value of a t greater than . , thedynamics of the neurons become very similar (as suggested in Figure 2A).Figure 2C shows the dynamics of a BRC with respect to a t and c t . For a t < , the cell exhibitsa classical monostable behavior, relaxing to the stable state in the absence of inputs (blue curvesin Figure 2C). On the other hand, a bistable behavior can be observed for a t > : the cells caneither stabilize on an upper stable state or a lower stable state depending on past inputs (red curvesin Figure 2C). Since these upper and lower stable states do not correspond to an h t which is equalto , they can be associated with cellular memory that never fades over time. Furthermore, Figure 2also illustrates that neuron convergence dynamics depend on the value of c .4 t a t A C

Stable stateUnstable state h t ttth t Ux t -11 c t =0.2c t =0.7a t =0.7 a t =1.5 B h t h t a t =0.7 a t =1.5h t -F(h t ) h t -F(h t ) Figure 2: A. Bifurcation diagram of Equation 4 for U x t = 0 . B. Plots of the function h t − F ( h t ) for two values of a t , where F ( h t ) = c t h t + (1 − c t ) tanh( a t h t ) is the right hand side of Equation 4with x t = 0 . Full dots correspond to stable states, empty dots to unstable states. C. Response ofBRC to an input time-series for different values of a t and c t . The recurrently neuromodulated bistable recurrent cell (nBRC)

To further improve the per-formance of BRC, one can relax the cellular memory constraint. By creating a dependency of a t and c t on the output of other neurons of the layer, one can build a kind of recurrent layer-wiseneuromodulation. We refer to this modiﬁed version of a BRC as an nBRC, standing for recurrentlyneuromodulated BRC. The update rule for the nBRC is the same as for BRC, and follows Equation 4.The difference comes in the computation of a t and c t , which are neuromodulated as follows: (cid:26) a t = 1 + tanh( U a x t + W a h t − ) c t = σ ( U c x t + W c h t − ) (5)The update rule of nBRCs being that of BRCs (Equation 4), bistable properties are maintained andhence the possibility of a cellular memory that does not fade over time. However, the new recurrentneuromodulation scheme adds a type of network memory on top of the cellular memory. This recur-rent neuromodulation scheme brings the update rule even closer to standard GRU. This is highlightedwhen comparing Equation 2 and Equation 4 with parameters neuromodulated following Equation 5.We stress that, as opposed to GRUs, bistability is still ensured through a t belonging to ]0 , . Arelaxed cellular memory constraint is also ensured, as each neuron past state h t − only directlyinﬂuences its own current state and not the state of other neurons of the layer (Hadamard producton the h t update in Equation 4). This is important for numerical stability as the introduction of acellular positive feedback for bistability leads to global instability if the update is computed usingother neurons states directly (as it is done in the classical GRU update, see the matrix multiplication W h h t − in Equation 2).Finally, let us note that to be consistent with the biological model presented in Section 3, Equation 5should be interpreted as a way to represent a neuromodulation mechanism of a neuron by thosefrom its own layer and the layer that precedes. Hence, the possible analogy between gates z and r inGRUs and neuromodulation. In this respect, studying the introduction of new types of gates basedon more biological plausible neuromodulation architectures would certainly be interesting. To demonstrate the performances of BRCs and nBRCs with respect to standard GRUs and LSTMs,we tackle three problems. The ﬁrst is a one-dimensional toy problem, the second is a two-dimensional denoising problem and the third is the sequential MNIST problem. The supervisedsetting used is the same for all three benchmarks. The network is presented with a time-series andis asked to output a prediction (regression for the ﬁrst two benchmarks and classiﬁcation for thethird) after having received the last element of the time-series x T . We show that the introductionof bistability in recurrent cells is especially useful for datasets in which only sparse time-series areavailable. In this section, we also take a look at the dynamics inside the BRC neurons in the contextof the denoising benchmark and show that bistability is heavily used by the neural network.5 BRC nBRC GRU LSTM . ± . . ± . . ± . . ± .

50 0 . ± . . ± . . ± . . ± . . ± . . ± . . ± .

006 1 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Table 1: Mean square error on test set after gradient descent iterations of different architec-tures on the copy ﬁrst input benchmark. Results are shown for different values of T .Figure 3: Evolution of the average mean-square error ( ± standard deviation) over three runs on thecopy input benchmark for GRU and BRC and for different values of T . For the ﬁrst two problems, training sets comprise samples and performances are evaluated ontest sets generated with samples. For the MNIST benchmark, the standard train and test setsare used. All averages and standard deviations reported were computed over three different seeds.We found that there were very little variations in between runs, and thus believe that three runs areenough to capture the performance of the different architectures. For benchmark 1, networks arecomposed of two recurrent layers of neurons each whereas for benchmark 2 and 3, networksare composed of four recurrent layers of neurons each. Different recurrent cells are alwaystested on similar networks (i.e. same number of layers/neurons). We used the tensorﬂow ([21])implementation of GRUs and LSTMs. Finally, the ADAM optimizer with a learning rate of e − isused for training all networks, with a mini-batch size of . The source code for carrying out theexperiments is available at https://github.com/nvecoven/BRC . Copy ﬁrst input benchmark

In this benchmark, the network is presented with a one-dimensionaltime-series of T time-steps where x t ∼ N (0 , , ∀ t ∈ T . After receiving x T , the network outputvalue should approximate x , a task that is well suited for capturing their capacity to learn longtemporal dependencies if T is large. Note that this benchmark also requires the ability to ﬁlterirrelevant signals as, after time-step , the networks are continuously presented with noisy inputs thatthey must learn to ignore. The mean square error on the test set is shown for different values of T inTable 1. For smaller values of T , all recurrent cells achieve similar performances. The advantage ofusing bistable recurrent cells appears when T becomes large (Figure 3). Indeed, when T is equal to , only networks made of bistable cells are capable of outperforming random guessing threshold(which would be equal to in this setting ). Denoising benchmark

The copy input benchmark is interesting as a means to highlight the mem-orisation capacity of the recurrent neural network, but it does not tackle its ability to successfullyexploit complex relationships between different elements of the input signal to predict the out-put. In the denoising benchmark, the network is presented with a two-dimensional time-series of T time-steps. Five different time-steps t , . . . , t are sampled uniformly in { , . . . , T − N } with As x is sampled from a normal distribution N (0 , , guessing would lead to the lowest error whichwould on average be equal to the standard deviation. BRC nBRC GRU LSTM . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Table 2: Mean square error on test set after gradient descent iterations of different architec-tures on the denoising benchmark. Results are shown with and without constraint on the locationof relevant inputs. Relevant inputs cannot appear in the N last time-steps, that is x t [1] = − , ∀ t > ( T − N ) . In this experiment, results were obtained with T = 400 .BRC nBRC GRU LSTM . ± . . ± . . ± . . ± . Table 3: Mean square error on test set after gradient descent iterations of different architec-tures on the modiﬁed copy input benchmark. N ∈ { , . . . , T − } and are communicated to the network through the ﬁrst dimension of the time-series by setting x t [1] = 0 if t ∈ [ t , . . . , t ] , x t [1] = 1 if t = T and x t [1] = − otherwise.Note that this dimension is also used to notify the network that the end of the time-series is reached(and thus, that the network should output its prediction). The second dimension is a data-stream,generated as for the copy ﬁrst input benchmark, that is x t [2] ∼ N (0 , , ∀ t ∈ T . At time-step T ,the network is asked to output [ x t [2] , . . . , x t [2]] . The mean squared error is averaged over the values. That is, the error on the prediction is equal to P i =1 ( x ti [2] − O [ i ]) with O the output of theneural network. Note that the parameter N controls the length of the forgetting period as it forces therelevant inputs to be in the ﬁrst T − N time-steps. This ensures that t x < T − N, ∀ x ∈ { , . . . , } .As one can see in Table 2 (generated with T = 200 and two different values of N ), for N = 200 ,GRUs and LSTMs are unable to exceed random guessing (mean square error of ) whereas BRC andnBRC performances are virtually unimpacted. Also, Table 2 provides a very important observation.GRUs and LSTMs are, in fact, able to learn long-term dependencies, as they achieve extremelygood performances when N = 0 . In fact, all the samples generated when N = 200 could also begenerated when N = 0 , meaning that with the right parameters, the GRUs and LSTMs networkcould achieve good predictive performances on such samples. However, our results show that GRUsand LSTMs are unable to learn those parameters when the datasets are only composed of suchsamples. That is, GRUs and LSTMs need training datasets with some samples for which the memoryrequired is quite short to learn efﬁciently, and allow for learning the samples for which the temporalstructure is longer. Bistable cells on the other hand are not susceptible to this caveat.To further highlight this behavior, we design another benchmark that is a variant of the copy inputbenchmark. In this benchmark, the network is presented with a one-dimensional time-series oflength T = 600 where x t = 0 , ∀ t ∈ T \ t and x t ∼ N (0 , , with t chosen uniformly in { , . . . , } . The network is tasked to output x t . Table 3 shows that, using this training scenario,GRUs are capable of achieving a low MSE (around . ) on the test set used for the original copyinput benchmark in which all the T = 600 . This was not the case in Table 1 (MSE around . for T = 600 ), when trained on a datasets for which all the samples require a time-step-longdependency. On the other hand, the performance of BRC and nBRC peaked in this scenario. Sequential MNIST

In this benchmark, the network is presented with the MNIST images, shownpixel by pixel as a time-series. MNIST images are made of 1024 pixels (32 by 32), showing that BRCand nBRC can learn dynamics over thousands of time-steps. Similar to both previous benchmarks,we add n black time-steps of black pixels at the end of the time-series to add a forgetting period.Results are shown in Table 4 for two values of n black , and are consistent with what has been observedin both previous benchmarks. Until now, we have looked at the learning performances of bistable recurrent cells. It is howeverinteresting to take a deeper look at the dynamics of such cells to understand whether or not bistability7 black

BRC nBRC GRU LSTM . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Table 4: Accuracy on MNIST test set after gradient descent iterations on different architec-tures on the MNIST benchmark. Images are fed to the recurrent network pixel by pixel. Results areshown for MNIST images with n black black pixels appended to the image. Proportion of bistable neurons ( a t >1) per layer0.70.60.50.4 t =30 t =100 t =246 t =300 t =376 Average value of c t per layert =30 t =100 t =246 t =300 t =3760.70.50.30.1Layer 0 Layer 1Layer 2 Layer 3 Figure 4: Representation of the BRC parameters, per layer, of a recurrent neural network (with layers of neurons each), when shown a time-series of the denoising benchmark ( T = 400 , N = 0 ). Layer numbering increases as layers get deeper (i.e. layer i corresponds to the ith layerof the network). The time-steps at which a relevant input is shown to the model are clearlydistinguishable by the behaviour of those measures alone.is used by the network. To this end, we pick a random time-series from the denoising benchmark andanalyse some properties of a t and c t . Figure 4 shows the proportion of bistable cells per layer andthe average value of e t per layer. The dynamics of the parameters show that they are well used bythe network, and three main observations owe to be made. First, as relevant inputs are shown to thenetwork, the proportion of bistable neurons tends to increase in layers 2 and 3, effectively storinginformation and thus conﬁrming the interest of introducing bistability for long-term memory. Asmore information needs to be stored, the network leverages the power of bistability by increasingthe number of bistable neurons. Second, as relevant inputs are shown to the network, the averagevalue c t tends to increase in layer 3, effectively making the network less and less sensitive to newinputs. Third, one can observe a transition regime when a relevant input is shown. Indeed, there is ahigh decrease in the average value of c t , effectively making the network extremely sensitive to thecurrent input, which allows for its efﬁcient memorization. In this paper, we introduced two new important concepts from the biological brain into recurrentneural networks: cellular memory and bistability. This lead to the development of a new cell, calledthe Bistable Recurrent Cell (BRC) that proved to be very efﬁcient on several datasets requiring long-term memory and on which the performances of classical recurrent cells such as GRUs and LSTMSwere limited.Furthermore, by relaxing the cellular memory constraint and using a special rule for recurrent neu-romodulation, we were able to create a neuromodulated bistable recurrent cell (nBRC) which is verysimilar to a standard GRU. This is of great interest and provides insights on how gates in GRUs andLSTMs, among others, could in fact be linked to what is neuromodulation in biological brains. Asfuture work, it would be of interest to study some more complex and biologically plausible neuro-modulation schemes and see what types of new, gated architectures could emerge from them.

Acknowledgements

Nicolas Vecoven gratefully acknowledges the ﬁnancial support of the Belgian FRIA.8 eferences [1] Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735–1780.[2] Cho K, Van Merriënboer B, Bahdanau D, Bengio Y. On the properties of neural machinetranslation: Encoder-decoder approaches. arXiv preprint arXiv:14091259. 2014;.[3] Chung J, Gulcehre C, Cho K, Bengio Y. Empirical Evaluation of Gated Recurrent NeuralNetworks on Sequence Modeling; 2014.[4] Sussillo D, Barak O. Opening the black box: low-dimensional dynamics in high-dimensionalrecurrent neural networks. Neural computation. 2013;25(3):626–649.[5] Zhou GB, Wu J, Zhang CL, Zhou ZH. Minimal gated unit for recurrent neural networks.International Journal of Automation and Computing. 2016;13(3):226–234.[6] Dey R, Salemt FM. Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE;2017. p. 1597–1600.[7] Jing L, Gulcehre C, Peurifoy J, Shen Y, Tegmark M, Soljacic M, et al. Gated orthogonalrecurrent units: On learning to forget. Neural computation. 2019;31(4):765–783.[8] Jozefowicz R, Zaremba W, Sutskever I. An empirical exploration of recurrent network archi-tectures. In: International conference on machine learning; 2015. p. 2342–2350.[9] Tavanaei A, Ghodrati M, Kheradpisheh SR, Masquelier T, Maida A. Deep learning in spikingneural networks. Neural Networks. 2019;111:47–63.[10] Pfeiffer M, Pfeil T. Deep learning with spiking neurons: opportunities and challenges. Fron-tiers in neuroscience. 2018;12:774.[11] Bellec G, Salaj D, Subramoney A, Legenstein R, Maass W. Long short-term memory andlearning-to-learn in networks of spiking neurons. In: Advances in Neural Information Pro-cessing Systems; 2018. p. 787–797.[12] Bengio Y, Lee DH, Bornschein J, Mesnard T, Lin Z. Towards biologically plausible deeplearning. arXiv preprint arXiv:150204156. 2015;.[13] Miconi T. Biologically plausible learning in recurrent neural networks reproduces neural dy-namics observed during cognitive tasks. Elife. 2017;6:e20899.[14] Bellec G, Scherr F, Hajek E, Salaj D, Legenstein R, Maass W. Biologically inspired alter-natives to backpropagation through time for learning in recurrent neural nets. arXiv preprintarXiv:190109049. 2019;.[15] Barak O. Recurrent neural networks as versatile tools of neuroscience research. Currentopinion in neurobiology. 2017;46:1–6.[16] Marder E, Abbott L, Turrigiano GG, Liu Z, Golowasch J. Memory from the dynamics of intrin-sic membrane currents. Proceedings of the national academy of sciences. 1996;93(24):13481–13486.[17] Drion G, O’Leary T, Dethier J, Franci A, Sepulchre R. Neuronal behaviors: A control per-spective. In: 2015 54th IEEE Conference on Decision and Control (CDC). IEEE; 2015. p.1923–1944.[18] Drion G, O’Leary T, Marder E. Ion channel degeneracy enables robust and tunable neuronalﬁring rates. Proceedings of the National Academy of Sciences. 2015;112(38):E5361–E5370.[19] Franci A, Drion G, Seutin V, Sepulchre R. A balance equation determines a switch in neuronalexcitability. PLoS computational biology. 2013;9(5).[20] Ribar L, Sepulchre R. Neuromodulation of neuromorphic circuits. IEEE Transactions onCircuits and Systems I: Regular Papers. 2019;66(8):3028–3040.[21] Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al.. TensorFlow: Large-ScaleMachine Learning on Heterogeneous Systems; 2015. Software available from tensorﬂow.org.Available from: http://tensorflow.org/ .[22] Golubitsky M, Schaeffer DG. Singularities and groups in bifurcation theory. vol. 1. SpringerScience & Business Media; 2012. 9

Proof of bistability for BRC and nBRC for a t > Theorem A.1.

The system deﬁned by the equation h t = ch t − + (1 − c ) tanh( U x t + ah t − ) = F ( h t − ) (6) with c ∈ [0 , is monostable for a ∈ [0 , and bistable for a > in some ﬁnite range of U x t centered around x t = 0 .Proof. We can show that the system undergoes a supercritical pitchfork bifurcation at the equilib-rium point ( x , h ) = (0 , for a = a pf = 1 by verifying the conditions G ( h ) (cid:12)(cid:12) a pf = dG ( h t ) dh t (cid:12)(cid:12) h ,a pf = d G ( h t ) dh t (cid:12)(cid:12) h ,a pf = dG ( h t ) da (cid:12)(cid:12) h ,a pf = 0 (7) d G ( h t ) dh t (cid:12)(cid:12) h ,a pf > , d G ( h t ) dh t da (cid:12)(cid:12) h ,a pf < (8)where G ( h t ) = h t − F ( h t ) ([22]). This gives G ( h ) (cid:12)(cid:12) a pf = (1 − c )( h − tanh( a pf h )) = 0 , (9) dG ( h t ) dh t (cid:12)(cid:12)(cid:12)(cid:12) h ,a pf = (1 − c )( a pf (tanh ( a pf h ) −

1) + 1) = (1 − c )(1 − a pf ) = 0 , (10) d G ( h t ) dh t (cid:12)(cid:12)(cid:12)(cid:12) h ,a pf = (1 − c )2 a pf tanh( a pf h )(1 − tanh ( a pf h )) = 0 , (11) dG ( h t ) da (cid:12)(cid:12)(cid:12)(cid:12) h ,a pf = (1 − c ) h (tanh( a pf h ) −

1) = 0 , (12) d G ( h t ) dh t (cid:12)(cid:12)(cid:12)(cid:12) h ,a pf = (1 − c ) ∗ (2 a (tanh ( a pf h ) − + 4 a pf tanh ( a pf h )(tanh ( a pf h ) − − c ) > , (13) d G ( h t ) dh t da (cid:12)(cid:12)(cid:12)(cid:12) h ,a pf = (1 − c )((tanh ( a pf h ) −

1) + 2 a pf h tanh( a pf h )(1 − tanh ( a pf h )))= c − < . (14)The stability of ( x , h ) for a = 1 can be assessed by studying the linearized system h t = dF ( h t ) dh t (cid:12)(cid:12)(cid:12)(cid:12) h h t − . (15)The equilibrium point is stable if dF ( h t ) /dh t ∈ [0 , , singular if dF ( h t ) /dh t = 1 , and unstable if dF ( h t ) /dh t ∈ ]1 , + ∞ [ . We have dF ( h t ) dh t (cid:12)(cid:12)(cid:12)(cid:12) h = c + (1 − c ) a (1 − tanh ( a t h )) (16) = c + (1 − c ) a, (17)which shows that ( x , h ) is stable for a ∈ [0 , and unstable for a > .It follows that for a < , the system has a unique stable equilibrium point at ( x , h ), whose unique-ness is veriﬁed by the monotonicity of G ( h t ) ( dG ( h t ) /dh t > ∀ h t ).For a > , the point ( x , h ) is unstable, and there exist two stable points ( x , ± h ) whose basinsof attraction are deﬁned by h t ∈ ] − ∞ , h [ for − h and h t ∈ ] h , + ∞ [ for h1