Dimensions of Timescales in Neuromorphic Computing Systems
Herbert Jaeger, Dirk Doorakkers, Celestine Lawrence, Giacomo Indiveri
aa r X i v : . [ c s . ET ] F e b Dimensions of “Timescales” in Neuromorphic ComputingSystems
Herbert JaegerUniversity of [email protected] Celestine LawrenceUniversity of [email protected] DoorakkersUniversity of [email protected] Giacomo IndiveriETH Zurich and University of [email protected] 10, 2021
Abstract
This article is a public deliverable of the EU project
Memory technologies withmulti-scale time constants for neuromorphicarchitectures (MeMScales, memscales.eu/,Call ICT-06-2019 Unconventional Nanoelectronics, project number 871371). ThisarXiv version is a verbatim copy of the deliverable report, with administrative in-formation stripped. It collects a wide and varied assortment of phenomena, models,research themes and algorithmic techniques that are connected with timescale phe-nomena in the fields of computational neuroscience, mathematics, machine learningand computer science, with a bias toward aspects that are relevant for neuromor-phic engineering. It turns out that this theme is very rich indeed and spreads outin many directions which defy a unified treatment. We collected several dozensof sub-themes, each of which has been investigated in specialized settings (in theneurosciences, mathematics, computer science and machine learning) and has beendocumented in its own body of literature. The more we dived into this diversity, themore it became clear that our first effort to compose a survey must remain sketchyand partial. We conclude with a list of insights distilled from this survey which givegeneral guidelines for the design of future neuromorphic systems. ontents Introduction and overview
The original topic for this deliverable, agreed more than a year ago and specified inthe project proposal, was
Report on literature survey and analysis of STDP and RC [=spike-time dependent plasticity and reservoir computing, respectively] guidance for thedesign of indeterminate hardware, with the understanding that the guiding focus of thissurvey would lie on timescale aspects. When we started to carry out this survey weperceived that, first, the restriction to STDP and RC would exclude many technicallyand algorithmically important phenomena in general analog (spiking) microchips — sowe extended our perspective to neuromorphic computing in general. Second, we realizedthat the phenomenology of “timescales” is very rich, and this word is applied to quitedifferent phenomena in different contexts. Therefore, an important contribution of thisdeliverable report is to stake out the conceptual dimensions of this scintillating word.Hence our new title,
Dimensions of “timescales” in neuromorphic computing systems.
The report is structured as follows. In Section 2 we unfold the conceptual dimensionsof the timescales concept, by pointing out different uses and subconcepts of this notionin different formal-theoretical, computational and physical contexts. In the followingthree sections we compile the findings of a literature survey, sorted into the fields ofneuroscience (Section 3), mathematics and theoretical physics (Section 4), and computerscience / machine learning (Section 5). Section 6 distils a number of take-home messagesdistilled from the findings in this deliverable which we hope are helpful for informingfuture research in MemScales and beyond. A final Section 7 gives concrete physical-timescale related guidelines for the design of indeterminate hardware which result fromthe specific givens in recent developments in STDP and RC research.
In this section we give a “travel guide” for the landscape of timescale phenomena, andpoint out terminologies that are used. We found it not possible (at least, not at present)to develop a unified, comprehensive conceptual framework. Therefore, we present ourfindings in the form of a collection of objects and places of interest, as in a tourist guidewhere showplaces are paid passing visits.
A “Speed” and “memory”.
There are at least two different, but likewise fundamen-tal, understandings of “timescales”.The first one is to speak of fast or slow timescales when a dynamical system evolvesfaster or slower, as one could for instance mathematically determine by changing timeconstants in ordinary differential equations (ODEs). When the system is “fast”, itsrates of change in numerical dynamical variables are high — timeseries will exhibitmany high absolute first derivatives, have strong components in the high-frequencyend of its Fourier spectrum, etc. We note, however, that there is no unique math-ematical criterion to measure “speed”. For instance, if an ODE-defined dynamicalsystem (DS) is close or in a fixed point attractor, even very small ODE time constantswill not translate to large numerical change rates.3he second is to speak of long or short memory durations. In this view, a DS isevolving on a slow timescale when it has long “memory spans”. This means thatsome information that is encoded in the system state at some time t can be decodedagain at (much) later times. Instead of using the word “memory”, which is tooclosely suggestive of neural and cognitive processing, we find it preferable to speak of“preservation of information across time”. Exploring the preservation of informationacross time has been one of the main themes in the theoretical literature on reservoircomputing (RC) in the last 20 years. B State variables, control parameters.
In mathematical models of DS, it is cus-tomary to distinguish dynamical state variables from control variables . Both appearas arguments in the defining function of iterated maps and differential equations (andother formalisms), as in the generic ODE ˙ x = f ( x , a ) where x is the state vector and a denotes the vector of control parameters. The idea is that the latter are “fixed” or“given” and are not affected by the system state update operators. However, this roledistribution dynamical vs. fixed is not always clear-cut. Often one considers scenarioswhere the control parameters are subject to slow changes, for instance induced bytop-down regulatory input in hierarchical neural processing architectures. C Absolute and relative timescales.
Some systems, formal or physical, can be “run”faster or slower and the speed is the only thing that changes. Examples are systemsdefined by differential equations, whose “velocity” can be set by time constants; ordigital microprocessor systems whose clock cycle duration can be varied. To describethis “velocity” one needs an absolute reference time which can be formal (as in ODEsystems) or physically “real”, as in physical microprocessors. Absolute timescalesare important for computing systems that are interacting with their input/outputenvironment “in realtime”.Many of the fascinating properties of complex dynamical systems arise not fromits absolute “velocity” but from the fact subsets of dynamical variables, or subsys-tems, evolve faster or slower than other subsystems. Temporal multiscale propertiescan also be attributed to the dynamics of a single variable: In a single-variabletimeseries one may identify a spectrum of short- and longrange “correlations”, or“memory traces”, or statistical dependencies, etc.
Multiscale dynamics are a generaland maybe essential characteristic of complex systems, although this concept has nosingle, commonly agreed definition.It is remarkable that there seems to be no good word for the “velocity” of a dynamicalsystem, which is why we put this word in quotes. In speaking about “velocity”, onesays “the system is slow” or “the fast subsystem”, but no-one says “the speed (orvelocity) of the system is high”.
D Measuring time.
For a physicist or signal processing engineer, time is given bynature and invariably denoted by t . Mathematicians do not care about real timeand speak of “unit time steps” or “the unit time interval”, an arbitrary conventionto associate the unit inverval on the real line as a reference to quantify time. Whena mathematician talks with a signal processing engineer, the former tends to bepuzzled (if not disturbed) by the fact that the latter keeps talking about “seconds”,a word that one will not find in mathematical textbooks on dynamical systems.4heoretical computer scientists ignore the aspect of temporal duration entirely. Theformal models of computing automata only know of “state update steps”, where theonly aspect that is left over from physical time t or unit time [0 ,
1] is serial order and causation: the next state comes after the previous and the latter determinesthe former. A most interesting challenge with regards to measuring time arises forcomputational neuroscientists when they want to explain how a brain can “estimate”or “experience” time. What mechanism in neural dynamics can enable a subject toestimate the presentation duration of a stimulus? Proposed answers include the useof neural delay lines, neural reference oscillators that function as clocks, or stimulus-duration-characteristic patterns in high-dimensional neural transients. We find thata general theoretical treatment of how “clocks” or “time-meters” can be defined indynamical systems would be a rewarding subject of study.
E Collective and derived variables.
In statistical physics, neural field models, pop-ulation dynamics and many other domains where one investigates systems made fromlarge numbers of interacting small subsystems or “particles”, one often describes theglobal dynamics of the “population” or “ensemble” through derived collective vari-ables. Their timescale is typically slower than the native, local timescales of theinteracting subsystems; and one typically tries to capture the global dynamics with a small number of such collective variables. Slowness is here connected with dimensionreduction , simplification or abstraction . F Sub- and supersampling.
In discrete-time models of dynamical systems one cancreate “speedups” by subsampling and “slowdowns” by supersampling / interpola-tion. This however makes sense only for discrete-time models that can be understoodas sampled versions of a continuous-time process. It makes no sense to supersample,for instance, the state sequence of a Turing machine.
G Slowing-down by discretization.
When a real-valued timeseries is discretized bybinning, or a fine-grained discrete-valued timeseries is further simplified by coarsen-ing, high-frequency detail (which can be regarded as fast-timescale information) getslost in cases where there are oscillatory fluctations within bins in the original time-series. Thus, discretization or binning procedures may cut the spectrum of effectivetimescales.
H Frequency filtering.
Applying frequency filters to trajectories deletes dynamicalcomponents on the timescales corresponding to the cancelled frequencies. In thespecial case of low-pass (smoothing) filters, fast timescale information is lost. Inthe special case of high-pass (baseline normalization) filters the opposite effect isachieved.
I What is a “moment”?
We are used to think of a timeline as an ordered sequence of time points - let us call them “moments”. Even when one leaves out the complicationsof relativity theory, seeing time as a succession of zero-time moments is not alwaysthe most helpful view. Cognitive neuroscientists tell us that the subjective experienceof “now” in some ways integrates over several milliseconds. When neuroscientists tryto detect or define “synchrony” in neural spike patterns, they must soften the math-ematical notion of point-sharp co-temporality to short intervals. Signal processing5ngineers would sometimes like to get rid of delays in their equations but can’t. Ab-stractly speaking, in high-dimensional dynamical systems with nonzero-length signaltravel pathways, relevant information-carrying “patterns” arise not instantaneouslybut need some minimial duration to realize themselves. Such observations suggestthat in hierachically structured complex dynamical systems, a hierarchy of “nowness-windows” might be an appropriate concept, with short-duration “moments” definedfor small, local subsystems and increasingly longer-duration “moments” as one goesup in the subsystem hierarchy.
J Homeostasis, stability, robustness.
Biological organisms, brains, non-digital mi-crochips made from unconventional materials, and many other computing or cogniz-ing systems must preserve their functionality in the presence of external perturba-tions, change of environment, aging, parameter drift and other challenges. They doso through a wide spectrum of stabilization mechanisms which exploit, for example,redundancies, attractor-like phenomena, stabilizing feedback control, adaptation andlearning, or robust network topologies for system architectures. A common denom-inator in this diversity of mechanisms is that they aim to ensure that vital systemvariables stay within a (narrow) viability window, often by attempting to stabilizethem close to an optimal value. This has a twofold aspect of slowness. First, changerates of variables that are being stabilized are slow (when the stabilization is success-ful). Second, these critical variables must be stabilized through long timespans —vital variables through the entire system lifetime.
K Nonstationarity and mode hierarchies.
Computing systems exhibit nonstation-ary dynamics, be it because they are input-driven or because they “learn” or becausethey execute a sequence of subprograms. System trajectories (timeseries) resultingfrom nonstationary dynamics can be qualitatively or quantitatively described throughtemporal hierarchies of dynamical modes . For instance, a neuronal spike train can becharacterized on a very short timescale by an interspike interval, on a short timescaleby burst modes, on a longer timescale by locally averaged firing frequency, and on avery long timescale by asymptotic measures.In information processing systems, one may find ways to characterize what the cur-rent mode “represents”. For instance, the neural activity trajectories in a speech-processing brain might be described as “encoding” or “representing” linguistic phonemes,syllables, words, phrases, sentences, texts.There exists no unique, general mathematical characterization of modes. Modes ofan evolving DS might, for instance, be characterized in terms of frequency spectra,signal shapes, signal energy, attractor structures, degrees of chaoticity, or regionsof the system’s state space, to name but a few. Describing temporal multiscaledynamics is very much the same task as characterizing modes, and there seems to bean unlimited repertoire of options.
L Hierarchical architectures.
The human brain, most autonomous robot controlsystems, and many multiscale signal processing and control systems are hierarchi-cally structured. The “bottom” layers are in direct contact with incoming signalsand generate output signals, while “higher” processing layers carry out increasingly6cognitive” tasks based on increasingly abstracted and condensed representations ofthe information contained in the input signals.We find it a wide-spread, even paradigmatic view that higher levels operate on slowertimescales than lower levels. This view is supported by evidence from biologicalbrains, and guides the design of artificial signal processing and control systems thathave to cope with temporal multiscale data. It also agrees with the view of classicalAI, where action planning architectures generate goals and subgoals, plans and sub-plans, procedures and subprocedures in a nested way, where higher nesting levels aretaken care of by higher processing layers.A formidable challenge arises for formal modelers and concrete system developers(in computational neuroscience, machine learning and robotics). It concerns thenature of “top-down” influences: in what sense, and by which concrete mechanisms,do higher layers influence the processing on lower layers? Should this influencingbe understood and realized as attention, prediction, context setting, or modulation?Many questions, both conceptual and algorithmic/mathematical, still are open.
M Characterizing multiscale dynamics from left to right and from the side.
In symbolic dynamics and theoretical computer science, a theme related to multi-timescale dynamics is infinite-length symbol sequences. They can be characterizedby automata models, where some type of automaton generates the sequence “fromleft to right” — that is, the sequence is seen as the trajectory of a dynamical sys-tem. But such sequences are also described and analyzed as being the fixed points ofapplying grammar rules. This method of characterizing the structure of an infinitesequence is a-temporal but directly yields a transparent account of its multiscale, hi-erarchical structure. Research to connect these two views has only recently started.It seems likely (even obvious) that multiscale properties of DS trajectories are relatedto memory mechanisms that are effective in the generating DS.In theoretical modeling of timeseries data (in theoretical physics and economics inparticular), stochastic dynamics with long memories are discussed in terms of theshape of the corresponding power spectrum. One speaks of fat or heavy tailed or 1 /f power laws. Such long-memory behavior is associated with self-induced criticality or edge of chaos conditions, and is often claimed as a characteristic of complex naturalprocesses, for instance in economics, neural dynamics, or speech.Theoretical computer science offers a canonical repertoire of methods to specify au-tomata with increasing memory capacities (from finite-state autonomata through avariety of stack automata to Turing machines), and how they relate to an equivalentgrammar. It would be interesting to investigate how such memory mechanism hier-archies of symbolic automata and their grammars can be transfered into the domainof continuous-time, continuous-value DS and the power spectrum proporties of theirtrajectories. N Time warping.
In real-life timeseries one frequently finds local speed-ups or slow-downs, for example a speaker stretching out the pronounciation of a vowel for empha-sis. A related effect occurs when different realizations of a signal are originating fromslower or faster generators, for example from slower or faster speakers. Biologicalbrains can, within limits, compensate for such time warping in inputs. For artificial7emporal pattern recognition systems — often recurrent neural networks (RNNs) —this poses serious challenges.
O Online and real-time processing.
Many applications of signal processing andcontrol systems must generate output responses to incoming data streams without oronly minimal delay. This is the generic case for control systems, but also for manyother applications, for instance speech-to-speech translation, medical cardiographicmonitoring. One speaks of online or real-time processing. One may make a finedistinction (not always observed) between these two concepts.In online processing, the signal processing system is “entrained” to the driving inputstream. Its internal states directly “synchronize” with the input, where “synchroniz-ing” is understood in a generalized way that includes nonlinear transformations andmemory effects. The processing dynamical system can appropriately be mathemat-ically regarded as a dynamical system. Analog signal processing devices and RNNsare prototypical examples.In real-time processing — a natively digital-computing notion — the algorithmics ofthe system-internal processing is decoupled from the input. The input signal streamis sampled and buffered, processing subtasks are identified and solved by algorithmswhich must run fast enough to deliver results within predefined time limits. Onuniversal computers this may require the use of an underlying real-time operatingsystem. P Time complexity classes.
In theoretical CS, input-output tasks that can be al-gorithmically solved (i.e., computable tasks) are ordered into a hierarchy of timecomplexity . In theoretical CS, input arguments are always formatted as finite-lengthsymbol strings (“words”). The runtime of an algorithm is measured by the number ofmachine update steps (concretely, clock cycles) needed from the presentation of theinput word until the output word has been generated. To define a complexity class,the runtime is related to the length | w | of the input word w . For example, the class P of polynomially computable tasks comprises all tasks for which some algorithm(or deterministic machine) and some polynomial p exist, such that the algorithmterminates within p ( | w | ) update steps, for all input words w .We remark that the concept of time complexity is tied to understanding “computing”as “running a Turing machine from presenting an input word until it terminates withan output word”. This concept of time complexity cannot be naturally transfered toonline processing tasks. Q Different names for different timescales.
Biological brains exhibit dynamical pro-cesses on many timescales, and different processes affect different physical elementsin brains in different ways. This leads to a entangled maze of dynamical phenomenain which it is hard to not get lost. A coarse orientation is provided by the conceptualsequence inference → adaptation → learning → development → evolution . Theseterms denote denote bundles of dynamical phenomena which manifest themselves onincreasingly long timescales. None of them has a precise definition, but all of them areused in computational science, cognitive science and neural-networks based machinelearning with more or less similar semantic intuitions:8 Inference processes refer to the fast operations of sensor processing, motor con-trol and “reasoning” which do not essentially rely on structural or parametricchanges of the neural processing system, using the system “as is”. In machinelearning one often speaks of “inference” when a ready-trained neural network(or other ML model) is used to process task instances for which it has beentrained. • Adaptation is a particularly broad and vague concept. A common denominatorof its uses seems to be that adaptation works on slower timescales than infer-ence, and is in principle reversible. It often describes processes when a cognitive/ neural system re-calibrates, or re-focusses itself when the environmental con-text of operation changes. In formal models, adaptation processes often areexpressed through changes of control parameters in neural subsystems, inducedby “top down” regulatory mechanisms or subsystem-inherent homeostatic self-stabilization mechanisms. While this seems to us the most common intuitionconnected to the word “adaptation”, it is also used in a much more generalizedway to denote any change of any sort of system (from a single synapse to abiological population in an ecological niche) that improves the system’s “perfor-mance” or “viability”. In those cases, adaptation is not usually reversible. • Learning refers to processes which expand the functionality of a cognitive systemon the basis of experience. Learning processes are usually considered irreversible(“forgetting” are processes in their own right which cannot be understood astime-reversed learning). Learning processes are commonly associated with irre-versible changes in system parameters — in neural networks typically “synapticweights”. Structural changes (like deletion of neural connections or adding neu-rons to a network) may also result from learning, though this aspect seems lesscentral to the “learning” concept than mere parametric change. • Development is a notion which is much more common in the cognitive and neu-rosciences than in machine learning. It refers to the life-long history of an indi-vidual, autonomous cognitive system (animal, human, or generalized “agent”).The development history is often segmented into life periods like pre-natal de-velopment, stages of infancy, youth, adolescence, old age which are in turnassociated with specific structure-changing processes in the agent’s brain. Weforesee that developmental change will also become an important theme in neu-romorphic computing systems based on non-digital hardware which cannot be“programmed” and whose physical substrate is subject to aging. • Evolution is the longest-timescale item in our list of process categories. It de-scribes the adaptive change of entire populations, across generations, to fit a(possibly changing) environmental “niche”.Mathematical models of cognitive systems describe inference and adaptations pro-cesses (typically) through changes in the values of system variables (dynamical statevariables and/or control parameters). The system equations do not structurallychange. In contrast, models of development and evolutionary processes must ac-count for structural changes in the system equations. Formal tools for effecting and simulating structural change in system equations exist in the form of genetic / evolu-tionary algorithms. However, mathematical theories that can be used to characterize analyse structural change in qualitative terms are scarce, heuristic, and gener-ally still under-developed. We find that certain tools in mathematical logic (“non-monotonic logic”) come closest. However, these formalisms are not connected yet todynamical systems modeling.
R Philosophy of time.
Time is a fundamental quality of human experience, and philo-sophical inquiries have approached this theme from many angles. This lies outside ourcompetences and we only list some of the aspects of time that have been investigatedby philosophers (gleaned from Callender (2011)). • Time and metaphysics . What are the ontological realities (“presentalist”, “possi-bilist”, “eternalist”) of the past, the present, and the future? Is time continuousor discrete? • The direction of time.
What is the difference between the past and the future? Isthe arrow of time inherent in time, an effect of causality, or of thermodynamicallaws? • Time, ethics and experience.
Themes include: the subjective “now”; memory,anticipation, decisions and free will; development of time concepts in children;benefit and harm in the past and the future. • Time in physics.
Are (which) physical laws time reversible? How is time relatedto space? How is time understood in relativity theory and quantum theory?What are clocks? How is time affected by the uncertainty principle? Is theretime at all?This listing of aspects of time’s ways of flowing faster or slower, or of our ways toobserve a system for shorter or longer durations, makes it clear that a systematic, uni-fied account of “timescales” is out of reach. In order to give instructive initial input tothe MemScales project, the best we can currently do is to compile a “tourist guide”-like collection of concrete empirical findings, mathematical models and theories, andmachine learning approaches which have a bearing on some of the listed dimensions of“timescales”. We coarsely sort these collection items into three sections
Brains , Math-ematics and
Computing , which is somewhat arbitrary since many lines of researchcross-connect these areas. Our survey will be all but complete: firstly because we largelyomit entire domains of science (in particular biology, physics, psychology and philoso-phy), and secondly because even in the three domains that we did explore (the widerneurosciences, mathematics and computer science / machine learning), our boundedexpertise and the breadth of the subject put limits on what we could effectively cover.At the end of each section we list themes that we know should be included in futureextensions of such a survey.
This section collects perspectives of research, empirical findings and models from thewider neurosciences including (some of) cognitive science.The title of the first four subsections are the consequence of frameworking towardsa theory of neuromorphic signal processing, which we hope to work out more fully and10ore systematically in our future work in Mem-Scales. A metaphor to motivate ourstrategy is that it is incredibly difficult to solve a Rubik’s cube by just focusing on oneside at a time. In that analogy, it is even undefined which parts of the cube correspondto an open problem. For in some definitions, there already are reasonable solutionsto so-called open problems like the binding problem (Skarda, 1999), stability-plasticitydilemma (Shouval et al., 2002), and systems memory consolidation (Van Kesteren et al.,2012). But these solutions do not bind together to a comprehensive theory. The orga-nization of material in our four subsections below arises from distinguishing two axes ofdiscussing neural timescales, an individual (neuron) — recurrent axis and a dynamicalprocessing — plastic adaption axis.
A biological neuron has multiple timescales of phenomena due to voltage-gated (Doyle et al.,1998) and ligand-gated (Katz, 1971) ion channels (Ranjan et al., 2011), spatiotemporalfiltering across dendritic cables (Rall, 2009), hierarchical synaptic-dendritic-membrane-somatic processing (Gao et al., 2018) and biochemical pathways involving multiple chem-ical compounds (Bray, 1995; Barkai and Leibler, 1997; Bargmann, 2006). Thus, a singleneuron has enormous capacity for signal processing, much better than the McCulloch-Pitts and LSTM units in presently widespread artificial neural networks.A special primitive for spatiotemporal processing at the dendritic level is coincidencedetection. It can explain concentration-invariant signal recognition, for example in ol-factory (Hopfield, 1995) networks. Chaining of multi-timescale transient units with acoincidence detector results in transient synchrony (Hopfield and Brody, 2001) and canexplain uniform time-warping invariant signal recognition.A pioneering model of temporal processing at the membrane level is due to Hodgkinand Huxley (Hodgkin and Huxley, 1952), which considers the membrane potential andion channel activation numbers as a coupled system of nonlinear ODEs. A generaliza-tion of the Hodgkin-Huxley model with multiple ion channels whose conductances arenonlinear and modulatable at multiple timescales is now the gold standard for modellingthe membrane dynamics of a neuron. Note that if the ion channel activation numbers donot have any inter-neuronal immediate effects, then just modelling the membrane poten-tial is sufficient for a complete neurodynamical understanding. For example, Izhikevich(2004) showed that a reduced 2-dimensional threshold-reset ODE system is sufficient toexplain a possible set of 20 kinds of temporal processing in cortical neurons includingtonic spiking, phasic spiking, spike bursting, spike latency, subthreshold oscillations,rebound, bistability, and spike frequency adaptation.
Here we will focus on the workings of recurrent neural network, ignoring plasticity.Functionally, a worm brain can be understood to operate in sensory-inter-command-motor layers (Gray et al., 2005). The human brain is similar but more complicated(Eliasmith et al., 2012), i.e. the command-layer is split into functional regions perform-ing action selection and motor processing, and the inter-layer is split into functional re-gions performing information encoding, transform calculation, reward evaluation, work-ing memory and information decoding. 11hree noteworthy observations arise from the study of recurrent processing withmultiple timescales. Firstly, there often exists a behavioural hierarchy (Davis, 1979) re-sulting in a ‘singleness of action’ where a long timescale state controls shorter timescales.For example, the mating state of stickleback fish activates a stereotypical dance move-ment (Tinbergen, 1951), honeybees in a communicative state employ a waggle danceroutine (Von Frisch, 1967). Secondly, the behavioural hierarchy can be deep , as in areproductive instinct that activates sub-behaviours such as nest-building or defensivefighting. Experiments have shown that a three level hierarchy explains worm locomo-tion both behaviorally and in neuroanatomy (Kaplan et al., 2020). Lastly, there neednot always be an equivalence between anatomical and behavioral hierarchy, for examplechains of neurons can generate birdsongs (Long et al., 2010).
Here we will consider the form of synaptic plasticity as postulated by Hebb (Hebb, 1949),where any change in the synaptic weight from one neuron to another neuron, is onlybased on signals due to the activity of the two neurons i.e. deterministic bi-terminalinteractions. Networks with Hebbian plasticity, with or without memory of the neu-ronal activity (an extreme case is a strict ”spike-time” dependence (Caporale and Dan,2008)), are theoretically capable of signal processing primitives such as principal compo-nent analysis (Oja, 1982), self-organizing maps (Kohonen, 1982), and independent com-ponent analysis (Jutten and Herault, 1991). So, at a network level, Hebbian learningcan be much deeper than the popular maxim of “cells that fire together, wire together”.Also, Hebbian-like learning is possible within a single cell (Fernando et al., 2009) if theycontain motifs of chemical cycles where the concentration of different chemical species(such as in gene regulatory networks or phosphorylation cycles) can mimic the func-tionality of synaptic weights (slow-varying control parameters) and action potentials(fast-varying state variables).Among the gamut of possible Hebbian plasticity rules, the most noteworthy is theBienenstock-Cooper-Munro (BCM) model (Bienenstock et al., 1982) because it has beenexperimentally justified (Cooper and Bear, 2012). The BCM model has the rate ofchange of the synaptic weight equal to a fast timescale times the correlations in theneuron-neuron activity times a saliency factor equal to a mean-deviation of the postsy-naptic activity, minus a slow timescale times the synaptic weight. Thus, the BCM modelhas an increased rate of forgetting on introducing uncorrelated noise, can converge inwhitened environments by means of higher-order statistics, can learn direction sensitiv-ity without relying on neuroanatomical asymmetry, and can have a single neuron to beboth directionally and orientationally sensitive by learning on video stimuli.Also noteworthy is that a biophysical model of bidirectional synaptic plasticity(Shouval et al., 2002) can be phenomenologically reduced to a voltage-based STDP(Clopath et al., 2010), which under certain input conditions is equivalent to the BCMmodel. Experimental measurements of STDP on the visual cortex, somatosensory cortexand hippocampus could be fit to the phenomenological model and distinct timescaleswere extracted. 12 .4 Recurrent plasticity
We can look at recurrent plasticity as having two sides : deterministic effects and non-deterministic effects.All deterministic effects that are beyond bi-terminal interactions can be subsumedunder the banner of neo-Hebbian plasticity (Gerstner et al., 2018), including neuromod-ulated STDP (Fr´emaux and Gerstner, 2016) due to some form of reward or punishment(leading to the concept of an eligibility trace and three-factor rules) and heterosynap-tic plasticity due to some conservation law (Oh et al., 2015) such as the spatial con-servation of the total synaptic weight (Von der Malsburg, 1973) or the normalizationof the synaptic weights (Hyv¨arinen and Oja, 1998) to energetically sustainable levels(Walker and Stickgold, 2006; Kandel et al., 2014). Note that neo-Hebbian plasticitycombined with a suitable inter-layer (that is capable of generating rewards internally forcongruent or novel information) is sufficient for effective systems memory consolidation(Van Kesteren et al., 2012), but of course in reality non-determinism will also play asupplementary role as discussed below.All non-deterministic effects (including bi-terminal interactions) can be subsumedunder the banner of neural Darwinism (Edelman, 1993), also known as neuroevolution(Stanley et al., 2019). It is plasticity that is based on the principle of selection uponvariation, and hence is biased towards generating a hierarchical organization. Exper-imental evidence supports a hierarchical organization in the basal ganglia to generateaction sequences (Jin and Costa, 2015), so it can play an important role in learningprocedural memory. Of course, at some level genetics can also enforce a hierarchy(Felleman and Van Essen, 1991), but there is a reason to believe that neural Darwin-ism plays a major role given that hierarchies in the brain themselves are adaptive. Forexample, even people with cerebellar agenesis learn to walk (Boyd, 2010) and spokenlanguage perception colonizes the visual cortex in blind children (Bedny et al., 2015;Lane et al., 2015). Also, neural Darwinism can work for hard problems like nonlinearblind source separation for which deterministic and global optimization methods likeslow-feature analysis end up failing in high dimensions (Wiskott and Sejnowski, 2002).
We list a number of themes that would warrant a closer inspection but for which welacked time or expertise on this occasion. Surely there is an endless list of such themes.Nevertheless, with deeper thought or moments of serendipity, we should work towardsan ideal where newer themes are assimilated or accommodated into older themes (thePiagetian pun is intended (Piaget, 1952)). • Neural clock circuits and entrainment of neural dynamics to clock signals. • Variable binding through theta-wave phase synchronization. • Hierarchies of memory mechanisms — a large research field in the cognitive andneurosciences which would need a separate, extensive treatment. Surveys aregiven, for example, by Durstewitz et al. (2000) or Fusi and Wang (2016). • The role of cerebellar processing in timing fine-control.13
Experimental demonstrations of different time constants in cortical processing(Bernacchia et al., 2011). • Perception of temporal patterns (Large and Palmer, 2002). • Stages in ontogenetic development.
In this section we collect “pure” mathematical themes and formal modeling methodsfrom dynamical sytems theory and some areas of theoretical physics. Topics in formallogic are treated in Section 5.
Arguably the most popular mathematical approach to studying multiple timescale dy-namics has been via singular perturbation theory (SPT) of systems of ODEs. Intuitively,this theory studies perturbations with small parameters where the dynamics cannotstraightforwardly be approximated by the limiting case where the parameters vanish(O’Malley, Jr., 1991; Verhulst, 2005). A geometric approach to singular perturbationtheory (GSPT) was first set up by the works of A.N. Tikhonov and those of N. Levin-son (Vasil’eva and Volosov, 1967; Kaper, 1999), later worked out in more detail by N.Fenichel (Fenichel, 1979). This geometric approach formalizes interpretations of cer-tain singularly perturbed systems as ‘slow-fast’ systems, where some variables operateon a relatively fast timescale compared to other slower evolving variables. An enor-mous amount of research has been done on these slow-fast systems, as they are relevantfor the mathematical description of many processes in the life sciences. An extensivemodern overview of mathematical theory on slow-fast ODE systems is given by Kuehn(2015). In particular, slow-fast systems have become a big research topic in math-ematical neuroscience, see for example Rubin and Terman (2002); Izhikevich (2007);Ermentrout and Terman (2010); Pusuluri et al. (2020). Another notable application ofSPT is for control science, where the classical text is Kokotovic et al. (1999).To illustrate the mathematical approach to slow-fast ODEs, consider the two-dimensionalsystem τ dxdT = f ( x, y ) ,τ dydT = g ( x, y ) , (1)with f and g some possibly non-linear functions. We assume τ , τ > x and y variables. Now define a new parameter ǫ = τ /τ . Then system (1) can be transformed both into ǫ dxds = f ( x, y ) ,dyds = g ( x, y ) , (2)14nd into dxdt = f ( x, y ) ,dydt = ǫg ( x, y ) , (3)via reparameterizations of time T = s · τ and T = t · τ respectively. As long as ǫ > τ ≪ τ ; intuitivelythis means the x -variable operates on a much faster timescale than the relatively slow y -variable. In that case 0 < ǫ ≪
1, and we might consider how systems (2) and (3)for ǫ > ǫ →
0. We may now also call s the slow timescale, and t the fast timescale. For ǫ = 0, it is important to remark thatsystems (2) and (3) are not equivalent anymore. The case ǫ = 0 for system (2) is alsoreferred to as the slow subsystem or reduced problem , while the case ǫ = 0 for system(3) may be called the fast subsystem or layer problem .Intuitively, for 0 < ǫ ≪ y can be approximated bya constant. Therefore, y can be interpreted as a bifurcation parameter of the fast sub-sytem. The reduced problem describes the dynamics at ǫ = 0 of the slow variable ona one-dimensional manifold C , also called the critical manifold , which is given by thezeros of f . Observe that C can alternatively be said to be given by the equilibria of thefast subsystem (3). Close to attracting hyperbolic parts of C , the slow subsystem ap-proximately describes the dynamics of the system for 0 < ǫ ≪ τ . This is formalized by Fenichel’s Theorem , see for example Fenichel(1979); Kaper (1999) or Chapter 3 of Kuehn (2015).Orbits starting close to an attracting hyperbolic part of C for 0 < ǫ ≪ C by Fenichel’s Theorem, approximating the flow ofthe reduced problem, until nearing a point on C where hyperbolicity is lost. Thishappens at bifurcations, with respect to y , of the fast subsystem. What happens afterreaching such a bifurcation point, requires careful analysis of the full system. Onemight find jumps between attractors of the fast subsystem (stable equilibria in thetwo-dimensional example under consideration here), as the system converges towardsthe vicinity of another attracting hyperbolic part of C . While this type of behavioroften occurs in slow-fast ODE models, immediate jumps between attractors of the fastsubsystem cannot be predicted in general from a decomposition of the full system intofast and slow subsystems at ǫ = 0. Indeed, a peculiar type of behavior might occurwhere the full system for some time approximately follows a non-attracting hyperbolicpart of C . This phenomenon is known as a canard , and for example plays a role in theanalysis of spike adding for models of bursting neurons in mathematical neuroscience(Terman, 1991; Linaro et al., 2012).The theory of slow-fast ODE systems has been extended to multiscale stochasticdifferential equations (SDEs) incorporating noise, and more generally to the context ofrandom dynamical systems, see Chapter 15 of Kuehn (2015). Theory on slow-fast SDEshas for example been applied to give a rigorous analysis of certain multiscale synapticplasticity models for neural networks in Galtier and Wainrib (2012, 2013b). Also, slow-fast systems of maps can be studied with similar techniques, and have been applied15o neuron models as well, see for example Mira and Shilnikov (2005) and Ibarz et al.(2011). Multiple timescales can also be introduced in differential equations via delays (Yanchuk and Giacomelli,2017; Ruschel, 2020). The simplest such delay systems are modelled by τ L ˙ x ( t ) = − x ( t ) + F ( x ( t − τ D )) , (4)where τ L is the intrinsic time scale of the system, F ( x ) is a nonlinear function of x and τ D is the time delay. When τ D is large compared to τ L , it is known that thesetype of systems can exhibit a host of interesting spatio-temporal dynamical phenomena(Yanchuk and Giacomelli, 2017). Intuitively, a comparatively large delay introduces aslow timescale next to the fast intrinsic dynamics of the system. Such delay systems withlarge delay have recently been shown to be relevant for approaches to reservoir computingwith opto-electronic hardware (Hart et al., 2019). These opto-electronic delay systemscan be viewed as an alternative method for implementation of high-dimensional neuralnetworks. Space-time representations allow the dynamics of a delay system with lowspatial dimension to be interpreted as spatio-temporal dynamics of spatially extendedsystems. As such, delay systems have recently also been thought of as useful for thestudy of complex dynamical behavior in large-scale connected networks. Although delaysystems were originally thought of as similar to ring networks of identical neurons,Hart et al. (2019) propose that delay systems can be used to implement networks witharbitrary topologies.By defining ǫ = τ L /τ D , equation (4) can be rewritten into the singularly perturbeddelay equation ǫ ˙ x ( t ) = − x ( t ) + F ( x ( t − . Such type of such systems have been studied for example in Chow and Mallet-Paret(1983) and Ivanov and Sharkovsky (1992).
We list a number of themes that would warrant a closer inspection but for which welacked time or expertise on this occasion: • Critical slowdown of dynamics close to bifurcations • Characterizing multiscale structure in infinite symbol sequences via fixed pointsof grammar rule applications • Reaction-diffusion systems • Variables of multi-dimensional iterative maps can be given differing update fre-quencies. Little (if any) formal mathematical theory seems to exist on this topic. • Line attractors and their generalizations. • Statistical physics modeling of collective phenomena and generalized synchroniza-tion, slaving principle (Haken, 1983).16
Computing
This section collects topics, techniques and models from computer science — machinelearning and artificial neural networks in particular. The ordering of subsections isarbitrary.
In neural network (NN) architectures used in machine learning, a variety (but not avery large variety) of formal/algorithmical neuron models is used. The neuron modelsused in feedforward NNs always have a-temporal state update equations of the kind x i = f i ( P j w ij x j + b ), where the x j are the activations of neurons feeding into neuron i and f i is a (almost always) monotonically growing “activation function”. Time becomesa relevant theme only in recurrent neural networks (RNNs). Besides the a-temporalmodels x i = f i ( P j w ij x j + b ), which can also be used in discrete-time RNNs (whichthen mathematically can be regarded as implementing iterated maps), here we find adiversity of neuron models that either are specified by ordinary differential equations(ODEs) — from the simple leaky integrator neuron c ˙ x i = − x i + f i ( P j w ij x j + b ) withtime constant c to LSTM units (Hochreiter and Schmidhuber, 1997) to multi-variablecircuit equations for use in analog VLSI neuromorphic microchips (Chicca et al., 2014))— or by discretized versions of such ODE models, typcially using the elementary Eu-ler approximation; or spiking versions which include a discontinuous neural state resetoperation. All of these contain time constants. In complex neuron models (often witha biological motivation), different time constants can be set for different variables. Forinstance, slow synaptic efficiency adaptation rules (“slow” relative to the soma potentialdynamics) are crucial for creating dynamical memory traces in liquid state machines(Maass et al., 2002). When these neurons are “executed” in digital simulations, theycan be made to “run” faster or slower over a wide range (limited only be numericalstability conditions) compared to each other or to some reference timescale. In analogneuromorphic hardware realizations however, these time constants are fixed by physicalgivens and changing them is only possible if the chip design allows one to access and“set” the physical correlates of time constants (for instance voltages), and within limitedranges. It is easy to simulate multiscale temporal dynamics of time-discretized ODE models ondigital machines — all one has to do is to set different desired time constants in thevarious variable equations.It is not always easy to realize multiscale dynamics of time-discretized ODE modelson digital machines when the computed dynamics must match physical “real-time” inonline signal processing and control applications. The system state update equationsmust be simple enough, or implemented cleverly enough, or parallelized enough, toensure that the digital processing needed to compute the next time slice state takesat most as much time as the physical time allotted for a sampling interval. This maybecome demanding in robot control applications, in particular in compliant robots where17he sampling frequency must be high (order of 1000 Hz) in order to react fast enoughto sensor signals signifying effector impact.In serial-data “cognitive level” AI / machine learning tasks like text processing orvisual gesture analysis, the processing algorithm must have a working memory whichhas (at least) the power of a stack memory. This is needed to process the hierarchicallynested temporal structure in “grammatically” organized input sequences. Realizingsuch a stack memory is of course not a problem for digital computers when they executesymbolic-AI programs. When RNNs that have been set up or trained for such tasks aresimulated on digital machines, the hierarchical memory organization cannot be directlymapped to the (easily available) physical stack memory mechanisms of the digital com-puter. Instead, this memory functionality must become encoded and realized in termsof the RNNs dynamics. One way to do so is to train binary context-level switching neu-rons which can set the (single) RNN into a temporal hierarchy of dynamical processingmodes (Pascanu and Jaeger, 2011).
In theoretical (symbolic-digital) computer science the concept of time complexity refersto upper bounds on the maximal number of processing steps needed by a Turing machineto compute its result when it is started on a (any) input word of length n (Jaeger, 2019)This leads to a classification scheme for the “difficulty” of computing problems. Forinstance, the time complexity class P is the set of all input-output computing tasks suchfor a task T ∈ P that there exists some Turing machine M and some polynomial function p : N → N such that M always terminates within p ( n ) update steps. Some of the deepestunsolved problems in theoretical computer science (and indeed, in mathematics) concernsuch time complexity classes, in particular the famous P =? N P problem (Cook, 2000).This standard usage of the term “time complexity” is confined to characterizing thecomputational demands of evaluating functions — a Turing machine (and all other,equivalent mathematical definitions of an algorithm , of which there are many) incorpo-rates an input-word to output-word mapping. In the context of neuromorphic comput-ing and recurrent neural networks, a dynamical systems interpretation of “computing”seems more adequate than a function-evaluation interpretation. Some models of “com-puting” have been proposed in theoretical computer science which account for continualonline processing of unbounded-length, symbolic input streams, in particular interac-tive Turing machines (van Leeuwen and Wiedermann, 2001) and more recently streamautomata (Endrullis et al., 2019). In followship of the traditional questions that areconsidered in classical complexity theory, this research aims at classifying continualinput-output stream processing tasks into complexity classes. The adopted perspec-tive on discussing such complexity classes is however still tightly tied to the classical,function-based concepts of time complexity, in that such automata are designed in away that upon reading a new input symbol, they can “detach” from the input stream,do a possibly highly complex computation in traditional Turing machine fashion, andafter this computation terminates, produce an output symbol (or not). Complexityclass hierarchies investigated in such research typically include classes of continual serialinput-output tasks which are inaccessible by physical machines — super-Turing tasks— in that oracles are invoked, that is, external additional input (outside the input datastream) is allowed which provides information that itself is not Turing-computable.18here is also a body of research which aims at transfering concepts and methodsfrom symbolic complexity theory to RNNs with real-valued activations and/or weights(Siegelmann and Sontag, 1994; Sima and Orponen, 2003). One common theme and find-ing in this line is that a single infinite-precision real numbers allows one to encode aninfinite amount of information, which gives rise to super-Turing computing powers (thatis, symbolic input-output functions can be computed which no Turing machine can com-pute). However, there is no evidence that super-Turing performance cannot be physicallyrealized due to noise and limited precision (observability) of physical state variables.
A traditional topic in classical (symbolic) AI is “reasoning about action and change”, or“reasoning about action and time” (with several conferences and workshops and a wealthof publications that have these expressions in their titles). The objective of this researchis to extend the expressive powers of logic-based knowledge representation and inferenceformalisms to facilitate the representation of, and formal reasoning about, a cluster ofthemes that includes action, change, planning, intentions, time, events, causation andmore. Such formalisms are algorithmically processed with so-called theorem provers (alsocalled inference engines ). These are heuristic, discrete combinatorial search algorithmswhose processing steps are not interpreted as temporal steps but as logical arguments.Time, timing, measuring time, comparing durations, ordering events on a timeline etc.are objects that are logically reasoned about , in reasoning steps whose ordering is con-ceived as logical, not temporal. More than four decades of research have produced arich body of representation formalisms. We can only pinpoint a few examples. Allen(1991) is an early survey. Fundamental figures of reasoning about time are capturedby modal operators in temporal logics (also known as tense logics ) (Garson, 2014). Arelated classical subfield of AI, qualitative physics (Forbus, 1988) (closely related: naivephysics , qualitative reasoning ) explores logic-based formalisms which capture the every-day reasoning of humans about their mesoscale physical environment. A rather recentdevelopment is hybrid logic / dynamical-systems formalisms to reason about physicaldynamical sytems (Geuvers et al., 2010) in ways that capture the measurement metricsof “real” continuous time. Such formalisms are intended for formal verification of hybridphysical-computational systems in systems engineering. Slow feature analyis (SFA), developed by L. Wiskott (Wiskott and Sejnowski, 2002;Franzius et al., 2008), is a method to extract features from timeseries data (in particularvideo streams) which are defined by the fact that they change slowly. SFA has been usedto explain the functioning of feature detection cells in visual cortex (Berkes and Wiskott,2003) and hippocampal place cells (Sch¨onfeld and Wiskott, 2015). Interestingly (andpossibly, limitingly), the slow features found by SFA are functions of single input frames,not — as one might expect — functions of input episodes that last nonzero time.19 .6 Speed control in RNNs
Biological brains can generate and recognize instances of output patterns which differfrom each other only in speed (for instance, generating or recognizing slow and fast hand-waving or dance or music pattern generation). For a mathematical model of an RNNwritten in ODEs with time constants, it would be straightforward to adjust the process-ing speed in generation or recognition by scaling all time constants with the same factor.But physical neural systems, whether biological or in neuromorphic hardware, cannotscale all physical time constants with a global factor. This leads to a very interestingmathematical and biological (and algorithmical) question: how can the qualitative dy-namics of an RNN be speeded up / slowed down without a globel time constant scaling?We are aware of two approaches which both make use of the fact that when a RNN isexcited by different-speed versions of the same input pattern, the elicited network statespopulate different regions of state space. By characterizing the geometry of these differ-ent regions with (cheaply computable) variables, and subesquently actively controllingthese variables by elementary linear controllers (wyffels et al., 2014) or conceptor filters(Jaeger, 2014), speed variations up to a factor of 10 for pattern generating tasks havebeen achieved.
Even outside relativistic physics, space and time depend on each other. In particular,travelling solitons and waves need time proportional to travel distance. This may becomeexploitable for the design of neural mechanisms for variable-speed pattern recognitionand generation. The idea is to encode the target pattern spatially on a neural surfaceand let it be “read” by a travelling wave or soliton whose travel speed is determined by asingle or very few variables that can be controlled physically or algorithmically. Neuralfield theories of cognitive cortical processing, which are based on solitons and waves, areworked out in some detail (Engels and Sch¨oner, 1995; Lins and Sch¨oner, 2014), but asfar as we can see, so far not with the aim of explaining speed control.
In robotics and intelligent agent modeling, the cognitive control of action selection andmotor control is typically organized in a hierarchy of planning and controlling layers.Higher layers in such hierarchies operate on slower timescales than lower layers. Thelowest layers Hierarchical agent “architectures” are so common and have been proposedabundantly since 50 years, such that we can give only a few, exemplary ad hoc point-ers. Examples: In classical AI architectures, such hierarchies are explained in termsof plan-subplan hierarchies (Saffiotti et al., 1995). In control engineering, hierarchicalcontrol architectures have become an explicit industry standard (Albus, 1993). In thearchetype subsumption architecture (Brooks, 1989) in behavior-based robotics, higher-level “behaviors” can suppress lower ones. An influential early model in neural networks/ machine learning constitutes the control hierarchy in a format of trainable hierarchical mixture of experts where higher-level experts can gate lower-level experts. A simi-lar structure, based on ODEs where higher-layer behavior-controlling ODEs were runwith slower time constants, powered several winners in RoboCup world championships20Jaeger and Christaller, 1998). When mulit-layered RNNs are trained for robot tasks,timescale-differentiated layers of control emerge automatically (Yamashita and Tani,2008).
In reinforcement learning (RL), a key submechanism is to represent and compute eligi-bility traces (Sutton and Barto, 1998). This refers to a number of algorithmic methodsto maintain a memory trace (with weighted decay) of past actions (of a complete agentor a single neuron, in the latter case the action being spike generation) and input signals(sensor input to an agent or spike input to a neuron), paired with information about the(estimated) utility of the action history to receive reward. The setting of the decay ratedetermines the memory horizon. Reinforcement learning can be expected to play a largerole in neuromorphic training. Recently eligibility traces have also become instrumentalin designing neurally plausible (and hence potentially implementable in physical spikingneuromorphic microchips) approximate methods to emulate backpropagation learningin spiking neural networks (Bellec et al., 2019).
In many temporal machine learning tasks, the incoming signal can be sometimes faster,sometimes slower. This can happen when signal sources change (for instance, there areslow and fast speakers), but it can also happen within a single instance of an inputsignal (for instance, when a speaker gets excited and speaks faster, or when his/herway of pronunciation has temporal ideosyncrasies like stretching vowels longer thanaverage speakers). This is a problem for machine learning algorithms. In brute-forcelearning paradigms (deep learning in particular), such time warping effects are caughtby providing exhaustive training samples that cover all sorts of warping effects. A moretraining-data-economical approach is to send input signals through some time-unwarpingpreprocessing filter before feeding it to the RNN in training and exploitation, such thatthe RNN only has to cope with speed-normalized input signals. Another approach thatwe find the most elegant is to leave the input stream in its original time-warped versionand adapt the processing speed of the RNN, speeding it up when the input signal slowsdown such that each RNN state update step (in discrete-time RNN models) or unit-timestate evolution (in continuous-time RNN models) covers the same phenomenologicalchange increment in the input stream (Lukoˇseviˇcius et al., 2006).
Continual learning refers to the machine learning challenge to make a neural network(feedforward or recurrent) learn a sequence of tasks, one after the other, such that whenthe next task is trained into the network, the new weight adaptations do not destroywhat the network has previously learnt in other tasks. This catastrophic forgetting (or catastrophic interference ) problem has remained without a convincing solution since itwas first acknowledged decades ago (French, 2003). Only recently, a number of novel ap-proaches in deep learning found effective algorithmic ways to de-fuse this problem. This21s today a very active strand of research in deep learning, now called continual learn-ing , which has yielded a variety of effective algorithmic paradigms and a differentiatedview on variants of the problem statement (Parisi et al., 2019; He et al., 2019a). Thecontinual learning problem is closely related to the theme of transfer learning (whichconcerns the generalization and carry-over of competences learnt on other tasks to anew task), the theme of federated learning (which concerns the integration of learn-ing progress made in the peripheral nodes in a network of decentralized local learners(Kairouz et al., 2019)), and the theme of meta-learning (which concerns the learning oflearning strategies).Continual learning is connected to timescale and memory topics in several ways.First, in some continual learning algorithms, weight changes in synapses that are deemedimportant for previously learnt tasks are discouraged, reducing (= slowing down) theiradaptation rate. Second, the continual learning problem in its most demanding formposes itself on the longest possible, namely the lifelong learning timescale. Third, somecontinual learning algorithms rely on generative memory replay of previously learnttasks.
The simple linear readout which is typically used for training RNNs in the reservoir com-puting (RC) field can be used to define natural numerical measures for “how much” mem-ory about previous input is preserved in the current network state. In its most basic for-mat, the memory capacity of a discrete-time “echo state” reservoir network is measuredby (i) feeding it with white noise input, (ii) training linear readout units y d by linear re-gression on the task to recover the input value u ( t − d ) from d steps before, (iii) adding allcorrelation coefficients between signals y d and u ( t − d ) to get the desired measure (Jaeger,2002). Note that the training of readout units here is not done to soleve a “useful” taskbut solely for quantifying an core characteristic of an RNN (or, for that matter, any otherdynamical system). Note further that the “memory” which can be determined in thisway is a purely dynamical short-term memory and involves no learning inside the RNN.This concept of memory capacity has become the anchor point for a (by now) extensiveliterature of mathematical analyses which explore memory in RNNs under aspects likethe impact of noise (Antonik et al., 2018), continuous time (Hermans and Schrauwen,2010a), high-dimensional input (Hermans and Schrauwen, 2010b), infinite-dimensionalneural networks constructed by kernel methods (Hermans and Schrauwen, 2012), differ-ent neuron models (B¨using et al., 2010), or nonlinear readouts Grigoryeva et al. (2016),to name but a few. The literature is by now extensive and a systematic survey wouldbe welcome. Frady et al. (2018) develop a classification scheme for dynamical mem-ory tasks and measures which highlights the richness of phenomena and perspectivesassociated with dynamical memory in RNNs.The memory capacity of reservoir networks has become a standard metric to quantifyor predict the “goodness” of reservoir networks for cognitive tasks in studies where differ-ent network architecures (Strauss et al., 2012), reservoir pre-training methods (Schrauwen et al.,2008), or reservoir control parameter tuning are compared. The latter is often associatedwith investigations of reservoir performance “close to the edge of chaos” (Legenstein and Maass,2007) (which in most cases we find an incorrect usage of terminology; correctly it shouldbe “close to the loss of the echo state property”).22easuring delayed-input to trained output correlations is not the only way is not theonly way of quantifying the dynamical memory capacity of RNNs or general input-drivendynamical systems. If one adopts a probabilistic perspective, information-theoretic mea-sures like the Fisher memory matrix (Ganguli et al., 2008) are more informative, albeitharder to estimate empirically.We point out that dynamical memory cast as state-based information carry-overfrom the past to the present, as discussed above, is not the same as working memory .Working memory is a complex concept used in the cognitive and neurosciences for aspectrum of transient recall phenomena in animal and human remembering (Baddeley,2003; Botvinic and Plaut, 2006; Fusi and Wang, 2016). Working memory phenomenausually entail additional control mechanisms to encode and decode context informationand insertion of knowledge stored in long-term memory.Neither “dynamical memory”, “working memory”, nor “short-term memory” havegenerally shared, precise definitions and when one studies the literature one must becareful to appreciate the specific meaning of such terms intended by the author. We list a number of themes that would warrant a closer inspection but for which welacked time or expertise on this occasion: • Generating and detecting timing and rhythm patterns in music, speech or gesturerecognition / production (Eck, 2002a,b, 2007). • Methods for dynamical adapation of learning rates in gradient-descent training ofneural networks. • Timescales in connection with statistical efficiency of neural sampling and Markov-chain Monte Carlo sampling algorithms (Neal, 1993; Jaeger, 2020). • Interactions between adaptation rates, memory duration, and residual approxima-tion errors in online adaptive signal processing (Farhang-Boroujeny, 1998). • The role of delays in (neural network based) architectures for motion control. • Subsampling and supersampling in digital signal processing. • Attention and working memory mechanisms in deep learning, especially for lan-guage processing (Bahdanau et al., 2015).
Our meandering journey through the landscape of temporal and timescale phenomena innatural and artificial “cognitive” systems has delivered a large and speckled collection offindings. We could not bind them together in a unifying “story” (we tried this in a firstwrite-up but had to abandon the attempt because there were many themes left that didnot fit into the unified picture that we started to draw). But despite the heterogeneityof our findings, there are some lessons learnt that we believe provide useful input to theMemScales consortium at an early time in the project:23 imescales is a multidimensional concept.
There are many ways in which “timescale”themes come to the surface when thinking about cognitive systems. One consequencefor hardware and computational methods research in MemScales: there is not a singlegood (or even best) way to design neuromorphic systems with regards to timescales.How multiple timescales have to be physically and algorithmically supported dependson the use scenario of the targeted system.
Timescales cannot be ignored.
Our belief that timescales and memory hierarchiesare important was a raison d’ˆetre for launching our project. Our findings substantiateand underline this initial belief and convince us that a dedicated project focus ontimescales is a necessary topic of dedicated research in the further development ofneuromorphic computing.
More complex cognitive processing needs more timescales.
A task’s cognitivecomplexity seems closely linked to the spectrum of memory timescales needed for it.This indicates that for a systematic development of neuromorphic technologies it ishelpful to work out a complexity hierarchy of task types and initially not “reach forthe stars” but concentrate on tasks of modest complexity that require to integrateinformation only across a few timescales only (or even a single one).
Relative and absolute timescales.
A neuromorphic computing system (hardwareplus algorithms) must support a range of timescales that widens as the cognitivetask complexity grows. If the system is used in offline mode (for instance, text pro-cessing), one only needs to aim for a wide range of relative timescales. If the systemis targeting online processing tasks (for instance robot control or cardiac monitoring),in addition one must match the system’s absolute timescales to the task data streams.The main challenge here is probably to physically realize slow enough timescales.
Tricks to avoid many physical timescales.
It is not easy to realize a wide spectrumof timescales directly in physical effects on a non-digital neuromorphic microchip.There are a number of workarounds that may alleviate this challenge: • Digital-analog hybrid processors where slow timescales are made possible bydigital buffering. Needs a development of dedicated digital-analog algorithmics. • Large (possibly very large) RNNs can encode large amounts of information frompast input in the current network state and thus have longer dynamic memoryspans. Needs microchip technologies for realizing (very) large RNNs. • Reservoir transfer methods may have some potential for realizing long memoryspans even in modestly sized RNNs if these are explicitly trained for the specificmemory functionalities demanded by the target task. • Designing RNN architectures that include explicit mode switching mechanisms(possibly trainable) may realize temporally nested processing levels. Needs thedevelopment of dedicated architectures and learning algorithms, and a clearunderstanding of the “stack memory” demands of a task.
Delays may make neuromorphic computing difficult.
Signal travel delays in un-clocked analog neuromorphic microchips become a problem when delay times are notwell separated from the fastest timescales demanded by the processing task (in which24ase delays can be ignored). For high-frequency online processing tasks (for instancein future neuromorphic low-energy communication nodes), an explicit modeling andalgorithmic compensation for physical delays is needed. For multi-timescale offlinetasks, an upper limit for task throughput rates is given by the necessity to separatephysical delays from the fastest task timescale.We note that delays are no mathematical or algorithmical problem in digital com-puting as long as physical on-chip delay times are much shorter than clock cycletimes.
Delays may make neuromorphic computing easy.
If one would find a way to phys-ically realize tapped delay lines (by traveling waves or solitons, maybe skyrmionic?),multiple timescale dynamics (with longest scale given by longest signal travel timeon the delay line) might become explicitly designable. Needed: mathematics andalgorithms embedding tapped delay lines in analog computing architectures.
Life history timescale.
If the motto of brain-like computing is taken seriously, the“lifespan” timescale of an individual hardware system becomes relevant. While dig-ital microchips don’t age and don’t have an individuation history: if they start pro-cessing 0’s and 1’s differently from when they were sold, they are called “broken”and are replaced by an identical twin. Analog neuromorphic microchips will likelybe individual from the moment when they leave the fab (due to device mismatch);they will often exhibit slow parameter drift and physical aging; and they cannot be“programmed” in the traditional sense but will likely have to be trained. This willlead to individual lifelong learning and adaptation histories. Needed: novel mathe-matical tools to describe qualitative change and continual/lifelong learning schemes(algorithms and training schemes) that are appropriate for physically aging systems,which in particular will require a collaboration between learning and homeostaticself-stabilization mechanisms.
Today, virtually all analog spiking neuromorphic hardware demonstrations are basedon either STDP, RC, or sometimes a combination of the two. This also holds truefor our research in the predecessor project NeuRAM3 (for instance He et al. (2019b);Yousefzadeh et al. (2018); Cove et al. (2018)). Although our survey has made it clearthat biological brains as well as machine learning techniques derive their strength froma much wider range of computational / learning principles than STDP and RC, atthe current point in time it makes sense to focus on these two paradigms, identify thecurrent state of development in the theory and the practical uses thereof, and from thatderive concrete (minimal) requirements for physical timescales that have to be deliveredby analog spiking neuromorphic hardware. We treat STDP and RC in turn, but startwith a general summary of timescales and their biological, algorithmical and hardwarerealizations. 25 .1 Timescales: biological, algorithmical, hardware
Research in the neurosciences has identified a plethora of neural adaptation mechanisms.They are based on a wide spectrum of physical and physiological mechanisms and operateon all levels of the brain’s hierarchical architecture, from synapses to membranes to entireneural assemblies and projection pathways; and they serve many functions (as far as onecan identify them today) in homeostatic regulation, fast and slow adaptation to inputcharacteristics, short-term, working and long-term memory, learning and ontogenesis.This richness is far from being fully understood in the neurosciences, and there existsno unified or comprehensive mathematical model.Nonetheless, it is instructive to be aware of some core concepts and findings fromneuroscience. Table 1 gives a highlevel indicative overview which reflects the ongoingdiscussions in the consortium. It remains to be explored which physical effects of hard-ware devices can serve which biologically motivated mechanism. This is an intricatequestion because it is not the physical device / effect per se that serves a computa-tional/biological mechanism, but a complex interplay of the core physical effect withcircuit designs and control schemes, like for instance pulse pattern schemes for settingPCM resistances.
Biologicalplasticity phe-nomenon Timescale Mechanism Candidate physical de-vice / effect
Short-term plas-ticity 1 ms – 10 ms STDP, SDSP capacitorsLong-term plas-ticity 10 ms – 500 msfor weight change;1 h – years forweight preserva-tion LTP/LTD non-volatile memristive de-vices (for preserving the re-sults of LTP)Intrinsic plastic-ity 0.5 s – 10 s thresholdadaptation volatile ReRAM, TFT, ...Homeostatic plas-ticity 1 s – 1 h synaptic scal-ing volatile ReRAM, PCMdrift, TFT, ...Structural plas-ticity 1 h – lifetime architecturereorganisation reconfigurable / extend-able architectures
Table 1
Overview of plasticity phenomenaThe large number of physiological mechanisms underlying this spectrum of phe-nomena, as well as the wealth of formal models in theoretical neuroscience that capturethese phenomena at different levels of abstraction, as well as physical differences betweenbrain physiology and electronics, make it impossible to copy biological mechanisms 1-1to electronic microchips. Furthermore, it is not necessarily the most promising engineer-ing strategy to even try to copy brain mechanisms exactly into analog spiking hardware.On the one hand, many biochemical mechanisms will be hard to replicate in electronic26ystems, and on the other hand, electronic systems may offer opportunities (especially,faster timescales or very long-term non-volatile memory states) that are not affored inphysiological brain substrates. Yet, Table 1 teaches a clear lesson: in order to endowartificial with proxies of the biological inference, adaptation and learning mechanisms,a wide range of timescales must be covered.How this is concretely done will depend on the available hardware, targeted perfor-mance and use-cases, and algorithmic models. In the following subsection we will workthis out in an examplary case study.
There is not a single, well-defined STDP adaptation rule in biological brains. In fact, itis an experimental challenge to localize, measure and formulate STDP mechanisms inmammalian brains. In the machine learning / computational neuroscience / neuromor-phic engineering communities, a broad variety of STDP variants and combinations ofthem with other neural adaptation mechanism have been explored — Joshi and Triesch(2009); Clopath et al. (2010); Graupner and Brunel (2012); Galtier and Wainrib (2013a);Klampfl and Maass (2013); Roclin et al. (2013); Bengio et al. (2017); Thiele et al. (2018)are but a small selection of approaches that document this variability. The initial specificconcept of STDP (as described in the landmark paper by Markram et al. (1997), withmany forerunners) does not cover this variability. The term “spike-timing-dependentsynaptic plasticity” and the acronym STDP was introduced in Song et al. (2000). Theterm
Spike-Driven Synaptic Plasticity (SDSP), apparently introduced by Fusi et al.(2000) in a formal model of an adaptive synapse independent of, but potentially ef-fective in a variety of learning/adaptation mechanisms, should be preferred over theterm STDP when one considers spike-driven synaptic plasticity phenomena in a moregeneral setting than the original STDP framing. Since neuromorphic electronic circuitsand neural network learning algorithms used in them explore and exploit more generalmechansism than STDP proper, we will use the term SDSP in this section.In order to become concrete, we must however settle on a specific model, and thisshould not be a repetition of what we already developed in NeuRAM3. Instead, ourchoice should open doors for the currently most promising line of SDSP exploits, the so-called 3-factor rules. Again, this principle comprises many different variants. Generallyspeaking, in 3-factor SDSP rules, the adaptation effects determined by the basic twofactors (pre- and postsynaptic activations) of SDSP become multiplicatively modulatedby a third factor, which represents some kind of global control signal, which can bevariously interpreted as a reward signal, a derivate of a supervised target signal, atemporal coordination / synchronization guide, or a mean-field population activity signalfor achieving homeostatic regulation of a neuron’s average activity level (summarized inKusmierz et al. (2017)).Our choice is to opt for the first among the mentioned interpretations, and concretelyfor the model recently proposed by Bellec et al. (2019, 2020). This model, named the e-prop model, imports mechanisms from reinforcement learning and utilizes them to re-alize an approximation of stochastic gradient descent (SGD) with an SDSP mechanism.SGD is the main enabling learning principle that empowers deep learning techniques,27nd is thus of great potential value for neuromorphic technologies since the current stateof the art in machine learning is defined through SGD trained neural networks. However,in the deep learning field, one does not use spiking neuron models. Much effort has beenspent in the last 10 years to find approximations of SGD that also work in spiking neuralenvironments, with limited success. The model of Bellec et al. has immediately created astrong resonance, building on and transcending previous approaches to apprximate SGDin spiking networks, is mathematically transparent, can be adopted to a variety neuronmodels, has been explicitly formulated with analog spiking hardware implementationsin mind, and furthermore MemScales members (Indiveri, Jaeger) enjoy a long-standingcollaboration with the group of Wolfgang Maass where this model originates. Closelyrelated SDSP-realized approximations of SGD are currently being explored in a numberof research groups. Nefti and Averbeck (2019) review approaches of transferring neu-robiological models of reinforcement learning to artificial neural networks, emphasizingthe benefits of neuron models that include sub-mechanisms that operate on differenttimescales, and report brackets for biological time constants. Payvand et al. (2020)(whose first author is a member of the INI) present an analog circuit for an on-chip real-ization of (a version of) such learning rules, and demonstrate it in a simulation. Concretevalues of effective time constants are unfortunately not provided. The documentationof mathematical formalism in Bellec et al. (2019, 2020) is particularly detailed, whichgives us the option to analyse conditions on time constants, which we now proceed todo. Following Bellec et al. (2020), we first give a brief summary of e-prop for the caseof leaky integrate-and-fire (LIF) neurons (formulated in a discrete-time setting, usinga unit timestep of δt = 1 millisecond), the most simple and arguably most popularspiking neuron model in neuromorphic engineering theory. The core of SGD algorithmsin supervised learning for the adapation of a synaptic weight w ji from pre-synapticneuron i to post-synaptic neuron j is the error gradient dEdw ji , which can be factorized as dEdw ji = X t dEdz tj · " dz tj dw ji =: X t L tj e tji , (5)where z tj is the postsynaptic spike train (a binary signal), the summation goes over thetime points of the learning history, the factor dE/dz tj =: L tj is the learning signal , andthe factor dz tj /dw ji =: e tji is the eligibility trace . The elegibility trace depends on pre-and postsynaptic spiking (see below) and are thus a form of SDSP. The learning signalis the “third” factor in the customary terminology when one speaks of 3-factor rules.Note that this formulation (5) captures the weight change gradient obtained fromaccumulating information about a whole training sequence or a training batch. For aninstantaneous weight adapation in a single model update step from time t to t + δt = t +1,as needed for adaptive hardware implementations, (5) reduces to the online learning rule∆ t w ji = − η L tj e tji , (6)where η is a learning rate. We now take a closer look first at the elibility trace andthe “third factor”, the learning signal, in that order. We first give an brief summaryaccount of the formalism in Bellec et al. (2020), which is geared toward discrete-timesimulations on a digital computer, and then discuss what conditions on physical timeconstants in unclocked event-based analog hardware implementations can be derived.28ince spike pre- or postsynaptic spike trains z tj are not differentiable, they are re-placed by exponentially smoothed filtered versions¯ z t := F α ( z t ) := α F α ( z t − ) + z t (7)when needed. Bellec et al. (2020) derive that the eligibility trace e tji can then be re-written as e t +1 ji = ψ t +1 j ¯ z ti , (8)where ψ tj is a pseudo-derivative of ∂z tj / ∂v tj (used variously in the literature for makingspike trains differentiable under consideration of the post-synaptic neuron’s j refractoryperiod r ; Bellec et al. (2020) refer back to Bellec et al. (2018)), given by ψ tj := ( t inside r v th γ pd max (cid:16) , − (cid:12)(cid:12)(cid:12) v tj − v th v th (cid:12)(cid:12)(cid:12)(cid:17) else, (9)where in turn v tj is the membrane potential of neuron j , v th its firing threshold, and γ pd is a heuristic damping parameter that is set to γ pd = 0 . v tj evolves according to v t +1 j = αv tj + X i = j W rec ji z ti + X i W in ji x t +1 i − z tj v th , (10) z tj = H ( v tj − v th ) , (11)where W rec ji , W in ji are the recurrent and input weights to neuron j , x t is the input signaland H is the Heaviside step function. The decay rate α is can be expressed in terms ofan exponential decay function by α = exp( − δt/τ m ) , (12)where τ m is the membrane time constant. A biologically plausible value is τ m = 20 ms.With a stepsize δt = 1 ms (which is used in the simulations in Bellec et al. (2020)), thisgives α ≈ . L tj in (5) measures the deviation between the output signals y k generated by the network output neurons k (which are not recurrently connected to eachother), defined by the leaky integration rule y tk = κy t − k + X j W out kj z tj + b out k , (13)and the target outputs y ∗ ,tk by a simple linear combination of the errors L tj = X k B jk ( y tk − y ∗ ,tk ) . (14)The error backprojection weights B jk are determined in Bellec et al. (2020) accordingto various heuristics, among them fixing them at random values. While this works sat-isfactorily in the demonstrations given in Bellec et al. (2020), we see this dependence on29euristic intuition to define the learning signal as an opportunity for further improve-ments of this model. Which values of κ were chosen in the demonstrations in Bellec et al.(2020) remained un-documented there. However, it is clear that for tasks where the tar-get outputs y ∗ ,tk are smooth signals, they must be assumed to be high-pass filtered topreclude arbitrary baseline drifts which cannot be learnt by neuronal outputs. To con-nect our following discussion of synaptic/neuronal time constants with task-specific timeconstants, we will consider the period length T ∗ (in milliseconds) of the lowest signifi-cant frequency in y ∗ ,t k as the slowest task-relevant timescale. The challenge for onlinelearning with spiking neurons is to be slow enough to be able to integrate task-relevantinformation on that timescale T ∗ .We now consider the question how this model, which is formulated in a discrete-time set-up for simulation on digital computers, translates into requirements for RNNimplementations in unclocked, event-based, spiking hardware.We first consider the second factor ¯ z ti in the eligibility trace (8), which representspre-synaptic spikes arriving at the synapse w ji . Note that the 1 ms time differencebetween t + 1 and t in (8) is due to the discrete-time simulation scenario, where a unittime step is assumed for propagating the information from neuron i to neuron j . Forelectronic event-based neuromorphic hardware we may assume that the travel delays ofelectric signals are negligible, hence instead of (8) we will consider e tji = ψ tj ¯ z ti . (15)According to (7) and (12), ¯ z ti is an exponentially smoothed version of z ti with an ex-ponential time constant that we will call τ pre . In order to exploit (7) in analog unclockedhardware, a physical variable available at the physical implementation of synapse w ji must represent ¯ z ti , that is, a physical leaky integration of the incoming spike train z ti withtime constant τ pre must be effected somewhere in the circuit — either at the sendingneuron i (then this signal must be sent to all receiving neurons j ), or at the receivingsynapse w ji (then the integration must by physically repeated at all synapses to which i sends out its spike train).We note that it is not possible to derive general rules for how τ pre should be setfor the network to solve its learning task. Whether a specific setting will be successfuldepends on many design variables, for instance the size of the RNN (in larger RNNs,less precision per synapse is needed), functional specialization of neurons i and j (theymight specialize on high-frequency components in the outward task, leading to morerelaxed constraints on τ pre ), the average firing rate of the feeding neuron i (the higher,the smaller can τ pre be), and importantly, a model of how task-relevant information isencoded in z ti . We must make further assumptions to arrive at a well-defined problem.We will proceed under the following assumptions.1. Task-relevant information is encoded in the network by rate coding.2. The synapse w ij contributes significantly to the network’s functionality with re-gards to the slowest task-relevant timescale T ∗ .The leaky integration of z ti should be such that a significant memory trace of spikesthat lie T ∗ in the past is still present in ¯ z ti . What “significant” means is subject to anessentially arbitrary commitment. Here we opt for a plausible heuristic and require that30he contribution of z t − T ∗ i to ¯ z ti is reduced by a forgetting factor F of at most 1 / t . This leads to the condition exp( − T ∗ /τ pre ) ≥ / , (16)that is τ pre ≥ T ∗ − / ≈ . · T ∗ . (17)Next we turn to the first factor ψ tj in the eligibility trace (15). This factor accountsfor the postsynaptic spike timing when interpreted in an SDSP perspective. Inspecting(9) we see that this factor follows the temporal profile of the membrane potential v tj ofneuron j , which in turn (see (10)) is a leaky integration of recurrent and input spiketrains arriving at j . We have to transfer the discrete-time formulation of Bellec et al.to the continuous-time, event-based situation in analog unclocked hardware, that is,we must translate the discrete timestep discount factor α in (10) to an exponentialdecay rate that we will call (like Bellec et al. do) τ m . Again we must make additionalassumptions to arrive at a specific statement of our problem. Repeating the assumptionsand the heuristic that we committed above, we arrive at the same conclusion as in (17): τ m ≥ T ∗ − / ≈ . · T ∗ . (18)This suggests that membrane leaking time constants are needed that are in theorder of the slowest relevant task time constants. Bellec et al. (2020) used a biologicallymotivated value of τ m = 20 ms. They demonstrated their model on a supervised taskof phoneme recognition, where T ∗ is 10 ms, which satisfies our constraints (17) and(18). However, in another experiment, where e-prop was adapted to a reinforcementlearning situation, the task-relevant slowest timescale was in the order of T ∗ = 2000ms, still with τ m = 20 ms. This is at odds with (18). The solution to this puzzle is anargument that combines the influence of network size with the choice of the forgettingfactor F = 1 /
2. If we plug in smaller forgetting factors in (17), (18), we end upwith smaller admissible time constants τ pre , τ m . They result in smaller T ∗ -delayed task-relevant additive components in the signals ¯ z ti , ψ tj , a source of variation which in turncan however be compensated by the linear combinations effected through W rec ji , W in ji .This efficacy of this compensation scales with the size N of the RNN and the numericalaccuracy of the used computing environment. Bellec et al. used floating-point precisionarithmetics and large networks (with 2400 neurons in the phoneme recognition demo,not documented for the reinforcement learning task). In this light, our suggestions (17),(18) should be considered as extremely conservative if not pessimistic, relevant (only)for very small networks with a few neurons and low numerical precision (or with noise).We summarize our findings: • For implementing the e-prop algorithm for supervised training or RNNs in analogspiking neuromorphic hardware on the basis of elementary LIF neuron models,two leaky integration mechanisms are needed, one for the membrane potential andone for the smoothing of spike trains arriving at a synapse.31
Lower bounds on the minimally necessary time constants for these two integrationmechanisms depend on a number of design variables (in particular network sizeand realizable numerical accuracy) and task specifics (in particular slowest task-relevant time constant in task signals). Under the most conservative assumptions(very small network, low numerical accuracy) one can reason that the leaky inte-gration time constants for the membrane and synapse integrations must be in theorder of the slowest task-specific time constant. As network size and/or availablenumerical accuracy increases, increasingly faster neuronal/synaptic time constantscan be expected to be sufficient for realizing the e-prop algorithm. • Not all membrane or synapse time constants need to satisfy the conditions de-scribed here. In order to enable a RNN architecture to cope with the slowesttask-relevant timescales, it is enough if some neurons / synapses are capable ofthe required slow integrations. Specifically, hierarchical network architectures areoften designed in a way that “higher” layers operate in slower timescale modesthan “lower” layers.We emphasize that the considerations made above are tied to the specifics of thee-prop algorithm, with its specific version of SDSP and its specific training objectiveand system architecture proposed in Bellec et al. (2020). There are many other SDSPrules, other training objectives (in particular, unsupervised ones, or tasks based on non-temporal data) and architectures, where other considerations would have to be done. Inparticular, the necessity of leaky-integrating incoming spike trains at each synapse is aconsequence of the specific e-prop mathematics and will not be required in many otherSDSP versions, tasks or architectures.The main lesson to be drawn from this case study is that there should be a mecha-nism in the neuromorphic system whose time constant matches the slowest timescale T ∗ of the outward task (where “matching” needs to be qualified, it need not be identity butcan mean that the corresponding hardware mechanism has a faster timescale that can beexpanded to the task timescale through computational effects). We found the same les-son taught to us in a quite different experimental and algorithmic scenario too, as will bereported in Section 7.3. If that lesson holds true, then a very wide range of task-dictatedtimescales T ∗ must be served: ranging from milliseconds in robot/prosthetics control todays or weeks or even years in environmental monitoring, just to name two applicationtasks that have been proposed as targets for neuromorphic computing technologies.We emphasize that this case study does not imply a recommendation to for Mem-Scales research to implement this specific model. We chose it as a representative becausethe article of Bellec et al. (2020) gave a mathematical model in all detail, from whichwe could develop an exemplary analysis. Other SDSP models have been or are beingexplored in our consortium, like Yousefzadeh et al. (2018) or Cartiglia et al. (2020). Itis impossible to provide a theoretical analysis that covers all options, and it would beinappropriate to try to identify a “best” one.We finally point out that the e-prop algorithm has recently been employed in a reser-voir computing set-up, where it was used to optimize the recurrent weights of a reservoirfor an entire class of learning tasks in a “learning to learn” scenario (Subramoney et al.,2021). 32 .3 Timescale requirements for RC systems based on analog spikingevent-based neuromorphic hardware Reservoir computing based on physical reservoirs is a flourishing research area. PhysicalRC systems have been built on the basis of many different non-digital physical substrates.Popular media include optics (Antonik et al., 2018), nano-mechanics (Coulombe et al.,2017), carbon nanotubes (Dale et al., 2016), magnetic skyrmions (Prychynenko et al.,2018), spintronics (Torrejon et al., 2017), or gold nanoparticle thin films (Minnai et al.,2018) (survey in Tanaka and et al (2019)). These studies are mostly experimental. The-oretical analyses, or at least systematic explorations of dynamical phenomena in con-trolled simulations, are scarce. We are aware only of one work in the optical RC commu-nity (Grigoryeva et al., 2016) and another one in memristive electronics based reservoirs(Sheldon et al., 2020), which is however still rather rudimentary.An inherent obstacle to general theoretical analyses is that every physical systemscomes with ideosyncratic dynamical properties that leave their mark on the computa-tional properties of the respective system, and would have to be analysed on a case bycase basis. While a large body of analytical research has accumulated over the last twodecades for reservoirs that are mathematically defined on the simplest possible rate-based neuron model (the echo state networks), insights made there do not easily carryover to other sorts of reservoirs. Specifically, no theoretical analyses of computational/ learning characteristics of reservoirs based on analog spiking continuous-time neuralnetworks are yet available.A natural starting point for such analyses, with special attention paid to timescalephenomena, would be to study the memory capacity (MC) of analog spiking reservoirRNNs. In its original format, which was expressed for discrete-time non-spiking reser-voirs of the echo state network type, the MC is a measure for how many previous inputsof a one-dimensional white noise signal can be recovered by trained linear readouts,weighted with an accuracy factor. In the work that started this research line (Jaeger,2002) it was shown that MC is bounded by the number of neurons in the reservoir. Thistriggered a large number of follow-up studies (a Google Scholar query on “echo statenetwork” “memory capacity” returns more than 600 papers) which extended the originalanalysis with regards to input signal type, neuron model, noise robustness, input di-mension, network architecture, alternative definitions of MC, and more. Contributionscame from mathematics, theoretical physics, the neurosciences and machine learning.The broad interest in this question can be explained by the fundamental nature of thequestion of information transport in dynamical systems in general, the relevance formachine learning tasks (see Dambre et al. (2012) for the intimate connection betweenmemory properties and general computational capacities), and the relevance for under-standing dynamical short-term memory in biological brains. We are however not awareof mathematical analyses of MC in spiking RNNs, although it has been experimentallymeasured in a number of simulation studies.In the group of Jaeger at the University of Groningen, the PhD student DirkDoorakkers, whose position is funded through MemScales and who is a mathemati-cian with a specialization in dynamical systems theory (and who authored Section 4of this deliverable report), will carry out a dissertation project that centrally addressesthe question of information transport in spiking RNNs. His project, with the work-ing title
Double transients in multi-timescale systems provide a geometric description or dynamic coding with activity-silent working memory , plans a rigorous analysis ofmechanisms in spiking RNNs where1. an input signal is initially encoded in a temporal activation pattern of the RNN,which2. propagates in time through the RNN for a delay (“memory”) period d , undergoinga sequence of transformations, until3. upon a cue signal a desired output transform of the input signal is recovered by a decoding (“readout”) mechanism.This analysis will be done with the tools of contemporary dynamical systems theory,in particular slow-fast systems (singular perturbation methods) and bifurcation theory,aiming for a characterization of such memory mechanisms in terms of generic geomet-rical dynamical systems concepts, which to a certain degree would render the analysestransferable to general classes of multi-timescale hardware reservoirs. This will consti-tute a substantial contribution to task T 1.4, Toward a general model of unconventionalcomputing .For the time being, the best that we can offer is a summary of findings that we col-lected in the NeuRAM3 forerunner project to MemScales. Jaeger’s group was chargedto realize an online heartbeat anomaly classifier on the Dynap-se, a spiking analog neuro-morphic microchip developed at the Institute of Neuroinformatics in Zurich (Moradi et al.,2018). The challenge was that the natural time constant of human heartbeats is 1 sec,while the slowest available time constants for spike train integration on the Dynap-sewere much faster. Our findings and methods are reported in He et al. (2019b). Here wegive a summary account, which agrees well with our observations in Section 7.2: • In earlier simulations (not reported in He et al. (2019b)) we found that the learningtask was possible with spike train integration time constants that matched the tasktime constants. • The unavailability of physical time constants that were as slow as the 1 sec timeconstant of the heartbeat data led to failures in “direct-attack” attempts to traina Dynap-se based reservoir. • The task became solvable on the Dynap-se when a novel reservoir transfer methodwas employed to pre-configure the synaptic weights in the hardware reservoir in away that allowed linear combinations of spike trains arriving at a receiving neuronto compensate for the small forgetting factors (see Section 7.2) inherent in theDynap-se physics. The reservoir comprised about 750 neurons.34 eferences
J. S. Albus. A reference model architecture for intelligent systems design. In P. J.Antsaklis and K. M. Passino, editors,
An Introduction to Intelligent and AutonomousControl , chapter 2, pages 27–56. Kluwer Academic Publishers, 1993.J. F. Allen. Time and time again: The many ways to represent time.
InternationalJournal of Intelligent Systems , 6(4):341–355, 1991.P. Antonik, M. Hermans, M. Haelterman, and S. Massar. Random pattern and fre-quency generation using a photonic reservoir computer with output feedback.
NeuralProcessing Letters , 47(3):1041–1054, 2018.A. Baddeley. Working memory: looking back and looking forward.
Nature Reviews:Neuroscience , 4(10):829–839, 2003.D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning toalign and translate. In
International Conference on Learning Representations (ICLR) ,2015. URL http://arxiv.org/abs/1409.0473v6 .Cornelia I Bargmann. Chemosensation in c. elegans. In
WormBook: The Online Reviewof C. elegans Biology [Internet] . WormBook, 2006.Naama Barkai and Stan Leibler. Robustness in simple biochemical networks.
Nature ,387(6636):913–917, 1997.Marina Bedny, Hilary Richardson, and Rebecca Saxe. “visual” cortex responds to spokenlanguage in blind children.
Journal of Neuroscience , 35(33):11674–11681, 2015.G. Bellec, D. Salaj, A. Subramoney, R. Legenstein, and W. Maass. Long short-termmemory and learning-to-learn in networks of spiking neurons. arxiv manuscript, 2018.URL https://arxiv.org/abs/2006.12484 .G. Bellec, F. Scherr, E. Hajek, D. Salaj, R. Legenstein, and W. Maass. Biologicallyinspired alternatives to backpropagation through time for learning in recurrent neuralnets. arxiv manuscript, 2019. URL https://arxiv.org/abs/1901.09049 .G. Bellec, F. Scherr, A. Subramoney, E. Hajek, D. Salaj, R. Legenstein, and W. Maass.A solution to the learning dilemma for recurrent networks of spiking neurons.
NatureCommunications , 11(1):1–15, 2020.T. Bengio, Y. abd Mesnard, S. Fischer, A. abd Zhang, and Y. Wu. Stdp-compatibleapproximation of backpropagation in an energy-based model.
Neural Computation ,29(3):555–577, 2017.P. Berkes and L. Wiskott. Slow feature analysis yields a rich repertoire of complexcell properties.
Cognitive Sciences EPrint Archives (CogPrints) , 2804, 2003. URL http://cogprints.org/2804/ .A. Bernacchia, H. Seo, D. Lee, and X.-J. Wang. A reservoir of time constants for memorytraces in cortical neurons.
Nature Neuroscience , 14(3):366–372, 2011.35lie L Bienenstock, Leon N Cooper, and Paul W Munro. Theory for the development ofneuron selectivity: orientation specificity and binocular interaction in visual cortex.
Journal of Neuroscience , 2(1):32–48, 1982.M. M. Botvinic and D. C. Plaut. Short-term memory for serial order: A recurrent neuralnetwork model.
Psychological Review , 113(2):201–233, 2006.CAR Boyd. Cerebellar agenesis revisited.
Brain , 133(3):941–944, 2010.Dennis Bray. Protein molecules as computational elements in living cells.
Nature , 376(6538):307–312, 1995.R.A. Brooks. The whole iguana. In M. Brady, editor,
Robotics Science , pages 432–456.MIT Press, Cambridge, Mass., 1989.L. B¨using, B. Schrauwen, and R. Legenstein. Connectivity, dynamics, and memory inreservoir computing with binary and analog neurons.
Neural Computation , 22(5):1272–1311, 2010.C. Callender. Introduction. In C. Callender, editor,
The Oxford Handbook of Philosophyof Time . Oxford University Press, 2011.Natalia Caporale and Yang Dan. Spike timing–dependent plasticity: a hebbian learningrule.
Annu. Rev. Neurosci. , 31:25–46, 2008.M. Cartiglia, G. Haessig, and G. Indiveri. An error-propagation spiking neural networkcompatible with neuromorphic processors. In
Proc. 2020 IEEE International Confer-ence on Artificial Intelligence Circuits and Systems (AICAS) , pages 84–88, 2020.E. Chicca, F. Stefanini, C. Bartolozzi, and G. Indiveri. Neuromorphic electronic circuitsfor building autonomous cognitive systems.
Proc. of the IEEE , 102(9):1367–1388,2014.S.-N. Chow and J. Mallet-Paret. Singularly perturbed delay differential equations. InJ. Chandra and A.C. Scott, editors,
Coupled Nonlinear Oscillators , pages 7–12. North-Holland Publishing Company, 1983.Claudia Clopath, Lars B¨using, Eleni Vasilaki, and Wulfram Gerstner. Connectivityreflects coding: a model of voltage-based stdp with homeostasis.
Nature neuroscience
Nature Reviews Neuroscience , 13(11):798–810,2012.J. C. Coulombe, M. C. A. York, and J. Sylvestre. Computing with net-works of nonlinear mechanical oscillators.
PLOS ONE , 12(6), 2017. URL
Https://doi.org/10.1371/journal.pone.0178663 .36. Cove, R. George, J. Frascaroli, S. Brivio, C. Mayr, H. Mostafa, G. Indiveri, andS. Spiga. Spike-driven threshold-based learning with memristive synapses and neuro-morphic silicon neurons.
Journal of Physics D: Applied Physics , 51:344003, 2018.M. Dale, S. Miller, J. F. anbd Stepney, and M. A. Trefzer. Evolving carbon nanotubereservoir computers. In
Proc. Int. Conf. on Unconventional Computation and NaturalComputation , pages 49–61, 2016.J. Dambre, D. Verstraeten, B. Schrauwen, and S. Massar. Information processing ca-pacity of dynamical systems.
Nature Scientific Reports , 2:id 514, 2012.William J Davis. Behavioural hierarchies.
Trends in Neurosciences , 2:5–7, 1979.Declan A Doyle, Joao Morais Cabral, Richard A Pfuetzner, Anling Kuo, Jacqueline MGulbis, Steven L Cohen, Brian T Chait, and Roderick MacKinnon. The structure ofthe potassium channel: molecular basis of k+ conduction and selectivity. science , 280(5360):69–77, 1998.D. Durstewitz, J. K. Seamans, and T. J. Sejnowski. Neurocomputational models ofworking memory.
Nature Neuroscience , 3:1184–91, 2000.D. Eck. Real-time musical beat induction with spiking neurons. Technical Report IDSIA-22-02, IDSIA, Instituto Dalle Molle di studi sull’ intelligenza artificiale, Galleria 2, CH-6900 Manno, Switzerland, 2002a. ftp://ftp.idsia.ch/pub/techrep/IDSIA-22-02.ps.gz.D. Eck. Finding downbeats with a relaxation oscillator.
Psychological Research ∼ eckdoug/papers/2002 psyres.pdf.D. Eck. Identifying metrical and temporal structure withan autocorrelation phase matrix. Music Perception ∼ eckdoug/papers/2006 rppw draft.pdf.Gerald M Edelman. Neural darwinism: selection and reentrant signaling in higher brainfunction. Neuron , 10(2):115–125, 1993.Chris Eliasmith, Terrence C Stewart, Xuan Choo, Trevor Bekolay, Travis DeWolf,Yichuan Tang, and Daniel Rasmussen. A large-scale model of the functioning brain. science , 338(6111):1202–1205, 2012.J. Endrullis, J. W. Klop, and R. Bakhshi. Transducer degrees: atoms, infima andsuprema.
Acta Informatica , 57(3-5):727–758, 2019.Ch. Engels and G. Sch¨oner. Dynamic fields endow behavior-based robots with repre-sentations.
Robotics & Autonomous Systems , 14:55–77, 1995.G.B. Ermentrout and D. Terman.
Mathematical Foundations of Neuroscience , volume of Interdisciplinary Applied Mathematics . Springer, New York NY, 2010.B. Farhang-Boroujeny.
Adaptive Filters: Theory and Applications . Wiley, 1998.Daniel J Felleman and David C Van Essen. Distributed hierarchical processing in theprimate cerebral cortex. In
Cereb cortex . Citeseer, 1991.37. Fenichel. Geometric singular perturbation theory for ordinary differential equations.
Journal of Differential Equations , :53–98, 1979.Chrisantha T Fernando, Anthony ML Liekens, Lewis EH Bingle, Christian Beck,Thorsten Lenser, Dov J Stekel, and Jonathan E Rowe. Molecular circuits for as-sociative learning in single-celled organisms. Journal of the Royal Society Interface , 6(34):463–469, 2009.K. D. Forbus. Qualitative physics: past, present and future. In
Exploring ArtificialIntelligence: Survey Talks from the National Conferences on Artificial Intelligence ,pages 239–296. Morgan Kaufmann, 1988.E.P. Frady, D. Kleyko, and F.T. Sommer. A theory of sequence indexing and workingmemory in recurrent neural networks.
Neural Computation , 30(6):1449–1513, 2018.M. Franzius, B. Wilbert, and L. Wiskott. Invariant object recognition with slow featureanalysis. In
Proc. of ICANN 2008 , number 5163 in Lecture Notes in Computer Science,pages 961–970. Springer Verlag Berlin, 2008. DOI: 10.1007/978-3-540-87536-9 98.Nicolas Fr´emaux and Wulfram Gerstner. Neuromodulated spike-timing-dependent plas-ticity, and theory of three-factor learning rules.
Frontiers in neural circuits , 9:85,2016.R. M. French. Catastrophic interference in connectionist networks. In L. Nadel, edi-tor,
Encyclopedia of Cognitive Science , volume 1, pages 431–435. Nature PublishingGroup, 2003.S. Fusi and X.-J. Wang. Short-term, long-term, and working memory. In M. Arbiband J. Bonaiuto, editors,
From Neuron to Cognition via Computational Neuroscience ,pages 319–344. MIT Press, 2016.S. Fusi, M. Annunziato, D. Badoni, A. Salamon, and D. J. Amit. Spike-driven synapticplasticity: Theory, simulation, VLSI implementation.
Neural Computation , 12(10):2227–2258, 2000.M. Galtier and G. Wainrib. Multiscale analysis of slow-fast neuronal learningmodels with noise.
Journal of Mathematical Neuroscience , (13), 2012. doi: .M. Galtier and G. Wainrib. A biological gradient descent for prediction through acombination of STDP and homeostatic plasticity. Neural Computation , 25(11):2815–2832, 2013a. URL http://de.arxiv.org/abs/1206.4812 .M.N. Galtier and G. Wainrib. A biological gradient descent for prediction through acombination of STDP and homeostatic plasticity.
Neural Computation , :2815–2832,2013b.S. Ganguli, D. Huh, and H. Sompolinsky. Memory traces in dynamical systems. PNAS ,105(48):18970–18975, 2008. 38hangce Gao, MengChu Zhou, Yirui Wang, Jiujun Cheng, Hanaki Yachi, and JiahaiWang. Dendritic neuron model with effective learning algorithms for classification,approximation, and prediction.
IEEE transactions on neural networks and learningsystems , 30(2):601–614, 2018.James Garson. Modal logic. In Edward N. Zalta, editor,
The Stanford Encyclopedia ofPhilosophy . Summer 2014 edition, 2014.Wulfram Gerstner, Marco Lehmann, Vasiliki Liakoni, Dane Corneil, and Johanni Brea.Eligibility traces and plasticity on behavioral time scales: experimental support ofneohebbian three-factor learning rules.
Frontiers in neural circuits , 12:53, 2018.H. Geuvers, A. Koprowski, D. Synek, and E. van der Weegen. Automated machine-checked hybrid system safety proofs. In
Proc. Int. Conf. on Interactive TheoremProving , pages 259–274. Springer, 2010.M. Graupner and N. Brunel. Calcium-based plasticity model explains sensitivity ofsynaptic changes to spike pattern, rate, and dendritic location.
PNAS , 109(10):3991–3996, 2012.Jesse M Gray, Joseph J Hill, and Cornelia I Bargmann. A circuit for navigation incaenorhabditis elegans.
Proceedings of the National Academy of Sciences , 102(9):3184–3191, 2005.L. Grigoryeva, J. Henriques, L. Larger, and J.P. Ortega. Nonlinear memory capacity ofparallel time-delay reservoir computers in the processing of multidimensional signals.
Neural Computation , 28(7):1411–1451, 2016.H. Haken.
Advanced Synergetics - Instability Hierarchies of Self-Organizing Systems andDevices , volume 20 of
Springer Series in Synergetics . Springer, Berlin/Heidelberg,1983.J.D. Hart, L. Larger, T.E. Murphy, and R. Roy. Delayed dynamical systems: networks,chimeras and reservoir computing.
Philosophical Transactions of the Royal Society A , :20180123, 2019.X. He, Sygnowski J., A. Galashov, A. A. Rusu, Y. W. Teh, and R. Pascanu. Taskagnostic continual learning via meta learning. In Proc. neurIPS 2019 , 2019a. https://arxiv.org/abs/1906.05201 .X. He, T. Liu, F. Hadaeghi, and H. Jaeger. Reservoir transfer on analog neuromorphichardware. In
Proc. 9th Int. IEEE/EMBS Conf. on Neural Engineering , pages 1234–1238, 2019b.Donald Olding Hebb.
The organization of behavior: a neuropsychological theory . J.Wiley; Chapman & Hall, 1949.M. Hermans and B. Schrauwen. Memory in linear recurrent neural networks in contin-uous time.
Neural Networks , 23(3):341–355, 2010a.39. Hermans and B. Schrauwen. Memory in reservoirs for high dimensional input.In
Proc. WCCI 2010 (IEEE World Congress on Computational Intelligence) , pages2662–2668, 2010b.M. Hermans and B. Schrauwen. Recurrent kernel machines: Computing with infiniteecho state networks.
Neural Computation , 24(1):104–133, 2012.S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural Computation , 9(8):1735–1780, 1997.Alan L Hodgkin and Andrew F Huxley. A quantitative description of membrane currentand its application to conduction and excitation in nerve.
The Journal of physiology ,117(4):500, 1952.John J Hopfield. Pattern recognition computation using action potential timing forstimulus representation.
Nature , 376(6535):33–36, 1995.John J Hopfield and Carlos D Brody. What is a moment? transient synchrony asa collective mechanism for spatiotemporal integration.
Proceedings of the NationalAcademy of Sciences , 98(3):1282–1287, 2001.Aapo Hyv¨arinen and Erkki Oja. Independent component analysis by general nonlinearhebbian-like learning rules. signal processing , 64(3):301–313, 1998.B. Ibarz, J.M. Casado, and M.A.F. Sanjuan. Map-based models in neuronal dynamics.
Physics Reports , :1–74, 2011.A.F. Ivanov and A.N. Sharkovsky. Oscillations in singularly perturbed delay equations.In C.K.R.T. Jones, U. Kirchgraber, and H.O. Walther, editors, Dynamics Reported:Expositions in Dynamical Systems , volume New series: , pages 164–224. Springer-Verlag, 1992.E.M. Izhikevich. Dynamical Systems in Neuroscience: The Geometry of Excitability andBursting . MIT Press, Cambridge MA, 2007.Eugene M Izhikevich. Which model to use for cortical spiking neurons?
IEEE transac-tions on neural networks , 15(5):1063–1070, 2004.H. Jaeger. Short term memory in echo state networks. GMD-Report 152,GMD - German National Research Institute for Computer Science, 2002. URL .H. Jaeger. Controlling recurrent neural networks by conceptors. Technical Report 31,Jacobs University Bremen, 2014. arXiv:1403.3369.H. Jaeger. Computability and complexity. Lecture notes of the-oretical computer science ii, Jacobs University Bremen, 2019. .H. Jaeger. Exploring the landscapes of “computing”: digital, neuromorphic,unconventional — and beyond. manuscript arXiv:2011.12013, 2020. URL http://arxiv.org/abs/2011.12013 . 40. Jaeger and Th. Christaller. Dual dynamics: Designing behavior systems for au-tonomous robots.
Artificial Life and Robotics , 2:108–112, 1998.Xin Jin and Rui M Costa. Shaping action sequences in basal ganglia circuits.
Currentopinion in neurobiology , 33:188–196, 2015.P. Joshi and J. Triesch. Optimizing generic neural microcircuits through reward modu-lated STDP. In
Proc. ICANN 2009 , pages 239–248, 2009.Christian Jutten and Jeanny Herault. Blind separation of sources, part i: An adaptivealgorithm based on neuromimetic architecture.
Signal processing , 24(1):1–10, 1991.Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aur´elien Bellet, Mehdi Bennis,Arjun Nitin Bhagoji, K. A. Bonawitz, Zachary Charles, Graham Cormode, RachelCummings, Rafael G.L. D’Oliveira, Salim El Rouayheb, David Evans, Josh Gardner,Zachary Garrett, Adria Gasc´on, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser,Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu,Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Koneˇcny, AleksandraKorolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrede Lepoint, Yang Liu, PrateekMittal, Mehryar Mohri, Richard Nock, Ayfer ¨Ozg¨ur, Rasmus Pagh, Mariana Raykova,Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U.Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramer, Praneeth Vepakomma,Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and SenZhao. Advances and open problems in federated learning. arxiv report, 2019. URL https://arxiv.org/abs/1912.04977 .Eric R Kandel, Yadin Dudai, and Mark R Mayford. The molecular and systems biologyof memory.
Cell , 157(1):163–186, 2014.T.J. Kaper. An introduction to geometric methods and dynamical systems theory forsingular perturbation problems.
Proceedings of Symposia in Applied Mathematics , :85–131, 1999.Harris S Kaplan, Oriana Salazar Thula, Niklas Khoss, and Manuel Zimmer. Nestedneuronal dynamics orchestrate a behavioral hierarchy across timescales. Neuron , 105(3):562–576, 2020.B Katz. Quantal mechanism of neural transmitter release.
Science , 173(3992):123–126,1971.S. Klampfl and W. Maass. Emergence of dynamic memory traces in cortical microcircuitmodels through STDP.
J. of Neuroscience , 33(28):11515, 2013.Teuvo Kohonen. Self-organized formation of topologically correct feature maps.
Biolog-ical cybernetics , 43(1):59–69, 1982.P. Kokotovic, H.K. Khalil, and J. O’Reilly.
Singular Perturbation Methods in ControlAnalysis and Design , volume of Classics in Applied Mathematics . SIAM, Philadel-phia PA, 1999. 41. Kuehn.
Multiple Time Scale Dynamics . Springer, Cham, Switzerland, 1st edition,2015.L. Kusmierz, T. Isomura, and T. Toyoizumi. Learning with three factors: modulatinghebbian plasticity with errors.
Current Opinion in Neurobiology , 46:170–177, 2017.Connor Lane, Shipra Kanjlia, Akira Omaki, and Marina Bedny. “visual” cortex ofcongenitally blind adults responds to syntactic movement.
Journal of Neuroscience ,35(37):12859–12868, 2015.E. W. Large and C. Palmer. Perceiving temporal regularity in music.
Cognitive Science ,26:1–37, 2002.R. Legenstein and W. Maass. Edge of chaos and prediction of computational performancefor neural circuit models.
Neural Networks , 20(3):323–334, 2007.D. Linaro, A. Champneys, M. Desroches, and M. Storace. Codimension-two homoclinicbifurcations underlying spike adding in the Hindmarsh-Rose burster.
SIAM Journalon Applied Dynamical Systems , (3):939–962, 2012.J. Lins and G. Sch¨oner. Neural fields. In S. Coombes, P. beim Graben, R. Potthast,and J. Wright, editors, Neural fields: theory and applications , pages 319–339. SpringerVerlag, 2014.Michael A Long, Dezhe Z Jin, and Michale S Fee. Support for a synaptic chain modelof neuronal sequence generation.
Nature , 468(7322):394–399, 2010.M. Lukoˇseviˇcius, D. Popovici, H. Jaeger, and U. Siewert. Time warping invariant echostate networks. IUB Technical Report 2, International University Bremen, 2006. URL .W. Maass, T. Natschl¨ager, and H. Markram. Real-time comput-ing without stable states: A new framework for neural computationbased on perturbations.
Neural Computation
Science , 275(5297):213–215, 1997.C. Minnai, M. Mirigliano, S. A. Brown, and P. Milani. The nanocoherer: an electricallyand mechanically resettable resistive switching device based on gold clusters assembledon paper.
Nano Futures , 2:011002, 2018.C. Mira and A. Shilnikov. Slow-fast dynamics generated by noninvertible plane maps.
International Journal of Bifurcation and Chaos , (11):3509–3534, 2005.S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri. A scalable multicore architecture withheterogeneous memory structures for dynamic neuromorphic asynchronous processors(dynaps). IEEE Transactions on Biomedical Circuits and Systems , 12(1):106–122,2018. 42.M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. TechnicalReport CRG-TR-93-1, Dpt. of Computer Science, University of Toronto, 1993.E. O. Nefti and B. B. Averbeck. Reinforcement learning in artificial and biologicalsystems.
Nature Machine Intelligence , 1(3):133–143, 2019.Won Chan Oh, Laxmi Kumar Parajuli, and Karen Zito. Heterosynaptic structuralplasticity on local dendritic segments of hippocampal ca1 neurons.
Cell reports , 10(2):162–169, 2015.Erkki Oja. Simplified neuron model as a principal component analyzer.
Journal ofmathematical biology , 15(3):267–273, 1982.R.E. O’Malley, Jr.
Singular Perturbation Methods for Ordinary Differential Equations ,volume of Applied Mathematical Sciences . Springer-Verlag, New York NY, 1991.G. I. Parisi, R Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelonglearning with neural networks: A review.
Neural Networks , 113:54–71, 2019.R. Pascanu and H. Jaeger. A neurodynamical model for working memory.
NeuralNetworks , 24(2):199–207, 2011. DOI: 10.1016/j.neunet.2010.10.003.M.E. Payvand, M. abnd Fouda, F. Kurdahi, A. Eltawil, and E.O. Neftci. Error-triggeredthree-factor learning dynamics for crossbar arrays. In
Proc. 2nd IEEE InternationalConference on Artificial Intelligence Circuits and Systems (AICAS) , pages 218–222,2020.Jean Piaget.
The origins of intelligence in children . International Universities PressNew York, 1952.D. Prychynenko, M. Sitte, K. Litzius, B. Kr¨uger, G. Bourianoff, M. Kl¨aui, J. Sinova, andK. Everschor-Sitte. Magnetic skyrmion as a nonlinear resistive element: A potentialbuilding block for reservoir computing.
Phys. Rev. Applied , 9:014034, 2018.K. Pusuluri, H. Ju, and A.L. Shilnikov. Chaotic dynamics in neural systems. In R.A.Meyers, editor,
Encyclopedia of Complexity and Systems Science , page : to appear.Springer Science, 2020.Wilfrid Rall. Rall model.
Scholarpedia , 4(4):1369, 2009.Rajnish Ranjan, Georges Khazen, Luca Gambazzi, Srikanth Ramaswamy, Sean L Hill,Felix Sch¨urmann, and Henry Markram. Channelpedia: an integrative and interactivedatabase for ion channels.
Frontiers in neuroinformatics , 5:36, 2011.D. Roclin, O. Bichler, C. Gamrat, S. J. Thorpe, and J. O. Klein. Design study of efficientdigital order-based STDP neuron implementations for extracting temporal features.In
Proc. of The 2013 International Joint Conference on Neural Networks (IJCNN) ,pages 1–7, 2013.J.E. Rubin and D. Terman. Chapter 3: Geometric singular perturbation analysis ofneuronal dynamics. In B. Fiedler, editor,
Handbook of Dynamical Systems , volume ,pages 93–146. Elsevier, 2002. 43. Ruschel. Multiple Time-Scale Delay Systems in Mathematical Biology and LaserDynamics . PhD thesis, Technical University of Berlin, 2020.A. Saffiotti, K. Konolige, and E.H. Ruspini. A multivalued logic approach to integratingplanning and control.
Artificial Intelligence , 76:481–526, 1995.F. Sch¨onfeld and L. Wiskott. Modeling place field activity with hierarchical slow featureanalysis.
Frontiers in Computational Neuroscience , 9:51, 2015.Benjamin Schrauwen, Marion Wardermann, David Verstraeten, Jochen J. Steil, andDirk Stroobandt. Improving reservoirs using intrinsic plasticity.
Neurocomputing , 71:1159–1171, 1 2008.F. C. Sheldon, A. Kolchinsky, and F. Caravelli. The computational capacity of memristorreservoirs. Arxiv manuscript, 2020. URL https://arxiv.org/abs/2009.00112v2 .Harel Z Shouval, Mark F Bear, and Leon N Cooper. A unified model of nmda receptor-dependent bidirectional synaptic plasticity.
Proceedings of the National Academy ofSciences , 99(16):10831–10836, 2002.H.T. Siegelmann and E.D. Sontag. Analog computation via neural network.
TheoreticalComputer Science , 131(2):331–360, 1994.J. Sima and P. Orponen. General-purpose computation with neural networks: A surveyof complexity theoretic results.
Neural Computation , 15(12):2727–2778, 2003.Christine A Skarda. The perceptual form of life.
Journal of Consciousness Studies , 6(11-12):79–93, 1999.S. Song, K. D. Miller, and L. F. Abbott. Competitive hebbian learning through spike-timing-dependent synaptic plasticity.
Nature Neuroscience , 3(9):919–926, 2000.Kenneth O Stanley, Jeff Clune, Joel Lehman, and Risto Miikkulainen. Designing neuralnetworks through neuroevolution.
Nature Machine Intelligence , 1(1):24–35, 2019.T. Strauss, W. Wustlich, and R. Labahn. Design strategies for weight matrices of echostate networks.
Neural Computation , 24(12):3246–3276, 2012.A. Subramoney, F. Scherr, and W. Maass. Reservoirs learn to learn. In K. Nakajimaand I. Fischer, editors,
Reservoir Computing: Theory, Physical Implementations andApplications , chapter 1.3. Springer, 2021. Preprint arXiv:1909.07486.R. Sutton and A. G. Barto.
Reinforcement learning: an in-troduction . Cambridge: MIT press, 1998. online version at http://incompleteideas.net/book/ebook/the-book.html .G. Tanaka and et al. Recent advances in physical reservoir computing: A review.
NeuralNetworks , 115:100–123, 2019. preprint in https://arxiv.org/abs/1808.04962 .D. Terman. Chaotic spikes arising from a model of bursting in excitable cell membranes.
SIAM Journal of Applied Mathematics , (5):1418–1450, 1991.44. C. Thiele, O. Bichler, and A. Dupret. Event-based, timescale invariant unsupervisedonline deep learning with STDP. Frontiers in Computational Neuroscience , 12:article46, 2018.Niko Tinbergen.
The study of instinct.
Clarendon Press/Oxford University Press, 1951.J. Torrejon, F. A. Riou, M. Araujo, S. Tsunegi, G. Khalsa, D. Querlioz, P. Bortolotti,V. Cros, K. Yakushiji, A. Fukushima, H. Kubota, S. Yuasa, M. D. Stiles, and J. Grol-lier. Neuromorphic computing with nanoscale spintronic oscillators.
Nature , 547(27July):428–431, 2017.Marlieke TR Van Kesteren, Dirk J Ruiter, Guill´en Fern´andez, and Richard N Henson.How schema and novelty augment memory formation.
Trends in neurosciences , 35(4):211–219, 2012.J. van Leeuwen and J. Wiedermann. Beyond the turing limit: Evolving interactivesystems. In
International Conference on Current Trends in Theory and Practice ofComputer Science , number 2234 in LNCS, pages 90–109. Springer, 2001.A.B. Vasil’eva and V.M. Volosov. The work of Tikhonov and his pupils on ordinarydifferential equations containing a small parameter.
Russian Mathematical Surveys , :124–142, 1967.F. Verhulst. Methods and Applications of Singular Perturbations: Boundary Layers andMultiple Timescale Dynamics , volume of Texts in Applied Mathematics . Springer,New York NY, 2005.Chr Von der Malsburg. Self-organization of orientation sensitive cells in the striatecortex.
Kybernetik , 14(2):85–100, 1973.Karl Von Frisch.
The dance language and orientation of bees.
Harvard University Press,1967.Matthew P Walker and Robert Stickgold. Sleep, memory, and plasticity.
Annu. Rev.Psychol. , 57:139–166, 2006.L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of invari-ances.
Neural Computation , 14(4):715–770, 2002.F. wyffels, J. Li, T. Waegeman, B. Schrauwen, and H. Jaeger. Frequency modulation oflarge oscillatory neural networks.
Biological Cybernetics , 108:145–157, 2014.Y. Yamashita and J. Tani. Emergence of functional hierarchy in a multiple timescaleneural network model: A humanoid robot experiment.
PLOS Computational Biology ,4(11):e1000220, 2008.S. Yanchuk and G. Giacomelli. Spatio-temporal phenomena in complex systems withtime delays.