Synchronization and Redundancy: Implications for Robustness of Neural Learning and Decision Making
aa r X i v : . [ q - b i o . N C ] A p r Synchronization and Redundancy:Implications for Robustness of Neural Learning andDecision Making
Jake Bouvrie ( [email protected] )Department of Mathematics, Duke University, Durham, NC USA Jean-Jacques Slotine ( [email protected] )Nonlinear Systems Laboratory, Massachusetts Institute of Technology, Cambridge, MA USA Keywords: stochastic systems, learning, synchronization, neural decision making, population cod-ing, collective enhancement, neural uncertainty, stochastic contraction.
Abstract
Learning and decision making in the brain are key processes critical to survival, and yet areprocesses implemented by non-ideal biological building blocks which can impose significanterror. We explore quantitatively how the brain might cope with this inherent source of error bytaking advantage of two ubiquitous mechanisms, redundancy and synchronization. In particu-lar we consider a neural process whose goal is to learn a decision function by implementing anonlinear gradient dynamics. The dynamics, however, are assumed to be corrupted by pertur-bations modeling the error which might be incurred due to limitations of the biology, intrinsicneuronal noise, and imperfect measurements. We show that error, and the associated uncertaintysurrounding a learned solution, can be controlled in large part by trading off synchronizationstrength among multiple redundant neural systems against the noise amplitude. The impactof the coupling between such redundant systems is quantified by the spectrum of the networkLaplacian, and we discuss the role of network topology in synchronization and in reducing theeffect of noise. A range of situations in which the mechanisms we model arise in brain scienceare discussed, and we draw attention to experimental evidence suggesting that cortical circuitscapable of implementing the computations of interest here can be found on several scales. Fi-nally, simulations comparing theoretical bounds to the relevant empirical quantities show thatthe theoretical estimates we derive can be tight.
Learning and decision making in the brain are key processes critical to survival, and yet are pro-cesses implemented by imperfect biological building blocks which can impose significant error. Wesuggest that the brain can cope with this inherent source of error by taking advantage of two ubiqui-tous mechanisms: redundancy , and sharing of information . These concepts will be made precise inthe context of a specific model and learning scenario which together can serve as a conceptual toolfor illustrating the effect of redundancy and sharing.Motivated by the problem of learning to discriminate, we consider a neural process whose goalis to learn a decision function by implementing a nonlinear gradient dynamics. The dynamics, how-ever, are assumed to be corrupted by perturbations modeling the error which might be incurred.1his general perspective is intended to capture a range of possible learning instances occurring atdifferent anatomical scales: The neural process can involve whole brain areas communicating viabehavioral, motor or sensory pathways (Schnitzler and Gross, 2005), as in the case of the multi-ple amygdala-thalamus loops assumed to underpin fear conditioning for instance (LeDoux, 2000;Maren, 2001). Interacting local field potentials (LFPs) may also be modeled as both direct (by longrange phase-locking, e.g. in olfactory systems (Friedrich et al., 2004)) or indirect measurements ofcoordination and interaction among large assemblies of neurons. The learning dynamics may alter-natively model smaller ensembles of individual neurons, as in primary motor cortex, though we donot emphasize biological realism in our models at this scale. Nevertheless, one may still draw usefulconclusions as to the role of redundancy and information sharing. The error too may be treated atdifferent scales, and may take the form of noise intrinsic to the neural environment (Faisal et al.,2008) on a large, aggregate scale (e.g. in the case of LFPs) or on a small scale involving localizedpopulations of neurons.If there is noise corrupting the learning process, an immediate question is whether it is possibleto gauge the accuracy of the predictions of the learned function, and to what extent the organism canreduce uncertainty in its decisions by taking advantage of a simple, common information sharingmechanism. If there is redundancy in the form of multiple independent copies of the dynamical cir-cuit (Adams, 1998; Fernando et al., 2010), it is reasonable to expect that averaging over the differentsolutions might reduce noise via cancelation effects. In the case of learning in the brain, however,this approach is problematic because neurons are susceptible to saturation of their firing rates, andon large scales aggregate signal amplitudes will also saturate; the macroscopic dynamics that neu-ron populations and assemblies obey can be strongly nonlinear. When the dynamics followed bydifferent dynamical systems are nonlinear, one cannot expect to gain a meaningful signal by lin-ear averaging (see e.g. (Tabareau et al., 2010) and examples therein). As a simple illustration ofthis phenomenon, consider a collection of noisy sinusoidal oscillators allowed to run starting fromdifferent initial conditions, with identical frequencies and independent noise terms. The oscillatorswill be out of phase from each other, so an average over the trajectories will not yield anythingclose to a clean version of a sinusoid at the desired frequency. On the other hand it is reasonable tosuppose that synchronization across neuron populations or between macro-scale cortical loops mayprovide sufficient phase alignment to make linear averaging, and thus “consensus”, a powerful pro-posal for reducing the effects of noise (Tabareau et al., 2010; Masuda et al., 2010; Cao et al., 2010;Young et al., 2010; Poulakakis et al., 2010; Gigante et al., 2009). Indeed, it is a well known fact,that synchrony within a system of coupled dynamical elements provides (quantifiable) robustness toperturbations occurring in any one element’s dynamics (Needleman et al., 2001; Wang and Slotine,2005; Pham and Slotine, 2007).We will place much emphasis on exploring quantitatively the role of synchrony in controllinguncertainty arising from noise modeling neural error. In particular, we base our work on the ar-gument that noisy, nonlinear trajectories can be linearly averaged if fluctuations due to noise canbe made small and that fluctuations can be made small by coupling the dynamical elements ap-propriately. In the stochastic setting adopted here, “synchronization” refers to state synchrony: thetendency for individual elements’ trajectories to move towards a common trajectory, in a quan-tifiable sense. The estimates we present directly characterize the tradeoff between the network’stendency towards synchrony and the noise, and ultimately address the specific role this tradeoffplays in determining uncertainty surrounding a function learned by an imperfect learning system.We further show how and where the topology of the network of neural ensembles impacts theextent to which the noise, and therefore uncertainty, can be controlled. The estimates we providetake into account in a fundamental way both the nonlinearity in the dynamics and the noise. Moregenerally, the work discussed here also has implications in other related domains, such as networks2f coupled learners or adaptive sensor networks, and can be extended to multitask online or dy-namic learning settings. The difficulty inherent in analyzing dynamic learning systems, such ashierarchies with feedback (Mumford, 1992; Lee and Mumford, 2003), poses a challenge. But con-sidering dynamic systems can yield substantial benefits: transients can be important, as suggestedby the literature on regularization paths and early stopping (Yao et al., 2007). Furthermore, the roleof feedback/backprojections and attention-like mechanisms in learning and recognition systems,both biological and artificial, is known to be important but is not well understood (Hahnloser et al.,1999; Itti and Koch, 2001; Hung et al., 2005).The paper is organized as follows. In Section 2 we consider a specific learning problem anddefine a system of stochastic differential equations (SDEs) modeling a simple dynamic learningprocess. We then discuss stability and network topology in the context of synchronization. In Sec-tion 3 we present the main theoretical results of the paper, a set of uncertainty estimates, postponingproofs until later. Then in Section 5 we provide simulations and compare empirical estimates tothe theoretical quantities predicted by the Theorems in Section 3. Section 4 provides a discussionaddressing the significance and applicability of our theoretical contributions to neuroscience andbehavior. Finally, in Section 6 we give proofs of the results stated in Section 3.
The learning process we will model is that of a one-dimensional linear fitting problem described bygradient based minimization of a square loss objective, in the spirit of Rao & Ballard (Rao and Ballard,1999). This is perhaps the simplest and most fundamental abstract learning problem that an organ-ism might be confronted with – that of using experiential evidence to infer correlations and ulti-mately discover causal relationships which govern the environment and which can be used to makepredictions about the future. The model realizing this learning process is also simple, in that wecapture neural communication as an abstract process “in which a neural element (a single neuronor a population of neurons) conveys certain aspects of its functional state to another neural ele-ment” (Schnitzler and Gross, 2005). In doing so, we focus on the underlying computations takingplace in the nervous system, rather than dwell on neural representations. Even this simple settingbecomes involved technically, and is rich enough to explore all of the key themes discussed above.Our model also supports nonlinear decision functions in the sense that we might consider taking alinear function of nonlinear variables whose values might be computed upstream. In this case thedevelopment would be similar, but extended to the multidimensional case. The model may also beextended to richer function classes and more exotic loss functions directly, however for our pur-poses the additional generality does not yield significant further insight and furthermore might raisebiological plausibility concerns . To make the setting more concrete, we begin by assuming that we have observed a set of input-output examples { x i ∈ R , y i ∈ R } mi =1 , each representing a generic unit of sensory experience, andwant to estimate the linear regression function f w ( x ) = wx . Adopting the square loss, the totalerror incurred on the observations by f w is given by the familiar expression E ( w ) = m X i =1 ( y i − f w ( x i )) = m X i =1 ( y i − wx i ) . In the sense that one would have to carefully justify biologically the particular nonlinearities going into a nonlineardecision function on a case-by-case basis.
3e will model adaptation (training) by a noisy gradient descent process, with biologically plausibledynamics, on this squared prediction error loss function. The trajectory of the slope parameterover time w ( t ) and its governing dynamics may be represented in the biology in various forms.Stochastic rate codes, average activities in populations of neurons and population codes, localizeddirect electrical signals and chemical concentration gradients are some possibilities occurring acrossa range of scales. The dynamical system may also be interpreted as modeling the noisy, time-varyingstrength of a local field potential or other macro electrophysiological signal when there are multiple,interacting brain regions. We discuss these possibilities further in Section 4.The gradient of E with respect to the weight parameter is given by ∇ w E = − P mi =1 ( y i − wx i ) x i , and serves as the starting point. The gradient dynamics ˙ w = −∇ w E ( w ) are both linearand noise-free. Following the discussion above, we modify these dynamics to capture nonlinear saturation effects as well as (often substantial) noise modeling error. Saturation effects lead to asaturated gradient which we model in the form of the hyperbolic tangent nonlinearity, ˙ w = − tanh( a ∇ w E ( w )) , where a is a slope parameter. Note that the saturated dynamics need not be interpretable as itselfthe gradient of an appropriate loss function. The fundamental learning problem is defined by thesquare-loss, but it is implemented using an imperfect mechanism which imposes the nonlinearity .The error is modeled with an additional diffusion (noise) term giving the SDE dw t = − tanh( a ∇ w E ( w t )) dt + σdB t , (1)where dB t denotes the standard 1-dimensional Wiener increment process with standard deviation σ > . As mentioned before, this noise term σdB t and corresponding error is due to intrinsicneuronal noise (Faisal et al., 2008) (aggregated or localized) and possible interference between largeassemblies of neurons or circuits and parallels the more general concept of measurement error innetworks of coupled dynamical systems. We now consider the effect of having n independent copies of the neural system or pathway imple-menting the dynamics (1), with associated parameters { w ( t ) , . . . , w n ( t ) } . Since these dynamicsare nonlinear, the effect of the noise cannot be reduced by simply averaging over the independenttrajectories. However, if the circuit copies are coupled strongly enough they will attempt to syn-chronize , and averaging over the copies becomes a potentially powerful way to reduce the effect ofthe noise (Sherman and Rinzel, 1991; Needleman et al., 2001). The noise can be potentially large(we do not make any small-noise assumptions), and will of course act to break the synchrony. Wewill explore how well the noise can be reduced by synchronization and redundancy in the sectionsthat follow.Given n diffusively coupled copies of the noisy neural system, and setting a = 1 in (1), we havethe following system of nonlinear SDEs: dw i ( t ) = − tanh " m X ℓ =1 (cid:0) w i ( t ) x ℓ − y ℓ (cid:1) x ℓ dt + n X j =1 W ij ( w j − w i ) dt + σdB ( i ) t (2)for i = 1 , . . . , n , where B ( i ) t are independent standard Wiener processes. The diffusive couplingshere should be interpreted as modeling abstract intercommunication between and among different Put differently, in our setting the nonlinearity is not part of the learning problem , and so the saturated gradientdynamics should not be viewed as the gradient of another error criteria W ji = W ij = κ > when element i is connected to element j . Defining ( w ) i to be the(scalar) output of the i -th circuit, we can rewrite the system (2) in vector form as d w ( t ) = − (cid:16) tanh h m X i =1 (cid:0) w ( t ) x i − y i (cid:1) x i i + L w ( t ) (cid:17) dt + σd B t (3)where L = diag ( W ) − W is the network Laplacian , and B t is the standard n -dimensional Wienerprocess. The spectrum of the network Laplacian captures important properties of the network’stopology, and will play a key role. Finally, the change of variable X t := w k x k − h x , y i , with ( x ) i = x i , ( y ) i = y i , yields a system that will be easier to analyze: dX t = − (cid:0) tanh( X t ) k x k + LX t (cid:1) dt + ˜ σd B t (4)where we have defined ˜ σ := σ k x k . The unique globally stable equilibrium point for the deter-ministic part of (4) is seen to be X ∗ = 0 , which checks with the fact that the solution to the linearregression problem is w ∗ = h x , y i / h x , x i in this simple case. The topology of a network of dynamical systems strongly influences synchronization, to includethe rate at which elements synchronize and whether sync (or the tendency to sync) can occur at allin the first place. Thus the pattern of interconnections among neural systems plays an importantrole in controlling uncertainty by way of synchronization properties. In a network of stochasticsystems of the general (diffusive) type described in Section 2, topology can be seen to influencethe robustness of synchrony to noise through the spectrum of the network Laplacian. Laplaciansarising in various interesting networks and applications have received much attention, both in bi-ological decision making and in the context of synchronization of dynamical systems more gener-ally (Kopell and Ermentrout, 1986; Kopell, 2000; Jadbabaie et al., 2003; Wang and Slotine, 2005;Taylor et al., 2009; Poulakakis et al., 2010).We will consider four important network graphs here, and these arrangements will be helpfulexamples to keep in mind when interpreting the results given in Section 3. The simplest graphof coupled elements is perhaps the full, all-to-all graph. As one may guess, this network is alsothe easiest to synchronize since each element can speak directly to the others. The spectrum ofthe network Laplacian λ ( L ) for this graph shows why it might be especially effective for reducinguncertainty in the context of Equation (3). With uniform coupling strength κ > and n denotingthe number of elements in the network, one can check that λ ( L ) = { , nκ, . . . , nκ } . Denote by λ − the smallest non-zero (Fiedler) eigenvalue, and by λ + the largest eigenvalue. Here λ − = λ + = nκ and it is these eigenvalues that control synchronization for any given network. As we will show5 Figure 1:
Examples of (undirected) network graphs. below in Theorem 3.1, the effect of the noise can be reduced particularly quickly precisely becausethe non-zero eigenvalues depend on both parameters, κ and n .If fewer connections are made in the network it becomes harder to synchronize, and we moveaway from the all-to-all ideal. Figure 1 shows some other common network graphs. The undirectedring graph, appearing in the middle, has spectrum λ i ( L ) = 2 κ (cid:2) − cos (cid:0) πn ( i − (cid:1)(cid:3) , i = 1 , . . . , n .If the single edge connecting the first and last elements is removed to make a chain as shown on theleft in the Figure, the network becomes considerably harder to synchronize (Kopell and Ermentrout,1986), although the spectrum of the chain looks similar: λ i ( L ) = 2 κ (cid:2) − cos (cid:0) πn ( i − (cid:1)(cid:3) . Thismakes intuitive sense because information is constrained to flow through only one path, and withpossibly significant delays. Finally, the star graph shown on the right in the Figure has spectrum λ ( L ) = { , κ, . . . , κ, nκ } , and we can see that the key Fiedler eigenvalue λ − = κ does not growwith the size of the network n . The Theorems in Section 3 then predict that it will be impossibleto increase the synchronization rate simply by incorporating more copies of the neural circuit. Thecoupling strength must also increase to make fluctuations from the common trajectory (synchro-nization subspace) small. We will discuss this case in more detail. As might be particularly relevantto brain anatomy, random graphs and directed graphs may also be considered, and have been studiedextensively (Bollobas, 2001).In neuroscience-related models, each connection in a network has an associated biophysicalcost in terms of energy and space requirements. All-to-all networks, with n connections among n circuits or neurons, is often criticized as being biologically unrealistic because of this cost. However,it has been noted that all-to-all connectivity can be implemented with n connections using quorumsensing ideas (Taylor et al., 2009), wherein a global average is computed and shared. The globalaverage is computed given inputs from all n elements, and this average is sent back to each circuitvia another n connections. The shared variable may be communicated by synapses, or sensedchemically or electrically. Although quorum sensing cannot realize any set of n connections, theglobal average may be a weighted average or there may be several common variables organizedhierarchically. This allows for a rich set of networks with O ( n ) connectivity which behave morelike networks with all-to-all connectivity for synchronization and stability purposes. Furthermore,dynamics in the computation of the quorum variable itself, when appropriate for modeling purposes,does not necessarily pose any special difficulty for establishing synchronization properties if virtualsystems are used (Russo and Slotine, 2010).The difficulty with which synchrony may be imposed can be “normalized” by the number ofconnections in many cases to obtain a comparison between synchronization properties of vari-ous graphs that takes biological cost into account. Using quorum variables where appropriate,graphs whose spectrums depend on n are thus roughly comparable on equal biological terms. Cost-normalized comparisons of synchronization properties are not always possible or meaningful, how-ever. Consider the ring and chain networks introduced above. There is a difference of one edgebetween the two, but in the noise-free setting for example the chain requires asymptotically fourtimes more effort to synchronize than the ring architecture (see e.g. (Wang and Slotine, 2005), Ex-6mple 4.5). We turn to analyzing the stability of the nonlinear system given by Equation (4). We will argue thatthis is difficult for two reasons: the presence of noise, and the fact that the (noise-free) dynamicssaturate in magnitude. Indeed, without additional assumptions, one cannot in general show that thesystem is globally exponentially stable. A common method for studying the stability properties ofa noiseless nonlinear dynamical system is via Lyapunov theory (Slotine and Li, 1991), however inthe presence of noise system trajectories along the Lyapunov surface may not be strictly decreas-ing. Contraction analysis (Lohmiller and Slotine, 1998; Wang and Slotine, 2005) is a differentialformalism related to Lyapunov exponents, and captures the notion that a system is stable in someregion if initial conditions or temporary disturbances are forgotten. If all neighboring trajectoriesconverge to each other, global exponential convergence to a single trajectory can be concluded:
Definition 2.1 (Contraction) . Given the system equations ˙ x = f ( x , t ) , a region of the state spaceis called a contraction region if the Jacobian J f = ∂f∂x is uniformly negative definite in that region.Furthermore, the contraction rate is given by β , where ( J f + J ⊤ f ) ≤ β I < . An analogous definition in the case stochastic dynamics has also been developed (Pham et al.,2009), and requires contraction of the noise-free dynamics as well as a uniform upper bound on thevariance of the noise. However for the system (4), the Jacobian is found to be J ( w ) = k x k diag (tanh ( w ) − − L so that λ (cid:0) J ( w ) (cid:1) < − λ (cid:0) L (cid:1) ≤ − λ min ( L ) = 0 . The subspace of constant vectors is a flow invariantsubspace, and L does not contribute to the dynamics in this flow invariant space since L has azero eigenvalue corresponding to its constant eigenvector. This difficulty can arise whenever oneconsiders diffusively coupled elements, and in such cases the usual way around this difficulty is towork with an auxiliary or virtual system (as in e.g. (Pham and Slotine, 2007)) and study contractionto the flow invariant subspace starting from initial conditions outside. However since tanh ′ ( x ) =1 − tanh ( x ) , we still are left with the difficulty that the noise-free dynamics can have a convergencerate to equilibrium arbitrarily close to zero as one travels far out to the tails of the tanh function; thesystem is not necessarily contracting. Indeed, for any saturated dynamics, tanh (cid:0) f ( x , t ) (cid:1) , the ratecan be arbitrarily small. Thus one cannot easily determine the rate of convergence to equilibriumusing standard techniques. The analysis which we provide in the succeeding sections will attemptto get around these difficulties by separately exploring the system’s behavior in and out of the flow-invariant (synchronization) subspace of constant vectors. In this section we present and interpret the main results of the paper. The argument we put forwardis that noisy, nonlinear trajectories can be linearly averaged to reduce the noise if fluctuations dueto noise can be made small. We show that the fluctuations can be made small by coupling thedynamical systems, and that one can precisely control the size of the fluctuations. In particular,we give estimates which show that the tradeoff between noise and coupling strength among neuralcircuits determines the amount of uncertainty surrounding the decisions made by the neural system.Proofs of the Theorems are postponed until Section 6.7 .1 Preliminaries
We begin by decomposing the stochastic process { X t ∈ R n } t ≥ into a sum describing fluctuationsabout the center of mass. Let P = I − (1 /n ) ⊤ , the canonical projection onto the zero-meansubspace of R n , and define Q = I − P . Then for all t ≥ , X t = P X t + QX t . Clearly, ker P = im Q is the subspace of constant vectors. We will adopt the notation e X t for P X t , and X t for QX t (along with the analogous notation e w t and ¯ w t ), and derive expressions for these quantitiesbased on Equation (4). The macroscopic variable X t satisfies dX t = n ⊤ dX t = − k x k n ⊤ tanh( X t ) dt + ˜ σ √ n dB t (5)and thus d e X t = dX t − dX t = − (cid:16) tanh( X t ) k x k + LX t − k x k n ⊤ tanh( X t ) (cid:17) dt + ˜ σd B t − ˜ σ √ n dB t . (6)In terms of the original variable w , the fluctuations e w t are purely due to the noise, while ¯ w t parame-terizes the average decision function. As the decision function we consider is linear, the uncertaintyin the decisions is directly equivalent to uncertainty in the parameter w . We will study the evolutionof both the mean and the fluctuation processes over time, however to assess uncertainty the centralquantity of interest will be the size of the ball containing the fluctuations (the “microscopic” vari-ables). We characterize the magnitude of the fluctuations via the squared norm process satisfying d k e X t k − (cid:16) k x k h e X t , tanh( X t ) i + h e X t , LX t i (cid:17) dt + 12 ˜ σ ( n − dt + ˜ σ k e X t k dB t (7)which follows from (6) applying Ito’s Lemma to the function h ( e X t ) = h e X t , e X t i and the fact that h e X t , d B t i = k e X t k dB t in law. The first –and central– result says that the ball centered at ¯ w (the center of mass) containing thefluctuations can be controlled in expectation from above and below by the coupling strength and inmost cases the number of circuit copies, via the spectrum of the network Laplacian L . We note thatlower bounds are typically ignored in the dynamical systems literature, possible because they areless important for stability analyses. We have found, however, that such bounds can be derived inthe case of saturated gradient dynamics, and that control from below can yield further insight intothe present problem of neural learning.Let λ + be the largest eigenvalue of L, and let λ − be the smallest non-zero eigenvalue of L . Theorem 3.1 (Fluctuations can be made small) . After transients of rate λ − ( n − σ λ + (cid:18) − k x k λ − (cid:19) ≤ E (cid:13)(cid:13) ˜ w ( t ) (cid:13)(cid:13) ≤ ( n − σ λ − where e w = P w ( t ) . Clearly the lower bound is informative only when λ − > k x k . While we do not explicitlyassume any particular bound on the size of the examples k x k , it is reasonable that λ − ≫ k x k since λ − can depend on the number of circuits n and will always depend on the coupling strength κ ,8hich can be large. Large coupling strengths can be found in a variety of circumstances, particularlyin the case of motor control circuits (Grandhe et al., 1999; Kiemel et al., 2003) for example.In the next Theorem we give the variance of the fluctuations via a higher moment of k e w k . Thisresult makes use of the lower bound in Theorem 3.1, and leads to a result that gives control of thefluctuations in probability rather than in expectation. Theorem 3.2 (Variance of the trajectory distances to the center of mass) . After transients of rate λ − var (cid:0) k ˜ w ( t ) k (cid:1) ≤ (cid:18) ( n − σ λ − (cid:19) (cid:18) n − (cid:19) − (cid:18) ( n − σ λ + (cid:19) (cid:18) − k x k λ − (cid:19) . Chebyshev’s inequality combined with Theorem 3.2 immediately gives the following Corollary.
Corollary 3.1.
After transients of rate λ − P h(cid:13)(cid:13) ˜ w ( t ) (cid:13)(cid:13) − E (cid:13)(cid:13) ˜ w ( t ) (cid:13)(cid:13) ≥ ε i ≤ var (cid:0) k ˜ w ( t ) k (cid:1) ε . (8)Since any connected network graph has non-trivial eigenvalues which depend on the uniformcoupling strength κ , we see that for fixed n as κ → ∞ , var (cid:0) k ˜ w ( t ) k (cid:1) → . In the case of theall-to-all network topology, for example, the eigenvalues of L depend on both n and κ so that var (cid:0) k ˜ w ( t ) k (cid:1) = O ( κ − ) giving a power law decay of order O ( κ − ε − ) on the right hand side ofEquation (8) in Corollary 3.1.Finally, we turn to estimating in expectation the steady-state average distance between the tra-jectories of the circuit copies and the noise-free solution. As we have argued in Section 2.4, the rateof convergence to equilibrium of the trajectories w i ( t ) can be arbitrarily small. Although from theTheorems above the fluctuations can be made small, one cannot in general make a similar statementabout the center of mass ¯ w t process unless assumptions about the initial conditions are made (andby extension, the same holds true for the trajectories w i ( t ) ). Such an assumption would lead tocontrol over the contribution of the tanh terms, and establishes a lower bound on the contractionrate. Rather than make a specific assumption however, we state a general result: We again providea lower bound, this time following from the law of large numbers governing sums of i.i.d. Gaussianrandom variables and the lower bound on the fluctuations provided by Theorem 3.1. Theorem 3.3 (Average distance to the noise-free trajectory) . Denote by w ∗ the minimizer of thesquared-error objective (2.1) . After transients of rate λ − σ n + (cid:20) ( n − σ nλ + (cid:18) − k x k λ − (cid:19)(cid:21) + ≤ E " n n X i =1 ( w i ( t ) − w ∗ ) ≤ σ λ − + E (cid:2) ( ¯ w t − w ∗ ) (cid:3) where [ · ] + ≡ max(0 , · ) . Theorem 3.3 says that average closeness of the noisy system to that of the noise-free optimumis controlled by the tradeoff between the noise and the coupling strength, and the number of circuitcopies n . The former controls in large part the magnitude of the fluctuations, as discussed above.The latter quantity is the unavoidable linear averaging component, and can be brought to zero onlyas fast as the law of large numbers allows, O ( n − / ) at best. For fixed n as λ ( n ) − → ∞ , the upper andlower bounds coincide since E (cid:2) ( ¯ w ( t ) − w ∗ ) (cid:3) → σ /n . As both n → ∞ and κ → ∞ Theorem 3.3confirms that w i ( t ) → w ∗ in expectation. If the fluctuations are not made small however, linear9veraging will be wrong, and the error will of course be greater. Just how bad linear averaging iswhen the fluctuations are allowed to be large is described in large part by the maximum curvatureof the noise-free dynamics .Finally, we note that the estimates above depend on the number of samples m only throughthe norm of the examples x , and it is reasonable to assume that this quantity may be appropriatelynormalized based on the maximum values conveyed by subsystems or rates of neurons comprisingthe circuit in the case of population or rate codes, or maximum field strengths in the case of LFPs.However, the requirement that the organism must collect m observations before learning can pro-ceed is not essential. We may also consider the online learning setting, where data are observedsequentially and updates to the parameters ( w ) i are made separately on the basis of each observa-tion in temporal order. The analysis above studies convergence to and distance from the solution inthe steady state, whatever that solution may be, given m pieces of evidence. Thus the online settingcan also be considered as long as the time between observations is longer than the transient periods.Indeed, in many scenarios learning and decision making processes in the brain can take place onshort time scales relative to the time scale on which experience is accumulated. In this case whenanother piece of information arrives, the system moves to a region defined (stochastically) around anew steady-state. A complication can arise when the new point arrives during the transient periodof the previous learning process – before the system has had a chance to settle, on average, into thenew equilibrium – however we do not attempt to model this situation here. The estimates given in Section 3 quantify the tradeoff between the degree of synchronization and thenoise (error), and the role this tradeoff plays in determining the uncertainty of a decision functionlearned by way of a stochastic, nonlinear dynamics. Estimates both in expectation and in probabilitywere derived. We showed how and where both the coupling strength and the topology of the networkof neural ensembles impact the extent to which the noise, and therefore uncertainty due to error, canbe controlled. In particular, for most networks (see Section 2.3) the effect of the noise can be reducedby either increasing the coupling strength or the number of redundant systems (or both), leading to asteady-state solution that is going to be closer to the ideal, error-free solution with high probability.From a technical standpoint, this is because fluctuations about the common trajectory are exactly theway in which the noise enters the picture; when the fluctuations are made small, the error is madesmall. In this way an organism may mitigate error imposed by a noisy, imperfect learning apparatusand solve a learning task to greater accuracy. Furthermore, synchronization and redundancy canboth improve the speed of learning, in the sense that the rate of convergence to the steady statesolution also depends on these mechanisms. Each of the bounds presented in Section 3 above holdafter transient terms of order e − tλ − vanish, where λ − is the smallest non-zero eigenvalue of thenetwork Laplacian. For any stable connected network, strong coupling strengths directly improveconvergence rates to the steady-state, as seen by the dependence of λ − on κ . In the case of all-to-all(including approximately all-to-all and many random graphs), λ − = O ( nκ ) so that both increasedredundancy and sync will improve the speed of learning.Our overarching goal has been to explore quantitatively the role of redundancy and synchro-nization in reducing error incurred by a stochastic, non-ideal learning and decision making neuralprocess. We have gone about this by considering a model which emphasizes the underlying compu- One way to see this is to take the first-order Taylor expansion of the dynamics with integral remainder. The re-mainder term can be upper bounded by the spectral radius of the Hessian matrix, which is related to curvature (see e.g.(Tabareau et al., 2010)).
Synchronization has been suggested, over a diverse history of experimental work, as a fundamen-tal mechanism for improvement in precision and reduction of uncertainty in the nervous system(see e.g. (Needleman et al., 2001; Enright, 1980)). Redundancy too is an important and com-monly occurring mechanism. In retinal ganglion cells (Croner et al., 1993; Puchalla et al., 2005)and heart cells (Clay and DeHaan, 1979) the spatial mean across coupled cells cancels out noise.Populations of hair cells in otoliths perform redundant, collaborative computations to achieve ro-bustness (Kandel et al., 2000; Eliasmith and Anderson, 2004), and it has been suggested that multi-ple cortical (amygdala-thalamus) loops contribute to fear response/conditioning, and emotion moregenerally (LeDoux, 2000). With motor tasks such as reaching or standing, it has been argued thatplanning and representation occurs at least partially in redundant coordinate systems and involveredundant degrees of freedom (Scholz and Schoner, 1999). Todorov (Todorov, 2008) maintainsthat redundancy and noise combine to give rise to optimal muscle control policies, raising the in-teresting possibility that in some cases the impact of the noise may need to be adjusted but notnecessarily eliminated altogether. On a more localized scale, reach direction has also been found tobe conveyed by populations of neurons with overlapping tuning curves (Georgopoulos et al., 1982)where synchrony within such populations plays an important role (Grammont and Riehle, 1999).Multiple sensorimotor transformations involving disparate brain regions may be at play in the pari-etal cortex, where redundant sensory inputs from multiple modalities must be mapped into motorresponses (Ting, 2007; Pouget and Sejnowski, 1997). In the ascending auditory pathway, varyingdegrees of redundancy have been noted, and contribute to the robust representation of frequency andmore complex auditory objects (Chechik et al., 2006). Ensemble measurements have also been con-nected to behavior and have been suggested as inputs to brain-machine interfaces, while in stochasticneural decision making it has been suggested that it is the collective behavior across multiple pop-ulations of neurons that is responsible for perception and decision making, rather than activity of asingle neuron or population of neurons (Gigante et al., 2009).In these examples and more generally, we suggest that redundancy plus feedback synchroniza-tion is a mechanism which may be used to improve the accuracy, robustness and speed of a learningprocess involving the relevant respective brain areas. This is separate from, and in contrast with, re-dundancies which are harnessed to specifically increase storage capacity, as in the case of associatememory models (Hertz et al., 1991). There, robustness to corruption is also achieved (via patterncompletion dynamics) but the degree of robustness must be traded off against capacity. The primaryfunction of such populations of neurons is to ostensibly store and retrieve memory patterns ratherthan to implement adaptive, learning dynamics while eliminating noise.Another theme emerging from these instances of sync and redundancy, is that key computationsmay be seen as implemented by distant brain regions coupled together by way of long-distanceprojections and network “hubs”. Recent experimental observations in
C. elegans casts this inter-pretation in a developmental light (Varier and Kaiser, 2011), and suggests that such interactionsoccur from an early stage in life and are important for normal development in even simple organ-isms. Learning processes realized by such computations and interactions are certainly susceptibleto noise, and must cope with this noise one way or another. We suggest that synchronization and11edundancy are not only present and possible, but provide a ready, natural solution.The ability to learn and make decisions reliably in the presence of uncertainty is of fundamentalimportance for survival of any organism. This uncertainty can be seen to arise from three distinctsources, and the approach discussed here treats only the first two: intrinsic neuronal noise, bothlocal and in aggregate, and noise in the form of measurement error, under which we include errordue to limitations in precision and nonlinearity in biological systems. A third and equally impor-tant source of error is that of uncertainty in the inference process itself (Yang and Shadlen, 2007;Kiani and Shadlen, 2009). This uncertainty is specific to and inherent in the decision problem and ischaracterized by the posterior distribution over decisions given the experiential evidence. Our workonly considers uncertainty beyond that of the inference process, and as such is one part of a largerpuzzle. We argue that intrinsic noise is both experimentally and theoretically important – and in-volved enough technically – to be addressed in isolation, while holding all other variables constant.Indeed, intrinsic noise intensities can be large. The role of the network’s topology and couplingmechanism also strongly influences the overall picture, often in surprising or subtle ways. But itis also possible that the methods recruited here can be applied towards understanding some aspectof the inference error if different inferences from the same observations can be made by different“expert” (circuits) each with their own biases. Then averaging, nonlinearity and the uncertaintycould potentially be treated in a similar framework.
Asymptotic stability of the stochastic system considered here is guaranteed as long as there is cou-pling. In general, if the dynamics of a stochastic system are contracting or can be made contractingwith feedback, then combinations (e.g. parallel, serial, hierarchical) of such systems will be con-tracting (Pham et al., 2009; Lohmiller and Slotine, 1998). In the present setting, the system govern-ing the fluctuations about the mean trajectory is contracting with a rate dependent on the couplingstrength and the noise variance. Thus combinations of learning systems of the general type con-sidered here can enjoy strong stability guarantees automatically, since the individual systems arecontracting.Finally, we have assumed throughout that the errors affecting the collection of redundant neuralcircuits or systems are mutually independent. This is not an unreasonable modeling assumption:For large-scale learning processes involving different brain areas, noise imposed by local spike ir-regularities is largely unrelated to noise present in distant circuits. Within small populations ofneurons, it is likely that dependence among intrinsic neuronal noise sources decays rapidly in spaceso that nearest-neighbors may experience somewhat correlated noise, but beyond this are not sig-nificantly impacted by other members of the population. As noises in a biological environmentcan never be fully dependent (whether due to thermal or chemical-kinetic factors, or otherwise),partial-dependence among noise inputs may be explicitly modeled as, for example, mixing pro-cesses if desired (Doukhan, 1994). Estimates of the form discussed here would then be augmentedwith mixing terms leading to results which make identical qualitative statements about the role ofredundancy and sync. Fluctuations, and the effect of the noise, would still be reducible but wouldrequire larger coupling strengths or more redundancy compared to what would be necessary if thenoise sources were independent. 12 U n c oup l ed T r a j e c t o r i e s Time (s)Time Evolution of Coupled/Uncoupled Systems0 1 2 3 4 5 6 7 8 9 10−10−505 C oup l ed T r a j e c t o r i e s Time (s) A v e r age T r a j e c t o r y M agn i t ude Time (s)Coupled/Uncoupled Center−of−Mass TrajectoriesUncoupledCoupled
Figure 2: (Left) Typical simulated trajectories for coupled and uncoupled networks driven by thesame noise. (Right) Population average trajectories for the coupled and uncoupled systems.
To empirically test the estimates given in Section 3 we simulated several systems of SDEs given byEquation (4) using Euler-Maruyama integration (over time t ∈ [0 , s ] , regularly spaced samplepoints), for different settings of the parameters n (number of circuits or elements), κ (couplingstrength) and σ (noise standard deviation). Initial conditions were randomly drawn from the uniformdistribution on [ − , , and we fixed k x k = 1 and the coupling arrangement to all-to-all couplingwith fixed strength determined by κ . For simplicity the simulated systems had equilibrium pointat zero, corresponding to y = 0 , so that h x, y i = 0 and X ∗ = w ∗ (the change of variables is theidentity map and we can identify X t with w t ).For comparison purposes we first show on the left in Figure 2 typical simulated trajectories ofuncoupled (top) and coupled (bottom) populations when n = 20 , κ = 5 , σ = 10 . Both populationsare driven by the same noise and the same set of initial conditions, however each element is drivenby noise independent from the others as assumed above. From the units on the vertical axes, one cansee that coupling clearly reduces inter-trajectory fluctuations as expected. On the right in Figure 2,we show the coupled/uncoupled populations’ respective center of mass trajectories for this particularsimulation instance. One can see from this figure that the average of the coupled system tends closerto zero ( X ∗ ), and is less affected by large noise excursions.To empirically test tightness of the estimates given in Section 3, we repeated simulations of eachrespective system times, and averaged the relevant outcomes to approximate the expectationsappearing in the bounds. Transient periods were excluded in all cases. In Tables 1 through 5we show the values predicted by the bounds and the corresponding simulated quantities, for eachrespective triple of system parameter settings. Sample standard deviations of the simulated averages(expectations) are given in parentheses. In Figure 3 we show theoretical versus simulated expectedmagnitudes of the fluctuations E k e X t k when n = 200 and σ = 10 over a range of couplingstrengths. The solid dark trace is the upper bound of Theorem 3.1, while the open circles are theaverage simulated quantities (again separate simulations were run for each κ ). Error bars arealso given for the simulated expectations. Note that the magnitude scale ( y -axis) is logarithmic, sothe error bars are also plotted on a log scale. We omitted the lower theoretical bound from the plotbecause it is too close to the upper bound to visualize well relative to the scale of the bounds.Generally, the estimates relating to the magnitude of the fluctuations are seen to be tight, and thevariance estimate is within an order of magnitude. For the experiments with large noise amplitudes,13uantity Lower Bound Simulated Upper Bound E k e X t k = 3 . ) 9.500 var (cid:0) k e X t k (cid:1) - 9.450 (std = 14 . ) 111.046 n E k X t − X ∗ k = 22 . ) 12.249 (std = 22 . )Table 1: Estimates vs. simulated quantities: n = 20 , κ = 5 , σ = 10 . Quantity Lower Bound Simulated Upper Bound E k e X t k = 3 . ) 11.875 var (cid:0) k e X t k (cid:1) - 14.261 (std = 23 . ) 184.45 n E k X t − X ∗ k = 2 . ) 1.946 (std = 2 . )Table 2: Estimates vs. simulated quantities: n = 20 , κ = 1 , σ = 5 . Quantity Lower Bound Simulated Upper Bound E k e X t k = 15 . ) 47.500 var (cid:0) k e X t k (cid:1) - 230.275 (std = 373 . ) 2951.234 n E k X t − X ∗ k = 24 . ) 14.784 (std = 24 . )Table 3: Estimates vs. simulated quantities: n = 20 , κ = 1 , σ = 10 . Quantity Lower Bound Simulated Upper Bound E k e X t k = 7 . ) 49.500 var (cid:0) k e X t k (cid:1) - 49.332 (std = 70 . ) 2598 n E k X t − X ∗ k = 1 . ) 1.449 (std = 1 . )Table 4: Estimates vs. simulated quantities: n = 100 , κ = 1 , σ = 10 . Quantity Lower Bound Simulated Upper Bound E k e X t k = 1 . ) 9.900 var (cid:0) k e X t k (cid:1) - 2.151 (std = 3 . ) 102.362 n E k X t − X ∗ k = 1 . ) 1.496 (std = 1 . )Table 5: Estimates vs. simulated quantities: n = 100 , κ = 5 , σ = 10 . Coupling Strength ( κ ) E x p ec t e d F l u c t u a t i o n s M ag n i t ud e Upper BoundSimulated
Figure 3:
Simulated vs. theoretical upper bound estimates of the fluctuations’ expected magnitudeover a range of coupling strengths κ . Here n = 200 circuits and σ = 10 . the empirical estimates can appear to slightly violate the bounds where the bounds are tight sincethe variance across simulations is large. The lower bound estimating the distance of the center ofmass to the noise-free solution is also seen to be reasonably good. For comparison, we give theupper estimate where the empirical distance is substituted in place of the expectation in order toshow closeness to the lower bound. Theorem 3.3 predicts that the upper and lower estimates willeventually coincide if κ and/or n are chosen large enough. In this section we provide proofs of the results discussed in Section 3.We first introduce a key Lemma to be used in the development immediately below.
Lemma 6.1.
Let P = I − (1 /n ) ⊤ , the canonical projection onto the zero mean subspace of R n .Then for all x ∈ R n ≤ h P x, tanh( x ) i ≤ k P x k , where the hyperbolic tangent applies elementwise.Proof. Given x ∈ R n , define the index sets I = { , . . . , n } , I + = { i ∈ I | ( P x ) i ≥ } , and I − = I \ I + . Since P x is zero mean, P i ∈ I + ( P x ) i = P i ∈ I − | ( P x ) i | . We will express the hyperbolictangent as tanh( z ) = 2 s (2 z ) − , where s ( z ) = (1 + e − z ) − is the logistic sigmoid function. If welet µ = n ⊤ x be the center of mass of x , ( P x ) i = x i − µ ≥ implies s ( x i ) ≥ s ( µ ) by monotonicityof s. Likewise, ( P x ) i < implies s ( x i ) < s ( µ ) . Finally, note that since P = P and ∈ ker P , h P x, tanh( x ) i = h P x, P (cid:0) s (2 x ) − (cid:1) i = 2 h P x, s (2 x ) i . Using these facts, we prove the lower15ound first: h P x, tanh( x ) i = 2 X i ∈ I + ( P x ) i s (2 x i ) − X i ∈ I − | ( P x ) i | s (2 x i ) ≥ s (2 µ ) X i ∈ I + ( P x ) i − s (2 µ ) X i ∈ I − | ( P x ) i | = 2 s (2 µ ) · . Turning to the upper bound, we prove the equivalent statement h P x, s (2 x ) − x i ≤ . First, if µ = 0 , then P x = x so h P x, tanh( x ) i = h x, tanh( x ) i ≤ k x kk tanh( x ) k ≤ k x k = k P x k , since k tanh( x ) k ≤ k x k by virtue of the fact that | tanh( z ) | = tanh( | z | ) ≤ | z | for any z ∈ R . Nowsuppose that µ > . If z ≥ µ > , we can upper bound s (2 z ) by the line tangent to the point ( µ, s (2 µ )) : s (2 z ) ≤ mz + b with m < and b > . If z < µ , we can take the lower bound s (2 z ) > z + µ − s (2 µ ) . Using these estimates, we have that h P x, s (2 x ) − x i = X i ∈ I + ( P x ) i (cid:0) s (2 x i ) − x i (cid:1) + X i ∈ I − | ( P x ) i | (cid:0) x i − s (2 x i ) (cid:1) ≤ X i ∈ I + ( P x ) i (cid:0) b − (1 − m ) x i ) + X i ∈ I − | ( P x ) i | (cid:0) x i + µ − s (2 µ ) (cid:1) ≤ X i ∈ I + ( P x ) i (cid:0) b − (1 − m ) µ ) + X i ∈ I − | ( P x ) i | (cid:0) µ + µ − s (2 µ ) (cid:1) = (cid:16) X i ∈ I + ( P x ) i (cid:17)(cid:0) b + mµ − s (2 µ ) (cid:1) = 0 . The second inequality follows from the fact that (1 − m ) > , x i ≥ µ for i ∈ I + and x i < µ for i ∈ I − . Since P i ∈ I + ( P x ) i = P i ∈ I − | ( P x ) i | , and recalling that by definition b satisfies mµ + b = s (2 µ ) , the final equalities follow. If µ < , then the proof is similar, taking the line tangent to thepoint ( µ, s (2 µ )) as a lower bound for s (2 z ) and the line ( z − µ ) + s (2 µ ) as an upper bound. We begin by adding λ k X t k dt , with λ ∈ (0 , ∞ ) , to both sides of Equation (7) to obtain d k e X t k + λ k e X t k dt = −k x k h tanh X t , e X t i dt + ( λ k e X t k − h LX t , e X t i ) dt + 12 ( n − σ dt + ˜ σ k e X t k dB t = e − λt d ( k e X t k e λt ) , where the second equality follows noticing that the right hand side is the total Ito derivative of theleft hand side of the first equality. Now multiply both sides by e λt , switch to integral form, andmultiply both sides by e − λt to arrive at k e X t k = e − λt k e X k + Z t e λ ( s − t ) (cid:16)
12 ( n − σ − k x k h tanh X s , e X s i (cid:17) ds + Z t e λ ( s − t ) ( λ k e X s k − h LX s , e X s i ) ds + ˜ σ Z t e λ ( s − t ) k e X s k dB s . (9) Upper Bound:
Next, note that h LX t , e X t i = h L e X t , e X t i since L ( X t ) = 0 , and that e X t is bydefinition orthogonal to any constant vector. For all t we also have that λ − k e X t k − h L e X t , e X t i ≤ −h tanh X t , e X t i ≤ (10)16lmost surely. The first inequality follows from the fact that for all x ∈ im P , λ − k e X t k ≤ h L e X t , e X t i ≤ λ + k e X t k , if λ − is the Fiedler eigenvalue of L and λ + is the largest eigenvalue of L . The second inequality isgiven by Lemma 6.1. Setting λ ≡ λ − and applying the inequalities (10) to Equation (9) gives theestimate k e X t k ≤ e − λ − t k e X k + ( n − σ Z t e λ − ( s − t ) ds + ˜ σ Z t e λ − ( s − t ) k e X s k dB s = e − λ − t k e X k + ( n − σ λ − (1 − e − λ − t ) + ˜ σ Z t e λ − ( s − t ) k e X s k dB s (11)almost surely. Taking expectations and noting that E hR t e λ − ( s − t ) k e X s k dB s i = 0 , we have that E k e X t k ≤ ( n − σ λ − (12)after transients of rate λ − . Lower Bound:
We show that E k e X t k has a lower bound that can also be expressed in terms of thecoupling strength and the noise level. The derivation is similar to that of the upper bound, and webegin with Equation (9). We set λ ≡ λ + and apply the estimates λ + k e X s k − h L e X s , e X s i ≥ and h tanh X s , e X s i ≤ k e X s k for all s a.s., yielding k e X t k ≥ e − λ + t k e X k + Z t e λ + ( s − t ) (cid:16)
12 ( n − σ −k x k k e X s k (cid:17) ds + ˜ σ Z t e λ + ( s − t ) k e X s k dB s . Taking expectations and integrating the Ito term, we have E k e X t k ≥ e − λ + t E k e X k + ( n − σ λ + (1 − e − λ + t ) − k x k Z t e λ + ( s − t ) E k e X s k ds. After transients of rate λ − , we can apply (12) to estimate the remaining integral and lower boundthe above equation by e − λ + t E k e X k + ( n − σ λ + (1 − e − λ + t ) − k x k ( n − σ λ − λ + (1 − e − λ + t ) . Since λ − ≤ λ + , transients of rate λ + have already transpired if we suppose that we have waitedfor transients of rate λ − . Therefore, we can say that after transients of rate λ − , E k e X t k ≥ ( n − σ λ + (cid:18) − k x k λ − (cid:19) . (13) Finally, we can obtain corresponding upper and lower bounds for the original system (3) noting thatsince e X t = P (cid:0) w ( t ) k x k − h x , y i (cid:1) = k x k P w ( t ) , we have E k ˜ w k = E k e X t k / k x k , where wehave used the notation ˜ w for P w . The k x k in the denominator then cancels with the same quantityoccurring in ˜ σ in Equations (12) and (13), giving the final form shown in Theorem 3.1.17 .2 Fluctuations Estimates: Proof of Theorem 3.2 We first derive the fourth moment of the norm of the fluctuations. Starting from Equation (11),allow transients of rate λ − to pass so that we are left with the integral inequality k e X t k ≤ ( n − σ λ − + ˜ σ Z t e λ − ( s − t ) k e X s k dB s . Squaring both sides, we can apply the identity ( a + b ) ≤ a + 2 b to obtain k e X t k ≤ (cid:18) ( n − σ √ λ − (cid:19) + 8˜ σ (cid:18)Z t e λ − ( s − t ) k e X s k dB s (cid:19) . Taking expectations and invoking Ito’s Isometry for the second term leads to E k e X t k ≤ (cid:18) ( n − σ √ λ − (cid:19) + 8˜ σ Z t e λ − ( s − t ) E k e X s k ds ≤ (cid:18) ( n − σ √ λ − (cid:19) + 8˜ σ λ − (cid:18) ( n − σ λ − (cid:19) = (cid:18) ( n − σ λ − (cid:19) (cid:18) n − (cid:19) where the estimate (12) has been substituted in for E k e X s k . An upper bound on the varianceis then obtained from the identity var ( Z ) = E [ Z ] − ( E Z ) and the lower estimate given inEquation (13). Reversing the change of variables as in Section 6.1.1 yields the final result. Theorem 3.1 can be applied towards providing a lower bound for the average distance between thenoisy trajectories of the neural circuit and the noise-free solution to the learning problem. Firstobserve that from the orthogonal decomposition X t = P X t + QX t and the change of variablesmapping (3) to (4), k X t k = k X t k + k e X t k = k x k k w − w ∗ k . (14)Furthermore, we have that X t = n − X i X i ( t ) = n − X i ( w i k x k − h x , y i ) , so evidently k x k − E X t = E [( ¯ w t − w ∗ ) ] . Next, note that if the fluctuations are small, the tra-jectories ( w i ( t )) ni =1 are close to one another and the average trajectory ¯ w t = n − w ( t ) ⊤ evolvesessentially as ¯ w t ∼ w ∗ + σ √ n W t , where W t is interpreted as a white noise process. In this case wethen have that E [( ¯ w t − w ∗ ) ] = σ n , and we see that E [( ¯ w t − w ∗ ) ] ≥ σ n when the fluctuations arenot necessarily small. So we have that k x k − E X t ≥ σ n . Combining the above with Theorem 3.1, σ n + (cid:20) ( n − σ nλ + (cid:18) − k x k λ − (cid:19)(cid:21) + ≤ E X t k x k + E k e X t k n k x k ≤ σ λ − + E (cid:2) ( ¯ w t − w ∗ ) (cid:3) with the notation [ · ] + ≡ max(0 , · ) . Equation (14) then shows that the middle quantity above isequal to E (cid:2) n P ni =1 ( w i ( t ) − w ∗ ) (cid:3) . 18 cknowledgments The authors acknowledge helpful suggestions from and discussions with Jonathan Mattingly. JBgratefully acknowledges support under NSF contract IIS-08-03293, ONR contract N000140710625and Alfred P. Sloan Foundation grant no. BR-4834 to M. Maggioni.
References
Adams, P. (1998). Hebb and darwin.
Journal of Theoretical Biology , 195(4):419–438.Bennett, M. and Zukin, R. (2004). Electrical coupling and neuronal synchronization in the mam-malian brain.
Neuron , 41(4):495–511.Bollobas, B. (2001).
Random Graphs . Cambridge University Press, 2nd edition.Cao, M., Stewart, A., and Leonard, N. (2010). Convergence in human decision-making dynamics.
Systems and Control Letters , 59:87–97.Chechik, G., Anderson, M., Bar-Yosef, O., Young, E., Tishby, N., and Nelken, I. (2006). Reductionof information redundancy in the ascending auditory pathway.
Neuron , 51(3):278–280.Clay, J. R. and DeHaan, R. L. (1979). Fluctuations in interbeat interval in rhythmic heart-cellclusters. role of membrane voltage noise.
Biophysical Journal , 28(3):377–389.Croner, L. J., Purpura, K., and Kaplan, E. (1993). Response variability in retinal ganglion cellsof primates.
Proceedings of the National Academy of Sciences of the United States of America ,90(17):8128–8130.Doukhan, P. (1994).
Mixing: Properties and Examples . Springer-Verlag.Eliasmith, C. and Anderson, C. (2004).
Neural Engineering: Computation, Representation, andDynamics in Neurobiological Systems . MIT Press.Enright, J. (1980). Temporal precision in circadian systems: a reliable neuronal clock from unreli-able components?
Science , 209(4464):1542–1545.Faisal, A., Selen, L., and Wolpert, D. (2008). Noise in the nervous system.
Nat. Rev. Neurosci. ,9:292–303.Fernando, C., Goldstein, R., and Szathmary, E. (2010). The neuronal replicator hypothesis.
NeuralComputation , 22(11):2809–2857.Friedrich, R. W., Habermann, C. J., and Laurent, G. (2004). Multiplexing using synchrony in thezebrafish olfactory bulb.
Nature Neurosci. , 7:862871.Fukuda, T., Kosaka, T., Singer, W., and Galuske, R. (2006). Gap junctions among dendrites ofcortical gabaergic neurons establish a dense and widespread intercolumnar network.
J. Neurosci. ,26(13):3434–3443.Georgopoulos, A., Kalaska, J., Caminiti, R., and Massey, J. (1982). On the relations between thedirection of two-dimensional arm movements and cell discharge in primate motor cortex.
J.Neurosci. , 2(11):1527–1537. 19igante, G., Mattia, M., Braun, J., and Del Giudice, P. (2009). Bistable perception modeled ascompeting stochastic integrations at two levels.
PLoS Comput Biol , 5(7):e1000430.Grammont, F. and Riehle, A. (1999). Precise spike synchronization in monkey motor cortex in-volved in preparation for movement.
Exp. Brain. Res. , 128:118–122.Grandhe, S., Abbas, J., and Jung, R. (1999). Brain-spinal cord interactions stabilize the locomotorrhythm to an external perturbation.
Biomed Sci Instrum. , 35:175–180.Hahnloser, R., Douglas, R. J., Mahovald, M., and Hepp, K. (1999). Feedback interactions betweenneuronal pointers and maps for attentional processing.
Nat. Neurosci. , 2:746–752.Hertz, J., A., K., , and Palmer, R. (1991).
Introduction to the theory of neural computation . AddisonWesley.Hung, C., Kreiman, G., Poggio, T., and DiCarlo, J. (2005). Fast Readout of Object Identity fromMacaque Inferior Temporal Cortex.
Science , 310(5749):863–866.Itti, L. and Koch, C. (2001). Computational modelling of visual attention.
Nat. Rev. Neurosci. ,2(3):194–203.Jadbabaie, A., Lin, J., and Morse, A. S. (2003). Coordination of groups of mobile autonomousagents using nearest neighbor rules.
IEEE Transactions on Automatic Control , 48(6):988–1001.Kandel, E., Schwartz, J., and Jessell, T. (2000).
Principles of Neural Science (4th. ed.) . McGraw-Hill.Kiani, R. and Shadlen, M. N. (2009). Representation of confidence associated with a decision byneurons in the parietal cortex.
Science , 324(5928):759–764.Kiemel, T., Gormley, K. M., Guan, L., L.Williams, T., and Cohen, A. H. (2003). Estimating thestrength and direction of functional coupling in the lamprey spinal cord.
J. Comput. Neurosci. ,15(2):233–245.Kopell, N. (2000). We got rhythm: Dynamical systems of the nervous system.
Notices of the AMS ,47:6–16.Kopell, N. and Ermentrout, G. (1986). Symmetry and phase-locking in chains of weakly coupledoscillators.
Communications on Pure and Applied Mathematics , 39:623–660.LeDoux, J. (2000). Emotion circuits in the brain.
Annu. Rev. Neurosci. , 23:155–184.Lee, T. and Mumford, D. (2003). Hierarchical bayesian inference in the visual cortex.
J Opt SocAm A: Opt Image Sci Vis , 20(7):1434–1448.Lohmiller, W. and Slotine, J. (1998). On contraction analysis for non-linear systems.
Automatica ,34:683–696.Maren, S. (2001). Neurobiology of pavlovian fear conditioning.
Annu. Rev. Neurosci. , 24:897931.Masuda, N., Kawamura, Y., and Kori, H. (2010). Collective fluctuations in networks of noisycomponents.
New Journal of Physics , 12(9):093007.Mumford, D. (1992). On the computational architecture of the neocortex – II: The role of cortico-cortical loops.
Biol. Cyb. , 66:241–251. 20eedleman, D. J., Tiesinga, P. H., and Sejnowski, T. J. (2001). Collective enhancement of precisionin networks of coupled oscillators.
Physica D: Nonlinear Phenomena , 155(3-4):324–336.Pham, Q.-C. and Slotine, J.-J. (2007). Stable concurrent synchronization in dynamic system net-works.
Neural Networks , 20(1):62–77.Pham, Q.-C., Tabareau, N., and Slotine, J.-J. (2009). A contraction theory approach to stochasticincremental stability.
IEEE Transactions on Automatic Control , 54(4):816–820.Pouget, A. and Sejnowski, T. (1997). Spatial transformations in the parietal cortex using basisfunctions.
Journal of Cognitive Neuroscience , 9(2):222–237.Poulakakis, I., Scardovi, L., and Leonard, N. (2010). Coupled stochastic differential equations andcollective decision making in the two-alternative forced-choice task. In
Proc. American ControlConference .Puchalla, J. L., Schneidman, E., Harris, R. A., and Berry, M. J. (2005). Redundancy in the popula-tion code of the retina.
Neuron , 46(3):493 – 504.Rao, R. and Ballard, D. (1999). Predictive coding in the visual cortex: A functional interpretationof some extra-classical receptive-field effects.
Nat. Neurosci. , 2:79–87.Russo, G. and Slotine, J.-J. (2010). Global convergence of quorum sensing networks.
PhysicalReview E , 28.Schnitzler, A. and Gross, J. (2005). Normal and pathological oscillatory communication in thebrain.
Nature Reviews Neuroscience , 6:285–296.Scholz, J. and Schoner, G. (1999). The uncontrolled manifold concept: identifying control variablesfor a functional task.
Exp Brain Res , 126(3):289–306.Sherman, A. and Rinzel, J. (1991). Model for synchronization of pancreatic beta-cells by gapjunction coupling.
Biophysical Journal , 59(3):547 – 559.Slotine, J.-J. E. and Li, W. (1991).
Applied Nonlinear Control . Prentice-Hall.Tabareau, N., Slotine, J.-J., and Pham, Q.-C. (2010). How synchronization protects from noise.
PLoS Comput Biol , 6(1):e1000637.Taylor, A., Tinsley, M. R., Wang, F., Huang, Z., and Showalter, K. (2009). Dynamical Quorum Sens-ing and Synchronization in Large Populations of Chemical Oscillators.
Science , 323(5914):614–617.Ting, L. (2007). Dimensional reduction in sensorimotor systems: a framework for understandingmuscle coordination of posture. In Cisek, P., Drew, T., and Kalaska, J., editors,
ComputationalNeuroscience: Theoretical Insights into Brain Function , pages 301–325. Elsevier.Todorov, E. (2008). Recurrent neural networks trained in the presence of noise give rise to mixedmuscle-movement representations. preprint .Varier, S. and Kaiser, M. (2011). Neural development features: Spatio-temporal development of thecaenorhabditis elegans neuronal network.
PLoS Computational Biology , 7(1):1–9.21ang, W. and Slotine, J. (2005). On partial contraction analysis for coupled nonlinear oscillators.
Biological Cybernetics , 92(1):38–53.Yang, T. and Shadlen, M. N. (2007). Probabilistic reasoning by neurons.
Nature , 447(7148):1075–1080.Yao, Y., Rosasco, L., and Caponnetto, A. (2007). On early stopping in gradient descent learning.
Constructive Approximation , 26(2):289–315.Young, G., Scardovi, L., and Leonard, N. (2010). Robustness of noisy consensus dynamics withdirected communication. In