[PDF] Neural Kalman Filtering

Abstract

The Kalman filter is a fundamental filtering algorithm that fuses noisy sensory data, a previous state estimate, and a dynamics model to produce a principled estimate of the current state. It assumes, and is optimal for, linear models and white Gaussian noise. Due to its relative simplicity and general effectiveness, the Kalman filter is widely used in engineering applications. Since many sensory problems the brain faces are, at their core, filtering problems, it is possible that the brain possesses neural circuitry that implements equivalent computations to the Kalman filter. The standard approach to Kalman filtering requires complex matrix computations that are unlikely to be directly implementable in neural circuits. In this paper, we show that a gradient-descent approximation to the Kalman filter requires only local computations with variance weighted prediction errors. Moreover, we show that it is possible under the same scheme to adaptively learn the dynamics model with a learning rule that corresponds directly to Hebbian plasticity. We demonstrate the performance of our method on a simple Kalman filtering task, and propose a neural implementation of the required equations.

Full PDF

aa r X i v : . [ c s . N E ] F e b N EUR AL K ALMAN F ILTERING

Beren Millidge

School of InformaticsUniversity of Edinburgh [email protected]

Alexander Tschantz

Sackler Center for Consciousness ScienceSchool of Engineering and InformaticsUniversity of Sussex [email protected]

Anil K Seth

Sackler Center for Consciousness ScienceEvolutionary and Adaptive Systems Research GroupSchool of Engineering and Informatics

Christopher L Buckley

Evolutionary and Adaptive Systems Research GroupSchool of Engineering and InformaticsUniversity of Sussex

[email protected] A BSTRACT

The Kalman ﬁlter is a fundamental ﬁltering algorithm that fuses noisy sensory data, a previous stateestimate, and a dynamics model to produce a principled estimate of the current state. It assumes,and is optimal for, linear models and white Gaussian noise. Due to its relative simplicity and gen-eral effectiveness, the Kalman ﬁlter is widely used in engineering applications. Since many sensoryproblems the brain faces are, at their core, ﬁltering problems, it is possible that the brain possessesneural circuitry that implements equivalent computations to the Kalman ﬁlter. The standard ap-proach to Kalman ﬁltering requires complex matrix computations that are unlikely to be directlyimplementable in neural circuits. In this paper, we show that a gradient-descent approximation tothe Kalman ﬁlter requires only local computations with variance weighted prediction errors. More-over, we show that it is possible under the same scheme to adaptively learn the dynamics model witha learning rule that corresponds directly to Hebbian plasticity. We demonstrate the performance ofour method on a simple Kalman ﬁltering task, and propose a neural implementation of the requiredequations. K eywords Kalman Filtering, Computational Neuroscience, Predictive Coding, FilteringThe Bayesian Brain hypothesis has gained signiﬁcant traction in cognitive neuroscience over the last two decades(Knill and Pouget, 2004; Doya et al., 2007; Pouget et al., 2013). The idea that the brain is in some sense perform-ing (some approximation to) Bayesian inference to solve tasks is fairly mainstream, and has received extensive ex-perimental support (Kersten et al., 2004; Körding and Wolpert, 2004; Tenenbaum et al., 2006; Angelaki et al., 2009;Ernst and Banks, 2002), although there is still signiﬁcant debate on the exact nature of the computations performed,and especially about whether probabilities and probability densities such as likelihoods and priors are representedexplicitly, implicitly, or not at all (Sanborn and Chater, 2016; Ma et al., 2006, 2008; Pouget et al., 2000).

PREPRINT - 22 ND F EBRUARY , 2021What is not as widely appreciated in the literature is that not only does the brain have to do inference on the hiddenstates of the world from sensory data, it must also track the changes in those states over time. This means that thebrain must infer a time-varying posterior and constantly keep it updated as the outside world evolves. This changesthe problem the brain faces from that of Bayesian inference, to Bayesian ﬁltering (Kutschireiter, 2018).The ﬁltering problem is as follows (Jaswinski; Stengel, 1994). You have some estimated state ˆ x t , and some model ofhow the world evolves (the dynamics model): ˆ x t +1 = f (ˆ x t ) . You also receive observations y , and you have somemodel of how the observations depend on the estimated state (the observation model): y = g (ˆ x t ) . Your task, then, isto compute p (ˆ x t +1 | ˆ x t , y ...t ) . In the general nonlinear case, this calculation is analytically intractable and extremelyexpensive to compute exactly. Some form of approximate solution is required. Two forms of approximation aregenerally used. The ﬁrst is to approximate the model - such as by assuming linearity of the dynamics and observationmodels. This approach is taken by the Kalman ﬁlter. With the assumption of an initial Gaussian distribution to startwith, the linear model implies that the posterior distribution will remain Gaussian for all time, and so analytic updaterules for the mean and covariance can be derived .The second method is to approximate the posterior – usually with a set of samples (or particles). This approachis taken by the class of particle ﬁltering algorithms which track the changing posterior by propagating the particlesthrough the dynamics and then resampling based upon updated measurement information (Arulampalam et al., 2002;Gordon et al., 1993). This approach can handle general nonlinear ﬁltering cases, but suffers strongly from the curseof dimensionality. If the state-space is high-dimensional the number of particles required for a good approximationgrows rapidly (Doucet et al., 2000).While there has been some work focusing on the implementation of particle ﬁltering methods in neural circuitry(Kutschireiter et al., 2015), we focus on a neural implementation of the Kalman ﬁlter. There is substantial evidencethat the brain is capable of Bayes-optimal integration of noisy measurements, and is apparently in possession of ro-bust forward models both in perception (Zago et al., 2008; Simoncelli, 2009) and motor control(Munuera et al., 2009;Gold and Shadlen, 2003; Todorov, 2004). de Xivry et al. (2013) have even shown that a Kalman ﬁlter successfully ﬁtspsychomotor data on visually guided saccades and smooth pursuit movement, although they remain agnostic on howit may be implemented in the brain.In this paper, we show that it is possible to derive a gradient descent approximation to the analytical Kalman ﬁlterequations, which results in relatively simple update equations that are biologically plausible and can be embedded in astraightforward neural architecture of rate-coded integrate-and-ﬁre neurons. This approach eschews any complicatedlinear algebra or explicit computation of the Kalman gain matrix. Moreover, it also provides for adaptively learning thecoefﬁcients of the dynamics and measurements models through Hebbian plasticity. The paper comprises three sections.First, we provide a general introduction to Kalman ﬁltering, and derive our gradient descent algorithm.Secondly, wepropose how the resulting equations can be translated into a neural architecture of linear integrate-and-ﬁre neuronswith Hebbian plasticity. Finally, we compare the performance of our ﬁlter to exact analytical Kalman ﬁltering on asimple simulated tracking task. Related Work

Several earlier works have also tried to approach the problem of Kalman ﬁltering in the brain. Wilson and Finkel(2009) repurpose a line attractor network and show that it recapitulates the dynamics of a Kalman ﬁlter in the regime oflow prediction error. However their model only works for a single dimensional stimulus, does not encode uncertainty,and also only works when a linearisation around zero prediction error holds. Deneve et al. (2007) encode Kalman ﬁlterdynamics in a recurrent attractor network. Their approach however encodes stimuli by means of basis functions, whichleads to an exponentially growing number of basis functions required to tile the space as the dimensionality of the inputgrows. In our approach, neurons directly encode the mean of the estimated posterior distribution, which means that ournetwork size scales linearly with the number of dimensions. Our gradient method also completely eschews the direct2

PREPRINT - 22 ND F EBRUARY , 2021computation of the Kalman gain, which simpliﬁes the required computations signiﬁcantly. Additionally, Beck et al.(2007) show that probabilistic population coding approaches can compute posteriors for the exponential family ofdistributions of which the Gaussian distribution is a member. However, no explicitly worked out application of thepopulation coding approach to Kalman ﬁltering exists, to our knowledge.Our model is closely related to predictive coding approaches to brain function (Friston, 2005; Rao and Ballard, 1999;Spratling, 2017, 2008). The predictive coding and Kalman Filtering update rules have been compared explicitly in thecontext of control theory (Baltieri and Buckley, 2020), although without any precise statement of their mathematicalrelationship. In some works (Friston et al., 2008; Friston, 2008), it has been claimed that predictive coding and Kalmanﬁltering are equivalent. Interpreted strictly, this claim is false. In this paper, in effect, we make explicit and precisethe relationship between predictive coding and Kalman ﬁltering. Speciﬁcally, predictive coding and Kalman ﬁlteringoptimize the same Bayesian objective (with slightly different interpretations) which is convex in the linear, Gaussiancase. The Kalman Filter then derives the analytical solution to this convex minimization problem, while the predictivecoding updates arise from a gradient descent on this objective, thus leading to important differences in the actualupdate rule. In Appendix B, we demonstrate the relationship with predictive coding mathematically.Finally, our model can be extended straightforwardly to arbitrary nonlinear dynamics simply by changing the dynamicsand observation models to include arbitrary nonlinear functions – i.e. x t +1 = f ( x t , u t ) and y t = g ( x t ) . The onlyeffect this has on the functioning of our algorithms is introducing nonlinear derivative terms such as ∂ f ( x t ,u t ) ∂x t into theupdate equations. This renders our approach a gradient descent on the Extended Kalman Filter (EKF) (Ribeiro, 2004)objective, which the actual EKF algorithm solves analytically by simply using a local linearisation of the dynamicsaround the current state. In some cases, such as the simple activation functions used in artiﬁcial neural networks,these nonlinear derivative terms are often relatively straightforward and may be computed in a local and biologicallyplausible manner. Further nonlinear reﬁnements of the Kalman Filter have been proposed, such as the UnscentedKalman Filter (Wan and Van Der Merwe, 2000), and whether our gradient based approach can be extended to suchalgorithms remains an interesting avenue for future research. Kalman Filtering

The Kalman Filter solves the general ﬁltering problem presented above by making two simplifying assumptions. Theﬁrst is that both the dynamics model and the observation model are linear. The second assumption is that noiseentering the system is white and Gaussian. This makes both the prior and likelihoods Gaussian. Since the Gaussiandistribution is a conjugate prior to itself, this induces a Gaussian posterior, which can then serve as the prior in thenext timestep. Since both prior and posterior are Gaussian, ﬁltering can continue recursively for any number of time-steps without the posterior growing in complexity and becoming intractable. The Kalman ﬁlter is the Bayes-optimalsolution provided that the assumptions of linear models and white Gaussian noise are met (Kalman, 1960). TheKalman Filter, due to its simplicity and utility is widely used in engineering, time-series analysis, aeronautics, andeconomics (Grewal and Andrews, 2010; Leondes, 1970; Schneider, 1988; Harvey, 1990).The Kalman Filter is deﬁned upon the following linear state-space x t +1 = Ax t + Bu t + ωy t +1 = Cx t +1 + z t (1)Where x t represents the hidden or internal state at time t. u t is the control - or known inputs to the system - at timet. Matrices A,B, and C parametrize the linear dynamics or observation models, and ω and z are both zero-mean whitenoise Gaussian processes with covariance Σ ω and Σ z , respectively. Since the posterior p ( x t +1 | y ...t , x t ) is Gaussian,it can be represented by its two sufﬁcient statistics – the mean µ and covariance matrix Σ x . For simplicity, the model is presented in discrete time. The continuous time analogue of the Kalman ﬁlter is the Kalman-Bucyﬁlter (Kalman and Bucy, 1961). Generalization of this scheme to continuous time is an avenue for future work. PREPRINT - 22 ND F EBRUARY , 2021Kalman ﬁltering proceeds by ﬁrst "projecting" forward the current estimates according to the dynamics model. Thenthese estimates are "corrected" by new sensory data. The Kalman ﬁltering equations are as follows:

Projection: ˆ µ t +1 = Aµ t + Bu t ˆΣ x ( t + 1) = A Σ x ( t ) A T + Σ ω (2) Correction µ t +1 = ˆ µ t +1 + K ( y t +1 − C ˆ µ t +1 )Σ x ( t + 1) = ( I − K ) ˆΣ x ( t + 1) K = ˆΣ x ( t + 1) C T [ C ˆΣ x ( t + 1) C T + Σ z ] − (3)Where µ t and Σ x ( t ) are the mean and variance of the estimate of the state x at time t, and K is the Kalman gainmatrix. Although these update rules provide an analytically exact solution to the ﬁltering problem, the complicatedlinear algebra expressions, especially that for the Kalman gain matrix K, make it hard to see how such equations couldbe implemented directly in the brain.We therefore take a different approach, which utilizes the fact that the ﬁltering equations can be derived directly fromBayes’ rule. The mean of the posterior distribution is also the MAP (maximum-a-posteriori) point, since a Gaussiandistribution is unimodal. Thus, to estimate the new mean, we simply have to estimate, argmax ˆ x t +1 p (ˆ x t +1 | y t +1 , ˆ x t ) ∝ argmax ˆ x t +1 p ( y t +1 | ˆ x t +1 ) p (ˆ x t + 1 | ˆ x t )= argmax ˆ x t +1 N ( y t +1 ; C ˆ x t +1 , Σ z ) N (ˆ x t +1 ; A ˆ x t + Bu t , Σ ω )= argmax µ t +1 Z exp ( − ( y − Cµ t +1 ) T Σ Z ( y − Cµ t +1 )+ ( µ t +1 − Aµ t − Bu t ) T ˆΣ x ( µ t +1 − Aµ t − Bu t )= argmin µ t +1 − ( y − Cµ t +1 ) T Σ Z ( y − Cµ t +1 ) + ( µ t +1 − Aµ t − Bu t ) T ˆΣ x ( µ t +1 − Aµ t − Bu t ) (4)The ﬁrst step is just a writing out of Bayes’ rule. Next the fact that both likelihood and prior are Gaussian is used towrite these as Gaussian densities. In the second line, the algebraic form of the Gaussian density is substituted and wehave switched the maximization variable to µ t +1 due to the fact that the maximum of a Gaussian is its mean. In thenext line, we minimize the log probability instead of maximizing the probability, which gets rid of the exponential andthe normalizing constant (which can be computed analytically since the posterior is Gaussian). .The result is a convex optimization problem which can be solved analytically to give the Kalman ﬁlter equations shownabove (see Appendix A for a full derivation). However, the optimization problem can also be solved by gradientdescent. Moreover, the gradients required give rise to simple biologically plausible learning rules for networks ofintegrate-and-ﬁre neurons, and even allow for the adaptive computation of the coefﬁcients of the A, B, and C matricesusing nothing more than third-factor Hebbian plasticity. This means that a close approximation of Kalman ﬁlteringcan be achieved in the brain without complex linear algebra.Derivation of the gradients proceeds as follows. Our objective is the minimization of the loss function L , argmax ˆ x t +1 p ( y t +1 | ˆ x t +1 ) p (ˆ x t +1 | ˆ x t ) = argmin µ t +1 LL = − ( y − Cµ t +1 ) T Σ Z ( y − Cµ t +1 ) + ( µ t +1 − Aµ t − Bu t ) T ˆΣ x ( µ t +1 − Aµ t − Bu t )= − y T Σ z y + 2 µ Tt +1 C T Σ z Cµ t +1 − µ Tt +1 C T Σ z y + µ Tt +1 Σ x µ t +1 − µ t +1 Σ x Aµ t − µ t +1 Σ x Bu t (5) The log transformation is valid under maximization/minimization since the log function is monotonic PREPRINT - 22 ND F EBRUARY , 2021We proceed by taking derivatives with respect to µ t +1 . dLdµ t +1 = 2 C T Σ z y − ( C T Σ z C + C T Σ Tz C ) µ t +1 + (Σ x + Σ Tx ) µ t +1 − x Aµ t − x Bu t = 2 C T Σ z Cµ t +1 − C T Σ z Cµ t +1 + 2Σ x µ t +1 − x Aµ t − x Bu t = − C T Σ z [ y − Cµ t +1 ] + Σ x [ µ t +1 − Aµ t − Bµ t ]= − C T Σ z ǫ z + Σ x ǫ x (6)Where ǫ z = y − Cµ t +1 and ǫ x = µ t +1 − Aµ t − Bu t . This means that the Kalman ﬁlter gradient is, in effect, anupdate scheme relying on variance weighted prediction errors which can be computed with only linear operations ofweighted subtraction and addition.It is also possible to adaptively learn the coefﬁcients of the A,B, and C matrices by gradient descent by taking gradientsof the loss function with respect to these variables. We go through each in turn. dLdA = ddA [ − µ Tt +1 Σ x Aµ t + µ Tt A T Σ x Aµ t + µ Tt A T Σ x Bu t + u Tt B T Σ x Aµ t ]= − Tx µ t +1 µ Tt + Σ Tx Aµ t µ Tt Σ x Aµ t µ Tt + Σ x Bu t µ Tt + Σ Tx Bu t µ Tt = − Σ x [ µ t +1 − Aµ t − Bu t ] µ Tt = − Σ x ǫ x µ Tt (7)This is a simple Hebbian update rule between the prediction errors and the currently expected estimate. A similarderivation is possible with respect to the B matrix. dLdB = dLdB [2 u Tt B T Σ x Aµ t + u Tt B T Σ x Bu t − µ Tt +1 Σ x Bu t ]= (Σ x + Σ Tx ) Bu t u Tt + 2Σ x Aµ t u Tt − x µ t +1 u Tt = − Σ x [ µ t +1 − Aµ t − Bu t ] u Tt = − Σ x ǫ x u Tt (8)Which is another simple Hebbian update rule, between the dynamical prediction errors and the control variables.It is also possible to take derivatives with respect to the C observation matrix. dLdC = dLdC [ − µ Tt +1 C T Ry + µ Tt +1 C T RCµ t +1 ]= − Ryµ Tt +1 + 2 RCµ t +1 µ Tt +1 = − R [ y − cµ t +1 ] µ Tt +1 = − Rǫ y µ Tt +1 (9)Which is again a simple Hebbian update rule between the sensory prediction errors and the estimated state. In the nextsection we show how these equations can be implemented in a biologically plausible neural architecture. Neural Implementation

Our neural implementation assumes three distinct subpopulations of neurons. The ﬁrst set of neurons represent the bestestimate µ t +1 at any point in time. The other two populations represent either the dynamical prediction errors ǫ x orthe sensory prediction errors ǫ y . It is important to note that this scheme is only one possible equivalent representation,and is a fairly direct translation of the mathematics into neural implementation. It is possible to implement the samescheme in a variety of different neural architectures. 5 PREPRINT - 22 ND F EBRUARY , 2021Sensory data y t enters the model at the bottom and is projected upwards to the sensory prediction error neurons viadriving feed-forward afferent connections. The connectivity pattern of these connections is sparse (in our model one-to-one). The sensory prediction errors become variance weighted through lateral connections which encode the matrix R.These prediction errors are transmitted upwards through diffuse inhibitory connections embodying the matrix C T andsynapse onto the recurrent connections of the estimation neurons representing µ t +1 , thus implementing the requiredgradient descent rule. At the same time the estimation neurons project diffuse feedback inhibitory connections back tothe sensory prediction errors neurons, representing the Cµ t +1 term.The dynamical prediction error neurons receive sparse driving excitatory feedforward connections from the estimationneurons, and diffuse inhibitory feedback connections from higher layers encoding the previous state (representing the Aµ t +1 term) and, if required, an action efference copy representing the Bu t term. It has been assumed that the currentstate is derived not directly from the current estimation, but from a separate processes on a level higher. This is becausewe assume that if the Kalman ﬁlter were implemented in the brain it would not stand alone, but would be deeplyintegrated within a hierarchical structure that ultimately implements some complex nonlinear ﬁltering scheme. Thishierarchical scheme would enable context-dependent Kalman ﬁltering at the lower levels. An additional advantage isthat if the Kalman ﬁlter only has to deal with local state variables, then it is more likely that the linearity approximationsit requires approximately hold. If the system is instead conﬁned to a single layer, so that the estimate at any one timefeeds back upon itself, then this can be accomplished with an additional recurrent connection. While this approachcould be extended using a nonlinear generative model, and would become a gradient descent version of the extendedKalman ﬁlter, the EKF only uses a local linearisation of the nonlinear dynamics and thus performs relatively poorlyin highly nonlinear regimes – thus still necessitating more complex nonlinear representations in higher regions of thebrain.Broadly this scheme accords with predictive coding models of brain (Bastos et al., 2012; Friston, 2005) and withknown general principles of neurophysiology such as diffuse inhibitory feedback connections vs driving feedforwardexcitatory connections. The only case where this rule does not apply is in the inhibitory feedforward connectionsbetween the sensory prediction errors and the estimation units which require precise one-to-one connectivity which isunlikely in the brain. However recent work in predictive coding has shown that such a one-to-one connectivity patterncan empirically be relaxed with relatively little harm to performance (Millidge et al., 2020).Importantly, this scheme makes adaptive learning of the A,B,and C weight matrices correspond directly to Hebbianplasticity between the prediction error units and the originators of the connection.. For instance, take the A-matrixlearning update rule dLdA = Σ x ǫ x µ Tt . In elementwise terms this can be expressed as: dLdA ij = (Σ x ǫ x ) i µ t j , whichsimply makes the update to that connection the product of the variance weighted prediction error and the top-downprevious state, which corresponds exactly to a third-factor Hebbian learning rule.One potential worry is that implementing the Kalman ﬁlter estimation as a gradient descent process rather than aninstantaneous inference makes it too slow. In the time it takes the recurrent connections to do multiple gradientdescent steps, the world could have moved on so the computed estimation is out of date. In the results below weshow, however, that convergence in this case is rapid due to the convexity of the underlying estimation problem.Nearly perfect convergence to the analytical Kalman ﬁlter estimate is possible within 5 gradient steps, and reasonableestimates are obtainable within 2 and sometimes even 1 gradient step. Moreover, we found empirically that using toofew gradient steps effectively smooths the estimate, potentially a desirable property for the system. Results

The proposed model was implemented and tested on a simple Kalman ﬁltering application - that of tracking themotion of an accelerating body given only noise sensor measurements. The body is accelerated with an initial highacceleration that rapidly decays according to an exponential schedule. The Kalman ﬁlter must infer the position,velocity, and true acceleration of the body from only a kinematic dynamics model and noisy sensor measurements.6

PREPRINT - 22 ND F EBRUARY , 2021The body is additionally perturbed by white Gaussian noise in all of the position, velocity and displacement. Thecontrol schedule and the true position, velocity and displacement of the body are shown in ﬁgure 1 below. V a l u e Time course of true dynamical variables

PositionVelocityAcceleration (a) True Dynamics V a l u e Time Course of Control Input (b) Control Input V a l u e Time course of observations (random C matrix)

Y1Y2Y3 (c) Observations

Figure 1: The true dynamics, control input, and observations generated by a random C matrixThe analytical Kalman ﬁlter was set up as follows. It was provided with the true kinematic dynamics matrix (A) andthe true control matrix (B), A =  dt dt , dt  B = h i (10)The observation matrix C matrix was initialized randomly with coefﬁcients drawn from a normal distribution with 0mean and a variance of 1. This effectively random mapping of sensory states meant that the ﬁlter could not simplyobtain the correct estimate directly but had to disentangle the measurements ﬁrst. The Q and R matrices of theanalytical kalman ﬁlter were set to constant diagonal matrices, where the constant was the standard variance of thenoise added to the system.The performance of the analytical Kalman ﬁlter which computed updates using equations 1-3 is compared with thatof our neural Kalman ﬁlter using gradient descent dynamics. In this comparison the A, B, and C matrices are ﬁxedto their correct values and only the estimated mean is inferred according to equation 3. Comparisons are provided fora number of different gradient steps. As can be seen in Figure 2, only a small number (5) of gradient descent stepsare required to obtain performance very closely matching the analytical result. This is likely due to the convexity ofthe underlying optimization problem, and means that using gradient descent for "online" inference is not prohibitivelyslow. The simulation also shows the estimate for too few (2) gradient steps for which results are similar, but theestimate may be slightly smoother. The code used for these simulations is freely available and online at https : //github.com/Bmillidgework/NeuralKalmanF iltering PREPRINT - 22 ND F EBRUARY , 2021 P r e d i c t e d V a l u e Estimated Position

True ValueKalman FilterGradient Method (a) Position P r e d i c t e d V a l u e Estimated Velocity

True ValueKalman FilterGradient Method (b) Velocity P r e d i c t e d V a l u e Estimated Acceleration

True ValueKalman FilterGradient Method (c) Acceleration P r e d i c t e d V a l u e Estimated Position

True ValueKalman Filter5 Gradient Steps2 Gradient Steps (d) Position P r e d i c t e d V a l u e Estimated Velocity

True ValueKalman Filter5 Gradient Steps2 Gradient Steps (e) Velocity P r e d i c t e d V a l u e Estimated Acceleration

True ValueKalman Filter5 Gradient Steps2 Gradient Steps (f) Acceleration

Figure 2: Tracking performance of our gradient ﬁlter compared to the true values and the analytical Kalman Filter.First row shows accurate tracking over 2000 timesteps. Second row zooms in on 100 timestep period to demonstratetracking performance in miniature and the effect of few gradient updates.Next, we demonstrate the adaptive capabilities of our algorithm. In Figure 3, we show the performance of our algo-rithm in predicting the position, velocity, and acceleration of the body when provided with a faulty A matrix. Usingequation 7, our model learns the A matrix online via gradient descent. To ensure numerical stability, a very small learn-ing rate of − must be used. The entries of the A matrix given to the algorithm were initialized as random Gaussiannoise with a mean of 0 and a standard deviation of 1. The performance of the algorithm without learning the A matrixis also shown, and estimation performance is completely degraded without the adaptive learning. The adaptivity pro-cess converges remarkably quickly. It is interesting, moreover, to compare the matrix coefﬁcients learned through theHebbian plasticity to the known true coefﬁcients. Often they do not match the true values, and yet the network is ableto approximate Kalman ﬁltering almost exactly. Precisely how this works is an area for future exploration. If a systemsimilar to this is implemented in the brain, then this could imply that the dynamics model inherent in the synapticweight matrix should not necessarily be interpretable.We also show (second row of Figure 3) that, perhaps surprisingly, both the A and B matrix can be learned simultane-ously. In the simulations presented below, either only the A matrix, or both the A and the B matrix were initializedwith random gaussian coefﬁcients, and the network learned to obtain accurate estimates of the hidden state in thesecases. The results of only learning the B matrix were extremely similar for that of the A matrix. For conciseness, the results were notincluded. Interested readers are encouraged to look at the

NKF A B m atrix.ipynb ﬁle in the online code where these experimentswere run. PREPRINT - 22 ND F EBRUARY , 2021 P r e d i c t e d V a l u e Estimated position

True ValueKalman FilterGradient MethodNo A learning (a) Position for learnt A matrix P r e d i c t e d V a l u e Estimated velocity

True ValueKalman FilterGradient MethodNo A learning (b) Velocity for learnt A matrix P r e d i c t e d V a l u e Estimated acceleration

True ValueKalman FilterGradient MethodNo A learning (c) Acceleration for learnt A matrix P r e d i c t e d V a l e Estimated Position

Tr e Val eKalman FilterGradient MethodNo Learning (d) Position: learnt A and B matrices P r e d i c t e d V a l u e Estimated Velocity

True ValueKalman FilterGradient MethodNo Learning (e) Velocity: learnt A and B matrices P r e d i c t e d V a l u e Estimated Acceleration

True ValueKalman FilterGradient MethodNo Learning (f) Acceleration: learnt A and B matrices

Figure 3: Filtering performance for adaptively learning the A matrix (ﬁrst row) or both the A and B matrices in concert(second row). The ﬁltering behaviour of the of the randomly initialized ﬁlters without adaptive learning is also shown.We also tried adaptively learning the C matrix using equation 9, but all attempts to do so failed. Although the exactreason is unclear, we hypothesise that an incorrect C matrix corrupts the observations which provides the only "sourceof truth" to the system. If the dynamics are completely unknown but observations are known, then the true state of thesystem must be at least approximately near that implied by the observations, and the dynamics can be inferred fromthat. On the other hand, if the dynamics are known, but the observation mapping is unknown, then the actual state ofthe system could be on any of a large number of possible dynamical trajectories, but the exact speciﬁcs of which areunderspeciﬁed. Thus the network learns a C matrix which corresponds to some dynamical trajectory, which succeedsin minimizing the loss function, but which is completely dissimilar to the actual trajectory the system undergoes. Thiscan be seen by plotting the loss obtained according to equation 5 in Figure 4, which rapidly decreases, although theestimate diverges from the true values.

Discussion

Thus far we have derived a gradient descent approximation to the Kalman ﬁltering equations. We have shown that theestimated state converges rapidly to the exact analytical result. The gradients imply an update rule which is a simplesum of variance weighted prediction errors, and which can be straightforwardly computed in a network of rate-codedintegrate-and-ﬁre neurons.We have also shown that our gradient interpretation allows one to derive simple rules for learning the dynamicsmatrices A and B, which correspond to Hebbian plasticity, and shown empirically that these rules can be used toobtain accurate ﬁltering performance even when the initial dynamics matrices provided to the system are Gaussiannoise. 9

PREPRINT - 22 ND F EBRUARY , 2021 P r e d i c t e d V a l u e Estimated Position

True ValueKalman FilterGradient Method (a) Position: learnt C matrix P r e d i c t e d V a l u e Estimated Velocity

True ValueKalman FilterGradient Method (b) Velocity: learnt C matrix P r e d i c t e d V a l u e Estimated Acceleration

True ValueKalman FilterGradient Method (c) Acceleration: learnt C matrix L o ss Bayesian loss function (d) Loss function over timestels

Figure 4: Very poor tracking behaviour with a learnt C matrix. This is despite the fact that the bayesian loss functionrapidly decreases to a minimum. This shows that the ﬁlter can ﬁnd a prediction-error minimizing "solution" whichalmost arbitrarily departs from reality if the C-matrix is randomized.Due to only requiring prediction errors and Hebbian plasticity, we have claimed that our model is biologically plau-sible. We here interpret biological plausibility to require local and simple (linear) computation. This means thatthe computation occurring at each "neuron" can only require information that could in principle be available at thatneuron or synapse. Additionally, we have assumed that neurons can only do simple operations on inputs - primarilysummation of incoming weighted inputs. We have also assumed that synaptic weights are multiplicative.However, there are several deﬁciencies with our model in terms of biological plausibility which it is important tostate. Our model assumes full connectivity for the "diffuse" connectivity required to implement matrix multiplications.Additionally in other cases it requires one-to-one excitatory connectivity, both constraints which are not fully upheldin neural circuitry. Additionally, in one case (that of the "C matrix" between the populations of neurons representingthe estimate and the sensory prediction errors), we have assumed a complete symmetry of backward and forwardweights, such that the connections which embody the C matrix downwards also implement the C T matrix whentraversing upwards. This is also a constraint not satisﬁed within the brain. Additionally, our model can representnegative numbers in states or prediction errors, which rate-coded neurons cannot. The robustness of our method to thetightening of these constraints is thus an interesting topic for future investigation.10 PREPRINT - 22 ND F EBRUARY , 2021We believe, however, that despite some lack of biological plausibility, our model is useful in that it shows how astandard engineering algorithm can be derived in a way more amenable to neural computation, and provides a sketchat how it could be implemented in the brain. Moreover, we hope to draw attention to Bayesian ﬁltering algorithms andhow they can be implemented neurally, instead of just Bayesian inference on static posteriors.While our implementation uses the prior state derived from the previous output of the algorithm, it is also possiblethat in the brain the prior state could be set by, or inﬂuenced by feedback connections from higher levels, representingmore advanced states of processing which could inform the state estimate. We believe this is an important point abouthow Kalman ﬁltering may ﬁt into the larger picture. Since Kalman ﬁltering turns out to be relatively straightforwardalgorithm which can be implemented in a rather small neural circuit, it could perhaps serve as a composable buildingblock of cortical processing. Kalman ﬁlters could potentially be implemented at the lowest hierarchical levels toachieve some immediate processing and reduction of sensory noise before passing on improved state estimates tohigher levels of processing which can then apply more complex nonlinear ﬁltering algorithms. Using a Kalmanﬁlter locally, at the bottom of the hierarchy, would also reduce the primary disadvantage of the Kalman ﬁlter: itsassumption of linearity. If restricted to dealing with local state and observations, the linearity approximation wouldbecome substantially more accurate, and the Kalman ﬁlter could provide valuable noise reduction and sensor-fusionproperties to state estimators at higher levels. While the higher levels would provide more global contextualization ofthe prior state at each ﬁltering step.

Conclusion

We have shown that it is possible to derive a gradient descent approximation to the Kalman ﬁlter which requires onlylocal computation and simple operations such as the summation of variance weighted prediction errors. We haveproposed a simple neural architecture of rate-coded integrate-and-ﬁre neurons which can implement this algorithm.We have additionally shown that the gradient descent approach also provides a learning rule for adaptively learningthe dynamics model which is identical to Hebbian plasticity. We have shown that our algorithm rapidly converges tothe estimate of the exact Kalman ﬁlter, and that the Hebbian plasticity rule we have derived enables dynamics modelsto be learnt online during ﬁltering.

Appendix A: Derivation of Kalman Filtering Equations from Bayes’ Rule

In this appendix we derive the Kalman ﬁltering equations directly from Bayes rule. The ﬁrst step is to derive theprojected covariance, E [ˆ x t +1 ˆ x Tt +1 ] = E [( Ax + Bu + ω )( Ax + Bu + ω ) T ] (11) = E [ Axx T A T ] + E [ Axu T B T ] + E [ Axω T ] + E [ Bux T A T ] + E [ Buω T ] + E [ ω Tx A T ] + E [ ωU T B T ] + E [ ωω T ] (12) = A E [ xx T ] A T + E [ ωω T ] (13) = A Σ x ( t ) A T + Σ ω (14)Step 11 uses the fact that matrices A,B are constant so come out of the expectation operator, and that it is assumed thatcovariances between the state, the noise, and the control – E [ xu T ] , E [ xω T ] , E [ uω T ] – are 0. Step 13 uses the fact that E [ xx T ] = Σ x ( t ) and that E [ ωω T ] = Σ ω .Next we optimize the following loss function, derived from Bayes’ rule above (equation 5). L = − ( y − Cµ t +1 ) T Σ Z ( y − Cµ t +1 ) + ( µ t +1 − Aµ t − Bu t ) T ˆΣ x ( µ t +1 − Aµ t − Bu t ) (15)11 PREPRINT - 22 ND F EBRUARY , 2021To obtain the Kalman estimate for µ t +1 we simply take derivatives of the loss, set it to zero and solve analytically. dLdµ t +1 [ µ Tt +1 [ C T RC + Σ x ] µ t +1 − µ Tt +1 [ C T Ry − Σ x Aµ t − Σ x Bu t ] − [ y T R − µ Tt A T Σ x − u Tt B T Σ x ] µ t +1 (16) = 2[ C T RC + Σ x ] µ t +1 − C T Ry + Σ x ( Aµ t + Bu t )] (17) µ t +1 = [ C T RC + Σ − x [ C T Ry + Σ x ( Aµ t + Bu t ] (18) = [Σ − x − Σ − x C T [ C Σ x C T + R ] − C Σ − x [ C T Ry + Σ x ( Aµ t + Bu t )] (19) = [Σ − x − KC Σ − x ][ C T Ry + Σ x ( Aµ t + Bu t )] (20) = Aµ t + Bu t + Σ − x C T Ry − KC Σ − x C T Ry − KC ( Aµ t + Bu t ) (21) = ˆ µ t +1 − KC ˆ µ t +1 + [Σ − x C T R − KC Σ − x C T R ] y (22) = ˆ µ t +1 − KC ˆ µ t +1 + KK − [Σ − x C T R − KC Σ − x C T R ] y (23) = ˆ µ t +1 − KC ˆ µ t +1 + K [( C Σ x C T + R ) C − T Σ x [Σ − x C T R ] − C Σ x C T R ] y (24) = ˆ µ t +1 − KC ˆ µ t +1 + K [( C Σ x C T + R − ) R − C Σ x C T R ] y (25) = ˆ µ t +1 − KC ˆ µ t +1 + Ky (26) = ˆ µ t +1 + K [ y − C ˆ µ t +1 ] (27)Where K = Σ − x C T [ C Σ x C T + R ] − and is the Kalman gain and ˆ µ t +1 = Aµ t + Bu t is the projected mean.The ﬁrst few steps rearrange the loss function into a convenient form and then derive an expression for µ t +1 directly.Step 22 applies the Woodbury matrix inversion lemma to the [ C T RC +Σ x ] − term. The next step rewrites the formulain terms of the Kalman gain matrix K and multiplies it through. The other major manipulation is the multiplication ofthe last term of equation 23 by KK − which is valid since KK − = I .This derives the optimal posterior mean as the analytical solution to the optimization problem. Deriving the optimalcovariance is straightforward and done as follows, E [ µ t +1 µ Tt +1 ] = E [( ˆ mu t +1 + Ky − KC ˆ mu t +1 )( ˆ µ t +1 + Ky − KC ˆ µ t +1 ) T ] (28) = E [ ˆ µ t +1 ˆ µ t +1 T ] − E [ ˆ µ t +1 ˆ µ t +1 T ] C T K T − KC E [ ˆ µ t +1 ˆ µ t +1 T ] + K E [ yy T ] K T + KC E [ ˆ µ t +1 ˆ µ t +1 T ] C T K T (29) = Σ ˆ µ t +1 − Σ ˆ µ t +1 C T K T − KC Σ ˆ µ t +1 + K [ R + C Σ ˆ µ t +1 C T ] K T (30) = Σ ˆ µ t +1 − Σ ˆ µ t +1 C T K T − KC Σ ˆ µ t +1 + Σ ˆ µ t +1 C T [ C Σ ˆ µ t +1 C T + R ] − [ R + C Σ ˆ µ t +1 C T ] K T (31) = Σ ˆ µ t +1 − Σ ˆ µ t +1 C T K T − KC Σ ˆ µ t +1 + Σ ˆ µ t +1 C T K T (32) = Σ ˆ µ t +1 − KC Σ ˆ µ t +1 (33) = [ I − KC ]Σ ˆ µ t +1 (34)Which is the Kalman update equation for the optimal variance. The second line follows on the assumption that E [ xy T ] = 0 . On equation 32 the deﬁnition of the Kalman gain is substituted back in and the two C Σ x C T + R termscancel. Appendix B: Relationship between Predictive Coding and Kalman Filtering

In this appendix we clarify the relationship between predictive coding and Kalman ﬁltering. To jump ahead, we demon-strate that predictive coding and Kalman ﬁltering optimize the same objective – effectively performing maximum-a-posteriori inference over a trajectory of states and observations – however predictive coding optimizes the objective12

PREPRINT - 22 ND F EBRUARY , 2021via gradient descent on the mean state variable µ t while Kalman ﬁltering instead solves this (convex) optimizationproblem analytically resulting in an exact single-step update. Due to its exact nature, for engineering applications theKalman ﬁlter solution should be preferred, however the gradient descent equations of predictive coding are simple,elegant, and local in nature, allowing them to be performed efﬁciently, in theory, by distributed neural circuitry.Predictive coding is usually derived as a variational inference algorithm. Variational inference is a class of approximateBayesian inference methods which are used to turn an inference problem of computing a posterior directly, into aninference problem of optimizing the parameters θ of a separate variational distribution q ( x ; θ ) to be as close as possibleto the true posterior. In most cases, this algorithm is approximate as the variational distribution is chosen to be froma simpler family of distributions than the true posterior so as to allow for efﬁcient computation. However, in thiscase the variational family is of the same family as the true posterior (both are Gaussian) and so usually approximatevariational inference methods become exact.Variational inference proceeds by deﬁning a variational distribution q ( x ; θ ) which is to be optimized and minimizingthe divergence between this distribution and the true posterior. Since the true posterior is assumed to be intractable(which is why we would use variational inference in the ﬁrst place), we cannot directly minimize the divergencebetween them, but instead minimize the tractable variational bound, which is called the variational free energy F . F ( y ) = D KL [ q ( x ; θ ) || p ( y, x )]= D KL [ q ( x ; θ ) || p ( x | y )] − ln p ( y ) ≥ D KL [ q ( x ; θ ) || p ( x | y )] (35)To evaluate F , it is necessary to specify the variational density q ( x ; θ ) and the generative model p ( y, x ) . In thecase of Kalman ﬁltering, we are interested in performing variational inference over full trajectories y T , x T ofobservations and hidden states. Importantly, it is straightforward to demonstrate that with a Markov factorization of thegenerative model p ( y T , x T ) = p ( y | x ) p ( x ) Q Tt =2 p ( y t | x t ) p ( x t | x t − ) and a mean-ﬁeld temporal factorizationof the variational density, so that it is independent across timesteps q ( x T ; θ ) = Q Tt =1 q ( x t ; θ ) , then the free energyof the trajectory factorizes into independently optimizable free-energies of a particular timestep, F ( y T ) = T X t =1 F t ( y t ) F t ( y t ) = D KL [ q ( x t ; θ ) || p ( y t , x t | x t − )] (36)This temporal factorization of the free energy means that the minimization at each timestep is independent of theothers, and so we only need consider a single minimization of a single timestep to understand the solution, since alltime-steps will be identical in terms of the solution method. Applying the linear gaussian assumptions of the Kalmanﬁlter, we can specify our generative model in terms of Gaussian distributions, p ( y t , x t | x t − ) = p ( y t | x t ) p ( x t | x t − )= N ( y t ; Cx t , Σ z ) N ( x t | Ax t − , Σ x ) (37)Since we know the posterior is Gaussian, it makes sense to also use a Gaussian distribution for the variational approxi-mate distribution. Importantly, for predictive coding we make an additional assumption – the Laplace Approximation– which characterises the variance of this Gaussian as an analytic function of the mean, thus deﬁning, q ( x t ; θ ) = N ( x t ; µ t , σ ( µ )) (38) In the literature the negative of this bound is also called the Evidence Lower Bound (ELBO) Where, to make this expression not a function of x t − , we implicitly average over our estimate of x t − from the previoustimestep: p ( x t | x t − ) = E q ( x t − ) [ p ( x t | x t − )] PREPRINT - 22 ND F EBRUARY , 2021where θ = [ µ t , σ ( µ t )] are the parameters of the variational distribution – in this case a mean and variance since wehave assumed a Gaussian variational distribution. With the variational distribution and generative model preciselyspeciﬁed, it is now possible to explicitly evaluate the variational free energy for a speciﬁc time-step, F t ( y t ) = F t ( y t ) = D KL [ q ( x t ; θ ) || p ( y t , x t | x t − )]= − E q ( x t ; θ [ln p ( y t , x t | x t − ] − H [ q ( x t ; θ )] (39)Where the second term is the entropy of the variational distribution. Since we are only interested in minimizing withrespect to the mean µ t and the expression for the entropy of a Gaussian does not depend on the mean, we can ignore thisentropy term in subsequent steps. The key quantity is the ‘energy’ term E q ( x t ; θ [ln p ( y t , x t | x t − ] . Since the Laplaceapproximation ensures that most of the probability distribution is near the mode µ t of the variational distribution, wecan well approximate the expectation using a Taylor expansion to second order around the mode, E q ( x t ; θ [ln p ( y t , x t | x t − ] ≈ ln p ( y t , µ t | µ t − ) + E [ ∂p ( y t , x t | x t − ) ∂x t | x t = µ t [ x t − µ t ] + E [ ∂ p ( y t , x t | x t − ) ∂x t | x t = µ t [ x t − µ t ] = ln p ( y t , µ t | µ t − ) + ∂p ( y t , x t | x t − ) ∂x t | x t = µ t [ E [ x t ] − µ t ] | {z } =0 + ∂ p ( y t , x t | x t − ) ∂x t | x t = µ t E [( x t − µ t ) ] | {z } = σ (40)Since the ﬁrst term vanishes as E [ x t ] − µ t = µ t − µ t = 0 and we can neglect the second term since it only dependson σ and not µ , then the only term that matters for the minimization is the ﬁrst term ln p ( y t , µ t | µ t − ) . This means thatwe can write the overall optimization problem solved by predictive coding as, argmin µ t F t y t = argmin µ t ln p ( y t , µ t | µ t − ) (41)which is the same as the MAP optimization problem presented in equation 4. This means that ultimately the variationalinference problem solved by predictive coding and the MAP estimation problem solved by the Kalman ﬁlter are thesame although the interpretation of µ t differs slightly – from being a parameter of a Gaussian variational distributionversus simply a variable in the generative model – the actual update rules involving µ t are the same in both cases. Thusthe differences between the MAP Kalman ﬁltering objective and the variational predictive coding objective under theLaplace approximation are effectively only interpretational. References

Angelaki, D. E., Gu, Y., and DeAngelis, G. C. (2009). Multisensory integration: psychophysics, neurophysiology, andcomputation.

Current opinion in neurobiology , 19(4):452–458.Arulampalam, M. S., Maskell, S., Gordon, N., and Clapp, T. (2002). A tutorial on particle ﬁlters for onlinenonlinear/non-gaussian bayesian tracking.

IEEE Transactions on signal processing , 50(2):174–188.Baltieri, M. and Buckley, C. L. (2020). On kalman-bucy ﬁlters, linear quadratic control and active inference. arXivpreprint arXiv:2005.06269 .Bastos, A. M., Usrey, W. M., Adams, R. A., Mangun, G. R., Fries, P., and Friston, K. J. (2012). Canonical microcircuitsfor predictive coding.

Neuron , 76(4):695–711.Beck, J., Ma, W., Latham, P., and Pouget, A. (2007). Probabilistic population codes and the exponential family ofdistributions.

Progress in brain research , 165:509–519.de Xivry, J.-J. O., Coppe, S., Blohm, G., and Lefevre, P. (2013). Kalman ﬁltering naturally accounts for visuallyguided and predictive smooth pursuit dynamics.

Journal of Neuroscience , 33(44):17301–17313.Deneve, S., Duhamel, J.-R., and Pouget, A. (2007). Optimal sensorimotor integration in recurrent cortical networks:a neural implementation of kalman ﬁlters.

Journal of Neuroscience , 27(21):5744–5756.14

PREPRINT - 22 ND F EBRUARY , 2021Doucet, A., Godsill, S., and Andrieu, C. (2000). On sequential monte carlo sampling methods for bayesian ﬁltering.

Statistics and computing , 10(3):197–208.Doya, K., Ishii, S., Pouget, A., and Rao, R. P. (2007).

Bayesian brain: Probabilistic approaches to neural coding .MIT press.Ernst, M. O. and Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion.

Nature , 415(6870):429.Friston, K. (2005). A theory of cortical responses.

Philosophical transactions of the Royal Society B: Biologicalsciences , 360(1456):815–836.Friston, K. (2008). Hierarchical models in the brain.

PLoS Comput Biol , 4(11):e1000211.Friston, K. J., Trujillo-Barreto, N., and Daunizeau, J. (2008). Dem: a variational treatment of dynamic systems.

Neuroimage , 41(3):849–885.Gold, J. I. and Shadlen, M. N. (2003). The inﬂuence of behavioral context on the representation of a perceptualdecision in developing oculomotor commands.

Journal of Neuroscience , 23(2):632–651.Gordon, N. J., Salmond, D. J., and Smith, A. F. (1993). Novel approach to nonlinear/non-gaussian bayesian stateestimation. In

IEE proceedings F (radar and signal processing) , volume 140, pages 107–113. IET.Grewal, M. S. and Andrews, A. P. (2010). Applications of kalman ﬁltering in aerospace 1960 to the present [historicalperspectives].

IEEE Control Systems Magazine , 30(3):69–78.Harvey, A. C. (1990).

Forecasting, structural time series models and the Kalman ﬁlter . Cambridge university press.Jaswinski, A. Stochastic processes and ﬁltering theory, 1970.Kalman, R. E. (1960). A new approach to linear ﬁltering and prediction problems.

Journal of basic Engineering ,82(1):35–45.Kalman, R. E. and Bucy, R. S. (1961). New results in linear ﬁltering and prediction theory.

Journal of basic engineer-ing , 83(1):95–108.Kersten, D., Mamassian, P., and Yuille, A. (2004). Object perception as bayesian inference.

Annu. Rev. Psychol. ,55:271–304.Knill, D. C. and Pouget, A. (2004). The bayesian brain: the role of uncertainty in neural coding and computation.

TRENDS in Neurosciences , 27(12):712–719.Körding, K. P. and Wolpert, D. M. (2004). Bayesian integration in sensorimotor learning.

Nature , 427(6971):244.Kutschireiter, A. (2018).

Nonlinear ﬁltering in neuroscience: theory and application . PhD thesis, University of Zurich.Kutschireiter, A., Surace, S. C., Sprekeler, H., and Pﬁster, J.-P. (2015). The neural particle ﬁlter. arXiv preprintarXiv:1508.06818 .Leondes, C. T. (1970). Theory and applications of kalman ﬁltering. Technical report, ADVISORY GROUP FORAEROSPACE RESEARCH AND DEVELOPMENT NEUILLY-SUR-SEINE (FRANCE).Ma, W. J., Beck, J. M., Latham, P. E., and Pouget, A. (2006). Bayesian inference with probabilistic population codes.

Nature neuroscience , 9(11):1432.Ma, W. J., Beck, J. M., and Pouget, A. (2008). Spiking networks for bayesian inference and choice.

Current opinionin neurobiology , 18(2):217–222.Millidge, B., Tschantz, A., Seth, A., and Buckley, C. L. (2020). Relaxing the constraints on predictive coding models. arXiv preprint arXiv:2010.01047 .Munuera, J., Morel, P., Duhamel, J.-R., and Deneve, S. (2009). Optimal sensorimotor control in eye movementsequences.

Journal of Neuroscience , 29(10):3026–3035.15

PREPRINT - 22 ND F EBRUARY , 2021Pouget, A., Beck, J. M., Ma, W. J., and Latham, P. E. (2013). Probabilistic brains: knowns and unknowns.

Natureneuroscience , 16(9):1170.Pouget, A., Dayan, P., and Zemel, R. (2000). Information processing with population codes.

Nature Reviews Neuro-science , 1(2):125.Rao, R. P. and Ballard, D. H. (1999). Predictive coding in the visual cortex: a functional interpretation of someextra-classical receptive-ﬁeld effects.

Nature neuroscience , 2(1):79.Ribeiro, M. I. (2004). Kalman and extended kalman ﬁlters: Concept, derivation and properties.

Institute for Systemsand Robotics , 43:46.Sanborn, A. N. and Chater, N. (2016). Bayesian brains without probabilities.

Trends in cognitive sciences , 20(12):883–893.Schneider, W. (1988). Analytical uses of kalman ﬁltering in econometrics—a survey.

Statistical Papers , 29(1):3–33.Simoncelli, E. P. (2009). Optimal estimation in sensory systems.

The Cognitive Neurosciences, IV , pages 525–535.Spratling, M. W. (2008). Reconciling predictive coding and biased competition models of cortical function.

Frontiersin computational neuroscience , 2:4.Spratling, M. W. (2017). A review of predictive coding algorithms.

Brain and cognition , 112:92–97.Stengel, R. F. (1994).

Optimal control and estimation . Courier Corporation.Tenenbaum, J. B., Grifﬁths, T. L., and Kemp, C. (2006). Theory-based bayesian models of inductive learning andreasoning.

Trends in cognitive sciences , 10(7):309–318.Todorov, E. (2004). Optimality principles in sensorimotor control.

Nature neuroscience , 7(9):907.Wan, E. A. and Van Der Merwe, R. (2000). The unscented kalman ﬁlter for nonlinear estimation. In

Proceedings of theIEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373) ,pages 153–158. Ieee.Wilson, R. and Finkel, L. (2009). A neural implementation of the kalman ﬁlter. In

Advances in neural informationprocessing systems , pages 2062–2070.Zago, M., McIntyre, J., Senot, P., and Lacquaniti, F. (2008). Internal models and prediction of visual gravitationalmotion.