[PDF] Contrastive Similarity Matching for Supervised Learning

Abstract

We propose a novel biologically-plausible solution to the credit assignment problem motivated by observations in the ventral visual pathway and trained deep neural networks. In both, representations of objects in the same category become progressively more similar, while objects belonging to different categories become less similar. We use this observation to motivate a layer-specific learning goal in a deep network: each layer aims to learn a representational similarity matrix that interpolates between previous and later layers. We formulate this idea using a contrastive similarity matching objective function and derive from it deep neural networks with feedforward, lateral, and feedback connections, and neurons that exhibit biologically-plausible Hebbian and anti-Hebbian plasticity. Contrastive similarity matching can be interpreted as an energy-based learning algorithm, but with significant differences from others in how a contrastive function is constructed.

Full PDF

CContrastive Similarity Matching for SupervisedLearning

Shanshan Qin , Nayantara Mudur , and Cengiz Pehlevan John A. Paulson School of Engineering and Applied Sciences,Harvard University, Cambridge, MA USA Department of Physics, Harvard University, Cambridge, MA USA

Abstract

We propose a novel biologically-plausible solution to the credit assign-ment problem motivated by observations in the ventral visual pathway andtrained deep neural networks. In both, representations of objects in thesame category become progressively more similar, while objects belong-ing to diﬀerent categories become less similar. We use this observationto motivate a layer-speciﬁc learning goal in a deep network: each layeraims to learn a representational similarity matrix that interpolates be-tween previous and later layers. We formulate this idea using a contrastivesimilarity matching objective function and derive from it deep neural net-works with feedforward, lateral, and feedback connections, and neurons thatexhibit biologically-plausible Hebbian and anti-Hebbian plasticity. Con-trastive similarity matching can be interpreted as an energy-based learningalgorithm, but with signiﬁcant diﬀerences from others in how a contrastivefunction is constructed.

Synaptic plasticity is generally accepted as the underlying mechanism of learn-ing in the brain, which almost always involves a large population of neurons andsynapses across many diﬀerent brain regions. How the brain modiﬁes and coor-dinates individual synapses in the face of limited information available to eachsynapse in order to achieve a global learning task, the credit assignment problem,has puzzled scientists for decades. A major eﬀort in this domain has been tolook for a biologically-plausible implementation of the back-propagation of erroralgorithm (BP) (Rumelhart et al., 1986), which has long been disputed due to its1 a r X i v : . [ c s . L G ] D ec Input Hidden layers Output

SimilarityMatrix D i ff e r en t c l a ss e s S a m e c l a ss Fixed Fixed

Layer-wise similarity maching

Figure 1: Supervised learning via layer-wise similarity matching. For inputs ofdiﬀerent categories, similarity-matching diﬀerentiates the representations progres-sively (top), while for objects of the same category, representations become moreand more similar (middle). For a given set of training data and their correspond-ing labels, the training process can be regarded as learning hidden representationswhose similarity matrices match that of both input and output (bottom). The tun-ing of representational similarity is indicated by the springs with the constraintsthat input and output similarity matrices are ﬁxed.biological implausibility (Crick, 1989), although recent studies have made progressin resolving some of these concerns (Xie and Seung, 2003; Lee et al., 2015; Lilli-crap et al., 2016; Nøkland, 2016; Scellier and Bengio, 2017; Guerguiev et al., 2017;Whittington and Bogacz, 2017; Sacramento et al., 2018; Richards and Lillicrap,2019; Whittington and Bogacz, 2019; Belilovsky et al., 2018; Ororbia and Mali,2019; Lillicrap et al., 2020).In this paper, we present a novel approach to the credit assignment problem,motivated by observations on the nature of hidden layer representations in theventral visual pathway of the brain and deep neural networks. In both, repre-sentations of objects belonging to diﬀerent categories become less similar, whilerepresentations of objects belonging to the same category become more similar(Grill-Spector and Weiner, 2014; Kriegeskorte et al., 2008; Yamins and DiCarlo,2016). In other words, categorical clustering of representations becomes more andmore explicit in the later layers (Fig. 1). These results suggest a new approachto the credit assignment problem. By assigning each layer a layer-local similaritymatching task (Pehlevan and Chklovskii, 2019; Obeid et al., 2019), whose goal isto learn an intermediate representational similarity matrix between previous andlater layers, we may be able to get away from the need of backward propagation2f errors (Fig. 1). Motivated by this idea and previous observations that errorsignal can be implicitly propagated via the change of neural activities (Hintonand McClelland, 1988; Scellier and Bengio, 2017), we propose a biologically plau-sible supervised learning algorithm, the contrastive similarity matching (CSM)algorithm.Contrastive training (Anderson and Peterson, 1987; Movellan, 1991; Baldi andPineda, 1991) has been used to learn the energy landscapes of neural networks(NNs) whose dynamics minimize an energy function. Examples include inﬂuentialalgorithms like the Contrastive Hebbian Learning (CHL) (Movellan, 1991) andEquilibrium Propagation (EP) (Scellier and Bengio, 2017), where weight-updatesrely on the diﬀerence of the neural activity between a free phase, and a clamped(CHL) or nudged (EP) phase to locally approximate the gradient of an error signal.The learning process can be interpreted as minimizing a contrastive function,which reshapes the energy landscape to eliminate spurious ﬁxed points and makesthe desired ﬁxed point more stable.The CSM algorithm applies this idea to a contrastive function formulated bynudging the output neurons of a multilayer similarity matching objective function(Obeid et al., 2019). As a consequence, the hidden layers learn intermediaterepresentations between their previous and later layers. From the CSM contrastivefunction, we derive deep neural networks with feedforward, lateral and feedbackconnections, and neurons that exhibit biologically-plausible Hebbian and anti-Hebbian plasticity.The nudged phase of the CSM algorithm is analogous to the nudged phaseof EP but diﬀerent. It performs Hebbian feedforward and anti-Hebbian lateralupdates. CSM has opposite sign for the lateral connection updates compared withEP and CHL. This is because our weight updates solve a minimax problem. Anti-Hebbian learning pushes neurons within a layer to learn diﬀerent representations.The free phase of CSM is also diﬀerent where only feedforward weights are updatedby an anti-Hebbian rule. In EP and CHL all weights are updated.Our main contributions and results are listed below: • We provide a novel approach to the credit assignment problem using biologically-plausible learning rules by generalizing the similarity matching principle(Pehlevan and Chklovskii, 2019) to supervised learning tasks and introduc-ing the Contrastive Similarity Matching algorithm. • The proposed supervised learning algorithm can be related to other energy-based algorithms, but with a distinct underlying mechanism. • We present a version of our neural network algorithm with structured con-nectivity. 3

We show that the performance of our algorithm is on par with other energy-based algorithms using numerical simulations. The learned representationsof our Hebbian/anti-Hebbian network is sparser.The rest of this paper is organized as follows. In Section 2, to illustrate ourmain ideas we introduce and discuss supervised similarity matching. We thenintroduce nudged deep similarity matching objective, from which we derive theCSM algorithm for deep neural networks with nonlinear activation functions andstructured connectivity. We discuss the relation of CSM to other energy-basedlearning algorithms. In Section 3, we report the performance of CSM and compareit with EP, highlighting the diﬀerences between them. Finally, we discuss ourresults, possible biological mechanisms, and relate them to other works in Section4.

Here we illustrate our main idea in a simple setting. Let x t ∈ R n , t = 1 , · · · , T bea set of data points and z lt ∈ R k be their corresponding desired output or labels.Our idea is that the representation learned by the hidden layer, y t ∈ R m , shouldbe half-way between the input x and the desired output z l . We formulate thisidea using representational similarities, quantiﬁed by the dot product of represen-tational vectors within a layer. Our proposal can be formulated as the followingoptimization problem, which we name supervised similarity matching:min { y t } Tt =1 T T (cid:88) t =1 T (cid:88) t (cid:48) =1 [( x (cid:62) t x t (cid:48) − y (cid:62) t y t (cid:48) ) + ( y (cid:62) t y t (cid:48) − z l (cid:62) t z lt (cid:48) ) ] . (1)To get an intuition about what this cost function achieves, consider the casewhere only one training datum exists. Then, y (cid:62) y = ( x (cid:62) x + z l (cid:62) z l ), satisfyingour condition. When multiple training data are involved, interactions betweendiﬀerent data points lead to a non-trivial solution, but the fact that the hiddenlayer representations are in between the input and output layers stays.The optimization problem (1) can be analytically solved, making our intuitionprecise. Let the representational similarity matrix of the input layer be R xtt (cid:48) ≡ x (cid:62) t x t (cid:48) , the hidden layer be R ytt (cid:48) ≡ y (cid:62) t y t (cid:48) , and the output layer be R ztt (cid:48) ≡ z l (cid:62) t z lt (cid:48) .Instead of solving y directly, we can reformulate and solve the supervised similaritymatching problem (1) for R y , and then obtain y s by a matrix factorization through4n eigenvalue decomposition. By completing the square, problem (1) becomes anoptimization problem for R y :min R y ∈S m T (cid:13)(cid:13)(cid:13)(cid:13)

12 ( R x + R z ) − R y (cid:13)(cid:13)(cid:13)(cid:13) F , (2)where S m is the set of symmetric matrices with rank m , and F denotes the Frobe-nious norm. Optimal R y is given by keeping the top m modes in the eigenvaluedecomposition of ( R x + R z ) and setting the rest to zero. If m ≥ rank( R x + R z ),then optimal R y exactly equals ( R x + R z ), achieving a representational similar-ity matrix that is the average of input and output layers.The supervised similarity matching problem (1) can be solved by an online al-gorithm that can in turn be mapped onto the operation of a biologically plausiblenetwork with a single hidden layer, which runs an attractor dynamics minimizingan energy function (see Appendix A for details). This approach can be general-ized to multi-layer and nonlinear networks. We do not pursue it further becausethe resulting algorithm does not perform as well due to spurious ﬁxed points ofnonlinear dynamics for a given input x t . The Contrastive Similarity Matchingalgorithm overcomes this problem. Our goal is to combine the ideas of supervised similarity matching and contrastivelearning to derive a biologically plausible supervised learning algorithm. To doso, we deﬁne the nudged similarity matching problem ﬁrst.In energy-based learning algorithms like CHL and EP, weight-updates rely onthe diﬀerence of neural activity between a free phase and a clamped/nudged phaseto locally approximate the gradient of an error signal. This process can be inter-preted as minimizing a contrastive function, which reshapes the energy landscapeto eliminate the spurious ﬁxed points and make the ﬁxed point corresponding tothe desired output more stable. We adopt this idea to introduce what we call thenudged similarity matching cost function, and derive its dual formulation, whichwill be the energy function used in our contrastive formulation.We consider a P -layer ( P − f . For notational convenience, we denote inputs to the network by r (0) ,outputs by r ( P ) , and activities of hidden layers by r ( p ) , p = 1 , · · · , P −

1. Wepropose the following objective function for the training phase where outputs are5udged toward the desired labels z lt min a ≤ r pt ≤ a t =1 , ··· ,Tp =1 , ··· ,P P (cid:88) p =1 γ p − P T T (cid:88) t =1 T (cid:88) t (cid:48) =1 || r ( p − (cid:62) t r ( p − t (cid:48) − r ( p ) (cid:62) t r ( p ) t (cid:48) || + P (cid:88) p =1 γ p − P T T (cid:88) t =1 F ( r ( p ) t ) (cid:62) + 2 βT T (cid:88) t =1 (cid:13)(cid:13)(cid:13) r ( P ) t − z lt (cid:13)(cid:13)(cid:13) . (3)Here, β is a control parameter that speciﬁes how strong the nudge is. β →∞ limit corresponds to clamping the output layer to the desired output. γ ≥ F ( r ( p ) t ) is a regularizer deﬁned and related to the activation function by dF ( r ( p ) t ) /d r ( p ) t = u ( p ) t − b ( p ) t , where r ( p ) t = f ( u ( p ) t ), u ( p ) t and r ( p ) t are the total inputand output of p -th layer respectively, b ( p ) t is the the threshold of neurons in layer p . The reason for the inclusion of this regularizer will be apparent below. Weassume f to be a monotonic and bounded function, whose bounds are given by a and a .The objective function (3) is almost identical to the deep similarity matchingobjective introduced in (Obeid et al., 2019), except the nudging term. Obeid et al.(2019) used β = 0 version as an unsupervised algorithm. Here, we use a non-zero β for supervised learning.We note that we have not made a reference to a particular neural network yet.This is because the neural network that optimizes (3) will be fully derived fromthe nudged similarity matching problem. It will not be prescribed as in traditionalapproaches to deep learning. We next describe how to do this derivation.Using the duality transforms introduced in (Pehlevan et al., 2018; Obeid et al.,2019), the above nudged supervised deep similarity matching problem (3) can beturned into a dual minimax problem:min { W ( p ) } max { L ( p ) } T T (cid:88) t =1 l t (cid:16) { W ( p ) } , { L ( p ) } , r (0) t , z lt , β (cid:17) , (4)where l t := min a ≤ r ( p ) t ≤ a p =1 ,...,P P (cid:88) p =1 γ p − P (cid:104) Tr W ( p ) (cid:62) W ( p ) − r ( p ) (cid:62) t W ( p ) r ( p − t + 1 + γ (1 − δ pP )2 c ( p ) (cid:16) r ( p ) (cid:62) t L ( p ) r ( p ) t − Tr L (p) (cid:62) L (p) (cid:17) + 2 F (cid:16) r ( p ) t (cid:17) (cid:62) (cid:21) + 2 β (cid:13)(cid:13)(cid:13) r ( P ) t − z lt (cid:13)(cid:13)(cid:13) , (5)6 ebbian anti-Hebbian synapse nudging Figure 2: Illustration of the Hebbian/anti-Hebbian network with P hidden layersthat implements the contrastive similarity matching algorithm. The output layerneurons alternate between the free phase and nudged phase.Here, we introduced c ( p ) as a parameter that governs the relative importance offorward versus recurrent inputs and c ( p ) = 1 corresponds to the exact transforma-tion, details of which is given in the Appendix B.In Appendix C, we show that the the objective of the min in l t deﬁnes anenergy function for a deep neural network with feedforward, lateral and feedbackconnections (Figure 2). It has following neural dynamics: τ p d u ( p ) dt = − u ( p ) + W ( p ) r ( p − t − c ( p ) [1 + γ (1 − δ pP )] L ( p ) r ( p ) t + b ( p ) t + γ (1 − δ pP ) W ( p +1) (cid:62) r ( p +1) t − βδ pP ( r ( P ) t − z lt ) , r ( p ) t = f ( u ( p ) ) , (6)where δ pP is the Kronecker delta, p = 1 , · · · , P , τ p is a time constant, W ( P +1) = , r ( P +1) t = . Therefore, the minimization can be performed by running thedynamics until convergence. This observation will be the building block of ourCSM algorithm, which we present below. Finally, we note that the introductionof the regularizer in (3) is necessary for the energy interpretation and for provingthe convergence of the neural dynamics (Obeid et al., 2019). We ﬁrst state our contrastive function and then discuss its implications. Wesuppress the dependence on training data in l t and deﬁne: { L ∗ ( p ) } ≡ arg max { L ( p ) } T T (cid:88) t =1 l t (cid:0) { W ( p ) } , { L ( p ) } , { b ( p ) } , β (cid:1) , (7)7nd E ( { W ( p ) } , { b ( p ) } , β ) = 1 T T (cid:88) t =1 l t (cid:0) { W ( p ) } , { L ∗ ( p ) } , { b ( p ) } , β (cid:1) . (8)Finally, we formulate our contrastive function as J β ( { W ( p ) } , { b ( p ) } ) = E ( β ) − E (0) , (9)which is to be minimized over feedforward and feedback weights { W ( p ) } , as wellas bias { b ( p ) } . For ﬁxed bias, minimization of the ﬁrst term, E ( β ) correspondsexactly to the optimization of the minimax dual of nudged deep similarity match-ing (4). The second term E (0) corresponds to a free phase, where no nudging isapplied. We note that in order to arrive at a contrastive minimization problem,we use the same optimal lateral weights, (7), from the nudged phase in the freephase. Compared to the minimax dual of nudged deep similarity matching (4),we also optimize it over the bias for better performance.Minimization of the contrastive function (9) closes the energy gap betweennudged and free phases. Because the energy functions are evaluated at the ﬁxedpoint of the neural dynamics (6), such procedure enforces the output of the nudgednetwork to be a ﬁxed point of the free neural dynamics.To optimize our contrastive function (9) in a stochastic (one training datumat a time) manner, we use the following procedure. For each pair of training data { r t , z lt } , we run the nudged phase ( β (cid:54) = 0) dynamics (6) until convergence to getthe ﬁxed point r ( p ) β,t . Next, we run the free phase ( β = 0) neural dynamics (6) untilconvergence. We collect the ﬁxed points r ( p )0 ,t . L ( p ) is updated following a gradientascent of (7), while W ( p ) and b ( p ) follow a gradient descent of (9):∆ L ( p ) ∝ (cid:16) r ( p ) β,t r ( p ) (cid:62) β,t − L ( p ) (cid:17) , ∆ W ( p ) ∝ (cid:16) r ( p ) β,t r ( p − (cid:62) β,t − r ( p )0 ,t r ( p − (cid:62) ,t (cid:17) , ∆ b ( p ) ∝ (cid:16) r ( p ) β,t − r ( p )0 ,t (cid:17) . (10)In practice, learning rates can be chosen diﬀerently to achieve the best per-formance. A constant prefactor before L ( p ) can be added to achieve numericalstability. The above CSM algorithm is summarized in Algorithm 1. The CSM algorithm can be related to gradient descent in the β → W ( p ) and b ( p ) parameters under one vector8 lgorithm 1 Constrative Similarity Matching (CSM)

Input:

Initial { W ( p ) } , { L ( p ) } , { b ( p ) } , { r ( p ) } , p = 1 , . . . , P for t = 1 to T do Run the nudged phase neural dynamics (6) with β (cid:54) = 0 until convergence,collect the ﬁxed point { r ( pβ,t } Run the free phase dynamics (6) with β = 0 until convergence, collect ﬁxedpoint { r ( p )0 ,t } Update { L ( p ) } , { W ( p ) } and { b ( p ) } according to (10). end for variable θ , denote all the lateral connection matrices deﬁned in (7) by L ∗ , andrepresent the ﬁxed points of the network by ¯ r . Now the energy function can bewritten as E ( θ , β ; L ∗ , ¯ r ), where L ∗ and ¯ r depend on θ and β implicitly. In thelimit of small β , one can approximate the energy function to leading order by E ( θ , β ; L ∗ , ¯ r ) ≈ E ( θ ,

0) + (cid:18) Tr (cid:18) ∂E∂ L ∗ ∂ L ∗ ∂β (cid:19) + ∂E∂ ¯ r · ∂ ¯ r ∂β + ∂E∂β (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) β =0 β. (11)Note that the maximization in (7) implies ∂E∂ L ∗ = . If ∂E∂ ¯ r is also , i.e. the minimaof (5) are not on the boundaries but at the interior of the feasible set, then in thelimit β →

0, the gradient of the contrastive function is the gradient of the meansquare error function with respect to θ :lim β → β ∂J∂ θ = ∂∂ θ ∂E∂β (cid:12)(cid:12)(cid:12)(cid:12) β =0 = ∂∂ θ T T (cid:88) t =1 (cid:13)(cid:13) ¯ r Pt − z lt (cid:13)(cid:13) . (12)It is important to note that while β → β is a hyperparameter to be tuned. InAppendix F, we present simulations that conﬁrm the existence of an optimal β away from β = 0. The CSM algorithm is similar in spirit to other contrastive algorithms, such asCHL and EP. Like these algorithms, CSM performs two runs of the neural dynam-ics in a “free” and a “nudged” phase. However, there are important diﬀerences.One major diﬀerence is that in CSM, the contrastive function is minimized bythe feedforward weights. The lateral weights take part in the maximization of adiﬀerent minimax objective (7). In CHL and EP, such minimization is done withrespect to all the weights. 9s a consequence of this diﬀerence, CSM uses a diﬀerent update for lateralweights than CHL and EP. This anti-Hebbian update is diﬀerent in two ways: 1)It has the opposite sign, i.e. EP and CHL nudged/clamped phase lateral updatesare Hebbian. 2) No update is applied in the free phase. As we will demonstrate innumerical simulations, our lateral update imposes a competition between diﬀerentunits in the same layer. When network activity is constrained to be nonnegative,such lateral interactions are inhibitory and sparsify neural activity.Analogs of two hyperparameters of our algorithm play special roles in EP andCHL. The β → ∞ limit of EP corresponds to clamping the output to the desiredvalue in the nudged phase (Scellier and Bengio, 2017). Similarly, the β → ∞ limitof CSM also corresponds to training with fully clamped output units. We discussedthe gradient descent interpretation of the β → γ parameter, vanishes (Xie and Seung, 2003). In CSM, γ is a hyperparameter tobe tuned, which we explore in Appendix F. We can also generalize the nudged supervised similarity matching (Eq.3) to derivea Hebbian/anti-Hebbian network with structured connectivity. Following Obeidet al. (2019), we can modify any of the cross terms in the layer-wise similaritymatching objective (Eq.3) by introducing synapse-speciﬁc structure constants.For example: − T N ( p ) (cid:88) i N ( p − (cid:88) j T (cid:88) t T (cid:88) t (cid:48) r ( p ) t,i r ( p ) t (cid:48) ,i r ( p − t,j r ( p − t (cid:48) ,j s Wij , (13)where N ( p ) is the number of neurons in p -th layer, s Wij ≥ p -th layer and ( p − s Lij to specify the structure of the lateral connections (Fig. 6 A). Usingsuch structure constants, one can introduce many diﬀerent architectures, someof which we experiment with below. We present a detailed explanation of thesepoints in Appendix C.

In this section, we report the simulation results of the CSM algorithm on a super-vised classiﬁcation task using the MNIST dataset of handwritten digits (LeCunet al., 2010) and the CIFAR-10 image dataset (Krizhevsky et al., 2009). For our10igure 3: Comparison of training (left) and validation (right) errors between CSMand EP algorithms for a network with one hidden layer (784-500-10, upper panels)and three hidden layers (784-500-500-500-10, lower panels) trained on the MNISTdataset.simulations, we used the Theano Deep Learning framework (Team et al., 2016)and modiﬁed the code released by Scellier and Bengio (2017). The activationfunctions of the units were f ( x ) = min { , max { x, }} and c ( p ) = 1 / − δ pP ).Following Scellier and Bengio (2017), we used the persistent particle techniqueto tackle the long period of free phase relaxation. We stored the ﬁxed points ofhidden layers at the end of the free phase and used them to initialize the state ofthe network in the next epoch. The inputs consist of gray-scale 28-by-28 pixel images, and each image is associ-ated with a label ranging from { , · · · , } . We encoded the labels z l as one-hot10-dimensional vectors. We trained fully connected NNs with one and three hid-11en layers with lateral connections within each hidden layer. The performanceof CSM algorithm was compared with several variants of EP algorithm: (1) EP:beta regularized, where the networks had no lateral connections and the sign of β was randomized to act as a reqularizer as in (Scellier and Bengio, 2017) ; (2) EP:beta positive, where the networks had no lateral connections and β was a posi-tive constant; (3) EP: lateral, where networks had lateral connections and weretrained with a positive constant β . In all the fully-connected network simulationsfor MNIST, the number of neurons in each hidden layer is 500. We attained 0%training error and 2 .

16% and 3 .

52% validation errors with CSM, in the one andthree hidden layer cases respectively. This is on par with the performance of theEP algorithm, which attains a validation error of 2 .

53% and 2 .

73% respectivelyfor variant 1 and 2 .

18% and 2 .

77% for variant 2 (Fig.3). In the 3-layer case, atraining error-dependent adaptive learning rate scheme (CSM-Adaptive) was used,wherein the learning rate for the lateral updates is successively decreased whenthe training error drops below certain thresholds (see Appendix D for details).

CIFAR-10 is a more challenging dataset that contains 32-by-32 RGB images ofobjects belonging to ten classes of animals and vehicles. For fully connectednetworks, the performance of CSM was compared with EP (positive constant β ).We obtain validation errors of 59.21% and 51.76% in the one and two hiddenlayer networks respectively in CSM, and validation errors of 57.60% and 53.43%in EP (Fig.4). The mean and standard errors on the mean, of the last twentyvalidation errors, are reported here, in order to account for ﬂuctuations about themean. It is interesting to note that for both algorithms, deeper networks performbetter for CIFAR-10, but not for MNIST. For both datasets, the best performingnetwork trained with CSM achieves slightly better validation accuracy than thebest performing network trained with EP. The errors corresponding to the fullyconnected networks for both algorithms and datasets are summarized in Table 1.Here, CSM has been compared to the variant of EP with β > MNIST CIFAR-10Rule Train (%) Validate (%) Rule Train (%) Validate (%)CSM:1hl 0.00 2.16 CSM:1hl 1.77 . ± . EP:1hl 0.03 2.18 EP:1hl 0.76 . ± . CSM:3hl 0.00 3.52 CSM:2hl 17.96 . ± . EP:3hl 0.00 2.77 EP:2hl 1.25 . ± . While CSM and EP perform similarly, their learned representations diﬀer insparseness (Fig.5). Due to the non-negativity of hidden unit activity and anti-Hebbian lateral updates, the CSM network ends up with inhibitory lateral con-nections, which enforce sparse response (Fig.5). This can also be seen from thesimilarity matching objective (3). Imagine there are only two inputs with a neg-ative dot product, x · x (cid:48) <

0. The next layer will at least partially match this dotproduct, however, because the lowest value of y · y (cid:48) is zero due to y , y (cid:48) ≥ y and y (cid:48) will be forced to be orthogonal with non-overlapping sets of active neurons.Sparse response is a general feature of cortical neurons (Olshausen and Field, 2004)and energy-eﬃcient, making the representations learned by CSM more biologicallyrelevant. We also examined the performance of CSM in networks with structured connec-tivity. Every hidden layer can be constructed by ﬁrst considering sites arrangedon a two-dimensional grid. Each site only receives inputs from selected nearbysites controlled by the radius parameter (Fig.6A). This setting resembles retino-topy (Kandel et al., 2000) in the visual cortex. Multiple neurons can be presentat a single site controlled by the neurons per site (NPS) parameter. We considerlateral connections only between neurons sharing the same (x, y) coordinate.For MNIST dataset, networks with structured connectivity trained with theCSM rule achieved 2 .

22% validation error for a single hidden layer network with a13 poch V a li da t i on E rr o r ( % ) EP_1HLEP_2HLCSM_1HLCSM_2HL

Epoch T r a i n i ng E rr o r ( % ) Figure 4: Training (left) and validation (right) error curves for fully connectednetworks trained on CIFAR-10 dataset with CSM (solid) and EP (dashed) algo-rithms. The best fully connected CSM network attains slightly better validationaccuracy than the best fully connected EP network.radius of 4 and NPS of 20 (Fig. 6B) (See Appendix D for details). For CIFAR-10dataset, a one hidden layer structured network using CSM algorithm achieves 34%training error and 49.5% validation error after 250 epochs, which is a signiﬁcantimprovement compared to the fully connected one layer network. This structurednetwork had a radius of 4 and NPS of 3. A two hidden layer structured networkyielded a training error of 46.8% and a validation error of 51.4% after 200 epochs.Errors reported for the structured runs are the averages of ﬁve trials. The resultsfor all fully connected and structured networks are reported in Appendix D andE.

In this paper, we proposed a new solution to the credit assignment problem by gen-eralizing the similarity matching principle to the supervised domain and proposeda biologically plausible supervised learning algorithm, the Contrastive SimilarityMatching algorithm. In CSM, a supervision signal is introduced by minimizingthe energy diﬀerence between a free phase and a nudged phase. CSM diﬀers sig-niﬁcantly from other energy-based algorithms in how the contrastive function isconstructed. We showed that when non-negativity constraint is imposed on neuralactivity, the anti-Hebbian learning rule for the lateral connections makes the rep-resentations sparse and biologically relevant. We also derived the CSM algorithmfor neural networks with structured connectivity.14 A ) ( B ) C S M EP Figure 5: Representations of neurons in NNs trained by CSM algorithm are muchsparser than that of EP algorithm on MNIST dataset. (A) Heatmaps of repre-sentations at the second hidden layer, each row is the response of 500 neurons toa given digit image. Upper: CSM algorithm. Lower: EP algorithm. (B) Repre-sentation sparsity, deﬁned as fraction of neurons whose activity are larger than athreshold (0.01), along diﬀerent layers. Layer 0 is the input. The network has a784-500-500-500-10 architecture.The idea of using representational similarity for training neural networks hastaken various forms in previous work. The similarity matching principle has re-cently been used to derive various biologically plausible unsupervised learningalgorithms (Pehlevan and Chklovskii, 2019), such as principal subspace projec-tion (Pehlevan and Chklovskii, 2015), blind source separation (Pehlevan et al.,2017), feature learning (Obeid et al., 2019), and manifold learning (Senguptaet al., 2018). It has been used for semi-supervised classiﬁcation (Genkin et al.,2019). Similarity matching has also been used as part of a local cost function totrain a deep convolutional network (Nøkland and Eidnes, 2019), where insteadof layer-wise similarity matching, each hidden layer aims to learn representationssimilar to the output layer. Representational similarity matrices derived fromneurobiology data have recently been used to regularize CNNs trained for imageclassiﬁcation. The resulting networks are more robust to noise and adversarialattacks (Li et al., 2019). It would be interesting to study the robustness of neuralnetworks trained by the CSM algorithm.Like other constrastive learning algorithms, CSM operates with two phases:free and nudged. Previous studies in contrastive learning provided various bio-logically possible implementations of such two phased learning. One proposal isto introduce the teacher signal into the network through an oscillatory couplingwith a period longer than the time scale of neural activity converging to a steadystate. Baldi and Pineda (1991) proposed that such oscillations might be related15 nput Hidden layers ( A )( B ) Figure 6: (A) Sketch of structured connectivity in a deep neural network. Neuronslive on a 2-d grid. Each neuron takes input from a small grid (blue shades) fromthe previous layer and a small grid of inhibition from its nearby neurons (orangeshades). (B) Training and validation curves of CSM with structured single hiddenlayer networks on MNIST dataset, with a receptive ﬁeld of radius 4 and neuronsper site 4, 16, and 20.to rhythms in the brain. In more recent work, Scellier and Bengio (2017) providedan implementation of EP also applicable to CSM with minor modiﬁcations. Theyproposed that synaptic update only happens in the nudged phase with weightscontinuously updating according to a diﬀerential anti-Hebbian rule as the neu-ron’s state moves from the ﬁxed point at free phase to the ﬁxed point at thenudged phase. Further, such diﬀerential rule can be related to spike time depen-dent plasticity (Xie and Seung, 2000; Bengio et al., 2015). CSM can use the samemechanism for feedforward and feedback updates. Lateral connections need tobe separately updated in the free phase. The diﬀerential updating of synapses indiﬀerent phases of the algorithm can be implemented by neuromodulatory gatingof synaptic plasticity (Brzosko et al., 2019; Bazzari and Parri, 2019).A practical issue of CSM and other energy-based algorithms such as EP andCHL is that the recurrent dynamics takes a long time to converge. Recently, adiscrete-time version of EP has shown much faster training speed (Ernoult et al.,2019) and the application to the CSM could be an interesting future direction.16 cknowledgements

We acknowledge support by NIH, the Intel Corporation through Intel Neuromor-phic Research Community, and a Google Faculty Research Award. We thankDina Obeid and Blake Bordelon for helpful discussions.17

Derivation of a Supervised Similarity Match-ing Neural Network

The supervised similarity matching cost function (1) is formulated in terms of theactivities of units, but a statement about the architecture and the dynamics ofthe network has not been made. We will derive all these from the cost function,without prescribing them. To do so, we need to introduce variables that corre-spond to the synaptic weights in the network. As it turns out, these variables aredual to correlations between unit activities (Pehlevan et al., 2018).To see this explicitly, following the method of (Pehlevan et al., 2018), weexpand the squares in Eq.(1) and introduce new dual variables W ∈ R m × n , W ∈ R k × m and L ∈ R m × m using the following identities: − T T (cid:88) t =1 T (cid:88) t (cid:48) =1 y (cid:62) t y t (cid:48) x (cid:62) t x t (cid:48) = min W − T T (cid:88) t =1 y (cid:62) t W x t + Tr W (cid:62) W , T T (cid:88) t =1 T (cid:88) t (cid:48) =1 y (cid:62) t y t (cid:48) y (cid:62) t y t (cid:48) = max L T T (cid:88) t =1 y (cid:62) t L y t − Tr L (cid:62) L , − T T (cid:88) t =1 T (cid:88) t (cid:48) =1 y (cid:62) t y t (cid:48) z l (cid:62) t z lt (cid:48) = min W − T T (cid:88) t =1 z l (cid:62) t W y t + Tr W (cid:62) W . (14)Plugging these into Eq.(1), and changing orders of optimization, we arrive thefollowing dual, minimax formulation of supervised similarity matching:min W , W max L T T (cid:88) t =1 l t ( W , W , L , x t , z lt ) , (15)where l t := Tr W (cid:62) W − Tr L (cid:62) L + Tr W (cid:62) W + min y t − y (cid:62) t W x t + y (cid:62) t L y t − y (cid:62) t W (cid:62) z lt ) . (16)A stochastic optimization of the above objective can be mapped to a Hebbian/anti-Hebbian network following steps in (Pehlevan et al., 2018). For each training da-tum, { x t , z lt } , a two-step procedure is performed. First, optimal y t that minimizes l t is obtained by a gradient ﬂow until convergence,˙ y = W x t − L y t + W (cid:62) z lt . (17)We interpret this ﬂow as the dynamics of a neural circuit with linear activationfunctions, where the dual variables W , W and L are synaptic weight matrices18Fig. 7A). In the second part of the algorithm, we update the synaptic weights bya gradient descent-ascent on (16) with y t ﬁxed. This gives the following synapticplasticity rules∆ W = η ( y t x (cid:62) t − W ) , ∆ L = η ( y t y (cid:62) t − L ) , ∆ W = η ( z lt y (cid:62) t − W ) . (18)The learning rate η of each matrix can be chosen diﬀerently to achieve best per-formance. hiddeninput output Gradient step -4 -3 -2 -1 E rr o r ( M SE ) ( A ) ( B )( D ) -4 -2 0 2 4-4-2024 z z -4 -2 0 2 4-4-2024 z z ( C ) Hebbian anti-Hebbian synapses

Figure 7: A linear NN with Hebbian/anti-Hebbian learning rules. (A) During thelearning process, the output neurons (blue) are clamped at their desired states.After training, prediction for a new input x is given by the value of z at the ﬁxedpoint of neural dynamics. (B) The network is trained on a linear task: z lt = Ax t .Test error, deﬁned as the mean square error between the network’s prediction, z pt ,and the ground-truth value, z lt , 1 /T (cid:80) Tt =1 || z pt − z lt || F , decreases with the gradientascent-descent steps during learning. (C) Scatter plot of the predicted value versusthe desired value (element-wise). (D) The algorithm learns the correct mappingbetween x and z even in the presence of small Gaussian noise. In these examples, x ∈ R , A ∈ R × , elements of x and A are drawn from a uniform distribution inthe range [ − , y ∈ R and z ∈ R . In (C) and (D), 200 data points are shown.Overall, the network dynamics (17) and the update rules (18) map to a NNwith one hidden layer, with the output layer clamped to the desired state. The19pdates of the feedforward weight are Hebbian, and updates of the lateral weightare anti-Hebbian (Fig. 7A).For prediction, the network takes an input data point, x t , and runs withunclamped output until convergence. We take the value of the z units at theﬁxed point as the network’s prediction.Because the z units are not clamped during prediction and are dynamicalvariables, the correct outputs are not necessarily the ﬁxed points of the network inthe prediction phase. To make sure that the network produces correct ﬁxed points,at least for training data, we introduce the following step to the training procedure.We aim to construct a neural dynamics for the output layer in prediction phasesuch that its ﬁxed point z corresponds to the desired output z l . Since the outputlayer receives input W y from the previous layer, a decay term that depends on z is required to achieve stable ﬁxed point at z = z l . The simplest way is introducinglateral inhibition. And now the output layer has the following neural dynamics:˙ z = W y − L z , (19)where the lateral connections L are learned such that the ﬁxed point z ∗ ≈ z l .This is achieved by minimizing the following target functionmin L T T (cid:88) t =1 || W y t − L z lt || . (20)Taking the derivative of the above target function with respect to L whilekeeping the other parameters and variables evaluated at the ﬁxed point of neuraldynamics, we get the following “delta” learning rule for L :∆ L = η ( W y t − L z lt ) z l (cid:62) t . (21)After learning, the NN makes a prediction about a new input x by runningthe neural dynamics of y and z (17) and (19) until they converge to a ﬁxed point.We take the value of z units at the ﬁxed point as the prediction. As shown inFig.7 B-D, the linear network and weight update rule solve linear tasks eﬃciently.Although the above procedure can be generalized to multi-layer and nonlinearnetworks, one has to address the issue of spurious ﬁxed points of nonlinear dynam-ics for a given input x t . The Contrastive Similarity Matching algorithm presentedin the main text overcome this problem, which borrows ideas from energy-basedlearning algorithms such as Contrastive Hebbian Learning and Equilibrium Prop-agation. B Supervised Deep Similarity Matching

In this section, we follow (Obeid et al., 2019) to derive the minimax dual of deepsimilarity matching objective function. We start from rewriting the objective func-20ion (3) by expanding its ﬁrst term and combining the same terms from adjacentlayers, which givesmin a ≤ r pt ≤ a t =1 , ··· ,Tp =1 , ··· ,P P (cid:88) p =1 γ p − P T T (cid:88) t =1 T (cid:88) t (cid:48) =1 (cid:18) r ( p ) (cid:62) t r ( p ) t (cid:48) r ( p − (cid:62) t r ( p − t (cid:48) − γ (1 − δ pP )2 c ( p ) r ( p ) (cid:62) t r ( p ) t (cid:48) r ( p ) (cid:62) t r ( p ) t (cid:48) (cid:19) + P (cid:88) p =1 γ p − P T T (cid:88) t =1 F ( r ( p ) t ) (cid:62) + βT T (cid:88) t =1 (cid:13)(cid:13)(cid:13) r ( P ) t − z lt (cid:13)(cid:13)(cid:13) , (22)where c ( p ) is a parameter that change the relative importance of within-layerand between-layer similarity, we set it to be 1 / − T T (cid:88) t =1 T (cid:88) t (cid:48) =1 r ( p ) (cid:62) t r ( p ) t (cid:48) r ( p − (cid:62) t r ( p − t (cid:48) = min W ( p ) − T T (cid:88) t =1 r ( p ) (cid:62) t W ( p ) r ( p − t + Tr W ( p ) (cid:62) W ( p ) , (23)1 T T (cid:88) t =1 T (cid:88) t (cid:48) =1 r ( p ) (cid:62) t r ( p ) t (cid:48) r ( p ) (cid:62) t r ( p ) t (cid:48) = max L ( p ) T T (cid:88) t =1 r ( p ) (cid:62) t L ( p ) r ( p ) t − Tr L ( p ) (cid:62) L ( p ) , (24)in (22) and exchange the optimization order of r ( p ) t and the weight matrices, weturn the target function (3) into the following minmax problemmin { W ( p ) } max { L ( p ) } T T (cid:88) t =1 l t (cid:16) { W ( p ) } , { L ( p ) } , r (0) t , z lt , β (cid:17) , (25)where we have deﬁned an “energy” term (Eq.5 in the main text). The neuraldynamics of each layer can be derived by following the gradient of l t : d u ( p ) t dt ∝ − ∂l t ∂ r ( p ) t = 2 γ p − P (cid:104) − u ( p ) + b ( p ) t + W ( p ) r ( p − t + γ (1 − δ pP ) W ( p +1) (cid:62) r ( p +1) t − [1 + γ (1 − δ pP )] c ( p ) L ( p ) r ( p ) t − βδ pP ( r ( P ) t − z lt ) (cid:105) , r ( p ) t = f ( u ( p ) ) . (26)Deﬁne τ − p = 2 γ p − P , the above equation becomes Eq.6 in the main text.21 Supervised Similarity Matching for Neural Net-works with Structured Connectivity

In this section, we derive the supervised similarity matching algorithm for neuralnetworks with structured connectivity. Structure can be introduced to the quarticterms in (22): − T N ( p ) (cid:88) i N ( p − (cid:88) j T (cid:88) t T (cid:88) t (cid:48) r ( p ) t,i r ( p ) t (cid:48) ,i r ( p − t,j r ( p − t (cid:48) ,j s W, ( p ) ij , − T N ( p ) (cid:88) i N ( p ) (cid:88) j T (cid:88) t T (cid:88) t (cid:48) r ( p ) t,i r ( p ) t (cid:48) ,i r ( p ) t,j r ( p ) t (cid:48) ,j s L, ( p ) ij , (27)where s W, ( p ) ij and s L, ( p ) ij specify the feedforward connections of layer p with p-1 layerand lateral connections within layer p respectively. For example, setting them tobe 0s eliminates all connections. Now we have the following deep structuredsimilarity matching cost function for supervised learning:min a ≤ r pt ≤ a t =1 , ··· ,Tp =1 , ··· ,P P (cid:88) p =1 γ p − P T T (cid:88) t =1 T (cid:88) t (cid:48) =1 (cid:18) r ( p ) (cid:62) t r ( p ) t (cid:48) r ( p − (cid:62) t r ( p − t (cid:48) s W, ( p ) ij − γ (1 − δ pP )2 r ( p ) (cid:62) t r ( p ) t (cid:48) r ( p ) (cid:62) t r ( p ) t (cid:48) s L, ( p ) ij (cid:19) + P (cid:88) i =1 γ p − P T T (cid:88) t =1 F ( r ( p ) t ) (cid:62) + βT T (cid:88) t =1 (cid:13)(cid:13)(cid:13) r ( P ) t − z lt (cid:13)(cid:13)(cid:13) . (28)For each layer, we can deﬁne dual variables for W ( p ) ij and L ( p ) ij for interactions withpositive constants, and deﬁne the following variables¯ W ( p ) ij = (cid:40) W ( p ) ij , s W, ( p ) ij (cid:54) = 00 , s W, ( p ) ij = 0 , ¯ L ( p ) ij = (cid:40) L ( p ) ij , s L, ( p ) ij (cid:54) = 00 , s L, ( p ) ij = 0 (29)Now we can rewrite (28) as:min { ¯ W ( p ) } max { ¯ L ( p ) } T T (cid:88) t =1 ¯ l t (cid:16) { ¯ W ( p ) } , { ¯ L ( p ) } , r (0) t , z lt , β (cid:17) , (30)22here¯ l t := min a ≤ r ( p ) t ≤ a p =1 ,...,P P (cid:88) p =1 γ p − P (cid:40) (cid:88) i,js W, ( p ) ij (cid:54) =0 W ( p ) ij − (cid:88) i,js L, ( p ) ij (cid:54) =0 γ (1 − δ pP ) s L, ( p ) ij L ( p ) ij +[1 + γ (1 − δ pP )] r ( p ) t L ( p ) r ( p ) t − r ( p ) (cid:62) t W ( p ) r ( p − t + 2 F (cid:16) r ( p ) t (cid:17) (cid:62) (cid:41) + β (cid:13)(cid:13)(cid:13) r ( P ) t − z lt (cid:13)(cid:13)(cid:13) . (31)The neural dynamics follows the gradient of (31), which is τ p d u ( p ) t dt = − u ( p ) + b ( p ) t + ¯ W ( p ) r ( p − t + γ (1 − δ pP ) ¯ W ( p +1) (cid:62) r ( p +1) t − [1 + γ (1 − δ pP )] ¯ L ( p ) r ( p ) t − βδ pP ( r ( P ) t − z lt ) , r ( p ) t = f ( u ( p ) ) , p = 1 , · · · , P. (32)Local learning rules follow the gradient descent and ascent of (31):∆ W ( p ) ij ∝ (cid:32) r ( p ) j r ( p − i − W ( p ) ij s W, ( p ) ij (cid:33) , (33)∆ L ( p ) ij ∝ (cid:32) r ( p ) j r ( p ) i − L ( p ) ij s L, ( p ) ij (cid:33) . (34) D Hyperparameters and Performance in Numer-ical Simulations with the MNIST Dataset

D.1 One Hidden Layer

Table 2 reports the training and validation errors of three variants of the EPalgorithm and the CSM algorithm for a single hidden layer network on MNIST.The models were trained until the training error dropped to 0% or as close to0% as possible (as in the case of EP algorithm with β > β >

Algorithm Learning Rate TrainingError (%) ValidationError (%) No.epochsEP: ± β α W = 0.1, 0.05 0 2.53 40EP:+ β α W = 0.5, 0.125 0.034 2.18 100EP:lateral α W = 0.5, 0.25, α L = 0.75 0 2.29 25CSM α W = 0.5, 0.375, α L = 0.01 0 2.16 25 D.2 Three Hidden Layers

In Table 3, the CSM algorithm employs a scheme with decaying learning rates.Speciﬁcally, the learning rates for lateral updates are divided by a factor of 5,10, 50, and 100 when the training error dropped below 5% , , . . Algorithm Learning Rate TrainingError(%) ValidationError(%) No.epochsEP: ± β α W =0.128, 0.032, 0.008, 0.002 0 2.73 250EP: + β α W =0.128, 0.032, 0.008, 0.002 0 2.77 250EPlateral α W =0.128, 0.032, 0.008, 0.002; α L = α W = α L = α W = α L = .3 Structured Connectivity In this section, we explain the simulation for structured connectivity and reportthe results. Every hidden layer in these networks can be considered as multiple twodimensional grids stacked onto each other, with each grid containing neurons/unitsat periodically arranged sites. Each site only receives inputs from selected nearbysites. In this scheme, we consider lateral connections only between neurons sharingthe same (x, y) coordinate, and the length and width of the grid are the same. InTable 4, ‘Full’ refers to simulations where the input is the 28 ×

28 MNIST inputimage and ‘Crop’ refers to simulations in which the input image is a cropped20 ×

20 MNIST image. The ﬁrst three, annotated by ‘Full’, correspond to thesimulations reported in the main text. Errors are reported at the last epoch forthe run. In networks with structural connectivity, additional hyperparameters arerequired to constrain the structure, which are enumerated below: • Neurons-per-site (nps): The number of neurons placed at each site in agiven hidden layer, i.e. the number of two dimensional grids stacked ontoeach other. The nps for the input is 1. • Stride: Spacing between adjacent sites, relative to the input channel. Thestride of the input is always 1, i.e. sites are placed at (0, 0), (0, 1), (1, 0),so on, on the two dimensional grid. If the stride of the l -th layer is s , thenearest sites to the site at the origin will be (0 , s ) and ( s, n , will have d × d × n units, where d = 28 /s for the ‘Full’ runs and d = 20 /s for the ‘Crop’ runs.The nps values and stride together assign coordinates to all the units in thenetwork. • Radius: The radius of the circular two-dimensional region that all units inthe previous layer must lie within in order to have non-zero weights to thecurrent unit. Any units in the previous layer, lying outside the circle willnot be connected to the unit. 25able 4: Comparison of the training and validation errors of diﬀerent algorithmsfor one hidden layer NNs with structured connectivity on MNIST data set

Algorithm Learning Rate TrainingError (%) ValidationError (%) No.epochsR4, NPS4,Full α W = α L = α W = α L = α W = α L = α W = α L = α W = α L = α W = α L = α W = α L = E Hyperparameters and Performance in Numer-ical Simulations with the CIFAR-10 Dataset

Table 5 records the training and validation errors obtained for the CSM and EPalgorithms for fully connected networks, as well as for CSM with structured net-works, on the CIFAR-10 dataset. The validation error column for fully connectedruns reports the mean of the last twenty validation errors reported at the end ofthe training period as well as the standard error on the mean. For the structuredruns, the training and validation errors reported are the average of the last epoch’sreported errors from 5 trials and the standard error on the means. This is donein order to account for ﬂuctuations in the error during training.

F Performance of CSM as a Function of NudgeStrength and Feedback Strength

In the nudged deep similarity matching objective (3), β controls the strengthof nudging, while γ speciﬁes the strength of feedback input compared with the26able 5: Comparison of the validation errors of diﬀerent algorithms for diﬀerentnetworks. Algorithm,Connectivity,No. HiddenLayers Learning Rate TrainError(%) Val Error(%) No.epochsCSM, FC, 1HL α W = 0.059, 0.017 1.77 . ± . α L = 0.067CSM, FC, 2HL α W =0 . , . × − , . × − . ± . α L = 0 . , . × − CSM, Str, 1HL α W = 0.050, 0.0375 ± . . ± . α L = 0.01CSM, Str, 2HL α W = 0.265, 0.073, 0.020 . ± . . ± . α L = 0.075, 0.020EP, FC, 1HL α W = 0.014, 0.011 0.76 . ± . α W = 0.014, 0.011, 1.25 1.25 . ± . feedforward input. Scellier and Bengio (2017) used β = 1 in their simulations.In all the simulations reported here in the main text, we have set β = 1 , γ = 1.In this section, we trained a single hidden layer network using CSM on MNIST,while systematically varying the value of β and γ and keeping other parametersﬁxed. The validation errors for these experiments are documented in Table 6 and7 and plotted in Fig.8. We ﬁnd that the network has optimal values for γ lessthan 1.5, and β in the range bounded by 0.5 and 1. At these values, the networkis able to converge to low validation errors ( < V a li da t i on E rr o r ( % ) V a li da t i on E rr o r ( % ) A B

Figure 8: Validation error of a single hidden layer network trained by CSM al-gorithm on MNIST dataset as a function of parameter β (A) and γ (B). 4 trialswere conducted for values for which the validation error was less than 3%. Dotsindicate the mean validation error over trials, and errorbars indicate the standarddeviation over trials.Table 6: Validation errors at the end of the training period for a fully connected 1hidden layer network trained on MNIST, with diﬀerent β values. For values thatlay within the parameter range that converged to low ( < γ = 1. β value Mean ValidationError (%) MinimumValidation Error(%) MaximumValidation Error(%)0.01 89.70 89.70 89.700.1 90.09 90.09 90.090.25 46.20 2.28 90.090.5 2.19 2.17 2.210.75 2.42 2.22 2.511.0 2.36 2.26 2.481.2 23.30 2.36 85.911.5 2.62 2.40 2.752.0 79.55 79.55 79.55 γ values. For parametersthat converged to low ( < β = 1. γ value Mean ValidationError (%) MinimumValidation Error(%) MaximumValidation Error(%)0.2 2.75 2.64 2.850.5 2.51 2.43 2.600.7 2.38 2.24 2.470.8 2.41 2.32 2.470.9 2.26 2.21 2.311.0 2.45 2.38 2.531.1 2.46 2.37 2.631.2 2.35 2.26 2.481.3 2.43 2.28 2.541.5 83.16 83.16 83.161.8 90.09 90.09 90.09 eferences Anderson, J. R. and Peterson, C. (1987). A mean ﬁeld theory learning algorithmfor neural networks.

Complex Systems , 1:995–1019.Baldi, P. and Pineda, F. (1991). Contrastive learning and neural oscillations.

Neural Computation , 3(4):526–545.Bazzari, A. H. and Parri, H. R. (2019). Neuromodulators and long-term synapticplasticity in learning and memory: A steered-glutamatergic perspective.

Brainsciences , 9(11):300.Belilovsky, E., Eickenberg, M., and Oyallon, E. (2018). Greedy layerwise learningcan scale to imagenet. arXiv preprint arXiv:1812.11446 .Bengio, Y., Mesnard, T., Fischer, A., Zhang, S., and Wu, Y. (2015). Stdp aspresynaptic activity times rate of change of postsynaptic activity. arXiv preprintarXiv:1509.05936 .Brzosko, Z., Mierau, S. B., and Paulsen, O. (2019). Neuromodulation of spike-timing-dependent plasticity: past, present, and future.

Neuron , 103(4):563–581.Crick, F. (1989). The recent excitement about neural networks.

Nature ,337(6203):129–132.Ernoult, M., Grollier, J., Querlioz, D., Bengio, Y., and Scellier, B. (2019). Updatesof equilibrium prop match gradients of backprop through time in an rnn withstatic input. In

Advances in Neural Information Processing Systems , pages7079–7089.Genkin, A., Sengupta, A. M., and Chklovskii, D. (2019). A neural network forsemi-supervised learning on manifolds. In

International Conference on ArtiﬁcialNeural Networks , pages 375–386. Springer.Grill-Spector, K. and Weiner, K. S. (2014). The functional architecture of theventral temporal cortex and its role in categorization.

Nature Reviews Neuro-science , 15(8):536–548.Guerguiev, J., Lillicrap, T. P., and Richards, B. A. (2017). Towards deep learningwith segregated dendrites.

ELife , 6:e22901.Hinton, G. E. and McClelland, J. L. (1988). Learning representations by recircu-lation. In

Neural information processing systems , pages 358–366.30andel, E. R., Schwartz, J. H., Jessell, T. M., of Biochemistry, D., Jessell, M.B. T., Siegelbaum, S., and Hudspeth, A. (2000).

Principles of neural science ,volume 4. McGraw-hill New York.Kriegeskorte, N., Mur, M., Ruﬀ, D. A., Kiani, R., Bodurka, J., Esteky, H., Tanaka,K., and Bandettini, P. A. (2008). Matching categorical object representationsin inferior temporal cortex of man and monkey.

Neuron , 60(6):1126–1141.Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features fromtiny images.LeCun, Y., Cortes, C., and Burges, C. (2010). Mnist handwritten digit database.

ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist , 2.Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. (2015). Diﬀerence target prop-agation. In

Joint European Conference on Machine Learning and KnowledgeDiscovery in Databases , pages 498–515. Springer.Li, Z., Brendel, W., Walker, E., Cobos, E., Muhammad, T., Reimer, J., Bethge,M., Sinz, F., Pitkow, Z., and Tolias, A. (2019). Learning from brains how toregularize machines. In

Advances in Neural Information Processing Systems ,pages 9525–9535.Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. (2016). Ran-dom synaptic feedback weights support error backpropagation for deep learning.

Nature Communications , 7:13276.Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., and Hinton, G. (2020).Backpropagation and the brain.

Nature Reviews Neuroscience , pages 1–12.Movellan, J. R. (1991). Contrastive hebbian learning in the continuous hopﬁeldmodel. In

Connectionist models , pages 10–17. Elsevier.Nøkland, A. (2016). Direct feedback alignment provides learning in deep neuralnetworks. In

Advances in Neural Information Processing Systems , pages 1037–1045.Nøkland, A. and Eidnes, L. H. (2019). Training neural networks with local errorsignals. arXiv preprint arXiv:1901.06656 .Obeid, D., Ramambason, H., and Pehlevan, C. (2019). Structured and deepsimilarity matching via structured and deep hebbian networks. In

Advances inNeural Information Processing Systems , pages 15377–15386.31lshausen, B. A. and Field, D. J. (2004). Sparse coding of sensory inputs.

Currentopinion in neurobiology , 14(4):481–487.Ororbia, A. G. and Mali, A. (2019). Biologically motivated algorithms for propa-gating local target representations. In

Proceedings of the AAAI Conference onArtiﬁcial Intelligence , volume 33, pages 4651–4658.Pehlevan, C. and Chklovskii, D. (2015). A normative theory of adaptive di-mensionality reduction in neural networks. In

Advances in neural informationprocessing systems , pages 2269–2277.Pehlevan, C. and Chklovskii, D. B. (2019). Neuroscience-inspired online unsuper-vised learning algorithms: Artiﬁcial neural networks.

IEEE Signal ProcessingMagazine , 36(6):88–96.Pehlevan, C., Mohan, S., and Chklovskii, D. B. (2017). Blind nonnegative sourceseparation using biological neural networks.

Neural computation , 29(11):2925–2954.Pehlevan, C., Sengupta, A. M., and Chklovskii, D. B. (2018). Why do similaritymatching objectives lead to hebbian/anti-hebbian networks?

Neural computa-tion , 30(1):84–124.Richards, B. A. and Lillicrap, T. P. (2019). Dendritic solutions to the creditassignment problem.

Current Opinion in Neurobiology , 54:28–36.Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning represen-tations by back-propagating errors. nature , 323(6088):533–536.Sacramento, J., Costa, R. P., Bengio, Y., and Senn, W. (2018). Dendritic corti-cal microcircuits approximate the backpropagation algorithm. In

Advances inNeural Information Processing Systems , pages 8721–8732.Scellier, B. and Bengio, Y. (2017). Equilibrium propagation: Bridging the gapbetween energy-based models and backpropagation.

Frontiers in computationalneuroscience , 11:24.Sengupta, A., Pehlevan, C., Tepper, M., Genkin, A., and Chklovskii, D. (2018).Manifold-tiling localized receptive ﬁelds are optimal in similarity-preservingneural networks. In

Advances in Neural Information Processing Systems , pages7080–7090.Team, T. T. D., Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bah-danau, D., Ballas, N., Bastien, F., Bayer, J., Belikov, A., et al. (2016). Theano:32 python framework for fast computation of mathematical expressions. arXivpreprint arXiv:1605.02688 .Whittington, J. C. and Bogacz, R. (2017). An approximation of the error back-propagation algorithm in a predictive coding network with local hebbian synap-tic plasticity.

Neural Computation , 29(5):1229–1262.Whittington, J. C. and Bogacz, R. (2019). Theories of error back-propagation inthe brain.

Trends in cognitive sciences .Xie, X. and Seung, H. S. (2000). Spike-based learning rules and stabilization ofpersistent neural activity. In

Advances in neural information processing systems ,pages 199–208.Xie, X. and Seung, H. S. (2003). Equivalence of backpropagation and contrastivehebbian learning in a layered network.

Neural computation , 15(2):441–454.Yamins, D. L. and DiCarlo, J. J. (2016). Using goal-driven deep learning modelsto understand sensory cortex.