[PDF] Synaptic Metaplasticity in Binarized Neural Networks

Abstract

While deep neural networks have surpassed human performance in multiple situations, they are prone to catastrophic forgetting: upon training a new task, they rapidly forget previously learned ones. Neuroscience studies, based on idealized tasks, suggest that in the brain, synapses overcome this issue by adjusting their plasticity depending on their past history. However, such "metaplastic" behaviours do not transfer directly to mitigate catastrophic forgetting in deep neural networks. In this work, we interpret the hidden weights used by binarized neural networks, a low-precision version of deep neural networks, as metaplastic variables, and modify their training technique to alleviate forgetting. Building on this idea, we propose and demonstrate experimentally, in situations of multitask and stream learning, a training technique that reduces catastrophic forgetting without needing previously presented data, nor formal boundaries between datasets and with performance approaching more mainstream techniques with task boundaries. We support our approach with a theoretical analysis on a tractable task. This work bridges computational neuroscience and deep learning, and presents significant assets for future embedded and neuromorphic systems, especially when using novel nanodevices featuring physics analogous to metaplasticity.

Full PDF

SSynaptic Metaplasticity in Binarized NeuralNetworks

Axel Laborieux , Maxence Ernoult , Tifenn Hirtzlin , and Damien Querlioz Universit ´e Paris-Saclay, CNRS, Centre de Nanosciences et de Nanotechnologies, 91120, Palaiseau, France. * [email protected], [email protected] ABSTRACT

While deep neural networks have surpassed human performance in multiple situations, they are prone to catas-trophic forgetting: upon training a new task, they rapidly forget previously learned ones. Neuroscience studies,based on idealized tasks, suggest that in the brain, synapses overcome this issue by adjusting their plasticitydepending on their past history. However, such “metaplastic” behaviour has never been leveraged to mitigatecatastrophic forgetting in deep neural networks. In this work, we highlight a connection between metaplasticitymodels and the training process of binarized neural networks, a low-precision version of deep neural networks.Building on this idea, we propose and demonstrate experimentally, in situations of multitask and stream learning,a training technique that prevents catastrophic forgetting without needing previously presented data, nor formalboundaries between datasets. We support our approach with a theoretical analysis on a tractable task. This workbridges computational neuroscience and deep learning, and presents signiﬁcant assets for future embedded andneuromorphic systems.

Introduction

In recent years, deep neural networks have experienced incredible developments, outperforming the state-of-the-art,and sometimes human performance, for tasks ranging from image classiﬁcation to natural language processing .Nonetheless, these models suffer from catastrophic forgetting

2, 3 when learning new tasks: synaptic weights optimizedduring former tasks are not protected against further weight updates and are overwritten, causing the accuracy ofthe neural network on these former tasks to plummet

4, 5 (see Fig. 1(a)). Balancing between learning new tasks andremembering old ones is sometimes thought of as a trade-off between plasticity and rigidity: synaptic weights needto be modiﬁed in order to learn, but also to remain stable in order to remember. This issue is particularly critical inembedded environments, where data is processed in real-time without the possibility of storing past data. Giventhe rate of synaptic modiﬁcations, most artiﬁcial neural networks were found to have exponentially fast forgetting .This contrasts strongly with the capability of the brain, whose forgetting process is typically described with a powerlaw decay , and which can naturally perform continuous learning.The neuroscience literature provides insights about underlying mechanisms in the brain that enable task retention.In particular, it was suggested by Fusi et al.

6, 8 that memory storage requires, within each synapse, hidden stateswith multiple degrees of plasticity. For a given synapse, the higher the value of this hidden state, the less likely thissynapse is to change: it is said to be consolidated. These hidden variables could account for activity-dependentmechanisms regulated by intercellular signalling molecules occurring in real synapses

9, 10 . The plasticity of thesynapse itself being plastic, this behaviour is named “metaplasticity”. The metaplastic state of a synapse canbe viewed as a criterion of importance with respect to the tasks that have been learned throughout and thereforeconstitutes one possible approach to overcome catastrophic forgetting.Until now, the models of metaplasticity have been used for idealized situations in neuroscience studies. However,intriguingly, in the ﬁeld of deep learning, binarized neural networks (or the closely related XNOR-NETs ) have aremote connection with the concept of metaplasticity that has so far never been explored. Binarized neural networksare neural networks whose weights and activations are constrained to the values + −

1. These networks weredeveloped for performing inference with low computational and memory cost , and surprisingly, can achieve a r X i v : . [ c s . N E ] M a r xcellent accuracy on multiple vision

12, 16 and signal processing tasks. The training procedure of binarized neuralnetworks involves a real value associated to each synapse which accumulates the gradients of the loss computedwith binary weights. This real value is said to be “hidden”, as during inference, we only use its sign to get the binaryweight. In this work, we interpret the hidden weight in binarized neural networks as a metaplastic variable that canbe leveraged to achieve multitask learning. Based on this insight, we develop a learning strategy using binarizedneural networks to alleviate catastrophic forgetting with strong biological-type constraints: previously-presenteddata can not be stored, nor generated, and the loss function is not task-dependent with weight penalties.An important beneﬁt of our synapse-centric approach is that it does not require a formal separation betweendatasets, which also allows the possibility to learn a single task in a more continuous fashion. Traditionally, ifnew data appears, the network needs to relearn incorporating the new data into the old data: otherwise the networkwill just learn the new data and forget what it had already learned. Through the example of the progressivelearning of datasets, we show that our metaplastic binarized neural network, by contrast, can continue to learn atask when new data becomes available, without seeing the previously presented data of the dataset. This featuremakes our approach particularly attractive for embedded contexts. The spatially and temporally local nature ofthe consolidation mechanism makes it also highly attractive for hardware implementations, in particular usingneuromorphic approaches.Our approach takes a remarkably different direction than the considerable research in deep learning that is nowaddressing the question of catastrophic forgetting. Many proposals consist in keeping or retrieving information aboutthe data or the model at previous tasks: using data generation , the storing of exemplars , or in preserving theinitial model response in some components of the network . These strategies do not seem connected to how thebrain avoids catastrophic forgetting, need a very formal separation of the tasks, and are not very appropriate forembedded contexts. A solution to solve the trade-off between plasticity and rigidity more connected to ours is toprotect synaptic weights from further changes according to their “importance” for the previous task. For example,elastic weight consolidation uses the diagonal elements of the Fisher information matrix of the model distributionwith respect to its parameters to identify synaptic weights qualifying as important for a given task. In another work ,the consolidation strategy consists in computing an importance factor based on path integral. Finally, uses thesensitivity of the network with respect to small changes in synaptic weights. In all these techniques, the desiredmemory effect is enforced by changing the loss function and does not emerge from the synaptic behaviour itself.This aspect requires a very formal separation of the tasks, and makes these models still largely incompatible with theconstraints of biology and embedded contexts. The highly non-local nature of the consolidation mechanism alsomakes it difﬁcult to implement in neuromorphic-type hardware.Speciﬁcally, the contributions of the present work are the following: • We interpret the hidden real value associated to each weight (or hidden weight) in binarized neural networksas a metaplastic variable, we propose a new training algorithm for these networks adapted to learning differenttasks sequentially (Alg. 1). • We show that our algorithm allows a binarized neural network to learn permuted MNIST tasks sequentiallywith an accuracy equivalent to elastic weight consolidation, but without any change to the loss function orthe explicit computation of a task-speciﬁc importance factor. More complex sequences such as MNIST -Fashion-MNIST can also be learned sequentially with test accuracy on both tasks having no degradation withrespect to the accuracy reached on a single task. • We show that our algorithm enables to learn the Fashion-MNIST and the CIFAR-10 datasets by learningsequentially each subset of these datasets, which we call the stream-type setting. • We show that our approach has a mathematical justiﬁcation in the case of a tractable quadratic binary taskwhere the trajectory of hidden weights can be derived explicitly. igure 1.

Problem setting and illustration of our approach. (a) Problem setting: two training sets (hereMNIST and Fashion-MNIST) are presented sequentially to a fully connected neural network. When learningMNIST (epochs 0 to 50), the MNIST test accuracy reaches 97%, while the Fashion-MNIST accuracy stays around10%. When learning Fashion-MNIST (epochs 50 to 100), the associated test accuracy reaches 85% while theMNIST test accuracy collapses to ∼

20% in 25 epochs: this phenomenon is known as “catastrophic forgetting”. (b)Illustration of our approach: in a binarized neural network, each synapse incorporates a hidden weight W h used forlearning and a binary weight W b = sign ( W h ) used for inference. Our method, inspired by neuroscience works in theliterature , amounts to regarding hidden weights as metaplastic states that can encode memory across tasks andthereby alleviate forgetting. With regards to the conventional training technique of binarized neural network, itconsists in modulating some hidden weight updates by a function f meta ( W h ) whose shape is indicated in (c). Thismodulation is applied to negative updates of positive hidden weights, and to positive updates of negative hiddenweights. f meta ( | W h | ) being a decreasing function, this modulation makes the hidden weights signs less likely toswitch back when they grow in absolute value. ultitask Learning with Metaplastic Binarized Neural Networks The training process of conventional binarized neural networks relies on updating hidden real weights associatedwith each synapse, using loss gradients computed with binary weights. The binary weights are the signs of thehidden real weights, and are used in the equations of both the forward and backward passes. By contrast, thehidden weights are updated as a result of the learning rule, which therefore affects the binary weights only when thehidden weight changes sign - the detailed training algorithms are presented in Supplementary Algorithms 1 and 2 ofSupplementary Note 1. Hidden weights magnitudes have no impact on inference: two given binary weights of abinarized neural network may be equal to one, but their corresponding hidden weight may differ depending on thehistory of the training process.

Algorithm 1

Our modiﬁcation of the BNN training procedure to implement metaplasticity. W h are the hiddenweights, θ BN are Batch Normalization parameters, U W and U θ are the parameter updates prescribed by the Adamalgorithm , ( x , y ) is a batch of labelled training data, m is the metaplasticity parameter, and η is the learningrate. “ · ” denotes the element-wise product of two tensors with compatible shapes. The difference between ourimplementation and the non-metaplastic implementation (recovered for m =

0) lies in the condition lines 6 to 9. f meta is applied element-wise with respect to W h . “cache” denotes all the intermediate layers computations needed tobe stored for the backward pass. The details of the Forward and Backward functions are provided in SupplementaryNote 1. Input : W h , θ BN , U W , U θ , ( x , y ) , m , η . Output : W h , θ BN , U W , U θ . W b ← Sign ( W h ) (cid:46) Computing binary weights ˆ y , cache ← Forward ( x , W b , θ BN ) (cid:46) Perform inference C ← Cost ( ˆ y , y ) (cid:46) Compute mean loss over the batch ( ∂ W C , ∂ θ C ) ← Backward ( C , ˆ y , W b , θ BN , cache ) (cid:46) Cost gradients ( U W , U θ ) ← Adam ( ∂ W C , ∂ θ C , U W , U θ ) if U W · W b > then (cid:46) If U W prescribes to decrease | W b | W h ← W h − η U W · f meta ( m , W h ) (cid:46) Metaplastic update else W h ← W h − η U W end if θ BN ← θ BN − η U θ return W h , θ BN , U W , U θ Described as such, the training process of binarized neural networks is intriguingly similar to the one ofmetaplastic Hopﬁeld networks in : it prescribes to binarize the weight for the computation of the preactivationsor “synaptic currents”, and to update a metaplastic hidden variable for learning. This comparison suggests thatthe hidden weights in binarized neural networks could also be used as metaplastic variables. In our work, weshow that we can use the hidden weights as a criterion for importance to learn several tasks sequentially with onebinarized neural network, which involves one single set of synaptic weights. However, for this purpose, the trainingprocedure of binarized neural networks needs to be adapted. Based on the work of Fusi , our intuition is that binaryweights with high hidden weight values are relevant to the current task and can be consolidated: the learning processshould ensure that the greater the hidden real value, the more difﬁcult to switch back. For lack of such a blockingmechanism, there cannot be a long term memory across tasks since the number of updates required to learn a giventask is heuristically equal to the number of updates required to unlearn it. We therefore introduce the function f meta to provide an asymmetry which differentiates between updates towards zero hidden weights and away from zerofor equivalent gradient absolute values (see Fig. 1(b)). The higher the hidden weight, the more difﬁcult it is for thebinary weight to switch sign, which is very similar in spirit to the cascade of metaplastic states introduced in . Thestrength of the metaplasticity effect is characterized by the real parameter m of function f meta (see Fig. 1(c)), the case = . We train a binarizedneural network with two hidden layers of 4,096 units using Algorithm 1 with several metaplasticity m values and 40epochs per task (see Methods). Fig. 2 shows this process of learning six tasks. The conventional binarized neuralnetwork ( m = .

0) is subject to catastrophic forgetting: after learning a given task, the test accuracy quickly dropsupon learning a new task. Increasing the parameter m gradually prevents the test accuracy on previous tasks fromdecreasing with eventually the m = .

35 binarized neural network (Fig. 2(d)) managing to learn all six tasks with testaccuracies comparable with the 97 .

4% test accuracy achieved by the BNN trained on one task only (see Table. 1).Figs. 2(g) and 2(h) show the distribution of the metaplastic hidden weights after learning Task 1 and Task 2 inthe second layer. The consolidated weights of the ﬁrst task in Fig. 2(g) correspond to hidden weights between zeroand ﬁve in magnitude. We observe in Fig. 2(g) that around 10 of binary weights still have hidden weights near zeroafter learning one task. These weights correspond to synapses that repeatedly switched between + − Metaplasticity ( m = .

0) Consolidation Consolidation ( m = . )Task 1 9 . ± . . ± . . ± . . ± . Task 2 7 . ± . . ± . . ± . . ± . Task 3 9 . ± . . ± . . ± . . ± . Task 4 9 . ± . . ± . . ± . . ± . Task 5 13 . ± . . ± . . ± . . ± . Task 6 97 . ± . . ± . . ± . . ± . Table 1.

Binarized neural network test accuracies on six permuted MNISTs at the end of training for differentsettings. We indicate mean and standard deviation over ﬁve trials, for a conventional (non-metaplastic) BNN( m = . computed with parameter λ EWC = · , and our metaplastic binarized neural network approach with parameter m = . , implemented on the same binarized neural network architecture (seeMethods). We see that the random consolidation approach does not allow multitask learning. On the other hand, ourapproach achieves a performance similar to elastic weight consolidation for learning six permuted MNISTs with thegiven architecture, although unlike elastic weight consolidation the consolidation is based on an entirely local rulewithout changing the loss function.Supplementary Figure 1 shows a more detailed analysis of the performance of our approach when learning up toten MNIST permutations, and for varying sizes of the binarized neural network, highlighting the connection betweennetwork size and its capacity in terms of number of tasks.As a control experiment, we also applied Algorithm 1 to a full precision network, except for the weightbinarization step described in line one. Fig. 2(e) and Fig. 2(f) show the ﬁnal accuracy of each task at the end oflearning for a binarized neural network and a real valued weights deep neural network respectively, with the samearchitecture. The full precision network ﬁnal test accuracy of each task for the same range of m values cannot igure 2. Permuted MNIST learning task.

Binarized neural network learning six tasks sequentially for severalvalues of the metaplastic parameter m . (a) m = m = . m = . m = .

35. Curves are averaged over ﬁve runs and shadows correspond to one standarddeviation. (e,f) Final test accuracy on each task after the last task has been learned. The dots indicate the meanvalues over ﬁve runs, and the shaded zone one standard deviation. (e) corresponds to a binarized neural network and(f) corresponds to our method applied to a real valued weights deep neural network with the same architecture. (g,h)Hidden weights distribution of a m = .

35, two hidden layers of 4,096 units binarized neural network after learningfor 40 epochs each task (g) one permuted MNIST and (h) two permuted MNISTs. etain more than three tasks with accuracy above 90%. This highlights that our weight consolidation strategy is tiedspeciﬁcally to the use of a binarized neural network.This experimental result points out the fundamentally different meaning of hidden weights in a binarized neuralnetwork and of real weights in a full precision neural network respectively. In full precision networks, the inferenceis carried out using the real weights, in particular the loss function is also computed using these weights. Converselyin binarized neural networks, the inference is done with the binary weights and the loss function is also evaluatedwith these binary weights, which has two major consequences. First, the hidden weights do not undergo the sameupdates as the weights of a full precision network. Second, a synapse whose hidden weight is positive and which isprescribed a positive update consequently will not affect the loss, nor its gradient at the next learning iteration sinceit only takes into account the sign of the hidden weights. Hidden weights in binarized neural networks consequentlyhave a natural tendency to spread over time (Fig. 2(g,h)) they are not technically weights, but a trace of the historyof the network updates that is relevant for memory effects.

Figure 3.

MNIST/Fashion-MNIST sequential learning.

Binarized neural network learning MNIST andFashion-MNIST sequentially ((a) and (b)) or Fashion-MNIST and MNIST ((c) and (d)) for two values of themetaplastic parameter m . m = m = . which consists of fashion items images belongingto ten classes. Fig. 3(b) shows the result of the training of a m = . m = igure 4. Stream learning experiments. (a) Progressive learning of the Fashion-MNIST dataset. The dataset issplit into 60 parts consisting of only 1,000 examples, and containing all ten classes. Each sub dataset is learned for20 epochs. The dashed lines represent the accuracies reached when the training is done on the full dataset for 20epochs so that all curves are obtained with the same number of optimization steps. (b) Progressive learning of theCIFAR-10 dataset. The dataset is split into 20 parts, consisting of only 2,500 examples. Each sub dataset is learnedfor 200 epochs. The dashed lines represent the accuracies reached when the training is done on the full dataset for200 epochs. Shadows correspond to one standard deviation around the mean over ﬁve runs.to learn both tasks sequentially with baseline accuracies regardless of the order chosen to learn the tasks.

Stream Learning: Learning one Task from Subsets of Data

We have shown that the hidden weights of binarized neural networks can readily be used as importance factorsfor synaptic consolidation. Therefore, in our approach, it is not required to compute an explicit importance factorfor each synaptic weight. Our consolidation strategy is carried out simultaneously with the weight update, andlocally in space as consolidation only involves the hidden weights. The absence of formal dataset boundaries in ourapproach is important to tackle another aspect of catastrophic forgetting where all the training data of a given task isnot available at the same time. In this section, we use our method to address this situation, which we call “streamlearning”: the network learns one task but can only access one subset of the full dataset at a given time. Subsets ofthe full dataset are learned sequentially and the data of previous subsets cannot be accessed in the future.We ﬁrst consider the Fashion-MNIST dataset, split into 60 subsets presented sequentially during training (seeMethods). The learning curves for regular and metaplastic binarized neural networks are shown in Fig. 4(a), thedashed lines corresponding to the accuracy reached by the same architecture trained on the full dataset after fullconvergence. We observe that the metaplastic binarized neural network trained sequentially on subsets of dataperforms as well as the non-metaplastic binarized neural network trained on the full dataset. The difference inaccuracy between the baselines can be explained by our consolidation strategy gradually reducing the number ofweights able to switch, therefore acting as a learning rate decay (the mean accuracy achieved by a binarized neuralnetwork with m = . ettings are common across all subsets of data, the metaplastic binarized neural network gains new knowledge witheach subset of data without any information about subsets boundaries. This feature is especially useful for embeddedapplications, and is not currently possible in alternative approaches of the literature to address catastrophic forgetting. Mathematical Interpretation

We now provide a mathematical interpretation for the hidden weights of binarized neural networks: we show inarchetypal situations that the larger a hidden weight gets while learning a given task, the bigger the loss increaseupon ﬂipping the sign of the associated binary weight, and consequently the more important they are with respectto this task. For this purpose, we deﬁne a quadratic binary task, an analytically tractable and convex counterpartof a binarized neural network optimization task. This task, deﬁned formally in Supplementary Note 3, consistsin ﬁnding the global optimum on a landscape featuring a uniform (Hessian) curvature. The gradient used for theoptimization is evaluated using only the sign of the parameters W h (Fig. 5(a)), in the same way that binarized neuralnetworks employ only the sign of hidden weights for computing gradients during training. In Supplementary Note 3,we demonstrate theoretically that throughout optimization on the quadratic binary task, if the uniform norm of theweight optimum vector is greater than one, the hidden weights vector diverges. Fig. 5(a) shows an example intwo dimensions where such a divergence is seen. This situation is reminiscent of the training of binarized neuralnetworks on practical tasks, where the divergence of some hidden weights is observed. In the particular case ofa diagonal Hessian curvature, a correspondence exists between diverging hidden weights and components of theweight optimum greater than one in absolute value. We can derive an explicit form for the asymptotic evolutionof the diverging hidden weights while optimizing: the hidden weights diverge linearly: W h i , t ∼ (cid:101) W h i t with a speedproportional to the curvature and the absolute magnitude of the global optimum (see Supplementary Note 3). Giventhis result, we can prove the following theorem (see Supplementary Note 3): Theorem 1.

Let W optimize the quadratic binary task with optimum weight W ∗ and curvature matrix H, usingthe optimization scheme: W h t + = W h t − η H · ( sign ( W h t ) − W ∗ ) . We assume H equal to diag ( λ , . . . λ d ) with λ i > , ∀ i ∈ (cid:74) , d (cid:75) . Then, if | W ∗ i | > , the variation of loss resulting from ﬂipping the sign of W b i , t is: ∆ i L ( W t ) ∼ λ i + | (cid:101) W h i | η as t → + ∞ (1)This theorem states that the increase in the loss induced by ﬂipping the sign of a diverging hidden weight isasymptotically proportional to the sum of the curvature and a term proportional to the hidden weight. Hence thecorrelation between high valued hidden weights and important binary weights.Interestingly, this interpretation, established rigorously in the case of a diagonal Hessian curvature, maygeneralize to non-diagonal Hessian cases. Fig. 5 for example illustrates the correspondence between hiddenweights and high impact on the loss by sign change on a quadratic binary task (Fig. 5(b)) with a 500-dimensionalnon-diagonal Hessian matrix (see Methods for the generation procedure). Fig. 5(c,d,e) ﬁnally shows that thiscorrespondence extends to a practical binarized neural network situation, trained on MNIST. In this case, the costvariation E data ( ∆ L ) upon switching binary weights signs increases monotonically with the magnitudes of the hiddenweights (see Methods for implementation details). These results provide an interpretation as to why hidden weightscan be thought of as local importance factors useful for continual learning applications. Discussion and Related Works

Addressing catastrophic forgetting with ideas from both neuroscience and machine learning has led us to ﬁnd anartiﬁcial neural network with richer synapses behaviours that can perform continual learning without requiring anoverhead computation of task-related importance factors. The continual learning capability of metaplastic binarizedneural networks emerges from its intrinsic design, which is in stark contrast with other consolidation strategies

3, 21, 22 .The resulting model is more autonomous because the optimized loss function is the same across all tasks. Metaplastic igure 5.

Interpretation of the meaning of hidden weights. (a) Example of hidden weights trajectory in atwo-dimensional quadratic binary task. One hidden weight W h x diverges because the optimal hidden weight vector W ∗ has uniform norm greater than one (Lemma 2 of Supplementary Note 3). (b) Mean increase in the loss occurredby switching the sign of a hidden weight as a function of the normalized value of the hidden weight, for a500-dimensional quadratic binary task. The mean is taken by assigning hidden weights to bins of increasingabsolute value. The leftmost point corresponds to hidden weights staying bounded. (c,d,e) Increase in the lossoccurred by switching the sign of hidden weights as a function of the normalized absolute value of the hiddenweight in a binarized neural network trained on MNIST. The scales differ because the layers have different numbersof weights and thus different relative importance. See Methods for implementation details.synapses enable binarized neural networks to learn several tasks sequentially similarly to related works, but moreimportantly, our approach takes the ﬁrst steps beyond a more fundamental limitation of deep learning, namely theneed for a full dataset to learn a given task. A single autonomous model able to learn a task from small amounts ofdata while still gaining knowledge, approaching to some extent the way the brain acquires new information, pavesthe way for widespread use of embedded hardware for which it is impossible to store large datasets.Additionally, taking inspiration from the metaplastic behaviour of actual synapses of the brain resulted in astrategy where the consolidation is local in space and time. This makes this approach particularly suited for artiﬁcialintelligence dedicated hardware and neuromorphic computing approaches, which can save considerable energyby employing circuit architectures optimized for the topology of neural network models, and therefore limitingdata movements . The fact that our approach builds on synapses with rich behaviour also resonates with theprogress of nanotechnologies, which can provide compact and energy-efﬁcient electronic devices able to mimicneuroscience-inspired models . This also evidences the beneﬁt of taking inspiration from biology with regards topurely mathematically-motivated approaches: they tend to be naturally compatible with the constraints of hardwaredevelopments and can be amenable for the development of energy-efﬁcient artiﬁcial intelligence.In conclusion, we have shown that the hidden weights involved in the training of binarized neural networksare excellent candidates as metaplastic variables that can be efﬁciently leveraged for continual learning. We haveimplemented long term memory into binarized neural networks by modifying the hidden weight update of synapses.Our work highlights that binarized neural networks might be more than a low precision version of deep neuralnetworks, as well as the potential beneﬁts of the synergy between neurosciences and machine learning research,which for instance aims to convey long term memory to artiﬁcial neural networks. We have also mathematically ustiﬁed our technique in a tractable quadratic binary problem. Our method allows for online synaptic consolidationdirectly from model behaviour, which is important for neuromorphic dedicated hardware, and is also useful for avariety of settings subject to catastrophic forgetting. Acknowledgements

This work was supported by European Research Council Starting Grant NANOINFER (reference: 715872).Theauthors would like to thank L. Herrera-Diez, J. Thiele, G. Hocquet, P. Bessière, T. Dalgaty and J. Grollier fordiscussion and invaluable feedback on the manuscript.

Author Contributions

AL developed the Pytorch code used in this project and performed all subsequent simulations. AL and ME carriedthe mathematical analysis of the Mathematical Interpretation section. TH provided the initial idea for the project,and an initial Numpy version of the code. Authors ME and TH contributed equally to the project. DQ directed thework. All authors participated in data analysis, discussed the results and co-edited the manuscript.

Conﬂict of Interest Statement

The authors declare that they have no known competing ﬁnancial interests or personal relationships that could haveappeared to inﬂuence the work reported in this paper.

Methods

Metaplasticity-Inspired Training of Binarized Neural Networks

The binarized neural networks studied in this work are designed and trained following the principles introducedin - speciﬁc implementation details are provided in Supplementary Note 2. These networks consist of binarizedlayers where both weight values and neuron activations assume binary values meaning { + , − } . Binarized neuralnetworks can achieve high accuracy on vision tasks

12, 16 , provided that the number of neurons is increased withregards to real neural networks. Binarized neural networks are especially promising for AI hardware becauseunlike conventional deep networks which rely on costly matrix-vector multiplications, these operations for binarizedneural networks can be done in hardware with XNOR logic gates and pop-count operations, reducing the powerconsumption by several orders of magnitude .In this work, we propose an adaptation of the conventional binarized neural network training technique to providebinarized neural networks with metaplastic synapses. We introduce the function f meta : R + × R → R to provide anasymmetry, at equivalent gradient value and for a given weight, between updates towards zero hidden value andaway from zero. Alg. 1 describes our optimization update rule and the unmodiﬁed version of the update rule isrecovered when m = . f meta . f meta is deﬁned such that: ∀ x ∈ R , f meta ( , x ) = , (2) ∀ m ∈ R + , f meta ( m , ) = , (3) ∀ m ∈ R + , ∂ x f meta ( m , ) = , (4) ∀ m ∈ R + , lim | x |→ + ∞ f meta ( m , x ) = . (5)Conditions (3) and (4) ensure that near-zero real values, the weights are free to switch in order to learn. Condition(5) ensures that the farther from zero a real value is, the more difﬁcult it is to make the corresponding weight switchback. In all the experiments of this paper, we use : f meta ( m , x ) = − tanh ( m · x ) . (6) he parameter m controls how fast binary weights are consolidated (Fig. 1(c)). The speciﬁc choice of f meta is madeto have a variety of plasticity over large ranges of time steps (iteration steps) with an exponential dependence as in .Speciﬁc values of the hyperparameters can be found in Supplementary Note 2. Multitask training experiments

A permuted version of the MNIST dataset consists of a ﬁxed spatial permutation of pixels applied to each exampleof the dataset. We also train a full precision (32-bits ﬂoating point) version of our network with the same architecturefor comparison, but with tanh activation function instead of sign. The learned parameters in batch normalizationare not binary and therefore cannot be consolidated by our metaplastic strategy. Therefore, in our experiments, thebinarized and full precision neural networks have task-speciﬁc batch normalization parameters in order to isolate theeffect of weight consolidation on previous tasks test accuracies.The elastic weight consolidation control is trained with parameter λ EWC = · . The random consolidationpresented in Tab. 1 consists in computing the same importance factors as elastic weight consolidation but thenrandomly shufﬂing the importance factors of the synapses. Stream learning experiments

For Fashion-MNIST experiments, we use a metaplastic binarized neural network of two 1,024 units hidden layers.The dataset is split into 60 subsets of 1,000 examples each, and each subset is learned for 20 epochs. (All classes arerepresented in each subset.)For CIFAR-10 experiments, we use a binary version of VGG-7 similarly to , with six convolution layers of128-128-256-256-512-512 ﬁlters and kernel sizes of 3. Dropout with probability 0 . Sign Switch in a binarized neural network

Two major differences between the quadratic binary task and the binarized neural network are the dependence onthe training data and the relative contribution of each parameter which is lower in the case of the BNN than in thequadratic binary task. The procedure for generating Fig.5(c,d,e) has to be adapted accordingly. Bins of increasingnormalised hidden weights are created, but instead of computing the cost variation for a single sign switch, a ﬁxedamount of weights are switched within each bin so as to increase the contribution of the sign switch on the costvariation. The resulting cost variation is then normalised with respect to the number of switched weights. An averageis done over several realizations of the hidden weights to be switched. Given the different sizes of the three layers,the amounts of switched weights per bins for each layer are respectively 1,000, 2,000, and 100.

Positive Symmetric Deﬁnite Matrix Generation

To generate random positive symmetric deﬁnite matrices we ﬁrst generate the diagonal matrix of eigenvalues D = diag ( λ , ..., λ d ) with a uniform or normal distribution of mean µ and variance σ and ensure that all eigen valuesare positive. We then use the subgroup algorithm described in to generate a random rotation R in dimension d . Wethen compute H = R T · D · R . Data and Code Availability

Throughout this work, all simulations are performed using Pytorch 1.1.0. The source codes used in this work arefreely available online in the Github repository: https://github.com/Laborieux-Axel/SynapticMetaplasticityBNN ,All used datasets (MNIST, Fashion-MNIST, CIFAR-10) are available in the public domain.

References LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.

Nature , 436–444 (2015). . Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A. & Bengio, Y. An empirical investigation of catastrophicforgeting in gradientbased neural networks. In

In Proceedings of International Conference on LearningRepresentations (ICLR (2014). Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J.,Ramalho, T., Grabska-Barwinska, A. et al.

Overcoming catastrophic forgetting in neural networks.

Proc.national academy sciences , 3521–3526 (2017). French, R. M. Catastrophic forgetting in connectionist networks.

Trends cognitive sciences , 128–135 (1999). McClelland, J. L., McNaughton, B. L. & O’Reilly, R. C. Why there are complementary learning systems in thehippocampus and neocortex: insights from the successes and failures of connectionist models of learning andmemory.

Psychol. review , 419 (1995). Fusi, S., Drew, P. J. & Abbott, L. F. Cascade models of synaptically stored memories.

Neuron (2005). Wixted, J. T. & Ebbesen, E. B. On the form of forgetting.

Psychol. science , 409–415 (1991). Benna, M. K. & Fusi, S. Computational principles of synaptic memory consolidation.

Nat. neuroscience ,1697 (2016). Abraham, W. C. & Bear, M. F. Metaplasticity: the plasticity of synaptic plasticity.

Trends neurosciences ,126–130 (1996). Abraham, W. C. Metaplasticity: tuning synapses and networks for plasticity.

Nat. Rev. Neurosci. , 387–387(2008). Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R. & Bengio, Y. Binarized neural networks: Training deepneural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016).

Rastegari, M., Ordonez, V., Redmon, J. & Farhadi, A. Xnor-net: Imagenet classiﬁcation using binary convolu-tional neural networks. In

European Conference on Computer Vision , 525–542 (Springer, 2016).

Conti, F., Schiavone, P. D. & Benini, L. Xnor neural engine: A hardware accelerator ip for 21.6-fj/op binaryneural network inference.

IEEE Transactions on Comput. Des. Integr. Circuits Syst. , 2940–2951 (2018). Bankman, D., Yang, L., Moons, B., Verhelst, M. & Murmann, B. An always-on 3.8 µ j/86% cifar-10 mixed-signalbinary cnn processor with all memory on chip in 28-nm cmos. IEEE J. Solid-State Circuits , 158–172 (2018). Hirtzlin, T., Bocquet, M., Penkovsky, B., Klein, J.-O., Nowak, E., Vianello, E., Portal, J.-M. & Querlioz, D.Digital biologically plausible implementation of binarized neural networks with differential hafnium oxideresistive memory arrays.

Front. Neurosci. (2019).

Lin, X., Zhao, C. & Pan, W. Towards accurate binary convolutional neural network. In

Advances in NeuralInformation Processing Systems , 345–353 (2017).

Penkovsky, B., Bocquet, M., Hirtzlin, T., Klein, J.-O., Nowak, E., Vianello, E., Portal, J.-M. & Querlioz, D.In-memory resistive ram implementation of binarized neural networks for medical applications. In

Design,Automation and Test in Europe Conference (DATE) (2020).

Shin, H., Lee, J. K., Kim, J. & Kim, J. Continual learning with deep generative replay. In

Advances in NeuralInformation Processing Systems , 2990–2999 (2017).

Rebufﬁ, S.-A., Kolesnikov, A., Sperl, G. & Lampert, C. H. icarl: Incremental classiﬁer and representationlearning. In

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , 2001–2010(2017).

Li, Z. & Hoiem, D. Learning without forgetting.

IEEE transactions on pattern analysis machine intelligence , 2935–2947 (2017). Zenke, F., Poole, B. & Ganguli, S. Continual learning through synaptic intelligence. In

Proceedings of the 34thInternational Conference on Machine Learning-Volume 70 , 3987–3995 (JMLR. org, 2017). Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M. & Tuytelaars, T. Memory aware synapses: Learningwhat (not) to forget. In

Proceedings of the European Conference on Computer Vision (ECCV) , 139–154 (2018).

Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

Xiao, H., Rasul, K. & Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learningalgorithms. arXiv preprint arXiv:1708.07747 (2017).

Editorial. Big data needs a hardware revolution.

Nature , 145, DOI: 10.1038/d41586-018-01683-1 (2018).

Ambrogio, S., Narayanan, P., Tsai, H., Shelby, R. M., Boybat, I., di Nolfo, C., Sidler, S., Giordano, M., Bodini,M., Farinha, N. C. et al.

Equivalent-accuracy accelerated neural-network training using analogue memory.

Nature , 60–67 (2018).

Boyn, S., Grollier, J., Lecerf, G., Xu, B., Locatelli, N., Fusil, S., Girod, S., Carrétéro, C., Garcia, K., Xavier, S. et al.

Learning through ferroelectric domain dynamics in solid-state synapses.

Nat. communications , 1–7(2017). Romera, M., Talatchian, P., Tsunegi, S., Araujo, F. A., Cros, V., Bortolotti, P., Trastoy, J., Yakushiji, K.,Fukushima, A., Kubota, H., Yuasa, S., Ernoult, M., Vodenicarevic, D., Hirtzlin, T., Locatelli, N., Querlioz, D. &Grollier, J. Vowel recognition with four coupled spin-torque nano-oscillators.

Nature , 230 (2018).

Torrejon, J., Riou, M., Araujo, F. A., Tsunegi, S., Khalsa, G., Querlioz, D., Bortolotti, P., Cros, V., Yakushiji, K.,Fukushima, A. et al.

Neuromorphic computing with nanoscale spintronic oscillators.

Nature , 428 (2017).

Diaconis, P. & Shahshahani, M. The subgroup algorithm for generating uniform random variables.

Probab.engineering informational sciences , 15–32 (1987). Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariateshift. arXiv preprint arXiv:1502.03167 (2015). upplementary Note 1: Forward and Backward Propagation in Binarized Neural Net-works

Supplementary Algorithm 1

Forward function of the BNN reproduced from . W b = ( W b l ) l = ... L are the binaryweights, W BN = { ( γ l , β l ) | l = ... L } are Batch Normalization parameters. L is the total number of layers and thesubscript l when speciﬁed is the layer index. x is a batch of input data with dimensions ( P , N ) with P the number ofpixels and N the number of examples in the batch. E ( · ) and Var ( · ) are batch-wise mean and variance. While theyare computed during training with the statistics of the batches, running averages of the mean and variance are storedto be used at test time. This enables the network to infer on a single example at test time. ε is a small number toavoid division by zero, it was set to 10 − in all the experiments. Input : W b , W BN , x . Output : ˆ y , cache. a ← x (cid:46) Input is not binarized for l = L do (cid:46) For loop over the layers z l ← W b l a l (cid:46) Matrix multiplication a l ← γ l · z l − E ( z l ) √ Var ( z l )+ ε + β l (cid:46) Batch Normalization if l < L then (cid:46) If not the last layer a b l ← Sign ( a l ) (cid:46) Activation is binarized end if end for ˆ y ← a L return ˆ y , cache Supplementary Algorithm 2

Backward function of the BNN reproduced from . W b = ( W b l ) l = ... L are the binaryweights, θ BN = { ( γ l , β l ) | l = ... L } are Batch Normalization parameters. BackBatchNorm ( · ) speciﬁes how tobackpropagate through the Batch normalization . L is the total number of layers and the subscript l when speciﬁedis the layer index. 1 | a l |≤ is the derivative of Hardtanh taken as a replacement for back propagating through Signactivation. Input : C , ˆ y , W b , θ BN , cache. Output : ( ∂ W C , ∂ θ C ) . g a L ← ∂ C ∂ ˆ y (cid:46) Cost gradient with respect to output for l = L to 1 do (cid:46) For loop backward over the layers if l < L then (cid:46) If not the last layer g a l ← g a b l · | a l |≤ (cid:46) Back Prop through Sign end if ( g z l , g γ l , g β l ) ← BackBatchNorm ( g a l , z l , γ l , β l ) (cid:46) See g a b l − ← W b l g z l g W b l ← a b l − (cid:62) g z l end for ∂ W C ← { g W b l | l = ... L } ∂ θ C ← { g γ l , g β l | l = ... L } return ( ∂ W C , ∂ θ C ) The optimization is performed using Adaptive Moment Estimation (Adam) algorithm . As the sign function isnot differentiable in zero and the derivative is zero on R ∗ , during error backpropagation the derivative of hardtanhfunction is used as a replacement for the derivative of the Sign function. The activation function is the sign function xcept for the output layer. The input neurons are not binarized. We use batch normalization at all layers asdetailed in Alg. 1. The following derivation for layer l , γ l · z − E ( z ) (cid:112) Var ( z ) + ε + β l = γ l (cid:112) Var ( z ) + ε (cid:32) z − (cid:34) E ( z ) − β l (cid:112) Var ( z ) + εγ l (cid:35)(cid:33) a = Sign ( γ l ) Sign (cid:32) z − (cid:34) E ( z ) − β l (cid:112) Var ( z ) + εγ l (cid:35)(cid:33) shows that because the Sign function is invariant by any multiplicative constant in the input, the only task dependentparameters we need to store for an inference hardware chip is the term between square brackets, along with the signof γ l . The amount of task dependent parameters scales as the number of neurons and is order of magnitudes smallerthan the number of synapses.Adam optimizer updates the hidden weight with loss gradients computed using binary weights only. We use asmall weight decay of 10 − in the Adam optimizer to make zero ﬂoating values more stable. However, consolidatedweights are not subject to weight decay, as we implement weight decay as a modiﬁcation of the loss gradient, whichis gradually suppressed by f meta . Supplementary Note 2: Training parameters pMNISTsNetwork Binarized meta Binarized EWC Full precisionLayers 784-4096-4096-10 784-4096-4096-10 784-4096-4096-10Learning rate 0.005 0.005 0.005Minibatch size 100 100 100Epochs/task 40 40 40 m λ EWC

Table 2.

Hyperparameters for the permuted MNISTs experiment.FMNIST - MNISTNetwork Binarized metaLayers 784-4096-4096-10Learning rate 0.005Minibatch size 100Epochs/task 50 m Table 3.

Hyperparameters for the permuted FMNIST-MNIST experiment.The batch normalization layers parameters were not learned for the Fashion MNIST experiment whereas theywere learned for the CIFAR-10 experiment. tream FMNIST Stream CIFAR-10Network Binarized meta Binarized metaLayers 784-1024-1024-10 VGG-7Sub Parts 60 20Learning rate 0.005 0.0001Minibatch size 100 64Epochs/subset 20 200 m Table 4.

Hyperparameters for the stream learning experiment.The batch normalization parameters are set to β = γ = × . upplementary Figure 6. Inﬂuence of the network size on the number of tasks learned.

Mean test accuracyover tasks learned so far for up to ten tasks. Each task is a permuted version of MNIST learned for 40 epochs. Thebinarized neural network architecture consists of two hidden layers of a variable number of hidden units rangingfrom 512 to 4096. (a) uses metaplasticity with parameter m = .

35 and (b) uses elastic weight consolidation with λ EWC = , upplementary Note 3: Mathematical proofs Deﬁnition 1 (Quadratic Binary Task) . Consider the loss function: L ( W ) = ( W − W ∗ ) T · H · ( W − W ∗ ) (7) with a symmetric deﬁnite positive matrix H ∈ R d × d . Gradients are given by g ( W ) = H · ( W − W ∗ ) . We assume thefollowing optimization scheme:W h t + = W h t − η H · ( sign ( W h t ) − W ∗ ) , (8) where sign returns the sign of a vector component-wise. Lemma 1 (Condition for hidden Weight conﬁnement ) . Let W h optimize a quadratic binary task according to thedynamics W h t + = W h t − η H ( sign ( W h t ) − W ∗ ) . Let B ∞ be the unit ball for the inﬁnite norm and B ∞ its closure.Then: W ∗ ∈ B ∞ ⇒ ∃ C > , ∀ t ∈ N , (cid:107) W h t (cid:107) ∞ < C (9) W ∗ / ∈ B ∞ ⇒ lim t → ∞ (cid:107) W h t (cid:107) ∞ = ∞ (10) Proof of Lemma 1.

We ﬁrst prove Eq. (10). Let us assume that W ∗ / ∈ B ∞ so that there exists at least one component i ∈ (cid:74) , d (cid:75) such that | W ∗ i | >

1. Since H is symmetric deﬁnite positive, it is invertible. Taking the euclidian scalarproduct between H − e i and the update ( W h t + − W h t ) yields: (cid:104) e i , W h t + − W h t (cid:105) H − = ( H − e i ) T · ( W h t + − W h t )= − η ( H − e i ) T · H ( sign ( W h t ) − W ∗ )= − η e Ti · ( H − ) T H ( sign ( W h t ) − W ∗ )= − η e Ti · H − H ( sign ( W h t ) − W ∗ )= − η e Ti · ( sign ( W h t ) − W ∗ )= − η ( sign ( W h i , t ) − W ∗ i ) , where we have used at the fourth equality that H − is also symmetric. Since | W ∗ i | >

1, the sign of sign ( W h i , t ) − W ∗ i is constant (and (cid:54) = W along H − e i is expected to diverge. More precisely, let us assume W ∗ i > ( W h i , t ) − W ∗ i < − W ∗ i and: (cid:104) e i , W h t + − W h t (cid:105) H − ≥ − η ( − W ∗ i ) . (11)Summing Eq. (11) from time step 0 to t yields: (cid:104) e i , W h t (cid:105) H − ≥ − η ( − W ∗ i ) t + (cid:104) e i , W h0 (cid:105) H − , (12)showing that lim t → + ∞ (cid:104) e i , W h t (cid:105) H − = + ∞ . Consequently there exists j ∈ (cid:74) , d (cid:75) such that lim t → + ∞ (cid:104) e j , W h t (cid:105) = + ∞ and therefore lim t → ∞ (cid:107) W h t (cid:107) ∞ = + ∞ . Similarly if W ∗ i < −

1, we show that: (cid:104) e i , W h t (cid:105) H − ≤ η ( + W ∗ i ) t + (cid:104) e i , W h0 (cid:105) H − , (13) iving the same conclusion as above.We now prove Eq. (9). Let us assume that W ∗ ∈ B ∞ , i.e. ∀ i ∈ (cid:74) , d (cid:75) , | W ∗ i | <

1. We have: (cid:107) W h t + (cid:107) H − = (cid:104) W h t + , W h t + (cid:105) H − = (cid:104) W h t + ∆ W h t , W h t + ∆ W h t (cid:105) H − = (cid:107) W h t (cid:107) H − + (cid:104) ∆ W h t , W h t (cid:105) H − + (cid:104) ∆ W h t , ∆ W h t (cid:105) H − = (cid:107) W h t (cid:107) H − + (cid:104) H − ∆ W h t , W h t (cid:105) + (cid:107) ∆ W h t (cid:107) H − = (cid:107) W h t (cid:107) H − − η ( sign ( W h t ) − W ∗ ) T W h t + (cid:107) ∆ W h t (cid:107) H − = (cid:107) W h t (cid:107) H − − η ( sign ( W h t ) − W ∗ ) T W h t + (cid:107) ∆ W h t (cid:107) H − , so that : (cid:107) W h t + (cid:107) H − − (cid:107) W h t (cid:107) H − ≤ ⇔ ( sign ( W h t ) − W ∗ ) T · W h t ≥ (cid:107) ∆ W h t (cid:107) H − . (14)We want to show that if W h t is large enough in norm (cid:107)(cid:107) H − , Eq. (14) will be met. First note that, because thedimension is ﬁnite there exist two constants α > β > ∀ x ∈ R d , α (cid:107) x (cid:107) H − < (cid:107) x (cid:107) ∞ < β (cid:107) x (cid:107) H − and also that: (cid:107) ∆ W h t (cid:107) H − = η (cid:107) sign ( W h t ) − W ∗ (cid:107) H . Then, by triangular inequality: η (cid:107) sign ( W h t ) − W ∗ (cid:107) H ≤ η ( (cid:107) sign ( W h t ) (cid:107) H + (cid:107) W ∗ (cid:107) H ) . Denoting ( e α ) α and ( λ α ) α the eigenbasis of H and their associated eigenvalues, we have by Cauchy Schwarzinequality: (cid:107) sign ( W h t ) (cid:107) H = (cid:104) H · sign ( W h t ) , sign ( W h t ) (cid:105) = d ∑ α = λ α |(cid:104) sign ( W h t ) , e α (cid:105)| ≤ d ∑ α = λ α (cid:107) sign ( W h t ) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) = d · (cid:107) e α (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) = ≤ d λ α , max , so that: (cid:107) ∆ W h t (cid:107) H − ≤ η ( d (cid:113) λ α , max + (cid:107) W ∗ (cid:107) H ) . (15)Thus the right hand side of Eq. 14 is bounded. Also note that: ( sign ( W h t ) − W ∗ ) T · W h t = d ∑ i = ( − sign ( W h i , t ) W ∗ i ) | W h i , t |≥ d ∑ i = ( − | W ∗ i | ) | W h i , t |≥ ( − (cid:107) W ∗ (cid:107) ∞ ) d ∑ i = | W h i , t |≥ ( − (cid:107) W ∗ (cid:107) ∞ ) · (cid:107) W h t (cid:107) ∞ , So far we have shown that the left hand side of Eq.14 is lower bounded by a constant ( (cid:54) =

0) times the inﬁnitenorm of W h t , while the right hand side is bounded. Therefore to ensure Eq. (14) it sufﬁces that:2 ( − (cid:107) W ∗ (cid:107) ∞ ) · (cid:107) W h t (cid:107) ∞ ≥ η ( d (cid:113) λ α , max + (cid:107) W ∗ (cid:107) H ) ⇔ (cid:107) W h t (cid:107) ∞ ≥ η ( d (cid:112) λ α , max + (cid:107) W ∗ (cid:107) H ) ( − (cid:107) W ∗ (cid:107) ∞ ) . And thus to ensure Eq. (14) it sufﬁces that: (cid:107) W h t (cid:107) H − ≥ η ( d (cid:112) λ α , max + (cid:107) W ∗ (cid:107) H ) α ( − (cid:107) W ∗ (cid:107) ∞ ) . Denoting M = η ( d √ λ α , max + (cid:107) W ∗ (cid:107) H ) α ( −(cid:107) W ∗ (cid:107) ∞ ) , we can conclude that (cid:107) W h t (cid:107) H − ≥ M ⇒ (cid:107) W h t + (cid:107) H − < (cid:107) W h t (cid:107) H − . And becausethe update ∆ W h t is bounded in norm (cid:107)(cid:107) H − , an absolute upper bound of W h t is : C = β max ( (cid:107) W h0 (cid:107) H − , M + η ( d (cid:113) λ α , max + (cid:107) W ∗ (cid:107) H )) . Thus we have proven that W ∗ ∈ B ∞ ⇒ ∃ C > , ∀ t ∈ N , (cid:107) W h t (cid:107) ∞ < C Lemma 2 (hidden Weight Trajectory) . Let W h optimize a quadratic binary task according to the dynamics W h t + = W h t − η H ( sign ( W h t ) − W ∗ ) and assume H = diag ( λ , . . . λ d ) . Then: | W ∗ i | > = ⇒ W h i , t ∼ t → + ∞ sign ( W ∗ i ) ηλ i ( | W ∗ i | − ) (cid:124) (cid:123)(cid:122) (cid:125) = (cid:102) W h i t (16) Proof of Lemma 2. If H = diag ( λ , . . . λ d ) , the dynamics of W h t deﬁned in Eq. (8) simply rewrites component-wise: ∀ i ∈ (cid:74) , d (cid:75) , ∆ W h i , t = W h i , t + − W h i , t = − ηλ i ( sign ( W h i , t ) − W ∗ i ) . (17)By Lemma 1, components W i such that | W ∗ i | < or components i where | W ∗ i | > ∆ W h i , t has the sign of W ∗ i since Eq. (17) rewrites: ∆ W h i , t = sign ( W ∗ i ) ηλ i ( | W ∗ i | − sign ( W ∗ i W h i , t )) (cid:124) (cid:123)(cid:122) (cid:125) > , (18)so that W h i , t necessarily ends up having the same sign as W ∗ i , hence there exists t , i ∈ N such that : ∀ t > t , i , ∆ W h i , t = sign ( W ∗ i ) ηλ i ( | W ∗ i | − ) . (19)By deﬁnition of t , i , W h i , t and W ∗ i have opposite sign before t , i so that: ∀ t ≤ t , i , ∆ W h i , t = sign ( W ∗ i ) ηλ i ( + | W ∗ i | ) . (20)Therefore, summing Eq. (17) between 0 and t yields : W h i , t = W h i , + t , i ∑ u = sign ( W ∗ i ) ηλ i ( | W ∗ i | + )+ t ∑ u = t , i + sign ( W ∗ i ) ηλ i ( | W ∗ i | − )= W h i , + sign ( W ∗ i ) ηλ i ( | W ∗ i | + ) t , i + sign ( W ∗ i ) ηλ i ( | W ∗ i | − )( t − t , i ) ∼ t → + ∞ sign ( W ∗ i ) ηλ i ( | W ∗ i | − ) (cid:124) (cid:123)(cid:122) (cid:125) = (cid:102) W h i t (21) Theorem 2 (Importance of hidden Weights in a quadratic binary task) . Let W optimize a quadratic binary taskaccording to the dynamics W h t + = W h t − η H ( sign ( W h t ) − W ∗ ) and assume H = diag ( λ , . . . λ d ) . Then, for anycomponent i such that | W ∗ i | > , the variation of loss resulting from ﬂipping sign ( W h i , t ) → − sign ( W h i , t ) is: ∆ i L ( W h t ) = λ i | W ∗ i | = (cid:32) λ i + | (cid:102) W h i | η (cid:33) + O ( t ) (22)Proof of Theorem. 2 Proof.

Using Eq. (7), the loss reads: L ( W h t ) = ( sign ( W h t ) − W ∗ ) T H ( sign ( W h t ) − W ∗ )= n ∑ i = λ i ( sign ( W h i , t ) − W ∗ i ) = ∑ i , | W ∗ i |≤ λ i ( sign ( W h i , t ) − W ∗ i ) + ∑ i , | W ∗ i | > λ i ( sign ( W h i , t ) − W ∗ i ) . sing Lemma 2, for all components i such that | W ∗ i | >

1, there exists t , i such that for all t > t , i , sign ( W h i , t ) = sign ( W ∗ i ) and therefore λ i ( sign ( W h i , t ) − W ∗ i ) = λ i ( − | W ∗ i | ) . Deﬁning T = max i || W ∗ i | > ( t , i ) , the loss rewritesfor t > T : L ( W h t ) = ∑ i , | W ∗ i |≤ λ i ( sign ( W h i , t ) − W ∗ i ) + ∑ i , | W ∗ i | > λ i ( | W ∗ i | − ) Then, the increase in energy if a binary component in the | W ∗ i | > ∆ i L ( W h t ) = λ i (( | W ∗ i | + ) − ( | W ∗ i | − ) ) = λ i | W ∗ i | (23)Using the explicit form of W h i , t in Eq. (21) along with Eq. (23), we get: W h i , t = W h i , + sign ( W ∗ i ) ηλ i ( | W ∗ i | + ) t , i + sign ( W ∗ i ) ηλ i ( | W ∗ i | − )( t − t , i )= W h i , + sign ( W ∗ i ) ηλ i (cid:18) ∆ i L λ i + (cid:19) t , i + sign ( W ∗ i ) ηλ i (cid:18) ∆ i L λ i − (cid:19) ( t − t , i )= W h i , + sign ( W ∗ i ) η ∆ i L t + sign ( W ∗ i ) ηλ i ( t , i − t )= sign ( W ∗ i ) η (cid:18) ∆ i L − λ i (cid:19) t + W h i , + sign ( W ∗ i ) ηλ i t , i . Since W h i , t has the same sign as W ∗ i for t being large enough, multiplying both sides for the last equation anddividing by t yields: ∆ i L ( W h t ) = (cid:32) λ i + | (cid:102) W h i | η (cid:33) − | W h i , | + ηλ i t , i η t (cid:124) (cid:123)(cid:122) (cid:125) = O ( t ) (24)(24)