[PDF] DecisiveNets: Training Deep Associative Memories to Solve Complex Machine Learning Problems

Abstract

Learning deep representations to solve complex machine learning tasks has become the prominent trend in the past few years. Indeed, Deep Neural Networks are now the golden standard in domains as various as computer vision, natural language processing or even playing combinatorial games. However, problematic limitations are hidden behind this surprising universal capability. Among other things, explainability of the decisions is a major concern, especially since deep neural networks are made up of a very large number of trainable parameters. Moreover, computational complexity can quickly become a problem, especially in contexts constrained by real time or limited resources. Therefore, understanding how information is stored and the impact this storage can have on the system remains a major and open issue. In this chapter, we introduce a method to transform deep neural network models into deep associative memories, with simpler, more explicable and less expensive operations. We show through experiments that these transformations can be done without penalty on predictive performance. The resulting deep associative memories are excellent candidates for artificial intelligence that is easier to theorize and manipulate.

Full PDF

aa r X i v : . [ c s . L G ] D ec DecisiveNets:Training Deep Associative Memoriesto Solve Complex Machine Learning Problems

Vincent Gripon, Carlos Lassance, and Ghouthi Boukli Hacene

IMT Atlantique and Mila

Abstract.

Learning deep representations to solve complex machine learn-ing tasks has become the prominent trend in the past few years. Indeed,Deep Neural Networks are now the golden standard in domains as variousas computer vision, natural language processing or even playing com-binatorial games. However, problematic limitations are hidden behindthis surprising universal capability. Among other things, explainabilityof the decisions is a major concern, especially since deep neural networksare made up of a very large number of trainable parameters. Moreover,computational complexity can quickly become a problem, especially incontexts constrained by real time or limited resources. Therefore, un-derstanding how information is stored and the impact this storage canhave on the system remains a major and open issue. In this chapter, weintroduce a method to transform deep neural network models into deepassociative memories, with simpler, more explicable and less expensiveoperations. We show through experiments that these transformations canbe done without penalty on predictive performance. The resulting deepassociative memories are excellent candidates for artiﬁcial intelligencethat is easier to theorize and manipulate.

Keywords:

Deep Neural Networks, Deep Associative Memories, Quan-tiﬁcation, Sparse Coding

In recent years, Deep Neural Networks (DNNs) have experienced a succession ofmajor breakthroughs that have gradually elevated them to the standard methodin machine learning. DNNs obey the philosophy of replacing arbitrary humanchoices with automatic optimizations, directly derived from an objective functionrelated to the problem to be solved. Thus, once the contours of an architecturehave been determined, its functionality is acquired through a complex processinvolving a large amount of data, random processes and various combinationsaimed at obtaining an eﬃcient and generalizable solution.Contrary to many other methods, DNNs do not seem to suﬀer from a phe-nomenon of overﬁtting. Interestingly, the opposite seems to be true: larger ar-chitectures generalize better, for which there is still not a clear explanation (Werefer the reader to [1] for a possible explanation). On the other hand, as the

DecisiveNets architectures (and their performance) grow, their complexity and explainabilitydeteriorate. In areas where the decisions of DNNs need to be understood, orwhere the resources involved in the calculations are limited, a complex dilemmais thus faced that has given rise to an abundant literature (cf. Section 2).Instead of trying to explain these complex architectures, a promising strategyis to approximate trained models by simpler and less expensive ones. This is,for example, the interest of the knowledge distillation ﬁeld [2]. But even if thepredictive capacity is thus transmitted from complex models to simpler ones,the fact remains that the latter are diﬃcult to explain and compute. Why nottry instead to approximate trained deep neural network architectures by modelsbuilt on lighter and more explicable mechanisms, allowing an explicit accountof how information is stored? In this chapter we provide elements in order toaddress such a question.While DNNs are constructed explicitly as mathematical functions, with thegoal of associating inputs with outputs, other models take a diﬀerent view, whereneural networks become the receptacle of information elements that can be re-trieved from a fraction of their content. In this vein, sparse associative memoriesare very simple, yet surprisingly powerful tools that can be deployed in budget-constrained systems.In this chapter, we introduce a method for transforming a deep neural net-work model, optimized for a machine learning task, into a deep associative mem-ory with similar performance, although using processes that are lighter to imple-ment and simpler to visualize and interpret. We illustrate this capability usingstandardized datasets in the ﬁeld of computer vision.The outline of the chapter is as follows. In Section 2 we introduce relatedwork. In Section 3 we introduce notations, deep neural network and associativememories methodologies. In Section 4, we introduce our proposed method thataims at transforming deep neural networks into deep associative memories. InSection 5 we perform experiments and discuss the obtained results, and ﬁnallywe conclude in Section 6.

The deep neural network complexity has been a hot topic of interest for re-searchers in the past few years. There are many reasons to be interested incompression of DNNs, including lack of memory, constrained energy, maximumacceptable latency, or targeted data throughput. However, many works [3] showthat there exists a compromise between network size and accuracy, leading toenormous architectures to reach state-of-the-art performance in many domainsincluding vision and natural language processing.As an eﬀort to reduce the size of DNNs while maintaining a high level of ac-curacy, some authors propose to rely on pruning, where neurons and/or weightscan be removed during training based on a measure of their importance in the de-cision process. Pruning can be unstructured [4], meaning that any set of weights ecisiveNets 3 and neurons can be removed. As a consequence, the resulting operators sparsitycan be hard to leverage depending on the targeted hardware. On the contrary,other works aims at removing speciﬁc sets of neurons and weights which are moreeasily exploitable [5]. To measure neurons and/or weights importance, numerouscriteria have been introduced. For instance, in [6], the authors use the sum ofabsolute weights of each channel to select less important parameters. Soft FilterPruning (SFP) [7] is another approach that dynamically prunes ﬁlters in a softmanner. In [8], the authors propose the Neuron Importance Score Propagation(NISP), a method that estimates neurons scores using the reconstructed error ofthe last layer before classiﬁcation when computing back-propagation. Yamamoto et al [9] use a channel-pruning technique based on an attention mechanism, whereattention blocks are introduced into each layer and updated during training toevaluate the importance of each channel. In [10], the authors propose a learnablediﬀerentiable mask, that aims at ﬁnding out during training process the less im-portant neurons, channels or even layers and prune them. In [11], the authorspropose to give to the DNN the ability to decide during training which criterionshould be considered for each layer when pruning. Another method that relatesbetter to our proposed solution is Shift Attention Layers [12]. The idea is toprune all but one weight per convolutional kernel, which reduces both DNNsmemory and complexity and eases DNNs implementation on limited resourcesembedded systems [13].Another line of works consists in quantizing values and weights in DNNs,with the extreme case being binarizing both [14,15]. As a consequence, memoryrequirements are heavily reduced and operations considerably simpliﬁed. Prob-lematically, many such works end with a noticeable drop in accuracy of the corre-sponding architectures. In order to compensate for this drop, it is often requiredto increase the size of the architecture, leading to an unclear optimal trade-oﬀ inthe general case. Other approaches propose to limit the accuracy drop by usinga low bit quantization, where weights and/or activations are represented using2, 3 or 4 bits [16,17,18,19], or even learn the number of bits required to representeach layer values [20]. In such scenario, possible values are limited, quantizednetwork size is reduced and multiplications may be replaced by Look Up Tables(LUTs) [21].Other works propose to compensate the drop in accuracy by relying on knowl-edge distillation [2], where a bigger architecture (called the teacher) is used totrain a smaller one (called the student).The method we introduce in this chapter has yet a diﬀerent approach, whereinstead of trying to reduce the complexity of a DNN architecture, we progres-sively transform it into a Deep Hetero-Associative Memory (DHAM), built fromassembling lighter operators.

Associative memories are devices able to store then retrieve pieces of information(called messages) from a portion of their content. The most prominent model isthat proposed by John Hopﬁeld in [22], where the principle consists in using the

DecisiveNets

Gram matrix to store binary ( {− , } ) messages. Using a simple iterative algo-rithm, such messages can be reliably retrieved as long as their number remainssmall compared to the number of neurons in the architecture. In [23], the au-thors proved that this bound evolves as n/ log( n ), contradicting the conjectureof many previous works that estimated this capacity to be linear with n .To increase the number of messages it is possible to store, other authorshave proposed to consider sparse binary ( { , } ) messages instead. Because theentropy each of these message contain is smaller, it is theoretically possible tostore more of them. For example in [24], the authors do not update the retrievaliterative procedure proposed in [22]. Better results are generally obtained usingthe model in [25], where connection weights are thresholded at 1. More recently,in [26], the authors introduced a constraint on stored messages: they can besplit into c parts, where each part contains exactly one nonzero value. In [27], acomparative study of these models is performed, showing that the latter one canreliably store the largest number of messages. This number of messages evolvesas n / log( n ), where n is the number of neurons in the architecture.In our work, we use the model introduced in [26], deployed in multiple layersto mimic the architecture of a DNN. We describe in details how to build such amodel in the next section. In this section we introduce the vocabulary and notations required for the re-maining of this work. We begin with DNNs and continue with sparse associativememories.

Deep Neural Networks [28] (DNNs) are composite systems obtained by assem-bling layers . Layers can be of various types and serve diﬀerent purposes. Inthis work, we consider a generic model for a layer using a function: f l : x σ ( Wx + b ), where W is a weight tensor, b is a bias tensor and σ is a nonlinearactivation function. Layers can be assembled using various combinators, includ-ing additions or concatenations. As a result, a composite function f is obtained,which associates an input x with a corresponding decision ˆ y = f ( x ).We refer to f as the network function . Initially, the coeﬃcients in the tensors W and b – called parameters – are typically arbitrarily chosen, with no relationto the considered problem to be solved using the deep neural network.During a learning phase, network functions are trained to solve a problemby optimizing an objective function on a training dataset. Most of the time, atraining dataset is made of pairs ( x , y ), where y is the ideal decision associatedwith x . During the training phase, y and ˆ y = f ( x ) are compared using anobjective function. The derivative of this objective function is used to updatethe parameters of the deep neural network. ecisiveNets 5 Once trained, the network function is expected to perform well on previouslyunseen data. This property, called the generalization, is often assessed using avalidation set. A validation set is a dataset sampled from the same distributionas the training set, but disjoint from it. The prediction of the network functionon the validation set is a measure of its ability to perform on inputs not usedduring training.

We follow the model introduced in [26]. Let us consider pairs ( x i , y i ) i that wewant to store. Our aim is to learn the mapping x i y i using a neural network.We consider that vectors x i and y i have a speciﬁc shape. Namely, they are allbinary ( { , } ) vectors. Vectors x i are such that they can be split into c subvec-tors, each containing ℓ consecutive coordinates. The resulting subvectors containexactly one nonzero value. Similarly, vectors y i can be split into c ′ subvectors,each containing ℓ ′ consecutive coordinates. Each subvector contains exactly onenonzero value. An example of a vector x i , with c = 4 and ℓ = 3 is given inFigure 1, together with an associated vector y i , with c ′ = 2 and ℓ ′ = 5. x i =   y i =   Fig. 1: Arbitrary example of a mapping between x i , built with c = 4 and ℓ = 3,and y i , built with c ′ = 2 and ℓ ′ = 5.In order to store such pairs, a matrix W is built using the following formula: W = max i (cid:0) y i x ⊤ i (cid:1) , (1)where max is applied coeﬃcient-wise, and y ⊤ i is the transpose of y i . Let us pointout that W is of dimensions c ′ ℓ ′ × cℓ .Once W has been built, it can be used to try to retrieve y i from x i . To thisend, we use Algorithm 1. DecisiveNets z = Wx i for i ∈ { , . . . , c } do ˆ y i [ ℓ ( i −

1) : ℓi ] = WTA ( z [ ℓ ( i −

1) : ℓi ]) endreturn ˆ y i Algorithm 1:

Algorithm to retrieve y i from the corresponding input probe x i and W . We use Python notations here, where ˆ y i [ ℓ ( i −

1) : ℓi ] denotes thesubvector obtained from ˆ y i by taking the coordinates ℓ ( i −

1) (included) to ℓi (excluded). The operator WTA is a binary winner-takes-all that outputs a1 at coordinate j if and only if it corresponds to the maximum value in thecorresponding vector (0 otherwise).In general, there is no guarantee that the retrieved vector ˆ y i is identical to y i . All depends on the number of pairs used to generate W and to the potentialoverlap between the vectors in ( x i ) i or the vectors in ( y i ) i . For a more in-depthanalysis of this problem, we refer the reader to [29].Very much alike DNNs, it is possible to assemble associative memories tocreate deep architectures, called Deep Hetero-Associative Memories or DHAMin the following, where the output of one is the input of another. Problematically,associative memories are meant to store mappings between pairs of vectors. Assuch, there is very little interest in creating such assemblies, while the scope ofpossible applications remains very limited.In the next section, we show how it is possible to create DHAMs able to solvecomplex machine learning problems, far beyond the case of storing mappingsbetween pairs of vectors. Our proposed methodology is based on the following idea: it is possible to traina deep neural network, while progressively updating its operations so that itultimately functions as a DHAM. Our motivation is threefold:1. Because of the simplicity of the operations used in the context of DHAMs,we expect the resulting architecture to require signiﬁcantly less operationsto achieve a very similar performance to DNNs.2. As DHAMs operations are simple, it is quite straight-forward to understandthe decisions taken by a speciﬁc layer, and to obtain guarantees about thedecision process.3. Because DHAMs internal representations are very sparse, they are strongcandidates for better transferability and robustness to deviations of the in-puts.We shall investigate these points later in the experiments. But for now, let usexplain how we can amend the training process of DNNs so that they ultimatelyfunction as DHAMs. ecisiveNets 7

The main idea is to act on the nonlinear activation function σ that is usedat each layer of the considered DNN architecture. Starting with σ , the trainingprocess will smoothly and continuously transform it until it eventually becomesa local winner-takes-all operator. To this end, we make use of a temperature t , that is scaled exponentially from t init to t final at each step of the learningprocedure. This temperature acts on a softmax function. More precisely, let usdenote by x the input tensor of a layer, and by ˆy its output. We construct thefollowing nonlinear function: ˆy [ ℓ ( i −

1) : ℓi ] = σ t ( x [ ℓ ( i −

1) : ℓi ]) , ∀ i, where σ t ( z ) = softmax ( t · σ ( z ))max ( softmax ( t · σ ( z ))) σ ( z ) . Note that instead of ℓ , it is possible to deﬁne such a function using c as aparameter such that cℓ is the total number of dimensions of the tensor x alongthe considered axis. In our experiments, we found that acting on the featuremaps axis when using convolutional neural networks worked the best. We alsofound that it was best to disregard the gradients of the softmax portion of theactivation.At the beginning of the training process, the initial temperature is chosen sothat it is close to 0. As a consequence the softmax operator outputs a constantvector and σ t = σ . At the end of the training process, the temperature is veryhigh, so that the softmax operator acts as a max operator.Once the training process is ﬁnished, and before using our trained architec-ture for processing test inputs, we completely remove the softmax operator, andconsider instead the following nonlinear function: ˆy [ ℓ ( i −

1) : ℓi ] = σ W T A ( x [ ℓ ( i −

1) : ℓi ]) , ∀ i, where σ W T A ( z ) = ( max ( σ ( z )) == σ ( z )) σ ( z ) . This nonlinear function can be interpreted as the limit case of the previousone when the temperature tends to inﬁnity.

In the case of convolutional neural networks, the resulting architecture thus im-plements a competition between feature maps. More precisely, inside a group of ℓ consecutive feature maps, each location (e.g. a pixel in the case of processingimages) can activate only one feature map. When assembling the c obtainedsubtensors, each location is thus summarized as a combination of the c corre-sponding choices of activated feature maps. The number of possible combinationis thus ℓ c for each location. In our experiments, we found that choosing smallvalues of ℓ and thus large values of c usually led to the best results, which is inaccordance with the fact that this choice maximizes the number ℓ c , and thus thediversity of possible values for each processed input. DecisiveNets

Another important remark is that the resulting winner-takes-all functiondiﬀers from that of DHAMs in the fact the remaining maximum value is keptas is, instead of being put to 1. We tested both possibilities in our experiments,with a slight advantage in keeping the value. We believe that the main reasonfor this diﬀerence lies in the fact that keeping the value allows for consideringthe case where a group of ℓ feature maps shows no clear winner.In the limit case where ℓ = 1, we retrieve the function σ , thus encompassingthe untouched reference DNN architecture. On the contrary, when ℓ is maximum,we can only select one feature map for each location in the treated inputs. Moregenerally, varying ℓ from 1 to its maximum value has the eﬀect of reducing thenumber of possible combinations of feature maps, together with increasing spar-sity and reducing computations. As a matter of fact, when processing the nextlayer, a larger value of ℓ causes more values to be nulliﬁed, and thus considerablyreduces the number of multiplications to compute.In the following section, we present our experiments and discuss the obtainedresults. For our experiments we make use of the CIFAR-10 and CIFAR-100 datasets.These two datasets are made for image classiﬁcation purposes, and have thedouble advantage of being relatively small so that simulation time remains ac-ceptable and being challenging enough to be considered for benchmarking pur-poses. CIFAR-10 comes with 10 classes and 5,000 examples per class for training,while CIFAR-100 comes with 100 classes and 500 examples per class for training.Both datasets are made of RGB images of 32x32 pixels.Architecture-wise, we make use of Resnet-18 [30] for CIFAR-10 and Resnet-50 [30] for CIFAR-100. We motivate this choice by the fact that these architec-tures exhibit a very interesting trade-oﬀ between complexity and accuracy [3].During training, we divide the learning rate by 10 twice, at epochs 100 and200. We use a total of 300 epochs for training the architectures, with batches ofsize 128. We use standard data augmentation techniques comprising of randomcropping and horizontal ﬂipping. We report the average accuracy over 5 experi-ments for the CIFAR-10 experiments, while for the CIFAR-100 we only run oneexperiment due to timing constraints. ℓ In a ﬁrst series of experiments, we vary the parameter ℓ . To considerably reducethe combinatorial space of possibilities in our architectures, we only consideredthe case of using the same value of ℓ for all layers, despite our belief that thismight be suboptimal. We report the obtained results in Table 1. We also reportthe number of multiplications required to process a single input. As expected,the method exhibits a trade-oﬀ between number of multiplications and accuracy. ecisiveNets 9 We recall that ℓ = 1 corresponds to the untouched baseline DNN architec-ture. Interestingly, we observe that with both considered datasets, the proposedmethodology was able to improve the accuracy of the system, while considerablyreducing the number of multiplications.Table 1: Accuracy of considered architectures while varying ℓ . The number ofmultiplications for processing an input is also reported. Resnet18 and CIFAR-10 Resnet50 and CIFAR-100 ℓ accuracy multiplications accuracy multiplications1 95.21% 5070848 78.50% 12978094082 c Instead of varying ℓ , it is possible to vary the parameter c . The diﬀerence isthe following: in Resnet architectures, the number of feature maps is regularlyscaled up while progressing deeper in the architecture. When ℓ is used, theeﬀect is to increase the number of subvectors considered in deeper layers. When c is increased, the eﬀect is to increase the length of considered subvectors indeeper layers. The obtained results are presented in Table 2. Without surprise,the results we obtained using c were less interesting than when using ℓ . Thereason is that when forcing ℓ = 2 we obtain the maximum number of possiblecombinations (that we cannot obtain while varying the c parameter). In the previous experiments, we purposely did not report the choice of the ini-tial and ﬁnal temperatures. The results we indicated were obtained using a gridsearch for the best possible temperature parameters. In this subsection, we reporttypical results we found for this search. These are summarized in Table 3. In-terestingly, we observe very little dependence on the choice of these parameters,as long as the ﬁnal temperature is high enough to ensure a smooth transitiontowards the winner-takes-all nonlinear activation function.

Table 2: Accuracy of considered architectures while varying c . Resnet18 and CIFAR-10 Resnet50 and CIFAR-1001 59.73% 13.39%2 70.69% 26.34%4 81.04% 41.84%8 87.38% 55.27%16 91.09% 62.01%32 94.00% 65.89%64

In the next experiment, we aim at assessing the potential of the proposedmethodology for improving transfer performance. We use for this purpose theCIFAR-FS dataset, which is built from CIFAR-100 by splitting the classes intothree groups. The ﬁrst group made of 64 classes is considered for training a CNNbackbone architecture (in our case transformed into a DHAM). A next group of16 classes is meant for validation purposes; we disregard it in this experiment.The backbone architecture is then used to extract features from inputs belongingto 5 unseen classes chosen uniformly at random among the 20 remaining ones.We dispose of 5 examples per class to perform prediction, and evaluate the per-formance on the remaining 595 vectors per class. To do so, we rely on a simplenearest class mean classiﬁer, known to reach top performance for this type ofproblem [31]. We perform an average over 10,000 runs obtained by varying thechoice of the 5 classes among 20 and the choice of the 5 supervised examples perclass. Results are summarized in Table 4. Interestingly, with this experiment wecan clearly that the performance obtained using DHAMs is signiﬁcantly betterthan that with the considered baseline.

As mentioned in our motivations, by replacing the usual operations of DNNswith the simple local winner-takes-all of DHAMs, we expect the architecturesto better accommodate for small variations of their input. As a matter of fact,the selection of a maximum has the eﬀect of removing a large portion of smallcontributions due to the noise, as long as they do not provoke a change in thewinner selection. In the next experiment, we stress the ability of our proposedDHAMs to better accommodate for various additive noises. We consider thenoises described in [32], that is to say: Gaussian, Shot and Impulse noises. Theyare considered with ﬁve levels of severity each. We report in Table 5 the obtainedaverage accuracy under this type of perturbations. We observe that DHAMs areable to considerably reduce the impact of noise, depending on the choice of ℓ .Interestingly, larger values of ℓ can prove more robust to such deviations, despiteloosing more on the clean test set performance. ecisiveNets 11 Table 3: Accuracy of considered architectures while varying the initial and ﬁnaltemperatures and for various choices of ℓ . ℓ t init t final Resnet18/CIFAR-10 Resnet50/CIFAR-1002 0.1 100 94.56% 74.19%1 100 95.01% 78.17%10 100 95.24% 79.30%0.1 1000 94.81% 77.93%1 1000 95.25% 78.81%10 1000 95.25% 79.23%0.1 10000 95.07% 78.28%1 10000 95.11% 78.89%10 10000 95.13% 78.88%4 0.1 100 90.35% 53.00%1 100 93.65% 66.27%10 100 94.65% 76.80%0.1 1000 94.02% 74.61%1 1000 94.72% 76.56%10 1000 94.65% 76.58%0.1 10000 94.49% 76.16%1 10000 94.74% 76.73%10 10000 94.73% 76.89%

As we mentioned in the previous section, the main interpretation of the proposedmethodology is that it creates competition between feature maps (when the soft-max is applied to this dimension of the processed tensors). In terms of entropy,it would be ideal for the gradient descent to make sure that each feature mapsis used approximately as often as any other. In order to visualize this, we depictin Figure 2 and Figure 3 the proportion of wins of activations of feature mapsof arbitrarily selected layers in a trained Resnet50 architecture when processingtest images. Note that the proportion of wins of a given feature map representsthe sum of the binary version of its pixels obtained using

WTA . We observe thatcontrary to our expectations, there is a strong unbalance in the selection of fea-ture maps. We believe feature maps that are highly selective are likely to betterreﬂect class-speciﬁc features, resulting in better overall classiﬁcation. In futurework, it would be interesting to consider introducing feature map based pruningcriteria based on how strong is this unbalance.We perform a second experiment in which we aim at comparing the propor-tion of wins of features maps when processing inputs from diﬀerent classes. Todo so, we consider inputs from ﬁrst and second class of CIFAR100 and reportthe proportion of wins of features maps of arbitrarily selected layer in a trained

Table 4: Accuracy of DHAMs when performing few-shot transfer classiﬁcationon the CIFAR-FS dataset. The conﬁdence interval at 95% is indicated betweenparenthesis. ℓ Resnet18 Resnet501 82.74 ( ± .

13) 83.94 ( ± . ± . ( ± . ( ± .

13) 84.83 ( ± . ± .

14) 82.43 ( ± . ± .

15) 77.93 ( ± . ± .

16) 76.40 ( ± . ± .

17) 70.15 ( ± . Table 5: Accuracy of DHAMs subject to various types of noise (architecture:Resnet18). ℓ clean data Gaussian noise Shot noise Impulse noise1 95.21% 46.40% 59.50% 51.75%2

64 78.28% 48.72% 55.60% 40.03%

Resnet50. Figure 4 compares the proportion of wins of the same feature mapwhen considering two diﬀerent classes, and shows that this proportion is almostthe same in the ﬁrst layer, a bit unbalanced in the 21st layer, and considerablyunbalanced in the 48th (last) layer. Consequently, and as expected, ﬁrst layersshown almost the same proportion of wins of feature maps since they are genericlayers, while deeper layers shown an important diﬀerence of the proportion ofwins since they are more speciﬁc.

In this chapter we introduced a methodology to progressively transform classicaldeep neural networks, able to solve complex tasks such as classiﬁcation of images,into deep hetero-associative memories, with much lighter and simpler operators.We show via experiments that this conversion can be performed with no coston accuracy, while drastically reducing the number of multiplication. They alsoshow that we are able to improve the performance in transfer learning and therobustness towards deviations of the inputs. ecisiveNets 13 p r o p o r t i o n o f w i n s i n c o rr e s p o nd i n g s ub v e c t o r s ( % ) Fig. 2: Proportion of wins for classes 1 and 2 in the 12 ﬁrst subvectors of theﬁrst layer of Resnet50 trained on CIFAR100 when ℓ = 2. p r o p o r t i o n o f w i n s i n c o rr e s p o nd i n g s ub v e c t o r s ( % ) Fig. 3: Proportion of wins in the 4 ﬁrst subvectors of the ﬁrst layer of Resnet50trained on CIFAR100 when ℓ = 8. ,

000 Feature maps M e a n a m o un t o f w i n s i n c o rr e s p o nd i n g f e a t u r e m a p s Class 1Class 2 (a) 1st Layer M e a n a m o un t o f w i n s i n c o rr e s p o nd i n g f e a t u r e m a p s Class 1Class 2 (b) 21st Layer M e a n a m o un t o f w i n s i n c o rr e s p o nd i n g f e a t u r e m a p s Class 1Class 2 (c) 48th Layer

Fig. 4: Amount of wins in the 12 ﬁrst feature maps of the [1st,21st,48th] layersof Resnet50 for classes 1 and 2 trained on CIFAR100 when ℓ = 2. ecisiveNets 15 The proposed methodology allows to train sparse hetero-associative memo-ries to solve any complex problem that can be solved using DNNs, despite thenondiﬀerentiability of its main operators. For this reason, we believe that it is apromising direction of research for expanding the potentiality of these systems,even in very competitive engineering domains.In future work, we would like to better understand the decisions taken byDHAMs using visualization tools. We also think it would be desirable to imple-ment such systems on speciﬁc hardware, demonstrating the full potential of thissolution.

References

1. P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever,“Deep double descent: Where bigger models and more data hurt,” arXiv preprintarXiv:1912.02292 , 2019.2. G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531 , 2015.3. G. B. Hacene, “Processing and learning deep neural networks on chip,” Ph.D.dissertation, Ecole nationale sup´erieure Mines-T´el´ecom Atlantique, 2019.4. S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connectionsfor eﬃcient neural network,” in

Advances in neural information processing systems ,2015, pp. 1135–1143.5. Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning eﬃcient convo-lutional networks through network slimming,” in

Proceedings of the IEEE Inter-national Conference on Computer Vision , 2017, pp. 2736–2744.6. H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning ﬁlters foreﬃcient convnets,” arXiv preprint arXiv:1608.08710 , 2016.7. Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft ﬁlter pruning for acceleratingdeep convolutional neural networks,” arXiv preprint arXiv:1808.06866 , 2018.8. R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin,and L. S. Davis, “Nisp: Pruning networks using neuron importance score propa-gation,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 9194–9203.9. K. Yamamoto and K. Maeno, “Pcas: Pruning channels with attention statistics,” arXiv preprint arXiv:1806.05382 , 2018.10. R. K. Ramakrishnan, E. Sari, and V. P. Nia, “Diﬀerentiable mask for pruningconvolutional and recurrent networks,” in . IEEE, 2020, pp. 222–229.11. Y. He, Y. Ding, P. Liu, L. Zhu, H. Zhang, and Y. Yang, “Learning ﬁlter pruningcriteria for deep convolutional neural networks acceleration,” in

Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition , 2020, pp.2009–2018.12. G. B. Hacene, C. Lassance, V. Gripon, M. Courbariaux, and Y. Bengio, “Attentionbased pruning for shift networks,” arXiv preprint arXiv:1905.12300 , 2019.13. G. B. Hacene, V. Gripon, M. Arzel, N. Farrugia, and Y. Bengio, “Quantized guidedpruning for eﬃcient hardware implementations of convolutional neural networks,” arXiv preprint arXiv:1812.11337 , 2018.6 DecisiveNets14. I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarizedneural networks,” in

Advances in neural information processing systems , 2016, pp.4107–4115.15. M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet clas-siﬁcation using binary convolutional neural networks,” in

European Conference onComputer Vision . Springer, 2016, pp. 525–542.16. Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks with fewmultiplications,” arXiv preprint arXiv:1510.03009 , 2015.17. F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprintarXiv:1605.04711 , 2016.18. S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Traininglow bitwidth convolutional neural networks with low bitwidth gradients,” arXivpreprint arXiv:1606.06160 , 2016.19. S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha,“Learned step size quantization,” arXiv preprint arXiv:1902.08153 , 2019.20. M. Nikoli´c, G. B. Hacene, C. Bannon, A. D. Lascorz, M. Courbariaux, Y. Bengio,V. Gripon, and A. Moshovos, “Bitpruning: Learning bitlengths for aggressive andaccurate quantization,” arXiv preprint arXiv:2002.03090 , 2020.21. E. Wang, J. J. Davis, P. Y. Cheung, and G. A. Constantinides, “Lutnet: Rethinkinginference in fpga soft logic,” in . IEEE, 2019, pp.26–34.22. J. J. Hopﬁeld, “Neural networks and physical systems with emergent collectivecomputational abilities,”

Proceedings of the national academy of sciences , vol. 79,no. 8, pp. 2554–2558, 1982.23. R. McEliece, E. Posner, E. Rodemich, and S. Venkatesh, “The capacity of thehopﬁeld associative memory,”

IEEE transactions on Information Theory , vol. 33,no. 4, pp. 461–482, 1987.24. S.-I. Amari, “Characteristics of sparsely encoded associative memory,”

Neural net-works , vol. 2, no. 6, pp. 451–457, 1989.25. D. J. Willshaw, O. P. Buneman, and H. C. Longuet-Higgins, “Non-holographicassociative memory,”

Nature , vol. 222, no. 5197, pp. 960–962, 1969.26. V. Gripon and C. Berrou, “Sparse neural networks with large learning diversity,”

IEEE transactions on neural networks , vol. 22, no. 7, pp. 1087–1096, 2011.27. V. Gripon, J. Heusel, M. L¨owe, and F. Vermet, “A comparative study of sparseassociative memories,”

Journal of Statistical Physics , vol. 164, no. 1, pp. 105–129,2016.28. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521, no. 7553,pp. 436–444, 2015.29. V. Gripon, J. Heusel, M. L¨owe, and F. Vermet, “A comparative study of sparseassociative memories,”

Journal of Statistical Physics , vol. 164, pp. 105–129, 2016.30. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-nition,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2016, pp. 770–778.31. Y. Wang, W.-L. Chao, K. Q. Weinberger, and L. van der Maaten, “Simpleshot:Revisiting nearest-neighbor classiﬁcation for few-shot learning,” arXiv preprintarXiv:1911.04623 , 2019.32. D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness tocommon corruptions and perturbations,” arXiv preprint arXiv:1903.12261arXiv preprint arXiv:1903.12261