[PDF] Neural Similarity Learning

Abstract

Inner product-based convolution has been the founding stone of convolutional neural networks (CNNs), enabling end-to-end learning of visual representation. By generalizing inner product with a bilinear matrix, we propose the neural similarity which serves as a learnable parametric similarity measure for CNNs. Neural similarity naturally generalizes the convolution and enhances flexibility. Further, we consider the neural similarity learning (NSL) in order to learn the neural similarity adaptively from training data. Specifically, we propose two different ways of learning the neural similarity: static NSL and dynamic NSL. Interestingly, dynamic neural similarity makes the CNN become a dynamic inference network. By regularizing the bilinear matrix, NSL can be viewed as learning the shape of kernel and the similarity measure simultaneously. We further justify the effectiveness of NSL with a theoretical viewpoint. Most importantly, NSL shows promising performance in visual recognition and few-shot learning, validating the superiority of NSL over the inner product-based convolution counterparts.

Full PDF

NNeural Similarity Learning

Weiyang Liu

Zhen Liu

James M. Rehg Le Song Georgia Institute of Technology Mila, Université de Montréal Ant Financial * Equal Contribution [email protected], [email protected], [email protected], [email protected]

Abstract

Inner product-based convolution has been the founding stone of convolutionalneural networks (CNNs), enabling end-to-end learning of visual representation. Bygeneralizing inner product with a bilinear matrix, we propose the neural similarity which serves as a learnable parametric similarity measure for CNNs. Neural simi-larity naturally generalizes the convolution and enhances ﬂexibility. Further, weconsider the neural similarity learning (NSL) in order to learn the neural similarityadaptively from training data. Speciﬁcally, we propose two different ways oflearning the neural similarity: static NSL and dynamic NSL. Interestingly, dynamicneural similarity makes the CNN become a dynamic inference network. By regu-larizing the bilinear matrix, NSL can be viewed as learning the shape of kernel andthe similarity measure simultaneously. We further justify the effectiveness of NSLwith a theoretical viewpoint. Most importantly, NSL shows promising performancein visual recognition and few-shot learning, validating the superiority of NSL overthe inner product-based convolution counterparts.

Recent years have witnessed the unprecedented success of convolutional neural networks (CNNs) insupervised learning tasks such as image recognition [20], object detection [47], semantic segmenta-tion [40], etc. As the core of CNN, a standard convolution operator typically contains two components:a learnable template ( i.e. , kernel) and a similarity measure ( i.e. , inner product). One active streamof works [13, 63, 25, 61, 8, 53, 24, 59, 26] aims to improve the ﬂexibility of the convolution kerneland increases its receptive ﬁeld in a data-driven way. Another stream of works [39, 36] focuses onﬁnding a better similarity measure to replace the inner product. However, there still lacks a uniﬁedformulation that can take both the shape of kernel and the similarity measure into consideration.

Input 1 Input 2

Dynamic Neural SimilarityStatic

Neural

Similarity

Inner Product

KernelInputPlaceholder

Figure 1: Bipartite graph comparison of inner prod-uct, static neural similarity and dynamic neural sim-ilarity. A line represents a multiplication operationand a circle denotes an element in a vector. Greencolor denotes kernel and yellow denotes input.

To bridge this gap, we propose the neural similaritylearning (NSL) for CNNs. NSL ﬁrst deﬁnes theneural similarity by generalizing the inner productwith a parametric bilinear matrix and then learns theneural similarity jointly with the convolution kernels.A graphical comparison between inner product andneural similarity is given in Figure 1. With certainregularities on the neural similarity, NSL can beviewed as learning the shape of the kernel and thesimilarity measure simultaneously. Based on theneural similarity, we propose the neural similaritynetwork (NSN) by stacking convolution layers withneural similarity. We consider two distinct ways tolearn the neural similarity in CNN. First, we learna static neural similarity which is essentially a (regularized) bilinear similarity. By having moreparameters, the static neural similarity becomes a natural generalization of the standard inner product.Second and more interestingly, we also consider to learn the neural similarity in a dynamic fashion. a r X i v : . [ c s . L G ] D ec peciﬁcally, we use an additional neural network module to learn the neural similarity adaptivelyfrom the input images. This module is jointly optimized with the CNN via back-propagation. Usingthe dynamic neural similarity, the CNN becomes a dynamic neural network, because the equivalentweights of the neuron are input-dependent. In a high-level sense, CNNs with dynamic neural similarityshare the same spirits with HyperNetworks [18] and dynamic ﬁlter networks [28].A key motivation behind NSL lies in the fact that inner product-based similarity is unlikely to beoptimal for every task. Learning the similarity measure adaptively from data can be beneﬁcial indifferent tasks. A hidden layer with dynamic neural similarity can be viewed as a quadratic functionof the input, while a standard hidden layer is a linear function of the input. Therefore, dynamic neuralsimilarity introduces more ﬂexibility from the function approximation perspective.NSL aims to construct a ﬂexible CNN with strong generalization ability, and we can control theﬂexibility by imposing different regularizations on the bilinear similarity matrix. In this paper, wemostly consider the block-diagonal matrix with shared blocks as the bilinear similarity matrix inorder to reduce the number of parameters. In different applications, we will usually impose domain-speciﬁc regularizations. By properly regularizing the bilinear similarity matrix, NSL is able to makebetter use of the parameters than standard convolutional learning and ﬁnd a good trade-off betweengeneralization ability and representation ﬂexibility.NSL is closely connected to a surprising theoretical result in [16] that optimizing an underdeterminedquadratic objective over a matrix W with gradient descent on a factorization of this matrix leadsto an implicit regularization for the solution (minimum nuclear norm). A more recent theoreticalresult in [5] further shows that gradient descent for deep matrix factorization tends to give low-ranksolutions. Since NSL can be viewed as a form of factorization over the convolution kernel, we arguethat such factorization also yields some implicit regularization in gradient-based optimization, whichmay lead to more generalizable inductive bias. We will give more theoretical insights in the paper.While showing strong generalization ability in generic visual recognition, NSL is also very effectivefor few-shot learning due to its better ﬂexibility. Compared to initialization based methods [14, 46],NSL can naturally make full use of the pretrained model for few-shot learning. Speciﬁcally, wepropose three different learning strategies to perform few-shot recognition. Besides applying bothstatic and dynamic NSL to few-shot recognition, we further propose to meta-learn the neural similarity.Speciﬁcally, we adopt the model-agnostic meta learning [14] to learn the bilinear similarity matrix.Using this strategy, NSL can beneﬁt from the generalization ability of both the pretrained modeland the meta information [14]. Our results show that NSL can effectively improve the few-shotrecognition by a considerable margin.Our main contributions can be summarized as follows: • We propose the neural similarity which generalizes the inner product via bilinear similarity.Furthermore, we derive the neural similarity network by stacking convolution layers with neuralsimilarity. Although this paper mostly discusses CNNs, we note that NSL can easily be appliedto fully connected networks and recurrent networks. • We propose both static and dynamic learning strategies for the neural similarity. To order toovercome the convergence difﬁculty of dynamic neural similarity, we propose hypersphericallearning [39] with identity residuals to stablize the training. • We apply the neural similarity learning to generic visual recognition and few-shot image recog-nition. For few-shot learning, we propose novel usages of NSL and signiﬁcantly improve thecurrent few-shot learning performance.

Flexible convolution . Dilated (atrous) [61, 8] convolution has been proposed in order to constructa convolution kernel with large receptive ﬁeld for semantic segmentation. [13, 25] improve theconvolution kernel for high-level vision tasks by making the shape of kernel learnable and deformable.[39, 36] provide a decoupled view to understand the similarity measure and propose some alternative(learnable) similarity measures. Such decoupled similarity is shown to be useful for improvingnetwork generalization and adversarial robustness.

Dynamic neural networks . Dynamic neural networks have input-dependent neurons, which makesthe network adaptively changing in order to deal with different inputs. HyperNetworks [18] usesa recurrent network to dynamically generate weights for another recurrent network, such that theweights can vary across many timesteps. Dynamic ﬁlter networks [28] generates its ﬁlters which2re dynamically conditioned on an input. These dynamic neural networks usually perform poorly inimage recognition tasks and can not make use of any pretrained models. In contrast, the dynamic NSNperforms consistently better than the CNN counterpart, and is able to take advantage of the pretrainedmodels for few-shot learning. [11] investigates the input-dependent networks by dynamically selectingﬁlters, while NSN uses totally different approach to achieve the dynamic inference.

Meta-learning . A classic approach [7, 50] for meta-learning is to train a meta-learner that learnsto update the parameters of the learner’s model. This approach has been adopted to learn deepnetworks [1, 32, 43, 51]. Recently, There are a series of works [46, 14] that address the meta-learning problem by learning a good network initialization. Speciﬁcally for few-shot learning,there are initialization-based methods [43, 46, 14, 10], hallucination-based methods [57, 19, 2] andmetric learning-based methods [55, 52, 54]. Besides having very different formulation from theprevious works, NSL also combines the advantages from the initialization-based methods and thegeneralization ability from the pretrained model.

We denote a convolution kernel with size C × H × V ( C for the number of channels, H for theheight and V for the width) as ˜ W . We ﬂatten the kernel in each channel separately and thenconcatenate them to a vector: W = { ˜ W F , : , : , ˜ W F , : , : , · · · , ˜ W FC, : , : } ∈ R CHV where ˜ W Fi, : , : is the ﬂattenkernel weights of the i -th channel. Similarly, we denote an input patch of the same size C × H × V as ˜ X , and its ﬂatten version as X . A standard convolution operator uses inner product W (cid:62) X tocompute the output feature map in a sliding window fashion. Instead of using the inner product tocompute the similarity, we generalize the convolution with a bilinear similarity matrix: f M ( W , X ) = W (cid:62) M X (1) where M ∈ R CHV × CHV denotes the bilinear similarity matrix and is used to parameterize thesimilarity measure. In fact, if we requires M to be a symmetric positive semi-deﬁnite matrix, itshares some similarities with the distance metric learning [60]. Although we do not necessarilyneed to constrain the matrix M , we will still impose some structural constraints on M in order tostablize the training and save parameters in practice. To avoid introducing too many parameters inthe generalized convolution operator, we make the bilinear similarity matrix M to be block-diagonalwith shared blocks (there are C blocks in total): f M ( W , X ) = W (cid:62)  M s . . . M s  X (2) where M = diag ( M s , · · · , M s ) and M s is of size HV × HV . Interestingly, the hypersphericalconvolution [39] becomes a special case of this bilinear formulation when M is a diagonal matrixwith a normalizing factor (cid:107) W (cid:107)(cid:107) X (cid:107) being the diagonal. Since additional parameters are introduced tocontrol the similarity measure, we are able to learn a similarity measure directly from data ( i.e. , staticneural similarity) or learn a neural predictor that can estimate such a similarity matrix from the inputfeature map ( i.e. , dynamic neural similarity). In the paper, we mainly consider two structures for M s . Diagonal/Unconstrained neural similarity . If we require M s to be a diagonal matrix, then we endup with the diagonal neural similarity (DNS). DNS is very parameter-efﬁcient and can be viewedas a weighted inner product or an element-wise attention. Besides that, DNS is essentially puttingan additional spatial mask over the feature map, so it is semantically meaningful. If no constraint isimposed on M s , then we have the unconstrained neural similarity (UNS) which is very ﬂexible butrequires much more parameters. We ﬁrst introduce a static learning strategy for the neural similarity. Speciﬁcally, we learn the matrix M s jointly with the convolution kernel via back-propagation. An intuitive overview for static neuralsimilarity is given in Figure 2(a). When M s has been jointly learned after training, it will stay ﬁxedin the inference stage. More interestingly, as we can see from Equation (1) that the neural similarity isincorporated into the convolution operator via a linear multiplication, we can compute an equivalentweights for the kernel in advance if the neural similarity is static. Therefore, we can view the new3ernel as M (cid:62) W . As a result, when it comes to deployment in practice, the number of parametersused in static NSN is the same as the CNN baseline and the inference speed is also the same. Conv

Input

Neural Similarity

Conv

Input

Neural Similarity (a) Static Neural Similarity (b) Dynamic Neural Similarity

Figure 2: Intuitive comparison between static neuralsimilarity and dynamic neural similarity.

Learning static neural similarity can be viewed asa factorized learning of neurons. It also shares alot of similarities with matrix factorization in thesense that the equivalent neuron weights ˆ W is fac-torized into into two matrix M (cid:62) and W . Althoughthe original weights and the factorized weights aremathematically equivalent, they have different be-haviors and properties during gradient-based opti-mization [16]. Recent theories [16, 5, 33] suggestthat an implicit regularization may encourage thegradient-based matrix factorization to give minimum nuclear norm or low-rank solutions. Besidesthat, we also have structural constraints to explicitly regularize the matrix M . Furthermore, we canalso view this static neural similarity convolution as a one-hidden-layer linear network. It has beenshown that such over-parameterization can be beneﬁcial to the generalization [29, 3, 44, 4]. Besides the static neural similarity, we further propose to learn the neural similarity dynamically.The intuition behind is that the similarity measure should be adaptive to the input in order to achieveoptimal ﬂexibility. From a cognitive science perspective, it is also plausible to enable the network withdynamic inference [56, 31]. The difference between static and dynamic neural similarity is shown inFigure 2. Speciﬁcally, the dynamic neural similarity is generated dynamically using an additionalneural network M θ ( · ) with parameters θ , namely M s = M θ ( X ) . As a result, learning a dynamicneural similarity jointly with the network parameters is to solve the following optimization problem(without loss of generality, we simply use one neuron as an example in the following formulation): { W , θ } = arg min { W ,θ } (cid:88) i L (cid:0) y i , W (cid:62) M θ ( X i ) X i (cid:1) (3) where y i is the ground truth value for X i , and L is some loss function. Both W and θ can belearned end-to-end using back-propagation. Note that, although X i denotes the entire sample here, X i will become the local patch of the input feature map in CNNs. For simplicity, we consider aone-neuron fully connected layer instead of a convolution layer. Due to the dynamic neural similarity,the equivalent weights M θ ( X ) (cid:62) W become a function of the input X and therefore construct adynamic neural network. In fact, dynamic networks which generate the neuron weights entirely basedon an additional neural network have poor generalization ability for recognition tasks [18]. In contrast,our dynamic NSN achieves a dedicate balance between generalization and ﬂexibility by using neuronweights that are “semi-generated” ( i.e. , part of the weights are statically and directly learned fromsupervisions, and the neural similarity matrix is generated dynamically from the input). Interestingly,we notice that hyperspherical convolution [39] can be viewed as a special case of dynamic neuralsimilarity. One can see that its equivalent similarity matrix M θ ( X ) = diag ( (cid:107) W (cid:107)(cid:107) X (cid:107) , · · · , (cid:107) W (cid:107)(cid:107) X (cid:107) ) also depends on the input feature map but does not have any parameter θ . Hyperspherical learning with identity residuals . In our experiments, we ﬁnd that naively using aneural network to predict the neural similarity is very unstable during training, leading to difﬁcultlyin convergence (it requires a lot of tricks to converge). To address the training stability problem,we propose hyperspherical networks (SphereNet) [39] with identity residuals to serve as the neuralsimilarity predictor. The convergence stability of hyperspherical learning over standard neuralnetworks is discussed in [39, 37, 38, 36, 35, 34]. In order to further stablize the training, we learnthe residual of an identity similarity matrix instead of directly learning the entire similarity matrix.Formally, the neural similarity predictor is written as M θ ( X ) = SphereNet ( X ; θ ) + I where I is an identity matrix and SphereNet ( X ; θ ) denotes the hyperspherical network with parameter θ and input X . To save parameters, we can use hyperspherical convolutional networks instead ofhyperspherical fully-connected networks. One advantage of SphereNet is that each element of theoutput in SphereNet is bounded between − and ( [0 , if using ReLU), making the similarity matrixbounded and well behaved. In contrast, the output is unbounded in a standard neural network, easilymaking some values of the similarity matrix dominantly large. Most importantly, SphereNet withidentity residuals empirically yields not only more stable convergence but also stronger generalization.4 .3.2 Disjoint and Shared Parameterization in Neural Similarity Predictor We mainly consider disjoint and shared parameterizations for the dynamic neural similarity predictor.

Conv2

Input(a) Disjoint Parameterization

Conv1

Adaptation2Adaptation1

Input

Shared

OutputOutput

Conv2

Input

Conv1Input

OutputOutput

Disjoint 1Disjoint 2 (b) Shared Parameterization

Figure 3: Comparison between disjoint and shared parame-terization for dynamic neural similarity predictor.

Disjoint parameterization . Disjoint param-eterization treats every dynamic neural sim-ilarity independently. For each convolutionkernel ( i.e. , neuron), we use a disjoint neuralnetwork to predict the neural similarity matrix M s . A brief overview is given in Figure 3(a). Shared parameterization . Assuming thatthere exists an intrinsic structure to predictthe neural similarity from the input, we con-sider a shared neural network that producesthe neural similarity matrix for different con-volution kernels (usually convolution kernels of the same size). To address the dimension mismatchproblem of the input feature map, we adopt an adaptation network ( e.g. , convolution networks orfully-connected networks) to ﬁrst transform the inputs to the same dimension. Note that, theseadaptation networks are not shared across different kernels in general, but we can share those adapta-tion networks for the input feature map of the same size. An intuitive comparison between disjointand shared parameterization is given in Figure 3 (Conv1 and Conv2 denote different convolutionkernels). By sharing the neural similarity prediction networks across different kernels, the number ofparameters used in total can be signiﬁcantly reduced. Most importantly, this shared neural similaritynetwork may be able to learn some meta-knowledge about the neural similarity.

One of the largest advantages about the neural similarity formulation is that one can impose suitableregularizations on the neural similarity matrix M in different tasks. It gives us a way to incorporateour prior knowledge and problem understandings into the neural networks. The regularization on M controls the ﬂexibility of the neural similarity. If we impose no constraints on M , then it will haveway too many parameters. Although it may be ﬂexible enough, the generalization is not necessarilygood. Instead we usually need to impose some constraints ( e.g. , the block-diagonal with sharedblocks, diagonal, etc.) in order to save parameters and improve generalization. Structural regularization . As a typical example, requiring M to be a block-diagonal matrix withshared blocks is a strong structural regularization. Dilated convolution can be viewed as bothstructural and sparsity regularization on M s . In fact, more advanced structural regularizations couldbe considered. For instance, requiring M to be a symmetric or symmetric positive semi-deﬁnitematrix is also feasible (by using a Cholesky factorization M = LL (cid:62) where L a learnable lowertriangular matrix) and can largely limit the learnable class of similarity measures. Most importantly,structural regularizations may bring more geometric and semantic interpretability. Sparsity regularization . Soft sparsity regularization on the matrix M s can be enforced via a (cid:96) -norm penalty. One can also impose a hard sparsity constraint to limit the non-zero values in M s ,similar to [42]. It is also appealing to enforce sparsity-one pattern on M s , because it can constructefﬁcient neural networks based on the shift operation in [59]. . NSL is also a uniﬁed framework for jointly learning the kernel shape and similaritymeasure. If we further factorize M s to the multiplication of a diagonal Boolean matrix D and asimilarity matrix R , then the neural similarity can be parameterized as f M ( W , X ) = W (cid:62)  DR . . . DR  X = W (cid:62) ·  D . . . D (cid:124) (cid:123)(cid:122) (cid:125) Kernel Shape ·  R . . . R (cid:124) (cid:123)(cid:122) (cid:125) Similarity Measure · X (4) where D = diag ( d , · · · , d HV ) in which d i ∈ { , } , ∀ i is a Boolean value. D actually controls theshape of the kernel because it will spatially mask out some elements in the kernel. Speciﬁcally,because the diagonal of D is binary, some elements of M s will become zero and therefore the kernelshape is controlled by D . On the other hand, R still serves as the neural similarity matrix, similar5o the previous M s . D can also be viewed as masking out some elements of each column in R .Interestingly, if we do not require the diagonal of D to be Boolean, then it will become a continuousspatial mask for the kernel shape. Optimization . First of all, we only consider D to be static in both static and dynamic NSN. Theoptimization of D is non-trivial, because it is a Boolean matrix which is discretized and can not beoptimized directly using gradients. Therefore, we use a heuristic approach to optimize D . Speciﬁcally,we preserve a real-valued matrix D r which is used to construct the Boolean matrix D . We deﬁne D = I ( D r , α ) where I ( v, α ) is an element-wise function that outputs if v > α and otherwise. α is a ﬁxed threshold. We will update D r with the following equation: { D r } t +1 = { D r } t − η ∂ L ∂ D (5) where D r is only computed in order to update D . In both forward and backward passes, only D is used for computation, but D r is used to generate D . Essentially, the gradient w.r.t D serves as anoisy gradient for D r . Similar optimization strategy has also been employed in [22, 12, 42]. R isupdated end-to-end using back-propagation. It is also easy to dynamically produce D with a neuralnetwork, but we do not consider this case for simplicity. After introducing the neural similarity learning of a single convolution kernel, we discuss how toconstruct a neural similarity network using this building block. In order to save parameters, we let allthe convolution kernels of the same layer share the same neural similarity matrix, which means thatwe require the same convolution layer has the same similarity measure. We will empirically validatethis design choice in Section 7.1. Stacking convolution layers with static (dynamic) neural similaritygives us static (dynamic) NSN. Note that, static NSN has the same number of parameters as standardCNN in deployment but yields better generalization ability. Compared to [28], dynamic NSN hasbetter regularity on the convolution kernel and is also able to utilize the pretrained CNN models.

Training from pretrained models . In order to make use of the pretrained models, we can simply usethe pretrained model as our backbone network (with all the weights loaded). Then we add the static ordynamic neural similarity modules to the convolutional kernels and train the neural similarity moduleswith backbone weights ﬁxed until convergence. Optionally, we can ﬁnetune the entire network afterthe training of the neural similarity module. In contrast, the other dynamic networks [18, 28] arenot able to take advantage of the pretrained models. Note that, it is not necessary for both static anddynamic NSN to be trained from pretrained models. They can also be trained from scratch (weightsof both backbone and neural similarity module are optimized from random initialization) and stillyield better result than the CNN baselines.

Training and inference . Similar to CNNs, both static and dynamic NSN can be trained end-to-endusing mini-batch stochastic gradient descent. Apart from that the factorized form with D and R needs to be optimized using a heuristic approach, the training is basically the same as the standardCNN. In the inference stage, we can compute all the equivalent weights for static NSN in advance tospeed up inference in practice. For dynamic NSN, the inference is also similar to the standard CNNwith slightly more additional computations from the neural similarity module. As mentioned before, NSL can be viewed as a form of matrix multiplication where the weight matrix W is factorized as M (cid:62) W (cid:48) ( W (cid:48) is the new weight matrix and M is the similarity matrix). Suchfactorization form not only provides more modeling and regularization ﬂexibility, but it also introducesan implicit regularization (in gradient descent). The implicit regularization in matrix factorizationis studied in [16]. We ﬁrst compare the behavior of gradient descent on W and { W (cid:48) , M } toobserve the difference. We consider a simple example of a one-layer neural network with leastsquare loss ( i.e. , linear regression): min W L ( W ) := (cid:80) i (cid:107) y i − W (cid:62) X i (cid:107) where W ∈ R n × m is theweight matrix for neurons, y i ∈ R m is the target and X i ∈ R n is the i -th sample. The behavior ofgradient descent with inﬁnitesimally small learning rate can be captured by the differential equation: ˙ W t + ∇L ( W t ) = 0 with an initial condition W , where ˙ W t := d W t d t . For NSL, the objective becomes min { W (cid:48) , M } L ( W (cid:48) , M ) := (cid:80) i (cid:107) y i − W (cid:48)(cid:62) M X i (cid:107) , so the corresponding differential equations6f gradient descent on W (cid:48) and M are ˙ W (cid:48) t + ∇ W (cid:48) L ( W (cid:48) t , M ) = 0 and ˙ M t + ∇ M L ( W (cid:48) t , M ) = 0 ,respectively (with initial condition W (cid:48) and M ). Therefore, the gradient ﬂows of the standard updateon W and the factorized NSL update on { W (cid:48) , M } can be expressed as Standard Derivative: ˙ W t = (cid:88) i X i ( y i − W (cid:62) t X i ) (cid:62) = (cid:88) i X i ( r it ) (cid:62) ( Deﬁne r it = y i − W (cid:62) t X i ) NSL Derivative: ˙ W t = M (cid:62) t ˙ W (cid:48) t + ˙ M (cid:62) t W (cid:48) t = M (cid:62) t M t (cid:88) i X i ( r it ) (cid:62) + (cid:88) i X i ( r it ) (cid:62) W (cid:48)(cid:62) t W (cid:48) t (6) from which we observe that the gradient dynamics of the NSL update is very different from thegradient dynamics of the standard update. Therefore, NSL may introduce a regularization effectthat is different from the standard update, and we argue that such implicit regularization induced byNSL is beneﬁcial to the generalization power. [16] conjectures that optimizing matrix factorizationwith gradient descent implicitly regularizes the solution towards minimum nuclear norm. [5] extendsthe analysis of implicit regularization to deep matrix factorization ( i.e. , multi-layer linear neuralnetworks) and shows that multi-layer matrix factorization enhances an implicit tendency towardslow-rank solution. [15, 27] show that gradient descent converges to the maximum margin solution inlinear neural networks for binary classiﬁcation of separable data. More interestingly, [5] argues thatimplicit regularization in matrix factorization may not be captured using simple mathematical norms. Classic dynamic neural unit (DNU) [17] receives not only external inputs but also state feedbacksignals from themselves and other neurons. A general mathematical model of an isolated DNU isgiven by a differential equation ˙ x ( t ) = − α x ( t ) + f ( w , x ( t ) , u ) , y ( t ) = g ( x ( t )) where x is DNU’sneural state, w i is the weight vector, u is the external input, f ( · ) is the nonlinear activation and g ( · ) is DNU’s output. As a dynamical system, the output of DNU depends on both the external inputand the output time stamp. The neural state trajectory also depends on the equilibrium convergenceproperty of DNU. Different from DNU, dynamic NSN does not have the state feedback and self-recurrence. Instead it realizes the dynamic output with a neural similarity generator that changes theequivalent weight matrix adaptively based on the input. However, it will be interesting to combineself-recurrence to NSL, since it can save parameters and strengthen the approximation power.Recent work [9, 41, 49, 58] shows that many existing deep neural networks can be consider asdifferent numerical schemes approximating an ordinary differential equation (ODE). NSN with certainsimilarity design is also equivalent to approximating ODEs. For example, f M = W (cid:62) ( ˜ W + M ) X = X m + W (cid:62) M X where W (cid:62) ˜ W = Diag (0 , · · · , , , , · · · , ( lies in the center location) can bewritten as x n +1 = x n + ∆ t · g n ( x n ) ( i.e. , ResNet) where x n is the input feature map at depth n and g n ( · ) is the transformation at depth n . It is one step of forward Euler discretization of the ODE x t = g ( x , t ) . Different neural similarity designs correspond to different iterative method for ODEs. Connection and comparison to the existing works . Static NSN is a direct generalization fromthe standard CNN, and can be viewed as a factorized learning (with optional regularizations) ofconvolution kernels. Dynamic NSN can be viewed as a non-trivial generalization of hypersphericalconvolution [39] in the sense that hyperspherical convolution is also input-dependent and can beviewed as a special case of M being (cid:107) W (cid:107)(cid:107) X (cid:107) I . Compared to dynamic ﬁlter networks [28], dynamicNSN achieves a better trade-off between ﬂexiblity and generalization. Dynamic ﬁlter networksare very ﬂexible since the weights are completely generated using another network, but it yieldsunsatisfactory image recognition accuracy. In contrast, dynamic NSN imposes strong regularizationson the weights and is less ﬂexible than dynamic ﬁlter networks, but it has much stronger generalizationability while still being dynamic. When M has no constraints, our dynamic NSN will becomeessentially equivalent to the dynamic ﬁlter network. [11] proposes to dynamically select ﬁlters toperform inference, while NSL dynamically estimates a similarity measure. Dynamic NSN is a high-order function of input . Dynamic NSN outputs W (cid:62) M θ ( X ) X . Assum-ing M θ ( X ) is a one-layer neural network, i.e. , M θ ( X ) = W (cid:48) X (cid:62) . Then the one-layer dynamic NSNis written as W (cid:62) W (cid:48) X (cid:62) X which is a quadratic function of X . In general, M θ ( X ) is much morenonlinear, so one-layer dynamic NSN is naturally a high-order function of the input X . Therefore,dynamic NSN has stronger approximation ability and ﬂexibility than the standard convolution. Self-attention as a global dynamic neural similarity . Since self-attention [62] is also a high-orderfunction of input, it can also be viewed as a form of dynamic neural similarity. We deﬁne a novelglobal neural similarity that can reduce to self-attention in Appendix B.7

Applications . For fair comparison, the backbone network architecture is the same ineach experiment. We will mostly use VGG-like plain CNN architecture. Detailed structures forbaselines and NSN are provided in Appendix A. For CIFAR10 and CIFAR100, we follow the sameaugmentation settings from [21]. For Imagenet 2012 dataset, we mostly follow the settings in [30].Batch normalization, ReLU, mini-batch 128, and SGD with momentum . are used as default in allmethods. For CIFAR-10 and CIFAR-100, we start momentum SGD with the learning rate 0.1. Thelearning rate is divided by 10 at 34K, 54K iterations and the training stops at 64K. For ImageNet, thelearning rate starts with 0.1, and is divided by 10 at 200K, 375K, 550K iterations (ﬁnsihed at 600K). Method Error (%)Baseline CNN 7.78Dynamic NSN (CNN) 7.04Dynamic NSN (SN)

Table 1: Predictor Network.

Different neural similarity predictor . We consider two types ofarchitectures: CNN and SphereNet [39] for the neural similaritypredictor of dynamic NSN. We experiment on CIFAR-10 and DNS( M s is diagonal) is used in NSN. Table 1 shows that SphereNetworks better than standard CNN as a neural similarity predictor. It isbecause SphereNet has better convergence properties can can stablizeNSN training. In fact, dynamic NSN can not converge if trivially applying CNN to the predictor,and we have to perform normalization (or sigmoid activation) to the predictor’s ﬁnal output to makeit converge. In contrast, SphereNet can make dynamic NSN converge easily and perform better.Therefore, we will use SphereNet as the neural similarity predictor for dynamic NSN by default. Method Error (%)Baseline CNN 7.78Static NSN 7.15Static NSN (J) 6.92Dynamic NSN 6.85Dynamic NSN (J)

Table 2: Joint learning.

Joint learning of kernel shape and similarity . We now evaluate howjointly learning kernel shape and similarity can improve NSN. Weuse CIFAR-10 in the experiment. For both static and dynamic NSN,we use DNS ( M s is a diagonal matrix). For dynamic NSN, we useSphereNet [39] as the neural similarity predictor. Table 2 show that jointlearning D and R performs better than simply learning M s . However,to be simple, we will still learn a single M s in the other experiments. Method Error (%)Baseline CNN 7.78Dynamic NSN (Shared) 7.20Dynamic NSN (Disjoint)

Table 3: Predictor parameterization.

Shared v.s. disjoint dynamic NSN . We evaluate the shared anddisjoint parameterization for the neural similarity predictor. We useCIFAR-10 in the experiment. For both static and dynamic NSN,we use DNS. Table 3 shows that the shared similarity predictorperforms slightly worse than the disjoint one, but the shared onesaves nearly half of the parameters used in the disjoint one.

Method CIFAR-10 CIFAR-100Baseline CNN 7.78 28.95Baseline CNN++ 7.29 28.70Static NSN w/ DNS 7.15 28.35Static NSN w/ UNS 7.38 28.11Dynamic NSN w/ DNS 6.85

Dynamic NSN w/ UNS

Table 4: Error (%) on CIFAR-10 & CIFAR-100.

CIFAR-10/100 . We comprehensively evaluate bothstatic and dynamic NSN on CIFAR-10 and CIFAR-100.All dynamic NSN variants use SphereNet as neural sim-ilarity predictor. Both DNS and UNS are experimentedfor comparison. Because dynamic NSN uses slightlymore parameters than the baseline CNN, we constructa new baseline CNN++ by making the baseline CNNdeeper and wider such that the number of parametersis slightly larger than all variants of NSN. The resultsin Table 4 verify the superiority of both static and dynamic NSN. Our dynamic NSN outperformsboth baseline CNN and baseline CNN++ by a considerable margin. Moreover, one can observe thatdynamic NSN performs generally better than static NSN, showing that dynamic inference can bebeneﬁcial for the image recognition task. Both DNS and UNS perform similarly on CIFAR-10 andCIFAR-100, indicating that DNS is already ﬂexible enough for the image recognition task.

Method Top-1 Top-5

Table 5: Validation error (%) on ImageNet-2012.

ImageNet-2012 . In order to be parameter-efﬁcient,we evaluate the dynamic NSN with DNS on theImageNet-2012 dataset. The backbone network is aVGG-like 10-layer plain CNN, so the absolute perfor-mance is not state-of-the-art. However, the purposehere is to perform apple-to-apple fair comparison. Us-ing the same backbone network, dynamic NSN is signiﬁcantly and consistently better than bothbaseline CNN and CNN++. Note that, baseline CNN++ is a deeper and wider version of baselineCNN. The results in Table 5 show that dynamic NSN yields strong generalization ability with the8ame number of parameter, and most importantly, the experiments demonstrated that the dynamicinference mechanism can work well in a challenging large-scale image recognition task. . It is very natural to apply static NSN to the few-shot learning. Similar to the ﬁnetuningbaseline, we ﬁrst train a backbone network using the base class data. When it comes to the testingstage, we ﬁrst ﬁnetune both the static neural similarity matrix and the classiﬁer on the novel classdata and then use the ﬁnetuned classiﬁer to make prediction. Note that, in order to use a pretrainedbackbone, we need to initialize the neural similarity matrix with an identity matrix. Due to thestrong regularity that we imposed to the mete-similarity matrix, static NSN is able to preserve richinformation from the pretrained model while quickly adapting to the novel class data.

Dynamic NSN . Dynamic NSN is very suitable for the few-shot learning due to its dynamic nature.Its ﬁlters are conditioned on the input. Because dynamic NSN is able to learn a meta-informationabout the similarity measure, so its intermediate layers do not need to be ﬁnetuned in the testingstage. From a high-level perspective, dynamic NSN shares some similarities with MAML [14] inthe sense that dynamic NSN learns to transform its ﬁlters with a projection matrix, while MAMLtransforms its ﬁlters using gradient updates during inference. We directly train the dynamic NSN onthe base class data. In the testing stage, we ﬁrst retrain the classiﬁers using the novel class data, andthen classify the query image using the dynamic NSN and the retrained classiﬁer.

Meta-learned static NSN . Inspired by MAML [14], we propose to meta-learn the neural similarity.We pretrain the network on the base classes with identity similarity and then meta-learned the neuralsimilarity and classiﬁers similar to MAML. The meta-learned static NSN dynamically transforms itsﬁlters via projection using the gradients, similar to MAML. The meta-optimization is given by min M (cid:88) τ i ∼ p ( τ ) L τ i ( f M (cid:48) ) s.t. M (cid:48) = M − η ∇ M L τ i ( f M ) (7) which aims to learn a good initialization for the static neural similarity matrix. During testing, theprocedure exactly follows MAML [14] except that the meta-learned static NSN only updates theneural similarity matrix with gradients. The pretrained model is recently shown to perform well withcertain normalization [10]. Meta-learned static NSN is able to take full advantage of the pretrainedmodel, and can be viewed as an interpolation between the pretrained model and MAML [14]. In fact,dynamic neural similarity can be also meta-learned similarly, which is left for future investigation. Method Backbone 5-shot AccuracyFinetuning Baseline [46] CNN-4 49.79 ± ± ± ± ± ± ± ± ± Discriminative k-shot [6] ResNet-34 73.90 ± ± ± Dynamic NSN (ours) CNN-9 77.44 ± Table 6: Few-shot classiﬁcation on Mini-Imagenet test set.

Experiment on Mini-ImageNet . The exper-imental protocol is the same as [46, 14]. Fol-lowing [46], we use 4 convolution layers with32 × ﬁlters per layer. Batch normaliza-tion [23], ReLU non-linearity and × pool-ing are used. For all the NSN variants, we usethe best setup and hyperparameters. The re-sults in Table 6 show that all of our proposedthree few-shot learning strategies work rea-sonably well. The dynamic NSN outperformsthe other competitive methods by a consid-erably large margin. Static NSN works bet-ter than most exisint methods. Meta-learnedstatic NSN also shows obvious advantagesover its direct competitor MAML. Moreover, we also compare with the recent state-of-the-art methodLEO [48] which uses features from ResNet-28. Our dynamic NSN with the CNN-9 backboneachieves . accuracy, which is comparable to LEO but ours has much fewer network parameters.This experiment further validates the strong generalization ability of all NSN variants. We have proposed a general yet powerful framework to generalize traditional convolution withthe neural similarity . Our framework can capture the similarity structure that lies in our data ofinterest, and regularizing the similarity to accommodate the nature of input dataset may yield betterperformance. Our experiments on image recognition and few-shot learning show the potential of ourframework being ﬂexible, generalizable and interpretable. This framework can be further applied tomore applications, e.g. , semantic segmentation, and may inspire different threads of research.9 cknowledgements

Weiyang Liu was supported in part by Baidu Fellowship and Nvidia GPU Grant. Le Song wassupported in part by NSF grants CDS&E-1900017 D3SC, CCF-1836936 FMitF, IIS-1841351,CAREER IIS-1350983, DARPA Program on Learning with Less Labels.

References [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul,Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In

NIPS , 2016.[2] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarialnetworks. arXiv preprint arXiv:1711.04340 , 2017.[3] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descentfor deep linear neural networks. arXiv preprint arXiv:1810.02281 , 2018.[4] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit accelerationby overparameterization. arXiv preprint arXiv:1802.06509 , 2018.[5] Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. arXiv preprint arXiv:1905.13655 , 2019.[6] Matthias Bauer, Mateo Rojas-Carulla, Jakub Bartłomiej ´Swi ˛atkowski, Bernhard Schölkopf, and Richard ETurner. Discriminative k-shot learning using probabilistic models. arXiv preprint arXiv:1706.00326 , 2017.[7] Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier.

Learning a synaptic learning rule . Université deMontréal, Département d’informatique et de recherche . . . , 1990.[8] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.

TPAMI , 2018.[9] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differentialequations. In

NIPS , 2018.[10] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look atfew-shot classiﬁcation. In

ICLR , 2019.[11] Zhourong Chen, Yang Li, Samy Bengio, and Si Si. Gaternet: Dynamic ﬁlter selection in convolutionalneural network via a dedicated global gating network. arXiv preprint arXiv:1811.11205 , 2018.[12] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neuralnetworks with binary weights during propagations. In

NIPS , 2015.[13] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformableconvolutional networks. In

ICCV , 2017.[14] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deepnetworks. In

ICML , 2017.[15] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linearconvolutional networks. In

NIPS , 2018.[16] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro.Implicit regularization in matrix factorization. In

NIPS , 2017.[17] Madan Gupta, Liang Jin, and Noriyasu Homma.

Static and dynamic neural networks: from fundamentalsto advanced theory . John Wiley & Sons, 2004.[18] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106 , 2016.[19] Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinating features.In

ICCV , 2017.[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

CVPR , 2016.

21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.In

ECCV , 2016.[22] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neuralnetworks. In

NIPS , 2016.[23] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. In

ICML , 2015.[24] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In

NIPS , 2015.[25] Yunho Jeon and Junmo Kim. Active convolution: Learning the shape of convolution for image classiﬁcation.In

CVPR , 2017.[26] Yunho Jeon and Junmo Kim. Constructing fast network through deconstruction of convolution. In

NIPS ,2018.[27] Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. arXiv preprintarXiv:1810.02032 , 2018.[28] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic ﬁlter networks. In

NIPS , 2016.[29] Kenji Kawaguchi. Deep learning without poor local minima. In

NIPS , 2016.[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutionalneural networks. In

NIPS , 2012.[31] Samuel J Leven and Daniel S Levine. Multiattribute decision making in context: A dynamic neural networkmethodology.

Cognitive Science , 20(2):271–299, 1996.[32] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885 , 2016.[33] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrixsensing and neural networks with quadratic activations. In

COLT , 2018.[34] Rongmei Lin, Weiyang Liu, Zhen Liu, Chen Feng, Zhiding Yu, James M. Rehg, Li Xiong, and Le Song.Compressive hyperspherical energy minimization. arXiv preprint arXiv:1906.04892 , 2019.[35] Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, and Le Song. Learning towardsminimum hyperspherical energy. In

NIPS , 2018.[36] Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M Rehg, and Le Song.Decoupled networks.

CVPR , 2018.[37] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deephypersphere embedding for face recognition. In

CVPR , 2017.[38] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutionalneural networks. In

ICML , 2016.[39] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhiding Yu, Bo Dai, Tuo Zhao, and Le Song. Deephyperspherical learning. In

NIPS , 2017.[40] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-tion. In

CVPR , 2015.[41] Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond ﬁnite layer neural networks: Bridgingdeep architectures and numerical differential equations. In

ICML , 2018.[42] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multipletasks by learning to mask weights. arXiv preprint arXiv:1801.06519 , 2018.[43] Tsendsuren Munkhdalai and Hong Yu. Meta networks. arXiv preprint arXiv:1703.00837 , 2017.[44] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towardsunderstanding the role of over-parametrization in generalization of neural networks. arXiv preprintarXiv:1805.12076 , 2018.[45] Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metricfor improved few-shot learning. In

Advances in Neural Information Processing Systems , pages 721–731,2018.

46] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In

ICLR , 2017.[47] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detectionwith region proposal networks. In

Advances in neural information processing systems , pages 91–99, 2015.[48] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, andRaia Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960 , 2018.[49] Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. arXivpreprint arXiv:1804.04272 , 2018.[50] Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrentnetworks.

Neural Computation , 4(1):131–139, 1992.[51] Albert Shaw, Wei Wei, Weiyang Liu, Le Song, and Bo Dai. Meta architecture search. In

NeurIPS , 2019.[52] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In

NIPS ,2017.[53] Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360 imagery. In

NIPS , 2017.[54] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning tocompare: Relation network for few-shot learning. In

CVPR , 2018.[55] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shotlearning. In

NIPS , 2016.[56] Yingxu Wang. The cognitive processes of formal inferences.

International Journal of Cognitive Informaticsand Natural Intelligence (IJCINI) , 1(4):75–86, 2007.[57] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginarydata. arXiv preprint arXiv:1801.05401 , 2018.[58] E Weinan. A proposal on machine learning via dynamical systems.

Communications in Mathematics andStatistics , 5(1):1–11, 2017.[59] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad,Joseph Gonzalez, and Kurt Keutzer. Shift: A zero ﬂop, zero parameter alternative to spatial convolutions.In

CVPR , 2018.[60] Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. Distance metric learning withapplication to clustering with side-information. In

NIPS , 2003.[61] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprintarXiv:1511.07122 , 2015.[62] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarialnetworks. In

ICML , 2019.[63] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, betterresults. arXiv preprint arXiv:1811.11168 , 2018. ppendix A Experimental Details

Layer CNN-9 (CIFAR-10/100) CNN-10 (ImageNet-2012)Conv1.x [3 ×

3, 32] × ×

7, 64], Stride 23 ×

3, Max Pooling, Stride 2[3 ×

3, 64] × × ×

3, 64] × ×

3, 128] × × ×

3, 128] × ×

3, 256] × × Table 7: Our plain CNN architectures with different convolutional layers. Conv1.x, Conv2.x and Conv3.xdenote convolution units that may contain multiple convolution layers. E.g., [3 ×

3, 64] × × In CIFAR-10/100, our DNS predictor utilizes the structure of “Input - 64 hidden units (SphereConv,only x being normalized) - 9 output units (no ReLU)”, and our UNS predictor uses the structure of“Input - 128 hidden units (SphereConv, only x being normalized) - 81 output units (no ReLU)”. Notethat, SphereConv comes from [39]. For DNS predictor, we will add an identity matrix to the outputof the predictor to improve its initialization point. For UNS predictor, we simply use the outputof the network as the neural similarity matrix. For CIFAR-10/100, we use the same training dataaugmentation as in [36].On ImageNet-2012 dataset, the DNS predictor uses the structure of “Input - 32 hidden units (Sphere-Conv, only input is normalized) - 9 output units (no ReLU)”For meta-learning on Mini-ImageNet dataset, we use DNS for all experiments. For our non-MAMLbaseline and NSN models, we train the models on both training and validation set of Mini-ImageNet,while we train the MAML-trained static NSN model with the training set only.For non-MAML training, we use Adam optimizer with lr = 1 e − , β = 0 . , β = 0 . . Fornon-MAML testing, we ﬁnetune the model on query sets with SGD with lr = 0 . , momentum = 0 . ,dampening = 0 . and weight decay = 0 . for 100 epochs.In non-MAML static NSN experiments, We train the whole model from scratch and ﬁx the staticsimilarity matrices to be identity; during testing, we only ﬁnetune the matrices and the classiﬁer.The (non-MAML) dynamic NSN experiments are similar excluding that we have no static similaritymatrix anymore.In MAML-trained static NSN experiments, we use the trained non-MAML static NSN as a pretrainedmodel, and meta-train both the static similarity matrices and the classiﬁer. For the MAML gradientsteps on the support set, we ﬁrst run 5 gradient steps on both the static similarity matrices and theclassiﬁer with step size = 0 . . Because MAML-trained static NSN has less capacity for ﬁnetuningon query sets, we run additional 20 gradient steps with the same step size but on the classiﬁer only.The CNN-9 network architecture of dynamic NSN on Mini-ImageNet is the same as the one we useon CIFAR-10/100.Our code is publicly available at https://github.com/wy1iu/NSL . For all the missing experi-mental details, please refer to our code repository.13 Local and Global Neural Similarity

B.1 Formulation

The original dynamic neural similarity is performed in a local fashion, meaning that the similaritymatrix operates on the local patch instead of the entire input feature map. We extend the originalneural similarity from operating on the local patch to operating on the global input feature map. Asa result, we call the original neural similarity as local neural similarity (LNS). Speciﬁcally, for theinput feature map X ∈ R m × m × c with size m × m × c and a convolution kernel W ∈ R k × k × c ofsize k × k × c (stride 1 and dimension-preserving padding), the global neural similarity (GNS) forconvolution is formulated as F G M = W (cid:62)G M G X F (8)where F G M is a vector of size mm × which is different from the standard neural similarity (with stride1 and dimension-preserving padding), W G is the block circulant matrix (a special case of Toeplitzmatrix) for performing 2D convolution, M G is the neural similarity matrix, and X F ∈ R mmc × is ﬂattened vector of the input feature map X . The block circulant matrix W s G converts the 2Dconvolution into a matrix multiplication. The GNS matrix M G ∈ R mmc × mmc usually takes thefollowing block-diagonal form with the same block matrix M s G : M G =  M s G ∈ R mm × mm . . . M s G ∈ R mm × mm  ∈ R mmc × mmc (9)where there are c matrices M s G ∈ R mm × mm . Note that, if M s G is a diagonal matrix, then it will serveas a role similar to a spatial attention mask for the input feature map (The spatial attention map is alsoshared across different channels of the input feature map if we require M G to be a block-diagonalmatrix with sharing blocks). ... ... ... ... Input Feature Map Output Feature Map

Local Neural Similarity models single location

Global Neural Similarity models multiple locations

Figure 4: Comparison between neural similarity and generalized neural similarity.

LNS vs. GNS . The difference between neural similarity and global neural similarity lies in whetherthe convolution is taken into consideration. For the original neural similarity, although we apply itto convolution kernel, we do not consider the sliding window operation. Instead, we only considerthe local inner product operation and combine the neural similarity matrix locally. For global neuralsimilarity, we take the convolution into account and transform the original convolution operation to amatrix multiplication (using Toeplitz matrix). An intuitive comparison is given in Figure 4. Fromthe computation perspective, GNS and LNS are not equivalent in general. For example, we considerthe case where both M in LNS and M G in GNS are diagonal matrix. Although both similaritymatrix share the same structure, the equivalent outputs are totally different. For LNS, each positionin the output feature map is obtained with a weighted inner product. Diagonal M serves as theelement-wise weighting factor for computing the inner product, and the same set of weighting factorwill repeatedly be applied to every sliding window (with the same size of convolution kernel) in theinput feature map. In contrast, Diagonal M G in GNS serves as a spatial attention mask for the entireinput feature map. It is equivalent to ﬁrst compute a Hadamard product between the input featuremap and the spatial mask induced by M G , and then perform standard 2D convolution with kernel W

14n the result. GNS and LNS are only equivalent when GNS only considers an input feature map ofsize × × c ( i.e. , the input feature map contains only one spatial location). Both static GNS anddynamic GNS are similar to the corresponding variant in LNS. Self-attention as Dynamic GNS . Dynamic GNS can be written as follows: F G M = W (cid:62)G · M G ( X ; θ ) · X F (10)where M G ( X ; θ ) is a function dependent on X . We show that self-attention [62] is a special case ofdynamic GNS. We ﬁrst resize the dimension of X F in Eq. (10) to mm × c when multiplying with M G ( X ; θ ) . Then after the multiplication, we resize X F back to m × m × c . We consider the case of M G ( X ; θ ) = G ( X ) G ( X ) (cid:62) where G ( X ) is a × convolution that transforms X ∈ R m × m × c to a new feature map with size m × m × c and then resize the new feature map to G ( X ) ∈ R mm × c . G ( X ) is also a combination of × convolution and a resize operation, same as G ( X ) . One cansee that M G ( X ; θ ) = G ( X ) G ( X ) (cid:62) is essentially a self-attention map. By multiplying the selfattention map back to the feature map, we have exactly the same self-attention mechanism as in [62].As a form of dynamic GNS, the self attention operation can be written as F self-attention M = W (cid:62)G · Resize (cid:18) G ( X ) G ( X ) (cid:62) · Resize ( X F , mm, c ) , mmc, (cid:19) (11) Connection to spatial transformer . Dynamic GNS is also closely related to spatial transformernetworks [24]. Spatial transformer contains localization network, grid generator, and sampler. Infact, the localization networks take the feature map as input and output parameters for grid generator.Then the grid generator and the sampler transform the feature map. The pipeline resembles the neuralsimilarity learning, and can be viewed as a special case of GNS.

B.2 Preliminary Experiments

We implement self-attention with our dynamic GNS in both standard CNN and SphereNet [39],and then evaluate them with image classiﬁcation on CIFAR-10. To simplify evaluation, we onlyperform mild data augmentation on CIFAR-10 training set, unlike the main paper. We use the CNN-9architecture in [39] for both standard CNN and SphereNet, but we use 128, 192 and 256 as thenumber of ﬁlters in Conv1.x, Conv2.x and Conv3.x. For more details, refer to our code repository.Table 8 shows the results of CNN and SphereNet with and without self-attention. We can see thatself-attention does not seem to bring too many gains to the image classiﬁcation task. However, weobserve that using SphereNet can boost the advantages of self-attention and achieve considerableaccuracy gain.

Method Accuracy (%)CNN 90.86CNN w/ self-attention 90.69SphereNet 91.31SphereNet w/ self-attention

Table 8: CNN and SphereNet with self-attention (dynamic GNS) on CIFAR-10. Signiﬁcance of NSL for Meta-Learning

One of the key in MAML [14] is that it uses the gradient update to make the network parameterdynamically dependent on the input. Essentially, we can view it as a novel realization of dynamicneural networks except that the network parameters are dynamically changed following the gradientdirection. Different from MAML, dynamic NSL realizes the dynamic neural network with anadditional neural similarity predictor ( i.e.i.e.