[PDF] MMA Regularization: Decorrelating Weights of Neural Networks by Maximizing the Minimal Angles

Abstract

The strong correlation between neurons or filters can significantly weaken the generalization ability of neural networks. Inspired by the well-known Tammes problem, we propose a novel diversity regularization method to address this issue, which makes the normalized weight vectors of neurons or filters distributed on a hypersphere as uniformly as possible, through maximizing the minimal pairwise angles (MMA). This method can easily exert its effect by plugging the MMA regularization term into the loss function with negligible computational overhead. The MMA regularization is simple, efficient, and effective. Therefore, it can be used as a basic regularization method in neural network training. Extensive experiments demonstrate that MMA regularization is able to enhance the generalization ability of various modern models and achieves considerable performance improvements on CIFAR100 and TinyImageNet datasets. In addition, experiments on face verification show that MMA regularization is also effective for feature learning. Code is available at: this https URL.

Full PDF

MMMA Regularization: Decorrelating Weights ofNeural Networks by Maximizing the Minimal Angles

Zhennan Wang Canqun Xiang Wenbin Zou Chen Xu

Shenzhen University {wangzhennan2017, xiangcanqun2018}@email.szu.edu.cn, {wzou, xuchen_szu}@szu.edu.cn

Abstract

The strong correlation between neurons or ﬁlters can signiﬁcantly weaken thegeneralization ability of neural networks. Inspired by the well-known Tammesproblem, we propose a novel diversity regularization method to address this issue,which makes the normalized weight vectors of neurons or ﬁlters distributed on ahypersphere as uniformly as possible, through maximizing the minimal pairwise angles (MMA). This method can easily exert its effect by plugging the MMAregularization term into the loss function with negligible computational overhead.The MMA regularization is simple, efﬁcient, and effective. Therefore, it can be usedas a basic regularization method in neural network training. Extensive experimentsdemonstrate that MMA regularization is able to enhance the generalization abilityof various modern models and achieves considerable performance improvementson CIFAR100 and TinyImageNet datasets. In addition, experiments on faceveriﬁcation show that MMA regularization is also effective for feature learning.

Although neural networks have achieved state-of-the-art results in a variety of tasks, they containredundant neurons or ﬁlters due to the over-parametrization issue [37, 17], which is prevalent innetworks [35]. The redundance can lead to catching limited directions in feature space and poorgeneralization performance [23].To address the redundancy problem and make neurons more discriminative, some methods aredeveloped to encourage the angular diversity between pairwise weight vectors of neurons or ﬁlters ina layer, which can be categorized into the following three types. The ﬁrst type reduces the redundancyby dropping some weights and then retraining them iteratively during optimization [31, 9, 32], whichsuffers from complex training scheme and very long training phase. The second type is the widelyused orthogonal regularization [34, 48, 19, 47], which exploits a regularization term in loss functionto enforce the pairwise weight vectors as orthogonal as possible. However, it has been proven thatorthogonal regularization tends to group neurons closer, especially when the number of neurons isgreater than the dimension [20], and therefore it only produces marginal improvements [31]. The thirdtype also utilizes a regularization term but to encourage the weight vectors uniformly spaced throughminimizing the hyperspherical potential energy [20, 18] inspired from the Thomson problem [43, 40].Nonetheless, its disadvantage is that both the time complexity and the space complexity are veryhigh [20], and it suffers from a huge number of local minima and stationary points due to its highlynon-convex and non-linear objective function [18].In this paper, we propose a simple, efﬁcient, and effective method of angular diversity regularizationwhich penalizes the minimum angles between pairwise weight vectors in each layer. Similar to theintuition of the third type mentioned above, the most diverse state is that the normalized weightvectors are distributed on a hypersphere uniformly. To model the criterion of uniformity, we employthe well-known Tammes problem, that is, to ﬁnd the arrangement of n points on a unit sphere whichmaximizes the minimum distance between any two points [42, 25, 29, 22, 28]. However, the optimal Preprint. Under review. a r X i v : . [ c s . L G ] J un a) baseline (b) orthogonal (c) MHE(s=0) (d) MMA filter index filter index filter index filter index f ilt e r i nd e x f ilt e r i nd e x f ilt e r i nd e x f ilt e r i nd e x

0 15 30 45 60 0 15 30 45 60 0 15 30 45 60 0 15 30 45 60 015304560 015304560 015304560 1.00.50.0-0.5 -1.0

Figure 1: Comparison of ﬁlter cosine similarity from the ﬁrst layer of VGG19-BN trained onCIFAR100 with several different methods of angular diversity regularization. The number of similarityvalues above 0.2 is 495 (baseline), 120 (orthogonal), 51 (MHE), 0 (MMA), demonstrating theeffectiveness of MMA regularization.solutions for the Tammes problem only exist for some combinations of the number of points n anddimensions d , which are collected on the N.J.A. Sloane’s homepage [39], and obtaining a uniformdistribution for an arbitrary combination of n and d is still an open mathematical problem [25]. Inthis paper, we propose a numerical optimization method to get approximate solutions for the Tammesproblem through maximizing the minimal pairwise angles between weight vectors, named as MMAfor abbreviation. We further develop the MMA regularization for neural networks to promote theangular diversity of weight vectors in each layer and thus improve the generalization performance.There are several advantages of MMA regularization: (a) As analyzed in Section 3.2, the gradient ofMMA loss is stable and consistent, therefore it is easy to optimize and get near optimal solutionsfor the Tammes problem as shown in Table 1; (b) As veriﬁed in Table 3, the MMA regularizationis easy to implement with negligible computational overhead, but with considerable performanceimprovements; (c) The MMA regularization is effective for both the hidden layers and the outputlayer, decorrelating the ﬁlters and enlarging the inter-class separability respectively. Therefore, it canbe applied to multiple tasks, such as image classiﬁcation and face veriﬁcation demonstrated in thispaper. To intuitively make sense of the effectiveness of MMA regularization, we visualize the cosinesimilarity of ﬁlters from the ﬁrst layer of VGG19-BN trained on CIFAR100 in Figure 1. We compareseveral different methods of angular diversity regularization, including orthogonal regularizationin [34], MHE regularization in [20], and the proposed MMA regularization. The results show that theMMA regularization gets the most uncorrelated ﬁlters. Besides, the MMA regularization keeps somenegative correlations which have been veriﬁed to be beneﬁcial for neural networks [5].In summary, the main contributions of this paper are three-fold: • We propose a numerical method for the Tammes problem, called MMA, which can get nearoptimal solutions under arbitrary combinations of the number of points and dimensions. • We develop the novel MMA regularization which effectively promotes the angular diversityof weight vectors and therefore improves the generalization power of neural networks. • Various experiments on multiple tasks show that MMA regularization is generally effectiveand can become a basic regularization method for training neural networks.

To improve the generalization power of neural networks, many regularization methods have beenproposed, such as weight decay [14], decoupled weight decay [21], weight elimination [46], nuclearnorm [33], dropout [41], dropconnect [44], adding noise [2], and early stopping [24].Recently, diversity-promoting regularization approaches are emerging. These methods mainlypenalize the neural networks by adding a regularization term to the loss function. The regularizationterm either promotes the diversity of activations through minimizing the cross-covariance of hiddenactivations [6], or directly promotes the diversity of neurons or ﬁlters through enforcing the pairwiseorthogonality [34, 48, 19, 47] or minimizing the global potential energy [20, 18]. For many tasks,these methods obtain marginal improvements [34, 31, 49, 4]. Another stream of approaches gets2omparatively diverse neurons or ﬁlters by cyclically dropping and relearning some of the weights [31,9, 32], which leads to substantial performance gains, but suffers from complex training. In contrast,our proposed simple MMA regularization achieves signiﬁcant performance improvements whileemploying the standard training procedures.The most related work to our method is MHE [20], which targets the uniform distribution ofnormalized weight vectors on a hypersphere as well. However, the MHE is inspired by the Thomsonproblem [43] and models the criterion of uniformity as the minimum global potential energy, whichsuffers from high computational complexity and lots of local minima [18]. Inspired by the Tammesproblem [42, 22], our proposed MMA regularization models the criterion as maximizing the minimumangles, that is the key reason why our method is more efﬁcient and effective.

As our proposed regularization is inspired by the Tammes problem, we ﬁrstly analyze the Tammesproblem and propose a numerical method called MMA which maximizes the minimal pairwise angles between the vectors. Then we make a comparison of several numerical methods for the Tammesproblem by gradient analysis, which demonstrates the advantage of the proposed MMA. Finally, wedevelop a novel angular diversity regularization for neural networks by the proposed MMA.

Construction of points spaced uniformly on a unit hypersphere S d ∈ R d ( d ∈ { , , , ... } ) is an important problem for various applications ranging from coding theory to computationalgeometry [29]. There are many ways to model the criterion of uniformity. One approach is tomaximize the minimal pairwise distance between the points [29], i.e. max min i,j,i (cid:54) = j (cid:107) ˆw i − ˆw j (cid:107) , s.t. ∀ i ˆw i = w i (cid:107) w i (cid:107) (1)where w i ∈ R d × denotes the coordinate vector of the i -th point, ˆw denotes the l -normalizedvector, and the (cid:107) ∗ (cid:107) denotes the Euclidean norm. This criterion means the points on a unit sphereare spaced uniformly when the minimal pairwise distance is maximized, which is known as theTammes problem [42, 22] or the optimal spherical code [8, 39]. Denoting the dimension with d andthe number of points with n , we ﬁrstly analyze the analytical solutions for the case of d ≥ n − , andthen propose the numerical solutions for the case of d < n − . The analytical solutions for d ≥ n − . As the distance between any two points on a unithypersphere is inversely proportional to the cosine similarity, the Tammes problem is equivalent tominimize the maximal pairwise cosine similarity, i.e. min max i,j,i (cid:54) = j ˆw i · ˆw j , s.t. ∀ i ˆw i = w i (cid:107) w i (cid:107) (2)The maximum of ˆw i · ˆw j must be larger than the average. Therefore, the minimum is derived as: n ( n −

1) max i,j,i (cid:54) = j ˆw i · ˆw j ≥ (cid:88) i,j,i (cid:54) = j ˆw i · ˆw j = (cid:107) (cid:88) i ˆw i (cid:107) − (cid:88) i (cid:107) ˆw i (cid:107) = (cid:107) (cid:88) i ˆw i (cid:107) − n ≥ − n (3)Therefore, the minimum of maximal pairwise cosine similarity is − n − , which can be reached whenall pairwise angles between the points are equal to each other, and the sum of all vectors is a zerovector. This criterion has a matrix form: C = ˆW ˆW T =  − n − · · · − n − − n − . . . ...... . . . . . . − n − − n − · · · − n −  , s.t. ∀ i ˆW i = w Ti (cid:107) w i (cid:107) (4)where ˆW ∈ R n × d denotes the set of l -normalized points. According to the matrix theory, theeigenvalues of matrix C are λ = 0 with algebraic multiplicity of 1 and λ = nn − with algebraicmultiplicity of n − . As all the eigenvalues of C are greater than or equal to zero, C is a semi-positive deﬁnite matrix. According to the spectral theorem [3], ˆW can be gotten through theeigendecomposition of C , which is the analytical solution for the Tammes problem. However, sincethe rank of C is n − , the rank of ˆW and the minimum dimension of the points are also n − .Therefore, this analytical solution only exists for the case of d ≥ n − .3 G r a d i e n t no r m ( / || w i || ) Pairwise angle cosine, Equation (7) MMA, Equation (8) Riesz-Fisher (s=1), Equation (10) logarithmic, Equation (12)

Figure 2: Comparison of the gradient normchanged with pairwise angle. The gradient ofMMA loss is stable and consistent.

The numerical solutions for d < n − . Sofar, under the case of d < n − , the analyticalsolutions for the Tammes problem only exist forsome combinations of n and d [39]. For mostcombinations, the optimal solutions do not exist.Consequently, numerical methods are used to getapproximate solutions.As the objective (Equation 1) of Tammes problemis not globally differentiable [30], the conventionalsolution [1] alternatively optimizes a differentiablepotential energy function to get the approximate so-lutions, as discussed in next subsection. Nonetheless,with the help of SGD [36] and modern automaticdifferentiation library [27], we can now directly useEquation (1) to implement optimization and get theapproximate solutions. However, the calculation ofEuclidean length is expensive. Alternatively, as mentioned in Equation (2), we can use the cosinesimilarity as the objective function, called cosine loss, which is formulated as follows: l cosine = 1 n n (cid:88) i =1 max j,j (cid:54) = i Cos ij , Cos = ˆW ˆW T , s.t. ∀ i ˆW i = w Ti (cid:107) w i (cid:107) (5)where Cos ∈ R n × n denotes the cosine similarity matrix of the points. Employing the globalmaximum similarity as Equation (2) is inefﬁcient, as it only updates the closest pair of points.Therefore, we alternatively use the average of each vector’s maximum similarity.The cosine loss can be optimized quickly taking the advantage of matrix form. However, we ﬁnd thisloss is hard to converge, especially for the case that w i is very close to w j , which is very prevalentin neural networks [35]. As analyzed in next subsection, this is because the gradient is too small tocover random ﬂuctuations during the optimization. Gaining insight from the ArcFace [7], we proposethe angular version of cosine loss as the object function: l MMA = − n n (cid:88) i =1 min j,j (cid:54) = i θ ij , θ = arccos( ˆW ˆW T ) , s.t. ∀ i ˆW i = w Ti (cid:107) w i (cid:107) (6)where θ ∈ R n × n denotes the pairwise angle matrix. As this loss maximizes the minimal pairwise angles , we name it MMA loss for abbreviation. The MMA loss is very efﬁcient and robust foroptimization, so it is easy to get near optimal numerical solutions for the Tammes problem. Besides, itcan also get close solutions for the case d ≥ n − , which is validated in Section 4. In next subsection,we demonstrate the advantage of the proposed MMA loss through gradient analysis and comparison. In this subsection, we analyze and compare the gradients of loss functions generating approximatesolutions for uniformly spaced points. To simplify the derivation, we only consider the norm of thegradient of the core function, composing the summation in loss functions, w.r.t. corresponding weightvector w i . For intuitive comparison, the analysis results are presented in Figure 2.Corresponding to the cosine loss referred to Equation (5), the gradient norm is derived as follows: (cid:107) ∂ Cos ij ∂ w i (cid:107) = (cid:107) ∂ (cid:16) w Ti w j (cid:107) w i (cid:107)(cid:107) w j (cid:107) (cid:17) ∂ w i (cid:107) = (cid:107) ( I − M w i ) w j (cid:107)(cid:107) w i (cid:107)(cid:107) w j (cid:107) = (cid:107) w j (cid:107) sin θ ij (cid:107) w i (cid:107)(cid:107) w j (cid:107) = sin θ ij (cid:107) w i (cid:107) , M w i = w i w Ti (cid:107) w i (cid:107) (7)where M w i represents the projection matrix of w i . From the above derivation and Figure 2, we cansee that the gradient norm is very small when pairwise angle is close to zero. That is why the cosineloss is hard to converge for the case that w i and w j are close to each other, as experimented in Section4. Next, we derive the gradient norm corresponding to the MMA loss referred to Equation (6): (cid:107) ∂ θ ij ∂ w i (cid:107) = (cid:107) ∂ θ ij ∂ cos θ ij ∂ cos θ ij ∂ w i (cid:107) = 1sin θ ij sin θ ij (cid:107) w i (cid:107) = 1 (cid:107) w i (cid:107) (8)Compared to the gradient norm corresponding to the cosine loss, as referred to Equation (7), thegradient norm corresponding to the MMA loss is independent of the pairwise angle θ ij , so it wouldnot encounter the very small gradient even though θ ij is zero. Figure 2 shows that the gradient norm4orresponding to MMA loss is stable and consistent. Therefore, the MMA loss is easy to optimizeand get near optimal solutions for the Tammes problem, veriﬁed by experiments in Section 4.In addition to the above two loss functions, we also analyze the Riesz-Fisher loss [29] and thelogarithmic loss [29] which are often used to get uniformly distributed points on a hypersphere. Thephilosophy behind the two loss functions is that the points on a hypersphere are uniformly spacedwhen the potential energy is minimum, and both of them are formulated as kernel functions of thepotential energy. The Riesz-Fisher loss and the corresponding gradient norm are: l RF = 1 n ( n − (cid:88) i (cid:54) = j (cid:107) ˆw i − ˆw j (cid:107) − s , s > (9) (cid:107) ∂ (cid:107) ˆw i − ˆw j (cid:107) − s ∂ w i (cid:107) = s (cid:107) w i (cid:107) cos θ ij (2 sin θ ij ) s +1 (10)where s is a hyperparameter, and is set to 1 in Figure 2 for easy comparison. The logarithmic lossand the corresponding gradient norm are: l log = − n ( n − (cid:88) i (cid:54) = j log (cid:107) ˆw i − ˆw j (cid:107) (11) (cid:107) ∂ log (cid:107) ˆw i − ˆw j (cid:107) ∂ w i (cid:107) = 1 (cid:107) w i (cid:107) cos θ ij (2 sin θ ij ) (12)Due to the limited space, more details of derivation are presented in the supplementary material. Asvisualized in Figure 2, the Riesz-Fisher loss and logarithmic loss have similar properties: the gradientnorm is sharp around angles near zero and drops rapidly as the angle increases. Besides, the greaterthe s of Riesz-Fisher loss is, the sharper the gradient norm becomes. The very large gradient normaround angles near zero can cause instability and prevent the normal learning of neural networks, andthe very small gradient norm around angles away from zero makes the updates inefﬁcient. We arguethat is why the two loss functions just get inaccurate solutions for the Tammes problem in Section 4and perform not so good in terms of accuracy in Table 3. In this subsection, we develop the MMA regularization for neural networks, which promotes thelearning towards uniformly distributed weight vectors in angular space. For d ≥ n − , we canemploy the cosine similarity matrix in Equation (4) to constrain the weights. However, as the MMAloss can generate accurate approximate solutions in any case and is easy to implement, we uniformlyexploit the MMA loss referred to Equation (6) as the angular regularization: l MMA _ regularization = λ L (cid:88) i =1 l MMA ( W i ) (13)where λ denotes regularization coefﬁcient, L denotes the total number of layers, includingconvolutional layers and fully connected layers, and W i denotes the weight matrix of the i -thlayer with each row denoting a vectorized ﬁlter or neuron.The MMA regularization is complementary and orthogonal to weight decay [14]. Weight decayregularizes the Euclidean norm of weight vectors, while MMA regularization promotes the directiondiversity of weight vectors. MMA regularization can be applied into both hidden layers and outputlayer. For hidden layers, MMA regularization can reduce the redundancy of ﬁlters, which is verycommon in neural networks [35]. Consequently, the unnecessary overlap in the features capturedby the network’s ﬁlters is diminished. For output layer, MMA regularization can maximize theinter-class separability and therefore enhance the discriminative power of neural networks. This section compares several numerical methods for the Tammes problem, measured by the minimumangle, as shown in Table 1. The ﬁrst column denotes the dimension d and the second column denotesthe number of points n . The third column refers the minimal pairwise angles of the optimal solutionscollected in [39]. The rest columns are the minimum angle obtained by several different numericalmethods, including MMA loss in Equation (6), cosine loss in Equation (5), Riesz-Fisher loss with s = 2 in Equation (9), and logarithmic loss in Equation (11). The weights are initialized with valuesdrawn from the standard normal distribution and then optimized by SGD [36] with 10000 iterations.The initial learning rate is set to 0.1 and reduced by a factor of 5 once learning stagnates, and themomentum is set to 0.9. 5 E-3 0.01 0.1 171.572.072.573.073.574.0 T O P - on C I F A R ( % ) Coefficient of MMA Regularization

Baseline MMA Regularization

Figure 3: Coefﬁcient tuning for VGG19-BN. T O P - on C I F A R ( % ) Coefficient of MMA Regularization

Baseline MMA Regularization

Figure 4: Coefﬁcient tuning for ResNet20.Table 1: Minimum angle (degree) obtained by severaldifferent loss functions for the Tammes problem. The bestresults are highlighted in bold. d n optimal l MMA l cosine l RF l log d ≥ n − , as analyzed in Section3.1, each pairwise angle of optimalsolutions is arccos ( − n − ) , veriﬁed bythe second row ( d =3, n =4), the ﬁfthrow ( d =4, n =5), and the ninth row( d =5, n =6), from which we can observethat the optimal solutions can be easilyachieved by any of the four numericalmethods. For d < n − , all thenumerical solutions are more or lessprone to be worse than the optimalsolutions. However, the MMA loss canrobustly obtain the closest solutions tothe optimal. The cosine loss can alsoachieve very close solutions, but it isnot robust for the cases of too manypoints like the third row ( d =3, n =30), the forth row ( d =3, n =130), and the eighth row ( d =4, n =600).This is due to the too small gradient as analyzed in Section 3.2. The Riesz-Fisher loss and thelogarithmic loss are also robust, but they converge to solutions far from the optimal. We conduct image classiﬁcation experiments on CIFAR100 [13] and TinyImageNet [15]. For bothdatasets, we follow the simple data augmentation in [16]. We employ various classic networks as thebackbone networks, including ResNet56 [10], VGG19 [38] with batch normalization [12] denoted byVGG19-BN, VGG16 with batch normalization denoted by VGG16-BN, WideResNet [50] with 16layers and a widen factor of 8 denoted by WRN-16-8, and DenseNet [11] with 40 layers and a growthrate of 12 denoted by DenseNet-40-12. We denote the corresponding MMA regularization version ofmodels by X-MMA. For fair comparison, not only the X-MMA models but also the correspondingbackbones are trained from scratch, so our results may be slightly different from the ones presentedin the original papers due to different random seeds and hardware settings. For CIFAR100, thehyperparameters and settings are the same as the original papers. For TinyImageNet, we follow thesettings in [45]. Besides, all the random seeds are ﬁxed, so the experiments are reproducible andcomparisons are absolutely fair. Moreover, in order to reduce the variance of evaluation, we employthe average accuracy of last ﬁve epoches as the evaluation criterion.

To understand the behavior of MMA regularization, we conduct comprehensive ablation experimentson CIFAR100. Except otherwise noted, we use the VGG19-BN to implement ablation experiments.

Impact of the hyperparameter.

The MMA regularization coefﬁcient λ is the only hyperparameter.As the skip connections have implicitly promoted the angular diversity of neurons [26], we separatelyselect the VGG19-BN and ResNet20 to investigate the impact of different coefﬁcients for modelswithout and with skip connections, as shown in Figure 3 and Figure 4 respectively. From bothof the ﬁgures, we can see that the effect of MMA regularization with too small coefﬁcients is not6bvious. However, too large coefﬁcients improve slightly or even decrease the performance. This isbecause too strong regularization prevents the normal learning of neural networks to some extent.For the VGG19-BN, MMA regularization is not very sensitive to the hyperparameter and works wellfrom 0.03 to 0.2, therefore proving the effectiveness of MMA regularization. For the Resnet20, it issensitive because of the skip connections. In the following experiments, we set MMA regularizationcoefﬁcient to 0.07 for VGG models and 0.03 for models with skip connections.Table 2: Accuracy (%) of applyingMMA regularization to different layers.Model TOP-1 TOP-5baseline 72.08 90.5hidden 73.45 90.91hidden+output TheMMA regularization is applicable to both the hidden layersand the output layer. In Table 2, we study the effectof MMA regularization applied to hidden layers ( hidden )and all layers ( hidden+output ). The results show that the hidden version improves over the VGG19-BN baseline witha considerable margin and, moreover, the hidden+output version improves the performance further. This indicates that the MMA regularization is effectivefor both the hidden layers and the output layer, and the effects can be accumulated. As analyzed inSection 3.3, the effectiveness for hidden layers comes from decorrelating the ﬁlters or neurons, andthe effectiveness for output layer comes from enlarging the inter-class separability.

Comparison with other angular regularization.

This section compares several angular regulariza-tion from the perspective of calculating time per batch, occupied memory, accuracy, and the minimumpairwise angles of several layers, as shown in Table 3. Besides the MMA regularization, we alsoconsider the MHE [20] regularization and the widely used orthogonal regularization [34, 48, 19, 47]which also penalize the pairwise angles. The MHE actually takes the Riesz-Fisher loss (s>0) orlogarithmic loss (s=0) to implement regularization [20]. The orthogonal regularization promotes allthe pairwise weight vectors to be orthogonal. Here, we adopt the orthogonal regularization in [34]: l orthogonal = λ L (cid:88) i =1 (cid:107) ˆW i ˆW Ti − I (cid:107) F (14)where ˆ W i denotes the l -normalized weight matrix of the i -th layer, I denotes identity matrix, and (cid:107) ∗ (cid:107) F denotes the Frobenius norm.Table 3: Comparison of several different methods of angular regularization. The MMA achieves themost diverse ﬁlters and highest accuracy with negligible computational overhead.Regularization Time(s)/Batch Memory(MiB) Accuracy (%) Minimum Angle (degree)TOP-1 TOP-5 L3-3 L4-3 L5-3 Classifybaseline 0.070 1127 72.08 90.50 70.1 16.0 30.8 54.0MMA 0.095 1229 L3-3 , L4-3 , L5-3 , and

Classify respectively. These experiments are based on PyTorch [27] and NVIDIA GeForce GTX 1080 GPU.Compared to the baseline, the MMA regularization and orthogonal regularization slightly increasethe calculating time and occupied memory. However, the MHE regularization greatly increases thatdue to the computation of all the pairwise distances. In terms of accuracy, the MMA regularizationimproves over the baseline by a substantial margin. The orthogonal regularization is also effectivebut inferior to the MMA regularization. The MHE regularization is just comparable to the baseline,which may be because of the unstable gradient as analyzed in Section 3.2. We also observe that thereis a strong link between the minimal pairwise angles in hidden layers and the accuracy—the largerthe minimal angles, the higher the accuracy. This is because the larger minimal angle means the morediverse ﬁlters which would improve the generalizability of models. The MMA regularization is alsothe most effective to enlarge the minimal pairwise angle of classiﬁcation layer, which would increasethe inter-class separability and enhance the discriminative power of neural networks. More plots andcomparison of the minimal pairwise angles are shown in the supplementary material.7 .3 Results and Analysis

Table 4: Accuracy (%) on CIFAR100.Model TOP-1 TOP-5ResNet56 70.39 91.12ResNet56-MMA

VGG19-BN 72.08 90.50VGG19-BN-MMA

WRN-16-8 78.97 94.84WRN-16-8-MMA

DenseNet-40-12 73.98 92.74DenseNet-40-12-MMA

We ﬁrstly compare various modern architectureswith their MMA regularization versions onCIFAR100. From the results shown in Table 4,we can see that the X-MMA can typicallyimprove the corresponding backbone models.Especially, MMA regularization improves the TOP-1 accuracy of VGG19-BN by 1.65%. MMAregularization is also able to robustly improve theperformance of models with skip connections likeResNet, DenseNet, and WideResNet, although theimprovement is not as distinct as in VGG. This isbecause the skip connections have implicitly reduced feature correlations to some extent [26].Table 5: Accuracy (%) on TinyImageNet.Model TOP-1 TOP-5ResNet56 54.80 78.71ResNet56-MMA

VGG16-BN 62.16 82.41VGG16-BN-MMA

To further demonstrate the consistency of MMA’ssuperiority, we also evaluate the MMA regularizationwith ResNet56 and VGG16-BN on TinyImageNet, withthe coefﬁcient of 0.01 and 0.07 respectively. The resultsare reported in Table 5, where the X-MMA modelssuccessfully outperform the original backbones on bothTop-1 and Top-5 accuracy. It is worth emphasizing thatthe X-MMA models achieve the improvements with quitenegligible computational overhead and without modifying the original network architecture.

ArcFace [7] is one of the state-of-the-art face veriﬁcation methods, which proposes an additive angularmargin between the learned feature and the target weight vector in the classiﬁcation layer. Thismethod essentially encourages intra-class feature compactness by promoting the learned features closeto the target weight vectors. As analyzed in Section 3.3, MMA regularization can achieve diverseweight vectors and therefore improve inter-class separability for classiﬁcation layer. Consequently,the MMA regularization is complementary to the objective of ArcFace and should boost accuracyfurther. Motivated by this analysis, we propose ArcFace+ by applying MMA Regularization toArcFace. The objective function of ArcFace+ is deﬁned as: l arcface + = l arcface ( m ) + λl MMA ( W classify ) (15)where m is the angular margin of ArcFace, λ is the regularization coefﬁcient, and W classify is theweight matrix of classiﬁcation layer.For fair comparison, both the ArcFace and ArcFace+ are trained from scratch, therefore our results ofthe ArcFace may be slightly different from the ones presented in the original paper due to differentsettings and hardware. The implementation settings are detailed in the supplementary material.Table 6: Comparison of veriﬁcation results (%).Method LFW CFP-FP AgeDB-30ArcFace 99.35 95.30 94.62ArcFace+ From the results shown in Table 6, we can see thatthe ArcFace+ outperforms ArcFace across all thethree veriﬁcation datasets by margins which arevery signiﬁcant in the ﬁeld of face veriﬁcation.This comparison validates the effectiveness ofMMA regularization in feature learning. Note that these results are obtained with the defaultcoefﬁcient 0.03, we argue the results may be better with hyperparameter tuning.

In this paper, we propose a novel regularization method for neural networks, called MMAregularization, to encourage the angularly uniform distribution of weight vectors and thereforedecorrelate the ﬁlters or neurons. The MMA regularization has stable and consistent gradient, andis easy to implement with negligible computational overhead, and is effective for both the hiddenlayers and the output layer. Extensive experiments on image classiﬁcation demonstrate that theMMA regularization is able to enhance the generalization power of neural networks by considerableimprovements. Moreover, MMA regularization is also effective for feature learning with signiﬁcantmargins, due to enlarging the inter-class separability. As the MMA can be viewed as a basicregularization method for neural networks, we will explore the effectiveness of MMA regularizationon other tasks, such as object detection, object tracking, and image captioning, etc.8 eferences [1] Patrick Guy Adams. A numerical approach to tamme’s problem in euclidean n-space. 1997.[2] Guozhong An. The effects of adding noise during backpropagation training on a generalizationperformance.

Neural computation , 8(3):643–674, 1996.[3] Sheldon Jay Axler.

Linear algebra done right , volume 2. Springer, 1997.[4] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Neural photo editing withintrospective adversarial networks. arXiv preprint arXiv:1609.07093 , 2016.[5] Mircea I Chelaru and Valentin Dragoi. Negative correlations in visual cortical networks.

Cerebral Cortex , 26(1):246–256, 2016.[6] Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, and Dhruv Batra. Reducingoverﬁtting in deep networks by decorrelating representations. arXiv preprint arXiv:1511.06068 ,2015.[7] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angularmargin loss for deep face recognition. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 4690–4699, 2019.[8] Thomas Ericson and Victor Zinoviev.

Codes on Euclidean spheres . Elsevier, 2001.[9] Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, PeterVajda, Manohar Paluri, John Tran, et al. Dsd: Dense-sparse-dense training for deep neuralnetworks. arXiv preprint arXiv:1607.04381 , 2016.[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 770–778, 2016.[11] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 4700–4708, 2017.[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.[13] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.Technical report, Citeseer, 2009.[14] Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In

Advances in neural information processing systems , pages 950–957, 1992.[15] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge.

CS 231N , 2015.[16] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In

Artiﬁcial Intelligence and Statistics , pages 562–570, 2015.[17] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters forefﬁcient convnets. 2017.[18] Rongmei Lin, Weiyang Liu, Zhen Liu, Chen Feng, Zhiding Yu, James M Rehg, Li Xiong, andLe Song. Regularizing neural networks via minimizing hyperspherical energy. arXiv preprintarXiv:1906.04892 , 2020.[19] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhiding Yu, Bo Dai, Tuo Zhao, and Le Song.Deep hyperspherical learning. In

Advances in neural information processing systems , pages3950–3960, 2017.[20] Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, and Le Song. Learningtowards minimum hyperspherical energy. In

Advances in neural information processing systems ,pages 6222–6233, 2018. 921] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprintarXiv:1711.05101 , 2017.[22] L Lovisolo and EAB Da Silva. Uniform distribution of points on a hyper-sphere withapplications to vector bit-plane encoding.

IEE Proceedings-Vision, Image and Signal Processing ,148(3):187–193, 2001.[23] Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importanceof single directions for generalization. arXiv preprint arXiv:1803.06959 , 2018.[24] Nelson Morgan and Hervé Bourlard. Generalization and parameter estimation in feedforwardnets: Some experiments. In

Advances in neural information processing systems , pages 630–637,1990.[25] Oleg R Musin and Alexey S Tarasov. The tammes problem for n= 14.

Experimental Mathematics ,24(4):460–468, 2015.[26] A Emin Orhan and Xaq Pitkow. Skip connections eliminate singularities. arXiv preprintarXiv:1701.09175 , 2017.[27] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperativestyle, high-performance deep learning library. In

Advances in Neural Information ProcessingSystems , pages 8024–8035, 2019.[28] Federico Pernici, Matteo Bruni, Claudio Baecchi, and Alberto Del Bimbo. Fix your features:Stationary and maximally discriminative embeddings using regular polytope (ﬁxed classiﬁer)networks. arXiv preprint arXiv:1902.10441 , 2019.[29] Marko D Petkovi´c and Nenad Živi´c. The fekete problem and construction of the sphericalcoverage by cones.

Facta universitatis-series: Mathematics and Informatics , 28(4):393–402,2013.[30] János D Pintér. Globally optimized spherical point arrangements: model variants and illustrativeresults.

Annals of Operations Research , 104(1-4):213–230, 2001.[31] Aaditya Prakash, James Storer, Dinei Florencio, and Cha Zhang. Repr: Improved training ofconvolutional ﬁlters. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 10666–10675, 2019.[32] Siyuan Qiao, Zhe Lin, Jianming Zhang, and Alan L Yuille. Neural rejuvenation: Improvingdeep network training by enhancing computational resource utilization. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , pages 61–71, 2019.[33] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions oflinear matrix equations via nuclear norm minimization.

SIAM review , 52(3):471–501, 2010.[34] Pau Rodríguez, Jordi Gonzàlez, Guillem Cucurull, Josep M. Gonfaus, and F. Xavier Roca.Regularizing cnns with locally constrained decorrelations. In , 2017.[35] Aruni RoyChowdhury, Prakhar Sharma, and Erik G. Learned-Miller. Reducing duplicate ﬁltersin deep neural networks. In

NIPS workshop on Deep Learning: Bridging Theory and Practice ,2017.[36] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations byback-propagating errors. nature , 323(6088):533–536, 1986.[37] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improvingconvolutional neural networks via concatenated rectiﬁed linear units. In international conferenceon machine learning , pages 2217–2225, 2016.[38] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 , 2014.1039] N.J.A. Sloane, R.H. Hardin, W.D. Smith, et al. Tables of spherical codes.

Seehttp://neilsloane.com/packings/ , 2000.[40] Steve Smale. Mathematical problems for the next century.

The mathematical intelligencer , 20(2):7–15, 1998.[41] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overﬁtting.

The journal of machinelearning research , 15(1):1929–1958, 2014.[42] Pieter Merkus Lambertus Tammes. On the origin of number and arrangement of the places ofexit on the surface of pollen-grains.

Recueil des travaux botaniques néerlandais , 27(1):1–84,1930.[43] Joseph John Thomson. Xxiv. on the structure of the atom: an investigation of the stabilityand periods of oscillation of a number of corpuscles arranged at equal intervals around thecircumference of a circle; with application of the results to the theory of atomic structure.

The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science , 7(39):237–265, 1904.[44] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization ofneural networks using dropconnect. In

International conference on machine learning , pages1058–1066, 2013.[45] Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training bypreserving gradient ﬂow. arXiv preprint arXiv:2002.07376 , 2020.[46] Andreas S Weigend, David E Rumelhart, and Bernardo A Huberman. Generalization by weight-elimination with application to forecasting. In

Advances in neural information processingsystems , pages 875–882, 1991.[47] Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploringbetter solution for training extremely deep convolutional neural networks with orthonormalityand modulation. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 6176–6185, 2017.[48] Pengtao Xie, Yuntian Deng, Yi Zhou, Abhimanu Kumar, Yaoliang Yu, James Zou, and Eric PXing. Learning latent space models with angular constraints. In

Proceedings of the 34thInternational Conference on Machine Learning-Volume 70 , pages 3799–3810. JMLR. org,2017.[49] Pengtao Xie, Barnabas Poczos, and Eric P Xing. Near-orthogonality regularization in kernelmethods. In

UAI , volume 3, page 6, 2017.[50] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146 , 2016. 11 upplementary Material

A Detail Derivation of Equation (10) and Equation (12)

The detail derivation of Equation (10) is as follows: k ∂ k ˆw i − ˆw j k − s ∂ w i k = k ∂ k ˆw i − ˆw j k − s ∂ k ˆw i − ˆw j k ∂ k ˆw i − ˆw j k ∂ ( ˆw i − ˆw j ) ∂ ( ˆw i − ˆw j ) ∂ ˆw i ∂ w i k w i k ∂ w i k = k s k ˆw i − ˆw j k s +1 ( ˆw i − ˆw j ) T k ˆw i − ˆw j k I ( I − M w i ) k w i k k = k s ( I − M w i ) ˆ w j kk w i kk ˆw i − ˆw j k s +2 = s k w i k sin θ ij (2 sin θ ij ) s +2 = s k w i k cos θ ij (2 sin θ ij ) s +1 , with M w i = w i w Ti k w i k The detail derivation of Equation (12) is as follows: k ∂ log k ˆw i − ˆw j k ∂ w i k = 1 k ˆw i − ˆw j k k ∂ k ˆw i − ˆw j k − ( − ∂ w i k = 1 k ˆw i − ˆw j k | − |k w i k cos θ ij (2 sin θ ij ) − = 1 k w i k cos θ ij (2 sin θ ij ) B Dataset Description of Section 5

We conduct our image classiﬁcation experiments on CIFAR100 [4] and TinyImageNet [5]. TheCIFAR100 consists of 50k and 10k images of × pixels for the training and test sets respectively.We present experiments trained on the training set and evaluated on the test set. The TinyImageNetdataset is a subset of the ILSVRC2012 classiﬁcation dataset [8]. It consists of 200 object classes, andeach class has 500 training images, 50 validation images, and 50 test images. All images have beendownsampled to × pixels. As the labels for test set are not released, we present experimentstrained on the training set and evaluated on the validation set. For both datasets, we follow the simpledata augmentation in [6]. For training, 4 pixels are padded on each side and a × crop forCIFAR100 or a × crop for TinyImageNet is randomly sampled from the padded image orits horizontal ﬂip. For testing, we only evaluate the single view of the original × image forCIFAR100 or × image for TinyImageNet. Note that our focus is on the effectiveness of ourproposed MMA regularization, not on pushing the state-of-the-art results, so we do not use any moredata augmentation and training tricks to improve accuracy. C Implementation Settings of Section 6

We employ CASIA [10] as training dataset and LFW [3], CFP-FP [9], and AgeDB-30 [7] as faceveriﬁcation datasets. For the embedding network, we employ ResNet50 [2]. The angular margin m is set to 0.5 according to the ArcFace paper [1]. The regularization coefﬁcient λ is set to 0.03.Other hyperparameters and settings exactly follow the ArcFace paper [1], except for the batchsizeand learning schedule. Due to the limit of hardware, we set the batch size to 440 (the ArcFace papersets to 512). Accordingly, we ﬁnish the training process at 38K iterations and decay the learning rateby a factor of 10 at 23750 and 33250 iterations to ensure the same training samples.1 Supplement to Section 5.2: Comparison of the Minimal Pairwise Angle

Figure 1: Comparison of the minimal pairwise angle from all layers of VGG19-BN trained onCIFAR100 with several different diversity regularization. The MMA regularization gets the largestminimal pairwise angle consistently across all layers, and therefore the most diverse weight vectors.2 eferences [1] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss fordeep face recognition. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 4690–4699, 2019.[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[3] Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled faces in the wild: Adatabase forstudying face recognition in unconstrained environments. 2008.[4] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technicalreport, Citeseer, 2009.[5] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge.

CS 231N , 2015.[6] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets.In

Artiﬁcial Intelligence and Statistics , pages 562–570, 2015.[7] Stylianos Moschoglou, Athanasios Papaioannou, Christos Sagonas, Jiankang Deng, Irene Kotsia, andStefanos Zafeiriou. Agedb: the ﬁrst manually collected, in-the-wild age database. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition Workshops , pages 51–59, 2017.[8] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.

International journal of computer vision , 115(3):211–252, 2015.[9] Soumyadip Sengupta, Jun-Cheng Chen, Carlos Castillo, Vishal M Patel, Rama Chellappa, and David WJacobs. Frontal to proﬁle face veriﬁcation in the wild. In , pages 1–9. IEEE, 2016.[10] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXivpreprint arXiv:1411.7923 , 2014., 2014.