Stop memorizing: A data-dependent regularization framework for intrinsic pattern learning
Wei Zhu, Qiang Qiu, Bao Wang, Jianfeng Lu, Guillermo Sapiro, Ingrid Daubechies
SStop memorizing: A data-dependent regularizationframework for intrinsic pattern learning
Wei Zhu
Department of MathematicsDuke University [email protected]
Qiang Qiu
Department of Electrical EngineeringDuke University [email protected]
Bao Wang
Department of MathematicsUniversity of California, Los Angeles [email protected]
Jianfeng Lu
Department of MathematicsDepartment of Physics and Department of ChemistryDuke University [email protected]
Guillermo Sapiro
Department of Electrical EngineeringDuke University [email protected]
Ingrid Daubechies
Department of MathematicsDuke University [email protected]
Abstract
Deep neural networks (DNNs) typically have enough capacity to fit random databy brute force even when conventional data-dependent regularizations focusing onthe geometry of the features are imposed. We find out that the reason for this isthe inconsistency between the enforced geometry and the standard softmax crossentropy loss. To resolve this, we propose a new framework for data-dependent DNNregularization, the Geometrically-Regularized-Self-Validating neural Networks(GRSVNet). During training, the geometry enforced on one batch of features issimultaneously validated on a separate batch using a validation loss consistent withthe geometry. We study a particular case of GRSVNet, the Orthogonal-Low-rankEmbedding (OLE)-GRSVNet, which is capable of producing highly discriminativefeatures residing in orthogonal low-rank subspaces. Numerical experiments showthat OLE-GRSVNet outperforms DNNs with conventional regularization whentrained on real data. More importantly, unlike conventional DNNs, OLE-GRSVNetrefuses to memorize random data or random labels, suggesting it only learnsintrinsic patterns by reducing the memorizing capacity of the baseline DNN.
It remains an open question why DNNs, typically with far more model parameters than trainingsamples, can achieve such small generalization error. Previous work used various complexitymeasures from statistical learning theory, such as VC dimension [17], Radamacher complexity [2],and uniform stability [3, 11], to provide an upper bound for the generalization error, suggesting thatthe effective capacity of DNNs, possibly with some regularization techniques, is usually limited.However, the experiments by Zhang et al. [20] showed that, even with data-independent regularization,DNNs can perfectly fit the training data when the true labels are replaced by random labels, or whenthe training data are replaced by Gaussian noise. This suggests that DNNs with data-independentregularization have enough capacity to “memorize” the training data. This poses an interestingquestion for network regularization design: is there a way for DNNs to refuse to (over)fit training
Preprint. Work in progress. a r X i v : . [ c s . C V ] S e p amples with random labels, while exhibiting better generalization power than conventional DNNswhen trained with true labels? Such networks are very important because they will extract onlyintrinsic patterns from the training data instead of memorizing miscellaneous details.One would expect that data-dependent regularizations should be a better choice for reducing thememorizing capacity of DNNs. Such regularizations are typically enforced by penalizing the standardsoftmax cross entropy loss with an extra geometric loss which regularizes the feature geometry[9, 21, 19]. However, regularizing DNNs with an extra geometric loss has two disadvantages: First,the output of the softmax layer, usually viewed as a probability distribution, is typically inconsistentwith the feature geometry enforced by the geometric loss. Therefore, the geometric loss typicallyhas a small weight to avoid jeopardizing the minimization of the softmax loss. Second, we find thatDNNs with such regularization can still perfectly (over)fit random training samples or random labels.The reason is that the geometric loss (because of its small weight) is ignored and only the softmaxloss is minimized.This suggests that simply penalizing the softmax loss with a geometric loss is not sufficient toregularize DNNs. Instead, the softmax loss should be replaced by a validation loss that is consistentwith the enforced geometry. More specifically, every training batch B is split into two sub-batches,the geometry batch B g and the validation batch B v . The geometric loss l g is imposed on the featuresof B g for them to exhibit a desired geometric structure. A semi-supervised learning algorithm basedon the proposed feature geometry is then used to generate a predicted label distribution for thevalidation batch, which combined with the true labels defines a validation loss on B v . The total losson the training batch B is then defined as the weighted sum l = l g + λl v . Because the predictedlabel distribution on B v is based on the enforced geometry, the geometric loss l g can no longer beneglected. Therefore, l g and l v will be minimized simultaneously, i.e., the geometry is correctlyenforced (small l g ) and it can be used to predict validation samples (small l v ). We call such DNNsGeometrically-Regularized-Self-Validating neural Networks (GRSVNets). See Figure 1a for a visualillustration of the network architecture.GRSVNet is a general architecture because every consistent geometry/validation pair can fit into thisframework as long as the loss functions are differentiable. In this paper, we focus on a particular typeof GRSVNet, the Orthogonal-Low-rank-Embedding-GRSVNet (OLE-GRSVNet). More specifically,we impose the OLE loss [12] on the geometry batch to produce features residing in orthogonalsubspaces, and we use the principal angles between the validation features and those subspaces todefine a predicted label distribution on the validation batch. We prove that the loss function obtainsits minimum if and only if the subspaces of different classes spanned by the features in the geometrybatch are orthogonal, and the features in the validation batch reside perfectly in the subspacescorresponding to their labels (see Figure 1e). We show in our experiments that OLE-GRSVNet hasbetter generalization performance when trained on real data, but it refuses to memorize the trainingsamples when given random training data or random labels, which suggests that OLE-GRSVNeteffectively learns intrinsic patterns.Our contributions can be summarized as follows: • We proposed a general framework, GRSVNet, to effectively impose data-dependent DNNregularization. The core idea is the self-validation of the enforced geometry with a consistentvalidation loss on a separate batch of features. • We study a particular case of GRSVNet, OLE-GRSVNet, that can produce highly dis-criminative features: samples from the same class belong to a low-rank subspace, and thesubspaces for different classes are orthogonal. • OLE-GRSVNet achieves better generalization performance when compared to DNNswith conventional regularizers. And more importantly, unlike conventional DNNs, OLE-GRSVNet refuses to fit the training data (i.e., with a training error close to random guess)when the training data or the training labels are randomly generated. This implies thatOLE-GRSVNet never memorizes the training samples, only learns intrinsic patterns.
Many data-dependent regularizations focusing on feature geometry have been proposed for deeplearning [9, 21, 19]. The center loss [19] produces compact clusters by minimizing the Euclideandistance between features and their class centers. LDMNet [21] extracts features sampling a collectionof low dimensional manifolds. The OLE loss [9, 12] increases inter-class separation and intra-class2 ( · ; ✓ ) X g Z g ( · ; ✓ ) Z v X v ( U , . . . , U K ) ˆY v Y v l v ( Y v , ˆY v ) l g ( Z g ) l = l g + l v (a) GRSVNet architecture (b) Softmax (c) Softmax+weight decay(d) Softmax+OLE (e) OLE-GRSVNet Epoch A cc u r a cy Training: SoftmaxTesting: SoftmaxTraining: Softmax+wdTesting: Softmax+wdTraining: Softmax+OLETesting: Softmax+OLETraining: OLE-GRSVNetTesting: OLE-GRSVNet (f) Learning curves
Figure 1: GRSVNet architecture and the results of different networks with the same VGG-11 baselineachitecture on the SVHN dataset with real data and real labels. (a) GRSVNet architecture (betterunderstood in its special case OLE-GRSVNet detailed in Section 3). (b)-(e) Features of the test datalearned by different networks visualized in 3D using PCA. Note that for OLE-GRSVNet, only fourclasses (out of ten) have nonzero 3D embedding (Theorem 2). (f) Training/testing accuracy.similarity by embedding inputs into orthogonal low-rank subspaces. However, as mentioned inSection 1, these regularizations are imposed by adding the geometric loss to the softmax loss, which,when viewed as a probability distribution, is typically not consistent with the desired geometry. Ourproposed GRSVNet instead uses a validation loss based on the regularized geometry so that thepredicted label distribution has a meaningful geometric interpretation.The way in which GRSVNets impose geometric loss and validation loss on two separate batchesof features extracted with two identical baseline DNNs bears a certain resemblance to the siamesenetwork architecture [5] used extensively in metric learning [4, 7, 8, 14, 16]. The difference is,unlike contrastive loss [7] and triplet loss [14] in metric learning, the feature geometry is explicitlyregularized in GRSVNets, and a representation of the geometry, e.g., basis of the low-rank subspace,can be later used directly for the classification of test data.Our work is also related to two recent papers [20, 1] addressing the memorization of DNNs. Zhang etal. [20] empirically showed that conventional DNNs, even with data-independent regularization, arefully capable of memorizing random labels or random data. Arpit et al. [1] argued that DNNs trainedwith stochastic gradient descent (SGD) tend to fit patterns first before memorizing miscellaneousdetails, suggesting that memorization of DNNs depends also on the data itself, and SGD with earlystopping is a valid strategy in conventional DNN training. We demonstrate in our paper that whendata-dependent regularization is imposed in accordance with the validation, GRSVNets will never memorize random labels or random data, and only extracts intrinsic patterns. An explanation of thisphenomenon is provided in Section 4.
As pointed out in Section 1, the core idea of GRSVNet is to self-validate the geometry using aconsistent validation loss. To contextualize this idea, we study a particular case, OLE-GRSVNet,3here the regularized feature geometry is orthogonal low-rank subspaces, and the validation loss isdefined by the principal angles between the validation features and the subspaces.
The OLE loss was originally proposed in [12]. Consider a K -way classification problem. Let X = [ x , . . . , x N ] ∈ R d × N be a collection of data points { x i } Ni =1 ⊂ R d . Let X c denote thesubmatrix of X formed by inputs of the c -th class. The authors in [12] proposed to learn a lineartransformation T : R d → R d that maps data from the same class X c into a low-rank subspace, whilemapping the entire data X into a high-rank linear space. This is achieved by solving: min T : R d → R d K (cid:88) c =1 (cid:107) TX c (cid:107) ∗ − (cid:107) TX (cid:107) ∗ , s.t. (cid:107) T (cid:107) = 1 , (1)where (cid:107) · (cid:107) ∗ is the matrix nuclear norm, which is a convex lower bound of the rank function on theunit ball in the operator norm [13]. The norm constraint (cid:107) T (cid:107) = 1 is imposed to avoid the trivialsolution T = . It is proved in [12] that the OLE loss (1) is always nonnegative, and the globaloptimum value is obtained if TX c ⊥ TX c (cid:48) , ∀ c (cid:54) = c (cid:48) .Lezama et al. [9] later used OLE loss as a data-dependent regularization for deep learning. Given abaseline DNN that maps a batch of inputs X into the features Z = Φ( X ; θ ) , the OLE loss on Z is l g ( Z ) = K (cid:88) c =1 (cid:107) Z c (cid:107) ∗ − (cid:107) Z (cid:107) ∗ = K (cid:88) c =1 (cid:107) Φ( X c ; θ ) (cid:107) ∗ − (cid:107) Φ( X ; θ ) (cid:107) ∗ . (2)The OLE loss is later combined with the standard softmax loss for training, and we will henceforth callsuch network “softmax+OLE.” Softmax+OLE significantly improves the generalization performance,but it suffers from two problems because of the inconsistency between the softmax loss and theOLE loss: First, the learned features no longer exhibit the desired geometry of orthogonal low-ranksubspaces. Second, as will be shown in Section 4, softmax+OLE is still capable of memorizingrandom data or random labels, i.e., it has not reduced the memorizing capacity of DNNs. We will now explain how to incorporate OLE loss into the GRSVNet framework. First, let us betterunderstand the geometry enforced by the OLE loss by stating the following theorem.
Theorem 1.
Let Z = [ Z , . . . , Z c ] be a horizontal concatenation of matrices { Z c } Kc =1 . The OLE loss l g ( Z ) defined in (2) is always nonnegative. Moreover, l g ( Z ) = 0 if and only if Z ∗ c Z c (cid:48) = , ∀ c (cid:54) = c (cid:48) ,i.e., the column spaces of Z c and Z c (cid:48) are orthogonal. The proof of Theorem 1, as well as those of the remaining theorems, is detailed in the appendix. Notethat Theorem 1, which ensures that the OLE loss is minimized if and only if features of differentclasses are orthogonal, is a much stronger result than that in [12]. We then need to define a validationloss l v that is consistent with the geometry enforced by l g . A natural choice would be the principalangles between the validation features and the subspaces spanned by { Z c } Kc =1 .Now we detail the architecture for OLE-GRSVNet. Given a baseline DNN, we split every trainingbatch X ∈ R d ×| B | into two sub-batches, the geometry batch X g ∈ R d ×| B g | and the validation batch X v ∈ R d ×| B v | , which are mapped by the same baseline DNN into features Z g = Φ( X g ; θ ) and Z v = Φ( X v ; θ ) . The OLE loss l g ( Z g ) is imposed on the geometry batch to ensure span( Z gc ) areorthogonal low-rank subspaces, where span( Z gc ) is the column space of Z gc . Let Z gc = U c Σ c V ∗ c be the (compact) singular value decomposition (SVD) of Z gc , then the columns of U c form anorthonormal basis of span( Z gc ) . For any feature z = Φ( x ; θ ) ∈ Z v in the validation batch, itsprojection onto the subspace span( Z gc ) is proj c ( z ) = U c U ∗ c z . The cosine similarity between z andproj c ( z ) is then defined as the (unnormalized) probability of x belonging to class c , i.e., ˆ y c = P ( x ∈ c ) (cid:44) (cid:28) z , proj c ( z )max ( (cid:107) proj c ( z ) (cid:107) , ε ) (cid:29) (cid:44) K (cid:88) c (cid:48) =1 (cid:28) z , proj c (cid:48) ( z )max ( (cid:107) proj c (cid:48) ( z ) (cid:107) , ε ) (cid:29) , (3)where a small ε is chosen for numerical stability. The validation loss for x is then defined asthe cross entropy between the predicted distribution ˆ y = (ˆ y , . . . , ˆ y K ) T ∈ R K and the true label4
100 200 300 400 500 600 700 800
Epoch A cc u r a cy Training: SoftmaxTesting: SoftmaxTraining: Softmax+wdTesting: Softmax+wdTraining: Softmax+OLETesting: Softmax+OLETraining: OLE-GRSVNetTesting: OLE-GRSVNet (a) Training/testing accuracy with random labels
Epoch A cc u r a cy Training: SoftmaxTesting: SoftmaxTraining: Softmax+wdTesting: Softmax+wdTraining: Softmax+OLETesting: Softmax+OLETraining: OLE-GRSVNetTesting: OLE-GRSVNet (b) Training/testing accuracy with random data
Figure 2: Training and testing accuracy of different networks on the SVHN dataset with random labelsor random data (Gaussian noise). Note that softmax, sotmax+wd, and softmax+OLE can all perfectly(over)fit the random training data or training data with random labels. However, OLE-GRSVNetrefuses to fit the training data when there is no intrinsically learnable patterns. y ∈ { , . . . , K } . More specifically, let Y v ∈ R ×| B v | and ˆ Y v ∈ R K ×| B v | be the collection of truelabels and predicted label distributions on the validation batch, then the validation loss is defined as l v ( Y v , ˆ Y v ) = 1 | B v | (cid:88) x ∈ X v H ( δ y , ˆ y ) = − | B v | (cid:88) x ∈ X v log ˆ y y , (4)where δ y is the Dirac distribution at label y , and H ( · , · ) is the cross entropy between two distributions.The empirical loss l on the training batch X is then defined as l ( X , Y ) = l ([ X g , X v ] , [ Y g , Y v ]) = l g ( Z g ) + λl v ( Y v , ˆ Y v ) . (5)See Figure 1a for a visual illustration of the OLE-GRSVNet architecture. Because of the consistencybetween l g and l v , we have the following theorem: Theorem 2.
For any λ > , and any geometry/validation splitting of X = [ X g , X v ] satisfying X v contains at least one sample for each class, the empirical loss function defined in (5) is alwaysnonnegative. l ( X , Y ) = 0 if and only if both of the following conditions hold true: • The features of the geometry batch belonging to different classes are orthogonal, i.e., span( Z gc ) ⊥ span( Z gc (cid:48) ) , ∀ c (cid:54) = c (cid:48) • For every datum x ∈ X vc , i.e., x belongs to class c in the validation batch, its feature z = Φ( x ; θ ) belongs to span( Z gc ) .Moreover, if l < ∞ , then rank(span( Z gc )) ≥ , ∀ c , i.e., Φ( · ; θ ) does not trivially map data into . Remark : The requirement that λ > is crucial in Theorem 2, because otherwise the network canmap every input into and achieve the minimum. This is validated in our numerical experiments.After the training process has finished, we can then map the entire training data X all = [ X all , . . . , X all K ] (or a random portion of X all ) into their features Z all = Φ( X all ; θ ∗ ) , where θ ∗ is the learned parameter.The low-rank subspace span( Z all c ) for class c can be obtained via the SVD of Z all c . The label of a testdatum x is then determined by the principal angles between z = Φ( x ; θ ∗ ) and { span( Z all c ) } Kc =1 . Before delving into the implementation details of OLE-GRSVNet, we first take a look at two toyexperiments to illustrate our proposed framework. We use VGG-11 [15] as the baseline architecture,and compare the performance of the following four DNNs: (a) The baseline network with a softmaxclassifier (softmax). (b) VGG-11 regularized by weight decay (softmax+wd). (c) VGG-11 regularizedby penalizing the softmax loss with the OLE loss (softmax+OLE) (d) OLE-GRSVNet.5e first train these four DNNs on the Street View House Numbers (SVHN) dataset with the originaldata and labels without data augmentation. The test accuracy and the PCA embedding of thelearned test features are shown in Figure 1. OLE-GRSVNet has the highest test accuracy amongthe comparing DNNs. Moreover, because of the consistency between the geometric loss and thevalidation loss, the test features produced by OLE-GRSVNet are even more discriminative thansoftmax+OLE: features of the same class reside in a low-rank subspace, and different subspaces are(almost) orthogonal. Note that in Figure 1e, features of only four classes out of ten (though ideally itshould be three) have nonzero 3D embedding (Theorem 2).Next, we train the same networks, without changing hyperparameters, on the SVHN dataset witheither (a) randomly generated labels, or (b) random training data (Gaussian noise). We train theDNNs for 800 epochs to ensure their convergence, and the learning curves of training/testing accuracyare shown in Figure 2. Note that the baseline DNN, with either data-independent or conventionaldata-dependent regularization, can perfectly (over)fit the training data, while OLE-GRSVNet refusesto memorize the training data when there are no intrinsically learnable patterns.In another experiment, we generate three classes of one-dimensional data in R : the data pointsin the i -th class are i.i.d. samples from the Gaussian distribution with the standard deviation inthe i -th coordinate 50 times larger than other coordinates. Each class has 500 data points, and werandomly shuffle the class labels after generation. We then train a multilayer perceptron (MLP) with128 neurons in each layer for 2000 epochs to classify these low dimensional data with random labels.We found out that only three layers are needed to perfectly classify these data when using a softmaxclassifier. However, after incrementally adding more layers to the baseline MLP, we found out thatOLE-GRSVNet still refuses to memorize the random labels even for 100-layer MLP. This furthersuggests that OLE-GRSVNet refuses to memorize training data by brute force when there is nointrinsic patterns in the data. A visual illustration of this experiment is shown in Figure 3.We provide an intuitive explanation for why OLE-GRSVNet can generalize very well when giventrue labeled data but refuses to memorize random data or random labels. By Theorem 2, we knowthat OLE-GRSVNet obtains its global minimum if and only if the features of every random trainingbatch exhibit the same orthogonal low-rank-subspace structure. This essentially implies that OLE-GRSVNet is implicitly conducting O ( N | B | ) -fold data augmentation, where N is the number oftraining data, and | B | is the batch size, while conventional data augmentation by the manipulationof the inputs, e.g., random cropping, flipping, etc., is typically O ( N ) . This poses a very interestingquestion: Does it mean that OLE-GRSVNet can also memorize random data if the baseline DNNhas exponentially many model parameters? Or is it because of the learning algorithm (SGD) thatprevents OLE-GRSVNet from learning a decision boundary too complicated for classifying randomdata? Answering this question will be the focus of our future research. Most of the operations in the computational graph of OLE-GRSVNet (Figure 1a) explained inSection 3 are basic matrix operations. The only two exceptions are the OLE loss ( Z g → l g (( Z g )) )and the SVD ( Z g → ( U , . . . , U K ) ). We hereby specify their forward and backward propagations. According to the definition of the OLE loss in (2), we only need to find a (sub)gradient of the nuclearnorm to back-propagate the OLE loss. The characterization of the subdifferential of the nuclear normis explained in [18]. More specifically, assuming m ≥ n for simplicity, let U ∈ R m × m , Σ ∈ R m × n , V ∈ R n × n be the SVD of a rank- s matrix A . Let U = [ U (1) , U (2) ] , V = [ V (1) , V (2) ] be thepartition of U , V , respectively, where U (1) ∈ R m × s and V (1) ∈ R n × s , then the subdifferential ofthe nuclear norm at A is: ∂ (cid:107) A (cid:107) ∗ = (cid:110) U (1) V (1) ∗ + U (2) WV (2) ∗ , ∀ W ∈ R ( m − s ) × ( n − s ) with (cid:107) W (cid:107) ≤ (cid:111) , (6)where (cid:107) · (cid:107) is the spectral norm. Note that to use (6), one needs to identify the rank- s columnspace of A , i.e., span( U (1) ) to find a subgradient, which is not necessarily easy because of theexistence of numerical error. The authors in [9] intuitively truncated the numerical SVD with a smallparameter chosen a priori to ensure the numerical stability. We show in the following theorem usingthe backward stability of SVD that such concern is, in theory, not necessary.6 (a) Original three classes of one-dimensional data in R -100100-50 50 1000 5050 0 0100 -50 -50-100 -100 (b) Labels randomly shuffled -35-2-1 2001 15 Accuracy = 100.0%
02 103 50-5 -5 (c) Softmax with 3-layer MLP -0.40.6-0.20 0.4 1.50.2
Accuracy = 33.4% (d) OLE-GRSVNet with 3-layerMLP -0.110 10.1 0.5
Accuracy = 33.4% (e) OLE-GRSVNet with 5-layerMLP -0.210 1.50.2 0.5 1
Accuracy = 34.2% (f) OLE-GRSVNet with 10-layerMLP -0.110 40.1 0.5 3
Accuracy = 35.0% (g) OLE-GRSVNet with 20-layerMLP -0.050.600.05 0.4 10.1
Accuracy = 30.6% (h) OLE-GRSVNet with 50-layerMLP -0.20.3-0.10 0.2 6000.1
Accuracy = 29.8% (i) OLE-GRSVNet with 100-layerMLP
Figure 3: Visual illustration of the second toy experiment in Section 4 of the paper. (a) Threeclasses of one-dimensional data in R . (b) Labels randomly shuffled. (c)-(i) Features extracted bybaseline MLP with softmax classifier or OLE-GRSVNet. Only three layers of MLP are needed forconventional DNN to perfectly memorize random labels. But even with 100 layers of MLP, OLE-GRSVNet still refuses to memorize the random labels because there are no intrinsically learnablepatterns. Theorem 3.
Let U ε , Σ ε , V (cid:15) be the numerically computed reduced SVD of A ∈ R m × n , i.e., U ε ∈ R m × n , V ε ∈ R n × n , ( U ε + δ U ε ) Σ ε ( V ε + δ V ε ) ∗ = A + δ A = A ε , and (cid:107) δ U (cid:107) , (cid:107) δ V (cid:107) , (cid:107) δ A (cid:107) are all O ( ε ) , where ε is the machine error. If rank( A ) = s ≤ n , and the smallest singular value σ s ( A ) of A satisfies σ s ( A ) ≥ η > , we have d( U ε V ε ∗ , ∂ (cid:107) A (cid:107) ∗ ) = O ( ε/η ) . (7)However, in practice we did observe that using a small threshold ( − in this work) to truncate thenumerical SVD can speed up the convergence, especially in the first few epochs of training. With thehelp of Theorem 3, we can easily find a stable subgradient of the OLE loss in (2). Z g → ( U , . . . , U K ) Unlike the computation of the subgradient in Theorem 3, we have to threshold the singular vectorsof Z gc , because the desired output U c should be an orthonormal basis of the low-rank subspace span( Z gc ) . In the forward propagation, we threshold the singular vectors U c such that the smallestsingular value is at least / of the largest singular value.As for the backward propagation, one needs to know the Jacobian of SVD, which has been explainedin [10]. Typically, for a matrix A ∈ R n × n , computing the Jacobian of the SVD of A involvessolving a total of O ( n ) 2 × linear systems. We have not implemented the backward propagation7f SVD in this work because this involves technical implementation with CUDA API. In our currentimplementation, the node ( U , . . . , U K ) is detached from the computational graph during backpropagation, i.e., the validation loss l v is only propagated back through the path l v → ˆ Y v → Z v → θ .Our rational is this: The validation loss l v can be propagated back through two paths: l v → ˆ Y v → Z v → θ and l v → ˆ Y v → ( U , . . . , U K ) → Z g → θ . The first path will modify θ so that Z vc movescloser to U c , while the second path will move U c closer to Z vc . Cutting off the second path whencomputing the gradient might decrease the speed of convergence, but our numerical experimentssuggest that the training process is still well-behaved under such simplification. In this section, we demonstrate the superiority of OLE-GRSVNet when compared to conventionalDNNs in two aspects: (a) It has greater generalization power when trained on true data and truelabels. (b) Unlike conventionally regularized DNNs, OLE-GRSVNet refuses to memorize the trainingsamples when given random training data or random labels.We use similar experimental setup as in Section 4. The same four modifications to three baseline ar-chitectures (VGG-11,16,19 [15]) are considered: (a)
Softmax . (b)
Softmax+wd . (c)
Softmax+OLE (d)
OLE-GRSVNet . The performance of the networks are tested on the following datasets: • MNIST . The MNIST dataset contains × grayscale images of digits from to . Thereare 60,000 training samples and 10,000 testing samples. No data augmentation was used. • SVHN . The Street View House Numbers (SVHN) dataset contains × RGB images ofdigits from 0 to 9. The training and testing set contain 73,257 and 26,032 images respectively.No data augmentation was used. • CIFAR . This dataset contains × RGB images of ten classes, with 50,000 images fortraining and 10,000 images for testing. We use “
CIFAR+ ” to denote experiments on CIFARwith data augmentation: 4 pixel padding, × random cropping and horizontal flipping. All networks are trained from scratch with the “Xavier” initialization [6]. SGD with Nesterovmomentum 0.9 is used for the optimization, and the batch size is set to 200 (a 100/100 split forgeometry/validation batch is used in OLE-GRSVNet). We set the initial learning rate to 0.01, anddecrease it ten-fold at 50% and 75% of the total training epochs. For the experiments with true labels,all networks are trained for 100, 160 epochs for MNIST, SVHN, respectively. For CIFAR, we trainthe networks for 200, 300, 400 epochs for VGG-11, VGG16, VGG-19, respectively. In order toensure the convergence of SGD, all networks are trained for 800 epochs for the experiments withrandom labels. The mean accuracy after five independent trials is reported.The weight decay parameter is always set to µ = 10 − . The weight for the OLE loss in “soft-max+OLE” is chosen according to [9]. More specifically, it is set to . for MNIST and SVHN, . for CIFAR with VGG-11 and VGG-16, and . for CIFAR with VGG-19. For OLE-GRSVNet, theparameter λ in (5) is determined by cross-validation. More specifically, we set λ = 10 for MNIST, λ = 5 for SVHN and CIFAR with VGG-11 and VGG-16, and λ = 1 for CIFAR with VGG-19. Table 1 reports the performance of the networks trained on the original data with real or randomlygenerated labels. The numbers without parentheses are the percentage accuracies on the test datawhen networks are trained with real labels, and the numbers enclosed in parentheses are the accuracieson the training data when given random labels. Accuracies on the training data with real labels(always 100%) and accuracies on the test data with random labels (always close to 10%) are omittedfrom the table. As we can see, similar to the experiment in Section 4, when trained with real labels,OLE-GRSVNet exhibits better generalization performance than the competing networks. But whentrained with random labels, OLE-GRSVNet refuses to memorize the training samples like the othernetworks because there are no intrinsically learnable patterns. This is still the case even if we increasethe number of training epochs to 2000.We point out that by combining different regularization and tuning the hyperparameters, the testerror of conventional DNNs can indeed be reduced. For example, if we combine weight decay,8esting (training) accuracy (%)Dataset VGG Softmax Softmax+wd Softmax+OLE OLE-GRSVNetMNIST 11 99.40 (100.00) 99.47 (100.00) 99.49 (100.00)
SVHN 11 93.10 (99.99) 93.73 (100.00) 94.04 (99.99)
CIFAR 11 81.81 (100.00) 81.87 (100.00) 82.04 (99.95)
CIFAR 16 83.37 (100.00) 83.97 (99.99) 84.35 (99.96)
CIFAR 19 83.56 (99.99) 84.21 (99.97) 84.71 (99.96)
CIFAR+ 11 89.52 (99.98) 89.68 (99.98) 90.04 (99.93)
CIFAR+ 16 91.21 (99.96) 91.29 (99.96) 91.40 (99.92)
CIFAR+ 19 91.19 (99.96) 91.53 (99.95) (99.91) 91.65 (10.07)
Table 1: Testing or training accuracies when trained on training data with real or random labels. Thenumbers without parentheses are the percentage accuracies on the testing data when networks aretrained with real labels. The numbers enclosed in parentheses are the accuracies on the training data when networks are trained with random labels. This suggests that OLE-GRSVNet outperformsconventional DNNs on the testing data when trained with real labels. Moreover, unlike conventionalDNNs, OLE-GRSVNet refuses to memorize the training data when trained with random labels.conventional OLE regularization, batch normalization, data augmentation, and increase the learningrate from . to . , the test accuracy on the CIFAR dataset can be pushed to . . However,this does not change the fact that such network can still perfectly memorize training samples whengiven random labels. This corroborates the claim in [20] that conventional regularization appears tobe more of a tuning parameter instead of playing an essential role in reducing network capacity. We proposed a general framework, GRSVNet, for data-dependent DNN regularization. The coreidea is the self-validation of the enforced geometry on a separate batch using a validation lossconsistent with the geometric loss, so that the predicted label distribution has a meaningful geometricinterpretation. In particular, we study a special case of GRSVNet, OLE-GRSVNet, which is capableof producing highly discriminative features: samples from the same class belong to a low-ranksubspace, and the subspaces for different classes are orthogonal. When trained on benchmark datasetswith real labels, OLE-GRSVNet achieves better test accuracy when compared to DNNs with differentregularizations sharing the same baseline architecture. More importantly, unlike conventional DNNs,OLE-GRSVNet refuses to memorize and overfit the training data when trained on random labelsor random data. This suggests that OLE-GRSVNet effectively reduces the memorizing capacity ofDNNs, and it only extracts intrinsically learnable patterns from the data.Although we provided some intuitive explanation as to why GRSVNet generalizes well on real dataand refuses overfitting random data, there are still open questions to be answered. For example, whatis the minimum representational capacity of the baseline DNN (i.e., number of layers and number ofunits) to make even GRSVNet trainable on random data? Or is it because of the learning algorithm(SGD) that prevents GRSVNet from learning a decision boundary that is too complicated for randomsamples? Moreover, we still have not answered why conventional DNNs, while fully capable ofmemorizing random data by brute force, typically find generalizable solutions on real data. Thesequestions will be the focus of our future work.
Acknowledgments
The authors would like to thank José Lezama for providing the code of OLE [9]. Work partiallysupported by NSF, DoD, NIH, and Google. 9 eferences [1] D. Arpit, S. Jastrz˛ebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer,A. Courville, Y. Bengio, and S. Lacoste-Julien. A closer look at memorization in deep networks.In D. Precup and Y. W. Teh, editors,
Proceedings of the 34th International Conference onMachine Learning , volume 70 of
Proceedings of Machine Learning Research , pages 233–242,International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.[2] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds andstructural results.
Journal of Machine Learning Research , 3(Nov):463–482, 2002.[3] O. Bousquet and A. Elisseeff. Stability and generalization.
Journal of machine learningresearch , 2(Mar):499–526, 2002.[4] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Person re-identification by multi-channelparts-based cnn with improved triplet loss function. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 1335–1344, 2016.[5] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, withapplication to face verification. In
Computer Vision and Pattern Recognition, 2005. CVPR 2005.IEEE Computer Society Conference on , volume 1, pages 539–546. IEEE, 2005.[6] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neuralnetworks. In
Proceedings of the thirteenth international conference on artificial intelligenceand statistics , pages 249–256, 2010.[7] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariantmapping. In
Computer vision and pattern recognition, 2006 IEEE computer society conferenceon , volume 2, pages 1735–1742. IEEE, 2006.[8] J. Hu, J. Lu, and Y.-P. Tan. Discriminative deep metric learning for face verification in the wild.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages1875–1882, 2014.[9] J. Lezama, Q. Qiu, P. Musé, and G. Sapiro. OLE: Orthogonal low-rank embedding, a plug andplay geometric loss for deep learning. In
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , July 2018.[10] T. Papadopoulo and M. I. A. Lourakis. Estimating the jacobian of the singular value decom-position: Theory and applications. In
Computer Vision - ECCV 2000 , pages 554–570, Berlin,Heidelberg, 2000. Springer Berlin Heidelberg.[11] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity inlearning theory.
Nature , 428(6981):419, 2004.[12] Q. Qiu and G. Sapiro. Learning transformations for clustering and classification.
The Journalof Machine Learning Research , 16(1):187–225, 2015.[13] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrixequations via nuclear norm minimization.
SIAM Review , 52(3):471–501, 2010.[14] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recogni-tion and clustering. In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 815–823, 2015.[15] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 , 2014.[16] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,editors,
Advances in Neural Information Processing Systems 27 , pages 1988–1996. CurranAssociates, Inc., 2014.[17] V. Vapnik.
Statistical learning theory. 1998 . Wiley, New York, 1998.[18] G. A. Watson. Characterization of the subdifferential of some matrix norms.
Linear algebraand its applications , 170:33–45, 1992.[19] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep facerecognition. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors,
Computer Vision – ECCV2016 , pages 499–515, Cham, 2016. Springer International Publishing.[20] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requiresrethinking generalization.
International Conference on Learning Representations , 2017.1021] W. Zhu, Q. Qiu, J. Huang, R. Calderbank, G. Sapiro, and I. Daubechies. LDMNet: Lowdimensional manifold regularized neural networks. In
The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , July 2018.
A Proof of Theorem 1
It suffices to prove the case when K = 2 , as the case for larger K can be proved by induction. Inorder to simplify the notation, we restate the original theorem for K = 2 : Theorem.
Let A ∈ R N × m and B ∈ R N × n be matrices of the same row dimensions, and [ A , B ] ∈ R N × ( m + n ) be the concatenation of A and B . We have (cid:107) [ A , B ] (cid:107) ∗ ≤ (cid:107) A (cid:107) ∗ + (cid:107) B (cid:107) ∗ . (8) Moreover, the equality holds if and only if A ∗ B = , i.e., the column spaces of A and B areorthogonal.Proof. The inequality (8) and the sufficient condition for the equality to hold is easy to prove. Morespecifically, (cid:107) [ A , B ] (cid:107) ∗ = (cid:107) [ A , ] + [ , B ] (cid:107) ∗ ≤ (cid:107) [ A , ] (cid:107) ∗ + (cid:107) [ , B ] (cid:107) ∗ = (cid:107) A (cid:107) ∗ + (cid:107) B (cid:107) ∗ . (9)Moreover, if A ∗ B = , then | [ A , B ] | = [ A B ] ∗ [ A B ] = (cid:20) A ∗ A A ∗ BB ∗ A B ∗ B (cid:21) = (cid:20) A ∗ A 00 B ∗ B (cid:21) = (cid:20) | A | | B | (cid:21) , (10)where | A | = ( A ∗ A ) . Therefore, (cid:107) [ A , B ] (cid:107) ∗ = Tr ( | [ A , B ] | ) = Tr (cid:18)(cid:20) | A | | B | (cid:21)(cid:19) = Tr ( | A | ) + Tr ( | B | ) = (cid:107) A (cid:107) ∗ + (cid:107) B (cid:107) ∗ . (11)Next, we show the necessary condition for the equality to hold, i.e., (cid:107) [ A , B ] (cid:107) ∗ = (cid:107) A (cid:107) ∗ + (cid:107) B (cid:107) ∗ = ⇒ A ∗ B = . (12)Let (cid:20) E GG ∗ F (cid:21) = (cid:20) A ∗ A A ∗ BB ∗ A B ∗ B (cid:21) = | [ A , B ] | be a symmetric positive semidefinite matrix. Wehave | A | = A ∗ A = E + GG ∗ | B | = B ∗ B = F + G ∗ GA ∗ B = EG + GF . (13)Let { a i } mi =1 , { b i } ni =1 be the orthonormal eigenvectors of | A | , | B | , respectively. Then (cid:107)| A | a i (cid:107) = (cid:10) | A | a i , a i (cid:11) = (cid:10) ( E + GG ∗ ) a i , a i (cid:11) = (cid:107) E a i (cid:107) + (cid:107) G ∗ a i (cid:107) . (14)Similarly, (cid:107)| B | b i (cid:107) = (cid:107) F b i (cid:107) + (cid:107) G b i (cid:107) . (15)Suppose that (cid:107) [ A , B ] (cid:107) ∗ = (cid:107) A (cid:107) ∗ + (cid:107) B (cid:107) ∗ , then (cid:107) A (cid:107) ∗ + (cid:107) B (cid:107) ∗ = Tr( | A | ) + Tr( | B | ) = m (cid:88) i =1 (cid:104)| A | a i , a i (cid:105) + n (cid:88) i =1 (cid:104)| B | b i , b i (cid:105) = m (cid:88) i =1 (cid:107)| A | a i (cid:107) + n (cid:88) i =1 (cid:107)| B | b i (cid:107) = m (cid:88) i =1 (cid:0) (cid:107) E a i (cid:107) + (cid:107) G ∗ a i (cid:107) (cid:1) + n (cid:88) i =1 (cid:0) (cid:107) F b i (cid:107) + (cid:107) G b i (cid:107) (cid:1) ≥ m (cid:88) i =1 (cid:107) E a i (cid:107) + n (cid:88) i =1 (cid:107) F b i (cid:107) ≥ m (cid:88) i =1 (cid:104) E a i , a i (cid:105) + n (cid:88) i =1 (cid:104) F b i , b i (cid:105) = Tr( E ) + Tr( F ) = Tr( | [ A , B ] | ) = (cid:107) [ A , B ] (cid:107) ∗ = (cid:107) A (cid:107) ∗ + (cid:107) B (cid:107) ∗ (16)11herefore, both of the inequalities in this chain must be equalities, and the first one being equalityonly if G = . This combined with the last equation in (13) implies A ∗ B = EG + GF = (17) B Proof of Theorem 2
Proof.
Recall that l is defined in (5) as l ( X , Y ) = l ([ X g , X v ] , [ Y g , Y v ]) = l g ( Z g ) + λl v ( Y v , ˆ Y v ) . (18)The nonnegativity of l g ( Z g ) is guaranteed by Theorem 1. The validation loss l v ( Y v , ˆ Y v ) is alsononnegative since it is the average (over the validation batch) of the cross entropy losses: l v ( Y v , ˆ Y v ) = 1 | B v | (cid:88) x ∈ X v H ( δ y , ˆ y ) = − | B v | (cid:88) x ∈ X v log ˆ y y . (19)Therefore l = l g + λl v is also nonnegative.Next, for a given λ > , l ( X , Y ) obtains its minimum value zero if and only if both l g ( Z g ) and l v ( Y v , ˆ Y v ) are zeros. • By Theorem 1, l g ( Z g ) = 0 if and only if span( Z gc ) ⊥ span( Z gc (cid:48) ) , ∀ c (cid:54) = c (cid:48) . • According to (19), l v ( Y v , ˆ Y v ) = 0 if and only if ˆ y ( x ) = δ y , ∀ x ∈ X v , i.e., for every x ∈ X vc , its feature z = Φ( x ; θ ) belongs to span( Z gc ) .At last, we want to prove that if λ > , and X v contains at least one sample for each class, then rank(span( Z gc )) ≥ for any c ∈ { , . . . , K } .If not, then there exists c ∈ { , . . . , K } such that rank(span( Z gc )) = 0 . Let x ∈ X v be a validationdatum belonging to class y = c . Recall that the predicted probability of x belonging to class c isdefined in (3) as ˆ y c = P ( x ∈ c ) (cid:44) (cid:28) z , proj c ( z )max ( (cid:107) proj c ( z ) (cid:107) , ε ) (cid:29) (cid:44) K (cid:88) c (cid:48) =1 (cid:28) z , proj c (cid:48) ( z )max ( (cid:107) proj c (cid:48) ( z ) (cid:107) , ε ) (cid:29) = 0 . (20)Thus we have l ≥ λl v = − λ | B v | (cid:88) x ∈ X v log ˆ y y ≥ − λ | B v | log ˆ y ( x ) c = + ∞ (21) C Proof of Theorem 3
First, we need the following lemma.
Lemma 1.
Let A ∈ R m × n be a rank- s matrix, and let A = U (1) Σ (1) V (1) ∗ be the compact SVD of A , i.e., U (1) ∈ R m × s , Σ (1) ∈ R s × s , V (1) ∈ R n × s , then the subdifferential of the nuclear norm at A is: ∂ (cid:107) A (cid:107) ∗ = (cid:110) U (1) V (1) ∗ + ˜ U (2) ˜ W ˜ V (2) ∗ (cid:111) , (22) where ˜ U (2) ∈ R m × ( n − s ) , ˜ V (2) ∈ R n × ( n − s ) , ˜ W ∈ R ( n − s ) × ( n − s ) satisfy that the columns of ˜ U (2) and ˜ V (2) are orthonormal, span( U (1) ) ⊥ span( ˜ U (2) ) , span( V (1) ) ⊥ span( ˜ V (2) ) , and (cid:107) ˜ W (cid:107) ≤ . roof. Based on the characterization of the subdifferential in (6), we only need to show the followingtwo sets are identical: D = (cid:110) U (1) V (1) ∗ + U (2) WV (2) ∗ , ∀ W ∈ R ( m − s ) × ( n − s ) with (cid:107) W (cid:107) ≤ (cid:111) (23) D = (cid:110) U (1) V (1) ∗ + ˜ U (2) ˜ W ˜ V (2) ∗ , ˜ U (2) , ˜ V (2) , ˜ W satisfy the conditions in the lemma (cid:111) (24)On one hand, let d = U (1) V (1) ∗ + U (2) WV (2) ∗ ∈ D , and let U (2) W = ¯ U ¯ Σ ¯ V ∗ be the reducedSVD of U (2) W ∈ R m × ( n − s ) , i.e., ¯ U ∈ R m × ( n − s ) , ¯ Σ ∈ R ( n − s ) × ( n − s ) , ¯ V ∈ R ( n − s ) × ( n − s ) . Thenwe can set ˜ U (2) = ¯ U , ˜ W = ¯ Σ ¯ V ∗ , and ˜ V (2) = V (2) . It is easy to check that ˜ U (2) , ˜ V (2) , ˜ W satisfythe conditions in the lemma, and d = U (1) V (1) ∗ + ˜ U (2) ˜ W ˜ V (2) ∗ ∈ D (25)On the other hand, let d = U (1) V (1) ∗ + ˜ U (2) ˜ W ˜ V (2) ∗ ∈ D , where ˜ U (2) , ˜ V (2) , ˜ W satisfy theconditions in the lemma. Let ˜ U (2) = U (2) P and ˜ V (2) = V (2) Q , where P ∈ R ( m − s ) × ( n − s ) and Q ∈ R ( n − s ) × ( n − s ) have orthonormal columns. After setting W = P ˜ WQ ∗ , we have ˜ U (2) ˜ W ˜ V (2) ∗ = U (2) P ˜ WQ ∗ V (2) ∗ = U (2) WV (2) ∗ , (26)where (cid:107) W (cid:107) ≤ . Therefore, d = U (1) V (1) ∗ + ˜ U (2) ˜ W ˜ V (2) ∗ = U (1) V (1) ∗ + U (2) WV (2) ∗ ∈ D (27)Now we go on to prove Theorem 3. Proof.
Let rank( A ) = s , and we split the computed singular vectors into two parts: U ε =[ U (1) ε , U (2) ε ] , V ε = [ V (1) ε , V (2) ε ] , where U (1) ε ∈ R m × s , U (2) ε ∈ R m × ( n − s ) , V (1) ε ∈ R n × s ,and V (2) ε ∈ R n × ( n − s ) . By the backward stability of SVD, we have (cid:107) U (1) − U (1) ε (cid:107) = O ( ε/η ) , (cid:107) V (1) − V (1) ε (cid:107) = O ( ε/η ) , and there exists ˜ U (2) , ˜ V (2) satisfying the condition in Lemma 1 and (cid:107) ˜ U (2) − U (2) ε (cid:107) = O ( ε/η ) , (cid:107) ˜ V (2) − V (2) ε (cid:107) = O ( ε/η ) .Because of Lemma 1, we have ( U (1) V (1) ∗ + ˜ U (2) ˜ V (2) ∗ ) ∈ ∂ (cid:107) A (cid:107) ∗ , and d( U ε V ε ∗ , ∂ (cid:107) A (cid:107) ∗ ) ≤ (cid:107) U ε V ε ∗ − (cid:16) U (1) V (1) ∗ + ˜ U (2) ˜ V (2) ∗ (cid:17) (cid:107) = (cid:107) (cid:16) U (1) ε V (1) ε ∗ + U (2) ε V (2) ε ∗ (cid:17) − (cid:16) U (1) V (1) ∗ + ˜ U (2) ˜ V (2) ∗ (cid:17) (cid:107) ≤ (cid:107) (cid:16) U (1) ε − U (1) (cid:17) V (1) ε ∗ (cid:107) + (cid:107) U (1) (cid:16) V (1) ε ∗ − V (1) ∗ (cid:17) (cid:107) + (cid:107) (cid:16) U (2) ε − ˜ U (2) (cid:17) V (2) ε ∗ (cid:107) + (cid:107) ˜ U (2) (cid:16) V (2) ε ∗ − ˜ V (2) ∗ (cid:17) (cid:107) = O ( ε/η ))