Improving Generalization by Controlling Label-Noise Information in Neural Network Weights
Hrayr Harutyunyan, Kyle Reing, Greg Ver Steeg, Aram Galstyan
IImproving Generalization by ControllingLabel-Noise Information in Neural Network Weights
Hrayr Harutyunyan Kyle Reing Greg Ver Steeg Aram Galstyan Abstract
In the presence of noisy or incorrect labels, neuralnetworks have the undesirable tendency to memo-rize information about the noise. Standard regular-ization techniques such as dropout, weight decayor data augmentation sometimes help, but do notprevent this behavior. If one considers neural net-work weights as random variables that depend onthe data and stochasticity of training, the amountof memorized information can be quantified withthe Shannon mutual information between weightsand the vector of all training labels given inputs, I ( w ; y | x ) . We show that for any training al-gorithm, low values of this term correspond toreduction in memorization of label-noise and bet-ter generalization bounds. To obtain these lowvalues, we propose training algorithms that em-ploy an auxiliary network that predicts gradientsin the final layers of a classifier without access-ing labels. We illustrate the effectiveness of ourapproach on versions of MNIST, CIFAR-10, andCIFAR-100 corrupted with various noise models,and on a large-scale dataset Clothing1M that hasnoisy labels.
1. Introduction
Supervised learning with deep neural networks has showngreat success in the last decade. Despite having millions ofparameters, modern neural networks generalize surprisinglywell. However, their training is particularly susceptible tonoisy labels, as shown by Zhang et al. (2016) in their anal-ysis of generalization error. In the presence of noisy orincorrect labels, networks start to memorize the training la-bels, which degrades the generalization performance (Chenet al., 2019). At the extreme, standard architectures have the Information Sciences Institute, University of Southern Cal-ifornia, Marina del Rey, CA 90292. Correspondence to: HrayrHarutyunyan < [email protected] > . Proceedings of the th International Conference on MachineLearning , Online, PMLR 119, 2020. Copyright 2020 by the au-thor(s). T r a i n a cc u r a c y o n n o i s y l a b e l s overfitting ⬆ proposed no regularization dropout 0.5 weight decay 0.003 Epochs0.20.61.0 T e s t a cc u r a c y o n c l e a n l a b e l s Figure 1: Neural networks tend to memorize labels whentrained with noisy labels (80% noise in this case), evenwhen dropout or weight decay are applied. Our trainingapproach limits label-noise information in neural networkweights, avoiding memorization of labels and improvinggeneralization. Please refer to Sec. 2.1 for more details.capacity to achieve 100% classification accuracy on trainingdata, even when labels are assigned at random (Zhang et al.,2016). Furthermore, standard explicit or implicit regular-ization techniques such as dropout, weight decay or dataaugmentation do not directly address nor completely preventlabel memorization (Zhang et al., 2016; Arpit et al., 2017).Poor generalization due to label memorization is a signifi-cant problem because many large, real-world datasets areimperfectly labeled. Label noise may be introduced whenbuilding datasets from unreliable sources of information orusing crowd-sourcing resources like Amazon MechanicalTurk. A practical solution to the memorization problem islikely to be algorithmic as sanitizing labels in large datasetsis costly and time consuming. Existing approaches for ad-dressing the problem of label-noise and generalization per-formance include deriving robust loss functions (Natarajanet al., 2013; Ghosh et al., 2017; Zhang & Sabuncu, 2018; Xuet al., 2019), loss correction techniques (Sukhbaatar et al.,2014; Tong Xiao et al., 2015; Goldberger & Ben-Reuven,2017; Patrini et al., 2017), re-weighting samples (Jiang et al.,2017; Ren et al., 2018), detecting incorrect samples and re- a r X i v : . [ c s . L G ] N ov mproving Generalization by Controlling Label-Noise Information in Neural Network Weights labeling them (Reed et al., 2014; Tanaka et al., 2018; Maet al., 2018), and employing two networks that select train-ing examples for each other (Han et al., 2018; Yu et al.,2019). We propose an information-theoretic approach thatdirectly addresses the root of the problem. If a classifieris able to correctly predict a training label that is actuallyrandom, it must have somehow stored information aboutthis label in the parameters of the model. To quantify thisinformation, Achille & Soatto (2018) consider weights as arandom variable, w , that depends on stochasticity in trainingdata and parameter initialization. The entire training datasetis considered a random variable consisting of a vector ofinputs, x , and a vector of labels for each input, y . Theamount of label memorization is then given by the Shannonmutual information between weights and labels conditionedon inputs, I ( w ; y | x ) . Achille & Soatto (2018) show thatthis term appears in a decomposition of the commonly usedexpected cross-entropy loss, along with three other individu-ally meaningful terms. Surprisingly, cross-entropy rewardslarge values of I ( w ; y | x ) , which may promote memo-rization if labels contain information beyond what can beinferred from x . Such a result highlights that in addition tothe network’s representational capabilities, the loss function– or more generally, the learning algorithm – plays an im-portant role in memorization. To this end, we wish to studythe utility of limiting I ( w ; y | x ) , and how it can be used tomodify training algorithms to reduce memorization.Our main contributions towards this goal are as follows:1) We show that low values of I ( w ; y | x ) correspond toreduction in memorization of label-noise, and lead to bettergeneralization gap bounds. 2) We propose training methodsthat control memorization by regularizing label-noise infor-mation in weights. When the training algorithm is a variantof stochastic gradient descent, one can achieve this by con-trolling label-noise information in gradients. A promisingway of doing this is through an additional network that triesto predict the classifier gradients without using label infor-mation. We experiment with two training procedures thatincorporate gradient prediction in different ways: one whichuses the auxiliary network to penalize the classifier, andanother which uses predicted gradients to train it. In bothapproaches, we employ a regularization that penalizes theL2 norm of predicted gradients to control their capacity.The latter approach can be viewed as a search over trainingalgorithms, as it implicitly looks for a loss function thatbalances training performance with label memorization. 3)Finally, we show that the auxiliary network can be usedto detect incorrect or misleading labels. To illustrate theeffectiveness of the proposed approaches, we apply themon corrupted versions of MNIST, CIFAR-10, CIFAR-100with various label noise models, and on the Clothing1Mdataset, which already contains noisy labels. We show thatmethods based on gradient prediction yield drastic improve- ments over standard training algorithms (like cross-entropyloss), and outperform competitive approaches designed forlearning with noisy labels.
2. Label-Noise Information in Weights
We begin by formally introducing a measure of label-noise information in weights, and discuss its connectionsto memorization and generalization. Throughout the pa-per we use several information-theoretic quantities suchas entropy: H ( X ) = − E (cid:2) log p ( x ) (cid:3) , mutual information: I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X, Y ) , Kullback–Leiblerdivergence: KL ( p ( x ) || q ( x )) = E x ∼ p ( x ) [log( p ( x ) /q ( x ))] and their conditional variants (Cover & Thomas, 2006).Consider a setup in which a labeled dataset, S = ( x , y ) , fordata x = (cid:8) x ( i ) (cid:9) ni =1 and categorical labels y = (cid:8) y ( i ) (cid:9) ni =1 ,is generated from a distribution p θ ( x, y ) . A training algo-rithm for learning weights w of a fixed probabilistic classi-fier f ( y | x, w ) can be denoted as a conditional distribution A ( w | S ) . Given any training algorithm A , its training per-formance can be measured using the expected cross-entropy: H p,f ( y | x , w ) = E S E w | S (cid:34) n (cid:88) i =1 − log f ( y ( i ) | x ( i ) , w ) (cid:35) . Achille & Soatto (2018) present a decomposition of this ex-pected cross-entropy, which reduces to the following whenthe data generating process is fixed: H p,f ( y | x , w ) = H ( y | x ) − memorizing label-noise (cid:122) (cid:125)(cid:124) (cid:123) I ( w ; y | x ) (1) + E x ,w KL (cid:2) p ( y | x ) || f ( y | x , w ) (cid:3) . The problem of minimizing this expected cross-entropy isequivalent to selecting an appropriate training algorithm.If the labels contain information beyond what can be in-ferred from inputs (meaning non-zero H ( y | x )) , such analgorithm may do well by memorizing the labels throughthe second term of (1). Indeed, minimizing the empir-ical cross-entropy loss A ERM ( w | S ) = δ ( w ∗ ) , where w ∗ ∈ arg min w (cid:80) ni =1 − log f ( y ( i ) | x ( i ) , w ) , does exactlythat (Zhang et al., 2016). I ( w ; y | x ) Reduces Memorization
To demonstrate that I ( w ; y | x ) is directly linked to memo-rization, we prove that any algorithm with small I ( w ; y | x ) overfits less to label-noise in the training set. Theorem 2.1.
Consider a dataset S = ( x , y ) of n i.i.d. sam-ples, x = { x ( i ) } ni =1 and y = { y ( i ) } ni =1 , where the domainof labels is a finite set, Y . Let A ( w | S ) be any training algo-rithm, producing weights for a possibly stochastic classifier f ( y | x, w ) . Let (cid:98) y ( i ) denote the prediction of the classifier mproving Generalization by Controlling Label-Noise Information in Neural Network Weights on the i -th example and let e ( i ) = { (cid:98) y ( i ) (cid:54) = y ( i ) } be a ran-dom variable corresponding to predicting y ( i ) incorrectly.Then, the following inequality holds: E (cid:34) n (cid:88) i =1 e ( i ) (cid:35) ≥ H ( y | x ) − I ( w ; y | x ) − (cid:80) ni =1 H ( e ( i ) )log ( |Y| − . This result establishes a lower bound on the expected num-ber of prediction errors on the training set, which increasesas I ( w ; y | x ) decreases. For example, consider a cor-rupted version of the MNIST dataset where each label ischanged with probability . to a uniformly random incor-rect label. By the above bound, every algorithm for which I ( w ; y | x ) = 0 will make at least prediction errorson the training set in expectation. In contrast, if the weightsretain bit of label-noise information per example, the clas-sifier will make at least 40.5% errors in expectation. Theproof of Thm. 2.1 uses Fano’s inequality and is presented inthe supplementary material (Sec. A.1). Below we discussthe dependence of error probability on I ( w ; y | x ) . Remark 1.
If we let k = |Y| and r = n E (cid:2)(cid:80) ni =1 e ( i ) (cid:3) denote the expected training error rate, then by Jensen’sinequality we can simplify Thm. 2.1 as follows: r ≥ H ( y (1) | x (1) ) − I ( w ; y | x ) /n − n (cid:80) ni =1 H ( e ( i ) )log( k − ≥ H ( y (1) | x (1) ) − I ( w ; y | x ) /n − H ( r )log( k − . (2)Solving this inequality for r is challenging. One can sim-plify the right hand side further by bounding H ( e (1) ) ≤ (assuming that entropies are measured in bits). However,this will loosen the bound. Alternatively, we can find thesmallest r for which (2) holds and claim that r ≥ r . Remark 2. If |Y| = 2 , then log( |Y| −
1) = 0 , puttingwhich in (13) of supplementary leads to: H ( r ) ≥ H ( y (1) | x (1) ) − I ( w ; y | x ) /n. Remark 3.
When we have uniform label noise wherea label is incorrect with probability p ( ≤ p < k − k ) and I ( w ; y | x ) = 0 , the bound of (2) is tight, i.e., impliesthat r ≥ p . To see this, we note that H ( y (1) | x (1) ) = H ( p ) + p log( k − , putting which in (2) gives us: r ≥ H ( p ) + p log( k − − H ( r )log( k −
1) = p + H ( p ) − H ( r )log( k − . (3)Therefore, when r = p , the inequality holds, implying that r ≤ p . To show that r = p , we need to show that for any ≤ r < p , the (3) does not hold. Let r ∈ [0 , p ) and assume I ( w ; y ∣ x )/ n r ( t h e l o w e r b o un d o f r ) p = 0.2p = 0.4p = 0.6p = 0.8 Figure 2: The lower bound r on the rate of training errors r Thm. 2.1 establishes for varying values of I ( w ; y | x ) , inthe case when label noise is uniform and probability of alabel being incorrect is p .that (3) holds. Then r ≥ p + H ( p ) − H ( r )log( k − ≥ p + H ( p ) − ( H ( p ) + ( r − p ) H (cid:48) ( p ))log( k − ≥ p + − ( r − p ) log( k − k −
1) = 2 p − r. (4)The second line above follows from concavity of H ( x ) ; andthe third line follows from the fact that H (cid:48) ( p ) > − log( k − when ≤ p < ( k − /k . Eq. (4) directly contradictswith r < p . Therefore, Eq. (3) cannot hold for any r < p .When I ( w ; y | x ) > , we can find the smallest r by anumerical method. Fig. 2 plots r vs I ( w ; y | x ) when thelabel noise is uniform. When the label-noise is not uniform,the bound of (2) becomes loose as Fano’s inequality be-comes loose. We leave the problem of deriving better lowerbounds in such cases for a future work.Thm. 2.1 provides theoretical guarantees that memoriza-tion of noisy labels is prevented when I ( w ; y | x ) is small,in contrast to standard regularization techniques – such asdropout, weight decay, and data augmentation – which onlyslow it down (Zhang et al., 2016; Arpit et al., 2017). Todemonstrate this empirically, we compare an algorithm thatcontrols I ( w ; y | x ) (presented in Sec. 3) against theseregularization techniques on the aforementioned corruptedMNIST setup. We see in Fig. 1 that explicitly preventingmemorization of label-noise information leads to optimaltraining performance (20% training accuracy) and goodgeneralization on a non-corrupted validation set. Otherapproaches quickly exceed 20% training accuracy by in-corporating label-noise information, and generalize poorlyas a consequence. The classifier here is a fully connectedneural network with 4 hidden layers each having 512 ReLU mproving Generalization by Controlling Label-Noise Information in Neural Network Weights units. The rates of dropout and weight decay were selectedaccording to the performance on a validation set. I ( w ; y | x ) Improves Generalization
The information that weights contain about a training dataset S has previously been linked to generalization (Xu & Ra-ginsky, 2017). The following bound relates the expecteddifference between train and test performance to the mutualinformation I ( w ; S ) . Theorem 2.2. (Xu & Raginsky, 2017) Suppose (cid:96) (ˆ y, y ) isa loss function, such that (cid:96) ( f w ( x ) , y ) is σ -sub-Gaussianrandom variable for each w . Let S = ( x , y ) be the trainingset, A ( w | S ) be the training algorithm, and (¯ x, ¯ y ) be atest sample independent from S and w . Then the followingholds: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) (cid:96) ( f w (¯ x ) , ¯ y ) − n n (cid:88) i =1 (cid:96) (cid:16) f w ( x ( i ) ) , y ( i ) (cid:17)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:114) σ n I ( w ; S ) (5) For good test performance, learning algorithms need tohave both a small generalization gap, and good trainingperformance. The latter may require retaining more in-formation about the training set, meaning there is a natu-ral conflict between increasing training performance anddecreasing the generalization gap bound of (5). Further-more, information in weights can be decomposed as fol-lows: I ( w ; S ) = I ( w ; x ) + I ( w ; y | x ) . We claim that oneneeds to prioritize reducing I ( w ; y | x ) over I ( w ; x ) forthe following reason. When noise is present in the train-ing labels, fitting this noise implies a non-zero value of I ( w ; y | x ) , which grows linearly with the number of sam-ples n . In such cases, the generalization gap bound of (5)becomes a constant and does not improve as n increases. Toget meaningful generalization bounds via (5) one needs tolimit I ( w ; y | x ) . We hypothesize that for efficient learningalgorithms, this condition might be also sufficient.
3. Methods Limiting Label Information
We now consider how to design training algorithms thatcontrol I ( w ; y | x ) . We assume f ( y | x, w ) = Multinoulli ( y ; s ( a )) , with a as the output of a neural net-work h w ( x ) , and s ( · ) as the softmax function. We considerthe case when h w ( x ) is trained with a variant of stochas-tic gradient descent for T iterations. The inputs and la-bels of a mini-batch at iteration t are denoted by x t and y t respectively, and are selected using a deterministic proce-dure (such as cycling through the dataset, or using pseudo-randomness). Let w denote the weights after initialization,and w t the weights after iteration t . Let L ( w ; x, y ) be someclassification loss function (e.g, cross-entropy loss) and g L t (cid:44) ∇ w L ( w t − ; x t , y t ) be the gradient at iteration t . Let g t denote the gradients used to update the weights, possibly different from g L t . Let the update rule be w t = Ψ( w , g t ) ,and w T = Ψ( w , g T ) be the final weights (denoted with w for convenience).To limit I ( w ; y | x ) , the following sections will discuss twoapproximations which relax the computational difficulty,while still provide meaningful bounds: 1) first, we show thatthe information in weights can be replaced by informationin the gradients; 2) we introduce a variational bound on theinformation in gradients. The bound employs an auxiliarynetwork that predicts gradients of the original loss withoutlabel information. We then explore two ways of incorporat-ing predicted gradients: (a) using them in a regularizationterm for gradients of the original loss, and (b) using them totrain the classifier. Looking at (1) it is tempting to add I ( w ; y | x ) as a regular-ization to the H p,f ( y | x , w ) objective and minimize overall training algorithms: min A ( w | D ) H p,f ( y | x , w ) + I ( w ; y | x ) . (6)This will become equivalent to minimizing E x ,w KL (cid:2) p ( y | x ) || f ( y | x , w ) (cid:3) . Unfortunately, the optimization problemof (6) is hard to solve for two major reasons. First, the op-timization is over training algorithms (rather than over theweights of a classifier, as in the standard machine learningsetup). Second, the penalty I ( w ; y | x ) is hard to com-pute/approximate.To simplify the problem of (6), we relate information inweights to information in gradients as follows: I ( w ; y | x ) ≤ I ( g T ; y | x ) = T (cid:88) t =1 I ( g t ; y | x , g 4. Experiments We set up experiments with noisy datasets to see howwell the proposed methods perform for different types andamounts of label noise. The simplest baselines in our com-parison are standard cross-entropy (CE) and mean absolute error (MAE) loss functions. The next baseline is the for-ward correction approach (FW) proposed by (Patrini et al.,2017), where the label-noise transition matrix is estimatedand used to correct the loss function. Finally, we include therecently proposed determinant mutual information (DMI)loss, which is the log-determinant of the confusion matrixbetween predicted and given labels (Xu et al., 2019). BothFW and DMI baselines require initialization with the bestresult of the CE baseline. To avoid small experimentaldifferences, we implement all baselines, closely followingthe original implementations of FW and DMI. We train allbaselines except DMI using the ADAM optimizer (Kingma& Ba, 2014) with learning rate α = 10 − and β = 0 . .As DMI is very sensitive to the learning rate, we tune itby choosing the best from the following grid of values { − , − , − , − } . For all baselines, model selec-tion is done by choosing the model with highest accuracyon a validation set that follows the noise model of the corre-sponding train set. All scores are reported on a clean test set.Additional experimental details, including the hyperparame-ter grids, are presented in supplementary section B. The im-plementation of the proposed method and the code for repli-cating the experiments is available at https://github.com/hrayrhar/limit-label-memorization . To compare the variants of our approach discussed earlierand see which ones work well, we do experiments on theMNIST dataset with corrupted labels. In this experiment,we use a simple uniform label-noise model, where eachlabel is set to an incorrect value uniformly at random withprobability p . In our experiments we try 4 values of p – 0%,50%, 80%, 89%. We split the 60K images of MNIST intotraining and validation sets, containing 48K and 12K sam-ples respectively. For each noise amount we try 3 differenttraining set sizes – , , and . · . All classifiersand auxiliary networks are 4-layer CNNs, with a sharedarchitecture presented in the supplementary (Sec. B). Forthis experiment we include two additional baselines whereadditive noise (Gaussian or Laplace) is added to the gradi-ents with respect to logits. We denote these baselines withnames “CE + GN” and “CE + LN”. The comparison withthese two baselines demonstrates that the proposed methoddoes more than simply reduce information in gradients vianoise. We also consider a variant of LIMIT where insteadof sampling g t from q we use the predicted mean µ t .Table 1 shows the test performances of different approachesaveraged over 5 training/validation splits. Standard devia-tions and additional combinations of p and n are presentedin the supplementary (See tables 4 and 5). Additionally,Fig. 3 shows the training and testing performances of thebest methods during the training when p = 0 . and all train-ing samples are used. Overall, variants of LIMIT produce mproving Generalization by Controlling Label-Noise Information in Neural Network Weights Method p = 0 . p = 0 . p = 0 . All All AllCE 94.3 G + S L + S 94.8 LIMIT G - S L - S 95.0 Table 1: Test accuracy comparison on multiple versions ofMNIST corrupted with uniform label noise. Error bars arepresented in the supplementary material (Sec. C) T r a i n a cc u r a c y CECE + GNCE + LNMAE FWDMISoft reg. (8)LIMIT + S LIMIT + SLIMIT - SLIMIT - S T e s t a cc u r a c y Figure 3: Smoothed training and testing accuracy plots ofvarious approaches on MNIST with 80% uniform noise.the best results and improve significantly over standard ap-proaches. The variants with a Laplace distribution performbetter than those with a Gaussian distribution. This is likelydue to the robustness of MAE. Interestingly, LIMIT workswell and trains faster when the sampling of g t in q is dis-abled (rows with “-S”). Thus, hereafter we consider this asour primary approach. As expected, the soft regularizationapproach of (10) and cross-entropy variants with noisy gra-dients perform significantly worse than LIMIT. We excludethese baselines in our future experiments. Additionally, wetested the importance of penalizing norm of predicted gra-dients by comparing training and testing performances ofLIMIT with varying regularization strength β in the supple-mentary (Fig. A0). We found that this penalty is essentialfor preventing memorization.In our approach, the auxiliary network q should not be able to distinguish correct and incorrect samples, unless it over-fits. We found that it learns to predict “correct” gradients onexamples with incorrect labels (Sec. C). Motivated by this,we use the distance between predicted and cross-entropygradients to detect samples with incorrect or misleadinglabels (Fig. 4). Additionally, when considering the distanceas a score for for classifying correctness of a label, distance,we get 99.87% ROC AUC score (Sec. C). Next we consider a harder dataset, CIFAR-10 (Krizhevskyet al., 2009), with two label noise models: uniform noiseand pair noise. For pair noise, certain classes are confusedwith some other similar class. Following the setup of Xuet al. (2019) we use the following four pairs: truck → auto-mobile, bird → airplane, deer → horse, cat → dog. Note inthis type of noise H ( y | x ) is much smaller than in the caseof uniform noise. We split the 50K images of CIFAR-10into training and validation sets, containing 40K and 10Ksamples respectively. For the CIFAR experiments we useResNet-34 networks (He et al., 2016) with standard dataaugmentation, consisting of random horizontal flips and ran-dom 28x28 crops padded back to 32x32. For our proposedmethods, the auxiliary network q is ResNet-34 as well. Wenoticed that for more difficult datasets, it may happen thatwhile q still learns to produce good gradients, the updateswith these less informative gradients may corrupt the initial-ization of the classifier. For this reason, we add an additionalvariant of LIMIT, which initializes the q network with thebest CE baseline, similar to the DMI and FW baselines.Table 2 presents the results on CIFAR-10. Again, variantsof LIMIT improve significantly over standard baselines,especially in the case of uniform label noise. As expected,when q is initialized with the best CE model (similar to FWand DMI), the results are better. As in the case of MNIST,our approach helps even when the dataset is noiseless. CIFAR-100. To test proposed methods on a classificationtask with many classes, we apply them on CIFAR-100 with40% uniform noise. We use the same networks as in the caseof CIFAR-10. Results presented in Table 2 indicate severalinteresting phenomena. First, training with the MAE lossfails, which was observed by other works as well (Zhang& Sabuncu, 2018). The gradient of MAE with respect tologits is f ( x ) y ( (cid:98) y − y ) . When f ( x ) y is small, there is smallsignal to fix the mistake. In fact, in the case of CIFAR-100, f ( x ) y is approximately 0.01 in the beginning, slowingdown the training. The performance of FW degrades as theapproximation error of noise transition matrix become large.The DMI does not give significant improvement over CEdue to numerical issues with computing a determinant of a100x100 confusion matrix. LIMIT L performs worse thanother variants, as training q with MAE becomes challenging. mproving Generalization by Controlling Label-Noise Information in Neural Network Weights T-Shirt Shirt Knitwear Chiffon Sweater Hoodie Windbreaker Jacket Downcoat Suit Shawl Dress Vest Underwear Figure 4: Most mislabeled examples in MNIST (top left), CIFAR-10 (top right), and Clothing1M (bottom) datasets,according to the distance between predicted and cross-entropy gradients. More examples are presented in the supplementary(Sec. C). Method CIFAR-10 CIFAR-100 Clothing1Mno noise uniform noise pair noise uniform noise noisy train set . . . . . . . . . . -CE 92.7 85.2 81.0 69.0 38.8 90.0 88.1 87.2 81.8 44.9 ± ± ± ± ± ± ± ± LIMIT G ± ± L ± ± G + init. ± ± LIMIT L + init. ± ± Table 2: Test accuracy comparison on CIFAR-10, corrupted with various label noise types, on CIFAR-100 with 40% uniformlabel noise and on Clothing1M dataset. The error bars are computed by bootstrapping the test set 1000 times. The missingerror bars are presented in the supplementary material (Sec. C).However, performance improves when q is initialized withthe CE model. LIMIT G does not suffer from the mentionedproblem and works with or without initialization. Finally, as in our last experiment, we consider the Cloth-ing1M dataset (Xiao et al., 2015), which has 1M imageslabeled with one of 14 possible clothing labels. The datasethas very noisy training labels, with roughly 40% of exam-ples incorrectly labeled. More importantly, the label noisein this dataset is realistic and instance dependent. For thisdataset we use ResNet-50 networks and employ standarddata augmentation, consisting of random horizontal flipsand random crops of size 224x224 after resizing images tosize 256x256. The results shown in the last column of Ta-ble 2 demonstrate that DMI and LIMIT with initializationperform the best, producing similar results. 5. Related Work Our approach is related to many works that study memoriza-tion and learning with noisy labels. Our work also buildson theoretical results studying how generalization relates toinformation in neural network weights. In this section wepresent the related work and discuss the connections. Learning with noisy labels is a longstanding problem andhas been studied extensively (Frenay & Verleysen, 2014).Many works studied and proposed loss functions that arerobust to label noise. Natarajan et al. (2013) propose robustloss functions for binary classification with label-dependentnoise. Ghosh et al. (2017) generalize this result for multi-class classification problem and show that the mean abso-lute error (MAE) loss function is tolerant to label-dependentnoise. Zhang & Sabuncu (2018) propose a new loss function,called generalized cross-entropy (GCE), that interpolatesbetween MAE and CE with a single parameter q ∈ [0 , .Xu et al. (2019) propose a new loss function (DMI), whichis equal to the log-determinant of the confusion matrix be-tween predicted and given labels, and show that it is robustto label-dependent noise. These loss functions are robustin the sense that the best performing hypothesis on cleandata and noisy data are the same in the regime of infinitedata. When training on finite datasets, training with theseloss functions may result in memorization of training labels.Another line of research seeks to estimate label-noise andcorrect the loss function accordingly (Sukhbaatar et al.,2014; Tong Xiao et al., 2015; Goldberger & Ben-Reuven,2017; Patrini et al., 2017; Hendrycks et al., 2018; Yao et al.,2019). Some works use meta-learning to treat the problem mproving Generalization by Controlling Label-Noise Information in Neural Network Weights of noisy/incomplete labels as a decision problem in whichone determines the reliability of a sample (Jiang et al., 2017;Ren et al., 2018; Shu et al., 2019). Others seek to detectincorrect examples and relabel them (Reed et al., 2014;Tanaka et al., 2018; Ma et al., 2018; Han et al., 2019; Arazoet al., 2019). Han et al. (2018); Yu et al. (2019) employ anapproach where two networks select training examples foreach other using the small-loss trick. While our approachalso has a teaching component, the network uses all samplesinstead of filtering. Li et al. (2019) propose a meta-learningapproach that optimizes a classification loss along with aconsistency loss between predictions of a mean teacher andpredictions of the model after a single gradient descent stepon a synthetically labeled mini-batch.Some approaches assume particular label-noise models,while our approach assumes that H ( y | x ) > , whichmay happen because of any type of label noise or attributenoise (e.g., corrupted images or partially observed inputs).Additionally, the techniques used to derive our approach canbe adopted for regression or multilabel classification tasks.Furthermore, some methods require access to small cleanvalidation data, which is not required in our approach. Defining and quantifying information in neural networkweights is an open challenge and has been studied by multi-ple authors. One approach is to relate information in weightsto their description length. A simple way of measuringdescription length was proposed by Hinton & van Camp(1993) and reduces to the L2 norm of weights. Another wayto measure it is through the intrinsic dimension of an objec-tive landscape (Li et al., 2018; Blier & Ollivier, 2018). Liet al. (2018) observed that the description length of neuralnetwork weights grows when they are trained with noisylabels (Li et al., 2018), indicating memorization of labels.Achille & Soatto (2018) define information in weights asthe KL divergence from the posterior of weights to the prior.In a subsequent study they provide generalization boundsinvolving the KL divergence term (Achille & Soatto, 2019).Similar bounds were derived in the PAC-Bayesian setupand have been shown to be non-vacuous (Dziugaite & Roy,2017). With an appropriate selection of prior on weights, theabove KL divergence becomes the Shannon mutual infor-mation between the weights and training dataset, I ( w ; S ) .Xu & Raginsky (2017) derive generalization bounds thatinvolve this latter quantity. Pensia et al. (2018) upper bound I ( w ; S ) when the training algorithm consists of iterativenoisy updates. They use the chain-rule of mutual informa-tion as we did in (7) and bound information in updates byadding independent noise. It has been observed that addingnoise to gradients can help to improve generalization incertain cases (Neelakantan et al., 2015). Another approach restricts information in gradients by clipping them (Menonet al., 2020) .Achille & Soatto (2018) also introduce the term I ( w ; y | x ) and show the decomposition of the cross-entropy describedin (1). In a recent work, Yin et al. (2020) consider a similarterm in the context of meta-learning and use it as a regu-larization to prevent memorization of meta-testing labels.Given a meta-learning dataset M , they consider the informa-tion in the meta-weights θ about the labels of meta-testingtasks given the inputs of meta-testing tasks, I ( θ ; Y | X ) .They bound this information with a variational upper boundKL ( q ( θ | M ) || r ( θ )) and use multivariate Gaussian distri-butions for both. For isotropic Gaussians with equal covari-ances, the KL divergence reduces to (cid:107) θ − θ (cid:107) , which wasstudied by Hu et al. (2020) as a regularization to achieverobustness to label-noise. Note that this bounds not only I ( θ ; Y | X ) but also I ( θ ; X , Y ) . In contrast, we bound only I ( w ; y | x ) and work with information in gradients. 6. Conclusion and Future Work Several recent theoretical works have highlighted the im-portance of the information about the training data that ismemorized in the weights. We distinguished two compo-nents of it and demonstrated that the conditional mutualinformation of weights and labels given inputs is closelyrelated to memorization of labels and generalization perfor-mance. By bounding this quantity in terms of information ingradients, we were able to derive the first practical schemesfor controlling label information in the weights and demon-strated that this outperforms approaches for learning withnoisy labels. In the future, we plan to explore ways of im-proving the bound of (7) and to design a better bottleneck inthe gradients. Additionally, we aim to extend the presentedideas to reducing instance memorization. Acknowledgements We thank Alessandro Achille for his valuable comments.We also thank the anonymous reviewers whose suggestionshelped improve this manuscript. Hrayr Harutyunyan wassupported by the USC Annenberg Fellowship. This work isbased in part on research sponsored by Air Force ResearchLaboratory (AFRL) under agreement number FA8750-19-1-1000. The U.S. Government is authorized to reproduce anddistribute reprints for Government purposes notwithstandingany copyright notation therein. The views and conclusionscontained herein are those of the authors and should not beinterpreted as necessarily representing the official policiesor endorsements, either expressed or implied, of Air ForceLaboratory, DARPA or the U.S. Government. mproving Generalization by Controlling Label-Noise Information in Neural Network Weights References Achille, A. and Soatto, S. Emergence of invariance anddisentanglement in deep representations. J. Mach. Learn.Res. , 19(1):1947–1980, January 2018. ISSN 1532-4435.Achille, A. and Soatto, S. Where is the information in adeep neural network? arXiv preprint arXiv:1905.12213 ,2019.Arazo, E., Ortego, D., Albert, P., O’Connor, N. E., andMcGuinness, K. Unsupervised label noise modeling andloss correction. arXiv preprint arXiv:1904.11238 , 2019.Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio,E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville,A., Bengio, Y., et al. A closer look at memorization indeep networks. In Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 , pp. 233–242. JMLR. org, 2017.Blier, L. and Ollivier, Y. The description length of deeplearning models. In Advances in Neural InformationProcessing Systems , pp. 2216–2226, 2018.Chen, P., Liao, B., Chen, G., and Zhang, S. Understandingand utilizing deep neural networks trained with noisylabels. arXiv preprint arXiv:1905.05040 , 2019.Cover, T. M. and Thomas, J. A. Elements of informationtheory . Wiley-Interscience, 2006.Dziugaite, G. and Roy, D. Computing nonvacuous general-ization bounds for deep (stochastic) neural networks withmany more parameters than training data. 03 2017.Frenay, B. and Verleysen, M. Classification in the presenceof label noise: A survey. IEEE Transactions on NeuralNetworks and Learning Systems , 25(5):845–869, May2014. ISSN 2162-2388. doi: 10.1109/TNNLS.2013.2292894.Ghosh, A., Kumar, H., and Sastry, P. Robust loss functionsunder label noise for deep neural networks. In Thirty-FirstAAAI Conference on Artificial Intelligence , 2017.Goldberger, J. and Ben-Reuven, E. Training deep neural-networks using a noise adaptation layer. In ICLR , 2017.Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I.,and Sugiyama, M. Co-teaching: Robust training of deepneural networks with extremely noisy labels. In Advancesin neural information processing systems , pp. 8527–8537,2018.Han, J., Luo, P., and Wang, X. Deep self-learning fromnoisy labels. In Proceedings of the IEEE InternationalConference on Computer Vision , pp. 5138–5147, 2019. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016.Hendrycks, D., Mazeika, M., Wilson, D., and Gimpel, K.Using trusted data to train deep networks on labels cor-rupted by severe noise. In Advances in neural informationprocessing systems , pp. 10456–10465, 2018.Hinton, G. E. and van Camp, D. Keeping the neuralnetworks simple by minimizing the description lengthof the weights. In Proceedings of the Sixth AnnualConference on Computational Learning Theory , COLT’93, pp. 5–13, New York, NY, USA, 1993. Associationfor Computing Machinery. ISBN 0897916115. doi:10.1145/168304.168306. URL https://doi.org/10.1145/168304.168306 .Hu, W., Li, Z., and Yu, D. Simple and effective regu-larization methods for training on noisily labeled datawith generalization guarantee. In International Confer-ence on Learning Representations , 2020. URL https://openreview.net/forum?id=Hke3gyHYwH .Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L.Mentornet: Learning data-driven curriculum for verydeep neural networks on corrupted labels. arXiv preprintarXiv:1712.05055 , 2017.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Krizhevsky, A., Hinton, G., et al. Learning multiple layersof features from tiny images. 2009.Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measuringthe intrinsic dimension of objective landscapes. 04 2018.Li, J., Wong, Y., Zhao, Q., and Kankanhalli, M. S. Learn-ing to learn from noisy labeled data. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition , pp. 5051–5059, 2019.Ma, X., Wang, Y., Houle, M. E., Zhou, S., Erfani,S. M., Xia, S.-T., Wijewickrema, S., and Bailey, J.Dimensionality-driven learning with noisy labels. arXivpreprint arXiv:1806.02612 , 2018.Menon, A. K., Rawat, A. S., Reddi, S. J., and Kumar,S. Can gradient clipping mitigate label noise? In International Conference on Learning Representations ,2020. URL https://openreview.net/forum?id=rklB76EKPr .Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari,A. Learning with noisy labels. In Advances in neuralinformation processing systems , pp. 1196–1204, 2013. mproving Generalization by Controlling Label-Noise Information in Neural Network Weights Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser,L., Kurach, K., and Martens, J. Adding gradient noiseimproves learning for very deep networks. arXiv preprintarXiv:1511.06807 , 2015.Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., andQu, L. Making deep neural networks robust to label noise:A loss correction approach. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pp. 1944–1952, 2017.Pensia, A., Jog, V., and Loh, P.-L. Generalization errorbounds for noisy, iterative algorithms. In ,pp. 546–550. IEEE, 2018.Reed, S. E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D.,and Rabinovich, A. Training deep neural networks onnoisy labels with bootstrapping. CoRR , abs/1412.6596,2014.Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learningto reweight examples for robust deep learning. arXivpreprint arXiv:1803.09050 , 2018.Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., andMeng, D. Meta-weight-net: Learning an explicit mappingfor sample weighting. In Advances in Neural InformationProcessing Systems , pp. 1917–1928, 2019.Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L. D., andFergus, R. Training convolutional networks with noisylabels. In ICLR 2015 , 2014.Tanaka, D., Ikami, D., Yamasaki, T., and Aizawa, K. Jointoptimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pp. 5552–5560, 2018.Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and XiaogangWang. Learning from massive noisy labeled data for im-age classification. In , pp. 2691–2699,June 2015. doi: 10.1109/CVPR.2015.7298885.Xiao, T., Xia, T., Yang, Y., Huang, C., and Wang, X. Learn-ing from massive noisy labeled data for image classifica-tion. In Proceedings of the IEEE conference on computervision and pattern recognition , pp. 2691–2699, 2015.Xu, A. and Raginsky, M. Information-theoretic analysisof generalization capability of learning algorithms. In Advances in Neural Information Processing Systems , pp.2524–2533, 2017.Xu, Y., Cao, P., Kong, Y., and Wang, Y. L dmi: A novelinformation-theoretic loss function for training deep netsrobust to label noise. In Advances in Neural InformationProcessing Systems , pp. 6222–6233, 2019. Yao, J., Wu, H., Zhang, Y., Tsang, I., and Sun, J. Safe-guarded dynamic label regression for noisy supervision. Proceedings of the AAAI Conference on Artificial Intelli-gence , 33:9103–9110, 07 2019. doi: 10.1609/aaai.v33i01.33019103.Yin, M., Tucker, G., Zhou, M., Levine, S., and Finn,C. Meta-learning without memorization. In In-ternational Conference on Learning Representations ,2020. URL https://openreview.net/forum?id=BklEFpEYwS .Yu, X., Han, B., Yao, J., Niu, G., Tsang, I. W.-H., andSugiyama, M. How does disagreement help generaliza-tion against label corruption? In ICML , 2019.Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.Understanding deep learning requires rethinking general-ization. arXiv preprint arXiv:1611.03530 , 2016.Zhang, Z. and Sabuncu, M. Generalized cross entropy lossfor training deep neural networks with noisy labels. In Advances in neural information processing systems , pp.8778–8788, 2018. mproving Generalization by Controlling Label-Noise Information in Neural Network Weights Supplementary material: Improving Generalization by Controlling Label-Noise Information inNeural Network Weights Algorithm 1 LIMIT: limiting label information memorization in training.Our implementation is available at https://github.com/hrayrhar/limit-label-memorization . Input: Training dataset { ( x ( i ) , y ( i ) ) } ni =1 . Input: Gradient norm regularization coefficient β . { λ is set to } Initialize classifier f ( y | x, w ) and gradient predictor q φ ( · | x , g This section presents the proofs and some remarks that were not included in the main text due to space constraints. A.1. Proof of Thm. 2.1Theorem A.1. (Thm. 2.1 restated) Consider a dataset S = ( x , y ) of n i.i.d. samples, x = { x ( i ) } ni =1 and y = { y ( i ) } ni =1 ,where the domain of labels is a finite set, Y , with |Y| > . Let A ( w | S ) be any training algorithm, producing weights forpossibly stochastic classifier f ( y | x, w ) . Let (cid:98) y ( i ) denote the prediction of the classifier on i -th example and e ( i ) = { (cid:98) y ( i ) (cid:54) = y ( i ) } be a random variable corresponding to predicting y ( i ) incorrectly. Then, the following holds E (cid:34) n (cid:88) i =1 e ( i ) (cid:35) ≥ H ( y | x ) − I ( w ; y | x ) − (cid:80) ni =1 H ( e ( i ) )log ( |Y| − . Proof. For each example we consider the following Markov chain: y ( i ) → (cid:20) xy (cid:21) → (cid:20) x ( i ) w (cid:21) → (cid:98) y ( i ) . In this setup Fano’s inequality gives a lower bound for the error probability: H ( e ( i ) ) + P ( e ( i ) = 1) log ( |Y| − ≥ H ( y ( i ) | x ( i ) , w ) , (13)which can be written as: P ( e ( i ) = 1) ≥ H ( y ( i ) | x ( i ) , w ) − H ( e ( i ) )log ( |Y| − . mproving Generalization by Controlling Label-Noise Information in Neural Network Weights Summing this inequality for i = 1 , . . . , n we get n (cid:88) i =1 P ( e ( i ) = 1) ≥ (cid:80) ni =1 (cid:0) H ( y ( i ) | x ( i ) , w ) − H ( e ( i ) ) (cid:1) log ( |Y| − ≥ (cid:80) ni =1 (cid:0) H ( y ( i ) | x , w ) − H ( e ( i ) ) (cid:1) log ( |Y| − ≥ H ( y | x , w ) − (cid:80) ni =1 H ( e ( i ) )log ( |Y| − . The correctness of the last step follows from the fact that total correlation is always non-negative (Cover & Thomas, 2006): n (cid:88) i =1 H ( y ( i ) | x , w ) − H ( y | x , w ) = TC ( y | x , w ) ≥ . Finally, using the fact that H ( y | x , w ) = H ( y | x ) − I ( w ; y | x ) , we get that the desired result: E (cid:34) n (cid:88) i =1 e ( i ) (cid:35) ≥ H ( y | x ) − I ( w ; y | x ) − (cid:80) ni =1 H ( e ( i ) )log ( |Y| − . (14) A.2. Proof of Prop. 3.1Proposition A.1. (Prop. 3.1 restated) If g t = µ t + (cid:15) t , where (cid:15) t ∼ N (0 , σ q I d ) is an independent noise and E (cid:2) µ Tt µ t (cid:3) ≤ L ,then the following inequality holds: I ( g t ; y | x , g Given that (cid:15) t and µ t are independent, let us bound the expected L2 norm of g t : E (cid:2) g Tt g t (cid:3) = E (cid:2) ( (cid:15) t + µ t ) T ( (cid:15) t + µ t ) (cid:3) = E (cid:2) (cid:15) Tt (cid:15) t (cid:3) + E (cid:2) µ Tt µ t (cid:3) ≤ dσ q + L . Among all random variables Z with E [ Z T Z ] ≤ C the Gaussian distribution Y ∼ N (cid:0) , Cd I d (cid:1) has the largest entropy, givenby H ( Y ) = d log (cid:0) πeCd (cid:1) . Therefore, H ( g t ) ≤ d (cid:32) πe ( dσ q + L ) d (cid:33) . With this we can upper bound the I ( g t ; y | x , g In this section we describe the details of experiments and implementations. Classifier architectures. The architecture of classifiers used in MNIST experiments is presented in Table 3. The ResNet-34 used in CIFAR-10 and CIFAR-100 experiments differs from the standard ResNet-34 architecture (which is used for × images) in two ways: (a) the first convolutional layer has 3x3 kernels and stride 1 and (b) the max pooling layerafter it is skipped. The architecture of ResNet-50 used in the Clothing1M experiment follows the original (He et al., 2016). Hyperparameter search. The CE, MAE, and FW baselines have no hyperparameters. For the DMI, we tuned the learningrate by setting the best value from the following list: { − , − , − , − } . The soft regularization approach of (11)has two hyperparameters: λ and β . We select λ from [0 . , . , . , . and β from [0 . , . , . , . , . . Theobjective of LIMIT instances has two terms: λH p,q and β (cid:107) µ t (cid:107) . Consequently, we need only one hyperparameter insteadof two. We choose to set λ = 1 and select β from [0 . , . , . , . , . , . , . , . . When sampling is enabled,we select σ q from [0 . , . , . , . . In MNIST and CIFAR experiments, we trained all models for 400 epochs andterminated the training early when the best validation accuracy was not improved in the last 100 epochs. All models forClothing1M were trained for 30 epochs. C. Additional Results Effectiveness of gradient norm penalty. In the main text we discussed that the proposed approach may overfit if thegradient predictor q φ ( · | x , g In the proposed approach, the auxiliary network q should not be able to distinguish correctand incorrect samples, unless it overfits. In fact, Fig. A1 shows that if we look at the norm of predicted gradients, exampleswith correct and incorrect labels are indistinguishable in easy cases (MNIST with 80% uniform noise and CIFAR-10 with40% uniform noise) and have large overlap in harder cases (CIFAR-10 with 40% pair noise and CIFAR-100 with 40%uniform noise). Therefore, we hypothesize that the auxiliary network learns to utilize incorrect samples effectively bypredicting “correct” gradients. This also hints that the distance between the predicted and cross-entropy gradients mightbe useful for detecting samples with incorrect or confusing labels. Fig. A2 confirms this intuition, demonstrating that thisdistance separates correct and incorrect samples perfectly in easy cases (MNIST with 80% uniform noise and CIFAR-10with 40% uniform noise) and separates them well in harder cases (CIFAR-10 with 40% pair noise and CIFAR-100 with 40%uniform noise). If we interpret this distance as a score for classifying correctness of a label, we get 91.1% ROC AUC score mproving Generalization by Controlling Label-Noise Information in Neural Network Weights T r a i n a cc u r a c y = 0.0 = 0.1 = 0.3 = 1.0 = 3.0 = 10.0 = 30.0 = 100.0 (a) Training performance of LIMIT G − S T r a i n a cc u r a c y = 0.0 = 0.1 = 0.3 = 1.0 = 3.0 = 10.0 = 30.0 = 100.0 (b) Training performance of LIMIT L − S T e s t a cc u r a c y (c) Testing performance of LIMIT G − S T e s t a cc u r a c y (d) Testing performance LIMIT L − S Figure A0: Training and testing accuracies of “LIMIT G − S ” and “LIMIT L − S ” instances with varying values of β onMNIST with 80% uniform label noise. The curves are smoothed for better presentation. 30 20 10 0log|| || D e n s i t y incorrectcorrect (a) MNIST80% uniform noise 40 20 0log|| || D e n s i t y incorrectcorrect (b) CIFAR-1040% uniform noise 40 20 0log|| || D e n s i t y incorrectcorrect (c) CIFAR-1040% pair noise 40 20 0log|| || D e n s i t y incorrectcorrect (d) CIFAR-10040% uniform noise Figure A1: Histograms of the norm of predicted gradients for examples with correct and incorrect labels. The gradientpredictions are done using the best instances of LIMIT.in the hardest case: CIFAR-10 with 40% pair noise, and more than 99% score in the easier cases.Motivated by this results,we use this analysis to detect samples with incorrect or confusing labels in the original MNIST, CIFAR-10, and Clothing1Mdatasets. We present a few incorrect/confusing labels for each class in Figures A3 and A4. Quantitative results. Tables 4, 5, 6, and 7 present test accuracy comparisons on multiple corrupted versions of MNIST andCIFAR-10. The presented error bars are standard deviations. In case of MNIST, we compute them over 5 training/validationsplits. In the case of CIFAR-10, due to high computational cost, we have only one run for each model and dataset pair. Thestandard deviations are computed by resampling the corresponding test sets 1000 times with replacement. mproving Generalization by Controlling Label-Noise Information in Neural Network Weights g || D e n s i t y ROC AUC score: 0.9987 incorrectcorrect (a) MNIST80% uniform noise g || D e n s i t y ROC AUC score: 0.9913 incorrectcorrect (b) CIFAR-1040% uniform noise g || D e n s i t y ROC AUC score: 0.9109 incorrectcorrect (c) CIFAR-1040% pair noise g || D e n s i t y ROC AUC score: 0.9585 incorrectcorrect (d) CIFAR-10040% uniform noise Figure A2: Histograms of the distance between predicted and actual gradient for examples with correct and incorrect labels.The gradient predictions are done using the best instances of LIMIT. Method p = 0 . p = 0 . n = 10 n = 10 All n = 10 n = 10 AllCE 94.3 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± G + S ± ± ± ± ± ± L + S 94.8 ± ± ± ± ± ± LIMIT G - S ± ± ± ± ± ± L - S 95.0 ± ± ± ± ± ± Table 4: Test accuracy comparison on multiple versions of MNIST corrupted with uniform label noise. Method p = 0 . p = 0 . n = 10 n = 10 All n = 10 n = 10 AllCE 27.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± G + S ± ± ± ± ± ± L + S ± ± ± ± ± ± LIMIT G - S ± ± ± ± ± ± L - S ± ± ± ± ± ± Table 5: Test accuracy comparison on multiple versions of MNIST corrupted with uniform label noise. mproving Generalization by Controlling Label-Noise Information in Neural Network Weights Method p = 0 . p = 0 . p = 0 . p = 0 . p = 0 . CE 92.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± G ± ± ± ± ± L ± ± ± ± ± G + init. ± ± ± ± ± LIMIT L + init. ± ± ± ± ± Table 6: Test accuracy comparison on CIFAR-10, corrupted with uniform label noise. The error bars are computed bybootstrapping the test set 1000 times. Method p = 0 . p = 0 . p = 0 . p = 0 . p = 0 . CE 92.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± LIMIT G ± ± ± ± ± L ± ± ± ± ± G + init. ± ± ± ± ± L + init. ± ± ± ± ± Table 7: Test accuracy comparison on CIFAR-10, corrupted with pair noise, described in Sec. 4.2. The error bars arecomputed by bootstrapping the test set 1000 times. mproving Generalization by Controlling Label-Noise Information in Neural Network Weights y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 3 y = 3 y = 3 y = 3 y = 3 y = 8 y = 3 y = 3 y = 3 y = 3 y = 3 y = 3 y = 3 y = 5 y = 3 y = 9 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 5 y = 5 y = 5 y = 5 y = 5 y = 5 y = 5 y = 6 y = 5 y = 5 y = 5 y = 5 y = 5 y = 8 y = 5 y = 5 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 7 y = 7 y = 7 y = 7 y = 7 y = 9 y = 7 y = 7 y = 7 y = 7 y = 7 y = 1 y = 7 y = 2 y = 7 y = 9 y = 8 y = 8 y = 8 y = 8 y = 8 y = 8 y = 8 y = 9 y = 8 y = 8 y = 8 y = 8 y = 8 y = 8 y = 8 y = 1 y = 9 y = 7 y = 9 y = 1 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 4 (a) MNIST y = airplane y = airplane y = airplane y = airplane y = airplane y = ship y = airplane y = car y = airplane y = frog y = airplane y = airplane y = airplane y = airplane y = airplane y = airplane y = car y = truck y = car y = truck y = car y = truck y = car y = truck y = car y = frog y = car y = truck y = car y = truck y = car y = truck y = bird y = bird y = bird y = frog y = bird y = frog y = bird y = airplane y = bird y = cat y = bird y = bird y = bird y = frog y = bird y = deer y = cat y = frog y = cat y = airplane y = cat y = frog y = cat y = frog y = cat y = truck y = cat y = frog y = cat y = frog y = cat y = frog y = deer y = deer y = deer y = frog y = deer y = deer y = deer y = frog y = deer y = deer y = deer y = deer y = deer y = dog y = deer y = deer y = dog y = cat y = dog y = horse y = dog y = dog y = dog y = dog y = dog y = horse y = dog y = bird y = dog y = cat y = dog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = horse y = horse y = horse y = horse y = horse y = horse y = horse y = horse y = horse y = cat y = horse y = horse y = horse y = horse y = horse y = horse y = ship y = ship y = ship y = ship y = ship y = ship y = ship y = ship y = ship y = airplane y = ship y = ship y = ship y = ship y = ship y = ship y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = bird y = truck y = truck (b) CIFAR-10 Figure A3: Most confusing 8 labels per class in the MNIST (on the left) and CIFAR-10 (on the right) datasets, according tothe distance between predicted and cross-entropy gradients. The gradient predictions are done using the best instances ofLIMIT. mproving Generalization by Controlling Label-Noise Information in Neural Network Weights y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Dress y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Jacket y = T-Shirt y = Hoodie y = T-Shirt y = Jacket y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = Jacket y = Shirt y = Jacket y = Shirt y = Jacket y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = Vest y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Knitwear y = Shirt y = Knitwear y = T-Shirt y = Knitwear y = Jacket y = Knitwear y = Jacket y = Knitwear y = T-Shirt y = Knitwear y = T-Shirt y = Knitwear y = Shirt y = Knitwear y = Jacket y = Knitwear y = Shirt y = Knitwear y = T-Shirt y = Knitwear y = T-Shirt y = Knitwear y = Shirt y = Knitwear y = T-Shirt y = Knitwear y = T-Shirt y = Knitwear y = Shirt y = Knitwear y = T-Shirt y = Chiffon y = Downcoat y = Chiffon y = Dress y = Chiffon y = Jacket y = Chiffon y = Knitwear y = Chiffon y = Vest y = Chiffon y = Dress y = Chiffon y = Dress y = Chiffon y = Hoodie y = Chiffon y = Dress y = Chiffon y = Suit y = Chiffon y = Suit y = Chiffon y = Vest y = Chiffon y = Vest y = Chiffon y = Downcoat y = Chiffon y = Vest y = Chiffon y = Shirt y = Sweater y = Shirt y = Sweater y = Jacket y = Sweater y = Knitwear y = Sweater y = Underwear y = Sweater y = T-Shirt y = Sweater y = Shirt y = Sweater y = Underwear y = Sweater y = Underwear y = Sweater y = Knitwear y = Sweater y = T-Shirt y = Sweater y = Knitwear y = Sweater y = Knitwear y = Sweater y = Knitwear y = Sweater y = Underwear y = Sweater y = T-Shirt y = Sweater y = Knitwear y = Hoodie y = T-Shirt y = Hoodie y = Jacket y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Hoodie y = Downcoat y = Hoodie y = Jacket y = Hoodie y = Jacket y = Hoodie y = Vest y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Hoodie y = Vest y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Windbreaker y = Dress y = Windbreaker y = Downcoat y = Windbreaker y = Jacket y = Windbreaker y = Jacket y = Windbreaker y = Suit y = Windbreaker y = T-Shirt y = Windbreaker y = Jacket y = Windbreaker y = Suit y = Windbreaker y = T-Shirt y = Windbreaker y = Downcoat y = Windbreaker y = T-Shirt y = Windbreaker y = Downcoat y = Windbreaker y = Downcoat y = Windbreaker y = Suit y = Windbreaker y = Downcoat y = Windbreaker y = Downcoat y = Jacket y = T-Shirt y = Jacket y = T-Shirt y = Jacket y = Suit y = Jacket y = Suit y = Jacket y = Shirt y = Jacket y = Suit y = Jacket y = T-Shirt y = Jacket y = T-Shirt y = Jacket y = T-Shirt y = Jacket y = Shirt y = Jacket y = T-Shirt y = Jacket y = Shirt y = Jacket y = Suit y = Jacket y = Downcoat y = Jacket y = T-Shirt y = Jacket y = T-Shirt y = Downcoat y = Windbreaker y = Downcoat y = Jacket y = Downcoat y = Windbreaker y = Downcoat y = Windbreaker y = Downcoat y = Dress y = Downcoat y = Shirt y = Downcoat y = Jacket y = Downcoat y = Shirt y = Downcoat y = T-Shirt y = Downcoat y = Hoodie y = Downcoat y = Shirt y = Downcoat y = Suit y = Downcoat y = Hoodie y = Downcoat y = Dress y = Downcoat y = T-Shirt y = Downcoat y = T-Shirt y = Suit y = T-Shirt y = Suit y = Shirt y = Suit y = Shirt y = Suit y = Jacket y = Suit y = Shirt y = Suit y = Jacket y = Suit y = Dress y = Suit y = T-Shirt y = Suit y = Shirt y = Suit y = Shirt y = Suit y = Shirt y = Suit y = T-Shirt y = Suit y = Shirt y = Suit y = T-Shirt y = Suit y = T-Shirt y = Suit y = T-Shirt y = Shawl y = Suit y = Shawl y = Jacket y = Shawl y = Downcoat y = Shawl y = Knitwear y = Shawl y = Downcoat y = Shawl y = Downcoat y = Shawl y = Windbreaker y = Shawl y = Downcoat y = Shawl y = Knitwear y = Shawl y = Downcoat y = Shawl y = Windbreaker y = Shawl y = Windbreaker y = Shawl y = Suit y = Shawl y = Downcoat y = Shawl y = Windbreaker y = Shawl y = Jacket y = Dress y = T-Shirt y = Dress y = Vest y = Dress y = Suit y = Dress y = Suit y = Dress y = Jacket y = Dress y = Jacket y = Dress y = Suit y = Dress y = T-Shirt y = Dress y = Suit y = Dress y = Suit y = Dress y = Suit y = Dress y = Vest y = Dress y = Shirt y = Dress y = Suit y = Dress y = Vest y = Dress y = Vest y = Vest y = T-Shirt y = Vest y = Underwear y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = Jacket y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Underwear y = Vest y = Underwear y = Vest y = Underwear y = Vest y = Underwear y = Jacket y = Underwear y = T-Shirt y = Underwear y = Dress y = Underwear y = Vest y = Underwear y = Jacket y = Underwear y = Vest y = Underwear y = T-Shirt y = Underwear y = Jacket y = Underwear y = Vest y = Underwear y = Hoodie y = Underwear y = Vest y = Underwear y = T-Shirt y = Underwear y = Vest= Vest