[PDF] Making Convex Loss Functions Robust to Outliers using e -Exponentiated Transformation

Abstract

In this paper, we propose a novel {\em e -exponentiated} transformation, 0≤e<1 , for loss functions. When the transformation is applied to a convex loss function, the transformed loss function become more robust to outliers. Using a novel generalization error bound, we have theoretically shown that the transformed loss function has a tighter bound for datasets corrupted by outliers. Our empirical observation shows that the accuracy obtained using the transformed loss function can be significantly better than the same obtained using the original loss function and comparable to that obtained by some other state of the art methods in the presence of label noise.

Full PDF

MMaking Convex Loss Functions Robust to Outliers using e -ExponentiatedTransformation Suvadeep Hajra Abstract

In this paper, we propose a novel e -exponentiated transformation, ≤ e < , for loss functions.When the transformation is applied to a convexloss function, the transformed loss function be-come more robust to outliers. Using a novelgeneralization error bound, we have theoreticallyshown that the transformed loss function has atighter bound for datasets corrupted by outliers.Our empirical observation shows that the accu-racy obtained using the transformed loss functioncan be signiﬁcantly better than the same obtainedusing the original loss function and comparableto that obtained by some other state of the artmethods in the presence of label noise.

1. Introduction

Convex loss functions are widely used in machine learningas their usage lead to convex optimization problem in asingle layer neural network or in a kernel method. That, inturn, provides the theoretical guarantee of getting a glob-ally optimum solution efﬁciently. However, many earlierstudies have pointed out that convex loss functions are notrobust to outliers (Long & Servedio, 2008; 2010; Ding &Vishwanathan, 2010; Manwani & Sastry, 2013; Rooyenet al., 2015; Ghosh et al., 2015). Indeed a convex lossimposes a penalty which grows at least linearly with thenegative margin for a wrongly classiﬁed example, thus mak-ing the classiﬁcation hyperplane greatly impacted by theoutliers. Consequently, nonconvex loss functions have beenwidely studied as a robust alternative to convex loss function(Masnadi-Shirazi & Vasconcelos, 2008; Long & Servedio,2010; Ding & Vishwanathan, 2010; Denchev et al., 2012;Manwani & Sastry, 2013; Ghosh et al., 2015).In this paper, we propose e -exponentiated transformationfor loss function to make a convex loss functions morerobust to outliers. Given a convex loss function l (ˆ y, y ) , wedeﬁne it’s e -exponentiated transformation to be l e,c (ˆ y, y ) = Department of Computer Science and Engineering, IndianInstitute of Technology Bombay, Mumbai, India. l ( σ e,c (ˆ y ) , y ) for ≤ e < and some real positive constant c where σ e,c (ˆ y ) is given by σ e,c (ˆ y ) = (cid:26) sgn (ˆ y ) | ˆ y | e if | ˆ y | ≥ cc e − ˆ y otherwise (1)with | ˆ y | denoting the absolute value of ˆ y ∈ R and the signfunction sgn (ˆ y ) deﬁned to be equal to for ˆ y ≥ , − otherwise. For a differentiable convex loss function l ( · , · ) ,its e -exponentiated transformation l e,c ( · , · ) is differentiableeverywhere except at ˆ y ∈ {− c, c } . Thus, a gradient basedoptimization algorithm can be used for empirical risk mini-mization with e -exponentiated loss function. Moreover, an e -exponentiated loss function l (ˆ y, y ) is more robust to out-liers than the corresponding convex loss function l (ˆ y, y ) asthe slope | dd ˆ y l e,c (ˆ y, y ) | = e | ˆ y | e − | ddσ e,c (ˆ y ) l ( σ e,c (ˆ y ) , y ) | < | dd ˆ y l (ˆ y, y ) | for ˆ y < − (please refer to Figure 1). yy l o ss logistic, e =0.75logistic, e =1.00hinge, e =0.75hinge, e =1.00 Figure 1. e -exponentiated transformation of logistic and hinge loss. e = 1 . implies the original loss. We have used c = 0 . in allthe plots. Additionally, by introducing a novel generalization errorbound, we show that the bound for an e -exponentiatedloss function can be tighter than the corresponding convexloss function. Unlike existing generalization error bounds(Rosasco et al., 2004) which strongly depends on the Lips-chitz constant of a loss function, our derived bound depends a r X i v : . [ c s . L G ] F e b aking Convex Loss Functions Robust to Outliers using e -Exponentiated Transformation on the Lipschitz constant only weakly. Consequently, evenhaving a larger Lipschitz constant for an e -exponentiatedloss function compared to the corresponding convex lossfunction, the bound can be tighter.In summary, the contributions of the paper are as follows:1. In this paper, we propose an e -exponentiated transfor-mation of convex loss function. The proposed transfor-mation can make a convex loss function more robustto outliers.2. Using a novel generalization error bound, we showthat the bound for an e -exponentiated loss functioncan be tighter than the corresponding convex loss func-tion. Our derived bound only weakly depends on theLipschitz constant of a loss function. Consequently,our bound for a loss function can be tighter in spite ofhaving a larger Lipschitz constant.3. We have empirically veriﬁed the accuracy obtained byour proposed e -exponentiated loss functions on severaldatasets. The results show that we can get signiﬁcantlybetter accuracies using the e -exponentiated loss func-tion than that obtained by the corresponding convexloss and comparable accuracies to that obtained bysome other state of the art methods in the presence oflabel noise.The organization of the work is as follows. In Section 2, wehave formally introduced the empirical risk minimizationproblem. Section 3 derives a novel generalization errorbound. Using the bound, we have also shown that the boundcan be tighter for e -exponentiated loss function. In Section 4,we have shown our experimental result. Finally, Section 5concludes the work.

2. Empirical Risk Minimization Using e -Exponentiated Loss We consider the empirical risk minimization of a linearclassiﬁer with e -exponentiated loss function for a binaryclassiﬁcation problem. Given a convex loss function l ( · , · ) ,the empirical risk minimization of a linear classiﬁer is givenby: ˆR l ( w ; D ) = 1 N N (cid:88) i =1 l (ˆ y i , y i ) = 1 N N (cid:88) i =1 l ( w T φ ( x i ) , y i ) (2) where D = { ( x i , y i ) } Ni =1 is the training set, φ ( x ) ∈ R d isthe feature representation of the sample x and the target y i stakes a value from { , − } for i ∈ { , · · · , N } . The corre-sponding empirical risk with e -exponentiated loss l e,c ( · , · ) is given by: ˆR l e,c ( w ; D ) = 1 N N (cid:88) i =1 l e,c (ˆ y i , y i ) = 1 N N (cid:88) i =1 l ( σ e,c (ˆ y i ) , y i )= 1 N N (cid:88) i =1 l ( σ e,c ( w T φ ( x i )) , y i ) (3) where e ∈ [0 , , c > and σ e,c ( · ) is as deﬁned in Eq. (1).In the rest of the paper, we will ignore the second argumentof ˆR l ( w ; D ) and ˆR l e,c ( w ; D ) whenever D can be inferredfrom the context.

3. Generalization Error Bounds of EmpiricalRisk Minimization with e -ExponentiatedLoss In this section, we present an upper bound for the general-ization error incurred by an e -exponentiated loss function.Towards this end, we ﬁrst propose a novel method for esti-mating the upper bound. Our introduced method of gener-alization error bound captures the average behaviour of aloss function as opposed to other existing methods (Rosascoet al., 2004) which captures the worst case behaviour. Moreparticularly, our method is more suitable for analysing non-convex problems where the risk function is smooth in mostof the regions but contains some very low probable highgradient regions. Consequently, our bound shows an weakdependence on the Lipschitz constant of the loss functionsas opposed to other existing methods (Rosasco et al., 2004)which depend on the Lipschitz constant monotonically. Fi-nally, applying the derived bound, we show that empiricalrisk minimization with e -exponentiated loss function canhave tighter generalization error bound than that can beobtained using the corresponding convex loss function. The gradient of an e -exponentiated loss function can bevery large (in the order of c e − × L l where L l is the Lip-schitz constant of the corresponding convex loss) makingthe Lipschitz constant of the transformed loss very large for c (cid:28) . On the other hand, the existing generalization errorbound gets loose as the Lipschitz constant gets larger. Toovercome this issue, we propose a novel bound for the same.Our bound is based on the work of (Rosasco et al., 2004).Before stating our bound, let us introduce certain notationsand deﬁnitions. Deﬁnition 1

A function f : A (cid:55)→ R , A ⊆ R n is said to be L f -Lipschitz continuous, L f > , if | f ( a ) − f ( b ) | ≤ L f || a − b || (4) for every a , b ∈ A . aking Convex Loss Functions Robust to Outliers using e -Exponentiated Transformation Deﬁnition 2

A function f : A (cid:55)→ R , A ⊆ R n , is said to be Lipschitz in the small continuous, if there exists (cid:15) > and L f ( (cid:15) ) > such that || a − b || ≤ (cid:15) implies | f ( a ) − f ( b ) | ≤ L f ( (cid:15) ) || a − b || (5) for every a , b ∈ A . Note that, in general, whenever a function f ( x ) is continu-ous and differentiable, L f ≥ L f ( (cid:15) ) ≥ sup x | f (cid:48) ( x ) | = L f for all (cid:15) > where f (cid:48) ( x ) is the gradient of f ( x ) at x . How-ever, this might not be true when the function f ( x ) alsodepends on the distribution of the input x .With the above deﬁnitions, we state our generalization errorbound in the next theorem. Note that since a close ball in R d deﬁned as W M (cid:44) { w ∈ R d | || w || ≤ M } is a compactset, we can cover the set by taking union of a ﬁnite numberof balls of radius (cid:15) for any (cid:15) > . Let us denote the coveringnumber of W M by C ( (cid:15) ) . Also, we deﬁne the expected riskcorresponding to the empirical risk given by Eq. (2) R l ( w ) = E x ,y [ l ( w T φ ( x ) , y )] (6)where E x ,y [ · ] denotes expectation over the joint distributionof x and y . Also note that so far we have used the notation l ( · , · ) to represent a convex loss function. However, in thissection, we use the notation to represent any arbitrary lossfunction. With the above deﬁnitions and notations, we stateour generalization error bound in the following theorem. Theorem 1

Let D N = ( x i , y i ) Ni =1 such that φ ( x ) i ∈{ φ ( x ) ∈ R d | || φ ( x ) || ≤ } , and y i ∈ {− , +1 } . Let w ∈ W M (cid:44) { w ∈ R d | || w || ≤ M } with M ≥ .Let the loss function l ( · , · ) is L l -Lipschitz continuous. Set B = L R l ( M ) M + C l where L R l ( (cid:15) ) is as deﬁned in Eq. (5) and C l > such that C l ≥ l (0 , y ) for y ∈ {− , +1 } . Thenfor all (cid:15) > , we have P (cid:18)(cid:26) D N | sup w ∈W M | R l ( w ) − ˆR l ( w ; D N ) | ≤ (cid:15) + L l (cid:15) B (cid:27)(cid:19) ≥ − (cid:18) C (cid:18) (cid:15) L R l ( (cid:15) (cid:48) ) (cid:19) + 1 (cid:19) exp (cid:18) − N(cid:15) B (cid:19) . (7) where (cid:15) (cid:48) > such that (cid:15) (cid:48) ≥ min (cid:110) (cid:15), (cid:15) L R l ( (cid:15) (cid:48) ) (cid:111) . (and thisalways exists). Proof of Theorem 1 has been skipped to Appendix 6. Tocompare our result with the previous result, we state theresult of (Rosasco et al., 2004) in the next theorem:

Theorem 2 (Rosasco et al., 2004) Let D N , M , W M , L l and C l are as deﬁned in Theorem 1. Set B = L l M + C l .Then for all (cid:15) > , we have P (cid:18)(cid:26) D N | sup w ∈W M (cid:12)(cid:12)(cid:12) R l ( w ) − ˆR l ( w ; D N ) (cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:27)(cid:19) ≥ − C (cid:18) (cid:15) L l (cid:19) exp (cid:18) − N (cid:15) B (cid:19) . (8) Remark 1

The conﬁdence bound in the RHS of Eq. (8) involves L l , the Lipschitz constant of the loss function. Thus,the bound is a monotonically decreasing function of L l i.e.it gets worse as L l gets larger. On the other hand, theconﬁdence bound of Eq. (7) no more involve the Lipschitzconstant of the loss function L l . Instead, it involves L R l ( (cid:15) ) which can be reasonably small even when L l is very large. Remark 2

By comparing Eq. (7) and (8) , we see that thereare two main differences. First, in LHS of Eq. (7) , (cid:15) has beenreplaced by a slightly larger quantity (cid:15) + L l (cid:15) / B . Since wegenerally take (cid:15) (cid:28) and B ≥ , L l (cid:15) / B can be a negligi-ble quantity even for reasonably large L l . Thus, it does notcompromise the error bound signiﬁcantly. Secondly, in RHSEq. (7) , C ( (cid:15)/ L l ) has been replaced by C ( (cid:15)/ L R l ( (cid:15) (cid:48) )) + 1 .Since for x (cid:28) , the covering number C ( x ) (cid:29) , Eq. (7) also does not compromise the conﬁdence probability signiﬁ-cantly. Moreover, if L R l ( (cid:15) (cid:48) ) is reasonably smaller than L l ,the conﬁdence bound given by Eq. (7) can be signiﬁcantlybetter than that given by Eq. (8) . Remark 3

For the nonconvex problem where the risk issmooth on most of the regions in its domain but has veryhigh gradient on some very low probable regions, the boundgiven by Theorem 2 can be very loose as the correspondingLipschitz constant can be very large. However, Theorem 1can still provides a tight bound under proper distributionalassumption. Thus, Theorem 1 is better suitable for analysingnonconvex problems.

From Theorem 1, we see that when L l (cid:15)/ B (cid:28) , the gener-alization error bound is a monotonically decreasing functionof L R l ( (cid:15) ) where l ( · , · ) is the loss function used in the empir-ical risk minimization. Thus, to compare the generalizationerror bound of an e -exponentiated loss function with that ofthe corresponding convex loss function, we compare L R l ( (cid:15) ) with L R le,c ( (cid:15) ) where l ( · , · ) is a convex loss function and l e,c ( · , · ) is its e -exponentiated transformation. Since L R l ( (cid:15) ) depends on the distribution x and y , we assume that themargin y ˆ y = y w T φ ( x ) follows an uniform distribution.Moreover, since by our previous assumptions, || φ ( x ) || ≤ and || w || ≤ M , | y ˆ y | ≤ M . Note that in this case, L R le,c ( (cid:15) ) = L R le,c ( M ) = sup || w || ≤ M || dd w R l e,c || = L R le,c aking Convex Loss Functions Robust to Outliers using e -Exponentiated Transformation Thus, we compute an upper bound of L R le,c as L R le,c = sup || w || ≤ M || dd w E x ,y [ l ( σ e,c ( w T φ ( x )) , y )] || = sup || w || ≤ M || E x ,y (cid:20) dd w l ( σ e,c ( w T φ ( x )) , y ) (cid:21) || ≡ (cid:12)(cid:12)(cid:12)(cid:12) E − M ≤ δ ≤ M (cid:20) ddδ l ( σ e,c ( δ )) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) where δ = ˆ yy (9)The RHS of Eq. (9) can be shown to be less than L R l ≡| E − M ≤ δ ≤ M (cid:2) ddδ l ( δ ) (cid:3) | for sufﬁciently large M and convexloss function l ( · , · ) with non-positive gradient. Note thatmost of the standard convex loss functions for classiﬁcationhave gradient which is non-positive.In the next section, we show the experimental results using e -exponentiated loss functions.

4. Experimental Results

To demonstrate the improvement obtained using e -exponentiated loss functions empirically, we show the re-sults of two sets of experiments. In the ﬁrst set of exper-iments, we have compared the accuracies obtained using e -exponentiated loss function with that obtained using thecorresponding convex loss function on a subset of ImageNetdataset (Deng et al., 2009). In the second set of experiments,we compared the e -exponentiated loss functions with otherstate of the art methods for noisy label learning on fourdatasets. To show the improvement in accuracies using the e -exponentiated loss functions over the corresponding convexloss functions, we have performed experiments on a sub-set of ImageNet dataset. Our collected subset of ImageNetdataset contains , images of labels. We haverandomly splitted the dataset into training set of , ,validation set of , and test set of , images. Forthe experiments, we have extracted pre-trained features ofthe images by passing them through the ﬁrst ﬁve layersof a pre-trained AlexNet model (Krizhevsky et al., 2012).We have downloaded the pre-trained model from (Shel-hamer, 2013 (accessed October, 2018) and use the code of(Kratzert, 2017 (accessed October, 2018) for extracting thepre-trained features. Note that there are only labelscommon in between our subset of ImageNet dataset and Im-ageNet LSVRC-2010 contest dataset on which the AlexNetmodel has been pre-trained.For classiﬁcation using the pre-trained features, we haveused a three layer fully connected neural network withReLU activation. We performed the experiments using the Loss function e Top-1 Top-5 . .

84 67 . Logistic .

75 39 .

12 65 . .

00 36 .

95 63 . . .

30 67 . Softmax .

75 36 .

00 63 . .

00 35 .

31 62 . Table 1.

Top-1 and Top-5 accuracies obtained on subset of Im-ageNet dataset. We have used e -exponentiated logistic andsoftmax loss function. Experiments are performed using e =1 , . and . . Note that e -exponentiated loss function with e = 1 gives us back the original convex loss function. e -exponentiated softmax loss and logistic loss by varying e = 1 , . and . and setting c = 0 . Note that e = 1 gives us the original convex loss function. We set the di-mension of the hidden layers to be and used Adamoptimizer for optimization. To ﬁnd the suitable value ofinitial learning rate and keep probability for the dropout,we performed cross-validation using the top-5 accuracy onthe validation set. The top-1 and top-5 test accuracies of allthe experiments are shown in Table 1. The results showsthat we have got a to % improvement in top-1 and top-5accuracies for e = 0 . over e = 1 . for both softmax andlogistic loss. For e = 0 . , the accuracies obtained are inbetween the accuracies obtained by e = 0 . and e = 1 . . In this section, we compare the accuracies obtained us-ing e -exponentiated loss function with other state-of-the-artmethods by adding label noise on the training set. For thepurpose, we have adopted the experimental setup of (Maet al., 2018). Experimental Setup

As in (Ma et al., 2018), we per-formed the experiments by adding , , and symmetric label noise on four benchmark datasets:MNIST ((LeCun et al., 1998)), SVHN ((Netzer et al.,2011)), CIFAR-10 ((Krizhevsky, 2009)) and CIFAR-100((Krizhevsky, 2009)). For all the datasets, we have used thesame model and optimization setup as used in (Ma et al.,2018). Additionally, we have performed experiments using e -exponentiated softmax loss function with c = 0 . andvarying e = 1 . , . and . . As mentioned earlier, e = 1 gives us back the corresponding softmax loss. Following Maet al. in (Ma et al., 2018), we have repeated the experimentsﬁve times and reported the mean accuracies. Baseline Methods

For the comparison purpose, we haveused the baseline methods which have been used in (Ma aking Convex Loss Functions Robust to Outliers using e -Exponentiated Transformation et al., 2018). For the shake of completeness, we brieﬂydescribe those: Forward (Patrini et al., 2017)

Noisy labels are correctedby multiplying the network predictions with a labeltransition matrix.

Backward (Patrini et al., 2017)

Noise labels are cor-rected by multiplying the loss by the inverse of a labeltransition matrix.

Boot-soft (Reed et al., 2014)

Loss function is modiﬁed byreplacing the target label by a convex combination ofthe target label and the network output.

Boot-hard (Reed et al., 2014)

It is same as

Boot-soft ex-cept that instead of directly using the class predictionsin the convex combination, it converts the class predic-tion vector to a { , } -vector by thresholding beforeusing in the convex combination. D2L (Ma et al., 2018)

It uses an adaptive loss functionwhich exploits the differential behaviour of the deeprepresentation subspace while a network is trained onnoisy labels.

Training with e -Exponentiated Loss function We havefound that for larger network, the rate of convergence us-ing e -exponentiated loss function in the initial iterationsare slow due to smaller magnitude of gradients. For a sim-ilar problem, Barron et al., in (Barron, 2017), have usedan “annealing” approach in which, at the beginning of theoptimization, they start with a convex loss function andat each epoch they gradually make the loss function non-convex by slowly tuning a hyper-parameter. However, inour experiments, we take a simpler approach. For the ﬁrst total epoch/ epochs, where total epoch is the total num-ber of epochs the model is trained, we trained the modelby setting e = 1 . After total epoch/ epochs, we switchthe value of e to our desired lower value. We take it asa future work to use a more sophisticated approaches like“annealing” in our experiments. Results

The results are shown in Table 2. From the table,we can see that the accuracies obtained by e -exponentiatedsoftmax loss with e = 0 . are comparable (within the margin) or better out of times for methods Backward , Boot-hard and out of times for method Boot-soft .However, its performance is relatively worse than that of themethods

Forward and

D2L in which cases the accuraciesobtained by e -exponentiated loss function are comparableor better only out of times. Moreover, in some setting,the accuracy obtained by the two methods is better than thatobtained by e -exponentiated loss function by a wide margin.However, it should be noted that the scope of our work is to develop better loss functions for the problem and many ofthe other label correction methods can be used along withour proposed loss functions.

5. Conclusion

In this paper, we have proposed e -exponentiated transfor-mation of loss function. The e -exponentiated convex lossfunctions are almost differentiable, thus can be optimizedusing gradient descend based algorithm and more robustto outliers. Additionally, using a novel generalization errorbound, we have shown that the bound can be tighter for an e -exponentiated loss function than that for the correspondingconvex loss function in spite of having a much larger Lip-schitz constant. Finally, by empirical evaluation, we haveshown that the accuracy obtained using e -exponentiated lossfunction can be signiﬁcantly better than that obtained usingthe corresponding convex loss function and comparable tothe accuracy obtained by some other state of the art methodsin the presence of label noise. References

Barron, J. T. A more general robust loss function.

CoRR ,abs/1701.03077, 2017. URL http://arxiv.org/abs/1701.03077.Denchev, V. S., Ding, N., Vishwanathan, S. V. N., andNeven, H. Robust classiﬁcation with adiabatic quantumoptimization. In

ICML , 2012.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Li, F.-F.ImageNet: A large-scale hierarchical image database. In

CVPR , 2009.Ding, N. and Vishwanathan, S. V. N. t-logistic regression.In

NIPS , pp. 514–522, 2010.Ghosh, A., Manwani, N., and Sastry, P. S. Making riskminimization tolerant to label noise.

Neurocomputing ,160:93–107, 2015.Kratzert, F. Finetune AlexNet with Tensorﬂow 1.0,2017 (accessed October, 2018). URL https : / /github.com/kratzert/ﬁnetune alexnet with tensorﬂow/tree/5d751d62eb4d7149f4e3fd465febf8f07d4cea9d.Krizhevsky, A. Learning multiple layers of features fromtiny images. , University of Toronto, 2009.Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassiﬁcation with deep convolutional neural networks.In

NIPS , pp. 1106–1114, 2012.Laurent, T. and Brecht, J. Deep linear networks with ar-bitrary loss: All local minima are global. In

ICML , pp.2908–2913, 2018. aking Convex Loss Functions Robust to Outliers using e -Exponentiated TransformationDataset Noise Forward Backward Boot-hard Boot-soft D2L Softmax CrossentropyRate e = .

00 e = .

75 e = .

0% 99 .

30 99 .

23 99 .

13 99 .

20 99 .

28 99 .

30 99 . MNIST

20% 96 .

45 90 .

12 87 .

69 88 .

50 98 .

84 88 .

29 88 .

76 89 . .

90 70 .

89 69 .

49 70 .

19 98 .

49 68 .

70 69 .

18 71 . .

88 52 .

83 50 .

45 46 .

04 94 .

73 46 .

12 46 .

39 49 . .

22 90 .

16 89 .

47 89 .

26 90 .

32 91 .

09 91 .

02 91 . SVHN

20% 85 .

51 79 .

61 81 .

21 79 .

26 87 .

63 78 .

99 79 .

03 78 . .

09 64 .

15 63 .

25 64 .

30 82 .

68 61 .

43 61 .

15 60 . .

57 53 .

14 47 .

61 39 .

21 80 .

92 39 .

17 39 .

23 38 . .

27 89 .

03 89 .

06 89 .

46 89 .

41 90 .

33 90 .

36 90 . CIFAR-10

20% 84 .

61 79 .

41 81 .

19 79 .

21 85 .

13 82 .

00 82 .

94 84 . .

84 74 .

69 76 .

67 73 .

81 83 .

36 75 .

60 75 .

86 78 . .

41 45 .

42 70 .

57 68 .

12 72 .

84 67 .

02 68 .

36 72 . .

54 68 .

48 68 .

31 67 .

89 68 .

60 68 .

56 68 .

34 67 . CIFAR-100

20% 60 .

25 58 .

74 58 .

49 57 .

32 62 .

20 59 .

84 61 .

08 61 . .

27 45 .

42 44 .

41 41 .

87 52 .

01 51 .

56 53 .

05 54 . .

22 34 .

49 36 .

65 32 .

29 42 .

27 38 .

71 39 .

41 39 . Table 2.

Experiments on four benchmark datasets. For e -exponentiated loss functions, we have evaluated e -exponentiated softmax lossfunction with three different value of e = 1 . , . and . . The accuracies of other methods have been taken from (Ma et al., 2018). LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.

Proceed-ings of the IEEE , 86(11):2278–2324, 1998.Long, P. M. and Servedio, R. A. Random classiﬁcationnoise defeats all convex potential boosters. In

ICML , pp.608–615, 2008.Long, P. M. and Servedio, R. A. Random classiﬁcationnoise defeats all convex potential boosters. volume 78,pp. 287–304, 2010.Ma, X., Wang, Y., Houle, M. E., Zhou, S., Erfani, S. M., Xia,S., Wijewickrema, S. N. R., and Bailey, J. Dimensionality-driven learning with noisy labels. In

ICML , pp. 3361–3370, 2018.Manwani, N. and Sastry, P. S. Noise tolerance under riskminimization.

IEEE Trans. Cybernetics , 43(3):1146–1151, 2013.Masnadi-Shirazi, H. and Vasconcelos, N. On the designof loss functions for classiﬁcation: theory, robustness tooutliers, and savageboost. In

NIPS , pp. 1049–1056, 2008.Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B.,and Ng, A. Y. Reading digits in natural images withunsupervised feature learning, 2011.Patrini, G., Rozza, A., Menon, A. K., Nock, R., and Qu,L. Making deep neural networks robust to label noise:A loss correction approach. In

CVPR , pp. 2233–2241,2017.Reed, S. E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D.,and Rabinovich, A. Training deep neural networks onnoisy labels with bootstrapping.

CoRR , abs/1412.6596,2014. URL http://arxiv.org/abs/1412.6596. Rooyen, B., Menon, A. K., and Williamson, R. C. Learningwith symmetric label noise: The importance of beingunhinged. In

NIPS , pp. 10–18, 2015.Rosasco, L., Vito, E. D., Caponnetto, A., Piana, M., andVerri, A. Are loss functions all the same? volume 16, pp.1063–107, 2004.Shelhamer, E. bvlc alexnet.caffemodel, 2013 (accessedOctober, 2018). URL http://dl.caffe.berkeleyvision.org/bvlc alexnet.caffemodel.

6. Proof of Theorem 1

Before going to the proof of Theorem 1, we will state andprove another result which is required for the proof.

Lemma 1

Let the expected risk R l ( w ) be Lipschitz insmall continuous and the corresponding loss function is L l -Lipschitz. Then for || w − w || ≤ (cid:15) , (cid:15) > , and ρ > | ˆR l ( w ) − ˆR l ( w ) | ≤ L R l ( (cid:15) ) || w − w || + ρ (10) is satisﬁed with probability at least − exp (cid:16) − Nρ L l (cid:15) (cid:17) . Proof :

Since R l ( w ) is Lipschitz in small continuous and || w − w || ≤ (cid:15) , we have | R l ( w ) − R l ( w ) | ≤ L R l ( (cid:15) ) || w − w || (11)If we let z i = l ( w T φ ( x i ) , y i ) − l ( w T φ ( x i ) , y i ) , then wecan write E[ z ] = R l ( w ) − R l ( w ) , and N N (cid:88) i =1 z i = ˆR l ( w ) − ˆR l ( w ) aking Convex Loss Functions Robust to Outliers using e -Exponentiated Transformation Since || w − w || ≤ (cid:15) and the loss function l ( · , · ) is L l -Lipschitz function, | z i | ≤ L l (cid:15) . Using Hoeffding’s in-equality, we get P (cid:110) D N | (cid:12)(cid:12)(cid:12) (R( w ) − R( w )) − (cid:16) ˆR( w ; D N ) − ˆR( w ; D N ) (cid:17)(cid:12)(cid:12)(cid:12) ≥ ρ (cid:111) ≤ exp (cid:32) − Nρ L l (cid:15) (cid:33) . (12) Combining Eq. (11) and (12), we complete the proof. (cid:3)

Now we prove Theorem 1.

Proof of Theorem 1:

We will mainly follow the proofof (Rosasco et al., 2004). For simplifying the notation, weignore the subscript of D N , R l ( · ) and ˆR l ( · ) through out theproof. First of all, by denoting ∆ D ( w ) = R( w ) − ˆ R( w ) (13)and using Lemma 1, we get | ∆ D ( w ) − ∆ D ( w ) |≤ | R( w ) − R( w ) | + (cid:12)(cid:12)(cid:12) ˆR( w ; D ) − ˆR( w ; D ) (cid:12)(cid:12)(cid:12) ≤ L R ( (cid:15) (cid:48) ) || w − w || + ρ (14)holds for all || w − w || ≤ (cid:15) (cid:48) for some (cid:15) (cid:48) > withprobability at least − exp (cid:16) − Nρ L l (cid:15) (cid:48) (cid:17) . Putting ρ = L l (cid:15) (cid:48) (cid:15)B into the above statement, we get | ∆ D ( w ) − ∆ D ( w ) | ≤ L R ( (cid:15) (cid:48) ) || w − w || + L l (cid:15) (cid:48) (cid:15)B (15)with probability at least − exp (cid:16) − N(cid:15) B (cid:17) . Again, in(Rosasco et al., 2004), Rosasco et al. have shown that P( A ) = P ( ∪ mi =1 A w i ) ≤ m exp (cid:18) − N (cid:15) B (cid:19) (16)where w , · · · , w m be the m = C (cid:16) (cid:15) L R ( (cid:15) (cid:48) ) (cid:17) points suchthat the close balls B (cid:16) w i , (cid:15) L R ( (cid:15) (cid:48) ) (cid:17) with radius (cid:15) L R ( (cid:15) (cid:48) ) andcenter w i covers the whole set W M = { w ∈ R d | || w || ≤ M } and A w i = {D| | ∆ D ( w i ) | ≥ (cid:15) } for i = 1 , · · · , m. (17)When (cid:15) (cid:48) ≥ (cid:15) L R ( (cid:15) (cid:48) ) , for all w ∈ W M , there exists some i ∈ { , · · · , m } such that w ∈ B (cid:16) w i , (cid:15) L R ( (cid:15) (cid:48) ) (cid:17) i.e. || w − w i || ≤ (cid:15) L R ( (cid:15) (cid:48) ) (18)Note that D ∈ A is the dataset for which there exists some w i whose empirical risk has not converged to its expected risk. Thus, for all D / ∈ A , we have | ∆ D ( w i ) | ≤ (cid:15) forall i ∈ { , · · · , m } . Now, combining Eq. (15) and (18),we can say that when there exists some (cid:15) (cid:48) > such that (cid:15) (cid:48) ≥ (cid:15) L R ( (cid:15) (cid:48) ) , | ∆ D ( w ) − ∆ D ( w i ) | ≤ (cid:15) + L l (cid:15)(cid:15) (cid:48) B (19)holds for all w ∈ W M and some w i with probability at least − exp (cid:16) − N(cid:15) B (cid:17) . Therefore, if there exists an (cid:15) (cid:48) > suchthat (cid:15) (cid:48) ≥ (cid:15) L R ( (cid:15) (cid:48) ) , | ∆ D ( w ) | ≤ (cid:15) + L l (cid:15)(cid:15) (cid:48) B (20)hold with probability at least (cid:16) − exp (cid:16) − N(cid:15) B (cid:17)(cid:17) (cid:16) − m exp (cid:16) − N(cid:15) B (cid:17)(cid:17) ≥ − m +1) exp (cid:16) − N(cid:15) B (cid:17) = 1 − (cid:16) C (cid:16) (cid:15) L R ( (cid:15) (cid:48) ) (cid:17) + 1 (cid:17) exp (cid:16) − N(cid:15) B (cid:17) . Byreplacing (cid:15) with (cid:15)/ and by replacing (cid:15) (cid:48) by (cid:15) whenever (cid:15) (cid:48) > (cid:15) , the statement of the lemma follows.But, it still remains to show that there always exists an (cid:15) (cid:48) > such that (cid:15) (cid:48) ≥ (cid:15) L R ( (cid:15) (cid:48) ) . Note that L R ( (cid:15) (cid:48) ) is a monotonicallyincreasing function of (cid:15) (cid:48) . If for some (cid:15) (cid:48) < (cid:15) , (cid:15) (cid:48) ≥ (cid:15)/ L R ( (cid:15) (cid:48) ) holds, we are already done. Else, we have (cid:15) (cid:48) L R ( (cid:15) (cid:48) ) < (cid:15) .Thus, we can increase (cid:15) (cid:48) L R ( (cid:15) (cid:48) ) unboundedly by increasing (cid:15) (cid:48) , making it larger than (cid:15) eventually.eventually.