[PDF] Analyzing Overfitting under Class Imbalance in Neural Networks for Image Segmentation

Abstract

Class imbalance poses a challenge for developing unbiased, accurate predictive models. In particular, in image segmentation neural networks may overfit to the foreground samples from small structures, which are often heavily under-represented in the training set, leading to poor generalization. In this study, we provide new insights on the problem of overfitting under class imbalance by inspecting the network behavior. We find empirically that when training with limited data and strong class imbalance, at test time the distribution of logit activations may shift across the decision boundary, while samples of the well-represented class seem unaffected. This bias leads to a systematic under-segmentation of small structures. This phenomenon is consistently observed for different databases, tasks and network architectures. To tackle this problem, we introduce new asymmetric variants of popular loss functions and regularization techniques including a large margin loss, focal loss, adversarial training, mixup and data augmentation, which are explicitly designed to counter logit shift of the under-represented classes. Extensive experiments are conducted on several challenging segmentation tasks. Our results demonstrate that the proposed modifications to the objective function can lead to significantly improved segmentation accuracy compared to baselines and alternative approaches.

Full PDF

11 Analyzing Overﬁtting under Class Imbalance inNeural Networks for Image Segmentation

Zeju Li, Konstantinos Kamnitsas and Ben Glocker

Abstract —Class imbalance poses a challenge for developingunbiased, accurate predictive models. In particular, in imagesegmentation neural networks may overﬁt to the foregroundsamples from small structures, which are often heavily under-represented in the training set, leading to poor generalization.In this study, we provide new insights on the problem ofoverﬁtting under class imbalance by inspecting the networkbehavior. We ﬁnd empirically that when training with limiteddata and strong class imbalance, at test time the distribution oflogit activations may shift across the decision boundary, whilesamples of the well-represented class seem unaffected. This biasleads to a systematic under-segmentation of small structures.This phenomenon is consistently observed for different databases,tasks and network architectures. To tackle this problem, weintroduce new asymmetric variants of popular loss functionsand regularization techniques including a large margin loss,focal loss, adversarial training, mixup and data augmentation,which are explicitly designed to counter logit shift of the under-represented classes. Extensive experiments are conducted onseveral challenging segmentation tasks. Our results demonstratethat the proposed modiﬁcations to the objective function canlead to signiﬁcantly improved segmentation accuracy comparedto baselines and alternative approaches.

Index Terms —overﬁtting, class imbalance, image segmentation.

I. I

NTRODUCTION T HE success of convolutional neural networks (CNNs)is strongly linked with the availability of large scale,representative datasets. However, in many real-world applica-tions such as medical image segmentation, the availability oflarge, annotated datasets is still limited. But even when thereis a sufﬁcient number of images available, the fundamentalproblem of class imbalance remains where region-of-interests(ROIs) (i.e. foreground classes) are heavily under-representedin the training data [3], [30]. Similar to [9], the class imbalanceratio of one image can be deﬁned as the ratio between thenumber of pixels of the background class (which is commonlythe most frequent class) and the number of pixels of differentobject classes. The class imbalance ratio of a whole datasetwould then be reported as the average class imbalance ratio ofall images in the set. Class imbalance ratios of 100:1 or higherare not uncommon in applications such as lesion segmentation,as shown in Table I.When the model is trained with imbalanced datasets, itcan overﬁt to the training samples from the under-representedclasses and may not generalize well during test time. However,the effects of overﬁtting under class imbalance on the model

Z. Li, K. Kamnitsas and B. Glocker are with the BioMedIA Group, Depart-ment of Computing, Imperial College London, SW72AZ, United Kingdom.E-mail: [email protected]. behavior is not well understood. In this study, we investigatehow the distribution of activations of the classiﬁcation layer( logits ) changes when the model is trained using differentamounts of training data with strong class imbalance. As themodel is trained with fewer training data and overﬁt the under-represented classes more, we ﬁnd that the model projectsunseen samples of the under-represented classes closer to andeven across the decision boundary, while samples of the over-represented classes remain unaffected. This biased distributionshift leads to under-segmentation of under-represented class.Current solutions to address class imbalance or to mitigateoverﬁtting do not explicitly consider this asymmetric logit shiftand are unable to lead to signiﬁcant improvements, as we showthrough an extensive set of experiments.This study sheds new light on the problem of overﬁtting inthe presence of class imbalance by making the following keycontributions: 1) Via inspection of the network behavior onfour segmentation tasks and datasets, and two popular modelarchitectures, we conclude that overﬁtting under class imbal-ance consistently leads to decreased performance on under-represented classes speciﬁcally in terms of low sensitivity; 2)We identify the shift in the logit distribution of unseen testsamples of under-represented classes as a result of overﬁttingunder class imbalance; 3) Base on our observations, wepropose simple yet effective asymmetric variants of ﬁve lossfunctions and regularization techniques which are explicitlydesigned to change the network behavior yielding improvedsegmentation accuracy for the under-represented classes.This article is an extension of our earlier work presented atMICCAI 2019 [27]. We extend our previous work on multipleaspects: 1) We provide a more detailed analysis includingexperiments on two additional datasets; 2) We further includea 3D U-Net to conﬁrm that our observations hold acrossdifferent network architectures; 3) We explore the proposedtraining objectives with the Dice loss in addition to cross-entropy; 4) We enrich the experiments by adding comparisonswhen training with F-score, extend experiments to multi-classsegmentation, and also evaluate another regularization methodwhich is noted as asymmetric augmentation. Our ﬁndings hereconﬁrm our initial observations about the biased behavior ofneural networks. The behavior of logit distribution shift isconsistently observed across different types of data, tasks andarchitectures. Our work highlights the importance of the issueof overﬁtting under class imbalance. The quantitative evalua-tion further supports our proposal of taking class imbalanceinto account when designing the learning objective. a r X i v : . [ c s . C V ] F e b II. R

ELATED WORK

A. Class imbalance

Class imbalance, which has been the focus of previousworks [4], [19], is a common issue in image classiﬁcation andimage segmentation. Compared with the literature on classimbalance, a key contribution of this study is the focus onthe model behaviour when it overﬁts to the under-representedclasses with a detailed analysis and potential solutions. In thefollowing, we discuss related work categorized by differentmethodological approaches.

1) Re-weighting:

A common approach to tackle classimbalance is class-level re-weighting, which assigns higherweights or higher sampling probability to the under-represented classes based on sample frequency [41], [46]or advanced rules [9]. In this study, explore re-weightingas a baseline approach in all experiments where we trainthe models with patches which are separately sampled fromdifferent classes with the same probability. Beyond that,sample-level re-weighting strategies are also proposed to builda balanced model. For example, hard sample mining wasproposed to avoid the dominant effect of majority classes [11].Similarly, focal loss and its variants were proposed to weightdifﬁcult samples over easy samples [1], [14], [29], [43] tosteer the learning towards small objects. However, the under-represented samples are not necessarily difﬁcult to predictduring training. In fact, as we show empirically, the trainingsamples of the under-represented classes are learned well dueto overﬁtting. In this case, we ﬁnd that a focal loss mayeven decrease the performance when processing imbalanceddatasets because it reduces the focus on the under-representedsamples. Therefore, in this study, we improve upon focal lossby removing the attenuation of under-represented classes. Mar-gin based loss functions were proposed to learn discriminativeembeddings and widely adopted to metric-learning and facerecognition [10], [31]. Margin losses can also been seen as akind of re-weighting approach which changes the magnitudeof the gradient of the network output by multiplying a scalar,as we show in the supplementary Section VII. In this study,we propose to only assign margins for the under-representedclasses. The design of uneven margins for imbalanced datasetswas ﬁrst proposed in [26] for perception. Recently, LargeMargin Local Embedding (LMLE) was proposed to put moreconstraints for the under-represented classes by only applyingmultiple margins to the minority classes, with a computation-ally expensive metric-learning based framework [17]. Morerecently, two concurrent studies were also proposed to setlarger margins for the under-represented classes from theperspective of uncertainty [23] or generalization bound [5],[49]. In this study, we empirically show that one shouldnot assign margins for the over-represented classes based onthe observations of asymmetric logit distribution under classimbalance.

2) Data synthesis:

Our work is related to data synthesismethods [6], [13] which generate synthetic samples of theminority classes based on intra-class relationship betweensamples to increase the variance of under-represented class.In addition, we create synthetic samples in the latent feature space rather than image space and provide two new waysto synthesize samples of the under-represented classes formodern machine learning models. We also propose to adoptstronger data augmentation for the under-represented classesby changing the augmentation probabilities to alleviate over-ﬁtting.

3) Other methods:

The above mentioned methods are allbased on changing the training data distribution to tackleclass imbalance. In contrast, some other approaches try tocounter class imbalance by modifying the training strategy.Speciﬁcally, [15] ﬁrstly trained their model with data which issampled from each class with the same probability. Thereafter,they only retrain the output layer with uniformly sampled datawhile freezing all other network parameters. In this way, theycould separately learn a diverse representation and a classiﬁerfor realistic data distribution. Similar strategies, which aim tochange the decision boundary at test-time, are also proposedrecently for long-tailed recognition [21], [38], [48]. Theseapproaches are complementary to our proposed solutions andcould be combined. Other learning paradigms such as meta-learning [42] and transfer learning [32] were also recentlyproposed for long-tail learning, but these are outside the scopeof this paper.

4) Segmentation:

The problem of class imbalance in imagesegmentation is different from that in image recognition [32]because the dominating class in image segmentation is thebackground class with diverse characteristics, and its seg-mentation accuracy is highly robust. In contrast, the accuracyfor the majority classes in long-tailed image recognition candegrade with common techniques such as re-weighting [21]. Inaddition, the evaluation of segmentation performance mostlyrelies on the foreground classes, and therefore, the focusis on improving accuracy in those classes. For example,recent studies proposed to provide a better trade-off betweensensitivity and precision for segmentation [14], [33]. However,these strategies yield little improvements when processinghighly imbalanced datasets, as shown in our experiments.This is because a deep neural network may achieve near-perfect training accuracy even for the under-represented sam-ples without beneﬁtting from the modiﬁed loss function. Classimbalance in image segmentation has been also approachedvia a boundary loss [22]. However, it is only applicableto segmentation. The authors show promising results withsufﬁcient training data, but the model may still be prone tooverﬁt the under-represented class with limited dataset. Otherwork adopted multi-stage approaches with candidate proposalsand background suppression [36], [39]. However, the candidateprediction process may still suffer from class imbalance.In addition, any missed candidates in one stage cannot berecovered in a later stage. In contrast, the solutions proposedhere are general loss functions that can be incorporated intoany model or learning approach and is applicable beyondimage segmentation.

B. Regularization techniques

To improve generalization of deep neural networks, a num-ber of regularization techniques are available. This includes dropout [37], weight decay [24], data augmentation [7], [8],data mixing [45], [47], and adversarial training [12], [44].However, most of these techniques were proposed for gen-eral image classiﬁcation tasks where class imbalance is notexplicitly addressed. It is also unclear how these techniquesaffect the network behavior in this setting.III. O

VERFITTING UNDER CLASS IMBALANCE AND ITSEFFECT ON SEGMENTATION PERFORMANCE

To explore the effects of overﬁtting on the network be-havior, we train CNNs using different amounts of data, onsegmentation tasks that exhibit strong class imbalance. Weconduct experiments on challenging segmentation tasks usingdata from the Multimodal Brain Tumor Image Segmentation(BRATS) challenge [2], the Anatomical Tracings of LesionsAfter Stroke (ATLAS) dataset [28], small organ segmentation(data from [25]) and Kidney Tumor Segmentation (KiTS) [16].The statistics of those four datasets are summarized in Table I.To ensure our ﬁndings generalize across models, in our investi-gation we employ two convolutional network architectures thathave been proven potent in a variety of segmentation tasks:We employ a DeepMedic architecture [20] for the experimentson brain lesions and multi-organ segmentation tasks on whichit has previously shown high performance [20], [35], and awell conﬁgured 3D U-Net [18] for the experiments on kidneytumor segmentation on KiTS19 data, which is the base modelof the winning entry of KiTS19 challenge [16]. The detailednetwork conﬁgurations are summarized in Section V.

TABLE IT

HE STATISTICS AND CLASS IMBALANCE RATIOS OF THE FOUR DATASETSUSED IN THIS STUDY . C

LASS IMBALANCE RATIO IS DEFINED AS THEAVERAGE RATIO BETWEEN THE NUMBER OF THE BACKGROUND (BG)

PIXELS AND THE FOREGROUND (FG)

PIXELS OVER ALL IMAGES . Dataset Total FGpixels Total BGpixels Classimbalance ratio(avg. ± std.)BRATS 1.2 × × ± × × ± × × ± × × ± × × ± × × ± × × ± × × ± The observations on the test and training set are summarizedin Fig. 1. With less training data, we notice a clear decrease ofsegmentation accuracy on test data while the accuracy on train-ing data increases due to easier overﬁtting, as expressed byDSC (deﬁned as

DSC = 2 sensitivity · precisionsensitivity + precision ). We observe thatoverﬁtting leads to a reduction of sensitivity while precisionremains largely stable. In all settings and tasks, the speciﬁcityof the foreground always remains near-perfect ( > A. Logit distribution shift

To obtain a better understanding of the network behaviorafter training on imbalanced data, we monitor the logit distri-bution when processing training and unseen test samples. Theobservations we make for the tasks of brain tumor core, kidneytumor and brain stroke lesion segmentation are summarizedin Fig. 3, 4 and 5, respectively. We notice that the logitdistribution of foreground samples shifts signiﬁcantly towardsand even across the decision boundary, while the logit distri-bution of background samples remains stable. The shift of theforeground logits results in a higher number of false negatives,which causes a drastic decrease of sensitivity (calculated as

TPTP + FN ). This biased logit shift under class imbalance may alsooccur in other tasks such as image classiﬁcation. However,it is particularly prevalent in image segmentation with smallstructures-of-interest.We ﬁnd that this shift of logits correlates with how mucha model overﬁts to the under-represented class. Training withless data leads to more overﬁtting, and the logit distributionshift becomes larger. Moreover, we ﬁnd that the logit shiftalso correlates with the size of structures represented by theforeground class. The rarest class shifts the most, as shown inthe right part of Fig. 4.In image segmentation, a CNN is optimized to push thelogits of different classes away from each other and far fromthe decision boundary. It is relatively easy for a deep CNNto build an embedding for the training samples from theunder-represented class because it just needs to build a set ofcase-speciﬁc ﬁlters to facilitate memorization. For example,as a CNN will only observe very few training samples ofthe foreground class, a CNN can dedicate speciﬁc modelparameters to memorize all foreground samples, even if theindividual patterns are rather complex. Speciﬁcally, we ﬁnd aCNN seems to be more conﬁdent about foreground samplesduring training, mapping them farther away from the decisionboundary when overﬁtting, as shown in Fig. 3, Fig. 4 andFig. 5. However, these tailored ﬁlters will not generalize to un-seen test data. Therefore, the activations for test samples of theunder-represented class are smaller in magnitude (sub-optimalpattern matching of ﬁlters and unseen samples), leading to theobserved distribution shift. In contrast, a CNN has to buildgeneric ﬁlters for a well represented class to represent manydifferent characteristics of the same class, leading to goodgeneralization. Such ﬁlters will map unseen samples to similarlocations in logit space and no shift between the embeddingsof training and test samples is observed. As a result of classimbalance and overﬁtting, a CNN may underperform on theunder-represented class while still generalizing well for thewell-represented class.While the negative effect of class imbalance and overﬁttingon model performance is well known, to our knowledgethere has been little work investigating the speciﬁcs how thenetwork behavior is affected. Only by understanding better Small organs segmentationBrain tumor segmentation Kidney tumor segmentation

Amount of training data Amount of training dataAmount of training data

Brain lesion segmentationBrain tumor core

Gallbladder Aorta Vena cava Vein Kidney tumorKidney

Brain stroke lesion

Amount of training data

Kidney tumorKidney

DeepMedic

3D U-Net

Amount of training data

Fig. 1. Performance on brain tumor core, brain stroke lesion, small organs, kidney and kidney tumor segmentation with varying amounts of training data.The foreground (FG) and background (BG) samples are highly imbalanced, as noted below each subﬁgure. With less training data, performance drops due tothe decrease of sensitivity, while the precision is largely retained.

Image (T1 MRI) Ground Truth Train w/ 100% training data Train w/ 50% training data Train w/ 10% training data Train w/ 5% training data Train w/ 5% training data w/ our regularizationImage (T1-weighted MRI) Ground Truth Train w/ 100% training data Train w/ 50% training data Train w/ 40% training data Train w/ 30% training data Train w/ 30% training data w/ our regularizationImage (CT) Ground Truth Train w/ 100% training data Train w/ 10% training data Train w/ 10% training data w/ our regularizationImage (CT) Ground Truth Train w/ 100% training data Train w/ 25% training data Train w/ 25% training data w/ our regularizationBRATSw/ DeepMedic

Red : brain tumor coreATLASw/ DeepMedic

Red : brain stroke lesionKiTSw/ 3D U-Net

Red : Kidney tumor

Blue : KidneyAbdominal organsw/ DeepMedic

Red : gallbladder

Blue : Aorta

Purple : Vena cava

Magenta : Vein Train w/ 10% training data (zoom-in) Train w/ 10% training data w/ our regularization (zoom-in)Train w/ 25% training data (zoom-in) Train w/ 25% training data w/ our regularization (zoom-in)

Fig. 2. Visualization of different datasets and segmentation results with different portions of training data. With less training data, the models are prone to under-segment the under-represented classes. The proposed regulation methods can alleviate the overﬁtting of under-represented classes and provide segmentationresults with higher sensitivity and overall accuracy. Best viewed in color. the implication, we can devise mitigation strategies. Previ-ous loss functions and regularization techniques that aim toprevent overﬁtting did not take the behavior that we observeinto account, and thus show limited success for improvingsegmentation accuracy in the setting of limited data withstrong class imbalance. Here, we propose solutions via newasymmetric variants of existing objective functions leading tobetter feature embeddings for the under-represented samples,leading to signiﬁcant improvements in segmentation accuracyfor small structures-of-interest.IV. T

ACKLING OVERFITTING UNDER CLASS IMBALANCEWITH ASYMMETRIC OBJECTIVE FUNCTIONS

Based on our observations above about the biased behaviorof CNNs, we design modiﬁcations to existing loss functions and training strategies to prevent the logit distribution shift.Speciﬁcally, we add a bias for the under-represented class.Although the original techniques were proposed for differentpurposes, our modiﬁcations share a common goal: keep thelogit activations of the under-represented class away from thedecision boundary. Even if the logit of a foreground sampleshifts towards the decision boundary as long as it does notcross it, its prediction remains correct (cf. Fig. 6).

A. Asymmetric large margin loss

We consider a CNN for the task of semantic segmentation.For a training dataset { ( x i , y i ) } Ni =1 with N samples, wedenote a training sample with x i and its corresponding one-hotvector y i . If c is the total number of classes of the task, y i has c elements, with its j ’th element y ij ∈ { , } corresponding 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 𝑧 𝑧 𝑧 𝑧 △ ̂ 𝑧 = ̂ 𝑧 ’ ( ) ’ − ̂ 𝑧 ’ + , - . Amount of training data

FG w/ 20% dataFG w/ 100% data FG w/ 50% data FG w/ 10% data FG w/ 5% dataBG w/ 20% dataBG w/ 100% data BG w/ 50% data BG w/ 10% data BG w/ 5% data 𝑧 𝑧 𝑧 𝑧 𝑧 Fig. 3. (Left part) Activations of the classiﬁcation layer (logit z for background, logit z for brain tumor core) when processing (top) tumor and (bottom)background samples of BRATS with DeepMedic, using different amounts of training data. The CNN maps training and testing samples of the backgroundclass to similar logit values. However, mean activation for testing data shifts signiﬁcantly for the tumor class towards and sometimes across the decisionboundary. (Right part) The shift of mean value of logits observed when processing training and testing data ( ∆ˆ z = | ˆ z test | − | ˆ z train | ). △ ̂ 𝑧 = ̂ 𝑧 % & ’ % − ̂ 𝑧 % ) * + , Amount of training data

Tumor w/ 100% data Tumor w/ 50% data Tumor w/ 20% data Tumor w/ 10% data Tumor w/ 5% dataKidney w/ 100% data Kidney w/ 50% data Kidney w/ 20% data Kidney w/ 10% data Kidney w/ 5% dataBG w/ 100% data BG w/ 50% data BG w/ 20% data BG w/ 10% data BG w/ 5% data

Fig. 4. (Left part) Activations of the classiﬁcation layer (logit z for background, logit z for kidney, logit z for kidney tumor) when processing (top) tumor,(middle) kidney and (bottom) background samples of KiTS with 3D U-Net, using different amounts of training data. The CNN also fails to map the trainingand testing samples of the tumor class in a similar position. (Right part) The shift of mean value of logits. to the j ’th class. y ij equals to if j is the real class of x i ,or otherwise. With this notation, the cross-entropy (CE) losscan be written as : L CE ( x i , y i ) = − c (cid:88) j =1 y ij log( p ij ) , (1)where p ij is the predicted probability by the network that thereal class of x i is j . Probability p ij is commonly obtained via We formulate CE as sum over classes to make class speciﬁc modiﬁcations. a softmax function over the c activations { ( z ij ) cj =1 ∈ IR c } thatthe network outputs for x i at its last layer. These activationsare called the logits . With this, p ij is given by: p ij = e z ij (cid:80) cj =1 e z ij . (2)Besides CE, the smooth version of the DSC metric is analternative choice for the loss function which is widely used formedical image segmentation [33]. DSC loss can be calculatedin the form of − DSC = FP + FN TP + FP + FN ), which is: 𝑧 " 𝑧 " △ ̂ 𝑧 = ̂ 𝑧 & ’ ( & − ̂ 𝑧 & * + , - Amount of training data 𝑧 " 𝑧 . 𝑧 . 𝑧 . 𝑧 . 𝑧 " FG w/ 30% dataFG w/ 40% data FG w/ 50% dataFG w/ 100% data

Fig. 5. (Left part) Activations of the classiﬁcation layer when processinglesion samples of ATLAS with DeepMedic, using different amounts of trainingdata. (Right part) The shift of mean value of logits. L DSC ( x i , y i ) = c (cid:88) j =1 (cid:16) (1 − y ij ) p ij + y ij (1 − p ij )2 y ij p ij + (1 − y ij ) p ij + y ij (1 − p ij ) (cid:17) . (3)The large margin loss was proposed for increasing theEuclidean distances between logits for different classes to learndiscriminative features [40]. Symmetrically, it is implementedby adding a margin on the logits of every class: L CE M ( x i , y i ) = − c (cid:88) j =1 y ij log( q ij ) , (4)in which we require: q ij = e z ij − y ij m (cid:80) cj =1 e z ij − y ij m , (5)where m is a hyper-parameter for the margin. Althoughthe large margin loss encourages the model to map differentclasses away from each other, the decision boundary remainsin the center. According to our observations, class imbalancecauses shifts of unseen foreground samples towards the back-ground class. To mitigate this, a regularizer may aim to movethe decision boundary closer to the background class. Ourasymmetric modiﬁcation only sets the margin for the rareclasses. We deﬁne r as a one-hot vector with c elements, withits j ’th element r j ∈ { , } corresponding to the j ’th classand r j equals to 1 if j is taken as the rare class. With theindication of r , we derive the asymmetric large margin lossas: ˆ L CE M ( x i , y i ) = − c (cid:88) j =1 y ij log(ˆ q ij ) , (6)where we require: ˆ q ij = e z ij − y ij r j m (cid:80) cj =1 e z ij − y ij r j m , (7)In this study, we deﬁne r j as 1 for the foreground samplesand 0 for the background samples. In other applications, r j canalso be deﬁned as a continuous variable indicating the rarity of the classes with r j ∈ [0 , , for methods in Section IV-A,IV-B and IV-C. Similarly, the symmetric and asymmetric largemargin loss for DSC loss can be derived by substitutingequation 5: L DSC M ( x i , y i ) = c (cid:88) j =1 (cid:16) (1 − y ij ) q ij + y ij (1 − q ij )2 y ij q ij + (1 − y ij ) q ij + y ij (1 − q ij ) (cid:17) , (8)and ˆ L DSC M ( x i , y i ) = c (cid:88) j =1 (cid:16) (1 − y ij )ˆ q ij + y ij (1 − ˆ q ij )2 y ij ˆ q ij + (1 − y ij )ˆ q ij + y ij (1 − ˆ q ij ) (cid:17) . (9) B. Asymmetric focal loss

The focal loss was proposed for small object detection byreducing the weight for well-classiﬁed samples and focusingon samples which are near the decision boundary [29]. Itadds attenuation inside the loss function based on the logitactivations: L CE focal ( x i , y i ) = − c (cid:88) j =1 (1 − p ij ) γ y ij log( p ij ) , (10)where γ is the hyper-parameter to control the focus. Thesymmetric focal loss prevents logits from being too large andmakes every class stay near the decision boundary. However,this makes it likely for the unseen foreground samples to shiftacross the decision boundary. We remove the loss attenuationfor the foreground class to keep it away from the decisionboundary: ˆ L CE focal ( x i , y i ) = c (cid:88) j =1 (cid:16) − r j y ij log( p ij ) − (1 − r j )(1 − p ij ) γ y ij log( p ij ) (cid:17) . (11)Inspired by the focal loss [29], related work integrates a sim-ilar attenuation term into the DSC loss [1], [43]. In practice, weﬁnd that the logarithmic DSC loss [1] signiﬁcantly changes themagnitude of DSC loss making it difﬁcult to be combined withother losses. The attenuation in the focal Tversky loss [43] isvery large and may overly suppress the easier class. Here, wepropose another form of DSC loss with an adaptive weightpreserving a similar loss magnitude. Speciﬁcally, we add theattenuation term to the false negatives part of the function andprevent the network being too conﬁdent about its prediction: L DSC focal ( x i , y i ) = c (cid:88) j =1 (cid:16) (1 − y ij ) p ij + (1 − p ij ) γ y ij (1 − p ij )2 y ij p ij + (1 − y ij ) p ij + y ij (1 − p ij ) (cid:17) , (12)Compared with the original version of the CE loss, thisformulation for the DSC loss has a similar effect of reducingthe penalty for the well-classiﬁed samples while keeping themagnitude of the loss similar to the original one, as shown Original/asymmetric focal loss 𝛾 𝛾 𝛾 : Augmented foreground/background/ : Foreground/background/ : Original/regularized decision boundary/Original/asymmetric adversarial training 𝑟 𝑟 𝑟 𝑟 𝑟 𝑟 𝑟 𝑟 𝑟 𝑟 Original/asymmetric mixup 𝜆𝜆 𝜆 1 − 𝜆1 − 𝜆1 − 𝜆 𝜆𝜆 𝜆 𝑚 𝑚𝑚𝑚

Original/asymmetric large margin loss 𝑚 𝑚

Original/asymmetric augmentation

Loss function-based methodsData augmentation-based methods

Vanilla

𝒜(,) 𝒜(,)𝒜(,) 𝒜(,)𝒜(,)𝒜(,) 𝒜(,) 𝒜(,) 𝒜(,)𝒜(,) 𝒜 ./ (,) 𝒜 ./ (,)𝒜 ./ (,)𝒜 ./ (,)

Fig. 6. The illustration of the proposed asymmetric modiﬁcations for the existing loss functions and regularization techniques. We make the logit activationsof foreground class far away from the decision boundary by setting a bias for the foreground class in different ways. in Supplementary Fig. 9. We refer to this as the focal DSCloss in the following. Similarly, the asymmetric version of thefocal DSC loss is derived by removing the attenuation termfor the foreground class: ˆ L DSC focal ( x i , y i ) = c (cid:88) j =1 (cid:16) (1 − y ij ) p ij + r j y ij (1 − p ij )2 y ij p ij + (1 − y ij ) p ij + y ij (1 − p ij )+ (1 − y ij ) p ij + (1 − r j )(1 − p ij ) γ y ij (1 − p ij )2 y ij p ij + (1 − y ij ) p ij + y ij (1 − p ij ) (cid:17) . (13) C. Asymmetric adversarial training

Adversarial training was proposed to learn more robustclassiﬁers by training with difﬁcult samples [12]. The networkis trained by considering adversarial samples as additionaltraining data [34], [44]: L adv ( x i , y i ) = L ( x i , y i ) + L ( x i + l · d adv (cid:107) d adv (cid:107) , y i ) , (14)with d adv = argmax d ; (cid:107) d (cid:107) <(cid:15) L ( x i + d , y i ) . (15)Here, d adv is the direction of the generated adversarialsamples, l and (cid:15) are the magnitude and the range of theadversarial perturbations, respectively. L is the chosen lossfunction, which can be L CE and / or L DSC . Similar to thelarge margin loss, symmetric adversarial training preservesthe decision boundary and may cause difﬁculties for unseenforeground samples, which tends to shift towards backgroundclass. Our proposed asymmetric adversarial training aims toproduce a larger space between the foreground class andthe decision boundary. Speciﬁcally, we generate samples byconsidering more from the rare classes: ˆ d adv = argmax d ; (cid:107) d (cid:107) <(cid:15) L ( x i + d , y i (cid:12) r ) (cid:12)(cid:12)(cid:12) y i · r > , (16)where “ (cid:12) ” refers to the element product and “ · ” refers tothe dot product. D. Asymmetric mixup

Mixup is a simple yet effective data augmentation algorithmto improve generalization by generating extra training samplesby using the linear combination of pairs of images and theirlabels [47]: L mixup ( x i , y i , x k , y k ) = L ( x i , y i ) + L ( ˜ x i , ˜ y i ) , (17)where ( ˜ x i , ˜ y i ) is the generated training sample: ˜ x i = λ x i + (1 − λ ) x k , ˜ y i = λ y i + (1 − λ ) y k . (18)Here, λ is randomly selected based on a beta distribution,( x k , y k ) is another random training sample. Mixup regularizesthe model by centering the decision boundary between classeswhich helps very little in our setting. Different from theoriginal mixup, which generates samples with soft labels, ourmodiﬁcation generates hard labels by considered augmentedsamples near to the foreground samples as foreground class.Asymmetric mixup can keep the decision boundary away fromthe foreground class and increase the area of the foregroundlogit distribution. This prevents unseen under-presented sam-ples from shifting across the decision boundary. Speciﬁcally,the mixed image ˜ x i which has a certain distance from thebackground class, is taken as a foreground sample: ˆ˜ y i =  y i if (cid:0) λ > m and y i · r (1 − y k · r ) == 1 (cid:1) or y i == y k , y k if (cid:0) − λ > m and y k · r (1 − y i · r ) == 1 (cid:1) or y i == y k , otherwise , (19)where m is the margin to guarantee that the augmentedsamples are not getting too close to background samples. Inpractice, we do not update the model using training sampleswith ˆ˜ y i = . E. Asymmetric augmentation

In order to extend the latent space of the foregroundclass, we also evaluate a simple method to compensate classimbalance by adjusting the magnitude of augmentation fordifferent classes. Standard data augmentation methods wouldpreserve the label y i and adopt the same set of heuristictransformations such as scaling and rotations to the original training sample x i for different classes. The generated trainingsample ˜ x i can be obtained using: ˜ x i = A ( x i ) , (20)where A is the chosen transformation with certain proba-bility. When the dataset is highly imbalanced, adding moresynthesized background samples is not necessary. Our simplevariant of data augmentation reduces the number of trans-formed samples for the background classes. In this asymmetricsetting, the generated sample ˜ x i is obtained using: ˆ˜ x i = (cid:26) A ( x i ) if y i · r == 1 , A small ( x i ) otherwise , (21)where A small is transformations with smaller probability. F. The combination of asymmetric techniques

The above-mentioned modiﬁcations would introduce morevariances for the under-represented classes in the latent spaceor the image space by adding a bias for the foreground classfrom different perspectives. In practice, some or all of thetechniques can be integrated into a single model to combatoverﬁtting under class imbalance.Speciﬁcally, we can ﬁrst generate different sets of the aug-mented samples following the asymmetric adversarial training,the asymmetric mixup and the asymmetric augmentation fol-lowing equation 16, 19 and 21, separately. The network canthen be optimized using the extended training set with the lossfunctions combined with the asymmetric large margin loss andasymmetric focal loss: ˆ L CE combine ( x i , y i ) = c (cid:88) j =1 (cid:16) − r j y ij log(ˆ q ij ) − (1 − r j )(1 − ˆ q ij ) γ y ij log(ˆ q ij ) (cid:17) . (22)A combined DSC loss can be formulated in a similar way.V. E XPERIMENTS

A. Experimental setup

We demonstrate the effect of our proposed modiﬁcationswith a variety of medical image segmentation tasks usingdifferent models and training scenarios. Here, we summarizethe dataset splits and experimental settings, which are keptthe same with motivational experiments in Section III. Wekeep the hyper-parameters of the methods the same for theoriginal baselines and our modiﬁed techniques. The hyper-parameters are summarized in Supplementary Table V, VI andVII. Additionally, we conduct a sensitivity analysis of all thehyper-parameters and summarize the results in SupplementaryTable VIII. We also provide the source code for our experi-ments . https://github.com/ZerojumpLine/OverﬁttingUnderClassImbalance

1) Brain tumor segmentation:

We ﬁrst evaluate the asym-metric techniques for the case of binary brain tumor coresegmentation using the DeepMedic network architecture, awell performing method for this task [20]. To investigatethe behavior under overﬁtting and to isolate better the effectof the objective functions, we do not use dropout, weightdecay and data augmentation in this experiment. We trainthe network with CE loss, unless otherwise speciﬁed. Bydefault, we sample 50% training samples from the foregroundclass. We conduct experiments using the training dataset ofBRATS2017 dataset [2] which contains 285 four modalitiesMagnetic Resonance (MR) images. The MR images all havethe same voxel space of 1.0 × ×

2) Brain stroke lesion segmentation:

We also evaluate theasymmetric techniques for the case of brain stroke lesion seg-mentation [28] again using DeepMedic. Here, we use a morerealistic setting, employing standard regularization techniquesincluding dropout, weight decay and data augmentation, asin the original work where the model achieved high perfor-mance for stroke lesion segmentation [20]. We implementour asymmetric techniques with the default training settingand default network architecture. The augmentation includessmall intensity shifts and ﬂipping in the sagittal plane wFithprobability 0.5. The network is always trained with CE loss.We conduct experiments using ATLAS dataset [28] whichcontains 220 T1-weighted MR images. The MR images havethe same voxel space of 1.0 × ×

3) Small organ segmentation:

For organ segmentation inSection III, we use a default DeepMedic network. We conductexperiments using the training datatset of the abdominal organsegmentation challenge [25] which contains 30 computedtomography (CT) scans. We train the network to segmentthirteen abdominal organs. We test on 10 cases and trainmodels using 20 (100%) and 5 cases (25% of training set).We resample all the MR images to a common voxel spacingof 2.0 × ×

4) Kidney tumor segmentation:

In addition, we evaluatethe asymmetric techniques for the case of kidney tumorsegmentation. We train a well conﬁgured 3D U-Net [18] whichincludes extensive data augmentation with scaling, rotations,brightness, contrast, gamma and Gaussian noise augmentationswith a predeﬁned policy [16]. We also train DeepMedic withsimilar augmentation strategies yielding lower accuracy onthis task. Therefore, we evaluate the asymmetric regularizationtechniques on the U-Net with kidney tumor segmentation. Thetask includes the segmentation of both the kidney and kidneytumor. As the segmentation of kidney is relatively easy, inthis experiment we only focus on tumor segmentation andonly take kidney tumor as the foreground class to implement

TABLE IIE

VALUATION OF BRAIN TUMOR CORE SEGMENTATION USING D EEP M EDIC WITH DIFFERENT AMOUNTS OF TRAINING DATA AND DIFFERENT TECHNIQUESTO COUNTER OVERFITTING . T

HE RESULTS ARE CALCULATED WITH POST - PROCESSING . R

ESULTS WHICH HAVE WORSE

DSC

THAN THE VANILLABASELINE ARE HIGHLIGHTED WITH SHADING . B

EST AND SECOND BEST RESULTS ARE IN BOLD WITH THE BEST ALSO UNDERLINED . Method 5% training 10% training 20% training 50% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - CE [20] 50.4 41.0 83.5 18.0 62.5 56.0 83.1 14.3 64.9 59.8 85.7 13.8 69.4 65.4 85.3 15.7Vanilla - CE - 80% tumor 45.5 36.0 86.7 17.8 61.5 54.2 81.7 18.5 65.3 59.6 85.0 15.1 68.6 64.1 86.1 14.8Vanilla - F1 (DSC) 47.2 37.4 86.6 15.9 58.9 51.1 83.6 20.1 64.3 58.1 83.5 16.3 67.1 62.5 86.5 15.3Vanilla - F2 [14] 45.8 36.9 81.9 17.9 59.3 52.2 84.9 18.0 66.4 61.1 83.4 14.1 68.8 66.0 83.4 13.7Vanilla - F4 [14] 51.6 42.5 83.8 18.1 59.6 53.0 82.9 18.4 65.9 61.9 85.4 14.2 67.5 64.5 84.9 13.7Vanilla - F8 [14] 47.4 38.7 83.1 19.6 59.8 52.4 87.0 15.4 64.5 60.3 85.2 14.7 67.9 65.4 81.6 14.9Large margin loss [31] 44.5 35.9 82.8 20.2 60.9 53.5 84.0 17.6 67.0 61.6 86.1 14.4 66.5 62.2 88.1 13.7Asymmetric large margin loss 56.8 48.9 83.4

Symmetric combination 50.0 42.0 84.6 21.1 60.3 53.1 84.7 25.1 64.1 58.3 86.6 19.1 67.2 63.1 86.6 15.1Asymmetric combination

TABLE IIIE

VALUATION OF BRAIN STROKE LESION SEGMENTATION ON

ATLAS

BASED ON D EEP M EDIC WITH DIFFERENT AMOUNTS OF TRAINING DATA ANDDIFFERENT TECHNIQUES TO COUNTER OVERFITTING . T

HE RESULTS ARE CALCULATED WITH POST - PROCESSING . R

ESULTS WHICH HAVE WORSE

DSC

THAN THE VANILLA BASELINE ARE HIGHLIGHTED WITH SHADING . B

EST AND SECOND BEST RESULTS ARE IN BOLD WITH THE BEST ALSO UNDERLINED .Method 30% training 50% training 100% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - w/ augmentation [20] 22.2 18.3 60.9 48.6 45.2 40.9 59.7 31.1 54.5 49.5 67.3 32.2Vanilla - w/o augmentation 15.0 11.7 59.1 51.9 40.3 35.9 53.0 40.2 51.7 48.0 62.3 31.9Vanilla - asymmetric augmentation 22.4 18.8 58.0 50.2 47.3 43.4 57.5 32.1 56.9 51.9 69.8 28.2Large margin loss [31] 18.9 14.8 64.4 48.8 45.3 40.7 60.5 36.8 55.1 49.4 70.0 28.0Asymmetric large margin loss 23.5 19.8 58.6 45.9 47.7 44.3 58.4 33.8

Focal loss [29] 20.4 16.7 62.7 47.9 46.9 41.8 61.7 31.4 56.0 50.8 69.1 30.9Asymmetric focal loss 26.3 22.2 59.0 46.4 49.0 47.8 56.3 31.7 56.6 63.2 55.6 27.9Adversarial training [12] 20.1 16.7 57.3 56.9 47.2 41.6 62.6 35.0 54.0 48.5 69.7 34.9Asymmetric adversarial training asymmetric techniques. To be speciﬁc, we always set r as [0 , , (cid:124) . The network is always trained with both CE andsample-wise DSC loss. The two losses have the same weight.We conduct experiments using the training dataset of KiTS19dataset [16] which contains 210 CT images. We resample allthe CT images to a common voxel spacing of 1.6 × × B. Quantitative results

Taking the provided manual segmentations as the groundtruth, we calculate DSC, sensitivity (SEN), precision (PRC)and 95% Hausdorff distance (HD) (mm) to evaluate thesegmentation accuracy. The initial segmentation results ofour method always have higher sensitivity and DSC, butsometimes would cause more false positive predictions and therefore lead to worse distance-based metrics such as HD.We argue that in practice this problem can be addressed bytaking advantage of some connected component-based post-processing, which is widely adopted in many segmentationmethods [18]. Speciﬁcally, we assume there is only one targetcomponent and suppress all but the largest region. We reportboth results with or without these post-processing operations.The quantitative segmentation results on BRATS, ATLASand KiTS datasets using different amounts of training datawith post-processing are summarized in Table II, III andIV, respectively. The corresponding quantitative segmentationresults without post-processing are summarized in Supple-mentary Table X, XI and XII. We also evaluate one of theproposed methods, asymmetric focal loss, with abdominalorgan segmentation in which multiple classes are consideredunder-represented. The experiments are summarized in Sup-plementary Table IX.Class imbalance affects the segmentation sensitivity of theunder-represented class, as shown in Section III. We ﬁnd thatprevious attempts to tackle class imbalance do not improvesensitivity, while our asymmetric methods do lead to betterresults with higher sensitivity across different tasks. Thisindicates that the proposed methods may effectively mitigate TABLE IVE

VALUATION OF KIDNEY AND KIDNEY TUMOR SEGMENTATION BASED ON

3D U-N

ET WITH DIFFERENT AMOUNTS OF TRAINING DATA AND DIFFERENTTECHNIQUES TO COUNTER OVERFITTING . T

HE RESULTS ARE CALCULATED WITH POST - PROCESSING . R

ESULTS WHICH HAVE WORSE

DSC

THAN THEVANILLA BASELINE ARE HIGHLIGHTED WITH SHADING . B

EST AND SECOND BEST RESULTS ARE IN BOLD WITH THE BEST ALSO UNDERLINED .Method Kidney10% training 50% training 100% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - w/ augmentation [18] 93.3 91.2 96.9 5.4 96.4 95.8 97.1 2.7 96.6 96.1 97.3 2.4Vanilla - w/o augmentation 92.3 89.3 96.8 12.1 96.1 95.6 96.7 2.8 96.3 95.8 96.9 2.7Vanilla - asymmetric augmentation 94.3 92.2 97.0 5.2 94.9 94.5 95.5 5.9 96.1 95.8 96.4 3.8Large margin loss [31]

Focal loss [29] 91.4 85.9 99.2 10.6 94.1 89.6 99.2 4.2 94.3 90.0 99.1 4.2Asymmetric focal loss 92.0 86.7 99.0 6.0 94.7 90.9 98.9 3.5 94.8 90.9 99.1 3.1Adversarial training [12] 94.1 91.9 97.3 9.1 96.3 95.7 97.1 2.6 96.6 96.2 97.2

Asymmetric adversarial training 94.4 92.5 97.2 5.7

Mixup [47]

Asymmetric mixup

Asymmetric combination 93.5 89.7 98.5 5.2 93.9 90.0 98.3 5.3 96.7 95.6 97.9

Method Kidney tumor10% training 50% training 100% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - w/ augmentation [18] 54.6 46.0 80.0

Vanilla - w/o augmentation 37.4 31.5 65.6 96.0 62.8 58.7 75.9 47.8 73.0 69.1 83.4 18.9Vanilla - asymmetric augmentation 55.9 48.2 76.4 71.5 74.3 70.3 85.2 33.3 78.4 76.9 85.7 19.8Large margin loss [31] 52.2 44.3 77.2 68.5 78.2 74.3 87.8 26.6 80.2 79.1 84.5 25.5Asymmetric large margin loss 55.5 48.3 77.4 71.6

Focal loss [29] 47.1 37.5 78.2 74.5 73.0 66.0 87.6 40.2 79.0 73.2 90.0 20.3Asymmetric focal loss overﬁtting under class imbalance. These results in turn supportour previous analysis of logit shift under class imbalance andindicate that considering this implication would help buildunbiased network.

1) Baseline experiments:

We perform baseline experimentswith binary brain tumor core segmentation, as shown inTable II. We show that increasing the weight of tumor samples(from 50% to 80%) decreases performance when the datasetis highly imbalanced. This is because increasing the weightencourages the network to memorize the under-representedsamples and may actually lead to more overﬁtting, thus beingcounter-productive. Simply changing the objective function toF-score (deﬁned as F β = (1 + β ) sensitivity · precision β sensitivity + precision ), whichis a balancing loss and weights sensitivity β -times more thanprecision [14], [33], only shows little improvements increasingthe sensitivity slightly. Changing the sampling weights ortraining with a loss function using F-scores seems to havelittle impact when foreground training samples are limited,with training accuracy close to 100%, as shown in Fig. 1.A common approach to alleviate under-segmentation is toadjust thresholds of decision boundaries based on validationsets. In this work, however, we observe distribution shift onunseen test data, which can signiﬁcantly differ from the train-ing/validation sets. Hence, a threshold selected on validationdata may not be optimal for new test data. Due to the lackof ground truth, it is practically not possible to optimise thedecision thresholds for a speciﬁc test set.

2) Asymmetric large margin loss:

The original large marginloss decreases performance in some cases, while our modiﬁ-cation yields improvements over the symmetric version in allcases.

3) Asymmetric focal loss:

The original focal loss alsodecreases the sensitivity in some cases and leads to worseperformance. It is because the focal term would decrease theweight of foreground samples and push its logit closer to thedecision boundary, making it easier to cause false negativepredictions. Our modiﬁcation removes the loss attenuation forthe under-represented class and improve the performance in allcases. We notice that the asymmetric focal loss would makethe performance of other class (kidney) overﬁt more, but it isnot the focus of this study and can be easily addressed by justkeeping focal term for the background class.

4) Asymmetric adversarial training:

When the network istrained without data augmentation (as shown in Table II),the original adversarial training seems to be effective whenlittle training data is available while our modiﬁcations canfurther improve the sensitivity and boost the performancesubstantially.When the network is trained with data augmentation (asshown in Table III and IV), we ﬁnd the original adversarialtraining does not improve the performance when trainingdata is limited. It indicates that in this case the augmentedsamples by adversarial training might not add anything on topof intensity augmentation. In contrast, our proposed modiﬁ- Original/asymmetric large margin loss

Original/asymmetric focal loss

Original/asymmetric mixupVanilla

Symmetric/asymmetric combination

Original/asymmetric adversarial training

Fig. 7. Activations of the classiﬁcation layer when processing tumor (top) and background (bottom) samples of BRATS with DeepMedic, using 5% trainingdata. Asymmetric modiﬁcations lead to better separation of the logits of unseen tumor samples.

Symmetric/asymmetric combinationOriginal/asymmetriclarge margin loss Original/asymmetric focal loss Original/asymmetric mixupOriginal/asymmetric adversarial trainingw/oaugmentation Original/asymmetricaugmentation

Fig. 8. Activations of the classiﬁcation layer for tumor (top), kidney (middle) and background (bottom) samples of KiTS with 3D U-Net, using 10% trainingdata. Asymmetric modiﬁcations also lead to better separation of the logits of unseen tumor samples and is complementary to standard data augmentation. cations seem to help always leading to better segmentationperformance.

5) Asymmetric mixup:

We ﬁnd the original mixup can beeffective for the well-represented class. For example, it alwaysimprove the segmentation performance of kidney, as shownin Table IV. However, the original mixup leads to lowersensitivity for the under-represented class. We ﬁnd that theasymmetric mixup can improve the performance for BRATSto a large extent, as shown in Table II. However, we also ﬁndit to be less effective for ATLAS and KiTS which only haveone image channel, as shown in Table III and IV. This maybe because the intensity distributions of healthy and lesionregions in ATLAS and KiTS overlap too much, as shown in theSupplementary Fig. 10. In this case, the mixed samples, whichare very similar to the background samples in intensity but aretaken as the foreground samples, may confuse the network. ForBRATS, however, four image channels are available with thehealthy and tumor regions show larger differences in T2 andﬂuid-attenuated inversion recovery (FLAIR) sequences. Themixed samples seem to take good advantage of the intensityrelationship.

6) Asymmetric augmentation:

Despite its simplicity, weﬁnd asymmetric augmentation to be an effective method toimprove segmentation performance of the under-representedclasses in most cases in terms of DSC and sensitivity, assummarized in Table III. However, we also notice that it coulddecrease the performance when data augmentation is strongand training data is sufﬁcient, as shown in Table IV. It might bebecause the strong asymmetric augmentation would drive themodel to focus too much on the foreground samples, making the background samples under-represented.

7) The combination of asymmetric techniques:

We alsocombine the asymmetric techniques which are found to im-prove the segmentation accuracy. The combination of theasymmetric techniques is a safe choice leading to the overallbest segmentation results with improved sensitivity in all cases.In contrast, the combination of the symmetric counterparts isunable to mitigate overﬁtting and often decreases sensitivity.

C. Logit distribution changes

The effects of all the techniques on the logit distributionsof BRATS using 5% training data is presented in Fig. 7.Asymmetric techniques would increase the variances of theforeground class and expand its logit distribution. The originallarge margin loss and adversarial training try to push samplesfrom different classes far from each other, however, the logitsof unseen data remain in the center around the decision bound-ary and thus the predictions are not improved. The originallarge margin loss results in even larger shifts for the foregroundsamples. For our asymmetric modiﬁcations only the logits offoreground samples are pushed away and the unseen fore-ground logits tend to remain on the correct side of the decisionboundary. The original focal loss encourages the network toprevent the logits of each class from staying too far from thedecision boundary. However, it allows foreground logits toremain near the decision boundary which can result in falsenegative predictions on unseen samples. Our asymmetric focalloss removes the constraints for foreground samples. Originalmixup encourages the symmetric distributions of different classes but does not consider class imbalance. Asymmetricmixup exploits the latent space based on the relationshipbetween samples to generate foreground samples and make thedecision boundary stay near the background class. This leadsto the largest improvement by increasing the region for theforeground logit distribution and reduce logit shift of unseenforeground samples. The combination of the four asymmetrictechniques can stabilize the logits further more.The effects of all the techniques on the logit distributionof KiTS using 10% training data is summarized in Fig. 8.The original data augmentation can reduce the logit shift ofboth unseen kidney and kidney tumor samples although it isnot speciﬁcally designed to regularize the logit distribution.The asymmetric augmentation can further reduce the logitshift of the unseen tumor samples. The asymmetric largemargin loss reduces the tumor logit shift towards the kidneyclass. Although the logit distribution is already regularized bythe strong augmentation, the proposed asymmetric techniquesprovide further beneﬁts in stabilizing the logits.VI. C ONCLUSION

We study overﬁtting of neural networks under class im-balance by inspecting network behavior. We observe thatwhen processing unseen under-represented samples, the logitactivations tend to shift towards the decision boundary and thesensitivity decreases. This phenomenon is conﬁrmed across avariety of different tasks and two popular different networkarchitectures. We derive simple yet effective asymmetric vari-ants of existing loss functions and regularization techniquesto prevent overﬁtting. We show that our proposed methodscan substantially improve segmentation performance underclass imbalance in terms of DSC and increased sensitivity,outperforming previous solutions. We believe more regular-ization methods can be derived to alleviate this problem byconsidering the biased network behavior. We also believethat the plotting logit distributions may be useful networkinspection tool and help to gain a better understanding networkbehavior under different training scenarios. In future work, wewill investigate if monitoring of intermediate activations mayprovide further insights for other challenging settings such asdomain shift or self-supervised learning.A

CKNOWLEDGEMENTS

This work received funding from the European ResearchCouncil (ERC) under the European Union’s Horizon 2020research and innovation programme (grant agreement No757173, project MIRA, ERC-2017-STG). ZL is supportedby the China Scholarship Council (CSC). KK is funded bythe UKRI London Medical Imaging & Artiﬁcial IntelligenceCentre for Value Based Healthcare.R

EFERENCES[1] N. Abraham and N. M. Khan. A novel focal tversky loss function withimproved attention u-net for lesion segmentation. In , pages683–687. IEEE, 2019. [2] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby,J. B. Freymann, K. Farahani, and C. Davatzikos. Advancing the cancergenome atlas glioma mri collections with expert segmentation labels andradiomic features.

Sci. Data , 4:170117, 2017.[3] T. Brosch, L. Y. Tang, Y. Yoo, D. K. Li, A. Traboulsee, and R. Tam.Deep 3d convolutional encoder networks with shortcuts for multiscalefeature integration applied to multiple sclerosis lesion segmentation.

IEEE transactions on medical imaging , 35(5):1229–1239, 2016.[4] M. Buda, A. Maki, and M. A. Mazurowski. A systematic study ofthe class imbalance problem in convolutional neural networks.

NeuralNetworks , 106:249–259, 2018.[5] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learning imbal-anced datasets with label-distribution-aware margin loss. In

Advancesin Neural Information Processing Systems , pages 1567–1578, 2019.[6] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer.Smote: synthetic minority over-sampling technique.

Journal of artiﬁcialintelligence research , 16:321–357, 2002.[7] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaug-ment: Learning augmentation strategies from data. In

Proceedings ofthe IEEE conference on computer vision and pattern recognition , pages113–123, 2019.[8] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment:Practical data augmentation with no separate search. arXiv preprintarXiv:1909.13719 , 2019.[9] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie. Class-balancedloss based on effective number of samples. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 9268–9277, 2019.[10] J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arcface: Additive angularmargin loss for deep face recognition. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 4690–4699, 2019.[11] Q. Dong, S. Gong, and X. Zhu. Imbalanced deep learning by minorityclass incremental rectiﬁcation.

IEEE transactions on pattern analysisand machine intelligence , 41(6):1367–1381, 2018.[12] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessingadversarial examples. In

International Conference on Learning Repre-sentations (ICLR) , 2015.[13] H. Guo and H. L. Viktor. Learning from imbalanced data sets withboosting and data generation: the databoost-im approach.

ACM SigkddExplorations Newsletter , 6(1):30–39, 2004.[14] S. R. Hashemi, S. S. M. Salehi, D. Erdogmus, S. P. Prabhu, S. K.Warﬁeld, and A. Gholipour. Asymmetric loss functions and deepdensely-connected networks for highly-imbalanced medical image seg-mentation: Application to multiple sclerosis lesion detection.

IEEEAccess , 7:1721–1735, 2018.[15] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio,C. Pal, P.-M. Jodoin, and H. Larochelle. Brain tumor segmentation withdeep neural networks.

Medical image analysis , 35:18–31, 2017.[16] N. Heller, N. Sathianathen, A. Kalapara, E. Walczak, K. Moore,H. Kaluzniak, J. Rosenberg, P. Blake, Z. Rengel, M. Oestreich, et al.The kits19 challenge data: 300 kidney tumor cases with clinical context,ct semantic segmentations, and surgical outcomes. arXiv preprintarXiv:1904.00445 , 2019.[17] C. Huang, Y. Li, C. C. Loy, and X. Tang. Learning deep representationfor imbalanced classiﬁcation. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 5375–5384, 2016.[18] F. Isensee, P. F. J¨ager, S. A. Kohl, J. Petersen, and K. H. Maier-Hein. Automated design of deep learning methods for biomedical imagesegmentation. arXiv preprint arXiv:1904.08128 , 2019.[19] J. M. Johnson and T. M. Khoshgoftaar. Survey on deep learning withclass imbalance.

Journal of Big Data , 6(1):27, 2019.[20] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane,D. K. Menon, D. Rueckert, and B. Glocker. Efﬁcient multi-scale 3d cnnwith fully connected crf for accurate brain lesion segmentation.

Med.Image Anal. , 36:61–78, 2017.[21] B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, andY. Kalantidis. Decoupling representation and classiﬁer for long-tailedrecognition. In

International Conference on Learning Representations(ICLR) , 2020.[22] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, andI. B. Ayed. Boundary loss for highly unbalanced segmentation. In

International conference on medical imaging with deep learning , pages285–296, 2019.[23] S. Khan, M. Hayat, S. W. Zamir, J. Shen, and L. Shao. Striking theright balance with uncertainty. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 103–112, 2019.[24] A. Krogh and J. A. Hertz. A simple weight decay can improvegeneralization. In

Advances in neural information processing systems , ICML , volume 2, pages379–386, 2002.[27] Z. Li, K. Kamnitsas, and B. Glocker. Overﬁtting of neural nets underclass imbalance: Analysis and improvements for segmentation. In

International Conference on Medical Image Computing and Computer-Assisted Intervention , pages 402–410. Springer, 2019.[28] S.-L. Liew, J. M. Anglin, N. W. Banks, M. Sondag, K. L. Ito, H. Kim,J. Chan, J. Ito, C. Jung, N. Khoshab, et al. A large, open source datasetof stroke anatomical brain images and manual lesion segmentations.

Scientiﬁc data , 5:180011, 2018.[29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Focal lossfor dense object detection. In

Proceedings of the IEEE internationalconference on computer vision , pages 2980–2988, 2017.[30] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. S´anchez.A survey on deep learning in medical image analysis.

Medical imageanalysis , 42:60–88, 2017.[31] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss forconvolutional neural networks. In

International Conference on MachineLeanring (ICML) , pages 507–516, 2016.[32] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu. Large-scale long-tailed recognition in an open world. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , pages2537–2546, 2019.[33] F. Milletari, N. Navab, and S.-A. Ahmadi. V-net: Fully convolutionalneural networks for volumetric medical image segmentation. In , pages 565–571.IEEE, 2016.[34] T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii. Virtual adversarialtraining: a regularization method for supervised and semi-supervisedlearning.

IEEE transactions on pattern analysis and machine intelli-gence , 41(8):1979–1993, 2018.[35] M. H. Savenije, M. Maspero, G. G. Sikkes, J. R. van der Voort vanZyp, A. N. TJ Kotte, G. H. Bol, and C. A. T. van den Berg. Clini-cal implementation of mri-based organs-at-risk auto-segmentation withconvolutional networks for prostate radiotherapy.

Radiation Oncology ,15:1–12, 2020.[36] A. A. A. Setio, F. Ciompi, G. Litjens, P. Gerke, C. Jacobs, S. J.Van Riel, M. M. W. Wille, M. Naqibullah, C. I. S´anchez, and B. vanGinneken. Pulmonary nodule detection in ct images: false positivereduction using multi-view convolutional networks.

IEEE Trans. Med.Imaging , 35(5):1160–1169, 2016.[37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-nov. Dropout: a simple way to prevent neural networks from overﬁtting.

The journal of machine learning research , 15(1):1929–1958, 2014.[38] K. Tang, J. Huang, and H. Zhang. Long-tailed classiﬁcation by keepingthe good and removing the bad momentum causal effect. arXiv preprintarXiv:2009.12991 , 2020.[39] V. V. Valindria, I. Lavdas, J. Cerrolaza, E. O. Aboagye, A. G. Rockall,D. Rueckert, and B. Glocker. Small organ segmentation in whole-bodymri using a two-stage fcn and weighting schemes. In

MICCAI-MLMI ,pages 346–354. Springer, 2018.[40] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax forface veriﬁcation.

IEEE Signal Processing Letters , 25(7):926–930, 2018.[41] S. Wang, W. Liu, J. Wu, L. Cao, Q. Meng, and P. J. Kennedy. Trainingdeep neural networks on imbalanced data sets. In , pages 4368–4374. IEEE,2016.[42] Y.-X. Wang, D. Ramanan, and M. Hebert. Learning to model the tail. In

Advances in Neural Information Processing Systems , pages 7029–7039,2017.[43] K. C. Wong, M. Moradi, H. Tang, and T. Syeda-Mahmood. 3dsegmentation with exponential logarithmic loss for highly unbalancedobject sizes. In

International Conference on Medical Image Computingand Computer-Assisted Intervention , pages 612–619. Springer, 2018.[44] C. Xie, M. Tan, B. Gong, J. Wang, A. Yuille, and Q. V. Le. Adversarialexamples improve image recognition. arXiv preprint arXiv:1911.09665 ,2019.[45] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. Cutmix: Regu-larization strategy to train strong classiﬁers with localizable features. In

Proceedings of the IEEE International Conference on Computer Vision ,pages 6023–6032, 2019.[46] B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In

Third IEEE internationalconference on data mining , pages 435–442. IEEE, 2003.[47] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyondempirical risk minimization. In

International Conference on LearningRepresentations (ICLR) , 2018.[48] B. Zhou, Q. Cui, X.-S. Wei, and Z.-M. Chen. Bbn: Bilateral-branchnetwork with cumulative learning for long-tailed visual recognition.In

Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition , pages 9719–9728, 2020.[49] P. Zhou and J. Feng. Understanding generalization and optimizationperformance of deep cnns. In

International Conference on MachineLearning , 2018. S UPPLEMENTARY M ATERIAL

VII. T

HE ANALYSIS OF LARGE MARGIN LOSS FROM ASAMPLE RE - WEIGHTING PERSPECTIVE

Some of the proposed methods are based on two well knownloss functions, noted as focal loss and large margin loss. Weargue that these two loss functions can be both seen as sample-level re-weighting methods, which change the magnitudeof gradient of the network output by multiplying a scalar.Speciﬁcally, focal loss would decrease the weights of well-classiﬁed samples, while large margin loss would increase theweights of all samples, especially for well-classiﬁed samples.Here, we provide an analysis of these loss functions fromsample re-weighting perspective.Following the formulation in the main text, we ﬁrst considera CNN trained using cross-entropy loss with one sample x i and its one-hot label y i : L CE ( x i , y i ) = − c (cid:88) j =1 y ij log( p ij ) = − y i · log( p i ) , (23)where p i is the calculated probability, which is a normalizedterm from the network output z i : p i = e z i e z i · . (24)Let’s look at the gradient of a general loss function L ( x i , y i ) with respect to the network parameters θ : ∂L ( x i , y i ) ∂θ = ∂L ( x i , y i ) ∂ z i ∂ z i ∂θ , (25)where the former term is associated with the design of lossfunctions and the latter is related to the network architecture.Considering a cross entropy loss L CE ( x i , y i ) , we can have: ∂L CE ( x i , y i ) ∂ z i = ∂L CE ( x i , y i ) ∂ p i ∂ p i ∂ z i = − p i · y i p i · y i ( y i − p i ) = p i − y i . (26)In instance-level, we can assign different weights for differ-ent samples x i with different scalars w i . The weights couldbe derive based on the frequency of samples or other heuristicrules. Typically, we multiply L CE ( x i , y i ) by w i , and thegradient after re-weighting would become: ∂ (cid:16) w i L CE ( x i , y i ) (cid:17) ∂ z i = w i ( p i − y i ) . (27) It can be seen that re-weighting would change the gradientof the network output for different samples and make themodel ﬁt better the chosen samples, which are assigned alarger weight w i .Similarly, we can also derive the gradient of the focal lossas: ∂ (cid:16) L CE focal ( x i , y i ) (cid:17) ∂ z i = ∂ (cid:16) L CE focal ( x i , y i ) (cid:17) ∂ p i ∂ p i ∂ z i = (cid:16) (1 − p i · y i ) γ − γ p i · y i log( p i · y i )(1 − p i · y i ) γ − (cid:17) ( p i − y i )= w i focal ( p i − y i ) , (28)The weight term of focal loss w i focal is a scalar and relatedto the sample probability p i · y i . Generally speaking, w i focal would decrease when p i · y i is large, therefore focal loss wouldmake the model ﬁt the easy cases less.More interestingly, we next look into the effect of large mar-gin loss on the gradient. Large margin loss would change thecalculation of probability for the training process. Speciﬁcally,we substitute p i with q i to calculate the loss function, wherewe require: q i = e z i − y i m e z i − y i m · , (29)where m is the hyper-parameter for the margin. In this case,the gradient of large margin loss can be derived as: ∂L CE M ( x i , y i ) ∂ z i = ∂L CE ( x i , y i ) ∂ q i ∂ q i ∂ z i = e z i · e z i − y i m · ( p i − y i ) = w i M ( p i − y i ) . (30)The weight term of large margin loss w i M is also a scalarand related to the network output z i · y i . It can be seen thatthe existence of a margin m would increase the gradient ofsample x i . Moreover, w i M would be larger as z i · y i becomeslarger, therefore large margin loss would make the model ﬁtthe easy cases more, and keep the distribution of z i away fromthe decision boundary.The analysis for L DSC can be done in a similar way.VIII. T

HE MAGNITUDE OF FOCAL

DSC

LOSS

The proposed focal DSC loss has similar behaviour with theoriginal of focal loss, as shown in Figure 9. In addition, it doesnot change the magnitude of loss too much compared withexisting solutions [1], [43], making it easier to be combinedwith other losses. We ﬁnd it is particularly important forour experiments with 3D U-net [18] because this frameworkadopts a loss function which is a combination of cross entropyand DSC loss.IX. H

YPER - PARAMETERS OF THE REGULARIZATIONTECHNIQUES

We summarize the hyper-parameters in Table V, Table VIand Table VII as a reference for practitioners. We ﬁnd whenthe asymmetric regularization techniques are combined to-gether, the network could be regularized too much. In this case,

Fig. 9. The comparison of focal loss with cross entropy and DSC loss. Thebehavior of focal loss for cross entropy and DSC loss are similar using theformulation in equation 10 and 12. the model would not converge and even perform poorly on thetraining data. Therefore, we always choose hyper-parameterswith smaller regularization magnitude for the experiments withthe combined regularization. Empirically, we ﬁnd decreasingthe hyper-parameters of large margin loss and/or focal loss isa sensible choice.

TABLE VH

YPER - PARAMETERS OF EXPERIMENTS USING D EEP M EDIC WITH

BRATS.

BRATS 5% data 10% data 20% data 50% dataIndividual large margin m γ (cid:15) l

10 10 10 20mixup λ (symmetric) 0.2 0.2 0.2 0.2mixup λ (asymmetric) 1 1 1 1mixup m m γ (cid:15) l

10 10 10 20mixup λ (symmetric) 0.2 0.2 0.2 0.2mixup λ (asymmetric) 1 1 1 1mixup m TABLE VIH

YPER - PARAMETERS OF EXPERIMENTS USING D EEP M EDIC WITH

ATLAS.

ATLAS 30% data 50% data 100% dataIndividual large margin m γ (cid:15) l

10 10 10mixup λ (symmetric) 0.2 0.2 0.2mixup λ (asymmetric) 1 1 1mixup m m γ (cid:15) l

10 10 10mixup λ (symmetric) 0.2 0.2 0.2mixup λ (asymmetric) —— —— 1mixup m —— —— 0.2Probability of backgroundsamples being augmented 50% 50% —— X. S

ENSITIVITY ANALYSIS

We conduct a series of controlled experiments with differenthyper-parameters to provide more practical details of theproposed regularization techniques. Speciﬁcally, we use abaseline DeepMedic model for brain tumor core segmentationwith 5% training data of BRATS. The experimental details are TABLE VIIH

YPER - PARAMETERS OF EXPERIMENTS USING

3D U-N

ET WITH K I TS.

KiTS 10% data 50% data 100% dataIndividual large margin m γ (cid:15) l

100 50 50mixup λ (symmetric) 0.2 0.2 0.2mixup λ (asymmetric) 1 1 1mixup m m γ (cid:15) —— —— ——adversarial l —— —— ——mixup λ (symmetric) —— —— ——mixup λ (asymmetric) —— —— ——mixup m —— —— ——Probability of backgroundsamples being augmented 0% —— —— consistent with descriptions in Section V. We summarize theresults with and without any post-processing in Table VIII.We can see from the results that the proposed methodscan improve the baseline segmentation results with variedhyper-parameters in most cases. Speciﬁcally, asymmetric largemargin loss yields improvements for most cases, however, aspeciﬁc hyper-parameter may yield unexpected results (i.e. m = 0 . ). A potential reason is that the model which focuseson a small portion of easy under-represented samples (c.f.equation 30 in Section VII) would overﬁt more. Asymmetriclarge margin loss with larger m makes the model emphasizeon more under-represented samples and therefore generalizebetter. Asymmetric adversarial training and asymmetric mixupyields considerable improvements when the perturbation indata augmentation is larger (i.e. l > . for asymmetricadversarial training and m < . for asymmetric mixup).Asymmetric focal loss is robust and can improve the segmen-tation results with all chosen hyper-parameters. Therefore, werecommend to choose asymmetric focal loss at ﬁrst for newapplications.XI. T HE INTENSITY HISTOGRAM OF DIFFERENT DATASETS

Empirically, we ﬁnd the asymmetric mixup is the most ef-fective method for tumor segmentation with BRATS. However,asymmetric mixup show limited improvements for ATLAS andKiTS. We think it is because the multi-channel information inBRATS could create more useful information, as shown inFigure 10.XII. T

HE QUANTITATIVE RESULTS OF ABDOMINAL ORGANSEGMENTATION

We evaluate one of our proposed techniques, asymmetric fo-cal loss, with the application of abdominal organ segmentationto demonstrate our method can be feasibly applied to multi-class segmentation. Speciﬁcally, we train a model of basicDeepMedic using 25% of the training data, with the samesetting in empirical experiments in Section III. Consideringthe class distribution of the dataset, as shown in Figure 11,we take class 4, class 5, class 8, class 9, class 10, class 11,class 12 and class 13 as rare classes. Speciﬁcally, we initiatethe one-hot vector r as [0 , , , , , , , , , , , , , (cid:124) . We TABLE VIIIT

HE SENSITIVITY ANALYSIS OF DIFFERENT HYPER - PARAMETERS . W

ECONDUCT EXPERIMENTS WITH DIFFERENT PARAMETERS WITH BRAINTUMOR CORE SEGMENTATION (5%

TRAINING DATA ) WITH

BRATS

USING D EEP M EDIC . R

ESULTS WHICH HAVE WORSE

DSC

THAN THE VANILLABASELINE ARE HIGHLIGHTED WITH GRAY SHADING .Method Parameter DSC SEN PRC HDw/ post-processingVanilla - CE —— 50.4 41.0 83.5 18.0Asymmetriclarge margin loss m = 0.2 53.6 44.8 84.8 15.6m = 0.5 48.4 39.4 81.5 16.9m = 1 56.8 48.9 83.4 15.0m = 1.5 54.1 45.6 81.7 15.3m = 2 51.6 42.8 84.0 16.7m = 3 54.4 45.7 82.3 14.3Asymmetricfocal loss γ = 0.5 53.4 44.8 79.6 16.6 γ = 1 53.9 45.2 81.9 17.9 γ = 1.5 56.5 48.3 87.8 13.8 γ = 2 58.8 51.4 81.6 15.0 γ = 3 57.5 49.0 85.5 14.2 γ = 4 55.2 48.5 78.3 15.5Asymmetricadversarialtraining (cid:15) = 1e-5 l = 2.5 50.3 41.8 82.0 17.2 (cid:15) = 1e-5 l = 5 58.1 50.0 84.7 14.1 (cid:15) = 1e-5 l = 10 58.5 50.8 80.1 16.2 (cid:15) = 1e-5 l = 15 53.8 46.2 80.1 16.9 (cid:15) = 1e-5 l = 20 56.6 50.7 76.9 18.8 (cid:15) = 1e-4 l = 10 57.6 51.1 78.9 16.1 (cid:15) = 1e-6 l = 10 56.2 48.5 81.1 17.8Asymmetricmixup m = 0.1 52.1 47.3 73.8 20.7m = 0.15 58.1 53.7 75.0 19.9m = 0.2 59.8 56.8 74.7 17.7m = 0.25 60.4 55.0 82.0 15.6m = 0.3 59.1 54.3 82.0 15.3m = 0.4 52.1 44.2 84.2 21.4m = 0.8 50.3 41.6 85.5 17.7w/o post-processingVanilla - CE —— 51.0 42.6 78.6 17.5Asymmetriclarge margin loss m = 0.2 53.2 46.0 79.8 18.3m = 0.5 48.8 40.8 78.1 17.5m = 1 55.5 50.6 76.2 23.9m = 1.5 52.6 47.2 73.1 25.8m = 2 51.4 44.2 78.2 18.8m = 3 53.4 47.3 75.1 21.3Asymmetricfocal loss γ = 0.5 54.2 48.0 76.2 22.0 γ = 1 53.7 46.8 76.0 22.8 γ = 1.5 54.3 49.6 76.3 25.9 γ = 2 57.3 52.7 76.4 24.4 γ = 3 55.7 50.3 75.4 24.6 γ = 4 54.4 50.3 71.1 25.4Asymmetricadversarialtraining (cid:15) = 1e-5 l = 2.5 50.5 43.6 76.3 21.3 (cid:15) = 1e-5 l = 5 56.6 51.3 76.1 21.9 (cid:15) = 1e-5 l = 10 56.8 51.8 74.8 23.6 (cid:15) = 1e-5 l = 15 53.3 47.6 74.8 21.5 (cid:15) = 1e-5 l = 20 55.4 53.2 72.0 26.2 (cid:15) = 1e-4 l = 10 56.9 53.3 74.1 22.4 (cid:15) = 1e-6 l = 10 55.2 50.0 76.0 23.8Asymmetricmixup m = 0.1 52.0 48.8 68.7 32.2m = 0.15 58.0 55.7 70.6 31.6m = 0.2 59.3 57.9 70.6 27.8m = 0.25 60.1 55.9 78.0 23.5m = 0.3 59.2 55.4 77.9 17.6m = 0.4 52.8 45.3 80.2 21.5m = 0.8 51.0 43.6 79.2 18.7 use γ = 4 in this experiments. We adopt post-processingdescribed in Section V separately to the results of everyclasses. The results are shown in Table IX. The asymmetricfocal loss can get better overall segmentation results than crossentropy or its symmetric variant. More importantly. it can getbetter segmentation results with higher sensitivity for mostrare classes. Speciﬁcally, asymmetric focal loss can improve (a) (b) (c)T1 T1-weightedT1-weighted T2 FLAIRCT Fig. 10. (a) The intensity histogram of BRATS, (b) ATLAS and (c) KiTS. Theintensity of the foreground and background classes overlap a lot for ATLASand KiTS. This can be a potential factor due to which the asymmetric mixupdoes not create useful synthetic samples and cannot improve the segmentationperformance that much. the average DSC of rare classes by 4.9%. We also noticethat asymmetric focal loss would decrease the segmentationperformance of esophagus which is taken as a rare class. Itis because esophagus is too small, and post-processing wouldremove the correct segmentation regions by mistake but leavethe false positive predictions. We think more advanced post-processing would help improve the segmentation in this case. × Fig. 11. The class distribution of the abdomen dataset we use in this study.We summarize the total pixel number of different classes. We take class 4, 5,8, 9, 10, 11, 12 and 13 as rare classes.

XIII. Q

UANTITATIVE RESULTS WITHOUTPOST - PROCESSING

The quantitative segmentation results without post-processing are summarized in Table X, Table XI and Ta-ble XII. Without post-processing, the proposed asymmetricregularization methods can improve DSC but could lead toworse distance-based evaluation metrics such as Hausdorffdistance (HD). It is because the regularized model, whichis more sensitive for the under-represented classes, wouldmake relatively more false positive predictions. The falsepositive predictions which are far from the ground truthwould increase HD signiﬁcantly. However, in practice mostfalse positive predictions could be easily removed by someconnected component-based post-processing, as described inSection V. In this way, eventually we can get better or similarHD with our methods, as shown in the main text. TABLE IXE

VALUATION OF ABDOMEN SEGMENTATION WITH

OF TRAINING DATA WITH SYMMETRIC ( SY .) AND ASYMMETRIC ( ASY .) FOCAL LOSS . T

HE RARECLASSES ARE MARKED WITH r . AVG IS THE AVERAGE PERFORMANCE OF ALL CLASSES . AVG r IS THE AVERAGE PERFORMANCE OF ALL RARECLASSES . B

EST RESULTS ARE HIGHLIGHTED IN BOLD . c1 (spleen) c2 (right kidney) c3 (left kidney) c4 (gallbladder) r c5 (esophagus) r vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focallossDSC r c9 (vena cava) r c10 (vein) r vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focallossDSC 87.4 88.4 Sensitivity 84.1 85.2 84.0 28.8 28.5 31.0 76.6 73.1 82.9 56.8 60.3 76.9 28.4 16.5 31.0Precision 92.3 92.8 94.2 91.3 84.1 86.3 91.6 91.5 86.3 86.1 79.3 72.7 91.5 82.1 80.0c11 (pancreas) r c12 (right adrenal) r c13 (left adrenal) r AVG AVG r vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focallossDSC 17.2 24.3 Sensitivity 11.1 17.3 18.3 45.3 24.3 50.3 27.2 22.2 41.6 51.2 48.1 55.7 40.2 35.6 48.9Precision 56.6 52.0 61.7 74.5 60.8 69.4 61.4 52.7 63.6 82.4 78.3 76.4 76.8 69.6 68.5

TABLE XE

VALUATION OF BRAIN TUMOR CORE SEGMENTATION USING D EEP M EDIC WITH DIFFERENT AMOUNTS OF TRAINING DATA AND DIFFERENT TECHNIQUESTO COUNTER OVERFITTING . T

HE RESULTS ARE CALCULATED WITHOUT ANY POST - PROCESSING . R

ESULTS WHICH HAVE WORSE

DSC

THAN THEVANILLA BASELINE ARE HIGHLIGHTED WITH GRAY SHADING . B

EST AND SECOND BEST RESULTS ARE HIGHLIGHTED IN BOLD WITH THE BEST ALSOBEING UNDERLINED . Method 5% training 10% training 20% training 50% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - CE [20] 51.0 42.6 78.6 17.5 62.8 56.9 81.6

Asymmetric large margin loss 55.5 50.6 76.2 23.9 64.3 57.9 84.1

Symmetric combination 50.6 43.0 82.2 20.3 61.0 54.2 83.4 23.3 64.9 59.5 85.9 16.8 67.4 63.9 84.4 15.7Asymmetric combination

TABLE XIE

VALUATION OF BRAIN STROKE LESION SEGMENTATION ON

ATLAS

USING D EEP M EDIC WITH DIFFERENT AMOUNTS OF TRAINING DATA ANDDIFFERENT TECHNIQUES TO COUNTER OVERFITTING . T

HE RESULTS ARE CALCULATED WITHOUT POST - PROCESSING . R

ESULTS WHICH HAVE WORSE

DSC

THAN THE VANILLA BASELINE ARE HIGHLIGHTED WITH SHADING . B

Large margin loss [31] 20.2 17.0 59.4

Asymmetric large margin loss 24.0 23.8 50.4 41.4 49.2 52.6 54.7 35.4 56.9 59.9 60.8 27.7Focal loss [29] 21.9 20.2 55.0 TABLE XIIE

VALUATION OF KIDNEY AND KIDNEY TUMOR SEGMENTATION BASED ON

3D U-N

ET WITH DIFFERENT AMOUNTS OF TRAINING DATA AND DIFFERENTTECHNIQUES TO COUNTER OVERFITTING . T

HE RESULTS ARE CALCULATED WITHOUT ANY POST - PROCESSING . R

ESULTS WHICH HAVE WORSE

DSC

THANTHE VANILLA BASELINE ARE HIGHLIGHTED WITH SHADING . B

Vanilla - w/o augmentation 92.8 90.2 96.6 12.4 96.3 93.1 96.6 2.5 96.5 96.4 96.8 3.8Vanilla - asymmetric augmentation 94.7 93.0 96.9 5.2 95.6 95.9 95.8 5.9 96.5 96.6 96.6 3.7Large margin loss [31]

Asymmetric adversarial training 94.6 92.8 97.2 4.6

Mixup [47]

Asymmetric combination 94.0 90.5 98.4 4.8 94.3 90.8 95.4 5.1

Large margin loss [31] 54.8 48.2 77.2 84.0 77.1 75.2 84.5 58.8 80.9 82.1 83.5 47.4Asymmetric large margin loss 55.7 50.1 75.4 99.6 77.9 76.0 84.9 54.6

Mixup [47] 54.5 48.3 79.682.2