Analyzing Overfitting under Class Imbalance in Neural Networks for Image Segmentation
11 Analyzing Overfitting under Class Imbalance inNeural Networks for Image Segmentation
Zeju Li, Konstantinos Kamnitsas and Ben Glocker
Abstract —Class imbalance poses a challenge for developingunbiased, accurate predictive models. In particular, in imagesegmentation neural networks may overfit to the foregroundsamples from small structures, which are often heavily under-represented in the training set, leading to poor generalization.In this study, we provide new insights on the problem ofoverfitting under class imbalance by inspecting the networkbehavior. We find empirically that when training with limiteddata and strong class imbalance, at test time the distribution oflogit activations may shift across the decision boundary, whilesamples of the well-represented class seem unaffected. This biasleads to a systematic under-segmentation of small structures.This phenomenon is consistently observed for different databases,tasks and network architectures. To tackle this problem, weintroduce new asymmetric variants of popular loss functionsand regularization techniques including a large margin loss,focal loss, adversarial training, mixup and data augmentation,which are explicitly designed to counter logit shift of the under-represented classes. Extensive experiments are conducted onseveral challenging segmentation tasks. Our results demonstratethat the proposed modifications to the objective function canlead to significantly improved segmentation accuracy comparedto baselines and alternative approaches.
Index Terms —overfitting, class imbalance, image segmentation.
I. I
NTRODUCTION T HE success of convolutional neural networks (CNNs)is strongly linked with the availability of large scale,representative datasets. However, in many real-world applica-tions such as medical image segmentation, the availability oflarge, annotated datasets is still limited. But even when thereis a sufficient number of images available, the fundamentalproblem of class imbalance remains where region-of-interests(ROIs) (i.e. foreground classes) are heavily under-representedin the training data [3], [30]. Similar to [9], the class imbalanceratio of one image can be defined as the ratio between thenumber of pixels of the background class (which is commonlythe most frequent class) and the number of pixels of differentobject classes. The class imbalance ratio of a whole datasetwould then be reported as the average class imbalance ratio ofall images in the set. Class imbalance ratios of 100:1 or higherare not uncommon in applications such as lesion segmentation,as shown in Table I.When the model is trained with imbalanced datasets, itcan overfit to the training samples from the under-representedclasses and may not generalize well during test time. However,the effects of overfitting under class imbalance on the model
Z. Li, K. Kamnitsas and B. Glocker are with the BioMedIA Group, Depart-ment of Computing, Imperial College London, SW72AZ, United Kingdom.E-mail: [email protected]. behavior is not well understood. In this study, we investigatehow the distribution of activations of the classification layer( logits ) changes when the model is trained using differentamounts of training data with strong class imbalance. As themodel is trained with fewer training data and overfit the under-represented classes more, we find that the model projectsunseen samples of the under-represented classes closer to andeven across the decision boundary, while samples of the over-represented classes remain unaffected. This biased distributionshift leads to under-segmentation of under-represented class.Current solutions to address class imbalance or to mitigateoverfitting do not explicitly consider this asymmetric logit shiftand are unable to lead to significant improvements, as we showthrough an extensive set of experiments.This study sheds new light on the problem of overfitting inthe presence of class imbalance by making the following keycontributions: 1) Via inspection of the network behavior onfour segmentation tasks and datasets, and two popular modelarchitectures, we conclude that overfitting under class imbal-ance consistently leads to decreased performance on under-represented classes specifically in terms of low sensitivity; 2)We identify the shift in the logit distribution of unseen testsamples of under-represented classes as a result of overfittingunder class imbalance; 3) Base on our observations, wepropose simple yet effective asymmetric variants of five lossfunctions and regularization techniques which are explicitlydesigned to change the network behavior yielding improvedsegmentation accuracy for the under-represented classes.This article is an extension of our earlier work presented atMICCAI 2019 [27]. We extend our previous work on multipleaspects: 1) We provide a more detailed analysis includingexperiments on two additional datasets; 2) We further includea 3D U-Net to confirm that our observations hold acrossdifferent network architectures; 3) We explore the proposedtraining objectives with the Dice loss in addition to cross-entropy; 4) We enrich the experiments by adding comparisonswhen training with F-score, extend experiments to multi-classsegmentation, and also evaluate another regularization methodwhich is noted as asymmetric augmentation. Our findings hereconfirm our initial observations about the biased behavior ofneural networks. The behavior of logit distribution shift isconsistently observed across different types of data, tasks andarchitectures. Our work highlights the importance of the issueof overfitting under class imbalance. The quantitative evalua-tion further supports our proposal of taking class imbalanceinto account when designing the learning objective. a r X i v : . [ c s . C V ] F e b II. R
ELATED WORK
A. Class imbalance
Class imbalance, which has been the focus of previousworks [4], [19], is a common issue in image classification andimage segmentation. Compared with the literature on classimbalance, a key contribution of this study is the focus onthe model behaviour when it overfits to the under-representedclasses with a detailed analysis and potential solutions. In thefollowing, we discuss related work categorized by differentmethodological approaches.
1) Re-weighting:
A common approach to tackle classimbalance is class-level re-weighting, which assigns higherweights or higher sampling probability to the under-represented classes based on sample frequency [41], [46]or advanced rules [9]. In this study, explore re-weightingas a baseline approach in all experiments where we trainthe models with patches which are separately sampled fromdifferent classes with the same probability. Beyond that,sample-level re-weighting strategies are also proposed to builda balanced model. For example, hard sample mining wasproposed to avoid the dominant effect of majority classes [11].Similarly, focal loss and its variants were proposed to weightdifficult samples over easy samples [1], [14], [29], [43] tosteer the learning towards small objects. However, the under-represented samples are not necessarily difficult to predictduring training. In fact, as we show empirically, the trainingsamples of the under-represented classes are learned well dueto overfitting. In this case, we find that a focal loss mayeven decrease the performance when processing imbalanceddatasets because it reduces the focus on the under-representedsamples. Therefore, in this study, we improve upon focal lossby removing the attenuation of under-represented classes. Mar-gin based loss functions were proposed to learn discriminativeembeddings and widely adopted to metric-learning and facerecognition [10], [31]. Margin losses can also been seen as akind of re-weighting approach which changes the magnitudeof the gradient of the network output by multiplying a scalar,as we show in the supplementary Section VII. In this study,we propose to only assign margins for the under-representedclasses. The design of uneven margins for imbalanced datasetswas first proposed in [26] for perception. Recently, LargeMargin Local Embedding (LMLE) was proposed to put moreconstraints for the under-represented classes by only applyingmultiple margins to the minority classes, with a computation-ally expensive metric-learning based framework [17]. Morerecently, two concurrent studies were also proposed to setlarger margins for the under-represented classes from theperspective of uncertainty [23] or generalization bound [5],[49]. In this study, we empirically show that one shouldnot assign margins for the over-represented classes based onthe observations of asymmetric logit distribution under classimbalance.
2) Data synthesis:
Our work is related to data synthesismethods [6], [13] which generate synthetic samples of theminority classes based on intra-class relationship betweensamples to increase the variance of under-represented class.In addition, we create synthetic samples in the latent feature space rather than image space and provide two new waysto synthesize samples of the under-represented classes formodern machine learning models. We also propose to adoptstronger data augmentation for the under-represented classesby changing the augmentation probabilities to alleviate over-fitting.
3) Other methods:
The above mentioned methods are allbased on changing the training data distribution to tackleclass imbalance. In contrast, some other approaches try tocounter class imbalance by modifying the training strategy.Specifically, [15] firstly trained their model with data which issampled from each class with the same probability. Thereafter,they only retrain the output layer with uniformly sampled datawhile freezing all other network parameters. In this way, theycould separately learn a diverse representation and a classifierfor realistic data distribution. Similar strategies, which aim tochange the decision boundary at test-time, are also proposedrecently for long-tailed recognition [21], [38], [48]. Theseapproaches are complementary to our proposed solutions andcould be combined. Other learning paradigms such as meta-learning [42] and transfer learning [32] were also recentlyproposed for long-tail learning, but these are outside the scopeof this paper.
4) Segmentation:
The problem of class imbalance in imagesegmentation is different from that in image recognition [32]because the dominating class in image segmentation is thebackground class with diverse characteristics, and its seg-mentation accuracy is highly robust. In contrast, the accuracyfor the majority classes in long-tailed image recognition candegrade with common techniques such as re-weighting [21]. Inaddition, the evaluation of segmentation performance mostlyrelies on the foreground classes, and therefore, the focusis on improving accuracy in those classes. For example,recent studies proposed to provide a better trade-off betweensensitivity and precision for segmentation [14], [33]. However,these strategies yield little improvements when processinghighly imbalanced datasets, as shown in our experiments.This is because a deep neural network may achieve near-perfect training accuracy even for the under-represented sam-ples without benefitting from the modified loss function. Classimbalance in image segmentation has been also approachedvia a boundary loss [22]. However, it is only applicableto segmentation. The authors show promising results withsufficient training data, but the model may still be prone tooverfit the under-represented class with limited dataset. Otherwork adopted multi-stage approaches with candidate proposalsand background suppression [36], [39]. However, the candidateprediction process may still suffer from class imbalance.In addition, any missed candidates in one stage cannot berecovered in a later stage. In contrast, the solutions proposedhere are general loss functions that can be incorporated intoany model or learning approach and is applicable beyondimage segmentation.
B. Regularization techniques
To improve generalization of deep neural networks, a num-ber of regularization techniques are available. This includes dropout [37], weight decay [24], data augmentation [7], [8],data mixing [45], [47], and adversarial training [12], [44].However, most of these techniques were proposed for gen-eral image classification tasks where class imbalance is notexplicitly addressed. It is also unclear how these techniquesaffect the network behavior in this setting.III. O
VERFITTING UNDER CLASS IMBALANCE AND ITSEFFECT ON SEGMENTATION PERFORMANCE
To explore the effects of overfitting on the network be-havior, we train CNNs using different amounts of data, onsegmentation tasks that exhibit strong class imbalance. Weconduct experiments on challenging segmentation tasks usingdata from the Multimodal Brain Tumor Image Segmentation(BRATS) challenge [2], the Anatomical Tracings of LesionsAfter Stroke (ATLAS) dataset [28], small organ segmentation(data from [25]) and Kidney Tumor Segmentation (KiTS) [16].The statistics of those four datasets are summarized in Table I.To ensure our findings generalize across models, in our investi-gation we employ two convolutional network architectures thathave been proven potent in a variety of segmentation tasks:We employ a DeepMedic architecture [20] for the experimentson brain lesions and multi-organ segmentation tasks on whichit has previously shown high performance [20], [35], and awell configured 3D U-Net [18] for the experiments on kidneytumor segmentation on KiTS19 data, which is the base modelof the winning entry of KiTS19 challenge [16]. The detailednetwork configurations are summarized in Section V.
TABLE IT
HE STATISTICS AND CLASS IMBALANCE RATIOS OF THE FOUR DATASETSUSED IN THIS STUDY . C
LASS IMBALANCE RATIO IS DEFINED AS THEAVERAGE RATIO BETWEEN THE NUMBER OF THE BACKGROUND (BG)
PIXELS AND THE FOREGROUND (FG)
PIXELS OVER ALL IMAGES . Dataset Total FGpixels Total BGpixels Classimbalance ratio(avg. ± std.)BRATS 1.2 × × ± × × ± × × ± × × ± × × ± × × ± × × ± × × ± The observations on the test and training set are summarizedin Fig. 1. With less training data, we notice a clear decrease ofsegmentation accuracy on test data while the accuracy on train-ing data increases due to easier overfitting, as expressed byDSC (defined as
DSC = 2 sensitivity · precisionsensitivity + precision ). We observe thatoverfitting leads to a reduction of sensitivity while precisionremains largely stable. In all settings and tasks, the specificityof the foreground always remains near-perfect ( > A. Logit distribution shift
To obtain a better understanding of the network behaviorafter training on imbalanced data, we monitor the logit distri-bution when processing training and unseen test samples. Theobservations we make for the tasks of brain tumor core, kidneytumor and brain stroke lesion segmentation are summarizedin Fig. 3, 4 and 5, respectively. We notice that the logitdistribution of foreground samples shifts significantly towardsand even across the decision boundary, while the logit distri-bution of background samples remains stable. The shift of theforeground logits results in a higher number of false negatives,which causes a drastic decrease of sensitivity (calculated as
TPTP + FN ). This biased logit shift under class imbalance may alsooccur in other tasks such as image classification. However,it is particularly prevalent in image segmentation with smallstructures-of-interest.We find that this shift of logits correlates with how mucha model overfits to the under-represented class. Training withless data leads to more overfitting, and the logit distributionshift becomes larger. Moreover, we find that the logit shiftalso correlates with the size of structures represented by theforeground class. The rarest class shifts the most, as shown inthe right part of Fig. 4.In image segmentation, a CNN is optimized to push thelogits of different classes away from each other and far fromthe decision boundary. It is relatively easy for a deep CNNto build an embedding for the training samples from theunder-represented class because it just needs to build a set ofcase-specific filters to facilitate memorization. For example,as a CNN will only observe very few training samples ofthe foreground class, a CNN can dedicate specific modelparameters to memorize all foreground samples, even if theindividual patterns are rather complex. Specifically, we find aCNN seems to be more confident about foreground samplesduring training, mapping them farther away from the decisionboundary when overfitting, as shown in Fig. 3, Fig. 4 andFig. 5. However, these tailored filters will not generalize to un-seen test data. Therefore, the activations for test samples of theunder-represented class are smaller in magnitude (sub-optimalpattern matching of filters and unseen samples), leading to theobserved distribution shift. In contrast, a CNN has to buildgeneric filters for a well represented class to represent manydifferent characteristics of the same class, leading to goodgeneralization. Such filters will map unseen samples to similarlocations in logit space and no shift between the embeddingsof training and test samples is observed. As a result of classimbalance and overfitting, a CNN may underperform on theunder-represented class while still generalizing well for thewell-represented class.While the negative effect of class imbalance and overfittingon model performance is well known, to our knowledgethere has been little work investigating the specifics how thenetwork behavior is affected. Only by understanding better Small organs segmentationBrain tumor segmentation Kidney tumor segmentation
Amount of training data Amount of training dataAmount of training data
Brain lesion segmentationBrain tumor core
Gallbladder Aorta Vena cava Vein Kidney tumorKidney
Brain stroke lesion
Amount of training data
Kidney tumorKidney
DeepMedic
3D U-Net
Amount of training data
Fig. 1. Performance on brain tumor core, brain stroke lesion, small organs, kidney and kidney tumor segmentation with varying amounts of training data.The foreground (FG) and background (BG) samples are highly imbalanced, as noted below each subfigure. With less training data, performance drops due tothe decrease of sensitivity, while the precision is largely retained.
Image (T1 MRI) Ground Truth Train w/ 100% training data Train w/ 50% training data Train w/ 10% training data Train w/ 5% training data Train w/ 5% training data w/ our regularizationImage (T1-weighted MRI) Ground Truth Train w/ 100% training data Train w/ 50% training data Train w/ 40% training data Train w/ 30% training data Train w/ 30% training data w/ our regularizationImage (CT) Ground Truth Train w/ 100% training data Train w/ 10% training data Train w/ 10% training data w/ our regularizationImage (CT) Ground Truth Train w/ 100% training data Train w/ 25% training data Train w/ 25% training data w/ our regularizationBRATSw/ DeepMedic
Red : brain tumor coreATLASw/ DeepMedic
Red : brain stroke lesionKiTSw/ 3D U-Net
Red : Kidney tumor
Blue : KidneyAbdominal organsw/ DeepMedic
Red : gallbladder
Blue : Aorta
Purple : Vena cava
Magenta : Vein Train w/ 10% training data (zoom-in) Train w/ 10% training data w/ our regularization (zoom-in)Train w/ 25% training data (zoom-in) Train w/ 25% training data w/ our regularization (zoom-in)
Fig. 2. Visualization of different datasets and segmentation results with different portions of training data. With less training data, the models are prone to under-segment the under-represented classes. The proposed regulation methods can alleviate the overfitting of under-represented classes and provide segmentationresults with higher sensitivity and overall accuracy. Best viewed in color. the implication, we can devise mitigation strategies. Previ-ous loss functions and regularization techniques that aim toprevent overfitting did not take the behavior that we observeinto account, and thus show limited success for improvingsegmentation accuracy in the setting of limited data withstrong class imbalance. Here, we propose solutions via newasymmetric variants of existing objective functions leading tobetter feature embeddings for the under-represented samples,leading to significant improvements in segmentation accuracyfor small structures-of-interest.IV. T
ACKLING OVERFITTING UNDER CLASS IMBALANCEWITH ASYMMETRIC OBJECTIVE FUNCTIONS
Based on our observations above about the biased behaviorof CNNs, we design modifications to existing loss functions and training strategies to prevent the logit distribution shift.Specifically, we add a bias for the under-represented class.Although the original techniques were proposed for differentpurposes, our modifications share a common goal: keep thelogit activations of the under-represented class away from thedecision boundary. Even if the logit of a foreground sampleshifts towards the decision boundary as long as it does notcross it, its prediction remains correct (cf. Fig. 6).
A. Asymmetric large margin loss
We consider a CNN for the task of semantic segmentation.For a training dataset { ( x i , y i ) } Ni =1 with N samples, wedenote a training sample with x i and its corresponding one-hotvector y i . If c is the total number of classes of the task, y i has c elements, with its j ’th element y ij ∈ { , } corresponding 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 " 𝑧 𝑧 𝑧 𝑧 𝑧 △ ̂ 𝑧 = ̂ 𝑧 ’ ( ) ’ − ̂ 𝑧 ’ + , - . Amount of training data
FG w/ 20% dataFG w/ 100% data FG w/ 50% data FG w/ 10% data FG w/ 5% dataBG w/ 20% dataBG w/ 100% data BG w/ 50% data BG w/ 10% data BG w/ 5% data 𝑧 𝑧 𝑧 𝑧 𝑧 Fig. 3. (Left part) Activations of the classification layer (logit z for background, logit z for brain tumor core) when processing (top) tumor and (bottom)background samples of BRATS with DeepMedic, using different amounts of training data. The CNN maps training and testing samples of the backgroundclass to similar logit values. However, mean activation for testing data shifts significantly for the tumor class towards and sometimes across the decisionboundary. (Right part) The shift of mean value of logits observed when processing training and testing data ( ∆ˆ z = | ˆ z test | − | ˆ z train | ). △ ̂ 𝑧 = ̂ 𝑧 % & ’ % − ̂ 𝑧 % ) * + , Amount of training data
Tumor w/ 100% data Tumor w/ 50% data Tumor w/ 20% data Tumor w/ 10% data Tumor w/ 5% dataKidney w/ 100% data Kidney w/ 50% data Kidney w/ 20% data Kidney w/ 10% data Kidney w/ 5% dataBG w/ 100% data BG w/ 50% data BG w/ 20% data BG w/ 10% data BG w/ 5% data
Fig. 4. (Left part) Activations of the classification layer (logit z for background, logit z for kidney, logit z for kidney tumor) when processing (top) tumor,(middle) kidney and (bottom) background samples of KiTS with 3D U-Net, using different amounts of training data. The CNN also fails to map the trainingand testing samples of the tumor class in a similar position. (Right part) The shift of mean value of logits. to the j ’th class. y ij equals to if j is the real class of x i ,or otherwise. With this notation, the cross-entropy (CE) losscan be written as : L CE ( x i , y i ) = − c (cid:88) j =1 y ij log( p ij ) , (1)where p ij is the predicted probability by the network that thereal class of x i is j . Probability p ij is commonly obtained via We formulate CE as sum over classes to make class specific modifications. a softmax function over the c activations { ( z ij ) cj =1 ∈ IR c } thatthe network outputs for x i at its last layer. These activationsare called the logits . With this, p ij is given by: p ij = e z ij (cid:80) cj =1 e z ij . (2)Besides CE, the smooth version of the DSC metric is analternative choice for the loss function which is widely used formedical image segmentation [33]. DSC loss can be calculatedin the form of − DSC = FP + FN TP + FP + FN ), which is: 𝑧 " 𝑧 " △ ̂ 𝑧 = ̂ 𝑧 & ’ ( & − ̂ 𝑧 & * + , - Amount of training data 𝑧 " 𝑧 . 𝑧 . 𝑧 . 𝑧 . 𝑧 " FG w/ 30% dataFG w/ 40% data FG w/ 50% dataFG w/ 100% data
Fig. 5. (Left part) Activations of the classification layer when processinglesion samples of ATLAS with DeepMedic, using different amounts of trainingdata. (Right part) The shift of mean value of logits. L DSC ( x i , y i ) = c (cid:88) j =1 (cid:16) (1 − y ij ) p ij + y ij (1 − p ij )2 y ij p ij + (1 − y ij ) p ij + y ij (1 − p ij ) (cid:17) . (3)The large margin loss was proposed for increasing theEuclidean distances between logits for different classes to learndiscriminative features [40]. Symmetrically, it is implementedby adding a margin on the logits of every class: L CE M ( x i , y i ) = − c (cid:88) j =1 y ij log( q ij ) , (4)in which we require: q ij = e z ij − y ij m (cid:80) cj =1 e z ij − y ij m , (5)where m is a hyper-parameter for the margin. Althoughthe large margin loss encourages the model to map differentclasses away from each other, the decision boundary remainsin the center. According to our observations, class imbalancecauses shifts of unseen foreground samples towards the back-ground class. To mitigate this, a regularizer may aim to movethe decision boundary closer to the background class. Ourasymmetric modification only sets the margin for the rareclasses. We define r as a one-hot vector with c elements, withits j ’th element r j ∈ { , } corresponding to the j ’th classand r j equals to 1 if j is taken as the rare class. With theindication of r , we derive the asymmetric large margin lossas: ˆ L CE M ( x i , y i ) = − c (cid:88) j =1 y ij log(ˆ q ij ) , (6)where we require: ˆ q ij = e z ij − y ij r j m (cid:80) cj =1 e z ij − y ij r j m , (7)In this study, we define r j as 1 for the foreground samplesand 0 for the background samples. In other applications, r j canalso be defined as a continuous variable indicating the rarity of the classes with r j ∈ [0 , , for methods in Section IV-A,IV-B and IV-C. Similarly, the symmetric and asymmetric largemargin loss for DSC loss can be derived by substitutingequation 5: L DSC M ( x i , y i ) = c (cid:88) j =1 (cid:16) (1 − y ij ) q ij + y ij (1 − q ij )2 y ij q ij + (1 − y ij ) q ij + y ij (1 − q ij ) (cid:17) , (8)and ˆ L DSC M ( x i , y i ) = c (cid:88) j =1 (cid:16) (1 − y ij )ˆ q ij + y ij (1 − ˆ q ij )2 y ij ˆ q ij + (1 − y ij )ˆ q ij + y ij (1 − ˆ q ij ) (cid:17) . (9) B. Asymmetric focal loss
The focal loss was proposed for small object detection byreducing the weight for well-classified samples and focusingon samples which are near the decision boundary [29]. Itadds attenuation inside the loss function based on the logitactivations: L CE focal ( x i , y i ) = − c (cid:88) j =1 (1 − p ij ) γ y ij log( p ij ) , (10)where γ is the hyper-parameter to control the focus. Thesymmetric focal loss prevents logits from being too large andmakes every class stay near the decision boundary. However,this makes it likely for the unseen foreground samples to shiftacross the decision boundary. We remove the loss attenuationfor the foreground class to keep it away from the decisionboundary: ˆ L CE focal ( x i , y i ) = c (cid:88) j =1 (cid:16) − r j y ij log( p ij ) − (1 − r j )(1 − p ij ) γ y ij log( p ij ) (cid:17) . (11)Inspired by the focal loss [29], related work integrates a sim-ilar attenuation term into the DSC loss [1], [43]. In practice, wefind that the logarithmic DSC loss [1] significantly changes themagnitude of DSC loss making it difficult to be combined withother losses. The attenuation in the focal Tversky loss [43] isvery large and may overly suppress the easier class. Here, wepropose another form of DSC loss with an adaptive weightpreserving a similar loss magnitude. Specifically, we add theattenuation term to the false negatives part of the function andprevent the network being too confident about its prediction: L DSC focal ( x i , y i ) = c (cid:88) j =1 (cid:16) (1 − y ij ) p ij + (1 − p ij ) γ y ij (1 − p ij )2 y ij p ij + (1 − y ij ) p ij + y ij (1 − p ij ) (cid:17) , (12)Compared with the original version of the CE loss, thisformulation for the DSC loss has a similar effect of reducingthe penalty for the well-classified samples while keeping themagnitude of the loss similar to the original one, as shown Original/asymmetric focal loss 𝛾 𝛾 𝛾 : Augmented foreground/background/ : Foreground/background/ : Original/regularized decision boundary/Original/asymmetric adversarial training 𝑟 𝑟 𝑟 𝑟 𝑟 𝑟 𝑟 𝑟 𝑟 𝑟 Original/asymmetric mixup 𝜆𝜆 𝜆 1 − 𝜆1 − 𝜆1 − 𝜆 𝜆𝜆 𝜆 𝑚 𝑚𝑚𝑚
Original/asymmetric large margin loss 𝑚 𝑚
Original/asymmetric augmentation
Loss function-based methodsData augmentation-based methods
Vanilla
𝒜(,) 𝒜(,)𝒜(,) 𝒜(,)𝒜(,)𝒜(,) 𝒜(,) 𝒜(,) 𝒜(,)𝒜(,) 𝒜 ./ (,) 𝒜 ./ (,)𝒜 ./ (,)𝒜 ./ (,)
Fig. 6. The illustration of the proposed asymmetric modifications for the existing loss functions and regularization techniques. We make the logit activationsof foreground class far away from the decision boundary by setting a bias for the foreground class in different ways. in Supplementary Fig. 9. We refer to this as the focal DSCloss in the following. Similarly, the asymmetric version of thefocal DSC loss is derived by removing the attenuation termfor the foreground class: ˆ L DSC focal ( x i , y i ) = c (cid:88) j =1 (cid:16) (1 − y ij ) p ij + r j y ij (1 − p ij )2 y ij p ij + (1 − y ij ) p ij + y ij (1 − p ij )+ (1 − y ij ) p ij + (1 − r j )(1 − p ij ) γ y ij (1 − p ij )2 y ij p ij + (1 − y ij ) p ij + y ij (1 − p ij ) (cid:17) . (13) C. Asymmetric adversarial training
Adversarial training was proposed to learn more robustclassifiers by training with difficult samples [12]. The networkis trained by considering adversarial samples as additionaltraining data [34], [44]: L adv ( x i , y i ) = L ( x i , y i ) + L ( x i + l · d adv (cid:107) d adv (cid:107) , y i ) , (14)with d adv = argmax d ; (cid:107) d (cid:107) <(cid:15) L ( x i + d , y i ) . (15)Here, d adv is the direction of the generated adversarialsamples, l and (cid:15) are the magnitude and the range of theadversarial perturbations, respectively. L is the chosen lossfunction, which can be L CE and / or L DSC . Similar to thelarge margin loss, symmetric adversarial training preservesthe decision boundary and may cause difficulties for unseenforeground samples, which tends to shift towards backgroundclass. Our proposed asymmetric adversarial training aims toproduce a larger space between the foreground class andthe decision boundary. Specifically, we generate samples byconsidering more from the rare classes: ˆ d adv = argmax d ; (cid:107) d (cid:107) <(cid:15) L ( x i + d , y i (cid:12) r ) (cid:12)(cid:12)(cid:12) y i · r > , (16)where “ (cid:12) ” refers to the element product and “ · ” refers tothe dot product. D. Asymmetric mixup
Mixup is a simple yet effective data augmentation algorithmto improve generalization by generating extra training samplesby using the linear combination of pairs of images and theirlabels [47]: L mixup ( x i , y i , x k , y k ) = L ( x i , y i ) + L ( ˜ x i , ˜ y i ) , (17)where ( ˜ x i , ˜ y i ) is the generated training sample: ˜ x i = λ x i + (1 − λ ) x k , ˜ y i = λ y i + (1 − λ ) y k . (18)Here, λ is randomly selected based on a beta distribution,( x k , y k ) is another random training sample. Mixup regularizesthe model by centering the decision boundary between classeswhich helps very little in our setting. Different from theoriginal mixup, which generates samples with soft labels, ourmodification generates hard labels by considered augmentedsamples near to the foreground samples as foreground class.Asymmetric mixup can keep the decision boundary away fromthe foreground class and increase the area of the foregroundlogit distribution. This prevents unseen under-presented sam-ples from shifting across the decision boundary. Specifically,the mixed image ˜ x i which has a certain distance from thebackground class, is taken as a foreground sample: ˆ˜ y i = y i if (cid:0) λ > m and y i · r (1 − y k · r ) == 1 (cid:1) or y i == y k , y k if (cid:0) − λ > m and y k · r (1 − y i · r ) == 1 (cid:1) or y i == y k , otherwise , (19)where m is the margin to guarantee that the augmentedsamples are not getting too close to background samples. Inpractice, we do not update the model using training sampleswith ˆ˜ y i = . E. Asymmetric augmentation
In order to extend the latent space of the foregroundclass, we also evaluate a simple method to compensate classimbalance by adjusting the magnitude of augmentation fordifferent classes. Standard data augmentation methods wouldpreserve the label y i and adopt the same set of heuristictransformations such as scaling and rotations to the original training sample x i for different classes. The generated trainingsample ˜ x i can be obtained using: ˜ x i = A ( x i ) , (20)where A is the chosen transformation with certain proba-bility. When the dataset is highly imbalanced, adding moresynthesized background samples is not necessary. Our simplevariant of data augmentation reduces the number of trans-formed samples for the background classes. In this asymmetricsetting, the generated sample ˜ x i is obtained using: ˆ˜ x i = (cid:26) A ( x i ) if y i · r == 1 , A small ( x i ) otherwise , (21)where A small is transformations with smaller probability. F. The combination of asymmetric techniques
The above-mentioned modifications would introduce morevariances for the under-represented classes in the latent spaceor the image space by adding a bias for the foreground classfrom different perspectives. In practice, some or all of thetechniques can be integrated into a single model to combatoverfitting under class imbalance.Specifically, we can first generate different sets of the aug-mented samples following the asymmetric adversarial training,the asymmetric mixup and the asymmetric augmentation fol-lowing equation 16, 19 and 21, separately. The network canthen be optimized using the extended training set with the lossfunctions combined with the asymmetric large margin loss andasymmetric focal loss: ˆ L CE combine ( x i , y i ) = c (cid:88) j =1 (cid:16) − r j y ij log(ˆ q ij ) − (1 − r j )(1 − ˆ q ij ) γ y ij log(ˆ q ij ) (cid:17) . (22)A combined DSC loss can be formulated in a similar way.V. E XPERIMENTS
A. Experimental setup
We demonstrate the effect of our proposed modificationswith a variety of medical image segmentation tasks usingdifferent models and training scenarios. Here, we summarizethe dataset splits and experimental settings, which are keptthe same with motivational experiments in Section III. Wekeep the hyper-parameters of the methods the same for theoriginal baselines and our modified techniques. The hyper-parameters are summarized in Supplementary Table V, VI andVII. Additionally, we conduct a sensitivity analysis of all thehyper-parameters and summarize the results in SupplementaryTable VIII. We also provide the source code for our experi-ments . https://github.com/ZerojumpLine/OverfittingUnderClassImbalance
1) Brain tumor segmentation:
We first evaluate the asym-metric techniques for the case of binary brain tumor coresegmentation using the DeepMedic network architecture, awell performing method for this task [20]. To investigatethe behavior under overfitting and to isolate better the effectof the objective functions, we do not use dropout, weightdecay and data augmentation in this experiment. We trainthe network with CE loss, unless otherwise specified. Bydefault, we sample 50% training samples from the foregroundclass. We conduct experiments using the training dataset ofBRATS2017 dataset [2] which contains 285 four modalitiesMagnetic Resonance (MR) images. The MR images all havethe same voxel space of 1.0 × ×
2) Brain stroke lesion segmentation:
We also evaluate theasymmetric techniques for the case of brain stroke lesion seg-mentation [28] again using DeepMedic. Here, we use a morerealistic setting, employing standard regularization techniquesincluding dropout, weight decay and data augmentation, asin the original work where the model achieved high perfor-mance for stroke lesion segmentation [20]. We implementour asymmetric techniques with the default training settingand default network architecture. The augmentation includessmall intensity shifts and flipping in the sagittal plane wFithprobability 0.5. The network is always trained with CE loss.We conduct experiments using ATLAS dataset [28] whichcontains 220 T1-weighted MR images. The MR images havethe same voxel space of 1.0 × ×
3) Small organ segmentation:
For organ segmentation inSection III, we use a default DeepMedic network. We conductexperiments using the training datatset of the abdominal organsegmentation challenge [25] which contains 30 computedtomography (CT) scans. We train the network to segmentthirteen abdominal organs. We test on 10 cases and trainmodels using 20 (100%) and 5 cases (25% of training set).We resample all the MR images to a common voxel spacingof 2.0 × ×
4) Kidney tumor segmentation:
In addition, we evaluatethe asymmetric techniques for the case of kidney tumorsegmentation. We train a well configured 3D U-Net [18] whichincludes extensive data augmentation with scaling, rotations,brightness, contrast, gamma and Gaussian noise augmentationswith a predefined policy [16]. We also train DeepMedic withsimilar augmentation strategies yielding lower accuracy onthis task. Therefore, we evaluate the asymmetric regularizationtechniques on the U-Net with kidney tumor segmentation. Thetask includes the segmentation of both the kidney and kidneytumor. As the segmentation of kidney is relatively easy, inthis experiment we only focus on tumor segmentation andonly take kidney tumor as the foreground class to implement
TABLE IIE
VALUATION OF BRAIN TUMOR CORE SEGMENTATION USING D EEP M EDIC WITH DIFFERENT AMOUNTS OF TRAINING DATA AND DIFFERENT TECHNIQUESTO COUNTER OVERFITTING . T
HE RESULTS ARE CALCULATED WITH POST - PROCESSING . R
ESULTS WHICH HAVE WORSE
DSC
THAN THE VANILLABASELINE ARE HIGHLIGHTED WITH SHADING . B
EST AND SECOND BEST RESULTS ARE IN BOLD WITH THE BEST ALSO UNDERLINED . Method 5% training 10% training 20% training 50% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - CE [20] 50.4 41.0 83.5 18.0 62.5 56.0 83.1 14.3 64.9 59.8 85.7 13.8 69.4 65.4 85.3 15.7Vanilla - CE - 80% tumor 45.5 36.0 86.7 17.8 61.5 54.2 81.7 18.5 65.3 59.6 85.0 15.1 68.6 64.1 86.1 14.8Vanilla - F1 (DSC) 47.2 37.4 86.6 15.9 58.9 51.1 83.6 20.1 64.3 58.1 83.5 16.3 67.1 62.5 86.5 15.3Vanilla - F2 [14] 45.8 36.9 81.9 17.9 59.3 52.2 84.9 18.0 66.4 61.1 83.4 14.1 68.8 66.0 83.4 13.7Vanilla - F4 [14] 51.6 42.5 83.8 18.1 59.6 53.0 82.9 18.4 65.9 61.9 85.4 14.2 67.5 64.5 84.9 13.7Vanilla - F8 [14] 47.4 38.7 83.1 19.6 59.8 52.4 87.0 15.4 64.5 60.3 85.2 14.7 67.9 65.4 81.6 14.9Large margin loss [31] 44.5 35.9 82.8 20.2 60.9 53.5 84.0 17.6 67.0 61.6 86.1 14.4 66.5 62.2 88.1 13.7Asymmetric large margin loss 56.8 48.9 83.4
Symmetric combination 50.0 42.0 84.6 21.1 60.3 53.1 84.7 25.1 64.1 58.3 86.6 19.1 67.2 63.1 86.6 15.1Asymmetric combination
TABLE IIIE
VALUATION OF BRAIN STROKE LESION SEGMENTATION ON
ATLAS
BASED ON D EEP M EDIC WITH DIFFERENT AMOUNTS OF TRAINING DATA ANDDIFFERENT TECHNIQUES TO COUNTER OVERFITTING . T
HE RESULTS ARE CALCULATED WITH POST - PROCESSING . R
ESULTS WHICH HAVE WORSE
DSC
THAN THE VANILLA BASELINE ARE HIGHLIGHTED WITH SHADING . B
EST AND SECOND BEST RESULTS ARE IN BOLD WITH THE BEST ALSO UNDERLINED .Method 30% training 50% training 100% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - w/ augmentation [20] 22.2 18.3 60.9 48.6 45.2 40.9 59.7 31.1 54.5 49.5 67.3 32.2Vanilla - w/o augmentation 15.0 11.7 59.1 51.9 40.3 35.9 53.0 40.2 51.7 48.0 62.3 31.9Vanilla - asymmetric augmentation 22.4 18.8 58.0 50.2 47.3 43.4 57.5 32.1 56.9 51.9 69.8 28.2Large margin loss [31] 18.9 14.8 64.4 48.8 45.3 40.7 60.5 36.8 55.1 49.4 70.0 28.0Asymmetric large margin loss 23.5 19.8 58.6 45.9 47.7 44.3 58.4 33.8
Focal loss [29] 20.4 16.7 62.7 47.9 46.9 41.8 61.7 31.4 56.0 50.8 69.1 30.9Asymmetric focal loss 26.3 22.2 59.0 46.4 49.0 47.8 56.3 31.7 56.6 63.2 55.6 27.9Adversarial training [12] 20.1 16.7 57.3 56.9 47.2 41.6 62.6 35.0 54.0 48.5 69.7 34.9Asymmetric adversarial training asymmetric techniques. To be specific, we always set r as [0 , , (cid:124) . The network is always trained with both CE andsample-wise DSC loss. The two losses have the same weight.We conduct experiments using the training dataset of KiTS19dataset [16] which contains 210 CT images. We resample allthe CT images to a common voxel spacing of 1.6 × × B. Quantitative results
Taking the provided manual segmentations as the groundtruth, we calculate DSC, sensitivity (SEN), precision (PRC)and 95% Hausdorff distance (HD) (mm) to evaluate thesegmentation accuracy. The initial segmentation results ofour method always have higher sensitivity and DSC, butsometimes would cause more false positive predictions and therefore lead to worse distance-based metrics such as HD.We argue that in practice this problem can be addressed bytaking advantage of some connected component-based post-processing, which is widely adopted in many segmentationmethods [18]. Specifically, we assume there is only one targetcomponent and suppress all but the largest region. We reportboth results with or without these post-processing operations.The quantitative segmentation results on BRATS, ATLASand KiTS datasets using different amounts of training datawith post-processing are summarized in Table II, III andIV, respectively. The corresponding quantitative segmentationresults without post-processing are summarized in Supple-mentary Table X, XI and XII. We also evaluate one of theproposed methods, asymmetric focal loss, with abdominalorgan segmentation in which multiple classes are consideredunder-represented. The experiments are summarized in Sup-plementary Table IX.Class imbalance affects the segmentation sensitivity of theunder-represented class, as shown in Section III. We find thatprevious attempts to tackle class imbalance do not improvesensitivity, while our asymmetric methods do lead to betterresults with higher sensitivity across different tasks. Thisindicates that the proposed methods may effectively mitigate TABLE IVE
VALUATION OF KIDNEY AND KIDNEY TUMOR SEGMENTATION BASED ON
3D U-N
ET WITH DIFFERENT AMOUNTS OF TRAINING DATA AND DIFFERENTTECHNIQUES TO COUNTER OVERFITTING . T
HE RESULTS ARE CALCULATED WITH POST - PROCESSING . R
ESULTS WHICH HAVE WORSE
DSC
THAN THEVANILLA BASELINE ARE HIGHLIGHTED WITH SHADING . B
EST AND SECOND BEST RESULTS ARE IN BOLD WITH THE BEST ALSO UNDERLINED .Method Kidney10% training 50% training 100% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - w/ augmentation [18] 93.3 91.2 96.9 5.4 96.4 95.8 97.1 2.7 96.6 96.1 97.3 2.4Vanilla - w/o augmentation 92.3 89.3 96.8 12.1 96.1 95.6 96.7 2.8 96.3 95.8 96.9 2.7Vanilla - asymmetric augmentation 94.3 92.2 97.0 5.2 94.9 94.5 95.5 5.9 96.1 95.8 96.4 3.8Large margin loss [31]
Focal loss [29] 91.4 85.9 99.2 10.6 94.1 89.6 99.2 4.2 94.3 90.0 99.1 4.2Asymmetric focal loss 92.0 86.7 99.0 6.0 94.7 90.9 98.9 3.5 94.8 90.9 99.1 3.1Adversarial training [12] 94.1 91.9 97.3 9.1 96.3 95.7 97.1 2.6 96.6 96.2 97.2
Asymmetric adversarial training 94.4 92.5 97.2 5.7
Mixup [47]
Asymmetric mixup
Asymmetric combination 93.5 89.7 98.5 5.2 93.9 90.0 98.3 5.3 96.7 95.6 97.9
Method Kidney tumor10% training 50% training 100% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - w/ augmentation [18] 54.6 46.0 80.0
Vanilla - w/o augmentation 37.4 31.5 65.6 96.0 62.8 58.7 75.9 47.8 73.0 69.1 83.4 18.9Vanilla - asymmetric augmentation 55.9 48.2 76.4 71.5 74.3 70.3 85.2 33.3 78.4 76.9 85.7 19.8Large margin loss [31] 52.2 44.3 77.2 68.5 78.2 74.3 87.8 26.6 80.2 79.1 84.5 25.5Asymmetric large margin loss 55.5 48.3 77.4 71.6
Focal loss [29] 47.1 37.5 78.2 74.5 73.0 66.0 87.6 40.2 79.0 73.2 90.0 20.3Asymmetric focal loss overfitting under class imbalance. These results in turn supportour previous analysis of logit shift under class imbalance andindicate that considering this implication would help buildunbiased network.
1) Baseline experiments:
We perform baseline experimentswith binary brain tumor core segmentation, as shown inTable II. We show that increasing the weight of tumor samples(from 50% to 80%) decreases performance when the datasetis highly imbalanced. This is because increasing the weightencourages the network to memorize the under-representedsamples and may actually lead to more overfitting, thus beingcounter-productive. Simply changing the objective function toF-score (defined as F β = (1 + β ) sensitivity · precision β sensitivity + precision ), whichis a balancing loss and weights sensitivity β -times more thanprecision [14], [33], only shows little improvements increasingthe sensitivity slightly. Changing the sampling weights ortraining with a loss function using F-scores seems to havelittle impact when foreground training samples are limited,with training accuracy close to 100%, as shown in Fig. 1.A common approach to alleviate under-segmentation is toadjust thresholds of decision boundaries based on validationsets. In this work, however, we observe distribution shift onunseen test data, which can significantly differ from the train-ing/validation sets. Hence, a threshold selected on validationdata may not be optimal for new test data. Due to the lackof ground truth, it is practically not possible to optimise thedecision thresholds for a specific test set.
2) Asymmetric large margin loss:
The original large marginloss decreases performance in some cases, while our modifi-cation yields improvements over the symmetric version in allcases.
3) Asymmetric focal loss:
The original focal loss alsodecreases the sensitivity in some cases and leads to worseperformance. It is because the focal term would decrease theweight of foreground samples and push its logit closer to thedecision boundary, making it easier to cause false negativepredictions. Our modification removes the loss attenuation forthe under-represented class and improve the performance in allcases. We notice that the asymmetric focal loss would makethe performance of other class (kidney) overfit more, but it isnot the focus of this study and can be easily addressed by justkeeping focal term for the background class.
4) Asymmetric adversarial training:
When the network istrained without data augmentation (as shown in Table II),the original adversarial training seems to be effective whenlittle training data is available while our modifications canfurther improve the sensitivity and boost the performancesubstantially.When the network is trained with data augmentation (asshown in Table III and IV), we find the original adversarialtraining does not improve the performance when trainingdata is limited. It indicates that in this case the augmentedsamples by adversarial training might not add anything on topof intensity augmentation. In contrast, our proposed modifi- Original/asymmetric large margin loss
Original/asymmetric focal loss
Original/asymmetric mixupVanilla
Symmetric/asymmetric combination
Original/asymmetric adversarial training
Fig. 7. Activations of the classification layer when processing tumor (top) and background (bottom) samples of BRATS with DeepMedic, using 5% trainingdata. Asymmetric modifications lead to better separation of the logits of unseen tumor samples.
Symmetric/asymmetric combinationOriginal/asymmetriclarge margin loss Original/asymmetric focal loss Original/asymmetric mixupOriginal/asymmetric adversarial trainingw/oaugmentation Original/asymmetricaugmentation
Fig. 8. Activations of the classification layer for tumor (top), kidney (middle) and background (bottom) samples of KiTS with 3D U-Net, using 10% trainingdata. Asymmetric modifications also lead to better separation of the logits of unseen tumor samples and is complementary to standard data augmentation. cations seem to help always leading to better segmentationperformance.
5) Asymmetric mixup:
We find the original mixup can beeffective for the well-represented class. For example, it alwaysimprove the segmentation performance of kidney, as shownin Table IV. However, the original mixup leads to lowersensitivity for the under-represented class. We find that theasymmetric mixup can improve the performance for BRATSto a large extent, as shown in Table II. However, we also findit to be less effective for ATLAS and KiTS which only haveone image channel, as shown in Table III and IV. This maybe because the intensity distributions of healthy and lesionregions in ATLAS and KiTS overlap too much, as shown in theSupplementary Fig. 10. In this case, the mixed samples, whichare very similar to the background samples in intensity but aretaken as the foreground samples, may confuse the network. ForBRATS, however, four image channels are available with thehealthy and tumor regions show larger differences in T2 andfluid-attenuated inversion recovery (FLAIR) sequences. Themixed samples seem to take good advantage of the intensityrelationship.
6) Asymmetric augmentation:
Despite its simplicity, wefind asymmetric augmentation to be an effective method toimprove segmentation performance of the under-representedclasses in most cases in terms of DSC and sensitivity, assummarized in Table III. However, we also notice that it coulddecrease the performance when data augmentation is strongand training data is sufficient, as shown in Table IV. It might bebecause the strong asymmetric augmentation would drive themodel to focus too much on the foreground samples, making the background samples under-represented.
7) The combination of asymmetric techniques:
We alsocombine the asymmetric techniques which are found to im-prove the segmentation accuracy. The combination of theasymmetric techniques is a safe choice leading to the overallbest segmentation results with improved sensitivity in all cases.In contrast, the combination of the symmetric counterparts isunable to mitigate overfitting and often decreases sensitivity.
C. Logit distribution changes
The effects of all the techniques on the logit distributionsof BRATS using 5% training data is presented in Fig. 7.Asymmetric techniques would increase the variances of theforeground class and expand its logit distribution. The originallarge margin loss and adversarial training try to push samplesfrom different classes far from each other, however, the logitsof unseen data remain in the center around the decision bound-ary and thus the predictions are not improved. The originallarge margin loss results in even larger shifts for the foregroundsamples. For our asymmetric modifications only the logits offoreground samples are pushed away and the unseen fore-ground logits tend to remain on the correct side of the decisionboundary. The original focal loss encourages the network toprevent the logits of each class from staying too far from thedecision boundary. However, it allows foreground logits toremain near the decision boundary which can result in falsenegative predictions on unseen samples. Our asymmetric focalloss removes the constraints for foreground samples. Originalmixup encourages the symmetric distributions of different classes but does not consider class imbalance. Asymmetricmixup exploits the latent space based on the relationshipbetween samples to generate foreground samples and make thedecision boundary stay near the background class. This leadsto the largest improvement by increasing the region for theforeground logit distribution and reduce logit shift of unseenforeground samples. The combination of the four asymmetrictechniques can stabilize the logits further more.The effects of all the techniques on the logit distributionof KiTS using 10% training data is summarized in Fig. 8.The original data augmentation can reduce the logit shift ofboth unseen kidney and kidney tumor samples although it isnot specifically designed to regularize the logit distribution.The asymmetric augmentation can further reduce the logitshift of the unseen tumor samples. The asymmetric largemargin loss reduces the tumor logit shift towards the kidneyclass. Although the logit distribution is already regularized bythe strong augmentation, the proposed asymmetric techniquesprovide further benefits in stabilizing the logits.VI. C ONCLUSION
We study overfitting of neural networks under class im-balance by inspecting network behavior. We observe thatwhen processing unseen under-represented samples, the logitactivations tend to shift towards the decision boundary and thesensitivity decreases. This phenomenon is confirmed across avariety of different tasks and two popular different networkarchitectures. We derive simple yet effective asymmetric vari-ants of existing loss functions and regularization techniquesto prevent overfitting. We show that our proposed methodscan substantially improve segmentation performance underclass imbalance in terms of DSC and increased sensitivity,outperforming previous solutions. We believe more regular-ization methods can be derived to alleviate this problem byconsidering the biased network behavior. We also believethat the plotting logit distributions may be useful networkinspection tool and help to gain a better understanding networkbehavior under different training scenarios. In future work, wewill investigate if monitoring of intermediate activations mayprovide further insights for other challenging settings such asdomain shift or self-supervised learning.A
CKNOWLEDGEMENTS
This work received funding from the European ResearchCouncil (ERC) under the European Union’s Horizon 2020research and innovation programme (grant agreement No757173, project MIRA, ERC-2017-STG). ZL is supportedby the China Scholarship Council (CSC). KK is funded bythe UKRI London Medical Imaging & Artificial IntelligenceCentre for Value Based Healthcare.R
EFERENCES[1] N. Abraham and N. M. Khan. A novel focal tversky loss function withimproved attention u-net for lesion segmentation. In , pages683–687. IEEE, 2019. [2] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby,J. B. Freymann, K. Farahani, and C. Davatzikos. Advancing the cancergenome atlas glioma mri collections with expert segmentation labels andradiomic features.
Sci. Data , 4:170117, 2017.[3] T. Brosch, L. Y. Tang, Y. Yoo, D. K. Li, A. Traboulsee, and R. Tam.Deep 3d convolutional encoder networks with shortcuts for multiscalefeature integration applied to multiple sclerosis lesion segmentation.
IEEE transactions on medical imaging , 35(5):1229–1239, 2016.[4] M. Buda, A. Maki, and M. A. Mazurowski. A systematic study ofthe class imbalance problem in convolutional neural networks.
NeuralNetworks , 106:249–259, 2018.[5] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learning imbal-anced datasets with label-distribution-aware margin loss. In
Advancesin Neural Information Processing Systems , pages 1567–1578, 2019.[6] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer.Smote: synthetic minority over-sampling technique.
Journal of artificialintelligence research , 16:321–357, 2002.[7] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaug-ment: Learning augmentation strategies from data. In
Proceedings ofthe IEEE conference on computer vision and pattern recognition , pages113–123, 2019.[8] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment:Practical data augmentation with no separate search. arXiv preprintarXiv:1909.13719 , 2019.[9] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie. Class-balancedloss based on effective number of samples. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 9268–9277, 2019.[10] J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arcface: Additive angularmargin loss for deep face recognition. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 4690–4699, 2019.[11] Q. Dong, S. Gong, and X. Zhu. Imbalanced deep learning by minorityclass incremental rectification.
IEEE transactions on pattern analysisand machine intelligence , 41(6):1367–1381, 2018.[12] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessingadversarial examples. In
International Conference on Learning Repre-sentations (ICLR) , 2015.[13] H. Guo and H. L. Viktor. Learning from imbalanced data sets withboosting and data generation: the databoost-im approach.
ACM SigkddExplorations Newsletter , 6(1):30–39, 2004.[14] S. R. Hashemi, S. S. M. Salehi, D. Erdogmus, S. P. Prabhu, S. K.Warfield, and A. Gholipour. Asymmetric loss functions and deepdensely-connected networks for highly-imbalanced medical image seg-mentation: Application to multiple sclerosis lesion detection.
IEEEAccess , 7:1721–1735, 2018.[15] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio,C. Pal, P.-M. Jodoin, and H. Larochelle. Brain tumor segmentation withdeep neural networks.
Medical image analysis , 35:18–31, 2017.[16] N. Heller, N. Sathianathen, A. Kalapara, E. Walczak, K. Moore,H. Kaluzniak, J. Rosenberg, P. Blake, Z. Rengel, M. Oestreich, et al.The kits19 challenge data: 300 kidney tumor cases with clinical context,ct semantic segmentations, and surgical outcomes. arXiv preprintarXiv:1904.00445 , 2019.[17] C. Huang, Y. Li, C. C. Loy, and X. Tang. Learning deep representationfor imbalanced classification. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 5375–5384, 2016.[18] F. Isensee, P. F. J¨ager, S. A. Kohl, J. Petersen, and K. H. Maier-Hein. Automated design of deep learning methods for biomedical imagesegmentation. arXiv preprint arXiv:1904.08128 , 2019.[19] J. M. Johnson and T. M. Khoshgoftaar. Survey on deep learning withclass imbalance.
Journal of Big Data , 6(1):27, 2019.[20] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane,D. K. Menon, D. Rueckert, and B. Glocker. Efficient multi-scale 3d cnnwith fully connected crf for accurate brain lesion segmentation.
Med.Image Anal. , 36:61–78, 2017.[21] B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, andY. Kalantidis. Decoupling representation and classifier for long-tailedrecognition. In
International Conference on Learning Representations(ICLR) , 2020.[22] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, andI. B. Ayed. Boundary loss for highly unbalanced segmentation. In
International conference on medical imaging with deep learning , pages285–296, 2019.[23] S. Khan, M. Hayat, S. W. Zamir, J. Shen, and L. Shao. Striking theright balance with uncertainty. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 103–112, 2019.[24] A. Krogh and J. A. Hertz. A simple weight decay can improvegeneralization. In
Advances in neural information processing systems , ICML , volume 2, pages379–386, 2002.[27] Z. Li, K. Kamnitsas, and B. Glocker. Overfitting of neural nets underclass imbalance: Analysis and improvements for segmentation. In
International Conference on Medical Image Computing and Computer-Assisted Intervention , pages 402–410. Springer, 2019.[28] S.-L. Liew, J. M. Anglin, N. W. Banks, M. Sondag, K. L. Ito, H. Kim,J. Chan, J. Ito, C. Jung, N. Khoshab, et al. A large, open source datasetof stroke anatomical brain images and manual lesion segmentations.
Scientific data , 5:180011, 2018.[29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Focal lossfor dense object detection. In
Proceedings of the IEEE internationalconference on computer vision , pages 2980–2988, 2017.[30] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. S´anchez.A survey on deep learning in medical image analysis.
Medical imageanalysis , 42:60–88, 2017.[31] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss forconvolutional neural networks. In
International Conference on MachineLeanring (ICML) , pages 507–516, 2016.[32] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu. Large-scale long-tailed recognition in an open world. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , pages2537–2546, 2019.[33] F. Milletari, N. Navab, and S.-A. Ahmadi. V-net: Fully convolutionalneural networks for volumetric medical image segmentation. In , pages 565–571.IEEE, 2016.[34] T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii. Virtual adversarialtraining: a regularization method for supervised and semi-supervisedlearning.
IEEE transactions on pattern analysis and machine intelli-gence , 41(8):1979–1993, 2018.[35] M. H. Savenije, M. Maspero, G. G. Sikkes, J. R. van der Voort vanZyp, A. N. TJ Kotte, G. H. Bol, and C. A. T. van den Berg. Clini-cal implementation of mri-based organs-at-risk auto-segmentation withconvolutional networks for prostate radiotherapy.
Radiation Oncology ,15:1–12, 2020.[36] A. A. A. Setio, F. Ciompi, G. Litjens, P. Gerke, C. Jacobs, S. J.Van Riel, M. M. W. Wille, M. Naqibullah, C. I. S´anchez, and B. vanGinneken. Pulmonary nodule detection in ct images: false positivereduction using multi-view convolutional networks.
IEEE Trans. Med.Imaging , 35(5):1160–1169, 2016.[37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-nov. Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research , 15(1):1929–1958, 2014.[38] K. Tang, J. Huang, and H. Zhang. Long-tailed classification by keepingthe good and removing the bad momentum causal effect. arXiv preprintarXiv:2009.12991 , 2020.[39] V. V. Valindria, I. Lavdas, J. Cerrolaza, E. O. Aboagye, A. G. Rockall,D. Rueckert, and B. Glocker. Small organ segmentation in whole-bodymri using a two-stage fcn and weighting schemes. In
MICCAI-MLMI ,pages 346–354. Springer, 2018.[40] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax forface verification.
IEEE Signal Processing Letters , 25(7):926–930, 2018.[41] S. Wang, W. Liu, J. Wu, L. Cao, Q. Meng, and P. J. Kennedy. Trainingdeep neural networks on imbalanced data sets. In , pages 4368–4374. IEEE,2016.[42] Y.-X. Wang, D. Ramanan, and M. Hebert. Learning to model the tail. In
Advances in Neural Information Processing Systems , pages 7029–7039,2017.[43] K. C. Wong, M. Moradi, H. Tang, and T. Syeda-Mahmood. 3dsegmentation with exponential logarithmic loss for highly unbalancedobject sizes. In
International Conference on Medical Image Computingand Computer-Assisted Intervention , pages 612–619. Springer, 2018.[44] C. Xie, M. Tan, B. Gong, J. Wang, A. Yuille, and Q. V. Le. Adversarialexamples improve image recognition. arXiv preprint arXiv:1911.09665 ,2019.[45] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. Cutmix: Regu-larization strategy to train strong classifiers with localizable features. In
Proceedings of the IEEE International Conference on Computer Vision ,pages 6023–6032, 2019.[46] B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In
Third IEEE internationalconference on data mining , pages 435–442. IEEE, 2003.[47] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyondempirical risk minimization. In
International Conference on LearningRepresentations (ICLR) , 2018.[48] B. Zhou, Q. Cui, X.-S. Wei, and Z.-M. Chen. Bbn: Bilateral-branchnetwork with cumulative learning for long-tailed visual recognition.In
Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition , pages 9719–9728, 2020.[49] P. Zhou and J. Feng. Understanding generalization and optimizationperformance of deep cnns. In
International Conference on MachineLearning , 2018. S UPPLEMENTARY M ATERIAL
VII. T
HE ANALYSIS OF LARGE MARGIN LOSS FROM ASAMPLE RE - WEIGHTING PERSPECTIVE
Some of the proposed methods are based on two well knownloss functions, noted as focal loss and large margin loss. Weargue that these two loss functions can be both seen as sample-level re-weighting methods, which change the magnitudeof gradient of the network output by multiplying a scalar.Specifically, focal loss would decrease the weights of well-classified samples, while large margin loss would increase theweights of all samples, especially for well-classified samples.Here, we provide an analysis of these loss functions fromsample re-weighting perspective.Following the formulation in the main text, we first considera CNN trained using cross-entropy loss with one sample x i and its one-hot label y i : L CE ( x i , y i ) = − c (cid:88) j =1 y ij log( p ij ) = − y i · log( p i ) , (23)where p i is the calculated probability, which is a normalizedterm from the network output z i : p i = e z i e z i · . (24)Let’s look at the gradient of a general loss function L ( x i , y i ) with respect to the network parameters θ : ∂L ( x i , y i ) ∂θ = ∂L ( x i , y i ) ∂ z i ∂ z i ∂θ , (25)where the former term is associated with the design of lossfunctions and the latter is related to the network architecture.Considering a cross entropy loss L CE ( x i , y i ) , we can have: ∂L CE ( x i , y i ) ∂ z i = ∂L CE ( x i , y i ) ∂ p i ∂ p i ∂ z i = − p i · y i p i · y i ( y i − p i ) = p i − y i . (26)In instance-level, we can assign different weights for differ-ent samples x i with different scalars w i . The weights couldbe derive based on the frequency of samples or other heuristicrules. Typically, we multiply L CE ( x i , y i ) by w i , and thegradient after re-weighting would become: ∂ (cid:16) w i L CE ( x i , y i ) (cid:17) ∂ z i = w i ( p i − y i ) . (27) It can be seen that re-weighting would change the gradientof the network output for different samples and make themodel fit better the chosen samples, which are assigned alarger weight w i .Similarly, we can also derive the gradient of the focal lossas: ∂ (cid:16) L CE focal ( x i , y i ) (cid:17) ∂ z i = ∂ (cid:16) L CE focal ( x i , y i ) (cid:17) ∂ p i ∂ p i ∂ z i = (cid:16) (1 − p i · y i ) γ − γ p i · y i log( p i · y i )(1 − p i · y i ) γ − (cid:17) ( p i − y i )= w i focal ( p i − y i ) , (28)The weight term of focal loss w i focal is a scalar and relatedto the sample probability p i · y i . Generally speaking, w i focal would decrease when p i · y i is large, therefore focal loss wouldmake the model fit the easy cases less.More interestingly, we next look into the effect of large mar-gin loss on the gradient. Large margin loss would change thecalculation of probability for the training process. Specifically,we substitute p i with q i to calculate the loss function, wherewe require: q i = e z i − y i m e z i − y i m · , (29)where m is the hyper-parameter for the margin. In this case,the gradient of large margin loss can be derived as: ∂L CE M ( x i , y i ) ∂ z i = ∂L CE ( x i , y i ) ∂ q i ∂ q i ∂ z i = e z i · e z i − y i m · ( p i − y i ) = w i M ( p i − y i ) . (30)The weight term of large margin loss w i M is also a scalarand related to the network output z i · y i . It can be seen thatthe existence of a margin m would increase the gradient ofsample x i . Moreover, w i M would be larger as z i · y i becomeslarger, therefore large margin loss would make the model fitthe easy cases more, and keep the distribution of z i away fromthe decision boundary.The analysis for L DSC can be done in a similar way.VIII. T
HE MAGNITUDE OF FOCAL
DSC
LOSS
The proposed focal DSC loss has similar behaviour with theoriginal of focal loss, as shown in Figure 9. In addition, it doesnot change the magnitude of loss too much compared withexisting solutions [1], [43], making it easier to be combinedwith other losses. We find it is particularly important forour experiments with 3D U-net [18] because this frameworkadopts a loss function which is a combination of cross entropyand DSC loss.IX. H
YPER - PARAMETERS OF THE REGULARIZATIONTECHNIQUES
We summarize the hyper-parameters in Table V, Table VIand Table VII as a reference for practitioners. We find whenthe asymmetric regularization techniques are combined to-gether, the network could be regularized too much. In this case,
Fig. 9. The comparison of focal loss with cross entropy and DSC loss. Thebehavior of focal loss for cross entropy and DSC loss are similar using theformulation in equation 10 and 12. the model would not converge and even perform poorly on thetraining data. Therefore, we always choose hyper-parameterswith smaller regularization magnitude for the experiments withthe combined regularization. Empirically, we find decreasingthe hyper-parameters of large margin loss and/or focal loss isa sensible choice.
TABLE VH
YPER - PARAMETERS OF EXPERIMENTS USING D EEP M EDIC WITH
BRATS.
BRATS 5% data 10% data 20% data 50% dataIndividual large margin m γ (cid:15) l
10 10 10 20mixup λ (symmetric) 0.2 0.2 0.2 0.2mixup λ (asymmetric) 1 1 1 1mixup m m γ (cid:15) l
10 10 10 20mixup λ (symmetric) 0.2 0.2 0.2 0.2mixup λ (asymmetric) 1 1 1 1mixup m TABLE VIH
YPER - PARAMETERS OF EXPERIMENTS USING D EEP M EDIC WITH
ATLAS.
ATLAS 30% data 50% data 100% dataIndividual large margin m γ (cid:15) l
10 10 10mixup λ (symmetric) 0.2 0.2 0.2mixup λ (asymmetric) 1 1 1mixup m m γ (cid:15) l
10 10 10mixup λ (symmetric) 0.2 0.2 0.2mixup λ (asymmetric) —— —— 1mixup m —— —— 0.2Probability of backgroundsamples being augmented 50% 50% —— X. S
ENSITIVITY ANALYSIS
We conduct a series of controlled experiments with differenthyper-parameters to provide more practical details of theproposed regularization techniques. Specifically, we use abaseline DeepMedic model for brain tumor core segmentationwith 5% training data of BRATS. The experimental details are TABLE VIIH
YPER - PARAMETERS OF EXPERIMENTS USING
3D U-N
ET WITH K I TS.
KiTS 10% data 50% data 100% dataIndividual large margin m γ (cid:15) l
100 50 50mixup λ (symmetric) 0.2 0.2 0.2mixup λ (asymmetric) 1 1 1mixup m m γ (cid:15) —— —— ——adversarial l —— —— ——mixup λ (symmetric) —— —— ——mixup λ (asymmetric) —— —— ——mixup m —— —— ——Probability of backgroundsamples being augmented 0% —— —— consistent with descriptions in Section V. We summarize theresults with and without any post-processing in Table VIII.We can see from the results that the proposed methodscan improve the baseline segmentation results with variedhyper-parameters in most cases. Specifically, asymmetric largemargin loss yields improvements for most cases, however, aspecific hyper-parameter may yield unexpected results (i.e. m = 0 . ). A potential reason is that the model which focuseson a small portion of easy under-represented samples (c.f.equation 30 in Section VII) would overfit more. Asymmetriclarge margin loss with larger m makes the model emphasizeon more under-represented samples and therefore generalizebetter. Asymmetric adversarial training and asymmetric mixupyields considerable improvements when the perturbation indata augmentation is larger (i.e. l > . for asymmetricadversarial training and m < . for asymmetric mixup).Asymmetric focal loss is robust and can improve the segmen-tation results with all chosen hyper-parameters. Therefore, werecommend to choose asymmetric focal loss at first for newapplications.XI. T HE INTENSITY HISTOGRAM OF DIFFERENT DATASETS
Empirically, we find the asymmetric mixup is the most ef-fective method for tumor segmentation with BRATS. However,asymmetric mixup show limited improvements for ATLAS andKiTS. We think it is because the multi-channel information inBRATS could create more useful information, as shown inFigure 10.XII. T
HE QUANTITATIVE RESULTS OF ABDOMINAL ORGANSEGMENTATION
We evaluate one of our proposed techniques, asymmetric fo-cal loss, with the application of abdominal organ segmentationto demonstrate our method can be feasibly applied to multi-class segmentation. Specifically, we train a model of basicDeepMedic using 25% of the training data, with the samesetting in empirical experiments in Section III. Consideringthe class distribution of the dataset, as shown in Figure 11,we take class 4, class 5, class 8, class 9, class 10, class 11,class 12 and class 13 as rare classes. Specifically, we initiatethe one-hot vector r as [0 , , , , , , , , , , , , , (cid:124) . We TABLE VIIIT
HE SENSITIVITY ANALYSIS OF DIFFERENT HYPER - PARAMETERS . W
ECONDUCT EXPERIMENTS WITH DIFFERENT PARAMETERS WITH BRAINTUMOR CORE SEGMENTATION (5%
TRAINING DATA ) WITH
BRATS
USING D EEP M EDIC . R
ESULTS WHICH HAVE WORSE
DSC
THAN THE VANILLABASELINE ARE HIGHLIGHTED WITH GRAY SHADING .Method Parameter DSC SEN PRC HDw/ post-processingVanilla - CE —— 50.4 41.0 83.5 18.0Asymmetriclarge margin loss m = 0.2 53.6 44.8 84.8 15.6m = 0.5 48.4 39.4 81.5 16.9m = 1 56.8 48.9 83.4 15.0m = 1.5 54.1 45.6 81.7 15.3m = 2 51.6 42.8 84.0 16.7m = 3 54.4 45.7 82.3 14.3Asymmetricfocal loss γ = 0.5 53.4 44.8 79.6 16.6 γ = 1 53.9 45.2 81.9 17.9 γ = 1.5 56.5 48.3 87.8 13.8 γ = 2 58.8 51.4 81.6 15.0 γ = 3 57.5 49.0 85.5 14.2 γ = 4 55.2 48.5 78.3 15.5Asymmetricadversarialtraining (cid:15) = 1e-5 l = 2.5 50.3 41.8 82.0 17.2 (cid:15) = 1e-5 l = 5 58.1 50.0 84.7 14.1 (cid:15) = 1e-5 l = 10 58.5 50.8 80.1 16.2 (cid:15) = 1e-5 l = 15 53.8 46.2 80.1 16.9 (cid:15) = 1e-5 l = 20 56.6 50.7 76.9 18.8 (cid:15) = 1e-4 l = 10 57.6 51.1 78.9 16.1 (cid:15) = 1e-6 l = 10 56.2 48.5 81.1 17.8Asymmetricmixup m = 0.1 52.1 47.3 73.8 20.7m = 0.15 58.1 53.7 75.0 19.9m = 0.2 59.8 56.8 74.7 17.7m = 0.25 60.4 55.0 82.0 15.6m = 0.3 59.1 54.3 82.0 15.3m = 0.4 52.1 44.2 84.2 21.4m = 0.8 50.3 41.6 85.5 17.7w/o post-processingVanilla - CE —— 51.0 42.6 78.6 17.5Asymmetriclarge margin loss m = 0.2 53.2 46.0 79.8 18.3m = 0.5 48.8 40.8 78.1 17.5m = 1 55.5 50.6 76.2 23.9m = 1.5 52.6 47.2 73.1 25.8m = 2 51.4 44.2 78.2 18.8m = 3 53.4 47.3 75.1 21.3Asymmetricfocal loss γ = 0.5 54.2 48.0 76.2 22.0 γ = 1 53.7 46.8 76.0 22.8 γ = 1.5 54.3 49.6 76.3 25.9 γ = 2 57.3 52.7 76.4 24.4 γ = 3 55.7 50.3 75.4 24.6 γ = 4 54.4 50.3 71.1 25.4Asymmetricadversarialtraining (cid:15) = 1e-5 l = 2.5 50.5 43.6 76.3 21.3 (cid:15) = 1e-5 l = 5 56.6 51.3 76.1 21.9 (cid:15) = 1e-5 l = 10 56.8 51.8 74.8 23.6 (cid:15) = 1e-5 l = 15 53.3 47.6 74.8 21.5 (cid:15) = 1e-5 l = 20 55.4 53.2 72.0 26.2 (cid:15) = 1e-4 l = 10 56.9 53.3 74.1 22.4 (cid:15) = 1e-6 l = 10 55.2 50.0 76.0 23.8Asymmetricmixup m = 0.1 52.0 48.8 68.7 32.2m = 0.15 58.0 55.7 70.6 31.6m = 0.2 59.3 57.9 70.6 27.8m = 0.25 60.1 55.9 78.0 23.5m = 0.3 59.2 55.4 77.9 17.6m = 0.4 52.8 45.3 80.2 21.5m = 0.8 51.0 43.6 79.2 18.7 use γ = 4 in this experiments. We adopt post-processingdescribed in Section V separately to the results of everyclasses. The results are shown in Table IX. The asymmetricfocal loss can get better overall segmentation results than crossentropy or its symmetric variant. More importantly. it can getbetter segmentation results with higher sensitivity for mostrare classes. Specifically, asymmetric focal loss can improve (a) (b) (c)T1 T1-weightedT1-weighted T2 FLAIRCT Fig. 10. (a) The intensity histogram of BRATS, (b) ATLAS and (c) KiTS. Theintensity of the foreground and background classes overlap a lot for ATLASand KiTS. This can be a potential factor due to which the asymmetric mixupdoes not create useful synthetic samples and cannot improve the segmentationperformance that much. the average DSC of rare classes by 4.9%. We also noticethat asymmetric focal loss would decrease the segmentationperformance of esophagus which is taken as a rare class. Itis because esophagus is too small, and post-processing wouldremove the correct segmentation regions by mistake but leavethe false positive predictions. We think more advanced post-processing would help improve the segmentation in this case. × Fig. 11. The class distribution of the abdomen dataset we use in this study.We summarize the total pixel number of different classes. We take class 4, 5,8, 9, 10, 11, 12 and 13 as rare classes.
XIII. Q
UANTITATIVE RESULTS WITHOUTPOST - PROCESSING
The quantitative segmentation results without post-processing are summarized in Table X, Table XI and Ta-ble XII. Without post-processing, the proposed asymmetricregularization methods can improve DSC but could lead toworse distance-based evaluation metrics such as Hausdorffdistance (HD). It is because the regularized model, whichis more sensitive for the under-represented classes, wouldmake relatively more false positive predictions. The falsepositive predictions which are far from the ground truthwould increase HD significantly. However, in practice mostfalse positive predictions could be easily removed by someconnected component-based post-processing, as described inSection V. In this way, eventually we can get better or similarHD with our methods, as shown in the main text. TABLE IXE
VALUATION OF ABDOMEN SEGMENTATION WITH
OF TRAINING DATA WITH SYMMETRIC ( SY .) AND ASYMMETRIC ( ASY .) FOCAL LOSS . T
HE RARECLASSES ARE MARKED WITH r . AVG IS THE AVERAGE PERFORMANCE OF ALL CLASSES . AVG r IS THE AVERAGE PERFORMANCE OF ALL RARECLASSES . B
EST RESULTS ARE HIGHLIGHTED IN BOLD . c1 (spleen) c2 (right kidney) c3 (left kidney) c4 (gallbladder) r c5 (esophagus) r vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focallossDSC r c9 (vena cava) r c10 (vein) r vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focallossDSC 87.4 88.4 Sensitivity 84.1 85.2 84.0 28.8 28.5 31.0 76.6 73.1 82.9 56.8 60.3 76.9 28.4 16.5 31.0Precision 92.3 92.8 94.2 91.3 84.1 86.3 91.6 91.5 86.3 86.1 79.3 72.7 91.5 82.1 80.0c11 (pancreas) r c12 (right adrenal) r c13 (left adrenal) r AVG AVG r vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focalloss vanilla- CE sy.focalloss asy.focallossDSC 17.2 24.3 Sensitivity 11.1 17.3 18.3 45.3 24.3 50.3 27.2 22.2 41.6 51.2 48.1 55.7 40.2 35.6 48.9Precision 56.6 52.0 61.7 74.5 60.8 69.4 61.4 52.7 63.6 82.4 78.3 76.4 76.8 69.6 68.5
TABLE XE
VALUATION OF BRAIN TUMOR CORE SEGMENTATION USING D EEP M EDIC WITH DIFFERENT AMOUNTS OF TRAINING DATA AND DIFFERENT TECHNIQUESTO COUNTER OVERFITTING . T
HE RESULTS ARE CALCULATED WITHOUT ANY POST - PROCESSING . R
ESULTS WHICH HAVE WORSE
DSC
THAN THEVANILLA BASELINE ARE HIGHLIGHTED WITH GRAY SHADING . B
EST AND SECOND BEST RESULTS ARE HIGHLIGHTED IN BOLD WITH THE BEST ALSOBEING UNDERLINED . Method 5% training 10% training 20% training 50% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - CE [20] 51.0 42.6 78.6 17.5 62.8 56.9 81.6
Asymmetric large margin loss 55.5 50.6 76.2 23.9 64.3 57.9 84.1
Symmetric combination 50.6 43.0 82.2 20.3 61.0 54.2 83.4 23.3 64.9 59.5 85.9 16.8 67.4 63.9 84.4 15.7Asymmetric combination
TABLE XIE
VALUATION OF BRAIN STROKE LESION SEGMENTATION ON
ATLAS
USING D EEP M EDIC WITH DIFFERENT AMOUNTS OF TRAINING DATA ANDDIFFERENT TECHNIQUES TO COUNTER OVERFITTING . T
HE RESULTS ARE CALCULATED WITHOUT POST - PROCESSING . R
ESULTS WHICH HAVE WORSE
DSC
THAN THE VANILLA BASELINE ARE HIGHLIGHTED WITH SHADING . B
EST AND SECOND BEST RESULTS ARE IN BOLD WITH THE BEST ALSO UNDERLINED .Method 30% training 50% training 100% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - w/ augmentation [20] 22.9 22.4 52.3 41.7 47.7 48.5 55.3
Large margin loss [31] 20.2 17.0 59.4
Asymmetric large margin loss 24.0 23.8 50.4 41.4 49.2 52.6 54.7 35.4 56.9 59.9 60.8 27.7Focal loss [29] 21.9 20.2 55.0 TABLE XIIE
VALUATION OF KIDNEY AND KIDNEY TUMOR SEGMENTATION BASED ON
3D U-N
ET WITH DIFFERENT AMOUNTS OF TRAINING DATA AND DIFFERENTTECHNIQUES TO COUNTER OVERFITTING . T
HE RESULTS ARE CALCULATED WITHOUT ANY POST - PROCESSING . R
ESULTS WHICH HAVE WORSE
DSC
THANTHE VANILLA BASELINE ARE HIGHLIGHTED WITH SHADING . B
EST AND SECOND BEST RESULTS ARE IN BOLD WITH THE BEST ALSO UNDERLINED .Method Kidney10% training 50% training 100% trainingDSC SEN PRC HD DSC SEN PRC HD DSC SEN PRC HDVanilla - w/ augmentation [18] 93.7 91.7 96.9 5.6 96.5 96.1 97.0 3.6
Vanilla - w/o augmentation 92.8 90.2 96.6 12.4 96.3 93.1 96.6 2.5 96.5 96.4 96.8 3.8Vanilla - asymmetric augmentation 94.7 93.0 96.9 5.2 95.6 95.9 95.8 5.9 96.5 96.6 96.6 3.7Large margin loss [31]
Asymmetric adversarial training 94.6 92.8 97.2 4.6
Mixup [47]
Asymmetric combination 94.0 90.5 98.4 4.8 94.3 90.8 95.4 5.1
Large margin loss [31] 54.8 48.2 77.2 84.0 77.1 75.2 84.5 58.8 80.9 82.1 83.5 47.4Asymmetric large margin loss 55.7 50.1 75.4 99.6 77.9 76.0 84.9 54.6
Mixup [47] 54.5 48.3 79.682.2