[PDF] Cutting-edge 3D Medical Image Segmentation Methods in 2020: Are Happy Families All Alike?

Abstract

Segmentation is one of the most important and popular tasks in medical image analysis, which plays a critical role in disease diagnosis, surgical planning, and prognosis evaluation. During the past five years, on the one hand, thousands of medical image segmentation methods have been proposed for various organs and lesions in different medical images, which become more and more challenging to fairly compare different methods. On the other hand, international segmentation challenges can provide a transparent platform to fairly evaluate and compare different methods. In this paper, we present a comprehensive review of the top methods in ten 3D medical image segmentation challenges during 2020, covering a variety of tasks and datasets. We also identify the "happy-families" practices in the cutting-edge segmentation methods, which are useful for developing powerful segmentation approaches. Finally, we discuss open research problems that should be addressed in the future. We also maintain a list of cutting-edge segmentation methods at \url{this https URL}.

Full PDF

11 Cutting-edge 3D Medical Image SegmentationMethods in 2020: Are Happy Families All Alike?

Jun Ma

Abstract —Segmentation is one of the most important and popular tasks in medical image analysis, which plays a critical role indisease diagnosis, surgical planning, and prognosis evaluation. During the past ﬁve years, on the one hand, thousands of medicalimage segmentation methods have been proposed for various organs and lesions in different medical images, which become more andmore challenging to fairly compare different methods. On the other hand, international segmentation challenges can provide atransparent platform to fairly evaluate and compare different methods. In this paper, we present a comprehensive review of the topmethods in ten 3D medical image segmentation challenges during 2020, covering a variety of tasks and datasets. We also identify the”happy-families” practices in the cutting-edge segmentation methods, which are useful for developing powerful segmentationapproaches. Finally, we discuss open research problems that should be addressed in the future. We also maintain a list of cutting-edgesegmentation methods at https://github.com/JunMa11/SOTA-MedSeg.

Index Terms —Image Segmentation, Deep Learning, U-Net, Convolutional Neural Networks, Survey, Review. (cid:70)

NTRODUCTION M EDICAL image segmentation aims to delineate theinterested anatomical structures, such as tumors, or-gans, and tissues, in a semi-automatic or fully automaticway, which has many applications in clinical practice, suchas radiomic analysis [1], treatment planning [2], and sur-vival analysis [3], and so on. Currently, medical image seg-mentation is also an active research topic. Figure 1 presentsthe word cloud of the paper titles in the 23rd InternationalConference and Medical Image Computing & ComputerAssisted Intervention (MICCAI 2020) that is the largestinternational event in medical image analysis community.It can found that the term ‘segmentation’ has very high fre-quency and putting the top high-frequency words togethercan form a meaningful phase “image segmentation usingdeep learning/network(s)”.Fig. 1: Word cloud of the paper titles in MICCAI 2020. ‚ Jun Ma was with the Department of Mathematics, School of Science,Nanjing University of Science and Technology, Nanjing, China, 210094.Manuscript was ﬁnished on December 31, 2020. Comments are welcome.

1. https://miccai2020.org/en/

Since U-Net [4], the legend medical image segmentationapproach, appeared in 2015, there has been numerous newsegmentation methods have been proposed for various seg-mentation tasks [5], [6] in the past ﬁve years . With so manysegmentation papers on hand, it becomes extremely hardto compare them and identify the methodology progress,because the proposed methods are usually evaluated ondifferent datasets with different dataset splits, metrics, andimplementations.Public segmentation challenges provide a standard plat-form of getting insights into the current cutting-edge ap-proaches where solutions are evaluated and comparedagainst each other in a transparent and fair way. In MIC-CAI 2020, there are totally ten international 3D medicalimage segmentation challenges . All these challenges followthe Biomedical Image Analysis ChallengeS (BIAS) Initia-tive [7]. Speciﬁcally, the challenge designs are transparentand standardized, and the proposals (http://miccai.org/events/challenges/) also have passed the peer review.Table 1 provides an overview of the 10 segmentationchallenges, which can be roughly divided into ‚ ﬁve single-modality image segmentation tasks, in-cluding three CT image segmentation tasks and twoMR image segmentation tasks; ‚ ﬁve multi-modality image segmentation tasks, in-cluding two bi-modality tasks, two triple-modalitytasks, and one four-modality task.In this paper, we ﬁrst provide a comprehensive reviewof the ten 3D medical segmentation challenges and the asso-ciated top solutions. we also identify the ”happy-families”elements in the top solutions. Finally, we highlight someproblems and potential future directions for medical imagesegmentation. a r X i v : . [ ee ss . I V ] J a n TABLE 1: Task overview of ten 3D medical image segmentation challenges. ‘Seg. Targets’ denotes segmentation targets ineach task; ‘

Name Seg. Targets The main contributions of this paper are summarized asfollows: ‚ We provide a comprehensive review of ten recentinternational 3D medical image segmentation chal-lenges, including the task descriptions, the datasets,and more importantly, the top solutions of partic-ipant teams, which represent the cutting-edge seg-mentation methods at present. ‚ We identify the widely used ”happy-families” com-ponents in the top methods, which are useful fordeveloping powerful segmentation approaches. ‚ We summarize several unsolved problems and po-tential research directions, which could promote thedevelopments in medical image segmentation ﬁeld.

RELIMINARIES : W

IDELY U SED M ETHODS IN D EEP L EARNING - BASED M EDICAL I MAGE S EGMEN - TATION nnU-Net [8], no new net, is a dynamic fully automatic seg-mentation framework for medical images, which is based onthe widely used U-Net architecture [4], [9]. It can automati-cally conﬁgures the preprocessing, the network architecture,the training, the inference, and the post-processing for anynew segmentation task. Without manual intervention, nnU-Net surpasses most existing approaches, and achieves thestate-of-the-art in 33 of 53 segmentation tasks and otherwiseshows comparable performances to the top leaderboardentries. Currently, nnU-Net has been the most popular back-bone for 3D medical image segmentation tasks because of itspowerful, ﬂexible, out-of-the-box, and open-sourced,

Loss function is used to guide the network to learn meaning-ful predictions and dictate how the network is supposed totrade off mistakes. Cross entropy loss and Dice loss [10], [11]are two most popular loss functions in segmentation tasks.Speciﬁcally, cross entropy aims to minimize the dissimilaritybetween two distributions, which is deﬁned by L CE “ ´ N C ÿ c “ N ÿ i “ g ci log s ci , (1) where g ci is the ground truth binary indicator of class label c of voxel i , and s ci is the corresponding predicted segmen-tation probability.Dice loss can directly optimize the Dice Similarity Coefﬁ-cient (DSC) which is the most commonly used segmentationevaluation metric. In general, there are two variants for Diceloss, one employs squared terms in the denominator [10],which is deﬁned by L Dice ´ square “ ´ ř Cc “ ř Ni “ g ci s ci ř Cc “ ř Ni “ p g ci q ` ř Cc “ ř Ni “ p s ci q . (2)The other does not use the squared terms in the denomina-tor [11], which is deﬁned by L Dice “ ´ ř Cc “ ř Ni “ g ci s ci ř Cc “ ř Ni “ g ci ` ř Cc “ ř Ni “ s ci . (3)The default loss function in nnU-Net is the unweighted sum L CE ` L Dice . Dice Similarity Coefﬁcient (DSC) and Hausdorff Distance(HD) are two widely used segmentation metrics, which canmeasure the region overlap ratio and boundary distance,respectively. Let G and S be the ground truth and thesegmentation result, respectively. DSC is deﬁned by DSC “ | G X S || G | ` | S | . (4)A similar metric IoU (Jaccard) sometimes is used as analternative, which is deﬁned by IoU “ | G X S || G Y S | . (5)Let B G and B S are the boundary points of the groundtruth and the segmentation, respectively. The HausdorffDistance is deﬁned by HD pB G, B S q “ max p hd pB G, B S q , hd pB S, B G qq , (6)where hd pB G, B S q “ max x PB G min y PB S || x ´ y || , and hd pB S, B G q “ max x PB S min y PB G || x ´ y || . To eliminate the impact of the outliers, 95% HD is alsowidely used, which is based on the calculation of the 95thpercentile of the distances between boundary points in B G and B S . INGLE M ODALITY I MAGE S EGMENTATION

The task in CADA challenge (https://cada-as.grand-challenge.org/Overview/) is to segment theaneurysms from 3D CT images. The organizers provide 92cases for training and 23 cases for testing, where the casesare with cerebral aneurysms without vasospasm. The maindifﬁculty in this challenge is the highly imbalanced labels.As shown in Figure 2 (the ﬁrst row), the aneurysm is verysmall and most of the voxels are the background in the CTimages.Six metrics are used to quantitatively evaluate the seg-mentation results, including Jaccard (IoU), Hausdorff dis-tance (HD), mean distance (MD), Pearson correlation co-efﬁcient between predicted volume and reference volumeof all aneurysms (Volume Pearson R), the mean absolutedifference of predicted and reference volume (Volume Bias),and Standard deviation of the difference between predictedand reference volumes (Volume Std). For the ranking, amaximum-minimum normalization is performed accordingto all participants. In this way, each individual metric takesa value between 0 (worst case among all participants) and1 (perfect ﬁt between the reference and predicted segmenta-tion). The ranking score is calculated as the average of thenormalized metrics.TABLE 2: Quantitative results of top-2 teams on CADAChallenge Leaderboard. The bold numbers denote the bestresults.

Metrics Mediclouds junma [12]Rank 1st Rank 2ndIoU 0.758 HD Vol.Bias

Final Score 0.833

Table 2 shows the quantitative segmentation results ofthe top-2 teams on the challenge leaderboard . The team‘junma’ achieved the best IoU while the team ‘Mediclouds’achieved better performance in the remaining ﬁve metrics.However, the ﬁnal score difference is marginal. The methodof the team ‘Mediclouds’, unfortunately, is not available.Thus, we only present the solution of the team ‘junma’.Speciﬁcally, Ma and Nie [12] developed their methods basedon nnU-Net [8] where the main modiﬁcation to use a largepatch size ( ˆ ˆ ) during training and inference.Five models were trained in ﬁve-fold cross-validation andeach model was trained on a TITAN V100 32G GPU. Eachtesting case is predicted by the ensemble of the trained ﬁvemodels.

4. Myocardium, left and right ventricle5. Whole tumor, enhancing tumor, and tumor core6. https://cada-as.grand-challenge.org/FinalRanking/

The task in ASOCA challenge (https://asoca.grand-challenge.org/Home/) is to segment the coronaryarteries from Cardiac Computed Tomography Angiography(CCTA) images. The organizers provide 40 cases fortraining and 20 cases for testing. The main difﬁculties inthis challenge are the imbalanced problem and appearancevariations. On the one hand, the coronary arteries onlyoccupy a small proportion in the whole CT image. On theother hand, the arteries from healthy and unhealthy casesshare different appearances. Figure 2 (the second row)presents a visualized example. DSC and HD95 are used toevaluate and rank the segmentation results.TABLE 3: Quantitative results of top-2 teams on ASOCAChallenge Leaderboard. The bold numbers denote the bestresults.

Team Name DSC HD95 Final RankRuochenGao Table 3 shows the quantitative segmentation results ofthe top-2 teams on the challenge leaderboard during MIC-CAI 2020. The 1st-place team had better DSC while the2nd-place team obtained better HD95, indicating that thetop-2 teams achieved better region overlap and boundarydistance, respectively.The team ‘RuochenGao’ used nnU-Net [8] as the back-bone. The whole pipeline include three independent net-works for three tasks: epicardium segmentation, artery seg-mentation, and scale map regression [13]. The ﬁnal segmen-tation results were generated by the ensemble of artery seg-mentation results and scale map regression results followedby removing the vessels outside the epicardium. The team‘SenYang’ proposed an improved 2D U-Net with selectivekernel (SK-UNet) where the regular convolution blockswere replaced by SE-Res modules in the encoder. Moreover,the SK-modules [14], including different convolution ﬁltersand kernel sizes, were used in the decoder to leverage multi-scale information. The segmentation task in VerSe challenge(https://verse2020.grand-challenge.org/) is to segmentthe vertebrae from CT images. The organizers provide100 cases for training, 100 cases for public testing (theparticipants can access the testing cases) and 100 cases forhidden testing (this testing set is not publicly available andparticipants are required to submitted their solutions withDocker containers) [15], [16]. The annotations consist of 28different vertebrae but each case may only contain part ofthe vertebrae. There are several difﬁculties in this challenge:highly varying ﬁelds-of-view (FoV) across cases, large scansizes, highly correlating shapes of adjacent vertebrae, scannoise, the presence of vertebral fractures, metal implants,and so on [17]. Figure 2 (the third row) presents a visualized

7. https://asoca.grand-challenge.org/MICCAI Ranking/

Fig. 2: Visualized examples in three CT segmentation tasks. The ground truth of the original image (a) in each task is shownin 2D projected onto the raw data (b) and in 3D together with a volume rendering of the raw data (c).example. DSC and HD are used to evaluate and rank thesegmentation results.Payer et al., the defending champion in VerSe 2019 [17],succeeded in winning this year’s challenge again by theSpatialConﬁguration-Net [18] and U-Net [4], [9]. Speciﬁ-cally, they proposed a coarse-to-ﬁne approach, includingthree stages: ‚ stage 1: localizing the whole spine by a 3D U-Net-based heatmap regression network, which can re-move background; The network input size rangedfrom ˆ ˆ to ˆ ˆ . ‚ stage 2: localizing and identifying all vertebrae land-marks simultaneously via a 3D SpatialConﬁguration-Net, which combines local appearance with spatialconﬁguration of landmarks; The network input sizeranged from ˆ ˆ to ˆ ˆ duringtraining and was up to ˆ ˆ duringinference. To address the missed vertebrae, a MRF-based graphical model was employed to reﬁne thelocalization results. ‚ stage 3: segmenting each vertebra individually by a3D U-Net. The input size was ˆ ˆ .Table 4 presents the quantitative segmentation results on thepublic testing set. The absent results will be added when thechallenge summarize paper is released. TABLE 4: Quantitative vertebrae segmentation results ofthe winner solution in VerSe 2020. ‘-’ denotes not availablecurrently. Testing set DSC HD95Public 0.9354 -Hidden - -

Fig. 3: Visualized examples in two MR segmentation tasks. The ground truth of the original image (a) in each task is shownin 2D projected onto the raw data (b) and in 3D rendering (c).ticipants during the challenge. Participants are required tobuild a Singularity image and shared it with the organizers.Figure 3 (the ﬁrst row) presents a visualized example. Fourmetrics are used to evaluate and rank the segmentationresults, including DSC, IoU, Average symmetric surfacedistance (ASSD), and HD.TABLE 5: Quantitative segmentation results (in terms ofDSC and HD) of top-3 teams on M&Ms Challenge Leader-board. ‘ED’ and ‘ES‘ denote the end-diastolic and end-systolic phases cardiac MR images. The bold numbers arethe best results and the italics numbers are not-signiﬁcantwhen compared with the best results ( p ą . in T-test). Metrics Peter M. Full [19] Yao Zhang [20] Jun Ma [21]Rank 1st Rank 2nd Rank 3rdED LV DSC HD MYO DSC HD RV DSC HD ES LV DSC HD MYO DSC HD RV DSC HD The top-3 teams developed their methods based on nnU-Net [8]. Speciﬁcally, Full et al. [19], the 1st-place team,handled the domain shift problem by an ensemble of ﬁve 2D and ﬁve 3D nnU-Net models that were trained withthe batch normalization and extensive data augmentation,such as random rotation, ﬂipping, gamma correction, mul-tiplicative/additive brightness, and so on. Zhang et al. [20],the 2nd-place team, used label propagation to leverageunlabelled cases and exploited the style transfer to reducethe variance among different centers and vendors. The ﬁnalsolution was one single model without using postprocessingand ensemble. Ma [21], the 3rd-place team, addressed thedomain shift problem by enlarging the training set with his-togram matching, where new training cases were generatedby using histogram matching to transfer the intensity distri-bution of 25 unlabelled cases to existing labelled cases. Theﬁnal solution was an ensemble of ﬁve 3D nnU-Net models.Table 5 presents the quantitative segmentation results of thetop-3 teams. It can be found that the differences among themwere marginal and not statistically signiﬁcant, indicatingthat all (three) roads lead to Rome . The task in EMIDEC challenge (http://emidec.com/) is tosegment the myocardium, the infarction, and the no-reﬂowareas from delayed-enhancement cardiac MR images. Theorganizers provide 100 cases for training and 50 cases fortesting [22]. The main difﬁculties in this challenge are thelow contrast, varied short-axis orientations, heterogeneous appearances of myocardium pathology areas, and unbal-anced distribution between normal and pathological cases.Figure 3 (the second row) presents a visualized example.The evaluation and ranking metrics include ‚ clinical metrics: the average errors for the volume ofthe myocardium (in mm3), the volume (in mm3) andthe percentage of infarction and no-reﬂow area; ‚ geometrical metrics: the average DSC for the dif-ferent areas and Hausdorff distance (in 3D) for themyocardium.Table 6 presents the quantitative segmentation resultsof the top-3 teams on the ﬁnal leaderboard . Both Zhangand Ma, the top-2 teams, used a two-stage cascaded frame-work and developed their methods based on nnU-Net [8].Speciﬁcally, Zhang [23] ﬁrst used a 2D nnU-Net, focus-ing on the intra-slice information, to obtain a preliminarysegmentation, and then a 3D nnU-Net, focusing on thevolumetric spatial information, was employed to reﬁne thesegmentation results. The 3D nnU-Net took the combinationof the preliminary segmentation and original image as theinput. Finally, the scattered voxels in segmentation resultswere removed in postprocessing step. Ma [24] used the 2DnnU-Net in the two stages. Firstly a 2D U-Net was used tosegment the whole heart, including the left ventricle and themyocardium. Then, the whole heart was cropped as a regionof interest (ROI). Finally, a new 2D U-Net was trained tosegment the infraction and no-reﬂow areas in the ROI. Theﬁnal model was an ensemble of ﬁve 2D nnU-Net modelsin each stage. Feng et al. used dilated 2D UNet [25] withrotation-based augmentation, which aim to overcoming thevaried short-axis orientations.TABLE 6: Quantitative results of top-3 teams on EMIDECChallenge Leaderboard. Targets Metrics Zhang [23] Ma [24] Feng et al.Rank 1st Rank 2nd Rank 3rdMyocardium DSC 0.8786 0.8628 0.8356Vol. Diff. 9258 10153 15187HD 13.01 14.31 33.77Infarction DSC 0.7124 0.6224 0.5468Vol. Diff. 3118 4874 3971Vol. Diff. Ratio 2.38% 3.50% 2.89%Re-ﬂow DSC 0.7851 0.7776 0.7222Vol. Diff. 634.7 829.7 883.4Vol. Diff. Ratio 0.38% 0.49% 0.53%

The methods of Zhang [23] and Ma [24] obtained compa-rable results for myocardium and re-ﬂow areas, but Zhangachieved signiﬁcantly better results for infraction, whichwere 9% and 17% higher than the methods of Ma [24]and Feng et al. in terms of DSC. The major methodologydifference is that Zhang used the 3D network in the secondstage while Ma and Feng et al. used the 2D network. Thus,one of the possible reasons might be that 3D network canuse more image contextual information than 2D network,and also lead to better performance.

8. http://emidec.com/leaderboard9. http://emidec.com/downloads/papers/paper-24.pdf

ULTI - MODALITY

3D I

MAGE S EGMENTATION

The task in ADAM challenge (http://adam.isi.uu.nl/) isto segment the aneurysms from TOF-MRA and structuralMR images. The organizers provide 113 cases for trainingand 141 cases for testing. In the 113 training cases, 93cases contain at least one untreated, unruptured intracranialaneurysm and 20 cases do not have intracranial aneurysms.In the 141 testing cases, 117 cases containing at least oneuntreated, unruptured intracranial aneurysm, and 26 casesdo not have intracranial aneurysms. Each case has twofolders: ‚ ‘orig’ folder: containing all of the original TOF-MRAimages and structural images (T1, T2, or FLAIR). Thestructural image was aligned to the TOF image byelastix . ‚ ‘pre’ folder: All images were preprocess by‘n4biasﬁeldcorrection’ to correct bias ﬁeld inhomo-geneities.The main difﬁculty in this challenge is the extremely im-balanced problem. Speciﬁcally, the median image size is ˆ ˆ , while the median aneurysm voxel size is 238,leading to a extremely imbalanced foreground-backgroundratio of . ˆ ´ . Figure 4 presents the visualized exam-ples. Participants are allowed to use any of the providedimages to develop their methods. The testing set is hiddenby the organizers and participants should submit theirmethods with Docker containers.Fig. 4: Visualized examples in ADAM Challenge. Groundtruth (b) of the intracranial aneurysm is shown in 2D pro-jected onto the TOF-MRA and the structural MR image (a)and in 3D together with a volume rendering of the raw data(c). The red arrows point to the intracranial aneurysm.Table 7 presents the quantitative results of top-2 teamson ADAM Challenge Leaderboard during MICCAI 2020.Both teams developed their methods based on nnU-Net [8].Speciﬁcally, to alleviate the imbalanced problem, the team‘junma‘ trained two group ﬁve-fold nnU-Net models withDice + Cross entropy loss and Dice + TopK loss, respec-tively [26]. Only preprocessed TOF-MRA images were used

10. https://elastix.lumc.nl/11. http://stnava.github.io/ANTs/12. http://adam.isi.uu.nl/results/results-miccai-2020/ during training. The ﬁnal model was the ensemble of ﬁvebest models during cross-validation. To speed up the in-ference, the default testing time augmentation in nnU-Net(TTA) was disabled during testing. The team ‘jocker’ mod-iﬁed the default nnU-Net by introducing residual blocks inthe encoder and replacing the instance normalization withgroup normalization. The loss function was Dice + TopKloss. The ﬁnal model was the ensemble of four models withdifferent modalities and output classes.TABLE 7: Quantitative results of top-2 teams on ADAMChallenge Leaderboard. The bold numbers are the bestresults.

Team DSC HD95 Volumetric Similarity Rankjunma

As shown in Table 7, the team ‘junma’ achieved thebest DSC and Volumetric Similarity and the team ‘joker’achieved the best HD95. However, it should be noted thatthe differences between them are marginal. ˆ ˆ mm ) locating the oropharynx region. The main difﬁcultiesare the multi-modality fusion, imbalanced problem, and theunseen testing cases from a new medical center. Figure 5presents the visualized examples.Fig. 5: Visualized examples in HECKTOR Challenge.Ground truth (b) of the head and neck tumor is shown in2D projected onto the CT and the PET image (a) and in 3Dtogether with a volume rendering of the raw data (c).Table 8 presents the quantitative segmentation resultsof the top-2 teams on HECKTOR Challenge Leaderboard during MICCAI 2020. Both teams developed their methodsbased on the two channel 3D U-Net [9]. Speciﬁcally, theteam ‘andrei.iantsen’, replaced the batch normalization withsqueeze-and-excitation normalization and introduced theresidual blocks in the encoder. The loss function was the un-weighted sum of Dice loss and focal loss [29]. Four modelswith leave-one-center-out splits and four additional modelswith random data splits were trained for 800 epoches usingAdam optimizer [30] on two NVIDIA 1090Ti GPUs with abatch size of 2. The ﬁnal model was an ensemble of the eightmodels. The team ‘junma’ [31] ﬁrstly trained ﬁve 3D nnU-Net models [8] with Dice + TopK loss for ﬁve-fold cross-validation. Then, a segmentation quality score was deﬁnedby model ensembles, which can be used to select the caseswith high uncertainties. Finally, the high uncertainty caseswere reﬁned by a hybrid active contour model with iterativeconvolution-thresholding methods [32], [33], [34].Both teams concatenated the PET and CT image as inputand model ensembles were used to predicting the testingset. In the loss function, Dice loss was also incorporated inboth teams.TABLE 8: Quantitative results of top-2 teams on HECKTORChallenge Leaderboard. The bold numbers are the bestresults. Team DSC Precision Recall Rankandrei.iantsen

The ﬁnal rank was based on the DSC scores on thetesting set. The 1st-place team ‘andrei.iantsen’ obtainedbetter DSC and Recall while the 2nd-place team ‘junma’obtained better Precision. However, the differences betweenthe two teams are marginal, especially for the DSC and Pre-cision. Moreover, both teams achieved signiﬁcantly higherPrecision than Recall, indicating that most of the segmentedvoxels were real tumor voxels but many tumor voxels weremissed by the model.

Fig. 6: Visualized examples in MyoPS Challenge. Groundtruth is shown in 2D projected onto the multi-sequence MRimages (b) and in 3D rendering (c).segmentation results. Cross-validation results showed that2D U-Net achieved better performance for the edema while2.5D U-Net achieved better performance for the scar. Toobtained better performance, a weighted method was usedfor ﬁnal ensemble. Speciﬁcally, the weights for edema andscar prediction channels were 0.8 in 2D and 2.5D U-Net,respectively, while the weights for the other predictionchannels were 0.5.TABLE 9: Quantitative results of the winner team ‘Zhai &Gu et al.’ [37] in MyoPS challenge.

Target DSC RankScar 0.672 ˘ ˘ Table 9 presents the quantitative segmentation resultsof the winner team on the testing set . Zhai & Gu etal. achieved an average DSC of 0.672 ˘ ˘ ˘ ABCs challenge (https://abcs.mgh.harvard.edu/) includedtwo brain structures segmentation tasks ‚ Task 1: segmenting ﬁve brain structures, includingfalx cerebri, tentorium cerebelli, sagittal and trans-verse brain sinuses, cerebellum and ventricles, whichcan be used for automated deﬁnition of the clinicaltarget volume (CTV) for radiotherapy treatment. ‚ Task 2: segmenting ten structures, including Brain-stem, left and right eyes, left and right optic nerves,left and right optic chiasm, lacrimal glands, andcochleas, which can be used in radiotherapy treat-ment plan optimization.The organizers provide 45 cases for training, 15 cases forvalidation, and 15 cases for testing. Each case consists ofone CT image acquired for treatment planning, and twopost-operative brain MRI images (i.e., contrast enhancedT1-weighted, T2-weighted FLAIR). The CT and MR imageswere obtained from two different CT scanners and sevendifferent MRI scanners, respectively. The multi-modalityimages were co-registered, and re-sampled to the same reso-lution and size. The main difﬁculties are the multi-modalityfusion, imbalanced labels, and multi-vendor cases. Figure 7presents the visualized examples in two tasks. Participantsare required to submit their segmentation results within 48hours after the time of download the testing set. DSC andSurface DSC at the tolerance of 2 mm are used to evaluateand rank the segmentation results.Fig. 7: Visualized examples in ABCs Challenge. Groundtruth (b) is shown in 2D projected onto the multi-modalityimages (a) and in 3D together with a volume rendering ofthe raw data (c).Table 10 presents the average DSC and SDSC of testingset segmentation results of the top-2 teams on the ChallengeLeaderboard . Both teams developed their methods basedon nnU-Net [8]. Speciﬁcally, the team ‘Jarvis’ [38] used the

15. https://github.com/deepmind/surface-distance16. https://abcs.mgh.harvard.edu/index.php/leader-board

TABLE 10: Quantitative results of top-2 teams on ABCsChallenge Leaderboard. The bold numbers are the bestresults.

Team Task 1 Task 2 RankDSC SDSC DSC SDSCJarvis [38] ResU-Net where residual blocks were introduced in the U-Net encoder. The training process had three main features: ‚ the training cases ‘007’ and ‘054’ in Task 2 hadannotation issues. Thus, the default annotations werereplaced with pseudo labels generated by cross-validation. ‚ the ﬂipping along x-axis was dropped from the de-fault data augmentation setting in nnU-Net, becausethe segmentation targets in Task 2 are sensitive to leftand right direction. ‚ in addition to the default Dice + CE loss in nnU-Net, Tversky loss [26], [39] was also used to train theResU-Net.The ﬁnal model was an ensemble of default nnU-Net, ResU-Net with Dice-CE loss, and ResU-Net with Tversky loss. Theteam ‘HILab’ used a coarse-to-ﬁne framework with nnU-Net [8] for both tasks. Speciﬁcally, ‚ in Task 1, an uncovered model was trained to ob-tain the coarse segmentations with small overﬁtting.Then, each organ was cropped from the originalimages and reﬁned by an independent network. Thereﬁned organs were combined as the ﬁnal segmenta-tion results. ‚ in Task 2, a coarse model was ﬁrstly trained to seg-ment all organs. Then, each organ was also croppedfrom the original images and reﬁned by an inde-pendent network. The training process was differ-ent from Task 1, where a new data augmentationtechnique, ﬂipping each organ to other side wasintroduced to enlarge the training set. The ﬁnal seg-mentation results were also the combination of thereﬁned organs.Both teams fused the three modalities by concatenatingthem as the network input. Model ensemble was also usedby both teams but the ensemble strategies were different.In particular, the team ‘Jarvis’ used an ensemble of multiplemulti-organ segmentation networks while the team ‘HILab’used an ensemble of one multi-organ and multiple single-organ segmentation networks. ‚ Region-based training: replacing the softmax layerwith a sigmoid layer and changing the optimizationtarget to the three tumor sub-regions. The defaultcross entropy loss term was also replaced with abinary cross-entropy where each of the regions wasoptimized independently; ‚ Postprocesing: removing enhancing tumor entirelyif the predicted volume was less than a giventhreshold. The threshold was optimized on trainingset cross-validation twice, once via maximizing themean Dice score and once via minimizing the BraTS-like ranking score; ‚ Increased batch size: increasing the batch size from2 to 5; ‚ Extensive data augmentation: using more aggressiveaugmentations, such as increasing the the probabilityof applying rotation, scaling, the scale range, elasticdeformation, and so on; ‚ Batch normalization: replacing the default instancenormalization with batch normalization; ‚ Batch dice: computing the dice loss over all samplesin the batch; ‚ BraTS Ranking-based model selection: selecting thebest model with BraTS-like ‘rank then aggregate’ranking scheme.The ﬁnal model was an ensemble of 25 cross-validationmodels including three groups of top performing models.Two tied teams ranked second. Speciﬁcally, the team‘NPU PITT’, leading by Jia et al. [46], proposed a HybridHigh-resolution and Non-local Feature Network (H NF-Net) Four modalities were concatenated as a four-channelinput and processed at ﬁve different scales in the network.The edema and enhancing tumor were segmented by thesingle HNF-Net and the tumor core was segmented by hecascaded HNF-Net. and different brain tumor sub-regionswere segmented by single and cascaded HNF-Nets. Theteam ‘Radicals’, leading by Wang et al. [47], proposed anend-to-end Modality-Pairing learning method with paral-leled branches and more layer connections to explore thelatent relationship among different modalities. Moreover, aconsistence loss was introduced to minimize the predictionvariance between branches. The ﬁnal model was an ensem-ble of three Modality-Pairing models and three Vanilla nnU-Net [8] models.TABLE 11: Quantitative results of the top-3 teams on BraTS2020 Challenge Leaderboard. The bold numbers are the bestresults.

Team Target DSC HD95MIC DKFZFabian et al. [45]Rank 1st Enhancing Tumor 0.820 ˘ ˘ ˘ ˘ ˘ ˘ NPU PITTJia et al. [46]Rank 2nd (tie) Enhancing Tumor ˘ ˘ Whole Tumor 0.888 ˘ ˘ Tumor Core ˘ ˘ ˘ ˘ ˘ ˘ ˘ ˘ Table 11 presents the quantitative segmentation results ofthe top-3 teams on the testing set . Overall, the performancedifferences are marginal. The team ‘MIC DKFZ’ achievedthe best HD95 for the tumor core and the team ‘Radicals’achieved the best DSC for the whole tumor. The team‘NPU PITT’ achieved the best performance in the remainingfour metrics. ISCUSSION

As the Anna Karenina principle goes : “All happy familiesare alike.”, there are also some common components in thetop methods. nnU-Net [8] backbone All the top methods used U-Net [4], [9] like architectures in the ten 3D segmentationchallenges. Remarkably, nnU-Net was used by the top teams in nine out of ten challenges, because it is open-sourced,powerful, ﬂexible, and out-of-the-box. Participants can eas-ily integrate their new methods into nnU-Net.

Dice-related loss functions

Loss function is one of themost important elements in deep learning-based segmenta-tion methods. nnU-Net used Dice + cross entropy as thedefault loss function. For extremely imbalanced segmen-tation tasks, modifying the loss function can obtain betterperformance. For example, the winner in HECKTOR chal-lenge used Dice + Focal loss. Both the winner and the runnerup used Dice + TopK loss in ADAM challenge. For a moredetailed analysis of segmentation loss functions, please referto [26].

Cascaded/coarse-to-ﬁne framework

Cropping theregion-of-interest (ROI) can eliminate the unrelatedbackground tissues and reduce the computational burden.Thus, one can ﬁrstly trained a model to obtain the coarsesegmentation and then crop the ROI. After that, traininga new model with the ROI image (concatenated with thecoarse segmentation) to reﬁne the segmentation results.This strategy is quite effective for myocardial pathologyand small organ segmentation tasks, which was used byboth the winners in EMIDEC and MyoPS challenge, andthe runner up in ABCs challenge.

Model Ensembles

Ensemble is an effective way to fusethe performance of multiple single models. All the top teamsused more than one models in their ﬁnal solutions. Themodels were usually trained with different data splits, dataaugmentation techniques, networks, or loss functions, andthen combined by averaging the predictions, majority vote,or cascaded frameworks.

Concatenated input fusion in multi-modality segmen-tation tasks

How to fuse multiple different images is a keyquestion in multi-modality segmentation tasks. Commondeep-learning based image fusion methods include input-level fusion, feature-level fusion, and output-level fusion.In ﬁve multi-modality segmentation challenges, four out ofﬁve winner teams used input-level fusion, which directlyconcatenated multiple images as network inputs. The win-ner team in ADAM challenge only use one modality butthe runner up, achieving similar performance, also used theconcatenation strategy to fuse different modalities.

Based on the summary of the ten segmentation challenges,it can be found that deep learning has achieved unprece-dented or even human-level performance on many medicalimage segmentation tasks, but there still remains severalproblems. Following, we introduce some of the problemsand also opportunities that can promote the further devel-opment of medical image segmentation methods.

Standardized method reports

Many challenge organiz-ers required the participants to submit a short paper todescribe their methods. However, these papers are usuallystructured with their own way and some necessary de-tails might be missed. Currently, the challenge quality hasbeen greatly improved with the Biomedical Image AnalysisChallengeS (BIAS) initiative [7], where a checklist is usedto standardize the review process and raise interpretabilityand reproducibility of challenge results. Thus, there is also a high demand for the challenge method reports qualitycontrol. The winner team in MICCAI Hackathon Challengeprovided an initial attempt (https://github.com/JunMa11/MICCAI-Reproducibility-Checklist) at dealing with themethod reproducibility with a checklist, but more effortsare required to make this checklist more complete andacceptable by our community. Publicly available baseline models nnU-Net has beenproved to be a strong baseline. When starting with a new3D segmentation challenge, most of the participants willtrain nnU-Net baseline models, which usually cost 72-120GPU hours (depending on the computational facilities).This could be a great waste of energy and time, becauseparticipants are repeatedly doing the same thing. There isa strong demand for publicly available (trained) baselinemodels when a new segmentation challenge is launched. Inthis way, participants can pay more attention to developingnew methods without spending energy and time on trainingthe baseline models.

Fast and memory efﬁcient models

There is no doubtthat accuracy (e.g., DSC, HD) is an important factor forsegmentation methods. However, the running time and theGPU memory requirement of the segmentation methods arealso critical when deploying the trained models in clinicalpractice. Currently, most of the top methods use modelensembles, which could be time-consuming and requirevery high computing resources. In order to promote thedeep learning-based medical image segmentations to beclinically applicable, it is necessary to pay more attentionto the models’ running efﬁciency.

Theoretical foundations of segmentation models

Cur-rent theoretical studies of deep learning usually have strongassumptions [48], [49], [50], such as smoothness, inﬁnitewidth, and so on. However, when it comes to real practice,many open problems remain unsolved. For example, what isthe theoretical principle of designing segmentation networkarchitectures? is there a generalization gap? how should weto estimate it? what does the loss function landscape looklike? does the training process converge to a good solution?How fast? how much data do we need when start with anew segmentation task?

Diverse datasets and generalizable segmentation mod-els

Collecting diverse datasets is critical for developinggeneralizable segmentation models, because clinical practicerequires the trained models can be applied to many (unseen)medical centers. According to the challenge results (e.g.,M&Ms, BraTS, HECKTOR), we found that the segmentationperformances have a signiﬁcant drop when testing setsinclude unseen cases from new medical centers. Thus, it isimportant to have diverse datasets to evaluate the models’generalization ability when organizing segmentation chal-lenges. Currently, building generalizable models that canbe applied consistently across medical centres, diseases,and scanner vendors is still an unsolved and challengingproblem.

Extremely imbalanced target segmentation

Imbalancedsegmentation has been a long-term problem in medicalimage segmentation, especially when the size of target fore-ground region is several orders of magnitude less than thebackground size. Recently studies have made some progress[10], [51], [52], however, the extremely imbalanced segmen- tation still remains very difﬁcult. For example, in ADAMchallenge, the median target size is 238, which occupies . ˆ ´ of the median image size. The winner methodachieved a DSC score of 0.41, which has large rooms forfurther improvements. There are also some important but not mentioned top-ics in this paper. For example, a summary of 2Dmedical image segmentation challenges [53], [54], [55]has not been included in this paper, because weonly found three 2D international image segmenta-tion challenges in 2020, including thyroid nodule seg-mentation in ultrasound images (https://tn-scui2020.grand-challenge.org/Home/), optic disc and cup segmen-tation in fundus images (https://refuge.grand-challenge.org/Home2020/), and cataract segmentation in surgi-cal videos (https://cataracts-semantic-segmentation2020.grand-challenge.org/Home/). Thus, the ﬁndings would bebiased with limited challenge samples and we will providea similar summary for 2D medical image segmentationmethods when there are many ( ě ) international chal-lenges. Moreover, this paper only covers cutting-edge fullysupervised segmentation methods, while semi-supervisedlearning [56], [57], [58], weakly-supervised learning [59],[60], and continual learning [61], [62], [63]-based segmenta-tion methods are not mentioned. This is because, currently,there are few benchmarks or challenges [64], [65] for thesetopics in the medical image segmentation ﬁeld. ONCLUSION

Challenges provide an open and fair platform for various re-search groups to test and validate their segmentation meth-ods on common datasets acquired from the clinical environ-ment. In this paper, we have summarized ten 3D medicalimage segmentation challenges and the corresponding topmethods. In addition, we also identify the widely involved”happy-families” elements in the top methods and give po-tential future research directions in medical image segmen-tation. Moreover, we also maintain a public GitHub reposi-tory (https://github.com/JunMa11/SOTA-MedSeg) to col-lect the cutting-edge segmentation methods based on vari-ous international segmentation challenges. We expect thatthis review of the cutting-edge 3D image segmentationmethods will be beneﬁcial to both early-stage and seniorresearchers in related ﬁelds. A CKNOWLEDGMENTS

The authors would like to thank all the organizers for creat-ing the public datasets and holding the great challenges. Theauthors also highly appreciate Ruochen Gao (the winner inASOCA), Ran Gu (the winner in MyoPS), Wenhui Lei (thewinner in MyoPS and the runner up to winner in ABCs),Munan Ning (the winner in ABCs), and Yixin Wang (therunner up to winner in BraTS), Yao Zhang (the runner up towinner in M&Ms), Yichi Zhang (the winner in EMIDEC) forvaluable discussions. R EFERENCES [1] R. J. Gillies, P. E. Kinahan, and H. Hricak, “Radiomics: images aremore than pictures, they are data,”

Radiology , vol. 278, no. 2, pp.563–577, 2016.[2] E. Rietzel, G. T. Chen, N. C. Choi, and C. G. Willet, “Four-dimensional image-based treatment planning: Target volume seg-mentation and dose calculation in the presence of respiratorymotion,”

International Journal of Radiation Oncology Biology Physics ,vol. 61, no. 5, pp. 1535–1550, 2005.[3] K. Zhang, X. Liu, J. Shen, Z. Li, Y. Sang, X. Wu, Y. Cha, W. Liang,C. Wang, K. Wang et al. , “Clinically applicable ai system foraccurate diagnosis, quantitative measurements and prognosis ofcovid-19 pneumonia using computed tomography,”

Cell , 2020.[4] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in

InternationalConference on Medical Image Computing and Computer-Assisted In-tervention , 2015, pp. 234–241.[5] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I.S´anchez, “A survey on deep learning in medical image analysis,”

Medical Image Analysis , vol. 42, pp. 60–88, 2017.[6] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, andD. Terzopoulos, “Image segmentation using deep learning: Asurvey,” arXiv preprint arXiv:2001.05566 , 2020.[7] L. Maier-Hein, A. Reinke, M. Kozubek, A. L. Martel, T. Arbel,M. Eisenmann, A. Hanbury, P. Jannin, H. M ¨uller, S. Onogur et al. , “Bias: Transparent reporting of biomedical image analysischallenges,”

Medical Image Analysis , vol. 66, p. 101796, 2020.[8] F. Isensee, P. F. J¨ager, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-conﬁguring method for deep learning-basedbiomedical image segmentation,”

Nature Methods , 2020.[9] ¨O. C¸ ic¸ek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ron-neberger, “3d u-net: learning dense volumetric segmentation fromsparse annotation,” in

International Conference on Medical ImageComputing and Computer-Assisted Intervention , 2016, pp. 424–432.[10] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convo-lutional neural networks for volumetric medical image segmen-tation,” in ,2016, pp. 565–571.[11] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal,“The importance of skip connections in biomedical image segmen-tation,” in

Deep Learning and Data Labeling for Medical Applications ,2016, pp. 179–187.[12] J. Ma and Z. Nie, “Exploring large context for cerebral aneurysmsegmentation,” arXiv preprint arXiv:2012.15136 , 2020.[13] Y. Wang, X. Wei, F. Liu, J. Chen, Y. Zhou, W. Shen, E. K. Fishman,and A. L. Yuille, “Deep distance transform for tubular structuresegmentation in ct scans,” in

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , 2020, pp. 3833–3842.[14] X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,”in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2019, pp. 510–519.[15] M. T. L¨ofﬂer, A. Sekuboyina, A. Jacob, A.-L. Grau, A. Scharr,M. El Husseini, M. Kallweit, C. Zimmer, T. Baum, and J. S.Kirschke, “A vertebral segmentation dataset with fracture grad-ing,”

Radiology: Artiﬁcial Intelligence , vol. 2, no. 4, p. e190138, 2020.[16] A. Sekuboyina, M. Rempﬂer, A. Valentinitsch, B. H. Menze, andJ. S. Kirschke, “Labeling vertebrae with two-dimensional refor-mations of multidetector ct images: An adversarial approachfor incorporating prior knowledge of spine anatomy,”

Radiology:Artiﬁcial Intelligence , vol. 2, no. 2, p. e190074, 2020.[17] A. Sekuboyina, A. Bayat, M. E. Husseini, M. L¨ofﬂer, M. Rempﬂer,J. Kukaˇcka, G. Tetteh, A. Valentinitsch, C. Payer, M. Urschler et al. ,“Verse: A vertebrae labelling and segmentation benchmark,” arXivpreprint arXiv:2001.09193 , 2020.[18] C. Payer, D. ˇStern, H. Bischof, and M. Urschler, “Integrating spatialconﬁguration into heatmap regression based cnns for landmarklocalization,”

Medical image analysis , vol. 54, pp. 207–219, 2019.[19] P. M. Full, F. Isensee, P. F. J¨ager, and K. Maier-Hein, “Studying ro-bustness of semantic segmentation under domain shift in cardiacmri,” arXiv preprint arXiv:2011.07592 , 2020.[20] Z. Yao, Y. Jiawei, H. Feng, L. Yang, W. Yixin, T. Jiang, Z. Cheng,Z. Yang, and H. Zhiqiang, “Semi-supervised cardiac image seg-mentation via label propagation and style transfer,” arXiv preprintarXiv:2012.14785 , 2020. [21] J. Ma, “Histogram matching augmentation for domain adaptationwith application to multi-centre, multi-vendor and multi-diseasecardiac image segmentation,” arXiv preprint arXiv:2012.13871 ,2020.[22] A. Lalande, Z. Chen, T. Decourselle, A. Qayyum, T. Pommier,L. Lorgis, E. de la Rosa, A. Cochet, Y. Cottin, D. Ginhac et al. ,“Emidec: A database usable for the automatic evaluation of my-ocardial infarction from delayed-enhancement cardiac mri,”

Data ,vol. 5, no. 4, p. 89, 2020.[23] Y. Zhang, “Cascaded convolutional neural network for automaticmyocardial infarction segmentation from delayed-enhancementcardiac mri,” arXiv preprint arXiv:2012.14128 , 2020.[24] J. Ma, “Cascaded framework for automatic evaluation of my-ocardial infarction from delayed-enhancement cardiac mri,” arXivpreprint arXiv:2012.14556 , 2020.[25] X.-Y. Zhou, J.-Q. Zheng, P. Li, and G.-Z. Yang, “Acnn: a fullresolution dcnn for medical image segmentation,” in , 2020,pp. 8455–8461.[26] M. Jun, “Segmentation loss odyssey,” arXiv preprintarXiv:2005.13449 , 2020.[27] V. Andrearczyk, V. Oreiller, M. Valli`eres, J. Castelli, H. Elha-lawani, M. Jreige, S. Boughdad, J. O. Prior, and A. Depeursinge,“Automatic segmentation of head and neck tumors and nodalmetastases in pet-ct scans,” in

Medical Imaging with Deep Learning ,2020, pp. 33–43.[28] V. Andrearczyk, V. Oreiller, M. Valli`eres, M. Jreige, J. O. Prior, andA. Depeursinge, “Overview of the hecktor challenge at miccai2020: Automatic head and neck tumor segmentation in pet/ct,”in

Lecture Notes in Computer Science (LNCS) Challenges , 2021.[29] P. Goyal and H. Kaiming, “Focal loss for dense object detec-tion,”

IEEE Transactions on pattern Analysis and machine intelligence ,vol. 39, pp. 2999–3007, 2018.[30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in

Proceedings of the 3rd International Conference on LearningRepresentations (ICLR) , 2014.[31] M. Jun and Y. Xiaoping, “Combining cnn and hybrid activecontours for head and neck tumor segmentation in ct and petimages,” arXiv preprint arXiv:2012.14207 , 2020.[32] D. Wang, H. Li, X. Wei, and X.-P. Wang, “An efﬁcient iterativethresholding method for image segmentation,”

Journal of Compu-tational Physics , vol. 350, pp. 657–667, 2017.[33] D. Wang and X.-P. Wang, “The iterative convolution-thresholding method (ictm) for image segmentation,” arXivpreprint arXiv:1904.10917 , 2019.[34] J. Ma, D. Wang, X.-P. Wang, and X. Yang, “A fast algorithmfor geodesic active contours with applications to medical imagesegmentation,” arXiv preprint arXiv:2007.00525 , 2020.[35] X. Zhuang, “Multivariate mixture model for cardiac segmentationfrom multi-sequence mri,” in

International Conference on MedicalImage Computing and Computer-Assisted Intervention , 2016, pp. 581–588.[36] ——, “Multivariate mixture model for myocardial segmentationcombining multi-source images,”

IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 41, no. 12, pp. 2933–2946,2018.[37] S. Zhai, R. Gu, W. Lei, and G. Wang, “Myocardial edema and scarsegmentation using a coarse-to-ﬁne framework with weightedensemble,” in

Myocardial Pathology Segmentation Combining Multi-Sequence Cardiac Magnetic Resonance Images , X. Zhuang and L. Li,Eds., 2020, pp. 49–59.[38] N. Munan, B. Cheng, Y. Chenglang, and Z. Yefeng, “Ensembledresunet for anatomical brain barriers segmentation,” arXiv preprintarXiv:2012.14567 , 2020.[39] S. S. M. Salehi, D. Erdogmus, and A. Gholipour, “Tversky lossfunction for image segmentation using 3d fully convolutionaldeep networks,” in

International Workshop on Machine Learning inMedical Imaging , 2019, pp. 379–387.[40] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani,J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest et al. , “Themultimodal brain tumor image segmentation benchmark (brats),”

IEEE Transactions on Medical Imaging , vol. 34, no. 10, pp. 1993–2024,2014.[41] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby,J. B. Freymann, K. Farahani, and C. Davatzikos, “Advancing thecancer genome atlas glioma mri collections with expert segmenta- tion labels and radiomic features,” Scientiﬁc data , vol. 4, p. 170117,2017.[42] B. Spyridon, A. Hamed, S. Aristeidis, B. Michel, R. Martin,K. Justin, F. John, F. Keyvan, and D. Christos, “Segmentation labelsand radiomic features for the pre-operative scans of the tcga-gbmcollection,”

The Cancer Imaging Archive , 2017.[43] ——, “Segmentation labels and radiomic features for the pre-operative scans of the tcga-lgg collection,”

The Cancer ImagingArchive , 2017.[44] S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempﬂer, A. Crimi, R. T.Shinohara, C. Berger, S. M. Ha, M. Rozycki, and et al., “Identifyingthe best machine learning algorithms for brain tumor segmenta-tion, progression assessment, and overall survival prediction inthe brats challenge,” arXiv preprint arXiv:1811.02629 , 2018.[45] F. Isensee, P. F. Jaeger, P. M. Full, P. Vollmuth, and K. H. Maier-Hein, “nnu-net for brain tumor segmentation,” arXiv preprintarXiv:2011.00848 , 2020.[46] J. Haozhe, C. Weidong, H. Heng, and X. Yong, “H2nf-net for braintumor segmentation using multimodal mr imaging: 2nd placesolution to brats challenge 2020 segmentation task,” arXiv preprintarXiv:2012.15318 , 2020.[47] Y. Wang, Y. Zhang, F. Hou, Y. Liu, J. Tian, C. Zhong, Y. Zhang, andZ. He, “Modality-pairing learning for brain tumor segmentation,” arXiv preprint arXiv:2010.09277 , 2020.[48] T. Poggio, A. Banburski, and Q. Liao, “Theoretical issues in deepnetworks,”

Proceedings of the National Academy of Sciences , vol. 117,no. 48, pp. 30 039–30 045, 2020.[49] E. Weinan, C. Ma, S. Wojtowytsch, and L. Wu, “Towards amathematical understanding of neural network-based machinelearning: what we know and what we don’t,” arXiv preprintarXiv:2009.10713 , 2020.[50] H. Fengxiang and T. Dacheng, “Trecent advances in deep learningtheory,” arXiv preprint arXiv:2012.10931 , 2020.[51] J. Ma, Z. Wei, Y. Zhang, Y. Wang, R. Lv, C. Zhu, G. Chen, J. Liu,C. Peng, L. Wang, Y. Wang, and J. Chen, “How distance transformmaps boost segmentation cnns: An empirical study,” in

MedicalImaging with Deep Learning , vol. 121, 2020, pp. 479–492.[52] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, andI. Ben Ayed, “Boundary loss for highly unbalanced segmentation,”

Medical Image Analysis , vol. 67, p. 101851, 2021.[53] N. Kumar, R. Verma, D. Anand, Y. Zhou, O. F. Onder, E. Tsougenis,H. Chen, P.-A. Heng et al. , “A multi-organ nucleus segmentationchallenge,”

IEEE Transactions on Medical Imaging , vol. 39, no. 5, pp.1380–1391, 2019.[54] T. Roß, A. Reinke, P. M. Full, M. Wagner, H. Kenngott, M. Apitz,H. Hempe, D. M. Filimon, P. Scholz, T. N. Tran et al. , “Com-parative validation of multi-instance instrument segmentation inendoscopy: results of the robust-mis 2019 challenge,”

MedicalImage Analysis , p. 101920, 2020.[55] H. Fu, F. Li, X. Sun, X. Cao, J. Liao, J. I. Orlando, X. Tao, Y. Li,S. Zhang, M. Tan, C. Yuan, C. Bian, R. Xie, J. Li, X. Li, J. Wang,L. Geng, P. Li, H. Hao, J. Liu, Y. Kong, Y. Ren, H. Bogunovi´c,X. Zhang, and Y. Xu, “Age challenge: Angle closure glaucomaevaluation in anterior segment optical coherence tomography,”

Medical Image Analysis , vol. 66, p. 101798, 2020.[56] V. Cheplygina, M. de Bruijne, and J. P. Pluim, “Not-so-supervised:A survey of semi-supervised, multi-instance, and transfer learningin medical image analysis,”

Medical Image Analysis , vol. 54, pp. 280– 296, 2019.[57] J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervisedlearning,”

Machine Learning , vol. 109, no. 2, pp. 373–440, 2020.[58] G.-J. Qi and J. Luo, “Small data challenges in big data era: A surveyof recent progress on unsupervised and semi-supervised meth-ods,”

IEEE Transactions on Pattern Analysis and Machine Intelligence ,2020.[59] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, andX. Ding, “Embracing imperfect datasets: A review of deep learningsolutions for medical image segmentation,”

Medical Image Analysis ,p. 101693, 2020.[60] D. Karimi, H. Dou, S. K. Warﬁeld, and A. Gholipour, “Deeplearning with noisy labels: exploring techniques and remedies inmedical image analysis,”

Medical Image Analysis , vol. 65, p. 101759,2020.[61] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter,“Continual lifelong learning with neural networks: A review,”

Neural Networks , vol. 113, pp. 54–71, 2019. [62] A. Soltoggio, K. O. Stanley, and S. Risi, “Born to learn: theinspiration, progress, and future of evolved plastic artiﬁcial neuralnetworks,”

Neural Networks , vol. 108, pp. 48–67, 2018.[63] S. C. Hoi, D. Sahoo, J. Lu, and P. Zhao, “Online learning: Acomprehensive survey,” arXiv preprint arXiv:1802.02871 , 2018.[64] J. Ma, Y. Wang, X. An, C. Ge, Z. Yu, J. Chen, Q. Zhu, G. Dong,J. He, Z. He, C. Tianjia, Z. Yuntao, N. Ziwei, and Y. Xiaoping,“Towards data-efﬁcient learning: A benchmark for covid-19 ctlung and infection segmentation,”

Medical Physics , 2020.[65] J. Ma, Y. Zhang, S. Gu, Y. Zhang, C. Zhu, Q. Wang, X. Liu, X. An,C. Ge, S. Cao, Q. Zhang, S. Liu, Y. Wang, Y. Li, C. Wang, J. He,and X. Yang, “Abdomenct-1k: Is abdominal organ segmentation asolved problem?” arXiv preprint arXiv:2010.14808arXiv preprint arXiv:2010.14808