Brain Tumor Segmentation using 3D-CNNs with Uncertainty Estimation
BBrain Tumor Segmentation using 3D-CNNs withUncertainty Estimation
Laura Mora Ballestar and Veronica Vilaplana (cid:63)
Signal Theory and Communications Department, Universitat Politcnica deCatalunya. BarcelonaTech, Spain [email protected], [email protected]
Abstract.
Automation of brain tumors in 3D magnetic resonance im-ages (MRIs) is key to assess the diagnostic and treatment of the disease.In recent years, convolutional neural networks (CNNs) have shown im-proved results in the task. However, high memory consumption is stilla problem in 3D-CNNs. Moreover, most methods do not include uncer-tainty information, which is specially critical in medical diagnosis. Thiswork proposes a 3D encoder-decoder architecture, based on V-Net [6]which is trained with patching techniques to reduce memory consump-tion and decrease the effect of unbalanced data. We also introduce voxel-wise uncertainty, both epistemic and aleatoric using test-time dropoutand data-augmentation respectively. Uncertainty maps can provide extrainformation to expert neurologists, useful for detecting when the model isnot confident on the provided segmentation.
Keywords: brain tumor segmentation · deep learning · uncertainty · convolutional neural networks Brain tumors are categorized into primary, brain originated; and secondary, tu-mors that have spread from elsewhere and are known as brain metastasis tumors.Among malignant primary tumors, gliomas are the most common in adults, rep-resenting 81% of brain tumors [7]. The World Health Organization (WHO) cat-egorizes gliomas into grades I-IV which can be simplified into two types (1) lowgrade gliomas (LGG), grades I-II, which are less common and are characterizedby low blood concentration and slow growth and (2) high grade gliomas (HGG),grades III-IV, which have a faster growth rate and aggressiveness.The extend of the disease is composed of four heterogeneous histologicalsub-regions, i.e. the peritumoral edematous/invaded tissue, the necrotic core(fluid-filled), the enhancing and no-enhancing tumor (solid) core. Each region isdescribed by varying intensity profiles across MRI modalities (T1-weighted, post-contrast T1-weighted, T2-weighted, and Fluid-Attenuated Inversion Recovery-FLAIR), which reflect the diverse tumor biological properties and are commonly (cid:63)
This work has been partially supported by the project MALEGRA TEC2016-75976-R financed by the Spanish Ministerio de Econom´ıa y Competitividad. a r X i v : . [ ee ss . I V ] S e p L. Mora et al. used to assess the diagnosis, treatment and evaluation of the disease. These MRImodalities facilitate tumor analysis, but at the expense of performing manualdelineation of the tumor regions which is a challenging and time-consumingprocess. For this reason, automatic mechanisms for region tumor segmentationhave appeared in the last decade thanks to the advancement of deep learningmodels in computer vision tasks.The Brain Tumor Segmentation (BraTS) [1–5] challenge started in 2012 witha focus on evaluating state-of-the-art methods for glioma segmentation in multi-modal MRI scans. BraTS 2020 training dataset includes 369 cases (293 HGG and76 LGG), each with four 3D MRI modalities rigidly aligned, re-sampled to 1 mm isotropic resolution and skull-stripped with size 240 x x Brain tumor segmentation methods include generative and discriminative ap-proaches. Generative methods try to incorporate prior knowledge and modelprobabilistic distributions whereas discriminative methods extract features fromimage representations. This latter approach has thrived in recent years thanksto the advancement in CNNs, as demonstrated in the winners of the previousBraTS. The biggest break through in this area was introduced by DeepMedic [10]a 3D CNN that exploits multi-scale features using parallel pathways and incor-porates a fully connected conditional random field (CRF) to remove false pos-itives. [11] compares the performances of three 3D CNN architectures showingthe importance of the multi-resolution connections to obtain fine details in thesegmentation of tumor sub-regions. More recently, EMMA [12] creates and en-semble at inference time which reduces overfitting but at high computationalcost, and [13] proposes a cascade of two CNNs, where the first network producesraw tumor masks and the second network is trained on the vecinity of the tumorto predict tumor regions. BraTS 2018 winner [14] proposed an asymmetricallyencoder-decoder architecture with a variational autoencoder to reconstruct theimage during training, which is used as a regularizer. Isensee, F [15] uses a regu-lar 3D-U-Net optimized on the evaluation metrics and co-trained with external rain Tumor Segmentation using 3D-CNNs with Uncertainty Estimation 3 data. BraTS 2019 winners [16] use a two-stage cascade U-Net trained end-to-end. Finally, [17] applies several tricks in three categories: data processing, modeldevising and optimization process to boost the model performance.
Uncertainty information of segmentation results is important, specially in medi-cal imaging, to guide the clinical decisions and help understand the reliability ofthe provided segmentation, hence being able to identify more challenging caseswhich may require expert review. Segmentation models for brain tumor MRIstend to label voxels with less confidence in the surrounding tissues of the segmen-tation targets [19], thus indicating regions that may have been miss-segmented.Last year’s BraTS challenge already started introducing uncertainty measure-ments. [18] computes epistemic uncertainty using TTD. They obtain a posteriordistribution generated after running several epochs for each image at test-time.Then, mean and variance are used to evaluate the model uncertainty. A differentapproach is proposed by Wang G [19], who uses TTD and data augmentation toestimate the voxel-wise uncertainty by computing the entropy instead of the vari-ance. Finally, [20] proposes to incorporate uncertainty measures during trainingas they define a loss function that models label noise and uncertainty.
The following section details the network architecture as well as the trainingschemes and data processing techniques used to train our model.
MRI intensity values are not standardized as the data is obtained from differentinstitutions, scanners and protocols. When training a neural network it is par-ticularly important to use normalized data, even more as MRI modalities aretreated like color channels. For this reason, we normalize each modality of eachpatient independently to have zero mean and unit std based on non-zero voxelsonly, which represent the brain region.We also apply data augmentation techniques to prevent over-fitting by tryingto disrupt minimally the data. For this, we apply Random Flip (for all 3 axes)with a 50% probability, Random Intensity Shift between ( − . .. . V-Net [6] and U-Net [25] architectures have proven to be successful and reli-able encoder-decoder architectures for medical image segmentation, as they areable to segment fine image structures. Moreover, may past years participantsachieve great results using this architectures as baselines. With this in mind,
L. Mora et al. our work uses a V-Net architecture with four output channels and some minormodifications, such as the usage of Instance Normalization [21] in contrast ofBatch Normalization, which normalizes across each channel for each trainingexample. Moreover we have doubled the number of features maps at each levelof the network as proposed in [15], having 32 feature channels at the highestresolution.The network follows a common approach that progressively downsizes theimage and feature dimensions by a factor of 2 using strided convolutions insteadof pooling layers, see Figure 1.Fig. 1: We use the V-Net [6] architecture with instance normalization, ELU non-linearities and doubled number of feature channels. Feature dimensionality isdenoted at each block. The network outputs the segmentation in four channelsusing a Softmax.
The network is trained end-to-end with randomly selected patches which have a50% probability being centered on healthy tissue and 50% probability on tumor[22]. Due to memory constraints, we need to make a trade-off between the sizeof each patch and the batch size. With this, we found that the best resultsare achieved with patches of size 64x64x64 and a batch size of 8. In our case,bigger patch sizes required smaller batches and were more likely to overfit, thusachieving worst results on the validation set.As optimization, we use SGD with 0 .
99 momentum and a learning rate of1 e −
2. We use the Pytorch learning scheduler ReduceLROnPlateau, which willdecrease by a factor of 0 . rain Tumor Segmentation using 3D-CNNs with Uncertainty Estimation 5 L dice = 2 ∗ (cid:80) Ni p i g i (cid:80) Ni p i + (cid:80) Ni g i + (cid:15) where N is the number of voxels, p i and g i correspond to the predicted andground truth labels per voxel respectively, and (cid:15) is added to avoid zero division.We also train a version of the model where the loss is optimized on the threenested regions, whole tumor, tumor core and enhancing tumor together with theprevious dice loss. The proposed model shows several false positives in the form of small and sepa-rated connected components. Therefore, we keep the two biggest connected com-ponents if their proportion is bigger than some threshold (obtained by analysingthe training set), as some of the subjects may have several regions with unhealthytissue.
This year’s BraTS includes a third task to evaluate the model uncertainty andreward methods with predictions that are: (a) confident when correct and (b)uncertain when incorrect. In this work, we model the voxel-wise uncertainty ofour method at test time, using test time dropout (TTD) and test-time dataaugmentation for epistemic and aleatoric uncertainty respectively.We compute epistemic uncertainty as proposed in Gal et.al [23], who usesdropout as a Bayesian Approximation in order to simplify the task. Therefore,the idea is to use dropout both at training and testing time. The paper suggeststo repeat the prediction a few hundred times with random dropout. Then, thefinal prediction is the average of all estimations and the uncertainty is modelledby computing the variance of the predictions. In this work we perform B = 50iterations and use dropout with a 50% probability to zero out a channel. Assaid, the uncertainty map is estimated with the variance for each sub-regionindependently. Let Y i = (cid:8) y i , y i ...y iB (cid:9) be the vector that represents the i-thvoxel’s predicted labels, the voxel-wise uncertainty map, for each evaluationregion, is obtained as the variance: var = 1 B B (cid:88) b =1 ( y ib − y imean ) Uncertainty can also be estimated with the entropy, as [19] showed. However,the entropy will provide a global measure instead of map for sub-region. In thiscase, the voxel-wise uncertainty is calculated as: H ( Y i | X ) ≈ − M (cid:88) m =1 ˆ p im ln(ˆ p im ) L. Mora et al. where ˆ p im is the frequency of the m-th unique value in Y i and X representsthe input image.Finally, we model aleatoric uncertainty by applying the same augmentationtechniques from training plus random Gaussian noise. The final prediction anduncertainty maps are computed following the same strategy as in the epistemicuncertainty. The model has been implemented in Pytorch [24] and trained on the GPI servers, based on 2 Intel(R) Xeon(R) @ 2.40GHz CPUs using 16GB RAM anda 12GB NVIDIA GPU, using BraTS 2020 training dataset. We report resultson both training (369 cases) and validation (125 cases) datasets. All results,prediction and uncertainty maps, are uploaded to the CBICAs Image ProcessingPortal (IPP) for evaluation of Dice score, Hausdorff distance (95th percentile),sensitivity and specificity per each class. Specific uncertainty evaluation metricsare the ratio of filtered TN (FTN) and the ratio of filtered TP. The principal metrics to evaluate the segmentation performance are the DiceScore, which is an overlap measure for pairwise comparison of segmentationmask X and ground truth Y:
DSC = 2 ∗ | X ∩ Y || X | + | Y | and the Hausdorff distance, which is the maximum distance of a set to thenearest point in the other set, defined as: D H ( X, Y ) = max (cid:26) sup x(cid:15)X inf y(cid:15)Y d ( x, y )) , sup y(cid:15)Y inf x(cid:15)X d ( x, y )) (cid:27) where sup represents the supremum and inf the infimum. In order to havemore robust results and to avoid issues with noisy segmentation, the evaluationscheme uses the 95th percentile.Table 1 and Table 2 show results for training and validation sets respectively.We show the segmentation maps obtain directly from our model and the onesobtained from averaging the predictions from uncertainty estimation, annotatedwith (avg).In both sets, the best results are obtained with the V-Net model directly whenpost-processing is applied. However, both Dice score and Hausdorff distance havelower performance on the validation set, being more noticeable in the ET region, The UPC Image and Video Processing Group (GPI) is a research group of the SignalTheory and Communications department. Table 1: Segmentation Results on Training Dataset (369 cases).
Method Dice Hausdorff (mm)
WT TC ET WT TC ETV-Net 0.8421 0.7837 0.6752 21.9354 11.7565 31.1815V-Net+post
V-Net (avg) 0.8414 0.7727 0.6482 22.7816 11.8985 34.0967V-Net+post (avg) 0.8342
Table 2: Segmentation Results on Validation Dataset (125 cases)
Method Dice Hausdorff (mm)
WT TC ET WT TC ETV-Net 0.8368 0.7499 0.6159 26.4085 13.3398 49.7425V-Net+post
V-Net (avg) 0.8335 0.7547 were it increments from 31 to 47. This high distance is caused because we obtain13 predictions with the highest Hausdorff distance and 0 dice, indicating we arepredicting ET in tumors that should not have it.
BraTS requires to upload three uncertainty maps, one for each subregion (WT,TC, ET) together with the prediction map. Moreover, uncertainty maps must benormalized from 0-100 such that ”0” represents the most certain prediction and”100” represents the most uncertain. Moreover they measure the FTP definedas
F T P = (
T P − T P T ) /T P , where T represents the threshold used tofilter the more uncertain values. The ratio of filtered true negatives (FTN) iscalculated in a similar manner.Table 3: Uncertainty Results on Training and Validation Dataset Method DICE AUC FTP RATIO AUC FTN RATIO AUC
WT TC ET WT TC ET WT TC ET(train) V-Net+post 0.8506 0.7890 0.6779 0.0029 0.0104 0.0179 0.0008 0.0002 0.0002(valid) V-Net+post 0.8505 0.7583 0.6274 0.0118 0.0515 0.0815 0.0027 0.0008 0.0005 L. Mora et al.
Fig. 2: Training results on patients: 223 and 325 (top-bottom). Image order: (1)Flair (2) GT (3) Prediction (4) Average Prediction from uncertainty (5) WTuncertainty map (6) TC uncertainty map (7) ET uncertainty mapFig. 3: Validation results on patients: 007 and 035 (top-bottom). Image order: (1)Flair (2) Prediction (3) Average Prediction from uncertainty (4) WT uncertaintymap (5) TC uncertainty map (5) ET uncertainty mapResults on Table 3 show that our model has low FTP and FTN, meaningthat it is certain of those predictions.
Figure 2 and figure 3 show some visual results from the training and validationsets respectively. As it can be seen, in some of the subjects we obtain a fairlygood segmentation, whereas in others we have too many false positives in theWT and ET regions. However, the uncertainty values show that the model ismore uncertain in the tumor surroundings and areas where the prediction hasbeen miss-classified.
In this paper we use a V-Net architecture with minor modifications: number offeature maps and instance normalization instead of using batch normalization. rain Tumor Segmentation using 3D-CNNs with Uncertainty Estimation 9
Results, on both training and validation sets, prove that our model has severeproblems to correctly segment whole tumor and, specially enchancing tumor.In both cases, we have a large number of false positives, as can be seen in theHaurdorff distance results. In the case of ET, the distance is much bigger as asingle false positive voxel in a patient where no enhancing tumor is present inthe ground truth results in a Dice score of 0 and a Hausdorff distance of approx-imately 373, which is the case for some of the subjects which has a dramaticallyeffect on the mean value.In order to improve the results, we plan on adding ResNet blocks to thenetwork baseline as well as using the Generalised Dice Loss [26], more suitedfor unbalanced data. Moreover, we believe that some post-processing techniquesshould also improve the results.