A fully 3D multi-path convolutional neural network with feature fusion and feature weighting for automatic lesion identification in brain MRI images
Yunzhe Xue, Meiyan Xie, Fadi G. Farhat, Olga Boukrina, A. M. Barrett, Jeffrey R. Binder, Usman W. Roshan, William W. Graves
AA fully 3D multi-path convolutional neural networkwith feature fusion and feature weighting forautomatic lesion identification in brain MRI images
Yunzhe Xue
Department of Computer ScienceNew Jersey Institute of TechnologyNewark, NJ, USA [email protected]
Meiyan Xie
Department of Computer ScienceNew Jersey Institute of TechnologyNewark, NJ, USA [email protected]
Fadi G. Farhat
Department of Computer ScienceNew Jersey Institute of TechnologyNewark, NJ, USA [email protected]
Olga Boukrina
Stroke Rehabilitation ResearchKessler FoundationWest Orange, NJ, USA [email protected]
A. M. Barrett
Emory University and Atlanta VA Medical CenterAtlanta, GA, USA [email protected]
Jeffrey R. Binder
Department of NeurologyMedical College of WisconsinMilwaukee, WI, USA [email protected]
Usman W. Roshan
Department of Computer ScienceNew Jersey Institute of TechnologyNewark, NJ, USA [email protected]
William W. Graves
Department of PsychologyRutgers University - NewarkNewark, NJ, USA [email protected]
Abstract
We propose a fully 3D multi-path convolutional network to predict stroke lesionsfrom 3D brain MRI images. Our multi-path model has independent encoders fordifferent modalities containing residual convolutional blocks, weighted multi-pathfeature fusion from different modalities, and weighted fusion modules to combineencoder and decoder features. Compared to existing 3D CNNs like DeepMedic, 3DU-Net, and AnatomyNet, our networks achieves the highest statistically significantcross-validation accuracy of 60.5% on the large ATLAS benchmark of 220 patients.We also test our model on multi-modal images from the Kessler Foundation andMedical College Wisconsin and achieve a statistically significant cross-validationaccuracy of 65%, significantly outperforming the multi-modal 3D U-Net andDeepMedic. Overall our model offers a principled, extensible multi-path approachthat outperforms multi-channel alternatives and achieves high Dice accuracies onexisting benchmarks.
Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract a r X i v : . [ ee ss . I V ] N ov ethods Fully 3D multi-path convolutional neural network
Our contribution is a fully 3D convolutionalnetwork for predicting stroke lesions in brain MRI images. Our model contains only 3D convolutionalkernels, 3D feature fusion modules, and feature weighting methods. We use separate encoders foreach modality which we then fuse in a custom designed feature fusion module. The fused featuresare then added to outputs from the upsampling blocks. We use squeeze-and-excitation to weighchannels for different modalities as well as a simple custom designed amplified weighting designedto fit small lesions that are hard to detect. Our overall model structure is given in Figure 1. TheU-shaped network with encoders, decoders, and connections between them is a widely used structurefor segmentation problems [1].
11 1 2 2 3 3 4 4 5 9 28 47 36 21 54321
FusionResidual BlocksDown SamplingConvUp Sampling
Figure 1: Overview of our 3D convolutional neural net system.
Variants of our modelSingle path:
Here we have a single encoder combined for all modalities. In this version differentmodalities are stacked as multiple channels. With this variant we can see the effect of a multi-channelvs. multi-path approach for different input modalities.
No fusion:
This applies only to ATLAS data where we have single modality T1 images. Since theimages have just one modality there are no features to fuse. Thus we we replace the fusion modulewith identity mapping as in the original 3D UNet.
Basic:
This is our basic multi-path model with the basic fusion component.
Squeeze-and-excitation in fusion (SE):
Here we consider the fusion block with the squeeze-and-excitation component [2] in fusion to weigh channels across different modalities. This is a kind ofattention mechanism [3] focusing on channels. The idea is to use pooling to compress features to apoint followed by fully connected layers and non-linear activation (sort of a mini encoder decodersetup). The sigmoid layer outputs a number between 0 and 1 that is used to weight the channel.
SE and Amplified Weighting (SE+AW):
During our experimental design we found that smalllesions are particularly hard to map. In fact our ATLAS benchmark has about a quarter of images withlesion volumes below mm and a third quarter below mm . More interestingly we found25 samples in ATLAS which give a 0 Dice coefficient during training. We call these hard samplesand found their gradient values in the residual block in the decoder to be much smaller compared tothat of other samples (about one thousand times). One explanation for this is the addition operationbetween features from the upsampler and the fusion module. Upon closer examination we find theoutput from the fusion to have a much larger value than the upsampler before entering the residualblock (30 to 100 times). As a possible fix we simply amplify the weights from the upsampler by afactor of 10. We call this Amplified Weighting (AW). ReLUOnly:
Instead of amplifying the upsampler weights as above we consider an alternativeapproach to balance the magnitude of outputs from the fusion and upsampler modules. In our residualblock we move the first normalization layer to the end as the final layer just before addition. This isknown as the ReLU-Only pre-activation [4]. With this modification the outputs from the residualblocks get normalized to between 0 and 1. These are then given as inputs to the fusion moduleand because they are normalized the outputs from the fusion module are not much larger than theupsampler outputs (about 10 to 30 times). In fact the squeeze-and-excitation module that assigns2eights to different channels is partly responsible for the weight inflation. This variant includes theSE variant above.
Data, other methods, measure of accuracy and statistical significanceImaging Data:
We obtained high-resolution (1 mm ) whole-brain MRI scans from 25 patients fromthe Kessler Foundation (KES), a neuro-rehabilitation facility in West Orange, New Jersey. We alsoobtained 20 high-resolution scans from the Medical College of Wisconsin (MCW). We includedsubacute ( < > Comparison of CNN Methods:
We compared our CNN to three state of the art recently publishedCNNs. (1) DeepMedic [6]: This is a popular dual-path 3D convolutional neural network witha conditional random field to account for spatial order of slices. (2) AnatomyNet [7]: This is aconvolutional neural network with residual connections [8] and squeeze-excitation blocks [2]. (3)3D-UNet [1]: A 3D U-Net that obtained third place in the BRATS 2017 multimodal brain tumorsegmentation challenge.
Measure of accuracy: Dice coefficient:
The Dice coefficient is typically used to measure theaccuracy of predicted lesions in MRI images [9]. Starting with the human binary mask as groundtruth, each predicted voxel is determined to be either a true positive (TP, also one in true mask), falsepositive (FP, predicted as one but zero in the true mask), or false negative (FN, predicted as zero butone in the true mask). The Dice coefficient is formally defined as
DICE = T P T P + F P + F N . Measure of statistical significance: Wilcoxon rank sum test:
The Wilcoxon rank sum test [10](also known as the Mann-Whitney U test) can be used to determine whether the difference betweentwo sets of measurements (methods in our case) is significant. While the t-test has been shown tohave high Type 1 error possibly due to violation of independence in the folds [11], the Wilcoxon ranktest has been previously recommended for testing across datasets [12] and also applied on a singledataset [13] like we do.
Results
We perform a five-fold cross-validation. On KES+MCW our splits are chosen randomly whereasin ATLAS we first rank the samples by their training accuracy in increasing order. We then groupevery five together and rotate the test sample by picking the first from each group of five, then thesecond and so forth. This gives us five sets of train and test samples. Our goal here is to include hardto train samples in the training set so as to better predict hard ones in the test. We apply all methodsto the same set of splits in both datasets. Each of our MRI images is aligned to standard space. Wenormalized each 3D image by subtracting the mean and dividing by variance of the image’s pixelvalues.In Figure 2 we compare the average Dice coefficients of test samples of each of our variants. OnKES+MCW (Figure 2(a)) we see that our Basic variant (multi-path approach) has about a 2%improvement over the SinglePath variant that stacks different modalities as multiple channels. Thustreating each modality independently with their own encoders has a noticeable advantage. We see thesame in ATLAS (Figure 2(b)) if we consider the flipped image as a second modality.In fact in ATLAS we see that using the flipped image as another modality brings a considerableimprovement over using just the single T1 images alone. The Flip variant in Figure 2(b) has 58.4%Dice value whereas the Basic variant is at 56.3% (with the difference having a p-value of .086).Following the basic model we see that both squeeze and excite blocks and the amplified weightingimproves overall test accuracy on both datasets. Thus upweighting features from the upsampler (as inour AW variant) help to improve not just training accuracy on hard samples but also the overall testaccuracy.In Figure 3 we compare our SE+AW (squeeze-and-excite in fusion and amplified weighting) modelto three other 3D CNN models on KES+MCW. On ATLAS we show our ReLUOnly model althoughthe SE+AW variant is only slightly behind in accuracy. On KES+MCW we do not have results forAnatomyNet since that is designed for single modality images. We show the Dice values for four3 odel type A v e r a g e d i c e v a l u e Model type A v e r a g e d i c e v a l u e (a) Kessler+MCW (b) ATLASFigure 2: Average five-fold test Dice values of variants of our model on KES+MCW and ATLASdatasets.different thresholds of lesion sizes starting from the smallest 25% to 100% that includes the entiredataset.On KES+MCW we see that in the 25% smallest lesions our model has a 34.4% accuracy whereas thenext best 3D U-Net is 29%.On ATLAS in the 25% smallest lesions our model has 41.4% accuracywhile the next best AnatomyNet is 34.1%. Our weighting techniques (amplified weighting andReLUOnly) thus give us a 5.4% and 7% improvement over the next best on KES+MCW and ATLASrespectively. In the case of ATLAS the improvement is statistically significant with a p-value of0.004.On KES+MCW (100% threshold) our model has a Dice value of 65.1% while the next best 3D U-Netreaches 60.8%. On all of ATLAS our model has Dice value 60.5% while the next best AnatomyNetreaches 56.1%. In both cases the differences are strongly statistically significant. On KES+MCW thedifference has p-value 0.0016 and on ATLAS in the order of − . Thus we see that our model notonly improves upon previous methods on small hard to detect lesions but overall as well. Data set size(sorted by lesion volume size) A v e r a g e d i c e v a l u e DeepMedic 3D UNet Our model
Data set size(sorted by lesion volume size) A v e r a g e d i c e v a l u e DeepMedic 3D UNet AnatomyNet Our model (a) Kessler+MCW (b) ATLASFigure 3: Average five-fold test Dice values of our SE+AW model compared to three other 3D CNNson KES+MCW and ATLAS datasets.In Table 1 we see that our model has the fewest FLOPS except for DeepMedic that uses patches ofimages and thus has a much smaller input size ( × × and × × for DeepMedic vs. × × for all other models). Our model total parameters are comparable to 3D-UNet. Thissuggests that our model’s performance can be attributed to its architecture and novel design featuresthat we introduce and not just higher capacity.Method Our 3D CNN model AnatomyNet 3D-UNet DeepMedicTotal trainable parameters 1,739,146 1,359,747 1,781,304 658,295FLOPS 1,025,567,394 10,765,767,799 1,720,811,520 91,426,315Table 1: Total trainable parameters and FLOPS of our model compared to AnatomyNet, 3D-UNet,and DeepMedic. 4 eferences [1] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedicalimage segmentation. In International Conference on Medical image computing and computer-assistedintervention , pages 234–241. Springer, 2015.[2] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 7132–7141, 2018.[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. Attention is all you need. In
Advances in neural information processingsystems , pages 5998–6008, 2017.[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.In
European conference on computer vision , pages 630–645. Springer, 2016.[5] Sook-Lei Liew, Julia M Anglin, Nick W Banks, Matt Sondag, Kaori L Ito, Hosung Kim, Jennifer Chan,Joyce Ito, Connie Jung, Nima Khoshab, et al. A large, open source dataset of stroke anatomical brainimages and manual lesion segmentations.
Scientific data , 5:180011, 2018.[6] Konstantinos Kamnitsas, Christian Ledig, Virginia FJ Newcombe, Joanna P Simpson, Andrew D Kane,David K Menon, Daniel Rueckert, and Ben Glocker. Efficient multi-scale 3d cnn with fully connected crffor accurate brain lesion segmentation.
Medical image analysis , 36:61–78, 2017.[7] Wentao Zhu, Yufang Huang, Liang Zeng, Xuming Chen, Yong Liu, Zhen Qian, Nan Du, Wei Fan, andXiaohui Xie. Anatomynet: Deep learning for fast and fully automated whole-volume segmentation of headand neck anatomy.
Medical physics , 46(2):576–589, 2019.[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[9] Alex P Zijdenbos, Benoit M Dawant, Richard A Margolin, and Andrew C Palmer. Morphometric analysisof white matter lesions in mr images: method and validation.
IEEE transactions on medical imaging ,13(4):716–724, 1994.[10] Frank Wilcoxon. Individual comparisons by ranking methods.
Biometrics bulletin , 1(6):80–83, 1945.[11] Thomas G Dietterich. Approximate statistical tests for comparing supervised classification learningalgorithms.
Neural computation , 10(7):1895–1923, 1998.[12] Janez Demšar. Statistical comparisons of classifiers over multiple data sets.
Journal of Machine learningresearch , 7(Jan):1–30, 2006.[13] Nathalie Japkowicz and Mohak Shah.
Evaluating learning algorithms: a classification perspective .Cambridge University Press, 2011..Cambridge University Press, 2011.