Comparing Deep Learning strategies for paired but unregistered multimodal segmentation of the liver in T1 and T2-weighted MRI
Vincent Couteaux, Mathilde Trintignac, Olivier Nempont, Guillaume Pizaine, Anna Sesilia Vlachomitrou, Pierre-Jean Valette, Laurent Milot, Isabelle Bloch
CCOMPARING DEEP LEARNING STRATEGIES FOR PAIRED BUT UNREGISTEREDMULTIMODAL SEGMENTATION OF THE LIVER IN T1 AND T2-WEIGHTED MRI
Vincent Couteaux (cid:63) † Mathilde Trintignac ‡ Olivier Nempont † Guillaume Pizaine † Anna Sesilia Vlachomitrou † Pierre-Jean Valette ‡ Laurent Milot ‡ Isabelle Bloch (cid:63)(cid:63)
LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, France † Philips Research Paris, Suresnes, FranceHospices Civils de Lyon, Lyon, France
ABSTRACT
We address the problem of multimodal liver segmentation in pairedbut unregistered T1 and T2-weighted MR images. We compare sev-eral strategies described in the literature, with or without multi-tasktraining, with or without pre-registration. We also compare differ-ent loss functions (cross-entropy, Dice loss, and three adversariallosses). All methods achieved comparable performances with theexception of a multi-task setting that performs both segmentations atonce, which performed poorly.
Index Terms — Segmentation, U-net, Multimodal imaging,MRI, Liver
1. INTRODUCTION
Automatic liver segmentation tools are used in a clinical context toease the interpretation of medical images, as well as for quantita-tive measurements. For instance, they are used for volumetry or as apreliminary step for automated assessment of hepatic fat fraction orliver tumor burden. Such analyses can be done with multiple modal-ities, among which T1 and T2-weighted MRIs.T1 weighted MRIs are anatomical images, and thus show a well-contrasted and easy to segment liver. Conversely, T2-weighted im-ages have a lower inter-slice resolution. They are more prone toacquisition artifacts, and the liver is harder to distinguish from itssurroundings (see Figure 1). Therefore, the accurate manual seg-mentation of the liver is more tedious in T2 images. T1 and T2-weighted images are acquired a few minutes apart, which inducesmisalignments between the two images (mainly due to breathing).In this work, , we propose a comparison of different strategiesdescribed in the literature to perform accurate automatic segmenta-tion of the liver in pairs of T1 and T2-weighted MRI images.In the recent literature, most proposed segmentation methods arebuilt upon the U-Net architecture [1], either by integrating shape pri-ors [2], modifying the architecture [3], or adapting to a particularproblem, such as joint lesion segmentation [4], unsupervised domainadaptation [5], or integration of manual corrections [6]. We thereforechose this setting as a basis for our comparison.Our experiments revolve around two axes. The first axis is themulti-input and multi-output strategy: it is shown in [7] that it is pos-sible to train efficient unspecialized segmentation networks when nopaired data are available, and many works showed the usefulnessof dual-input segmentation for registered paired images [3, 8]. Weinvestigate to what extent these strategies are applicable to our par-ticular problem and data set, and how they compare. The goal is
Fig. 1 . Multisequence MRI pair example. Left: T1-weighted image.Right: T2-weighted image.to answer the following questions: Does providing both images ofthe pair as input to the network increases performance? If so, doessegmenting both images in one forward pass, in a multi-task way, en-ables to get better results? Can we improve the results by applyinga non-linear registration algorithm to the images before forwardingthem to the network? Our hypothesis is that, by providing anatomi-cal information through the T1 image, a segmentation network maysucceed better in segmenting the T2 image, even with misaligne-ments.The second axis is the objective function we optimize duringtraining. The importance of this parameter in segmentation has beendemonstrated in [9, 10], and the benefit of learned segmentationlosses by adversarial training for certain applications in [11, 12, 13].We compare three standard voxel-to-voxel loss functions, and threeother functions taking advantage of adversarial training. For this axisthe hypothesis is that adversarial losses, by learning the distributionof liver masks, may lead to more robust predictions and an increasedperformance overall.
2. DATA
We compare the different approaches on a database of 88 pairs ofT1-weighted and T2-weighted MRIs centered on the liver, comingfrom 51 patients which all have hepatic lesions. The T1 images areacquired at a portal enhancement time. We resize every image at aresolution of 3 mm along the z axis, and 1.5 mm along the x and y axes. Images in each pair are aligned with their DICOM metadata,but breathing can induce large movements of the liver (up to cm).Reference segmentation masks are obtained through manual an-notations by a radiologist, using interactive 3D tools. Note that due a r X i v : . [ ee ss . I V ] J a n egII SS SegII SS Seg SegII SS Seg II S SegReg warp
Fig. 2 . Inference of a pair of images for different input/output strategies. Left to right: single input; double-input, single-output; double-input,double-output; double-input, single-output preregistered. For the specialized setting, I is the T1 image of the pair and I the T2 image.For the unspecialized setting, the T1 and T2 images are randomly swapped during training, so that the networks do not expect a particularmodality in each channel.to low contrast of the liver in T2 images, as well as lower resolutionalong the z axis, manual annotation of the liver in T2 images is dif-ficult and less accurate than in T1 images. We split the dataset bykeeping 12 pairs for testing, and 6 pairs for validation.
3. METHODS3.1. Architecture and training
In order to focus on the strategies and the objective functions, and fora fair comparison, we let the following architecture and optimizationparameters fixed. We use the 3D-U-net architecture [1, 14], pre-trained with weights provided in [15], as this architecture is nowstandard for medical image segmentation [16]. We use the Adamoptimizer at a . − learning rate with early-stopping, by keepingthe best network on the validation dataset, for 900 epochs of 150steps. For memory reasons, we use batches of size 1, and crop inputsinto cuboids of random sizes and ratios. We use a random intensityshift as a data augmentation strategy. We evaluate the benefit of multi-modal inputs by comparing differ-ent input/output settings (see Figure 2). Each setting has two ver-sions, which we refer to as specialized or unspecialized and that wedescribe below. Single input:
The network has only one channel for input and out-put. If specialized , two networks are trained, one for eachmodality. If unspecialized , only one network is trained, tak-ing as input indifferently a T1 or T2 image.
Double input, single output:
The network has two channels as in-put, and one as output. We train it to segment the image inthe first channel, the second channel receiving the auxiliarymodality. If specialized , two networks are needed, one seg-menting T1 images with T2 as auxiliary modality, and viceversa. If unspecialized , only one network is needed, segment-ing indifferently T1 or T2 images.
Double input, double output:
A single network is trained, whereboth the input and output layers have two channels - one foreach image of the pair - so that both predictions are multi-tasked. When specialized , the first channel is always the T1image and the second channel always T2, whereas we ran-domly swap the channels during traing when unspecialized ,so that the network does not expect a particular modality ineach channel.
Pre-registered:
To study how registering the images before the seg-mentation may be beneficial, we performed a non-linear reg-istration of the T1 image on the T2 image of each pair and trained a double input, single output network segmenting T2images with T1 as auxiliary modality. As the T2 image isharder to segment, we can expect that by also providing thealigned T1 image to the network, it can use all relevant infor-mation to make a more accurate segmentation.
In the following section we consider a dataset of images { x n } ⊂ X and annotations { y n } ⊂ Y , and a segmentor S : X → Y . Toevaluate the influence of the loss function on the performances, wetest 6 of them on the single-input, unspecialized strategy.Three loss functions are standard voxel-to-voxel functions: Binary cross-entropy: L bce ( x, y ) = − (cid:80) y log( S ( x )) , where thesum is taken over all image voxels, as proposed in [1]. Dice: L Dice ( x, y ) = − (cid:80) yS ( x ) / ( (cid:80) y (cid:80) S ( x )) , as proposedin [10]. The normalization enables a better performancefor important class imbalance between foreground and back-ground ( i.e. when the objects to segment are small) comparedto L bce . Cross-entropy+Dice: L sum = L bce + L Dice
Three loss functions are adversarial losses: we simultaneouslytrain a discriminator network D : Y → [0 , to recognize referencemasks. The idea is that by learning the distribution of liver masks, wecan get models that are robust to improbable predictions (with holesor odd shape for instance). This technology has gained popularityin segmentation [17, 18], and we refer to the related works sectionin [11] for a good review of adversarial learning in segmentation. The vanilla GAN loss, as in [12]: L GAN ( x, y ) = L bce ( S ( x ) , y ))+ L bce ( D ( S ( x )) , The embedding loss, proposed in [13]: L el ( x, y ) = L bce ( S ( x ) , x )+ || D k ( S ( x )) − D k ( y ) || , where D k represents the k -th layerof network D . The goal is to gain in stability during training. The gambling loss, described in [11]: Instead of the discrimina-tor D , we use a gambler G : X × Y → Y , which takesas input an image and a segmentation, and outputs a nor-malized betting map. It is trained to minimize L G ( x, y ) = − (cid:80) G ( x, S ( x )) y log( S ( x )) while the segmentor minimizes L S = L bce − L G The goal is to train the gambler networkable to recognize the hard part of the image to segment, sothat the segmentor focuses on them.We start the training with a pre-trained segmentor, with thesingle-input unspecialized strategy (see Section 3.2), as we foundthat it performed as well as any other (see Section 4). . RESULTS, DISCUSSION AND CONCLUSIONS
We report the performance of all approaches using Dice coefficient.Hausdorff distance and Hausdorff distance 95th percentile resultedin a similar ranking of the approaches. To help with the interpreta-tion of the results, we perform a 4-fold cross-validation on the entiredatabase using the single-input unspecialized strategy, that we repeat3 times. On average, the Dice scores of the different runs on eachfold differ by . . Any Dice score difference below that will thusnot be considered meaningful for comparing strategies.Table 1 shows a comparison of performances among the trainingstrategy, recorded on the test dataset. We note a slightly worse per-formance for T2 images in specialized strategies compared to theirunspecialized counterpart, while the difference of Dice scores of T1images is limited. Multi-output networks performed significantlyworse than multi-input ones. As shown in [19], multi-task learningis not trivial to work with, and the naive approach we tried showedits limits; it may be especially tricky in 3D segmentation, where in-creasing the capacity of a network is costly in terms of memory.Let us now assess how much a network relies on the auxiliarymodality for double-input settings. To this end, we evaluate the per-formance of a network when segmenting an image with its matchingauxiliary image, and compare it to its performance when given mis-matched pairs. We measure this performance gap with the differenceof Dice scores, and compile the results in Table 3. We record a moreimportant gap for T2 images, which is consistent with the idea that,as this modality is harder to segment, the network learns to fetch in-formation in the T1 modality. This gap is even more important whenimages are pre-registered, which is also expected as the informationis easier to retrieve when images are aligned. Multi-output settingsshowed the most important need for the auxiliary modality. Con-trary to our hypothesis according to which adding T1 informationshould improve the T2 predictions, comparing Table 1 with Table 3seems to link an increased usage of the auxiliary channel with badperformances. One explanation may be that the network may relyon irrelevant pieces of additional information. Overall, we foundthat T2 images contain enough information to accurately segmentthe liver, and the gap in performance between T1 and T2 images canbe explained by a difference in annotation quality.Table 4 shows the performance of networks trained with differ-ent loss functions. We found no clear effect of this parameter onperformance. Adversarial training did not outperform the other lossfunctions either, despite an expected behavior during the adversar-ial training and good discriminator performance. This experimentcorroborates the idea expressed in [16, 20] that data variety and an-notation quality are prominent, and that the gains induced by thoseinnovations are sometimes hard to replicate on custom datasets. Ourcross-validation experiment showed high score differences acrosssplits, and showed that the split we chose for comparing strategieswas particularly difficult (one cross-validation split averages a Diceof . on both T1 and T2). This also corroborates the idea ofprominence of data on performance.Figure 3 shows some examples of predictions of the single-input, unspecialized method. We can see that the predictions remainaccurate even in the presence of large tumor burden (leftmost col-umn) or big lesions near the edge of the liver (middle right column).The rightmost column shows a case were the patient has undergonean hepatectomy. As it is the only case with such atypical anatomyon our database, it is particularly challenging. Despite this difficultythe network managed to make accurate predictions, especially onthe T1 image. Table 1 . Mean Dice scores vs. training strategy.Unspec. Spec.Strategy T1 T2 T1 T21 input 0.961 0.938 0.959 0.9292 in, 1 out 0.955 0.933 0.954 0.9292 in, 2 out 0.938 0.907 0.942 0.8971 out, prereg. - - - 0.925
Table 3 . Dice reduction when pairs are mismatched vs.strategy. Unspec. Spec.Strategy T1 T2 T1 T22 in, 1 out < Table 4 . Mean Dice scores vs. loss function.Loss T1 T2Binary cross-entropy 0.961 0.938Dice loss 0.959 0.931Dice + BCE 0.959 0.932Vanilla GAN 0.950 0.930Embedding 0.960 0.935Gambling 0.959 0930
5. REFERENCES [1] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net:Convolutional networks for biomedical image segmentation,”in
International Conference on Medical image Computing andComputer-Assisted Intervention . Springer, 2015, pp. 234–241.[2] Qi Zeng, Davood Karimi, Emily HT Pang, Shahed Mo-hammed, Caitlin Schneider, Mohammad Honarvar, and Septi-miu E Salcudean, “Liver segmentation in magnetic resonanceimaging via mean shape fitting with fully convolutional neu-ral networks,” in
International Conference on Medical Im-age Computing and Computer-Assisted Intervention . Springer,2019, pp. 246–254.[3] Xiaomeng Li, Hao Chen, Xiaojuan Qi, Qi Dou, Chi-Wing Fu,and Pheng-Ann Heng, “H-DenseUNet: hybrid densely con-nected UNet for liver and tumor segmentation from CT vol-umes,”
IEEE Transactions on Medical Imaging , vol. 37, no.12, pp. 2663–2674, 2018.[4] Eugene Vorontsov, An Tang, Chris Pal, and Samuel Kadoury,“Liver lesion segmentation informed by joint liver segmenta-tion,” in
IEEE 15th International Symposium on BiomedicalImaging (ISBI) . IEEE, 2018, pp. 1332–1335.[5] Junlin Yang, Nicha C Dvornek, Fan Zhang, Julius Chapiro,MingDe Lin, and James S Duncan, “Unsupervised domainadaptation via disentangled representations: Application tocross-modality liver segmentation,” in
International Confer-ence on Medical Image Computing and Computer-Assisted In-tervention . Springer, 2019, pp. 255–263.[6] Grzegorz Chlebus, Hans Meine, Smita Thoduka, NasreddinAbolmaali, Bram van Ginneken, Horst Karl Hahn, and AndreaSchenk, “Reducing inter-observer variability and interaction ig. 3 . A few examples from the test database. Top: T1 image, bottom: T2 image. Red: manual annotation, green: prediction from thesingle-input, unspecialized network.time of MR liver volumetry by combining automatic CNN-based liver segmentation and manual corrections,”
PloS one ,vol. 14, no. 5, pp. e0217228, 2019.[7] Kang Wang, Adrija Mamidipalli, Tara Retson, NaeimBahrami, Kyle Hasenstab, Kevin Blansit, Emily Bass, Timo-teo Delgado, Guilherme Cunha, Michael S Middleton, et al.,“Automated CT and MRI liver segmentation and biometry us-ing a generalized convolutional neural network,”
Radiology:Artificial Intelligence , vol. 1, no. 2, pp. 180022, 2019.[8] Tongxue Zhou, Su Ruan, and St´ephane Canu, “A review: Deeplearning for medical image segmentation using multi-modalityfusion,”
Array , vol. 3-4, pp. 100004, 2019.[9] Shruti Jadon, “A survey of loss functions for semantic segmen-tation,”
ArXiv:2006.14822 , 2020.[10] Carole H Sudre, Wenqi Li, Tom Vercauteren, SebastienOurselin, and M Jorge Cardoso, “Generalised dice overlap asa deep learning loss function for highly unbalanced segmenta-tions,” in
Deep learning in medical image analysis and mul-timodal learning for clinical decision support , pp. 240–248.Springer, 2017.[11] Laurens Samson, Nanne van Noord, Olaf Booij, Michael Hof-mann, Efstratios Gavves, and Mohsen Ghafoorian, “I bet youare wrong: Gambling adversarial networks for structured se-mantic segmentation,” in
IEEE International Conference onComputer Vision Workshops , 2019.[12] Pauline Luc, Camille Couprie, Soumith Chintala, and JakobVerbeek, “Semantic segmentation using adversarial networks,”
ArXiv:1611.08408 , 2016.[13] Mohsen Ghafoorian, Cedric Nugteren, N´ora Baka, Olaf Booij,and Michael Hofmann, “El-gan: Embedding loss driven gen-erative adversarial networks for lane detection,” in
EuropeanConference on Computer Vision (ECCV) . 2018, pp. 256–272,Springer.[14] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi,“V-net: Fully convolutional neural networks for volumetricmedical image segmentation,” in
Fourth International Con-ference on 3D Vision (3DV) . IEEE, 2016, pp. 565–571.[15] Zongwei Zhou, Vatsal Sodha, Md Mahfuzur Rahman Sid-diquee, Ruibin Feng, Nima Tajbakhsh, Michael B Gotway, and Jianming Liang, “Models genesis: Generic autodidactic mod-els for 3D medical image analysis,” in
International Confer-ence on Medical Image Computing and Computer-Assisted In-tervention . Springer, 2019, pp. 384–393.[16] Fabian Isensee, Jens Petersen, Andr´e Klein, David Zimmerer,Paul F. Jaeger, onnon Kohl, Jakob Wasserthal, Gregor Koehler,Tobias Norajitra, Sebastian J. Wirkert, and Klaus Maier-Hein,“nnU-Net: Self-adapting framework for U-Net-based medicalimage segmentation,”
ArXiv:1809.10486 , 2018.[17] Wei-Chih Hung, Yi-Hsuan Tsai, Yan-Ting Liou, Yen-Yu Lin,and Ming-Hsuan Yang, “Adversarial learning for semi-supervised semantic segmentation,” in
BMVC , 2018.[18] Yizhe Zhang, Lin Yang, Jianxu Chen, Maridel Fredericksen,David P Hughes, and Danny Z Chen, “Deep adversarial net-works for biomedical image segmentation utilizing unanno-tated images,” in
International Conference on Medical Im-age Computing and Computer-Assisted Intervention . Springer,2017, pp. 408–416.[19] Sen Wu, Hongyang R. Zhang, and Christopher R´e, “Un-derstanding and improving information transfer in multi-tasklearning,” in
International Conference on Learning Represen-tations , 2020.[20] Johannes Hofmanninger, Forian Prayer, Jeanny Pan, SebastianR¨ohrich, Helmut Prosch, and Georg Langs, “Automatic lungsegmentation in routine imaging is primarily a data diversityproblem, not a methodology problem,”
European RadiologyExperimental , vol. 4, no. 1, pp. 1–13, 2020.
6. ACKNOWLEDGMENTS
This work has been partially funded by a grant from
AssociationNationale de la Recherche et de la Technologie (
7. CONFLICT OF INTEREST
The authors have no relevant financial or non-financial interests todisclose. . COMPLIANCE WITH ETHICAL STANDARDS. COMPLIANCE WITH ETHICAL STANDARDS