[PDF] A Unified Conditional Disentanglement Framework for Multimodal Brain MR Image Translation

Abstract

Multimodal MRI provides complementary and clinically relevant information to probe tissue condition and to characterize various diseases. However, it is often difficult to acquire sufficiently many modalities from the same subject due to limitations in study plans, while quantitative analysis is still demanded. In this work, we propose a unified conditional disentanglement framework to synthesize any arbitrary modality from an input modality. Our framework hinges on a cycle-constrained conditional adversarial training approach, where it can extract a modality-invariant anatomical feature with a modality-agnostic encoder and generate a target modality with a conditioned decoder. We validate our framework on four MRI modalities, including T1-weighted, T1 contrast enhanced, T2-weighted, and FLAIR MRI, from the BraTS'18 database, showing superior performance on synthesis quality over the comparison methods. In addition, we report results from experiments on a tumor segmentation task carried out with synthesized data.

Full PDF

AA UNIFIED CONDITIONAL DISENTANGLEMENT FRAMEWORK FOR MULTIMODALBRAIN MR IMAGE TRANSLATION

Xiaofeng Liu, Fangxu Xing, Georges El Fakhri, Jonghye Woo

Dept. of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA

ABSTRACT

Multimodal MRI provides complementary and clinically rele-vant information to probe tissue condition and to characterizevarious diseases. However, it is often difﬁcult to acquire suf-ﬁciently many modalities from the same subject due to lim-itations in study plans, while quantitative analysis is still de-manded. In this work, we propose a uniﬁed conditional dis-entanglement framework to synthesize any arbitrary modalityfrom an input modality. Our framework hinges on a cycle-constrained conditional adversarial training approach, whereit can extract a modality-invariant anatomical feature witha modality-agnostic encoder and generate a target modalitywith a conditioned decoder. We validate our framework onfour MRI modalities, including T1-weighted, T1 contrast en-hanced, T2-weighted, and FLAIR MRI, from the BraTS’18database, showing superior performance on synthesis qualityover the comparison methods. In addition, we report resultsfrom experiments on a tumor segmentation task carried outwith synthesized data.

Index Terms — Image synthesis, Generative AdversarialNetworks, Deep learning, Brain tumor

1. INTRODUCTION

Multimodal magnetic resonance (MR) images are often re-quired to provide complementary information for clinicaldiagnosis and scientiﬁc studies [1, 2]. For example, multi-modal MR imaging (MRI) with T1-weighted, T1ce (contrastenhanced), T2-weighted, and FLAIR (FLuid-Attenuated In-version Recovery) MRI can offer greater sensitivity to tumorheterogeneity and growth pattern than single modality, T1ceMRI, thereby beneﬁting diagnosis, staging, and monitoringof brain metastasis [3]. However, in practice, it is often difﬁ-cult to acquire sufﬁciently many modalities due to limitationsin study plans, and some modalities could be missing due toimaging artifacts [4, 5].In recent years, cross-modality synthesis of brain MRimages using generative adversarial networks (GANs) hasgained its popularity [1]. For example, Yu et al. [5] adopteda pair-wise image-to-image network via Pix2Pix [6, 2] fortransferring T1-weighted to either T2-weighted or FLAIRMRI. Also, a cycle-reconstruction approach via CycleGANfor unpaired image translation [7] was introduced in [4, 8] to stabilize the training. These methods [1, 5, 4, 8, 2] aimedat modeling the mapping between two speciﬁc modalities,requiring two inverse autoencoders to achieve the cycle-reconstruction [7].However, the aforementioned approaches have a limi-tation in that they dealt with the cross-modality synthesisproblem which cannot be easily scalable to multiple modali-ties (i.e., more than two modalities). In other words, in orderto learn all mappings among M modalities, M ( M − dif-ferent generators have to be trained and deployed (e.g., 12possible cross-modality networks for T1-weighted, T1ce,T2-weighted, and FLAIR MRI) [9]. Moreover, each trans-lator cannot fully use the entire training data, but can onlylearn from a subset of data (two out of M modalities). Fail-ure to fully use the whole training data is likely to limit thequality of generated images. To address this, recently, Xinet al. [10] proposed to construct a 1-to-3 network to trans-late T2-weighted to T1-weighted/T1ce/FLAIR MRI based onPix2Pix [6]. The improvement over Pix2Pix was achieved byutilizing 3 × training pairs for one translator [10]. Besides, theclosely related multiple tasks mutually reinforced each other[11]. Yet, with the 1-to-3 network, the number of models tobe trained was still limited to the number of modalities.In this paper, we propose to achieve all of the pair-wisetranslation using a single set of conditional autoencoder anddiscriminator. Our framework is scalable to many modalitiesand can effectively use all of possible paired cross-modalitytraining data. Several unpaired multi-domain synthesis meth-ods are inherently multimodal translation, while they usuallyrequire multiple domain-speciﬁc autoencoders [12] and dis-criminators [13], and do not consider pair-wise training data[12, 13, 14, 11]. Without the pair-wise supervision, the largelyunconstrained image generation tends to alter important char-acteristics of an input modality for generating diverse out-puts. Unlike image-to-image translation in computer vision,in medical domain, it is of paramount importance not to in-troduce unnecessary changes and artifacts to carry out quan-titative analyses [15].In addition, these methods ignore the inherent connectionbetween different MR modalities [16, 16]. Since multipleMR modalities are acquired with different scan parameters forthe same subject, there should be a shared modality-invariantanatomical feature space [16]. Accordingly, we propose to a r X i v : . [ ee ss . I V ] J a n onﬁgure a pair-wise disentanglement approach [17, 18] toextract an anatomical feature with a modality-agnostic en-coder, and then inject a modality-speciﬁc appearance with aconditional decoder.Speciﬁcally, our uniﬁed multimodal translation frame-work hinges on the autoencoder, where the decoder is con-ditioned on the target modality to utilize the paired trainingdata. After the feedforward processing of the autoencoderconditioned on any modality label, the same autoencoderis called again, which conditioned on the modality label ofthe original input for the cycle-reconstruction. The anatom-ical information disentanglement is simply enforced by thesimilarity of the output feature map of the encoder [18, 17].We empirically validate its effectiveness on the BraTS’18database, showing superior performance over the comparisonmethods.

2. METHODOLOGY

Given a set of co-registered M MR modalities, a sample x with modality m x has M − pixel-wise aligned samples withthe other modalities. The target modality of image synthesisand the corresponding ground truth sample are denoted as m y and x y , respectively. Given the pair of input sample and targetmodality { x, m y } , we propose to learn a parameterized map-ping f : { x, m y } → ˜ x y from { x, m y } to the generated cor-responding sample with modality m y to closely resemble x y . m y denotes a four-dimensional one-hot vector to representthe four MR modalities available in the BraTS’18 database.Of note, m y = m x indicates the self-reconstruction. The pro-posed framework is shown in Fig. 1. A straightforward baseline network structure for paired two-modality translation is an autoencoder, which is constructedwith an encoder

Enc and a decoder

Dec . Brieﬂy, it ﬁrst maps x to a latent feature z via the Enc , and then decode z to re-construct the target image via the Dec . The target groundtruth image serves as a strong supervision signal, while theunpaired translation cannot beneﬁt from such regularization[7, 19, 12, 11, 13].However, the autoencoder has a limitation in that thegenerated images are likely to be blurry [2], which is partlycaused by the element-wise criteria such as the L or L loss [20]. Although recent studies [21] have substantiallyimproved the predicted log-likelihood in the autoencoder, theimage generation quality of the autoencoder is still inferior toGANs. In addition, enforcing pixel-wise similarity is likelyto distract the autoencoder from understanding the underlyinganatomical structure, when inputting slightly misregistereddata.In order to enforce high-level semantic similarity andimprove quality of generated textures, recent cross-modalitytranslation models [4, 5, 10] adopted an additional adversarialloss L adv with a discriminator Dis following Pix2Pix [6, 2],

Fig. 1 . Illustration of the proposed uniﬁed conditional adver-sarial framework for multimodal co-registered brain MR im-age translation. Note that only one

Enc - Dec set is recalledtwice, and only the gray masked subnets are used for testing.where the training objectives consist of both the L loss andadversarial GAN loss: min Enc,Dec max

Dis L (˜ x y , x y ) = | ˜ x y − x y | , (1) min Enc,Dec max

Dis L adv = E log ( Dis (˜ x y )) + E log (1 − Dis (˜ x y )) . (2)To extend the Pix2Pix basenet to multimodal translation,we adopt the conditional decoder structure that takes both thefeature map extracted by the encoder Enc ( x ) and the tar-get modality code m y as input. The target modality code isspatially replicated and concatenated with the input image.Of note, the unpaired multi-domain synthesis network takesthe modality code as input to the encoder [12, 13, 14, 11];therefore it cannot achieve the disentanglement of modality-agnostic anatomic and modality-speciﬁc factors [16, 17, 18].Then, a single autoencoder model can be switched to all pos-sible pair-wise cross-modality translations. Therefore, ˜ x y inEqs. (1-2) can be Dec ( Enc ( x ) , m y ) .Instead of conﬁguring M Dis for all modalities, we in-troduce an auxiliary modality classiﬁer

Dis mc [22] on top of Dis that allows a single

Dis to control multiple modalities.The to be minimized modality classiﬁcation loss can be for-mulated as: L mc = E ∼ ˜ x y [ − log Dis mc (˜ x y , m y )] + E ∼ x [ − log Dis mc ( x, m x )] . (3)The objective of conditional GAN with multi-task dis-criminator can induce an output distribution of over (˜ x y | m y ) that matches the empirical distribution of real images withmodality m y , i.e., x y . However, the mapping between twodistributions can be largely unconstrained and have manypossible translations f to induce the same distribution over f ( { x, m y } ) [7]. Therefore, the learned f cannot guaranteethat the individual inputs { x, m y } and outputs ˜ x y are pairedup as expected. To mitigate this, CycleGAN [7] is proposedto introduce an additional cycle-reconstruction constraint forunpaired two-domain image translation. Speciﬁcally, the gen-erated output is mapped back to the original image with aninverse translator, and the L loss is explicitly used as a loss ig. 2 . Illustration of the results of our proposed framework.We use the ﬁrst row as input, and conﬁgure four target modal-ities. The diagonal results are obtained using self-translation(i.e., m y = m x ).function to measure the similarity between the mapped backimage and the original input. In this way, the shape struc-ture can be better maintained. Note that both Pix2Pix [6]and CycleGAN [7] are used as the two-domain translators,and there are two inverse autoencoders to achieve the cyclereconstruction [7].In addition, rather than conﬁguring two autoencoders withinverse direction [7], we can simply recall the same autoen-coder twice with a different conditional modality code in afeedforward processing. Speciﬁcally, the second time trans-lation always uses the modality of original input sample m x to achieve the reconstruction of x , given by min Enc,Dec L (˜ x x , x x ) = | ˜ x x − x | , (4)where ˜ x x = Dec ( Enc (˜ x y ) , m x ) is expected to reconstruct x .To achieve the disentanglement of anatomical informationand modality-speciﬁc factors without the anatomical label,we propose to enforce the similarity between Enc ( x x ) and Enc (˜ x x ) which are the two encoder outputs in a cycle, givenby min Enc L disen = | Enc (˜ x y ) − Enc ( x ) | , (5)which explicitly requires that the paired co-registered two-modality images have the same feature map. Their sharedfactors can be the anatomical information [16], and havethe difference between modality-speciﬁc imaging parameters Fig. 3 . Comparison of different methods for the T1-weightedand T2-weighted MR translation. GT indicates the groundtruth x y .[16]. The feature vector is also required to be combinedwith the target modality label to reconstruct the target image,which encourages them to be independent and complemen-tary to each other [17]. Therefore, the latent feature spacecan be anatomically related and modality-invariant (i.e., dis-entangled with a modality factor) [18, 17]. In addition, thefeature-level similarity is also related to the perception loss[23], which enhances the textures. For simpler implementation, we reformulate the min-maxterms to minimization only in a consistent manner. The ob-jective of our conditional autoencoder and the adversarialcycle-reconstruction streams can be formulated as: min

Enc,Dec L (˜ x y , x y ) + α L (˜ x x , x x ) + β L adv + λ L mc , (6) min Dis − L adv + λ L mc , (7)where α , β , λ , and λ are the weighting hyperparameters. Enc and

Dec minimize L adv , while Dis minimize −L adv toplay a round-based adversarial game to improve each other toﬁnd a saddle point. In practice, we sample the same numberof x from each modality in training [6]. After training, we can obtain the translation functions by as-sembling a subset of the subnetworks, i.e.,

Enc and

Dec .Note that our translation can be agnostic to an input modality.Therefore, with an input sample x and a target modality m y ,we can generate its corresponding ˜ x y = Dec ( Enc ( x ) , m y ) .With generated images with a target modality, we concatenatethem for further tumor segmentation [10, 8].

3. RESULTS AND DISCUSSION

We evaluated our framework on the BraST’18 multimodalbrain tumor database [9], which contains a total of 285 sub-jects with four MRI modalities: T1-weighted, T1ce, T2-weighted, and FLAIR MRI, with the size of 240 × × − , ] as in[10, 5], which was then processed by 2D networks. The ax-ial slices with less than 2,000 pixels in the brain area wereﬁltered out as in [10]. able 1 . Numerical comparisons of four methods in testing Methods L1 ↓ SSIM ↑ PSNR ↑ IS ↑ × Pix2Pix [6] 171.3 ± ± ± ± × TCMGAN [10] 168.6 ± ± ± ± L disen ± ± ± ± ± ± ± ± For a fair comparison, we followed [10] to use 100 sub-jects for the training translator, 85 subjects for testing, and100 subjects for a training segmentor. We adopted the samebackbone for

Enc , Dec , and

Dis for all comparisons [10].In addition, the reimplemented Pix2Pix [6] was used as thetwo-modality transfer baseline model. In order to align theabsolute value of each loss, we set weights α = 1 , β = 0 . , λ = 1 , and λ = 1 . We used Adam optimizer with a batch-size of 64 for 100 epochs training. The learning rate was setat lr Enc,Dec = 1e − and lr Dis = 1e − and the momentumwas set at 0.5. Our framework was implemented using Py-Torch. The training on an NVIDIA V100 GPU took about 8hours. In practice, translating one test image with our uniﬁed Enc and

Dec only took about 0.1 seconds.

In Fig. 2, we illustrated the multimodal generation results of12 cross-modality translation tasks and 4 self-reconstructiontasks. The proposed framework successfully synthesized anymodality by simply conﬁguring the target modality, which isconsistent with the target ground truth MR images. We notethe self-supervision was not used for the subsequent segmen-tation, but was used for checking the image generation qualityin our implementation.The qualitative comparisons with the 1-to-1 translatorPix2Pix [6] and 1-to-3 translator [10] are shown in Fig. 3along with our proposed framework. The proposed frame-work was able to generate visually pleasing results with bettershape and structure consistency when visually assessed. Fromthe red box in the ﬁrst row, we can see that the tumor area wasbetter maintained with the help of the cycle-constraint com-pared with [10] which uses the additional tumor-consistentloss. Also, the artifact shown in the T1-weighted MR im-age (i.e., stripes indicated by the blue circle) yielded similarstripes as shown in the T2-weighted MR image with TCM-GAN and Pix2Pix. Our disentangled encoder was able toeliminate the artifact and enforce the latent representationfollowing the distribution of normal MR images.

The synthesized images were expected to have realistic-looking textures, and to be structurally coherent with itscorresponding ground truth images x y . For quantitative eval-uation, we adopted the widely used metrics including meanL1 error, structural similarity index measure (SSIM), peaksignal-to-noise ratio (PSNR), and inception score (IS).Table 1 lists numerical comparisons between the proposedframework, Pix2Pix [6], and TCMGAM [10] for the 85 test- Table 2 . Comparisons of the segmentation accuracy

Methods DICE12 × Pix2Pix[6] 0.7436 ± × TCMGAN[10] 0.7673 ± L disen ± ± ± ing subjects using the BraTS’18 database. Of note, proposed- L disen indicates the proposed model without the disentangle-ment constraint L disen . The proposed uniﬁed conditional dis-entanglement framework outperformed the other comparisonmethods w.r.t. these metrics, and the performance with theproposed framework was better than the proposed frameworkwithout the L disen . In Table 2, we followed [10] to use the synthesized imagesby different methods to boost the tumor segmentation accu-racy. Speciﬁcally, we sampled a slice with any modality inthe testing data, and used our uniﬁed translation frameworkto generate its complementary three modalities [10]. Then,we concatenated the real slice and its generated three modal-ities as input to the segmentor. We note that only using theadditional generated complementary slices can achieve im-provements over only using one real slice [10]. For example,the DICE score was 0.7404 using only T2-weighted MRI forsegmentation. We also computed the DICE score using theentire four real modalities, which served as an “upper bound”.The proposed uniﬁed conditional disentanglement frame-work yielded better segmentation performance than the base-line model Pix2Pix [6] and TCMGAN [10]. In addition, theDICE score of our conditional disentanglement frameworkwas close to the upper bound which was computed using fourreal modalities. It was seen from our experiments that us-ing all of the pairs in our training and the use of the cycle-constraint provided more accurate tumor shape recovery, thusleading to the better segmentation results.

4. CONCLUSION

This work presented a uniﬁed conditional disentanglementframework for co-registered multimodal translation based ona single set of target modality conditioned autoencoder andmulti-task discriminator. The encoder is learned to extract thedisentangled anatomical information by enforcing the consis-tency of two co-registered images with different modalities.The autoencoder is simply recalled twice to form a circularprocessing ﬂow to enforce the cycle-constraint. Our frame-work is scalable to many modalities and effective to utilize theentire paired training data. In addition, our framework demon-strated superior performance on the tumor segmentation taskover the compared methods using the generated images. . COMPLIANCE WITH ETHICAL STANDARDS

This research study was conducted retrospectively using hu-man subject data made available in open access by BraTS’18.Ethical approval was not required as conﬁrmed by the licenseattached with the open access data.

6. ACKNOWLEDGMENTS

This work is partially supported by NIH R01DC018511,R01DE027989, and P41EB022544.

7. REFERENCES [1] K. Armanious, C. Jiang, M. Fischer, T. K¨ustner,T. Hepp, et al., “Medgan: Medical image translation us-ing gans,”

Computerized Medical Imaging and Graph-ics , vol. 79, pp. 101684, 2020.[2] X. Liu, F. Xing, C. Yang, C.C.-Jay Kuo, G. El Fakhri,and J. Woo, “Symmetric-constrained irregular structureinpainting for brain mri registration with tumor pathol-ogy,” in

MICCAI BrainLes , 2020.[3] Y. Chang, G. C Sharp, Q. Li, H. A. Shih, G. El Fakhri,J. Ra, and J. Woo, “Subject-speciﬁc brain tumor growthmodelling via an efﬁcient bayesian inference frame-work,” in

SPIE Medical Imaging , 2018, p. 105742I.[4] S. Dar, M. Yurt, L. Karacan, A. Erdem, E. Erdem,and T. C¸ ukur, “Image synthesis in multi-contrast mriwith conditional generative adversarial networks,”

IEEETMI , pp. 2375–2388, 2019.[5] B. Yu, L. Zhou, L. Wang, Y. Shi, J. Fripp, andP. Bourgeat, “Ea-gans: edge-aware generative adver-sarial networks for cross-modality mr image synthesis,”

IEEE TMI , vol. 38, no. 7, pp. 1750–1762, 2019.[6] P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-imagetranslation with conditional adversarial networks,” in

CVPR , 2017, pp. 1125–1134.[7] J. Zhu, T. Park, P. Isola, and A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarialnetworks,” in

ICCV , 2017.[8] Y. Qu, C. Deng, W. Su, Y. Wang, Y. Lu, and Z. Chen,“Multimodal brain MRI translation focused on lesions,”in

ICMLC , 2020, pp. 352–359.[9] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer,K. Farahani, et al., “The multimodal brain tumor imagesegmentation benchmark (brats),”

IEEE TMI , 2014.[10] B. Xin, Y. Hu, Y. Zheng, and H. Liao, “Multi-modalitygenerative adversarial networks with tumor consistencyloss for brain mr image synthesis,” in

ISBI , 2020. [11] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo,“Stargan: Uniﬁed generative adversarial networks formulti-domain image-to-image translation,” in

CVPR ,2018, pp. 8789–8797.[12] X. Huang, M. Liu, S. Belongie, and J. Kautz, “Mul-timodal unsupervised image-to-image translation,” in

ECCV , 2018, pp. 172–189.[13] X. Yu, X. Cai, Z. Ying, T. Li, and G. Li, “Single-gan: Image-to-image translation by a single-generatornetwork using multiple generative adversarial learning,”in

ACCV , 2018, pp. 341–356.[14] W. Yuan, J. Wei, J. Wang, Q. Ma, and T. Tasdizen, “Uni-ﬁed attentional generative adversarial network for braintumor segmentation from multimodal unpaired images,”in

MICCAI , 2019.[15] M. Siddiquee, Z. Zhou, N. Tajbakhsh, R. Feng, M. Got-way, Y. Bengio, and J. Liang, “Learning ﬁxed points ingenerative adversarial networks: From image-to-imagetranslation to disease detection and localization,” in

ICCV , 2019, pp. 191–200.[16] B. Dewey, L. Zuo, A. Carass, Y. He, Y. Liu, E. Mowry,S. Newsome, J. Oh, P. Calabresi, and J. L. Prince, “Adisentangled latent space for cross-site MRI harmoniza-tion,” in

MICCAI , 2020, pp. 720–729.[17] X. Liu, S. Li, L. Kong, W. Xie, P. Jia, J. You, and BVKKumar, “Feature-level frankenstein: Eliminating varia-tions for discriminative recognition,” in

CVPR , 2019.[18] M. Mathieu, J. Jake Zhao, J. Zhao, A. Ramesh,P. Sprechmann, and Y. LeCun, “Disentangling factors ofvariation in deep representation using adversarial train-ing,” in

NeurIPS , 2016, pp. 5040–5048.[19] M. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in

NeurIPS , 2017.[20] A. Larsen, S. Sønderby, and H. Larochelle, “Autoen-coding beyond pixels using a learned similarity metric,”

ICML , 2016.[21] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen,I. Sutskever, and M. Welling, “Improved variational in-ference with inverse autoregressive ﬂow,” in

NeurIPS ,2016, pp. 4743–4751.[22] A. Odena, C. Olah, and J. Shlens, “Conditional imagesynthesis with auxiliary classiﬁer gans,” in

ICML , 2017.[23] X. Liu, BVK Kumar, P. Jia, and J. You, “Hard nega-tive generation for identity-disentangled facial expres-sion recognition,”