A Unified Conditional Disentanglement Framework for Multimodal Brain MR Image Translation
AA UNIFIED CONDITIONAL DISENTANGLEMENT FRAMEWORK FOR MULTIMODALBRAIN MR IMAGE TRANSLATION
Xiaofeng Liu, Fangxu Xing, Georges El Fakhri, Jonghye Woo
Dept. of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
ABSTRACT
Multimodal MRI provides complementary and clinically rele-vant information to probe tissue condition and to characterizevarious diseases. However, it is often difficult to acquire suf-ficiently many modalities from the same subject due to lim-itations in study plans, while quantitative analysis is still de-manded. In this work, we propose a unified conditional dis-entanglement framework to synthesize any arbitrary modalityfrom an input modality. Our framework hinges on a cycle-constrained conditional adversarial training approach, whereit can extract a modality-invariant anatomical feature witha modality-agnostic encoder and generate a target modalitywith a conditioned decoder. We validate our framework onfour MRI modalities, including T1-weighted, T1 contrast en-hanced, T2-weighted, and FLAIR MRI, from the BraTS’18database, showing superior performance on synthesis qualityover the comparison methods. In addition, we report resultsfrom experiments on a tumor segmentation task carried outwith synthesized data.
Index Terms — Image synthesis, Generative AdversarialNetworks, Deep learning, Brain tumor
1. INTRODUCTION
Multimodal magnetic resonance (MR) images are often re-quired to provide complementary information for clinicaldiagnosis and scientific studies [1, 2]. For example, multi-modal MR imaging (MRI) with T1-weighted, T1ce (contrastenhanced), T2-weighted, and FLAIR (FLuid-Attenuated In-version Recovery) MRI can offer greater sensitivity to tumorheterogeneity and growth pattern than single modality, T1ceMRI, thereby benefiting diagnosis, staging, and monitoringof brain metastasis [3]. However, in practice, it is often diffi-cult to acquire sufficiently many modalities due to limitationsin study plans, and some modalities could be missing due toimaging artifacts [4, 5].In recent years, cross-modality synthesis of brain MRimages using generative adversarial networks (GANs) hasgained its popularity [1]. For example, Yu et al. [5] adopteda pair-wise image-to-image network via Pix2Pix [6, 2] fortransferring T1-weighted to either T2-weighted or FLAIRMRI. Also, a cycle-reconstruction approach via CycleGANfor unpaired image translation [7] was introduced in [4, 8] to stabilize the training. These methods [1, 5, 4, 8, 2] aimedat modeling the mapping between two specific modalities,requiring two inverse autoencoders to achieve the cycle-reconstruction [7].However, the aforementioned approaches have a limi-tation in that they dealt with the cross-modality synthesisproblem which cannot be easily scalable to multiple modali-ties (i.e., more than two modalities). In other words, in orderto learn all mappings among M modalities, M ( M − dif-ferent generators have to be trained and deployed (e.g., 12possible cross-modality networks for T1-weighted, T1ce,T2-weighted, and FLAIR MRI) [9]. Moreover, each trans-lator cannot fully use the entire training data, but can onlylearn from a subset of data (two out of M modalities). Fail-ure to fully use the whole training data is likely to limit thequality of generated images. To address this, recently, Xinet al. [10] proposed to construct a 1-to-3 network to trans-late T2-weighted to T1-weighted/T1ce/FLAIR MRI based onPix2Pix [6]. The improvement over Pix2Pix was achieved byutilizing 3 × training pairs for one translator [10]. Besides, theclosely related multiple tasks mutually reinforced each other[11]. Yet, with the 1-to-3 network, the number of models tobe trained was still limited to the number of modalities.In this paper, we propose to achieve all of the pair-wisetranslation using a single set of conditional autoencoder anddiscriminator. Our framework is scalable to many modalitiesand can effectively use all of possible paired cross-modalitytraining data. Several unpaired multi-domain synthesis meth-ods are inherently multimodal translation, while they usuallyrequire multiple domain-specific autoencoders [12] and dis-criminators [13], and do not consider pair-wise training data[12, 13, 14, 11]. Without the pair-wise supervision, the largelyunconstrained image generation tends to alter important char-acteristics of an input modality for generating diverse out-puts. Unlike image-to-image translation in computer vision,in medical domain, it is of paramount importance not to in-troduce unnecessary changes and artifacts to carry out quan-titative analyses [15].In addition, these methods ignore the inherent connectionbetween different MR modalities [16, 16]. Since multipleMR modalities are acquired with different scan parameters forthe same subject, there should be a shared modality-invariantanatomical feature space [16]. Accordingly, we propose to a r X i v : . [ ee ss . I V ] J a n onfigure a pair-wise disentanglement approach [17, 18] toextract an anatomical feature with a modality-agnostic en-coder, and then inject a modality-specific appearance with aconditional decoder.Specifically, our unified multimodal translation frame-work hinges on the autoencoder, where the decoder is con-ditioned on the target modality to utilize the paired trainingdata. After the feedforward processing of the autoencoderconditioned on any modality label, the same autoencoderis called again, which conditioned on the modality label ofthe original input for the cycle-reconstruction. The anatom-ical information disentanglement is simply enforced by thesimilarity of the output feature map of the encoder [18, 17].We empirically validate its effectiveness on the BraTS’18database, showing superior performance over the comparisonmethods.
2. METHODOLOGY
Given a set of co-registered M MR modalities, a sample x with modality m x has M − pixel-wise aligned samples withthe other modalities. The target modality of image synthesisand the corresponding ground truth sample are denoted as m y and x y , respectively. Given the pair of input sample and targetmodality { x, m y } , we propose to learn a parameterized map-ping f : { x, m y } → ˜ x y from { x, m y } to the generated cor-responding sample with modality m y to closely resemble x y . m y denotes a four-dimensional one-hot vector to representthe four MR modalities available in the BraTS’18 database.Of note, m y = m x indicates the self-reconstruction. The pro-posed framework is shown in Fig. 1. A straightforward baseline network structure for paired two-modality translation is an autoencoder, which is constructedwith an encoder
Enc and a decoder
Dec . Briefly, it first maps x to a latent feature z via the Enc , and then decode z to re-construct the target image via the Dec . The target groundtruth image serves as a strong supervision signal, while theunpaired translation cannot benefit from such regularization[7, 19, 12, 11, 13].However, the autoencoder has a limitation in that thegenerated images are likely to be blurry [2], which is partlycaused by the element-wise criteria such as the L or L loss [20]. Although recent studies [21] have substantiallyimproved the predicted log-likelihood in the autoencoder, theimage generation quality of the autoencoder is still inferior toGANs. In addition, enforcing pixel-wise similarity is likelyto distract the autoencoder from understanding the underlyinganatomical structure, when inputting slightly misregistereddata.In order to enforce high-level semantic similarity andimprove quality of generated textures, recent cross-modalitytranslation models [4, 5, 10] adopted an additional adversarialloss L adv with a discriminator Dis following Pix2Pix [6, 2],
Fig. 1 . Illustration of the proposed unified conditional adver-sarial framework for multimodal co-registered brain MR im-age translation. Note that only one
Enc - Dec set is recalledtwice, and only the gray masked subnets are used for testing.where the training objectives consist of both the L loss andadversarial GAN loss: min Enc,Dec max
Dis L (˜ x y , x y ) = | ˜ x y − x y | , (1) min Enc,Dec max
Dis L adv = E log ( Dis (˜ x y )) + E log (1 − Dis (˜ x y )) . (2)To extend the Pix2Pix basenet to multimodal translation,we adopt the conditional decoder structure that takes both thefeature map extracted by the encoder Enc ( x ) and the tar-get modality code m y as input. The target modality code isspatially replicated and concatenated with the input image.Of note, the unpaired multi-domain synthesis network takesthe modality code as input to the encoder [12, 13, 14, 11];therefore it cannot achieve the disentanglement of modality-agnostic anatomic and modality-specific factors [16, 17, 18].Then, a single autoencoder model can be switched to all pos-sible pair-wise cross-modality translations. Therefore, ˜ x y inEqs. (1-2) can be Dec ( Enc ( x ) , m y ) .Instead of configuring M Dis for all modalities, we in-troduce an auxiliary modality classifier
Dis mc [22] on top of Dis that allows a single
Dis to control multiple modalities.The to be minimized modality classification loss can be for-mulated as: L mc = E ∼ ˜ x y [ − log Dis mc (˜ x y , m y )] + E ∼ x [ − log Dis mc ( x, m x )] . (3)The objective of conditional GAN with multi-task dis-criminator can induce an output distribution of over (˜ x y | m y ) that matches the empirical distribution of real images withmodality m y , i.e., x y . However, the mapping between twodistributions can be largely unconstrained and have manypossible translations f to induce the same distribution over f ( { x, m y } ) [7]. Therefore, the learned f cannot guaranteethat the individual inputs { x, m y } and outputs ˜ x y are pairedup as expected. To mitigate this, CycleGAN [7] is proposedto introduce an additional cycle-reconstruction constraint forunpaired two-domain image translation. Specifically, the gen-erated output is mapped back to the original image with aninverse translator, and the L loss is explicitly used as a loss ig. 2 . Illustration of the results of our proposed framework.We use the first row as input, and configure four target modal-ities. The diagonal results are obtained using self-translation(i.e., m y = m x ).function to measure the similarity between the mapped backimage and the original input. In this way, the shape struc-ture can be better maintained. Note that both Pix2Pix [6]and CycleGAN [7] are used as the two-domain translators,and there are two inverse autoencoders to achieve the cyclereconstruction [7].In addition, rather than configuring two autoencoders withinverse direction [7], we can simply recall the same autoen-coder twice with a different conditional modality code in afeedforward processing. Specifically, the second time trans-lation always uses the modality of original input sample m x to achieve the reconstruction of x , given by min Enc,Dec L (˜ x x , x x ) = | ˜ x x − x | , (4)where ˜ x x = Dec ( Enc (˜ x y ) , m x ) is expected to reconstruct x .To achieve the disentanglement of anatomical informationand modality-specific factors without the anatomical label,we propose to enforce the similarity between Enc ( x x ) and Enc (˜ x x ) which are the two encoder outputs in a cycle, givenby min Enc L disen = | Enc (˜ x y ) − Enc ( x ) | , (5)which explicitly requires that the paired co-registered two-modality images have the same feature map. Their sharedfactors can be the anatomical information [16], and havethe difference between modality-specific imaging parameters Fig. 3 . Comparison of different methods for the T1-weightedand T2-weighted MR translation. GT indicates the groundtruth x y .[16]. The feature vector is also required to be combinedwith the target modality label to reconstruct the target image,which encourages them to be independent and complemen-tary to each other [17]. Therefore, the latent feature spacecan be anatomically related and modality-invariant (i.e., dis-entangled with a modality factor) [18, 17]. In addition, thefeature-level similarity is also related to the perception loss[23], which enhances the textures. For simpler implementation, we reformulate the min-maxterms to minimization only in a consistent manner. The ob-jective of our conditional autoencoder and the adversarialcycle-reconstruction streams can be formulated as: min
Enc,Dec L (˜ x y , x y ) + α L (˜ x x , x x ) + β L adv + λ L mc , (6) min Dis − L adv + λ L mc , (7)where α , β , λ , and λ are the weighting hyperparameters. Enc and
Dec minimize L adv , while Dis minimize −L adv toplay a round-based adversarial game to improve each other tofind a saddle point. In practice, we sample the same numberof x from each modality in training [6]. After training, we can obtain the translation functions by as-sembling a subset of the subnetworks, i.e.,
Enc and
Dec .Note that our translation can be agnostic to an input modality.Therefore, with an input sample x and a target modality m y ,we can generate its corresponding ˜ x y = Dec ( Enc ( x ) , m y ) .With generated images with a target modality, we concatenatethem for further tumor segmentation [10, 8].
3. RESULTS AND DISCUSSION
We evaluated our framework on the BraST’18 multimodalbrain tumor database [9], which contains a total of 285 sub-jects with four MRI modalities: T1-weighted, T1ce, T2-weighted, and FLAIR MRI, with the size of 240 × × − , ] as in[10, 5], which was then processed by 2D networks. The ax-ial slices with less than 2,000 pixels in the brain area werefiltered out as in [10]. able 1 . Numerical comparisons of four methods in testing Methods L1 ↓ SSIM ↑ PSNR ↑ IS ↑ × Pix2Pix [6] 171.3 ± ± ± ± × TCMGAN [10] 168.6 ± ± ± ± L disen ± ± ± ± ± ± ± ± For a fair comparison, we followed [10] to use 100 sub-jects for the training translator, 85 subjects for testing, and100 subjects for a training segmentor. We adopted the samebackbone for
Enc , Dec , and
Dis for all comparisons [10].In addition, the reimplemented Pix2Pix [6] was used as thetwo-modality transfer baseline model. In order to align theabsolute value of each loss, we set weights α = 1 , β = 0 . , λ = 1 , and λ = 1 . We used Adam optimizer with a batch-size of 64 for 100 epochs training. The learning rate was setat lr Enc,Dec = 1e − and lr Dis = 1e − and the momentumwas set at 0.5. Our framework was implemented using Py-Torch. The training on an NVIDIA V100 GPU took about 8hours. In practice, translating one test image with our unified Enc and
Dec only took about 0.1 seconds.
In Fig. 2, we illustrated the multimodal generation results of12 cross-modality translation tasks and 4 self-reconstructiontasks. The proposed framework successfully synthesized anymodality by simply configuring the target modality, which isconsistent with the target ground truth MR images. We notethe self-supervision was not used for the subsequent segmen-tation, but was used for checking the image generation qualityin our implementation.The qualitative comparisons with the 1-to-1 translatorPix2Pix [6] and 1-to-3 translator [10] are shown in Fig. 3along with our proposed framework. The proposed frame-work was able to generate visually pleasing results with bettershape and structure consistency when visually assessed. Fromthe red box in the first row, we can see that the tumor area wasbetter maintained with the help of the cycle-constraint com-pared with [10] which uses the additional tumor-consistentloss. Also, the artifact shown in the T1-weighted MR im-age (i.e., stripes indicated by the blue circle) yielded similarstripes as shown in the T2-weighted MR image with TCM-GAN and Pix2Pix. Our disentangled encoder was able toeliminate the artifact and enforce the latent representationfollowing the distribution of normal MR images.
The synthesized images were expected to have realistic-looking textures, and to be structurally coherent with itscorresponding ground truth images x y . For quantitative eval-uation, we adopted the widely used metrics including meanL1 error, structural similarity index measure (SSIM), peaksignal-to-noise ratio (PSNR), and inception score (IS).Table 1 lists numerical comparisons between the proposedframework, Pix2Pix [6], and TCMGAM [10] for the 85 test- Table 2 . Comparisons of the segmentation accuracy
Methods DICE12 × Pix2Pix[6] 0.7436 ± × TCMGAN[10] 0.7673 ± L disen ± ± ± ing subjects using the BraTS’18 database. Of note, proposed- L disen indicates the proposed model without the disentangle-ment constraint L disen . The proposed unified conditional dis-entanglement framework outperformed the other comparisonmethods w.r.t. these metrics, and the performance with theproposed framework was better than the proposed frameworkwithout the L disen . In Table 2, we followed [10] to use the synthesized imagesby different methods to boost the tumor segmentation accu-racy. Specifically, we sampled a slice with any modality inthe testing data, and used our unified translation frameworkto generate its complementary three modalities [10]. Then,we concatenated the real slice and its generated three modal-ities as input to the segmentor. We note that only using theadditional generated complementary slices can achieve im-provements over only using one real slice [10]. For example,the DICE score was 0.7404 using only T2-weighted MRI forsegmentation. We also computed the DICE score using theentire four real modalities, which served as an “upper bound”.The proposed unified conditional disentanglement frame-work yielded better segmentation performance than the base-line model Pix2Pix [6] and TCMGAN [10]. In addition, theDICE score of our conditional disentanglement frameworkwas close to the upper bound which was computed using fourreal modalities. It was seen from our experiments that us-ing all of the pairs in our training and the use of the cycle-constraint provided more accurate tumor shape recovery, thusleading to the better segmentation results.
4. CONCLUSION
This work presented a unified conditional disentanglementframework for co-registered multimodal translation based ona single set of target modality conditioned autoencoder andmulti-task discriminator. The encoder is learned to extract thedisentangled anatomical information by enforcing the consis-tency of two co-registered images with different modalities.The autoencoder is simply recalled twice to form a circularprocessing flow to enforce the cycle-constraint. Our frame-work is scalable to many modalities and effective to utilize theentire paired training data. In addition, our framework demon-strated superior performance on the tumor segmentation taskover the compared methods using the generated images. . COMPLIANCE WITH ETHICAL STANDARDS
This research study was conducted retrospectively using hu-man subject data made available in open access by BraTS’18.Ethical approval was not required as confirmed by the licenseattached with the open access data.
6. ACKNOWLEDGMENTS
This work is partially supported by NIH R01DC018511,R01DE027989, and P41EB022544.
7. REFERENCES [1] K. Armanious, C. Jiang, M. Fischer, T. K¨ustner,T. Hepp, et al., “Medgan: Medical image translation us-ing gans,”
Computerized Medical Imaging and Graph-ics , vol. 79, pp. 101684, 2020.[2] X. Liu, F. Xing, C. Yang, C.C.-Jay Kuo, G. El Fakhri,and J. Woo, “Symmetric-constrained irregular structureinpainting for brain mri registration with tumor pathol-ogy,” in
MICCAI BrainLes , 2020.[3] Y. Chang, G. C Sharp, Q. Li, H. A. Shih, G. El Fakhri,J. Ra, and J. Woo, “Subject-specific brain tumor growthmodelling via an efficient bayesian inference frame-work,” in
SPIE Medical Imaging , 2018, p. 105742I.[4] S. Dar, M. Yurt, L. Karacan, A. Erdem, E. Erdem,and T. C¸ ukur, “Image synthesis in multi-contrast mriwith conditional generative adversarial networks,”
IEEETMI , pp. 2375–2388, 2019.[5] B. Yu, L. Zhou, L. Wang, Y. Shi, J. Fripp, andP. Bourgeat, “Ea-gans: edge-aware generative adver-sarial networks for cross-modality mr image synthesis,”
IEEE TMI , vol. 38, no. 7, pp. 1750–1762, 2019.[6] P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-imagetranslation with conditional adversarial networks,” in
CVPR , 2017, pp. 1125–1134.[7] J. Zhu, T. Park, P. Isola, and A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarialnetworks,” in
ICCV , 2017.[8] Y. Qu, C. Deng, W. Su, Y. Wang, Y. Lu, and Z. Chen,“Multimodal brain MRI translation focused on lesions,”in
ICMLC , 2020, pp. 352–359.[9] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer,K. Farahani, et al., “The multimodal brain tumor imagesegmentation benchmark (brats),”
IEEE TMI , 2014.[10] B. Xin, Y. Hu, Y. Zheng, and H. Liao, “Multi-modalitygenerative adversarial networks with tumor consistencyloss for brain mr image synthesis,” in
ISBI , 2020. [11] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo,“Stargan: Unified generative adversarial networks formulti-domain image-to-image translation,” in
CVPR ,2018, pp. 8789–8797.[12] X. Huang, M. Liu, S. Belongie, and J. Kautz, “Mul-timodal unsupervised image-to-image translation,” in
ECCV , 2018, pp. 172–189.[13] X. Yu, X. Cai, Z. Ying, T. Li, and G. Li, “Single-gan: Image-to-image translation by a single-generatornetwork using multiple generative adversarial learning,”in
ACCV , 2018, pp. 341–356.[14] W. Yuan, J. Wei, J. Wang, Q. Ma, and T. Tasdizen, “Uni-fied attentional generative adversarial network for braintumor segmentation from multimodal unpaired images,”in
MICCAI , 2019.[15] M. Siddiquee, Z. Zhou, N. Tajbakhsh, R. Feng, M. Got-way, Y. Bengio, and J. Liang, “Learning fixed points ingenerative adversarial networks: From image-to-imagetranslation to disease detection and localization,” in
ICCV , 2019, pp. 191–200.[16] B. Dewey, L. Zuo, A. Carass, Y. He, Y. Liu, E. Mowry,S. Newsome, J. Oh, P. Calabresi, and J. L. Prince, “Adisentangled latent space for cross-site MRI harmoniza-tion,” in
MICCAI , 2020, pp. 720–729.[17] X. Liu, S. Li, L. Kong, W. Xie, P. Jia, J. You, and BVKKumar, “Feature-level frankenstein: Eliminating varia-tions for discriminative recognition,” in
CVPR , 2019.[18] M. Mathieu, J. Jake Zhao, J. Zhao, A. Ramesh,P. Sprechmann, and Y. LeCun, “Disentangling factors ofvariation in deep representation using adversarial train-ing,” in
NeurIPS , 2016, pp. 5040–5048.[19] M. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in
NeurIPS , 2017.[20] A. Larsen, S. Sønderby, and H. Larochelle, “Autoen-coding beyond pixels using a learned similarity metric,”
ICML , 2016.[21] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen,I. Sutskever, and M. Welling, “Improved variational in-ference with inverse autoregressive flow,” in
NeurIPS ,2016, pp. 4743–4751.[22] A. Odena, C. Olah, and J. Shlens, “Conditional imagesynthesis with auxiliary classifier gans,” in
ICML , 2017.[23] X. Liu, BVK Kumar, P. Jia, and J. You, “Hard nega-tive generation for identity-disentangled facial expres-sion recognition,”