Dual-cycle Constrained Bijective VAE-GAN For Tagged-to-Cine Magnetic Resonance Image Synthesis
Xiaofeng Liu, Fangxu Xing, Jerry L. Prince, Aaron Carass, Maureen Stone, Georges El Fakhri, Jonghye Woo
DDUAL-CYCLE CONSTRAINED BIJECTIVE VAE-GAN FOR TAGGED-TO-CINE MAGNETICRESONANCE IMAGE SYNTHESIS
Xiaofeng Liu , Fangxu Xing , Jerry L. Prince , Aaron Carass , Maureen Stone ,Georges El Fakhri , Jonghye Woo Dept. of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA Dept. of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA Dept. of Neural and Pain Sciences, University of Maryland School of Dentistry, Baltimore, MD, USA
ABSTRACT
Tagged magnetic resonance imaging (MRI) is a widely usedimaging technique for measuring tissue deformation in mov-ing organs. Due to tagged MRI’s intrinsic low anatomicalresolution, another matching set of cine MRI with higher res-olution is sometimes acquired in the same scanning sessionto facilitate tissue segmentation, thus adding extra time andcost. To mitigate this, in this work, we propose a novel dual-cycle constrained bijective VAE-GAN approach to carry outtagged-to-cine MR image synthesis. Our method is based ona variational autoencoder backbone with cycle reconstructionconstrained adversarial training to yield accurate and realisticcine MR images given tagged MR images. Our frameworkhas been trained, validated, and tested using 1,768, 416, and1,560 subject-independent paired slices of tagged and cineMRI from twenty healthy subjects, respectively, demonstrat-ing superior performance over the comparison methods. Ourmethod can potentially be used to reduce the extra acquisi-tion time and cost, while maintaining the same workflow forfurther motion analyses.
Index Terms — Tagged MRI, image synthesis, deeplearning, Generative adversarial networks.
1. INTRODUCTION
Assessment of an internal organ’s deformation has been animportant topic in both medical imaging research and clini-cal practice. One of the most popular modalities to measureinternal motion is tagged magnetic resonance (MR) imaging(MRI) with spatially encoded tag patterns [1]. Tagged MRimages with horizontal and vertical tag patterns are acquiredas the internal organ, such as the tongue, moves while thetag patterns deform together with the organ; recording all de-formation information in the deformed tag patterns. Thesetag deformations can then be analyzed with motion extractionmethods including Harmonic Phase (HARP) based methods,recovering motion fields in either two-dimensional (2D) orthree-dimensional (3D) spaces [2, 3]. Tagged MRI is an es- tablished method for internal organ motion imaging [4] thatcontinues to enjoy widespread use.In practice, due to the nature of tagged MRI’s encodingmethod, image resolution on the anatomy of acquired imagesis relatively low. Although motion fields are computed in eachwhole MR slice, it is usually necessary to separate the mo-tion field to leave only the part within the organ of interest.As a result, another separate set of dynamic cine MR imageswithout tags are sometimes acquired alongside the tagged MRimages as a matching pair [5]. Cine MR images have higherresolution than the tagged MR images and are used for seg-menting the region of interest, such as the whole tongue [6],in which to compute the motion field. In addition, cine MRimages are used to assess surface motion of the organs. How-ever, this extra acquisition can double the time and cost.To address this, we systematically investigate a variationalautoencoders (VAEs)-based generative model with adversar-ial training objectives to synthesize cine MRI from taggedMRI as shown in Fig. 1. To the best of our knowledge, thisis the first attempt at tagged-to-cine MR image synthesis. Al-though the L1 loss in VAEs uses pixel-wise pair similarityto provide a good baseline, the results are often blurry [7, 8,9]. Therefore, an additional generative adversarial network(GAN) [10, 11] is further incorporated into our model to en-rich the texture, while mitigating potential differences in tex-ture or contrast within real cine MR images. Considering thatimage translation is largely unconstrained and even ill-posed[12], the anatomical structure of a specific input-output pairis not necessarily consistent if we only align the distributionswith GANs. To address this, we introduce an unpaired cyclereconstruction constraint [12, 13] into our VAE-based modelso that our proposed dual-cycle constrained bijective VAE-GAN can maintain the anatomical structure.Both quantitative and qualitative evaluation results usinga total of 3,774 paired slices of tagged and cine MRI fromtwenty healthy subjects show the validity of our proposeddual-cycle constrained bijective VAE-GAN framework andits superiority to conventional VAEs and GANs based imagestyle translation methods. a r X i v : . [ ee ss . I V ] J a n . METHODOLOGY Given the paired tagged and cine MR images { x t , x c } , wepropose to learn a parameterized mapping f : x t → ˜ x t → c from the tagged MR images x t to the generated cine MR im-ages ˜ x t → c to closely resemble the real cine MR images x c . A straightforward baseline structure of f can be the VAE,which is constructed with an encoder Enc and a decoder
Dec .The VAE first maps x t to a code z in a latent space via the Enc , and then decodes z to reconstruct the target image ˜ x t → c via the Dec . For the reconstruction error, we adopt the L loss, which can be formulated as L (˜ x t → c , x c ) = | ˜ x t → c − x c | .The KL-divergence L KL between the latent code z and theprior standard Gaussian is modeled with the standard repa-rameterization trick [14]. The training objective is to mini-mize L V AE = L ( (cid:101) x t → c , x c ) + αL KL ( Enc ( x t ) ||N (0 , I )) , (1)where α is a balancing weight for the KL divergence termto penalize the deviation of the distribution of the latent codefrom the prior distribution.A limitation of VAEs, however, is that the generated im-ages tend to be blurry [7, 8] due to the injected noise, limitedexpressiveness of the inference models, or imperfect element-wise criteria such as the L or L loss [15]. Although re-cent studies [14] have greatly improved the predicted log-likelihood, the VAE image generation quality still lags behindGANs [7, 16, 17].In order to enforce perceptual realism with respect to theanatomic structure and improve the quality of the generatedtextures, we adopt the additional adversarial training proce-dure [18]. The GAN training aims to learn a mapping f with the output f ( x t ) that is indistinguishable from real cine MR images byan adversary trained to classify f ( x t ) apart from x c . A pos-sible solution to combine the objectives of GAN and VAE isfollowing the VAE-GAN framework [15] to configure a dis-criminator to discriminate real samples from both the recon-structions and the generated examples with sampling Gaus-sian noise. Similarly, the Pix2Pix [19] proposes to use thereal-fake pair as input to the discriminator to enhance the sta-bility of GANs. In theory, the GAN’s objective can induce anoutput distribution over f ( x t ) that matches the empirical dis-tribution of x c . The optimal f thereby translates the domainof x t to a domain f ( x t ) distributed identically to x c .However, such a translation does not guarantee that theindividual input x t and output ˜ x t → c = f ( x t ) are paired upin a meaningful way. There are infinitely many mappings f Fig. 1 . Illustration of the proposed dual-cycle constrained bi-jective VAE-GAN. Encoder, decoder, and discriminator havetwo parallel modules for x t and x c , respectively. Note thatonly the gray masked subnets are used for testing.that will induce the same distribution over f ( x t ) [12]. Whilethe generated cine MR images may be visually pleasing andrealistic, the anatomical shape or structure may not be neces-sarily consistent with the input tagged MR images. We notethat the objective of our tagged-to-cine MR image synthesisis to facilitate the segmentation of the internal organ or to ob-serve surface motion of the organ; therefore, we introduce astricter requirement with respect to the structural consistencyover conventional image translation approaches [12, 13].Recently, cycleGAN and its variants [12, 13] have shownoutstanding performance in many unpaired image translationtasks. They enforce the generated output with f : x t → ˜ x t → c can be mapped back to the original image with an additionalinverse translator f − : x c → ˜ x c → t . With two bijectivemappings, the distribution of ˜ x t → c can be largely constrainedand therefore the anatomic structure can be better maintainedcompared with the conventional approaches. Accordingly, wepropose to introduce the unpaired cycle constraint [12, 13] forour VAE-based pair-wise synthesis. Our framework, as illus-trated in Fig. 1, is based on a dual-branch VAEs and cycle-constrained GANs. Detailed structure:
The two encoders (i.e.,
Enc t and Enc c )have the same structure and take tagged MR images x t or cineMR images x c as input, respectively. The two decoders (i.e., Dec t and Dec c ) mirror the structure of the encoder respon-sible for tagged or cine MRI generation. The latent space ofboth tagged and cine MRI is shared [13].Similarly, there are two discriminators (i.e., Dis t and Dis c ) for the adversarial training of tagged or cine MRI. Fortagged MR images sampled from the tagged MRI domain, Dis t should output true, while for images generated by Dec t ,it should output false. Dec t can generate two types of images: 1) images fromthe reconstruction stream ˜ x t → t = Dec t ( z ∼ Enc t ( x t )) and2) images from the translation stream ˜ x t → t = Dec t ( z ∼ ig. 2 . Comparison of different tagged-to-cine MR generation methods, including the vanilla VAE*, VAE+vanilla GAN [15]*,Pix2Pix [19]*, and our proposed dual-cycle constrained bijective VAE-GAN. * indicates the first attempt at tagged-to-cine MRimage synthesis. Enc c ( x c )) . Since the reconstruction stream is trained in asupervised manner, it suffices that we only apply the adver-sarial training to images from the translation stream, ˜ x c → t ,given by L t (˜ x t → t , x t ) = | Dec t ( Enc t ( x t )) − x t | (2) L Dis t = log ( Dis t ( x t )) + log (1 − Dis t (˜ x c → t )) . (3)We apply similar processing to Dec c , where Dis c istrained to output true for real images sampled from cine MRIand false for images generated from Dec c , given by L c (˜ x c → c , x c ) = | Dec c ( Enc c ( x c )) − x c | (4) L Dis c = log ( Dis c ( x c )) + log (1 − Dis c (˜ x t → c )) . (5) Training Strategy:
We jointly optimize the learning ob-jectives of our backbone VAE and the adversarial cycle- reconstruction streams: min
Enc t ,Dec t L tV AE + βL t (˜ x t → t , x t ) + λL Dis t , (6) min Enc c ,Dec c L cV AE + βL c (˜ x c → c , x t ) + λL Dis c , (7) max Dis t L Dis t , max Dis c L Dis c , (8)where L tV AE = L ( (cid:101) x t → c , x c ) + αL KL ( Enc t ( x t ) ||N (0 , I )) ,and L cV AE = L ( (cid:101) x c → t , x t ) + αL KL ( Enc c ( x c ) ||N (0 , I )) are the backbone VAE objectives for x t and x c , respectively.The hyper-parameters β and λ control the weights of thecycle-based reconstruction objective term and the discrimi-nate term, respectively. Inheriting from GAN, the Enc - Dec and
Dis play the round-based min-max adversarial game toimprove each other to find a saddle point. In practice, weusually sample a similar number of x t and x c in training [13]. Testing Tagged-to-Cine Translation:
Targeting our tagged-to-cine MR synthesis objective, after training, we can ob-tain the translation functions by assembling a subset of theubnetworks, i.e.,
Enc t and Dec c . Therefore, given taggedMR images x t , we can generate its corresponding ˜ x t → c = Dec c ( Enc t ( x t )) . The involved subnets are indicated withthe gray mask in Fig. 1.
3. EXPERIMENTS AND RESULTS3.1. Data Acquisition and Subjects
For the experiments performed in this study, a total of 3,774paired tagged and cine MR images from twenty healthy sub-jects were acquired while speaking an utterance, “a souk” inline with a periodic metronome-like sound. MRI scanningwas performed on a Siemens 3.0T TIM Trio system with a12-channel head coil and a 4-channel neck coil using a seg-mented gradient echo sequence [3]. The field of view was240 ×
240 mm with an in-plane resolution of 1.87 × We used 10 subjects (1,768 slice pairs) for training, 2 subjects(416 slice pairs) for hyper-parameter validation, and 8 sub-jects (1,560 slice pairs) for evaluation. Tagged MR imageswith horizontal tag patterns were used in our evaluation. Forfair comparison, we resized the tagged and cine MR imagesto 256 ×
256 and adopted the
Enc , Dec , and
Dis backbonesfrom [19] for all of the methods.Our framework was implemented using the PyTorch deeplearning toolbox. The training was performed on an NVIDIAV100 GPU, which took about 4 hours. In practice, translatingone tagged MR image to a cine MR image in testing with only
Enc t and Dec c took about 0.1 seconds.In order to align the absolute value of each loss, we setdifferent weights for each part with grid searching on the val-idation data. Specifically, we set weight α = 1 , β = 1 , and λ = 0 . . We used Adam optimizer for training. The learningrate was set at lr Enc,Dec = 1e − and lr Dis = 1e − and themomentum was set at 0.5. The synthesis results using VAE [14], VAE-GAN [15],Pix2Pix [19], and our proposed method are shown in Fig. 2.The proposed framework successfully synthesized the cineMR images, which are consistent with the target ground truthcine MR images. The VAE only model simply relies onthe pixel-wise matching, thus resulting in relatively blurryoutputs. There is a relatively large difference in contrastand texture within the tongue region between the real cineMR images, which could affect subsequent analyses. Thus,the added-on adversarial objective is expected to enrich thefine-grained tissue texture [18]. VAE-GAN [15] adopts the
Table 1 . Numerical comparisons of four methods in testingacross 1,560 slice pairs. The best and the second best resultsare bold and underlined, respectively.
Methods L1 ↓ SSIM ↑ PSNR ↑ IS ↑ VAEonly ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± vanilla GAN, which is difficult to optimize, and often leadsto the well-known problem of mode collapse. By using thepair-wise inputs to the discriminator, the model collapse inPix2Pix [19] was relatively alleviated. Although the genera-tive outputs were visually realistic, the unconstrained imagetranslation yielded distortion in the anatomical structure.In contrast, our proposed dual-cycle constrained bijectiveVAE-GAN with the cycle reconstruction constraint generatedvisually pleasing results with better structural consistency,which is important for subsequent analyses. The synthesized images were expected to have realistic-looking textures, and to be structurally coherent with itscorresponding ground-truth x c . For quantitative evaluation,we adopted widely used evaluation metrics: mean L1 error,structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and inception score (IS) [20].Table 1 lists numerical comparisons between the proposedframework, VAE [14], VAE-GAN [15], and Pix2Pix [19] forthe 8 testing subjects. The proposed dual-cycle constrainedbijective VAE-GAN outperformed the other three comparisonmethods with respect to SSIM, PSNR, and IS. We note that allof the compared methods have the L1 minimization objective.
4. CONCLUSION AND FUTURE WORK
We presented a novel framework to synthesize cine MRI fromits paired tagged MRI. We systematically investigated VAEwith the GAN objectives, and proposed a novel dual-cycleconstrained bijective VAE-GAN. The adversarial trainingintroduced the realistic texture, while mitigating potentialdifferences in contrast and texture within the real cine MRimages. In addition, importantly, the qualitative evaluationshows the anatomical structure of the tongue was better main-tained with the additional cycle reconstruction constraint. Ourexperimental results demonstrated that our method surpassedthe comparison methods as quantitatively and qualitativelyassessed. The synthesized cine MRI can potentially be usedfor further segmenting the tongue and observing surface mo-tion, and, in our future work, we will develop a segmentationnetwork in conjunction with the proposed network using thesynthesized cine MR images. . COMPLIANCE WITH ETHICAL STANDARDS
The subjects signed an informed consent form, and data werecollected in accordance with the protocol approved by the In-stitutional Review Board (IRB) of the University of Mary-land, Baltimore. The retrospective use of the data was grantedby IRB of Massachusetts General Hospital.
6. ACKNOWLEDGMENTS
This work is supported by NIH R01DC014717, R01DC018511,R01CA133015, and P41EB022544.
7. REFERENCES [1] Nael F Osman, Elliot R McVeigh, and Jerry L Prince,“Imaging heart motion using harmonic phase mri,”
IEEE TMI , vol. 19, no. 3, pp. 186–202, 2000.[2] Fangxu Xing, Jonghye Woo, Arnold D Gomez, Dzung LPham, Philip V Bayly, Maureen Stone, and Jerry LPrince, “Phase vector incompressible registration algo-rithm for motion estimation from tagged magnetic res-onance images,”
IEEE TMI , vol. 36, no. 10, pp. 2116–2128, 2017.[3] Fangxu Xing, Jonghye Woo, Junghoon Lee, Emi Z Mu-rano, Maureen Stone, and Jerry L Prince, “Analysis of3-D tongue motion from tagged and cine magnetic reso-nance images,”
Journal of Speech, Language, and Hear-ing Research , vol. 59, no. 3, pp. 468–479, 2016.[4] Caroline Petitjean, Nicolas Rougon, and PhilippeCluzel, “Assessment of myocardial function: a reviewof quantification methods and results using tagged mri,”
Journal of Cardiovascular Magnetic Resonance , vol. 7,no. 2, pp. 501–516, 2005.[5] Vijay Parthasarathy, Jerry L Prince, Maureen Stone,Emi Z Murano, and Moriel NessAiver, “Measuringtongue motion from tagged cine-mri using harmonicphase (harp) processing,”
The Journal of the Acousti-cal Society of America , pp. 491–504, 2007.[6] Junghoon Lee, Jonghye Woo, Fangxu Xing, Emi ZMurano, Maureen Stone, and Jerry L Prince, “Semi-automatic segmentation of the tongue for 3D motionanalysis with dynamic mri,” in
ISBI , 2013, p. 1465.[7] Ian Goodfellow, “Nips 2016 tutorial: Generative adver-sarial networks,” arXiv:1701.00160 , 2016.[8] Xiaofeng Liu, BVK Vijaya Kumar, Ping Jia, andJane You, “Hard negative generation for identity-disentangled facial expression recognition,”
PatternRecognition , vol. 88, pp. 1–12, 2019. [9] Xiaofeng Liu, Xing Fangxu, Yang Chao, C.C.-Jay Kuo,Georges El Fakhri, and Jonghye Woo, “Symmetric-constrained irregular structure inpainting for brain mriregistration with tumor pathology,” in
MICCAI Brain-Les , 2020.[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio, “Generative adversarialnets,” in
NeurIPS , 2014, pp. 2672–2680.[11] Karim Armanious, Chenming Jiang, Marc Fischer,Thomas K¨ustner, Tobias Hepp, Konstantin Nikolaou,Sergios Gatidis, and Bin Yang, “Medgan: Medicalimage translation using gans,”
Computerized MedicalImaging and Graphics , vol. 79, pp. 101684, 2020.[12] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros, “Unpaired image-to-image translation usingcycle-consistent adversarial networks,” in
ICCV , 2017,pp. 2223–2232.[13] Ming-Yu Liu, Thomas Breuel, and Jan Kautz, “Un-supervised image-to-image translation networks,” in
NeurIPS , 2017, pp. 700–708.[14] Durk P Kingma, Tim Salimans, Rafal Jozefowicz,Xi Chen, Ilya Sutskever, and Max Welling, “Improvedvariational inference with inverse autoregressive flow,”in
NeurIPS , 2016, pp. 4743–4751.[15] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby,Hugo Larochelle, and Ole Winther, “Autoencoding be-yond pixels using a learned similarity metric,” in
Inter-national conference on machine learning . PMLR, 2016,pp. 1558–1566.[16] Xiaofeng Liu, Tong Che, Yiqun Lu, Chao Yang, Site Li,and Jane You, “Auto3d: Novel view synthesis throughunsupervisely learned variational viewpoint and global3d representation,” in
ECCV , 2020, pp. 1358–1376.[17] Xiaofeng Liu, Site Li, Lingsheng Kong, Wanqing Xie,Ping Jia, Jane You, and BVK Kumar, “Feature-levelfrankenstein: Eliminating variations for discriminativerecognition,” in
CVPR , 2019, pp. 637–646.[18] Ian Goodfellow, Yoshua Bengio, Aaron Courville, andYoshua Bengio,
Deep Learning , vol. 1, MIT press Cam-bridge, 2016.[19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros, “Image-to-image translation with conditional ad-versarial networks,” in
CVPR , 2017, pp. 1125–1134.[20] Yuhang Song, Chao Yang, Zhe Lin, Xiaofeng Liu, QinHuang, Hao Li, and C-C Jay Kuo, “Contextual-basedimage inpainting: Infer, match, and translate,” in