[PDF] Generation of Multimodal Ground Truth Datasets for Abdominal Medical Image Registration Using CycleGAN

Abstract

Sparsity of annotated data is a major limitation in medical image processing tasks such as registration. Registered multimodal image data are essential for the success of various medical procedures. To overcome the shortage of data, we present a method which allows the generation of annotated, multimodal 4D datasets. We use a CycleGAN network architecture to generate multimodal synthetic data from a digital body phantom and real patient data. The generated T1-weighted MRI, CT, and CBCT images are inherently co-registered. Because organ masks are also provided by the digital body phantom, the generated dataset serves as a ground truth for image segmentation and registration. Realistic simulation of respiration and heartbeat is possible within the framework. Compared to real patient data the synthetic data showed good agreement regarding the image voxel intensity distribution and the noise characteristics. To underline the usability as a registration ground truth, a proof of principle registration was performed. We were able to optimize the registration parameters of the multimodal non-rigid registration in the process, utilizing the liver organ masks for evaluation purposes. The best performing registration setting was able to reduce the average symmetric surface distance (ASSD) of the liver masks from 8.7mm to 0.8mm. Thus, we could demonstrate the applicability of synthetic data for the development of medical image registration algorithms. This approach can be readily adapted for multimodal image segmentation.

Full PDF

11 Generation of Multimodal Ground Truth Datasetsfor Abdominal Medical Image Registration UsingCycleGAN

Dominik F. Bauer, Tom Russ, Barbara I. Waldkirch, William P. Segars, Lothar R. Schad, Frank. G. Z ¨ollner,

Member, IEEE and Alena-Kathrin Golla (n ´ee Schnurr)

Abstract — Sparsity of annotated data is a major limita-tion in medical image processing tasks such as registra-tion. Registered multimodal image data are essential for thesuccess of various medical procedures. To overcome theshortage of data, we present a method which allows thegeneration of annotated, multimodal 4D datasets. We usea CycleGAN network architecture to generate multimodalsynthetic data from a digital body phantom and real pa-tient data. The generated T1-weighted MRI, CT, and CBCTimages are inherently co-registered. Because organ masksare also provided by the digital body phantom, the gener-ated dataset serves as a ground truth for image segmen-tation and registration. Realistic simulation of respirationand heartbeat is possible within the framework. Comparedto real patient data the synthetic data showed good agree-ment regarding the image voxel intensity distribution andthe noise characteristics. To underline the usability as aregistration ground truth, a proof of principle registrationwas performed. We were able to optimize the registrationparameters of the multimodal non-rigid registration in theprocess, utilizing the liver organ masks for evaluation pur-poses. The best performing registration setting was ableto reduce the average symmetric surface distance (ASSD)of the liver masks from 8.7 mm to 0.8 mm. Thus, we coulddemonstrate the applicability of synthetic data for the de-velopment of medical image registration algorithms. Thisapproach can be readily adapted for multimodal imagesegmentation.

Index Terms — CycleGAN, Image Registration, ImageSynthesis, Liver, Multimodal Imaging

I. I

NTRODUCTIONManuscript submitted for review: November, 2020.This research project is part of the Research Campus M OLIE andfunded by the German Federal Ministry of Education and Research(BMBF) within the Framework ”Forschungscampus: public-private part-nership for Innovations” under the funding code 13GW0388A. (Corre-sponding author: Dominik F. Bauer.)We gratefully acknowledge the support of NVIDIA Corporation withthe donation of the NVIDIA Titan Xp used for this research.D. F. Bauer, T. Russ, Barbara I. Waldkirch, L. R. Schad, F.G. Z¨ollner, A.-K. Schnurr, are with the Chair of Computer As-sisted Clinical Medicine, Mannheim Institute for Intelligent Systems inMedicine, Medical Faculty Mannheim, Heidelberg University, Germany(e-mail: { dominik.bauer, tom.russ, barbara.waldkirch, lothar.schad,frank.zoellner, alena-kathrin.golla } @medma.uni-heidelberg.de).W. P. Segars is with the Carl E. Ravin Advanced Imaging Labs,Department of Radiology, Duke University, United States (e-mail:[email protected]).B. I. Waldkirch is with the Institute for Medical Informatics, MannheimUniversity of Applied Sciences, Germany.Frank. G. Z¨ollner and Alena-Kathrin Golla (n´ee Schnurr) share seniorauthorship. M ULTIMODAL imaging plays an important part in thediagnosis of liver diseases. In the case of polycisticliver disease (PLD) multimodal imaging provides informationabout the character and location of hepatic cysts. This helps tochose an appropriate form of therapy, such as liver transplan-tation or cyst fenestration with partial hepatic resection [1].The long-term success of transplantation may be limited byvarious postoperative complications, and an early diagnosisis important. For this, imaging is crucial and a multimodalapproach often is most effective [2]. For the treatment ofliver tumors [3] and speciﬁcally, of Hepatocellular carcinoma(HCC), which is the sixth most common malignant tumorworldwide and the third most frequent cause of cancer-related mortality, a vast variety of treatments is available[4]. These include interventional procedures such as transarte-rial chemoembolizations (TACE), radioembolization, radiofre-quency ablation, percutaneous liver tumor cryoablation, andmicrowave thermocoagulation therapy, for which multimodalregistration allows to combine pre- and intrainterventional datato improve treatment planning [5]–[9]. Each imaging modalityhas strengths and weaknesses. Image registration enables thefusion of complementary information of each modality.The lack of convenient ground truth data is a major limita-tion in the ﬁeld of medical image segmentation and registration[10], [11]. The generation of organ masks for segmentationrequires labor-intensive manual annotation. For the develop-ment of image registration algorithms (especially for non-rigid image registration methods) and validation of registrationaccuracy, the ground truth is generally not available [11]. Thisis because the patient positioning in-between scans usuallycan not be reproduced, particularly in the case of multimodalimaging. In the abdomen the variable content of the bladderand bowel and additional patient motion like respiration andheartbeat further exacerbate the problem.Our approach to bypass the lack of ground truth datais the generation of synthetic data from the 4 dimensional(4D) XCAT phantom [12]. The synthesis is done via thehere proposed CycleGAN network. Thereby, we are able togenerate an unlimited amount of fully annotated multimodaltraining data.Using the XCAT phantom as the basis for synthesis insteadof real patient images is beneﬁcial, because CycleGANsperform a style transfer while maintaining the geometry givenby the XCAT. Therefore, it is possible to directly use theorgan masks provided by the XCAT as segmentation masks inthe synthesized images. By using a modality-speciﬁc XCAT a r X i v : . [ ee ss . I V ] D ec phantom, our style transfer is monomodal, whereas the transferbased on patient data would be multimodal. To enforce thepreservation of the anatomical geometries we employ addi-tional loss functions which can only be used in monomodalstyle transfers.In this work we synthesize Computed Tomography (CT),Cone Beam Computed Tomography (CBCT), and MagneticResonance Imaging (MRI) data. Interventions are often mon-itored via CBCT, whereas CT and MRI images are taken fordiagnosis beforehand to assist the navigation during interven-tion [5]. We will demonstrate the usefulness of the dataset asa multimodal registration ground truth for the liver. A. Related Work

To evaluate registration results or to train deep learning reg-istration approaches, either anatomical multi-label segmenta-tions or landmarks are required [13], [14]. However, generatinglabeled data is labor-intensive, subjective or even impracticalfor large datasets. Recently, evaluation techniques using sta-tistical models like bootstrap have been developed, which donot depend on ground truth data [15], [16]. These algorithmscan only estimate the stochastic part of the registration errorand therefore need to be handled with care.Established ground truth datasets are usually only availablefor the brain. The Retrospective Image Registration EvaluationProject (RIRE) offers a CT, MRI, and PET gold standard forthe brain. The data was registered using bone-implanted ﬁdu-cial markers, which could be removed without leaving behindany traces [17]. The BrainWeb database consists of simulatedMRI imaging sequences (T1-weighted, T2-weighted, and pro-ton density), including optional multiple-sclerosis lesions [18].The images are perfectly aligned, since they are calculatedfrom the same model. Slice thicknesses, noise levels, andlevels of intensity non-uniformity can be varied.Image synthesis can be used to reduce the multimodal reg-istration problem to a monomodal problem by ﬁrst convertingone modality into the other. Modality reduction has shownimprovements in registration accuracy for the brain [19]–[21]and the pelvis [22].For MRI-only radiotherapy planning Wolterink et al. demonstrated feasible results using a CycleGAN approach forMRI-to-CT translation and showed that training with unpairedimages was superior to training with paired images [23]. Asequential generative adversarial network (GAN) to synthesizemultimodal image data has been demonstrated by Yang et al. [24].Analytical models which transform the XCAT phantominto cardiac or abdominal MRI images have already beendeveloped [25], [26]. A GAN approach developed by Abbasi et al. synthesizes labeled cardiac MRI images from the XCATphantom [27]. Tmenova et al. presented a CycleGAN tosynthesize X-ray angiograms from the XCAT phantom, whichproved to be useful as a data augmentation strategy [28].

B. Contribution

Using a CycleGAN network and the XCAT phantom asinput, we generate a synthetic 4D multimodal dataset of the abdomen. The dataset consists of T1-weighted MRI, CT, andCBCT images in the inhaled and exhaled state. The data isperfectly co-registered and includes the displacement ﬁeldsfor respiratory movements and also segmentation masks forall organs. Therefore, it serves as a ground truth dataset forregistration and segmentation. Upon publication, the syntheticdataset will be made public. A major advantage of thesynthetic data is that there are no legal and ethical issuesconcerning data sharing [29].In a previous work we already showed that the syntheticCT images are beneﬁcial for the training of deep learningsegmentation networks [30]. To demonstrate the utility ofthe multimodal dataset for the optimization of registrationalgorithms, we evaluate a multimodal non-rigid registrationfor varying parameter settings. We focus on the registration ofthe liver, however, the registration quality can be assessed forany other organ.

II. M

ATERIALS AND M ETHODS

A. CycleGAN Network Architecture

CycleGANs learn the mapping between two domains X and Y given unpaired training samples x ∈ X and y ∈ Y [31]. Themapping functions G : X → Y and F : Y → X are calledgenerators. Two discriminators D X and D Y aim to distinguishbetween real images and generated images. Fig. 1 shows thecomplete CycleGAN network architecture for the XCAT andCT image domain. CycleGAN networks for MRI and CBCTimages were trained analogously. The cycle consistency loss L cyc ( G, F ) enforces forward and backward consistency forthe generators, i.e. F ( G ( x )) ≈ x and G ( F ( y )) ≈ y . With aleast squares generative adversarial loss L adv ( G, F, D X , D Y ) ,the generators were trained to generate images which cannotbe distinguished from real images by the discriminator. Thediscriminators are 70 x 70 PatchGANs, which were trainedwith a least squares generative adversarial loss function. Forthe generators we used the Res-Net architecture shown inFig. 2. All convolutional layers use the Rectiﬁed Linear Unit(ReLU) activation function, except for the ﬁnal convolution,which employs a hyperbolic tangent (tanh). The upsamplingwas performed via a bilinear interpolation instead of a decon-volution, in order to avoid checkerboard artifacts [32]. Fig. 1. CycleGAN network architecture: The generators G XCAT → CT and F CT → XCAT map images from the XCAT domain to the CT domainand vice versa. CycleGAN networks for MRI and CBCT images weretrained analogously.

AUER et al. : GENERATION OF MULTIMODAL GROUND TRUTH DATASETS 3

Fig. 2. Res-Net architecture used for the CycleGAN generators G and F . The numbers inside the arrows indicate the number of outputchannels of an operation. B. Training and Loss Functions

Training is performed using the Adam optimizer with alearning rate of 0.0002. The network was trained with 256x256pixel image patches and a batch size of 4. For each axial sliceone random patch was extracted. We trained the networks for150.000 steps, which corresponds to 100 epochs for MRI, 75epochs for CT and 15 epochs for CBCT.The geometry of the XCAT is mostly well-preserved in thesynthesized images. Nevertheless, we previously observed thatthe CycleGAN sometimes replaced high-contrast structureslike bones and air cavities with soft tissue [33]. To enhance thepreservation of these high-contrast structures, we extended thegenerator loss with an intensity loss and a gradient differenceloss: L int ( G, F ) = || ( G ( x ) − x ) || + || ( F ( y ) − y ) || , (1) L gdl ( G, x ) = (cid:88) i,j || x i,j − x i − ,j |− | G ( x ) i,j − G ( x ) i − ,j || + || x i,j − x i,j − |− | G ( x ) i,j − G ( x ) i,j − || . (2)Modifying the CycleGAN loss function to regularize themapping between the image domains is a common approach[34]. The intensity loss preserves the signal intensity of theorgans provided by the XCAT phantom. This helps to keepthe structure of organs intact. The weighting of the intensityloss can easily be adjusted for speciﬁc organs by usingorgan masks. As shown by Nie et al. , the gradient differenceloss prevents blurring and therefore sharpens the synthesizedimages [35]. The total generator loss is a combination of thepreviously deﬁned losses with different weights: L gen ( G, F, D X , D Y ) = L adv ( G, F, D X , D Y )+ λ cyc L cyc ( G, F )+ λ int L int ( G, F )+ λ gdl ( L gdl ( G, x ) + L gdl ( F, y )) . (3)We trained the CycleGAN with empirically chosen com-binations of weights λ , which are given in Table II. Acombination of the gradient loss and the intensity loss wasfound to yield the best results [33]. Further increasing theweighting factors lead to excessive regularization and thus the networks learned an identity mapping. For the MRI networkslower over-regularization thresholds were found. C. Data

We train our CycleGAN network to map between XCATphantom data and real patient data. The goal is to obtain anetwork that generates realistic looking synthetic data usingthe XCAT phantoms as input. In the following paragraphswe will address the real patient and the XCAT training dataseparately.

1) Patient Data:

For the patient training data we used 24CT, CBCT and T1-weighted MRI abdominal scans taken in-house from the same patients. We exclude 2 CT images due tostrong metal artifacts caused by medical instruments. Since theCBCT images were taken during interventions, artifacts causedby contrast agents or metallic instruments are common. It isdesirable for the synthetic CBCT data to mimic those artifacts,therefore those CBCT images were not excluded. For eachmodality the patient images were resampled to a uniﬁed voxelspacing given in Table I. All scans included the whole liverand the narrow ﬁeld of view of the CBCT scan was focused onthe liver. The MRI images included arms, the CT and CBCTimages did not. As the XCAT phantom does not include thepatient couch, we removed the patient couch from the CTpatient volumes. For CBCT and MRI no couch was visible inthe patient images. The MRI images were acquired at 3 Teslawith a volume interpolated breathold exam (VIBE) sequence.All scans were acquired on whole body clinical devices(Siemens Healthineers, Forchheim, Germany; CT: SomatomEmotion 16; CBCT: Artis Zeego; MRI: Magnetom Tim Trio).The data sets were windowed to the ranges given in TableI. For CT and CBCT a ﬁxed window is used. Since MRIintensities vary widely from image to image, the 10th and90th percentile of each volume (whole 3D matrix) was usedfor windowing. For training, a linear intensity transformationwas applied to transform the intensities from the windowinginterval to [-1,1]. This normalization was carried out for allmodalities. Normalization of the training data is a crucialstep to improve the results for image segmentation and imagesynthesis and the training performance is robust to the choiceof normalization method [36], [37].

2) XCAT Phantom Data:

The XCAT model provides highlydetailed whole-body anatomies. Organ masks can be easilyobtained within the XCAT framework. The phantom includesfemale and male models for varying ages. The heart beat andrespiratory motions are also included. The anatomy and motioncan be adapted by various parameters. This allows the creationof highly individual patient geometries. For the XCAT trainingdata we generated one XCAT volume per XCAT model foreach modality with 56 different models of varying ages. TheXCAT data includes the whole liver and was generated withthe same voxel spacing, windowing and normalization as theresampled patient data. Arms were included only in the MRIXCATs.The XCAT phantom provides attenuation coefﬁcients forall organs. The simulated tube energy of the CBCT and CTphantoms was varied from 90-120 keV in steps of 5 keV. This

TABLE IT

RAINING DATA STATISTICS . F OR CT AND

MRI

THE NUMBER OF SLICES PER IMAGE VARY IN THE GIVEN INTERVAL . Resolution (x/y/z) [mm] Windowing × × [52 , no 66 ± × × [80 , no 51 ± × × no 67 ± × × no 51 ± × × [48 , yes 67 ± × × [59 , yes 51 ± led to a variation of attenuation coefﬁcients in the phantoms.Afterwards, those were transformed into Hounsﬁeld Units. Toobtain CBCT and MRI XCAT data, we needed to convert theCT XCAT. For the CBCT XCAT we applied a ﬁeld of viewmask obtained from the patient CBCTs, which was centeredon the liver. For the MRI phantoms we replaced the attenuationcoefﬁcients for each organ with simulated MRI values usingthe signal equation for the VIBE sequence. This pre-processingstep is the ﬁrst step of the analytical models that convert theCT XCAT into an MRI XCAT [25], [26]. It ensures that theMRI signal is initialized with realistic values matching theMRI training data. This enables us to use the aforementionedintensity and gradient loss for the generation of synthetic MRIimages, since the transformation with the CycleGAN is nowmonomodal. The signal intensities ( SI ) for the VIBE sequencein terms of acquisition parameters repetition time TR, echotime TE, and ﬂip angle α and tissue-speciﬁc T , T relaxationtimes, and proton density ρ is given by: SI = ρ sin α (1 − exp − T RT )(1 − cos α exp − T RT ) exp − T ET . (4)We calculated the MRI intensity for all 44 abdominal organspresent in the XCAT. The imaging parameters T E = 4 . ms, T R = 7 . ms, and α = 10 ◦ were taken from the patientVIBE scans. The values for the proton density ρ taken from[25]. T1 and T2 relaxation times for 3 T for blood and thespinal cord were obtained from [38] and the rest from [39].For organs with no available T1, T2 or ρ we used valuesof similar organs. To simulate some organ variability, werandomly varied T1, T2, and ρ by ± using a uniformdistribution. A summary of the training data statistics is givenin Table I. D. Evaluation Metrics

Quantiﬁcation of the synthetic image quality is difﬁcult,since there are no corresponding real images for comparison[28], [40]. Therefore, metrics that require a one-to-one corre-spondence like the mean absolute error (MAE) can not be cal-culated between synthetic and real images. We calculate one-to-one corresponding metrics between the synthetic imagesand the XCATs, to investigate the magnitude of change fromthe XCAT phantoms. Real patient images and synthetic imagesare then compared by assessing their noise characteristics andvoxel intensity distributions.

1) XCAT vs. Synthetic:

The axial slices of the synthetic CTvolumes were compared to the corresponding axial slices ofthe XCAT volumes with respect to anatomical accuracy. TheMAE was calculated to assess the change of the intensityvalues. We excluded the background for the calculation of theMAE. The similarity of structure and features was evaluatedusing structural similarity index measure (SSIM) and featuresimilarity index measure (FSIM) [41], [42]. Additionally, wecalculate the edge preservation ratio (EPR) and edge genera-tion ratio (EGR) [30], [43].

2) Real Patient vs. Synthetic:

Regarding realistic noise char-acteristics and intensity distribution, the 3D synthetic volumesare compared to the 3D patient volumes. For the noisecharacteristics, only liver voxels were considered. Limitingthe noise considerations to the liver is reasonable, since theliver is a large and mostly homogeneous organ. We manuallysegmented the liver in 4 patients for each modality. The liversegmentations for the 56 synthetic images were provided bythe XCAT phantom. The noise texture was evaluated usingan estimation of the radial noise power spectrum (NPS). Theradial NPS of the synthetic and patient images was comparedby calculating the Pearson correlation, further called the NPScorrelation coefﬁcient (NCC) [30]. In addition to noise texture,we calculated the noise magnitude (NM), i.e. the standarddeviation of the liver voxel intensities.Furthermore, intensity distribution histograms of patient andsynthetic images were calculated. To quantify their similarity,the Pearson correlation coefﬁcient between them was calcu-lated (HistCC).

E. Proof of Principle Registration Evaluation

We performed a proof of principle image registration todemonstrate the feasibility of the multimodal dataset for eval-uation and thus development of registration algorithms. Ourgoal was to investigate different parameter settings to optimizethe registration result. We implemented the registration inPython 3.5 with SimpleITK 1.2.4, which provides a simpliﬁedinterface to the Insight Toolkit (ITK) [44]. A non-rigid B-spline transform with a gradient descent optimizer, a learningrate of 1 and a maximum of 300 iterations was used. Threedifferent registration metrics were considered, namely MattesMutual Information (MMI), Normalized Correlation (NC), andMean Squares (MS). For the MMI, 50 histogram bins wereused. The MS metric was only used for the monomodal CT toCT registration, since it is not suited for multimodal images.Additionally, we varied the spacing of the B-spline controlpoints from 50 mm to 150 mm in steps of 20 mm. For the

AUER et al. : GENERATION OF MULTIMODAL GROUND TRUTH DATASETS 5 multimodal and monomodal registrations this results in 12 and18 different parameter settings, respectively.The registration was performed on the synthetic data fromall 56 XCAT models. We registered the CT, MRI and CBCTimages in the inhaled state to the CT image in the exhaledstate. To evaluate the registration, we take advantage of theliver organ masks obtained from the XCAT phantoms. Theveins and arteries inside the liver were included in the livermask by using a morphological closing operation. We appliedthe registration transform to the liver masks in the inhaledstate and compared the result to the CT liver mask in theexhaled state. The similarity of the two masks was assessedby calculating two metrics. Firstly, we employed the averagesymmetric surface distance (ASSD) which is sensitive toshape and alignment. Secondly, we used the Dice similaritycoefﬁcient (DSC) to assess the overlap [45].

III. R

ESULTS

A. Synthetic Images

TABLE IIT

HE IMAGE QUALITY METRICS FOR THE EVALUATION OF THESYNTHETIC IMAGES . CBCT CT MRI λ cyc /λ int /λ gdl ± ± ± ± ± ± ± ± ± ± ± ± ±

14 51 ±

16 37 ± ± ± ± ±

13 39 ± ± ±

16 39 ±

19 22 ± ± ± ± We consider the metrics that compare the synthetic imageswith the XCAT phantoms shown in the upper half of TableII. The FSIM and SSIM indicate that image structures andfeatures are well preserved in the CT and CBCT images,whereas the synthetic MRIs showed little structural and featuresimilarity to the XCATs. Regarding edges, the EPR is similarfor all modalities, whereas the EGR is largest for the CBCTimages. The MAE is slightly larger than the NM (synthetic)for every modality. The MAE for CBCT is more than twiceas high as the MAE of CT.A schematic of our simulation framework is shown in Fig.3. Starting from the CT XCAT phantom, CBCT and MRIXCAT versions are generated by applying a FOV Mask orby simulating the VIBE signal equation, respectively. Organmasks for each modality are extracted from the phantoms. Im-ages are synthesized from the XCAT phantoms via CycleGANnetworks. On the right hand side, real axial patient slices ofeach modality are shown as a comparison to the syntheticimages. Qualitatively, the style of the synthesized images is ingood agreement with the real patient images.To quantify this observation we compare the noise charac-teristics and voxel intensity distribution of the synthetic imagesto the patient images, the results are listed in the lower half ofTable II. A high NCC for all modalities indicates that the noisetexture was emulated realistically, albeit the NCC is slightly smaller for the synthetic MRI images. For all modalities,the NM (synthetic) is in excellent agreement with the NM(patient). In Fig. 4 the intensity histograms are shown. Ingeneral, the synthetic intensity distributions match the patientintensity distributions nicely. This is underlined by the overallhigh HistCC values in Table II. However, for CT and CBCTthe soft tissue peaks are modeled a bit too narrowly. The lungtissue peak is shifted towards higher CT numbers for the CT.In the MRI, the soft tissues is slightly underrepresented.

B. Proof of Principle Registration

The metrics DSC and ASSD for the evaluation of the proofof principle registration are shown in Fig. 5 and Table III,respectively. Both metrics lead to the same conclusions. Themonomodal CT to CT registration yielded good results forall three registration metrics and grid point spacings, with thebest result for MMI with 50 mm grid point spacing. For CBCT,the MMI again worked well, whereas the registrations usingthe NC mostly failed. The best results were again obtainedwith MMI and a grid point spacing of 50 mm. For MRI, theregistrations with MMI and NC yielded similar results withthe best result obtained for NC with a grid point spacingof 150 mm. Overall, the monomodal CT to CT registrationachieved the best results.Coronal views of the registration results for the best settingsof each modality are visualized in Fig. 6. Slices of the inhaledstate (pre-registration) are shown in the left row and the rightrow shows slices of the exhaled state (ground truth). Theregistered images in the middle row show a large similarityto the ground truth. This observation is further supported bythe overlaid liver contours. The post-registration liver contour(yellow) is in high agreement with the ground truth livercontour (red).

IV. D

ISCUSSION

We could demonstrate that using our CycleGAN approach incombination with the XCAT phantom, it is feasible to generaterealistic multimodal image data sets while simultaneouslygenerate a ground truth reference. Applying this generateddata to an examplarily multimodal image registration taskcould demonstrate the value of our synthetic dataset generationframework.The image quality metrics in Table II show the high qualityof the image synthesis. Low SSIM and FSIM for the MRimages indicate, that image values of the homogeneous organsin the MRI XCAT phantoms needed to be altered morestrongly by the networks in comparison to the CT XCATs.The ratio of MAE to NM (synthetic) is 2.1, 1.3, and 1.5for CBCT, CT, and MR, respectively. Assuming normallydistributed noise, the ratio of MAE to NM is approximately0.8 [46]. This means that the MAE can not only be attributedto noise. The large MAE of the CBCT images compared to theCT images might be due to the introduction of metal artifacts,since the patient CBCTs showed metal artifacts in the livercaused by medical instruments. This is supported by a largeEGR for CBCT.

Fig. 3. Schematic of the simulation framework. Starting point is the CT XCAT, from which CBCT and MRI versions are derived. Synthetic CT,CBCT, and MRI images are created via separately trained CycleGAN networks. Organ masks can be obtained from the XCAT phantoms. Patientimages which were used to train the CycleGANs are shown on the right hand side.Fig. 4. Intensity histograms of the patient and synthetic images averaged over all volumes. Note that the background peaks are cropped.Fig. 5. Dice similarity coefﬁcient (DSC) for the proof of principle registrations with 56 data points each. The mean is marked as a ”+” and thewhiskers indicate the 10th and 90th percentile. All outliers are depicted as black dots. The dashed horizontal line shows the mean pre-registrationDSC.

For all modalities realistic noise texture and magnitude wasachieved. Additionally, the voxel intensity distribution wasmodeled adequately. Most of the discrepancies between thepatient and synthetic histograms in Fig. 4 can be explained byinspecting the XCAT phantoms. The deviation of the CT lungpeaks (synthetic -780 HU, patient -835 HU) can be explained by an overestimated initial lung value of -760 HU given bythe XCAT. The narrow soft tissue peaks for CT and CBCTmight be due to too little variation of the organ attenuationcoefﬁcients. The under representation of soft tissue in thesynthetic MRI is due to the body size of the patients andXCATs. We found that in the MRI patient dataset 66.5 % of

AUER et al. : GENERATION OF MULTIMODAL GROUND TRUTH DATASETS 7

TABLE IIIR

EGISTRATION EVALUATION

ASSD

METRIC . T

HE BEST RESULT FOR EACH MODALITY IS HIGHLIGHTED . M

EAN PRE - REGISTRATION

ASSD: 8.7 MM ASSD (CT to CT)Grid Point Spacing [mm] 50 70 90 110 130 150MMI ± . ± . . ± . . ± . . ± . . ± . NC . ± . . ± . . ± . . ± . . ± . . ± . MS . ± . . ± . . ± . . ± . . ± . . ± . ASSD (CBCT to CT)Grid Point Spacing [mm] 50 70 90 110 130 150MMI ± . ± . . ± . . ± . . ± . . ± . NC . ± . . ± . . ± . . ± . . ± . . ± . ASSD (MRI to CT)Grid Point Spacing [mm] 50 70 90 110 130 150MMI . ± . . ± . . ± . . ± . . ± . . ± . NC . ± . . ± . . ± . . ± . . ± . ± Pre-Registration Post-Registration Ground Truth Image

CBC T C T M R Fig. 6. Registration of the pre-registered images to the ground truth CT image. Red contours indicate the ground-truth boundaries of the liver(target). Blue and yellow contours represent the boundaries of the liver before and after deformation respectively. the image voxels show the body, whereas for the MRI XCATdataset, it is only 46.5 %. A rather large HistCC of 0.94 ± V. C

ONCLUSION

The presented simulation framework can be used to extendsmall datasets by transferring the style of the dataset onto thegeometry given by the XCAT phantom and serves as a groundtruth for image registration and segmentation. By adjusting theXCAT parameters, the synthetic data was tailored to a givenpatient collective. It was shown that the multimodal abdominaldataset can be utilized to evaluate and reﬁne registrationalgorithms.In the future, the framework will be extended to othermodalities, such as T2-weighted MRI or PET, which canfurther boost the performance of multimodal methods. Anextension to other body regions, such as the thorax or pelvis,is also possible. Synthetic images over larger body regions areespecially interesting for whole body segmentation. Expansionof datasets using this method provides a promising tool toovercome the dearth of medical training data. R EFERENCES [1] D. E. Morgan, M. E. Lockhart, C. L. Canon, M. P. Holcombe, andJ. S. Bynon, “Polycystic liver disease: multimodality imaging for com-plications and transplant evaluation,”

Radiographics , vol. 26, no. 6, pp.1655–1668, 2006.[2] A. H. M. Caiado et al. , “Complications of liver transplantation: multi-modality imaging approach,”

Radiographics , vol. 27, no. 5, pp. 1401–1417, 2007.[3] S. E. Seltzer et al. , “Multimodality diagnosis of liver tumors: featureanalysis with ct, liver-speciﬁc and contrast-enhanced mr, and a computermodel,”

Acad. Radiol. , vol. 9, no. 3, pp. 256–269, 2002.[4] K. Memon, R. J. Lewandowski, L. Kulik, A. Riaz, M. F. Mulcahy, andR. Salem, “Radioembolization for primary and metastatic liver cancer,”

Semin. Radiat. Oncol. , vol. 21, no. 4, pp. 294–302, 2011.[5] B. Waldkirch, S. Engelhardt, F. G. Z¨ollner, L. R. Schad, and I. Wolf,“Multimodal image registration of pre-and intra-interventional data forsurgical planning of transarterial chemoembolisation,” in

Proc. SPIEMed. Imag. , vol. 10951. SPIE, 2019, p. 109512U.[6] N. Spahr, S. Thoduka, N. Abolmaali, R. Kikinis, and A. Schenk,“Multimodal image registration for liver radioembolization planning andpatient assessment,”

Int. J. Comput. Assist. Radiol. Surg. , vol. 14, no. 2,pp. 215–225, 2019.[7] D. H. Lee and J. M. Lee, “Recent advances in the image-guided tumorablation of liver malignancies: radiofrequency ablation with multipleelectrodes, real-time multimodality fusion imaging, and new energysources,”

Korean J. Radiol. , vol. 19, no. 4, pp. 545–559, 2018.[8] H. Elhawary et al. , “Multimodality non-rigid image registration forplanning, targeting and monitoring during ct-guided percutaneous livertumor cryoablation,”

Acad. Radiol. , vol. 17, no. 11, pp. 1334–1344,2010.[9] Y.-W. Chen, R. Xu, S.-Y. Tang, S. Morikawa, and Y. Kurumi, “Non-rigid mr-ct image registration for mr-guided liver cancer surgery,” in

Proc. IEEE/ICME Int. Conf. Complex Med. Eng.

IEEE, 2007, pp.1756–1760.[10] F. Z¨ollner, E. Svarstad, A. Munthe-Kaas, L. Schad, A. Lundervold, andJ. Rørvik, “Assessment of kidney volumes from mri: Acquisition andsegmentation techniques,”

AJR Am. J. Roentgenol. , vol. 199, no. 5, 2012.[11] F. G. Z¨ollner, A. ˇSerifovi´c-Trbali´c, G. Kabelitz, M. Koci´nski,A. Materka, and P. Rogelj, “Image registration in dynamic renalmri—current status and prospects,”

Magn. Reson. Mater. Phy. , vol. 33,pp. 33–48, 2020.[12] W. P. Segars, G. Sturgeon, S. Mendonca, J. Grimes, and B. M. W. Tsui,“4D XCAT phantom for multimodality imaging research,”

Med. Phys. ,vol. 37, no. 9, pp. 4902–4915, Aug 2010.[13] Y. Hu et al. , “Weakly-supervised convolutional neural networks formultimodal image registration,”

Med. Image Anal. , vol. 49, pp. 1–13,2018.[14] Y. Xiao et al. , “Evaluation of mri to ultrasound registration methods forbrain shift correction: the curious2018 challenge,”

IEEE Trans. Med.Imag. , vol. 39, no. 3, pp. 777–786, 2019.[15] J. Kybic and D. Smutek, “Image registration accuracy estimation withoutground truth using bootstrap,” in

Proc. Int. Workshop Comp. Vis.Approaches Med. Image Anal.

Springer, 2006, pp. 61–72.[16] C. J. Twining, V. S. Petrovi´c, T. F. Cootes, R. S. Schestowitz, W. R.Crum, and C. J. Taylor, “Evaluating registration without ground truth,” arXiv preprint arXiv:2002.10534 , 2020.[17] J. West et al. , “Comparison and evaluation of retrospective intermodalitybrain image registration techniques,”

J. Comput. Assist. Tomogr. , vol. 21,no. 4, pp. 554–568, 1997.[18] C. A. Cocosco, V. Kollokian, R. K.-S. Kwan, G. B. Pike, and A. C.Evans, “Brainweb: Online interface to a 3d mri simulated braindatabase,”

NeuroImage , vol. 5, p. 425, 1997.[19] X. Liu, D. Jiang, M. Wang, and Z. Song, “Image synthesis-based multi-modal image registration framework by using deep fully convolutionalnetworks,”

MBEC , vol. 57, no. 5, pp. 1037–1048, 2019.[20] S. Roy, A. Carass, A. Jog, J. L. Prince, and J. Lee, “Mr to ct registrationof brains using image synthesis,” in

Proc. SPIE Med. Imag. , vol. 9034.SPIE, 2014, p. 903419.[21] M. Chen, A. Carass, A. Jog, J. Lee, S. Roy, and J. L. Prince, “Crosscontrast multi-channel image registration using image synthesis for mrbrain images,”

Med. Image Anal. , vol. 36, pp. 2–14, 2017.[22] X. Cao, J. Yang, Y. Gao, Y. Guo, G. Wu, and D. Shen, “Dual-coresteered non-rigid registration for multi-modal images via bi-directionalimage synthesis,”

Med. Image Anal. , vol. 41, pp. 18–31, 2017. [23] J. M. Wolterink, A. M. Dinkla, M. H. F. Savenije, P. R. Seevinck,C. A. T. van den Berg, and I. Iˇsgum, “Deep mr to ct synthesis usingunpaired data,” in

Proc. Int. Workshop Simul. Synth. Med. Imag. , 2017,pp. 14–23.[24] X. Yang, Y. Lin, Z. Wang, X. Li, and K.-T. Cheng, “Bi-modality medicalimage synthesis using semi-supervised sequential generative adversarialnetworks,”

IEEE J. Biomed. Health Inform. , vol. 24, no. 3, pp. 855–865,2019.[25] L. Wissmann, C. Santelli, W. P. Segars, and S. Kozerke, “Mrxcat:Realistic numerical phantoms for cardiovascular magnetic resonance,”

J. Cardiovasc. Magn. Reson. , vol. 16, no. 1, p. 63, 2014.[26] C. Paganelli, P. Summers, C. Gianoli, M. Bellomi, G. Baroni, andM. Riboldi, “A tool for validating mri-guided strategies: a digitalbreathing ct/mri phantom of the abdominal site,”

MBEC , vol. 55, no. 11,pp. 2001–2014, 2017.[27] S. Abbasi-Sureshjani, S. Amirrajab, C. Lorenz, J. Weese, J. Pluim, andM. Breeuwer, “4d semantic cardiac magnetic resonance image synthesison xcat anatomical model,” arXiv preprint arXiv:2002.07089 , 2020.[28] O. Tmenova, R. Martin, and L. Duong, “Cyclegan for style transfer inx-ray angiography,”

Int. J. Comput. Assist. Radiol. Surg. , vol. 14, no. 10,pp. 1785–1794, 2019.[29] J. Yoon, L. N. Drumright, and M. Van Der Schaar, “Anonymizationthrough data synthesis using generative adversarial networks (ads-gan),”

IEEE J. Biomed. Health Inform. , 2020.[30] T. Russ et al. , “Synthesis of ct images from digital body phantomsusing cyclegan,”

Int. J. Comput. Assist. Radiol. Surg. , vol. 14, no. 10,pp. 1741–1750, 2019.[31] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in

Proc. IEEEInt. Conf. Comput. Vis. , 2017, pp. 2223–2232.[32] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboardartifacts,”

Distill , 2016. [Online]. Available: http://distill.pub/2016/deconv-checkerboard[33] D. F. Bauer et al. , “Synthesis of ct images using cyclegans: Enhancementof anatomical accuracy,” in

Proc. Int. Conf. Med. Imag. Deep Learning ,2019.[34] J. He, C. Wang, D. Jiang, Z. Li, Y. Liu, and T. Zhang, “Cyclegan with animproved loss function for cell detection using partly labeled images,”

IEEE J. Biomed. Health Inform. , 2020.[35] D. Nie et al. , “Medical image synthesis with deep convolutional ad-versarial networks,”

IEEE Trans. Biomed. Eng. , vol. 65, no. 12, pp.2720–2730, 2018.[36] J. C. Reinhold, B. E. Dewey, A. Carass, and J. L. Prince, “Evaluatingthe impact of intensity normalization on mr image synthesis,” in

Proc.SPIE Med. Imag. , vol. 10949. SPIE, 2019, p. 109493H.[37] N. Jacobsen, A. Deistung, D. Timmann, S. L. Goericke, J. R. Reichen-bach, and D. G¨ullmar, “Analysis of intensity normalization for optimalsegmentation performance of a fully convolutional neural network,”

Z.Med. Phys. , vol. 29, no. 2, pp. 128–138, 2019.[38] G. J. Stanisz et al. , “T1, t2 relaxation and magnetization transfer intissue at 3t,”

Magn. Reson. Med. , vol. 54, no. 3, pp. 507–512, 2005.[39] C. M. De Bazelaire, G. D. Duhamel, N. M. Rofsky, and D. C. Alsop,“Mr imaging relaxation times of abdominal and pelvic tissues measuredin vivo at 3.0 t: preliminary results,”

Radiology , vol. 230, no. 3, pp.652–659, 2004.[40] H. Huang, P. S. Yu, and C. Wang, “An introduction to image synthesiswith generative adversarial nets,” arXiv preprint arXiv:1803.04469 ,2018.[41] Z. Wang, A. C. Bovik, and H. R. Sheikh, “Image quality assessment:From error measurement to structural similarity,”

IEEE Trans. ImageProcessing , vol. 13, no. 4, pp. 600 – 612, 2004.[42] Lin Zhang, Lei Zhang, Xuanqin Mou, and D. Zhang, “FSIM: A FeatureSimilarity Index for Image Quality Assessment,”

IEEE Trans. ImageProcessing , vol. 20, no. 8, pp. 2378–2386, Aug 2011.[43] L. Chen, F. Jiang, H. Zhang, S. Wu, S. Yu, and Y. Xie, “Edgepreservation ratio for image sharpness assessment,” in . IEEE, 2016,pp. 1377–1381.[44] B. C. Lowekamp, D. T. Chen, L. Ib´a˜nez, and D. Blezek, “The designof simpleitk,”

Front. Neuroinform. , vol. 7, p. 45, 2013.[45] L. R. Dice, “Measures of the amount of ecologic association betweenspecies,”

Ecology , vol. 26, no. 3, pp. 297–302, 1945.[46] R. C. Geary, “The ratio of the mean deviation to the standard deviationas a test of normality,”