[PDF] Anonymization of labeled TOF-MRA images for brain vessel segmentation using generative adversarial networks

Abstract

Anonymization and data sharing are crucial for privacy protection and acquisition of large datasets for medical image analysis. This is a big challenge, especially for neuroimaging. Here, the brain's unique structure allows for re-identification and thus requires non-conventional anonymization. Generative adversarial networks (GANs) have the potential to provide anonymous images while preserving predictive properties. Analyzing brain vessel segmentation, we trained 3 GANs on time-of-flight (TOF) magnetic resonance angiography (MRA) patches for image-label generation: 1) Deep convolutional GAN, 2) Wasserstein-GAN with gradient penalty (WGAN-GP) and 3) WGAN-GP with spectral normalization (WGAN-GP-SN). The generated image-labels from each GAN were used to train a U-net for segmentation and tested on real data. Moreover, we applied our synthetic patches using transfer learning on a second dataset. For an increasing number of up to 15 patients we evaluated the model performance on real data with and without pre-training. The performance for all models was assessed by the Dice Similarity Coefficient (DSC) and the 95th percentile of the Hausdorff Distance (95HD). Comparing the 3 GANs, the U-net trained on synthetic data generated by the WGAN-GP-SN showed the highest performance to predict vessels (DSC/95HD 0.82/28.97) benchmarked by the U-net trained on real data (0.89/26.61). The transfer learning approach showed superior performance for the same GAN compared to no pre-training, especially for one patient only (0.91/25.68 vs. 0.85/27.36). In this work, synthetic image-label pairs retained generalizable information and showed good performance for vessel segmentation. Besides, we showed that synthetic patches can be used in a transfer learning approach with independent data. This paves the way to overcome the challenges of scarce data and anonymization in medical imaging.

Full PDF

AAnonymization of labeled TOF-MRA images for brain vesselsegmentation using generative adversarial networks

Tabea Kossen , , Pooja Subramaniam , , Vince I Madai , , Anja Hennemuth , , , Kristian Hildebrand ,Adam Hilbert , , Jan Sobesky , , Michelle Livne , Ivana Galinovic , Ahmed A Khalil , , , ,Jochen B Fiebach and Dietmar Frey CLAIM - Charité Lab for AI in Medicine, Charité Universitätsmedizin Berlin, Germany Department of Computer Engineering and Microelectronics, Computer Vision & Remote Sensing, Technical University Berlin, Berlin, Germany Department of Electrical Engineering and Computer Science, Technical University of Berlin, Berlin, Germany School of Computing and Digital Technology, Faculty of Computing, Engineering and the Built Environment, Birmingham City University, Birmingham, UK Institute for Imaging Science and Computational Modelling in Cardiovascular Medicine, Charité Universitätsmedizin Berlin, Berlin, Germany Fraunhofer MEVIS, Bremen, Germany Department VI Computer Science and Media, Beuth University of Applied Sciences, Berlin, Germany Johanna-Etienne-Hospital, Neuss, Germany Centre for Stroke Research Berlin, Charité Universitätsmedizin Berlin, Berlin, Germany Department of Neurology, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany Mind, Brain, Body Institute, Berlin School of Mind and Brain, Humboldt University Berlin, Berlin, Germany Berlin Institute of Health, Berlin, Germany

ABSTRACT

Anonymization and data sharing are crucial for privacy protection and acquisition of large datasets for robust medical imageanalysis. This represents a major challenge, especially for brain imaging research. Here, the unique structure of brain imagesallows for potential re-identiﬁcation and thus requires anonymization beyond conventional methods. Generative adversarialnetworks (GANs) have the potential to provide anonymous images while maintaining their predictive properties.Analyzing brain vessel segmentation, we trained 3 GAN architectures on time-of-ﬂight (TOF) magnetic resonance angiogra-phy (MRA) patches of patients with cerebrovascular disease for image-label pair generation: 1) Deep convolutional GAN, 2)Wasserstein-GAN with gradient penalty (WGAN-GP) and 3) WGAN-GP with spectral normalization (WGAN-GP-SN). First,the synthesized image-labels from each GAN architecture were used to train a U-net for vessel segmentation. The U-nets werethen tested on real patient data. In total, 66 patients were used for this analysis. In a second step, we simulated the applicationof our synthetic patches in a transfer learning approach using a second, independent dataset. Here, for an increasing number ofup to 15 patients we evaluated vessel segmentation model performance on real data with and without pre-training on generatedpatches. Finally, performance for all models was assessed by the Dice Similarity Coefﬁcient (DSC) and the 95th percentile ofthe Hausdorff Distance (95HD).Comparing the 3 GAN architectures, the U-net model trained on synthetic data generated by the WGAN-GP-SN showedthe highest performance to predict brain vessels (DSC/95HD 0.82/28.97) benchmarked by the U-net trained on real data(0.89/26.61). The transfer learning approach showed superior performance for the same GAN architecture compared to nopre-training, especially for one labeled patient only (DSC/95HD 0.91/25.68 compared to DSC/95HD 0.85/27.36).In a brain imaging segmentation paradigm, synthesized image-label pairs preserved generalizable information and showedgood performance for vessel segmentation. Furthermore, we showed that synthetic patches can be used in a transfer learn-ing approach with an independent dataset. These results pave the way to overcome the crucial challenges of scarce data andanonymization in the medical imaging ﬁeld. To facilitate further research, our synthetic image-label pairs are being made avail-able upon request.

Keywords:

Anonymization, Generative Adversarial Networks, Image Segmentation

Modern deep learning methods have revolutionized the ﬁeld of nat-ural image analysis (Krizhevsky et al. (2017); Simonyan and Zisser-man (2014)). These methods are translated to medical image anal-ysis with growing success (Litjens et al. (2017); Ronneberger et al.(2015); Livne et al. (2019)). However, in contrast to natural images, the number of data sets in medical image analysis are usually or-ders of magnitude smaller since their availability is limited owing todata privacy regulation. This poses a continuous challenge for deeplearning research in the medical imaging ﬁeld. To meet this chal-lenge, anonymization of medical images is an essential method toensure both data privacy and data availability for research. However,current anonymization methods in neuroimaging such as face blur- a r X i v : . [ ee ss . I V ] N ov T. Kossen et al. ring or face removal still allow re-identiﬁcation and thus cannot beapplied (Abramian and Eklund (2019); Ravindra and Grama (2019);Wachinger et al. (2015)). These results call for new techniques toanonymize medical neuroimaging data to both protect patient pri-vacy and to facilitate research progress.Generative adversarial networks (GANs) have the potential to ful-ﬁll this need. GANs have already been applied successfully for med-ical imaging data synthesis (Neff et al. (2017); Yi et al. (2019);Sorin et al. (2020)). Also, ﬁrst pilot studies have already made useof GANs for anonymization purposes (Shin et al. (2018); Hukke-las et al. (2019)). However, applications for neuroimages are scarceand synthesizing images often requires additional patient informa-tion such as a segmentation label (Shin et al. (2018)). This meansthat patient information is still fed into the model and the generatedimages are then not properly anonymized. Thus, there is a need toinvestigate the ability of GANs to create state-of-the-art anonymoussynthetic neuroimaging data maintaining the predictive properties ofthe original data. Importantly, such an approach would have the mostbeneﬁcial impact if the corresponding labels would be created in thesame process since many supervised deep learning applications re-quire time-consuming manual labeling of the dataset by experiencedphysicians.In this work, we utilize arterial brain vessel segmentation to testthe ability of GANs to create synthetic neuroimaging data and cor-responding labels. Moreover, we investigate the generalizability ofthe synthesized data on a second, independent dataset. With re-spect to the generative architectures, we train 3 different GAN ar-chitectures on time-of-ﬂight (TOF) magnetic resonance angiogra-phy (MRA) image patches of patients with cerebrovascular disease:1) Deep Convolutional GAN (DCGAN), 2) Wasserstein GAN withgradient penalty (WGAN-GP) and 3) WGAN-GP using spectral nor-malization (WGAN-GP-SN). With each GAN type, we synthesizedboth the image and the corresponding label. We validate the gen-erated synthetic patches using two different approaches. In the ﬁrstapproach, we evaluate the quality of the generated patches a) us-ing the Fréchet inception distance (FID) and b) by training a vesselsegmentation U-net on the synthetic patches. The U-net’s perfor-mance is then assessed on real test data. In total, 66 patients wereutilized for this analysis. In the second approach, we use the syn-thetic patches to pre-train a vessel segmentation model and applythe network weights in a transfer learning setting to pre-initializethe training of a U-net model using up to 15 patients from a second,independent TOF-MRA dataset. The performance of this model isthen compared to a U-net model without any pre-training. Finally, tofacilitate and accelerate future research on arterial vessel segmenta-tion and to corroborate the usefulness of the effective anonymizationprocedure, we make the synthesized image-label pairs generated inour study available upon request.Taken together, the contributions of the paper are: We present ef-fectively anonymized and labeled TOF-MRA patches for brain ves-sel segmentation for the ﬁrst time to our knowledge. Furthermore,we compare three different state-of-the-art GAN architectures andevaluate our synthesized labeled data on an independent, seconddataset in a novel evaluation pipeline. We show that pre-training avessel segmentation network using our synthetic data yields supe-rior performance compared to no pre-training and can reduce theamount of additional training data. Finally, we make our synthesizeddata available upon request to facilitate further research.

GANs have already been shown to be successful in many appli-cations of data augmentation in medical imaging (Frid-Adar et al.(2018), Sandfort et al. (2019)) as well as in neuroimaging (Bowleset al. (2018), Foroozandeh and Eklund (2020)). Here, real medicalimages together with synthesized images were used to improve mod-els that were trained on real data only. Whereas we provide resultson data augmentation, this study focussed on the models trained onpurely synthetic data and its generalizability to a new dataset.Generating medical images with labels is not a new idea. Neffet al. (2017) showed that lung x-rays with corresponding segmenta-tion labels can be generated using a GAN architecture. Guibas et al.(2018) demonstrated the synthesization of labeled retina images us-ing two GANs. While these studies focused on 2D medical images,we use a 3D dataset and evaluate the performance on an indepen-dent dataset. In the neuroimaging domain Foroozandeh and Eklund(2020) recently showed that synthesized and labeled MR imagescan improve tumor segmentation performance. However, the focushere was on augmentation and models trained on synthesized dataalone yielded comparably low performance. Also, in contrast to thepresent study, only one dataset was used for training and evaluation.

The architecture of the proposed DCGAN was adapted from Rad-ford et al. (2016) and Neff et al. (2017). The WGAN-GP is an exten-sion of the original Wasserstein GAN (Arjovsky et al. (2017)) usinggradient penalty for regularization (Gulrajani et al. (2017)). For thethird architecture WGAN-GP-SN spectral normalization was usedin the convolutional layers of the WGAN-GP (Miyato et al. (2018)).Our code is available at https://github.com/prediction2020/GANs-for-anonymized-labeled-TOF-MRA-patches. The proposed meth-ods and the structure of the GAN is shown in Fig. 1.The generator G of all architectures took a noise vector of length100 sampled from a gaussian distribution as input. The noise vec-tor was then fed through 6 upsampling convolutional layers usinga kernel size of 5 and stride of 2. After each convolution layer, abatch normalization layer and a ReLU activation layer were added,except for the last convolution layer. The activation function used af-ter the last convolution layer is the hyperbolic tangent function. Thenetwork then outputs two 96 x 96 images that correspond to oneimage-label pair x gen ∼ p gen . The objective function for the genera-tors of all architectures were built upon: L G = max G E x gen ∼ p gen [ log ( D ( x gen ))] (1)The discriminator D for all architectures took two 96 x 96 im-ages as input which correspond to either a real image-label pair orgenerated image-label pair. The pairs were again fed through 6 con-volutional layers with a kernel size of 5 and stride of 2. After eachconvolution layer, a batch normalization layer and a leaky ReLU(with a slope of 0.2) were added, except for the last convolutionlayer. The activation function used after the last convolution layer inthe DCGAN was a sigmoid function. The objective function of thediscriminator for the DCGAN was: L D = max D E x real ∼ p real [ logD ( x real )] + E x gen ∼ p gen [ log ( − D ( x gen ))] (2) nonymized labeled TOF-MRA images using GANs [!t] Figure 1.

Workﬂow of this study (A) and basic architecture of the generative adversarial networks that were trained (B). where x real ∼ p real denoted the real image-label pair.For the WGAN-GP and WGAN-GP-SN, a gradient penalty termfor regularization was added to the discriminator’s loss:loss D = D ( x gen ) − D ( x real ) + λ ( (cid:107) ∇ D ( ε x real + ( − ε ) x gen ) (cid:107) − ) , (3)where ε ∼ U [ , ] and λ =

10. Since the discriminator acted as acritic, the sigmoid activation function in the last convolutional layerwas omitted. The batch normalization was replaced by instance nor-malization to normalize across features and channels in the WGAN-GP. In the WGAN-GP-SN architecture, spectral normalization wasused instead of instance normalization.For training the DCGAN, the Adam optimizer (Kingma and Ba(2017)) with a learning rate of 0.0003 with β = . β = β = . A total of 121 patient MRA data from two studies were used: PEGA-SUS (N=66) and 1000Plus (N=55). All patients were diagnosed witha cerebrovascular disease. Details on both studies can be found inprevious papers, for the PEGASUS study see (Mutke et al. (2014)),for the 1000Plus study see (Hotter et al. (2009)). All the patientsgave their informed written consent. The studies have been con-ducted in accordance with the authorized ethical review committeeof Charité - Universitätsmedizin Berlin.

T. Kossen et al.

Scans were performed on a clinical 3T whole-body system (Mag-netom Trio, Siemens Healthcare, Erlangen, Germany; using a 12-channel receive radiofrequency coil (Siemens Healthcare) tailoredfor head imaging.Parameters PEGASUS: voxel size = (0.5x0.5x0.7) mm ; matrixsize: 312x384x127; TR/TE = 22ms/3.86ms; acquisition time: 3:50min, ﬂip angle = 18 degrees.Parameters 1000Plus: voxel size = (0.5x0.7x0.7) mm ; matrixsize: 384x268x127; TR/TE = 22ms/3.86ms; acquisition time: 3:50min, ﬂip angle = 18 degrees.For both datasets, skull-stripping was applied. The segmentationlabels were produced semi-manually using a standardized pipelinealong with 4 raters correcting the labels as described in Livne et al.(2019). For the anonymization, 41 out of the 66 PEGASUS patients wereused as a training set, 11 were used for validation and 14 for testing.For the transfer learning approach, one to 15 patients in incrementsof two of the 1000Plus data were utilized for training. The 1000Plusvalidation set consisted of 10 and the test set of 40 patients.Due to memory considerations, 2D patches of size 96x96 wereextracted from each patient instead of using the whole volume. Thedata contained 1% vessels and 99% background. To compensate forthis imbalance, 500 patches per patient with a brain vessel in thecenter were extracted. Then, 500 random patches per patient wereadded. The input patches were normalized to a range between -1 and1 for the GAN used for anonymization. For the U-net segmentationmodel, the input was normalized patch-wise to zero-mean and unit-variance.

The generated images were ﬁrst visually inspected and then quan-titatively compared to the real data using the Fréchet inception dis-tance (FID) (Heusel et al. (2018)). The FID measures the similarityof the real and generated images by feeding both into an Inception-v3 network. The difference between the activations in the pool3layer inside the Inception-v3 network is then calculated as follows:FID = (cid:107) µ real − µ gen (cid:107) + Tr ( σ real + σ gen − ( σ real σ gen ) / ) , (4)where x real ∼ N ( µ real , σ real ) and x gen ∼ N ( µ gen , σ gen ) are thedistributions of the features in the pool3 layer of the real and gener-ated data respectively.The FID was calculated for 41,000 generated patches of all threearchitectures with the respective 41,000 real patches. The lower theFID, the higher the similarity of the generated data to the originaldata.As a second evaluation, the state-of-the-art "half U-net" used inLivne et al. (2019) was trained with generated data as well as bothreal and generated data. The parameters learning rate and dropoutrate were tuned with respect to the validation set. Additionally, clas-sical augmentation was used as described in Livne et al. (2019) ifthis led to an improved performance on the validation set. Each seg-mentation network was trained for 15 epochs. Then, the performancewas evaluated on the binary segmentation maps of the test set by theDSC and the 95th percentile of the Hausdorff distance (95HD):DSC = + FP + FN , (5) Table 1.

Fréchet inception distance (FID) as a quantitative measurement ofthe generated image’s similarity compared to the real images for each of thethree GAN architectures. WGAN-GP-SN showed the highest similarity tothe real data in terms of FID.GAN architecture FIDDCGAN 105.96WGAN-GP 52.53WGAN-GP-SN where TP are the true positives, FP the false positives and FN thefalse negatives. The Hausdorff distance is deﬁned as:HD = max ( max i ∈ [ , N − ] d ( i , P , G ) , max i ∈ [ , M − ] d ( i , G , P )) , (6)where N and M denote the number of voxels on the vessel treeof the ground truth G and the prediction P respectively. d ( i , P , G ) is deﬁned as the distance from vessel voxel i in G to the closestvessel voxel in P . The 95HD was then the 95th percentile Hausdorffdistance for each voxel, averaged over each voxel and each patient.It was measured in millimeters.In the second part of the analysis, the performance of the U-nettrained on generated patches was evaluated on the 1000Plus dataset.For an increasing number of training patients (1, 3,. . . , 15) the U-net was trained from scratch and using the weights from the modeltrained on the generated image-label pairs (transfer learning). Theperformance of using real data only and transfer learning was thencompared by assessing the DSC and 95HD on the validation (10patients) and test set (40 patients). Overall, generated synthetic patches showed high similarities to thetraining set patches, in particular those that were synthesized by theWGAN-GP-SN. The patches generated by the DCGAN showed alower resolution with slight checkerboard artifacts compared to theoriginal patches. The generated corresponding labels ﬁt well to thepatches for all models. A subset of the synthesized image-label pairsfor all GAN architectures as well as original image-label pairs areshown in Fig. 2A to D. In the quantitative assessment, the data gen-erated by the WGAN-GP-SN architecture showed the highest simi-larity to the real data with a FID of 38.05 compared to 105.96 for theworst performing DCGAN. All FID values for real and synthesizeddata can be found in Table 1.In the ﬁrst validation approach, The U-net trained on data gener-ated by the WGAN-GP-SN showed the highest performance of allGAN models with a segmentation performance of 0.82 DSC/28.9795HD. The U-net trained on real PEGASUS data showed a perfor-mance of 0.89 DSC/26.57 95HD. The same model showed a sim-ilarly high performance in the external validation on the 1000Plusdata with 0.88 DSC/25.68 95HD. Quantitative results for all modelstrained on generated and/or real data can be found in Table 2.In the second validation approach applying transfer learning, theU-net pre-initialized with the weights from training on synthesizedpatches exhibited a higher performance compared to the modeltrained from scratch on real data only could be observed. Particularlywhen training on patches from one patient only (n=1000), transferlearning using patch-label pairs generated by the WGAN-GP-SN ledto a higher performance in terms of DSC and 95HD (DSC/95HD0.91/25.68 compared to 0.85/27.36). This observed performancedifference between pre-initialized models and models trained from nonymized labeled TOF-MRA images using GANs Figure 2.

Real and synthesized image patches with corresponding labels. (A) to (C) show image-label pairs generated by DCGAN (A), WGAN-GP (B) andWGAN-GP-SN (C) respectively. (D) show real patches and corresponding labels. The synthesized patches resemble real vessel patches and the labels ﬁt wellto the patches, especially those generated by WGAN-GP-SN (C). scratch became smaller when more patients were used for training.Results of the transfer learning approach are visualized in Fig. 3.Fig. 4 shows the error maps for both approaches on one example pa-tient in large vessels (Fig. 4A and C) and small vessels (Fig. 4B andD).

We present a Wasserstein-GAN based model for the generation ofsynthetic TOF-MRA imaging data and corresponding labels. Themodel generated synthetic data of high quality, as evidenced visuallyand through the FID measure, and retained much of the predictive

T. Kossen et al.

Table 2.

Summary of the Dice similarity coefﬁcient (DSC) and the 95th-percentile Hausdorff distance (95HD) of the U-net on validation and test set. Eachmetric is averaged over patients. The artiﬁcial patches were generated by Generative Adversarial Networks (GANs) trained on the PEGASUS dataset. For dataaugmentation, both real and generated patches have been used for training. For anonymization the U-net was trained on generated patches only. Models trainedon anonymized, synthetic data only show performances close to the model trained on real data. mean DSC mean 95HD [mm]val test val testU-net on real PEGASUS data (Livne et al.) 0.88 0.89 29.50 26.57

Data augmentation (real data (PEGASUS) and generated data)

DCGAN 0.89

WGAN-GP 0.89 0.89 28.03 30.01WGAN-GP-SN 0.89 0.89 29.91 26.51

Anonymization (trained on generated data only)

PEGASUS anonymization models: validated and evaluated on PEGASUS data

DCGAN 0.82 0.79 34.58 31.25WGAN-GP 0.82 0.78 34.64 33.70WGAN-GP-SN 0.85

PEGASUS anonymization models evaluated on real 1000Plus

DCGAN - 0.76 - 26.79WGAN-GP - 0.85 - 26.98WGAN-GP-SN - - [] Figure 3.

Performance evaluation for segmentation for an increasing number of patients on the 1000Plus dataset when trained from scratch (green) and usingtransfer learning (blue). The black dotted lines indicate the performance of the Unet on the real PEGASUS dataset. The error bars show the standard deviationover the patients. Especially for up to 5000 data samples the pre-trained WGAN-GP-SN outperform the models without any pre-training. properties of the original images. Here, a predictive model for ves-sel segmentation trained on synthetic data alone showed a good per-formance on one dataset and excellent performance on an externalvalidation set. The synthetic data were also successfully applied ina transfer learning approach where training was pre-initialized withweights from a model trained on synthetic data. It outperformed themodels trained on real data. Our results mark a signiﬁcant step to-wards the use of GAN-based models to generate synthetic and effec-tively anonymous data. Consequently, this approach has the poten-tial to signiﬁcantly accelerate research in the ﬁeld of neuroimaging.While the image-label pairs synthesized by the DCGAN showedsome artifacts, the more recent GAN architectures (WGAN-GP andWGAN-GP-SN) produced higher resolution data that looked similarto the real data (Fig. 2). The superiority of the WGAN-approaches was conﬁrmed by lower FID values as well as the improved per-formance of the U-net segmentation models trained on syntheticdata. This can be explained by the inherent differences betweenWasserstein-GANs and the DCGAN. In contrast to the DCGAN,the loss function of the WGAN-GP architectures utilizes the EarthMover’s distance and is bounded by a Lipschitz constraint (Arjovskyet al. (2017); Gulrajani et al. (2017)). This works as a robust regular-ization and enhances training stability while diminishing mode col-lapse at the same time. This explains why the WGAN-GP producedmore realistic looking image-label pairs. Other studies conﬁrm thesuperiority of Wasserstein GAN architectures over the DCGAN (Ar-jovsky et al. (2017); Gulrajani et al. (2017)). A recent addition toGAN architectures was the introduction of spectral normalization.This method additionally restricts the discriminator’s weights for nonymized labeled TOF-MRA images using GANs [] Figure 4.

Error maps for one example patient from the 1000Plus study using one patient when training from scratch (A, B) and using transfer learning fromWGAN-GP-SN generated patches (C, D). True positives are shown in red, false positives in green and false negatives in yellow. Transfer learning led to lesserrors, especially on small vessels (B, D). each layer in order to stabilize training even for high learning rates(Miyato et al. (2018)). As evidenced in our work, spectral normal-ization is also beneﬁcial for the application of Wasserstein GANs,and the combination of both regularization techniques (WGAN-GP-SN) yielded the best image quality both by visual inspection as wellas in terms of FID. These techniques have thus supported the preser-vation of the predictive properties for vessel segmentation withinthe synthetic patches. Therefore, it is likely that more sophisticated(future) GAN architectures will further improve the generation ofsynthetic data. Here, potential current candidate methods are pro-gressive growing GAN (PG-GAN) or stacked GAN architectures(Karras et al. (2018); Huang et al. (2017)).Whereas the data generated by WGAN-GP-SN consistentlyyielded the highest DSC in the transfer learning approach, this isnot as apparent in other parts of the results. First, the 95HD did notshow a consistent trend. Since the Hausdorff distance is vulnerableto outliers, we argue that it might not be as reliable as the DSC.This is also corroborated by the high standard deviation over thepatients. Secondly, when training the U-net with real data and ad-ditional synthesized data (data augmentation), the performance onlyslightly increased for the WGAN-GP-SN. In addition, the DCGANseems to perform slightly better. Here, the real training data used forthe U-net was the same as for GAN training. Hence, the generateddata contains information from the same underlying distribution anddid not add much value. Further, the best performance by addingDCGAN generated data needs to be taken with caution and does notnecessarily mean that this was the overall best performing generativemodel.GAN architectures have the potential to generate anonymized datasince the generator does not have direct access to the training data.This also holds true for this study: the generator synthesizes patch-label pairs from a noise vector. However, a recent study by Hayeset al. (2019) shows that DCGANs might be vulnerable to so-calledmembership inference attacks (Shokri et al. (2017)). Such attacksaim to identify whether a given data sample was part of the originaltraining set or not. To prevent this, differentially private GANs (DP-GANs) have been introduced (Xie et al. (2018)). Here, carefully ad-justed noise is introduced in the gradients during the discriminator’straining. While these GANs have the potential to ensure a certainlevel of privacy, they show poorer performance to date (Mukherjee et al. (2020)) and have only been trained on natural image datasetsyet. Training a DPGAN on sparse medical imaging datasets remainsa major challenge. While DPGANs might provide even further ad-vantages in anonymization, we argue that our synthesized patch-label pairs are effectively anonymized. For one, in the WGAN-GP-SN approach, we apply Lipschitz regularization techniques such asgradient penalty and spectral normalization. Wu et al. (2019) foundthat these techniques might reduce information leakage and mighteven make the trained models resistant to membership inference at-tacks. Furthermore, we use randomly sampled 2D patches in thisstudy. Thus, for a successful membership inference attack two eventsmust coincide: First, the real training data that is protected by state-of-the-art hospital security systems has to be leaked. Second, thepatches need to be extracted in the exact same way as in the GAN-training process to allow re-identiﬁcation. The minuscule probabilityof these events to happen is comparable to other theoretical scenar-ios of state-of-the-art anonymization. For example, any tabular dataanonymized using state-of-the-art techniques could be re-identiﬁedwhen compared with the leaked original data. Thus, we consider ourgenerated patches anonymous and hence make them available forresearchers upon request.Our results are also promising for AI in healthcare product de-velopment (Higgins and Madai (2020)). In the medical AI researchsetting, a strong focus on performance in homogeneous samples canbe observed. This is in stark contrast to the requirements for a medi-cal imaging product. A product is supposed to be used in a real worldsetting confronted with highly heterogeneous data reﬂecting differ-ent settings and multiple hardware options. Thus, product develop-ment should focus as much on training on heterogeneous data ason keeping the necessary performance (Higgins and Madai (2020)).This, however, is currently highly challenging as data is a scarce re-source due to limited availability. Our results show that a relativelysmall amount of data is sufﬁcient to generate robust results. Thus,a GAN-based anonymization approach could allow the generationof high quality data from a smaller number of patients from multiplelocations that - in total - reﬂect the full distribution of soft- and hard-ware settings in the clinical setting. Here, the possibility to generatehigh-quality labels as evidenced by our study is also a great advan-tage. Notably, a GAN model also learns the quality of the labelsprovided during training. Thus, the ﬁnal performance of any model

T. Kossen et al. trained on synthetic data will also be dependent on the quality ofthe real labels. Providing high-quality labels is no simple task andrequires usually hours of manual labor by highly qualiﬁed medicalstaff. Thus, a novel GAN-based approach to product developmentcould entail the high-quality labeling of relatively small data-setsfrom multiple data providers that are then anonymized and pooledfor training. This would on one hand keep development costs rela-tively low which is a prerequisite for startup success. On the otherhand, such an approach would ensure both high performance andlow bias as the chance for out-of-sample data in the clinical settingwould be signiﬁcantly lowered.Our study has several limitations. The DCGAN is 2D due to com-putational restrictions. 3D approaches could help extracting infor-mation about the 3D vessel tree structure and in this way improvethe performance of the segmentation task. The computational re-strictions also did not allow to try out more advanced GAN archi-tectures such as PG-GAN. Another limitation is the calculation ofthe FID. Due to computational restrictions it was only calculatedto conﬁrm the quality of visually inspected images and not for ev-ery epoch in an end-to-end solution. Secondly, the FID for assess-ing the image quality might not be ideal. Although it is used as aquality measurement in the medical ﬁeld (Haarburger et al. (2019);Cao et al. (2020)), it was originally designed for natural images andhence might not entirely capture relevant features for medical imag-ing. Thus, further research on assessing image quality speciﬁc tomedical images should be undertaken.

This study marks a crucial step towards true anonymization of med-ical imaging data while maintaining essential predictive featureswithin the image patch. We show that these features might be gen-eralizable to another, independent dataset. Our initial performancefor vessel segmentation on the PEGASUS dataset already is rela-tively high. We show that training more advanced GAN architecturescan further increase the quality of synthesized image-label pairs. Byusing only one patient from a different cohort, we can achieve ahigh comparable performance on an independent dataset. Our syn-thesized image-label pairs allow other researchers to build modelsthat only require few labeled patient data and will signiﬁcantly fa-cilitate research in this domain. It may be the case that our frame-work achieves similar results on other medical segmentation tasks.This could lead to a lower demand of labeled patient data and allowmore data sharing of anonymized data. Nevertheless, further stud-ies should assess the generalizability of this analysis to other (morecomplex) segmentation problems.

DISCLOSURES

Tabea Kossen reported receiving personal fees from ai4medicineoutside the submitted work. Dr Madai reported receiving personalfees from ai4medicine outside the submitted work. Adam Hilbertreported receiving personal fees from ai4medicine outside the sub-mitted work.While not related to this work, Dr Sobesky reports receipt ofspeakers honoraria from Pﬁzer, Boehringer Ingelheim, and Dai-ichi Sankyo. Furthermore, Dr Fiebach has received consulting andadvisory board fees from BioClinica, Cerevast, Artemida, Brain-omix, Biogen, BMS, EISAI, and Guerbet. Dr Frey reported receiv-ing grants from the European Commission, reported receiving per- sonal fees from and holding an equity interest in ai4medicine outsidethe submitted work.

ACKNOWLEDGEMENTS

This work has received funding by the German Federal Ministry ofEducation and Research through (1) the grant Centre for Stroke Re-search Berlin and (2) a Go-Bio grant for the research group PRE-DICTioN2020 (lead: DF).

REFERENCES

Abramian, D., Eklund, A., 2019. Refacing: Reconstructing Anonymized Fa-cial Features Using GANS , 5.Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein GAN.arXiv:1701.07875 [cs, stat] URL: http://arxiv.org/abs/1701.07875 . arXiv: 1701.07875.Bowles, C., Chen, L., Guerrero, R., Bentley, P., Gunn, R., Hammers, A.,Dickie, D.A., Hernández, M.V., Wardlaw, J., Rueckert, D., 2018. GANAugmentation: Augmenting Training Data using Generative Adversar-ial Networks. arXiv:1810.10863 [cs] URL: http://arxiv.org/abs/1810.10863 . arXiv: 1810.10863.Cao, B., Zhang, H., Wang, N., Gao, X., Shen, D., 2020. Auto-GAN: Self-Supervised Collaborative Learning for Medical Image Synthesis. Pro-ceedings of the AAAI Conference on Artiﬁcial Intelligence 34, 10486–10493. URL: https://aaai.org/ojs/index.php/AAAI/article/view/6619 , doi:doi:10.1609/aaai.v34i07.6619.Foroozandeh, M., Eklund, A., 2020. Synthesizing brain tumor imagesand annotations by combining progressive growing GAN and SPADE.arXiv:2009.05946 [cs] URL: http://arxiv.org/abs/2009.05946 .arXiv: 2009.05946.Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger,J., Greenspan, H., 2018. GAN-based synthetic medical im-age augmentation for increased CNN performance in liver le-sion classiﬁcation. Neurocomputing 321, 321–331. URL: , doi:doi:10.1016/j.neucom.2018.09.013.Guibas, J.T., Virdi, T.S., Li, P.S., 2018. Synthetic Medical Images from DualGenerative Adversarial Networks. arXiv:1709.01872 [cs] URL: http://arxiv.org/abs/1709.01872 . arXiv: 1709.01872.Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C., 2017.Improved Training of Wasserstein GANs , 11.Haarburger, C., Horst, N., Truhn, D., Broeckmann, M., Schrading,S., Kuhl, C., Merhof, D., 2019. Multiparametric Magnetic Res-onance Image Synthesis using Generative Adversarial Networks.Eurographics Workshop on Visual Computing for Biology andMedicine , 5 pagesURL: https://diglib.eg.org/handle/10.2312/vcbm20191226 , doi:doi:10.2312/VCBM.20191226.Hayes, J., Melis, L., Danezis, G., Cristofaro, E.D., 2019. LO-GAN: Membership Inference Attacks Against Generative Models.Proceedings on Privacy Enhancing Technologies 2019, 133–152.URL: https://content.sciendo.com/view/journals/popets/2019/1/article-p133.xml , doi:doi:10.2478/popets-2019-0008.Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S., 2018.GANs Trained by a Two Time-Scale Update Rule Converge to a LocalNash Equilibrium. arXiv:1706.08500 [cs, stat] URL: http://arxiv.org/abs/1706.08500 . arXiv: 1706.08500.Higgins, D., Madai, V.I., 2020. From bit to bedside: A practical frameworkfor artiﬁcial intelligence product development in healthcare. AdvancedIntelligent Systems , 2000052URL: http://dx.doi.org/10.1002/aisy.202000052 , doi:doi:10.1002/aisy.202000052.Hotter, B., Pittl, S., Ebinger, M., Oepen, G., Jegzentis, K., Kudo, K., Rozan-ski, M., Schmidt, W.U., Brunecker, P., Xu, C., Martus, P., Endres,M., Jungehülsing, G.J., Villringer, A., Fiebach, J.B., 2009. Prospec-tive study on the mismatch concept in acute stroke patients within nonymized labeled TOF-MRA images using GANs the ﬁrst 24 h after symptom onset - 1000Plus study. BMC Neurol-ogy 9, 60. URL: https://doi.org/10.1186/1471-2377-9-60 ,doi:doi:10.1186/1471-2377-9-60.Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., Belongie, S., 2017. StackedGenerative Adversarial Networks. arXiv:1612.04357 [cs, stat] URL: http://arxiv.org/abs/1612.04357 . arXiv: 1612.04357.Hukkelas, H., Mester, R., Lindseth, F., 2019. DeepPrivacy: A Gener-ative Adversarial Network for Face Anonymization, in: Bebis, G.,Boyle, R., Parvin, B., Koracin, D., Ushizima, D., Chai, S., Sueda, S.,Lin, X., Lu, A., Thalmann, D., Wang, C., Xu, P. (Eds.), Advances inVisual Computing. Springer International Publishing, Cham. volume11844, pp. 565–578. URL: http://link.springer.com/10.1007/978-3-030-33720-9_44 , doi:doi:10.1007/978-3-030-33720-9_44.Karras, T., Aila, T., Laine, S., Lehtinen, J., 2018. ProgressiveGrowing of GANs for Improved Quality, Stability, and Variation.arXiv:1710.10196 [cs, stat] URL: http://arxiv.org/abs/1710.10196 . arXiv: 1710.10196.Kingma, D.P., Ba, J., 2017. Adam: A Method for Stochastic Optimiza-tion. arXiv:1412.6980 [cs] URL: http://arxiv.org/abs/1412.6980 . arXiv: 1412.6980.Krizhevsky, A., Sutskever, I., Hinton, G.E., 2017. ImageNet classiﬁca-tion with deep convolutional neural networks. Communications of theACM 60, 84–90. URL: http://dl.acm.org/citation.cfm?doid=3098997.3065386 , doi:doi:10.1145/3065386.Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F.,Ghafoorian, M., van der Laak, J.A.W.M., van Ginneken, B.,Sánchez, C.I., 2017. A survey on deep learning in medicalimage analysis. Medical Image Analysis 42, 60–88. URL: , doi:doi:10.1016/j.media.2017.07.005.Livne, M., Rieger, J., Aydin, O.U., Taha, A.A., Akay, E.M., Kossen, T.,Sobesky, J., Kelleher, J.D., Hildebrand, K., Frey, D., Madai, V.I.,2019. A U-Net Deep Learning Framework for High PerformanceVessel Segmentation in Patients With Cerebrovascular Disease. Fron-tiers in Neuroscience 13. URL: ,doi:doi:10.3389/fnins.2019.00097.Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y., 2018. Spectral Normaliza-tion for Generative Adversarial Networks. arXiv:1802.05957 [cs, stat]URL: http://arxiv.org/abs/1802.05957 . arXiv: 1802.05957.Mukherjee, S., Xu, Y., Trivedi, A., Ferres, J.L., 2020. privGAN:Protecting GANs from membership inference attacks at low cost.arXiv:2001.00071 [cs, stat] URL: http://arxiv.org/abs/2001.00071 . arXiv: 2001.00071.Mutke, M.A., Madai, V.I., von Samson-Himmelstjerna, F.C., Zaro Weber,O., Revankar, G.S., Martin, S.Z., Stengl, K.L., Bauer, M., Hetzer, S.,Günther, M., Sobesky, J., 2014. Clinical evaluation of an arterial-spin-labeling product sequence in steno-occlusive disease of the brain. PloSOne 9, e87143. doi:doi:10.1371/journal.pone.0087143.Neff, T., Payer, C., Stern, D., Urschler, M., 2017. Generative AdversarialNetwork based Synthesis for Supervised Medical Image Segmentation.Proceedings of the OAGM & ARW Joint Workshop Vision, Automationand Robotics doi:doi:10.3217/978-3-85125-524-9-30.Radford, A., Metz, L., Chintala, S., 2016. Unsupervised RepresentationLearning with Deep Convolutional Generative Adversarial Networks.arXiv:1511.06434 [cs] URL: http://arxiv.org/abs/1511.06434 .arXiv: 1511.06434.Ravindra, V., Grama, A., 2019. De-anonymization Attacks on Neuroimag-ing Datasets. arXiv:1908.03260 [cs, eess, q-bio] URL: http://arxiv.org/abs/1908.03260 . arXiv: 1908.03260.Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: Convolu-tional Networks for Biomedical Image Segmentation, in: Navab,N., Hornegger, J., Wells, W.M., Frangi, A.F. (Eds.), MedicalImage Computing and Computer-Assisted Intervention – MIC-CAI 2015. Springer International Publishing, Cham. volume 9351,pp. 234–241. URL: http://link.springer.com/10.1007/978-3-319-24574-4_28 , doi:doi:10.1007/978-3-319-24574-4_28.Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., 2016. Improved Techniques for Training GANs. arXiv:1606.03498[cs] URL: http://arxiv.org/abs/1606.03498 . arXiv: 1606.03498.Sandfort, V., Yan, K., Pickhardt, P.J., Summers, R.M., 2019. Dataaugmentation using generative adversarial networks (CycleGAN) toimprove generalizability in CT segmentation tasks. Scientiﬁc Re-ports 9, 16884. URL: , doi:doi:10.1038/s41598-019-52737-x. num-ber: 1 Publisher: Nature Publishing Group.Shin, H.C., Tenenholtz, N.A., Rogers, J.K., Schwarz, C.G., Senjem, M.L.,Gunter, J.L., Andriole, K.P., Michalski, M., 2018. Medical Image Syn-thesis for Data Augmentation and Anonymization Using Generative Ad-versarial Networks, in: Gooya, A., Goksel, O., Oguz, I., Burgos, N.(Eds.), Simulation and Synthesis in Medical Imaging, Springer Interna-tional Publishing, Cham. pp. 1–11. doi:doi:10.1007/978-3-030-00536-8_1.Shokri, R., Stronati, M., Song, C., Shmatikov, V., 2017. Mem-bership Inference Attacks Against Machine Learning Models, in:2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18.doi:doi:10.1109/SP.2017.41. iSSN: 2375-1207.Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networksfor Large-Scale Image Recognition. arXiv:1409.1556 [cs] URL: http://arxiv.org/abs/1409.1556 . arXiv: 1409.1556.Sorin, V., Barash, Y., Konen, E., Klang, E., 2020. Creating Artiﬁ-cial Images for Radiology Applications Using Generative Adversar-ial Networks (GANs) – A Systematic Review. Academic Radiol-ogy URL: https://linkinghub.elsevier.com/retrieve/pii/S1076633220300210 , doi:doi:10.1016/j.acra.2019.12.024.Wachinger, C., Golland, P., Kremen, W., Fischl, B., Reuter, M., Alzheimer’sDisease Neuroimaging Initiative, 2015. BrainPrint: a discriminativecharacterization of brain morphology. NeuroImage 109, 232–248.doi:doi:10.1016/j.neuroimage.2015.01.032.Wu, B., Zhao, S., Chen, C., Xu, H., Wang, L., Zhang, X., Sun, G., Zhou,J., 2019. Generalization in Generative Adversarial Networks: A NovelPerspective from Privacy Protection. arXiv:1908.07882 [cs, stat] URL: http://arxiv.org/abs/1908.07882 . arXiv: 1908.07882.Xie, L., Lin, K., Wang, S., Wang, F., Zhou, J., 2018. Differentially Pri-vate Generative Adversarial Network. arXiv:1802.06739 [cs, stat] URL: http://arxiv.org/abs/1802.06739 . arXiv: 1802.06739.Yi, X., Walia, E., Babyn, P., 2019. Generative Adversarial Net-work in Medical Imaging: A Review. Medical Image Anal-ysis 58, 101552. URL: http://arxiv.org/abs/1809.07294http://arxiv.org/abs/1809.07294