[PDF] Measuring Utility and Privacy of Synthetic Genomic Data

Abstract

The availability of genomic data is often essential to progress in biomedical research, personalized medicine, drug development, etc. However, its extreme sensitivity makes it problematic, if not outright impossible, to publish or share it. As a result, several initiatives have been launched to experiment with synthetic genomic data, e.g., using generative models to learn the underlying distribution of the real data and generate artificial datasets that preserve its salient characteristics without exposing it. This paper provides the first evaluation of the utility and the privacy protection of six state-of-the-art models for generating synthetic genomic data. We assess the performance of the synthetic data on several common tasks, such as allele population statistics and linkage disequilibrium. We then measure privacy through the lens of membership inference attacks, i.e., inferring whether a record was part of the training data. Our experiments show that no single approach to generate synthetic genomic data yields both high utility and strong privacy across the board. Also, the size and nature of the training dataset matter. Moreover, while some combinations of datasets and models produce synthetic data with distributions close to the real data, there often are target data points that are vulnerable to membership inference. Looking forward, our techniques can be used by practitioners to assess the risks of deploying synthetic genomic data in the wild and serve as a benchmark for future work.

Full PDF

MMeasuring Utility and Privacy of Synthetic Genomic Data

Bristena [email protected] Georgi GanevUCL and [email protected] Emiliano De CristofaroUCL and Alan Turing [email protected]

Abstract

Genomic data provides researchers with an invaluable source of information to advance progress in biomedical research,personalized medicine, and drug development. At the same time, however, this data is extremely sensitive, which makesdata sharing, and consequently availability, problematic if not outright impossible. As a result, organizations have begun toexperiment with sharing synthetic data, which should mirror the real data’s salient characteristics, without exposing it. In thispaper, we provide the ﬁrst evaluation of the utility and the privacy protection of ﬁve state-of-the-art models for generatingsynthetic genomic data.First, we assess the performance of the synthetic data on a number of common tasks, such as allele and populationstatistics as well as linkage disequilibrium and principal component analysis. Then, we study the susceptibility of the data to membership inference attacks , i.e., inferring whether a target record was part of the data used to train the model producingthe synthetic dataset. Overall, there is no single approach for generating synthetic genomic data that performs well acrossthe board. We show how the size and the nature of the training dataset matter, especially in the case of generative models.While some combinations of datasets and models produce synthetic data with distributions close to the real data, there oftenare target data points that are vulnerable to membership inference. Our measurement framework can be used by practitionersto assess the risks of deploying synthetic genomic data in the wild, and will serve as a benchmark tool for researchers andpractitioners in the future.

Progress in genome sequencing is helping pave the way towards prevention, diagnosis, and treatment of several diseases andconditions. Much of this progress is dependent on the availability, and consequently the sharing, of genomic data. Numerousinitiatives have been established to support and encourage genomic data sharing, and funding agencies like the NationalInstitutes of Health (NIH) often make it a requirement to fund grant applications [42]. Successful data sharing programsinclude the International HapMap Project [38], which helped identify common genetic variations and study their involvementin human health and disease, as well as the 1000 Genomes Project [4], which aims to create a catalog of human variation andgenotype data.Overall, data sharing in genomics is crucial to enable progress in Precision Medicine [21]. Unsurprisingly, however, thisis inherently at odds with the need to protect individuals’ privacy. Genomic data contains sensitive information related toheritage, predisposition to diseases, phenotype traits, etc., which makes it hard to anonymize [23]. Hiding “sensitive” portionsof the genome is not effective either, as sensitive information can still be inferred via high-order correlation models [48]. Fora thorough review of privacy threats in genomics, please see [8, 40, 59].As a result, genomics researchers have begun to investigate the possibility of releasing synthetic datasets, rather thanreal/anonymized data [46]. This follows a general trend in healthcare; for instance, the National Health Service (NHS)in England has recently concluded a project focused on releasing synthetic Emergency Room (“A&E”) records [41]. Theintuition is to use generative models to learn to generate samples with the same characteristics—more precisely, with the samedistribution—of the real data. That is, rather than releasing data of actual individuals, entities share artiﬁcially generated datain such a way that the statistical properties of the original data are preserved, but minimizing the risk of malicious inferenceof sensitive information [16].

Generative Models and Genomics.

Speciﬁc to genomics, previous work has experimented with both statistical and machinelearning generative models. Samani et al. [48] propose an inference model based on the recombination rate, which can alsobe used to generate new synthetic samples. Yelmen et al. [61] use Generative Adversarial Models (GANs) and RestrictedBoltzmann Machines (RMBs) to mimic the distribution of real genomes and capture population structures. Finally, Killoranet al. [33] use ad-hoc training techniques for GANs and architectures for computer vision tasks.

Motivation.

Prior work on generating synthetic data in genomics has thus far only scratched the surface with respect toassessing their utility, and more speciﬁcally their statistical ﬁdelity. Moreover, we do not really know whether these approaches1 a r X i v : . [ q - b i o . GN ] F e b ctually provide any meaningful privacy guarantees. To address this gap, we introduce a novel evaluation framework andperform a series of measurements geared to assess both the utility and the privacy of ﬁve state-of-the-art models used togenerate human genomic synthetic data. More speciﬁcally:• Utility.

We focus on a number of very common computational tasks on genomic data. We measure how well generativemodels preserve summary statistics (e.g., allele frequencies, population statistics), or linkage disequilibrium. We alsoassess how close are the distributions of synthetic data vs. real data for principal component analysis.•

Privacy.

We mount membership inference attacks [29], having an attacker infer whether a target record was part ofthe real data used to train the model producing the synthetic dataset. More precisely, we quantify the privacy gained,vis-à-vis this attack, from releasing synthetic data vs. releasing the real dataset. In the process, we also introduce a novelattack where the adversary only has partial information for a target individual.

Main Findings.

Overall, our evaluation shows that there is no single approach for generating genomic synthetic data thatperforms well across the board, both in terms of utility and privacy. However, some models, provide high utility as well asincreased privacy protection. Among other things, we ﬁnd that:• A high order correlation model (Recomb) has the best utility metrics for small datasets, but does so at the cost of privacy,even against weaker adversaries who only have partial information available.• The RBM model has a better performance with increasing dataset sizes, both in terms of utility and privacy, as moretargets beneﬁt from privacy gain when synthetic data is generated using a larger training set.• There are combinations of target and training sets for which releasing the synthetic dataset increases the privacy loss,but one cannot meaningfully predict what these combinations will be without actually running the privacy evaluationpart of our framework.

In this section, we introduce background information about genomics, then, we present the privacy metrics and datasets usedin our evaluation.

Genomes and Genes.

The genome represents the entirety of an organism’s hereditary information. It is encoded in DNA(deoxyribonucleic acid); each DNA strand is made up of four chemical units, called nucleotides, represented by the letters A,C, G, and T. The human genome consists of approximately 3 billion nucleotides, that are packaged into thread-like structurescalled chromosomes. The genome includes both the genes and the non-coding sequences of the DNA. The former determinespeciﬁc traits, characteristics, or control activity within an organism. We refer to the group of genes that were inheritedtogether from a single parent as a haplotype . An allele is a different variation of a gene; any individual inherits two allelesfor each gene, one from each of their parents. The genotype consists of the alleles that an organism has for a particularcharacteristic.

SNPs and SNVs.

About 99.5% of the genome is shared among all humans; the rest differs due to genetic variations. Singlenucleotide polymorphisms (SNPs) are the most common type of genetic variation. They occur at a single position in thegenome and in at least 1% of the population. SNPs are usually biallelic and can be encoded by {0, 1, 2}, with 0 denotinga combination of two major (i.e., common) alleles, 2 a combination of two minor alleles, and 1 a combination of a majorand a minor allele (which is also referred to as a heterozygous SNP). Single nucleotide variants (SNVs) are single nucleotidepositions in the genomic DNA at which different sequence alternative exists [18].

Recombination Rate (RR).

Recombination is the process of determining the frequency with which characteristics are inher-ited together. The RR is the probability that a transmitted haplotype constitutes a new combination of alleles different fromthat of either parental haplotype [14].

Genome-Wide Association Studies (GWAS).

GWAS are hypothesis-free methods for identifying associations between ge-netic regions and traits. A typical GWAS looks for common variants in a number of individuals, both with and without a trait,using genome-wide SNP arrays [17, 39]. 2 lgorithm 1

MIATrain [53]

Input:

A generative model GM(), the target record t , a referencedataset R of size n , the number of synthetic test sets n s of size m , and the number k of shadow models. Output:

MIA t () for i = 1 , · · · , k do R i ∼ R n f i ∼ GM( R i ) for j = 1 , · · · , n s do S mj ∼ f i S train ∪ S mj l train ∪ R (cid:48) i ← R i ∪ t f (cid:48) i ∼ GM( R (cid:48) i ) for j = 1 , · · · , n s do S mj ∼ f (cid:48) i S train ∪ S mj l train ∪ MIA t () ← Classifier ( S train , l train ) Algorithm 2

MIAGain [53]

Input:

A generative model GM(), the target record t , the tar-get training set R tout of size n , the size m of the syntheticdataset, the number n s of synthetic test sets, a referencedataset R a , the number k of shadow models. Output:

P G t f out ∼ GM( R tout ) for i = 1 , · · · , n s do S i ∼ f mout S test ∪ S i R tin ← R tout ∪ t f in ∼ GM( R tin ) for i = 1 , · · · , n s do S i ∼ f min S test ∪ S i MIA t () ← MIATrain ( GM() , t, R a , n, m, n s , k ) MIA t ( S test ) = (cid:80) S i ∈ S test Pr[

MIA t ( S i )=1]2 ∗ n s P G t ← (1 − MIA t ( S test )) / A well-understood privacy threat in genomics is determining whether the data of a target individual is part of an aggregategenomic dataset, or mixture. This is known as a membership inference attack (MIA) [30, 58, 64]. The ability to infer thepresence of an individual’s data in a dataset constitutes an inherent privacy leak whenever the dataset has some sensitiveattributes. For instance, if a mixture includes DNA from patients with a speciﬁc disease, learning that a particular person ispart of that mixture exposes their health status. Overall, genomic data contains some of the most sensitive information abouta person’s past, present, and future; therefore, MIAs against genomic data prompt severe privacy threats, including denial oflife or health insurance, revealing predisposition to diseases and conditions, ancestry, etc.MIAs have also been studied in the context of machine learning, aiming to infer whether a target data point was used totrain a target model or not. This has been done both for discriminative [11, 45, 47, 49] and generative models [11, 24, 26].Inferring training set membership might yield serious privacy violations. For instance, if a model for drug dose predictionis trained using data from patients with a certain disease, or synthetic health images are produced by a generative modeltrained on patients’ images, learning that data of a particular individual was part of the training set leaks information aboutthat person’s health. Overall, MIAs are also used as signals that access to a target model is “leaky,” and can be a gateway toadditional attacks [15].Recently, Stadler et al. [53] present a comprehensive evaluation of MIAs in the context of synthetic data, showing thateven access to a single synthetic dataset output by the target model can lead to serious privacy leakage. In our evaluation,we adapt their attacks to the context of genomic data and rely on their “Privacy Gain” metric (presented next) to quantify thedifference in privacy leakage when releasing a synthetic dataset instead of the real one.

As our main privacy evaluation metric, we use the

Privacy Gain (PG) [53]. The PG quantiﬁes the privacy advantage obtainedby a target t , vis-à-vis an MIA adversary, when a synthetic dataset is published as opposed to the raw data. MIA Training.

As depicted in Algorithm 1, the adversary is trained as follows. First, we take a reference dataset (which mayor may not overlap with the raw data used to generate the synthetic dataset) and use it to generate a synthetic dataset (lines1–6). This dataset is labeled as 0, i.e., it does not include the target record (line 7). The target record is then added to thedataset (line 8), and synthetic data is generated from the new dataset which includes the target record; this dataset is labeled as1 (lines 10-13). These models are referred to as generative shadow models. Finally, the adversary uses the synthetic datasetsto train a classiﬁer (line 14), which distinguishes whether or not the target was used in the training of a generative model.

PG Estimation.

To estimate the PG for a ﬁxed target record and input dataset, we use Algorithm 2. First, the algorithmtakes the target training set and generates n s synthetic datasets without the target record (lines 1–4). Then, the target recordis added to the training set and n s of synthetic datasets are generated from this dataset (lines 5–9). Then, the adversarytrains their attack model MIATrain (line 10). Finally, the PG for a target t is computed as P G t = − MIA t ( S test )2 , where M IA t ( S test ) = (cid:80) S i ∈ S test Pr[

MIA t ( S i )=1]2 ∗ n s (lines 11–12). 3ut simply, the PG is quantiﬁed as the difference between the probability that an attacker correctly identiﬁes that the targetrecord belongs to the real dataset (which is equal to 1 in this case) and the probability that the attacker correctly identiﬁes thatthe target record was used in training a generative model that outputs a synthetic dataset. PG Values.

The PG ranges between 0, when publishing the synthetic dataset leads to the same privacy loss as publishing thereal dataset (i.e.,

M IA t ( S test ) = 1) and 0.5, when publishing the synthetic dataset perfectly protects the target from MIA (i.e., M IA t ( S test ) =0). This means that P G = 0 . when the probability of the adversary inferring whether or not a target is partof the training set used to generate the synthetic dataset is the same as random guessing (i.e., M IA t ( S test ) = 0 . ). Dimensionality Reduction.

To reduce the effects of high-dimensionality, the MIA attacker ﬁrst maps the synthetic data to alower feature space. This allows to more easily detect the inﬂuence of the target record on the training dataset. We experimentwith four different feature sets, as done in [53]: namely, a naive feature set, which encodes the number of distinct categoriesplus the most and least frequent category for each attribute, a histogram , which computes the frequency counts for eachattribute, a correlation , which encodes pairwise correlations between attributes, and an ensemble feature set, which combinesall the previously mentioned feature sets.

In our evaluation, we use data from two projects: HapMap [38] and the 1000 Genome Project [4]. More speciﬁcally, we use1,000 SNPs from chromosome 13 from the following three datasets:1.

CEU Population (HapMap).

Samples from 117 Utah residents with Northern and Western European ancestry, releasedin phase 2 of the HapMap project.2.

CHB Population (HapMap).

Samples from 120 Han Chinese individuals from Beijing, China.3.

Samples from 2,504 individuals from 26 different populations released from phase 3 of the 1000Genomes project.

The ability to effectively train statistical models [7] from genomic data is very important in many applications. As manymodels suffer from the “curse of dimensionality” [5], i.e., models do not usually perform well on small datasets with highdimensionality, statistical and generative models have been proposed not only to mitigate possible privacy concerns of sharinggenomic data but also to help “inﬂate” the size of the datasets for more meaningful analysis.In this section, we provide an overview of the state-of-the-art models for generating synthetic genomic data. In particular,we discuss the Recombination model proposed by Samani et al. [48], the RBM and GAN models proposed by Yelmen etal. [61], and the WGAN model from Killoran et al. [33]. We also introduce and consider two other “hybrid” models.

Recombination Model (Recomb).

Samani et al. [48] propose the use of a recombination model as an inference methodfor quantifying individuals’ genomic privacy. This is a statistical model, based on high-order SNV correlation that relateslinkage disequilibrium patterns to the underlying recombination rate. Given a set of sampled haplotypes, the model relatestheir distribution to the underlying recombination rate.Also, [48] shows how to use this method to generate synthetic samples in order to perform Principal Component Analysis(PCA). The recombination model yields a distribution closer to the real data than models using only linkage disequilibriumand allele frequencies. In order to obtain the underlying recombination rate, the model uses a “genetic map,” which includesthe recombination rate. This is provided with the dataset for the HapMap datasets, but not for the 1000 genomes data. For thelatter, we use the scripts from [2].

Restricted Boltzmann Machines (RBMs).

RBMs [52] are generative models geared to learn a probability distribution over aset of inputs. RBMs are shallow, two-layer neural nets: the ﬁrst layer is known as the “visible” (on input) layer and the secondas the hidden layer. The two layers are connected via a bipartite graph – i.e., every node in the visible layer is connected toevery node in the hidden one, but no two nodes in the same group are connected to each other, allowing for more efﬁcienttraining algorithms. The learning procedure consists of maximizing the likelihood function over the visible variables of themodel. The RBM models re-create data in an unsupervised manner through many forward and backward passes betweenthe two layers, corresponding to sampling from the learned distribution. The output of the hidden layer passes through anactivation function, which then becomes the input for the hidden layer. RBMs are typically used for dimensionality reduction,classiﬁcation, regression, collaborative ﬁltering, topic modeling, etc.As mentioned earlier, Yelmen et al. [61] use RBMs to generate synthetic genomic data. In our evaluation, we follow thesame RBM settings as [61]. More speciﬁcally, we use a ReLu activation function, with the visible layer having the same size4s the input we considered (1,000 features) and with the number of hidden nodes set to 100. The learning rate is set to 0.01,the batch size to 32, and we iterate over 2,000 epochs.

Generative Adversarial Networks (GANs).

A GAN is an unsupervised deep learning model consisting of two neuralnetworks, a generator and a discriminator, which compete against each other in the form of a game setting. During training,the generator’s goal is to produce synthetic data and the discriminator evaluates them against real data samples in order todistinguish the synthetic from the real samples. The training objective is to learn the data distribution so that the data samplesproduced by the generator cannot be distinguished from real data by the discriminator.Again, we use the GAN approach proposed by Yelmen et al. [61], mirroring their experimental settings. More precisely,the generator model consists of an input layer with latent dimension set to 600, and two hidden layers, of sizes 512 and 1,024respectively. The discriminator consists of an input layer with size equal to the number of SNPs evaluated (1,000), and twohidden layers of sizes 512 and 256 respectively, as well as an output layer of size 1. The output layer for the generator usestanh as an activation function and the output layer for the discriminator uses the sigmoid activation function. For both thegenerator and discriminator, we compile them using the Adam optimization and binary cross-entropy as the loss function.

Recombination RBM (Rec-RBM).

To overcome issues caused by low numbers of training samples, we propose a hybridapproach between the Recomb and the RBM models. In other words, we use the former to generate extra samples, which wethen use, together with the real data samples, to train the RBM model with the same parameters as before. We do so to explorewhether having more data points available to train the model improves the utility of the synthetic data.

Recombination GAN (Rec-GAN).

Similar to Rec-RBM, we use the Recomb model to generate extra training samples forthe GAN model, using the same parameters as before. Again, we want to study whether having a larger dataset available fortraining the GAN improves the overall utility of the synthetic data output by it.

Wasserstein GAN (WGAN).

Killoran et al. [33] propose an alternative GAN model by treating DNA sequences as a hybridbetween natural language and computer vision data. The sequences are one-hot encoded, the GAN is based on a WGAN archi-tecture trained with a gradient penalty [22], and both the generator and discriminator use convolutional neural networks [34]and a residual architecture [25], which includes skip connections that jump over some layers. The authors also propose ajoint method combining the GAN model with an activation maximization design [37, 51, 63] in order to tune the sequences tohave desired properties. We do not, however, include the joint model in our evaluation, as we focus on a range of statistics asopposed to a single desired property.In our evaluation, we use the WGAN model with the default parameters from the implementation in [1]. The generatorconsists of an input layer with dimension of the latent space set to 100, followed by a hidden layer with size 100 times thelength of the sequence (1,000), which is then reshaped to (length of the sequence, 100), followed by 5 resblocks. Finally,there is a 1-D convolutional layer followed by the output layer, which uses softmax. The discriminator has a very similararchitecture but in a different order – i.e., it starts with the input layer to which the one-hot sequences are fed, that is followedby the 1-D convolutional layer, then the 5 resblocks, followed by the reshape layer and the output layer of size 1. We perform5 discriminator updates for every generator update. Both the generator and discriminator use Adam optimization and theirlearning rates are set to 0.0001, while the loss as mentioned is adjusted by a gradient penalty. We use a batch size of 64. In ourexperiments, the WGAN model converges after about 80 iterations; so, as opposed to the 100,000 proposed by the authors,we train the model for 100 iterations in our evaluation.

We now perform a comprehensive utility evaluation of the synthetic data generated by the models presented in Section 3. Welook at common summary statistics used in genome-wide association studies, aiming to assess the accuracy loss due to theuse of synthetic datasets. More speciﬁcally, we analyze how well data generated by the generative models preserves allelefrequencies, population statistics, and linkage disequilibrium, and how close the distribution of the synthetic data is to the realdata for principal component analysis.

Major Allele Frequency (MAF).

In population genetics, the major allele frequency (MAF) is routinely used to providehelpful information to differentiate between common and rare variants in the population, as it quantiﬁes the frequency atwhich the most common allele occurs in a given population. We start our utility analysis by comparing MAFs in the syntheticdata vs. the real data.In Figure 1, we plot the MAF at each position for the real datasets and for the synthetic samples, over the CEU and CHBpopulations, and the 1000 Genomes dataset. For CEU/CHB (Figure 1a–1b), we observe that Recomb and WGAN replicatebest the allele frequencies in the real data. On the other hand, GAN and Rec-GAN fail to do so, and in fact, the generated5

200 400 600 800 1000

Samples

RealRecombRBMGANRec-RBMRec-GANWGAN (a) CEU

Samples

RealRecombRBMGANRec-RBMRec-GANWGAN (b) CHB

Samples

RealRecombRBMGANRec-RBMRec-GANWGAN (c) 1000 Genomes

Figure 1:

Major allele frequencies for synthetic data generated by the models, plotted against the real data, for the CEU population, theCHB population, and the 1000 Genomes dataset. samples seem random. The RBM model, even though not as close to the real frequencies as Recomb, performs better than theGAN and Rec-GAN models. In fact, RBM further improves when combined with Recomb (see Rec-RBM).For 1000 Genomes (Figure 1c), Recomb’s MAF distribution is also similar to the real data’s. However, RBM and Rec-RBM both display MAFs close to the real data, whereas, even with more training samples available, the GAN and Rec-GANmodels still seem to produce random results. Moreover, WGAN does not match the MAF distribution for this population asclosely. Overall, the difference in the MAF distributions across datasets is likely to be due to fewer samples available for theHapMap populations compared to the 1000 Genomes.

Alternate Allele Correlation (AAC).

To evaluate whether the real and synthetic data are genetically different, in Figure 2,we plot the alternate allele correlation (AAC). The more similar two populations are, the closer the SNPs should be to thediagonal, as in the leftmost plots, where we have the real data against itself. The strongest AAC is with the synthetic datagenerated by Recomb. On the opposite side of the spectrum, the synthetic data generated by GAN and Rec-GAN haveweak correlations. For the CEU and CHB populations, we ﬁnd Rec-RBM to yield stronger AACs than simple RBM and theWGAN. For the 1000 genomes dataset (Figure 2c), there is a strong correlation between the alternate alleles for the real dataand Recomb, RBM, Rec-RBM, and WGAN.

Site Frequency Spectrum (SFS).

Another summary statistic that captures essential information about the underlying distri-6

20 40 60 80 100

Alternate allele count, Real A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, Recomb A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, RBM A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, GAN A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, Rec-RBM A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, Rec-GAN A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, WGAN A l t e r na t e a ll e l e c oun t, R ea l (a) CEU Alternate allele count, Real A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, Recomb A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, RBM A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, GAN A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, Rec-RBM A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, Rec-GAN A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, WGAN A l t e r na t e a ll e l e c oun t, R ea l (b) CHB Alternate allele count, Real A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, Recomb A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, RBM A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, GAN A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, Rec-RBM A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, Rec-GAN A l t e r na t e a ll e l e c oun t, R ea l Alternate allele count, WGAN A l t e r na t e a ll e l e c oun t, R ea l (c) 1000 Genomes Figure 2:

Alternate allele correlation for the CEU population, the CHB population, and the 1000 Genomes dataset. minor allele frequency sc a l ed s i t e f r equen cy Scaled folded site frequency spectrum

RealRecombRBMGANRec-RBMRec-GANWGAN (a) CEU minor allele frequency sc a l ed s i t e f r equen cy Scaled folded site frequency spectrum

RealRecombRBMGANRec-RBMRec-GANWGAN (b) CHB minor allele frequency sc a l ed s i t e f r equen cy Scaled folded site frequency spectrum

RealRecombRBMGANRec-RBMRec-GANWGAN (c) 1000 Genomes

Figure 3:

Frequency spectrum analysis for the CEU population, the CHB population, and the 1000 Genomes dataset. bution of the allele frequencies of a given set of SNPs in a population or sample is the SFS [19, 20]. Basically, it providesa histogram whose size depends on the number of sequenced individuals. In Figure 3, we plot the scaled folded SFS, whichis the distribution of counts of minor alleles in a sample calculated over all segregating sites. We scale this value so that aconstant value is expected across the spectrum for neutral variation and constant population size, which yields the best visualcomparisons. If the distribution of allele frequencies for the synthetic samples matches that for the real data, we would expectto see the two spectra aligned.With the HapMap populations (Figure 3a–3b), Rec-GAN suggests an excess of rare variants for a minor allele frequencyaround 0.1. Whereas GAN seems to generate data closer to a neutral expectation, i.e., the synthetic dataset describes a morestable population. Similarly, for the 1000 Genomes (Figure 3c), Rec-GAN has an excess of rare variants for a minor allelefrequency less than 0.1, and this is also displayed, at a lower scale, by the GAN-generated data.We also compute the Kolmogorov-Smirnov (KS) two-sample test [27] for the goodness of ﬁt on the SFS for each datasetvs. the synthetic data (see Table 1). The test compares the agreement between the cumulative distributions of two independentsamples. For every two-samples test, the 95% critical value is approximately 0.195 (as we have 100 samples in each dataset),so we can reject the null hypothesis (that there is no difference between the distributions) for all synthetic data above thisvalue. For both CEU and CHB, we cannot reject the null hypothesis only for the samples generated by the Recomb and theWGAN models. For the 1000 Genomes dataset, we reject the null hypothesis for synthetic data generated by the GAN andRec-GAN. 7

FS % Heterozygous SamplesCEU CHB 1000 Genomes CEU CHB 1000 GenomesModels D p-value D p-value D p-value D p-value D p-value D p-value Recomb < . < . < . RBM < . < . < . < . GAN < . < . < . < . < . < . Rec-RBM < . < . < . < . < . Rec-GAN < . < . < . < . < . < . WGAN < . < . < . < . < . Table 1:

Two-sample (real vs. synthetic data) Kolmogorov-Smirnov test performed on the SFS and the percentage of heterozygous samples.

Next, we look at population statistics to determine how close to the real dataset is the synthetic data. In particular, we look atthe percentage of heterozygous variants for both real and synthetic samples, at the ﬁxation index, and at the Euclidean GeneticDistance.

Heterozygosity.

The condition of having two different alleles at a locus is denoted as heterozygosity. The percentage ofheterozygous variants is commonly used in population studies, as a low percentage of heterozygous variants implies lessdiversity in the population. In Figure 4a–4b, we plot the percentage of heterozygous variants in each sample for the CEU/CHBpopulations, comparing the real statistics (blue/leftmost bars) vs. those computed on the synthetic data. In both cases, Recomband WGAN yield similar percentages to the real dataset. Whereas, with GAN and RBM, the percentage decreases, suggestingthat both models produce more homozygous variants. Moreover, even though for the major allele frequencies Rec-RBMproduces variants with statistics closer to the real data, the percentage of heterozygous variants turns out to be the lowest forboth populations. By contrast, Rec-GAN produces a higher percentage of heterozygous variants than GAN, even though themajor allele frequencies are not aligned with the original samples.With the 1000 Genomes (Figure 4c), the percentage of heterozygous samples in the real data is lower across all sam-ples. Once again, and in line with previous results, GAN and the Rec-GAN signiﬁcantly deviate from the percentages ofheterozygous samples found in the real data.We also run a Kolmogorov-Smirnov (KS) two-sample test [27] for the goodness of ﬁt on the percentage of heterozygoussamples for each dataset vs. the synthetic data (see Table 1). Interestingly, for both Recomb and WGAN, we do not reject thenull hypothesis for the CEU dataset, but we do for the CHB dataset. In fact, for all of the models trained on the CHB dataset,we reject the null hypothesis. For the 1000 Genomes dataset, we do not reject the null hypothesis only for RBM.

Fixation Index ( F ST ). Another way to assess how different are groups of populations from each other is to use the ﬁxationindex [28]. This provides a comparison of differences in allele frequency, with values ranging from 0 (not different) to 1(completely different/no alleles in common). In Figure 5, we compare the ﬁxation index values for the real data againstthe synthetic samples. For illustration purposes, we also include F ST of the real data against itself, which obviously yields F ST = 0 .Recomb is once again the closest to the real data, which conﬁrms the alignment from Figure 1. The F ST value for thesynthetic data produced by RBM is, for both CEU and CHB populations, less than 0.10, however, the hybrid Rec-RBM modelfurther reduces this value to less than 0.04, and so does WGAN. For both populations, data generated by GAN and Rec-GANhas the highest F ST , although, for the CHB population, the latter increases it and for the CEU population reduces it. Finally,for the 1000 genomes, Recomb, RBM, and Rec-RBM all have F ST close to the real data. While still having a low F ST ,WGAN has a slightly higher value. Whereas, with GAN and Rec-GAN, F ST signiﬁcantly deviates from the real data, evenwith the increased number of samples of this dataset. Euclidean Genetic Distance (EGD).

Since the ﬁxation index does not easily allow for pairwise comparisons among popu-lations, in Figure 6, we plot the Euclidean Genetic Distance (EGD) between the samples in each dataset. EGD is routinelyused as a measure of divergence between populations, and shows the number of differences, or mutations, between two pop-ulations; a distance of means that is no difference in the results, i.e., there is an exact match. From Fig. 6a–6b, where theEGD on the diagonal is , we observe that, for both CEU and CHB populations, the synthetic samples generated by GAN arecloser to each other than by the other models. Rec-GAN generates samples with EGD close to 0, suggesting that there arevery few differences between them, as well as samples with a distance of around 30. As for the other population statistics,Recomb generates samples that match the differences observed in the real data the closest, for both populations. For RBM,the samples generated have fewer differences than the real data. Perhaps more interestingly, Rec-RBM yields samples witha higher divergence than the real data; this can be a consequence of the low percentage of heterozygous samples found inthe synthetic samples generated by this model (recall Figure 4). The samples from WGAN match some of the differences8

100 200 300 400 500 600 700

Sample index P e r c en t c a ll s Heterozygous Population

RealRecombRBMGANRec-RBMRec-GANWGAN (a) CEU Population

Sample index P e r c en t c a ll s Heterozygous Population

RealRecombRBMGANRec-RBMRec-GANWGAN (b) CHB Population

Sample index P e r c en t c a ll s Heterozygous Population

RealRecombRBMGANRec-RBMRec-GANWGAN (c) 1000 Genomes Population

Figure 4:

Percentage of heterozygous variants in each sample in the dataset for CEU, CHB populations, and the 1000 Genomes dataset. observed in the real data, but the model also yields a few samples with a higher divergence.Finally, for the 1000 Genomes (Figure 6c), we ﬁnd that all samples in the real data have closer EGDs between eachother. In fact, the samples generated by RBM yield a similar pattern in the EGD distances. Although Recomb, Rec-RBM,and WGAN do too, they exhibit a lower distance, on average, between samples. As for CEU/CHB populations, GAN andRec-GAN models overall fail to capture the differences between samples.

Linkage disequilibrium (LD) captures the non-random association of alleles at two or more positions in a general population– i.e., those alleles do not occur randomly with respect to each other. In Genome-Wide Association Studies, LD allowsresearchers to optimize genetic studies, e.g., by preventing genotyping SNPs that provide redundant information [10]. InFigure 7, we plot the r value for LD based on the Rogers-Huff method [44]. This ranges from 0 (there is no LD between the2 SNPs) to 1 (the SNPs are in complete LD, i.e., the two SNPs have not been separated by recombination and have the sameallele frequencies).For CEU and CHB populations, RBM generates samples that display a stronger LD than the real data. With more trainingsamples, Rec-RBM yields a weaker LD, but still stronger than the real data for both models. On the other side of the spectrum,for Rec-GAN, the LD for the synthetic data is the weakest. For the 1000 Genomes, we ﬁnd a stronger LD between the realsamples than with the other two datasets. RBM generates samples that are almost indistinguishable from the real data in termsof LD. The LD in the synthetic datasets generated by Recomb, Rec-RBM, and WGAN have lower correlations than RBM,with GAN and Rec-GAN both failing to preserve the LD. Finally, we further study the difference between synthetic and real data by performing a principal component analysis on thecorresponding samples. We extract the ﬁrst two principal components and project the real and synthetic datasets on these two9

EU CHB 1000 Genomes0.00.10.20.30.4 F S T Real vs RealReal vs RecombReal vs RBMReal vs GANReal vs Rec-RBMReal vs Rec-GANReal vs WGAN

Figure 5:

Fixation index values for the CEU and CHB populations, and the 1000 Genomes dataset.

Real

Recomb

RBM

GAN

Rec-RBM

Rec-GAN

WGAN (a) CEU

Real

Recomb

RBM

GAN

Rec-RBM

Rec-GAN

WGAN (b) CHB

Real

Recomb

RBM

GAN

Rec-RBM

Rec-GAN

WGAN (c) 1000 Genomes

Figure 6:

Pairwise Euclidean Genetic Distance (EGD) between individuals. components to show how the synthetic samples are distributed, compared to the real data.Figure 8 presents this 2D visualization. For both HapMap populations, Recomb has a close distribution to the real data,which, according to [48], is due to the fact that the genetic recombination model considers all the correlations between SNPsand builds a higher-order model. For the 1000 Genomes, once again, the GAN and Rec-GAN models perform quite poorly,generating samples with a different distribution than the real samples. In contrast to the HapMap populations, the samplesfrom Recomb are all centered around 0 and fail to simulate the distribution given by the real data, and similar results are inthe case of samples generated by Rec-RBM.

Our utility evaluation shows that, while generative machine learning models perform well in synthesizing image data, thereis still a need for progress when it comes to genomic data. Overall, the Recomb model, which is based on high-order SNVcorrelations, generates synthetic data preserving most statistical properties displayed by the real data, even when few samplesare available. We get better utility when the genetic map is included with the data, rather than generated from the existingdata. With RBM, more training samples improve the quality of the synthetic data, as evidenced by the difference between theHapMap populations and the 1000 Genomes dataset.We also ﬁnd that, when few samples are available for training, the hybrid Rec-RBM model approach helps improve thequality of samples compared to just RBM. This is clear from the utility of the synthetic data on the two smaller HapMapdatasets. For the 1000 genomes, it is not surprising that the performance of Rec-RBM is worse compared to RBM since Re-10 a) CEU(b) CHB(c) 1000 Genomes

Figure 7:

Pairwise Linkage Disequilibrium for Real vs. Synthetic Samples. comb does not generate as “useful” samples as for the other two datasets. Finally, the GAN and the Rec-GAN models generatesamples with the lowest utility, regardless of the number of samples available for training. However, the data generated byWGAN preserves most statistical properties of the real data.

Next, we evaluate the vulnerability of the synthetic data to Membership Inference Attacks (MIAs). To do so, we measure thePrivacy Gain PG (see Section 2.3) obtained by releasing a synthetic dataset instead of the real data. Recall that, if the syntheticdata does not hide nor give any additional information to an MIA attacker,

P G t , for a target record t , should have a value ofaround 0.25.We present experiments for both a “standard” MIA and a novel attack, which we denote as MIA with partial information. The latter essentially assumes that the adversary only has access to partial data from the target sequence. We exclude GANand Rec-GAN from the evaluation since they yield poor utility performance, so there is not really any point in evaluating theirprivacy.Throughout our evaluation, we randomly choose 10 targets from each dataset across 10 test runs. In each run, we ﬁx thetarget and sample a new training cohort. We train the attack classiﬁer using 5 shadow models, using 100 synthetic trainingsets for each of them. We then evaluate the privacy gain on 100 synthetic datasets, with a split of 50 sets generated from atraining set including the target, and 50 sets generated without. Finally, we report the PG for each test and each target as theaverage PG across all synthetic datasets tested.

We use three adversarial classiﬁers: K-Nearest Neighbor (KNN), Logistic Regression (LogReg), and Random Forest (Rand-Forest). We use four feature sets, as described in Section 2.3: Naive ( F Naive ), Histogram ( F Hist ), Correlations ( F Corr ), andan Ensemble feature set ( F Ens ). In Figure 9, we report the PG value for targets randomly chosen from the two HapMap populations.

KNN.

For CEU, using KNN (Figure 9a, left), we ﬁnd that over 74% of the targets in the synthetic dataset generated byRecomb have a PG lower than the random baseline (0.25) for the Ensemble, Correlations, and Histogram feature sets. With11

PC1 P C RealRecombRBMGANRec-RBMRec-GANWGAN (a) CEU Population

40 20 0 20 40

PC1 P C RealRecombRBMGANRec-RBMRec-GANWGAN (b) CHB Population

50 0 50 100

PC1 P C RealRecombRBMGANRec-RBMRec-GANWGAN (c) 1000 Genomes

Figure 8:

2D Principal Component Analysis (PCA) visualization of the real and synthetic sequences.

RBM, there are between 84% and 88% of the targets, depending on the feature set, that have a PG of 0.25; in other words,for these targets, the probability of the adversary inferring their presence in the training set is the same as random guessing.However, between 10% and 15% of the targets have no PG at all, whereas, there are between 1% and 4% of the targets,depending on the feature set, for which the synthetic data perfectly protects the target from MIA (PG=0.5). With Rec-RBM,at least 59% of the targets have a PG of 0.25 under the four feature sets. With WGAN, at least 48% of the targets have a PGof 0.25, depending on the feature set.For the CHB population (Figure 9b, left), we ﬁnd that over 60% of the targets generated by the Recomb, with all features,have PG below the random guess baseline. With RBM, between 89% and 97% of targets have PG of exactly 0.25 across allfeature sets, i.e. the synthetic dataset generated by the RBM for these targets does not hide nor give new information to theattacker about their membership to the synthetic dataset. Interestingly, for the Correlations feature set, there is no target thathas PG lower than 0.25. As for the CEU population, about 50% of the targets across all feature sets have a PG of 0.25 for datafrom Rec-RBM, and at least 47% of targets from WGAN.

LogReg.

Using LogReg, both Recomb and RBM have the lowest PG among all attack classiﬁers, for both HapMap popu-lations. For CEU (Figure 9a, middle), using the Histogram feature set, 94% (resp., 96%) of the targets from Recomb (resp.,RBM) have PG below 0.25, which is the random guess baseline. Under Correlations, 99% (resp., 97%) of the targets inRecomb (resp., RBM) have PG below 0.25, while, for the Ensemble feature set, 96% (resp., 98%) of the targets from Recomb(resp., RBM) have PG below 0.25. With Rec-RBM, we ﬁnd that between 52% and 56% of the targets across all featuresets have PG above 0.25, and with WGAN, between 50% and 57% of the targets across all feature sets have PG above 0.25.Moreover, for the Rec-RBM and WGAN-generated data, there is no target that consistently has a lower PG than the randomguess baseline across all test runs.For CHB (Figure 9b, middle), with synthetic data generated by Recomb, the average PG is below the random baseline(0.25) for 99% of the targets in the Histogram feature set, 97% for Correlations, and for 96% for Ensemble. For RBM, 79%of the targets in the Histogram feature set have a PG below 0.25. Under the Ensemble feature set, 84% of the targets have12 ecomb RBM Rec-RBM WGAN0.00.10.20.30.40.5 PG t KNN

Recomb RBM Rec-RBM WGAN

LogReg

Recomb RBM Rec-RBM WGAN

RandForest (a) CEU Population

Recomb RBM Rec-RBM WGAN0.00.10.20.30.40.5 PG t KNN

Recomb RBM Rec-RBM WGAN

LogReg

Recomb RBM Rec-RBM WGAN

RandForest (b) CHB Population

Figure 9:

Privacy Gain (PG) of different models over the two HapMap populations.

RandForest LogReg KNN PG t (a) Random Target RandForest LogReg KNN PG t (b) Outlier target Figure 10:

Privacy Gain (PG) of RBM over the 1000 Genomes dataset.

PG below 0.25. For synthetic data generated by Rec-RBM, we ﬁnd that 54% of the targets from the Histogram and Ensemblefeature sets have PG lower than 0.25 and 46% have PG over 0.25. For the Naive and Correlations feature sets, respectively,45% and 47% of the targets have PG lower than the random guess baseline. For WGAN-generated data, Correlations featureset yields most targets (55%) with PG < RandForest.

When using RandForest as the attack classiﬁer on data from the CEU population (Figure 9a, right), with RBM-generated data, 73% of the targets from both Correlation and Histogram feature sets have lower PG than the random baseline.This is for 69% and 62% of the targets with, respectively, Ensemble and Naive feature sets. For the synthetic data generatedby Rec-RBM, about 51% of targets have a PG of over 0.25, and 49% of the targets have PG of less than 0.25, across all featuresets. For WGAN, between 46% and 59% of the targets from all feature sets have a PG less than the random guess baseline(0.25), with the Correlations feature set having the least percentage of vulnerable targets (59%).For CHB (Figure 9b, right), the lowest privacy gain for the samples generated by Recomb: over 79% of all targets foreach of the four feature sets have PG lower than the random baseline. For the synthetic samples from RBM, 71%, 61%, and53% of the targets under the Naive, Histogram, respectively, Ensemble feature sets have PG lower than 0.25. However, forthe Correlations feature sets, we ﬁnd that 55% of the targets have a PG of 0.25, meaning that, those targets are protected fromMIA. For the synthetic samples generated by Rec-RBM, 54% of the targets from the Histogram and Ensemble feature sets and47 and 45% of the targets from the Correlations and Naive, respectively, have PG lower than 0.25. Finally, for the WGAN,we ﬁnd that between 44% and 53% of the targets have PG > .1.2 1000 Genomes Population For the 1000 Genomes population, we focus our analysis on the RBM model, as it generated synthetic data closest to the realdata across all utility metrics evaluated.

Random Target.

In Figure 10a, we plot the PG for the synthetic data generated by RBM, for randomly chosen targets. Withthe RandForest MIA classiﬁer, we observe that 56%, 75%, 92%, and 50% of the targets have PG higher than the randomguess baseline (PG ≥ ≥ Outlier Target.

To better understand whether, with more training data, a target’s signal in the synthetic dataset is diluted, wealso test an “extreme” outlier case. That is, we craft an outlier target that has only minor alleles at all positions. While weare aware that this case would be extremely rare in a real-world scenario, our goal is to observe whether, and how much, thisimpacts PG. To this end, in Figure 10b, we plot the PG of this outlier case across 10 test runs.With RandForest, we ﬁnd that, under the Ensemble feature set, PG is below 0.25 for 8 of the 10 test runs. In fact, this is theonly combination between attack classiﬁer and feature set for which a greater percentage of the targets have a lower privacygain than in the random target case. For the Naive feature set, in only 3 of the test runs, PG is below the random baseline.For the Correlations and Histogram feature sets, all test runs yield PG of 0.25 or above. With LogReg, 4 out of 10 of the testruns for the Naive, Histogram, and Correlations feature sets yield PG below 0.25. For Ensemble, this happens for 6 test runs.Finally, with KNN, across all feature sets, PG for all test runs is 0.25, i.e., the synthetic data does not disclose any membershipinformation regarding the outlier.While there are differences across classiﬁers and feature sets, the PG, for all test runs in this outlier target case, is centeredaround 0.25. This evident from Figure 10b, which implies that, across all test runs, the accuracy of the MIA is not much betterthan random guessing.

The different combinations of datasets, attack classiﬁers, and feature sets, yield varied results with respect to privacy. This isdue to two main reasons: ﬁrst, not all classiﬁers have the same accuracy on tasks for the same dataset, as shown in previouswork [6]. Second, the features that the generative model preserves after training will “reﬂect” in the synthetic data; thus, thiswill impact the PG based on the feature extraction method.On the HapMap populations, while the utility evaluation shows that Recomb-generated synthetic data is “closest” to thereal data, it does so with a signiﬁcant privacy loss in comparison to the other models. The RBM-generated synthetic data isthe most vulnerable under the LogReg classiﬁer, with at least 70% of the targets across both populations and all feature setshaving PG below the random guess baseline. This suggests that, with few data samples available for training, the RBM modelis likely to overﬁt and is thus susceptible to MIAs.For Rec-RBM and WGAN, the attacker cannot reliably predict membership, i.e., the addition of extra samples from theRecomb model in the training of the Rec-RBM dilutes the target’s signal in the training data. However, for both models, westill ﬁnd combinations of targets and training sets for the attack classiﬁer for which PG is signiﬁcantly lower than the randomguess baseline; i.e., the synthetic data will still expose membership information about the respective targets.On the 1000 Genomes, PG values have a higher variation overall when the target is chosen randomly from the dataset thanfor the two smaller HapMap datasets. The results for RBM data conﬁrm our hypothesis that the target’s inﬂuence is dilutedwithin larger datasets. However, once again, this does not mean that membership inference is not possible for both Rec-RBMand WGAN, depending on the combination of target, training set, attack classiﬁer, and feature set.However, in the case of an “extreme” outlier (i.e., a target which has minor alleles at all positions), the synthetic datagenerated by RBM does not have a big impact on PG. In this case, across all test runs, the PG is actually close to the randomguess baseline.

Next, we introduce a novel attack, which we denote as MIA with Partial Information (MIA-PI). Basically, we only give theattacker access to a fraction of SNVs from the target sequence, chosen at random. The attacker then uses the Recombinationmodel from [48] as an inference method to predict the rest of the sequence. Compared to the previous attack, here theadversary trains their (attack) classiﬁer using the sequence inferred from the partial data. Thus, the privacy gain formula alsoneeds to be adjusted to account for how likely an adversary is to identify a target within a dataset from partial information.14 igure 11:

Accuracy of the Membership Inference Attack with access to full sequence and partial information for

Recomb . PG for MIA-PI.

Assuming the attacker has partial information t (cid:48) as a fraction of the SNVs from t , they ﬁrst use the Recombmodel, as an inference algorithm, to predict the rest of the SNVs from the target sequence, which we denote by t p . Theprivacy gain is computed as P G t = MIA tp ( R t ) − MIA tp ( S test )2 , where M IA t p ( S test ) = (cid:80) S i ∈ S test Pr[

MIA tp ( S i )=1]2 ∗ n s and M IA t p ( R t ) = (cid:80) R i ∈ R t Pr[

MIA tp ( R i )=1]2 ∗ n s . That is, the privacy gain, in this case, is computed as the difference between theprobability that the attacker, who has partial information about the target record, correctly identiﬁes that the target is part ofthe real dataset versus the target being part of the training set used to generate the synthetic dataset.As a result, PG now ranges between -0.5 and 0.5, where 0.5 means that having the real dataset R and the partial information t (cid:48) about the target allows the adversary to infer the membership of t in R , while the synthetic dataset reduces the adversary’schance of success (i.e. M IA t p ( R t ) = 1 and M IA t p ( S test ) = 0 ). A negative PG value means that publishing the syntheticdata, instead of the real data, improves the adversary’s chance to correctly infer membership of the target t (i.e. M IA t p ( R t )

From the experiments above, we ﬁnd that the attack classiﬁer that yields the lowest PG is Logistic Regression,thus, to ease presentation, we only experiment with that one. In the following, we present the results of the MIA-PI experimentsfor the CEU population, focusing on the Recomb and RBM models (as mentioned, with a LogReg attack classiﬁer). We do soas these two models yielded the least PG in Section 5.1.

Recomb.

In Figure 11, we plot the Cumulative Distribution Function (CDF) of the accuracy of the attack for Recomb whenthe adversary has access to the full sequence vs. partial information, speciﬁcally, a ratio of 0.05, 0.1, and 0.2 of the total SNVsfrom the target sequence. Interestingly, even when only 0.05 of the target SNVs are available to the attacker, for 90% and91% of the targets from the Histogram and respectively Ensemble feature sets, the accuracy of the attacker is still above therandom guess baseline (50% accuracy). Our intuition is that many targets are vulnerable to the attack, even with little partialinformation, since we use the Recomb model not only for the attack but also as an inference method to predict the rest of thesequence.To explore how much of the MIA-PI vulnerability is due to the release of synthetic datasets, and not only by how muchinformation the attacker has available, in Figure 12, we plot the CDF of the PG with MIA-PI. In line with the accuracy results,we ﬁnd that, in the case of the Correlations feature set, the PG is greater than 0 for at least 88% of the targets, for all ratiosof partial information tested. However, for the other three feature sets, releasing the synthetic dataset instead of the real datadecreases the privacy gain (i.e., PG<0) for the majority of targets. When the adversary has access to just 5% of the SNVsfrom the target, there is a negative PG for 61% of the targets under the Histogram and 62% of the targets under the Ensemblefeature sets. With 10% of the target sequence available, 54% and 60% of the targets under, respectively, the Histogram andthe Ensemble feature sets have negative PG, and with 20% these numbers go up to 64% and 67%. For the Naive feature set,there are more targets with negative PG with increasing partial information available to the attacker, i.e., 59%, 70%, and 84%with, respectively, 5%, 10%, and 20% of the target sequence available. Overall, this shows that releasing the synthetic datasetinstead of the synthetic data does not mitigate privacy, even when the attacker does not have access to the full sequence.

RBM.

In Figure 13, we plot the CDF for the accuracy of the attack for the RBM model for both full and partial informationabout the target record available to the attacker. Across all feature sets, there is an increase in the accuracy of the attack withmore information available to the attacker, as it is expected. We also look at CDF for the PG in the case of partial informationavailable to the attacker in Figure 14. Once again, under the Naive feature set, increasing the partial information available tothe attacker has a negative correlation with the percentage of targets that have a negative PG. Under all other feature sets, forthe majority of targets, releasing the synthetic dataset instead of the real data yields a positive PG, meaning that releasing thesynthetic dataset instead of the real dataset improves the PG. 15 .50 0.25 0.00 0.25 0.50 PG t F r a c t i o n o f t a r g e t s PG t PG t Figure 12:

Privacy Gain (PG) for synthetic samples generated by

Recomb . Figure 13:

Accuracy of the Membership Inference Attack with access to full sequence and partial information for

RBM . Take-Aways.

We ﬁnd that not even decreasing the attacker’s power by only giving him partial information from the targetsequence mitigates privacy for the Recomb-generated synthetic data. This is likely due to the fact that, by using the Recombmodel as a generative model as well as an inference model, the adversary’s inference power is increased since the feature setextracted from the synthetic data will be closer to the feature set for the predicted target.However, in this case, for the RBM synthetic data, we see an increase in the privacy gained by releasing synthetic data asopposed to real data. This implies that, even if the RBM is likely to overﬁt when few samples are available for training, it doesso on the predicted sequence of the target rather than on the full sequence, and thus decreases the accuracy of the MIA.

In this section, we review related work on synthetic data, privacy in genomics, and MIAs against machine learning models.

Synthetic Data Initiatives.

In recent years, researchers have focused on the generation of synthetic electronic health records(EHR), aiming to facilitate research in and adoption of machine learning in medicine. Choi et al. [13] use a combination ofan autoencoder with GAN model, called medGAN, to generate high-dimensional multi-label discrete data. ADS-GAN [62]uses a quantiﬁable deﬁnition for “identiﬁability” that is combined with the discriminator’s loss to minimize the probabilityof patient’s re-identiﬁcation, while CorGAN [54] combines convolutional GANs and convolutional autoencoders to capturethe correlations between adjacent medical features. Biswal et al. [9] use variational autoencoder to synthesize sequences ofdiscrete EHR encounters and encounter features. Other initiatives focus on generating synthetic data modeled on primary caredata [3, 41, 56, 60]. Researchers have also explored generating synthetic health patient data to detect cancer and other diseases,e.g., RDP-CGAN [55] combines convolutional GANs and convolutional autoencoders, both trained with Rényi differentialprivacy [36]. Speciﬁc to genomes are the works we have introduced in Section 3 and evaluated, in terms of utility and privacy,throughout the paper [33, 48, 61].

Privacy in Genomics.

Researchers have focused on studying and mitigating privacy risks in genomics. One of the ﬁrstattacks on genomic data is the Membership Inference Attack proposed by Homer et al. [30], showing that an adversary caninfer the presence of an individual’s genotype within a complex DNA mixture. This attack has been improved by Wanget al. [58] using correlation statistics of a few hundred SNPs. Then, Im et al. [31] show that the summary informationfrom genome-wide association studies, such as regression coefﬁcients, can also lead to revealing the participation of anindividual within the respective study. Membership inference has also been shown possible in the context of the Beaconnetwork [43, 50, 57], a federated service that answers queries of the form “does your data have a speciﬁc nucleotide at aspeciﬁc genomic coordinate?”. 16 .50 0.25 0.00 0.25 0.50 PG t F r a c t i o n o f t a r g e t s PG t PG t Figure 14:

Privacy Gain (PG) for synthetic samples generated by

RBM . Close to our work, Chen et al. [12] study the effects of differential privacy protection against membership inference attackon machine learning for genomic data. However, their study is focused on privacy leakage via sharing trained classiﬁcationmodels, whereas, we study the privacy leakage from sharing synthetic datasets.

MIAs against Machine Learning Models.

Shokri et al. [49] present the ﬁrst attack against discriminative models, aiming toidentify whether a data record was used in training, using an approach based on shadow models. Hayes et al. [24] present theﬁrst MIA against generative models like GANs; they use a discriminator to output the data with the highest conﬁdence valuesas the original training data. Hilprecht et al. [26] study MIAs against both GANs and Variational AutoEncoders (VAEs), whileChen et al. [11] propose a generic MIA model against GANs.As mentioned, Stadler et al. [53] evaluate MIAs in the context of synthetic data, and show that even access to a singlesynthetic dataset output by the target model can lead to privacy leakage. We re-use their framework for quantifying the privacygain when a synthetic dataset is released as opposed to the real dataset. However, not only we do so for a speciﬁc context(namely, genomics), but we also measure utility (while they only study privacy) and introduce and measure a novel attackwhereby the attacker who only has access to partial genomic information about the target.Finally, measurement studies have focused on privacy in machine learning. Jayaraman and Evans [32] evaluate differentialprivacy to understand the impact that different choices for privacy parameters have on both utility and privacy. Long etal. [35] study membership inference attacks on three different discriminative models, in order to understand why and howthey succeed.

This paper presented an in-depth measurement study of state-of-the-art methods to generate synthetic genomic data. We didso vis-à-vis 1) their utility, with respect to a number of common analytical tasks performed by researchers, as well as 2) theprivacy protection they provide compared to releasing real data.High-quality synthetic data must accurately capture the relations between data points, however, this can enable attackers toinfer sensitive information about the training data used to generate the synthetic data. This was illustrated by the performanceof the Recomb model on the HapMap datasets: while it achieves the best utility, it does so at the cost of signiﬁcantly reducingprivacy. Overall, there is no single method that outperforms the others for all metrics and all datasets. However, we did ﬁndthat models based on a simple GAN architecture (i.e., GAN and Rec-GAN) are not a good ﬁt to genomic data, as they providethe lowest utility across the board.Our analysis revealed that the size of the training dataset matters, especially in the case of generative models. Not only wesaw an improvement in utility with the addition of samples in the hybrid Rec-RBM approach for the smaller HapMap datasets,and for RBM and WGAN for the 1000 Genomes dataset, but we also measured a decrease in the number of targets exposedto membership inference.Our measurement framework can be used by practitioners to assess the risks of deploying synthetic genomic data in thewild, and will serve as a benchmark tool for researchers and practitioners in the future. In future work, we plan to integratedifferential privacy mechanisms in our framework, as well as novel metrics and algorithms.

Acknowledgments.

This work was partly supported by a Google Faculty Award on “Enabling Progress in Genomic Researchvia Privacy Preserving Data Sharing” and a grant from the NCSC and the Alan Turing Institute on “Evaluating Privacy-Preserving Generative Models In The Wild.” The authors wish to thank Yang Zhang and Christophe Dessimoz for commentsand feedback on the manuscript. 17 eferences

Nature , 526(7571), 2015.[5] N. Altman and M. Krzywinski. The curse(s) of dimensionality.

Nature Methods , 15(6), 2018.[6] D. R. Amancio, C. H. Comin, D. Casanova, G. Travieso, O. M. Bruno, F. A. Rodrigues, and L. da Fontoura Costa. A systematiccomparison of supervised classiﬁers.

PloS one , 9(4):e94137, 2014.[7] C. Angermueller, T. Pärnamaa, L. Parts, and O. Stegle. Deep learning for computational biology.

Molecular Systems Biology , 12(7),2016.[8] E. Ayday, E. De Cristofaro, J.-P. Hubaux, and G. Tsudik. Whole genome sequencing: Revolutionary medicine or privacy nightmare?

Computer , 48(2), 2015.[9] S. Biswal, S. Ghosh, J. Duke, B. Malin, W. Stewart, and J. Sun. EVA: Generating Longitudinal Electronic Health Records UsingConditional Variational Autoencoders. arXiv:2012.10020 , 2020.[10] W. S. Bush and J. H. Moore. Genome-wide association studies.

PLoS Computational Biology , 8(12), 2012.[11] D. Chen, N. Yu, Y. Zhang, and M. Fritz. GAN-Leaks: A taxonomy of membership inference attacks against GANs. In

ACM Conferenceon Computer and Communications Security

Machine learning for Healthcare Conference , 2017.[14] G. M. Clarke and L. R. Cardon. Disentangling linkage disequilibrium and linkage from dense single-nucleotide polymorphism triodata.

Genetics , 171(4):2085–2095, 2005.[15] E. De Cristofaro. An overview of privacy in machine learning. arXiv:2005.08679

Theoretical Population Biology ,71(1), 2007.[20] R. A. Fisher. The distribution of gene ratios for rare mutations.

Proceedings of the Royal Society of Edinburgh , 50, 1931.[21] Genetics Home Reference. What is Precision Medicine? https://medlineplus.gov/genetics/understanding/precisionmedicine/deﬁnition/, 2020.[22] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved Training of Wasserstein GANs. In

Advances in NeuralInformation Processing Systems , 2017.[23] M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich. Identifying personal genomes by surname inference.

Science ,339(6117), 2013.[24] J. Hayes, L. Melis, G. Danezis, and E. De Cristofaro. LOGAN: Membership inference attacks against generative models.

Proceedingson Privacy Enhancing Technologies , 2019.[25] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In

IEEE Conference on Computer Vision andPattern Recognition , 2016.[26] B. Hilprecht, M. Härterich, and D. Bernau. Monte Carlo and Reconstruction Membership Inference Attacks against GenerativeModels.

Proceedings on Privacy Enhancing Technologies , 2019.[27] J. L. Hodges. The signiﬁcance probability of the Smirnov two-sample test.

Arkiv för Matematik , 3(5), 1958.[28] K. E. Holsinger and B. S. Weir. Genetics in geographically structured populations: deﬁning, estimating and interpreting FST.

NatureReviews Genetics , 10(9), 2009.[29] N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe, J. Muehling, J. V. Pearson, D. A. Stephan, S. F. Nelson, and D. W.Craig. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotypingmicroarrays.

PLoS Genetics , 4(8), 2008.[30] N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe, J. Muehling, J. V. Pearson, D. A. Stephan, S. F. Nelson, and D. W.Craig. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotypingmicroarrays.

PLoS Genetics , 2008.[31] H. K. Im, E. R. Gamazon, D. L. Nicolae, and N. J. Cox. On sharing quantitative trait GWAS results in an era of multiple-omics dataand the limits of genomic privacy.

The American Journal of Human Genetics , 90(4), 2012.[32] B. Jayaraman and D. Evans. Evaluating differentially private machine learning in practice. In

USENIX Security , 2019.

33] N. Killoran, L. Lee, A. Delong, D. Duvenaud, and B. Frey. Generating and designing DNA with deep generative models.arXiv:1712.06148, 2017.[34] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwrittenzip code recognition.

Neural computation , 1(4), 1989.[35] Y. Long, V. Bindschaedler, and C. A. Gunter. Towards measuring membership privacy. arXiv:1712.09136 , 2017.[36] I. Mironov. Rényi differential privacy. In

ACM Computing Surveys , 48(1), 2015.[41] NHS England. A&E Synthetic Data. https://data.england.nhs.uk/dataset/a-e-synthetic-data, 2021.[42] NIH. Genomic Data Sharing. https://osp.od.nih.gov/scientiﬁc-sharing/genomic-data-sharing-faqs/, 2021.[43] J. L. Raisaro, F. Tramer, Z. Ji, D. Bu, Y. Zhao, K. Carey, D. Lloyd, H. Soﬁa, D. Baker, P. Flicek, et al. Addressing beacon re-identiﬁcation attacks: quantiﬁcation and mitigation of privacy risks.

Journal of the American Medical Informatics Association ,24(4):799–805, 2017.[44] A. R. Rogers and C. Huff. Linkage disequilibrium between loci with unknown phase.

Genetics , 182, 2009.[45] A. Sablayrolles, M. Douze, C. Schmid, Y. Ollivier, and H. Jégou. White-box vs Black-box: Bayes Optimal Strategies for MembershipInference. In

International Conference on Machine Learning , 2019.[46] P. Sadrach. How Artiﬁcial Intelligence Can Revolutionize Healthcare. https://builtin.com/healthcare-technology/how-ai-revolutionize-healthcare, 2020.[47] A. Salem, Y. Zhang, M. Humbert, P. Berrang, M. Fritz, and M. Backes. ML-Leaks: Model and Data Independent MembershipInference Attacks and Defenses on Machine Learning Models. In

Network and Distributed System Security Symposium , 2019.[48] S. S. Samani, Z. Huang, E. Ayday, M. Elliot, J. Fellay, J.-P. Hubaux, and Z. Kutalik. Quantifying genomic privacy via inference attackwith high-order SNV correlations. In

IEEE Security and Privacy Workshops , 2015.[49] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In

IEEE Sympo-sium on Security and Privacy , 2017.[50] S. S. Shringarpure and C. D. Bustamante. Privacy risks from genomic data-sharing beacons.

The American Journal of Human Genetics ,97(5):631–646, 2015.[51] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualising Image Classiﬁcation Models andSaliency Maps. arXiv:1312.6034, 2013.[52] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, Colorado Univ atBoulder Dept of Computer Science, 1986.[53] T. Stadler, B. Oprisanu, and C. Troncoso. Synthetic Data – A Privacy Mirage. arXiv:2011.07018, 2020.[54] A. Torﬁ and E. A. Fox. CorGAN: Correlation-capturing convolutional generative adversarial networks for generating synthetic health-care records. arXiv:2001.09346, 2020.[55] A. Torﬁ, E. A. Fox, and C. K. Reddy. Differentially Private Synthetic Medical Data Generation using Convolutional GANs. arXiv:2012.11774 , 2020.[56] A. Tucker, Z. Wang, Y. Rotalinti, and P. Myles. Generating high-ﬁdelity synthetic patient data for assessing machine learning health-care software.

NPJ Digital Medicine , 3(1), 2020.[57] N. Von Thenen, E. Ayday, and A. E. Cicek. Re-identiﬁcation of individuals in genomic data-sharing beacons via allele inference.

Bioinformatics , 35(3):365–371, 2019.[58] R. Wang, Y. F. Li, X. Wang, H. Tang, and X. Zhou. Learning your identity and disease from research papers: information leaks ingenome wide association study. In

ACM Conference on Computer and Communications Security , 2009.[59] S. Wang, X. Jiang, S. Singh, R. Marmor, L. Bonomi, D. Fox, M. Dow, and L. Ohno-Machado. Genome Privacy: Challenges, TechnicalApproaches to Mitigate Risk, and Ethical Considerations in the United States.

Annals of the New York Academy of Sciences , 1387(1),2017.[60] Z. Wang, P. Myles, and A. Tucker. Generating and Evaluating Synthetic UK Primary Care Data: Preserving Data Utility & PatientPrivacy. In

IEEE International Symposium on Computer-Based Medical Systems

IEEE Journal of Biomedical and Health Informatics , PP, 2020.[63] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding Neural Networks Through Deep Visualization.arXiv:1506.06579, 2015.

64] X. Zhou, B. Peng, Y. F. Li, Y. Chen, H. Tang, and X. Wang. To release or not to release: evaluating information leaks in aggregatehuman-genome data. In

European Symposium on Research in Computer Security , pages 607–627. Springer, 2011., pages 607–627. Springer, 2011.