Private data sharing between decentralized users through the privGAN architecture
PPrivate data sharing between decentralized usersthrough the privGAN architecture st Jean-Francois Rajotte
Data Science InstituteUniversity of British Columbia
Vancouver, [email protected] nd Raymond T. Ng
Department of Computer ScienceUniversity of British Columbia
Vancouver, [email protected]. 1:
A motivating example: conceptual representation ofinformation sharing with synthetic data from two banks. Thedata samples here are represented by green smileys andred frownies, which could represent the good and fraudulentcustomers signatures respectively. Banks would benefit to learnfrom more fraudulent signatures but cannot share their data.The open question is what the banks can share to improvetheir knowledge of fraudulent users while minimizing the riskof privacy breaches of their customers.
Abstract —More data is almost always beneficial for analysisand machine learning tasks. In many realistic situations however,an enterprise cannot share its data, either to keep a competitiveadvantage or to protect the privacy of the data sources, theenterprise’s clients for example. We propose a method for dataowners to share synthetic or fake versions of their data withoutsharing the actual data, nor the parameters of models thathave direct access to the data. The method proposed is basedon the privGAN architecture where local GANs are trainedon their respective data subsets with an extra penalty froma central discriminator aiming to discriminate the origin of agiven fake sample. We demonstrate that this approach, whenapplied to subsets of various sizes, leads to better utility for theowners than the utility from their real small datasets. The onlyshared pieces of information are the parameter updates of thecentral discriminator. The privacy is demonstrated with white-box attacks on the most vulnerable elments of the architectureand the results are close to random guessing. This method wouldapply naturally in a federated learning setting.
Index Terms —Synthetic data, GAN, Privacy, Distributed data,Federated Learning
I. I
NTRODUCTION
Sharing data without breaching privacy is a common chal-lenge in many fields because of lengthy approval processesrequired to address ethical and privacy requirements. As amotivating example, one can imagine two banks (or two bankbranches) who would like to build a model to detect fraudulenthandwritten signatures but have too few samples on their ownto train a sufficiently accurate fraudulent behaviour classifi-cation model. There are many ways to solve this problem;but the situation can be generally represented by Fig. 1. Thecrux of the issue is the exact definition of
Shared Information .Specifically, the two banks have at least the following optionswith respect to what is to be shared: • their data; • synthetic data generated by a basic Generative Adversar-ial Network (GAN); or • access to synthetic samples while the generator is beingtrained.We assume in this paper that the first option is too risky; butwe will keep it as an ideal baseline that we want to be as closeas possible.A natural method for using data from multiple sources whilekeeping data locally is federated learning [1]; but that wouldallow data access to model(s) shared among data owners.One could add a layer of security by creating a syntheticdata either locally or centrally with GANs [2]. For the banksexample, the fraudulent signatures classifier could be trainedon the synthetic data. In their typical implementations [3]–[5],however, GANs trained with a federated learning frameworknecessitate sharing a model that has direct access to the dataowners’ (e.g. the banks) data, which is again a privacy risk.Our approach studied here is a method that generates localsynthetic samples that comprise characteristics from all dataowners’ data while only sharing a central discriminator thataccess synthetic samples . Motivated by real world use casessuch as the banks in Fig. 1, we modified the original privGAN[6] prescription to create disjoint subsets of varying sizes.Another notable difference with the original privGAN is theexclusion of a pre-training of the central discriminator on thereal data since the whole point of this work is to circumventthe real data access limitations. a r X i v : . [ s t a t . M L ] S e p s a proxy for the client signatures, we use the hand writtendigits from which we generate synthetic fake digits. Like theoriginal privGAN, we show that the resulting local syntheticdata are more private than local synthetic data resulting fromlocal GANs. In order to demonstrate the validity of thisapproach, we will consider utility as the classification accuracyof a model trained on the synthetic digits and evaluated on aholdout set. II. R ELATED WORKS
Sharing private data or their characteristics has been exten-sively explored recently. A very natural approach is simplyto generate private synthetic data with Differential Privacy [7]through GANs based on either differentially private stochasticgradient descent [8] (e.g [9]) or the PATE [10] (e.g. [11]).Both approaches suffer from low utility data for a reasonabledegree of privacy. A recent paper trying to address this issueshows some promise [12].Another approach is to train a model in a federated learningsetting such that the data never has to be shared; someexamples are FedGAN [5], MD-GAN [4] and [13]. Since ithas been demonstrated that GANs are vulnerable to privacyattacks [14], various ways have been proposed to providebetter privacy protection. GANs trained on distributed datasetswith differential privacy (e.g. [15] and [16]) suffer from thesame low quality generated samples as centrally trained GANs,unless they have access to a very large amount of trainingdata as in this language model application [17]. FedGP [18]addresses these challenges by training a central generatorwhile keeping the discriminators and their parameters local.The privacy protection, however, is obtained through Differ-ential Average-Case Privacy using ”post-hoc privacy analysisframework”. Our method allows users to create local syntheticdata sets, and privacy protection naturally arises from thearchitecture without post processing steps.III.
PRIV
GANPrivGAN [6] is an extension of GAN whose developmenthas lead to notoriously realistic synthetic images. In theiroriginal version, GANs comprise a generator G and a dis-criminator D playing a two-player game as shown in Fig.2. The generator aims to create fake samples such that thediscriminator estimates their probability to be real as high aspossible. The discriminator on the other hand, tries to estimatethe probability that a sample is real rather than fake.PrivGAN was originally designed to protect against mem-bership inference attacks such as [14]. The architecture com-prised N GANs trained on their disjoints, independent andidentically distributed (i.i.d.) subsets with an extra loss froma central (private) discriminator D P as described in Fig. 3.The numeric subscripts correspond to the models and dataassociated to a given data owner. In the banks example fromFig. 1, X corresponds to Bank 1’s clients’ signatures and G - D correspond to the GAN’s generator-discriminator pairtrained on those signatures with the extra discriminator D P shared by both banks. The authors show that their method Fig. 2: Simple GAN architecture: the generator G aims tocreate fake samples X fake that are indistinguishable from thereal samples X real by the discriminator D . Fig. 3:
Original privGAN architecture with N=2 subsets. Inrelation to Fig. 1, the shared information corresponds to thecentral discriminator D P accessing only the fake samplesduring the training. “ minimally affects the quality of downstream samples asevidenced by the performance on downstream learning taskssuch as classification ”. For the banks, the two important thingsin this setting are 1) the clients’ signatures never leave the bankand 2) architecture elements that have direct access to the realsignatures never leave the bank either.In this paper, we explore the application of the privGANarchitecture on subsets that are different in sizes. Our ex-ploration will focus on the utility of the synthetic dataset asa training set for a classification model which we evaluateon a holdout set from the original data. Our only othermodification of the original method is to skip the pre-trainingof the central discriminator D P on real data. This is to limitreal data access to local components. In our bank examplefrom Fig. 1, that means that the client signatures from Bank1 are only accessed by the discriminator D which itself isnever shared with the other bank. We will see that reducingank 1’s amount of signatures eventually leads to significantreduction of the utility of the synthetic data even to a pointof uselessness. We will show how Bank 2 could help Bank 1without compromising it’s client’s privacy.IV. E XPERIMENTAL DETAILS
In our experiments, we compare the utility and privacy ofthe fake data generated with privGAN and the simple (non-private) GAN. We limit our exploration to two data owners(e.g. two banks) so the architecture is exactly as depicted inFig. 3. As a proxy for the banks’ client hand written signatures,we use MNIST [19] to demonstrate our approach. MNISTcontains 70,000 (60,000 training samples and 10,000 testsamples) grayscale handwritten digits. It contains a balancednumber of each 10 classes (digits from 0 to 9) of 28x28images. We vary both training subsets in size to cover variousscenarios and are referred to as a fraction of the total subset.For examples, we show results for a data owners has as low as1% of the total training dataset (60,000 images) correspondingto 600 images, i.e. 60 images per digit. In the privGANarchitecture, we combine this data owner with another onealso with varying dataset size. We explore combinations ofdata owners with various fractions from 1%-1% to 50%-50% and many in between such as 20%-10%, 8%-50%, etc. In ourbank analogy, this corresponds to various scenarios where thetwo banks have different amounts of signatures to train theirsynthetic generator. A. Models and training
As for the original privGAN paper, we use standard fullyconnected networks for both generators and discriminatorsbecause MNIST is a relatively simple dataset. Identicalgenerator and discriminator architectures are used for bothGANs and privGANs. We train all the privGAN modelswith Adam optimizer with learning rate of 0.0002 ( β =0.5)for 200 epochs. We use the same parameters for the GANmodels except for 250 epochs. Following the procedure of theoriginal paper, we run a model for each digit in order to havea label for each sample. As mentioned before, contrary tothe original privGAN method, we do not pre-train the centraldiscriminator D P on real images. It is not expected to have asignificant impact as this pre-training was only implementedto speed up the convergence in the original privGAN paper.While evaluating the performance on downstream classifica-tion tasks, we train all GAN models with an Adam optimizerwith a learning rate of 0.0002 ( β =0.5) for 25 epochs over3000 generated samples . In all cases, we use batch size of256 samples. We make the maximum data in a privGAN subset to be 50% of the datasetwhich is more than enough to reach very high utility. When evaluating privGAN generated samples from both generators, theevaluation is still performed on 3000 samples, 1500 from each generators.
B. Membership inference attacks
To evaluate the privacy vulnerability of a model and itsgenerated data, we apply membership attacks on the discrim-inators, the most vulnerable elements of the architecture. Thewhite-box attacks on the simple GANs are outlined in [20]where it is assumed that the adversary is in possession ofthe trained model along with a large data pool including thetraining samples. The attacker is also assumed to have theknowledge of what fraction of the dataset was used for train-ing, f . The attack then proceeds by using the discriminator ofthe trained GAN to obtain a probability score for each samplein the dataset. The samples are then sorted in descending orderof probability score and the top f fraction of the scores arepredicted as the likely training set. The evaluation of the white-box attack is done by calculating what fraction of the predictedtraining set was actually in the training set.In our bank example, a successful membership inferenceattack would trace back clients to their bank, an informationto be kept private. It makes the reasonable assumption thatthe adversary has access to clients’ signatures which cancome from many sources (e.g. contracts or petitions signedby the client).Since our privGAN models have two generatordiscrimina-tor pairs, the previously described attack cannot be directlyapplied to it. However, for a successful whitebox attack, eachof the discriminators should score samples from the trainingcorpus higher than those not used in training. The privGANauthors modified the previous approach by identifying a maxprobability score by taking the max over the scores fromall discriminators. The rationale being that the discriminatorwhich has trained on a particular data sample should havethe largest discriminator score. We now proceed to sort thesamples by each of these aggregate scores and select the top f fraction samples as the predicted training set. As a sanitycheck, we also include the mean probability score in ourresults. V. R ESULTS
We first look at the generated images from a GAN andprivGAN trained on various training set sizes in Fig. 4. Onecan see that for small training data (4%-5% of the trainingset), GANs are just creating noise. PrivGAN on the otherhand, does create recognizable digits even when the othersubset is also small (shown at 4% in the figure). We alsonote that when a data owner has a large amount of data,privGAN makes the image noisier than what a simple GANwould generate. This is not necessary a bad thing, it can beinterpreted as a sign of privacy. Indeed, going back to ourbank analogy, a noisier synthetic signature is harder to link toany real signature from the bank’s clients. We will see belowthat this privacy comes at a low cost in utility for syntheticsamples created with privGAN.We then explore the effect on the utility of an unequalamount of data on each data owners. The subsets are i.i.d.ig. 4:
Generated images with GANs (left column) and priv-GAN’s generator 1 (other columns). The rows corresponds tothe fraction of the training data accessible locally. For theGAN, this correspond to the total training data size and forprivGAN, this corresponds to the subsets X real . Column 2,3,4correspond to the data size of the other data owner, X real . samples of different sizes from the parent training populationso the data distributions are the same, only the subset sizesvary. Fig. 5 shows the classification accuracy of a modeltrained separately on each local synthetic dataset and theirunion. For reference, the point 50%-50% on the most top rightof Fig. 5a correspond to the original privGAN paper where thewhole training set was used. We note that, for the privGAN ar-chitecture, the data owners are interchangeable which translatein the symmetry along a bottom-left to top-right line which isreasonably respected within some fluctuations. For example,the accuracy at 10%-50% should be the same as 50%-10%.We further note that the utility remains high for both syntheticdatasets as long as one has sufficiently enough training data.This can be seen at the lower-right and upper-left sections ofFig. 5b.Fig. 6 compares the same accuracies with those for thesame classifier trained on synthetic samples from a localGAN. We first notice that at some threshold (5%), GANproduce synthetic data with essentially no utility: the classifieraccuracy is close to random guessing. This corresponds toGAN generating noise as noted in Fig. 4. This does nothappen for privGAN, there is always a minimal utilityobserved in our exploration. Clearly, privGAN is beneficialwhen data owners have small datasets size even if the otherdata owner has small data size too. For the bank analogy,this means that banks can help each others in building bettersignature classification models even when they have fewsignatures. As the available training data size increase, allutilities converge slightly below the GAN utility. Indeed,when the subset reaches the size of 40%, there is no utilitygain from the privGAN architecture (although we will see (a) Trained on samples from both generators(b) Trained on samples from generator 1 (left) and 2 (right) Fig. 5:
Accuracy (as color) of a classification model trained onprivGAN synthetic data. All figures share the same color code.The vertical and horizontal axes correspond to the fraction ofdata available for training the generator-discriminator pair.The fraction is w.r.t. the full training set for each digit,i.e. 100% = 10,000 training samples per digit. The blackpoints correspond to actual accuracies and the color fillingbetween the points are interpolations. Note the log scale onthe horizontal and vertical axes. below there is still a privacy benefit). There is always a subsetsize threshold, depending on the other data owner subsetsize, below which privGAN becomes beneficial in utility. Forexample, when the other data owner has as a 25% subset size,privGAN becomes beneficial for subsets size below roughly20%.Fig. 7 shows the accuracies of white-box attacks on thediscriminators. It shows that privGAN’s discriminators arevery secure since the attack successes are always close torandom guessing while GANs’ vulnerability increase as func-ig. 6:
Accuracy of a classification model trained on privGANand GAN synthetic data. The privGAN synthetic data iscomprised of samples from generator 1 only. The horizontalaxes correspond to the fraction of the training sample subsetused for training the privGAN generator-discriminator pairwhile keeping the other subset size fixed.
Fig. 7:
White-box attack accuracies as function of fraction oftraining data used for training privGAN. Here, both subsetshave the same fraction of the data, so 100% corresponds toeach subset having 50% of the full training data set. Theattacks are performed with 200 training samples combinedwith the X (approx 1000) hold out from the given digit whichmakes the random guessing at 0.2. All privGAN attacks areclose to random guessing while most GAN results are aboveand increasing with data fraction. tion of training samples. PrivGAN privacy can also be seen inFig. 8 where the discriminator’s confidences on the trainingset is very similar to the confidence on the holdout set. Forthe GAN, on the other hand, we can see a clear distinctionbetween the confidence on the training set and the holdoutset. This is the vulnerability exploited by the white-box attack.We also see from the GAN confidence distributions, that theconfidence decreases on the training set as the training set sizedecreases. This is the opposite for the holdout set such thatboth confidence distributions overlap for small training set sizemaking the synthetic data more private as was shown in Fig.7. For the bank analogy, this means that an adversary cannotconfidently match a bank clients’ signatures with its bank by Fig. 8:
Digit 0 discriminator distributions. The rows corre-spond to subsets fraction = 100%, 50% & 20% of the trainingdata, respectively. The columns correspond to the discrimi-nator 1, discriminator 2 & GAN discriminator, respectively.The red color corresponds to the discriminator probability onits training data, the blue color corresponds to the hold outdata and the green color correspond to the training data ofthe other discriminator (privGAN only). This shows how theprivGAN discriminators give similar probabilities for trainand test set, a characteristic of private models. For GANs,there is a clear distinction of the distributions.
Fig. 9:
Same as Fig. 8 but for digit 1. The GAN distributionsare very similar, a sign of privacy (see digit 1 in Fig. 7). Thisdistributions for GANs happens only for digit 1. nspecting privGAN components. This is usually not the casefor the GANs.We show the results of the first four digits to demonstratethe particularity of digit 1 (the digits not shown have behavioursimilar to digits 0, 2 and 3). Indeed, we note that for digit 1,the GAN discriminator happens to be private too. This canalso be seen by the discriminator distributions in Fig. 9 wherediscriminator has a similar response on the training and testsamples. We hypothesise that digits 1s are inherently moreprivate because of their lower diversity, i.e. there are fewerways to draw 1s than there are for other digits. As a results,the1s are more similar and harder to distinguish, hence they aremore private. VI. C
ONCLUSION
We have shown that the utility of a synthetic dataset canbe significantly improved for data owners with small datasetswhen information is shared through the privGAN architecture.For example, if the data owner has too few data samples, asimple GAN could not even generate recognizable syntheticimages while it was always possible with privGAN. Wehave tested the most privacy-vulnerable components of theprivGAN architecture, the discriminators, and the membershipinference attacks are as good as random as opposes to basicGAN synthetic data. If the data size is large enough, the ben-efits, besides helping other data owners, are mainly the addedprivacy to the synthetic data. Our results are encouragingfor enterprises wanting to collaborate by sharing informationwhile minimizing the privacy risks.We have used a bank example as an application for theprivGAN method to preserve the clients privacy while sharinginformation about their signatures. In this scenario, a bank(or branch of a bank) could help another bank to generate asynthetic signature dataset to train a better fraudulent signaturedetection model while preserving their clients’ privacy. Wehave shown that a bank with too few signatures cannotgenerated any synthetic signatures on its own, but it becomespossible with the help of another bank through the privGANarchitecture. Depending on the bank needs and privacy poli-cies, it is possible to vary the trade-off between utility andprivacy. Thus, a Pareto front could be explored by modifyingcertain elements of the privGAN architecture. For example, thestrength of central discriminator D P can be modified and willaffect both privacy and utility. It is also possible to change theneural network model to, say, convolutional neural networkswhich are more adapted to image processing than dense layernetworks used here. This modification might affect the privacyvulnerability as it varies for different types of networks asdemonstrated in [21].The original privGAN paper demonstrated the utility oftheir methods on more complex datasets such as CIFAR10.Although it remains to be shown, it is reasonable to assumethat our conclusion would also hold with more complexdatasets and other data types, within the known limitationsof simple GANs. As for further work, we are exploring the benefits whenthe data owners have non-i.i.d. subsets. For example, thesignatures could come from clients of different age, gender orany other population parameter that could influence signatures.A CKNOWLEDGMENT
We are grateful to the privGAN authors Sumit Mukherjee,Yixi Xu, Anusua Trivedi and Juan Lavista Ferres for fruitfuldiscussions and granting access to their code.R
EFERENCES[1] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-efficient learning of deep networks from decentralizeddata,” 2016.[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in Neural Information Processing Systems 27 , Z. Ghahramani,M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available:http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf[3] S. Augenstein, H. B. McMahan, D. Ramage, S. Ramaswamy, P. Kairouz,M. Chen, R. Mathews, and B. A. y Arcas, “Generative models foreffective ml on private, decentralized datasets,” 2019.[4] C. Hardy, E. L. Merrer, and B. Sericola, “Md-gan: Multi-discriminatorgenerative adversarial networks for distributed datasets,” 2018.[5] M. Rasouli, T. Sun, and R. Rajagopal, “Fedgan: Federated generativeadversarial networks for distributed data,” 2020.[6] S. Mukherjee, Y. Xu, A. Trivedi, and J. L. Ferres, “privgan: Protectinggans from membership inference attacks at low cost,” 2019.[7] C. Dwork and A. Roth, “The algorithmic foundations of differentialprivacy,”
Found. Trends Theor. Comput. Sci. , vol. 9, no. 34, p. 211407,Aug. 2014. [Online]. Available: https://doi.org/10.1561/0400000042[8] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov,K. Talwar, and L. Zhang, “Deep learning with differential privacy,”
Proceedings of the 2016 ACM SIGSAC Conference on Computerand Communications Security , Oct 2016. [Online]. Available: http://dx.doi.org/10.1145/2976749.2978318[9] L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou, “Differentially privategenerative adversarial network,” 2018.[10] N. Papernot, M. Abadi, lfar Erlingsson, I. Goodfellow, and K. Talwar,“Semi-supervised knowledge transfer for deep learning from privatetraining data,” 2016.[11] J. Yoon, J. Jordon, and M. van der Schaar, “PATE-GAN: Generatingsynthetic data with differential privacy guarantees,” in
InternationalConference on Learning Representations , 2019. [Online]. Available:https://openreview.net/forum?id=S1zk9iRqF7[12] M. Neunhoeffer, Z. S. Wu, and C. Dwork, “Private post-gan boosting,”2020.[13] C. Fan and P. Liu, “Federated generative adversarial learning,” 2020.[14] J. Hayes, L. Melis, G. Danezis, and E. D. Cristofaro, “LOGAN:evaluating privacy leakage of generative models using generativeadversarial networks,”
CoRR , vol. abs/1705.07663, 2017. [Online].Available: http://arxiv.org/abs/1705.07663[15] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federatedlearning: A client level perspective,” 2017.[16] D. Chen, T. Orekondy, and M. Fritz, “Gs-wgan: A gradient-sanitizedapproach for learning differentially private generators,” 2020.[17] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learningdifferentially private recurrent language models,” 2017.[18] A. Triastcyn and B. Faltings, “Federated generative privacy,” 2019.[19] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,”