Differentially Private Generation of Small Images
DDifferentially Private Generation of Small Images
J. Schwabedal, P. Michel, and M. S. RiontinoPionic, Berlin, Germany e-mail: [email protected]: github.com/jusjusjus/noise-in-dpsgd-2020/tree/v1.0.0
May 7, 2020
Abstract.
We explore the training of generative adversarial networks with dif-ferential privacy to anonymize image data sets. On MNIST, we numericallymeasure the privacy-utility trade-off using parameters from (cid:15) - δ differential pri-vacy and the inception score. Our experiments uncover a saturated trainingregime where an increasing privacy budget adds little to the quality of gen-erated images. We also explain analytically why differentially private Adamoptimization is independent of the gradient clipping parameter. Furthermore,we highlight common errors in previous works on differentially private deeplearning, which we uncovered in recent literature. Throughout treatment of thesubject, we hope to prevent erroneous estimates of anonymity in the future. Differential privacy is the de-facto standard for anonymized release of statisticalmeasurements (Dwork et al., 2014). The method information-theoretically lim-its how much of any individual example in a data set is leaked into the releasedstatistics. Risks of privacy infringement remain therefore bounded independentof the infringement method. This independence is key as increasingly power-ful expert systems are trained on cross-referenced statistics, which may lead toaccidental, inconceivably intricate, and hard-to-detect infringements. In thiswork, we explore a method able to anonymize not only derived statistical mea-surements, but the underlying raw data set with the same differential-privacyguarantees. We follow others in training generative adversarial neural networks(GANs) with differential privacy.Abadi et al. (2016) proposed a method of training neural networks withdifferential privacy in their seminal paper. The authors proposed to make thegradient computations of stochastic gradient descent (SGD) a randomized mech-anism by clipping the L-2 norm of parameter gradients in each example, andby adding random Gaussian noise to the gradients. Using the same method,we show that it is possible to train generative adversarial networks (GANs) onhigh-dimensional images (for an overview see Gui et al. (2020)). Their generatornetworks can then be used to synthesize image data that are of high utility forfurther processing, but have guaranteed privacy for the original data source.1 a r X i v : . [ c s . L G ] M a y rivacy-preserving training of GANs with DP-SGD has been attemptedrecently. Beaulieu-Jones et al. (2017) trained a generator for labeled blood-pressure trajectories to synthesize anonymous samples from the SPRINT trialusing the AC-GAN approach. Zhang et al. (2018) devised the WassersteinGAN with gradient penalty to train a generator for MNIST and CIFAR10. Toimprove training with privacy, they grouped parameters according to their gra-dients and adjusted the clipping boundary for each group. Xie et al. (2018) usethe original Wasserstein GAN procedure wherein the critic’s (discriminator’s)parameters are clipped. This in turn ensures that gradients are bounded, thusfulfilling the criteria to compute privacy bounds from the noise variance addedto the gradients.In the remaining, we discuss the prior works cited above and point outsome important fallacies. We then present or own analysis of differentiallyprivate synthetic data generated from the MNIST data set. We discuss how DPparameters affect the quality of generated images, which we measure using theinception score. A randomized mechanism h : D → R satisfies ( (cid:15), δ )-differential privacy if thefollowing inequality holds for any adjacent pair of data sets d, d (cid:48) ∈ D and S ⊂ R : P [ h ( d ) ∈ S ] ≤ e (cid:15) P [ h ( d (cid:48) ) ∈ S ] + δ . (1)The Gaussian randomized mechanism adds to an n -dimensional mechanism f random Gaussian noise with variance C σ , wherein C is the L-2 sensitivityof the mechanism across neighboring data sets, and σ is the noise multiplier.Drawing vector-valued independent Gaussian variates ξ ∼ N (0 , C σ I n × n ), themechanism is defined as h ( x ) = f ( x ) + ξ . (2)In the training of neural networks with stochastic gradient descent, the mech-anism f is the computation of parameter gradient update g , i.e. the mean overparameter gradients g j from samples j across a mini-batch B : g = 1 | B | (cid:88) j ∈ B g j (3)Only one gradient g k differs in Eq. (3) between adjacent data sets. Let ussuppose we can limit the L-2 norm of individual gradients to C . Then g has anL-2 sensitivity of C/B across adjacent data sets.Based upon this theory, Abadi et al. (2016) has formulated differentiallyprivate SGD, which we reproduce in Algo. (1). Note, that the gradient’s L-2norm of each example is clipped individually (line 5), and that independentand vector-valued Gaussian variates ξ are added (line 6). One may replace thesimple descent step (line 7) with any higher order, or moment based algorithmof gradient descent such as RMSprop or Adam, for example (Kingma and Ba,2015, Tieleman et al., 2012). 2 lgorithm 1: Differentially private SGD (Abadi et al., 2016)
Input:
Examples { x , ..., x n } , neural-network parameters θ , loss function L ( θ, x i ), learning rate η , noise scale σ , gradient norm bound C ,random noise ξ ∼ N (0 , σ C I ) Output:
Differentially private parameters θ for t < T do Sample a random minibatch B for x i ∈ B do Compute g i ← ∇ θ t L ( θ t , x i ) g i ← g i / max(1 , || g i || /C ) g ← | B | ( ξ + (cid:80) i g i ) θ ← θ − η t g Zhang et al. (2018) observed that, when training a generative adversarial net-work for data release, the Gaussian mechanism can be confined either to thegenerator or to the critic. They argue that applying the Gaussian mechanismonly to the critic permits batch-right normalization techniques to be used in thegenerator. In Algo. (2), we reproduce their differentially private WassersteinGAN with gradient penalty. Note the similarity between lines 9 and 10 in thecritic step, and lines 5 and 6 in Algo. (1). These lines are responsible for im-plementing the Gaussian randomized mechanism. Also, note that the generatortraining step does neither involve gradient clipping nor random perturbations.It does also not depend on the data set as long as the schedule updating thelearning rate η t is differentially private. This includes early stopping. Differentially private training of neural nets was formulated using the Gaussianmechanism, but its application is only useful if we are able to estimate tightupper bounds for the privacy lost. This loss is quantified by the parameters (cid:15) and δ from Eq. (1). Such an upper bound was derived by Mironov (2017) andMironov et al. (2019), in which they use the theory of R´enyi differential privacy(RDP). Their analysis is technically complicated, and we will only sketch itselements here.R´enyi differential privacy (RDP) is formulated in terms of the R´enyi diver-gence of two probability distributions p and q : D α ( p || q ) = 1 α − (cid:104) (cid:18) p ( x ) q ( x ) (cid:19) α (cid:105) x ∼ q (4)A mechanism h : D → R fulfills ( α, (cid:15) (cid:48) )-RDP if, for all neighboring data sets d, d (cid:48) ∈ D and S ⊂ R , h obeys the inequality: D α [ P ( h ( d ) ∈ S ) || P ( h ( d (cid:48) ) ∈ S )] ≤ (cid:15) (cid:48) (5)Mironov (2017) also linked the two definitions of differential privacy (1) and (5):3 lgorithm 2: Differentially private WGAN-DP (Zhang et al., 2018)
Input:
Examples { x , ..., x n } , neural-network parameters θ and w ,learning rate η , noise scale σ , gradient norm bound C , randomnoise ξ ∼ N (0 , σ C I ) Output:
Differentially private parameters θ for t < T do for s = 1 , . . . , n critic do /* differentially private critic step */ Sample a random minibatch B for x i ∈ B do Sample z from P ( z ), and ρ ∈ [0 , y ← ρx i + (1 − ρ ) G ( z ) L i ← D ( G ( z )) − D ( x i ) + λ ( ||∇ x D | y || − g i ← ∇ w L i g i ← g i / max(1 , || g i || /C ) g ← | B | ( ξ + (cid:80) i g i ) w ← w − η t g /* non-private generator step */ Sample | B | instances of z i from P ( z ) g t ← | B | (cid:80) | B | i =1 ∇ θ D ( G ( z i )) θ ← θ − η t g t Each mechanism satisfying ( α, (cid:15) (cid:48) )-RDP also satisfies ( (cid:15), δ )-DP with (cid:15) = (cid:15) (cid:48) ( α ) − log δα − . (6)One is free to choose δ . Then, one typically chooses the α that minimizes (cid:15) .Mironov et al. (2019) give the details on how to compute (cid:15) (cid:48) ( α ) for one stepof the sampled Gaussian randomized mechanism, stochastic gradient descent.Across multiple steps, the epsilons add linearly. Adding the values for (cid:15) (cid:48) of eachoptimization step, we compute the privacy budget in terms of α and (cid:15) (cid:48) andconvert it to δ and (cid:15) using Eq. (6). In sum, this gives us an upper bound for( (cid:15), δ )-DP for our GAN training, that limits the privacy leaked from the data setinto the generator. The MNIST dataset contains 70,000 labeled images of digits. We used the 60,000examples of its training data set to train GANs. To optimize the classifier forthe computation of inception scores, we also used the 10,000 examples in thetest set as a validation set.
To assess the quality of generated images, we adopt the inception score (IS):A classifier K with classes k generates a probability distribution P ( k | x ) when4pplied to examples x of a data set X . The conditional probability is related tothe marginal P ( k ) = (cid:104) P ( k | x ) (cid:105) x ∼ X using the Kullback-Leibler divergence:KL( P || Q ) = (cid:104) log P − log Q (cid:105) x ∼ X (7)The inception score s ( X ) is defined as the exponential of the mean Kullback-Leibler divergence: s ( X ) = exp (cid:34) M (cid:88) k =1 KL( P ( k | x ) || P ( k )) /M (cid:35) (8)Note that the IS framework requires a classifier of high quality. The scoretakes values between 1 and the number of classes M . It has been argued thatthe IS correlates well with subjective image quality because of a subjective biastowards class-distinguishing image features (Salimans et al., 2016). We train Wasserstein generative adversarial networks using gradient penaltywith Adam optimization. The critic (discriminator) consists of three striddenconvolutional layers with leaky ReLU activation functions of negative slope 0 . capacity , with a kernel size of 5. Thenumber of filters doubles with each convolutional layer. The generator startswith a 128-dimensional Gaussian latent space that is processed by three convo-lutions that transpose the structure of the critic. Padding is chosen to matchthe 28-by-28 pixel images of MNIST. The network is trained in batches usingAdam with β = 0 . β = 0 . Differentially private stochastic gradient descent has been previously used totrain generative adversarial networks (Beaulieu-Jones et al., 2017, Xie et al.,2018, Zhang et al., 2018). Beaulieu-Jones et al. (2017) use the original GANalgorithm with a binary classification in the classifier. They clip the gradientsafter averaging, but not the parameters (W-GAN step). In Methods, they write:... we limit the maximum distance of any of these [optimization]steps and then add a small amount of random noise.As we outlined in Sec. 2.1, limiting each step, i.e. clipping the average gradient,is not sufficient to grant differential privacy to each example. Each contributionto the gradient needs to be clipped individually. At the time of writing, this erroralso conforms with their published code (github.com/greenelab/SPRINT gan).Results presented by Beaulieu-Jones et al. (2017) may still be correct becausethe authors use a batch size of one throughout the paper.In the public code repository implementing the experiments reported byZhang et al. (2018), we have encountered a stray factor 1 / (cid:112) | B | in the com-putation of the noise. At a batch-size of 64, the noise is, therefore, a factoreight too small to grant the differential privacy constraints computed from the5eported clipping and noise multiplier. Furthermore, the clipping is performedin a non-standard way: the authors group sets of parameters – e.g. all biases– and clip their gradients’ L-2 norm separately. The difference to regular clip-ping is probably proportional to the number of groups. The clipped gradientsare, therefore, about a factor of about 10 larger than when clipped in the stan-dard way. These programming and conceptual errors leave the applied privacyanalysis inapplicable.Xie et al. (2018) use the original W-GAN algorithm in which parametersare clipped. They go on to show that bounded network parameters, images,and classifications result in bounded parameter gradients thus fulfilling the DPconditions. They compute the bound c g , and use it for their DP algorithm. Un-fortunately, the authors neither write how they scale noise with c g , nor do theypublish their algorithm in the public repository (github.com/illidanlab/dpgan). To evaluate the inception score, we trained a classifier on the ten classes of theMNIST training data set. On the test set, our trained classifier achieved anaccuracy of 99.6%, which is comparable to the state of the art for an individualneural-network classifier. We also computed the inception score of the originaldata sets and achieved s = 9 . ± .
05 (train set), and 9 . ± .
04 (test set).We used this classifier to compute the inception scores reported in the fol-lowing sections. Specifically, we describe our findings of how the gradient clip-ping, the noise multiplier, and the network capacity affected the privacy-utilityrelationship between privacy parameter (cid:15) , and inception score IS. In the com-putation of (cid:15) , we set δ = 10 − to align our results with other publications thatuse the same value. Note that there is still no consensus how to choose δ , opti-mally. In all figures, we counterposed the results with a non-anonymous GANtraining. We mark its maximum inception score, i.e. ISmax = 8 .
51, by a hori-zontal dashed line. All experiments were done with the public code repositoryat ”github.com/jusjusjus/noise-in-dpsgd-2020/tree/v1.0.0”.
For Adam optimization, ”[...] the magnitudes of parameter updates are invariantto rescaling of the gradient” (Kingma and Ba, 2015). Specifically, Adam tracksrunning means of the value m t and square v t of incoming averaged gradients g t at time step t . The parameter update ∆ p t is normalized by these runningmeans (some details omitted for clarity): m t = β m t − + (1 − β ) g t v t = β v t − + (1 − β ) g t (9)∆ p t = η m t √ v t Let us choose C smaller than all individual gradient L-2 norms throughout thewhole training process. This is possible if we assume that the gradients are non-zero. Then we can rewrite g t = C ( ˆ ξ + g t / || g t || ), wherein ˆ ξ = ξ/C is independentof C . After entering the expression for g t in Eqns. (9), C cancels if we replace6 → Cm and v → C v . For such small values of C , the training becomes C -independent due to the normalization property of Adam optimization.Figure 1: Privacy-utility plot for different gradient L2-norm clips C . For C = 0 . C = 1 we observed comparable IS, whereas for C = 100, IS as a functionof (cid:15) was systematically reduced.Adam’s rescaling property, thus, divides L-2 clipping constant C into tworegimes set by the smallest gradient norm in the data set. (i) For smaller valuesof C , all gradients are clipped equally. (ii) For larger values, the individualgradient norm weights the gradient sum over the mini-batch.Empirically we found that regime (i), in which C is chosen arbitrarily small,showed the largest inception score. Larger values of C led to smaller values ofthe inception score. We choose C = 1 in the following, which was below theper-example gradient norm in our experiments. The noise multiplier σ is inversely related to the signal-to-noise ratio in thegradients. One may, therefore, expect that large σ are detrimental to learningas optimal parameter values are escaped by random perturbations.We trained generators with DP-Adam at C = 1, for a variety of values of σ while monitoring their inception score and (cid:15) (Eq. (1)). Throughout training, thescore increased initially then to approach a maximal level. We found that themaximal inception score showed little dependence on values of σ < . s = 7 .
2. For σ > .
9, the score showed a steep break-off only reachingabout s = 3 at optimal levels of the privacy budget (cf. Fig. 2).Herein we observe the existence of a critical noise multiplier beyond which aprivacy-utility trade-off will likely be sub-optimal, as we discuss further below. We trained generators with DP-SGD at a variety of network capacities whilemonitoring their inception score and privacy loss (cid:15) (Eq. (1)). The score ap-proached a capacity-dependent maximum with increasing steps. We found that7igure 2: Privacy-utility plot for different noise multipliers σ . At σ < .
9, weobserved a plateau of IS that was mostly shifted to larger (cid:15) for smaller σ . At σ = 1 .
0, we observed another regimen in which training led to much lower valuesof IS with shallow gains during continued training.the maximal inception score was biggest for an intermediate network capacityof 32, while larger and small capacities reached lower levels.
Generative anonymization of data through differential privacy promises a quan-tifiable trade-off between the protection of individual privacy and the ability touse raw data sets for machine learning. In this article, we explore the trainingof generative adversarial networks for image data under differential privacy con-straints. We found that some of the previous articles that explored this optionshowed technical and conceptual flaws (Beaulieu-Jones et al., 2017, Zhang et al.,2018). In our introduction we provided a detailed workup of differential privacyin deep learning, which we hope will further clarify this intricate mathematicaltheory for future researchers.It is in the interest of a practitioner to maximize the utility of the generatorwhile staying within a specific privacy budget. We explored this privacy-utilitytrade-off in the MNIST data set . We uncovered two modes of DP-GAN train-ing showing distinct characteristic dependencies between privacy loss and imageutility; an optimal one in which the utility steeply increased with spent privacyeventually reaching a plateau, and a sub-optimal one, wherein the slope wasshallow (cf. Fig. 1 at C = 100 and Fig. 2 at σ = 1). Sub-optimal privacy-utilitycharacteristics crossed through optimal ones at low levels of the utility. On theother hand, the plateau in optimal characteristics did not seem to dependenton the hyperparameters. In Fig. 2, for example, we observed a stable plateauover an order of magnitude in (cid:15) , and for different values of the noise multiplier. Our work reproduces experiments initially published by Zhang et al. (2018). We do notcompare these results to ours because of the aforementioned errors. σ or C let to a sudden change in theobserved characteristics. We hypothesize that too small signal-to-noise ratiosin anonymous gradient updates make the stochastic optimization process un-able to uncover minima in the parameter space. The hypothesis is consistentwith break-offs upon increases in σ and C . It is also consistent with the largefluctuations in utility present in sub-optimal characteristics.We also explained analytically why DP-GAN training with Adam becomesgradient-clipping independent for small values of the clipping constant. Thisresult generalizes to other methods of gradient descent with normalization. Fur-thermore, we found a weak dependence of the utility plateau on the networkcapacity. In the non-DP case, small capacities show reduced expressivity in thegenerator thus degrading the utility. A reduction was not visible, however, atincreased capacities (computations not shown). When training with differentialprivacy, we found that increased capacity led to a systematic decrease in util-ity. In these high-capacity networks, more terms enter the gradient L-2 normthus enhancing the effect of the clipping. We hypothesize that this is the mainmechanism leading to a degraded utility in larger networks.Our simple explorations are limited by the approximations with which weexplore the privacy-utility relationship. Privacy was indirectly measured as anupper privacy bound, and utility was indirectly measured with the inceptionscore. In future works, one should complement these metrics with direct mea-sures of privacy through membership-inference attacks and classification scoreson generated images, for example.In applications, privacy-utility plots could be a useful tool to tune param-eters in a data anonymization workflow. The method needs to be carefullyadopted, however, because additional privacy loss incurs during hyperparam-eter optimization (Abadi et al., 2016), and the auxiliary classifier network weused needs to be trained with differential privacy as well.9n sum, we found that it is indeed possible to generate differentially privatesynthetic data set within a moderate privacy budget of (cid:15) ≤
10. However, wepresume that the reduced utility for parameter-rich networks will be a majorhurdle when training DP-GANs for larger, more nuanced image data sets thanMNIST. This could become a problem in particular, when close to a sub-optimaltraining regime in the signal-to-noise ratio of parameter gradients.
Acknowledgements.
We thank Dr. Ilya Mironov for guiding our understand-ing of privacy-preserving deep learning. This work was funded by the EuropeanUnion and the German Ministry for Economic Affairs and Energy under theEXIST grant program.
References
Mart´ın Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov,Kunal Talwar, and Li Zhang. Deep Learning with Differential Privacy. 2016.doi: 10.1145/2976749.2978318. URL http://arxiv.org/abs/1607.00133 .Brett K. Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, and Casey S.Greene. Privacy-preserving generative deep neural networks support clini-cal data sharing. bioRxiv , page 159756, jul 2017. doi: 10.1101/159756.C Dwork, A Roth, Cynthia Dwork, and Aaron Roth. The Algorithmic Founda-tions of Differential Privacy.
Foundations and Trends in Theoretical ComputerScience , 9:211–407, 2014. doi: 10.1561/0400000042.Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Ye. A review ongenerative adversarial networks: Algorithms, theory, and applications. arXivpreprint arXiv:2001.06937 , 2020.Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic opti-mization. , pages 1–15, 2015.Ilya Mironov. Renyi differential privacy.
CoRR , abs/1702.07476, 2017. URL http://arxiv.org/abs/1702.07476 .Ilya Mironov, Kunal Talwar, and Li Zhang. R \ ’enyi Differential Privacy of theSampled Gaussian Mechanism. pages 1–14, 2019. URL http://arxiv.org/abs/1908.10530 .Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford,and Xi Chen. Improved techniques for training GANs. In Advances in NeuralInformation Processing Systems , 2016. URL https://arxiv.org/abs/1606.03498https://github.com/openai/improved{_}gan .Tijmen Tieleman, Geoffrey E. Hinton, Nitish Srivastava, and Kevin Swersky.Lecture 6.5-rmsprop: Divide the gradient by a running average of its recentmagnitude.
COURSERA: Neural Networks for Machine Learning , 2012. ISSN19386451. doi: 10.1109/ESEM.2009.5314231.10iyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and Jiayu Zhou. DifferentiallyPrivate Generative Adversarial Network. feb 2018. URL http://arxiv.org/abs/1802.06739 .Xinyang Zhang, Shouling Ji, and Ting Wang. Differentially Private Releasing viaDeep Generative Model. 2018. URL http://arxiv.org/abs/1801.01594http://arxiv.org/abs/1801.01594