[PDF] Differentially Private Adversarial Robustness Through Randomized Perturbations

Abstract

Deep Neural Networks, despite their great success in diverse domains, are provably sensitive to small perturbations on correctly classified examples and lead to erroneous predictions. Recently, it was proposed that this behavior can be combatted by optimizing the worst case loss function over all possible substitutions of training examples. However, this can be prone to weighing unlikely substitutions higher, limiting the accuracy gain. In this paper, we study adversarial robustness through randomized perturbations, which has two immediate advantages: (1) by ensuring that substitution likelihood is weighted by the proximity to the original word, we circumvent optimizing the worst case guarantees and achieve performance gains; and (2) the calibrated randomness imparts differentially-private model training, which additionally improves robustness against adversarial attacks on the model outputs. Our approach uses a novel density-based mechanism based on truncated Gumbel noise, which ensures training on substitutions of both rare and dense words in the vocabulary while maintaining semantic similarity for model robustness.

Full PDF

DDifferentially Private Adversarial Robustness Through RandomizedPerturbations

Nan Xu $ , Oluwaseyi Feyisetan ∗ , Abhinav Aggarwal ∗ , Zekun Xu ∗ , Nathanael Teissier † $ University of Southern California, Los Angeles, CA, USA ∗ Amazon Alexa, Seattle, WA, USA † Amazon Alexa, Arlington, VA, USA [email protected], { sey,aggabhin,zeku,natteis } @amazon.com Abstract

Deep Neural Networks, despite their great suc-cess in diverse domains, are provably sensi-tive to small perturbations on correctly clas-siﬁed examples and lead to erroneous predic-tions. Recently, it was proposed that this be-havior can be combatted by optimizing theworst case loss function over all possible sub-stitutions of training examples. However, thiscan be prone to weighing unlikely substitu-tions higher, limiting the accuracy gain. In thispaper, we study adversarial robustness throughrandomized perturbations, which has two im-mediate advantages: (1) by ensuring that sub-stitution likelihood is weighted by the proxim-ity to the original word, we circumvent opti-mizing the worst case guarantees and achieveperformance gains; and (2) the calibrated ran-domness imparts differentially-private modeltraining, which additionally improves robust-ness against adversarial attacks on the modeloutputs. Our approach uses a novel density-based mechanism based on truncated Gumbelnoise, which ensures training on substitutionsof both rare and dense words in the vocabu-lary while maintaining semantic similarity formodel robustness.

Deep neural networks (DNNs) have found ap-plications within multiple domains: from com-puter vision (Krizhevsky et al., 2012), and nat-ural language processing (Mikolov et al., 2013),to robotics (Kober et al., 2013) and self-drivingcars (Bojarski et al., 2016). However, DNNs havebeen shown to be vulnerable to adversarial exam-ples. These are small perturbations of examplesthat are correctly classiﬁed by well-trained mod-els but incorrectly classiﬁed in the target (Szegedyet al., 2013; Goodfellow et al., 2014).A few approaches have been proposed to defendagainst such adversarial attacks. One of the most widely used methods is adding the adversarial ex-amples to the original training set and retraining themodel. On most kinds of perturbations, such aug-mented training approach has achieved improvedrobustness without harming accuracy on the orig-inal testing sets (Jia and Liang, 2017; Iyyer et al.,2018; Ribeiro et al., 2018; Belinkov and Bisk, 2017;Ebrahimi et al., 2017). However, this often leads tothe augmented neural network over-ﬁtting to the ad-ditional data (Matyasko and Chau, 2017), but fail-ing to perform robustly against other types of adver-sarial examples (Jia and Liang, 2017; Belinkov andBisk, 2017). Recently, certiﬁed defences have beenadopted in the computer vision domain (Lecuyeret al., 2019; Dvijotham et al., 2018; Gowal et al.,2018). To defend against perturbations on text data,the Interval Bounded Propagation (IBP) approachwas proposed by (Jia et al., 2019) to minimize theupper bound on the worst-case loss that word sub-stitutions can induce during the training procedure.In this paper, we propose a new approach togenerate adversarial examples via word substitu-tions in textual analysis. Our approach is based onrandomized mechanisms satisfying Metric Differ-ential Privacy ( d χ -privacy (Andr´es et al., 2013)) –a variant of Differential privacy (DP). DP was pro-posed by (Dwork et al., 2006) and has been estab-lished as a de facto standard for privacy-preservingdata analysis. It mathematically guarantees, givena privacy parameter (cid:15) , that an adversary observ-ing separate outputs of computations over adjacentdatabases (described by a Hamming distance) willmake essentially the same inference. As opposedto standard DP, with d χ -privacy, the guaranteesare scaled by a (different) distance metric betweenadjacent databases, and privacy preserving noiseis sampled from a multivariate (Laplacian) distri-bution. The distances are over a metric space asdeﬁned by word embeddings such as GloVe (Pen-nington et al., 2014) or fastText (Bojanowski et al., a r X i v : . [ c s . L G ] S e p i.e. , morerelevant) words from other close but less relevantwords. As a result, for a given value of the privacyparameter (cid:15) , an irrelevant word could have a similarsubstitution probability as a relevant word. We pro-pose a new metric-DP mechanism called the trun-cated Gumbel perturbation mechanism to allow asmaller range of nearby words considered than themultivariate Laplace mechanism. The new mecha-nism samples a k value from a truncated Poissondistribution as substitution candidates before pertur-bation, hence words nearby with irrelevant mean-ings are disregarded. This better preserves wordsemantics and improves utility of models trainedon perturbed datasets in downstream tasks.In this paper, we investigate the performance ofa well-trained IBP model on classiﬁcation taskswhen the input text is perturbed by a metric DPmechanism with different values of (cid:15) – correspond-ing to different degrees of semantic preservation.Motivated by the success of augmented trainingwith adversarial data such as (Jia and Liang, 2017),we also add the adversarial examples generated bythe privacy mechanisms to the original training setwhile comparing its robustness with IBP.The contributions of this paper is as follows: • We propose a novel metric-DP mechanismcalled the truncated Gumbel mechanism,which provides formal privacy guarantees,and better preserves semantic meanings thanthe existing multivariate Laplace mechanisms. • To the best of our knowledge, we are the ﬁrstto leverage metric-DP mechanisms to gener-ate adversarial examples and study the per-formance of different adversarial training ap-proaches at different values of (cid:15) . • We empirically demonstrate the beneﬁt of thetruncated Gumbel mechanism in preservingsemantics and show that augmented trainingperforms better than certiﬁably robust training,both in clean and adversarial accuracy.

Privacy Preservation

DP (Dwork et al., 2006)preserves privacy on the output of a computationby adding noise sampled from a certain distribu-tion (e.g. Laplace). The magnitude of the noise isproportional to the sensitivity of the computation,and controlled by the parameter (cid:15) . We considera relaxation of DP, metric DP or d χ -privacy, thatoriginated in the context of location privacy, wherelocations close to the user are assigned higherprobability those far away (Andr´es et al., 2013;Chatzikokolakis et al., 2013). For text, the corol-lary to geo-location cooridinates are word vectorsin an embedding space. To preserve privacy, noiseis sampled from a multivariate distribution such asthe multivariate Laplace mechanism in (Fernandeset al., 2019; Feyisetan et al., 2020) or a hyperbolicdistribution in (Feyisetan et al., 2019). Adversarial Attacks

Deep neural networks arevulnerable to adversarial examples, where pertur-bations applied to examples correctly classiﬁed bywell-trained models, lead to mis-classiﬁcation withsigniﬁcantly high conﬁdence (Szegedy et al., 2013;Goodfellow et al., 2014). In the text domain, ad-versarial example generation includes techniquesfor extraneous text insertion (Jia and Liang, 2017),word substitution (Alzantot et al., 2018), paraphras-ing (Iyyer et al., 2018; Ribeiro et al., 2018), andcharacter-level noise (Belinkov and Bisk, 2017;Ebrahimi et al., 2017). In this paper, we generateadversarial examples by word-level perturbationswithout semantic-preservation constraints. Speciﬁ-cally, randomized perturbations satisfying metric-DP are employed, with the privacy parameter (cid:15) con-trolling semantic similarity during substitutions.

Adversarial Training

Augmenting training setswith adversarial examples is a common way of im-proving robustness in adversarial training (Szegedyet al., 2013; Goodfellow et al., 2014). Althoughit achieves improved robustness without harmingaccuracy on the original testing sets (Jia and Liang,2017; Iyyer et al., 2018; Ribeiro et al., 2018; Be-linkov and Bisk, 2017; Ebrahimi et al., 2017), aug-mented training is still vulnerable when tested onother adversarial examples (Jia and Liang, 2017;Belinkov and Bisk, 2017). Certiﬁed defenceswhich provide guarantees of robustness to norm-bounded attacks have become popular in com-puter vision (Lecuyer et al., 2019; Dvijotham et al.,2018; Gowal et al., 2018). For text, the Intervalound Propagation (IBP) approach minimizes anupper bound on the worst-case loss during train-ing that any combination of word substitutionscan induce (Jia et al., 2019). This requires thatthe allowed word substitutions are known a-priori .In this paper, we study the robustness of an IBP-trained model on adversarial examples generatedby metric DP mechanisms. Furthermore, we an-alyze how adding adversarial examples into thetraining set can help improve robustness.

Connections between Privacy Preservation andAdversarial Learning

To the best of our knowl-edge, this paper is the ﬁrst to propose: perturbingtext with metric-DP mechanisms, and testing therobustness of adversarial training approaches withthese adversarial examples. Connections betweenprivacy and adversarial learning have been studiedextensively in the different domains (Pinot et al.,2019). Two key properties of DP have been lever-aged to add a noise layer to the network’s architec-ture to provide guaranteed robustness against adver-sarial examples (Lecuyer et al., 2019). Similarly,trade-offs between DP preservation and provablerobustness have been studied by learning privatemodel parameters ﬁrst followed by rigorous robust-ness bound computation (Phan et al., 2019a,b).

We begin with providing some background onmetric Differential Privacy and the multivariateLaplace mechanism, which is commonly used inprivacy-preserving textual analysis.

Differential Privacy

First proposed by (Dworket al., 2006), DP provides a strong mathematicalframework for guaranteeing that the output of arandomized mechanism will remain essentially un-changed on any two neighboring input databases.Formally, a randomized mechanism M : X → Y satisﬁes ( (cid:15), δ ) -DP if for any x, x (cid:48) ∈ X that differin only one entry, then it holds for all Y ⊆ Y that:Pr [ M ( x ) ∈ Y ] ≤ e (cid:15) Pr [ M ( x (cid:48) ) ∈ Y ] + δ, (1)where (cid:15) > and δ ∈ [0 , are parameters thatquantify the strength of the privacy guarantee. If δ = 0 , we say that the mechanism M is (cid:15) -DP.This deﬁnition can be generalized to other met-rics for capturing dataset proximity depending onthe application, e.g., the Manhattan distance met-ric used to provide indistinguishability if the indi-vidual’s registration date differs at most 5 days in two databases, and the Euclidean distance on the2-dimensional space used to preserve the user’s lon-gitude and latitude information (Chatzikokolakiset al., 2015). In particular, for text data, we adoptmetric Differential Privacy (a.k.a. d χ -privacy), fol-lowing (Chatzikokolakis et al., 2013; Fernandeset al., 2019; Feyisetan et al., 2020). In this frame-work, we ensure that for all y ∈ Y , it holds that:Pr [ M ( x ) = y ] ≤ e (cid:15)d ( x,x (cid:48) ) Pr [ M ( x (cid:48) ) = y ] , (2)where the metric d ( x, x (cid:48) ) = (cid:107) φ ( x ) − φ ( x (cid:48) ) (cid:107) de-scribes the Euclidean distance of the word repre-sentations for x, x (cid:48) in some semantic embeddingspace like GloVe (Pennington et al., 2014). Underthis deﬁnition, the likelihood of a similar outputfrom the mechanism is weighted in proportion todistance of the word being substituted. Multivariate Laplace Mechanism

A popularapproach for achieving metric-DP is to use a multi-variate Laplace Mechanism for high-dimensionaldata (Wu et al., 2017; Feyisetan et al., 2020).Given the embedding vector φ ( x ) ∈ R n for eachword in the vocabulary, an n -dimensional noise κ is sampled following the distribution p ( κ ) ∝ exp( − (cid:15) (cid:107) κ (cid:107) ) . This variate is obtained by ﬁrst sam-pling a uniform vector in the n -dimensional unitball and scaling it using a Gamma variate sampledfrom Γ( n, /(cid:15) ) . The perturbed word x (cid:48) is the near-est word to φ ( x ) + κ in the embedding space. Truncated Poisson Sampling

The mechanismwe deﬁne in this paper uses random variates sam-pled from a Poisson distribution, but truncated invalue if it gets too large. We deﬁne this densityfunction below.

Deﬁnition 1.

Let λ > be a real and a, b be twointegers with ≤ a < b . We say that a randomvariable X follows a TruncatedPoisson ( λ ; a, b ) distribution if the following holds: Pr( X = k ) =  e − λ λ k k ! if a ≤ k < b − (cid:80) b − k = a e − λ λ k k ! if k = b otherwise. To sample a random variate X following thisdistribution, we sample Y ∼ Poisson ( λ ) andset X = Y if a ≤ Y < b , and X = b ,otherwise. An important property of such ran-dom variables is that for all λ > , it holdsthat Pr( X = b ) > e − λ . This follows fromthe fact that since ≤ a < b , we can write r( X = b ) = (cid:80) ∞ k =0 e − λ λ k k ! − (cid:80) b − k = a e − λ λ k k ! = e − λ + (cid:80) a − k =1 e − λ λ k k ! + (cid:80) ∞ k = b +1 e − λ λ k k ! > e − λ . Thiswill be useful in our privacy analysis. Gumbel Distribution

Our mechanism uses ran-dom variates sampled from the Gumbel dis-tribution, deﬁned over all x ∈ R , us-ing the cumulative density Gumbel ( x ; µ, β ) =exp ( − exp ( − ( x − µ ) /β )) for µ ∈ R and β > .We write X ∼ Gumbel (0 , b ) to denote a Gumbeldistributed random sample with µ = 0 and β = b . Lambert-W Function

This is a popular multi-valued function obtained from the inverse rela-tion of the function f ( w ) = we w for any com-plex valued w . We focus on only the real prin-cipal branch of this function deﬁned whenever f ( w ) ≥ − , in which we have the asymptotic iden-tity W ( x ) = ln x − ln ln x + Θ (cid:0) ln ln x ln x (cid:1) (see (Hoor-far and Hassani, 2008)). We now give an overview of approaches discussedin this paper for defending against adversarial at-tacks. Given text input x ∈ X , we consider classi-ﬁcation tasks where a model f ( x ; θ ) , parametrizedby θ , should predict a label y ∈ Y . For sentimentclassiﬁcation tasks, the input x is composed of astring of l words x , x , · · · , x l and labelled byone of the two classes y ∈ { , − } , where thepositive sentiment is denoted by while the nega-tive by − . For textual entailment tasks, two textsare given, one is the premise x and the other isthe hypothesis x (cid:48) , and a label is provided basedon the relationship between the two: y ∈ { , , } denoting the entailment, contradiction or neutralrelationship, respectively. Performance of the clas-siﬁcation model is evaluated by the percentileof correct predictions inferred on the testing set: (cid:80) x i ∈D test ( f ( x i ; θ ) = y i ) / |D test | , where is anindicator function equal to if the predicted la-bel f ( x i ; θ ) is identical to the ground-truth y i , otherwise; |D test | represents the size of the test set. Adversarial Attacks by Word Substitutions

We evaluate the performance of existing certiﬁ-ably robust trained models when perturbed textsare provided as inputs. Formally, a word-level per-turbation is obtained by substituting a given word x i by another word (cid:101) x i in a way that the semanticsimilarity between the two is determined by theleveraged metric DP mechanism. To achieve this, the additive noise is parametrized by the privacyparameter (cid:15) : a larger value of (cid:15) corresponds to lessnoise, and vice versa.For the multivariate Laplace Mechanismof (Feyisetan et al., 2020), since the noise is scaledpurely as a function of the distance from the orig-inal word, when (cid:15) is small, words in the dense re-gions of the embedding space are prone to gettingsubstituted with dissimilar words (that are furtheraway), compared to the words in the sparse region.This is because in areas where embedding vectorsare densely located, the distance between two ir-relevant words is commensurate to that betweentwo words with similar meanings in a sparse re-gion. Hence, adapting the word-level substitutionto variations in the density of the embedding spacecan help boost the utility of models trained on per-turbed datasets. To do this efﬁciently (and withoutany expensive computation of local sensitivity eachtime a substitution is made), we propose a novelmechanism based on a truncated Gumbel distribu-tion and prove that it admits metric DP. Instead ofsampling based on the distance from the originalword, this approach samples k candidate substitu-tions following the Truncated Poisson distributionand then makes a distance-based calibrated ran-dom choice from the k − -nearest neighbors ofthe original word in the embedding space (see Al-gorithms 1 and 2). We describe this mechanismin more detail in Section 5, and prove its formalprivacy guarantees in Appendix A. Learning with Adversarial Examples

Moti-vated by the success of augmented training ap-proaches when text perturbations happen in theform of extraneous text insertion (Jia and Liang,2017), paraphrasing (Iyyer et al., 2018; Ribeiroet al., 2018), character-level noise (Belinkov andBisk, 2017; Ebrahimi et al., 2017), we also inves-tigate the effectiveness of adding adversarial ex-amples generated by metric DP mechanisms to thetraining set for retraining. Retaining the label ofeach sample, we perturb the text four times, duringwhich every word is perturbed by either the existingmultivariate Laplace Mechanism or the proposedtruncated Gumbel Mechanism.

Motivated by the approach proposed by (Durfeeand Rogers, 2019), our density-aware word substi-tution mechanism uses a Gumbel random variatefor selecting amongst a list of candidate perturba- lgorithm 1: T RUNCATED -G UMBEL -A RG -M IN Input :

Real vector u = [ u , . . . , u m ] , scale parameter b > , truncation parameter C > Sample g , . . . , g m ∼ i.i.d. Gumbel (0 , b ) truncatedbetween [ − C, C ] . Compute u (cid:48) = [ u + g , . . . , u m + g m ] . return arg min u (cid:48) . tions (see Algorithm 2). To ensure plausible denia-bility over the entire vocabulary, the support of thesubstitution mechanism must include all the words,however, limiting the set of candidate substitutionsto only the semantically similar words is necessaryto maintain utility.We balance this trade-off by ﬁrst randomly se-lecting the k nearest neighbors of the original wordusing a truncated Poisson variate, with support overthe whole vocabulary (see Step 4). The mean num-ber of candidates is set to the natural logarithm ofthe vocabulary size, to ensure that this number isneither too small, nor too large. Next, the closest k − words to the original word are obtained (usinga nearest neighbor search) and their distances arerecorded (see Steps 5 and 6). A random choice overthis set is made using Algorithm 1, where the dis-tances are ﬁrst noised with Gumbel distributed ran-dom variates and then the smallest noised distancedetermines the new word (see Step 7). The Gumbelnoise is scaled using the privacy parameter (cid:15) andthe diameter ∆ of the embedding space, and thenclipped using a truncation parameter C > . Theprocess is repeated independently for each word inthe input string. We evaluate the proposed privacy mechanism, ad-versarial attacks and the defense approach throughanswers to the following questions: Q1 How does the privacy parameter (cid:15) affect thebehavior of the perturbation mechanisms ondifferent text classiﬁcation tasks? Q2 Does the proposed truncated Gumbel mecha-nism lead to a smaller range of word substi-tutions compared to the Multivariate LaplaceMechanism? Q3 How will different adversarial training ap-proaches, i.e., the IBP approach with certiﬁedrobustness and the proposed augmented train-ing, perform when testing on adversarial ex-amples derived from metric-DP mechanisms?

We evaluate the robustness of models on twotext classiﬁcation tasks: sentiment analysis on theIMDb movie review dataset (Maas et al., 2011)and textual entailment on premise-hypothesis rela-tion dataset SNLI (Bowman et al., 2015). We use300-dimensional GloVe vectors for word embed-ding (Pennington et al., 2014). The statistics of thetwo datasets are listed in Table. 1.

Sentiment Analysis

In IMDb, each movie re-view is accompanied with either a positive or neg-ative label. For the binary classiﬁcation task, weimplemented the CNN architecture that achievedthe best adversarial attack and certiﬁed accuracyin (Jia et al., 2019).

Textual Entailment

In SNLI, each sample iscomposed of two sentence: one as the premise andthe other as the hypothesis. The classiﬁcation taskis to deﬁne the relationship as an entailment, contra-diction, or neutral. Following the implementationin (Alzantot et al., 2018), only words in hypoth-esis are allowed to be substituted. Similarly, weadopted the architecture that outperformed othersin (Jia et al., 2019) for evaluating different adver-sarial training approaches.

We compare robustness of the following two train-ing approaches when adversarial examples are gen-erated using metric-DP perturbation.

Certiﬁably Robust Trained Approach

IntervalBound Propagation (IBP) was leveraged to mini-mize the upper bound on the worst-case loss thatany combination of word substitutions can induce.Speciﬁcally, an upper and lower bound on the ac-tivation of an neuron in each layer is computedbased on the bounds of neurons in previous layersthat connect to it. Bounds for the input layer iscomputed based on the smallest axis-aligned boxthat contains all the possible word substitutions,while the upper bound on the loss in the ﬁnal layeris combined with the normal cross entropy loss tooptimize the classiﬁcation performance on the ac-tual word and any other substitutions. The allowedsubstitutions are based on (Alzantot et al., 2018).

Augmented Training we add the adversarial ex-amples (four times of perturbations per sample)generated by metric differential privacy mecha-nisms into the training set and retain the model. lgorithm 2:

Truncated Gumbel Perturbation Mechanism

Input :

String x = w w . . . w (cid:96) ∈ W (cid:96) , privacy parameter (cid:15) > , word set W . Let ∆ = max w,w (cid:48) ∈W (cid:107) φ ( w ) − φ ( w (cid:48) ) (cid:107) be the maximum inter-word distance, ∆ = min w,w (cid:48) ∈W w (cid:54) = w (cid:48) (cid:107) φ ( w ) − φ ( w (cid:48) ) (cid:107) bethe minimum inter-word distance. Set b = { W (2 α ∆) , log e ( α ∆ ) } , where α = (cid:16) (cid:15) − |W| )∆ (cid:17) and W denotesthe principal branch of the Lambert-W function. Initialize an empty string ˜ x . for w i ∈ x do Sample k = TruncatedPoisson (log |W| ; 1 , |W| ) . Find the top k closest words to w i in W as u = [ u , u , . . . , u j , . . . , u k ] , where u = w i . Compute the distances d = [ d , d , . . . d j , . . . , d k ] , where d j = || w i − u j || . Set (cid:101) w i = u j , where j = T RUNCATED -G UMBEL -A RG -M IN ( d , b, ∆) . Add (cid:101) w i to ˜ x . end Return ˜ x . Dataset IMDb SNLI

Task type binary three-classTraining set size 20,000 550,152Testing set size 1000 10,000Total word count 11,856,015 4,614,822Vocabulary size 145,901 49,895Sentence length 263.46 ± ± Table 1: Summary of dataset properties.

Following (Alzantot et al., 2018), a population-based genetic attacker is implemented to searchfor perturbations that lead to misclassiﬁcation fromthe model. Given an original or modiﬁed sentence,the attacker randomly substitutes a word from thesentence with a new one based on the perturba-tion mechanism satisfying metric DP. After mul-tiple substitutions, the attacker obtains a popula-tion of new sentences together with their ﬁtnessscores (negatively proportional to the probabilitypredicted for the correct label).If the new sentence with the highest ﬁtness scoresuccessfully fools the model, then the attackermoves forward to the next sentence and starts anew round of testing. Otherwise, the attacker willperform crossover and mutation operations: sampletwo new sentences as parents from the populationaccording to their ﬁtness score, and then generatethe child sentence by taking the word from eitherparent randomly. Another round of perturbationover the child sentence is then performed to furtherincrease sentence diversity. The model is certiﬁedrobust to after providing correct predictions over apredeﬁned numbers of attacks.

Based on attributes of the testing set, different met-rics are utilized to evaluate models’ performance. • Clean Accuracy: the percentage of correct pre-dictions when testing on the original samples. • Adversarial Accuracy: percent of correct pre-dictions when testing on perturbed samples.

In the context of privacy preservation, plausible de-niability measures the likelihood of making correctinference given a sample perturbed by the privacymechanism. Following (Feyisetan et al., 2020),the following statistics are recorded to empiricallyevaluate the plausible deniability of the metric DPmechanisms at different values of (cid:15) (over , experiment runs): • N w , measures the probability that a word doesnot get modiﬁed by the mechanism. This is ap-proximated by counting the number of timesan input word w does not get replaced afterrunning the mechanism , times. • S w , which is the number of distinct words thatare produced as the output of M ( w ) . Thisis approximated by counting the number ofdistinct substitutions for an input word w afterrunning the mechanism , times. Plausible Deniability Analysis (Q1)

In Fig. 1,we observe similar trends on the two privacy statis-tic measures for both datasets. When samples areperturbed by the multivariate Laplace mechanism(shown in Fig. 1a and Fig. 1b), the number of dis-tinct substitutions S w decreases from , to

50 100 150 200epsilon02004006008001000 S w N w (a) Multivariate Laplace Mechanism on IMDb S w N w (b) Multivariate Laplace Mechanism on SNLI

20 40 60 80log epsilon05101520 S w

20 40 60 80log epsilon4006008001000 N w (c) Truncated Gumbel Perturbation Mechanism on IMDb

20 40 60 80log epsilon0.02.55.07.510.012.515.017.5 S w

20 40 60 80log epsilon4006008001000 N w (d) Truncated Gumbel Perturbation Mechanism on SNLI Figure 1: Empirical S w and N w statistics of Multivariate Laplace Mechanism and Truncated Gumbel PerturbationMechanism on vocabularies from IMDb and SNLI. The average amount of the two measures is plotted as curveswhile the standard deviation is represented by shadows along the curve. Same plot patterns (curve and shadow)represent the same meaning (mean ± std) in the following ﬁgures. while the the times of maintaining the original word N w shows the opposite trend. The empirical valuesof the two measures are consistent with the deﬁ-nition of metric DP that the multivariate Laplacemechanisms satisﬁes i.e.,: (cid:15) → provides abso-lute privacy as the output produced by the mecha-nism becomes independent of the input word, while (cid:15) → ∞ results in null privacy where M ( w ) = w .There are two main differences between trun-cated Gumbel (demonstrated in Fig. 1c and Fig. 1d)and multivariate Laplace mechanism in privacystatistics: 1) minor increase or decrease in (cid:15) doesnot inﬂuence word substitutions produced by trun-cated Gumbel, hence variation of S w and N w isplotted against the logarithm value of (cid:15) ; 2) dueto the effects of word substitutions among the top k closest words in the vocabulary, the maximumamount of distinct substitutions one word can haveis around on IMDB and . on SNLI. Word Substitution Range Analysis (Q2)

Onemain advantage of the proposed truncated Gumbelperturbation mechanism over the existing multivari-ate Laplace mechanism relies on the top-k closestwords as substitutions, which helps preserve wordsemantics and improve utility of downstream MLtasks for words located in dense area of the em-bedding space. To show this property, we comparethe amount of distinct word substitutions S w when the times of keeping the word unchanged N w isﬁxed in Fig 2. We discover that when differentmechanisms result in the same perturbation effects,the multivariate Laplace mechanism has a muchbroader range of word substitutions compared withthe proposed truncated Gumbel mechanism, whichwill probably raise problems in semantic preserva-tion and result in poor performance on downstreamtasks trained on the perturbed dataset. l o g S w Multivariate LaplaceTruncated Gumbel (a) S w against N w on IMDB l o g S w Multivariate LaplaceTruncated Gumbel (b) S w against N w on SNLI Figure 2: Word substitution range comparison (lower S w is better when N w is ﬁxed). Due to the differentscales of S w by the two mechanisms, the y-axis indi-cates the log value of S w for better visualization. We list performance of the two adversarial train-ing approaches when samples are perturbed by themultivariate Laplace mechanism in Table 3 and the able 2: Performance of adversarial training approaches on Text Data with(out) perturbations from truncatedgumbel perturbation mechanism . Note that results are recorded when log (cid:15) = 4 . for IMDb and log (cid:15) = 4 . for SNLI, which are slightly larger than their respective lower bounds on (cid:15) . log (cid:15) Adv IBP

SNLI Clean IBP

Adv IBP 12.5 11.49 12.98 14.95 truncated Gumbel mechanism in Table 2.In Table 3, clean accuracy of the proposed aug-mented training approach is approximately . higher than that of the certiﬁably robust trained ap-proach IBP for any (cid:15) selection on IMDb and . higher for (cid:15) ≥ on SNLI. Retraining with adver-sarial examples helps maintain the similar level ofclean accuracy as the normal training approach,which is consistent with observations in litera-ture (Jia and Liang, 2017; Iyyer et al., 2018; Ribeiroet al., 2018; Belinkov and Bisk, 2017; Ebrahimiet al., 2017). When evaluating the model’s robust-ness against word perturbations from the multivari-ate Laplace mechanism, the augmented trainingoutperforms the IBP approach only when the (cid:15) value is larger than some threshold, e.g., (cid:15) > on IMDb and (cid:15) > on SNLI. This is expectedas the augmented training cannot protect againstall attacks especially when small values of (cid:15) re-sults in any word substitution without consideringsemantic-preserving. In this case, the model canhardly learn the hidden relationship between thecorrupted new texts and the original text label.Given better semantic-preserving capability in-herent in the proposed truncated Gumbel mech-anism, the augmented training approach outper-forms the certiﬁably robust trained IBP method inboth clean and adversarial accuracy almost for anytested (cid:15) value tested. In Table 2, improvement ofclean accuracy by the augmented training approachover IBP is . on IMDb and . on SNLIwhen log (cid:15) = 50 . At the same time, better per-formance against adversarial attacks is achievedby the augmented training approach: . higheradversarial accuracy on IMDb and . on SNLI.One possible explanation of the inferior adver- sarial accuracy achieved by the certiﬁed defenseapproach IBP may be attributed to the training pro-cedure, which is based on the word substitutionsthat preserve semantic meanings (Alzantot et al.,2018). However, the testing adversarial examplesare generated by randomized perturbations frommetric DP mechanisms, where the semantic mean-ing is not always preserved, but dynamically deter-mined by the privacy parameter (cid:15) . We study the performance of different adversarialtraining approaches against adversarial examplesgenerated by metric DP mechanisms. To betterpreserve semantic meanings during word perturba-tions, we propose a novel truncated Gumbel mech-anism, which formally satisﬁes metric DP (see Ap-pendix A). Empirically experiments demonstratethe advantage of the truncated Gumbel mechanismover the existing multivariate Laplace mechanismdue to its smaller range of substitution candidates.In two textual classiﬁcation tasks, retraining withadversarial examples performs better than the certi-ﬁed defence in both clean and adversarial accuracy.We think the following aspects are interestingand deserve more investigations in the future: 1)robustness of other adversarial training approachesbased on the metric DP-inspired adversarial exam-ples, e.g., surrogate-loss minimization; 2) gener-alization capability of the well-trained augmentedtraining approach, e.g., performance against othertypes of adversarial examples; 3) privacy preserva-tion performance of the proposed truncated gumbelmechanism, e.g., performance of membership in-ference attacks (MIA) on perturbed texts. eferences

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang.2018. Generating natural language adversarial ex-amples. arXiv preprint arXiv:1804.07998 .Miguel E Andr´es, Nicol´as E Bordenabe, KonstantinosChatzikokolakis, and Catuscia Palamidessi. 2013.Geo-indistinguishability: Differential privacy forlocation-based systems. In

Proceedings of the 2013ACM SIGSAC conference on Computer & communi-cations security , pages 901–914.Yonatan Belinkov and Yonatan Bisk. 2017. Syntheticand natural noise both break neural machine transla-tion. arXiv preprint arXiv:1711.02173 .Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information.

Transactions of the Associa-tion for Computational Linguistics , 5:135–146.Mariusz Bojarski, Davide Del Testa, DanielDworakowski, Bernhard Firner, Beat Flepp,Prasoon Goyal, Lawrence D Jackel, Mathew Mon-fort, Urs Muller, Jiakai Zhang, et al. 2016. End toend learning for self-driving cars. arXiv preprintarXiv:1604.07316 .Samuel R Bowman, Gabor Angeli, Christopher Potts,and Christopher D Manning. 2015. A large anno-tated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 .Konstantinos Chatzikokolakis, Miguel E Andr´es,Nicol´as Emilio Bordenabe, and CatusciaPalamidessi. 2013. Broadening the scope ofdifferential privacy using metrics. In

InternationalSymposium on Privacy Enhancing TechnologiesSymposium , pages 82–102. Springer.Konstantinos Chatzikokolakis, Catuscia Palamidessi,and Marco Stronati. 2015. Constructing elas-tic distinguishability metrics for location privacy.

Proceedings on Privacy Enhancing Technologies ,2015(2):156–170.David Durfee and Ryan M Rogers. 2019. Practicaldifferentially private top-k selection with pay-what-you-get composition. In

Advances in Neural Infor-mation Processing Systems , pages 3532–3542.Krishnamurthy Dvijotham, Sven Gowal, Robert Stan-forth, Relja Arandjelovic, Brendan O’Donoghue,Jonathan Uesato, and Pushmeet Kohli. 2018. Train-ing veriﬁed learners with learned veriﬁers. arXivpreprint arXiv:1805.10265 .Cynthia Dwork, Frank McSherry, Kobbi Nissim, andAdam Smith. 2006. Calibrating noise to sensitivityin private data analysis. In

Theory of cryptographyconference , pages 265–284. Springer.Javid Ebrahimi, Anyi Rao, Daniel Lowd, and De-jing Dou. 2017. Hotﬂip: White-box adversarialexamples for text classiﬁcation. arXiv preprintarXiv:1712.06751 . Natasha Fernandes, Mark Dras, and Annabelle McIver.2019. Generalised differential privacy for text doc-ument processing. In

International Conference onPrinciples of Security and Trust , pages 123–148.Springer, Cham.Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, andTom Diethe. 2020. Privacy-and utility-preservingtextual analysis via calibrated multivariate perturba-tions. In

Proceedings of the 13th International Con-ference on Web Search and Data Mining , pages 178–186.Oluwaseyi Feyisetan, Tom Diethe, and Thomas Drake.2019. Leveraging hierarchical representations forpreserving privacy and utility in text. arXiv preprintarXiv:1910.08917 .Ian J Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2014. Explaining and harnessing adversar-ial examples. arXiv preprint arXiv:1412.6572 .Sven Gowal, Krishnamurthy Dvijotham, Robert Stan-forth, Rudy Bunel, Chongli Qin, Jonathan Uesato,Relja Arandjelovic, Timothy Mann, and PushmeetKohli. 2018. On the effectiveness of interval boundpropagation for training veriﬁably robust models. arXiv preprint arXiv:1810.12715 .Abdolhossein Hoorfar and Mehdi Hassani. 2008. In-equalities on the lambert w function and hyperpowerfunction.

J. Inequal. Pure and Appl. Math , 9(2):5–9.Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks. arXiv preprint arXiv:1804.06059 .Robin Jia and Percy Liang. 2017. Adversarial exam-ples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328 .Robin Jia, Aditi Raghunathan, Kerem G¨oksel, andPercy Liang. 2019. Certiﬁed robustness toadversarial word substitutions. arXiv preprintarXiv:1909.00986 .Jens Kober, J Andrew Bagnell, and Jan Peters. 2013.Reinforcement learning in robotics: A survey.

The International Journal of Robotics Research ,32(11):1238–1274.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classiﬁcation with deep con-volutional neural networks. In

Advances in neuralinformation processing systems , pages 1097–1105.Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geam-basu, Daniel Hsu, and Suman Jana. 2019. Certiﬁedrobustness to adversarial examples with differentialprivacy. In , pages 656–672. IEEE.ndrew Maas, Raymond E Daly, Peter T Pham, DanHuang, Andrew Y Ng, and Christopher Potts. 2011.Learning word vectors for sentiment analysis. In

Proceedings of the 49th annual meeting of the as-sociation for computational linguistics: Human lan-guage technologies , pages 142–150.Alexander Matyasko and Lap-Pui Chau. 2017. Marginmaximization for robust classiﬁcation using deeplearning. In , pages 300–307. IEEE.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In

Advances in neural information processingsystems , pages 3111–3119.Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP) , pages 1532–1543.NhatHai Phan, My T Thai, Han Hu, Ruoming Jin, TongSun, and Dejing Dou. 2019a. Scalable differentialprivacy with certiﬁed robustness in adversarial learn-ing. arXiv preprint arXiv:1903.09822 .NhatHai Phan, Minh Vu, Yang Liu, Ruoming Jin, De-jing Dou, Xintao Wu, and My T Thai. 2019b. Het- erogeneous gaussian mechanism: Preserving differ-ential privacy in deep learning with provable robust-ness. arXiv preprint arXiv:1906.01444 .Rafael Pinot, Florian Yger, C´edric Gouy-Pailler, andJamal Atif. 2019. A uniﬁed view on differential pri-vacy and robustness to adversarial examples. arXivpreprint arXiv:1906.07982 .Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Semantically equivalent adversar-ial rules for debugging nlp models. In

Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 856–865.Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian Goodfellow, andRob Fergus. 2013. Intriguing properties of neuralnetworks. arXiv preprint arXiv:1312.6199 .Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri,Somesh Jha, and Jeffrey Naughton. 2017. Bolt-ondifferential privacy for scalable stochastic gradientdescent-based analytics. In

Proceedings of the 2017ACM International Conference on Management ofData , pages 1307–1322.

Privacy Proof for Truncated GumbelMechanism

Theorem 1.

The truncated Gumbel perturbationmechanism, deﬁned in Algorithm 2, is (cid:15)d χ -privatewith respect to the Euclidean metric.Proof. We ﬁrst show for any pairs of substitutablewords w and w’,

Pr[ M ( w ) = u i | K = n ]Pr( M ( w (cid:48) ) = u i | K = n ] ≤ exp (cid:20) b e b ∆ d ( w, w (cid:48) ) (cid:21) , where n = |W| and d ( w, w (cid:48) ) = (cid:107) φ ( w ) − φ ( w (cid:48) ) (cid:107) .Conditional on K = n , Pr( M ( w ) = u i | K = n ) = Pr( d i + g i < min j (cid:54) = i d j + g j ) . Since g , . . . , g n are i.i.d. random variables, weargue for each i independently. Fix g − i =[ g , . . . , g i − , g i +1 , . . . , g n ] as a random draw from n − independent Gumbel distributions. Deﬁne g ∗ = sup g : d i + g < min j (cid:54) = i d j + g j . Then g i < min j (cid:54) = i ( d j + g j ) − d i if and only if g i ≤ g ∗ ,which means M ( w ) = u i if and only if g i ≤ g ∗ .Now consider another substitutable word w (cid:48) witha corresponding distance vector d (cid:48) = [ d (cid:48) , . . . , d (cid:48) n ] .By triangle inequality, we have | d i − d (cid:48) i | ≤ d ( w, w (cid:48) ) , for i = 1 , . . . . , n. Therefore,

Pr( M ( w (cid:48) ) = u i | K = n )= Pr( d (cid:48) i + g i < min j (cid:54) = i ( d (cid:48) j + g j ))= Pr( g i < min j (cid:54) = i ( d (cid:48) j + g j ) − d (cid:48) i )= Pr( g i < min j (cid:54) = i ( d j + g j ) − d i + 2 d ( w, w (cid:48) )= Pr( g i ≤ g ∗ + 2 d ( w, w (cid:48) )) . Therefore,

Pr( M ( w ) = u i | K = n )Pr( M ( w (cid:48) ) = u i | K = n ) ≥ Pr( g i ≤ g ∗ )Pr( g i ≤ g ∗ + 2 d ( w, w (cid:48) ))= exp( − e − b g ∗ )exp( − e − b g ∗ − b d ( w,w (cid:48) ) )= exp[ − e − b g ∗ (1 − e − b d ( w,w (cid:48) ) )] , which is increasing in g ∗ as − e − b d ( w,w (cid:48) ) > .Since g ∗ ≥ − , and then Pr( M ( w ) = u i | K = n )Pr( M ( w (cid:48) ) = u i | K = n ) ≥ exp( − e − b ( − (1 − e − b d ( w,w (cid:48) ) )) ≥ exp (cid:20) − e b ∆ · b d ( w, w (cid:48) ) (cid:21) . By symmetry of w and w (cid:48) , we also have Pr( M ( w ) = u i | K = n )Pr( M ( w (cid:48) ) = u i | K = n ) ≤ exp (cid:20) b e b ∆ d ( w, w (cid:48) ) (cid:21) . Recall that K ∼ TruncatedPoisson ( λ ; 1 , n ) .We want to show an upper bound for Pr( M ( w )= u i )Pr( M ( w (cid:48) )= u i ) ,which is Pr( M ( w ) = u i )Pr( M ( w (cid:48) ) = u i )= (cid:80) nk =1 Pr( M ( w ) = u i | K = k ) Pr( K = k ) (cid:80) nk =1 Pr( M ( w (cid:48) ) = u i | K = k ) Pr( K = k ) ≤ (cid:80) nk =1 Pr( M ( w ) = u i | K = k ) Pr( K = k )Pr( M ( w (cid:48) ) = u i | K = n ) Pr( K = n ) ≤ n − M ( w ) = u i | K = n ) Pr( K = n )Pr( M ( w (cid:48) ) = u i | K = n ) Pr( K = n ) , Since

Pr( M ( w ) = u i | K = n ) = exp( − e − b g ∗ ) ≥ exp( − e b ) , and Pr( K = n ) ≥ e − λ (from Deﬁnition 1), Pr( M ( w ) = u i )Pr( M ( w (cid:48) ) = u i ) ≤ exp (cid:18) b e b ∆ d ( w, w (cid:48) ) (cid:19)(cid:18) n − − e b − λ ) (cid:19) = (cid:18) n − e e b + λ (cid:19) exp (cid:18) b e b ∆ d ( w, w (cid:48) ) (cid:19) ≤ n exp( e b + λ ) exp (cid:18) b e b ∆ d ( w, w (cid:48) ) (cid:19) In order to guarantee (cid:15) d χ -privacy, we solve for b using e (cid:15)d ( w,w (cid:48) ) ≥ n exp( e b + λ ) exp (cid:18) b e b ∆ d ( w, w (cid:48) ) (cid:19) . Taking logarithm on both sides, (cid:15) ≥ d ( w, w (cid:48) ) log e (cid:18) n exp( e b + λ ) (cid:19) + 2 b e b ∆ , o we need to ﬁnd an upper bound for the right-hand side of the equation as a function of b . d ( w, w (cid:48) ) log e (cid:18) n exp( e b + λ ) (cid:19) + 2 b e b ∆ ≤ (cid:18) n + e b + λ (cid:19) + 2 b e b ∆ = 2 + log n + λ ∆ + (cid:18) + 2 b (cid:19) e b ∆ , which is decreasing in b . When b ≤ ∆ , n + λ ∆ + (cid:18) + 2 b (cid:19) e b ∆ ≤ n + λ ∆ + 3 b e b ∆ , it is sufﬁcient to set b = 2∆ W (cid:18) (cid:18) (cid:15) − n + λ ∆ (cid:19)(cid:19) , where W is Lambert-W function. When b > ∆ , n + λ ∆ + (cid:18) + 2 b (cid:19) e b ∆ ≤ n + λ ∆ + 3∆ e b ∆ , it is sufﬁcient to set b = 2∆log e (cid:18) ∆ (cid:18) (cid:15) − n + λ ∆ (cid:19)(cid:19) . Thus, a sufﬁcient condition for (cid:15) ≥ d ( w, w (cid:48) ) log e (cid:18) n exp( e b + λ ) (cid:19) + 2 b e b ∆ , is to set b to be max (cid:18) W (cid:18) (cid:18) (cid:15) − n + λ ∆ (cid:19)(cid:19) , e (cid:18) ∆ (cid:18) (cid:15) − n + λ ∆ (cid:19)(cid:19) (cid:19) . Now that we have proved the proposed mecha-nism M is (cid:15) d χ -private with respect to Euclideanmetric d on a string of one word, we have for anypair of inputs w, w (cid:48) ∈ W (cid:96) and any output u ∈ W (cid:96) , Pr( M ( w ) = u )Pr( M ( w (cid:48) ) = u ) = (cid:96) (cid:89) i =1 (cid:18) Pr( M ( w i ) = u i )Pr( M ( w (cid:48) i ) = u i ) (cid:19) ≤ (cid:96) (cid:89) i =1 exp( (cid:15)d ( w i , w (cid:48) i )) = exp( (cid:15)d ( w, w (cid:48) )) , where d ( w, w (cid:48) ) = (cid:80) (cid:96)i =1 d ( w i , w (cid:48) i ) . For Algorithm 2, we set λ = log |W| , so that thevalue of b used is the following: b = max (cid:18) W (cid:18) (cid:18) (cid:15) − |W| ∆ (cid:19)(cid:19) , e (cid:18) ∆ (cid:18) (cid:15) − |W| ∆ (cid:19)(cid:19) (cid:19) For this value of b to be deﬁned, we must ensurethat (cid:15) is set in a way that the logarithm and Lambert- W function in the denominator has a positive argu-ment. This holds whenever the following is true: (cid:15) > |W| )∆ . For IMDB dataset, we have |W| = 48210 , andthat for the SNLI dataset is |W| = 11673 . Using ∆ = 0 . and . for IMDB and SNLI,respectively, the lower bounds for (cid:15) we obtain are . and . , respectively. B Fraction of Modiﬁed Words

Lemma 1.

For given (cid:15) > , string x = w . . . w (cid:96) and any ﬁxed k , the expected fraction of wordsthat get modiﬁed using Algorithm 2 is at least (1 − p ) , where p = exp (cid:16) − e − b (cid:17) . In particular, E ( N w ) ≤ p |W| .Proof. Fix a word w i ∈ x . Since u = w i ,observe that we can write the probability thatit does not get modiﬁed as Pr ( (cid:102) w i = u ) =Pr ( g < min j ≥ ( d j + g j )) . Let g ∗ = sup g : g < min j ≥ ( d j + g j ) . Then, similar to the proofof Theorem 1, g < min j ≥ ( d j + g j ) if andonly if g ≤ g ∗ . This gives Pr ( (cid:102) w i = u ) =Pr ( g ≤ g ∗ ) = exp (cid:0) − e − g ∗ /b (cid:1) . Since g ∗ ≤ ,we can write Pr ( (cid:102) w i = u ) ≤ exp (cid:16) − e − b (cid:17) .Thus, the expected fraction of words in x thatdo not get modiﬁed is at most p , where p =exp (cid:0) − exp (cid:0) − b (cid:1)(cid:1) . From this, we compute theexpected fraction of words that get modiﬁed as atleast (1 − p ) , as desired. The bound on E ( N w ) fol-lows from a simple union bound over all the wordsin the vocabulary.Note that ∂p∂b = ∂∂b exp (cid:16) − e − b (cid:17) < , andhence, p is a decreasing function in b , implyingthat as the privacy increases ( b increases), the valueof E ( N w ) decreases, as expected. Utility Analysis vs. Sparsity of theEmbedding Space

We want to analyze how word substitution worksfor Gumbel vs. Laplace for different embeddingdensities. Given a word w ∈ W in the vocabulary,we let δ ( w ) = min w (cid:48) ∈W w (cid:54) = w (cid:48) d ( w, w (cid:48) ) denote the dis-tance to the closest word to w in the embeddingspace. For the same value of (cid:15) , let n Lap ∼ Lap (cid:0) (cid:15) (cid:1) be the amount of Laplace noise added to perturb theword, and p Lap ( w ) be the probability that the event ξ w : arg min w (cid:48) ∈W (cid:0) || w (cid:48) − (cid:0) w + n Lap (cid:1) || (cid:1) = w ( i.e. the word remains unchanged). Then, we cancompute this probability as follows: p Lap ( w ) = Pr ( ξ w ) = Pr (cid:0) || n Lap || < δ ( w ) / (cid:1) = 2 (cid:90) δ ( w ) / (cid:15) e − (cid:15)x/ dx = (cid:90) (cid:15)δ ( w ) / e − y dy = 1 − e (cid:16) − (cid:15)δ ( w )4 (cid:17) . Thus, as δ ( w ) increases (the sparsity around w increases), so does p Lap ( w ) , implying that underLaplace mechanism, words inside the sparse re-gions of the embedding space tend to stay un-changed. However, when δ ( w ) approaches (denser regions), the probability p Lap ( w ) vanishes.For such regions, w will get modiﬁed with probabil-ity approaching one, which can potentially reduceutility. For the same amount of (cid:15) , the Truncated Gumbelmechanism keeps w unchanged when the noiseadded to w is smaller than any other perturbedcandidate. If p Gum ( w ) is the probability that w does not change under this perturbation, then wecan write the following: p Gum ( w ) ≥ Pr ( g < δ ( w ) + g ) Pr( K ≥ g − g < δ ( w )) Pr( K ≥ Since the difference of two i.i.d. Gumbel randomvariables follows a Logistic distribution, we obtainthe following (by letting G b ∼ Logistic (0 , b ) ): p Gum ( w ) ≥ Pr ( G b < δ ( w )) Pr( K ≥ (cid:18)

11 + e − δ ( w ) /b (cid:19) Pr( K ≥ ≥ e − e − δ ( w ) /b Pr( K ≥ , where, the last inequality follows since x ≤ e x . Thus, even when δ ( w ) approaches (denserregions), there is at least p Gum ( w ) | δ ( w ) → ≥ Pr( K ≥ e = e (cid:16) − log |W| e |W| (cid:17) |W|→∞ −−−−−→ . prob-ability that w remains unchanged. This helps pre-serve utility by ensuring that the modiﬁed word islikely to be closer to the original word since thereis a signiﬁcant probability mass around the originalword (specially as |W| increases). able 3: Performance of adversarial training approaches on Text Data with(out) perturbations from multivariateLaplace mechanism . The clean accuracy of normal training is . on IMDB and . on SNLI. Theaccuracy from one model higher than that achieved by the other model in the same setting is marked by boldface. (cid:15) Adv IBP 0.30 0.50 1.20 4.90

SNLI Clean IBP

Adv IBP 1.84 1.90 2.21 3.709.14 24.08