Privacy-Preserving Graph Convolutional Networks for Text Classification
PPrivacy-Preserving Graph Convolutional Networks for Text Classification
Timour Igamberdiev and
Ivan Habernal
Trustworthy Human Language TechnologiesDepartment of Computer ScienceTechnical University of Darmstadt
Abstract
Graph convolutional networks (GCNs) are apowerful architecture for representation learn-ing and making predictions on documents thatnaturally occur as graphs, e.g., citation or so-cial networks. Data containing sensitive per-sonal information, such as documents withpeople’s profiles or relationships as edges,are prone to privacy leaks from GCNs, asan adversary might reveal the original inputfrom the trained model. Although differen-tial privacy (DP) offers a well-founded privacy-preserving framework, GCNs pose theoreti-cal and practical challenges due to their train-ing specifics. We address these challenges byadapting differentially-private gradient-basedtraining to GCNs. We investigate the impactof various privacy budgets, dataset sizes, andtwo optimizers in an experimental setup overfive NLP datasets in two languages. We showthat, under certain modeling choices, privacy-preserving GCNs perform up to 90% of theirnon-private variants, while formally guarantee-ing strong privacy measures.
Many text classification tasks naturally occur in theform of graphs where nodes represent text docu-ments and edges are task specific, such as articlesciting each other or health records belonging to thesame patient. When learning node representationsand predicting their categories, models benefit fromexploiting information from the neighborhood ofeach node, as shown in graph neural networks, andgraph convolutional networks (GCNs) in particular(Kipf and Welling, 2017), making them superior toother models (Xu et al., 2019; De Cao et al., 2019).While GCNs are powerful for a variety of NLPproblems, like other neural models they are proneto privacy attacks. Adversaries with extensive back-ground knowledge and computational power mightreveal sensitive information about the training datafrom the model, such as reconstructing information about the original classes of a model (Hitaj et al.,2017) or even auditing membership of an individ-ual’s data in a model (Song and Shmatikov, 2019).In order to preserve privacy for graph NLP data,models have to protect both the textual nodes andthe graph structure, as both sources carry poten-tially sensitive information.Privacy-preserving techniques, such as differen-tial privacy (DP) (Dwork and Roth, 2013), preventinformation leaks by adding ‘just enough’ noiseduring training a model while attaining acceptableperformance. Recent approaches to DP in neuralmodels attempt to balance this trade-off betweennoise and utility, with differentially private stochas-tic gradient descent (SGD-DP) (Abadi et al., 2016)being a prominent example. However, SGD-DPcomes with design choices specific to i.i.d. data,such as batches and ‘lots’ (see §4.2), and its suit-ability for graph neural networks remains an openand non-trivial question.In this work, we ask what privacy guaranteesand performance can be provided by differentiallyprivate stochastic gradient descent and its variantsfor GCNs. First, we are interested in how models’accuraccies differ under varying privacy ‘budgets’.Second, more importantly, we want to understandto which extent the training data size affects privateand non-private performance and whether simplyadding more data would be a remedy for the ex-pected performance drop of DP models. We tacklethese questions by adapting SGD-DP (Abadi et al.,2016) to GCNs as well as proposing a differentially-private version of Adam (Kingma and Ba, 2015),Adam-DP. We hypothesize that Adam’s advantages,i.e. fewer training epochs, would lead to a betterprivacy/utility trade-off as opposed to SGD-DP.We conduct experiments on five datasets in twolanguages (English and Slovak) covering a varietyof NLP tasks, including research article classifica-tion in citation networks, Reddit post classification,and user interest classification in social networks, a r X i v : . [ c s . S I] F e b here the latter ones inherently carry potentiallysensitive information calling for privacy-preservingmodels. Our main contributions are twofold. First,we show that DP training can be applied to the caseof GCNs despite the challenges of non-i.d.d. data.Second, we show that more sophisticated text rep-resentations can mitigate the performance drop dueto DP noise, resulting in a relative performance of90% of the non-private variant, while keeping strictprivacy ( (cid:15) = 2 ). To the best of our knowledge, thisis the first study that brings differentially privategradient-based training to graph neural networks. As DP does not belong to the mainstream methodsin NLP, here we shortly outline the principles andpresent the basic terminology from the NLP per-spective. Foundations can be found in (Dwork andRoth, 2013; Desfontaines and Pejó, 2020).The main idea of DP is that if we query adatabase of N individuals, the result of the querywill be almost indistinguishable from the result ofquerying a database of N − individuals, thus pre-venting each single individual’s privacy to a certaindegree. The difference of results obtained fromquerying any two databases that differ in one indi-vidual has a probabilistic interpretation.Dataset D consists of | D | documents where eachdocument is associated with an individual whoseprivacy we want to preserve. Let D (cid:48) differ from D by one document, so either | D (cid:48) | = | D | ± , or | D (cid:48) | = | D | with i -th document replaced. D and D (cid:48) are called neighboring datasets.Let A : D (cid:55)→ y ∈ R be a function applied to adataset D ; for example a function returning the av-erage document length or the number of documentsin the dataset. This function is also called a query which is not to be confused with queries in NLP,such as search queries. In DP, this query functionis a continuous random variable associated with aprobability density p t ( A ( D ) = y ) . Once the func-tion A ( D ) is applied on the dataset D , the resultis a single draw from this probability distribution.This process is also known as a randomized algo-rithm . For example, a randomized algorithm forthe average document length can be a Laplace den-sity such that p t ( A ( D ) = y ) = b exp (cid:16) − | µ − y | b (cid:17) , A document can be any arbitrary natural language text,such as a letter, medical record, tweet, personal plain textpasswords, or a paper review. In general, the query output is multidimensional R k ; herewe keep it scalar for the sake of simplicity. where µ is the true average document length and b is the scale (the ‘noisiness’ parameter). By ap-plying this query to D , we obtain y ∈ R , a singledraw from this distribution.Now we can formalize the backbone idea of DP.Having two neighboring datasets D , D (cid:48) , privacyloss is defined as ln p ( A ( D ) = y ) p ( A ( D (cid:48) ) = y ) . (1)DP bounds this privacy loss by design. Given (cid:15) ∈ R : (cid:15) ≥ (the privacy budget hyper-parame-ter), all values of y , and all neighboring datasets D and D (cid:48) , we must ensure that max ∀ y (cid:12)(cid:12)(cid:12)(cid:12) ln p ( A ( D ) = y ) p ( A ( D (cid:48) ) = y ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) . (2)In other words, the allowed privacy loss of anytwo neighboring datasets is upper-bounded by (cid:15) ,also denoted as ( (cid:15), -DP. The privacy budget (cid:15) controls the amount of preserved privacy. If (cid:15) → ,the query outputs of any two datasets become in-distinguishable, which guarantees almost perfectprivacy but provides very little utility. Similarly,higher (cid:15) values provide less privacy but better util-ity. Finding the sweet spot is thus the main chal-lenge in determining the privacy budget for a partic-ular application (Lee and Clifton, 2011; Hsu et al.,2014). An important feature of ( (cid:15), δ ) -DP is thatonce we obtain the result y of the query A ( D ) = y ,any further computations with t cannot weaken theprivacy guaranteed by (cid:15) and δ .The desired behavior of the randomized algo-rithm is therefore adding as little noise as possibleto maximize utility while keeping the privacy guar-antees given by Eq. 2. The amount of noise isdetermined for each particular setup by the sensitiv-ity of the query ∆ A , such that for any neighboringdatasets D, D (cid:48) we have ∆ A = max ∀ D,D (cid:48) (cid:0) | A ( D ) − A ( D (cid:48) ) | (cid:1) . (3)The sensitivity corresponds to the ‘worst case’range of a particular query A , i.e., what is the maxi-mum impact of changing one individual. The largerthe sensitivity, the more noise must be added tofulfill the privacy requirements of (cid:15) (Eq. 2). For ex-ample, in order to be ( (cid:15), -DP, the Laplace mech-anism must add noise b = (∆ A ) − (Dwork and ( (cid:15), -DP is a simplification of more general ( (cid:15), δ ) -DPwhere δ is a negligible constant allowing relaxation of theprivacy bounds (Dwork and Roth, 2013, p. 18). oth, 2013, p. 32). As the query sensitivity di-rectly influences the required amount of noise, it isdesirable to design queries with low sensitivity.The so far described mechanisms consider a sce-nario when we apply the query only once. To en-sure ( (cid:15), δ ) -DP with multiple queries on the samedatasets, proportionally more noise has to be added. A wide range of NLP tasks have been utilizing graph neural networks , specifically graph con-volutional networks (GCNs), including text sum-marization (Xu et al., 2020), machine translation(Marcheggiani et al., 2018) and semantic role la-beling (Zheng and Kordjamshidi, 2020). Recentend-to-end approaches combine pre-trained trans-former models with GNNs to learn graph repre-sentations for syntactic trees (Sachan et al., 2020).Rahimi et al. (2018) demonstrated the strength ofGCNs on predicting geo-location of Twitter userswhere nodes are represented by users’ tweets andedges by social connections, i.e. mentions of otherTwitter users. Their approach shows that user’sneighborhood delivers extra information improvingthe model’s performance. However, if we want toprotect user-level privacy, the overall social graphhas to be taken into account.Several recent works in the NLP area deal with privacy using arbitrary definitions. Li et al. (2018)propose an adversarial-based approach to learninglatent text representation for sentiment analysis andPOS tagging. Although their privacy-preservingmodel performs on par with non-private models,they admit the lack of formal privacy guarantees.Similarly, Coavoux et al. (2018) train an adversarialmodel to predict private information on sentimentanalysis and topic classification. The adversary’smodel performance served as a proxy for privacystrength but, despite its strengths, comes with noformal privacy guarantees. Similar potential pri-vacy weaknesses can be found in a recent workby Abdalla et al. (2020) who replaced personalhealth information by semantically similar wordswhile keeping acceptable accuracy of downstreamclassification tasks.Abadi et al. (2016) pioneered the connection ofDP and deep learning by bounding the query sen-sitivity using gradient clipping as well as formally Queries might be different, for example querying theaverage document length first and then querying the numberof documents in the dataset. proving the overall privacy bounds by introducingthe ‘moments accountant’ mechanism (see §4.3).While originally tested on image recognition, theyinspired subsequent work in language modelingusing LSTM (McMahan et al., 2018).General DP over graphs still pose substantialchallenges preventing their practical use (Zhu et al.,2017, Sec. 4.4). Two very recent approaches to lo-cal DP, that is adding noise to each example beforepassing it to graph model training, transform the la-tent representation of the input into a binary vectorleading to reduced query sensitivity (Sajadmaneshand Gatica-Perez, 2020; Lyu et al., 2020).
We employ the Graph Convolutional Network(GCN) architecture (Kipf and Welling, 2017) forenabling DP in the domain of graph-based NLP.GCN is a common and simpler variant to morecomplex types of GNNs which allows us to focusprimarily on DP analysis and results, allowing fora clear comparison of the DP and non-DP models.Let G = ( V , E ) model our graph data where eachnode v i ∈ V contains a feature vector of dimension-ality d . GCN aims to learn a node representationby integrating information from each node’s neigh-borhood. The features of each neighboring node of v i pass through a ‘message passing function’ (usu-ally a transformation by a weight matrix Φ ) andare then aggregated and combined with the currentstate of the node h li to form the next state h l +1 i .Edges are represented using an adjacency matrix A ∈ R n × n . A is then multiplied by the matrix H ∈ R n × f , f being the hidden dimension, as wellas the weight matrix Φ responsible for messagepassing. Additional tweaks by Kipf and Welling(2017) include adding the identity matrix to A to in-clude self-loops in the computation ˆ A = A + I , aswell as normalizing matrix A by the degree matrix D , specifically using a symmetric normalization D − AD − . This results in the following equationfor calculating the next state of the GCN for a givenlayer l , passing through a non-linearity function σ : H l +1 = σ (cid:16) ˆ D − ˆ A ˆ D − H ( l ) Φ ( l ) (cid:17) (4)The final layer states for each node are then usedfor node-level classification, given output labels. .2 SGD-DP and Adam-DP SGD-DP (Abadi et al., 2016) modifies the standardstochastic gradient descent algorithm to be differen-tially private. The DP ‘query’ is the gradient com-putation at time step t : g t ( x i ) ← ∇ θ t L ( θ t , x i ) ,for each i in the training set. To ensure DP, theoutput of this query is distorted by random noiseproportional to the sensitivity of the query, whichis the range of values that the gradient can take. Asgradient range is unconstrained, possibly leadingto extremely large noise, Abadi et al. (2016) clipthe gradient vector by its (cid:96) norm, replacing eachvector g with ¯ g = g/ max(1 , || g || C ) , C being theclipping threshold. This clipped gradient is alteredby a draw from a Gaussian: ¯ g t ( x i ) + N (0 , σ C I ) .Instead of running this process on individual ex-amples, Abadi et al. (2016) actually break up thetraining set into ‘lots’ of size L , being a slightlyseparate concept from that of ‘batches’. Whereasthe gradient computation is performed in batches,SGD-DP groups several batches together into lotsfor the DP calculation itself, which consists ofadding noise, taking the average over a lot andperforming the descent θ t +1 ← θ t − η t ˜ g t .Incorporating this concept, we obtain the overallcore mechanism of SGD-DP: ˜ g t = L (cid:32)(cid:80) i ∈ L g t ( x i )max (cid:16) , || g t ( xi ) || C (cid:17) + N (0 , σ C I ) (cid:33) (5)In this paper, we also develop a DP version ofAdam (Kingma and Ba, 2015), a widely-used de-fault optimizer in NLP (Ruder, 2016). As Adamshares the core principle of gradient computingwithin SGD, to make it differentialy private weadd noise to the gradient following Eq. 5 (prior toAdam’s moment estimates and parameter update).Despite their conceptual simplicity, both SGD-DP and Adam-DP have to determine the amountof noise to guarantee ( (cid:15), δ ) privacy. Abadi et al.(2016) proposed the moments accountant whichwe present in detail here. SGD-DP introduces two features, namely (1) areverse computation of the privacy budget, and(2) tighter bounds on the composition of multiplequeries. First, a common DP methodology is topre-determine the privacy budget ( (cid:15), δ ) and addrandom noise according to these parameters. Incontrast, SGD-DP does the opposite: Given a pre-defined amount of noise (hyper-parameter of the algorithm), the privacy budget ( (cid:15), δ ) is computedretrospectively. Second, generally in DP, with mul-tiple executions of a ‘query’ (i.e. a single gradientcomputation in SGD), we can simply sum up the (cid:15), δ values associated with each query. However,this naive composition leads to a very large privacybudget as it assumes that each query used up themaximum given privacy budget.The simplest bound on a continuous random vari-able Z , the Markov inequality, takes into accountthe expectation E [ Z ] , such that for (cid:15) ∈ R + : Pr[ Z ≥ (cid:15) ] ≤ E [ Z ] (cid:15) (6)Using the Chernoff bound, a variant of theMarkov inequality, on the privacy loss Z treated asa random variable (Eq. 2), we obtain the followingformulation by multiplying Eq. 6 by λ ∈ R andexponentiating: Pr[exp( λZ ) ≥ exp( λ(cid:15) )] ≤ E [exp( λZ )]exp( λ(cid:15) ) (7)where E [exp( λZ )] is also known as the moment-generating function.The overall privacy loss Z is composed of asequence of consecutive randomized algorithms X , . . . , X k (see §2). Since all X i are indepen-dent, the numerator in Eq. 7 becomes a productof all E [exp( λX i )] . Converting to log form andsimplifying, we obtain Pr[ Z ≥ (cid:15) ] ≤ exp (cid:32)(cid:88) i ln E [exp( λX i )] − λ(cid:15) (cid:33) (8)Note the moment generating function inside thelogarithmic expression. Since the above bound isvalid for any moment of the privacy loss randomvariable, we can go through several moments andfind the one that gives us the lowest bound.Since the left-hand side of Eq. 8 is by defini-tion the δ value, the overall mechanism is ( (cid:15), δ )-DPfor δ = exp( (cid:80) i ln E [exp( λX i )] − λ(cid:15) ) . The corre-sponding (cid:15) value can be found by modifying 8: (cid:15) = (cid:80) i ln E [exp( λX i )] − ln δλ (9)The overall SGD-DP algorithm, given the rightnoise scale σ and a clipping threshold C , is thus Such that for k queries with privacy budget ( (cid:15), δ ) , theoverall algorithm is ( k(cid:15), kδ ) -DP. hown to be ( O ( q(cid:15) √ T ) , δ ) -differentially privateusing this accounting method, with q representingthe ratio LN between the lot size L and dataset size N , and T being the total number of training steps.See (Abadi et al., 2016) for further details. We are interested in a text classification use-casewhere documents are connected via undirectededges, forming a graph. While structurally limiting,this definition covers a whole range of applications.We perform experiments on five single-label multi-class classification tasks. The
Cora , Citeseer , and
PubMed datasets (Yang et al., 2016; Sen et al.,2008; McCallum et al., 2000; Giles et al., 1998) arewidely used citation networks of research paperswhere citing a paper i from paper j creates an edge i − j . The task is to predict the category of theparticular paper.The Reddit dataset (Hamilton et al., 2017) treatsthe ‘original post’ as a graph node and connects twoposts by an edge if any user commented on bothposts. Given the large size of this dataset (230knodes; all posts from Sept. 2014) causing severecomputational challenges, we sub-sampled 10% ofposts (only few days of Sept. 2014). The gold labelcorresponds to one of the top Reddit communitiesto which the post belongs to.Unlike the previous English data sets, the
Pokec dataset (Takac and Zabovsky, 2012; Leskovec andKrevl, 2014) contains an anonymized social net-work in Slovak. Nodes represent users and edgestheir friendship relations. User-level informationcontains many attributes in natural language (e.g.,‘music’, ‘perfect evening’). We set up the follow-ing binary task: Given the textual attributes, predictwhether a user prefers dogs or cats. Pokec’s per-sonal information including friendship connectionsshows the importance of privacy-preserving meth-ods to protect this potentially sensitive information.For the preparation details see Appendix B.The four English datasets adapted from the pre-vious work are only available in their encoded form.For the citation networks, each document is rep-resented by a bag-of-words encoding. The Red-dit dataset combines GloVe vectors (Pennington Perozzi and Skiena (2015) used the Pokec data for userprofiling, namely age prediction for ad targeting. We findsuch a use case unethical. In contrast, our classification task isharmless, yet serves well the demonstration purposes of textclassification of social network data.
Dataset Classes Test size Training size
CiteSeer 6 1,000 1,827Cora 7 1,000 1,208PubMed 3 1,000 18,217Pokec 2 2,000 16,000Reddit 41 5,643 15,252
Table 1: Dataset statistics; size is number of nodes. et al., 2014) averaged over the post and its com-ments. Only the Pokec dataset is available as rawtexts, so we opted for multilingual BERT (Devlinet al., 2019) and averaged all contextualized wordembeddings over each users’ textual attributes. The variety of languages, sizes, and different in-put encoding allows us to compare non-private andprivate GCNs under different conditions. Table 1summarizes data sizes and number of classes.
Vanilla GCN on full datasets:The aim is to train the GCN with access to thelargest training data possible, but without any pri-vacy mechanism.
Experiment B
Learning curves on the vanillaGCN: Evaluating the influence on performancewith less training data, without privacy, allowingfor a comparison of results with the DP settingsbelow.
Experiment C
GCN with DP: We evaluate per-formance varying the amount of privacy budgetwith the full datasets.
Experiment D
GCN with DP: Varying both datasize and the amount of privacy budget. This allowsus to see the effects on performance of both addingnoise and reducing training data.
As the δ privacy parameter is typically kept ‘cryp-tographically small’ (Dwork and Roth, 2013) and,unlike the main privacy budget (cid:15) , has a limited im-pact on accuracy (Abadi et al., 2016, Fig. 4), wefixed its value to − for all experiments. The clip-ping threshold is set at 1. We validated our PyTorchimplementation by fully reproducing the MNISTresults from Abadi et al. (2016). We perform allexperiments five times with different random seeds Sentence-BERT (Reimers and Gurevych, 2019) resultedin lower performance. Users fill in the attributes such that thetext resembles a list of keywords rather than actual discourse. on-DP F scores DP F scoresRnd. Maj. SGD Adam (cid:15) SGD Adam
CiteSeer
Cora
PubMed
Pokec
Table 2: Experiments A and C: Random and major-ity baselines (first two columns), full dataset withoutDP (third and fourth columns), with DP and varying (cid:15) (right-most three columns). and report the mean and standard deviation. Earlystopping is determined using the validation set. SeeAppendix A for details on other hyperparameters.
Experiment A
Table 2 shows the results on theleft-hand side under ‘Non-DP’. When trained withSGD, both Cora and CiteSeer datasets achievefairly good results at 0.77 F1 score each, both hav-ing relatively small graphs. Much lower are thePubMed results at 0.49, possibly due to the datasetconsisting of a much larger graph. Reddit showshigher performance at 0.68, which could in part bedue to its input representations as GloVe embed-dings, as opposed to binary-valued word vectors.Finally, Pokec shows the best result at 0.83 pos-sibly because of more expressive representations(BERT) and a simpler task (binary classification).In comparison, in line with previous research(Ruder, 2016), Adam outperforms SGD in all cases,with Pokec showing the smallest gap (0.826 and0.832 for SGD and Adam, respectively).
Experiment B
Starting with the SGD results inFigure 1, we can notice three main patterns. 1. Clear improvement as training data increases(e.g. CiteSeer, with 0.70 F1 score at 10% vs.0.77 at 100%).2. The exact opposite pattern, with PubMeddropping from 0.57 at 10% to 0.49 at 100%,with a similar pattern for Pokec.3. Early saturation of results for Reddit and Cora(at 20-30% for Reddit with approximately0.69 F1 score, 50% for Cora at a score of0.77), where results do not increase beyond acertain point.Regarding points (2) and (3) above, we speculatethat, with a larger training size, a vanilla GCN hasa harder time to learn the more complex input rep-resentations. In particular, for PubMed and Pokec,the increasing number of training nodes only par-tially increases the graph degree, so the model failsto learn expressive node representations when lim-ited information from the node’s neighborhood isavailable. By contrast, Reddit graph degree growsmuch faster, thus advantaging GCNs.Comparing each of these patterns for Adam, wesee that for (1), datasets also improve (CiteSeer),(2) shows a very similar decrease in results forPokec, but mostly a constant score throughout forPubMed (at ~0.80), while (3) shows continued im-provement where SGD saturated for Cora and Red-dit, suggesting that Adam allows to break throughthe learning bottleneck.
Experiment C
The results of Experiment C canbe seen in Table 2 for a comparison of different DPnoise values, as well as a comparison of the resultswith and without DP. We note four main patternsin this experiment:1. Interestingly, SGD-DP results stay the same,regardless of the noise value added.2. Adam-DP results are far worse than SGD-DP,but increasing with lesser privacy (less noise).3. SGD-DP results almost always outperform thebaselines (except for PubMed).4. We see bigger drops in performance in the DPsetting for datasets with simpler input repre-sentations.
SGD-DP vs. Adam-DP
Points (1) and (2) arevery unexpected results, both contrary to expecta-tions. One explanation for the former could be that
0% 40% 60% 80% 100%0.50.60.70.80.9 , , , , , F1 SGDF1 Adam
CiteSeer
20% 40% 60% 80% 100%0.50.60.70.80.9 , , , , Cora
20% 40% 60% 80% 100%0.50.60.70.80.9 , , , , , , , , PubMed
20% 40% 60% 80% 100%0.50.60.70.80.9 , , , , Pokec
20% 40% 60% 80% 100%0.50.60.70.80.9 , , , , , , Reddit
Figure 1: Experiment B: F wrt. training data size (in %), without DP. gradients in SGD are already quite noisy, whichmay even help in generalization for the model, sothe additional DP noise does not pose much diffi-culty beyond a certain drop in performance. Re-garding Adam-DP, we see that results are far worseand do increase with lesser privacy (e.g., 0.51 with (cid:15) = 2 vs. 0.76 F1 with (cid:15) = 137 for Pokec). Severalreasons can account for this, one being that Adamhas more required hyperparameters, which couldbe sensitive with respect to the DP setting. Differences in input features
For points (3)and (4) above, we see varying degrees of perfor-mance drops, depending on the dataset. Datasetsof simpler input features can have results drop bymore than half in comparison to the non-DP im-plementation, although still outperform a majoritybaseline (e.g. . non-DP > . DP > . Maj.for CiteSeer). An exception to this is PubMed,which has DP results slightly below a majoritybaseline ( . < . ). The drop in results fromnon-DP to DP is not as sharp ( . > . ), mostprobably explained by the fact that the non-DPmodel was not able to achieve good performance.Reddit shows a smaller drop from non-DP to DPand significantly outperforms the majority baseline( . > . > . , respectively). Finally, thebest-performing SGD-DP model was Pokec, with arelatively small drop from the non-DP to DP result( . > . F1 score, respectively). Hence, Cite-Seer, Cora and PubMed, all using one-hot textualrepresentations, show fairly low results for DP at (cid:15) ≈ . Slightly better is Reddit (GloVe), whilePokec is by far the best (BERT). Experiment D
Finally, Figure 2 shows the DPresults both for varying (cid:15) and with different trainingsub-samples (25%, 50%, 75% and the full 100%).Overall, some parallels and contrasts can be drawnwith the learning curves from Experiment B.Datasets which behave similarly for the two ex- periments are CiteSeer and Cora, where the formerimproves with more training data and the latter satu-rates at a certain point. PubMed, Reddit and Pokecshow a contrasting pattern, with both PubMed andReddit staying about the same for all sub-samples,apart from the 100% setting, with a slight drop forPubMed and slight increase for Reddit. In exper-iment B, both had more gradual learning curves,with a slow decline for PubMed and a quick plateaufor Reddit. Similarly, Pokec here shows the bestresults with the full data, in contrast to the gradualdecline in the non-private setting.We can see that the patterns for learning curvesare not the same in the DP and non-DP setting.While increasing training data may help to someextent, it does not act as a solution to the generaldrop in performance caused by adding DP noise.
Summary
The main observations of these exper-iments can be summarized as follows:1. The network is learning useful representationsin the SGD-DP setting, outperforming the ma-jority baselines.2. SGD-DP is fairly robust to noise for thesedatasets and settings even for privacy at (cid:15) = 2 .3. While being superior in the non-private set-ting, Adam-DP does not perform very well.4. More complex representations are better forthe DP setting, showing a smaller perfor-mance drop.5. Patterns for decreasing training size and in-creasing noise are not the same, thus increas-ing training data does not necessarily mitigatenegative performance effects of DP.We provide an additional error analysis in Ap-pendix C, where we show that failed predictionsin Reddit and CiteSeer are caused by ‘hard cases’,
10 100 ε F1 SGD-DP 25%SGD-DP 50%SGD-DP 75%SGD-DP 100% (+stdev)Adam-DP 25%Adam-DP 50%Adam-DP 75%Adam-DP 100% (+stdev)
CiteSeer ε Cora ε PubMed ε Pokec ε Reddit
Figure 2: Experiment D: F with varying training data size (in %) wrt. privacy budget (cid:15) , with DP. i.e. examples and classes that are consistently miss-classified regardless of training data size or privacybudget. Moreover, Appendix D describes results onthe MNIST dataset with varying lot sizes, showinghow this hyperparameter affects model results. Issues of applying SGD-DP to GCNs
Splittinggraph datasets consisting of one large graph intosmaller mini-batches is not trivial. Special methodshave been developed to specifically deal with suchcases, such as sampling and aggregation (Hamil-ton et al., 2017), as well as pre-computing graphrepresentations (Rossi et al., 2020). Such tech-niques would be necessary for adapting ‘batches’and ‘lots’ from SGD-DP directly but it comes withtheoretical limitations. Namely, nodes in a graphare not necessarily i.i.d., being by definition relatedto each other, thus there would be potential privacyleakage when performing computations on separatemini-batches of a graph. Further investigation intoaltering the SGD-DP algorithm and incorporatingpotential graph mini-batching methods are thus leftfor future work.The benefits of our approach to applying SGD-DP and Adam-DP to the GCN case directly, arethat (1) it is practical, simply adding it as a wrap-per on top of the original model and (2) the abilityto retain the original graph structure, thus not los-ing important information present in the originaldataset and avoiding potential privacy leakage. Thedownside of this, however, is that the noise addedhas to be quite large in order to obtain reasonable (cid:15), δ values. As we have shown in our experiments,this method is indeed feasible in practice, givenenough representational power in the input.
Hyperparameters
We use the same hyperparam-eters for both DP and non-DP settings to enable afair comparison. In actual deployment, the DP ver- sion should have its own hyperparameter optimizedas optimal settings may vary due to the added noise.However, further tuning on the training data comeswith extra price as it consumes the privacy budget.
Is our model ‘bullet-proof’ ( (cid:15), δ ) -DP? Whilethe SGD-DP algorithm does guarantee differen-tial privacy by design, the ‘devil is in the details’.Abadi et al. (2016) propose in their implementa-tion that q < /σ , where q = L/N ( L being thelot size, N the size of the input dataset). In ourcase, due to the nature of large one-graph datasets, q = 1 , since the lot size is equal to the size of thedataset. This detail is not, however, mentioned in(Abadi et al., 2016) directly, but rather in the com-ments of original SGD-DP code. Whether thisminor implementation detail influences the overallprivacy budget computation through the momentsaccountant remains an open theoretical question.
We have explored differentially-private training forGCNs, showing the nature of the privacy-utilitytrade-off. While there is an expected drop in re-sults for the SGD-DP models, they generally per-form far better than the baselines, reaching up to90% of their non-private variants in one setup. Infact, more complexity in the input representationsseems to mitigate the negative performance effectsof applying DP noise. By adapting global DP to achallenging class of deep learning networks, we arethus a step closer to flexible and effective privacy-preserving NLP.
Acknowledgments
This research work has been funded by the Ger-man Federal Ministry of Education and Research As of 2020, there is only a fork of the original codeavailable at https://tinyurl.com/y2mwmbm9 nd the Hessian Ministry of Higher Education, Re-search, Science and the Arts within their joint sup-port of the National Research Center for AppliedCybersecurity ATHENE. Calculations were con-ducted on the Lichtenberg high performance com-puter of the TU Darmstadt.
References
Martin Abadi, Andy Chu, Ian Goodfellow, H. Bren-dan McMahan, Ilya Mironov, Kunal Talwar, andLi Zhang. 2016. Deep Learning with DifferentialPrivacy. In
Proceedings of the 2016 ACM SIGSACConference on Computer and Communications Se-curity , pages 308–318, Vienna, Austria. ACM.Mohamed Abdalla, Moustafa Abdalla, Frank Rudzicz,and Graeme Hirst. 2020. Using word embeddingsto improve the privacy of clinical notes.
Journalof the American Medical Informatics Association ,27(6):901–907.Maximin Coavoux, Shashi Narayan, and Shay B. Co-hen. 2018. Privacy-preserving Neural Representa-tions of Text. In
Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing , pages 1–10, Brussels, Belgium. Associ-ation for Computational Linguistics.Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019.Question Answering by Reasoning Across Docu-ments with Graph Convolutional Networks. In
Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , pages 2306–2317,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Damien Desfontaines and Balázs Pejó. 2020. SoK: Dif-ferential privacies.
Proceedings on Privacy Enhanc-ing Technologies , 2020(2):288–313.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Cynthia Dwork and Aaron Roth. 2013. The Algorith-mic Foundations of Differential Privacy.
Founda-tions and Trends® in Theoretical Computer Science ,9(3-4):211–407.C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence.1998. CiteSeer: An Automatic Citation IndexingSystem. In
Proceedings of the third ACM confer-ence on Digital Libraries , pages 89–98, Pittsburgh,PA, USA. ACM Press. Will Hamilton, Rex Ying, and Jure Leskovec. 2017. In-ductive Representation Learning on Large Graphs.In
Advances in Neural Information Processing Sys-tems 30 , pages 1024–1034, Long Beach, CA, USA.Curran Associates, Inc.Briland Hitaj, Giuseppe Ateniese, and Fernando Perez-Cruz. 2017. Deep Models Under the GAN: Informa-tion Leakage from Collaborative Deep Learning. In
Proceedings of the 2017 ACM SIGSAC Conferenceon Computer and Communications Security , page603–618, Dallas, TX, USA. Association for Com-puting Machinery.Justin Hsu, Marco Gaboardi, Andreas Haeberlen, San-jeev Khanna, Arjun Narayan, Benjamin C. Pierce,and Aaron Roth. 2014. Differential Privacy: AnEconomic Method for Choosing Epsilon. In , pages 398–410. IEEE.Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In
Proceedingsof the 3rd International Conference for LearningRepresentations , San Diego, CA.Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph ConvolutionalNetworks. In
Proceedings of International Con-ference on Learning Representations (ICLR 2017) ,pages 1–14, Toulon, France.Jaewoo Lee and Chris Clifton. 2011. How Much IsEnough? Choosing (cid:15) for Differential Privacy. In
Proceedings of the 14th Information Security Con-ference (ISC 2011) , pages 325–340, Xi’an, China.Springer Berlin / Heidelberg.Jure Leskovec and Andrej Krevl. 2014. SNAPDatasets: Stanford large network dataset collection. http://snap.stanford.edu/data .Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018.Towards Robust and Privacy-preserving Text Repre-sentations. In
Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers) , pages 25–30, Melbourne,Australia. Association for Computational Linguis-tics.Lingjuan Lyu, Yitong Li, Xuanli He, and Tong Xiao.2020. Towards Differentially Private Text Represen-tations. In
Proceedings of the 43rd InternationalACM SIGIR Conference on Research and Develop-ment in Information Retrieval , pages 1813–1816,Virtual conference. ACM.Diego Marcheggiani, Jasmijn Bastings, and Ivan Titov.2018. Exploiting semantics in neural machine trans-lation with graph convolutional networks. In
Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 2 (Short Papers) , pages 486–492, New Orleans,Louisiana. Association for Computational Linguis-tics.ndrew Kachites McCallum, Kamal Nigam, JasonRennie, and Kristie Seymore. 2000. Automating theconstruction of internet portals with machine learn-ing.
Information Retrieval , 3(2):127–163.H. Brendan McMahan, Daniel Ramage, Kunal Talwar,and Li Zhang. 2018. Learning Differentially PrivateRecurrent Language Models. In
Proceedings of the6th International Conference on Learning Represen-tations , pages 1–14, Vancouver, BC, Canada.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In
Proceedings of the 2014 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 1532–1543, Doha,Qatar. Association for Computational Linguistics.Bryan Perozzi and Steven Skiena. 2015. Exact AgePrediction in Social Networks. In
Proceedings ofthe 24th International Conference on World WideWeb - WWW ’15 Companion , pages 91–92, Florence,Italy. ACM Press.Afshin Rahimi, Trevor Cohn, and Timothy Baldwin.2018. Semi-supervised User Geolocation via GraphConvolutional Networks. In
Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers) , pages2009–2019, Melbourne, Australia. Association forComputational Linguistics.Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In
Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Process-ing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP) ,pages 3980–3990, Hong Kong, China. Associationfor Computational Linguistics.Emanuele Rossi, Fabrizio Frasca, Ben Chamberlain,Davide Eynard, Michael Bronstein, and FedericoMonti. 2020. Sign: Scalable inception graph neuralnetworks. arXiv preprint arXiv:2004.11198 .Sebastian Ruder. 2016. An overview of gradientdescent optimization algorithms. arXiv preprintarXiv:1609.04747 .Devendra Singh Sachan, Yuhao Zhang, Peng Qi, andWilliam Hamilton. 2020. Do syntax trees help pre-trained transformers extract information? arXivpreprint arXiv:2008.09084 .Sina Sajadmanesh and Daniel Gatica-Perez. 2020.When Differential Privacy Meets Graph Neural Net-works. arXiv preprint .Prithviraj Sen, Galileo Namata, Mustafa Bilgic, LiseGetoor, Brian Galligher, and Tina Eliassi-Rad. 2008.Collective Classification in Network Data.
AI Mag-azine , 29(3):93. Congzheng Song and Vitaly Shmatikov. 2019. Au-diting data provenance in text-generation models.In
Proceedings of the 25th ACM SIGKDD Interna-tional Conference on Knowledge Discovery & DataMining , pages 196–206.Lubos Takac and Michal Zabovsky. 2012. Data analy-sis in public social networks. In
International scien-tific conference and international workshop presentday trends of innovations , Łom˙za, Poland.Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu.2020. Discourse-aware neural extractive text sum-marization. In
Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 5021–5031, Online. Association for Computa-tional Linguistics.Keyulu Xu, Weihua Hu, Jure Leskovec, and StefanieJegelka. 2019. How Powerful are Graph Neural Net-works? In
Proceedings of International Conferenceon Learning Representations (ICLR 2019) , pages 1–17, New Orleans, Louisiana.Zhilin Yang, William W. Cohen, and Ruslan Salakhut-dinov. 2016. Revisiting Semi-Supervised Learningwith Graph Embeddings. In
Proceedings of The33rd International Conference on Machine Learn-ing , volume 48, pages 40–48, New York, NY, USA.PMLR.Chen Zheng and Parisa Kordjamshidi. 2020. SRLGRN:Semantic role labeling graph reasoning network. In
Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 8881–8891, Online. Association for Computa-tional Linguistics.Tianqing Zhu, Gang Li, Wanlei Zhou, and Philip S.Yu. 2017.
Differential Privacy and Applications .Springer International Publishing.
A Hyperparameter Configuration
Our GCN model consists of 2 layers, with ReLUnon-linearity, a hidden size of 32 and dropout of50%, trained with a learning rate of 0.01. Wefound that early stopping the model works betterfor the non-DP implementations, where we used apatience of 20 epochs. We did not use early stop-ping for the DP configuration, which shows betterresults without it. For all SGD runs we used a max-imum of 2000 epochs, while for Adam we used500.Due to the smaller amount of epochs for Adam,it is possible to add less noise to achieve a lower (cid:15) value. Table 3 shows the mapping from noisevalues used for each optimizer to the corresponding (cid:15) . Noise-SGD Noise-Adam
Table 3: (cid:15) values from experiments C and D, with thecorresponding noise values added to the gradient foreach optimizer.
B Pokec Dataset Pre-processing
In order to prepare the binary classification taskfor the Pokec dataset, the original graph consistingof 1,632,803 nodes and 30,622,564 edges is sub-sampled to only include users that filled out the‘pets’ column and had either cats or dogs as theirpreference, discarding entries with multiple pref-erences. For each pet type, users were reorderedbased on percent completion of their profiles, suchthat users with most of the information were re-tained.For each of the two classes, the top 10,000 usersare taken, with the final graph consisting of 20,000nodes and 32,782 edges. The data was split into80% training, 10% validation and 10% test parti-tions.The textual representations themselves were pre-pared with ‘bert-multilingual-cased’ from Hugging-face transformers, converting each attribute of userinput in Slovak to BERT embeddings with the pro-vided tokenizer for the same model. Embeddingsare taken from the last hidden layer of the model,with dimension size 768. The average over alltokens is taken for a given column of user infor-mation, with 49 out of the 59 original columnsretained. The remaining 10 are left out due to con-taining less relevant information for textual anal-ysis, such as a user’s last login time. To furthersimplify input representations for the model, theaverage is taken over all columns for a user, result-ing in a final vector representation of dimension768 for each node in the graph. C Are ‘hard’ examples consistentbetween private and non-privatemodels?
To look further into the nature of errors for experi-ments B and C, we evaluate the ‘hard cases’. Theseare cases that the model has an incorrect prediction https://github.com/huggingface/transformers for with the maximum data size and non-privateimplementation (results of experiment A). For ex-periment B, we take the errors for every setting ofthe experiment (10% training data, 20%, and soforth) and calculate the intersection of those errorswith that of the ‘hard cases’ from the baseline im-plementation. This intersection is then normalizedby the original number of hard cases to obtain a per-centage value. The results for experiment B can beseen in Figure 3. We perform the same procedurefor experiment C with different noise values, asseen in Figure 4. This provides a look into how thenature of errors differs among these different set-tings, whether they stay constant or become morerandom as we decrease the training size or increaseDP noise.Regarding the errors for experiment C, we cansee a strong contrast between datasets such as Red-dit and PubMed. For the latter, the more noise weadd as (cid:15) decreases, the more random the errors be-come. In the case of Reddit, however, we see thateven if we add more noise, it still fails on the samehard cases. This means that there are hard aspectsof the data that remain constant throughout. Forinstance, out of all the different classes, some maybe particularly difficult for the model.Although the raw data for Reddit does not havereferences to the original class names and inputtexts, we can still take a look into these classes nu-merically and see which ones are the most difficultin the confusion matrix. In the baseline non-DPmodel, we notice that many classes are consistentlypredicted incorrectly. For example, class 10 is pre-dicted 93% of the time to be class 39. Class 18is never predicted to be correct, but 95% of thetime predicted to be class 9. Class 21 is predictedas class 16 83% of the time, and so forth. Thismodel therefore mixes up many of these classeswith considerable confidence.Comparing this with the confusion matrix for thedifferentially private implementation at an (cid:15) valueof 2, we can see that the results incorrectly predictthese same classes as well, but the predictions aremore spread out. Whereas the non-private modelseems to be very certain in its incorrect prediction,mistaking one class for another, the private modelis less certain and predicts a variety of incorrectclasses for the target class.For the analysis of the hard cases of experimentB in Figure 3, we can see some of the same pat-terns as above, for instance between PubMed and
0% 40% 60% 80% 100%
Training data subset % of "hard cases" in false predictions
Figure 3: Hard cases in non-DP.
Reddit. Even if the training size is decreased, themodel trained on Reddit still makes the same typesof errors throughout. In contrast, as training sizeis decreased for PubMed, the model makes moreand more random errors. The main difference be-tween the hard cases of the two experiments is that,apart from Reddit, here we can see that for all otherdatasets the errors become more random as we de-crease training size. For example, Cora goes downfrom 85% of hard cases at 90% training data to74% at 10% training data. In the case of experi-ment C, they stay about the same, for instance Coraretains just over 70% of the hard cases for all noisevalues. Overall, while we see some parallels be-tween the hard cases for experiments B and C withrespect to patterns of individual datasets such asReddit and PubMed, the general trend of more andmore distinct errors that is seen for the majorityof datasets with less training size in experiment Bis not the same in experiment C, staying mostlyconstant across different noise values for the latter.The idea that the nature of errors for DP noise andless training data being the same is thus not alwaysthe case, meaning that simply increasing trainingsize may not necessarily mitigate the effects of DPnoise.
D MNIST Baselines
Table 4 shows results on the MNIST dataset withdifferent lot sizes and noise values, keeping lot andbatch sizes the same. We use a simple feed-forwardneural network with a hidden size of 512, dropoutof 50%, SGD optimizer, and a maximum of 2000epochs with early stopping of patience 20, withother hyperparameters such as learning rate beingthe same as above. We note that the configuration ε % of "hard cases" in false predictions Figure 4: Hard cases analysis DP.
Lot Size Noise (cid:15)
F1 Std.
600 4 1.26 0.90 0.026,000 4 4.24 0.84 0.0160,000 4 15.13 0.45 0.0460,000 50 0.98 0.39 0.1560,000 100 0.50 0.10 0.01
Table 4: Results on the MNIST dataset with varyinglot sizes and noise values. in the first row with lot size of 600 and noise 4 isthe same as described by Abadi et al. (2016) in theirapplication of the moments accountant, reachingthe same (cid:15) value of 1.2586.We can see some important patterns in theseresults that relate to our main results from the GCNexperiments. Maintaining a constant noise of 4, aswe increase the lot size, not only does the (cid:15) valueincrease, but we see a dramatic drop in F1 score,especially for a lot size of 60,000, being the fulltraining set. If we try to increase the noise andmaintain that 60,000 lot size, while we are able tolower the (cid:15)(cid:15)