A Quantitative Metric for Privacy Leakage in Federated Learning
AA QUANTITATIVE METRIC FOR PRIVACY LEAKAGE IN FEDERATED LEARNING
Yong Liu ∗† Xinghua Zhu (cid:63) † Jianzong Wang (cid:63) ‡ Jing Xiao (cid:63)(cid:63)
Ping An Technology (Shenzhen) Co., Ltd. ∗ National University of Singapore
ABSTRACT
In the federated learning system, parameter gradients areshared among participants and the central modulator, whilethe original data never leave their protected source domain.However, the gradient itself might carry enough informationfor precise inference of the original data. By reporting theirparameter gradients to the central server, client datasets areexposed to inference attacks from adversaries. In this paper,we propose a quantitative metric based on mutual informa-tion for clients to evaluate the potential risk of informationleakage in their gradients. Mutual information has receivedincreasing attention in the machine learning and data min-ing community over the past few years. However, existingmutual information estimation methods cannot handle high-dimensional variables. In this paper, we propose a novelmethod to approximate the mutual information between thehigh-dimensional gradients and batched input data. Experi-mental results show that the proposed metric reliably reflectthe extent of information leakage in federated learning. Inaddition, using the proposed metric, we investigate the in-fluential factors of risk level. It is proven that, the risk ofinformation leakage is related to the status of the task model,as well as the inherent data distribution.
Index Terms — Security and Privacy, Federated Learn-ing, Information Theory, Security Metric.
1. INTRODUCTION
In the contemporary AI industry, there is an ever-rising questfor organized data. In most industries, though, large amountsof data exist in isolated devices and institutes, being wastedaway under the restriction of security or privacy regulations.Federated learning (FL) [1, 2, 3] is devised to activate theisolated data sources. Distinguished from the centralized ma-chine learning, in an FL system, data do not leave their pro-tected source locations [4]. Instead, the model parameter gra-dients are reported to a central modulator for global modelaggregation.The FL framework is a promising ideology. Yet, whenit comes to application, many data holders still lack incen- ∗ Work done as an intern at Ping An Technology (Shenzhen) Co., Ltd. † These authors contributed equally to this work. ‡ Corresponding author: [email protected]. tives to participate in the FL process [5, 6]. One of the majorconcerns lies within the verifiability information security. Ithas been proved that by observing parameter gradients, anadversary can make precise inference on the raw input data[7, 8]. Although provable encryption schemes, such as homo-morphic encryption [9, 10], secure multi-party computation[11, 12, 13] and secret sharing, has been proposed to guaran-tee information security, their implementation and operationare to costly for practical applications. In terms of data ob-fuscation techniques, such as differential privacy [14, 15], en-gineers need to empirically balance the privacy level and fed-erated model performance. From a data holder’s perspective,the level of security provided by such designs is too arbitraryto be convincing. Therefore, a quantifiable and universal met-ric is essential to promote incentives for data contribution inthe FL systems.Some may advocate the added noise level in the differen-tial privacy scheme as an indicator of security degree. Butthere is no proof of a quantifiable relationship between thenoise level and the information leakage risk. It still remainsan open problem to systematically define the sufficient noiselevel for a differentially private model [16].A practical metric for information leakage risk should sat-isfy the following properties:•
Scale invariance - the quantitative value should havethe same meanings under different circumstances.•
Interpretability - the metric should be inline with prov-able information bounds.•
Model-agnostic - the model itself can be distributed asa blackbox to clients.In the information theory, mutual information (MI) is a mea-sure of the common information between two random vari-ables. It provides a theoretically provable, universal andquantifiable metric for the amount of information leakage onone variable given the other. In the machine learning society,studies have been dedicated to estimate the MI between ob-servable variables (model parameters, gradients, etc.) and theoriginal data. However, existing methods were mostly basedon discrete variables, or made risky assumptions about theprobability density functions [17, 18].For the FL framework, we aim to measure the risk of in-formation leakage before reporting the calculated model pa- a r X i v : . [ c s . CR ] F e b ig. 1 . Structure of an FL system with risk pre-alarm.rameters to the central server. Consequently, a client canmake informed decision on whether it is safe to upload itsmodel parameters. Therefore, the corresponding MI input /output variables are batched raw data and the computed gra-dients, respectively. Both variables easily have hundreds ofdimensions, well beyond the scope of discussion in previ-ous works. In this paper, we propose a novel hierarchicalmutual information estimation method, H-MINE, for high-dimensional MI approximation. The proposed method is thenapplied to estimate the risk of information leakage in an FLclient under various experimental settings. The main contri-butions of this paper are as follows:• Propose a novel hierarchical model, H-MINE, for ro-bust and efficient high-dimensional MI estimation.• Apply H-MINE in the quantification of informationleakage risk in FL systems.• Verify the credibility of H-MINE through comparisonof inference attack results.• Analyse the inherent influential factors of informationsecurity in FL systems.
2. PROPOSED METHOD2.1. Risk Pre-Alarm in FL Systems
In the federated stochastic gradient descent (FedSGD) algo-rithm, clients and the central server communicates iterativelyto jointly optimize the global task model f θ [1]. In a com-munication round t , a client i obtains the current globalmodel parameters θ t from the central server. Client i calcu-lates the parameter gradient G ti = ∇ θ f tθ ( X iB ) with a batchsampled from local dataset D i . That is, the batched data X iB = { x i , x i , ..., x iB } ⊆ D i , where B is the batch size.Client gradients G ti , i = 1 , ..., N , are sent to the central server. On the central server, G ti ’s are aggregated to updatethe global model, such that θ t +1 ← θ t − η (cid:80) i G ti , where η isthe learning rate.In this paper, we assume a modest security environment,where all participants are honest-but-curious. Both the clientsand central server would not try to poison the learning pro-cess, but may probe the underlying raw data when they haveaccess to other participants’ public information G ti . There-fore, clients are susceptible to inference attacks from the cen-tral server or other intercepting adversaries. We propose toimplement an information leakage risk estimator on the clientside, so that a client can be alarmed of the potential risk beforepublishing its gradient information (see Fig.1). In this paper,the information leakage risk is quantified by the mutual in-formation between the batched raw data X iB and the gradient G ti , i.e., I ( X iB ; G ti ) . Without loss of generality, the datapoints in a client datasetare assumed to be independently and identically distributed(IID), i.e., x ij ∼ X i , j = 1 , ..., |D i | . It follows that G ti ( X iB ) ∼ G ti .As discussed in previous sections, MI estimation is non-trivial, especially for high-dimensional random variables.Belghazi et al. proposed to solve this problem with a neuralnetwork [17]. In fact, the MI between two random variables I ( X ; G ) is equivalent to the Kullback-Leibler Divergence D KL [ P XG || P X ⊗ P G ] . The Donsker-Varadhan represen-tation [19] of the KL-Divergence gives a lower bound on I ( X B ; G t ) : I ( X B ; G t ) = D KL [ P X B G t || P X B ⊗ P G t ] ≥ sup T ∈T E P XBGt [ T ] − log( E P XB ⊗ P Gt [ e T ]) , where T can be any class of functions T : ( X i , G ti ) → R sat-isfying the integrability constraints of the Donsker-Varadhantheorem. As proposed by Belghazi et al, choosing a neuralnetwork as T transforms the MI estimation problem into anetwork optimization one. This transformation exploits theflexibility of neural networks in approximating arbitrarilycomplex functions, but also benefits from the well-developedtools for network optimization. The structure of the statisticnetwork T φ can be designed to accommodate different datatypes. Its parameters φ are optimized with iterative samplingfrom the joint and marginal distributions of the X B and G t . Numerous architectures of the statistic network have been ex-perimented in the literature. They performed reasonably wellas a constraining factor in tasks like generative adversarial ig. 2 . H-MINE: the hierarchical statistic network.networks [20], representation learning [21], and so on. How-ever, when we look into the accuracy of the estimated MI it-self, the previously proposed statistic networks were not sosuccessful. When the variable dimensions increased, conven-tional statistic networks failed to converge or converged atinsignificant values. In this section, a novel statistic networkstructure is proposed that handles the high-dimensional inputand output variables in the FL clients.The structure of the proposed statistic network is illus-trated in Fig.2. Particularly, we would like to make use ofthe fact that the data points x ∈ D i are IID. That is to say,the input variable in our statistic network, X B , can be dividedinto B independent random variables. In our proposed statis-tic network, x m of batch X B and the gradient G t are groupedinto a block O m = { x m , G t } , m = 1 , ..., B . Since O m ’s areidentically distributed, a shared sub-network, called Block-Model, is used to map them into B embedding vectors h m .The embedding vectors are then concatenated as input to theconsequent MixModel, which outputs a single scalar in thefinal layer. This statistic network structure is termed hierar-chical MINE (H-MINE) for brevity.As opposed to increasing the depth of the statistic net-work, H-MINE utilizes a shared sub-network to extract infor-mation from IID components. The hierarchical structure ofthe proposed statistic network effectively reduces the dimen-sionality of the input variables. The number of parametersin the input layer is also reduced by B times, compared to anaive fully connected layer. The thorough procedure for MIapproximation using H-MINE is elaborated in Algorithm 1.
3. RESULTS3.1. Experiment Settings
In this section, we evaluate the performance of our proposedmethods with the Adult dataset [22]. The task of the Adultdataset is to predict whether an individual’s annual incomeexceeds $50K. It contains 14 private attributes about each in-dividual, including education level, age, gender, occupation,and so on. The dataset contains 32,560 samples (7,841 pos-itive and 24,719 negative). Missing values are replaced withthe medians in the dataset. A logistic regression is used as thetask classification model.In H-MINE, the BlockModel is a multi-layer perception
Algorithm 1:
H-MINE Algorithm.
Inputs:
Task model batch size B , sample size S , taskmodel status θ t . Initialize:
H-MINE parameters φ . while not converged do for k = 1 , ..., S do Generate batch samples X B , ˆ X B : X B ← [ x , ..., x B ] , ˆ X B ← [ˆ x , ..., ˆ x B ] Calculate corresponding gradients: G = ∇ θ f θ t ( X B ) , ˆ G = ∇ θ f θ t ( ˆ X B ) for j = 1 , ..., B do h j = BlockModel φ ( x j , G ) ˆ h j = BlockModel φ ( x j , ˆ G ) end H = { h , h , ..., h B } , ˆ H = { ˆ h , ˆ h , ..., ˆ h B } v k = T φ ( X B , G ) = MixModel φ ( H, G ) , ˆ v k = T φ ( X B , ˆ G ) = MixModel φ ( ˆ H, ˆ G ) end Evaluate the lower-bound: V ( φ ) = S (cid:80) Sk =1 v k − log( S (cid:80) Sk =1 e ˆ v k ) Update H-MINE parameters: φ ← φ + ∇ φ V ( φ ) ; end (MLP), which contains 3 fully connected layers with 200,200, 5 neurons, respectively. The MixModel is a two-layerMLP with 500, 1 neurons, respectively. We use Adam [23]optimizer to train H-MINE and learning rate α φ = 5 × − .As a baseline of correlation metrics, the sum of all ele-ments of covariance matrix C ( X B , G t ) = sum ( cov ( X B , G t )) is also included in the subsequent figures. In our proposed method, the MI is approximated via optimiza-tion of a statistic network. The convergence curves of theproposed statistic network, H-MINE, are presented in Fig.3.The curves of a conventional statistic network, comprising of
Steps
Lea k age R i sk H-MINEMINEBaseline
Steps
Lea k age R i sk H-MINEMINEBaseline
Fig. 3 . Comparison of convergence speed of H-MINE anda conventional statistic network. Gradients are evaluated at epoch = 1 . Batch sizes are 1 and 3 respectively. three layer MLP [17], are depicted for comparison. In theseexperiments, the number of dimensions in the input variablemultiplies as the batch size B increases. The conventionalstatistic network diverges in all attempted configurations, forits inability to handle high-dimensional variables. In contrast,H-MINE conveges steadily even with increased batch size. In this section, we use the DeepLeakage model to validatethe accuracy of our estimated mutual information I ( X B ; G t ) .In theories, smaller mutual information between the twovariables leads to increased difficulty in the inference at-tack. Therefore, in the following experiments, we comparethe inference error of the recovered data with our estimated I ( X B ; G t ) in various circumstances, and verify their correla-tion. Batch Size I n f e r en c e E rr o r Lea k age R i sk H-MINEBaseline
Variance of Gaussian Noise I n f e r en c e E rr o r Lea k age R i sk H-MINEBaseline
Fig. 4 . Validation with DeepLeakage. Gradients are evaluatedat epoch = 3 . For the added noise experiment, B = 3 .The DeepLeakage model in our experiments is a three-layer MLP that contains 100, 100, and 2 neurons, respec-tively. For each configuration, the DeepLeakage is performed5 times with random initialization, producing inference re-sults ˆ X ( k ) B , k = 1 , ..., . The inference error of DeepLeakageis defined as (cid:15) = var ( ˆ X ( k ) B ) .Zhu et al. suggested two means to defend against infer-ence attacks, namely, increasing the batch size, and addingnoise to the gradients. Fig.4 demonstrates the change of I ( X B ; G t ) and (cid:15) verses batch size (left) and added noise onthe gradients (right). As it shows, the estimated MI decreasesas the batch size and noise level increase. At the same time,the inference error increases quadratically. The results areinline with our hypothesis, but also consistent with the resultsin [7]. In other words, our estimated MI can faithfully reflectthe risk of inference attacks. In this section, we analyze the factors affecting the risk of in-formation leakage that are inherent in the data or the trainingprocess themselves.Firstly, FedSGD is an iterative process. Clients calculatetheir gradients over the evolving parameter values. At dif-ferent timesteps, the gradients may present different levels of
Epoch
Lea k age R i sk H-MINEBaseline (a) Leakage risk at differenttimesteps, batch size B = 3 . Positive:Negative Ratio
Lea k age R i sk H-MINEBaseline (b) Leakage risk with unbalanceddatasets, epoch = 3 , B = 3 . Fig. 5 . Inherent factor analysis.information about the batched data, due to its interaction withthe current model status. Fig.5(a) presents I ( X B ; G t ) versesthe iteration step t . The MI is the largest at the first epoch,while it gradually decreases over the optimization process.At initial steps, every data point is new to the model. Thusthe gradients possess strong information about how the modelmisses to fit the observed data. When the model starts to gaina reasonable form, the amount of miss-fit shrinks, thereforerevealing less about the observed data. As the task modelconverges, the MI between the gradients and the raw data alsostablizes at a certain level.Secondly, it is speculated that the data distribution itselfalso affects the risk of information leakage [24]. In this ex-periment, we simulate a client dataset with different ratios ofpositive and negative entries, and investigate the correspond-ing variation of I ( X B ; G t ) . As shown in Fig 5(b), the moreunbalanced the dataset is, the more information about thebatched data is preserved in the gradients. The results ver-ify that, gradients on unbalanced data distributions are morevulnerable to inference attacks.
4. CONCLUSIONS
In this paper, we proposed a novel security metric for FLclients based on high-dimensional mutual information esti-mation. The proposed algorithm H-MINE can effectively andfaithfully approximate the mutual information between modelgradients and the original batched data. Therefore, it providesa quantitative metric for potential risk alarms on the clientside. Analysis on the inherent factors of information leakagerisk suggest data holders to be cautious with initial trainingsteps and unbalanced data distributions.
5. ACKNOWLEDGEMENTS
This paper is supported by National Key Research and Devel-opment Program of China under grant No. 2018YFB1003500,No. 2018YFB0204400 and No. 2017YFB1401202. Corre-sponding author is Jianzong Wang from Ping An Technology(Shenzhen) Co., Ltd. . REFERENCES [1] Jakub Konecn`y, H Brendan McMahan, Daniel Ram-age, and Peter Richt´arik, “Federated optimization: Dis-tributed machine learning for on-device intelligence,”2016.[2] Jakub Koneˇcn`y, H Brendan McMahan, Felix X Yu,Peter Richt´arik, Ananda Theertha Suresh, and DaveBacon, “Federated learning: Strategies for im-proving communication efficiency,” arXiv preprintarXiv:1610.05492 , 2016.[3] Anxun He, Jianzong Wang, Zhangcheng Huang, andJing Xiao, “Fedsmart: An auto updating federated learn-ing optimization mechanism,” in
APWeb and WAIMJoint International Conference on Web and Big Data .Springer, 2020, pp. 716–724.[4] Lingwei Kong, Hengtao Tao, Jianzong Wang, and et al,“Network coding for federated learning systems,” in
ICONIP . Springer, 2020.[5] Xinghua Zhu, Jianzong Wang, Zhenhou Hong, and JingXiao, “Empirical studies of institutional federated learn-ing for natural language processing,” in
Findings ofEMNLP . ACL, 2020.[6] Xinghua Zhu, Jianzong Wang, Zhenhou Hong, Tian Xia,and Jing Xiao, “Federated learning of unsegmented chi-nese text recognition model,” in .IEEE, 2019, pp. 1341–1345.[7] Ligeng Zhu, Zhijian Liu, and Song Han, “Deep leak-age from gradients,” in
Advances in Neural InformationProcessing Systems , 2019, pp. 14774–14784.[8] Jonas Geiping, Hartmut Bauermeister, Hannah Dr¨oge,and Michael Moeller, “Inverting gradients–how easyis it to break privacy in federated learning?,” arXivpreprint arXiv:2003.14053 , 2020.[9] Ronald L Rivest, Len Adleman, Michael L Dertouzos,et al., “On data banks and privacy homomorphisms,”
Foundations of secure computation , vol. 4, no. 11, pp.169–180, 1978.[10] Abbas Acar, Hidayet Aksu, A Selcuk Uluagac, andMauro Conti, “A survey on homomorphic encryptionschemes: Theory and implementation,”
ACM Comput-ing Surveys (CSUR) , vol. 51, no. 4, pp. 1–35, 2018.[11] Adi Shamir, “How to share a secret,”
Communicationsof the ACM , vol. 22, no. 11, pp. 612–613, 1979.[12] Oded Goldreich, “Secure multi-party computation,”
Manuscript. Preliminary version , vol. 78, 1998. [13] Payman Mohassel and Peter Rindal, “Aby3: A mixedprotocol framework for machine learning,” in
Proceed-ings of the 2018 ACM SIGSAC Conference on Computerand Communications Security , 2018, pp. 35–52.[14] Cynthia Dwork, Frank McSherry, Kobbi Nissim, andAdam Smith, “Calibrating noise to sensitivity in privatedata analysis,” in
Theory of cryptography conference .Springer, 2006, pp. 265–284.[15] Cynthia Dwork, Aaron Roth, et al., “The algorithmicfoundations of differential privacy,”
Foundations andTrends in Theoretical Computer Science , vol. 9, no. 3-4,pp. 211–407, 2014.[16] Jaewoo Lee and Chris Clifton, “How much is enough?Choosing (cid:15) for differential privacy,”
Lecture Notes inComputer Science (including subseries Lecture Notes inArtificial Intelligence and Lecture Notes in Bioinformat-ics) , vol. 7001 LNCS, pp. 325–340, 2011.[17] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Ra-jeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville,and R Devon Hjelm, “Mine: mutual information neuralestimation,” arXiv preprint arXiv:1801.04062 , 2018.[18] Morteza Noshad, Yu Zeng, and Alfred O Hero, “Scal-able mutual information estimation using dependencegraphs,” in
ICASSP 2019-2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 2962–2966.[19] M. D. Donsker and S. R.S. Varadhan, “Asymptotic eval-uation of certain markov process expectations for largetime—iii,”
Communications on Pure and Applied Math-ematics , vol. 29, no. 4, pp. 389–461, July 1976.[20] Thang Doan, Joao Monteiro, Isabela Albuquerque, Bog-dan Mazoure, Audrey Durand, Joelle Pineau, and R De-von Hjelm, “On-line adaptative curriculum learning forgans,” in
Proceedings of the AAAI Conference on Artifi-cial Intelligence , 2019, vol. 33, pp. 3470–3477.[21] Liangjian Wen, Yiji Zhou, Lirong He, Mingyuan Zhou,and Zenglin Xu, “Mutual information gradient es-timation for representation learning,” arXiv preprintarXiv:2005.01123 , 2020.[22] Dheeru Dua and Casey Graff, “UCI machine learningrepository,” 2017.[23] Diederik P Kingma and Jimmy Ba, “Adam: Amethod for stochastic optimization,” arXiv preprintarXiv:1412.6980 , 2014.[24] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown,“Federated visual classification with real-world data dis-tribution,” arXiv preprint arXiv:2003.08082arXiv preprint arXiv:2003.08082