[PDF] Computation-efficient Deep Model Training for Ciphertext-based Cross-silo Federated Learning

Abstract

Although cross-silo federated learning improves privacy of training data by exchanging model updates rather than raw data, sharing updates (e.g., local gradients or parameters) may still involve risks. To ensure no updates are revealed to the server, industrial FL schemes allow clients (e.g., financial or medical) to mask local gradients by homomorphic encryption (HE). In this case, the server cannot obtain the updates, but the curious clients can obtain this information to infer other clients' private data. To alleviate this situation, the most direct idea is to let clients train deep models on encrypted domain. Unfortunately, the resulting solution is of poor accuracy and high cost, since the existing advanced HE is incompatible with non-linear activation functions and inefficient in terms of computational cost. In this paper, we propose a \emph{computational-efficient deep model training scheme for ciphertext-based cross-silo federated learning} to comprehensively guarantee privacy. First, we customize \emph{a novel one-time-pad-style model encryption method} to directly supports non-linear activation functions and decimal arithmetic operations on the encrypted domain. Then, we design a hybrid privacy-preserving scheme by combining our model encryption method with secret sharing techniques to keep updates secret from the clients and prevent the server from obtaining local gradients of each client. Extensive experiments demonstrate that for both regression and classification tasks, our scheme achieves the same accuracy as non-private approaches and outperforms the state-of-the-art HE-based scheme. Besides, training time of our scheme is almost the same as non-private approaches and much more efficient than HE-based schemes. Our scheme trains a 9 -layer neural network on the MNIST dataset in less than one hour.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 1

Computation-efﬁcient Deep Model Training forCiphertext-based Cross-silo Federated Learning

Xue Yang, Yan Feng, Weijun Fang, Jun Shao, Xiaohu Tang,

Member, IEEE,

Shutao Xia,

Member, IEEE,

Rongxing Lu,

Fellow, IEEE

Abstract —Although cross-silo federated learning improves privacyof training data by exchanging model updates rather than raw data,sharing updates (e.g., local gradients or parameters) may still involverisks. To ensure no updates are revealed to the server, industrialFL schemes allow clients (e.g., ﬁnancial or medical) to mask localgradients by homomorphic encryption (HE). In this case, the servercannot obtain the updates, but the curious clients can obtain thisinformation to infer other clients’ private data. To alleviate thissituation, the most direct idea is to let clients train deep modelson encrypted domain. Unfortunately, the resulting solution is ofpoor accuracy and high cost, since the existing advanced HE isincompatible with non-linear activation functions and inefﬁcientin terms of computational cost. In this paper, we propose a computational-efﬁcient deep model training scheme for ciphertext-based cross-silo federated learning to comprehensively guaranteeprivacy. First, we customize a novel one-time-pad-style model encryp-tion method to directly supports non-linear activation functions anddecimal arithmetic operations on the encrypted domain. Then, wedesign a hybrid privacy-preserving scheme by combining our modelencryption method with secret sharing techniques to keep updatessecret from the clients and prevent the server from obtaining localgradients of each client. Extensive experiments demonstrate thatfor both regression and classiﬁcation tasks, our scheme achieves thesame accuracy as non-private approaches and outperforms the state-of-the-art HE-based scheme. Besides, training time of our scheme isalmost the same as non-private approaches and much more efﬁcientthan HE-based schemes. Our scheme trains a -layer neural networkon the MNIST dataset in less than one hour. Index Terms —Federated learning, homomorphic encryption, deepneural networks, semantic security.

I. I

NTRODUCTION W ITH the continued emergence of privacy breaches anddata abuse [1], data privacy and security issues graduallyimpede the ﬂourishing development of deep learning [2]. In orderto mitigate such privacy concerns, federated learning (FL) [3] hasrecently been presented as an appealing solution. As illustrated inFig. 1, FL is essentially a distributed learning framework wheremany clients collaboratively train a shared global model under theorchestration of a central server, while ensuring that each client’sraw data is stored locally and not exchanged or transferred.According to application scenarios [4], FL can be divided into cross-device FL and cross-silo FL . Concretely, the clients in thecross-device FL [5], [6] are a large number of mobile or IoT

X. Yang, Y. Feng, W. Fang and S. Xia are with Tsinghua Shenzhen InternationalGraduate School, Tsinghua University, and also with the PCL Research Centerof Networks and Communications, Peng Cheng Laboratory, Shenzhen, 518055,China.J. Shao is with the School of Computer and Information Engineering, ZhejiangGongshang University, Zhejiang, 310018, China.X. Tang is with the Information Security and National Computing GridLaboratory, Southwest Jiaotong University, Chengdu, 610031, China.R. Lu is with the Canadian Institute of Cybersecurity, Faculty of ComputerScience, University of New Brunswick, Fredericton, Canada, E3B 5A3.Corresponding author: W. Fang (e-mail: [email protected]) !" () *+,-.$/+0 ! ! " & @.1 A) <.+=-.$/+0 " ) % " )*)+& ! , ! - ." / @.1 " % & ! " " & Fig. 1: The overall framework of federated learning. During thephase of deep model training, clients run stochastic gradientdescent (SGD) algorithm on local training data to compute localgradients and upload them. After receiving local gradients fromclients, the server aggregates them, updates the current globalmodel parameters, and then distributes the updates global modelparameters to clients for the next iteration.devices with unreliable communications and limited computingresource, while the clients in the cross-silo FL [7], [8] are a smallnumber of organizations (e.g. medical or ﬁnancial institutions)with reliable communications and relative abundant computingpower. In this paper, we focus on cross-silo FL, which hassigniﬁcantly more stringent requirements on privacy and learningperformance compared with the cross-device FL [4], [9] (i.e.,privacy guarantee should not be ensured by sacriﬁcing learningaccuracy, especially for medical or ﬁnancial institutions).As we all know, data is the huge digital wealth for orga-nizations, and thus the adversary including the curious serverand curious clients wants to obtain as much training data aspossible from shared and public information. Although cross-silo FL improves privacy of training data by exchanging modelupdates (i.e., local gradients or current model parameters) ratherthan raw data, sharing model updates is still a well-known privacyrisk [10], [11]. To overcome such privacy leakage, homomorphicencryption (HE) [12] is particularly attractive in the cross-siloFL [9], [13], [14], as it guarantees privacy without sacriﬁc-ing accuracy. Unfortunately, these HE-based solutions cannotprovide comprehensive privacy guarantee for the cross-silo FL(i.e., prevent the curious server from obtaining training data ofeach clients but cannot prevent curious clients from obtainingtraining data of others). More speciﬁcally, HE can support thecurious server performing addition and multiplication operationson encrypted local gradients to complete aggregation and updateprocesses, which prevents the curious server from obtainingplain local gradients . However, these HE-based solutions do notconsider supporting curious clients to train on encrypted globalmodel parameters, and thus cannot prevent curious clients from

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 2 obtaining training data of others . In fact, HE cannot elegantlysupport deep learning training due to two issues [4], [15]:1) cannot address non-linear activation functions well on theencrypted domain , and 2) training computational costs are toohigh to be applied in practice . Thus, these HE-based solutionshardly consider preventing curious clients from obtaining plainmodel parameters.Actually, how to efﬁciently allow clients to train deep neuralnetworks locally on encrypted models is still a big challenge[4], [15]. Therefore, in this paper, we aim to efﬁciently addresssuch challenge and propose a computation-efﬁcient deep modeltraining for ciphertext-based cross-silo federated learning underthe semi-trusted security model , which provides comprehensiveprivacy guarantee and high learning accuracy at the same time.The main novelty and contributions are four-folds: • The most attractive novelty is that for the widely-adopteddeep learning framework with the

ReLU non-linear activa-tion and mean squared error loss function, we customizea novel one-time-pad-style model encryption method tosupport each client efﬁciently executing training on theencrypted global model. Speciﬁcally, our method allows theserver to select different private keys to encrypt global modelin different iterations during training, and thus the privatekey is one-time used. Unlike advanced HE techniques, ourmethod supports non-linear activation functions and decimalarithmetic operations on the encrypted domain naturally. • In order to comprehensively guarantee privacy, we furtherdesign a hybrid privacy-preserving scheme by combining ourmodel encryption method with the secret sharing technique.Particularly, our model encryption method implemented onthe server-side ensures that clients cannot obtain local gra-dients or global model, and the secret sharing techniqueinitiated on the client-side ensures the server cannot obtainlocal gradients of each client. • In order to evaluate our scheme from the privacy-preservation ability , we give the detailed security analysis.In particular, we apply the provable security technique toformally prove that for curious clients, our model encryptionmethod is semantic secure under the chosen plaintext attack,and meanwhile we give the detailed mathematical theory toformally prove that clients cannot obtain true predictions.Thus, it demonstrates that each client cannot obtain anyplaintext information. Besides, we formally prove that theserver cannot obtain local gradients of each client. • In order to evaluate our method from learning accuracy and efﬁciency , we conduct extensive experiments on severallarge-scale datasets. Empirical results show that our schemeachieves better model performance while being faster thanthe state-of-the-art HE-based approach.The remainder of this paper is organized as: In Section II,we outline preliminaries. In Section III, we state models anddesign goals. In Section IV, we present our scheme in details,followed by a simple example in Section V. Security analysisand performance evaluation are shown in Sections VII and VI,respectively. We discuss related works in Section VIII. Finally,we draw conclusions in Section IX.II. P

RELIMINARIES

In this section, we outline concepts of the cross-silo FL and theHadamard product, which will server as the basis of our scheme. ! " ""$"% ! & ’ ( ) * ( + *,( " -- ".$"% "& ’ $"% /"$"% /.$"% /& ’ $"% & "$"% & & ’ $"% ! . "$"% /$"% & $"% .$"% & & $+% **( , - ( "$56"% /$56"% .$56"% & $56"% ""$5% "& $5% & & $5% "$5% /$5% .$5% & $5% ,* *,( +6" - "/$5% ""$+% "& $+% "$+6"% /$+6"% .$+6"% & $+6"% - "$+% & $+% .$+% "/$+% Input layer Layer * Layer $8 9 *%

Layer Layer $: 9*%

Output layerHidden layerLayer ; Layer : Model parameter: ( The number of neurons in the -th layer < = ! " >! / >4>! & ’ Input feature vector ? $5% = 1 "$5% >1 /$5% >4>1 & $5% Output vector of the -th hidden layer ( * @ 8 @ : 9* ) ? $+% = 1 "$+% >1 /$+% >4>1 & $+% Prediction of the training sample

Fig. 2: The structure of DNN with the input layer, L − hiddenlayers with ReLU non-linear activation and the output layer. The l -th layer for l ∈ { , , . . . , L } consists of n l neurons connectedby n l × n l − -dimensional parameter matrix W ( l ) ∈ R n l × n l − . A. The Cross-Silo FL

The cross-silo FL [4] is essentially a distributed machinelearning framework where a small number of organizations (e.g.medical or ﬁnancial) with stable communication channels col-laboratively train a shared global model with high accuracy. Inthis paper, we focus on training Deep Neural Networks (DNNs),which have been widely adopted as the infrastructure of deeplearning models to solve many complex tasks [16].Formally, consider a cross-silo FL with K clients, denoted as C = {C , C , . . . , C K } , and each client C k has the local trainingdataset D k . The cross-silo FL aims to solving an optimizationproblem to obtain the optimal global parameter W [3], [17]: min W F ( W ) (cid:44) K (cid:88) k =1 |D k ||D| F ( W, D k ) , (1)where |D k | is the sample size of the client C k and |D| = (cid:80) Kk =1 |D k | . F ( W, D k ) is the local object of C k deﬁned by F ( W, D k ) (cid:44) |D k | (cid:88) ( x i , ¯ y i ) ∈D k L ( W ; ( x i , ¯ y i )) , (2)where ( x i , ¯ y i ) is the training sample, x i and ¯ y i are the cor-responding feature vector and the ground-truth label vector,respectively. L ( · ; · ) is a user-speciﬁed loss function. In this paper,we adopt a widely used Mean Squared Error (MSE) denoted as L ( W ; ( x i , ¯ y i )) = 12 (cid:107) y i − ¯ y i (cid:107) , (3)where y i = f ( W, x i ) is the prediction vector of the sample ( x i , ¯ y i ) . (cid:107) · (cid:107) is the l norm of a vector.In order to ﬁnd the optimal parameters for the Eq. (1), theserver and all clients collaboratively perform the DNN training.Before starting training, the server and all clients agree with theDNN structure in advance, which is shown in Fig. 2. Then, theserver initializes the global model parameter W = { W ( l ) } Ll =1 anddistributes it to all clients for training. As shown in Fig. 1, thetraining process of the DNN in the cross-silo FL mainly includestwo steps: local model training and global model update . OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 3

1) Local Model Training:

Once receiving W = { W ( l ) } Ll =1 ,each client C k computes local gradients by running the SGD withthe local training dataset D k . Generally, the training process withthe SGD mainly includes two steps: forward propagation and backward propagation . Note that since the operation of the SGDfor every training sample is the same, we simply give the sample ( x i , ¯ y i ) as a representative and ignore the subscript i of ( x i , ¯ y i ) in the following description unless other speciﬁcation. • Forward propagation is the process of obtaining predictionthrough sequentially calculating under W after the inputsamples enter the network. Speciﬁcally, for the sample fea-ture vector x , the output vector y ( l ) = ( y ( l )1 , y ( l )2 , . . . , y ( l ) n l ) ∈ R n l of the l -th layer is computed as y ( l ) =  ReLU (cid:16) W ( l ) y ( l − (cid:17) , for ≤ l ≤ L − ,W ( l ) y ( l − , for l = L, (4)where y (0) = x and ReLU( · ) is the widely used non-linearactivation function satisfying that: for any input x , y = ReLU( x ) = (cid:40) x, x > , , x ≤ , • Backward propagation starts from the loss value, and updatesthe parameter values of the network in reverse, so that theloss value of the updated network decreases. We show thegradient calculation between adjacent layers, and then givethe gradient calculation from the loss to any parameter.(1) Based on Eq. (4), we can get the prediction y ( L ) after theforward propagation. Thus, with Eq. (3), we can compute ∂ L ( W ; ( x , ¯ y )) ∂y ( L ) i = y ( L ) i − ¯ y i , for i = 1 , , . . . , n L . (5)(2) Then, we compute the gradient between adjacent layers.Speciﬁcally, based on Eq. (4), the prediction of output layeris y ( L ) = W ( L ) y ( L − , which is not affected by ReLU, andthus the gradients of y ( L ) respect to W ( L ) and y ( L − are ∂y ( L ) i ∂w ( L ) ij = y ( L − j and ∂y ( L ) i ∂y ( L − j = w ( L ) ij , (6)where i = { , , . . . , n L } and j = { , , . . . , n L − } . Theoutputs of the hidden layers are y ( l ) = ReLU (cid:0) W ( l ) y ( l − (cid:1) for ≤ l ≤ L − , which are inﬂuenced by ReLU.Concretely, gradients of y ( l ) respect to W ( l ) and y ( l − are ∂y ( l ) i ∂w ( l ) ij = (cid:40) y ( l − j , W ( l ) i y ( l − > , W ( l ) i y ( l − ≤ (7) ∂y ( l ) i ∂y ( l − j = (cid:40) w ( l ) ij , W ( l ) i y ( l − > , W ( l ) i y ( l − ≤ (8)where i = { , , . . . , n l } and j = { , , . . . , n l − } , and W ( l ) i = ( w ( l ) i , w ( l ) i , . . . , w ( l ) in l − ) is the i -th row of W ( l ) .(3) Finally, based on the chain rule of gradient, the gradientsof any parameter with respect to the loss are computed as: ∂ L ( W ; ( x , ¯ y )) ∂w ( l ) ij = ∂ L ( W ; ( x , ¯ y )) ∂ y ( L ) ∂ y ( L ) ∂ y ( L − · · · ∂ y ( l +1) ∂y ( l ) i ∂y ( l ) i ∂w ( l ) ij , where i = { , , . . . , n l } and j = { , , . . . , n l − } . Notethat we ignore the detailed result that can be easily deducedwith the above calculations due to the page limitation. Similarly, for all samples in D k (i.e., ( x i , ¯ y i ) ∈ D k ), C k computesthe local gradients ∂ L ( W ;( x i , ¯ y i )) ∂W ( l ) for l = 1 , , . . . , L , and thencomputes the average local gradients as: for l = 1 , , . . . , L , ∇ F (cid:16) W ( l ) , D k (cid:17) = 1 |D k | (cid:88) ( x i , ¯ y i ) ∈D k ∂ L ( W ; ( x i , ¯ y i )) ∂W ( l ) . (9)Finally, C k uploads ∇ F ( W ( l ) , D k ) to the server for model update.

2) Global Model Update:

After receiving ∇ F ( W ( l ) , D k ) fromall K clients, the server aggregates and updates the current modelparameters for the next iteration. Speciﬁcally, given the learningrate η , the updated parameter is computed as W ( l ) ⇐ W ( l ) − η K (cid:88) k =1 |D k ||D| ∇ F ( W ( l ) , D k ) . (10)After that, the server distributes the updated parameters W = { W l } Ll =1 to all clients for local training. Remark 1.

The server and all clients interactively repeat theoperations in Sections II-A1 and II-A2 until convergence. Finally,the server returns the well-trained global model parameters toall clients for inference.B. Hadamard Product

The Hadamard product [18] takes two matrices of the samedimensions and produces another matrix of the same dimensionas the operands.

Deﬁnition 1.

For two matrices A and B of the same dimension m × n , the Hadamard product A ◦ B (or A (cid:12) B ) is a matrix ofthe same dimension as the operands, with elements given by ( A ◦ B ) ij = ( A (cid:12) B ) ij = ( A ) ij ( B ) ij Two properties of Hadamard product are given as follows: • For any two matrices A and B , and a diagonal matrix D ,we have D ( A ◦ B ) = ( DA ) ◦ B and ( A ◦ B ) D = ( AD ) ◦ B. (11) • For any two column vectors a and b , the Hadamard productis a ◦ b = D a b , where D a is the corresponding diagonalmatrix with the vector a as its main diagonal.III. M ODELS AND D ESIGN G OALS

In this part, we introduce the system and threat models con-sidered in this paper, and identify our design goals.

A. System Model

The system model includes two components: a server and asmall number of clients, denoted as C = {C , C , . . . , C K } . • The Server is responsible for aggregating local gradientsand updating global model for the next iteration. To keepmodel parameters secret, the server sends encrypted globalmodel parameters to clients for training or prediction. • Clients have own local training data and want to collab-oratively train a global model. Speciﬁcally, each client C k runs SGD algorithm to compute local gradients with ownlocal training dataset D k and the received global model The ( i, j ) -th entry of ∂ L ( W ;( x i , ¯ y i )) ∂W ( l ) is ∂ L ( W ;( x i , ¯ y i )) ∂w ( l ) ij OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 4 !" () *+,-.+$/,0 !" ’

8) 9,3.+$/,0 ; <,&=.&0$5&,5.:.67,2> ?.3@=.&0$5&,5.:.67,2> A !"

B) 9,3.+$:&.07 ! !"( ’ !"( ’ E+7

F) *+,-.+$/,0 ; H::& !"

I G50.6 , E+7 - !" -( ./0+1*,)( 2" ./ -( % 0 ./0+1*,)(&" ./2 -( %0 ./ -( ./ - " & ’ Fig. 3: System architecture and the overall protocol of the proposed scheme.parameters, and then returns encrypted local gradients to theserver for aggregating and updating.Concretely, as shown in Fig. 3, the overall process of ourprivacy-preserving cross-silo FL includes the following phases: • Model training is responsible for generating a model withstrong generalization ability, which is the most importantpart of the cross-silo FL. Speciﬁcally, for each iteration,the server and each client C k ∈ C interactively perform thefollowing four steps:(1) Global model encryption : The server ﬁrst encrypts currentglobal model parameters with our proposed one-time-padmodel encryption method and then distributes encryptedthe global model parameters to all client for local training.(2)

Local model training : Based on the received encryptedmodel parameters, C k runs the SGD algorithm with localtraining data D k to obtain the encrypted local gradients.(3) Local gradients perturbation : After obtaining the en-crypted local gradients, C k further perturbs them with thesecret-sharing technique and uploads the perturbed andencrypted local gradients to the server.(4) Global model update : The server ﬁrst aggregates perturbedand encrypted local gradients from all clients and thendecrypts the aggregated results to update the current globalmodel parameters for the next iteration.The server and all clients interactively iterate the steps (1)-(4) until convergence. Consequently, the server obtains well-trained global model parameters. • Data inference:

The server encrypts the well-trained globalmodel parameters based on our one-time-pad model encryp-tion method with the little modiﬁcation and sends them toall clients for data inference.

B. Threat Model and Design Goals

Similar to [4], we assume the server and clients are honest-but-curious (also called “semi-trusted”) [19], [20], which means thatthey honestly follow the underlying scheme, but attempt to inferother entities’ data privacy independently. Speciﬁcally, the servertries to infer private training data of each client from received local gradients. For industrial clients, data is digital wealth andthey want to obtain as much data as possible accordingly. Thus,curious clients (even colluding with each other) try to obtainprivate training data of others from model updates. In theorythere may be collusion between the server and curious clients.In practice, however, such collusion could make the server losereputation due to data leakage. Thus, the collusion between theserver and clients is not proﬁtable. Accordingly, we hold such anassumption as: the server does not collude with clients.Consequently, the design goals of our scheme mainly includethe following two aspects: • Conﬁdentiality : The proposed scheme should ensure theconﬁdentiality of local training data. Since curious adver-saries may infer private training data from exchanged modelupdates, it is better to prevent them from obtaining theseupdates. In particular, clients cannot know updated globalmodel parameters and well-trained model parameters. Inaddition to the aggregated gradients, the server cannot obtainlocal gradients of each client from the received information. • High performance : The proposed method should ensurehigh model accuracy and efﬁciency, especially in terms ofcomputational costs. Speciﬁcally, it is better for our schemeto achieve the same model accuracy as the plain model,e.g., FedAvg [3], and meanwhile the computational costs ofour scheme should outperform the state-of-the-art HE-basedscheme, like BatchCrypt [9].IV. P

ROPOSED S CHEME

In this section, we describe our proposed ciphertext-based deepmodel training in DNN with ReLU non-linear activation in details,which can be easily applied on the state-of-the-art models suchas Convolutional Neural Networks (CNN) [21] as well as ResNet[22] and DenseNet [23]. Speciﬁcally, the structure of the DNNis shown in Fig. 2, and according to Section III-A, our proposedscheme in the framework of the cross-silo FL mainly includes theﬁve steps: global model encryption, local model training, Localgradients perturbation, global model update and data inference.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 5

A. Global Model Encryption

In order to protect the privacy of model parameters W = { W ( l ) } Ll =1 , the server needs to encrypt them before distributing.Speciﬁcally, the parameter encryption consists of two steps: • Key selection:

The server generates random one-time-usedkeys for different iterations as follows. – Randomly select multiplicative noisy vectors r ( l ) =( r ( l )1 , r ( l )2 , . . . , r ( l ) n l ) ∈ R n l > for ≤ l ≤ L − and an ad-ditive noisy vector r ( a ) = ( r ( a )1 , r ( a )2 , . . . , r ( a ) n L ) ∈ R n L with pairwise different components. – Deﬁne a disjoint partition (cid:116) ms =1 { I s } of { , , . . . , n L } such that ∪ ms =1 I s = { , , . . . , n L } and I i ∩ I j = ∅ for any i (cid:54) = j . Randomly select noisy numbers γ I , γ I , . . . , γ I m ∈ R . Then let γ = ( γ , γ , . . . , γ n L ) ,whose coordinates are given as γ i = γ I s for i ∈ I s and s = 1 , , . . . , m .The server keeps (cid:0) { r ( l ) } L − l =1 , γ (cid:1) secret. • Parameter encryption:

For the global parameter matrix W ( l ) ∈ R n l × n l − , compute the corresponding ciphertext as (cid:99) W ( l ) = (cid:40) R ( l ) ◦ W ( l ) , for ≤ l ≤ L − ,R ( l ) ◦ W ( l ) + R ( a ) , for l = L, (12)where R ( l ) ∈ R n l × n l − and R ( a ) ∈ R n L × n L − satisfy R ( l ) ij =  r (1) i , when l = 1 r ( l ) i /r ( l − j , when ≤ l ≤ L − /r ( L − j , when l = L (13) R ( a ) ij = γ i · r ( a ) i , (14)where i ∈ [1 , n l ] and j ∈ [1 , n l − ] in Eq. (13), and i ∈ [1 , n L ] and j ∈ [1 , n L − ] in Eq. (14).Finally, the server sends encrypted parameters (cid:99) W = { (cid:99) W ( l ) } Ll =1 and the public parameter r ( a ) to each client for local training. B. Local Model Training

After receiving (cid:99) W = { (cid:99) W ( l ) } Ll =1 , each client C k ∈ C computesthe encrypted local gradients ∇ F ( (cid:99) W ( l ) , D k ) with local trainingdata D k . Similar to the plain training introduced in Section II-A1,for each sample ( x i , ¯ y i ) ∈ D k , C k executes two operations: forward propagation and backward propagation . We also ignorethe subscript i of ( x i , ¯ y i ) for simplicity unless other speciﬁcation.

1) Forward Propagation:

Similar to the calculations in Eq. (4),the encrypted output vector of the l -th layer under (cid:99) W is ˆ y ( l ) =  ReLU (cid:16)(cid:99) W ( l ) ˆ y ( l − (cid:17) , for ≤ l ≤ L − . (cid:99) W ( l ) ˆ y ( l − , for l = L. (15)where ˆ y (0) = x . Speciﬁcally, Theorem 1 shows the important re-lations between the encrypted outputs { ˆ y ( l ) } Ll =1 and the plaintextoutputs { y ( l ) } Ll =1 given in Eq. (4). Theorem 1.

For ≤ l ≤ L , the encrypted output vector ˆ y ( l ) andthe plaintext output vector y ( l ) have the following relationship: ˆ y ( l ) = r ( l ) ◦ y ( l ) , when ≤ l ≤ L − . (16) ˆ y ( L ) = y ( L ) + α γ ◦ r ( a ) = y ( L ) + α r . (17) The value of m ( ≤ m ≤ n L ) determines the security of predictions, whichwill be analyzed in Theorem 5. where α = (cid:80) n L − i =1 ˆ y ( L − i and r = γ ◦ r ( a ) .Proof. See Appendix A.

Remark 2.

From Theorem 1, we can observe that our modelencryption method can efﬁciently keep the consistency of ReLUactivation with non-private training. That is, for ≤ l ≤ L − ,if non-private result W ( l ) y ( l − as computed in Eq. (4) satisﬁes W ( l ) y ( l − > (resp. W ( l ) y ( l − ≤ ), then by the deﬁnition ofReLU, y ( l ) = W ( l ) y ( l − (resp. y ( l ) = 0 ). According to Theorem1, our encrypted result certainly satisﬁes ˆ y ( l ) = (cid:99) W ( l ) ˆ y ( l − > (resp. ˆ y ( l ) = 0 ) due to ˆ y ( l ) = r ( l ) ◦ y ( l ) and r ( l ) ∈ R n l > .2) Backward Propagation: After obtaining { ˆ y ( l ) } Ll =1 , theclient calculates corresponding gradients based on the MSE lossfunction. From Theorem 1, for the sample ( x , ¯ y ) , the client canonly obtain ˆ y ( L ) = y ( L ) + α r . Thus, the MSE loss shown in Eq.(3) is changed into the ciphertext-based MSE denoted as: (cid:98) L (cid:16)(cid:99) W ; ( x , ¯ y ) (cid:17) = 12 (cid:107) ˆ y ( L ) − ¯ y (cid:107) (18) Similar to the calculations of backward propagation in Sec-tion II-A1, the client can obtain the encrypted local gradients ∂ (cid:98) L ( (cid:99) W ;( x , ¯ y )) ∂ (cid:99) W ( l ) ∈ R n l × n l − with the above ciphertext-based MSE,which have the relation with plain gradients in Theorem 2. Theorem 2.

For any ≤ l ≤ L , the encrypted gradient matrix ∂ (cid:98) L ( (cid:99) W ;( x , ¯ y ) ) ∂ (cid:99) W ( l ) and the plain gradient matrix ∂ L ( W ;( x , ¯ y )) ∂W ( l ) satisfy ∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ (cid:99) W ( l ) = 1 R ( l ) ◦ ∂ L ( W ; ( x , ¯ y )) ∂W ( l ) + r T σ ( l ) − υ β ( l ) , (19) where R ( l ) is the n l × n l − matrix whose ( i, j ) -th entry is R ( l ) ij , σ ( l ) = α ∂ ˆ y ( L ) ∂ (cid:99) W ( l ) + (cid:16) ∂ ˆ L ( (cid:99) W ;( x , ¯ y )) ∂ ˆ y ( L ) (cid:17) T ∂α∂ (cid:99) W ( l ) , υ = r T r and β ( l ) = α ∂α∂ (cid:99) W ( l ) . Note that when l = L , α = (cid:80) n L − i =1 ˆ y ( L − i is not afunction of (cid:99) W ( L ) , and thus ∂α∂ (cid:99) W ( L ) = n L × n L − , which impliesthat σ ( L ) = α ∂ ˆ y ( L ) ∂ (cid:99) W ( l ) and β ( L ) = n L × n L − .Proof. See Appendix B.From Theorem 2, we can observe that σ ( l ) and β ( l ) can becomputed directly by clients and both two values decide whetherthe server can recover the true aggregated model parameters.Hence, in addition to the encrypted gradients, the client needsto compute σ ( l ) and β ( l ) .Similarly, for any sample in D k , e.g., ( x i , ¯ y i ) ∈ D k , eachclient C k computes the encrypted local gradients ∂ (cid:98) L ( (cid:99) W ;( x i , ¯ y i )) ∂ (cid:99) W ( l ) and the corresponding noisy items, represented as σ ( l )( x i , ¯ y i ) and β ( l )( x i , ¯ y i ) . Then, C k computes the average encrypted local gradients ∇ F ( (cid:99) W ( l ) , D k ) and ( σ ( l ) k , β ( l ) k ) as: for l = 1 , , . . . , L ,  ∇ F ( (cid:99) W ( l ) , D k ) := 1 |D k | (cid:88) ( x i , ¯ y i ) ∈D k ∂ (cid:98) L (cid:16)(cid:99) W ; ( x i , ¯ y i ) (cid:17) ∂ (cid:99) W ( l ) ; σ ( l ) k := 1 |D k | (cid:88) ( x i , ¯ y i ) ∈D k σ ( l )( x i , ¯ y i ) ; β ( l ) k := 1 |D k | (cid:88) ( x i , ¯ y i ) ∈D k β ( l )( x i , ¯ y i ) . (20) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 6

C. Local gradients perturbation

To prevent the server from obtaining local gradients of eachclient, the client perturbs the encrypted local gradients withone-time-used random numbers, and returns the perturbed andencrypted local gradients to the server.1) All clients privately consult with three sets ofrandom number matrices { φ ( l )1 , φ ( l )2 , . . . , φ ( l ) K } , { ϕ ( l )1 , ϕ ( l )2 , . . . , ϕ ( l ) K } and { ψ ( l )1 , ψ ( l )2 , . . . , ψ ( l ) K } such that (cid:80) Kk =1 |D k | φ ( l ) k = n l × n l − , (cid:80) Kk =1 |D k | ϕ ( l ) k = n l × n l − and (cid:80) Kk =1 |D k | ψ ( l ) k = n l × n l − for l = 1 , , . . . , L .Without loss of generality, we assume the k -th client holds { φ ( l ) k , ϕ ( l ) k , ψ ( l ) k } Ll =1 . Note that for different iteration, clientsconsult with different sets of random matrices in advance.2) Each client C k masks the encrypted local gradients and thecorresponding noisy terms with { φ ( l ) k , ϕ ( l ) k , ψ ( l ) k } Ll =1 as:  ∇ (cid:98) F (cid:16)(cid:99) W ( l ) , D k (cid:17) = ∇ F (cid:16)(cid:99) W ( l ) , D k (cid:17) + φ ( l ) k , (cid:98) σ ( l ) k = σ ( l ) k + ϕ ( l ) k , (cid:98) β ( l ) k = β ( l ) k + ψ ( l ) k . (21)Then the following items can be computed: (cid:101) σ ( l ) k,s := ( r ( a ) | I s ) T (cid:98) σ ( l ) k | Is , for s = 1 , , . . . , m, where r ( a ) | I s is the restriction of the vector r ( a ) on I s and (cid:98) σ ( l ) k | Is is the sub-matrix of (cid:98) σ ( l ) k consists of the rows indexedby I s .3) Each client C k returns {∇ (cid:98) F ( (cid:99) W ( l ) , D k ) } Ll =1 together with { (cid:101) σ ( l ) k, , (cid:101) σ ( l ) k, , · · · , (cid:101) σ ( l ) k,m , (cid:98) β ( l ) k } Ll =1 . D. Global Model Update

After receiving {∇ (cid:98) F ( (cid:99) W ( l ) , D k ) } Ll =1 and ( m + 1) noisy terms { (cid:101) σ ( l ) k, , · · · , (cid:101) σ ( l ) k,m , (cid:98) β ( l ) k } Ll =1 from all K clients, the server aggre-gates them and recovers exact model updates for the next iteration.The details are given as follows. • For l = 1 , , . . . , L , perform the aggregation operation as:  ∇ F ( (cid:99) W ( l ) ) = K (cid:88) k =1 |D k ||D| ∇ (cid:98) F ( (cid:99) W ( l ) , D k ) , (cid:101) σ ( l ) s = K (cid:88) k =1 |D k ||D| (cid:101) σ ( l ) k,s , s = 1 , , . . . , m, (cid:98) β ( l ) = K (cid:88) k =1 |D k ||D| (cid:98) β ( l ) k . (22) • According to Theorem 2, compute ∇ F ( W ( l ) ) = R ( l ) ◦ (cid:32) ∇ F ( (cid:99) W ( l ) ) − (cid:16) m (cid:88) s =1 γ I s (cid:101) σ ( l ) s (cid:17) + υ (cid:98) β ( l ) (cid:33) , where l = 1 , , . . . , L , γ I s and υ are the secret parametersowned by the server. Note that according to Theorem 2,we can prove that our proposed scheme can obtain theexact model (i.e., achieve the same learning accuracy as theplaintext training), which is shown in the Theorem 3. Theorem 3. ∇ F ( W ( l ) ) is the plain aggregated gradientssatisfying ∇ F ( W ( l ) ) = K (cid:88) k =1 |D k ||D| ∇ F (cid:16) W ( l ) , D k (cid:17) (23) where ∇ F (cid:0) W ( l ) , D k (cid:1) is the plain average local gradientsdenoted in Eq. (9) .Proof. See Appendix C. • Based on Theorem 3, update the current global model forthe next iteration as: for l = 1 , , . . . , L , W ( l ) ⇐ W ( l ) − η ∇ F ( W ( l ) ) , which is equal to Eq. (10), i.e., the plaintext updating. E. Data Inference

After ﬁnishing the model training, the server needs to send thewell-trained model to each client. In order to prevent clients fromknowing the real model W and allow each client to predict locally,the server still needs to encrypt W . Speciﬁcally, the operationsare similar to that in Section IV-A, the only difference is that inthe last layer, the server does not adopt the additive noises r ( a ) and γ . Thus, the encrypted model parameters are computed as (cid:99) W ( l ) = R ( l ) ◦ W ( l ) , for ≤ l ≤ L. where R ( l ) ∈ R n l × n l − satisﬁes the Eq. (13). Obviously, withoutthe inﬂuence of additive noises r ( a ) and γ , based on Theorem 1,it can be veriﬁed that ˆ y ( L ) = y ( L ) , which is the real prediction.Therefore, we can see that from our proposed method, each clientcan exactly make prediction without knowing the real model.V. A N EXAMPLE OF OUR PROPOSED SCHEME

In this section, we give a simple example to show one iterationof DNN training for both the plain cross-silo FL and our privacy-preserving cross-silo FL, where the structure of DNN is shown inFig. 4. Note that in both cases, the most complex calculation is thelocal training, and thus we describe the corresponding process indetails and ignore the other processes due to the page limitation. !" ! ! " ! $ % & $ % ’ $ % ( $ % ) * "+,- * "+"- * * , +"- * "+ * * , + * Fig. 4: A simple example of DNN structure.

A. Process of training in cross-silo FL

Based on Section II-A, the server ﬁrst sends W = { W (1) , W (2) , W (3) } to each client C k for local model training.As shown in Eq. (4), for the sample ( x , ¯ y ) , C k performs forwardpropagation to obtain the output vector of each layer as:  y ( l ) =ReLU (cid:16) W ( l ) y ( l − (cid:17) for l ∈ { , } ; y (3) = W (3) y (2) . (24)Then, C k computes local gradients of each layer as: ∂ L ( W ; ( x , ¯ y )) ∂w (3) ij = ∂ L ( W ; ( x , ¯ y )) ∂y (3) i ∂y (3) i ∂w (3) ij , OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 7 ∂ L ( W ; ( x , ¯ y )) ∂w (2) ij = ∂ L ( W ; ( x , ¯ y )) ∂ y (3) ∂ y (3) ∂y (2) i ∂y (2) i ∂w (2) ij ,∂ L ( W ; ( x , ¯ y )) ∂w (1) ij = ∂ L ( W ; ( x , ¯ y )) ∂ y (3) ∂ y (3) ∂ y (2) ∂ y (2) ∂y (1) i ∂y (1) i ∂w (1) ij , We ignore the detailed result of the above calculations that canbe deduced with Eqs. (5)-(8). Finally, according to Eqs. (9) and(10), the current model parameter can be updated.

B. Process of training in our scheme

In this section, we mainly show the processes of global modelencryption and local model training to explain the crux of ourmodel encryption method. According to Section IV, the serverperforms the global model encryption to protect privacy of W .

1. Global model encryption:

The server performs key selection and parameter encryption as follows: (1) Key selection:

Randomly select r ( l ) = ( r ( l )1 , r ( l )2 , r ( l )3 ) ∈ R > for l ∈ { , } and r ( a ) = ( r ( a )1 , r ( a )2 ) ∈ R . Randomly select γ ∈ R , and let γ = ( γ , γ ) , where γ = γ = γ . (2) Parameter encryption: Based on Eq. (12), compute theencrypted model (cid:99) W = { (cid:99) W (1) , (cid:99) W (2) , (cid:99) W (3) } . Due to the pagelimitation, we show (cid:99) W (3) as an example: (cid:99) W (3) =  r (2)1 w (3)11 + γr ( a )1 1 r (2)2 w (3)12 + γr ( a )1 1 r (2)3 w (3)13 + γr ( a )11 r (2)1 w (3)21 + γr ( a )2 1 r (2)2 w (3)22 + γr ( a )2 1 r (2)3 w (3)23 + γr ( a )2  . Send (cid:99) W = { (cid:99) W (1) , (cid:99) W (2) , (cid:99) W (3) } and r ( a ) to each client C k .

2. Local model training : According to Eq. (15), C k performs theforward propagation to compute { ˆ y (1) , ˆ y (2) , ˆ y (3) } as: (cid:99) W (1) ˆ y (0) =  r (1)1 w (1)11 r (1)1 w (1)12 r (1)2 w (1)21 r (1)2 w (1)22 r (1)3 w (1)31 r (1)3 w (1)32  · (cid:18) x x (cid:19) = r (1) ◦ (cid:16) W (1) y (0) (cid:17) . Since r ( l ) = ( r ( l )1 , r ( l )2 , r ( l )3 ) ∈ R > , which does not affect thesign of ReLU, ˆ y (1) can be deduced as: ˆ y (1) = ReLU (cid:16)(cid:99) W (1) ˆ y (0) (cid:17) = ReLU (cid:16) r (1) ◦ (cid:16) W (1) y (0) (cid:17)(cid:17) = r (1) ◦ ReLU (cid:16) W (1) y (0) (cid:17) = r (1) ◦ y (1) . (25) (cid:99) W (2) ˆ y (1) =  r (2)1 r (1)1 w (2)11 r (2)1 r (1)2 w (2)12 r (2)1 r (1)3 w (2)13 r (2)2 r (1)1 w (2)21 r (2)2 r (1)2 w (2)22 r (2)2 r (1)3 w (2)23 r (2)3 r (1)1 w (2)31 r (2)3 r (1)2 w (2)32 r (2)3 r (1)3 w (2)33  ·  r (1)1 y (1)1 r (1)2 y (1)2 r (1)3 y (1)3  = r (2) ◦ (cid:16) W (1) y (0) (cid:17) . Similarly, we can obtain that ˆ y (2) = ReLU (cid:16)(cid:99) W (2) ˆ y (1) (cid:17) = r (2) ◦ y (2) . (26) ˆ y (3) = (cid:99) W (3) ˆ y (2) = W (3) y (2) + (cid:32) γr ( a )1 γr ( a )1 γr ( a )1 γr ( a )2 γr ( a )2 γr ( a )2 (cid:33) ·  r (2)1 y (2)1 r (2)2 y (2)2 r (2)3 y (2)3  = y (3) + α r , where α = (cid:88) i =1 ˆ y (2) i and r = γ ◦ r ( a ) . (27) In fact, γ and γ can be selected different according to Section IV-A. Butin this example, we directly set γ = γ for simplicity as different values wouldnot change the operations Obviously, Eqs. (25)-(27) verify the correctness of Theorem1. Next, C k performs the backward propagation to compute theencrypted local gradients ∂ (cid:98) L ( (cid:99) W ;( x , ¯ y )) ∂ (cid:99) W ( l ) , where the ( i, j ) -th entry ∂ (cid:98) L ( (cid:99) W ;( x , ¯ y )) ∂ (cid:98) w ( l ) ij is computed as: ∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ (cid:98) w (3) ij = ∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ ˆ y (3) i ∂ ˆ y (3) i ∂ (cid:98) w (3) ij = 1 R (3) ij ∂ L ( W ; ( x , ¯ y )) ∂w (3) ij + αr i ∂ ˆ y (3) i ∂ (cid:98) w (3) ij ,∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ (cid:98) w (2) ij = ∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ ˆ y (3) ∂ ˆ y (3) ∂ (cid:98) w (2) ij = 1 R (3) ij ∂ L ( W ; ( x , ¯ y )) ∂w (2) ij − αυ ∂α∂ (cid:98) w (2) ij + r T (cid:32) α ∂ ˆ y (3) ∂ (cid:98) w (2) ij + (cid:0) ∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ ˆ y (3) (cid:1) T ∂α∂ (cid:98) w (2) ij (cid:33) . ∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ (cid:98) w (1) ij = 1 R (3) ij ∂ L ( W ; ( x , ¯ y )) ∂w (1) ij − αυ ∂α∂ (cid:98) w (1) ij + r T (cid:32) α ∂ ˆ y (3) ∂ (cid:98) w (1) ij + (cid:0) ∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ ˆ y (3) (cid:1) T ∂α∂ (cid:98) w (1) ij (cid:33) . Obviously, the above equations verify the correctness of Theorem2. Note that we ignore the deductive details due to page limitation,and recommend readers to refer Appendix B for details.Finally, according to Section IV-C and Section IV-D, eachclient and the server execute the local gradients perturbation and global model update , respectively.VI. S

ECURITY A NALYSIS

Based on design goals, we analyze the security properties ofour scheme in this section. Particularly, our analysis includesthree aspects: (1) the conﬁdentiality of model parameters, (2) theconﬁdentiality of predictions during model training and (3) theconﬁdentiality of local gradients for each client.

A. Conﬁdentiality of model parameters

As stated in [10], [11], semi-trusted clients may infer sensitiveinformation from model parameters, and thus the intuitive ideais to prevent them from obtaining plaintext model parameters.Hence, we prove that our method is semantic security againstsemi-trusted clients.As introduced in Section IV-A, in order to keep model param-eters secret, the server encrypts current global model parametersbefore distributing. After getting the encrypted global modelparameters (cid:99) W = { (cid:99) W ( l ) } Ll =1 , each client C k performs localtraining to compute local gradients. We observe that C k performslinear and derivative operations based on (cid:99) W , which would notaffect the security of local gradient, and thus the crux is thesecurity of (cid:99) W . Next, we show that our model encryption methodis semantically secure. Particularly, the server randomly selectsdifferent private keys for different iterations, i.e., the private key isone-time used, and thus we ﬁrst would like to review the deﬁnitionof semantic security for a one-time key [24], [25]. Deﬁnition 2 (The semantic security of a one-time key cipher [24],[25]) . For a cipher E = ( E, D ) , where E and D are encryption OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 8 and decryption operations, respectively. Consider an adversary A that selects two messages m and m with the same lengthfrom the message space. The challenger then ﬂips a fair binarycoin b , encrypts one of the messages E ( k, m b ) using a randomkey k selected from the key space and sends it back to A . A nowguesses b ∗ ∈ { , } that yielded the particular encryption. Let Z be the event where b = 0 and A guesses b ∗ = 1 and let Z be the event where b = 1 and A guesses b ∗ = 1 . Then, a cipher E is semantically secure if the advantage Add SS ( A , E ) = | Pr( Z ) − Pr( Z ) | is negligible for all efﬁcient adversaries. Similar to [24], [25], we show the semantic security of ourmodel encryption method in Theorem 4.

Theorem 4.

Our one-time-pad-style model encryption method issemantically secure.Proof.

According to Deﬁnition 2, we prove this theorem basedon one-time-pad cipher as follows: • The polynomial-time adversary A chooses two messages W = { W ( l )0 } Ll =1 and W = { W ( l )1 } Ll =1 of equal length,and gives these to the challenger. • The challenger generates the random parameters { r ( l ) } L − l =1 , γ and r ( a ) according to Key selection in Section IV-A, alongwith a random b = { , } , and encrypts the message W b = { W ( l ) b } Ll =1 as: (cid:99) W b ( l ) = (cid:40) R ( l ) ◦ W ( l ) b , for ≤ l ≤ L − R ( l ) ◦ W ( l ) b + R ( a ) , for l = L, where R ( l ) ∈ R n l × n l − and R ( a ) ∈ R n L × n L − satisfy R ( l ) ij =  r (1) i , when l = 1 r ( l ) i /r ( l − j , when ≤ l ≤ L − /r ( L − j , when l = LR ( a ) ij = γ i · r ( a ) i . • The adversary A is then given the resulting ciphertext (cid:99) W b = { (cid:99) W b ( l ) } Ll =1 . Finally, A outputs a guess b ∗ ∈ { , } .Since ( { r ( l ) } L − l =1 , γ ) is randomly selected from real number spaceand kept secret from the adversary, (cid:99) W and (cid:99) W are also uniformlyrandom, which means that the distributions (cid:99) W and (cid:99) W areidentically distributed (no algorithm can distinguish them). Then Z and Z are identical events and so Add SS ( A , our method ) = | Pr( Z ) − Pr( Z ) | = 0 , which is negligible for all adversaries. B. Conﬁdentiality of predictions

As shown in [26], the adversary can improve the attackpresented in [10] by adopting the predictions (i.e, y ( L ) ) duringmodel training. Thus, we analyze the conﬁdentiality of predictionsunder our proposed scheme.As shown in Theorem 1, the client can only obtain theencrypted prediction vector ˆ y ( L ) = y ( L ) + α γ ◦ r ( a ) , where y ( L ) is the plaintext prediction vector. The parameter α and r ( a ) = ( r ( a )1 , r ( a )2 , . . . , r ( a ) n L ) are known to the client, while γ = ( γ , γ , . . . , γ n L ) is chosen by the server randomly which is unknown to the client. Recall that there exists a partition (cid:116) ms =1 { I s } of { , , . . . , n L } , such that for any i, j in the same I s satisfy that γ i = γ j . Thus, the conﬁdentiality of predictions ismainly inﬂuenced by the parameter m . Speciﬁcally, we give theconﬁdentiality of predictions in terms of m under our proposedscheme in the following Theorem. Theorem 5.

When the number of classes n L = 1 , clients cannotobtain any information about the plaintext prediction y ( L ) ; When n L ≥ and m = 1 , the probability that clients obtain theplaintext predictions y ( L ) from the encrypted predictions ˆ y ( L ) isless than 1; When n L ≥ and ≤ m ≤ n L , the correspondingprobability is less than or equal to /m .Proof. When n L = 1 , then the prediction is one dimensional,denoted as ˆ y ( L ) = y ( L ) + αγr ( a ) , which usually representsregression tasks. Since γ is chosen randomly by the server, clientscannot know the plaintext prediction y ( L ) .When n L ≥ , then the prediction is multi dimensional,denoted as ˆ y ( L ) = y ( L ) + α γ ◦ r ( a ) , which usually representsclassiﬁcation tasks. Under this case, if m = 1 , then γ satisﬁes γ = γ = · · · = γ n L = γ . Thus, ˆ y ( L ) i is denoted as ˆ y ( L ) i = y ( L ) i + αγr ( a ) i for i = 1 , , . . . , n L . Note that r ( a )1 , r ( a )2 , . . . , r ( a ) n L are pairwise distinct and γ is randomly selected by the server, thusit can not determine the largest one among y ( L )1 , y ( L )2 , . . . , y ( L ) n L .Hence the probability that clients obtain the plaintext predictionis obviously less than .If ≤ m ≤ n L , then γ satisﬁes γ i = γ I s for ≤ s ≤ m and i ∈ I s . Let s (cid:48) ∈ I s such that y ( L ) s (cid:48) = max i ∈ I s { y ( L ) i } , then max ≤ i ≤ n L { y ( L ) i } = max { y ( L )1 (cid:48) , y ( L )2 (cid:48) , . . . , y ( L ) m (cid:48) } , i.e., the maximalone among y ( L )1 , y ( L )2 , . . . , y ( L ) n L is equal to the maximal oneamong y ( L )1 (cid:48) , y ( L )2 (cid:48) , . . . , y ( L ) m (cid:48) . Hence the probability that clients ob-tain the plaintext prediction is less than or equal to the probabilitythat clients obtain the maximal one among y ( L )1 (cid:48) , y ( L )2 (cid:48) , . . . , y ( L ) m (cid:48) .Concretely, the noisy prediction vector (ˆ y ( L )1 (cid:48) , ˆ y ( L )2 (cid:48) , . . . , ˆ y ( L ) m (cid:48) ) andthe plaintext prediction vector ( y ( L )1 (cid:48) , y ( L )2 (cid:48) , . . . , y ( L ) m (cid:48) ) satisfy thefollowing m equations:  ˆ y ( L )1 (cid:48) = y ( L )1 (cid:48) + αγ I r ( a )1 (cid:48) , ˆ y ( L )2 (cid:48) = y ( L )2 (cid:48) + αγ I r ( a )2 (cid:48) , ... ˆ y ( L ) m (cid:48) = y ( L ) m (cid:48) + αγ I m r ( a ) m (cid:48) . (28)Note that (ˆ y ( L )1 (cid:48) , ˆ y ( L )2 (cid:48) , . . . , ˆ y ( L ) m (cid:48) ) and α are known to the clients,and γ I , γ I , . . . , γ I m are independent randomly chosen by theserver. Thus for any m -tuple ( y ( L )1 (cid:48) , y ( L )2 (cid:48) , . . . , y ( L ) m (cid:48) ) ∈ R m , therealways exists an m -tuple ( γ I , γ I , . . . , γ I m ) satisfying Eq. (28).Hence the client can not obtain any information about the plain-text prediction vector ( y ( L )1 (cid:48) , y ( L )2 (cid:48) , . . . , y ( L ) m (cid:48) ) . So the probabilitythat the clients obtain the maximal one among y ( L )1 (cid:48) , y ( L )2 (cid:48) , . . . , y ( L ) m (cid:48) is equal to /m . Therefore the probability that clients obtain theplaintext predictions y ( L ) from the encrypted predictions ˆ y ( L ) isless than or equal to /m .According to Theorem 5, clients cannot obtain the correctprediction y ( L ) of a given sample feature x . Hence, the semi-trusted client cannot leverage the prediction to help inferring thesensitive information of other clients’ training data. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 9

C. Conﬁdentiality of local gradients

In this section, we analyze that our proposed scheme canprevent the semi-trusted server from obtaining true local gradientsof each client, i.e., protect the local training data of each client.Speciﬁcally, as stated in Section III-B, the server needs to recoverthe accurate model to ensure model accuracy. Hence, we ensurethat the server can only get the aggregated gradients rather thanthe gradients of each client.

Theorem 6.

The proposed local gradients perturbation canensure that the probability of getting plaintext local gradientsof each client is negligible for the semi-trusted server.Proof.

Similar to [27], because the random number ma-trices { φ ( l )1 , φ ( l )2 , . . . , φ ( l ) K } Ll =1 , { ϕ ( l )1 , ϕ ( l )2 , . . . , ϕ ( l ) K } Ll =1 and { ψ ( l )1 , ψ ( l )2 , . . . , ψ ( l ) K } Ll =1 added to the encrypted local gradi-ents are uniformly sampled from the real space, the values ∇ (cid:98) F ( (cid:99) W ( l ) , D k ) , (cid:98) σ ( l ) k and (cid:98) β ( l ) k appear uniformly random tothe server. Speciﬁcally, given ∇ (cid:98) F ( (cid:99) W ( l ) , D k ) , there exist in-ﬁnite pairs ( ∇ F ( (cid:99) W ( l ) , D k ) , φ ( l ) k ) satisfying ∇ (cid:98) F ( (cid:99) W ( l ) , D k ) = ∇ F ( (cid:99) W ( l ) , D k ) + φ ( l ) k . Hence, the probability of identifying thecorrect solution from an inﬁnite number of solutions satisfying ∇ (cid:98) F ( (cid:99) W ( l ) , D k ) is almost zero. That is, the server cannot identifythe encrypted local gradients ∇ F ( (cid:99) W ( l ) , D k ) of the client C k . Sim-ilarly, the server also cannot identify individual additional noisyterms { σ ( l ) k , β ( l ) k } . Therefore, the probability of getting plaintextlocal gradients ∇ F ( W ( l ) , D k ) of the client C k is negligible evenwith the private key.Consequently, it is almost impossible for the server to obtainthe plaintext local gradients of each client, let alone the localtraining data. Note that in the cross-silo FL [4], clients aredifferent organizations (e.g. medical or ﬁnancial), the networkconnection is relatively stable and the network bandwidth isrelatively large. Thus, we can neglect the stragglers that cannotreturn the model updates to the server.VII. P ERFORMANCE E VALUATION

In this section, we empirically evaluate our method on real-world datasets in terms of learning accuracy and efﬁciency , andadopt the state-of-the-art plain training (i.e., FedAvg [3]) and HE-based training (i.e., BatchCrypt [9]) for comparison.

A. Experimental Setup

We implement our scheme based on the native network layerin Pytorch [28] running on single Tesla M40 GPU. In allexperiments, training epochs and the batch-size of each client areset to be and , respectively. Since different values of m hardly affect the correctness of decryption (i.e., would not affectthe learning accuracy), we directly set m = 1 for experimentson learning accuracy . For efﬁciency experiments, we will discusscomputational and communication costs in terms of m . Datasets and Metrics . We evaluate performance on twoprivacy-sensitive datasets covering the bank and governmentscenarios and one commonly adopted benchmark dataset. • UCI Bank Marketing Dataset (UBMD) [29] is relatedto direct marketing campaigns of a Portuguese bankinginstitution and aims to predict the possibility of clientsfor subscribing deposits. It contains instances of dimensional bank data. Following conventional practise, wesplit the dataset into training/validation/test sets by . • San Francisco Crime Classiﬁcation (SFCC) containsincidents derived from San Francisco Police DepartmentCrime Incident Reporting system. SFCC provides training data and test data for the classiﬁcation of39 crime types. For each data item, 7 types of information,e.g. date of the incident, the address of the crime incidentor the name of the police station is given. • MNIST [30] provides black-and-white hand-written digitimages. It contains 60000 training images and 10000 testimages, divided into 10 classes, ranging from number to . We resize the images into sizes of × . B. Learning Accuracy on Regression and Classiﬁcation

We evaluate the learning accuracy of our scheme against theFedAvg [3] and the BatchCrypt [9] on both regression andclassiﬁcation tasks.

1) Regression:

We evaluate the performance of three schemeson regression tasks with the UBMD in terms of different layers L ∈ { , , } and numbers of clients K ∈ { , , } . Table Ishows results of three schemes for the ﬁnal converged modelon testsets. From the table, the accuracy of our method ele-gantly aligns with that of the FedAvg under various settings,which veriﬁes that our model encryption method would notreduce the learning accuracy compared with the plaintext DNNmodel training. Besides, our method compares favorably againstBatchCrypt for most cases, since BatchCrypt needs to pre-processthe gradients with quantization, which decreases the precision.

2) Classiﬁcation:

In this section, we evaluate our method forthe classiﬁcation task on the SFCC and MNIST datasets in termsof L ∈ { , , } and K ∈ { , , } . The accuracy of convergedmodels on testsets is shown in Table II and III, respectively.Similar to regression tasks, the accuracy of our method elegantlyaligns with FedAvg. Besides, with the increase number of modelparameters (i.e., the increasing of L ), the advantage of our methodagainst BatchCrypt is even more evident for classiﬁcation tasks. C. Communication and Computation

In this section, we compare three schemes in terms of computa-tional and communication costs. The experiments are conductedon -layer neural network with an input size of × anda batch size of . For a comprehensive discussion, we alsoanalyze the impact of the number of private key partition m onthe computational and communication costs of our scheme.

1) Computational Costs:

Table IV provides a detailed com-parison of computational costs, where ME , LMT-FP , LMT-BP and MU stands for global model encryption , the forwardpropagation of local model training , the backward propagation oflocal model training and global model update , respectively. Fromthe table, we can see compared to the FedAvg, the increasedcomputational costs of our scheme are mainly caused by the LMT-BP in client side. More speciﬁcally, as described in SectionIV-B2, in addition to ∂ (cid:98) L ( (cid:99) W ;( x i , ¯ y i )) ∂ (cid:99) W ( l ) , the client needs to computetwo additional noisy items σ ( l )( x i , ¯ y i ) and β ( l )( x i , ¯ y i ) for each sample ( x i , ¯ y i ) , which increases computational costs of our scheme.Besides, the increased computational cost of our scheme on OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 10

TABLE I: MSE Result for Regression Tasks. Lower MSE Means Better Performance.

FedAvg [3]

BatchCrypt [9]

Ours

Clients L = 3 L = 5 L = 7 L = 3 L = 5 L = 7 L = 3 L = 5 L = 7 Some operations (e.g., dividing by random vectors) that are unavoidable in model training may cause precisionerrors, but the corresponding effects are negligible (see the results for the FedAvg and our scheme).

TABLE II: Accuracy Result for Classiﬁcation Task (SFCC dataset).

FedAvg [3]

BatchCrypt [9]

Ours

Clients L = 5 L = 7 L = 9 L = 5 L = 7 L = 9 L = 5 L = 7 L = 9 TABLE III: Accuracy Result for Classiﬁcation Task (MNIST dataset).

FedAvg [3]

BatchCrypt [9]

Ours

Clients L = 5 L = 7 L = 9 L = 5 L = 7 L = 9 L = 5 L = 7 L = 9 TABLE IV: Crypto related computation time (in seconds) for 100 epoch of three approaches on

MNIST . FedAvg BatchCrypt Ours K ME LMT-FP LMT-BP MU Total ME LMT-FP LMT-BP MU Total ME LMT-FP LMT-BP MU TotalClient 1 0 82.17 1749.66 0 1831.83 0 80.42 12143.74 0 12224.16 0 84.54 2372.19 0 2456.735 0 87.42 1842.63 0 1930.05 0 86.72 12218.91 0 12305.63 0 88.26 2461.32 0 2549.5810 0 89.31 1695.80 0 1785.11 0 88.48 12515.94 0 12604.42 0 84.27 2486.13 0 2570.4020 0 88.53 1762.56 0 1851.09 0 82.93 12609.53 0 12692.46 0 86.31 2502.46 0 2588.77Server 1 0 0 0 1633.27 1633.27 0 0 0 6286.24 6286.24 92.48 0 0 1872.47 1964.955 0 0 0 1724.53 1724.53 0 0 0 6319.46 6319.46 95.36 0 0 1894.31 1989.6710 0 0 0 1755.21 1755.21 0 0 0 6401.23 6401.23 94.84 0 0 1929.73 2024.5720 0 0 0 1796.42 1796.42 0 0 0 6488.71 6488.71 95.77 0 0 1982.67 2078.44

For convenience, the time of local gradients encryption in the BatchCrypt is included into LMT-BP. the server side is basically negligible, indicating that our modelencryption method is very efﬁcient in practice.Besides, we compare the total training time in Fig 5. Speciﬁ-cally, for a more comprehensive comparison, we provide the resultof our scheme in terms of different settings of m ∈ { , n L , n L } ( n L is the number of label classes), which stands for the caseof minimum, medium and full privacy protection according toTheorem 5. From Fig. 5, we can see that due to the privacy-preservation, the training time of our scheme increases as m increases. However, the training time of our method is quiteclose to that of FedAvg even with m = n L . This is from thefact that m only linearly affects the number of additions andthe corresponding computational costs are negligible compared toother operations (e.g. forward and backward propagation of neuralnetworks). Furthermore, our scheme signiﬁcantly outperformsthan the BatchCrypt. It is worth noting that clients train themodel on the encrypted domain in our scheme . In contrast, inthe BatchCrypt, the server performs the homomorphic additionand multiplication operations while clients train the model onthe plaintext domain . Based on the introduction in Section II-A,the most complex and time-consuming process is local training.Hence, it is no doubt that the training time of BatchCrypt will increase dramatically when training on the encrypted domain. T r a i n i n g T i m e ( h ) FedAvgBatchCryptOurs (m=1)Ours (m= n L )Ours (m= n L ) Fig. 5: Training time comparison between different approachesas training epoch increases.

2) Communication Overhead:

In our scheme, the communi-cation overhead mainly includes two interactions: 1) the serversends noisy parameters (cid:99) W together with the noisy vector r ( a ) to clients, and 2) each client returns local noisy gra-dients {∇ (cid:98) F ( (cid:99) W ( l ) , D k ) } Ll =1 together with m + 1 noisy items OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 11 { (cid:101) σ ( l ) k, , (cid:101) σ ( l ) k, , · · · , (cid:101) σ ( l ) k,m , (cid:98) β ( l ) k } Ll =1 . Obviously, compared to the Fe-dAvg, the added communication costs are r ( a ) and the noisyterms, where the cost of r ( a ) can be negligible. Therefore, theadditional communication is theoretically O (( m + 1) | W | ) , where | W | is the size of model parameters. The experiments alsoconﬁrm our theoretical results. For example, when m = 1 , boththe server-to-client and client-to-server overheads in FedAvg are . KB, while the server-to-client and client-to-server overheadsin our method are . KB and . KB, respectively.In summary, in order to achieve the privacy preservation, webring about certain amount of extra computational and com-munication costs. Nonetheless, we try the best to decrease theadditional cost and keep it in constant level without decreasingthe model accuracy compared to the original FL.VIII. R

ELATED W ORK

Federated learning was formally introduced by Google in 2016[31] to address data privacy in machine learning. However, recentworks [10], [11] proved that the original FL schemes still facethe risk of privacy leakage. Therefore, many cryptography tech-nologies, such as secure multi-party computation, homomorphicencryption and differential privacy, had been proposed to addressprivacy risks in FL. Accordingly, we brieﬂy discussed theirsuitability to cross-silo FL.

Differential Privacy (DP) [32] is a common tool to protectthe privacy of individual data sample via adding independent DPnoises to local gradients from clients. Due to the high efﬁciencyand easy integration, DP has been widely applied in cross-deviceFL [33]–[35]. Unfortunately, DP sacriﬁces model accuracy inexchange for privacy [36], which is not suitable in cross-silo FLrequiring high model accuracy [4], [9].

Secure Multi-Party Computation (MPC) [37] supports multipleparties collaboratively computing an agreed-upon function withtheir private inputs in a way that only the intended output isrevealed to each. This technique combines carefully designedcomputations and protocols, which can be applied to traditionalmachine learning algorithms, like k-means clustering [38]. How-ever, as introduced in [9], such protocols between clients aredifﬁcult to implement efﬁciently in a geo-distributed scenario incross-silo FL [2].

Homomorphic Encryption (HE) [39] allows certain computa-tions (e.g., addition and multiplication) to be performed directlyon ciphertexts without decrypting them ﬁrst. Meanwhile, HE canguarantee privacy without sacriﬁcing the model accuracy. Thus,this technique was considered to be very promising in the cross-silo FL to allow clients to encrypt local gradients [8], [14],[20]. However, complex cryptographic operations (e.g., modularmultiplication and exponentiation) caused heavy computationalcosts, making the HE far from practice. Recently, BatchCrypt[9] was proposed to alleviate computation costs by encrypting abatch of quantized gradients that are encoded into one rather thanencrypting one by one. Although reducing computational costs, itstill introduced non-negligent computation costs compared withplain training approaches. Moreover, a recent survey [4] showedthat many HE-based schemes cannot guarantee comprehensiveprivacy, as these schemes cannot prevent curious clients fromobtaining private data. As stated in [2], [4], the direct idea is tolet clients train models on encrypted domain. Unfortunately, HEcannot elegantly applied in deep learning training, as it cannot directly address non-linear activation functions on the encryptedmodel and faces heavy training costs .Thus, we present a model encryption method that guaranteesprivacy without sacriﬁcing accuracy, is easy to be implementedin practice and ensures high efﬁciency of computation.IX. C

ONCLUSION

In this paper, we present a computational-efﬁcient deep modeltraining scheme for ciphertext-based cross-silo federated learning,which aims to guarantee comprehensive privacy under the semi-trusted security model. We customize a novel one-time-pad-style model encryption method to directly supports non-linearactivation functions and decimal arithmetic operations on theencrypted domain, hence allowing clients to train deep neuralnetwork smoothly on the encrypted model. Furthermore, wecombined our model encryption method with the secret sharingtechnique to ensure both the server and client cannot obtainlocal training data of others. Extensive experiments demonstratethat the model accuracy and computational costs of our schemeare almost the same as the plain training, while are better thanthe state-of-the-art HE-based scheme. Future research includesapplying our model encryption method to the CNN with the cross-entropy loss, which is more suitable for classiﬁcation tasks.A

PPENDIX AP ROOF OF THE T HEOREM Proof.

Based on Eq. (13), we can deduce that (cid:40) R (1) = D r (1) E (1) and R ( L ) D r ( L − = E ( L ) ,R ( l ) D r ( l − = D r ( l ) E ( l ) , for ≤ l ≤ L − , where D r (1) is the n l × n l diagonal matrix whose main diagonalis r (1) and E ( l ) is the n l × n l − matrix whose entries are all 1s,for l = 1 , , . . . , L .Next, we ﬁrst prove Eq. (16) by induction. Speciﬁcally, when l = 1 , by the above equations, we can obtain ˆ y (1) = ReLU (cid:16)(cid:99) W (1) x (cid:17) = ReLU (cid:16)(cid:0) R (1) ◦ W (1) (cid:1) x (cid:17) = ReLU (cid:16)(cid:0) D r (1) E (1) ◦ W (1) (cid:1) x (cid:17) ( a ) = ReLU (cid:16) D r (1) (cid:0) E (1) ◦ W (1) (cid:1) x (cid:17) = ReLU (cid:16) D r (1) W (1) x (cid:17) ( b ) = ReLU (cid:16) r (1) ◦ ( W (1) x ) (cid:17) ( c ) = r (1) ◦ ReLU ( W (1) x ) = r (1) ◦ y (1) , where ( a ) and ( b ) follow from the propertes of Hadamard product(See Deﬁnition 1), and ( c ) follows from r (1) ∈ R n > . Then, for ≤ l ≤ L − , assuming ˆ y ( l − = r ( l − ◦ y ( l − by induction.Then, we have ˆ y ( l ) = ReLU (cid:16)(cid:99) W ( l ) ˆ y ( l − (cid:17) = ReLU (cid:16)(cid:0) R ( l ) ◦ W ( l ) (cid:1)(cid:0) r ( l − ◦ y ( l − (cid:1)(cid:17) = ReLU (cid:16)(cid:0) R ( l ) ◦ W ( l ) (cid:1) D r ( l − y ( l − (cid:17) = ReLU (cid:16)(cid:0) ( R ( l ) D r ( l − ) ◦ W ( l ) (cid:1) y ( l − (cid:17) = ReLU (cid:16)(cid:0) D r ( l ) E ( l ) ◦ W ( l ) (cid:1) y ( l − (cid:17) = ReLU (cid:16)(cid:0) D r ( l ) ( E ( l ) ◦ W ( l ) ) (cid:1) y ( l − (cid:17) = ReLU (cid:16) D r ( l ) W ( l ) y ( l − (cid:17) = ReLU (cid:16) r ( l ) ◦ ( W ( l ) y ( l − ) (cid:17) = r ( l ) ◦ ReLU (cid:16) W ( l ) y ( l − (cid:17) = r ( l ) ◦ y ( l ) . OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 12

Then, we prove Eq. (17) as follows. ˆ y ( L ) = (cid:99) W ( L ) ˆ y ( L − = ( R ( L ) ◦ W ( L − + R ( a ) ) ˆ y ( L − = ( R ( L ) ◦ W ( L − )( r ( L − ◦ y ( L − ) + R ( a ) ˆ y ( L − = ( R ( L ) D r ( L − ) ◦ W ( L ) y ( L − ) + R ( a ) ˆ y ( L − = W ( L ) y ( L − + R ( a ) ˆ y ( L − = y ( L ) + α γ ◦ r ( a ) = y ( L ) + α r . A PPENDIX BP ROOF OF THE T HEOREM Proof.

Before giving the proof, we recall some notations aboutthe derivatives of vector-valued functions. Speciﬁcally, for anyvectors f ∈ R m and x ∈ R n , the partial derivative ∂ f ∂ x of f withrespect to x is an m × n matrix, whose ( i, j ) -th entry is given as (cid:16) ∂ f ∂ x (cid:17) ij = ∂f i ∂x j . Moreover, when x is a u × v matrix, we can regard x as a vectorof R uv , then ∂ f ∂ x is also well-deﬁned. Next, we prove the Theorem2 in details. Based on Eq. (18), we have ∂ (cid:98) L (cid:16)(cid:99) W ; ( x , ¯ y ) (cid:17) ∂ ˆ y ( L ) = (cid:16) ˆ y ( L ) − ¯ y (cid:17) T = ∂ L ( W ; ( x , ¯ y )) ∂ y ( L ) + α r T , Then by the chain rule, we can derive that ∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ (cid:99) W ( l ) = ∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ ˆ y ( L ) ∂ ˆ y ( L ) ∂ (cid:99) W ( l ) = (cid:18) ∂ L ( W ; ( x , ¯ y )) ∂ y ( L ) + α r T (cid:19) ∂ ˆ y ( L ) ∂ (cid:99) W ( l ) = ∂ L ( W ; ( x , ¯ y )) ∂ y ( L ) ∂ ˆ y ( L ) ∂ (cid:99) W ( l ) + α r T ∂ ˆ y ( L ) ∂ (cid:99) W ( l ) = ∂ L ( W ; ( x , ¯ y )) ∂ y ( L ) (cid:18) ∂ y ( L ) ∂ (cid:99) W ( l ) + ∂ ( α r ) ∂ (cid:99) W ( l ) (cid:19) + α r T ∂ ˆ y ( L ) ∂ (cid:99) W ( l ) = ∂ L ( W ; ( x , ¯ y )) ∂ (cid:99) W ( l ) + α r T ∂ ˆ y ( L ) ∂ (cid:99) W ( l ) + (cid:32) ∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ ˆ y ( L ) − α r T (cid:33) ∂ ( α r ) ∂ (cid:99) W ( l )( a ) = ∂ L ( W ; ( x , ¯ y )) ∂ (cid:99) W ( l ) − υα ∂α∂ (cid:99) W ( l ) + r T (cid:32) α ∂ ˆ y ( L ) ∂ (cid:99) W ( l ) + (cid:16) ∂ (cid:98) L ( (cid:99) W ; ( x , ¯ y )) ∂ ˆ y ( L ) (cid:17) T ∂α∂ (cid:99) W ( l ) (cid:33) = 1 R ( l ) ◦ ∂ L ( W ; ( x , ¯ y )) ∂W ( l ) + r T σ ( l ) − υ β ( l ) , where R ( l ) is the n l × n l − matrix whose ( i, j ) -th entry is R ( l ) ij , σ ( l ) = α ∂ ˆ y ( L ) ∂ (cid:99) W ( l ) + (cid:16) ∂ ˆ L ( (cid:99) W ;( x , ¯ y )) ∂ ˆ y ( L ) (cid:17) T ∂α∂ (cid:99) W ( l ) , υ = r T r and β ( l ) = α ∂α∂ (cid:99) W ( l ) , and in ( a ) we use the fact that ∂ (cid:98) L ( (cid:99) W ;( x , ¯ y )) ∂ ˆ y ( L ) r = r T (cid:16) ∂ (cid:98) L ( (cid:99) W ;( x , ¯ y )) ∂ ˆ y ( L ) (cid:17) T ∈ R . A PPENDIX CP ROOF OF T HEOREM Proof.

Before proving the Eq. (23), we ﬁrst show the correctnessof Eq. (22) as: for l = 1 , , . . . , L , ∇ F ( (cid:99) W ( l ) ) = K (cid:88) k =1 |D k ||D| ∇ (cid:98) F ( (cid:99) W ( l ) , D k )= K (cid:88) k =1 |D k ||D| (cid:16) ∇ F ( (cid:99) W ( l ) , D k ) + φ ( l ) k (cid:17) ( a ) = K (cid:88) k =1 |D k ||D| ∇ F ( (cid:99) W ( l ) , D k ) + n l × n l − = K (cid:88) k =1 |D k ||D| ∇ F ( (cid:99) W ( l ) , D k ) , where (a) follows the condition that (cid:80) Kk =1 |D k | φ ( l ) k = n l × n l − for l = 1 , , . . . , L . Similarly, we can deduce (cid:98) σ ( l ) and (cid:98) β ( l ) as: (cid:98) σ ( l ) = K (cid:88) k =1 |D k ||D| (cid:98) σ ( l ) k = K (cid:88) k =1 |D k ||D| (cid:16) σ ( l ) k + ϕ ( l ) k (cid:17) = K (cid:88) k =1 |D k ||D| σ ( l ) k + n l × n l − = K (cid:88) k =1 |D k ||D| σ ( l ) k , (cid:98) β ( l ) = K (cid:88) k =1 |D k ||D| (cid:98) β ( l ) k = K (cid:88) k =1 |D k ||D| (cid:16) β ( l ) k + ψ ( l ) k (cid:17) = K (cid:88) k =1 |D k ||D| β ( l ) k + n l × n l − = K (cid:88) k =1 |D k ||D| β ( l ) k , and then m (cid:88) s =1 γ I s (cid:101) σ ( l ) s = m (cid:88) s =1 γ I s (cid:32) K (cid:88) k =1 |D k ||D| (cid:101) σ ( l ) k,s (cid:33) = K (cid:88) k =1 |D k ||D| (cid:32) m (cid:88) s =1 γ I s ( r ( a ) | I s ) T (cid:98) σ ( l ) k | Is (cid:33) = K (cid:88) k =1 |D k ||D| r T (cid:98) σ ( l ) k = K (cid:88) k =1 |D k ||D| r T σ ( l ) k . In addition, according to Theorem 2, we can derive that ∇ F ( (cid:99) W ( l ) , D k ) = 1 |D k | (cid:88) ( x i , ¯ y i ) ∈D k ∂ (cid:98) L (cid:16)(cid:99) W ; ( x i , ¯ y i ) (cid:17) ∂ (cid:99) W ( l ) = 1 |D k | (cid:88) ( x i , ¯ y i ) ∈D k (cid:18) R ( l ) ◦ ∂ L ( W ; ( x i , ¯ y i )) ∂W ( l ) + r T σ ( l )( x i , ¯ y i ) − υ β ( l )( x i , ¯ y i ) (cid:19) = 1 R ( l ) ◦ ∇ F ( W ( l ) , D k ) + r T σ ( l ) k − υ β ( l ) k . OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, JANUARY 2021 13

Thus, we can obtain that ∇ F ( W ( l ) ) = R ( l ) ◦ (cid:32) ∇ F ( (cid:99) W ( l ) ) − (cid:32) m (cid:88) s =1 γ I s (cid:101) σ ( l ) s (cid:33) + υ (cid:98) β ( l ) (cid:33) = R ( l ) ◦ K (cid:88) k =1 |D k ||D| (cid:16) ∇ F ( (cid:99) W ( l ) , D k ) − r T σ ( l ) k + υ β ( l ) k (cid:17) = R ( l ) ◦ K (cid:88) k =1 |D k ||D| (cid:32) R ( l ) ◦ ∇ F ( W ( l ) , D k ) + r T σ ( l ) k − υ β ( l ) k − r T σ ( l ) k + υ β ( l ) k (cid:33) = R ( l ) ◦ R ( l ) ◦ K (cid:88) k =1 |D k ||D| ∇ F ( W ( l ) , D k )= K (cid:88) k =1 |D k ||D| ∇ F ( W ( l ) , D k ) . R EFERENCES[1] Wikipedia, “Facebook-cambridge analytica data scandal,” https://en.wikipedia.org/wiki/Facebook%E2%80%93Cambridge Analyticadata scandal, 2018.[2] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning:Concept and applications,”

ACM Trans. Intell. Syst. Technol. , vol. 10, no. 2,pp. 12:1–12:19, 2019.[3] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-efﬁcient learning of deep networks from decentralizeddata,” in

Proceedings of the 20th International Conference on ArtiﬁcialIntelligence and Statistics , vol. 54, 2017, pp. 1273–1282.[4] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji,K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, and et al., “Advancesand open problems in federated learning,”

CoRR , vol. abs/1912.04977, 2019.[5] J. Konecn´y, H. B. McMahan, D. Ramage, and P. Richt´arik, “Federated op-timization: Distributed machine learning for on-device intelligence,”

CoRR ,vol. abs/1610.02527, 2016.[6] S. Ramaswamy, R. Mathews, K. Rao, and F. Beaufays, “Federated learningfor emoji prediction in a mobile keyboard,”

CoRR , vol. abs/1906.04329,2019.[7] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, and N. Taft,“Privacy-preserving ridge regression on hundreds of millions of records,” in , 2013, pp. 334–348.[8] K. Cheng, T. Fan, Y. Jin, Y. Liu, T. Chen, and Q. Yang, “Secureboost: Alossless federated learning framework,”

CoRR , vol. abs/1901.08755, 2019.[9] C. Zhang, S. Li, J. Xia, W. Wang, F. Yan, and Y. Liu, “Batchcrypt: Efﬁcienthomomorphic encryption for cross-silo federated learning,” in , 2020, pp. 493–506.[10] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” in

Advances inNeural Information Processing Systems 32: Annual Conference on NeuralInformation Processing Systems 2019 , 2019, pp. 14 747–14 756.[11] M. Nasr, R. Shokri, and A. Houmansadr, “Comprehensive privacy analysisof deep learning: Passive and active white-box inference attacks againstcentralized and federated learning,” in , 2019, pp. 739–753.[12] P. Paillier, “Public-key cryptosystems based on composite degree residuosityclasses,” in

Advances in Cryptology - EUROCRYPT ’99, InternationalConference on the Theory and Application of Cryptographic Techniques ,vol. 1592, 1999, pp. 223–238.[13] S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith,and B. Thorne, “Private federated learning on vertically partitioned datavia entity resolution and additively homomorphic encryption,”

CoRR , vol.abs/1711.10677, 2017.[14] C. Liu, S. Chakraborty, and D. C. Verma, “Secure model fusion fordistributed learning using partial homomorphic encryption,” in

Policy-BasedAutonomic Data Governance , vol. 11550, 2018, pp. 154–179.[15] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:Challenges, methods, and future directions,”

IEEE Signal Process. Mag. ,vol. 37, no. 3, pp. 50–60, 2020.[16] J. Schmidhuber, “Deep learning in neural networks: An overview,”

NeuralNetworks , vol. 61, pp. 85–117, 2015.[17] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergenceof fedavg on non-iid data,” in , 2020. [18] R. A. Horn and C. R. Johnson,

Matrix Analysis, 2nd Ed . CambridgeUniversity Press, 2012.[19] B. Hitaj, G. Ateniese, and F. P´erez-Cruz, “Deep models under the GAN:information leakage from collaborative deep learning,” in

Proceedings of the2017 ACM SIGSAC Conference on Computer and Communications Security ,2017, pp. 603–618.[20] L. T. Phong, Y. Aono, T. Hayashi, L. Wang, and S. Moriai, “Privacy-preserving deep learning via additively homomorphic encryption,”

IEEETrans. Inf. Forensics Secur. , vol. 13, no. 5, pp. 1333–1345, 2018.[21] T. Lin, A. RoyChowdhury, and S. Maji, “Bilinear convolutional neuralnetworks for ﬁne-grained visual recognition,”

IEEE Trans. Pattern Anal.Mach. Intell. , vol. 40, no. 6, pp. 1309–1322, 2018.[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in , 2016, pp. 770–778.[23] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in ∼ dabo/cs255/lectures/PRP-PRF.pdf, 2020.[26] B. Zhao, K. R. Mopuri, and H. Bilen, “idlg: Improved deep leakage fromgradients,” CoRR , vol. abs/2001.02610, 2020.[27] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan,S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregationfor federated learning on user-held data,”

CoRR , vol. abs/1611.04482, 2016.[28] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen,Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K¨opf, E. Yang,Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai,and S. Chintala, “Pytorch: An imperative style, high-performance deeplearning library,” in

Advances in Neural Information Processing Systems32: Annual Conference on Neural Information Processing Systems , 2019,pp. 8024–8035.[29] S. Moro, P. Cortez, and P. Rita, “A data-driven approach to predict thesuccess of bank telemarketing,”

Decis. Support Syst. , vol. 62, pp. 22–31,2014.[30] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86, no. 11,pp. 2278–2324, 1998.[31] J. Konecn´y, H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, andD. Bacon, “Federated learning: Strategies for improving communicationefﬁciency,”

CoRR , vol. abs/1610.05492, 2016.[32] C. Dwork, “Differential privacy,” in

Automata, Languages and Program-ming, 33rd International Colloquium , vol. 4052, 2006, pp. 1–12.[33] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learning differ-entially private recurrent language models,” in , 2018.[34] M. A. Pathak, S. Rane, and B. Raj, “Multiparty differential privacy viaaggregation of locally trained classiﬁers,” in

Advances in Neural InformationProcessing Systems 23: 24th Annual Conference on Neural InformationProcessing Systems 2010 , 2010, pp. 1876–1884.[35] K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. S. Quek,and H. V. Poor, “Federated learning with differential privacy: Algorithmsand performance analysis,”

IEEE Trans. Inf. Forensics Secur. , vol. 15, pp.3454–3469, 2020.[36] G. Xu, H. Li, S. Liu, K. Yang, and X. Lin, “Verifynet: Secure and veriﬁablefederated learning,”

IEEE Trans. Inf. Forensics Secur. , vol. 15, pp. 911–926,2020.[37] Z. Wang, S. S. Cheung, and Y. Luo, “Information-theoretic secure multi-party computation with collusion deterrence,”

IEEE Trans. Inf. ForensicsSecur. , vol. 12, no. 4, pp. 980–995, 2017.[38] G. Jagannathan and R. N. Wright, “Privacy-preserving distributed k-meansclustering over arbitrarily partitioned data,” in

Proceedings of the EleventhACM SIGKDD International Conference on Knowledge Discovery and DataMining , 2005, pp. 593–599.[39] Z. Brakerski, “Fully homomorphic encryption without modulus switchingfrom classical gapsvp,” in