[PDF] Provably Secure Federated Learning against Malicious Clients

Abstract

Federated learning enables clients to collaboratively learn a shared global model without sharing their local training data with a cloud server. However, malicious clients can corrupt the global model to predict incorrect labels for testing examples. Existing defenses against malicious clients leverage Byzantine-robust federated learning methods. However, these methods cannot provably guarantee that the predicted label for a testing example is not affected by malicious clients. We bridge this gap via ensemble federated learning. In particular, given any base federated learning algorithm, we use the algorithm to learn multiple global models, each of which is learnt using a randomly selected subset of clients. When predicting the label of a testing example, we take majority vote among the global models. We show that our ensemble federated learning with any base federated learning algorithm is provably secure against malicious clients. Specifically, the label predicted by our ensemble global model for a testing example is provably not affected by a bounded number of malicious clients. Moreover, we show that our derived bound is tight. We evaluate our method on MNIST and Human Activity Recognition datasets. For instance, our method can achieve a certified accuracy of 88% on MNIST when 20 out of 1,000 clients are malicious.

Full PDF

PProvably Secure Federated Learning against Malicious Clients

Xiaoyu Cao, Jinyuan Jia, Neil Zhenqiang Gong

Duke University { xiaoyu.cao, jinyuan.jia, neil.gong } @duke.edu Abstract

Federated learning enables clients to collaboratively learn ashared global model without sharing their local training datawith a cloud server. However, malicious clients can corruptthe global model to predict incorrect labels for testing ex-amples. Existing defenses against malicious clients leverageByzantine-robust federated learning methods. However, thesemethods cannot provably guarantee that the predicted labelfor a testing example is not affected by malicious clients.We bridge this gap via ensemble federated learning . In par-ticular, given any base federated learning algorithm, we usethe algorithm to learn multiple global models, each of whichis learnt using a randomly selected subset of clients. Whenpredicting the label of a testing example, we take majorityvote among the global models. We show that our ensem-ble federated learning with any base federated learning al-gorithm is provably secure against malicious clients. Speciﬁ-cally, the label predicted by our ensemble global model for atesting example is provably not affected by a bounded num-ber of malicious clients. Moreover, we show that our derivedbound is tight. We evaluate our method on MNIST and Hu-man Activity Recognition datasets. For instance, our methodcan achieve a certiﬁed accuracy of 88% on MNIST when 20out of 1,000 clients are malicious.

Introduction

Federated learning (Koneˇcn`y et al. 2016; McMahan et al.2017) is an emerging machine learning paradigm, which en-ables many clients (e.g., smartphones, IoT devices, and or-ganizations) to collaboratively learn a model without shar-ing their local training data with a cloud server. Due to itspromise for protecting privacy of the clients’ local trainingdata and the emerging privacy regulations such as GeneralData Protection Regulation (GDPR), federated learning hasbeen deployed by industry. For instance, Google has de-ployed federated learning for next-word prediction on An-droid Gboard. Existing federated learning methods mainlyfollow a single-global-model paradigm. Speciﬁcally, a cloudserver maintains a global model and each client maintains a local model . The global model is trained via multiple itera-tions of communications between the clients and server. In each iteration, three steps are performed: 1) the server sendsthe current global model to the clients; 2) the clients updatetheir local models based on the global model and their localtraining data, and send the model updates to the server; and3) the server aggregates the model updates and uses themto update the global model. The learnt global model is thenused to predict labels of testing examples.However, such single-global-model paradigm is vulnera-ble to security attacks. In particular, an attacker can injectfake clients to federated learning or compromise existingclients, where we call the fake/compromised clients mali-cious clients . Such malicious clients can corrupt the globalmodel via carefully tampering their local training data ormodel updates sent to the server. As a result, the corruptedglobal model has a low accuracy for the normal testing ex-amples (Fang et al. 2020; Xie, Koyejo, and Gupta 2019) orcertain attacker-chosen testing examples (Bagdasaryan et al.2020; Bhagoji et al. 2019; Xie et al. 2020). For instance,when learning an image classiﬁer, the malicious clients canre-label the cars with certain strips as birds in their localtraining data and scale up their model updates sent to theserver, such that the learnt global model incorrectly predictsa car with the strips as bird (Bagdasaryan et al. 2020).Various Byzantine-robust federated learning methodshave been proposed to defend against malicious clients(Blanchard et al. 2017; Chen, Su, and Xu 2017; Mhamdi,Guerraoui, and Rouault 2018; Yin et al. 2018, 2019; Chenet al. 2018; Alistarh, Allen-Zhu, and Li 2018). The mainidea of these methods is to mitigate the impact of statis-tical outliers among the clients’ model updates. They canbound the difference between the global model parameterslearnt without malicious clients and the global model pa-rameters learnt when some clients become malicious. How-ever, these methods cannot provably guarantee that the labelpredicted by the global model for a testing example is notaffected by malicious clients. Indeed, studies showed thatmalicious clients can still substantially degrade the testingaccuracy of a global model learnt by a Byzantine-robustmethod via carefully tampering their model updates sentto the server (Bhagoji et al. 2019; Fang et al. 2020; Xie,Koyejo, and Gupta 2019).In this work, we propose ensemble federated learning ,the ﬁrst federated learning method that is provably secureagainst malicious clients. Speciﬁcally, given n clients, we a r X i v : . [ c s . CR ] F e b eﬁne a subsample as a set of k clients sampled from the n clients uniformly at random without replacement. For eachsubsample, we can learn a global model using a base feder-ated learning algorithm with the k clients in the subsample.Since there are (cid:0) nk (cid:1) subsamples with k clients, (cid:0) nk (cid:1) globalmodels can be trained in total. Suppose we are given a test-ing example x . We deﬁne p i as the fraction of the (cid:0) nk (cid:1) globalmodels that predict label i for x , where i = 1 , , · · · , L .We call p i label probability . Our ensemble global model predicts the label with the largest label probability for x .In other words, our ensemble global model takes a major-ity vote among the global models to predict label for x .Since each global model is learnt using a subsample with k clients, a majority of the global models are learnt usingnormal clients when most clients are normal. Therefore, themajority vote among the global models is secure against abounded number of malicious clients. Theory:

Our ﬁrst major theoretical result is that our ensem-ble global model provably predicts the same label for a test-ing example x when the number of malicious clients is nolarger than a threshold, which we call certiﬁed security level .Our second major theoretical result is that we prove our de-rived certiﬁed security level is tight, i.e., when no assump-tions are made on the base federated learning algorithm, itis impossible to derive a certiﬁed security level that is largerthan ours. Note that the certiﬁed security level may be dif-ferent for different testing examples. Algorithm:

Computing our certiﬁed security level for x requires its largest and second largest label probabilities.When (cid:0) nk (cid:1) is small (e.g., the n clients are dozens of orga-nizations (Kairouz et al. 2019) and k is small), we can com-pute the largest and second largest label probabilities ex-actly via training (cid:0) nk (cid:1) global models. However, it is challeng-ing to compute them exactly when (cid:0) nk (cid:1) is large. To addressthe computational challenge, we develop a Monte Carlo al-gorithm to estimate them with probabilistic guarantees viatraining N instead of (cid:0) nk (cid:1) global models. Evaluation:

We empirically evaluate our method onMNIST (LeCun, Cortes, and Burges 1998) and Human Ac-tivity Recognition datasets (Anguita et al. 2013). We dis-tribute the training examples in MNIST to clients to simu-late federated learning scenarios, while the Human ActivityRecognition dataset represents a real-world federated learn-ing scenario, where each user is a client. We use the pop-ular FedAvg developed by Google (McMahan et al. 2017)as the base federated learning algorithm. Moreover, we use certiﬁed accuracy as our evaluation metric, which is a lowerbound of the testing accuracy that a method can provablyachieve no matter how the malicious clients tamper their lo-cal training data and model updates. For instance, our en-semble FedAvg with N = 500 and k = 10 can achievea certiﬁed accuracy of 88% on MNIST when evenly dis-tributing the training examples among 1,000 clients and 20of them are malicious.In summary, our key contributions are as follows:• Theory:

We propose ensemble federated learning, theﬁrst provably secure federated learning method againstmalicious clients. We derive a certiﬁed security level for

Algorithm 1

Single-global-model federated learning Input: C , globalIter , localIter , η , Agg. Output:

Global model w . w ← random initialization. for Iter global = 1 , , · · · , globalIter do /* Step I */ The server sends w to the clients. /* Step II */ for i ∈ C do w i ← w . for Iter local = 1 , , · · · , localIter do Sample a

Batch from local training data D i . w i ← w i − η ∇ Loss ( Batch ; w i ) . end for Send g i = w i − w to the server. end for /* Step III */ g ← Agg ( g , g , · · · , g | C | ) . w ← w − η · g . end for return w .our ensemble federated learning. Moreover, we prove thatour derived certiﬁed security level is tight.• Algorithm:

We propose a Monte Carlo algorithm to com-pute our certiﬁed security level in practice.•

Evaluation:

We evaluate our methods on MNIST and Hu-man Activity Recognition datasets.All our proofs are shown in Supplemental Material.

Background on Federated Learning

Assuming we have n clients C = { , , · · · , n } and acloud server in a federated learning setting. The i th clientholds some local training dataset D i , where i = 1 , , · · · , n .Existing federated learning methods (Koneˇcn`y et al. 2016;McMahan et al. 2017; Wang et al. 2020; Li et al. 2020b)mainly focus on learning a single global model for the n clients. Speciﬁcally, the server maintains a global model andeach client maintains a local model. Then, federated learn-ing iteratively performs the following three steps, which areshown in Algorithm 1. In Step I, the server sends the currentglobal model to the clients. In Step II, each client trainsa local model via ﬁne-tuning the global model to its localtraining dataset. In particular, each client performs localIter iterations of stochastic gradient descent with a learning rate η to train its local model. Then, each client sends its modelupdate (i.e., the difference between the local model and theglobal model) to the server. In Step III, the server aggre-gates the clients’ model updates according to some aggre-gation rule Agg and uses the aggregated model update toupdate the global model. The three steps are repeated for globalIter iterations. Existing federated learning algorithms The server may select a subset of clients, but we assume theserver sends the global model to all clients for convenience. ssentially use different aggregation rules in Step III. For in-stance, Google developed FedAvg (McMahan et al. 2017),which computes the average of the clients’ model updatesweighted by the sizes of their local training datasets as theaggregated model update to update the global model.We call such a federated learning algorithm that learns asingle global model base federated learning algorithm anddenote it as A . Note that given any subset of the n clients C ,a base federated learning algorithm can learn a global modelfor them. Speciﬁcally, the server learns a global model viaiteratively performing the three steps between the server andthe given subset of clients. Our Ensemble Federated Learning

Unlike single-global-model federated learning, our ensem-ble federated learning trains multiple global models, each ofwhich is trained using the base algorithm A and a subsamplewith k clients sampled from the n clients uniformly at ran-dom without replacement. Among the n clients C , we have (cid:0) nk (cid:1) subsamples with k clients. Therefore, (cid:0) nk (cid:1) global modelscan be trained in total if we train a global model using eachsubsample. For a given testing input x , these global modelsmay predict different labels for it. We deﬁne p i as the frac-tion of the (cid:0) nk (cid:1) global models that predict label i for x , where i = 1 , , · · · , L . We call p i label probability . Note that p i is an integer multiplication of ( nk ) , which we will leverageto derive a tight security guarantee of ensemble federatedlearning. Moreover, p i can also be viewed as the probabil-ity that a global model trained on a random subsample with k clients predicts label i for x . Our ensemble global model predicts the label with the largest label probability for x , i.e.,we deﬁne: h ( C , x ) = argmax i p i , (1)where h is our ensemble global model and h ( C , x ) is thelabel that our ensemble global model predicts for x whenthe ensemble global model is trained on clients C . Deﬁning provable security guarantees against maliciousclients:

Suppose some of the n clients C become mali-cious. These malicious clients can arbitrarily tamper their lo-cal training data and model updates sent to the server in eachiteration of federated learning. We denote by C (cid:48) the set of n clients with malicious ones. Moreover, we denote by M ( C (cid:48) ) the number of malicious clients in C (cid:48) , e.g., M ( C (cid:48) ) = m means that m clients are malicious. Note that we don’t knowwhich clients are malicious. For a testing example x , ourgoal is to show that our ensemble global model h provablypredicts the same label for x when the number of maliciousclients is bounded. Formally, we aim to show the following: h ( C (cid:48) , x ) = h ( C , x ) , ∀ C (cid:48) , M ( C (cid:48) ) ≤ m ∗ , (2)where h ( C (cid:48) , x ) is the label that the ensemble global modeltrained on the clients C (cid:48) predicts for x . We call m ∗ certiﬁedsecurity level . When a global model satisﬁes Equation (2)for a testing example x , we say the global model achievesa provable security guarantee for x with a certiﬁed secu-rity level m ∗ . Note that the certiﬁed security level may be different for different testing examples. Next, we derive thecertiﬁed security level of our ensemble global model. Deriving certiﬁed security level using exact label prob-abilities:

Suppose we are given a testing example x . As-suming that, when there are no malicious clients, our en-semble global model predicts label y for x , p y is the largestlabel probability, and p z is the second largest label probabil-ity. Moreover, we denote by p (cid:48) y and p (cid:48) z respectively the labelprobabilities for y and z in the ensemble global model whenthere are malicious clients. Suppose m clients become mali-cious. Then, − ( n − mk )( nk ) fraction of subsamples with k clientsinclude at least one malicious client. In the worst-case sce-nario, for each global model learnt using a subsample in-cluding at least one malicious client, its predicted label for x changes from y to z . Therefore, in the worst-case scenario,the m malicious clients decrease the largest label probability p y by − ( n − mk )( nk ) and increase the second largest label proba-bility p z by − ( n − mk )( nk ) , i.e., we have p (cid:48) y = p y − (1 − ( n − mk )( nk ) ) and p (cid:48) z = p z +(1 − ( n − mk )( nk ) ) . Our ensemble global model stillpredicts label y for x , i.e., h ( C (cid:48) , x ) = h ( C , x ) = y , once m satisﬁes the following inequality: p (cid:48) y > p (cid:48) z ⇐⇒ p y − p z > − (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) . (3)In other words, the largest integer m that satisﬁes the in-equality (3) is our certiﬁed security level m ∗ for the testingexample x . The inequality (3) shows that our certiﬁed secu-rity level is related to the gap p y − p z between the largestand second largest label probabilities in the ensemble globalmodel trained on the clients C without malicious ones. Forinstance, when a testing example has a larger gap p y − p z , theinequality (3) may be satisﬁed by a larger m , which meansthat our ensemble global model may have a larger certiﬁedsecurity level for the testing example. Deriving certiﬁed security level using approximate labelprobabilities:

When (cid:0) nk (cid:1) is small (e.g., several hundred),we can compute the exact label probabilities p y and p z viatraining (cid:0) nk (cid:1) global models, and compute the certiﬁed secu-rity level via inequality (3). However, when (cid:0) nk (cid:1) is large, itis computationally challenging to compute the exact labelprobabilities via training (cid:0) nk (cid:1) global models. For instance,when n = 100 and k = 10 , there are already . × global models, training all of which is computationally in-tractable in practice. Therefore, we also derive certiﬁed se-curity level using a lower bound p y of p y (i.e., p y ≤ p y )and an upper bound p z of p z (i.e., p z ≥ p z ). We use a lowerbound p y of p y and an upper bound p z of p z because ourcertiﬁed security level is related to the gap p y − p z and weaim to estimate a lower bound of the gap. The lower bound p y and upper bound p z may be estimated by different meth-ods. For instance, in the next section, we propose a MonteCarlo algorithm to estimate a lower bound p y and an upperbound p z via only training N of the (cid:0) nk (cid:1) global models.ext, we derive our certiﬁed security level based on theprobability bounds p y and p z . One way is to replace p y and p z in inequality (3) as p y and p z , respectively. Formally, wehave the following inequality: p y − p z > − (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) . (4)If an m satisﬁes inequality (4), then the m also satisﬁes in-equality (3), because p y − p z ≤ p y − p z . Therefore, we canﬁnd the largest integer m that satisﬁes the inequality (4) asthe certiﬁed security level m ∗ . However, we found that thecertiﬁed security level m ∗ derived based on inequality (4) isnot tight, i.e., our ensemble global model may still predictlabel y for x even if the number of malicious clients is largerthan m ∗ derived based on inequality (4). The key reason isthat the label probabilities are integer multiplications of ( nk ) .Therefore, we normalize p y and p z as integer multiplicationsof ( nk ) to derive a tight certiﬁed security level. Speciﬁcally,we derive the certiﬁed security level as the largest integer m that satisﬁes the following inequality (formally described inTheorem 1): (cid:108) p y · (cid:0) nk (cid:1)(cid:109)(cid:0) nk (cid:1) − (cid:4) p z · (cid:0) nk (cid:1)(cid:5)(cid:0) nk (cid:1) > − · (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) . (5)Figure 1 illustrates the relationships between p y , p y , and (cid:100) p y · ( nk ) (cid:101) ( nk ) as well as p z , p z , and (cid:98) p z · ( nk ) (cid:99) ( nk ) . When an m sat-isﬁes inequality (4), the m also satisﬁes inequality (5), be-cause p y − p z ≤ (cid:100) p y · ( nk ) (cid:101) ( nk ) − (cid:98) p z · ( nk ) (cid:99) ( nk ) . Therefore, the certi-ﬁed security level derived based on inequality (4) is smallerthan or equals the certiﬁed security level derived based oninequality (5). Note that when p y = p y and p z = p z , both(4) and (5) reduce to (3) as the label probabilities are inte-ger multiplications of ( nk ) . The following theorem formallysummarizes our certiﬁed security level. Theorem 1.

Given n clients C , an arbitrary base federatedlearning algorithm A , a subsample size k , and a testing ex-ample x , we deﬁne an ensemble global model h as Equation(1). y and z are the labels that have the largest and sec-ond largest label probabilities for x in the ensemble globalmodel. p y is a lower bound of p y and p z is an upper boundof p z . Formally, p y and p z satisfy the following conditions: max i (cid:54) = y p i = p z ≤ p z ≤ p y ≤ p y . (6) Then, h provably predicts y for x when at most m ∗ clientsin C become malicious, i.e., we have: h ( C (cid:48) , x ) = h ( C , x ) = y, ∀ C (cid:48) , M ( C (cid:48) ) ≤ m ∗ , (7) where m ∗ is the largest integer m ( ≤ m ≤ n − k ) thatsatisﬁes inequality (5). Our Theorem 1 is applicable to any base federated learn-ing algorithm, any lower bound p y of p y and any upper 𝑝 𝑧 𝑝 𝑦 𝑝 y 𝑝 𝑧 𝑝 𝑧 ∙ 𝑛𝑘𝑛𝑘 𝑝 𝑦 ∙ 𝑛𝑘𝑛𝑘 } 𝑛𝑘 Figure 1: An example to illustrate the relationships between p y , p y , and (cid:100) p y · ( nk ) (cid:101) ( nk ) as well as p z , p z , and (cid:98) p z · ( nk ) (cid:99) ( nk ) .bound p z of p z that satisfy (6). When the lower bound p y andupper bound p z are estimated more accurately, i.e., p y and p z are respectively closer to p y and p z , our certiﬁed securitylevel may be larger. The following theorem shows that ourderived certiﬁed security level is tight, i.e., when no assump-tions on the base federated learning algorithm are made, itis impossible to derive a certiﬁed security level that is largerthan ours for the given probability bounds p y and p z . Theorem 2.

Suppose p y + p z ≤ . For any C (cid:48) satisfying M ( C (cid:48) ) > m ∗ , i.e., at least m ∗ + 1 clients are malicious,there exists a base federated learning algorithm A ∗ that sat-isﬁes (6) but h ( C (cid:48) , x ) (cid:54) = y or there exist ties. Computing the Certiﬁed Security Level

Suppose we are given n clients C , a base federated learn-ing algorithm A , a subsample size k , and a testing dataset D with d testing examples. For each testing example x t in D , we aim to compute its label ˆ y t predicted by our ensembleglobal model h and the corresponding certiﬁed security level ˆ m ∗ t . To compute the certiﬁed security level based on ourTheorem 1, we need a lower bound p ˆ y t of the largest labelprobability p ˆ y t and an upper bound p ˆ z t of the second largestlabel probability p ˆ z t . When (cid:0) nk (cid:1) is small, we can computethe exact label probabilities via training (cid:0) nk (cid:1) global models.When (cid:0) nk (cid:1) is large, we propose a Monte Carlo algorithm toestimate the predicted label and the two probability boundsfor all testing examples in D simultaneously with a conﬁ-dence level − α via training N of the (cid:0) nk (cid:1) global models. Computing predicted label and probability bounds forone testing example:

We ﬁrst discuss how to compute thepredicted label ˆ y t and probability bounds p ˆ y t and p ˆ z t for onetesting example x t . We sample N subsamples with k clientsfrom the n clients uniformly at random without replace-ment and use them to train N global models g , g , · · · , g N .We use the N global models to predict labels for x t andcount the frequency of each label. We treat the label with thelargest frequency as the predicted label ˆ y t . Recall that, basedon the deﬁnition of label probability, a global model trainedon a random subsample with k clients predicts label ˆ y t for x t with the label probability p ˆ y t . Therefore, the frequency N ˆ y t of the label ˆ y t among the N global models followsa binomial distribution B ( N, p ˆ y t ) with parameters N and p ˆ y t . Thus, given N ˆ y t and N , we can use the standard one-sided Clopper-Pearson method (Clopper and Pearson 1934)to estimate a lower bound p ˆ y t of p ˆ y t with a conﬁdence level − α . Speciﬁcally, we have p ˆ y t = B ( α ; N ˆ y t , N − N ˆ y t + 1) , lgorithm 2 Computing Predicted Label and Certiﬁed Se-curity Level Input: C , A , k , N , D , α . Output:

Predicted label and certiﬁed security level foreach testing example in D . g , g , · · · , g N ← S AMPLE &T RAIN ( C , A , k, N ) for x t in D do counts [ i ] ← (cid:80) Nl =1 I ( g l ( x t ) = i ) , i ∈ { , , · · · , L } /* I is the indicator function */ ˆ y t ← index of the largest entry in counts (ties arebroken uniformly at random) p ˆ y t ← B (cid:0) αd ; N ˆ y t , N − N ˆ y t + 1 (cid:1) p ˆ z t ← − p ˆ y t if p ˆ y t > p ˆ z t then ˆ m ∗ t ← S EARCH L EVEL ( p ˆ y t , p ˆ z t , k, | C | ) else ˆ y t ← ABSTAIN, ˆ m ∗ t ← ABSTAIN end if end for return ˆ y , ˆ y , · · · , ˆ y d and ˆ m ∗ , ˆ m ∗ , · · · , ˆ m ∗ d where B ( q ; v, w ) is the q th quantile from a beta distributionwith shape parameters v and w . Moreover, we can estimate p ˆ z t = 1 − p ˆ y t ≥ − p ˆ y t ≥ p z t as an upper bound of p ˆ z t . Computing predicted labels and probability bounds for d testing examples: One method to compute the predictedlabels and probability bounds for the d testing examples isto apply the above process to each testing example individ-ually. However, such method is computationally intractablebecause it requires training N global models for every test-ing example. To address the computational challenge, wepropose a method that only needs to train N global mod-els in total. Our idea is to split α among the d testing ex-amples. Speciﬁcally, we follow the above process to train N global models and use them to predict labels for the d test-ing examples. For each testing example x t , we estimate thelower bound p ˆ y t = B (cid:0) αd ; N ˆ y t , N − N ˆ y t + 1 (cid:1) with conﬁ-dence level − α/d instead of − α . According to the Bon-ferroni correction , the simultaneous conﬁdence level of esti-mating the lower bounds for the d testing examples is − α .Following the above process, we still estimate p ˆ z t = 1 − p ˆ y t as an upper bound of p ˆ z t for each testing example. Complete algorithm:

Algorithm 2 shows our algorithm tocompute the predicted labels and certiﬁed security levels forthe d testing examples in D . The function S AMPLE &T RAIN randomly samples N subsamples with k clients and trains N global models using the base federated learning algorithm A . Given the probability bounds p ˆ y t and p ˆ z t for a testingexample x t , the function S EARCH L EVEL ﬁnds the certiﬁedsecurity level ˆ m ∗ t via ﬁnding the largest integer m that sat-isﬁes (5). For example, S EARCH L EVEL can simply start m from 0 and iteratively increase it by one until ﬁnding ˆ m ∗ t . Probabilistic guarantees:

In Algorithm 2, since we es-timate the lower bound p ˆ y t using the Clopper-Pearsonmethod, there is a probability that the estimated lower bound Dataset MNIST HARModel architecture CNN DNNNumber of clients 1,000 30 globalIter localIter η Table 1: Federated learning settings and hyperparameters.is incorrect, i.e., p ˆ y t > p ˆ y t . When the lower bound is esti-mated incorrectly for a testing example x t , the certiﬁed se-curity level ˆ m ∗ t outputted by Algorithm 2 for x t may also beincorrect, i.e., there may exist an C (cid:48) such that M ( C (cid:48) ) ≤ ˆ m ∗ t but h ( C (cid:48) , x t ) (cid:54) = ˆ y t . In other words, our Algorithm 2 hasprobabilistic guarantees for its outputted certiﬁed securitylevels. However, in the following theorem, we prove theprobability that Algorithm 2 returns an incorrect certiﬁed se-curity level for at least one testing example is at most α . Theorem 3.

The probability that Algorithm 2 returns an in-correct certiﬁed security level for at least one testing exam-ple in D is bounded by α , which is equivalent to:Pr ( ∩ x t ∈D ( h ( C (cid:48) , x t ) = ˆ y t , ∀ C (cid:48) , M ( C (cid:48) ) ≤ ˆ m ∗ t | ˆ y t (cid:54) = ABSTAIN )) ≥ − α. (8)Note that when the probability bounds are estimated de-terministically, e.g., when (cid:0) nk (cid:1) is small and the exact labelprobabilities can be computed via training (cid:0) nk (cid:1) global mod-els, the certiﬁed security level obtained from our Theorem 1is also deterministic. Experiments

Experimental Setup

Datasets, model architectures, and base algorithm:

Weuse MNIST (LeCun, Cortes, and Burges 1998) and Hu-man Activity Recognition (HAR) datasets (Anguita et al.2013). MNIST is used to simulate federated learning sce-narios, while HAR represents a real-world federated learn-ing scenario. Speciﬁcally, MNIST has 60,000 training exam-ples and 10,000 testing examples. We consider n = 1 , clients and we split them into groups. We assign a trainingexample with label l to the l th group of clients with proba-bility q and assign it to each remaining group with a prob-ability − q . After assigning a training example to a group,we distribute it to a client in the group uniformly at ran-dom. The parameter q controls local training data distribu-tion on clients and we call q degree of non-IID . q = 0 . means that clients’ local training data are IID, while a larger q indicates a larger degree of non-IID. By default, we set q = 0 . . However, we will study the impact of q (degreeof non-IID) on our method. HAR includes human activitydata from 30 users, each of which is a client. The task is topredict a user’s activity based on the sensor signals (e.g., ac-celeration) collected from the user’s smartphone. There are

10 20 30 40 50 60 70

Number of malicious clients m C e r t i f i ed a cc u r a cy @ m FedAvgEnsemble FedAvg (a) MNIST

Number of malicious clients m C e r t i f i ed a cc u r a cy @ m FedAvgEnsemble FedAvg (b) HAR

Figure 2: FedAvg vs. ensemble FedAvg.

Number of malicious clients m C e r t i f i ed a cc u r a cy @ m k = k = k = (a) MNIST Number of malicious clients m C e r t i f i ed a cc u r a cy @ m k = k = k = (b) HAR Figure 3: Impact of k on our ensemble FedAvg.6 possible activities (e.g., walking, sitting, and standing), in-dicating a 6-class classiﬁcation problem. There are 10,299examples in total and each example has 561 features. Weuse 75% of each user’s examples as training examples andthe rest as testing examples.We consider a convolutional neural network (CNN) archi-tecture (shown in Supplemental Material) for MNIST. ForHAR, we consider a deep neural network (DNN) with twofully-connected hidden layers, each of which contains 256neurons and uses ReLU as the activation function. We usethe popular FedAvg (McMahan et al. 2017) as the base fed-erated learning algorithm. Recall that a base federated learn-ing algorithm has hyperparameters (shown in Algorithm 1): globalIter , localIter , learning rate η , and batch size. Ta-ble 1 summarizes these hyperparameters for FedAvg in ourexperiments. In particular, we set the globalIter in Table 1because FedAvg converges with such settings. Evaluation metric:

We use certiﬁed accuracy as our eval-uation metric. Speciﬁcally, we deﬁne the certiﬁed accuracyat m malicious clients (denoted as CA @ m ) for a federatedlearning method as the fraction of testing examples in thetesting dataset D whose labels are correctly predicted by themethod and whose certiﬁed security levels are at least m .Formally, we deﬁne CA @ m as follows:CA @ m = (cid:80) x t ∈D I (ˆ y t = y t ) · I ( ˆ m ∗ t ≥ m ) |D| , (9)where I is the indicator function, y t is the true label for x t ,and ˆ y t and ˆ m ∗ t are respectively the predicted label and cer-tiﬁed security level for x t . Intuitively, CA @ m means thatwhen at most m clients are malicious, the accuracy of thefederated learning method for D is at least CA @ m no mat-ter what attacks the malicious clients use (i.e., no matterhow the malicious clients tamper their local training dataand model updates). Note that CA @0 reduces to the stan-dard accuracy when there are no malicious clients. Number of malicious clients m C e r t i f i ed a cc u r a cy @ m N = N = N = (a) MNIST Number of malicious clients m C e r t i f i ed a cc u r a cy @ m N = N = N = (b) HAR Figure 4: Impact of N on our ensemble FedAvg. Number of malicious clients m C e r t i f i ed a cc u r a cy @ m α = α = α = (a) MNIST Number of malicious clients m C e r t i f i ed a cc u r a cy @ m α = α = α = (b) HAR Figure 5: Impact of α on our ensemble FedAvg.When we can compute the exact label probabilities viatraining (cid:0) nk (cid:1) global models, the CA @ m of our ensembleglobal model h computed using the certiﬁed security levelsderived from Theorem 1 is deterministic. When (cid:0) nk (cid:1) is large,we estimate predicted labels and certiﬁed security levels us-ing Algorithm 2, and thus our CA @ m has a conﬁdence level − α according to Theorem 3. Parameter settings:

Our method has three parameters: N , k , and α . Unless otherwise mentioned, we adopt the follow-ing default settings for them: N = 500 , α = 0 . , k = 10 for MNIST, and k = 2 for HAR. Under such default set-ting for HAR, we have (cid:0) nk (cid:1) = (cid:0) (cid:1) = 435 < N = 500 andwe can compute the exact label probabilities via training 435global models. Therefore, we have deterministic certiﬁed ac-curacy for HAR under the default setting. We will explorethe impact of each parameter while using the default settingsfor the other two parameters. For HAR, we set k = 4 whenexploring the impact of N (i.e., Figure 4(b)) and α (i.e., Fig-ure 5(b)) since the default setting k = 2 gives deterministiccertiﬁed accuracy, making N and α not relevant. Experimental Results

Single-global-model FedAvg vs. ensemble FedAvg:

Fig-ure 2 compares single-global-model FedAvg and ensem-ble FedAvg with respect to certiﬁed accuracy on the twodatasets. When there are no malicious clients (i.e., m = 0 ),single-global-model FedAvg is more accurate than ensembleFedAvg. This is because ensemble FedAvg uses a subsam-ple of clients to train each global model. However, single-global-model FedAvg has 0 certiﬁed accuracy when justone client is malicious. This is because a single maliciousclient can arbitrarily manipulate the global model learnt byFedAvg (Blanchard et al. 2017). However, the certiﬁed ac-curacy of ensemble FedAvg reduces to 0 when up to 61

10 20 30 40 50 60 70

Number of malicious clients m C e r t i f i ed a cc u r a cy @ m q = q = q = Figure 6: Impact of the degree of non-IID q on MNIST.and 9 clients (6.1% and 30%) are malicious on MNIST andHAR, respectively. Note that it is unknown whether existingByzantine-robust federated learning methods have non-zerocertiﬁed accuracy when m > , and thus we cannot compareensemble FedAvg with them. Impact of k , N , and α : Figure 3, 4, and 5 show the impactof k , N , and α , respectively. k achieves a trade-off betweenaccuracy under no malicious clients and security under ma-licious clients. Speciﬁcally, when k is larger, the ensembleglobal model is more accurate at m = 0 , but the certiﬁedaccuracy drops more quickly to 0 as m increases. This is be-cause when k is larger, it is more likely for the sampled k clients to include malicious ones. The certiﬁed accuracy in-creases as N or α increases. This is because training moreglobal models or a larger α allows Algorithm 2 to estimatetighter probability bounds and larger certiﬁed security lev-els. When N increases from 100 to 500, the certiﬁed accu-racy increases signiﬁcantly. However, when N further growsto 1,000, the increase of certiﬁed accuracy is marginal. Ourresults show that we don’t need to train too many globalmodels in practice, as the certiﬁed accuracy saturates when N is larger than some threshold. Impact of degree of non-IID q : Figure 6 shows the cer-tiﬁed accuracy of our ensemble FedAvg on MNIST whenthe clients’ local training data have different degrees of non-IID. We observe that the certiﬁed accuracy drops when q increases from 0.5 to 0.9, which represents a high degree ofnon-IID. However, the certiﬁed accuracy is still high when m is small for q = 0 . , e.g., the certiﬁed accuracy is still83% when m = 10 . This is because although each globalmodel trained using a subsample of clients is less accuratewhen the local training data are highly non-IID, the ensem-ble of multiple global models is still accurate. Related Work

In federated learning, the ﬁrst category of studies (Smithet al. 2017; Li et al. 2020b; Wang et al. 2020; Liu et al. 2020;Peng et al. 2020) aim to design federated learning methodsthat can learn more accurate global models and/or analyzetheir convergence properties. For instance, FedMA (Wanget al. 2020) constructs the global model via matching and av-eraging the hidden elements in a neural network with similarfeature extraction signatures. The second category of stud-ies (Koneˇcn`y et al. 2016; McMahan et al. 2017; Wen et al.2017; Alistarh et al. 2017; Lee et al. 2017; Sahu et al. 2018;Bernstein et al. 2018; Vogels, Karimireddy, and Jaggi 2019; Yurochkin et al. 2019; Mohri, Sivek, and Suresh 2019; Wanget al. 2020; Li, Wen, and He 2020; Li et al. 2020c; Hamer,Mohri, and Suresh 2020; Rothchild et al. 2020; Malinovskyet al. 2020) aim to improve the communication efﬁciencybetween the clients and server via sparsiﬁcation, quantiza-tion, and/or encoding of the model updates sent from theclients to the server. The third category of studies (Bonawitzet al. 2017; Geyer, Klein, and Nabi 2017; Hitaj, Ateniese,and Perez-Cruz 2017; Melis et al. 2019; Zhu, Liu, and Han2019; Mohri, Sivek, and Suresh 2019; Wang, Tong, and Shi2020; Li et al. 2020a) aim to explore the privacy/fairnessissues of federated learning and their defenses.These stud-ies often assume a single global model is shared among theclients. Smith et al. (Smith et al. 2017) proposed to learn acustomized model for each client via multi-task learning.Our work is on security of federated learning, whichis orthogonal to the studies above. Multiple studies (Fanget al. 2020; Bagdasaryan et al. 2020; Xie, Koyejo,and Gupta 2019; Bhagoji et al. 2019) showed that theglobal model’s accuracy can be signiﬁcantly downgradedby malicious clients. Existing defenses against maliciousclients leverage Byzantine-robust aggregation rules suchas Krum (Blanchard et al. 2017), trimmed mean (Yinet al. 2018), coordinate-wise median (Yin et al. 2018), andBulyan (Mhamdi, Guerraoui, and Rouault 2018). However,they cannot provably guarantee that the global model’s pre-dicted label for a testing example is not affected by mali-cious clients. As a result, they may be broken by strong at-tacks that carefully craft the model updates sent from themalicious clients to the server, e.g., (Fang et al. 2020). Wepropose ensemble federated learning whose predicted labelfor a testing example is provably not affected by a boundednumber of malicious clients.We note that ensemble methods were also proposed asprovably secure defenses (e.g., (Jia, Cao, and Gong 2020))against data poisoning attacks. However, they are insufﬁ-cient to defend against malicious clients that can manipu-late both the local training data and the model updates. Inparticular, a provably secure defense against data poisoningattacks guarantees that the label predicted for a testing exam-ple is unaffected by a bounded number of poisoned trainingexamples. However, a single malicious client can poison anarbitrary number of its local training examples, breaking theassumption of provably secure defenses against data poison-ing attacks.

Conclusion

In this work, we propose ensemble federated learning andderive its tight provable security guarantee against mali-cious clients. Moreover, we propose an algorithm to com-pute the certiﬁed security levels. Our empirical results ontwo datasets show that our ensemble federated learning caneffectively defend against malicious clients with provablesecurity guarantees. Interesting future work includes esti-mating the probability bounds deterministically and consid-ering the internal structure of a base federated learning algo-rithm to further improve our provable security guarantees. cknowledgement

We thank the anonymous reviewers for insightful reviews.This work was supported by NSF grant No.1937786.

References

Alistarh, D.; Allen-Zhu, Z.; and Li, J. 2018. Byzantinestochastic gradient descent. In

NeurIPS .Alistarh, D.; Grubic, D.; Li, J.; Tomioka, R.; and Vojnovic,M. 2017. QSGD: Communication-efﬁcient SGD via gradi-ent quantization and encoding. In

NeurIPS .Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; and Reyes-Ortiz,J. L. 2013. A public domain dataset for human activityrecognition using smartphones. In

ESANN .Bagdasaryan, E.; Veit, A.; Hua, Y.; Estrin, D.; andShmatikov, V. 2020. How to backdoor federated learning.In

AISTATS .Bernstein, J.; Wang, Y.-X.; Azizzadenesheli, K.; and Anand-kumar, A. 2018. signSGD: Compressed Optimisation forNon-Convex Problems. In

ICML .Bhagoji, A.; Chakraborty, S.; Mittal, P.; and Calo, S. 2019.Analyzing Federated Learning through an Adversarial Lens.In

ICML .Blanchard, P.; Mhamdi, E. M. E.; Guerraoui, R.; and Stainer,J. 2017. Machine Learning with Adversaries: Byzantine Tol-erant Gradient Descent. In

NeurIPS .Bonawitz, K.; Ivanov, V.; Kreuter, B.; Marcedone, A.;McMahan, H. B.; Patel, S.; Ramage, D.; Segal, A.; and Seth,K. 2017. Practical secure aggregation for privacy-preservingmachine learning. In

CCS .Chen, L.; Wang, H.; Charles, Z.; and Papailiopoulos, D.2018. DRACO: Byzantine-resilient Distributed Training viaRedundant Gradients. In

ICML .Chen, Y.; Su, L.; and Xu, J. 2017. Distributed Statistical Ma-chine Learning in Adversarial Settings: Byzantine GradientDescent. In

POMACS .Clopper, C. J.; and Pearson, E. S. 1934. The use of conﬁ-dence or ﬁducial limits illustrated in the case of the binomial.

Biometrika .Fang, M.; Cao, X.; Jia, J.; and Gong, N. Z. 2020. Lo-cal model poisoning attacks to Byzantine-robust federatedlearning. In

USENIX Security .Geyer, R. C.; Klein, T.; and Nabi, M. 2017. Differentiallyprivate federated learning: A client level perspective. arXivpreprint arXiv:1712.07557 .Hamer, J.; Mohri, M.; and Suresh, A. T. 2020. FedBoost:Communication-Efﬁcient Algorithms for Federated Learn-ing. In

ICML .Hitaj, B.; Ateniese, G.; and Perez-Cruz, F. 2017. Deep mod-els under the GAN: information leakage from collaborativedeep learning. In

CCS . Jia, J.; Cao, X.; and Gong, N. Z. 2020. Intrinsic certiﬁedrobustness of bagging against data poisoning attacks. arXivpreprint arXiv:2008.04495 .Kairouz, P.; McMahan, H. B.; Avent, B.; Bellet, A.; Bennis,M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.;Cummings, R.; et al. 2019. Advances and open problems infederated learning. arXiv preprint arXiv:1912.04977 .Koneˇcn`y, J.; McMahan, H. B.; Yu, F. X.; Richt´arik, P.;Suresh, A. T.; and Bacon, D. 2016. Federated learn-ing: Strategies for improving communication efﬁciency. In

NeurIPS Workshop on Private Multi-Party Machine Learn-ing .LeCun, Y.; Cortes, C.; and Burges, C. 1998. MNISThandwritten digit database.

Available: http://yann. lecun.com/exdb/mnist .Lee, K.; Lam, M.; Pedarsani, R.; Papailiopoulos, D.; andRamchandran, K. 2017. Speeding up distributed machinelearning using codes.

IEEE Transactions on InformationTheory .Li, Q.; Wen, Z.; and He, B. 2020. Practical Federated Gra-dient Boosting Decision Trees. In

AAAI .Li, T.; Sanjabi, M.; Beirami, A.; and Smith, V. 2020a. FairResource Allocation in Federated Learning. In

ICLR .Li, X.; Huang, K.; Yang, W.; Wang, S.; and Zhang, Z. 2020b.On the convergence of fedavg on non-iid data. In

ICLR .Li, Z.; Kovalev, D.; Qian, X.; and Richt´arik, P. 2020c. Ac-celeration for Compressed Gradient Descent in Distributedand Federated Optimization. In

ICML .Liu, F.; Wu, X.; Ge, S.; Fan, W.; and Zou, Y. 2020. FederatedLearning for Vision-and-Language Grounding Problems. In

AAAI .Malinovsky, G.; Kovalev, D.; Gasanov, E.; Condat, L.; andRichtarik, P. 2020. From Local SGD to Local Fixed PointMethods for Federated Learning. In

ICML .McMahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.;et al. 2017. Communication-efﬁcient learning of deep net-works from decentralized data. In

AISTATS .Melis, L.; Song, C.; De Cristofaro, E.; and Shmatikov, V.2019. Exploiting unintended feature leakage in collaborativelearning. In

IEEE S&P .Mhamdi, E. M. E.; Guerraoui, R.; and Rouault, S. 2018. TheHidden Vulnerability of Distributed Learning in Byzantium.In

ICML .Mohri, M.; Sivek, G.; and Suresh, A. T. 2019. AgnosticFederated Learning. In

ICML .Peng, X.; Huang, Z.; Zhu, Y.; and Saenko, K. 2020. Feder-ated Adversarial Domain Adaptation. In

ICLR .Rothchild, D.; Panda, A.; Ullah, E.; Ivkin, N.; Stoica,I.; Braverman, V.; Gonzalez, J.; and Arora, R. 2020.FetchSGD: Communication-Efﬁcient Federated Learningwith Sketching. In

ICML .Sahu, A. K.; Li, T.; Sanjabi, M.; Zaheer, M.; Talwalkar,A.; and Smith, V. 2018. On the convergence of federatedptimization in heterogeneous networks. arXiv preprintarXiv:1812.06127 .Smith, V.; Chiang, C.-K.; Sanjabi, M.; and Talwalkar, A. S.2017. Federated multi-task learning. In

NeurIPS .Vogels, T.; Karimireddy, S. P.; and Jaggi, M. 2019. Pow-erSGD: Practical low-rank gradient compression for dis-tributed optimization. In

NeurIPS .Wang, H.; Yurochkin, M.; Sun, Y.; Papailiopoulos, D.; andKhazaeni, Y. 2020. Federated Learning with Matched Aver-aging. In

ICLR .Wang, Y.; Tong, Y.; and Shi, D. 2020. Federated LatentDirichlet Allocation: A Local Differential Privacy BasedFramework. In

AAAI .Wen, W.; Xu, C.; Yan, F.; Wu, C.; Wang, Y.; Chen, Y.; andLi, H. 2017. Terngrad: Ternary gradients to reduce commu-nication in distributed deep learning. In

NeurIPS .Xie, C.; Huang, K.; Chen, P.-Y.; and Li, B. 2020. DBA:Distributed Backdoor Attacks against Federated Learning.In

ICLR .Xie, C.; Koyejo, S.; and Gupta, I. 2019. Fall of empires:Breaking byzantine-tolerant SGD by inner product manipu-lation. In

UAI .Yin, D.; Chen, Y.; Kannan, R.; and Bartlett, P. 2019. De-fending Against Saddle Point Attack in Byzantine-RobustDistributed Learning. In

ICML .Yin, D.; Chen, Y.; Ramchandran, K.; and Bartlett, P. 2018.Byzantine-Robust Distributed Learning: Towards OptimalStatistical Rates. In

ICML .Yurochkin, M.; Agarwal, M.; Ghosh, S.; Greenewald, K.;Hoang, N.; and Khazaeni, Y. 2019. Bayesian NonparametricFederated Learning of Neural Networks. In

ICML .Zhu, L.; Liu, Z.; and Han, S. 2019. Deep leakage from gra-dients. In

NeurIPS .ayer SizeInput × × Convolution + ReLU × × Max Pooling × Convolution + ReLU × × Max Pooling × Fully Connected + ReLU 512Softmax 10Table 2: The CNN architecture for MNIST.

Proof of Theorem 1 𝑜 Figure 7: Illustration of O C , O (cid:48) C , and O o .We ﬁrst deﬁne a subsample of k clients from C as S ( C , k ) . Then, we deﬁne the space of all possible subsamples from C as O C = {S ( C , k ) } and the space of all possible subsamples from C (cid:48) as O (cid:48) C = {S ( C (cid:48) , k ) } . Let O o = {S ( C ∩ C (cid:48) , k ) } = O C ∩ O (cid:48) C denote the space of all possible subsamples from the set of normal clients C ∩ C (cid:48) , and O = {S ( C ∪ C (cid:48) , k ) } = O C ∪ O (cid:48) C denote the space of all possible subsamples from either C or C (cid:48) . Figure 7 illustrates O C , O (cid:48) C , and O o . We use a randomvariable X to denote a subsample S ( C , k ) and Y to denote a subsample S ( C (cid:48) , k ) in O . We know that X and Y have thefollowing probability distributions: Pr ( X = s ) = (cid:40) ( nk ) , if s ∈ O C , otherwise , (10)Pr ( Y = s ) = (cid:40) ( nk ) , if s ∈ O (cid:48) C , otherwise . (11)Recall that given a set of clients s , the base federated learning algorithm A learns a global model. For simplicity, we denoteby A ( s, x ) the predicted label of a testing example x given by this global model. We have the following equations: p y = Pr ( A ( X , x ) = y ) (12) = Pr ( A ( X , x ) = y | X ∈ O o ) · Pr ( X ∈ O o )+ Pr ( A ( X , x ) = y | X ∈ ( O C − O o )) · Pr ( X ∈ ( O C − O o )) , (13) p (cid:48) y = Pr ( A ( Y , x ) = y ) (14) = Pr ( A ( Y , x ) = y | Y ∈ O o ) · Pr ( Y ∈ O o )+ Pr ( A ( Y , x ) = y | Y ∈ ( O (cid:48) C − O o )) · Pr ( Y ∈ ( O (cid:48) C − O o )) . (15)Note that we have: Pr ( A ( X , x ) = y | X ∈ O o ) = Pr ( A ( Y , x ) = y | Y ∈ O o ) , (16)r ( X ∈ O o ) = Pr ( Y ∈ O o ) = (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) , (17)where m is the number of malicious clients. Therefore, we know:Pr ( A ( X , x ) = y | X ∈ O o ) · Pr ( X ∈ O o ) = Pr ( A ( Y , x ) = y | Y ∈ O o ) · Pr ( Y ∈ O o ) . (18)By subtracting (13) from (15), we obtain: p (cid:48) y − p y = Pr ( A ( Y , x ) = y | Y ∈ ( O (cid:48) C − O o )) · Pr ( Y ∈ ( O (cid:48) C − O o )) − Pr ( A ( X , x ) = y | X ∈ ( O C − O o )) · Pr ( X ∈ ( O C − O o )) . (19)Similarly, we have the following equation for any i (cid:54) = y : p (cid:48) i − p i = Pr ( A ( Y , x ) = i | Y ∈ ( O (cid:48) C − O o )) · Pr ( Y ∈ ( O (cid:48) C − O o )) − Pr ( A ( X , x ) = i | X ∈ ( O C − O o )) · Pr ( X ∈ ( O C − O o )) . (20)Therefore, we can show: p (cid:48) y − p (cid:48) i = p y − p i + ( p (cid:48) y − p y ) − ( p (cid:48) i − p i ) (21) = p y − p i + [ Pr ( A ( Y , x ) = y | Y ∈ ( O (cid:48) C − O o )) − Pr ( A ( Y , x ) = i | Y ∈ ( O (cid:48) C − O o ))] · Pr ( Y ∈ ( O (cid:48) C − O o )) − [ Pr ( A ( X , x ) = y | X ∈ ( O C − O o )) − Pr ( A ( X , x ) = i | X ∈ ( O C − O o ))] · Pr ( X ∈ ( O C − O o )) . (22)Note that we have: Pr ( A ( Y , x ) = y | Y ∈ ( O (cid:48) C − O o )) − Pr ( A ( Y , x ) = i | Y ∈ ( O (cid:48) C − O o )) ≥ − , (23)Pr ( A ( X , x ) = y | X ∈ ( O C − O o )) − Pr ( A ( X , x ) = i | X ∈ ( O C − O o )) ≤ , (24)Pr ( Y ∈ ( O (cid:48) C − O o )) = Pr ( X ∈ ( O C − O o )) = 1 − (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) . (25)Therefore, based on (22) and that p y and p i are integer multiplications of ( nk ) , we have the following: p (cid:48) y − p (cid:48) i ≥ p y − p i + ( − · (cid:34) − (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) (cid:35) − (cid:34) − (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) (cid:35) (26) = p y − p i − (cid:34) − · (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) (cid:35) (27) = (cid:6) p y · (cid:0) nk (cid:1)(cid:7)(cid:0) nk (cid:1) − (cid:4) p i · (cid:0) nk (cid:1)(cid:5)(cid:0) nk (cid:1) − (cid:34) − (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) (cid:35) (28) ≥ (cid:108) p y · (cid:0) nk (cid:1)(cid:109)(cid:0) nk (cid:1) − (cid:4) p z · (cid:0) nk (cid:1)(cid:5)(cid:0) nk (cid:1) − (cid:34) − (cid:0) n − m ∗ k (cid:1)(cid:0) nk (cid:1) (cid:35) (29) > , (30)which indicates h ( C (cid:48) , x ) = y . Proof of Theorem 2

We prove Theorem 2 by constructing a base federated learning algorithm A ∗ such that the conditions in (6) are satisﬁed but h ( C (cid:48) , x ) (cid:54) = y or there exist ties.We follow the deﬁnitions of O , O C , O (cid:48) C , O o , X , and Y in the previous section. Next, we consider four cases (Figure 8illustrates them). 𝐵 𝐴 𝐵 𝐵 𝑜 𝐴𝐵 𝑜𝐴 𝐵 𝑜 (a) Case 1 𝐴 𝐵 𝐴 𝐵 𝐵 𝑜 𝐴𝐵 𝑜𝐴 𝐵 𝑜 (b) Case 2 𝐴 𝐵 𝐴𝐵 𝑜𝐴 𝐵 𝑜𝐴 𝐵 𝐵 𝑜 (c) Case 3 𝐴 𝐵 𝐴𝐵 𝑜𝐴 𝐵 𝑜𝐴 𝐵 𝐵 𝑜 (d) Case 4 Figure 8: Illustration of O C , O (cid:48) C , O o , O A , and O B in the four cases. Case 1: m ≥ n − k .In this case, we know O o = ∅ . Let O A ⊆ O C and O B ⊆ O C such that | O A | = (cid:108) p y · (cid:0) nk (cid:1)(cid:109) , | O B | = (cid:4) p z · (cid:0) nk (cid:1)(cid:5) , and O A ∩ O B = ∅ . Since p y + p z ≤ , we have: | O A | + | O B | = (cid:24) p y · (cid:18) nk (cid:19)(cid:25) + (cid:22) p z · (cid:18) nk (cid:19)(cid:23) (31) ≤ (cid:24) p y · (cid:18) nk (cid:19)(cid:25) + (cid:22) (1 − p y ) · (cid:18) nk (cid:19)(cid:23) (32) = (cid:24) p y · (cid:18) nk (cid:19)(cid:25) + (cid:18) nk (cid:19) − (cid:24) p y · (cid:18) nk (cid:19)(cid:25) (33) = (cid:18) nk (cid:19) = | O C | . (34)Therefore, we can always ﬁnd such a pair of disjoint sets ( O A , O B ) . Figure 8(a) illustrates O A , O B , O C , and O (cid:48) C . We canconstruct A ∗ as follows: A ∗ ( s, x ) =  y, if s ∈ O A z, if s ∈ O B ∪ O (cid:48) C i, i (cid:54) = y and i (cid:54) = z, otherwise . (35)We can show that such A ∗ satisﬁes the following probability properties: p y = Pr ( A ∗ ( X , x ) = y ) = | O A || O C | = (cid:108) p y · (cid:0) nk (cid:1)(cid:109)(cid:0) nk (cid:1) ≥ p y , (36) z = Pr ( A ∗ ( X , x ) = z ) = | O B || O C | = (cid:4) p z · (cid:0) nk (cid:1)(cid:5)(cid:0) nk (cid:1) ≤ p z . (37)Therefore, A ∗ satisﬁes the probability conditions in (6). However, we have: p (cid:48) z = Pr ( A ∗ ( Y , x ) = z ) = 1 , (38)which indicates h ( C (cid:48) , x ) = z (cid:54) = y . Case 2: m ∗ < m < n − k , ≤ p y ≤ − ( n − mk )( nk ) , and ≤ p z ≤ ( n − mk )( nk ) .Let O A ⊆ O C − O o such that | O A | = (cid:100) p y · (cid:0) nk (cid:1) (cid:101) . Let O B ⊆ O o such that | O B | = (cid:98) p z · (cid:0) nk (cid:1) (cid:99) . Figure 8(b) illustrates O A , O B , O C , O (cid:48) C , and O o . We can construct a federated learning algorithm A ∗ as follows: A ∗ ( s, x ) =  y, if s ∈ O A z, if s ∈ O B ∪ ( O (cid:48) C − O o ) i, i (cid:54) = y and i (cid:54) = z, otherwise . (39)We can show that such A ∗ satisﬁes the following probability conditions: p y = Pr ( A ∗ ( X , x ) = y ) = | O A || O C | = (cid:108) p y · (cid:0) nk (cid:1)(cid:109)(cid:0) nk (cid:1) ≥ p y , (40) p z = Pr ( A ∗ ( X , x ) = z ) = | O B || O C | = (cid:4) p z · (cid:0) nk (cid:1)(cid:5)(cid:0) nk (cid:1) ≤ p z , (41)which indicates A ∗ satisﬁes (6). However, we have: p (cid:48) y − p (cid:48) z = Pr ( A ∗ ( Y , x ) = y ) − Pr ( A ∗ ( Y , x ) = z ) (42) = 0 − | O B | + | O (cid:48) C − O o || O (cid:48) C | (43) = − (cid:4) p z · (cid:0) nk (cid:1)(cid:5)(cid:0) nk (cid:1) − (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) (44) < , (45)which implies h ( C (cid:48) , x ) (cid:54) = y . Case 3: m ∗ < m < n − k , ≤ p y ≤ − ( n − mk )( nk ) , and ( n − mk )( nk ) ≤ p z ≤ − p y .Let O A ⊆ O C − O o and O B ⊆ O C − O o such that | O A | = (cid:100) p y · (cid:0) nk (cid:1) (cid:101) , | O B | = (cid:98) p z · (cid:0) nk (cid:1) (cid:99) − (cid:0) n − mk (cid:1) , and O A ∩ O B = ∅ .Note that | O C − O o | = (cid:0) nk (cid:1) − (cid:0) n − mk (cid:1) , and we have: | O A | + | O B | = (cid:24) p y · (cid:18) nk (cid:19)(cid:25) + (cid:22) p z · (cid:18) nk (cid:19)(cid:23) − (cid:18) n − mk (cid:19) (46) ≤ (cid:24) p y · (cid:18) nk (cid:19)(cid:25) + (cid:22) (1 − p y ) · (cid:18) nk (cid:19)(cid:23) − (cid:18) n − mk (cid:19) (47) = (cid:24) p y · (cid:18) nk (cid:19)(cid:25) + (cid:20)(cid:18) nk (cid:19) − (cid:24) p y · (cid:18) nk (cid:19)(cid:25)(cid:21) − (cid:18) n − mk (cid:19) (48) = (cid:18) nk (cid:19) − (cid:18) n − mk (cid:19) . (49)Therefore, we can always ﬁnd a pair of such disjoint sets ( O A , O B ) . Figure 8(c) illustrates O A , O B , O C , O (cid:48) C , and O o . We canconstruct an algorithm A ∗ as follows: ∗ ( s, x ) =  y, if s ∈ O A z, if s ∈ O B ∪ O (cid:48) C i, i (cid:54) = y and i (cid:54) = z, otherwise . (50)We can show that such A ∗ satisﬁes the following probability conditions: p y = Pr ( A ∗ ( X , x ) = y ) = | O A || O C | = (cid:108) p y · (cid:0) nk (cid:1)(cid:109)(cid:0) nk (cid:1) ≥ p y , (51) p z = Pr ( A ∗ ( X , x ) = z ) = | O B | + | O o || O C | = (cid:4) p z · (cid:0) nk (cid:1)(cid:5)(cid:0) nk (cid:1) ≤ p z , (52)which are consistent with the probability conditions in (6). However, we can show the following: p (cid:48) z = Pr ( A ∗ ( Y , x ) = z ) = 1 , (53)which gives h ( C (cid:48) , x ) = z (cid:54) = y . Case 4: m ∗ < m < n − k , − ( n − mk )( nk ) < p y ≤ , and ≤ p z ≤ − p y < ( n − mk )( nk ) .Let O A ⊆ O o and O B ⊆ C o such that | O A | = (cid:108) p y · (cid:0) nk (cid:1)(cid:109) + (cid:0) n − mk (cid:1) − (cid:0) nk (cid:1) , | O B | = (cid:4) p z · (cid:0) nk (cid:1)(cid:5) , and O A ∩ O B = ∅ . Note that | O o | = (cid:0) n − mk (cid:1) , and we have: | O A | + | O B | = (cid:24) p y · (cid:18) nk (cid:19)(cid:25) + (cid:18) n − mk (cid:19) − (cid:18) nk (cid:19) + (cid:22) p z · (cid:18) nk (cid:19)(cid:23) (54) ≤ (cid:24) p y · (cid:18) nk (cid:19)(cid:25) + (cid:18) n − mk (cid:19) − (cid:18) nk (cid:19) + (cid:22) (1 − p y ) · (cid:18) nk (cid:19)(cid:23) (55) = (cid:24) p y · (cid:18) nk (cid:19)(cid:25) + (cid:18) n − mk (cid:19) − (cid:18) nk (cid:19) + (cid:20)(cid:18) nk (cid:19) − (cid:24) p y · (cid:18) nk (cid:19)(cid:25)(cid:21) (56) = (cid:18) n − mk (cid:19) . (57)Therefore, we can always ﬁnd such a pair of disjoint sets ( O A , O B ) . Figure 8(d) illustrates O A , O B , O C , O (cid:48) C , and O o . Next,we can construct an algorithm A ∗ as follows: A ∗ ( s, x ) =  y, if s ∈ O A ∪ ( O C − O o ) z, if s ∈ O B ∪ ( O (cid:48) C − O o ) i, i (cid:54) = y and i (cid:54) = z, otherwise . (58)We can show that A ∗ has the following properties: p y = Pr ( A ∗ ( X , x ) = y ) = | O A | + | O C − O o || O C | = (cid:108) p y · (cid:0) nk (cid:1)(cid:109)(cid:0) nk (cid:1) ≥ p y , (59) p z = Pr ( A ∗ ( X , x ) = z ) = | O B || O C | = (cid:4) p z · (cid:0) nk (cid:1)(cid:5)(cid:0) nk (cid:1) ≤ p z , (60)which implies A ∗ satisﬁes the probability conditions in (6). However, we also have: p (cid:48) y − p (cid:48) z = Pr ( A ∗ ( Y , x ) = y ) − Pr ( A ∗ ( Y , x ) = z ) (61) = | O A || O (cid:48) C | − | O B | + | O (cid:48) C − O o || O (cid:48) C | (62) = (cid:108) p y · (cid:0) nk (cid:1)(cid:109) + (cid:0) n − mk (cid:1) − (cid:0) nk (cid:1)(cid:0) nk (cid:1) − (cid:4) p z · (cid:0) nk (cid:1)(cid:5) − (cid:0) n − mk (cid:1) + (cid:0) nk (cid:1)(cid:0) nk (cid:1) (63) (cid:108) p y · (cid:0) nk (cid:1)(cid:109)(cid:0) nk (cid:1) − (cid:4) p z · (cid:0) nk (cid:1)(cid:5)(cid:0) nk (cid:1) − (cid:34) − · (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) (cid:35) . (64)Since m > m ∗ , we have: (cid:108) p y · (cid:0) nk (cid:1)(cid:109)(cid:0) nk (cid:1) − (cid:4) p z · (cid:0) nk (cid:1)(cid:5)(cid:0) nk (cid:1) ≤ (cid:34) − · (cid:0) n − mk (cid:1)(cid:0) nk (cid:1) (cid:35) . (65)Therefore, we have p (cid:48) y − p (cid:48) z ≤ , which indicates h ( C (cid:48) , x ) (cid:54) = y or there exist ties.To summarize, we have proven that in any possible cases, Theorem 2 holds, indicating that our derived certiﬁed security levelis tight. Proof of Theorem 3

Based on the Clopper-Pearson method, for each testing example x t , we have:Pr ( p ˆ y t ≤ Pr ( A ( S ( C , k ) , x t ) = ˆ y t ) ∧ p ˆ z t ≥ Pr ( A ( S ( C , k ) , x t ) = i ) , ∀ i (cid:54) = ˆ y t ) ≥ − αd . (66)Therefore, for a testing example x t , if our Algorithm 2 does not abstain for x t , the probability that it returns an incorrectcertiﬁed security level is at most αd . Formally, we have the following:Pr (( ∃ C (cid:48) , M ( C (cid:48) ) ≤ ˆ m ∗ t , h ( C (cid:48) , x t ) (cid:54) = ˆ y t ) | ˆ y t (cid:54) = ABSTAIN ) ≤ αd . (67)Therefore, we have the following:Pr ( ∩ x t ∈D (( ∀ C (cid:48) , M ( C (cid:48) ) ≤ ˆ m ∗ t , h ( C (cid:48) , x t ) = ˆ y t ) | ˆ y t (cid:54) = ABSTAIN )) (68) = 1 − Pr ( ∪ x t ∈D (( ∃ C (cid:48) , M ( C (cid:48) ) ≤ ˆ m ∗ t , h ( C (cid:48) , x t ) (cid:54) = ˆ y t ) | ˆ y t (cid:54) = ABSTAIN )) (69) ≥ − (cid:88) x t ∈D Pr (( ∃ C (cid:48) , M ( C (cid:48) ) ≤ ˆ m ∗ t , h ( C (cid:48) , x t ) (cid:54) = ˆ y t ) | ˆ y t (cid:54) = ABSTAIN ) (70) ≥ − d · αd (71) = 1 − α.α.