[PDF] CaPC Learning: Confidential and Private Collaborative Learning

Abstract

Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other's data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties.

Full PDF

PPublished as a conference paper at ICLR 2021 C A PC L

EARNING : C

ONFIDENTIAL AND P RIVATE C OLLABORATIVE L EARNING

Christopher A. Choquette-Choo ∗ , Natalie Dullerud ∗ , Adam Dziedzic ∗ University of Toronto and Vector Institute {christopher.choquette.choo,natalie.dullerud}@[email protected]

Yunxiang Zhang ∗† The Chinese University of Hong Kong [email protected]

Somesh Jha ‡ University of Wisconsin-Madison and XaiPient [email protected]

Nicolas Papernot ‡ University of Toronto and Vector Institute [email protected]

Xiao Wang ‡ Northwestern University [email protected] A BSTRACT

Machine learning beneﬁts from large training datasets, which may not always bepossible to collect by any single entity, especially when using privacy-sensitivedata. In many contexts, such as healthcare and ﬁnance, separate parties maywish to collaborate and learn from each other’s data but are prevented from doingso due to privacy regulations. Some regulations prevent explicit sharing of databetween parties by joining datasets in a central location ( conﬁdentiality ). Othersalso limit implicit sharing of data, e.g., through model predictions ( privacy ). Thereis currently no method that enables machine learning in such a setting, whereboth conﬁdentiality and privacy need to be preserved, to prevent both explicitand implicit sharing of data. Federated learning only provides conﬁdentiality, notprivacy, since gradients shared still contain private information. Differentiallyprivate learning assumes unreasonably large datasets. Furthermore, both of theselearning paradigms produce a central model whose architecture was previouslyagreed upon by all parties rather than enabling collaborative learning where eachparty learns and improves their own local model. We introduce

Conﬁdential andPrivate Collaborative (CaPC) learning, the ﬁrst method provably achieving both conﬁdentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques incombination with privately aggregated teacher models. We demonstrate how CaPCallows participants to collaborate without having to explicitly join their training setsor train a central model. Each party is able to improve the accuracy and fairnessof their model, even in settings where each party has a model that performs wellon their own dataset or when datasets are not IID and model architectures are heterogeneous across parties.

NTRODUCTION

The predictions of machine learning (ML) systems often reveal private information contained in theirtraining data (Shokri et al., 2017; Carlini et al., 2019) or test inputs. Because of these limitations,legislation increasingly regulates the use of personal data (Mantelero, 2013). The relevant ethicalconcerns prompted researchers to invent ML algorithms that protect the privacy of training data andconﬁdentiality of test inputs (Abadi et al., 2016; Koneˇcn`y et al., 2016; Juvekar et al., 2018). ∗ Equal contributions, authors ordered alphabetically. † Work done while the author was at Vector Institute. ‡ Equal contributions, authors ordered alphabetically. a r X i v : . [ c s . L G ] F e b ublished as a conference paper at ICLR 2021 P i* P P n ... Send encrypted query Enc(q) M M n Enc(q)Enc(q) Enc(r ) Enc(r n ) Output logitsOutput logitsGenerate random vector

Generate random vector r̂ r̂ n Send Enc(r i - r̂ i ) r̂ r̂ n Decrypt (r - r̂ ) (r n - r̂ n ) . . . Garbled circuit: argmax of logits One-hot( 𝑎𝑟𝑔𝑚𝑎𝑥[ (r i - r̂ i )+ r̂ i ])= s i + ŝ i s ŝ ŝ n s n Content Service Provider ෍ 𝑖=1 𝑛 ŝ i ෍ 𝑖=1 𝑛 s i Garbled circuit: argmax sum of shares 𝑎𝑟𝑔𝑚𝑎𝑥[෍ 𝑖=1 𝑛 s i + ŝ i + 𝐷𝑃𝑁𝑜𝑖𝑠𝑒 𝜀 ] Add Laplacian /

Gaussian noise ෍ 𝑖=1 𝑛 ŝ i + 𝐷𝑃 𝑁𝑜𝑖𝑠𝑒(𝜀) Label . . . Y S Y Y n . . . Querying PartyAnswering Parties

1a 1b Figure 1:

Conﬁdential and Private Collaborative (CaPC) Learning Protocol: 1a Querying party P i ∗ sendsencrypted query q to each answering party P i , i (cid:54) = i ∗ . Each P i engages in a secure 2-party computation protocolto evaluate Enc ( q ) on M i and outputs encrypted logits Enc ( r i ) . 1b Each answering party, P i , generates arandom vector ˆ r i , and sends Enc ( r i − ˆ r i ) to the querying party, P i ∗ , who decrypts to get r i − ˆ r i . 1c Eachanswering party P i runs Yao’s garbled circuit protocol ( Y i ) with querying party P i ∗ to get s i for P i ∗ and ˆ s i for P i s.t. s i + ˆ s i is the one-hot encoding of argmax of logits. 2 Each answering party sends ˆ s i to the contentservice provider (CSP). The CSP sums ˆ s i from each P i and adds Laplacian or Gaussian noise for DP. Thequerying party sums s i from each Y i computation. 3 The CSP and the querying party run Yao’s garbledcircuit Y s to obtain argmax of querying party and CSP’s noisy share. The label is output to the querying party. Yet, these algorithms require a large dataset stored either in a single location or distributed amongstbillions of participants. This is the case for example with federated learning (McMahan et al., 2017).Prior algorithms also assume that all parties are collectively training a single model with a ﬁxedarchitecture. These requirements are often too restrictive in practice. For instance , a hospital maywant to improve a medical diagnosis for a patient using data and models from other hospitals. Inthis case, the data is stored in multiple locations, and there are only a few parties collaborating.Further, each party may also want to train models with different architectures that best serve theirown priorities.We propose a new strategy that lets fewer heterogeneous parties learn from each other collaboratively ,enabling each party to improve their own local models while protecting the conﬁdentiality and privacyof their data. We call this

Conﬁdential and Private Collaborative (CaPC) learning.Our strategy improves on conﬁdential inference (Boemer, 2020) and PATE, the private aggregation ofteacher ensembles (Papernot et al., 2017). Through structured applications of these two techniques,we design a strategy for inference that enables participants to operate an ensemble of heterogeneousmodels, i.e. the teachers, without having to explicitly join each party’s data or teacher model at a singlelocation. This also gives each party control at inference, because inference requires the agreementand participation of each party. In addition, our strategy provides measurable conﬁdentiality andprivacy guarantees, which we formally prove. We use the running example of a network of hospitalsto illustrate our approach. The hospitals participating in CaPC protocol need guarantees on both conﬁdentiality (i.e., data from a hospital can only be read by said hospital) and privacy (i.e., nohospital can infer private information about other hospitals’ data by observing their predictions).First, one hospital queries all the other parties over homomorphic encryption (HE), asking themto label an encrypted input using their own teacher models. This can prevent the other hospitalsfrom reading the input (Boemer et al., 2019), an improvement over PATE, and allows the answeringhospitals to provide a prediction to the querying hospital without sharing their teacher models.The answering hospitals use multi-party computation (MPC) to compute an aggregated label, and addnoise during the aggregation to obtain differential privacy guarantees (Dwork et al., 2014). This isachieved by a content service provider (CSP), which then relays the aggregated label to the queryinghospital. The CSP only needs to be semi-trusted: we operate under the honest-but-curious assumption.The use of MPC ensures that the CSP cannot decipher each teacher model’s individual prediction,and the noise added via noisy argmax mechanism gives differential privacy even when there are few participants . This is a signiﬁcant advantage over prior decentralized approaches like federatedlearning, which require billions of participants to achieve differential privacy, because the sensitivity2ublished as a conference paper at ICLR 2021of the histogram used in our aggregation is lower than that of the gradients aggregated in federatedlearning. Unlike our approach, prior efforts involving few participants thus had to prioritize modelutility over privacy and only guarantee conﬁdentiality (Sheller et al., 2020).Finally, the querying hospital can learn from this conﬁdential and private label to improve their localmodel. Since the shared information is a label rather than a gradient, as used by federated learning,CaPC participants do not need to share a common model architecture; in fact, their architectures canvary throughout the participation in the protocol. This favors model development to a degree whichis not possible in prior efforts such as federated learning.We show how participants can instantiate various forms of active and online learning with the labelsreturned by our protocol: each party participating in the CaPC protocol may (a) identify deﬁcienciesof its model throughout its deployment and (b) ﬁnetune the model with labels obtained by interactingwith other parties. Intuitively, we achieve the analog of a doctor querying colleagues for a secondopinion on a difﬁcult diagnostic, without having to reveal the patient’s medical condition. Thisprotocol leads to improvements in both the accuracy and fairness (when there is a skew in the datadistribution of each participating hospital) of model predictions for each of the CaPC participants.To summarize, our contributions are the following:• We introduce CaPC learning: a conﬁdential and private collaborative learning platform thatprovides both conﬁdentiality and privacy while remaining agnostic to ML techniques.• Through a structured application of homomorphic encryption, secure MPC, and privateaggregation, we design a protocol for CaPC. We use two-party deep learning inference anddesign an implementation of the noisy argmax mechanism with garbled circuits.• Our experiments on SVHN and CIFAR10 demonstrate that CaPC enables participants tocollaborate and improve the utility of their models, even in the heterogeneous setting wherethe architectures of their local models differ, and when there are only a few participants.• Further, when the distribution of data drifts across participating parties, we show that CaPCsigniﬁcantly improves fairness metrics because querying parties beneﬁt from knowledgelearned by other parties on different data distributions, which is distilled in their predictions.• We release the source code for reproducing all our experiments.

ACKGROUND

Before introducing CaPC, we ﬁrst go over elements of cryptography and differential privacy that arerequired to understand it. Detailed treatment of these topics can be found in Appendices A and B.2.1 C

RYPTOGRAPHIC P RELIMINARIES FOR C ONFIDENTIALITY

The main cryptographic tool used in CaPC is secure multi-party computation (MPC) (Yao, 1986).MPC allows a set of distrusting parties to jointly evaluate a function on their input without revealinganything beyond the output. In general, most practical MPC protocols can be classiﬁed into two cate-gories: 1) generic MPC protocols that can compute any function with the above security goal (Malkhiet al., 2004); and 2) specialized MPC protocols that can be used to compute only selected functions(e.g., private set intersection (Pinkas et al., 2020), secure machine learning (Mohassel & Zhang,2017)). Although specialized MPC protocols are less general, they are often more efﬁcient in execu-tion time. Protocols in both categories use similar cryptographic building blocks, including (fully)homomorphic encryption (Gentry, 2009), secret sharing (Shamir, 1979), oblivious transfer (Rabin,2005), garbled circuits (Yao, 1986). To understand our protocol, it is not necessary to know alldetails about these cryptographic building blocks and thus we describe them in Appendix A.1. Ourwork uses these cryptographic preliminaries for secure computation at prediction time, unlike recentapproaches, which explore new methods to achieving conﬁdentiality at training time (Huang et al.,2020a;b).The cryptographic protocol designed in this paper uses a specialized MPC protocol for securelyevaluating a private ML model on private data, and a generic two-party computation protocol tocompute an argmax in different forms. For the generic two-party computation, we use a classical Yao’sgarbled-circuit protocol that can compute any function in Boolean circuit. For secure classiﬁcation3ublished as a conference paper at ICLR 2021of neural networks, our protocol design is ﬂexible to work with most existing protocols (Boemeret al., 2020; 2019; Gilad-Bachrach et al., 2016; Mishra et al., 2020). Most existing protocols aredifferent in how they handle linear layers (e.g. convolution) and non-linear layers (e.g. ReLU). Forinstance, one can perform all computations using a fully homomorphic encryption scheme resultingin low communication but very high computation, or using classical MPC techniques with morecommunication but less computation. Other works (Juvekar et al., 2018) use a hybrid of both andthus enjoy further improvement in performance (Mishra et al., 2020). We discuss it in more details inAppendix A.2.2.2 D

IFFERENTIAL P RIVACY

Differential privacy is the established framework for measuring the privacy leakage of a randomizedalgorithm (Dwork et al., 2006). In the context of machine learning, it requires the training algorithmto produce statistically indistinguishable outputs on any pair of datasets that only differ by one datapoint. This implies that an adversary observing the outputs of the training algorithm (e.g., the model’sparameters, or its predictions) can improve its guess at most by a bounded probability when inferringproperties of the training data points. Formally, we have the following deﬁnition.

Deﬁnition 1 (Differential Privacy) . A randomized mechanism M with domain D and range R satisﬁes ( ε, δ ) -differential privacy if for any subset S ⊆ R and any adjacent datasets d, d (cid:48) ∈ D , i.e. (cid:107) d − d (cid:48) (cid:107) ≤ , the following inequality holds: Pr [ M ( d ) ∈ S ] ≤ e ε Pr [ M ( d (cid:48) ) ∈ S ] + δ (1)In our work, we obtain differential privacy by post-processing the outputs of an ensemble of modelswith the noisy argmax mechanism of Dwork et al. (2014) (for more details on differential privacy,please refer to Appendix B), à la PATE (Papernot et al., 2017). We apply the improved analysisof PATE (Papernot et al., 2018) to compute the privacy guarantees obtained (i.e., a bound on ε ).Our technique differs from PATE in that each of the teacher models is trained by different partieswhereas PATE assumes a centralized learning setting where all of the training and inference isperformed by a single party. Note that our technique is used at inference time, which differs fromrecent works in differential privacy that compare neuron pruning during training with mechanismssatisfying differential privacy (Huang et al., 2020c). We use cryptography to securely decentralizecomputations. HE C A PC P

ROTOCOL

We now introduce our protocol for achieving both conﬁdentiality and privacy in collaborative (CaPC)learning. To do so, we formalize and generalize our example of collaborating hospitals from Section 1.3.1 P

ROBLEM D ESCRIPTION

A small number of parties {P i } i ∈ [1 ,K ] , each holding a private dataset D i = { ( x j , y j or ∅ ) j ∈ [1 ,N i ] } and capable of ﬁtting a predictive model M i to it, wish to improve the utility of their individualmodels via collaboration. Due to the private nature of the datasets in question, they cannot directlyshare data or by-products of data (e.g., model weights) with each other. Instead, they will collaborateby querying each other for labels of the inputs about which they are uncertain. In the active learningparadigm, one party P i ∗ poses queries in the form of data samples x and all the other parties {P i } i (cid:54) = i ∗ together provide answers in the form of predicted labels ˆ y . Each model {M i } i ∈ [1 ,K ] can be exploitedin both the querying phase and the answering phase, with the querying party alternating betweendifferent participants {P i } i ∈ [1 ,K ] in the protocol. Threat Model.

To obtain the strong conﬁdentiality and privacy guarantees that we described, werequire a semi-trusted third party called the content service provider (CSP). We assume that the CSPdoes not collude with any party and that the adversary can corrupt any subset of C parties {P i } i ∈ [1 ,C ] .When more than one party gets corrupted, this has no impact on the conﬁdentiality guarantee, butthe privacy budget obtained (cid:15) will degrade by a factor proportional to C because the sensitivity ofthe aggregation mechanism increases (see Section 3.3). We work in the honest-but-curious setting, acommonly adopted assumption in cryptography which requires the adversary to follow the protocoldescription correctly but will try to infer information from the protocol transcript.4ublished as a conference paper at ICLR 20213.2 C A PC P

ROTOCOL D ESCRIPTION

Our protocol introduces a novel formulation of the private aggregation of teachers, which implementstwo-party conﬁdential inference and secret sharing to improve upon the work of Papernot et al. (2017)and guarantee conﬁdentiality. Recall that the querying party P i ∗ initiates the protocol by sendingan encrypted input x to all answering parties P i , i (cid:54) = i ∗ . We use sk and pk to denote the secret andpublic keys owned by party P i ∗ . The proposed protocol consists of the following steps:1. For each i (cid:54) = i ∗ , P i (with model parameters M i as its input) and P i ∗ (with x, sk, pk as its input)run a secure two-party protocol. As the outcome, P i obtains ˆ s i and P i ∗ obtains s i such that s i + ˆ s i = OneHot (arg max( r i )) where r i are the predicted logits.This step could be achieved by the following:a) P i ∗ and P i run a secure two-party ML classiﬁcation protocol such that P i ∗ learns nothingwhile P i learns Enc pk ( r i ) , where r i are the predicted logits.b) P i generates a random vector ˆ r i , performs the following computation on the encrypted data Enc pk ( r i ) − Enc pk (ˆ r i ) = Enc pk ( r i − ˆ r i ) , and sends the encrypted difference to P i ∗ , whodecrypts and obtains ( r i − ˆ r i ) .c) P i (with ˆ r i as input) and P i ∗ (with r i − ˆ r i as input) engage in Yao’s two-party garbled-circuit protocol to obtain vector s i for P i ∗ and vector ˆ s i for P i , such that s i + ˆ s i = OneHot (arg max( r i )) .2. P i sends ˆ s i to the CSP. The CSP computes ˆ s = (cid:80) i (cid:54) = i ∗ ˆ s i + DPNoise ( (cid:15) ) , where DPNoise () is element-wise Laplacian or Gaussian noise whose variance is calibrated to obtain a desireddifferential privacy guarantee ε ; whereas P i ∗ computes s = (cid:80) i (cid:54) = i ∗ s i .3. The CSP and P i ∗ engage in Yao’s two-party garbled-circuit protocol for computing the argmax: P i ∗ gets arg max(ˆ s + s ) and the CSP gets nothing.Next, we elaborate on the conﬁdentiality and privacy guarantees achieved by CaPC.3.3 C ONFIDENTIALITY AND D IFFERENTIAL P RIVACY G UARANTEES

Conﬁdentiality Analysis.

We prove in Appendix E that the above protocol reveals nothing to P i orthe CSP and only reveals the ﬁnal noisy results to P i ∗ . The protocol is secure against a semi-honestadversary corrupting any subset of parties. Intuitively, the proof can be easily derived based on thesecurity of the underlying components, including two-party classiﬁcation protocol, secret sharing,and Yao’s garbled circuit protocol. As discussed in Section 4.1 and Appendix A.1, for secret sharingof unbounded integers, we need to make sure the random padding is picked from a domain muchlarger than the maximum possible value being shared. Given the above, a corrupted P i ∗ cannot learnanything about M i of the honest party due to the conﬁdentiality guarantee of the secure classiﬁcationprotocol; similarly, the conﬁdentiality of x against corrupted P i is also protected. Intermediate valuesare all secretly shared (and only recovered within garbled circuits) so they are not visible to any party. Differential Privacy Analysis.

Here, any potential privacy leakage in terms of differential privacyis incurred by the answering parties {P i } i (cid:54) = i ∗ for their datasets {D i } i (cid:54) = i ∗ , because these partiesshare the predictions of their models. Before sharing these predictions to P i ∗ , we follow the PATEprotocol: we compute the histogram of label counts ˆ y , then add Laplacian or Gaussian noise using asensitivity of , and ﬁnally return the argmax of ˆ y σ to P i ∗ . Since P i ∗ only sees this noisily aggregatedlabel, both the data-dependent and data-independent differential privacy analysis of PATE apply to P i ∗ (Papernot et al., 2017; 2018). Thus, when there are enough parties with high consensus, wecan obtain a tighter bound on the privacy budget (cid:15) as the true plurality will more likely be returned(refer to Appendix B for more details on how this is achieved in PATE). This setup assumes that onlyone answering party can be corrupted. If instead C parties are corrupted, the sensitivity of the noisyaggregation mechanism will be scaled by C and the privacy guarantee will deteriorate. There is noprivacy leakage to the CSP; it does not receive any part of the predictions from {P i } i (cid:54) = i ∗ . XPERIMENTS

CaPC aims to improve the model utility of collaborating parties by providing them with new labelleddata for training their respective local models. Since we designed the CaPC protocol with techniques5ublished as a conference paper at ICLR 2021for conﬁdentiality (i.e., conﬁdential inference and secret sharing) and differential privacy (i.e., privateaggregation), our experiments consider the following three major dimensions:1. How well does collaboration improve the model utility of all participating parties?2. What requirements are there to achieve privacy and how can these be relaxed under differentcircumstances? What is the trade-off between the privacy and utility provided by CaPC?3. What is the resulting computational cost for ensuring conﬁdentiality?4.1 I

MPLEMENTATION

We use the HE-transformer library with MPC (MP2ML) by Boemer (2020) in step 1a of our protocolfor conﬁdential two-party deep learning inference. To make our protocol ﬂexible to any privateinference library, not just those that return the label predicted by the model (HE-transformer onlyreturns logits), we incorporate steps 1b and 1c of the protocol outside of the private inferencelibrary. The EMP toolkit (Wang et al., 2016) for generic two-party computation is used to computethe operations including argmax and sum via the garbled circuits. To secret share the encryptedvalues, we ﬁrst convert them into integers over a prime ﬁeld according to the CKKS parameters, andthen perform secret sharing on that domain to obtain perfect secret sharing. We use the single largestlogit value for each M i obtained on its training set D i in plain text to calculate the necessary noise.4.2 E VALUATION S ETUP

Collaboration.

We use the following for experiments unless otherwise noted. We uniformly samplefrom the training set in use , without replacement, to create disjoint partitions, D i , of equal sizeand identical data distribution for each party. We select K = 50 and K = 250 as the number ofparties for CIFAR10 and SVHN, respectively (the number is larger for SVHN because we have moredata). We select Q = 3 querying parties, P i ∗ , and similarly divide part of the test set into Q separateprivate pools for each P i ∗ to select queries, until their privacy budget of (cid:15) is reached (using Gaussiannoise with σ = 40 on SVHN and on CIFAR10). We are left with , and , evaluationdata points from the test set of CIFAR10 and SVHN, respectively. We ﬁx (cid:15) = 2 and for SVHNand CIFAR10, respectively (which leads to ≈ queries per party), and report accuracy on theevaluation set. Querying models are retrained on their D i plus the newly labelled data; the differencein accuracies is their accuracy improvement.We use shallower variants of VGG, namely VGG-5 and VGG-7 for CIFAR10 and SVHN, respec-tively, to accommodate the small size of each party’s private dataset. We instantiate VGG-7 with 6convolutional layers and one ﬁnal fully-connected layer, thus there are 7 functional layers overall.Similarly, VGG-5 has 4 convolutional layers followed by a fully connected layer. The ResNet-10architecture starts with a single convolutional layer, followed by 4 basic blocks with 2 convolutionallayers in each block, and ends with a fully-connected layer, giving 10 functional layers in total. TheResNet-8 architecture that we use excludes the last basic block and increases the number of neuronsin the last (fully-connected) layer. We present more details on architectures in Appendix F.2.We ﬁrst train local models for all parties using their non-overlapping private datasets. Next, werun the CaPC protocol to generate query-answer pairs for each querying party. Finally, we retrainthe local model of each querying party using the combination of their original private dataset andthe newly obtained query-answer pairs. We report the mean accuracy and class-speciﬁc accuracyaveraged over 5 runs for all retrained models, where each uses a different random seed. Heterogeneity and Data Skew.

Where noted, our heterogeneous experiments (recall that this is anewly applicable setting that CaPC enables) use VGG-7, ResNet-8 and ResNet-10 architectures for K parties, each. One model of each architecture is used for each of Q = 3 querying parties. Our dataskew experiments use less data samples for the classes ‘horse’, ‘ship’, and ‘truck’ on CIFAR10and less data for the classes and on SVHN. In turn, unfair ML algorithms perform worse onthese speciﬁc classes, leading to worse balanced accuracy (see Appendix D). We adopt balancedaccuracy instead of other fairness metrics because the datasets we use have no sensitive attributes,making them inapplicable. We employ margin, entropy, and greedy k-center active learning strategies For the SVHN dataset, we combine its original training set and extra set to get a larger training set.

OLLABORATION A NALYSIS

We ﬁrst investigate the beneﬁts of collaboration for improving each party’s model performancein several different settings, namely: homogeneous and heterogeneous model architectures acrossquerying and answering parties, and uniform and non-uniform data sampling for training data. Fromthese experiments, we observe: increased accuracy in both homogeneous settings and heterogeneoussettings to all model architectures (Section 4.3.1) and improved balanced accuracy when there is dataskew between parties, i.e., non-uniform private data (Section 4.3.2).4.3.1 U

NIFORMLY S AMPLED P RIVATE D ATA

The ﬁrst setting we consider is a uniform distribution of data amongst the parties—there is no datadrift among parties. Our set up for the uniform data distribution experiments is detailed in Section 4.2.We evaluate the per-class and overall accuracy before and after CaPC in both homogeneous andheterogeneous settings on the CIFAR10 and SVHN datasets.In Figure 2, we see there is a consistent increase in accuracy for each class and overall in termsof mean accuracy across all parties on the test sets. We observe these improvements in both thehomogeneous and heterogeneous settings for both datasets tested. As demonstrated in Figure 2, thereis a greater climb in mean accuracy for the heterogeneous setting than the homogeneous setting onSVHN. Figures 5, 6, and 7 provide a breakdown of the beneﬁts obtained by each querying party.We can see from these ﬁgures that all querying parties observe an increase in overall accuracy inheterogeneous and homogeneous settings with both datasets; additionally, the jump in accuracyis largely constant between different model architectures. In only . of all cases were any class-speciﬁc accuracies degraded, but they still showed a net increase in overall model accuracy. SODQH FDU ELUG FDW GHHU GRJ IURJ KRUVH VKLS WUXFN &ODVVHV $ FF X U D F\ &,)$5+RPRJHQHRXV6WDJH %HIRUH&D3&$IWHU&D3& &ODVVHV 69+1+RPRJHQHRXV &ODVVHV 69+1+HWHURJHQHRXV Figure 2:

Using CaPC to improve model performance.

Dashed lines represent mean accuracy.

With homogeneous models, we observe a mean increase of . and of . percentage points onCIFAR10 and SVHN, respectively, and an increase of . with heterogeneous models; each partystill sees improvements despite differing model architectures (see Figure 7 in Appendix F).4.3.2 N ON -U NIFORMLY S AMPLED P RIVATE D ATA

In this section, we focus our analysis on two types of data skew between parties: varying size of dataper class and total size of data provided; the setup is described in Section 4.2. To analyze data skew,we explore the balanced accuracy (which measures mean recall on a per-class basis, see Appendix D).We use balanced accuracy in order to investigate aggregate fairness gains offered by CaPC. Randomsampling from non-uniform distributions leads to certain pitfalls: e.g., underrepresented classes arenot speciﬁcally targeted in sampling. Thus, we additionally utilize active learning techniques, namelyentropy, margin, and greedy-k-center (see Deﬁnitions 6-8 in Appendix C), and analyze balancedaccuracy with each strategy.In Figure 3, we see that CaPC has a signiﬁcant impact on the balanced accuracy when there is dataskew between the private data of participating parties. Even random sampling can drastically improvebalanced accuracy. Leveraging active learning techniques, we can achieve additional beneﬁts in7ublished as a conference paper at ICLR 2021balanced accuracy. In particular, we observe that entropy and margin sampling achieves the greatestimprovement over random sampling in per-class accuracy for the less represented classes ‘horse’,‘ship’, and ‘truck’ on CIFAR10 and classes 1 and 2 on SVHN. These enhancements can be explainedby the underlying mechanisms of margin and entropy sampling because the less-represented classeshave a higher margin/entropy; the queries per class for each method are shown in Figure 9. Throughthese experiments, we show that in data skew settings, the CaPC protocol can signiﬁcantly improvethe fair performance of models (as measured by balanced accuracy), especially when combined withactive learning techniques. Note that we see similar trends with (normal) accuracy as well. SODQH FDU ELUG FDW GHHU GRJ IURJ KRUVH VKLS WUXFN &ODVVHV $ FF X U D F\ 6DPSOLQJ0HWKRG %HIRUH&D3&5DQGRP(QWURS\ 0DUJLQ*UHHG\N&HQWHU &ODVVHV $ FF X U D F\ 6DPSOLQJ0HWKRG %HIRUH&D3&5DQGRP(QWURS\ 0DUJLQ*UHHG\N&HQWHU Figure 3:

Using CaPC with active learning to improve balanced accuracy under non-uniformdata distribution.

Dashed lines are balanced accuracy (BA).

We now study the trade-off between privacy and utility of our obtained models. Recall that we addGaussian (or Laplacian) noise to the aggregate of predicted labels of all parties. Under the uniformsetting, we choose the standard deviation σ by performing a (random) grid search and choosing thehighest noise before a signiﬁcant loss in accuracy is observed. In doing so, each query uses minimal ε while maximizing utility. Figure 11 in Appendix F shows a sample plot for K = 250 models. Formore details on how ε is calculated, please refer to Appendix B.As we increase the number of parties, we can issue more queries for a given privacy budget ( ε )which leads to a higher accuracy gain. In Figure 4, we report the accuracy gain achieved using CaPCwith various numbers of parties, K . With a ﬁxed total dataset size, increasing the number of partiesdecreases their training data size, leading to worse performing models. These models see the largestbeneﬁt from CaPC but, importantly, we always see a net improvement across all values of K . ε )828384858687888990 A cc u r a c y ( % ) Number ofparties:150200250300400

Number of parties

150 200 250 300 400

Accuracygain (%)

Best ε Accuracy gain for balanced SVHNusing CaPC versus number of parties and pri-vacy budget, ε . With more parties, we can achievea higher accuracy gain at a smaller bound on ε .8ublished as a conference paper at ICLR 20214.5 C OMPUTATIONAL C OSTS OF C ONFIDENTIALITY

The incorporation of conﬁdentiality in CaPC increases computational costs. We segment the analysisof computational overhead of CaPC into three parts corresponding to sequential steps in the protocol:(1) inference, (2) secret sharing between each querying and answering party, and (3) secret sharingbetween the querying party and the CSP. Each of these steps is analyzed in terms of the wall-clocktime (in seconds). We use the default encryption setting in HE-transformer and vary the modulusrange, N , which denotes the max value of a given plain text number to increase the maximum securitylevel possible. HE-transformer only supports inference on CPUs and is used in step (1).Step (1) with neural network inference using MPC incurs the highest CPU and network costs (seeTable 1 and Figure 13 in Appendix F). Even the base level of security increases computational costby 100X, and high security levels see increases up to 1000X, in comparison to the non-encryptedinference on CPU. Compared to step (1), the rest of the CaPC protocol incurs a negligible overhead toperform secret sharing. Overall, CaPC incurs only a low additional cost over the underlying MP2MLframework, as shown in Figure 13, which enables applicability and scalability as these tools progress. ISCUSSION AND C ONCLUSIONS

CaPC is a secure and private protocol that protects both the conﬁdentiality of test data and the privacyof training data, which are desired in applications like healthcare and ﬁnance. Our framework facili-tates collaborative learning using heterogeneous model architectures and separate private datasets,even if the number of parties involved is small. It offers notable advantages over recent methods forlearning with multiple participants, such as federated learning, which assumes training of a singleﬁxed model architecture. CaPC does not assume a homogeneous model architecture and allowsparties to separately and collaboratively train different models optimized for their own purposes.Federated learning also requires a large number of parties while CaPC provides gains in accuracywith signiﬁcantly fewer participants, even in contexts where each party already has a model withhigh accuracy. Notably, CaPC incurs low overhead on top of underlying tools used for secure neuralnetwork inference.Through our experiments, we also demonstrate that CaPC facilitates collaborative learning even whenthere exists non i.i.d (highly skewed) private data among parties. Our experiments show that CaPCimproves on the fair performance of participating querying models as indicated by improvementsin the balanced accuracy, a common fairness metric. Further, we observe a signiﬁcant increasein per-class accuracy on less-represented classes on all datasets tested. Notably, CaPC is easilyconﬁgured to leverage active learning techniques to achieve additional fairness improvement gainsor to learn from other heterogeneous models trained with fairness techniques, e.g., with syntheticminority oversampling (Chawla et al., 2002). In future work, we look to analyzing the fairnessimplications of CaPC in contexts where there is discrimination over a private dataset’s sensitiveattributes, not just class labels. In these cases, other fairness metrics like equalized odds and equalopportunity (see Appendix D) can be explored.We note some limitations of the proposed protocol. HE-transformer does not prevent leaking certainaspects of the model architecture, such as the type of non-linear activation functions and presenceof MaxPooling layers. CaPC improves upon existing methods in terms of the necessary number ofparties; however, it would be favorable to see this number decreased under 50 for better ﬂexibilityand applicability in practice.In the face of this last limitation, when there are few physical parties, we can generate a larger numberof virtual parties for CaPC, where each physical party subdivides their private dataset into disjointpartitions and trains multiple local models. This would allow CaPC to tolerate more noise injectedduring aggregation and provide better privacy guarantees. Note that each physical party could selectqueries using a dedicated strong model instead of the weak models used for answering queries inCaPC. This setting is desirable in cases where separate models are required within a single physicalparty, for example, in a multi-national organization with per-country models.9ublished as a conference paper at ICLR 2021 A CKNOWLEDGMENTS R EFERENCES

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, andLi Zhang. Deep learning with differential privacy. In

Proceedings of the 2016 ACM SIGSACConference on Computer and Communications Security , pp. 308–318, 2016.Fabian Boemer. he-transformer. https://github.com/IntelAI/he-transformer ,2020. [Online; accessed 19-September-2020].Fabian Boemer, Yixing Lao, Rosario Cammarota, and Casimir Wierzynski. Ngraph-he: A graphcompiler for deep learning on homomorphically encrypted data. In

Proceedings of the 16th ACMInternational Conference on Computing Frontiers , CF ’19, pp. 3–13, New York, NY, USA, 2019.Association for Computing Machinery.Fabian Boemer, Rosario Cammarota, Daniel Demmler, Thomas Schneider, and Hossein Yalame.MP2ML: a mixed-protocol machine learning framework for private inference. In Melanie Volkamerand Christian Wressnegger (eds.),

ARES 2020: The 15th International Conference on Availability,Reliability and Security, Virtual Event, Ireland, August 25-28, 2020 , pp. 14:1–14:10. ACM, 2020.Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. (leveled) fully homomorphic encryptionwithout bootstrapping.

ACM Transactions on Computation Theory (TOCT) , 6(3):1–36, 2014.Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer:Evaluating and testing unintended memorization in neural networks. In { USENIX } SecuritySymposium ( { USENIX } Security 19) , pp. 267–284, 2019.Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: syntheticminority over-sampling technique.

Journal of artiﬁcial intelligence research , 16:321–357, 2002.Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. Homomorphic encryption forarithmetic of approximate numbers. In

International Conference on the Theory and Application ofCryptology and Information Security , pp. 409–437. Springer, 2017.Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity inprivate data analysis. In

Theory of cryptography conference , pp. 265–284. Springer, 2006.Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy. In , pp. 51–60. IEEE, 2010.Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.

Foundationsand Trends® in Theoretical Computer Science , 9(3–4):211–407, 2014.David Evans, Yan Huang, Jonathan Katz, and Lior Malka. Efﬁcient privacy-preserving biometricidentiﬁcation. In

Proceedings of the 17th conference Network and Distributed System SecuritySymposium, NDSS , volume 68, 2011.Reza Zanjirani Farahani and Masoud Hekmatfar.

Facility location: concepts, models, algorithms andcase studies . Springer, 2009.Craig Gentry.

A fully homomorphic encryption scheme , volume 20. Stanford university Stanford,2009. 10ublished as a conference paper at ICLR 2021Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing.Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In

International Conference on Machine Learning , pp. 201–210, 2016.Sebastien Godard. sar (sysstat). http://sebastien.godard.pagesperso-orange.fr/ ,2020. [Online; accessed 10-September-2020].Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In

Advancesin neural information processing systems , pp. 3315–3323, 2016.Yangsibo Huang, Zhao Song, Danqi Chen, Kai Li, and Sanjeev Arora. Texthide: Tackling dataprivacy in language understanding tasks. arXiv preprint 2010.06053 , 2020a.Yangsibo Huang, Zhao Song, Kai Li, and Sanjeev Arora. Instahide: Instance-hiding schemes forprivate distributed learning. arXiv preprint 2010.02772 , 2020b.Yangsibo Huang, Yushan Su, Sachin Ravi, Zhao Song, Sanjeev Arora, and Kai Li. Privacy-preservinglearning via deep net pruning. arXiv preprint 2003.01876 , 2020c.Yuval Ishai, Joe Kilian, Kobbi Nissim, and Erez Petrank. Extending oblivious transfers efﬁciently. In

Annual International Cryptology Conference , pp. 145–161. Springer, 2003.Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. Gazelle: A low latencyframework for secure neural network inference. In , pp. 1651–1669, 2018.Jakub Koneˇcn`y, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, andDave Bacon. Federated learning: Strategies for improving communication efﬁciency. arXivpreprint arXiv:1610.05492 , 2016.David D. Lewis and William A. Gale. A sequential algorithm for training text classiﬁers. In

Pro-ceedings of the 17th Annual International ACM-SIGIR Conference on Research and Developmentin Information Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum) , pp.3–12, 1994.Dahlia Malkhi, Noam Nisan, Benny Pinkas, and Yaron Sella. Fairplay—a secure two-party computa-tion system. In

Proceedings of the 13th Conference on USENIX Security Symposium - Volume 13 ,SSYM’04, pp. 20, USA, 2004. USENIX Association.Alessandro Mantelero. The eu proposal for a general data protection regulation and the roots of the‘right to be forgotten’.

Computer Law & Security Review , 29(3):229–235, 2013.H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially privaterecurrent language models. arXiv preprint arXiv:1710.06963 , 2017.Ilya Mironov. Rényi differential privacy. In , pp. 263–275. IEEE, 2017.Pratyush Mishra, Ryan Lehmkuhl, Akshayaram Srinivasan, Wenting Zheng, and Raluca Ada Popa.Delphi: A cryptographic inference service for neural networks. In , pp. 2505–2522. USENIX Association, August 2020. ISBN978-1-939133-17-5.Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-preserving machinelearning. In , pp. 19–38. IEEE, 2017.Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian J. Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data. In , 2017.Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and ÚlfarErlingsson. Scalable private learning with PATE. In , 2018. 11ublished as a conference paper at ICLR 2021Benny Pinkas, Mike Rosulek, Ni Trieu, and Avishay Yanai. Psi from paxos: Fast, malicious private setintersection. In

Annual International Conference on the Theory and Applications of CryptographicTechniques , pp. 739–767. Springer, 2020.Michael O. Rabin. How to exchange secrets with oblivious transfer. Cryptology ePrint Archive,Report 2005/187, 2005. https://eprint.iacr.org/2005/187 .Tobias Scheffer, Christian Decomain, and Stefan Wrobel. Active hidden markov models for informa-tion extraction. In

International Symposium on Intelligent Data Analysis , pp. 309–318. Springer,2001.Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-setapproach. arXiv preprint arXiv:1708.00489 , 2017.Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, Universityof Wisconsin–Madison, 2009.Adi Shamir. How to share a secret.

Communications of the ACM , 22(11):612–613, 1979.Claude E Shannon. A mathematical theory of communication.

Bell system technical journal , 27(3):379–423, 1948.Micah J. Sheller, Brandon Edwards, G. Anthony Reina, Jason Martin, Sarthak Pati, AikateriniKotrotsou, Mikhail Milchenko, Weilin Xu, Daniel Marcus, Rivka R. Colen, and Spyridon Bakas.Federated learning in medicine: facilitating multi-institutional collaborations without sharingpatient data.

Scientiﬁc Reports , 10(1):12598, 2020. doi: 10.1038/s41598-020-69250-1. URL https://doi.org/10.1038/s41598-020-69250-1 .Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacksagainst machine learning models. In , pp.3–18. IEEE, 2017.Xiao Wang, Alex J. Malozemoff, and Jonathan Katz. EMP-toolkit: Efﬁcient MultiParty computationtoolkit. https://github.com/emp-toolkit , 2016.Andrew Chi-Chih Yao. How to generate and exchange secrets (extended ). In , pp. 162–167, Toronto, Ontario, Canada, October 27–29,1986. IEEE Computer Society Press.

A M

ORE B ACKGROUND ON C RYPTOGRAPHY

A.1 C

RYPTOGRAPHIC B UILDING B LOCKS

Homomorphic encryption.

Homomorphic encryption deﬁnes an encryption scheme such thatthe encryption and decryption functions are homomorphic between plaintext and ciphertext spaces.Although it is known that fully homomorphic encryption can be constructed based on lattice-basedassumptions, most applications only require a weaker version with bounded number of multiplicationson each ciphertext. Schemes with this constraint are much more practical, including for example,BGV (Brakerski et al., 2014), CKKS (Cheon et al., 2017), etc.

Secret sharing.

Secret sharing denotes a scheme in which a datum, the secret, is shared amongst agroup of parties by dividing the secret into parts such that each party only has one part, or ‘share’ ofthe secret. The secret can only be recovered if a certain number of parties conspire to combine theirshares. It is easy to construct secret sharing modulo a positive integer. If the application does notallow modular operation, one can still achieve statistically secure secret sharing by using randomshares that are much larger than the secret being shared (Evans et al., 2011).

Oblivious transfer.

Oblivious transfer involves two parties: the sending party and the receivingparty. The sending party has two pieces of information, s and s , and the receiver wants to receive s b , where b ∈ { , } , such that the sending party cannot learn b and the receiving party cannot learn s ¬ b . In general, oblivious transfer requires public-key operations, however, it is possible to execute alarge number of oblivious transfers with only a very small number of public-key operations based onoblivious transfer extension (Ishai et al., 2003). 12ublished as a conference paper at ICLR 2021 Garbled circuits.

In Yao’s garbled circuit protocol for two-party computation, each of the twoparties assumes a role, that of garbler or that of evaluator. The function f on which to compute eachof the two parties’ inputs is described as a Boolean circuit. The garbler randomly generates aliases(termed labels) representing 0 and 1 in the Boolean circuit describing f and replaces the binary valueswith the generated labels for each wire in the circuit. At each gate in the circuit, which can be viewedas a truth table, the garbler uses the labels of each possible combination of inputs to encrypt thecorresponding outputs, and permutes the rows of the truth table. The garbler then uses the generatedlabels for 0 and 1 to encode their own input data and sends these labels and the garbled Booleancircuit to the evaluator. The evaluator now converts their binary input data to the correspondinglabels through a 1-2 oblivious transfer protocol with the garbler. After receiving the labels for theirinput, the evaluator evaluates the garbled circuit by trying to decrypt each row in the permutable truthtables at each gate using the input labels; only one row will be decryptable at each gate, which isthe output label for the outgoing wire from the gate. The evaluator eventually ﬁnishes evaluating thegarbled circuit and obtains the label for the output of the function f computed on the garbler’s andthe evaluator’s input. The garbler then must provide the true value for the output label so that bothparties can get the output.A.2 P

ROTECTING C ONFIDENTIALITY USING

MPCNeural networks present a challenge to secure multi-party computation protocols due to their uniquestructure and exploitative combination of linear computations and non-linear activation functions.Cryptographic inference with neural networks can be considered in two party computation case inwhich one party has conﬁdential input for which they wish to obtain output from a model and theother party stores the model; in many cases the party storing the model also wishes that the modelremains secure.Conﬁdential learning and inference with neural networks typically uses homomorphic encryption(HE) or secure multi-party computation (MPC) methods. Many libraries support pure HE or MPCprotocols for secure inference of neural networks; a comprehensive list can be viewed in (Boemeret al., 2020). Notably, libraries such as nGraph-HE (Boemer et al., 2019) and CryptoNets (Gilad-Bachrach et al., 2016) provide pure homomorphic encryption solutions to secure neural networkinference. nGraph-HE, an extension of graph compiler nGraph, allows secure inference of DNNsthrough linear computations at each layer using CKKS homomorphic encryption scheme (Cheonet al., 2017; Boemer et al., 2019). CryptoNets similarly permit conﬁdential neural network inferenceusing another leveled homomorphic encryption scheme, YASHE’ (Gilad-Bachrach et al., 2016).On the other hand, several libraries employing primarily MPC methods in secure NN inferenceframeworks rely on ABY, a tool providing support for common non-polynomial activation functionsin NNs through use of both Yao’s GC and GMW.In DL contexts, while pure homomorphic encryption methods maintain model security, their failureto support common non-polynomial activation functions leads to leaking of pre-activation values(feature maps at hidden layers). Tools that use solely MPC protocols avoid leaking pre-activationvalues as they can guarantee data conﬁdentiality on non-polynomial activation functions but maycompromise the security of the model architecture by leaking activation functions or model structure.Recent works on secure NN inference propose hybrid protocols that combine homomorphic encryp-tion schemes, and MPC methods to build frameworks that try to reduce leakages common in pure HEand MPC protocols. Among recent works that use hybrid protocols and do not rely on trusted thirdparties are Gazelle (Juvekar et al., 2018), Delphi (Mishra et al., 2020), and MP2ML (Boemer et al.,2020).Gazelle, Delphi and MP2ML largely support non-polynomial activation functions encountered inconvolutional neural networks, such as maximum pooling and rectiﬁed linear unit (ReLU) operations.Gazelle introduced several improvements over previous methods for secure NN inference primarilyrelating to latency and conﬁdentiality. In particular, Gazelle framework provides homomorphic en-cryption libraries with low latency implementations of algorithms for single instruction multiple data(SIMD) operations, ciphertext permutation, and homomorphic matrix and convolutional operations,pertinent to convolutional neural networks. Gazelle utilizes kernel methods to evaluate homomorphicoperations for linear components of networks, garbled circuits to compute non-linear activationfunctions conﬁdentially and additive secret sharing to quickly switch between these cryptographicprotocols. Delphi builds on Gazelle, optimizing computation of both linear and non-linear com-13ublished as a conference paper at ICLR 2021putations in CNNs by secret sharing model weights in the pre-processing stage to speed up linearcomputations later, and approximating certain activation functions such as ReLU with polynomials.MP2ML employs nGraph-HE for homomorphic encryption and ABY framework for evaluation ofnon-linear functions using garbled circuits.

B M

ORE B ACKGROUND ON D IFFERENTIAL P RIVACY

One of the compelling properties of differential privacy is that it permits the analysis and control ofcumulative privacy cost over multiple consecutive computations. For instance, strong compositiontheorem (Dwork et al., 2010) gives a tight estimate of the privacy cost associated with a sequence ofadaptive mechanisms {M i } i ∈ I . Theorem 1 (Strong Composition) . For ε, δ, δ (cid:48) ≥ , the class of ( ε, δ ) -differentially private mecha-nisms satisﬁes ( ε (cid:48) , kδ + δ (cid:48) ) -differential privacy under k -fold adaptive composition for: ε (cid:48) = ε (cid:112) k log(1 /δ (cid:48) ) + kε ( e ε − (2)To facilitate the evaluation of privacy leakage resulted by a randomized mechanism M , it is helpful toexplicitly deﬁne its corresponding privacy loss c M and privacy loss random variable C M . Particularly,the fact that M is ( ε, δ ) -differentially private is equivalent to a certain tail bound on C M . Deﬁnition 2 (Privacy Loss) . Given a pair of adjacent datasets d, d (cid:48) ∈ D and an auxiliary input aux ,the privacy loss c M of a randomized mechanism M evaluated at an outcome o ∈ R is deﬁned as: c M ( o | aux, d, d (cid:48) ) (cid:44) log Pr[ M ( aux, d ) = o ]Pr[ M ( aux, d (cid:48) ) = o ] (3) For an outcome o ∈ R sampled from M ( d ) , C M ( aux, d, d (cid:48) ) takes the value c M ( o | aux, d, d (cid:48) ) . Based on the deﬁnition of privacy loss, Abadi et al. (Abadi et al., 2016) introduced the momentsaccountant to track higher-order moments of privacy loss random variable and achieved even tighterprivacy bounds for k -fold adaptive mechanisms. Deﬁnition 3 (Moments Accountant) . Given any adjacent datasets d, d (cid:48) ∈ D and any auxiliary input aux , the moments accountant of a randomized mechanism M is deﬁned as: α M ( λ ) (cid:44) max aux,d,d (cid:48) α M ( λ | aux, d, d (cid:48) ) (4) where α M ( λ | aux, d, d (cid:48) ) (cid:44) log E [exp( λC M ( aux, d, d (cid:48) ))] is obtained by taking the logarithm of theprivacy loss random variable. As a natural relaxation to the conventional ( ε, δ ) -differential privacy, Rényi differential privacy(RDP) (Mironov, 2017) provides a more convenient and accurate approach to estimating privacy lossunder heterogeneous composition. Deﬁnition 4 (Rényi Divergence) . For two probability distributions P and Q deﬁned over R , theRényi divergence of order λ > between them is deﬁned as: D λ ( P || Q ) (cid:44) λ − E x ∼ Q (cid:2) ( P ( x ) /Q ( x )) λ (cid:3) = 1 λ − E x ∼ P (cid:2) ( P ( x ) /Q ( x )) λ − (cid:3) (5) Deﬁnition 5 (Rényi Differential Privacy) . A randomized mechanism M is said to satisfy ε -Rényidifferential privacy of order λ , or ( λ, ε ) -RDP for short, if for any adjacent datasets d, d (cid:48) ∈ D : D λ ( M ( d ) || M ( d (cid:48) )) = 1 λ − E x ∼M ( d ) (cid:34)(cid:18) Pr[ M ( d ) = x ]Pr[ M ( d (cid:48) ) = x ] (cid:19) λ − (cid:35) ≤ ε (6) Theorem 2 (From RDP to DP) . If a randomized mechanism M guarantees ( λ, ε ) -RDP, then it alsosatisﬁes ( ε + log(1 /δ ) λ − , δ ) -differential privacy for any δ ∈ (0 , . Building upon the moments accountant and RDP techniques, Private Aggregation of Teacher En-sembles (PATE) (Papernot et al., 2017) provides a ﬂexible approach to training machine learningmodels with strong privacy guarantees. Precisely, rather than directly learning from labeled private14ublished as a conference paper at ICLR 2021data, the model that gets released instead learns from unlabeled public data by querying a teacherensemble for predicted labels. Models in the ensemble are themselves trained on disjoint partitions ofthe private dataset, while privacy guarantees are enabled by applying the Laplace mechanism to theensemble’s aggregated label counts. Coupled with data-dependent privacy analysis, PATE achievesa tighter estimate of the privacy loss associated with label queries, especially when the consensusamong teacher models is strong. Given this motivation, the follow-up work of PATE (Papernot et al.,2018) further improves the privacy bound both by leveraging a more concentrated noise distributionto strengthen consensus and by rejecting queries that lack consensus.

C M

ORE B ACKGROUND ON A CTIVE L EARNING

Active learning, sometimes referred to as query learning, exploits the intuition that machine learningalgorithms will be able to learn more efﬁciently if they can actively select the data from which theylearn. For certain supervised learning tasks, this insight is of particularly important implications, aslabeled data rarely exists in abundance and data labeling can be very demanding (Settles, 2009).In order to pick queries that will most likely contribute to model learning, various pool samplingmethods have been proposed to estimate the informativeness of unlabeled samples. Uncertainty-basedapproaches (Lewis & Gale, 1994), such as margin sampling and entropy sampling, typically achieve asatisfactory trade-off between sample utility and computational efﬁciency. We also explore a core-setapproach to active learning using greedy-k-center sampling (Sener & Savarese, 2017).

Deﬁnition 6 (Margin Sampling (Scheffer et al., 2001)) . Given an unlabeled dataset d and a clas-siﬁcation model with conditional label distribution P θ ( y | x ) , margin sampling outputs the mostinformative sample: x ∗ = arg min x ∈ d P θ (ˆ y | x ) − P θ (ˆ y | x ) (7) where ˆ y and ˆ y stand for the most and second most probable labels for x , according to the model. Deﬁnition 7 (Entropy Sampling) . Using the setting and notations in Deﬁnition 6, margin samplingcan be generalized by using entropy (Shannon, 1948) as an uncertainty measure as follows: x ∗ = arg max x ∈ d − (cid:88) i P θ ( y i | x ) log P θ ( y i | x ) (8) where y i ranges over all possible labels. Deﬁnition 8 (Greedy-K-center Sampling) . We aim to solve the k-center problem deﬁned by Farahani& Hekmatfar (2009), which is, intuitively, the problem of picking k center points that minimize thelargest distance between a data point and its nearest center. Formally, this goal is deﬁned as min S : |S∪D|≤ k max i min j ∈S∪D ∆( x i , x j ) (9) where D is the current training set and S is our new chosen center points. This deﬁnition can can besolved greedily as shown in (Sener & Savarese, 2017). D M

ORE B ACKGROUND ON F AIRNESS

Due to the imbalance in sample quantity and learning complexity, machine learning models mayhave disparate predictive performance over different classes or demographic groups, resulting inunfair treatment of certain population. To better capture this phenomenon and introduce tractablecountermeasures, various fairness-related criteria have been proposed, including balanced accuracy,demographic parity, equalized odds (Hardt et al., 2016), etc.

Deﬁnition 9 (Balanced Accuracy) . Balanced accuracy captures model utility in terms of bothaccuracy and fairness. It is deﬁned as the average of recall scores obtained on all classes.

Among the criteria that aim to alleviate discrimination against certain protected attributes, equalizedodds and equal opportunity Hardt et al. (2016) are of particular research interests.

Deﬁnition 10 (Equalized Odds) . A machine learning model is said to guarantee equalized odds withrespect to protected attribute A and ground truth label Y if its prediction ˆ Y and A are conditionallyindependent given Y . In the case of binary random variables A, Y, ˆ Y , this is equivalent to: Pr (cid:104) ˆ Y = 1 | A = 0 , Y = y (cid:105) = Pr (cid:104) ˆ Y = 1 | A = 1 , Y = y (cid:105) , y ∈ { , } (10)15ublished as a conference paper at ICLR 2021 To put it another way, equalized odds requires the model to have equal true positive rates and equalfalse positive rates across the two demographic groups A = 0 and A = 1.

Deﬁnition 11 (Equal Opportunity) . Equal opportunity is a relaxation of equalized odds that requiresnon-discrimination only within a speciﬁc outcome group, often referred to as the advantaged group.Using previous notations, the binary case with advantaged group Y = 1 is equivalent to: Pr (cid:104) ˆ Y = 1 | A = 0 , Y = 1 (cid:105) = Pr (cid:104) ˆ Y = 1 | A = 1 , Y = 1 (cid:105) (11) E P

ROOF OF C ONFIDENTIALITY

Here we prove that our protocol described in the main body does not reveal anything except the ﬁnalnoised result to P i ∗ . In can be proven in the standard real-world ideal-world paradigm, where theideal functionality takes inputs from all parties and sends the ﬁnal results to P i ∗ . We use A to denotethe set of corrupted parties. Below, we describe the simulator (namely S ). The simulator strategydepends on if i ∗ is corrupted.If i ∗ ∈ A , our simulator works as below:1.a) The simulator simulates what honest parties would do.1.b) For each i / ∈ A , S sends fresh encryption of a random r i to P i ∗ .1.c) For each i / ∈ A , S sends random s i to P i ∗ on be half of the 2PC functionality between P i and P i ∗ .2-3 S sends the output of the whole computation to P i ∗ on behalf of the 2PC functionalitybetween CSP and P i ∗ If i ∗ / ∈ A , our simulator works as below:1.a) If i ∗ / ∈ A , for each i ∈ A , S computes a fresh encryption of zero and sends it to P i onbehalf of P i ∗ .1.b) The simulator simulates what honest parties would do.1.c) For each i ∈ A , S sends random ˆ s i to P i on behalf of the 2PC functionality between P i and P i ∗ .2-3 The simulator simulates what honest parties would do.Assuming that the underlying encryption scheme is CPA secure and that 2PC protocols used in step 1,2 and 3 are secure with respect to standard deﬁnitions (i.e., reveals nothing beyond the outputs), oursimulation itself is perfect. F D

ETAILS ON E XPERIMENTAL S ETUP

F.1 MNIST

AND F ASHION -MNISTWe use the same setup as for CIFAR10 and SVHN datasets with the following adjustments. We select K = 250 as the default number of parties. For the imbalanced classes we select classes 1 and 2 forMNIST as well as Trouser and Pullover for Fashion-MNIST. We use the Gaussian noise with σ = 40 (similarly to SVHN). We are left with , evaluation data points from the test set (similarly toCIFAR10). We ﬁx the default value of (cid:15) = 2 . for MNIST and (cid:15) = 3 . for Fashion-MNIST. Weuse a variant of the LeNet architecture.F.2 D ETAILS ON A RCHITECTURES

To train the private models on subsets of datasets, we downsize the standard architectures, such asVGG-16 or ResNet-18. Below is the detailed list of layers in each of the architectures used (generatedusing torchsummary ). The diagram for ResNet-10 also includes skip connections and convolutionallayers for adjusting the sizes of feature maps. 16ublished as a conference paper at ICLR 2021

VGG-7 for SVHN:----------------------------------------------------------------Layer type Output Shape Param

Conv2d-24 [-1, 512, 4, 4] 2,359,296BatchNorm2d-25 [-1, 512, 4, 4] 1,024Conv2d-26 [-1, 512, 4, 4] 131,072BatchNorm2d-27 [-1, 512, 4, 4] 1,024BasicBlock-28 [-1, 512, 4, 4] 0Linear-29 [-1, 10] 5,130================================================================Total params: 4,903,242Params size MB: 18.70----------------------------------------------------------------LeNet style architecture for MNIST:----------------------------------------------------------------Layer type Output Shape Param

G A

DDITIONAL E XPERIMENTS AND F IGURES S O DQH F D U E L U G F D W GHH U GRJ I U RJ KR U V H V K L S W U X FN &ODVVHV $ FF X U D F\ &,)$5+RPRJHQHRXV43 %HIRUH&D3&0HDQ$IWHU&D3&0HDQ%HIRUH&D3&$IWHU&D3& S O DQH F D U E L U G F D W GHH U GRJ I U RJ KR U V H V K L S W U X FN &ODVVHV &,)$5+RPRJHQHRXV43 S O DQH F D U E L U G F D W GHH U GRJ I U RJ KR U V H V K L S W U X FN &ODVVHV &,)$5+RPRJHQHRXV43 Figure 5:

Using CaPC to improve each party’s model performance on the CIFAR10 dataset.

We observe that each separate querying party (QP) sees a per-class and overall accuracy bonus usingCaPC. &ODVVHV $ FF X U D F\ 69+1+RPRJHQHRXV43 %HIRUH&D3&0HDQ$IWHU&D3&0HDQ%HIRUH&D3&$IWHU&D3& &ODVVHV 69+1+RPRJHQHRXV43 &ODVVHV 69+1+RPRJHQHRXV43 Figure 6:

Using CaPC to improve each party’s model performance on the SVHN dataset.

Weobserve that all querying parties (QPs) see a net increase overall, with nearly every class seeingimproved performance. 19ublished as a conference paper at ICLR 2021 &ODVVHV $ FF X U D F\ 69+10RGHO9** %HIRUH&D3&0HDQ$IWHU&D3&0HDQ%HIRUH&D3&$IWHU&D3& &ODVVHV 69+10RGHO5HV1HW &ODVVHV 69+10RGHO5HVQHW Figure 7:

Using CaPC to improve each party’s heterogeneous model performance on the SVHNdataset.

Each querying party adopts a different model architecture ( of ) and of all answeringparties adopt each model architecture. All model architectures see beneﬁts from using CaPC. &ODVVHV $ FF X U D F\ 01,67+RPRJHQHRXV %HIRUH&D3&0HDQ$IWHU&D3&0HDQ%HIRUH&D3&$IWHU&D3& 7 V K L U W 7 RS 7 U RX V H U 3 X OO R Y H U ' U H VV & RD W 6 DQGD O 6 K L U W 6 QHD N H U % DJ $ Q N O H % RR W &ODVVHV $ FF X U D F\ )DVKLRQ01,67+RPRJHQHRXV %HIRUH&D3&0HDQ$IWHU&D3&0HDQ%HIRUH&D3&$IWHU&D3& Figure 8:

Using CaPC to improve model performance on balanced MNIST on Fashion-MNIST.

Dashed lines represent mean accuracy. We observe a mean increase of 4.5% for MNIST ( (cid:15) = 2 . )and 2.9% for Fashion-MNIST ( (cid:15) = 3 . ).Method Forward Pass (Step 1a)

CP U, P = 8192 14 . ± . CP U, P = 16384 29 . ± . CP U, P = 32768 57 . ± . GPU, no encryption . ± . CPU, no encryption . ± . QP-AP (Steps 1b and 1c) QP-CSP (Steps 2 and 3) . ± . . ± . Table 1:

Wall-clock time (sec) of various encryption methods with a batch size of 1.

We vary themodulus range, P , which denotes the max value of a given plain text number. Note that the GPU isslower than the CPU because of the mini-batch with a single data item and data transfer overhead toand from the GPU. We use the CryptoNet-ReLU model provided by HE-transformer (Boemer, 2020).20ublished as a conference paper at ICLR 2021 SODQH FDU ELUG FDW GHHU GRJ IURJ KRUVH VKLS WUXFN &ODVVHV 1 X P EH U R I 4 XH U L H V &,)$5+RPRJHQHRXV 5DQGRP(QWURS\0DUJLQ*UHHG\N&HQWHU &ODVVHV 1 X P EH U R I 4 XH U L H V 69+1+RPRJHQHRXV 5DQGRP(QWURS\0DUJLQ*UHHG\N&HQWHU Figure 9:

Using active learning to improve CaPC fairness . We observe that underrepresentedclasses are sampled more frequently than in a random strategy. &ODVVHV $ FF X U D F\ 6DPSOLQJ0HWKRG %HIRUH&D3&5DQGRP(QWURS\ 0DUJLQ*UHHG\N&HQWHU 7 V K L U W 7 RS 7 U RX V H U 3 X OO R Y H U ' U H VV & RD W 6 DQGD O 6 K L U W 6 QHD N H U % DJ $ Q N O H % RR W &ODVVHV $ FF X U D F\ 6DPSOLQJ0HWKRG %HIRUH&D3&5DQGRP(QWURS\ 0DUJLQ*UHHG\N&HQWHU Figure 10:

Using CaPC with active learning to improve balanced accuracy under non-uniformdata distribution.

Dashed lines are balanced accuracy (BA).

We observe that all sampling strategiessigniﬁcantly improve BA and the best active learning scheme can improve BA by a total of . percentage-points (an additional . percentage points over Random sampling) on MNIST (left)and a total of . percentage-points (an additional . ) on Fashion-MNIST (right).21ublished as a conference paper at ICLR 2021 1RLVH6WDQGDUG'HYLDWLRQ $ FF X U D F\ &,)$5+RPRJHQHRXV 9 R W H * DS % H W Z HHQ 7 RS & KR V HQ & O D VV H V $FFXUDF\*DS&KRVHQ 1RLVH6WDQGDUG'HYLDWLRQ $ FF X U D F\ 69+1+RPRJHQHRXV 9 R W H * DS % H W Z HHQ 7 RS & KR V HQ & O D VV H V $FFXUDF\*DS&KRVHQ 1RLVH6WDQGDUG'HYLDWLRQ $ FF X U D F\ 01,67+RPRJHQHRXV 9 R W H * DS % H W Z HHQ 7 RS & KR V HQ & O D VV H V $FFXUDF\*DS&KRVHQ 1RLVH6WDQGDUG'HYLDWLRQ $ FF X U D F\ )DVKLRQ01,67+RPRJHQHRXV 9 R W H * DS % H W Z HHQ 7 RS & KR V HQ & O D VV H V $FFXUDF\*DS&KRVHQ Figure 11:

Tuning the amount of noise ( σ ) in CaPC. We tune the amount of Gaussian noise thatshould be injected in the noisy argmax mechanism by varying the standard deviation. We choosethe highest noise: σ = 7 for CIFAR10, σ = 40 for SVHN, MNIST, and Fashion-MNIST, withouthaving a signiﬁcant impact on the model accuracy, allowing a minimal privacy budget expenditurewhile maximizing utility. We train 50 models for CIFAR10 and 250 models for SVHN, MNIST, andFashion-MNIST. ε )82848688909294 A cc u r a c y ( % ) Number ofparties:150200250300400

Number of parties

150 200 250 300 400

Accuracygain (%)

Best ε Accuracy gain for balanced MNISTusing CaPC versus number of parties and pri-vacy budget, ε . With more parties, we can achievea higher accuracy gain at a smaller bound on ε .22ublished as a conference paper at ICLR 2021 U t ili z a t i o n ( % ) Time step (sec)

CPU

NET

MEM MP2ML (HE-transformer for nGraph). U t ili z a t i o n ( % ) Time step (sec)

CPUNETMEM

1a 2 CaPC (built on top of the MP2ML framework).Figure 13: