[PDF] Fairness in the Eyes of the Data: Certifying Machine-Learning Models

Abstract

We present a framework that allows to certify the fairness degree of a model based on an interactive and privacy-preserving test. The framework verifies any trained model, regardless of its training process and architecture. Thus, it allows us to evaluate any deep learning model on multiple fairness definitions empirically. We tackle two scenarios, where either the test data is privately available only to the tester or is publicly known in advance, even to the model creator. We investigate the soundness of the proposed approach using theoretical analysis and present statistical guarantees for the interactive test. Finally, we provide a cryptographic technique to automate fairness testing and certified inference with only black-box access to the model at hand while hiding the participants' sensitive data.

Full PDF

FFairness in the Eyes of the Data:Certifying Machine-Learning Models

Shahar Segal , Yossi Adi , Benny Pinkas , Carsten Baum , Chaya Ganesh , Joseph Keshet Tel-Aviv University Bar-Ilan University Aarhus University IISc Bangalore

Abstract

We present a framework that allows to certify the fairness degree of a model basedon an interactive and privacy-preserving test. The framework veriﬁes any trainedmodel, regardless of its training process and architecture. Thus, it allows us toevaluate any deep learning model on multiple fairness deﬁnitions empirically. Wetackle two scenarios, where either the test data is privately available only to thetester or is publicly known in advance, even to the model creator. We investigatethe soundness of the proposed approach using theoretical analysis and presentstatistical guarantees for the interactive test. Finally, we provide a cryptographictechnique to automate fairness testing and certiﬁed inference with only black-boxaccess to the model at hand while hiding the participants’ sensitive data.

Machine learning systems are increasingly being used to inform and inﬂuence decisions about peo-ple, leading to algorithmic outcomes that have powerful personal and societal consequences. Forinstance, decisions such as (i) is an individual likely to commit another crime? [2]; or (ii) is anindividual likely to default on a loan? [31] are made using algorithmic predictions. This can beconcerning given the many documented cases of models amplifying bias and discrimination fromthe training data [6, 21, 8, 29]. To address this formally, a line of recent works [10, 14, 18, 20]considers fairness in classiﬁcation by proposing notions of fairness based on similarity measuresand formalizing variants of this notion that provide guarantees against discrimination.One common scenario in which such a discrimination could potentially happen is a setting with aclient and a server. The server classiﬁes queries by a client in an automated way using a machinelearning model generated by it. On the other hand, the client wants to make sure its queries aretreated fairly and its sensitive data is conserved. If the model itself is not a secret, then a client canpotentially run tests (such as the ones implied by the references above) on the model to establish itspurported fairness without exposing its data. Making a model public, however, is not always in theinterest of the server, since it has invested resources such as expertise, data and computation time forthe training – and therefore often wants the model to remain proprietary. Moreover, sharing modelsmay in some cases raise security or privacy concerns. It therefore may be deemed appropriate ornecessary to outsource any such test to a semi-trusted third party such as a government entity, whichwould inspect a model and certify its fairness. This raises our ﬁrst question:

Question 1: Can we design a framework for certifying the fairness of models, giving guarantees toclients while being practically realizable and keeping the model secret from the clients?

Having such a third party relieves the client from testing fairness, but actually just shifts responsibil-ity to someone who might be more qualiﬁed to make a judgement about the model. To minimize the

Preprint. a r X i v : . [ c s . A I] S e p ecessary trust between the model owner and the third party, such a test would still be restricted to ablack-box scenario. In addition, constructing such a test for establishing fairness guarantees can bedifﬁcult on its own. Given that the sources and amount of data is limited, it might be that the thirdparty can only use data in the fairness test that the model owner is familiar with. This might enablethe model owner to design an unfair model which successfully passes the examination by the thirdparty. This raises our second question: Question 2: Can we design a black-box fairness test that gives guarantees even if the test set is(partially) known?

Here, by a black-box test we mean that a test should only query the model M on different inputs butshould not make any assumptions about the actual model parameters. Our contributions

In this work, we answer both questions afﬁrmatively. We design an architec-ture for three (or more) participants which are the model owner (or “server”) S , the client C and atrusted third party R (also called “regulator”). Our architecture uses techniques from cryptographyto construct secure protocols for1. An interactive test between S and R allowing to establish with probability that a model M provided by S is fair with respect to a set of pre-deﬁned groups. While ensuring that R doesnot learn M . This test considers scenarios where S is or is not aware of the test data. R is notinvolved in the training of M , it only performs certiﬁcation.2. An interactive computation between S and C which computes a prediction ˆ y = M ( x ) from aninput x and a model M . The interactive computation neither leaks M to C nor x to S , and yetmakes sure that the model that was used in the prediction has been certiﬁed by R .Our work provides fairness tests necessary for these protocols that have black-box to the modeland uses existing highly efﬁcient cryptographic primitives to implement the tests securely. Whilewe motivate the underlying ideas of these tests on an intuitive level and give formal argumentsfor their soundness, we also provide experimental evidence that the hypotheses that make our testspossible are viable. Since secure and privacy-preserving computation of models, for both trainingand inference, is a very active research area (e.g. [27, 16, 28, 24]), the performance of the currentsolutions in this ﬁeld is continuously improving. As our work assumes the existence of secureprotocols for inference, and investigates how to add fairness on top of these in a generic way that isindependent of the underlying training algorithm, our approach will beneﬁt in practicality from anyindependent progress that is made in this direction. Related work

Fairness in algorithms was ﬁrst investigated by Friedman and Nissenbaum [12].Since then, further research into data as a source of unfairness in ML decisions has been done e.g.in [17, 7]. Baluta et al. [4] showed how to verify properties of a DNN (fairness among them). Intheir work they encode the network into

Conjunctive Normal Forms and then test if it will likelyfulﬁll certain logical constraints. In comparison, our approach is independent of the concrete modelparameters and architecture.Recently, Kilbertus et al. [19] suggested to use cryptographic primatives for fairness certiﬁcation , fair model training , and model decision veriﬁcation . However, this work was mainly focused onmodel training, and did not provide analysis and guarantees for model fairness certiﬁcation andveriﬁcation. In contrast, our study focuses on fairness certiﬁcation of existing models, which weanalyze from a theoretical and practical point of view while providing guarantees based on thenumber of samples available in the test set. We also explore a different scenario where these samplesare known to S during model training, which makes the certiﬁcation harder.Several statistical measures of unfairness, and fairness criteria are studied in [11, 32]. These andsubsequent works achieve statistical notions of fairness through post-processing the training data,and/or by enforcing constraints at training time. Our work differs from this line of research inthat we want to guarantee fairness which is enforced obliviously of the training process. Dwork etal. [10] shows that statistical notions of fairness are inadequate, while [8] established that calibrationdoes not rule out unfair decisions. These results emphasize that fairness is nuanced, complicated,application-speciﬁc, and can depend on legal and social contexts. In this work, we answer theorthogonal question of designing a fairness test for a model, given an accepted deﬁnition of fairness.2 Preliminaries

Let X be the set of possible inputs, G be a ﬁnite set of groups that are relevant for fairness (e.g.,ethnic groups) and Y be a ﬁnite set of labels. We suppose X × G × Y is drawn from a probabilityspace Ω with an unknown distribution D . Let M be a trained model for a classiﬁcation task of D ,we denote M ( x ) for classiﬁcation of input x ∈ X .The goal of training a model M is usually to achieve low error on unseen data. In addition, whendealing with model fairness, we also take into account a measurement with respect to G . Whilethere are plenty of fairness measurements [30], here we focus on group risk and likelihood baseddeﬁnitions, speciﬁcally: overall risk equality , equalized odds and demographic parity . First, wedeﬁne the conditional risk and likelihood respectively: (cid:96) g ( M ) = E ( x,g (cid:48) ,y (cid:48) ) ∼D [ { M ( x ) (cid:54) = y (cid:48) } | g (cid:48) = g ] (Risk) (cid:96) g,y ( M ) = E ( x,g (cid:48) ,y (cid:48) ) ∼D [ { M ( x ) (cid:54) = y (cid:48) } | g (cid:48) = g, y (cid:48) = y ] (Risk with label condition) L g,y ( M ) = E ( x,g (cid:48) ,y (cid:48) ) ∼D [ { M ( x ) = y } | g (cid:48) = g ] (Likelihood)where { π } is an indicator function with a predicate π . The empirical conditional risk is deﬁned fora given independent sample set T = { ( x , g , y ) , ..., ( x m , g m y m ) } ∼ D m as: ¯ (cid:96) g ( M, T ) = 1 m g m (cid:88) i =1 { M ( x i ) (cid:54) = y i ∧ g i = g } (1)where m g is the number of samples in T from group g .We deﬁne a metric called the fairness gap to be the maximal margin between any two groups (andlabels). Formally, we use three well-known measurements: max g ,g ∈G | (cid:96) g ( M ) − (cid:96) g ( M ) | (Overall Risk Equality) max g ,g ∈G ,y ∈Y | (cid:96) g ,y ( M ) − (cid:96) g ,y ( M ) | (Equalized Odds) max g ,g ∈G ,y ∈Y | L g ,y ( M ) − L g ,y ( M ) | (Demographic Parity)Likewise, the empirical fairness gap (EFG) is deﬁned using the empirical approximation of eachmeasurement respectively.Lastly, we consider a model M as (cid:15) -fair on ( G , D ) with respect to a fairness measurement if its fair-ness gap is smaller than (cid:15) . We relax the deﬁnition by demanding to have (cid:15) -fairness with conﬁdence − δ , which in practice is inherent due to the limited sample set from D . That is, a model M as (cid:15) -fair on ( G , D ) under the overall risk equality metric if: Pr (cid:20) max g ,g ∈G | (cid:96) g ( M ) − (cid:96) g ( M ) | > (cid:15) (cid:21) ≤ δ (2)Similarly, plugging-in the fairness gap metric for equalized odds and demographic parity yields thecorresponding (cid:15) -fairness deﬁnition of each of them. In Section 1 we gave an overview about the different participants in our setting. In this section wewill make their roles more explicit and describe the security guarantees that are given to each of themas well as the trust relations. Note that we discuss our framework with respect to three participantsbut it can easily be generalized to any larger number. In particular, it allows for a large number ofregulators {R i } ki =1 that a client C can choose from or even perform the regulators role by itself. • The

Server S initially generates the model M . Its main objective is to keep M secret. • The

Client C has a private input x and wishes to obtain ˆ y ← M ( x ) , where the server provides M (depending on the application, S might receive ˆ y as well but not x ). The objective of C is toensure that M is fair while keeping x private. 3 The third participant is the

Regulator R who should neither learn M nor x or y . After S provesthe fairness of model M , R outputs certiﬁcate cert M for the model to attest its validity. cert M is tied to another certiﬁcate R issued, cert ID , which serves as the identity of R . Thus, when cert M is shown to C it can verify that indeed R certiﬁed the model using cert ID . In addition, R has access to the sample set T in order to check fairness, which is possibly known to S .The certiﬁcates cert ID , cert M are implemented using a digital signature scheme and a collision-resistant hash function. Roughly, cert ID is a public veriﬁcation key that is tied to the identity of R ,and cert M is a signature on a compressed version of the model that is computed using a collision-resistant hash function. Here, a cryptographic signature ensures that only R could issue cert M ,while the hash function forces S to use the same model with C that he used when obtaining cert M .We deﬁne signature schemes and collision resistant hashing in Appendix A. S C R

1. Send certiﬁcation data cert ID

1. Send certi-ﬁcation data cert ID

2. Verify fairness of model M

3. Certify fairness as cert M

5. Classify accord-ing to model M (cid:48)

4. Check if M (cid:48) is certiﬁedby cert ID and cert M Figure 1: Certiﬁcation and VeriﬁcationAt the beginning of the protocol, R generatesits public certiﬁcate cert ID and makes it avail-able. S and R then interact to generate themodel certiﬁcate cert M for model M . In theprocess R is allowed to query M an arbitrarynumber of times to ensure fairness. To performan inference by S and C , both ﬁrst agree on aregulator certiﬁcate cert ID that they will use.Then, C obtains an output ˆ y based on its input x and on a model M (cid:48) provided by S . Here, C only accepts ˆ y if M (cid:48) is certiﬁed by the regulatorbehind cert ID for fairness. C does not learnanything about M (cid:48) , besides ˆ y . Fig. 1 describesthe aforementioned process schematically.Both the inference on M and the veriﬁcation of cert M that are necessary in Fig. 1 can be done usingsecure computation. Secure Computation lets parties perform computations that reveal neither theirinputs nor any intermediate values, but only the outputs of the computation. This can be done easilyif S , R and C would have access to a “trusted third party” F SC which performs the computational taskfor them. That trusted third party would receive inputs from participants, do the computation, andsend the output back to them. In our case, F SC would recieve all secret input from the participantsand send cert M to S after verifying M is fair. F SC can also send M ( x ) to C if model M (cid:48) is certiﬁedin that manner.Such a ”trusted third party” does not exist, however one can imitate it using secure computation. Weprovide more information on how to emulate F SC in Appendix A.3, together with the aforementionedprocess and its security in Appendix B. We introduce two interactive tests which allow R to determine if a model M is (cid:15) -fair:1. The model M is queried using a sample set T which is unknown to S . We show that fairnessguarantees about M can be made given by the empirical fairness gap (EFG) and a lower-boundon the minimal size of the set T with respect to each group g ∈ G .2. The model M is queried using a sample set ˜ T which is derived from a set T in an augmented andrandomized fashion (namely, by making small changes to items from T ). The set T as well asthe augmentation algorithm are known to S in advance. We show that this test implies (cid:15) -fairnessof M , given that the augmentation impacts fair and unfair models in a different way.Both of the aforementioned tests are independent of the representation of M , make no requirementon its training algorithm and only require access to M ( · ) for different inputs. Looking ahead, this isprecisely what allows us to perform those tests interactively between S and R without leaking M . We start by describing the ﬁrst test when the model M is queried using a sample set T whichis unknown to S . All deﬁnitions below are based on (cid:15) -fairness with conﬁdence − δ under the4verall risk equality fairness metric, however these can be modiﬁed to other group-based fairnessdeﬁnitions.Given a sufﬁcient amount of unknown samples from every group g ∈ G , we approximate the con-ditional risk of the model M as in (1). In order to achieve a high conﬁdence in the test and a closeenough approximation, R needs to generate enough samples from each group, i.e. a balanced testset T w.r.t G . This is inherent, as we cannot make claims about the behavior of M with respect to g without ever probing M on elements in g .Denote the Empirical Fairness Gap as EF G = max g ,g ∈G | ¯ (cid:96) g ( M, T ) − ¯ (cid:96) g ( M, T ) | . The follow-ing theorem states the conditions which guarantee that a model is (cid:15) -fair with a conﬁdence − δ . Theorem 1.

A model M is (cid:15) -fair with conﬁdence − δ if: EF G < (cid:15) and min g ∈G m g ≥ (cid:15) − EF G ) ln 2 |G||Y| δ (3) for T = { ( x , g , y ) , ..., ( x m , g m , y m ) } ∼ D m and where m g , as in Eq. (1) , denotes the numberof occurrences of g in T . Demographic parity and equalized odds can be achieved by using the corresponding EFG deﬁnitionand minimizing over

G × Y , counting m g and m g,y respectively in (3). The full proof of the abovetheorem can be found in Appendix C.1, and it relies on an Hoeffding-bound argument.In other terms, we require the EFG of the model to be smaller than (cid:15) , and the difference between (cid:15) and the EFG has an impact on the minimum number of samples we require from each group toguarantee (cid:15) -fairness. We can see a natural trade-off in Theorem 1: A larger sample set is required toverify smaller fairness gaps, as indicated by (cid:15) . Thus the more data you have, the easier it is to verifythat the EFG is close to the actual fairness gap. The disadvantage of the aforementioned test is that all test data T must be hidden so that the modelgenerator S cannot use it to adapt M accordingly. In other settings, we would like to test for fairnessusing public data, which can be known to S . This setting is realistic in many scenarios. For example,if labelled data is costly, getting unique labelled data for a test will be difﬁcult for R .A straightforward argument against this approach is once the data is publicly available, T is not cho-sen independently of M . Thus, a malicious S can create an unfair model that memorizes the set T and responds fairly on it, so that it passes the test outlined above. To counter such dishonest training,we need a method to alter the existing samples and force some sort of generalization abilities. Wetherefore deﬁne the notion of an augmentor . An augmentor applies random augmentations to theinput which alter the sample but still preserves its label and group with high probability, here weuse it to generate new samples for querying that with high probability were not seen during modeltraining. This is a necessary but not sufﬁcient condition in order to ensure a valid test for M . Forexample, consider an augmentor that only alters the ﬁrst few pixels of the image. A model thatsimply ignores those pixels can still overﬁt on the rest of the image and pass any test.Hence, in this work we suggest to use a set of randomized augmentation functions to reduce memo-rization capabilities of an adversary. For this, the assumption is that (cid:15) -fair models behave differentlyfrom unfair models when queried against the samples augmented by the augmentor. Then, this dif-ferent behavior can be leveraged to expose the unfair nature of certain models. Our approach followsthis assumption to construct a querying test set in the same fashion of the test from Section 4.1.More speciﬁcally, deﬁne an algorithm augmentor aug : X × G × { , } τ → X × G that gets as inputa random string and a sample and outputs a new augmented sample. The label and group of the newsample should be the same as the original sample with high probability.We re-deﬁne the conditional risk to be on an augmented sample from D . Formally: (cid:96) g, aug ( M ) = E ( x,g (cid:48) ,y ) ∼D ,r ←{ , } τ (cid:104) { M (˜ x ) (cid:54) = y } (cid:12)(cid:12)(cid:12) (˜ x, ˜ g ) = aug ( x, g ; r ) ∧ g (cid:48) = ˜ g (cid:105) . As mentioned before, there is no guarantee that samples augmented by aug yield better resultsthan T itself. We need an additional assumption on the behavior of fair and unfair models when5hown augmented samples from aug , thus we call a class of models M detectable if it fulﬁlls thatassumption. Deﬁnition 1 ( ( (cid:15), α, aug ) − detectable fairness) . Let A be an arbitrary training algorithm which out-puts a model in M . M has ( (cid:15), α, aug ) -detectable fairness on D if there exists m ∈ N such that forany T ∼ D m and M ∼ A ( D , T, aug , α ) , M is (cid:15) -fair if: max g ,g ∈G | (cid:96) g , aug ( M ) − (cid:96) g , aug ( M ) | ≤ α . Deﬁnition 1 allows us to build an interactive test and to empirically ﬁnd parameters (cid:15), α, m and anaugmentor for which it appears to be true. It also yields a non-trivial angle both for breaking ouroverall construction and for improving it. Notice, the above deﬁnition does not imply that all (cid:15) -fairmodels have this property, and some fair models will not be discovered due to that. In Section 5.2we demonstrate that some models seems to be detectable. Additionally, we observe that the outputof aug is not required to be indistinguishable from a new sample from D . In particular the deﬁnitiondoes not rule out that A is aware of the possible augmentations.Let ˜ T be a sample set T after augmenting each sample. Denote ¯ (cid:96) g, aug ( M, ˜ T ) the empirical condi-tional risk and EF G = max g ,g ∈G (cid:12)(cid:12)(cid:12) ¯ (cid:96) g , aug ( M, ˜ T ) − ¯ (cid:96) g , aug ( M, ˜ T ) (cid:12)(cid:12)(cid:12) . We state the following, Theorem 2.

Let M be a class of models with ( (cid:15), α, aug ) -detectable fairness. Let T, ˜ T , aug and M ∈ M be as stated above. M is (cid:15) -fair with conﬁdence − δ if: EF G < α and min g ∈G m g ≥ α − EF G ) ln 2 |G||Y| δ . In other words, we can certify (cid:15) -fairness of a model with high conﬁdence assuming ( (cid:15), α, aug ) -detectable fairness. The proof of this theorem is in Appendix C.2. We provide empirical evidence to demonstrate that the assumptions made for the fairness tests inSection 4 are meaningful.We used six different datasets from various domains: visual (UTKFaces [33], LFW [15], Colored-MNIST [3], and a subset of CelebA [26]), tabular (Adult Income [22]) and spoken (TIMIT [13]).The datasets vary in size and disparity of minority groups and as such some can be used to createfair or unfair models based on their empirical fairness gap (EFG). We demonstrate the variety of ourdatasets and detail the preprocess in Appendix D.1. Simulating the necessary setup for Section 4.1, we assume that R possesses a subset of secret sam-ples to be used to certify a model M for fairness and accuracy. Naturally, we split the data into atraining and test subset. Setting (cid:15) -fair and δ -conﬁdence thresholds, we can certify whether a modelis fair using the conditions in Theorem . A bottleneck of these conditions is our dependency on thesize of the sample set. Datasets with bigger sample set allow us to certify more (fair) models, whilewe were not able to certify a (fair) model if the sample set was too small, even if it is indeed trulyfair under the chosen fairness metric.We performed our test on the mentioned datasets with δ = 0 . and varying (cid:15) of . , . and . . For some tasks this gap and conﬁdence level might be intolerable, but for others, such as genderprediction of a face image, which is the task set for UTKFace, LFW and CelebA, it is better than theexisting empirical gaps between ethnicity groups of well-known service providers’ models [6].The test results for overall risk equality are shown in Fig. 2. As shown, out of the six datasets onlyC-MNIST and CelebA produced fair models during our training for (cid:15) = 0 . , while UTKFace hasa fair model for (cid:15) = 0 . . LFW, Adult Income and TIMIT datasets are all below the threshold ofall tests, either due to sample size or a large EFG. Therefore, we focus on the ﬁrst three datasets asthey are the only ones to pass any of our tests. We annotated 8,500 celebrities out of 10,177 in the dataset for ethnicity using Amazon Mechanical Turk.Three turkers annotated three images of each of the 8,500 celebrities, resulting in 177,683 images classiﬁed aseither Asian, African, Caucasian or Other. The annotations can be downloaded from https://github.com/will/be/published/ . Fairness test in the private and public settings on the 3 image datasets. ”Regular” refers to the modeltrained fairly, while ”Bias” refers to the biased sampled training.Dataset Model

Private Setting Public Setting

Accuracy Risk Equality EFG Accuracy Risk Equality EFGRegular Bias Regular Bias Regular Bias Regular BiasUTKFace ResNet18 89.76 88.56 0.012 0.093 96.11 91.44 0.027 0.139C-MNIST LeNet 98.11 74.01 0.001 0.450 89.17 67.97 0.007 0.340CelebA ResNet18 97.63 96.95 0.007 0.034 96.60 97.02 0.010 0.033

Empirical Fairness Gap M i n i m a l m g ×10 C-MNISTCelebA CelebA biasUTKFace UTKFace biasAdult IncomeLFWTIMIT = . = . = . Figure 2:

Private fairness test borders by EFG and the minimal m g . Left to the dashed border is the area where a model wouldpass the test for that (cid:15) with δ = 0 . . Dots indicate the overall riskequality results for each dataset. To further evaluate the setup, wetrained the same models with atainted batch sampler. The samplershowed less samples from the small-est minority group-label pair ( g , y ) in each batch in order to generate asynthetic sample disparity. We de-noted these models as Bias in Fig.2 and Table 1. The taint resulted inan almost as accurate model with amuch larger EFG, suggesting they areless fair. For equalized odds, onlyCelebA had enough samples to cer-tify a model for (cid:15) = 0 . . It re-quires at least twice as many sam-ples (since we count m g,y instead of m g ). We ﬁnd it interesting as it im-plies the amount of data should be aconsideration even for which deﬁni-tion of fairness is practical to choose.We detail the results for equalizedodds and demographic parity metricsin Appendix D.2. Results for bias C-MNIST does not appear in Fig 2 since its performance issigniﬁcantly worse. In this setup, all the data used in the test is known to all participants. With that in mind, we showa potential augmentor for datasets of images and demonstrate empirically that unfair models cannotpass our test as both accurate and fair. Even though the training set is the same as the test set, the keydifference is that S ﬁxes M before we apply our augmentor to the dataset with new randomness. Thisgenerates diverse enough samples, for which models that are fair and generalize well on augmentedsamples pass the test, while models which are either unfair or bad at generalization fail the test.Our augmentations include rotation, cropping, blanked pixels [34] and added Gaussian noise. Eachaugmentation was set to be invoked at a certain probability threshold which was chosen randomly.The augmented images should keep the same label and group as the original image to the humaneye. By doing so, we hope to generate varied data that cannot be easily be reversed or overﬁtted on.We tested our three image datasets using the overall risk equality metric, the results are in Table 1.We used the same method to generate fair and unfair models as in Section 5.1 with the followingdifference: during training we invoked the augmentor per sample to generate a new augmentedsample each time. When we trained the models on the original dataset, the models were not able togeneralize on the augmented data.The results show that there exists a margin in EFG between the fair and unfair models on UTKFace,C-MNIST and CelebA, while the margin is different between datasets, potentially due to their vary-ing size and different complexities of the tasks. This suggests the existence of some α per dataset,based on Deﬁnition 1, but we were not able to pinpoint the exact α . We conducted further attemptsto characterize α in Appendix D.5. 7 .00 0.25 0.50 0.75 1.00 1.25 1.50 a cc u r a c y (a) 1-nn attack accuracy total acc. g acc. g acc. E F G (b) 1-nn attack EFG Threshold O u t - o f - d i s t r i b u t i o n r a t e (c) ODIN out-of-distribution rate test settrain set Figure 3: (a)-(b) 1-NN attack accuracy and EFG; (c) ODIN out-of-distribution rate.

Attacks against public fairness tests

As we assume that our test works without knowledge ofthe concrete model, our scheme might be susceptible to an indirect attack on our augmentor. Forexample, if the model could distinguish between the public data available and a new sample, it couldtry and behave fairly during the test, but unfairly when an actual new sample is shown. To mimicsuch an attack, we tested on UTKFace whether it is easy to fool our test using a simple k -nearestneighbour algorithm (kNN) or an out-of-distribution detection technique (ODIN [25]) on top of afair classiﬁer to identify the augmented samples. For a ﬁxed threshold distance from our augmenteddataset, we would switch to the unfair model and otherwise output the class of identiﬁed by thekNN. For the ODIN attack we would create a threshold for detect out of distribution samples toswitch to the unfair classiﬁer. Ideally these attacks uses the kNN or fair classiﬁer to pass the testas fair when needed, while future new samples (being “far enough” in threshold terms) invoke theunfair model as predictor. We gave the kNN augmented samples of the test as referenced neighborsand plotted the accuracy and EFG by the threshold distance for k=1 in Fig. 3a-b (larger k had worseresults). In order to have a similar EFG the fair model have, this approach leads to a drop in accuracyfrom 91.44% to 81.9%. In the other attack, we tuned ODIN’s hyperparameters, taking the valueswhich had at least 95% success rate identifying test samples and had the best results at detectingnew samples as out-of-distribution. Further details on tuning are listed in Appendix D.4. We plottedthe train and test detection rate as out-of-distribution by threshold in Fig. 3c. As can be seen, thesets are detected at similar rates, and are nearly indistinguishable. This resulted in a similar fair orunfair behavior depending on the chosen threshold. These experiments suggest that these types ofattacks are not a good approach to attack our proposed test, as this hybrid models cannot pass asboth fair and accurate enough for practical applications. We present an interactive test to certify fairness of any machine learning model using cryptographicmethods such as secure computation. The interactive test ensures R does not learn M , x does notleak to S , and M does not leak to C , yet it veriﬁes the model was used during inference has beencertiﬁed by R . We experimented with two scenarios where the test data is either public or private.We provide analysis and guarantees for the test data, as well as rigorously deﬁne the relation betweenthe empirical fairness gap to the sample set sizes.We believe creating regulatory entities, such as our abstract entity R , can be a step towards standard-izing fairness. Once these roles are set in place, our framework can guarantee users are treated fairlyin a secure manner. Moreover, from our guarantees and experiments we noticed not all fairnessdeﬁnitions are created equally, some are harder to verify and require a much larger volumes of data,i.e. equalized odds requires at least twice as many samples as overall risk equality. This makes roomfor consideration on what practical deﬁnition should we aim for with respect to limited resources orwhat compromise needs to be made in terms of fairness gap and certainty ( (cid:15) and δ ).For future work we would like to further explore the public data scenario. Speciﬁcally, to character-ize the detectable fairness hyper-parameter α and its relation to other parameters like the sample setsize T , the amount of randomness used per augmentation, etc. Additionally, we would like to ex-plore whether these parameters can be estimated in advance , without having to conduct experimentson a dataset. Our results in Section 5 suggest that this is a challenge on its own. Moreover, as we aredealing with large models we also require to hash the model inside secure computation. This stephas substantial cost (see Appendix B.4), and it is an open question if it could be made more efﬁcientin practice using different ideas than ours. Lastly, the proposed method is focused on group-basedfairness deﬁnitions, exploring other fairness deﬁnitions is also an interesting research direction.8 roader Impact Our framework touches two important and sensitive subjects: fairness and privacy. As such, it hastremendous value from an ethical perspective, but also should be treated with careful considerationas a formal notion of fairness does not always align with what we think is fair and just. We be-lieve our work is a step in the larger context of looking at Machine Learning problems through acryptographic lens. This opens up the prospect of beneﬁting from what cryptography has to offer –balancing integrity with privacy.The beneﬁt of our framework is that it is a method to certify and evaluate fairness under a formaldeﬁnition thereof, while it also conserves the privacy of clients and the intellectual property of modelproviders. Clients and providers beneﬁt from it as it bridges the potential mistrust between them. Itcan also standardize fairness in the form of trusted regulators which are acknowledged as trustworthyin evaluating models. We ﬁnd this to be an aspect that hopefully has a positive impact on society.However, we need to be careful by what we consider as fair. There is no ”one size ﬁts all” fairnessdeﬁnition which complicates the validity of a certiﬁcation in social terms. In particular, we focusedon speciﬁc notions of fairness in our work, as these lead to an implementable result. The questionof which fairness notion is the most applicable in a certain setting is independent of it and beyondthe scope of this work. Our work leaves the deﬁnition of fairness to policy makers, regulators andother experts, and gives a way to take an accepted deﬁnition of fairness and certify a model for it. Itshould also be noted though that we guarantee fairness only up to certain probability, and this mightgive people a false sense of fairness of a model even when there’s a chance it is not.What is interesting about our work is that issues like fairness and transparency are considered tobe at odds with another desirable feature – privacy. But our work brings these together. Moreover,cryptographic models for various primitives and protocols ranging from encryption to secure com-putation are designed to work in worst-case adversarial environments. This seems to be necessaryfor successful deployment of machine learning in certain applications and we hope that bringing thismindset to the machine learning ﬁeld might be beneﬁcial.

References [1] Victor Arribas Abril, Pieter Maene, Nele Mertens, and Nigel Smart. ’bristol fashion’ mpc cir-cuits. https://homes.esat.kuleuven.be/~nsmart/MPC/ . Last accessed on 11/16/2019.[2] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: theres softwareused across the country to predict future criminals. and its biased against blacks. propublica2016, 2016.[3] Martin Arjovsky, L´eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini-mization. arXiv preprint arXiv:1907.02893 , 2019.[4] Teodora Baluta, Shiqi Shen, Shweta Shinde, Kuldeep S Meel, and Prateek Saxena. Quantitativeveriﬁcation of neural networks and its security applications. In

Proceedings of the 2019 ACMSIGSAC Conference on Computer and Communications Security , pages 1249–1264, 2019.[5] Assi Barak, Daniel Escudero, Anders Dalskov, and Marcel Keller. Secure evaluation ofquantized neural networks. Cryptology ePrint Archive, Report 2019/131, 2019. https://eprint.iacr.org/2019/131 .[6] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities incommercial gender classiﬁcation. In

Conference on fairness, accountability and transparency ,pages 77–91, 2018.[7] Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automaticallyfrom language corpora contain human-like biases.

Science , 356(6334):183–186, 2017.[8] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic de-cision making and the cost of fairness. In

Proceedings of the 23rd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , pages 797–806. ACM, 2017.[9] Ivan Damg˚ard, Daniel Escudero, Tore Frederiksen, Marcel Keller, Peter Scholl, and NikolajVolgushev. New primitives for actively-secure mpc over rings with applications to privatemachine learning. In , pages 1102–1120.IEEE, 2019. 910] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairnessthrough awareness. In

Proceedings of the 3rd innovations in theoretical computer scienceconference , pages 214–226. ACM, 2012.[11] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkata-subramanian. Certifying and removing disparate impact. In

Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pages 259–268. ACM, 2015.[12] Batya Friedman and Helen Nissenbaum. Bias in computer systems.

ACM Transactions onInformation Systems (TOIS) , 14(3):330–347, 1996.[13] John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett.Darpa timit acoustic-phonetic continuous speech corpus cd-rom. nist speech disc 1-1.1.

NASASTI/Recon technical report n , 93, 1993.[14] ´Ursula H´ebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration:Calibration for the (computationally-identiﬁable) masses. In

International Conference on Ma-chine Learning , pages 1944–1953, 2018.[15] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces inthe wild: A database for studying face recognition in unconstrained environments. TechnicalReport 07-49, University of Massachusetts, Amherst, October 2007.[16] Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. { GAZELLE } : A low la-tency framework for secure neural network inference. In { USENIX } Security Symposium( { USENIX } Security 18) , pages 1651–1669, 2018.[17] Matthew Kay, Cynthia Matuszek, and Sean A Munson. Unequal representation and genderstereotypes in image search results for occupations. In

Proceedings of the 33rd Annual ACMConference on Human Factors in Computing Systems , pages 3819–3828. ACM, 2015.[18] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerryman-dering: Auditing and learning for subgroup fairness. In

International Conference on MachineLearning , pages 2569–2577, 2018.[19] Niki Kilbertus, Adri`a Gasc´on, Matt J. Kusner, Michael Veale, Krishna P. Gummadi, and AdrianWeller. Blind justice: Fairness with encrypted sensitive attributes. In Jennifer G. Dy and An-dreas Krause, editors,

Proceedings of the 35th International Conference on Machine Learning,ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018 , volume 80 of

Proceed-ings of Machine Learning Research , pages 2635–2644. PMLR, 2018.[20] Michael P Kim, Aleksandra Korolova, Guy N Rothblum, and Gal Yona. Preference-informedfairness. arXiv preprint arXiv:1904.01793 , 2019.[21] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fairdetermination of risk scores. In . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.[22] Ron Kohavi. Scaling up the accuracy of naive-bayes classiﬁers: a decision-tree hybrid. In

Pro-ceedings of the Second International Conference on Knowledge Discovery and Data Mining ,page to appear, 1996.[23] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classiﬁersfor face veriﬁcation. In , pages365–372, Sep. 2009.[24] Nishant Kumar, Mayank Rathee, Nishanth Chandran, Divya Gupta, Aseem Rastogi, and RahulSharma. Cryptﬂow: Secure tensorﬂow inference. arXiv preprint arXiv:1909.07814 , 2019.[25] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690 , 2017.[26] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes inthe wild. In

Proceedings of International Conference on Computer Vision (ICCV) , December2015.[27] Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-preservingmachine learning. In , pages 19–38.IEEE, 2017. 1028] M Sadegh Riazi, Christian Weinert, Oleksandr Tkachenko, Ebrahim M Songhori, ThomasSchneider, and Farinaz Koushanfar. Chameleon: A hybrid secure computation framework formachine learning applications. In

Proceedings of the 2018 on Asia Conference on Computerand Communications Security , pages 707–721, 2018.[29] Rachael Tatman and Conner Kasten. Effects of talker dialect, gender & race on accuracy ofbing speech and youtube automatic captions. In

INTERSPEECH , pages 934–938, 2017.[30] Sahil Verma and Julia Rubin. Fairness deﬁnitions explained. In , pages 1–7. IEEE, 2018.[31] Kaveh Waddell. How algorithms can bring down minorities credit scores.

The Atlantic , 2,2016.[32] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair represen-tations. In

International Conference on Machine Learning , pages 325–333, 2013.[33] Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adver-sarial autoencoder. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .IEEE, 2017.[34] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing dataaugmentation. arXiv preprint arXiv:1708.04896 , 2017.11

Cryptographic Primitives

We now describe the cryptographic primitives that are necessary to implement our framework fromSection 3 in more detail:

Signatures , Collision-Resistant Hash Functions and

Secure Computation . A.1 Signatures

Cryptographic signatures can be thought of as a computational analogue to hand-written signatures.We give a schematic explanation of signature schemes in Figure 4. Here, a pair of a public veriﬁ-cation key vk and a secret signing key sk are generated together by the key generation algorithm KeyGen . sk will be used by the signing algorithm Sign to create a signature σ on a message m ,while the veriﬁcation algorithm Verify decides if a pair ( m, σ ) is valid according to the veriﬁca-tion key vk or not. Secure signature schemes guarantee unforgeability , which means that given vk and arbitrarily many signature pairs { ( m i , σ i ) } i ∈ [ (cid:96) ] , it is hard to generate a valid signature σ on amessage m , where m (cid:54) = m i for all i ∈ [ (cid:96) ] . A.2 Collision-Resistant Hashing

We will use a

Collision-Resistant Hash Function H k : { , } n × { , } n → { , } n , which is anefﬁciently computable function such that it is hard for any polynomial-time algorithm (in n ) that isgiven a random k to come up with x , x such that H k ( x ) = H k ( x ) . In practice, one uses e.g.SHA-3 to implement H k for a k that is ﬁxed in advance. Since the input length of SHA-3 is ﬁxed,in order to hash longer messages, one can apply H k recursively using a Merkle Tree (see Figure 4).For such a Merkle Tree it can be proven that if H k is collision-resistant then ˆ H k is too. Sign VerifyKeyGen m sk vkσ / (cid:98) H k ( m , . . . , m (cid:96) ) m m m m m (cid:96) − m (cid:96) . . . ...H k ( · ) H k ( · ) H k ( · ) H k ( · ) H k ( · ) . . .. . . Figure 4: Signatures (left), Merkle Trees (right)

A.3 Secure Computation

We further let parties perform computations on shared data such that the computation does not revealtheir inputs, for purposes as mentioned in Section 3.Secure Computation can be imagined as the existence of a “trusted third party” F SC which performsa computational task for certain parties. F SC would receive the inputs from both participants, do thecomputation, and send the output to the participants. The task of this party is outlined in Figure 5:As is common in the secure computation literature, this description assumes that the computation isdone by a circuit K . Participant P provides to the trusted party its input x , while participant P provides its input x . The trusted party computes K ( x , x ) and sends its outputs to the respectiveparticipants. By this deﬁnition this “idealized box” F SC achieves the desired privacy objective.Such a trusted third party F SC as described in Figure 5 does not necessarily exist in the real world,but it can be emulated using cryptographic tools as a protocol consisting of two (or more) entitiessending messages to each other over a network. Guarantees in these protocols can be given if at leastone of the participants is acting honestly throughout the process.12 wo parties P , P can talk to this trusted third party. Input:

Upon message ( Input − P , x ) from P and ( Input − P , x ) from P store x , x locally. Compute:

Upon input ( Compute , K ) from P and P and if x , x have been stored:1. Check if x , x have suitable size for the circuit K . If not, output ( Abort ) .2. If x , x have suitable size then compute ( y , y ) = K ( x , x ) and store y , y locally. Output:

Upon input ( Output ) from P and P and if y , y have been computed, send y to P and y to P . Figure 5: A Trusted Third Party F SC for Secure Computation.The two most popular approaches for implementing Figure 5 are based on cryptographic paradigmscalled Fully Homomorphic Encryption (FHE) and

Secure Multiparty Computation (MPC). For com-parison, current FHE schemes are constrained by their demand for computational power and they atbest can evaluate a few hundred AND-gates of the circuit K per second. MPC on the other hand,which has a higher demand in terms of communication, can achieve a much better throughput. Inparticular, there exist MPC schemes that are tailored at efﬁciently implementing the function M ( · ) ,such as e.g. [5, 9]. B Implementing the Framework

In this section we describe how to implement the framework from Section 3 using the tests fromSection 4. While the implementation is described at a high level, it is easy to instantiate each of thecomponents based on existing cryptographic tools and the experimental results from Section 5.

B.1 Creating the Test

Consider a design of an interactive test based on the set T = { ( x , g , y ) , ..., ( x m , g m , y m ) andparameters δ, (cid:15) as follows:1. The regulator R computes the minimal m g fulﬁlling Equation (3) by assuming EF G = 0 .If T does not contain enough samples from each group, then R aborts. If R does not abort,it tells S the total number of inputs m that will be checked.2. R and S run a secure computation of a functionality F Check which is described below. S inputs M into F Check while R inputs ( { ( x i , g i , y i ) } i ∈ [ m ] ) . The functionality F Check consistsof the following steps:(a) Compute ˆ y i ← M ( x i ) for all i ∈ [ m ] .(b) For all i ∈ [ m ] , compute a bit b i as 1 if ˆ y i = y i and 0 otherwise.(c) Based on the b i values, compute for each group g the empirical risk ¯ (cid:96) g ( M, T ) basedon Eq. 1.(d) Based on the result of the previous step, evaluate Eq. (3) of Theorem 1 (checking (cid:15) -fairness). Output if the statement holds and otherwise.Based on the statement of Theorem 1 it follows that F Check will output if and only if the model M provided by S is (cid:15) -fair with conﬁdence − δ .The secure computation of F Check implements the functionality as a Binary circuit K that is evalu-ated on secret inputs. We examine the size of this circuit in Section B.4. A test using augmented data If R instead wishes to use public and augmented data as for The-orem 2 then this will only work assuming that M is ( (cid:15), α ) -detectable as deﬁned in Deﬁnition 1. Insuch a setting R would now create a test set T (cid:48) from T locally using an augmentor aug and thenfollow the exact same path as for the public data (albeit with different constants). B.2 Algorithms

We now describe how to use the circuit K from Section B.1 to implement the framework. The overallapproach is as follows: Initially, R generates a signature key pair and distributes the veriﬁcation key13o all other participants. Then R and S run a secure computation which runs the test of Section B.1and computes a Merkle tree hash ˆ H k ( M ) of the model M . If the test ﬁnds that the model is fair, then R signs ( ˆ H k ( M ) , (cid:15), δ , fairness deﬁnition string) and sends the signature to S . By signing (cid:15), δ and afairness deﬁnition string we allow multiple fairness deﬁnitions and hyperparameters to be certiﬁed.Later, whenever S and C run a certiﬁed inference for (cid:15), δ and a fairness deﬁnition, then in addition torunning a secure computation of M ( x ) , the functionality will also recompute the hash ˆ H k ( · ) of themodel provided by S and output it to C , while S sends the signature on the model to C . C can thenlocally check if R originally issued the signature on the hash for those hyperparameters and fairnessdeﬁnition, given the public veriﬁcation key of R . The overall protocols are outlined in Figure 6. We will have three participants S , C , R as outlined before. Let ( KeyGen , Sign , Verify ) be a signaturescheme and H k ( · ) be a collision-resistant hash function whose key k is a common input to all parties.Moreover, let F SC be a functionality for secure computation as outlined in Figure 5. S has a model M as input, R has a fairness validation set T as well as parameters (cid:15), δ and fair the fairness deﬁnitionstring. Setup:

This reﬂects Step of Figure 1.1. R uses KeyGen to generate a key pair ( sk, vk ) . R keeps sk private and sends vk to C , S as cert ID . Certiﬁcation:

This reﬂects Steps , of Figure 1.1. S , R input the same values into F Check as they do in Section B.1.2. Let K be the circuit as outlined in Section B.1. Create a circuit K cert that performs thefollowing:First run K on the respective inputs as before, computing F Check . Denote the output bitof this circuit as b . Then compute the Merkle tree output h ← ˆ H k ( M ) based on thehash function H k . Finally, output ( b, h ) to R .3. Both parties run a secure computation of K cert using F SC .4. If the output bit b is 1, then R computes σ ( h ) ← Sign sk ( h, ( (cid:15), δ, fair )) and sends itas cert M to S . Inference:

This reﬂects Steps , of Figure 1.1. S sends σ to C . C also knows the signature veriﬁcation key vk .2. S and C run a secure computation, where S inputs ˜ M and C inputs its input x . For thissecure computation they construct a circuit K inf as follows:(a) Compute ˆ y ← ˜ M ( x ) .(b) Compute ˜ h ← ˆ H k ( ˜ M ) .3. C , S run a secure computation of K inf using F SC . C obtains as output (ˆ y, ˜ h ) while S does not obtain anything.4. C computes b ← Verify vk (˜ σ, (˜ h, (cid:15), δ, fair )) . If this is true then C accepts ˆ y . Other-wise it rejects it. Figure 6: Protocol π Framework for Certiﬁed Inference

B.3 Security

We now give a sketch of the argument about the security of π Framework with respect to Section 3. Thismust naturally stay on a high level, since we did not make the security properties of the frameworkformal.First, we note that π Framework leaks to R and C the Merkle-tree hash h of the model. But since it canbe assumed that M has high entropy and the implementation of H k is a cryptographic hash function,the leakage of h should be tolerable. That being said, we base our security argument on statements It is possible in principle to reduce this leakage by computing R ’s signature of h , and the signature veri-ﬁcation by C , in a secure computation, but this will considerably increase the overhead. In the other direction,if we are willing to leak some more information then the circuit K can be modiﬁed to output to R whether M successfully classiﬁed each input x i and let R compute the (cid:15) -fairness of the model locally. This will simplifythe secure computation at the cost of leaking more data to R . • The functionality F SC can be implemented using a secure protocol. As mentioned in Section2 this can be done using secure two-party or multi-party computation (MPC). • There exist secure signature and hashing schemes.

Given these primitives, we can assume that the certiﬁcation and inference steps of Figure 6 are assecure as if they were computed by a trusted party: Assume that in the inference step, the signature ˜ σ and output ˜ h of F SC are validated. This can only happen due to 3 cases: (i) ˜ σ was generated for ˜ M by R (which is the desired course of events); (ii) ˜ σ was issued by R but for a different ˜ M (cid:48) ; or (iii) ˜ σ was never issued by R . In the last case, S must have broken the security of the signature scheme. Inthe second case, S must have broken the collision-resistance of H k . Therefore, either S managed tobreak the signature scheme or the hash function, or R signed ˜ M . R computes this signature if andonly if the model passed the test of Section B.1. Based on the statement of Theorem 1 it follows thatthis test passes if and only if the model ˜ M provided by S is (cid:15) -fair with conﬁdence − δ . B.4 Efﬁciency

We now estimate the efﬁciency of implementing our framework using π Framework . We ﬁrst claim thatit only makes sense to run our framework in settings where the ML inference is done using a securecomputation: If the inference is not computed using a secure computation, then one option is for theclient to learn the model and run by itself a check for fairness, or send the model to another partyand ask it to do this check. Another option is that the client simply hands over its input to the modelowner, but this would require prohibitively expensive zero-knowledge proofs, to be computed at theowner side, to attest to fairness of the output without revealing anything about the model.Therefore, given that inference is done via secure computation, the parties must incur the cost of run-ning a secure computation of the inference, and the efﬁciency of the framework should be measuredby the additional overhead that is added on top of the secure inference.The main computational tasks that are run by π Framework are as follows: • The

Certiﬁcation phase runs m instances of a secure computation of inference and inaddition computes a hash of the model and checks the accuracy of the output. • The

Inference phase runs a single secure computation of the inference and in additioncomputes a hash of the model.The

Certiﬁcation phase is a one-time event, and therefore its overhead is less critical. Theorem 1shows that the number of samples m g per group should be m = EF G − (cid:15) ) ln | G | δ . Setting forexample EF G = 0 . , (cid:15) = 0 . , δ = 0 . and considering | G | = 100 groups, we get that m g ≈ ,which does not seem to be too far off from existing training set sizes.In more detail, we describe here the cost of implementing the different steps of the circuit K whichcomputes the certiﬁcation, as described in Section B.1: Step (a) needs to implement the inference m times. This is by far the largest component of the circuit. Step (b) computes m comparisons, whichare easy. Step (c) computes ¯ (cid:96) g ( M, T ) for each group g , based on Eq. 1. This computation must sumthe b values for each group g . To make this step efﬁcient, the circuit must hard-wire the connectionsfor these summations, and the locations of the inputs from each g can be known. (There is no need tohide these locations from S .) Eq. 1 also computes a division by m g , but there is no need to computethe division and the circuit forwards m g · ¯ (cid:96) g ( M, T ) to the next step. Step (d) tests Eq. 3 for each pairof g , g , namely computes ¯ (cid:96) g ( M, T ) − ¯ (cid:96) g ( M, T ) . Since the input to this step is m g i · ¯ (cid:96) g i ( M, T ) then the test in this equation should be changed appropriately (which is straightforward, especiallyif m g = m g ).As for the cost of computing M ( · ) , current secure computation implementations for this task onlyhide the weights of a DNN but reveal the actual network structure and activation functions. Weassume that our secure computation will also only hide the weights as this seems to be a standardassumption. Therefore, we ask what is the additional cost of hashing this data over the default costof using the weights in the computation of the model.15here is a lot of current work on lightweight hashing schemes for usage in zero-knowledge proofs,and it is reasonable to expect that a lot of improvements in this area will be made in the near future.As a baseline, we consider the Keccak-F function, which is the basis of the SHA3 standard. Thatfunction takes a 1600 bit input and can be implemented by a Boolean circuit of 38,400 AND gates(see [1]), i.e. 24 AND gates per input bit. If we use a Merkle tree then the total number of hashes istwice the number of input blocks . Therefore, the total cost is about 48 AND gates per input bit.Now, with regards to the secure evaluation of the model (not considering special MPC implemen-tations for secure inference ), let us consider a setting where the weights have 32 bit ﬁxed-pointvalues. The cost per each weight (when used in DNN inference) must be at least that of multiplyingthe weight with either an input or output of a hidden layer and adding all these products together(neglecting the cost of the activation function). Multiplying the weight with a 32 bit value costs ANDs per input bit, while adding up the result would only require ANDs per input bit (see [1]), andwe therefore take the assumption that the total cost of the secure computation is

AND gates perbit of the weights (neglecting the activation function). Therefore the fairness veriﬁcation increasesthe cost of inference in this model by only about . While using optimized implementations forinference will make the additional overhead from hashing larger, we can in practice lower the costof hashing drastically by exploiting special properties of F SC which allow the use of homomorphiccommitments. We leave such specialized hashing techniques as interesting future work. C Theorems Proofs

C.1 Theorem 1

Using Hoeffding’s concentration bound we get that: Pr (cid:20)(cid:12)(cid:12) ¯ (cid:96) g ( M, T ) − (cid:96) g ( M ) (cid:12)(cid:12) > (cid:15) − EF G (cid:21) ≤ e − m g ( (cid:15) − EFG )22 = δ |G||Y| By a union bound it follows that | ¯ (cid:96) g ( M, T ) − (cid:96) g ( M ) | ≤ ( (cid:15) − EF G ) / for all g ∈ G with probability − δ . Given that this event holds then, by applying the triangle inequality twice, for any g , g ∈ G we have: | (cid:96) g ( M ) − (cid:96) g ( M ) | ≤ | (cid:96) g ( M ) − ¯ (cid:96) g ( M, T ) | + EF G + | ¯ (cid:96) g ( M, T ) − (cid:96) g ( M ) | ≤ (cid:15) Hence max g ,g ∈G | (cid:96) g ( M ) − (cid:96) g ( M ) | ≤ (cid:15) with conﬁdence − δ . Similar arguments show the sameis true for equalized odds and demographic parity. C.2 Theorem 2

Similar to the proof in C.1, we get that: Pr (cid:20)(cid:12)(cid:12)(cid:12) ¯ (cid:96) g, aug ( M, ˜ T ) − (cid:96) g, aug ( M ) (cid:12)(cid:12)(cid:12) > α − EF G (cid:21) ≤ δ |G||Y| which implies that | (cid:96) g , aug ( M ) − (cid:96) g , aug ( M ) | ≤ α with conﬁdence − δ for all groups. Since M is ( (cid:15), α ) -detectable fairness, then M is (cid:15) -fair with the same conﬁdence. D Experiments Details

D.1 Datasets • UTKFace [33] is a dataset of face images with attribute annotation for age, ethnicity (called race and annotated as black, asian, white or other ), and gender ( male or female ). We We can improve on that by having the circuit output to C the results of the ﬁrst layer of the Merkle tree,and have C locally compute the rest of the tree. For this to work, we will on the other hand have to add randomvalues to each input block to avoid lookup table-based attacks on preimages of H k . This analysis neglects recent works such as e.g. [5] that apply to special types of networks only. We believethat the accuracy of the networks such as MobileNets that are used in [5] is too low to be of use for fairnesstesting. black and white anddiscarded all other samples, so we were left with 14,604 samples, which we split equally totrain and test sets. The dataset consists of 70% white and 30% black , 53% male and 47% female . • MNIST is originally a dataset of hand-written digits from 0 to 9. The dataset is used to pre-dict the digit in the image without additional annotations, hence there are 10 classes acrossthe dataset without any allocation of groups. We changed the task to a binary classiﬁcationand synthetically generated two fairness groups of digits based on MNIST data. Therefore,we assign the label 0 to the digits 0-4 and the label 1 to the digits 5-9. We randomly coloredhalf of the dataset’s digits in red as was done in [3], resulting in 50% red digits and 50%white digits – these were the fairness groups. We called this dataset

C(olored)-MNIST . • LFW [15] is a dataset of face images with attributes annotation [23]. Using the “Black”attribute we divided the data into two groups, while using “Male” as a binary label. • CelebA [26] is a face recognition dataset consisting of more than 10,000 different celebri-ties with gender labelling. We annotated 8,500 celebrities out of 10,177 in the datasetfor ethnicity using Amazon Mechanical Turk. Three turks annotated three images of eachof the 8,500 celebrities, resulting in 177,683 images classiﬁed as either Asian, African,Caucasian or Other . During our experiments we merged all but the Caucasian group toproduce a large dataset to showcase our setup, having over 30,000 samples for the minoritygroup. • Adult Income [22] is a tabular features dataset with a label for low/high income. We usedthe gender feature as group afﬁliation and income for labels. During the preprocessing allnumeric features were normalized, while categorical features were transformed into one-hot vectors in order to be used later by DNN models. • TIMIT [13] is a voice recognition dataset with dialects and gender annotation. We usedthe different dialects as groups and gender of speaker as label. To have more samplesper dialect, we merged dialects which have much in common and are considered similar,namely we merge New England with New York City and Northern with North Midland,while discarding the rest.

Dataset Size g , y g , y g , y g , y UTKFace 14,604 37.5% 31.51% 15.87% 15.12%LFW 13,144 74.22% 21.52% 3.24% 1.02%CelebA 177,683 33.97% 49.26% 8.67% 8.11%C-MNIST 70,000 25% 25% 25% 25%Adult Income 48,842 46.54% 20.3% 29.52% 3.62%

Table 2: Data distribution across groups and labels for each of the datasets.

D.2 Private Data Setup Full Results

Full results for overall risk equality, equalized odds and demographic parity can be found in Table3.

D.3 Public Data Setup Full Results

We’ve experimented with cutting the UTKFace dataset in half, to see how it affects the margin and α . We also have results for the LFW dataset. Since we could not generate a fair model in LFW,we have no reference or evidence of margin, but empirically the EFG seems high suggesting theaugmentation would work on it as well. Results are in Table 4. D.4 ODIN Tuning

We tuned ODIN’s 3 hyperparameters - T temperature, (cid:15) perturbation and δ threshold. We choseT from among { , , , } , (cid:15) from 30 evenly spaced numbers between 0 and 0.01 and took The annotations can be downloaded from https://github/will/be/published/ ataset Model Accuracy Risk Equality Equalized Odds Demographic ParityEFG (cid:15) -Test EFG (cid:15) -Test EFG (cid:15) -TestUTKFace ResNet18 89.76 0.012 Failed † † UTKFace Bias-ResNet18 88.56 0.093 Failed 0.115 Failed 0.088 FailedCelebA ResNet18 97.63 0.007 Passed 0.017 Passed 0.083 FailedCelebA Bias-ResNet18 96.95 0.034 Failed 0.045 Failed 0.039 FailedC-MNIST LeNet 98.11 0.001 Passed 0.022 Failed † † † Table 3: Fairness test using private data with (cid:15) = 0 . and δ = 0 . . “Failed † ” refers to insufﬁcientsample size to certify the fairness of the model. Dataset Model Accuracy Risk Equality EFGFair Bias Fair BiasUTKFace - Half Size ResNet18 92.13 87.03 3.56 12.91UTKFace ResNet18 96.11 91.44 2.72 13.88C-MNIST LeNet 89.17 67.97 0.65 34.04LFW ResNet18 91.98 - 7.45 -CelebA ResNet18 96.60 97.02 1.01 3.28

Table 4: Fairness test using public data with an augmentor on the 4 image datasets. ”Half Size”refers to the dataset with half of the samples removed.those which yielded the best results for any δ ∈ [0 , . We note that the hyperparameter tuning hadlittle effect, as most of the values chosen performed very similarly. D.5 Testing with unknown margin